Coding Fairness: Detecting Demographic-Related Coding Discrepancies in ICD Code Assignments

Ying Yin; Stuart J Nelson; Yijun Shao; Charles Faselis; Ali Ahmed; Qing Zeng-Treitler

. 2026 Feb 14;2025:1474–1481.

Coding Fairness: Detecting Demographic-Related Coding Discrepancies in ICD Code Assignments

Ying Yin ^1,², Stuart J Nelson ¹, Yijun Shao ^1,², Charles Faselis ^1,^2,⁴, Ali Ahmed ^1,^2,³, Qing Zeng-Treitler ^1,²

PMCID: PMC12919552 PMID: 41726471

Abstract

Coded clinical data are crucial in biomedical informatics research. While it is well known that electronic medical records often contain coding errors, numerous studies rely on International Classification of Diseases (ICD) codes for phenotyping in cohort assembly, statistical analysis, and AI modeling. Although fairness hasbecome an important focus in AI research, the potential biases embedded in coded clinical data have received less attention. In this study, we employed a race- and sex-agnostic AI phenotyping model to assess coding fairness across 203 ICD code blocks within the Veterans Health Administration Clinical Data Warehouse. Our findings revealed variability in coding consistency across demographic subgroups, including sex, race, and ethnicity. Notably, over 50% of the code blocks exhibitedstatisticallysignificant differences in discrepancies between AI-generated and ICD-based phenotypesacross these demographic groups. These results suggest the need to recognize and address demographic-related coding discrepancies to ensure coding fairness.

Introduction

Clinical coding systems, such as the International Classification of Diseases (ICD), are fundamental for billing and administrative purposes in clinical settings. Coded clinical data in electronic health records (EHR) are also widely used in clinical research to identify patients with specific traits or medical conditions, also known as phenotyping, for creating patient cohorts and measuring clinical outcomes.^1,2 For instance, the electronic Medical Records and Genomics (eMERGE) Network has developed a set of electronic health record (EHR)-based phenotypes,³ most of which rely on existing structured data, particularly ICD codes.

The medical coding process is inherently complex and labor-intensive, requiring trained coders to assign codes based on clinical notes and other documentation manually.⁴ In recent years, coders have often utilized software tools to facilitate the process. Software tools, however, are imperfect and depend on the data provided by their users. The coding process is thus susceptible to errors due to factors such as coder fatigue, inadequate training, variations in coding standards, and resource limitations.⁵ Our previous research also demonstrated considerable variability and inconsistency in ICD code assignment over time and across different clinical settings, underscoring the challenges in achieving reliable and reproducible coding.⁶

Fairness, or the elimination of bias, is a foundational bioethics principle. With the increasing reliance on artificial intelligence (AI) as a powerful tool in healthcare, the concept of AI bias has also gained significant attention from both the public and scientific communities.^7-9 AI bias often mirrors existing human biases, potentially leading to discriminatory decisions and the perpetuation of health disparities. Such biases frequently originate from training data that reflect systemic inequities. Efforts to mitigate AIbias have primarily focused on ensuring equitable representation or outcome across demographic categories such as race, ethnicity, sex, socioeconomic status, and geographic location.¹⁰

Given the volume of clinical data and the heterogeneous nature of coding errors, disparities in clinical coding are challenging to detect. In this context, we introduce the concept of ‘coding fairness’, an issue deeply embedded within the coding process itself. Previous research has revealed disparities in the sufficiency and accuracy of ICD coding across race and ethnicity for specific diseases.^11,12 These coding disparities could be caused by provider bias, lack of specialist care, or phenotypic differences. For example, opioid misuse may be under-coded in older adults or female patients because they typically have a lower misuse rate.¹³ Even though the latent biases within clinical coding can significantly influence downstream AI model predictions and further propagate into a biased AI model, there is a lack of methods to systematically detect coding quality across different demographic subgroups.

To improve ICD-based phenotyping, researchers have incorporated alternative data sources, especially free-text notes processed through natural language processing (NLP), to enhance phenotype definitions beyond coding.^14-16 We also developed AI phenotyping models that leverage both structured and unstructured EHR data to generate AI-based phenotypes aligned with ICD code blocks. These AI phenotypes outperformed clinically assigned ICD code blocks in predicting outcomes, such as mortality and hospitalization, suggesting that they more accurately capture a patient’s overall health status.¹⁷

Building upon these findings, we further developed a race- and sex-agnostic AI phenotyping model. By removing these demographic variables, the model generates phenotypes based solely on clinical features (e.g., medications, procedures, and clinical notes), providing a more neutral benchmark against which to compare with the original ICD phenotypes. Given the tens of thousands of ICD codes with widely varying prevalence, this study focused on the ICD code block level to assess broader disease category classifications. The primary objective was to compare discrepancies between AI-derived and ICD-based phenotypes across demographic groups, thereby enabling a systematic examination of coding biases.

Methods

Data Source:

The data for this analysis is sourced from the Veterans Health Administration (VHA) Corporate Data Warehouse (CDW), a nationwide database for U.S. veterans. The CDW contains comprehensive data from 2000 to 2024, including demographics, clinical encounters, diagnoses, procedures, medications, clinical notes, etc. The study was approved by the VA Institutional Review Board, Washington, DC, VA Medical Center.

ICD Code Blocks:

The ICD-10 codebook is structured hierarchically into approximately 260 code blocks across all 22 chapters. Due to the limited on-site obstetrics and prenatal care services the VHA offers, we excluded diagnosis code blocks related to these domains (O and P). We also omitted code blocks lacking substantive diagnostic value, suchas R(symptomsand signs) and U–Z(causes of morbidity andadministrative codes). As a result, we included 203 ICD-10 code blocks. For the ICD-9-CM codes, we identified corresponding code blocks using general equivalence mappings (GEM) provided by the Centers for Disease Control and Prevention (CDC).

Data Preparation and Feature Extraction

The VHA divides its healthcare system geographically into 18 Veterans Integrated Service Networks (VISNs). We randomlysampled 9,000 outpatient visits fromeach VISNfrom 2014, 2016, and 2018. This samplingstrategy ensured equal representation across VISNs and included data from both the ICD-9-CM and ICD-10 eras. We excluded visits missing diagnoses assignment from previously specified ICD code blocks or lacking clinical notes, resulting in a final cohort of 269,935 outpatient visits for analysis (Figure 1). For each visit, we extracted medical prescriptions (grouped by VA drug class) and procedures (grouped by three-digit Current Procedural Terminology (CPT) codes) within a seven-day window. Additionally, we retrieved clinical notes from these visits for topic modeling.

Topic Modeling:

We used Latent Dirichlet Allocation (LDA), an unsupervised machine learning technique, to identify hidden topics in the clinical notes. The model was trained on 367,986 clinical notes from 2018 visits. Notes underwent pre-processing, including tokenization, removal of punctuation, standard and domain-specific stop words, and low-frequency terms. The topic model was trained to extract 1000 topics. While each topic theoretically appears in every note with some proportion, we establishedan empirical thresholdbasedon the virtualword count (topic proportion × note wordcount). A topic was considered present only when its virtual word count exceeded 4.¹⁸ The trained topic model was then applied to notes extracted from 2014 and 2016 visits for inference.

Machine Learning Modeling for Code Block Prediction:

A two-tier neural network architecture was developed to predict ICD block assignments, utilizing 2187 input features: patient age at the visit, LDA-derived topics from clinical notes, medications (VA drug class), and procedures (CPT codes). The model architecture mirrored the ICD hierarchy, predicting both ICD chapters (level 1) and ICD blocks (level 2) (Figure 2). Binary cross-entropy loss was minimized across both levels.

Figure 2: — Deep Learning Model Architecture for Code Block Prediction

The dataset was divided into training (64%), validation (16%), andtest (20%) subsets. Model training employed stochastic gradient descent with a batch size of 256, the Adam optimizer, a learningrate of 0.0001, and a dropout rate of 0.25. The model achieving the highest micro-F1 score for level 2 predictions was selected, with early stopping triggered after 10 epochs of no improvement to mitigate overfitting. This model produced probabilistic multi-label predictions for 203 ICD blocks as AI phenotypes, representing a ‘fuzzy’ classification with probabilities ranging from 0 to 1.

Coding Fairness Analysis:

To assess the difference between AI and ICD-based phenotype assignment, we calculated the absolute difference between the model’s probability estimates and the binary (0 or 1) values of the assigned code blocks for each visit. The differences in each ICD block were aggregated by demographic groups to evaluate coding fairness. We used a Student’s t-test to determine the statistical significance of the coding agreement between minority and majority demographic groups. A p-value less than 0.05 was considered indicative of a statistically significant difference.

Software:

Topic modeling was conducted using MALLET (version 2.0), a Java library (version 1.8). All other analyses were performed in Python (version 3.7), utilizing the scikit-learn, SciPy, PyTorch, and Matplotlib libraries for deep neural network development, data analysis, and data visualization.

Results

Study Cohort

The study cohort (n=269,934) comprised a patient population with the following demographic distribution: 8% female, 20% African American, and 5% Hispanic (Table 1). Each selected visit was treated as an independent encounter, with patients counted at the visit level, potentially resulting in duplicate patient entries across multiple visits.

Table 1:

Patients’ Demographic

Characteristic	#	%
Total	269,934
Sex
Male	249,090	92%
Female	20,844	8%
Race
White	191,910	71%
Black	53,959	20%
Other	24,065	9%
Ethnicity
Non-Hispanic	245,386	91%
Hispanic	13,857	5%
Unknown	10,691	4%

Open in a new tab

Race- and sex-agn ostic AI phenotyping mode performance and group discrepancy

To evaluate the AI phenotypes generated by the models, we compared the model predictions against the originally assigned ICD blocks, treating these assignments as our ‘silver standard’ due to potential limitations in the initial assignments. Over the 203 ICD code blocks, the model achieved an average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.76, with a standard deviation of 0.11. This indicates a moderate level of overall predictive performance. Notably, our model exhibited a robust performance when applied to code blocks with higher prevalence rates. The model delivered an average AUC of 0.92 (SD = 0.02) for the top 10 most frequently occurring code blocks (Table 2) and a slightly lower but still commendable AUC of 0.88 (SD = 0.05) for the 50 most common ones. Since we know that ICD codes are imperfect, a lower AUC does not necessarily indicate that the phenotype model is less accurate.

Table 2:

Model Performance for Code Block Prediction

	AUC (%)	F1 (%)
All 203 Blocks, Mean ± SD	76 (± 11)	9 (± 17)
Top 10 prevalent Blocks	92 (± 2)	58 (± 12)
Top 20 prevalent Blocks	91 (± 3)	50 (± 16)
Top 50 prevalent Blocks	88 (± 5)	31 (± 22)

Open in a new tab

Our predictive model was intentionally designed without sex or race as input features. This allows us to examine the discrepancies in the AI model and ICD code-based phenotyping across demographic groups. As illustrated in Figure 3, we observed inconsistent model performance across demographic subgroupsusing code block 162 (Acute kidney failure and chronic kidney disease) as an example. The AUC for predicting this code block is significantly higher for females (AUC = 0.958) compared with males (AUC = 0.930), despite there being significantly more males in our sample. Similar differences were observed in terms of race and ethnicity. The fact that race and sex-agnostic AI models performed differently for different demographic groups indicates the coding disparities.

Figure 3: — ROC curve of code block 162 (Acute kidney failure and chronic kidney disease) by a) Sex, b) Race, and c) Ethnicity

AI model prediction-ICD assignment agreement and group discrepancies

To quantify coding disparities across all code blocks, we calculated the absolute difference between AI-predicted probabilities and actual ICD assignments for each code block, stratified by race, ethnicity, and sex. Statistical analysis of thesedifferences acrossdemographic subgroups revealed substantial disparities: 78% of code blocks showed statistically significant differences in AI-ICD agreement between Black and White patients, 33% between Hispanic and non-Hispanic patients, and 58% between male and female patients.

As illustrated in Figure 4A, among the 118 code blocks exhibiting significant differences in AI-ICD agreement between sexes, the AI-ICD difference was smaller for 67 code blocks (indicated in blue) and greater for 51 code blocks (indicated in red) among females. This suggests a variable pattern of AI performance across sex-based subgroups. Corresponding results comparing the differences between White and Black patients are depicted in Figure 4B, and differences between Hispanic and non-Hispanic patients are shown in Figure 4C.

Table 4 presents the top ICD code blocks with the most significant differences in AI-ICD agreements across the demographicsubgroups. For example, across racial groups, the White population exhibited a larger AI-ICD difference score for conditions such as “ischemic heart diseases” (0.119 compared to 0.075 for Blacks), with additional notable disparities in “Other disorders of ear” (0.070 vs. 0.038). Conversely, the Black population showed a larger difference for “Mental and behavioral disorders due to psychoactive substance use” (0.168 compared to 0.134 for Whites). Similar detailed comparisons for ethnic subgroups include “Chronic lower respiratory diseases” with a difference of 0.074 for Hispanics versus 0.102 for non-Hispanics. For sex-based subgroups, “Diabetes mellitus” showed a difference of 0.112 for females versus 0.170 for males, offering a comprehensive view of the disparities across all analyzed demographics.

Table 4:

Top ICD Code Blocks with Most Significant Disparities in AI-ICD Agreement Among Race, Ethnicity, and Sex

ICD code block	AI-ICD Difference Score^*
Top code blocks for race	Black	White
Ischemic heart diseases	0.075	0.119
Other disorders of ear	0.038	0.070
Mental and behavioral disorders due to psychoactive substance use	0.168	0.134
Disorders of thyroid gland	0.046	0.072
Schizophrenia, schizotypal, delusional, and other non-mood psychotic disorders	0.065	0.043
Top code blocks for Ethnicity	Hispanic	non-Hispanic
Chronic lower respiratory diseases	0.074	0.102
Ischemic heart diseases	0.083	0.111
Diseases of arteries, arterioles and capillaries	0.038	0.053
Anxiety, dissociative, stress-related, somatoform and other nonpsychotic mental disorders	0.192	0.171
Mood [affective] disorders	0.188	0.170
Top code blocks for Sex	Female	Male
Ischemic heart diseases	0.044	0.115
Diseases of male genital organs	0.044	0.106
Diabetes mellitus	0.112	0.170
Noninflammatory disorders of female genital tract	0.041	0.004
Disorders of adult personality and behavior	0.044	0.017

Open in a new tab

For each group, the AI-ICD difference score represents the average absolute discrepancy between the AI model’s predicted probabilities and the binary ICD assignments for each code block per visit. The group with the greater AI-I C D difference is bolded.

Discussion

We had previously assessed the reliability of ICD code assignments in VHA CDW during the transition from ICD-9-CM to ICD-10-CM. Using clusters of equivalent codes derived from the US CDC GEM tables, we found that many of the 687 most-used code clusters (47.0% to 69.9%) had ICD-10-CM assignments that differed significantly from those predicted based on ICD-9-CMcodes.6 The current study does not aimto correct codinginconsistencies. Without a true gold standard, the absolute truth remains unknown. However, we extracted contextual information surrounding outpatient visits, which is the type of information used by trained professionals to assign codes. After training a race and sex-agnostic AI model, we compared the AI-predicted code blocks with ICD-based code blocks to measure fairness variations across demographic subgroups. Our findings suggest significant coding discrepancies across demographic groups, affecting over half of all examined coding blocks.

Clinical data extracted from EHRs are the cornerstone of observational studies in clinical research.^2,19 Researchers commonly acknowledge that EHR data is inherently “noisy” and invest considerable effort in data preprocessing and cleaning to enhance its quality. However, they typically focus on the main exposure and outcome variables within the context of specific projects. With the rise of AI and the remarkable ability of deep learning models to process large amounts of data, some assume that big data and AI can automatically mitigate errors. This assumption, however, is questionable. As AI models become more complex, often containing tens of thousands to millions, or even billions, of parameters, the adage “garbage in, garbage out” remains valid. Inaccuracies or biases in coding practices compromise the reliability of predictive algorithms and limit the generalizability of research findings across diverse patient populations, potentially skewing clinical and epidemiological insights.

The fairness of AI systems is contingent upon the fairness of the underlying data to a significant degree.²⁰ Biased training data inevitably propagates bias into the resulting models. Our findings demonstrate that the AI-ICD assignments vary considerably across specific demographic groups and clinical domains. The observed differential AI-ICD performance underscores the need for further investigation into underlying factors such as disparate clinical documentation or inherent biases influencing model behavior.

Another myth for AI bias is that model performance tends to favor majority groups, a phenomenon often attributed to the greater data availability for these populations. However, our analysis reveals a more nuanced picture: minority groups do not consistently experience a higher level of disagreement with the AI models. For instance, in Figure 4B, among the 165 code blocks showing significant differences between Black and White patients, AI-predicted scores aligned more closely with assigned ICD codes in 80% of these blocks for Black patients. This finding suggests that AI models can serve as powerful tools to uncover and quantify biases in clinical coding practices, offering a unique lens through which to evaluate fairness.

Clinical Implication

The identification of substantial coding discrepancies across demographic groups highlights the need for increased awareness and training among clinical coders. Coders should be aware that unconscious biases or systemic inconsistencies may exist in code assignments,²¹ potentially impacting the accuracy and fairness of patient records. EHR vendors play a critical role in shaping how coding workflows are implemented. The study suggests that vendors should consider integrating bias-detection tools and real-time coding support features into their platforms that challenging coding bias and enabling continuous quality assurance.²²

For AI model developers, the findings highlight the critical importance of scrutinizing the fairness of training data. Since AI models trained on biased coding data can perpetuate and even amplify existing disparities,²³ developers should incorporate fairness audits and bias mitigation strategies throughout model development. This includes systematically evaluating model performance across demographic subgroups and applying techniques such as adversarial training, data augmentation, and fairness-aware algorithms.²⁴

Limitations

Though the study is carefully designed to detect the coding bias systematically, there are several limitations. First, we did not have a gold standard to validate the AI-model performance. Establishing a gold standard through chart review for all code blocks is extremely time-consuming. Nevertheless, such a standard would offer a more definitive picture of the ground truth, enabling us to better validate the model’s predictions and assess its reliability. In a separate but related study, we found that using AI-predicted phenotypes consistently led to more accurate predictions of hard outcomes of death or rehospitalization, suggesting that AI-predicted phenotypes themselves are more accurate.¹⁷ In addition, disease prevalence can vary across demographic subgroups due to underlying physiological and genetic differences. As a result, variations in coding may reflect these biological factors rather than human bias. Our model can be improved by incorporating additional features, such as lab tests and genetic variables, to better capture these true underlying differences. Finally, we did not include longitudinal data in our analysis. During our manual review, we observed that some codes appear to be carried over fromprevious visits. This inheritance may not be evident from the current clinical notes, which raises a challenge for the model in predicting the corresponding coding.

Future Directions

In our future research, we aim to enhance model performance by utilizing cloud computing resources to train with larger datasets. We intend to investigate the root causes of coding variability, employing explainable AI techniques to gain deeper insights. Extending this analysis to other healthcare datasets (such as All-of-Us data) and more diverse populations will strengthen the generalizability and utility of our findings. We envision usingthese insights to develop standardized frameworks for identifying and addressing coding consistency.

Acknowledgments

This work was supported by VA HSRD grant 1I21HX003278-01A1, and by AHRQ grant R01 HS28450-01A1.

Figures & Tables

References

1.Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. May 2 2012;13(6):395–405. doi: 10.1038/nrg3208. doi:10.1038/nrg3208. [DOI] [PubMed] [Google Scholar]
2.Casey JA, Schwartz BS, Stewart WF, Adler NE. Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. Annu Rev Public Health. 2016;37:61–81. doi: 10.1146/annurev-publhealth-032315-021353. doi:10.1146/annurev-publhealth-032315-021353. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gottesman O, Kuivaniemi H, Tromp G, et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med. Oct 2013;15(10):761–71. doi: 10.1038/gim.2013.72. doi:10.1038/gim.2013.72. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dixit RA, Boxley CL, Samuel S, Mohan V, Ratwani RM, Gold JA. Electronic Health Record Use Issues and Diagnostic Error: A Scoping Review and Framework. J Patient Saf. Jan 1 2023;19(1):e25–e30. doi: 10.1097/PTS.0000000000001081. doi:10.1097/PTS.0000000000001081. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Doktorchik C, Lu M, Quan H, Ringham C, Eastwood C. A qualitative evaluation of clinically coded data quality from health information manager perspectives. Health Inf Manag. Jan 2020;49(1):19–27. doi: 10.1177/1833358319855031. doi:10.1177/1833358319855031. [DOI] [PubMed] [Google Scholar]
6.Nelson SJ, Yin Y, Trujillo Rivera EA, et al. Are ICD codes reliable for observational studies? Assessing coding consistency for data quality. Digit Health. Jan-Dec 2024;10:20552076241297056. doi: 10.1177/20552076241297056. doi:10.1177/20552076241297056. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. Dec 18 2018;169(12):866–872. doi: 10.7326/M18-1990. doi:10.7326/M18-1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Char DS, Shah NH, Magnus D. Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. Mar 15 2018;378(11):981–983. doi: 10.1056/NEJMp1714229. doi:10.1056/NEJMp1714229. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Green BL, Murphy A, Robinson E. Accelerating health disparities research with artificial intelligence. Front Digit Health. 2024;6:1330160. doi: 10.3389/fdgth.2024.1330160. doi:10.3389/fdgth.2024.1330160. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical Machine Learning in Healthcare. Annu Rev Biomed Data Sci. Jul 2021;4:123–144. doi: 10.1146/annurev-biodatasci-092820-114757. doi:10.1146/annurev-biodatasci-092820-114757. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hill EJ, Sharma J, Wissel B, et al. Parkinson’s disease diagnosis codes are insufficiently accurate for electronic health record research and differ by race. Parkinsonism Relat Disord. Sep 2023;114:105764. doi: 10.1016/j.parkreldis.2023.105764. doi:10.1016/j.parkreldis.2023.105764. [DOI] [PubMed] [Google Scholar]
12.Coleman KJ, Stewart C, Waitzfelder BE, et al. Racial-Ethnic Differences in Psychiatric Diagnoses and Treatment Across 11 Health Care Systems in the Mental Health Research Network. Psychiatr Serv. Jul 1 2016;67(7):749–57. doi: 10.1176/appi.ps.201500217. doi:10.1176/appi.ps.201500217. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Workman TE, Kupersmith J, Ma P, et al. A Comparison of Veterans with Problematic Opioid Use Identified through Natural Language Processing of Clinical Notes versus Using Diagnostic Codes. Healthcare (Basel) Apr 6 2024;12(7) doi:10.3390/healthcare12070799. [Google Scholar]
14.Wang P, Garza M, Zozus M. Cancer Phenotype Development: A Literature Review. Stud Health Technol Inform. 2019;257:468–472. [PubMed] [Google Scholar]
15.Sivakumar K, Nithya NS, Revathy O. Phenotype Algorithm based Big Data Analytics for Cancer Diagnose. J Med Syst. Jul 4 2019;43(8):264. doi: 10.1007/s10916-019-1409-z. doi:10.1007/s10916-019-1409-z. [DOI] [PubMed] [Google Scholar]
16.Liao KP, Ananthakrishnan AN, Kumar V, et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS One. 2015;10(8):e0136651. doi: 10.1371/journal.pone.0136651. doi:10.1371/journal.pone.0136651. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Yin Y, Shao Y, Ma P, Zeng-Treitler Q, Nelson SJ. Machine-Learned Codes from EHR Data Predict Hard Outcomes Better than Human-Assigned ICD Codes. Mach Learn Knowl Extr. Jun 2025;7(2):36. doi: 10.3390/make7020036. doi:10.3390/make7020036. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Shao Y, Zhang S, Raman VK, et al. Artificial intelligence approaches for phenotyping heart failure in U.S. Veterans Health Administration electronic health record. ESC Heart Fail. Jun 14 2024 doi:10.1002/ehf2.14787. [Google Scholar]
19.Cowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. Jan 2017;106(1):1–9. doi:10.1007/s00392-016-1025-6. [Google Scholar]
20.Jindal A. Misguided Artificial Intelligence: How Racial Bias is Built Into Clinical Models. Brown J Hosp Med. 2023;2(1):38021. doi: 10.56305/001c.38021. doi:10.56305/001c.38021. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Torres JM, Hessler-Jones D, Yarbrough C, Tapley A, Jimenez R, Gottlieb LM. An online experiment to assess bias in professional medical coding. BMC Med Inform Decis Mak. Jun 20 2019;19(1):115. doi: 10.1186/s12911-019-0832-x. doi:10.1186/s12911-019-0832-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Rozier MD, Patel KK, Cross DA. Electronic Health Records as Biased Tools or Tools Against Bias: A Conceptual Model. Milbank Q. Mar 2022;100(1):134–150. doi: 10.1111/1468-0009.12545. doi:10.1111/1468-0009.12545. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Perets O, Stagno E, Yehuda EB, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. Apr 12 2024. doi:10.1101/2024.04.09.24305594.
24.Hasanzadeh F, Josephson CB, Waters G, Adedinsewo D, Azizi Z, White JA. Bias recognition and mitigation strategies in artificial intelligence healthcare applications. NPJ Digit Med. Mar 11 2025;8(1):154. doi: 10.1038/s41746-025-01503-7. doi:10.1038/s41746-025-01503-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1-9502] 1.Jensen PB, Jensen LJ, Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. May 2 2012;13(6):395–405. doi: 10.1038/nrg3208. doi:10.1038/nrg3208. [DOI] [PubMed] [Google Scholar]

[r2-9502] 2.Casey JA, Schwartz BS, Stewart WF, Adler NE. Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. Annu Rev Public Health. 2016;37:61–81. doi: 10.1146/annurev-publhealth-032315-021353. doi:10.1146/annurev-publhealth-032315-021353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-9502] 3.Gottesman O, Kuivaniemi H, Tromp G, et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med. Oct 2013;15(10):761–71. doi: 10.1038/gim.2013.72. doi:10.1038/gim.2013.72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-9502] 4.Dixit RA, Boxley CL, Samuel S, Mohan V, Ratwani RM, Gold JA. Electronic Health Record Use Issues and Diagnostic Error: A Scoping Review and Framework. J Patient Saf. Jan 1 2023;19(1):e25–e30. doi: 10.1097/PTS.0000000000001081. doi:10.1097/PTS.0000000000001081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-9502] 5.Doktorchik C, Lu M, Quan H, Ringham C, Eastwood C. A qualitative evaluation of clinically coded data quality from health information manager perspectives. Health Inf Manag. Jan 2020;49(1):19–27. doi: 10.1177/1833358319855031. doi:10.1177/1833358319855031. [DOI] [PubMed] [Google Scholar]

[r6-9502] 6.Nelson SJ, Yin Y, Trujillo Rivera EA, et al. Are ICD codes reliable for observational studies? Assessing coding consistency for data quality. Digit Health. Jan-Dec 2024;10:20552076241297056. doi: 10.1177/20552076241297056. doi:10.1177/20552076241297056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-9502] 7.Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. Dec 18 2018;169(12):866–872. doi: 10.7326/M18-1990. doi:10.7326/M18-1990. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-9502] 8.Char DS, Shah NH, Magnus D. Implementing Machine Learning in Health Care - Addressing Ethical Challenges. N Engl J Med. Mar 15 2018;378(11):981–983. doi: 10.1056/NEJMp1714229. doi:10.1056/NEJMp1714229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9-9502] 9.Green BL, Murphy A, Robinson E. Accelerating health disparities research with artificial intelligence. Front Digit Health. 2024;6:1330160. doi: 10.3389/fdgth.2024.1330160. doi:10.3389/fdgth.2024.1330160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10-9502] 10.Chen IY, Pierson E, Rose S, Joshi S, Ferryman K, Ghassemi M. Ethical Machine Learning in Healthcare. Annu Rev Biomed Data Sci. Jul 2021;4:123–144. doi: 10.1146/annurev-biodatasci-092820-114757. doi:10.1146/annurev-biodatasci-092820-114757. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-9502] 11.Hill EJ, Sharma J, Wissel B, et al. Parkinson’s disease diagnosis codes are insufficiently accurate for electronic health record research and differ by race. Parkinsonism Relat Disord. Sep 2023;114:105764. doi: 10.1016/j.parkreldis.2023.105764. doi:10.1016/j.parkreldis.2023.105764. [DOI] [PubMed] [Google Scholar]

[r12-9502] 12.Coleman KJ, Stewart C, Waitzfelder BE, et al. Racial-Ethnic Differences in Psychiatric Diagnoses and Treatment Across 11 Health Care Systems in the Mental Health Research Network. Psychiatr Serv. Jul 1 2016;67(7):749–57. doi: 10.1176/appi.ps.201500217. doi:10.1176/appi.ps.201500217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-9502] 13.Workman TE, Kupersmith J, Ma P, et al. A Comparison of Veterans with Problematic Opioid Use Identified through Natural Language Processing of Clinical Notes versus Using Diagnostic Codes. Healthcare (Basel) Apr 6 2024;12(7) doi:10.3390/healthcare12070799. [Google Scholar]

[r14-9502] 14.Wang P, Garza M, Zozus M. Cancer Phenotype Development: A Literature Review. Stud Health Technol Inform. 2019;257:468–472. [PubMed] [Google Scholar]

[r15-9502] 15.Sivakumar K, Nithya NS, Revathy O. Phenotype Algorithm based Big Data Analytics for Cancer Diagnose. J Med Syst. Jul 4 2019;43(8):264. doi: 10.1007/s10916-019-1409-z. doi:10.1007/s10916-019-1409-z. [DOI] [PubMed] [Google Scholar]

[r16-9502] 16.Liao KP, Ananthakrishnan AN, Kumar V, et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS One. 2015;10(8):e0136651. doi: 10.1371/journal.pone.0136651. doi:10.1371/journal.pone.0136651. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-9502] 17.Yin Y, Shao Y, Ma P, Zeng-Treitler Q, Nelson SJ. Machine-Learned Codes from EHR Data Predict Hard Outcomes Better than Human-Assigned ICD Codes. Mach Learn Knowl Extr. Jun 2025;7(2):36. doi: 10.3390/make7020036. doi:10.3390/make7020036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18-9502] 18.Shao Y, Zhang S, Raman VK, et al. Artificial intelligence approaches for phenotyping heart failure in U.S. Veterans Health Administration electronic health record. ESC Heart Fail. Jun 14 2024 doi:10.1002/ehf2.14787. [Google Scholar]

[r19-9502] 19.Cowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol. Jan 2017;106(1):1–9. doi:10.1007/s00392-016-1025-6. [Google Scholar]

[r20-9502] 20.Jindal A. Misguided Artificial Intelligence: How Racial Bias is Built Into Clinical Models. Brown J Hosp Med. 2023;2(1):38021. doi: 10.56305/001c.38021. doi:10.56305/001c.38021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21-9502] 21.Torres JM, Hessler-Jones D, Yarbrough C, Tapley A, Jimenez R, Gottlieb LM. An online experiment to assess bias in professional medical coding. BMC Med Inform Decis Mak. Jun 20 2019;19(1):115. doi: 10.1186/s12911-019-0832-x. doi:10.1186/s12911-019-0832-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r22-9502] 22.Rozier MD, Patel KK, Cross DA. Electronic Health Records as Biased Tools or Tools Against Bias: A Conceptual Model. Milbank Q. Mar 2022;100(1):134–150. doi: 10.1111/1468-0009.12545. doi:10.1111/1468-0009.12545. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-9502] 23.Perets O, Stagno E, Yehuda EB, et al. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv. Apr 12 2024. doi:10.1101/2024.04.09.24305594.

[r24-9502] 24.Hasanzadeh F, Josephson CB, Waters G, Adedinsewo D, Azizi Z, White JA. Bias recognition and mitigation strategies in artificial intelligence healthcare applications. NPJ Digit Med. Mar 11 2025;8(1):154. doi: 10.1038/s41746-025-01503-7. doi:10.1038/s41746-025-01503-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Coding Fairness: Detecting Demographic-Related Coding Discrepancies in ICD Code Assignments

Ying Yin, Ph.D.

Stuart J Nelson, M.D.

Yijun Shao, Ph.D.

Charles Faselis, MD

Ali Ahmed, MD, MPH

Qing Zeng-Treitler, Ph.D.

Abstract

Introduction

Methods

Data Source:

ICD Code Blocks:

Data Preparation and Feature Extraction

Figure 1:

Topic Modeling:

Machine Learning Modeling for Code Block Prediction:

Figure 2:

Coding Fairness Analysis:

Software:

Results

Study Cohort

Table 1:

Race- and sex-agn ostic AI phenotyping mode performance and group discrepancy

Table 2:

Figure 3:

AI model prediction-ICD assignment agreement and group discrepancies

Figure 4:

Table 4:

Discussion

Clinical Implication

Limitations

Future Directions

Acknowledgments

Figures & Tables

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases