Skip to main content
Applied Clinical Informatics logoLink to Applied Clinical Informatics
. 2025 Apr 16;16(2):337–344. doi: 10.1055/a-2491-3872

Extracting International Classification of Diseases Codes from Clinical Documentation Using Large Language Models

Ashley Simmons 1, Kullaya Takkavatakarn 2,3, Megan McDougal 1, Brian Dilcher 4, Jami Pincavitch 5, Lukas Meadows 6, Justin Kauffman 7,8, Eyal Klang 7,8, Rebecca Wig 9, Gordon Smith 10, Ali Soroush 7,8,11, Robert Freeman 8, Donald J Apakama 12, Alexander W Charney 13,14,15,16, Roopa Kohli-Seth 17, Girish N Nadkarni 2,7,8, Ankit Sakhuja 7,8,17,
PMCID: PMC12020521  PMID: 39608761

Abstract

Background  Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.

Objective  The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.

Methods  We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes of authentic patient cases from American Health Information Management Association Vlab for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.

Results  Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from −0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied among LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of nonspecific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude 2.1), and hallucinations (35% with Claude 2.1).

Conclusion  Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.

Keywords: ICD-10, artificial intelligence, billing, human–computer interaction, billing systems

Background and Significance

Medical coding is an important part of the United States Healthcare System in the 21st century. Healthcare organizations hire and train a substantial workforce proficient in abstracting medical codes from clinical records. 1 This workforce then supports and submits claims for reimbursement adhering to regulatory requirements, handle insurance denials, support research endeavors, aid public health surveillance, and ensure a faithful representation of a patient's medical history in the electronic health records (EHRs). 2 3 International Classification of Diseases (ICD) codes developed by the World Health Organization (WHO) are used to document specific diagnostic and procedural information such as medical history, surgical history, and problem lists. The ICD codes are currently in their tenth revision (ICD-10). 3 ICD-10-Clinical Modification (ICD-10-CM) is a variant of ICD-10 adopted by the United States government to add additional detail to the ICD-10 codes developed by the WHO with around 68,000 diagnosis codes. 3

Computerized assistive coding (CAC) technologies are currently used to improve the workflow of medical coding professionals. The American Health Information Management Association (AHIMA) defines CACs as “computer software that automatically generates a set of medical codes for review, validation, and use based upon provider clinical documentation.” 4 Their performance, however, is still far below that of medical coding professionals. 5 6 These CACs are thus used as semi-automated processes that augment human workflows. 4

Recent studies have investigated the use of CACs enhanced with natural language processing (NLP) to improve coding efficiency. However, these systems often fall short in handling the nuanced, heterogeneous, and ambiguous medical terminology typical of clinical documentation. A notable challenge is the system's misinterpretation of conditional language or hypothetical scenarios, such as instructions advising patients to return if specific symptoms arise. Rather than recognizing these as hypothetical, CACs frequently code the mentioned symptoms as active conditions, mistakenly assuming the patient currently experiences them, highlighting the need for more sophisticated language models, like large language models (LLMs), that can better interpret context and improve accuracy. 7

With advent of LLMs, there are new opportunities for further refinement of CACs. Recent studies have shown superior performance of LLMs in many tasks such as answering United States Medical Licensure Exam (USMLE) questions 8 or taking the Bar Exam. 9 Recent literature has also highlighted LLMs' ability to comprehend and synthesize complex medical texts, enabling them to effectively extract relevant information from clinical narratives. 10 This has generated considerable interest in exploring the use of LLMs to enhance coding processes by improving accuracy in identifying diagnoses and procedures from free-text notes. However, recent literature indicates that LLMs face notable challenges; for example, they often struggle to generate accurate diagnoses when provided with ICD codes 11 or to produce billing codes based on code descriptions. 12 Despite their promise, LLMs have yet to undergo rigorous evaluation in the highly specialized task of extracting ICD-10-CM codes directly from patient notes, raising important questions about the reliability and accuracy of these models in clinical coding tasks.

Objectives

Our primary objective in the present study was to assess performance of LLMs in extraction of ICD-10-CM codes against human coder. Our secondary objectives were to assess performance of LLMs in extraction of Category ICD-10-CM codes and ICD-10-CM codes for primary diagnoses, and to identify the reasons of discrepancies between the performance of various LLMs and human coder in extraction of ICD-10-CM codes.

Methods

For this study we evaluated six available LLMs, namely, GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b. With permission from the American Health Information Management Association (AHIMA), we used deidentified patient notes of authentic patient cases from the AHIMA Vlab 13 for this study. The notes were selected at random from an overall pool of 88 notes from the AHIMA VLab. AHIMA VLab is a virtual practice environment for health information education. It includes deidentified authentic patient charts that are used by students for coding exercises, chart analysis, general orientation to medical record forms, and indexing. 13 AHIMA inpatient authentic patient cases are comprised of deidentified patient encounters and we used inpatient notes with a combination of both history and physical and progress notes. The AHIMA authentic patient cases can be accessed through My AHIMA Learning Center, an online portal, 13 at https://myahima.brightspace.com/d2l/home/6681 .

These notes were presented to both LLMs and human coder (AS) for extraction of ICD-10-CM codes. The human coder extracted ICD-10-CM codes in for billing purposes as per current standard of practice. In addition to mastery level certification, the coder has 11 years of practical experience in medical coding and serves as an Assistant Professor for undergraduate students in a Health Informatics and Information Management Program, specializing in medical coding. We used Python 3.8.3's “random” module to assign a random number to each patient note, ensuring the notes were blinded from the human coder. We used a standardized prompt for this study that was consistent across all LLMs—“Please code the following note using the ICD-10 CM inpatient guidelines from 2022.” The human coder separately analyzed the same notes and extracted ICD-10-CM code(s) using 3M Encoder, 14 as is the standard practice. Applicable 2022 ICD-10-CM Official Coding Guidelines for each note were applied by referring to the 2022 ICD-10-CM coding guidelines 3 specific to inpatient settings. Each note was evaluated individually to ensure that the assigned codes adhere to the official coding guidelines. To evaluate the inter-coder reliability, another human coder (MM) extracted ICD-10-CM codes in a random 20% subset of patients. We further did an exploratory analysis on a randomly chosen subset of 10% of patient notes to identify reasons for discrepancy between the human coder and LLMs in extraction of ICD-10-CM codes. The selection of a 10% subset aligns with previous literature. 15 We have provided details of statistical analysis in the Supplementary Materials (available in the online version)

Results

We included 50 patient notes, including 23 history and physicals, and 27 progress notes. The human coder extracted a total of 165 unique ICD-10-CM codes, with good concordance between two human coders (kappa = 0.86). As shown in Fig. 1A the number of unique ICD-10-CM codes extracted by LLMs varied from 221 for Gemini Advanced to 658 for Llama 2-70b. There was a significant difference in the number of unique ICD-10-CM codes extracted across the LLMs ( p  < 0.001). This variation in number of unique codes extracted highlights differing levels in how each model interprets medical text, likely influenced by nuances embedded in their training data and methodologies. The median (IQR) number of ICD-10-CM codes extracted by the human coder was 4. 2 3 4 5 6 Among the LLMs, the number of ICD-10-CM codes extracted were as follows: GPT-3.5: 7, 4 5 6 7 8 9 10 GPT-4: 6, 4 5 6 7 8 Claude 2.1: 6, 4 5 6 7 8 Claude 3: 8, 6 7 8 9 10 Gemini Advanced: 5, 5 6 7 and Llama 2-70b: 11. 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Fig. 1.

Fig. 1

Number of ICD-10-CM codes identified ( A ). Percentage agreement between individual large language models (LLMs) and human coder in extraction of ICD-10-CM codes ( B ).

Performance with ICD-10-CM Codes

GPT-4 achieved the highest percent agreement for ICD-10 code extraction among the LLMs and the human coder at 15.2%, followed by Claude 3 (12.7%), GPT-3.5 (12.4%), Gemini Advanced (12.2%), Claude 2.1 (9.9%), and Llama 2-70b (1.4%; Fig. 1B ). Llama 2-70b extracted the highest number of unique ICD-10-CM codes (658), but this extensive range likely contributed to its comparatively poorer performance. This over-inclusiveness suggests that the model may have interpreted nuances in the clinical text too broadly, resulting in identification of extraneous or less relevant codes. The Cohen's kappa values were poor, ranging between −0.02 to 0.01, suggesting minimal to no agreement among LLMs when compared to the human coder ( Table 1 ). The precision and recall of the LLMs were poor. Claude 3 had highest recall (0.36), while GPT-4 achieved highest precision (0.21). The reasons for discrepancies in ICD-10-CM codes extracted by LLMs versus those extracted by the human coder are shown in Table 2 . GPT-4 outperformed the other models with the lowest percentage of discrepancies (3%) by correctly avoiding coding for signs and symptoms when a more specific diagnosis code was available. This suggests that GPT-4 prioritized relevant codes over selecting more general codes. Most discrepancies for GPT-4 were due to its extraction of codes for diagnoses not confirmed by providers (60%), followed by hallucinations (21%). In contrast, Llama 2-70b, which had poorest performance, showed the highest overall discrepancy rate. Most of its discrepancies were due to coding for unconfirmed diagnoses (54%) and use of nonspecific codes when more specific options were available (19%). Other models showed mixed results, each with distinct areas of difficulty. For example, Claude 3 exhibited a high discrepancy rate related to unconfirmed diagnoses (57%) and hallucinations (18%), suggesting it may face challenges in accurately interpreting clinical notes. Similarly, Gemini Advanced had a high rate of missed diagnoses (40%), potentially reflecting limitations in recognizing specific details essential for precise coding. Subgroup analysis of history and physical notes, as well as progress notes, revealed consistent results in percent agreement between LLMs and the human coder ( Fig. 1B and Table 1 ). When focusing solely on the primary diagnosis, Claude 3 yielded a percent agreement of 26% and a kappa value of 0.25, followed by Claude 2.1 (percent agreement 20% and kappa 0.20) and GPT-4 (percent agreement 18% and kappa 0.17), respectively ( Supplementary Table S1 , available in the online version).

Table 1. Performance of LLMs for the ICD-10-CM code extraction compared to certified coding specialist.

All notes History and physical notes Progress notes
GPT-3.5
 Kappa 0.01 0.02 −0.01
 Precision 0.17 0.18 0.16
 Recall 0.31 0.30 0.31
GPT-4
 Kappa 0.01 0.02 −0.01
 Precision 0.21 0.23 0.19
 Recall 0.35 0.37 0.33
Claude 2.1
 Kappa −0.02 −0.02 −0.01
 Precision 0.15 0.16 0.14
 Recall 0.22 0.21 0.24
Claude 3
 Kappa −0.01 0.01 −0.02
 Precision 0.16 0.19 0.14
 Recall 0.36 0.40 0.32
Gemini Advanced
 Kappa −0.02 −0.02 0.01
 Precision 0.19 0.21 0.18
 Recall 0.25 0.27 0.24
Llama 2-70b
 Kappa −0.02 0.01 −0.02
 Precision 0.02 0.03 0.01
 Recall 0.07 0.08 0.06

Abbreviation: LLMs, large language models.

Table 2. Reasons for discrepancy in extracted ICD-10-CM codes between individual LLMs when compared against the human coder (evaluated in 10% random subset).

Issue GPT-3.5 GPT-4 Claude 2.1 Claude 3 Gemini Advanced Llama 2-70b
Coding for symptoms, signs of conditions despite the presence of a more specific diagnosis code 4 (20%) 1 (3%) 5 (22%) 4 (12%) 1 (4%) 10 (16%)
Coding for diagnoses not confirmed by providers 7 (35%) 17 (60%) 6 (26%) 19 (57%) 5 (20%) 34 (54%)
Use of nonspecific codes 5 (25%) 1 (3%) 1 (4%) 3 (10%) 2 (8%) 12 (19%)
Hallucinations 2 (10%) 6 (21%) 8 (35%) 6 (18%) 7 (28%) 6 (10%)
Missed diagnosis 2 (10%) 3 (10%) 3 (13%) 1 (3%) 10 (40%) 1 (1%)
Total discrepant codes 20 28 23 33 25 63

Abbreviation: LLMs, large language models.

The complexity of clinical notes was categorized into three groups based on tertiles of the number of ICD-10-CM codes extracted by human coders: low (<3 codes), medium (3–6 codes), and high (>6 codes). There was no significant difference in Cohen's kappa for the LLMs across cases of varying complexity ( Supplementary Table S2 , available in the online version).

Performance with Category ICD-10-CM Codes

There were 146 unique category ICD-10-CM codes extracted by human coder ( Fig. 2A ). GPT-4 achieved the highest percent agreement at 26.4%, followed by GPT-3.5 (23.6%), Claude 3 (21.3%), Claude 2.1 (20.8%), Gemini Advanced (20.6%), and Llama 2-70b (10%) ( Fig. 2B ). Cohen's kappa values were again poor, ranging between −0.01 to 0.03, suggesting minimal to no agreement among LLMs when compared to human coder ( Table 3 ). When focusing on the primary diagnosis, Claude 2.1 and Claude 3 achieved the best performance with a percent agreement of 36% and a kappa value of 0.35, followed by GPT-4 (percent agreement 34%, kappa 0.33), and Gemini Advanced (percent agreement 30%, kappa 0.31; Supplementary Table S3 , available in the online version).

Fig. 2.

Fig. 2

Number of category ICD-10-CM codes identified ( A ). Percentage agreement between individual large language models (LLMs) and human coder in extraction of category ICD-10-CM codes ( B ).

Table 3. Performance of LLMs for the category ICD-10-CM code extraction compared to certified coding specialist.

All notes History and physical notes Progress notes
GPT-3.5
 Kappa 0.02 0.01 −0.01
 Precision 0.31 0.29 0.32
 Recall 0.51 0.52 0.5
GPT-4
 Kappa 0.03 0.01 0.02
 Precision 0.34 0.35 0.33
 Recall 0.55 0.56 0.55
Claude 2.1
 Kappa −0.01 0.01 −0.02
 Precision 0.29 0.32 0.26
 Recall 0.43 0.41 0.44
Claude 3
 Kappa −0.01 −0.01 0.01
 Precision 0.26 0.30 0.22
 Recall 0.53 0.58 0.49
Gemini Advanced
 Kappa −0.01 −0.01 0.02
 Precision 0.31 0.35 0.26
 Recall 0.39 0.44 0.34
Llama 2-70b
 Kappa −0.01 0.01 −0.01
 Precision 0.13 0.15 0.12
 Recall 0.29 0.30 0.28

Abbreviation: LLMs, large language models.

Discussion

In this study we have benchmarked the performance of LLMs in extracting ICD-10-CM codes from narrative documentation in patient charts. We conducted this evaluation using a comparative analysis of performance of these models against that of human coder. The LLMs evaluated in this study included GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b. We found that all evaluated LLMs had poor concordance, precision, and recall in extraction of ICD-10-CM codes when compared to the human coder. GPT-4, however, achieved the best performance in both, extraction of ICD-10-CM codes and category ICD-10-CM codes. When focusing only on primary diagnosis, Claude 3 showed the best performance across extraction of both ICD-10-CM codes and category ICD-10-CM codes. The higher rates of agreement in extracting category ICD-10-CM codes compared to granular ICD-10-CM codes likely stem from the broader scope of category codes, which require less specificity and detail. These three-digit codes represent general diagnostic categories, making it easier for models to align them with clinical documentation. However, while this reduced complexity contributes to improved agreement, the need for adequate specificity still leads to poor overall performance, as evidenced by the low percent agreement and Cohen's kappa values across all models. We have further evaluated the reasons for discrepancy in extraction of ICD-10-CM codes by individual LLMs when compared against the human coder.

Since the introduction of GPT-3.5, there has been a steady interest in exploring the capabilities of LLMs in various areas. Recent studies have shown that GPT-3.5, one of the first LLM models available, achieved a passing score in the USMLE 8 and passed two portions of the Bar Exam—evidence and torts. 9 These are complex examinations that are specific to their professional fields—USMLE for medicine and Bar Exam for law. USMLE questions span a diverse range of topics in medicine that include clinical medicine, basic science, and bioethics. Similarly, passing a Bar Exam requires an in-depth understanding of the law and the legal language. The fact that an LLM, which has not been trained specifically for this purpose, can perform so well in such specific professional examinations has led to a lot of excitement about the potential of such models.

The LLMs, however, fail to replicate similar performance for more specific tasks. For example, in a study that evaluated the ability of GPT-3.5 to answer questions related to the field of nephrology, the results were much less impressive with only 51% accuracy rate. 16 The authors used questions from Nephrology Self-Assessment Program and Kidney Self-Assessment Program. 17 18 Both of these resources are used to enhance and refresh clinical knowledge in the field of nephrology and for preparation of the American Board of Internal Medicine Nephrology Board Examination. This was way below the passing threshold of 75% for Nephrology Self-Assessment Program and 76% for Kidney Self-Assessment Program. Another recent study that evaluated the performance of LLMs on Nephrology Self-Assessment Program and Kidney Self-Assessment Program found that GPT-4 achieved a much better performance with 73.3% correct answers, 19 however still below the passing threshold. Performance of Claude 2 and Llama was much worse with only 54.4 and 30.6% correct responses, respectively. Another study that evaluated GPT-3.5's performance on questions from similar resources but with a focus on glomerular diseases, a group of highly specific kidney diseases, found that GPT-3.5's accuracy dropped further down to 45%. 16 LLMs have shown similar suboptimal performance in self-assessment tests designed for other specialties such as gastroenterology, 20 ophthalmology, 21 and urology. 22 Thus, it seems that even though LLMs may perform well with general professional examinations, they do not perform well when more specific knowledge of the field is required.

It is therefore not surprising that LLMs in our study were unable to perform well in the highly specialized task of extracting ICD-10-CM codes from inpatient notes. The training required to become a medical coder is complex and includes a comprehensive education in medical terminology, pathophysiology, anatomy, and pharmacology, in addition to the coding terminology itself. The coders must learn to parse through the medical records and tease out the right diagnostic codes, while separating out the verbiage that discusses symptomatology or warning signs. It therefore requires an in-depth understanding of ICD-10-CM system, clinical documentation, and a great command of English language.

Our study highlights the limitations of LLMs while extracting ICD-10 CM codes from inpatient notes. While the human coder extracted total 165 unique ICD-10-CM codes, the total of unique ICD-10-CM codes extracted by the LLMs were much higher. Gemini Advanced extracted 221 ICD-10-CM codes—the least amount among the LLMs studied. Claude 2.1 was next and extracted 238 ICD-10-CM codes. This was followed by 268 ICD-10-CM codes with GPT-4, 305 with GPT-3.5, 332 with Claude 3, and finally 658 ICD-10-CM codes with Llama 2-70b—the highest number of them all. As shown in Table 2 , there were multiple reasons for this discrepancy. Some of these codes resulted from the inability of individual LLMs to distinguish symptom codes from diagnosis codes as established in the ICD-10-CM Official Coding Guidelines. According to guidelines, conditions and signs or symptoms codes falling within categories R00-R94 should only be used when more specific diagnosis cannot be made even after all the facts bearing on the case have been investigated, and in cases in which a more precise diagnosis was not available for any other reason. 3 For example, in a case where the patient presented with chest pain, cough, and fatigue but was diagnosed with upper respiratory infection, the codes for chest pain, cough, and fatigue were also extracted by one of the LLMs. Because there was a precise diagnosis code for the upper respiratory infection, the sign and symptom codes were not necessary, therefore leading to an inflated code count.

The LLMs at times also failed to accurately identify all secondary diagnoses for those cases or assigned additional diagnoses without available supporting clinical documentation. For example, the LLM identified elevated sodium levels listed within the lab results and assigned diagnosis code E87.0 (hyperosmolality and hypernatremia) without any corresponding physician documentation to validate the diagnosis. This is an example of the LLM disregarding the coding guidelines outlined in Section I. A. 19, which emphasizes that diagnosis codes should be assigned solely based on the diagnostic statements provided by the healthcare provider within the notes, rather than relying on the clinical criteria used by the provider to establish the diagnosis (i.e., lab values). 3 As shown in Table 2 , there were also instances of hallucinations where LLM coded diagnoses not present anywhere in the note, and use of nonspecific codes. For example, in a case the LLM extracted the ICD-10-CM code “Z63.0 - Problems in relationship with spouse or partner” from a note regarding a 3-year-old pediatric patient, despite no documentation suggesting any relationship or partner-related issues.

The identified trend in the LLM code assignments sequencing further suggests that the systems arranged the codes based on numerical order as abstracted directly from the clinical notes provided, rather than prioritizing the codes based on hierarchical coding guidance. This also emphasizes the limitations in LLMs' understanding of the hierarchy involved in coding sequencing.

Our results are consistent with prior literature showing mediocre performance of LLMs when working with ICD codes. Spark NLP, a much smaller NLP model, has shown much better performance in extraction of ICD-10-CM codes in comparison to GPT-3.5 and GPT-4. 23 In comparison to a success rate of 76% achieved by Spark NLP, the overall accuracies of GPT-3.5 and GPT-4 were only 26 and 36%, respectively. Although LLMs are adept at generating natural text and understanding broad contexts, specialized models like Spark NLP, which are optimized for specific tasks and trained on healthcare-specific data, may achieve greater accuracy in domain-specific tasks. Despite this, there is a unique value in exploring the capabilities of general-purpose LLMs in this domain as LLMs bring the versatility to adapt across multiple tasks and coding contexts, potentially reducing the need for multiple, highly specific models. Recent literature has shown that LLMs struggle to generate diagnosis when provided with ICD codes. 11 A recent study showed that LLMs have difficulty in generating billing codes when provided with code descriptions. 12 Among GPT-3.5, GPT-4, Gemini Advanced, and Llama 2-70b, this study found that GPT-4 had the best performance to generate ICD-10-CM codes when provided with code descriptions. The performance was still poor at only 33.9% match rate.

We found similar results in this study using unstructured clinical notes as input for LLMs where GPT-4 had the best, albeit still poor, performance in extraction of ICD-10-CM codes when compared against the human coder. Our current work builds systematically on the evolving literature on LLMs and benchmarks their performance against that of the human coder. As human coders are used by hospitals for extraction of ICD codes as current standard of practice, this study provides an effective benchmark for future LLM research in this highly specialized area. Our study further identifies reasons for discrepancies in the outputs of LLMs, highlighting potential avenues for future research for effective integration of LLMs in clinical coding.

Our findings suggest that LLMs are not yet fully reliable for independent use in healthcare coding workflows. For healthcare organizations considering integration, a hybrid approach that combines LLMs with human oversight may offer a practical pathway, enhancing coding accuracy while minimizing risks. Looking ahead, there are promising avenues for enhancing LLMs in clinical coding tasks. Advanced prompt engineering and integration of retrieval-augmented generation techniques could tailor responses more effectively to the specific nuances of clinical documentation, thereby improving accuracy. Incorporating these methods into a hybrid workflow may help organizations gradually implement LLMs in a way that balances innovation with quality assurance.

Although our study provides important insights into the performance of LLMs for extraction of ICD-10-CM codes, it is important to interpret these results while understanding the limitations of the study. We only investigated extraction of ICD-10-CM codes based on inpatient notes and the results are therefore not generalizable to extraction of ICD-10-CM codes based on outpatient notes or to extraction of ICD-10 procedure codes. As our goal was to benchmark the performance of available LLMs to that of the human coder, we used LLMs at their default hyperparameter settings and a standardized prompt to generate responses. It is important to acknowledge that utilization of different prompts can elicit differing responses. Our study also does not utilize retrieval-augmented generation or fine-tuning, which could potentially further enhance the performance of LLMs.

Conclusion

In conclusion, our study benchmarks the performance of LLMs in the highly specialized task of extraction of ICD-10-CM codes from inpatient notes, against the human coder. Although GPT-4 exhibited the highest overall performance in ICD-10-CM code extraction, it still fell short. These findings suggest that integrating LLMs into coding practices in the near term should involve hybrid workflows that combine machine suggestions with human oversight to enhance accuracy. Additionally, the findings underscore the importance of targeted strategies in LLM development, such as advanced prompt engineering, retrieval-augmented generation, and model fine-tuning, to further refine LLM performance and inform future integration into healthcare coding practices.

Clinical Relevance Statement

Although large language models (LLMs) have shown superior performance in various tasks, our study shows that LLMs have poor performance in the highly specific task of extraction of ICD-10-CM codes from patient notes when compared against human coder. This poor performance is due to a variety of factors such as inappropriate coding for signs and symptoms, coding for diagnoses not confirmed by providers, hallucinations, use of nonspecific codes, and missed diagnoses. These findings underscore the need for further refinement of LLMs to enhance their accuracy in clinical coding tasks.

Multiple-Choice Questions

  1. Which large language model extracted the highest number of unique ICD-10-CM codes?

    1. GPT-3.5

    2. GPT-4

    3. Claude 2.1

    4. Llama 2-70b

    Correct Answer : The correct answer is option d. Llama 2-70b extracted the highest number of unique ICD-10-CM codes.

  2. Which large language model had the highest percent agreement with the human coder for ICD-10-CM code extraction?

    1. Claude 3

    2. GPT-3.5

    3. GPT-4

    4. Gemini Advanced

    Correct Answer : The correct answer is option c. GPT-4 had the highest percent agreement with the human coder for ICD-10-CM code extraction.

Acknowledgment

The authors thank AHIMA VLab for granting permission to use the deidentified inpatient notes for the study.

Funding Statement

Funding This study was funded by the U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases NIH/NIDDK grant (grant no.: K08DK131286).

Conflict of Interest G.N.N. is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, Siemens Healthineer, and Variant Bio, and serves a scientific advisory board member for Renalytix and Pensieve. He also has equity in Renalytix, Pensieve, and Verici. All remaining authors have declared no conflicts of interest.

Protection of Human and Animal Subjects

Human and animal subjects were not included in the project.

#

Equal contribution as first author.

*

Equal contribution as senior author.

Supplementary Material

10-1055-a-2491-3872-s202406ra0183.pdf (38.8KB, pdf)

Supplementary Material

Supplementary Material

References

  • 1.Zippia Medical Biller Coder Demographics and Statistics in the US. ZippiaAccessed May 25, 2024 at:https://www.zippia.com/medical-biller-coder-jobs/demographics/
  • 2.AHIMA Certified Coding Specialist AHIMA. Accessed May 25, 2024 at:https://www.ahima.org/certification-careers/certification-exams/ccs/ [Google Scholar]
  • 3.Services CfMM ICD-10-CM Official Guidelines for Coding and ReportingUpdated 2022. Accessed May 25, 2024 at:https://www.cms.gov/files/document/fy-2022-icd-10-cm-coding-guidelines-updated-02012022.pdf
  • 4.Campbell S, Giadresco K. Computer-assisted clinical coding: a narrative review of the literature on its benefits, limitations, implementation and impact on clinical coding professionals. HIM J. 2020;49(01):5–18. doi: 10.1177/1833358319851305. [DOI] [PubMed] [Google Scholar]
  • 5.Stanfill M H, Marc D T. Health information management: implications of artificial intelligence on healthcare data and information management. Yearb Med Inform. 2019;28(01):56–64. doi: 10.1055/s-0039-1677913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nguyen A N, Truran D, Kemp M et al. Computer-assisted diagnostic coding: effectiveness of an NLP-based approach using SNOMED CT to ICD-10 mappings. AMIA Annu Symp Proc. 2018;2018:807–816. [PMC free article] [PubMed] [Google Scholar]
  • 7.Perera S, Sheth A, Thirunarayan Ket al. Challenges in understanding clinical notesProceedings of the 2013 International Workshop on Data Management & Analytics for Healthcare - DARE '132013
  • 8.USMLE United State Medical Licensing ExaminationAccessed December 8, 2023 at: USMLE.org
  • 9.Bommarito M J, Katz D M.GPT Takes the Bar ExamSSRN2022 [DOI] [PMC free article] [PubMed]
  • 10.Van Veen D, Van Uden C, Blankemeier L et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30(04):1134–1142. doi: 10.1038/s41591-024-02855-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lee S A, Timothy L. Do large language models understand medical codes? arXiv. 2024 doi: 10.48550/arXiv.2403.10822. [DOI] [Google Scholar]
  • 12.Soroush A, Glicksberg Benjamin S, Zimlichman E et al. Large language models are poor medical coders—benchmarking of medical code querying. NEJM AI. 2024;1(05):AIdbp2300040. [Google Scholar]
  • 13.AHIMA VLAB 2023. Accessed June 26, 23 at:https://myahima.brightspace.com/
  • 14.3M. AHIMA; 2023. Accessed June 26, 2023 at:https://myahima.brightspace.com
  • 15.van Melle M A, Zwart D LM, Poldervaart J M et al. Validity and reliability of a medical record review method identifying transitional patient safety incidents in merged primary and secondary care patients' records. BMJ Open. 2018;8(08):e018576. doi: 10.1136/bmjopen-2017-018576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Miao J, Thongprayoon C, Cheungpasitporn W. Assessing the accuracy of ChatGPT on core questions in glomerular disease. Kidney Int Rep. 2023;8(08):1657–1659. doi: 10.1016/j.ekir.2023.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.American Society of Nephrology Kidney Self-Assessment ProgramAccessed at:https://www.asn-online.org/education/ksap/
  • 18.nephSAP Nephrology Self-Assessment ProgramAccessed at:https://nephsap.org/
  • 19.Wu S, Koo M, Blum L et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024;1(02):AIdbp2300092. [Google Scholar]
  • 20.Suchman K, Garg S, Trindade A J. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol. 2023;118(12):2280–2282. doi: 10.14309/ajg.0000000000002320. [DOI] [PubMed] [Google Scholar]
  • 21.Mihalache A, Popovic M M, Muni R H. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141(06):589–597. doi: 10.1001/jamaophthalmol.2023.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Deebel N A, Terlecki R. ChatGPT performance on the American Urological Association Self-assessment Study Program and the potential influence of artificial intelligence in urologic training. Urology. 2023;177:29–33. doi: 10.1016/j.urology.2023.05.010. [DOI] [PubMed] [Google Scholar]
  • 23.Kocaman V.Comparing Spark NLP for Healthcare and ChatGPT in Extracting ICD10-CM Codes from Clinical NotesAccessed April 20, 2024 at:https://www.johnsnowlabs.com/comparing-spark-nlp-for-healthcare-and-chatgpt-in-extracting-icd10-cm-codes-from-clinical-notes/

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

10-1055-a-2491-3872-s202406ra0183.pdf (38.8KB, pdf)

Supplementary Material

Supplementary Material


Articles from Applied Clinical Informatics are provided here courtesy of Thieme Medical Publishers

RESOURCES