Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Ashley Simmons; Kullaya Takkavatakarn; Megan McDougal; Brian Dilcher; Jami Pincavitch; Lukas Meadows; Justin Kauffman; Eyal Klang; Rebecca Wig; Gordon Smith; Ali Soroush; Robert Freeman; Donald J Apakama; Alexander W Charney; Roopa Kohli-Seth; Girish N Nadkarni; Ankit Sakhuja

doi:10.1101/2024.04.29.24306573

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Nov 23:2024.04.29.24306573. [Version 3] doi: 10.1101/2024.04.29.24306573

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Ashley Simmons, Kullaya Takkavatakarn, Megan McDougal, Brian Dilcher, Jami Pincavitch, Lukas Meadows, Justin Kauffman, Eyal Klang, Rebecca Wig, Gordon Smith, Ali Soroush, Robert Freeman, Donald J Apakama, Alexander W Charney, Roopa Kohli-Seth, Girish N Nadkarni, Ankit Sakhuja

PMCID: PMC11601733 PMID: 39606395

Abstract

Background

Healthcare reimbursement and coding is dependent on accurate extraction of International Classification of Diseases-tenth revision – clinical modification (ICD-10-CM) codes from clinical documentation. Attempts to automate this task have had limited success. This study aimed to evaluate the performance of large language models (LLMs) in extracting ICD-10-CM codes from unstructured inpatient notes and benchmark them against human coder.

Methods

This study compared performance of GPT-3.5, GPT4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b in extracting ICD-10-CM codes from unstructured inpatient notes against a human coder. We presented deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases to LLMs and human coder for extraction of ICD-10-CM codes. We used a standard prompt for extracting ICD-10-CM codes. The human coder analyzed the same notes using 3M Encoder, adhering to the 2022-ICD-10-CM Coding Guidelines.

Results

In this study, we analyzed 50 inpatient notes, comprising of 23 history and physicals and 27 progress notes. The human coder identified 165 unique codes with a median of 4 codes per note. The LLMs extracted varying numbers of median codes per note: GPT 3.5: 7, GPT4: 6, Claude 2.1: 6, Claude 3: 8, Gemini Advanced: 5, and Llama 2-70b:11. GPT 4 had the best performance though the agreement with human coder was poor at 15.2% for overall extraction of ICD-10-CM codes and 26.4% for extraction of category ICD-10-CM codes.

Conclusion

Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against a human coder.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.

PERMALINK

This is a preprint.

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Ashley Simmons

Kullaya Takkavatakarn

Megan McDougal

Brian Dilcher

Jami Pincavitch

Lukas Meadows

Justin Kauffman

Eyal Klang

Rebecca Wig

Gordon Smith

Ali Soroush

Robert Freeman

Donald J Apakama

Alexander W Charney

Roopa Kohli-Seth

Girish N Nadkarni

Ankit Sakhuja

Abstract

Background

Methods

Results

Conclusion

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

Benchmarking Large Language Models for Extraction of International Classification of Diseases Codes from Clinical Documentation

Ashley Simmons

Kullaya Takkavatakarn

Megan McDougal

Brian Dilcher

Jami Pincavitch

Lukas Meadows

Justin Kauffman

Eyal Klang

Rebecca Wig

Gordon Smith

Ali Soroush

Robert Freeman

Donald J Apakama

Alexander W Charney

Roopa Kohli-Seth

Girish N Nadkarni

Ankit Sakhuja

Abstract

Background

Methods

Results

Conclusion

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases