Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones

Cosmin A Bejan; Amy M Reed; Matthew Mikula; Siwei Zhang; Yaomin Xu; Daniel Fabbri; Peter J Embí; Ryan S Hsi

doi:10.1101/2024.08.12.24311870

Abstract

Background

Recent advancements of large language models (LLMs) like Generative Pre-trained Transformer 4 (GPT-4) have generated significant interest among the scientific community. Yet, the potential of these models to be utilized in clinical settings remains largely unexplored. This study investigated the abilities of multiple LLMs and traditional machine learning models to analyze emergency department (ED) reports and determine if the corresponding visits were caused by symptomatic kidney stones.

Methods

Leveraging a dataset of manually annotated ED reports, we developed strategies to enhance the performance of GPT-4, GPT-3.5, and Llama-2 including prompt optimization, zero- and few-shot prompting, fine-tuning, and prompt augmentation. Further, we implemented fairness assessment and bias mitigation methods to investigate the potential disparities by these LLMs with respect to race and gender. A clinical expert manually assessed the explanations generated by GPT-4 for its predictions to determine if they were sound, factually correct, unrelated to the input prompt, or potentially harmful. The evaluation includes a comparison between LLMs, traditional machine learning models (logistic regression, extreme gradient boosting, and light gradient boosting machine), and a baseline system utilizing International Classification of Diseases (ICD) codes for kidney stones.

Results

The best results were achieved by GPT-4 (macro-F1=0.833, 95% confidence interval [CI]=0.826–0.841) and GPT-3.5 (macro-F1=0.796, 95% CI=0.796–0.796), both being statistically significantly better than the ICD-based baseline result (macro-F1=0.71). Ablation studies revealed that the initial pre-trained GPT-3.5 model benefits from fine-tuning when using the same parameter configuration. Adding demographic information and prior disease history to the prompts allows LLMs to make more accurate decisions. The evaluation of bias found that GPT-4 exhibited no racial or gender disparities, in contrast to GPT-3.5, which failed to effectively model racial diversity. The analysis of explanations provided by GPT-4 demonstrates advanced capabilities of this model in understanding clinical text and reasoning with medical knowledge.

PERMALINK

This is a preprint.

Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones

Cosmin A Bejan

Amy M Reed

Matthew Mikula

Siwei Zhang

Yaomin Xu

Daniel Fabbri

Peter J Embí

Ryan S Hsi

Abstract

Background

Methods

Results

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

Large Language Models Improve the Identification of Emergency Department Visits for Symptomatic Kidney Stones

Cosmin A Bejan

Amy M Reed

Matthew Mikula

Siwei Zhang

Yaomin Xu

Daniel Fabbri

Peter J Embí

Ryan S Hsi

Abstract

Background

Methods

Results

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases