Disambiguation of acronyms in clinical narratives with large language models

Amila Kugic; Stefan Schulz; Markus Kreuzthaler

doi:10.1093/jamia/ocae157

. 2024 Jun 25;31(9):2040–2046. doi: 10.1093/jamia/ocae157

Disambiguation of acronyms in clinical narratives with large language models

Amila Kugic ¹, Stefan Schulz ², Markus Kreuzthaler ^3,^✉

PMCID: PMC11339513 PMID: 38917444

Abstract

Objective

To assess the performance of large language models (LLMs) for zero-shot disambiguation of acronyms in clinical narratives.

Materials and Methods

Clinical narratives in English, German, and Portuguese were applied for testing the performance of four LLMs: GPT-3.5, GPT-4, Llama-2-7b-chat, and Llama-2-70b-chat. For English, the anonymized Clinical Abbreviation Sense Inventory (CASI, University of Minnesota) was used. For German and Portuguese, at least 500 text spans were processed. The output of LLM models, prompted with contextual information, was analyzed to compare their acronym disambiguation capability, grouped by document-level metadata, the source language, and the LLM.

Results

On CASI, GPT-3.5 achieved 0.91 in accuracy. GPT-4 outperformed GPT-3.5 across all datasets, reaching 0.98 in accuracy for CASI, 0.86 and 0.65 for two German datasets, and 0.88 for Portuguese. Llama models only reached 0.73 for CASI and failed severely for German and Portuguese. Across LLMs, performance decreased from English to German and Portuguese processing languages. There was no evidence that additional document-level metadata had a significant effect.

Conclusion

For English clinical narratives, acronym resolution by GPT-4 can be recommended to improve readability of clinical text by patients and professionals. For German and Portuguese, better models are needed. Llama models, which are particularly interesting for processing sensitive content on premise, cannot yet be recommended for acronym resolution.

Keywords: natural language processing, large language models, electronic health records, acronyms

Introduction

Acronyms function as shortcuts for words or longer phrases. Their correct resolution and disambiguation fully depends on the text around the acronym and the interpretation of the reader. In clinical narratives, acronyms are especially difficult to resolve due to a highly specialized and constantly evolving language, which lacks standardization. This is very relevant, because acronyms in clinical texts are usually not introduced, can have several meanings, are often only used in local scopes, and thus constitute a barrier to understanding not only for patients but also for physicians¹, even affecting patient safety. Automated disambiguation of acronyms in clinical narratives would therefore not only support understanding and readability for patients and professionals but further help improve important natural language processing (NLP) tasks in healthcare, such as information retrieval, machine translation, and text summarization. To automatically disambiguate acronyms, deep learning, language modeling, and statistical modeling techniques have been applied. Kashyap et al² incorporated PubMed Central full-text articles for detecting and extracting acronym–expansion pairs and trained a logistic regression model. The prediction performance reached an average accuracy of 0.879. Skreta et al³ combined UMLS term embeddings with reverse substitution to create an abbreviation disambiguation pipeline with convolutional neural networks. On the Clinical Abbreviation Sense Inventory (CASI)⁴ dataset, a methodology that combined the concept hierarchies during pre-training, augmented them with related medical concepts extracted from an embedding space and the global context of the clinical narrative, scored 0.841 in accuracy. Adams et al⁵ applied the contextualized representation of a word from local context and metadata in combination with predefined short-form expansion inventories, by drawing Gaussian embeddings jointly from word and metadata prior densities, and by processing surrounding words with a Bayesian Skip-Gram model. This creates a variational distribution over the latent meaning cell, which outperformed deep learning strategies. For the CASI dataset, it scored a weighted mean F1-measure across five pre-training runs of 0.710. Building on these techniques, large language models (LLMs) present a promising new direction for clinical NLP. LLMs have become popular through wide-spread access to conversational agents, which respond to complex prompts and deliver information on a wide range of topics. ChatGPT, created by OpenAI, was one of the first LLM-based chatbots accessible for the public. It mimics conversations and generates text responses to complex prompts. Pre-trained on huge general text data and further fine-tuned to NLP tasks, it applies the transformer architecture introduced by Vaswani et al⁶. LLMs soon attracted great interest, and various predictions for their application in medicine were reported. Dave et al⁷ listed potential use cases and functions of ChatGPT in a review, which covered search engine quality, patient monitoring, risk factor assessment, and medical education support. Thapa and Adhikari⁸ expressed future opportunities for streamlining literature reviews, summarizing complex findings and generating novel research hypotheses based on pre-trained concepts and patterns already existent in the LLM. Patel and Lam⁹ discussed circumstances in which synthetic dataset creation and information extraction help fix inherent problems of healthcare providers with narrative data, such as balancing manual documentation time and workload, and supporting clinicians in day-to-day documentation. Baker et al¹⁰ performed a randomized controlled trial comparing the use of an LLM with typing and dictation to support documentation of patients’ history of present illness (HPI) and demonstrated an increase in text quality and length, although false information was reported by the LLM in 36% of documents. Ramachandran et al¹¹ performed a prompt-based extraction of social determinants of health, eg, alcohol, drugs, living and employment status, in a one-shot prompt setting with GPT-4, yielding an overall F1-measure of 0.652 on test data of the Social History Annotation Corpus (SHAC). This is further put into perspective by the authors, who compared this performance to the best performing teams of the 2022 n2c2 challenge¹² on the same dataset, which turned out to be comparable to the seventh best performing team. Similarly, the MEDIQA-Chat 2023 Shared Tasks centered on summarization¹³, text classification, and generation of interactions between patients and clinicians. All tasks encompass synthetic dialogue creation, automated text processing, and data augmentation, for which a large portion of teams relied on the use of GPT-3 or GPT-4. The best performing team was reported to have scored 0.780 in accuracy for header classification. In these described applications, the LLM applies the contextual understanding of the prompt to generate text-based answers from the basis of the knowledge gained from pretraining on huge text corpora. This understanding of the context in a prompt could help solve acronym disambiguation. In previous work¹⁴, we had tested and compared GPT-3 to a clinical text mining approach for German acronym resolution and found that GPT-3 outperformed the text mining baseline on a small dataset. The work presented here aims to extend our previous work by testing the same methodology with various parameters to better understand the acronym resolution and disambiguation capabilities of LLMs.

Objectives

For acronym disambiguation, a zero-shot prompt scenario is tested, ie, only once prompting a LLM without examples, in order to answer the following research questions:

Do LLMs outperform existing approaches for clinical acronym resolution?
Would more advanced LLMs achieve better results in comparison to smaller LLMs?
Is there a performance decrease when switching from English to non-English clinical narratives?
Does the addition of document-level metadata deliver higher performance results?

Methods

Data

Acronym definition

We define “acronym” as a short form characterized by capital letters representing syllables or word initials of the corresponding long form. We do not make a distinction between initialisms and other acronyms.

Acronym selection and context extraction

The selection of acronyms from clinical narratives was done in two ways. The acronym was either chosen prior to extracting the acronym (German 3A, English 3A, English 17k) or text spans with valid acronym candidates were extracted with the following rule-based expression: \\s[A-Z][A-Z0-9]{1-4}\\s. The latter approach was followed for German 100 and Portuguese 500 datasets. When the rule-based expression recognized a valid acronym, 50 characters to the left and right of the acronym were extracted to form the text span or context of the prompt. The shortness of the spans was motivated by the need to ensure the non-identifiability of the data.

The Portuguese corpus contained large passages of fully capitalized text. In order to control the unwanted effect of selecting too many capitalized words as they fulfil the acronym filtering criteria, text snippets with a rate of more than 30% of uppercase characters were removed.

Preprocessing

The only preprocessing performed on German and Portuguese texts was the removal of line breaks prior to extracting the context around each acronym. For English datasets, no preprocessing was performed.

Dataset characteristics

Very short acronyms have shown to be highly ambiguous. In clinical narratives, the resolution of their full-length meaning is only possible from their surrounding context¹. Based on this knowledge, our investigation resolved acronyms between two and five characters in length. Table 1 shows acronym and text span counts and lengths, as well as the possible long-form resolutions, per dataset. Possible long-form resolutions were either inferred by the annotator or available from the source dataset, cf., Adams et al⁵. All German and Portuguese text spans had a context length of 100 characters, while the English datasets varied in length, with a mean context length of 369 or 384 characters. The data were fully anonymized.

Table 1.

Distribution of acronyms, text spans (context), mean text span (context) length, and possible long-form expansions (targets) for each acronym per dataset.

Dataset	Distinct acronym count	Text span count	Mean text span length	Possible long-form expansions
English 17k	41	17 873	369	150
English 3A	3	500	384	12
German 3A	3	500	100	17
German 100	76	100	100	116
Portuguese 500	225	500	100	570

Open in a new tab

English 17k and English 3A

Two subsets of the anonymized Clinical Abbreviation Sense Inventory (CASI)⁴ dataset from the University of Minnesota were curated. The English 17k dataset was filtered down from 37 502 to the test set of 17 873 following the guidelines set by Adams et al⁵, to eliminate dataset inconsistencies. According to the authors, the filtering does not empirically impact performance.

To compare the performance of LLMs on the same subset of data, we chose the three acronyms “RA”, “MS”, and “MI”, known for their difficulty and ambiguity, from the CASI dataset, representing a total of 500 text spans from clinical narratives named English 3A. This choice followed Link et al¹⁵, where text spans were extracted from electronic health records (EHRs) from the Veterans Affairs Healthcare Centers Data from the Million Veterans Project¹⁶, to develop a semi-supervised ensemble machine learning method for acronym disambiguation.

The metadata for our investigations consisted of section header information from CASI. For example, “MS” can be expanded to the long forms “morphine sulfate” or “multiple sclerosis”, and from the context of the narrative, the correct resolution is chosen by the LLM. The metadata in the form of “history”, “medications”, “assessment”, “plan”, etc., is expected to provide additional semantic context.

German 3A and German 100

Both German datasets stemmed from manually de-identified narratives from dermatology, cardiology, and oncology departments of an Austrian hospital network. A domain specialist chose three of the most frequent and ambiguous acronyms (“VA”, “HT”, “AP”). The metadata for both datasets consisted of either the name of the clinical specialty or disease. For instance, the expansion candidates for “AP” included “Alkalische Phosphatase”, “Angina pectoris”, and “Atrial pacing”, and possible metadata to enrich the prompt, such as “Kardiologie” (cardiology) or “Kolonkarzinom” (colon carcinoma), depended on the dataset. This constituted the German 3A dataset. For the German 100 dataset, 100 text spans with 76 distinct acronyms were extracted from clinical narratives of the same dataset.

Portuguese 500

The curated 500 text spans in Brazilian Portuguese were extracted from the multi-institutional and multi-specialty corpus SemClinBr.¹⁷ For lack of provenance metadata, we used names of signs, symptoms, and disorders, enclosed by manually assigned semantic tags, available from the source dataset. For example, in “[…] BCR hipofonese de B1 […]”, the acronym BCR was selected for resolution, and the Portuguese terms for heart disease, nighttime dizziness, and chest pains had been annotated as symptoms and disorders. Supplied to the LLM as metadata, these are expected to support the correct resolution of “BCR” in the sense of “Bulhas cardíacas regulares”, ie, regular heart sounds.

LLM selection

Four LLMs were chosen to ascertain the base-level performance for this NLP task. Our selection included two proprietary LLMs from the company OpenAI, GPT-3.5 and GPT-4, due to their popularity and accessibility to users worldwide, and two open-source LLMs, Llama-2-7b-chat and Llama-2-70b-chat from the company Meta (see Data Availability for additional information). Llama was chosen because the models are non-proprietary and can be run on premise.

Processing pipeline

To analyze the acronym resolution performance of each LLM in this scenario, four runs per text span were performed, where each span consisted of the acronym and its surrounding context. Each run corresponded to one prompt sent to a LLM for acronym resolution, and adjusted slightly for a certain parameter, ie, choice of language model, choice of prompt, addition of additional context through metadata. Two prompt combinations were chosen:

LLM + prompt with instructions + context;
LLM + prompt with instructions + context + metadata.

The implementation of our workflow used the available application programming interface (API) by OpenAI. The language models “gpt-3.5-turbo”, “gpt-4”, “llama-2-7b-chat”, and “llama-2-70b-chat” were applied as base models without any fine-tuning to the problem domain. Only the required fields “messages” and “model” were specified; for all other optional variables, default values were set. The goal was to not skew the results and to ascertain the base performance of these large models for the given task. The role of “system” for all prompts always remained the same and was set to: “This system should act as a medical acronym disambiguation tool.” The role of the “user” included the wording of the whole prompt, which consisted of the information imported from the chosen acronyms and abbreviations datasets prepared for our investigation. The prompts were formulated in the language of the texts to be processed. Variations to the prompts were included describing metadata types, eg, headers, department, and document-level information. Figure 1 outlines the steps for prompting LLMs for acronym resolution with an English narrative example.

Figure 1. — Visualization of the processing pipeline to prompt four different large language models (LLMs) for the disambiguation of an acronym with prompt combinations (i) and (ii).

These are the prompts in English corresponding to (i) prompting for a resolution of an acronym with only the context it appears in, or (ii) prompting for a resolution with context and additional metadata information:

“What is the resolution of the acronym ACRONYM in the following clinical context: CONTEXT. The answer should be kept short and concise. The acronym resolution should be given out in the following format: short form, long form. The answer should not contain any further explanations.”
“What is the resolution of the acronym ACRONYM in the following clinical context: CONTEXT. This relates to the following section in the clinical narrative: METADATA. The answer should be kept short and concise. The acronym resolution should be given out in the following format: short form, long form. The answer should not contain any further explanations.”

The tokens ACRONYM, CONTEXT, METADATA were variables to be replaced in each prompt with the information from each dataset surrounded by quotation marks. The prompts for German and Portuguese were also adapted due to section information in each narrative not being available, which is why the metadata for the German narratives consisted of the patient’s department or the main diagnosis. Similar adjustments had to be made for Portuguese, with the use of signs, symptoms, and disorders, available by document-level annotations.

Formatting clauses and limiting explanations were needed to mitigate unnecessary generation and explanations of the narratives contained in the prompt and to facilitate extraction and resolution of the searched acronym. LLM responses contained the expansions of the acronyms, which were saved for evaluation.

Annotation

Two types of annotation workflows were followed, which either corresponded to a manual evaluation and resolution of acronyms through a domain expert or evaluating the resolution of acronyms through automated means.

For all datasets excluding English 17k, a domain expert manually annotated all long-form resolution candidates for each acronym. Every line in the final dataset consisted of the selected acronym, context, metadata, and its resolution by the LLM for each of the four experiments. For each acronym resolution candidate, the annotator had to indicate whether it was “correct” or “incorrect.” The domain expert analyzed the prompt-based resolutions for correctness and plausibility.

Due to the provided correct resolutions in the English 17k dataset and the dataset being too large for a full manual evaluation of the generated resolution candidates, regular expressions, string similarity, and mapping tables were utilized to automatically ascertain the correct or incorrect resolution of each term candidate.

Annotation criteria

For the classification of “correct” resolutions, the following criteria were applied for all datasets:

expansion correct according to the context and the annotator’s knowledge,
expansion unambiguous (only one expansion provided),
expansion clearly identified in the LLM result,
expansion allowed in the sample language or in English (eg, for “FE” in Portuguese, both “fração de ejeção” and “ejection fraction” are allowed),
text surrounding the expansion not misleading,
redundancy tolerated (eg, “DM” in text with “DM II” expanded into “Diabetes mellitus type 2”).

Evaluation

According to other acronym disambiguation publications²^,³^,¹³, accuracy (correctly predicted instances/total number of instances) was chosen as a metric to gauge how well acronyms were resolved with the implementation of various prompts. Because of the automated workflow for the English 17k dataset, we evaluated the results of that dataset by extracting 500 random automated classifications. The deviations between these 500 automated and manually annotated classifications of this subset were measured as error rate.

Results

Tables 2 and 3 provide accuracy scores for experiments per dataset and LLM. GPT-4 outperformed GPT-3.5 across languages and showed a significantly higher performance, based on confidence intervals, in comparison to German and Portuguese, with an optimum accuracy of 0.978 for GPT-4.

Table 2.

Overall accuracy scores for prompting GPT-3.5 and GPT-4 models per dataset (with 0.95 confidence intervals).

Datasets	Prompt combination (i) GPT-3.5 + context	Prompt combination (ii) GPT-3.5 + context + metadata	Prompt combination (i) GPT-4 + context	Prompt combination (ii) GPT-4 + context + metadata
English 3A	0.85 [0.82–0.89]	0.88 [0.84–0.91]	0.97 [0.95–0.98]	0.98 [0.97–0.99]
German 3A	0.41 [0.36–0.45]	0.37 [0.33–0.42]	0.65 [0.61–0.69]	0.59 [0.54–0.63]
German 100	0.74 [0.64–0.82]	0.72 [0.62–0.81]	0.86 [0.78–0.92]	0.85 [0.76–0.91]
Portuguese 500	0.74 [0.70–0.78]	0.76 [0.72–0.80]	0.88 [0.85–0.91]	0.89 [0.86–0.91]

Open in a new tab

Table 3.

Overall accuracy scores for prompting Llama-2-7b-chat and Llama-2-70b-chat models per dataset (with 0.95 confidence intervals).

Datasets	Prompt combination (i) Llama-2-7b-chat + context	Prompt combination (ii) Llama-2-7b-chat + context + metadata	Prompt combination (i) Llama-2-70b-chat + context	Prompt combination (ii) Llama-2-70b-chat + context + metadata
English 3A	0.73 [0.69–0.77]	0.72 [0.68–0.76]	0.69 [0.65–0.73]	0.70 [0.66–0.74]
German 3A	0.02 [0.01–0.03]	0.04 [0.03–0.06]	0.08 [0.06–0.11]	0.10 [0.08–0.13]
German 100	0.34 [0.25–0.44]	0.34 [0.25–0.44]	0.41 [0.31–0.51]	0.45 [0.35–0.55]
Portuguese 500	0.16 [0.13–0.19]	0.15 [0.12–0.18]	0.28 [0.24–0.32]	0.29 [0.25–0.33]

Open in a new tab

As mentioned, the results of the English 3A dataset cover only three acronyms. This is why we assessed in an additional experiment the performance of the complete English 17k dataset. Here, GPT-3.5 reached an accuracy of 0.91 with a 95% confidence interval [0.90,0.91]. The assessed error rate for the automated evaluation of this dataset was 0.014.

For both Llama models, the best performing result was 0.73 in accuracy for the English 3A dataset. In all other experiments, Llama showed a poor performance.

In addition, and in contrast to the crisp ChatGPT responses, Llama's output was mostly overly verbose, including text definitions and context information. It was also characterized by an admixture of the source language with English fragments. Not rarely, it hallucinated non-existing words and terms, eg, “Herz-Tasternalis”, “Tibutation”, “Insuficiência Olárge Oblast”, “Peso Averrado”, “Microsoft Contin”, up to nonsense formulations, eg, “Athérsepatrum (German for Pulse)”, or “VA = Vitalienari, a term used in cardiology to describe the absence of any vital signs, such as pulse, breathing rate, or blood pressure”.

Finally, there was no evidence that additional document-level metadata had a significant effect, cf., largely overlapping confidence intervals in Table 2 with prompt combination (ii).

Discussion

The discussion is structured according to the research questions, cf., Objectives.

Research question A: Do LLMs outperform existing approaches for clinical acronym resolution?

A one-to-one comparison with state-of-the-art approaches described in the Introduction section cannot be performed for all publications, even with all works using the same dataset (CASI) for testing. Known inconsistencies with that dataset (duplicate entries, no resolution of short forms, etc.) lead to changes in the selection of acronyms for training and test sets in each of these studies. Skreta et al³ highlighted this issue when comparing their methodology to related work. On CASI alone, Skreta et al³ used a subset of 65 abbreviations, whereas Kashyap et al² only 52 that occurred in PubMed Central articles. We followed Adams et al⁵ cleaning of the CASI dataset and used the same 41 acronyms. Therefore, our method can best be compared with theirs, even though the number of text spans is not identical (Adams et al⁵: 18 233 vs. 17 873, due to our filtering). Similar to our methodology, Agrawal et al¹⁸ used GPT-3.5 for acronym expansion, also following Adams’ filtering⁵ and achieved a comparable performance of 0.86 in accuracy on the same 41 acronyms with 18 164 text spans.

From our reported final results, we conclude that a zero-shot disambiguation of clinical acronyms via LLMs can deliver similar or even better performance results in comparison to state-of-the-art methods.

Research question B: Would more advanced LLMs achieve better results in comparison to smaller LLMs?

The performance jump between GPT-4 and GPT-3.5, and Llama-2-7b-chat and Llama-2-70b-chat, respectively, is evident, and it is plausible that larger model sizes pay off.

GPT-4 outperformed GPT-3.5 both with and without metadata inclusion. For both datasets (English 3A and German 3A), the switch to the GPT-4 model produced a statistically significant improvement based on 95% confidence intervals. Related works corroborate this trend, with Scheschenja et al¹⁹, who compared both models for patient education prior to interventional radiology procedures, and reported a significantly better accuracy for GPT-4. Similarly, Taloni et al²⁰ tested both models’ and humans’ ability for self-assessments in ophthalmology, in which GPT-4 outperformed its predecessor. Even in the analysis of Llama models, Llama-2-70b-chat outperformed Llama-2-7b-chat for the German 3A and Portuguese 500 datasets.

Research question C: Is there a performance decrease when switching from English to non-English clinical narratives?

We can assume that LLMs have uneven training datasets based on source language distribution. Even supporting a high number of languages, LLMs were primarily designed for English, for which by far more content can be retrieved from the Web. It is plausible that the contextual understanding and performance of such LLMs decrease as less data are available.

Dreano et al²¹ even showed that interlanguage variations in Llama-2 models exist, when analyzing English paired languages. This partly also explains the decreased performance results from the Llama-2 models in our investigation.

Another explanation is the overall poorer text quality of the German and Portuguese clinical narratives with very compact formulations, abundance of short forms, ad-hoc shortened words, and ellipses.

Additionally, the results between English and the other two languages might be influenced by the context length. For English, the input to the LLM was, on average, three times longer, whereas the context length for German and Portuguese snippets had been strictly limited to 100 characters. Feeding LLMs with as much data as possible might yield better acronym resolutions at the expense of runtime performance, which would be something worth investigating more thoroughly.

To summarize, the results of our investigation point towards a large decrease in performance, when switching the prompt and dataset language from English to German and Portuguese, which is especially evident in Llama models. In those instances, fine-tuned LLMs on the problem domain would probably increase performance results quite significantly.

Research question D: Does the addition of document level metadata deliver higher performance results?

For OpenAI models applied on the English 3A dataset, the section header information as metadata just slightly improved the results, while for Llama models decreases were noted, both without statistical significance. One main improvement was found for the acronym “MS”. In most cases where the resolution would be “morphine sulfate”, the LLM would still resolve it as “multiple sclerosis”. The addition of the section header information “discharge medications” resulted in a correct prediction of the acronym. Conversely, the opposite is also true, which means the addition of section header information changed a correct resolution into a false one, although this occurred in fewer cases in comparison to improvements by the models.

For the German 3A dataset, the opposite trend compared to English was found in OpenAI models, ie, the addition of department information on average corrected 24 acronym resolutions per experiment. However, the LLM changed its prediction compared to the experiment without additional metadata on average 48 times. This is also the reason why the accuracy decreased for that dataset, which means the addition of department level metadata introduced more ambiguity.

To summarize, we cannot deny that document-level metadata may help in the resolution of the acronyms in some cases, although the overall analysis does not provide evidence for a significant effect.

Limitations

It became obvious that cross-language comparisons are difficult because of the diversity of the source data. In the three-acronym studies, CASI contained acronyms in their most common sense, whereas in the German corpus (in which the number of acronyms is higher), non-standard senses prevailed, such as “Herzton” for “HT”. These acronym senses rarely occur in publicly accessible corpora. In the Portuguese corpus, acronyms were even more frequent, and non-acronym capitalized words were common.

Regarding the choice of language models, this investigation focused on four pre-trained LLMs, which can be characterized as foundation models (FMs)²². The benefits of FMs are their versatility and ability to perform a wide variety of tasks, while at the same time requiring less task-specific data. Limitations of FMs include unspecific results not tailored to the task, as well as hallucinations, ie, generating plausible sounding but inaccurate or irrelevant information. Additionally, the size and complexity of FMs require extensive hardware resources for on-premises deployment and/or overhead costs. Wornow et al²³ reviewed the use of fine-tuned FMs for clinical narratives and expected that clinical FMs would achieve better predictive performance in comparison to traditional ML models, require even less data, and enable simpler and cheaper model deployments. The disadvantages of fine-tuned FM are the need for high-quality domain specific labelled data, possibility of overfitting, and maintenance. Fine-tuning can also reduce the generalizability of the language model.

Especially regarding acronym disambiguation, while fine-tuned FMs would be able to improve their performance on a single-language dataset, creating a clinical cross-lingual fine-tuned FM would be a challenging task, not just from a privacy perspective.

Finally, we emphasize that our work can only be seen as a snapshot of a high dynamic technological evolution, hardly ever seen in the history of computer science.

Conclusion and outlook

We presented an application-oriented analysis of clinical acronym resolution with LLMs for clinical narratives in English, German, and Portuguese. Experiments were performed to ascertain whether the inherent contextual understanding of LLMs can be of use for this NLP task with regard to the processing of this text genre. The results showed clear potential for applying these methods to English narratives, while for German and Portuguese, further processing or fine-tuning is needed. In future investigations, we will focus on investigating the impact of text span length, the effect of metadata for very short spans, as well as on combinations between LLMs and web mining approaches, in order to optimize the acronym resolution quality. A focus should always be set on investigating the performance of LLMs that can be run within a safe clinical communication network because of significant concerns regarding the processing of data with personal health information by cloud-based services, such as ChatGPT.

Acknowledgments

We want to thank the colleagues of the Health Artificial Intelligence Lab, HAILab-PUCPR, for using their Portuguese corpus for this investigation, and the team from the University of Minnesota for making their Clinical Abbreviation Sense Inventory publicly available.

Contributor Information

Amila Kugic, Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, 8036 Graz, Austria.

Stefan Schulz, Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, 8036 Graz, Austria.

Markus Kreuzthaler, Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, 8036 Graz, Austria.

Author contributions

Amila Kugic designed the study, implemented the code, curated the necessary resources, executed the analysis, and wrote the manuscript. Stefan Schulz annotated the results and revised the manuscript. Markus Kreuzthaler validated the annotated corpora, helped with the conceptualization of the study, and revised the manuscript.

Funding

This research has received funding from the European Union's Horizon Research and Innovation Programme under grant agreement No 101057062 (AIDAVA, https://aidava.eu/).

Conflicts of interest

None declared.

Data availability

The Python API from the company OpenAI can be found under https://github.com/openai/openai-python. For proprietary models, GPT-3.5 and GPT-4, no additional hosting website was required. For the open-source models, Llama-2-7b-chat and Llama-2-70b-chat, the following hosting website was utilized: https://www.llama-api.com. Both these models are also available for download on Hugging Face from https://huggingface.co/meta-llama/Llama-2-7b-chat and https://huggingface.co/meta-llama/Llama-2-70b-chat. The Clinical Abbreviation Sense Inventory (CASI) from the University of Minnesota is available here: https://hdl.handle.net/11299/137703. The Portuguese corpus source files are linked in the following repository: https://github.com/HAILab-PUCPR. The datasets in German, generated and analyzed during this study, cannot be shared publicly due to local ethics regulations. The data can be made accessible on reasonable request to the corresponding author in consultation with the institutional review board of the Medical University of Graz.

References

1. Schwarz CM, Hoffmann M, Smolle C, et al. Structure, content, unsafe abbreviations, and completeness of discharge summaries: a retrospective analysis in a University Hospital in Austria. J Eval Clin Pract. 2021;27(6):1243-1251. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Kashyap A, Burris H, Callison-Burch C, Boland MR. The CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR): a Method for predicting acronym sense from neonatal clinical notes. Int J Med Inform. 2020;137:104101. [DOI] [PubMed] [Google Scholar]
3. Skreta M, Arbabi A, Wang J, et al. Automatically disambiguating medical acronyms with ontology-aware deep learning. Nat Commun. 2021;12(1):5319. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Moon S, Pakhomov S, Melton G. Clinical abbreviation sense inventory. 2012. Accessed September 22, 2023. http://conservancy.umn.edu/handle/11299/137703
5. Adams G, Ketenci M, Bhave S, et al. Zero-Shot clinical acronym expansion via latent meaning cells. Proc Mach Learn Res. 2020;136:12-40. [PMC free article] [PubMed] [Google Scholar]
6. Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need. Adv Neural Inf Process Syst. 2017;30.
7. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Thapa S, Adhikari S. ChatGPT, Bard, and large language models for biomedical research: opportunities and pitfalls. Ann Biomed Eng. 2023;51(12):2647-2651. [DOI] [PubMed] [Google Scholar]
9. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digital Health. 2023;5(3):e107-e108. [DOI] [PubMed] [Google Scholar]
10. Baker HP, Dwyer E, Kalidoss S, et al. ChatGPT’s ability to assist with clinical documentation: a randomized controlled trial. J Am Acad Orthop Surg. 2023;32(3):123-129. 10.5435/JAAOS-D-23-00474 [DOI] [PubMed] [Google Scholar]
11. Ramachandran GK, Fu Y, Han B, et al. Prompt-based extraction of social determinants of health using few-shot learning. In: Proceedings of the 5th Clinical Natural Language Processing Workshop. Toronto, Canada: Association for Computational Linguistics; 2023:385-393. 10.18653/v1/2023.clinicalnlp-1.41 [DOI] [Google Scholar]
12. Lybarger K, Yetisgen M, Uzuner Ö. The 2022 n2c2/UW shared task on extracting social determinants of health. J Am Med Inform Assoc. 2023;30(8):1367-1378. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Ben Abacha A, Yim W, Adams G, et al. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In: Naumann T, Ben Abacha A, Bethard S, et al. , eds. In: Proceedings of the 5th Clinical Natural Language Processing Workshop. Toronto, Canada: Association for Computational Linguistics; 2023:503-513. 10.18653/v1/2023.clinicalnlp-1.52 [DOI]
14. Kugic A, Kreuzthaler M, Schulz S. Clinical acronym disambiguation via ChatGPT and BING. Stud Health Technol Inform. 2023;309:78-82. [DOI] [PubMed] [Google Scholar]
15. Link NB, Huang S, Cai T, Million Veteran Program, et al. Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping. Int J Med Inform. 2022;162:104753. [DOI] [PubMed] [Google Scholar]
16. Gaziano JM, Concato J, Brophy M, et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214-223. [DOI] [PubMed] [Google Scholar]
17. Oliveira LESe, Peters AC, da Silva AMP, et al. SemClinBr—a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semant. 2022;13(1):13. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Agrawal M, Hegselmann S, Lang H, et al. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022:1998–2022. 10.18653/v1/2022.emnlp-main.130 [DOI]
19. Scheschenja M, Viniol S, Bastian MB, et al. Feasibility of GPT-3 and GPT-4 for in-depth patient education prior to interventional radiological procedures: a comparative analysis. Cardiovasc Intervent Radiol. 2023;47(2):245-250. 10.1007/s00270-023-03563-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Taloni A, Borselli M, Scarsi V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13(1):18562. 10.1038/s41598-023-45837-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Dreano S, Molloy D, Murphy N. Embed_Llama: Using LLM embeddings for the metrics shared task. In: Koehn P, Haddow B, Kocmi T, et al. , eds. In: Proceedings of the Eighth Conference on Machine Translation. Singapore: Association for Computational Linguistics; 2023:738-745. 10.18653/v1/2023.wmt-1.60 [DOI]
22. Scott IA, Zuccon G. The new paradigm in machine learning—foundation models, large language models and beyond: a primer for physicians. Intern Med J. 2024;54(5):705-715. [DOI] [PubMed] [Google Scholar]
23. Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):135. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[ocae157-B1] 1. Schwarz CM, Hoffmann M, Smolle C, et al. Structure, content, unsafe abbreviations, and completeness of discharge summaries: a retrospective analysis in a University Hospital in Austria. J Eval Clin Pract. 2021;27(6):1243-1251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B2] 2. Kashyap A, Burris H, Callison-Burch C, Boland MR. The CLASSE GATOR (CLinical Acronym SenSE disambiGuATOR): a Method for predicting acronym sense from neonatal clinical notes. Int J Med Inform. 2020;137:104101. [DOI] [PubMed] [Google Scholar]

[ocae157-B3] 3. Skreta M, Arbabi A, Wang J, et al. Automatically disambiguating medical acronyms with ontology-aware deep learning. Nat Commun. 2021;12(1):5319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B4] 4. Moon S, Pakhomov S, Melton G. Clinical abbreviation sense inventory. 2012. Accessed September 22, 2023. http://conservancy.umn.edu/handle/11299/137703

[ocae157-B5] 5. Adams G, Ketenci M, Bhave S, et al. Zero-Shot clinical acronym expansion via latent meaning cells. Proc Mach Learn Res. 2020;136:12-40. [PMC free article] [PubMed] [Google Scholar]

[ocae157-B6] 6. Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need. Adv Neural Inf Process Syst. 2017;30.

[ocae157-B7] 7. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B8] 8. Thapa S, Adhikari S. ChatGPT, Bard, and large language models for biomedical research: opportunities and pitfalls. Ann Biomed Eng. 2023;51(12):2647-2651. [DOI] [PubMed] [Google Scholar]

[ocae157-B9] 9. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digital Health. 2023;5(3):e107-e108. [DOI] [PubMed] [Google Scholar]

[ocae157-B10] 10. Baker HP, Dwyer E, Kalidoss S, et al. ChatGPT’s ability to assist with clinical documentation: a randomized controlled trial. J Am Acad Orthop Surg. 2023;32(3):123-129. 10.5435/JAAOS-D-23-00474 [DOI] [PubMed] [Google Scholar]

[ocae157-B11] 11. Ramachandran GK, Fu Y, Han B, et al. Prompt-based extraction of social determinants of health using few-shot learning. In: Proceedings of the 5th Clinical Natural Language Processing Workshop. Toronto, Canada: Association for Computational Linguistics; 2023:385-393. 10.18653/v1/2023.clinicalnlp-1.41 [DOI] [Google Scholar]

[ocae157-B12] 12. Lybarger K, Yetisgen M, Uzuner Ö. The 2022 n2c2/UW shared task on extracting social determinants of health. J Am Med Inform Assoc. 2023;30(8):1367-1378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B13] 13. Ben Abacha A, Yim W, Adams G, et al. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In: Naumann T, Ben Abacha A, Bethard S, et al. , eds. In: Proceedings of the 5th Clinical Natural Language Processing Workshop. Toronto, Canada: Association for Computational Linguistics; 2023:503-513. 10.18653/v1/2023.clinicalnlp-1.52 [DOI]

[ocae157-B14] 14. Kugic A, Kreuzthaler M, Schulz S. Clinical acronym disambiguation via ChatGPT and BING. Stud Health Technol Inform. 2023;309:78-82. [DOI] [PubMed] [Google Scholar]

[ocae157-B15] 15. Link NB, Huang S, Cai T, Million Veteran Program, et al. Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping. Int J Med Inform. 2022;162:104753. [DOI] [PubMed] [Google Scholar]

[ocae157-B16] 16. Gaziano JM, Concato J, Brophy M, et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J Clin Epidemiol. 2016;70:214-223. [DOI] [PubMed] [Google Scholar]

[ocae157-B17] 17. Oliveira LESe, Peters AC, da Silva AMP, et al. SemClinBr—a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semant. 2022;13(1):13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B18] 18. Agrawal M, Hegselmann S, Lang H, et al. Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics; 2022:1998–2022. 10.18653/v1/2022.emnlp-main.130 [DOI]

[ocae157-B19] 19. Scheschenja M, Viniol S, Bastian MB, et al. Feasibility of GPT-3 and GPT-4 for in-depth patient education prior to interventional radiological procedures: a comparative analysis. Cardiovasc Intervent Radiol. 2023;47(2):245-250. 10.1007/s00270-023-03563-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B20] 20. Taloni A, Borselli M, Scarsi V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13(1):18562. 10.1038/s41598-023-45837-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ocae157-B21] 21. Dreano S, Molloy D, Murphy N. Embed_Llama: Using LLM embeddings for the metrics shared task. In: Koehn P, Haddow B, Kocmi T, et al. , eds. In: Proceedings of the Eighth Conference on Machine Translation. Singapore: Association for Computational Linguistics; 2023:738-745. 10.18653/v1/2023.wmt-1.60 [DOI]

[ocae157-B22] 22. Scott IA, Zuccon G. The new paradigm in machine learning—foundation models, large language models and beyond: a primer for physicians. Intern Med J. 2024;54(5):705-715. [DOI] [PubMed] [Google Scholar]

[ocae157-B23] 23. Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):135. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Disambiguation of acronyms in clinical narratives with large language models

Amila Kugic, MSc

Stefan Schulz, MD

Markus Kreuzthaler, PhD

Abstract

Objective

Materials and Methods

Results

Conclusion

Introduction

Objectives

Methods

Data

Acronym definition

Acronym selection and context extraction

Preprocessing

Dataset characteristics

Table 1.

English 17k and English 3A

German 3A and German 100

Portuguese 500

LLM selection

Processing pipeline

Figure 1.

Annotation

Annotation criteria

Evaluation

Results

Table 2.

Table 3.

Discussion

Research question A: Do LLMs outperform existing approaches for clinical acronym resolution?

Research question B: Would more advanced LLMs achieve better results in comparison to smaller LLMs?

Research question C: Is there a performance decrease when switching from English to non-English clinical narratives?

Research question D: Does the addition of document level metadata deliver higher performance results?

Limitations

Conclusion and outlook

Acknowledgments

Contributor Information

Author contributions

Funding

Conflicts of interest

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases