Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2024 Apr 17;31(9):1892–1903. doi: 10.1093/jamia/ocae078

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need

Cheng Peng 1, Xi Yang 2,3, Aokun Chen 4,5, Zehao Yu 6, Kaleb E Smith 7, Anthony B Costa 8, Mona G Flores 9, Jiang Bian 10,11, Yonghui Wu 12,13,
PMCID: PMC11339507  PMID: 38630580

Abstract

Objective

To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning.

Methods

We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 billion parameters. We adopted soft prompts (ie, trainable vectors) with frozen LLM, where the LLM parameters were not updated (ie, frozen) and only the vectors of soft prompts were updated, known as prompt tuning. We added additional soft prompts as a prefix to the input layer, which were optimized during the prompt tuning. We evaluated the proposed method using 7 clinical NLP tasks and compared them with previous task-specific solutions based on Transformer models.

Results and Conclusion

The proposed approach achieved state-of-the-art performance for 5 out of 7 major clinical NLP tasks using one unified generative LLM. Our approach outperformed previous task-specific transformer models by ∼3% for concept extraction and 7% for relation extraction applied to social determinants of health, 3.4% for clinical concept normalization, 3.4%-10% for clinical abbreviation disambiguation, and 5.5%-9% for natural language inference. Our approach also outperformed a previously developed prompt-based machine reading comprehension (MRC) model, GatorTron-MRC, for clinical concept and relation extraction. The proposed approach can deliver the “one model for all” promise from training to deployment using a unified generative LLM.

Keywords: prompt tuning, large language model, text generation, clinical natural language processing, transformer model

Introduction

Solving multiple natural language processing (NLP) tasks with a unified model is an attractive promise of deep learning. Over the past decade, NLP researchers have successfully simplified feature engineering with vector representations1 and pre-trained transformer models2 that can be adapted to multiple downstream tasks. These advances have led to the development of large language models (LLMs), which could potentially fulfill the promise of “one model solving all NLP tasks.” LLMs have dominated NLP in recent years, with generative LLMs particularly excelling at solving multiple NLP tasks through a unified sequent-to-sequence learning (ie, text-to-text learning)3 architecture, bypassing the need for generating vectors as an intermediate step. In the general domain, researchers have explored an encoder–decoder LLM, T5,4 to approach all NLP tasks as a “text-to-text” problem. However, the T5 model requires fine-tuning, meaning each subtask still needs a task-specific T5 variation, which is still one step away from the ideal of “one model solving all tasks.”

This study seeks to apply the text-to-text learning architecture to formulate major clinical NLP tasks and solve them using a unified generative LLM through prompt tuning, thus fulfilling the “one model solving all tasks” promise. We formulate 7 major clinical NLP tasks as text-to-text learning problems, including (1) clinical concept extraction, (2) clinical relation extraction, (3) clinical abbreviation disambiguation, (4) natural language inference (NLI), (5) medication attribute filling, (6) clinical concept normalization, and (7) progress note understanding. We solved the 7 tasks using GatorTronGPT,5 a generative clinical LLM developed from scratch using a GPT-3 architecture and trained on 277 billion words of clinical and general English text. To solve the 7 tasks using a single unified model, we applied prompt-tuning with frozen LLMs, where the original parameters of GatorTronGPT remained unchanged during the tuning process. We used soft prompts6 initialized by a bidirectional Long Short Term Memory (LSTM)7 to instruct GatorTronGPT to generate accurate responses for different tasks. Our approach achieved state-of-the-art performance in 5 out of the 7 clinical NLP tasks, demonstrating the potential of generative clinical LLMs as versatile all-purpose text analytics tools capable of solving all major clinical NLP tasks. Soft prompting with frozen LLMs shows promise in facilitating the deployment of a unified generative LLM for diverse artificial intelligence (AI) applications.

Background

Understanding human language is a complex and challenging task. Researchers approach NLP using many intermediate subtasks, such as named entity recognition (NER) to identify important concepts, and word sense disambiguation (WSD) to determine the correct meaning of ambiguous words.8 Similarly, clinical NLP has several major tasks including clinical concept extraction, clinical concept normalization, medical relation extraction, WSD, NLI, medication attribute filling, and clinical note understanding.8 Many clinical NLP open challenges have been organized to tackle these major challenges.9–13

Historically, NLP researchers developed different machine learning models for specific tasks. For example, sequence labeling models such as conditional random fields (CRFs)14 are widely used for NER, and classification models such as support vector machines15 for WSD. With the advent of deep learning, the field of NLP began to challenge the “one model per task” approach, aiming instead to solve multiple tasks with a unified model. Early deep learning methods for NLP involved training distributed word representations using shallow embedding algorithms such as word2vec,16 then integrating these embeddings into deep neural network architectures such as convolutional neural network1,17,18 or recurrent neural network, particularly those implemented using LSTM,19–21 to solve various downstream tasks.

Not long after, the transformer2,22 emerged, which is a neural network architecture implemented using a self-attention mechanism. A single pre-trained transformer model, like BERT2—an encoder-only transformer, can be adapted for multiple tasks through fine-tuning. While transformers have been widely used in many clinical NLP studies,23,24 adopting pre-trained transformers for different tasks typically requires adding task-specific layers and fine-tuning pre-trained parameters. This approach ultimately generates multiple task-specific models, deviating from the goal of “one model for all tasks.”

Solving NLP tasks through text-to-text learning originates in the sequence-to-sequence learning architecture3 proposed in the transformer model for machine translation. This model utilizes an “encoder” component to process the input sequence (ie, source language) and a “decoder” component for generating the output sequence (ie, target language). Recently, researchers have pre-trained very large transformer models with billions or even hundreds of billions of parameters using text data containing billions or even trillions of words.25,26 These very large transformer models, known as LLMs, can be categorized into 3 main types: (1) encoder-only models, (2) decoder-only models, and (3) encoder–decoder models. LLMs using decoder-only and encoder–decoder transformers are “generative” as they can produce coherent and contextually relevant text such as emails, articles, and computer codes.

Before 2021, encoder-only LLMs such as BERT dominated clinical NLP, widely used for tasks such as clinical concept extraction23 and relation extraction.27 Researchers have explored encoder-only transformers to unify NLP tasks as span extraction28 or machine reading comprehension (MRC)29 tasks, which unified information extraction tasks such as clinical concept extraction and relation extraction. We previously developed GatorTron,30 a widely recognized encoder-only LLM in the clinical domain with up to 8.9 billion parameters. However, encoder-only LLMs require additional task-specific layers, which hinders their application in fulfilling the “one model for all tasks” promise.

More recently, generative LLMs such as the decoder-only GPT-3 model25 and the encoder–decoder T5 model4 have demonstrated strong text-to-text learning capabilities. In the general domain, Raffel et al4 first formulated multiple NLP tasks using text-to-text learning and solved them using T5-based LLMs. However, their approach still requires the fine-tuning of T5 model parameters, resulting in multiple task-specific variations of T5 models, making it still a “one model per task” solution.4 Agrawal et al31 recently examined the few-shot learning ability of ChatGPT in solving multiple clinical information extraction tasks. In the clinical domain, we have developed GatorTronGPT,5 a generative clinical LLM trained from scratch using 277 billion words of clinical, biomedical, and general English text. We evaluated GatorTronGPT in solving biomedical relation extraction and biomedical question answering tasks.

Prompt-based learning is the key technology widely used to instruct LLMs to generate textual content of interest. Unlike traditional fully supervised training and pretraining then fine-tuning, where a task-specific model is trained from scratch or fine-tuned on the target task/dataset, as shown in Figure 1, prompt-based learning formulates the downstream tasks as a text generation problem by adding prompts to the input text to control the model output. Therefore, we can use prompts in LLMs to control text generation without updating the LLM parameters, which provides a foundation to formulate multiple clinical NLP tasks as text-to-text generation using a unified NLP model. There are two main types of prompts including (1) “hard prompts,” which are textual inputs composed by humans providing information about the target task and (2) “soft prompts,” which are trainable embeddings (or virtual tokens) that are learned by machine learning models.6 In a recent study,32 we explored soft prompts6 and demonstrated that machines can learn soft prompts more effectively than human-composed hard prompts for clinical concept and relation extraction.

Figure 1.

Figure 1.

Three learning strategies for clinical NLP tasks. Different shapes represent task-specific text data in the “Input Data” frame and different task-specific soft prompts in the “Soft Prompt” frame. (A) Fully supervised training. Seven individual models are trained from scratch on eight task-specific datasets. The outputs are obtained using task-specific objectives. (B) Pre-training to fine-tuning. Seven individual models are fine-tuned on eight task-specific datasets based on a pre-trained model. The outputs are obtained using task-specific objectives. (C) Pre-training to prompting. Seven individual soft prompts are trained on eight task-specific datasets based on a frozen pre-trained language model. The outputs are obtained from a unified text generation module.

In this study, we seek to revisit the promise of “one unified model for all” and examine to what extent a single unified generative LLM can solve major clinical NLP tasks. Solving major clinical NLP tasks using a unified model is very attractive as it can significantly reduce training and deployment costs in real-world applications. This study is different from the previous study using the encoder–decoder T5 model4 in three ways: (1) we explored a generative LLM based on a decoder-only architecture, specifically GPT-3,25 (2) we applied prompt tuning and fixed the parameters in the LLMs (ie, frozen) during fine-tuning, whereas in reference,4 the T5 LLM was updated during fine-tuning, and (3) we adopted soft prompts instead of hard prompts composed by human experts as in reference.4 The proposed approach in this study demonstrates the capability to solve all major clinical NLP tasks using a single unified generative LLM through soft prompting.

Methods

Clinical NLP tasks and datasets

We explored the following 7 major clinical NLP tasks:

Clinical concept extraction aims to identify concepts with important clinical meanings (eg, medications, treatments, adverse drug events). We used 2 benchmark datasets for evaluation: the 2018 n2c2 challenge (Track 2) dataset,33 focusing on the extraction of medication and adverse drug events (referred to as the drug-ADE dataset), and the 2022 n2c2 challenge (Track 2) dataset,34 focusing on social determinants of health (referred to as the SDoH dataset). The drug-ADE dataset consists of 505 discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-III database,35 annotated with 9 categories of clinical concepts (eg, drug, drug attributes, ADEs). The SDoH dataset consists of 5 categories of SDoH concepts and 9 categories of SDoH-associated attribute concepts.

Clinical concept normalization is to standardize clinical concepts using standard concept identifiers, such as those defined in the Unified Medical Language System (UMLS). This task is typically approached through information retrieval and is solved using vector space models.36 We used a disorder mention dataset developed for SemEval-2015 Task 14 for evaluation.37 Specifically, the task is to detect disorder mentions and normalize them into a Concept Unique Identifier (CUI) within the UMLS/SNOMED-CT terminology. The dataset used is the ShARe corpus,38 which consists of 531 deidentified clinical notes (a mix of discharge summaries and radiology reports) selected from the MIMIC II clinical database.39

Clinical relation extraction is to establish semantic relations among clinical concepts (eg, drugs and adverse events). We used the drug-ADE dataset developed by 2018 n2c233 and the SDoH dataset developed by 2022 n2c234 for evaluation, where the drug-ADE dataset contains 8 categories of relations among drugs, drug-associated attributes, and ADEs, while the SDoH dataset consists of 28 categories of relations between SDoH concepts and SDoH-associated attributes.

Clinical abbreviation disambiguation is a special case of WSD, involves determining the correct meaning of ambiguous abbreviations in clinical narratives (eg, AB for abortion). This task is typically solved using classification models. We utilized a widely used abbreviation dataset developed by the University of Minnesota (UMN), which consists of 37 500 instances, 75 unique abbreviations, and 351 senses.40

Natural language inference, also known as recognizing textual entailment, involves determining if a given hypothesis can be inferred from a given premise, which is typically approached using classification models. We used MedNLI as the benchmark dataset, which was annotated by clinicians based on the medical history of patients.41 There are three possible relationships between two sentences—a hypothesis and a corresponding premise, in NLI: entailment (ie, the hypothesis is logically follows from the premise), contradiction (ie, the hypothesis and premise are logically incompatible), and neutral (ie, the hypothesis is neither logically entailed nor contradicted by the premise).

Medication attribute filling is to determine the contextual attributes of medications documented in clinical narratives, which has been typically solved using multiple classification models and information extraction models. For evaluation, we used the Contextualized Medication Event Dataset, derived from the 2022 n2c2 challenge (Track 1), which focuses on the multidimensional context of medication changes.13 This task is divided into two subtasks: event classification and context classification. The first subtask aims to classify medication mentions into one of 3 categories: Disposition, NoDisposition, or Undetermined. The second subtask aims to classify the contextual information for Disposition events across 5 orthogonal dimensions: Action, Negation, Temporality, Certainty, and Actor.

Progress note understanding aims to determine the causal relations between the “assessment section” and “plan section” of a progress note. For evaluation, we used the benchmark dataset developed by the 2022 n2c2 challenge (Track 3),10 where a subset of 5000 physician-written progress notes across 84 note types, representing the daily progress notes, are sampled from the MIMIC-III database. The annotated corpus contained 768 progress notes and 5934 labels for predefined four relations: (1) Direct, (2) Indirect, (3) Neither, and (4) Not Relevant. The four relations corresponded to the providers’ judgment on whether a diagnosis presented in the Plan Subsection was the primary reason for hospitalization (Direct), a secondary health problem or diagnosis to the main problem or diagnosis (Indirect), an issue that was not documented (Neither), or not a diagnosis or problem (Not Relevant).

Generative LLMs

We explored GatorTronGPT, a generative clinical LLM developed in our previous study.5 GatorTronGPT was trained using a GPT-3 architecture and is available in two versions: one with 5 billion parameters and another with 20 billion parameters, using 277 billion words of text comprising (1) 82 billion words of clinical text from approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text.42

Prompting strategies

We adopted soft prompting with frozen LLMs, which demonstrated good performance in our previous studies29,32 based on an encoder-only LLM, GatorTron.30 Soft prompts were used to instruct GatorTronGPT to generate correct responses for the 7 clinical NLP tasks. Specifically, the soft prompts were initiated as vectors of random values and optimized through backpropagation, while the LLM parameters remained frozen (ie, not updated). This technique of using only trainable soft prompts, also known as prompt tuning, contrasts with the discrete ‘hard’ text-based prompts typically used by ChatGPT, which generally include task-specific instructions and sometimes task illustrations.6 Given the input tokens, {x1,x2,,xn}, where n is the number of tokens, the GatorTronGPT model first embeds the tokens to form a matrix Xe Rn×e, with e is the dimension of the embedding space. The soft prompts are represented as a parameter matrix P Rm×h, which is initialized using an LSTM model, where m is the length of the prompt and h is the hidden size. The encoded prompt parameter matrix Pe Rm×e is then concatenated with the embedded input Xe to form a single matrix [Pe; Xe]R(m+n)×e, which then flows through the rest of the model layers. During training, the model parameters of GatorTronGPT were frozen and only the LSTM parameters were updated. The LSTM parameters were not shared among tasks. Figure 2 illustrates the architecture of soft prompting with GatorTronGPT for text generation.

Figure 2.

Figure 2.

Overall architecture of soft prompting with GatorTronGPT for text generation. Task-specific soft prompts are randomly initialized and encoded by an LSTM model. The encoded soft prompts are first concatenated with the embeddings of the input text and the target text and then passed into the GatorTronGPT model. All GatorTronGPT parameters are frozen, and only the LSTM weights are updated during soft prompting.

Formulate clinical NLP tasks as text-to-text learning

To solve all 7 clinical NLP tasks using a unified LLM, we reformatted all benchmark datasets into a text-to-text learning format, where both the input and output are in text form. Table 1 provides examples for each of the 7 tasks. Specifically, we used “template” phrases to convert various forms of gold-standard annotations into natural language facts. For clinical concept normalization, we converted the UMLS concept CUIs into their official concept names to facilitate the text-to-text learning architecture. We concatenate a soft prompt as a prefix to the embeddings derived from the input text, denoted as [prompt, input_text], and then the annotated sample is constructed as [prompt, input_text, target_text]. During soft prompting, we use the gold standard target_text to optimize the soft prompts. During inference, we converted the generated target_text back to the format used by gold standard annotations for evaluation.

Table 1.

Examples for formulating 7 clinical NLP tasks as text-to-text generation.

Task Input data Annotation Converted answer
Clinical concept extraction 6. Colchicine 0.6 mg Tablet Sig: One (1) Tablet PO DAILY (Daily) as needed for Gout flare/pain.
  • Drug: Colchicin e

  • Strength: 0.6 mg

  • Form: Tablet, Tablet

  • Dosage: One (1)

  • Frequency: DAILY (Daily) as needed

  • Route: PO

  • Reason: Gout flare/pain

The extracted drug entity is Colchicin e; the extracted strength entity is 0.6 mg; the extracted form entity is Tablet; the extracted dosage entity is One (1); the extracted frequency is DAILY (Daily) as needed; the extracted route entity is PO; the extracted reason entity is Gout flare/pain.
Clinical relation extraction Social History: Lives at [** Hospital6 3355 **], smoking 28 year pack hx, etoh remote, former IVDU (used once), [** Company 2318 **] bus driver for 18 years
  • (Lives, at [** Hospital6 3355 **], Living status-Type),

  • (smoking, 28 year pack hx, Tobacco-Amount),

  • (etoh, remote, Alcohol-Status),

  • (former IVDU, IVDU, Drug-Method),

  • (former IVDU, used once, Drug-Frequency),

  • (bus driver, for 18 years, Employment-Duration)

The relation between “Lives” and “at [** Hospital6 3355 **]” is “Living status-Type”; the relation between “smoking” and “28 year pack hx” is “Tobacco-Amount”; the relation between “etoh” and “remote” is “Alcohol-Status”; the relation between “former IVDU” and “IVDU” is “Drug-Method”; the relation between “former IVDU” and “used once” is “Drug-Frequency”; the relation between “bus driver” and “for 18 yrs” is “Employment-Duration.
Clinical concept normalization
  • Past Medical History:

  • Arthritis

  • carpal tunnel

  • shingles right arm 2000

  • needs right knee replacement

  • left knee replacement in [**2010**]

  • thyroidectomy 1978

  • cholecystectomy [**1981**]

  • hysterectomy 2001

  • h/o LGIB 2000-2001 after taking baby ASA 81 QOD

  • Social History:

  • Her husband died recently.

  • (arthritis, Arthritis, C0003864)

  • (carpal tunnel, Carpal Tunnel Syndrome, C0007286)

  • (shingles, Herpes zoster disease, C0019360)

  • (LGIB, Lower gastrointestinal hemorrhage, C0024050)

The normalized string of the disorder concept “arthritis” is “Arthritis”; the normalized string of the disorder concept “carpal tunnel” is “Carpal Tunnel Syndrome”; the normalized string of the disorder concept “shingles” is “Herpes zoster disease”; the normalized string of the disorder concept “LGIB” is “Lower gastrointestinal hemorrhage.
Clinical abbreviation disambiguation |CEA|173|175|LABORATORY DATA|PAIN: Negative. ADL STATUS: Energy: Low. Eating: She is eating well. Sleeping: She is sleeping well. Maintaining weight: Yes. LABORATORY DATA: Normal except for an elevated CEA at 6.1 but as the patient has been cutting back her smoking it has gone from 6.9 to 6.1. CHEMOTHERAPY/RADIATION THERAPY HISTORY: The patient has had no chemotherapy or hormone therapy. carcinoembryonic antigen The sense of the abbreviation “CEA” is “carcinoembryonic antigen.
NLI
  • Premise: The patient was seen by his primary care physician after he had complained of a one-week history of dyspnea on exertion and jaw tightness.

  • Hypothesis: The patient has symptoms of a CHF exacerbation.

Entailment The hypothesis that “The patient has symptoms of a CHF exacerbation” is entailment to the premise that “The patient was seen by his primary care physician after he had complained of a one-week history of dyspnea on exertion and jaw tightness.”
Medication attribute filling Context: “least moderate risk, positive stress test. After discussion with Dr. Camacho, our plan will be continue her on aspirin, beta-blocker, and a low-dose ACE inhibitor through her regimen. She refuses to take a cholesterol-lowering agent due to previous concerns with myalgias. She will be referred for cardiac catheterization within the next several days with a goal to better define her coronary anatomy for the possibility of percutaneous coronary intervention versus cardiac bypass surgery.”
  • Event Classification:

  • (cholesterol-lowering agent, Disposition)

  • Context Classification:

  • (Action, Start)

  • (Negation, Negated)

  • (Temporality, Present)

  • (Certainty, Certain)

  • (Actor, Patient)

  • Event Classification:

  • The category of medication event “cholesterol-lowering agent” is “Disposition.

  • Context Classification:

  • The category of disposition event “cholesterol-lowering agent” from the dimension of Action is “Start.” The category of disposition event “cholesterol-lowering agent” from the dimension of Negation is “Negated.” The category of disposition event “cholesterol-lowering agent” from the dimension of Temporality is “Present.” The category of disposition event “cholesterol-lowering agent” from the dimension of Certainty is “Certain.

  • The category of disposition event “cholesterol-lowering agent” from the dimension of Actor is “Patient.

Progress note understanding
  • Assessment: “45 year old male with no known CAD and aspirin allergy who presented with chest pain and symptoms concerning for unstable angina, now with STE changes on ECG in the setting of chest pain. Now chest pain free.”

  • Plan: “PUMP: Patient with some ECG changes mildly concerning for possible LVH, although does not meet diagnostic criteria on current ECG ' s. - Baseline TTE today to assess.”

Direct The relation between the given assessment and plan subsection is Direct.

The ground-truth annotations are highlighted in bold.

Evaluation metrics

For clinical concept extraction and relation extraction, we used the strict micro-averaged precision, recall, and F1-score, aggregated across all concept and relation categories, for evaluation. For clinical concept normalization, we used both strict and relax precision, recall, and F1-score for evaluation. For clinical abbreviation disambiguation, we used strict precision, recall, and F1-score for evaluation. For NLI, we used accuracy for evaluation. For medication attribute filling and progress note understanding, we used F1-score for evaluation. We examined two GatorTronGPT models with 5 and 20 billion parameters and compared with previous transformer-based models.

Experimental settings

We developed our prompt-based GatorTronGPT model using the NeMo-Megatron library from NVIDIA and the PyTorch Lightning library. We used a grid search to optimize hyperparameters, including the LSTM hyperparameters, the token length of soft prompts, the training epoch, and the training batch size. We used 8 NVIDIA A100-80G GPUs for the soft prompting. For inference, we used 1 A100-80G GPU for GatorTronGPT-5B and 4 A100-80G GPUs for GatorTronGPT-20B. The Supplementary Table S1 provides the training and inference time for each task.

Results

The proposed soft prompting with a frozen generative LLM, GatorTronGPT, solved 7 clinical NLP tasks using a single unified model and achieved performance better (on 5 out of 7 tasks) or comparable to the best results reported in the challenges or previous studies after the challenge. Figure 3 summarizes the comparison results and Table 2 provides detailed evaluation scores.

Figure 3.

Figure 3.

Summarization of comparison results of GatorTronGPT models on 7 clinical NLP tasks.

Table 2.

Comparison of soft prompting using frozen GatorTronGPT with previous models.

1. Clinical concept extraction
Datasets
2018 n2c2
2022 n2c2
Model Precision Recall F1 Precision Recall F1
BiLSTM-CRF (best in challenge)33 0.8973 0.8939 0.8956
BERT23 0.8887 0.8728 0.8807 0.8160 0.8483 0.8318
GatorTron30 0.8759 0.9038 0.8896 0.8181 0.8508 0.8341
BERT-MRC29 0.9159 0.8942 0.9018 0.8496 0.8353 0.8424
GatorTron-MRC29 0.9199 0.9012 0.9059 0.8521 0.8396 0.8451
GatorTronGPT-5B 0.9168 0.9057 0.9045 0.8598 0.8415 0.8589
GatorTronGPT-20B 0.9189 0.9081 0.9060 0.8615 0.8424 0.8615
2. Clinical relation extraction

Datasets
2018n2c2
2022 n2c2
Model Precision Recall F1 Precision Recall F1
BiLSTM-CRF (best in challenge)33 0.9715 0.9548 0.9630
BERT23 0.9598 0.9438 0.9545 0.7998 0.7620 0.7807
GatorTron30 0.9719 0.9482 0.9599 0.7970 0.7631 0.7798
BERT-MRC29 0.9722 0.9489 0.9604 0.8432 0.8371 0.8415
GatorTron-MRC29 0.9724 0.9488 0.9605 0.8445 0.8390 0.8428
GatorTronGPT-5B 0.9734 0.9492 0.9610 0.8551 0.8472 0.8520
GatorTronGPT-20B 0.9746 0.9506 0.9614 0.8560 0.8491 0.8529
3. End-to-end clinical concept extraction and normalization

Dataset
SemEval-2015 task 14
Model Strict Precision Strict Recall Strict F1 Relax Precision Relax Recall Relax F1
ezDI (best in challenge)43 0.783 0.732 0.757 0.815 0.761 0.787
ULisboa44 0.779 0.705 0.740 0.806 0.729 0.765
UTH-CCB45 0.778 0.696 0.735 0.797 0.714 0.753
GatorTronGPT-5B 0.810 0.751 0.782 0.833 0.769 0.795
GatorTronGPT-20B 0.812 0.772 0.791 0.839 0.785 0.813
4. Clinical abbreviation disambiguation

5. NLI
Dataset
UMN abbreviation
Dataset
MedNLI
Model Precision Recall F1 Model Accuracy
BiLSTM46 0.8810 0.8960 0.8840 BioBERT47 0.8050
DistilBERT48 0.9254 0.9358 0.9263 ClinicalBERT49 0.8270
BioBERT47 0.9479 0.9597 0.9505 BioMegatron50 0.8390
BlueBERT51 0.9430 0.9518 0.9432 GatorTron30 0.8670
GatorTronGPT-5B 0.9854 0.9832 0.9842 GatorTronGPT-5B 0.8832
GatorTronGPT-20B 0.9849 0.9830 0.9836 GatorTronGPT-20B 0.8946
6. Medication attribute filling
7. Progress note understanding
Dataset 2022 n2c2 Track 1
Dataset
2022 n2c2
Model Event (F1) Context (Accuracy) Model F1
KEEPHA13 (best in challenge) 0.9379 - CMU (best in challenge)10 0.8212
GatorTron30 0.9379 0.9126
GatorTronS5 0.9362 0.9080 Yale10 0.8133
RoBERTa52 0.8588 0.8417 UWisc10 0.8119
RoBERTa-MIMIC52 0.9251 0.9121 UDuisburg10 0.8034
GatorTronGPT-5B 0.9332 0.8937 GatorTronGPT-5B 0.7875
GatorTronGPT-20B 0.9346 0.8941 GatorTronGPT-20B 0.7954

Abbreviations: CMU, Carnegie Mellon University; UDuisburg, University of Duisburg–Essen; UWisc, University of Wisconsin.

Best evaluation scores are highlighted in bold.

Clinical concept extraction

The GatorTronGPT-20B model achieved state-of-the-art performance for both drug-ADE and SDoH datasets. For drug-ADE concept extraction, GatorTronGPT-20B outperformed task-specific models like BERT and GatorTron by 1.6%-2.5%, achieved comparable performance to GatorTron-MRC, a MRC architecture implemented using hard prompts based on an encoder-only LLM, GatorTron. For the SDoH dataset, which has more types of concepts and overlapping concepts, the GatorTronGPT-20B model outperformed task-specific models by ∼3% and outperformed the GatorTron-MRC model by 1.6%. We observed performance improvements in both datasets when scaling up from the GatorTronGPT-5B to the GatorTronGPT-20B model.

Clinical relation extraction

The GatorTronGPT-20B model achieved state-of-the-art performance on both benchmark datasets. For drug-ADE relation, GatorTronGPT-20B outperformed task-specific BERT and GatorTron by 0.1% and 0.7%, respectively, and achieved comparable performance to GatorTron-MRC. For SDoH relation extraction, GatorTronGPT-20B outperformed other task-specific models by ∼7% and outperformed GatorTron-MRC by 1%.

Clinical concept normalization

Table 2 compared GatorTronGPT-20B with the top 3 systems developed in this challenge. GatorTronGPT-20B achieved state-of-the-art performance outperforming the best system developed in the original challenge by 3.4% and 2.6% in strict and relax F1 scores, respectively.

Clinical abbreviation disambiguation

GatorTronGPT-5B achieved a state-of-the-art F1 score of 0.9842, outperforming previous task-specific transformer models by 3.4%-10% on the UMN abbreviation disambiguation benchmark dataset.

Natural language inference

The GatorTronGPT-20B model achieved state-of-the-art accuracy, outperforming previous task-specific transformers by 5.5%-9% and outperforming our previous GatorTron model by 2.8%.

Medical attribute filling

For the event classification task, GatorTronGPT-20B outperformed task-specific RoBERTa models by 0.9%-7.6% and achieved comparable performance with our previously developed GatorTron and GatorTronS5—an encoder-only clinical LLM developed using synthetic clinical text generated by GatorTronGPT. For context classification, our previously developed GatorTron achieved the best accuracy.

Progress note understanding

GatorTronGPT-20B achieved performance comparable to the NLP model ranked fourth in this challenge, which is 2.6% lower than the best performance. The top 4 teams, including Carnegie Mellon University (CMU), Yale, University of Wisconsin (UWisc), and University of Duisburg–Essen (UDuisburg), are all based on an ensemble of multiple models10; whereas, the proposed method is a single model solution.

Discussion

Text-to-text learning through generative LLMs offers a unified solution across all clinical NLP tasks. Even though a previous study in the general English domain examined the encoder–decoder T5 model, their approach is still one step away from the promise of “one model for all NLP tasks.” This study advances that work by exploring soft prompting with a frozen generative LLM, GatorTronGPT—developed from scratch using 277 billion words of mixed clinical and general English text with a GPT-3 architecture and up to 20 billion parameters—thereby fulfilling the goal of maintaining a single model from training to deployment. Our experiments show that the proposed approach achieved state-of-the-art performance on 5 out of 7 major clinical NLP tasks. The proposed solution also outperformed our previously developed encoder-only model GatorTron,30 and the hard prompt-based GatorTron-MRC models,29 demonstrating the versatility and efficiency of soft prompting with frozen generative LLMs in diverse clinical NLP tasks. We observed consistent performance improvements by scaling up the size of GatorTronGPT, indicating that large size generative LLMs enhance text-to-text learning. This study, along with our previous study,5 demonstrates that generative clinical LLMs can be all-purpose text analytics engines.

Our approach can directly benefit the real-world deployment of AI systems in healthcare. Real-world clinical decision support and healthcare require many complex AI modules, ranging from information extraction and standardization to various classifications and language understanding. In this study, we successfully solved 7 major clinical NLP tasks at the phrase-, sentence-, and document-levels, covering diverse tasks including information retrieval, classification, and language understanding. As more AI-enabled tools are being developed for healthcare, the burden of deploying and maintaining them grows significantly. Training with frozen LLMs is more parameter-efficient since only a small set of parameters of the models (eg, 1.3%-5.5% in our experiments) are updated during the training, which greatly reduces the computing cost for practical applications. The success of using a single generative LLM to solve these major clinical NLP tasks suggests that the proposed soft prompting with frozen generative LLMs holds promise for unifying diverse and complex AI modules in healthcare applications.

This study demonstrates that generative LLMs, such as GatorTronGPT, are more versatile than encoder-only LLMs such as BERT. Generative LLMs, designed for autoregressive text generation, use a decoder module to efficiently capture linguistic sequences and generate human-like responses. In generative LLMs, both input and output are sequences, which conveniently allows for formulating multiple NLP tasks using text-to-text learning. In contrast, encoder-only transformer models are designed to derive vector representations of input tokens and require additional task-specific layers for different tasks.

To realize the “one model for all” concept from training to deployment, we adopted soft prompting with frozen generative LLMs. This study, along with our previous study29 using encoder-only LLMs, shows that machines can learn “soft” prompts that are more effective and robust than “hard” prompts composed by humans for both generative LLMs and encoder-only LLMs. Previous studies have reported that generative LLMs such as ChatGPT are very sensitive to hard prompts; even minor changes in prompts can lead to dramatically different responses. Our approach provides a robust approach enabling machines to learn robust soft prompts, freeing researchers from the time-consuming task of prompt engineering. Frozen LLMs, that is, keeping LLM parameters unchanged during prompting, is critical to achieve “one model for all,” as any fine-tuning of LLMs will generate task-specific variations deviating from the goal. We suspect that fine-tuning LLMs to a specific task will reduce their generalizability to other tasks. This can potentially be alleviated by tuning generative LLMs using many different tasks, that is, multitask instruction learning. Recent studies53–55 have reported that multitask instruction learning could further improve the performance of generative LLMs.

Hallucination remains a challenge due to the probabilistic nature of text-to-text learning algorithms. We conducted an error analysis to examine the hallucinations made by GatorTronGPT. Table 3 shows three types of hallucinations we observed including (1) Nonlogical, (2) Irrelevant, and (3) Interpretable. To ensure the security and safety of healthcare, hallucinations must be controlled to a minimal level. Previously, we proposed a solution29 to down-grade free text generation into a MRC architecture, restricting LLMs to use only words presented in the given context for generating results. However, this solution is only applicable for information extraction and is not viable for other NLP tasks such as WSD and NLI. Our current approach, based on soft prompting with frozen generative LLMs, is generalizable to almost all major clinical NLP tasks.

Table 3.

Observed hallucinations of the clinical relation extraction task on the 2018 n2c2 dataset.

Input data Ground truth Hallucination
1. Nonlogical hallucination: the LLM generated sequences without logic. Disp: * 60 80 mg syringe * Refills: * 1 * 8. [s2] Acetaminophen [e2] 500 mg Tablet Sig: 1-2 Tablets PO [s1] Q6H (every 6 hours) as needed [e1] for pain. The relation between “Q6H (every 6 hours) as needed” and “Acetaminophen” is “Frequency-Drug. − Non-responder—Non-responder—Non-responder
8. docusate sodium 100 mg [s1] Capsule [e1] Sig: One (1) Capsule PO BID (2 times a day). 7. [s2] senna [e2] 8.6 mg Tablet Sig: One (1) Tablet PO BID (2 times a day) as needed for constipation. The relation between “Capsule” and “senna” is “No-relation. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
2. Irrelevant hallucination: the LLM generated readable but irrelevant answers. Disp: * 30 Tablet (s) * Refills: * 0 * 24. [s2] clopidogrel [e2] 75 mg Tablet Sig: [s1] One (1) [e1] Tablet PO DAILY (Daily) The relation between “One (1)” and “clopidogrel” is “Dosage-Drug. 1. The drug clopidogrel is a non-steroidal anti-platelet drug that is used to prevent platelet aggregation and reduce the risk of thrombosis in patients with acute coronary syndromes. It is also used to prevent thrombosis in patients with non-cardiac surgery. 2. Clopidogrel is a prodrug that is converted to its active form by the enzyme CYP2C19.
Medications on Admission: atenolol 25 mg daily aspirin 81 mg daily lipitor 10 mg QOD (every other am) prednisone 10 mg daily tamsulosin SR 0.4 mg evening multivitamin 1 tab daily fish oil capsule 1000 mg twice a day [s2] systane lubricant eye [e2] drops 1 gtt [s1] TID [e1]. The relation between “TID” and “systane lubricant eye” is “Frequency-Drug. I have been using nonpreserved systane eye drops for a few years.
3. Interpretable hallucination: the LLM generated a different way of the answer that is interpretable. OXYCODONE—20 mg Tablet Sustained Release 12 hour—3 (Three) Tablet (s) by mouth every morning (60 mg), 1 tablet every 2 pm (20 mg) and 3 tablets every evening (60 mg) [s2] PANTOPRAZOLE [e2] [PROTONIX]—40 mg Tablet, Delayed Release (E.C.)—[s1] 1 [e1] Tablet (s) by mouth once day The relation between “1” and “PANTOPRAZOLE” is “Dosage-Drug. The relation type between the Drug entity “PANTOPRAZOLE” and Dosage entity “1” is “has_dosage.”
Disp: * 30 Tablet (s) * Refills: * 2 * 3. [s2] Fluticasone—Salmeterol [e2] 250 - 50 mcg/Dose Disk with Device Sig: One (1) [s1] Disk with Device [e1] Inhalation Hospital 1 (2 times a day). The relation between “Disk with Device” and “Fluticasone—Salmeterol” is “Form-Drug The relation type between the Drug entity “Fluticasone—Salmeterol” and Form entity “Disk with Device” is “is_a.”

The ground truth relations were highlighted in bold. 

Conclusion

This study demonstrates that soft prompting with frozen generative LLMs is an all-purpose clinical text analytics engine solving various major clinical NLP tasks. Frozen LLM, that is, keeping LLM parameters unchanged during prompt tuning, is the key to achieving “one unified model for all” solution. The proposed solution is cost-effective to adopt generative LLMs for various healthcare applications.

Supplementary Material

ocae078_Supplementary_Data

Acknowledgments

We would like to thank the i2b2 and n2c2 challenge organizers to provide the annotated corpus. We gratefully acknowledge the support of NVIDIA Corporation and the NIVIDA AI Technology Center (NVAITC) UF program. We would like to thank the UF Research Computing team, led by Dr Erik Deumens, for providing computing power through UF HiperGator-AI cluster.

Contributor Information

Cheng Peng, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States.

Xi Yang, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL 32610, United States.

Aokun Chen, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL 32610, United States.

Zehao Yu, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States.

Kaleb E Smith, NVIDIA, Santa Clara, CA 95051, United States.

Anthony B Costa, NVIDIA, Santa Clara, CA 95051, United States.

Mona G Flores, NVIDIA, Santa Clara, CA 95051, United States.

Jiang Bian, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL 32610, United States.

Yonghui Wu, Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL 32610, United States.

Author contributions

Cheng Peng and Yonghui Wu were responsible for the overall design, development, and evaluation of this study. Cheng Peng, Aokun Chen, Zehao Yu, and Kaleb E. Smith had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Cheng Peng and Yonghui Wu did the bulk of the writing, Kaleb E. Smith, Anthony B. Costa, Mona G. Flores, and Jiang Bian also contributed to writing and editing of this manuscript. All authors reviewed the manuscript critically for scientific content, and all authors gave final approval of the manuscript for publication.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This study was partially supported by a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-2018C3-14754), a grant from the National Cancer Institute, R01CA246418, grants from the National Institute on Aging, NIA R56AG069880, R01AG080624, R01AG083039, R01AG080991, R01AG084236, R01AG084178, R01AG076234, and R33AG062884, Ed and Ethel Moore Alzheimer’s Disease Research Program 23A09, National Institute of Allergy and Infectious Diseases, NIAID R01AI172875, National Heart, Lung, and Blood Institute, NHLBI R01HL169277, the Cancer Informatics Shared Resource supported by the UF Health Cancer Center and the UF Clinical and Translational Science Institute Biomedical Informatics Program. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions.

Conflicts of interest

K.E.S., A.B.C., and M.G.F. are employed by NVIDIA. There are no other competing financial or non-financial interests. The work presented in this study was conducted exclusively within the University of Florida Health.

Data availability

The 2018 n2c2 and 2022 n2c2 datasets are available from the n2c2 website after data use agreement (see https://n2c2.dbmi.hms.harvard.edu/ for details). The SemEval-2015 dataset is available from: https://alt.qcri.org/semeval2015/task14/. The UMN abbreviation dataset is available from: https://conservancy.umn.edu/handle/11299/137703. The MedNLI dataset is available from https://physionet.org/content/mednli/1.0.0/.

References

  • 1. Collobert R, Weston J, Bottou L, et al.  Natural language processing (almost) from scratch. J Mach Learn Res. 2011;12:2493-2537. https://dl.acm.org/doi/10.5555/1953048.2078186 [Google Scholar]
  • 2. Devlin J, Chang M-W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171-4186. 10.18653/v1/N19-1423 [DOI]
  • 3. Sutskever I, Vinyals O, Le QV.  Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27:3104-3112. https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html [Google Scholar]
  • 4. Raffel C, Shazeer N, Roberts A, et al.  Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):5485-5551. https://dl.acm.org/doi/10.5555/3455716.3455856 [Google Scholar]
  • 5. Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. NPJ Digit Med. 2023;6(1):1-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Liu P, Yuan W, Fu J, et al.  Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. 2023;55(9):1-35. 10.1145/3560815 [DOI] [Google Scholar]
  • 7. Graves A, Fernández S, Schmidhuber J.  Bidirectional LSTM networks for improved phoneme classification and recognition. Artificial Neural Networks: Formal Models and Their Applications—ICANN 2005. Springer; 2005:799-804. 10.1007/11550907_126 [DOI] [Google Scholar]
  • 8. Nadkarni PM, Ohno-Machado L, Chapman WW.  Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544-551. 10.1136/amiajnl-2011-000464 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Uzuner Ö, South BR, Shen S, et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 2011;18(5):552-556. 10.1136/amiajnl-2011-000203 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gao Y, Dligach D, Miller T, et al.  Progress note understanding–assessment and plan reasoning: overview of the 2022 N2C2 track 3 shared task. J Biomed Inform. 2023;142:104346. 10.1016/j.jbi.2023.104346 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Suominen H, Salanterä S, Velupillai S, et al.  Overview of the ShARe/CLEF eHealth evaluation lab 2013. In: Forner P, Müller H, Paredes R, eds. Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Springer; 2013:212-231. 10.1007/978-3-642-40802-1_24 [DOI] [Google Scholar]
  • 12. Sun W, Rumshisky A, Uzuner O.  Evaluating temporal relations in clinical text: 2012 i2b2 challenge. J Am Med Inform Assoc. 2013;20(5):806-813. 10.1136/amiajnl-2013-001628 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Mahajan D, Liang JJ, Tsou C-H, et al.  Overview of the 2022 n2c2 shared task on contextualized medication event extraction in clinical notes. J Biomed Inform. 2023;144:104432. 10.1016/j.jbi.2023.104432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 2001:282-289.
  • 15. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory. ACM; 1992. 10.1145/130385.130401 [DOI]
  • 16. Mikolov T, Sutskever I, Chen K, et al.  Distributed representations of words and phrases and their compositionality. In: Burges CJ, Bottou L, Welling M, eds. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2013. https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf [Google Scholar]
  • 17. Wu Y, Jiang M, Xu J, et al.  Clinical named entity recognition using deep learning models. AMIA Annu Symp Proc. 2017;2017:1812-1819. https://www.ncbi.nlm.nih.gov/pubmed/29854252 [PMC free article] [PubMed] [Google Scholar]
  • 18. Yang X, Bian J, Fang R, et al.  Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting. J Am Med Inform Assoc. 2020;27(1):65-72. 10.1093/jamia/ocz144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics; 2016:260-270. 10.18653/v1/N16-1030 [DOI]
  • 20. Yang X, Bian J, Wu Y.  Detecting medications and adverse drug events in clinical notes using recurrent neural networks. In: Liu F, Jagannatha A, Yu H, eds. Proceedings of the 1st International Workshop on Medication and Adverse Drug Event Detection.  PMLR; 2018:1-6. https://proceedings.mlr.press/v90/yang18a.html [Google Scholar]
  • 21. Yang X, Lyu T, Li Q, et al.  A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak. 2019;19(Suppl 5):232. 10.1186/s12911-019-0935-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Vaswani A, Shazeer N, Parmar N, et al.  Attention is all you need. Adv Neural Inf Process Syst. 2017;30. https://proceedings.neurips.cc/paper/7181-attention-is-all [Google Scholar]
  • 23. Yang X, Bian J, Hogan WR, et al.  Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27(12):1935-1942. 10.1093/jamia/ocaa189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Si Y, Wang J, Xu H, et al.  Enhancing clinical concept extraction with contextual embeddings. J Am Med Inform Assoc. 2019;26(11):1297-1304. 10.1093/jamia/ocz096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Proceedings of the Advances in Neural Information Processing System.  2020;33:1877-1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [Google Scholar]
  • 26. Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv [cs.CL], 2023. http://arxiv.org/abs/2307.09288.
  • 27. Yang X, Yu Z, Guo Y, et al. Clinical relation extraction using transformer-based models. arXiv [cs.CL], 2021. http://arxiv.org/abs/2107.08957.
  • 28. Joshi M, Chen D, Liu Y, et al.  SpanBERT: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist. 2020;8:64-77. 10.1162/tacl_a_00300 [DOI] [Google Scholar]
  • 29. Peng C, Yang X, Yu Z, et al.  Clinical concept and relation extraction using prompt-based machine reading comprehension. J Am Med Inform Assoc. 2023;30(9):1486-1493. 10.1093/jamia/ocad107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Yang X, Chen A, PourNejatian N, et al.  A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194. 10.1038/s41746-022-00742-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Agrawal M, Hegselmann S, Lang H, et al. Large language models are few-shot clinical information extractors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022;1998-2022. https://aclanthology.org/2022.emnlp-main.130.pdf
  • 32. Peng C, Yang X, Smith KE, et al. Model tuning or prompt tuning? A study of large language models for clinical concept and relation extraction. J Biomed Inform. 2024;153. [DOI] [PMC free article] [PubMed]
  • 33. Henry S, Buchan K, Filannino M, et al.  2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc. 2020;27(1):3-12. 10.1093/jamia/ocz166 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lybarger K, Yetisgen M, Uzuner Ö.  The 2022 n2c2/UW shared task on extracting social determinants of health. J Am Med Inform Assoc. 2023;30(8):1367-1378. 10.1093/jamia/ocad012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Johnson AEW, Pollard TJ, Shen L, et al.  MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Davenport M.  Introduction to modern information retrieval. 3rd ed. J Med Libr Assoc. 2012;100(1):75-75. 10.3163/1536-5050.100.1.017 [DOI] [Google Scholar]
  • 37. Elhadad N, Pradhan S, Gorman S, et al. SemEval-2015 task 14: analysis of clinical text. In: Nakov P, Zesch T, Cer D, et al. , eds. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics; 2015:303-310. 10.18653/v1/S15-2051 [DOI]
  • 38. Bodenreider O, McCray AT.  Exploring semantic groups through visual approaches. J Biomed Inform. 2003;36(6):414-432. 10.1016/j.jbi.2003.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Saeed M, Lieu C, Raber G, et al.  MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology. IEEE; 2002:641-644. 10.1109/CIC.2002.1166854 [DOI] [PubMed] [Google Scholar]
  • 40. Moon S, Pakhomov S, Melton G. Clinical abbreviation sense inventory. 2012. Accessed October 30, 2023. http://conservancy.umn.edu/handle/11299/137703
  • 41. Shivade C. Mednli-a natural language inference dataset for the clinical domain. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics;2019:1586-1596. [Google Scholar]
  • 42. Gao L, Biderman S, Black S, et al. The pile: an 800GB dataset of diverse text for language modeling. arXiv [cs.CL]. 2020. http://arxiv.org/abs/2101.00027.
  • 43. Pathak P, Patel P, Panchal V, et al. ezDI: a supervised NLP system for clinical narrative analysis. In: Nakov P, Zesch T, Cer D, et al. , eds. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics;2015:412-416. 10.18653/v1/S15-2071 [DOI]
  • 44. Leal A, Martins B, Couto F. ULisboa: recognition and normalization of medical concepts. In: Nakov P, Zesch T, Cer D, et al. , eds. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015).  Association for Computational Linguistics; 2015:406-411. 10.18653/v1/S15-2070 [DOI] [Google Scholar]
  • 45. Xu J, Zhang Y, Wang J, et al. UTH-CCB: the participation of the SemEval 2015 challenge—task 14. In: Nakov P, Zesch T, Cer D, et al. , eds. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015).  Association for Computational Linguistics; 2015:311-314. 10.18653/v1/S15-2052 [DOI] [Google Scholar]
  • 46. Cevik M, Mohammad Jafari S, Myers M, et al.  Sequence labeling for disambiguating medical abbreviations. J Healthc Inform Res. 2023;7(4):501-526. 10.1007/s41666-023-00146-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Lee J, Yoon W, Kim S, et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. 10.1093/bioinformatics/btz682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2020. Accessed November 1, 2023. http://arxiv.org/abs/1910.01108
  • 49. Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. In: Rumshisky A, Roberts K, Bethard S, et al. , eds. Proceedings of the 2nd Clinical Natural Language Processing Workshop.  Association for Computational Linguistics; 2019:72-78. 10.18653/v1/W19-1909 [DOI] [Google Scholar]
  • 50. Shin H-C, Zhang Y, Bakhturina E, et al. BioMegatron: larger biomedical domain language model. In: Webber B, Cohn T, He Y, et al. , eds. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:4700-4706. 10.18653/v1/2020.emnlp-main.379 [DOI]
  • 51. Peng Y, Yan S, Lu Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Demner-Fushman D, Cohen KB, Ananiadou S, et al. , eds. Proceedings of the 18th BioNLP Workshop and Shared Task. Association for Computational Linguistics; 2019:58-65. 10.18653/v1/W19-5006 [DOI]
  • 52. Chen A, Yu Z, Yang X, et al.  Contextualized medication information extraction using transformer-based deep learning architectures. J Biomed Inform. 2023;142:104370. 10.1016/j.jbi.2023.104370 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. 2022. Accessed March 23, 2023. http://arxiv.org/abs/2212.13138. [DOI] [PMC free article] [PubMed]
  • 54. Wei J, Bosma M, Zhao VY, et al. Finetuned language models are zero-shot learners. International Conference on Learning Representations. 2022. https://openreview.net/pdf?id=gEZrGCozdqR
  • 55. Chung HW, Hou L, Longpre S, et al. Scaling instruction-finetuned language models. arXiv [cs.LG]. 2022. http://arxiv.org/abs/2210.11416.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae078_Supplementary_Data

Data Availability Statement

The 2018 n2c2 and 2022 n2c2 datasets are available from the n2c2 website after data use agreement (see https://n2c2.dbmi.hms.harvard.edu/ for details). The SemEval-2015 dataset is available from: https://alt.qcri.org/semeval2015/task14/. The UMN abbreviation dataset is available from: https://conservancy.umn.edu/handle/11299/137703. The MedNLI dataset is available from https://physionet.org/content/mednli/1.0.0/.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES