Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing

Enshuo Hsu; Kirk Roberts

doi:10.21203/rs.3.rs-4559971/v1

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jun 28:rs.3.rs-4559971. [Version 1] doi: 10.21203/rs.3.rs-4559971/v1

Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing

Enshuo Hsu ¹, Kirk Roberts ²

PMCID: PMC11230489 PMID: 38978609

Abstract

The performance of deep learning-based natural language processing systems is based on large amounts of labeled training data which, in the clinical domain, are not easily available or affordable. Weak supervision and in-context learning offer partial solutions to this issue, particularly using large language models (LLMs), but their performance still trails traditional supervised methods with moderate amounts of gold-standard data. In particular, inferencing with LLMs is computationally heavy. We propose an approach leveraging fine-tuning LLMs and weak supervision with virtually no domain knowledge that still achieves consistently dominant performance. Using a prompt-based approach, the LLM is used to generate weakly-labeled data for training a downstream BERT model. The weakly supervised model is then further fine-tuned on small amounts of gold standard data. We evaluate this approach using Llama2 on three different n2c2 datasets. With no more than 10 gold standard notes, our final BERT models weakly supervised by fine-tuned Llama2-13B consistently outperformed out-of-the-box PubMedBERT by 4.7–47.9% in F1 scores. With only 50 gold standard notes, our models achieved close performance to fully fine-tuned systems.

Keywords: Natural language processing, Large language models, Electronic health records, Weak supervision

Introduction

Deep learning-based natural language processing (NLP) has achieved remarkable success in the open domain. However, achieving optimal performance in the clinical domain faces many challenges. First, training such complex architectures often requires a large labeled corpus¹. Second, specific subpopulations (e.g., rare diseases, minority ethnicities) are often under-represented in clinical notes², magnifying the consequences of underpowered datasets. Third, even with sufficient notes available in electronic health records (EHRs), the protection of patient privacy makes access to the corpus challenging. Finally, manual annotation of a gold standard is not only a labor-intensive task, it also requires advanced clinical knowledge for interpretation of the text in clinical notes^3,4. In recent years, approaches including weak supervision and in-context learning have been developed to address this challenge^1,4.

Weak supervision, which utilizes labeling functions (LFs) to generate noisy weak labels for model training, has already been adopted in the clinical domain^3,5–11. Despite its promise, weak supervision still requires significant resources to construct LFs. The rule-based approach requires domain experts to handcraft decision rules^5–9. The ontology-based approach requires that the concepts of interest be included in existing ontologies or dictionaries^10,11. Data programming requires significant efforts from programmers who have a thorough understanding of the clinical data³.

In-context learning, in which pre-trained large language models (LLMs) are prompted to predict textual outputs, is a relatively new method. In theory, it requires few (“few-shot”) or even no (“zero-shot”) training data^12,13. However, recent studies raised concerns about underperformance^14–16 and instability¹⁷ in the medical domain. Despite the appealing idea, at this point, there is no strong evidence to support the use of in-context learning as the frontline approach in a medical NLP system. Furthermore, due to the model sizes (measured as the number of parameters), LLM inference requires significant computation resources. We estimate that performing in-context learning with Llama2-13B, a 13 billion-parameter model¹⁸ for 2018 i2b2 benchmark¹⁹ (a subset of 505 discharge summaries from the MIMIC-III dataset²⁰) requires 3.3 × 10¹² oat point operations (FLOPs) per input sentence. On the other hand, predicting with a Bidirectional Encoder Representations from Transformers (BERT) model with 110 million parameters only requires 4.4 × 10¹⁰ FLOPs per input sentence. This computational difference results in a dramatic difference in GPU time such that inferencing the entire collection of MIMIC-III discharge summaries would take an estimated 727 days on an NVIDIA A100 GPU while predicting with BERT would only take around 18 hours (Fig. 2).

Benchmarking GPU hours with MIMIC-III discharge summaries. The 505 discharge summaries in the 2018 i2b2 challenge were used to project the entire collection of discharge summaries. Running on an NVIDIA A100 GPU, Llama2-13B requires 727 days of GPU time, while PubMedBERT only requires about 18 hours.

Recently, a few attempts have been made to combine the benefits of both weak supervision and incontext learning^21,22. However, to our knowledge, there is no evaluation of an end-to-end approach in the medical domain that prompts an LLM for weak supervision and fine-tunes smaller models on the downstream task gold standard. The benefits and limitations of this method in a practical scenario where a small number of annotated notes are available have not been evaluated. Furthermore, fine-tuning LLM which has shown significant benefits in recent studies²³ has not been considered in such pipelines. Therefore, we propose an LLM-powered weak supervision approach that 1) minimizes domain expertise for rule-crafting and data programming and removes the dependency for ontologies by using the LLM to create weak labels, 2) leverages the latest prompt-based supervised fine-tuning (SFT) techniques to fine-tune LLMs, 3) consistently achieves dominant performances by weakly supervising and fine-tuning BERT²⁴ models for downstream tasks, and 4) avoids the computational burden of deploying LLMs in the production environment.

In this study, we evaluated four experimental settings as detailed in Table 1. The primary method, Llama-SFTn-WS-BERTn starts with supervised fine-tuning (SFT) Llama2-13B with a certain number (n) of gold standard notes in the training set. The fine-tuned Llama model then performs in-context learning on the rest of the training set to generate weak labels. We use the weak labels to perform weak supervision (WS) on BERT, followed by final fine-tuning with gold standards. Considering the high GPU memory requirement of SFT, we also proposed a compact version, Llama-WS-BERTn which the SFT of Llama2 was omitted. We use Llama2 out-of-the-box to perform weak supervision on BERT. For comparison, we evaluated two baselines, Llama-SFTn and BERTn which Llama2-13B and PubMedBERT were fine-tuned with n gold standard notes. Details are described in the Methods section.

Table 1.

Experimental settings

	Notation	Description	Product
Proposed methods	Llama-SFTn-WS-BERTn	Llama2-13B is supervised fine-tuned (SFT) with n gold standard notes in the training set, then performs few-shot in-context learning to generate weak labels. Weakly supervise BERT and fine-tune BERT with n gold standard notes.	A weakly supervised and fine-tuned BERT
Proposed methods	Llama-WS-BERTn	Llama2-13B out-of-the-box performs few-shot in-context learning to generate weak labels. Weakly supervise BERT and fine-tune BERT with n gold standard notes.	A weakly supervised and fine-tuned BERT
Baselines	Llama-SFTn	Llama2-13B is supervised fine-tuned with n gold standard notes in training set.	A fine-tuned Llama
Baselines	BERTn	Fine-tune BERT with n gold standard notes.	A fine-tuned BERT

Open in a new tab

We evaluated three widely used biomedical benchmarks, the 2012²⁵, 2014²⁶, and 2018¹⁹ Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenges for temporal relation extraction, protected health information (PHI) de-identification, and adverse drug events (ADEs) and medication properties extraction, respectively.

This study demonstrates a robust usage of LLMs that requires minimal to zero human input while achieving significant improvement in well-established benchmarks. We hypothesize that this approach is a safe and effective means of augmenting existing supervised clinical NLP approaches by inserting this simple technique between the now-standard pre-training and fine-tuning steps.

Results

LLMs inferencing is computationally expensive

On the 2018 benchmark, which contains a subset of 505 MIMIC-III discharge summaries, Llama2-13B spent 147 GPU hours in total, with a median of 16 minutes (Q1-Q3 = 11–22 minutes) to create weak labels on each note. Since the computation for inference is linear to the number of input instances, we fit a linear regression model to project the total GPU time for labeling all the 59,652 discharge summaries in the MIMIC-III dataset. The projected time on a single NVIDIA A100 GPU is 727 days. PubMedBERT took 9 minutes in total, with a median of 1 second (Q1-Q3 = 0.7–1.4 seconds) per note. The projected time for labeling all the discharge summaries in MIMIC-III is 18 hours and 16 minutes.

LLM-generated weak labels

For the 2012 benchmark, the out-of-the-box Llama2-13B and the fine-tuned Llama-SFT3 generated weak labels with 9,804 and 20,402 entities, respectively. The median numbers of entities per sentence were 2 and 3, respectively. For the 2014 benchmark, Llama2-13B and Llama-SFT3 generated weak labels with 18,062 and 15,190 entities, respectively, with a median of 1 entity per sentence. For the 2018 benchmark, Llama2-13B and Llama-SFT3 generated weak labels with 53,177 and 56,169 entities with 1 and 4 entities per sentence, respectively. Our post-processing algorithm was able to handle the majority of LLM predictions, with less than 1% of sentences failing due to inconsistent output formats (Table 2).

Table 2.

Summary of LLM-generated weak labels

Benchmarks (Training set)	2012	2014	2018
Features
Notes	190	790	303
Sentences	5,995	34,101	46,228
Total entities	17,933	17,401	50,951
Entities per sentence, median [Q1, Q3]	3 [2, 4]	0 [0, 0]	0 [0, 0]
Entities per sentence, mean (Std Dev)	3.14 (2.34)	0.51 (1.72)	1.1 (3.13)
Entities per note, median [Q1, Q3]	82 [56, 120]	18 [13, 27]	147 [96, 224]
Weak labels generated by Llama2-13B
Post-processing failed, sentences (%)	2	2	22
Total entities	9,804	18,062	53,177
Entities per sentence, median [Q1, Q3]	2 [1, 3]	1 [1, 1]	1 [1, 3]
Entities per note, median [Q1, Q3]	46 [31, 61]	19 [12, 29]	165 [102, 232]
Weak labels generated by Llama-SFT3
Post-processing failed, sentences (%)	14	27	82
Total entities	20,402	15,190	56,169
Entities per sentence, median [Q1, Q3]	3 [2, 5]	1 [1, 4]	4 [2, 7]
Entities per note, median [Q1, Q3]	88 [59, 132]	15 [10, 21]	181 [111, 246]
Weak labels generated by Llama-SFT5
Post-processing failed, sentences (%)	76	34	48
Total entities	18,676	15,107	48,143
Entities per sentence, median [Q1, Q3]	3 [2, 5]	1 [1, 3]	4 [2, 7]
Entities per note, median [Q1, Q3]	82 [57, 119]	14 [9, 22]	147 [90, 215]
Weak labels generated by Llama-SFT10
Post-processing failed, sentences (%)	69	28	25
Total entities	18,386	13,743	46,223
Features
Entities per sentence, median [Q1, Q3]	3 [2, 5]	1 [1, 2]	4 [2, 7]
Entities per note, median [Q1, Q3]	81 [54, 117]	15 [11, 20]	147 [84, 207]
Weak labels generated by Llama-SFT50
Post-processing failed, sentences (%)	72	54	59
Total entities	18,420	18,415	47,688
Entities per sentence, median [Q1, Q3]	3 [2, 5]	1 [1, 3]	4 [2, 7]
Entities per note, median [Q1, Q3]	77 [57, 118]	17 [12, 25]	150 [87, 212]

Open in a new tab

Proposed method: Llama-SFT n -WS-BERT n

Our primary proposed method, Llama-SFTn-WS-BERTn consistently achieved dominant performance in most experiments across the three benchmarks. In the extremely low-resourced setting in which only 3 gold standard notes were used, on the 2012 events benchmark, time expression benchmark, 2014 benchmark, and 2018 benchmark, Llama-SFT3-WS-BERT3 achieved F1 scores of 0.7765, 0.7538, 0.6336, and 0.7747, while the baseline Llama-SFT3 had 0.7418, 0.6045, 0.5898, and 0.6252; BERT3 had 0.5953, 0.2753, 0.3083, and 0.6555. Llama-SFT3-WS-BERT3 outperformed the Llama-SFT3 baseline by 3.5–15.0% and the BERT3 baseline by 11.9–47.9% in the F1 score. When 10 gold standard notes were used, Llama-SFT10-WS-BERT10 achieved F1 scores of 0.8466, 0.8448, 0.6942, and 0.8005, which is 3.2–14.6% higher than the Llama-SFT10 baseline and 4.7–16.8% higher than the BERT10 baseline. In the relatively annotation-abundant scenario when 50 gold standard notes were used, the Llama-SFT50-WS-BERT50 achieved close performance to fully supervised BERT models by only 2.8%, 2.5%, 6.1%, and 2.2% lower in F1 score. For the 2012 time expression benchmark, however, the F1 score of Llama-SFT50-WS-BERT50 is slightly lower than BERT50 by 1.3% (Fig. 3).

Weakly supervised end models fine-tuned on 3, 5, 10, and 50 gold standard notes from the training set compared to BERT models without weak supervision. (A) 2012 i2b2 challenge events extraction F1 score and (B) temporal expression extraction F1 score. (C) 2014 i2b2 challenge Strict micro F1 score and (D) Relaxed micro F1 score. (E) 2018 i2b2 challenge Strict micro F1 score and (F) Lenient micro F1 score.

Proposed method: Llama-WS-BERT n

The compact method Llama-WS-BERTn showed improved performance in most benchmarks. On the 2012 event benchmark, Llama-WS-BERTn and Llama-SFTn had similar performance and the differences are 0.5–2.5% for n from 3 to 50. While it outperformed the BERTn by up to 17.1%. In the 2012 temporal expression benchmark, Llama-WS-BERTn and Llama-SFTn had similar performances when n was less than 10. While Llama-WS-BERT10 and Llama-WS-BERT50 outperformed Llama-SFT10 and Llama-SFT50 by 7.9% and 13.5%, respectively. On the 2014 benchmark, Llama-WS-BERTn and Llama-SFTn had similar performances except for n of 5. Llama-WS-BERTn outperformed BERTn by 3.5–27.5%. On the 2018 benchmark, Llama-WS-BERTn outperformed Llama-SFTn and BERTn by 5.6–11.8% and 1.1–8.8%, respectively. Overall, Llama-WS-BERTn performs similar to or better than the Llama-SFTn baseline while dominating the BERTn baseline on most benchmarks (Fig. 3).

Baseline methods

On the 2012 benchmarks, under the low-resource setting (n < 10), Llama-SFTn performed better than BERTn. While when n = 50, BERT50 outperformed Llama-SFT50. On the 2014 benchmark, Llama-SFTn outperformed BERTn across the board by 5–28.2%. On the 2018 benchmark, BERTn outperformed Llama-SFTn by 1.5–7.4% (Fig. 3).

Llama-3 large language models

As a stand-alone sensitivity analysis, we evaluated a recently published large language model, Llama3, with 70 billion parameters²⁷. We performed the compact Llama-WS-BERTn method, which does not require fine-tuning the LLM. The results of the 2018 benchmark showed a consistent outperformance over the Lama2-13B by 1.1–6.1% under the Llama-WS-BERTn setting. While the fine-tuned Llama2 weak supervision (Llama-SFTn-WS-BERTn) showed higher performance consistently (Figure S1).

Discussion

We proposed an LLM-powered weak supervision system that costs minimal to zero domain knowledge to improve the performance of clinical information extraction by 4.7–47.9% from the BERT baseline when no more than 10 gold standard notes were used for training. When 50 gold standard notes were used, our system achieved similar performance as a fully supervised BERT with a 2.2–6.1% difference. The method showed an overall benefit of fine-tuning low training sizes across the three benchmarks. Considering the computational burden of fine-tuning LLMs, we also proposed a compact version using Llama2 out-of-the-box and achieved improved performances across the board. The products of our methods are fine-tuned BERT models with 110 million parameters. Compared to modern LLMs which often have billions of parameters, the compact size makes model deployment more computationally efficient. Our framework (i.e., LLM, SFT, prompt templates, and post-processing algorithms) is domain-independent and can be applied to most medical information extraction systems. We expect the performance of this framework will improve further when more medically-focused LLMs become available. We conclude that the proposed method is a generalizable and effortless booster for low-training-size scenarios.

This study is one of the early works exploring the potential use of LLMs in the medical domain. Recent studies have debated the feasibility and performance of in-context learning for information extraction^14,15,17,28. Following the ideas of LLM-powered labeling functions²⁹ and clinical knowledge distillation²¹, we proposed a robust alternative that combines supervised fine-tuning LLMs, in-context learning, and weak supervision to achieve stably dominant performances. As a knowledge-free alternative for labeling functions, our study also points out a direction in which current weak supervision methods could be free from the heavy reliance on domain expert inputs and ontology.

On the 2012 time expression benchmark, when 50 the gold standard notes were used for training, our Llama-SFT50-WS-BERT50 had slightly reduced performances by 1.3% compared to the BERT50 baseline. This finding is consistent with a recent weak supervision study which showed negative impact when a large amount of training notes were provided³. The most likely explanation is that when gold standards are adequate to provide the model with correct knowledge, the noise in the weak labels exceed the benefits. However, the performance drops in such cases with our approach are quite small, suggesting such an approach can have endurance upsides with little chances of catastrophic loss, unlike other LLM use cases.

On the 2014 and the 2018 benchmarks, we observed reversed results between the two baselines Llama-SFTn and BERTn in which Llama-SFTn performed better on the 2014 PHI de-identification task while BERTn performed better on the 2018 ADE & medication extraction task. One explanation is that since Llama2 is a general-domain model while PubMedBERT is a biomedical model, the former might have advantages in solving non-medical problems such as PHI identification while the latter has advantages in solving medical problems like medication terms.

In a stand-alone sensitivity analysis, we explored the newer and larger version of Llama, Llama3-70B, and demonstrated that though the choice of LLMs plays a role in the performance, adopting our proposed methods gives consistent benefits.

Despite the promising results, this study does have a few limitations. First, unlike other weak supervision studies in which a large number of unlabeled notes were processed by LFs^2,3, for computational considerations, we chose benchmarks with relatively small sample sizes. We would expect that with larger weakly-labeled datasets the performance of our approach should increase, though this requires further experimentation. However, even with less than 800 notes, the LLM was able to generate weak labels that dramatically improved performance. Second, as an initial work, we did not evaluate many different LLMs. We selected Llama2-13B as the main LLM and explored Llama3-70B on the side based on the reported performances in the medical domain and their open-source and lightweight features³⁰. Other open-source LLMs should be evaluated in future studies. Third, to keep the study focused, we did not evaluate different settings in supervised fine-tuning (e.g., prompt templates, learning rate), in-context learning (e.g., prompt templates, the number of few-shot examples), post-processing (e.g., label harmonization), and BERT model fine-tuning. We follow reported best practices for those^12,14,18. We expect the performance to further improve if those details are carefully tuned.

Conclusion

In conclusion, we proposed a novel method that combines LLMs and weak supervision for high-performance medical information extraction while minimizing domain knowledge dependence. Our method shows a consistent benefit. Further performance improvements are anticipated with more refined in-context learning and fine-tuning.

Methods

Figure 1 provides an overview of our approach. We first constructed a prompt template with a system prompt, an instruction, few-shot examples sampled from the training set, and an input/output placeholder. For a given set of n gold standard notes, we fine-tuned Llama2-13B via prompt-based supervised fine-tuning (SFT). We then used the fine-tuned Llama2 for few-shot in-context learning on the unannotated notes to generate weak labels. The weak labels were used to fine-tune (“weakly supervise”) a BERT model. The BERT model was then fine-tuned with the gold standard notes to achieve optimal performance.

Methodology flowchart. A prompt template is constructed with a few random sentences from the training set as the few-shot examples. For certain annotated gold standard notes, we first fine-tune Llama2-13B, then use the fine-tuned model to perform in-context learning to weakly supervise a PubMedBERT. Finally, we fine-tune the BERT model with gold standard notes and use it in the production environment.

Benchmarks

We used datasets and tasks from 2012²⁵, 2014²⁶, and 2018¹⁹ Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge as benchmarks.

The 2012 i2b2 challenge focused on temporal relation extraction with 310 annotated clinical notes. Entities include 1) clinically significant events (“EVENT”), such as problems, tests, treatments, clinical departments, admissions, and transfers between departments, and 2) temporal expressions (“TIMEX3”), which are dates, times, durations, or frequencies phrases. For this study, the F1 scores for events and time expressions are used as the main metrics, while the temporal relations between events and time expressions are not evaluated.

The 2014 i2b2 challenge de-identification track focused on extracting Health Insurance Portability and Accountability Act (HIPAA) protected health information (PHI) from 1304 annotated clinical notes. We used the i2b2-PHI entities which include 7 types of PHI. We used strict and relaxed micro F1 scores as the main metrics.

The 2018 i2b2 challenge track 2 focused on the extraction of adverse drug events (ADEs) and medication properties from 505 discharge notes. The concept extraction task defined 9 entity types: drug, strength, form, dosage, frequency, route, duration, reason, and ADE. We used the Strict and Lenient micro F1 as the main metrics.

Prompt templates:

We prepared a prompt template for each benchmark task as highlighted in Fig. 1 and listed in Table S1. Our design adopted recent studies in prompt engineering^14,21 which includes 4 sections: 1) system prompt, in which a role is assigned to Llama2 to provide the context and to avoid triggering the safety features of the LLM, 2) instruction, which is a narrative description of the background (e.g., medical notes), task (e.g., named entity recognition, entity types), and expected output (i.e., the entity text and the entity type), 3) few-shot examples, where 8 randomly sampled sentences and the corresponding gold standard labels were listed following the JavaScript Object Notation (JSON) format. 4) Input placeholders, where for each sentence the text was placed in the input placeholder while the prompt was fed to LLM. The LLM would output text following the “[/INST]” special token which we collect for post-processing.

Supervised fine-tuning (SFT) Llama2

We used the prompt template described in the previous section to perform SFT. Each sentence in the gold standard notes was placed in the input placeholder and fed to the Llama2. Note that SFT is auto-regressive thus the labels were appended following the ”[/INST]” special token after the input sentences. Following the original SFT hyperparameters¹⁸, we use a cosine learning rate schedule with a 2 × 10^{− 5} initial learning rate and a weight decay of 0.1. The sequence length was 4096. We trained for 2 epochs. Due to the limiting GPU memory, we set the batch size to 1.

Few-shot in-context learning

LLMs have limitations in the number of input tokens due to their transformer architecture. Llama2-13B has a limit of 4096 input tokens. Including entire clinical notes in a prompt would often exceed the maximum input length. Therefore, we performed in-context learning at the sentence level. We sentencese-gmented each note with spaCy 3.5.4 Sentencizer³¹. Sentences were placed in the input placeholder and the output was collected after the ”[/INST]” special token for post-processing. To maximize the reproducibility, we set the top-k parameter to 1 which disabled random sampling of generated tokens. To increase the text generation speed, we set the maximum output length to 128 tokens.

Large language models:

Clinical notes often include PHI and are restricted from sharing. LLMs that are only available through API (e.g., GPT-3, GPT-4)^17,32 could be limiting in real-world scenarios. An ideal LLM for our system meets the three criteria: 1) is open-source and can be deployed locally, 2) is lightweight enough for making inferences on a local server, and 3) has high performance in the medical domain. Llama2 is a pre-trained open-source large language model that comes with different sizes of architecture from 7 billion to 65 billion parameters and has demonstrated competitive performances in both open-domain and biomedical NLP benchmarks³⁰. The 7 billion parameter version (“Llama2-7B”) loaded in 16-bit floating-point can fit in a GPU with 14 GB of vRAM, while the 13 billion parameter version (“Llama2-13B”) fits in 26 GB of vRAM. We chose Llama2-13B for a balance of performance and computation cost. As a stand-alone sensitivity analysis, we also evaluated a recently published large language model, Llama3, with 70 billion parameters²⁷.

Post-processing:

To serve the purpose of minimizing human effort, our post-processing was designed to be automatic, robust, and generalizable across tasks. The steps were: 1) generated-text extraction, which extracts all generated text after the ”[/INST]” special token. In cases where excessive text was generated after the intended JSON format, for instance, a new “[INST]” was generated by Llama2, we truncated it. 2) JSON formatting, which is a simple regular expression logic that extracts the “\{.*?\}” patterns in a JSON list. 3) Entity recovery, which utilized the extracted entity text to identify the span in the input sentence. 4) Entity type filtering, which filters out irrelevant entity types that Llama2 created and are not one of the entity types for the benchmark tasks. We used exact, case-sensitive string matching to minimize potential bias from human interpretation. By the end of post-processing, we obtained a list of entities with the span, entity text, and entity type for each clinical note.

Weak supervision

We used one of the latest state-of-the-art biomedical BERT models, PubMedBERT³³ (denoted as BERT) in this study. To evaluate the scenario where only a few annotated notes are available for training, the BERT model was first fine-tuned with weak labels from (N-n_s) notes followed by fine-tuning with gold labels from n_s notes, where N is the total number of training notes, n_s ∈ {3, 5, 10, 50}. To ensure the n_s notes were representative, they were selected such as having the closest number of entities to the median number of entities among all notes in the official training set. The formula below defines the selected subset $S_{n_{s}}$ :

S_{n_{s}} = {{note}_{i} : {iintopn}_{s} argmin (a b s (# {ofentitiesinnote}_{i} - median#ofentities))}

Fine-tuning BERT

Fine-tuning with weak labels and gold standard data follows similar methods, with a few differences in hyperparameters (Table S2). To segment notes into shorter chunks that the BERT models could process, we sentence-segmented the notes with spaCy. For each sentence, word tokenization was performed using the WordPiece algorithm implemented in the Python transformers module (version 4.30.2) and based on a pre-defined dictionary.

For fine-tuning, the development set was divided into a training set (80%) and a validation set (20%), unless specified in Table S2. Model weights were saved as checkpoints after each training period (“epoch”), and optimal checkpoint weights were selected during validation as our final NLP model. For efficiency, an early stop criterion of 8 continuous non-improving epochs was used. The NLP models were implemented using Python 3.9.7, PyTorch 2.0.1, and transformers 4.30.2. All computations were performed on a server with 8 NVIDIA A100 80GB GPU.

Benchmarking FLOPs and GPU time

The corpus in the 2018 benchmark is a subset of 505 discharge summaries from the MIMIC-III²⁰ database. We calculated the FLOPs for inferencing one input sentence with Llama2-13B following³⁴,

N_{tokens} (2 N + 2 n_{layer} n_{ctx} d_{attn}) = 128 \times (2 \times (13015864320) + 2 \times 40 \times 400 \times 4096) ≅ 3.348 \times 10^{12}

where N_tokens denotes the number of tokens Llama2 outputs; N denotes the total parameters in the model; n_layer denotes the number of layers in the model; n_ctx denotes the input context token length. We use the length of the prompt template to estimate. d_attn denotes the dimension of attention output. We monitored the FLOPs for PubMedBERT with the built-in tool, profiler in PyTorch. The GPU time for each note was monitored during the inferencing with Llama2-13B and the prediction with PubMedBERT. We randomly sampled 50 to 500 notes and fitted a linear regression line to model the correlation between the number of notes and the total GPU time. A projection was made to estimate the total GPU time required for all the discharge summaries from the MIMIC-III database.

FUNDING

This work was partially supported by awards from the National Institutes of Health, including the National Institute of Biomedical Imaging and Bioengineering (NIBIB: R21EB029575) and the National Institute of Allergy & Infectious Diseases (NIAID: R21AI164100).

Funding Statement

Footnotes

COMPETING INTERESTS

None

Supplementary Files

This is a list of supplementary files associated with this preprint. Click to download.

20240610appendix.pdf

Contributor Information

Enshuo Hsu, University of Texas Health Science Center at Houston.

Kirk Roberts, University of Texas Health Science Center at Houston.

DATA AVAILABILITY STATEMENT

The benchmark datasets used in this study are publicly available. Registration is required via the DBMI portal (https://portal.dbmi.hms.harvard.edu/). Once approved, dataset requests can be made through the n2c2 NLP Research Data Sets webpage (https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/). The source datasets are managed by the Department of Biomedical Informatics, Harvard Medical School

References

1.Zhang J., Hsieh C.-Y., Yu Y., Zhang C. & Ratner A. A Survey on Programmatic Weak Supervision. Preprint at http://arxiv.org/abs/2202.05433 (2022).
2.Dong H. et al. Ontology-driven and weakly supervised rare disease identification from clinical notes. BMC Med Inform Decis Mak 23, 86 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Datta S. & Roberts K. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open 6, ooad027 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ge Y., Guo Y., Das S., Al-Garadi M. A. & Sarker A. Few-shot learning for medical text: A review of advances, trends, and opportunities. Journal of Biomedical Informatics 104458 (2023) doi: 10.1016/j.jbi.2023.104458. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Cusick M. et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res 136, 95–102 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wang H. et al. A Weakly-Supervised Named Entity Recognition Machine Learning Approach for Emergency Medical Services Clinical Audit. Int J Environ Res Public Health 18, 7776 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Singhal S., Hegde B., Karmalkar P., Muhith J. & Gurulingappa H. Weakly Supervised Learning for Categorization of Medical Inquiries for Customer Service Effectiveness. Front Res Metr Anal 6, 683400 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Agnikula Kshatriya B. S. et al. Identification of asthma control factor in clinical notes using a hybrid deep learning model. BMC Med Inform Decis Mak 21, 272 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Shen Z. et al. Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision. BMC Med Inform Decis Mak 22, 88 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Fries J. A. et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 12, 2017 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Dhrangadhariya A. & Müller H. Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation. JAMIA Open 6, ooac107 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Brown T. et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020). [Google Scholar]
13.Dong Q. et al. A Survey on In-context Learning. Preprint at http://arxiv.org/abs/2301.00234 (2023).
14.Agrawal M., Hegselmann S., Lang H., Kim Y. & Sontag D. Large Language Models are Few-Shot Clinical Information Extractors. Preprint at 10.48550/arXiv.2205.12689 (2022). [DOI] [Google Scholar]
15.Moradi M., Blagec K., Haberl F. & Samwald M. GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain. Preprint at 10.48550/arXiv.2109.02555 (2022). [DOI] [Google Scholar]
16.Jimenez Gutierrez B. et al. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. in Findings of the Association for Computational Linguistics: EMNLP 2022 4497–4512 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). doi: 10.18653/v1/2022.findings-emnlp.329. [DOI] [Google Scholar]
17.Liu Y. et al. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. Preprint at http://arxiv.org/abs/2304.01852 (2023).
18.Touvron H. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv.org https://arxiv.org/abs/2307.09288v2 (2023). [Google Scholar]
19.Henry S., Buchan K., Filannino M., Stubbs A. & Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc 27, 3–12 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Johnson A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Meoni S., De la Clergerie E. & Ryffel T. Large Language Models as Instructors: A Study on Multilingual Clinical Entity Extraction. in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks 178–190 (Association for Computational Linguistics, Toronto, Canada, 2023). [Google Scholar]
22.Frei J. & Kramer F. Annotated dataset creation through large language models for non-english medical NLP. Journal of Biomedical Informatics 104478 (2023) doi: 10.1016/j.jbi.2023.104478. [DOI] [PubMed] [Google Scholar]
23.Karkera N., Acharya S. & Palaniappan S. K. Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics 24, 290 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Devlin J., Chang M.-W., Lee K. & Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at 10.48550/arXiv.1810.04805 (2019). [DOI] [Google Scholar]
25.Sun W., Rumshisky A. & Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc 20, 806–813 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Stubbs A., Kotfila C. & Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform 58, S11–S19 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Introducing Meta Llama 3: The most capable openly available LLM to date. Meta AI; https://ai.meta.com/blog/meta-llama-3/. [Google Scholar]
28.Gutiérrez B. J. et al. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. Preprint at http://arxiv.org/abs/2203.08410 (2022).
29.Smith R., Fries J. A., Hancock B. & Bach S. H. Language Models in the Loop: Incorporating Prompting into Weak Supervision. Preprint at http://arxiv.org/abs/2205.02318 (2022).
30.Touvron H. et al. LLaMA: Open and Efficient Foundation Language Models. Preprint at 10.48550/arXiv.2302.13971 (2023). [DOI] [Google Scholar]
31.Honnibal M. & Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7, 411–420 (2017). [Google Scholar]
32.OpenAI. GPT-4 Technical Report. Preprint at 10.48550/arXiv.2303.08774 (2023). [DOI] [Google Scholar]
33.Gu Y. et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 3, 1–23 (2022). [Google Scholar]
34.Kaplan J. et al. Scaling Laws for Neural Language Models. Preprint at http://arxiv.org/abs/2001.08361 (2020).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[R1] 1.Zhang J., Hsieh C.-Y., Yu Y., Zhang C. & Ratner A. A Survey on Programmatic Weak Supervision. Preprint at http://arxiv.org/abs/2202.05433 (2022).

[R2] 2.Dong H. et al. Ontology-driven and weakly supervised rare disease identification from clinical notes. BMC Med Inform Decis Mak 23, 86 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Datta S. & Roberts K. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open 6, ooad027 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ge Y., Guo Y., Das S., Al-Garadi M. A. & Sarker A. Few-shot learning for medical text: A review of advances, trends, and opportunities. Journal of Biomedical Informatics 104458 (2023) doi: 10.1016/j.jbi.2023.104458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Cusick M. et al. Using weak supervision and deep learning to classify clinical notes for identification of current suicidal ideation. J Psychiatr Res 136, 95–102 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Wang H. et al. A Weakly-Supervised Named Entity Recognition Machine Learning Approach for Emergency Medical Services Clinical Audit. Int J Environ Res Public Health 18, 7776 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Singhal S., Hegde B., Karmalkar P., Muhith J. & Gurulingappa H. Weakly Supervised Learning for Categorization of Medical Inquiries for Customer Service Effectiveness. Front Res Metr Anal 6, 683400 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Agnikula Kshatriya B. S. et al. Identification of asthma control factor in clinical notes using a hybrid deep learning model. BMC Med Inform Decis Mak 21, 272 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Shen Z. et al. Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision. BMC Med Inform Decis Mak 22, 88 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Fries J. A. et al. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 12, 2017 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Dhrangadhariya A. & Müller H. Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation. JAMIA Open 6, ooac107 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Brown T. et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020). [Google Scholar]

[R13] 13.Dong Q. et al. A Survey on In-context Learning. Preprint at http://arxiv.org/abs/2301.00234 (2023).

[R14] 14.Agrawal M., Hegselmann S., Lang H., Kim Y. & Sontag D. Large Language Models are Few-Shot Clinical Information Extractors. Preprint at 10.48550/arXiv.2205.12689 (2022). [DOI] [Google Scholar]

[R15] 15.Moradi M., Blagec K., Haberl F. & Samwald M. GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain. Preprint at 10.48550/arXiv.2109.02555 (2022). [DOI] [Google Scholar]

[R16] 16.Jimenez Gutierrez B. et al. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. in Findings of the Association for Computational Linguistics: EMNLP 2022 4497–4512 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). doi: 10.18653/v1/2022.findings-emnlp.329. [DOI] [Google Scholar]

[R17] 17.Liu Y. et al. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. Preprint at http://arxiv.org/abs/2304.01852 (2023).

[R18] 18.Touvron H. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv.org https://arxiv.org/abs/2307.09288v2 (2023). [Google Scholar]

[R19] 19.Henry S., Buchan K., Filannino M., Stubbs A. & Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J Am Med Inform Assoc 27, 3–12 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Johnson A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Meoni S., De la Clergerie E. & Ryffel T. Large Language Models as Instructors: A Study on Multilingual Clinical Entity Extraction. in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks 178–190 (Association for Computational Linguistics, Toronto, Canada, 2023). [Google Scholar]

[R22] 22.Frei J. & Kramer F. Annotated dataset creation through large language models for non-english medical NLP. Journal of Biomedical Informatics 104478 (2023) doi: 10.1016/j.jbi.2023.104478. [DOI] [PubMed] [Google Scholar]

[R23] 23.Karkera N., Acharya S. & Palaniappan S. K. Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics 24, 290 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Devlin J., Chang M.-W., Lee K. & Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at 10.48550/arXiv.1810.04805 (2019). [DOI] [Google Scholar]

[R25] 25.Sun W., Rumshisky A. & Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J Am Med Inform Assoc 20, 806–813 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Stubbs A., Kotfila C. & Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform 58, S11–S19 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Introducing Meta Llama 3: The most capable openly available LLM to date. Meta AI; https://ai.meta.com/blog/meta-llama-3/. [Google Scholar]

[R28] 28.Gutiérrez B. J. et al. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again. Preprint at http://arxiv.org/abs/2203.08410 (2022).

[R29] 29.Smith R., Fries J. A., Hancock B. & Bach S. H. Language Models in the Loop: Incorporating Prompting into Weak Supervision. Preprint at http://arxiv.org/abs/2205.02318 (2022).

[R30] 30.Touvron H. et al. LLaMA: Open and Efficient Foundation Language Models. Preprint at 10.48550/arXiv.2302.13971 (2023). [DOI] [Google Scholar]

[R31] 31.Honnibal M. & Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7, 411–420 (2017). [Google Scholar]

[R32] 32.OpenAI. GPT-4 Technical Report. Preprint at 10.48550/arXiv.2303.08774 (2023). [DOI] [Google Scholar]

[R33] 33.Gu Y. et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 3, 1–23 (2022). [Google Scholar]

[R34] 34.Kaplan J. et al. Scaling Laws for Neural Language Models. Preprint at http://arxiv.org/abs/2001.08361 (2020).

PERMALINK

This is a preprint.

Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing

Enshuo Hsu

Kirk Roberts

Abstract

Introduction

Figure 2.

Table 1.

Results

LLMs inferencing is computationally expensive

LLM-generated weak labels

Table 2.

Proposed method: Llama-SFT n -WS-BERT n

Figure 3.

Proposed method: Llama-WS-BERT n

Baseline methods

Llama-3 large language models

Discussion

Conclusion

Methods

Figure 1.

Benchmarks

Prompt templates:

Supervised fine-tuning (SFT) Llama2

Few-shot in-context learning

Large language models:

Post-processing:

Weak supervision

Fine-tuning BERT

Benchmarking FLOPs and GPU time

FUNDING

Funding Statement

Footnotes

Contributor Information

DATA AVAILABILITY STATEMENT

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases