Abstract
Large language models (LLMs) such as ChatGPT are fine-tuned on large and diverse instruction-following corpora, and can generalize to new tasks. However, those instruction-tuned LLMs often perform poorly in specialized medical natural language understanding (NLU) tasks that require domain knowledge, granular text comprehension, and structured data extraction. To bridge the gap, we: (1) propose a unified prompting format for 7 important NLU tasks, (2) curate an instruction-tuning dataset, MNLU-Instruct, utilizing diverse existing open-source medical NLU corpora, and (3) develop BioMistral-NLU, a generalizable medical NLU model, through fine-tuning BioMistral on MNLU-Instruct. We evaluate BioMistral-NLU in a zero-shot setting, across 6 important NLU tasks, from two widely adopted medical NLU benchmarks: BLUE and BLURB. Our experiments show that our BioMistral-NLU outperforms the original BioMistral, as well as the proprietary LLMs - ChatGPT and GPT-4. Our dataset-agnostic prompting strategy and instruction tuning step over diverse NLU tasks enhance LLMs’ generalizability across diverse medical NLU tasks. Our ablation experiments show that instruction-tuning on a wider variety of tasks, even when the total number of training instances remains constant, enhances downstream zero-shot generalization.
Introduction
Fine-tuning large language models (LLMs) on a diverse collection of instruction-following datasets enables LLMs to generalize across a wide range of new tasks in a zero- or few-shot setting.1,2 Following this instruction fine-tuning phase, medical foundation LLMs3 have demonstrated great performance in various medical tasks, which require in-depth medical domain knowledge and logical reasoning ability4, such as medical exams4, common sense reasoning5,6 and diagnostic reasoning.3 This generalizability is particularly crucial for tasks with limited annotated data, where fine-tuning is infeasible. Meanwhile, using a safe, generalized model can also mitigate the safety risks associated with training task-specific models using sensitive medical data.
Despite their superior generalizability in some areas, instruction-tuned LLMs can underperform smaller-scale, fine-tuned language models, in some specialized medical natural language understanding (NLU) tasks. These tasks require models to understand, interpret, and respond to human language meaningfully.7 Examples of medical NLU tasks include information extraction8, 9 and sentence classification10, which generates additional labels for medical text, facilitating real-time information retrieval and secondary medical research11. The performance gap may be due to the current foundation LLMs’ instruction-tuning phase which focuses primarily on natural language generation (NLG) tasks that allow for free-text, unconstrained outputs.1 Although many NLG tasks require complex logical reasoning, these skills do not directly translate to nuanced NLU tasks.
To bridge this gap, we propose a unified prompting format for 7 widely studied medical NLU tasks identified in a recent survey12, employing span extraction and multi-choice question-answering (QA). Utilizing this unified format, we create an instruction-tuning dataset, MNLU-Instruct, from diverse existing open-source medical NLU corpora. We fine-tune a high-performing biomedical LLM, BioMistral5 on MNLU-Instruct, resulting in a new, generalizable medical NLU model we call BioMistral-NLU. We evaluate the generalizability of BioMistral-NLU, using zero-shot, dataset-agnostic prompts, on two widely adopted benchmark datasets: the Biomedical Language Understanding Evaluation (BLUE)13 and the Biomedical Language Understanding and Reasoning Benchmark (BLURB)14.Collectively, the benchmarks include 15 biomedical datasets with 6 important NLU task categories, across both clinical and biomedical domains.In our evaluation, BioMistral-NLU outperforms the original BioMistral, as well as ChatGPT, and GPT-4 on the macro average across all tasks. Our ablation experiments demonstrate that instruction-tuning on a broader variety of tasks, even with a constant total number of training instances, improves downstream zero-shot generalization..
Related work
Medical NLU
Within this broad category of medical NLU, there is extensive research on specific NLU tasks in clinical and biomedical domains, such as Information Extraction (IE) and Document Classification (DC).15 To develop a comprehensive understanding of medical NLU, previous research curates two NLU benchmark datasets: the Biomedical Language Understanding Evaluation (BLUE)13 and the Biomedical Language Understanding and Reasoning Benchmark (BLURB)14. These two benchmarks encompass multiple important medical NLU tasks and are widely adopted to evaluate various LLMs for their medical NLU capabilities.16–18
Previous studies explore the ability of task-agnostic LLMs to perform medical NLU tasks. For example, Agrawal et al. (2022)19 demonstrate LLMs’ potential for clinical NLU tasks through few-shot in-context learning (ICL). Hu et al. (2023)9 evaluate ChatGPT on two clinical NER datasets, representing a subset of NLU tasks. Wang et al. (2023)17 propose a novel prompting strategy for multiple clinical NLU tasks using proprietary LLMs such as ChatGPT20 and GPT-421. However, they only evaluate the LLMs on a few samples from each task within the BLUE benchmark. Similarly, Chen et al. (2023)18 and Feng et al. (2024)16 systematically evaluate multiple LLMs using the BLURB benchmark14. Although ChatGPT and GPT-4 outperform other LLMs, they considerably underperform the in-domain fine-tuned systems. This performance gap highlights the need for more generalized systems for medical NLU.
Instruction tuning for medical NLU
Instruction tuning involves fine-tuning a pre-trained LM on a diverse collection of instruction-following tasks and thus enables the LM to understand and follow natural language instructions, and generalize to previously unseen tasks in zero-shot or few-shot settings.1,22 Instruction-tuning datasets typically encompass a wide range of natural language processing (NLP) tasks presented in an instructional format, including reasoning, question-answering, dialogue, and summarization.23 Utilizing instruction tuning, previous research has developed systems focused on generalizing to a limited subset of NLU tasks in the general domain, such as IE24–28 and Named Entity Recognition (NER)29,30.
Several previous studies aim to adapt instruction-tuning to the medical domain, with a major focus on dialogue-based chatbots, such as ChatDoctor31 and MedAlpaca6. Other medical foundation LLMs, like MedGemini3 and Taiyi32, show potential for diverse NLU tasks but lack comprehensive evaluation. Previous system development has often focused on a limited subset of medical NLU tasks. For example, Luo et al. (2022)33 explore Table QA; Zhao et al. (2024)30 focused on NER; Sainz et al. (2023)26 focused on IE; Rohanian et al. (2023)34 and Tran et al. (2024)35 focused on QA, IE, and text generation; However, the application of these models to other NLU tasks, such as sentence similarity and natural language inference, has not yet been explored. To the best of our knowledge, there is no comprehensive system development and evaluation across all medical NLU tasks for their generalizability. Therefore, in this work, we aim to bridge this gap by evaluating our proposed system in a zero-shot setting using two widely adopted benchmarks, encompassing 7 important medical NLU tasks.
Methods
In this section, we will introduce the task formulation, and outline the three-step approach to creating our generalized LLM across medical NLU tasks.
Task formulation
We reformulate the NLU problem as text generation tasks. Our learning objective M for the medical NLU system is defined by the function M : (I, X, T) → O. Specifically, given a user instruction I, associated medical text X, and NLU task labels T, the model M is instructed to output the system output O, where I, X, T, O correspond to sequences of tokens.
We developed a unified prompt format, which simplifies evaluation across diverse NLU task outputs, and potentially facilitates knowledge transfer when the system is fine-tuned for a wider range of NLU tasks.
Unified Medical NLU format
Building on prior research outlined in the Related Work section, we develop our unified NLU format that focuses on seven critical NLU tasks. Six of these NLU tasks are directly adapted from the BLUE and BLURB benchmarks, including named entity recognition (NER), document classification (DC), relation extraction (RE), multi-choice question-answering (QA), natural language inference (NLI), and semantic text similarity (STS). We also incorporate event extraction (EE), which is extensively researched in the medical domain.36 In EE, each event consists of a trigger and multiple arguments that characterize the event. The event trigger extraction (ETE) and event argument extraction (EAE) can be considered as NER. The event argument classification (EAC) classifies the event argument into a subtype, and can be considered as sequence classification. Table 1 demonstrates the input-output format for each medical NLU task.
Table 1:
The task-agnostic prompt format for 7 medical NLU tasks: named entity recognition (NER), event extraction (EE), document classification (DC), relation extraction (RE), multi-choice question-answering (QA), natural language inference (NLI), and semantic text similarity (STS). Event trigger extraction (ETE), event argument extraction (EAE), and event argument classification (EAC) are all components of the EE task. Variables inside {} are derived from each dataset instance.
| Task | Input prompt format | Output prompt format |
|---|---|---|
| NER/ ETE | Extract all relevant medical named entities from the medical text below. Focus on identifying following entities: {type1}, {type2}, … . {text} | {type1}:{span1}… … {spann} |
| EAE | What is the {type}attribute of the {trigger} ‘ {span}’ in the medical text below? {text} | {trigger} - {attribute}:{span1} |
| EAC | What is the {type}attribute of the {trigger} ‘ {span}’ in the medical text below? {text}{options} | {trigger} -{attribute}:{option} |
| DC | Which options best describe cancer hallmark from the medical text below? {text}{options} | {option} |
| RE | What is the relation between the{type1}entity ‘{span1}’ and the {type2}entity ‘{span2}’ from the medical text below? {text}{options} | |
| QA | {question}{text}{options} | |
| NLI | What is the relation between the premise and hypothesis? Premise: {premise}. Hypothesis: {hypothesis}{options} | |
| STS | How similar are the two sentences below? Sentence 1: {sentence1}. Sentence 2: {sentence2}. {options} |
Those seven NLU tasks can be summarized into three categories: (1) token classification, (2) sequence classification, and (3) sequence regression.
Task Input prompt format Output prompt format
NER, ETE, and EAE are token classification tasks, which assign a class label to each token in the input sequence * In token classification, the input includes the user instruction I with pre-defined token labels, and the target text T. In the output O, each line includes all the token annotations associated with a specific label. Each line starts with a class label, followed by the corresponding positive tokens in the order they appear in X. Continuous positive tokens are grouped into text spans (entities), separated by “…”, for example, “Disease: fever…headache”. If no tokens are classified as entities, the O is “None”. More specifically, NER classifies each token as a possible named entity.
EAC, DC, RE, QA, and NLI are sequence classification tasks, which assign a class label to the entire input token sequence (see Table 4). In sequence classification, the user instruction I specifies pre-defined class labels as multiple choices, which is a commonly adopted format in instruction-tuning, for example, “(B) fevers happens with headache”.1 The system output O is always one or more multi-choice options. In DC, the medical text X is the document. In RE, X is the corresponding medical text snippet with labeled named entities. In NLI, X is a pair of a premise and a hypothesis. In QA, user instruction I involves the task question, and X is the corresponding medical text.
Table 4:
Sequence classification and regression datasets used in the evaluation.
| Task | Dataset | Multi-choice options |
|---|---|---|
| DC | HoC | 10 cancer hallmarks |
| QA | PubMedQA BioASQ | yes / maybe / no yes / no |
| RE | GAD DDI ChemProt i2b2-2010 | 2 gene-disease relations 4 drug-drug interactions 5 chemical-protein relations 8 medical problem relations |
| NLI | MedNLI | entails / neutral / contradicts |
| STS | BioSSES | 5 similarity score definitions |
STS is a sequence regression task, which assigns a numeric score to the entire input. In this study, we explore the widely researched task of sequence regression: calculating the semantic text similarity (STS) score between two sentences. In the user instruction I of STS, the STS scores correspond to the scoring criteria from the original publication, and are presented as multi-choice options, for example, ‘(A) The two sentences are on different topics (score 0).’
MNLU-Instruct dataset
Focusing on the 7 medical NLU tasks outlined in Table 1, we construct the instruction-tuning dataset, MNLU-Instruct, through intensively searching for publicly available clinical and biomedical NLU datasets outside of BLUE and BLURB. To better assess the generalizability of our proposed system, we intentionally avoid adding any QA datasets to the MNLU-Instruct dataset, using QA tasks as novel tasks specifically for assessment purposes. Instead, beyond NLU tasks, we additionally incorporate three medical summarization tasks, which require similar text summarization and understanding abilities as the QA tasks. Meanwhile, given the limited availability of public medical datasets for NLI and STS, we incorporate datasets from the general domain, including SNLI, Multi-NLI, and SIS-B. As a result, we derive the MNLU-Instruct dataset with the train splits from 33 publicly available datasets shown in Table 2.
Figure 1:
Instruction-tuning dataset (MNLU-Instruct), system development, and downstream evaluation for BioMistral-NLU.
Table 2:
The MNLU-Instruct dataset, which is used for fine-tuning: NLU and summarization datasets and tasks curated from existing open-source medical corpora. Due to the page limit for this manuscript, the citations for the datasets will be published in our project GitHub.
| Task | Datasets used for instruction-tuning |
|---|---|
| NER | i2b2 2006DeID, i2b2 2011Coreference, i2b2 2012Temporal, i2b2 2014 DeID, GENIA, linnaeus, tmVar, DrugProt, BioRed, NLM-Gene, ClinicalIE, BC4CHEMD, GNorm, PubMed PICO, PICO-Data |
| EE | i2b2 2009Medication, i2b2 2018ADE, n2c2 2022SDoH |
| DC | i2b2 2006Smoking, i2b2 2008Obesity, n2c2 2018, 2024 SemEval Task 2, TrialStop, MTSamples |
| RE | i2b2 2011Coreference, i2b2 2012Temporal, EUADR, DrugProt, BioRed |
| NLI | BioNLI, SNLI, Multi-NLI |
| STS | SIS-B |
| Summ | PubMedSum, CDSR, AciDemo |
We construct the NLU input-output pairs in MNLU-Instruct through the task-agnostic prompting strategy shown in Table 1, which directly adapts pre-defined label names from the original publications. We additionally expand abbreviated label names, i.e., from ‘GENERIF’ to ‘Gene reference into a function (function of a gene)’. To increase the variability of MNLU-Instruct, for every NLU input-output pair, we randomly shuffle the order of task labels. Specifically, token labels in token classification tasks, as well as multi-choice options in sequence classification and regression tasks, are randomly shuffled. When train splits are unavailable or datasets have very few input-output pairs, we utilize the entire datasets for training. The complete set of dataset labels, prompts, and statistics can be found in our project GitHub Repository (https://github.com/uw-bionlp/BioMistral-NLU.)
BioMistral-NLU system development
We hypothesize that instruction-tuning on a diverse, yet relevant set of tasks improves the generalizability of LLMs on medical NLU tasks. To verify this hypothesis, we fine-tune a high-performing medical LLM on MNLU-Instruct and evaluate it in a zero-shot setting.
We chose BioMistral-7B-DARE as our baseline system, which is the state-of-the-art open-source LLM on multiple medical QA tasks. For simplicity, we refer to BioMistral-7B-DARE as BioMistral in this work. We fine-tune BioMis-tral using all of the parameters on MNLU-Instruct, resulting in BioMistral-NLU-FT. However, fine-tuning LLMs in specialized domains can potentially degrade their original generalization ability across broader tasks.38 To mitigate this risk and preserve the versatility of the original BioMistral, we utilize DARE39, as suggested by Labrak et al. (2024)5. This approach integrates model parameters from BioMistral-NLU-FT and BioMistral, without additional training, and creates the merged system BioMistral-NLU.
The experiment is conducted using the alignment-handbook† package. Based on the engineering judgment recommended by the alignment-handbook GitHub discussion, we set the number of epochs to 3, the batch size to 16, and configured the learning rate to 2e-04 with a warm-up ratio of 0.1, using 4 A100 GPUs. The rest hyperparameters are the same as the default configurations by the alignment-handbook. For inference, we use the vllm package‡ and set the temperature to 0. Our whole fine-tuning process for BioMistral-NLU takes less than one day.
Experiment setup
In this section, we will introduce our evaluation datasets, evaluation metrics, and comparative systems.
Evaluation datasets and metrics
We evaluate BioMistral-NLU in a zero-shot setting using BLURB and BLUE. Due to the sensitivity involved in the deployment of clinical note-based corpora, we excluded two inaccessible datasets from BLUE, ShARe/CLEF40 and MedSTS41. Some datasets are included in both benchmarks evaluated, resulting in a total of 7 tasks and 15 unique datasets evaluated. We developed the evaluation datasets using the unified prompt format outlined in Table 1; the entity types and multi-choice options for these datasets are shown in Table 3 and 4.
Table 3:
NER datasets used in the evaluation.
| Dataset | Named entities |
| BC2GM | Gene |
| BC5-chemical | Chemical |
| BC5-disease | Disease |
| NCBI-disease | Disease |
| JNLPBA | Protein, Cell type, RNA, Cell line, DNA |
| EBM PICO | Interventions, Participants, Outcomes |
For consistency with prior studies, we utilize the same evaluation criteria from BLUE13 and BLURB14. Token classification tasks are evaluated using F1 scores at the entity level, except for the PICO dataset, which is evaluated at the token level. When class labels are balanced like in NLI and QA, sequence classification tasks are evaluated using accuracy. When class labels are imbalanced, like in RE, sequence classification tasks are evaluated using F1. For the sequence regression task, STS, system outputs are converted to numerical integer scores and evaluated based on Pearson correlation.
Comparative systems
We compare our proposed system, BioMistral-NLU, with our baseline, BioMistral, as well as other high-performing systems.
Proprietary LLMs: ChatGPT20 and GPT-421. We reference prior research that evaluates these proprietary LLMs on BLURB16,18 § Note that ChatGPT’s performance is reported under one-shot ICL, while GPT-4’s performance is based on randomly selected 3-shot examples for NER tasks and zero-shot for other tasks. Additionally, their prompts are strategically optimized for each dataset, resulting in competitive systems. Given that Feng et al. (2024)16 demonstrated GPT-4’s superiority over FLAN-T5-XXL42, PMC- LLaMA-13B43, and Zephyr-7B-Beta44, we excluded these systems from further evaluation.
Open-source LLMs: BioMistral¶, and LLaMA-3.1-8B-Instruct||45. In our controlled experiments, we evaluate open-source LLMs using our proposed unified prompting formats shown in Table 1. The evaluation is conducted in a zero-shot setting, except for NER datasets. Since our desired token classification output prompt format is less common during those open-source LLMs’ instruction tuning phase, we additionally incorporate an explanation for the output formats and two in-context examples. For each inference query, the 2-shot examples are randomly selected from the training split of each dataset. We ensure that the outputs from the 2-shot examples are distinct from each other to prevent bias toward a specific answer.
Results
Following the practice in BLURB14, we average system performance across datasets for an overview. As shown in Table 5, BioMistral-NLU outperforms the baseline BioMistral with an increase in the macro average score of 19.7 for BLURB and 16.7 for BLUE. Meanwhile, BioMistral-NLU outperforms the proprietary models, achieving an increase in the macro average score of 9.0 over ChatGPT, and 2.7 over GPT-4 for BLURB. Our results demonstrate that instruction-tuning on diverse medical NLU tasks using our unified format effectively improves the LLMs’ general-izability to unseen NLU datasets. In this section, we will analyze the results and characterize the gaps between the systems.
Table 5:
Our proposed system, BioMistral-NLU’s zero-shot performance on 15 unseen medical NLU datasets from 2 benchmarks: BLURB (labeled by †) and BLUE (labeled by ∗). Bold indicates superior performance over other open-source LLMs, which utilize the same, dataset-agnostic prompts as BioMistral-NLU, under the zero-shot setting, except for two extra random examples for NER tasks. Underline indicates better performance over ChatGPT one-shot and GPT-4 three-shot ICL. ChatGPT and GPT-4 utilize dataset-specific prompts and few-shot examples, and therefore have advantages over our proposed system. ‘-’ indicates that the performance is not measured by prior research.
Comparison across systems
Comparing BioMistral-NLU with the baseline BioMistral, we observe an average performance increase of 33.7 for NER tasks and 8.2 for other tasks. This difference may originate from the instruct-tuning phase of BioMistral. While the number of NER task instances might be relatively small during BioMistral’s instruction-tuning phase, the other tasks utilize a QA prompting strategy and are likely similar to some of BioMistral’s instruction-tuning tasks. This necessitates instruction-tuning on a wider variety of NLU tasks to improve the LLM’s generalizability.
Comparing BioMistral-NLU with proprietary LLMs in the BLURB benchmark, we observe that BioMistral-NLU has an average F1 score of 9.7 higher than GPT-4 across NER tasks. However, for other BLURB tasks, BioMistral-NLU has an average score of 2.0 higher than ChatGPT and 5.4 lower than GPT-4. Given that GPT-4 is significantly larger in terms of parameter size and has been instruction-tuned on much more diverse corpora, its superior generalization ability for other tasks involving more complex reasoning is consistent with the empirical scaling law.1,46
Error analysis
We observe that for NER tasks, a major source of error for BioMistral-NLU is the nuanced task of accurately identifying exact named entity boundaries. For example, in the BC2GM gene NER dataset, the predicted named entity is ‘Id - 1’, whereas the gold named entity is ‘mouse Id - 1’. To better understand the prevalence of this discrepancy, we evaluate 5 NER datasets using a relaxed criterion, where two named entities are considered equivalent if their spans overlap. Using this relaxed criterion, we observe an average improvement of 15.5 in F1 across the 5 NER datasets from the original entity-level F1.
In all RE tasks, BioMistral-NLU demonstrates recall rates that are 10 to 70 points higher than its precision, suggesting a tendency to identify many false positive relationships. One major source of these false positives is the occurrence of interactions between entities, which do not fit into any of the pre-defined relation categories of interest. As a result, BioMistral-NLU assigns a wrong relation label instead of recognizing no relation.
In the sequence regression dataset, BioSSES, BioMistral-NLU tends to predict intermediate similarity scores (such as scores of 2 or 3) rather than extreme scores (0, 1, 4, or 5).
Discussion
In this section, we will evaluate the impact of instruction dataset composition, focusing on two components: instruction-tuning tasks and domains.
Impact of instruction-tuning tasks
We aim to assess the impact of instruction-tuning task selection from two perspectives: (1) its relevance to downstream tasks and (2) its task diversity. Focusing on these two perspectives, we fine-tune the baseline system, BioMistral, with different subsets of tasks used to build BioMistral-NLU. We evaluate the fine-tuned system on the 4 RE datasets from Table 5 in a zero-shot setting, and compare the macro-average F1 scores across the 4 RE datasets.
To study the impact of task relevance, we first construct two instruction-tuning setups: (1) with the RE task (w/ RE) and (2) with the DC task (w/o RE). We chose the DC task because DC employs a similar QA prompting format to RE and it contains 6 diverse datasets from Table 2. To study the impact of task diversity, besides DC and RE, we additionally include 2 and 4 more randomly selected tasks from Table 2.
All fine-tuning experiments are controlled using a fixed number of 50,000 data instances and running for three epochs. We maintain an equal number of instances for each task (i.e., 50,000/k instances per task when fine-tuning with k tasks), and randomly sample fine-tuning instances from all datasets within the same task.
More specifically, our experiment settings are:
-
w/ RE:
1 task: RE
3 tasks: RE, NLI, NER
5 tasks: RE, NLI, NER, EE, STS
-
w/o RE:
1 task: DC
3 tasks: DC, NLI, NER
5 tasks: DC, NLI, NER, EE, STS
After BioMistral is fine-tuned with the same number of instances, we observe the following from Figure 2: (1) Overall, setting 1 (with RE) consistently outperforms setting 2 (without RE), due to its relevance to the RE datasets used in downstream evaluation; (2) In both settings, system performance increases with the number of fine-tuning tasks, demonstrating the benefits of fine-tuning with multiple tasks; (3) When fine-tuning on a single task, the performance improvement on downstream tasks depends on the similarity between fine-tuning task and the downstream task.
Figure 2:
Average zero-shot performance on the 4 RE datasets, after instruction-tuning on 50k instances.
Impact of instruction-tuning domain
After demonstrating the benefits of diverse instruction-tuning tasks, we now examine individual tasks. Note that the BLUE benchmark includes both biomedical and clinical datasets: biomedical data is derived from scientific publications, while clinical data consists of semi-structured clinical notes from patients.47 In this section, we assess how domain selection affects downstream generalizability.
We follow a similar experimental setup as described in the previous section, fine-tuning BioMistral for three epochs over 25,000 data instances. The fine-tuned system is evaluated on six biomedical NER datasets from Table 5 in a zero-shot setting, using macro average F1 scores. Instruction-tuning NER datasets from MNLU-Instruct ** are divided into biomedical and clinical splits. Our experiments include fine-tuning on a single split (BioMed / Clinical) and both splits (Both). We additionally combine single splits or include additional in- stances, creating a similar experiment setting with 50k instances. We use the 2-shot BioMistral as the baseline system.
From Figure 3, we observe the following: (1) Instruction-tuning on the BioMed domain alone consistently outperforms tuning on the Clinical domain alone when using the same number of instances. (2) Compared to the baseline, instruction-tuning on the Clinical domain negatively impacts downstream performance on the BioMed domain. (3) Combining instances from both domains improves downstream generalizability to the BioMed domain, even with the same total number of instances. (4) Increasing the number of instances from the BioMed or Both domains improves performance, while adding more instances solely from the Clinical domain decreases performance.
Figure 3:
Average zero-shot performance on 6 biomedical NER datasets, when finetuned on different domains.
Conclusion
In this work, we introduce a unified prompting format for 7 important medical NLU tasks, and develop an instruction-tuning dataset based on publicly available clinical and biomedical corpora. Our experiment demonstrates that fine-tuning across diverse medical NLU datasets improves the system’s generalizability in a zero-shot setting with dataset-agnostic prompt tuning. Our ablation study underscores the necessity for instruction tuning across diverse medical NLU tasks, including domain-specific lexicon and common biomedical tasks.
Our NLU-focused, instruction-tuning pipeline could be applied to other LLMs beyond BioMistral. We conducted an initial fine-tuning experiment with the LLaMA-3.1-8B model, using the same experiment configuration and MNLU-Instruct dataset as in BioMistral-NLU. While there were improvements for some evaluation datasets, the overall performance gains were much smaller compared to the BioMistral-NLU. In the future, we aim to better understand the optimal instruction-tuning configurations for different LLMs.
Our future work will also focus on further improving the generalized LLM’s zero-shot performance on medical NLU tasks and narrowing its gap to in-domain fine-tuned systems. Since LLMs are often known to struggle with adhering to in-context annotation guidelines48, our future work will focus on integrating nuanced task descriptions from annotation guidelines into both the fine-tuning and inference stages.26 Future work could also involve a self-verification step49 or using a knowledge base as augmentation50 to reduce false positives in the sequence classification tasks.
Acknowledgement
This work was supported by the National Institutes of Health (NIH)—National Cancer Institute (Grant Nos. 1R01CA248422-01A1), National Library of Medicine (Grant No. 2R15LM01320902A1), and National Center for Advancing Translational Sciences of the National Institutes of Health (Grant No. UL1TR002319). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
Tasks such as NER are often treated as sequence labeling tasks in the NLP field.37 In this work, we refer to them as Token classification tasks for consistency with the BLURB.14
https://github.com/huggingface/alignment-handbook
https://github.com/vllm-project/vllm
GPT-4 version: gpt-4-0613. ChatGPT version: GPT-3.5, though the exact version was not specified in the original publication.
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
https://huggingface.co/BioMistral/BioMistral-7B
We also include event triggers as named entities.
Figures & Table
References
- 1.Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, et al. Scaling Instruction-Finetuned Language Models. arXiv e-prints. 2022 arXiv-2210. [Google Scholar]
- 2.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:230709288. 2023 [Google Scholar]
- 3.Saab K, Tu T, Weng WH, Tanno R, Stutz D, Wulczyn E, et al. Capabilities of Gemini Models in Medicine. 2024 [Google Scholar]
- 4.Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:230313375. 2023 [Google Scholar]
- 5.Labrak Y, Bazoge A, Morin E, Gourraud PA, Rouvier M, Dufour R. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv preprint arXiv:240210373. 2024 [Google Scholar]
- 6.Han T, Adams LC, Papaioannou JM, Grundmann P, Oberhauser T, L¨oser A, et al. MedAlpaca–an open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:230408247. 2023 [Google Scholar]
- 7.Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:180407461. 2018 [Google Scholar]
- 8.Xie Q, Chen Q, Chen A, Peng C, Hu Y, Lin F, et al. Me LLaMA: Foundation Large Language Models for Medical Applications. arXiv preprint arXiv:240212749. 2024 [Google Scholar]
- 9.Hu Y, Ameer I, Zuo X, Peng X, Zhou Y, Li Z, et al. Zero-shot clinical entity recognition using chatgpt. arXiv preprint arXiv:230316416. 2023 [Google Scholar]
- 10.Chen S, Li Y, Lu S, Van H, Aerts HJ, Savova GK, et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. Journal of the American Medical Informatics Association. 2024;31(4):940–8. doi: 10.1093/jamia/ocad256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ajmal S, Ahmed AAI, Jalota C. Natural language processing in improving information retrieval and knowledge discovery in healthcare conversational agents. Journal of Artificial Intelligence and Machine Learning in Management. 2023;7(1):34–47. [Google Scholar]
- 12.Wang B, Xie Q, Pei J, Chen Z, Tiwari P, Li Z, et al. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys. 2023;56(3):1–52. [Google Scholar]
- 13.Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. BioNLP 2019. 2019:58. [Google Scholar]
- 14.Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 2021;3(1):1–23. [Google Scholar]
- 15.Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, et al. Deep learning in clinical natural language processing: a methodical review. Journal of the American Medical Informatics Association. 2020;27(3):457–70. doi: 10.1093/jamia/ocz200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Feng H, Ronzano F, LaFleur J, Garber M, de Oliveira R, Rough K, et al. Evaluation of large language model performance on the Biomedical Language Understanding and Reasoning Benchmark. medRxiv. 2024:2024–05. [Google Scholar]
- 17.Wang Y, Zhao Y, Petzold L. Machine Learning for Healthcare Conference. PMLR; 2023. Are large language models ready for healthcare? a comparative study on clinical language understanding; pp. 804–23. [Google Scholar]
- 18.Chen Q, Sun H, Liu H, Jiang Y, Ran T, Jin X, et al. An extensive benchmark study on biomedical text generation and mining with ChatGPT. Bioinformatics. 2023;39(9):btad557. doi: 10.1093/bioinformatics/btad557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large Language Models are Few-Shot Clinical Information Extractors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022 [Google Scholar]
- 20.OpenAI: Introducing ChatGPT 2022 Accessed: 2024-04-12. https://openai.com/blog/chatgpt . [Google Scholar]
- 21.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774. 2023 [Google Scholar]
- 22.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems. 2022;35:27730–44. [Google Scholar]
- 23.Zhang S, Dong L, Li X, Zhang S, Sun X, Wang S, et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:230810792. 2023 [Google Scholar]
- 24.Wang X, Zhou W, Zu C, Xia H, Chen T, Zhang Y, et al. InstructUIE: multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:230408085. 2023 [Google Scholar]
- 25.Jiao Y, Zhong M, Li S, Zhao R, Ouyang S, Ji H, et al. Instruct and extract: Instruction tuning for on-demand information extraction. arXiv preprint arXiv:231016040. 2023 [Google Scholar]
- 26.Sainz O, Garc´ıa-Ferrero I, Agerri R, de Lacalle OL, Rigau G, Agirre E. Gollie: Annotation guidelines improve zero-shot information-extraction. arXiv preprint arXiv:231003668. 2023 [Google Scholar]
- 27.Wang C, Liu X, Chen Z, Hong H, Tang J, Song D. DeepStruct: Pretraining of language models for structure prediction. arXiv preprint arXiv:220510475. 2022 [Google Scholar]
- 28.Lu Y, Liu Q, Dai D, Xiao X, Lin H, Han X, et al. Unified structure generation for universal information extraction. arXiv preprint arXiv:220312277. 2022 [Google Scholar]
- 29.Zhou W, Zhang S, Gu Y, Chen M, Poon H. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:230803279. 2023 [Google Scholar]
- 30.Zhao J, Liu C, Liang J, Li Z, Xiao Y. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE; 2024. A Novel Cascade Instruction Tuning Method for Biomedical NER; pp. 11701–5. [Google Scholar]
- 31.Yunxiang L, Zihan L, Kai Z, Ruilong D, You Z. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:230314070. 2023 doi: 10.7759/cureus.40895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, et al. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. Journal of the American Medical Informatics Association. 2024:ocae037. doi: 10.1093/jamia/ocae037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Luo M, Saxena S, Mishra S, Parmar M, Baral C. Biotabqa: Instruction learning for biomedical table question answering. arXiv preprint arXiv:220702419. 2022 [Google Scholar]
- 34.Rohanian O, Nouriborji M, Clifton DA. Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing. arXiv preprint arXiv:240100579. 2023 doi: 10.1016/j.artmed.2024.103007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Tran H, Yang Z, Yao Z, Yu H. BioInstruct: instruction tuning of large language models for biomedical natural language processing. Journal of the American Medical Informatics Association. 2024:ocae122. doi: 10.1093/jamia/ocae122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Frisoni G, Moro G, Carbonaro A. A survey on event extraction for natural language understanding: Riding the biomedical literature wave. IEEE Access. 2021;9:160721–57. [Google Scholar]
- 37.He Z, Wang Z, Wei W, Feng S, Mao X, Jiang S. A survey on recent advances in sequence labeling from deep learning models. arXiv preprint arXiv:201106727. 2020 [Google Scholar]
- 38.Ainsworth SK, Hayase J, Srinivasa S. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:220904836. 2022 [Google Scholar]
- 39.Yu L, Yu B, Yu H, Huang F, Li Y. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:231103099. 2023 [Google Scholar]
- 40.Suominen H, Salanter¨a S, Velupillai S, Chapman WW, Savova G, Elhadad N, et al. Overview of the ShARe/CLEF eHealth evaluation lab 2013. Information Access Evaluation. 2013:212–31. Multilinguality, Multimodality, and Visualization: 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings 4. Springer. [Google Scholar]
- 41.Wang Y, Afzal N, Fu S, Wang L, Shen F, Rastegar-Mojarad M, et al. MedSTS: a resource for clinical semantic textual similarity. Language Resources and Evaluation. 2020;54:57–72. [Google Scholar]
- 42.Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research. 2024;25(70):1–53. [Google Scholar]
- 43.Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC-LLaMA: toward building open-source language models for medicine. Journal of the American Medical Informatics Association. 2024:ocae045. doi: 10.1093/jamia/ocae045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Tunstall L, Beeching E, Lambert N, Rajani N, Rasul K, Belkada Y, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:231016944. 2023 [Google Scholar]
- 45.at Meta A Introducing Meta Llama 3: The most capable openly available LLM to date. 2024 Accessed: 2024-04-18. https://ai.meta.com/blog/meta-llama-3/ [Google Scholar]
- 46.Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, et al. Scaling laws for neural language models. arXiv preprint arXiv:200108361. 2020 [Google Scholar]
- 47.Wu S, Liu H. AMIA Annual Symposium Proceedings. vol 2011. American Medical Informatics Association; 2011. Semantic characteristics of NLP-extracted concepts in clinical notes vs. biomedical literature; p. 1550. [PMC free article] [PubMed] [Google Scholar]
- 48.Zhang M, Yan H, Zhou Y, Qiu X. Promptner: A prompting method for few-shot named entity recognition via k nearest neighbor search. arXiv preprint arXiv:230512217. 2023 [Google Scholar]
- 49.Gero Z, Singh C, Cheng H, Naumann T, Galley M, Gao J, et al. Self-verification improves few-shot clinical information extraction. arXiv preprint arXiv:230600024. 2023 [Google Scholar]
- 50.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020;33:9459–74. [Google Scholar]




