Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2024 Jun 4;31(9):1821–1832. doi: 10.1093/jamia/ocae122

BioInstruct: instruction tuning of large language models for biomedical natural language processing

Hieu Tran 1, Zhichao Yang 2, Zonghai Yao 3, Hong Yu 4,5,6,7,
PMCID: PMC11339494  PMID: 38833265

Abstract

Objectives

To enhance the performance of large language models (LLMs) in biomedical natural language processing (BioNLP) by introducing a domain-specific instruction dataset and examining its impact when combined with multi-task learning principles.

Materials and Methods

We created the BioInstruct, comprising 25 005 instructions to instruction-tune LLMs (LLaMA 1 and 2, 7B and 13B version). The instructions were created by prompting the GPT-4 language model with 3-seed samples randomly drawn from an 80 human curated instructions. We employed Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. We then evaluated these instruction-tuned LLMs on several BioNLP tasks, which can be grouped into 3 major categories: question answering (QA), information extraction (IE), and text generation (GEN). We also examined whether categories (eg, QA, IE, and generation) of instructions impact model performance.

Results and Discussion

Comparing with LLMs without instruction-tuned, our instruction-tuned LLMs demonstrated marked performance gains: 17.3% in QA on average accuracy metric, 5.7% in IE on average F1 metric, and 96% in Generation tasks on average GPT-4 score metric. Our 7B-parameter instruction-tuned LLaMA 1 model was competitive or even surpassed other LLMs in the biomedical domain that were also fine-tuned from LLaMA 1 with vast domain-specific data or a variety of tasks. Our results also show that the performance gain is significantly higher when instruction fine-tuning is conducted with closely related tasks. Our findings align with the observations of multi-task learning, suggesting the synergies between 2 tasks.

Conclusion

The BioInstruct dataset serves as a valuable resource and instruction tuned LLMs lead to the best performing BioNLP applications.

Keywords: instruction tuning, large language models, question answering, natural language inference, information extraction, text generation, multi-task learning

Introduction

Large language models (LLMs), including GPTs, have made significant impact on natural language processing (NLP) applications.1–6 In the clinical domain, efforts have been made to fine-tune LLMs.7–10 Yet, this method can be resource-intensive and is at risk of overfitting, more so when faced with limited or low-quality clinical data.7 In contrast, “instruction tuning” emerges within the NLP community as a promising alternative to such exhaustive fine-tuning of LLMs.11 Stemming from Instruction Fine-tuning,12 it enables models to adapt to and perform new tasks more effectively through natural language instructions alone. Innovations by Mishra et al13 and Wang et al14 have laid the groundwork for instruction tuning by harmonizing crowdsourced instructions. This approach, expanded upon by Sanh et al2 and Wei et al,11 strives for adaptability to novel instructional tasks. Subsequent efforts, like those by Chung et al,12 amplify the technique, spotlighting task diversity, augmented model scale, and integrated chain-of-thought, with Ouyang et al15 introducing a unique reinforcement learning perspective. Nevertheless, despite its notable progress in general NLP scenarios, the biomedical field finds itself underrepresented, primarily attributed to the missing tailored instruction sets.16,7 Addressing this lacuna, our study introduces a comprehensive biomedical natural language processing (BioNLP) instruction dataset, curated with limited human intervention.

Specifically, we introduce BioInstruct, a dataset comprising more than 25 000 natural language instructions along with their corresponding inputs and outputs. Drawing inspiration from recent work that leverages the GPT language model for data generation,17 we collect BioInstruct in a fully automated manner. We prompt a pre-trained LLM, GPT-4, with a sample of 3 examples (as seeds) from our manually collected triplets (instructions, input, output), then instruct the model to generate new instructions as illustrated in Table S5. Through this process, we automatically produce over 25 000 diverse triplets consisting of instructions, inputs, and outputs spanning a range of biomedical NLP tasks. We subsequently use BioInstruct to fine-tune both LLaMA 118 and LLaMA 2.19

To evaluate LLM’s ability for the biomedical applications, we introduced a benchmark including the several BioNLP tasks which can be grouped into 3 major categories: question answering, information extraction, and text generation.

  • QA (Question Answering) category include Question Answering and Natural Language Inference (NLI) tasks play a pivotal role in healthcare by allowing accurate retrieval of specific knowledge from vast medical repositories, aiding doctors in diagnostics and treatment planning.20

  • IE (Information Extraction) category aims to identify and extract relevant data from unstructured clinical text automatically, enabling healthcare professionals to make well-informed decisions, thus improving patient care.21

  • Text Generation category has the potential to revolutionize patient care by summarizing conversation into clinical notes22–26 or generating patient’s assessment given patients symptom.27,28

Our findings offer insights into the diverse impacts of BioInstruct across these tasks. We find that LLMs fine-tuned with BioInstruct outperformed LLMs without BioInstruct in QA, IE, Generation tasks by 17.3%, 5.7%, 96%, respectively. Driven by the relatively low improvements in IE, we conducted data ablation experiments to explore the type of tasks in BioInstruct that contribute to these improvements. We observe that contributing fine-tuning tasks is unique to each evaluation task. When evaluating text generation, LLMs fine-tuned on all tasks from BioInstruct outperforms LLMs fine-tuned on any single task. But this is not the case when evaluating on QA and IE.

In addition, experiments show that fine-tuning a 7B-parameter LLaMA 1 model18 on BioInstruct can perform competitively or even outperform baseline models such as PMC-LLaMA 7B,29 Asclepius 7B,30 MedAlpaca 7B,31 and ChatDoctor32 across several benchmarks. Notably, all these models, including ours, are derived from the foundational LLaMA 1 model,18 ensuring a fair basis for comparison. Specifically, we note that our instructed LLaMA 1 model outperforms in 4 out of 4 question-answering tasks when compared with Asclepius,30 PMC-LLaMA29 and it also exceeds MedAlpaca31 in NLI task. This suggests that BioInstruct is particularly effective for augmenting the model to apply medical knowledge. We also notice a positive correlation between the number of generated examples and downstream task performance, implying that the performance of models trained on BioInstruct can be further enhanced simply by expanding its size. To summarize, our research contributions are as follows:

  • We introduce a new benchmark to evaluate LLMs ability on 3 categories of BioNLP tasks: QA to evaluate medical knowledge, IE to evaluate clinical extraction, and text generation to evaluate applied clinical skills.

  • We introduce a new Instruction Tuning data BioInstruct, which specifically tailored for the biomedical domain. We find that LLMs fine-tuned on BioInstruct significantly improve performance on the benchmark compared to competitive baselines.

  • We further explore the type of tasks in BioInstruct that contribute to these improvements through the Multi-Task Learning framework. We observe that contributing to fine-tuning tasks is dependent on each evaluation task. This inspires future work to predict contributing tasks given a new evaluation task.

Related work

LLMs in BioNLP

The use of LLMs in the field of NLP has shown remarkable potential and achieved significant milestones. The fine-tuning of these models on specific tasks has resulted in breakthroughs across a wide range of applications, including translation, text generation, and question-answering. In BioNLP, LLMs have also begun to play a crucial role. These models, when fine-tuned on biomedical text corpora, have shown promising results in medical information extraction, biomedical literature summarization, and answering medical questions. Models like BioBERT,33 ClinicalBERT,7 and the recent MedAlpaca31 which introduces an open-source collection of medical conversational AI models and training data tailored for LLMs, have advanced biomedical text mining capabilities. Additionally, fine-tuned LLMs such as PMC-LLaMA,29 Asclepius,30 and ChatDoctor32 have shown impressive performance in various biomedical tasks, setting new benchmarks in the field.

Traditional fine-tuning versus instruction tuning

Traditional fine-tuning and instruction tuning offer contrasting methods for adapting LLMs to specific tasks. Traditional fine-tuning, as used in models like BioBERT33 and ClinicalBERT,7 involves extensive retraining of a pre-trained model on task-specific datasets. This approach often yields highly specialized models but requires substantial, high-quality data and can be prone to overfitting. On the other hand, instruction tuning, exemplified by works like,11 focuses on the model’s ability to follow natural language instructions, promoting flexibility and generalizability across various tasks with less dependency on extensive task-specific data. This approach is particularly beneficial in scenarios where acquiring large annotated datasets is challenging. However, instruction tuning may not always achieve the same level of task-specific accuracy as traditional fine-tuning with a dedicated dataset. The choice between these methodologies depends on factors like data availability, adaptability needs, and desired task performance.

Instruction tuning in BioNLP

Obtaining large-scale supervised data can be expensive and time-consuming. A popular solution in recent literature has been to use language models (LMs) for automatic data generation and augmentation.34,35 Different from that, Self-Instruct17 is not specific to a particular task (say, QA or NLI) but to bootstrap new task definitions that may not have been defined before by NLP practitioners (though potentially still important for real users).1,14,36 Recent work by11 highlighted that instruction tuning—fine-tuning language models on a collection of datasets described via instructions substantially improves zero-shot performance of LLMs on unseen tasks. In parallel with our work, Zhang et al37 also propose generating instruction data with GPT-4 models and fine-tuning with the LLaMA model (so called AlpaCare). The major differences are that (1) they use a set of clinician-crafted tasks as their seed tasks, resulting in a different distribution of generated tasks; (2) they evaluate the performance of their instructed model on different benchmarks and employ different metrics.

Methodology

Our methodology incorporates 2 primary components: the creation of our BioInstruct dataset and the subsequent fine-tuning of several LLMs.

Instruction collection

We introduce BioInstruct, a dataset comprising 25 005 natural language instructions tailored to a broad array of biomedical and clinical NLP tasks. Each entry in this dataset is structured with a natural language instruction, an associated input, and the expected output resulting from the instruction’s execution.

Taking inspiration from the “Self-Instruct” methodology,17 the collection of BioInstruct is a fully automated process. This process requires only an initial set of 80 manually constructed seed tasks, which can be produced within roughly 3 hours of human effort. These seed examples span a diverse range of biomedical and clinical NLP tasks, covering areas such as answering biomedical questions, summarizing, assessing eligibility for clinical trials, and determining differential diagnosis, among others (some examples of seed tasks are shown in Table S6). During the data collection phase, we prompted the pretrained GPT-4 language model with 3 examples randomly selected from seed tasks, guiding it to generate new samples (illustrated in Table S5). We prompted GPT-4 to create 25 005 natural language instructions in July 2023.

Among the GPT-4 created instructions, we plot the top 20 most common root verbs and their top 4 direct noun objects of BioInstruct dataset in Figure 1. Overall, we see quite diverse intents and textual formats in these instructions. Additionally, we used GPT-4 to classify the instructions into major categories such as Question Answering, Generation, Information Extraction, and Others (eg, trial recommendation). We validate the GPT-4 classification performance through 1000 test examples. The accuracy is 87.3%. Moreover, to provide a deeper insight into the reliability of these category predictions, we conducted an inter-rater reliability analysis with 3 independent raters who classified each example into the designated major categories. This process resulted in a Krippendorff’s alpha of 0.83, reflecting a strong consensus among the raters and underscoring the consistency and precision of our categorization approach. The high degree of agreement, as indicated by this alpha value, further validates the integrity and reliability of our methodology. The distribution of these categories is visually represented in the pie chart shown in Figure 1. The data generation prompt consists of examples in a structured format, making it easier for the pre-trained GPT-4 language model to generate. Each example in the dataset contains 3 fields:

Figure 1.

Figure 1.

Distribution of our BioInstruct dataset. (A) Task type distribution of 25 005 natural language instructions. (B) The top 20 most common root verbs (inner circle) and their top 4 direct noun objects (outer circle) in the generated instructions.

  1. An instruction describing the task (eg, “Given a lengthy patient education material, provide a concise summary that preserves the crucial information, while ensuring it remains accessible for patients”).

  2. The input argument that instantiates the instruction, creating a specific example of the task.

  3. A textual output reflecting a correct execution of the instruction given the input arguments.

For promoting diversity, we only incorporate a new instruction into the task pool if its ROUGE-L similarity with present instructions falls below 0.7. Instructions with particular keywords (like image, picture, graph), which are typically not handled by LMs, are disregarded. Invalid generations are identified and filtered out based on heuristics including instruction is too long (more than 150 words) or too short (less than 3 words), instance output is a repetition of the input.

Instruction tuning

Instruction Tuning refers to continually training an LLM with NLP tasks formatted as natural language instructions and model responses treated as task outputs.11 Further to solving structured natural language tasks38 turned an English-centric LLM, LLaMA 1,18 into an open-ended chat model. It delivers GPT-like performance for English by training on distilled data from GPT itself.17 Our work follow the similar approach performing instruction tuning on LLaMA 118 and LLaMA 219 using our BioInstruct dataset.

The core idea behind Instruction Tuning is to utilize natural language as a versatile interface, allowing models to process instructions in the same way they would handle any other text input. This significantly broadens their usability across various domains and tasks, leveraging their pre-trained capabilities to interpret and act upon instructions directly. Instruction Tuning thus encourages models to become more flexible in their application, moving towards an intuitive understanding of tasks through human-like instructions.

An example of how instruction tuning is performed can be seen in what we show in Table S7. This table presents a typical instruction tuning prompt format, where an instruction clearly defines a task, followed by an input providing contextual background, and the model generates an appropriate response based on the instruction. This approach exemplifies the Instruction Tuning process, illustrating its implementation in training models to perform specialized tasks by utilizing task descriptions and inputs to guide the generation of precise outputs.

Multi-task exploration

To find out which task in BioInstruct contributed the most for the evaluation benchmark. We first grouped the samples in BioInstruct into the following BioNLP application categories: QA, IE, generation, and other tasks, each consisting of 22.8%, 33.8%, 33.5%, and 10%, respectively. We then fine-tuned on a subset (following single or multiple tasks) of BioInstruct tasks as mentioned previously. We then evaluated models on the benchmark. We exclude NLI as there are few samples in the BioInstruct dataset.

Experimental setup

In this section, we detail our experimental setup, the datasets employed, and the evaluation strategy adopted for assessing the performance of our instruction-tuned LLMs in various BioNLP tasks. A 2-sided t test was used to determine the significance of improvements between different models. All significance tests were evaluated at α=0.05. Furthermore, all experiments were conducted using 2 Nvidia A100 GPUs, each with 40 GB of memory. The CPU used was an Intel Xeon Gold 6230 processor, and the system was equipped with 192 GB of RAM.

Benchmark datasets

Our experiments spanned multiple subtasks, covering major NLP tasks within the biomedical domain. We focused on recent tasks to prevent data leakage as GPT4 was trained for information up until September 2021:

Multiple Choice Question Answering (MCQA):

  • MedQA-USMLE: 39 A subtask containing multiple-choice questions from the United States Medical Licensing Examination.

  • MedMCQA: 40 Features medical multiple-choice questions source from various textbooks and clinical scenarios.

  • PubMedQA: 41 A benchmark dataset for biomedical research question answering derived from PubMed abstracts.

  • BioASQ MCQA: 42 A comprehensive subtask designed for biomedical semantic indexing and question answering.

Natural Language Inference (NLI):

  • MedNLI: 43 A medical NLI subta crafted from clinical narratives, which tasks models with determining the relationship between premise and hypothesis sentences. We excluded NLI for multi-task exploration, as there were few samples in the BioInstruct dataset.

Clinical Information Extraction:

  • Medication Status Extraction: 44 A subtask focusing on the extraction of medication-related information from clinical narratives, such as drug names, dosages, and administration routes.

  • Clinical Coreference Resolution: 44 This involves identifying phrases in clinical texts that refer to the same entity, facilitating understanding patient narratives.

Generation task about clinical skills:

  • Conv2note: 22 This is a subtask (A) of the MediQA-Chat 2023 Challenge. The goal is to generate a section of a structured clinical note based on a given patient-doctor conversation, with an emphasis on accurately capturing medical details.

  • ICliniq: 32 In this subtask, given a patient’s detailed description of their concerns or symptoms, the aim is to provide a compassionate, reassuring, and informative medical response that addresses their worries and offers professional guidance on the next steps or actions to take. We evaluate this using a random subset of 100 samples from the ICliniq dataset.

Evaluation metrics

For the multiple choice QA and NLI tasks, we employed accuracy as the primary metric.

In the clinical information extraction tasks, performance was gauged using precision, recall, F1 score, and conditional accuracy. The latter specifically measured the correctness of medication status classification, indicating how many statuses were correctly identified.

For the generation task, we utilized the GPT4 API to evaluate Coherence, Naturalness, and Completeness scores, which assess the quality and relevance of the generated contents. This decision was informed by recent studies,45,46 which demonstrated GPT-4’s strong alignment with human evaluation in various generative tasks. Those works have shown GPT-4’s evaluations on coherence, naturalness, and completeness to closely mirror human judgments, suggesting its reliability in assessing the nuanced aspects of generated content. In the clinical NLP domain, a previous study47 compared GPT-4 evaluation and human evaluation on 100 patient-doctor conversations and show that GPT-4 evaluation generally aligns with human evaluation in Figures 3 and 4 of.47 To further refine our evaluation, we incorporated the Bertscore_F1 metric. This metric, derived from the embeddings of the BERT model, gauges the semantic similarity between the generated and reference texts. Another significant metric we introduced was the Concept_F1, which calculates the overlap of medical entities between 2 documents, offering insight into the model’s grasp of medical knowledge and terminology. Additionally, the Bleurt score, an advanced metric designed specifically for generation tasks, was employed. It determines the alignment of the model’s output with the human-written reference, both syntactically and semantically.

Baseline models

In our study, we compared the performance of BioInstruct with several prominent baselines in the BioNLP domain:

Asclepius 30 is a specialized clinical LLM, further fine-tuned from LLaMA 1 on synthetic clinical notes derived from publicly available biomedical literature. Preliminary evaluations show that Asclepius can effectively handle tasks on real clinical notes, showing its promise for real-world healthcare applications.

ChatDoctor 32 is a language model aiming for health assistants, that is designed to provide users with medical information, advice, and guidance. For training, it is fine-tuned from LLaMA 1 7B model with the dialogue-based instruction tuning data.

MedAlpaca 31 is a model that has been further fine-tuned on Alpaca,38 which is an instruction-tuned variant of LLaMA. Its primary focus is on assisting with medical dialogues and handling question-answering tasks.

PMC-LLaMA 29 is a domain-adapted LLaMA 1 model that was pretrained on 4.8M biomedical academic papers and 30K medical textbooks, and was further fine-tuned with a comprehensive dataset tailored for instruction tuning, covering areas like medical question-answering and conversational dialogues.

Results

Performance on QA and NLI tasks

How does instruction tuning perform on QA and NLI tasks?

Table 1 details the zero-shot performance across various models on multiple-choice question answering (QA) tasks and NLI tasks within the biomedical domain, including MedQA-USMLE, MedMCQA, PubmedQA, BioASQ MCQA, and MedNLI. Focusing on the LLaMA series, the instructed variants consistently outperform their non-instructed counterparts, demonstrating the utility of instruction tuning in enhancing model performance. For instance, in the MedQA-USMLE task, instruction-tuned LLaMA 1 7B outperformed its non-instructed variant by a 95% confidence interval (CI) of 3.96-4.10, P <.05. Similar improvements were observed in the MedMCQA, where the instructed version surpassed the non-instructed LLaMA 1 7B by a 95% CI of 6.2-6.6, P <.05. The BioASQ MCQA and PubmedQA tasks further illustrate the strength of instruction tuning. LLaMA 1 13B Instruct exhibited superior performance, with notable improvements in BioASQ MCQA, achieving a score of 84.29 compared to the non-instructed score of 69.86, with a 95% CI of 11.41-17.45, P <.05. Similarly, in PubmedQA, the instructed variant recorded an improvement with a 95% CI of 2.23-5.51, P <.05 over the non-instructed model. Moreover, instruction-tuned LLaMA 2 models also showed remarkable performance leaps. In the MedNLI task, LLaMA 2 13B Instruct achieved an improvement with a 95% CI of 4.57-6.95, P <.05, compared to its non-instructed version. The consistency of improvement across multiple models and tasks validates the effectiveness of instruction tuning in leveraging model capabilities for specific domain tasks. These statistically significant improvements emphasize the critical role of instruction tuning in enhancing the specialized performance of language models, particularly in demanding domains like medical question answering and inference tasks.

Table 1.

Zero-shot performance of the original, instructed, and some baseline models on multiple-choice QA tasks and natural language inferences tasks, P <.05; 95% confidence interval.

Model MedQA-USMLE MedMCQA PubmedQA BioASQ MCQA MedNLI
AlpaCare LLaMA 1 7B 28.7 ± 0.0722 33.91 ± 0.0097 62.9 ± 0.2104 68.43 ± 2.7543 33.3 ± 0.3143
AlpaCare LLaMA 2 7B 34.88 ± 0.0466 36.04 ± 0.0643 55.52 ± 1.0427 67.14 ± 2.8654 35.96 ± 0.1209
Asclepius 7B 29.08 ± 0.2405 30.46 ± 0.0953 40.14 ± 0.3667 72.14 ± 1.439 38.83 ± 0.0302
ChatDoctor 30.92 ± 0.1097 31.01 ± 0.1182 63.08 ± 0.4116 74.86 ± 0.601 40.14 ± 1.5672
MedAlpaca 7B 37.61 ± 0.0467 34.37 ± 0.079 48.82 ± 0.3415 61.57 ± 5.9762 36.6 ± 2.1421
PMC-LLaMA 7B 27.73 ± 0.0743 26.94 ± 0.1485 55.8 ± 1.4445 61.43 ± 1.4289 33.71 ± 1.1738
LLaMA 1 7B 27.47 ± 0.0628 25.06 ± 0.1601 47.54 ± 0.5771 70.14 ± 2.7543 33.59 ± 0.9334
LLaMA 1 7B Instruct 31.5 ± 0.0038 31.46 ± 0.0561 64.36 ± 1.6479 81.14 ± 1.8933 43.27 ±1.9985
LLaMA 1 13B 33.85 ± 0.8164 31.76 ± 0.0473 54.88 ± 0.4572 69.86 ± 1.4991 34.62 ± 2.0661
LLaMA 1 13B Instruct 36.76 ± 0.121 34.03 ± 0.4909 58.75 ± 1.3148 84.29 ± 2.0862 34.63 ± 0.5435
LLaMA 2 7B 30.37 ± 1.0608 29.41 ± 0.0377 57.2 ± 0.7538 65.86 ± 3.1138 33.35 ± 0.016
LLaMA 2 7B Instruct 37.25 ± 0.3621 36.14 ± 0.0254 63.97 ± 0.7363 84.76 ± 4.4353 34.96 ± 1.3937
LLaMA 2 13B 34.89 ± 0.5802 32.37 ± 0.3702 67.76 ± 0.7174 73.14 ± 2.97 35.94 ± 1.006
LLaMA 2 13B Instruct 39.32 ±0.0998 36.54 ±0.27 71.73 ±1.0781 87.38 ±4.8823 41.7 ± 0.1315

Bold values indicate best performance.

Performance on IE tasks

How does instruction tuning perform on medication status extraction?

Table 2 showcases the performance of various models on the clinical information extraction task, specifically focused on medication status. Notably, LLaMA 2 7B Instruct recorded the highest F1 score of 75.63 significantly outperforming its non-instructed counterpart by a 95% CI: 3.40-4.42, P <.05. This was paralleled by a substantial boost in precision, achieving the top score of 75.44 compared to 65.58 from its non-instructed version. Similarly, LLaMA 1 7B Instruct showed a notable increase in recall from 71.95 to 84.43, leading to an F1 improvement reflecting with a 95% CI of 2.10-2.68, P <.05. These improvements underline the effectiveness of instruction tuning in enhancing the model’s ability to accurately extract and classify medication status, substantiating the method’s value in clinical applications.

Table 2.

One-shot performance of the baseline models, original LLaMA, and instructed LLaMA on the medication status extraction and coreference resolution tasks.

Medication status extraction
Coreference resolution
Models Precision Recall F1 Conditional ACC Precision Recall F1
PMC-LLaMA 7B 68.08 78 67.28 ± 0.3213 71.21 ± 0.3311 56.56 47.37 45.97 ± 0.1935
ChatDoctor 63.07 83.56 67.6 ± 0.0232 80.31 ± 0.3565 58.29 56.18 54.02 ± 0.0366
LLaMA 1 7B 67.61 71.95 66.58 ± 0.0601 78.45 ± 0.2048 60.14 39.19 43.8 ± 0.1914
LLaMA 1 7B Instruct 62.83 84.43 68.97 ± 0.241 82.52 ±0.6844 67.58 52.52 55.14 ± 0.1765
LLaMA 2 7B 65.58 89.16 71.72 ± 0.2313 76.69 ± 0.1564 61.48 45.97 50.03 ± 0.3343
LLaMA 2 7B Instruct 75.44 82.83 75.63 ±0.3701 82.3 ± 0.0664 72.58 58.69 61.24 ±0.439

Conditional ACC measured the correctness of medication status classification, indicating how many statuses were correctly identified conditional on the extracted medications, P <.05; 95% confidence interval. Bold values indicate best performance.

How does instruction tuning perform on clinical coreference resolution?

Table 2 also illustrates the performance of various models on the clinical information extraction task of coreference resolution. LLaMA 2 7B Instruct led with an F1 score of 61.24, markedly better than its non-instructed version which posted an F1 of 50.03. This improvement is with a 95% CI: 10.88-11.54, P <.05 highlights the substantial enhancement brought about by instruction tuning. The LLaMA 1 7B Instruct also saw a significant F1 increase to 55.14 with a 95% CI: 10.56-11.86, P <.05. The advancements in precision and recall across both models accentuate the advantage of instruction tuning in mastering the intricacies of coreference resolution within the clinical domain, which is essential for constructing coherent and comprehensive patient narratives.

Performance on generation tasks

How does instruction tuning perform on short Dialogue2Note summarization?

Table 3 encapsulates the performances of various models on the Conv2note, a challenging task aimed at converting doctor-patient conversations to clinical notes. The instruction-tuned LLaMA 1 7B model demonstrated significant improvements across all metrics compared to its non-instructed version, with a marked increase in Coherence from 2.74 to 4.49, Completeness from 2.76 to 4.54, and Naturalness from 2.44 to 4.42, outperforming the baseline LLaMA 1 7B by a 95% confidence intervals of 1.747-1.753, 1.777-1.782, and 1.975-1.985, respectively, all with P <.05. Similarly, the LLaMA 2 7B Instruct model also exhibited notable improvements, achieving higher scores in all categories than the non-instructed version, which highlights the effectiveness of instruction tuning in enhancing the model’s ability to generate coherent, complete, and natural notes from conversations.

Table 3.

One-shot performance of the original model, instructed model, and some baseline models on MediQA-Task A: doctor-patient conversation to clinical note and ICliniq: doctor-patient question answering, P <.05; 95% confidence interval.

GPT4 metrics
Other metrics
Models Coherence Completeness Naturalness Concept_F1 BertScore_F1 Bleurt
Conv2note
PMC-LLaMA 7B 2.84 ± 0.0031 2.91 ± 0.0010 2.46 ± 0.0016 17.91 ± 0.3291 85.91 ± 0.5463 46.63 ± 0.2789
ChatDoctor 4.47 ± 0.0004 4.46 ± 0.0016 4.37 ± 0.0004 29.42 ± 0.0491 86.99 ± 0.3024 51.34 ± 0.1637
LLaMA 1 7B 2.74 ± 0.0020 2.76 ± 0.0016 2.44 ± 0.0038 17.87 ± 0.5227 85.28 ± 0.3021 46.95 ± 1.8605
LLaMA 1 7B Instruct 4.49 ±0.0011 4.54 ±0.0012 4.42 ±0.0014 31.13 ±0.1979 88.45 ±0.0218 53.19 ±0.0231
LLaMA 2 7B 2.80 ± 0.0037 2.96 ± 0.0009 2.57 ± 0.0060 18.93 ± 0.9404 84.92 ± 0.0894 45.95 ± 0.2743
LLaMA 2 7B Instruct 4.45 ± 0.0025 4.41 ± 0.0011 4.28 ± 0.0043 27.87 ± 0.1094 88.06 ± 0.0254 51.67 ± 0.0252
Doctor-patient QA
PMC-LLaMA 7B 2.31 ± 0.0060 2.64 ± 0.0132 2.34 ± 0.0039 15.85 ± 0.1743 84.02 ± 1.6380 46.55 ± 0.1897
ChatDoctor 4.81 ± 0.0002 4.86 ± 0.0007 4.62 ± 0.0002 18.37 ± 0.5444 86.25 ± 0.1389 48.25 ± 0.8187
LLaMA 1 7B 2.4 ± 0.0042 2.35 ± 0.0027 2.28 ± 0.0093 17.91 ± 0.1545 83.08 ± 0.0483 47.57 ± 0.8279
LLaMA 1 7B Instruct 4.7 ± 0.0010 4.74 ± 0.0011 4.58 ± 0.0017 20.3 ± 0.0185 87.38 ±0.0245 50.28 ±0.0225
LLaMA 2 7B 1.95 ± 0.0171 2.14 ± 0.0101 1.8 ± 0.0018 15.35 ± 0.5389 84.77 ± 0.2790 46.21 ± 0.3205
LLaMA 2 7B Instruct 4.86 ±0.0005 4.9 ±0.0011 4.71 ±0.0009 22.63 ±0.1026 86.35 ± 0.0327 48.34 ± 0.0453

Bold values indicate best performance.

How does instruction tuning perform on doctor-patient QA?

Table 3 also illustrates the performances of several models on ICliniq, a challenging task aimed at mimicking the doctor-patient question answering. LLaMA 2 7B Instruct not only surpassed its non-instructed counterpart significantly with improvements in Coherence (1.95-4.86), Completeness (2.14-4.9), and Naturalness (1.8-4.71) but also outperformed the PMC-LLaMA 7B and ChatDoctor models. These improvements are statistically significant with a 95% confidence intervals of 2.90-2.93, 2.748-2.772, and 2.908-2.912, respectively, all with P <.05, showcasing instruction tuning’s pivotal role in optimizing models for complex QA tasks. The LLaMA 2 7B Instruct also set a new benchmark by achieving the highest scores across almost all metrics, further underscoring the impact of targeted instruction tuning in enhancing model performance on specialized tasks.

Multi-task instruction tuning result

In the previous section, we find that models instruction-fine-tuned with BioInstruct outperformed models without BioInstruct. In this section, we attempt to find out how BioInstruct impact model performance. We summarize our findings by answering the following 4 questions:

Which task-specific tuning results in the best model performance?

Table 4 contains the result of our LLaMA 2 model instruction fine-tuned with only 1 single task. For each evaluation subtask, model instruction fine-tuned on the same task as evaluation task outperforms model instruction fine-tuned on other tasks. Except 2 non-IE subtasks: BioASQ MCQA and Conv2note.

Table 4.

Performance on different tasks when trained on a single task and performance gain when model is trained with one additional task besides the evaluation task.

      Train task Performance
Gain
Eval task QA IE GEN w. QA w. IE w. GEN
QA 53.46 52.80 52.27 0.07% −0.94%
IE 69.07 72.43 65.17 2.04% −3.74%
GEN 4.36 4.39 4.42 2.15% −2.00%

For the QA task, average accuracy was calculated using results from 4 datasets: MedQA-USMLE, MedMCQA, PubmedQA, and BioASQ MCQA. For the IE task, average F1 scores from Medication Status Extraction and Coreference Resolution were used. Lastly, for the Generation task, average scores based on GPT-4 metrics from the Conv2note and Doctor-Patient QA datasets were evaluated. Bold values indicate best performance.

Which additional task-specific tuning results improve the most?

We also explore synergies between tasks in multi-task scenario. Table 4 also contains the result of our LLaMA 2 model instruction fine-tuned with 2 tasks combined. During instruction tuning, we fix 1 training task the same as the evaluation task and permutate the additional task among tasks other than the evaluation task. We find that models fine-tuned on some additional tasks outperformed those fine-tuned without the additional task, but not all additional tasks improves performance in evaluation task. We summarize our findings as follows:

  1. Additional IE task helps QA (0.07%) and vice versa (2.04%).

  2. The additional QA task benefits the generative task by 2.15%. However, the generative task does not benefit QA.

Does fine-tuning more tasks achieves better performance?

Finally, we explore if more training tasks leads to better performance. We verify the following statement for different evaluation tasks: Model fine-tuned on all tasks outperforms model fine-tuned on single task or 2 tasks as mentioned previously. We found that this statement is true for all generative tasks. But it does not apply to any of the IE tasks as shown in Figure 2.

Figure 2.

Figure 2.

Performance of different tasks in BioInstruct. Each scatter corresponds to a subtask to evaluate. Each colored dot inside the scatter represents a different training task. The black dot represents the baseline performance of LLaMA 2 7B without BioInstruct fine-tuning. The purple dot represents the performance of LLaMA 2 7B fine-tuned on all BioInstruct tasks. We then ablate BioInstruct. Above each scatter, we provide the best single task fine-tuned in the first row. In the second row, we also provide the best fine-tuning task in addition to the specific task A, where task A is the same as the evaluation task.

How much data used for instruction tunning will make a difference?

We found that 5K samples from BioInstruct significantly improve the performance on most evaluation task as shown in Figure 3. However, adding more fine-tuning samples would still improve the performance. Further investigation into performance dynamics revealed a nuanced response to increasing instruction volumes across different task categories. Specifically, while QA and IE tasks generally benefited from a larger set of instructions, a notable dip in IE task performance at the 10 000 samples suggested a temporary challenge in generalization, which was subsequently overcome as instruction diversity increased at the 25 000 samples. Conversely, generation tasks demonstrated a steady improvement in performance with the addition of more instructions, reinforcing the positive impact of data volume on model capability. This analysis confirms that beyond the initial significant improvements seen with 5000 samples, further enriching the instruction pool continues to enhance model performance, highlighting the importance of both the quantity and diversity of instructional data.

Figure 3.

Figure 3.

Performance on different evaluation tasks when LLaMA 2 7B is fine-tuned on varying number of instruction samples in BioInstruct.

Discussion

In this study, we attempted to find out whether the impact of BioInstruct is tailored to specific BioNLP applications. For this, we draw the synergies from multitask learning,48,49 where studies have shown that similar tasks lead to better improvement than tasks that differ (Table 4). As shown in Figure 2, the highest performance gain resides in the scenario when the task and its instructions are similar. In addition, our results show that instruction tuning with additional QA tasks improve performance when evaluating on IE and generative task, compared to those without additional QA tasks. This supported previous finding in transfer learning: intermediate training on reasoning task tend to transfer the best in various downstream tasks.50

As shown in Figure 3, instruction tuning with more data improved the performance for most task except 2 clinical QA tasks: MedQA-USMLE and MedMCQA, especially when compared to biomedical QA tasks: PubmedQA and BioASQ. These tasks primarily focus on information retrieval from biomedical literature, where an extensive dataset provides a richer context and a broader knowledge base for the model to understand and generate accurate responses. On the other hand, MedQA-USMLE and MedMCQA present a set of challenges distinct from the former tasks, as they are centered around clinical questions, which tend to be more specific, scenario-based, and require a depth of understanding in clinical reasoning. The narrower focus and the need for practical application of knowledge in these clinical tasks might not benefit as much from the addition of more data, especially if the added data is more general to biomedical contexts.

Unfortunately, these was no best instruction tuning task combination that achieves the best performance across all BioNLP tasks (Figure 2). It varied depending on the specific evaluation task. This correlated with previous studies in the general NLP domain,50,51,52 and it was unclear how to identify such subset.53 Future work could explore task embedding for meta-learning54,55 in the BioNLP domain.

Limitation

Despite the merits of this study, there are several limitations that could be improved upon in future research. First, we adopted an empirical sampling approach from prior research,17 choosing 3 demonstrations from a pool of examples to construct the BioInstruct dataset. Future research could benefit from experimenting with the number of demonstrations as a hyperparameter. Second, our selection of demonstrations was random. Previous works have indicated that choosing demonstrations with semantic similarities can enhance performance in downstream tasks,56 whereas others have suggested benefits from a more diverse selection.57 An in-depth analysis of the balance between similarity and diversity could provide more insights for future investigations. Additionally, a comparison with closed-source models is missing in our current evaluation, which could provide a more comprehensive benchmark of our model’s capabilities. This aspect will be considered in subsequent studies to better position our findings within the broader landscape of model performances. Finally, we found that adding more fine-tuning samples would still improve the performance of some tasks. In future work we will explore the threshold for the best performance.

Conclusion

We introduce BioInstruct, an automatically generated dataset of natural language instructions and their corresponding inputs and outputs. Our experiments show that LLMs trained on BioInstruct outperformed not only LLMs without BioInstruct, but also other strong LLMs being fine-tuned on extensive amounts domain-specific data. Our results also show that the performance gained the most when instruction tuned on instructions with similar BioNLP applications and that the performance gain was the steepest with the first 5k instructions.

Supplementary Material

ocae122_Supplementary_Data

Acknowledgments

We thank our colleagues and reviewers for their constructive comments and insights that greatly improved the manuscript.

Contributor Information

Hieu Tran, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA 01003, United States.

Zhichao Yang, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA 01003, United States.

Zonghai Yao, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA 01003, United States.

Hong Yu, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA 01003, United States; Department of Medicine, University of Massachusetts Medical School, Worcester, MA 01655, United States; Center for Biomedical and Health Research in Data Sciences, Miner School of Computer and Information Sciences, University of Massachusetts Lowell, Lowell, MA 01854, United States; Center for Healthcare Organization and Implementation Research, VA Bedford Health Care, Bedford, MA 01730, United States.

Author contributions

Hieu Tran is the implementer who built the NLP system. He was responsible for building and analyzing various NLP models and methods proposed in this paper and was also one of the writers of the final paper. Zhichao Yang participated in the formulation, discussion, and writing of the project with his question answering and information extraction experience, providing many valuable views and suggestions from the perspective of NLP or BioNLP research. Zonghai Yao participated in the formulation, discussion, and writing of the project with his text generation experience, providing many valuable views and suggestions from the perspective of NLP or BioNLP research. Hong Yu is the planner of the whole project. She was responsible for writing the proposal for the project, planning the direction and progress of the whole project, discussing and suggesting the establishment of the NLP system, and writing the final paper.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

Research reported in this study was supported by the National Institute of Nursing Research and the National Institute of Mental Health of the National Institutes of Health under award numbers 1R01NR020868 and R01MH125027, respectively. This work was supported with funding by the National Center on Homelessness among Veterans, US Department of Veterans Affairs Homeless Programs Office. The content is solely the responsibility of the authors and does not necessarily represent NIMH, NINR, NIH, US Department of Veterans Affairs, or the US government.

Conflicts of interest

The authors declare no competing interests.

Data availability

The data will be released here: https://github.com/bio-nlp/BioInstruct.

References

  • 1. Brown TB, Mann B, Ryder N, et al.  Language models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. Advances in Neural Information Processing Systems. Vol 33. 2020:1877-1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [Google Scholar]
  • 2. Sanh V, Webson A, Raffel C, et al.  2021. Multitask prompted training enables zero-shot task generalization. CoRR; abs/2110.08207. https://arxiv.org/abs/2110.08207.
  • 3. Chowdhery A, Narang S, Devlin J, et al.  2022. PaLM: scaling language modeling with pathways. arXiv, arXiv:220402311, preprint: not peer reviewed. https://arxiv.org/abs/2204.02311.
  • 4. Longpre S, Hou L, Vu T, et al.  2023. The flan collection: designing data and methods for effective instruction tuning. https://arxiv.org/abs/2301.13688.
  • 5. OpenAI. 2023. GPT-4 Technical Report. arXiv, arXiv:230308774, preprint: not peer reviewed. https://api.semanticscholar.org/CorpusID:257532815.
  • 6. Yang Z, Yao Z, Tasmin M, et al.  2023. Performance of multimodal GPT-4V on USMLE with image: potential for imaging diagnostic support with explanations. medRxiv: not peer reviewed. https://www.medrxiv.org/content/early/2023/10/26/2023.10.26.23297629.
  • 7. Alsentzer E, Murphy JR, Boag W, et al Publicly available clinical BERT embeddings. In: Rumshisky A, Roberts K, Bethard S, Naumann T, eds. Proceedings of the 2nd Clinical Natural Language Processing Workshop. Minneapolis, MN: Association for Computational Linguistics; 2019:72-78. 10.18653/v1/W19-1909. [DOI]
  • 8. Peng Y, Yan S, Lu Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Demner-Fushman D, Cohen KB, Ananiadou S, Tsujii J, eds. Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy. Association for Computational Linguistics; 2019:58–65.
  • 9. Van Aken B, Papaioannou JM, Mayrdorfer M, Budde K, Gers FA, Loeser A. Clinical outcome prediction from admission notes using self-supervised knowledge integration. In: Merlo P, Tiedemann J, Tsarfaty R, eds. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online. Association for Computational Linguistics; 2021:881–893.
  • 10. Hernandez E, Mahajan D, Wulff J, et al. Do we still need clinical language models? In: Conference on Health, Inference, and Learning. PMLR; 2023:578–597.
  • 11. Wei J, Bosma M, Zhao VY, et al.  2021. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652. preprint: not peer reviewed. https://api.semanticscholar.org/CorpusID:237416585.
  • 12. Chung HW, Hou L, Longpre S, et al.  2022. Scaling instruction-finetuned language models. arXiv, arXiv:221011416, preprint: not peer reviewed.
  • 13. Mishra S, Khashabi D, Baral C, Hajishirzi H. Cross-task generalization via natural language crowdsourcing instructions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics; 2022:3470-3487. https://aclanthology.org/2022.acl-long.244.
  • 14. Wang Y, Mishra S, Alipoormolabashi P, et al. Super-natural instructions: generalization via declarative instructions on 1600+ NLP tasks. In: Goldberg Y, Kozareva Z, Zhang Y, eds. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics; 2022:5085–5109.
  • 15. Ouyang L, Wu J, Jiang X, et al.  Training Language Models to Follow Instructions with Human Feedback. 2022;13. https://arxiv org/abs/220302155. [Google Scholar]
  • 16. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D.  Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4(1):86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Wang Y, Kordi Y, Mishra S, et al. Self-instruct: aligning language model with self generated instructions. In: Rogers A, Boyd-Graber J, Okazaki N, eds. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics; 2023:13484–13508.
  • 18. Touvron H, Lavril T, Izacard G, et al.  2023. LLaMA: open and efficient foundation language models. arXiv, arXiv:230213971, preprint: not peer reviewed.
  • 19. Touvron H, Martin L, Stone KR, et al.  2023. LLaMA 2: open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, preprint: not peer reviewed. https://api.semanticscholar.org/CorpusID:259950998.
  • 20. Jin Q, Yuan Z, Xiong G, et al.  Biomedical question answering: a survey of approaches and challenges. ACM Comput Surv. 2022;55(2):1-36. [Google Scholar]
  • 21. Wang Y, Wang L, Rastegar-Mojarad M, et al.  Clinical information extraction applications: a literature review. J Biomed Inform. 2018;77:34-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Ben Abacha A, Yim W, Adams G, Snider N, Yetisgen M. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In: Proceedings of the 5th Clinical Natural Language Processing Workshop. Association for Computational Linguistics; 2023:503-513. https://aclanthology.org/2023.clinicalnlp-1.52.
  • 23. Wang J, Yao Z, Yang Z, et al.  2023. NoteChat: a dataset of synthetic doctor-patient conversations conditioned on clinical notes. https://api.semanticscholar.org/CorpusID:264439568.
  • 24. Wang J, Yao Z, Mitra A, Osebe S, Yang Z, Yu H. UMASS_BioNLP at MEDIQA-Chat 2023: can LLMs generate high-quality synthetic note-oriented doctor-patient conversations? In: Naumann T, Ben Abacha A, Bethard S, Roberts K, Rumshisky A, eds. Proceedings of the 5th Clinical Natural Language Processing Workshop, Toronto, Canada. Association for Computational Linguistics; 2023:460-471. .
  • 25. Krishna K, Khosla S, Bigham JP, Lipton ZC. Generating SOAP notes from doctor-patient conversations using modular summarization techniques. In: Zong C, Xia F, Li W, Navigli R, eds. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics; 2021:4958–4972.
  • 26. Yao Z, Schloss BJ, Selvaraj SP. Improving summarization with human edits. In: Bouamor H, Pino J, Bali K, eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics; 2023:2604–2620.
  • 27. Zeng G, Yang W, Ju Z, et al. MedDialog: large-scale medical dialogue datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:9241-9250. https://aclanthology.org/2020.emnlp-main.743.
  • 28. Yang Z, Yu H. Generating accurate electronic health assessment from medical graph. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020:3764-3773. https://aclanthology.org/2020.findings-emnlp.336. [DOI] [PMC free article] [PubMed]
  • 29. Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC-LLaMA: toward building open-source language models for medicine. J Am Med Inform Assoc. 2024;31(9):1833-1843. 10.1093/jamia/ocae045. [DOI] [PMC free article] [PubMed]
  • 30. Kweon S, Kim J, Kim J, et al.  2023. Publicly shareable clinical large language model built on synthetic clinical notes. arXiv, arXiv:230900237, preprint: not peer reviewed.
  • 31. Han T, Adams LC, Papaioannou JM, et al.  2023. MedAlpaca—an open-source collection of medical conversational AI models and training data. arXiv, arXiv:230408247, preprint: not peer reviewed.
  • 32. Li Y, Li Z, Zhang K, Dan R, Zhang Y.  2023. ChatDoctor: a medical chat model fine-tuned on LLaMA model using medical domain knowledge. arXiv, arXiv:230314070, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
  • 33. Lee J, Yoon W, Kim S, et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Schick T, Schütze H. Generating datasets with pretrained language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2021:6943-6951. https://aclanthology.org/2021.emnlp-main.555.
  • 35. Meng Y, Michalski M, Huang J, Zhang Y, Abdelzaher T, Han J. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In: International Conference on Machine Learning. PMLR; 2023:24457-24477.
  • 36. Honovich O, Scialom T, Levy O, Schick T. Unnatural instructions: tuning language models with (almost) no human labor. In: Rogers A, Boyd-Graber J, Okazaki N, eds. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics; 2023:14409–14428.
  • 37. Zhang X, Tian C, Yang X, Chen L, Li Z, Petzold LR.  2023. AlpaCare: instruction-tuned large language models for medical application. arXiv, arXiv:231014558, preprint: not peer reviewed.
  • 38. Taori R, Gulrajani I, Zhang T, et al.  2023. Stanford Alpaca: an instruction-following LLaMA model. https://github com/tatsu-lab/stanford_alpaca.
  • 39. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P.  What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421. [Google Scholar]
  • 40. Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In: Conference on Health, Inference, and Learning. PMLR; 2022:248-260.
  • 41. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubmedQA: a dataset for biomedical research question answering. In: Inui K, Jiang J, Ng V, Wan X, eds. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics; 2019:2567–2577.
  • 42. Tsatsaronis G, Balikas G, Malakasiotis P, et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 2015;16(1):138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Shivade C, et al. MedNLI-a natural language inference dataset for the clinical domain. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics; 2019:1586-1596.
  • 44. Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. In: Goldberg Y, Kozareva Z, Zhang Y, eds. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics; 2022:1998-2022.
  • 45. Fu J, Ng SK, Jiang Z, Liu P.  2023. GPTscore: evaluate as you desire. arXiv, arXiv:230204166, preprint: not peer reviewed.
  • 46. Liu Y, Iter D, Xu Y, Wang S, Xu R, Zhu C. G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Bouamor H, Pino J, Bali K, eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics; 2023:2511–2522.
  • 47. Zhang H, Chen J, Jiang F, et al. HuatuoGPT, towards taming language model to be a doctor. In: Bouamor H, Pino J, Bali K, eds. Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics; 2023:10859–10885.
  • 48. Zhang Y, Yang Q.  An overview of multi-task learning. Nat Sci Rev. 2018;5(1):30-43. https://api.semanticscholar.org/CorpusID:90063862. [Google Scholar]
  • 49. Li F, Jin Y, Liu W, Rawat BPS, Cai P, Yu H.  Fine-tuning bidirectional encoder representations from transformers (BERT)-based models on large-scale electronic health record notes: an empirical study. JMIR Med Inform. 2019;7(3):e14830. https://api.semanticscholar.org/CorpusID:202567532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Pruksachatkun Y, Phang J, Liu H, et al. Intermediate-task transfer learning with pretrained language models: when and why does it work? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020:5231-5247. https://aclanthology.org/2020.acl-main.467.
  • 51. McCann B, Keskar NS, Xiong C, Socher R.  2018. The natural language decathlon: multitask learning as question answering. ArXiv; abs/1806.08730. https://api.semanticscholar.org/CorpusID:49393754.
  • 52. Wang A, Hula J, Xia P, et al. Can you tell me how to get past sesame street? Sentence-level pretraining beyond language modeling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2019:4465-4476. https://aclanthology.org/P19-1439.
  • 53. Aribandi V, Tay Y, Schuster T, et al. ExT5: towards extreme multi-task scaling for transfer learning. In: International Conference on Learning Representations. 2022. https://api.semanticscholar.org/CorpusID:260421295.
  • 54. Vu T, Wang T, Munkhdalai T, et al. Exploring and predicting transferability across NLP tasks. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:7882-7926. https://aclanthology.org/2020.emnlp-main.635.
  • 55. Kim J, Asai A, Ilharco G, Hajishirzi H.  2023. TaskWeb: selecting better source tasks for multi-task NLP. ArXiv; abs/2305.13256. https://api.semanticscholar.org/CorpusID:258832390.
  • 56. Liu J, Shen D, Zhang Y, Dolan B, Carin L, Chen W. What makes good in-context examples for GPT-3? In: Agirre E, Apidianaki M, Vulić I, eds. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. Association for Computational Linguistics; 2022:100-114. https://aclanthology.org/2022.deelio-1.10.
  • 57. Su H, Kasai J, Wu CH, et al.  2022. Selective annotation makes language models better few-shot learners. ArXiv; abs/2209.01975. https://api.semanticscholar.org/CorpusID:252089424.
  • 58. Ainslie J, Lee-Thorp J, de Jong M, Zemlyanskiy Y, Lebrón F, Sanghai S. GQA: training generalized multi-query transformer models from multi-head checkpoints. In: Bouamor H, Pino J, Bali K, eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore. Association for Computational Linguistics; 2023:4895–4901.
  • 59. Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. In: Zong C, Xia F, Li W, Navigli R, eds. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics; 2021:4582–4597.
  • 60. Hu EJ, Shen Y, Wallis P, et al.  2021. LoRA: low-rank adaptation of large language models. arXiv, arXiv:210609685, preprint: not peer reviewed.
  • 61. Aghajanyan A, Zettlemoyer L, Gupta S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In: Zong C, Xia F, Li W, Navigli R, eds. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics; 2021:7319–7328.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae122_Supplementary_Data

Data Availability Statement

The data will be released here: https://github.com/bio-nlp/BioInstruct.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES