Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 25.
Published in final edited form as: Nat Lang Process Inf Syst. 2025 Jul 1;15836:80–94. doi: 10.1007/978-3-031-97141-9_6

How important is domain-specific language model pretraining and instruction finetuning for biomedical relation extraction?

Aviv Brokman 1, Ramakanth Kavuluru 2
PMCID: PMC12367199  NIHMSID: NIHMS2101756  PMID: 40843288

Abstract

Major technical advances in the general NLP domain are often subsequently applied to the high-value, data-rich biomedical domain. The past few years have seen generative language models (LMs), instruction finetuning, and few-shot learning become foci of NLP research. As such, generative LMs pretrained on biomedical corpora have proliferated and biomedical instruction finetuning has been attempted as well, all with the hope that domain specificity improves performance on downstream tasks. Given the nontrivial effort in training such models, we investigate what, if any, benefits they have in the key biomedical NLP task of relation extraction. Specifically, we address two questions: (1) Do LMs trained on biomedical corpora outperform those trained on general domain corpora? (2) Do models instruction finetuned on biomedical datasets outperform those finetuned on assorted datasets or those simply pretrained? We tackle these questions using existing LMs, testing across four datasets. In a surprising result, general-domain models typically outperformed biomedical-domain models. However, biomedical instruction finetuning improved performance to a similar degree as general instruction finetuning, despite having orders of magnitude fewer instructions. Our findings suggest it may be more fruitful to focus research effort on larger-scale biomedical instruction finetuning of general LMs over building domain-specific biomedical LMs.

Keywords: language models, biomedical information extraction, instruction finetuning, pretraining

1. Introduction

Biomedical entities and relations among them are at the forefront of biomedical knowledge discovery. Novel relational information among biomedical entities is often conveyed through text in scientific literature or clinical notes. For instance, protein-protein interactions (to understand disease etiology and progression), gene-disease associations (to identify potential drug targets), drug-disease treatment relations (to spot off-label usage or assess potential for repositioning), and drug-gene interactions (to design targeted therapies) are often discussed in literature. The extracted relations are often used to build knowledge graphs to facilitate knowledge discovery and knowledge-based search systems [2,26]. Adverse event relations (linking medications and adverse effects) are often reported in clinical text and their extraction facilitates more effective post-market pharmacovigilance [5,30]. As such, biomedical relation extraction (RE) is a frequently pursued task in the BioNLP community, featuring in many shared task series (e.g., BioCreative, i2b2, n2c2). Our current contribution is about high-level questions regarding the value of biomedical domain specific language models (LMs) and instruction finetuning for biomedical RE. In the rest of this section, we overview the recent RE relevant methods landscape and frame our contributions in that context.

The last few years have seen a proliferation of transformer-based generative (encoder-decoder and decoder-only) LMs. In part, these have become popular because tasks such as summarization, paraphrasing, and code generation are fundamentally generative. But generative models have advantages over encoder-only LMs even in some tasks that were previously addressed with the latter. Encoder models were largely designed for classification problems, where a classifier is trained on top of the final layer embedding of the [CLS] token. They have been used in myriad other ways — notably in named entity recognition (NER) and RE where embeddings of spans are fed to classifiers [12,56]. These methods finetune weights from scratch and therefore require ample data. But few- and zero-shot learning have become focal areas of NLP research; their maturation promises improved scalability of NLP, as manual creation of large datasets is time consuming. Few-/zero-shot learning in encoder models is often achieved by aligning the NLP task objective function with the pretraining objective function. Prompting with templates containing [MASK] tokens to be predicted has been the few-/zero-shot method for encoder-only models [42,49]. Using such a strategy largely rests on whether the desired NLP task is a classification task or can be formulated as one. In summation, encoder-only LMs are well-suited to full finetuning (as opposed to the few-shot setting) for a variety of non-generative tasks. The additional layering and customization required on top of encoder models for RE can be obviated in generative models, by contrast, given their relatively flexible approach of formulating RE with straightforward natural language templates. This particularly comes in handy in few- and zero-shot learning.

In the past few years, instruction finetuning (IFT) in the general domain has become a major research theme for generative LMs [35,41,52]. In its basic form, IFT converts examples of a wide variety of labeled datasets into pairs of natural language instructions (along with input text) and responses (containing the correct output), and then trains a generative LM on them. This typically improves few-shot performance by aligning the model towards the goals that humans have. Then, for a new task, the corresponding instructions are generally more in-distribution than they would have been sans IFT.

Most advances in NLP are originally developed for the general domain, and only later adapted to the biomedical domain. Recently, this has manifested in the development of biomedical versions of popular general-domain LMs. Intuitively, it is sensible to do so because the distribution of text in biomedical NLP tasks is more similar to biomedical text than to general-domain text, and because biomedical-specific tokenizers can more parsimoniously represent biomedical entities. Likewise, inspired by the striking success of IFT, Parmar et al. [36] instruction finetuned BART-Base [27] on a biomedical meta-dataset they assembled (called BoX) to create In-BoXBART.

Biomedical generative LMs are quickly being trained, and we can safely assume that larger biomedical models will be trained in the near-future. Given the high labor, monetary, and effort costs of creating biomedical-specific LMs, it is prudent to ask if they are worthwhile. To some degree, this has already been tested in biomedical versions of T5 [38,40] and BART [27,53] in the context of full-finetuning, yielding underwhelming improvements in scores. However, the tasks tested largely require extremely short, simple generations. Plausibly, these models would differentiate themselves from their general-domain counterparts on more complex tasks, especially in the few-shot context. We investigate whether this is the case in the high-value task of RE3. Specifically we ask:

  • Do generative models pretrained on PubMed-based scientific text outperform those trained on general-domain text?

  • How do biomedical and general-domain instruction finetuned models perform compared to their base models?

  • Are the answers to these questions different in the fully finetuned and few-shot settings?

We test these questions across four biomedical RE datasets using a variety of open biomedical and general domain LMs. We make our code and prompting templates available at https://github.com/bionlproc/DS-PT-IFT.

2. Related Work

2.1. Generative Relation Extraction

RE has traditionally been conducted via a pipeline in two steps — NER followed by relation classification (RC) between pairs of entities from NER. However, this pipeline approach was joined by end-to-end approaches that employ a joint loss function for both NER and RC phases beginning with [34]. Among them, Zeng et al. [55] used a copy mechanism to tackle generative RE without an intermediate NER step; several extensions followed suit [14,18,54]. One advantage of their strategy is that generation can naturally handle discontinuous entities (e.g., hand pain in the phrase hand and arm pain). Discontinuous entities pose such a challenge to NER that there is an entire line of research dedicated to solving it [11,51]. But these copy-based models require training neural network components from scratch, and are therefore not well-suited to few-shot learning.

Cabot et al. [20] use BART to directly generate relation triples using special tokens to demarcate roles in the triple; this still requires training randomly initialized special tokens. Luo et al. [33] finetune BioGPT on natural language templates, aligning the current NLP task’s output form most closely with the pretraining strategy. At test time, relations are extracted from generated text using regular expressions. We adopt this general strategy of Luo et al. [33] in the present study. Wadhwa et al. [48] finetune Flan-T5 [52] using templates that convert relations into a target sequence of structured data, rather than natural language. We note that in preliminary experiments, our performance was much lower with templates of this form.

2.2. Biomedical Language Models

Early transformer-based biomedical LMs had encoder-only architectures, usually based on BERT [10], and were trained on some combination of PubMed abstracts, PubMed Central full-text articles (PMC), and MIMIC III [22] clinical notes. BioBERT [25] was initialized using weights from BERT and continually pretrained on PubMed and PMC. Alsentzer et al. [1] continued pretraining BioBERT on MIMIC-III, yielding ClinicalBERT. Likewise, Gururangan et al [16] continually pretrained RoBERTa [31] on additional full-text biomedical articles from S2ORC [32], obtaining BioMed-RoBERTa. Models that trained from scratch on biomedical text generally perform better than their continually trained counterparts. The BERT-based BlueBERT [37] and PubMedBERT [15], and the RoBERTa-based Bio-LM [28] are a few such models. Shin et al. [43] trained versions of BioMegatron, both from scratch and through continued pre-training from Megatron [44]. BioELECTRA [23], a biomedical version of the high-performing ELECTRA [9], was trained from scratch on PubMed and PMC.

As generative models (with encoder-decoder and decoder-only architectures) have risen in popularity, PubMed-based generative models have been trained from scratch as well. Biomedical versions of T5 [40] (SciFive [38]) and GPT (BioGPT [33] and BioMedLM [4]) have all been trained from scratch on PubMed and/or PMC corpora. BioBART [53] is a continually pretrained BART [27] model with PubMed text. SciFive and BioBART showed negligible improvement on most biomedical tasks, and modest improvements in a small number of tasks, over their general-domain counterparts. However, we believe the generative capabilities of SciFive and BioBART are under-explored by [38] and [53] because either the tasks tackled require modest generation (e.g., multiple choice answer) or are evaluated by ROUGE scores, which only provide rough estimates of generation quality compared to human evals. Luo et al. [33] tested BioGPT on three end-to-end RE datasets, where it substantially outperformed the comparablysized GPT-2 Medium [39]. We are unable to find any performance scores for BioGPT-Large on end-to-end RE. BioMedLM [4] scored much higher than the comparably-sized GPT-Neo 2.7B model on three QA tasks. In summary, there is evidence that on very short generation tasks, in a fully finetuned setup, biomedical encoder-decoder LMs outperform general-domain counterparts. But there is a dearth of information about the efficacy of decoder-only model performance in more complex generation tasks, and even more so for encoder-decoder models. Few-shot performance patterns of performance are especially unknown.

2.3. Instruction Finetuning

Wei et al. [52] explore the effect of IFT on zero-shot performance by instruction finetuning on 60 tasks, constructing 10 templates per task, finding consistent improvements in performance. Sanh et al. [41] conduct similar experiments on a different set of datasets, yielding the T0 model. Chung et al. [8] build on these previous attempts, ramping up the scale of finetuning up to 1600 tasks and consequently release the Flan-T5 series of instruction finetuned models. InstructGPT [35] is trained using reinforcement learning with human feedback (RLHF) [7,45] on instructions to align GPT-3 [6] to human preferences. Wang et al. [50] create synthetic instruction data for IFT using GPT-3, attaining competitive performance with the more expensively trained InstructGPT. In-BoXBART [36], the only attempt at biomedical instruction finetuning (to our knowledge), is an IFT version of BART-Base trained on 32 biomedical tasks, with one template per task, making its scale of IFT considerably smaller than that of other IFT models.

3. Task Setup

Training:

RE instances consist of a source text and a set of semi-structured relations that are asserted in it. To finetune generative LMs on a relation extraction dataset, we convert its training examples into sequences of natural language before finetuning the models in the usual manner. In essence, the natural language sequences contain the source text along with an instruction to extract relations followed by the relations expressed in natural language form. Figure 1 illustrates our approach for the drug combination extraction (DCE) task of extracting combination drug therapies from scientific text [46] as we formalize below.

Fig.1:

Fig.1:

Conversion of a training example of the DCE dataset into natural language sequences.

Let (x,y) be a training example, where x is the input text and y denotes the relations present in it. To finetune LMs, we construct a single prompting template T=Ts,Tt for each dataset. A template consists of a source transformation function Ts and a target transformation function Tt, which return source sequence xT, containing the source text x along with an instruction to guide generation, and target sequence yT, conveying the relations y in the form of natural language:

Ts(x)=xT
Tt(y)=yT

For encoder-decoder models, xT and yT respectively are supplied to the encoder and decoder modules. For decoder models, the sequences xT and yT are joined into one sequence with a function J, which then serves as a single input to the model. We take J to be

JxT,yT=xT\n\nyT

where represents concatenation. Our formulation of J separates the source and target sequences with two line breaks, though other formulations can be used.

Evaluation:

For validation and test examples, we only have source sequences, and our aim is to generate the target sequences. For encoder-decoder models, xT again serves as input to the encoder. For decoder models, the input sequence is

JxT,=xT\n\n.

The generated output sequence of the model yˆT must be transformed into a structured format so that predicted relations can be compared to annotated relations to calculate performance metrics. We design an information extractor for each template using regular expressions.

4. Experimental Setup

4.1. Main experiments

We conduct two experiments to investigate the effect of domain-specificity: (1) We compare biomedical- and general-domain pretrained models and (2) we compare biomedical instruction fine-tuned, general-domain instruction fine-tuned, and base model performances. All experiments are repeated in fully finetuned and few-shot settings.

  • Biomedical vs general-domain pretraining: In our first experiment, we compare the efficacy of biomedical LMs and their closest general-domain language equivalent in terms of architecture and size. We evaluate performance of the biomedical decoder LMs BioGPT (350M) [33], BioGPT-Large (1.5B) [33], and BioMedLM (2.7B) [4] and compare to their approximate general domain equivalents, GPT2-Medium (350M), GPT2-XL (1.5B) [39], and GPT-Neo (2.7B) [3]. We also evaluate performance of the biomedical encoder-decoder LMs SciFive-Base (220M), SciFive-Large (770M) (both PubMed versions) [38], BioBART-Base (140M), and BioBART-Large (400M) [53] and compare to the general domain equivalents T5-Base (220M), T5-Large (770M) [40], BART-Base (140M), and BART-Large (400M) [27].

  • Instruction finetuning vs base models: In our second experiment, we compare the efficacy of IFT models and their base models. No biomedical models have been instruction finetuned in our literature review. General-domain models have been instruction finetuned with general-domain datasets and separately with biomedical datasets. For general-domain instructions, we compare the Flan-T5 [52] series of models to the base T5 models [40] they were trained from. For biomedical domain instructions, we compare In-BoXBART [36] to the base BART [27], from which it was trained.

  • Few-shot setting: We repeat the experiments from above two scenarios, but in the few-shot setting. We emphasize that we finetune the models, rather than perform in-context learning. We experiment with 16-shot and 64-shot learning. Few-shot performance values are averaged over five runs with different random seeds for selecting training examples. Barring a single exception, fully finetuned decoder-only model performances were quite inferior (F1 < 30) compared to encoder-decoder models; so we did not perform few-shot experiments on decoder-only models.

4.2. Datasets

We use four public biomedical RE datasets for our experiments. For datasets without an established train/validation split, we select 20% of the training dataset to serve as a validation set.

  1. CDR [29] is a BioCreative V dataset consisting of PubMed abstracts for which all mentions of chemicals and diseases are annotated, as well as all relations between them describing chemical-induced diseases. Relations all belong to a single class and are annotated at the entity level. 1500 abstracts are split into train, validation, and test sets of equal size.

  2. ChemProt [24] is a BioCreative VI dataset consisting of PubMed abstracts for which all mentions of chemical compounds/drugs and genes/proteins are annotated, as well as all intra-sentence relations between the two entity types belonging to a set of relation types of interest. Relations belong to the classes agonist, antagonist, upregulate, downregulate, and substrate, and are annotated at the mention level. Since relations are annotated at the mention level, an example may contain multiple annotated relations that are conceptually identical but have different mentions. The dataset is split into 1,020 training, 612 validation, and 800 test examples.

  3. The drug combinations extraction (DCE) [46] dataset consists of sentences selected from PubMed abstracts containing multiple drugs, annotated for drugs given in combination. Sets of drugs are annotated as consisting of a drug combination with a positive effect, a drug combination with an effect not clarified by the text, or not a drug combination. Though some studies combine the latter two relation types, [21,46] we model them as-is. The dataset is split into 1362 training and 272 test instances.

  4. The DDI [17] dataset consists of texts from DrugBank and Medline annotated for drugs/chemicals and relations holding for pairs of co-occurring drugs. Mentions of drugs are classified as generic drug name, brand-name drug, drug group, or non-drug active substance. Relations holding for drug pairs are pharmacokinetic mechanism, combination effect, advice regarding interaction, or interaction in the absence of supporting information. Relations are annotated at the mention level. The dataset consists of 784 Drug-Bank documents and 233 Medline abstracts, split up into 714 train and 303 test examples.

We calculate precision, recall, and F1 for all tasks. Evaluation is done in a strict setting: relations are only considered correct if the entities are all correct and relation type, if applicable, is correct. A correct entity consists of an exact string match. For CDR, the predicted entity must match one of the mentions corresponding to the MeSH ID of the gold entity.

5. Results

All results are displayed in Table 1, primarily grouped into model series (e.g., T5) with an in-group ordering by model size. Before moving into the main results, we note two high level trends among the datasets and model series: (1) Performance was higher in general for CDR and DCE than in ChemProt and DDI, perhaps owing to the higher number of relation types in the latter datasets. (Henceforth, we refer to CDR and DCE as high-performing and ChemProt and DDI as low-performing as we explain other results). (2) Encoder-decoder models performed much better than decoder-only models, with the exception of BioGPT for CDR. In fact, decoder-only models performed so poorly that we largely limit our further discussion to encoder-decoder models. Potentially, this is because decoder models may be more amendable to generative outputs such as summaries and not as suitable for RE.

Table 1:

Comparison of model F1-scores across a range of model types and sizes for four biomedical RE datasets.

CDR
DCE
ChemProt
DDI
Model # Param. 16-shot 64-shot Full 16-shot 64-shot Full Full Full

GPT-2-Medium 350M - - 16.3 - - 0.2 14.7 14.2
BioGPTb 350M - - 44.1 - - 0.7 9.7 17.1
GPT-2-XL 1.5B - - 16.1 - - 0 14.9 24.3
BioGPT-Largeb 1.5B - - 15.6 - - 1.2 11.1 21.4
GPT-Neo-2.7B 2.7B - - 10.0 - - 1.5 11.7 24.0
BioMedLMb 2.7B - - 7.9 - - 11.9 17.1 19.1

BART-Base 110M 17.6 27.4 36.1 8.9 30.6 38.5 11.3 12.4
BioBART-Baseb 110M 29.0 33.1 38.0 8.9 24.1 41.8 9.6 9.5
In-BoXBART 110M 21.4 28.9 36.4 10.4 26.7 45.1 10.5 13.4
BART-Large 400M 26.4 32.5 40.1 9.6 30.4 41.7 15.7 17.0
BioBART-Largeb 400M 30.7 33.0 40.1 8.7 29.2 43.3 13.2 13.7

T5-Small 60M 18.3 20.3 29.4 22.5 18.5 32.3 0.3 2.9
Flan-T5-Small* 60M 18.9 22.2 37.1 10.7 21.2 33.6 1.3 2.9
T5-Base 220M 30.3 29.8 42.8 30.0 39.3 50.3 15.6 15.4
SciFive-Baseb 220M 28.8 10.1 41.5 0 21.6 47.6 12.4 11.9
Flan-T5-Base* 220M 30.5 33.1 45.0 30.0 33.8 44.1 16.2 15.2
T5-Large 770M 33.9 37.7 50.8 25.7 42.4 47.6 27.3 27.5
SciFive-Largeb 770M 22.7 5.9 46.3 3.1 3.7 46.1 22.3 20.6
Flan-T5-Large* 770M 35.9 38.4 50.4 27.4 45.1 48.9 26.9 32.0

Few-shot values are averages across five different random seeds for selecting training examples. Models marked with “b” are biomedically pretrained. Models marked with * are instruction finetuned with general-domain instructions, and those marked by † are instruction finetuned with biomedical instructions. It is important to note that the scores in this table are not necessarily the current state-of-the-art for each dataset, because our goal is to compare biomedical pretraining and IFT to corresponding general models, in the generative LM landscape for end-to-end RE.

Biomedical vs general-domain pretraining:

Overall, general-domain pre-training was superior to biomedical-domain pretraining in full and few-shot finetuning experiments (Figures 2 and 3). In full finetuning, this pattern held for T5 models across all datasets and for BART models in low-performing datasets. Notable exceptions are the BART models in high-performing datasets, where biomedical models outperformed, or performed similarly to, corresponding general-domain models, and BioGPT which substantially outperformed its GPT2-Medium counterpart. However, BioGPT performed poorly in all other circumstances. In few-shot finetuning, biomedical T5 models performed much worse than general-domain models (Figure 3). Biomedical pretraining had inconsistent effects in BART models, improving performance for CDR while generally lowering performance for DCE.

Fig.2:

Fig.2:

Comparison of fully finetuned LMs pretrained on general-domain and biomedical text.

Fig.3:

Fig.3:

Comparison of few-shot finetuned LMs pretrained on general-domain and biomedical text.

Effect of instruction finetuning:

In full finetuning, IFT versions of models generally performed similarly or slightly better than their base versions (Figure 4). The small benefits of IFT mostly manifested in high-performing datasets. In few-shot finetuning, both general-domain instructions and biomedical instructions consistently improved performance and to a similar degree (Figure 5), even though there were orders of magnitude fewer biomedical instruction datasets.

Fig.4:

Fig.4:

Comparison of fully finetuned base and instruction-finetuned LMs.

Fig.5:

Fig.5:

Few-shot comparison of base and instruction finetuned LMs on high-performing datasets.

6. Discussion

We were surprised that decoder-only models performed so poorly and so much worse than encoder-decoder models. Luo et al. [33] found BioGPT to attain state-of-the-art (SOTA) performance on CDR, DDI, and KD-DTI [19]. While we differed from their effort in some preprocessing of the dataset (e.g., we did not remove examples containing no relations, how many epochs we trained for, and how we determined if predicted entities matched gold entities), our performance on DDI was drastically lower than theirs, and our results on two additional datasets, DCE and ChemProt are far below SOTA. An unusual finding is that only for CDR, BioGPT performance is better than BioGPT-Large and BioMedLM (2.7B models). Preliminary experiments with Llama2–7b [47] were similarly fruitless and hence we do not report any scores given disappointing initial experiments. As encoder-decoder models performed so much better, future practical generative RE efforts may be more fruitful with encoder-decoder architectures; and if additional models are to be pretrained on biomedical corpora (PubMed), encoder-decoder models may be a safer bet.

However, training additional generative LMs (even encoder-decoder types) on PubMed generally yielded inferior models, at least as far as RE is concerned. This result held in both full and few-shot finetuning contexts. We found this pattern counterintuitive, since most of the datasets we experimented with consist of annotated texts from PubMed — one might expect that models trained on PubMed itself would be particularly well suited for these tasks, as they fall squarely inside of the pretraining corpus distribution. We hypothesize that (1) the sheer quantity of tokens available for training on general-domain text outweighs the benefits of domain-specificity, and/or (2) that biomedical-domain models may learn inferior linguistic representations as biomedical models are trained on a comparatively narrow subset of the distribution of the English language, and RE requires linguistic sophistication. Thus, even for encoder-decoder models, training biomedical LMs from scratch may not be judicious.

As for the IFT experiments, our results do not suggest that domain-specific training is altogether pointless. In the full finetuning case, IFT models achieved roughly the same performance as their base models, though they converged in far fewer epochs during training. In the few-shot setting, we found that small-scale biomedical IFT produced performance gains comparable to large-scale general-domain IFT. In-BoXBART was finetuned on 32 biomedical tasks with one template per task. By contrast, Flan-T5 models were finetuned on 1,836 tasks with multiple templates per task, and Wei et al. [52] found that performance continued to climb until ≈ 300 tasks. This suggests that IFT with additional biomedical datasets could yield strong benefits. While there are far fewer annotated biomedical datasets than general-domain datasets, the biomedical metadataset BigBIO [13], containing over 100 datasets currently, would be a valuable resource for biomedical IFT. Alternatively, synthetic biomedical instruction examples could also be generated à la Wang et al. [50]. Some nuance is required when interpreting the few-shot performance of In-BoxBART on the CDR dataset. Among the tasks that In-BoxBART was instruction finetuned on is the NER task for CDR. It is possible that some of In-BoxBART’s capabilities on RE for CDR come from this sort of leakage. In addition to our previous suggestions for further research, our conclusions are limited to RE, a particularly high-value application of LMs. Other biomedical tasks should be evaluated as well.

To conclude, the fast evolving landscape of public LMs and their use in many fields naturally leads to domain specific models and IFT. In this paper, we took the first step toward rigorous assessment of the need for domain specificity in these models for biomedical RE. It would be interesting to see if our findings apply to other tasks (e.g., biomedical summarization) and domains (e.g., legal). Additionally, more detailed experiments are needed to tease out different types of errors caused by pretrained domain-specific LMs in comparison with general LMs, to see if there is a way to combine them for improved performance.

Footnotes

3

Here we are explicitly dealing with end-to-end RE, where entities must be extracted by the RE system and are not provided a priori.

References

  • 1.Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, McDermott M: Publicly available clinical BERT embeddings. In: Proc of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78 (2019) [Google Scholar]
  • 2.Bakal G, Talari P, Kakani EV, Kavuluru R: Exploiting semantic patterns over biomedical knowledge graphs for predicting treatment and causative relations. Journal of biomedical informatics 82, 189–199 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Black S, Gao L, Wang P, Leahy C, Biderman S: GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow (Mar 2021). 10.5281/zenodo.5297715, https://doi.org/10.5281/zenodo.5297715 [DOI] [Google Scholar]
  • 4.Bolton E, Venigalla A, Yasunaga M, Hall D, Xiong B, Lee T, Daneshjou R, Frankle J, Liang P, Carbin M, et al. : BioMedLM: A 2.7 b parameter language model trained on biomedical text. arXiv preprint arXiv:2403.18421 (2024) [Google Scholar]
  • 5.Botsis T, Buttolph T, Nguyen MD, Winiecki S, Woo EJ, Ball R: Vaccine adverse event text mining system for extracting features from vaccine safety reports. J. of the American Medical Informatics Association 19(6), 1011–1018 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Sutskever I, Amodei D: Language models are few-shot learners. In: Advances in Neural Information Processing Systems. vol. 33, pp. 1877–1901 (2020) [Google Scholar]
  • 7.Christiano PF, Leike J, Brown T, Martic M, Legg S, Amodei D: Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017) [Google Scholar]
  • 8.Chung HW, Hou L, Longpre S, Zoph B, Tay Y, Fedus W, Li E, Wang X, Dehghani M, Brahma S, et al. : Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022) [Google Scholar]
  • 9.Clark K, Luong M, Le QV, Manning CD: ELECTRA: pre-training text encoders as discriminators rather than generators. In: 8th International Conference on Learning Representations (2020) [Google Scholar]
  • 10.Devlin J, Chang MW, Lee K, Toutanova K: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. of NAACL:HLT. pp. 4171–4186 (2019) [Google Scholar]
  • 11.Dirkson A, Verberne S, Kraaij W: FuzzyBIO: A proposal for fuzzy representation of discontinuous entities. In: Proc. of the 12th Workshop on Health Text Mining and Information Analysis. pp. 77–82 (2021) [Google Scholar]
  • 12.Eberts M, Ulges A: Span-based joint entity and relation extraction with transformer pre-training. Proc. of the EACL p. 2006–2013 (2020) [Google Scholar]
  • 13.Fries JA, Weber L, Seelam N, Altay G, Datta D, Garda S, Kang M, Su R, Kusa W, Cahyawijaya S, et al. : BigBIO: A framework for data-centric biomedical natural language processing. arXiv preprint arXiv:2206.15076 (2022) [Google Scholar]
  • 14.Giorgi J, Bader G, Wang B: A sequence-to-sequence approach for document-level relation extraction. In: Proc. of the 21st Workshop on Biomedical Language Processing. pp. 10–25 (2022) [Google Scholar]
  • 15.Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H: Domain-specific language model pretraining for biomedical natural language processing (2020)
  • 16.Gururangan S, Marasović A, Swayamdipta S, Lo K, Beltagy I, Downey D, Smith NA: Don’t stop pretraining: Adapt language models to domains and tasks. In: Proc. of the ACL. pp. 8342–8360 (2020) [Google Scholar]
  • 17.Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T: The ddi corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics 46(5), 914–920 (2013) [DOI] [PubMed] [Google Scholar]
  • 18.Hou Y, Xia Y, Wu L, Xie S, Fan Y, Zhu J, Che W, Qin T, Liu TY: Discovering drug-target interaction knowledge from biomedical literature (2021) [DOI] [PubMed]
  • 19.Hou Y, Xia Y, Wu L, Xie S, Fan Y, Zhu J, Qin T, Liu TY: Discovering drug–target interaction knowledge from biomedical literature. Bioinformatics 38(22), 5100–5107 (10 2022) [DOI] [PubMed] [Google Scholar]
  • 20.Huguet Cabot PL, Navigli R: REBEL: Relation extraction by end-to-end language generation. In: Findings of the ACL: EMNLP 2021. pp. 2370–2381 (2021) [Google Scholar]
  • 21.Jiang Y, Kavuluru R: End-to-end n-ary relation extraction for combination drug therapies. In: IEEE 11th International Conf. on Healthcare Informatics. pp. 72–80 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Johnson AE, Pollard TJ, Shen L, Lehman L.w.H., Feng M, Ghassemi M, Moody B, Szolovits P, Anthony Celi L, Mark RG: Mimic-iii, a freely accessible critical care database. Scientific data 3(1), 1–9 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kanakarajan K.r., Kundumani B, Sankarasubbu M: BioELECTRA:pretrained biomedical text encoder using discriminators. In: Proc. of the 20th Workshop on BioNLP. pp. 143–154 (2021) [Google Scholar]
  • 24.Krallinger M, Rabal O, Akhondi SA, Pérez MP, Santamaría J, Rodríguez GP, Tsatsaronis G, Intxaurrondo A, López JA, Nandal U, et al. : Overview of the biocreative vi chemical-protein interaction track. In: Proc. of the sixth BioCreative challenge evaluation workshop. vol. 1, pp. 141–146 (2017) [Google Scholar]
  • 25.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (09 2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lever J, Gakkhar S, Gottlieb M, Rashnavadi T, Lin S, Siu C, Smith M, Jones MR, Krzywinski M, Jones SJ: A collaborative filtering-based approach to biomedical knowledge discovery. Bioinformatics 34(4), 652–659 (2018) [DOI] [PubMed] [Google Scholar]
  • 27.Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L: BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proc. of the ACL. pp. 7871–7880 (2020) [Google Scholar]
  • 28.Lewis P, Ott M, Du J, Stoyanov V: Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art. In: Proc of the 3rd Clinical Natural Language Processing Workshop. pp. 146–157 (2020) [Google Scholar]
  • 29.Li J, Sun Y, Johnson RJ, Sciaky D, Wei C, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z: Biocreative V CDR task corpus: a resource for chemical disease relation extraction. Database J. Biol. Databases Curation (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liu F, Jagannatha A, Yu H: Towards drug safety surveillance and pharmacovigilance: current progress in detecting medication and adverse drug events from electronic health records. Drug safety 42(1), 95–97 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Liu Y, Ott M, Goyal N, Du J, Joshi M, Stoyanov V: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) [Google Scholar]
  • 32.Lo K, Wang LL, Neumann M, Kinney R, Weld D: S2ORC: The semantic scholar open research corpus. In: Proc. of the ACL. pp. 4969–4983 (Jul 2020) [Google Scholar]
  • 33.Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu TY: BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23(6) (2022) [DOI] [PubMed] [Google Scholar]
  • 34.Miwa M, Sasaki Y: Modeling joint entity and relation extraction with table representation. In: Proceedings of EMNLP. pp. 1858–1869 (2014) [Google Scholar]
  • 35.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, et al. : Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022) [Google Scholar]
  • 36.Parmar M, Mishra S, Purohit M, Luo M, Mohammad M, Baral C: InBoXBART: Get instructions into biomedical multi-task learning. In: Findings of the ACL: NAACL 2022. pp. 112–128 (Jul 2022) [Google Scholar]
  • 37.Peng Y, Yan S, Lu Z: Transfer learning in biomedical natural language processing: An evaluation of bert and elmo on ten benchmarking datasets. In: Proc. of the Workshop on BioNLP. pp. 58–65 (2019) [Google Scholar]
  • 38.Phan LN, Anibal JT, Tran H, Chanana S, Bahadroglu E, Peltekian A, Altan-Bonnet G: Scifive: a text-to-text transformer model for biomedical literature. arXiv preprint arXiv:2106.03598 (2021) [Google Scholar]
  • 39.Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. : Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) [Google Scholar]
  • 40.Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485–5551 (2020) [Google Scholar]
  • 41.Sanh V, Webson A, Raffel C, Bach SH, Sutawika L, Alyafeai Z: Multitask prompted training enables zero-shot task generalization. In: International Conference on Learning Representations (2022) [Google Scholar]
  • 42.Schick T, Schütze H: Exploiting cloze-questions for few-shot text classification and natural language inference. In: Proc. of the 16th EACL. pp. 255–269 (2021) [Google Scholar]
  • 43.Shin HC, Zhang Y, Bakhturina E, Puri R, Patwary M, Shoeybi M, Mani R: BioMegatron: Larger biomedical domain language model. In: Proceedings of EMNLP. pp. 4700–4706 (2020) [Google Scholar]
  • 44.Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B: Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019) [Google Scholar]
  • 45.Stiennon N, Ouyang L, Wu J, Ziegler D, Lowe R, Voss C, Radford A, Amodei D, Christiano PF: Learning to summarize with human feedback. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 3008–3021 (2020) [Google Scholar]
  • 46.Tiktinsky A, Viswanathan V, Niezni D, Meron Azagury D, Shamay Y, Taub-Tabib H, Hope T, Goldberg Y: A dataset for n-ary relation extraction of drug combinations. In: Proceedings of NAACL : HLT. pp. 3190–3203 (Jul 2022) [Google Scholar]
  • 47.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Scialom T: Llama 2: Open foundation and fine-tuned chat models (2023)
  • 48.Wadhwa S, Amir S, Wallace BC: Revisiting relation extraction in the era of large language models. In: Proc. of the ACL (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Wang H, Xu C, McAuley J: Automatic multi-label prompting: Simple and interpretable few-shot classification. In: Proceedings of NAACL : HLT. pp. 5483–5492 (2022) [Google Scholar]
  • 50.Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H: Self-instruct: Aligning language models with self-generated instructions. In: Proc. of the ACL. pp. 13484–13508 (2023) [Google Scholar]
  • 51.Wang Y, Yu B, Zhu H, Liu T, Yu N, Sun L: Discontinuous named entity recognition as maximal clique discovery. In: Proc. of the ACL. pp. 764–774 (2021) [Google Scholar]
  • 52.Wei J, Bosma M, Zhao V, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022) [Google Scholar]
  • 53.Yuan H, Yuan Z, Gan R, Zhang J, Xie Y, Yu S: BioBART: Pretraining and evaluation of a biomedical generative language model. In: Proc. of the 21st Workshop on Biomedical Language Processing. pp. 97–109 (May 2022) [Google Scholar]
  • 54.Zeng D, Zhang H, Liu Q: Copymtl: Copy mechanism for joint extraction of entities and relations with multi-task learning. In: Proc. of the AAAI conference on artificial intelligence. vol. 34, pp. 9507–9514 (2020) [Google Scholar]
  • 55.Zeng X, Zeng D, He S, Liu K, Zhao J: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proc. of the ACL. pp. 506–514 (2018) [Google Scholar]
  • 56.Zhong Z, Chen D: A frustratingly easy approach for entity and relation extraction. In: Proceedings of NAACL : HLT. pp. 50–61 (Jun 2021) [Google Scholar]

RESOURCES