Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2026 Feb 14;2025:1071–1080.

Leveraging Large Language Models for Cancer Vaccine Adjuvant Name Extraction from Biomedical Literature

Hasin Rehana 1,2, Jie Zheng 3, Feng-Yu Yeh 3, Benu Bansal 1,4, Nur Bengisu Çam 5, Christianah Jemiyo 1, Brett McGregor 1, Arzucan Özgür 5, Yongqun He 3, Junguk Hur 1
PMCID: PMC12919462  PMID: 41726492

Abstract

This study explores cancer vaccine adjuvant name recognition using Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT), Large Language Model Meta AI (Llama), and Gemma. The models were tested in zero- and few-shot learning paradigms using AdjuvareDB and Vaccine Adjuvant Compendium (VAC) datasets. Prompts were designed to extract adjuvant names and assess the impact of contextual details. Notably, Llama-3.2 3B achieved a Recall of up to 68.7% (72.5% with manual validation) on the VAC dataset with four-shots, although its Precision and F1-score were lower. In contrast, GPT-4o, with additional contextual interventions, achieved a Precision of 65.9%, Recall of 79.7%, and F1-score of 69.8% on the AdjuvareDB dataset. Gemma-2 9B also demonstrated moderate few-shot gains, peaking at 63.6% F1-score. These LLMs outperformed BioBERT, a model widely used for biomedical text mining, highlighting the potential of general-purpose LLMs for automatic vaccine adjuvant name extraction and contributing to advancements in vaccine research.

Introduction

An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response1. Identifying adjuvant names from cancer vaccine studies is essential for advancing research and enhancing immunotherapies. While traditional adjuvants like aluminum salts and oil-in-water emulsions have been widely used in prophylactic vaccines, the complex and immunosuppressive tumor microenvironment demands more potent and specialized adjuvants tailored for therapeutic cancer applications2.

The discovery and development of vaccine adjuvants rely heavily on curated databases that compile critical information on adjuvant properties, usage, and safety. Several prominent databases have emerged as invaluable tools for researchers. For example, AdjuvareDB (http://tmliang.cn/adjuvaredb) is a comprehensive web-based database compiling information on candidate adjuvants in clinical use3, providing detailed records of adjuvant composition, function, and other key attributes essential for understanding their immunotherapeutic applications. Equally important, Vaxjo (https://violinet.org/vaxjo/) is a centralized web-based database and analysis platform designed to curate, store, and analyze vaccine adjuvants and their roles in vaccine development4. Vaxjo offers details on adjuvant names, components, structure, appearance, storage conditions, preparation methods, function, safety, and associations with specific vaccines, facilitating an in-depth exploration of adjuvant characteristics. Another key resource is the VAC database (https://vac.niaid.nih.gov/), which provides comprehensive records for each adjuvant, including Vaccine Ontology (VO) ID, detailed properties of the adjuvant, preclinical and clinical usage data, associated publications, product grade, and available formulations. By centralizing and standardizing critical information on adjuvants, these databases support informed decision-making for experimental and clinical applications and accelerate the development of next-generation vaccines.

Identifying and characterizing potent adjuvants is critical for advancing cancer vaccine development, as these components directly influence the magnitude and quality of immune responses. However, manually extracting adjuvant-related information from the rapidly growing biomedical literature is labor-intensive and inefficient. This presents a significant bottleneck in discovering and optimizing adjuvant formulations for cancer vaccines. Recent advances in Artificial Intelligence (AI)-driven text mining offer promising solutions to address this challenge and support immunotherapy research. AI is transforming the expensive and time-consuming drug development process by improving patient selection and streamlining target discovery, especially in oncology. Although obstacles exist, such as limited data accessibility and a lack of trained staff, AI can enhance cancer vaccination effectiveness through novel adjuvant design and by improving customized therapy5. In recent years, computational approaches have emerged as powerful tools for advancing cancer vaccine research. Machine learning (ML) techniques have been applied to predict immunogenic epitopes, analyze clinical trial outcomes, and identify novel antigens. Kumar et al. highlighted how advancements in AI, particularly ML and computational modeling, have enabled the precise prediction and optimization of neoantigens, improved vaccine design, and facilitated the creation of personalized cancer vaccines6. Domain-specific models, such as BioBERT and BioClinicalBERT, built on the BERT architecture and pre-trained on large biomedical corpora, have also been widely adopted for biomedical named entity recognition7,8.

Large language models (LLMs) are AI models designed to process and generate human-like text by learning patterns and relationships within vast textual datasets9. In the biomedical domain, LLMs have demonstrated immense potential in addressing challenges posed by unstructured data. Their capabilities extend across a broad spectrum of natural language processing (NLP) applications, including named entity recognition (NER), text summarization, translation, and contextual understanding. Studies have demonstrated the efficacy of Generative Pretrained Transformers (GPT), Large Language Model Meta AI (Llama), and Gemma in tasks such as named entity recognition and relation extraction, which are critical for understanding the interplay between vaccine components and their immunological outcomes.

A novel annotation schema for oncology information was created and evaluated using LLMs, demonstrating that although GPT-4 exhibited superior performance in extracting comprehensive oncological histories from clinical notes, substantial enhancements are still required for dependable use in clinical research and patient care documentation10. Hou et al. illustrated that localized fine-tuning of Llama models via the Quantized Low-Rank Adaptation (QLoRA) algorithm could proficiently produce physician letters in radiation oncology, achieving significant therapeutic advantages and efficiency with minimal computational resources11. Ghali et al. applied Gemma in their system for biomedical named entity recognition on PubMed data, demonstrating its effectiveness in extracting structured clinical information12.

Using LLMs, Ferber et al. developed an automated pipeline that accurately matches cancer patients to clinical trials, identifying 93.3% of relevant trials and achieving matches in 92.7% of cases13. This improves matching patients to trials and may be better than qualified medical professionals. Although research leveraging LLMs for cancer vaccine adjuvant recognition is still nascent, a few pioneering studies underscore its potential. For instance, VaxLLM fine-tuned a large LLM to annotate vaccine components, including adjuvants, in Brucella vaccines14. Similar methodologies could be adapted to cancer vaccine research, focusing on extracting adjuvant-specific entities from clinical trial data. Studies on oncology guidelines and personalized oncology have also explored LLMs for tasks like zero-shot learning, achieving notable accuracy improvements through few-shot training15. While these applications are not directly centered on cancer vaccine adjuvants, they highlight the broader utility of LLMs in biomedical research and their promise to advance this niche area.

This manuscript investigates the application of LLMs for recognizing cancer vaccine adjuvant names from biomedical literature. By harnessing the advanced NLP capabilities of LLMs, this study proposes a systematic framework for extracting adjuvants referenced in PubMed abstracts and cancer vaccine trials. This approach not only facilitates a more comprehensive understanding of the role of adjuvants in cancer immunotherapy but also underscores the potential of LLMs to advance biomedical research through data-driven insights.

Methods

Overall Methodology

LLM-driven automation has greatly benefited tasks such as extracting meaningful information from scientific literature, annotating clinical data, and identifying relationships between entities. Our approach integrates LLMs into a comprehensive pipeline for automated adjuvant name recognition, as illustrated in Figure 1(a). The pipeline begins with annotated datasets comprising clinical trials from AdjuvareDB and PubMed abstracts from the VAC database. These literatures are preprocessed and standardized to ensure consistency across inputs. Three LLMs are employed, and the performance is compared against the domain-specific baseline, BioBERT. Task-specific prompts are constructed for each LLM under zero-shot and few-shot learning settings to guide adjuvant name extraction. Model outputs are postprocessed for consistency and evaluated using Precision, Recall, and F1-score to assess extraction accuracy and completeness.

Figure 1.

Figure 1.

Overview of the methodology and dataset distribution. (a) Workflow for adjuvant name recognition. (b) Year-wise distribution of the PubMed articles in the VAC dataset.

Dataset and Data Preprocessing

This study utilized datasets for cancer vaccine adjuvant name recognition from two primary sources: gold standard annotated clinical trial records from the AdjuvareDB website and PubMed abstracts annotated in the VAC database. The AdjuvareDB dataset included 104 trials manually annotated by the AdjuvareDB team; however, since seven trials lacked specific annotations or interventions, only 97 clinical trials were considered. The VAC dataset comprised 290 abstracts from clinical and preclinical studies collected from PubMed. As shown in Figure 1(b), the yearly distribution of these abstracts indicates a significant increase in cancer vaccine adjuvant-related studies, emphasizing the growing complexity of manual curation. A detailed breakdown of the datasets, including annotation sources, is presented in Table 1, which together form a robust foundation for developing and evaluating automated adjuvant name recognition methods.

Table 1.

Datasets. Overview of datasets used for this study.

Dataset type and Source Annotation Source Entries Dataset Specifications
Clinical trial dataset AdjuvareDB 97 Title: The official title of each clinical trial.
Brief Summary: A concise description of the trial objectives and design.
Interventions: Detailed descriptions of the trial interventions.
PubMed abstract dataset VAC 290 Title: The title of the manuscript.
Abstract: Abstract of the manuscript.
Substances: A list of substances from PubMed records.

The datasets were prepared to support structured, document-level inference. Each record was formatted with key metadata fields (e.g., unique identifier, title, abstract or trial description, and optional intervention/substance lists) and used as input to a single prompt. For comparison with supervised methods, the same datasets were used to fine-tune BioBERT using token-level BIO annotations (B-Adjuvant, I-Adjuvant, O), which supported span-level recognition and evaluation. An 80:20 train-validation split was used for both datasets. To support postprocessing and evaluation, we compiled a curated dictionary of known adjuvant names by aggregating terms from AdjuvareDB and the VAC database. All entries were normalized (e.g., lowercased, punctuation-stripped) and were reviewed by domain experts to ensure coverage of synonyms, abbreviations, and naming variants. The dictionary served as a controlled vocabulary for exact-match validation as well as to filter and standardize model outputs during postprocessing.

Language Models for Adjuvant Recognition

This study employed three general-purpose LLMs, GPT, Llama, and Gemma, to extract adjuvant names from biomedical literature. Their performance was compared against BioBERT, a domain-specific model fine-tuned on biomedical corpora, to benchmark effectiveness under different learning settings.

  • GPT: GPT models, developed by OpenAI, represent a groundbreaking class of LLMs designed to process and generate natural language text16. The transformer-based architecture allows them to process large volumes of unstructured data, making them invaluable for applications like literature mining and clinical data analysis. Our study employed GPT-4o for cancer vaccine adjuvant identification.

  • Llama: The Llama is a family of open-source LLMs developed by Meta AI, designed to achieve cutting-edge performance across various benchmarks17. These models are available in various configurations, including lightweight versions optimized for resource-constrained devices. For this study, we utilized the Llama-3.2 3B instruction fine-tuned model, which offered a good balance between performance and output consistency.

  • Gemma: Gemma is an open-source language model developed by Google DeepMind, designed using the same foundational architecture and infrastructure as their proprietary Gemini models but released for general-purpose, open-access use18. In this study, we used the Gemma-2 9B instruction fine-tuned model for its balance between model size and general-purpose performance. Similar to GPT and Llama, Gemma was evaluated in zero-shot and few-shot settings using prompt-based querying without fine-tuning.

  • BioBERT as a Comparative Model: To establish a supervised baseline, we finetuned BioBERT7 on our dataset. BioBERT is a BERT-based model, pretrained on large-scale biomedical corpora, including PubMed and PMC articles. The model was finetuned using the Hugging Face library with a learning rate of 2e-5, batch size of 16, maximum sequence length of 512, and 30 training epochs. We used the AdamW optimizer and cross-entropy loss, masking padding tokens during training.

The LLMs were evaluated solely via prompt engineering, without access to labeled training data. All the LLM experiments were conducted using zero-shot and few-shot prompting with a temperature setting of 0.0001 and a maximum token limit of 100. BioBERT served as a reference point to assess the performance of general-purpose LLMs, operating in zero-shot and few-shot capabilities compared to a domain-specific fine-tuned language model.

Prompt Engineering

Effective prompt design is crucial for guiding LLMs to accurately extract targeted information. We designed a series of carefully crafted prompts tailored for each input type to extract cancer vaccine adjuvant names from PubMed abstracts and clinical trial records. Each prompt was applied to a single document (e.g., a PubMed abstract or a clinical trial record), with input data including the unique identifier (PMID/NCT number), title, article/trial data, and optionally the “substances” or “interventions” fields. Including these structured fields was hypothesized to provide additional domain-specific context, especially in disambiguating adjuvant mentions that appear with other biomedical terms or abbreviations. This document-level isolation was fundamental to ensure data integrity and prevent cross-document “bleed” (where information such as an adjuvant name could be erroneously attributed to another study). In the zero-shot setting, we constructed prompts that explicitly defined the task, key instructions, output format, and task input. To evaluate the impact of additional structured context, prompts were tested with and without the inclusion of substances/interventions to determine their role in enhancing extraction accuracy. The general prompt structures are detailed in Figure 2. All zero-to-few-shot prompts are available in our GitHub repository (https://github.com/hurlab/Vaccine-Adjuvant-LLM).

Figure 2.

Figure 2.

Sample prompt for cancer vaccine adjuvant recognition (a) for PubMed abstract dataset, (b) for clinical trial dataset.

In the few-shot setting, each prompt included up to four demonstration examples consisting of PubMed abstracts or clinical trial records (with PMID/NCT number, title, abstract/clinical trial description, and optionally additional substances/interventions), followed by the corresponding tab-separated values (TSV)-formatted output. These examples were concatenated into a single prompt prior to appending the new document to be processed. This design enabled the LLMs to learn the expected extraction behavior through multiple demonstrations, without explicit fine-tuning.

Scalability and Deployment

Although our prompts were designed to process one document at a time to ensure accurate attribution, this did not limit scalability. We implemented a parallel workflow that distributed document analysis across multiple processing threads. These processes operated independently and submitted requests to the language model concurrently, significantly increasing throughput. This architecture was designed to allow throughput to scale almost linearly with the number of worker processes assigned. Therefore, the primary bottleneck shifted from sequential processing to the rate at which the LLM vendor’s API could serve requests. We increased the worker pool size up to the vendor-imposed rate cap to maximize throughput. Furthermore, we implemented a shared back-off coordination mechanism to gracefully handle transient API errors and rate limiting. When any process encountered a rate-limit error, it signaled all others to temporarily pause. This design prevented a cascade of failed requests and allowed the system to automatically recover and continue processing. By decoupling the logic of single-document analysis from the parallel execution architecture, our workflow achieved high-throughput scalability without compromising analytical integrity, making it suitable for large-scale biomedical literature analysis.

Postprocessing

We have employed a systematic postprocessing pipeline to clean and standardize the LLM responses. The process involved removing additional texts, standardizing the formatting, deduplicating extracted entities, and enforcing TSV format. Each output row included the document’s unique identifier (PMID for PubMed and NCT ID for clinical trials) and corresponding adjuvant name(s). Each response was checked for the “Done” marker to signal the end of processing for the input document. This safeguard helped identify incomplete or truncated outputs. To ensure terminology consistency, we cross-referenced all extracted terms against a curated dictionary of known adjuvant names, compiled from the AdjuvareDB, VAC database, and additional literature sources. This dictionary served as a controlled vocabulary for validating and normalizing model outputs. Entries that did not match any dictionary term were flagged for manual review. Our postprocessing step ensured high-quality, structured data suitable for downstream analysis and evaluation.

Performance Evaluation

The study evaluated a model’s performance using Precision, Recall, and F1-score. These metrics collectively provide a robust framework for evaluating the model’s accuracy, comprehensiveness, and balance in its outputs. The study identified areas for targeted improvement by analyzing these metrics, ensuring alignment with the intended objectives.

The evaluation process employed a combination of automated and manual validation methods to ensure the accuracy and reliability of the evaluation scores. This two-step validation, automated followed by manual review, ensures that our evaluation metrics accurately reflect model performance. Automated validation served as the initial step in the pipeline, applying an exact-match (case-insensitive) criterion to compare the model’s outputs against a curated dictionary of predefined mappings. The dictionary, meticulously compiled and validated in advance, acted as the reference for determining correctness. Automated validation was efficient in quickly identifying outputs that matched the expected results. However, its limitations in handling ambiguous, context-dependent, or nuanced cases necessitated further scrutiny through manual validation. Mismatched cases identified during the automated process were subjected to manual validation to address the limitations. All authors of this study served as validators and thoroughly reviewed each mismatched output to determine its correctness. At least two validators reviewed each case independently, ensuring that each instance was examined from multiple perspectives, reducing the likelihood of oversight or bias. In cases where the two initial validators disagreed, the instance was forwarded to a third validator. The third validator reviewed the case independently and provided the final decision, resolving discrepancies and ensuring fair and accurate validation. Findings were carefully documented throughout the manual validation process, including the reasons for disagreements and their resolution.

Results and Discussion

Although GPT-4o, Llama-3.2 3B, and Gemma-2 9B tried to extract the adjuvant names from the given biomedical literature following the desired output format, they exhibited minor inconsistencies, such as formatting markers or header variations. In addition, we tested the Llama-3.2 1B variant during prompt development but excluded it from formal experiments due to unstable output formatting (data not shown). These inconsistencies emphasize the importance of our postprocessing step to ensure output normalization prior to evaluation.

Table 2 summarizes the performance of GPT-4o, Llama-3.2 3B, Gemma-2 9B, and BioBERT models across automated and manual validation processes on the VAC dataset. GPT-4o consistently outperformed Llama-3.2 3B and the BioBERT baseline across both automated and manually validated outputs. Notably, manual validation scores were consistently higher than automated ones, underscoring the value of human-in-the-loop evaluation in nuanced biomedical tasks. With zero-shot, GPT-4o achieved an F1-score of 39.5% (automated) and 61.5% (manual), which increased to 51.5% (automated) and 69.2% (manual) at four-shot without substances. The inclusion of substance information further improved GPT-4o’s F1-score, peaking at 51.8% (automated) and 69.1% (manual) with four-shot.

Table 2.

Comparative Results on VAC dataset. Direct comparison of GPT-4o, Llama-3.2 3B, Gemma-2 9B, and BioBERT models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without substances.

Automated Validation Manual Validation
Models Shots TP FP FN P (%) R (%) F1 (%) P (%) R (%) F1 (%)
GPT-4o (Without Substances) 0 155 319 157 32.7 49.8 39.5 57.0 66.8 61.5
1 161 270 151 37.3 51.5 43.2 64.1 70.5 67.1
2 164 256 148 39.0 52.6 44.8 64.5 70.2 67.2
3 177 224 135 44.2 56.8 49.7 67.3 70.5 68.8
4 185 221 127 45.5 59.2 51.5 67.1 71.5 69.2
GPT-4o (With Substances) 0 151 329 161 31.5 48.5 38.2 58.4 67.7 62.7
1 164 276 148 37.3 52.7 43.7 64.0 70.1 66.9
2 166 267 146 38.4 53.2 44.6 64.6 70.1 67.2
3 186 231 126 44.6 59.5 51.0 66.8 70.8 68.7
4 188 227 124 45.4 60.4 51.8 67.2 71.0 69.1
Llama-3.2 3B Instruct (Without Substances) 0 120 445 192 21.2 38.4 27.3 42.2 52.5 46.8
1 192 420 120 31.4 61.6 41.6 52.0 71.8 60.3
2 192 396 120 32.6 61.4 42.6 54.1 70.9 61.4
3 204 386 108 34.6 65.4 45.2 54.0 70.8 61.3
4 214 428 98 33.4 68.7 44.9 51.5 72.5 60.2
Llama-3.2 3B Instruct (With Substances) 0 116 434 196 21.1 37.2 26.9 43.5 50.1 46.6
1 178 368 134 32.7 57.2 41.6 57.2 69.0 62.6
2 189 350 123 35.1 60.7 44.5 58.8 70.7 64.2
3 197 316 115 38.4 63.0 47.7 60.9 70.0 65.1
4 198 332 114 37.4 63.6 47.1 59.5 69.9 64.3
Gemma-2 9B Instruct (Without Substances) 0 157 344 155 31.3 50.2 38.6 - - -
1 147 216 165 40.6 47.0 42.7 - - -
2 175 253 137 40.9 56.2 47.3 - - -
3 184 258 128 41.6 59.0 48.8 - - -
4 186 244 126 43.3 59.6 50.1 - - -
Gemma-2 9B Instruct (With Substances) 0 148 385 164 27.8 47.4 35.0 - - -
1 174 283 138 38.1 55.7 45.2 - - -
2 176 269 136 39.6 56.4 46.5 - - -
3 176 257 136 40.6 56.4 47.2 - - -
4 180 254 132 41.5 57.8 48.3 - - -
BioBERT (Without Substances) - 38 75 28 33.6 57.6 42.5 - - -
BioBERT (With Substances) - 40 55 25 46.7 38.9 42.4 - - -

TP: True Positive, FP: False Positive, FN: False Negative, P: Precision, R: Recall, F1: F1-score

GPT-4o benefited the most from additional shots and contextual information. In contrast, Llama-3.2 3B started with a low zero-shot F1-score of 27.3% but benefited significantly from few-shot learning and substance information, reaching up to 47.1% (automated) and 64.3% (manual) in the best configuration. Interestingly, while GPT-4o began with stronger Precision, both models improved Recall more effectively with added shots, suggesting that few-shot learning enhanced entity identification strategies. Additionally, the Gemma-2 9B model showed moderate improvements with few-shot learning and substance information, achieving up to 50.1% F1 (automated) without substances and 48.3% with substances, at four-shot. Manual validation was performed for GPT 4o and Llama 3.2 3B, the two models that demonstrated the strongest performance across automated metrics. The baseline BioBERT model, although it achieved a relatively higher Precision of 46.7%, lagged significantly behind both LLMs, with an F1 of only 42.5% (without substances) and 42.4% (with substances). This finding highlighted the limitations of baseline NER models in emerging or nuanced domains like cancer vaccine adjuvant extraction, where general-purpose LLMs offered superior flexibility.

Automated validation provided speed and consistency in identifying exact matches, while manual validation introduced the expertise and judgment necessary for handling edge cases and ambiguities. This two-step validation process improved the reliability and comprehensiveness of the evaluation scores, ensuring they accurately reflect the model’s performance and provide actionable insights for further refinement and improvement. Furthermore, during the manual review, we observed that some of the adjuvant names identified by the LLMs were valid but not included in the ‘gold standard’ dataset. In the current evaluation, such adjuvants were marked as incorrect simply because they were not part of the gold standard reference. This highlights potential gaps in the existing annotations and suggests that LLMs can identify novel or overlooked biomedical entities.

Table 3 provides a comparative evaluation of GPT-4o, Llama-3.2 3B, Gemma-2 9B, and BioBERT on the AdjuvareDB dataset, highlighting the significant impact of domain-specific information ‘interventions’ in the clinical trial data and few-shot prompting on model performance. GPT-4o consistently outperformed other models, with F1-scores rising from 47.1% (zero-shot, no intervention) to a peak of 69.8% (one and three-shot, with intervention), demonstrating its strong ability to generalize with minimal examples when guided by contextual knowledge. Llama-3.2 3B also benefited from interventions, improving from 26.5% (zero-shot, no intervention) to 55.7% (three-shot, with intervention), although it still lagged behind GPT-4o in all configurations. Gemma-2-9B showed moderate performance gains, particularly with domain-specific interventions, improving from 38.2% (zero-shot, no substances) to 63.6% (two-shot, with substances). While it did not reach GPT-4o’s accuracy, Gemma consistently outperformed Llama in lower-shot and intervention-enhanced setups, indicating its solid adaptability to biomedical tasks. In contrast, BioBERT performed poorly, with F1-scores below 16%, likely due to the small dataset size and sparsity of target entities.

Table 3.

Comparative results on AdjuvareDB dataset annotated by the AdjuvareDB team. Direct comparison of GPT-4o, Llama-3.2 3B, Gemma-2 9B, and BioBERT models, highlighting the differences in performance metrics (Precision, Recall, and F1-score) with and without interventions.

Models Shots TP FP FN P (%) R (%) F1 (%)
GPT-4o (Without Interventions) 0 44 46 53 48.9 45.4 47.1
1 43 32 54 57.5 44.7 50.3
2 44 29 53 60.4 45.0 51.6
3 43 30 54 59.2 44.3 50.7
4 44 30 53 59.5 45.4 51.5
GPT- 4o (With Interventions) 0 77 52 20 59.5 79.4 68.0
1 77 47 20 62.0 79.7 69.8
2 73 40 24 64.5 74.9 69.3
3 72 37 25 65.9 74.2 69.8
4 73 42 24 63.3 75.3 68.8
Llama-3.2 3B Instruct (Without Interventions) 0 27 80 70 25.2 27.8 26.5
1 40 116 57 25.7 41.2 31.7
2 41 106 56 27.8 42.3 33.6
3 41 76 56 35.1 42.6 38.5
4 42 94 55 30.7 43.0 35.8
Llama-3.2 3B Instruct (With Interventions) 0 43 77 54 35.8 44.0 39.4
1 68 91 29 42.9 70.1 53.2
2 71 91 26 43.9 73.2 54.9
3 69 83 28 45.6 71.5 55.7
4 69 84 28 45.0 70.8 55.0
Gemma-2 9B Instruct (Without Substances) 0 39 68 58 36.4 40.2 38.2
1 39 50 58 43.8 40.2 41.9
2 45 65 52 40.9 46.4 43.5
3 49 60 48 44.8 50.5 47.5
4 51 64 46 44.3 52.6 48.1
Gemma-2 9B Instruct (With Substances) 0 67 94 30 41.6 69.1 51.9
1 67 53 30 55.8 69.1 61.8
2 70 54 27 56.6 72.5 63.6
3 68 53 29 56.3 70.4 62.6
4 70 56 27 55.3 71.8 62.5
BioBERT (Without Interventions) - 6 50 16 10.7 27.3 15.4
BioBERT (With Interventions) - 9 111 12 7.5 42.9 12.8

TP: True Positive, FP: False Positive, FN: False Negative, P: Precision, R: Recall, F1: F1-score

Figure 3 highlights the improvement of the F1-score as the number of few-shot examples increases under different experimental settings. On both the VAC (Figures 3a and 3b) and AdjuvareDB datasets (Figures 3c and 3d), GPT-4o consistently achieved the highest performance across most of the shot levels, reinforcing its superiority in both zero-shot and few-shot scenarios. The effect of few-shot learning is evident between 0 and 1 shot, where all models, especially Llama-3.2 3B, experienced significant improvements. Including additional context information, such as substances or interventions, led to a further performance boost. Particularly, all the models benefited notably from using interventions in the AdjuvareDB dataset (Figures 3c and 3d), supporting the hypothesis that domain-specific context helps reduce ambiguity in entity recognition.

Figure 3.

Figure 3.

Performance plots showing F1-score (%) across varying numbers of few-shot examples for the LLMs (a) VAC dataset without substances, (b) VAC dataset with substances, (c) the AdjuvareDB dataset without interventions, (d) AdjuvareDB dataset with interventions.

Our findings suggest that although BioBERT was fine-tuned on both datasets, its performance on the AdjuvareDB dataset remained constrained, likely due to the small size of the dataset and class imbalance. In contrast, LLMs like GPT-4o, Llama-3.2 3B, and Gemma-2 9B demonstrated stronger and more flexible performance, particularly when supported by few-shot prompting and domain-specific additional context. One key takeaway is that the instruction-following capabilities and contextual reasoning capabilities of LLMs offer a significant advantage for low-resource biomedical extraction tasks, especially when paired with high-quality prompt design.

Error Analysis

To evaluate the performance and limitations of the LLMs, we conducted a detailed error analysis of the mismatched cases by manually reviewing each prediction. This analysis revealed several recurring issues that impacted extraction accuracy, such as lack of standardization, partial matches, incomplete/partial extractions, etc. A prominent source of error was the lack of standardization in adjuvant naming across biomedical literature. For instance, the model extracted full names and abbreviations inconsistently, such as identifying ‘GLA-SE’ as ‘GLA-Stable Emulsion (SE)’ in some cases and ‘Glucopyranosyl lipid-A’ in others. In other instances, the model captured concentration information or formulation details (e.g., ‘2% SE’) that were not explicitly part of the gold standard but relevant to the context, such as with ‘Stable Emulsion (SE)’.

Additionally, adjuvants like ‘Advax’ were represented in multiple naming conventions, such as ‘Advax delta inulin’, ‘Delta inulin’, and ‘Delta Inulin (DI)’, which caused discrepancies despite being semantically equivalent. Partial matches were also a frequent challenge, especially for compound or multi-word adjuvant names. For example, ‘Stable Emulsion (SE)’ was sometimes extracted as a standalone term and other times as part of a more specific formulation like ‘GLA-SE’, leading to mismatches when the model correctly identified only a subset of the entire expression. Besides, incomplete extractions were noted where only part of a multi-word adjuvant was captured, missing essential components. For instance, in the case of ‘GLA-SE’, the model sometimes extracted only the ‘GLA’ component or interpreted it solely as ‘Glucopyranosyl lipid-A’, omitting the emulsion context. These errors and their causes are summarized in Table 4, which presents representative examples, issues identified through manual validation, and comments on the root causes. This categorization helps us to understand the limitations of current LLM approaches in handling domain-specific, non-standardized biomedical terminology.

Table 4.

Breakdown of LLM errors in adjuvant name recognition

Gold Standard → Predicted Issues resolved by manual evaluation Cause/Comment
3M-052-SE → TLR7/8-based adjuvant 3M-052 in stable emulsion (3M-052-SE) Lack of standardization Extracted additional molecular information
Advax → Advax delta inulin, Delta inulin polysaccharide adjuvant, Delta Inulin (DI) Lack of standardization Extracted multiple naming conventions and variations extracted
Stable Emulsion (SE) → 2% SE Lack of standardization Extracted concentration information also
GLA-SE → Glucopyranosyl lipid-A Lack of standardization and partial extraction Extracted the full name for one part and missed the other part

Conclusion

The automation of adjuvant name extraction is a timely advancement, given the overwhelming scale of biomedical literature. The findings of this study underscore the potential of LLMs in addressing domain-specific challenges in cancer vaccine research. Our results show that models like GPT-4o, Llama-3.2 3B, and Gemma-2 9B are not only capable of identifying cancer vaccine adjuvant names across large and diverse datasets but can also rival or exceed domain-specific models like BioBERT under well-designed prompt configurations. Our study illustrates that LLMs not only reduce the burden of manual curation but also introduce opportunities for capturing rare, emerging, or inconsistently labeled adjuvant entities. The structured prompt design used in this study served as a robust mechanism for evaluating how contextual information and few-shot examples influence model performance. However, our findings also highlight the continued necessity of manual validation to resolve formatting inconsistencies and ensure gold-standard alignment, especially given that LLMs are still in the early stages of evolution. Nonetheless, their rapid development and demonstrated effectiveness suggest enormous potential for transforming biomedical information extraction tasks.

Building on these findings, our future research will focus on developing curation pipelines that leverage LLMs to assist in the identification and extraction of novel biomedical entities. This curation system will support domain experts by streamlining the curation process, enhancing dataset completeness, and reducing manual workload, ultimately improving the scalability of biomedical knowledge resources. Expanding beyond cancer vaccine adjuvants, our work will also target infectious diseases, addressing broader public health challenges. We plan to explore larger variants of Llama, Gemma, and other open-access LLMs to further enhance the flexibility and efficiency of our methodology. Additionally, we aim to integrate structured biomedical knowledge, such as vaccine ontology, into our data preprocessing and prompt design pipelines to improve semantic accuracy and a deeper contextual understanding of adjuvants. By refining model generalizability and expanding our annotated datasets, our goal is to develop a robust, scalable framework that supports the extraction of vaccine-related knowledge and accelerates immunotherapy research.

Acknowledgment

The study was supported by the U.S. National Institute of Allergy and Infectious Diseases (NIAID; U24AI171008 to Y.H. and J.H.).

Availability

The source code is available at https://github.com/hurlab/Vaccine-Adjuvant-LLM.

Figures & Tables

References

  • 1.Zhao T, Cai Y, Jiang Y, He X, Wei Y, Yu Y, et al. Vaccine adjuvants: Mechanisms and platforms. Signal Transduct Target Ther. 2023;8(1):283. doi: 10.1038/s41392-023-01557-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Temizoz B, Kuroda E, Ishii KJ. Vaccine adjuvants as potential cancer immunotherapeutics. Int Immunol. 2016;28(7):329–38. doi: 10.1093/intimm/dxw015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ren D, Jin J, Xiong S, Xia D, Zhao X, Guo H, et al. Adjuvaredb: A comprehensive database for candidate adjuvant compendium in clinic. Clin Transl Med. 2024;14(4) [Google Scholar]
  • 4.Sayers S, Ulysse G, Xiang Z, He Y. Vaxjo: A web-based vaccine adjuvant database and its application for analysis of vaccine adjuvants and their uses in vaccine development. J Biomed Biotechnol. 2012;2012:831486. doi: 10.1155/2012/831486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhang W-Y, Zheng X-L, Coghi PS, Chen J-H, Dong B-J, Fan X-X. Revolutionizing adjuvant development: Harnessing ai for next-generation cancer vaccines. Front Immunol. 2024;15:1438030. doi: 10.3389/fimmu.2024.1438030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kumar A, Dixit S, Srinivasan K, Vincent PDR. Personalized cancer vaccine design using AI-powered technologies. Front Immunol. 2024;15:1357217. doi: 10.3389/fimmu.2024.1357217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hu Y, Chen Q, Du J, Peng X, Keloth VK, Zuo X, et al. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inf Assoc. 2024;31(9):1812–20. [Google Scholar]
  • 9.Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, et al. A comprehensive overview of large language models. arXiv. 2023.
  • 10.Palepu A, Dhillon V, Niravath P, Weng W-H, Prasad P, Saab K, et al. Exploring large language models for specialist-level oncology care. arXiv. 2024.
  • 11.Hou Y, Bert C, Gomaa A, Lahmer G, Höfler D, Weissmann T, et al. Fine-tuning a local llama-3 large language model for automated privacy-preserving physician letter generation in radiation oncology. Front Artif Intell. 2025;7:1493716. doi: 10.3389/frai.2024.1493716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ghali M-K, Farrag A, Sakai H, Baz HE, Jin Y, Lam S. Gamedx: Generative AI-based medical entity data extractor using large language models. arXiv. 2024.
  • 13.Ferber D, Hilgers L, Wiest IC, Leßmann M-E, Clusmann J, Neidlinger P, et al. End-to-end clinical trial matching with large language models. arXiv. 2024.
  • 14.Li X, Zheng Y, Hu J, Zheng J, Wang Z, He Y. Vaxllm: Leveraging fine-tuned large language model for automated annotation of brucella vaccines. bioRxiv. 2024:2024.11. 25.625209.
  • 15.Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open. 2023;6(11):e2343689-e. doi: 10.1001/jamanetworkopen.2023.43689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Radford A. Improving language understanding by generative pre-training. 2018.
  • 17.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv. 2023.
  • 18.Team G, Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, et al. Gemma: Open models based on Gemini research and technology. arXiv. 2024.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES