Abstract
Background
Vaccines are crucial for preventing infectious diseases; however, they may also be associated with adverse events (AEs). Conventional analysis of vaccine AEs relies on manual review and assignment of AEs to terms in terminology or ontology, which is a time-consuming process and constrained in scope. This study explores the potential of using Large Language Models (LLMs) and LLM text embeddings for efficient and comprehensive vaccine AE analysis.
Results
We used Llama-3 LLM to extract AE information from FDA-approved vaccine package inserts for 111 licensed vaccines, including 15 influenza vaccines. Text embeddings were then generated for each vaccine’s AEs using the nomic-embed-text and mxbai-embed-large models. Llama-3 achieved over 80% accuracy in extracting AE text from vaccine package inserts. To further evaluate the performance of text embedding, the vaccines were clustered using two clustering methods: (1) LLM text embedding-based clustering and (2) ontology-based semantic similarity analysis. The ontology-based method mapped AEs to the Human Phenotype Ontology (HPO) and Ontology of Adverse Events (OAE), with semantic similarity analyzed using Lin’s method. Text embeddings were generated for each vaccine’s AE description using the LLM nomic-embed-text and mxbai-embed-large models. Compared to the semantic similarity analysis, the LLM approach was able to capture more differential AE profiles. Furthermore, LLM-derived text embeddings were used to develop a Lasso logistic regression model to predict whether a vaccine is “Live” or “Non-Live”. The term “Non-Live” refers to all vaccines that do not contain live organisms, including inactivated and mRNA vaccines. A comparative analysis showed that, despite similar clustering patterns, the nomic-embed-text model outperformed the other. It achieved 80.00% sensitivity, 83.06% specificity, and 81.89% accuracy in a 10-fold cross-validation. Many AE patterns, with examples demonstrated, were identified from our analysis with AE LLM embeddings.
Conclusion
This study demonstrates the effectiveness of LLMs for automated AE extraction and analysis, and LLM text embeddings capture latent information about AEs, enabling more comprehensive knowledge discovery. Our findings suggest that LLMs demonstrate substantial potential for improving vaccine safety and public health research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13326-025-00331-8.
Keywords: FDA package inserts, Vaccine, Adverse event, Large language models (LLM), LIama-3 model, Vaccine ontology (VO), Human phenotype ontology (HPO), Ontology of adverse events (OAE)
Background
Vaccines have been widely used to effectively protect humans against various infectious diseases, such as influenza and COVID-19 [1]. However, they are also associated with various adverse events (AEs), although such events are typically rare and mild. For example, many influenza vaccines are associated with AEs such as fever and rash in a small population. Sometimes, more severe AEs such as Guillain-Barré syndrome (GBS, a rare neurological disorder) might occur [2]. Additionally, mRNA COVID-19 vaccines were associated with an excess risk of serious adverse events [3]. Therefore, it is important to systematically monitor and analyze various AEs associated with different vaccines to identify the vaccine AE patterns and underlying mechanisms.
Ontologies have been traditionally used to classify and analyze adverse events. For example, vaccine and drug AE case reporting is usually annotated using the MedDRA [4]. However, the MedDRA hierarchy is not well established, therefore, the Ontology of Adverse Events (OAE) [5] has often been used to classify different AEs after mapping the MedDRA AE records to OAE terms [6]. The Human Phenotype Ontology (HPO) [7, 8] ontologically classifies various phenotypes and can also be used to analyze AEs. The hierarchical relationships within the ontology, along with the weights assigned to the terms, can be used to analyze their semantic similarity. This method relies on a predefined ontology structure, taking into account the hierarchy and relationships between terms to compute similarity. This approach helps in understanding and comparing the semantic proximity of different terms [9]. For example, the HPO and OAE can be used to support such semantic similarity analysis. The characteristics and relations among different vaccines can be analyzed using the Vaccine Ontology (VO) [10], a community-based biomedical ontology focusing on the classification of vaccine knowledge from various aspects such as vaccine type, vaccine platform, targeted disease or pathogen, etc.
Large Language Models (LLMs) have been widely used to effectively extract and summarize useful information from various types of text [11]. LLMs are built using billions of parameters, allowing them to understand complex language patterns, semantics, and context. One of the key technologies enabling LLMs to understand and represent text is text embedding. Text embedding is a process by which words, phrases, or even entire documents are transformed into numerical vectors that capture semantic meaning. These vectors allow models to quantify the relationship between different pieces of text, which helps in a variety of tasks, such as text classification, information retrieval, and sentiment analysis. A recent study showed that GPT LLM could be used to accurately identify and catalog vaccine adverse events within clinical reports from the Vaccine Adverse Event Reporting System (VAERS) [12]. Another commonly used large language model is the Llama-3 instruction-tuned model, which is fine-tuned and optimized for dialogue and chat use cases. It surpasses many available open-source chat models on common benchmarks [13].
In this manuscript, we hypothesized that integrating ontologies with LLM could significantly enhance vaccine AE studies. To address the hypothesis, we applied an LLM to extract AE information from FDA-approved vaccine package inserts. Subsequently, the text embedding capability of LLM was leveraged to transform vaccine AE text into a high-dimensional vector space. Hierarchical clustering was employed to group vaccines based on these embeddings, allowing us to examine the alignment between LLM-derived similarities and established vaccine characteristics. To investigate the characteristics of LLM text-embedding methods, we compared the vaccine similarity patterns generated by LLM with those obtained using traditional ontology-based semantic methods. Moreover, we explored the potential of LLM-generated embeddings for machine learning tasks by training a classifier to predict whether a vaccine was classified as “Live” or “Non-Live”. This approach aimed to evaluate the LLM’s capability to extract latent information from text, highlighting its potential for faster vaccine analysis. By leveraging text embeddings, this method bypasses the complexity of ontology mapping, enabling efficient exploration and knowledge discovery from biomedical annotation texts relevant to our research.
Methods
Retrieval of vaccine package inserts data and annotation of vaccines
All of the FDA vaccine package insert PDF files were downloaded from https://www.fda.gov/vaccines-blood-biologics/vaccines/vaccines-licensed-use-united-states. Vaccines that lacked package insert files or had only approval letters were excluded from the analysis.
The vaccines were manually annotated by the VO based on the vaccine names. Unmapped vaccines were added to the VO automatically using the ROBOT template function [14] guided by the VO vaccine design pattern.
Extraction of the ‘adverse reactions’ section using the Llama-3 model
We utilized the Llama-3 model [13] to extract the text of the ‘Adverse Reactions’ section of the FDA package insert documents, which is typically located on the first or second page of each package insert in a downloaded PDF document. Llama-3 is a family of models developed by Meta Inc. and available in both 8B and 70B parameter sizes (pre-trained or instruction-tuned). For this study, we used the 8B model.
The prompt we used for processing data with our Llama-3 model was as follows:
Using this data: {pdf_content}. Respond to this prompt: Extract the text under the section ‘ADVERSE REACTIONS’ without adding any introductory phrases or sentences.
Provide only the adverse reactions listed.
Respond directly. Please do NOT say words like ‘Here is the extracted text:’ and ‘Sure, here is the extracted text under the section ‘ADVERSE REACTIONS’:‘.
Only provide the text under the ‘ADVERSE REACTIONS’ heading.
Respond based solely on the provided data. Do NOT use your own knowledge in the response.
Where {pdf_content} represents the text content extracted from the first or second page of the PDF using PyPDFLoader from langchain_community.document_loaders (a Python library). We developed a Python program to process individual PDF files automatically.
Vaccine AEs LLM text embedding
We employed the nomic-embed-text [15] and mxbai-embed-large model [16] for text embedding of the extracted AE descriptions. The nomic-embed-text model, developed by Nomic AI, is an open-source embedding model optimized for large-scale text analysis. It effectively encodes both short and long texts, outperforming OpenAI’s text-embedding-ada-002 and text-embedding-3-small across various NLP tasks. The mxbai-embed-large model by mixedbread.ai is a state-of-the-art embedder that outperforms commercial models like OpenAI’s text-embedding-3-large and rivals models 20 times its size. Notably, it was trained without MTEB data overlap, demonstrating strong generalization across diverse domains, tasks, and text lengths. In our vaccine AE research, each text segment was encoded into vector representations of 768 dimensions using the nomic-embed-text model and 1,024 dimensions using the mxbai-embed-large model. The text embedding vectors were used as features for each vaccine’s adverse reactions. Pearson correlation coefficients were calculated pairwise between vaccines.
Mapping vaccine AE text to HPO and OAE ontologies and calculating semantic similarity
We standardized vaccine AE concepts through mapping to HPO and OAE terms using the tool BioPortal Annotator (https://bioportal.bioontology.org/annotator) [17] and an in-house tool OntoRetriever (available at https://github.com/Cetomato/OntoRetriever). Each vaccine was associated with a set of ontology terms. To ensure both completeness and accuracy in the mapping process, we manually reviewed all AE-to-HPO/OAE term mappings, with validation provided by vaccine experts to confirm their correctness and scientific validity.
Lin’s method [18] in the ontologyX packages suite [19] was employed to calculate the similarity of these ontology sets. A key step in this method is to define the Information Content (IC) of each ontology term. The IC of a term is inversely proportional to the number of its child terms; that is, terms with more children have lower IC. The semantic similarity between two terms is calculated based on the IC of their shared ancestor terms.
The frequency of a term, t is defined as:
![]() |
where is the count of terms t’ that include t and its children, and N represents the total number of terms within the ontology corpus.
Thus, the information content is defined as:
![]() |
In an ontology, two terms can share parents. IC-based methods calculate term similarity by determining the information content of their most specific common ancestor, often termed the Most Informative Common Ancestor (MICA). The equation for calculating the term similarity is defined as:
![]() |
Vaccine clustering based on adverse events
Vaccines were first clustered based on LLM text embeddings. We calculated pairwise Pearson correlation coefficients for vaccines based on text embedding vectors. The results of the coefficients were further used to perform hierarchical clustering of the vaccines using Euclidean distance and complete linkage. All the vaccines from our collection were used for the clustering based on the LLM text embeddings and were further annotated with vaccine attributes from VO.
To identify the optimal number of clusters, we utilized the silhouette method [20]. Specifically, we computed the average silhouette width for cluster solutions ranging from 2 to 15 clusters. The silhouette width measures how similar an object is to its own cluster compared to other clusters, with higher values indicating better-defined clusters. For the selected cluster number, we calculated internal validation metrics to assess clustering quality using the Silhouette Score and Dunn Index.
Furthermore, we conducted comparative clustering analyses using both the above text embedding-based and ontology-based semantic similarity analyses. For such a comparison, only the data of the set of influenza vaccines were used. For the ontology-based semantic similarity analysis approach, we first calculated a semantic similarity matrix of vaccines using Lin’s Method [18]. Using the results from either of the two methods, we further performed hierarchical clustering of the vaccines using Euclidean distance and complete linkage.
The resulting clusters of the above clustering analyses were visualized as heatmaps generated using the ComplexHeatmap [21] package in R [22]. From the heatmaps, we were able to determine whether different classifications are associated with specific characteristics of the vaccines.
Building and evaluating a machine learning model to predict if a vaccine is live
We attempted to classify vaccines as ‘Live’ or ‘Non-Live’ using a machine learning approach. LASSO logistic regression and Support Vector Machine (SVM) with a linear kernel were employed for this task, utilizing the glmnet [23] and e1071 [24] R packages, respectively. We first standardized both the training and test datasets through mean centering and scaling to ensure that all features were comparable. To address potential class imbalance in the binary classification task, we computed class weights, assigning higher weights to the minority class.
Next, we applied LASSO regression [25] for feature selection, determining the optimal hyperparameter 𝜆-1𝑠𝑒 using 10-fold cross-validation. This process resulted in the development of a LASSO logistic regression classification model. Based on the selected features, we also trained an SVM model. To optimize the performance of the SVM model, we fine-tuned the hyperparameter C (cost) using 10-fold cross-validation, ultimately selecting the optimal value. Through these procedures, we aimed to enhance both the predictive accuracy and generalization capability of the model.
To rigorously visualize model performance, we first divided the dataset into a training set (70%) and a test set (30%). The models’ performance was assessed using the Receiver Operating Characteristic (ROC) curve to evaluate classification accuracy and discriminative ability. The pROC package [26] was used to generate ROC curves for the training, testing, and full datasets.
To comprehensively assess the optimized models’ robustness, a final 10-fold cross-validation evaluation [27] was conducted on the entire dataset. Key performance metrics, including accuracy, sensitivity, specificity, and Area Under the Curve (AUC), were calculated. Lastly, we compared the performance differences between the two classification models.
We employed the shapr package [28] in R to perform SHapley Additive exPlanations (SHAP) analysis on the Lasso logistic regression classification model. SHAP values were calculated for each feature, and the features were subsequently ranked in descending order based on the absolute value of their SHAP values.
Results
Project design and workflow
The overall project workflow is shown in Fig. 1. Specifically, we first collected all vaccine package inserts certified by the FDA. Then, we used the LLM Llama-3 to extract text information related to adverse events. Following this, we applied LLM text embedding to vectorize the adverse event text into high-dimensional vectors. Based on these feature vectors, we performed cluster analysis on all vaccines. Furthermore, a machine learning classification model was built based on the vector of LLM embeddings to classify vaccines as live or inactivated, and its performance was evaluated using 10-fold cross-validation. To compare the differences between LLM-based and ontology-based clustering results, we conducted an in-depth study of influenza vaccines. The adverse event text of these vaccines was mapped to the HPO and OAE ontologies, and semantic similarity was used to cluster the vaccines. The results were compared with those obtained from the LLM method on the subset of influenza vaccines.
Fig. 1.
Overall project workflow. Blue, green, and orange colors present AE text extraction, ontology-based clustering, and text embedding-based clustering and classification, respectively
Package inserts of 111 licensed influenza vaccines collected
A total of 111 licensed vaccines from the FDA website were collected. Most of these vaccines have unique trade names. These vaccines were used for further vaccine AE data processing and analysis.
We standardized these vaccines by mapping them to VO. Out of the 111 FDA-approved vaccines we analyzed, 69 were successfully identified within VO. The remaining 42 vaccines were incorporated into VO following the VO vaccine design pattern (Fig. 2). The ‘vaccine component’ for vaccines includes two parts: ‘vaccine adjuvant’ and ‘pathogen organism component in vaccine’, in which the organism quality indicates whether the organism is live attenuated or inactivated. The vaccine roles, such as ‘recombinant vector vaccine role’ and ‘RNA vaccine role’ are used to represent different FDA vaccine platforms. A one-to-one mapping of all 111 downloaded vaccines to the VO is available at: https://github.com/vaccineontology/VO/blob/master/src/templates/vo_FDA.csv.
Fig. 2.
VO vaccine design pattern for FDA-approved vaccines. The pattern shows the relationships between FDA vaccines and their associated components and other properties. Blue-edged boxes represent material entities, and black-edged boxes represent quality or role entities
The VO vaccine attributes include FDA indication, whether the vaccine is recombinant, live, mRNA-based, monovalent, or multivalent, and whether an adjuvant was used. It allows us to systematically extract vaccine key attributes based on the vaccine design pattern. Table 1 provides the basic statistical details of the vaccine, which are based on these vaccine-annotated attributes in VO.
Table 1.
Vaccine attribute statistics
Overall (N = 111) |
|
---|---|
Type | |
COVID-19 Vaccine, mRNA | 3 (2.7%) |
Diphtheria & Tetanus | 13 (11.7%) |
Hepatitis vaccine | 7 (6.3%) |
Influenza Vaccine | 31 (27.9%) |
Measles, Mumps and Rubella | 8 (7.2%) |
Meningococcal Vaccine | 8 (7.2%) |
Other | 24 (21.6%) |
Papillomavirus | 3 (2.7%) |
Pneumococcal Vaccine | 4 (3.6%) |
Respiratory Syncytial Virus Vaccine | 4 (3.6%) |
Rotavirus Vaccine | 3 (2.7%) |
Zoster Vaccine | 3 (2.7%) |
Recombinant | |
No | 101 (91.0%) |
Yes | 10 (9.0%) |
Live | |
No | 83 (74.8%) |
Yes | 28 (25.2%) |
mRNA | |
No | 108 (97.3%) |
Yes | 3 (2.7%) |
Valent | |
Monovalent | 79 (71.2%) |
Multivalent | 32 (28.8%) |
Adjuvant | |
No | 104 (93.7%) |
Yes | 7 (6.3%) |
Approval Year | |
~1990 | 9 (8.1%) |
1991–2000 | 6 (5.4%) |
2000–2010 | 40 (36.0%) |
2011–2024 | 48 (43.2%) |
NA | 8 (7.2%) |
Efficient vaccine AE extraction using Llama-3
In 87 out of 111 package inserts, Llama-3 accurately extracted the corresponding vaccine AE text. For results initially identified as incorrect, re-extraction and manual review often led to accurate outcomes due to the inherent randomness of large language models. Although the process involved manual review, the overall workload of literature information mining was significantly reduced due to the usage of LLM. The vaccine AE extraction results are available at: https://github.com/vaccineontology/VO-LLM/blob/main/results/vaccine_anno_df_all_in_one.xlsx (Column Adverse.reactions).
Vaccine clustering based on LLM text embedding feature
In our study, we utilized the nomic-embed-text [15] and mxbai-embed-large [16] models to generate vector embeddings for the adverse reactions of 111 vaccines. Specifically, the nomic-embed-text model produced 768-dimensional vectors, while the mxbai-embed-large model generated 1,024-dimensional vectors. These embeddings served as feature representations for each vaccine’s adverse reactions in our analysis. These vaccines were subsequently clustered using Euclidean distance and complete linkage. Notably, many of the resulting clusters exhibit a strong correlation with vaccine attributes, such as indication and whether the vaccine is live or non-live. Figure 3 presents the clustering results based on the nomic-embed-text model. A clear “Influenza Vaccine” cluster, encompassing the majority of influenza vaccines, highlights the strong correlation between clustering and the “Vaccine” attribute. Furthermore, live vaccines were clustered into smaller, more homogeneous groups of 2–5 vaccines, forming “blocks” within the “Live” attribute. By analyzing the effect of the approval year on vaccine clusters, we did not find clear evidence of significant temporal shifts in the clustering patterns (Fig. 3). Supplementary Figure S1 presents the clustering results of vaccines based on the mxbai-embed-large model, which exhibits characteristics consistent with Fig. 3.
Fig. 3.
Vaccine clustering based on Adverse Event embedding profiles and attribute correlations. The AE embedding profiles from the nomic-embed-text model were used for vaccine clustering. The heatmap (left) shows the AE-based correlation matrix, with dendrograms indicating hierarchical clustering. Colored bars (right) annotate key vaccine attributes, including indication, recombinant status, live virus status, mRNA technology, valency, adjuvant presence, and approval year
The clustering analysis identified an optimal seven-cluster solution using the silhouette method, which maximized the average silhouette width (Fig. 4). The mean silhouette coefficient for this solution was 0.14, indicating that the clusters were moderately well-defined. Additionally, the computed Dunn Index was 0.219, suggesting a reasonable degree of separation between clusters while maintaining compactness within each cluster.
Fig. 4.
Optimal clustering analysis of vaccine AE profiles. (A) Silhouette analysis for determining the optimal number of clusters. The x-axis shows the number of clusters, and the y-axis represents the average silhouette width. The peak at 7 clusters indicates the optimal clustering solution. (B) Silhouette plot for the 7-cluster solution (n = 111). Each horizontal segment represents a sample, colored by its assigned cluster, with segment length indicating the silhouette width of the sample
Comparison of ontology-based and LLM embedding approaches
We developed OntoRetriever (available at https://github.com/Cetomato/OntoRetriever), a tool that can be adapted to extract ontology terms from text when provided with the OBO file of other ontologies. Currently, it supports the HPO and OAE ontologies. The map_text_to_hpo function within this tool offers this capability. The “Distance” column can be understood as the confidence score of the extracted ontology terms, allowing for a manual review of mappings with low confidence.
Table 2 presents partial results of adverse event extraction for influenza vaccines and their mapping to OAE using OntoRetriever. AE_Word represents adverse event-related words that are found in the text, while OAE_ID and OAE_Term correspond to the mapped ontology ID and term. Distance indicates the semantic similarity between the adverse event word and the ontology term. The majority of terms were exact matches, allowing experts to focus only on those with higher distances, thereby reducing the need for manual intervention. The table illustrates representative cases of exact matches, moderate matches, and poor matches.
Table 2.
Extraction of adverse event terms from text and their mapping to OAE by ontoretriever
AE_Word | OAE_ID | OAE_Term | Distance |
---|---|---|---|
Swelling | OAE:0009307 | swelling AE | 0 |
Irritability | OAE:0001105 | irritability AE | 0 |
Headache | OAE:0000377 | headache AE | 0 |
… | … | … | … |
Drowsiness | OAE:0001550 | sleepiness AE | 44.41 |
Pain or tenderness in the abdomen | OAE:0004591 | abdominal tenderness AE | 51.23 |
Muscle aches/arthralgia | OAE:0000383 | muscle ache AE | 62 |
… | … | … | … |
Loss of appetite | OAE:0007547 | lack of satiety AE | 98.35 |
Runny nose | OAE:0000335 | nasal congestion AE | 116.18 |
Serum creatine kinase | OAE:0000529 | blood creatine phosphokinase level increased AE | 146.09 |
To conduct an in-depth comparison of clustering patterns for influenza vaccines based on LLM text embeddings and ontology semantics, we selected a set of 15 influenza vaccines, each of which has its vaccine brand name and associated package insert document. Figure 5A illustrates the clustering results obtained using the LLM nomic-text embedding method. In contrast, another heatmap was generated using ontology-based semantic similarity analysis, resulting in higher similarity scores, as evidenced by the prevalence of red colors (Fig. 5B). Interestingly, the LLM method identified a more differential AE profile among the vaccines (Fig. 5A and B). It can be observed that the majority of clustering patterns exhibit a degree of similarity. When the clustering tree is cut by a vertical line, the vaccines generally fall into three distinct groups. Notably, the FluMist Live Attenuated vaccine stands out as being significantly different from the rest. Supplement Figure S3 displays the similarity of influenza vaccines’ adverse reactions based on the OAE ontology.
Fig. 5.
Comparative heatmap analysis of influenza vaccine clustering using LLM and Ontology-based semantic similarity. (A) The clustering of 15 influenza vaccines based on their AE profiles using LLM nomic-embed-text embedding. (B) Clustering of the same 15 influenza vaccines based on ontology-based semantic similarity analysis. Colored boxes highlight the main clusters identified in both analyses. The FluMist Live Attenuated vaccine, circled in red, is consistently separated in both methods, indicating a distinct AE profile. The red dashed line in both (A) and (B) highlights the cut tree line
Vaccine classification based on ontology and LLM text embedding features
As shown in Fig. 3, our clustering analysis based on the LLM embeddings of vaccine AEs was able to identify the clustering of vaccines based on different features, such as vaccines being live or non-live. Therefore, we hypothesized that the vaccine AE records contain hidden or latent information about vaccine attributes. To address this issue, we focused on the classification analysis of vaccine live/non-live using LLM embeddings.
The VO classifies various vaccines based on different features such as ‘live’ or ‘non-live’. For example, the following semantic axiom is typically used to define whether a vaccine is a live attenuated vaccine:
‘has quality’ some ‘vaccine organism live attenuated’.
In this study, we classified vaccines as ‘live’ or ‘non-live’ using LLM text embedding features. A LASSO logistic regression model was developed, and ROC curves were generated for the training, test, and complete datasets. Feature selection using LASSO is shown in Fig. 6A. These selected features were important for the model to overcome overfitting on the training set and ensure the model’s generalizability, though they do not hold biomedical significance. The corresponding AUC values based on the nomic-embed-text embedding features were 0.93, 0.85, and 0.91, respectively (see Fig. 6B). When comparing performance with the mxbai-embed-large model, we found that its results were poorer, possibly due to the longer length of its embedding vectors. The corresponding AUC values for this model were 0.83, 0.78, and 0.82, respectively (see Supplementary Figure S2).
Fig. 6.
Performance of LASSO logistic regression model for classifying vaccines as ‘live’ or ‘non-live’ using LLM text embedding features. (A) Feature importance plot showing the coefficients of selected features from LASSO regression. (B) ROC curves for the training, test, and full datasets using nomic-embed-text embeddings
To assess the robustness of the LASSO model’s classification performance, we employed 10-fold cross-validation on the entire dataset. The mean sensitivity, specificity, and overall accuracy of the model were found to be 80.00%, 83.06%, and 81.89%, respectively. The SVM classifier built based on the selected features performed worse, likely due to the high complexity of the model.
We also applied SHAP analysis to identify key features in model predictions. The top four LASSO-selected features are shown in Supplemental Figure S4. Although SHAP effectively highlighted important features, our 768-dimensional vaccine AE representation lacked biomedical interpretable features, making it difficult to associate SHAP values directly with specific AEs or AE clusters.
Discussion
This study made three primary contributions. Firstly, we leveraged LLMs to efficiently process and extract itemized adverse events from FDA-approved vaccine package inserts with high accuracy. Secondly, we compared two methods, namely an ontology-based semantic similarity analysis and an LLM text embedding method, for analyzing and classifying the relationships between these vaccines based on their adverse event profiles. Lastly, we explored the potential of LLM text embeddings to classify vaccines based on vaccine attributes, such as live or non-live. Our findings demonstrate that LLM text embeddings can quickly analyze biomedical descriptive texts, bypassing complex ontology mapping steps. The embedding-based methods can also complement ontology-based clustering and classification. This approach helps uncover potential vaccine classifications, encapsulates latent information within the text [29], and may serve as a valuable tool for medical knowledge discovery.
In our study, the selection of Llama-3, nomic-embed-text, and mxbai-embed-large embeddings was primarily driven by the need to address privacy concerns regarding biomedical data and the requirement for local deployment. These factors were essential to ensure the protection of sensitive medical information and maintain full control over the research process. While GPT-based embeddings are powerful, they often require cloud-based operation, which raises privacy concerns, especially when handling sensitive medical data.
Traditionally, ontology provides an important knowledge representation platform that classifies the knowledge of entities in a specific domain and the relations among these entities. For example, the VO classifies the list of vaccines and their attributes, such as vaccine types, and attributes, such as live or non-live. Furthermore, ontology can be used to support clustering analysis based on the ontology semantic similarity analysis. However, the development of ontology is time-consuming, and the information recorded in ontology may be incomplete. In this study, we used the FDA vaccine package insert documents to build up LLM embeddings and further used them to perform vaccine clustering analysis. Our results showed that the vaccine AE text, when recorded in embeddings, contains global information that can be used to support machine learning studies, such as vaccine clustering. We did not use metadata such as the manufacturing methods and storage conditions, which could introduce new information and potentially affect the clustering patterns of vaccines. These embedding-based clustering methods even show some distinct advantages compared with the classical ontology semantic similarity method. Furthermore, our study demonstrated that the LLM embedding-based classification method was able to classify vaccine information, such as vaccine live or non-live. Therefore, such an LLM embedding-based method has the potential to automatically extract useful vaccine attributes.
In mapping the adverse event description texts to ontology terms, we took several measures to ensure the quality of the mappings. First, all mappings of vaccine adverse event terms to ontology terms were manually reviewed for accuracy. Experts in the field of vaccines also participated in the review process to guarantee both the accuracy and professional integrity of the mappings.
To further enhance the accuracy and completeness of these mappings in future work, we developed an advanced mapping tool, OntoRetriever. This tool leverages large language models and text embeddings for mapping, utilizes text-embedding vector similarity to evaluate the results of multiple mappings, and provides a final score for each mapping. Our testing of the tool using HPO and OAE validated the accuracy of the tool. The tool uses the OBO format as input. Tools like ROBOT [14] can be used to convert OWL format files to OBO format. Therefore, OntoRetriever can be adopted for other ontology usage.
While our study did not find missing mappings, we would like to note that possible missing or misaligned mappings could influence the results of our semantic similarity calculations. Specifically, if a vaccine adverse event term is not mapped to an ontology term or is incorrectly mapped, this could lead to false negatives, where two similar adverse events are mistakenly classified as dissimilar. On the other hand, if unrelated adverse event terms are mapped to the same ontology term or closely related events are mapped to different terms, it could result in false positives, causing the model to incorrectly group dissimilar events as similar.
One surprising result of our study is the discovery of latent information about vaccine AEs through LLM text embeddings. Specifically, the analysis revealed a potential association between vaccine AEs and whether the vaccine is live or inactivated, even though this relationship was not explicitly stated in the AE text. By leveraging the features extracted from text embeddings, we successfully built a classification model to distinguish live vaccines from non-live ones, demonstrating strong predictive performance. This finding not only validates the effectiveness of text embeddings in capturing hidden patterns but also suggests that this approach could be a powerful tool for uncovering latent medical knowledge.
Our method of LLM-based text embeddings has shown its capability of efficiently and accurately identifying new patterns from vaccine package insert documents. The potential applications of LLM-based text embeddings extend beyond vaccine safety and may provide new insights across various healthcare research areas, including drug discovery, disease classification, and personalized medicine. However, an important challenge remains: evaluating the scalability and generalizability of our methodology across larger datasets or different vaccine types. Note that our systematic study used all the available 111 FDA vaccines, which is the largest vaccine package insert document dataset in English. It is possible to use the same method to analyze other vaccine or drug package inserts. However, adverse event descriptions in FDA-approved package inserts follow specific styles, while other package inserts may use different ways of presenting the information. This variability in presentation styles could potentially lead to inconsistencies in text vectorization, making it challenging to compare adverse event profiles across different datasets. Meanwhile, for data from different sources, clustering based on embedding features may possibly reveal entirely novel patterns, indicating the need for further exploration by researchers. In the future, we plan to apply our methods to analyze more datasets and compare results out of datasets with different data formats.
Our study demonstrated a high performance of using LLMs to automatically process and extract itemized adverse events from vaccine package insert text. For this LLM study, we developed a Python code that automatically processes individual package insert documents. Over 80% of the extracted information was accurate, and re-running the extraction process could complement the results, enhancing overall accuracy. However, minimal manual intervention was required throughout the entire LLM-based vaccine AE extraction process, a significant advantage over traditional manual methods. In the future, to further reduce manual intervention, we may introduce multiple language models, each generated independently, which will then be aggregated through a final decision-making model.
Our comparative analysis shows that the LLM method outperformed ontology-based approaches in identifying differential AE profiles among vaccines. The higher similarity scores from the ontology-based method were likely due to its focus on annotation terms within the hierarchical structure. In contrast, LLM embeddings capture richer, more nuanced representations of text, understanding complex semantic information, context, and relationships [30], which allows them to better handle intricate medical terminology and contextual nuances in vaccine adverse event descriptions.
LLM embeddings excel in adapting to complex and evolving contexts, handling terms that may not fit rigid ontologies. Their ability to generalize across various contexts enables them to effectively address emerging adverse events or new expressions, something ontology-based methods struggle with due to their fixed structures. Overall, LLM embeddings provide a dynamic and context-aware approach, enabling more sensitive identification of differential AE profiles and allowing for faster and more efficient analysis of biomedical texts, such as adverse events. They can serve as a valuable complement to ontology-based methods.
Using the LLM embeddings, we were able to discover many latent adverse event (AE) patterns. For example, our approach identified an 81% accuracy rate in the classification of live and non-live vaccines. By examining the AE profiles between live and non-live vaccines, we found that live vaccines tend to induce systemic, respiratory, and inflammatory responses such as fever and fatigue. Our analysis found that live attenuated influenza vaccine FluMist Quadrivalent is associated with asthma and recurrent wheezing, a phenomenon not observed in other influenza vaccines. This finding fits in with the general adverse event patterns of live vaccines. On the other hand, non-live vaccines are commonly associated with local skin and muscular symptoms and neurological reactions such as swollen, redness, local injection site pain, and movement disorders. These profiles are aligned with previous studies such as the AE results of the analysis of the VAERS [6].
By examining the adverse event profiles of those misclassified live and non-live vaccines, we found that the misclassified live (or non-live) vaccines tend to have AE profiles of non-live (or live) vaccines. For example, PRIORIX (Measles, Mumps, and Rubella Vaccine, Live) and IXCHIQ (Chikungunya Vaccine, Live), both live vaccines, were incorrectly classified as non-live, likely due to their local reactions such as pain and redness. Conversely, non-live vaccines such as SPIKEVAX, Gardasil, Typhim Vi, and H1N1 2009 Monovalent were incorrectly classified as live, likely due to these vaccines’ association with systemic reactions such as fever and fatigue, which were usually seen in live vaccines.
Our study acknowledges certain biases in the dataset, which may affect the generalizability and interpretation of the results. Two factors contributing to bias are the geographic scope and vaccine type of the vaccine dataset. Since it primarily consists of FDA-approved vaccines, vaccine package inserts from other countries may contain different components or follow distinct regulatory guidelines. As a result, our conclusions, drawn from a dataset limited to FDA-approved vaccines, may not be directly applicable to vaccines used in other parts of the world. This bias may introduce limitations to the external validity of our study, highlighting the need for caution when attempting to generalize the findings beyond the U.S. context. Additionally, inactivated vaccines are more prevalent than non-inactivated ones among FDA-approved vaccines, which may influence the development of classification models. Note that our study addressed type bias by assigning weights to each type. Further analysis of the bias effect may deserve further investigation.
Overall, our research demonstrates the power of LLMs in efficiently extracting and analyzing itemized vaccine adverse events. Our future work includes an extension of the methods for the analysis of other vaccines’ information and an automatic conversion of the LLM-generated results to an ontological representation. Such work will significantly help us better understand vaccine adverse events, leading to advanced public health research.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Abbreviations
- AE
Adverse Event
- LLM
Large Language Model
- HPO
Human Phenotype Ontology
- VO
Vaccine Ontology
- GBS
Guillain-Barré syndrome
- OAE
Ontology of Adverse Events
- VAERS
Vaccine Adverse Event Reporting System
- IC
Information Content
- MICA
Most Informative Common Ancestor
- AUC
Area Under the Curve
- ROC
Receiver Operating Characteristic
- SHAP
SHapley Additive exPlanations
- SVM
Support Vector Machine
Author contributions
ZW designed the entire study pipeline and conducted data analysis and visualization. JZ and XL mapped the text to the HPO ontology. JZ conducted ontology data integration and analysis. YH proposed the study and overall project design and served as the vaccine domain expert in result interpretation. All authors contributed to writing and revising the paper. All authors approved the submission.
Funding
This collaborative project is supported by a NIH grant U24AI171008 (Y. He and J. Zheng) and a Teaching Reform Project grant No. 2023jcjg0107 from Peking Union Medical College (Z. Wang).
Data availability
OntoRetriever supports extracting adverse reaction terms from text and mapping them to the most relevant ontology terms. It can be accessed at https://github.com/Cetomato/OntoRetriever and currently supports the HPO and OAE ontologies.The HPO mapping results and associated vaccine information and annotations are both available on the following GitHub repository: https://github.com/vaccineontology/VO-LLM.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Zhigang Wang, Email: wangzg@pumc.edu.cn.
Yongqun He, Email: yongqunh@med.umich.edu.
References
- 1.Conceição Silva F, De Luca PM, Lima-Junior JdaC. Vaccine development against infectious diseases: state of the art, new insights, and future directions. Vaccines (Basel). 2023;11:1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Martín Arias LH, Sanz R, Sáinz M, Treceño C, Carvajal A. Guillain-Barré syndrome and influenza vaccines: A meta-analysis. Vaccine. 2015;33:3773–8. [DOI] [PubMed] [Google Scholar]
- 3.Fraiman J, Erviti J, Jones M, Greenland S, Whelan P, Kaplan RM, et al. Serious adverse events of special interest following mRNA COVID–19 vaccination in randomized trials in adults. Vaccine. 2022;40:5798–805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Mozzicato P. MedDRA: an overview of the medical dictionary for regulatory activities. Pharm Med. 2009;23:65–75. [Google Scholar]
- 5.He Y, Sarntivijai S, Lin Y, Xiang Z, Guo A, Zhang S, et al. OAE: the ontology of adverse events. J Biomed Semant. 2014;5:29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sarntivijai S, Xiang Z, Shedden KA, Markel H, Omenn GS, Athey BD, et al. Ontology-based combinatorial comparative analysis of adverse events associated with killed and live influenza vaccines. PLoS ONE. 2012;7:e49941. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA et al. The human phenotype ontology in 2021. Nucleic Acids Res. 202;49:D1027-D1217. [DOI] [PMC free article] [PubMed]
- 8.Gargano MA, Matentzoglu N, Coleman B, Addo-Lartey EB, Anagnostopoulos AV, Anderton J, et al. The human phenotype ontology in 2024: phenotypes around the world. Nucleic Acids Res. 2024;52:D1333–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sánchez D, Batet M. A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl. 2013;40:1393–9. [Google Scholar]
- 10.He Y, Cowell L, Diehl A, Mobley H, Peters B, Ruttenberg A et al. VO: vaccine ontology. Nat Prec. 2009;1–1.
- 11.Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, et al. A review on large Language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access. 2024;12:26839–74. [Google Scholar]
- 12.Li Y, Li J, He J, Tao C. AE-GPT: using large Language models to extract adverse events from surveillance reports-A use case with influenza vaccine adverse events. PLoS ONE. 2024;19:e0300919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Introducing Meta Llama 3. The most capable openly available LLM to date [Internet]. Meta AI. [cited 2024 Jun 6]. Available from: https://ai.meta.com/blog/meta-llama–3/
- 14.Jackson RC, Balhoff JP, Douglass E, Harris NL, Mungall CJ, Overton JA. ROBOT: A tool for automating ontology workflows. BMC Bioinformatics. 2019;20:407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Nussbaum Z, Morris JX, Duderstadt B, Mulyar A. Nomic Embed: Training a Reproducible Long Context Text Embedder [Internet]. arXiv; 2025 [cited 2025 Mar 7]. Available from: http://arxiv.org/abs/2402.01613
- 16.Lee S, Shakir A, Koenig D, Lipp J. Open Source Strikes Bread - New Fluffy Embedding Model [Internet]. 2024 [cited 2024 Jun 6]. Available from: https://www.mixedbread.ai/blog/mxbai-embed-large-v1
- 17.Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 2009;37:W170–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lin D. An Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning. Morgan Kaufmann; 1998. pp. 296–304.
- 19.Greene D, Richardson S, Turro E. OntologyX: a suite of R packages for working with ontological data. Bioinformatics. 2017;33:1104–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nisha, Kaur PJ. Cluster quality based performance evaluation of hierarchical clustering method. 2015 1st International Conference on Next Generation Computing Technologies (NGCT) [Internet]. 2015 [cited 2025 Mar 19]. pp. 649–53. Available from: https://ieeexplore.ieee.org/abstract/document/7375201
- 21.Gu Z, Eils R, Schlesner M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics. 2016;32:2847–9. [DOI] [PubMed] [Google Scholar]
- 22.R Core Team R. R: A Language and environment for statistical computing. Vienna: Austria; 2013. [Google Scholar]
- 23.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 24.Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien [Internet]. 2023. Available from: https://CRAN.R-project.org/package=e1071
- 25.Ranstam J, Cook JA. LASSO regression. Br J Surg. 2018;105:1348. [Google Scholar]
- 26.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S + to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wong T-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48:2839–46. [Google Scholar]
- 28.Sellereite N, Jullum M. Shapr: an R-package for explaining machine learning models with dependence-aware Shapley values. J Open Source Softw. 2019;5:2027. [Google Scholar]
- 29.Keraghel I, Morbieu S, Nadif M, Beyond Words. A comparative analysis of LLM embeddings for effective clustering. In: Miliou I, Piatkowski N, Papapetrou P, editors. Advances in intelligent data analysis XXII. Cham: Springer Nature Switzerland; 2024. pp. 205–16. [Google Scholar]
- 30.Petukhova A, Matos-Carvalho JP, Fachada N. Text Clustering with LLM Embeddings [Internet]. arXiv; 2024 [cited 2024 Aug 29]. Available from: http://arxiv.org/abs/2403.15112
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
OntoRetriever supports extracting adverse reaction terms from text and mapping them to the most relevant ontology terms. It can be accessed at https://github.com/Cetomato/OntoRetriever and currently supports the HPO and OAE ontologies.The HPO mapping results and associated vaccine information and annotations are both available on the following GitHub repository: https://github.com/vaccineontology/VO-LLM.