Skip to main content
PLOS One logoLink to PLOS One
. 2024 Mar 21;19(3):e0300919. doi: 10.1371/journal.pone.0300919

AE-GPT: Using Large Language Models to extract adverse events from surveillance reports-A use case with influenza vaccine adverse events

Yiming Li 1, Jianfu Li 2, Jianping He 1, Cui Tao 2,*
Editor: Vincenzo Bonnici3
PMCID: PMC10956752  PMID: 38512919

Abstract

Though Vaccines are instrumental in global health, mitigating infectious diseases and pandemic outbreaks, they can occasionally lead to adverse events (AEs). Recently, Large Language Models (LLMs) have shown promise in effectively identifying and cataloging AEs within clinical reports. Utilizing data from the Vaccine Adverse Event Reporting System (VAERS) from 1990 to 2016, this study particularly focuses on AEs to evaluate LLMs’ capability for AE extraction. A variety of prevalent LLMs, including GPT-2, GPT-3 variants, GPT-4, and Llama2, were evaluated using Influenza vaccine as a use case. The fine-tuned GPT 3.5 model (AE-GPT) stood out with a 0.704 averaged micro F1 score for strict match and 0.816 for relaxed match. The encouraging performance of the AE-GPT underscores LLMs’ potential in processing medical data, indicating a significant stride towards advanced AE detection, thus presumably generalizable to other AE extraction tasks.

Introduction

Vaccines are a vital component of public health and have been instrumental in preventing infectious illnesses [1, 2]. Nowadays, we possess vaccines that cope with over 20 life-threatening diseases, contributing to enhanced health and longevity for people across all age groups [3]. Each year, vaccinations save between 3.5 to 5 million individuals from fatal diseases such as diphtheria, tetanus, pertussis, influenza, and measles [3]. However, vaccine-related adverse events (AEs), although rare, can occur after immunization. By September 2023, Vaccine Adverse Event Reporting System (VAERS) had received more than 1,791,000 vaccine AE reports, of which 9.5% were classified as serious, which include events that result in death, hospitalization, or significant disability [46]. Consequently, vaccine AEs can cause a range of side effects in individuals, from mild, temporary symptoms to severe complications [710]. They may also give rise to vaccine hesitancy among healthcare providers and recipients [11].

Understanding AEs following vaccinations is vital in surveilling the effective implementation of immunization programs [12]. Such vigilance ensures the continuous safety of vaccination campaigns, allowing for prompt responses when AE occurs. This not only conduces to recognizing early warning signs but also fosters hypotheses regarding potential new vaccine AEs or shifts in the frequency of known ones, ultimately contributing to the refinement and development of vaccines [13].

Therefore, the extraction of AEs and AE-related information would play a pivotal role in advancing our understanding of conditions such as syndromes and other system disorders that can emerge after vaccination. VAERS functions as a spontaneous reporting system for adverse events post-vaccination, serving as the national early warning mechanism to flag potential safety issues with U.S. licensed vaccines [1416]. VAERS collects structured information such as age, medical background, and vaccine type. It also includes a short narrative from the reporters to describe symptoms, medical history, diagnoses, treatments, and their temporal information [14]. Although it does not establish causal relationships between AEs and vaccines, VAERS detects possible safety concerns warranting deeper investigation through robust systems and study designs [14]. Du et al. employed advanced deep learning algorithms to detect nervous system disorder-related events in the cases of Guillain-Barre syndrome (GBS) linked to influenza vaccines from the VAERS reports [17]. Through the evaluation of different machine learning and deep learning methods, including domain-specific BERT models such as BioBERT and VAERS BERT, their research demonstrated the superior performance of deep learning techniques over traditional machine learning methods (ie, conditional random fields with extensive features) [17].

Nowadays, with the popularity of artificial intelligence (AI) surging, a remarkable breakthrough has emerged: the development of large language models (LLMs) [1820]. These cutting-edge AI constructs have redefined the way computers understand and generate human language, leading to unprecedented advancements in various aspects [21]. These models, powered by advanced machine learning techniques, have the capacity to comprehend context, semantics, and nuances, allowing them to generate coherent and contextually relevant text [19]. For example, Generative Pre-trained Transformer (GPT), developed by OpenAI, represents a pioneering milestone in the realm of AI [22]. Built on a massive dataset, GPT is a state-of-the-art language model that excels in generating coherent and contextually relevant text [23]. Since its inception, GPT has exhibited exceptional capabilities, ranging from crafting imaginative narratives and aiding content creation to facilitating language translation and engaging in natural conversations with virtual assistants [24, 25]. Its impact across various domains has highlighted its potential to revolutionize human-computer interaction, making significant strides towards machines truly understanding and interacting with human language [26]. Another model Llama 2, available free of charge, however, takes a novel approach by incorporating multimodal learning, seamlessly fusing text and image data [27]. This unique feature enables Llama 2 to not only comprehend and generate text with finesse but also to understand and contextualize visual information, setting it apart from traditional language models like GPT [27].

LLMs represent a significant leap forward in natural language processing (NLP), enabling applications ranging from text generation and translation to sentiment analysis and Chatbot virtual assistants [28]. By learning patterns from vast amounts of text data, these LLMs have the potential to bridge the gap between human communication and machine understanding, opening up new avenues for communication, information extraction, and problem-solving [29]. Hu et al. examined ChatGPT’s potential for clinical named entity recognition (NER) in a zero-shot context, comparing its performance with GPT-3 and BioClinicalBERT on synthetic clinical notes [30]. ChatGPT outperformed GPT-3, although BioClinicalBERT still performed better. The study demonstrates ChatGPT’s promising utility for zero-shot clinical NER tasks without requiring annotation [30].

In this study, we aim to develop AE-GPT, an automatic adverse event extraction tool based on LLMs, with a specific focus on adverse events following the Influenza vaccine. The influenza vaccine, being one of the most frequently reported vaccines in the VAERS, serves as a prominent use case for our investigation. Our choice is not only motivated by the vaccine’s substantial reporting frequency but also by its significance in public health. As we delve into extracting AE-related entities, the influenza vaccine provides a robust and relevant context for evaluating the performance of our proposed AE-GPT framework. While our study centers on the influenza vaccine as a use case, the framework’s applicability extends to other vaccine types, enriching the generalizability of our findings across the broader domain of vaccine safety surveillance. Unlike several studies that have explored the application of LLMs in executing NER tasks, our focus extends beyond zero-shot learning (inference from the pre-trained model), providing comprehensive performance comparisons between pretrained LLMs, fine-tuned LLMs, and traditional language models. This research aims to address this gap by providing a thorough examination of the LLM’s capabilities, specifically focusing on its performance within the NER task and also proposing the advanced fine-tuned models AE-GPT, which specializes in AE related entity extraction. Our investigation not only involves utilizing the pretrained model for inference but also enhancing the LLM’s NER performance through fine-tuning with the customized dataset, thus providing a deeper understanding of its potential and effectiveness.

Materials and methods

Fig 1 presents an overview of the study framework. Our investigation commenced with zero-shot entity recognition, involving the direct input of user prompts into pretrained LLMs (GPT & Llama 2). To achieve a comprehensive assessment of LLMs’ effectiveness in clinical entity recognition, our analysis covered a range of LLMs, namely GPT-2, GPT-3, GPT-3.5, GPT-4, and Llama 2. Furthermore, to enhance their performance, we performed fine-tuning on these LLMs using annotated data, followed by utilizing user prompts to facilitate result inference.

Fig 1. Overview of the study framework.

Fig 1

Data source and use case

VAERS functions as an advanced alert mechanism jointly overseen by the Centers for Disease Control and Prevention (CDC) and the U.S. Food and Drug Administration (FDA), playing a pivotal role in identifying potential safety concerns associated with FDA-approved vaccines [31, 32]. As of Aug 2013, VAERS has documented more than 1,781,000 vaccine-related AEs [32].

The influenza vaccine plays a significant role in preventing millions of illnesses and visits to doctors due to flu-related symptoms each year [33]. For example, in the 2021–2022 flu season, prior to the emergence of the COVID-19 pandemic, flu vaccination was estimated to have prevented around 1.8 million flu-related illnesses, resulting in approximately 1,000,000 fewer medical visits, 22,000 fewer hospitalizations, and nearly 1,000 fewer deaths attributed to influenza [34]. A study conducted in 2021 highlighted that among adults hospitalized with flu, those who had received the flu vaccine had a 26% reduced risk of needing intensive care unit (ICU) admission and a 31% lower risk of flu-related mortality compared to individuals who had not been vaccinated [35].

However, influenza vaccines also have been associated with a range of potential adverse effects, such as pyrexia, hypoesthesia, and even rare conditions like GBS [36]. Among them, GBS ranks as the primary contributor to acute paralysis in developed nations, and continues to be the most frequently documented serious adverse event following trivalent influenza vaccination in the VAERS database, with a report rate of 0.70 cases per 1 million vaccinations [3739]. This rare autoimmune disorder, GBS, affects the peripheral nervous system, characterized by rapidly advancing, bilateral motor neuron paralysis that typically arises subsequent to an acute respiratory or gastrointestinal infection [37, 4042].

As a use case, this study focuses on symptom descriptions (referred to as narrative safety reports) that include GBS and symptoms frequently linked with GBS. Particularly, our interest lies in these reports following the administration of diverse influenza virus vaccines, including FLU3, FLU4, H5N1, and H1N1. In order to enable a direct performance comparison with traditional language models, we employed the identical dataset that was utilized by Du et al. in their previous study [17]. This dataset comprises a total of 91 annotated reports. In the context of understanding the development of GBS and other nervous system disorders, we explored six entity types that collectively capture significant clinical insights within VAERS reports: investigation, nervous_AE, other_AE, procedure, social_circumstance, and temporal_expression. Investigation, which refers to lab tests and examinations, including entities like “neurological exam” and “lumbar puncture” [17]. nervous_AE (e.g., “tremors” “Guillain-Barré syndrome”) involves symptoms and diseases related to nervous system disorders, whereas other_AE (e.g., “complete bowel incontinence” “diarrhea”) is associated with other symptoms and diseases [17]. procedure addresses clinical interventions to the patient, including vaccination, treatment and therapy, intensive care, featuring instances such as “flu shot” and “hospitalized” [17]. social_circumstance records events associated with the social environment of a patient, for example, “smoking” and “alcohol abuse” [17]. temporal_expression is concerned with temporal expressions with prepositions like “for 3 days” and “on Friday morning” [17].

Models

To fully investigate the performance of LLMs on the NER task, GPT models and Llama 2 will be leveraged.

GPT

GPT represents a groundbreaking advancement in the realm of NLP and artificial intelligence [43]. Developed by OpenAI, GPT stands as a remarkable example of the transformative capabilities of large-scale neural language models [44]. At its core, GPT is founded upon the innovative Transformer architecture, a model that has revolutionized the field by effectively capturing long-term dependencies within sequences, making it exceptionally well-suited for tasks involving language understanding and generation [45, 46]. The GPT family has multiple versions: The GPT-2 model, with 1.5 billion parameters, is capable of generating extensive sequences of text while adapting to the style and content of arbitrary inputs [47]. Moreover, GPT-2 can also perform various NLP tasks, such as classification [47]. On the other hand, GPT-3 with 175 billion parameters, takes the capabilities even further [35]. It’s an autoregressive language model trained with 96 layers on a combination of 560GB+ web corpora, internet-based book corpora, and Wikipedia datasets, each weighted differently in the training mix [48, 49]. GPT-3 model is available in four versions: Davinci, Curie, Babbage and Ada which differ in the amount of trainable parameters– 175, 13, 6.7 and 2.7 billion respectively [48, 50]. GPT-4 has grown in size by a factor of 1000, now reaching a magnitude of 170 trillion parameters, a substantial increase when compared to GPT-3.5’s 175 billion parameters [51]. One of the most notable improvements in GPT-4 is the expanded context length. In GPT-3.5, the context length is 2048 [51]. However, in GPT-4, it has been elevated to 8192 or 32768, depending on the specific version, representing an augmentation of 4 to 16 times compared to GPT-3.5 [51]. In terms of its generated output, GPT-4 possesses the capacity to not just accommodate multimodal input, but also produce a maximum of 24000 words (equivalent to 48 pages) [51]. This represents an increase of 8 times compared to GPT-3.5, constrained by 3000 words (equivalent to 6 pages) [51].

The rationale behind GPT’s design stems from the understanding that pre-training, involving the unsupervised learning of language patterns from vast textual corpora, can provide a strong foundation for subsequent fine-tuning on specific tasks [44]. This enables GPT to acquire a sophisticated understanding of grammar, syntax, semantics, and even world knowledge, essentially learning to generate coherent and contextually relevant text [52].

GPT’s architecture comprises multiple layers of self-attention mechanisms, which allow the model to weigh the importance of different words in a sentence based on their contextual relationships [53, 54]. This intricate layering, coupled with the model’s considerable parameters, empowers GPT to process and generate complex linguistic structures, making it a versatile tool for a wide range of NLP tasks, including text completion, translation, summarization, and even creative writing [55].

LLama 2

Llama 2 emerges as a cutting-edge advancement in the domain of natural language processing, marking a significant evolution in the landscape of language models [27]. Developed as an extension of its predecessor, Llama, this model represents an innovative step forward in harnessing the power of transformers for language understanding and generation [56]. The architecture of Llama 2 is firmly rooted in the Transformer framework, which has revolutionized the field by enabling the modeling of complex dependencies in sequences.

The rationale behind Llama 2’s conception rests upon the recognition that while pre-training large language models on diverse text corpora is beneficial, customizing their multi-layer self-attention architecture for linguistic structures can further optimize their performance [56].

Experiment setup

Dataset split

In this study, we partitioned the dataset into a training set and a test set using an 8:2 ratio, where 72 VAERS reports were designated for the training set and the remaining 19 reports for the test set.

Pretrained model inference

We firstly inferred the results by using the available pretrained LLMs. GPT-2 model source has been made publicly available, we used TFGPT2LMHeadModel as the pretrained GPT-2 model to test its ability in this NER task [57]. Llama 2 is also an open-source LLM, which can be accessed through MetaAI [56].

To evaluate the performance of the pre-trained models, we conducted several experiments, selecting the temperature and max tokens settings (shown in Table 1) that yielded the best results. Temperature, a hyperparameter, influences the randomness of the generated text. Higher values, such as 1.0 or above, increase diversity, while lower values, like 0.5 or below, produce more focused outputs. Meanwhile, max tokens determine the maximum length of the generated text, serving to control and limit the length of the output. We employed the prompts (as depicted in Table 1) that adeptly articulate our objective, are comprehensible to the LLMs, and additionally aid in the efficient extraction of results.

Table 1. Prompts and hyperparameters of pretrained models.
Model Prompt Temperature Max tokens
GPT-2 Please extract all names of investigation, nervous AE, other AE, procedure, social circumstance, and timestamp from this note, and put them in a list 1.0 1,000
GPT-3 ada Answer the question based on the context below, and if the question can’t be answered based on the context, say "I don’t know" 0 150
babbage Context: [note]
curie ---
davinci Question: Please extract all names of [timestamp/nervous_AE/other_AE/procedure/investigation/social circumstance] from this note
Answer:
GPT-3.5 Please extract all names of investigation, nervous AE, other AE, procedure, social circumstance, and timestamp from this note, and put them in a list 0.8 1,000
GPT-4
Llama 2-7b-chat "Please extract all names of [timestamp/nervous AE/other AE/procedure/investigation/social circumstance] from this note: [note] 0.6 512
2-13b-chat

Inference for the pretrained GPT models was executed on a server equipped with 8 Nvidia A100 GPUs, where each GPU provided a memory capacity of 80GB. Meanwhile, the pretrained Llama models were inferred on a server, which included 5 Nvidia V100 GPUs, each offering a memory capacity of 32GB.

Model fine-tuning

Fine-tuning the GPT models is facilitated through OpenAI ChatGPT’s API calls, with the exception that the GPT-2 model’s fine-tuning stems from GPT2sQA and the fine-tuning for GPT-4 has not been made accessible yet [58]. For Llama 2 models, the fine-tuning process begins with HuggingFace. Subsequently, the model’s embeddings are automatically fine-tuned and updated. Throughout the process, the temperature remains consistent. The format requirements for training set templates differ among models, depending on whether they are instruction-based or not. Fig 2 presents an example of the question answering-based training set used by GPT-2, which initializes with the question “Please extract all the names of nervous_AE from the following note”. The question is followed by the answer with annotations where the entities (i.e., Guillain Barre Syndrome, quadriplegic, GBS) and the starting character offset (i.e., 0, 141, 212) should be indicated. The training set ends with the context (Guillain Barre Syndrome. Onset on…). Fig 3 shows an example to illustrate the structured format of the training set tailored for GPT-3, where the prompt and annotations are necessitated. In the prompts, only the original report is required because of the predetermined NER template embedded in GPT-3, while the annotations include the entity types and the entities. For instance, as shown in Fig 3, “Guillain Barre Syndrome. Onset…” is the original description from the VAERS reports. Within the annotations section, all the involved entity types (nervous_AE, timestamp, investigation, other_AE, and procedure) are listed, with ’Guillain Barre Syndrome’, ’quadriplegic’, and ’GBS’ being the entities classified under nervous_AE. Fig 4 shows an instruction-based training example utilized for GPT-3.5, and Llama 2-chat, which utilizes prompt instructions to guide and refine the model’s responses, ensuring more accurate and contextually relevant outputs. The process of human-machine interaction is imitated. In this scenario, three roles are identified: the system, user, and assistant. The system outlines the task to be accomplished by GPT, stipulating—"You are an assistant adept at named entity recognition." Unlike the structured format training set, in addition to the original VAERS reports, users are also required to clarify the task with a specific question—e.g., "Please extract all the nervous_AE entities in the following note." The annotations section only include the anticipated responses that users expect GPT to provide.

Fig 2. One example of a question answering-based training set.

Fig 2

Fig 3. One example of the structured format training set.

Fig 3

Fig 4. One example of the instruction-based training set.

Fig 4

For model fine-tuning, we selected the initial hyperparameters as outlined in Table 2. Typically, these settings are based on defaults, except for GPT-3, where non-default values outperformed the defaults. As for prompts, they differ slightly due to model-specific training set needs.

Table 2. Prompts and hyperparameters of model fine-tuning.
Model Prompt Training set format Temperature Max tokens
GPT-2 Please extract all the names of [timestamp/nervous_AE/other_AE/procedure/investigation/social circumstance] from the following note Question answering-based 1.0 1,000
GPT-3 ada JSON format specified by OpenAI Structured 0.8 1,000
babbage
curie
davinci
GPT-3.5 Please extract all the [timestamp/nervous_AE/other_AE/procedure/investigation/social circumstance] in the following note: [note] Instruction-based 1.0 4,096
Llama 2-7b-chat JSON format specified by Llama Instruction-based 1.0 4,096
2-13b-chat

Fine-tuning for the pretrained GPT models was executed on a server equipped with 8 Nvidia A100 GPUs, where each GPU provided a memory capacity of 80GB. Meanwhile, the pretrained Llama models were fine-tuned on a server, which included 5 Nvidia V100 GPUs, each offering a memory capacity of 32GB.

Post-processing

In the post-processing stage, we addressed instances of nested entities, as depicted in Fig 5 (“Muscle strength” vs “Muscle strength decreased”). To effectively handle this, we adopted a strategy wherein entities possessing the longest spans were retained, while the nested entities were excluded from consideration. In the examples illustrated in Fig 5, the investigation term "Muscle strength" was eliminated, resulting in nervous_AE "Muscle strength decreased" for the final output. This procedure ensured a streamlined and accurate representation of the entities within the given context.

Fig 5. One example of nested entities.

Fig 5

Evaluation

The performance of the LLMs was evaluated by metrics, including precision, recall, and F1. These evaluations were conducted under two distinct matching criteria: exact matching, which required identical entity boundaries, and relaxed matching, which took into consideration overlapping entity boundaries.

Precision=TruepositiveTruepositive+Falsepositive
Recall=TruepositiveTruepositive+Falsenegative
F1=2×Precision×RecallPrecision+Recall

Results

Tables 3 and 4 presents the NER performance across various LLMs using strict F1 and relaxed F1 metrics respectively. In terms of evaluation metrics, strict F1 requires an exact match in both content and position between the predicted and true segments, while relaxed F1 allows for partial matches, providing a more lenient evaluation of model performance. Notably, the GPT-3.5 model emerges as the frontrunner in this NER task. Remarkably, GPT-3, 3.5, and 4 models surpass LLama models significantly. Within the GPT model family, performance significantly improves with each successive version upgrade. GPT-3-davinci, in particular, achieved the highest performance among all GPT-3 models.

Table 3. NER performance comparison on VAERS reports by strict F1.

GPT-2 GPT-3 GPT-3.5 GPT-4 Llama
ada babbage curie davinci 2-7b-chat 2-13b-chat
Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Pretrained Fine-tuned Pretrained Fine-tuned
investigation 0 0 0 0.389 0 0.304 0 0.214 0 0.344 0.241 0.667 0.304 0.097 0.024 0.099 0.114
nervous_AE 0 0 0 0.319 0 0.277 0 0.356 0 0.459 0.208 0.727 0.458 0.102 0.024 0.173 0.096
other_AE 0 0 0 0.17 0 0.17 0 0.183 0 0.351 0.165 0.638 0.412 0.234 0.041 0.213 0.042
procedure 0 0 0 0.39 0 0.388 0 0.511 0 0.464 0.059 0.716 0.02 0.283 0.066 0.189 0.077
social_circumstance 0 0 0 0 0 0 0 0 0 0 0.133 0.5 0 0 0 0 0
temporal_expression 0 0 0 0.457 0 0.545 0 0.504 0 0.583 0.252 0.76 0.323 0.202 0.062 0.013 0.053
Microaverage 0 0 0 0.335 0 0.356 0 0.359 0 0.436 0.183 0.704 0.308 0.16 0.046 0.134 0.07

Note: The scores were averaged scores after 10 runs

Table 4. NER performance comparison on VAERS reports by relaxed F1.

GPT-2 GPT-3 GPT-3.5 GPT-4 Llama
ada babbage curie davinci 2-7b-chat 2-13b-chat
Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Fine-tuned Pretrained Pretrained Fine-tuned Pretrained Fine-tuned
investigation 0 0 0 0.463 0 0.448 0 0.321 0 0.53 0.289 0.795 0.464 0.13 0.047 0.128 0.21
nervous_AE 0 0.047 0 0.478 0 0.408 0 0.483 0 0.584 0.308 0.872 0.658 0.277 0.047 0.378 0.192
other_AE 0 0.021 0 0.243 0 0.243 0 0.286 0 0.412 0.278 0.704 0.486 0.309 0.062 0.295 0.042
procedure 0 0.08 0 0.419 0 0.464 0 0.534 0 0.522 0.094 0.743 0.305 0.318 0.077 0.245 0.088
social_circumstance 0 0 0 0 0 0 0 0.286 0 0 0.133 0.5 0 0 0 0 0
temporal_expression 0 0.075 0 0.705 0 0.743 0 0.628 0 0.729 0.613 0.886 0.673 0.519 0.198 0.093 0.093
Microaverage 0 0.049 0 0.457 0 0.484 0 0.456 0 0.54 0.329 0.816 0.515 0.269 0.101 0.221 0.118

Note: The scores were averaged scores after 10 runs.

Interestingly, the performance of the fine-tuned GPT-3 model closely rivals that of GPT-4 though the fine-tuned GPT-3 model outperforms both the pretrained GPT-3.5 and GPT-4 models.

Tables 3 and 4 show the NER performance for various entity types, encompassing investigation, nervous_AE, other_AE, procedure, social_circumstance, and temporal_expression respectively. Among these categories, temporal_expression exhibits the highest performance, followed by nervous_AE and procedure.

However, it’s worth noting that LLMs encounter significant challenges in extracting social circumstance entities, with the fine-tuned GPT-3.5 model achieving the highest F1 score of 0.5 only in this category. Across all models evaluated, GPT-3.5 generally delivers the best performance, except for the pretrained GPT-3.5, which excels in precision scores for investigation extraction. Therefore, we proposed the fine-tuned GPT-3.5, and named it as “AE-GPT”.

Discussion

Our research has yielded remarkable insights into the capabilities of LLMs in the context of NER. With a specific focus on the performance of these models, we have achieved substantial achievements throughout this study. Additionally, we are pleased to introduce the advanced fine-tuned models collectively known as AE-GPT, which have demonstrated exceptional prowess in the extraction of AE related entities. Our work showcases the success of leveraging pretrained LLMs for inference and fine-tuning them with a customized dataset, underscoring the effectiveness and potential of this approach.

In the realm of LLMs for AE NER tasks, the fine-tuned GPT-3.5-turbo model (AE-GPT) notably stood out, demonstrating superior performance compared to its competing models. Interestingly, the process of fine-tuning seemed to have a significant effect on the capabilities of certain models. For instance, both the fine-tuned GPT-3 and GPT-3.5 showed enhanced performance, even outstripping the more advanced but unfine-tuned GPT-4. This suggests that the specific fine-tuning with AE datasets could have equipped GPT models with a more profound insight of the domain, whereas the generic, broad knowledge base of the pretrained GPT-4 may not have been as optimized for this particular task. However, this fine-tuning effect was not universally observed. Despite similar attempts at enhancement, GPT-2 did not exhibit substantial improvements when fine-tuned. One plausible explanation is that GPT-2’s underlying architecture and training might have specialized in tasks like text completion rather than NER tasks [59]. Its core strengths may not align as seamlessly with the demands of AE NER, resulting in fine-tuning less effective for this model. On the other hand, the performance of Llama remained stagnant across both its iterations and fine-tuning attempts. This could be indicative of a plateau in the model’s learning capacity for the AE NER task, or perhaps the fine-tuning process or data did not sufficiently align with the model’s strengths. Another possibility is that Llama’s architecture inherently lacks certain features or capacities which make the GPT series more adaptable to the AE NER task. The limited size of the dataset may also contribute to the overfitting, which degrades the performance. Further investigation might be needed to discern the specific factors influencing Llama’s performance.

Compared to the work carried out by Du et al., which focused on using the conventional machine learning-based methods and deep learning-based methods [17]. AE-GPT outperforms the proposed model in Du’s work (the highest exact match micro averaged F-1 score at 0.6802 by ensembles of BioBERT and VAERS BERT; highest lenient match micro averaged F-1 score at 0.8078 by Large VAERS BERT) [17]. AE-GPT’s (the fine-tuned GPT-3.5 model’s) enhanced performance in extracting specific entities like investigations, various adverse events, social circumstances, and timestamps can be attributed to its vast pretraining on diverse datasets and its inherent architectural advantages, allowing it to capture broader contextual nuances. Meanwhile, the ensembles of BioBERT and VAERS BERT, despite their biomedical specialization, might have limitations in adaptability across diverse data representations, leading to their comparative underperformance. However, when focusing on procedure extraction, the domain-specific nature of the BioBERT and VAERS BERT ensemble might provide a more attuned understanding of the intricate and context-dependent nature of medical procedures. This specificity could overshadow GPT-3.5’s broad adaptability, explaining the latter’s lesser effectiveness in that particular extraction task.

Our study embarked on a comprehensive comparison of prevalent large language models, encompassing GPT-2, various versions of GPT-3, GPT-3.5, GPT-4, and Llama 2, specifically focusing on their aptitude to extract AE-related entities. Crucially, both pretrained and fine-tuned iterations of these models were scrutinized. Based on its exhaustive nature, this research stands as one of the most holistic inquiries to date into the performance of LLMs in the NER domain. Furthermore, it carves a niche by exploring the impact of fine-tuning on LLMs for NER tasks, distinguishing our efforts from other existing research and reinforcing the study’s unique contribution to the field.

While our study offers valuable insights, it is not without its limitations. The dataset utilized in this research is relatively constrained, comprising only 91 VAERS reports. This limited scope might impede the generalizability of our findings to broader contexts. Moreover, it’s noteworthy that we primarily focused on VAERS reports, which differ in structure and content from traditional clinical reports, potentially limiting the direct applicability of our findings to other medical documentation.

In our forthcoming endeavors, we aim to incorporate fine-tuning experiments with GPT-4, especially as it becomes accessible for such tasks in the fall of 2023. This will not only add another dimension to our current findings but also ensure that our research remains at the cutting edge, reflecting the latest advancements in the world of LLMs.

Error analysis

While AE-GPT (the fine-tuned GPT-3.5 model) has demonstrated commendable performance in recognizing a majority of entity types, it exhibits inherent limitations. Table 5 shows the error statistics of AE-GPT across various entity types. Our classification of error types remains consistent with that of Du et al., ensuring easier comparison [17]. ’Boundary mismatch’ denotes discrepancies in the span range of entities between machine-annotated and human-annotated results. ’False positive’ refers to entities identified by the proposed model that aren’t present in the gold standard, while ’false negative’ indicates entities the model failed to extract. ’Incorrect entity type’ pertains to instances where, though the entity’s span range is accurate, the entity itself has been misclassified. It is evident that the model exhibits a predominant challenge in dealing with boundary mismatch, false positives and false negatives, which can be attributed to several factors. The quality and representativeness of the training data play a significant role; inconsistent or limited annotations can lead to mismatches and incorrect identifications [60, 61]. The inherent complexity of distinguishing similar and potentially overlapping entities adds to the challenge. Additionally, textual ambiguity and the trade-off between specialization during fine-tuning and the generalization from the model’s original vast pretraining can impact accuracy. While GPT-3 is powerful, capturing all the nuances of a specialized NER task can still pose challenges.

Table 5. Statistics of AE-GPT prediction errors on different entity types.

Boundary Mismatch (out of human annotated entities) False Positive (out of machine annotated entities) False Negative (out of human annotated entities) Incorrect Entity Type (out of machine annotated entities)
investigation 13/66, 19.7% 20/90, 22.22% 6/66, 9.1% 5/90, 5.56%
nervous_AE 28/175, 16% 15/169, 8.88% 30/175, 17.14% 1/169, 0.59%
other_AE 12/169, 7.1% 21/132, 15.91% 63/169, 37.28% 3/132, 2.27%
procedure 4/156, 2.56% 28/140, 20% 46/156, 29.49% 2/140, 1.43%
social_circumstance 0/2, 0% ½, 50% ½, 50% 0/2, 0%
temporal_expression 21/141,14.89% 6/130, 4.62% 22/141,15.6% 0/130,0%

In particular, AE-GPT tends to miss specific procedure names, such as "IV immunoglobulin" and "flu vax." Likewise, it exhibits a heightened likelihood of failing to recognize entities related to social circumstances. This underscores the necessity for an improved and broader domain-specific vocabulary within GPT-3.5.

Moreover, AE-GPT frequently confuses general terms such as "injection" and "vaccinated" as exact procedure names, and fails to extract the real vaccine names following it in the text. This misinterpretation results in concurrent false positive and false negative errors.

Another noteworthy limitation of AE-GPT is its proneness to splitting errors. For instance, the named phrase "unable to move his hands, arms or legs.", the model often erroneously segments this into "unable to move his hands’’ and "arms or legs," revealing a shortcoming in its grasp of language understanding.

In our next steps, we intend to improve rare entity extraction, such as social circumstances, by leveraging ontologies and terminologies in these specific domains. We also plan to enhance the embeddings within LLMs to broaden their coverage of these rare entities. Furthermore, expanding our dataset to include drug AEs is on our agenda. We will also introduce clinical notes and biomedical literature to further enrich the dataset. This increased data volume will enable LLMs to better distinguish nuances between entity classes, such as procedure vs. investigation and nervous_AE vs. other_AE.

In future investigations, we will also acknowledge the importance of delving into the statistical significance of identified AEs. While the current study primarily focuses on evaluating the performance of different pretrained and fine-tuned LLMs in the NER task, we have previously conducted statistical analyses using structured data [62, 63]. The subsequent phase of assessing the statistical significance of AEs represents a crucial avenue for further exploration. Our plan is to integrate the extracted data from unstructured text with the previously collected structured data, incorporating rigorous statistical methods, such as hypothesis testing and significance thresholds. This approach aims to systematically evaluate the significance of AEs within the context of the Influenza vaccine use case, providing a more comprehensive understanding and enhancing the robustness and reliability of our findings.

Conclusion

In conclusion, our comprehensive exploration of LLMs within the context of NER, including the development of our specialized AE extraction model AE-GPT, has not only highlighted the profound implications of our findings but also marks a significant achievement as the first paper to evaluate the realm of pretrained and fine-tuned LLMs in NER. The introduction of our specialized fine-tuned model, AE-GPT, exhibits the ability to tailor LLMs to domain-specific tasks, offering promising avenues for addressing real-world challenges, particularly in the extraction of AE related entities. Our research underlines the broader significance of LLMs in advancing natural language understanding and processing, with implications spanning various fields, from healthcare and biomedicine to information retrieval and beyond. As we continue to harness the potential of LLMs and refine their performance, we anticipate further breakthroughs that will drive innovation and enhance the utility of these models across diverse applications.

Abbreviations

AE

adverse event

AI

artificial intelligence

CDC

Centers for Disease Control and Prevention

FDA

Food and Drug Administration

GBS

Guillain-Barre syndrome

GPT

Generative Pre-trained Transformer

ICU

intensive care unit

LLM

Large Language Model

NER

named entity recognition

NLP

natural language processing

VAERS

Vaccine Adverse Event Reporting System

Data Availability

All data underlying the findings described in this manuscript to be freely available at https://www.kaggle.com/datasets/yimingli99/ae-gpt-data.

Funding Statement

This article was partially supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Numbers R01AI130460 and U24AI171008. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Di Renzo L, Franza L, Monsignore D, Esposito E, Rio P, Gasbarrini A, et al. Vaccines, Microbiota and Immunonutrition: Food for Thought. Vaccines (Basel). 2022. Feb 15;10(2):294. doi: 10.3390/vaccines10020294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li Y, Lundin SK, Li J, Tao W, Dang Y, Chen Y, et al. Unpacking adverse events and associations post COVID-19 vaccination: a deep dive into vaccine adverse event reporting system data. Expert Review of Vaccines. 2024. Dec 31;23(1):53–9. doi: 10.1080/14760584.2023.2292203 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Vaccines and immunization [Internet]. [cited 2023 Sep 14]. Available from: https://www.who.int/health-topics/vaccines-and-immunization
  • 4.Vaccine Adverse Event Reporting System [Internet]. U.S. Department of Health and Human Services; About VAERS. Available from: https://vaers.hhs.gov/about.html
  • 5.Wide-ranging Online Data for Epidemiologic Research (CDC WONDER) [Internet]. Centers for Disease Control and Prevention (CDC); CDC WONDER. Available from: https://wonder.cdc.gov/
  • 6.Fraiman J, Erviti J, Jones M, Greenland S, Whelan P, Kaplan RM, et al. Serious adverse events of special interest following mRNA COVID-19 vaccination in randomized trials in adults. Vaccine. 2022. Sep 22;40(40):5798–805. doi: 10.1016/j.vaccine.2022.08.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Possible Side effects from Vaccines | CDC [Internet]. 2023 [cited 2023 Sep 14]. Available from: https://www.cdc.gov/vaccines/vac-gen/side-effects.htm
  • 8.McNeil MM, Weintraub ES, Duffy J, Sukumaran L, Jacobsen SJ, Klein NP, et al. Risk of anaphylaxis after vaccination in children and adults. J Allergy Clin Immunol. 2016. Mar;137(3):868–78. doi: 10.1016/j.jaci.2015.07.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Strebel PM, Sutter RW, Cochi SL, Biellik RJ, Brink EW, Kew OM, et al. Epidemiology of Poliomyelitis in the United States One Decade after the Last Reported Case of Indigenous Wild Virus-Associated Disease. Clinical Infectious Diseases. 1992;14(2):568–79. doi: 10.1093/clinids/14.2.568 [DOI] [PubMed] [Google Scholar]
  • 10.Babazadeh A, Mohseni Afshar Z, Javanian M, Mohammadnia-Afrouzi M, Karkhah A, Masrour-Roudsari J, et al. Influenza Vaccination and Guillain–Barré Syndrome: Reality or Fear. J Transl Int Med. 2019. Dec 31;7(4):137–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Paterson P, Meurice F, Stanberry LR, Glismann S, Rosenthal SL, Larson HJ. Vaccine hesitancy and healthcare providers. Vaccine. 2016. Dec 20;34(52):6700–6. doi: 10.1016/j.vaccine.2016.10.042 [DOI] [PubMed] [Google Scholar]
  • 12.Machingaidze S, Wiysonge CS. Understanding COVID-19 vaccine hesitancy. Nat Med. 2021. Aug;27(8):1338–9. doi: 10.1038/s41591-021-01459-7 [DOI] [PubMed] [Google Scholar]
  • 13.Varricchio F, Iskander J, Destefano F, Ball R, Pless R, Braun MM, et al. Understanding vaccine safety information from the Vaccine Adverse Event Reporting System. Pediatr Infect Dis J. 2004. Apr;23(4):287–94. doi: 10.1097/00006454-200404000-00002 [DOI] [PubMed] [Google Scholar]
  • 14.Patricia Wodi A, Marquez P, Mba-Jonas A, Barash F, Nguon K, Moro PL. Spontaneous reports of primary ovarian insufficiency after vaccination: A review of the vaccine adverse event reporting system (VAERS). Vaccine. 2023. Feb 24;41(9):1616–22. doi: 10.1016/j.vaccine.2022.12.038 [DOI] [PubMed] [Google Scholar]
  • 15.Shimabukuro TT, Nguyen M, Martin D, DeStefano F. Safety monitoring in the Vaccine Adverse Event Reporting System (VAERS). Vaccine. 2015. Aug;33(36):4398–405. doi: 10.1016/j.vaccine.2015.07.035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li Y, Li J, Dang Y, Chen Y, Tao C. Temporal and Spatial Analysis of COVID-19 Vaccines Using Reports from Vaccine Adverse Event Reporting System. JMIR Preprints [Internet]. [cited 2023 Sep 11]; Available from: https://preprints.jmir.org/preprint/51007 doi: 10.2196/preprints.51007 [DOI] [Google Scholar]
  • 17.Du J, Xiang Y, Sankaranarayanapillai M, Zhang M, Wang J, Si Y, et al. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning. Journal of the American Medical Informatics Association. 2021. Jul 1;28(7):1393–400. doi: 10.1093/jamia/ocab014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.y Arcas BA. Do Large Language Models Understand Us? Daedalus. 2022. May 1;151(2):183–97. [Google Scholar]
  • 19.Chen J, Liu Z, Huang X, Wu C, Liu Q, Jiang G, et al. When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities. 2023. [Google Scholar]
  • 20.Li Y, Bubeck S, Eldan R, Giorno AD, Gunasekar S, Lee YT. Textbooks Are All You Need II: phi-1.5 technical report. 2023. [Google Scholar]
  • 21.Monteith B, Sung M. TechRxiv. Preprint. 2023. Unleashing the Economic Potential of Large Language Models: The Case of Chinese Language Efficiency. Available from: 10.36227/techrxiv.23291831.v1 [DOI] [Google Scholar]
  • 22.Cheng K, Li Z, Li C, Xie R, Guo Q, He Y, et al. The Potential of GPT-4 as an AI-Powered Virtual Assistant for Surgeons Specialized in Joint Arthroplasty. Ann Biomed Eng. 2023. Jul 1;51(7):1366–70. doi: 10.1007/s10439-023-03207-z [DOI] [PubMed] [Google Scholar]
  • 23.Lee JS, Hsiang J. Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information. 2020. Sep 1;62:101983. [Google Scholar]
  • 24.Biswas S. Prospective Role of Chat GPT in the Military: According to ChatGPT. Qeios [Internet]. 2023. Feb 27 [cited 2023 Aug 15]; Available from: https://www.qeios.com/read/8WYYOD [Google Scholar]
  • 25.Imamguluyev R. The Rise of GPT-3: Implications for Natural Language Processing and Beyond. International Journal of Research Publication and Reviews. 2023. Mar 3;4:4893–903. [Google Scholar]
  • 26.Ueda Daiju, Shannon L Walston Toshimasa Matsumoto, Deguchi Ryo, Tatekawa Hiroyuki, Miki Yukio. Evaluating GPT-4-based ChatGPT’s Clinical Potential on the NEJM Quiz. medRxiv. 2023. Jan 1;2023.05.04.23289493. [Google Scholar]
  • 27.Roumeliotis KI, Tselikas ND, Nasiopoulos DK. Llama 2: Early Adopters’ Utilization of Meta’s New Open-Source Pretrained Model. Preprints. 2023;2023. [Google Scholar]
  • 28.Gong Y. Multilevel Large Language Models for Everyone. 2023. [Google Scholar]
  • 29.Hagendorff T. Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods. 2023. [Google Scholar]
  • 30.Hu Y, Ameer I, Zuo X, Peng X, Zhou Y, Li Z, et al. Zero-shot Clinical Entity Recognition using ChatGPT [Internet]. arXiv.org. 2023. Available from: doi: 10.48550/arXiv.2303.16416 [DOI] [Google Scholar]
  • 31.Gringeri M, Battini V, Cammarata G, Mosini G, Guarnieri G, Leoni C, et al. Herpes zoster and simplex reactivation following COVID-19 vaccination: new insights from a vaccine adverse event reporting system (VAERS) database analysis. Expert Rev Vaccines. 2022. May;21(5):675–84. doi: 10.1080/14760584.2022.2044799 [DOI] [PubMed] [Google Scholar]
  • 32.VAERS—Data [Internet]. [cited 2023 Aug 17]. Available from: https://vaers.hhs.gov/data.html
  • 33.Vega-Briceño LE, Abarca V K, Sánchez D I. [Flu vaccine in children: state of the art]. Rev Chilena Infectol. 2006. Jun;23(2):164–9. [DOI] [PubMed] [Google Scholar]
  • 34.Centers for Disease Control and Prevention [Internet]. 2023 [cited 2023 Aug 11]. Benefits of Flu Vaccination During 2021–2022 Flu Season. Available from: https://www.cdc.gov/flu/about/burden-averted/2021-2022.htm
  • 35.Ferdinands JM, Thompson MG, Blanton L, Spencer S, Grant L, Fry AM. Does influenza vaccination attenuate the severity of breakthrough infections? A narrative review and recommendations for further research. Vaccine. 2021. Jun 23;39(28):3678–95. doi: 10.1016/j.vaccine.2021.05.011 [DOI] [PubMed] [Google Scholar]
  • 36.Du J, Cai Y, Chen Y, Tao C. Trivalent influenza vaccine adverse symptoms analysis based on MedDRA terminology using VAERS data in 2011. Journal of Biomedical Semantics. 2016. May 13;7(1):13. [Google Scholar]
  • 37.Wang DJ, Boltz DA, McElhaney J, McCullers JA, Webby RJ, Webster RG. No evidence of a link between influenza vaccines and Guillain–Barre syndrome–associated antiganglioside antibodies. Influenza Other Respir Viruses. 2012. May;6(3):159–66. doi: 10.1111/j.1750-2659.2011.00294.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Cw A, Bc J, Jd L. The Guillain-Barré syndrome: a true case of molecular mimicry. Trends in immunology [Internet]. 2004. Feb [cited 2023 Aug 12];25(2). Available from: https://pubmed.ncbi.nlm.nih.gov/15102364/ [DOI] [PubMed] [Google Scholar]
  • 39.Vellozzi C, Burwen DR, Dobardzic A, Ball R, Walton K, Haber P. Safety of trivalent inactivated influenza vaccines in adults: Background for pandemic influenza vaccine safety monitoring. Vaccine. 2009. Mar 26;27(15):2114–20. doi: 10.1016/j.vaccine.2009.01.125 [DOI] [PubMed] [Google Scholar]
  • 40.Grabenstein JD. Guillain-Barré Syndrome and Vaccination: Usually Unrelated. Hosp Pharm. 2000. Feb 1;35(2):199–207. [Google Scholar]
  • 41.Vucic S, Kiernan MC, Cornblath DR. Guillain-Barré syndrome: an update. J Clin Neurosci. 2009. Jun;16(6):733–41. [DOI] [PubMed] [Google Scholar]
  • 42.Yu RK, Usuki S, Ariga T. Ganglioside Molecular Mimicry and Its Pathological Roles in Guillain-Barré Syndrome and Related Diseases. Infect Immun. 2006. Dec;74(12):6517–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Natuva R, Puppala SSS. Chat GPT—a Boon or Bane to Academic Cardiology? Indian Journal of Clinical Cardiology. 2023. Aug 17;26324636231185644. [Google Scholar]
  • 44.Hou W, Ji Z. GeneTuring tests GPT models in genomics. bioRxiv. 2023. Mar 13;2023.03.11.532238. doi: 10.1101/2023.03.11.532238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Topal MO, Bas A, van Heerden I. Exploring Transformers in Natural Language Generation: GPT, BERT, and XLNet. [Google Scholar]
  • 46.Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023. Mar 19;11(6):887. doi: 10.3390/healthcare11060887 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Schneider ETR, de Souza JVA, Gumiel YB, Moro C, Paraiso EC. A GPT-2 Language Model for Biomedical Texts in Portuguese. In: 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS). 2021. p. 474–9. [Google Scholar]
  • 48.Olmo A, Sreedharan S, Kambhampati S. GPT3-to-plan: Extracting plans from text using GPT-3. 2021. [Google Scholar]
  • 49.Gokaslan A, Cohen V. OpenWebText Corpus. [Google Scholar]
  • 50.Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems [Internet]. Curran Associates, Inc.; 2020. [cited 2023 Aug 24]. p. 1877–901. Available from: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [Google Scholar]
  • 51.Koubaa A. GPT-4 vs. GPT-3.5: A Concise Showdown. Preprints [Internet]. 2023. Mar 24; Available from: doi: 10.20944/preprints202303.0422.v1 [DOI] [Google Scholar]
  • 52.Budzianowski P, Vulić I. Hello, It’s GPT-2 –How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. 2019. [Google Scholar]
  • 53.Ghojogh B, Ghodsi A. Attention mechanism, transformers, BERT, and GPT: tutorial and survey. 2020; [Google Scholar]
  • 54.Vijayasarathi M S. A, Tanuj G. Application of ChatGPT in medical science and Research. 2023. Jul 1;3:1480–3. [Google Scholar]
  • 55.Roy K, Zi Y, Narayanan V, Gaur M, Sheth A. Knowledge-Infused Self Attention Transformers. 2023. [Google Scholar]
  • 56.Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023. [Google Scholar]
  • 57.OpenAI GPT2 [Internet]. [cited 2023 Sep 8]. Available from: https://huggingface.co/docs/transformers/model_doc/gpt2
  • 58.Tarlaci F. GPT2sQA [Internet]. 2023. [cited 2023 Sep 6]. Available from: https://github.com/ftarlaci/GPT2sQA [Google Scholar]
  • 59.Austin J. The Book of Endless History: Authorial Use of GPT2 for Interactive Storytelling. In: Cardona-Rivera RE, Sullivan A, Young RM, editors. Interactive Storytelling. Cham: Springer International Publishing; 2019. p. 429–32. (Lecture Notes in Computer Science). [Google Scholar]
  • 60.Li Y., Peng X., Li J., Peng S., Pei D., Tao C., et al. Development of a Natural Language Processing Tool to Extract Acupuncture Point Location Terms. In: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI) [Internet]. 2023. p. 344–51. Available from: doi: 10.1109/ICHI57859.2023.00053 [DOI] [Google Scholar]
  • 61.Li Y, Tao W, Li Z, Sun Z, Li F, Fenton S, et al. Artificial intelligence-powered pharmacovigilance: A review of machine and deep learning in clinical text-based adverse drug event detection for benchmark datasets. J Biomed Inform. 2024. Mar 4;104621. doi: 10.1016/j.jbi.2024.104621 [DOI] [PubMed] [Google Scholar]
  • 62.Luo C, Jiang Y, Du J, Tong J, Huang J, Lo Re V, et al. Prediction of post‐vaccination Guillain‐Barré syndrome using data from a passive surveillance system. Pharmacoepidemiol Drug Saf. 2021. May;30(5):602–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Du J, Cai Y, Chen Y, He Y, Tao C. Analysis of Individual Differences in Vaccine Pharmacovigilance Using VAERS Data and MedDRA System Organ Classes: A Use Case Study With Trivalent Influenza Vaccine. Biomed Inform Insights. 2017. Apr 11;9:1178222617700627. doi: 10.1177/1178222617700627 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Vincenzo Bonnici

27 Dec 2023

PONE-D-23-33399AE-GPT: Using Large Language Models to Extract Adverse Events from Surveillance Reports-A Use Case with Influenza Vaccine Adverse EventsPLOS ONE

Dear Dr. Tao,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Feb 10 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vincenzo Bonnici, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure: 

"This article was partially supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Numbers R01AI130460 and U24AI171008."

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." 

If this statement is not correct you must amend it as needed. 

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

"This article was partially supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Numbers R01AI130460 and U24AI171008."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"This article was partially supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health under Award Numbers R01AI130460 and U24AI171008."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. In the online submission form, you indicated that [Data are available upon request with proper IRB approval and DUA.]. 

All PLOS journals now require all data underlying the findings described in their manuscript to be freely available to other researchers, either 1. In a public repository, 2. Within the manuscript itself, or 3. Uploaded as supplementary information.

This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If your data cannot be made publicly available for ethical or legal reasons (e.g., public availability would compromise patient privacy), please explain your reasons on resubmission and your exemption request will be escalated for approval. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The article presents a study that explores the capabilities of the most renowned Large Language Models in addressing the clinical Named Entity Recognition (NER) problem in a zero-shot context. The authors employ Large Language Models, starting from pharmacovigilance reports collected on VAERS, to identify and categorize adverse events.

However, the article is not presented adequately, especially in the introduction and materials and methods sections. In the introduction, the authors focus too much on explaining the adopted LLMs and too little on the problem to be solved and the literature related to it. There is no mention of the subsequent phase of identifying statistically significant AEs. Systems for investigating and studying VAERS reports are mentioned but not explained. The motivations for choosing the influenza vaccine are reported in the materials and methods, but it would be preferable to at least allude to them in the introduction, which is currently too focused on LLMs (methods). The definition of NER in a zero-shot context is unclear or absent.

The Experimental Setup section seems more like a draft. In the "Dataset Split" subsection, I would replace "20%" with the exact number. In the "Pretrained Models" section, tuning of the temperature and max token settings is performed, but these parameters are not introduced.

The displayed tables have almost no captions, failing to specify the LLM to which they refer. In the post-processing phase, it is not clear how nested entities are treated.

Regarding the statistical significance of the results, the entire dataset contains few reports (91). Is it possible to expand the dataset? Are there no other reports for the influenza vaccine, or are they not suitable, and why?

Only one dataset split into a training set and a validation set is performed, and the F1 score is calculated only once to assess performance on the validation set. The significance of the F1 score performance could be overly dependent on the data used for validation or training, given the limited dataset size. It would be interesting to perform multiple iterations for a more robust evaluation.

Finally, the paper presents an interesting study, and the results seem promising, but the exposition of the work is not adequate.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Mar 21;19(3):e0300919. doi: 10.1371/journal.pone.0300919.r002

Author response to Decision Letter 0


14 Feb 2024

Comments from Reviewers:

We would like to thank the reviewers for their constructive comments. We have carefully addressed them. Please see our point-to-point responses below.

Reviewer #1: The article presents a study that explores the capabilities of the most renowned Large Language Models in addressing the clinical Named Entity Recognition (NER) problem in a zero-shot context. The authors employ Large Language Models, starting from pharmacovigilance reports collected on VAERS, to identify and categorize adverse events.

However, the article is not presented adequately, especially in the introduction and materials and methods sections. In the introduction, the authors focus too much on explaining the adopted LLMs and too little on the problem to be solved and the literature related to it.

Thank you for the constructive feedback. While we understand the reviewer's concerns, our primary focus in this paper is to explore and compare the performance of different pretrained and fine-tuned large language models (LLMs), including GPTs and Llamas, specifically in the context of Named Entity Recognition (NER). We aim to investigate and present a comprehensive analysis of LLMs' capabilities in NER tasks, and the adverse events following Influenza Vaccine serve as a relevant use case to illustrate the application of our proposed models. We believe that maintaining the emphasis on LLMs in the introduction aligns with the core objectives of our study. However, we are open to further clarifying the connection between our primary research goal and the use case provided. We hope this approach aligns with the scope and objectives of our study while addressing the concerns raised by the reviewer.

There is no mention of the subsequent phase of identifying statistically significant AEs.

To address the concern “There is no mention of the subsequent phase of identifying statistically significant AEs.”, we have incorporated a section in the discussion that explicitly outlines our approach to identifying statistically significant AEs in the subsequent phases of our study. We will elaborate on the statistical methods, significance thresholds, and any adjustments made for multiple comparisons, ensuring a comprehensive and transparent presentation of our analytical approach. This addition will provide clarity on our methodology for identifying and interpreting statistically significant AEs, enhancing the overall robustness and completeness of our study.

Systems for investigating and studying VAERS reports are mentioned but not explained.

In our study, we consider VAERS reports as a valuable dataset in the NER task. While we briefly referred to 'systems' in the context of related studies, it's important to note that our primary objective is to assess the capabilities of LLMs in handling adverse event information extraction. The mention of systems in this context serves to acknowledge prior research that utilized VAERS reports as a dataset for similar investigations. We will ensure that the manuscript explicitly highlights the dataset's role and its distinction from the main focus of our study, which is the performance evaluation of LLMs in the context of AE extraction.

The motivations for choosing the influenza vaccine are reported in the materials and methods, but it would be preferable to at least allude to them in the introduction, which is currently too focused on LLMs (methods). The definition of NER in a zero-shot context is unclear or absent.

We have addressed the comment “The motivations for choosing the influenza vaccine are reported in the materials and methods, but it would be preferable to at least allude to them in the introduction, which is currently too focused on LLMs (methods).” and “The definition of NER in a zero-shot context is unclear or absent.” in the last paragraph of Introduction.

The Experimental Setup section seems more like a draft. In the "Dataset Split" subsection, I would replace "20%" with the exact number. In the "Pretrained Models" section, tuning of the temperature and max token settings is performed, but these parameters are not introduced.

We have updated this information in the "Dataset Split" and “Pretrained model inference” subsection. Thank you for your suggestion.

The displayed tables have almost no captions, failing to specify the LLM to which they refer.

We edited Table 3 and Table 4 to address your comment “The displayed tables have almost no captions, failing to specify the LLM to which they refer.”

In the post-processing phase, it is not clear how nested entities are treated.

In response to the reviewer's comment regarding the treatment of nested entities in the post-processing phase, we appreciate the opportunity to clarify our approach. As detailed in the manuscript, our strategy involves addressing instances of nested entities, as exemplified in Figure 5 ("Muscle strength" vs "Muscle strength decreased"). To effectively manage this, we prioritize entities with the longest spans while excluding the nested entities from consideration. For instance, in the scenarios outlined in Figure 5, the investigation term "Muscle strength" is omitted, resulting in the final output of the nervous_AE being "Muscle strength decreased." This systematic approach ensures a streamlined and accurate representation of entities within the given context, enhancing the robustness of our post-processing methodology.

Regarding the statistical significance of the results, the entire dataset contains few reports (91). Is it possible to expand the dataset? Are there no other reports for the influenza vaccine, or are they not suitable, and why?

In response to the reviewer's comment regarding the size of the dataset, we appreciate the valuable suggestion to expand the dataset. While it is feasible to increase the dataset size to potentially enhance results, it's essential to note that one of the primary objectives of our study is to compare the performance of LLMs with traditional language models. To maintain consistency and enable a meaningful comparison, we utilized the dataset from Du et al.'s study, titled "Extracting postmarketing adverse events from safety reports in the VAERS using deep learning." This shared dataset allows us to directly compare our results with those obtained using similar methodologies, ensuring a fair evaluation of the LLM's performance against traditional models.

Only one dataset split into a training set and a validation set is performed, and the F1 score is calculated only once to assess performance on the validation set. The significance of the F1 score performance could be overly dependent on the data used for validation or training, given the limited dataset size. It would be interesting to perform multiple iterations for a more robust evaluation.

Although we initially conducted 10 iterations for our experiments, regrettably, this information was overlooked in the initial reporting. We have now rectified this omission by adding a note below Table 3 and Table 4 to acknowledge the ten iterations. We appreciate the reviewer's keen observation.

Finally, the paper presents an interesting study, and the results seem promising, but the exposition of the work is not adequate.

We believe that the manuscript details a technically sound investigation into the effectiveness of Large Language Models (LLMs) in identifying and cataloging adverse events (AEs) from VAERS, using influenza vaccines as a use case. Rigorous experimentation was undertaken, employing various prevalent LLMs such as GPT-2, GPT-3 variants, GPT-4, and Llama2, with a particular focus on the fine-tuned GPT 3.5 model (AE-GPT), given that GPT-4 was not available for fine-tuning. The achieved 0.704 averaged micro F1 score for strict match and 0.816 for relaxed match attests to the performance of our methodology and the suitability of our chosen models. The conclusions drawn in the manuscript are a direct reflection of the meticulously collected data, supporting the assertion that LLMs, particularly AE-GPT, exhibit promising potential for advanced AE detection.

Attachment

Submitted filename: A rebuttal letter .docx

pone.0300919.s001.docx (15.7KB, docx)

Decision Letter 1

Vincenzo Bonnici

7 Mar 2024

AE-GPT: Using Large Language Models to Extract Adverse Events from Surveillance Reports-A Use Case with Influenza Vaccine Adverse Events

PONE-D-23-33399R1

Dear Dr. Tao,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at http://www.editorialmanager.com/pone/ and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vincenzo Bonnici, PhD

Academic Editor

PLOS ONE

Acceptance letter

Vincenzo Bonnici

12 Mar 2024

PONE-D-23-33399R1

PLOS ONE

Dear Dr. Tao,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vincenzo Bonnici

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: A rebuttal letter .docx

    pone.0300919.s001.docx (15.7KB, docx)

    Data Availability Statement

    All data underlying the findings described in this manuscript to be freely available at https://www.kaggle.com/datasets/yimingli99/ae-gpt-data.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES