Abstract
Background
As clinical trials scale up and grow more complex, researchers are facing mounting challenges, including inefficient participant recruitment, complex data management, and limited risk monitoring. These issues not only increase the workload for clinical researchers but also compromise trial reliability and safety, potentially elevating the risk of trial failure. Large language models (LLMs), as an emerging technology in natural language processing (NLP), exhibit notable advantages across various tasks, such as information extraction and relation classification.
Main text
With domain-specific pre-training and fine-tuning, LLMs present promising potential in clinical trial tasks such as automated patient-trial matching and the extraction and processing of trial data, which are anticipated to reduce time and financial costs. Additionally, they offer valuable insights for scientific rationale, medical decision-making, and trial endpoint prediction. In this context, an increasing number of studies have begun to explore the applications of LLMs in the design and conduct of clinical trials.
Conclusion
This paper provides a review of LLM applications in clinical trials with an emphasis on real-world integration. Comparative advantages over traditional NLP models, technical limitations, and future implementation challenges are also discussed. This narrative review aims to highlight the potential of LLMs in clinical trial workflows and clarify key challenges and future directions.
Keywords: Clinical trials, Large language models, Natural language processing, LLMs, Clinical data management
Background
As the most rigorous and widely adopted referent standard for evaluating medical interventions, clinical trials play an indispensable and pivotal role in several key areas, including advancing medical progress, verifying the efficacy of treatment regimens, and ensuring the safety of medications for patients. Since 1747, when James Lind conducted the world’s first systematically documented randomized controlled trial (RCT) on scurvy prevention using vitamin C, the standardized protocol design and conduct of clinical trials have not only become a cornerstone of evidence-based medicine but have also provided a solid and reliable scientific foundation for modern medical innovations [1]. By implementing rigorously controlled study designs, standardized data collection processes, and methodologically sound statistical analyses, clinical trials can minimize research bias and provide robust medical evidence to support evidence-based decisions in clinical practice. In the twenty-first century, the number of clinical trials has increased dramatically. According to data from ClinicalTrials.gov, the number of registered trials increased from 2119 in 2000 to 477,220 in 2023, representing a compound annual growth rate of over 9% [2].
With ongoing innovation and rapid advancements in experimental drugs and therapeutic regimens [3–6], the challenges faced by clinical trials are becoming increasingly complex, drawing widespread attention from both the medical community and society at large. One of the most prominent challenges is recruiting qualified participants in compliance with the trial protocol within the preset timeline and ensuring their continued participation [7]. A systematic study of 151 RCTs in the UK showed that up to 44% of trials did not achieve their intended participant recruitment goals, and a median of 11% of participants discontinued follow-up for various reasons [8]. Accurately matching clinical trials with potential participants also presents challenges. A study on cancer clinical trials demonstrated that over half (55.6%) of patients were ineligible for participation at the healthcare institution where they received treatment, and an additional 21.5% were excluded for failing to meet enrollment criteria [9]. Assessing treatment effects based on clinical indicators, particularly distinguishing the effects of therapeutic interventions from the natural progression of the disease, represents another challenge in clinical trials [10]. This requires researchers to establish multidimensional data integration and analysis systems, as well as systematically tracking and monitoring the long-term clinical performance of participants. Additionally, there are also several critical challenges in clinical trials that urgently require resolution, including establishing a trial risk assessment system, real-time monitoring of adverse events (AEs) [11–13], the standardized management of vast clinical data, and the safeguarding of participants’ private data, all of which demand innovative solutions from next-generation information technologies.
In recent years, artificial intelligence (AI) technologies, including machine learning (ML) [14, 15], deep learning (DL) [16], computer vision (CV), and the Internet of Medical Things (IoMT), have made substantial progress in the field of clinical trials. These advances include the use of computer vision for cellular imaging analysis to assess the immunological response of subjects following drug treatment [17], the application of DL to analyze retinal angiography images in clinical trials to evaluate hemodynamic features [18], and the use of random forest algorithms to predict the approval likelihood of drugs in clinical trials for chronic obstructive pulmonary disease [19]. As one of the essential branches of AI, natural language processing (NLP) technology uses computational models to process large-scale text, enabling both semantic understanding and generation of natural language [20]. Researchers have applied NLP technology to clinical documents, such as electronic health records (EHRs), to transform unstructured data into clinical insights [21]. Since the advent of generative pre-trained transformer-1 (GPT-1) and bidirectional encoder representations from transformers-base (BERT-base) in 2018, large language models (LLMs), as a class of pre-trained models characterized by a large number of parameters, have had a revolutionary impact on the AI field. Based on the Transformer architecture and large-scale data training, LLMs exhibit emergent abilities in contextual understanding, the ability to follow instructions, and deductive reasoning, demonstrating exceptional interactive capabilities and professional proficiency in NLP tasks such as named entity recognition (NER), relation extraction (RE), or relation classification (RC) [22, 23].
Foundational LLMs, exemplified by GPT, BERT, and LLaMa, following domain-specific pre-training and subsequent medical fine-tuning, have been widely applied in clinical practice, such as medical inquiry, clinical decision support, and the interpretation of medical imaging and electrocardiogram reports [24–33]. However, their systematic evaluation and integration in the clinical trials remain in early stages. Current researches mainly focus on limited domains such as subject screening and participant data extraction, lacking exploration and optimization efforts in other aspects of clinical trials. A comprehensive review is therefore necessary to understand their potential contributions across the various stages of clinical trials.
This review presents the relevant applications of LLMs in the design and conduct stages of clinical trials, technical advances, and future challenges, across four main dimensions: (1) the applications of LLMs in clinical trial design, specifically involving three key aspects: extracting and analyzing research elements and reasons for termination to inform scientific rationale and protocol development, optimizing eligibility criteria and evaluating the association between criteria complexity with trial termination risk, and enhancing research ethics and informed consent processes; (2) the deployment of LLMs in the conduct of clinical trials, covering four key stages including participant recruitment and screening, data acquisition and management, safety monitoring, and trial outcome prediction; (3) the advantages of LLMs in comparison with traditional NLP approaches, their existing technical limitations, and the corresponding technical strategies; (4) the challenges LLMs encounter in clinical trial settings, including data privacy, quality and blinding control, model transparency and trustworthiness, legal and regulatory concerns, standardization of evaluation and application, as well as corresponding potential solutions. This study comprehensively summarizes the current applications and projected scenarios in clinical trial workflows. This review aims to help clinical researchers better understand the potential applications of LLMs across various stages of clinical trials, while the discussion of technical characteristics and implementation challenges provides valuable insights for the promotion and future development of LLMs in this context.
LLMs in clinical trial design
In this section, we focus on the applications of LLMs in three key aspects: (1) protocol design and scientific rationale, (2) recruitment and exclusion criteria, and (3) ethics and informed consent. We outline how LLMs can assist in extracting and synthesizing essential trial components, such as PICO elements, thereby supporting the construction of scientifically grounded protocols. The discussion then turns to the optimization of eligibility criteria, emphasizing the potential of LLMs to analyze and flag overly complex or exclusionary terms linked to trial failure, and retrieve historical patterns to support recruitment design. Furthermore, we explore the potential of LLMs to identify ethical risks and enhance the customization and clarity of informed consent materials, particularly by simplifying technical language for laypersons (Fig. 1).
Fig. 1.
Applications of large language models in clinical trial design. The left boxes illustrate three steps of clinical trial design and their specific contents, in the following top-to-bottom order: establishment of the research background and objectives, protocol development, and ethical approval with informed consent. The right boxes demonstrate how large language models (LLMs) can assist researchers in optimizing and accelerating specific tasks in each design phase
Scientific rationale and protocol design
Extracting and analyzing previous research elements
During clinical trial design, researchers typically review previous studies to extract key information, including Patient, Intervention, Comparison, and Outcome (PICO) elements, efficacy and safety data, eligibility criteria, and reasons for trial termination, in order to inform hypothesis generation and protocol development. However, with the exponential growth of medical literature in recent years, researchers are increasingly challenged to stay abreast of contemporary advancements [34]. Additionally, extracting key trial information from abstracts or full texts of research reports often proves to be both time-consuming and labor-intensive [35]. Several applications, such as RobotReviewer [36], ResearchScreener [37], DistillerSR [38], and Abstrackr [39], have already been developed to extract information from scientific articles or abstracts, assess research quality, and draw inferences regarding treatment effects. However, these technologies are primarily based on traditional semi-automated frameworks (e.g., logistic regression, support vector machines), which face limitations such as limited flexibility and challenges in generalizability.
As an emerging NLP technology, LLMs are trained on large-scale datasets using unsupervised learning techniques and are able to extract and analyze key elements from prior studies with higher efficiency and accuracy. For example, Lee et al. developed a GPT-4-based pipeline, named SEETrials, which is capable of automatically extracting safety and efficacy data from the abstracts of multiple myeloma clinical trials [40]. Whitton and Hunter utilized a fine-tuned BERT model for the automatic extraction of key findings from RCT reports and the automatic generation of corresponding tables [41].
The PICO framework serves as a fundamental component in formulating clinical questions. The capability of LLMs to automatically extract PICO information from clinical trial reports not only accelerates comprehensive analyses of previous studies but also facilitates the formation of new research hypotheses. Mutinda et al. utilized BioBERT to extract PICO information, which is then standardized using the Unified Medical Language System (UMLS), and ultimately converts the information into structured data, facilitating subsequent statistical analyses and visualization [42]. Utilizing domain-specific training datasets can notably enhance the completeness and accuracy of models in PICO information extraction tasks. Wang et al. compared general-purpose and mixed-domain pretrained BERT models; those pretrained specifically for the medical domain exhibited significant advantages in accuracy of PICO extraction [43]. Few-shot learning (FSL) has been demonstrated to be an effective approach in low-resource settings. Ghosh et al. developed AlpaPICO, a PICO information extraction model that boosts in-context learning capabilities with minimal prompting. Unlike traditional methods that require additional training, researchers input annotated context into AlpaCare to bolster its performance in low-resource environments, thereby bypassing the need for complex supervised training and fine-tuning [44].
Guiding research direction through evidence synthesis
In the frontier areas of medicine where consensus has yet to be reached, contradictory research findings frequently arise. Building an evidence-based medical system rooted in diverse evidence helps guide researchers in identifying gaps in medical knowledge [45]. Traditionally, identifying and resolving such contradictions requires extensive manual review of the literature. LLMs can assist researchers in capturing differences in conclusions across studies on similar topics, while offering high-quality, multidimensional evidence-based medical insights. Xie et al. employed the GPT-4–1106-preview model to evaluate ChatGPT’s effectiveness in identifying conflicting arguments. Tests conducted on a PubMed dataset demonstrated that the model achieved a recall rate of 0.903 in detecting inconsistencies in ternary assertions. Notably, ChatGPT’s basic reasoning logic in interpreting medical evidence is not influenced by manual annotation; rather, it relies on its inherent logical reasoning capabilities to assess the relevance of each claim to the core hypothesis of the issue [46]. This study highlights ChatGPT’s objectivity and critical approach in evaluating conflicting medical evidence, offering novel methodological support for clinical trials grounded in evidence-based rationale.
Learning from historical trial terminations
LLMs systematically analyze various factors influencing trial continuation and summarize termination reasons from historical trials, providing valuable insights for clinical researchers to optimize the design phases of trials, thereby improving completion or success rates. Wang et al. devised an innovative approach by applying the sciBERT-NN model to a large dataset comprising 76,950 clinical trial records from ClinicalTrials.gov and MEDLINE, aiming to predict the likelihood of clinical trial publication and analyze characteristics influencing trial completion [47]. Razuvayevskaya et al. employed a fine-tuned BERT model to perform a comprehensive analysis and classification of 28,561 prematurely terminated clinical trials from the ClinicalTrials.gov database. This model systematically summarized the reasons for clinical trial failure—99% of prematurely terminated trials could be grouped based on their reasons for failure within a classification framework comprising six primary categories and 15 subcategories. Results revealed a critical finding: drug trials terminated at various stages generally lacked robust genetic support [48]. The analysis of common features of terminated trials using LLMs encouraged researchers to more precisely explore and validate potential targets at the molecular level, offering novel insights into clinical trial design.
Recruitment and exclusion criteria
Designing appropriate eligibility criteria is one of the most challenging tasks in clinical trial planning. Researchers typically define the target population based on study objectives, disease characteristics, and the nature of the intervention. At the same time, individuals with potential safety risks or factors that may interfere with outcome assessment must be excluded to ensure internal validity and control for confounding variables. This process often involves referencing eligibility criteria from prior similar studies and clinical practice guidelines and is usually refined through multidisciplinary discussions [49]. Moreover, trial designers need to strike a delicate balance between scientific rigor and practical feasibility, especially in the absence of established precedents, since overly restrictive criteria may inadvertently exclude patients who could otherwise benefit from participation, reducing the generalizability of findings and contributing to recruitment delays or even trial terminations.
On the one hand, LLMs can analyze the complexity of eligibility criteria and their association with trial termination risks and identify specific terms within recruitment criteria that are strongly linked to trial failure. Peterson et al. applied ScispaCy to analyze the complexity of eligibility criteria across thousands of clinical trials. They found that from 2008 to 2018, the median number of unique words in eligibility criteria increased by 95%, while trial termination rates rose by 17.6%. ScispaCy also uncovered linguistic patterns associated with high termination risks. For example, exclusion criteria in neuropsychiatric trials containing medical terms such as “brain,” “central nervous system,” and “psychiatric” were strongly linked to higher trial failure rates [50]. This insight highlights that LLMs have the potential to offer early warnings about the risks of overly complex or exclusionary criteria, enabling investigators to proactively revise them before trial initiation.
On the other hand, LLMs can automate the extraction of recruitment criteria from prior similar studies, offering trial designers structured, experience-based insights that help standardize eligibility criteria across similar trials. Datta et al. introduced a GPT-4-based model, AutoCriteria, which automates the extraction of recruitment eligibility criteria from clinical trial reports with high accuracy [51]. Lim et al. employed an LLM-assisted pipeline to systematically analyze phase III clinical trials involving heart failure with reduced ejection fraction (HFrEF). The quantitative analysis demonstrated that 69% of trials used the New York Heart Association functional classification as the primary inclusion criterion, while acute coronary syndrome and valvular heart disease were the most common exclusion criteria [52]. This finding holds significant reference value for future clinical studies related to HFrEF.
Ethics and informed consent
Identifying potential ethical issues
All clinical trials involving human subjects must undergo review by an institutional review board (IRB), which is responsible for safeguarding participant rights and evaluating the ethical acceptability of trial protocols [53]. Constrained by lengthy review timelines [54] and the limited oversight capacity within IRBs [55], clinical trial investigators can utilize LLMs to conduct ethical self-assessments of trial designs prior to IRB submission. LLMs can learn and understand complex ethical issues, automatically assess the compliance of trials with local regulations, and provide decision support. Sridharan and Sivaramakrishnan conducted a pioneering study, which evaluated four LLMs (Google Bard, Claude, GPT-3.5, and GPT-4) on seven validated cases that covered multiple key ethical concerns, including recruitment eligibility criteria, concerns related to vulnerable populations, the disclosure of information in the informed consent, risk–benefit assessments, and the rationale for using placebos. The results of the study showed that all the evaluated LLMs were able to respond to all the questions posed in the given cases upon receiving prompts. Moreover, LLMs produced more specific and actionable suggestions when given more detailed prompts [56]. This study highlighted that, when given detailed prompts, LLMs may complement early-stage protocol development by helping researchers identify areas that could raise concerns during formal IRB review.
Optimizing informed consent
Obtaining fully informed consent from participants is the cornerstone of protecting their rights and safety in clinical trials. When drafting informed consent forms (ICFs), researchers often need to establish the framework including the study objectives, participant responsibilities, and potential risks or benefits based on the clinical trial protocol. However, due to the cognitive gap between patients and professionals, it is often challenging for participants to fully comprehend the medical terms and concepts presented in the ICFs [57, 58]. Furthermore, a patient’s racial and religious background may also influence their perception of certain clinical trials [59]. Therefore, after establishing the core framework of the ICFs, researchers often need to conduct a readability check of the drafted version and develop different versions or supplementary materials tailored to different populations.
Recent studies have highlighted the potential of LLMs to assist researchers in drafting personalized ICFs tailored to specific cases and to facilitate participants’ comprehension of abstract medical concepts and principles in ICFs. Sridharan and Sivaramakrishnan examined the ability of four LLMs to generate ICFs across seven ethically challenging cases. The results showed that all tested LLMs were not only capable of identifying key ethical risks, but also successfully generated tailored ICFs that included risk disclosure and participant rights with adaptation to specific cases [56]. By annotating complex technical terms, adopting concise phrasing, and simplifying paragraph structure, LLMs might effectively reduce the difficulty participants face in understanding ICFs. Campillos-Llanos et al. employed the GPT-3.5 model to optimize the clinical trial announcements for Spanish patients. Although 28.9% of the definitions still required further refinement, GPT-3.5 generally demonstrated satisfactory performance in defining and simplifying medical terminology [60]. Ali et al. applied GPT-4 to reduce the comprehension difficulty of the ICF from the college freshman level to the eighth-grade level. After optimization, the frequency of technical terms and passive voice in the simplified text notably decreased, and complex long sentences were effectively split into several shorter sentences. All simplified versions passed strict review by legal and medical professionals [61].
LLMs in clinical trial conduct and operations
In this section, we review the applications of LLMs in the conduct and operational management of clinical trials, with a particular emphasis on four key stages: (1) participant recruitment and screening, (2) data collection and management, (3) safety monitoring, and (4) trial outcome prediction. We first review how LLMs facilitate automated patient-trial matching and eligibility screening with high accuracy and reduced costs, by transforming unstructured documents and aligning them with structured criteria. We then explore the potential of LLMs to assist in collecting and structuring trial data, improving data quality through automated processing and cleaning, and enabling data mapping across heterogeneous datasets. In the safety domain, LLMs enhance adverse event detection and drug-drug interaction tracking through natural language understanding and prompt-based inference. Moreover, we highlight their emerging role in predicting trial outcomes, including the simulated generation of trial scenarios (Fig. 2).
Fig. 2.
Applications of large language models in the conduct and operations of clinical trials. The four boxes at the top depict the main steps and related content of clinical trial conduct, in the following left-to-right order: participant recruitment, follow-up and data collection, safety monitoring, and data management and analysis. The boxes below outline specific tasks that large language models (LLMs) can be deployed to automate at each of these stages
Participant recruitment and eligibility screening
Insufficient patient recruitment is a primary cause of clinical trial termination [3]. Previous efforts, such as enhancing social media outreach [62] and providing incentives to participants [63], have partially addressed the insufficient enrollment. However, the effectiveness of these methods remains uncertain. Patient recruitment typically begins with eligibility screening, a process in which researchers compare trial inclusion/exclusion criteria against patient information, most of which is stored as unstructured data in electronic health records (EHRs). This process is labor-intensive and time-consuming—on average, screening a single patient takes approximately 45 min [64, 65]. Against this backdrop, previous studies have attempted to leverage traditional AI to assist with patient–trial matching [66, 67]. However, there is a notable discrepancy between the unstructured free-text data in electronic EHRs and the structured eligibility criteria. This inconsistency in data format severely hampers the efficiency of trial matching. In recent years, with the widespread adoption of standardized medical coding systems such as the International Classification of Diseases (ICD) [68], Current Procedural Terminology (CPT) [69], and Unified Medical Language System (UMLS) [70], the standardization of expressions of medical concepts substantially enhances the ability of LLMs to parse, understand, and reconcile various terms in EHR information and trial eligibility criteria, laying the foundation for efficient and precise automated patient-trial matching and eligibility screening.
Recent advancements in LLM-based trial-matching frameworks have demonstrated promising gains in matching accuracy and data security. TrialGPT developed by Jin et al. precisely scored various patient metrics, demonstrated superior accuracy in ranking and filtering potential trials when compared to traditional linear aggregation analysis [71]. Yuan et al. developed an innovative LLM-PTM matching model with privacy enhancement methods. Compared to baseline models, LLM-PTM, through its analysis of de-identified patient data, demonstrated 6% and 8.4% improvement in F1-scores for patient-eligibility matching and patient-trial matching, respectively [72]. This advancement not only enhanced matching accuracy but also effectively safeguarded patient privacy, providing a viable solution to data security concerns in clinical trial recruitment.
The impressive performance of LLMs in reducing the time and economic costs of patient-trial matching also warrants attention. Beattie et al. utilized GPT-4 to assess whether 74 patients with head and neck cancer met the inclusion criteria for a phase II clinical trial. Through the continuous refinement of LLM-driven EHR analysis under different prompts, GPT-4 maintained good accuracy while reducing the average matching time to 7.9–12.4 min per patient, with costs lowered to just $0.15–$0.27 per case [73].
The integration of LLMs with oncogenomics databases is spearheading a new direction in cancer patient-drug trial matching studies, paving the way for precision oncology therapies. Wu et al. developed MonoMiner based on the OMIM knowledge base of 4461 monogenic pathogenic genes. MonoMiner retrieves disease-gene pairs from EHRs using specific trigger words and ranks them by similarity to known entries in the OMIM database [74]. Xu et al. developed OncoCTMiner, an innovative clinical oncology database that incorporates an automated patient-trial matching function. OncoCTMiner can intelligently match patients’ genomic mutation profiles and other “omics” data with the enrollment criteria of clinical trials in the database. Additionally, users can filter the matching results based on multidimensional information such as the patient’s clinical treatment history and trial recruitment status, ultimately producing an optimized list of recommended clinical trials [75]. Such integrations mark a promising step toward more targeted enrollment in oncology trials at the genetic level.
Data collection and management
Data acquisition and standardization
In traditional RCTs, clinical data collection is highly structured and typically predefined during the trial design through case report forms (CRFs) [76] and data management plans (DMPs) [77]. These documents specify variables, timing, coding standards, and access control. During phases such as baseline data collection, intervention administration recording, and follow-up data collection, researchers often need to manually enter and cross-check large amounts of free-text information, which must then be converted into structured data. These processes are not only time-consuming and labor-intensive, but also prone to input errors, inconsistent terminology, and inter-site variability, all of which can compromise data quality. Conventional methods for information extraction typically rely on rule-based or lexicon-based encoding approaches [78], as well as conditional random fields (CRFs) [79] and neural networks (NNs) [80]. However, these techniques often suffer from limited generalizability and poor performance on long or complex textual inputs. Against this backdrop, collecting and transforming clinical text into electronic structured data automatically has become an urgent challenge.
The utilization of multimodal LLMs (M-LLMs) to extract and standardize baseline information or participant responses from unstructured text is an emerging solution. For instance, M-LLMs integrated with optical character recognition (OCR) technology can efficiently extract baseline characteristics from handwritten physician notes, paper-based reports, or non-interactive databases, and convert them into editable and searchable electronic text [81]. Laique et al. utilized OCR to convert cancer genomics data stored in PDF format from The Cancer Genome Atlas (TCGA) into structured text, and subsequently applied a fine-tuned ClinicalBERT model to identify and classify genomic entities, offering a novel pathway for processing cancer research data [82]. Similarly, M-LLM-based approaches have been explored for extracting structured data from pathology documents [83] and unstructured medical records [84]. However, the application of LLMs for automated extraction or standardization of trial intervention data, such as dosage and timing, as well as group-specific differences between treatment and control arms, remains largely undeveloped. Future work should expand LLMs’ applications in these areas, integrating real-time data capture and remote monitoring to help researchers efficiently interpret trial data characteristics.
Data processing and cleaning
After trial data have been collected and standardized, further data processing and cleaning are typically required. Specifically, validation rules predefined in the DMPs are applied to identify missing values, outliers, and inconsistencies within the structured dataset. Researchers then issue data queries to investigate these anomalies and implement corrections. The revised rules are employed to remove erroneous data, impute missing values, and rectify records that do not meet the established standards, thereby ensuring data accuracy and completeness [85]. Throughout this process, LLMs, by leveraging contextual modeling, can efficiently detect and correct anomalous data within the dataset, enhancing the automation and quality of data cleaning. As a context-aware data processing pipeline, LLMClean is capable of constructing semantic dependency models based on context, thereby identifying and correcting invalid, anomalous, or inconsistent entries in clinical datasets [86]. Databonsai is a data preprocessing tool that integrates multiple LLMs. Leveraging the contextual awareness of LLMs and unique batch processing capabilities, Databonsai can automatically perform tasks such as annotation, classification, anomaly detection, and standardization across multiple tabular datasets based on user-defined requirements. It is particularly well-suited for real-world data and auxiliary clinical trial databases [87]. In practice, the deployment of LLMs for data cleaning must follow the pre-specified DMPs to ensure compliance with privacy safeguards, data quality, and blinding control, which are discussed in detail in the Data quality, privacy, and blinding control section.
Data mapping and integration
Clinical trial data typically require mapping and integration into large datasets to facilitate analysis and sharing. This process commonly involves terminology standardization, variable alignment, and format conversion, whereby these steps are guided by variable naming conventions and target data structures specified in DMPs [88]. LLM-based comprehensive architectures have demonstrated considerable potential for cross-dataset mapping. Kimura et al. combined BioBERT, GPT-3.5, and retrieval-augmented generation (RAG) techniques to map drug names to the standardized RxNorm terminology system [89]. Adams et al. integrated GPT-3 embeddings with vector matching methods to align clinical trial data with public health registry data, subsequently mapping these data to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) [90]. This framework not only maintained high-precision matching but also reduced data mapping time to a matter of hours, thereby substantially enhancing the capacity of small teams to participate in data sharing platforms.
Safety monitoring
Tracking adverse events
In clinical trials involving investigational drugs, devices, or novel therapies, tracking adverse events (AEs) is a critical component of safety assessment and subject protection [91]. Active surveillance of AEs typically relies on regular follow-up assessments, laboratory testing, and review of data within participants’ EHRs or trial records. AEs, especially severe AEs, are often required to be reported and entered into the system within 24 h [92]. Given the time-sensitiveness of this process, an increasing number of studies have begun to explore the potential of LLMs to detect AEs from trial records automatically. Li et al. demonstrated the strong performance of LLMs in extracting influenza vaccine-related AEs from annotated reports [93], while Hu et al. investigated the impact of various prompt strategies on the performance of LLMs in identifying vaccine-associated AEs [94]. Sivarajkumar et al. further advanced this application by using LLMs to automatically adjudicate cardiovascular AEs from real-world, unstructured clinical trial records. Their LLM-based pipeline first extracts relevant information from free-text documents and then applies expert guidelines and a Tree-of-Thoughts reasoning framework to identify cardiovascular events [95]. Currently, LLMs are deployed primarily for the detection and adjudication of AEs, serving a diagnostic rather than prognostic role. Future efforts should focus on integrating such LLM-driven analytic frameworks into electronic data capture systems or mobile applications to enable real-time, proactive AE monitoring throughout the course of clinical trials.
Detecting drug-drug interactions
In clinical trials, managing drug-drug interactions (DDIs) is a critical aspect of trial design and patient safety. LLMs offer promising tools to enhance this process by automatically detecting potential DDIs, particularly those related to pharmacokinetics and synergistic effects. For example, Zirkle et al. developed the BioBERT-directional DDI model based on 325 manually annotated FDA drug labels. This model successfully identified pharmacokinetic-drug interactions (PK-DDI) with an accuracy of 0.82 and achieved 100% accuracy in identifying target drugs [96]. Li et al. utilized CancerGPT based on GPT-2 to predict the synergistic effects of paired drugs across different organ sites after few-shot prompting. The results showed that for areas with rich prior knowledge, such as the endometrium and bone tissue, data-driven models exhibited better predictive performance. However, in prediction tasks for areas lacking sufficient external information, such as soft tissue and the urinary tract, CancerGPT’s FSL performance notably surpassed that of data-driven models [97]. This study highlighted the value of few-shot prompting in situations where the types and characteristics of real-world data differ from training data. In the future, researchers can leverage LLMs to uncover drug combinations with greater efficacy or risk, and tailor protocol adaptations for specific subgroups affected by DDIs, enhancing therapeutic outcomes while maintaining trial-level generalizability.
Trial outcome prediction
In feasibility assessments and interim analyses, leveraging LLMs to predict trial outcomes (such as primary/secondary endpoint and drug efficacy) can provide early insights for protocol optimization and resource allocation, promoting more evidence-based medical decision-making. Dougherty et al. applied BART to predict the outcomes of clinical trials involving COMP360, a drug developed for patients with treatment-resistant depression. The research team transcribed recordings of conversations between participants and their psychotherapists into text. BART then assigned emotion and arousal scores to these transcripts and predicted participants’ treatment responses at 3 and 12 weeks to assess the overall efficacy of COMP360 [98]. Agarwal et al. used TRANSMED to predict the duration of hospitalization and improvements in pulmonary ventilation in patients receiving a novel COVID-19 treatment [99]. This study highlighted that integrating static information (e.g., patient baseline characteristics) with dynamic datasets (e.g., longitudinal imaging or laboratory results) may enhance model performance, particularly in rare or novel diseases.
Recently, researchers have proposed an innovative approach using multi-LLM pipelines to simulate clinical trials for drug efficacy evaluation. Goldenholz et al. developed a pipeline composed of three different LLMs to simulate randomized trials of cenobamate for seizure control. The pipeline employed LLaMa 2 to generate clinical notes based on real-world data; the Mistral model then summarized these notes, and Claude 2 aggregated the summaries and predicted the 50% response rate between the placebo and treatment groups [100]. Future studies could explore integrating both trial simulation and data analysis into a single LLM, thereby simplifying the entire pipeline.
It is noteworthy that most RCTs rely on blinding to minimize bias. Unrestricted use of LLMs to predict trial outcomes may inadvertently reveal critical intervention information or influence assessors’ judgments, thereby compromising the integrity of the blinding mechanism. Therefore, such applications must be carefully contextualized and confined to appropriate settings, such as non-blinded or open-label trials, retrospective analyses, or simulation frameworks.
Comparison between LLMs and traditional NLP models in clinical trials
Based on the Transformer architecture and large-scale pre-training, LLMs demonstrate better performance than traditional NLP models. This section elucidates four major performance advantages of LLMs over traditional NLP models: contextual understanding, few-shot learning, dynamic text generation, and generalization/multitask capability. However, challenges including hallucinated outputs, prompt sensitivity, and limited capacity for knowledge updates may compromise their reliability. By examining the technical advantages that LLMs currently demonstrate over traditional approaches alongside their remaining limitations in the context of clinical trial scenarios, this analysis informs trial investigators to better understand and evaluate the outputs of LLMs (Fig. 3).
Fig. 3.
Performance advantages, technical limitations, and solutions for large language models in clinical trials. The upper four sections illustrate the major performance advantages of large language models (LLMs) compared to traditional natural language processing models, and explain how these advantages support LLM applications in clinical trials. These advantages are, from left to right: contextual understanding, few-shot learning, dynamic text generation, and generalization with multitask capability. The middle sections present the current technical limitations of LLMs with the integration of trial scenarios, including output hallucination, prompt sensitivity, and challenges in updating the knowledge base. The lower sections list potential technical solutions to mitigate these limitations
Performance advantages
Context understanding
Compared to traditional NLP models (e.g., dictionary-based model, N-gram model, hidden Markov model, support vector machine) constrained by fixed window sizes and linear, sequence-based text input, the Transformer architecture of LLMs demonstrates notable advantages in capturing long-range dependencies and semantic connections, namely, superior context understanding [101, 102]. In the long-term follow-up period, trial records often chronicle the progression of illness and the medication history of subjects over months or even years. Superior context understanding allows LLMs to detect AEs or functional changes in post-treatment subjects, and establish long-range connections with baseline characteristics, assisting researchers in evaluating treatment efficacy and identifying potential response patterns over time.
Few-shot learning
Few-shot learning (FSL) is a learning approach designed for LLMs that leverages a limited number of annotated instances as prompts to learn specific tasks [103]. In the absence of sufficient external training data to serve as ground truth, traditional NLP models demand extremely high precision in textual representation, and the subsequent refinement processes are often highly complex, resulting in reduced flexibility [104]. FSL enables LLMs to perform specific tasks using only a small number of labeled examples, without requiring extensive architectural modifications or additional supervised training. This allows for rapid adaptation to trial-related tasks in low-resource settings, such as extracting clinical information of subjects with rare or novel diseases, or developing more precise informed consent materials for children, older adults, or minority populations.
Dynamic text generation
Leveraging the unique “emergent” ability [105], LLMs can generate text coherently, adjusting in real time based on immediate context. This dynamic adaptability offers practical value in clinical trial documentation. Medical writing often involves complex terminology and potential ambiguity. LLMs help ensure clarity and coherence by adjusting word choice and writing style based on contextual meaning and specific trial settings. For example, in the analysis of drug efficacy informed by participant data, LLMs typically adopt a precise and objective narrative style, often augmenting the use of specialized terminology. In contrast, when aiding participants in understanding informed consent documentation, LLMs employ a more accessible language, replacing technical terms with simplified lexis to improve text comprehensibility.
Generalization and multitask capability
LLMs trained on general-purpose datasets can transfer the experience and knowledge acquired in specific tasks to other related tasks, showcasing exceptional generalization abilities within condensed timeframes. In contrast, traditional NLP models often require different model parameters to be set for different scenarios, leading to reduced flexibility. Additionally, their unified Transformer architecture supports efficient knowledge sharing and parallel processing, enabling LLMs to perform multiple tasks concurrently [106]. As a near “plug-and-play” tool, LLMs might perform baseline features extraction and participant eligibility screening at the same time, reducing the burden on researchers when designing complex pipelines.
Technical limitations
Output hallucination
Hallucination often refers to the generation of unfounded details to support a viewpoint [107]. This error arises from the presence of false or highly redundant information in the training data [108], the use of high-uncertainty sampling algorithms [109], as well as exposure bias between text generation and inference [110]. In clinical trials, hallucinated outputs can distort trial outcome predictions, misguide eligibility assessments, or even fabricate patient data to fit expected conclusions. Mitigation strategies include credibility-based data weighting, controllable generation methods [111], and retrieval-augmented generation (RAG) [112].
Prompt sensitivity
LLMs may generate markedly different responses to similar prompts for the same task, such as when synonyms are used or the prompt style is adjusted, indicating that the quality of the model’s output is highly sensitive to prompt variations [113]. Considering the diverse medical record formats and rich free-text information in clinical trials, reducing the prompt sensitivity in a controlled manner will assist LLMs in adapting to different user input styles or various trial categories. Future research may explore soft prompts with learnable vectors and prompt calibration, in which meta-prompts are iteratively scored and adjusted based on user intent to improve LLM adaptability to downstream tasks [114, 115].
Challenges in knowledge update
After deployment, static LLMs lack the ability to automatically update their knowledge base, while frequent retraining often incurs substantial costs. Furthermore, LLMs may encounter catastrophic forgetting when learning new tasks [116]. These limitations might hinder LLMs from continuously learning and integrating external data or expert consensus, which might eventually impair the timeliness and accuracy of LLMs’ recommendations of clinical trials matched to participants, or the analysis involving rare diseases or advanced medications. The current strategy for updating the knowledge base mainly involves leveraging external retrieval information to override outdated content in the model’s output, such as data augmentation and internet augmentation [117].
Implementation challenges and potential solutions
In this section, we discuss the potential challenges and related solutions of deploying LLMs in clinical trials based on real scenarios, including (1) data quality control, privacy protection, and blinding requirements; (2) limited transparency in decision-making, biases in data and algorithms, and difficulty in reliability and explainability assessment; (3) legal and regulatory hurdles; and (4) standardization of LLMs evaluation and application. This discussion will offer insights for the promotion and future applications of LLMs in clinical trials (Fig. 4).
Fig. 4.
Challenges and potential solutions in the applications of large language models in clinical trials. This figure highlights four major challenges associated with the application of large language models (LLMs) in clinical trials: requirements of data quality, privacy, and blinding control; legal and administrative impediments; the imperative for standardization in evaluation and practice; and concerns surrounding model credibility and interpretability. Furthermore, the figure delineates techniques or cross-sector collaboration approaches, presenting potential solutions corresponding to each of the listed challenges
Data quality, privacy, and blinding control
During the protocol development of clinical trials, investigators typically establish DMPs in advance to address practical challenges in data management, such as quality control, privacy protection, and blinding control. (i) A primary concern for researchers is ensuring that LLMs used for data management do not become new sources of data contamination. As discussed in the Comparison between LLMs and traditional NLP models in clinical trials section, LLMs may compromise data quality in clinical trials by generating inconsistent, inaccurate, or biased outputs due to their sensitivity to prompts and potential misinterpretation of context. Embedding predefined rules into prompts or implementing contextual constraint mechanisms are key strategies to guide LLMs in generating outputs that meet established quality standards [118]. (ii) The private information of participants may be unintentionally leaked during the output process, potentially being maliciously exploited by cybercriminals. For data privacy safeguarding, the deployment of LLMs should incorporate multiple data encryption techniques, such as symmetric/asymmetric encryption [119] and k-anonymity [120]. In recent years, the emergence of fast fully homomorphic encryption (TFHE) has enabled LLMs to perform computation directly on encrypted data, balancing privacy protection with processing efficiency [121]. (iii) LLM-generated outputs for data management may inadvertently reveal group allocation to researchers, compromising the blinding of the trial. Therefore, researchers should implement role-based access control (RBAC) to restrict LLM access to randomization data [122] and establish output auditing mechanisms to prevent the generation of any information that could indirectly reveal intervention assignments.
Model credibility, interpretability, and bias
Compared to traditional NLP models, LLMs often lack transparency in their reasoning and decision-making processes, a challenge commonly referred to as the “black box” [123]. The limited interpretability of LLM frameworks may undermine the credibility of their outputs, particularly when LLMs are applied to tasks such as trial protocol optimization or patient stratification. Current research efforts, including model distillation [124], hybrid architectures [125], and explainable AI (XAI) techniques [126, 127], have proven effective in enhancing interpretability. However, most of these methods are developed for general tasks and have yet to be systematically validated in clinical trial settings. Integrating multiple XAI techniques to support causal reasoning and decision traceability may help build greater trust in LLM-assisted analysis in the future.
LLMs applied in clinical trials are vulnerable to algorithmic and data biases, potentially leading to unequal trial outcomes, skewed patient recruitment, and diminished credibility and fairness of recommendations [128]. Such biases may stem from underrepresented populations in historical datasets, subjective endpoint annotations, or limited language coverage [129]. To mitigate issues of data and algorithmic bias, future efforts could focus on improving the quality of training data by leveraging real-world data for pre-training, or employing advanced algorithms such as fairness-aware training or adversarial debiasing to mitigate biased samples [130].
Legal and regulatory concerns
The application of LLMs in clinical trials entails various complex legal concerns. (i) Data privacy protection: The training and fine-tuning of LLMs require large volumes of trial data, which contains a substantial amount of sensitive personal information. Legal and regulatory frameworks across the globe, such as the European Union’s General Data Protection Regulation (GDPR) and the US Health Insurance Portability and Accountability Act (HIPAA), impose stringent requirements on the protection of patients’ personal information during LLMs training, transmission, and storage [131]. (ii) Intellectual property disputes: Some developers use medical databases to train LLMs without proper authorization, infringing on the intellectual property rights of the data owners. Additionally, the copyright and originality issues surrounding the use of LLMs’ outputs are also a topic of ongoing debate. (iii) Assignment of responsibility: How to address the medical disputes and responsibility issues stemming from LLM-mediated decisions is an urgent challenge. The amendment to Sect. 1557 of the US Affordable Care Act stipulates that physicians or medical decision-makers “liable for medical decisions made in reliance on clinical algorithms,” indicating that LLMs do not yet possess independent diagnostic and decision-making authority [132]. (iv) Misinformation and academic integrity: The advanced ability of LLMs to closely mimic human writing might lead to the creation of inaccurate or even false information, resulting in misleading conclusions that could undermine academic integrity [133].
Recent legislative developments, such as the European Union’s Artificial Intelligence Act (2024), have initiated the classification of artificial intelligence systems based on risk levels; however, numerous challenges persist. Current legal frameworks frequently lack specific guidance for clinical trial applications, while inconsistencies in global regulatory approaches may hinder the achievement of unified oversight [134, 135]. To keep pace with the rapid advancement of LLMs, regulatory frameworks must be adaptive, evidence-based, and capable of maintaining equilibrium between safety and innovation. Mechanisms such as regulatory sandboxes may offer viable solutions by facilitating controlled clinical testing within clearly defined legal parameters.
Standardization of evaluation and practice
To date, regulatory agencies such as the US Food and Drug Administration (FDA) and the European Medicines Agency (EMA) have not issued specific guidelines for evaluating or implementing LLMs in clinical trials. Establishing unified standards and practical guidelines is therefore essential for facilitating their integration into the workflows. Given that manual assessment remains the primary approach for evaluating LLMs in the medical field [136], a “human-in-the-loop” strategy provides a practical framework for standard evaluation through human–machine collaboration. In this approach, clinical experts prompt models to measure the consistency between the LLMs’ outputs and professional opinions, and subsequently provide improvement recommendations. Moreover, quantitative metrics, including ROC curves and domain-specific checklists such as STAGER and QUEST, have been proposed to assess the LLMs’ output quality [137, 138]. Industry-specific benchmarks are equally important; for example, CTBench provides a comprehensive test suite for evaluating GPT-4’s ability to extract baseline characteristics of clinical trial participants [139]. Future guideline development should incorporate feedback from real-world clinical trial design and conduct, alongside expert consensus, while balancing potential benefits, performance limitations, and application risks. Additionally, open platforms, shared toolkits, and structured feedback mechanisms are needed to facilitate best-practice dissemination and collaborative progress in trial workflows.
Conclusion
The integration of LLMs in clinical trials encompasses a range of critical processes, including protocol development, informed consent, patient recruitment, safety monitoring, data management, and outcome prediction. Currently, some applications, such as trial data mapping and simulation-based evaluations, remain at the exploratory stage with limited validation, while applications targeting specific tasks like real-time adverse event monitoring remain underdeveloped. Going forward, clinical researchers must ensure that LLM deployment aligns with both the trial protocol and conduct management plan, and adheres to relevant legal and regulatory requirements. The establishment of standardized industry benchmarks, enhancement of transparency in model outputs, and active mitigation of data and algorithmic biases will be essential for unlocking the full potential of LLMs for broader and safer integration into clinical trials.
Acknowledgements
All the figures were created based on the tools provided by Biorender.com (accessed on 28/07/2025).
Abbreviations
- RCT
Randomized controlled trial
- AE
Adverse event
- DL
Deep learning
- NLP
Natural language processing
- EHR
Electronic health record
- GPT
Generative pre-trained transformer
- BERT
Bidirectional encoder representations from transformer
- LLM
Large language model
- NER
Named entity recognition
- DDI
Drug-drug interaction
- PICO
Patient, Intervention, Comparison, Outcome
- FSL
Few-shot learning
- IRB
Institutional review board
- ICF
Informed consent form
- DMP
Data management plan
- RAG
Retrieval-augmented generation
- M-LLM
Multimodal large language model
- OCR
Optical character recognition
- XAI
Explainable artificial intelligence
Authors' contributions
Writing-original draft, A.Q.L., Z.H.W., A.M.J.; Conceptualization, Q.C., B.F.T., P.L.; Investigation, A.Q.L., W.Z.H.; Writing-review and editing, A.Q.L., W.Z.H., A.M.J., L.C., C.Q., L.X.Z., W.M.M., W.Y.G., D.Q.Z., M.J.X., G.D.C., S.K.P., H.Z.W., L.Z., H.G.Z., X.P.D., Y.X.W., J.Z., Q.C., B.F.T., P.L.; Visualization, A.Q.L, W.Z.H.. All authors have read and agreed to the published version of the manuscript.
Authors' twitter handles
X: @PengL_Robin.
Funding
This work was supported by grants from the Science and Technology Innovation Program of Hunan Province. (No. 2023RC3074).
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anqi Lin, Zhihan Wang, and Aimin Jiang contributed equally to this work.
Contributor Information
Quan Cheng, Email: chengquan@csu.edu.cn.
Bufu Tang, Email: tangbufu@zju.edu.cn.
Peng Luo, Email: luopeng@smu.edu.cn.
References
- 1.Buchanan WW, Kean CA, Rainsford KD, Kean WF. Clinical therapeutic trials. Inflammopharmacology. 2024;32(1):61–71. [DOI] [PubMed] [Google Scholar]
- 2.Su L, Liu S, Li G, Xie C, Yang H, Liu Y, et al. Trends and characteristics of new drug approvals in China, 2011–2021. Ther Innov Regul Sci. 2023;57(2):343–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chen J, Lin A, Luo P. Advancing pharmaceutical research: a comprehensive review of cutting-edge tools and technologies. Curr Pharm Anal. 2024;21(1):1–19. [Google Scholar]
- 4.Liu Y, Zhang S, Liu K, Hu X, Gu X. Advances in drug discovery based on network pharmacology and omics technology. Curr Pharm Anal. 2024;21(1):33–43. [Google Scholar]
- 5.Wang Z, Zhao Y, Zhang L. Emerging trends and hot topics in the application of multi-omics in drug discovery: a bibliometric and visualized study. Curr Pharm Anal. 2024;21(1):20–32. [Google Scholar]
- 6.Lin A, Fang X, Jiang A, Qi C, Gan W, Zhu L, et al. Large language models in drug development: current progress and future directions. Curr Mol Pharmacol. 2025;18(1):1–5. [Google Scholar]
- 7.Hung M, Mohajeri A, Almpani K, Carberry G, Wisniewski JF, Janes K, et al. Successes and challenges in clinical trial recruitment: the experience of a new study team. Med Sci (Basel). 2024;12(3):39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Walters SJ, Bonacho Dos Anjos Henriques-Cadby I, Bortolami O, Flight L, Hind D, Jacques RM, et al. Recruitment and retention of participants in randomized controlled trials: a review of trials funded and published by the United Kingdom Health Technology Assessment Programme. BMJ Open. 2017;7(3):e015276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Unger JM, Vaidya R, Hershman DL, Minasian LM, Fleury ME. Systematic review and meta-analysis of the magnitude of structural, clinical, and physician and patient barriers to cancer clinical trial participation. J Natl Cancer Inst. 2019;111(3):245–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.de Aquino CH. Methodological issues in randomized clinical trials for prodromal Alzheimer’s and Parkinson’s disease. Front Neurol. 2021;12: 694329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gu T, Jiang A, Zhou C, Lin A, Cheng Q, Liu Z, et al. Adverse reactions associated with immune checkpoint inhibitors and bevacizumab: a pharmacovigilance analysis. Int J Cancer. 2023;152(3):480–95. [DOI] [PubMed] [Google Scholar]
- 12.Shen J, Hu R, Lin A, Jiang A, Tang B, Liu Z, et al. Characterization of second primary malignancies post CAR T-cell therapy: real-world insights from the two global pharmacovigilance databases of FAERS and VigiBase. EClinicalMedicine. 2024;73: 102684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhou C, Peng S, Lin A, Jiang A, Peng Y, Gu T, et al. Psychiatric disorders associated with immune checkpoint inhibitors: a pharmacovigilance analysis of the FDA Adverse Event Reporting System (FAERS) database. EClinicalMedicine. 2023;59: 101967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang J, Li H, Tao W, Zhou J. GseaVis: an R package for enhanced visualization of gene set enrichment analysis in biomedicine. Med Res. 2025;1(1):131–5. [Google Scholar]
- 15.Fang Y, Kong Y, Rong G, Luo Q, Liao W, Zeng D. Systematic investigation of tumor microenvironment and antitumor immunity with IOBR. Med Res. 2025;1:136–40. [Google Scholar]
- 16.Lin A, Qi C, Li M, Guan R, Imyanitov EN, Mitiushkina NV, et al. Deep learning analysis of the adipose tissue and the prediction of prognosis in colorectal cancer. Front Nutr. 2022;9: 869263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bannon D, Moen E, Schwartz M, Borba E, Kudo T, Greenwald N, et al. DeepCell Kiosk: scaling deep learning-enabled cellular image analysis with Kubernetes. Nat Methods. 2021;18(1):43–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lee CS, Tyring AJ, Wu Y, Xiao S, Rokem AS, DeRuyter NP, et al. Generating retinal flow maps from structural optical coherence tomography with artificial intelligence. Sci Rep. 2019;9(1):5694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Calzetta L, Pistocchini E, Chetta A, Rogliani P, Cazzola M. Experimental drugs in clinical trials for COPD: artificial intelligence via machine learning approach to predict the successful advance from early-stage development to approval. Expert Opin Investig Drugs. 2023;32(6):525–36. [DOI] [PubMed] [Google Scholar]
- 20.Jerfy A, Selden O, Balkrishnan R. The growing impact of natural language processing in healthcare and public health. Inquiry. 2024;61:469580241290095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Liu Y, Wang H, Zhou H, Li M, Hou Y, Zhou S, et al. A review of reinforcement learning for natural language processing and applications in healthcare. J Am Med Inform Assoc. 2024;31(10):2379–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Annepaka Y, Pakray P. Large language models: a survey of their development, capabilities, and applications. Knowl Inf Syst. 2025;67:2967–3022. [Google Scholar]
- 23.Denecke K, May R, Rivera-Romero O. Transformer models in healthcare: a survey and thematic analysis of potentials, shortcomings and risks. J Med Syst. 2024;48(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zhu L, Mou W, Luo P. Ensuring consistency and accuracy in evaluating ChatGPT-4 for clinical recommendations. Clin Gastroenterol Hepatol. 2025;23(1):189–90. [DOI] [PubMed] [Google Scholar]
- 25.Zhu L, Lai Y, Mou W, Zhang H, Lin A, Qi C, et al. ChatGPT’s ability to generate realistic experimental images poses a new challenge to academic integrity. J Hematol Oncol. 2024;17(1):27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhu L, Mou W, Lai Y, Chen J, Lin S, Xu L, et al. Step into the era of large multimodal models: a pilot study on ChatGPT-4V(ision)’s ability to interpret radiological images. Int J Surg. 2024;110(7):4096–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lin A, Zhu L, Mou W, Yuan Z, Cheng Q, Jiang A, et al. Advancing generative artificial intelligence in medicine: recommendations for standardized evaluation. Int J Surg. 2024;110(8):4547–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wan P, Huang Z, Tang W, et al. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial. Nat Med. 2024;30:2878–85. [DOI] [PubMed] [Google Scholar]
- 29.Liu C, Wei M, Qin Y, Zhang M, Jiang H, Xu J, et al. Harnessing large language models for structured reporting in breast ultrasound: a comparative study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4). Ultrasound Med Biol. 2024;50(11):1697–703. [DOI] [PubMed] [Google Scholar]
- 30.Günay S, Öztürk A, Yiğit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: a comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68–73. [DOI] [PubMed] [Google Scholar]
- 31. Beşler MS, Oleaga L, Junquero V, Merino C. Evaluating GPT-4o’s performance in the official European Board of Radiology exam: a comprehensive assessment. Acad Radiol. 2024;31(11):4365–71. [DOI] [PubMed]
- 32.Zhu L, Mou W, Hong C, Yang T, Lai Y, Qi C, et al. The evaluation of generative AI should include repetition to assess stability. JMIR Mhealth Uhealth. 2024;12: e57978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gan W, Ouyang J, She G, Xue Z, Zhu L, Lin A, et al. ChatGPT’s role in alleviating anxiety in total knee arthroplasty consent process: a randomized controlled trial pilot study. Int J Surg. 2025;111(3):2546–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;7(9): e1000326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020;20(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Marshall IJ, Kuiper J, Banner E, Wallace BC. Automating biomedical evidence synthesis: RobotReviewer. Proc Conf Assoc Comput Linguist Meet. 2017;2017:7–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chai KEK, Lines RLJ, Gucciardi DF, Ng L. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10(1):93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Hamel C, Kelly SE, Thavorn K, Rice DB, Wells GA, Hutton B. An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening – impact on reviewer-relevant outcomes. BMC Med Res Methodol. 2020;20:256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018;7(1):45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Lee K, Paek H, Huang LC, Hilton CB, Datta S, Higashi J, et al. SEETrials: leveraging large language models for safety and efficacy extraction in oncology clinical trials. Inform Med Unlocked. 2024;50:101589. [DOI] [PMC free article] [PubMed]
- 41.Whitton J, Hunter A. Automated tabulation of clinical trial results: a joint entity and relation extraction approach with transformer-based language representations. Artif Intell Med. 2023;144: 102661. [DOI] [PubMed] [Google Scholar]
- 42.Mutinda FW, Liew K, Yada S, Wakamiya S, Aramaki E. Automatic data extraction to support meta-analysis statistical analysis: a case study on breast cancer. BMC Med Inform Decis Mak. 2022;22(1):158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wang Q, Liao J, Lapata M, Macleod M. PICO entity extraction for preclinical animal literature. Syst Rev. 2022;11:209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ghosh M, Mukherjee S, Ganguly A, Basuchowdhuri P, Naskar SK, Ganguly D. AlpaPICO: extraction of PICO frames from clinical trial documents using LLMs. Methods. 2024;226:78–88. [DOI] [PubMed] [Google Scholar]
- 45.Kirk-Smith MD, Stretch DD. Evidence-based medicine and randomized double-blind clinical trials: a study of flawed implementation. J Eval Clin Pract. 2001;7(2):119–23. [DOI] [PubMed] [Google Scholar]
- 46.Xie S, Zhao W, Deng G, He G, He N, Lu Z, et al. Utilizing ChatGPT as a scientific reasoning engine to differentiate conflicting evidence and summarize challenges in controversial clinical questions. J Am Med Inform Assoc. 2024;31(7):1551–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Wang S, Šuster S, Baldwin T, Verspoor K. Predicting publication of clinical trials using structured and unstructured data: model development and validation study. J Med Internet Res. 2022;24(12): e38859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Razuvayevskaya O, Lopez I, Dunham I, Ochoa D. Genetic factors associated with reasons for clinical trial stoppage. Nat Genet. 2024;56(9):1862–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kim ES, Bruinooge SS, Roberts S, Ison G, Lin NU, Gore L, et al. Broadening eligibility criteria to make clinical trials more representative: American Society of Clinical Oncology and Friends of Cancer Research joint research statement. J Clin Oncol. 2017;35(33):3737–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Peterson JS, Plana D, Bitterman DS, Johnson SB, Aerts HJWL, Kann BH. Growth in eligibility criteria content and failure to accrue among National Cancer Institute (NCI)-affiliated clinical trials. Cancer Med. 2023;12:4715–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Datta S, Lee K, Paek H, Manion FJ, Ofoegbu N, Du J, et al. AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models. J Am Med Inform Assoc. 2024;31:375–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Lim YMF, Asselbergs FW, Bagheri A, Denaxas S, Tay WT, Voors A, et al. Eligibility of Asian and European registry patients for phase III trials in heart failure with reduced ejection fraction. ESC Heart Fail. 2024;11(6):3559–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Capili B, Anastasi JK. Ethical research and the institutional review board: an introduction. Am J Nurs. 2024;124:50–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.van Wijk RPJ, van Dijck JTJM, Timmers M, van Veen E, Citerio G, Lingsma HF, et al. Informed consent procedures in patients with an acute inability to provide informed consent: policy and practice in the CENTER-TBI study. J Crit Care. 2020;59:6–15. [DOI] [PubMed] [Google Scholar]
- 55.Lynch HF, Rosenfeld S. Institutional review board quality, private equity, and promoting ethical human subjects research. Ann Intern Med. 2020;173:558–62. [DOI] [PubMed] [Google Scholar]
- 56.Sridharan K, Sivaramakrishnan G. Leveraging artificial intelligence to detect ethical concerns in medical research: a case study. J Med Ethics. 2025;51(2):126–34. [DOI] [PubMed] [Google Scholar]
- 57.Bergenmar M, Molin C, Wilking N, Brandberg Y. Knowledge and understanding among cancer patients consenting to participate in clinical trials. Eur J Cancer. 2008;44:2627–33. [DOI] [PubMed] [Google Scholar]
- 58.Sand K, Eik-Nes NL, Loge JH. Readability of informed consent documents (1987–2007) for clinical trials: a linguistic analysis. J Empir Res Hum Res Ethics. 2012;7:67–78. [DOI] [PubMed] [Google Scholar]
- 59.Jimenez R, Zhang B, Joffe S, Nilsson M, Rivera L, Mutchler J, et al. Clinical trial participation among ethnic/racial minority and majority patients with advanced cancer: what factors most influence enrollment? J Palliat Med. 2013;16:256–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Campillos-Llanos L, Ortega-Riba F, Terroba AR, Valverde-Mateos A, Capllonch-Carrión A. CLARA-MeD tool - a system to help patients understand clinical trial announcements and consent forms in Spanish. Stud Health Technol Inform. 2024;316:95–9. [DOI] [PubMed] [Google Scholar]
- 61.Ali R, Connolly ID, Tang OY, Mirza FN, Johnston B, Abdulrazeq HF, et al. Bridging the literacy gap for surgical consents: an AI-human expert collaborative approach. NPJ Digit Med. 2024;7:63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Darmawan I, Bakker C, Brockman TA, Patten CA, Eder M. The role of social media in enhancing clinical trial recruitment: scoping review. J Med Internet Res. 2020;22: e22810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Parkinson B, Meacock R, Sutton M, Fichera E, Mills N, Shorter GW, et al. Designing and using incentives to support recruitment and retention in clinical trials: a scoping review and a checklist for design. Trials. 2019;20:624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Woo M. An AI boost for clinical trials. Nature. 2019;573:S100–2. [DOI] [PubMed] [Google Scholar]
- 65.Penberthy LT, Dahman BA, Petkov VI, DeShazo JP. Effort required in eligibility screening for clinical trials. J Oncol Pract. 2012;8:365–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Alexander M, Solomon B, Ball DL, Sheerin M, Dankwa-Mullan I, Preininger AM, et al. Evaluation of an artificial intelligence clinical trial matching system in Australian lung cancer patients. JAMIA Open. 2020;3:209–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Hassanzadeh H, Karimi S, Nguyen A. Matching patients to clinical trials using semantically enriched document representation. J Biomed Inform. 2020;105: 103406. [DOI] [PubMed] [Google Scholar]
- 68.Pine M, Tompkins C. Evolution of the International Classification of Diseases—from hierarchical classification to linguistic nuance. JAMA Netw Open. 2024;7: e246474. [DOI] [PubMed] [Google Scholar]
- 69.Leslie-Mazwi TM, Bello JA, Tu R, Nicola GN, Donovan WD, Barr RM, et al. Current procedural terminology: history, structure, and relationship to valuation for the neuroradiologist. AJNR Am J Neuroradiol. 2016;37:1972–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Jing X. The unified medical language system at 30 years and how it is used and published: systematic review and content analysis. JMIR Med Inform. 2021;9: e20675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Jin Q, Wang Z, Floudas CS, et al. Matching patients to clinical trials with large language models. Nat Commun. 2024;15:9074. [DOI] [PMC free article] [PubMed]
- 72.Yuan J, Tang R, Jiang X, Hu X. Large language models for healthcare data augmentation: an example on patient-trial matching. AMIA Annu Symp Proc. 2024;2023:1324–33. [PMC free article] [PubMed] [Google Scholar]
- 73. Beattie J, Owens D, Navar AM, Schmitt LG, Taing K, Neufeld S, et al. Large language model augmented clinical trial screening. medRxiv. 2024. 10.1101/2024.08.27.24312646.
- 74.Wu DW, Bernstein JA, Bejerano G. Discovering monogenic patients with a confirmed molecular diagnosis in millions of clinical notes with MonoMiner. Genet Med. 2022;24:2091–102. [DOI] [PubMed] [Google Scholar]
- 75.Xu Q, Liu Y, Sun D, Huang X, Li F, Zhai J, et al. OncoCTMiner: streamlining precision oncology trial matching via molecular profile analysis. Database (Oxford). 2023;2023:baad077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Bernd CL. Clinical case report forms design—a key to clinical trial success. Drug Inf J. 1984;18:3–8. [DOI] [PubMed] [Google Scholar]
- 77. Everyone needs a data-management plan. Nature. 2018;555:286. [DOI] [PubMed]
- 78.Weeks HL, Beck C, McNeer E, Williams ML, Bejan CA, Denny JC, et al. medExtractR: a targeted, customizable approach to medication extraction from electronic health records. J Am Med Inform Assoc. 2020;27:407–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Jonnalagadda S, Cohen T, Wu S, Gonzalez G. Enhancing clinical concept extraction with distributional semantics. J Biomed Inform. 2012;45:129–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.López-Úbeda P, Martín-Noguerol T, Juluru K, Luna A. Natural language processing in radiology: update on clinical applications. J Am Coll Radiol. 2022;19(11):1271–85. [DOI] [PubMed] [Google Scholar]
- 81.James JK, Maran T, Rice MP, Hunt TS, Peterson KJ, Hogan WJ, et al. Experience with an optical character recognition search application for review of outside medical records. Mayo Clin Proc Digit Health. 2024;2:511–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Laique SN, Hayat U, Sarvepalli S, Vaughn B, Ibrahim M, McMichael J, et al. Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports. Gastrointest Endosc. 2021;93:750–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Shahid F, Hsu MH, Chang YC, Jian WS. Using generative AI to extract structured information from free text pathology reports. J Med Syst. 2025;49:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Garcia-Carmona AM, Prieto ML, Puertas E, Beunza JJ. Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study. JMIR AI. 2025;4:e68776. [DOI] [PMC free article] [PubMed]
- 85.Van den Broeck J, Cunningham SA, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2: e267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Biester F, Abdelaal M, Del Gaudio D. LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs. In: Tekli J, et al., editors. New Trends in Database and Information Systems. ADBIS 2024; 2024; Cham. Springer; 2025:68–78.
- 87. Databonsai. https://github.com/alvin-r/databonsai. Accessed 22 July 2025.
- 88.Sharma DK, Solbrig HR, Prud’hommeaux E, Pathak J, Jiang G. Standardized representation of clinical study data dictionaries with CIMI archetypes. AMIA Annu Symp Proc. 2017;2016:1119–28. [PMC free article] [PubMed] [Google Scholar]
- 89.Kimura E, Kawakami Y, Inoue S, Okajima A. Mapping drug terms via integration of a retrieval-augmented generation algorithm with a large language model. Healthc Inform Res. 2024;30:355–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Adams MC, Perkins ML, Hudson C, Madhira V, Akbilgic O, Ma D, et al. Breaking digital health barriers through a large language model-based tool for automated Observational Medical Outcomes Partnership mapping: development and validation study. J Med Internet Res. 2025;27: e69004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Cornelius VR, Sauzet O, Williams JE, Ayis S, Farquhar-Smith P, Ross JR, et al. Adverse event reporting in randomised controlled trials of neuropathic pain: considerations for future practice. Pain. 2013;154:213–20. [DOI] [PubMed] [Google Scholar]
- 92.Wallace S, Myles PS, Zeps N, Zalcberg JR. Serious adverse event reporting in investigator-initiated clinical trials. Med J Aust. 2016;204:231–3. [DOI] [PubMed] [Google Scholar]
- 93.Li Y, Li J, He J, Tao C. AE-GPT: using large language models to extract adverse events from surveillance reports—a use case with influenza vaccine adverse events. PLoS ONE. 2024;19: e0300919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Hu Y, Chen Q, Du J, Peng X, Keloth VK, Zuo X, et al. Improving large language models for clinical named entity recognition via prompt engineering. J Am Med Inform Assoc. 2024;31:1812–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Sivarajkumar S, Ameri K, Li C, Wang Y, Jiang M. Automating adjudication of cardiovascular events using large language models. arXiv. 2025. 10.48550/arXiv.2503.17222.
- 96.Zirkle J, Han X, Racz R, Samieegohar M, Chaturbedi A, Mann J, et al. Deep learning-enabled natural language processing to identify directional pharmacokinetic drug-drug interactions. BMC Bioinformatics. 2023;24:413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Li T, Shetty S, Kamath A, Jaiswal A, Jiang X, Ding Y, et al. CancerGPT for few shot drug pair synergy prediction using large pretrained language models. NPJ Digit Med. 2024;7(1):40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Dougherty RF, Clarke P, Atli M, Kuc J, Schlosser D, Dunlop BW, et al. Psilocybin therapy for treatment resistant depression: prediction of clinical outcome by natural language processing. Psychopharmacology. 2025;242(7):1553–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Agarwal K, Choudhury S, Tipirneni S, Mukherjee P, Ham C, Tamang S, et al. Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal BERT: a study on COVID-19 outcome prediction. Sci Rep. 2022;12:10748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Goldenholz DM, Goldenholz SR, Habib S, Westover MB. Inductive reasoning with large language models: a simulated randomized controlled trial for epilepsy. Epilepsy Res. 2025;211: 107532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.López-Úbeda P, Martín-Noguerol T, Escartín J, Luna A. Role of natural language processing in automatic detection of unexpected findings in radiology reports: a comparative study of RoBERTa, CNN, and ChatGPT. Acad Radiol. 2024;31(12):4833–42. [DOI] [PubMed] [Google Scholar]
- 102.Geevarghese R, Sigel C, Cadley J, Chatterjee S, Jain P, Hollingsworth A, et al. Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing. J Clin Pathol. 2024;78(2):135–8. [DOI] [PubMed] [Google Scholar]
- 103. Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM. Learning to compare: relation network for few-shot learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2018. p 1199–208.
- 104.Ge Y, Guo Y, Das S, Al-Garadi MA, Sarker A. Few-shot learning for medical text: a review of advances, trends, and opportunities. J Biomed Inform. 2023;144: 104458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Sung F, Yang Y, Zhang L, Xiang T, Torr PHS, Hospedales TM. Learning to compare: relation network for few-shot learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. p 1199–208.
- 106. Duan J, Zhang S, Wang Z, Jiang L, Qu W, Hu Q, et al. Efficient training of large language models on distributed infrastructures: a survey. arXiv. 2024; arXiv:2407.20018 .
- 107.Zhang Y, Li Y, Cui L, Cai D, Liu L, Fu T, et al. Siren’s song in the AI ocean: a survey on hallucination in large language models. Comput Linguist. 2025. 10.1162/COLI.a.16. [Google Scholar]
- 108.Lee N, Ping W, Xu P, Patwary M, Fung PN, Shoeybi M, et al. Factuality enhanced language models for open-ended text generation. Adv Neural Inf Process Syst. 2022;35:34586–99. [Google Scholar]
- 109. Lee K, Ippolito D, Nystrom A, Zhang C, Eck D, Callison-Burch C, et al. Deduplicating training data makes language models better. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022. p 8424–45.
- 110. Wang C, Sennrich R. On exposure bias, hallucination and domain shift in neural machine translation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p 3544–52.
- 111. Rashkin H, Reitter D, Tomar GS, Das D. Increasing faithfulness in knowledge-grounded dialogue with controllable features. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. 2021. p 704–18.
- 112.Malik S, Kharel H, Dahiya DS, Ali H, Blaney H, Singh A, et al. Assessing ChatGPT4 with and without retrieval-augmented generation in anticoagulation management for gastrointestinal procedures. Ann Gastroenterol. 2024;37(5):514–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv. 2023;55(10):1–35. [Google Scholar]
- 114. Levi E, Brosh E, Friedmann M. Intent-based prompt calibration: enhancing prompt optimization with synthetic boundary cases. arXiv. 2024. 10.48550/arXiv.2402.03099.
- 115.Peng C, Yang X, Smith KE, Yu Z, Chen A, Bian J, et al. Model tuning or prompt tuning? A study of large language models for clinical concept and relation extraction. J Biomed Inform. 2024;153: 104630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116. Zhang Z, Fang M, Chen L, Namazi-Rad MR, Wang J. How do large language models capture the ever-changing world knowledge? A review of recent advances. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. p 8289–311.
- 117.Amugongo LM, Mascheroni P, Brooks S, Doering S, Seidel J. Retrieval augmented generation for large language models in healthcare: a systematic review. PLoS Digit Health. 2025;4(6): e0000877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Balancing accuracy and user satisfaction. the role of prompt engineering in AI-driven healthcare solutions. Front Artif Intell. 2025;8:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Hosseinzadeh M, Mohammed AH, Rahmani AM, Alenizi FA, Zandavi SM, Yousefpoor E, et al. A secure routing approach based on league championship algorithm for wireless body sensor networks in healthcare. PLoS ONE. 2023;18(10): e0290119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc. 2008;15(5):627–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Gong Y, Chang X, Mišić J, Mišić VB, Wang J, Zhu H. Practical solutions in fully homomorphic encryption: a survey analyzing existing acceleration methods. Cybersecurity. 2024;7:5. [Google Scholar]
- 122.Jayabalan M, O’Daniel T. Access control and privilege management in electronic health record: a systematic literature review. J Med Syst. 2016;40(12):261. [DOI] [PubMed] [Google Scholar]
- 123. Danilevsky M, Qian K, Aharonov R, Katsis Y, Kawas B, Sen P. A survey of the state of explainable AI for natural language processing. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020. p 447–59.
- 124.Wang L, Yoon KJ. Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Trans Pattern Anal Mach Intell. 2022;44(6):3048–68. [DOI] [PubMed] [Google Scholar]
- 125.Kierner S, Kucharski J, Kierner Z. Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: a scoping review. J Biomed Inform. 2023;144: 104428. [DOI] [PubMed] [Google Scholar]
- 126. Ribeiro M, Singh S, Guestrin C. Why should I trust you?: explaining the predictions of any classifier. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 2016. p 97–101.
- 127.Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4768–77. [Google Scholar]
- 128.Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12–22. [DOI] [PubMed] [Google Scholar]
- 129.Choi Y, Yu W, Nagarajan MB, Teng P, Goldin JG, Raman SS, et al. Translating AI to clinical practice: overcoming data shift with explainability. Radiographics. 2023;43(5): e220105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Jumreornvong O, Perez AM, Malave B, Mozawalla F, Kia A, Nwaneshiudu CA. Biases in artificial intelligence application in pain medicine. J Pain Res. 2025;18:1021–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Minssen T, Vayena E, Cohen IG. The challenges for regulating medical use of ChatGPT and other large language models. JAMA. 2023;330(4):315–6. [DOI] [PubMed] [Google Scholar]
- 132.Shumway DO, Hartman HJ. Medical malpractice liability in large language model artificial intelligence: legal review and policy recommendations. J Osteopath Med. 2024;124(7):287–90. [DOI] [PubMed] [Google Scholar]
- 133.Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: Pandora’s box has been opened. J Med Internet Res. 2023;25: e46924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Cha S. Towards an international regulatory framework for AI safety: lessons from the IAEA’s nuclear safety regulations. Humanit Soc Sci Commun. 2024;11(1):1–13. [Google Scholar]
- 135.Baldassarre A, Padovan M. Regulatory and ethical considerations on artificial intelligence for occupational medicine. Med Lav. 2024;115(2): e2024013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Lee J, Park S, Shin J, et al. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak. 2024;24:366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Tam TYC, Sivarajkumar S, Kapoor S, Stolyar AV, Polanska K, McCarthy KR, et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med. 2024;7:258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138. Chen J, Zhu L, Mou W, Lin A, Zeng D, Qi C, et al. STAGER checklist: standardized testing and assessment guidelines for evaluating generative artificial intelligence reliability. iMetaOmics. 2024;1(1):e7.
- 139. Neehal N, Wang B, Debopadhaya S, Dan S, Murugesan K, Anand V, et al. CTBench: a comprehensive benchmark for evaluating language model capabilities in clinical trial design. arXiv. 2024. 10.48550/arXiv.2406.17888.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No datasets were generated or analysed during the current study.




