Skip to main content
Chinese Medical Journal logoLink to Chinese Medical Journal
. 2024 Sep 19;137(21):2529–2539. doi: 10.1097/CM9.0000000000003302

Leveraging foundation and large language models in medical artificial intelligence

Io Nam Wong 1, Olivia Monteiro 1, Daniel T Baptista-Hon 1, Kai Wang 2, Wenyang Lu 4, Zhuo Sun 3,4,, Sheng Nie 5,, Yun Yin 6,
Editor: Jing Ni
PMCID: PMC11556979  PMID: 39497256

Abstract

Recent advancements in the field of medical artificial intelligence (AI) have led to the widespread adoption of foundational and large language models. This review paper explores their applications within medical AI, introducing a novel classification framework that categorizes them as disease-specific, general-domain, and multi-modal models. The paper also addresses key challenges such as data acquisition and augmentation, including issues related to data volume, annotation, multi-modal fusion, and privacy concerns. Additionally, it discusses the evaluation, validation, limitations, and regulation of medical AI models, emphasizing their transformative potential in healthcare. The importance of continuous improvement, data security, standardized evaluations, and collaborative approaches is highlighted to ensure the responsible and effective integration of AI into clinical applications.

Keywords: Artificial intelligence, Foundation model, Large language model, Multi-modal, Data security, Medical AI, Segment-anchoring model, ChatGPT, Disease-specific model, General-domain model, Data privacy, Hallucination, Data annotation

Introduction

The year 2023 marked a turning point in medical artificial intelligence (AI), driven by revolutionary foundation models (FMs) such as OpenAI’s GPT-n series and derived large language models (LLMs) such as Chat generative pre-trained transformer (ChatGPT). These FMs and LLMs have led to remarkable advancements in medical AI globally. Notably, Google’s medical LLM, known as Med-PaLM, has achieved expert-level accuracy in addressing questions from the U.S. Medical Licensing Examination (USMLE).[1] The popularity of using FMs or LLMs in medical diagnoses has increased, as observed in major academic journals. Leading publishers such as Springer Nature have featured various articles dedicated to FMs and LLMs for medical imaging.[2,3] There has been a significant surge in the inaugural curation of articles focusing on digital health reviews, including the AI in Medicine series published by NEJM.[4] Furthermore, the landscape for implementing medical AI has matured, as demonstrated by initiatives such as the call for global medical imaging data sharing[5] and the U.S. Food and Drug Administration (FDA)’s draft guidelines for regulating medical AI.[6] These breakthroughs highlight the potential of FMs and LLMs to enhance medical education and support healthcare professionals in making informed clinical decisions.

This review paper presents a comprehensive overview of the significant progress made by researchers worldwide in medical AI using FMs or LLMs in recent times, shedding light on the advancements and challenges encountered in leveraging AI technologies for medical applications [Figure 1].

Figure 1.

Figure 1

FMs and LLMs for medical AI. The diverse medical data types can be effectively employed in clinical settings by leveraging different medical AI models. AI: Artificial intelligence; ChatGPT: Chat generative pre-trained transformer; FMs: Foundation models; GPT: Generative pre-trained transformer; IRENE: A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics; LLMs: Large language models; RETFound: Foundation model for retinal images; Med-PaLM: A large language model designed to provide high quality answers to medical questions; MedSAM: Segment anything model in medical images; MedLSAM: Localize and segment anything model for 3D CT images; MRI: Magnetic resonance imaging; Neuro-Oph GPT: Neuro-ophthalmology generative pre-trained transformer; OpenMEDLab: An open-source platform for multi-modality foundation models in Medicine; PathChat: A vision-language AI assistant for pathology; PLIP: Pathology language-image pretraining; SAM-Med2D: A specialized version of SAM for 2D medical image segmentation; UNI: A general-purpose self-supervised model for pathology; Virchow: A million-slide digital pathology foundation model; 2D: Two dimensional; 3D: Three dimensional.

FMs and LLMs for Medical AI

Recently, medical AI has placed a significant emphasis on developing FMs and LLMs. Remarkably, the Nature journal series has published comprehensive reviews on FM/LLM-based medical AI.[2,3] Furthermore, Zhang and Metaxas[7] provided an insightful retrospective and prospective evaluation of the challenges and opportunities in FMs for medical image analysis, introducing the concept of a “foundation model lineage” to guide and summarize advancements in medical AI. Based on successful models such as ChatGPT, researchers are utilizing advanced self-supervised pre-training techniques and abundant training data to train FMs, which in turn power LLMs with extensive parameters and exceptional capabilities. Thus, FMs serve as building blocks for language models, with LLMs resulting from fine-tuning and scaling up FMs. The primary differences between foundation and large-language models lie in their sizes and capabilities.[8] Foundation models are typically trained using a substantial amount of text data and are known for their generalizability and adaptability.[9] However, language models retain the general language understanding of foundation models but are specifically scaled-up versions that excel at a particular task.[10] However, the boundary between these two models has recently become blurred, mainly because of the increased availability of data and the evolution of pre-training and fine-tuning techniques.[8,11]

In this review, our focus is on FM/LLM-powered medical AI. We propose a classification framework based on modality and task specificity to categorize these into three types. (1) Disease-specific models: These models concentrate on specific medical conditions, tailoring their capabilities to address particular diseases or medical-related tasks. (2) General-domain models: These models exhibit broad applicability across various medical domains, providing versatile solutions for a range of tasks and applications in the medical field. (3) Multi-modal models: These models integrate information from diverse modalities, including clinical text, medical images, and other data types, to enhance the comprehensive understanding of patient’s health conditions and to support accurate diagnoses, personalized treatments, and informed decision-making.

These modality- and task-based classifications provide a framework for understanding the diverse applications of FMs and LLMs in medical AI [Table 1].

Table 1.

Summary of three types of medical AI models and their clinical applications.

Categories Model name Applications
Disease-specific models RETFound Diagnosing and predicting the prognosis of eye diseases; Predicting systemic diseases (heart failure, stroke, and Parkinson’s disease) after fine-tuning
Neuro-Oph GPT Neuro-ophthalmology
UNI Computational pathology tasks of varying diagnostic difficulty: Nuclear segmentation; Primary and metastatic cancer detection; Cancer grading and subtyping; Biomarker screening and molecular subtyping; Organ transplant assessment; Several pan-cancer classifications
Virchow Tile-level pan-cancer detection and subtyping; Slide-level biomarker prediction
General-domain models ChatGPT Diagnosing common and rare clinical cases, medical question answering
Med-PaLM Medical question answering, medical image classification and radiology report summarization
MedSAM Medical image segmentation
SAM-Med2D Analysis of 2D medical images
MedLSAM Analysis of 3D medical images
Multi-modal models PathChat Analysis of pathology images
PLIP Analysis of pathology images using open-source data
OpenMEDLab Medical image processing, medical question answering
IRENE Diagnosing eight different lung diseases

AI: Artificial intelligence; ChatGPT: Chat generative pre-trained transformer; IRENE: A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics; Med-PaLM: A large language model designed to provide high quality answers to medical questions; MedSAM: Segment anything model in medical images; MedLSAM: Localize and segment anything model for 3D CT images; Neuro-Oph GPT: Neuro-ophthalmology generative pre-trained transformer; OpenMEDLab: An open-source platform for multi-modality foundation models in Medicine; PathChat: A vision-language AI assistant for pathology; PLIP: Pathology language-image pretraining; SAM-Med2D: A specialized version of SAM for 2D medical image segmentation; UNI: A general-purpose self-supervised model for pathology; Virchow: A million-slide digital pathology foundation model; 2D: Two dimensional; 3D: Three dimensional.

Disease-specific models

In 2023, significant progress was made in developing disease-specific models, particularly those for analyzing retinal and other pathological images.[12] This progress was facilitated by the early adoption of digitalization and the availability of extensive datasets from both public and private sources.[13,14] These disease-specific models show exceptional versatility and consistently deliver outstanding performances across diverse clinical tasks, including those involving multicenter datasets.[12,15]

In the past few months, a groundbreaking study introduced RETFound, a retinal image analysis FM. RETFound underwent self-supervised learning on over 1.64 million unlabeled retinal images and exhibited optimal performance in diagnosing and predicting the prognosis of eye diseases, as well as predicting systemic diseases after fine-tuning. The validation of RETFound’s performance involved test data from eight different centers encompassing conditions such as heart failure, stroke, and Parkinson’s disease. This significant achievement highlights the potential of intelligent diagnostic applications based on retinal images.[12] Similarly, Wenzhou Medical University collaborated with Macau University of Science and Technology to unveil the Neuro-Oph GPT digital healthcare system, a large-scale language model designed specifically for neuro-ophthalmology.[16]

UNI, a self-supervised vision encoder for pathology, has significantly advanced the field. UNI was pre-trained using an extensive dataset of over 100,000 whole-slide images (WSI) covering 20 major organ types and was validated to perform very well on 34 clinically relevant anatomic pathology tasks. These tasks include nuclear segmentation, primary and metastatic cancer detection, cancer grading and subtyping, biomarker screening and molecular subtyping, organ transplant assessment, and several pan-cancer classification tasks.[17] Additionally, Vorontsov et al[18] developed the Virchow model, a vision transformer model with 632 million parameters trained on 1.5 million whole-slide pathology images, which demonstrated outstanding efficacy in tasks such as tile-level pan-cancer detection and subtyping, as well as slide-level biomarker prediction.

Recently, there has been a notable transition toward the adoption of multi-modal datasets. This approach allows for holistic analysis of various aspects of a disease, enabling more accurate predictions, improved diagnosis, and a deeper understanding of the underlying mechanisms,[19] which will be further discussed in the subsequent multi-modal model.

General-domain models

Universal medical AI models are currently being developed to address the limitations of specialized models and cover a wide range of clinical applications, such as grounded radiology reports, augmented procedures, and decision support.[2] Leveraging its access to an extensive dataset of 175 billion parameters and 45 terabytes of text data, ChatGPT has demonstrated its capability to diagnose common and rare medical conditions.[20,21] In a pilot study, ChatGPT was assessed for its ability to diagnose ten rare eye diseases, demonstrating a favorable level of accuracy.[22] Eriksen et al[23] evaluated the efficacy of GPT-4 in diagnosing complex cases, revealing its exceptional performance compared to simulated human medical journal readers generated from online answers. GPT-4 exhibited a superior accuracy rate of 57.37% (21.8 out of 38.0 clinical case challenges), surpassing 99.98% (9998 out of 10,000 simulated cases) of the performance of simulated readers. This highlights the potential of universal medical AI models to address the limitations of current diagnostic processes.

By the end of 2023, global efforts to develop universal medical AI models primarily focused on two aspects: LLMs and universal segmentation models for medical imaging analysis. In a study published in Nature, scientists from Google Research and DeepMind introduced the Med-PaLM, a specialized medical language model, from a general language model through instruction fine-tuning.[1] Med-PaLM achieved a real-time expert evaluation score of 92.6% (129.6 out of 140.0 questions), which closely resembled the performance of clinical doctors (92.9%, 130/140). Additional assessments by non-medical reviewers found an accuracy rate of 80.3% (112.4/140.0) in Med-PaLM’s answers, with direct resolution of patient queries in 94.4% (132.2/140.0) of cases, compared to 95.9% (134.3/140.0) for human clinical doctors. Subsequent releases demonstrated that Med-PaLM 2 showed expert-level performance, achieving an accuracy rate of 86.5% (922/1066) when answering multiple-choice questions and open-ended questions, and engaging in answer reasoning on the USMLE.[1,24]

In medical image segmentation, researchers primarily employ a segment-anchoring model (SAM) to develop universal segmentation models capable of segmenting any target in arbitrary image modalities.[25] Ma et al[26] introduced MedSAM, which is an advanced technique for medical image segmentation. MedSAM is a fine-tuned version of SAM that has been trained on a large dataset consisting of over one million image-mask pairs from 10 different imaging modalities and more than 30 types of cancer. The performance of MedSAM surpasses those of SAM and other specialized models. Kim et al[27] adapted the SAM model for medical video segmentation analysis to generate Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation for Echocardiography (MediViSTA-SAM). Subsequently, two additional SAM-based models are introduced for analyzing two-dimensional (SAM-Med2D)[28] and three-dimensional (3D) medical images (MedLSAM)[29]. Despite the promising generalizations demonstrated by these models, the development of a fully annotation-free universal medical image SAM model remains a milestone.

Multi-modal models

Acosta et al[30] emphasized the growing availability of biomedical data and the declining costs of omics sequencing to lay the foundation for developing multi-modal AI solutions. As demonstrated by recent advancements, these models possess unparalleled capabilities in processing textual information and integrating data from multiple modalities.

Recently, Lu et al[31] successfully combined a foundational vision encoder with a pre-trained LLM to create PathChat, an innovative multi-modal system for pathological image analysis. By incorporating multiturn conversations, PathChat enhances diagnostic capabilities and enables a more comprehensive and contextual interpretation of medical images.

Many AI studies focused on pathology analysis underscore the importance of accessing substantial, high-quality, and diverse clinical data from public health organizations, such as hospitals, to construct robust medical AI models.[3234] However, Huang et al[35] presented an alternative approach, utilizing openly available Twitter data (now referred to as X) to create an extensive OpenPath dataset consisting of over 200,000 pathology images paired with corresponding natural language descriptions to develop a pathology language-image pre-training (PLIP) model. This multi-modal medical AI outperforms previous contrastive language-image pre-trained models and addresses the significant challenge of the limited availability of publicly annotated medical images.

Another great achievement was the release of “OpenMEDLab Pudong Medical” by the Shanghai Artificial Intelligence Laboratory. This groundbreaking effort marked the world’s first global multi-modal FM group, covering over ten medical data modalities. This exemplifies the innovative endeavors of Chinese researchers to advance LLMs within the industry.[36] Notably, they made these models openly accessible, enabling broader collaboration and potential applications in the medical community.[37]

Furthermore, Zhou et al[38] published a unified framework called IRENE, which jointly learns holistic representations of medical images, unstructured chief complaints, and structured clinical information to showcase significant improvements in diagnosing eight different lung diseases.[38] These remarkable developments highlight the transformative potential of multi-modal AI solutions in healthcare.

In summary, the introduction of this classification model helps to understand the application of FMs and LLMs in medical AI more effectively. Disease-specific models are used in specialized medical condition analysis, general domain models cover a wide range of clinical applications, and multi-modal models combine diverse modalities of datasets for comprehensive disease analysis. Collectively, these models contribute to the advancement of medical AI and have the potential to improve healthcare outcomes.

Data Acquisition and Augmentation for Medical AI Models

Data acquisition and augmentation are crucial in developing robust and effective AI models. They should be carried out carefully to ensure that the acquired data are diverse, balanced, and representative while addressing biases and data privacy concerns. Several challenges should be considered when acquiring and augmenting data for AI models. Common challenges include limited data availability,[39] high annotation costs,[40] management of different data modalities,[41] and protection of patient privacy,[42] all of which significantly hinder the development of medical AI development.

Despite these challenges, FMs and LLMs that have achieved breakthroughs in computer vision, natural language processing, and knowledge discovery are making their mark in medical AI. These models offer promising capabilities in medical dialogues,[43] clinical risk prediction,[44] treatment decisions,[44,45] and other healthcare applications, including PathoDuet for pathological image analysis,[46] RETFound for fundus image analysis,[12] Endo-FM for endoscopic video analysis,[47] and Med-Flamingo and Med-PaLM for medical question answering.[24,48] The growing adoption of these models in clinical diagnosis, medical dialogue, and drug development benefits both healthcare professionals and patients.

However, the effectiveness of medical AI models is hindered by long-standing issues in managing medical data, making it crucial to address these challenges for the widespread application of powerful AI systems.[2,7,49,50] Generative AI powered by FMs/LLMs offers potential solutions to healthcare data challenges. The OpenAI team demonstrated the remarkable capabilities of the GPT-4 model for medical text understanding and generation through extensive experiments involving various medical scenarios.[23,45] Moreover, Chambon et al[51] showed the potential of FMs to address data scarcity by generating high-quality X-ray images from text using stable diffusion models. These achievements have led to advancements in medical AI.

Addressing data volume challenges in medical AI

A significant challenge in the field of medical AI is the limited availability of data, particularly for rare diseases. Robust model training requires ample data; however, comprehensive datasets are often lacking under rare conditions. Furthermore, strict regulations regarding patient privacy make it difficult to access medical records even for more common diseases. The construction of real-world datasets is a resource-intensive process involving data collection, cleaning, and annotation, further complicating these challenges.[52,53]

Generative AI models powered by FMs/LLMs offer promising solutions for addressing the challenges of limited data volumes in medical applications. First, they assist in medical data augmentation by generating additional training data and leveraging their vast knowledge to enhance the diversity and informativeness of datasets. Generative models such as large diffusion models have demonstrated proficiency in this area. For instance, a study conducted at Harvard University successfully employed DALL·E to generate synthetic dermatological images for training classification models.[54] Similarly, Sun et al[55] developed the PathAsst generative foundation model that generates instruction-following data specifically tailored for training pathology-specific models. Second, FMs were trained to enable more efficient utilization of existing data resources by serving as a bridge between limited downstream data and abundant upstream data sources. For instance, Mishra et al[56] customized domain pre-trained language medical foundation models (CXR-BERT, BlueBERT, and ClinicalBERT) to improve pathological image classification for rare diseases. This demonstrates targeted training even with limited data availability. Notably, at the 2023 NeurIPS conference, OpenMEDLab initiated the MedFMC Foundation Model Prompting for Medical Image Classification Challenge, attracting over 600 global teams and significantly stimulating application-oriented research on general FMs in medical image classification tasks.[57] Furthermore, the abundance of information available on the internet enhances the accessibility of training data for FMs. High-quality medical data platforms like PubMed serve as invaluable repositories, while careful processing and validation steps ensure the utility and reliability of the acquired data.[18,58]

Overall, generative AI models powered by FMs/LLMs offer a multifaceted approach to address the challenges posed by limited data volumes in medical AI. They facilitate data augmentation and effective utilization of existing resources, fostering advancements in medical image classification, the diagnosis of rare diseases, and other healthcare applications.

Data annotation

In addition to addressing the issue of increasing data volume, annotating collected data is a crucial step in leveraging medical AI models for disease diagnosis, treatment planning, and advancing healthcare systems. To enrich medical datasets with metadata and labels, data annotation adds valuable human expertise and meaningful context to offer valuable insights into medical education, diagnosis, and AI applications. However, data annotation itself faces challenges such as the scarcity of expert annotators and the complexity of the annotation processes. Fortunately, the scalability of FMs and LLMs presents an opportunity to address the cost implications associated with large-scale medical data annotations.[57,59,60]

Text annotation, which involves extracting crucial information from diverse medical reports, enables physicians to quickly understand patient conditions and make more accurate diagnoses. It also facilitates the creation of comprehensive patient records for longitudinal tracking and the identification of patterns in disease progression. Although human experts have demonstrated a high accuracy in extracting medical information, their methods are time-consuming. FMs, particularly LLMs, exhibit information extraction capabilities similar to those of human experts. Utilizing these models offers cost savings for healthcare professionals. The Med-PaLM 2 model is a good example of this. It was trained on an extensive dataset consisting of medical text and codes. Notably, Med-PaLM 2 outperforms clinicians in answering medical questions, demonstrating its potential for effectively annotating medical text data.[24]

Medical image annotation is essential in the understanding and analysis of pathological and radiological images. The annotation of image-segmentation masks is particularly important for disease diagnosis and lesion localization. The introduction of SAM, a vision FM specializing in image segmentation, has marked a significant development in this field. Subsequent studies exploring the application of SAM for segmenting medical images have shown promising results, validating its use in image data annotation.[61] However, the direct utilization of SAM may lead to inconsistencies and unreliable results, requiring fine-tuning of medical images before annotation.[62] To address this challenge, Zhang et al[29] developed MedLSAM, a 3D CT image localization and segmentation FM based on SAM. This innovative model ensures consistent and time-efficient 3D medical image annotation regardless of the dataset size, significantly reducing annotation costs.

Data annotation has become increasingly efficient and accurate by leveraging the capabilities of these models. Text annotation benefits from the information extraction capabilities of LLMs, whereas medical image annotation is enhanced through models such as SAM and MedLSAM. These advancements have contributed to the effective utilization of annotated medical data, paving the way for improved disease diagnoses, treatment planning, and advancements in the overall healthcare system.

Multi-modal data fusion

Multi-modal data fusion in healthcare is essential for refining diagnostic precision and treatment efficacy, and FMs/LLMs offer a new perspective. The integration of diverse medical data including images, diagnostic reports, and biomedical signals provides healthcare professionals with comprehensive insights into patient health. FMs/LLMs undergo pre-training using large-scale paired multi-modal data, equipping them with the ability to understand and process multi-modal inputs. Subsequently, the transformer architecture is crucial in fusing multi-modal data in latent space during applications.[6365]

To achieve effective fusion, multi-modal pre-training utilizes paired data samples, such as images and the corresponding text, ensuring that different modalities share representations in the latent space. This approach is commonly used in medical subfields, such as radiology and pathology, where paired data, such as images and text reports, are common. For example, Microsoft (Iowa, USA) trained the BioViL model on a large dataset of chest X-rays and their corresponding radiology reports to obtain matched images and language features.[66] Similarly, Huang et al[35] curated the OpenPath dataset, which consists of pathology image-text pairs from Twitter (now referred to as X), and used this dataset to train the PLIP model, achieving impressive zero-shot predictions regardless of the image input.

In contrast, LLMs leverage their powerful semantic comprehension capabilities, driven by attention mechanisms, to extend beyond language processing and handle multi-modal scenarios. Data from different modalities are combined into LLM prompts, and the resulting fused input undergoes transformer layers, in which attention mechanisms facilitate information exchange for multi-modal fusion. The success of LLMs such as GPT-4 in the medical domain demonstrates the potential of this fusion approach in healthcare.[45] For example, Moor et al[48] developed the Med-Flamingo model by concatenating images and text inputs into a unified sequence, processing them using an LLM. Med-Flamingo showed impressive few-shot learning abilities in medical image-based question-and-answer tasks.

Data privacy

The protection of privacy in medical data has always been of utmost importance, prompting governments worldwide to enforce stringent laws and regulations governing the sharing and use of sensitive healthcare information.[67] However, the emergence of AI technology, specifically FMs, presents a novel approach to addressing privacy issues. As mentioned above, FM/LLM-based generative models possess powerful generative capabilities that allow the creation of synthetic datasets suitable for training models while ensuring the exclusion of any patient-specific details that could compromise privacy. In certain studies, researchers explored the use of diffusion models to generate highly detailed 3D medical images that are devoid of sensitive information but contain the necessary features for effective model training.[68] This approach demonstrates the potential of leveraging FMs/LLMs for data generation without compromising patient privacy. However, these models can inadvertently learn and reproduce patterns from pre-trained patient data in their generated synthetic data. This raises concerns about the potential re-identification of individuals and the preservation of privacy.[69,70] Therefore, it is imperative to thoroughly de-identify pre-training data to ensure that all patient-specific information is removed or anonymized to maintain confidentiality.

To fully leverage FMs/LLMs for medical data generation, ongoing research efforts are actively addressing the complex privacy challenges associated with these models. Researchers are exploring techniques and methodologies to enhance privacy preservation, such as differential privacy, federated learning, and secure aggregation methods.[7174] These approaches strike a balance between the utility of the generated data for training models and the protection of patient privacy. It is crucial to continue investigating and developing robust privacy-preserving mechanisms and frameworks to ensure that medical AI models in healthcare are deployed in a responsible and privacy-conscious manner. Thus, we can unlock the potential benefits of these models while safeguarding the sensitive medical information of individuals.[7577]

Evaluation and Validation of Medical AI

After the training phase, it is essential to accurately gauge the performance and security metrics of the model before deploying it in real-world scenarios. However, assessing models poses a significant challenge because of their scale and intrinsic nature. In this study, we explored three evaluation strategies for AI models, each with its advantages and limitations.

A commonly used approach involves using fixed datasets and evaluation metrics. Medical researchers have curated various datasets and metrics such as Medical Information Mart for Intensive Care III (MIMIC-III) and Biomedical Language Understanding and Reasoning Benchmark (BLURB),[78,79] which provide standardized and repeatable evaluation results and ensure fair comparisons between models. However, real-world scenarios often expose AI models to diverse and uncommon scenarios that are not adequately represented by static datasets. Moreover, datasets and metrics that align with human values are scarce, and keeping up with the pace of model development when updating metrics is challenging.[50,80]

Another widely employed evaluation technique is expert assessment, in which human evaluators appraise the performance of a model. Chambon et al[51] invited radiologists to assess the accuracy of ChatGPT in translating radiology reports. Although human experts have the advantages of precise evaluation, flexibility, and alignment with human values, their involvement incurs high costs and their assessments may be subjective because of their backgrounds.[81]

The third approach explores the use of a sufficiently robust established model aligned with human values as a benchmark for evaluating other models. This method often circumvents the need for fixed datasets or annotations, thereby increasing the efficiency by relying solely on the inferences of the benchmark model. For instance, Chiang et al[82] validated the performance of ChatGPT in tasks such as story generation and adversarial attacks, showcasing evaluation at a human-expert level and consistent results across diverse prompts. However, identifying such a benchmark model in medicine is challenging because of domain-shift issues and the potential lack of domain expertise. Nonetheless, leveraging automated evaluation with FMs and LLMs remains a promising research avenue. Combining the strengths of human experts and automated evaluations offers the potential for higher-quality evaluation outcomes.

Limitations of Foundation and Large Language Models in Medical AI

Although FMs and LLMs can mitigate the scarcity of medical data, they have noteworthy limitations, including hallucinations, biases, and a lack of standardization.[50,8385]

Hallucinations occur when generative AI models create content that seems plausible but is incorrect. This issue can arise from various factors related to the training data, such as quality, quantity, and inherent biases. In the medical context, erroneous information generated by medical AI can have severe consequences. Therefore, it is critical to address concerns related to hallucinations to ensure accurate medical diagnosis, decision-making, and patient care. The identification and assessment of hallucination severity are essential, and evaluation criteria should consider aspects such as factual accuracy, coherence, and consistency. For instance, benchmarks such as the medical domain hallucination test (Med-HALT) can be used to assess hallucinations in LLMs.[86] In addition, collaboration between humans and AI can help detect model-generated hallucinations, whereas crowdsourcing platforms can collect human evaluations and develop reliable medical FMs. Furthermore, the development of adversarial testing methods can help identify potential triggers of hallucinations, thereby bolstering the reliability of the content generated by models.[87]

FMs/LLMs may also exhibit biases toward specific groups, regions, sexes, and other factors. These biases can originate from cultural, linguistic, demographic, and political influences present in the training data. For example, evaluations of ChatGPT’s performance on medical licensing exams in the US and China revealed language biases favoring English exams because of biases in the model training data.[88,89] Since FMs/LLMs are often trained without supervision, the online training data they rely on may contain inaccuracies and biases that go unnoticed, leading to discrepancies between the model outputs and the expectations of human experts in the medical field. To mitigate this, it is crucial to involve human experts in both creating datasets and evaluating models.[90] In addition, stakeholders and developers must acknowledge the limitations of these model architectures and training methodologies to address harmful data and adversarial influences. The incorporation of adversarial training during the development phase of medical models can strengthen the defense mechanisms against detrimental content.[90]

As medical AI applications continue to expand, the standardization of AI models has become increasingly important. Regulatory agencies, such as the US FDA, are beginning to classify programs that perform medical tasks as medical devices.[6] In the future, medical AI models may be considered novel medical devices and subject to rigorous regulations. This standardization process involves defining the applications of medical AI models, conducting benchmark testing, creating user guidelines, and validating their effectiveness through trials. After deployment, these models must be continuously monitored and adapted to changing tasks and environments promptly.[2]

Regulation of Medical AI

The rapid development of AI in healthcare has brought about regulatory challenges that have prompted nuanced responses worldwide.[91] In the United States, the FDA’s regulatory trajectory has evolved from the initial “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD)”[92] to the subsequent “Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan”,[93] “Content of Premarket Submissions for Device Software Function”[94] and the most recent “Artificial Intelligence and Medical Products: How CBER, CDER, CDRH, and OCP are Working Together”.[95] These developments reflected a pragmatic shift from proposals to actionable guidelines.

In Europe, the European Union has introduced initiatives such as the “European Health Data Space”[96] and the “Artificial Intelligence Act”[97] to promote responsible utilization of health data. The former aims to enhance healthcare by empowering individuals to control their health data, whereas the latter designates medical diagnostic systems as high-risk AI systems and calls for targeted regulatory measures. The European Medicines Agency (EMA) has issued a draft reflection paper emphasizing the importance of enhancing AI credibility for patient safety and robust clinical research.[98] By contrast, the United Kingdom has adopted a moderate approach without introducing new laws, as outlined in its AI regulatory blueprint.[99]

In China, the National Medical Products Administration (NMPA) has issued various documents, including guidelines on deep learning-assisted decision software, principles for the registration review of AI medical devices, and announcements defining AI medical software product categories. The recently released “Summary of the Results of the First Medical Device Product Classification in 2023” provides clarity to the industry.[100]

These regulatory frameworks lay the groundwork for scientific oversight and governance of AI medical devices. They reflect the need to address the unique challenges posed by AI in healthcare and ensure the responsible and safe implementation of these technologies. These regulatory frameworks are not static and are expected to undergo further updates as AI in healthcare continues to evolve. Ongoing discussions and collaborations between regulatory bodies, researchers, healthcare professionals, and industry stakeholders will shape the future of regulatory policies and ensure the safe and effective integration of AI in healthcare.

Summary and Outlook

The introduction of medical AI models in healthcare has opened new possibilities for enhancing the efficiency of diagnosis and treatment. However, this breakthrough has been accompanied by challenges that require attention, including data security, biases, and the need for supervision during training. Despite these obstacles, there is optimism regarding the potential of medical AI models to improve healthcare outcomes through collaborations between researchers and healthcare professionals.[101103] To facilitate progress, it is imperative to establish an environment that enables seamless data sharing supported by robust regulations.[5,104]

The integration of AI into healthcare is on the verge of advancement in various areas. Future efforts should focus on refining medical AI models to achieve greater accuracy and specificity in disease diagnosis and treatment. This will involve continuous algorithmic improvements and training on diverse datasets to ensure optimal performance across various medical conditions.[35,105] Additionally, the development of secure frameworks for handling extensive medical data is paramount to guaranteeing patient privacy while enabling efficient analyses.[5,76,77,106]

Another crucial aspect for future exploration is the development of standardized evaluation protocols. The establishment of consistent criteria for evaluating the effectiveness and reliability of medical AI models will enable stakeholders to make well-informed decisions regarding their implementation.[50,7880,107] Additionally, fostering collaborations among researchers, healthcare professionals, and regulatory bodies is essential to shaping and enforcing effective policies that govern the responsible integration of medical AI into clinical practice.[91,108,109]

Medical AI models have tremendous potential for transforming healthcare. With their ability to enhance diagnostics, customize treatments, and improve patient outcomes, these models can redefine the healthcare field. However, to fully harness the transformative power of medical AI, continuous research, innovation, and careful ethical considerations are required to address the challenges that arise when implementing this tool effectively.

Funding

This work was supported by grants from the Macau Science and Technology Development Fund (No. 0069/2021/AFJ) and the Macau University of Science and Technology Faculty Research Grants (No. FRG-22-022-FMD).

Conflicts of interest

None.

Footnotes

Io Nam Wong and Olivia Monteiro contributed equally to this work.

How to cite this article: Wong IN, Monteiro O, Baptista-Hon DT, Wang K, Lu WY, Sun Z, Nie S, Yin Y. Leveraging foundation and large language models in medical artificial intelligence. Chin Med J 2024;137:2529–2539. doi: 10.1097/CM9.0000000000003302

References

  • 1.Singhal K Azizi S Tu T Mahdavi SS Wei J Chung HW, et al. Large language models encode clinical knowledge. Nature 2023;620:172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Moor M Banerjee O Abad ZSH Krumholz HM Leskovec J Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–265. doi: 10.1038/s41586-023-05881-4. [DOI] [PubMed] [Google Scholar]
  • 3.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930–1940. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
  • 4.AI in Medicine (2024 collection). Available at https://www.nejm.org/ai-in-medicine. [Last accessed on Aug 6, 2024].
  • 5.Saenz A, Chen E, Marklund H, Rajpurkar P. The MAIDA initiative: Establishing a framework for global medical-imaging data sharing. Lancet Digit Health 2024;6:e6–e8. doi: 10.1016/S2589-7500(23)00222-4. [DOI] [PubMed] [Google Scholar]
  • 6.U.S. Food and Drug Administration . Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)-Enabled Device Software Functions. 2023. Available at https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial. [Last accessed on Aug 6, 2024].
  • 7.Zhang S, Metaxas D. On the challenges and perspectives of foundation models for medical image analysis. Med Image Anal 2024;91:102996. doi: 10.1016/j.media.2023.102996. [DOI] [PubMed] [Google Scholar]
  • 8.Wornow M Xu Y Thapa R Patel B Steinberg E Fleming S, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med 2023;6:135. doi: 10.1038/s41746-023-00879-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Scott IA, Zuccon G. The new paradigm in machine learning – Foundation models, large language models and beyond: A primer for physicians. Intern Med J 2024;54:705–715. doi: 10.1111/imj.16393. [DOI] [PubMed] [Google Scholar]
  • 10.Pan SR Luo LH Wang YF Chen C Wang JP Wu XD, et al. Unifying large language models and knowledge graphs: A roadmap. IEEE Trans Knowl Data Eng 2024;36:3580–3599. doi: 10.1109/TKDE.2024.3352100. [Google Scholar]
  • 11.Myers D Mohawesh R Chellaboina VI Sathvik AL Venkatesh P Ho, YH, et al. Foundation and large language models: Fundamentals, challenges, opportunities, and social impacts. Clust Comput 2024;27:1–26. doi: 10.1007/s10586-023-04203-7. [Google Scholar]
  • 12.Zhou Y Chia MA Wagner SK Ayhan MS Williamson DJ Struyven RR, et al. A foundation model for generalizable disease detection from retinal images. Nature 2023;622:156–163. doi: 10.1038/s41586-023-06555-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.van Noordt C, Misuraca G. Artificial intelligence for the public sector: Results of landscaping the use of AI in government across the European Union. Gov Inf Q 2022;39:101714. doi: 10.1016/j.giq.2022.101714. [Google Scholar]
  • 14.Wang F, Preininger A. AI in health: State of the art, challenges, and future directions. Yearb Med Inform 2019;28:16–26. doi: 10.1055/s-0039-1677908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat Biomed Eng 2022;6:1399–1406. doi: 10.1038/s41551-022-00936-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.International Opthalmology Times. Available at: http://www.iophthalmology.net/index/article/detail?id=13158. [Last accessed on August 6, 2024].
  • 17.Chen RJ Ding T Lu MY Williamson DFK Jaume G Song AH, et al. Towards a general-purpose foundation model for computational pathology. Nat Med 2024;30:850–862. doi: 10.1038/s41591-024-02857-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Vorontsov E Bozkuurt A Casson A Shaikovski G Zelechowski M Liu S, et al. Virchow: A million-slide digital pathology foundation model. arXiv 2023:abs/2309.07778. doi: 10.48550/arXiv.2309.07778 [Google Scholar]
  • 19.Chen X, Xie H, Tao X, Wang F, Leng M, Lei B. Artificial intelligence and multimodal data fusion for smart healthcare: Topic modeling and bibliometrics. Artif Intell Rev 2024;57:91. doi: 10.1007/s10462-024-10712-7. [Google Scholar]
  • 20.Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. medRxiv 2023.04.20.23288859. doi: 10.1101/2023.04.20.23288859. [Google Scholar]
  • 21.Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 2024;15:2050. doi: 10.1038/s41467-024-46411-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zheng Y Sun X Feng B Kang K Yang Y Zhao A, et al. Rare and complex diseases in focus: ChatGPT’s role in improving diagnosis and treatment. Front Artif Intell 2024;7:1338433. doi: 10.3389/frai.2024.1338433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Eriksen AV, Möller S, Ryg J. Use of GPT-4 to diagnose complex clinical rcases. NEJM AI 2024;1:AI2300031. doi: 10.1056/AIp2300031. [Google Scholar]
  • 24.Singhal K Tu T Gottweis J Sayres R Wulczyn E Hou L, et al. Towards expert-level medical question answering with large language models. arXiv 2023:2305.09617. doi:10.48550/arXiv.2305.09617. [Google Scholar]
  • 25.Kirillov A Mintun E Ravi N Mao H Rolland C Gustafson L, et al. Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris: France; 2023: IEEE Computer Society Digital Library 3992–4003. doi: 10.1109/ICCV51070.2023.00371. [Google Scholar]
  • 26.Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun 2024;15:654. doi: 10.1038/s41467-024-44824-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kim S Kim K Hu J Chen C Lyu Z Hui R, et al. MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation for Echocardiography. arXiv 2023:2309.13539. doi: 10.48550/arXiv.2309.13539. [Google Scholar]
  • 28.Cheng J Ye J Deng Z Chen J Li T Wang H, et al. SAM-Med2d. arXiv 2023:2308.16184. doi:10.48550/arXiv.2308.16184. [Google Scholar]
  • 29.Lei W, Wei X, Zhang X, Li K, Zhang S. MedLSAM: Localize and segment anything model for 3d medical images. arXiv 2023:2306.14752. doi: 10.48550/arXiv.2306.14752 [DOI] [PubMed] [Google Scholar]
  • 30.Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med 2022;28:1773–1784. doi: 10.1038/s41591-022-01981-2. [DOI] [PubMed] [Google Scholar]
  • 31.Lu MY Chen B Williamson DFK Chen RJ Zhao M Chow AK, et al. A multimodal generative AI copilot for human pathology. Nature 2024. doi: 10.1038/s41586-024-07618-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng 2018;2:719–731. doi: 10.1038/s41551-018-0305-z. [DOI] [PubMed] [Google Scholar]
  • 33.Yang Y, Sun K, Gao Y, Wang K, Yu G. Preparing data for artificial intelligence in pathology with clinical-grade performance. Diagnostics (Basel) 2023;13:3115. doi: 10.3390/diagnostics13193115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Alowais SA Alghamdi SS Alsuhebany N Alqahtani T Alshaya AI Almohareb SN, et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med Educ 2023;23:689. doi: 10.1186/s12909-023-04698-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Huang Z, Bianchi F, Yuksekgonul M, Montine TJ, Zou J. A visual–language foundation model for pathology image analysis using medical Twitter. Nat Med 2023;29:2307–2316. doi: 10.1038/s41591-023-02504-3. [DOI] [PubMed] [Google Scholar]
  • 36.Wang X Zhang X Wang G He J Li Z Zhu W, et al. OpenMEDLab: An open-source platform for multi-modality foundation models in medicine. arXiv 2024:2402.18028. doi: 10.48550/arXiv.2402.18028. [Google Scholar]
  • 37.Zhang S Wang X He J Li Y Zhu W Wang D, et al. OpenMEDLab. Available at: https://github.com/openmedlab. [Last accessed on Aug 6, 2024].
  • 38.Zhou HY Yu Y Wang C Zhang S Gao Y Pan J, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng 2023;7:743–755. doi: 10.1038/s41551-023-01045-x. [DOI] [PubMed] [Google Scholar]
  • 39.Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021;5:493–497. doi: 10.1038/s41551-021-00751-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med 2022;28:31–38. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
  • 41.Huang SC, Pareek A, Seyyedi S, Banerjee I, Lungren MP. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. NPJ Digit Med 2020;3:136. doi: 10.1038/s41746-020-00341-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hathaliya JJ, Tanwar S. An exhaustive survey on security and privacy issues in Healthcare 4.0. Comput Commun 2020;153:311–335. doi: 10.1016/j.comcom.2020.02.018. [Google Scholar]
  • 43.Liu G, He J, Li P, He G, Chen Z, Zhong S. PeFoMed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering. arXiv 2024:2401.02797. doi: 10.48550/arXiv.2401.02797. [Google Scholar]
  • 44.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019;110:12–22. doi: 10.1016/j.jclinepi.2019.02.004. [DOI] [PubMed] [Google Scholar]
  • 45.Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of gpt-4 on medical challenge problems. arXiv 2023:2303.13375. doi: 10.48550/arXiv.2303.13375. [Google Scholar]
  • 46.Hua S, Yan F, Shen T, Ma L, Zhang X. Pathoduet: Foundation models for pathological slide analysis of H&E and IHC stains. Med Immage Anal 2024;97:103289. doi: 10.1016/j.media.2024.103289. [DOI] [PubMed] [Google Scholar]
  • 47.Wang Z, Liu C, Zhang S, Dou Q. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. Cham: Springer Nature Switzerland, 2023. doi: 10.1007/978-3-031-43996-4_10. [Google Scholar]
  • 48.Moor M Huang Q Wu S Yasunaga M Zakka C Dalmia Y, et al. Med-flamingo: A multimodal medical few-shot learner. arXiv 2023:2307.15189. doi: 10.48550/arXiv.2307.15189. [Google Scholar]
  • 49.Thieme A, Nori A, Ghassemi M, Bommasani R, Andersen TO, Luger E. Foundation models in healthcare: Opportunities, risks & strategies forward. Extended abstracts of the 2023 CHI conference on human factors in computing systems. New York: Association for Computing Machinery; 2023. doi: 10.1145/3544549.3583177. [Google Scholar]
  • 50.Bommasani R Hudson DA Adeli E Altman R Arora S vonArx S, et al. On the opportunities and risks of foundation models. arXiv 2021:2108.07258. doi: 10.48550/arXiv.2108.07258. [Google Scholar]
  • 51.Chambon P, Bluethgen C, Langlotz CP, Chaudhari A. Adapting pretrained vision-language foundational models to medical imaging domains. arXiv 2022:2210.04133. doi: 10.48550/arXiv.2210.04133. [Google Scholar]
  • 52.Wojtara M, Rana E, Rahman T, Khanna P, Singh H. Artificial intelligence in rare disease diagnosis and treatment. Clin Transl Sci 2023;16:2106–2111. doi: 10.1111/cts.13619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Decherchi S, Pedrini E, Mordenti M, Cavalli A, Sangiorgi L. Opportunities and challenges for machine learning in rare diseases. Front Med (Lausanne) 2021;8:747612. doi: 10.3389/fmed.2021.747612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Sagers LW, Diao JA, Groh M, Rajpurkar P, Adamson AS, Manrai AK. Improving dermatology classifiers across populations using images generated by large diffusion models. arXiv 2022:2211.13352. doi: 10.48550/arXiv.2211.13352. [Google Scholar]
  • 55.Sun Y Zhu C Zheng S Zhang K Sun L Shui Z, et al. PathAsst: Redefining pathology through generative foundation AI assistant for pathology. arXiv 2023:2305.15072. doi: 10.48550/arXiv.2305.15072. [Google Scholar]
  • 56.Mishra A, Mittal R, Jestin C, Tingos K, Rajpurkar P. Improving zero-shot detection of low prevalence chest pathologies using domain pre-trained language models. arXiv 2023:2306.08000. doi: 10.48550/arXiv.2306.08000. [Google Scholar]
  • 57.Wang D Wang X Wang L Li M Da Q Liu X, et al. A real-world dataset and benchmark for foundation model adaptation in medical image classification. Sci Data 2023;10:574. doi: 10.1038/s41597-023-02460-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ikezogwo WO Seyfioglu MS Ghezloo F Geva D Mohammed FS Anand PK, et al. Quilt-1M: One million image-text pairs for histopathology. Adv Neural Inf Process Syst 2023;36(DB1):37995–38017. [PMC free article] [PubMed] [Google Scholar]
  • 59.Xie Q, Schenck EJ, Yang HS, Chen Y, Peng Y, Wang F. Faithful AI in medicine: A systematic review with large language models and beyond. medRxiv 2023:2023:2023.04.18.23288752. doi: 10.1101/2023.04.18.23288752. [Google Scholar]
  • 60.Sylolypavan A, Sleeman D, Wu H, Sim M. The impact of inconsistent human annotations on AI driven clinical decision making. NPJ Digit Med 2023;6:26. doi: 10.1038/s41746-023-00773-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Mazurowski MA, Dong H, Gu H, Yang J, Konz N, Zhang Y. Segment anything model for medical image analysis: An experimental study. Med Image Anal 2023;89:102918. doi: 10.1016/j.media.2023.102918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Zhang Y, Shen Z, Jiao R. Segment anything model for medical image segmentation: Current applications and future directions. Comput Biol Med 2024;171:108238. doi: 10.1016/j.compbiomed.2024.108238. [DOI] [PubMed] [Google Scholar]
  • 63.Krones FH, Marikkar U, Parsons G, Szmui A, Mahdi A. Review of multimodal machine learning approaches in healthcare. Available at SSRN: https://ssrn.com/abstract=4736389. doi: 10.2139/ssrn.4736389.
  • 64.Soenksen LR Ma Y Zeng C Boussioux L Villalobos Carballo K Na L, et al. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit Med 2022;5:149. doi: 10.1038/s41746-022-00689-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Hassani S, Dackermann U, Mousavi M, Li J. A systematic review of data fusion techniques for optimized structural health monitoring. Inform Fusion 2024;103:102136. doi: 10.1016/j.inffus.2023.102136. [Google Scholar]
  • 66.Boecking B Usuyama N Bannur S Castro DC Schwaighofer A Hyland S, et al. Making the most of text semantics to improve biomedical vision–language processing. In: Computer Vision-ECCV2022. Cham: Springer Nature Switzerland, 2022. doi: 10.1007/978-3-031-20059-5_1. [Google Scholar]
  • 67.Malin B, Karp D, Scheuermann RH. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J Investig Med 2010;58:11–18. doi: 10.2310/JIM.0b013e3181c9b2ea. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Shibata H Hanaoka S Nakao T Kikuchi T Nakamura Y Nomura Y, et al. Practical medical image generation with provable privacy protection based on denoising diffusion probabilistic models for high-resolution volumetric images. Appl Sci 2024;14:3489. doi: 10.3390/app14083489. [Google Scholar]
  • 69.Carlini N Hayes J Nasr M Jagielski M Sehwag V Tramèr F, et al. Extracting training data from diffusion models. arXiv 2023:abs/2301.13188. doi: 10.48550/arXiv.2301.13188 [Google Scholar]
  • 70.Zhou C Li Q Li C Yu J Liu Y Wang G, et al. A comprehensive survey on pretrained foundation models: A history from BERT to chatGPT. arXiv 2023:2302.09419. doi: 10.48550/arXiv.2302.09419. [Google Scholar]
  • 71.Wei K Li J Ding M Ma C Yang HH Farokhi F, et al. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans Inform Forensics Secur 2020;15:3454–3469. doi: 10.1109/TIFS.2020.2988575. [Google Scholar]
  • 72.Girgis, A., Data D, Diggavi S, Kairouz P, Suresh AT. Shuffled model of differential privacy in federated learning. Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, PMLR 2021;130:2521–2529. Available at https://proceedings.mlr.press/v130/girgis21a.html. [Last accessed on Aug 6, 2024]. [Google Scholar]
  • 73.Sun L, Qian J, Chen X. LDP-FL: Practical private aggregation in federated learning with local differential privacy. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence Main Track: 1571–1578. doi: 10.24963/ijcai.2021/217. [Google Scholar]
  • 74.Fereidooni H Marchal S Miettinen M Mirhoseini A Möllering H Nguyen TD, et al. Safelearn: Secure aggregation for private federated learning. In: 2021 IEEE Security and Privacy Workshops (SPW). IEEE. 2021: 56–62. doi: 10.1109/SPW53761.2021.00017. [Google Scholar]
  • 75.Zhuang W, Chen C, Lyu L. When foundation model meets federated learning: Motivations, challenges, and future directions. arXiv 2023:2306.15546. doi: 10.48550/arXiv.2306.15546. [Google Scholar]
  • 76.Khalid N, Qayyum A, Bilal M, Al-Fuqaha A, Qadir J. Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Comput Biol Med 2023;158:106848. doi: 10.1016/j.compbiomed.2023.106848. [DOI] [PubMed] [Google Scholar]
  • 77.Murdoch B. Privacy and artificial intelligence: Challenges for protecting health information in a new era. BMC Med Ethics 2021;22:122. doi: 10.1186/s12910-021-00687-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Johnson AE Pollard TJ Shen L Lehman LW Feng M Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Gu Y Tinn R Cheng H Lucas M Usuyama N Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare 2021;3:1–23. doi: 10.1145/3458754. [Google Scholar]
  • 80.Gupta T Gong W Ma C Pawlowski N Hilmkil A Scetbon M, et al. The essential role of causality in foundation world models for embodied AI. arXiv 2024:2402.06665. doi: 10.48550/arXiv.2402.06665. [Google Scholar]
  • 81.Gehrmann S, Clark E, Sellam T. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J Artif Intell Res 2023;77:103–166. doi: 10.1613/jair.1.13715. [Google Scholar]
  • 82.Chiang CH, Lee HY. Can large language models be an alternative to human evaluations? arXiv 2023:2305.01937. doi: 10.48550/arXiv.2305.01937. [Google Scholar]
  • 83.Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in medicine: The potentials and pitfalls: A narrative review. Ann Intern Med 2024;177:210–220. doi: 10.7326/M23-2772. [DOI] [PubMed] [Google Scholar]
  • 84.Tonmoy S Zaman SMM Jain V Rani A Rawte V Chadha A, et al. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv 2024:2401.01313. doi: 10.48550/arXiv.2401.01313. [Google Scholar]
  • 85.Rawte V, Sheth A, Das A. A survey of hallucination in large foundation models. arXiv 2023:2309.05922. doi: 10.48550/arXiv.2309.05922. [Google Scholar]
  • 86.Umapathi LK, Pal A, Sankarasubbu M. Med-halt: Medical domain hallucination test for large language models. arXiv 2023:2307.15343. doi: 10.48550/arXiv.2307.15343. [Google Scholar]
  • 87.Huang L Yu W Ma W Zhong W Feng Z Wang H, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv 2023:2311.05232. doi: 10.48550/arXiv.2311.05232. [Google Scholar]
  • 88.Kung TH Cheatham M Medenilla A Sillos C De Leon L Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health 2023;2:e0000198. doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Wang X Gong Z Wang G Jia J Xu Y Zhao J, et al. ChatGPT performs on the Chinese national medical licensing examination. J Med Syst 2023;47:86. doi: 10.1007/s10916-023-01961-0. [DOI] [PubMed] [Google Scholar]
  • 90.Ferrara E. Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday 2023;28:2304.03738. doi: 10.5210/fm.v28i11.13346. [Google Scholar]
  • 91.Mennella C, Maniscalco U, De Pietro G, Esposito M. Ethical and regulatory challenges of AI technologies in healthcare: A narrative review. Heliyon 2024;10:e26297. doi: 10.1016/j.heliyon.2024.e26297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.U.S. Food and Drug Administration . Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD). 2019. Available at https://www.fda.gov/files/medical%20devices/published/US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper.pdf. [Last accessed on August 6, 2024].
  • 93.U.S. Food and Drug Administration . Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. 2021. Available at https://www.fda.gov/media/145022/download?attachment. [Last accessed on August 6, 2024].
  • 94.U.S. Food and Drug Administration . Content of premarket submissions for device software functions. 2021. Available at https://www.fda.gov/media/153781/download. [Last accessed on August 6,2024].
  • 95.U.S. Food and Drug Administration , Artificial intelligence and medical products: How CBER, CDER, CDRH, and OCP are Working Together. 2024. Available at https://www.fda.gov/media/177030/download?attachment. [Last accessed on August 6, 2024].
  • 96.Hendolin M. Towards the European health data space: From diversity to a common framework. Eurohealth 2022;27:15–17. Available at https://iris.who.int/bitstream/handle/10665/352268/Eurohealth-27-2-15-17-eng.pdf?sequence=1&isAllowed=y. [Last accessed on August 6, 2024]. [Google Scholar]
  • 97.Madiega T. Artificial intelligence act. European Parliament: European Parliamentary Research Service, 2021. Available at https://www.europarl.europa.eu/RegData/etudes/BRIE/2021/698792/EPRS_BRI(2021)698792_EN.pdf. [Last accessed on August 6, 2024]. [Google Scholar]
  • 98.EM Agency . Reflection paper on the use of artificial intelligence in the lifecycle of medicines. 2023. Available at https://www.ema.europa.eu/en/documents/scientific-guideline/draft-reflection-paper-use-artificial-intelligence-ai-medicinal-product-lifecycle_en.pdf. [Last accessed on August 6, 2024].
  • 99.GOV. UK . A pro-innovation approach to AI regulation. GOV.UK, 2023. Available at https://assets.publishing.service.gov.uk/media/64cb71a547915a00142a91c4/a-pro-innovation-approach-to-ai-regulation-amended-web-ready.pdf. [Last accessed on August 6, 2024]. [Google Scholar]
  • 100.NMDS Administration, First batch of medical device classification results for 2023. Available at https://chinameddevice.com/nmpa-medical-devices-classification/. [Last accessed on August 6, 2024].
  • 101.Dvijotham KD Winkens J Barsbey M Ghaisas S Stanforth R Pawlowski N, et al. Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians. Nat Med 2023;29:1814–1820. doi: 10.1038/s41591-023-02437-x. [DOI] [PubMed] [Google Scholar]
  • 102.Ng AY Oberije CJG Ambrózay É Szabó E Serfőző O Karpati E, et al. Prospective implementation of AI-assisted screen reading to improve early detection of breast cancer. Nat Med 2023;29:3044–3049. doi: 10.1038/s41591-023-02625-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Martinez-Gutierrez JC Kim Y Salazar-Marioni S Tariq MB Abdelkhaleq R Niktabe A, et al. Automated large vessel occlusion detection software and thrombectomy treatment times: A cluster randomized clinical trial. JAMA Neurol 2023;80:1182–1190. doi: 10.1001/jamaneurol.2023.3206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Callaway E. World’s biggest set of human genome sequences opens to scientists. Nature 2023;624:16–17. doi: 10.1038/d41586-023-03763-3. [DOI] [PubMed] [Google Scholar]
  • 105.Yang J, Soltan AAS, Eyre DW, Clifton DA. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat Mach Intell 2023;5:884–894. doi: 10.1038/s42256-023-00697-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Wenhua Z Hasan MK Jailani NB Islam S Safie N Albarakati HM, et al. A lightweight security model for ensuring patient privacy and confidentiality in telehealth applications. Comput Human Behav 2024;153:108134. doi: 10.1016/j.chb.2024.108134. [Google Scholar]
  • 107.Tanguay W Acar P Fine B Abdolell M Gong B Cadrin-Chênevert A, et al. Assessment of radiology artificial intelligence software: A validation and evaluation framework. Can Assoc Radiol J 2023;74:326–333. doi: 10.1177/08465371221135760. [DOI] [PubMed] [Google Scholar]
  • 108.NHS UK. The Topol Review. Preparing the healthcare workforce to deliver the digital future. 2019: 1–52. Available at https://topol.hee.nhs.uk/wp-content/uploads/HEE-Topol-Review-2019.pdf. [Last accessed on August 6, 2024].
  • 109.Siala H, Wang Y. SHIFTing artificial intelligence to be responsible in healthcare: A systematic review. Soc Sci Med 2022;296:114782. doi: 10.1016/j.socscimed.2022.114782. [DOI] [PubMed] [Google Scholar]

Articles from Chinese Medical Journal are provided here courtesy of Wolters Kluwer Health

RESOURCES