The Ministry of Food and Drug Safety (MFDS) of the Republic of Korea, similar to the United States Food and Drug Administration and the United Kingdom’s Medicines and Healthcare products Regulatory Agency, issued specific regulatory guidelines on January 24, 2025, for approving generative artificial intelligence (AI) technologies as medical devices [1]. Although these guidelines use the term ‘generative AI,’ they predominantly focus on the approval of AI software tools based on large language models (LLMs) and large multimodal models (LMMs) [2], the latter of which can process various types of input data, such as texts, images, videos, audio, and bio-signals. Unlike the broader guidelines for approving AI models as medical devices, specific regulatory guidelines for LLMs/LMMs have arguably not yet been proposed in other countries. This editorial provides a concise overview of the South Korean MFDS guidelines, highlighting their key features and areas requiring improvement, as the guidelines are currently available only in Korean.
Scope of the MFDS Guidelines
The MFDS guidelines pertain to AI tools that use LLMs/LMMs based on large-scale foundation models, with the primary function of generating text outputs [2,3]. These guidelines do not cover generalist medical AI systems such as Google’s Med-Gemini, which can respond to a wide array of medical queries and tasks. Instead, they focus on software tools that perform more specialized and narrower healthcare-related tasks such as drafting radiology reports for chest radiographs. The MFDS guidelines are intended for software tools directly involved in patient diagnosis, treatment, and prognosis. LLM/LMM-based software tools used in healthcare settings for other tasks such as dictation (converting spoken medical speech into text) and search, retrieval, and summarization of electronic health records are not currently recognized as medical devices by the MFDS and are excluded from these guidelines. However, some gray areas remain, and the regulatory scope may expand to include these tools if emerging needs justify their regulation as medical devices.
A Brief Summary of Key Regulatory Considerations Included in the MFDS Guidelines
The unique challenges in approving LLMs/LMMs as medical devices, compared to traditional AI, and the corresponding regulatory measures are summarized in Tables 1 and 2. Table 1includes both general considerations and specific measures outlined in the MFDS guidelines. The list in the table intentionally excludes purely technical issues such as system requirements for robust AI operations or technical requirements to ensure data security. Detailed discussions of each characteristic of LLMs/LMMs listed in Table 1, along with the associated challenges, are available in another publication [4]. As outlined in Table 1, LLMs/LMMs pose greater and more varied risks of improper device use and inadvertent harm to patients than traditional AI systems. The MFDS guidelines acknowledge these risks, but the measures taken to address them function primarily indirectly through user instructions and by limiting device use to qualified healthcare professionals rather than removing or directly controlling these risks. Moreover, some regulatory measures within the MFDS guidelines are somewhat abstract, and not all challenges listed in Table 1 are addressed with specific regulatory actions.
Table 1. Unique challenges in approving LLMs/LMMs as medical devices and the corresponding regulatory measures, including both general considerations and specific measures outlined in the South Korean MFDS guidelines.
LLM characteristics | Challenges | Proposed additional considerations/requirements in medical device approval | South Korean MFDS guidelines for medical device approval | |
---|---|---|---|---|
Versatility and emergent abilities | Higher risk of off-label use by uninformed users compared with conventional AI | Confirmation of intended usage details Restriction of LLMs’ responses to user prompts to intended usage Explicit provision of intended usage details and warnings against off-label use to users similar to medication package inserts |
Provision of intended usage details (such as purpose and indications) and warnings against off-label use through user instructions similar to medication package inserts | |
Sensitivity to variation in prompts | Inconsistency in LLM outputs | Provision of libraries of preset recommended standardized prompts, controlled medical vocabularies, and comprehensive guidance on proper prompt use, including instructions on how to structure chat sessions | Not addressed with specific regulatory actions | |
Stochasticity | Inconsistency in LLM outputs | Examination of the degree and nature of variability stemming from stochasticity Requirement for model adjustments for deterministic operations in high-stakes applications |
Acknowledgment of the potential need to address inconsistency due to stochasticity | |
Complexities in the use performance metrics | Diverse, barely standardized, abstract performance metrics for free-text outputs Subjectivity in human evaluation |
Tailored selection and implementation of performance metrics | Model performance to be evaluated clinically (as exemplified in Table 2) by multiple expert clinicians in the relevant fields and supplementarily by automated metrics such as BLEU, ROUGE, METEOR, precision, recall, accuracy, F1 score, BERTScore, GREEN, etc. | |
For subjective human evaluation | ||||
(a) Clear guidelines for defining and grading quality attributes | ||||
(b) Use of multiple expert evaluators | ||||
(c) Adequate evaluator training | ||||
A priori specification of methodologic details | ||||
Misinformation and explainability/interpretability challenges | Heightened risk of automation bias—users inadvertently accepting incorrect information from LLMs—compared with conventional AI | Mandatory incorporation of auxiliary features to help users assess the credibility of AI-generated outputs, such as | Requirement for the use of an LLM-based medical device exclusively by qualified clinicians in the relevant fields, with explicit stipulations to be included in the user instructions | |
(a) Uncertainty indicators | ||||
(b) Provision of accurate and representative references and sources accompanying the information provided by LLMs | ||||
(c) Ideally, direct explanations of the reasoning behind LLM outputs | ||||
Mandatory digital watermarks on AI-generated content for clear distinction, requiring review and confirmation by qualified health care professionals, serving as a reminder of human supervision and liability | ||||
Continuous model updates | Risk of substantial changes in model performance and behavior over time | Implementation of a quality control test, such as periodically assessing result consistency using a standard set of prompts Further emphasis on guiding principles for predetermined change control plans |
Inclusion of cautionary advice about potential model performance drift in the user instructions, and a general recommendation for periodic performance evaluations with no specific requirements stipulated | |
Additional general aspects | Inherent risks of potential incompleteness in evaluating LLMs for medical device approval | Incremental approval process of gradually expanding LLM use as more data on safety and performance are gathered through continuous monitoring Mandatory incorporation of real-time feedback systems allowing instant and easy flagging of incorrect or inconsistent responses |
Not addressed with specific regulatory actions |
Reprinted with an added column titled “South Korean MFDS guidelines for medical device approval,” from Park & Kim. Radiology 2024;312:e241703 [4].
‘LLM’ in the table refers to both LLM and LMM.
LLM = large language model, LMM = large multimodal model, MFDS = Ministry of Food and Drug Safety, AI = artificial intelligence, BLEU = bilingual evaluation understudy, ROUGE = recall-oriented understudy for gisting evaluation, METEOR = metric for evaluation of translation with explicit ordering, BERT = bidirectional encoder representations from transformers, GREEN = generative radiology report evaluation and error notation
Table 2. Exemplary grading system proposed by the Ministry of Food and Drug Safety for clinical evaluation of text outputs from LLMs/LMMs, such as draft radiology reports (translated from Korean [1]).
Grade | Details | Decision |
---|---|---|
A | No errors | Acceptable |
B | Non-clinical errors only | Acceptable |
C | Errors with clinical items: unlikely to be clinically significant | Context-dependent* |
D | Errors with clinical items: likely to be clinically significant | Unacceptable |
E | Errors with clinical items: likely to cause significant patient harm | Unacceptable |
*Acceptability depends on the specific clinical context, including the intended use, setting, target patients, and users of the AI medical device.
LLM = large language model, LMM = large multimodal model, AI = artificial intelligence
Areas for Improvement in the MFDS Guidelines
The South Korean MFDS guidelines provide a substantial initial framework but remain preliminary for several key reasons. This section identifies three major areas in which the guidelines could be enhanced.
First, current guidelines focus predominantly on narrow-task LLMs/LMMs, whereas the broader challenges posed by generalist medical AI remain unresolved. The near-infinite range of inputs and outputs associated with generalist medical AI models may render them ‘untestable’ under the current medical device regulatory framework [5]. Some experts thus advocate a novel regulatory approach that treats these models more like human clinicians than conventional devices, mirroring standards that govern human healthcare providers [6,7]. This underscores the crucial need for more comprehensive and detailed international discussions regarding the regulation of generalist medical AI systems.
Second, current regulatory measures fall short of ensuring the selection of trustworthy LLM/LMM-based medical devices and the prevention of inadvertent patient harm during their use. Instead, most measures are indirect and rely heavily on proper user behavior, guided by user instructions and warnings. For example, instead of mandating features that help users assess the credibility of AI-generated outputs, the guidelines stipulate that these devices should be used exclusively by qualified clinicians to mitigate issues of misinformation and automation bias (Table 1). Although qualified clinicians with ample clinical knowledge and experience are likely to be better equipped to handle these issues than those without clinical expertise, there is no guarantee that they will always do so effectively [8]. Therefore, a more definitive approach would be to mandate that AI medical devices include tools that can explicitly explain their reasoning and express confidence levels [4,9]. Although these auxiliary features are still largely experimental for LLMs/LMMs [4,10,11,12], requiring them could drive advancements in these technologies and emphasize their significance in the clinical adoption of LLM/LMM-based devices. The MFDS approach also shifts the burden of device evaluation to healthcare practitioners, implicitly requiring them to possess sufficient literacy to evaluate LLM/LMM-based devices beyond mere usage of them. Consequently, it is critical to update medical education to ensure that healthcare professionals are equipped with the relevant knowledge and skills to evaluate LLM/LMM-based medical devices beyond mere usage [13].
Third, the current guidelines lack explicit measures to ensure proper continuous monitoring, which is essential for safe, compliant, and effective use of AI, particularly for LLM/LMM-based devices [4,14]. Continuous monitoring is not limited to monitoring the population performance of an AI model (i.e., the performance of a model measured within a particular study population) [14,15,16] but should also include the collection and analysis of individual problem cases, monitoring of user behavior and human-AI interactions, and technical monitoring, as detailed in Table 3. Effective monitoring may thus require a dedicated information technology (IT) infrastructure designed to enable functionalities such as (but not limited to) the following:
Table 3. Comprehensive list of items to be considered in the continuous monitoring of AI medical devices.
Category | Items | Details | |
---|---|---|---|
Monitoring of model performance | Population performance | Monitor AI performance drift due to model changes or updates with a standard quality control test set or a standard set of prompts; due to data changes using recent institutional data | |
Individual case analysis | Continuously collect (e.g., through an easy single-click button for instant flagging of specific cases by users) and analyze specific instances where problems, such as incorrect or inconsistent AI results, occur | ||
Monitoring of user behavior and human-AI interactions | Analysis of human-AI disagreements | Utilize tools such as an ‘agree-with-AI (Yes or No)’ button for continuous monitoring | |
(a) A significant increase in agreement may indicate overreliance, complacency, or automation bias | |||
(b) A significant increase in disagreement could signal a drift in model performance or blind distrust | |||
Usage rate | Track how frequently AI-generated results are actively viewed or opened by healthcare professionals, rather than merely being generated and neglected | ||
Misuse/off-label use | Track instances of use outside of device approval requirements, such as using an AI tool for patients who are not indicated for it or do not require it (e.g., an unexpected increase in negative AI results or a decrease in positive predictive value may signal such incidents) | ||
Cherry picking | Identify patterns where users select cases based on AI outputs, such as preferring less complex cases | ||
Technical monitoring | AI response time | Monitor the timeliness and availability of AI-generated results | |
IT resource consumption | Track the data traffic and resources required for data gathering and processing | ||
Redundancy | Identify unnecessary or unused data transfers and processing of exams or cases | ||
Error rate | Monitor rates of data rejection and other errors | ||
Software updates | Keep track of updates to the AI software |
AI = artificial intelligence, IT = information technology
• Instant and easy flagging and collection of incorrect or inconsistent AI responses with a single mouse click, thereby minimizing digital fatigue.
• Tracking the rate of user agreement or disagreement with AI results using a simple tool such as an ‘agree-with-AI (Yes or No)’ button, where abrupt changes could indicate emerging issues such as user complacency or automation bias with a significant increase in agreement, and model performance drift with a significant increase in disagreement.
• Dashboard-type continuous display of monitoring outcomes to provide ongoing visibility and oversight.
Mandating such comprehensive monitoring tools as a condition for medical device approval would significantly help ensure the effectiveness and thoroughness of continuous monitoring practices.
CONCLUSIONS
The South Korean MFDS guidelines have established a substantial initial framework for approving LLMs/LMMs as medical devices. However, these guidelines are still in their nascent stages. Current regulations do not encompass generalist medical AI systems. The measures in place primarily manage risks indirectly through user instructions and restricting device use to qualified healthcare professionals. Additionally, not all challenges are addressed by specific regulatory actions. In particular, explicit measures to ensure proper continuous monitoring are lacking. A requirement for LLM/LMM-based medical devices to include auxiliary tools that can explicitly explain their reasoning and express confidence levels, coupled with a dedicated IT infrastructure to facilitate comprehensive, continuous post-market monitoring, would significantly enhance the regulatory framework. Furthermore, medical education must be updated to ensure that healthcare professionals are adequately equipped with the knowledge and skills necessary to evaluate LLM/LMM-based devices beyond their mere usage. This approach will bolster the safety and efficacy of these medical devices and prepare healthcare systems to integrate advanced AI technologies more seamlessly and responsibly.
Footnotes
Conflicts of Interest: Seong Ho Park received consulting fees from the Ministry of Food and Drug Safety of the Republic of Korea for his contributions to the development of the guidelines discussed in this article, is the Editor-in-Chief, and was not involved in the publication decision.
The remaining authors have declared no conflicts of interest related to this work.
- Conceptualization: Seong Ho Park.
- Writing—original draft: Seong Ho Park.
- Writing—review & editing: Geraldine Dean, Ernest Montañà Ortiz, Joon-Il Choi.
Funding Statement: None
References
- 1.Ministry of Food and Drug Safety. Guidelines for approving generative artificial intelligence technologies as medical devices. [accessed on March 2, 2025]. Available at: https://www.mfds.go.kr/brd/m_1060/view.do?seq=15628 .
- 2.World Health Organization. Ethics and governance of artificial intelligence for health: guidance on large multi-modal models. [accessed on March 2, 2025]. Available at: https://www.who.int/publications/i/item/9789240084759 .
- 3.Jung KH. Uncover this tech term: foundation model. Korean J Radiol. 2023;24:1038–1041. doi: 10.3348/kjr.2023.0790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Park SH, Kim N. Challenges and proposed additional considerations for medical device approval of large language models beyond conventional AI. Radiology. 2024;312:e241703. doi: 10.1148/radiol.241703. [DOI] [PubMed] [Google Scholar]
- 5.Gilbert S, Harvey H, Melvin T, Vollebregt E, Wicks P. Large language model AI chatbots require approval as medical devices. Nat Med. 2023;29:2396–2398. doi: 10.1038/s41591-023-02412-6. [DOI] [PubMed] [Google Scholar]
- 6.Blumenthal D, Patel B. The regulation of clinical artificial intelligence. NEJM AI. 2024;1:aIpc2400545 [Google Scholar]
- 7.Rajpurkar P, Topol EJ. A clinical certification pathway for generalist medical AI systems. Lancet. 2025;405:20. doi: 10.1016/S0140-6736(24)02797-1. [DOI] [PubMed] [Google Scholar]
- 8.Yu F, Moehring A, Banerjee O, Salz T, Agarwal N, Rajpurkar P. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat Med. 2024;30:837–849. doi: 10.1038/s41591-024-02850-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Park SH, Langlotz CP. Crucial role of understanding in human-artificial intelligence interaction for successful clinical adoption. Korean J Radiol. 2025;26:287–290. doi: 10.3348/kjr.2025.0071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Faghani S, Moassefi M, Rouzrokh P, Khosravi B, Baffour FI, Ringler MD, et al. Quantifying uncertainty in deep learning of radiologic images. Radiology. 2023;308:e222217. doi: 10.1148/radiol.222217. [DOI] [PubMed] [Google Scholar]
- 11.Faghani S, Gamble C, Erickson BJ. Uncover this tech term: uncertainty quantification for deep learning. Korean J Radiol. 2024;25:395–398. doi: 10.3348/kjr.2024.0108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–1940. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 13.McCoy LG, Ci Ng FY, Sauer CM, Yap Legaspi KE, Jain B, Gallifant J, et al. Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review. BMC Med Educ. 2024;24:1096. doi: 10.1186/s12909-024-06048-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Warraich HJ, Tazbaz T, Califf RM. FDA perspective on the regulation of artificial intelligence in health care and biomedicine. JAMA. 2025;333:241–247. doi: 10.1001/jama.2024.21451. [DOI] [PubMed] [Google Scholar]
- 15.Sittig DF, Singh H. Recommendations to ensure safety of AI in real-world clinical care. JAMA. 2025;333:457–458. doi: 10.1001/jama.2024.24598. [DOI] [PubMed] [Google Scholar]
- 16.Feng J, Xia F, Singh K, Pirracchio R. Not all clinical AI monitoring systems are created equal: review and recommendations. NEJM AI. 2025;2:aIra2400657 [Google Scholar]