Responsible use of large language models in gastroenterology and hepatology

Sahil Khanna

doi:10.1177/17562848261441693

. 2026 Apr 26;19:17562848261441693. doi: 10.1177/17562848261441693

Responsible use of large language models in gastroenterology and hepatology

Sahil Khanna ^1,^✉

PMCID: PMC13129269 PMID: 42080195

Abstract

Large language models (LLMs) and related generative artificial intelligence (AI) systems are rapidly entering clinical workflows, including in gastroenterology and hepatology, where text-heavy documentation, guideline-driven care, and high-volume patient messaging create a strong demand for decision support and automation. Deployment raises distinctive risks: hallucinations and unsafe recommendations, automation bias, privacy and confidentiality threats, inequitable care, intellectual property and licensing uncertainty, and unclear allocation of responsibility across clinicians, institutions, and vendors. Regulators are increasingly emphasizing a lifecycle approach to trustworthy AI, including validation, risk management, human oversight, cybersecurity, monitoring, and transparency. Frameworks for healthcare protection regulate how health information may be shared with external model providers and how outputs should be logged, audited, and retained. This review synthesizes practical guidance for responsible LLM use, organized around (1) tiered use cases and risk stratification; (2) ethical principles (beneficence, non-maleficence, autonomy, justice, and accountability); (3) core legal and regulatory considerations; and (4) operational governance, evaluation, and monitoring strategies. Actionable checklists to support institutional adoption, use, and transparency are provided. Responsible use requires aligning LLM capabilities with risk, preventing inappropriate data sharing, validating performance in representative populations, ensuring human oversight and clear accountability, mitigating harm, and maintaining an evidence-based, continuously monitored deployment lifecycle.

Keywords: clinical decision support, ethics, gastroenterology, generative artificial intelligence, governance, hepatology, large language models, law, privacy, regulation

Plain language summary

Language models in GI care

Large language models (LLMs) are powerful AI tools that can help clinicians by drafting notes, summarizing information, and assisting with patient messages. They are becoming common in gastroenterology and hepatology as the field relies heavily on written documentation and synthesis of complex clinical data. However, these tools can sometimes produce confident but incorrect information, overlook important details, or reflect biases found in their training data. Owing to these risks, LLMs must always be used with human oversight, not as independent medical decision‑makers. Privacy remains a major concern. When LLMs process patient information, organizations must follow strict rules such as Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR) to protect confidentiality, manage cybersecurity risks, and ensure that data shared with vendors is appropriately controlled. To use LLMs safely, health systems should match the tool to the level of clinical risk, from simple administrative tasks to high‑stakes decision support. They should build in safeguards like human review, high‑quality medical sources, transparency about AI use, and continuous monitoring for errors or drift in performance. LLMs also need to be checked for fairness, since they may unintentionally reproduce biased medical assumptions, which can contribute to inequitable care. Overall, LLMs can reduce burden and support high‑quality care in gastroenterology and hepatology, but only when paired with strong governance, human oversight, privacy protections, and responsible workflows.

Introduction

High volumes of unstructured text (clinic notes, procedure and pathology reports, imaging data, discharge summaries) are generated in day-to-day Gastroenterology and Hepatology (GI/Hep) practices that require continual synthesis in the context of evolving evidence, guidelines, and patient-specific factors. These characteristics make the specialty an attractive early adopter for large language model (LLM) assisted documentation, summarization, and decision support.^1,2 There are high-stakes decision points (e.g., acute gastrointestinal bleeding, endoscopy triage, decompensated cirrhosis, and hepatocellular carcinoma staging), where hallucinated or biased outputs can lead to harm. The World Health Organization (WHO) has emphasized that generative artificial intelligence (AI) can improve healthcare only if risks are identified and actively governed across the model lifecycle.^3,4

GI/Hep share many AI-governance challenges with other specialties (privacy, bias, automation bias, and unclear liability), but several features make GI/Hep a particularly relevant testing case for responsible LLM deployment. Being a procedure-intensive specialty that generates image-rich and narrative reports that must be translated into actionable recommendations (e.g., polyp pathology letters and surveillance intervals).^5,6 In addition, guideline-dense longitudinal care (IBD, chronic liver disease), high-volume patient messaging (diet/medication questions, bowel preparation, symptom flares), and high-acuity triage decisions (acute GI bleeding, decompensated cirrhosis, and referral urgency) are highly prevalent. These characteristics make LLMs attractive for summarization and communication, but also amplify the impact of omission, hallucination, and misalignment with patient-specific contraindications or local practice constraints.

This review focuses on considerations for responsible LLM use with these aspects: (i) describe plausible use cases and categorize them by risk; (ii) outline ethical principles and common failure modes; (iii) summarize relevant legal and regulatory obligations; and (iv) propose a practical governance and evaluation framework, including checklists and tables designed for multidisciplinary adoption teams. The manuscript focuses on text-based LLMs because they are the most mature and widely deployed in clinical documentation and communication. The same governance principles may extend to other generative AI (e.g., multimodal models used with endoscopic images or pathology), and additional safeguards may be needed in those situations.

Methods

A targeted review of peer-reviewed literature and policy documents published was performed, prioritizing sources relevant to LLM use, medical ethics, health data privacy, cybersecurity, and regulation. PubMed/MEDLINE was searched for English-language sources published from January 2018 through January 2026 using combinations of terms including: “large language model,” “LLM,” “ChatGPT,” “generative AI,” “foundation model” AND “gastroenterology,” “hepatology,” “endoscopy,” “colonoscopy,” “inflammatory bowel disease,” “cirrhosis,” “hepatocellular carcinoma,” as well as governance terms (“HIPAA,” “GDPR,” “EU AI Act,” “medical device,” “clinical decision support,” “cybersecurity,” “prompt injection”). Reporting guidelines for clinical AI studies (CONSORT-AI, SPIRIT-AI, DECIDE-AI) and publisher/journal policies on AI disclosure were reviewed.^7–10 Given rapid policy changes, sources from regulators and standards bodies such as the WHO, National Institute of Standards and Technology (NIST), U.S. Food and Drug Administration (FDA), Office of the National Coordinator for Health Information Technology (ONC), European data protection board (EDPB), and others are included where available. Additional references were identified by citation chaining. Given the heterogeneity of study designs and rapidly evolving policy, a formal systematic review or meta-analysis was not performed. Instead, evidence was synthesized into a practical risk-tier taxonomy and matched safeguards and evaluation expectations to intended use.

LLM-enabled workflows: A risk-tier taxonomy proposal

A practical starting point is to match the intended use, balancing benefits and harm. Deployment of LLMs can be for activities ranging from low-risk administrative tasks to high-risk decision support. Table 1 proposes a taxonomy and corresponding safeguards.^2,11,12

Table 1.

Representative LLM use cases and recommended safeguards.

Use case (example)	Typical inputs	Potential harm	Risk tier^a	Key ethical/legal issues	Recommended safeguards
Administrative support (scheduling, prior-auth draft letters)	Non-clinical or limited PHI	Workflow delays, minor errors	Tier 0–1	Confidentiality; vendor contracting	Avoid PHI where feasible; role-based access; audit logs; human review
Documentation assistance (drafting clinic notes, procedure reports)	EHR text, dictation	Incorrect documentation; propagation of errors	Tier 1–2	Accuracy; attribution; record integrity	Human-in-the-loop editing; source linking; disclaimers; lockout from ordering
Patient messaging (portal replies, prep instructions)	Patient questions; limited context	Unsafe advice; reduced trust; inequity	Tier 2–3	Informed consent; transparency; bias	Approved templates; escalation triggers; readability; language access; label AI assistance
Guideline summarization/retrieval (HCC staging, variceal prophylaxis)	Guidelines + patient context	Misapplied guideline; outdated info	Tier 2–3	Currency; transparency	RAG; citations; date-stamped sources; clinician verification
Clinical decision support (acute GI bleed risk, cirrhosis complications)	Full PHI + labs/vitals	Direct patient harm; malpractice exposure	Tier 3–4	Safety; liability; regulation	Formal validation; monitoring; version control; limited autonomy; regulatory assessment when applicable
Autonomous action (orders, referrals, medication changes)	PHI + system permissions	High-severity harm	Tier 4	Safety; authorization; accountability	Avoid/rare; least privilege; dual-signature; strong controls; incident response

Open in a new tab

Risk tiers are illustrative: Tier 0 (no clinical data), Tier 1 (limited PHI/administrative), Tier 2 (documentation/communication), Tier 3 (clinical support), Tier 4 (autonomous clinical action).

AI, artificial intelligence; EHR, electronic health record; GI, gastroenterology; HCC, hepatocellular carcinoma; LLM, large language models; PHI, protected health information; RAG, retrieval-augmented generation.

To operationalize “responsible use” into implementable safeguards, a pragmatic risk-tiering approach based on the principle that LLM-related risk increases as outputs move closer to patient care and clinical action is proposed in this manuscript. Tier assignment is driven by four primary dimensions: (1) clinical proximity and actionability (administrative support vs decision support vs autonomous actions), (2) severity and reversibility of potential harm (documentation errors vs unsafe treatment recommendations), (3) degree of autonomy and opportunity for human oversight (draft-only with mandatory review vs direct-to-patient vs automated execution), and (4) data sensitivity (no patient data vs limited PHI vs full electronic health record (EHR) context). This framework emphasizes workflow realities where time-sensitive decisions and guideline-intensive care make the consequences of confident but incorrect outputs particularly salient. When a tool spans multiple functions, it is suggested that the tier be assigned according to the highest-risk function or the function with the most direct impact on patient care.

Tier 0: No clinical data, non-clinical support

Tier 0 includes use cases that do not involve patient information or clinical decision-making (e.g., general writing assistance, generic education materials, and workflow brainstorming). The primary risks are misinformation, reputational harm, or inappropriate reliance rather than direct patient harm. Safeguards emphasize basic information governance (approved tools, acceptable-use policy) and avoidance of unintended expansion beyond the intended use, user, or context into clinical recommendations.

Tier 1: Limited PHI and operational/administrative tasks

Tier 1 covers tasks where limited identifiers or operational context may be used (e.g., drafting prior authorization narratives, referral letters, scheduling support), but the output is not intended to guide clinical care directly. Risks include confidentiality breaches, administrative errors, and propagation of inaccuracies into downstream processes. Safeguards typically center on privacy compliance (minimum necessary data, approved platforms), access controls, auditing, and human verification before external submission. For purely internal Tier 1 administrative tasks (e.g., internal routing and scheduling optimization), disclosure to patients may not be practical or meaningful, but organizational transparency and documentation are still required.

Tier 2: Clinical documentation and communication with human-in-the-loop oversight

Tier 2 includes LLM use for drafting clinical notes, procedure reports, discharge instructions, and patient portal messages, where a clinician (or trained staff under protocol) reviews and approves content prior to filing or sending. Harm can arise through incorrect documentation, omission of red flags, hallucinations, misleading instructions, or unequal communication quality across patient groups (e.g., language or literacy barriers). Tier 2 safeguards emphasize mandatory human review, source-of-truth verification against the chart, standardized templates, escalation triggers, and transparency/labeling policies for patient-facing content.

Tier 3: Clinical decision support

Tier 3 encompasses tools that synthesize patient-specific data to propose diagnoses, triage, risk stratification, management steps, or guideline application. These outputs can materially influence care decisions even when the clinician remains the final decision-maker. Potential harm is higher because errors may be clinically plausible, time-sensitive, and difficult to detect. This tier therefore requires stronger controls: clear intended-use statements and limitations, rigorous validation in representative populations and settings, structured output constraints (e.g., retrieval from vetted sources), monitoring for drift and safety signals, and defined accountability and incident response pathways.

Tier 4: Autonomous clinical action

Tier 4 includes agentic or automated execution (e.g., placing orders, changing medications, scheduling urgent procedures without clinician sign-off, or sending definitive patient instructions autonomously). This tier carries the highest risk because the opportunity for human interception is minimized, and failures may directly cause harm. Tier 4 autonomy should be avoided or tightly constrained; if contemplated, it requires the most stringent governance: least-privilege permissions, dual authorization, formal safety case documentation, continuous monitoring, robust cybersecurity controls, and, where applicable, alignment with regulatory expectations for high-risk clinical AI.

Evidence base for LLM applications in GI/Hep by risk tier

Although LLM adoption is accelerating, the published GI/Hep evidence base remains heterogeneous, ranging from controlled question-set evaluations to retrospective analyses using de-identified clinical data and early workflow studies. This variability makes a risk-tier lens useful: it aligns evaluation intensity with potential patient impact and highlights where evidence is strongest today. A few examples for Tier 0–3 relevant to GI/Hep are outlined below. Table 2 outlines representative GI/Hep studies evaluating LLM applications by risk tier.

Table 2.

Representative GI/Hep studies evaluating LLM applications by tier.

Tier	Representative use case	Study (year)	Data/inclusion	Endpoints (examples)	Key findings/limitations
0	Patient education and emotional support in cirrhosis/HCC	Yeo et al. (2023)¹³	164 cirrhosis/HCC questions; graded by transplant hepatologists	Accuracy; completeness; reproducibility	Often correct but frequently not comprehensive; gaps in decision thresholds and regional guideline variation
0	Patient-facing IBD FAQ responses	Gravina et al. (2024)¹⁴	Common IBD patient questions; evidence-controlled review	Correctness; updating/detail; risk of inaccuracies	Plausible answers with notable limitations; supports adjunct (not autonomous) role
0	GI/Hep education (exam-style questions)	Suchman et al. (2023)¹⁵; Gravina et al. (2024)¹⁶; Anvari et al. (2025)¹⁷	ACG self-assessment; residency exam GI questions; hepatology question sets	Accuracy; rationale quality; variability	Variable performance; reinforces need for verification and supervised educational use
2	Summarizing hepatology referral documents for triage	Shroff et al. (2026)¹⁸	50 patient referral records; iterative prompt; provider review	Accuracy; hallucination; time-to-triage	High accuracy, low hallucination; ~60% reduction in triage time; requires monitoring for omissions/drift
3	Colonoscopy surveillance interval recommendations	Chang et al. (2024)¹⁹	505 de-identified patient colonoscopy + pathology cases	Concordance vs guideline panel and endoscopists: reliability	Higher concordance with guideline panel than routine practice; discordant subset remains—needs guardrails + governance
3	Model-to-model variability in colonoscopy follow-up intervals	Amini et al. (2025)²⁰	Guideline-based prompts for follow-up intervals	Guideline adherence; consistency	Different LLMs produce different interval recommendations; supports need for local validation and retrieval/citation

Open in a new tab

GI, gastroenterology; HCC, hepatocellular carcinoma; Hep, hepatology; LLM, large language models.

Tier 0 in GI/Hep: No clinical data, non-clinical support

GI/Hep evidence is poised to use LLMs for patient-facing education/counseling and clinician education. In cirrhosis and hepatocellular carcinoma (HCC), ChatGPT responses to 164 curated questions were graded by transplant hepatologists and were frequently “correct” but often not “comprehensive,” with gaps in decision thresholds and regional guideline variation.¹³ Similar evaluations in IBD demonstrate that LLMs can generate plausible answers to common patient questions, but may provide incomplete, outdated, or occasionally inaccurate recommendations, supporting their role as adjuncts for education rather than stand-alone advice.^21–23 In medical education, GI/Hep studies show variable performance on exam-style questions (e.g., gastroenterology self-assessment tests and residency exam items), underscoring the need to treat LLMs as supplementary study tools and to teach verification habits.^15–17

Tier 1 in GI/Hep: Limited PHI and operational/administrative tasks

Peer-reviewed GI/Hep-specific studies of LLMs for operational tasks (e.g., prior authorization letters, appointment scheduling, triage) are limited. Nevertheless, the same evaluation endpoints used in other fields are applicable: time saved, rate of factual errors, readability, and downstream operational outcomes (e.g., reduced cycle time for authorizations) while ensuring that protected data are handled under appropriate agreements and security controls.

Tier 2 in GI/Hep: Clinical documentation and communication with human-in-the-loop oversight

Emerging GI/Hep evidence supports LLM-assisted summarization and document triage when outputs are reviewed by clinicians. A 2026 study developed an LLM assistant to summarize hepatology referral documents using predefined data elements and tested it on 50 patient records; the AI summaries were substantially shorter, had high accuracy with a low hallucination rate, and reduced triage time by ~60% compared with reviewing original referral packets.¹⁸ This illustrates a high-value, lower-risk pathway for LLM use that still requires careful validation for omissions, bias, and drift. Related work outside GI/Hep, such as LLM-generated discharge summaries and patient-friendly note transformations, highlights recurring safety concerns from omissions and inaccuracies, reinforcing the need for structured evaluation and human oversight before scaling patient-facing communication.

Tier 3 in GI/Hep: Clinical decision support

The highest-stakes GI/Hep evidence base is still early, but several studies show both promise and risk. For example, a real-world retrospective analysis entered de-identified colonoscopy and pathology data into ChatGPT-4 to generate rescreening/surveillance intervals and compared outputs with expert-panel guideline recommendations and routine endoscopist practice; ChatGPT-4 had higher concordance with guideline-based recommendations than routine practice in that study, but still produced discordant recommendations in a subset of cases (both earlier and later follow-up).¹⁹ Separate comparisons across LLMs (e.g., ChatGPT vs Bard) also demonstrate variability in guideline adherence.^20,24 Taken together, these data support Tier 3 tools only when they are tightly scoped, validated against contemporary guideline logic, instrumented to surface supporting evidence (e.g., citations/decision pathways), and deployed with explicit clinician accountability.

Multimodal and next-generation models

Finally, emerging multimodal LLM applications in endoscopy and liver pathology highlight that “generative AI” is expanding beyond text. Early studies exploring vision-enabled models for bowel preparation scoring and liver fibrosis staging suggest potential but also important performance gaps, underscoring that multimodal outputs require the same (or higher) validation thresholds as tier 2–3 text applications before clinical use.

Ethical considerations

Ethical analysis can be anchored in beneficence, non-maleficence, autonomy, justice, and accountability. LLMs create distinctive tensions because outputs are probabilistic and may be fluent yet incorrect, because training data may encode historical biases, and because interfaces can encourage overreliance.^25–27

Beneficence and non-maleficence: Patient safety and error modes

A central risk is clinically plausible but incorrect output (“hallucination”) or omission of key contraindications. Another is automation bias: clinicians and patients may accept confident answers without sufficient verification, especially under time pressure.^25,28 Mitigation requires explicit human oversight, verification workflows, and evaluation under realistic conditions, including adversarial testing (“red teaming”) and monitoring for performance drift.^29,30

LLM safety risks extend beyond hallucination (fabrication). Three distinct error classes are important in clinical GI/Hep: (1) hallucination: confidently stated but unsupported facts; (2) omission: failure to surface key contraindications, red flags, or next steps (e.g., anticoagulation reversal considerations in GI bleeding; sedation risks in advanced cirrhosis); and (3) contextual misalignment: outputs that may be “generally correct” but are not the best recommendation for an individual patient because they fail to incorporate comorbidities, drug interactions, local resource constraints, or patient preferences (Table 3). Mitigations should be matched to each error mode (retrieval/citations for hallucination; structured contraindication/red-flag checklists for omission; scenario-based validation and required human rationale for contextual misalignment).

Table 3.

Common LLM failure modes and mitigation strategies.

Failure mode	Example in GI/hepatology	Why it matters	Mitigation
Hallucinated facts or fabricated citations	Invent endoscopy findings or guideline statements	Can mislead decisions; undermines trust	RAG with vetted sources; require citations; human verification; disable citation generation when uncertain
Omitted contraindications/interactions	Suggests NSAIDs in cirrhosis/variceal bleed risk; misses anticoagulant interactions	High-severity harm	Structured checks (drug–drug/allergy); forcing functions; escalation rules
Bias and inequitable recommendations	Uses race-based eGFR adjustments or stereotypes in pain management	Amplifies disparities	Subgroup testing; bias red-teaming; disaggregated monitoring; guardrails against race-based medicine
Prompt injection/data exfiltration	Malicious text in copied note causes model to reveal PHI or ignore instructions	Security breach; unsafe behavior	Isolate untrusted text; least privilege; input/output filtering; OWASP-aligned testing
Overreliance and deskilling	Clinician accepts summary without reviewing source data	Errors propagate; reduced reasoning	Training; UI cues; mandatory verification for Tier 3–4; audit and feedback

Open in a new tab

GI, gastroenterology; LLM, large language models; OWASP, Open Worldwide Application Security Project; PHI, protected health information; RAG, retrieval-augmented generation.

Justice: Bias, inequity, and representativeness

Bias can arise from unequal representation in training data, biased documentation, or legacy clinical “dogmas” or “truths” that embed misconceptions. A study of commercial LLMs found that all produced some race-based medical misconceptions and were inconsistent across repeated runs, underscoring the need for equity-focused evaluation and governance.³ Tools should be tested in the populations they will serve, with disaggregated performance reporting and bias mitigation plans.^25,26

Autonomy and transparency: Informed use and appropriate reliance

Patients should not be unknowingly triaged or counseled by systems in ways that meaningfully affect care. Organizations should disclose when LLMs are used in patient-facing interactions or in clinician workflows that materially influence recommendations. Transparency also supports clinician autonomy, providing users with clear indications of tool purpose, limits, and contexts in which outputs become unreliable.^25,31

Respect for autonomy supports transparency about when LLMs contribute to patient-facing communication or clinical recommendations, and where feasible, an ability for patients to request a human-only alternative. Practical opt-out pathways are most applicable for Tier 2 patient messaging/letters and selected Tier 3 decision-support tools (e.g., AI-assisted triage or interval recommendations). For backend Tier 1 uses, opt-out may be impractical, but institutions should still ensure that LLM use does not materially change clinical decisions without clinician review, and should provide clear channels for patients to ask questions, request clarification, or escalate concerns.

Accountability and professional integrity

LLMs cannot assume professional responsibility, and accountability ultimately rests with clinicians and organizations. Liability standards are continuing to evolve as AI becomes more integrated into clinical practice, reinforcing the need for clear governance, documentation, and clinician training.⁴

Legal and regulatory considerations

Legal obligations differ across jurisdictions and settings, but several themes recur: confidentiality and data protection; cybersecurity and safety; documentation, traceability, and transparency; medical device and health information technology (IT) regulation when LLMs function as clinical decision support; and allocation of responsibility across clinicians, institutions, and vendors (Table 4).^31–34

Table 4.

Selected legal and regulatory touchpoints relevant to LLM use.

Domain	Instrument (examples)	Who is obligated	Practical implications for LLM deployment
Privacy and confidentiality	HIPAA; GDPR; ICO guidance on AI and data protection	Health systems, clinicians, vendors (as applicable)	Do not paste PHI into non-compliant tools; execute BAAs/DPAs; minimize data; perform DPIA; address cross-border transfers
Algorithm transparency in health IT	ONC HTI-1 (US)	Health IT developers; downstream deployers/users	Maintain documentation on intended use, limitations, evaluation, and update history; ensure user access to baseline algorithm information
Medical device oversight	FDA guidance and GMLP (US); EU AI Act (EU) (plus medical device rules)	Manufacturers/providers; sometimes deployers	Clarify intended use and claims; implement lifecycle monitoring, quality management, and controlled updates; consider when CDS crosses into regulated SaMD
Payer/coverage decisions	CMS guidance on algorithmic utilization management (US)	Medicare Advantage plans and contractors; indirectly, clinicians/health systems	Avoid sole reliance on LLM outputs for denials; preserve individualized clinician review and explainability

Open in a new tab

AI, artificial intelligence; BAA, business associate agreement; CDS, clinical decision support; CMS, Centers for Medicare and Medicaid Services; DPA, data processing agreement; DPIA, data protection impact assessment; GDPR, General Data Protection Regulation; GMLP, Good Machine Learning Practice; HIPAA, Health Insurance Portability and Accountability Act; ICO, Information Commissioner’s Office; IT, information technology; LLM, large language models; ONC HTI, Office of the National Coordinator for Health Information Technology; PHI, protected health information; SaMD, software as a medical device.

Data protection and confidentiality

When LLMs process protected health information (PHI), organizations must ensure that disclosures are available and provided, minimum necessary standards are applied, and vendor relationships are governed by appropriate agreements. Under the Health Insurance Portability and Accountability Act (HIPAA), service providers that create, receive, maintain, or transmit electronic PHI on behalf of covered entities or business associates generally function as business associates and require a business associate agreement, even in “no-view” encrypted scenarios.³²

De-identification can enable broader use, but both Safe Harbor and Expert Determination approaches retain residual re-identification risk, particularly for free-text data.³⁵ In the European Union (EU), the General Data Protection Regulation (GDPR) imposes obligations around lawful basis for processing, purpose limitation, data minimization, and cross-border transfers, all of which may affect cloud-hosted LLMs and fine-tuning workflows.³¹ Guidance from the United Kingdom on AI and data protection emphasizes accountability and governance and notes ongoing updates as legislation changes.³⁶

Medical device and healthcare IT oversight

Whether an LLM-enabled tool is regulated as a medical device depends on intended use, claims, and functional role in clinical decision-making. In the US, FDA guidance and the international “Good Machine Learning Practice” principles stress rigorous lifecycle controls, including data management, transparency, performance evaluation, and post-deployment monitoring.^37,38 The ONC HTI-1 Final Rule established algorithm transparency requirements for certain predictive decision support interventions in certified health IT.³⁴ In the EU, the AI Act introduces obligations for high-risk AI and additional transparency requirements, with staged applicability dates.^35,39

Professional liability and standard of care

Clinicians remain responsible for the care delivered, even when decision support tools are used. Liability analyses emphasize that clinicians may face risk both when they rely on faulty AI recommendations and when they unreasonably ignore reliable AI warnings, depending on the evolving standard of care.⁴ Regulatory guidance cautions against using algorithms as the sole basis for coverage decisions and emphasizes that medical necessity determinations require individualized, grounded review.⁴⁰

Operational governance for responsible LLM use

Responsible deployment requires operational controls that convert principles into practice. The NIST Artificial Intelligence Risk Management Framework and its generative AI profile provide a useful structure: govern, map, measure, and manage risks across the lifecycle, with explicit accountability and continuous monitoring.^26,30 Reporting guidelines for clinical AI studies reinforce the need to specify intended use, human–AI interaction, error analysis, and system versioning.^7–9

Governance: Roles, policies, and acceptable use

Health systems should establish a multidisciplinary governance structure that includes clinicians, nursing, informatics, legal/compliance, privacy, cybersecurity, and patient representatives. Key outputs include an acceptable-use policy, a risk-tier framework (Table 1), vendor due diligence standards, and a process for incident reporting and corrective action.^26,27,30

Technical safeguards: Architecture, security, and data controls

Common controls include retrieval-augmented generation (RAG) with curated local knowledge bases; role-based access; input/output filtering; prevention of sensitive information disclosure; and limiting agentic capabilities to least privilege. The Open Worldwide Application Security Project (OWASP) highlights prompt injection, sensitive information disclosure, excessive agency, and overreliance as recurring failure modes that should be explicitly addressed in design and testing.²⁷

Privacy and leakage risk: Why “don’t upload PHI” is not enough

Privacy risk can arise through memorization, training data extraction, especially for large models trained on sensitive data.⁴¹ In addition, targeted misinformation and model-manipulation attacks can deliberately inject incorrect biomedical facts into model behavior, reinforcing the need for robust access control and monitoring.^42–44

GDPR and international data protection

GDPR considerations for LLM deployment extend beyond “do not upload identifiable data.” Health data are “special-category” personal data, requiring both a lawful basis and a separate condition for processing special-category data, as well as transparency about purposes, recipients, and retention. LLM workflows can stress core GDPR principles: purpose limitation and data minimization (e.g., broad reuse of clinical notes for fine-tuning); accuracy (model outputs may be incorrect about individuals); and data subject rights (access, rectification, restriction, objection, and erasure). In practice, controller/processor roles must be explicit (including for cloud LLM vendors), with appropriate contractual safeguards, audit rights, and limits on downstream model training. Many clinical deployments will trigger a Data Protection Impact Assessment, particularly when using novel technology at scale on sensitive health data. Cross-border transfers and sub-processor chains must be mapped and controlled. Recent European work, including the EDPB ChatGPT Taskforce report and EDPS guidance on generative AI, emphasizes risk-based governance, clear role allocation, and practical mechanisms to uphold data subject rights. For GI/Hep implementations, a conservative approach is to prefer enterprise-grade deployments where PHI/personal data are not used for model training, to log prompts/outputs securely for quality oversight, and to validate that generated summaries or recommendations do not introduce inaccurate personal data into the record.

Evaluation and monitoring

Evaluation should be proportional to risk and should reflect real-world workflow. For Tier 1–2 applications, this includes documentation accuracy audits, clinician satisfaction, and time saved. For Tier 3–4 applications, prospective evaluation and safety monitoring are needed, ideally aligned with DECIDE-AI/CONSORT-AI guidance.^8,9,29

Responsible scholarly use: Authorship, disclosure, and publication ethics

Medical journals and publishers increasingly require transparency about generative AI use in manuscript preparation. Some journals distinguish assistive uses (e.g., grammar) from generative uses that produce substantive text, images, or references; the latter typically requires disclosure, and authors retain responsibility for accuracy and originality. ICMJE similarly emphasizes disclosure of AI-assisted technologies and reiterates that chatbots cannot be listed as authors because they cannot take responsibility for integrity and accuracy.¹⁰ Authors should also recognize intellectual property constraints: U.S. copyright guidance emphasizes human authorship requirements for copyright protection of AI-generated material.^45,46

Education

Education is an early and common entry point for LLM use by trainees in GI/Hep. Programs should teach trainees to treat LLM outputs as unverified drafts, to cross-check against primary sources and local guidelines, and to avoid entering any identifiable patient information into non-approved tools.⁴⁷ Training curricula can incorporate “AI literacy” (prompting, citation/verification habits, bias awareness, and recognizing automation bias) and can use local, institutionally governed models for low-risk educational tasks.⁴⁷ Studies in GI/Hep show that LLM performance on exam-style questions is variable, supporting use as a supplementary learning tool rather than a replacement for standard educational resources.^15–17

Preventing harm: Practical safeguards for LLM use

Preventing harm from LLMs requires a safety-by-design approach that matches safeguards to risk, recognizes that LLM outputs can be fluent yet wrong, and anticipates both clinical and cybersecurity failure modes. Guidance emphasizes human oversight, transparency, equity, and lifecycle governance, including post-release auditing and monitoring where LLMs are deployed at scale.^25,26,30,48

Start with intended use and risk tier

Harm prevention begins by defining the intended use, the user (clinician vs patient facing), the setting (inpatient vs outpatient), and the level of autonomy. Tier 1–2 applications (documentation assistance, controlled messaging) generally allow stronger human review than Tier 3 applications (clinical decision support) and should be prioritized for early adoption. Risk tiering should explicitly identify prohibited uses (e.g., autonomous ordering, medication changes, or triage decisions without clinician sign-off) and have inbuilt escalation triggers.^25,26 Table 5 outlines a tier-by-tier ethical crosswalk with the risk-tiered approach.

Table 5.

Tier-by-tier ethical crosswalk (benefits, risks, safeguards).

Dimension	Tier 0	Tier 1	Tier 2	Tier 3
Primary value	Education, drafting non-clinical content, and guideline summarization without patient data	Operational efficiency (scheduling, prior authorization drafts) with minimal identifiers	Clinician efficiency via documentation/summarization; controlled patient communication with review	Patient-specific synthesis to support decisions (diagnosis/therapy/intervals)
Main harm modes	Misinformation; overtrust; reputational harm	Privacy leakage; administrative errors affecting access	Omissions/inaccuracies in notes or patient messages; automation bias	Clinical harm from incorrect/biased recommendations; inappropriate escalation/de-escalation
Autonomy and transparency	Label as educational; avoid implying clinician relationship	Organizational transparency; patient disclosure typically not required unless patient-facing output	Disclose when content is delivered to patients/external audiences; offer human alternative on request	Strong transparency; clear limits; consider opt-out where feasible; document human decision-maker
Justice (bias) focus	Test for biased educational content	Ensure administrative workflows do not create access inequities	Audit outputs for subgroup differences; avoid race-based medicine	Rigorous subgroup testing; avoid proxy discrimination; monitor downstream outcomes
Accountability	Author/clinician responsible for any redistributed content	Institution accountable for operational decisions; vendor agreements	Clinician signs/finalizes; institution governs monitoring and reporting	Clinician retains responsibility; tool treated as CDS with auditing, documentation, and escalation pathways
Minimum safeguards	Citations to trusted sources; disclaimers; avoid medical advice	No training on PHI unless contractually permitted; logging; access controls	Human-in-the-loop review; retrieval + citations; contraindication checks; monitoring for drift	Formal validation; limited scope; real-world performance monitoring; human factors testing; incident reporting

Open in a new tab

CDS, clinical decision support; PHI, protected health information.

Safety guardrails to reduce hallucinations and unsafe recommendations

For clinical tasks, guardrails should reduce the probability that an LLM generates unsupported facts, misapplies guidelines, or omits contraindications. Practical strategies include: RAG from curated, date-stamped guideline repositories; requiring the model to quote or cite only from retrieved sources; structuring prompts to force uncertainty and differential diagnoses; and limiting output scope (e.g., “draft a note” vs “recommend therapy”). Because outputs vary stochastically, safeguards should be tested across repeated runs and across realistic prompt formats (short questions, copied notes, portal messages).^25,26,29

Human oversight and human factors: Design the human–AI team

Human oversight must be operationalized, not assumed. Interfaces should make review easy (source links, highlighted uncertainties, and structured contraindication checks) and should avoid creating false confidence. Human factors/usability engineering is widely used in medical device development to minimize use-related hazards; similar principles apply when LLM tools are embedded in EHR workflows or patient messaging systems.^48,49

Equity and bias: Test for subgroup harms and avoid race-based medicine

Explicit testing for bias and inequitable recommendations must be carried out. LLMs have been shown to reproduce race-based medical misconceptions and may yield inconsistent outputs across repeated runs, increasing the risk of differential harm.³ Implementations should include language-access evaluation (reading level and translation quality), subgroup performance reporting, and guardrails to prevent race-based heuristics unless explicitly evidence-based and current.^25,26,50

Privacy and cybersecurity: Treat LLMs as a new attack surface

LLM-enabled systems introduce cybersecurity risks that can translate into patient harm: prompt injection, sensitive data disclosure, excessive “agency” via tools/plugins, and overreliance are highlighted in the OWASP Top 10 for LLM applications.²⁷ Threat modeling should incorporate adversarial AI tactics and techniques, drawing on resources such as MITRE ATLAS (MITRE Adversarial Threat Landscape for Artificial-Intelligence Systems).^43,44 Privacy threats also include potential memorization and extraction of training data and membership inference in some settings, reinforcing the need for strict data minimization, contractual controls, and secure logging.⁴¹ Healthcare-specific risks include targeted manipulation of medical LLMs (e.g., injecting incorrect biomedical facts while preserving apparent performance), which underscores the need to control model access, monitor updates, and validate knowledge integrity after any change.⁴²

Future directions

Near-term value is likely greatest in Tier 1–2 applications (documentation, summarization, controlled patient messaging) with strong human oversight, while Tier 3–4 applications should proceed only with rigorous validation, monitoring, and clear accountability.^2,26 Policy landscapes are also evolving, reinforcing the need for ongoing compliance monitoring.^36,39

Conclusion

LLMs can help teams manage information overload and reduce clerical burden, but safe use depends on aligning capabilities with clinical risk, protecting patient data, validating performance across diverse populations, and sustaining governance throughout the deployment lifecycle. A risk-tier taxonomy, NIST-aligned governance, OWASP-informed security testing, and transparent disclosure practices provide a practical path toward responsible adoption (Boxes 1–4).

Box 1.

Point-of-care checklist for clinicians using LLM output (Tier 2–4 use cases).

Checklist when an LLM output influences clinical documentation, patient communication, or decisions.
[ ] Confirm the task: Is this documentation support, patient messaging, or clinical decision support? Match verification effort to risk tier.
[ ] Verify source-of-truth: Cross-check all clinical claims against the chart (labs/vitals/med list) and authoritative references (guidelines, drug labels).
[ ] Look for “silent failures”: missing contraindications, drug interactions, allergy conflicts, and red-flag symptoms.
[ ] Assess bias/appropriateness: Would the recommendation change inappropriately based on race/sex/age/language/insurance status?
[ ] Ensure transparency: If patient-facing or materially influencing care, disclose AI assistance according to local policy.
[ ] Document your reasoning: Record why you accepted/modified/rejected the output; do not paste unverifiable text into the medical record.
[ ] Escalate when unsure: Complex/high-stakes cases require specialist review; do not rely on LLM output as a tie-breaker.

Open in a new tab

AI, artificial intelligence; LLM, large language models.

Box 2.

Institutional pre-deployment checklist.

Checklist for enabling an LLM tool in workflows.
[ ] Define intended use and risk tier (Table 1); document what the tool must NOT do (e.g., autonomous ordering).
[ ] Data governance: confirm HIPAA/GDPR basis; execute BAA/DPA; limit data retention; define whether data can be used for training/fine-tuning.
[ ] Security review: threat model (prompt injection, data exfiltration, account misuse); penetration/red-team plan; logging and monitoring.
[ ] Clinical validation plan: representative test sets; subgroup analyses; error taxonomy; human factors/usability testing.
[ ] Update and version control: model version, knowledge cutoffs, change management (including rollback), and post-update revalidation triggers.
[ ] Patient-facing transparency: labeling, consent approach (if applicable), escalation pathways, and complaint mechanisms.
[ ] Governance: named clinical owner, privacy officer sign-off, and incident response escalation route.

Open in a new tab

AI, artificial intelligence; BAA, Business Associate Agreement; DPA, Data Processing Agreement; GDPR, General Data Protection Regulation; HIPAA, Health Insurance Portability and Accountability Act; LLM, large language models.

Box 3.

Manuscript AI-disclosure checklist.

Checklist for preparing a review article that involved any generative AI support.
[ ] Identify all AI tools used (name, version, date accessed) and whether they were assistive (grammar) or generative (content creation).
[ ] Describe where AI was used (e.g., drafting, summarization, translation, coding) and what human verification steps were performed.
[ ] Verify every factual claim and every reference against primary sources; do not cite an AI chatbot as a primary reference.
[ ] Check for plagiarism (unintended reproduction) and fabricated citations.
[ ] Ensure confidentiality: confirm no patient-identifiable information or proprietary data were entered into non-compliant systems.
[ ] Include a disclosure statement per journal/publisher policy (e.g., Methods or Acknowledgments).

Open in a new tab

AI, artificial intelligence.

Box 4.

Preventing harm in LLM workflows.

Checklist for implementing or using an LLM that touches clinical documentation, patient messaging, or decision support.
[ ] Define intended use and prohibited uses; assign a risk tier and map required safeguards.
[ ] Use data minimization: provide only what is necessary for the task; avoid free-text PHI where feasible; follow organizational policy for permitted tools.
[ ] Prefer RAG from curated, date-stamped sources (guidelines, drug labels, institutional protocols); do not allow the model to invent citations.
[ ] Add forcing functions for safety: contraindication checks (allergies, renal/hepatic dosing, pregnancy), interaction checks, and “red flag” symptom screening.
[ ] Require human review before charting, sending patient messages, or acting clinically; ensure the UI supports rapid verification (source links, uncertainty flags).
[ ] Build escalation rules: immediate clinician/specialist review for GI bleeding, decompensated cirrhosis, suspected infection/sepsis, cancer pathways, pediatrics, pregnancy, immunosuppression, or any uncertainty.
[ ] Equity checks: test outputs for language level, translation accuracy, and subgroup performance; ban race-based medicine heuristics unless explicitly evidence-based and current.
[ ] Cybersecurity controls: prompt-injection testing; input/output filtering; least-privilege tool access; audit logs; monitoring for abnormal usage patterns.
[ ] Red-team before go-live and after material updates; include clinicians, trainees, security, and patient representatives.
[ ] Incident response: define how to report, triage, and remediate unsafe outputs; track near misses and feed learnings into updates.

Open in a new tab

GI, gastroenterology; LLM, large language models; PHI, protected health information; RAG, retrieval-augmented generation.

Acknowledgments

None. Generative AI-assisted technologies were not used in the writing of this manuscript.

Footnotes

ORCID iD: Sahil Khanna Inline graphic https://orcid.org/0000-0002-7619-8338

Declarations

Ethics approval and consent to participate: Not applicable (narrative review).

Consent for publication: Not applicable.

Author contributions: Sahil Khanna: Conceptualization; Methodology; Writing – original draft; Writing – review & editing.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

The authors declare that there is no conflict of interest.

Availability of data and materials: There are no associated data.

References

1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023; 29: 1930–1940. [DOI] [PubMed] [Google Scholar]
2. Wiest IC, Bhat M, Clusmann J, et al. Large language models for clinical decision support in gastroenterology and hepatology. Nat Rev Gastroenterol Hepatol 2025; 22: 773–787. [DOI] [PubMed] [Google Scholar]
3. Omiye JA, Lester JC, Spichak S, et al. Large language models propagate race-based medicine. NPJ Digit Med 2023; 6: 195. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Price WN, Gerke S, Cohen IG. Potential liability for physicians using artificial intelligence. JAMA 2019; 322: 1765–1766. [DOI] [PubMed] [Google Scholar]
5. Pellegrino R, Federico A, Gravina AG. Conversational LLM Chatbot ChatGPT-4 for colonoscopy boston bowel preparation scoring: an artificial intelligence-to-head concordance analysis. Diagnostics (Basel) 2024; 14: 2537. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Panzeri D, Laohawetwanit T, Akpinar R, et al. Assessing the diagnostic accuracy of ChatGPT-4 in the histopathological evaluation of liver fibrosis in MASH. Hepatol Commun 2025; 9: e0695. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020; 26: 1351–1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020; 26: 1364–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28: 924–933. [DOI] [PubMed] [Google Scholar]
10. ICoMJ Editors. Recommendations: artificial intelligence (AI)-assisted technology, https://www.icmje.org/recommendations/ (2026, accessed 2 February 2026).
11. Berry P, Dhanakshirur RR, Khanna S. Utilizing large language models for gastroenterology research: a conceptual framework. Therap Adv Gastroenterol 2025; 18: 17562848251328577. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721–732. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Gravina AG, Pellegrino R, Cipullo M, et al. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J Gastroenterol 2024; 30: 17–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Suchman K, Garg S, Trindade AJ. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118: 2280–2282. [DOI] [PubMed] [Google Scholar]
16. Gravina AG, Pellegrino R, Palladino G, et al. Charting new AI education in gastroenterology: cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Dig Liver Dis 2024; 56: 1304–1311. [DOI] [PubMed] [Google Scholar]
17. Anvari S, Lee Y, Jin DS, et al. Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions. J Can Assoc Gastroenterol 2025; 8: 58–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Shroff H, Shankar A, Baron A, et al. A large language model assistant for summarizing hepatology referral documents. Am J Gastroenterol 2026; 121(4): 925–931. [DOI] [PubMed] [Google Scholar]
19. Chang PW, Amini MM, Davis RO, et al. ChatGPT4 outperforms endoscopists for determination of postcolonoscopy rescreening and surveillance recommendations. Clin Gastroenterol Hepatol 2024; 22: 1917–1925.e17. [DOI] [PubMed] [Google Scholar]
20. Amini M, Chang PW, Davis RO, et al. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: bridging the gap in healthcare settings. Endosc Int Open 2025; 13: a25865912. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Zhang Y, Wan XH, Kong QZ, et al. Evaluating large language models as patient education tools for inflammatory bowel disease: a comparative study. World J Gastroenterol 2025; 31: 102090. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Naqvi HA, Delungahawatta T, Atarere JO, et al. Evaluation of online chat-based artificial intelligence responses about inflammatory bowel disease and diet. Eur J Gastroenterol Hepatol 2024; 36: 1109–1112. [DOI] [PubMed] [Google Scholar]
23. Lusetti F, Maimaris S, La Rosa GP, et al. Applications of generative artificial intelligence in inflammatory bowel disease: a systematic review. Dig Liver Dis 2025; 57: 1883–1889. [DOI] [PubMed] [Google Scholar]
24. Tariq R, Malik S, Khanna S. Evolving landscape of large language models: an evaluation of ChatGPT and bard in answering patient queries on colonoscopy. Gastroenterology 2024; 166: 220–221. [DOI] [PubMed] [Google Scholar]
25. World Health Organization. Ethics and governance of artificial intelligence for health: guidance on large multi-modal models. Geneva: WHO, https://iris.who.int/handle/10665/375579 (2024, accessed 2 February 2026). [Google Scholar]
26. Autio C, Dunietz J, Jain S, et al. Artificial intelligence risk management framework: generative artificial intelligence profile. NIST AI 600-1. National Institute of Standards and Technology, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf (2024, accessed 2 February 2026). [Google Scholar]
27. The OWASP Foundation. OWASP top 10 for large language model applications (v1.1), https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed 2 February 2026).
28. McGreevey JD, 3rd, Hanson CW, 3rd, Koppel R. Clinical, legal, and ethical aspects of artificial intelligence-assisted conversational agents in health care. JAMA 2020; 324: 552–553. [DOI] [PubMed] [Google Scholar]
29. Chang CT, Farah H, Gui H, et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit Med 2025; 8: 149. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Tabassi E. Artificial intelligence risk management framework (AI RMF 1.0). NIST AI 100-1. National Institute of Standards and Technology, https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf (2023, accessed 2 February 2026). [Google Scholar]
31. Publications Office of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). OJ L, 2024/1689. https://op.europa.eu/en/publication-detail/-/publication/dc8116a1-3fe6-11ef-865a-01aa75ed71a1 (2024, accessed 2 February 2026).
32. USDoHaH Services. Guidance on HIPAA & cloud computing, https://www.hhs.gov/hipaa/for-professionals/special-topics/health-information-technology/cloud-computing/index.html (2022, accessed 2 February 2026).
33. EPa Council. Regulation (EU) 2016/679 (general data protection regulation). EUR-Lex, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 (2016, accessed 2 February 2026). [Google Scholar]
34. OotNCfHI Technology. HTI-1 final rule: health data technology and interoperability—certification program updates algorithm transparency information sharing https://www.healthit.gov/topic/laws-regulation-and-policy/health-data-technology-and-interoperability-certification-program (accessed 2 February 2026).
35. USDoHaH Services. Guidance regarding methods for de-identification of protected health information in accordance with HIPAA, https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html (2025, accessed 2 February 2026).
36. ICsO (UK). Guidance on AI and data protection (under review due to the Data (Use and Access) Act 2025), https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ (2023, accessed 2 February 2026).
37. U.S. Food and Drug Administration HC, MHRA. Good machine learning practice for medical device development: guiding principles, https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles (2021, accessed 2 February 2026).
38. USFaD Administration. Marketing submission recommendations for a predetermined change control plan for artificial intelligence/machine learning-enabled device software functions. https://www.fda.gov/media/166704/download Issued August 18, 2025 (accessed 2 February 2026).
39. EuropeanCommission. AI Act—regulatory framework and application timeline, https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai (2025, accessed 2 February 2026).
40. CfMM Services. FAQs related to coverage criteria and utilization management requirements in CMS Final Rule CMS-4201-F. HPMS memo, https://www.aha.org/system/files/media/file/2024/02/faqs-related-to-coverage-criteria-and-utilization-management-requirements-in-cms-final-rule-cms-4201-f.pdf (2024, accessed 2 February 2026). [Google Scholar]
41. Carlini N, Tramèr F, Wallace E, et al. Extracting training data from large language models. Paper presented at 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2633–2650, https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting (2021, accessed 2 February 2026). [Google Scholar]
42. Han T, Nebelung S, Khader F, et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit Med 2024; 7: 288. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. MITRE ATLAS. ATLAS data repository (tactics, techniques, mitigations, case studies), https://github.com/mitre-atlas/atlas-data (accessed 2 February 2026).
44. MITRE. MITRE and microsoft collaborate to address generative AI security risks (news release), 6 November 2023, https://www.mitre.Org/news-insights/news-release/mitre-and-microsoft-collaborate-address-generative-ai-security-risks (2023, accessed 2 February 2026).
45. USC Office. Copyright office releases part 2 of artificial intelligence report (Copyrightability). NewsNet Issue 1060, 29 January, https://www.copyright.Gov/newsnet/2025/1060.html (2025, accessed 2 February 2026).
46. USC Office. Copyright registration guidance: works containing material generated by artificial intelligence (policy statement). Federal Register, https://www.govinfo.gov/content/pkg/FR-2023-03-16/html/2023-05321.htm (2023, accessed 2 February 2026). [Google Scholar]
47. Tariq R, Dilmaghani S, Advani R, et al. Perception and understanding of artificial intelligence among gastroenterology fellows and early career gastroenterologists: a nationwide cross-sectional survey study. Dig Dis Sci 2025; 70: 2655–2664. [DOI] [PubMed] [Google Scholar]
48. NAo Medicine. An artificial intelligence code of conduct for health and medicine: essential guidance for aligned action. Washington, DC: National Academies Press, https://www.nationalacademies.org/publications/29087 (2025, accessed 2 February 2026). [PubMed] [Google Scholar]
49. USFaD Administration. Human factors and medical devices, https://www.fda.gov/medical-devices/device-advice-comprehensive-regulatory-assistance/human-factors-and-medical-devices (accessed 2 February 2026).
50. Moons KGM, Damen JAA, Kaul T, et al. PROBAST + AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 2025; 388: e082505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr1-17562848261441693] 1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023; 29: 1930–1940. [DOI] [PubMed] [Google Scholar]

[bibr2-17562848261441693] 2. Wiest IC, Bhat M, Clusmann J, et al. Large language models for clinical decision support in gastroenterology and hepatology. Nat Rev Gastroenterol Hepatol 2025; 22: 773–787. [DOI] [PubMed] [Google Scholar]

[bibr3-17562848261441693] 3. Omiye JA, Lester JC, Spichak S, et al. Large language models propagate race-based medicine. NPJ Digit Med 2023; 6: 195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-17562848261441693] 4. Price WN, Gerke S, Cohen IG. Potential liability for physicians using artificial intelligence. JAMA 2019; 322: 1765–1766. [DOI] [PubMed] [Google Scholar]

[bibr5-17562848261441693] 5. Pellegrino R, Federico A, Gravina AG. Conversational LLM Chatbot ChatGPT-4 for colonoscopy boston bowel preparation scoring: an artificial intelligence-to-head concordance analysis. Diagnostics (Basel) 2024; 14: 2537. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr6-17562848261441693] 6. Panzeri D, Laohawetwanit T, Akpinar R, et al. Assessing the diagnostic accuracy of ChatGPT-4 in the histopathological evaluation of liver fibrosis in MASH. Hepatol Commun 2025; 9: e0695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr7-17562848261441693] 7. Cruz Rivera S, Liu X, Chan A-W, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020; 26: 1351–1363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr8-17562848261441693] 8. Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020; 26: 1364–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr9-17562848261441693] 9. Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28: 924–933. [DOI] [PubMed] [Google Scholar]

[bibr10-17562848261441693] 10. ICoMJ Editors. Recommendations: artificial intelligence (AI)-assisted technology, https://www.icmje.org/recommendations/ (2026, accessed 2 February 2026).

[bibr11-17562848261441693] 11. Berry P, Dhanakshirur RR, Khanna S. Utilizing large language models for gastroenterology research: a conceptual framework. Therap Adv Gastroenterol 2025; 18: 17562848251328577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-17562848261441693] 12. Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 2024; 30: 1134–1142. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr13-17562848261441693] 13. Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721–732. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr14-17562848261441693] 14. Gravina AG, Pellegrino R, Cipullo M, et al. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J Gastroenterol 2024; 30: 17–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-17562848261441693] 15. Suchman K, Garg S, Trindade AJ. Chat generative pretrained transformer fails the multiple-choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol 2023; 118: 2280–2282. [DOI] [PubMed] [Google Scholar]

[bibr16-17562848261441693] 16. Gravina AG, Pellegrino R, Palladino G, et al. Charting new AI education in gastroenterology: cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Dig Liver Dis 2024; 56: 1304–1311. [DOI] [PubMed] [Google Scholar]

[bibr17-17562848261441693] 17. Anvari S, Lee Y, Jin DS, et al. Artificial intelligence in hepatology: a comparative analysis of ChatGPT-4, Bing, and Bard at answering clinical questions. J Can Assoc Gastroenterol 2025; 8: 58–62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr18-17562848261441693] 18. Shroff H, Shankar A, Baron A, et al. A large language model assistant for summarizing hepatology referral documents. Am J Gastroenterol 2026; 121(4): 925–931. [DOI] [PubMed] [Google Scholar]

[bibr19-17562848261441693] 19. Chang PW, Amini MM, Davis RO, et al. ChatGPT4 outperforms endoscopists for determination of postcolonoscopy rescreening and surveillance recommendations. Clin Gastroenterol Hepatol 2024; 22: 1917–1925.e17. [DOI] [PubMed] [Google Scholar]

[bibr20-17562848261441693] 20. Amini M, Chang PW, Davis RO, et al. Comparing ChatGPT3.5 and Bard recommendations for colonoscopy intervals: bridging the gap in healthcare settings. Endosc Int Open 2025; 13: a25865912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr21-17562848261441693] 21. Zhang Y, Wan XH, Kong QZ, et al. Evaluating large language models as patient education tools for inflammatory bowel disease: a comparative study. World J Gastroenterol 2025; 31: 102090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr22-17562848261441693] 22. Naqvi HA, Delungahawatta T, Atarere JO, et al. Evaluation of online chat-based artificial intelligence responses about inflammatory bowel disease and diet. Eur J Gastroenterol Hepatol 2024; 36: 1109–1112. [DOI] [PubMed] [Google Scholar]

[bibr23-17562848261441693] 23. Lusetti F, Maimaris S, La Rosa GP, et al. Applications of generative artificial intelligence in inflammatory bowel disease: a systematic review. Dig Liver Dis 2025; 57: 1883–1889. [DOI] [PubMed] [Google Scholar]

[bibr24-17562848261441693] 24. Tariq R, Malik S, Khanna S. Evolving landscape of large language models: an evaluation of ChatGPT and bard in answering patient queries on colonoscopy. Gastroenterology 2024; 166: 220–221. [DOI] [PubMed] [Google Scholar]

[bibr25-17562848261441693] 25. World Health Organization. Ethics and governance of artificial intelligence for health: guidance on large multi-modal models. Geneva: WHO, https://iris.who.int/handle/10665/375579 (2024, accessed 2 February 2026). [Google Scholar]

[bibr26-17562848261441693] 26. Autio C, Dunietz J, Jain S, et al. Artificial intelligence risk management framework: generative artificial intelligence profile. NIST AI 600-1. National Institute of Standards and Technology, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf (2024, accessed 2 February 2026). [Google Scholar]

[bibr27-17562848261441693] 27. The OWASP Foundation. OWASP top 10 for large language model applications (v1.1), https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed 2 February 2026).

[bibr28-17562848261441693] 28. McGreevey JD, 3rd, Hanson CW, 3rd, Koppel R. Clinical, legal, and ethical aspects of artificial intelligence-assisted conversational agents in health care. JAMA 2020; 324: 552–553. [DOI] [PubMed] [Google Scholar]

[bibr29-17562848261441693] 29. Chang CT, Farah H, Gui H, et al. Red teaming ChatGPT in medicine to yield real-world insights on model behavior. NPJ Digit Med 2025; 8: 149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr30-17562848261441693] 30. Tabassi E. Artificial intelligence risk management framework (AI RMF 1.0). NIST AI 100-1. National Institute of Standards and Technology, https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf (2023, accessed 2 February 2026). [Google Scholar]

[bibr31-17562848261441693] 31. Publications Office of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). OJ L, 2024/1689. https://op.europa.eu/en/publication-detail/-/publication/dc8116a1-3fe6-11ef-865a-01aa75ed71a1 (2024, accessed 2 February 2026).

[bibr32-17562848261441693] 32. USDoHaH Services. Guidance on HIPAA & cloud computing, https://www.hhs.gov/hipaa/for-professionals/special-topics/health-information-technology/cloud-computing/index.html (2022, accessed 2 February 2026).

[bibr33-17562848261441693] 33. EPa Council. Regulation (EU) 2016/679 (general data protection regulation). EUR-Lex, https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679 (2016, accessed 2 February 2026). [Google Scholar]

[bibr34-17562848261441693] 34. OotNCfHI Technology. HTI-1 final rule: health data technology and interoperability—certification program updates algorithm transparency information sharing https://www.healthit.gov/topic/laws-regulation-and-policy/health-data-technology-and-interoperability-certification-program (accessed 2 February 2026).

[bibr35-17562848261441693] 35. USDoHaH Services. Guidance regarding methods for de-identification of protected health information in accordance with HIPAA, https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html (2025, accessed 2 February 2026).

[bibr36-17562848261441693] 36. ICsO (UK). Guidance on AI and data protection (under review due to the Data (Use and Access) Act 2025), https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ (2023, accessed 2 February 2026).

[bibr37-17562848261441693] 37. U.S. Food and Drug Administration HC, MHRA. Good machine learning practice for medical device development: guiding principles, https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles (2021, accessed 2 February 2026).

[bibr38-17562848261441693] 38. USFaD Administration. Marketing submission recommendations for a predetermined change control plan for artificial intelligence/machine learning-enabled device software functions. https://www.fda.gov/media/166704/download Issued August 18, 2025 (accessed 2 February 2026).

[bibr39-17562848261441693] 39. EuropeanCommission. AI Act—regulatory framework and application timeline, https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai (2025, accessed 2 February 2026).

[bibr40-17562848261441693] 40. CfMM Services. FAQs related to coverage criteria and utilization management requirements in CMS Final Rule CMS-4201-F. HPMS memo, https://www.aha.org/system/files/media/file/2024/02/faqs-related-to-coverage-criteria-and-utilization-management-requirements-in-cms-final-rule-cms-4201-f.pdf (2024, accessed 2 February 2026). [Google Scholar]

[bibr41-17562848261441693] 41. Carlini N, Tramèr F, Wallace E, et al. Extracting training data from large language models. Paper presented at 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2633–2650, https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting (2021, accessed 2 February 2026). [Google Scholar]

[bibr42-17562848261441693] 42. Han T, Nebelung S, Khader F, et al. Medical large language models are susceptible to targeted misinformation attacks. NPJ Digit Med 2024; 7: 288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr43-17562848261441693] 43. MITRE ATLAS. ATLAS data repository (tactics, techniques, mitigations, case studies), https://github.com/mitre-atlas/atlas-data (accessed 2 February 2026).

[bibr44-17562848261441693] 44. MITRE. MITRE and microsoft collaborate to address generative AI security risks (news release), 6 November 2023, https://www.mitre.Org/news-insights/news-release/mitre-and-microsoft-collaborate-address-generative-ai-security-risks (2023, accessed 2 February 2026).

[bibr45-17562848261441693] 45. USC Office. Copyright office releases part 2 of artificial intelligence report (Copyrightability). NewsNet Issue 1060, 29 January, https://www.copyright.Gov/newsnet/2025/1060.html (2025, accessed 2 February 2026).

[bibr46-17562848261441693] 46. USC Office. Copyright registration guidance: works containing material generated by artificial intelligence (policy statement). Federal Register, https://www.govinfo.gov/content/pkg/FR-2023-03-16/html/2023-05321.htm (2023, accessed 2 February 2026). [Google Scholar]

[bibr47-17562848261441693] 47. Tariq R, Dilmaghani S, Advani R, et al. Perception and understanding of artificial intelligence among gastroenterology fellows and early career gastroenterologists: a nationwide cross-sectional survey study. Dig Dis Sci 2025; 70: 2655–2664. [DOI] [PubMed] [Google Scholar]

[bibr48-17562848261441693] 48. NAo Medicine. An artificial intelligence code of conduct for health and medicine: essential guidance for aligned action. Washington, DC: National Academies Press, https://www.nationalacademies.org/publications/29087 (2025, accessed 2 February 2026). [PubMed] [Google Scholar]

[bibr49-17562848261441693] 49. USFaD Administration. Human factors and medical devices, https://www.fda.gov/medical-devices/device-advice-comprehensive-regulatory-assistance/human-factors-and-medical-devices (accessed 2 February 2026).

[bibr50-17562848261441693] 50. Moons KGM, Damen JAA, Kaul T, et al. PROBAST + AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 2025; 388: e082505. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Responsible use of large language models in gastroenterology and hepatology

Sahil Khanna

Roles

Abstract

Plain language summary

Introduction

Methods

LLM-enabled workflows: A risk-tier taxonomy proposal

Table 1.

Tier 0: No clinical data, non-clinical support

Tier 1: Limited PHI and operational/administrative tasks

Tier 2: Clinical documentation and communication with human-in-the-loop oversight

Tier 3: Clinical decision support

Tier 4: Autonomous clinical action

Evidence base for LLM applications in GI/Hep by risk tier

Table 2.

Tier 0 in GI/Hep: No clinical data, non-clinical support

Tier 1 in GI/Hep: Limited PHI and operational/administrative tasks

Tier 2 in GI/Hep: Clinical documentation and communication with human-in-the-loop oversight

Tier 3 in GI/Hep: Clinical decision support

Multimodal and next-generation models

Ethical considerations

Beneficence and non-maleficence: Patient safety and error modes

Table 3.

Justice: Bias, inequity, and representativeness

Autonomy and transparency: Informed use and appropriate reliance

Accountability and professional integrity

Legal and regulatory considerations

Table 4.

Data protection and confidentiality

Medical device and healthcare IT oversight

Professional liability and standard of care

Operational governance for responsible LLM use

Governance: Roles, policies, and acceptable use

Technical safeguards: Architecture, security, and data controls

Privacy and leakage risk: Why “don’t upload PHI” is not enough

GDPR and international data protection

Evaluation and monitoring

Responsible scholarly use: Authorship, disclosure, and publication ethics

Education

Preventing harm: Practical safeguards for LLM use

Start with intended use and risk tier

Table 5.

Safety guardrails to reduce hallucinations and unsafe recommendations

Human oversight and human factors: Design the human–AI team

Equity and bias: Test for subgroup harms and avoid race-based medicine

Privacy and cybersecurity: Treat LLMs as a new attack surface

Future directions

Conclusion

Box 1.

Box 2.

Box 3.

Box 4.

Acknowledgments

Footnotes

Declarations

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases