Skip to main content
BMJ Health & Care Informatics logoLink to BMJ Health & Care Informatics
. 2026 Feb 18;33(1):e102007. doi: 10.1136/bmjhci-2025-102007

Artificial intelligence translation in healthcare: an urgent call for evidence-informed policy frameworks

Chukwuebuka Anyaegbuna 1,, Natasha Steele 2, April Shichu Liang 2, Stephen P Ma 2, Ivan Lopez 2, Nymisha Chilukuri 2, Kavita Patel 2, Kevin Schulman 2, Jonathan H Chen 3
PMCID: PMC12918658  PMID: 41708159

Abstract

The deployment of artificial intelligence (AI) translation tools in healthcare is accelerating rapidly, yet regulatory frameworks lag dangerously behind clinical practice. Recent data reveal that 57% of US physicians are already using or planning to adopt AI translation services within the next year. This creates a critical policy vacuum where clinicians deploy tools with variable performance across languages, risking patient safety and deepening health inequities. We examine the fractured regulatory landscape, document performance disparities between well-resourced and digitally under-represented languages, and argue for an urgent, evidence-informed policy framework centred on patient comprehension rather than linguistic accuracy.

We delineate a risk-stratified validation approach comprising two distinct tracks: a ‘Streamlined Pathway’ for tool-language combinations with robust existing evidence (eg, Spanish) and a ‘Standard Pathway’ requiring independent, prospective validation for digitally under-represented languages (eg, Haitian Creole). To ensure accountability, we propose establishing oversight bodies within the U.S. Department of Health and Human Services (HHS) or the Food and Drug Administration (FDA) to mandate pre-deployment validation and post-market monitoring. Without such action, AI translation risks creating a two-tier system where the 25.7 million Americans with non-English language preferences receive dramatically different care quality based solely on the language they speak.

Keywords: Artificial intelligence, Health Equity, Natural Language Processing, Patient Care

Introduction: an emerging crisis in healthcare equity

The rapid emergence of artificial intelligence (AI) translation tools has accelerated well beyond the capacity of our current regulatory infrastructure, creating a profound and urgent crisis at the intersection of technological innovation and healthcare equity. The legal frameworks governing language access, established decades before the advent of large language models (LLMs), have created a policy vacuum where technological adoption is surging while regulatory adaptation lags dangerously behind.

This is no longer a hypothetical future scenario. The American Medical Association’s 2024 Physician AI Sentiment Report reveals that translation services now rank as the most familiar AI use case among physicians, with 57% of respondents already using, planning immediate adoption or planning to adopt these tools within the next year.1 This represents a 30% increase from 2023, signalling a massive shift in clinical practice. Physicians report high enthusiasm for these services, placing them among the highest-value use cases for generative AI. However, this enthusiasm dramatically outpaces the development of evidence-based quality standards, creating significant patient safety risks as clinicians deploy tools with widely variable performance across languages. For the millions of patients with non-English language preference (NELP), language barriers remain a persistent and dangerous obstacle to safe healthcare. Extensive research confirms that patients with limited English proficiency (LEP) experience systematically worse outcomes than their English-speaking counterparts.2,5 They face higher rates of adverse events involving physical harm,2 increased odds of 30-day hospital readmissions3 and higher rates of avoidable emergency department revisits.4 A pivotal study of US hospitals found that adverse events among LEP patients were nearly twice as likely to result in physical harm compared with English-speaking patients (49.1% vs 29.5%), often due to errors in communication.2 Furthermore, language discordance is associated with lower patient satisfaction and a higher risk of medication errors.5

While professional human translation remains the gold standard for mitigating these risks, it is resource-intensive and subject to severe operational bottlenecks. Even at large academic medical centres, human translation for discharge documents or patient education materials can take between 1 and 5 days or be deferred altogether due to cost constraints.6 Consequently, many patient interactions, particularly those involving enduring materials rather than direct encounters, occur without adequate language support.

AI-powered translation tools offer the promise of an immediate, scalable and low-cost solution to these delays. Yet, this technological promise is shadowed by an emerging reality: the performance of these tools is not uniform. The quality of AI-generated medical translations varies dramatically depending on the volume of training data available for the target language. This ‘resource gap’ results in superior performance for languages with abundant online text, while performance degrades precipitously for digitally underrepresented languages.7 Without immediate policy intervention, we risk cementing a two-tier healthcare system where speakers of widely used languages benefit from cutting-edge AI assistance, while speakers of less common languages are exposed to disproportionate medical errors.

Methods

This policy analysis draws on a targeted review of recent empirical literature on AI translation performance in healthcare settings, current US federal statutes and regulations governing language access, and published implementation guidance. We identified key studies through searches of PubMed, legal databases and federal regulatory sources. Our framework development was informed by stakeholder perspectives from clinical practice, health policy and health equity research.

The evidence gap: variable performance and safety risks

The performance divide in AI medical translation is stark, well-documented and rooted in the fundamental architecture of LLMs. While substantial research has developed methodologies for assessing translation quality, including automated metrics such as Bilingual Evaluation Understudy(BLEU) scores and expert evaluation frameworks, these approaches have historically emphasised linguistic fidelity over functional health outcomes.8 9 These models rely on vast corpora of training data to learn linguistic patterns; thus, their proficiency is directly correlated with the digital footprint of a given language.10 11

For ‘high-resource’ languages like Spanish, the performance of leading AI models has reached a level of maturity that challenges the traditional primacy of human translation. A 2025 study in JAMA Pediatrics evaluated the ability of OpenAI’s Generative Pre-trained Transformer 4 Omni (GPT-4o) to translate paediatric patient instructions into Spanish. The results were striking: the AI translations were not merely comparable to those of professional human translators but were often preferred by expert evaluators for their fluency and clarity, containing significantly fewer mistranslation errors than the human reference standard.12 In this specific context, the evidence suggests that AI can serve as a safe, efficient and potentially superior alternative to current workflows.

However, this success story is not universal. Performance degrades sharply for ‘digitally under-represented’ languages. Another research group that lauded AI performance for Spanish found alarming results when evaluating Haitian Creole. For this population, professional human translations consistently and significantly outperformed both Google Translate and ChatGPT.13

The nature of the errors was particularly concerning. While professional translations contained clinically significant errors in 8.3% of cases, the error rate for Google Translate was 23.3%, and for ChatGPT, it rose to 33.3%.13 In a clinical setting, an error rate of one in three is untenable. These errors included critical omissions of dosage instructions, misinterpretations of symptoms and the hallucination of medical advice not present in the original English text. For a Haitian Creole-speaking patient, the AI tool transforms from a helpful assistant into a potentially dangerous source of misinformation.

This disparity creates a distinct policy challenge. If a health system deploys AI translation as a blanket solution, it inadvertently privileges patients who speak high-resource languages while exposing those who speak low-resource languages to substandard care. Currently, there is little transparency regarding these performance differences. Foundation model providers rarely disclose granular performance metrics for specific medical use cases across the long tail of languages. This leaves healthcare institutions in the dark, forced to make deployment decisions without the data necessary to distinguish between safe and unsafe applications.

Navigating a fractured regulatory landscape

The regulatory environment surrounding medical translation has become increasingly contested and uncertain. Understanding the obligations of healthcare systems requires distinguishing between three levels of legal authority, each with different degrees of enforceability and susceptibility to administrative change.

Level 1: federal statutes

At the foundation are federal statutes that require Congressional action to modify. Title VI of the Civil Rights Act of 1964 prohibits discrimination on the basis of national origin, which courts have consistently interpreted to include discrimination based on language.14 15 Section 1557 of the Affordable Care Act (ACA) reinforced this mandate, requiring covered health entities to take ‘reasonable steps to provide meaningful access’ to individuals with NELP.16 17 These statutory requirements provide durable protection and a clear legal obligation that remains enforceable regardless of shifting political priorities.

Level 2: formal regulations

At the second level are formal regulations, which implement statutory requirements through detailed rulemaking. The 2024 HHS Final Rule implementing Section 1557 provided the first explicit guidance on machine translation (referred to here as AI translation). This regulation introduced a critical distinction between ‘critical’ and ‘non-critical’ documents. For critical documents—a category that includes consent forms, discharge instructions and notices of eligibility18—the regulation mandates that ‘a qualified human translator must review and correct the content before it is given to a patient.19’ This creates a regulatory ‘human-in-the-loop’ floor for high-stakes clinical communication. However, this regulation was developed based on the performance capabilities of older model generations and defaults to a manual review process that may negate the efficiency gains of AI deployment.

Level 3: agency guidance

At the third and most fluid level is agency guidance such as the recent Dear Colleague Letter.20 Recent executive actions have introduced significant uncertainty into this landscape. Executive Order 14224, designating English as the official language of the USA,21 and subsequent memoranda from the Department of Justice have signalled a shift in enforcement priorities, emphasising the minimisation of ‘non-essential multilingual services.’22 However, legal scholars emphasise that executive orders cannot overturn existing statutes; the underlying requirements of Title VI and Section 1557 remain the law of the land.23 24

This tiered structure reveals a critical policy vacuum. While the statute is clear on the obligation to provide equity, and the Biden-era regulation establishes a method (human review) for critical documents, neither provides ongoing operational standards for validating AI quality. Healthcare institutions lack clear answers to fundamental questions: What level of accuracy is ‘good enough’ for a non-critical document? How do we validate tools across hundreds of languages where we lack internal staff expertise? Who bears liability when a hallucinated translation leads to patient harm?

Towards an evidence-informed policy framework

We propose a tiered regulatory framework that moves beyond a binary ‘human vs AI’ debate. This framework centres on a fundamental principle: the appropriate metric for AI translation quality in healthcare is patient comprehension and safety, not merely linguistic accuracy.

Core principle: patient-centred evaluation standards

Traditional translation evaluation focuses on linguistic fidelity; whether the translated text accurately conveys the semantic meaning of the source text as judged by bilingual experts. While necessary, this metric is insufficient for healthcare. A translation can be linguistically accurate yet clinically ineffective if it uses a register that is too high for the patient’s health literacy or fails to account for cultural context.

We propose shifting the evaluation standard to functional health literacy. The pivotal question is not, ‘Did the AI translate this word correctly?’, but rather ‘After reading these discharge instructions, can the patient correctly identify safe behaviours?’. This patient-centred standard creates a single, language-agnostic benchmark. It elegantly addresses the performance disparity problem because a tool that enables adequate patient comprehension meets the standard regardless of the underlying linguistic mechanisms, while a tool that produces grammatically perfect but incomprehensible translations fails the standard.

Recommendation: performance-based validation pathways

To operationalise this standard, we recommend a risk-stratified approach to validation. We propose establishing a Universal Patient Comprehension Standard requiring non-inferiority to certified human translation, the current standard of care (eg, demonstrating that patient comprehension rates are not meaningfully lower than those achieved with professional translation; specific margins should be determined through stakeholder consensus and may warrant stricter thresholds for life-threatening medication instructions). To achieve this, we delineate two distinct evidence-informed validation pathways:

Streamlined validation pathway

This pathway applies to tool-language combinations where robust validation evidence already exists in the peer-reviewed literature. For example, given the strong evidence supporting GPT-4’s performance in Spanish medical translation, healthcare institutions should be permitted to rely on this published evidence to document compliance. Institutions would not need to conduct their own primary validation studies but would instead focus on ongoing monitoring. This pathway recognises that duplicating well-established validation studies represents an inefficient use of limited resources.

Standard validation pathway

This pathway applies to novel combinations, digitally under-represented languages (such as Haitian Creole, Tagalog or Hmong) or specific high-risk use cases lacking robust published evidence. For these scenarios, independent entities without financial conflicts of interest should conduct prospective validation studies demonstrating that the specific tool-language-use case combination meets the patient comprehension standard before offering it for clinical deployment. All validation studies should be prospectively registered in a public database analogous to ClinicalTrials.gov to prevent selective publication of favourable results. These studies should use validated assessment protocols and be sized to detect clinically meaningful differences in patient understanding. This approach ensures that all healthcare institutions, including community hospitals and rural facilities that lack research infrastructure, can confidently deploy validated tools without bearing the burden of conducting institution-specific validation studies.

Governance and enforcement mechanisms

Effective implementation of these validation pathways requires dedicated oversight infrastructure. We recommend establishing oversight bodies within the U.S. Department of Health and Human Services (HHS) or the Food and Drug Administration (FDA) with explicit statutory authority to mandate pre-deployment validation studies, conduct ongoing monitoring and impose penalties for non-compliance. Critically, these bodies must include not only AI technologists but also translation experts, clinicians who work with LEP populations, patient advocates from affected language communities and health equity researchers. To address the implementation burden on smaller healthcare systems, we further recommend the creation of centralised validation consortia and federal funding mechanisms for validation studies of digitally under-represented languages, enabling resource-pooling across institutions. Indeed, robust validation evidence could ultimately support regulatory evolution, including potential revision of current human-in-the-loop requirements for tool-language combinations demonstrating non-inferiority to certified human translation.

Model versioning and post-deployment monitoring

A fundamental challenge with LLM-based translation is that models are frequently updated, often without advance notice to end-users. A tool validated using one model version may perform differently when the underlying architecture changes. Our framework therefore requires version-specific validation, mandatory notification when healthcare-deployed models are updated, and defined revalidation triggers when model architectures change substantially. Beyond pre-deployment validation, ongoing monitoring is essential. We recommend mandatory incident reporting systems for translation-related adverse events, periodic comprehension audits using standardised protocols, patient feedback mechanisms to flag translation problems directly and defined trigger metrics that would prompt revalidation when error rates exceed specified thresholds.

Limitations and scope

Our analysis focuses primarily on the US regulatory context; similar challenges exist in other jurisdictions, and our proposed framework may inform international policy development, although specific regulatory mechanisms will differ across contexts. We acknowledge that our patient comprehension standard, while superior to purely linguistic metrics, may not fully capture cultural dimensions of effective health communication—translations that patients technically understand may still be culturally inappropriate. Validation processes should therefore incorporate community health workers and cultural mediators from target language communities, not only bilingual evaluators. Additionally, several critical areas warrant future policy development beyond the scope of this framework. First, questions of legal liability remain unresolved; current legal frameworks may place disproportionate risk on clinicians as end-users rather than on the AI developers who create these tools, and future frameworks should explore liability standards that incentivise developers to conduct robust validation.25 Second, privacy and data security require ongoing attention; while institutional Business Associate Agreements (BAAs) cover many enterprise AI services, the transmission of protected health information to commercial LLM services warrants clear governance to prevent unauthorised data retention or use for model training. Finally, transparency regarding industry practices such as A/B testing, where different users may unknowingly receive different model versions, remains essential for informed clinical deployment.

Conclusion

The deployment of AI translation in healthcare is not a future possibility; it is a current reality. With nearly 60% of physicians already using or planning to adopt these tools, the sector has passed the tipping point of adoption. The question is no longer whether AI translation will be used in clinical care, but how it will be governed to ensure safety and equity.

Healthcare organisations cannot wait for federal clarity that may never come or may fluctuate with political cycles. They face immediate decisions about technology deployment while bearing ethical obligations to protect vulnerable patients and legal obligations to provide meaningful access. The policy vacuum is not theoretical; it affects real-world decision-making every day as hospitals navigate resource allocation under conditions of contested guidance.

Done well, AI translation could dramatically improve healthcare access for millions of Americans, reducing delays, improving patient understanding and enabling more frequent communication in patients’ preferred languages. It offers the only viable path to scaling language access to meet the growing diversity of the patient population. However, if deployed enthusiastically but without appropriate validation, AI translation risks creating a dangerous two-tier system. In this scenario, the benefits of innovation would accrue primarily to speakers of well-resourced languages, while speakers of digitally under-represented languages would face increased risks of medical errors, poorer health outcomes and deepened inequities.

For the 25.7 million Americans with NELP, and particularly for those speaking digitally under-represented languages, the development of thoughtful, evidence-informed policy frameworks is not merely an academic exercise—it is an urgent matter of health equity and patient safety.

Acknowledgements

Support for this research was provided by the Commonwealth Fund. The views presented here are those of the authors and should not be attributed to the Commonwealth Fund or its directors, officers or staff.

Footnotes

Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Patient consent for publication: Not applicable.

Ethics approval: Not applicable.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  • 1.American Medical Association . Chicago: American Medical Association; 2025. 2024 physician ai sentiment report. [Google Scholar]
  • 2.Divi C, Koss RG, Schmaltz SP. Language proficiency and adverse events in US hospitals: a pilot study. Int J Qual Health Care. 2007;19:60–7. doi: 10.1093/intqhc/mzl069. [DOI] [PubMed] [Google Scholar]
  • 3.Chu JN, Wong J, Bardach NS, et al. Association between language discordance and unplanned hospital readmissions or emergency department revisits: a systematic review and meta-analysis. BMJ Qual Saf. 2024;33:456–69. doi: 10.1136/bmjqs-2023-016295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Muir KJ, Sliwinski K, Lasater KB. Reducing disparities in emergency department outcomes for individuals with limited English proficiency: The nurse work environment. Nurs Outlook. 2025;73 doi: 10.1016/j.outlook.2024.102318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Diamond L, Izquierdo K, Canfield D, et al. A Systematic Review of the Impact of Patient-Physician Non-English Language Concordance on Quality of Care and Outcomes. J Gen Intern Med. 2019;34:1591–606. doi: 10.1007/s11606-019-04847-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lopez I, Velasquez DE, Chen JH, et al. Operationalising LLM-based machine-assisted translation in health systems. NPJ Digit Med. 2025;8:584. doi: 10.1038/s41746-025-01944-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Demszky D, Baruah SU, Kreutzer J, et al. Washington, DC: Brookings Institution; 2024. Closing the gap: a call for more inclusive language technologies. [Google Scholar]
  • 8.Reiter E. A Structured Review of the Validity of BLEU. Comput Linguist Assoc Comput Linguist. 2018;44:393–401. doi: 10.1162/coli_a_00322. [DOI] [Google Scholar]
  • 9.Dew KN, Turner AM, Choi YK, et al. Development of machine translation technology for assisting health communication: A systematic review. J Biomed Inform. 2018;85:56–67. doi: 10.1016/j.jbi.2018.07.018. [DOI] [PubMed] [Google Scholar]
  • 10.Lai VD, Ngo NT, Pouran Ben Veyseh A, et al. Singapore: Association for Computational Linguistics; 2023. ChatGPT beyond english: towards a comprehensive evaluation of large language models in multilingual learning. in: findings of the association for computational linguistics: EMNLP; pp. 13171–99. [Google Scholar]
  • 11.Stap D, Araabi A. Toronto: Association for Computational Linguistics; 2023. ChatGPT is not a good indigenous translator. in: proceedings of the workshop on natural language processing for indigenous languages of the americas (Americasnlp) pp. 163–7. [Google Scholar]
  • 12.Ray M, Kats DJ, Moorkens J, et al. Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework. JAMA Pediatr. 2025;179:1026–33. doi: 10.1001/jamapediatrics.2025.1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Brewster RCL, Gonzalez P, Khazanchi R, et al. Performance of ChatGPT and Google Translate for Pediatric Discharge Instruction Translation. Pediatrics. 2024;154 doi: 10.1542/peds.2023-065573. [DOI] [PubMed] [Google Scholar]
  • 14.Pub. L, 78 Stat. 241; Title VI of the civil rights act of 1964; pp. 88–352. [Google Scholar]
  • 15.Chen AH, Youdelman MK, Brooks J. The legal framework for language access in healthcare settings: Title VI and beyond. J Gen Intern Med. 2007;22:362–7. doi: 10.1007/s11606-007-0366-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Department of Health and Human Services . Washington, DC: Department of Health and Human Services; Section 1557: ensuring meaningful access for individuals with limited english proficiency. [Google Scholar]
  • 17.American Medical Association . Chicago: American Medical Association; 2017. Affordable care act, section 1557: fact sheet. [Google Scholar]
  • 18.Monterey, CA: LanguageLine Solutions; 2019. Six medical documents that must be translated according to the civil rights act. [Google Scholar]
  • 19.Department of Health and Human Services, Office for Civil Rights Nondiscrimination in health programs and activities (final rule) 2024
  • 20.Office for Civil Rights . Washington, DC: Department of Health and Human Services; 2024. Language access provisions of the final rule implementing section 1557 of the affordable care act (dear colleague letter) [Google Scholar]
  • 21.Designating english as the official language of the united states. 2025
  • 22.Bondi P. Washington, DC: Department of Justice; 2025. Designating english as the official language of the united states. [Google Scholar]
  • 23.National Immigration Law Center . Los Angeles: National Immigration Law Center; 2025. Trump administration’s attempts to dismantle language access do not erase civil rights law. [Google Scholar]
  • 24.Derrington S, Barwise A, Martins L, et al. An emerging threat to health equity: diminished patient access to language services. Health Aff Forefront. 2025 doi: 10.1377/forefront.20251022.780704. [DOI] [Google Scholar]
  • 25.Price WN, II, Gerke S, Cohen IG. In: Research handbook on health, AI and the law. Solaiman B, Cohen IG, editors. Cheltenham: Edward Elgar Publishing; 2024. Liability for use of artificial intelligence in medicine; pp. 150–66. [Google Scholar]

Articles from BMJ Health & Care Informatics are provided here courtesy of BMJ Publishing Group

RESOURCES