Abstract
Over 25 million U.S. patients with a non-English language preference face unsafe care because discharge instructions and other materials are rarely translated in time. Advances in translation assisted by large language models can close this gap, but implementation guidance is scarce. Using the Consolidated Framework for Implementation Research, we outline key considerations—innovation, individuals, inner setting, implementation process, and outer setting—to offer healthcare leaders and policymakers a practical roadmap for language model machine-assisted translation integration.
Subject terms: Machine learning, Health policy
Introduction
Patients with a non-English language preference (NELP) experience significant barriers to high-quality health care, resulting in higher emergency department visits and hospital readmission rates compared to patients with an English language preference1,2. These disparities are driven by multilevel factors, including insufficient use of interpretation services (real‑time spoken language support)3–5 as well as limited availability of bilingual providers6–9. Another key contributing factor is the lack of translated medical information (written text). When patients require translated medical information, including discharge instructions or informed consent forms, certified translators either craft the translated documents de novo or use computer-aided translation (CAT) software, such as translation memory tools, grammar checkers, or terminology managers. These processes are costly, labor-intensive, and lack necessary customization10–12. In instances where in-house expertise is insufficient, health systems rely on third-party companies for translations. Despite these workflows, health systems continue to fail to routinely translate vital documents for patients with NELP12–14, even though the Civil Rights Act of 1964 and Department of Health and Human Services (HHS) regulations require U.S. hospitals to provide reasonable access to language assistance services, including written translations15.
Machine-assisted translation (MAT) using large language models (LLMs) has the potential to bridge this gap. Before 2023, neural machine translation (NMT) was the leading automated translation technology16. However, NMT systems struggle with specialized terminology and generally offer limited contextual understanding, particularly on longer passages17. Meanwhile, existing CAT software is hampered by limited adaptability and static glossaries, requiring considerable effort from translators to refine each piece of text. In contrast, LLMs provide deeper contextual understanding, enabling them to process entire passages and handle vocabulary and linguistic nuances more precisely18. They also support multi-task learning, allowing them to translate, summarize, or simplify text—capabilities crucial for producing accessible patient materials19. Although a single-center study reported that LLM translation achieved results comparable to human translators for languages with abundant online resources, such as Spanish and Portuguese20, concerns remain regarding its accuracy in languages with less digital content. Examples include Quechua and Yorùbá, which are considered digitally underrepresented languages in machine translation research due to the scarcity of comprehensive online corpora21,22. Additionally, there remains a lack of guidance on how to safely and effectively implement LLM-supported MAT technology into healthcare, particularly as generative artificial intelligence (AI) tools in medicine expand19,23–27.
To advance care for the millions of individuals with NELP, health systems and policymakers must take the lead in implementing LLM-supported MAT. Section 1557 of the Affordable Care Act has already established a fundamental requirement: machine-generated translations must undergo human review before reaching patients28. Beyond this high‑level mandate, there is no detailed federal or professional‑body guidance on how to operationalize LLM‑based MAT (e.g., evaluation, privacy safeguards, or workflow integration), leaving health systems to devise their own policies. In this perspective, we apply the Consolidated Framework for Implementation Research (CFIR)29—a comprehensive implementation science framework encompassing five domains and up to 67 constructs—as a conceptual lens to examine these challenges and considerations. By mapping these factors across CFIR domains, we present a roadmap that can help health systems judiciously operationalize MAT (Tables 1–5).
Table 2.
Individuals domain: focuses on the characteristics of individuals involved in implementation, including their knowledge, beliefs, and personal attributes
| Areas that impact the implementation gap | Barriers to implementation | Potential solutions |
|---|---|---|
| D. Translators | LLM-based MAT integration that disrupts established workflows | EHR/IT teams can partner with translation leads to add “auto-draft” capabilities directly into translators’ current workflows. This should include integrating existing translator tools into the MAT workflow, such as CAT software |
| Design comprehensive training focusing on best practices when interacting with MAT tools, how to handle MAT outages, etc. | ||
| E. Clinicians | Poorly structured or jargon-laden clinical notes cause translation errors and reduce MAT accuracy | Encourage clinicians to improve the quality of written discharge summaries |
| Enable a two-prompt approach where the MAT LLM uses a “preparation” prompt to clean up messy notes before applying the “translation” prompt | ||
| F. Patients | Mistrust or misunderstanding of AI-driven translations | Ensure transparency regarding the use of AI in translating patient documents |
| Inform patients (via consent forms or discharge packets) that MAT is used, emphasizing final human verification | ||
| Lack of patient engagement in refining MAT processes | Patient advisory boards can be leveraged to collect direct patient feedback on clarity and acceptability of translations | |
| Tailor MAT deployment from patient feedback by offering MAT primarily in areas where patients feel comfortable (e.g., adult vs. pediatric settings), expanding usage as trust grows |
MAT machine-assisted translation, LLM large language model, EHR electronic health record.
Table 3.
Inner Setting Domain: encompasses the organizational context in which implementation occurs, including structural characteristics, available resources, and culture
| Areas that impact the implementation gap | Barriers to implementation | Potential solutions |
|---|---|---|
| G. Translator workforce | Current translator workloads are saturating teams and leading to lengthy turnaround times | LLM-based MAT should offer a completed draft for the translator to review as soon as the translation request is made in the EMR |
| Risk that low-quality LLM outputs demand more effort to manually correct compared to de novo synthesis | LLM-based MAT should follow selective deployment beginning with note types and languages that have been shown to perform well in prospective pilot studies. Deployment could be expanded gradually as data is collected and fine-tuning on MAT models is performed | |
| Reliance on third-party vendors that do not use MAT in their workflows reduces possible gains in translation efficiency | Third-party vendors and health systems should negotiate contracts for vendors to refine MAT drafts rather than translate from scratch, preserving speed and leveraging external expertise | |
| H. Organizational culture | Translators have concerns about job displacement or diminishing professional expertise | Reinforce a translator-in-the-loop model that emphasizes LLM outputs should be a first draft with the final review remaining the translator’s responsibility |
| Maintain Section 1557 of the Affordable Care Act that says machine-generated translations must undergo human review before reaching patients | ||
| Translators feel like they are losing their autonomy | Organizations should set clear MAT expectations with translation teams on a case-by-case basis. For instance, translators may opt to use MAT tools for some projects while foregoing them for others. Nevertheless, health systems should strive to increase MAT usage over time, especially when it demonstrably enhances efficiency and patient outcomes |
MAT machine-assisted translation, LLM large language model.
Table 4.
Implementation process domain: focuses on the active strategies and actions taken to adopt, execute, and sustain an intervention
| Areas that impact the implementation gap | Barriers to implementation | Potential solutions |
|---|---|---|
| I. Integration | Lack of guidance on the effective integration of MAT | Aligning with Existing Workflows: Embed MAT features directly into current EHR or translation software to minimize disruptions for translators and clinicians |
| Co-Designing with End Users: Collaborate with translators and clinicians to tailor prompts, interfaces, and error-check features based on real-world needs | ||
| Retrospective Testing: Use secure offline data to identify errors and refine the model prior to live deployment | ||
| Prospective Testing: Pilot small-scale rollouts for a single language or document type to gather feedback and measure error rates | ||
| Plan for Deployment Infrastructure: Employ centralized data storage (e.g., BigQuery) with detailed logging to enable auditing, troubleshooting, and secure PHI handling | ||
| Fine-Tuning to Meet Deployment Needs: Continuously update the LLM with new translator-approved data, targeting challenging note types or languages first | ||
| Measuring Real-World Impact: Track turnaround time, adoption rates, and patient outcomes to gauge the model’s overall value and display these in data dashboards (e.g., Looker Studio) | ||
| J. Evaluation | Lack of guidance on the effective evaluation of MAT | Translation Quality: Apply frameworks like MQM periodically to assess accuracy, fluency, and style on a sample of translated documents |
| Operational Metrics: Track turnaround times, the percentage of language-concordant materials, and error rates to inform iterative improvements | ||
| Clinical Outcomes: Monitor readmission or mortality rates in specific patient cohorts to gauge the real-world effects of MAT on care quality | ||
| Patient Understandability and Actionability: Use tools like PEMAT and patient advisory boards to ensure translations are clear, actionable, and culturally appropriate |
PHI protected health information, MAT machine-assisted translation, EHR electronic health record, LLM large language model, MQM multidimensional quality metrics, PEMAT patient education materials assessment tool.
Table 1.
Innovation domain: examines characteristics of an intervention that influence its adoption, implementation, and sustainability
| Areas that Impact the Implementation Gap | Barriers to Implementation | Potential Solutions |
|---|---|---|
| A. Patient privacy | Risk of exposing PHI to third-party vendors or unprotected APIs | Partner with model providers to obtain ZDR endpoints or run a private instance of an open-sourced model (e.g., LLaMA-3.3 8B-Instruct) |
| Risk of fine-tuning a model and having translation drafts include erroneous PHI from other cases | Scrub fine-tuning datasets of PHI using tools (e.g., Stanford de-identifier) | |
| B. LLM limitations | Need for awareness of LLM limitations like hallucinations, context loss, and bias that lead to inaccurate translations | Design comprehensive training modules for translation teams on LLM limitations and mitigation strategies |
| Reliance on automated outputs, reducing human accountability | Implement features prompting users to justify non-edits, similar to EHR clinical decision support | |
| Digitally underrepresented languages are more likely to perform poorer in LLM-based MAT | Promote equity and language inclusivity by thoroughly testing all languages used in the health system’s MAT workflow | |
| C. Operating costs | High financial burden of running open-source LLMs on-premises (e.g., GPU costs) | Run quantized versions (e.g., 8-bit or 4-bit) of LLMs to reduce GPU costs |
| Resource constrained hospitals in smaller or rural settings struggle to obtain the finances for GPUs or the overhead costs of maintaining a compute cluster | Multiple hospitals or a regional health network can partner to share costs to build and maintain a HIPAA-compliant cluster (data sent to this cluster should first be de-identified using highly accessible methods, such as the Stanford de-identifier). Costs can then be allocated proportionally based on usage, ensuring that heavy users contribute a correspondingly larger share of the operating expenses |
PHI protected health information, MAT machine-assisted translation, ZDR zero-data-retention, LLM large language model, EHR electronic health record, GPU graphics processing unit.
Table 5.
Outer setting domain: examines external influences on implementation, including policies, incentives, and interorganizational networks
| Areas that impact the implementation gap | Barriers to implementation | Potential solutions |
|---|---|---|
| K. Regulatory oversight | Need for regulatory oversight of MAT practices in healthcare | Regulatory bodies, like the Joint Commission, should extend their evaluations to examine whether MAT meets the same standards required for traditional translation methods |
| There is no standard benchmark for comparing LLM-generated versus human-generated translations | Expand existing Section 1557 compliance checks to include accuracy benchmarks for LLM outputs that include evaluation metrics described in the “Implementation Process Domain” | |
| L. Data | There is a lack of centralized, high-quality bilingual corpora for training and validation with emphasis on sufficient examples for digitally underrepresented languages | A consortium-based dataset could be built via state health agencies, academic medical centers, and smaller hospitals that pool resources to build a shared de-identified translation repository |
| M. National Policy on LLM use | Outdated national standards (e.g., CLAS) that do not address LLM-based translation needs | Update CLAS standards to include specific guidelines and mandates for LLM use in medical translation |
| Need for amendments to section 1557 of the Affordable Care Act and its implications on MAT | Revise Section 1557 to incorporate detailed requirements for LLM use, ensuring guidelines on safeguarding patient data privacy, acceptable MAT error rates and types, baseline qualifications an LLM must meet before it can be used for MAT, and clear roles and conditions when MAT can be used independently and when human review is mandatory |
MAT machine-assisted translation, LLM large language model, CLAS culturally and linguistically appropriate services.
CFIR for machine-assisted translation
I. Innovation domain
For machine-assisted translation to succeed, health systems must learn to manage LLM limitations. For example, LLMs tend to “hallucinate” (generate plausible but factually incorrect text)30, lose context (difficulty retrieving and using information located in the middle of long passages)31, and degrade human accountability (where users become reliant on the LLM and fail to evaluate its outputs)32. The implementation of LLM MAT must therefore include appropriate safeguards. Translators should be trained to recognize the common pitfalls (e.g., hallucinations, context loss, bias) of LLMs in MAT, and, to counteract human overreliance, translators could be intermittently prompted to justify their choice to leave a sentence unedited in a translation, similar to safety checks in electronic health records (EHRs) for potential medication-medication interactions33.
Patient privacy is another design quality that must be considered. Closed-source LLMs should never be used in ways that expose protected health information (PHI) via unprotected application programming interfaces. We recommend leveraging zero-data-retention (ZDR) endpoints34 (API interfaces that process each request in memory only and do not log or persist any input or output data) or private instances through industry partnerships. For example, Stanford Health Care has established a dedicated pathway to the Azure OpenAI Service under full institutional control, enabling high-throughput, privacy-compliant LLM queries35. Such a model supports an environment for LLM-based MAT while following data protection standards.
Cost considerations also play a significant role in deciding whether to adopt LLMs. Operating large open-source models at scale can be prohibitively expensive for resource-constrained health systems. For example, running a single NVIDIA A100 80GB GPU on a cloud platform (e.g., Google Cloud) may cost around $4.74 per hour (as of May 2025)36. Since payors don’t reimburse LLM use, these costs must be covered by an institution’s operating budget. Two factors, however, can offset that burden. First, as secure ZDR endpoints become more widely available, organizations can adopt a pay-as-you-go model that reduces both infrastructure demands and overall expenses. These options can help democratize access to powerful LLMs for health systems of varying sizes while maintaining patient privacy. Second, patients with NELP have consistently longer stays, higher readmission rates, and greater care costs when language-concordant services are lacking37,38. Investing in LLM-based MAT can reduce those downstream expenses and improve value-based care metrics that payors already incentivize, such as Centers for Medicare & Medicaid Services (CMS) readmission penalties.
II. Individuals domain
For translators, health systems should accommodate existing workflows. Though workflows vary, a common one involves a staff member submitting a request, then a translator is assigned the document and works in the EHR—either from scratch or using templates. In some organizations, a second translator reviews the draft before the original translator finalizes formatting and delivers the material to the patient. LLM-based MAT can be woven into this existing workflow by integrating into the EHR, enabling automatic draft generation as soon as a translation request is initiated. The translator can open the draft within the same interface they already use to make edits. Compatibility with CAT tools, allowing translators to run grammar checks or insert pre-translated templates from a memory database directly into the EHR, ensures they retain access to their preferred resources, and if a second review is standard, that reviewer can still validate the final text.
On the clinician side, translation accuracy depends on the quality of the English text. High-quality notes are well-structured, free from spelling errors, and written in plain language. To enhance accuracy, clinicians should prepare high-quality discharge summaries. Alternatively, MAT workflow may use an LLM’s zero-shot capabilities39 to first refine messy or jargon-heavy notes via a “preparation” prompt, then generate a translation with a second prompt. This two-prompt approach can be automatically applied to all incoming English text or toggled on/off by the translation team. By improving the quality of the source text, health systems can reduce the risk of inaccuracies.
Patients are also key stakeholders in MAT implementation. To secure their buy‑in, health systems must be transparent about their use of MAT and proactively seek patient input. Strategies may include brief post‑visit surveys, patient advisory boards, and focus groups dedicated to specific languages to elicit language‑specific guidance. Involving patients in this way is essential for both routine quality‑improvement loops and for the formal evaluation of LLM‑based MAT (see “Implementation Process Domain”). Additionally, surveying patients about their comfort with LLM-supported translations in various care settings (e.g., adult versus pediatric) can help create a more patient-centered approach to MAT deployment.
III. Inner setting domain
An important Inner Setting factor is the translator workforce, which in many health systems is too small to meet the increasing demand for timely translations. Translators often face excessive workloads, causing lengthy turnaround times and leaving some patients without materials in their preferred language. Integrating LLM-based MAT can offload the initial draft, shortening the translation cycle and easing the translator’s workload. This approach also allows organizations to reallocate translator time to more complex tasks. However, it is crucial to recognize that a poorly generated LLM translation may demand more effort to correct than a fully manual translation. A practical approach to mitigate this risk is to first identify which documents and language pairs MAT consistently handles well (e.g., Spanish routine discharge summaries) and where it struggles (e.g., Korean surgery informed consent forms). In the early stages, organizations might limit MAT to documents and languages where the model performs reliably while relying on human translations for more complex tasks. As data from these difficult cases accumulate, the model can be fine-tuned and iteratively improved, gradually expanding its applicability. Accuracy remains paramount: tracking turnaround times alongside quality metrics ensures faster translations still meet patient needs. By weaving these measures into routine workflows, translation teams can maintain both efficiency and quality. We detail fine-tuning and evaluation in the Implementation Process Domain section.
Many health systems rely on third-party translation services rather than (or in addition to) in-house translators. In-house teams may offer greater control over workflows, direct communication with clinicians, and familiarity with local patient populations, but they require sustained funding, staffing, and expertise across multiple languages. Third-party services can support smaller organizations who may lack specialized linguistic support, though they may introduce logistical hurdles such as limited oversight and slower turnaround times. Some of these challenges are more difficult to resolve, but others, like slower turnaround times, could be mitigated if third-party vendors adopt translation editing services. Rather than translating documents entirely de novo, an in-house LLM-based system could generate a first draft for the vendor to refine. This approach preserves the time-saving benefits of LLM-assisted translation while still leveraging the expertise and capacity of external resources.
Organizational culture is another key pillar of the Inner Setting. Health system leaders must emphasize that AI complements—rather than replaces—human translators, preserving a translator-in-the-loop model that aligns with Section 1557 of the Affordable Care Act28. This approach values the expertise of translators, many deeply rooted in local communities, and ensures patients with NELP receive the same high standard of care as English-speaking counterparts. By positioning human translators as indispensable for quality control, cultural adaptation, and patient engagement, health systems maintain equity while harnessing AI’s efficiency and scalability40.
IV. Implementation process domain
This section outlines practical steps for integrating LLM-supported MAT, drawing on the implementation of the Crisis Message Detector-1 (CMD-1), an AI-driven tool for triaging mental health messages41, as an example of embedding AI solutions into existing clinical workflows.
Aligning with existing workflows: Introduce MAT within platforms or software that translators and clinicians already use, thereby preserving established workflows. For instance, embedding an MAT draft feature into the EHR can streamline adoption. In the CMD-1 project, researchers integrated their AI model into Slack, an application providers already relied on, to minimize workflow disruptions.
Co-designing with end users: Actively involve translators and clinicians in MAT design—refining prompts, interfaces, and workflows—to address bottlenecks and foster trust. For example, if translators want an interactive interface with grading buttons, feedback fields, or error flags, co-designing these features into the EHR can improve implementation success. In CMD-1, for example, clinicians set the risk tolerance for false positives and false negatives.
Retrospective testing: A critical first step in implementation is to evaluate MAT on retrospective data before deployment. In CMD-1, the team first tested the model on historical data in an offline environment. This safe environment enabled the identification of errors without compromising patient safety. For MAT, retrospective testing should follow LLM changes (e.g., fine-tuning) or significant updates to translator workflows.
Prospective testing: After retrospective evaluation, health systems could start with small-scale deployments—limited to one frequently requested language, document type, or a few translators—and track usability, error rates, and impact over a defined period (e.g., two weeks). Once implementation is proven effective under these conditions, they could expand coverage to additional languages, documents, or translators. Key takeaways from these pilots identify areas needing improvement and highlight where MAT struggles most (e.g., certain languages or note types), guiding selective deployment and future fine-tuning. Prospective testing should follow significant model updates (e.g., post-fine-tuning) or major workflow changes.
Plan for deployment infrastructure: Even the best-performing model can fail without robust technical and operational support. Health systems must first select a secure data-storage solution, such as BigQuery42, Snowflake43, or Azure44, to store PHI. Just as CMD-1 relied on a dedicated platform to store and audit crisis-detection outputs, MAT requires a similarly comprehensive logging framework. All types of translations should be logged, whether generated entirely by a human translator or initially drafted through the LLM-supported workflow and then refined by a human. Each translation request should log a unique identifier, timestamp, LLM used, the model’s initial translation, the final version, note type, target language, translator ID, flags for harmful outputs, and metrics such as generation and editing times. Storing all of this in a single row for each translation record enables auditing, failure analysis, and model fine-tuning as part of ongoing clinical quality improvement.
Fine-tuning to meet deployment needs: Health systems can use the comprehensive logging framework to fine-tune LLM-based translation models in a supervised manner45. Specifically, final translator-approved versions (paired with the original English texts) provide gold-standard data for train/validation/test splits. The model is trained on the “train” split and monitored on the “validation” split using automatic metrics such as chrF++ or COMET, then evaluated on the “test” split to avoid overfitting. If a health system is using an open-source MAT model, it can reduce computational overhead by adopting Parameter Efficient Fine-Tuning libraries, such as Low-Rank Adaptation, which updates only a minimal subset of the model’s weights45. By analyzing prospective testing results, health systems can identify note types or languages where the model struggles and target those for fine-tuning. Over time, this iterative process broadens the LLM’s capabilities, improving performance across diverse document types and languages. It is also important to note that fine-tuning efforts may evolve from solely optimizing for translation accuracy to incorporating translator preferences. Once the model consistently drafts high-quality translations, further fine-tuning—using approaches like Direct Preference Optimization46—can tailor outputs to better align with the needs of individual translators or entire teams.
Measuring real-world impact: Success should be defined by practical outcomes, such as faster translation turnaround times, higher user satisfaction (both for translators and patients), and improved patient outcomes, rather than technical metrics alone. Tracking adoption rates and downstream effects (e.g., patient comprehension or readmission rates) offers a more comprehensive view of the model’s value. These operational metrics can be monitored in real time using data dashboards such as Looker Studio47 to ensure the system meets clinical and patient needs.
Evaluation is a critical component of implementation that can be approached through multiple methods.
Translation quality: The Multidimensional Quality Metrics (MQM) framework—which rates accuracy, fluency, terminology, style, and locale appropriateness and tags each error as critical, major, or minor—provides a robust gauge of quality48. Other healthcare MT studies have used the validated 5-point fluency-adequacy-meaning-severity rubric which scores meaning preservation and grammar while labeling errors by clinical severity49. However, these frameworks are time-intensive and are best applied periodically (e.g., monthly) on a representative sample of translations. One way to form this sample is by computing sentence embeddings of the source text and selecting a diverse subset based on similarity scores50. Ideally, manual evaluation should be performed by the translation team, leveraging their expertise for nuanced feedback that guides ongoing model improvements. For routine monitoring, we recommend a combination of automated metrics: chrF++ for character‑level similarity51 and COMET for semantic adequacy and fluency52. BLEU, which measures n-gram overlap between the LLM output and the final human-approved text53, could be included for historical comparability. However, its correlation with human judgements is inconsistent, so it should be interpreted with caution and supplemented by newer metrics that show stronger alignment with human ratings54.
Operational metrics: Translation turnaround time and the proportion of patients with NELP who receive language-concordant discharge instructions should be used to measure system efficiency, pinpoint areas for workflow refinement, and ensure equitable care delivery across diverse language groups.
Clinical outcomes: Health systems need to track process and clinical outcomes. They can focus on measures prioritized by the CMS, such as readmission and mortality rates for conditions like heart failure or Chronic Obstructive Pulmonary Disease1. These outcomes can be tracked by extracting data from clinical notes55,56 or ICD-10 codes57,58 for patients with NELP.
Patient understandability and actionability: The Patient Education Materials Assessment Tool (PEMAT), which assesses the clarity, ease of understanding, and actionability of health care materials, could be used to evaluate MAT59. Similar to MQM, it is time-consuming to administer and best applied periodically (e.g., monthly) on a representative subset of translations50. PEMAT evaluation should be carried out by patients or patient advisory boards. Another approach to evaluation involves having a clinician identify five key points from a discharge summary, then convert those into open-ended questions. A patient representative reads the translated text and answers these questions (Fig. 1). By grading the accuracy of responses, teams gauge how well LLM-supported translations convey essential clinical information.
Fig. 1. An illustrative patient-level evaluation workflow for LLM-supported MAT.

Here, a Spanish discharge summary is followed by five open-ended questions covering key clinical concepts that will be answered by the patient. Each question has a “gold-standard” answer generated from the source English note, allowing evaluators to compare the patient’s responses with the intended information. This approach measures both the understandability and actionability of the translation by assessing how well patients grasp critical details from the LLM-generated text.
V. Outer setting domain
As LLM adoption increases, organizations like the Joint Commission are needed to conduct independent quality assurance checks of MAT workflows. As part of its accreditation process, the Joint Commission assesses whether hospitals comply with Section 1557 of the Affordable Care Act, which mandates that patients with NELP receive appropriate language access services28. In doing so, the Joint Commission verifies that hospitals have processes in place to deliver patient materials that are accurate, timely, and linguistically appropriate. As health systems begin to integrate MAT, these evaluations can be extended to examine whether MAT meets the same standards required for traditional translation methods using evaluation methods described in the “Implementation Process Domain.”
Health systems and public health agencies should collaborate to create a shared clinical‑translation corpus, similar to open resources such as MIMIC‑IV, which provides de‑identified clinical notes for public research use60. Each participating institution (academic medical centers, community hospitals, rural and safety‑net facilities) would contribute de‑identified source notes across the full spectrum of document types (e.g., post‑operative instructions, discharge summaries, consent forms) paired with translator‑verified versions. The corpus would (i) maintain distinct training and held‑out test datasets and (ii) grow through a community‑driven process in which sites donate a small batch of new notes, de‑identified with methods such as “hiding in plain sight”61. Efforts should be made to collect translations for digitally underrepresented languages. Over time, this iterative, collaborative approach will enrich clinical translation data available for fine‑tuning and will maintain an evolving benchmark for evaluating LLM-based MAT.
New policy guidelines are needed to support MAT implementation, beginning with national standards for LLM use. The HHS Office of Minority Health developed the National Standards for Culturally and Linguistically Appropriate Services (CLAS) in Health and Health Care. These standards, last updated in 2013, provide a framework for culturally and linguistically appropriate care62 but require revision to ensure accurate, reliable, and timely MAT. HHS should also revisit Section 1557 of the Affordable Care Act to aid implementation. It's May 6, 2024 rule revision was the first to address machine translation28, but further guidance is needed on safeguarding patient data, acceptable error rates, and critical error reporting. Given the substantial variation in LLM performance across different languages, policy guidelines need to be language-specific, rather than a single blanket standard. For instance, health systems could rank languages based on an LLM’s internal representations63, examine the distribution of languages in the LLM’s training corpus (when made publicly available)64, or evaluate performance on an in-house testing dataset to determine the languages in which a given model meets MAT standards. The next rule revision must set baseline qualifications for LLMs used in MAT, ensuring that only suitable models are implemented.
Conclusion
Equitable access to linguistically concordant information remains a major unmet need for the millions of people in the United States who prefer a language other than English. Machine translation with LLMs offers promising solutions—reducing translator workload, shortening turnaround times, and extending translation services to resource-constrained settings. However, it also introduces new challenges, including context loss and inaccuracies in digitally underrepresented languages. For safe, equitable outcomes, health systems must continuously evaluate accuracy, address biases, and refine workflows. Moving forward, hybrid effectiveness-implementation studies across various clinical settings65 will be essential for determining whether LLM-based MAT truly reduces language-related disparities in practice.
Author contributions
I.L. and D.E.V. conceptualized the Perspective. I.L. and D.E.V. wrote the initial draft. I.L., D.E.V., J.H.C. and J.A.R. contributed to the first draft and provided critical revisions. All authors have read, reviewed, and approved of the final manuscript.
Data availability
No datasets were generated or analysed during the current study.
Competing interests
J.H.C. reported being a co-founder of Reaction Explorer LLC that develops and licenses organic chemistry education software, paid consulting fees from Sutton Pierce, Younker Hyde MacFarlane, and Sykes McAllister as a medical expert witness, and paid consulting fees from ISHI Health. J.A.R. receives funding from the National Institute on Minority Health and Health Disparities (NIMHD). The remaining authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rawal, S. et al. Association between limited English proficiency and revisits and readmissions after hospitalization for patients with acute and chronic conditions in Toronto, Ontario, Canada. JAMA322, 1605–1607 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lion, K. C., Lin, Y.-H. & Kim, T. Artificial intelligence for language translation: the equity is in the details. JAMA332, 1427–1428 (2024). [DOI] [PubMed] [Google Scholar]
- 3.Flores, G. The impact of medical interpreter services on the quality of health care: a systematic review. Med. Care Res. Rev.62, 255–299 (2005). [DOI] [PubMed] [Google Scholar]
- 4.Schulson, L. B. & Anderson, T. S. National estimates of professional interpreter use in the ambulatory setting. J. Gen. Intern. Med.37, 472–474 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Diamond, L. C., Schenker, Y., Curry, L., Bradley, E. H. & Fernandez, A. Getting by: underuse of interpreters by resident physicians. J. Gen. Intern. Med.24, 256–262 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Detz, A. et al. Language concordance, interpersonal care, and diabetes self-care in rural Latino patients. J. Gen. Intern. Med.29, 1650–1656 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Betancourt, J. R., Green, A. R., Carrillo, J. E. & Ananeh-Firempong, O. Defining cultural competence: a practical framework for addressing racial/ethnic disparities in health and health care. Public Health Rep.118, 293–302 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Molina, R. L. & Kasper, J. The power of language-concordant care: a call to action for medical schools. BMC Med. Educ.19, 378 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Harvey, S. M., Branch, M. R., Hudson, D. & Torres, A. Listening to immigrant Latino men in rural Oregon: exploring connections between culture and sexual and reproductive health services. Am. J. Mens. Health7, 142–154 (2013). [DOI] [PubMed] [Google Scholar]
- 10.Gavvala, S. Ensuring understanding: Language-concordant discharge instructions. Rice Univ. Baker Inst. Public Policy, Issue Brief. 10.25613/cayx-wc08 (2023).
- 11.Karpińska, P. Computer aided translation – possibilities, limitations and changes in the field of professional translation. J. Educ. Cult. Soc.8, 133–142 (2017). [Google Scholar]
- 12.Davis, S. H. et al. Translating discharge instructions for limited English-proficient families: strategies and barriers. Hosp. Pediatr.9, 779–787 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Choe, A. Y. et al. Improving discharge instructions for hospitalized children with limited english proficiency. Hosp. Pediatr.11, 1213–1222 (2021). [DOI] [PubMed] [Google Scholar]
- 14.Diamond, L. C., Wilson-Stronks, A. & Jacobs, E. A. Do hospitals measure up to the national culturally and linguistically appropriate services standards?. Med. Care48, 1080–1087 (2010). [DOI] [PubMed] [Google Scholar]
- 15.Rights (OCR), O. for C. Summary of Guidance to Federal Financial Assistance Recipients Regarding Title VI and the prohibition against national origin discrimination affecting limited English proficient persons. https://www.hhs.gov/civil-rights/for-providers/laws-regulations-guidance/guidance-federal-financial-assistance-title-vi/index.html (2007).
- 16.Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at 10.48550/arXiv.1609.08144 (2016).
- 17.Koehn, P. & Knowles, R. Six challenges for neural machine translation. In Proc. First Workshop on Neural Machine Translation (eds. Luong, T., Birch, A., Neubig, G. & Finch, A.) 28–39 (Association for Computational Linguistics, Vancouver, 2017). 10.18653/v1/W17-3204.
- 18.Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (Curran Associates, Inc., 2017).
- 19.Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature642, 442–450 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Brewster, R. C. L. et al. Performance of ChatGPT and Google Translate for Pediatric Discharge Instruction Translation. Pediatrics154, e2023065573 (2024). [DOI] [PubMed] [Google Scholar]
- 21.Ortega, J. E., Castro Mamani, R. & Cho, K. Neural machine translation with a polysynthetic low resource language. Mach. Transl.34, 325–346 (2020). [Google Scholar]
- 22.Adebara, I., Abdul-Mageed, M. & Silfverberg, M. Linguistically-Motivated Yorùbá-English Machine Translation. In Proc. of the 29th International Conference on Computational Linguistics (eds Calzolari, N. et al.) 5066–5075 (International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022).
- 23.Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat. Med. 1–6 10.1038/s41591-024-03456-y (2025). [DOI] [PMC free article] [PubMed]
- 24.Savage, T. et al. Fine tuning large language models for medicine: the role and importance of direct preference optimization. Preprint at 10.48550/arXiv.2409.12741 (2024).
- 25.Mirza, F. N. et al. Using ChatGPT to facilitate truly informed medical consent. NEJM AI1, AIcs2300145 (2024). [Google Scholar]
- 26.Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med.30, 1134–1142 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zaretsky, J. et al. Generative artificial intelligence to transform inpatient discharge summaries to patient-friendly language and format. JAMA Netw. Open7, e240357 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Nondiscrimination in Health Programs and Activities. Federal Register https://www.federalregister.gov/documents/2024/05/06/2024-08711/nondiscrimination-in-health-programs-and-activities (2024). [PubMed]
- 29.Damschroder, L. J., Reardon, C. M., Widerquist, M. A. O. & Lowery, J. The updated consolidated framework for implementation research based on user feedback. Implement. Sci.17, 75 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Xu, Z., Jain, S. & Kankanhalli, M. Hallucination is inevitable: an innate limitation of large language models. Preprint at http://arxiv.org/abs/2401.11817 (2024).
- 31.Liu, N. F. et al. Lost in the middle: how language models use long contexts. Trans. Assoc. Comput. Linguist.12, 157–173 (2024). [Google Scholar]
- 32.Levy, A., Agrawal, M., Satyanarayan, A. & Sontag, D. Assessing the impact of automated suggestions on decision making: domain experts mediate model errors but take less initiative. In Proc. 2021 CHI Conference on Human Factors in Computing Systems 1–13 (Association for Computing Machinery, New York, NY, USA, 2021). 10.1145/3411764.3445522.
- 33.Kuperman, G. J. et al. Medication-related clinical decision support in computerized provider order entry systems: a review. J. Am. Med. Inform. Assoc.14, 29–40 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Data controls in the OpenAI platform - OpenAI API. https://platform.openai.com.
- 35.Ng, M. Y., Helzer, J., Pfeffer, M. A., Seto, T. & Hernandez-Boussard, T. Development of secure infrastructure for advancing generative AI research in healthcare at an academic medical center. Res. Sq. rs.3.rs-5095287 10.21203/rs.3.rs-5095287/v1 (2024). [DOI] [PMC free article] [PubMed]
- 36.Vedula, K. S. et al. Distilling large language models for efficient clinical information extraction. Preprint at 10.48550/arXiv.2501.00031 (2024).
- 37.Woods, A. P. et al. Limited English proficiency and clinical outcomes after hospital-based care in English-speaking countries: a systematic review. J. Gen. Intern. Med.37, 2050–2061 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Manuel, S. P., Nguyen, K., Karliner, L. S., Ward, D. T. & Fernandez, A. Association of English language proficiency with hospitalization cost, length of stay, disposition location, and readmission following total joint arthroplasty. JAMA Netw. Open5, e221842 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. In Proc. 36th International Conference on Neural Information Processing Systems 22199–22213 (Curran Associates Inc., Red Hook, NY, USA, 2022).
- 40.Bakken, S. AI in health: keeping the human in the loop. J. Am. Med. Inform. Assoc.30, 1225–1226 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Swaminathan, A. et al. Natural language processing system for rapid detection and intervention of mental health crisis chat messages. NPJ Digit. Med.6, 1–9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.BigQuery enterprise data warehouse. Google Cloud https://cloud.google.com/bigquery.
- 43.The Snowflake AI Data Cloud - Mobilize Data, Apps, and AI. https://www.snowflake.com/content/snowflake-site/global/en.
- 44.Create Your Azure Free Account Or Pay As You Go | Microsoft Azure. https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account/search.
- 45.Zhang, X., Rajabi, N., Duh, K. & Koehn, P. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QLoRA. In Proc. Eighth Conference on Machine Translation (eds. Koehn, P., Haddow, B., Kocmi, T. & Monz, C.) 468–481 (Association for Computational Linguistics, Singapore, 2023). 10.18653/v1/2023.wmt-1.43.
- 46.Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. In Proc. 37th International Conference on Neural Information Processing Systems 53728–53741 (Curran Associates Inc., Red Hook, NY, USA, 2023).
- 47.Looker Studio. Google for Developers https://developers.google.com/looker-studio.
- 48.Lommel, A. R., Burchardt, A. & Uszkoreit, H. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proc. Translating and the Computer35 (Aslib, London, UK, 2013).
- 49.Chen, X., Acosta, S. & Barry, A. E. Evaluating the accuracy of Google translate for diabetes education material. JMIR Diabetes1, e3 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lopez, I., Haredasht, F. N., Caoili, K., Chen, J. H. & Chaudhari, A. Embedding-driven diversity sampling to improve few-shot synthetic data generation. Preprint at 10.48550/arXiv.2501.11199 (2025).
- 51.Popović, M. chrF++: words helping character n-grams. In Proc. Second Conference on Machine Translation (eds. Bojar, O. et al.) 612–618 (Association for Computational Linguistics, Copenhagen, Denmark, 2017). 10.18653/v1/W17-4770.
- 52.Rei, R., Stewart, C., Farinha, A. C. & Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds. Webber, B., Cohn, T., He, Y. & Liu, Y.) 2685–2702 (Association for Computational Linguistics, Online, 2020). 10.18653/v1/2020.emnlp-main.213.
- 53.Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics (eds. Isabelle, P., Charniak, E. & Lin, D.) 311–318 (Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002). 10.3115/1073083.1073135.
- 54.Mathur, N., Baldwin, T. & Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds. Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 4984–4997 (Association for Computational Linguistics, Online, 2020). 10.18653/v1/2020.acl-main.448.
- 55.Lopez, I. et al. Clinical entity augmented retrieval for clinical information extraction. NPJ Digit. Med.8, 1–11 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Swaminathan, A. et al. Selective prediction for extracting unstructured clinical data. J. Am. Med. Inform. Assoc.31, 188–197 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bates, B. A. et al. Validity of International Classification of Diseases (ICD)-10 diagnosis codes for identification of acute heart failure hospitalization and heart failure with reduced versus preserved ejection fraction in a national medicare sample. Circ. Cardiovasc. Qual. Outcomes16, e009078 (2023). [DOI] [PubMed] [Google Scholar]
- 58.Gothe, H. et al. Algorithms to identify COPD in health systems with and without access to ICD coding: a systematic review. BMC Health Serv. Res.19, 737 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Shoemaker, S. J., Wolf, M. S. & Brach, C. Development of the Patient Education Materials Assessment Tool (PEMAT): a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ. Couns.96, 395–403 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data10, 1 (2023). [DOI] [PMC free article] [PubMed]
- 61.Carrell, D. et al. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text. J. Am. Med. Inform. Assoc.20, 342–348 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.National Standards for Culturally and Linguistically Appropriate Services (CLAS) in Health and Health Care. Federal Register https://www.federalregister.gov/documents/2013/09/24/2013-23164/national-standards-for-culturally-and-linguistically-appropriate-services-clas-in-health-and-health (2013). [PubMed]
- 63.Li, Z. et al. Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages. In Special Track on AI Alignment 28186–28194 (Association for the Advancement of Artificial Intelligence, 2025). 10.1609/aaai.v39i27.35038.
- 64.Xie, Y. et al. Weakly supervised scene text generation for low-resource languages. Expert Syst. Appl.237, 121622 (2024). [Google Scholar]
- 65.Khoong, E. C. & Rodriguez, J. A. A research agenda for using machine translation in clinical medicine. J. Gen. Intern. Med.37, 1275–1277 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No datasets were generated or analysed during the current study.
