Skip to main content
JAMA Network logoLink to JAMA Network
. 2025 Sep 17;8(9):e2532312. doi: 10.1001/jamanetworkopen.2025.32312

Accuracy of Artificial Intelligence vs Professionally Translated Discharge Instructions

Melissa Martos 1,, Blanca Fields 2, Samuel G Finlayson 1, Nigel Hartell 2, Theresa Kim 1, Emily Larimer 1, Jason J Lau 1, Yu-Hsiang Lin 1, Taylor Salaguinto 3, Nguyen Tran 3, K Casey Lion 3
PMCID: PMC12444566  PMID: 40960827

Key Points

Question

How does the accuracy of artificial intelligence (AI)–based translation compare with professional translation of discharge instructions under routine practice conditions?

Findings

In this comparative effectiveness analysis of AI vs professional translation of 148 sections of 34 issued discharge instructions, AI translations were noninferior in some domains for Spanish instructions but consistently inferior in Chinese, Vietnamese, and Somali.

Meaning

These findings suggest that AI translation may have similar performance to professional translations for Spanish discharge instructions but requires further development and validation prior to implementation in other languages.


This comparative effectiveness analysis describes the process of implementation for artificial intelligence–based translation and investigates its accuracy for translating discharge instructions into Simplified Chinese, Somali, Spanish, and Vietnamese.

Abstract

Importance

Patients using languages other than English are a group at risk of poor health outcomes and encounter barriers to access of translated written materials. Although artificial intelligence (AI) may offer an opportunity to improve access, few studies have evaluated the accuracy and safety of AI translation for clinical care under routine practice conditions.

Objective

To investigate the accuracy of AI translation compared with professional human translation of patient-specific issued pediatric inpatient discharge instructions.

Design, Setting, and Participants

This comparative effectiveness analysis compared translations by a neural machine translation model vs professional translators using patient-specific pediatric inpatient discharge instructions received by families between May 18, 2023, and May 18, 2024, at a single center academic pediatric hospital. Instructions were translated to Simplified Chinese, Somali, Spanish, and Vietnamese by professional translators and the Azure AI system and then broken into scoring sections. Two professional translators per language evaluated translations (blinded to source) on an established 5-point scale for fluency, adequacy, meaning, and error severity, with 1 indicating worst performance and 5 indicating best performance.

Exposure

AI vs professional translation.

Main Outcome and Measure

Quality of discharge instruction translation, including fluency, adequacy, meaning, and severity of errors.

Results

A total of 148 sections from 34 discharge instructions were analyzed. When considering all 4 languages together, average fluency, adequacy, and meaning were lower among AI compared with professional human translations. Among all tested languages, mean (SD) fluency for AI translations was 2.98 (1.12) compared with 3.90 (0.96) for professional translations (difference, 0.92; 95% CI, 0.83-1.01; P < .001), adequacy was 3.81 (1.14) compared with 4.56 (0.70) (difference, 0.74; 95% CI, 0.65-0.83; P < .001), meaning was 3.38 (1.15) compared with 4.28 (0.84) (difference, 0.90; 95% CI, 0.80-0.99; P < .001), and error severity was 3.53 (1.28) compared with 4.48 (0.88) (difference, 0.95; 95% CI, 0.85-1.06; P < .001). Compared with professional translations, the Spanish AI translations were noninferior in adequacy (difference, 0.08; 95% CI, −0.02 to 0.19) and error severity (difference, 0.03; 95% CI, −0.09 to 0.14) but inferior in fluency (difference, 0.38; 95% CI, 0.23-0.53) and just crossed the inferiority threshold in meaning (difference, 0.08; 95% CI, −0.04 to 0.20). The Chinese, Vietnamese, and Somali AI translations were inferior to the professional translations across all metrics, with the greatest differences for Somali.

Conclusions and Relevance

In this comparative effectiveness analysis of AI- vs professionally translated issued discharge instructions, AI-translated instructions performed similarly for Spanish but worse for other languages tested. Validation and clinical implementation of AI-based translation will require special attention to languages of lesser diffusion to prevent creating new inequities.

Introduction

Approximately 25 million individuals are estimated by the US census to speak English less than “very well,” which is associated with numerous adverse outcomes, including worse satisfaction, comprehension, treatment adherence, and clinical outcomes, higher costs, and more frequent serious safety events.1,2,3,4,5,6,7,8,9,10,11 Language barriers affect how care is delivered through both spoken and written communication. Written communication is essential because comprehension of medical information is notoriously low,12,13,14,15,16,17,18,19,20,21,22,23 particularly for those who use a language other than English for care and may receive suboptimal verbal communication.24,25,26,27 Unfortunately, timely access to professional translation of written materials is lacking due to logistical challenges, cost, and turnaround time. Current systems might deliver translations within 90 to 120 minutes in exceptional circumstances but more commonly require multiple days, limiting access to time-sensitive information.28

Modern artificial intelligence (AI)–based models may provide a means of fast, safe, and understandable translation of written materials. However, errors in care instructions can have serious (and potentially dangerous) consequences for patients. Few studies have evaluated the accuracy or safety of AI-based translation of actual issued clinical discharge instructions. Standardized instructions that meet accepted standards for readability, understandability, and completeness differ greatly from actual issued instructions, which frequently fall short.29,30,31 Actual clinical discharge instructions are thus likely less accessible to patients of varying health literacy and more difficult to translate.32,33 Given that the performance of a translation tool will vary based on the characteristics of the source text, it is critical to test these tools under the conditions in which they would be used. This study described the process of implementation for AI-based translation and investigated its accuracy in translating clinically issued discharge instructions into our top 4 institutional languages.

Methods

AI Product Selection and Enrichment

To select a suitable model, we compared 2 forms of AI technologies, a neural machine translation model (Azure AI Translator, Microsoft Corp) and a large language model (Azure Open AI, Microsoft Corp). Neural machine translation models are designed specifically for translating between languages, whereas large language models are designed to create and interpret language more broadly. Our language services team favored the performance of the neural machine translation model during preliminary testing of a set of 2 discharge instructions per language in our top 4 institutional languages other than English (Simplified Chinese, Somali, Spanish, and Vietnamese). Our informatics leaders (who would be involved in future AI translation implementation) also strongly preferred a neural machine translation model due to concern for potential hallucinations with large language models, preferring to err on the side of caution where patient-facing materials were concerned.34,35,36,37 The AI interface was built within the Seattle Children’s Hospital tenant. It subsequently underwent enrichment, a standard practice for AI implementation.38 Enrichment is critical for ensuring contextual adequacy by providing exposure to specialty-specific terms and languages of lesser diffusion, which may be underrepresented in typical model training datasets.39

To enrich the model, professionally translated educational documents, previously created and approved by the Department of Patient Education and Department of Interpreter and Translation Services at the Seattle Children’s Hospital, were provided to the model based on the number available per language (1917 Spanish, 433 Vietnamese, and 216 Simplified Chinese) to enrich the model. We were unable to provide additional training to the Somali model because the vendor had not enabled that option. An initial set of 2 discharge instructions per language were translated by both enriched and unenriched models for Simplified Chinese, Spanish, and Vietnamese. Certified bilingual staff compared the blinded output in detailed written evaluations, determining that only the Vietnamese model performed better after enrichment among the enriched languages. Primary differences related to the handling of formatting and the register of language used, with higher complexity language noted in the enriched model. Thus, the enriched model was used for Vietnamese, whereas base models were used for Simplified Chinese, Somali, and Spanish.

Discharge Instruction Selection

Patient instructions with a minimum of 75 words issued at Seattle Children’s Hospital between May 18, 2023, and May 18, 2024, were selected. We included only narrative portions of instructions written as free text and/or using 1 or more templated phrases. We did not include prepopulated sections, such as follow-up visit timing or hospital contact information. We used purposive sampling to select a range of diagnoses and specialties and avoid duplicative sampling of the same templated instruction. To achieve this, we began with a consecutive sample, skipping duplicate diagnoses, until we reached our target goal. This study was approved by the Seattle Children’s institutional review board. Informed consent was not required by the institutional review board due to the anonymous nature of the data included. This study follows the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) reporting guideline for comparative effectiveness research, including a priori study design, an explanation of misclassification mitigation, and discussion of internal and external validity of findings.40

Discharge Instruction Translation and Grading

Discharge instructions were classified by medical specialty and degree of personalization, including entirely personalized, entirely templated, or a combination. Each discharge instruction was deidentified and translated into Simplified Chinese, Somali, Spanish, and Vietnamese by professional translators and the AI tool. For scoring, each instruction was broken into sections of approximately 30 to 80 words at contextually logical break points. We sampled discharge instructions until reaching our goal of 148 sections per power calculations, resulting in 34 total instructions. A survey for scoring was built using REDCap (Research Electronic Data Capture),41,42 which displayed complete English instructions for context, followed by each scoring section displaying both the English text and its translation, blinded to translation method.

Two translators per language (distinct from those who translated the documents) completed a training block of 6 instruction sets and compared their results, with discussion and resolution of differences to ensure consistency. Afterward, each translator independently scored approximately half of the translated sections. To control for individual variation in grading and minimize the potential impact of misclassification, the AI and professional versions of the same instruction were always scored by the same translator (not in immediate succession).

Scoring was performed for the following 4 domains, which are established metrics for evaluating translation accuracy: fluency, adequacy, meaning, and severity of errors.43,44 Fluency referred to clarity and flow of writing, including appropriate vocabulary and grammar. Adequacy referred to accuracy, completeness, and appropriateness for the context. Meaning referred to maintenance of the substance of the original text, including connotation and semantics. Severity of errors referred to the expected impact errors may have on clinical care, with 1 indicating worst performance and 5 indicating best performance. Error severity was classified as clinically impactful if ranked 1 to 3 (see Table 1 for examples). Errors marked by translators as “unsure of clinical impact” were rescored by 2 pediatricians (M.M., E.L., J.J.L., and/or K.C.L.), with disagreements resolved by discussion to reach consensus. For a detailed scoring guide, see the eMethods in Supplement 1.

Table 1. Example Scoring for Severity of Errorsa.

Score Source text Error
Clinically impactful
1 (Harms patient) “Yellow Zone: I do not feel good. I need to take rescue medications and my controller medicines listed below to keep from getting worse.” “Yellow Zone: I do not feel good. [omitted].”
“Always check to make sure the [nasogastric] tube is in the right place before each feeding.” “Always check to make sure the test tube is in the right place before each feeding.”
2 (Impairs care) “What can I expect after surgery? Changes in sensation.” “What can I expect after surgery? Changes in poetry.”
3 (Delays care) “Call the cardiologist on-call for any of the following: limping on the leg(s) of catheter site(s)” “Call the cardiologist on-call for any of the following: [unintelligible] on the leg(s) of catheter site(s)”
Not clinically impactful
4 (Unclear effect on patient care) “Call your Primary Healthcare Provider for any of the following: lethargy” “Call your Primary Healthcare Provider for any of the following: excessive sleeping”
5 (No effect on patient care) Call the Orthopedic Resident on-call at 206-987-2000 (Evenings/Nights/Weekends)” Call the Orthopedic Resident [omitted] at 206-987-2000 (Evenings/Nights/Weekends)”
a

Discharge instructions were scored by professional translators blinded to source in 4 languages. Scores were assigned on a scale of 1 to 5, with 1 indicating most harmful (harms patient) and 4 to 5 indicating not clinically impactful.

Statistical Analysis

We conducted a comparative effectiveness study comparing the mean scores for AI translation with the mean scores for professional translation in each of the 4 domains. Means were compared within each language using paired t tests. Our noninferiority limit was set to a difference of less than 0.2 on the 5-point Likert scale (a 4% absolute difference), determined in collaboration with our language services colleagues as a clinically meaningful threshold. The CIs for the difference in means that included this margin of 0.2 or greater were interpreted as evidence of inferiority. Proportions of error severity scores deemed clinically impactful within each language were also compared using the paired Wilcoxon signed-rank test. We also compared the proportion of overall instructions with clinically impactful errors within each language by the McNemar test. P < .05 was considered statistically significant. Assuming an SD of 0.7 as suggested by prior work,45 α = .05, and a power of 0.80, the study was adequately powered with 148 translated sections to detect a difference of 0.2 or more.

Results

A total of 34 discharge instructions were identified and broken into 148 scoring sections. Complete discharge instruction length ranged from 76 to 1177 words, with a median of 243 words. Pediatric medical specialties were represented in 22 (65%) and surgical in 12 (35%); among medical specialties, 15 (68%) were from general pediatrics, whereas the remainder were from subspecialty services (Table 2).

Table 2. Characteristics of the Issued Discharge Instructions Taken From Seattle Children’s Hospital Between 2023 and 2024.

Characteristic No. (%) of discharge instructions (N = 34)
Field
Medical 22 (65)
Cardiology 1 (3)
General pediatrics 15 (44)
Hematology-oncology 1 (3)
Neonatal intensive care 2 (6)
Rehabilitation 2 (6)
Rheumatology 1 (3)
Surgical 12 (35)
Cardiac surgery 1 (3)
General surgery 3 (9)
Oral and maxillofacial surgery 2 (6)
Orthopedics 3 (9)
Plastic and craniofacial surgery 1 (3)
Transplant surgery 1 (3)
Urology 1 (3)
Type of instructions
Free texted 6 (18)
Combination of free texted and templated 1 (3)
Templated 27 (79)

When considering all 4 languages together, mean fluency, adequacy, and meaning were lower among AI compared with professional human translations. Among all tested languages, mean (SD) fluency for AI translations was 2.98 (1.12) compared with 3.90 (0.96) for professional translations (difference, 0.92; 95% CI, 0.83-1.01; P < .001), adequacy was 3.81 (1.14) compared with 4.56 (0.70) (difference, 0.74; 95% CI, 0.65-0.83; P < .001), meaning was 3.38 (1.15) compared with 4.28 (0.84) (difference, 0.90; 95% CI, 0.80-0.99; P < .001), and error severity was 3.53 (1.28) compared with 4.48 (0.88) (difference, 0.95; 95% CI, 0.85-1.06; P < .001).

AI performance varied based on language tested (Figure 1 and Table 3). In Spanish, AI translations were noninferior in adequacy and error severity, but fluency (3.82 vs 4.20; difference, 0.38; 95% CI, 0.23-0.53; P < .001) was significantly worse for AI translations compared with professional translations. Meaning was not significantly different between Spanish AI and professional translations, but AI translation failed to meet the noninferiority threshold with a CI at the limit of 0.20 (4.46 vs 4.54 for AI vs professional; difference, 0.08; 95% CI, −0.04 to 0.20). Meanwhile, AI translations were inferior in every category for Simplified Chinese, Somali, and Vietnamese (Table 3). In Simplified Chinese, AI translations performed consistently worse than professional translations, including for fluency (difference, 0.89; 95% CI, 0.72-1.06), adequacy (difference, 0.49, 95% CI, 0.34-0.64), meaning (difference, 0.95, 95% CI, 0.77-1.14), and error severity (difference, 0.91, 95% CI, 0.69-1.12), all significant (P < .001 for all); similar differences were found between AI and professional translations in Vietnamese (fluency: difference, 0.70; 95% CI, 0.55-0.84; adequacy: difference, 0.54; 95% CI, 0.40-0.68; meaning: difference, 0.59; 95% CI, 0.42-0.76; error severity: difference, 0.83; 95% CI, 0.66-1.00; P < .001 for all). AI translations performed worst in Somali, where there were significant and large differences in every category, including fluency (difference, 1.71; 95% CI, 1.53-1.89), adequacy (difference, 1.86; 95% CI, 1.68-2.03), meaning (difference, 1.97; 95% CI, 1.80-2.13), and error severity (difference, 2.05; 95% CI, 1.86-2.23; P < .001 for all).

Figure 1. Comparison of Mean Fluency, Adequacy, Meaning, and Error Severity Scores for Professionally Translated vs Artificial Intelligence (AI)–Translated Discharge Instructions.

Figure 1.

Discharge instructions (n = 148 for each category) were scored by professional translators blinded to source in 4 languages. Scores were assigned on a scale of 1 to 5, with 1 indicating worst performance and 5 indicating best performance. P values are based on paired t tests testing for differences between professional and AI translation of issued discharge instructions (rather than noninferiority). Significant differences were found between professional and AI translations in all categories for all languages other than Spanish, where the only significant difference was found in fluency.

Table 3. Mean Fluency, Adequacy, Meaning, and Error Severity Scores for Professionally Translated vs AI-Translated Discharge Instructions.

Variable Mean score (95% CI)
Fluency Adequacy Meaning Error severity
Simplified Chinese
Professional 3.91 (3.76 to 4.06) 4.72 (4.61 to 4.82) 4.26 (4.12 to 4.41) 4.35 (4.19 to 4.51)
AI 3.02 (2.84 to 3.20) 4.23 (4.06 to 4.40) 3.31 (3.13 to 3.49) 3.45 (3.25 to 3.64)
Difference 0.89 (0.72 to 1.06) 0.49 (0.34 to 0.64) 0.95 (0.77 to 1.14) 0.91 (0.69 to 1.12)
P valuea <.001 <.001 <.001 <.001
Somali
Professional 3.79 (3.65 to 3.94) 4.45 (4.33 to 4.58) 4.28 (4.14 to 4.41) 4.47 (4.34 to 4.60)
AI 2.08 (1.98 to 2.18) 2.59 (2.45 to 2.74) 2.31 (2.19 to 2.43) 2.43 (2.29 to 2.57)
Difference 1.71 (1.53 to 1.89) 1.86 (1.68 to 2.03) 1.97 (1.80 to 2.13) 2.05 (1.86 to 2.23)
P valuea <.001 <.001 <.001 <.001
Spanish
Professional 4.20 (4.05 to 4.35) 4.70 (4.59 to 4.80) 4.54 (4.43 to 4.65) 4.69 (4.57 to 4.81)
AI 3.82 (3.68 to 3.97) 4.61 (4.52 to 4.71) 4.46 (4.35 to 4.57) 4.66 (4.56 to 4.77)
Difference 0.38 (0.23 to 0.53) 0.08 (−0.02 to 0.19) 0.08 (−0.04 to 0.20) 0.03 (−0.09 to 0.14)
P valuea <.001 .06 .09 .32
Vietnamese
Professional 3.70 (3.54 to 3.87) 4.36 (4.24 to 4.47) 4.03 (3.90 to 4.17) 4.42 (4.26 to 4.57)
AI 3.01 (2.84 to 3.18) 3.82 (3.68 to 3.95) 3.45 (3.31 to 3.59) 3.59 (3.39 to 3.78)
Difference 0.70 (0.55 to 0.84) 0.54 (0.40 to 0.68) 0.59 (0.42 to 0.76) 0.83 (0.66 to 1.00)
P valuea <.001 <.001 <.001 <.001

Abbreviation: AI, artificial intelligence.

a

P values are based on paired t tests comparing professional and AI translation of issued discharge instructions.

The numbers of clinically impactful errors (Table 1) were also significantly greater in AI than in professional translations in every language other than Spanish (Figure 2). In Spanish, only 10 sections (7%) in both AI and professional translations (z = 0.84, P = .40) had errors that were clinically impactful, whereas in Somali, 136 (92%) (AI) vs 19 (13%) (professional) (z = 10.43, P < .001) were clinically impactful. Differences were also significant in Simplified Chinese (77 [52%] vs 29 [20%], z = 7.17, P < .001) and Vietnamese (60 [41%] vs 20 [14%], z = 7.96, P < .001). When comparing errors at the level of the overall instruction set, AI-translated instructions were significantly more likely to have clinically impactful errors compared with human translations in Chinese (30 [88.2%] vs 20 [58.8%], P = .004), Somali (33 [97.1%] vs 11 [32.4%], P < .001), and Vietnamese (28 [82.4%] vs 13 [38.2%], P < .001); there was no difference for Spanish (8 [23.5%] vs 4 [11.8%], P = .22) (eTable in Supplement 1). Notably, this analysis does not account for differences in length of each instruction set.

Figure 2. Frequency of Clinically Impactful Errors by Language and Translation Type .

Figure 2.

Discharge instructions were scored by professional translators blinded to source in 4 languages (n = 148 per category). Scores were assigned on a scale of 1 to 5, with 1 indicating most harmful (harms patient) and 4 to 5 indicating not clinically impactful. P values are based on paired Wilcoxon signed-rank testing for significant differences between professional and artificial intelligence (AI) translation of issued discharge instructions (rather than noninferiority). No significant differences were found in proportions of errors in Spanish, but all other groups showed significantly worse performance of the AI.

Discussion

In this study comparing AI-based vs professional human translations in 4 languages, our findings reveal that although the AI model produced translated discharge instructions that were largely comparable to professionally translated ones in Spanish, the AI translations were inferior across the board in Simplified Chinese, Somali, and Vietnamese. In Spanish, AI translations were noninferior in adequacy and error severity but inferior in meaning; AI translations were inferior in fluency for all languages tested. Our findings also revealed a high proportion of clinically impactful errors for translations in languages other than Spanish, highlighting the need for caution when considering AI implementation in medical settings in its current form. These results align with prior studies,45,46,47 which demonstrate fair accuracy for prevalent languages but notable underperformance for those of lesser diffusion.

The body of literature on AI-based translation for medical care is limited but rapidly expanding. Prior studies43,45 have highlighted the potential utility of AI models for translating discharge instructions. Other work48,49 has investigated models applied to other translation needs, such as medication instructions and educational information. Notably, most studies have evaluated translations of standardized, curated content,45,50 which may not fully reflect the complexities of documents in clinical practice. Our study makes an important contribution to the literature by examining AI performance on actual issued clinical discharge instructions, using a robust, fully powered sample to evaluate performance across multiple languages. Furthermore, although other studies43,45 have used volunteer native speakers or bilingual physicians for evaluation, we engaged professional translators, whose linguistic expertise may offer particularly discerning assessment. Consistent with previous findings, our study underscores both the promise of these models and the ongoing need for improvement.

Medical AI translation technology is evolving rapidly. Section 1557 of the Patient Protection and Affordable Care Act prohibits automated translation of medical documents without human review “when accuracy is essential,” “when the source documents or materials contain complex, nonliteral or technical language,” or “when the underlying text is critical to the rights, benefits, or meaningful access” of individuals using a language other than English for care.51,52 However, as AI models improve, less time and effort may be needed for human review and editing. The need for rapid-turnaround written document translation continues to increase as more platforms, such as patient portals, become central to care delivery and traditional means of communication become less supported. AI translation may offer an opportunity to address these key areas of need, but differences in context, language conventions, and formatting in each scenario require tailored implementation.

Of particular importance is ensuring that the technology evolves to support languages of lesser diffusion and those linguistically different from English. Our model’s performance in Spanish was similar to that of professional translators, but that was not the case for the other languages tested; others have found similar patterns.45,46,47 Worse performance of machine translation is a known phenomenon when working between languages that are syntactically different, such as English and Chinese.53,54 Furthermore, less availability of quality translated training documents available for languages of lesser diffusion presents a challenge for quality model output in these languages.39,55 Prevalent languages in the US, such as Spanish, have more infrastructure to support them, including the widespread availability of interpreters and translated education materials, websites, and prescriptions.56,57,58,59 The addition of AI supporting language access in Spanish is important, but gaps in access are most profound among languages of lesser diffusion, which may present the greatest potential for important breakthroughs in health care equity. There is great promise for technology to improve the care for patients using a language other than English, but as we implement new technology we must make concerted efforts to include patients from languages less prevalent locally. Rather than settling for lower-quality translations in these languages or excluding them altogether, we must prioritize enhancing AI performance in these other languages to ensure equitable language access.

How can we improve AI performance to serve patients and families using all languages? First, we can better train AI models to navigate specific uses in diverse languages on both the development and user (eg, hospital system) sides. Further development of translated written materials and methods to overcome inadequate training materials for languages of lesser diffusion is essential.55,60,61,62 Our efforts to enrich the translation models using local materials provided useful insights, most notably that enrichment did not always improve model performance, highlighting the need for evaluation prior to implementation. Nonetheless, it is essential that vendors offer the ability to locally enrich models in every language for equitable access to AI resources. Limited professionally translated local enrichment materials also form a barrier for less prevalent languages.63 Enrichment materials should be thoughtfully selected to ensure that they are clearly written for patients and families so that output translations will be understandable. Next, once an AI model is implemented, rigorous monitoring by an overseeing body must ensure that output quality does not decrease with time, a phenomenon known as dataset shift that occurs due to differences between actual and training data.64 Lastly, we can improve partnerships across institutions and vendors to avoid “reinventing the wheel” at each institution. Because AI models are generally proprietary technology, our ability to collaborate is often limited. We can support improvement across the board by openly sharing professionally translated enrichment documents and training sets. Such sharing will require partnerships among institutions and navigation of content ownership that ensures that translators who have contributed their intellectual property are appropriately compensated for their efforts. Through greater collaboration, we can take advantage of market diversity to improve technology and better serve patients and families using all languages.

As AI translation technology advances, its potential to improve patient care increases; however, careful evaluation is essential to avoid unintended consequences that could deepen inequities. Innovations must be introduced thoughtfully, focusing on understanding the experiences and needs of the families they aim to support. While pursuing progress, we must avoid making assumptions about how families prioritize translation quality versus speed. Achieving equitable and effective outcomes will require patient and family input at every stage of AI implementation.

Limitations

Our study had limitations. As with any study investigating AI implementation in medicine, the validity of the results applies to the specific AI model used, and models are rapidly changing. Furthermore, enriching an AI model to address a specific context adequately38 leads to further differentiation of models, making reproducibility challenging. However, it is reassuring that our results are consistent with others using other models. Although it can be hypothesized that similarly less common languages may perform less well, our results cannot be generalized to languages other than those tested. Further work exploring AI translation accuracy in other contexts and adult settings may also be informative. Despite blinding, translators may have been able to detect when translations were created by the AI due to nuances in language and fluency. Additionally, there may have been differences in scoring between individual translators; however, each translator scored both the AI and professional translations for that language. Thus, although conclusions that can be drawn about differences between languages are limited, differences within languages are meaningful. Additionally, we used a professional vendor for human translations to ensure they were representative of a standard approach and thus do not have access to data comparing time required for professional vs AI translations.

Conclusions

In this comparative effectiveness analysis of blinded translator review of AI vs professional translation of 148 sections of actual discharge instructions into 4 languages, our AI model produced translations that were largely noninferior to professional translations for Spanish but underperformed in all other languages tested, consistent with prior literature. The results included a high proportion of clinically impactful errors in AI translations. Although current models may perform well for translating discharge instructions for widely spoken languages, considerable work remains to improve performance for languages that are less prevalent in the US. Thoughtful and inclusive implementation of this technology may lead to more equitable care for patients and families.

Supplement 1.

eMethods. Translation Scoring Guide

eTable. Proportion of Issued Discharge Instructions With Clinical Impactful Errors, Comparing Professional to AI Translations

Supplement 2.

Data Sharing Statement

References

  • 1.Wilson E, Chen AH, Grumbach K, Wang F, Fernandez A. Effects of limited English proficiency and physician language on health care comprehension. J Gen Intern Med. 2005;20(9):800-806. doi: 10.1111/j.1525-1497.2005.0174.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Morales LS, Cunningham WE, Brown JA, Liu H, Hays RD. Are Latinos less satisfied with communication by health care providers? J Gen Intern Med. 1999;14(7):409-417. [DOI] [PMC free article] [PubMed]
  • 3.Levas MN, Cowden JD, Dowd MD. Effects of the limited English proficiency of parents on hospital length of stay and home health care referral for their home health care-eligible children with infections. Arch Pediatr Adolesc Med. 2011;165(9):831-836. doi: 10.1001/archpediatrics.2011.61 [DOI] [PubMed] [Google Scholar]
  • 4.John-Baptiste A, Naglie G, Tomlinson G, et al. The effect of English language proficiency on length of stay and in-hospital mortality. J Gen Intern Med. 2004;19(3):221-228. doi: 10.1111/j.1525-1497.2004.21205.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Karliner LS, Auerbach A, Nápoles A, Schillinger D, Nickleach D, Pérez-Stable EJ. Language barriers and understanding of hospital discharge instructions. Med Care. 2012;50(4):283-289. doi: 10.1097/MLR.0b013e318249c949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Karliner LS, Kim SE, Meltzer DO, Auerbach AD. Influence of language barriers on outcomes of hospital care for general medicine inpatients. J Hosp Med. 2010;5(5):276-282. doi: 10.1002/jhm.658 [DOI] [PubMed] [Google Scholar]
  • 7.Lion KC, Wright DR, Desai AD, Mangione-Smith R. Costs of care for hospitalized children associated with preferred language and insurance type. Hosp Pediatr. 2017;7(2):70-78. doi: 10.1542/hpeds.2016-0051 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Flores G, Abreu M, Schwartz I, Hill M. The importance of language and culture in pediatric care: case studies from the Latino community. J Pediatr. 2000;137(6):842-848. doi: 10.1067/mpd.2000.109150 [DOI] [PubMed] [Google Scholar]
  • 9.Khan A, Yin HS, Brach C, et al. ; Patient and Family Centered I-PASS Health Literacy Subcommittee . Association between parent comfort with English and adverse events among hospitalized children. JAMA Pediatr. 2020;174(12):e203215. doi: 10.1001/jamapediatrics.2020.3215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Divi C, Koss RG, Schmaltz SP, Loeb JM. Language proficiency and adverse events in US hospitals: a pilot study. Int J Qual Health Care. 2007;19(2):60-67. doi: 10.1093/intqhc/mzl069 [DOI] [PubMed] [Google Scholar]
  • 11.Castro MRH, Schwartz H, Hernandez S, et al. The association of limited English proficiency with morbidity and mortality after trauma. J Surg Res. 2022;280:326-332. doi: 10.1016/j.jss.2022.07.044 [DOI] [PubMed] [Google Scholar]
  • 12.Yin HS, Dreyer BP, Ugboaja DC, et al. Unit of measurement used and parent medication dosing errors. Pediatrics. 2014;134(2):e354-e361. doi: 10.1542/peds.2014-0395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Grover G, Berkowitz CD, Lewis RJ. Parental recall after a visit to the emergency department. Clin Pediatr (Phila). 1994;33(4):194-201. doi: 10.1177/000992289403300401 [DOI] [PubMed] [Google Scholar]
  • 14.Al-Harthy N, Sudersanadas KM, Al-Mutairi M, et al. Efficacy of patient discharge instructions: a pointer toward caregiver friendly communication methods from pediatric emergency personnel. J Family Community Med. 2016;23(3):155-160. doi: 10.4103/2230-8229.189128 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Isaacman DJ, Purvis K, Gyuro J, Anderson Y, Smith D. Standardized instructions: do they improve communication of discharge information from the emergency department? Pediatrics. 1992;89(6 pt 2):1204-1208. doi: 10.1542/peds.89.6.1204 [DOI] [PubMed] [Google Scholar]
  • 16.Stevens PK, Penprase B, Kepros JP, Dunneback J. Parental recognition of postconcussive symptoms in children. J Trauma Nurs. 2010;17(4):178-182; quiz 183-184. doi: 10.1097/JTN.0b013e3181ff2789 [DOI] [PubMed]
  • 17.Fosarelli P, DeAngelis C, Kaszuba A. Compliance with follow-up appointments generated in a pediatric emergency room. Am J Prev Med. 1985;1(3):23-29. doi: 10.1016/S0749-3797(18)31409-0 [DOI] [PubMed] [Google Scholar]
  • 18.Hemphill RR, Santen SA, Howell JM, Altieri MF. Follow-up compliance in febrile children: a comparison of two systems. Acad Emerg Med. 1998;5(10):996-1001. [DOI] [PubMed] [Google Scholar]
  • 19.Liberman DB, Shelef DQ, He J, McCarter R, Teach SJ. Low rates of follow-up with primary care providers after pediatric emergency department visits for respiratory tract illnesses. Pediatr Emerg Care. 2012;28(10):956-961. doi: 10.1097/PEC.0b013e31826c6dde [DOI] [PubMed] [Google Scholar]
  • 20.Polacek TL, Fitzmaurice L, Sharma V, Kennedy C, Glaros A. Follow-up appointment compliance of discharged patients from a pediatric emergency department. Nurse Pract. 1996;21(3):e2532312. [PubMed] [Google Scholar]
  • 21.Zorc JJ, Scarfone RJ, Li Y, et al. ; Randomized trial . Scheduled follow-up after a pediatric emergency department visit for asthma: a randomized trial. Pediatrics. 2003;111(3):495-502. doi: 10.1542/peds.111.3.495 [DOI] [PubMed] [Google Scholar]
  • 22.Zorc JJ, Chew A, Allen JL, Shaw K. Beliefs and barriers to follow-up after an emergency department asthma visit: a randomized trial. Pediatrics. 2009;124(4):1135-1142. doi: 10.1542/peds.2008-3352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zorc JJ, Scarfone RJ, Li Y. Predictors of primary care follow-up after a pediatric emergency visit for asthma. J Asthma. 2005;42(7):571-576. doi: 10.1080/02770900500215947 [DOI] [PubMed] [Google Scholar]
  • 24.Glick AF, Farkas JS, Nicholson J, et al. Parental management of discharge instructions: a systematic review. Pediatrics. 2017;140(2):e20164165. doi: 10.1542/peds.2016-4165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wang NE, Gisondi MA, Golzari M, van der Vlugt TM, Tuuli M. Socioeconomic disparities are negatively associated with pediatric emergency department aftercare compliance. Acad Emerg Med. 2003;10(11):1278-1284. [DOI] [PubMed] [Google Scholar]
  • 26.Jones AR, Brandow M, Smith E. After the emergency department: a study of pediatric injury follow-up. N C Med J. 1998;59(5):307-309. [PubMed] [Google Scholar]
  • 27.Harris LM, Dreyer BP, Mendelsohn AL, et al. Liquid medication dosing errors by Hispanic parents: role of health literacy and English proficiency. Acad Pediatr. 2017;17(4):403-410. doi: 10.1016/j.acap.2016.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Davis SH, Rosenberg J, Nguyen J, et al. Translating discharge instructions for limited English-proficient families: strategies and barriers. Hosp Pediatr. 2019;9(10):779-787. doi: 10.1542/hpeds.2019-0055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Unaka NI, Statile A, Haney J, Beck AF, Brady PW, Jerardi KE. Assessment of readability, understandability, and completeness of pediatric hospital medicine discharge instructions. J Hosp Med. 2017;12(2):98-101. doi: 10.12788/jhm.2688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Austad K, Lee JH, Lanney H, et al. Evaluating the quality and equity of patient hospital discharge instructions. BMC Health Serv Res. 2025;25(1):291. doi: 10.1186/s12913-025-12410-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.DeSai C, Janowiak K, Secheli B, et al. Empowering patients: simplifying discharge instructions. BMJ Open Qual. 2021;10(3):e001419. doi: 10.1136/bmjoq-2021-001419 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Burns ST, Amobi N, Chen JV, O’Brien M, Haber LA. Readability of patient discharge instructions. J Gen Intern Med. 2022;37(7):1797-1798. doi: 10.1007/s11606-021-06988-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kim W, Kim I, Baltimore K, Imtiaz AS, Bhattacharya BS, Lin L. Simple contents and good readability: improving health literacy for LEP populations. Int J Med Inform. 2020;141:104230. doi: 10.1016/j.ijmedinf.2020.104230 [DOI] [PubMed] [Google Scholar]
  • 34.Guerreiro NM, Alves DM, Waldendorf J, et al. Hallucinations in large multilingual translation models. Trans Assoc Comput Linguist. 2023;11:1500-1517. doi: 10.1162/tacl_a_00615 [DOI] [Google Scholar]
  • 35.Hendy A, Abdelrehim M, Sharaf A, et al. How good are gpt models at machine translation? a comprehensive evaluation. arXiv. Preprint posted online February 18, 2023. doi: 10.48550/arXiv.2302.09210 [DOI]
  • 36.Metz C. A.I. is getting more powerful, but its hallucinations are getting worse. New York Times. May 5, 2025. Accessed July 7, 2025. https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html
  • 37.Moore-Coyler R. Why is AI hallucinating more frequently, and how can we stop it? Live Science. June 21, 2025. Accessed July 7, 2025. https://www.livescience.com/technology/artificial-intelligence/ai-hallucinates-more-frequently-as-it-gets-more-advanced-is-there-any-way-to-stop-it-from-happening-and-should-we-even-try
  • 38.Nassiri K, Akhloufi MA. Recent advances in large language models for healthcare. BioMedInformatics. 2024;4(2):1097-1143. doi: 10.3390/biomedinformatics4020062 [DOI] [Google Scholar]
  • 39.Thompson B, Dhaliwal MP, Frisch P, Domhan T, Federico M. A shocking amount of the web is machine translated: insights from multi-way parallelism. arXiv. Preprint posted online June 5, 2024. doi: 10.18653/v1/2024.findings-acl.103 [DOI]
  • 40.Berger ML, Mamdani M, Atkins D, Johnson ML. Good research practices for comparative effectiveness research: defining, reporting and interpreting nonrandomized studies of treatment effects using secondary data sources: the ISPOR Good Research Practices for Retrospective Database Analysis Task Force Report–Part I. Value Health. 2009;12(8):1044-1052. doi: 10.1111/j.1524-4733.2009.00600.x [DOI] [PubMed] [Google Scholar]
  • 41.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)–a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377-381. doi: 10.1016/j.jbi.2008.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Harris PA, Taylor R, Minor BL, et al. ; REDCap Consortium . The REDCap consortium: building an international community of software platform partners. J Biomed Inform. 2019;95:103208. doi: 10.1016/j.jbi.2019.103208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Taira BR, Kreger V, Orue A, Diamond LC. A pragmatic assessment of Google Translate for emergency department instructions. J Gen Intern Med. 2021;36(11):3361-3365. doi: 10.1007/s11606-021-06666-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen X, Acosta S, Barry AE. Machine or human? evaluating the quality of a language translation mobile app for diabetes education material. JMIR Diabetes. 2017;2(1):e13. doi: 10.2196/diabetes.7446 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Brewster RCL, Gonzalez P, Khazanchi R, et al. Performance of ChatGPT and Google Translate for pediatric discharge instruction translation. Pediatrics. 2024;154(1):e2023065573. doi: 10.1542/peds.2023-065573 [DOI] [PubMed] [Google Scholar]
  • 46.Khoong EC, Steinbrook E, Brown C, Fernandez A. Assessing the use of Google Translate for Spanish and Chinese translations of emergency department discharge instructions. JAMA Intern Med. 2019;179(4):580-582. doi: 10.1001/jamainternmed.2018.7653 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Patil S, Davies P. Use of Google Translate in medical communication: evaluation of accuracy. BMJ. 2014;349:g7392. doi: 10.1136/bmj.g7392 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Cornelison BR, Al-Mohaish S, Sun Y, Edwards CJ. Accuracy of Google Translate in translating the directions and counseling points for top-selling drugs from English to Arabic, Chinese, and Spanish. Am J Health Syst Pharm. 2021;78(22):2053-2058. doi: 10.1093/ajhp/zxab224 [DOI] [PubMed]
  • 49.Chen X, Acosta S, Barry AE. Evaluating the accuracy of Google Translate for diabetes education material. JMIR Diabetes. 2016;1(1):e3. doi: 10.2196/diabetes.5848 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Grimm DR, Lee YJ, Hu K, et al. The utility of ChatGPT as a generative medical translator. Eur Arch Otorhinolaryngol. 2024;281(11):6161-6165. doi: 10.1007/s00405-024-08708-8 [DOI] [PubMed] [Google Scholar]
  • 51.Centers for Medicare & Medicaid Services , US Department of Health and Human Services. Accessed August 11, 2025. https://www.hhs.gov/civil-rights/for-individuals/section-1557/fs-limited-english-proficiency/index.html
  • 52.Rights (OCR) O for C. Limited English Proficiency (LEP). August 8, 2024. Accessed August 11, 2025. https://www.hhs.gov/civil-rights/for-individuals/special-topics/limited-english-proficiency/index.html
  • 53.Luo J, Cherry C, Foster G. To diverge or not to diverge: a morphosyntactic perspective on machine translation vs human translation. Trans Assoc Comput Linguist. 2024;12:355-371. doi: 10.1162/tacl_a_00645 [DOI] [Google Scholar]
  • 54.Deng D, Xue N. Translation divergences in Chinese–English machine translation: an empirical investigation. Comput Linguist. 2017;43(3):521-565. doi: 10.1162/COLI_a_00292 [DOI] [Google Scholar]
  • 55.Kuwanto G, Akyürek AF, Tourni IC, Li S, Jones AG, Wijaya D. Low-resource machine translation training curriculum fit for low-resource languages. arXiv. Preprint posted online November 29, 2021. doi: 10.1007/978-981-99-7025-4_39 [DOI]
  • 56.Diamond LC, Wilson-Stronks A, Jacobs EA. Do hospitals measure up to the national culturally and linguistically appropriate services standards? Med Care. 2010;48(12):1080-1087. doi: 10.1097/MLR.0b013e3181f380bc [DOI] [PubMed] [Google Scholar]
  • 57.Graves JM, Moore M, Gonzalez C, Ramos J, Nguyen L, Vavilala MS. Too little information: accessibility of information about language services on hospital websites. J Immigr Minor Health. 2020;22(3):433-438. doi: 10.1007/s10903-020-00978-8 [DOI] [PubMed] [Google Scholar]
  • 58.Weiss L, Gany F, Rosenfeld P, et al. Access to multilingual medication instructions at New York City pharmacies. J Urban Health. 2007;84(6):742-754. doi: 10.1007/s11524-007-9221-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Burkle CM, Anderson KA, Xiong Y, Guerra AE, Tschida-Reuter DA. Assessment of the efficiency of language interpreter services in a busy surgical and procedural practice. BMC Health Serv Res. 2017;17(1):456. doi: 10.1186/s12913-017-2425-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Tapo AA, Coulibaly B, Diarra S, et al. Neural machine translation for extremely low-resource african languages: a case study on Bambara. arXiv. Preprint posted online November 10, 2020. doi: 10.18653/v1/2020.loresmt-1.3 [DOI]
  • 61.Faheem MA, Wassif KT, Bayomi H, Abdou SM. Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation. Sci Rep. 2024;14(1):2265. doi: 10.1038/s41598-023-51090-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Karakanta A, Dehdari J, van Genabith J. Neural machine translation for low-resource languages without parallel corpora. Mach Transl. 2018;32(1):167-189. doi: 10.1007/s10590-017-9203-5 [DOI] [Google Scholar]
  • 63.Lion KC, Lin YH, Kim T. Artificial intelligence for language translation: the equity is in the details. JAMA. 2024;332(17):1427-1428. doi: 10.1001/jama.2024.15296 [DOI] [PubMed] [Google Scholar]
  • 64.Finlayson SG, Subbaswamy A, Singh K, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med. 2021;385(3):283-286. doi: 10.1056/NEJMc2104626 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1.

eMethods. Translation Scoring Guide

eTable. Proportion of Issued Discharge Instructions With Clinical Impactful Errors, Comparing Professional to AI Translations

Supplement 2.

Data Sharing Statement


Articles from JAMA Network Open are provided here courtesy of American Medical Association

RESOURCES