Skip to main content
JMIR Formative Research logoLink to JMIR Formative Research
. 2026 Jan 12;10:e79676. doi: 10.2196/79676

Evaluating Spanish Translations of Emergency Department Discharge Instructions by a Large Language Model: Tool Validation and Reliability Study

Jossie A Carreras Tartak 1,, Ryan CL Brewster 2, Daniela Arango Isaza 3, Antonio Berumen Martinez 3, Ana Grafals 1, Phanidhar Adusumilli 4, Ted Fitzgerald 4, Roger Orcutt 1, Larry A Nathanson 1, Adrian D Haimovich 1
Editor: Alicia Stone
Reviewed by: Pramod Bharadwaj Chandrashekar, Robert Frederking, Uday Kiran Chilakalapalli
PMCID: PMC12835839  PMID: 41525688

Abstract

When given a sample of 100 emergency department discharge instructions, Claude Sonnet, a large language model, produced accurate Spanish translations as evaluated by Spanish-speaking physicians and medical interpreters.

Keywords: artificial intelligence, machine learning, machine translation, language, disparities

Introduction

Language-concordant emergency department (ED) discharge instructions are an essential component of equitable care for patients who prefer a language other than English [1-4]. ED discharge instructions are often complex, combining standardized templates with personalized clinician-written text. In most cases, patients who prefer a language other than English still receive instructions in English [5]. When translation is attempted, clinicians will often informally rely on tools such as Google Translate that are not auditable, are generally not institutionally approved for clinical use, and have known performance limitations for long or technically detailed documents [6-8].

Large language models (LLMs) offer a promising auditable and institutionally governable approach to addressing this equity gap [6,9]. Because reproducibility in patient-care processes requires controlled models rather than open chat interfaces such as ChatGPT (OpenAI), institutions will need to deploy translation LLMs directly [6]. To our knowledge, no formative pilot studies have evaluated LLMs for translating ED discharge instructions. We therefore conducted a feasibility study using real ED discharge instructions to assess LLM translation performance in preparation for clinical implementation.

Methods

Study Design

This was a single-center feasibility study at an urban academic medical center. We iteratively developed a translation prompt using Claude Sonnet (version 3.5; Anthropic) accessed via protected health information (PHI)–compliant Amazon Web Services (Amazon), testing each version on batches of 10 to 20 randomly sampled free-text discharge instructions provided to ED patients between July 1 and December 31, 2024. Claude Sonnet 3.5 was selected as it balances cost, performance, and speed and was available in our PHI-compliant environment.

Following prompt development, a team of independent evaluators (2 native Spanish-speaking physicians and 2 certified medical interpreters) reviewed a translated set of 100 randomly sampled free-text discharge instructions. We used a rubric adapted from prior studies consisting of a 5-point Likert scale across 5 domains (Multimedia Appendix 1) designed so that items rated a 3 or lower would be deemed substandard or unacceptable for use in a clinical setting [9]. Scores of 3 or lower in any domain required written explanation and were escalated for further review.

Our primary outcome was the proportion of discharge instructions that scored a 3 or lower in any one domain. The secondary outcome was the mean Likert score for each of the 5 domains stratified by reviewer type (interpreter vs physician). Descriptive analyses were performed in Python (version 3.11).

Ethical Considerations

This study was approved by the Beth Israel Deaconess Institutional Review Board (2024P000315). All abstracted data were deidentified.

Results

Of the 100 samples translated using the designed prompt (Multimedia Appendix 2), the mean Likert score ratings across samples by domain were as follows: 5.0 (95% CI 5.0-5.0) for completeness, 4.8 (95% CI 4.8-4.8) for fluency, 4.9 (95% CI 4.9-4.9) for meaning, 5.0 (95% CI 5.0-5.0) for severity, and 4.9 (95% CI 4.9-4.9) overall (Table 1; example translations are provided in Multimedia Appendices 3 and 4). One sample was given a score of 3 by a single reviewer in the domains of meaning and overall quality because the term “concussion” was translated as conmoción cerebral (full redacted translation in Multimedia Appendix 3). On adjudication, the translation was deemed clinically acceptable because the term conmoción cerebral is one of several translations of the term “concussion,” along with concusión.

Table 1.

Interpreter and physician evaluator scores for Spanish translations (N=100).

Domain Mean interpreter scores (95% CI) Mean physician scores (95% CI)
Completeness 5.0 (5.0-5.0) 5.0 (5.0-5.0)
Fluency 4.8 (4.7-4.9) 4.8 (4.7-4.9)
Meaning 5.0 (5.0-5.0) 4.9 (4.9-4.9)
Severity 5.0 (5.0-5.0) 5.0 (5.0-5.0)
Overall 5.0 (5.0-5.0) 4.9 (4.9-4.9)

Discussion

In this feasibility pilot study, we found that Claude Sonnet produced clinically acceptable Spanish translations of ED discharge instructions. The one case flagged for further review reflected regional differences in Spanish vocabulary, an observation suggesting that future LLM prompts may incorporate patient nationality or dialects to improve comprehensibility.

Our results are in alignment with prior work on standardized discharge instructions as well as free-text instructions from pediatric settings [9,10]. Free-text instructions have the potential for grammatical errors, dictation and typographical errors, missing information, formatting issues, and use of overly complicated medical terminology that might compromise translation quality. A recent study (n=20) in the pediatric setting showed comparable quality between interpreter translation and the GPT-4o model from OpenAI [10]. Our study did not directly compare the LLM outputs to interpreter outputs, but instead included interpreters as reviewers.

Our single-center results may not apply to institutions that have different discharge instruction processes or lack access to PHI-compliant LLMs. Moreover, our study was limited to Spanish. Further testing will be needed to establish the safety of LLM translation before live implementation.

Acknowledgments

We would like to thank Shari Gold-Gomez, Ana Torres, Natalia Chilcote, and Marie Rodriguez for their contributions to this study. RCLB was affiliated with the Department of Pediatrics at Beth Israel Deaconess Medical Center at the time of the study and is currently affiliated with the Department of Pediatrics at Stanford University School of Medicine. DAI and ABM were affiliated with the Department of Medicine at Beth Israel Deaconess Medical Center at the time of the study. DAI is currently affiliated with the Department of Medicine at the University of California, San Francisco. ABM is currently affiliated with the Department of Medicine at Boston Medical Center.

Abbreviations

ED

emergency department

LLM

large language model

PHI

protected health information

Multimedia Appendix 1

Interpretation rating guide.

Multimedia Appendix 2

Large language model prompt.

Multimedia Appendix 3

Translation with low score.

Multimedia Appendix 4

Sample translation.

Data Availability

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.

Footnotes

Authors' Contributions: Conceptualization: ADH (lead), JACT (equal), RCLB (supporting)

Data curation: ADH

Formal analysis: ADH

Funding acquisition: ADH (lead), JACT (equal)

Investigation: DAI (lead), ABM (equal)

Methodology: ADH (lead), JACT (equal), AG (supporting), RCLB (supporting)

Project administration: ADH (lead), JACT (equal)

Resources: LAN (lead), RO (supporting)

Software: TF (lead), PA (supporting)

Supervision: ADH (lead), JACT (equal)

Writing—original draft: JACT

Writing—review and editing: JACT (lead), ADH (supporting)

Conflicts of Interest: None declared.

References

  • 1.Khoong EC, Sherwin EB, Harrison JD, Wheeler M, Shah SJ, Mourad M, Khanna R. Impact of standardized, language-concordant hospital discharge instructions on postdischarge medication questions. J Hosp Med. 2023 Sep;18(9):822–828. doi: 10.1002/jhm.13172. https://escholarship.org/uc/item/qt9j5394sm . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Samuels-Kalow ME, Stack AM, Porter SC. Effective discharge communication in the emergency department. Ann Emerg Med. 2012 Aug;60(2):152–9. doi: 10.1016/j.annemergmed.2011.10.023.S0196-0644(11)01762-8 [DOI] [PubMed] [Google Scholar]
  • 3.Gutman CK, Lion KC, Fisher CL, Aronson PL, Patterson M, Fernandez R. Breaking through barriers: the need for effective research to promote language-concordant communication as a facilitator of equitable emergency care. J Am Coll Emerg Physicians Open. 2022 Feb;3(1):e12639. doi: 10.1002/emp2.12639. https://linkinghub.elsevier.com/retrieve/pii/EMP212639 .EMP212639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lion KC, Lin Y, Kim T. Artificial intelligence for language translation: the equity is in the details. JAMA. 2024 Nov 05;332(17):1427–1428. doi: 10.1001/jama.2024.15296.2823653 [DOI] [PubMed] [Google Scholar]
  • 5.Isbey S, Badolato G, Kline J. Pediatric emergency department discharge instructions for Spanish-speaking families: are we getting it right? Pediatr Emerg Care. 2022 Feb 01;38(2):e867–e870. doi: 10.1097/PEC.0000000000002470.00006565-202202000-00093 [DOI] [PubMed] [Google Scholar]
  • 6.Lopez I, Velasquez DE, Chen JH, Rodriguez JA. Operationalizing machine-assisted translation in healthcare. NPJ Digit Med. 2025 Sep 30;8(1):584. doi: 10.1038/s41746-025-01944-0. https://doi.org/10.1038/s41746-025-01944-0 .10.1038/s41746-025-01944-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Khoong EC, Steinbrook E, Brown C, Fernandez A. Assessing the use of Google Translate for Spanish and Chinese translations of emergency department discharge instructions. JAMA Intern Med. 2019 Apr 01;179(4):580–582. doi: 10.1001/jamainternmed.2018.7653. https://europepmc.org/abstract/MED/30801626 .2725080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Taira BR, Kreger V, Orue A, Diamond LC. A pragmatic assessment of Google Translate for emergency department instructions. J Gen Intern Med. 2021 Nov;36(11):3361–3365. doi: 10.1007/s11606-021-06666-z. https://europepmc.org/abstract/MED/33674922 .10.1007/s11606-021-06666-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ray M, Kats DJ, Moorkens J, Rai D, Shaar N, Quinones D, Vermeulen A, Mateo CM, Brewster RCL, Khan A, Rader B, Brownstein JS, Hron JD. Evaluating a large language model in translating patient instructions to Spanish using a standardized framework. JAMA Pediatr. 2025 Sep 01;179(9):1026–1033. doi: 10.1001/jamapediatrics.2025.1729.2836029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Brewster RCL, Gonzalez P, Khazanchi R, Butler A, Selcer R, Chu D, Aires BP, Luercio M, Hron JD. Performance of ChatGPT and Google Translate for pediatric discharge instruction translation. Pediatrics. 2024 Jul 01;154(1):e2023065573. doi: 10.1542/peds.2023-065573.197484 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia Appendix 1

Interpretation rating guide.

Multimedia Appendix 2

Large language model prompt.

Multimedia Appendix 3

Translation with low score.

Multimedia Appendix 4

Sample translation.

Data Availability Statement

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.


Articles from JMIR Formative Research are provided here courtesy of JMIR Publications Inc.

RESOURCES