Abstract
When given a sample of 100 emergency department discharge instructions, Claude Sonnet, a large language model, produced accurate Spanish translations as evaluated by Spanish-speaking physicians and medical interpreters.
Keywords: artificial intelligence, machine learning, machine translation, language, disparities
Introduction
Language-concordant emergency department (ED) discharge instructions are an essential component of equitable care for patients who prefer a language other than English [1-4]. ED discharge instructions are often complex, combining standardized templates with personalized clinician-written text. In most cases, patients who prefer a language other than English still receive instructions in English [5]. When translation is attempted, clinicians will often informally rely on tools such as Google Translate that are not auditable, are generally not institutionally approved for clinical use, and have known performance limitations for long or technically detailed documents [6-8].
Large language models (LLMs) offer a promising auditable and institutionally governable approach to addressing this equity gap [6,9]. Because reproducibility in patient-care processes requires controlled models rather than open chat interfaces such as ChatGPT (OpenAI), institutions will need to deploy translation LLMs directly [6]. To our knowledge, no formative pilot studies have evaluated LLMs for translating ED discharge instructions. We therefore conducted a feasibility study using real ED discharge instructions to assess LLM translation performance in preparation for clinical implementation.
Methods
Study Design
This was a single-center feasibility study at an urban academic medical center. We iteratively developed a translation prompt using Claude Sonnet (version 3.5; Anthropic) accessed via protected health information (PHI)–compliant Amazon Web Services (Amazon), testing each version on batches of 10 to 20 randomly sampled free-text discharge instructions provided to ED patients between July 1 and December 31, 2024. Claude Sonnet 3.5 was selected as it balances cost, performance, and speed and was available in our PHI-compliant environment.
Following prompt development, a team of independent evaluators (2 native Spanish-speaking physicians and 2 certified medical interpreters) reviewed a translated set of 100 randomly sampled free-text discharge instructions. We used a rubric adapted from prior studies consisting of a 5-point Likert scale across 5 domains (Multimedia Appendix 1) designed so that items rated a 3 or lower would be deemed substandard or unacceptable for use in a clinical setting [9]. Scores of 3 or lower in any domain required written explanation and were escalated for further review.
Our primary outcome was the proportion of discharge instructions that scored a 3 or lower in any one domain. The secondary outcome was the mean Likert score for each of the 5 domains stratified by reviewer type (interpreter vs physician). Descriptive analyses were performed in Python (version 3.11).
Ethical Considerations
This study was approved by the Beth Israel Deaconess Institutional Review Board (2024P000315). All abstracted data were deidentified.
Results
Of the 100 samples translated using the designed prompt (Multimedia Appendix 2), the mean Likert score ratings across samples by domain were as follows: 5.0 (95% CI 5.0-5.0) for completeness, 4.8 (95% CI 4.8-4.8) for fluency, 4.9 (95% CI 4.9-4.9) for meaning, 5.0 (95% CI 5.0-5.0) for severity, and 4.9 (95% CI 4.9-4.9) overall (Table 1; example translations are provided in Multimedia Appendices 3 and 4). One sample was given a score of 3 by a single reviewer in the domains of meaning and overall quality because the term “concussion” was translated as conmoción cerebral (full redacted translation in Multimedia Appendix 3). On adjudication, the translation was deemed clinically acceptable because the term conmoción cerebral is one of several translations of the term “concussion,” along with concusión.
Table 1.
Interpreter and physician evaluator scores for Spanish translations (N=100).
| Domain | Mean interpreter scores (95% CI) | Mean physician scores (95% CI) |
| Completeness | 5.0 (5.0-5.0) | 5.0 (5.0-5.0) |
| Fluency | 4.8 (4.7-4.9) | 4.8 (4.7-4.9) |
| Meaning | 5.0 (5.0-5.0) | 4.9 (4.9-4.9) |
| Severity | 5.0 (5.0-5.0) | 5.0 (5.0-5.0) |
| Overall | 5.0 (5.0-5.0) | 4.9 (4.9-4.9) |
Discussion
In this feasibility pilot study, we found that Claude Sonnet produced clinically acceptable Spanish translations of ED discharge instructions. The one case flagged for further review reflected regional differences in Spanish vocabulary, an observation suggesting that future LLM prompts may incorporate patient nationality or dialects to improve comprehensibility.
Our results are in alignment with prior work on standardized discharge instructions as well as free-text instructions from pediatric settings [9,10]. Free-text instructions have the potential for grammatical errors, dictation and typographical errors, missing information, formatting issues, and use of overly complicated medical terminology that might compromise translation quality. A recent study (n=20) in the pediatric setting showed comparable quality between interpreter translation and the GPT-4o model from OpenAI [10]. Our study did not directly compare the LLM outputs to interpreter outputs, but instead included interpreters as reviewers.
Our single-center results may not apply to institutions that have different discharge instruction processes or lack access to PHI-compliant LLMs. Moreover, our study was limited to Spanish. Further testing will be needed to establish the safety of LLM translation before live implementation.
Acknowledgments
We would like to thank Shari Gold-Gomez, Ana Torres, Natalia Chilcote, and Marie Rodriguez for their contributions to this study. RCLB was affiliated with the Department of Pediatrics at Beth Israel Deaconess Medical Center at the time of the study and is currently affiliated with the Department of Pediatrics at Stanford University School of Medicine. DAI and ABM were affiliated with the Department of Medicine at Beth Israel Deaconess Medical Center at the time of the study. DAI is currently affiliated with the Department of Medicine at the University of California, San Francisco. ABM is currently affiliated with the Department of Medicine at Boston Medical Center.
Abbreviations
- ED
emergency department
- LLM
large language model
- PHI
protected health information
Interpretation rating guide.
Large language model prompt.
Translation with low score.
Sample translation.
Data Availability
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
Footnotes
Authors' Contributions: Conceptualization: ADH (lead), JACT (equal), RCLB (supporting)
Data curation: ADH
Formal analysis: ADH
Funding acquisition: ADH (lead), JACT (equal)
Investigation: DAI (lead), ABM (equal)
Methodology: ADH (lead), JACT (equal), AG (supporting), RCLB (supporting)
Project administration: ADH (lead), JACT (equal)
Resources: LAN (lead), RO (supporting)
Software: TF (lead), PA (supporting)
Supervision: ADH (lead), JACT (equal)
Writing—original draft: JACT
Writing—review and editing: JACT (lead), ADH (supporting)
Conflicts of Interest: None declared.
References
- 1.Khoong EC, Sherwin EB, Harrison JD, Wheeler M, Shah SJ, Mourad M, Khanna R. Impact of standardized, language-concordant hospital discharge instructions on postdischarge medication questions. J Hosp Med. 2023 Sep;18(9):822–828. doi: 10.1002/jhm.13172. https://escholarship.org/uc/item/qt9j5394sm . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Samuels-Kalow ME, Stack AM, Porter SC. Effective discharge communication in the emergency department. Ann Emerg Med. 2012 Aug;60(2):152–9. doi: 10.1016/j.annemergmed.2011.10.023.S0196-0644(11)01762-8 [DOI] [PubMed] [Google Scholar]
- 3.Gutman CK, Lion KC, Fisher CL, Aronson PL, Patterson M, Fernandez R. Breaking through barriers: the need for effective research to promote language-concordant communication as a facilitator of equitable emergency care. J Am Coll Emerg Physicians Open. 2022 Feb;3(1):e12639. doi: 10.1002/emp2.12639. https://linkinghub.elsevier.com/retrieve/pii/EMP212639 .EMP212639 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lion KC, Lin Y, Kim T. Artificial intelligence for language translation: the equity is in the details. JAMA. 2024 Nov 05;332(17):1427–1428. doi: 10.1001/jama.2024.15296.2823653 [DOI] [PubMed] [Google Scholar]
- 5.Isbey S, Badolato G, Kline J. Pediatric emergency department discharge instructions for Spanish-speaking families: are we getting it right? Pediatr Emerg Care. 2022 Feb 01;38(2):e867–e870. doi: 10.1097/PEC.0000000000002470.00006565-202202000-00093 [DOI] [PubMed] [Google Scholar]
- 6.Lopez I, Velasquez DE, Chen JH, Rodriguez JA. Operationalizing machine-assisted translation in healthcare. NPJ Digit Med. 2025 Sep 30;8(1):584. doi: 10.1038/s41746-025-01944-0. https://doi.org/10.1038/s41746-025-01944-0 .10.1038/s41746-025-01944-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Khoong EC, Steinbrook E, Brown C, Fernandez A. Assessing the use of Google Translate for Spanish and Chinese translations of emergency department discharge instructions. JAMA Intern Med. 2019 Apr 01;179(4):580–582. doi: 10.1001/jamainternmed.2018.7653. https://europepmc.org/abstract/MED/30801626 .2725080 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Taira BR, Kreger V, Orue A, Diamond LC. A pragmatic assessment of Google Translate for emergency department instructions. J Gen Intern Med. 2021 Nov;36(11):3361–3365. doi: 10.1007/s11606-021-06666-z. https://europepmc.org/abstract/MED/33674922 .10.1007/s11606-021-06666-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ray M, Kats DJ, Moorkens J, Rai D, Shaar N, Quinones D, Vermeulen A, Mateo CM, Brewster RCL, Khan A, Rader B, Brownstein JS, Hron JD. Evaluating a large language model in translating patient instructions to Spanish using a standardized framework. JAMA Pediatr. 2025 Sep 01;179(9):1026–1033. doi: 10.1001/jamapediatrics.2025.1729.2836029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Brewster RCL, Gonzalez P, Khazanchi R, Butler A, Selcer R, Chu D, Aires BP, Luercio M, Hron JD. Performance of ChatGPT and Google Translate for pediatric discharge instruction translation. Pediatrics. 2024 Jul 01;154(1):e2023065573. doi: 10.1542/peds.2023-065573.197484 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Interpretation rating guide.
Large language model prompt.
Translation with low score.
Sample translation.
Data Availability Statement
The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.
