Performance of Large Language Model-generated Spanish Discharge Material

Eduardo Pérez-Guerrero; Asad Aali; Emanuel Irizarry; Nicole Corso; Jason Hom; Fatima Rodriguez; Christine Santiago

doi:10.1007/s11606-025-09758-2

. Author manuscript; available in PMC: 2025 Sep 23.

Published before final editing as: J Gen Intern Med. 2025 Jul 25:10.1007/s11606-025-09758-2. doi: 10.1007/s11606-025-09758-2

Performance of Large Language Model-generated Spanish Discharge Material

Eduardo Pérez-Guerrero ¹, Asad Aali ², Emanuel Irizarry ³, Nicole Corso ⁴, Jason Hom ⁴, Fatima Rodriguez ⁵, Christine Santiago ⁴

PMCID: PMC12451971 NIHMSID: NIHMS2108300 PMID: 40715961

Introduction:

Language barriers in discharge instructions disproportionately affect non-English-speaking patients, often contributing to suboptimal post-discharge care and higher readmission rates¹. Such disparities can be mitigated by offering clear, medically accurate translated materials. However, professional human translation services often cannot meet the high volume and rapid turnaround for discharges². Large language models (LLMs), such as GPT-4o (OpenAI, San Francisco), may offer an alternative by generating accurate, readable, and scalable Spanish discharge documentation.

This study investigates whether GPT-4o can produce Spanish translations of English cardiology discharge documents while preserving semantic fidelity. Cardiology was chosen because discharges are often post-procedural or involve complex regimens, such as diuretic titration or evolving anticoagulation plans. We used multiple natural language processing (NLP) metrics to evaluate whether LLM-generated translations maintain clinical meaning and rival human translations.

Methods:

We collected 131 cardiology discharge-related documents from an academic medical center. The documents were patient-friendly materials explaining pathophysiology, treatment, and post-care instructions for common conditions and procedures. Each was originally written in English (Human English, HE) and professionally translated into Spanish (Human Spanish, HS). The LLM GPT-4o produced Spanish translations (LS) from HE and back-translated English versions (LE) from HS. The project did not include patient identifiers, and no chart review was involved.

Translation quality was assessed using same-language metrics (BLEU and chrF: 0–100; BERTscore and COMET: 0–1) and cross-language metrics (XLM-RoBERTa, LaBSE, mBERT, and COMET-QE: 0–1). These metrics are recognized in the clinical LLM research community for measuring lexical and semantic fidelity and have been used in prior work.³ BLEU and chrF capture lexical overlap between texts, with higher scores indicating stronger word alignment. BERTscore and COMET capture semantic overlap. XLM-RoBERTa, LaBSE, mBERT, and COMET-QE assess how well meaning is retained across languages. Scores ≥ 0.80 and ≥ 80 are considered excellent³. We used multiple metrics to capture complementary aspects of translation fidelity—from word-level match (BLEU, chrF), to semantic similarity using contextual embeddings (BERTscore, LaBSE, mBERT), and overall translation quality assessed by model-based scoring (COMET, COMET-QE). This combination allowed us to strengthen the validity of our findings and show consistency across methods. These thresholds have not been used to define clinical acceptability, as humans have traditionally produced translations. They serve to benchmark how closely LS align with HS by comparing both to the reference HE.

Results:

Translation quality comparisons between GPT-4o-generated discharge instructions and professional human translations are summarized in Figures 1 and 2. In same-language comparisons (Figure 1), GPT-4o-generated LE demonstrated moderate lexical similarity (BLEU, chrF) but high semantic alignment (BERTscore, COMET) with original HE. GPT-4o-generated LS showed lower lexical similarity scores compared to reference HS but maintained high semantic alignment scores. Cross-language comparisons (Figure 2) confirmed that GPT-4o-generated translations consistently achieved high semantic alignment scores, comparable to or exceeding professional translation standards.

Figure 1. — Same-Language Comparisons

BLEU, chrF range 0–100; BERTscore, COMET range 0–1 (higher = better).

Figure 2. — Cross-Language Comparisons

XLM-RoBERTa, LaBSE, mBERT, COMET-QE range 0–1 (higher = better).

Discussion

Our study demonstrates that LLM-generated Spanish discharge summaries preserved the meaning of the original English instructions, performing similarly to—and occasionally better than—professional translations across key NLP metrics. Lexical similarity scores were lower in GPT-4o-generated Spanish translations than in English back-translations, possibly influenced by GPT-4o’s predominant training on English text and intrinsic linguistic differences between English and Spanish. Despite these lexical variations, semantic alignment remained consistently strong across both languages.

These findings suggest LLMs could function as scalable translation tools. Professional services often struggle to meet the demands of rapid, high-volume discharges, and LLM-generated materials may bridge this gap. Prior work demonstrates LLM capability in simplifying complex medical documents^3–4, summarizing clinical documentation⁵, and drafting patient messages⁶. Building on these successes, LLMs offer promise for multilingual patient education⁷.

This initial evaluation focused on generic discharge educational materials, which are typically translated once and reused. In such cases, human translators may not face the same volume challenges. Nevertheless, these results provide a foundation for future testing in individualized cases. Future studies should investigate efficiency gains and cognitive burdens associated with professional translators reviewing LLM-generated drafts, as seen in operational models utilizing AI-generated patient message drafts⁶, and should also validate these NLP metrics through real-world human evaluations of readability, understandability, actionability, and cultural sensitivity.

This study has several limitations. It did not include human assessments for hallucinations—an established concern with LLM-generated outputs⁵—or evaluations for potential harmfulness. Human oversight remains essential for patient-facing content.

Funding declaration:

This manuscript did not receive any funding.

Footnotes

Conflict of interest:

The authors have no competing interests to disclose with respect to this manuscript’s content

References

1.Diamond L, Izquierdo K, Canfield D, Matsoukas K, Gany F. A Systematic Review of the Impact of Patient-Physician Non-English Language Concordance on Quality of Care and Outcomes. J Gen Intern Med. 2019;34(8):1591–1606. doi: 10.1007/s11606-019-04847-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Khoong EC, Fernandez A. Addressing Gaps in Interpreter Use: Time for Implementation Science Informed Multi-Level Interventions. J Gen Intern Med. 2021. Nov;36(11):3532–3536. doi: 10.1007/s11606-021-06823-4. Epub 2021 May 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Van Veen D, Van Uden C, Blankemeier L, et al. Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nat Med. 2024;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Zaretsky J, Kim JM, Baskharoun S, et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw Open. 2024;7(3):e240357. doi: 10.1001/jamanetworkopen.2024.0357 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large Language Models in Medicine. Nat Med. 2023;29(8):1930–1940. doi: 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
6.Garcia P, Ma SP, Shah S, et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw Open. 2024;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Gulati V, Roy SG, Moawad A, et al. Transcending Language Barriers: Can ChatGPT Be the Key to Enhancing Multilingual Accessibility in Healthcare? J Am Coll Radiol. Published online June 14, 2024:S1546-1440(24)00523-4. doi: 10.1016/j.jacr.2024.05.009 [DOI] [PubMed] [Google Scholar]

[R1] 1.Diamond L, Izquierdo K, Canfield D, Matsoukas K, Gany F. A Systematic Review of the Impact of Patient-Physician Non-English Language Concordance on Quality of Care and Outcomes. J Gen Intern Med. 2019;34(8):1591–1606. doi: 10.1007/s11606-019-04847-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Khoong EC, Fernandez A. Addressing Gaps in Interpreter Use: Time for Implementation Science Informed Multi-Level Interventions. J Gen Intern Med. 2021. Nov;36(11):3532–3536. doi: 10.1007/s11606-021-06823-4. Epub 2021 May 4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Van Veen D, Van Uden C, Blankemeier L, et al. Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nat Med. 2024;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Zaretsky J, Kim JM, Baskharoun S, et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw Open. 2024;7(3):e240357. doi: 10.1001/jamanetworkopen.2024.0357 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large Language Models in Medicine. Nat Med. 2023;29(8):1930–1940. doi: 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]

[R6] 6.Garcia P, Ma SP, Shah S, et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw Open. 2024;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Gulati V, Roy SG, Moawad A, et al. Transcending Language Barriers: Can ChatGPT Be the Key to Enhancing Multilingual Accessibility in Healthcare? J Am Coll Radiol. Published online June 14, 2024:S1546-1440(24)00523-4. doi: 10.1016/j.jacr.2024.05.009 [DOI] [PubMed] [Google Scholar]

PERMALINK

Performance of Large Language Model-generated Spanish Discharge Material

Eduardo Pérez-Guerrero, MD

Asad Aali, BS

Emanuel Irizarry, MD

Nicole Corso, BS

Jason Hom, MD

Fatima Rodriguez, MD, MPH

Christine Santiago, MD, MPH

Introduction:

Methods:

Results:

Figure 1.

Figure 2.

Discussion

Funding declaration:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Performance of Large Language Model-generated Spanish Discharge Material

Eduardo Pérez-Guerrero, MD

Asad Aali, BS

Emanuel Irizarry, MD

Nicole Corso, BS

Jason Hom, MD

Fatima Rodriguez, MD, MPH

Christine Santiago, MD, MPH

Introduction:

Methods:

Results:

Figure 1.

Figure 2.

Discussion

Funding declaration:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases