Skip to main content
. Author manuscript; available in PMC: 2024 Oct 15.
Published in final edited form as: Nat Med. 2024 Feb 27;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5

Extended Data Table 4 |.

Summarization baselines

Dataset Baseline BLEU ROUGE-L BERTScore MEDCON

Open-i Ours 46.0 68.2 94.7 64.9
ImpressionGPT [52] - 65.4 - -

MIMIC-CXR Ours 29.6 53.8 91.5 55.6
RadAdapt [27] 18.9 44.5 90.0 -
ImpressionGPT [52] - 47.9 - -

MIMIC-III Ours 11.5 34.5 89.0 36.5
RadAdapt [27] 16.2 38.7 90.2 -
Med-PaLM M [25] 15.2 32.0 - -

Patient questions Ours 10.7 37.3 92.5 59.8
ECL° [53] - 50.5 - -

Progress notes Ours 3.4 27.2 86.1 31.5
CUED [54] - 30.1 - -

Dialogue Ours 26.9 42.9 90.2 59.9
ACI-Bench° [41] - 45.6 - 57.8

Comparison of our general approach (GPT-4 using ICL) against baselines specific to each individual dataset. We note that the focal point of our study is not to achieve state-of-the-art quantitative results, especially given the discordance between NLP metrics and reader study scores. A dash (-) indicates that the metric was not reported; a ° indicates that the dataset was pre-processed differently.