. Author manuscript; available in PMC: 2024 Oct 15.

Published in final edited form as: Nat Med. 2024 Feb 27;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5

Extended Data Table 4 |.

Summarization baselines

Dataset	Baseline	BLEU	ROUGE-L	BERTScore	MEDCON

Open-i	Ours	46.0	68.2	94.7	64.9
Open-i	ImpressionGPT [52]	-	65.4	-	-

MIMIC-CXR	Ours	29.6	53.8	91.5	55.6
	RadAdapt [27]	18.9	44.5	90.0	-
	ImpressionGPT [52]	-	47.9	-	-

MIMIC-III	Ours	11.5	34.5	89.0	36.5
	RadAdapt [27]	16.2	38.7	90.2	-
	Med-PaLM M [25]	15.2	32.0	-	-

Patient questions	Ours	10.7	37.3	92.5	59.8
Patient questions	ECL° [53]	-	50.5	-	-

Progress notes	Ours	3.4	27.2	86.1	31.5
Progress notes	CUED [54]	-	30.1	-	-

Dialogue	Ours	26.9	42.9	90.2	59.9
Dialogue	ACI-Bench° [41]	-	45.6	-	57.8

Comparison of our general approach (GPT-4 using ICL) against baselines specific to each individual dataset. We note that the focal point of our study is not to achieve state-of-the-art quantitative results, especially given the discordance between NLP metrics and reader study scores. A dash (-) indicates that the metric was not reported; a ° indicates that the dataset was pre-processed differently.