. Author manuscript; available in PMC: 2024 Oct 15.

Published in final edited form as: Nat Med. 2024 Feb 27;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5

Extended Data Table 2 |.

Models

Dataset descriptions

Dataset	Task	Number of samples	Avg. number of tokens		Lexical variance
Dataset	Task	Number of samples	Input	Target	Lexical variance

Open-i	Radiology reports	3.4K	52 ± 22	14 ± 12	0.11
MIMIC-CXR	Radiology reports	128K	75 ± 31	22 ± 17	0.08
MIMIC-III	Radiology reports	67K	160 ± 83	61 ± 45	0.09
MeQSum	Patient questions	1.2K	83 ± 67	14 ± 6	0.21
ProbSum	Progress notes	755	1,013 ± 299	23 ± 16	0.15
ACI-Bench	Dialogue	126	1,512 ± 467	211 ± 98	0.04

Task Instructions

Task	Instruction

Radiology reports	“Summarize the radiology report findings into an impression with minimal text.”
Patient questions	“Summarize the patient health query into one question of 15 words or less.”
Progress notes	“Based on the progress note, generate a list of 3–7 problems (a few words each) ranked in order of importance.”
Dialogue	“Summarize the patient/doctor dialogue into an assessment and plan.”

Top, description of six open-source datasets with a wide range of token Length and Lexical variance, or the ratio of unique words to total words. Bottom, instructions for each of the four summarization tasks.