Skip to main content
. Author manuscript; available in PMC: 2024 Oct 15.
Published in final edited form as: Nat Med. 2024 Feb 27;30(4):1134–1142. doi: 10.1038/s41591-024-02855-5

Extended Data Table 2 |.

Models

Dataset descriptions

Dataset Task Number of samples Avg. number of tokens Lexical variance
Input Target

Open-i Radiology reports 3.4K 52 ± 22 14 ± 12 0.11
MIMIC-CXR Radiology reports 128K 75 ± 31 22 ± 17 0.08
MIMIC-III Radiology reports 67K 160 ± 83 61 ± 45 0.09
MeQSum Patient questions 1.2K 83 ± 67 14 ± 6 0.21
ProbSum Progress notes 755 1,013 ± 299 23 ± 16 0.15
ACI-Bench Dialogue 126 1,512 ± 467 211 ± 98 0.04
Task Instructions

Task Instruction

Radiology reports “Summarize the radiology report findings into an impression with minimal text.”
Patient questions “Summarize the patient health query into one question of 15 words or less.”
Progress notes “Based on the progress note, generate a list of 3–7 problems (a few words each) ranked in order of importance.”
Dialogue “Summarize the patient/doctor dialogue into an assessment and plan.”

Top, description of six open-source datasets with a wide range of token Length and Lexical variance, or the ratio of unique words to total words. Bottom, instructions for each of the four summarization tasks.