Extended Data Table 2 |.
Dataset descriptions | |||||
---|---|---|---|---|---|
| |||||
Dataset | Task | Number of samples | Avg. number of tokens | Lexical variance | |
Input | Target | ||||
| |||||
Open-i | Radiology reports | 3.4K | 52 ± 22 | 14 ± 12 | 0.11 |
MIMIC-CXR | Radiology reports | 128K | 75 ± 31 | 22 ± 17 | 0.08 |
MIMIC-III | Radiology reports | 67K | 160 ± 83 | 61 ± 45 | 0.09 |
MeQSum | Patient questions | 1.2K | 83 ± 67 | 14 ± 6 | 0.21 |
ProbSum | Progress notes | 755 | 1,013 ± 299 | 23 ± 16 | 0.15 |
ACI-Bench | Dialogue | 126 | 1,512 ± 467 | 211 ± 98 | 0.04 |
Task Instructions | |
---|---|
| |
Task | Instruction |
| |
Radiology reports | “Summarize the radiology report findings into an impression with minimal text.” |
Patient questions | “Summarize the patient health query into one question of 15 words or less.” |
Progress notes | “Based on the progress note, generate a list of 3–7 problems (a few words each) ranked in order of importance.” |
Dialogue | “Summarize the patient/doctor dialogue into an assessment and plan.” |
Top, description of six open-source datasets with a wide range of token Length and Lexical variance, or the ratio of unique words to total words. Bottom, instructions for each of the four summarization tasks.