Extended Data Table 4 |.
Dataset | Baseline | BLEU | ROUGE-L | BERTScore | MEDCON |
---|---|---|---|---|---|
| |||||
Open-i | Ours | 46.0 | 68.2 | 94.7 | 64.9 |
ImpressionGPT [52] | - | 65.4 | - | - | |
| |||||
MIMIC-CXR | Ours | 29.6 | 53.8 | 91.5 | 55.6 |
RadAdapt [27] | 18.9 | 44.5 | 90.0 | - | |
ImpressionGPT [52] | - | 47.9 | - | - | |
| |||||
MIMIC-III | Ours | 11.5 | 34.5 | 89.0 | 36.5 |
RadAdapt [27] | 16.2 | 38.7 | 90.2 | - | |
Med-PaLM M [25] | 15.2 | 32.0 | - | - | |
| |||||
Patient questions | Ours | 10.7 | 37.3 | 92.5 | 59.8 |
ECL° [53] | - | 50.5 | - | - | |
| |||||
Progress notes | Ours | 3.4 | 27.2 | 86.1 | 31.5 |
CUED [54] | - | 30.1 | - | - | |
| |||||
Dialogue | Ours | 26.9 | 42.9 | 90.2 | 59.9 |
ACI-Bench° [41] | - | 45.6 | - | 57.8 |
Comparison of our general approach (GPT-4 using ICL) against baselines specific to each individual dataset. We note that the focal point of our study is not to achieve state-of-the-art quantitative results, especially given the discordance between NLP metrics and reader study scores. A dash (-) indicates that the metric was not reported; a ° indicates that the dataset was pre-processed differently.