Skip to main content
. 2023 Sep 6;10:586. doi: 10.1038/s41597-023-02487-3

Table 6.

Results of the summarization models evaluated at the full note level, test set 1.

Model ROUGE-1 ROUGE-2 ROUGE-L medcon
Transcript-copy-and-paste
longest spearker turn 27.84 9.32 23.44 32.37
longest doctor turn 27.47 9.23 23.20 32.33
12 speaker turns 33.16 10.60 30.01 39.68
12 doctor turns 35.88 12.44 32.72 47.79
transcript 32.84 12.53 30.61 55.65
Retrieval-based
trainUMLS 43.87 17.55 40.47 33.30
trainsent 41.59 15.50 38.20 26.17
BART-based
BART 41.76 19.20 34.70 43.38
BART (Division) 51.56 24.06 45.92 47.23
BART + FTSAMSum 40.87 18.96 34.60 41.55
BART + FTSAMSum (Division) 53.46 25.08 48.62 48.23
BioBART 39.09 17.24 33.19 42.82
BioBART (Division) 49.53 22.47 44.92 43.06
LED-based
LED 28.37 5.52 22.78 30.44
LED (Division) 34.15 8.01 29.80 32.67
LED + FTpubMed 27.19 5.30 21.80 27.44
LED + FTpubMed (Division) 30.46 6.93 26.66 32.34
OpenAI (wo FT)
Text-Davinci-002 41.08 17.27 37.46 47.39
Text-Davinci-003 47.07 22.08 43.11 57.16
ChatGPT 47.44 19.01 42.47 55.84
GPT-4 51.76 22.58 45.97 57.78

Simple retrieval-based methods provided strong baselines wih better out-of-the-box performances than LED models and full-note BART models. In general for BART and LED fine-tuned models, division-based generation worked better. OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.