Table 6.
Model | ROUGE-1 | ROUGE-2 | ROUGE-L | medcon |
---|---|---|---|---|
Transcript-copy-and-paste | ||||
longest spearker turn | 27.84 | 9.32 | 23.44 | 32.37 |
longest doctor turn | 27.47 | 9.23 | 23.20 | 32.33 |
12 speaker turns | 33.16 | 10.60 | 30.01 | 39.68 |
12 doctor turns | 35.88 | 12.44 | 32.72 | 47.79 |
transcript | 32.84 | 12.53 | 30.61 | 55.65 |
Retrieval-based | ||||
trainUMLS | 43.87 | 17.55 | 40.47 | 33.30 |
trainsent | 41.59 | 15.50 | 38.20 | 26.17 |
BART-based | ||||
BART | 41.76 | 19.20 | 34.70 | 43.38 |
BART (Division) | 51.56 | 24.06 | 45.92 | 47.23 |
BART + FTSAMSum | 40.87 | 18.96 | 34.60 | 41.55 |
BART + FTSAMSum (Division) | 53.46 | 25.08 | 48.62 | 48.23 |
BioBART | 39.09 | 17.24 | 33.19 | 42.82 |
BioBART (Division) | 49.53 | 22.47 | 44.92 | 43.06 |
LED-based | ||||
LED | 28.37 | 5.52 | 22.78 | 30.44 |
LED (Division) | 34.15 | 8.01 | 29.80 | 32.67 |
LED + FTpubMed | 27.19 | 5.30 | 21.80 | 27.44 |
LED + FTpubMed (Division) | 30.46 | 6.93 | 26.66 | 32.34 |
OpenAI (wo FT) | ||||
Text-Davinci-002 | 41.08 | 17.27 | 37.46 | 47.39 |
Text-Davinci-003 | 47.07 | 22.08 | 43.11 | 57.16 |
ChatGPT | 47.44 | 19.01 | 42.47 | 55.84 |
GPT-4 | 51.76 | 22.58 | 45.97 | 57.78 |
Simple retrieval-based methods provided strong baselines wih better out-of-the-box performances than LED models and full-note BART models. In general for BART and LED fine-tuned models, division-based generation worked better. OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.