Table 7.
Model | Evaluation score on the subjective division | ||||||
---|---|---|---|---|---|---|---|
ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | BLEURT | medcon | Average | |
Retrieval-based | |||||||
trainUMLS | 41.70 | 23.45 | 31.64 | 72.10 | 39.01 | 23.04 | 41.60 |
trainsent | 41.12 | 20.62 | 29.20 | 70.78 | 37.94 | 18.86 | 39.47 |
BART-based | |||||||
BART | 48.19 | 25.81 | 30.13 | 68.93 | 43.83 | 44.41 | 47.97 |
BART (Division) | 47.25 | 26.05 | 31.21 | 70.05 | 43.55 | 44.20 | 48.16 |
BART + FTSAMSum | 46.33 | 25.52 | 29.88 | 68.68 | 45.01 | 43.21 | 47.70 |
BART + FTSAMSum (Division) | 52.44 | 30.44 | 35.83 | 72.41 | 44.51 | 47.84 | 51.08 |
BioBART | 45.79 | 23.65 | 28.96 | 68.49 | 41.09 | 41.10 | 45.87 |
BioBART (Division) | 46.29 | 25.99 | 32.43 | 70.30 | 42.99 | 41.14 | 47.33 |
LED-based | |||||||
LED | 24.81 | 5.29 | 11.00 | 55.60 | 30.68 | 20.19 | 30.04 |
LED (Division) | 31.27 | 8.31 | 15.99 | 56.94 | 25.40 | 24.03 | 31.22 |
LED + FTpubMed | 23.48 | 4.72 | 10.49 | 54.46 | 20.32 | 17.91 | 26.40 |
LED + FTpubMed (Division) | 26.03 | 6.17 | 12.93 | 56.41 | 19.19 | 20.46 | 27.78 |
OpenAI (wo FT) | |||||||
Text-Davinci-002 | 29.73 | 12.38 | 20.13 | 58.98 | 36.70 | 32.47 | 37.22 |
Text-Davinci-003 | 33.29 | 15.24 | 23.76 | 60.63 | 38.06 | 36.14 | 39.73 |
ChatGPT | 32.70 | 14.05 | 22.69 | 65.14 | 39.48 | 38.21 | 41.49 |
GPT-4 | 41.20 | 19.02 | 26.56 | 63.34 | 43.18 | 44.25 | 44.93 |
BART-based models generated at both full note and division levels had similar levels of performances, which were in general better than the other model classes. As in the full note evaluation, retrieval-based methods provided competitive baselines.