Skip to main content
. 2023 Sep 6;10:586. doi: 10.1038/s41597-023-02487-3

Table 7.

Results of the summarization models on the subjective division, test set 1.

Model Evaluation score on the subjective division
ROUGE-1 ROUGE-2 ROUGE-L BERTScore BLEURT medcon Average
Retrieval-based
trainUMLS 41.70 23.45 31.64 72.10 39.01 23.04 41.60
trainsent 41.12 20.62 29.20 70.78 37.94 18.86 39.47
BART-based
BART 48.19 25.81 30.13 68.93 43.83 44.41 47.97
BART (Division) 47.25 26.05 31.21 70.05 43.55 44.20 48.16
BART + FTSAMSum 46.33 25.52 29.88 68.68 45.01 43.21 47.70
BART + FTSAMSum (Division) 52.44 30.44 35.83 72.41 44.51 47.84 51.08
BioBART 45.79 23.65 28.96 68.49 41.09 41.10 45.87
BioBART (Division) 46.29 25.99 32.43 70.30 42.99 41.14 47.33
LED-based
LED 24.81 5.29 11.00 55.60 30.68 20.19 30.04
LED (Division) 31.27 8.31 15.99 56.94 25.40 24.03 31.22
LED + FTpubMed 23.48 4.72 10.49 54.46 20.32 17.91 26.40
LED + FTpubMed (Division) 26.03 6.17 12.93 56.41 19.19 20.46 27.78
OpenAI (wo FT)
Text-Davinci-002 29.73 12.38 20.13 58.98 36.70 32.47 37.22
Text-Davinci-003 33.29 15.24 23.76 60.63 38.06 36.14 39.73
ChatGPT 32.70 14.05 22.69 65.14 39.48 38.21 41.49
GPT-4 41.20 19.02 26.56 63.34 43.18 44.25 44.93

BART-based models generated at both full note and division levels had similar levels of performances, which were in general better than the other model classes. As in the full note evaluation, retrieval-based methods provided competitive baselines.