Skip to main content
. 2023 Sep 6;10:586. doi: 10.1038/s41597-023-02487-3

Table 10.

Results of the summarization models on the assessment_and_plan division, test set 1.

Model Evaluation score on the assessment_and_plan division
ROUGE-1 ROUGE-2 ROUGE-L BERTScore BLEURT medcon Average
Retrieval-based
trainUMLS 44.59 21.50 29.66 70.39 44.77 24.70 42.94
trainsent 41.28 19.73 28.02 69.48 43.18 18.79 40.28
BART-based
BART 0.00 0.00 0.00 0.00 29.05 0.00 7.26
BART (Division) 43.31 20.59 26.55 67.49 40.99 32.30 42.73
BART + FTSAMSum 1.52 0.49 0.87 35.38 19.79 1.00 14.28
BART + FTSAMSum (Division) 43.89 21.37 27.56 68.09 41.96 31.33 43.08
BioBART 0.00 0.00 0.00 0.00 29.05 0.00 7.26
BioBART (Division) 42.44 19.44 26.42 67.57 43.88 31.12 43.00
LED-based
LED 0.00 0.00 0.00 0.00 29.05 0.00 7.26
LED (Division) 28.23 6.13 12.44 55.75 27.78 21.94 30.27
LED + FTpubMed 0.00 0.00 0.00 0.00 29.05 0.00 7.26
LED + FTpubMed (Division) 28.00 5.99 13.07 55.68 20.95 25.01 29.33
OpenAI (wo FT)
Text-Davinci-002 30.90 12.27 21.44 61.01 44.98 35.04 40.64
Text-Davinci-003 35.41 14.86 25.38 63.97 49.18 46.40 46.19
ChatGPT 36.43 12.50 23.32 63.56 48.21 43.71 44.89
GPT-4 38.16 14.12 24.90 64.26 49.41 42.36 45.44

Similar to objective_exam and objective_results, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.