Skip to main content
. 2023 Sep 6;10:586. doi: 10.1038/s41597-023-02487-3

Table 9.

Results of the summarization models on the objective_results division, test set 1.

Model Evaluation score on the objective_results division
ROUGE-1 ROUGE-2 ROUGE-L BERTScore BLEURT medcon Average
Retrieval-based
trainUMLS 30.26 14.89 29.87 66.24 37.25 8.91 34.35
trainsent 40.52 18.21 38.87 73.33 45.79 12.45 41.03
BART-based
BART 0.00 0.00 0.00 0.00 5.45 0.00 1.36
BART (Division) 30.48 19.16 27.80 66.64 43.07 21.56 39.27
BART + FTSAMSum 20.79 0.46 20.67 54.54 28.32 0.77 24.40
BART + FTSAMSum (Division) 29.45 18.01 26.63 66.43 40.75 20.17 38.01
BioBART 17.50 0.00 17.50 52.44 25.33 0.00 22.36
BioBART (Division) 35.38 14.33 32.79 68.40 47.63 15.69 39.81
LED-based
LED 0.00 0.00 0.00 0.00 5.45 0.00 1.36
LED (Division) 14.04 4.97 11.08 48.86 9.61 7.86 19.09
LED + FTpubMed 0.00 0.00 0.00 0.00 5.45 0.00 1.36
LED + FTpubMed (Division) 10.48 3.64 8.32 42.43 7.13 8.86 16.48
OpenAI (wo FT)
Text-Davinci-002 41.48 20.12 39.95 70.61 50.79 24.42 44.92
Text-Davinci-003 44.92 25.21 43.84 72.35 55.87 29.37 48.90
ChatGPT 34.50 17.75 30.84 66.68 48.51 22.28 41.29
GPT-4 37.65 19.94 35.73 68.33 48.50 26.73 43.67

Similar to objective_exam, BART and LED full note generation models suffered a significant drop at the objective_results division. This may be attributable to the higher sparsity of this division, low amounts of content (sometimes only 2-3 sentences), and the appearance of text later in the sequence. The OpenAI were in general better performant with BART division-based models as next best.