Skip to main content
. 2024 Feb 2;37(2):471–488. doi: 10.1007/s10278-024-00985-3

Table 5.

Performance of PEGASUS in the external test set

BARTScore
 + PET (↑)
PEGASUSScore
 + PET (↑)
ROUGE-1
(↑)
ROUGE-2
(↑)
ROUGE-L
(↑)
BLEU
(↑)
BERTScore
(↑)
Internal test

-1.47

[-1.48, -1.46]

-1.44

[-1.45, -1.42]

53.8

[53.4, 54.2]

30.9

[30.5, 31.4]

40.0

[39.6, 40.5]

24.7

[24.2, 25.1]

0.747

[0.735, 0.739]

External test using Physician 1’s style

-1.66

[-1.70, -1.62]

-1.72

[-1.77, -1.67]

38.6

[36.9, 40.2]

14.8

[13.5, 16.1]

26.2

[24.9, 27.6]

11.1

[9.9, 12.3]

0.671

[0.662, 0.679]

External test using Physician 2’s style

-1.68

[-1.73, -1.63]

-1.67

[-1.72, -1.61]

38.5

[36.5, 40.5]

15.9

[14.1, 17.8]

29.2

[27.2, 31.3]

11.5

[9.8, 13.4]

0.679

[0.668, 0.691]

External test using Physician 3’s style

-1.73

[-1.78, -1.68]

-1.75

[-1.81, -1.69]

42.2

[40.6, 43.8]

18.1

[16.5, 19.7]

30.0

[28.4, 31.8]

13.3

[11.8, 14.9]

0.688

[0.679, 0.697]

Note that BARTScore and PEGASUSScore compute the log-probability of generating one text given another text, with a range of negative infinity to 0. Other metrics, including ROUGE, BLEU and BERTScore, compute the F1 score of n-gram overlap or semantic similarity, ranging from 0 to 1 (or 0 to 100% when converted to a percentage). A higher value (less negative or more positive) indicates better performance for all these metrics. Data are shown as mean [2.5th percentile, 97.5th percentile]