. 2024 Feb 2;37(2):471–488. doi: 10.1007/s10278-024-00985-3

Table 5.

Performance of PEGASUS in the external test set

	BARTScore + PET (↑)	PEGASUSScore + PET (↑)	ROUGE-1 (↑)	ROUGE-2 (↑)	ROUGE-L (↑)	BLEU (↑)	BERTScore (↑)
Internal test	-1.47 [-1.48, -1.46]	-1.44 [-1.45, -1.42]	53.8 [53.4, 54.2]	30.9 [30.5, 31.4]	40.0 [39.6, 40.5]	24.7 [24.2, 25.1]	0.747 [0.735, 0.739]
External test using Physician 1’s style	-1.66 [-1.70, -1.62]	-1.72 [-1.77, -1.67]	38.6 [36.9, 40.2]	14.8 [13.5, 16.1]	26.2 [24.9, 27.6]	11.1 [9.9, 12.3]	0.671 [0.662, 0.679]
External test using Physician 2’s style	-1.68 [-1.73, -1.63]	-1.67 [-1.72, -1.61]	38.5 [36.5, 40.5]	15.9 [14.1, 17.8]	29.2 [27.2, 31.3]	11.5 [9.8, 13.4]	0.679 [0.668, 0.691]
External test using Physician 3’s style	-1.73 [-1.78, -1.68]	-1.75 [-1.81, -1.69]	42.2 [40.6, 43.8]	18.1 [16.5, 19.7]	30.0 [28.4, 31.8]	13.3 [11.8, 14.9]	0.688 [0.679, 0.697]

BARTScore
+ PET (↑)

PEGASUSScore
+ PET (↑)

ROUGE-1
(↑)

ROUGE-2
(↑)

ROUGE-L
(↑)

BLEU
(↑)

BERTScore
(↑)

Internal test

-1.47

[-1.48, -1.46]

-1.44

[-1.45, -1.42]

53.8

[53.4, 54.2]

30.9

[30.5, 31.4]

40.0

[39.6, 40.5]

24.7

[24.2, 25.1]

0.747

[0.735, 0.739]

External test using Physician 1’s style

-1.66

[-1.70, -1.62]

-1.72

[-1.77, -1.67]

38.6

[36.9, 40.2]

14.8

[13.5, 16.1]

26.2

[24.9, 27.6]

11.1

[9.9, 12.3]

0.671

[0.662, 0.679]

External test using Physician 2’s style

-1.68

[-1.73, -1.63]

-1.67

[-1.72, -1.61]

38.5

[36.5, 40.5]

15.9

[14.1, 17.8]

29.2

[27.2, 31.3]

11.5

[9.8, 13.4]

0.679

[0.668, 0.691]

External test using Physician 3’s style

-1.73

[-1.78, -1.68]

-1.75

[-1.81, -1.69]

42.2

[40.6, 43.8]

18.1

[16.5, 19.7]

30.0

[28.4, 31.8]

13.3

[11.8, 14.9]

0.688

[0.679, 0.697]

Note that BARTScore and PEGASUSScore compute the log-probability of generating one text given another text, with a range of negative infinity to 0. Other metrics, including ROUGE, BLEU and BERTScore, compute the F1 score of n-gram overlap or semantic similarity, ranging from 0 to 1 (or 0 to 100% when converted to a percentage). A higher value (less negative or more positive) indicates better performance for all these metrics. Data are shown as mean [2.5th percentile, 97.5th percentile]