Table 5.
BARTScore + PET (↑) |
PEGASUSScore + PET (↑) |
ROUGE-1 (↑) |
ROUGE-2 (↑) |
ROUGE-L (↑) |
BLEU (↑) |
BERTScore (↑) |
|
---|---|---|---|---|---|---|---|
Internal test |
-1.47 [-1.48, -1.46] |
-1.44 [-1.45, -1.42] |
53.8 [53.4, 54.2] |
30.9 [30.5, 31.4] |
40.0 [39.6, 40.5] |
24.7 [24.2, 25.1] |
0.747 [0.735, 0.739] |
External test using Physician 1’s style |
-1.66 [-1.70, -1.62] |
-1.72 [-1.77, -1.67] |
38.6 [36.9, 40.2] |
14.8 [13.5, 16.1] |
26.2 [24.9, 27.6] |
11.1 [9.9, 12.3] |
0.671 [0.662, 0.679] |
External test using Physician 2’s style |
-1.68 [-1.73, -1.63] |
-1.67 [-1.72, -1.61] |
38.5 [36.5, 40.5] |
15.9 [14.1, 17.8] |
29.2 [27.2, 31.3] |
11.5 [9.8, 13.4] |
0.679 [0.668, 0.691] |
External test using Physician 3’s style |
-1.73 [-1.78, -1.68] |
-1.75 [-1.81, -1.69] |
42.2 [40.6, 43.8] |
18.1 [16.5, 19.7] |
30.0 [28.4, 31.8] |
13.3 [11.8, 14.9] |
0.688 [0.679, 0.697] |
Note that BARTScore and PEGASUSScore compute the log-probability of generating one text given another text, with a range of negative infinity to 0. Other metrics, including ROUGE, BLEU and BERTScore, compute the F1 score of n-gram overlap or semantic similarity, ranging from 0 to 1 (or 0 to 100% when converted to a percentage). A higher value (less negative or more positive) indicates better performance for all these metrics. Data are shown as mean [2.5th percentile, 97.5th percentile]