Skip to main content
[Preprint]. 2025 May 2:arXiv:2402.14359v2. Originally published 2024 Feb 22. [Version 2]

Table 3.

Performance of various summarization systems in different metrics on PubMed dataset. black cells indicate the best result, while yellow cells denote the second best. In general, the smaller LongT5 competes well with Llama2 across different metrics. Specifically, FM-based methods and human annotation tend to favor Llama2, in contrast to existing metrics that primarily rely on n-gram overlap calculations similar to ROUGE.

Single-score QA/Verification-based Facet-aware
Model ROUGE-L BERTScore ACU QuestEval G-EVAL FM(Llama2) FM(GPT-3.5) FM(GPT-4) Human
GPT-3.5 0.2109 0.8408 0.1914 0.2333 0.9143 0.7691 0.6343 0.6623 0.6780
Llama2 0.2223 0.8408 0.2126 0.2678 0.7633 0.8769 0.7228 0.7120 0.7704
LongT5 0.2832 0.8534 0.2533 0.2699 0.8367 0.7719 0.6591 0.6818 0.7241
LongT5-block 0.2345 0.8408 0.2128 0.2496 0.4939 0.7207 0.6283 0.6628 0.6782
BigBird 0.2240 0.8317 0.2127 0.2376 0.4939 0.6687 0.5947 0.5649 0.6186
BigBird-block 0.2127 0.8383 0.1918 0.2392 0.4327 0.7347 0.6475 0.6167 0.6317