Table 3.
Performance of various summarization systems in different metrics on PubMed dataset. black cells indicate the best result, while yellow cells denote the second best. In general, the smaller LongT5 competes well with Llama2 across different metrics. Specifically, FM-based methods and human annotation tend to favor Llama2, in contrast to existing metrics that primarily rely on n-gram overlap calculations similar to ROUGE.
Single-score | QA/Verification-based | Facet-aware | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | ROUGE-L | BERTScore | ACU | QuestEval | G-EVAL | FM(Llama2) | FM(GPT-3.5) | FM(GPT-4) | Human |
GPT-3.5 | 0.2109 | 0.8408 | 0.1914 | 0.2333 | 0.9143 | 0.7691 | 0.6343 | 0.6623 | 0.6780 |
Llama2 | 0.2223 | 0.8408 | 0.2126 | 0.2678 | 0.7633 | 0.8769 | 0.7228 | 0.7120 | 0.7704 |
LongT5 | 0.2832 | 0.8534 | 0.2533 | 0.2699 | 0.8367 | 0.7719 | 0.6591 | 0.6818 | 0.7241 |
LongT5-block | 0.2345 | 0.8408 | 0.2128 | 0.2496 | 0.4939 | 0.7207 | 0.6283 | 0.6628 | 0.6782 |
BigBird | 0.2240 | 0.8317 | 0.2127 | 0.2376 | 0.4939 | 0.6687 | 0.5947 | 0.5649 | 0.6186 |
BigBird-block | 0.2127 | 0.8383 | 0.1918 | 0.2392 | 0.4327 | 0.7347 | 0.6475 | 0.6167 | 0.6317 |