[Preprint]. 2025 May 2:arXiv:2402.14359v2. Originally published 2024 Feb 22. [Version 2]

Table 3.

Performance of various summarization systems in different metrics on PubMed dataset. black cells indicate the best result, while yellow cells denote the second best. In general, the smaller LongT5 competes well with Llama2 across different metrics. Specifically, FM-based methods and human annotation tend to favor Llama2, in contrast to existing metrics that primarily rely on n-gram overlap calculations similar to ROUGE.

	Single-score		QA/Verification-based		Facet-aware
Model	ROUGE-L	BERTScore	ACU	QuestEval	G-EVAL	FM(Llama2)	FM(GPT-3.5)	FM(GPT-4)	Human
GPT-3.5	0.2109	0.8408	0.1914	0.2333	0.9143	0.7691	0.6343	0.6623	0.6780
Llama2	0.2223	0.8408	0.2126	0.2678	0.7633	0.8769	0.7228	0.7120	0.7704
LongT5	0.2832	0.8534	0.2533	0.2699	0.8367	0.7719	0.6591	0.6818	0.7241
LongT5-block	0.2345	0.8408	0.2128	0.2496	0.4939	0.7207	0.6283	0.6628	0.6782
BigBird	0.2240	0.8317	0.2127	0.2376	0.4939	0.6687	0.5947	0.5649	0.6186
BigBird-block	0.2127	0.8383	0.1918	0.2392	0.4327	0.7347	0.6475	0.6167	0.6317