[Preprint]. 2023 Oct 17:arXiv:2309.10066v2. [Version 2]

Table 1:

All evaluation metrics included in this study and their respective categories.

Category	Definition	Corresponding evaluation metrics included in this study
Lexical overlap-based metrics	These metrics measure the overlap between the generated text and the reference in terms of textual units, such as n-grams or word sequences.	ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-L, ROUGE-LSUM, BLEU, CHRF, METEOR, ClDEr
Embedding-based metrics	These metrics measure the semantic similarity between the generated and reference texts using pretrained embeddings.	ROUGE-WE-1, ROUGE-WE-2, ROUGE-WE-3, BERTScore, MoverScore
Graph-based metrics	These metrics construct graphs using entities and their relations extracted from the sentences, and evaluate the summary based on these graphs.	RadGraph
Text generation-based metrics	These metrics assess the quality of generated text by framing it as a text generation task using sequence-to-sequence language models.	BARTScore, BARTScore+PETPEGASUSScore+PET, T5Score+PET, PRISM
Supervised regression-based metrics	These metrics require human annotations to train a parametrized regression model to predict human judgments for the given text.	S³-pyr, S³-resp
Question answering-based metrics	These metrics formulate the evaluation process as a question-answering task by guiding the model with various questions.	UniEval
Reference-free metrics	These metrics do not require the reference text to assess the quality of the generated text. Instead, they compare the generated text against the source document.	SummaQA, BLANC, SUPERT, Stats-compression, Stats-coverage, Stats-density, Stats-novel trigram

Note that we included 17 different evaluation methods to assess model performance. Given that each method might encompass multiple variants, we have a total of 30 metrics. A detailed overview of these metrics can be found in Appendix S4.