Table 1:
Category | Definition | Corresponding evaluation metrics included in this study |
---|---|---|
Lexical overlap-based metrics | These metrics measure the overlap between the generated text and the reference in terms of textual units, such as n-grams or word sequences. | ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-L, ROUGE-LSUM, BLEU, CHRF, METEOR, ClDEr |
Embedding-based metrics | These metrics measure the semantic similarity between the generated and reference texts using pretrained embeddings. | ROUGE-WE-1, ROUGE-WE-2, ROUGE-WE-3, BERTScore, MoverScore |
Graph-based metrics | These metrics construct graphs using entities and their relations extracted from the sentences, and evaluate the summary based on these graphs. | RadGraph |
Text generation-based metrics | These metrics assess the quality of generated text by framing it as a text generation task using sequence-to-sequence language models. | BARTScore, BARTScore+PETPEGASUSScore+PET, T5Score+PET, PRISM |
Supervised regression-based metrics | These metrics require human annotations to train a parametrized regression model to predict human judgments for the given text. | S3-pyr, S3-resp |
Question answering-based metrics | These metrics formulate the evaluation process as a question-answering task by guiding the model with various questions. | UniEval |
Reference-free metrics | These metrics do not require the reference text to assess the quality of the generated text. Instead, they compare the generated text against the source document. | SummaQA, BLANC, SUPERT, Stats-compression, Stats-coverage, Stats-density, Stats-novel trigram |
Note that we included 17 different evaluation methods to assess model performance. Given that each method might encompass multiple variants, we have a total of 30 metrics. A detailed overview of these metrics can be found in Appendix S4.