Skip to main content
. 2024 Feb 2;37(2):471–488. doi: 10.1007/s10278-024-00985-3

Table 2.

All evaluation metrics included in this study and their respective categories

Category Definition Corresponding Evaluation Metrics
Lexical overlap-based metrics These metrics measure the overlap between the generated text and the reference in terms of textual units, such as n-grams or word sequences ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-L, ROUGE-LSUM, BLEU, CHRF, METEOR, CIDEr
Embedding-based metrics These metrics measure the semantic similarity between the generated and reference texts using pretrained embeddings ROUGE-WE-1, ROUGE-WE-2, ROUGE-WE-3, BERTScore, MoverScore
Graph-based metrics These metrics construct graphs using entities and their relations extracted from the sentences, and evaluate the summary based on these graphs RadGraph
Text generation-based metrics These metrics assess the quality of generated text by framing it as a text generation task using sequence-to-sequence language models BARTScore, BARTScore + PET PEGASUSScore + PET, T5Score + PET, PRISM
Supervised regression-based metrics These metrics require human annotations to train a parametrized regression model to predict human judgments for the given text S3-pyr, S3-resp
Question answering-based metrics These metrics formulate the evaluation process as a question-answering task by guiding the model with various questions UniEval
Reference-free metrics These metrics do not require the reference text to assess the quality of the generated text. Instead, they compare the generated text against the source document SummaQA, BLANC, SUPERT, Stats-compression, Stats-coverage, Stats-density, Stats-novel trigram

Note that we included 17 different evaluation methods to assess model performance. Given that each method might encompass multiple variants, we have a total of 30 metrics. A detailed overview of these metrics can be found in Appendix 3