Table 9:
Comparing the correlation to human annotations of single metrics, as well as the average correlation of ensembles of metrics that include a given metric. Lastly, we include the correlation of the best performing metric ensemble (Coverage, BARTScore, Distilled).
| Metric | Pearson Correlation | |
|---|---|---|
| Single | Avg In Ensemble | |
| Coverage (Cov) | .457 | .544 |
|
| ||
| BARTScore | .539 | .550 |
| CTC | .507 | .546 |
| Entailment | .453 | .539 |
| BERTScore | .482 | .535 |
| Reviser | .324 | .528 |
| FactScore | .444 | .536 |
|
| ||
| Distilled | .564 | .556 |
|
| ||
| Best Ensemble | N/A | .583 |