. Author manuscript; available in PMC: 2024 Apr 30.

Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:10520–10542. doi: 10.18653/v1/2023.acl-long.587

Table 10:

Statistics for each candidate generation method. Rel. stands for Relevance and is measured by BERTScore F1 overlap with the reference. Faith. stands for faithfulness and is measured by the FactScore (as defined in §4.2). Extract. stands for the extractive density (level of copy-and-paste) as defined by Grusky et al. (2018). The first 6 rows (Mask-And-Fill and Swaps) construct negative examples for faithfulness calibration. The next two rows form the positive candidate set for faithfulness. The last two (diverse beam) form candidates for relevance calibration.

	`Candidate Method`	`Clinical`			`Chemical`			`Biomedical`
	`Candidate Method`	Rel.	Faith.	Extract.	Rel.	Faith.	Extract.	Rel.	Faith.	Extract.
Faith. Contrast	Mask-And-Fill (Low)	0.98	0.52	1.55	0.99	0.75	3.24	0.97	0.73	4.92
	Mask-And-Fill (High)	0.97	0.52	1.44	0.97	0.73	2.90	0.95	0.71	4.05
	Swap Intrinsic (Low)	0.94	0.52	1.64	0.97	0.70	2.92	0.98	0.71	4.70
	Swap Intrinsic (High)	0.90	0.52	1.82	0.95	0.65	2.62	0.97	0.67	4.13
	Swap Extrinsic (Low)	0.94	0.52	1.64	0.97	0.70	2.92	0.98	0.68	4.44
	Swap Extrinsic (High)	0.90	0.52	1.82	0.95	0.65	2.62	0.97	0.64	3.79
	Paraphrase	0.90	0.52	1.26	0.94	0.77	3.06	0.92	0.73	4.00
	Reference	1.00	0.52	1.96	1.00	0.76	3.54	1.00	0.74	5.78
Rel.	Diverse Beam (PRIMERA)	0.84	0.53	2.65	0.87	0.85	9.66	0.86	0.86	12.90
Rank	Diverse Beam (LongT5)	0.83	0.52	2.06	0.86	0.83	7.46	0.85	0.82	8.39