Table 7:

Basic statistics for single-document (SDS) and multi-document (MDS) summarization datasets. For multi-document summarization (MDS), # Source words are aggregated across documents. Compression ratio is the average ratio of source words to summary words. Extractiveness metrics (coverage and density) come from Grusky et al. (2018) and, for consistency, are calculated using the official code across the validation set for each dataset. Spacy tokenization is performed before extracting fragments. Other corpus statistics are pulled from either the corresponding paper or Table 1 in Sharma et al. (2019). Entries are filled with N/A because the dataset is private (Krishna et al., 2020), or too expensive to generate (Liu et al., 2018a). The Gigaword SDS dataset comes from the annotated Gigaword dataset (Graff et al., 2003; Napoles et al., 2012)

			Comp.	Extractiveness		Summary		Source
	Dataset	# Docs	Ratio	Coverage	Density	# words	# sents	# words
SDS	Gigaword (Rush et al., 2015a)	4mn	3.8	0.58	1.1	8.3	1	31.4
	CNN/DM (Nallapati et al., 2016)	312k	13.0	0.80	3.0	55.6	3.8	789.9
	Newsroom (Grusky et al., 2018)	1.2mn	43.0	0.82	9.6	30.4	1.4	750.9
	XSum (Narayan et al., 2018)	226k	18.8	0.57	0.89	23.3	1.0	431.1
	Arxiv (Cohan et al., 2018)	215k	39.8	0.92	3.7	292.8	9.6	6,913.8
	PubMed (Cohan et al., 2018)	133k	16.2	0.90	5.9	214.4	6.9	3,224.4
	BigPatent (Sharma et al., 2019)	1.3mn	36.4	0.86	2.4	116.5	3.5	3,572.8

MDS	WikiSum (Liu et al., 2018a)	2.3mn	264.0	N/A	N/A	139.4	N/A	36,802.5
	Multi-News (Fabbri et al., 2019)	56k	8.0	0.68	3.0	263.7	10	2,103.5
	SOAP (Krishna et al., 2020)	7k	4.7	N/A	N/A	320	N/A	1,500
	CLINSUM (ours)	110k	45.2	0.83	13.1	261.9	17.7	11,838.7