Skip to main content
. Author manuscript; available in PMC: 2021 Jun 24.
Published in final edited form as: Proc Conf. 2021 Jun;2021:4794–4811. doi: 10.18653/v1/2021.naacl-main.382

Table 7:

Basic statistics for single-document (SDS) and multi-document (MDS) summarization datasets. For multi-document summarization (MDS), # Source words are aggregated across documents. Compression ratio is the average ratio of source words to summary words. Extractiveness metrics (coverage and density) come from Grusky et al. (2018) and, for consistency, are calculated using the official code across the validation set for each dataset. Spacy tokenization is performed before extracting fragments. Other corpus statistics are pulled from either the corresponding paper or Table 1 in Sharma et al. (2019). Entries are filled with N/A because the dataset is private (Krishna et al., 2020), or too expensive to generate (Liu et al., 2018a). The Gigaword SDS dataset comes from the annotated Gigaword dataset (Graff et al., 2003; Napoles et al., 2012)

Comp. Extractiveness Summary Source
Dataset # Docs Ratio Coverage Density # words # sents # words
SDS Gigaword (Rush et al., 2015a) 4mn 3.8 0.58 1.1 8.3 1 31.4
CNN/DM (Nallapati et al., 2016) 312k 13.0 0.80 3.0 55.6 3.8 789.9
Newsroom (Grusky et al., 2018) 1.2mn 43.0 0.82 9.6 30.4 1.4 750.9
XSum (Narayan et al., 2018) 226k 18.8 0.57 0.89 23.3 1.0 431.1
Arxiv (Cohan et al., 2018) 215k 39.8 0.92 3.7 292.8 9.6 6,913.8
PubMed (Cohan et al., 2018) 133k 16.2 0.90 5.9 214.4 6.9 3,224.4
BigPatent (Sharma et al., 2019) 1.3mn 36.4 0.86 2.4 116.5 3.5 3,572.8

MDS WikiSum (Liu et al., 2018a) 2.3mn 264.0 N/A N/A 139.4 N/A 36,802.5
Multi-News (Fabbri et al., 2019) 56k 8.0 0.68 3.0 263.7 10 2,103.5
SOAP (Krishna et al., 2020) 7k 4.7 N/A N/A 320 N/A 1,500
CLINSUM (ours) 110k 45.2 0.83 13.1 261.9 17.7 11,838.7