. Author manuscript; available in PMC: 2024 Apr 30.

Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:10520–10542. doi: 10.18653/v1/2023.acl-long.587

Table 1:

Statistics for long-form scientific summarization datasets. The biomedical dataset is from Cohan et al. (2018), the recipe to recreate the clinical from Adams et al. (2022), and the chemical from this work.

Statistic	Clinical	Chemical	Bio.
Train Size	41,705	115,956	119,924
Validation Size	940	1,000	6,633
Test Size	1,861	2,000	6,658
Source Tokens	8,175	5,364	3,092
Reference Tokens	416	216	205
Extractive Coverage	0.66	0.90	0.88
Extractive Density	1.97	3.53	5.87