Skip to main content
. Author manuscript; available in PMC: 2024 Apr 30.
Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2023 Jul;2023:10520–10542. doi: 10.18653/v1/2023.acl-long.587

Table 1:

Statistics for long-form scientific summarization datasets. The biomedical dataset is from Cohan et al. (2018), the recipe to recreate the clinical from Adams et al. (2022), and the chemical from this work.

Statistic Clinical Chemical Bio.
Train Size 41,705 115,956 119,924
Validation Size 940 1,000 6,633
Test Size 1,861 2,000 6,658
Source Tokens 8,175 5,364 3,092
Reference Tokens 416 216 205
Extractive Coverage 0.66 0.90 0.88
Extractive Density 1.97 3.53 5.87