[Preprint]. 2025 Aug 25:rs.3.rs-7216581. [Version 1] doi: 10.21203/rs.3.rs-7216581/v1

Table 3.

Statistics of the eight standardized biomedical datasets we used, including the source and aim of their tasks, training and test sizes (number of tokens), the number of entity types and the number of entities in each dataset.

Datasets	Training Size	Test Size	Entity Types	Entities
MIMIC III (information relating to patients)	36.4k	6.4k	12	8.7k
BC5CDR (extracting relationships between chemicals and diseases)	228.8k	122.2k	2	28.8k
Med-Mentions (annotated with UMLS concepts)	847.9k	593.6k	1	340.9k
NCBI Disease (PubMed abstracts annotated with disease names)	134.0k	20.5k	4	6.3k
Reddit-Impacts (clinical impacts and social impacts collected from Reddit)	30.0k	6.0k	2	0.2k