Table 3.
Statistics of the eight standardized biomedical datasets we used, including the source and aim of their tasks, training and test sizes (number of tokens), the number of entity types and the number of entities in each dataset.
| Datasets | Training Size | Test Size | Entity Types | Entities |
|---|---|---|---|---|
| MIMIC III (information relating to patients) | 36.4k | 6.4k | 12 | 8.7k |
| BC5CDR (extracting relationships between chemicals and diseases) | 228.8k | 122.2k | 2 | 28.8k |
| Med-Mentions (annotated with UMLS concepts) | 847.9k | 593.6k | 1 | 340.9k |
| NCBI Disease (PubMed abstracts annotated with disease names) | 134.0k | 20.5k | 4 | 6.3k |
| Reddit-Impacts (clinical impacts and social impacts collected from Reddit) | 30.0k | 6.0k | 2 | 0.2k |