Skip to main content
. 2023 Nov 24;39(12):btad716. doi: 10.1093/bioinformatics/btad716

Table 3.

Baseline corpora statistics.

GSC+ EHR
Number of documents 228 100
Number of annotations 2773 1815
Unique HPO concepts 461 252
Total number of tokens in the corpus 5724 59 470
Unique tokens in the corpus 1035 5672
Annotations containing “canonical” tokens 2362 (85.2%) 1450 (79.9%)
Total “canonical” tokens 3685 (64.4%) 1095 (19.3%)
Unique “canonical” tokens 571 (55.2%) 260 (23.7%)