Skip to main content
. 2019 Jun 14;27(1):22–30. doi: 10.1093/jamia/ocz075

Table 1.

Statistics of the data set. Rare words are words that occur only once in the data. Unknown words refer to words that are not seen in the training set

Item Training Development
Document 242 61
Entities 41 171 9776
Nest level 1 entity (flat entities) 41 109 9760
Nest level 2 entity 61 16
Nest level 3 entity 1 0
Polysemous entity 47 13
Textually nested entity 15 3
ADE 785 174
Dosage 3401 820
Drug 13 109 3114
Duration 499 93
Form 5340 1311
Frequency 5075 1205
Reason 3105 750
Route 4479 996
Strength 5378 1313
Unknown words /Unique words 17.00%
Rare words /Unique words 37.19% 37.69%
EUNKs/All entities 2.67%
ERAREs /All entities 1.89% 3.88%

Abbreviations: ADE, adverse drug event; EUNKs, entities that contain unknown words, ERAREs, entities that contain rare words.