Table 2.
Summary of annotated electronic health records documents used to train the named entity recognition model.
Variable | Number of annotated text spans | |
---|---|---|
Phase 1 | Phase 2 | |
History of violence | 391 | 350 |
History of self-harm | 559 | 397 |
Formal education | 174 | 200 |
Medication | 1774 | 3860 |
Benefits recipient | 188 | 195 |
Drug/alcohol use disorder | 190 | 130 |
(Parental) suicide | 19 | 77 |
Psychiatric admission | 332 | 260 |
Total: | 3,627 | 5,469 |
Text spans are words or word combinations that refer to the concept of interest (the variable), as selected by the manual annotator. The model was trained in two phases: first using GATE software and second using Prodigy—an active learning-based annotation tool. The annotated documents shown in this table constituted the “gold-standard” training dataset used in model development. EHR, electronic health record.