Table 2.
Summary of annotated electronic health records documents used to train the named entity recognition model.
| Variable | Number of annotated text spans | |
|---|---|---|
| Phase 1 | Phase 2 | |
| History of violence | 391 | 350 |
| History of self-harm | 559 | 397 |
| Formal education | 174 | 200 |
| Medication | 1774 | 3860 |
| Benefits recipient | 188 | 195 |
| Drug/alcohol use disorder | 190 | 130 |
| (Parental) suicide | 19 | 77 |
| Psychiatric admission | 332 | 260 |
| Total: | 3,627 | 5,469 |
Text spans are words or word combinations that refer to the concept of interest (the variable), as selected by the manual annotator. The model was trained in two phases: first using GATE software and second using Prodigy—an active learning-based annotation tool. The annotated documents shown in this table constituted the “gold-standard” training dataset used in model development. EHR, electronic health record.