Table 2.
Statistics of Clinical Note Datasets with the Natural Language Processing Pipeline and Manual Annotations
| MIMIC-III Disch | MIMIC-III Rad | Tayside Brain Img | |
|---|---|---|---|
| |T| | 59,652 | 522,279 | 156,618 |
| |D| | 127,150 | 109,096 | 7,761 |
| 15,598 | 13,907 | 1,137 | |
| 74,217 | 65,171 | 2,898 | |
| 37,110 | 73,589 | 7,321 | |
| 10,568 | 21,102 | 2,855 | |
| 500 | 1,000 | 5,000 | |
| 1,073 | 198 | 279+4 | |
| 312 | 145 | 273 |
|T|, number of documents; |D|, number of mention-UMLS pairs; , , number of weakly labelled positive and negative mention-UMLS pairs, respectively; , , number of documents associated with one or more rare diseases detected by SemEHR and SemEHR+WS (i.e. further with weak supervision), respectively; , , , number of documents sampled, number of mention-UMLS pairs sampled, and number of the sampled documents with one or more rare diseases identified by SemEHR, respectively. For Tayside data, 4 new positive mention-UMLS pairs in were identified from the reports during the manual annotation