Skip to main content
. 2023 May 5;23:86. doi: 10.1186/s12911-023-02181-9

Table 2.

Statistics of Clinical Note Datasets with the Natural Language Processing Pipeline and Manual Annotations

MIMIC-III Disch MIMIC-III Rad Tayside Brain Img
|T| 59,652 522,279 156,618
|D| 127,150 109,096 7,761
|Dweak+| 15,598 13,907 1,137
|Dweak-| 74,217 65,171 2,898
|TRD| 37,110 73,589 7,321
|TRDweak| 10,568 21,102 2,855
|Tann| 500 1,000 5,000
|Dann| 1,073 198 279+4
|TRDann| 312 145 273

|T|, number of documents; |D|, number of mention-UMLS pairs; |Dweak+|, |Dweak-|, number of weakly labelled positive and negative mention-UMLS pairs, respectively; |TRD|, |TRDweak|, number of documents associated with one or more rare diseases detected by SemEHR and SemEHR+WS (i.e. further with weak supervision), respectively; |Tann|, |Dann|, |TRDann|, number of documents sampled, number of mention-UMLS pairs sampled, and number of the sampled documents with one or more rare diseases identified by SemEHR, respectively. For Tayside data, 4 new positive mention-UMLS pairs in |Dann| were identified from the reports during the manual annotation