Skip to main content
. Author manuscript; available in PMC: 2023 Apr 18.
Published in final edited form as: J Biomed Inform. 2022 Sep 5;134:104175. doi: 10.1016/j.jbi.2022.104175

Table 1.

Summary of EHR datasets by phenotype. The filter positive set was defined as the set of patients passing the clinically established filter criterion as defined in the Method Section. Labeled set was defined as the small set of patients with manually curated gold-standard labels, which was randomly sampled from filter positive set. The prevalence is the proportion of subjects with positive phenotype status among those for whom labels are provided.

EHR Platform Phenotype Sample Size of Filter Positive Patients No. of Labeled Samples (%) Prevalence (%)
MGB Asthma 7289 183 (2.5 %) 47.5 (%)
Breast Cancer 2002 94 (4.7 %) 77.6
COPD 3021 153 (5.1 %) 43.1
Depression 10189 252 (2.5 %) 54.8
Epilepsy 2225 117 (5.3 %) 47.9
Hypertension 19853 390 (2.0 %) 79.0
SCZ 456 108 (23.7 %) 17.6
T1DM 2111 128 (6.1 %) 16.4
RA 987 153 (15.5%) 36.6
CAD 3793 186 (4.9 %) 37.1
CD 519 136 (26.2 %) 53.7
UC 476 126 (26.5 %) 49.2
T2DM 3460 280 (8.1 %) 35.7
MS 136 101 (74.3 %) 52.5
Stroke 2052 128 ( 6.2 %) 36.7

BWH Pseudogout 12035 365 (3.0 %) 21.9

BCH ARDS 2201 44 (1.9 %) 40.9