. Author manuscript; available in PMC: 2023 Apr 18.

Published in final edited form as: J Biomed Inform. 2022 Sep 5;134:104175. doi: 10.1016/j.jbi.2022.104175

Table 1.

Summary of EHR datasets by phenotype. The filter positive set was defined as the set of patients passing the clinically established filter criterion as defined in the Method Section. Labeled set was defined as the small set of patients with manually curated gold-standard labels, which was randomly sampled from filter positive set. The prevalence is the proportion of subjects with positive phenotype status among those for whom labels are provided.

EHR Platform	Phenotype	Sample Size of Filter Positive Patients	No. of Labeled Samples (%)	Prevalence (%)
MGB	Asthma	7289	183 (2.5 %)	47.5 (%)
	Breast Cancer	2002	94 (4.7 %)	77.6
	COPD	3021	153 (5.1 %)	43.1
	Depression	10189	252 (2.5 %)	54.8
	Epilepsy	2225	117 (5.3 %)	47.9
	Hypertension	19853	390 (2.0 %)	79.0
	SCZ	456	108 (23.7 %)	17.6
	T1DM	2111	128 (6.1 %)	16.4
	RA	987	153 (15.5%)	36.6
	CAD	3793	186 (4.9 %)	37.1
	CD	519	136 (26.2 %)	53.7
	UC	476	126 (26.5 %)	49.2
	T2DM	3460	280 (8.1 %)	35.7
	MS	136	101 (74.3 %)	52.5
	Stroke	2052	128 ( 6.2 %)	36.7

BWH	Pseudogout	12035	365 (3.0 %)	21.9

BCH	ARDS	2201	44 (1.9 %)	40.9