. Author manuscript; available in PMC: 2024 Nov 1.

Published in final edited form as: J Biomed Inform. 2023 Sep 29;147:104507. doi: 10.1016/j.jbi.2023.104507

Table 2.

Summary of Datasets I and II for model development and evaluation

	Dataset I (N = 3150)		Dataset II (N = 200)

	Clinician-confirmed TGD patients (N=1575) n (%)	Non-TGD patients filtered by keyword search (N=1575) n (%)	TGD patients by chart review (N=180) n (%)	Non-TGD patients by chart review (N=20) n (%)
Age, mean (SD) year	35.94 (16.04)	60.92 (18.0)	34.52 (15.48)	57.85 (20.27)

Race, n (%)
Asian	77 (4.89)	37 (2.35)	8 (4.44)	1 (5.0)
Black	116 (7.37)	84 (5.33)	12 (6.67)	2 (10.0)
More than one race	50 (2.54)	6 (0.38)	6 (3.33)	0 (0.0)
Other	177 (11.24)	116 (7.37)	24 (13.33)	2 (10.0)
White	1155 (73.33)	1332 (84.57)	130 (72.22)	15 (75.0)

Ethnicity
Hispanic	22 (1.40)	41 (2.60)	9 (5.0)	1 (5.0)
Non-Hispanic	1351 (85.78)	1321 (83.87)	146 (81.11)	15 (75.0)
Other	415 (12.83)	213 (13.52)	25 (13.89)	4 (20.0)

Patients with keywords, n (%)
Diagnoses	957 (60.76)	0	103 (57.22)	0
Procedures	422 (26.8)	0	10 (5.56)	3 (15.0)
Clinical notes	1402 (89.02)	0	172 (95.56)	17 (85.0)

Patients with missing gender fields, n (%)	884 (56.13)	691 (43.87%)	84 (46.67)	15 (75.0)