Skip to main content

View full-text article in PMC

. 2022 Mar 14;10(3):e32903. doi: 10.2196/32903

Table 1.

Patient and note distribution for each data set considered in this study.

Data set				I^a (N=717)		II^a (N=5611)		III (N=93,277)		IV^b (93,277)		V^c (N=75.692)
Train set, n (%)				430 (59.9)		3360 (59.88)		55,966 (59.99)		55,966 (59.99)		38,381 (50.71)
Validation set, n (%)				143 (19.9)		1123 (20.01)		18,655 (19.99)		18,655 (19.99)		18,655 (24.65)
Test set, n (%)				144 (20.1)		1128 (20.10)		18,656 (20)		18,656 (20)		18,656 (24.65)
Age (years), mean (SD)				60 (23)		58 (23)		59 (23)		59 (23)		53 (23)
Gender, n (%)
		Men	306 (42.7)		2381 (42.43)		51,876 (55.61)		51,876 (55.61)		43,765 (57.82)
		Women	410 (57.2)		3229 (57.55)		41,396 (44.38)		41,396 (44.38)		31,925 (42.18)
		Unknown	1 (0.1)		1 (0.02)		5 (0.005)		5 (0.005)		2 (0.003)
Total notes, n (%)				4245 (100)		34,368 (100)		545,468 (100)		871,753 (100)		544,907 (100)
	Train set			2480 (58.42)		20,500 (59.65)		326,934 (59.94)		653,219 (74.93)		326,373 (59.89)
	Validation set			704 (16.58)		6698 (19.49)		109,726 (20.12)		109,726 (12.59)		109,726 (20.14)
	Test set			794 (18.70)		6494 (18.89)		108,808 (19.95)		108,808 (12.48)		108,808 (19.97)

^aData sets I and II are subsets of data set III.

^bData set IV represents the hybrid data set of labeled and unlabeled notes considered for the weak supervision experiment.

^cData set V contains the set of unlabeled notes from IV.