Skip to main content
. 2022 Mar 14;10(3):e32903. doi: 10.2196/32903

Table 1.

Patient and note distribution for each data set considered in this study.

Data set Ia (N=717) IIa (N=5611) III (N=93,277) IVb (93,277) Vc (N=75.692)
Train set, n (%) 430 (59.9) 3360 (59.88) 55,966 (59.99) 55,966 (59.99) 38,381 (50.71)
Validation set, n (%) 143 (19.9) 1123 (20.01) 18,655 (19.99) 18,655 (19.99) 18,655 (24.65)
Test set, n (%) 144 (20.1) 1128 (20.10) 18,656 (20) 18,656 (20) 18,656 (24.65)
Age (years), mean (SD) 60 (23) 58 (23) 59 (23) 59 (23) 53 (23)
Gender, n (%)

Men 306 (42.7) 2381 (42.43) 51,876 (55.61) 51,876 (55.61) 43,765 (57.82)

Women 410 (57.2) 3229 (57.55) 41,396 (44.38) 41,396 (44.38) 31,925 (42.18)

Unknown 1 (0.1) 1 (0.02) 5 (0.005) 5 (0.005) 2 (0.003)
Total notes, n (%) 4245 (100) 34,368 (100) 545,468 (100) 871,753 (100) 544,907 (100)

Train set 2480 (58.42) 20,500 (59.65) 326,934 (59.94) 653,219 (74.93) 326,373 (59.89)

Validation set 704 (16.58) 6698 (19.49) 109,726 (20.12) 109,726 (12.59) 109,726 (20.14)

Test set 794 (18.70) 6494 (18.89) 108,808 (19.95) 108,808 (12.48) 108,808 (19.97)

aData sets I and II are subsets of data set III.

bData set IV represents the hybrid data set of labeled and unlabeled notes considered for the weak supervision experiment.

cData set V contains the set of unlabeled notes from IV.