Table 1.
Patient and note distribution for each data set considered in this study.
| Data set | Ia (N=717) | IIa (N=5611) | III (N=93,277) | IVb (93,277) | Vc (N=75.692) | |||||||
| Train set, n (%) | 430 (59.9) | 3360 (59.88) | 55,966 (59.99) | 55,966 (59.99) | 38,381 (50.71) | |||||||
| Validation set, n (%) | 143 (19.9) | 1123 (20.01) | 18,655 (19.99) | 18,655 (19.99) | 18,655 (24.65) | |||||||
| Test set, n (%) | 144 (20.1) | 1128 (20.10) | 18,656 (20) | 18,656 (20) | 18,656 (24.65) | |||||||
| Age (years), mean (SD) | 60 (23) | 58 (23) | 59 (23) | 59 (23) | 53 (23) | |||||||
| Gender, n (%) | ||||||||||||
|
|
Men | 306 (42.7) | 2381 (42.43) | 51,876 (55.61) | 51,876 (55.61) | 43,765 (57.82) | ||||||
|
|
Women | 410 (57.2) | 3229 (57.55) | 41,396 (44.38) | 41,396 (44.38) | 31,925 (42.18) | ||||||
|
|
Unknown | 1 (0.1) | 1 (0.02) | 5 (0.005) | 5 (0.005) | 2 (0.003) | ||||||
| Total notes, n (%) | 4245 (100) | 34,368 (100) | 545,468 (100) | 871,753 (100) | 544,907 (100) | |||||||
|
|
Train set | 2480 (58.42) | 20,500 (59.65) | 326,934 (59.94) | 653,219 (74.93) | 326,373 (59.89) | ||||||
|
|
Validation set | 704 (16.58) | 6698 (19.49) | 109,726 (20.12) | 109,726 (12.59) | 109,726 (20.14) | ||||||
|
|
Test set | 794 (18.70) | 6494 (18.89) | 108,808 (19.95) | 108,808 (12.48) | 108,808 (19.97) | ||||||
aData sets I and II are subsets of data set III.
bData set IV represents the hybrid data set of labeled and unlabeled notes considered for the weak supervision experiment.
cData set V contains the set of unlabeled notes from IV.