Skip to main content
. 2024 Jan 25;26:e48443. doi: 10.2196/48443

Table 2.

Protected health information statistics for the 4 sentence groups (English-only sentences, Chinese-only sentences, mixed Chinese-English sentences, and numeric or symbolic sentences).

Protected health information type English-only sentences, n Chinese-only sentences, n Chinese-English mixed sentences, n Numeric or symbolic sentences, n Entire corpus, n

Training set Test set Training set Test set Training set Test set Training set Test set Training set Test set
Datea 12,436 2346 2146 458 4530 919 19,850 3724 38,962 7447
Agea 2444 533 21 5 639 128 0 0 3104 666
Namea 100 18 273 44 1339 240 0 0 1712 302

Patient 9 9 106 12 42 11 0 0 157 32

Person 8 3 6 1 128 32 0 0 142 36

Doctor 83 6 161 31 1169 197 0 0 1413 234
Locationa 5030 1014 4812 1108 5330 1132 4 3 15,176 3257

Named location 17 10 11 3 271 69 0 0 299 82

Nationality 21 7 3 1 17 10 0 0 41 18

Region 6 11 0 0 10 1 0 0 16 12

Country 355 52 6 5 171 44 0 0 532 101

City 196 39 26 6 510 101 0 0 732 146

Hospital 1552 266 73 11 1915 393 0 0 3540 670

Department 1616 391 2278 542 762 111 0 0 4656 1044

Room 1007 182 1200 267 248 53 4 3 2459 505

Number 0 0 1162 256 25 1 0 0 1187 257

School 28 6 15 2 642 159 0 0 685 167

Generic location 218 45 37 15 708 173 0 0 963 233

Market 12 3 1 0 50 17 0 0 63 20
Professiona 575 107 82 9 1566 449 0 0 2223 565
IDa 54 0 192 26 197 5 7 0 449 31

ID number 43 0 98 11 170 2 7 0 318 13

Medical record 1 0 92 14 19 3 0 0 112 17
Total number of sentences 20,639 4018 7526 1650 13,604 2873 19,861 3727 61,630 12,268

aCoarse-grained protected health information.