Table 3.
PHI distributions in the i2b2/UTHealth 2014 de-identification corpus
PHI category | # in training data | # in test data | Total # in corpus |
---|---|---|---|
NAME: PATIENT | 1316 | 879 | 2195 |
NAME: DOCTOR | 2885 | 1912 | 4,797 |
NAME: USERNAME | 264 | 92 | 356 |
PROFESSION | 234 | 179 | 413 |
LOCATION: HOSPITAL | 1,437 | 875 | 2,312 |
LOCATION: ORGANIZATION | 124 | 82 | 206 |
LOCATION: STREET | 216 | 136 | 352 |
LOCATION: CITY | 394 | 260 | 654 |
LOCATION: STATE | 314 | 190 | 504 |
LOCATION: COUNTRY | 66 | 117 | 183 |
LOCATION: ZIP CODE | 212 | 140 | 352 |
LOCATION: OTHER | 4 | 13 | 17 |
AGE | 1,233 | 764 | 1,997 |
DATE | 7,507 | 4,980 | 12,487 |
CONTACT: PHONE | 309 | 215 | 524 |
CONTACT: FAX | 8 | 2 | 10 |
CONTACT: EMAIL | 4 | 1 | 5 |
CONTACT: URL | 2 | 0 | 2 |
CONTACT: IPADDRESS ID: SSN | 0 0 | 0 0 | 0 0 |
ID: MEDICAL RECORD | 611 | 422 | 1033 |
ID: HEALTH PLAN | 1 | 0 | 1 |
ID: ACCOUNT | 0 | 0 | 0 |
ID: LICENSE | 0 | 0 | 0 |
ID: VEHICLE | 0 | 0 | 0 |
ID: DEVICE | 7 | 8 | 15 |
ID: BIO ID | 1 | 0 | 1 |
ID: ID NUMBER | 261 | 195 | 456 |
Total # of tags | 17,410 | 11,462 | 28,872 |
Average PHI per file | 22.03 | 22.3 | 22.14 |