Skip to main content
. Author manuscript; available in PMC: 2016 Aug 9.
Published in final edited form as: J Biomed Inform. 2015 Aug 28;58(Suppl):S20–S29. doi: 10.1016/j.jbi.2015.07.020

Table 3.

PHI distributions in the i2b2/UTHealth 2014 de-identification corpus

PHI category # in training data # in test data Total # in corpus
NAME: PATIENT 1316 879 2195
NAME: DOCTOR 2885 1912 4,797
NAME: USERNAME 264 92 356
PROFESSION 234 179 413
LOCATION: HOSPITAL 1,437 875 2,312
LOCATION: ORGANIZATION 124 82 206
LOCATION: STREET 216 136 352
LOCATION: CITY 394 260 654
LOCATION: STATE 314 190 504
LOCATION: COUNTRY 66 117 183
LOCATION: ZIP CODE 212 140 352
LOCATION: OTHER 4 13 17
AGE 1,233 764 1,997
DATE 7,507 4,980 12,487
CONTACT: PHONE 309 215 524
CONTACT: FAX 8 2 10
CONTACT: EMAIL 4 1 5
CONTACT: URL 2 0 2
CONTACT: IPADDRESS ID: SSN 0 0 0 0 0 0
ID: MEDICAL RECORD 611 422 1033
ID: HEALTH PLAN 1 0 1
ID: ACCOUNT 0 0 0
ID: LICENSE 0 0 0
ID: VEHICLE 0 0 0
ID: DEVICE 7 8 15
ID: BIO ID 1 0 1
ID: ID NUMBER 261 195 456
Total # of tags 17,410 11,462 28,872
Average PHI per file 22.03 22.3 22.14