Table 2.
Complete list of all 28 features annotated by the NLP pipeline
| Lexical | Frequency | Medical dictionary | Known PHI |
|---|---|---|---|
| Part of Speech |
Term Frequency (Token) |
# matches HL7 2.5 |
# matches US Census Names |
| Part of Speech (Binned) |
Term Frequency (Token, Part of Speech) |
# matches HL7 3.0 |
|
| Capitalization |
|
# matches ICD9 CM |
# matches for pattern HOSPITAL |
| Word or Number |
|
# matches ICD10 CM |
# matches for pattern AGE |
| Length |
|
# matches ICD10 PCS |
# matches for pattern DATE |
| |
|
# matches LOINC |
# matches for pattern DOCTOR |
| |
|
# matches MESH |
# matches for pattern LOCATION |
| |
|
# matches RXNORM |
# matches for pattern PATIENT |
| |
|
# matches SNOMED |
# matches for pattern ID |
| |
|
# matches COSTAR |
# matches for pattern PHONE |
| # consectutive tokens any dictionary | # consecutive tokens any pattern |
In the lexical phase, part of speech and capitalization usage is annotated for each word token. In the frequency phase, each word is annotated with the frequency of appearance in public and private medical texts. In the dictionary phase, each word is compared to a list of standard medical concepts in UMLS sources. In the knownPHI phase, tokens and phrases are compared against suspicious patterns of HIPAA identifiers.