Table 1.
HMS Scrubber | MeDS | MIT deid | MIST | HIDE | ||
---|---|---|---|---|---|---|
Main technique |
Rule-based |
X |
X |
X |
n/a |
n/a |
|
ML-based |
n/a |
n/a |
n/a |
X |
X |
Programming language |
Java |
Java |
Perl |
Python |
Python |
|
ML algorithm |
n/a |
n/a |
n/a |
CRF (Carafe) |
CRF (CRFsuite) |
|
Input documents |
XML/txt |
HL7/txt |
txt |
txt/XML-inline/json |
XML/txt/HL7 |
|
HIPAA compliant |
X |
X |
X |
1 |
1 |
|
Regular Expressions (#) |
~50 |
~40 |
~90 |
2 |
2 |
|
PHI markers (e.g., Mr.) |
X |
X |
X |
3 |
-- |
|
Part-of-speech information |
-- |
X |
-- |
-- |
-- |
|
String similarity techniques (e.g. edit distance, fuzzy matching) |
-- |
X |
-- |
-- |
-- |
|
Dictionaries* (size) |
Person names |
~101K |
~280K |
~96K4 |
-- |
-- |
|
Geographic places |
|
~167K |
~4K |
-- |
-- |
|
US area code |
-- |
-- |
~380 |
-- |
-- |
|
Medical phrases |
-- |
~50 |
~28 |
-- |
-- |
|
Medical terms |
-- |
~80K |
~175K |
-- |
-- |
|
Companies |
-- |
~200 |
~500 |
-- |
-- |
|
Ethnicities |
-- |
~120 |
~195 |
-- |
-- |
|
Common words |
-- |
~220K |
~50K |
-- |
-- |
Machine Learning features |
Contextual window |
n/a |
n/a |
n/a |
3-words |
4-words |
|
Morphological (#) |
n/a |
n/a |
n/a |
22 |
34 |
|
Syntactic |
n/a |
n/a |
n/a |
-- |
-- |
|
Semantic |
n/a |
n/a |
n/a |
-- |
-- |
From dictionaries | n/a | n/a | n/a | 5 | 5 |
*HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).
*MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.
*MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.
1 It will depend on the types of the PHI instances used for training.
2 Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).
3 Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).
4 Person names dictionaries comprise lists of names, last names and name prefixes.
5 These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.