Skip to main content
. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109

Table 1.

Main characteristics of the de-identification tools

    HMS Scrubber MeDS MIT deid MIST HIDE
Main technique
Rule-based
X
X
X
n/a
n/a
 
ML-based
n/a
n/a
n/a
X
X
Programming language
Java
Java
Perl
Python
Python
ML algorithm
n/a
n/a
n/a
CRF (Carafe)
CRF (CRFsuite)
Input documents
XML/txt
HL7/txt
txt
txt/XML-inline/json
XML/txt/HL7
HIPAA compliant
X
X
X
1
1
Regular Expressions (#)
~50
~40
~90
2
2
PHI markers (e.g., Mr.)
X
X
X
3
--
Part-of-speech information
--
X
--
--
--
String similarity techniques (e.g. edit distance, fuzzy matching)
--
X
--
--
--
Dictionaries* (size)
Person names
~101K
~280K
~96K4
--
--
 
Geographic places
 
~167K
~4K
--
--
 
US area code
--
--
~380
--
--
 
Medical phrases
--
~50
~28
--
--
 
Medical terms
--
~80K
~175K
--
--
 
Companies
--
~200
~500
--
--
 
Ethnicities
--
~120
~195
--
--
 
Common words
--
~220K
~50K
--
--
Machine Learning features
Contextual window
n/a
n/a
n/a
3-words
4-words
 
Morphological (#)
n/a
n/a
n/a
22
34
 
Syntactic
n/a
n/a
n/a
--
--
 
Semantic
n/a
n/a
n/a
--
--
  From dictionaries n/a n/a n/a 5 5

*HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).

*MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.

*MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.

1 It will depend on the types of the PHI instances used for training.

2 Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).

3 Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).

4 Person names dictionaries comprise lists of names, last names and name prefixes.

5 These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.