. 2012 Jul 27;12:109. doi: 10.1186/1471-2288-12-109

Table 1.

Main characteristics of the de-identification tools

		HMS Scrubber	MeDS	MIT deid	MIST	HIDE
Main technique	Rule-based	X	X	X	n/a	n/a
	ML-based	n/a	n/a	n/a	X	X
Programming language		Java	Java	Perl	Python	Python
ML algorithm		n/a	n/a	n/a	CRF (Carafe)	CRF (CRFsuite)
Input documents		XML/txt	HL7/txt	txt	txt/XML-inline/json	XML/txt/HL7
HIPAA compliant		X	X	X	¹	¹
Regular Expressions (#)		~50	~40	~90	²	²
PHI markers (e.g., Mr.)		X	X	X	³	--
Part-of-speech information		--	X	--	^--	--
String similarity techniques (e.g. edit distance, fuzzy matching)		--	X	--	--	--
Dictionaries* (size)	Person names	~101K	~280K	~96K⁴	--	--
	Geographic places		~167K	~4K	--	--
	US area code	--	--	~380	--	--
	Medical phrases	--	~50	~28	--	--
	Medical terms	--	~80K	~175K	--	--
	Companies	--	~200	~500	--	--
	Ethnicities	--	~120	~195	--	--
	Common words	--	~220K	~50K	--	--
Machine Learning features	Contextual window	n/a	n/a	n/a	3-words	4-words
	Morphological (#)	n/a	n/a	n/a	22	34
	Syntactic	n/a	n/a	n/a	--	--
	Semantic	n/a	n/a	n/a	--	--
	From dictionaries	n/a	n/a	n/a	⁵	⁵

*HMS Scrubber’s dictionary sources: 1990 US Census (person and place names).

*MeDS’ dictionary sources: Ispell, 2005 SS Death Index, Regenstrief Medical Record System, UMLS, MESH.

*MIT deid’s dictionary sources: 1990 US Census, MIMIC II Database, Atkinson’s Spell Checking Oriented Word Lists, UMLS.

¹ It will depend on the types of the PHI instances used for training.

² Both MIST and HIDE use regular expression in order to derive the morphological features from the tokens (e.g., all_caps_token ‘^[A-Z] + $’).

³ Within MIST, PHI markers are used only for detecting companies (e.g., “Ltd.”).

⁴ Person names dictionaries comprise lists of names, last names and name prefixes.

⁵ These systems are tailored to derive features from dictionaries, however they are not distributed with any default dictionary.