A De-identifier for Medical Discharge Summaries

Özlem Uzuner; Tawanda C Sibanda; Yuan Luo; Peter Szolovits

doi:10.1016/j.artmed.2007.10.001

. Author manuscript; available in PMC: 2009 Jan 1.

Published in final edited form as: Artif Intell Med. 2007 Nov 28;42(1):13–35. doi: 10.1016/j.artmed.2007.10.001

A De-identifier for Medical Discharge Summaries^¹

Özlem Uzuner ¹, Tawanda C Sibanda ², Yuan Luo ³, Peter Szolovits ⁴

PMCID: PMC2271040 NIHMSID: NIHMS39403 PMID: 18053696

Abstract

Objective

Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals.

The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain foreign and misspelled PHI; they also contain PHI that are ambiguous with non-PHI. These complications are compounded by the linguistic characteristics of clinical records. For example, medical discharge summaries, which are studied in this paper, are characterized by fragmented, incomplete utterances and domain-specific language; they cannot be fully processed by tools designed for lay language.

Methods and Results

In this paper, we show that we can de-identify medical discharge summaries using a de-identifier, Stat De-id, based on support vector machines and local context (F-measure = 97% on PHI). Our representation of local context aids de-identification even when PHI include out-of-vocabulary words and even when PHI are ambiguous with non-PHI within the same corpus. Comparison of Stat De-id with a rule-based approach shows that local context contributes more to de-identification than dictionaries combined with hand-tailored heuristics (F-measure = 85%). Comparison with two well-known named entity recognition (NER) systems, SNoW (F-measure = 94%) and IdentiFinder (F-measure = 36%), on five representative corpora show that when the language of documents is fragmented, a system with a relatively thorough representation of local context can be a more effective de-identifier than systems that combine (relatively simpler) local context with global context. Comparison with a Conditional Random Field De-identifier (CRFD), which utilizes global context in addition to the local context of Stat De-id, confirms this finding (F-measure = 88%) and establishes that strengthening the representation of local context may be more beneficial for de-identification than complementing local with global context.

Keywords: automatic de-identification of narrative patient records, local lexical context, local syntactic context, dictionaries, sentential global context, syntactic information for de-identification

1 Introduction

Medical discharge summaries can be a major source of information for many studies. However, like all other clinical records, discharge summaries contain explicit personal health information (PHI) which, if released, would jeopardize patient privacy. In the United States, the Health Information Portability and Accountability Act (HIPAA) provides guidelines for protecting the confidentiality of patient records. Paragraph 164.514 of the Administrative Simplification Regulations promulgated under the HIPAA states that for data to be treated as de-identified, it must clear one of two hurdles.

An expert must determine and document “that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information.”
Or, the data must be purged of a specified list of seventeen categories of possible identifiers, i.e., PHI, relating to the patient or relatives, household members and employers, and any other information that may make it possible to identify the individual [1]. Many institutions consider the clinicians caring for a patient and the names of hospitals, clinics, and wards to fall under this final category because of the heightened risk of identifying patients from such information [2, 3].

Of the seventeen categories of PHI listed by HIPAA, the following appear in medical discharge summaries: first and last names of patients, of their health proxies, and of their family members; identification numbers; telephone, fax, and pager numbers; geographic locations; and dates. In addition, names of doctors and hospitals are frequently mentioned in discharge summaries; for this study, we add them to the list of PHI. Given discharge summaries, our goal is to find the above listed PHI and to replace them with either anonymous tags or realistic surrogates.

Medical discharge summaries are characterized by fragmented, incomplete utterances and domain-specific language. As such, they cannot be effectively processed by tools designed for lay language text such as news articles [4]. In addition, discharge summaries contain some words that can appear both as PHI and non-PHI within the same corpus, e.g., the word Huntington can be both the name of a person, “Dr. Huntington”, and the name of a disease, “Huntington’s disease”. They also contain foreign and misspelled words as PHI, e.g., John misspelled as Jhn and foreign variants such as Ioannes. These complexities pose challenges to de-identification.

An ideal de-identification system needs to identify PHI perfectly. However, while anonymizing the PHI, such a system needs to also protect the integrity of the data by maintaining all of the non-PHI, so that medical records can later be processed and retrieved based on their inclusion of these terms. Almost all methods that determine whether a target word², i.e., the word to be classified as PHI or non-PHI, is PHI base their decision on a combination of features related to the target itself, to words that surround the target, and to discourse segments containing the target. We call the features extracted from the words surrounding the target and from the discourse segment containing the target the context of the target. In this paper, we are particularly interested in comparing methods that rely on what we call local context, by which we mean the words that immediately surround the target (local lexical context) or that are linked to it by some immediate syntactic relationship (local syntactic context), and global context, which refers to the relationships of the target with the contents of the discourse segment containing the target. For example, the surrounding k-tuples of words to the left and right of a target are common components of local context, whereas a model that selects the highest probability interpretation of an entire sentence by a Markov model employs sentential global context (where the discourse segment is a sentence).

In this paper, we present a de-identifier, Stat De-id, which uses local context to de-identify medical discharge summaries. We treat de-identification as a multi-class classification task; the goal is to consider each word in isolation and to decide whether it represents a patient, doctor, hospital, location, date, telephone, ID, or non-PHI. We use Support Vector Machines (SVMs), as implemented by LibSVM [5], trained on human-annotated data as a means to this end.

Our representation of local context benefits from orthographic, syntactic, and semantic characteristics of each target word and the words within a ±2 context window of the target. Other models of local context have used the features of words immediately adjacent to the target word; our representation is more thorough as it includes (for a ±2 context) local syntactic context, i.e., the features of words that are linked to the target by syntactic relations identified by a parse of the sentence. This novel representation of local syntactic context uses the Link Grammar Parser [6], which can provide at least a partial syntactic parse even for incomplete and fragmented sentences [7]. Note that syntactic parses can be generally regarded as sentential features. However, in our corpora, more than 40% of the sentences only partially parse. The features extracted from such partial parses represent phrases rather than sentences and contribute to local context. For sentences that completely parse, our representation benefits from syntactic parses only to the extent that they help us relate the target to its immediate neighbors (within 2 links), again extracting local context.

On five separate corpora obtained from Partners Healthcare and Beth Israel Deaconess Medical Center, we show that despite the fragmented and incomplete utterances and the domain-specific language that dominate the text of discharge summaries, we can capture the patterns in the language of these documents by focusing on local context; we can use these patterns for de-identification. Stat De-id, presented in this paper, is built on this hypothesis. It finds more than 90% of the PHI even in the face of ambiguity between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI.

We compare Stat De-id with a rule-based heuristic+dictionary approach [8] two named entity recognizers, SNoW [9] and IdentiFinder [10], and a Conditional Random Field De-identifier (CRFD). SNoW and IdentiFinder also use local context; however, their representation of local context is relatively simple and, for named entity recognition (NER), is complemented with information from sentential global context, i.e., the dependencies of entities with each other and with non-entity tokens in a single sentence. CRFD, developed by us for the studies presented in this paper, employs the exact same local context used by Stat De-id and reinforces this local context with sentential global context. In this manuscript, we refer to sentential global context simply as global context. Because medical discharge summaries contain many short, fragmented sentences, we hypothesize that global context will add limited value to local context for de-identification, and that strengthening the representation of local context will be more effective for improving de-identification. We present experimental results to support this hypothesis: On our corpora, Stat De-id significantly outperforms all of SNoW, IdentiFinder, CRFD, and the heuristic+dictionary approach.

The performance of Stat De-id is encouraging and can guide research in identification of entities in corpora with fragmented, incomplete utterances and even domain-specific language. Our results show that even on such corpora, it is possible to create a useful representation of local context and to identify the entities indicated by this context.

2 Background and Related Work

A number of investigators have developed methods for de-identifying medical corpora or for recognizing named entities in non-clinical text (which can be directly applied to at least part of the de-identification problem). The two main approaches taken have been either (a) use of dictionaries, pattern matching, and local rules or (b) statistical methods trained on features of the word(s) in question and their local or global context. Our work on Stat De-id falls into the second of these traditions and differs from others mainly in its use of novel local context features determined from a (perhaps partial) syntactic parse of the text.

2.1 De-identification

Most de-identification systems use dictionaries and simple contextual rules to recognize PHI [8, 11]. Gupta et al. [11], for example, describe the DeID system which uses the U.S. Census dictionaries to find proper names, employs patterns to detect phone numbers and zip codes, and takes advantage of contextual clues (such as section headings) to mark doctor and patient names. Gupta et al. report that, after scrubbing with DeID, of the 300 reports scrubbed, two reports still contained accession numbers, two reports contained clinical trial names, three reports retained doctors’ names, and three reports contained hospital or lab names.

Beckwith et al. [12] present a rule-based de-identifier for pathology reports. Unlike our discharge summaries, pathology reports contain significant header information. Beckwith et al. identify PHI that appear in the headers (e.g., medical record number and patient name) and remove the instances of these PHI from the narratives. They use pattern-matchers to find dates, IDs, and addresses; they utilize well-known markers such as Mr., MD, and PhD to find patient, institution, and physician names. They conclude their scrubbing by comparing the narrative text with a database of proper names. Beckwith et al. report that they remove 98.3% of unique identifiers in pathology reports from three institutions. They also report that on average 2.6 non-PHI phrases per record are removed.

The de-identifier of Berman [13] takes advantage of standard nomenclature available in UMLS. This system assumes that words that do not correspond to nomenclature and that are not in a standard list of stop words are PHI and need to be removed. As a result, this system produces a large number of false positives.

Sweeney’s Scrub system [3] employs numerous experts each of which specializes in recognizing a single class of personally-identifying information, e.g., person names. Each expert uses lexicons and morphological patterns to compute the probability that a given word belongs to the personally-identifying information class it specializes in. The expert with the highest probability determines the class of the word. On a test corpus of patient records and letters, Scrub identified 99–100% of personally-identifying information. Unfortunately, Scrub is a proprietary system and is not readily available for use.

To identify patient names, Taira et al. [14] use a lexical analyzer that collects name candidates from a database and filters out the candidates that match medical concepts. They refine the list of name candidates by applying a maximum entropy model based on semantic selectional restrictions—the hypothesis that certain word classes impose semantic constraints on their arguments, e.g., the verb vomited implies that its subject is a patient. They achieve a precision of 99.2% and recall of 93.9% on identification of patient names in a clinical corpus.

De-identification resembles NER. NER is the task of identifying entities such as people, places, and organizations in narrative text. Most NER tasks are performed on news and journal articles. However, given the similar kinds of entities targeted by de-identification and NER, NER approaches can be relevant to de-identification.

2.2 Named Entity Recognition

Much NER work has been inspired by the Message Understanding Conference (MUC) and by the Entity Detection and Tracking task of Automatic Content Extraction (ACE) conference organized by the National Institute of Standards and Technology. Technologies developed for ACE-2007, for example, have been designed for and evaluated on several individual corpora: a 65000-word Broadcast News corpus, a 47500-word Broadcast Conversations corpus, a 60000-word Newswire corpus, a 47500-word Weblog corpus, a 47500-word Usenet corpus, and a 47500-word Conversational Telephone Speech corpus [15].

One of the most successful named entity recognizers, among the NER systems developed for and outside of MUC and ACE, is IdentiFinder [10]. IdentiFinder uses a Hidden Markov Model (HMM) to learn the characteristics of names that represent entities such as people, locations, geographic jurisdictions, organizations, and dates. For each entity class, IdentiFinder learns a bigram language model, where a word is defined as a combination of the actual lexical unit and various orthographic features. To find the names and classes of all entities, IdentiFinder computes the most likely sequence of entity classes in a sentence given the observed words and their features. The information obtained from the entire sentence constitutes IdentiFinder’s global context.

Isozaki and Kazawa [16] use SVMs to recognize named entities in Japanese text. They determine the entity type of each target word by employing features of the words within two words of the target (a ±2 word window). The features they use include the part of speech and the structure of the word, as well as the word itself.

Roth and Yih’s SNoW system [9] labels the entities and their relationships in a sentence. The relationships expressed in the sentence constitute SNoW’s global context and aid it in creating a final hypothesis about the entity type of each word. SNoW recognizes names of people, locations, and organizations.

Our de-identification solution combines the strengths of some of the abovementioned systems. Like Isozaki et al., we use SVMs to identify the class of individual words (where the class is one of seven categories of PHI or the class non-PHI); we use orthographic information as well as part of speech and local context as features. Like Taira et al., we hypothesize that PHI categories are characterized by their local lexical and syntactic context. However, our approach to de-identification differs from prior NER and de-identification approaches in its use of deep syntactic information obtained from the output of the Link Grammar Parser [6]. We benefit from this information to capture local syntactic context even when parses are partial, i.e., input text contains fragmented and incomplete utterances. We enrich local lexical context with local syntactic context and thus create a more thorough representation of local context. We use our newly defined representation of local context to identify PHI in clinical text.

3 Definitions

We define the PHI found in medical discharge summaries as follows:

Patients: include the first and last names of patients, their health proxies, and family members. Titles, such as Mr., are excluded, e.g., “Mrs. [Lunia Smith]_patient was …”.
Doctors: include medical doctors and other practitioners. Again titles, such as Dr., are not considered part of PHI, e.g., “He met with Dr. [John Doe]_doctor”.
Hospitals: include names of medical organizations. We categorize the entire institution name as PHI including common words such as hospital, e.g., “She was admitted to [Brigham and Women’s Hospital]_hospital”.
IDs: refer to any combination of numbers and letters identifying medical records, patients, doctors, or hospitals, e.g., “Provider Number: [12344]_ID”.
Dates: HIPAA specifies that years are not considered PHI, but all other elements of a date are. We label a year appearing in a date as PHI if the date appears as a single lexical unit, e.g., 12/02/99, and as non-PHI if the year exists as a separate token, e.g., 23 March, 2006. This decision was motivated by the fact that many solutions to de-identification and NER classify entire tokens as opposed to segments of a token. Also, once identified, dates such as 12/02/99 can be easily post-processed to separate the year from the rest.
Locations: include geographic locations such as cities, states, street names, zip codes, and building names and numbers, e.g., “He lives in [Newton]_location”.
Phone numbers: include telephone, pager, and fax numbers.

4 Hypotheses

We hypothesize that we can de-identify medical discharge summaries even when the documents contain many fragmented and incomplete utterances, even when many words are ambiguous between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI. Given the nature of the domain-specific language of discharge summaries, we hypothesize that a thorough representation of local context will be more effective for de-identification than (relatively simpler) local context enhanced with global context; in this manuscript, local context refers to the characteristics of the target and of the words within a ±2 context window of the target whereas global context refers to the dependencies of entities with each other and with non-entity tokens in a sentence.

5 Corpora

We tested our methods on five different corpora, three of which were developed from a corpus of 48 discharge summaries from various medical departments at the Beth Israel Deaconess Medical Center (BIDMC), the fourth of which consisted of authentic data including actual PHI from 90 discharge summaries of deceased patients from Partners HealthCare, and the fifth of which came from a corpus of 889 de-identified discharge summaries, also from Partners. The sizes of these corpora and the distribution of PHI within them are shown in Table 1. The collection and use of these data were approved by the Institutional Review Boards of Partners, BIDMC, State University of New York at Albany, and Massachusetts Institute of Technology.

Table 1.

Number of words in each PHI category in the corpora. Word counts depend on the number and format of inserted surrogates.

Category	Number of tokens
Category	Random corpus	Ambiguous corpus	Out-of-vocabulary corpus	Authentic corpus	Challenge corpus
Non-PHI	17,874	19,275	17,875	112,669	444,127
Patient	1,048	1,047	1,037	294	1,737
Doctor	311	311	302	738	7,697
Location	24	24	24	88	518
Hospital	600	600	404	656	5,204
Date	735	736	735	1,953	7,651
ID	36	36	36	482	5,110
Phone	39	39	39	32	271

Category	Number of ambiguous tokens in the ambiguous corpus	Number of ambiguous tokens in the challenge corpus
Non-PHI	3,787	39,374
Patient	514	158
Doctor	247	1,083
Location	24	44
Hospital	86	1,910
Date	201	81
ID	0	4
Phone	0	1

Corpus	Patients in names dict.	Doctors in names dict.	Locations in location dict.	Hospitals in hospital dict.	Dates in month dict.	Non-PHI in names dict.	Non-PHI in location dict.	Non-PHI in hospitals dict.	Non-PHI in month dict.
Random	86.45%	86.50%	87.5%	87.5%	12.65%	15.87%	9.19%	14.10%	0.07%
Authentic	78.57%	70.33%	54.55%	80.18%	21.97%	16.12%	10.19%	12.74%	0.02%
Ambiguous	86.53%	86.50%	100%	87.5%	12.64%	19.53%	10.50%	14.03%	0.08%
OoV	2.51%	1.99%	0%	19.56%	12.65%	15.87%	9.19%	14.10%	0.07%
Challenge	14.10%	17.20%	11.40%	26.59%	5.15%	15.36%	11.32%	8.61%	0.06%

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.34%	96.92%	97.63%
IFinder	PHI	62.21%	75.83%	68.35%*
H+D	PHI	93.67%	66.56%	77.82%*
CRFD	PHI	81.94%	81.17%	81.55%*
Stat De-id	Non-PHI	99.53%	99.75%	99.64%
IFinder	Non-PHI	96.15%	92.92%	94.51%*
H+D	Non-PHI	95.07%	99.31%	97.14%*
CRFD	Non-PHI	98.91%	99.05%	98.98%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.31%	96.62%	97.46%
SNoW	PHI	95.18%	97.63%	96.39%*
Stat De-id	Non-PHI	99.64%	99.82%	99.73%
SNoW	Non-PHI	99.75%	99.48%	99.61%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.46%	95.24%	96.82%
IFinder	PHI	26.17%	61.98%	36.80%*
H+D	PHI	82.67%	87.30%	84.92%*
CRFD	PHI	91.16%	84.75%	87.83%*
Stat De-id	Non-PHI	99.84%	99.95%	99.90%
IFinder	Non-PHI	98.68%	94.19%	96.38%*
H+D	Non-PHI	99.58%	99.39%	99.48%*
CRFD	Non-PHI	99.62%	99.86%	99.74%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.40%	93.75%	96.02%
SNoW	PHI	96.36%	91.03%	93.62%*
Stat De-id	Non-PHI	99.90%	99.98%	99.94%
SNoW	Non-PHI	99.86%	99.95%	99.90%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	96.37%	94.27%	95.31%
IFinder	PHI	45.52%	69.04%	54.87%*
H+D	PHI	79.69%	44.25%	56.90%*
CRFD	PHI	81.84%	78.08%	79.92%*
Stat De-id	Non-PHI	99.18%	99.49%	99.34%
IFinder	Non-PHI	95.23%	88.22%	91.59%*
H+D	Non-PHI	92.52%	98.39%	95.36%*
CRFD	Non-PHI	98.12%	98.78%	98.45%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	95.75%	93.24%	94.48%
SNoW	PHI	92.93%	91.57%	92.24%*
Stat De-id	Non-PHI	99.33%	99.59%	99.46%
SNoW	Non-PHI	99.17%	99.31%	99.24%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	94.02%	92.08%	93.04%
IFinder	PHI	50.26%	67.16%	57.49%*
H+D	PHI	58.35%	30.08%	39.70%*
SNoW	PHI	91.80%	87.83%	89.77%*
CRFD	PHI	74.15%	71.15%	72.62%*
Stat De-id	Non-PHI	98.28%	98.72%	98.50%
IFinder	Non-PHI	92.26%	85.48%	88.74%*
H+D	Non-PHI	86.19%	95.31%	90.52%*
SNoW	Non-PHI	97.34%	98.27%	97.80%*
CRFD	Non-PHI	95.84%	96.89%	96.37%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.12%	96.77%	97.44%
IFinder	PHI	52.44%	54.62%	53.51%*
H+D	PHI	88.24%	24.79%	38.71%*
CRFD	PHI	82.01%	78.71%	80.32%*
Stat De-id	Non-PHI	99.54%	99.74%	99.64%
IFinder	Non-PHI	93.52%	92.97%	93.25%*
H+D	Non-PHI	90.32%	99.53%	94.70%*
CRFD	Non-PHI	98.43%	99.01%	98.72%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.69%	97.37%	98.03%
IFinder	PHI	25.10%	49.10%	33.20%*
H+D	PHI	36.24%	55.84%	43.95%*
CRFD	PHI	86.37%	84.79%	85.57%*
Stat De-id	Non-PHI	99.83%	99.92%	99.86%
IFinder	Non-PHI	97.25%	92.47%	94.80%*
H+D	Non-PHI	97.67%	94.95%	96.29%*
CRFD	Non-PHI	99.55%	99.65%	99.60%*

Feature	Class	Precision	Recall	F-measure
Target words	Non-PHI	91.61%	98.95%	95.14%
Target words	PHI	86.26%	42.03%	56.52%
Lexical bigrams	Non-PHI	95.61%	98.10%	96.84%†
Lexical bigrams	PHI	85.43%	71.14%	77.63%
Syntactic bigrams	Non-PHI	96.96%	98.72%	97.83%
Syntactic bigrams	PHI	90.76%	80.20%	85.15%
POS information	Non-PHI	94.85%	98.38%	96.58%†
POS information	PHI	86.38%	65.84%	74.73%
Dictionary	Non-PHI	88.99%	99.26%	93.85%
Dictionary	PHI	81.92%	21.41%	33.95%
MeSH	Non-PHI	86.49%	100%	92.75%•
MeSH	PHI	0%	0%	0%‡
Orthographic	Non-PHI	86.49%	100%	92.75%•
Orthographic	PHI	0%	0%	0%‡

Feature	Class	Precision	Recall	F-measure
Target words	Non-PHI	98.79%	99.94%	99.36%
Target words	PHI	97.64%	67.38%	79.74%
Lexical bigrams	Non-PHI	98.46%	99.83%	99.14%†
Lexical bigrams	PHI	92.75%	58.47%	71.73%
Syntactic bigrams	Non-PHI	98.55%	99.87%	99.21%†
Syntactic bigrams	PHI	94.66%	60.97%	74.17%
POS information	Non-PHI	97.95%	99.63%	98.78%
POS information	PHI	81.99%	44.64%	57.81%
Dictionary	Non-PHI	97.11%	99.89%	98.48%
Dictionary	PHI	88.11%	21.14%	34.10%
MeSH	Non-PHI	96.37%	100%	98.15%‡
MeSH	PHI	0%	0%	0%
Orthographic	Non-PHI	96.39%	99.92%	98.12%‡
Orthographic	PHI	22.03%	0.61%	1.19%

Feature	Class	Precision	Recall	F-measure
Target words	Non-PHI	96.90%	99.87%	98.36%
Target words	PHI	96.05%	49.56%	65.38%
Lexical bigrams	Non-PHI	97.34%	99.69%	98.50%
Lexical bigrams	PHI	91.99%	56.87%	70.29%
Syntactic bigrams	Non-PHI	97.50%	99.74%	98.61%
Syntactic bigrams	PHI	93.44%	59.61%	72.79%
POS information	Non-PHI	96.04%	99.42%	97.70%
POS information	PHI	79.33%	35.24%	48.80%
Dictionary	Non-PHI	94.26%	99.90%	96.99%
Dictionary	PHI	69.70%	3.79%	7.19%
MeSH	Non-PHI	94.05%	100%	96.93%
MeSH	PHI	0%	0%	0%
Orthographic	Non-PHI	96.05%	99.60%	97.79%
Orthographic	PHI	84.67%	35.30%	49.83%

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.04%	96.49%	97.26%
SNoW	PHI	96.50%	95.08%	95.78%*
Stat De-id	Non-PHI	99.67%	99.82%	99.74%
SNoW	Non-PHI	99.53%	99.67%	99.60%*

Method	Class	Precision	Recall	F-measure
Stat De-id	PHI	98.98%	96.96%	97.96%
SNoW	PHI	98.73%	93.81%	96.21%*
Stat De-id	Non-PHI	99.90%	99.97%	99.93%
SNoW	Non-PHI	99.80%	99.96%	99.88%*

Corpus	Feature	Class	Precision	Recall	F-measure
Random	All Stat De-id/All CRFD	Non-PHI	99.53%	99.75%	99.64%
	All Stat De-id/All CRFD	PHI	98.34%	96.92%	97.63%
	All local context features of SNoW	Non-PHI	98.79%	99.36%	99.08%*
	All local context features of SNoW	PHI	95.76%	92.23%	93.96%*
	All local context features of IdentiFinder	Non-PHI	99.35%	99.18%	99.27%*
	All local context features of IdentiFinder	PHI	94.33%	95.85%	95.33%*
Authentic	All Stat De-id/All CRFD	Non-PHI	99.84%	99.95%	99.90%
	All Stat De-id/All CRFD	PHI	98.46%	95.24%	96.82%
	All local context features of SNoW	Non-PHI	99.72%	99.94%	99.83%*
	All local context features of SNoW	PHI	98.42%	92.67%	95.46%*
	All local context features of IdentiFinder	Non-PHI	99.66%	99.92%	99.79%*
	All local context features of IdentiFinder	PHI	97.75%	91.04%	94.28%*
Ambiguous	All Stat De-id/All CRFD	Non-PHI	99.18%	99.49%	99.34%
	All Stat De-id/All CRFD	PHI	96.37%	94.27%	95.31%
	All local context features of SNoW	Non-PHI	98.15%	98.98%	98.56%*
	All local context features of SNoW	PHI	92.51%	87.11%	89.73%*
	All local context features of IdentiFinder	Non-PHI	97.62%	98.49%	98.05%*
	All local context features of IdentiFinder	PHI	88.86%	83.42%	86.06%*
OoV	All Stat De-id/All CRFD	Non-PHI	99.54%	99.74%	99.64%
	All Stat De-id/All CRFD	PHI	98.12%	96.77%	97.44%
	All local context features of SNoW	Non-PHI	98.95%	99.45%	99.20%*
	All local context features of SNoW	PHI	96.10%	92.67%	94.33%*
	All local context features of IdentiFinder	Non-PHI	99.12%	99.17%	99.14%*
	All local context features of IdentiFinder	PHI	94.23%	93.87%	94.05%*
Challenge	All Stat De-id/All CRFD	Non-PHI	99.83%	99.92%	99.86%
	All Stat De-id/All CRFD	PHI	98.69%	97.37%	98.03%
	All local context features of SNoW	Non-PHI	99.50%	99.86%	99.68%*
	All local context features of SNoW	PHI	97.72%	92.14%	94.85%*
	All local context features of IdentiFinder	Non-PHI	99.50%	99.89%	99.70%*
	All local context features of IdentiFinder	PHI	98.21%	92.15%	95.08%*

Predicted	Non-PHI	Patient	Doctor	Location	Hospital	Date	ID	Phone
Actual	Non-PHI	Patient	Doctor	Location	Hospital	Date	ID	Phone
Non-PHI	112,605	2	4	0	17	33	8	0
Patient	12	280	1	0	0	0	0	1
Doctor	24	1	711	0	2	0	0	0
Location	19	0	3	51	14	0	0	1
Hospital	54	0	3	4	595	0	0	0
Date	58	0	0	0	4	1,891	0	0
ID	3	0	0	0	0	0	479	0
Phone	11	0	0	0	0	1	1	19

PERMALINK

A De-identifier for Medical Discharge Summaries1

Özlem Uzuner, Ph.D.

Tawanda C Sibanda, M.Eng.

Yuan Luo, M.S.

Peter Szolovits, Ph.D.

Abstract

Objective

Methods and Results

1 Introduction

2 Background and Related Work

2.1 De-identification

2.2 Named Entity Recognition

3 Definitions

4 Hypotheses

5 Corpora

Table 1.

5.1 Corpus Populated with Random PHI

5.2 Corpus Populated with Ambiguous PHI

Table 2.

5.3 Corpus Populated with Out-of-Vocabulary PHI

5.4 Authentic Discharge Summary Corpus

5.5 Challenge Corpus

6 Methods: Stat De-id

6.1 Support Vector Machines

6.2 Knowledge Representation

6.3 Lexical and Orthographic Features

6.3.1 The target itself

6.3.2 Lexical Bigrams

6.3.3 Capitalization

6.3.4 Punctuation

6.3.5 Numbers

6.3.6 Word Length

6.4 Syntactic Features

6.4.1 Part of Speech

6.4.2 Syntactic Bigrams

Link Grammar Parser

Using Link Grammar Parser Output as an SVM Input

6.5 Semantic Features

6.5.1 MeSH ID

6.5.2 Dictionary Information

Table 3.

6.5.3 Section Headings

7 Baseline Approaches

7.1 Heuristic+dictionary Scheme

7.2 SNoW

7.3 IdentiFinder

7.4 Conditional Random Field De-identifier (CRFD)

8 Evaluation Methods

8.1 Precision, Recall, and F-measure

8.2 Statistical Significance

9 Results and Discussion

9.1 De-identifying Random and Authentic Corpora

Table 4.

Table 5.

Table 6.

Table 7.

9.2 De-identifying the Ambiguous Corpus

Table 8.

Table 9.

Table 10.

9.3 De-identifying the Out-of-Vocabulary Corpus

Table 11.

Table 12.

Table 13.

9.4 De-identifying the Challenge Corpus

Table 14.

Table 15.

9.5 Feature Importance

Table 16.

Table 17.

Table 18.

9.6 Local vs. Global Context

Table 19.

10 Multi-Class SVM Results and Implications for Future Research

Table 20.

Table 21.

11 Conclusions

Acknowledgments

Footnotes

A De-identifier for Medical Discharge Summaries^¹