Abstract
Effective de-identification methods are needed to support reuse of electronic health record data for research and other purposes. We investigated using two different text-processing systems in tandem as a strategy for de-identification of clinical notes. We ran 100 outpatient notes through deid.pl, from MIT’s PhysioToolkit, followed by MedLEE, and we manually compared the output with original notes to determine the amount of protected health information (PHI) retained. Pipelining resulted in an overall error rate of 2%, with 2 personal names retained in output: one initial and a commonly used English term used in medicine. All retained PHI was transformed into standardized medical concepts, making re-identification less likely. Pipelining using deid.pl improved performance of MedLEE in excluding PHI from output and may be a useful strategy for de-identifying clinical data while providing computer-readable output.
Introduction
The American Recovery and Reinvestment Act of 2009 (ARRA),1 which provides funding for electronic health record adoption, specifies increased privacy protection of clinical data. This includes the eighteen types of protected health information (PHI) enumerated by HIPAA. Eight of these were studied by Uzuner et al2 and included age, clinician name, date less than a year, hospital, location, patient name, telephone number, and identifiers such as medical record numbers and social security numbers.
De-identification of clinical notes has the potential to add utility to the large quantity of text documents that will be created if electronic health records are adopted as hoped. These notes, which can provide more detailed patient information than commonly used data such as billing or coded diagnosis data, can not be exchanged on a large scale without a reasonable de-identification algorithm. The notes could be additionally useful if accurately coded into computer-readable formats.
Clinical notes have the potential to be used to accomplish such goals as quality improvement, public health surveillance, and research; specific activities may include performing automated public health reporting, monitoring guideline adherence and quality, and detecting adverse advents. However, using these clinical data for secondary purposes increases the likelihood that PHI will leave the protection of a health care facility and end up at institutions such as research facilities and quality and safety organizations.
Systems that have been tested for their ability to remove PHI have reached a high level of de-identification using a variety of methods, including semantic analysis,3, 4 template or pattern matching,5,6 and concept or dictionary matching,4,5,7,8 with fine-tuning using quantitative methods such as probability tables,4,5 pair wise matching,8 and conditional random fields.9 Many of the systems studied use more than one approach within a system, but studies have tended to use a single fully functional system at a time.
In a recent paper, Morrison, et al. examined MedLEE, a general natural language processing system designed to extract medical concepts from clinical text, as a potential de-identification system.10 It was found that MedLEE retained 3.2% of PHI in the output but that the PHI was transformed to standardized medical concepts, making it less recognizable as PHI. In this study, we attempted to improve performance by pairing MedLEE with a system that used a complementary strategy for handling PHI. The two approaches can be likened to the liver and kidney.10 Most de-identification systems function like the liver; they are designed to identify specific phrases that appear to be PHI and to tag them just as the liver targets and removes specific toxins. MedLEE functions more like the kidney, eliminating everything except that which is identified as being medically codable.
We therefore created a pipeline by applying MedLEE to the output of a freely available de-identification system that works by identifying PHI, like a liver identifies toxins. We selected deid.pl,11 which is part of the PhysioToolkit12 created at MIT, because it is open-source and has good documented performance in recognizing PHI.
Methods
We used the corpus from Morrison et al.,10 which included physician outpatient notes from Columbia University Medical Center from November 2004 through April 2005. These notes were written by resident and attending physicians from general Internal Medicine as well as specialists including neurology, hematology, and oncology. The notes were unstructured with varied formats, including clinic notes and referral letters.
The de-identification software, deid.pl, is available on PhysioNet.12 The open-source package includes code and dictionaries to remove protected health information from narrative notes using lexical look-up tables, regular expressions, and heuristics. It replaces provider and patient names with category tags, and it substitutes dates with consistent, shifted dates, preserving attributes such as day of the week, season, and the patient’s age.
The 100 notes we used from the prior study were already pre-processed for use with MedLEE, with handling of special characters, removal of whitespace, and addition of breaks in case of excess text without punctuation. We downloaded the deid.pl code, added to each note the prefix and suffix required by the system, and ran the program, a process that took approximately 4 hours. The deid.pl system may be tuned to a site by including local names of clinicians, patients, hospitals, and locations in its dictionaries, but we used the existing dictionaries to test its most straightforward use, relying on its heuristics and common names. Because of this, we do not consider the evaluation to be a test of deid.pl’s optimized performance. We then ran MedLEE on the unmodified text output from deid.pl, resulting in XML-tagged parsed concepts.
A physician who is board certified in Preventive Medicine with formal training in Public Health and Biomedical Informatics manually reviewed the notes and output. The PHI in the notes was characterized and summed by the eight types of PHI identified by Uzuner et al:2 patient name, clinician name, hospital, identifiers (e.g., social security numbers, medical record numbers), date (except for year), location, phone number, and age >89.
We followed a similar algorithm to the previous study. We sought to identify processing errors that allowed PHI into the output in any form by comparing the 100 original notes with the corresponding XML-tagged output. We treated first and last names as separate units but locations and hospital names as a single unit. If any part of the unit was allowed into output, we considered it an error. For example, if the apartment number of an address was erroneously allowed into output as a lab result value, it counted as a PHI leak. We considered all clinical sites as hospital data types, including names of specific clinic names or locations that would reveal the clinical site. Identifiers were any string that could be traced to a single individual, such as a medical record number. We also considered email addresses to be identifiers because they tend to be linked with only one person and could potentially be used to identify an individual.
The output from the deid.pl software with tagged PHI was compared to the PHI in the previously marked up original notes. The MedLEE XML-tagged output from the second step was then compared to the deid.pl output and original notes to determine the level of PHI retained. We compared these results to the prior results from processing the notes with MedLEE alone.
We calculated the proportion of PHI in the original notes that ended up in the output. This is equivalent to the false negative rate (1–sensitivity) of a de-identification system (i.e., the PHI that is not identified and is therefore inappropriately left in the note).
We also examined the output of 5 of the notes after running MedLEE alone and after running the pipeline to determine whether concept recognition had been affected by deid.pl. The output was manually inspected side-by-side for differences in XML-coded concepts to determine whether MedLEE changed its interpretation after running deid.pl.
Results
Both systems were able to process all 100 notes. After the deid.pl system ran on the notes, some text appeared to be missing from several notes, from a few words to multiple lines, particularly when a large amount of PHI was nearby. MedLEE handled the deid.pl output without problems, although a systematic examination of potential missing medical concepts was not performed.
Out of a total of 818 PHI instances, deid.pl alone missed 191 (24%) of PHI in text, including 12 of 119 patient names and 9 of 170 clinician names. This resulted in 21 (7%) total misses of individuals’ names. The two systems in tandem fared better, with only two names appearing in output. One of those was a clinician’s first initial which was interpreted as a common laboratory measurement and the other was a patient’s surname that is a common English word that is used frequently in medicine. Both of the names had been missed in the previous study using MedLEE alone.
After running deid.pl, one identifier was not recognized, in this case a medical record number, which MedLEE then ignored. Fourteen telephone numbers remained, with 13 of those representing 4-digit beeper numbers and one an extension, none of which subsequently appeared in MedLEE output. Five of the eight ages over 89 were not recognized by the systems run in tandem; anecdotally, neither system recognized any age well, with ages under 89 rarely recognized although this was not quantified.
Two date errors were detected in output after pipeline, the same two errors that were in output of MedLEE only; in both cases, neither appeared in a standard date format and the result was a single digit retained in output. In the first case, the phrase “PAP11/03 wnl Today: ” was retained as:
normalfinding:within normal limits
idref>> 93
parsemode>> mode5
quantity>> [3,[idref,91]]
sectname>> report unknown section item
sid>> 3
timeper>> today
idref>> 95
The other error occurred from the phrase “Apr,03” (details modified for patient protection).
Upon manual examination of the two resulting outputs, MedLEE alone and MedLEE after deid.pl, two of the five notes examined had identical coded concepts. The remaining three had minimal differences only. One notable difference was the lack of the concept mammography in one note’s post-pipeline output. This was due to the word mammo being classified as a name in deid.pl. Other differences included CC misinterpreted as craniocaudal rather than the correct meaning, chief complaint, after being processed by MedLEE alone but not after the pipeline. The term Appt was also misinterpreted as activated partial prothromboplastin time after the pipeline but not after MedLEE alone.
Discussion
Using deid.pl serially with MedLEE improves MedLEE’s performance in removing PHI, but some leakage persisted. The rate at which clinician and patient names leaked into the output dropped from 3.5% to 0.7%, a five-times improvement. After examining the names that leaked, it is not clear that the rate can be improved very much; one name was a medically-relevant English word, and the other was a first initial. Although peoples’ initials are considered PHI by HIPAA unless a qualified statistician determines that the risk of re-identification is very small13 in neither case was the output information at all recognizable as a name.
Deid.pl was previously evaluated by Neamatullah et al.11 The researchers found performance that was notably better than what we found, with our “errors” equivalent to their false negatives. They found that deid.pl had an error rate of 0% for patient names excluding initials (0 of 54), 100% for patient initials (2 of 2), and 0.5% (3 of 593) for clinician names. The likely reason for the difference is the unavailability of a local dictionary, which is how deid.pl is intended to be used. Studies done on deid.pl without local dictionaries resulted in a false negative rate of 7% for all types of names, as opposed to 1% with the customized dictionary.11
Performance on deid.pl identification of hospitals, clinician names, patient names, and locations would improve with local dictionary use, and we caution readers not to use the deid.pl results in Table 1 as the optimal deid.pl performance. However, failing to optimize the local dictionaries probably did not affect the pipeline performance very much. The two missed names would likely have leaked anyway (unless common English words used in medicine were eliminated, as well as single letters). The six leaked hospital names might have been avoided, however, dropping the total errors to 13 (1.6%).
Table 1.
PHI allowed into output from MedLEE only, deid.pl only, and deid.pl followed by MedLEE. An “error” implies that PHI was leaked to the output.
PHI Type | MedLEE errors* | deid.pl errors | Both errors | PHI |
---|---|---|---|---|
Number (percent) | ||||
Patient | 4 (3) | 12 (10) | 1 (1) | 119 |
Clinician | 6 (4) | 9 (5) | 1 (1) | 170 |
Identifiers | 0 (0) | 1 (2) | 0 (0) | 41 |
Hospital | 7 (7) | 80 (77) | 6 (6) | 103 |
Telephone | 1(3) | 14 (38) | 0 (0) | 37 |
Location | 3 (8) | 20 (54) | 2 (5) | 37 |
Age >89 | 5 (63) | 6 (75) | 5 (63) | 8 |
Date | 2† (1) | 49 (16) | 2 (1) | 302 |
Total | 28 (3) | 191 (23) | 17 (2) | 818 |
from Morrison, et al.10
errors not originally detected
We chose to evaluate the pipeline strategy using an “out of the box” approach for one main reason. If these systems were to be used for de-identification on a larger scale, it could involve a multitude of different sites. It is unlikely that each site has a local dictionary, making its use unwieldy. We were interested in how it would do without customization to clarify potential real-world application of the systems.
Age was not recognized well by either system. Understandably, the overlap of one- or two- digit numbers with laboratory results and other measurements makes this the most challenging data type for systems to handle. However, many of these ages refer to family members in the family history section rather than the patient, making them less likely to enable re-identification. Several of these appeared in output as a year. This would be less conspicuous in the 90’s when one would expect dates to be in that range but might provide more identifiability in the current decade. The systems also did not handle beeper numbers; this may be due to the fact that the number of digits used for beepers vary between institutions. The difference in patterns contributing to errors may also apply to medical record numbers, which can vary in format and length.
Dates were handled fairly well by the two systems together, with only a single digit of two dates leaking through, the same two that were missed by MedLEE alone. Although deid.pl did not change MedLEE’s error rate for dates, a potential benefit of running MedLEE through a system like deid.pl may be the increased consistency of input. If MedLEE could be trained to handle the standardized output created by deid.pl, such as dates, MedLEE’s performance could potentially improve. In the prior study, although MedLEE allowed only two dates into the output, it anecdotally did not recognize dates reliably. More consistent recognition of dates would likely improve its ability to handle concepts such as time.
In this study, we did not comprehensively assess the loss of useful information from the MedLEE output, although we examined five notes to determine potential differences in concept processing. Such loss might occur because deid.pl inappropriately identified a medical phrase as PHI, or because a change in a sentence caused MedLEE to fail to parse some other part of the sentence. We did discover that deid.pl removed some amount of text from the notes, which was clear when examining the deid.pl output compared with the original note. However, in the five notes we examined, the pipelined MedLEE output did not appear to have been affected meaningfully compared to output from MedLEE alone. MedLEE did handle some terms differently after running deid.pl; one term from a note, Appt, appeared in output (albeit misclassified) only after the pipeline processing.
Some of the errors appeared unavoidable given both human typing error and collisions. The dates that were missed were irregularly formatted with no spaces, and the transformed dates, consisting of a single digit, were nearly undetectable in the output. Distinguishing medical terms from names may also be unavoidable. Although the concept mammography was excluded from output after deid.pl and MedLEE ran in tandem, this was solely due to its recognition of mammo as a name by deid.pl. This exemplifies one potential downside of removing all names that could be misclassified as medical concepts; important information could be excluded.
It may be very difficult to remove certain forms of PHI, such as names that are common English words and that are not in a context that is easy to identify as a name. The ultimate goal is not to hide PHI fragments such as lone initials, but to protect patient identities. If it can be demonstrated that an individual cannot be re-identified from leakage of PHI fragments such as initials, then the goal of protecting patient privacy may still be achieved. For example, the corpus may achieve k-anonymity.14, 15 The advantage of processing with MedLEE is that it provides coded output; the advantage of processing with another complementary system is that the exclusion of PHI from output improves with little appreciable reduction in medical concept recognition.
Conclusion
Combining complementary de-identification systems in a pipeline resulted in improved performance from MedLEE alone, with the overall error rate dropping from 3.4% to 2.1% and the error rate for names dropping from 3.5% to 0.7%, with a masking of PHI as erroneous medical codes. A by-product of the approach is the encoded medical information. Using complementary approaches, one that selects PHI and the other that selects non-PHI (analogous to the liver and kidney), does result in improved performance. Nevertheless, performance is not perfect, and manual review would be required to remove all PHI.
Acknowledgments
Research for natural language processing and continuing development of MedLEE funded by R01 LM007659 and R01 LM008635 from the National Library of Medicine. Research for evaluation of MedLEE as a de-identification and syndromic surveillance tool funded by RO1 LM06910 and PO1 HK000029.
References
- 1.United States Congress American Recovery and Reinvestment Act of 2009. 2009January6 [Google Scholar]
- 2.Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007 Sep–Oct;14(5):550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ruch P, Baud RH, Rassinoux AM, Bouillon P, Robert G. Medical document anonymization with a semantic lexicon. Am Med Inform Assoc Proc. 2000:729–33. [PMC free article] [PubMed] [Google Scholar]
- 4.Taira RK, Bui AA, Kangarloo H. Identification of patient name references within medical documents using semantic selectional restrictions. Am Med Inform Assoc Proc. 2002:757–61. [PMC free article] [PubMed] [Google Scholar]
- 5.Sweeney L. Replacing personally-identifying information in medical records, the Scrub system. Am Med Inform Assoc Proc. 1996:333–7. [PMC free article] [PubMed] [Google Scholar]
- 6.Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Med Inform Decis Mak. 2006;6:12. doi: 10.1186/1472-6947-6-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176–86. doi: 10.1309/E6K3-3GBP-E5C2-7FYU. [DOI] [PubMed] [Google Scholar]
- 8.Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Archives of Pathology & Laboratory Medicine. 2003;127(6):680–6. doi: 10.5858/2003-127-680-CMDS. [DOI] [PubMed] [Google Scholar]
- 9.Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc. 2007;14(5):564–73. doi: 10.1197/jamia.M2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Morrison FP, Li L, Lai AM, Hripcsak G. Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes? J Am Med Inform Assoc. 2009;16(1):37–9. doi: 10.1197/jamia.M2862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Neamatullah I, Douglass MM, Lehman LW, Reisner A, Villarroel M, Long WJ, et al. Automated de-identification of free-text medical records. BMC Medical Informatics & Decision Making. 2008;8:32. doi: 10.1186/1472-6947-8-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. 2000;101(23):E215–20. doi: 10.1161/01.cir.101.23.e215. [DOI] [PubMed] [Google Scholar]
- 13.NIH Research Repositories, Databases, and the HIPAA Privacy Rule. NIH Publication Number 04-5489 2004 [cited 2009 March 11]; Available from: http://privacyruleandresearch.nih.gov/researchrepositories.asp
- 14.Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002;10(5):557–70. [Google Scholar]
- 15.Malin BA, Sweeney L. A secure protocol to distribute unlinkable health data. AMIA Annual Symposium Proceedings/AMIA Symposium. 2005:485–9. [PMC free article] [PubMed] [Google Scholar]