Abstract
Accessing both structured and unstructured clinical data is a high priority for research efforts. However, HIPAA requires that data meet or exceed a deidentification standard to assure that protected health information (PHI) is removed. This is a particularly difficult problem in the case of unstructured clinical free text and natural language processing (NLP) systems can be trained to automatically de-identify clinical text. Moreover, manual human annotation of clinical note documents for the purpose of building reference standards to evaluate NLP systems is a costly and time consuming process. Annotation schema must be created that can be used to build reliable and valid reference standards to evaluate NLP systems for the deidentification task. We describe the inductive creation of an annotation schema and subsequent reference standard. We also provide estimates of the accuracy of human annotators for this particular task.
Introduction
Access to clinical data is essential in medical research, but clinical information is rich in protected health information (PHI) which severely restricts its use for research purposes. These restrictions require compliance with the Health Insurance Portability and Accountability Act (HIPAA) to protect the confidentiality of patient health care information1. HIPAA provides guidelines for the use of PHI using one of two alternatives: a statistical proof that any shared PHI cannot be traced to an individual; or removal of 18 types of identifiers to comply with the “Safe Harbor” method. The majority of clinical information is stored in the form of free-text data. Since this data is unstructured and contains identifiers without a constant format, removal of the 18 types of PHI identifiers becomes a significant challenge. Manual de-identification is prohibitively time consuming and requires multiple independent reviews for acceptable accuracy2,3. Automated approaches to de-identify clinical documents are available, but are often developed and evaluated using specific document types and may not be generalizable to all types of datasets3,4. Even with the most successful automated tools there remains the possibility that an identifier may slip past to a “deidentified” document. Certain categories of “missed” identifiers may be considered more sensitive than others. In addition, clinical narratives are inherently information rich and even if all HIPAA identifiers are removed, there may be specific keywords or scenarios that may allow for re-identification.
The Department of Veterans Affairs (VA) seeks to improve healthcare of veterans through clinical research, but privacy and confidentiality of veterans’ healthcare information must be maintained. The VA has been a leader in the electronic medical record (EMR) with clinical care documented electronically in the Computerized Patient Records System (CPRS) in use at all VA health care facilities. VA clinical narrative documents create unique challenges for automated de-identification. These notes tend to be highly templated with pre-defined section headings and require entry of free text in a narrative style as well as embed information stored elsewhere in the VA EMR. This project was undertaken to create a manually de-identified reference standard from a variety of VA CPRS narrative documents using an annotation schema to prioritize categories of PHI. The process by which the annotation schema was applied and refined involved a human inductive learning process and is described in this paper. The reference standard built from the annotation task provides an example of the type and density of PHI in heterogeneous VA CPRS in-patient clinical document types, as well as a tool to evaluate recall and precision of automated de-identification systems.
Methods
Study Population and Document Corpus
This study was conducted at one VA Health Care facility that provides tertiary care to veterans across multiple states. Clinical narrative documents from a VA study to assess the inter-rater reliability of VA Infection Preventionists (IP) categorizing healthcare acquired bloodstream infections were used. The study population included a random selection of 120 inpatient hospitalizations at one VA Health Care Facility from Aug 2000 through Dec 2005 of adults who were eligible for review of a hospital acquired infection if their blood was sampled and a microorganism(s) was isolated. Clinical documents relevant for IP review were abstracted from the electronic health records of these selected hospitalizations to compile sets of related documents. The documents in each of the 120 sets of hospitalizations were authored by healthcare providers from several clinical domains and consisted of a broad range of document types with at least one admission history and physical and one discharge summary; all microbiology reports; and other text documents authored during the days surrounding positive blood cultures. There were 9 categories of document types with a total number of 5,875 notes for the entire cohort which consisted of: admission history and physicals (180), discharge summaries (128), physician progress notes (1,152), nursing notes (1,566), consultant notes (332), microbiology reports (1,758), procedure notes (212), infusion services notes (31), and chest radiograph reports (521).
Our goal was to build a reference standard that was a representative sample of the cohort and could adequately evaluate NLP automated de-identification tools. The sample size of the test set for the reference standard was determined by power analysis using recall (sensitivity) as the parameter of interest. We sampled 240 documents from a distribution of the 9 document types for the complete corpus, with 20 documents for the training set and 220 for the test set. For each document type, a unique patient was sampled only once, but one patient could have multiple documents of various types. The numbers sampled from each document type were calculated based on previously reported prevalence of PHI5 as well as oversampling from the lengthier document types.
Study Design
Creation of the Annotation Schema
An annotation schema that classifies PHI in detail as well as ranks the “privacy severity” would be optimal when creating a reference standard and in evaluating the performance of de-identification tools. We sought to build on the annotation schema developed in the i2b2 challenge6 where selected HIPAA PHI were grouped in categories that distinguish direct patient identifiers from those for healthcare providers. The i2b2 PHI categories included: patients, doctors, hospitals, IDs, dates, locations, phone numbers, and age greater than 89. The physician author (JM) modified the i2b2 annotation schema by classifying categories as a patient or healthcare provider identifier that would be sensitive outside or only within the institution where the data was collected, and in assigning privacy severity levels (Table 1). PHI that could identify an individual patient by anyone outside of the institution was considered of highest severity, such as with a patient/proxy name or through numbers as a social security number. On the other hand, although HIPAA considers a laboratory accession number PHI, identifiers that could reasonably be used for re-identification only by someone within the institution was considered of lowest severity. A training instruction manual was prepared that outlined the various PHI categories to be marked and provided specific examples. A PHI annotation was to begin at the start of a PHI phrase and end at the completion of the phrase to capture instances of PHI rather than as word tokens. Thus our annotation task was at the instance level rather than the token level that is often used in other similar studies.
Table 1.
Annotation Schema with description of PHI categories and examples marking instances of PHI
| PHI category (Description) | Examples | Sensitivity |
|---|---|---|
|
Patient name (names of patient, health proxies, family members) |
SMITH, JOHN Q Mr. Smith |
High |
|
Patient identifier sensitive outside the institution (numbers, letters that identify patient or proxy) |
His SSN is 000-00-000 Spouse’s number is 999-9999 |
High |
|
Health care worker name (name of HCW. Excludes titles as Dr., MD, or RN) |
JONES, JANE MD Author: Jones, James E Jr |
Medium |
|
Health care worker identifier sensitive outside the institution (numbers, letters that identify HCW) |
Page at (111–0000) Send the fax to (999)999–9999 |
Medium |
|
Health care facility (health care facilities, labs, nursing homes, “non-generic” in-or out-patient locations) |
ANYCITY Hospital UMC MICU UMC 5West 15A |
Medium |
|
Locations (such as cities, street names, zip codes, address) |
5 Main St, City, State 00000 He lives in City, State |
Medium |
|
Dates (all elements of a date, including year and/or time if in same phrase as month/date. Ignore weekdays.) |
Jul 4, 2001@01:00 1/01 Monday, Jan 26, 2001 |
Medium |
| Age > 89 | He is a 91 yo male | Medium |
|
Identifier sensitive within the institution (numbers, letters representing codes that could feasibly identify an individual within institution only) |
TECH CODE: 123 Accession #: MICRO 01 111 11122233/jlb |
Low |
Application and Refinement of the Schema
Two annotators (A1 and A2) worked independently to apply the annotation schema to unmarked documents first in a training set and then in the test set. The annotators were both non-clinical graduate students in Biomedical Informatics and had experience working with clinical documents. The physician (A3) referred to the instruction manual during an initial training session with the two annotators. Questions that arose led to discussion with consensus agreement. The annotation was done using an open source annotation tool called Knowtator7, a text annotation tool with functionality that includes the ability to define annotation schemas, merge reviews, calculate inter-annotator agreement (IAA), and export output in XML format.
An annotation schema was built in Knowtator defining the various PHI categories as classes. Both annotators then independently marked PHI categories on the training set. Annotators used the fast annotation mode in Knowtator release version 1.8 for all annotation tasks. A consensus review was done for the 20 training documents by the physician who established the annotation schema. The physician and two annotators then met to review disagreements settled by the physician as well as arrive at consensus on specific examples that were not easily resolved. Most questions involved marking span, but others relating to whether and how specific text should be annotated led to consultation with a local NLP expert (SM) or the VA privacy officer. The 220 documents in the test set were then reviewed and annotated independently by the two annotators in five batches of 44 documents each. After completion of each batch, a consensus review of the two initial manual de-identified merged documents was done by the physician. Examples found in the consensus review of the first two batches that were not specifically covered in the instructions for annotation schema were reviewed and clarified using an iterative, inductive process. The consensus review determined missed and wrongly selected elements, corrected annotator mistakes, and then created a final document set to serve as a reference standard.
Statistical evaluation of Annotation process
For each batch of annotation, standard metrics of inter-annotator agreement (IAA) were calculated for the two independent reviewers using the formula8,9: IAA = 2×matches / (2×matches + non-matches). Recall was also calculated for each independent annotator against the final reference set. Time to manually de-identify notes was assessed as well.
Results
There were 9 document types in the corpus of 220 documents used in the test set. Annotators completed a total number of 5,312 instances of complete PHI annotations (Table 2). This represents a conservative number of PHI, as we did not use PHI tokens, where multiple tokens make up one instance (e.g.; FName MI LName). The annotation task took approximately two hours for an annotator to complete a batch of 44 notes. Each annotation took a median of 2 seconds for annotator A1 and 3 seconds for annotator A2 to identify, mark and categorize an instance of PHI by type using the fast annotate mode in the Knowtator tool.
Table 2.
PHI Counts by Document Type
| Note type | No. notes | Total PHI | No. PHI per note | Min PHI | Max PHI |
|---|---|---|---|---|---|
| Chest Radiograph | 10 | 73 | 7.3 | 3 | 16 |
| Consultant Notes | 30 | 523 | 17.4 | 6 | 75 |
| Discharge Notes | 40 | 1930 | 48.3 | 6 | 543 |
| History and Physical | 40 | 1238 | 31 | 5 | 217 |
| Infusion Services Notes | 6 | 66 | 11 | 6 | 18 |
| Physician Progress Notes | 34 | 910 | 26.8 | 6 | 86 |
| Microbiology | 10 | 76 | 7.6 | 6 | 13 |
| Nursing Notes | 30 | 293 | 9.8 | 6 | 24 |
| Procedure Notes | 20 | 203 | 10.2 | 1 | 17 |
Among the 9 document types, discharge summaries and history and physicals had the highest prevalence of PHI, while chest radiographs and microbiology reports had the lowest prevalence of PHI (Table 2). One particular discharge summary was substantially longer than other discharge summaries, and 543 PHI were identified in that single note. However, no improvement was noted when this longer document was removed from the agreement evaluation as an outlier.
The vast majority of PHI type was annotated as “medium severity” PHI at 91.04%. “Medium severity” PHI included the largest category consisting of dates (3380, 63.3%), HCW name (940, 17.7%), facility (364, 6.85%), location (134, 2.52%), and HCW identifiers outside institution (18, 0.34%). PHI categorized as “high severity” accounted for only 4.46% of the annotations and included patient identifiers outside the facility (12, 0.23%) as well as patient and proxy names (225, 4.24%). PHI of patient identifiers sensitive only inside of the institution, categorized as “low severity” was in 239 instances, accounting for 4.5% of the annotations.
The IAA between independent annotators as well as recall comparing annotator against consensus were calculated for each batch (Table 3), as well as for the entire reference set. Performance was lowest for batch 1, then improved and plateaued in subsequent batches. Overall IAA was 0.94 between the independent annotators. The recall and precision for annotator 1 was 0.95 and 0.97, respectively. For annotator 2, the recall and precision were both 0.96. In addition to the PHI annotated by either annotator, the consensus review identified 25 PHI that both annotators missed in 18 of the 220 documents reviewed. None of the PHI missed by both independent annotators was of high severity and the categories were as follows: 12 partial dates; 7 locations; 2 partial healthcare provider names; and 4 identifiers sensitive only within the institution. The 25 PHI missed by both annotators were from documents that tended to be long, dense in PHI and tedious to review. Some of the missed PHI were in “unexpected” locations and often in an uncommon format, such as a city marked in all lower case letters or the title of the healthcare provider placed between the first and last name.
Table 3.
IAA by Annotator
| Batch | **IAA A1:A2 | Recall (A1) | Recall (A2) |
|---|---|---|---|
| 1 | 0.90 | 0.93 | 0.94 |
| 2 | 0.96 | 0.97 | 0.97 |
| 3 | 0.93 | 0.96 | 0.95 |
| 4 | 0.93 | 0.94 | 0.96 |
| 5 | 0.96 | 0.97 | 0.96 |
| Overall | 0.94 | 0.95 | 0.96 |
A1 = Annotator 1, A2 = Annotator 2
Discussion and Limitations
We developed and then applied an Annotation Schema through an inductive learning process. By providing annotators training with examples and encouraging questions for clarification prior to annotating a training set, we were able to induce general rules from specific instances. Feedback of consensus review and continued discussion regarding ambiguous PHI classification after completion of the first two batches of the test set solidified annotator certainty with the annotation task. Examples of ambiguous classification of PHI that were resolved involved whether days of the week and hospital unit locations should be annotated. During this process, the group discussed and noted days of the week was often included in medication instructions and removal could potentially decrease clinical usefulness. In addition, hospital unit locations sounding “generic” (such as “MICU”) were referenced throughout documents in the study note corpus, and if removed would negatively impact the usability of the document.
An IAA of 0.90 between the two annotators was observed even with the first of five batches of annotations, which quickly plateaued to 0.96 by the second batch. Recall followed a similar trend and plateaued to 0.97. Reasons for good agreement from the onset without need for a larger training set were likely due to awareness of the i2b2 de-identification challenge, extensive discussions and training prior to the task, and annotation for de-identification is a relatively simple task as compared to other clinical annotation tasks. However, the maximum IAA did not surpass 0.96 and recall was capped at 0.97, despite iterative review and discussion of disagreements after each batch. The majority of disagreements that occurred appeared to be failing to annotate PHI due to fatigue in lengthy documents or by misclassifying PHI as a result of using the previous class already highlighted in the fast annotate mode. Since these errors did not appear to be knowledge based, we did not anticipate further improvement in performance with human annotation of additional documents. Improvements could be achieved by providing reviewers with small sections of the documents in “snippets” to reduce fatigue and combining machine aided annotation tools with human review. In addition, although use of experienced reviewers participating in an inductive learning process was optimal to develop the annotation schema, the process we used was not an accurate representation of how new annotators would perform when applying the schema guidelines.
No automated system can be perfect in removing all PHI. One objective was to build a reference standard from a broad variety of clinical document types to evaluate automated NLP de-identification tools. In addition to determining recall and precision, our modified schema classifies PHI types by level of “privacy severity” to provide additional information when evaluating automated tools regarding disclosure of PHI that may remain in scrubbed documents. “Low severity” PHI would have less risk of re-identifying an individual, but only 4.5% of PHI were of this level. On the other hand, only 4.46% of PHI annotated were considered “high severity” and therefore the most critical to ensure removal. The reference standard identifies PHI categories likely to be missed as well as document types at higher risk of missed PHI when processed by selected automated tools and could be used in multiple ways including to initially train/retrain an automated tool; to point out the higher risk areas to target during subsequent manual reviews after a screening automated preprocessing step; and to provide policy makers with additional information when making decisions to allow the release of processed documents.
From our review of the corpus, it was unclear whether the ability to solely assess compliance with the appropriate removal of all 18 types of PHI identifiers is sufficient to consider a clinical narrative document safe from re-identification. Potentially contextual information in the social history or rare medical conditions could be pieced together to increase the risk of re-identifying a unique individual. Consider some of these fictitious examples: “due to mental status changes Mr. PHI initiated a high speed car chase that led to his arrest over the holiday weekend” or “her rare blood type required contacting multiple blood banks over the state”. In addition, medical errors commonly occur, and although the focus is on patient confidentiality, there could be liability for healthcare systems if information regarding care is identifiable to an institution or a specific healthcare provider. Lastly, using linked sets of documents increases the risk for re-identification even if only partial PHI is left in one document, as it can be fit together with partial PHI from another.
Conclusion
An inductive learning process was used to develop and then apply an annotation schema that provides details on PHI type and level of privacy severity. The methods developed will inform VA policy makers about the prevalence and type of PHI in specific document sources. The reference standard built from the annotation task will be important for ongoing VA research efforts that focus on using clinical free text documents for research purposes. Despite removal of all HIPAA identifiers in the corpus, concerns were raised whether this was sufficient to consider the documents safe from re-identification.
Acknowledgments
This study was supported using resources at the VA Salt Lake City Health Care System and from the CDC VA Prevention Epicenter funding and the VA Consortium for Healthcare Informatics Research.
References
- 1.Standards for privacy of individually identifiable health information: final rule. 67 Federal Register 53181 2002. (codified at 45 CFR 160 and 164). [PubMed]
- 2.Douglas M, et al. Computer-assisted deidentification of free text in the MIMIC II database. Computers in Cardiology. 2004;19(22):341–344. [Google Scholar]
- 3.Sweeney L. Replacing personallyidentifying information in medical records, the Scrub system. Proc AMIA Annu Fall Symp. 1996:333–337. [PMC free article] [PubMed] [Google Scholar]
- 4.Gupta D, et al. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121(2):176–186. doi: 10.1309/E6K3-3GBP-E5C2-7FYU. [DOI] [PubMed] [Google Scholar]
- 5.Dorr DA, et al. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf Med. 2006;45:246–252. [PubMed] [Google Scholar]
- 6.Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic deidentification. J Am Med Inform Assoc. 2007:550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Knowtator reference: http://knowtator.sourceforge.net/
- 8.Roberts A, Gaizauskas R, Hepple M, et al. The CLEF corpus: semantic annotation of clinical text. AMIA Annu Symp Proc. 2007:625–9. [PMC free article] [PubMed] [Google Scholar]
- 9.Hripcsak G, Heitjan DF. Measuring agreement in medical informatics reliability studies. J Biomed Inform. 2002 Apr;35(2):99–110. doi: 10.1016/s1532-0464(02)00500-2. [DOI] [PubMed] [Google Scholar]
