Abstract
Patient privacy is a major concern when allowing data sharing and the flow of health information. Hence, de-identification and anonymization techniques are used to ensure the protection of patient health information while supporting the secondary uses of data to advance the healthcare system and improve patient outcomes. Several de-identification tools have been developed for free-text, however, this research focuses on developing notes de-identification and adjudication framework that has been tested for i2b2 searches. The aim is to facilitate clinical notes research without an additional HIPAA approval process or consent by a clinician or patient especially for narrative free-text notes such as physician and nursing notes. In this paper, we build a scalable, accurate, and maintainable pipeline for notes de-identification utilizing the natural language processing and REDCap database as a method of adjudication verification. The system is deployed at an enterprise-scale where researchers can search and visualize over 45 million de-identified notes hosted in an i2b2 instance.
Introduction
Electronic health record (EHR) systems provide a promising resource of data to accelerate data-driven solutions and research. In general, there are two forms of data in healthcare systems: structured and unstructured data. Structured data refers to a pre-defined data model with associated value sets which are often stored in a database such as diagnosis codes and patient vital signs (e.g. heart rate). Unstructured data, on the other hand, does not follow a pre-defined format of specific values, it can be found in wide-ranging clinical notes and reports such as physician progress notes, nursing free text patient assessments, procedure and operative reports, and radiology and pathology reports. Structure data’s consistency makes it easier to integrate for research purposes whereas unstructured data requires more preprocessing, normalization and transformation before it can be utilized by researchers. As a result, most of the unstructured data is viewed at point of care by clinicians but often gets unmanaged as an enterprise asset for supporting healthcare system analysis and clinical research.
Clinical notes represent interactions between patients and the healthcare system arising from episodes of patient care in which healthcare providers record the observations, impressions, care or treatment plans, and other activities. Therefore, they contain observational data, family history, and physician interpretations which are key to help better understand patients’ health conditions and predict early diagnosis and treatment. Text notes may include elements recorded in discrete structured format or non-discrete narrative format using different documentation methods ranging from standardized templates that integrate discrete phrases or elements from the EHR to hand-written or dictated notes that later get scanned or transcribed into the EHR. Free-text narrative notes are the traditional method for healthcare professionals to record their practice without limitations. Nursing specialties, such as psychiatric nursing, rely heavily on narrative notes as they allow nurses to pull together events and information in a meaningful way within a subjectively experienced environment, as well as to document time-oriented events1. While clinicians generally value flexibility and efficiency, those reusing data often value structure and standardization2.
Clinical notes are stored as a free-text format and often contain identifiable or confidential information that poses challenges when attempting to de-identify to the Safe Harbor criteria. While de-identification tools have been developed for free text, not many have been developed or tested to support integration with self-service query tools that predominantly integrate structured data organized by standard ontologies. The University of Kansas Medical Center (KUMC), the Medical College of Wisconsin (MCW), and many other medical centers provide researchers access to de-identified, structured patient data through i2b2 data repository platform3, 4. While structured data is de-identified by removing required fields and obscuring dates by date shifting often or reducing temporal resolution, it lacks cohesive representation of patient documentation; specifically, observations and decisions recorded in free text notes. Having these notes de-identified allows researchers to define more robust computable phenotypes versus using structured data alone. Additionally, providing datasets from i2b2 that contain de-identified free text allows us to disclose the minimum information necessary to advance research while protecting patient privacy.
Background and Significance
Information extraction methods were a focus of development mostly between 1987 and 1998 and initially sponsored by the federal agencies outside of the biomedical domain. Information extraction supporting clinical research has been increasingly catalyzed since the National Institutes of Health started the Clinical and Transactional Science Award (CTSA) program in 20075. To extract clinical documentation or clinical text, de-identification step is required to ensure the removal of the 18 HIPAA identifiers described below.
There have been several challenges within the research community that focuses on clinical notes de-identification such as the 2006 i2b2 de-identification challenge6, the 2014 i2b2/UTHealth Shared Task7, and the CEGS N-GRID shared task8. This led researchers to develop various de-identification tools over the past few years 9, 10, 11, 12, 13, 14. Most of the de-identification tools reportedly achieve high accuracy, however, their performance significantly drops when applied to real-world datasets. In addition, they often do not meet the scalability requirement for Clinical Integrated Data Repositories (CIDR), and fail to address the following crucial facets: “How can we gain trust in the de-identification process?”, “How can we ensure patient privacy and confidentiality when sharing de-identified clinical notes?”, and “What are the required precautions and procedures an institution must follow to ensure privacy?”. Hence, the goal of this study is to build a reliable framework at an enterprise-level that allows data sharing for de-identified clinical notes across multiple medical centers. This paper describes our approach to clinical notes de-identification across two medical centers to support researchers via self-service i2b2 queries and augmenting research dataset requests with de-identified notes.
Materials and Methodology
HIPAA Privacy Rule Definition
The Privacy Rule of The Health Insurance Portability and Accountability Act (HIPAA) requires covered entities’ health information be protected while allowing data sharing and the usage for health information. The HIPAA Privacy Rule was established in 1996 to ensure the proper protection of patient privacy when permitting data sharing and the usage for health information for research purposes. It defines the rules for data disclosure and usages without obtaining patient consent.
According to HIPAA there are two de-identification approaches, the Safe Harbor method and the statistical method, which both involve significant human resources to manually examine EHR content for de-identification. The Safe Harbor method of the HIPAA Privacy Rule can be applied automatically on clinical narrative text using NLP methods, allowing for faster and cheaper de-identification of clinical text15. In the Safe Harbor method, there are 18 protected health identifiers (PHI) that should be removed in a document to be considered de-identified (shown in Table 1). The removal of these identifiers may result in information loss; however, it is necessary for supporting data reuse and clinical text mining. The de-identification task is time-consuming and labor-intensive; hence, many tools were developed to automatically remove the PHI instead of relying on humans.
Table 1.
HIPAA Privacy Rule identifiers.
| HIPAA identifiers | |||
|---|---|---|---|
| 1 | Names | 10 | Account numbers |
| 2 | All geographical address elements smaller than state | 11 | Certificate numbers |
| 3 | All Dates elements related to the individual (except year) | 12 | Vehicle serial numbers and identifiers |
| 4 | Phone numbers | 13 | Device serial numbers and identifiers |
| 5 | Fax numbers | 14 | Web resource locators (URLs) and links |
| 6 | Email addresses | 15 | IP addresses |
| 7 | Social security numbers | 16 | Biometric identifiers (e.g. fingerprint) |
| 8 | Medical Record numbers | 17 | Full face photographic images |
| 9 | Health plan beneficiary numbers | 18 | Any unique identifying number, code, or characteristic |
HERON i2b2-based Data Repository
The Kansas University Hospital uses the Epic EHR where research data is derived mainly from Epic’s Clarity relational database that contains more 7,000 tables with over 60,000 columns3. Data from Clarity are extracted, transformed and loaded (ETL) using Structured Query Language (SQL) and Python programming language into an i2b2 compatible star schema, de-identified, and transferred to separate server to be assessed by researchers using the i2b2 application. Since 2010, the KUMC HERON i2b2-based research repository has had over 70 releases that increase the richness of clinical, sociodemographic, and administrative information such as nursing flowsheets, tumor/cystic fibrosis/trauma/cardiac catheterization registries, patient self-reported findings via the patient portal, and social history data in addition to basic patient demographics, diagnosis, laboratory results, and medication data. Incorporating free text findings from multi-disciplinary flowsheet documentation, clinician notes, and reports remained a major gap in our goal of providing comprehensive clinical data.
The Medical College of Wisconsin (MCW) as well uses Epic for their Electronic Medical Record System and extracts from the Clarity reporting system the necessary tables to support the Clinical Research Data Warehouse, which has a translation layer that supports the (ETL) of de-identified data into an i2b2 star schema. In addition to Epic, MCW also integrates a number of historical single-use systems to augment and provide a richer longitudinal context to their patient population going back to 2004, including the NAACCR tumor registry, IntelliDose for chemo, etc. MCW continues to enrich with other clinical systems on their campus to enrich specific domains such as EKG for cardiovascular, genomics for precision medicine.
Enterprise-scale de-identification framework
Figure 1 shows the flow of the de-identification and adjudication process for the purpose of loading data to i2b2 de-identified server for researchers to search and visualize. There are two steps for the system: notes de-identification and notes adjudication. The notes de-identification step removes all identifiable information such as names of patients, locations, identification numbers, dates, and age above 90 in order to meet HIPAA Privacy Rule criteria, it includes text preprocessing, regular expression processing, named entity recognition, date shifting, PHI seeding, blacklisting de-identification, and whitelisting identified information to be retained. During notes adjudication, notes are audited to ensure PHI removal and the note de-identification performance was evaluated before releasing the notes.
Figure 1.
Overview of the enterprise-scale de-identification framework
Notes De-identification
1. Text Preprocessing
Text in the EHR usually comes from different resources such as dictation software, direct keyboard entry, and templated forms that integrate EHR structure data. Since text is authored in different formats, text pre-processing is crucial for the de-identification process and can significantly increase data consistency. During preprocessing we normalized character encodings so all notes share the same encode, we corrected broken sentences due to the Epic record boundaries limitations, where one clinical note, as a result, could be stored across multiple database records, therefore, we concatenated broken records into notes and restored missing line breaks. In addition, we cleaned text and removed its natural complexity by stripping some punctuations that could potentially prevent identification of certain PHI (e.g. backslash, underline), and we fixed camel-case text by adding spaces between words concatenated into compound words.
2. Regular Expression Rules
A set of regular expressions was developed to de-identify several identifiers including MRN, addresses, emails, phone numbers, and zip codes. We defined 24 different regular expressions to search for patterns where these identifiers are possibly mentioned. Table 2 shows some examples of regular expressions that were used, typically when regular expressions find PHI, it masks the identifier and replaces it based on the identifier type. For example, 07328644 is replaced with [xx-xx-xx-xx-xx] and Justin@gmail.com is replaced with xxx@xxx.xxx.
Table 2.
Regular expression examples for different types of PHI.
| Regular Expressions | Example | Type | |
|---|---|---|---|
| 1 | ([0-9]{1,2})[/\\-]([0-9]{4}) | Month/Year | Date |
| 2 | [-0-9a-zA-Z.+_]+@[-0-9a-zA-Z.+_]+\\.[a-zA-Z]{2,4} | Justin@gmail.com | |
| 3 | ([0-9]{1,2})\\-([0-9]{1,2})\\-([0-9]{1,2})\\-([0-9]{1,2})\\-([0-9]{1,2}) | 01-23-45-67 | MRN |
| 4 | (1[\\-|\\.]){0,1}\\({0,1}([0-9]{3})\\){0,1}[\\-|\\s|\\.]{0,1}([0-9]{3})[\\- |\\s|\\.]{0,1}([0-9]{4})\\b |
505-779-4055 | Phone number |
| 5 | \\d{5}((-)\\d{4})? | 87123 | Zip code |
3. Named Entity Recognition
Named Entity Recognition (NER), also known as entity extraction, is used to identify entities in the text and classify them into predefined classes such as person name, location, and organization. There are many well-known NER tools like NLTK16, Stanford CoreNLP17, and SpaCy18. Most of these tools are pre-trained models that used machine learning algorithms. We evaluated named entity recognition software and several training models from the Stanford NLP group17 , Apache OpenNLP19, and the MITRE Identification Scrubber Toolkit MIST. A combination of three named entity recognition models developed by Stanford consistently outperformed OpenNLP and MIST20. The named entity recognition software and models from the Stanford NLP group achieved “out of the box” named entity performance of 92.6% on a test set of patient records.
When using named entity recognition in our NLP pipeline, named entity recognition increased the de-identification performance. This shows the importance of using NER to detect named entities that regular expressions failed to identify. Figure 2 illustrates an example of using the named entity recognition module on notes.
Figure 2.
Example of notes after applying NER.
4. Date Shifting
Dates related to patient events such as surgery date and birth date are shifted for anonymization, but the relation between patient record dates is preserved by assigning a random offset to each patient. Date shifting is an important feature for notes usability, since it preserves the chronological order of events, allowing users to better understand patient history and timeline. Therefore, when regular expressions component finds a date pattern, instead of masking it with its type (e.g. [DATE]), the algorithm adds a random offset to all dates within the same note as shown in Figure 3 with an example offset set to 10.
Figure 3.
Example of notes after applying date shifting.
5. PHI Seeding
Typically, EHR data contains clinical notes, patient demographics, encounter information which are stored in the same database, since the de-identification process removes patient information from clinical notes, this information can be easily obtained from the database, therefore we utilize the EHR data and feed patient information to the tool to improve the performance. In addition to using existing PHI, we also defined new PHI’s, for example, the age of the patient at the time of the visit is considered PHI if the age is above 89. While this information might not be available in some EHR data, it could be calculated using existing PHI like patient birth date and visit date. A list of PHI and new PHI was created for each patient, then we used regular expressions to remove any occurrences of these PHIs in the clinical text notes. This component can be enabled and disabled depending on whether there is access to the EHR data.
6. Blacklist and Whitelist Matching
A Blacklist and a whitelist are used to handle automatic de-identification and identification of difficult terms that regular expressions and NER tool are unable to identify. Based on domain experts, we define blacklist terms that the algorithm failed to de-identify, these terms are names that are mislabeled by the NER algorithm due to the lack of such terms in the training data of the model (e.g. Wauwatosa, Shankar), or geographic locations such as county and city name that NER is unable to remove. To prevent the tool from over-scrubbing non-PHI terms, we create a whitelist to prevent the de-identification of terms that can be easily mistaken for PHI. For example, ED, ADVAIR, and Sarcoidosis are medical terms that are added the whitelist since they are mislabeled as entity names by the NER tool.
Notes Adjudication and Evaluation
Initial Evaluation
In general, there are multiple methods to verify the performance of clinical notes de-identification. First, standard NLP metrics can be used to measure performance such as recall (sensitivity), precision (positive predictive value) and F value which is the harmonic mean of recall and precision20. Second, evaluation can be done to compare the performance of the automated systems with that of humans. For example, a study conducted by the Children’s Hospital Medical Center, Cincinnati, Ohio, USA compared the performance of the automated systems with humans using two native English speakers20. Third, evaluation can be achieved using one de-identification system against another standard system such as the performance of the MITRE Identification Scrubber Toolkit (MIST)21. For this work, we reported the recall and precision metrics to evaluate de-identification performance on the record-level; if at least one identifier is not completely de-identified then the record is not properly de-identified, and if at least one token is mistakenly de-identified then the record is considered over-scrubbed.
Our de-identification method was initially tested at MCW, by creating a stratified 48,000 patient record test set containing 22 types of patient records drawn randomly from 48M patient records within the CRDW. This full 48,000 record set was used for performance evaluation. We randomly sampled 1,000 records from the 48,000-record test set to report the accuracy. On a single PC (Intel Core 2 Quad CPU @3GHz and 8G Memory) the software de-identified 110 records/sec. We identified 27 errors in a second pass evaluation after correcting errors in the first pass of 1,000 patient records, most of the errors were software failure to remove some parts of patient names (first, last, middle, and initials) and some of the regular expressions generated false positives. Therefore, the system was improved by adding the PHI seeding component that allowed identification of patient names and other related health information stored in the Epic system, where these identifiers get scrubbed if they appear in the notes. In addition, another challenge we faced was over-scrubbing some terms (e.g. Drug names and procedures names), we addressed the issue by expanding the whitelist terms and creating several Junit tests22 to run regression validation for the regular expressions.
After this initial testing, several validation audits on different de-identified note types were performed to test if patient names were successfully scrubbed. We found 55 patient name leaks in a sample of 11,367 Discharge Summaries (0.48%), 45 PHI leaks in a sample of 2,000 Echo Notes (2.25%), and 87 PHI leaks in a sample of 5,000 Therapy Notes (1.74%), in which most of these identified PHI are patient nicknames. This audit prompted us to implement a new leak mitigation strategy, which blacklists the patient preferred names using Epic Patient tables since patients could be referred by their nicknames instead of their actual names.
Enterprise-scale Notes Adjudication
To be trusted, the automatic de-identification tool provided “acceptable” accuracy, but determining the acceptable performance can vary depending on many factors such as the final purpose of the de-identified documents, the legal agreements that could be imposed to avoid re-identification, and the fact that some PHI categories are more sensitive than others 21. The KUMC Medical Informatics team worked with the KU health system privacy team to determine standards on an acceptable number of notes to review and follow up with another review process for false negatives. They also worked on setting up guidelines for reviewers and annotators to follow to eliminate any bias and standardize the adjudication and annotation process, and provided the reviewers with proper training.
Figure 4 describes the process flow of the adjudication process for the purpose of loading de-identified notes to i2b2 de-identified server for researchers to search. Once notes are de-identified, 30 randomly selected notes per note type or measure is selected for review. If a note or a measure is found with patient identifiers (i.e. false negative), it requires modification of the de-identification tool to address these PHI and additional analysis of 30 randomly selected notes or measures before approval and discussion with the Data Request Oversight Committee (DROC) formed by representatives from the KU health system and university. Once reviewers from both entities indicate approval for release within the REDCap project, the de-identified text and corresponding facts will automatically be incorporated into the KUMC clinical data warehouse, HERON, for the next release. If the research impact is insignificant then the note type is blacklisted (e.g. a nursing free text field to record a nurse practitioner’s paper).
Figure 4.
De-identification adjudication process.
REDCap is used to validate and approve the fitness of the de-identification on the various types of unstructured data that integrate de-identified notes and personal health identifiers from i2b2 with auditing tools to support adjudication in REDCap23. First, we utilized de-identification software using NLP methods to de-identify clinical free text. Second, we used REDCap as a method to store de-identified data and hold the audit files to track the review process. We created two complementary REDCap projects with the first project for holding all audit files and the second for managing the review process for dual adjudication and commentary in which it documents the review process and all related information including the number of notes reviewed, the number of notes not properly de-identified (false negatives), the number of notes accurately de-identified (all positives), the number of notes falsely de-identified (false positives), risk level assessment, the decision of release approval and any additional comments by the reviewers. As a result, notes types approved by both reviewers will be loaded to i2b2 de-identified server for researchers to search. Figure 5 shows the REDCap de-identification review and adjudication tool for note-by-note type.
Figure 5.
De-identification review and adjudication tool for note-by-note type hosted in REDCap.
Final Evaluation
We first evaluated nursing flowsheets using the proposed adjudication process. We evaluated 4010 different flowsheet types and randomly sampled 30 notes per type. After running the de-identification tool on this dataset, two annotators reviewed a total of 48,377 notes and found the reported average precision was 74.8% and the average recall was 84.5% at record-level. The Inter-rater agreement between the two specialists was K=90.8% which considered almost perfect agreement. As a result, 3,254 flowsheets types were approved and released in HERON, while 756 types were rejected due to performance results and their insignificant impact on the research community.
Throughout the flowsheet reviewing process, we used the reported errors to enhance the system, thus, we updated the regular expressions to handle other types of date formats (e.g. MM/DD) and phone numbers that are written without dashes. Also, we observed that both backlist and whitelist could be expanded based on domain knowledge of the institution. For example, adding local counties and cities to the blacklist increased the overall performance. Therefore, we added more terms to the blacklist of KUMC which resulted in an expanded list of 1857 terms. We used Junit for automated regression testing to ensure our tool successfully remove previously tested PHI after incorporating new changes.
Consequently, we gained more confidence in the de-identification process after several rounds of flowsheet audits and tool modifications that improved the overall de-identification performance. We then reviewed physician notes where we audited 100 clinical notes (240 records; notes in Epic are stored in multiple records due to 4000-character column size restriction) from 44 different note types including progress notes, discharge summary, and H&P notes. 4 errors were identified in 240 records achieving 98.1% recall and 57% precision at record-level. To investigate the low precision, we computed precision and recall at instance-level (tokens are classified to PHI or non-PHI) and found the tool achieved 99.7% recall and 94% precision at instance-level. Reporting precision and recall metrics at instance-level is more accurate and interpretable than record-level since record-level evaluated record instead of tokens; a record is labeled as false negative if at least one PHI token was not de-identified and labeled as false positive if at least one token is mistakenly de-identified . However, for enterprise-evaluation we used record-level since evaluating using instance-level is inefficient and time-consuming when reviewing thousands of notes.
At enterprise scale across all note types at KUMC, our pipeline de-identified 90,881,400 records in 4.8 days with a processing rate of 218 records/second, and we were able to release millions of de-identified notes which included: 112 types of physician notes (34 million), 4010 types of flowsheets (6 million), pathology reports (380 thousand), radiology exams (2.2 million), and radiology impressions (2.2 million) into HERON for i2b2 free text search and de-identified data delivery. This experiment was conducted on 32 threads using a JVM limited to 4 GB on LINUX for Intel Xeon CPU ES-4617 2.9 GHz cores. Initially, performance degraded with extensive whitelist and blacklist additions, but the method was enhanced by utilizing hashing techniques when searching backlist and whitelist terms for efficient search. Also, we modified the tool to run in a multithreading manner to allow multiple notes to be processed simultaneously to enhance the tool scalability.
Discussion
In this paper, we introduced an enterprise-scale framework for clinical note de-identification and adjudication. Our note de-identification process is a hybrid approach of NLP techniques, rule-based taggers, dictionaries, and domain knowledge. It follows a modular design system where modules are independently created and can be easily modified and replaced to facilitate software adoption and reusability, for example, the named entity module can be replaced, whitelist and blacklist can be easily updated based on the institution needs, and additional regular expressions can be added. We complemented technical advances with an adjudication method using REDCap for tracking and auditing the de-identified notes review process with health system privacy officials. These combined advances allow the release of notes for i2b2 searches and allowing clinical researchers to visualize notes in the i2b2 timeline plug-in. Figure 6 shows an example of an i2b2 query incorporating free text search for “AD-PKD” (Autosomal dominant polycystic kidney disease) in the discharge summaries; allowing clinical researchers to review the richer context provided by narrative.
Figure 6.
i2B2 query using text search on de-identified notes.
The de-identification system has been deployed at two medical institutions (KUMC and MCW) after an intensive evaluation using different datasets of various notes types and it is made public (https://bitbucket.org/MCW_BMI/notes-deidentification) to further collaborations.
Similar to continuous quality improvement cycles, several experiments were performed to evaluate the tool and enhance system performance. At first, an initial test was conducted on 1000 notes stratified from 48,000 patient records of 22 types, 27 errors were identified. In addition, we evaluated the performance of patient name removal by running the tool on a dataset that consisted of 11,367 Discharge Summaries, 2,000 Echo Notes, and 5,000 Therapy Notes (1.74%) and found 187 patients that were called by their preferred names (i.e. nickname) instead of their actual names which the tool failed to identify. This audit led to an improved new leak mitigation strategy leveraging the patient preferred names from one of the EHR Patient tables to customize blacklisting.
Notes adjudication is an important component to the overall de-identification process, it promotes trust, transparency, and collaboration with the privacy office, and allows institution to formulate an acceptable threshold for errors and prioritize improvements in the de-identification process. Simultaneously, it is very time consuming. Using REDCap in our work helped tracking the review process and generated reports of the review outcomes such as the approved note types, comments of rejected note types, and disagreement between the two reviewers.
Following the notes adjudication process, we conducted an evaluation test on a large dataset of nursing flowsheets, we considered 4010 flowsheet types where we applied the adjudication process for each type. Two annotators reviewed 48,377 random notes drawn from various note types. The system achieved an average precision of 74.8% and a recall of 84.5% (record-level). Throughout this process, we used the reported errors to enhance the system by adding more regular expressions and expanding the backlist and whitelist terms based on the domain knowledge of the institution. Also, we utilized hashing techniques when searching backlist and whitelist terms as well as modifying the tool to run in a multithreading manner to enable better scalability. As a result, 3,254 flowsheets types were approved and released in HERON, while 756 types were rejected due to performance results and their insignificant impact on the research community (e.g. a text field used to store the pager number of clinical team members).
The several rounds of flowsheet audits and system modifications increased confidence in the overall de-identification process and engagement with the health system privacy office which was critical prior to the release of de-identified free text data and incorporate other types of notes as well. A key safeguard is the structure of our data sharing and data use agreements. While not requiring institutional review board approval, data requests are reviewed, approved, require the recipient of the de-identified data safeguard the received data as if it was a limited dataset, and notify the medical informatics team if any PHI is detected. After flowsheets, we reviewed physician notes and audited 100 clinical notes from 44 different note types. A total of 4 errors were identified in the 240 records and achieved 98.1% recall (record-level). The result shows that adjudicating physician notes after nursing flowsheets helped achieve high precision and recall since several improvements were incorporated in the de-identification tool during the review process. However, our initial focus on performance assessment was on de-identification and future work, and continuous improvements are needed to refine and expand whitelists where our method is erroneously scrubbing medical terminology.
There were several limitations encountered during this study. First, a single note in the Epic EHR may be stored across multiple database records due to a field size limitation of 4000 characters. This affected note visualization and readability when using i2b2 timeline, so methods were altered to store merged records into a single Character Large Object (CLOB) field instead of varchar. Second, the default Epic notes terminology is defined using local or inherited non-standard terminologies; with nearly 50% of clinical notes assigned to a generic “progress note” category. We plan to deploy a standard ontology such as LOINC to improve the i2b2 text search time and facilitate clinical notes aggregation and exchange across research consortia.
Conclusion
In this paper, we introduced a framework for note de-identification that can accurately de-identify free-text EHR notes while preserving clinical content and described our notes adjudication process using REDCap project. Our experimental results demonstrated high performance and scalability of the proposed note de-identification framework. The system was tested at two medical institutions and currently deployed for i2b2 text search and data requests. We are continuously releasing new note types by prioritizing note types based on their significance to the research community.
Acknowledgements
This work was supported by a CTSA grant from NCATS awarded to the University of Kansas for Frontiers: University of Kansas Clinical and Translational Science Institute (# UL1TR002366) and the Medical College of Wisconsin Clinical and Translational Science Institute (# UL1TR001436). The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH or NCATS.
Figures & Table
References
- 1.Hall JM, Powell J. Understanding the person through narrative. Nursing research and practice. 2011;2011. [DOI] [PMC free article] [PubMed]
- 2.Rosenbloom ST, Denny JC, Xu H, Lorenzi N, Stead WW, Johnson KB. Data from clinical notes: a perspective on the tension between structure and flexible documentation. Journal of the American Medical Informatics Association. 2011;18(2):181–6. doi: 10.1136/jamia.2010.007237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Waitman LR, Warren JJ, Manos EL, Connolly DW , editors. Expressing observations from electronic medical record flowsheets in an i2b2 based clinical data repository to support research and quality improvement. AMIA Annual Symposium Proceedings; 2011: American Medical Informatics Association. [PMC free article] [PubMed]
- 4.Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) Journal of the American Medical Informatics Association. 2010;17(2):124–30. doi: 10.1136/jamia.2009.000893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G. Combining knowledge- and data-driven methods for de-identification of clinical narratives. J Biomed Inform. 2015;(58 Suppl):S53–9. doi: 10.1016/j.jbi.2015.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association : JAMIA. 2007;14(5):550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015;(58 Suppl):S11–9. doi: 10.1016/j.jbi.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stubbs A, Filannino M, Uzuner Ö. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID shared tasks Track 1. Journal of biomedical informatics. 2017;(75S):S4–S18. doi: 10.1016/j.jbi.2017.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lee H-J, Wu Y, Zhang Y, Xu J, Xu H, Roberts K. A hybrid approach to automatic de-identification of psychiatric notes. Journal of biomedical informatics. 2017;(75):S19–S27. doi: 10.1016/j.jbi.2017.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. Journal of the American Medical Informatics Association. 2008;15(5):601–10. doi: 10.1197/jamia.M2702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Neamatullah I, Douglass MM, Li-wei HL, Reisner A, Villarroel M, Long WJ, et al. Automated de-identification of free-text medical records. BMC medical informatics and decision making. 2008;8(1):32. doi: 10.1186/1472-6947-8-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yang X, Lyu T, Li Q, Lee C-Y, Bian J, Hogan WR, et al. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Medical Informatics and Decision Making. 2019;19(5):232. doi: 10.1186/s12911-019-0935-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G. Combining knowledge-and data-driven methods for de-identification of clinical narratives. Journal of biomedical informatics. 2015;(58):S53–S9. doi: 10.1016/j.jbi.2015.06.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liu Z, Tang B, Wang X, Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. Journal of biomedical informatics. 2017;(75):S34–S42. doi: 10.1016/j.jbi.2017.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Meystre SM, Ferrandez O, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: a study of its impact on clinical text information content. J Biomed Inform. 2014;(50):142–50. doi: 10.1016/j.jbi.2014.01.011. [DOI] [PubMed] [Google Scholar]
- 16.Loper E, Bird S. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028.
- 17.Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D. 2014. , editors. The Stanford CoreNLP natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations.
- 18.Honnibal M, Montani I. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017;7(1) [Google Scholar]
- 19.Kottmann J, Margulies B, Ingersoll G, Drost I, Kosin J, Baldridge J, et al. 2011. Apache opennlp. Online (May 2011), www opennlp apache org.
- 20.Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. Journal of the American Medical Informatics Association. 2013;20(1):84–94. doi: 10.1136/amiajnl-2012-001012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fernandes AC, Cloete D, Broadbent MT, Hayes RD, Chang C-K, Jackson RG, et al. Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records. BMC medical informatics and decision making. 2013;13(1):71. doi: 10.1186/1472-6947-13-71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gamma E, Beck K. 2006. JUnit.
- 23.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377–81. doi: 10.1016/j.jbi.2008.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]






