Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2021 Jan 25;2020:648–657.

Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives

Youngjun Kim 1, Paul M Heider 1, Stéphane M Meystre 1,2
PMCID: PMC8075417  PMID: 33936439

Abstract

De-identification of electric health record narratives is a fundamental task applying natural language processing to better protect patient information privacy. We explore different types of ensemble learning methods to improve clinical text de-identification. We present two ensemble-based approaches for combining multiple predictive models. The first method selects an optimal subset of de-identification models by greedy exclusion. This ensemble pruning allows one to save computational time or physical resources while achieving similar or better performance than the ensemble of all members. The second method uses a sequence of words to train a sequential model. For this sequence labelling-based stacked ensemble, we employ search-based structured prediction and bidirectional long short-term memory algorithms. We create ensembles consisting of de-identification models trained on two clinical text corpora. Experimental results show that our ensemble systems can effectively integrate predictions from individual models and offer better generalization across two different corpora.

Introduction

Ensemble learning is a meta-algorithm that uses the outputs of individual classifiers to reduce their errors and improve accuracy. The approach has attracted natural language processing (NLP) researchers as a convenient and effective way to combine multiple predictive models. In general, a classifier ensemble has a two-tier learning structure in which the first layer consists of a set of individual classifiers and the second layer serves to combine their outputs. Although the performance of individual models in the first layer is of primary importance, more thorough consideration should be given to effective model integration by the metaclassifier in the second layer.

Our focus is on ensemble-based methods for electronic health record (EHR) narrative text de-identification. De-identification involves detecting and hiding or removing pre-defined categories of identifiers (personally identifiable information, PII) such as a social security number or biometric information. In practice, eighteen categories of PII are listed in the Health Insurance Portability and Accountability Act (HIPAA) privacy rule ‘safe harbor’ method.1 De-identification has been considered a fundamental task in clinical NLP and has played an important role in protecting patient information privacy while making clinical data shareable with other research communities just as EHRs are becoming more prevalent.

In our previous research on EHR narratives de-identification2, we applied a variety of information extraction (IE) methods including deep learning, shallow learning, and rule-based approaches. We analyzed de-identification model diversity by grouping similar models based on their outputs. To combine multiple de-identification models, we presented three ensemble methods: voting, decision template method (DTM)3, and stacked generalization4. When evaluated with the 2014 i2b2 (informatics for integrating biology and the bedside) de-identification challenge corpus5,6 (called ‘2014 i2b2’ hereafter), we showed that ensemble methods consisting of a diverse set of PII extraction models could improve EHR de-identification.2

In this subsequent research, we advance our approach in two directions. First, we explore various types of ensemble architectures to examine how they are capable of producing more accurate predictions. We employ more ensemble learning methods in addition to the three approaches we already experimented with. We propose an ensemble pruning method that allows one to automatically determine the voting threshold (i.e., number of members voting for one annotation) and the optimal combination of de-identification models. In our previous study2, we created a stacked learning ensemble by training an SVM (support vector machine)7 classifier that used each PII concept of the individual classifier as a training example. Besides the concept-level SVM model, we propose a sequence labelling-based (or word token-level) stacked ensemble that uses a sequence of words to train the sequential model. We implement two stacked generalization methods based on search-based structured prediction (SEARN)8 and bidirectional long short-term memory (Bi-LSTM)9,10, a variant of the recurrent neural network (RNN) algorithm. We compare the performance of concept-based ensemble methods against sequence labelling-based stacked learning ensembles.

Second, we aim to increase the generalization of de-identification by combining multiple datasets. We exploit another text collection created for the 2016 CEGS (centers of excellence in genomic science) N-GRID (neuropsychiatric genome-scale and RDoC-individualized domains) shared task11 (called ‘2016 N-GRID’) for training and evaluation of de-identification models. Extending from previous work that used only 2014 i2b2 data, we build de-identification models using the 2016 N-GRID data or a combination of both datasets. We investigate how well each model performs when trained with either corpus or trained with the union of both corpora. We find that the union model can provide more generalizable de-identification across two different corpora. Unlike the union approach, where all data sets must be accessible to train a new model, our ensemble methods only necessitate models trained on any data source. Our experimental results show that most ensemble systems can effectively integrate predictions from individual models to improve de-identification performance.

Background

Clinical text de-identification has been the focus of several shared tasks5,6,11,12. In the 2006 i2b2 de-identification task11, eight PII categories were defined: patient, doctor, hospital, location, age, date, phone number, and ID. The target PII categories for the 2014 i2b2 challenge4,5 and the 2016 N-GRID challenge10 tasks included name, profession, location, age, date, contact, and ID. Each category had one or more subcategories. For example, the name category had patient, doctor, and username subcategories.

The publicly available challenge data facilitated further research into the identification of PII. The de-identification systems that performed well in the challenges used machine learning approaches or hybrid methods that combined machine learning and rule-based approaches. Wellner et al.13 trained a sequence labeling model using conditional random fields (CRF)14 with regular expressions to better extract certain PII categories such as dates. Their approach produced a phrase F1-score of 97.36% on the 2006 i2b2 test set. Yang and Garibaldi15 also employed the CRF algorithm to create their de-identification models. For some of the PII subcategories that were less commonly mentioned in the training set, they built a rule-based system using keywords and regular expression patterns. Their system yielded the overall precision, recall, and F1-score of 96.45%, 90.92% and 93.60% on the 2014 i2b2 challenge test data, respectively. As a good example of a transition into neural network approaches, Liu et al.16 presented an ensemble that combined four de-identification components, including a rule-based system, a CRF model, and two variants of Bi-LSTM models. Their best-scoring system produced a F1-score of 91.43% on the 2016 N-GRID test data.

Recently, context-dependent representations have been employed to improve text de-identification performance. More advanced than context-independent (or static) word embeddings17,18 that ignore the context in which a word appears, context-dependent word embedding19-21 model can assign different vector representations for the word based on its context usage. Khin et al.22 and Lee et al.23 reported improved classification accuracy when the learning architecture included ELMo19 representations (embeddings from language models). They used the general domain pretrained ELMo models. Alsentzer et al.24 trained BERT20 (bidirectional encoder representations from transformers) models with the MIMIC-III (medical information mart for intensive care) v1.4 database25 containing various types of clinical notes. They reported that MIMIC-based BERT models were less beneficial than models trained on general texts or biomedical literature because the phrases identified as PII were replaced with tags in MIMIC texts.

As seen in the above-mentioned studies, clinical text de-identification tasks have achieved almost mature performance, especially with Bi-LSTM algorithms. To further improve performance, we firstly apply the ensemble pruning method to text de-identification. The hill climbing search method26,27 is one of the commonly used pruning methods. It begins with an ensemble containing none or all of the individual models. It searches the space of sub-ensembles by greedy inclusion or exclusion of one model. For this study, we apply backward elimination28,29, which makes the ensemble smaller by excluding individual models without loss of accuracy.

Methods

We describe the PII categories to be identified and outline the distribution of labeled data. Then we explain how we train each individual de-identification model and present our ensemble-based methods.

Data description

We aim to extract PII phrases whose categories were defined in the 2014 i2b2 de-identification challenge. The same categories were used for the 2016 CEGS N-GRID shared task. Each PII instance was assigned one of seven categories. Table 1 shows the number of annotated instances of each category in the test set of each corpus. The column named “Instance Ratio” in Table 1 shows the ratio of the number of instances in the two data sets. Date, location, and name categories are common in both corpora. When compared to the 2014 i2b2 data, more categories of profession, age and location were recorded in the 2016 N-GRID data, but fewer ID and contact categories were recorded. For example, the 2016 N-GRID data contains 33 ID instances, which is only about 5% of the 2014 i2b2 data.

Table 1.

The numbers of PII instances and tokens in each data set.

2014 i2b2 Test 2016 N-GRID Test
Instances Tokens Instances Tokens Instance Ratio
Name 2,883 5,655 2,404 3,515 0.83
Profession 179 345 1,010 1,685 5.64
Location 1,813 3,046 3,771 5,749 2.08
Age 764 836 2,354 2,575 3.08
Date 4,980 19,784 3,821 10,035 0.77
Contact 218 664 126 623 0.58
ID 625 1,723 33 72 0.05
Total 11,462 32,053 13,519 24,254 1.18

Instance ratio = the ratio of the number of instances in two data sets (= 2016 N-GRID Test / 2014 i2b2 Test)

Table 2 shows the number of documents, sentences, and PII instances found in each corpus, as well as their average number per document. The column named “Average Ratio” in Table 2 shows the ratio of these average values between two data sets. The 2016 i2b2 corpus consists of 400 test documents and 13,519 PII instances. The 2016 N-GRID test set contains many more sentences (about 4.48 times) and tokens (about 2.78 times) than the 2014 i2b2 data. However, the prevalence of PII instances is relatively lower in the 2016 N-GRID given the number of tokens. PII phrases in the 2016 N-GRID corpus are shorter at about 1.8 token (= 60.64 tokens / 33.80 instances) per PII phrase. PII phrases in the 2014 i2b2 data include an average of about 2.8 tokens (= 62.36 / 22.30).

Table 2.

Corpora characteristics.

2014 i2b2 Test 2016 N-GRID Test
Count Average Count Average Average Ratio
Documents 514 1.00 400 1.00 1.00
Sentences 22,047 42.89 76,877 192.19 4.48
Tokens 444,245 864.29 962,251 2405.63 2.78
PII instances 11,462 22.30 13,519 33.80 1.52
PII tokens 32,053 62.36 24,254 60.64 0.97

Average ratio = the ratio of average values between two data sets (= 2016 N-GRID Test / 2014 i2b2 Test)

In addition to quantitative analysis, we also examined how the two corpora differ in content. We assigned UMLS (Unified Medical Language System)30 Metathesaurus concepts to phrases in the text. We used MetaMap31 to recognize these concepts and compared their occurrences across the two corpora. We found that some concepts related to mental illness were more frequent in the 2016 N-GRID data, which is not a surprise considering that the corpus only includes psychiatric intake notes. For example, anxiety, suicidal-behavior, violent-behavior, psychosis, mood, and depression were relatively frequent. The 2014 i2b2 data covered a diverse set of medical conditions. Concepts associated with common diseases, such as hypertension, edema, blood-pressure, chest-pain, and diabetes, often appeared.

De-identification models

Similarly to our previous de-identification study2, we built 12 de-identification models using deep learning, shallow learning, and rule-based approaches. Our goal was to provide diversity between models in the ensemble by employing various algorithms, although some of them were trained on the same training data.

We consider two of them as external resources because their model or rules were learned on different corpora than the target data. One of the external resources is the MITRE identification scrubber toolkit32 (called ‘MIST’), a machine learning-based system trained on the 2006 i2b2 de-identification task data. The other system is the PhysioNet de-identification software package33 (called ‘PhysioNet deid’), a rule-based system tuned primarily to de-identify PII in nursing notes and discharge summaries.

For the remaining machine learning-based classifiers, we first trained the models with the 2014 i2b2 training data. We performed a 10-fold cross validation on the training set and tuned the parameters of each classifier to maximize the micro-averaged F1-score. We hypothesize that this optimization process with respect to the cross-validated performance measure can also improve performance on the test data. Then, we trained the new models by reusing the classifier configuration optimized for the 2014 i2b2 model. For each classification method, we created one model with the 2016 N-GRID data and one model with the union of both corpora.

We developed three de-identification models that use deep learning: two versions of Bi-LSTM with a CRF layer (called ‘LSTM-CRF’) and one version of LSTM without CRF (called ‘LSTM’). We used pre-trained word embeddings for each word in all sentences processed. The 100-dimensional GloVe (global vectors for word representation)17 embeddings built with the 2014 dump of English Wikipedia were used for all Bi-LSTM models. In addition to these deep learning architectures with multiple hidden layers, we applied various classical algorithms based on shallow architectures. They can be divided into two approaches. One is based on structured learning algorithms, including CRF, MEMM (maximum entropy Markov models)34, SEARN, MIRA (margin infused relaxed algorithm)35, and structural SVM36. These algorithms have been widely used for many named entity recognition (NER) tasks with the ability of modeling interdependent output variables (i.e., tags of the words arranged in a sentence). The other is token classification-based methods to classify each word independently. We implemented SVM7 and OGD (Online gradient descent)37 classifiers. Table 3 shows the software libraries and hyperparameter configurations used for the learning algorithms. Readers can refer to the manuscript of our previous de-identification work2 for more details.

Table 3.

Description of the learning algorithms.

Method Description Software Hyperparameters
LSTM-CRF v.N38 Bi-LSTM with a CRF layer NeuroNER39 learning rate = 0.01, 80 epochs
LSTM-CRF v.L38 Bi-LSTM with a CRF layer Lample et al.38 learning rate = 0.005, 100 epochs
CRF14 Conditional random fields Wapiti40 e (interval for stopping) = 0.004
MEMM34 Maximum entropy Markov models Wapiti40 e (interval for stopping) = 0.004
Searn8 Search-based structured prediction Vowpal Wabbit41 learning rate = 0.05, 20 epochs
MIRA35 Margin infused relaxed algorithm Miralium42 5 (max. weight updates) = 0.1, 30 epochs
LSTM v.L38 Bi-LSTM without CRF Lample et al.38 learning rate = 0.05, 100 epochs
SVM7 Support vector machines LIBLINEAR43 negative example weight = 0.01
OGD37 Online gradient descent Vowpal Wabbit41 learning rate = 0.1, 10 epochs
Struct. SVM36 Structural SVM MITIE44 default settings
MIST32 MITRE identification scrubber MIST32 default settings
PhysioNet deid33 PhysioNet de-identification PhysioNet deid33 default settings

Concept-based ensemble methods

We briefly introduce the ensemble methods applied to our previous de-identification research and describe the newly proposed methods for this study. First, we describe the concept-based (or phrase-based) ensemble methods. Given a set of PII instances as training examples, each method builds a model that assigns test examples to True (matched the reference annotation) or False labels.

Voting:

We applied the same voting strategy that was used effectively in our previous concept extraction tasks2,45-47. It collects PII terms produced by a set of de-identification models and outputs all terms that received more votes than the voting threshold. When two PII terms overlapped, we chose the term that received more votes. For terms with the same number of votes, we chose the term produced by the model that performed better during cross-validation. These tie-break rules were also applied to voting pruned and the decision template method3 described below.

Voting pruned:

Ensemble pruning aims to reduce ensemble members to achieve similar or better performance than the ensemble of all members. Different than the stacked ensemble that implicitly assign a weight to each classifier, ensemble pruning completely excludes individual classifiers. The pruned ensemble can offer the advantage of saving computational time or physical resources that would be consumed by excluded members.

We implemented a new ensemble pruning method that allowed us to select the voting threshold with a subset of de-identification models. This procedure involved the following steps: for each voting threshold, we began with an ensemble containing all individual models. Then, at each step, we greedily excluded one model from the voting ensemble such that the F1-score on the training data was the highest without the model. These steps continued until the resulting performance of the pruned voting ensemble was no longer improved. This procedure was repeated for each voting threshold to obtain a subset of de-identification models. Then, we chose the voting threshold that yielded the best performance. For example, from twelve 2014 i2b2 models, the pruned ensemble included four models and only considered PII terms labeled by at least two models.

Decision Template Method (DTM)3:

For each PII instance, a decision profile was created in the same way as in our previous study2. As we added more de-identification models from two data sets, the decision profile matrix included 22 rows (corresponding to 22 de-identification models) and two columns for binary decisions. To each row, [1, 0] was assigned when a model identified the PII instance, [0, 1] otherwise. Then, we aggregated the decision profiles to construct a decision template (DT) for each class label (True or False). We used Hamming distance to find the most similar DT for each test instance.

Stacked SVM:

Similarly to Voting and DTM, we reused the SVM-based stacked ensemble from our previous application. We defined features to consider the degree of agreement and consistency between de-identification models. For example, we counted how many different models produced a predicted PII. We used the LIBLINEAR software package43 for SVM classification with a linear kernel. When two PII instances overlapped, we chose the one with the higher confidence score produced by the SVM metaclassifier.

Sequence labelling-based ensemble methods

For the newly introduced sequence labelling-based stack generalization methods, we started with tokenizing the input text. For each token, we then encoded the prediction of each model using BIO tags. Finally, we labeled each token with a PII category. We only used the outputs of individual de-identification models as input. No tie-break policy was needed because these sequence labeling-based methods allowed for only one tag for each word and did not generate overlapping PII annotations.

Stacked SEARN:

Unlike the SEARN model, which was used as one of the individual de-identification models, we defined the features using only the outputs of each individual model to build SEARN-based stacked ensembles. Each model assigned a PII tag to each word token. We then used the predicted tags of the current word, two words preceding it and two words following it. We used Vowpal Wabbit41, a fast out-of-core online learning system, and took advantage of its handy n-gram feature generation. We created trigram features using the sequence of predicted tags. We set the number of training iterations to 20 and the initial learning rate to 0.025.

Stacked LSTM:

Finally, we used a Bi-LSTM network with a CRF classifier as the last layer. The inputs of the network only contained the outputs of each individual de-identification model. No word or character-based representations were used. We used the NER system of Reimers and Gurevych48 to train the Bi-LSTM models. We created an embedding to map the output of each individual model into a 10-dimensional vector. The concatenation of these features was fed into the LSTM layer. We used a two hidden stacked bidirectional LSTM and each LSTM layer had 100 recurrent units. We set the learning rate to 0.001 and trained the model using RMSProp gradient descent method49 for 25 epochs with 25% dropout to the recurrent units.

Similarly to the individual models, each ensemble classifier was trained to optimize the micro-averaged F1-score. Their thresholds (e.g., voting threshold in the voting ensemble) and hyperparameter values (e.g., dropout rate in the stacked LSTM) were determined after 10-fold cross validation with the training set.

Results

We present experimental results for each de-identification model and for the six ensemble methods. We measured recall, precision, and the F1-score (harmonic mean of recall and precision with equal weight). We used the 2016 CEGS N-GRID shared task11 evaluation script to calculate performance measures. Our models were evaluated on the 2014 i2b2/UTHealth de-identification challenge corpus (‘2014 i2b2’) and the 2016 CEGS N-GRID shared task11 corpus (‘2016 N-GRID’).

Performance of individual de-identification methods

For each machine learning-based de-identification method, we created three different models: one was trained on the 2014 i2b2 training data (‘2014 model’), another was trained on the 2016 N-GRID data (‘2016 model’), and the last one was trained on the union of both corpora (‘2014 + 2016 model’). Table 4 shows the performance of individual de-identification models. We report the micro-averaged F1-score with strict entity matching, where both the text span and PII category must exactly match the reference annotation. The left half and right half of the table represent the results of the 2014 i2b2 and 2016 N-GRID test sets, respectively. For each test data, we show how well the 2014, 2016, and 2014 + 2016 models performed. The best results are shown in bold for each trained model and test configuration. For external de-identification systems (MIST and PhysioNet deid), we report one result for each data set because they did not use training data.

Table 4.

Accuracy (F1-scores) of individual de-identification methods on the 2014 and 2016 test sets.

Method 2014 i2b2 Test (%) 2016 N-GRID Test (%)
2014
model
2016
model
2014 + 2016
model
2014
model
2016
model
2014 + 2016
model
LSTM-CRF v.N 94.43 73.21 95.29 (+0.86) 77.80 90.82 91.14 (+0.32)
LSTM-CRF v.L 94.60 70.65 95.51 (+0.91) 71.91 91.07 90.40 (–0.67)
CRF 94.15 73.80 94.90 (+0.75) 73.33 88.92 89.14 (+0.22)
MEMM 93.78 74.47 94.28 (+0.50) 72.01 88.19 88.46 (+0.27)
SEARN 93.08 71.89 93.49 (+0.41) 74.68 89.38 89.61 (+0.23)
MIRA 93.72 73.79 93.88 (+0.16) 74.27 88.80 89.41 (+0.61)
LSTM v.L 92.83 67.76 94.62 (+1.79) 66.89 90.18 89.75 (–0.43)
SVM 93.17 72.39 93.77 (+0.60) 75.15 89.16 89.23 (+0.07)
OGD 93.04 70.56 93.38 (+0.34) 74.67 89.27 89.53 (+0.26)
Struct. SVM 79.23 54.43 77.02 (–2.21) 59.47 76.79 73.45 (–3.34)
MIST 54.21 28.70
PhysioNet deid 46.73 23.40

Overall, F1-scores on the 2016 test data were about 4% lower than those on the 2014 test data, presumably because of the relatively low prevalence of PII instances in the 2016 N-GRID corpus. Deep learning approaches outperformed shallow learning methods. Bi-LSTM with a CRF layer outperformed the Bi-LSTM without CRF for all data sets. Among shallow learning methods, the CRF and SEARN produced the highest F1-scores. In general, methods based on structured learning algorithms (e.g., CRF and MEMM) performed better than token classification-based methods on the 2014 i2b2 data. As exceptions, on the 2016 N-GRID, token classification-based methods such as SVM and OGD outperformed the CRF model. If we optimized the hyperparameters with the 2016 N-GRID training data instead of reusing the values of the hyperparameters optimized for the 2014 i2b2 data, we would expect the results of the 2016 N-GID models to correspond to those of the 2014 i2b2. The structured SVM models (struct. SVM) did not perform well because they were trained with a predefined feature set designed for general domain NER tasks. If the same features defined for other classifiers were used, then the struct. SVM would achieve better results.

Each method performed much better when the model was trained with its own training corpus rather than with the other corpus. For example, on the 2014 i2b2 Test, the LSTM-CRF v.L with the 2014 i2b2 yielded an F1-score of 94.60% but only 70.65% with the 2016 N-GRID. Exploiting another corpus provided added value. Most models trained on the union of two corpora achieved better performance on the target corpus. The numbers in parentheses in the columns named ‘2014+2016 model’ indicate the difference before and after merging two datasets. LSTM v.L and MIRA models increased F1-scores by 1.79% (= 94.62% − 92.83%) on the 2014 i2b2 and 0.61% (= 89.41% − 88.80%) on the 2016 N-GRID, respectively. Importantly, the 2014+2016 models offered better generalization across two corpora. They substantially outperformed the models trained with the other corpus.

Performance of ensemble methods

We evaluated the performance of three different ensembles methods: (1) ensembles consisting of de-identification models trained on the 2014 i2b2 data (‘2014 models’), (2) ensembles of models trained on the 2016 N-GRID data (‘2016 models’), and (3) ensembles combining all models from ensembles (1) and (2) (‘2014 + 2016 models’). The 2014 + 2016 ensembles included 22 different classifiers (10 models from the 2014 i2b2, 10 models from the 2016 N-GRID, and two external de-identification systems).

Table 5 shows the performance of these ensembles. As we did in Table 4, we report the micro-averaged F1-score with strict entity matching. For the evaluation of each test set, the corresponding training data was used for training. For example, for the 2014 i2b2 test set, the voting threshold of the 2014 + 2016 voting models was determined with the outputs of each model for the 2014 i2b2 training data. Using the outputs of each model as features, the 2014 + 2016 stacked LSTM was trained with the sentences contained in the 2014 i2b2 training data.

Table 5.

Accuracy (F1-scores) of ensemble methods on the 2014 and 2016 test sets.

Method 2014 i2b2 Test (%) 2016 N-GRID Test (%)
2014 models 2014 + 2016 models 2016 models 2014 + 2016 models
Voting 95.38 94.53 (–0.85) 92.09 90.60 (–1.49)
Voting pruned *95.60 *95.93 (+0.33) *92.52 *92.57 (+0.05)
DTM 95.42 95.31 (–0.11) 92.11 90.91 (–1.20)
Stacked SVM *95.70 *95.85 (+0.15) *92.64 *92.69 (+0.05)
Stacked SEARN 95.36 *95.86 (+0.50) 92.38 92.28 (–0.10)
Stacked LSTM 95.32 95.66 (+0.34) 92.10 92.23 (+0.13)

For each trained model and test configuration, the best results appear in boldface. Results that are not significantly different from the best result at the 95% significance level are preceded by an asterisk (*). For example, the stacked SVM with the 2014 models produced the highest F1-score of 95.70% on the 2014 i2b2 test set, which is significantly better than all other ensembles and individual models except the voting pruned. For statistical significance testing, we used paired t-tests to compare the best performing system and the other methods. The pruned voting models outperformed the corresponding voting ensembles with all models. A more detailed comparison is proposed in the Discussion.

Adding models trained with different corpora contributed to the performance of ensemble methods. The numbers in parentheses in the ‘2014+2016 models’ columns indicate the difference between using the models trained on a single dataset and using the models from two data sources. The higher the gain, the greater the benefit from the other data source. The voting and DTM methods of the 2014 + 2016 models are an exception because they were unable to efficiently integrate low-performance models that were trained on the other corpus. Voting pruned, stacked SVM, and stacked LSTM achieved better results on both test data. Pruned voting and stacked SVM models reached the highest F1-scores of 95.93% on the 2014 i2b2 and 92.69% on the 2016 N-GRID, respectively. The results show that these ensembles were more beneficial for the 2016 N-GRID data, producing higher F1-scores than the best individual models (95.51% increased to 95.93% for the 2014 i2b2; 91.14% increased to 92.69% for the 2016 N-GRID). Sequence labelling-based stack generalization (stacked SEARN and stacked LSTM) yielded lower F1-scores than the stacked SVM, but with comparable results when including all models.

Discussion

We further discuss the results of the SVM-based stacked ensemble (stacked SVM) with both 2014 and 2016 models on the 2016 N-GRID test set. Table 6 shows the stacked SVM performance for each category. The results on the right half of the table were calculated with binary token matching where each PII term was evaluated on a per-token basis, regardless of the category. This ensemble yielded the overall precision, recall, and F1-score of 98.57%, 92.86% and 95.63% with binary token matching, respectively. For date, age, name, and contact categories, the stacked SVM reached over 90% F1-score with strict entity matching. Although contact PII categories (e.g., phone and e-mail) were less common, they were identified relatively accurately because they are usually recorded in standardized formats. More work is needed for improved extraction of profession and location categories that are less formally written with rich vocabulary.

Table 6.

Results of SVM-based stacked ensemble (stacked SVM) on the 2016 N-GRID test set for each category.

Category Count Strict entity (%) Binary token (%)
Precision Recall F1-score Precision Recall F1-score
Name 2,404 95.98 93.43 94.69 98.08 96.26 97.16
Profession 1,010 90.27 72.57 80.46 94.70 76.87 84.86
Location 3,771 92.54 82.92 87.47 97.06 85.71 91.03
Age 2,354 97.75 95.75 96.74 97.99 95.01 96.48
Date 3,821 97.54 96.57 97.05 99.26 98.5 98.88
Contact 126 94.78 86.51 90.46 99.13 90.93 94.85
ID 33 79.17 57.58 66.67 97.14 70.83 81.93
Total 13,519 95.45 90.08 92.69 98.57 92.86 95.63

From the outputs of the stacked SVM, we analyzed the examples whose presence was successfully identified in the text but whose category was not correctly assigned. This type of error was most common in the categories shown in Table 7. For example, organization is one of the subcategories of location, of which 22, 14, and 12 were misclassified as hospital, city, and profession subcategories, respectively. The last column shows some challenging examples that may be used interchangeably as multiple subcategories.

Table 7.

The five most common subcategories of the misclassification errors.

Reference subcategory Classified as Examples
Organization Hospital (22) City (14) Profession (12) Edgemore House, Nestle, Facebook
Patient Doctor (19) Hospital (11) City (3) Galloway, KC, Harriet
Hospital City (12) Doctor (10) Organization (6) Odessa, Lewis Clark, GBMC
Country City (13) State (6) Organization (3) Jordan, Georgia, PNG
City Hospital (11) Organization (5) Country (3) Fountain Hills, Vallejo, Villa Rica

Finally, we compare the results of voting ensembles with or without pruning. Ensemble pruning achieved higher performance than the ensemble using all models. The pruning method was effective even when models trained with different datasets participated in the voting ensemble. We examined how well the pruned ensemble performed at each voting threshold. Figure 1 shows the F1-scores of pruned voting ensembles and ensembles with all 22 models. The gap between the ensemble pruned and with all models indicates the F1-score difference between them. The graphs on the left and right represent the results of the 2014 i2b2 and 2016 N-GRID test sets, respectively. Note that the y-axis scale in each graph does not start at zero to focus on the value ranges of interest.

Figure 1.

Figure 1.

Accuracy (F1-scores) of pruned ensembles and ensembles with all members.

The results with voting thresholds ranging from one to 12 are presented. No pruning occurred at the higher thresholds. The solid line with circle dots represents the pruned voting method. The pruned voting ensembles produced significantly higher F1-scores than the corresponding ensembles that included all members at the 95% significance level on the 2014 i2b2 and 2016 N-GRID test sets until the voting thresholds were 9 and 11, respectively. When the threshold was set to three and 14 models were excluded from voting, the pruned ensemble with eight members* achieved the highest F1-score (95.93%) on the 2014 i2b2 data. Similarly, at the same voting threshold, the pruned ensemble with nine members obtained a F1-score of 92.57% with no significant difference from the highest score on the 2016 N-GRID data.

Conclusion

We explored various types of ensemble learning methods and confirmed that they could improve de-identification of EHR narratives. Experimental results show that two concept-based ensemble methods, stacked SVM and pruned voting ensembles, consistently performed better than individual classifiers and other ensemble approaches. Our pruning method has the ability to construct an efficient ensemble by selecting the optimal subset of de-identification models. We also sought to better utilize the two clinical text corpora from different sources. We found that exploiting two corpora could improve generalization. Most ensemble systems consisting of models trained with two datasets yielded better results than ensemble systems with models trained with either data only.

Our ensemble approaches can provide a convenient solution, even in situations where texts or labeled annotations are not accessible, but only models or components can be used. Not limited to the evaluation on the corpora sharing the same PII category definitions, our further research includes more generalization across a variety of data sets annotated with heterogeneous PII categories, for example, the 2006 i2b2 de-identification challenge12 and PhysioNet50 corpora. Another promising direction for future work is to create Bi-LSTM models with dependency-based word embeddings51,52 or contextualized word representations19-21.

Acknowledgments

This project was partly supported by the National Institute for General Medical Sciences (R41GM116479). De-identified clinical records used in this research were provided by the 2014 i2b2/UTHealth NLP Shared Task (funded by NIH NLM 2U54LM008748 and NIH NLM 5R13LM011411), and the 2016 CEGS N-GRID challenge (funded by NIH P50 MH106933, and NIH 4R13LM011411).

Footnotes

*

Six of the 2014 models (LSTM-CRF v.N, LSTM-CRF v.L, CRF, SEARN, LSTM v.L, and struct. SVM) and two of the 2016 models (LSTM CRF v.N, and LSTM-CRF v.L)

Seven of the 2016 models (LSTM-CRF v.N, LSTM-CRF v.L, CRF, SEARN, LSTM v.L, OGD, and struct. SVM), one of the 2014 models (LSTM-CRF v.N), and PhysioNet deid

Figures & Table

References

  • 1.HIPAA Privacy Rule . 45 CFR Part 160, Part 164 (A, E) U.S. Department of Health and Humans Services; 2008. [Google Scholar]
  • 2.Kim Y, Heider P, Meystre SM. Ensemble-Based methods to improve de-identification of electronic health record narratives. AMIA Annu Symp. 2018. pp. 663–72. [PMC free article] [PubMed]
  • 3.Kuncheva LI, Bezdek JC, Duin RP. Decision templates for multiple classifier fusion: An experimental comparison. Pattern recognition. 2001;34(2):299–314. [Google Scholar]
  • 4.Wolpert DH. Stacked generalization. Neural Netw. 1992;5:241–59. [Google Scholar]
  • 5.Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1. J. Biomed. Inform. 2015;58:S11–S19. doi: 10.1016/j.jbi.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015;58:S20–S29. doi: 10.1016/j.jbi.2015.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Cortes C, Vapnik V. Support-Vector networks. Machine learning. 1995;20(3):273–97. [Google Scholar]
  • 8.Daumé H, Langford J, Marcu D. Search-Based structured prediction. Machine learning. 2009;75(3):297–325. [Google Scholar]
  • 9.Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • 10.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997;45(11):2673–81. [Google Scholar]
  • 11.Stubbs A, Filannino M, Uzuner Ö. De-Identification of psychiatric intake records: Overview of 2016 CEGS N- GRID shared tasks track 1. J. Biomed. Inform. 2017;75:S4–S18. doi: 10.1016/j.jbi.2017.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. JAMIA. 2007;14(5):550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wellner B, Huyck M, Mardis S, et al. Rapidly retargetable approaches to de-identification in medical records. JAMIA. 2007;14(5):564–73. doi: 10.1197/jamia.M2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th ICML. 2001. pp. 282–89.
  • 15.Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 2015;58:S30–S38. doi: 10.1016/j.jbi.2015.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu Z, Tang B, Wang X, Chen Q. De-Identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75:S34–S42. doi: 10.1016/j.jbi.2017.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.
  • 18.Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532–43.
  • 19.Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations. Proceedings of the 2018 NAACL-HLT. 2018. pp. 2227–37.
  • 20.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-Training of deep bidirectional transformers for language understanding. Proceedings of the 2019 NAACL-HLT. 2019. pp. 4171–86.
  • 21.Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proceedings of the 27th COLING. 2018. pp. 1638–49.
  • 22.Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018.
  • 23.Lee K, Filannino M, Uzuner Ö. An empirical test of GRUs and deep contextualized word representations on de- identification. Stud. Health Technol. Inform. 2019;264:218–22. doi: 10.3233/SHTI190215. [DOI] [PubMed] [Google Scholar]
  • 24.Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. pp. 72–78.
  • 25.Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6(1):49–62. [Google Scholar]
  • 27.Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning. 2000;40(2):139–57. [Google Scholar]
  • 28.Banfield R, Hall L, Bowyer K, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6:49–62. [Google Scholar]
  • 29.Partalas I, Tsoumakas G, Vlahavas IP. Focused ensemble selection: A diversity-based method for greedy ensemble selection. Proceedings of the 2008 ECAI. 2008. pp. 117–21.
  • 30.Bodenreider O. The unified medical language system (umls): Integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Aronson AR, Lang F-M. An overview of metamap: Historical perspective and recent advances. JAMIA. 2010;17(3):229–36. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identification scrubber toolkit: Design, training, and assessment. Int J Med Inform. 2010;79(12):849–59. doi: 10.1016/j.ijmedinf.2010.09.007. [DOI] [PubMed] [Google Scholar]
  • 33.Neamatullah I, Douglass MM, Lehman L-wH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32. doi: 10.1186/1472-6947-8-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.McCallum A, Freitag D, Pereira FC. Maximum entropy Markov models for information extraction and segmentation. Proceedings of the 17th ICML. 2000. pp. 591–98.
  • 35.Crammer K, Singer Y. Ultraconservative online algorithms for multiclass problems. J Mach Learn Res. 2003;3(Jan):951–91. [Google Scholar]
  • 36.Joachims T, Finley T, Yu C-NJ. Cutting-Plane training of structural SVMs. Mach Learn. 2009;77(1):27–59. [Google Scholar]
  • 37.Bottou L. Online learning and stochastic approximations. On-Line learning in neural networks. 1998. pp. 9–42.
  • 38.Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. Proceedings of the 2016 NAACL-HLT. 2016. pp. 260–70.
  • 39.Dernoncourt F, Lee JY, Szolovits P. NeuroNER: An easy-to-use program for named-entity recognition based on neural networks. Proceedings of the 2017 EMNLP: System Demonstrations. 2017. pp. 97–102.
  • 40.Lavergne T, Cappé O, Yvon Fc. Practical very large scale CRFs. Proceedings of the 48th ACL. 2010. pp. 504–13.
  • 41.Langford J, Li L, Strehl A. Vowpal wabbit online learning project: Technical report. http://hunch.net , 2007.
  • 42.Favre B. Java implementation of MIRA learning for sequences. https://github.com/benob/miralium .
  • 43.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008;9(Aug):1871–74. [Google Scholar]
  • 44.King DE. MITIE: Library and tools for information extraction. https://github.com/mit-nlp/MITIE .
  • 45.Kim Y, Riloff E. Stacked generalization for medical concept extraction from clinical notes. Proceedings of the BioNLP. 2015;2015:61–70. [Google Scholar]
  • 46.Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annual Symposium Proceedings. 2015. pp. 737–46. [PMC free article] [PubMed]
  • 47.Kim Y, Meystre S. A study of medical problem extraction for better disease management. Studies in health technology and informatics. 2019;264:193–97. doi: 10.3233/SHTI190210. [DOI] [PubMed] [Google Scholar]
  • 48.Reimers N, Gurevych I. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. pp. 338–48.
  • 49.Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. 2012;4(2):26–31. [Google Scholar]
  • 50.Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computers in Cardiology. Vol. 2004. IEEE; 2004. Computer-Assisted de-identification of free text in the MIMIC II database; pp. 341–44. [Google Scholar]
  • 51.Levy O, Goldberg Y. Dependency-Based word embeddings. Proceedings of the 52nd ACL. 2014. pp. 302–08.
  • 52.Komninos A, Manandhar S. Dependency based embeddings for sentence classification tasks. Proceedings of the 2016 NAACL-HLT. 2016. pp. 1490–500.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES