Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives

Youngjun Kim; Paul M Heider; Stéphane M Meystre

. 2021 Jan 25;2020:648–657.

Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives

Youngjun Kim ¹, Paul M Heider ¹, Stéphane M Meystre ^1,²

PMCID: PMC8075417 PMID: 33936439

Abstract

De-identification of electric health record narratives is a fundamental task applying natural language processing to better protect patient information privacy. We explore different types of ensemble learning methods to improve clinical text de-identification. We present two ensemble-based approaches for combining multiple predictive models. The first method selects an optimal subset of de-identification models by greedy exclusion. This ensemble pruning allows one to save computational time or physical resources while achieving similar or better performance than the ensemble of all members. The second method uses a sequence of words to train a sequential model. For this sequence labelling-based stacked ensemble, we employ search-based structured prediction and bidirectional long short-term memory algorithms. We create ensembles consisting of de-identification models trained on two clinical text corpora. Experimental results show that our ensemble systems can effectively integrate predictions from individual models and offer better generalization across two different corpora.

Introduction

Ensemble learning is a meta-algorithm that uses the outputs of individual classifiers to reduce their errors and improve accuracy. The approach has attracted natural language processing (NLP) researchers as a convenient and effective way to combine multiple predictive models. In general, a classifier ensemble has a two-tier learning structure in which the first layer consists of a set of individual classifiers and the second layer serves to combine their outputs. Although the performance of individual models in the first layer is of primary importance, more thorough consideration should be given to effective model integration by the metaclassifier in the second layer.

Our focus is on ensemble-based methods for electronic health record (EHR) narrative text de-identification. De-identification involves detecting and hiding or removing pre-defined categories of identifiers (personally identifiable information, PII) such as a social security number or biometric information. In practice, eighteen categories of PII are listed in the Health Insurance Portability and Accountability Act (HIPAA) privacy rule ‘safe harbor’ method.¹ De-identification has been considered a fundamental task in clinical NLP and has played an important role in protecting patient information privacy while making clinical data shareable with other research communities just as EHRs are becoming more prevalent.

In our previous research on EHR narratives de-identification², we applied a variety of information extraction (IE) methods including deep learning, shallow learning, and rule-based approaches. We analyzed de-identification model diversity by grouping similar models based on their outputs. To combine multiple de-identification models, we presented three ensemble methods: voting, decision template method (DTM)³, and stacked generalization⁴. When evaluated with the 2014 i2b2 (informatics for integrating biology and the bedside) de-identification challenge corpus^5,6 (called ‘2014 i2b2’ hereafter), we showed that ensemble methods consisting of a diverse set of PII extraction models could improve EHR de-identification.²

In this subsequent research, we advance our approach in two directions. First, we explore various types of ensemble architectures to examine how they are capable of producing more accurate predictions. We employ more ensemble learning methods in addition to the three approaches we already experimented with. We propose an ensemble pruning method that allows one to automatically determine the voting threshold (i.e., number of members voting for one annotation) and the optimal combination of de-identification models. In our previous study², we created a stacked learning ensemble by training an SVM (support vector machine)⁷ classifier that used each PII concept of the individual classifier as a training example. Besides the concept-level SVM model, we propose a sequence labelling-based (or word token-level) stacked ensemble that uses a sequence of words to train the sequential model. We implement two stacked generalization methods based on search-based structured prediction (SEARN)⁸ and bidirectional long short-term memory (Bi-LSTM)^9,10, a variant of the recurrent neural network (RNN) algorithm. We compare the performance of concept-based ensemble methods against sequence labelling-based stacked learning ensembles.

Second, we aim to increase the generalization of de-identification by combining multiple datasets. We exploit another text collection created for the 2016 CEGS (centers of excellence in genomic science) N-GRID (neuropsychiatric genome-scale and RDoC-individualized domains) shared task¹¹ (called ‘2016 N-GRID’) for training and evaluation of de-identification models. Extending from previous work that used only 2014 i2b2 data, we build de-identification models using the 2016 N-GRID data or a combination of both datasets. We investigate how well each model performs when trained with either corpus or trained with the union of both corpora. We find that the union model can provide more generalizable de-identification across two different corpora. Unlike the union approach, where all data sets must be accessible to train a new model, our ensemble methods only necessitate models trained on any data source. Our experimental results show that most ensemble systems can effectively integrate predictions from individual models to improve de-identification performance.

Background

Clinical text de-identification has been the focus of several shared tasks^5,6,11,12. In the 2006 i2b2 de-identification task¹¹, eight PII categories were defined: patient, doctor, hospital, location, age, date, phone number, and ID. The target PII categories for the 2014 i2b2 challenge^4,5 and the 2016 N-GRID challenge¹⁰ tasks included name, profession, location, age, date, contact, and ID. Each category had one or more subcategories. For example, the name category had patient, doctor, and username subcategories.

The publicly available challenge data facilitated further research into the identification of PII. The de-identification systems that performed well in the challenges used machine learning approaches or hybrid methods that combined machine learning and rule-based approaches. Wellner et al.¹³ trained a sequence labeling model using conditional random fields (CRF)¹⁴ with regular expressions to better extract certain PII categories such as dates. Their approach produced a phrase F₁-score of 97.36% on the 2006 i2b2 test set. Yang and Garibaldi¹⁵ also employed the CRF algorithm to create their de-identification models. For some of the PII subcategories that were less commonly mentioned in the training set, they built a rule-based system using keywords and regular expression patterns. Their system yielded the overall precision, recall, and F₁-score of 96.45%, 90.92% and 93.60% on the 2014 i2b2 challenge test data, respectively. As a good example of a transition into neural network approaches, Liu et al.¹⁶ presented an ensemble that combined four de-identification components, including a rule-based system, a CRF model, and two variants of Bi-LSTM models. Their best-scoring system produced a F₁-score of 91.43% on the 2016 N-GRID test data.

Recently, context-dependent representations have been employed to improve text de-identification performance. More advanced than context-independent (or static) word embeddings^17,18 that ignore the context in which a word appears, context-dependent word embedding^19-21 model can assign different vector representations for the word based on its context usage. Khin et al.²² and Lee et al.²³ reported improved classification accuracy when the learning architecture included ELMo¹⁹ representations (embeddings from language models). They used the general domain pretrained ELMo models. Alsentzer et al.²⁴ trained BERT²⁰ (bidirectional encoder representations from transformers) models with the MIMIC-III (medical information mart for intensive care) v1.4 database²⁵ containing various types of clinical notes. They reported that MIMIC-based BERT models were less beneficial than models trained on general texts or biomedical literature because the phrases identified as PII were replaced with tags in MIMIC texts.

As seen in the above-mentioned studies, clinical text de-identification tasks have achieved almost mature performance, especially with Bi-LSTM algorithms. To further improve performance, we firstly apply the ensemble pruning method to text de-identification. The hill climbing search method^26,27 is one of the commonly used pruning methods. It begins with an ensemble containing none or all of the individual models. It searches the space of sub-ensembles by greedy inclusion or exclusion of one model. For this study, we apply backward elimination^28,29, which makes the ensemble smaller by excluding individual models without loss of accuracy.

Methods

We describe the PII categories to be identified and outline the distribution of labeled data. Then we explain how we train each individual de-identification model and present our ensemble-based methods.

Data description

We aim to extract PII phrases whose categories were defined in the 2014 i2b2 de-identification challenge. The same categories were used for the 2016 CEGS N-GRID shared task. Each PII instance was assigned one of seven categories. Table 1 shows the number of annotated instances of each category in the test set of each corpus. The column named “Instance Ratio” in Table 1 shows the ratio of the number of instances in the two data sets. Date, location, and name categories are common in both corpora. When compared to the 2014 i2b2 data, more categories of profession, age and location were recorded in the 2016 N-GRID data, but fewer ID and contact categories were recorded. For example, the 2016 N-GRID data contains 33 ID instances, which is only about 5% of the 2014 i2b2 data.

Table 1.

The numbers of PII instances and tokens in each data set.

	2014 i2b2 Test		2016 N-GRID Test
	Instances	Tokens	Instances	Tokens	Instance Ratio
Name	2,883	5,655	2,404	3,515	0.83
Profession	179	345	1,010	1,685	5.64
Location	1,813	3,046	3,771	5,749	2.08
Age	764	836	2,354	2,575	3.08
Date	4,980	19,784	3,821	10,035	0.77
Contact	218	664	126	623	0.58
ID	625	1,723	33	72	0.05
Total	11,462	32,053	13,519	24,254	1.18

Open in a new tab

Instance ratio = the ratio of the number of instances in two data sets (= 2016 N-GRID Test / 2014 i2b2 Test)

Table 2 shows the number of documents, sentences, and PII instances found in each corpus, as well as their average number per document. The column named “Average Ratio” in Table 2 shows the ratio of these average values between two data sets. The 2016 i2b2 corpus consists of 400 test documents and 13,519 PII instances. The 2016 N-GRID test set contains many more sentences (about 4.48 times) and tokens (about 2.78 times) than the 2014 i2b2 data. However, the prevalence of PII instances is relatively lower in the 2016 N-GRID given the number of tokens. PII phrases in the 2016 N-GRID corpus are shorter at about 1.8 token (= 60.64 tokens / 33.80 instances) per PII phrase. PII phrases in the 2014 i2b2 data include an average of about 2.8 tokens (= 62.36 / 22.30).

Table 2.

Corpora characteristics.

	2014 i2b2 Test		2016 N-GRID Test
	Count	Average	Count	Average	Average Ratio
Documents	514	1.00	400	1.00	1.00
Sentences	22,047	42.89	76,877	192.19	4.48
Tokens	444,245	864.29	962,251	2405.63	2.78
PII instances	11,462	22.30	13,519	33.80	1.52
PII tokens	32,053	62.36	24,254	60.64	0.97

Open in a new tab

Average ratio = the ratio of average values between two data sets (= 2016 N-GRID Test / 2014 i2b2 Test)

In addition to quantitative analysis, we also examined how the two corpora differ in content. We assigned UMLS (Unified Medical Language System)³⁰ Metathesaurus concepts to phrases in the text. We used MetaMap³¹ to recognize these concepts and compared their occurrences across the two corpora. We found that some concepts related to mental illness were more frequent in the 2016 N-GRID data, which is not a surprise considering that the corpus only includes psychiatric intake notes. For example, anxiety, suicidal-behavior, violent-behavior, psychosis, mood, and depression were relatively frequent. The 2014 i2b2 data covered a diverse set of medical conditions. Concepts associated with common diseases, such as hypertension, edema, blood-pressure, chest-pain, and diabetes, often appeared.

De-identification models

Similarly to our previous de-identification study², we built 12 de-identification models using deep learning, shallow learning, and rule-based approaches. Our goal was to provide diversity between models in the ensemble by employing various algorithms, although some of them were trained on the same training data.

We consider two of them as external resources because their model or rules were learned on different corpora than the target data. One of the external resources is the MITRE identification scrubber toolkit³² (called ‘MIST’), a machine learning-based system trained on the 2006 i2b2 de-identification task data. The other system is the PhysioNet de-identification software package³³ (called ‘PhysioNet deid’), a rule-based system tuned primarily to de-identify PII in nursing notes and discharge summaries.

For the remaining machine learning-based classifiers, we first trained the models with the 2014 i2b2 training data. We performed a 10-fold cross validation on the training set and tuned the parameters of each classifier to maximize the micro-averaged F₁-score. We hypothesize that this optimization process with respect to the cross-validated performance measure can also improve performance on the test data. Then, we trained the new models by reusing the classifier configuration optimized for the 2014 i2b2 model. For each classification method, we created one model with the 2016 N-GRID data and one model with the union of both corpora.

We developed three de-identification models that use deep learning: two versions of Bi-LSTM with a CRF layer (called ‘LSTM-CRF’) and one version of LSTM without CRF (called ‘LSTM’). We used pre-trained word embeddings for each word in all sentences processed. The 100-dimensional GloVe (global vectors for word representation)¹⁷ embeddings built with the 2014 dump of English Wikipedia were used for all Bi-LSTM models. In addition to these deep learning architectures with multiple hidden layers, we applied various classical algorithms based on shallow architectures. They can be divided into two approaches. One is based on structured learning algorithms, including CRF, MEMM (maximum entropy Markov models)³⁴, SEARN, MIRA (margin infused relaxed algorithm)³⁵, and structural SVM³⁶. These algorithms have been widely used for many named entity recognition (NER) tasks with the ability of modeling interdependent output variables (i.e., tags of the words arranged in a sentence). The other is token classification-based methods to classify each word independently. We implemented SVM7 and OGD (Online gradient descent)³⁷ classifiers. Table 3 shows the software libraries and hyperparameter configurations used for the learning algorithms. Readers can refer to the manuscript of our previous de-identification work² for more details.

Table 3.

Description of the learning algorithms.

Method	Description	Software	Hyperparameters
LSTM-CRF v.N³⁸	Bi-LSTM with a CRF layer	NeuroNER³⁹	learning rate = 0.01, 80 epochs
LSTM-CRF v.L³⁸	Bi-LSTM with a CRF layer	Lample et al.³⁸	learning rate = 0.005, 100 epochs
CRF¹⁴	Conditional random fields	Wapiti⁴⁰	e (interval for stopping) = 0.004
MEMM³⁴	Maximum entropy Markov models	Wapiti⁴⁰	e (interval for stopping) = 0.004
Searn⁸	Search-based structured prediction	Vowpal Wabbit⁴¹	learning rate = 0.05, 20 epochs
MIRA³⁵	Margin infused relaxed algorithm	Miralium⁴²	5 (max. weight updates) = 0.1, 30 epochs
LSTM v.L³⁸	Bi-LSTM without CRF	Lample et al.³⁸	learning rate = 0.05, 100 epochs
SVM⁷	Support vector machines	LIBLINEAR⁴³	negative example weight = 0.01
OGD³⁷	Online gradient descent	Vowpal Wabbit⁴¹	learning rate = 0.1, 10 epochs
Struct. SVM³⁶	Structural SVM	MITIE⁴⁴	default settings
MIST³²	MITRE identification scrubber	MIST³²	default settings
PhysioNet deid³³	PhysioNet de-identification	PhysioNet deid³³	default settings

Open in a new tab

Concept-based ensemble methods

We briefly introduce the ensemble methods applied to our previous de-identification research and describe the newly proposed methods for this study. First, we describe the concept-based (or phrase-based) ensemble methods. Given a set of PII instances as training examples, each method builds a model that assigns test examples to True (matched the reference annotation) or False labels.

Voting:

We applied the same voting strategy that was used effectively in our previous concept extraction tasks^2,45-47. It collects PII terms produced by a set of de-identification models and outputs all terms that received more votes than the voting threshold. When two PII terms overlapped, we chose the term that received more votes. For terms with the same number of votes, we chose the term produced by the model that performed better during cross-validation. These tie-break rules were also applied to voting pruned and the decision template method³ described below.

Voting pruned:

Ensemble pruning aims to reduce ensemble members to achieve similar or better performance than the ensemble of all members. Different than the stacked ensemble that implicitly assign a weight to each classifier, ensemble pruning completely excludes individual classifiers. The pruned ensemble can offer the advantage of saving computational time or physical resources that would be consumed by excluded members.

We implemented a new ensemble pruning method that allowed us to select the voting threshold with a subset of de-identification models. This procedure involved the following steps: for each voting threshold, we began with an ensemble containing all individual models. Then, at each step, we greedily excluded one model from the voting ensemble such that the F₁-score on the training data was the highest without the model. These steps continued until the resulting performance of the pruned voting ensemble was no longer improved. This procedure was repeated for each voting threshold to obtain a subset of de-identification models. Then, we chose the voting threshold that yielded the best performance. For example, from twelve 2014 i2b2 models, the pruned ensemble included four models and only considered PII terms labeled by at least two models.

Decision Template Method (DTM)³:

For each PII instance, a decision profile was created in the same way as in our previous study². As we added more de-identification models from two data sets, the decision profile matrix included 22 rows (corresponding to 22 de-identification models) and two columns for binary decisions. To each row, [1, 0] was assigned when a model identified the PII instance, [0, 1] otherwise. Then, we aggregated the decision profiles to construct a decision template (DT) for each class label (True or False). We used Hamming distance to find the most similar DT for each test instance.

Stacked SVM:

Similarly to Voting and DTM, we reused the SVM-based stacked ensemble from our previous application. We defined features to consider the degree of agreement and consistency between de-identification models. For example, we counted how many different models produced a predicted PII. We used the LIBLINEAR software package⁴³ for SVM classification with a linear kernel. When two PII instances overlapped, we chose the one with the higher confidence score produced by the SVM metaclassifier.

Sequence labelling-based ensemble methods

For the newly introduced sequence labelling-based stack generalization methods, we started with tokenizing the input text. For each token, we then encoded the prediction of each model using BIO tags. Finally, we labeled each token with a PII category. We only used the outputs of individual de-identification models as input. No tie-break policy was needed because these sequence labeling-based methods allowed for only one tag for each word and did not generate overlapping PII annotations.

Stacked SEARN:

Unlike the SEARN model, which was used as one of the individual de-identification models, we defined the features using only the outputs of each individual model to build SEARN-based stacked ensembles. Each model assigned a PII tag to each word token. We then used the predicted tags of the current word, two words preceding it and two words following it. We used Vowpal Wabbit41, a fast out-of-core online learning system, and took advantage of its handy n-gram feature generation. We created trigram features using the sequence of predicted tags. We set the number of training iterations to 20 and the initial learning rate to 0.025.

Stacked LSTM:

Finally, we used a Bi-LSTM network with a CRF classifier as the last layer. The inputs of the network only contained the outputs of each individual de-identification model. No word or character-based representations were used. We used the NER system of Reimers and Gurevych48 to train the Bi-LSTM models. We created an embedding to map the output of each individual model into a 10-dimensional vector. The concatenation of these features was fed into the LSTM layer. We used a two hidden stacked bidirectional LSTM and each LSTM layer had 100 recurrent units. We set the learning rate to 0.001 and trained the model using RMSProp gradient descent method49 for 25 epochs with 25% dropout to the recurrent units.

Similarly to the individual models, each ensemble classifier was trained to optimize the micro-averaged F₁-score. Their thresholds (e.g., voting threshold in the voting ensemble) and hyperparameter values (e.g., dropout rate in the stacked LSTM) were determined after 10-fold cross validation with the training set.

Results

We present experimental results for each de-identification model and for the six ensemble methods. We measured recall, precision, and the F₁-score (harmonic mean of recall and precision with equal weight). We used the 2016 CEGS N-GRID shared task¹¹ evaluation script to calculate performance measures. Our models were evaluated on the 2014 i2b2/UTHealth de-identification challenge corpus (‘2014 i2b2’) and the 2016 CEGS N-GRID shared task¹¹ corpus (‘2016 N-GRID’).

Performance of individual de-identification methods

For each machine learning-based de-identification method, we created three different models: one was trained on the 2014 i2b2 training data (‘2014 model’), another was trained on the 2016 N-GRID data (‘2016 model’), and the last one was trained on the union of both corpora (‘2014 + 2016 model’). Table 4 shows the performance of individual de-identification models. We report the micro-averaged F₁-score with strict entity matching, where both the text span and PII category must exactly match the reference annotation. The left half and right half of the table represent the results of the 2014 i2b2 and 2016 N-GRID test sets, respectively. For each test data, we show how well the 2014, 2016, and 2014 + 2016 models performed. The best results are shown in bold for each trained model and test configuration. For external de-identification systems (MIST and PhysioNet deid), we report one result for each data set because they did not use training data.

Table 4.

Accuracy (F₁-scores) of individual de-identification methods on the 2014 and 2016 test sets.

	Method 2014 i2b2 Test (%)				2016 N-GRID Test (%)
	2014 model	2016 model	2014 + 2016 model		2014 model	2016 model	2014 + 2016 model
LSTM-CRF v.N	94.43	73.21	95.29	(+0.86)	77.80	90.82	91.14	(+0.32)
LSTM-CRF v.L	94.60	70.65	95.51	(+0.91)	71.91	91.07	90.40	(–0.67)
CRF	94.15	73.80	94.90	(+0.75)	73.33	88.92	89.14	(+0.22)
MEMM	93.78	74.47	94.28	(+0.50)	72.01	88.19	88.46	(+0.27)
SEARN	93.08	71.89	93.49	(+0.41)	74.68	89.38	89.61	(+0.23)
MIRA	93.72	73.79	93.88	(+0.16)	74.27	88.80	89.41	(+0.61)
LSTM v.L	92.83	67.76	94.62	(+1.79)	66.89	90.18	89.75	(–0.43)
SVM	93.17	72.39	93.77	(+0.60)	75.15	89.16	89.23	(+0.07)
OGD	93.04	70.56	93.38	(+0.34)	74.67	89.27	89.53	(+0.26)
Struct. SVM	79.23	54.43	77.02	(–2.21)	59.47	76.79	73.45	(–3.34)
MIST	54.21				28.70
PhysioNet deid	46.73				23.40

Open in a new tab

Overall, F₁-scores on the 2016 test data were about 4% lower than those on the 2014 test data, presumably because of the relatively low prevalence of PII instances in the 2016 N-GRID corpus. Deep learning approaches outperformed shallow learning methods. Bi-LSTM with a CRF layer outperformed the Bi-LSTM without CRF for all data sets. Among shallow learning methods, the CRF and SEARN produced the highest F₁-scores. In general, methods based on structured learning algorithms (e.g., CRF and MEMM) performed better than token classification-based methods on the 2014 i2b2 data. As exceptions, on the 2016 N-GRID, token classification-based methods such as SVM and OGD outperformed the CRF model. If we optimized the hyperparameters with the 2016 N-GRID training data instead of reusing the values of the hyperparameters optimized for the 2014 i2b2 data, we would expect the results of the 2016 N-GID models to correspond to those of the 2014 i2b2. The structured SVM models (struct. SVM) did not perform well because they were trained with a predefined feature set designed for general domain NER tasks. If the same features defined for other classifiers were used, then the struct. SVM would achieve better results.

Each method performed much better when the model was trained with its own training corpus rather than with the other corpus. For example, on the 2014 i2b2 Test, the LSTM-CRF v.L with the 2014 i2b2 yielded an F₁-score of 94.60% but only 70.65% with the 2016 N-GRID. Exploiting another corpus provided added value. Most models trained on the union of two corpora achieved better performance on the target corpus. The numbers in parentheses in the columns named ‘2014+2016 model’ indicate the difference before and after merging two datasets. LSTM v.L and MIRA models increased F₁-scores by 1.79% (= 94.62% − 92.83%) on the 2014 i2b2 and 0.61% (= 89.41% − 88.80%) on the 2016 N-GRID, respectively. Importantly, the 2014+2016 models offered better generalization across two corpora. They substantially outperformed the models trained with the other corpus.

Performance of ensemble methods

We evaluated the performance of three different ensembles methods: (1) ensembles consisting of de-identification models trained on the 2014 i2b2 data (‘2014 models’), (2) ensembles of models trained on the 2016 N-GRID data (‘2016 models’), and (3) ensembles combining all models from ensembles (1) and (2) (‘2014 + 2016 models’). The 2014 + 2016 ensembles included 22 different classifiers (10 models from the 2014 i2b2, 10 models from the 2016 N-GRID, and two external de-identification systems).

Table 5 shows the performance of these ensembles. As we did in Table 4, we report the micro-averaged F₁-score with strict entity matching. For the evaluation of each test set, the corresponding training data was used for training. For example, for the 2014 i2b2 test set, the voting threshold of the 2014 + 2016 voting models was determined with the outputs of each model for the 2014 i2b2 training data. Using the outputs of each model as features, the 2014 + 2016 stacked LSTM was trained with the sentences contained in the 2014 i2b2 training data.

Table 5.

Accuracy (F₁-scores) of ensemble methods on the 2014 and 2016 test sets.

Method	2014 i2b2 Test (%)			2016 N-GRID Test (%)
	2014 models 2014 + 2016 models			2016 models 2014 + 2016 models
Voting	95.38	94.53	(–0.85)	92.09	90.60	(–1.49)
Voting pruned	*95.60	*95.93	(+0.33)	*92.52	*92.57	(+0.05)
DTM	95.42	95.31	(–0.11)	92.11	90.91	(–1.20)
Stacked SVM	*95.70	*95.85	(+0.15)	*92.64	*92.69	(+0.05)
Stacked SEARN	95.36	*95.86	(+0.50)	92.38	92.28	(–0.10)
Stacked LSTM	95.32	95.66	(+0.34)	92.10	92.23	(+0.13)

Open in a new tab

For each trained model and test configuration, the best results appear in boldface. Results that are not significantly different from the best result at the 95% significance level are preceded by an asterisk (^*). For example, the stacked SVM with the 2014 models produced the highest F₁-score of 95.70% on the 2014 i2b2 test set, which is significantly better than all other ensembles and individual models except the voting pruned. For statistical significance testing, we used paired t-tests to compare the best performing system and the other methods. The pruned voting models outperformed the corresponding voting ensembles with all models. A more detailed comparison is proposed in the Discussion.

Adding models trained with different corpora contributed to the performance of ensemble methods. The numbers in parentheses in the ‘2014+2016 models’ columns indicate the difference between using the models trained on a single dataset and using the models from two data sources. The higher the gain, the greater the benefit from the other data source. The voting and DTM methods of the 2014 + 2016 models are an exception because they were unable to efficiently integrate low-performance models that were trained on the other corpus. Voting pruned, stacked SVM, and stacked LSTM achieved better results on both test data. Pruned voting and stacked SVM models reached the highest F₁-scores of 95.93% on the 2014 i2b2 and 92.69% on the 2016 N-GRID, respectively. The results show that these ensembles were more beneficial for the 2016 N-GRID data, producing higher F₁-scores than the best individual models (95.51% increased to 95.93% for the 2014 i2b2; 91.14% increased to 92.69% for the 2016 N-GRID). Sequence labelling-based stack generalization (stacked SEARN and stacked LSTM) yielded lower F₁-scores than the stacked SVM, but with comparable results when including all models.

Discussion

We further discuss the results of the SVM-based stacked ensemble (stacked SVM) with both 2014 and 2016 models on the 2016 N-GRID test set. Table 6 shows the stacked SVM performance for each category. The results on the right half of the table were calculated with binary token matching where each PII term was evaluated on a per-token basis, regardless of the category. This ensemble yielded the overall precision, recall, and F₁-score of 98.57%, 92.86% and 95.63% with binary token matching, respectively. For date, age, name, and contact categories, the stacked SVM reached over 90% F₁-score with strict entity matching. Although contact PII categories (e.g., phone and e-mail) were less common, they were identified relatively accurately because they are usually recorded in standardized formats. More work is needed for improved extraction of profession and location categories that are less formally written with rich vocabulary.

Table 6.

Results of SVM-based stacked ensemble (stacked SVM) on the 2016 N-GRID test set for each category.

Category	Count	Strict entity (%)			Binary token (%)
		Precision	Recall	F₁-score		Precision	Recall	F₁-score
Name	2,404	95.98	93.43	94.69		98.08	96.26	97.16
Profession	1,010	90.27	72.57	80.46		94.70	76.87	84.86
Location	3,771	92.54	82.92	87.47		97.06	85.71	91.03
Age	2,354	97.75	95.75	96.74		97.99	95.01	96.48
Date	3,821	97.54	96.57	97.05		99.26	98.5	98.88
Contact	126	94.78	86.51	90.46		99.13	90.93	94.85
ID	33	79.17	57.58	66.67		97.14	70.83	81.93
Total	13,519	95.45	90.08	92.69		98.57	92.86	95.63

Open in a new tab

From the outputs of the stacked SVM, we analyzed the examples whose presence was successfully identified in the text but whose category was not correctly assigned. This type of error was most common in the categories shown in Table 7. For example, organization is one of the subcategories of location, of which 22, 14, and 12 were misclassified as hospital, city, and profession subcategories, respectively. The last column shows some challenging examples that may be used interchangeably as multiple subcategories.

Table 7.

The five most common subcategories of the misclassification errors.

Reference subcategory		Classified as			Examples
Organization	Hospital	(22)	City	(14)	Profession	(12)	Edgemore House, Nestle, Facebook
Patient	Doctor	(19)	Hospital	(11)	City	(3)	Galloway, KC, Harriet
Hospital	City	(12)	Doctor	(10)	Organization	(6)	Odessa, Lewis Clark, GBMC
Country	City	(13)	State	(6)	Organization	(3)	Jordan, Georgia, PNG
City	Hospital	(11)	Organization	(5)	Country	(3)	Fountain Hills, Vallejo, Villa Rica

Open in a new tab

Finally, we compare the results of voting ensembles with or without pruning. Ensemble pruning achieved higher performance than the ensemble using all models. The pruning method was effective even when models trained with different datasets participated in the voting ensemble. We examined how well the pruned ensemble performed at each voting threshold. Figure 1 shows the F₁-scores of pruned voting ensembles and ensembles with all 22 models. The gap between the ensemble pruned and with all models indicates the F₁-score difference between them. The graphs on the left and right represent the results of the 2014 i2b2 and 2016 N-GRID test sets, respectively. Note that the y-axis scale in each graph does not start at zero to focus on the value ranges of interest.

Figure 1. — Accuracy (F₁-scores) of pruned ensembles and ensembles with all members.

The results with voting thresholds ranging from one to 12 are presented. No pruning occurred at the higher thresholds. The solid line with circle dots represents the pruned voting method. The pruned voting ensembles produced significantly higher F₁-scores than the corresponding ensembles that included all members at the 95% significance level on the 2014 i2b2 and 2016 N-GRID test sets until the voting thresholds were 9 and 11, respectively. When the threshold was set to three and 14 models were excluded from voting, the pruned ensemble with eight members* achieved the highest F₁-score (95.93%) on the 2014 i2b2 data. Similarly, at the same voting threshold, the pruned ensemble with nine members^† obtained a F₁-score of 92.57% with no significant difference from the highest score on the 2016 N-GRID data.

Conclusion

We explored various types of ensemble learning methods and confirmed that they could improve de-identification of EHR narratives. Experimental results show that two concept-based ensemble methods, stacked SVM and pruned voting ensembles, consistently performed better than individual classifiers and other ensemble approaches. Our pruning method has the ability to construct an efficient ensemble by selecting the optimal subset of de-identification models. We also sought to better utilize the two clinical text corpora from different sources. We found that exploiting two corpora could improve generalization. Most ensemble systems consisting of models trained with two datasets yielded better results than ensemble systems with models trained with either data only.

Our ensemble approaches can provide a convenient solution, even in situations where texts or labeled annotations are not accessible, but only models or components can be used. Not limited to the evaluation on the corpora sharing the same PII category definitions, our further research includes more generalization across a variety of data sets annotated with heterogeneous PII categories, for example, the 2006 i2b2 de-identification challenge¹² and PhysioNet⁵⁰ corpora. Another promising direction for future work is to create Bi-LSTM models with dependency-based word embeddings^51,52 or contextualized word representations^19-21.

Acknowledgments

This project was partly supported by the National Institute for General Medical Sciences (R41GM116479). De-identified clinical records used in this research were provided by the 2014 i2b2/UTHealth NLP Shared Task (funded by NIH NLM 2U54LM008748 and NIH NLM 5R13LM011411), and the 2016 CEGS N-GRID challenge (funded by NIH P50 MH106933, and NIH 4R13LM011411).

Footnotes

Six of the 2014 models (LSTM-CRF v.N, LSTM-CRF v.L, CRF, SEARN, LSTM v.L, and struct. SVM) and two of the 2016 models (LSTM CRF v.N, and LSTM-CRF v.L)

^†

Seven of the 2016 models (LSTM-CRF v.N, LSTM-CRF v.L, CRF, SEARN, LSTM v.L, OGD, and struct. SVM), one of the 2014 models (LSTM-CRF v.N), and PhysioNet deid

Figures & Table

References

1.HIPAA Privacy Rule . 45 CFR Part 160, Part 164 (A, E) U.S. Department of Health and Humans Services; 2008. [Google Scholar]
2.Kim Y, Heider P, Meystre SM. Ensemble-Based methods to improve de-identification of electronic health record narratives. AMIA Annu Symp. 2018. pp. 663–72. [PMC free article] [PubMed]
3.Kuncheva LI, Bezdek JC, Duin RP. Decision templates for multiple classifier fusion: An experimental comparison. Pattern recognition. 2001;34(2):299–314. [Google Scholar]
4.Wolpert DH. Stacked generalization. Neural Netw. 1992;5:241–59. [Google Scholar]
5.Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1. J. Biomed. Inform. 2015;58:S11–S19. doi: 10.1016/j.jbi.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015;58:S20–S29. doi: 10.1016/j.jbi.2015.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Cortes C, Vapnik V. Support-Vector networks. Machine learning. 1995;20(3):273–97. [Google Scholar]
8.Daumé H, Langford J, Marcu D. Search-Based structured prediction. Machine learning. 2009;75(3):297–325. [Google Scholar]
9.Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
10.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997;45(11):2673–81. [Google Scholar]
11.Stubbs A, Filannino M, Uzuner Ö. De-Identification of psychiatric intake records: Overview of 2016 CEGS N- GRID shared tasks track 1. J. Biomed. Inform. 2017;75:S4–S18. doi: 10.1016/j.jbi.2017.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. JAMIA. 2007;14(5):550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wellner B, Huyck M, Mardis S, et al. Rapidly retargetable approaches to de-identification in medical records. JAMIA. 2007;14(5):564–73. doi: 10.1197/jamia.M2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th ICML. 2001. pp. 282–89.
15.Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 2015;58:S30–S38. doi: 10.1016/j.jbi.2015.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Liu Z, Tang B, Wang X, Chen Q. De-Identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75:S34–S42. doi: 10.1016/j.jbi.2017.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.
18.Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532–43.
19.Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations. Proceedings of the 2018 NAACL-HLT. 2018. pp. 2227–37.
20.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-Training of deep bidirectional transformers for language understanding. Proceedings of the 2019 NAACL-HLT. 2019. pp. 4171–86.
21.Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proceedings of the 27th COLING. 2018. pp. 1638–49.
22.Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018.
23.Lee K, Filannino M, Uzuner Ö. An empirical test of GRUs and deep contextualized word representations on de- identification. Stud. Health Technol. Inform. 2019;264:218–22. doi: 10.3233/SHTI190215. [DOI] [PubMed] [Google Scholar]
24.Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. pp. 72–78.
25.Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6(1):49–62. [Google Scholar]
27.Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning. 2000;40(2):139–57. [Google Scholar]
28.Banfield R, Hall L, Bowyer K, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6:49–62. [Google Scholar]
29.Partalas I, Tsoumakas G, Vlahavas IP. Focused ensemble selection: A diversity-based method for greedy ensemble selection. Proceedings of the 2008 ECAI. 2008. pp. 117–21.
30.Bodenreider O. The unified medical language system (umls): Integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Aronson AR, Lang F-M. An overview of metamap: Historical perspective and recent advances. JAMIA. 2010;17(3):229–36. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identification scrubber toolkit: Design, training, and assessment. Int J Med Inform. 2010;79(12):849–59. doi: 10.1016/j.ijmedinf.2010.09.007. [DOI] [PubMed] [Google Scholar]
33.Neamatullah I, Douglass MM, Lehman L-wH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32. doi: 10.1186/1472-6947-8-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.McCallum A, Freitag D, Pereira FC. Maximum entropy Markov models for information extraction and segmentation. Proceedings of the 17th ICML. 2000. pp. 591–98.
35.Crammer K, Singer Y. Ultraconservative online algorithms for multiclass problems. J Mach Learn Res. 2003;3(Jan):951–91. [Google Scholar]
36.Joachims T, Finley T, Yu C-NJ. Cutting-Plane training of structural SVMs. Mach Learn. 2009;77(1):27–59. [Google Scholar]
37.Bottou L. Online learning and stochastic approximations. On-Line learning in neural networks. 1998. pp. 9–42.
38.Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. Proceedings of the 2016 NAACL-HLT. 2016. pp. 260–70.
39.Dernoncourt F, Lee JY, Szolovits P. NeuroNER: An easy-to-use program for named-entity recognition based on neural networks. Proceedings of the 2017 EMNLP: System Demonstrations. 2017. pp. 97–102.
40.Lavergne T, Cappé O, Yvon Fc. Practical very large scale CRFs. Proceedings of the 48th ACL. 2010. pp. 504–13.
41.Langford J, Li L, Strehl A. Vowpal wabbit online learning project: Technical report. http://hunch.net , 2007.
42.Favre B. Java implementation of MIRA learning for sequences. https://github.com/benob/miralium .
43.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008;9(Aug):1871–74. [Google Scholar]
44.King DE. MITIE: Library and tools for information extraction. https://github.com/mit-nlp/MITIE .
45.Kim Y, Riloff E. Stacked generalization for medical concept extraction from clinical notes. Proceedings of the BioNLP. 2015;2015:61–70. [Google Scholar]
46.Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annual Symposium Proceedings. 2015. pp. 737–46. [PMC free article] [PubMed]
47.Kim Y, Meystre S. A study of medical problem extraction for better disease management. Studies in health technology and informatics. 2019;264:193–97. doi: 10.3233/SHTI190210. [DOI] [PubMed] [Google Scholar]
48.Reimers N, Gurevych I. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. pp. 338–48.
49.Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. 2012;4(2):26–31. [Google Scholar]
50.Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computers in Cardiology. Vol. 2004. IEEE; 2004. Computer-Assisted de-identification of free text in the MIMIC II database; pp. 341–44. [Google Scholar]
51.Levy O, Goldberg Y. Dependency-Based word embeddings. Proceedings of the 52nd ACL. 2014. pp. 302–08.
52.Komninos A, Manandhar S. Dependency based embeddings for sentence classification tasks. Proceedings of the 2016 NAACL-HLT. 2016. pp. 1490–500.

[r1-099_3413268] 1.HIPAA Privacy Rule . 45 CFR Part 160, Part 164 (A, E) U.S. Department of Health and Humans Services; 2008. [Google Scholar]

[r2-099_3413268] 2.Kim Y, Heider P, Meystre SM. Ensemble-Based methods to improve de-identification of electronic health record narratives. AMIA Annu Symp. 2018. pp. 663–72. [PMC free article] [PubMed]

[r3-099_3413268] 3.Kuncheva LI, Bezdek JC, Duin RP. Decision templates for multiple classifier fusion: An experimental comparison. Pattern recognition. 2001;34(2):299–314. [Google Scholar]

[r4-099_3413268] 4.Wolpert DH. Stacked generalization. Neural Netw. 1992;5:241–59. [Google Scholar]

[r5-099_3413268] 5.Stubbs A, Kotfila C, Uzuner Ö. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task track 1. J. Biomed. Inform. 2015;58:S11–S19. doi: 10.1016/j.jbi.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-099_3413268] 6.Stubbs A, Uzuner Ö. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J. Biomed. Inform. 2015;58:S20–S29. doi: 10.1016/j.jbi.2015.07.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7-099_3413268] 7.Cortes C, Vapnik V. Support-Vector networks. Machine learning. 1995;20(3):273–97. [Google Scholar]

[r8-099_3413268] 8.Daumé H, Langford J, Marcu D. Search-Based structured prediction. Machine learning. 2009;75(3):297–325. [Google Scholar]

[r9-099_3413268] 9.Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]

[r10-099_3413268] 10.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997;45(11):2673–81. [Google Scholar]

[r11-099_3413268] 11.Stubbs A, Filannino M, Uzuner Ö. De-Identification of psychiatric intake records: Overview of 2016 CEGS N- GRID shared tasks track 1. J. Biomed. Inform. 2017;75:S4–S18. doi: 10.1016/j.jbi.2017.06.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-099_3413268] 12.Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. JAMIA. 2007;14(5):550–63. doi: 10.1197/jamia.M2444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13-099_3413268] 13.Wellner B, Huyck M, Mardis S, et al. Rapidly retargetable approaches to de-identification in medical records. JAMIA. 2007;14(5):564–73. doi: 10.1197/jamia.M2435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-099_3413268] 14.Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th ICML. 2001. pp. 282–89.

[r15-099_3413268] 15.Yang H, Garibaldi JM. Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 2015;58:S30–S38. doi: 10.1016/j.jbi.2015.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16-099_3413268] 16.Liu Z, Tang B, Wang X, Chen Q. De-Identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75:S34–S42. doi: 10.1016/j.jbi.2017.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-099_3413268] 17.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.

[r18-099_3413268] 18.Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. Empirical Methods in Natural Language Processing (EMNLP) 2014. pp. 1532–43.

[r19-099_3413268] 19.Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations. Proceedings of the 2018 NAACL-HLT. 2018. pp. 2227–37.

[r20-099_3413268] 20.Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-Training of deep bidirectional transformers for language understanding. Proceedings of the 2019 NAACL-HLT. 2019. pp. 4171–86.

[r21-099_3413268] 21.Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proceedings of the 27th COLING. 2018. pp. 1638–49.

[r22-099_3413268] 22.Khin K, Burckhardt P, Padman R. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv preprint arXiv:1810.01570. 2018.

[r23-099_3413268] 23.Lee K, Filannino M, Uzuner Ö. An empirical test of GRUs and deep contextualized word representations on de- identification. Stud. Health Technol. Inform. 2019;264:218–22. doi: 10.3233/SHTI190215. [DOI] [PubMed] [Google Scholar]

[r24-099_3413268] 24.Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019. pp. 72–78.

[r25-099_3413268] 25.Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Scientific data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r26-099_3413268] 26.Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6(1):49–62. [Google Scholar]

[r27-099_3413268] 27.Dietterich TG. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning. 2000;40(2):139–57. [Google Scholar]

[r28-099_3413268] 28.Banfield R, Hall L, Bowyer K, Kegelmeyer WP. Ensemble diversity measures and their application to thinning. Information Fusion. 2005;6:49–62. [Google Scholar]

[r29-099_3413268] 29.Partalas I, Tsoumakas G, Vlahavas IP. Focused ensemble selection: A diversity-based method for greedy ensemble selection. Proceedings of the 2008 ECAI. 2008. pp. 117–21.

[r30-099_3413268] 30.Bodenreider O. The unified medical language system (umls): Integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–70. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r31-099_3413268] 31.Aronson AR, Lang F-M. An overview of metamap: Historical perspective and recent advances. JAMIA. 2010;17(3):229–36. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r32-099_3413268] 32.Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE identification scrubber toolkit: Design, training, and assessment. Int J Med Inform. 2010;79(12):849–59. doi: 10.1016/j.ijmedinf.2010.09.007. [DOI] [PubMed] [Google Scholar]

[r33-099_3413268] 33.Neamatullah I, Douglass MM, Lehman L-wH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32. doi: 10.1186/1472-6947-8-32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r34-099_3413268] 34.McCallum A, Freitag D, Pereira FC. Maximum entropy Markov models for information extraction and segmentation. Proceedings of the 17th ICML. 2000. pp. 591–98.

[r35-099_3413268] 35.Crammer K, Singer Y. Ultraconservative online algorithms for multiclass problems. J Mach Learn Res. 2003;3(Jan):951–91. [Google Scholar]

[r36-099_3413268] 36.Joachims T, Finley T, Yu C-NJ. Cutting-Plane training of structural SVMs. Mach Learn. 2009;77(1):27–59. [Google Scholar]

[r37-099_3413268] 37.Bottou L. Online learning and stochastic approximations. On-Line learning in neural networks. 1998. pp. 9–42.

[r38-099_3413268] 38.Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. Proceedings of the 2016 NAACL-HLT. 2016. pp. 260–70.

[r39-099_3413268] 39.Dernoncourt F, Lee JY, Szolovits P. NeuroNER: An easy-to-use program for named-entity recognition based on neural networks. Proceedings of the 2017 EMNLP: System Demonstrations. 2017. pp. 97–102.

[r40-099_3413268] 40.Lavergne T, Cappé O, Yvon Fc. Practical very large scale CRFs. Proceedings of the 48th ACL. 2010. pp. 504–13.

[r41-099_3413268] 41.Langford J, Li L, Strehl A. Vowpal wabbit online learning project: Technical report. http://hunch.net , 2007.

[r42-099_3413268] 42.Favre B. Java implementation of MIRA learning for sequences. https://github.com/benob/miralium .

[r43-099_3413268] 43.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 2008;9(Aug):1871–74. [Google Scholar]

[r44-099_3413268] 44.King DE. MITIE: Library and tools for information extraction. https://github.com/mit-nlp/MITIE .

[r45-099_3413268] 45.Kim Y, Riloff E. Stacked generalization for medical concept extraction from clinical notes. Proceedings of the BioNLP. 2015;2015:61–70. [Google Scholar]

[r46-099_3413268] 46.Kim Y, Riloff E, Hurdle JF. A study of concept extraction across different types of clinical notes. AMIA Annual Symposium Proceedings. 2015. pp. 737–46. [PMC free article] [PubMed]

[r47-099_3413268] 47.Kim Y, Meystre S. A study of medical problem extraction for better disease management. Studies in health technology and informatics. 2019;264:193–97. doi: 10.3233/SHTI190210. [DOI] [PubMed] [Google Scholar]

[r48-099_3413268] 48.Reimers N, Gurevych I. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. pp. 338–48.

[r49-099_3413268] 49.Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning. 2012;4(2):26–31. [Google Scholar]

[r50-099_3413268] 50.Douglass M, Clifford GD, Reisner A, Moody GB, Mark RG. Computers in Cardiology. Vol. 2004. IEEE; 2004. Computer-Assisted de-identification of free text in the MIMIC II database; pp. 341–44. [Google Scholar]

[r51-099_3413268] 51.Levy O, Goldberg Y. Dependency-Based word embeddings. Proceedings of the 52nd ACL. 2014. pp. 302–08.

[r52-099_3413268] 52.Komninos A, Manandhar S. Dependency based embeddings for sentence classification tasks. Proceedings of the 2016 NAACL-HLT. 2016. pp. 1490–500.

PERMALINK

Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives

Youngjun Kim, PhD

Paul M Heider, PhD

Stéphane M Meystre, MD, PhD

Abstract

Introduction

Background

Methods

Data description

Table 1.

Table 2.

De-identification models

Table 3.

Concept-based ensemble methods

Voting:

Voting pruned:

Decision Template Method (DTM)3:

Stacked SVM:

Sequence labelling-based ensemble methods

Stacked SEARN:

Stacked LSTM:

Results

Performance of individual de-identification methods

Table 4.

Performance of ensemble methods

Table 5.

Discussion

Table 6.

Table 7.

Figure 1.

Conclusion

Acknowledgments

Footnotes

Figures & Table

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Decision Template Method (DTM)³: