Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2021 Aug 7;28(10):2093–2100. doi: 10.1093/jamia/ocab128

Improving domain adaptation in de-identification of electronic health records through self-training

Shun Liao 1,2,, Jamie Kiros 3, Jiyang Chen 3, Zhaolei Zhang 1,2, Ting Chen 3
PMCID: PMC8449604  PMID: 34363664

Abstract

Objective

De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain.

Materials and Methods

We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain.

Results

In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge.

Conclusion

Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.

Keywords: medical language processing, de-identification, domain adaptation

INTRODUCTION

Electronic health records (EHRs) are commonly recommended, as they can improve efficiency and accuracy in clinical management.1 In many countries, certain types of privacy-related information have to be removed before they can be shared and analyzed for research purposes.2,3 However, the de-identification procedures are often time-consuming because these records are presented in free text format. For instance, “John visited the hospital in December” needs to be de-identified into “[PatientNameTag] visited hospital in [DateTag].” In this case, an automated de-identification system is recommended to minimize the manual work and maximize the benefits from a large EHR system.2

Recently, automated de-identification systems have surpassed human-level performance in several public de-identification datasets.4,5 However, these systems often experience significant performance drops when the test dataset (from the target domain) is very different from the training dataset (from source domain); this problem is known as the domain adaptation issue.6,7 For example, i2b2-2006 and i2b2-2014 are 2 different datasets from the same data repository, but a model trained on i2b2-2006 only achieves a test F1-score of 87.5 on i2b2-2014, while achieving 98.8 on the test cohort of i2b2-2006.8 In this case, the de-identification systems will fail to identify some personal identifiers.9 To solve this domain adaptational issue in the context of EHRs de-identification, we propose an effective self-training framework.

The motivation of this work is to use both labeled and unlabeled data to train the de-identification model, which are from the source domain and target domain, respectively. It is different from other methods, which mostly use the labeled data from source domain.4,5,7 In most deployments, unlabeled data from the target domains are readily available, incurring no additional cost for data acquirement.10 Our framework is partially inspired by pseudo-label methods in few-shot problems,11 which usually consist of 3 steps: (1) training the model on labeled data in the training set, (2) generating pseudo-labels for the unlabeled data using the trained model, and (3) retraining the model on both labeled and pseudo-labeled data. More details on pseudo-label methods are introduced in Related Work .

We benchmark the proposed framework on 4 de-identification datasets, 2 of which are publicly available (i2b2-2006 and i2b2-2014) and 2 of which are privately curated (mimic-radiology and mimic-discharge).8,12 We consider the domain adaptation (or “cross-domain”) as the main scenario for evaluation. In the cross-domain experiments, the model is trained on a dataset from the source domain and tested on a different dataset from the target domain. We conducted the experiments in 2 scenarios. In the first scenario, data from the source domain are labeled and data from the target domain are unlabeled. In a common deployment scenario, the de-identification model trained by one hospital (source domain) is deployed on the unlabeled medical records of another hospital (target domain). The second scenario introduces a few additional labeled data from the target domain. In previous work, Hartman et al7 reported that the inclusion of a few labels from the target domain could significantly increase the domain adaptation performance.

In addition, we conducted 2 ablation experiments to further validate the robustness of the proposed approach. In the first ablation experiment, we intentionally removed some of the labels from the source domain to generate unlabeled data. This can be considered as a “simplified” case of cross-domain experiment, as both the training and testing come from the same domain.13 In the second ablation experiment, we investigate how compatible the proposed self-training framework is with pretraining methods. Similar to self-training, pretraining methods can also make use of unlabeled data from the testing domain7,14; however, these methods only fine-tune the word-embedding layer, which does not maximize the usage of the unlabeled data. In contrast, self-training methods use the unlabeled data to retrain the entire model and are more likely to increase the performance for domain adaptation.

In summary, this work proposes a self-training framework to address the domain adaptation issue. We evaluated the proposed framework on 4 de-identification datasets in various experimental scenarios, and the proposed framework increased the performance in all scenarios.

Related work

Two types of automated de-identification systems are available: rule based and learning based.15 The earliest automated systems for medical free text described in the late 1990s were mostly rule based.16 These rule-based systems often have a high false positive rate due to the simplicity of the model, which limits their usage.15 Subsequent work applied learning algorithms on manually engineered features, such as support vector machines.17 Recently, researchers started applying deep neural networks in de-identification, and various network architectures have been proposed, including the attention function, gating function, and conditional random field.4,18 These learning-based methods have achieved the precision and recall of 97% or higher on some public de-identification datasets,4,5 surpassing human-level performances, which had a precision of 81% and a recall of 98% in de-identification.2 Moreover, there are some hybrid solutions that combine these 2 methods.19 A detailed review can be found in Yogarajan et al’s work.9 In this work, we focus on learning-based methods.

Multiple algorithms have been proposed to address domain adaptation issues in de-identification.6,7,20 For example, Lee et al6 proposed a solution that leverages existing corpora to de-identify psychiatric notes. In recent work, Hartman et al7 replaced the general-purpose word embedding with a more context-specific word embedding, which is generated by fine-tuning the embedding using the unlabeled data from the target domain. Our framework shares the same intuition with Hartman et al’s work in utilizing unlabeled data from the target domain to boost model performance. Intuitively, the main difference is that Hartman et al used unlabeled data to fine-tune the word-embedding layer, while our self-training framework uses unlabeled data to retrain the entire model. Along this line, we explore combining our self-training framework with different pretraining methods to investigate whether the combination can further boost the performance.

Our framework is inspired by the pseudo-label methods.11 Similar methods have been successfully applied to semi-supervised learning, where they leverage unlabeled data to boost the accuracy of the model, on domains like text classification, and image recognition.21,22 Both Unsupervised Data Augmentation and NoisyStudent use unlabeled data to boost the accuracy of the model. In addition, recent works have applied self-training to improve robustness or domain adaption in general image recognition.23,24 The effectiveness of self-training in enhancing the robustness in natural language is much less explored, and one challenge is the discrete characteristics of language.22 To the best of our knowledge, this work is the first to explore the effectiveness of the self-training method in improving the robustness of de-identification models.

MATERIALS AND METHODS

Data source

We evaluated the effectiveness of our proposed framework on 4 standard de-identification datasets, 2 publicly available and 2 privately curated. The 2 public datasets (i2b2-2006 and i2b2-2014) are available on the i2b2 National Center for Biomedical Computing.8 The 2 private datasets are subsets that were generated from the MIMIC-III (Medical Information Mart for Intensive Care III) dataset: mimic-radiology and mimic-discharge.12 The general characteristics of each dataset are summarized in Table 1.

Table 1.

Dataset characteristics.

Dataset Domain Description Patients Tokens PHI Types PHI
i2b2-2006 Discharge notes 889 487 000 8 19 500
i2b2-2014 Diabetic longitudinal records 296 758 000 23 28 800
mimic-radiology Radiology notes in ICUs 1000 205 000 8 4100
mimic-discharge Discharge notes in ICUs 1000 128 000 9 40 800

This table summarizes the characteristics of each dataset, including the number of patients, the number of tokens in records, the number of PHI types/classes, and the number of PHI tokens. Each row indicates the summary of one dataset, and the column contains the characteristics of such a dataset. The symbol “#” indicates the “number of” in this table.

ICU: intensive care unit; PHI: protected health information.

Prior to release, the four publicly available datasets were all manually de-identified, which replaced personal identifiers with plausible but realistic surrogate information; we evaluate our systems on this surrogate Personal Health Information (PHI) with plausible words, e.g., replacing the patient’s name with unique common names. Throughout the paper the term PHI is used to mean such surrogate PHI. Notably, while all datasets used comply with the US HIPAA In the Safe Harbor method of de-identification, they may include different categories of PHI types replaced. Notably, in different datasets, specific guidelines vary regarding which entities are considered PHI. For example, i2b2-2014 identifies 23 different types of PHI, while i2b2-2006 only identifies 8 PHI. The detailed de-identification guidelines are described in previous work by Hartman et al.8

Evaluation metrics

In this work, we use the strict F1-score as the primary metric in order to be consistent with the named entity recognition (NER) literature.26 According to the strict F1-score, the predicted PHI entity is considered correct if and only if the predicted entity shares the same annotation, starting word, and ending word with the ground truth. For example, “John Snow visited hospital” is presented in which “John Snow” is annotated with a [PatientNameTag]. The strict F1-score considers the prediction as correct if and only if “John Snow” is predicted as one entity with [PatientNameTag] as the label.

In domain adaptation experiments, we consider the strict F1-score of Name PHI instead of all PHI, which are different across the source and target domain; the Name PHI includes the names of both patients and clinicians.8 We choose the Name “PHI” because it is labeled consistently across the 4 datasets, and it is more difficult to achieve a high F1-score for this PHI.5,27 When training and testing with the same dataset, we report the weighted average, micro-strict F1-score across all PHI types.

Each dataset is separated into training (60%), validation (20%), and testing (20%) cohorts. During each experiment, we use 3 distinct random seeds to initialize the neural network; the reported performances are averaged over these 3 random seeds.

Proposed self-training framework

The main idea of the proposed framework is to utilize unlabeled samples from the target domain to increase the relevant sample size for training. There are 3 main steps in our framework: (1) train a model on labeled samples from the training dataset in source domain, (2) use the trained model to generate pseudo-labels on unlabeled samples from the training dataset in target domain, and (3) retrain the model with a combination of labeled and pseudo-labeled samples. In practice, steps 2 and 3 can iterate multiple times to further boost performance. We illustrate our framework in Figure 1.

Figure 1.

Figure 1.

Algorithm overview. We illustrate the 3 stages of the proposed self-training framework. In the first initial training stage, the model is trained on labeled data (source domain) only. Then, the second pseudo-labeling stage generates pseudo-labels for unlabeled data (target domain). The third stage is retraining, which retrains the model on both true and pseudo-labels. The pseudo-labeling and retraining stages can be looped for multiple iterations. Notably, the noise is turned on only during training, not during label generation.

As shown in Figure 1, an important step is to add noise during training but remove the noise when generating pseudo-labels. The noise is mandatory because it ensures that the model produces different output with the generated pseudo-labels. Without the noise, the model will have the same output as pseudo-labels, which makes the loss, and consequently the gradient, to be equal to zero. We identify 2 critical types of noise: data augmentation and activation dropout.28 Data augmentation consists of word-level switchout and character-level switchout.22 Switchout replaces a word or character with a random word or character, respectively. For example, the character-level switchout can change “John Snow visited hospital” to “John Znow visited hospital,” if the “S” is switched into “Z.” Similarly, word-level switchout can change “John Sun visited hospital” if the “Snow” is switched into “Sun.” To control the noise strength, we introduce one hyperparameter to control the probability for each type of switchout.

De-identification model architecture

Regarding architecture, we follow a similar 3-block network structure as the network in previous work.8 The first block is preprocessing, and the second one is embedding. The last block consists of a 2-layer long short-term memory and a conditional random field. More details are presented in Supplementary Appendix A.

We implemented the de-identification baseline model through a PyTorch NER library, Flair.29 Our implementation has only around 5 hundred lines of code, 20% of the repository used in the previous publication.8 Our implementation is available online (github.com/shun1024/deid-self-training).

Training details

In the initial training phase (shown in Figure 1), we applied a standard supervised learning training strategy with data augmentation, including word and character switchout. The model is trained by minimizing the cross-entropy loss over labeled data with stochastic gradient descent (SGD).30 We tuned 3 hyperparameters in the validation cohort with a grid search strategy, including learning rate [0.001, 0.003, 0.01], batch size [16, 32],17,31 and dropout rate [0.1, 0.2]. Then, in the second phase, we generate pseudo-labels for all unlabeled data by the the trained model, which has the best performance in validation cohort. Unlike true labels, a pseudo-label is a continuous vector that represents the predictive probabilities of the network. As a reminder, data augmentation and dropout must be turned off during pseudo-label generation.

In the retraining phase (the third phase in Figure 1), we continue to train the best performing model from the first phase, and apply a similar training strategy (cross-entropy loss, stochastic gradient descent, and data augmentation). The difference is that each mini-batch contains half a batch of true labeled data and half a batch of pseudo-labeled data. In addition, we retrain the model for a total of fifty epochs and use the current model to update the pseudo-labels for every 5 epochs. We tune the same 3 hyperparameters in the same validation cohort, and the best performing model is chosen for testing.

We do not include other common training techniques, such as sharpening and confident unlabeled data selection,22 because preliminary experiments show that they have limited influence on improving domain adaptation performance of the de-identification task.

Experimental setup

Two categories of experiments were conducted on the 4 de-identification datasets.

Cross domain experiments

This experiment evaluates the domain adaptation performance where the model is trained and tested on different datasets, eg, trained on dataset A (i2b2-2014) and tested on dataset B (mimic-discharge). We considered 2 scenarios: (1) B has no labels and (2) B has a few labels. In the first scenario, the model trained on A is directly deployed on dataset B. Notably, we test the best-performing model on the validation cohort of dataset A. In the second scenario, the model trained on A is fine-tuned with some additional labels from dataset B. In this scenario, we test the best performing model on the validation cohort of dataset B. Regarding the metric, we consider the strict F1-score on Name PHI for the testing cohort of B, because A and B might have different numbers of PHI types.

Ablation experiments

We consider 2 ablation experiments to investigate 2 individual components of the proposed framework: the domain of unlabeled data and the pretrained embedding of neural network. In the first ablation experiment, we investigate the proposed framework when labeled and unlabeled data both come from the source domain. In the second ablation experiment, we investigate the compatibility of the proposed self-training framework with various pretraining methods in the cross-domain.

Unlabeled Data from Source Domain Experiments. In these experiments, labeled and unlabeled samples come from the same domain. Given a dataset A, a fraction of labels in training samples (3%-30%) are intentionally removed. The experiment with 100% labeled training samples is equivalent to a fully supervised learning baseline. We use a strict micro F1-score to evaluate the performance, which averages over all output classes.

Different Pretraining Methods Experiments. In principle, the proposed self-training framework is orthogonal to pretraining methods because they are applied at different network layers. Therefore, we investigate the combination of the proposed framework with different pretraining methods, and performance is evaluated in the same setting as the cross-domain without-labels experiments.

RESULTS

The results of the 4 experiments are presented in the same order as described in the Experimental Setup. The first section presents the cross-domain results, including without labels from the target domain and with labels from the target domain. Following these 2 experiments, we also present an explanation toward the improvement in domain adaptation. The second section presents the 2 ablation experiments on the domain of unlabeled data and the effect of pretrained embedding.

Cross-domain experiments

This section presents the cross-domain experiments in which the model is trained on one dataset but tested on another. The first 2 subsections present the results for 2 subscenarios respectively: target domain has no labels and target domain has some labels. The third subsection presents an understanding toward how self-training improves domain adaptation performance. In these experiments, we consider the strict F1-score on Name PHI.

No labels from target domain

In deployment, the target domain usually has and only has unlabeled data. We simulate this by removing the labels in the dataset from the target domain. Table 2 summarizes the model performance in different settings in which different source and target datasets are used. The results are organized by source dataset (each row) and by target dataset (column). In each cell, the first number is the performance of the model trained with only labeled data from the source dataset; the second number is the performance of both labeled and unlabeled data through the proposed self-training framework, and the last number in parentheses is the absolute improvement.

Table 2.

Results of Name PHI in cross domain without labels.

Source i2b2-2006 i2b2-2014 Mimic-discharge Mimic-radiology
i2b2-2006 96.61/– (–) 74.51/82.11 (+7.60) 52.01/56.71 (+4.70) 67.88/74.91 (+7.03)
i2b2-2014 76.61/85.41 (+8.80) 96.09/– (–) 69.01/71.13 (+2.12) 78.28/80.93 (+2.65)
mimic-discharge 51.29/60.61 (+9.32) 64.12/65.33 (+1.21) 90.36/– (–) 70.36/81.22 (+10.86)
mimic-radiology 62.09/64.41 (+2.32) 69.45/71.01 (+1.56) 80.01/86.35 (+6.34) 96.08/– (–)

In this experiment, the source domain contains labeled data, and the target domain only has unlabeled data. Each cell contains the results of testing performance on the target dataset (column). In each cell, the first number is the performance of the model trained with labeled data, and the second number is the performance on both labeled and unlabeled data. The last number in parentheses is the absolute improvement. Each number is a strict F1-score on Name PHI.

PHI: protected health information.

As shown in Table 2, there is a significant performance drop when the model is deployed on a different dataset. For instance, the model trained on i2b2-2014 has a testing performance of 96.09 (F1-score in Name PHI) in i2b2-2014 but only 76.61 in i2b2-2006. Through self-training, the domain adaptation performance is boosted from 76.61 to 85.41 (+8.80) in training on i2b2-2014 and testing on i2b2-2006. On average, the improvement between i2b2 datasets (i2b2-2006 and i2b2-2014) and mimic datasets (mimic-discharge and mimic-radiology) are 8.2 and 8.6, respectively, while the proposed framework yields an average of 3.86 F1-score improvement in other pairwise experiments.

In Figure 2, we illustrate the results of training on i2b2-2014 with 3 random seeds. The proposed framework significantly improves the domain adaptation performance. Results for other datasets are presented in Supplementary Figure S1.

Figure 2.

Figure 2.

Cross domain results using i2b2-2014 as source domain. We illustrate the results for training on i2b2-2014 but testing on different datasets. In each plot, the x-axis shows 2 methods: the baseline and our self-training framework. Baseline trains the model with only labeled data, and our method uses both labeled and unlabeled data. The y-axis indicates the strict F1-score in Name PHI. The 95% confidence interval and the median are calculated based on 3 random seeds for model initialization.

With partial labels from target domain

Even though additional labels can be costly to acquire, it is still possible to obtain a few labels for the data from the target domain. In the following experiments, we introduced different percentages of labels from the target domain, and the remaining part of the target dataset remains unlabeled.

To maintain consistency, we only introduce Name PHI labels to the target domain. For example, “1%” means that 1% of sentences in the target dataset have the Name PHI labels and 99% are unlabeled. We conduct the experiments by training the model on the source domain first and then fine-tune it with the labeled data from the target domain. The self-training framework will further retrain the fine-tuned model with unlabeled data from the target domain (Table 3).

Table 3.

Results for Name PHI in cross domain with partially labeled samples.

Percentage of Labeled Tokens i2b2-2014 -> i2b2-2006
i2b2-2006 -> i2b2-2014
Number of Tokens Name F1-Score Number of Tokens Name F1-Score
0.1% 758 78.13/86.12 (+7.99) 487 74.93/82.94 (+8.01)
0.5% 3790 88.14/90.95 (+2.81) 2435 86.95/89.07 (+3.02)
1.0% 7580 91.78/93.09 (+1.31) 4870 92.00/93.23 (+1.23)
1.5% 11 370 95.01/95.96 (+0.95) 7305 94.55/95.25 (+0.70)

In this experiment, we fine-tune the model, trained in the source domain, with additional labels from the target domain. i2b2-2014 -> i2b2-2016 means that the source domain is i2b2-2014 and the target domain is i2b2-2006. Each row shows the percentage of labeled samples from target domain. In each cell, the first number is performance for training with labeled data only, and the second number is for the proposed self-training framework. The last number in parentheses is the absolute improvement.

PHI: protected health information.

Table 3 shows that the proposed self-training framework improves the domain adaptation performance when different percentages of the samples in the target domain are labeled. We observed that a few additional labeled samples significantly boosted domain adaptation performance. For example, given 1.5% labeled sentences from target domain, the fine-tuned model achieved an F1-score of 95.01 when trained on i2b2-2014 and tested on the i2b2-2006. The proposed self-training further improves the performance by using unlabeled data, and the improvement is larger when the percentage of labeled data is smaller, eg, +7.99 when 0.1% of the data are labeled in adapting the model trained on i2b2-2014 to i2b2-2006.

Reasons for performance improvement

In the cross-domain without-labels experiments, we observe that the proposed self-training framework improves the domain adaptation performance significantly. In this section, we investigate how the self-training framework improves performance. To answer this question, we gather all the wrong entities from the testing cohort (target domain), in which the wrong entity is the entity with the wrong predictions after the initial training phase. As a reminder, an entity is a list of words, eg, “John Snow” is the Name entity and “John” is a word. We monitor what kind of wrong entity is more likely to be corrected by the proposed self-training framework. We take the source domain as i2b2-2014 and target domain as i2b2-2006 as an example and conduct the investigation for the first 20 epochs of retraining.

We present the figures in Supplementary Appendix C. According to Supplementary Figure S2, we observe that self-training (after 20 epochs of retraining) helps to correct the whole entity when it contains at least 1 correct word. Specifically, among entities with at least 1 correct word, 37.8% are corrected by self-training, otherwise only 11.8% are corrected. Intuitively, if an entity has a correct word, the self-training framework can make the whole entity more compatible with the target domain, which finally corrects the whole entity.

In addition to entity-based analysis, we present the word-based analysis in Supplementary Figure S3. We observed that the self-training framework is more likely to correct a word with a higher predicted probability of a true label. As shown in epoch 0 (just after the initial phase), the median probability of the corrected words is around 20%, while it is around 0% for the uncorrected words. In addition, we observe that self-training gradually improves the probability of true labels for most words, so that some of them are eventually corrected by the proposed self-training framework. We hypothesize that the improvement from self-training is largely due to gradually adapting the model to target domain distribution, so that some borderline errors can be corrected. For the cases in which the predictions are very wrong, the framework does not significantly improve the performance.

Ablation experiments

Unlabeled data from source domain experiments

In this section, we present experiments where labeled and unlabeled data come from the same domain. Given a dataset, we intentionally remove a certain percentage of labels. The experiments were conducted with 4 different percentages of labels: 3%, 10%, 30%, and 100% (eg, 10% means 90% of labels are removed). Notably, 100% is equivalent to a fully supervised learning setting as all training samples have labels. Because training and testing cohorts are from the same domain, we use the micro strict F1-score as the metric for these experiments.

In a fully supervised setting, our baseline model achieves comparable performance with other state-of-the-art models (Table 4). For example, in the i2b2-2014 dataset, our model achieves a strict micro F1-score of 94.01, which is on par with 95.47 as previously achieved by Tang et al.32 We observed that our framework improves the performance of different percentages of labels. We tabulate the results of i2b2-2014 in Figure 3. The results for the other datasets are presented in Supplementary Figure S4.

Table 4.

Results for fully supervised and unlabeled data from source domain experiments.

Dataset 100% (Fully Supervised) 30% 10% 3%
i2b2-2006 97.15/– (–) 96.21/96.57 (+0.36) 93.54/94.01 (+0.47) 87.11/89.20 (+2.10)
i2b2-2014 94.01/– (–) 91.62/92.07 (+0.45) 87.29/88.71 (+1.42) 78.28/82.31 (+4.03)
mimic-discharge 98.21/– (–) 97.46/97.69 (+0.23) 94.67/95.51 (+0.84) 90.74/92.57 (+1.83)
mimic-radiology 97.83/– (–) 96.01/96.52 (+0.51) 93.75/94.47 (+0.72) 89.79/90.51 (+0.72)

In this experiment, both labeled and unlabeled data came from the same domain. The numbers in each column are the percentage of the remaining labels. For example, 30% indicates that the model is trained on 30% of labeled and 70% of unlabeled samples. In each cell, the first number is the performance for supervised learning and the second number is for our self-training framework. The numbers in parentheses are the absolute improvements. Each number is a micro strict F1-score.

PHI: protected health information.

Figure 3.

Figure 3.

Unlabeled data from source domain results using i2b2-2014. We illustrate the results for training with different percentages of labeled data on i2b2-2014. The title of each plot indicates the percentage of labeled data (eg, 3% means that 97% of data are unlabeled). The fully supervised is equivalent to training with 100% labeled data. The x-axis shows the supervised baseline and our self-training framework, and the y-axis shows the micro strict F1-score on all protected health information.

Different methods for pretrained embeddings

We investigate 3 embeddings, GloVe (the embedding used in the aforementioned experiments), BERT (a contextual word embedding), and the matching GloVe. The matching GloVe embedding is generated by fine-tuning the Glove on the unlabeled data from the target domain. We conducted the experiments on i2b2-2014 as the source domain and i2b2-2006 as the target domain and showed the results in Supplementary Appendix E.

As shown in Supplementary Table S1, the contextual embedding approach (BERT) does not improve the domain adaptation performance; a similar observation was reported in other NER studies.15 Fine-tuning the GloVe embedding improves the domain adaptation performance, and the proposed self-training framework further improves upon the matching GloVe embedding.

DISCUSSION

Although our framework is effective, there is still a gap in domain adaptation when the target domain is different from the source domain. For example, in the cross-domain without-labels experiments, the model trained on i2b2-2006 achieved an F1-score of 62.09 in de-identifying the Name PHI of mimic-radiology, and the proposed self-training framework only increased the score by 2.32. We only consider Name PHI for domain adaptation experiments because it is annotated with a consistent guideline. For other PHI without consistent guidelines, such as Username, we suspect that the deep learning model would suffer from a more significant domain adaptation issue. This observation highlights the risk because it is often unknown how different the source and target domains could be. Moreover, even though a few additional labels can significantly improve the domain adaptation performance, it is time-consuming to acquire new annotations in every deployment. Therefore, we suggest a careful design and a comprehensive evaluation with consideration of the domain adaptation issue.

In this work, we mainly explored using a self-training framework to address the domain adaptation issue. We focus on self-training framework because of its promising performance in robustness, as shown in recent published work.33 However, other alternative algorithms are also worth exploring in future studies.31,34,35 A recent survey in natural language has summarized domain adaptation algorithms into 3 categories, namely model-centric, data-centric, and hybrid.36 The data-centric algorithm focuses on improving domain adaptation by including additional unlabeled data; our framework belongs to this category. There are some model-centric algorithms, such as the domain-invariant features learning proposed by Zhao et al.35 Their method was motivated by the reasoning that invariant model features should improve the robustness of the model in the target domain. Exploring different methods to address the domain adaptation issue in de-identification will be our next step.

In addition to algorithm development, understanding why domain adaptation algorithms work is another active research topic.37 For example, Carmon et al33 showed that inclusion of unlabeled data could reduce the sample complexity of labeled data to achieve a certain level of robustness. In other words, by using unlabeled data, fewer labeled data are required to achieve a certain level of domain adaptation performance, such as 90% accuracy in target domain. Interestingly, Carmon et al derived their theoretic guarantee in a simple Gaussian estimation problem that cannot generalize to deep learning problems. Our work has shown that our self-training framework can boost the domain adaptation performance, in addition to robustness to noise, through our empirical experiments; however, the exact theoretical proof and general reasoning is not known due to the complexity of deep learning model. In summary, we would like to highlight that the domain adaptation issue remains an open problem in both algorithm development and theoretical understanding. More research efforts are required to address this issue, especially in clinical scenarios.

CONCLUSION

This work aims to address the domain adaptation issue in automated de-identification systems. In our experiments, we showed that the F1-score could drop when the model is trained but tested in different datasets. To address this issue, we proposed a self-training framework to improve domain adaptation performance by learning on unlabeled data from the target domain. To evaluate our proposed framework, we conducted experiments on 4 de-identification datasets, and the proposed framework yielded an averaged improvement of F1-score by 5.38, with a max of 10.86. Even though our proposed self-training is effective, we highlight that the domain adaption issue still remains as an open problem. Therefore, careful evaluation is recommended before deployment. In future research, we will investigate the use case of current de-identification systems in real-world workflow and compare different domain adaptation methods for the EHRs de-identification problem.

FUNDING

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

AUTHOR CONTRIBUTIONS

SL and TC developed and implemented the core algorithm. All the authors substantially contributed to the design of the work, revising it critically, and the final approval. They have also helped in resolving related questions.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online .

Supplementary Material

ocab128_Supplementary_Data

ACKNOWLEDGMENTS

The author would like to thank Shengyang Sun, Hyunmin Lee, Vivek Natarajan for their feedback of the manuscript.

DATA AVAILABILITY STATEMENT

i2b2-2006 and i2b2-2014 are available in www.i2b2.org/. For MIMIC-III dataset, the records are available in mimic.physionet.org/ and the personal health information annotations are private .

CONFLICT OF INTEREST STATEMENT

We have no competing interests to declare.

REFERENCES

  • 1.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med 2018; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Neamatullah I, Douglass MM, Lehman LH, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8: 32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cohen IG, Mello MM.. HIPAA and protecting health information in the 21st Century. JAMA 2018; 320 (3): 231–2. [DOI] [PubMed] [Google Scholar]
  • 4.Dernoncourt F, Lee JY, Uzuner O, Szolovits P.. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24 (3): 596–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Liu Z, Tang B, Wang X, Chen Q.. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75S: S34–S42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lee HJ, Zhang Y, Roberts K, Xu H.. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc 2017; 2017: 1070–9. [PMC free article] [PubMed] [Google Scholar]
  • 7.Hartman T, Howell MD, Dean J, et al. Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20: 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Murphy SN, Weber G, Mendis M, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010; 17 (2): 124–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Yogarajan V, Pfahringer B, Mayo M.. A review of automatic end-to-end de-identification: is high accuracy the only metric? Appl Artif Intell 2020; 34 (3): 251–69. [Google Scholar]
  • 10.Ben-David S, Urner R.. On the hardness of domain adaptation and the utility of unlabeled target samples. In: ALT 2012: Algorithmic Learning Theory. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7568. Berlin, Germany: Springer; 2012: 139–53. [Google Scholar]
  • 11.Lee D-H.Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop on Challenges in Representation Learning; 2013. [Google Scholar]
  • 12.Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ren M, Triantafillou E, Ravi S, et al. Meta-learning for semi-supervised few-shot classification. arXiv, doi: https://arxiv.org/abs/1803.00676, 2 Mar 2018, preprint: not peer reviewed.
  • 14.Devlin J, Chang MW, Lee K, Toutanova K.. BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019 - Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019: 4171–86. [Google Scholar]
  • 15.Uzuner Ö, Luo Y, Szolovits P.. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc 2007; 14 (5): 550–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Grouin C, Zweigenbaum P.. Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches. Stud Health Technol Inform 2013; 192: 476–80. [PubMed] [Google Scholar]
  • 17.Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH.. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol 2010; 10: 70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yadav S, Ekbal A, Saha S, Bhattacharyya P.. Deep learning architecture for patient data de-identification in clinical records. In: Proceedings of the Clinical Natural Language Processing Workshop (Clinical NLP); 2016: 32–41. [Google Scholar]
  • 20.Lee HJ, Wu Y, Zhang Y, Xu J, Xu H, Roberts K.. A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 2017; 75S: S19–S27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kouw WM, Loog M. A review of domain adaptation without target labels. arXiv, doi: https://arxiv.org/abs/1901.05335, 16 Jan 2019, preprint: not peer reviewed. [DOI] [PubMed]
  • 22.Xie Q, Dai Z, Hovy E, Luong MT, Le QV. Unsupervised data augmentation for consistency training. arXiv, doi: https://arxiv.org/abs/1904.12848, 5 Nov 2020, preprint: not peer reviewed.
  • 23.Xie Q, Luong MT, Hovy E, Le QV. Self-training with noisy student improves imagenet classification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020.
  • 24.Inoue N, Furuta R, Yamasaki T, Aizawa K. Cross-domain weakly-supervised object detection through progressive domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018: 5001–9.
  • 25.Raghunathan A, Xie SM, Yang F, Duchi J, Liang P. Understanding and mitigating the tradeoff between robustness and accuracy. arXiv, doi: https://arxiv.org/abs/2002.10716, 25 Feb 2020, preprint: not peer reviewed.
  • 26.Goyal A, Gupta V, Kumar M.. Recent named entity recognition and classification techniques: a systematic review. Comput Sci Rev 2018; 29: 21–43. [Google Scholar]
  • 27.Deleger L, Molnar K, Savova G, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc 2013; 20 (1): 84–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R.. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014; 15 (56): 1929–58. [Google Scholar]
  • 29.Akbik A, Bergmann T, Blythe D, Rasul K, Schweter S, Vollgraf R. FLAIR: an easy-to-use framework for state-of-the-art NLP. In: NAACL HLT 2019 - Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019: 54–59.
  • 30.Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks. In: NIPS ’12: Proceedings of the 25th International Conference on Neural Imaging Processing Systems, Vol. 1; 2012: 1223–31.
  • 31.Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS.. Deep learning for visual understanding: a review. Neurocomputing 2016; 187: 27–48. [Google Scholar]
  • 32.Tang B, Jiang D, Chen Q, Wang X, Yan J, Shen Y.. De-identification of clinical text via Bi-LSTM-CRF with neural language models. AMIA Annu Symp Proc 2019; 2019: 857–63. [PMC free article] [PubMed] [Google Scholar]
  • 33.Carmon Y, Raghunathan A, Schmidt L, Liang P, Duchi JC. Unlabeled data improves adversarial robustness. arXiv, doi: https://arxiv.org/abs/1905.13736, 4 Dec 2019, preprint: not peer reviewed.
  • 34.Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. arXiv, doi: https://arxiv.org/abs/2005.14165, 22 Jul 2020, preprint: not peer reviewed.
  • 35.Zhao H, Des Combes RT, Zhang K, Gordon GJ. On learning invariant representations for domain adaptation. In: 36th International Conference on Machine Learning (ICML 2019);2019.
  • 36.Ramponi A, Plank B. Neural unsupervised domain adaptation in NLP–-a survey. arXiv, doi: https://arxiv.org/abs/2006.00632, 28 Oct 2020, preprint: not peer reviewed.
  • 37.Wei C, Shen K, Chen Y, Ma T. Theoretical analysis of self-training with deep networks on unlabeled data. arXiv, doi: https://arxiv.org/abs/2010.03622, 16 Jun 2021, preprint: not peer reviewed.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocab128_Supplementary_Data

Data Availability Statement

i2b2-2006 and i2b2-2014 are available in www.i2b2.org/. For MIMIC-III dataset, the records are available in mimic.physionet.org/ and the personal health information annotations are private .


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES