A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records

Sicheng Zhou; Nan Wang; Liwei Wang; Ju Sun; Anne Blaes; Hongfang Liu; Rui Zhang

doi:10.1016/j.csbj.2023.08.018

. 2023 Aug 22;22:32–40. doi: 10.1016/j.csbj.2023.08.018

A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records

Sicheng Zhou ^a, Nan Wang ^b, Liwei Wang ^c, Ju Sun ^d, Anne Blaes ^e, Hongfang Liu ^c, Rui Zhang ^f,^⁎

PMCID: PMC10480628 PMID: 37680211

Abstract

Objective

Transformer-based language models are prevailing in the clinical domain due to their excellent performance on clinical NLP tasks. The generalizability of those models is usually ignored during the model development process. This study evaluated the generalizability of CancerBERT, a Transformer-based clinical NLP model, along with classic machine learning models, i.e., conditional random field (CRF), bi-directional long short-term memory CRF (BiLSTM-CRF), across different clinical institutes through a breast cancer phenotype extraction task.

Materials and methods

Two clinical corpora of breast cancer patients were collected from the electronic health records from the University of Minnesota (UMN) and Mayo Clinic (MC), and annotated following the same guideline. We developed three types of NLP models (i.e., CRF, BiLSTM-CRF and CancerBERT) to extract cancer phenotypes from clinical texts. We evaluated the generalizability of models on different test sets with different learning strategies (model transfer vs locally trained). The entity coverage score was assessed with their association with the model performances.

Results

We manually annotated 200 and 161 clinical documents at UMN and MC. The corpora of the two institutes were found to have higher similarity between the target entities than the overall corpora. The CancerBERT models obtained the best performances among the independent test sets from two clinical institutes and the permutation test set. The CancerBERT model developed in one institute and further fine-tuned in another institute achieved reasonable performance compared to the model developed on local data (micro-F1: 0.925 vs 0.932).

Conclusions

The results indicate the CancerBERT model has superior learning ability and generalizability among the three types of clinical NLP models for our named entity recognition task. It has the advantage to recognize complex entities, e.g., entities with different labels.

Keywords: Natural language processing, Electronic health records, Information extraction, Generalizability

Graphical Abstract

Highlights

•
Evaluated the impact of corpus heterogeneity on the generalizability of series NLP models, including CRF, BiLSTM-CRF and BERT based models.
•
Novel analysis of robustness of the machine learning models using permutation dataset.
•
Compared two strategies for transferring machine learning models between clinical institutes, i.e., i) direct transfer vs ii) continuous fine-tuning.

1. Introduction

With the widespread of electronic health records (EHR), natural language processing (NLP) methods have gained momentum in leveraging clinical texts for clinical decision support and research purposes [1]. The published articles on PubMed about “clinical NLP” and “EHR” increased from 24 in 2017–70 in 2022. They were widely adopted in applications such as real-time cancer case identification [2], medical prescription classification [3], and automatic extraction of heart failure information from EHR [4]. Also, NLP plays an important role in clinical data management, including clinical data cleaning [5], [6], data interoperability improvement [7], [8] and EHR data capture [9], [10]. These NLP methods can be mainly classified into either symbolic or statistical approaches [11]. The symbolic approaches dominated the clinical domain in the early years since they could meet the basic information needs of many applications in the clinical domain while not needing a large amount of annotated data, which requires intensive human labor [12]. Furthermore, the interpretability of their results was an advantage [12]. However, symbolic approaches suffer from a notable limitation in terms of portability [1]. In recent years, benefiting from the development of large language models (transformer-based models [13]) and increased annotation data in the clinical domain, statistical approaches were developed rapidly and have achieved remarkable performance on various clinical NLP tasks. Nonetheless, a substantial gap exists between the performance of these large language models and the understanding of their generalizability [14]. The successfully deployed clinical NLP systems are often developed based on data from a single healthcare institute, their performances can be various if applied in different healthcare institutes [15]. The variations in EHR platforms, clinical documentation rules, and conventions across healthcare institutions can lead to divergent clinical texts even when documenting highly similar medical events. These variations encompass both syntactic and semantic differences [16], and they tend to accumulate across the entire clinical corpus. Consequently, the impact of these variations on the generalizability of NLP models developed within a single healthcare institution remains an important research gap.

Limited research has been conducted to investigate the generalizability of clinical NLP models. For instance, Sohn et al. examined the portability of a rule-based NLP system designed to identify asthma patients within a clinical cohort [1]. The NLP system was developed based on data from a single hospital and was subsequently evaluated using an external cohort, revealing a significant drop in performance due to variations in clinical documentation [1]. Fan et al. externally evaluated the cTAKES tool for a part-of-speech tagging task and found the performance dropped by 5 % when the tool was tested on data from another clinical institute. In another study, a rule-based NLP system was developed to identify patients with a family history of pancreatic cancer and tested on data from two different clinical institutions [17]. The findings indicated that, for highly specific NLP tasks, the rule-based system demonstrated portability as long as the rules remained simple and could be updated using new data. Liu et al. [18] evaluated the performance of an NLP system in detecting smoking status across multiple institutions. The system incorporated both machine learning and rule-based components. The results suggested that moderate efforts were required to make the NLP system portable, including annotating additional data to further train the machine learning module and incorporating new rules based on the new data. In a recent study, a rule-based NLP model was developed to extract social determinants from the clinical notes. Though clinical notes were mostly consistent in describing social determinants at geographically distinct institutions, the accuracy dropped by around 6 % when externally evaluating the model. Institution-specific modification of rules is necessary to maintain the generalizability of the model [19]. These studies primarily focused on assessing the generalizability of rule-based and traditional machine learning methods. However, there is currently a dearth of studies investigating the generalizability of transformer-based models in the clinical domain [14]. Khambete et al. explored the generalization of the SciBERT model on a clinical sentiment classification task. The SciBERT was trained on one medical specialty data from MIMIC III, and when tested on data of other medical specialties, the AUC drop about 8 %. The MIMIC III data cannot reflect the data heterogeneity among different clinical institutes [20].

Transformer-based models have demonstrated outstanding performance in various clinical natural language processing (NLP) tasks. However, the development of these models necessitates substantial computing resources and human effort for data annotation. Furthermore, privacy concerns often prevent the sharing of annotated clinical data, posing challenges to the development of generalizable models. Therefore, comprehending the generalizability of Transformer-based models holds utmost importance in the clinical NLP domain, as significant savings in labor and computational resources can be achieved by avoiding the training of similar models if they are portable and generalizable across different clinical institutions. Recently, we have introduced CancerBERT, a cancer-specific language model designed to identify eight breast cancer phenotypes using clinical texts from breast cancer patients obtained from the University of Minnesota's M Health Fairview (UMN) [21]. CancerBERT was developed based on the Biomedical Language Understanding Evaluation Bidirectional Encoder Representations from Transformers (BlueBERT) language model [22] and exhibited exceptional performance on the breast cancer phenotype extraction task. However, the CancerBERT models were exclusively developed and evaluated using the EHR corpus from UMN. The objective of this study is to assess the generalizability of CancerBERT models by evaluating their performance on a separate corpus collected from Mayo Clinic (MC), another healthcare institution. Additionally, we conducted evaluations and comparisons with other benchmark models, including conditional random field models (CRF) and bi-directional long-short memory CRF (BiLSTM-CRF) models. To evaluate the generalizability of the models, we employed a clinical information extraction task from various perspectives. The contributions of this study include:

1.
We assessed the impact of corpus heterogeneity on the generalizability of Bidirectional Encoder Representations from Transformers (BERT) based NLP models.
2.
We constructed a permutation dataset to analyze the robustness of the models.
3.
We compared two strategies for transferring models between clinical institutions: i) direct transfer and ii) continuous fine-tuning.

The remainder of this paper is structured as follows: Section 2 provides an in-depth discussion of the methodology, covering the data preparation and experimental setup, including the details of the datasets used, and evaluation metrics for each step. The results of the experiments are summarized in Section 3, following the same order as Section 2. In Section 4, we interpreted the results, draw connections with previous studies, and explore the implications of our findings. We also included the limitations of our current study in this section and propose directions for future research.

2. Methods and materials

2.1. Overview of the study

This study was approved by the institutional review boards (IRB) of UMN and MC. Fig. 1 shows the pipeline of the study. We collected the clinical texts of breast cancer patients from EHR in the two institutes (UMN and MC). MC team annotated their corpus following the same annotation guideline as previously defined by the UMN team [21]. The CancerBERT_UMN models, along with the benchmark CRF_UMN and BiLSTM_UMN models developed on the UMN corpus were originally designed to extract the eight types of breast cancer phenotypes (i.e., Hormone receptor type, Hormone receptor status, Tumor size, Tumor site, Cancer grade, Histological type, Cancer laterality, and Cancer stage) from the clinical texts. We externally evaluated the performances of models trained on the UMN site using data from MC site. Additionally, we compared two transfer learning strategies for the NLP models: (1) continuously trained models refined on the MC site (MC refined models) and (2) models solely trained locally on the MC site (MC model). Furthermore, we conducted two experiments to explore the advantages of BERT-based models over traditional BiLSTM-CRF and CRF models, which served as baseline models, in terms of model robustness.

2.2. Data sources

The data used in this study was obtained from the EHRs of two clinical institutes, i.e., the UMN Clinical Data Repository, and MC. The EHR data from the UMN contains the health records of 17,970 breast cancer patients from the years 2001–2018. The EHR data from MC contain 54,050 breast cancer patients from 2000 to 2020.

2.3. Manual annotation and corpora comparison

The annotation schema was introduced in the previous study [21]. We used the same schema to annotate the clinical texts extracted from MC EHR. We randomly sampled one hundred breast cancer patients, and for each patient, we randomly sampled one clinical note and one pathology report within 90 days after the diagnosis date for annotation. This resulted in 81 clinical notes and 80 pathology reports. Two annotators worked on 10 % of the text annotation independently, the inter-annotator agreement (IAA) was evaluated as Cohen’s Kappa of 0.80. Then, one annotator completed all annotation tasks.

To better understand the impact of corpus across sites on the performance of NLP methods, we compared the annotated clinical texts from the two sites. Several basic corpus statistics such as the number of sentences, the average length of the sentences, and the unique number of tokens were summarized and compared. Additionally, we measured the similarity between the two corpora by representing the two corpora as the normalized term frequency-inverse document frequency (TF-IDF) vectors and calculating the cosine similarity [1]. Each corpus vector contains the average TF-IDF values for unique terms in the corpus. The average TF-IDF value of term t is defined as Eq. (1), where N is the total number of documents in the corpus, the TF_i(t) is the frequency of token t in ith document divided by the total number of tokens in the ith document, and the IDF(t) is the log (N) divided by the number of documents containing token t.

TF - IDF (t) = \sum_{i = 1}^{N} {TF}_{i} (t) * \frac{IDF (t)}{N}

(1)

Besides the corpus level similarity, we also calculated the similarity of each breast cancer phenotype category (phenotype similarity) between the two corpora. Similarly, each breast cancer phenotype was represented by a normalized TF-IDF vector that contains the average TF-IDF values of distinct terms under each breast cancer phenotype category. Cosine similarity was calculated for the same breast cancer phenotype category from the two corpora. The phenotype similarity is used to investigate the relationship between performance changes resulting from model transfer and the similarity between the corpora. Specifically, we calculated the Pearson correlation coefficients [23] between the phenotype similarities and the performance drops of each model. The Pearson correlation coefficient was defined as Eq. (2), $X_{i}$ refers to the phenotype similarity between two corpora, $\overset{®}{X}$ is the mean of similarity scores of all phenotypes. $Y_{i}$ is the performance drop (change of F-1 score) for the corresponding phenotype for a specific model, $Y ®$ is the mean of performance drops for all phenotypes.

r = \frac{(Σ (X_{i} - X ®) (Y_{i} - Y ®))}{\sqrt{(Σ {(X_{i} - X ®)}^{2}) * (Σ {(Y_{i} - Y ®)}^{2})}}

(2)

The performance drops were quantified as the differences between the performances of UMN models evaluated on the UMN test set [21] and the performances of UMN models directly evaluated on the MC test set.

2.4. Portability evaluation for breast cancer phenotype extraction models

We evaluated the generalizability of three types of machine learning models in this study, i.e., CRF, BiLSTM-CRF, and CancerBERT [21] models. The CRF and BiLSTM-CRF are conventional machine learning models that have demonstrated efficacy in named entity recognition (NER) tasks within the clinical domain [24], [25], [26]. For CRF and BiLSTM baseline models, we experimented with both GloVe (trained on Wikipedia) [27] and Word2vec (trained on Google News) embeddings [28] as input. The CancerBERT models were developed based on the BERT language model and showed superior performance compared to the CRF and BiLSTM-CRF models for our breast cancer phenotype extraction task [21].

MC annotated data were divided into a training set (60 %), a development set (10 %), and a test set (30 %). The data splitting, model training and fine-tuning procedures were consistent with our previous study [21]. For all CancerBERT variants, the number of parameters remains the same (about 336 million) during the pre-training stage, and a softmax layer was added for NER prediction during the fine-tuning stage. The fine-tuned CancerBERT models use parameter sharing inherent in the BERT architecture, the weights in the transformer layers are shared across all input tokens. In this study, we evaluated the performance of the following three sets of models on MC corpus test set.

1.
Models were originally trained only on the UMN corpus [21]. These UMN models (CRF_UMN, BiLSTM-CRF_UMN, CancerBERT_{UMN_397}, CancerBERT_{UMN_997}) were evaluated on MC data (test set) to test their portability through the task of extracting the breast cancer phenotypes from the clinical texts.
2.
Continuously fine-tuned UMN models on MC corpus, including CRF_{MC_refined,} BiLSTM-CRF_{MC_refined,} and CancerBERT_{MC_refined}. The CancerBERT_{UMN_397} model was further fine-tuned on MC annotated data to obtain MC refined CancerBERT_{MC_refined} model. To refine the UMN models, we fine-tuned all the UMN locally trained models on the training and development sets of MC annotated data to obtain the corresponding MC refined models.
3.
CancerBERT model trained only on MC corpus (without UMN data). Following the same steps [21], we developed another CancerBERT_{MC_397} model, which was pre-trained only on MC corpus (contains about 5 million clinical documents) and fine-tuned on the training and development sets of MC annotated data as a benchmark model.

The breast cancer phenotype extraction can be framed as a NER task, and phenotype level (name entity) evaluation was applied. We used the micro-average (assign equal weight to each sample) F1 score and macro-average (assign equal weight to each category) F1 score as metrics, and both exact match and lenient match were used following the i2b2 standard [29]. F1 score was calculated as the harmonic mean of the precision and recall scores and provides a better assessment of model performance when data is imbalanced. All experiments were conducted 10 times, and the one-way ANOVA tests were conducted to ascertain if there are significant differences between the means of performance among different models with a 95 % confidence level. When ANOVA reveals statistically significant differences, it does not specify which groups differ from each other. Thus, we further conducted pairwise t-tests with Bonferroni correction which are used to adjust the significance level for multiple comparisons and ensure the probability of making a Type I error remains controlled and does not inflate due to multiple testing. [30], [31]. The model with the highest performance is significantly better than others if all pairwise t-tests were significant.

2.5. Permutation test set evaluation for UMN models

The utilization of a permutation dataset allows for the simulation of data variations, thereby providing insights into the generalizability of models when confronted with new data [14], [32]. In this study, we employed the manually created permutation test set to investigate the impact of data variances on model performance. Specifically, the BiLSTM-CRF_UMN, CRF_UMN, and CancerBERT_UMN models were evaluated using this permutation test set. The entity permutation was used since entities are the core of the NER task. To create the permutation test set, we first identified all distinct entities in MC annotated data that are not in the UMN annotated data; then we replaced the entities in the UMN test set with entities that were randomly sampled from MC unique entities under the same category. This approach resulted in a permutation dataset consisting of novel combinations of target entities and their corresponding contextual information. Evaluating the models on this permutation test set enabled us to assess their capacity to learn contextual information of entities rather than relying solely on entity memorization [32]. We evaluated all models developed using the UMN data using the permutation test set. The evaluation schema is the same as the portability evaluation of the models introduced in the previous section.

2.6. Evaluation of model generalizability with entity coverage ratio (ECR)

The assessment of machine learning model portability often involves quantifying the changes in standard metrics such as F1 score, precision, and recall when applied to different test sets. However, this approach offers only a broad understanding of portability and lacks fine-grained analysis [33]. To analyze the model portability from a different perspective, we adapted the ECR proposed in the previous study for our NER task [33]. The ECR measures to what degree the target entity in the test set has been seen in the training set with the same category. It was defined as the Eq. (2):

ECR (e_{i}) = {\begin{matrix} 0 & , C = 0 \\ \sum_{k = 1}^{K} \frac{# (e_{i}^{tr, k}) # (e_{i}^{te, k})}{C^{tr}} / C^{te} & , otherwise \end{matrix}

(2)

Where $C^{tr} = \sum_{k = 1}^{K} # (e_{i}^{tr, k})$ and $C^{te} = \sum_{k = 1}^{K} # (e_{i}^{te, k})$ . $e_{i}$ refers to a test entity, #( $e_{i}^{tr, k})$ is the number of entity $e_{i}$ in the training set with label k, $# (e_{i}^{te, k})$ is the number of entity $e_{i}$ in the test set with label k. The ECR scores range from 0 to 1, indicating the difficulty of predicting an entity (from easy to difficult) [33].

In our study, we calculated the ECR scores for the entities in the test set of UMN annotated data. We further divided the entities in the test set into different groups based on their ECR values, i.e., 0 < = ECR < 0.33, 0.33 < = ECR < 0.67, 0.67 < = ECR< 1, and ECR = 1. We evaluated the UMN models (i.e., CancerBERT_{UMN_397}, BiLSTM-CRF_UMN, and CRF_UMN models developed solely on UMN data) on the UMN test set and investigated the relationships between ECR values and the model’s performance on corresponding entities.

2.7. Experiment environments at UMN and MC

The corpora comparison and portability evaluation was conducted at MC, the models were fine-tuned and evaluated on a Centos server with one Nvidia Tesla V100 GPU. The permutation data evaluation and ECR analysis were conducted at UMN, under a Windows server with one Nvidia Tesla V100 GPU. We used Tensorflow 1.15 frameworks in all settings.

3. Results

3.1. Comparison of UMN vs MC corpora

We compared the corpora of UMN and MC from various perspectives (Table 1). Overall, the UMN corpus has more clinical documents per patient (190.5 vs 140.8), while the average length of each clinical document was shorter compared to MC (259 tokens vs 361 tokens). Table 1 also displays the number of breast cancer phenotypes annotated from the clinical texts of both sites, with the distinct number of terms for each phenotype entity indicated in parentheses. More entities related to breast cancer phenotypes were annotated from the UMN data. The IAA scores for annotations are 0.91 and 0.80 at UMN and MC sites, respectively. The similarities of phenotypes between the two corpora were calculated using the TF-IDF. The results revealed that breast cancer phenotypes exhibited higher similarity scores compared to the overall corpus (phenotype similarity average: 0.9088 vs overall: 0.5411).

Table 1.

Statistics for annotated pathology reports and clinical texts from UMN and MC. Numbers in parentheses indicate the unique number of terms for each phenotype concept. Phenotype Similarity refers to the cosine similarity of each breast cancer phenotype category between corpora.

	UMN	MC	Phenotype Similarity
Number of annotated documents	200	161	NA
Number of sentences	10452	6165	NA
Number of tokens	266079	110,980	NA
Number of unique tokens	9151	5735	NA
Average number of tokens in the sentence	25.5	18.0	NA
Hormone receptor type	1673 (29)	417 (18)	0.9923
Hormone receptor status	436 (14)	178 (12)	0.9815
Tumor size	540 (305)	393 (262)	0.8076
Tumor site	329 (173)	187 (135)	0.7399
Cancer grade	271 (15)	234 (23)	0.8810
Cancer laterality	1192 (4)	1846 (10)	0.9833
Cancer stage	173 (38)	207 (55)	0.9965
Histological type	1070 (95)	726 (77)	0.8880

Open in a new tab

3.2. Portability evaluations of machine learning-based models and BERT-based models

The portability evaluation results (strict match (lenient match) F1 scores) for the two classic NER models, i.e., CRF and BiLSTM-CRF, are shown in Table 2. We used two word embeddings as input for the models (Glove Wikipedia 6B [27] and Word2Vec Google News [28]), and evaluated the models using MC test set. The results show the performances of models that were directly obtained from the UMN site (CRF_UMN, BiLSTM-CRF_UMN) and the models that were further fine-tuned on MC data (CRF_{MC_refined}, BiLSTM-CRF_{MC_refined}).

Table 2.

Portability evaluation results (strict match (lenient match) F1 scores) for CRF and BiLSTM-CRF models. Note that models with subscript “UMN” indicate models trained only on UMN corpus and models with subscript “MC_refined” are UMN models with continuous fine tuning on MC corpus.

Word embeddings	Glove Wikipedia 6B		Glove Wikipedia 6B		Word2Vec Google News		Word2Vec Google News
Models	CRF_UMN	CRF _{MC_refined}	BiLSTM-CRF _UMN	BiLSTM-CRF _{MC_refined}	CRF_UMN	CRF _{MC_refined}	BiLSTM-CRF _UMN	BiLSTM-CRF _{MC_refined}
Hormone Receptor type	0.939 (0.941)	0.944 (0.954)	0.925 (0.925)	0.943 (0.954)	0.945 (0.948)	0.948 (0.951)	0.939 (0.939)	0.918 (0.929)
Hormone Receptor status	0.527 (0.527)	0.542 (0.542)	0.794 (0.794)	0.876 * *(0.876 )**	0.529 (0.529)	0.491 (0.491)	0.867 (0867)	0.837 (0.837)
Tumor size	0.224 (0.294)	0.694 (0.749)	0.392 (0.509)	0.711 * *(0.813 )**	0.244 (0.350)	0.663 (0.724)	0.367 (0.421)	0.592 (0.738)
Tumor site	0.053 (0.136)	0.303 (0.600)	0.266 (0.400)	0.472 (0.758)	0.044 (0.128)	0.321 (0.596)	0.205 (0.323)	0.361 * *(0.668 )**
Tumor grade	0.647 (0.681)	0.881 *(0.925 )**	0.903 (0.903)	0.916 * (0.916)	0.647 (0.685)	0.861 (0.913)	0.849 (0.849)	0.881 (0.881)
Tumor laterality	0.846 (0.846)	0.935 (0.935)	0.882 (0.882)	0.952 * *(0.952 )**	0.853 (0.853)	0.934 (0.934)	0.874 (0.874)	0.930 (0.930)
Cancer stage	0.773 (0.773)	0.868 (0.868)	0.578 (0.578)	0.891 * *(0.891 )**	0.632 (0.632)	0.873 (0.873)	0.593 (0.593)	0.838 (0.838)
Histological type	0.829 (0.896)	0.937 (0.964)	0.847 (0.917)	0.934 (0.959)	0.845 (0.907)	0.934 (0.965)	0.839 (0.921)	0.899 (0.917)
F1 Macro average	0.605 (0.637)	0.763 (0.817)	0.698 (0.738)	0.837 * *(0.889 )**	0.593 (0.629)	0.753 (0.806)	0.691 (0.723)	0.782 (0.842)
F1 Micro average	0.803 (0.822)	0.883 (0.905)	0.785 (0.814)	0.893 * *(0.922 )**	0.804 (0.825)	0.880 (0.903)	0.777 (0.802)	0.853 (0.885)

Open in a new tab

Note: The scores were averaged scores based on 10 runs, the texts in bold indicate the highest performance, * indicates the score is statistically higher than other methods (confidence level: 95 %).

Table 3 shows the portability evaluation results (strict match (lenient match) F1 scores) for the BERT-based models. We evaluated the CancerBERT models developed at UMN (CancerBERT_UMN), along with two benchmark BERT-based models (BERT-large origin [34] and BlueBERT [22]). The CancerBERT_UMN model has two variants, one with frequency-based 997 customized words (CancerBERT_{UMN_997}) in its vocabulary and another one has 397 knowledge-based customized words (CancerBERT_{UMN_397}). Each UMN model (Table 3, column UMN) was directly evaluated on MC test set. Then these models were further fine-tuned on MC data to obtain MC refined models (Table 3, sub-column MC refined) and evaluated again. In addition, we evaluated the CancerBERT_{MC_397} model trained solely on MC corpus for comparison. Table 3 gives a comprehensive overview of the performances of models developed using different transfer learning strategies.

Table 3.

Portability evaluation results (strict match (lenient match) F1 scores) for BERT-based models on MC test data set. The column of “UMN” includes models only trained on the UMN corpus, and the column “MC refined” contains models with fine-tuning on MC corpus. The column “MC only” is the model trained only on MC corpus.

Entity type	BERT-large Origin		BlueBERT (PubMed+MIMIC III)		CancerBERT _{UMN_997}		CancerBERT _{UMN_397}		CancerBERT _{MC_397}
	UMN	MC refined	UMN	MC refined	UMN	MC refined	UMN	MC refined	MC only
Hormone Receptor type	0.926 (0.935)	0.967 (0.969)	0.897 (0.911)	0.977 (0.977)	0.923 (0.963)	0.984 (0.988)	0.942 (0.947)	0.975 (0.981)	0.993 * *(0.993 )**
Hormone Receptor status	0.816 (0.816)	0.910 (0.910)	0.897 (0.897)	0.932 (0.932)	0.842 (0.842)	0.901 (0.901)	0.819 (0.819)	0.926 (0.926)	0.943 * *(0.943 )**
Tumor size	0.633 (0.737)	0.837 (0.903)	0.440 (0.648)	0.839 (0.915)	0.595 (0.664)	0.765 (0.813)	0.745 (0.781)	0.864 (0.928 *)	0.862 (0.907)
Tumor site	0.666 (0.739)	0.609 (0760)	0.308 (0.671)	0.590 (0.790)	0.186 (0.742)	0.733 * (0.792)	0.709 (0.759)	0.601 (0.786)	0.661 (0.832 *)
Tumor grade	0.827 (0.827)	0.909 (0.922)	0.886 (0. 886)	0.859 (0.939)	0.869 (0.869)	0.891 (0.891)	0.846 (0.846)	0.927 * *(0.943 )**	0.863 (0.933)
Tumor laterality	0.896 (0.896)	0.928 (0.928)	0.936 (0.936)	0.954 (0.954)	0.928 (0.928)	0.939 (0.939)	0.903 (0.903)	0.959 (0.959)	0.962 (0.962)
Cancer stage	0.774 (0.774)	0.934 (0.934)	0.799 (0.799)	0.953 (0.953)	0.806 (0.806)	0.870 (0.870)	0.829 (0.829)	0.949 (0.949)	0.950 (0.950)
Histological type	0.793 (0.888)	0.926 (0.950)	0.794 (0.902)	0.931 (0.958)	0.828 (0.923)	0.849 (0.922)	0.815 (0.914)	0.934 (0.965)	0.950 * *(0.981 )**
Macro average	0.724 (0.829)	0.874 (0.908)	0.744 (0.831)	0.879 (0.927)	0.747 (0.842)	0.867 (0.889)	0.828 (0.849)	0.892 (0.930)	0.898 * (0.932)
Micro average	0.843 (0.877)	0.905 (0.922)	0.817 (0.876)	0.917 (0.943)	0.829 (0.886)	0.903 (0.925)	0.864 (0.906)	0.925 (0.947)	0.932 * *(0.952 )**

Open in a new tab

Note: The scores were averaged scores based on 10 runs, the texts in bold indicate the highest performance, * indicates the score is statistically higher than other methods (confidence level: 95 %).

Based on Table 2 and Table 3, all the models show different levels of performance drop compared to the models’ performances evaluated on the original UMN test set. And after further fine-tuning the models on MC data, the performances significantly increased.

3.3. Permutation test set evaluation for UMN models

For each type of model, we chose the one with the best performance to evaluate on the permutation dataset. Table 4 shows the performances of the models, as well as the extent of performance changes (measured by F1 scores) compared to their performances on the normal test set. The results show that the CancerBERT_{UMN_397} model has the least performance drop compared to other models. This finding highlights the robustness of the CancerBERT_{UMN_397} model in effectively handling new data from a different institution.

Table 4.

Evaluation results (strict match (lenient match) F1 scores) for CancerBERT_{UMN_397}, BiLSTM-CRF_UMN, and CRF_UMN models on permutation dataset. Changed F1 score columns show the differences between F1 scores on the permutation set compared to the F1 scores on the normal test set.

Models Entities	CRF_UMN	Changed F1 score	BiLSTM-CRF_UMN	Changed F1 score	CancerBERT_{UMN_397}	Changed F1 score
Hormone Receptor type	0.295 (0.646)	–0.649 (–0.308)	0.448 (0.559)	–0.495 (–0.395)	0.634 (0.842)	–0.341 * (–*0.139 )**
Hormone Receptor status	0.061 (0.061)	–0.481 (–0.481)	0.000 (0.000)	–0.876 (–0.876)	0.564 (0.564)	–0.362 * (–*0.362 )**
Tumor size	0.434 (0.582)	–0.260 (–0.167)	0.659 (0.832)	–0.052 * *(0.019 )**	0.742 (0.921)	–0.122 (0.007)
Tumor site	0.314 (0.568)	0.110 * (–*0.018 )**	0.220 (0.615)	–0.252 (–0.143)	0.267 (0.736)	–0.334 (–0.050)
Tumor grade	0.596 (0.596)	–0.285 (–0.329)	0.575 (0. 575)	–0.341 (–0.341)	0.733 (0.733)	–0.194 * (–*0.210 )**
Tumor laterality	0.001 (0.001)	–0.934 (–0.934)	0.000 (0.000)	–0.952 (–0.952)	0.620 (0.620)	–0.339 * (–*0.339 )**
Cancer stage	0.450 (0.450)	–0.418 (–0.418)	0.310 (0.310)	–0.581 (–0.581)	0.863 (0.863)	–0.086 * (–*0.086 )**
Histological type	0.435 (0.794)	–0.502 (–0.170)	0.296 (0.718)	–0.638 (–0.241)	0.552 (0.818)	–0.382 * (–*0.147 )**
F1 Macro average	0.323 (0.462)	–0.440 (–0.355)	0.313 (0.451)	–0.524 (–0.438)	0.622 (0.761)	–0.270 * (–*0.169 )**
F1 Micro average	0.354 (0.558)	–0.529 (–0.347)	0.210 (0.328)	–0.683 (–0.594)	0.616 (0.722)	–0.309 * (–*0.225 )**

Open in a new tab

Note: Texts in bold indicate the smallest changed F1 scores. The results were averaged scores based on 10 runs, * indicates the change of F1 score is statistically lower than other methods (confidence level: 95 %).

Fig. 2 shows the models’ performances (strict F1 scores) on different test sets. The original test set is the UMN test set, while the portability test set is MC test set. All models were UMN models (CRF_UMN, BiLSTM-CRF_UMN, and CancerBERT_{UMN_397}) that were trained solely on UMN data. The CancerBERT_{UMN_397} model got the most consistent performances across different test sets.

3.4. Evaluation of model generalizability with ECR

Fig. 3 shows the performances (strict F1 scores) of different types of models for entities under different ECR groups: 1) 0 < = ECR < 0.33, 2) 0.33 < = ECR < 0.67, 3) 0.67 < = ECR< 1 and 4) ECR = 1. All three types of models achieved relatively high performances for groups 3 & 4. The CancerBERT_{UMN_397} model obtained significantly better performances in groups 1 & 2 compared to the other two models.

4. Discussions

A significant proportion of pertinent information resides within unstructured formats in EHR data. Before the era of NLP, information extraction from EHR data mainly relied on the use of structured data (e.g., medical codings) and key term searches of unstructured clinical texts, yielding suboptimal performance [35], [36], [37]. Due to the robust capabilities and rapid development of NLP, it has emerged as a pivotal tool in the extraction of this invaluable information, thereby enhancing clinical decision-making, administrative reporting, and academic research endeavors [11]. Clinical texts found in EHRs differ significantly from general language due to their inclusion of professional terminology, medical jargon, acronyms, and abbreviations. The semantic and contextual information conveyed within clinical texts extracted from EHRs can vary considerably across different healthcare institutions. These variations should be carefully considered during the development of NLP algorithms tailored for clinical applications. We compared the corpora of UMN and MC from various perspectives. UMN's annotated data revealed a greater number of unique phenotypes identified (673 vs 592), while MC corpus demonstrated a higher density of breast cancer phenotypes within the clinical texts (3.7 phenotypes/100 tokens vs 2.1 phenotypes/100 tokens). Remarkably, the similarity between breast cancer phenotypes was significantly higher than the overall token similarity between the two corpora (0.9088 vs 0.5411), which corroborates findings from a prior investigation [1]. It indicates that clinicians may utilize consistent clinical language when describing specific medical concepts within their clinical texts, thereby establishing a foundation for facilitating the transferability of NLP models among diverse clinical institutes. For instance, clinicians commonly employ standardized medical language such as "HER-2/neu positive" to describe receptor status and "Grade 2" to denote tumor grade.

In this study, we mainly explored the generalizability of the BERT-based model (CancerBERT models), along with other two classic machine learning models, i.e., CRF and BiLSTM-CRF. To assess the generalizability of these models, we evaluated their performance on the cancer phenotype extraction task. The results show that when directly evaluating the UMN models on MC test data, only the CancerBERT models achieved reasonable performances, indicating their advantages in portability compared to the BiLSTM-CRF and CRF models. Furthermore, after refining the models using MC data, the CancerBERT models consistently achieved the best overall performance and demonstrated greater stability, the average drop of micro F1 score (CancerBERT_UMN vs CancerBERT_{MC_refined}) is 0.067 for CancerBERT models, for CRF and BiLSTM-CRF, the drop of scores are 0.078 (CRF_UMN vs CRF_{MC_refined}) and 0.092 (BiLSTM-CRF_UMN vs BiLSTM-CRF_{MC_refined}). Although the CancerBERT_{MC_397} model trained from scratch using MC data obtained the best overall performance, the performance of the CancerBERT_{MC_refined} model remained comparable, with an F1 score of only 0.007 lower than the CancerBERT_{MC_397} model. These results indicate that BERT-based models can be effectively transferred to other clinical institutes while maintaining a relatively high level of performance, requiring minimal effort as only a small amount of annotated data is needed to fine-tune the models rather than training them from scratch on a new corpus. We calculated the correlation between the similarities of cancer phenotypes of two corpora and the performance drops (UMN models evaluated on the UMN test set versus UMN models directly evaluated on MC test set). The Pearson correlation scores are –0.678, –0.345, and –0.712 for CRF_UMN, BiLSTM-CRF_UMN, and CancerBERT_{UMN_397} models respectively, indicating medium (BiLSTM-CRF_UMN) to strong (CRF_UMN, CancerBERT_{UMN_397}) negative correlations. Our findings suggest that the portability of these NLP models is positively influenced by the similarity among the targeted entities, such as cancer phenotypes, across different corpora. When the entities exhibit higher similarity scores between distinct corpora, the model has a better capability to retain its original performance when transferred between the corpora.

The assessment of models on the permutation test set serves as an indicator of whether the model is effectively capturing the underlying patterns in the text or merely memorizing the phenotypes present in the training set. Our findings reveal a significant advantage of the CancerBERT model over the CRF and BiLSTM-CRF models in this permutation set evaluation. For instance, Table 4 shows that the exact match (lenient match) macro-average F1 scores only dropped 0.270 (0.169) for the CancerBERT_{UMN_397} model, while for CRF_UMN and BiLSTM-CRF_UMN models, the corresponding macro-average F1 scores drop by 0.440 (0.355) and 0.524 (0.438), respectively. Practically, the CancerBERT_{UMN_397} model identified 9.24 fewer phenotypes per patient on the permutation test set compared to the normal test set. In comparison, the CRF_UMN and BiLSTM-CRF_UMN models identified 22.83 and 24.61 fewer phenotypes per patient, respectively. These results show that among the three types of models, the CancerBERT model exhibits superior robustness and a greater ability to capture the contextual information of entities, enabling it to effectively handle novel entity variants that were previously unseen.

The ECR was employed to conduct an in-depth analysis of the portability evaluation results. Notably, the CancerBERT_{UMN_397} model has significantly better performances in all groups, with particularly pronounced advantages for phenotypes in Groups 0 and 1 (ECR<0.67). These groups contain the target test phenotypes that were either appearing in the training sets with different labels, or were absent from the training sets, indicating the relatively challenging nature of extracting phenotypes within these two groups. The results indicate the CancerBERT_{UMN_397} model has enhanced learning ability to learn the target phenotypes and their contexts compared to the BiLSTM-CRF_UMN and CRF_UMN models.

The study has several limitations. We evaluated the generalizability of machine learning models through an NLP task to extract breast cancer phenotypes from clinical texts within two clinical institutes. It is a common NER task in the clinical domain. However, there are many other NLP tasks, for instance, relation extraction and text classification. It is worth additional investigation on other clinical NLP tasks. Our investigation primarily focused on the performances of NLP models in the NER task, while the downstream implications of divergent NLP performance on clinical applications warrant further exploration in future studies. Additionally, our evaluation of model generalizability was restricted to two clinical corpora, and conducting further investigations involving corpora from additional institutes would enhance our understanding of the generalizability of NLP models.

5. Conclusions

In this study, we evaluated the generalizability of three types of NLP models (CRF, BiLSTM-CRF, and CancerBERT) in the clinical domain through an information extraction task to identify breast cancer phenotypes in clinical texts obtained from the UMN and MC. The models’ generalizability was evaluated from different perspectives. Notably, the CancerBERT models emerged as the most adept in terms of generalizability, as they exhibited superior capabilities in learning the contextual intricacies of target phenotypes and effectively accommodating the textual variations encountered in clinical texts. Our results show the CancerBERT models trained in a single institute can be transferred to other institutes and achieve comparable performance at low costs. Furthermore, the CancerBERT model shows remarkable robustness compared to the other two models assessed. This study represents the first attempt to comprehensively analyze the generalizability of BERT-based models within the clinical domain, and the insights garnered from our research hold significant implications for guiding the seamless transfer and adoption of NLP models across diverse clinical institutes. In the future, we plan to further collect the annotation data from other clinical institutes and evaluate the model generalizability on the relation extraction task to obtain a more comprehensive evaluation. Also, we will extrinsically evaluate the influence of model generalizability on downstream disease prediction tasks.

Funding statement

This work is partially supported by the National Center for Complementary and Integrative Health (NCCIH) under grant number R01AT009457 (Zhang), National Institute on Aging under grant number R01AG078154 (Zhang/Xu); and the University of Minnesota Clinical and Translational Science Institute (CTSI), supported by the National Center for Advancing Translational Sciences under grant number UL1TR002494. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutions of Health (NIH).

CRediT authorship contribution statement

Sicheng Zhou: Methodology, Software, Writing – original draft. Nan Wang: Data Curation, Writing – review & editing. Liwei Wang: Data curation, Writing – review & editing. Ju Sun: Supervision, Writing – review & editing. Anne Blaes: Supervision, Writing – review & editing. Hongfang Liu: Resources, Supervision, Writing – review & editing. Rui Zhang: Conceptualization, Resources, Methodology, Supervision, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

NA.

Data Availability

The data underlying this article cannot be shared publicly due to the privacy of patient health information.

References

1.Sohn S., Wang Y., Wi C.I., Krusemark E.A., Ryu E., Ali M.H., Juhn Y.J., Liu H. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions. J Am Med Inform Assoc. 2018;25(3):353–359. doi: 10.1093/jamia/ocx138. (Mar) [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Xie F., Lee J., Munoz-Plaza C.E., Hahn E.E., Chen W. Application of text information extraction system for real-time cancer case identification in an integrated healthcare organization. J Pathol Inform. 2017;8(1):48. doi: 10.4103/jpi.jpi_55_17. Jan 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Carchiolo V., Longheu A., Reitano G., Zagarella L. 2019 Federated Conference on Computer Science and Information Systems (FedCSIS) IEEE,; 2019. Medical prescription classification: a NLP-based approach; pp. 605–609. Sep 1. Sep 1. [Google Scholar]
4.Vijayakrishnan R., Steinhubl S.R., Ng K., Sun J., Byrd R.J., Daar Z., Williams B.A., Defilippi C., Ebadollahi S., Stewart W.F. Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record. J Card Fail. 2014;20(7):459–464. doi: 10.1016/j.cardfail.2014.03.008. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Mavrogiorgos K., Mavrogiorgou A., Kiourtis A., Zafeiropoulos N., Kleftakis S., Kyriazis D. 2022 32nd Conference of Open Innovations Association (FRUCT) IEEE,; 2022. Automated rule-based data cleaning using NLP; pp. 162–168. Nov 9. Nov 9. [Google Scholar]
6.Valmianski I, Frost N, Sood N, Wang Y, Liu B, Zhu JJ,Karumuri S, Finn IM, Zisook DS. SmartTriage: A system for personalized patientdata capture, documentation generation, and decision support. InMachineLearning for Health 2021 Nov 28 (pp. 75-96). PMLR.
7.Manias G., Mavrogiorgou A., Kiourtis A., Kyriazis D. SemAI: A novel approach for achieving enhanced semantic interoperability in public policies. InArtificial Intelligence Applications and Innovations: 17th IFIP WG 12.5 International Conference, AIAI 2021, Hersonissos, Crete, Greece, June 25–27, 2021, Proceedings 17 2021 (pp. 687–699). Springer International Publishing.
8.Digan W., Névéol A., Neuraz A., Wack M., Baudoin D., Burgun A., Rance B. Can reproducibility be improved in clinical natural language processing? A study of 7 clinical NLP suites. J Am Med Inform Assoc. 2021 1;28(3):504–515. doi: 10.1093/jamia/ocaa261. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kaufman D.R., Sheehan B., Stetson P., Bhatt A.R., Field A.I., Patel C., Maisel J.M. Natural language processing–enabled and conventional data capture methods for input to electronic health records: a comparative usability study. JMIR Med Inform. 2016 28;4(4) doi: 10.2196/medinform.5544. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Devine E.B., Van Eaton E., Zadworny M.E., Symons R., Devlin A., Yanez D., Yetisgen M., Keyloun K.R., Capurro D., Alfonso-Cristancho R., Flum D.R. Automating electronic clinical data capture for quality improvement and research: the CERTAIN validation project of real world evidence. eGEMs. 2018;6:1. doi: 10.5334/egems.211. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wen A., Fu S., Moon S., El Wazir M., Rosenbaum A., Kaggal V.C., Liu S., Sohn S., Liu H., Fan J. Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation. NPJ Digit Med. 2019;2(1):1–7. doi: 10.1038/s41746-019-0208-8. Dec 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Wang Y., Wang L., Rastegar-Mojarad M., Moon S., Shen F., Afzal N., Liu S., Zeng Y., Mehrabi S., Sohn S., Liu H. Clinical information extraction applications: a literature review. J Biomed Inform. 2018 1;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017:30. [Google Scholar]
14.Li L., Chen X., Ye H., Bi Z., Deng S., Zhang N., Chen H. China Conference on Knowledge Graph and Semantic Computing. Springer,; Singapore: 2021. On robustness and bias analysis of bert-based relation extraction; pp. 43–59. Nov 4. Nov 4. [Google Scholar]
15.Fan J.W., Prasad R., Yabut R.M., Loomis R.M., Zisook D.S., Mattison J.E., Huang Y. vol. 2011. American Medical Informatics Association,; 2011. Part-of-speech tagging for clinical text: wall or bridge between institutions? p. 382. (AMIA Annual Symposium Proceedings). [PMC free article] [PubMed] [Google Scholar]
16.Friedman C., Kra P., Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002 1;35(4):222–235. doi: 10.1016/s1532-0464(03)00012-1. [DOI] [PubMed] [Google Scholar]
17.Mehrabi S., Krishnan A., Roch A.M., Schmidt H., Li D., Kesterson J., Beesley C., Dexter P., Schmidt M., Palakal M., Liu H. Identification of patients with family history of pancreatic cancer-Investigation of an NLP system portability. Stud Health Technol Inform. 2015;216:604. [PMC free article] [PubMed] [Google Scholar]
18.Liu M., Shah A., Jiang M., Peterson N.B., Dai Q., Aldrich M.C., Chen Q., Bowton E.A., Liu H., Denny J.C., Xu H. vol. 2012. American Medical Informatics Association,; 2012. A study of transportability of an existing smoking status detection module across institutions; p. 577. (AMIA Annual Symposium Proceedings). [PMC free article] [PubMed] [Google Scholar]
19.Magoc T., Allen K.S., McDonnell C., Russo J.P., Cummins J., Vest J.R., Harle C.A. Generalizability and portability of natural language processing system to extract individual social risk factors. Int J Med Inform. 2023 5 doi: 10.1016/j.ijmedinf.2023.105115. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Khambete M.P., Su W., Garcia J.C., Badgeley M.A. Quantification of BERT diagnosis generalizability across medical specialties using semantic dataset distance. AMIA Summits Transl Sci Proc. 2021;2021:345. [PMC free article] [PubMed] [Google Scholar]
21.Zhou S., Wang N., Wang L., Liu H., Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc. 2022 doi: 10.1093/jamia/ocac040. Mar 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Peng Y., Chen Q., Lu Z. An empirical study of multi-task learning on BERT for biomedical text mining. arXiv preprint arXiv:2005.02799. 2020;2005:02799. May 6. [Google Scholar]
23.Cohen I., Huang Y., Chen J., Benesty J., Benesty J., Chen J., Huang Y., Cohen I. Pearson correlation coefficient. Noise Reduct Speech Process. 2009:1–4. [Google Scholar]
24.Liu Z., Tang B., Wang X., Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75:S34–S42. doi: 10.1016/j.jbi.2017.05.023. Nov 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Chapman A.B., Peterson K.S., Alba P.R., DuVall S.L., Patterson O.V. Detecting adverse drug events with rapidly trained classification models. Drug Saf. 2019;42(1):147–156. doi: 10.1007/s40264-018-0763-y. (Jan) [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Unanue I.J., Borzeshi E.Z., Piccardi M. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. J Biomed Inform. 2017;76:102–109. doi: 10.1016/j.jbi.2017.11.007. [DOI] [PubMed] [Google Scholar]
27.Pennington J., Socher R., Manning C. Glove: Global vectors for word representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014; pp:1532–1543.
28.Mikolov T., Sutskever I., Chen K., et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013:3111–3119. [Google Scholar]
29.Yang X., Bian J., Hogan W.R., Wu Y. Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27(12):1935–1942. doi: 10.1093/jamia/ocaa189. Dec 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Kim H.Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014;39(1):74–77. doi: 10.5395/rde.2014.39.1.74. Feb 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Armstrong R.A. When to use the Bonferroni correction. Ophthalmic Physiol Opt. 2014;34(5):502–508. doi: 10.1111/opo.12131. (Sep) [DOI] [PubMed] [Google Scholar]
32.Schutte D., Vasilakes J., Bompelli A., Zhou Y., Fiszman M., Xu H., Kilicoglu H., Bishop J.R., Adam T., Zhang R. Discovering novel drug-supplement interactions using SuppKG generated from the biomedical literature. J Biomed Inform. 2022;131 doi: 10.1016/j.jbi.2022.104120. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Fu J., Liu P., Zhang Q. Rethinking generalization of neural models: a named entity recognition case study. Proc AAAI Conf Artif Intell. 2020;34(05):7732–7739. Apr 3. [Google Scholar]
34.Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805. 2018 Oct 11.
35.Botsis T., Hartvigsen G., Chen F., Weng C. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma. 2010;2010:1. [PMC free article] [PubMed] [Google Scholar]
36.Coquet J., Bozkurt S., Kan K.M., Ferrari M.K., Blayney D.W., Brooks J.D., Hernandez-Boussard T. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients. J Biomed Inform. 2019;94 doi: 10.1016/j.jbi.2019.103184. Jun 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Halpern Y., Horng S., Choi Y., Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc. 2016;23(4):731–740. doi: 10.1093/jamia/ocw011. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data underlying this article cannot be shared publicly due to the privacy of patient health information.

[bib1] 1.Sohn S., Wang Y., Wi C.I., Krusemark E.A., Ryu E., Ali M.H., Juhn Y.J., Liu H. Clinical documentation variations and NLP system portability: a case study in asthma birth cohorts across institutions. J Am Med Inform Assoc. 2018;25(3):353–359. doi: 10.1093/jamia/ocx138. (Mar) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Xie F., Lee J., Munoz-Plaza C.E., Hahn E.E., Chen W. Application of text information extraction system for real-time cancer case identification in an integrated healthcare organization. J Pathol Inform. 2017;8(1):48. doi: 10.4103/jpi.jpi_55_17. Jan 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Carchiolo V., Longheu A., Reitano G., Zagarella L. 2019 Federated Conference on Computer Science and Information Systems (FedCSIS) IEEE,; 2019. Medical prescription classification: a NLP-based approach; pp. 605–609. Sep 1. Sep 1. [Google Scholar]

[bib4] 4.Vijayakrishnan R., Steinhubl S.R., Ng K., Sun J., Byrd R.J., Daar Z., Williams B.A., Defilippi C., Ebadollahi S., Stewart W.F. Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record. J Card Fail. 2014;20(7):459–464. doi: 10.1016/j.cardfail.2014.03.008. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Mavrogiorgos K., Mavrogiorgou A., Kiourtis A., Zafeiropoulos N., Kleftakis S., Kyriazis D. 2022 32nd Conference of Open Innovations Association (FRUCT) IEEE,; 2022. Automated rule-based data cleaning using NLP; pp. 162–168. Nov 9. Nov 9. [Google Scholar]

[bib6] 6.Valmianski I, Frost N, Sood N, Wang Y, Liu B, Zhu JJ,Karumuri S, Finn IM, Zisook DS. SmartTriage: A system for personalized patientdata capture, documentation generation, and decision support. InMachineLearning for Health 2021 Nov 28 (pp. 75-96). PMLR.

[bib7] 7.Manias G., Mavrogiorgou A., Kiourtis A., Kyriazis D. SemAI: A novel approach for achieving enhanced semantic interoperability in public policies. InArtificial Intelligence Applications and Innovations: 17th IFIP WG 12.5 International Conference, AIAI 2021, Hersonissos, Crete, Greece, June 25–27, 2021, Proceedings 17 2021 (pp. 687–699). Springer International Publishing.

[bib8] 8.Digan W., Névéol A., Neuraz A., Wack M., Baudoin D., Burgun A., Rance B. Can reproducibility be improved in clinical natural language processing? A study of 7 clinical NLP suites. J Am Med Inform Assoc. 2021 1;28(3):504–515. doi: 10.1093/jamia/ocaa261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Kaufman D.R., Sheehan B., Stetson P., Bhatt A.R., Field A.I., Patel C., Maisel J.M. Natural language processing–enabled and conventional data capture methods for input to electronic health records: a comparative usability study. JMIR Med Inform. 2016 28;4(4) doi: 10.2196/medinform.5544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Devine E.B., Van Eaton E., Zadworny M.E., Symons R., Devlin A., Yanez D., Yetisgen M., Keyloun K.R., Capurro D., Alfonso-Cristancho R., Flum D.R. Automating electronic clinical data capture for quality improvement and research: the CERTAIN validation project of real world evidence. eGEMs. 2018;6:1. doi: 10.5334/egems.211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Wen A., Fu S., Moon S., El Wazir M., Rosenbaum A., Kaggal V.C., Liu S., Sohn S., Liu H., Fan J. Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation. NPJ Digit Med. 2019;2(1):1–7. doi: 10.1038/s41746-019-0208-8. Dec 17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Wang Y., Wang L., Rastegar-Mojarad M., Moon S., Shen F., Afzal N., Liu S., Zeng Y., Mehrabi S., Sohn S., Liu H. Clinical information extraction applications: a literature review. J Biomed Inform. 2018 1;77:34–49. doi: 10.1016/j.jbi.2017.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017:30. [Google Scholar]

[bib14] 14.Li L., Chen X., Ye H., Bi Z., Deng S., Zhang N., Chen H. China Conference on Knowledge Graph and Semantic Computing. Springer,; Singapore: 2021. On robustness and bias analysis of bert-based relation extraction; pp. 43–59. Nov 4. Nov 4. [Google Scholar]

[bib15] 15.Fan J.W., Prasad R., Yabut R.M., Loomis R.M., Zisook D.S., Mattison J.E., Huang Y. vol. 2011. American Medical Informatics Association,; 2011. Part-of-speech tagging for clinical text: wall or bridge between institutions? p. 382. (AMIA Annual Symposium Proceedings). [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Friedman C., Kra P., Rzhetsky A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform. 2002 1;35(4):222–235. doi: 10.1016/s1532-0464(03)00012-1. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Mehrabi S., Krishnan A., Roch A.M., Schmidt H., Li D., Kesterson J., Beesley C., Dexter P., Schmidt M., Palakal M., Liu H. Identification of patients with family history of pancreatic cancer-Investigation of an NLP system portability. Stud Health Technol Inform. 2015;216:604. [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Liu M., Shah A., Jiang M., Peterson N.B., Dai Q., Aldrich M.C., Chen Q., Bowton E.A., Liu H., Denny J.C., Xu H. vol. 2012. American Medical Informatics Association,; 2012. A study of transportability of an existing smoking status detection module across institutions; p. 577. (AMIA Annual Symposium Proceedings). [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Magoc T., Allen K.S., McDonnell C., Russo J.P., Cummins J., Vest J.R., Harle C.A. Generalizability and portability of natural language processing system to extract individual social risk factors. Int J Med Inform. 2023 5 doi: 10.1016/j.ijmedinf.2023.105115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Khambete M.P., Su W., Garcia J.C., Badgeley M.A. Quantification of BERT diagnosis generalizability across medical specialties using semantic dataset distance. AMIA Summits Transl Sci Proc. 2021;2021:345. [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Zhou S., Wang N., Wang L., Liu H., Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc. 2022 doi: 10.1093/jamia/ocac040. Mar 25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Peng Y., Chen Q., Lu Z. An empirical study of multi-task learning on BERT for biomedical text mining. arXiv preprint arXiv:2005.02799. 2020;2005:02799. May 6. [Google Scholar]

[bib23] 23.Cohen I., Huang Y., Chen J., Benesty J., Benesty J., Chen J., Huang Y., Cohen I. Pearson correlation coefficient. Noise Reduct Speech Process. 2009:1–4. [Google Scholar]

[bib24] 24.Liu Z., Tang B., Wang X., Chen Q. De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform. 2017;75:S34–S42. doi: 10.1016/j.jbi.2017.05.023. Nov 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Chapman A.B., Peterson K.S., Alba P.R., DuVall S.L., Patterson O.V. Detecting adverse drug events with rapidly trained classification models. Drug Saf. 2019;42(1):147–156. doi: 10.1007/s40264-018-0763-y. (Jan) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Unanue I.J., Borzeshi E.Z., Piccardi M. Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition. J Biomed Inform. 2017;76:102–109. doi: 10.1016/j.jbi.2017.11.007. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Pennington J., Socher R., Manning C. Glove: Global vectors for word representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014; pp:1532–1543.

[bib28] 28.Mikolov T., Sutskever I., Chen K., et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst. 2013:3111–3119. [Google Scholar]

[bib29] 29.Yang X., Bian J., Hogan W.R., Wu Y. Clinical concept extraction using transformers. J Am Med Inform Assoc. 2020;27(12):1935–1942. doi: 10.1093/jamia/ocaa189. Dec 9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Kim H.Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014;39(1):74–77. doi: 10.5395/rde.2014.39.1.74. Feb 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Armstrong R.A. When to use the Bonferroni correction. Ophthalmic Physiol Opt. 2014;34(5):502–508. doi: 10.1111/opo.12131. (Sep) [DOI] [PubMed] [Google Scholar]

[bib32] 32.Schutte D., Vasilakes J., Bompelli A., Zhou Y., Fiszman M., Xu H., Kilicoglu H., Bishop J.R., Adam T., Zhang R. Discovering novel drug-supplement interactions using SuppKG generated from the biomedical literature. J Biomed Inform. 2022;131 doi: 10.1016/j.jbi.2022.104120. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Fu J., Liu P., Zhang Q. Rethinking generalization of neural models: a named entity recognition case study. Proc AAAI Conf Artif Intell. 2020;34(05):7732–7739. Apr 3. [Google Scholar]

[bib34] 34.Devlin J, Chang MW, Lee K, Toutanova K. Bert:Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805. 2018 Oct 11.

[bib35] 35.Botsis T., Hartvigsen G., Chen F., Weng C. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma. 2010;2010:1. [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Coquet J., Bozkurt S., Kan K.M., Ferrari M.K., Blayney D.W., Brooks J.D., Hernandez-Boussard T. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients. J Biomed Inform. 2019;94 doi: 10.1016/j.jbi.2019.103184. Jun 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Halpern Y., Horng S., Choi Y., Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc. 2016;23(4):731–740. doi: 10.1093/jamia/ocw011. Jul 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records

Sicheng Zhou

Nan Wang

Liwei Wang

Ju Sun

Anne Blaes

Hongfang Liu

Rui Zhang

Abstract

Objective

Materials and methods

Results

Conclusions

Graphical Abstract

Highlights

1. Introduction

2. Methods and materials

2.1. Overview of the study

Fig. 1.

2.2. Data sources

2.3. Manual annotation and corpora comparison

2.4. Portability evaluation for breast cancer phenotype extraction models

2.5. Permutation test set evaluation for UMN models

2.6. Evaluation of model generalizability with entity coverage ratio (ECR)

2.7. Experiment environments at UMN and MC

3. Results

3.1. Comparison of UMN vs MC corpora

Table 1.

3.2. Portability evaluations of machine learning-based models and BERT-based models

Table 2.

Table 3.

3.3. Permutation test set evaluation for UMN models

Table 4.

Fig. 2.

3.4. Evaluation of model generalizability with ECR

Fig. 3.

4. Discussions

5. Conclusions

Funding statement

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases