How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?

Hyunjae Kim; Jaewoo Kang

doi:10.1109/ACCESS.2022.3157854

. 2022 Mar 8;10:31513–31523. doi: 10.1109/ACCESS.2022.3157854

How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?

Hyunjae Kim ¹, Jaewoo Kang ^1,^✉

PMCID: PMC9014470 PMID: 35582496

Abstract

The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that although current best models achieve state-of-the-art performance on benchmarks based on overall performance, they have limitations in identifying synonyms and new biomedical concepts, indicating they are overestimated in terms of their generalization abilities. We also investigate failure cases of models and identify several difficulties in recognizing unseen mentions in biomedical literature as follows: (1) models tend to exploit dataset biases, which hinders the models’ abilities to generalize, and (2) several biomedical names have novel morphological patterns with weak name regularity, and models fail to recognize them. We apply a statistics-based debiasing method to our problem as a simple remedy and show the improvement in generalization to unseen mentions. We hope that our analyses and findings would be able to facilitate further research into the generalization capabilities of NER models in a domain where their reliability is of utmost importance.

Keywords: Bioinformatics (in engineering in medicine and biology), natural language processing, text mining

I. Introduction

Recently, more than 3,000 biomedical papers are being published per day on average [1], [2]. Searching these documents efficiently or extracting useful information from them would be of great help to researchers and practitioners in the field. Biomedical named entity recognition (BioNER), which involves identifying biomedical named entities in unstructured text, is a core task to do so since entities extracted by BioNER systems are utilized as important features in many downstream tasks such as drug-drug interaction extraction [3].

One important desideratum of BioNER models is to be able to generalize to unseen entity mentions. This generalization capability is of paramount importance in the biomedical domain due to the following reasons. First, various expressions for a biomedical entity (i.e., synonyms) continue to be made. For instance, pharmaceutical companies come up with marketing-appropriate names such as Gleevec to replace old names (usually identifiers) such as CGP-57148B and STI-571, whereas entities in other domains such as countries and companies are relatively unchanged. Second, new biomedical entities and concepts such as the novel coronavirus disease 2019 (COVID-19) constantly emerge, which can have a direct impact on human life and health.

In contrast to the importance of generalizing to new entities in the biomedical literature, there has been little systematic analysis of the generalizability of BioNER models. While recent works have made great efforts to push the state-of-the-art (SOTA) on various benchmarks [4]–[7], it is questionable whether a high overall performance on a benchmark indicates true generalization. We conducted a pilot study to check if current BioNER models are reliable in identifying new entities. Specifically, we trained BioBERT [7] on the NCBI corpus [8], and then tested how many spans containing the novel entity COVID-19 the model can extract from PubMed sentences. As a result, the model extracted only 45.7% of all the spans, although it achieved high overall performance on NCBI (90.5% in recall). From this, we conclude that existing BioNER models may have limitations in identifying unseen entities, and their generalizability should be explored in a more systematic way beyond measuring overall performance.

In this work, we analyze how well existing BioNER models generalize to unseen mentions. First, we define three types of recognition abilities that BioNER models should possess:

•
Memorization: The most basic ability is to identify the entity mentions that were seen during training. We call this type of mention memorizable mention. If there is no label inconsistency, even a simple rule-based model would recognize memorizable mentions easily.
•
Synonym generalization: Biomedical names are expressed in various forms, even when they refer to the same biomedical concepts. For instance, Motrin and Ibuprofen are the same entity, but their surface forms are highly different [9]. A BioNER model should be robust to these morphological variations (i.e., synonyms).
•
Concept generalization: While synonym generalization deals with recognizing new surface forms of existing entities, concept generalization refers to the generalization to novel entities or concepts that did not exist before. New biomedical concepts such as COVID-19 sometimes are very different from conventional entities in terms of their surface forms and the context in which they appear, which makes it difficult to identify.

In terms of the three recognition types that we define, we partition the entity mentions in the test sets (or validation sets) into three splits based on mention and CUI (Concept Unique Identifier) overlaps with the training sets, as shown in Table 1. This gives us several advantages. First, we can compare models’ generalization abilities in detail. For instance, we find that the gap in performance between BioBERT and BERT [10] is mainly from synonym and concept generalization, not memorization (Section III). Also, our classification is simple and can be easily adopted to other datasets and other downstream tasks in the biomedical domain such as relation extraction and normalization. We focus on two popular BioNER benchmark in this work: NCBI-disease [8] and BC5CDR [11].

TABLE 1. The Number of Mentions in the , , and Splits of Benchmarks. Each Split (i.e., , , and ) Corresponds to Each Recognition Type (i.e., Memorization, Synonym Generalization, and Concept Generalization). The Table Shows That Current BioNER Benchmarks are Overrepresented by the Mentions in the Splits (i.e., Memorizable Mentions).

Dataset	Type	Validation			Test
Dataset	Type
NCBI-disease	Disease	599 (62.4%)	196 (20.4%)	165 (17.2%)	515 (65.4%)	191 (24.3%)	81 (10.3%)
BC5CDR_dis	Disease	2,807 (63.4%)	922 (20.8%)	695 (15.7%)	2,642 (62.3%)	960 (22.6%)	642 (15.1%)
BC5CDR_chem	Drug/chem.	3,294 (61.2%)	510 (9.5%)	1,581 (29.4%)	3,438 (64.3%)	456 (8.5%)	1,453 (27.2%)

Open in a new tab

On the three test splits, we investigate the generalizability of existing BioNER models. Despite their SOTA performance on the benchmarks, they have limitations in their generalizability. Specifically, the models perform well on memorizable mentions, but find it difficult to identify unseen mentions. For the disease mentions in the BC5CDR corpus, BioBERT achieved a recall of 93.3% on memorizable mentions, but 74.9% and 73.7% on synonyms and new concepts, respectively. Also, the models cannot recognize the newly emerging biomedical concept COVID-19 well. Surprisingly, BioBERT recognized only 3.4% spans containing COVID-19 when trained on BC5CDR. From these observations, we conclude that existing BioNER models achieved high performance on benchmarks, but they are overestimated in terms of their generalizability.

Also, we identify several difficulties in recognizing unseen mentions. First, through a qualitative analysis of error cases on Inline graphic and splits, we find BioNER models can rely on the class distributions of each word in the training set, reducing the models’ abilities to generalize. Since BioNER datasets is relatively small for training large neural networks, models may be sensitive to such dataset bias. Second, after examining the failure for COVID-19, we conclude models are not robust to new entities when they do not follow conventional surface patterns. This is an important issue to be addressed since many biomedical entities have rare morphologies (See Table 8 for examples), and such entities will continue to appear in biomedical literature.

TABLE 8. Disease Entities With Rare Surface Forms and the Performance of BioBERT With/Without Our Debiasing Method.

Test Entity	COVID-19	47, XXY	African Iron Overload	Bejel	Geographic Tongue	Pinta	PMM2-CDG	Precocious Puberty	VACTERL Association
Frequency	5,287	1,292	39	112	391	124	370	6,903	685
BioBERT	45.7	1.0	6.2	45.5	66.8	50.6	29.0	71.1	31.0
+ Debias.	47.3	2.1	17.4	49.8	76.5	55.0	34.6	90.0	43.8
Improvement	+ 1.6	+ 1.1	+ 11.2	+ 4.3	+ 9.7	+ 4.4	+ 5.6	+ 18.9	+ 12.8

Open in a new tab

The two difficulties can be viewed as models’ biases on statistical cues and surface patterns. In order to show they are addressable, we apply a simple statistics-based debiasing method [12]. Specifically, we use the class distributions of words in the training set as bias prior distributions. This reduces the training signals from words whose surface forms are very likely to be entities (or non-entities), mitigating models’ bias on class distributions and name regularity. In experiments, we demonstrate our debiasing method consistently improves the generalization to synonyms, new concepts, and entities with unique forms including COVID-19.

To sum up, we make the following contributions¹

•
We first define memorization, synonym generalization, and concept generalization and systematically investigate existing BioNER models in this regard.
•
We raise the overestimation issue in terms of BioNER models’ generalizability to unseen mentions and provide empirical evidence to support our claim.
•
We identify two types of bias as the main difficulty in generalization in BioNER and show that they are addressable using a current debiasing method.

II. Data Preparation

A. Partitioning Benchmarks

We describe how we partition benchmarks. Several BioNER datasets provide entity mentions and also CUIs that link the entity mentions to their corresponding biomedical concepts in databases. We utilize overlaps in mentions and CUIs between training and test sets in the partitioning process. Let Inline graphic be the -th data example of a total of examples in a test set. is the -th sentence, is a list of entity mentions, and is a list of CUIs where is the number of the entity mentions (or CUIs) in the sentence. We partition all mentions in the original test set into three splits as follows:

where Inline graphic is the set of all entity mentions in the training set, and is the set of all CUIs in the training set. We describe the partitioning process in detail in the Appendix.

B. Datasets

We use two popular BioNER benchmarks with CUIs to systematically investigate models’ memorization, synonym generalization, and concept generalization abilities. Additionally, we automatically construct a dataset consisting of the novel entity COVID-19.

1). NCBI-Disease

The NCBI-disease corpus [8] is a collection of 793 PubMed articles with manually annotated disease mentions and the corresponding concepts in Medical Subject Headings (MeSH) or Online Mendelian Inheritance in Man (OMIM).

2). BC5CDR

The BC5CDR corpus [11] is proposed for disease name recognition and chemical-induced disease (CID) relation extraction tasks. The corpus consists of 1,500 manually annotated disease and chemical mentions and the corresponding concepts in MeSH. We denote the disease-type dataset as BC5CDR_dis and the chemical-type dataset as BC5CDR_chem.

3). COVID-19

We construct a dataset to see if a model trained on current benchmarks can identify the newly emerging biomedical concept COVID-19. We sampled 5,000 sentences containing “COVID-19” from the entire PubMed abstracts through March 2021 and annotated all COVID-19 occurrences in the sentences, which results in 5,237 labels. Note that only the exact term “COVID-19” was considered, and synonyms for COVID-19 were not considered in this dataset creation process.

C. Split Statistics

Table 1 shows the statistics of the splits of the benchmarks. We found that a significant portion of the benchmarks correspond to Inline graphic , implying that current BioNER benchmarks are highly skewed to memorizable mentions. In Section III, we discuss the overestimation problem that such overrepresentation of memorizable mentions may cause.

III. Generalizability of BIONER Models

This section describes baseline models and evaluation metrics and analyzes the three recognition abilities of the models.

A. Baseline Models

We use four current best neural net-based models and two traditional dictionary-based models as our baseline models. See the Appendix for implementation details.

1). Neural Models

We use BioBERT [7], BlueBERT [14], and PubMedBERT [13]. The models are all pretrained language models (PLMs) for the biomedical domain, with similar architectures. They are different in their vocabularies, weight initialization, and training corpora. See the Appendix for more details. Also, we use BERT [10] to compare general and domain-specific PLMs in terms of generalization in BioNER.

2). Dictionary Models

Traditional approaches in the field of BioNER are based on pre-defined dictionaries [15]. To compare the generalization abilities of traditional and recent approaches, we set two types of simple dictionary-based extractors as baseline models. DICT_train uses all the entity mentions in a training set (i.e., Inline graphic ) as a dictionary and classifies text spans as entities when the dictionary includes the spans. If candidate spans overlap, the longest one is selected. DICT_syn expands the dictionary to use entity mentions in the training set as well as their synonyms, which are pre-defined in biomedical databases.

B. Metrics

Following conventional evaluation metrics in BioNER, we use the precision (P), recall (R), and F1 score (F₁) at an entity level to measure overall performance [16]. We only use recall when evaluating three recognition abilities (i.e., Inline graphic , , and ) since it is impossible to classify false positives into each recognition type. For COVID-19, we use a relaxed version of recall: if “COVID-19” is contained in the predicted spans, we classify this prediction as a true positive.

C. Results

1). Overall Results

Table 2 shows the performance of the baseline models. BioBERT outperforms other baseline models on NCBI-disease based on overall performance. For the BC5CDR corpus, PubMedBERT is the best performing model. BERT performs less than domain-specific PLMs, but far superior to dictionary models. DICT_syn outperforms DICT_train in recall due to its larger biomedical dictionary, but the precision scores decrease in general. Note that the performance of DICT_syn on Inline graphic is lower than that of DICT_train as there exists annotation inconsistency between benchmarks and biomedical databases. We elaborate on this in the Appendix.

TABLE 2. Performance of Current BioNER Models on NCBI-Disease, BC5CDR_dis, and BC5CDR_chem. The Best Scores are Highlighted in Bold, and the Second Best Scores are Underlined.

	Test
Training	Model	Overall			In-depth			COVID-19
Training	Model	P	R	F₁				COVID-19
NCBI-disease	PubMedBERT [13]	86.6	88.9	87.7	94.5	81.1	77.7	36.0
	BlueBERT [14]	85.8	89.5	87.6	95.9	80.4	76.7	13.8
	BioBERT [7]	86.7	90.5	88.6	95.5	80.9	84.1	45.7
	BERT [10]	83.8	87.6	85.6	95.0	79.0	70.7	18.7
	DICT_syn	50.7	58.0	54.1	88.5	13.8	0.0	0.0
	DICT_train	52.7	55.4	54.0	88.8	0.0	0.0	0.0
BC5CDR_dis	PubMedBERT [13]	83.1	87.3	85.2	93.1	78.3	75.7	2.2
	BlueBERT [14]	82.2	86.6	84.4	93.2	76.9	72.7	0.8
	BioBERT [7]	82.4	86.3	84.3	93.3	74.9	73.7	3.4
	BERT [10]	78.5	81.4	79.9	91.5	64.0	63.4	0.8
	DICT_syn	75.4	67.8	71.4	96.0	32.9	0.0	0.0
	DICT_train	75.9	61.4	67.8	96.7	0.0	0.0	0.0
BC5CDR_chem	PubMedBERT [13]	92.1	94.2	93.1	98.3	85.5	88.2	-
	BlueBERT [14]	92.8	92.9	92.8	98.0	81.6	86.0	-
	BioBERT [7]	92.1	93.1	92.6	97.8	82.1	87.0	-
	BERT [10]	89.8	90.0	89.9	97.0	72.4	81.1	-
	DICT_syn	71.5	62.2	66.5	95.9	32.7	1.4	-
	DICT_train	71.2	58.8	64.6	96.2	0.0	0.0	-

Open in a new tab

Memorization can be easily obtained compared to the other two abilities. Although the dictionary models are the simplest types of BioNER models without learnable parameters, they work well on Inline graphic . The degree of difficulty in recognizing synonyms and new concepts varies from data to data. The models’ performances on is lower than that on of BC5CDR_dis, but vice-versa on BC5CDR_chem.

2). Overestimation of Models

The neural models perform well on Inline graphic , but they achieved relatively low performance on and across all benchmarks. For instance, BioBERT achieved 93.3% recall on , but only 74.9% and 73.7% recall on and , respectively. Also, the neural models perform poorly on COVID-19 despite their high F1 scores. BioBERT performed the best, but the score is only 45.7% recall. Even more surprisingly, all the models hardly identify COVID-19 when trained on BC5CDR_dis. To sum up, current BioNER models have limitations in their generalizability.

As shown in Table 1, a large number of entity mentions in existing BioNER benchmarks are included in Inline graphic . This overrepresentation of memorizable mentions can lead to an overestimation of the generalization abilities of models. We believe our model has high generalization ability due to high performance on benchmarks, but the model may be highly fit to memorizable mentions. Taking these results into account, we would like to emphasize that researchers should be wary of falling into the trap of overall performance and misinterpreting a model’s high performance with generalization ability at the validation and inference time.

3). Effect of Domain-Specific Pretraining

Domain-specific PLMs constantly outperform BERT on Inline graphic and . These results show that pretraining on domain-specific corpora mostly affects synonym generalization and concept generalization. On the other hand, BERT and domain-specific PLMs achieve similar performance on because memorization does not require much domain-specific knowledge and the models have the same architecture and capacity.

In particular, we find the gap in performance between BERT and domain-specific PLMs is drastic in the generalization ability to abbreviations. Table 3 shows that neural models’ performances on abbreviations on the Inline graphic splits of NCBI-disease and BC5CDR_dis.² On NCBI-disease, BioBERT is very robust to abbreviations, and the gap in performance between BioBERT and BERT is 24.4% in recall. BioBERT also significantly outperforms the other domain-specific PLMs, resulting in high performance on of NCBI-disease. On the other hand, PubMedBERT is the best on the BC5CDR_dis, outperforming BERT by a recall of 19.7%.

TABLE 3. Performance of Neural Models on the Abbreviations in the Splits. 32.7% of Mentions are Abbreviations in of NCBI-Disease, While BC5CDR_dis Has Only 7.2% Abbreviations. The Best Scores are Highlighted in Bold.

Model	NCBI-disease (32.7%)	BC5CDR_dis (7.2%)
PubMedBERT [13]	84.8	71.5
BlueBERT [14]	81.5	70.4
BioBERT [7]	89.6	69.6
BERT [10]	65.2	52.8

Open in a new tab

IV. Analysis

In this section, we analyze which factors make the generalization to unseen biomedical names difficult based on failures of models on (1) Inline graphic and splits, and (2) COVID-19. For simplicity, we will focus on only BERT and BioBERT.

A. Dataset Bias

We qualitatively analyze the error cases of BioBERT by sampling a total of 100 incorrect predictions from the Inline graphic and splits of BC5CDR_dis. As a result, we found 36% of the error cases occur because the model tends to rely on statistical cues in the dataset and make biased predictions. Table 4 shows the examples of the biased predictions.

TABLE 4. Examples of Biased Predictions of BioBERT. Entity Mentions (Ground-Truth Labels) are Displayed in Blue. Model Predictions are Highlighted With Yellow Boxes.

Example
[1] Acute encephalopathy and cerebral vasospasm after multiagent chemotherapy
[2] 14 with anterior infarction (ANT-MI) and eight with inferior infarction (INF-MI).
[3] Two patients needed a lateral tarsorrhaphy for persistent epithelial defects.

Open in a new tab

In the first example, the model failed to extract the whole phrase “acute encephalopathy.” All the words “encephalopathy” in the training set are labeled as “B,”³ so the model classified the word as “B,” resulting in an incorrect prediction. In the second example, there are four entity mentions: two mentions are full names “anterior infarction” and “inferior infarction,” and the others are their corresponding abbreviations “ANT-MI” and “INF-MI. ” As the abbreviations are enclosed in parentheses after the full names, it should be easy for a model to identify the abbreviations in general if the model can extract the full names. Interestingly, although BioBERT correctly predicted the full names in the example, it failed to recognize their abbreviations. This is because “MI” is only labeled as “B” in the training set, and so the model was convinced that “MI” is only associated with the label “B. ” In the last example, about 73% of the words “defects” are labeled as “I” in the training set as components of entity mentions such as birth defects and atrial septal defects. However, the word “epithelial” is only labeled as “O,” so the model did not predict the phrase “epithelial defects” as an entity.

From these observations, we hypothesize that BioNER models are biased to class distributions in datasets. Specifically, models tend to over-rely on the class distributions of each word in the training set, causing the models to fail when the class distribution shifts in the test set.

B. Weak Name Regularity

Name regularity refers to patterns in the surface forms of entities [18], [19]. For example, many disease names have patterns such as “__ disease” and “__ syndrome.” These patterns are regarded as useful features for identifying unseen mentions and are often implemented in NER systems after being handcrafted. However, little analysis has been done on the difficulties a model can face when extracting novel entities that do not have common name patterns such as COVID-19. In this section, we hypothesize that the cause of models’ failure to recognize COVID-19 is its rare morphology and perform detailed analyses to support the hypothesis.

1). Cause of Failing to Recognize COVID-19

We have already confirmed in Table 2 that models fail to recognize COVID-19. To see if the cause of this failure is due to the rare surface form of COVID-19, we replace all occurrences of “COVID-19” in the COVID-19 dataset with more disease-like mentions “COVID,” while maintaining context. Interestingly, BioBERT can recognize the entity well after the replacement, as shown in Table 5.

TABLE 5. Performance of BioBERT on COVID-19 and Synthetically Generated Mention “COVID”.

	Test
Training	COVID-19	COVID
NCBI-disease	45.7	85.8 (+ 40.1)
BC5CDR_dis	3.4	55.1 (+ 51.7)

Open in a new tab

Next, we train models with entity mentions having similar surface forms to COVID-19 and see how the performance changes on COVID-19. First, we randomly generate 3–5 capital letters and 1–3 numbers. We then combine the generated letters and numbers using the pattern “Abbreviation-Number” and create pseudo entities such as IST-5, CHF-113, and SRS-3517. We randomly select 1 or 10 entity mentions in the training set that are abbreviations and replace them with different pseudo entities. We then trained BioBERT on the modified training set and test the model how well it recognizes COVID-19. As shown in Table 6, augmenting COVID-19-like name patterns improves the ability to recognize COVID-19.

TABLE 6. Performance of BioBERT on the COVID-19 Dataset When Trained With Name Patterns Similar to COVID-19.

	Training
Model	NCBI-disease	BC5CDR_dis
BioBERT	45.7	3.4
+ Replace. (1)	51.0 (+ 5.3)	14.8 (+ 11.4)
+ Replace. (10)	56.6 (+ 10.9)	27.6 (+ 24.2)

Open in a new tab

Note that low performance on COVID-19 is not due to lack of sufficient context. Models fail even if there is enough information in the context to determine whether COVID-19 is a disease, e.g., “treatment of COVID-19 patients with hypoxia” and “The 2019 novel coronavirus pneumonia (COVID-19) is an ongoing global pandemic with a worldwide death toll.” Also, the small number of training data is not the cause for the failure. We trained BioBERT on the MedMentions corpus [20], which contains several times more disease mentions than NCBI-disease and BC5CDR_dis, but the model extracted only 12.7% of COVID-19. From these observations, we conclude that the biggest difficulty in recognizing COVID-19 is the generalization to a novel surface form.

2). Comparison of NCBI-Disease and BC5CDR

When trained on NCBI-disease and BC5CDR_dis, the gap in performance between the models on COVID-19 is remarkable. This can be caused by three factors. First, the BC5CDR corpus contains a number of chemical mentions with the pattern “{Abbreviation}-{Number}” such as “MK-486” and “FLA-63,” thus models can misunderstand the pattern must be the chemical type, not a disease type. Second, NCBI-disease contains several times more abbreviations than BC5CDR_dis in the training set, which could help generalization to COVID-19 that is also an abbreviation. Lastly, NCBI-disease has the entity “EA-2” in the training set with a similar pattern to COVID-19, while BC5CDR_dis does not have any disease entity with the pattern. Replacing “EA-2” with “EA” significantly reduces the performance of BioBERT dramatically decreases from 45.7 to 11.2, which supports our claim.

C. Debiasing Method

We hypothesize BioNER models tend to rely on class distributions and name regularity experienced during training, making it difficult to generalize unseen entities, especially, entities with rare patterns (e.g., COVID-19). To support our hypothesis and see if such bias can be handled, we adopt a bias product method [21], which is a kind of debiasing method effective in alleviating dataset biases in various NLP tasks such as visual question answering and natural language inference.

1). Formulation

Bias product [21] trains an original model using a biased model such that the original model does not learn much from spurious cues. Let Inline graphic be the probability distribution over target classes of the original model at the -th word in the -th sentence, and be that of the biased model. We add and element-wise, and then calculate a new probability distribution by applying the softmax function over classes as follows:

We minimize the negative log-likelihood between the combined probability distribution Inline graphic and the ground-truth label. This assigns low training signals to words with highly skewed class distributions. As a result, it prevents the original model from being biased towards statistical cues in datasets. Note that only the original model is updated, and the biased model is fixed during training. At inference, we use only the probability distribution of the original model, Inline graphic .

In previous works, biased models are usually pretrained neural networks using hand-crafted features as input [21]–[24]. On the other hand, [12] used data statistics as the probability distributions of the biased model, which is computationally efficient and performs well. Similarly, we calculate the class distribution of each word using training sets, and then use the statistics. The probability that our biased model predicts Inline graphic -th class is defined as follows:

where Inline graphic is the number of sentences in the training set, is the length of the -th sentence, and is the -th word in the -th sentence. If the ground-truth label of the word is the -th class, , otherwise 0.

2). Effect of Debiasing

We explore how the debiasing method affects models’ generalization abilities. Table 7 shows models’ performance changes after applying the debiasing method. The method decrease the memorization because it debiases models’ bias towards memorizable mentions and their class distributions. On the other hand, the method constantly improves the performance on Inline graphic and on the benchmarks. Debiasing methods usually decrease overall performance on benchmarks [22], [24], which is consistent with our results. With recent efforts to reduce bias while maintaining overall performance [24], our debiasing method could be improved in future work. Also, the debiasing method changes the model’s behavior and corrects the errors in the first and third examples in Table 4.

TABLE 7. Performance of BioBERT and BERT With/Without Our Debiasing Method on NCBI-Disease, BC5CDR_dis, and BC5CDR_chem. and Indicate Performance Increases and Decreases When Using the Method, Respectively. The Best Scores are Highlighted in Bold.

	Overall			In-depth
Model	P	R	F₁
NCBI-disease
BioBERT	86.7	90.5	88.6	95.5	80.9	84.1
+ Debias.	85.0	90.2	87.5	94.2	81.5	86.1
BERT	83.8	87.6	85.6	95.0	79.0	70.7
+ Debias.	82.0	87.0	84.5	93.1	80.0	73.2
BC5CDR_dis
BioBERT	82.4	86.3	84.3	93.2	74.9	73.7
+ Debias.	80.4	86.3	83.3	92.3	77.1	74.7
BERT	78.5	81.4	79.9	91.5	64.0	63.4
+ Debias.	75.7	81.0	78.3	89.9	66.2	64.7
BC5CDR_chem
BioBERT	92.1	93.1	92.6	97.8	82.1	87.0
+ Debias.	91.2	93.1	92.1	97.1	82.8	88.2
BERT	89.8	90.0	89.9	97.0	72.4	81.1
+ Debias.	87.3	90.7	89.0	96.6	73.7	83.9

Open in a new tab

We also see if our debiasing method can improve the generalizability to entities with weak name regularity. Before testing the method, we crawled a list of rare diseases and their descriptions from the NORD (National Organization for Rare Disorders) database⁴ based on our hypothesis that rare diseases tend to have more unique surface forms than common diseases. Disease names were filtered if BioBERT trained on NCBI-disease successfully extracted them based on the descriptions. Since descriptions provide sufficient context to recognize entities, e.g., “African iron overload is a rare disorder characterized abnormally elevated levels of iron in the body,” an entity’s surface form would be rare if a model fails to recognize the entity from the description. Thus, we assumed that the remaining diseases after filtering have weak name regularity. Finally, we obtained 8 diseases from the database and collected PubMed abstracts in which the diseases appear. Table 8 shows the list of diseases and their frequency of occurrence. All diseases are different from conventional patterns, and their CUIs are unseen based on the NCBI-disease training data. We tested our debiasing model on the diseases along with COVID-19 using the same relaxed version of recall as the same as for COVID-19. As a result, our debiasing method consistently improved the generalization to rare patterns as shown in Table 8.

3). Side Effects of Debiasing

Our debiasing method prevents models from overtrusting the class distributions and surface forms of mentions, making the models sometimes predict spans of text as entities, which have never appeared in the training set. Although this exploration of debiased models helps find unseen mentions as shown Table 7 and Table 8, they have some side effects at the expense of the exploration. To analyze this, we sample 100 cases from the test set of BC5CDR_dis that an original BioBERT model predicted correctly, but a debiased one failed.

Among all cases, we find 23 abnormal predictions of the debiased model and classify them into three categories as shown in Table 9. The most frequent type is predicting spans that are not noun phrases. As shown in the first example in the table, although “Loss of” is an incomplete phrase, the model predicted it. Also, the model predicted the word “infarcted” as an entity although the word is an adjective and is only labeled as “O” in the training set. Also, the second type is related to name regularity. We found that the model sometimes excluded strong patterns from their predictions. For instance, as shown in Table 9, the model predicted entities without “syndrome” and “injury”. When using the debiasing method, there can be a trade-off between performance for entities with weak name regularity and those with strong name regularity. Lastly, the model occasionally predicts special symbols. As shown in the last row of the table, the model predicted the word “sarcomas” with a comma. The model also recognized parentheses as entities. From these results, we conclude that the debiasing method can lead to abnormal predictions by encouraging models to predict rare (or never appeared) classes of words and spans during training.

TABLE 9. Side Effects of Debiasing. Entity Mentions (Ground-Truth Labels) are Displayed in Blue. Model Predictions are Highlighted With Yellow Boxes.

Example
Not noun phrase (15/23)
[1] Loss of righting ability was scored at
[2] a target to protect infarcted myocardium.
w/o Name regularity (5/23)
[1] Takotsubo syndrome (or apical ballooning syndrome) secondary to Zolmitriptan.
[2] upregulation of kidney injury molecule (KIM)-1
Special symbol (3/23)
[1] a wide range of cancers including sarcomas, lymphoma, gynecologic and testicular cancers.
[2] Global longitudinal (GLS), circumferential (GCS), and radial strain (GRS) were

Open in a new tab

V. Related Work

A. BioNER Models

In recent years, BioNER has received significant attention for its potential applicability to various downstream tasks in biomedical information extraction. Traditional methods in BioNER are based on hand-crafted rules [25]–[27] or biomedical dictionaries [28], [29]. However, these methods require the knowledge and labour of domain experts and are also vulnerable to unseen entity mentions. With the development of deep learning and the advent of large training data, researchers shifted their attention to neural models [4], [30], which are based on recurrent neural networks (RNNs) with conditional random fields (CRFs) [31]. These models automatically learn useful features in datasets without the need of human labour and achieve competent performance in BioNER. The performance of BioNER models has been further improved with the introduction of multi-task learning on multiple biomedical corpora [5], [6], [32]. Several works demonstrated the effectiveness of jointly learning the BioNER task and other biomedical NLP tasks [33]–[36]. Recently, pretrained language models such as BioBERT achieved SOTA results in many tasks such as relation extraction and question answering, and also in BioNER [7], [13], [14].

B. Generalization to Unseen Mentions

Generalization to unseen mentions has been an important research topic in the field of NER [37]–[39]. Despite recent attempts to analyze the generalization of NER models in the general domain [18], [40]–[42], there are few studies in the biomedical domain. Several studies investigated transferability of BioNER models across datasets [43], [44]. On the other hand, we study the generalization to new and unseen mentions based on our new data partitioning method. Note that they did not split benchmarks and evaluated models based on overall performance, so our method can be applied to their experimental setups in future work.

C. Dataset Bias

While many recent studies pointed out dataset bias problems in various NLP tasks such as sentence classification [45]–[47] and visual question answering [48], neither works raise bias problems regarding BioNER benchmarks. Our work is the first to deal with dataset bias in BioNER and to demonstrate the effectiveness of the debiasing method. Recent works found that low label consistency (the degree of label agreement of an entity on the training set) decreases the performance of models on general NER benchmarks [40], [41]. In this work, we show that high label consistency also can harm the generalization when the label distribution of the test set is different from that of the training set.

VI. Conclusion

In this work, we thoroughly explored the memorization, synonym generalization, and concept generalization abilities of existing BioNER models. We found current best NER models are overestimated, tend to rely on dataset biases, and have difficulty recognizing entities with novel surface patterns. Finally, we showed that the generalizability can be improved using a current debiasing method. We hope that our work can provide insight into the generalization abilities of BioNER models and new directions for future work.

Acknowledgment

The authors would like to thank Jinhyuk Lee, Mujeen Sung, Minbyul Jeong, Sean Yi, Gangwoo Kim, Wonjin Yoon, and Donghee Choi for the helpful feedback.

Biographies

graphic file with name kim-3157854.gif

Hyunjae Kim received the B.S. degree in mathematics and convergence software (double major) from Sogang University, South Korea, in 2017. He is currently pursuing the Ph.D. degree in computer science with Korea University, South Korea. His current research interests include developing training or data generation methods for low-resource domains, such as the biomedical domain.

graphic file with name kang-3157854.gif

Jaewoo Kang received the B.S. degree in computer science from Korea University, Seoul, South Korea, in 1994, the M.S. degree in computer science from the University of Colorado Boulder, CO, USA, in 1996, and the Ph.D. degree in computer science from the University of Wisconsin–Madison, WI, USA, in 2003. From 1996 to 1997, he was a Technical Staff Member at AT&T Labs Research, Florham Park, NJ, USA. From 1997 to 1998, he was a Technical Staff Member with Savera Systems Inc., Murray Hill, NJ, USA. From 2000 to 2001, he was the CTO and a Co-Founder of WISEngine Inc., Santa Clara, CA, USA, and Seoul. From 2003 to 2006, he was an Assistant Professor with the Department of Computer Science, North Carolina State University, Raleigh, NC, USA. Since 2006, he has been a Professor with the Department of Computer Science, Korea University, where he also works as the Department Head of Bioinformatics for Interdisciplinary Graduate Program.

Appendix I. Details in Partitioning Benchmarks

We classify the set of mentions Inline graphic : , into for single-type datasets (e.g., NCBI-disease and CDR). If a dataset is multi-type (e.g., MedMentions), we classify the mentions into . Since there are entity mentions that are mapped to more than one CUI, does not have to be a single CUI and may be a list of CUIs. In this case, we classify the mentions into Inline graphic if all CUIs in the list are not included in and otherwise as . We classify mentions with the unknown CUI “−1” into because unknown concepts in the training and test sets are usually different. We lowercase mentions and remove punctuation in them when partitioning benchmarks.

Appendix II. Model Comparison

Our neural baseline models (i.e., BERT, BioBERT, BlueBERT, and PubMedBERT) have the same model architecture, which are Transformer-based encoders [49] with a linear classifier. They differ in vocabulary, initialization method, and training corpus during pre-training, as summarized in Table 10. First, BERT is trained on Wikipedia and the BookCorpus [50] from scratch using the vocabulary within the corpora. BioBERT and BlueBERT are initialized with BERT’s weights and further trained on PubMed articles. Additionally, BlueBERT is trained on the MIMIC-III corpus, which consists of clinical notes. PubMedBERT is also trained on the PubMed corpus, but it is trained from scratch and trained with the PubMed vocabulary.

TABLE 10. Differences Between Pretrained Language Models. Vocab. and Init. Indicate Vocabulary and Initialization.

Model	Vocab.	Init.	Corpus
PubMedBERT [13]	PubMed	-	PubMed
BlueBERT [14]	Wiki+Books	BERT	PubMed+MIMIC
BioBERT [7]	Wiki+Books	BERT	PubMed
BERT [10]	Wiki+Books	-	Wiki+Books

Open in a new tab

Appendix III. Implementation Details

In the experiments, we used a public PyTorch implementation provided by [7].⁵ We used the bert-base-cased model for BERT,⁶ the biobert-base-cased-v1.1 model for BioBERT,⁷ the bluebert_pubmed_uncased_L-24_H-1024_A-16 model for BlueBERT,⁸ and the BiomedNLP-PubMedBERT-base-uncased-abstract model for PubMedBERT.⁹ The max length of input sequence is set to 128. Sentences whose lengths are over 128 are divided into multiple sentences at the preprocessing stage. We trained and tested our models on a single Quadro RTX 8000 GPU. For our synonym dictionaries, we used the July 2012 version of MEDIC [51] and the November 2019 version of CTD (Comparative Toxicogenomics Database), provided by Sung et al. [9].

For all models, we used the batch size of 64 and searched learning rate in the range of {1e-5, 3e-5, 5e-5}. For our debiasing method, we smooth the probability distribution of the biased model using temperature scaling [52] since excessive penalties for bias can hinder the learning process. We searched the scaled parameter in the range of {none, 1.1}, where none indicates that temperature scaling is not applied. We chose the best hyperparameters based on the F1 score on the development set. The selected hyperparameters are described in Table 11. Note that all results are averaged over 5 runs using a randomly selected seed.

TABLE 11. Best Configurations of Model Hyperparameters.

Dataset	Model	Learning Rate	Temperature
NCBI-disease	PubMedBERT	5e-5	-
	BlueBERT	5e-5	-
	BioBERT	5e-5	none
	BioBERT+ Debias.	3e-5	none
	BERT	5e-5	-
	BERT+ Debias.	3e-5	1.1
BC5CDR_dis	PubMedBERT	5e-5	-
	BlueBERT	5e-5	-
	BioBERT	5e-5	-
	BioBERT+ Debias.	5e-5	1.1
	BERT	5e-5	-
	BERT+ Debias	5e-5	1.1
BC5CDR_chem	PubMedBERT	5e-5	-
	BlueBERT	5e-5	-
	BioBERT	5e-5	-
	BioBERT+ Debias.	5e-5	1.1
	BERT	3e-5	-
	BERT+ Debias.	1e-5	1.1

Open in a new tab

The original BioBERT model [7] was trained on not only the training set, but also the development set, after the best hyperparameters are chosen based on the development set. This approach improves performance in general when the number of training examples is insufficient and is commonly used in many studies in BioNER. On the other hand, we did not use the development set for training models, resulting in lower performance of BERT and BioBERT compared to the performance reported by Lee et al. [7].

Appendix IV. Annotation Inconsistency in Biomedical Databases

As shown in Table 2, the performance on Inline graphic of DICT_syn is lower than that of DICT_train as there exists annotation inconsistency between benchmarks and biomedical databases. For example, “seizures” and “generalized seizures” are entities with the same concept in the databases, so the dictionary of DICT_syn includes both “seizures” and “generalized seizures.” However, in BC5CDR_dis only “seizures” is annotated. Since dictionary models predict the longest text spans that are in their dictionaries, DICT_syn predicts “generalized seizures,” resulting in incorrect prediction. Also, the dictionary models cannot generalize to new concepts, but DICT_syn achieved recall of 1.4 on Inline graphic of BC5CDR_dis due to annotation inconsistency, i.e., there are mentions with the same surface forms, but different CUIs.

Appendix V. Tokenization Issue

Following Lee et al. [7], we split words into subwords based on punctuations. For example, “COVID-19” is splitted into three words “COVID”, “-,” and “19”. This tokenization makes it easy to deal with nested entities. If “SARS-CoV” is splitted into subwords, a model cannot detect “SARS” as a disease. However, the tokenization is not an optimal way in predicting the whole word “SARS-CoV” as a virus.

To see if dramatic low performance is due to the tokenization issue, we preprocessed “COVID-19” as a single word and tested BioBERT on them. As a result, the performance of BioBERT has improved from 45.7 to 54.1, and from 3.4 to 15.4, when trained on NCBI-disease and BC5CDR_dis, respectively. Although the change in tokenization clearly boosts performance, we have seen that the performance improvement in Table 5 and Table 6, which is not explained by the tokenization issue alone. The main reason for the failure in recognizing COVID-19 is that models are vulnerable to unique name patterns.

Funding Statement

This work was supported in part by the National Research Foundation of Korea under Grant NRF-2020R1A2C3010638; in part by the Korea Health Technology Research and Development Project through the Korea Health Industry Development Institute (KHIDI) funded by the Ministry of Health & Welfare, Republic of Korea, under Grant HR20C0021; and in part by the Ministry of Science and ICT (MSIT) under the ICT Creative Consilience Program (IITP-2022-2020-0-01819) Supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP), South Korea.

Footnotes

^¹

Code and datasets are available at https://github.com/dmis-lab/bionergeneralization.

^²

Note that BC5CDR_chem is excluded in this experiment since it is not easy to distinguish between abbreviations and other chemical entities such as identifiers and formula due to their similar forms.

^³

Beginning in the BIO tagging scheme [16], [17].

^⁴

https://rarediseases.org/rare-diseases

^⁵

https://github.com/dmis-lab/biobert-pytorch

^⁶

https://huggingface.co/bert-base-cased

^⁷

https://github.com/dmis-lab/biobert

^⁸

https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16

^⁹

https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract

References

[1].Landhuis E., “Scientific literature: Information overload,” Nature, vol. 535, no. 7612, pp. 457–458, Jul. 2016. [DOI] [PubMed] [Google Scholar]
[2].Wang L. L., Lo K., Chandrasekhar Y., Reas R., Yang J., Eide D., Funk K., Kinney R., Liu Z., Merrill W., and Mooney P., “CORD-19: The COVID-19 open research dataset,” in Proc. 1st Workshop NLP COVID-19 ACL, 2020. [Online]. Available: https://aclanthology.org/2020.nlpcovid19-acl.1/ [Google Scholar]
[3].Liu S., Tang B., Chen Q., and Wang X., “Drug-drug interaction extraction via convolutional neural networks,” Comput. Math. Methods Med., vol. 2016, pp. 1–8, Jan. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Habibi M., Weber L., Neves M., Wiegandt D. L., and Leser U., “Deep learning with word embeddings improves biomedical named entity recognition,” Bioinformatics, vol. 33, no. 14, pp. i37–i48, Jul. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Wang X., Zhang Y., Ren X., Zhang Y., Zitnik M., Shang J., Langlotz C., and Han J., “Cross-type biomedical named entity recognition with deep multi-task learning,” Bioinformatics, vol. 35, no. 10, pp. 1745–1752, May 2019. [DOI] [PubMed] [Google Scholar]
[6].Yoon W., So C. H., Lee J., and Kang J., “CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition,” BMC Bioinf., vol. 20, no. S10, p. 249, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., and Kang J., “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinf., vol. 36, no. 4, pp. 1234–1240, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Doğan R. I., Leaman R., and Lu Z., “NCBI disease corpus: A resource for disease name recognition and concept normalization,” J. Biomed. Inform., vol. 47, pp. 1–10, Feb. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Sung M., Jeon H., Lee J., and Kang J., “Biomedical entity representations with synonym marginalization,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 3641–3650. [Google Scholar]
[10].Devlin J., Chang M.-W., Lee K., and Toutanova K., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 1, Jun. 2019, pp. 4171–4186. [Google Scholar]
[11].Li J., Sun Y., Johnson R. J., Sciaky D., Wei C.-H., Leaman R., Davis A. P., Mattingly C. J., Wiegers T. C., and Lu Z., “BioCreative V CDR task corpus: A resource for chemical disease relation extraction,” Database, vol. 2016, 2016, Art. no. baw068. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Ko M., Lee J., Kim H., Kim G., and Kang J., “Look at the first sentence: Position bias in question answering,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 1109–1121. [Google Scholar]
[13].Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., Naumann T., Gao J., and Poon H., “Domain-specific language model pretraining for biomedical natural language processing,” ACM Trans. Comput. Healthcare, vol. 3, no. 1, pp. 1–23, Jan. 2022. [Google Scholar]
[14].Peng Y., Yan S., and Lu Z., “Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets,” in Proc. 18th BioNLP Workshop Shared Task, 2019, pp. 58–65. [Google Scholar]
[15].Rindflesch T. C., Tanabe L., Weinstein J. N., and Hunter L., “EDGAR: Extraction of drugs, genes and relations from the biomedical literature,” Biocomputing, vol. 1999, pp. 517–528, Jan. 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Sang E. F. T. K. and Meulder F. D., “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proc. 7th Conf. Natural Lang. Learn. (HLT-NAACL), 2003, pp. 1–6. [Google Scholar]
[17].Ramshaw L. A. and Marcus M. P., “Text chunking using transformation-based learning,” in Natural Language Processing Using Very Large Corpora (Text, Speech and Language Technology), vol. 11, Armstrong S., Church K., Isabelle P., Manzi S., Tzoukermann E., and Yarowsky D., Eds. Dordrecht, The Netherlands: Springer, 1999, doi: 10.1007/978-94-017-2390-9_10. [DOI] [Google Scholar]
[18].Lin H., Lu Y., Tang J., Han X., Sun L., Wei Z., and Yuan N. J., “A rigorous study on named entity recognition: Can fine-tuning pretrained model lead to the promised land?” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 7291–7300. [Google Scholar]
[19].Ghaddar A., Langlais P., Rashid A., and Rezagholizadeh M., “Context-aware adversarial training for name regularity bias in named entity recognition,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 586–604, Jul. 2021. [Google Scholar]
[20].Mohan S. and Li D., “MedMentions: A large biomedical corpus annotated with UMLS concepts,” in Proc. Conf. Automated Knowl. Base Construct. (AKBC), Amherst, MA, USA, May 2019. [Google Scholar]
[21].Clark C., Yatskar M., and Zettlemoyer L., “Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 4069–4082. [Google Scholar]
[22].He H., Zha S., and Wang H., “Unlearn dataset bias in natural language inference by fitting the residual,” in Proc. 2nd Workshop Deep Learn. Approaches Low-Resource NLP (DeepLo), 2019, pp. 132–142. [Google Scholar]
[23].Karimi Mahabadi R., Belinkov Y., and Henderson J., “End-to-end bias mitigation by modelling biases in corpora,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 8706–8716. [Google Scholar]
[24].Utama P. A., Moosavi N. S., and Gurevych I., “Mind the trade-off: Debiasing NLU models without degrading the in-distribution performance,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 8717–8729. [Google Scholar]
[25].Fukuda K.-I., Tsunoda T., Tamura A., and Takagi T., “Toward information extraction: Identifying protein names from biological papers,” in Proc. PAC Symp. Biocomput., vol. 707, no. 18, 1998, pp. 707–718. [PubMed] [Google Scholar]
[26].Proux D., Rechenmann F., Julliard L., Pillet V., and Jacq B., “Detecting gene symbols and names in biological texts,” Genome Informat., vol. 9, pp. 72–80, Jan. 1998. [PubMed] [Google Scholar]
[27].Narayanaswamy M., Ravikumar K., and Vijay-Shanker K., “A biological named entity recognizer,” in Biocomputing. HI, USA: World Scientific, 2003, pp. 427–438. [DOI] [PubMed] [Google Scholar]
[28].Hettne K. M., Stierum R. H., Schuemie M. J., Hendriksen P. J. M., Schijvenaars B. J. A., Mulligen E. M. V., Kleinjans J., and Kors J. A., “A dictionary to identify small molecules and drugs in free text,” Bioinformatics, vol. 25, no. 22, pp. 2983–2991, Nov. 2009. [DOI] [PubMed] [Google Scholar]
[29].Gerner M., Nenadic G., and Bergman C. M., “LINNAEUS: A species name identification system for biomedical literature,” BMC Bioinf., vol. 11, p. 85, Feb. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Sahu S. and Anand A., “Recurrent neural network models for disease name recognition using domain invariant features,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (Long Papers), vol. 1, 2016, pp. 2216–2225. [Google Scholar]
[31].Lafferty J., McCallum A., and Pereira F. C., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), 2001, pp. 282–289. [Google Scholar]
[32].Crichton G., Pyysalo S., Chiu B., and Korhonen A., “A neural network multi-task learning approach to biomedical named entity recognition,” BMC Bioinf., vol. 18, no. 1, p. 368, Dec. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Leaman R. and Lu Z., “TaggerOne: Joint named entity recognition and normalization with semi-Markov Models,” Bioinformatics, vol. 32, pp. 2839–2846, Sep. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Watanabe T., Tamura A., Ninomiya T., Makino T., and Iwakura T., “Multi-task learning for chemical named entity recognition with chemical compound paraphrasing,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 6245–6250. [Google Scholar]
[35].Zhao S., Liu T., Zhao S., and Wang F., “A neural multi-task learning framework to jointly model medical named entity recognition and normalization,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 817–824. [Google Scholar]
[36].Peng Y., Chen Q., and Lu Z., “An empirical study of multi-task learning on BERT for biomedical text mining,” in Proc. 19th SIGBioMed Workshop Biomed. Lang. Process., 2020, pp. 205–214. [Google Scholar]
[37].Augenstein I., Derczynski L., and Bontcheva K., “Generalisation in named entity recognition: A quantitative analysis,” Comput. Speech Lang., vol. 44, pp. 61–83, Jul. 2017. [Google Scholar]
[38].Taille B., Guigue V., and Gallinari P., “Contextualized embeddings in named-entity recognition: An empirical study on generalization,” in Proc. Eur. Conf. Inf. Retr., Cham, Switzerland: Springer, 2020, pp. 383–391. [Google Scholar]
[39].Agarwal O., Yang Y., Wallace B. C., and Nenkova A., “Entity-switched datasets: An approach to auditing the in-domain robustness of named entity recognition models,” 2020, arXiv:2004.04123. [Google Scholar]
[40].Fu J., Liu P., Zhang Q., and Huang X., “Rethinking generalization of neural models: A named entity recognition case study,” in Proc. AAAI, 2020, pp. 7732–7739. [Google Scholar]
[41].Fu J., Liu P., and Neubig G., “Interpretable multi-dataset evaluation for named entity recognition,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 6058–6069. [Google Scholar]
[42].Agarwal O., Yang Y., Wallace B. C., and Nenkova A., “Interpretability analysis for named entity recognition to understand system predictions and how they can improve,” Comput. Linguistics, vol. 47, no. 1, pp. 117–140, Apr. 2021. [Google Scholar]
[43].Giorgi J. M. and Bader G. D., “Transfer learning for biomedical named entity recognition with neural networks,” Bioinformatics, vol. 34, no. 23, pp. 4087–4094, Dec. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Giorgi J. M. and Bader G. D., “Towards reliable named entity recognition in the biomedical domain,” Bioinformatics, vol. 36, no. 1, pp. 280–286, Jan. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Poliak A., Naradowsky J., Haldar A., Rudinger R., and Van Durme B., “Hypothesis only baselines in natural language inference,” in Proc. 7th Joint Conf. Lexical Comput. Semantics, 2018, pp. 180–191. [Google Scholar]
[46].Niven T. and Kao H.-Y., “Probing neural network comprehension of natural language arguments,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 4658–4664. [Google Scholar]
[47].McCoy T., Pavlick E., and Linzen T., “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 3428–3448. [Google Scholar]
[48].Agrawal A., Batra D., Parikh D., and Kembhavi A., “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4971–4980. [Google Scholar]
[49].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., and Polosukhin I., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [Google Scholar]
[50].Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., and Fidler S., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 19–27. [Google Scholar]
[51].Davis A. P., Wiegers T. C., Rosenstein M. C., and Mattingly C. J., “MEDIC: A practical disease vocabulary used at the comparative toxicogenomics database,” Database, vol. 2012, Mar. 2012, Art. no. bar065. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Guo C., Pleiss G., Sun Y., and Weinberger K. Q., “On calibration of modern neural networks,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), 2017, pp. 1321–1330. [Google Scholar]

[ref1] [1].Landhuis E., “Scientific literature: Information overload,” Nature, vol. 535, no. 7612, pp. 457–458, Jul. 2016. [DOI] [PubMed] [Google Scholar]

[ref2] [2].Wang L. L., Lo K., Chandrasekhar Y., Reas R., Yang J., Eide D., Funk K., Kinney R., Liu Z., Merrill W., and Mooney P., “CORD-19: The COVID-19 open research dataset,” in Proc. 1st Workshop NLP COVID-19 ACL, 2020. [Online]. Available: https://aclanthology.org/2020.nlpcovid19-acl.1/ [Google Scholar]

[ref3] [3].Liu S., Tang B., Chen Q., and Wang X., “Drug-drug interaction extraction via convolutional neural networks,” Comput. Math. Methods Med., vol. 2016, pp. 1–8, Jan. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] [4].Habibi M., Weber L., Neves M., Wiegandt D. L., and Leser U., “Deep learning with word embeddings improves biomedical named entity recognition,” Bioinformatics, vol. 33, no. 14, pp. i37–i48, Jul. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] [5].Wang X., Zhang Y., Ren X., Zhang Y., Zitnik M., Shang J., Langlotz C., and Han J., “Cross-type biomedical named entity recognition with deep multi-task learning,” Bioinformatics, vol. 35, no. 10, pp. 1745–1752, May 2019. [DOI] [PubMed] [Google Scholar]

[ref6] [6].Yoon W., So C. H., Lee J., and Kang J., “CollaboNet: Collaboration of deep neural networks for biomedical named entity recognition,” BMC Bioinf., vol. 20, no. S10, p. 249, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] [7].Lee J., Yoon W., Kim S., Kim D., Kim S., So C. H., and Kang J., “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinf., vol. 36, no. 4, pp. 1234–1240, Feb. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] [8].Doğan R. I., Leaman R., and Lu Z., “NCBI disease corpus: A resource for disease name recognition and concept normalization,” J. Biomed. Inform., vol. 47, pp. 1–10, Feb. 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] [9].Sung M., Jeon H., Lee J., and Kang J., “Biomedical entity representations with synonym marginalization,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 3641–3650. [Google Scholar]

[ref10] [10].Devlin J., Chang M.-W., Lee K., and Toutanova K., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 1, Jun. 2019, pp. 4171–4186. [Google Scholar]

[ref11] [11].Li J., Sun Y., Johnson R. J., Sciaky D., Wei C.-H., Leaman R., Davis A. P., Mattingly C. J., Wiegers T. C., and Lu Z., “BioCreative V CDR task corpus: A resource for chemical disease relation extraction,” Database, vol. 2016, 2016, Art. no. baw068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] [12].Ko M., Lee J., Kim H., Kim G., and Kang J., “Look at the first sentence: Position bias in question answering,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 1109–1121. [Google Scholar]

[ref13] [13].Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., Naumann T., Gao J., and Poon H., “Domain-specific language model pretraining for biomedical natural language processing,” ACM Trans. Comput. Healthcare, vol. 3, no. 1, pp. 1–23, Jan. 2022. [Google Scholar]

[ref14] [14].Peng Y., Yan S., and Lu Z., “Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets,” in Proc. 18th BioNLP Workshop Shared Task, 2019, pp. 58–65. [Google Scholar]

[ref15] [15].Rindflesch T. C., Tanabe L., Weinstein J. N., and Hunter L., “EDGAR: Extraction of drugs, genes and relations from the biomedical literature,” Biocomputing, vol. 1999, pp. 517–528, Jan. 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] [16].Sang E. F. T. K. and Meulder F. D., “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proc. 7th Conf. Natural Lang. Learn. (HLT-NAACL), 2003, pp. 1–6. [Google Scholar]

[ref17] [17].Ramshaw L. A. and Marcus M. P., “Text chunking using transformation-based learning,” in Natural Language Processing Using Very Large Corpora (Text, Speech and Language Technology), vol. 11, Armstrong S., Church K., Isabelle P., Manzi S., Tzoukermann E., and Yarowsky D., Eds. Dordrecht, The Netherlands: Springer, 1999, doi: 10.1007/978-94-017-2390-9_10. [DOI] [Google Scholar]

[ref18] [18].Lin H., Lu Y., Tang J., Han X., Sun L., Wei Z., and Yuan N. J., “A rigorous study on named entity recognition: Can fine-tuning pretrained model lead to the promised land?” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 7291–7300. [Google Scholar]

[ref19] [19].Ghaddar A., Langlais P., Rashid A., and Rezagholizadeh M., “Context-aware adversarial training for name regularity bias in named entity recognition,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 586–604, Jul. 2021. [Google Scholar]

[ref20] [20].Mohan S. and Li D., “MedMentions: A large biomedical corpus annotated with UMLS concepts,” in Proc. Conf. Automated Knowl. Base Construct. (AKBC), Amherst, MA, USA, May 2019. [Google Scholar]

[ref21] [21].Clark C., Yatskar M., and Zettlemoyer L., “Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 4069–4082. [Google Scholar]

[ref22] [22].He H., Zha S., and Wang H., “Unlearn dataset bias in natural language inference by fitting the residual,” in Proc. 2nd Workshop Deep Learn. Approaches Low-Resource NLP (DeepLo), 2019, pp. 132–142. [Google Scholar]

[ref23] [23].Karimi Mahabadi R., Belinkov Y., and Henderson J., “End-to-end bias mitigation by modelling biases in corpora,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 8706–8716. [Google Scholar]

[ref24] [24].Utama P. A., Moosavi N. S., and Gurevych I., “Mind the trade-off: Debiasing NLU models without degrading the in-distribution performance,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 8717–8729. [Google Scholar]

[ref25] [25].Fukuda K.-I., Tsunoda T., Tamura A., and Takagi T., “Toward information extraction: Identifying protein names from biological papers,” in Proc. PAC Symp. Biocomput., vol. 707, no. 18, 1998, pp. 707–718. [PubMed] [Google Scholar]

[ref26] [26].Proux D., Rechenmann F., Julliard L., Pillet V., and Jacq B., “Detecting gene symbols and names in biological texts,” Genome Informat., vol. 9, pp. 72–80, Jan. 1998. [PubMed] [Google Scholar]

[ref27] [27].Narayanaswamy M., Ravikumar K., and Vijay-Shanker K., “A biological named entity recognizer,” in Biocomputing. HI, USA: World Scientific, 2003, pp. 427–438. [DOI] [PubMed] [Google Scholar]

[ref28] [28].Hettne K. M., Stierum R. H., Schuemie M. J., Hendriksen P. J. M., Schijvenaars B. J. A., Mulligen E. M. V., Kleinjans J., and Kors J. A., “A dictionary to identify small molecules and drugs in free text,” Bioinformatics, vol. 25, no. 22, pp. 2983–2991, Nov. 2009. [DOI] [PubMed] [Google Scholar]

[ref29] [29].Gerner M., Nenadic G., and Bergman C. M., “LINNAEUS: A species name identification system for biomedical literature,” BMC Bioinf., vol. 11, p. 85, Feb. 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] [30].Sahu S. and Anand A., “Recurrent neural network models for disease name recognition using domain invariant features,” in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics (Long Papers), vol. 1, 2016, pp. 2216–2225. [Google Scholar]

[ref31] [31].Lafferty J., McCallum A., and Pereira F. C., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), 2001, pp. 282–289. [Google Scholar]

[ref32] [32].Crichton G., Pyysalo S., Chiu B., and Korhonen A., “A neural network multi-task learning approach to biomedical named entity recognition,” BMC Bioinf., vol. 18, no. 1, p. 368, Dec. 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] [33].Leaman R. and Lu Z., “TaggerOne: Joint named entity recognition and normalization with semi-Markov Models,” Bioinformatics, vol. 32, pp. 2839–2846, Sep. 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] [34].Watanabe T., Tamura A., Ninomiya T., Makino T., and Iwakura T., “Multi-task learning for chemical named entity recognition with chemical compound paraphrasing,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 6245–6250. [Google Scholar]

[ref35] [35].Zhao S., Liu T., Zhao S., and Wang F., “A neural multi-task learning framework to jointly model medical named entity recognition and normalization,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 817–824. [Google Scholar]

[ref36] [36].Peng Y., Chen Q., and Lu Z., “An empirical study of multi-task learning on BERT for biomedical text mining,” in Proc. 19th SIGBioMed Workshop Biomed. Lang. Process., 2020, pp. 205–214. [Google Scholar]

[ref37] [37].Augenstein I., Derczynski L., and Bontcheva K., “Generalisation in named entity recognition: A quantitative analysis,” Comput. Speech Lang., vol. 44, pp. 61–83, Jul. 2017. [Google Scholar]

[ref38] [38].Taille B., Guigue V., and Gallinari P., “Contextualized embeddings in named-entity recognition: An empirical study on generalization,” in Proc. Eur. Conf. Inf. Retr., Cham, Switzerland: Springer, 2020, pp. 383–391. [Google Scholar]

[ref39] [39].Agarwal O., Yang Y., Wallace B. C., and Nenkova A., “Entity-switched datasets: An approach to auditing the in-domain robustness of named entity recognition models,” 2020, arXiv:2004.04123. [Google Scholar]

[ref40] [40].Fu J., Liu P., Zhang Q., and Huang X., “Rethinking generalization of neural models: A named entity recognition case study,” in Proc. AAAI, 2020, pp. 7732–7739. [Google Scholar]

[ref41] [41].Fu J., Liu P., and Neubig G., “Interpretable multi-dataset evaluation for named entity recognition,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, pp. 6058–6069. [Google Scholar]

[ref42] [42].Agarwal O., Yang Y., Wallace B. C., and Nenkova A., “Interpretability analysis for named entity recognition to understand system predictions and how they can improve,” Comput. Linguistics, vol. 47, no. 1, pp. 117–140, Apr. 2021. [Google Scholar]

[ref43] [43].Giorgi J. M. and Bader G. D., “Transfer learning for biomedical named entity recognition with neural networks,” Bioinformatics, vol. 34, no. 23, pp. 4087–4094, Dec. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] [44].Giorgi J. M. and Bader G. D., “Towards reliable named entity recognition in the biomedical domain,” Bioinformatics, vol. 36, no. 1, pp. 280–286, Jan. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] [45].Poliak A., Naradowsky J., Haldar A., Rudinger R., and Van Durme B., “Hypothesis only baselines in natural language inference,” in Proc. 7th Joint Conf. Lexical Comput. Semantics, 2018, pp. 180–191. [Google Scholar]

[ref46] [46].Niven T. and Kao H.-Y., “Probing neural network comprehension of natural language arguments,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 4658–4664. [Google Scholar]

[ref47] [47].McCoy T., Pavlick E., and Linzen T., “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 3428–3448. [Google Scholar]

[ref48] [48].Agrawal A., Batra D., Parikh D., and Kembhavi A., “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4971–4980. [Google Scholar]

[ref49] [49].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., and Polosukhin I., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [Google Scholar]

[ref50] [50].Zhu Y., Kiros R., Zemel R., Salakhutdinov R., Urtasun R., Torralba A., and Fidler S., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 19–27. [Google Scholar]

[ref51] [51].Davis A. P., Wiegers T. C., Rosenstein M. C., and Mattingly C. J., “MEDIC: A practical disease vocabulary used at the comparative toxicogenomics database,” Database, vol. 2012, Mar. 2012, Art. no. bar065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] [52].Guo C., Pleiss G., Sun Y., and Weinberger K. Q., “On calibration of modern neural networks,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), 2017, pp. 1321–1330. [Google Scholar]

PERMALINK

How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?

Hyunjae Kim

Jaewoo Kang

Abstract

I. Introduction

TABLE 8. Disease Entities With Rare Surface Forms and the Performance of BioBERT With/Without Our Debiasing Method.

II. Data Preparation

A. Partitioning Benchmarks

B. Datasets

1). NCBI-Disease

2). BC5CDR

3). COVID-19

C. Split Statistics

III. Generalizability of BIONER Models

A. Baseline Models

1). Neural Models

2). Dictionary Models

B. Metrics

C. Results

1). Overall Results

TABLE 2. Performance of Current BioNER Models on NCBI-Disease, BC5CDRdis, and BC5CDRchem. The Best Scores are Highlighted in Bold, and the Second Best Scores are Underlined.

2). Overestimation of Models

3). Effect of Domain-Specific Pretraining

TABLE 3. Performance of Neural Models on the Abbreviations in the Splits. 32.7% of Mentions are Abbreviations in of NCBI-Disease, While BC5CDRdis Has Only 7.2% Abbreviations. The Best Scores are Highlighted in Bold.

IV. Analysis

A. Dataset Bias

TABLE 4. Examples of Biased Predictions of BioBERT. Entity Mentions (Ground-Truth Labels) are Displayed in Blue. Model Predictions are Highlighted With Yellow Boxes.

B. Weak Name Regularity

1). Cause of Failing to Recognize COVID-19

TABLE 5. Performance of BioBERT on COVID-19 and Synthetically Generated Mention “COVID”.

TABLE 6. Performance of BioBERT on the COVID-19 Dataset When Trained With Name Patterns Similar to COVID-19.

2). Comparison of NCBI-Disease and BC5CDR

C. Debiasing Method

1). Formulation

2). Effect of Debiasing

TABLE 7. Performance of BioBERT and BERT With/Without Our Debiasing Method on NCBI-Disease, BC5CDRdis, and BC5CDRchem. and Indicate Performance Increases and Decreases When Using the Method, Respectively. The Best Scores are Highlighted in Bold.

3). Side Effects of Debiasing

TABLE 9. Side Effects of Debiasing. Entity Mentions (Ground-Truth Labels) are Displayed in Blue. Model Predictions are Highlighted With Yellow Boxes.

V. Related Work

A. BioNER Models

B. Generalization to Unseen Mentions

C. Dataset Bias

VI. Conclusion

Acknowledgment

Biographies

Appendix I. Details in Partitioning Benchmarks

Appendix II. Model Comparison

TABLE 10. Differences Between Pretrained Language Models. Vocab. and Init. Indicate Vocabulary and Initialization.

Appendix III. Implementation Details

TABLE 11. Best Configurations of Model Hyperparameters.

Appendix IV. Annotation Inconsistency in Biomedical Databases

Appendix V. Tokenization Issue

Funding Statement

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

TABLE 2. Performance of Current BioNER Models on NCBI-Disease, BC5CDR_dis, and BC5CDR_chem. The Best Scores are Highlighted in Bold, and the Second Best Scores are Underlined.

TABLE 3. Performance of Neural Models on the Abbreviations in the Splits. 32.7% of Mentions are Abbreviations in of NCBI-Disease, While BC5CDR_dis Has Only 7.2% Abbreviations. The Best Scores are Highlighted in Bold.

TABLE 7. Performance of BioBERT and BERT With/Without Our Debiasing Method on NCBI-Disease, BC5CDR_dis, and BC5CDR_chem. and Indicate Performance Increases and Decreases When Using the Method, Respectively. The Best Scores are Highlighted in Bold.