Abstract
Discriminating the matched named entity pairs or identifying the entities’ canonical forms are critical in text mining tasks. More precise named entity normalization in text mining will benefit other subsequent text analytic applications. We built the named entity normalization model with a novel edge weight updating neural network. We, next, verify our model’s performance on NCBI disease, BC5CDR disease, and BC5CDR chemical databases, which are widely used named entity normalization datasets in the bioinformatics field. We also tested our model with our own financial named entity normalization dataset to validate the efficacy for more general applications. Using the constructed dataset, we differentiate named entity pairs. Our model achieved the highest named entity normalization performances in terms of various evaluation metrics. Our proposed model when tested on four different datasets achieved state-of-the-art results.
Keywords: Named entity normalization, Edge weight updating neural network, Text mining in bioinformatics, Text mining in finance, Named entity graph
Introduction
The text mining technology is undergoing a rapid evolution thanks to the exponential growth in the number of text-rich documents available online, and as a result, it is being widely applied in a range of domains such as finance and bioinformatics. Text mining aims to extract the information from documents to derive valuable insights. Documents subject to analysis contain many named entities, which are proper names that denote unique objects such as organizations, products, persons, and locations. The technique used to extract named entities from documents is called named entity recognition (NER, henceforth). Furthermore, named entity normalization (NEN, henceforth) involves matching extracted named entities with homogeneous identity and is pivotal for text mining tasks.
More specifically, in the biomedical domain, disease names and chemicals in drugs often have different surface forms while sharing the same concept. Types of named entities with different surface forms that share same concept can be divided into following categories: (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. For example, “hepatomegaly" and “liver enlarged" do not have matching strings but the two disease names have identical meanings, and thus, these two named entities are synonyms. Biomedical named entities have a wide variety of different surface forms compared with entities from other text sources. More accurate named entity normalization techniques will potentially improve the quality of downstream tasks. Moreover, matching entity pairs such as “International Business Machines" and “IBM", which are examples of acronyms, are very critical in financial text mining applications. Linking entities with the same identity enables accurate sentiment analysis on firms and products. Furthermore, evaluation of news impacts on the stock market requires the connection between news articles and related firms. Given the wide range of named entities in bioinformatics and finance documents, the total number of tokens to be calculated for text clustering and classification is enormous.
The early NEN models explored knowledge-based or rule-based approaches [1–4]. Generating the rules for named entity matching based on domain knowledge is valid only for the dataset in which the corresponding rules are already created. The rule-based models are not robust for the neologisms. In order to overcome the disadvantage that the rule-based model is not robust, models based on machine learning have been introduced [5–7]. However, machine learning models are limited to specific fields such as bioinformatics NEN and chemical engineering NEN due to lack of NEN datasets in other domains. Our research aims to construct fully automated NEN model that can be applied to various other domains. To test our model’s robustness on different domain, we also apply the NEN dataset in finance.
An automated named entity normalization model reduce the burden of hand-mined information extraction tasks. Clear linkage between entities with different forms, such as abbreviations and acronyms, aid in more accurate sentiment analysis. The named entity normalization model also benefits the creation of more comprehensible classifying and clustering documents. The primary contributions of our study are (1) constructing better performing NEN model using an Edge Weight Updating Neural Network and (2) applying our proposed model to bioinformatics NEN and financial NEN tasks.
The proposed method, that is, the edge weight updating neural network, consists of four parts: (1) ground truth entity graph construction, (2) similarity-based entity graph construction, (3) edge weight updating neural network training, and (4) edge weight updating neural network inferencing. The main concept behind the Edge Weight Updating Neural Network is to minimize the Ground Truth Entity Graph’s edge weight distributions and the Similarity-Based Entity Graph’s edge weight distributions. By minimizing the edge weight distributions on the two graphs, entity embeddings capture more accurate information on semantic similarity between matching entities.
Our proposed model is evaluated on three widely used bioinformatics datasets (NCBI Disease, BC5CDR Disease, and BC5CDR Chemical) and its performance is compared with other cutting-edge models. Furthermore, to validate the efficacy of our proposed model in general NEN tasks, we construct a financial NEN dataset with state-of-the-art NER using BERT [8]. Using the constructed dataset, we propose the deep learning model to solve more practical financial NEN tasks. Out dataset incorporates major challenges in entity matching: (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. Compare with other recent NEN models, our proposed model shows higher accuracies in all datasets used in the experiments, and our model is tested with not only bioinformatics NEN datasets but also financial NEN datasets, which verifies the efficacy in general NEN tasks.
The remainder of this paper is organized as follows. Section 2 describes related work. The structure for our proposed model is described in Sect. 3. Experiment settings for testing model performances are provided in Sect. 4. Section 4.1 presents an overview of dataset we used for evaluations. A brief explanation of pre-constructed NEN datasets from the bioinformatics domain is given in this section. Furthermore, Sect. 4.1 in Experiment Settings(subsection 4) provides the overview of preprocessing for data and financial NEN dataset construction with examples. In Sect. 5, we present the details regarding the qualitative and quantitative analyses we conducted on the models. Finally, in Sect. 6, we present our conclusions.
Related Work
Bioinformatics, chemical engineering, and materials science domain actively adopt cutting-edge deep learning frameworks for NEN tasks. According to Cho et al. [9], various products exist for recognizing and normalizing named entities in biomedical fields such as ProMiner [3] and MetaMap [10]. DNorm [11] and TaggerOne [12] also used machine learning models such as pairwise ranking scoring and semi-Markov models, respectively, for NEN processing. In genetic engineering, GenNorm [13] and GNAT [14] are used to normalize the gene names. ChemSpot [15] uses Conditional Random Field for NER and NEN tasks in chemical engineering. Weston et al. [16] developed MatScholar [16] python repository to perform general NLP tasks on material science texts, which includes entity normalization.
The above researches and products used NEN datasets concentrated on specific domains. ShARe/CLEF [17] is one of the widely used NEN datasets for bioinformatics that is made up of clinical notes. The NCBI [18] dataset contains PubMed abstracts for disease name normalization tasks. TAC2017ADR [19] aims to link identical drug labels. The BC2GM [20], BioNLP09 [21], and BioNLP-OST19 [22] datasets deal with genes, proteins, and bacteria, respectively. In chemical engineering, SCAI [23] and IUPAC [24] are available for researches on chemical name matching. Similar to chemical names, Weston et al. [16] developed a dataset for material engineering to normalize entities to a canonical form.
Applying machine learning algorithms in the financial domain is gaining increasing attention. One major branch is stock movement forecasting using various deep learning mechanisms [25, 26]. Thanks to the rapid developments of unstructured data processing techniques, researches on applying text mining techniques to the financial fields have increased in number. In their study, Gupta et al. [27] illustrated the trends for applying text mining in finance. Among many related text mining applications in finance, NEN can be applied to various financial researches and financial practices. In preprocessing for applying text mining techniques to solve real-world problems, NER and NEN models are performed preemptively. However, the NEN dataset for the financial domain is scant and there is a need for developing a dataset targeting the financial NEN.
Many researchers have developed targeted datasets for more general NEN tasks in domains such as user comments, product description, and financial invoices. For example, in their study, Jijkoun et al. [28] used user comments from newspaper websites. Sun et al. [29] performed normalization of product entity names, for which the dataset was developed by the authors. The study conducted by Francis et al. [30] on financial invoices is the most relevant one to our study. However, Francis et al. focused on insurance, telecommunications, banking, and tax companies using the following entities: International Bank Account Number (IBAN) of the beneficiary, invoice number, invoice date, and due date [30]. The focus of our study is on more general financial entity normalization, which covers entities from all financial sectors. Previous studies using the datasets illustrated above used various machine learning and deep learning models.
There are similarities between the string matching methodologies in various other fields and NEN researches. Sun et al. [29] proposed NEN for product names using a pre-constructed product entity linkage dictionary. In semantic string matching, Siamese Neural Networks are widely used [31–33]. Krivosheev et al. [34] used Siamese Graph Neural Network for company name normalization. We need to extend NEN on company names to NEN on a wide range of product names and legal entities. Siamese RNN model successfully apprehends the morphological similarity between strings [35]. Niu et al. [36] applied Attention mechanisms for medical concept normalization. Furthermore, the evolution of Transformer-based models capacitate the adoption pre-trained language models such as BERT [8] for entity linking problems [37].
The major development in recent NEN researches is as follows. D‘Souza et al. [1] proposed an early NEN model using a rule-based model, which requires comparatively more human input when generating the rules. The model is static and, thus, there is a possibility that new rules need to be created when applying the model to other datasets. NEN models that use more advanced machine learning and deep learning techniques can be more effective. Leaman et al. [12] used semi-Markov model, Li et al. [38] used word-level CNN model, and Wirght and Dustin [39] and Phan et al. [40] models based on BiGRU and BiLSTM. However, BERT achieved state-of-the-art performance in many general text mining and natural language processing (NLP) challenges. Compared with the four models illustrated above, the most recent researches such as the BERT ranking model [6] and BioSyn [7] takes full advantage of the BERT model by training the model based on BERT embeddings. The BERT Ranking model [6] used ranking-based objective function and BioSyn [7] used Synonym Marginalization techniques as the objective function for training. Our proposed model optimizes BERT embedding vectors with named entity graph’s edge weight updating neural network. Our proposed model successfully captures the ground truth linkage between named entity graphs, achieving the highest accuracies. Previous NEN researches focus mainly on the NEN dataset from a specific domain. To test the efficacy of our model in more general NEN tasks, we evaluate our model with NEN datasets from both the bioinformatics domain and financial domain.
Many NEN researches explore semi-supervised learning models. Our proposed model is motivated by one of the leading semi-supervised models on images, Edge-Labeling Graph Neural Network for Few-shot Learning [41] (EGNN). The major difference between EGNN and our model is that EGNN labels an edge for each round of training but our model updates edge weights for top K connected entities. By capturing more node and edge information simultaneously for each round of training, the proposed model shows better performance compared with other NEN models.
Proposed Method
Our proposed model, Edge Weight Updating Neural Network, consists of four major parts.
Ground Truth Entity Graph construction.
Similarity-Based Entity Graph construction.
Edge Weight Updating Neural Network training.
Edge Weight Updating Neural Network inferencing.
The basic idea behind Edge Weight Updating Neural Network is to minimize the Ground Truth Entity Graph’s edge weight distributions and the Similarity-Based Entity Graph’s edge weight distributions. The detailed flow diagram of Edge Weight Updating Neural Network is represented in Fig. . Our motivation for constructing Edge Weight Updating Neural Network is to provide more positive and negative samples for training at once. Training the model to minimize the ground truth data and reconstructed data is more widely used in the deep learning models in computer vision such as GAN [42]. We apply these ideas to text mining and create the model that is trained to minimize the edge weight distributions of the ground truth entity graph and that of the similarity-based entity graph. In Fig. 1, AWS is our query entity. First, with the vanilla BERT encoder, unrelated entities such as Apple, Inc. and NYSE might have higher similarity scores(edge weight between two entities) than the entity Amazon AWS. Then, the similarity-based graph is constructed with the given query entities according to the current similarity scores. The graph’s edge weight distribution is compared with the ground truth entity graph’s edge weight distributions. Entity embeddings are trained with Kullback–Leibler divergence [43] loss between two graphs. After iterating trough these steps, the BERT encoder will be trained to calculate the two entities’ ground truth similarity score.
Fig. 1.
Diagram of edge weight updating neural network with the query entity, AWS
We use the BERT model for the named entity embeddings. There are two main reasons for choosing the BERT model for the named entity embeddings. There exist various pretrained BERT models that serve the specific purposes such as BioBERT [44] for bioinformatics documents, FinBERT [45] for finance documents, and PatentBERT [46] for patent documents. Compared to other language embedding models such as Word2Vec [47], Glove [48], and Fasttext [49], BERT model’s WordPiece tokenizer is more robust handling the out of vocabulary problems. The number of out of vocabulary entities using the pretrained language models listed above are tabulated in Table .
Table 1.
Number of out of vocabulary entities using pretrained language models
| Model (total # entities) | Word2Vec | Glove | Fasttext | BERT-based models |
|---|---|---|---|---|
| NCBI disease (73,181) | 8469 | 7991 | 7279 | 0 |
| BC5CDR disease (73,548) | 8526 | 8077 | 7319 | 0 |
| BC5CDR chemical (407,428) | 87,859 | 70,683 | 70,091 | 0 |
| Financial NEN dataset (24,195) | 3871 | 4300 | 3615 | 0 |
Detailed steps for constructing the Ground Truth Named Entity Graph, building the Similarity-Based Entity Graph, and training and inferencing the Edge Weight Updating Neural Network are presented in Sects. 3.1, 3.2, 3.3 and 3.4, respectively.
Ground Truth Entity Graph Construction
Ground Truth Entity Graph constructions are based on entity mentions in each dataset and their mapping concept IDs. Figure demonstrates the steps for building the graph.
For the NEN corpus, each entity is annotated with one or more concept IDs. For example in Fig. 2, entities A, B, and C share the same concept ID, ID_1. Then, entities A, B, and C are fully connected in the entity graph. Other entity pairs, D - E (concept ID: ID_2) and F - G (concept ID: ID_3) are linked. The training dataset for each NEN corpus has query entities with corresponding concept ID. If query entity Q has a concept ID of ID_1, then, query entity Q will be linked to entities A, B, and C in the pre-constructed graph. As the constructed graph is the ground truth graph, each edge weight in the graph is 1.
Fig. 2.

Ground truth named entity graph construction
We iterate all the entities in training sets that include the referencing dictionary entity table and the query entity table. Graph created by the following steps above is the Ground Truth Entity Graph which is the reference or the target graph the Similarity-Based Entity Graph will try to match.
Similarity-Based Entity Graph Construction
For each query entity, Similarity-Based Entity Graph is constructed as follows. Graph edges are calculated using BERT embedding vector similarities. We use BioBERT [44] for bioinformatics NEN corpus’ initial BERT embeddings and the original BERT [8] for financial NEN corpus’ initial BERT embeddings.
For example, in Fig. , let query entity Q has size of 768 (vector length of BERT embeddings), . Similarly, BERT-based entity embeddings in the dictionary set are also denoted as . The BERT embedding has a fixed length of 768, so our embedding vectors have a vector length of 768.
To calculate the edge weights based on entity similarities, we calculate inner products between query entities and dictionary entities. Since the calculation time for the matrix multiplication is fast, computing the similarity between every query entities and dictionary entities are relatively less time consuming. The largest dataset, BC5CDR Chemical, with the matrix size of (407,454 * 768) by (1317 * 768) takes about 0.77 s on CPU and 0.27 s on GPU. The fastest calculation time on CPU is NCBI Diesease, (72,887 * 768) by (1,587 * 768), which take 0.20 s. The fastest calculation time on GPU is Finance NEN dataset, (20,071 * 768) by (20,071 * 768), which take 0.042 s. is the notation for inner product and is the set of similarities between query entity Q and all the entities in a dictionary; then the similarity between each query entity and each dictionary entity calculation is expressed as Eq. 1,
where,
| 1 |
We normalize the similarity score by dividing the maximum similarity score in each query entity’s similarity score set, . For Similarity-Based Entity Graph, top K edges based on similarity score are selected. Highlighted blue region in entity similarity table for query entity Q in Fig. 3 demonstrates the edge weight determination steps when . Mathematically, edge weights are calculated using Eq. 2.
where,
| 2 |
For each training epoch, which is illustrated in Sect. 3.3, edge weights are updated. Updated entity embedding vectors generate new similarity scores that alter the edge weights in the graph.
Fig. 3.
Entity matching graph based on entity similarity construction
Edge Weight Updating Neural Network Training
Fig. 4.
Minimizing the edge weight distributions in edge weight updating neural network for query entity Q
The main concept of Edge Weight Updating Neural Network is to minimize the difference between the edge weights’ discrete distribution for each query entity in the Ground Truth Entity Graph and the Similarity-Based Entity Graph. The Similarity-Based Entity Graph is dynamic. For the each iteration in the training phase, the Similarity-Based Entity Graph is reconstructed with the updated BERT model’s parameters from the previous traning iteration. As illustrated in Sect. 3.2, edge weights are calculated by entities’ embeddings. In each training epoch in Edge Weight Updating Neural Network, baseline BERT model’s parameters are optimized to mimic the ground truth edge weight distributions.
Figure shows the training process of our proposed model for the number of connected edges in the Similarity-Based Entity Graph is 5 (). Following the example in Sect. 3.2, query entity Q is connected to dictionary entities A, B, C, D, and F, and edge weights are 0.8, 0.9, 0.6, 0.7, and 0.5, respectively. Given the Ground Truth Entity Graph in Sect. 3.1, the truth edge weights for connected edges between query entity Q and dictionary entities, A, B, C, D, and F are 1, 1, 1, 0, and 0, respectively.
In training procedures, BERT parameters are tuned to make edge weights distributions in Similarity-Based Entity Graph closer to the ground truth edge weight distributions. We use Kullback–Leibler Divergence Loss [43](KL divergence loss, henceforth) for training our model. As edge weight distribution is discrete, we normalize the edge weights using the Softmax function.
We denote graph as G, entity as V, and edge as E. The Ground Truth Entity Graph and the Similarity-Based Entity Graph are denoted as and , respectively. The adjacency matrices for Ground Truth Entity Graph and the Similarity-Based Entity Graph are denoted and . is the discrete distribution of edge weights of Q in Similarity-Based Entity Graph. is the discrete distribution of edge weights of Q in the Ground Truth Entity Graph. Our KL divergence loss is calculated using Eq. 3.
where,
| 3 |
We use an Adam optimizer with weight decay [50], and set the batch size to 16 and the number of connected edges in the Similarity-Based Entity Graph to 30 () for all datasets we test. We train our model for 50 epochs. The best scores are reported in Sect. 4.
Edge Weight Updating Neural Network Inferencing
Fig. 5.
Inferencing the edge weight distributions in edge weight updating neural network for query entity Q
First, fine-tuned BERT embeddings illustrated in Sect. 3.3 are used to embed unseen query entities in test sets. With newly computed BERT embedding vectors, we repeat the steps in Sect. 3.2 to construct the new Similarity-Based Entity Graph. For each query entity, a dictionary entity with the highest edge weights is returned as a synonym. Figure demonstrates the inferencing process of the Edge Weight Updating Neural Network.
Experiment Settings
Dataset
Named Entity Normalization Datasets in Bioinformatics
Most NEN researches are from the bioinformatics domain. To test our model’s performance with other NEN models, we select three of the most used bioinforinmatics NEN datasets: NCBI Disease [18] and two datasets from Biocreinative V CDR (BC5CDR, henceforth) [51].
Three datasets summarized below contains bioinformatics-related entity mentions with unique concept IDs. The main goal of these datasets is to identify the mentions that share the same concept IDs. We follow NEN preprocessing convention for the datasets below, in which the mentions that do not exist in the concept dictionary are eliminated [40]. Bioinformatics NEN datasets usually consist of train, development, and test sets. Following previous studies, we use train and development sets for training our model. Test sets are used for evaluations.
NCBI Disease [18] NCBI Disease corpus provides disease mentions in different surface forms. Disease mentions in this dataset are extracted from 793 PubMed abstracts containing a total of 6,892 disease mentions, which are mapped to 790 unique disease concepts. Disease concepts are annotated by Medical Subject Headings (MeSH) and Online Mendelian Inheritance in Man (OMIM). Disease mentions sharing the same disease concept are considered synonyms. Table shows detailed statistics of the NCBI Disease corpus.
Biocreative V CDR Disease and Biocreative V CDR Chemical [51] The BC5CDR corpus is organized for challenging tasks of disease named entity recognition and chemical-induced disease relation extraction. The BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, and 5818 disease and 3116 chemical-disease interactions [51]. The dataset contains disease mention corpus and chemical mention corpus. Disease mentions are mapped into the MeSH IDs similar to the NCBI Disease corpus. Chemical mentions are annotated using the Comparative Toxicogenomics Database (CTD) [52]. Mentions that share the same disease concept and chemical concept based on MeSH ID and CTD ID are considered synonyms. Detailed statistics of both BC5CDR Disease corpus and BC5CDR Chemical corpus are illustrated in Table 2.
Table 2.
Data statistics of three bioinformatics NEN datasets
| # of documents | # of mentions (entities) | |||||
|---|---|---|---|---|---|---|
| Train | Dev | Test | Train | Dev | Test | |
| NCBI disease | 592 | 100 | 100 | 5134 | 787 | 960 |
| BC5CDR disease | 500 | 500 | 500 | 4182 | 4244 | 4424 |
| BC5CDR chemical | 500 | 500 | 500 | 5203 | 5347 | 5385 |
Named Entity Normalization Datasets in Finance
There are no publicly open financial NEN datasets available; therefore, we constructed our own financial NEN dataset to test the performance of our proposed model in NEN tasks other than the bioinformatics domain.
Overview We construct the dataset for the financial NEN task from the annual reports (Form 10-K) of Standard and Poor’s 500 listed companies (https://github.com/sjeon7/Financial-Named-Entity-Normalization-Dataset). We aim to build the dataset that fulfills the need for financial NEN; the dataset includes (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. A detailed explanation of primary data sources, data preprocessing steps, and dataset construction procedures are as follows. Figure demonstrates the overall flow diagram for NEN dataset construction.
Fig. 6.

Flow diagram of the overall dataset construction
Data Source We gather the year 2019’s Form 10-Ks (published early 2020) of S &P500 companies from the U.S. firms and Exchange Commission (SEC) website,1 which is open to the public. We parse the business section of each 10-K documents from 496 companies. The business section of 10-K is considered the self-identity of firms and presents the information of main products, competitors, partners, and laws affecting the business. Among the sections in 10-K, this section contains the most number of entities. Out of 496 companies’ business section, 67,792 sentences were parsed.
Data Preprocessing For NER in financial documents, we implement the BERT NER model [8] using Huggingface’s2 Python repository. Huggingface’s NER model is trained using CoNLL-2003 NER dataset [53]. The outputs of the BERT NER model are WordPiece tokens that we have to link together with specified rules that will be circumstantially described below. There are four types of entity types: persons (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC), and one outside the named entity tag (O) in the CoNLL-2003 dataset. We detect entities with ORG and MISC tags. For the year 2019 S &P500 firms’ 10-K, we parse a total of 41,593 named entities.
Dataset Construction With named entities recognized illustrated in Sect. , we construct the financial named entity normalization dataset. As mentioned in Sect. 4.1.2, our focus is to build a NEN dataset to meet the need for general text mining in finance; the dataset includes (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. We hand label a total of 7155 unique named entities into 2600 groups; with each group sharing the same identity. Table shows three examples in our dataset for types of named entities that need to be normalized.
Table 3.
Example of financial named entity normalization dataset
| Named entity | Matching named entity | |
|---|---|---|
| Synonyms | Coca-Cola® | Coca-Cola |
| COVID-19 pandemic | COVID-19 | |
| iPhone 11 Pro Max | iPhone® | |
| Abbreviations | Baker Hughes company | Baker Hughes Co |
| Comcast corporation | Comcast corp | |
| Qualcomm incorporated | Qualcomm Inc | |
| Acronyms | Amazon Web Services | AWS |
| Bank of New York Mellon | BNY Mellon | |
| New York Stock Exchange | NYSE | |
| Combinations of punctuations | Apple, Inc | Apple Inc |
| Walmart U. S | Walmart U. S | |
| Booz Allen & Hamilton | Booz Allen Hamilton | |
| Descriptive phrases | EY (formerly Ernst & Young) | Ernst and Young |
| Securities Exchange Act of 1934 | (The exchange act) | |
| Facebook (including Instagram) | Facebook® | |
| NER parsing errors | Disney channel-the | Disney channel |
| Full throttle®-a | Full throttle®) | |
| Keystone-our | Keystone foods |
Synonyms There exist entities with the suffix “®" or “™". “Coca-Cola®" and “Coca-Cola" are the same entity. In addition, “COVID-19 Pandemic" and “COVID-19" should be linked. We generalize the product model numbers in which “iPhone 11 Pro Max" and “iPhone®" are considered identical entities.
Abbreviations Most abbreviations occur for abridging “Company" to “Co.", “Corporation" to “Corp.", and “Incorporated" to “Inc.".
Acronyms Acronyms are one of the most challenging NEN tasks. There are multiple abbreviations that are included in financial documents. We avoided matching acronyms if there are multiple original entities can be assigned. For example, “Advanced Development Programs (ADP )" and “Automatic Data Processing, Inc. (ADP)" both share the same acronyms, “ADP", but these should not be linked together.
Combinations of punctuations The different combinations of punctuations problems can be solved using rule-based approaches. However, there are many entities with a combination of punctuations. “,", “.", and “ &" are commonly found and used interchangeably.
Descriptive phrases In parsed named entity, an entity with descriptive phrases can be frequently found. With or without descriptive phrases, the root or the identified entity is invariable.
NER parsing errors No NER models and entity concatenation models are perfect. If NER is conducted manually, there are possible human errors too. According to our dataset, one common error model makes is appending the following token after “-" token. NER parsing error correction is one of the important targets our NEN model aims to achieve.
Table 4.
Statistics of the financial named entity normalization dataset
| Train | Development | Test | Total | |
|---|---|---|---|---|
| # of Identical entity groups | 1710 | 800 | 90 | 2600 |
| # of Positive pairs | 4598 | 2466 | 3761 | 10,825 |
| # of Negative pairs | 7902 | 2534 | 3739 | 14,175 |
| # of Pairs total | 12,500 | 5000 | 7500 | 25,000 |
Hand-matched entity pairs are labeled positive. We also added negatively labeled pairs in which two entities have no relationship. A total of 25,000 pairs with 10,825 positive matching pairs and 14,175 negative pairs are created. We separate entity groups for a train set, development set, and test set in which there are no overlapping groups. This eliminates possible training bias, especially when training the model with entities’ graph topology. Table shows the statistics of our financial NEN dataset. For training the models, we use train and development set for training and test set for evaluation similar to bioinformatics datasets.
Experiment Settings: Named Entity Normalization in Bioinformatics
We compare our proposed model’s performance with seven different biomedical NEN models. The accuracy score presented in this study is excerpted from original papers. A summary of each model is illustrated in Table .
Table 5.
Models used in bioinformatics NEN datasets evaluations
| Models | Descriptions |
|---|---|
| Sieve-based [1] | This is one the earliest NEN papers. The research was conducted with 10 Sieve, which is mostly a rule-based approaches. Many published post this research follow similar preprocessing steps |
| Taggerone [12] | Taggerone used the semi-Markov model for both NER and NEN tasks. Taggerone was originally validated on the NCBI Disease and BC5CDR corpus |
| CNN Ranking [38] | CNN Ranking model used a word-level deep learning approach for NEN. This research did not perform better than the previous model, Taggerone. However, it was the first study that applied deep learning to NEN tasks |
| NormCo [39] | NormCo used BiGRU, which is considered to be a better performing deep learning model with text data. NormCo achieved similar accuracy scores with significantly fewer parameters |
| BNE [40] | BNE introduced two-level BiLSTM to capture both character-level and word-level information of biomedical entities, achieving increased NEN performance |
| BERT ranking [6] | BERT Ranking model is based on Transformer-based embeddings that use the pre-trained BERT [8], BioBERT [44], and ClinicalBERT [54] for their entity embeddings. For each entity, candidate concepts were retrieved and three different BERT models are fine-tuned to rank and to capture the ground truth concepts |
| TripletNet [55] | The concept of TripletNet [56] for semi-supervised learning was introduced for NEN tasks. This study uses CNN for entity embedding and shared CNN parameters are trained with TripletNet structure |
| BioSyn [7] | BioSyn uses BioBERT for entity embeddings and trained with Synonym Marginalization. Marginal Maximum Likelihood (MML) is the objective function for Synonym Marginalization |
Experiment Settings: Named Entity Normalization in Finance
The dataset we used is covered in Sect. 4.1.2. Table shows each model used in NEN in Finance is tested. BioSyn is one of the state-of-the-art NEN model and the model’s code is opened to public. We modified BioSyn for NEN dataset for finance domain and compared the performance. The experiments are conducted using Intel Core-i9-10940X CPU with 128GB memory and three NVIDIA GeForce Titan RTX GPU. To avoid possible biases caused by exogenous variables, we use the same setting for all models if applicable.
Table 6.
Models used in finance NEN datasets evaluations
| Models | Descriptions |
|---|---|
| Edit Distance [57] | Edit Distance is suitable for basic NEN tasks for linking “Apple Inc" and “Apple Inc.". However, Edit Distance can only capture the superficial morphological similarity between two entities. In our experiment, we calculate the Edit Distance between two entity pairs and train a simple classifier to determine the equivalence of two entities |
| BERT [8] | BERT is a state-of-the-art model for various NLP tasks. However, for our specific tasks, the BERT model has a limitation on capturing morphological similarity between entity pairs. We use pre-trained BERT vectors with size 768 and train a simple MLP classifier with batch size 4096 to determine the linkage between entity pairs |
| Siamese GCN [58] | We use the entity graph illustrated in Sect. 3.2 and we use a pre-trained BERT vector for each entity node vector. 2-layer Siamese GCN is used in our experiment with 256 hidden nodes for the first GCN layer and 16 hidden nodes for the second GCN layer. GCN requires more epochs for training so we trained for 120 epochs for the full dataset (full batch: 17,500 entity pairs). The learning rate for ADAM optimizer for GCN is 0.01 |
| Siamese BiLSTM [59] | For Character Level Siamese BiLSTM model training, we one-hot encoded the characters entity strings with unique 85 tokens. We stack two BiLSTM layers. The BiLSTM cells in the first layer return 64 dimension hidden states output and the BiLSTM cells in the second layer return 16 dimension hidden states output. To prevent overfitting, we train the BiLSTM model for 12 epochs. The BiLSTM model is trained with a learning rate of 0.001. Embedding dimension, 16, is the same as GCN |
| BioSyn [7] | The detailed model description is illustrated in Table 5 |
For the pairwise NEN tasks, we take slightly different approach for BioSyn and our proposed model. The convention for the performance test in NEN tasks in bioinformatics is more similar to retrieval tasks. In bioinformatics NEN performance test, the model retrieve most similar entity from the candidate concepts (mentions). For each trial, the model scores if the retrieved named entity’s concept ID matches the query entity’s concept ID. However, for the pairwise NEN tasks, the model will retrieve the top-k most similar entities for both entities in the pair. We test BioSyn and our proposed model with the top-k of 1 and 3. The model scores if there exists the overlapping concept IDs in two groups of retrieved entities. For top-1, the model will recommend the one most similar entity and their concept IDs will be compared, and for top-3, the model will recommend the three most similar entities and will be scored by the presence of any overlapping concept IDs in two entity groups.
Results
We conduct both quantitative and qualitative analysis. For NCBI Disease, BC5CDR Disease, and BC5CDR Chemical datasets, we compare our proposed model’s score with previous researches. Bioinformatics datasets are reported by top one recommendation accuracy. Given the biomedical entity in the train set, entities are matched with the most similar entities in datasets. If the query entity and target entity share the same concept ID, it is considered correct. The financial NEN dataset is a pairwise NEN matching corpus. For evaluations on the financial NEN dataset, models that are used in evaluations distinguish whether two named entity pairs share identical meanings or not. We also perform the qualitative analysis to assess models’ weaknesses.
Quantitative Analysis: Bioinformatics
Table 7.
Bioinformatics named entity normalization performance test
| NCBI disease | BC5CDR disease | BC5CDR chemical | |
|---|---|---|---|
| Sieve-based [1] | 84.7 | 84.1 | 90.7 |
| Taggerone [12] | 87.7 | 88.9 | 94.1 |
| CNN ranking [38] | 86.1 | – | – |
| NormCo [39] | 87.8 | 88.0 | – |
| BNE [40] | 87.7 | 90.6 | 95.8 |
| BERT ranking [6] | 89.1 | – | – |
| TripletNet [55] | 90.0 | – | – |
| BioSyn [7] | 90.7 | 92.9 | 96.6 |
| BioSyn with TF-IDF [7] | 91.1 | 93.2 | 96.6 |
| Proposed model | 91.7 | 93.4 | 96.7 |
| Proposed model with TF-IDF | 92.1 | 93.7 | 96.7 |
Bold faced numbers indicate the best scores and underlined numbers indicate the second best scores
Table shows a performance comparison between our proposed model and previous state-of-the-art models. The best scores are boldfaced and the second best scores are underlined. We also train and report our model’s performance with TF-IDF vectors added to the vanilla embedding vectors that is illustrated on the previous state-of-the-art model, BioSyn [7]. For three bioinformatics datasets, our proposed model achieved the highest accuracy. Our model showed the highest performance increase by 1.0% in the NCBI Disease corpus. For BC5CDR Disease and BC5CDR Chemical corpus, the performance increase compared the previous state-of-the-art model is 0.5% and 0.1%, respectively.
The NCBI Disease corpus is a comparatively harder dataset based on the performance of other models. We conclude that there there is a significant to increase the accuracy in a relatively lower performing dataset. The previous model already performs excellently on the the BC5CDR corpus with accuracy scores of 93.2–96.6%. Especially, the earlier NEN models on BC5CDR Chemical dataset already achieved over 90% and the updates on the performance is decreased yearly, the improvement on this specific dataset might reached the plateau.
Quantitative Analysis: Finance
Table 8.
Precision, Recall, F-score, Accuracy of models
| Precision (%) | Recall (%) | F-score (%) | Accuracy (%) | |
|---|---|---|---|---|
| Edit distance | 43.12 | 63.35 | 51.31 | 62.60 |
| BERT | 62.92 | 82.17 | 71.27 | 76.81 |
| Siamese GCN | 79.00 | 82.16 | 80.55 | 82.56 |
| Siamese BiLSTM | 75.96 | 89.98 | 82.38 | 85.15 |
| BioSyn [7] by top1 | 99.73 | 77.77 | 87.39 | 88.75 |
| BioSyn [7] by top3 | 99.38 | 89.79 | 94.34 | 94.60 |
| Proposed model by top1 | 99.82 | 90.88 | 95.14 | 95.35 |
| Proposed model by top3 | 99.68 | 98.03 | 98.85 | 98.85 |
Bold faced numbers indicate the best scores and underlined numbers indicate the second best scores
Table shows the performance of each model we test. The evaluation metrics are expressed as follows
| 4 |
False positive indicates that two entities should not be matched, but our proposed model decided to link two entities. False negative indicates that two entities should be matched, but our proposed model failed to link two entities.
For practical use in the NEN model in the finance domain, a model with higher precision should be rewarded more. In practice, a model with higher precision will reduce the burden for practitioners’ tasks by giving more reliable entity-matching results. A model with higher precision will reduce time double-checking the validity entity pairs marked as matched.
Edit Distance had the lowest score along with all performance evaluation indicators. Graph Convolutional Network we use for the experiments adopts the BERT vector as entity node features. BERT and GCN have a similar recall, but GCN has higher precision, which brings higher F-score and accuracy compared with BERT. Our proposed model achieved the highest precision, F-score, and accuracy. Among all the models, our proposed model is the only model with a precision score over 90%. Therefore, our proposed model is the most suitable for practical use.
Qualitative Analysis
Error Analysis
In error analysis, entities for which accurate recommendations are not made are reported. Through error analysis, we aim to recognize the pattern of cases where recommendations are not properly made.
Table 9.
Error analysis on three biomedical NEN datasets
| Query Entity | Retrieved Synonym Entity | |
|---|---|---|
| NCBI Disease | Encephalopathy | Aids encephalopathy |
| Nail dystrophy | Twenty nail dystrophy | |
| cdm | cdmd | |
| Copper overload | Copper deficient | |
| g m2 gangliosidosis | g m2 gangliosidosis type ii | |
| BC5CDR Disease | Lung mass | Liver mass |
| Hypoactivity | Hyperactivity | |
| htn | htx | |
| Thrombocytopenia type ii | Thrombocytopenia 2 | |
| Chronic liver disease | Chronic hepatitis | |
| BC5CDR Chemical | Inorganic as | Chemicals inorganic |
| Alcohol nicotine | Alcohol nicotinyl | |
| dph | ddph | |
| naoh | Natrolite | |
| myo inositol 1 phosphate | myo inositol 1 3 6 triphosphate |
Table lists the errors in three bioinformatics NEN datasets. Our proposed model achieves approximately 90% accuracy for all three datasets. However, finding the synonyms for short abbreviations such as “cdm", “htn", and “dph" seems relatively harder. In addition, if there exist longer overlapping strings, the performance of the model is degraded.
Table 10.
Error analysis on financial NEN dataset
| Entity 1 | Entity 2 |
|---|---|
| False positive | |
| Park Hyatt | (Marriott, Hyatt, Hilton and AccorHotels) |
| AT & T acquisition | AT & T Corp. ’ s (ATTC) |
| Bureau of Indian affairs (BIA) | Balanced budget Act of 1997 (BBA) |
| Garmin corporation | Garmin GTN Xi |
| Windows server | Microsoft teams |
| False negative | |
| (LTE) | 4 G |
| EU member state | European Union Member States |
| RPS | Renewable Portfolio Standards (RPS) |
| 737-800 Boeing 737 | (B737) |
| Cyber Security Regulation | Privacy and Cyber Security Regulation |
Financial NEN datasets are constructed using entity pairs. Our model predicts whether two entity pairs are matched or not. Table is divided into false positive lists and false-negative lists. By examine the false-positive lists, entities with similar meanings or with matching strings are often predicted positive while the actual label is negative.
We also examine the false negatives. Matching named entities with parenthesis and abbreviations is the part where our model’s prediction is relatively unstable. Entity pairs such as “LTE" and “4 G" can be one of the most difficult to predict as positive because the intrinsic meaning of “LTE" and “4 G" requires common sense. Even our model is based on BERT, which captures the semantic meaning from the sentences where named entities are excerpted, using the common sense beyond the information presented in surrounding sentences can be limited.
Named Entity Normalization Result According to Training Progresses
Table 11.
Named entity normalization result for Epoch 0, Epoch 1, and Epoch with highest accuracy
| Epoch 0 | Epoch 1 | Epoch with Highest Accuracy | |
|---|---|---|---|
| NCBI disease: c2 deficiency | |||
| Top 1 | c2 deficiency | c2 deficiency | c2 deficiency |
| Top 2 | c3 deficiency | c6 deficiency | c2 deficient |
| Top 3 | t2 deficiency | c3 deficiency | Hereditary c2 deficiency |
| Top 4 | c5 deficiency | c2 deficient | type ii c2 deficiency |
| Top 5 | cpox deficiency | c4 deficiency | type i c2 deficiency |
| BC5CDR disease: failing left ventricle | |||
| Top 1 | Tumor cerebral ventricle | Dysfunction left ventricular | Left sided heart failure |
| Top 2 | Cerebral ventricle tumor | Left sided heart failure | Heart failure |
| Top 3 | Tumors cerebral ventricle | Remodeling left ventricular | Cardiac failure |
| Top 4 | Syndrome slit ventricle | Hypertrophy left ventricular | Heart failure left sided |
| Top 5 | Ventricle tumor cerebral | Outflow obstruction left ventricular | Right sided heart failure |
| BC5CDR Chemical: vincristine sulfate | |||
| Top 1 | Vincristine sulfate | Vincristine sulfate | Vincristine sulfate |
| Top 2 | Sulfate vincristine | Vincristine | Vincristine |
| Top 3 | Vinblastine sulfate | Voacristine | Sulfate vincristine |
| Top 4 | Sulfate vinblastine | Leurocristine | Vincristin |
| Top 5 | Riboflavin 3 sulfate | Ergocristine | Vincristin medac |
| Financial NEN: Polo Ralph Lauren Children | |||
| Top 1 | Pinky swear foundation | Polo Golf Ralph Lauren | Polo Ralph Lauren |
| Top 2 | Bath & body works Canada | Polo Ralph Lauren | Polo Ralph Lauren Children, Chaps |
| Top 3 | Ticketmaster North America | Siemens Medical Solutions USA | Polo Golf Ralph Lauren |
| Top 4 | LIP-BU TAN | Mojo Networks, Inc | Lilly International |
| Top 5 | Coca-Cola life | Polo Ralph Lauren Children, Chaps | Polo / Lauren Company, LP |
Bold-underlined entities are the entities with the same concept ID as the query entity
As the training epochs increase, recommendations become more accurate. We randomly selected entities from four datasets we tested. Top 5 recommendations for the selected entities are provided for epoch 0, epoch 1, and epoch with best result in Sects. 5.1 and 5.2.
Table shows how recommendations change as training progress. Entities after each dataset are the examples excerpted (c2 deficiency, failing left ventricle, vincristine sulfate, and Polo Ralph Lauren Children), and bold-underlined entities are the entities with the same concept ID as the query entity. Throughout the datasets, at epoch 0, the recommended entities differ greatly from the concept ID of the query entity. As the model is trained, the recommendation becomes more accurate in epoch 1. At the epochs in which the highest accuracy for the datasets is achieved, true synonyms for query entities are successfully selected.
Based on our experiments, our proposed model has the highest precision, recall, F1 score, and accuracy. Qualitative analysis shows that our proposed model also gives the most stable results by achieving over 98% on the four evaluation metrics.
Conclusion
We introduce Edge Weight Updating Neural Network for NEN. NEN to match extracted named entities with homogeneous identity is pivotal for many text mining tasks. We tested our model on three widely used NEN datasets, NCBI Disease, BC5CDR Disease, and BC5CDR Chemical. We also generated the NEN dataset for the finance domain. Next, we verify our model’s performance for general NEN applications.
The main contribution of this study are as follows. Our proposed model successfully links named entities with the same meanings with different surface forms. The proposed model performs best among previous NEN models. We test our model not only for bioinformatics datasets in which NEN researches are more active but also for financial NEN datasets. According to the performance of the NEN corpus in two distinct fields, our proposed model proves the efficacy for general NEN applications.
Similar to many other NEN models, the performance of linking named entities with abbreviations is comparatively lower. Matching abbreviations more accurately is one of the future works. The neural network model with our proposed Edge Weight Updating objective function performs better than other models. Providing the more general guideline for the number of training epochs and increasing the training stability is one of the future research topics.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C2093785 and No. 2018R1D1A1A02045842).
Declarations
Conflict of Interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Sung Hwan Jeon, Email: sjeon@dm.snu.ac.kr.
Sungzoon Cho, Email: zoon@snu.ac.kr.
References
- 1.D’Souza J, Ng V (2015) Sieve-based entity linking for the biomedical domain. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing Vol 2: Short Papers, pp 297–302
- 2.Ghiasvand O, Kate RJ (2014) Uwm: Disorder mention extraction from clinical text using crfs and normalization using learned edit distance patterns. In: SemEval@ COLING, pp 828–832
- 3.Hanisch D, Fundel K, Mevissen H-T, Zimmer R, Fluck J Prominer: rule-based protein and gene entity recognition. BMC Bioinf. 2005;6(1):1–9. doi: 10.1186/1471-2105-6-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kang N, Singh B, Afzal Z, van Mulligen EM, Kors JA. Using rule-based natural language processing to improve disease normalization in biomedical text. J Am Med Inf Assoc. 2013;20(5):876–881. doi: 10.1136/amiajnl-2012-001173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Karadeniz I, Özgür A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinf. 2019;20(1):1–12. doi: 10.1186/s12859-019-2678-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ji Z, Wei Q, Xu H. Bert-based ranking for biomedical entity normalization. AMIA Summits Trans Sci Proc. 2020;2020:269. [PMC free article] [PubMed] [Google Scholar]
- 7.Sung M, Jeon H, Lee J, Kang J (2020) Biomedical entity representations with synonym marginalization. arXiv preprint arXiv:2005.00239
- 8.Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
- 9.Cho H, Choi W, Lee H. A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinf. 2017;18(1):451. doi: 10.1186/s12859-017-1857-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Aronson AR (2001) Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In: Proceedings of the AMIA Symposium, American Medical Informatics Association. p 17 [PMC free article] [PubMed]
- 11.Leaman R, Islamaj Doğan R, Lu Z. Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–2917. doi: 10.1093/bioinformatics/btt474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Leaman R, Lu Z. Taggerone: joint named entity recognition and normalization with semi-markov models. Bioinformatics. 2016;32(18):2839–2846. doi: 10.1093/bioinformatics/btw343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wei C-H, Kao H-Y. Cross-species gene normalization by species inference. BMC Bioinf. 2011;12(S8):5. doi: 10.1186/1471-2105-12-S8-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hakenberg J, Gerner M, Haeussler M, Solt I, Plake C, Schroeder M, Gonzalez G, Nenadic G, Bergman CM. The gnat library for local and remote gene mention normalization. Bioinformatics. 2011;27(19):2769–2771. doi: 10.1093/bioinformatics/btr455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rocktäschel T, Weidlich M, Leser U. Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics. 2012;28(12):1633–1640. doi: 10.1093/bioinformatics/bts183. [DOI] [PubMed] [Google Scholar]
- 16.Weston L, Tshitoyan V, Dagdelen J, Kononova O, Trewartha A, Persson KA, Ceder G, Jain A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J Chem Inf Model. 2019;59(9):3692–3702. doi: 10.1021/acs.jcim.9b00470. [DOI] [PubMed] [Google Scholar]
- 17.Suominen H, Salanterä S, Velupillai S, Chapman WW, Savova G, Elhadad N, Pradhan S, South BR, Mowery DL, Jones GJ, et al. (2013) Overview of the share/clef ehealth evaluation lab 2013. In: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, pp 212–231
- 18.Doğan RI, Leaman R, Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inf. 2014;47:1–10. doi: 10.1016/j.jbi.2013.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Demner-Fushman D, Shooshan SE, Rodriguez L, Aronson AR, Lang F, Rogers W, Roberts K, Tonning J. A dataset of 200 structured product labels annotated for adverse drug reactions. Sci Data. 2018;5:180001. doi: 10.1038/sdata.2018.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Smith L, Tanabe LK, Nee Ando RJ, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, et al. Overview of biocreative ii gene mention recognition. Genome Biol. 2008;9(S2):2. doi: 10.1186/gb-2008-9-s2-s2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim J-D, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of bionlp’09 shared task on event extraction. In: Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pp 1–9
- 22.Bossy R, Deléger L, Chaix E, Ba M, Nédellec C (2019) Bacteria biotope at bionlp open shared tasks 2019. In: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, pp 121–131
- 23.Kolárik C, Klinger R, Friedrich CM, Hofmann-Apitius M, Fluck J (2008) Chemical names: terminological resources and corpora annotation. In: Workshop on Building and Evaluating Resources for Biomedical Text Mining (6th Edition of the Language Resources and Evaluation Conference)
- 24.Klinger R, Kolářik C, Fluck J, Hofmann-Apitius M, Friedrich CM. Detection of iupac and iupac-like chemical names. Bioinformatics. 2008;24(13):268–276. doi: 10.1093/bioinformatics/btn181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Arratia A, Belanche LA, Fábregues L (2019) An evaluation of equity premium prediction using multiple kernel learning with financial features. Neural Process Lett 1–18
- 26.Corba BS, Egrioglu E, Dalar AZ. Ar-arch type artificial neural network for forecasting. Neural Process Lett. 2020;51(1):819–836. doi: 10.1007/s11063-019-10117-6. [DOI] [Google Scholar]
- 27.Gupta A, Dengre V, Kheruwala HA, Shah M. Comprehensive review of text-mining applications in finance. Financ Innov. 2020;6(1):1–25. doi: 10.1186/s40854-020-00205-1. [DOI] [Google Scholar]
- 28.Jijkoun V, Khalid MA, Marx M, De Rijke M (2008) Named entity normalization in user generated content. In: Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, pp 23–30
- 29.Sun C, Lin L, Liu M, Liu B, Sha X (2012) A product named entity normalization method based on entity relations. In: 2012 8th International Conference on Information Science and Digital Content Technology (ICIDT2012), vol. 1, pp 166–169
- 30.Francis S, Van Landeghem J, Moens M-F. Transfer learning for named entity recognition in financial and biomedical documents. Information. 2019;10(8):248. doi: 10.3390/info10080248. [DOI] [Google Scholar]
- 31.Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30
- 32.Ranasinghe T, Orasan C, Mitkov R (2019) Semantic textual similarity with siamese neural networks. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp 1004–1011
- 33.Liu B, Zhang T, Niu D, Lin J, Lai K, Xu Y (2018) Matching long text documents via graph convolutional networks. arXiv preprint arXiv:1802.07459, pp 2793–2799
- 34.Krivosheev E, Atzeni M, Mirylenka K, Scotton P, Casati F (2020) Siamese graph neural networks for data integration. arXiv preprint arXiv:2001.06543
- 35.Neculoiu P, Versteegh M, Rotaru M (2016) Learning text similarity with siamese recurrent networks. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp 148–157
- 36.Niu J, Yang Y, Zhang S, Sun Z, Zhang W. Multi-task character-level attentional networks for medical concept normalization. Neural Process Lett. 2019;49(3):1239–1256. doi: 10.1007/s11063-018-9873-x. [DOI] [Google Scholar]
- 37.Mulang’ IO, Singh K, Prabhu C, Nadgeri A, Hoffart J, Lehmann J (2020) Evaluating the impact of knowledge graph context on entity disambiguation models. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 2157–2160
- 38.Li H, Chen Q, Tang B, Wang X, Xu H, Wang B, Huang D. Cnn-based ranking for biomedical entity normalization. BMC Bioinf. 2017;18(11):79–86. doi: 10.1186/s12859-017-1805-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Wright D (2019) Normco: Deep disease normalization for biomedical knowledge base construction. PhD thesis, UC San Diego
- 40.Phan MC, Sun A, Tay Y (2019) Robust representation learning of biomedical names. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 3275–3285
- 41.Kim J, Kim T, Kim S, Yoo CD (2019) Edge-labeling graph neural network for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11–20
- 42.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Adv Neural Inf Process Syst 27
- 43.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
- 44.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–1240. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Araci D (2019) Finbert: financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063
- 46.Lee J-S, Hsiang J (2019) Patentbert: patent classification with fine-tuning a pre-trained bert model. arXiv preprint arXiv:1906.02124
- 47.Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
- 48.Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
- 49.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Linguist. 2017;5:135–146. doi: 10.1162/tacl_a_00051. [DOI] [Google Scholar]
- 50.Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
- 51.Li J, Sun Y, Johnson RJ, Sciaky D, Wei C-H, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z (2016) Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016 [DOI] [PMC free article] [PubMed]
- 52.Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, Mattingly CJ. Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucl Acids Res. 2009;37(supp 1):786–792. doi: 10.1093/nar/gkn580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sang EF, De Meulder F (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. (arXiv preprint cs/0306050)
- 54.Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Inf Assoc. 2019;26(11):1297–1304. doi: 10.1093/jamia/ocz096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Mondal I, Purkayastha S, Sarkar S, Goyal P, Pillai J, Bhattacharyya A, Gattu M (2020) Medical entity linking using triplet network. arXiv preprint arXiv:2012.11164
- 56.Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International Workshop on Similarity-based Pattern Recognition, Springer, pp 84–92
- 57.Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Doklady. 1966;10:707–710. [Google Scholar]
- 58.Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
- 59.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–2681. doi: 10.1109/78.650093. [DOI] [Google Scholar]




