Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora

Sudhakaran Gajendran; D Manjula; Vijayan Sugumaran; R Hema

doi:10.1016/j.compbiolchem.2022.107808

. 2023 Jan 2;102:107808. doi: 10.1016/j.compbiolchem.2022.107808

Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora

Sudhakaran Gajendran ^a,^⁎, D Manjula ^b, Vijayan Sugumaran ^c,^d, R Hema ^e

PMCID: PMC9807269 PMID: 36621289

Abstract

The number of biomedical articles published is increasing rapidly over the years. Currently there are about 30 million articles in PubMed and over 25 million mentions in Medline. Among these fundamentals, Biomedical Named Entity Recognition (BioNER) and Biomedical Relation Extraction (BioRE) are the most essential in analysing the literature. In the biomedical domain, Knowledge Graph is used to visualize the relationships between various entities such as proteins, chemicals and diseases. Scientific publications have increased dramatically as a result of the search for treatments and potential cures for the new Coronavirus, but efficiently analysing, integrating, and utilising related sources of information remains a difficulty. In order to effectively combat the disease during pandemics like COVID-19, literature must be used quickly and effectively. In this paper, we introduced a fully automated framework consists of BERT-BiLSTM, Knowledge graph, and Representation Learning model to extract the top diseases, chemicals, and proteins related to COVID-19 from the literature. The proposed framework uses Named Entity Recognition models for disease recognition, chemical recognition, and protein recognition. Then the system uses the Chemical - Disease Relation Extraction and Chemical - Protein Relation Extraction models. And the system extracts the entities and relations from the CORD-19 dataset using the models. The system then creates a Knowledge Graph for the extracted relations and entities. The system performs Representation Learning on this KG to get the embeddings of all entities and get the top related diseases, chemicals, and proteins with respect to COVID-19.

Keywords: Biomedical Named Entity Recognition (BioNER), Relation Extraction (RE), Knowledge graph, Representation learning, BERT, BiLSTM

Graphical Abstract

1. Introduction

Biomedical Named Entity Recognition (BioNER) and Biomedical Relation Extraction (BioRE) are becoming more and more crucial for biomedical research due to the rising volume of data that has been available in the form of unstructured text articles. There are currently approximately 25 million references in Medline. Even in more specialised subjects, it is challenging to keep up with the literature at this rate (Gajendran et al., 2020). Graphs are useful tools for numerous practical applications. They have been applied to the classification of nodes and development of recommendation systems in social network mining. They have also been utilised in natural language processing to decipher basic queries and offer responses using relational data (Chen et al., 2020). Graphs have been employed in the biomedical field to undertake medication repurposing, detect drug-target interactions, and select drugs pertinent to disease (Doğan et al., 2014, Chen et al., 2020).

There are numerous approaches to build knowledge graphs using resources like text or pre-existing databases (Wang et al., 2020). Knowledge graphs are frequently built from pre-existing databases. Experts in the relevant fields create these databases using methods that range from text mining to manual curation. Manual curation requires subject-matter specialists to depth domain knowledge to claim the association between the entities. Automated methods use NLP methods to quickly identify relevant association. Rule-based systems, unsupervised, and supervised methods are the three classes into which the automated systems are divided.

The proposed system considers the Automated way of extracting knowledge graphs from text using Deep Learning and NLP techniques. The advantages of this approach include quick results and not needing ground truth information. Since its breakout, COVID-19 has been a worldwide pandemic with a high transmission rate and a substantial mortality rate that has affected millions of individuals (Gajendran et al., 2020). Scientific publications on the new Coronavirus have increased exponentially as researchers look for treatments and potential cures, but collecting, integrating, and utilising related sources of information effectively remains a challenge (Wang et al., 2020).

Scientific publications regarding COVID-19 contain various data about related diseases, proteins, chemicals and so on. The data in such publications are vastly unstructured. Most of the articles published under the title COVID-19 are gathered under the name of CORD-19. The research presents an Information Extraction (IE) method followed by Knowledge Graph building in a fully automated generic pipeline.

The proposed system creates Named Entity Recognition model using BERT-BiLSTM-CRF. Then this model is trained on multiple datasets and multiple models are generated. Then these models are used to recognize diseases, proteins, and chemicals in the prediction dataset. The Relation Extraction models are created using SciBERT with linear classifier. Then this model is trained on multiple datasets and multiple models are created. Then these models are used to extract relations such as Chemical – Protein relation and Chemical – Disease relation. Once the entities and relations are extracted from the prediction dataset, the system create the Knowledge graph with the entities as nodes and relations as edges.

2. Related work

The literature review contains all the related works relevant to the system in use such as BioNER, RE, Knowledge Graph and Representation Learning.

2.1. Named entity recognition

Identification of biomedical examples such as chemical substances, genes, proteins, viruses, diseases, DNAs, and RNAs is the goal of biomedical named entity recognition (BNER) (Gajendran and Manjula, 2020). The procedures that would be utilised to extract these things are the main issue for BNER. Dictionary-based techniques detect and extract the biomedical entities by using extensive dictionary. The interaction extraction model Polysearch (Cheng et al., 2008) provides a well-known illustration of the system. Whatizit (Rebholz-Schuhmann, 2013), an online model with distinct categories for different BioNER, is another illustration (Yoshua et al., 2003).

Machine learning techniques are now the most popular ones for named entity recognition. Support vector machines (Kazama et al., 2002), Hidden Markov models (Shen et al., 2003), Decision trees, and Naive Bayesian approaches (Nobata et al., 1999;) were the first supervised machine learning techniques to be utilised. The groundbreaking study by (Lafferty et al., 2001) on Conditional Random Fields (CRF) that considered the likelihood of word spatial interdependence redirected the researchers in a different perspective (Peters et al., 2018).

The literature has shifted toward universal deep neural network models over the past five years. For instance, different variations of convolution neural networks (CNN) and recurrent neural networks (RNN)(Kim et al., 2019) have all been applied to BioNER systems (Zhu et al., 2017). For instance, LSTM, Bi-LSTM (Yanran et al., 2015, Gajendran et al., 2020) are common RNN variations.

Bi-LSTM and CRFs models are paired with a combination of different types of embeddings in a network to produce the best results. (Gajendran et al., 2020; Giorgi et al., 2019; Luo et al., 2018; Habibi et al., 2017) Here, word embeddings are generated via a pre-trained lookup table, while character-level embeddings are generated by independent Bi-LSTMs for each word sequence (Habibi et al., 2017). These two results are then concatenated to form the word representations x1, x2,., xn. Currently, BioNER is being fine-tuned for Transformer-based models like SciBERT and BioBERT(Harnoune et al., 2021, Beltagy et al., 2019).

2.2. Relation extraction

Most of the literatures for identifying the association between the entities falls into rule based and machine learning based models. A rule-based approach heavily relies on sentence syntactic and semantic analysis for relationship extraction. In Fundel et al. (2006), for instance used the tree based structure to identify and extract the association between the entities by splitting the sentence into nouns and verbs. The nouns denotes the different entities and the verbs denotes the association between the entities.

The most popular machine learning techniques use training data from an structured dataset with labels (supervised learning). Obtaining labelled training and testing data used to be the largest barrier to implementing such machine learning algorithms for relation identification. However, this issue has been greatly mitigated by data sets produced by biomedical text mining competitions like BioCreative and BioNLP. one of the first research projects that use an SVM. The latter study, in contrast, employed a comparable SVM model to determine the polarity of food-disease relationships. According to Jensen et al. (2014), term frequency-inverse document frequency (TF-IDF) attributes were utilised to discover food-phytochemical and food-disease relationships. Whereas in (Quan et al., 2014), a CRF was employed for both entity recognition and association extraction.

Deep learning (DL) methods have gained popularity for identifying the association between the entities in recent times because to their cutting-edge performance and reduced requirement for intricate feature processing(Percha et al., 2012). Different types of neural networks namely CNN and RNN, and combinations of CNN and RNN are the three most used DL techniques Jettakul et al. (2019). The use of different embeddings such as position, character and word level information encoded as vectors (Zeng et al., 2014) may all be feature inputs for DL models. Recently, models based on Transformers, including BioBERT and SciBERT, have been employed(Devlin et al., 2019).

2.3. Knowledge graph and representation learning

The knowledge graphs (KG) extracts information from medication, disease, and gene databases as well as their relationships to form a low dimensional structure (Zheng et al., 2020). Its advantage is that a large knowledge graph is obtained. But the data is highly generic and only takes structured data as input. The evidence text was manually encoded as a triple using the COVID-19 Knowledge Graph (source-relation-target) (Domingo-Fernandez et al., 2020). Its advantage is that extracted information is mostly correct apart from Human errors. But only a small knowledge graph is obtained and manual curation is time-consuming.

In Repke and Krestel (2021), they extracted the knowledge graphs from financial text corpus using NER and RE tasks. Its advantage is that finding financial entities is comparatively easier with rule-based approaches. In Deep Learning-based Knowledge Graph Generation for COVID-19 (Kim et al., 2021), they find entities related to COVID-19 from dictionaries and extract their relations from text corpus. Its advantage is that the Unsupervised method is devised to find entities and relations. But results vary massively in unsupervised methods.

The KG also struggles with the critical issue of information being sparse that severely impairs the ability to calculate association between items. Knowledge graph embedding, which has the ability to provide dense and low-dimensional feature space and aids in effectively calculating the semantic links between items with low computational quality, has been designed to address this issue. Translation based models is based on the idea that a triple (head, relation, tail) can be represented as a geometric principle such as h + r ≈ t (TransE) (Ji et al., 2015). Tensor Factorization-Based Models is based on the idea that all triples can be transformed to 3D binary tensor and this tensor is converted into entities and relations embeddings using Dimensionality Reduction.

Some of the problems that we found in the analysis of literature survey are Manual curation of entities and relations takes very long time and the results count will be low, Lack of ground truth information for correct and easy evaluation which results in lower quality knowledge graphs, Rule based information extraction requires manual rule setting which differs based on the information we need, usage of external knowledge base which may not be available for the concerned task and lack of structured data which will be easier to extract.

3. Proposed system

Fig. 1 displays the proposed system's overall architecture diagram. Preprocessing, Feature extraction, Named Entity Recognition, Relation Extraction, Graph construction, and Representation learning are the six modules that make up the proposed work and produces the COVID-19 knowledge graph as the outcome. In the Preprocessing module, the CORD-19 abstracts are taken and is transformed to input to the Language Model and the BERT model is finetuned to get better results for entities extraction. In the Feature extraction module, the Named Entity Recognition datasets are passed through to the BERT model to get the features from these datasets. The BERT-BiLSTM-CRF model is utilised in the Named Entity Recognition module to train on NER datasets and to predict on CORD-19. The Relation Extraction module trains the BERT model with the help of the Relations datasets, and then uses the trained models to determine whether a relation between the extracted elements from the CORD-19 exists. In the Graph Construction module, the Knowledge graph is constructed. In the Representation Learning module, the KG is used to train the TransD model to find embeddings of entities which are used to find similarity with COVID-19. Figure 3.1 shows the Complete Architecture diagram.

3.1. Named entity recognition

Each sentence is tokenized using the custom tokenizer created during preprocessing. And, the sentence's special tokens [CLS] and [SEP] are added at the beginning and end, respectively. The tokenizer basically tokenizes common words as individual tokens and more rare words into meaningful sub tokens. Then the sentences are truncated or padded to accommodate the max length (256) for the BERT model. Then the mask is defined such that only words are marked as 1 and padded words are marked otherwise.

Finally, the individual sentence of size (256) is passed to the fine-tuned CORD-19 SciBERT model and the last hidden layer output is taken from the BERT model which is of the dimension 768. The output produced by the BERT model is of the shape (1,256,768). Here the 1 denotes the number of input sentences. The 256 denotes the sequence length. Each subword in the sentence is represented by a 768-dimensional vector that depends both on the subword and the words around it. Hence this type of embeddings is called Contextualized word embedding. This output is passed onto the Named Entity Recognition layers which detect the entities based on the word embedding of words in the sentence. The module takes the finetuned CORD-19 SciBERT model as the input which is used to produce the feature embeddings. The module returns the contextualized word embeddings of the input dataset regardless of whether it is a NER dataset or CORD-19 dataset.

The individual sentence features such as input_ids, attention_mask, token_type_ids are taken and are passed to the CORD-19 SciBERT. The contextualised word embedding from each token in the phrase is then extracted from the final hidden layer, which is size 768, and transferred to the bidirectional LSTM layer. The output is size 1024. (512 for forward LSTM and 512 for backward LSTM). To prevent overfitting, a different dropout values are experimented during the initial trails and finally a dropout regularisation of 30% is used since it produced better performance. The following layer is fully linked, taking an input of 1024 bytes and producing a vector with a size equal to the total number of labels in the training dataset. Then the vector is given to the Conditional Random Field (CRF) layer which learns the transitional probabilities from the input dataset and finds the best possible tag sequence for the sentence. The model uses Negative log-likelihood loss to optimise the model. It is modelled as a minimization problem.

Now there are 3 models, which are capable of finding diseases, drugs and proteins respectively. Now the Processed CORD-19 is fed into each model and the entity tags of CORD-19 dataset are found. The tags are for each word token that are encoded in BIO Scheme. The terms B-Entity, I-Entity, and O-Entity here all relate to the entity's beginning, inside, and outside, respectively. Table 1 shows sample encoding of a sentence in BIO scheme. These Tags are combined so that the final output contains CORD-19 dataset with all the diseases, chemicals and proteins mentioned.

Table 1.

Example encoding of a sentence in BIO scheme.

Token	Encoding in BIO
Management	O
of	O
critically	O
ill	O
patients	O
with	O
Severe	B-Disease
Acute	I-Disease
Respiratory	I-Disease
Syndrome	I-Disease
.	O

Dataset Statistics	Dataset for Training	Dataset for Development	Dataset for Testing	Total Dataset
PubMed References	593	100	100	793
Total disease entities	5145	787	960	6892
Specific disease entities	1710	368	427	2136
Specific concept ID	670	176	203	790

Dataset Statistics	Dataset for Training	Dataset for Development	Dataset for Testing	Total Dataset
Abstracts	3500	3500	3000	10,000
No. of Characters	4,883,753	4,864,558	4,199,068	13,947,379
No. of Tokens	770,855	766,331	662,571	2,199,757
Abstracts with classes	2916	2907	2478	8301
No. of Mentions	29,478	29,526	25,351	84,355
No. of Chemicals	8520	8677	7563	19,805
No. of Journals	193	188	188	203

Corpus Characteristics	Training Set	Test Set	Whole corpus
Abstracts	2000	404	2404
Sentences	20,546	4260	24,806
Words	472,006	96,780	568,786
Entities	51,291	8662	59,953

Corpus Characteristics	Training Set	Testing Set	Whole Corpus
No. of Chosen Abstracts	1000	500	1500
No. of Chemical Mentions	10,550	5385	15,935
Chemical Unique Mentions	2973	1435	4408
No. of Disease Mentions	8426	4424	12,850
Disease Unique Mentions	3829	1988	5817
Chemical Induced Disease Relations	2050	1066	3116

Group	CHEMPROT relations belonging to this group
CPR:1	PART_OF
CPR:2	REGULATOR\|DIRECT_REGULATOR\|INDIRECT_REGULATOR
CPR:3	UPREGULATOR\|ACTIVATOR\|INDIRECT_UPREGULATOR
CPR:4	DOWNREGULATOR\|INHIBITOR\|INDIRECT_DOWNREGULATOR
CPR:5	AGONIST\|AGONIST-ACTIVATOR\|AGONIST-INHIBITOR
CPR:6	ANTAGONIST
CPR:7	MODULATOR\|MODULATOR-REGULATOR\|MODULATOR-INHIBITOR
CPR:8	COFACTOR
CPR:9	SUBSTRATE\|PRODUCT_OF\|SUBSTRATE_PRODUCT_OF
CPR:10	NOT

Dataset Statistics	Dataset for Training	Dataset for Development	Dataset for Testing	Total Dataset
Document	1020	612	800	2432
Chemical	13,017	8004	10,810	31,831
Protein	12,752	7567	10,019	30,338
Positive Relation	4157	2416	3458	10,031
Positive relation in one sentence	4122	2412	3444	9978

Subfield	Count	% Of Corpus
Virology	20,116	42.3%
Immunology	9875	20.7%
Molecular biology	6040	12.7%
Genetics	3783	8.0%
Intensive care medicine	3204	6.7%
Other	4595	9.6%

Model	Precision	Recall	F1
BiLSTM + Word Embedding	84.87	74.11	79.13
LSTM-CRF + Word and Char Embedding	85.20	82.40	83.80
BiLSTM-CRF + Word Embedding	86.75	87.11	86.93
SciBERT	85.47	90.10	87.73
BERT-BiLSTM-CRF (Proposed System)	88.49	89.02	88.76

Dataset	Precision	Recall	F1
NCBI-Disease	88.49	89.02	88.76
JNLPBA	71.25	81.5	76.06
CHEMDNER	90.88	92.25	91.56

Entity Type	Total no. of instances	Unique no. of instances
Disease	1,00,441	17,672
Chemical	66,212	7841
Protein	3,66,963	1,29,972

	NCBI	CHEMDNER	JNLPBA
SOTA	88.60	91.14	77.39
Yoon et. al (Yoon et al., 2019)	84.69	88.19	77.39
Bert (Lee et al., 2020)	85.63	90.04	74.94
Hunflair (Weber et al., 2019)	88.26	-	77.60
Chai et. al (Chai et al., 2022)	88.26	91.70	-
Proposed System	88.76	91.56	76.06

Dataset	Time (Sec)	macro – F1
NCBI-Disease	301	0.865
JNLPBA	510	0.719
CHEMDNER	2864	0.884

Model	Precision	Recall	F1
1d-CNN + Glove	60.85	56.42	58.55
LSTM-SVM	64.90	49.30	56.00
LSTM-CNN	54.30	65.90	59.50
LSTM-CRF	60.00	67.50	63.50
SciBERT (Proposed System)	74.00	73.00	73.00

Dataset	Precision	Recall	F1
BC5CDR	74.00	73.00	73.00
CHEMPROT	72.00	71.00	71.00

Relation Type	Total no. of Instances	Unique no. of Instances
Chemical Induced Disease	5425	4071
Chemical Protein Relation	87,794	72,891

Entity	Entity Type	Evidence
ARDS	Disease	Aslan, Anolin et al. “Acute respiratory distress syndrome in COVID-19: possible mechanisms and therapeutic management.” Pneumonia (Nathan Qld.) vol. 13,1 14. 6 Dec. 2021, doi:10.1186/s41479–021–00092–9
CRD	Disease	Beltramo, Guillaume et al. “Chronic respiratory diseases are predictors of severe outcome in COVID-19 hospitalised patients: a nationwide study.” The European respiratory journal vol. 58,6 2004474. 9 Dec. 2021, doi:10.1183/13993003.04474–2020
Hepatitis C	Disease	Ronderos, Diana et al. “Chronic hepatitis-C infection in COVID-19 patients is associated with in-hospital mortality.” World journal of clinical cases vol. 9,29 (2021): 8749–8762.
HSV-2	Disease	Shanshal, Mohammed, and Hayder Saad Ahmed. “COVID-19 and Herpes Simplex Virus Infection: A Cross-Sectional Study.” Cureus vol. 13,9 e18022. 16 Sep. 2021, doi:10.7759/cureus.18022
PGE2	Chemical	Ricke-Hoch, Melanie et al. “Impaired immune response mediated by prostaglandin E2 promotes severe COVID-19 disease.” PloS one vol. 16,8 e0255335. 4 Aug. 2021, doi:10.1371/journal.pone.0255335
Polyvinylidene fluoride	Chemical	Nageh, Hassan et al. “Zinc Oxide Nanoparticle-Loaded Electrospun Polyvinylidene Fluoride Nanofibers as a Potential Face Protector against Respiratory Viral Infections.” ACS omega vol. 7,17 14887–14896. 22 Apr. 2022, doi:10.1021/acsomega.2c00458
Poly adp ribose	Chemical	Badawy, Abdulla A-B. “Immunotherapy of COVID-19 with poly (ADP-ribose) polymerase inhibitors: starting with nicotinamide.” Bioscience reports vol. 40,10 (2020): BSR20202856. doi:10.1042/BSR20202856
Succinate	Chemical	Pacl, Hayden T et al. “Water-soluble tocopherol derivatives inhibit SARS-CoV-2 RNA-dependent RNA polymerase.” bioRxiv: the preprint server for biology 2021.07.13.449251. 27 Jul. 2021, doi:10.1101/2021.07.13.449251. Preprint.
SER	Chemical	Rahbar Saadat, Yalda et al. “Host Serine Proteases: A Potential Targeted Therapy for COVID-19 and Influenza.” Frontiers in molecular biosciences vol. 8 725528. 30 Aug. 2021, doi:10.3389/fmolb.2021.725528
Trypsin 1	Protein	Kim, Yeeun et al. “Trypsin enhances SARS-CoV-2 infection by facilitating viral entry.” Archives of virology vol. 167,2 (2022): 441–458. doi:10.1007/s00705–021–05343–0
IL-1β	Protein	Mardi, Amirhossein et al. “Interleukin-1 in COVID-19 Infection: Immunopathogenesis and Possible Therapeutic Perspective.” Viral immunology vol. 34,10 (2021): 679–688. doi:10.1089/vim.2021.0071
Kinase	Protein	Pillaiyar, Thanigaimalai, and Stefan Laufer. “Kinases as Potential Therapeutic Targets for Anti-coronaviral Therapy.” Journal of medicinal chemistry vol. 65,2 (2022): 955–982. doi: 10.1021/acs.jmedchem.1c00335
DICER	Protein	Mousavi, Seyyed Reza et al. “Dysregulation of RNA interference components in COVID-19 patients.” BMC research notes vol. 14,1 401. 29 Oct. 2021, doi:10.1186/s13104–021–05816–0
Cyclo Oxygenase 2	Protein	Baghaki, Semih et al. “COX2 inhibition in the treatment of COVID-19: Review of literature to propose repositioning of celecoxib for randomized controlled studies.” Int J Infect Dis. 2020;101:29–32. doi:10.1016/j.ijid.2020.09.1466

PERMALINK

Extraction of knowledge graph of Covid-19 through mining of unstructured biomedical corpora

Sudhakaran Gajendran

D Manjula

Vijayan Sugumaran

R Hema

Abstract

Graphical Abstract

1. Introduction

2. Related work

2.1. Named entity recognition

2.2. Relation extraction

2.3. Knowledge graph and representation learning

3. Proposed system

Fig. 1.

3.1. Named entity recognition

Table 1.

3.2. Relation extraction module

Fig. 2.

3.3. Graph construction

Algorithm 5

3.4. Representation learning module

Algorithm 6

4. Experimental setup

4.1. Dataset

4.1.1. NCBI-disease

Table 2.

4.1.2. CHEMDNER

Table 3.

4.1.3. JNLPBA

Table 4.

4.1.4. BC5CDR

Table 5.

4.1.5. CHEMPROT

Table 6.

Table 7.

4.1.6. CORD-19

Table 8.

4.2. Hyperparameters

Table 9.

Table 10.

4.3. Performance metrics

5. Results and discussion

5.1. Named entity recognition

Table 11.

Fig. 3.

Fig. 4.

Table 12.

Table 13.

Table 14.

Table 15.

5.2. Relation extraction

Table 16.

Fig. 5.

Fig. 6.

Fig. 7.

Table 17.

Table 18.

5.3. Graph construction

Fig. 8.

5.4. Representation learning

Fig. 9.

Fig. 10.

Fig. 11.

Table 19.

6. Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Footnotes

Appendix A. Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases