Multichannel Convolutional Neural Network for Biological Relation Extraction

Chanqin Quan; Lei Hua; Xiao Sun; Wenjun Bai

doi:10.1155/2016/1850404

. 2016 Dec 7;2016:1850404. doi: 10.1155/2016/1850404

Multichannel Convolutional Neural Network for Biological Relation Extraction

Chanqin Quan ¹, Lei Hua ^2,^*, Xiao Sun ², Wenjun Bai ¹

PMCID: PMC5174749 PMID: 28053977

Abstract

The plethora of biomedical relations which are embedded in medical logs (records) demands researchers' attention. Previous theoretical and practical focuses were restricted on traditional machine learning techniques. However, these methods are susceptible to the issues of “vocabulary gap” and data sparseness and the unattainable automation process in feature extraction. To address aforementioned issues, in this work, we propose a multichannel convolutional neural network (MCCNN) for automated biomedical relation extraction. The proposed model has the following two contributions: (1) it enables the fusion of multiple (e.g., five) versions in word embeddings; (2) the need for manual feature engineering can be obviated by automated feature learning with convolutional neural network (CNN). We evaluated our model on two biomedical relation extraction tasks: drug-drug interaction (DDI) extraction and protein-protein interaction (PPI) extraction. For DDI task, our system achieved an overall f-score of 70.2% compared to the standard linear SVM based system (e.g., 67.0%) on DDIExtraction 2013 challenge dataset. And for PPI task, we evaluated our system on Aimed and BioInfer PPI corpus; our system exceeded the state-of-art ensemble SVM system by 2.7% and 5.6% on f-scores.

1. Introduction

DDI and PPI are two of the most typical tasks in the field of biological relation extraction. DDI task aims to extract the interactions among two or more drugs when these drugs are combined and act with each other in human body; the hidden drug interactions may seriously affect the health of human body. Therefore, it is significant to further understand the interactions of drugs to reduce drug-safety accidents. Different from DDI task, PPI task aims to extract the interaction relations among proteins, and it has captured much interest among the study of biomedical relations recently [1, 2]. There are a number of databases which have been created for DDI (DrugBank [3, 4]) and PPI (MINT [5], IntAct [6]). However, with the rapid growth of biomedical literatures (e.g., MedLine has doubled in size within decade), it is hard for these databases to keep up with the latest DDI or PPI. Consequently, efficient DDI and PPI extraction systems become particularly important.

Previous studies have explored many different methods for DDI and PPI tasks. The dominant techniques generally fall under three broad categories: cooccurrence based method [7], rule-pattern based method [8, 9], and statistical machine learning (ML) based method [10–13]. Cooccurrence based method considers two entities interacting with each other if entities occur in the same sentence. A major weakness of this method is its tendency for having a high recall but a low precision.

The rule and pattern based methods employ predefined patterns and rules to match the labeled sequence. Although having achieved high accuracy among traditional rule and pattern based methods, their sophistication in pattern design and attenuated recall performance deviate them from practical usage. Besides the rule and pattern based methods, ML based techniques view DDI or PPI task as a standard supervised classification problem, that is, to decide whether there is an interaction (binary classification) or what kinds of relations (multilabel classification) between two entities. Compared with cooccurrence and rule-pattern based methods, ML based methods show much better performance and generalization, and the state-of-the-art results for DDI [14] and PPI [2] are all achieved by ML based methods.

Traditional ML based methods usually collect words around target entities as key features, such as unigram, bigram, and trigram, and then these features are put into a bag-of-words model and encoded into one-hot (https://en.wikipedia.org/wiki/One-hot) type representations; after that, these representations are fed to a traditional classifier such as SVM. However, such representations are unable to capture semantic relations among words or phrases and fail in generalizing the long context dependency [15]. The former issue is rendered as “vocabulary gap” (e.g., the words “depend” and “rely” (these words are considered as the cue words or interaction verbs [8] which are important in biomedical relation extraction) are different in one-hot representations, albeit their similar linguistic functions). The latter one is introduced due to the n-order Markov restriction that attempts to alleviate the issue of “curse of dimensionality.” Moreover, the inability to extract features automatically leads to the laborious manual efforts in designing features, which hinders the practical use of traditional ML based methods in extracting biomedical relation features.

To tackle these issues, in this work, we employ word embedding [16, 17] (also known as distribution representations) to represent the words. Different from one-hot representation, word embedding could map words to dense vectors of real numbers in a low-dimensional space, and thus the “vocabulary gap” problem can be well solved by the dot product of two word vectors. Compared to one-hot model, which merely allows the binary coding fashion in words (e.g., yes or no), our employment of the word embedding was able to output the similarity of two words via dot product. Such representation also yield neurological underpinning and is more in consistent with the way of human thinking.

Based on the previous researches on word embedding, this research builds a model on distributed word embedding and proposes a multichannel convolutional neural network (MCCNN) for biomedical relation extraction. The concept “channel” in MCCNN is inspired by three-channel RGB image processing [18], which means different word embedding represents different channel and different aspect of input words. The proposed MCCNN integrates different versions of word embeddings for better representing the input words. The only input for MCCNN is the sentences which contain drug-drug pairs (in DDI task) and protein-protein pairs (in PPI task). By looking up different versions of word embedding, input sentences will be initialized and transformed into multichannel representations. After that, the robust neural network method (CNN) will be applied to automatically extract features and feed them to a Softmax layer for the classification.

In sum, our proposed MCCNN model has yield threefold contributions:

(1)
We propose a new model MCCNN to tackle DDI and PPI tasks and demonstrate that MCCNN model which relies on multichannel word embedding is effective in extracting biomedical relations features; the proposed model allows the automated feature extraction process. We tested our proposed model on DDIExtraction 2013 challenge dataset and achieved an overall f-score 70.2% that outperformed the current best system in DDIExtraction challenge by 5.1% and recent [14] state-of-the-art linear SVM based method by 3.2%.
(2)
We also evaluated the proposed model on Aimed and BioInfer PPI extraction tasks. The attained F-scores 72.4% and 79.6% which outperform the state-of-the-art ensemble SVM system by 2.7% and 5.6%, respectively.
(3)
We release our code (https://github.com/coddinglxf/DDI) taking into account the model's simplicity and good performance.

In remaining sections, Section 2 details proposed MCCNN methods, Section 3 demonstrates and discusses the experiments results, Section 4 briefly concludes this work, and Section 5 details the implementation of MCCNN.

2. Method

In this section, firstly, we briefly describe the concept and training algorithm for word embedding. And then, we introduce the multichannel word embedding and CNN model for relation extraction in detail; at last, we show how to train proposed MCCNN model.

2.1. Word Embedding

Word embedding which could capture both syntactical and semantic information from a large unlabeled corpus has shown its effectiveness in many NLP tasks. The basic assumption for word embedding is that words which occur in similar contexts tend to have similar meanings. Many models had been proposed to train the word embedding, such as NNLM [16], LBL [19], Glove [20], and CBOW. CBOW model (also known as a part of word2vec [17] (https://code.google.com/archive/p/word2vec/)) is employed to train our own word embedding in this work due to its simplicity and effectiveness. CBOW model takes the average embedding of the context words as the context representation, and it reduces the training time by replacing the last traditional Softmax layer with a hierarchical Softmax. In addition, CBOW could further reduce time consumption by negative samples. An outline architecture of CBOW is shown by Figure 1.

2.2. Multichannel Word Embedding Input Layer

Word embedding reflects the distributions of words in unlabeled corpus. In order to ensure the maximum coverage of the word embeddings, the articles from PubMed, PMC, MedLine, and Wikipedia are used for training word embedding. Five versions of word embedding are generated based on these corpora. The first four word embeddings are released by Pyysalo et al. [21], while the fifth word embedding is trained by CBOW on MedLine corpus (http://www.nlm.nih.gov/databases/journal.html) (see Figure 1 for more details). The statistics of the five word embeddings are rendered in Table 1.

Table 1.

Statistics for five word embeddings (all with 200 dimensions).

	Vocabulary size	Training corpus
1	2515686	PMC
2	2351706	PubMed
3	4087446	PMC and PubMed
4	5443656	Wikipedia and PubMed
5	650187	MedLine

Entity1	Entity2	Generated inputs
Nabumetone	warfarin	Caution should be exercised when administering Entity1 with Entity2 since interactions have been seen with other EntityOther

Nabumetone	NSAIDs	Caution should be exercised when administering Entity1 with EntityOther since Interactions have been seen with other Entity2

Warfarin	NSAIDs	Caution should be exercised when administering EntityOther with Entity1 since interactions have been seen with other Entity2

Rule 1	Anesthetics, general: exaggeration of the hypotension induced by general anesthetics

Rule 2	To minimize CNS depression and possible potentiation, barbiturates, antihistamines, narcotics, hypotensive agents or phenothiazines should be used with caution

	Train			Test
	DrugBank	MedLine	Overall	DrugBank	MedLine	Overall
Abstract	572	142	714	158	33	191
Positive	3788	232	4020	884	95	979
Negative	22118	1547	23665	4367	345	4712
Advice	818	8	826	214	7	221
Effect	1535	152	1687	298	62	360
Mechanism	1257	62	1319	278	24	302
Int	178	10	188	94	2	96

After preprocessing and filtering rules
Positive	3767	231	3998	884	92	976
Negative	14445	1179	15624	2819	243	3062
Advice	815	7	822	214	7	221
Effect	1517	152	1669	298	62	360
Mechanism	1257	62	1319	278	21	299
Int	178	10	188	94	2	96

	Vocabulary size	Word embedding
1	9984	PMC
2	10273	PubMed
3	10399	PMC and PubMed
4	10432	Wikipedia and PubMed
5	9639	Medline

	Baseline			One-channel			MCCNN
	P	R	F	P	R	F	P	R	F
Advice	89.39	53.88	67.24	80.77	67.12	73.32	82.99	73.52	77.97
Effect	56.32	57.42	56.87	60.46	73.67	66.41	67.03	69.47	68.23
Mechanism	78.33	53.36	63.47	64.72	70.81	67.63	85.00	62.75	72.20
Int	93.55	30.21	45.67	82.05	33.33	47.41	75.51	38.54	51.03
Overall (micro)	70.00	52.68	60.12	66.50	67.31	66.90	75.99	65.25	70.21

	F-score
MCCNN (with preprocessing)	70.21
MCCNN (without preprocessing)	67.80

Method	Feature sets
Kim	Word features, dependency graph features
	Word pair features, parse tree features
	Noun phrase constrained coordination features

FBK-irst	Linear features, path-enclosed tree kernels
FBK-irst	Shallow linguistic features

WBI	Features combination of other DDI methods

UTurku	Linear features, external resources
UTurku	Word features, graph features

	ADV	EFF	MEC	INT	DEC	Overall
Kim	72.5	66.2	69.3	48.3	77.5	67.0
FBK-irst	69.2	62.8	67.9	54.7	80.0	65.1
WBI	63.2	61.0	61.8	51.0	75.9	60.9
UTurku	63.0	60.0	58.2	50.7	69.6	59.4

MCCNN	78.0	68.2	72.2	51.0	79.0	70.2

	Aimed	BioInfer	Word embedding
All	6276	5461	—
1	5293	4666	PMC
2	5363	4712	PubMed
3	5404	4749	PMC and PubMed
4	5414	4762	Wikipedia and PubMed
5	4977	4328	MedLine

	Baseline			One-channel			MCCNN
	P	R	F	P	R	F	P	R	F
Aimed	71.62	61.25	64.27	72.28	60.82	65.58	76.41	69.00	72.45
BioInfer	78.13	73.00	72.34	76.06	79.43	77.07	81.30	78.10	79.62

	DrugBank	MedLine
DrugBank	70.8	52.6
MedLine	10.0	28.0

Datasets	Positive	Negative
BioInfer	2512	7010
Aimed	995	4812

	Aimed	BioInfer
Choi and Myaeng [22]	67.0	72.6
Yang et al. [23]	64.4	65.9
Li et al. [2]	69.7	74.0
Erkan et al. [11]	59.6	—
Miwa et al. [24]	60.8	68.1
Miwa et al. [25]	64.2	67.6

MCCNN (the proposed)	72.4	79.6

GPU	NVIDIA GeForce GTX TITAN X
CPU	Intel(R) Xeon CPU E5-2620 v3 @ 2.4 GHz
System	Windows 7
memory	8 G

PERMALINK

Multichannel Convolutional Neural Network for Biological Relation Extraction

Chanqin Quan

Lei Hua

Xiao Sun

Wenjun Bai

Abstract

1. Introduction

2. Method

2.1. Word Embedding

Figure 1.

2.2. Multichannel Word Embedding Input Layer

Table 1.

Figure 2.

2.3. Convolutional Layer

2.4. Max-Pooling Layer

2.5. Softmax Layer for Classification

2.6. Model Training

3. Experiments

3.1. Preprocessing for Corpora

Table 2.

Table 3.

Rule 1 . —

Rule 2 . —

3.2. Evaluation on DDI Task

3.2.1. Datasets

Table 4.

3.2.2. Pretrained Word Embedding

Table 5.

3.2.3. Experimental Settings and Results

Table 6.

Table 7.

3.2.4. Performance Comparison

Table 8.

Table 9.

3.2.5. Compared with Other CNN Based Models

3.2.6. Evaluation on Separated DrugBank and MedLine Corpus

Table 10.

3.3. Evaluation on PPI Task

3.3.1. Datasets and Pretrained Word Embedding

Table 11.

Table 12.

3.3.2. Changes of Performance from Baseline to MCCNN

Table 13.

3.3.3. Performance Comparison

Table 14.

3.4. Discussions

3.4.1. Hyperparameter Settings

3.4.2. Errors Analysis

4. Conclusion

5. Implementation

Table 15.

Acknowledgments

Competing Interests

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases