Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2021 Sep 15;28(12):2571–2581. doi: 10.1093/jamia/ocab176

Distantly supervised biomedical relation extraction using piecewise attentive convolutional neural network and reinforcement learning

Tiantian Zhu 1,2, Yang Qin 1, Yang Xiang 2, Baotian Hu 1,, Qingcai Chen 1,2, Weihua Peng 3
PMCID: PMC8633639  PMID: 34524450

Abstract

Objective

There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning.

Materials and Methods

In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks.

Results

The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of 0.5592 on the may-prevent dataset, 0.6666 on the may-treat dataset, and 0.3838 on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets.

Conclusions

The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.

Keywords: biomedical relation extraction, distant supervision, reinforcement learning, neural networks, deep learning

INTRODUCTION

The amount of biomedical literature is vastly increasing daily, triggering an urgent need to extract useful information from the literature automatically to facilitate knowledge integration and digestion.1 Relation extraction (RE) is an important step in information extraction that builds on the output of named entity recognition. In the biomedical domain, it aims to extract relational facts between bioentities mentioned in plain text, and is beneficial for many biomedical knowledge-driven applications.2–5 Previous solutions for RE included rule-based,6 unsupervised,7 and supervised learning approaches,8,9 in which the supervised learning approach has obtained substantial successes in recent years. However, a limitation is the requirement of high-quality and large volumes of corpora annotated by domain experts.

Distant supervision (DS) was proposed to alleviate this issue.10,11 DS is a form of weak supervision that can be used to generate large amounts of training data from unlabeled corpus using well-built knowledge bases (KBs). The basic DS assumption is that if 2 entities participate in a relation according to the KBs, all sentences in the text corpus that mention these 2 entities could potentially express this relation. For instance, the entity pair DRUG(Practolol)-DISEASE(arrhythmias) is defined to be associated with the may-treat relationship in the KB Unified Medical Language System (UMLS). According to the assumption, the following sentence in PubMed would be annotated as a positive training instance of the may-treat relation: “Example 1:Practolol, although reversing thearrhythmias, tends to cause hypotension.” (PMID: 7377). On the other side, if a pair mentioned in a sentence is not recorded in UMLS, the sentence is considered as a negative instance. Such training data are then used to train a supervised model. However, this assumption does not always hold, resulting in that some sentences containing an entity pair might be mistakenly identified as positive examples. For example, the following sentence contains the same entity pair but does not assert the may-treat relation, which is a false positive (FP) annotation: “Example 2: The effective half-life ofpractololwas less than 15 min and doses up to 0.4 mg/kg were unable to preventarrhythmiasduring adrenaline challenge.” (PMID: 26159). In this sentence, we cannot find any description that expresses the target relation. Thus, even though DS is efficient on data acquisition, it comes at the cost of data noise .

To address the noisy data challenge in DS, some works manually defined rules to improve the quality of training data.12,13 For instance, Bobić et al14 introduced the constraint for protein-protein interaction (PPI) and drug-drug interaction (DDI) RE tasks that if entities refer to the same object, the pair is filtered. However, such methods have limitations in generalization on different tasks. To weaken the strong hypothesis of DS, many studies perform classification as a multi-instance learning problem.15–20 These methods assume that a bag of sentences mentioning an entity pair are describing the same relation and they select one-best18 or calculate intrabag attention weights19,20 to reduce obstruction of noisy sentences. Such methods are effective in reducing noise but they cannot identify the FP instances in positive bags. Recently, some works21,22 used the reinforcement learning (RL) strategy to create “clean” data. These methods treated noise instances with a hard decision, rather than with soft attention weights. For example, Feng et al21 proposed the removal operation that only retain high-quality sentences. However, removing these sentences without additional supervision might be a misoperation. Unlike previous works, we made full use of each instance and proposed a novel RL strategy that considers the inherent connections and differences between FP and true positive examples, which is of great importance to DS.

In this article, we propose a novel biomedical RE method that combines a piecewise attentive convolutional neural network (PACNN) with a RL denoising module based on DS. The PACNN module is based on piecewise convolutional neural network (PCNN)18 with the filter attention mechanism to capture the internal relationship of biomedical sentences and relies on multichannel word embeddings that represent words as dense vectors. In the denoising module, an RL algorithm was proposed to enlarge the gap between positive and negative examples that mention the same entity pair by selecting FP instances from positive bags and further building contrastive negative bags. Inspired by the idea of tracking historical events to aid current decision making of human beings, we introduced a memory backtracking mechanism to RL to emphasize specific attention regions on the historical actions. The model was validated on 4 biomedical RE tasks. Through extensive experiments, we demonstrated our method’s effectiveness compared with DS baselines on 8 datasets. In addition, we conducted an ablation study to illustrate the contribution of different components and analyzed the newly built negative bags to validate the effectiveness of the RL module.

MATERIALS AND METHODS

Corpus building

This method was evaluated on 4 biomedical RE tasks, namely, the extraction of may-treat, may-prevent, DDI, and PPI relations.23–29 Specifically, may-prevent denotes a treatment that can be used to prevent a disease, and may-treat indicates the treatment for a particular disease. DDI refers to the compound effects of patients taking 2 or more drugs at the same time or within a certain period. Because protein function depends largely on the functional context of its interaction partners, getting a better understanding of PPIs is vital to understand the biological processes.

For may-prevent and may-treat RE tasks, DS was used to generate training data, and the left part of Figure 1 is a flow chart of DS. We first selected all triples with may-prevent and may-treat relations from the UMLS Metathesaurus as the knowledge source. UMLS is a large-scale biomedical database that contains millions of medical concepts and their corresponding relations.30 Then, we used MetaMap to perform named entity recognition for 1 million randomly selected MEDLINE abstracts with the concept of UMLS (https://lhncbc.nlm.nih.gov/ii/tools/MetaMap.html). Finally, the relation of each entity pair was generated according to the heuristic rule: sentences containing concepts that are identified as being related in the UMLS are positive and sentences with concepts concerned to be not related in UMLS are negative. For DDI and PPI RE tasks, we made use of the distantly supervised set constructed by Thomas et al.31 All test sets were manually annotated and are publicly available. Table 1 lists the descriptions of the 4 distantly labeled training corpora and 8 test sets.

Figure 1.

Figure 1.

The architecture of the proposed piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) model for distant supervised biomedical relation extraction. The proposed method has 3 parts: distant supervision (left), PACNN model (middle), and reinforcement learning method (right).

Table 1.

Description of the 4 distantly labeled sets and 8 manually labeled test sets

Type Distantly labeled training set
Test set
Database Abstracts Positive/negative Corpus Positive/negative
Prevent UMLS 55 967 15 576/184 424 Prevent23 139/261
Treat UMLS 55 761 53 234/146 766 Treat23 173/227
DDI Drugbank 76 859 8705/191 295 DDI corpus, 201124 755/6271
PPI IntAct 49 958 37 600/162 400 AIMed25 1000/4834
KUPS BioInfer26 2534/7132
HPRD5027 163/270
IEPA28 335/482
LLL29 164/166

DDI: drug-drug interaction; PPI: protein-protein interaction; UMLS: Unified Medical Language System.

Text preprocessing

To ensure the nonoverlapping between the test and training sets, all sentences contained in the test sets are removed from the distantly labeled sets. To reduce the impact of nontarget entities and ensure the generalization during learning, we replaced the head and tail entities with symbols “entity1” and “entity2,” respectively, and other entities were represented as unified symbols “entity0.”32 Head and tail entities are components of the relational fact that is often represented in the form of a triple (head, relation, tail). Finally, we applied 2 rules referenced from Kim et al32 to filter out the noise instances (see Supplementary Appendix A for more examples):

Rule 1. To avoid target entities referred to the same realistic biomedical entity. If (1) 2 entities have the same name, or (2) 1 entity is an abbreviation or acronym of the other, filter out the corresponding instances.

Rule 2. Entity pairs in a coordinate structure should be filtered out.

Piecewise attentive convolutional neural network

The architecture of the PACNN is depicted in Figure 1, which consists of a multichannel input layer, a convolution layer, a piecewise max pooling layer, a filter attention layer, and a multi-instance classifier.

Multichannel input layer

In a neural network model, the embedding layer transforms discrete words into dense vectors as input. In our work, word embeddings and positional features were generated according to the given sentence S=w1,,e1,,e2,,wn where e1 and e2 are the target entities and n is the number of context words. For word embeddings, we applied 5 types of pretrained word embeddings generated using PubMed, PMC, MEDLINE articles, and Wikipedia to be in accordance with some previous studies.33,34 For positional features, we adopted the idea of Zeng et al.35 Position embeddings are vector representations of the relative distances from the current word respectively to the head or tail entities in the sentence and are added in each channel of the input layer.

Convolution

A convolution operation involves a filter, which is applied to a window of ω words to produce a new feature. Previous PCNNs only used a certain window size ω as the filter, which is limited in identifying relation between entities with different distances. In the PACNN, the window size is set by ω=2,3,4 to support various semantically dependent distances. The convolutional operation can be defined as filter CRc×ω×d in each channel input X, where c is the number of channels and d is the dimension of embeddings.

The feature fm can be calculated by the following formula:

fm=α(j=0cXj[m:m+ω1]Cj+b) (1)

where α is the activation function, Xjm:m+ω-1 is the concatenation of embeddings of ω continuous words for the j-th channel, Cj is the j-th filter of each window, is element wise multiplication, and b is a bias term.

After the feature fm is generated, the feature map is concatenated by:

f=[f1,f2,,fm,,fN] (2)

Note that the size of feature map is equivalent to the sentence length N.

Piecewise max pooling

Based on the PCNN,18 outputs from convolutional layer with the specific filter are divided into 3 part {f0:k,fk:v,fv:N} for max pooling according to the indices for e1(k) and e2(v). The max pooling is applied over each part of feature map:

f^=[max0i<k(fi),maxki<v(fi),maxvi<N(fi)] (3)

Filter attention

After obtaining the feature vectors for filters with different window sizes ω, most previous studies simply concatenate these representations. However, we applied a filter attention layer after the pooling layer to get weighted representations through them to obtain contributions of different window sizes. Denoting the feature vectors learned from 3 filter sizes as {vω=2,vω=3,vω=4}, then the sentence representation s can be defined as:

s=(α1vω=2)(α2vω=3)(α3vω=4) (4)

where vω=2,vω=3,vω=4 are sentence encoding vectors from filters of different window sizes, is the concatenation operation, α1,α2,α3 are filter attention weights and αi is calculated as the following formula:

αi=exp(ui)iexp(ui),whereui=viw (5)

where w is a randomly initialized parameter vector with the same dimension as vi and ui is the weight score assigned to each filter size. Attention values {αi} are calculated by taking softmax over {ui}.

Training

In the training phase, sentences are packed into bags according to entity pairs for multi-instance learning. In Figure 1, the bag (typeU2Udiabetes, insulin) contains sentence representations s1,s2,s3,,sl-1,sl, where l is the number of sentences in the bag. These sentence representations are applied with a sentence level attention β to score how well the sentence matches the relation, and the attention score βi for the i-th sentence is defined as:

βi=exp(siri)jexp(sirj) (6)

where ri and rj are the trainable relation vectors associated with relation ri and rj. The relation vectors are randomly initialized and are fixed after training to represent a particular relation during the test. The bag-level representation is obtained by B=i=0lβisi and fed into a Softmax layer to get the probability of each relation.

Reinforcement learning

To address the problem that instances with same entity pair and different labels are indistinguishable, we proposed a RL method to pick out the possibly negative instances from positive bags to build contrastive negative bags. It can increase the diversity of negative bags and emphasize the difference between positive and negative bags with the same entity pair. Then, the redistributed data is fed into the pretrained relation classifier PACNN. We used the performance of PACNN as the result-driven reward for a series of actions decided by the RL agent.

We first pretrained an instance level classifier with CNN representation model using bag labels which can be considered as an agent A. As mentioned before, the agent is assembled with a memory backtracking mechanism and it selects FP instances in each positive bag and adds them to an additional negative bag. At each step t, A is at statest=xt,y1:t-1 that is a weighted sum of the current sentence and the historical sentences given action by A, and the weight here is calculated by the memory backtracking mechanism. The agent would take an action yt{0,1} that decides whether sentence xt in a positive bag is FP or not. After A takes the terminate action for a positive bag Bi, we obtain a negative bag Bi-, which consists of the chosen instances. A will receive a final rewardR-1 from the PACNN model when all the selections are made. The details of the joint training process are described in Algorithm 1.

Algorithm 1.

Reinforcement Learning Algorithm

Initialize the policy network as Θ'=Θ, PACNN model as Φ'=Φ, training data S={B1,B2,,Bj,,BS}

for epoch m=1 to Mdo

  for batch n=1 to Ndo

   Batch data Bn={Bn1,Bn2,,Bnj,,BnBn}

   Batch label Ln={rn1,rn2,,rnj,,rnBn}

  forBnjBndo

   ifrnj=1 (Bnj is a positive bag) then

    Sample instance selection actions for each instance in Bnj with Θ':

    Anj={a1,a2,,ai,,aBnj},aiπΘ'si,ai

    foraiAnjdo

     ifai=0 (i-th instance in Bnj is false positive) then

     Add i-th instance into new negative bag Bnj'

  end if

  Add Bnj',rnj'=0 into Bn,Ln, respectively

   end for

  end if

end for

  Compute delayed reward RBn from PACNN model loss

  Update the parameter Θ of policy network

end for

  Update the parameter Φ in the PACNN model with fixed policy network Θ

end for

At the same time, it is considered that historical events that happened closer in time could have greater impact on the current decision. We designed wdis to indicate the influence degree of different time steps on the current state. The detailed operation of wdis is as follows:

wdis(t,i)=edis(t,i)i=1t1edis(t,i) (8)

where dis=t-i. Next, we use wmb to normalize the effects of wsim and wdis, and the weight wmb is calculated as:

wmb(t,i)=λi(wsim(t,i)+wdis(t,i))i=1t1λi(wsim(t,i)+wdis(t,i)) (9)
λi={μ,yi=11μ,yi=0 (10)

Here, the impact factor λi is introduced to give coefficients on negative (yi=1) and positive (yi=0) instances classified by the agent. In the experiment, μ is set as 0.4 by hyperparameter grid search. This can reduce the model's sensitivity and therefore increase the precision of choosing FP instances. Finally, the current state representation st can be calculated as follows:

st=vt+i=1t1wmb(t,i)vi (11)

We used the REINFORCE algorithm to train our agent. It is a type of policy gradient method proposed by Williams et al,36 and it maximizes the performance of the agent by updating its policy parameters. The policy is defined as follows:

π(a|st,Θ)=P(yt=a|st,Θ)=P(yt=a|xt,y1:t1,Θ) (12)

where Θ represents the parameters in the agent model, and a is a label indicating whether the current instance is false positive or not.We assume that the model has a reward of a set of bags S={B1,B2,,Bj,,BS} which is delayed. The reward can be formulated as the following formula:

R(S)=1|S|j=1|S|logp(rj|Bj) (13)

where S is the length of set S, rj is the relation label of bag Bj and prjBj is the probability given by the PACNN. Note that the relation classifier is at the bag-level because it computes prjBj for each bag.For a set of bags, we aim to maximize the total reward. Thus, the objective function can be defined as follows:

J(Θ)=Es0,a0,s1,a1,,st,at,[R(S)] (14)

Based on the policy gradient theorem,37 the parameters Θ are updated as follows:

ΘΘ+λt=1Nv˜logπΘ(st,at) (15)

where N=j=1SBj, for BjS. v is the value function and for each bag v=RS.

RESULTS

Performance comparison

To evaluate our approach, we compared the proposed PACNN+RL model with several baseline models in Table 2, including feature-based and neural network–based methods (see Supplementary Appendix F for details of baselines). For each task, we calculated the precision, recall, and F1-score. Our PACNN+RL model outperformed all other single models with the highest F1-score on the may-prevent dataset, may-treat dataset, and DDI corpus, 2011. For the PPI RE task, PACNN+RL achieved the highest F1-score among all single models on the AIMed, BioInfer, HPRD50, and IEPA datasets, whereas Thomas et al31 achieved the highest F1-score on the LLL dataset. The performance on different PPI datasets varies significantly and all methods achieved relatively low performance on AIMed dataset. This suggests that it is difficult to accurately extract PPI relations on the AIMed corpus, while our method outperformed the other single methods. We also noticed that the feature-based methods outperformed neural network–based methods on the LLL test set. However, our model achieved higher F1-scores than all neural network–based methods. As shown in Table 2, PACNN+RL achieved the highest precision on the may-treat dataset, DDI corpus, 2011, AIMed dataset, and HPRD50 dataset, which is a significant improvement compared with other models. For the remaining test sets, feature-based methods achieved the highest precision, which utilized manually selected and encoded lexical, syntactic, and semantic information as well as some rules. However, such methods suffer from intrinsic limitations, such as coverage or scalability among others, and PACNN+RL is superior to them in terms of recall and model robustness. Furthermore, we had initially hoped that the BioBERT’s pretraining capability would make model work better, and we fused our model with it. However, the experimental results of the fusion model PACNN+RL (+BioBERT v1.0)* are mixed. This may be due to the noise in DS data, and fine-tuning BioBERT on such data will not lead to a large improvement. In addition, we compared the space and time complexity of our PACNN+RL model with other models in Supplementary Appendix B.

Table 2.

Performance on the may-prevent, may-treat, DDI corpus, 2011, and PPI test sets in comparison with the state-of-the-art methods

Method may-prevent
may-treat
DDI corpus, 2011
protein-protein interaction
AIMed
BioInfer
HPRD50
IEPA
LLL
Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score
Roller and Stevenson23 0.5833 a 0.3559 0.4421 0.4832 0.5143 0.4983
Banuqitah et al38 0.54 0.72 0.62
Bobić et al14 0.325 0.437 0.373
Tikk et al39 0.283 0.866 a 0.426 0.628 a 0.365 0.462 0.569 0.687 0.622 0.710 a 0.525 0.604 0.790 a 0.573 0.664
Thomas et al31 0.330 0.441 0.377 0.256 0.784 0.386 0.404 0.667 0.503 0.457 0.851 0.594 0.500 0.872 a 0.635 0.564 0.831 0.672 a
PCNN+ONE18 0.4207 0.5000 0.4569 0.4980 0.7485 0.5981 0.2765 0.5640 0.3711 0.2217 0.5638 0.3183 0.4493 0.2181 0.2937 0.4951 0.4857 0.4904 0.6664 0.2424 0.3555 0.7138 0.0935 0.1653
CNN+ATT19 0.4236 0.6231 0.5043 0.4875 0.8011 0.6061 0.3056 0.4093 0.3499 0.2036 0.6020 0.3043 0.3854 0.2371 0.2936 0.4672 0.5428 0.5022 0.5262 0.3030 0.3846 0.6953 0.1495 0.2461
PCNN+ATT19 0.3962 0.6987 0.5057 0.4963 0.7777 0.6059 0.3184 0.3576 0.3369 0.2077 0.6301 0.3124 0.4177 0.2215 0.2895 0.4770 0.4952 0.4859 0.5831 0.2121 0.3111 0.6246 0.0934 0.1625
CNN+RL21 0.4232 0.5689 0.4854 0.5580 0.7310 0.6329 0.3394 0.3329 0.3361 0.1493 0.8087 0.2521 0.2869 0.7226 a 0.4107 0.3574 0.9428 a 0.5183 0.3978 0.5605 0.4653 0.4124 0.9906 a 0.5824
PCNN+ATT RA+BAG ATT40 0.4318 0.1418 0.2135 0.4046 0.3099 0.3510 0.1426 0.3058 0.1945 0.1868 0.4805 0.2690 0.2449 0.3829 0.2987 0.4650 0.5840 0.5177 0.4035 0.5324 0.4591 0.4818 0.4490 0.4648
BioBERT v1.0 (+ PMC)41 0.3482 0.9058 0.5030 0.4669 0.9826 a 0.6330 0.1778 0.6980 0.2834 0.2481 0.7660 0.3748 0.3952 0.6062 0.4784 0.4798 0.5828 0.5263 0.4531 0.5045 0.4774 0.6122 0.5488 0.5788
BioBERT v1.1 (+ PubMed)41 0.3547 0.9638 0.5185 0.4682 0.9419 0.6255 0.1839 0.6993 0.2912 0.2671 0.7900 0.3992 0.3879 0.6255 0.4789 0.5289 0.7301 0.6134 0.4818 0.5910 0.5308 0.5940 0.4817 0.5320
PACNN+RL (Ours) 0.4155 0.8550 0.5592 0.6666 a 0.6666 0.6666 0.7320 a 0.2601 0.3838 a 0.5531 a 0.3754 0.4472 0.5925 0.4648 0.5209 a 0.6709 a 0.6380 0.6540 a 0.6677 0.6435 0.6554 a 0.7531 0.6049 0.6709
PACNN+RL (+ BioBERT v1.0)∗ 0.3964 0.9728 a 0.5633 a 0.5113 0.9766 0.6712 a 0.2213 0.7372 a 0.3404 0.3497 0.6304 0.4498 a 0.4400 0.5320 0.4816 0.5155 0.8160 0.6318 0.5368 0.6616 0.5927 0.6466 0.5309 0.5831

CNN: convolutional neural network; DDI: drug-drug interaction; PACNN+RL: piecewise attentive convolutional neural network and reinforcement learning; PCNN: piecewise convolutional neural network; PPI: protein-protein interaction; RL: reinforcement learning.

a

For each metric, the bolded value indicates the best performing classifier.

Ablation experiment

To analyze the contribution of each part of our model, we cumulatively removed components and evaluated the total performance on 3 test sets in Table 3. The w/o multichannel (random) method utilized randomly initialized word embedding. The w/o multichannel (Wikipedia and PubMed) method utilized 1-channel pretrained word embedding, and its F1-score is 0.0874 higher on average than the w/o multichannel (random) model. Compared with the 1-channel model, the PACNN+RL model improved the overall F1-scores by 0.0708 on average. The filter attention mechanism is removed from the w/o filter attention method and a simple combination of the feature vectors through different filters is used, resulting in a sharp drop on performance. The preprocessing step can reduce the potentially misleading examples and improve the F1-scores by 0.0584 on average. Another aspect to note is that w/o RL behaves worse than the proposed model, which proves the effectiveness of the RL agent. In conclusion, the multichannel architecture can improve the performance of PACNN+RL by a large margin. In addition, model performance can be further improved by the filter attention, preprocessing, and RL.

Table 3.

Performances of model with and without different components

Method may-prevent
may-treat
DDI corpus, 2011
Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score
Without multichannel (random) 0.3571 0.3623 0.3597 0.4960 0.7310 0.5910 0.3534 0.1245 0.1841
Without multichannel (Wikipedia and PubMed) 0.3991 0.6304 0.4888 0.4945 0.7953 0.6099 0.2241 0.4464 0.2984
Without filter attention 0.4055 0.6376 0.4957 0.4937 0.9122 0.6407 0.2552 0.6371 a 0.3644
Without preprocessing 0.4142 0.7985 0.5455 0.4842 0.9768 a 0.6475 0.1522 0.5828 0.2414
Without RL 0.4182 a 0.7970 0.5486 0.4954 0.9473 0.6506 0.3434 0.4053 0.3718
PACNN+RL (Ours) 0.4155 0.8550 a 0.5592 a 0.6666 a 0.6666 0.6666 a 0.7320 a 0.2601 0.3838 a

DDI: drug-drug interaction; PACNN+RL: piecewise attentive convolutional neural network and reinforcement learning; RL: reinforcement learning.

aFor each metric, the bolded value indicates the best performing classifier.

The impact of the RL denoising module

To illustrate the effectiveness of the RL denoising module, we listed 1 newly generated negative bag that comprises single/multiple sentence(s) for each task in Table 4. We can infer from Table 4 that the entity pair DRUG(“vorapaxar”)-DISEASE(“stroke”) is found to have relationship may-prevent in UMLS. Therefore, the right 2 sentences are labeled as positive instances. However, neither of these sentences expressed the may-prevent relation. In addition, patients with a history of DISEASE(“stroke”) should not take the DRUG(“vorapaxar”) under certain circumstances. It clearly indicates that our model can select FP instances from positive bags, which further improves the accuracy of a relation classifier. These cases intuitively illustrate the ability of our model to denoise distantly generated labels.

Table 4.

Examples of contrastive negative bags generated by RL agent; the target entities are in italic

Relation False positive sentences
may-prevent (i) In this study, vorapaxar was discontinued in patients with a history of stroke due to excessive risk for intracranial hemorrhage after 2 years of therapy.
(ii) Vorapaxar should not be used in patients with a history of stroke, transient ischemic attack, intracranial hemorrhage, or active pathological bleeding.
may-treat (i) L-5418 may prove useful for grand mal epilepsy as it is less toxic than diphenylhydantoin and carbamazepine.
DDI (i) One group of 33 patients was treated with 150 mg amitriptyline a day (the AMI group); 25 other patients received a daily dose of thioridazine, either 200 mg (200-THD group; n = 7) or 400 mg (400-THD group; n = 18).
PPI (i) High articular levels of the angiogenetic factors VEGF and VEGF-receptor 2 as tissue healing biomarkers after single bundle anterior cruciate ligament reconstruction.
(ii) Immunostaining was used to monitor VEGF treatment by examining VEGF and VEGF-receptor 2 expression.

DDI: drug-drug interaction; PPI: protein-protein interaction; RL: reinforcement learning.

Then, we analyzed the outputs of RL from different aspects in Figure 2 and Supplementary Appendix C. Specifically, the line charts in Figure 2 display the newly built FP bag size distribution of different relations. The x-axis represents the instance number per bag and y-axis represents the number of corresponding bags. As can be seen from the line charts, with the bag size increases, the corresponding number shows a downward trend.

Figure 2.

Figure 2.

The predictive results of reinforcement learning with different relations. (A-D) Each panel represents different relations. The line chart and the pie chart describe 2 aspects of the results, respectively. DDI: drug-drug interaction; PPI: protein-protein interaction.

Finally, we manually checked 400 instances that were selected as FP instances by the RL agent in a randomly sampled dataset. For each instance, the agent makes a correct decision if the sentence is manually labeled as a negative instance and our RL agent selects it as an FP instance. Otherwise, we judged that the RL agent makes a wrong decision. Specifically, for each relation, we sampled 100 sentences from the newly built negative training data. The pie charts in Figure 2 depict that the accuracy scores on may-prevent, may-treat, DDI, and PPI are 71%, 76%, 76%, and 68%, respectively. Our RL agent chooses these 400 sentences as the FP instances, among which 291 sentences are correctly selected (not describing the relation), and 109 of them are wrong. To summarize, the accuracy of our RL agent is 72.8%, which demonstrates the effectiveness of our RL agent.

DISCUSSION

In this study, we investigated the proposed PACNN+RL method to extract may-prevent, may-treat, DDI, and PPI relations from distantly supervised datasets. By combining RL module with relation classifier, our system achieved the best performance on 7 biomedical benchmark datasets, with the goal to alleviate the erroneous data issue.

Performance analysis

The overall performance comparison in Table 2 shows that our method is superior to the baselines on almost all evaluation datasets. The LLL corpus is the only one that our method does not perform best, but we also obtain a competitive result with Thomas et al.31 From Table 1, we found that the LLL test set only contains 330 instances, which is far less than the other PPI test sets. The limited size of the LLL corpus will likely lead to the significant score variance among different methods. Also, Thomas et al31 employed rich handcrafted feature vectors in their RE system, such as lexical and dependency parsing features. Following the shortest dependency path hypothesis, they created the respective dependency parse tree by using the syntactical and dependency information of edges and vertices. However, the dependency parsing features are not included in our system. These may be the major reasons for the worse performance on this dataset. It should be noted that DDI in DrugBank includes both drug synergy and DDI. However, the DDI corpus includes only DDI information from DrugBank. Ideally, it is better to remove drug synergy from DDI in DrugBank because this part of the data may interfere with the model prediction. However, for a fair comparison, we followed the benchmark data31 and did not remove this part of the data. As shown in Table 2, BioBERT achieved relatively high F1-scores on the may-prevent and may-treat datasets, but it does not perform well on the DDI corpus, 2011, test set. This may be due to the fact that the DDI distantly labeled set contains more noisy data, and BioBERT is a supervised RE model that lacks the denoising module for noise reduction. It further demonstrates that the distantly supervised RE task is challenging, and simply applying a supervised RE model to weakly supervised datasets will get unsatisfactory performance.

From Table 3, we demonstrate the effectiveness of the usage of multichannel word embeddings, and we consider it may be due to the following advantages: (1) it decreases the number of unknown words by looking up the same word in different resources; (2) external information has been introduced by sharing among different embeddings; and (3) it is difficult to mine relation between biomedical mentions only using general word embeddings—however, biomedical word embeddings are suitable. To validate the contribution of filter attention, we show a visualization example in Figure 3. In this case, when the window size is equal to 4, it contains both the head and tail entities, and the PACNN model can directly extract information from the phrase: “ritonavirinduction ofmethadone.” However, when the window size is 2 or 3, only the information of the head entity can be obtained. Therefore, the attention weight is the largest when the window size is equal to 4. Finally, for the contribution of each component, we can conclude that (1) rich semantic information can improve the performance, (2) filter attention is of great importance for encoding semantic information, and (3) the RL denoising module would affect the total distantly supervised model’s performance.

Figure 3.

Figure 3.

Example of filter attention weights according to different window sizes.

Performance on supervised RE scenarios

We have verified the proposed model on the DS corpora in the previous section. However, many RE models are applicable to supervised scenarios. To further verify the proposed model, we conducted an experiment on the supervised RE dataset EU-ADR42 that contains gene-disease relation. Because the supervised corpus does not contain noise, we removed the RL module to better verify the relation classifier PACNN alone. As shown in Supplementary Appendix D, BioBERT v1.0 (+PMC)41 achieved the highest F1-score among all single models. The F1-score of BioBERT v1.0 (+PMC) on EU-ADR corpus is 0.0159 higher than our PACNN model, which benefits from pretraining on large-scale biomedical corpora. In order to take advantage of BioBERT’s pretraining capability, we fused our PACNN model with BioBERT v1.0. The fusion model PACNN (+BioBERT v1.0)* outperforms all single models with the highest F1-score, which is an increase of 0.0197 compared with PACNN and 0.0038 compared with BioBERT v1.0 (+PMC). It proves the effectiveness of the PACNN model and the BioBERT model in RE and understanding complex biomedical texts, respectively.

UMLS vs Semantic MEDLINE Database

The construction of distantly labeled data is based on biomedical KB whose quality could affect the final predictions. To study this effect, we compared UMLS that we used for training with Semantic MEDLINE Database (SemMedDB),43 which contains information about approximately 94.0 million semantic predications extracted by SemRep.44 Sentences with specific semantic predication (subject-PREVENTS-object and subject-TREATS-object) in SemMedDB are recorded as positive instances. The results of PACNN+RL on each KB are shown in Supplementary Appendix E. The proposed model achieved better performance on SemMedDB than UMLS on both may-prevent and may-treat test sets, and the improvement in performance may benefit from the high quality of SemMedDB that utilizes the NLP ability of SemRep.43 At the same time, it provides links between biomedical literature and structured semantic predications.

Limitations and future work

Because MetaMap is a rule-based approach that utilizes a manually curated dictionary, it falls short of offering more accurate deep learning models support.45 In the future, it can be replaced by other deep learning–based biomedical NLP tools, such as ScispaCy46 and LATTE.47 Another limitation of this work is that the DDI corpus has 2 versions, 2011 and 2013, and we only used the 2011 version for testing. The reason we used the previous version is DDI corpus, 2013 proposed 4 different types of DDI relationships (mechanism, effect, advice, int). However, there is no such KB, which makes the construction of a corresponding distantly supervised set a nontrivial problem. Also, some applications focus on detecting the relation between multiple entities. For example, in protein phosphorylation RE, 3 entities (a substrate, a kinase, and a site) are involved. However, our method currently focuses on RE with 2 entities, and multirelation extraction will also be our focus in the future.

CONCLUSION

In this study, we propose the PACNN+RL method, which consists of a PACNN for encoding semantic information of biomedical text and an RL method with memory backtracking mechanism to denoise the distant labeled data. Our approach outperforms baselines on 7 benchmark datasets, including may-treat, may-prevent, DDI, and PPI relations. Further experiments and analysis indicate the reasons for the effectiveness of PACNN+RL, proving that RL agent is helpful in addressing the noisy data challenge. In conclusion, PACNN+RL is a versatile tool to aid automatic biomedical RE via biomedical literature mining techniques.

FUNDING

This work is supported by the fund of the joint project with Beijing Baidu Netcom Science Technology, the National Natural Science Foundation of China (Grant Nos. 61872113, 61876052, and 62006061), the Special Foundation for Technology Research Program of Guangdong Province (Grant No. 2015B010131010), the Strategic Emerging Industry Development Special Funds of Shenzhen (Grant No. JCYJ20180306172232154), and the CCF-Baidu Open Fund (Grant No. CCF-BAIDUOF2020004).

AUTHOR CONTRIBUTIONS

TZ proposed methods, designed and carried out the experiments, and drafted the manuscript. BH and QC supervised the research and participated in study design. YX critically revised the manuscript and made substantial contributions to interpreting the results. WP and YQ participated in manuscript review. All authors provided feedback and approved the final version of the manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

The authors have no competing interests to declare.

DATA AVAILABILITY STATEMENT

The data underlying this article are available at http://112.74.48.115:9000/.

Supplementary Material

ocab176_Supplementary_Data

REFERENCES

  • 1. Zhao S, Su C, Lu Z, et al. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22 (3): bbaa057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Wei CH, Harris BR, Li D, et al. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database (Oxford) 2012; 2012: bas041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Spasic I, Ananiadou S, McNaught J, et al. Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform 2005; 6 (3): 239–51. [DOI] [PubMed] [Google Scholar]
  • 4. Ananiadou S. Advances of biomedical text mining for semantic search. Web Sci Med Domain 2011; 5. [Google Scholar]
  • 5. Wei CH, Kao HY, Lu Z.. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res 2013; 41 (Web Server issue): W518–W522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ono T, Hishigaki H, Tanigami A, et al. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001; 17 (2): 155–61. [DOI] [PubMed] [Google Scholar]
  • 7. Ciaramita M, Gangemi A, Ratsch E, et al. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: Proceedings of IJCAI; Edinburgh, Scotland, UK; 2005: 659–64.
  • 8. Airola A, Pyysalo S, Björne J, et al. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 2008; 9 (Suppl 11): S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Bui QC, Sloot PM, van Mulligen EM, et al. A novel feature-based approach to extract drug-drug interactions from biomedical text. Bioinformatics 2014; 30 (23): 3365–71. [DOI] [PubMed] [Google Scholar]
  • 10. Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of ISMB; Heidelberg, Germany; 1999. : 77–86. [PubMed]
  • 11. Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data. In: Proceedings of ACL; Singapore; 2009: 1003–11.
  • 12. Thomas P, Solt I, Klinger R, et al. Learning protein–protein interaction extraction using distant supervision. In: Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing; Hissar, Bulgaria; 2011: 25–32.
  • 13. Li G, Wu C, Vijay-Shanker K. Noise reduction methods for distantly supervised biomedical relation extraction. In: Proceedings of BioNLP; Vancouver, Canada; 2017: 184–93.
  • 14. Bobić T, Klinger R, Thomas P, et al. Improving distantly supervised extraction of drug-drug and protein-protein interactions. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP; Avignon, France; 2012: 35–43.
  • 15. Riedel S, Yao L, McCallum A. Modeling relations and their mentions without labeled text. In: Proceedings of ECML PKDD 2010; Barcelona, Spain; 6323: 148–63. [Google Scholar]
  • 16. Hoffmann R, Zhang C, Ling X, et al. Knowledge-based weak supervision for information extraction of overlapping relations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Portland, Oregon; 2011: 541–550.
  • 17. Surdeanu M, Tibshirani J, Nallapati R, et al. Multi-instance multi-label learning for relation extraction. In: Proceedings of EMNLP-CoNLL; Jeju Island, Korea; 2012: 455–65.
  • 18. Zeng D, Liu K, Chen Y, et al. Distant supervision for relation extraction via piecewise convolutional neural networks. In: Proceedings of EMNLP; Lisbon, Portugal; 2015: 1753–62.
  • 19. Lin Y, Shen S, Liu Z, et al. Neural relation extraction with selective attention over instances. In: Proceedings of ACL; Berlin, Germany; 2016: 2124–33.
  • 20. Ji G, Liu K, He S, et al. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In: Proceedings of AAAI; San Francisco, California; 2017: 3060–6.
  • 21. Feng J, Huang M, Zhao L, et al. Reinforcement learning for relation classification from noisy data. In: Proceedings of AAAI; New Orleans, Louisiana; 2018: 5779–86.
  • 22. Qin P, Xu W, Wang WY. Robust distant supervision relation extraction via deep reinforcement learning. In: Proceedings of ACL; Melbourne, Australia; 2018: 2137–47.
  • 23. Roller R, Stevenson M. Held-out versus gold standard: comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts. In: Proceedings of the 6th International Workshop on Health Text Mining and Information Analysis; Lisbon, Portugal; 2015: 97–102.
  • 24. Segura-Bedmar I, Martínez Fernández P, Sánchez-Cisneros D. The 1st DDIExtraction-2011 challenge task: extraction of drug-drug interactions from biomedical texts. In: Proceedings of the 1st Challenge Task on Drug-Drug Interaction Extraction; Huelva, Spain; 2011: 1–9.
  • 25. Bunescu R, Ge R, Kate RJ, et al. Comparative experiments on learning information extractors for proteins and their interactions. Artif Intell Med 2005; 33 (2): 139–55. [DOI] [PubMed] [Google Scholar]
  • 26. Pyysalo S, Ginter F, Heimonen J, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007; 8: 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Fundel K, Küffner R, Zimmer R.. RelEx–relation extraction using dependency parse trees. Bioinformatics 2007; 23 (3): 365–71. [DOI] [PubMed] [Google Scholar]
  • 28. Ding J, Berleant D, Nettleton D, et al. Mining MEDLINE: abstracts, sentences, or phrases? In: Proceedings of PSB; Lihue, Hawaii; 2002: 326–37. [DOI] [PubMed]
  • 29. Nédellec C. Learning language in logic-genic interaction extraction challenge. In: Proceedings of the 4th Learning Language in Logic Workshop; Bonn, Germany; 2005: 1–7.
  • 30. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32 (Database issue): D267–D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Thomas P, Bobić T, Leser U, et al. Weakly labeled corpora as silver standard for drug-drug and protein-protein interaction. In: Proceedings of the Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM) on Language Resources and Evaluation Conference (LREC); Istanbul, Turkey; 2012.
  • 32. Kim S, Liu H, Yeganova L, et al. Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 2015; 55: 23–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Quan C, Hua L, Sun X, et al. Multichannel convolutional neural network for biological relation extraction. Biomed Res Int 2016; 2016: 1850404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Moen S, Ananiadou TSS. Distributional semantics resources for biomedical text processing. In: Proceedings of LBM; Ananiadou, Sophia; 2013: 39–44.
  • 35. Zeng D, Liu K, Lai S, et al. Relation classification via convolutional deep neural network. In: Proceedings of COLING; Dublin, Ireland; 2014: 2335–44.
  • 36. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 1992; 8 (3–4): 229–56. [Google Scholar]
  • 37. Sutton RS, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of NIPS; Denver, Colorado; 1999: 1057–63.
  • 38. Banuqitah H, Eassa F, Jambi K, et al. Two level self-supervised relation extraction from MEDLINE using UMLS. Int J Data Mining Knowl Manag Process 2016; 6 (3): 11–23. [Google Scholar]
  • 39. Tikk D, Thomas P, Palaga P, et al. A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 2010; 6 (7): e1000837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Ye ZX, Ling ZH. Distant supervision relation extraction with intra-bag and inter-bag attentions. In: Proceedings of NAACL-HLT; Minneapolis, MN; 2019: 2810–9.
  • 41. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (4): 1234–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. van Mulligen EM, Fourrier-Reglat A, Gurwitz D, et al. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform 2012; 45 (5): 879–84. [DOI] [PubMed] [Google Scholar]
  • 43. Kilicoglu H, Shin D, Fiszman M, et al. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 2012; 28 (23): 3158–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Rindflesch TC, Fiszman M.. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 2003; 36 (6): 462–77. [DOI] [PubMed] [Google Scholar]
  • 45. Zhang Y, Zhang Y, Qi P, et al. Biomedical and clinical English model packages for the Stanza Python NLP library. JAMIA 2021; 28 (9): 1892–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Neumann M, King D, Beltagy I, et al. ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task; Florence, Italy; 2019: 319–27.
  • 47. Zhu M, Celikkaya B, Bhatia P, et al. LATTE: Latent type modeling for biomedical entity linking. AAAI 2020; New York, NY; 34 (05): 9757–64. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocab176_Supplementary_Data

Data Availability Statement

The data underlying this article are available at http://112.74.48.115:9000/.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES