Abstract
Objective
There have been various methods to deal with the erroneous training data in distantly supervised relation extraction (RE), however, their performance is still far from satisfaction. We aimed to deal with the insufficient modeling problem on instance-label correlations for predicting biomedical relations using deep learning and reinforcement learning.
Materials and Methods
In this study, a new computational model called piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) was proposed to perform RE on distantly supervised data generated from Unified Medical Language System with MEDLINE abstracts and benchmark datasets. In PACNN+RL, PACNN was introduced to encode semantic information of biomedical text, and the RL method with memory backtracking mechanism was leveraged to alleviate the erroneous data issue. Extensive experiments were conducted on 4 biomedical RE tasks.
Results
The proposed PACNN+RL model achieved competitive performance on 8 biomedical corpora, outperforming most baseline systems. Specifically, PACNN+RL outperformed all baseline methods with the F1-score of on the may-prevent dataset, on the may-treat dataset, and on the DDI corpus, 2011. For the protein-protein interaction RE task, we obtained new state-of-the-art performance on 4 out of 5 benchmark datasets.
Conclusions
The performance on many distantly supervised biomedical RE tasks was substantially improved, primarily owing to the denoising effect of the proposed model. It is anticipated that PACNN+RL will become a useful tool for large-scale RE and other downstream tasks to facilitate biomedical knowledge acquisition. We also made the demonstration program and source code publicly available at http://112.74.48.115:9000/.
Keywords: biomedical relation extraction, distant supervision, reinforcement learning, neural networks, deep learning
INTRODUCTION
The amount of biomedical literature is vastly increasing daily, triggering an urgent need to extract useful information from the literature automatically to facilitate knowledge integration and digestion.1 Relation extraction (RE) is an important step in information extraction that builds on the output of named entity recognition. In the biomedical domain, it aims to extract relational facts between bioentities mentioned in plain text, and is beneficial for many biomedical knowledge-driven applications.2–5 Previous solutions for RE included rule-based,6 unsupervised,7 and supervised learning approaches,8,9 in which the supervised learning approach has obtained substantial successes in recent years. However, a limitation is the requirement of high-quality and large volumes of corpora annotated by domain experts.
Distant supervision (DS) was proposed to alleviate this issue.10,11 DS is a form of weak supervision that can be used to generate large amounts of training data from unlabeled corpus using well-built knowledge bases (KBs). The basic DS assumption is that if 2 entities participate in a relation according to the KBs, all sentences in the text corpus that mention these 2 entities could potentially express this relation. For instance, the entity pair DRUG(“Practolol”)-DISEASE(“arrhythmias”) is defined to be associated with the may-treat relationship in the KB Unified Medical Language System (UMLS). According to the assumption, the following sentence in PubMed would be annotated as a positive training instance of the may-treat relation: “Example 1: Practolol, although reversing the arrhythmias, tends to cause hypotension.” (PMID: 7377). On the other side, if a pair mentioned in a sentence is not recorded in UMLS, the sentence is considered as a negative instance. Such training data are then used to train a supervised model. However, this assumption does not always hold, resulting in that some sentences containing an entity pair might be mistakenly identified as positive examples. For example, the following sentence contains the same entity pair but does not assert the may-treat relation, which is a false positive (FP) annotation: “Example 2: The effective half-life of practolol was less than 15 min and doses up to 0.4 mg/kg were unable to prevent arrhythmias during adrenaline challenge.” (PMID: 26159). In this sentence, we cannot find any description that expresses the target relation. Thus, even though DS is efficient on data acquisition, it comes at the cost of data noise .
To address the noisy data challenge in DS, some works manually defined rules to improve the quality of training data.12,13 For instance, Bobić et al14 introduced the constraint for protein-protein interaction (PPI) and drug-drug interaction (DDI) RE tasks that if entities refer to the same object, the pair is filtered. However, such methods have limitations in generalization on different tasks. To weaken the strong hypothesis of DS, many studies perform classification as a multi-instance learning problem.15–20 These methods assume that a bag of sentences mentioning an entity pair are describing the same relation and they select one-best18 or calculate intrabag attention weights19,20 to reduce obstruction of noisy sentences. Such methods are effective in reducing noise but they cannot identify the FP instances in positive bags. Recently, some works21,22 used the reinforcement learning (RL) strategy to create “clean” data. These methods treated noise instances with a hard decision, rather than with soft attention weights. For example, Feng et al21 proposed the removal operation that only retain high-quality sentences. However, removing these sentences without additional supervision might be a misoperation. Unlike previous works, we made full use of each instance and proposed a novel RL strategy that considers the inherent connections and differences between FP and true positive examples, which is of great importance to DS.
In this article, we propose a novel biomedical RE method that combines a piecewise attentive convolutional neural network (PACNN) with a RL denoising module based on DS. The PACNN module is based on piecewise convolutional neural network (PCNN)18 with the filter attention mechanism to capture the internal relationship of biomedical sentences and relies on multichannel word embeddings that represent words as dense vectors. In the denoising module, an RL algorithm was proposed to enlarge the gap between positive and negative examples that mention the same entity pair by selecting FP instances from positive bags and further building contrastive negative bags. Inspired by the idea of tracking historical events to aid current decision making of human beings, we introduced a memory backtracking mechanism to RL to emphasize specific attention regions on the historical actions. The model was validated on 4 biomedical RE tasks. Through extensive experiments, we demonstrated our method’s effectiveness compared with DS baselines on 8 datasets. In addition, we conducted an ablation study to illustrate the contribution of different components and analyzed the newly built negative bags to validate the effectiveness of the RL module.
MATERIALS AND METHODS
Corpus building
This method was evaluated on 4 biomedical RE tasks, namely, the extraction of may-treat, may-prevent, DDI, and PPI relations.23–29 Specifically, may-prevent denotes a treatment that can be used to prevent a disease, and may-treat indicates the treatment for a particular disease. DDI refers to the compound effects of patients taking 2 or more drugs at the same time or within a certain period. Because protein function depends largely on the functional context of its interaction partners, getting a better understanding of PPIs is vital to understand the biological processes.
For may-prevent and may-treat RE tasks, DS was used to generate training data, and the left part of Figure 1 is a flow chart of DS. We first selected all triples with may-prevent and may-treat relations from the UMLS Metathesaurus as the knowledge source. UMLS is a large-scale biomedical database that contains millions of medical concepts and their corresponding relations.30 Then, we used MetaMap to perform named entity recognition for 1 million randomly selected MEDLINE abstracts with the concept of UMLS (https://lhncbc.nlm.nih.gov/ii/tools/MetaMap.html). Finally, the relation of each entity pair was generated according to the heuristic rule: sentences containing concepts that are identified as being related in the UMLS are positive and sentences with concepts concerned to be not related in UMLS are negative. For DDI and PPI RE tasks, we made use of the distantly supervised set constructed by Thomas et al.31 All test sets were manually annotated and are publicly available. Table 1 lists the descriptions of the 4 distantly labeled training corpora and 8 test sets.
Figure 1.
The architecture of the proposed piecewise attentive convolutional neural network and reinforcement learning (PACNN+RL) model for distant supervised biomedical relation extraction. The proposed method has 3 parts: distant supervision (left), PACNN model (middle), and reinforcement learning method (right).
Table 1.
Description of the 4 distantly labeled sets and 8 manually labeled test sets
| Type | Distantly labeled training set |
Test set |
|||
|---|---|---|---|---|---|
| Database | Abstracts | Positive/negative | Corpus | Positive/negative | |
| Prevent | UMLS | 55 967 | 15 576/184 424 | Prevent23 | 139/261 |
| Treat | UMLS | 55 761 | 53 234/146 766 | Treat23 | 173/227 |
| DDI | Drugbank | 76 859 | 8705/191 295 | DDI corpus, 201124 | 755/6271 |
| PPI | IntAct | 49 958 | 37 600/162 400 | AIMed25 | 1000/4834 |
| KUPS | BioInfer26 | 2534/7132 | |||
| HPRD5027 | 163/270 | ||||
| IEPA28 | 335/482 | ||||
| LLL29 | 164/166 | ||||
DDI: drug-drug interaction; PPI: protein-protein interaction; UMLS: Unified Medical Language System.
Text preprocessing
To ensure the nonoverlapping between the test and training sets, all sentences contained in the test sets are removed from the distantly labeled sets. To reduce the impact of nontarget entities and ensure the generalization during learning, we replaced the head and tail entities with symbols “entity1” and “entity2,” respectively, and other entities were represented as unified symbols “entity0.”32 Head and tail entities are components of the relational fact that is often represented in the form of a triple (head, relation, tail). Finally, we applied 2 rules referenced from Kim et al32 to filter out the noise instances (see Supplementary Appendix A for more examples):
Rule 1. To avoid target entities referred to the same realistic biomedical entity. If (1) 2 entities have the same name, or (2) 1 entity is an abbreviation or acronym of the other, filter out the corresponding instances.
Rule 2. Entity pairs in a coordinate structure should be filtered out.
Piecewise attentive convolutional neural network
The architecture of the PACNN is depicted in Figure 1, which consists of a multichannel input layer, a convolution layer, a piecewise max pooling layer, a filter attention layer, and a multi-instance classifier.
Multichannel input layer
In a neural network model, the embedding layer transforms discrete words into dense vectors as input. In our work, word embeddings and positional features were generated according to the given sentence where and are the target entities and is the number of context words. For word embeddings, we applied 5 types of pretrained word embeddings generated using PubMed, PMC, MEDLINE articles, and Wikipedia to be in accordance with some previous studies.33,34 For positional features, we adopted the idea of Zeng et al.35 Position embeddings are vector representations of the relative distances from the current word respectively to the head or tail entities in the sentence and are added in each channel of the input layer.
Convolution
A convolution operation involves a filter, which is applied to a window of words to produce a new feature. Previous PCNNs only used a certain window size as the filter, which is limited in identifying relation between entities with different distances. In the PACNN, the window size is set by to support various semantically dependent distances. The convolutional operation can be defined as filter in each channel input , where is the number of channels and is the dimension of embeddings.
The feature can be calculated by the following formula:
| (1) |
where is the activation function, is the concatenation of embeddings of continuous words for the -th channel, is the -th filter of each window, is element wise multiplication, and is a bias term.
After the feature is generated, the feature map is concatenated by:
| (2) |
Note that the size of feature map is equivalent to the sentence length .
Piecewise max pooling
Based on the PCNN,18 outputs from convolutional layer with the specific filter are divided into 3 part for max pooling according to the indices for (k) and (v). The max pooling is applied over each part of feature map:
| (3) |
Filter attention
After obtaining the feature vectors for filters with different window sizes , most previous studies simply concatenate these representations. However, we applied a filter attention layer after the pooling layer to get weighted representations through them to obtain contributions of different window sizes. Denoting the feature vectors learned from 3 filter sizes as , then the sentence representation can be defined as:
| (4) |
where are sentence encoding vectors from filters of different window sizes, is the concatenation operation, are filter attention weights and is calculated as the following formula:
| (5) |
where is a randomly initialized parameter vector with the same dimension as and is the weight score assigned to each filter size. Attention values are calculated by taking softmax over .
Training
In the training phase, sentences are packed into bags according to entity pairs for multi-instance learning. In Figure 1, the bag (typeU2Udiabetes, insulin) contains sentence representations , where is the number of sentences in the bag. These sentence representations are applied with a sentence level attention to score how well the sentence matches the relation, and the attention score for the-th sentence is defined as:
| (6) |
where and are the trainable relation vectors associated with relation and . The relation vectors are randomly initialized and are fixed after training to represent a particular relation during the test. The bag-level representation is obtained by and fed into a Softmax layer to get the probability of each relation.
Reinforcement learning
To address the problem that instances with same entity pair and different labels are indistinguishable, we proposed a RL method to pick out the possibly negative instances from positive bags to build contrastive negative bags. It can increase the diversity of negative bags and emphasize the difference between positive and negative bags with the same entity pair. Then, the redistributed data is fed into the pretrained relation classifier PACNN. We used the performance of PACNN as the result-driven reward for a series of actions decided by the RL agent.
We first pretrained an instance level classifier with CNN representation model using bag labels which can be considered as an agent A. As mentioned before, the agent is assembled with a memory backtracking mechanism and it selects FP instances in each positive bag and adds them to an additional negative bag. At each step t, A is at state that is a weighted sum of the current sentence and the historical sentences given action by A, and the weight here is calculated by the memory backtracking mechanism. The agent would take an action that decides whether sentence in a positive bag is FP or not. After A takes the terminate action for a positive bag , we obtain a negative bag , which consists of the chosen instances. A will receive a final reward from the PACNN model when all the selections are made. The details of the joint training process are described in Algorithm 1.
Algorithm 1.
Reinforcement Learning Algorithm
Initialize the policy network as , PACNN model as , training data
for epoch to do
for batch to do
Batch data
Batch label
fordo
if ( is a positive bag) then
Sample instance selection actions for each instance in with :
fordo
if (-th instance in is false positive) then
Add -th instance into new negative bag
end if
Add into , respectively
end for
end if
end for
Compute delayed reward from PACNN model loss
Update the parameter of policy network
end for
Update the parameter Φ in the PACNN model with fixed policy network
end for
At the same time, it is considered that historical events that happened closer in time could have greater impact on the current decision. We designed to indicate the influence degree of different time steps on the current state. The detailed operation of is as follows:
| (8) |
where . Next, we use to normalize the effects of and , and the weight is calculated as:
| (9) |
| (10) |
Here, the impact factor is introduced to give coefficients on negative () and positive () instances classified by the agent. In the experiment, is set as by hyperparameter grid search. This can reduce the model's sensitivity and therefore increase the precision of choosing FP instances. Finally, the current state representation can be calculated as follows:
| (11) |
We used the REINFORCE algorithm to train our agent. It is a type of policy gradient method proposed by Williams et al,36 and it maximizes the performance of the agent by updating its policy parameters. The policy is defined as follows:
| (12) |
where represents the parameters in the agent model, and is a label indicating whether the current instance is false positive or not.We assume that the model has a reward of a set of bags which is delayed. The reward can be formulated as the following formula:
| (13) |
where is the length of set , is the relation label of bag and is the probability given by the PACNN. Note that the relation classifier is at the bag-level because it computes for each bag.For a set of bags, we aim to maximize the total reward. Thus, the objective function can be defined as follows:
| (14) |
Based on the policy gradient theorem,37 the parameters are updated as follows:
| (15) |
where , for . is the value function and for each bag .
RESULTS
Performance comparison
To evaluate our approach, we compared the proposed PACNN+RL model with several baseline models in Table 2, including feature-based and neural network–based methods (see Supplementary Appendix F for details of baselines). For each task, we calculated the precision, recall, and F1-score. Our PACNN+RL model outperformed all other single models with the highest F1-score on the may-prevent dataset, may-treat dataset, and DDI corpus, 2011. For the PPI RE task, PACNN+RL achieved the highest F1-score among all single models on the AIMed, BioInfer, HPRD50, and IEPA datasets, whereas Thomas et al31 achieved the highest F1-score on the LLL dataset. The performance on different PPI datasets varies significantly and all methods achieved relatively low performance on AIMed dataset. This suggests that it is difficult to accurately extract PPI relations on the AIMed corpus, while our method outperformed the other single methods. We also noticed that the feature-based methods outperformed neural network–based methods on the LLL test set. However, our model achieved higher F1-scores than all neural network–based methods. As shown in Table 2, PACNN+RL achieved the highest precision on the may-treat dataset, DDI corpus, 2011, AIMed dataset, and HPRD50 dataset, which is a significant improvement compared with other models. For the remaining test sets, feature-based methods achieved the highest precision, which utilized manually selected and encoded lexical, syntactic, and semantic information as well as some rules. However, such methods suffer from intrinsic limitations, such as coverage or scalability among others, and PACNN+RL is superior to them in terms of recall and model robustness. Furthermore, we had initially hoped that the BioBERT’s pretraining capability would make model work better, and we fused our model with it. However, the experimental results of the fusion model PACNN+RL (+BioBERT v1.0)* are mixed. This may be due to the noise in DS data, and fine-tuning BioBERT on such data will not lead to a large improvement. In addition, we compared the space and time complexity of our PACNN+RL model with other models in Supplementary Appendix B.
Table 2.
Performance on the may-prevent, may-treat, DDI corpus, 2011, and PPI test sets in comparison with the state-of-the-art methods
| Method | may-prevent |
may-treat |
DDI corpus, 2011 |
protein-protein interaction |
||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AIMed |
BioInfer |
HPRD50 |
IEPA |
LLL |
||||||||||||||||||||
| Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | |
| Roller and Stevenson23 | 0.5833 a | 0.3559 | 0.4421 | 0.4832 | 0.5143 | 0.4983 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – |
| Banuqitah et al38 | – | – | – | 0.54 | 0.72 | 0.62 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – |
| Bobić et al14 | – | – | – | – | – | – | 0.325 | 0.437 | 0.373 | – | – | – | – | – | – | – | – | – | – | – | – | – | – | – |
| Tikk et al39 | – | – | – | – | – | – | – | – | – | 0.283 | 0.866 a | 0.426 | 0.628 a | 0.365 | 0.462 | 0.569 | 0.687 | 0.622 | 0.710 a | 0.525 | 0.604 | 0.790 a | 0.573 | 0.664 |
| Thomas et al31 | – | – | – | – | – | – | 0.330 | 0.441 | 0.377 | 0.256 | 0.784 | 0.386 | 0.404 | 0.667 | 0.503 | 0.457 | 0.851 | 0.594 | 0.500 | 0.872 a | 0.635 | 0.564 | 0.831 | 0.672 a |
| PCNN+ONE18 | 0.4207 | 0.5000 | 0.4569 | 0.4980 | 0.7485 | 0.5981 | 0.2765 | 0.5640 | 0.3711 | 0.2217 | 0.5638 | 0.3183 | 0.4493 | 0.2181 | 0.2937 | 0.4951 | 0.4857 | 0.4904 | 0.6664 | 0.2424 | 0.3555 | 0.7138 | 0.0935 | 0.1653 |
| CNN+ATT19 | 0.4236 | 0.6231 | 0.5043 | 0.4875 | 0.8011 | 0.6061 | 0.3056 | 0.4093 | 0.3499 | 0.2036 | 0.6020 | 0.3043 | 0.3854 | 0.2371 | 0.2936 | 0.4672 | 0.5428 | 0.5022 | 0.5262 | 0.3030 | 0.3846 | 0.6953 | 0.1495 | 0.2461 |
| PCNN+ATT19 | 0.3962 | 0.6987 | 0.5057 | 0.4963 | 0.7777 | 0.6059 | 0.3184 | 0.3576 | 0.3369 | 0.2077 | 0.6301 | 0.3124 | 0.4177 | 0.2215 | 0.2895 | 0.4770 | 0.4952 | 0.4859 | 0.5831 | 0.2121 | 0.3111 | 0.6246 | 0.0934 | 0.1625 |
| CNN+RL21 | 0.4232 | 0.5689 | 0.4854 | 0.5580 | 0.7310 | 0.6329 | 0.3394 | 0.3329 | 0.3361 | 0.1493 | 0.8087 | 0.2521 | 0.2869 | 0.7226 a | 0.4107 | 0.3574 | 0.9428 a | 0.5183 | 0.3978 | 0.5605 | 0.4653 | 0.4124 | 0.9906 a | 0.5824 |
| PCNN+ATT RA+BAG ATT40 | 0.4318 | 0.1418 | 0.2135 | 0.4046 | 0.3099 | 0.3510 | 0.1426 | 0.3058 | 0.1945 | 0.1868 | 0.4805 | 0.2690 | 0.2449 | 0.3829 | 0.2987 | 0.4650 | 0.5840 | 0.5177 | 0.4035 | 0.5324 | 0.4591 | 0.4818 | 0.4490 | 0.4648 |
| BioBERT v1.0 (+ PMC)41 | 0.3482 | 0.9058 | 0.5030 | 0.4669 | 0.9826 a | 0.6330 | 0.1778 | 0.6980 | 0.2834 | 0.2481 | 0.7660 | 0.3748 | 0.3952 | 0.6062 | 0.4784 | 0.4798 | 0.5828 | 0.5263 | 0.4531 | 0.5045 | 0.4774 | 0.6122 | 0.5488 | 0.5788 |
| BioBERT v1.1 (+ PubMed)41 | 0.3547 | 0.9638 | 0.5185 | 0.4682 | 0.9419 | 0.6255 | 0.1839 | 0.6993 | 0.2912 | 0.2671 | 0.7900 | 0.3992 | 0.3879 | 0.6255 | 0.4789 | 0.5289 | 0.7301 | 0.6134 | 0.4818 | 0.5910 | 0.5308 | 0.5940 | 0.4817 | 0.5320 |
| PACNN+RL (Ours) | 0.4155 | 0.8550 | 0.5592 | 0.6666 a | 0.6666 | 0.6666 | 0.7320 a | 0.2601 | 0.3838 a | 0.5531 a | 0.3754 | 0.4472 | 0.5925 | 0.4648 | 0.5209 a | 0.6709 a | 0.6380 | 0.6540 a | 0.6677 | 0.6435 | 0.6554 a | 0.7531 | 0.6049 | 0.6709 |
| PACNN+RL (+ BioBERT v1.0)∗ | 0.3964 | 0.9728 a | 0.5633 a | 0.5113 | 0.9766 | 0.6712 a | 0.2213 | 0.7372 a | 0.3404 | 0.3497 | 0.6304 | 0.4498 a | 0.4400 | 0.5320 | 0.4816 | 0.5155 | 0.8160 | 0.6318 | 0.5368 | 0.6616 | 0.5927 | 0.6466 | 0.5309 | 0.5831 |
CNN: convolutional neural network; DDI: drug-drug interaction; PACNN+RL: piecewise attentive convolutional neural network and reinforcement learning; PCNN: piecewise convolutional neural network; PPI: protein-protein interaction; RL: reinforcement learning.
For each metric, the bolded value indicates the best performing classifier.
Ablation experiment
To analyze the contribution of each part of our model, we cumulatively removed components and evaluated the total performance on 3 test sets in Table 3. The w/o multichannel (random) method utilized randomly initialized word embedding. The w/o multichannel (Wikipedia and PubMed) method utilized 1-channel pretrained word embedding, and its F1-score is higher on average than the w/o multichannel (random) model. Compared with the 1-channel model, the PACNN+RL model improved the overall F1-scores by on average. The filter attention mechanism is removed from the w/o filter attention method and a simple combination of the feature vectors through different filters is used, resulting in a sharp drop on performance. The preprocessing step can reduce the potentially misleading examples and improve the F1-scores by on average. Another aspect to note is that w/o RL behaves worse than the proposed model, which proves the effectiveness of the RL agent. In conclusion, the multichannel architecture can improve the performance of PACNN+RL by a large margin. In addition, model performance can be further improved by the filter attention, preprocessing, and RL.
Table 3.
Performances of model with and without different components
| Method | may-prevent |
may-treat |
DDI corpus, 2011 |
||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | |
| Without multichannel (random) | 0.3571 | 0.3623 | 0.3597 | 0.4960 | 0.7310 | 0.5910 | 0.3534 | 0.1245 | 0.1841 |
| Without multichannel (Wikipedia and PubMed) | 0.3991 | 0.6304 | 0.4888 | 0.4945 | 0.7953 | 0.6099 | 0.2241 | 0.4464 | 0.2984 |
| Without filter attention | 0.4055 | 0.6376 | 0.4957 | 0.4937 | 0.9122 | 0.6407 | 0.2552 | 0.6371 a | 0.3644 |
| Without preprocessing | 0.4142 | 0.7985 | 0.5455 | 0.4842 | 0.9768 a | 0.6475 | 0.1522 | 0.5828 | 0.2414 |
| Without RL | 0.4182 a | 0.7970 | 0.5486 | 0.4954 | 0.9473 | 0.6506 | 0.3434 | 0.4053 | 0.3718 |
| PACNN+RL (Ours) | 0.4155 | 0.8550 a | 0.5592 a | 0.6666 a | 0.6666 | 0.6666 a | 0.7320 a | 0.2601 | 0.3838 a |
DDI: drug-drug interaction; PACNN+RL: piecewise attentive convolutional neural network and reinforcement learning; RL: reinforcement learning.
aFor each metric, the bolded value indicates the best performing classifier.
The impact of the RL denoising module
To illustrate the effectiveness of the RL denoising module, we listed 1 newly generated negative bag that comprises single/multiple sentence(s) for each task in Table 4. We can infer from Table 4 that the entity pair DRUG(“vorapaxar”)-DISEASE(“stroke”) is found to have relationship may-prevent in UMLS. Therefore, the right 2 sentences are labeled as positive instances. However, neither of these sentences expressed the may-prevent relation. In addition, patients with a history of DISEASE(“stroke”) should not take the DRUG(“vorapaxar”) under certain circumstances. It clearly indicates that our model can select FP instances from positive bags, which further improves the accuracy of a relation classifier. These cases intuitively illustrate the ability of our model to denoise distantly generated labels.
Table 4.
Examples of contrastive negative bags generated by RL agent; the target entities are in italic
| Relation | False positive sentences |
|---|---|
| may-prevent | (i) In this study, vorapaxar was discontinued in patients with a history of stroke due to excessive risk for intracranial hemorrhage after 2 years of therapy. |
| (ii) Vorapaxar should not be used in patients with a history of stroke, transient ischemic attack, intracranial hemorrhage, or active pathological bleeding. | |
| may-treat | (i) L-5418 may prove useful for grand mal epilepsy as it is less toxic than diphenylhydantoin and carbamazepine. |
| DDI | (i) One group of 33 patients was treated with 150 mg amitriptyline a day (the AMI group); 25 other patients received a daily dose of thioridazine, either 200 mg (200-THD group; n = 7) or 400 mg (400-THD group; n = 18). |
| PPI | (i) High articular levels of the angiogenetic factors VEGF and VEGF-receptor 2 as tissue healing biomarkers after single bundle anterior cruciate ligament reconstruction. |
| (ii) Immunostaining was used to monitor VEGF treatment by examining VEGF and VEGF-receptor 2 expression. |
DDI: drug-drug interaction; PPI: protein-protein interaction; RL: reinforcement learning.
Then, we analyzed the outputs of RL from different aspects in Figure 2 and Supplementary Appendix C. Specifically, the line charts in Figure 2 display the newly built FP bag size distribution of different relations. The x-axis represents the instance number per bag and y-axis represents the number of corresponding bags. As can be seen from the line charts, with the bag size increases, the corresponding number shows a downward trend.
Figure 2.
The predictive results of reinforcement learning with different relations. (A-D) Each panel represents different relations. The line chart and the pie chart describe 2 aspects of the results, respectively. DDI: drug-drug interaction; PPI: protein-protein interaction.
Finally, we manually checked 400 instances that were selected as FP instances by the RL agent in a randomly sampled dataset. For each instance, the agent makes a correct decision if the sentence is manually labeled as a negative instance and our RL agent selects it as an FP instance. Otherwise, we judged that the RL agent makes a wrong decision. Specifically, for each relation, we sampled 100 sentences from the newly built negative training data. The pie charts in Figure 2 depict that the accuracy scores on may-prevent, may-treat, DDI, and PPI are and , respectively. Our RL agent chooses these 400 sentences as the FP instances, among which 291 sentences are correctly selected (not describing the relation), and 109 of them are wrong. To summarize, the accuracy of our RL agent is , which demonstrates the effectiveness of our RL agent.
DISCUSSION
In this study, we investigated the proposed PACNN+RL method to extract may-prevent, may-treat, DDI, and PPI relations from distantly supervised datasets. By combining RL module with relation classifier, our system achieved the best performance on 7 biomedical benchmark datasets, with the goal to alleviate the erroneous data issue.
Performance analysis
The overall performance comparison in Table 2 shows that our method is superior to the baselines on almost all evaluation datasets. The LLL corpus is the only one that our method does not perform best, but we also obtain a competitive result with Thomas et al.31 From Table 1, we found that the LLL test set only contains 330 instances, which is far less than the other PPI test sets. The limited size of the LLL corpus will likely lead to the significant score variance among different methods. Also, Thomas et al31 employed rich handcrafted feature vectors in their RE system, such as lexical and dependency parsing features. Following the shortest dependency path hypothesis, they created the respective dependency parse tree by using the syntactical and dependency information of edges and vertices. However, the dependency parsing features are not included in our system. These may be the major reasons for the worse performance on this dataset. It should be noted that DDI in DrugBank includes both drug synergy and DDI. However, the DDI corpus includes only DDI information from DrugBank. Ideally, it is better to remove drug synergy from DDI in DrugBank because this part of the data may interfere with the model prediction. However, for a fair comparison, we followed the benchmark data31 and did not remove this part of the data. As shown in Table 2, BioBERT achieved relatively high F1-scores on the may-prevent and may-treat datasets, but it does not perform well on the DDI corpus, 2011, test set. This may be due to the fact that the DDI distantly labeled set contains more noisy data, and BioBERT is a supervised RE model that lacks the denoising module for noise reduction. It further demonstrates that the distantly supervised RE task is challenging, and simply applying a supervised RE model to weakly supervised datasets will get unsatisfactory performance.
From Table 3, we demonstrate the effectiveness of the usage of multichannel word embeddings, and we consider it may be due to the following advantages: (1) it decreases the number of unknown words by looking up the same word in different resources; (2) external information has been introduced by sharing among different embeddings; and (3) it is difficult to mine relation between biomedical mentions only using general word embeddings—however, biomedical word embeddings are suitable. To validate the contribution of filter attention, we show a visualization example in Figure 3. In this case, when the window size is equal to 4, it contains both the head and tail entities, and the PACNN model can directly extract information from the phrase: “ritonavir induction of methadone.” However, when the window size is 2 or 3, only the information of the head entity can be obtained. Therefore, the attention weight is the largest when the window size is equal to 4. Finally, for the contribution of each component, we can conclude that (1) rich semantic information can improve the performance, (2) filter attention is of great importance for encoding semantic information, and (3) the RL denoising module would affect the total distantly supervised model’s performance.
Figure 3.
Example of filter attention weights according to different window sizes.
Performance on supervised RE scenarios
We have verified the proposed model on the DS corpora in the previous section. However, many RE models are applicable to supervised scenarios. To further verify the proposed model, we conducted an experiment on the supervised RE dataset EU-ADR42 that contains gene-disease relation. Because the supervised corpus does not contain noise, we removed the RL module to better verify the relation classifier PACNN alone. As shown in Supplementary Appendix D, BioBERT v1.0 (+PMC)41 achieved the highest F1-score among all single models. The F1-score of BioBERT v1.0 (+PMC) on EU-ADR corpus is higher than our PACNN model, which benefits from pretraining on large-scale biomedical corpora. In order to take advantage of BioBERT’s pretraining capability, we fused our PACNN model with BioBERT v1.0. The fusion model PACNN (+BioBERT v1.0)* outperforms all single models with the highest F1-score, which is an increase of compared with PACNN and compared with BioBERT v1.0 (+PMC). It proves the effectiveness of the PACNN model and the BioBERT model in RE and understanding complex biomedical texts, respectively.
UMLS vs Semantic MEDLINE Database
The construction of distantly labeled data is based on biomedical KB whose quality could affect the final predictions. To study this effect, we compared UMLS that we used for training with Semantic MEDLINE Database (SemMedDB),43 which contains information about approximately 94.0 million semantic predications extracted by SemRep.44 Sentences with specific semantic predication (subject-PREVENTS-object and subject-TREATS-object) in SemMedDB are recorded as positive instances. The results of PACNN+RL on each KB are shown in Supplementary Appendix E. The proposed model achieved better performance on SemMedDB than UMLS on both may-prevent and may-treat test sets, and the improvement in performance may benefit from the high quality of SemMedDB that utilizes the NLP ability of SemRep.43 At the same time, it provides links between biomedical literature and structured semantic predications.
Limitations and future work
Because MetaMap is a rule-based approach that utilizes a manually curated dictionary, it falls short of offering more accurate deep learning models support.45 In the future, it can be replaced by other deep learning–based biomedical NLP tools, such as ScispaCy46 and LATTE.47 Another limitation of this work is that the DDI corpus has 2 versions, 2011 and 2013, and we only used the 2011 version for testing. The reason we used the previous version is DDI corpus, 2013 proposed 4 different types of DDI relationships (mechanism, effect, advice, int). However, there is no such KB, which makes the construction of a corresponding distantly supervised set a nontrivial problem. Also, some applications focus on detecting the relation between multiple entities. For example, in protein phosphorylation RE, 3 entities (a substrate, a kinase, and a site) are involved. However, our method currently focuses on RE with 2 entities, and multirelation extraction will also be our focus in the future.
CONCLUSION
In this study, we propose the PACNN+RL method, which consists of a PACNN for encoding semantic information of biomedical text and an RL method with memory backtracking mechanism to denoise the distant labeled data. Our approach outperforms baselines on 7 benchmark datasets, including may-treat, may-prevent, DDI, and PPI relations. Further experiments and analysis indicate the reasons for the effectiveness of PACNN+RL, proving that RL agent is helpful in addressing the noisy data challenge. In conclusion, PACNN+RL is a versatile tool to aid automatic biomedical RE via biomedical literature mining techniques.
FUNDING
This work is supported by the fund of the joint project with Beijing Baidu Netcom Science Technology, the National Natural Science Foundation of China (Grant Nos. 61872113, 61876052, and 62006061), the Special Foundation for Technology Research Program of Guangdong Province (Grant No. 2015B010131010), the Strategic Emerging Industry Development Special Funds of Shenzhen (Grant No. JCYJ20180306172232154), and the CCF-Baidu Open Fund (Grant No. CCF-BAIDUOF2020004).
AUTHOR CONTRIBUTIONS
TZ proposed methods, designed and carried out the experiments, and drafted the manuscript. BH and QC supervised the research and participated in study design. YX critically revised the manuscript and made substantial contributions to interpreting the results. WP and YQ participated in manuscript review. All authors provided feedback and approved the final version of the manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
The authors have no competing interests to declare.
DATA AVAILABILITY STATEMENT
The data underlying this article are available at http://112.74.48.115:9000/.
Supplementary Material
REFERENCES
- 1. Zhao S, Su C, Lu Z, et al. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22 (3): bbaa057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Wei CH, Harris BR, Li D, et al. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database (Oxford) 2012; 2012: bas041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Spasic I, Ananiadou S, McNaught J, et al. Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform 2005; 6 (3): 239–51. [DOI] [PubMed] [Google Scholar]
- 4. Ananiadou S. Advances of biomedical text mining for semantic search. Web Sci Med Domain 2011; 5. [Google Scholar]
- 5. Wei CH, Kao HY, Lu Z.. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res 2013; 41 (Web Server issue): W518–W522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ono T, Hishigaki H, Tanigami A, et al. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001; 17 (2): 155–61. [DOI] [PubMed] [Google Scholar]
- 7. Ciaramita M, Gangemi A, Ratsch E, et al. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: Proceedings of IJCAI; Edinburgh, Scotland, UK; 2005: 659–64.
- 8. Airola A, Pyysalo S, Björne J, et al. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 2008; 9 (Suppl 11): S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Bui QC, Sloot PM, van Mulligen EM, et al. A novel feature-based approach to extract drug-drug interactions from biomedical text. Bioinformatics 2014; 30 (23): 3365–71. [DOI] [PubMed] [Google Scholar]
- 10. Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of ISMB; Heidelberg, Germany; 1999. : 77–86. [PubMed]
- 11. Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data. In: Proceedings of ACL; Singapore; 2009: 1003–11.
- 12. Thomas P, Solt I, Klinger R, et al. Learning protein–protein interaction extraction using distant supervision. In: Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing; Hissar, Bulgaria; 2011: 25–32.
- 13. Li G, Wu C, Vijay-Shanker K. Noise reduction methods for distantly supervised biomedical relation extraction. In: Proceedings of BioNLP; Vancouver, Canada; 2017: 184–93.
- 14. Bobić T, Klinger R, Thomas P, et al. Improving distantly supervised extraction of drug-drug and protein-protein interactions. In: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP; Avignon, France; 2012: 35–43.
- 15. Riedel S, Yao L, McCallum A. Modeling relations and their mentions without labeled text. In: Proceedings of ECML PKDD 2010; Barcelona, Spain; 6323: 148–63. [Google Scholar]
- 16. Hoffmann R, Zhang C, Ling X, et al. Knowledge-based weak supervision for information extraction of overlapping relations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Portland, Oregon; 2011: 541–550.
- 17. Surdeanu M, Tibshirani J, Nallapati R, et al. Multi-instance multi-label learning for relation extraction. In: Proceedings of EMNLP-CoNLL; Jeju Island, Korea; 2012: 455–65.
- 18. Zeng D, Liu K, Chen Y, et al. Distant supervision for relation extraction via piecewise convolutional neural networks. In: Proceedings of EMNLP; Lisbon, Portugal; 2015: 1753–62.
- 19. Lin Y, Shen S, Liu Z, et al. Neural relation extraction with selective attention over instances. In: Proceedings of ACL; Berlin, Germany; 2016: 2124–33.
- 20. Ji G, Liu K, He S, et al. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In: Proceedings of AAAI; San Francisco, California; 2017: 3060–6.
- 21. Feng J, Huang M, Zhao L, et al. Reinforcement learning for relation classification from noisy data. In: Proceedings of AAAI; New Orleans, Louisiana; 2018: 5779–86.
- 22. Qin P, Xu W, Wang WY. Robust distant supervision relation extraction via deep reinforcement learning. In: Proceedings of ACL; Melbourne, Australia; 2018: 2137–47.
- 23. Roller R, Stevenson M. Held-out versus gold standard: comparison of evaluation strategies for distantly supervised relation extraction from medline abstracts. In: Proceedings of the 6th International Workshop on Health Text Mining and Information Analysis; Lisbon, Portugal; 2015: 97–102.
- 24. Segura-Bedmar I, Martínez Fernández P, Sánchez-Cisneros D. The 1st DDIExtraction-2011 challenge task: extraction of drug-drug interactions from biomedical texts. In: Proceedings of the 1st Challenge Task on Drug-Drug Interaction Extraction; Huelva, Spain; 2011: 1–9.
- 25. Bunescu R, Ge R, Kate RJ, et al. Comparative experiments on learning information extractors for proteins and their interactions. Artif Intell Med 2005; 33 (2): 139–55. [DOI] [PubMed] [Google Scholar]
- 26. Pyysalo S, Ginter F, Heimonen J, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 2007; 8: 50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Fundel K, Küffner R, Zimmer R.. RelEx–relation extraction using dependency parse trees. Bioinformatics 2007; 23 (3): 365–71. [DOI] [PubMed] [Google Scholar]
- 28. Ding J, Berleant D, Nettleton D, et al. Mining MEDLINE: abstracts, sentences, or phrases? In: Proceedings of PSB; Lihue, Hawaii; 2002: 326–37. [DOI] [PubMed]
- 29. Nédellec C. Learning language in logic-genic interaction extraction challenge. In: Proceedings of the 4th Learning Language in Logic Workshop; Bonn, Germany; 2005: 1–7.
- 30. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32 (Database issue): D267–D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Thomas P, Bobić T, Leser U, et al. Weakly labeled corpora as silver standard for drug-drug and protein-protein interaction. In: Proceedings of the Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM) on Language Resources and Evaluation Conference (LREC); Istanbul, Turkey; 2012.
- 32. Kim S, Liu H, Yeganova L, et al. Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 2015; 55: 23–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Quan C, Hua L, Sun X, et al. Multichannel convolutional neural network for biological relation extraction. Biomed Res Int 2016; 2016: 1850404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Moen S, Ananiadou TSS. Distributional semantics resources for biomedical text processing. In: Proceedings of LBM; Ananiadou, Sophia; 2013: 39–44.
- 35. Zeng D, Liu K, Lai S, et al. Relation classification via convolutional deep neural network. In: Proceedings of COLING; Dublin, Ireland; 2014: 2335–44.
- 36. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 1992; 8 (3–4): 229–56. [Google Scholar]
- 37. Sutton RS, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of NIPS; Denver, Colorado; 1999: 1057–63.
- 38. Banuqitah H, Eassa F, Jambi K, et al. Two level self-supervised relation extraction from MEDLINE using UMLS. Int J Data Mining Knowl Manag Process 2016; 6 (3): 11–23. [Google Scholar]
- 39. Tikk D, Thomas P, Palaga P, et al. A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature. PLoS Comput Biol 2010; 6 (7): e1000837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Ye ZX, Ling ZH. Distant supervision relation extraction with intra-bag and inter-bag attentions. In: Proceedings of NAACL-HLT; Minneapolis, MN; 2019: 2810–9.
- 41. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (4): 1234–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. van Mulligen EM, Fourrier-Reglat A, Gurwitz D, et al. The EU-ADR corpus: annotated drugs, diseases, targets, and their relationships. J Biomed Inform 2012; 45 (5): 879–84. [DOI] [PubMed] [Google Scholar]
- 43. Kilicoglu H, Shin D, Fiszman M, et al. SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 2012; 28 (23): 3158–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Rindflesch TC, Fiszman M.. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J Biomed Inform 2003; 36 (6): 462–77. [DOI] [PubMed] [Google Scholar]
- 45. Zhang Y, Zhang Y, Qi P, et al. Biomedical and clinical English model packages for the Stanza Python NLP library. JAMIA 2021; 28 (9): 1892–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Neumann M, King D, Beltagy I, et al. ScispaCy: Fast and robust models for biomedical natural language processing. In: Proceedings of the 18th BioNLP Workshop and Shared Task; Florence, Italy; 2019: 319–27.
- 47. Zhu M, Celikkaya B, Bhatia P, et al. LATTE: Latent type modeling for biomedical entity linking. AAAI 2020; New York, NY; 34 (05): 9757–64. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available at http://112.74.48.115:9000/.



