Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2023 Apr 29;2022:425–431.

MERIT: Minimal SupErvision Through Label Augmentation for Biomedical RelatIon ExTraction

Saman Enayati 1, Slobodan Vucetic 1
PMCID: PMC10148331  PMID: 37128402

Abstract

Relation Extraction (RE) is an important task in extracting structured data from free biomedical text. Obtaining labeled data needed to train RE models in specialized domains such as biomedicine can be very expensive because it requires expert knowledge. Thus, it is often the case that RE models need to be trained from relatively small labeled data sets. Despite the recent advances in Natural Language Processing (NLP) approaches for RE, training accurate RE models from small labeled data is still an open challenge. In this paper, we propose MERIT, a simple and effective approach for label augmentation that automatically increases the size of labeled data while introducing a moderate labeling noise. We performed extensive experiments on three benchmarks biomedical RE data sets. The results demonstrate the effectiveness of MERIT compared to the baseline.

Introduction

Relation Extraction (RE) is defined as classifying a type of relationship between a pair of entities occurring in a text passage. For example, given the sentence “Ciprofloxacin has some effect on Pyelonephritis”, we can infer the relation “May Treat” between two entities Ciprofloxacin and Pyelonephritis, and form a triplet (subject, relation, object). The extracted triplets from the text can be used for knowledge base population, question answering, or information retrieval. Recent advances in deep learning allow training very accurate12 models41,19,10,39,24 when a large amount of human-annotated training data is available. However, collecting training data is a costly and human-intensive process. In some specialized domains such as information extraction from biomedical documents, human annotation is particularly challenging and costly because it can only be done by biomedical experts. Therefore, RE training data in the biomedical domain can often be very small and result in RE models with low accuracy.

Weak or distant labeling is a popular approach for addressing label scarcity issues in many applications, including RE18,4. Distant supervision was applied previously to heuristically align entities to a given knowledge base (KB) with little annotation effort26. However, that approach requires a KB and is not able to take into consideration the context of entity co-occurrence. In weak labeling, labeling rules are used by string matching to automatically provide labeled-data29,42,18,23,31,30, which are often more efficient than using a KB for distant supervision42. However, exact string matching limits the generalizability of the rules2, and thereafter causes low coverage of data and labeling noise. To tackle the labeling noise issue, data programming31,30 aims to annotate the corpus by fitting a model to resolve any disagreements among the rules. Approaches to address the coverage of the labeling rules include differentiable soft-assignment of the rules to an unlabeled portion of the corpus42,32,25. Although these approaches provide higher coverage of the rules, they still suffer from labeling noise. Moreover, generating labeling rules is a costly and inexact process and depends on the skill of a user to convert their knowledge into useful rules. There have also been efforts in reducing annotation costs in biomedical informatics by effective visual interfaces for annotation8 and semi-structured annotation16. However, these solutions are less applicable to RE domain.

As an alternative to weak labeling, we propose a simple and efficient approach, MERIT, to automatically increase the number of labels given a small, labeled data set. Our main observation is that the nearest neighbors of a sentence representing a particular relation between entities are likely to43 represent the same relation. Thus, given a labeled sentence, we transfer its label to all its neighboring sentences. An open question to be studied in this paper is defining the neighborhood in the RE task, which requires us to define an appropriate distance measure and an appropriate distance threshold. The benefit of MERIT is that it does not require any expert knowledge, is easy to implement, and is computationally inexpensive. The proposed approach can be combined with other approaches dealing with data labeling scarcity such as weak labeling and active learning with uncertainty-based sampling34,37,38 and clustering-based sampling27,35.

Complementing recent efforts on synthetic training data augmentation28,11 and other types of label augmentation35, we show that MERIT successfully exploits the unlabeled portion of corpus to augment the limited available hand-labeled data with high-quality weak labels, which results in a more accurate RE model. We perform extensive experiments on three benchmark biomedical relation extraction datasets.

Task Formulation and Background

Given a sentence Si and an entity pair (esubj, eobj), where esubj is the subject entity and esubj is the object entiry, the RE task can be transformed to a classification problem. A classifier can be built to map the relation between the subject and object entities into predefined relation types from the set of relations {RNA}, where NA denotes that there is no relation between a pair of entities.

In order to represent the candidate relation pair from an input sequence, we replace the relevant entity names with their semantic types followed by the standard preprocessing steps for RE6. an example of input representation to the RE model is “We further show that @PROTEIN$ directly interacts with @PROTEIN$ and Rpn4.”

We fine-tune SciBERT model3 followed by a linear classification layer added on top of it to predict the relation type of a candidate pair. In other words, given sequence Si = (W1, … , Wn), where Wi is the ith token in the sequence, we feed Si into SciBERT, and retrieve the hidden state representation of the sequence (called the CLS) along with the word representations,

(hCLS,h1,...,hn)=SciBERT(w1,...,wn),

where hi is the representation of ith token in a d dimensional space. As typical with BERT6, we use hCLS, corresponding to the aggregate representation of a sequence, as an input into the classification layer. A softmax layer is added to output probabilistic label for the sentence,

zi=softmax(WthCLS),

where Wt is a matrix of the learnable parameters for linear layer, and Zi is the probability vector assigned to each relation type.

Methodology

The intuition behind our approach, MERIT, stems from the assumption that neighbors of every data point (based on a distance-based metrics) are creating a local community that shares the same label. We hypothesize that if such a local community exists for a labeled data point xi, we can transfer its label to its neighbors. We refer to the process of automatically assigning labels to neighbors of labeled data points as the weak labeling. As a result, we augment the training labels by utilizing expert supervision to generate high-quality weak labels, which can be further used in boosting the performance of any supervised RE model.

There are two key parameters in our approach that impact quality of weak labeling: 1) feature representation, and 2) distance threshold. Figure 1 illustrates the proposed approach. It depicts the local community in 2-dimensional space.

Figure 1.

Figure 1.

Label augmentation through local community search. The data points annotated by X denote strong label, while data points highlighted with bold color are the weak labels.

There are several choices to embed RE data points as vectors such as by semantic/static-based embeddings7,6. However, these representations are too general and are not explicitly designed to improve quality of weak labeling. Therefore, we aim to generate an embedding that captures a semantic representation explicitly aiming to improve quality of weak labeling. We apply a dependency parse tree on a sequence to extract the shortest dependency path (SDP) between two entities. As it has been shown in previous work20,40, SDP can boost the performance of RE models and provide strong hints about the relation between entities.

Let T be a rooted parse tree corresponding to sequence Si. Given a pair of entities (esubj, eobj), SDP can be defined as a minimum set of tokens that can be reached from esubj to eobj through the dependency tree T. For example, in the sentence “Chemical caused a dose-dependent reduction in plasma Gene”, the terms (caused, reduction) are the SDP tokens between entities (Chemical, Gene).

In order to integrate the SDP information into the feature space, we take the average embedding of SDP tokens and concatenate them with the entity representations based on SciBERT3

hi=concat(1kj=1,jSDPkhj,hsubj,hobj),

where hi corresponds to the final representation for sentence Si.

The second key decision in our framework is to define local communities for data point xi according to the described feature representation. We consider a threshold θ, under which xj cannot be within a local community of xi. For optimized calculations of local community around each data point, we cluster the data using the k-means6 clustering algorithm. Since we just search for local communities around each data point, k-means clustering reduces the search space by an order of magnitude. Finally, we compute the cosine similarity as the distance metric to identify the local communities. The equation below demonstrates the corresponding function:

LocCommunity(xi,xj)={xj,ifdistance(xi,xj)θ,  otherwise.

Experiments

First, we describe the characteristics of the datasets that we used. Next, we explain the experimental design. Finally, we discuss the results.

Datasets: We evaluate our approach on three benchmark biomedical RE datasets. The characteristics of these datasets are shown in Table 1. We used the same train, development, and test splits during model development for ChemProt and DDI datasets as described in the previous work17,14. For the PPI dataset, we utilized AIMed corpus5 and performed 5-fold cross-validation due to the lack of standard train and test split. ChemProt and DDI tasks are multi-class classification problems. For ChemProt, there are 6 different relation types to capture the interaction between chemical and protein. In addition, the DDI contains 5 relations that correspond to the interaction between drugs. The labels for the DDI task are advice, effect, int, mechanism, and negative. PPI task is a binary classification problem to extract human protein interactions.

Table 1.

Statistics of the dataset used for label augmentation.

Dataset Train Validation Test #relations
ChemProt 18k 11.2k 15.7k 6
DDI 22k 5.5k 5.7k 5
PPI 5.2k - 583 2

Evaluation metric and experiment design: We report standard precision (P), recall (R), and F1-score (F1) for binary classification, and micro-P, micro-R, and micro-F1 for multi-class classification. The evaluation metrics are as follows:

P=TPTP+FP,R=TPTP+FN,F1=2*P*RP+R,

where TP refers to True Positive, FP to False Positive, and FN to False Negative. For multi-class classification, the TP in micro-P and micro-R is the number of all positive class examples (we consider NA type as the negative label).

Micro-F1 calculates the total number of TPs, FNs, and FPs. We fine-tuned SciBERT for RE task. We set the maximum sequence length to 200, batch size to 16, distance similarity threshold to 0.9, learning rate to 2e-5, and epochs to 10. The remaining hyperparameters were the default values.

Results: We compared the effectiveness of our approach with the random sampling baseline (RS). We ran experiments for different labeling budget sizes [100, 200, 500] to acquire expert annotations. In both experiments, we randomly sampled from the corpus and trained the RE model on the collected samples. Compared to the RS baseline, MEIRT has the added benefit of leveraging the weak labels along with the strong labels. This experiment evaluated the effectiveness of MERIT as an extension to any supervised RE model. Figure 2 demonstrates a significant improvement of our approach over the RS baseline. Due to the random selection, we conducted all the experiments in 3 independent runs and report the average performance. The results show that in a scenario where limited annotated data were available (100), F1 score is higher by about 0.20 compared to the RS baseline. In addition, as we increased the hand-labeled data (up to 500), MERIT outperformed the baseline by up to 36% in the F1 score. This shows that the weak labels did not damage the performance when larger labeled datasets were available.

Figure 2.

Figure 2.

Comparison of our approach with RS baseline on three benchmarks biomedical RE datasets.

Ablation Study: To further evaluate the effectiveness of each parameter in our approach, we performed an ablation study that measured the impact of distance threshold and feature representations. We conducted the experiment with budget of 200 with different threshold values in the range of {0.7, 0.75,0.80, 0.85, 0.9, 0.95} on the three datasets. As Figure 3 illustrates, as the threshold decreased from 0.95 to 0.7, more weak labels were added. This in turn led to more noise in the labels, hence damaging the performance when the threshold was too low. We found that 0.90 was the optimal threshold value for all three datasets, as a good tradeoff between number of weak labels and their quality. This parameter can be further optimized during the learning process, but this was left for future research.

Figure 3.

Figure 3.

The impact of threshold on the final performance. We use 200 labeling budgets to perform this experiment.

We also explored the impact of several different embeddings for distance function calculation. We performed the following experiment on ChemProt dataset with a labeling budget of 200. We considered the following choices for the embedding:

  • CLS (dimension L) is an aggregate representation of all the tokens in a sentence;

  • ent-avg (L) is the average embedding of the entities in a sentence;

  • ent-sdp-avg (L) is the average embedding of the entities and SDP tokens;

  • ent-concat (2L) is concatenation of the embeddings of two entities in a sentence;

  • ent-words-between (3L) is concatenation of the embeddings of the two entities along with the average representation of all the words between two entities;

  • ent-concat-sdp-avg (3L) is a concatenation of the embeddings of the two entities along with the average representation of SDP tokens.

As shown in Figure 4, both ent-concat-sdp-avg and ent-sdp-avg representations outperformed non-sdp-based representations. This highlights the fact that SDP provides necessary information for the distance function to combine similar examples in computing the local community. The ent-concat-sdp-avg representation increased the final F1 score by 163% over CLS and 52% over ent-avg and ent-concat embeddings. In addition, the result for entity-words-in-between-based embedding shows that adding unnecessary context was not beneficial for training. However, it still outperformed the F1 score of the baseline CLS 27%. In addition, averaging the embedding performed slightly worse than concatenation (as illustrated in Figure 4).

Figure 4.

Figure 4.

Comparison with different feature representations on ChemProt dataset with 200 labeling budgets.

Conclusion

Obtaining labeled data needed to train RE models in biomedical domain can be very expensive because it requires expert knowledge. Thus, RE models often need to be trained from relatively small labeled datasets. Despite the recent advances in Natural Language Processing (NLP) approaches for RE, training accurate RE models from small data is still an open challenge. In this paper, we proposed MERIT, a simple and effective approach for label augmentation that automatically increases the size of labeled data while introducing a moderate labeling noise. To represent an entity pair, MERIT finds the shortest dependency path between the entities and concatenates the average embedding of the words in the shortest path to the embedding of the entities. Our experiments on three benchmarks biomedical RE data sets showed that the proposed representation results in superior accuracy. They further show that the proposed weak labeling results in higher accuracy on all three RE datasets for a range of experimental conditions, including the varying number of labeled examples.

Figures & Table

References

  • 1.Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmod Record. 1999 Jun 1;28(2):49–60. [Google Scholar]
  • 2.Batista DS, Martins B, Silva MJ. Semi-supervised bootstrapping of relationship extractors with distributional semantics. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015 Sep. pp. 499–504.
  • 3.Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. 2019 Mar 26.
  • 4.Boudjellal N, Zhang H, Khan A, Ahmad A. Biomedical relation extraction using distant supervision. Scientific Programming. 2020 Jun 16.
  • 5.Bunescu R, Ge R, Kate RJ, Marcotte EM, Mooney RJ, Ramani AK, Wong YW. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine. 2005 Feb 1;33(2):139–55. doi: 10.1016/j.artmed.2004.07.016. [DOI] [PubMed] [Google Scholar]
  • 6.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013 Jan 16.
  • 7.Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018 Oct 11.
  • 8.Enayati S, Yang Z, Lu B, Vucetic S. A visualization approach for rapid labeling of clinical notes for smoking status extraction. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances. 2021 Jun. pp. 24–30.
  • 9.Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996 Aug 2;Vol. 96, No. 34:226–231. [Google Scholar]
  • 10.Gupta P, Rajaram S, Schütze H, Runkler T. Neural relation extraction within and across sentence boundaries. In Proceedings of the AAAI Conference on Artificial Intelligence. 2019 Jul 17;Vol. 33, No. 01:6513–6520. [Google Scholar]
  • 11.Hassantabar S, Dai X, Jha NK. STEERAGE: Synthesis of neural networks using architecture search and grow-and-prune methods. arXiv preprint arXiv:1912.05831. 2019 Dec 12.
  • 12.Hassantabar S, Wang Z, Jha NK. SCANN: Synthesis of compact and accurate neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2021 Sep 29.
  • 13.Hendrickx I, Kim SN, Kozareva Z, Nakov P, Séaghdha DO, Padó S, Pennacchiotti M, Romano L, Szpakowicz S. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. arXiv preprint arXiv:1911.10422. 2019 Nov 23.
  • 14.Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of Biomedical Informatics. 2013 Oct 1;46(5):914–20. doi: 10.1016/j.jbi.2013.07.011. [DOI] [PubMed] [Google Scholar]
  • 15.Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967 Sep;32(3):241–54. doi: 10.1007/BF02289588. [DOI] [PubMed] [Google Scholar]
  • 16.Katic T, Pavlovski M, Sekulic D, Vucetic S. Learning semi-structured representations of radiology reports. arXiv preprint arXiv:2112.10746. 2021 Dec 20.
  • 17.Krallinger M, Rabal O, Akhondi SA, Pérez MP, Santamaría J, Rodríguez GP, Tsatsaronis G, Intxaurrondo A, López JA, Nandal U. Overview of the BioCreative VI chemical-protein interaction Track. In Proceedings of the Sixth BioCreative Challenge Evaluation Workshop. 2017 Oct;Vol. 1:141–146. [Google Scholar]
  • 18.Krasakis AM, Kanoulas E, Tsatsaronis G. Semi-supervised ensemble learning with weak supervision for biomedical relationship extraction. In Automated Knowledge Base Construction (AKBC) 2018 Nov 17.
  • 19.Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020 Feb 15;36(4):1234–40. doi: 10.1093/bioinformatics/btz682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li Z, Yang Z, Shen C, Xu J, Zhang Y, Xu H. Integrating shortest dependency path and sentence sequence into a deep learning framework for relation extraction in clinical text. BMC Medical Informatics and Decision Making. 2019 Jan;19(1):1–8. doi: 10.1186/s12911-019-0736-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Likas A, Vlassis N, Verbeek JJ. The global k-means clustering algorithm. Pattern Recognition. 2003 Feb 1;36(2):451–61. [Google Scholar]
  • 22.Lin H, Yan J, Qu M, Ren X. Learning dual retrieval module for semi-supervised relation extraction. In The World Wide Web Conference. 2019 May 13. pp. 1073–1083.
  • 23.Liu L, Ren X, Zhu Q, Zhi S, Gui H, Ji H, Han J. Heterogeneous supervision for relation extraction: A representation learning approach. arXiv preprint arXiv:1707.00166. 2017 Jul 1.
  • 24.Malekzadeh M, Hajibabaee P, Heidari M, Zad S, Uzuner O, Jones JH. Review of graph neural network in text classification. In Processing of IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON) 2021 Dec 1. pp. 0084–0091.
  • 25.Meng Y, Shen J, Zhang C, Han J. Weakly-supervised neural text classification. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018 Oct 17. pp. 983–992.
  • 26.Mintz M, Bills S, Snow R, Jurafsky D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 2009 Aug. pp. 1003–1011.
  • 27.Nguyen HT, Smeulders A. Active learning using pre-clustering. In Proceedings of the 21st International Conference on Machine Learning. 2004 Jul 4. (p. 79)
  • 28.Papanikolaou Y, Pierleoni A. Dare: Data augmented relation extraction with GPT-2. arXiv preprint arXiv:2004.13845. 2020 Apr 6.
  • 29.Qu M, Ren X, Zhang Y, Han J. Weakly-supervised relation extraction by pattern-enhanced embedding learning. In Proceedings of the 2018 World Wide Web Conference. 2018 Apr 10. pp. 1257–1266.
  • 30.Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Ré C. Snorkel: Rapid training data creation with weak supervision. The VLDB Journal. 2020 May;29(2):709–30. doi: 10.1007/s00778-019-00552-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ratner AJ, De Sa CM, Wu S, Selsam D, Ré C. Data programming: Creating large training sets, quickly. Advances in Neural Information Processing Systems. 2016. p. 29. [PMC free article] [PubMed]
  • 32.Ren W, Li Y, Su H, Kartchner D, Mitchell C, Zhang C. Denoising multi-source weak supervision for neural text classification. arXiv preprint arXiv:2010.04582. 2020 Oct 9.
  • 33.Rosenberg C, Hebert M, Schneiderman H. Semi-supervised self-training of object detection models. Carnegie Mellon University. 2005.
  • 34.Seung HS, Opper M, Sompolinsky H. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. 1992 Jul 1. pp. 287–294.
  • 35.Solmaz G, Cirillo F, Maresca F, Kumar AG. Label Augmentation with Reinforced Labeling for Weak Supervision. arXiv preprint arXiv:2204.06436. 2022 Apr 13.
  • 36.Wang M, Min F, Zhang ZH, Wu YX. Active learning through density clustering. Expert Systems with Applications. 2017 Nov 1;85:305–17. [Google Scholar]
  • 37.Wang R, Chen D, Kwong S. Fuzzy-rough-set-based active learning. IEEE Transactions on Fuzzy Systems. 2013 Nov 20;22(6):1699–704. [Google Scholar]
  • 38.Wang R, Chow CY, Kwong S. Ambiguity-based multiclass active learning. IEEE Transactions on Fuzzy Systems. 2015 Jul 1;24(1):242–8. [Google Scholar]
  • 39.Wei Q, Ji Z, Si Y, Du J, Wang J, Tiryaki F, Wu S, Tao C, Roberts K, Xu H. Relation extraction from clinical narratives using pre-trained language models. In AMIA Annual Symposium Proceedings. 2019;Vol. 2019:1236. [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhang Y, Zheng W, Lin H, Wang J, Yang Z, Dumontier M. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics. 2018 Mar 1;34(5):828–35. doi: 10.1093/bioinformatics/btx659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhou W, Huang K, Ma T, Huang J. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI Conference on Artificial Intelligence. 2021 Jan 1;Vol. 35, No. 16:14612–14620. [Google Scholar]
  • 42.Zhou W, Lin H, Lin BY, Wang Z, Du J, Neves L, Ren X. Nero: A neural rule grounding framework for label-efficient relation extraction. In Proceedings of The Web Conference. 2020 Apr 20. pp. 2166–2176.
  • 43.Xiaojin Z, Zoubin G. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University. 2002.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES