Skip to main content
. 2022 Jul 19;23(5):bbac282. doi: 10.1093/bib/bbac282

Table 2.

A summary of biomedical RE and event extraction datasets. The value of ‘-’ means that we could not find the number in their papers or websites. The SEN/DOC Level means whether the relation annotation is annotated in ‘Sentence’, ‘Document’ or ‘Cross-sentence’. ‘Document’ includes abstract, full-text or discharge record. ‘Cross-sentence’ allows two entities within a relation to appear in three surrounding sentences

Datasets # Doc./Sent. # Entities # Relations SEN/DOC Levels Descriptions
Protein–protein interaction
AIMed [37] 230 abstracts 4141 genes 1101 relations Sentence The AImed dataset aims to develop and evaluate protein name recognition and protein–protein interaction (PPI) extraction. It contains 750 Medline abstracts, which contain the ‘human’ word, and has 5206 names. Two hundred abstracts previously known to contain protein interactions for PPI extraction were obtained from the Database of Interacting Proteins (DIP) [50] and tagged for both 1101 protein interactions and 4141 protein names. Because negative examples for protein interactions were rare in the 200 abstracts, they manually selected 30 additional abstracts with more than one gene but did not have any gene interactions.
BioInfer [6] 1100 sentences 4573 proteins 2662 relations Sentence A PPI dataset uses ontologies defining the fine-granted types of entities (like ‘protein family or group’ and ‘protein complex’) and their relationships (like ‘CONTAIN’ and ‘CAUSE’). They developed
a corpus of 1100 sentences containing full dependency annotation, dependency types and comprehensive annotation of bio-entities and their relationships.
BioCreative II PPI IPS [7] 1098 full-texts - - Document The BioCreative II PPI protein interaction pairs subtask (IPS) provides 750 and 356 full texts for training and test sets, respectively. The full text includes corresponding gene mention symbols and PPI pairs.
Chemical–protein interaction
DrugProt [40] 5000 abstracts 65 561 chemicals, 61 775 genes 24 526 relations Sentence The DrugProt dataset aims to promote the development of chemical-gene RE systems, an extension of the ChemProt dataset. It addresses 13 different chemical-gene relations, including regulatory, specific and metabolic relations
Chemical–disease interaction
BC5CDR [9] 1500 abstracts 15 935 chemicals; 12 850 diseases 3106 relations Document BC5CDR consists of 1500 abstracts that chemical and disease mention annotations and their IDs. It annotates chemical-induced disease relation ID pair. There are 1400 abstracts selected from a CTD-Pfizer collaboration-related dataset, and the remaining 100 articles are new curation and are used in the test set.
DDI and Drug–ADE(adverse drug effect) interaction
ADE [51] 2972 MEDLINE case report 5063 drugs;
5776 adverse effects; 231 dosages
6821 drug-adverse effects; 279 drug-dosage relations Sentence The ADE dataset contains drugs and conditions. But the entities do not link to the standard database identifiers. Like most of the relation datasets, ADE annotates the relations (i.e. drug-ADE and drug-dosage relations) at the sentence level.
DDI13 [8] 905 documents 13 107 drugs 5028 relations Sentence SemEval 2013 DDIExtraction dataset consists of 792 texts selected from the DrugBank database and 233 Medline abstracts. The corpus is annotated with 18 502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) and pharmacodynamic (PD) interactions.
n2c2 2018 ADE [52] 505 summaries 83 869 entities 59 810 relations - The discharge summaries are from the clinical care database of
the MIMIC-III (Medical Information Mart for Intensive Care-III).
The summaries are manually selected to contain at least one ADE and annotated with nine concepts and eight relation pairs. The
data are split into 303 and 202 for training and test sets, respectively.
Variant/gene–disease interaction
EMU [21] 110 abstracts - 179 relations Document The EMU dataset focuses on finding relationships between mutations and their corresponding disease phenotypes. They use ‘MeSH = mutation’ to select abstracts and use MetaMap [53] to annotate the abstracts that are divided into containing mutations related to prostate cancer (PCa) and breast cancer (BCa). They
then use rules and patterns to select subsets of PCa and BCa for annotating.
RENET2 [54] 1000 abstracts, 500 full-texts - - Document It contains both 1000 abstracts (from RENET [55]) and 500 full texts from PMC open-access subset. For better quality, 500 abstracts of the dataset were refined. The authors used the 500 abstracts to train the RENET2 model and conduct their training data expansion using the other 500 abstracts. They further used the model trained on 1000 abstracts to construct 500 full-text articles.
Drug–gene mutation
N-ary [56] - - 3462 triples;
137 469
drug–gene relations; 3192 drug–mutation relations;
Cross-sentence Authors use distant supervision to construct a cross-sentence drug–gene mutation RE dataset. They use 59 distinct drug–gene mutation triples from the knowledge bases to extract 3462 ternary positive relation triples. The negative instances are generated by randomly sampling the entity pairs/triples without interaction.
Event extraction
GE09 [57] 1200 abstracts - 13 623 events Sentence As the first BioNLP shared task (ST), it aimed to define a bounded, well-defined GENIA event extraction (GE) task, considering both the actual needs and the state-of-the-art in bio-TM technology and to pursue it as a community-wide effort.
GE11 [58] 1210 abstracts, 14 full-text 21 616 proteins 18 047 events Sentence The BioNLP ST 2011 GE task follows the task definition of the BioNLP ST 2009, which is briefly described in this section. BioNLP ST 2011
took the role of measuring the progress of the community and generalization IE technology to the full papers.
CG [59] 600 abstracts 21 683 entities 17 248 events;
917 relations
Sentence The BioNLP ST 2013 Cancer Genetics (CG) corpus contains
annotations of over 17 000 events in 600 documents. The task addresses entities and events at all levels of biological organization, from the molecular to the whole organism, and involves
pathological and physiological processes.