. 2022 Jul 19;23(5):bbac282. doi: 10.1093/bib/bbac282

Table 2.

A summary of biomedical RE and event extraction datasets. The value of ‘-’ means that we could not find the number in their papers or websites. The SEN/DOC Level means whether the relation annotation is annotated in ‘Sentence’, ‘Document’ or ‘Cross-sentence’. ‘Document’ includes abstract, full-text or discharge record. ‘Cross-sentence’ allows two entities within a relation to appear in three surrounding sentences

Datasets	# Doc./Sent.	# Entities	# Relations	SEN/DOC Levels	Descriptions
Protein–protein interaction
AIMed [37]	230 abstracts	4141 genes	1101 relations	Sentence	The AImed dataset aims to develop and evaluate protein name recognition and protein–protein interaction (PPI) extraction. It contains 750 Medline abstracts, which contain the ‘human’ word, and has 5206 names. Two hundred abstracts previously known to contain protein interactions for PPI extraction were obtained from the Database of Interacting Proteins (DIP) [50] and tagged for both 1101 protein interactions and 4141 protein names. Because negative examples for protein interactions were rare in the 200 abstracts, they manually selected 30 additional abstracts with more than one gene but did not have any gene interactions.
BioInfer [6]	1100 sentences	4573 proteins	2662 relations	Sentence	A PPI dataset uses ontologies defining the fine-granted types of entities (like ‘protein family or group’ and ‘protein complex’) and their relationships (like ‘CONTAIN’ and ‘CAUSE’). They developed a corpus of 1100 sentences containing full dependency annotation, dependency types and comprehensive annotation of bio-entities and their relationships.
BioCreative II PPI IPS [7]	1098 full-texts	-	-	Document	The BioCreative II PPI protein interaction pairs subtask (IPS) provides 750 and 356 full texts for training and test sets, respectively. The full text includes corresponding gene mention symbols and PPI pairs.
Chemical–protein interaction
DrugProt [40]	5000 abstracts	65 561 chemicals, 61 775 genes	24 526 relations	Sentence	The DrugProt dataset aims to promote the development of chemical-gene RE systems, an extension of the ChemProt dataset. It addresses 13 different chemical-gene relations, including regulatory, specific and metabolic relations
Chemical–disease interaction
BC5CDR [9]	1500 abstracts	15 935 chemicals; 12 850 diseases	3106 relations	Document	BC5CDR consists of 1500 abstracts that chemical and disease mention annotations and their IDs. It annotates chemical-induced disease relation ID pair. There are 1400 abstracts selected from a CTD-Pfizer collaboration-related dataset, and the remaining 100 articles are new curation and are used in the test set.
DDI and Drug–ADE(adverse drug effect) interaction
ADE [51]	2972 MEDLINE case report	5063 drugs; 5776 adverse effects; 231 dosages	6821 drug-adverse effects; 279 drug-dosage relations	Sentence	The ADE dataset contains drugs and conditions. But the entities do not link to the standard database identifiers. Like most of the relation datasets, ADE annotates the relations (i.e. drug-ADE and drug-dosage relations) at the sentence level.
DDI13 [8]	905 documents	13 107 drugs	5028 relations	Sentence	SemEval 2013 DDIExtraction dataset consists of 792 texts selected from the DrugBank database and 233 Medline abstracts. The corpus is annotated with 18 502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) and pharmacodynamic (PD) interactions.
n2c2 2018 ADE [52]	505 summaries	83 869 entities	59 810 relations	-	The discharge summaries are from the clinical care database of the MIMIC-III (Medical Information Mart for Intensive Care-III). The summaries are manually selected to contain at least one ADE and annotated with nine concepts and eight relation pairs. The data are split into 303 and 202 for training and test sets, respectively.
Variant/gene–disease interaction
EMU [21]	110 abstracts	-	179 relations	Document	The EMU dataset focuses on finding relationships between mutations and their corresponding disease phenotypes. They use ‘MeSH = mutation’ to select abstracts and use MetaMap [53] to annotate the abstracts that are divided into containing mutations related to prostate cancer (PCa) and breast cancer (BCa). They then use rules and patterns to select subsets of PCa and BCa for annotating.
RENET2 [54]	1000 abstracts, 500 full-texts	-	-	Document	It contains both 1000 abstracts (from RENET [55]) and 500 full texts from PMC open-access subset. For better quality, 500 abstracts of the dataset were refined. The authors used the 500 abstracts to train the RENET2 model and conduct their training data expansion using the other 500 abstracts. They further used the model trained on 1000 abstracts to construct 500 full-text articles.
Drug–gene mutation
N-ary [56]	-	-	3462 triples; 137 469 drug–gene relations; 3192 drug–mutation relations;	Cross-sentence	Authors use distant supervision to construct a cross-sentence drug–gene mutation RE dataset. They use 59 distinct drug–gene mutation triples from the knowledge bases to extract 3462 ternary positive relation triples. The negative instances are generated by randomly sampling the entity pairs/triples without interaction.
Event extraction
GE09 [57]	1200 abstracts	-	13 623 events	Sentence	As the first BioNLP shared task (ST), it aimed to define a bounded, well-defined GENIA event extraction (GE) task, considering both the actual needs and the state-of-the-art in bio-TM technology and to pursue it as a community-wide effort.
GE11 [58]	1210 abstracts, 14 full-text	21 616 proteins	18 047 events	Sentence	The BioNLP ST 2011 GE task follows the task definition of the BioNLP ST 2009, which is briefly described in this section. BioNLP ST 2011 took the role of measuring the progress of the community and generalization IE technology to the full papers.
CG [59]	600 abstracts	21 683 entities	17 248 events; 917 relations	Sentence	The BioNLP ST 2013 Cancer Genetics (CG) corpus contains annotations of over 17 000 events in 600 documents. The task addresses entities and events at all levels of biological organization, from the molecular to the whole organism, and involves pathological and physiological processes.