Table 2.
Datasets | # Doc./Sent. | # Entities | # Relations | SEN/DOC Levels | Descriptions |
---|---|---|---|---|---|
Protein–protein interaction | |||||
AIMed [37] | 230 abstracts | 4141 genes | 1101 relations | Sentence | The AImed dataset aims to develop and evaluate protein name recognition and protein–protein interaction (PPI) extraction. It contains 750 Medline abstracts, which contain the ‘human’ word, and has 5206 names. Two hundred abstracts previously known to contain protein interactions for PPI extraction were obtained from the Database of Interacting Proteins (DIP) [50] and tagged for both 1101 protein interactions and 4141 protein names. Because negative examples for protein interactions were rare in the 200 abstracts, they manually selected 30 additional abstracts with more than one gene but did not have any gene interactions. |
BioInfer [6] | 1100 sentences | 4573 proteins | 2662 relations | Sentence | A PPI dataset uses ontologies defining the fine-granted types of entities (like ‘protein family or group’ and ‘protein complex’) and their relationships (like ‘CONTAIN’ and ‘CAUSE’). They developed a corpus of 1100 sentences containing full dependency annotation, dependency types and comprehensive annotation of bio-entities and their relationships. |
BioCreative II PPI IPS [7] | 1098 full-texts | - | - | Document | The BioCreative II PPI protein interaction pairs subtask (IPS) provides 750 and 356 full texts for training and test sets, respectively. The full text includes corresponding gene mention symbols and PPI pairs. |
Chemical–protein interaction | |||||
DrugProt [40] | 5000 abstracts | 65 561 chemicals, 61 775 genes | 24 526 relations | Sentence | The DrugProt dataset aims to promote the development of chemical-gene RE systems, an extension of the ChemProt dataset. It addresses 13 different chemical-gene relations, including regulatory, specific and metabolic relations |
Chemical–disease interaction | |||||
BC5CDR [9] | 1500 abstracts | 15 935 chemicals; 12 850 diseases | 3106 relations | Document | BC5CDR consists of 1500 abstracts that chemical and disease mention annotations and their IDs. It annotates chemical-induced disease relation ID pair. There are 1400 abstracts selected from a CTD-Pfizer collaboration-related dataset, and the remaining 100 articles are new curation and are used in the test set. |
DDI and Drug–ADE(adverse drug effect) interaction | |||||
ADE [51] | 2972 MEDLINE case report | 5063 drugs; 5776 adverse effects; 231 dosages |
6821 drug-adverse effects; 279 drug-dosage relations | Sentence | The ADE dataset contains drugs and conditions. But the entities do not link to the standard database identifiers. Like most of the relation datasets, ADE annotates the relations (i.e. drug-ADE and drug-dosage relations) at the sentence level. |
DDI13 [8] | 905 documents | 13 107 drugs | 5028 relations | Sentence | SemEval 2013 DDIExtraction dataset consists of 792 texts selected from the DrugBank database and 233 Medline abstracts. The corpus is annotated with 18 502 pharmacological substances and 5028 DDIs, including both pharmacokinetic (PK) and pharmacodynamic (PD) interactions. |
n2c2 2018 ADE [52] | 505 summaries | 83 869 entities | 59 810 relations | - | The discharge summaries are from the clinical care database of the MIMIC-III (Medical Information Mart for Intensive Care-III). The summaries are manually selected to contain at least one ADE and annotated with nine concepts and eight relation pairs. The data are split into 303 and 202 for training and test sets, respectively. |
Variant/gene–disease interaction | |||||
EMU [21] | 110 abstracts | - | 179 relations | Document | The EMU dataset focuses on finding relationships between mutations and their corresponding disease phenotypes. They use ‘MeSH = mutation’ to select abstracts and use MetaMap [53] to annotate the abstracts that are divided into containing mutations related to prostate cancer (PCa) and breast cancer (BCa). They then use rules and patterns to select subsets of PCa and BCa for annotating. |
RENET2 [54] | 1000 abstracts, 500 full-texts | - | - | Document | It contains both 1000 abstracts (from RENET [55]) and 500 full texts from PMC open-access subset. For better quality, 500 abstracts of the dataset were refined. The authors used the 500 abstracts to train the RENET2 model and conduct their training data expansion using the other 500 abstracts. They further used the model trained on 1000 abstracts to construct 500 full-text articles. |
Drug–gene mutation | |||||
N-ary [56] | - | - | 3462 triples; 137 469 drug–gene relations; 3192 drug–mutation relations; |
Cross-sentence | Authors use distant supervision to construct a cross-sentence drug–gene mutation RE dataset. They use 59 distinct drug–gene mutation triples from the knowledge bases to extract 3462 ternary positive relation triples. The negative instances are generated by randomly sampling the entity pairs/triples without interaction. |
Event extraction | |||||
GE09 [57] | 1200 abstracts | - | 13 623 events | Sentence | As the first BioNLP shared task (ST), it aimed to define a bounded, well-defined GENIA event extraction (GE) task, considering both the actual needs and the state-of-the-art in bio-TM technology and to pursue it as a community-wide effort. |
GE11 [58] | 1210 abstracts, 14 full-text | 21 616 proteins | 18 047 events | Sentence | The BioNLP ST 2011 GE task follows the task definition of the BioNLP ST 2009, which is briefly described in this section. BioNLP ST 2011 took the role of measuring the progress of the community and generalization IE technology to the full papers. |
CG [59] | 600 abstracts | 21 683 entities | 17 248 events; 917 relations |
Sentence | The BioNLP ST 2013 Cancer Genetics (CG) corpus contains annotations of over 17 000 events in 600 documents. The task addresses entities and events at all levels of biological organization, from the molecular to the whole organism, and involves pathological and physiological processes. |