Skip to main content
. 2017 Feb 26;2017:6213474. doi: 10.1155/2017/6213474

Table 2.

Standard corpora for omics domain.

Corpus Text mining evaluation task Brief introduction
JNLPBA (Joint Workshop on NLP in Biomedicine and Its Applications) [18] Gene/protein concept extraction The corpus consists of 2,000 PubMed abstracts as training data and 404 PubMed abstracts as test data.

BioCreAtivE 2004 Task 1A dataset [19] Gene/protein concept extraction The corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

BioCreAtivE 2 Gene Mention (GM) dataset [20] Gene/protein concept extraction The corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

AIMED [21] Protein-protein interaction The corpus consists of 225 PubMed abstracts that contain 1,987 sentences with 4,075 protein mentions.

HPRD50 (Human Protein Reference Database) [22] Protein-protein interaction The corpus consists of sentences with protein-protein interaction from 50 PubMed abstracts.

BioInfer (Bio Information Extraction Resource) [23] Protein, gene, and RNA relationships The corpus consists of 1100 sentences annotated with concept names, relationships, and syntactic dependencies.

IEPA (Interaction Extraction Performance Assessment) [24] Protein-protein interaction The corpus consists of more than 200 PubMed sentences annotated with protein-protein interaction.

BioCreAtivE 2.5 Elsevier Corpus [25] Protein-protein interaction The corpus consists of 61 PubMed articles as training data and 62 PubMed articles as test data.

BC4GO Corpus [26] Gene ontology The corpus consists of 1356 distinct GO terms from 200 PubMed articles.

GREC Corpus [27] Gene regulation and gene expression events The corpus consists of 240 PubMed abstracts with annotations on gene regulation and gene expression events.

GETM [28] Gene expression events The corpus consists of 150 PubMed abstracts with annotation for gene expression events.

AnEM [29] Tissue, cell, developing anatomical structure, cellular component The corpus consists of 500 PubMed sentences with annotations on variety of biomedical concepts.

CellFinder Corpus [30] Anatomical parts, cell lines, cell types, species, and cell components The corpus consists of annotations from 10 full-text PubMed articles.