. 2017 Feb 26;2017:6213474. doi: 10.1155/2017/6213474

Table 2.

Standard corpora for omics domain.

Corpus	Text mining evaluation task	Brief introduction
JNLPBA (Joint Workshop on NLP in Biomedicine and Its Applications) [18]	Gene/protein concept extraction	The corpus consists of 2,000 PubMed abstracts as training data and 404 PubMed abstracts as test data.

BioCreAtivE 2004 Task 1A dataset [19]	Gene/protein concept extraction	The corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

BioCreAtivE 2 Gene Mention (GM) dataset [20]	Gene/protein concept extraction	The corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

AIMED [21]	Protein-protein interaction	The corpus consists of 225 PubMed abstracts that contain 1,987 sentences with 4,075 protein mentions.

HPRD50 (Human Protein Reference Database) [22]	Protein-protein interaction	The corpus consists of sentences with protein-protein interaction from 50 PubMed abstracts.

BioInfer (Bio Information Extraction Resource) [23]	Protein, gene, and RNA relationships	The corpus consists of 1100 sentences annotated with concept names, relationships, and syntactic dependencies.

IEPA (Interaction Extraction Performance Assessment) [24]	Protein-protein interaction	The corpus consists of more than 200 PubMed sentences annotated with protein-protein interaction.

BioCreAtivE 2.5 Elsevier Corpus [25]	Protein-protein interaction	The corpus consists of 61 PubMed articles as training data and 62 PubMed articles as test data.

BC4GO Corpus [26]	Gene ontology	The corpus consists of 1356 distinct GO terms from 200 PubMed articles.

GREC Corpus [27]	Gene regulation and gene expression events	The corpus consists of 240 PubMed abstracts with annotations on gene regulation and gene expression events.

GETM [28]	Gene expression events	The corpus consists of 150 PubMed abstracts with annotation for gene expression events.

AnEM [29]	Tissue, cell, developing anatomical structure, cellular component	The corpus consists of 500 PubMed sentences with annotations on variety of biomedical concepts.

CellFinder Corpus [30]	Anatomical parts, cell lines, cell types, species, and cell components	The corpus consists of annotations from 10 full-text PubMed articles.