. 2016 Oct 28;8:59. doi: 10.1186/s13321-016-0172-0

Table 1.

The details of the gold standard patent corpora containing the annotations for chemicals

Corpus	Number of patents	Annotated entities	Number of annotations
CEMP training set (CEMP_T) [11, 25] $\approx$ 660 thousand token	7000 patents (title and abstract)	ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS	33543 (without normalization)
CEMP development set (CEMP_D) [11, 25] $\approx$ 650 thousand token	7000 patents (title and abstract)	ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS	32142 (without normalization)
CHEBI patent corpus (chapati) [18] $\approx$ 265 thousand token	40 full patents (title, abstract, claims, description)	CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM	18746 (normalized to CHEBI identifiers)
BioSemantic patent corpus (BioS) [19] 11,500 pages and $\approx$ 4.2 million token	200 full patents (title, abstract, claims, description)	IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET	400125 (without normalization)