CEMP training set (CEMP_T) [11, 25] 660 thousand token |
7000 patents (title and abstract) |
ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS |
33543 (without normalization) |
CEMP development set (CEMP_D) [11, 25] 650 thousand token |
7000 patents (title and abstract) |
ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS |
32142 (without normalization) |
CHEBI patent corpus (chapati) [18] 265 thousand token |
40 full patents (title, abstract, claims, description) |
CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM |
18746 (normalized to CHEBI identifiers) |
BioSemantic patent corpus (BioS) [19] 11,500 pages and 4.2 million token |
200 full patents (title, abstract, claims, description) |
IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET |
400125 (without normalization) |