Skip to main content
. 2024 Sep 9;11:982. doi: 10.1038/s41597-024-03835-7

Table 1.

Overview of datasets for chemical NLP.

Year Dataset Documents Chemical, protein, and gene mentions Unique IDs Relations Application
2008 Corbett32 500 abstracts, 42 papers 11,571 NER
2008 SCAI33 100 abstracts 1,206 NER
2012 ADE39 300 case reports 5,063 drugs 6,821 drug adverse effects 279 drug dosage RE
2013 DDI43 1,025, including texts from DrugBank and abstracts 18,502 drugs 5,028 drug-drug interactions RE
2015 CHEMDNER34 10,000 abstracts 84,355 chemicals NER
2016 BC5CDR35 1,500 articles 15,935 chemicals 12,850 diseases 4,409 MeSH chemically induced diseases NER, NEN, RE
2017 N-ary drug-gene-mutation42 137,469 drug–gene 3,192 drug–mutation RE
2017 ChemProt40 2,482 abstracts 32,514 chemicals 30,922 genes chemical-protein RE
2019 DrugProt41 5,000 abstracts 65,561 chemicals 61,775 proteins 24,526 chemical-protein RE
2020 EBED38 4,200, including abstracts, paragraphs, figure legends, and patents 16,715 chemicals 56,059 genes 5,161 ChEBI 12,563 Entrez chemically induced diseases NER, NEN, RE
2021 ChEMU 202044 1,500 patent extracts 17,834 chemicals chemical reaction steps NER, RE
2022 BioRED37,75 1,000 abstracts 7,021 chemicals 12,412 genes 1,096 MeSH 2,605 Entrez chemical-(chemical/disease/gene/variant) NER, NEN, RE
2024 EnzChemRED (this work) 1,210 abstracts 18,887 chemicals 13,028 proteins 3,155 ChEBI 2,569 UniProtKB chemical-chemical and (chemical-chemical)-protein NER, NEN, RE