EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai; Elisabeth Coudert; Lucila Aimo; Kristian Axelsen; Lionel Breuza; Edouard de Castro; Marc Feuermann; Anne Morgat; Lucille Pourcel; Ivo Pedruzzi; Sylvain Poux; Nicole Redaschi; Catherine Rivoire; Anastasia Sveshnikova; Chih-Hsuan Wei; Robert Leaman; Ling Luo; Zhiyong Lu; Alan Bridge

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Apr 22:arXiv:2404.14209v1. [Version 1]

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Catherine Rivoire, Anastasia Sveshnikova, Chih-Hsuan Wei, Robert Leaman, Ling Luo, Zhiyong Lu, Alan Bridge

PMCID: PMC11188131 PMID: 38903736

Abstract

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.

PERMALINK

This is a preprint.

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai

Elisabeth Coudert

Lucila Aimo

Kristian Axelsen

Lionel Breuza

Edouard de Castro

Marc Feuermann

Anne Morgat

Lucille Pourcel

Ivo Pedruzzi

Sylvain Poux

Nicole Redaschi

Catherine Rivoire

Anastasia Sveshnikova

Chih-Hsuan Wei

Robert Leaman

Ling Luo

Zhiyong Lu

Alan Bridge

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai

Elisabeth Coudert

Lucila Aimo

Kristian Axelsen

Lionel Breuza

Edouard de Castro

Marc Feuermann

Anne Morgat

Lucille Pourcel

Ivo Pedruzzi

Sylvain Poux

Nicole Redaschi

Catherine Rivoire

Anastasia Sveshnikova

Chih-Hsuan Wei

Robert Leaman

Ling Luo

Zhiyong Lu

Alan Bridge

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases