MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev; Anton Bushuiev; Niek F de Jonge; Adamo Young; Fleming Kretschmer; Raman Samusevich; Janne Heirman; Fei Wang; Luke Zhang; Kai Dührkop; Marcus Ludwig; Nils A Haupt; Apurva Kalia; Corinna Brungs; Robin Schmid; Russell Greiner; Bo Wang; David S Wishart; Li-Ping Liu; Juho Rousu; Wout Bittremieux; Hannes Rost; Tytus D Mak; Soha Hassoun; Florian Huber; Justin J J van der Hooft; Michael A Stravs; Sebastian Böcker; Josef Sivic; Tomáš Pluskal

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Oct 30:arXiv:2410.23326v1. [Version 1]

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev, Anton Bushuiev, Niek F de Jonge, Adamo Young, Fleming Kretschmer, Raman Samusevich, Janne Heirman, Fei Wang, Luke Zhang, Kai Dührkop, Marcus Ludwig, Nils A Haupt, Apurva Kalia, Corinna Brungs, Robin Schmid, Russell Greiner, Bo Wang, David S Wishart, Li-Ping Liu, Juho Rousu, Wout Bittremieux, Hannes Rost, Tytus D Mak, Soha Hassoun, Florian Huber, Justin J J van der Hooft, Michael A Stravs, Sebastian Böcker, Josef Sivic, Tomáš Pluskal

PMCID: PMC11581121 PMID: 39575121

Abstract

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: \textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \url{https://github.com/pluskal-lab/MassSpecGym}.

Full Text Availability

The license terms selected by the author(s) for this preprint version do not permit archiving in PMC. The full text is available from the preprint server.

PERMALINK

This is a preprint.

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev

Anton Bushuiev

Niek F de Jonge

Adamo Young

Fleming Kretschmer

Raman Samusevich

Janne Heirman

Fei Wang

Luke Zhang

Kai Dührkop

Marcus Ludwig

Nils A Haupt

Apurva Kalia

Corinna Brungs

Robin Schmid

Russell Greiner

Bo Wang

David S Wishart

Li-Ping Liu

Juho Rousu

Wout Bittremieux

Hannes Rost

Tytus D Mak

Soha Hassoun

Florian Huber

Justin J J van der Hooft

Michael A Stravs

Sebastian Böcker

Josef Sivic

Tomáš Pluskal

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

MassSpecGym: A benchmark for the discovery and identification of molecules

Roman Bushuiev

Anton Bushuiev

Niek F de Jonge

Adamo Young

Fleming Kretschmer

Raman Samusevich

Janne Heirman

Fei Wang

Luke Zhang

Kai Dührkop

Marcus Ludwig

Nils A Haupt

Apurva Kalia

Corinna Brungs

Robin Schmid

Russell Greiner

Bo Wang

David S Wishart

Li-Ping Liu

Juho Rousu

Wout Bittremieux

Hannes Rost

Tytus D Mak

Soha Hassoun

Florian Huber

Justin J J van der Hooft

Michael A Stravs

Sebastian Böcker

Josef Sivic

Tomáš Pluskal

Abstract

Full Text Availability

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases