Skip to main content
Nature Portfolio logoLink to Nature Portfolio
. 2025 Sep 15;22(10):2028–2031. doi: 10.1038/s41592-025-02813-0

MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries

Corinna Brungs 1,2,#, Robin Schmid 1,3,✉,#, Steffen Heuckeroth 4, Aninda Mazumdar 5, Matúš Drexler 1, Pavel Šácha 1, Pieter C Dorrestein 3, Daniel Petras 6, Louis-Felix Nothias 7,8, Václav Veverka 1,9, Radim Nencka 1, Zdeněk Kameník 5, Tomáš Pluskal 1,
PMCID: PMC12510872  PMID: 40954295

Abstract

Untargeted high-resolution mass spectrometry is a key tool in clinical metabolomics, natural product discovery and exposomics, with compound identification remaining the major bottleneck. Currently, the standard workflow applies spectral library matching against tandem mass spectrometry (MS2) fragmentation data. Multi-stage fragmentation (MSn) yields more profound insights into substructures, enabling validation of fragmentation pathways; however, the community lacks open MSn reference data of diverse natural products and other chemicals. Here we describe MSnLib, a machine learning-ready open resource of >2 million spectra in MSn trees of 30,008 unique small molecules, built with a high-throughput data acquisition and processing pipeline in the open-source software mzmine.

Subject terms: Metabolomics, Mass spectrometry


MSnLib is a large-scale, open MSn spectral library featuring >2.3 million MSn and >357,000 MS2 spectra for 30,008 unique small molecules.

Main

Accurate structural elucidation and compound annotation in untargeted mass spectrometry (MS) typically rely on matching fragmentation spectra (MS2−n) against reference libraries13. However, low annotation rates remain a persistent challenge, largely due to the limited structure coverage of high-quality open spectral libraries compared with the vast known chemical space captured in compound databases such as COCONUT4, ChEMBL5 or PubChem6,7. Obtaining reference standards, especially purified natural products, is challenging and costly. In multi-stage fragmentation (MSn), precursor ions are selected and fragmented in multiple iterations, producing spectral trees with deeper structural insights810. While MSn provides deeper structural information than MS2, and is crucial for characterizing complex molecules and distinguishing isomers8,9,11, its widespread application is constrained by the lack of public data. Open MSn (that is, n > 2) spectral entries currently number below 2,000, in contrast to the more than 700,000 MS2 entries for the more than 30,000 compounds covered in the open libraries GNPS12, MoNA and MassBank EU13. The mzCloud database contains the most extensive MSn collection, with more than 16 million spectra for over 30,000 unique compounds (April 2024). Like other proprietary MS libraries, including Wiley, NIST23 and METLIN14, data in mzCloud are locked and cannot be downloaded in open readable formats, thereby limiting their use in other tools and the training of machine learning models.

Here we introduce MSnLib, an open large-scale MSn library that covers more than 30,000 unique compound structures in over 2.3 million MSn spectra, and which significantly expands public MS resources. We describe our scalable, high-throughput pipeline combining collaborative compound sourcing, an open-source metadata curation script, rapid data acquisition, and automated data processing implemented in the open-source software mzmine15. Large-scale spectral library efforts are challenged by limited access to diverse, often costly compound collections. Through a collaborative network, we obtained seven collections totaling 37,829 compounds and 34,413 unique structures, representing a broad chemical space across various natural and synthetic classes (Extended Data Table 1, Extended Data Fig. 1, Supplementary Note 1, Supplementary Fig. 1 and Supplementary File 1).

Extended Data Table 1.

Metadata of compounds included for the new high-throughput library-building workflow

graphic file with name 41592_2025_2813_Tab1_ESM.jpg

Each compound’s natural product information is extracted from ChEMBL, LOTUS or the Dictionary of Natural Products (Version 31.2). Clinical phases (1–4) are gathered from multiple public resources, including ChEMBL, DrugBank, DrugCentral and Broad Institute. Those queries are included in the Python script for metadata clean-up. NP, natural product.

Extended Data Fig. 1. Insights into extracted chemical properties covered by the seve compound libraries ENAMDISC, MCEBIO, MCESCAF, ENAMMOL, NIHNP, MCEDRUG, and OTAVAPEP.

Extended Data Fig. 1

If a compound (the same InChIKey) is included in multiple compound libraries, it is kept in the smallest one. Most compounds fall into the size range of small molecules with masses below 600 and logP values below 5, following Lipinski’s rule. The compound libraries show differences in their covered compound classes and atomic compositions. Supplementary Fig. 1 presents additional histograms.

Our library generation pipeline consists of three stages: metadata curation, data acquisition, and data processing (Fig. 1). The main steps include metadata clean-up and acquisition sequence creation (Fig. 1a), followed by high-throughput data acquisition (Fig. 1b), and conclude with automated spectral library generation in mzmine (Fig. 1c). High-quality metadata curation is critical. We developed a Python script to clean input structures (SMILES/InChI), removing salts and applying harmonization steps (Fig. 1a). This script also enriches compound metadata by querying numerous chemical, biological and drug databases and applying compound classifiers. Finally, unique sample identifiers (IDs) are assigned to aid in later data processing. Task orchestration via the Python-based Prefect framework manages and accelerates database queries (Supplementary Fig. 2). More details are given in Methods and Supplementary Note 2.

Fig. 1. Mass spectral library-building workflow.

Fig. 1

a, Metadata clean-up and acquisition sequence generation. The metadata clean-up process consists of structure harmonization, experiment planning, and optional database queries to retrieve general compound, drug, natural product or other information. b, Sample preparation and flow injection data-dependent acquisition. The high-throughput data acquisition uses robotic and Echo liquid handling to mix and dilute compounds for subsequent analysis with a flow injection data-dependent acquisition MSn method. c, Data processing in mzmine. The automatic library generation workflow is implemented in mzmine and incorporates support for various data formats, processing steps, compound annotation, spectral merging, quality checks and export to open library formats. DDA, data-dependent acquisition; HRMS, high-resolution mass spectrometry.

MSn data acquisition is challenging due to the long MS cycle times that are required to acquire deep and wide spectral trees (Extended Data Fig. 2). In a pilot study (Supplementary Note 3), we developed a high-throughput dual-pump flow injection method to capture high-quality MSn spectra for multiple adducts of up to 10 compounds per injection (Fig. 1b). Specifically the automatic gain control, injection time and mass resolution were optimized to raise spectral quality and signal-to-noise ratios (Extended Data Fig. 3 and Supplementary Fig. 3). Reproducible static noise signals were identified and later removed during data processing (Supplementary Fig. 4). Using the optimized method, all compounds were analyzed in both ionization modes in 23 days and 9,060 injections using auto-generated sequences with unique sample IDs.

Extended Data Fig. 2. MSn tree data acquisition.

Extended Data Fig. 2

Instrument method set-up that was used for the MSn data acquisition. Three collision energies are applied to each precursor, resulting in Top5 for MS3, Top2 for MS4, and Top1 for MS5, resulting in a maximum number of 75 scans per MS2 precursor ion. OT = Orbitrap; IT = Ion trap; dd = data-dependent, HCD = higher-energy collisional dissociation.

Extended Data Fig. 3. Comparison of three different Orbitrap methods influencing the noise intensity and number of signals per MS3–5 scan for 220 pure standards in 22 samples of the MCEBIO library.

Extended Data Fig. 3

a, The noise intensity, here defined by the lowest signal in each MS3 scan, is influenced by the injection time of the C trap. Internal processing by the instrument vendor normalizes the intensity values by dividing them by the injection time, leading to higher absolute noise levels for shorter injection times. A higher AGC target (Method 2) shifts to longer injection times, keeping the noise level on the same trajectory. A higher mass resolution decreases the noise level to a lower trajectory. b, The MS3–5 scan quality is compared by evaluating the number of signals in all annotated and exported MS3–5 library scans. This includes the noise removal of signals below 2.5 times the lowest signal intensity in each scan before spectral export. The method optimization increased the number of high-quality spectra with more than ten fragment signals and the overall number of exported library spectra, despite the lower scan rate. Method 1 (blue): AGC 60%, Resolution 15k; Method 2 (orange): AGC 100%, Resolution 15k; Method 3 (magenta): AGC 100%, Resolution 60k.

Addressing the otherwise laborious data processing, we implemented an automated library-building workflow in mzmine (Fig. 1c). It imports data, builds MSn trees and annotates features against curated metadata by searching for compounds as various expected adducts or in-source fragments using their exact m/z ratio. The sample’s known composition constrained annotations. The mzmine feature table facilitates manual validation with annotations and ion traces (Supplementary Fig. 5). Key automated quality checks include precursor purity and fragment annotation rates. Finally, spectra were merged on different levels and exported in open MS library formats (Methods, Supplementary Note 4 and Supplementary Fig. 6).

Application of our automated workflow to the seven compound libraries successfully generated MSn trees for 30,008 unique compounds (87% coverage), yielding 357,065 MS2 and over 2.3 million MSn spectra after merging and deduplication (Extended Data Tables 2 and 3). This establishes MSnLib as a large-scale open library and enables it to significantly expand publicly available MSn data. Achieving this high coverage required the combining of both ionization modes, given that many compounds were detected exclusively in positive (>12,200) or negative (>3,400) ionization, alongside those found in both (~14,300) (Fig. 2a,b, Supplementary File 1 and Supplementary Figs. 713). Comparison with existing libraries confirms MSnLib’s complementarity, contributing 22,700 new compounds and valuable MSn trees (Fig. 2c). The unique chemical space covered by MSnLib and projected using the data visualization method, TMAP16 (Extended Data Fig. 4), underscores our method’s capability to substantially expand spectral knowledge. Supplementary Fig. 14 shows a representative MSnLib entry in mzmine’s MSn tree visualizer. Supplementary Note 5 provides a potential explanation for missed compounds.

Extended Data Table 2.

Statistics on each library and acquisition mode

graphic file with name 41592_2025_2813_Tab2_ESM.jpg

All 9,060 injections were measured in 23 days. The numbers of scans are containing the best and merged spectra including the pseudo-MS2 spectrum, when a whole fragmentation tree is merged into a single spectrum. Detailed patterns of compounds detected in positive and negative ion modes are given in Supplementary Fig. 713.

Extended Data Table 3.

Number of spectra for each MS level without merging (single best scan) and for individual and merged spectra (Spectype)

graphic file with name 41592_2025_2813_Tab3_ESM.jpg

More library information are in Supplementary File 8. Spectype explanation: SINGLE_BEST_SCAN=highest TIC fragmentation scan is exported for each precursor in one injection, SAME_ENERGY=merging of spectra of the same precursor and collision energy in one injection, ALL_ENERGIES=merging of spectra of the same precursor and different collision energies in one injection (in our case mostly 3 energies) PSEUDO_MS2=whole MSn tree for one compound is merged into a pseudo-MS2 scan.

Fig. 2. Comparison of compounds in spectral libraries.

Fig. 2

a, Plate visualization. A 384-well plate of the MCEBIO library is shown with wells as pie charts, depending on the percentage of compounds detected in each ionization mode. Ten compounds were mixed in each well. b, Detection dependent on ionization mode. The results for both polarities and for all seven libraries including 30,008 extracted unique compounds (with stereochemistry) are shown in an UpSet plot. c, Compound comparison with open and commercial spectral libraries. This UpSet plot shows the uniqueness and overlap of compounds of the newly acquired MSnLib compared with open (that is, GNPS (ALL_GNPS_NO_PROPAGATED), MassBank EU (MASSBANK_NIST) and MoNA (LC-MS/MS spectra), data from December 2023) and two commercial spectral libraries (mzCloud (auto and reference export), data from 4 March 2024; and NIST23 data from 13 August 2024) depending on the first InChIKey block, omitting stereochemistry.

Extended Data Fig. 4. TMAP projections of the chemical space coverage of MSnLib in comparison to other open spectral libraries and the commercial spectral libraries NIST23 and mzCloud.

Extended Data Fig. 4

The libraries are divided in a MSnLib, containing different compound libraries, b Open libraries, including GNPS, MoNA, and MassBank EU, (access date 08.12.2023, 35,278 unique compounds) c NIST23 (access date 13.08.2024, 47,461 unique compounds) and d mzCloud (access date 04.03.2024, 30,248 unique compounds). Each node represents an individual structure based on its canonical SMILES representation. Compounds included in the corresponding library are highlighted in magenta and drawn on top of compounds from other libraries in gray. The same structure can be included in multiple libraries. e, demonstrates a more fine-grained MSnLib coverage resolved by the individual compound libraries.

The MS2 spectral quality of the MCEBIO library entries, part of the greater MSnLib, was evaluated using feature-based molecular networking17. The pair-wise fragmentation similarity mapped the chemical space and led to the clustering of similar compound classes, proving high spectral quality. Only 20% of the entries matched to other open databases, highlighting the novelty of the MSnLib spectra (Extended Data Fig. 5 and Supplementary Note 6). For evaluation, we matched MSnLib against a public dataset of drug-incubated bacterial cultures (MSV000096589). While other combined open libraries yielded 129 annotations, MSnLib provided 80. Crucially, MSnLib contributed 21 unique annotations, increasing the total annotated features to 150. Of these unique MSnLib matches, 14 corresponded to added drugs, and the remaining seven were potential microbial metabolites (for example, ufiprazole from omeprazole) that were not further investigated. Detailed matching results and spectral plots are provided in Supplementary Files 10 and 11.

Extended Data Fig. 5. Map of the chemical space covered by the MCEBIO MS2 library in positive ion mode and matches to public spectral libraries.

Extended Data Fig. 5

a, Each node represents a single MS2 spectrum (MCEBIO) and the edges between two nodes show the spectral similarity of these two spectra (cosine similarity ≥ 0.7 and at least 4 matched signals). The clustering by feature-based molecular networking (FBMN) on the GNPS web platform highlights the coverage of the chemical space for the acquired 48,069 MS2 MCEBIO spectra. The mzmine algorithm matched all spectra (nodes) against the open spectral libraries from GNPS, MoNA, and MassBank EU, with only a 20% annotation rate (colors). The top ten annotations for each scan were exported for structure clean-up and scoring of structural similarity with actual compounds analyzed. For this, the maximum common edge subgraph (MCES, https://github.com/AlBi-HHU/myopic-mces) and the Tanimoto similarity were calculated. MCES reflects the number of structural modifications, attaining a value of 0 for identical structures. MCES < 4 was chosen for similar structures, and MCES ≥ 4 was selected for lower or no similarity. Of all annotations, 94% were identical or similar. b, A subnetwork for acetophenazine and its neighbors for a distance of up to two edges. The node shapes reflect actual structures, and colors reflect MCES scores. Spectral similarity successfully clustered this molecular family of analog structures. The spectral library match of acetophenazine to an analog demonstrates how MCES = 2 often describes structural isomers differing in the position of one functional group. Piperacetazine and perphenazine yielded identical matches. Clidinium belongs to a different chemical class. Overall, most spectra contain enough information to be connected by cosine similarity but remain unmatched against public libraries. The molecular network is shared as a Cytoscape file for interactive exploration (Supplementary File 7).

We introduce MSnLib, a large-scale, open MSn spectral library featuring >2.3 million MSn and >357,000 MS2 spectra for 30,008 unique compound structures detected across both ionization modes. This resource was built with collaborative compound sourcing, a high-throughput MSn acquisition method, and an automated, open-source processing workflow in mzmine15. By contributing ~22,700 complementary compounds to existing libraries, MSnLib significantly expands public spectral knowledge. We anticipate that this unique resource will greatly enhance untargeted liquid chromatography–mass spectrometry annotation and fuel machine learning advancements that utilize detailed MSn fragmentation and encoded substructure information for structure prediction and chemical classification. We aim to expand our collaborative network for compound and data sharing to further increase the coverage and impact of MSnLib. The automated library-building workflow is extendible and allows for efficient regeneration of MS libraries.

Methods

Materials

All solvents used were of LC–MS grade. Methanol, acetonitrile, water and formic acid were obtained from Thermo Scientific. Seven different compound libraries, that were available to us, were analyzed: the NIH NPAC ACONN collection of natural products provided by the US National Institute of Health (NIH) with 3,988 compounds (NIHNP), a peptidomimetic library provided by OTAVAchemicals (Ontario, Canada) comprising 1,298 compounds (OTAVAPEP), the Discovery Diverse Set DDS-10 library from Enamine (Kyiv, Ukraine) with 10,240 compounds (ENAMDISC), a library mixture of 4,378 compounds purchased from Enamine and Molport (Riga, Latvia) including the 4,000 carboxylic acid fragment library (ENAMMOL), and three other libraries purchased from MedChemExpress (MCE, New Jersey, USA) containing 10,315 bioactive compounds (MCEBIO), 5,000 compounds from the MCE 5K Scaffold Library (MCESCAF), and 2,610 Food and Drug Administration-approved drugs (MCEDRUG). More information on compounds is given in Supplementary Note 1.

General workflow

  1. Metadata clean-up
    1. Manual harmonization of input metadata sheet and column names
    2. Defining jobs and running jobs.py
  2. Sequence generation

  3. Sample preparation (mixing compounds)

  4. Data acquisition

  5. Automatic MSn tree library creation and data evaluation in mzmine
    1. Combining the cleaned metadata with the acquired data

Metadata clean-up

A Python script (metadata_cleanup_prefect.py) was developed for the curation of metadata. The main purpose of this script is structure extraction, salt removal, and standardization. Structure standardization was based on the ChEMBL structure pipeline Python package18. The complete clean-up failed with the original pipeline for specific structures (when a salt was given without a dot in the structure) and required an additional initial salt removal step and a second clean-up run. The cleaned and standardized structure is used for calculating other structural information such as canonical and, if available, isomeric SMILES (simplified molecular input line entry system), InChI (international chemical identifier developed by the International Union of Pure and Applied Chemistry), InChIKey (a condensed version of InChI), logP (the logarithm of the octanol–water partition coefficient) and monoisotopic mass. This mass is used during the automatic library generation. Optionally, additional information, such as whether a compound is considered a natural product or used as a drug, can be gathered from other databases based on a name, database identifier or structure search in PubChem, ChEMBL or other public resources (Supplementary Note 2). Queries can be easily turned off or implemented in the Python code. In our code, all databases that require a local file are deactivated by default.

Sequence generation

An additional Python script (sequence_creation.py) was used to prepare the sequence table based on plate and well information. The sequence for each plate is built so that it is first analyzed in positive ion mode, followed by negative ionization. A file name is generated automatically and contains the date, a unique sample identifier (combination of library, plate number and well location), the method used, and the polarity. The unique sample identifier is important because it is used for the automatic annotation and extraction of the acquired spectral data. Therefore, the unique sample identifier in the metadata column needs to be matched with a substring of the acquisition file name. Here, only this identifier is important in the name, enabling the addition of other prefixes and suffixes, for example, the date, polarity or method. The script generates acquisition sequence files specific to the Xcalibur sequence layout using the flow injection–Orbitrap MSn method. This can be easily modified for other analysis platforms, depending on their layout.

Sample preparation

Our mass spectral library contains seven different compound libraries. The MCEBIO library was prepared with an OT-2 liquid handler (Opentrons Labwork) to pool and dilute 10 compounds in each well of three 384-well plates, resulting in a concentration of 20 µM for each compound in a mixture of methanol and water (1:1). For the MCESCAF, MCEDRUG, OTAVAPEP, ENAMMOL and ENAMDISC libraries, the Echo 650 Liquid Handler (Beckmann Coulter) was used to pool eight compounds in 384-well plates, and a CERTUS FLEX liquid dispenser (Fritz Gyger AG) diluted the samples with 80–90 µl methanol and water (mixed in a 1:1 ratio), resulting in a concentration between 8 and 12 µM. The NIHNP library was further processed at the University of California San Diego, California, USA. Up to seven compounds were pooled in 96-well plates, resulting in a concentration of 5 µM. Because the plates showed strong evaporation, they were refilled with 50–100 µl methanol, acetonitrile and water (mixed in a 4:4:2 ratio), which was the previous mixture. Therefore, the end concentration is unknown.

Data acquisition

The flow injection–MSn analysis was performed using a Vanquish Horizon UHPLC system with two pumps coupled to an Orbitrap ID-X (Thermo Fisher Scientific) instrument. The instrument was calibrated in positive and negative ion mode with the Pierce FlexMix calibration solution prior to a library batch. Different set-ups were tested to extend the peak width for more MSn experiments, to reach the mass analyzer quickly, and to avoid sample carryover. Two pumps were connected with a T-piece. The first pump, that is, the delivery pump, ran through the autosampler. The flow for this pump was initially set to 45 µl min−1 for the injection to reach the T connection quickly. After 0.27 min the flow was decreased to 5 µl min−1 and kept constant until 1.35 min. Over the next 0.15 min the flow was increased to 45 µl min−1 and kept constant for another 1.5 min to clean the sample lines and to avoid sample carryover. The second pump, that is, the make-up pump, was used to broaden the elution profile. To maintain a combined constant flow of 55 µl min−1, the make-up pump started at 10 µl min−1, was increased to 50 µl min−1 after 0.27 min, and kept at that rate until 1.35 min. The flow was gradually decreased back to 10 µl min−1 over the next 0.15 min and kept constant for another 1.5 min. The whole run time per injection was 3 min. Both pumps used an isocratic mixture of water and acetonitrile at a 50:50 ratio, both with 0.1% formic acid. The switching of the flow speed is important for cleaning because the delivery pump runs at a low flow rate of 5 µl min−1 during most of the time of the sample delivery. It must be noted that the method can also be used with a second pump running constantly at 50 µl min−1, resulting in an altered flow rate of up to 95 µl min−1 during the analysis but with no big changes during the data acquisition.

The injection volume was set to 2 µl, except for that for the NIHNP library, which was set to 3 µl due to the lower concentration. H-ESI was used for ionization with a vaporizer temperature of 75 °C and ion transfer tube temperature of 275 °C. The voltages were set to 3,000 V and 2,000 V for positive and negative ionization modes, respectively. The sheath gas was set to 25 a.u. and the auxiliary gas was set to 5 a.u. No sweep gas was used. The MSn tree was built with the following main settings, with the Orbitrap as the mass analyzer: For MS1, data were analyzed from m/z 115 to 2,000 with a resolution of 30,000, a radiofrequency (RF) lens of 50%, an automatic gain control (AGC) target of 100% (40,000 a.u.), and a maximum injection time (maxIT) of 50 ms. After one MS1 scan, the three most intense ions, with a minimum intensity of 6 × 105 a.u. in positive and 2 × 105 a.u. in negative ionization, were picked using data-dependent acquisition with an isolation window of m/z 1.2, a resolution of 15,000, an AGC target of 30% (1.2 × 104 a.u.), and maxIT of 50 ms for positive and 80 ms for negative ion mode. Three fragmentation experiments to cover different collision energies (a maximum of nine scans) were conducted. For MS2, the energies were set to 20 eV and 60 eV, and the assisted collision energy to achieve the optimal MS2 energy for further MSn stages was tested in 15 eV steps, starting at 15 eV, and increasing to 30 eV, 45 eV, 60 eV and 75 eV. From this assisted collision energy step, the top five signals, with a minimum intensity of 2 × 104 a.u. in positive ion mode and 1 × 104 a.u. in negative ion mode, and within the mass range of m/z 90–2,000, were isolated for MS3 with an MS1 isolation window of m/z 1.2 and an MS2 isolation window of 2. The resolution was set to 60,000, the AGC target to 100% (5 × 104), and the maxIT to 200 ms for the positive and 500 ms for the negative ion mode. Three fixed collision energies of 20 eV, 40 eV and 60 eV were applied, resulting in a maximum number of 15 scans. The two most intense signals with a minimum intensity of 2 × 104 a.u. in positive and 1 × 104 in negative ion mode were selected from the 40 eV MS3 scans for the MS4 experiments, with an isolation window of m/z 2.2. All other settings were used as in MS3. For MS5, the two highest signals of an 40 eV MS4 scan and within a mass range of 150–2,000 were further fragmented using an isolation window of m/z 3 and the same settings as MS3 and MS4. Only 40 eV and 60 eV were used as the collision energy settings, given that 20 eV produced mainly the precursor ion. Dynamic exclusion was carried out at every MSn stage, meaning that each precursor was selected three times within 200 s and was excluded for the following 70 s with a mass tolerance of m/z 0.2. Additionally, isotopes of selected precursor ions were excluded within a window of m/z 2 for unassigned isotopes. It must be noted that the maximum occurrence should be set to a number divisible by 3 to carry out experiments for all three collision energies. Before the standard analysis, multiple blank injections were analyzed, and the detected signals were added to a targeted mass exclusion list. This exclusion list was updated after running 10 sample injections and the detection of reoccurring signals in all samples. The MSn schema is presented in Extended Data Fig. 2 and for one example compound in Supplementary Fig. 14. The processing was done in mzmine. A full batch configuration file is supplied as Supplementary File 2 (mzmine_exclusion_blankprofile_pos.mzbatch) and Supplementary File 3 (mzmine_exclusion_blankprofile_neg.mzbatch). Our system showed higher background signals in empty scans or less rich fragmentation scans around m/z 149.72 and m/z 173.52, therefore, these were added to an exclusion list with a width of m/z 0.03 for all MSn levels. The fragmentation tree is presented in Extended Data Fig. 5 with an example given in Supplementary Fig. 14. All settings are listed in Supplementary Tables 1 and 2.

Automatic MSn tree library generation and data evaluation in mzmine

The automatic library generation workflow was implemented in mzmine, to provide support for MS data from various vendors and open formats, spectral processing, spectral quality assessment, and annotation based on curated metadata. A spectral library generation workflow and a flow injection workflow were added to the mzwizard module, which is embedded in mzmine. The mzwizard supports a simplified workflow set-up while still preserving full configurability of the final workflow.

The data processing and automatic library extraction were done in mzmine using the steps below. A full batch configuration file is supplied as Supplementary File 4 (mzmine_msn_library_pos.mzbatch) and Supplementary File 5 (mzmine_msn_library_neg.mzbatch), and the mzwizard configuration is provided as Supplementary File 6 (mzwizard_msn_library.mzmwizard).

  1. Import of Orbitrap MS data as .raw files or as .mzML files after conversion using the ThermoRawFileParser (https://github.com/compomics/ThermoRawFileParser) or the MSConvert script (https://proteowizard.sourceforge.io/download.html).

  2. MS data processing
    1. Denoising: mass detection on MSn with the factor of the lowest signal mass detector and noise factor of 2.5 for all MS levels
    2. Background signal removal of two known artifacts
    3. Tree building
    4. Compound annotation based on a local compound database search. Here, the monoisotopic mass is used together with various selected ion adducts and in-source fragments to calculate the precursor mass. The algorithm considers only compounds in specific samples matched by a unique sample ID substring in the file names.
  3. Spectral library export to .json, .mgf or .msp formats
    • e.
      Scoring of the precursor isolation purity (%) of MSn spectra based on the preceding and following MS1 scan. Chimeric spectra are flagged in the output file.
      • i.
        Export the best spectrum for each precursor and energy (highest total ion chromatogram), no SPECTYPE information or ‘SINGLE_BEST_SCAN’ in library file
    • f.
      Merging of spectra, SPECTYPE information in the library file:
      • i.
        ‘SAME_ENERGY’: each individual fragmentation energy, when triggered multiple times in the same sample
      • ii.
        ‘ALL_ENERGIES’: all fragmentation energies for individual precursors (three energies in our method, using the merged same_energy if available, otherwise the best one)
      • iii.
        ‘ALL_MSN_TO_PSEUDO_MS2’: combining the full MSn tree of a compound ion into a pseudo-MS2 spectrum
    • g.
      Filtering of spectra based on a minimum of two signals above the noise threshold
  4. (Optional) Reimport of the spectral library to check the success of its generation

  5. (Optional) Alignment of all feature lists across samples and their matching to the newly generated spectral library as initial validation

  6. (Optional) Manual inspection of the spectral libraries and MSn experiments using the MSn tree visualizer (see Supplementary Fig. 15 for an example).

The processing was performed for 11,000 compounds in 1,100 injections on a DELL XPS 15 9510 laptop with 32 GB of RAM, eight processor cores and 16 threads for speed testing of the automatic library generation.

Various information can be stored within the library file, including retention time, ion mobility, collision energy, as well as instrument, method or compound specifications.

Compound list comparison between MSnLib and other spectral databases (TMAP projection)

Prior to the comparison, we cleaned and standardized the structure of all resources in the same way and calculated the InChIKey. The comparison is based on the first InChIKey block, removing stereochemistry. The libraries used are included in the Data Availability section.

Feature-based molecular networking and matching against public mass spectral libraries

The newly generated spectral library in positive ion mode for the MCEBIO library was imported into mzmine and reprocessed to a feature list. Only MS2 and pseudo-MS2 spectra were used, resulting in 48,069 spectra, to reduce data complexity, and given that most tools are limited to use with MS2 spectra. The feature list annotation was exported as a .csv file retaining the original information, for example, compound name and adduct for later comparison. Furthermore, the list was exported with the mzmine module named molecular networking files in an .mgf data format, compatible with running feature-based molecular networking (FBMN) utilizing the Global Natural Products Social Molecular Networking (GNPS) infrastructure. The parameters for FBMN were set to precursor ion and fragment ion mass tolerances of 0.02 Da, a minimum pairs cosine value (min. pairs cos.) of 0.7 (minimum cosine score necessary to connect to experimental MS2 spectra), a network TopK value of 1,000, a minimum number of matched fragment ions of 4, maximum connected component size of 0, and a maximum shift between precursors of 500 Da.

For matching our library against the three most commonly used public spectral databases, we used LC–MS/MS spectra from the MassBank of North America (MoNA, https://mona.fiehnlab.ucdavis.edu/downloads,.json, accessed 8 December 2023), spectra from the MassBank EU database, specifically MassBank_NIST.msp (https://github.com/MassBank/MassBank-data/releases/tag/2023.11), and spectra from the GNPS library, namely ALL_GNPS_NO_PROPOGATED (https://gnps-external.ucsd.edu/gnpslibrary, accessed 8 December 2023). Prior to matching our MSnLib’s feature list against these public spectral databases (see the first step), the original annotations were removed to retain only spectral library matches. We used the following settings: a minimum matched signals setting of 4, a precursor m/z tolerance and spectral m/z tolerance of 0.005 or 10 ppm, removing the precursor, a weighted cosine similarity with a minimum similarity of 0.6, and the weighting of the square root of the signal intensity (m/z0 × I0.5). The top five and top 10 matches were exported for further evaluation. Here, the feature ID produced by mzmine was used to compare the annotations by spectral matching with the original compound information. The structures of the matched compounds were cleaned with the same script as in the metadata clean-up and a new InChIKey string was computed. This InChIKey string was used to find spectra that were matched to the identical compound. Finally, to further evaluate the top 10 matching hits, we calculated their Tanimoto similarity, based on Morgan fingerprints (radius = 2, nBits = 2048), and determined their maximum common edge subgraph (MCES)19, using the default settings in the Python package. The FBMN visualization is provided in Supplementary File 7.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Online content

Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41592-025-02813-0.

Supplementary information

Supplementary Information (6.4MB, pdf)

Supplementary Notes 1–8, Supplementary Figs. 1–15, Supplementary Tables 1 and 2

Reporting Summary (1.6MB, pdf)
Peer Review File (4.5MB, pdf)
Supplementary File 1 (45.2MB, tsv)

Compounds included in the seven compound libraries

Supplementary File 2 (9.9KB, mzbatch)

mzmine batch configuration file for blank analysis in positive mode

Supplementary File 3 (10.2KB, mzbatch)

mzmine batch configuration file for blank analysis in negative mode

Supplementary File 4 (197.5KB, mzbatch)

mzmine batch configuration file for library analysis in positive mode

Supplementary File 5 (180.1KB, mzbatch)

mzmine batch configuration file for library analysis in negative mode

Supplementary File 6 (15KB, mzmwizard)

mzwizard configuration file for mzmine

Supplementary File 7 (2.3MB, cys)

CytoScape file for FBMN visualization

Supplementary File 8 (4.3KB, tsv)

Statistics of the seven analyzed libraries

Supplementary File 9 (43KB, mzbatch)

mzmine batch file for analysis of MassIVE dataset MSV000096589

Supplementary File 10 (6.9KB, tsv)

Results of the annotation of MassIVE dataset MSV000096589

Supplementary File 11 (33.6MB, pdf)

Results of the annotation of MassIVE dataset MSV000096589 – mirror plots

Acknowledgements

C.B. was supported by the Czech Academy of Sciences PPLZ fellowship number L200552251. The mzmine project is funded by the European Union, the BAB—funding bank for Bremen and Bremerhaven, and the Senator of Economics, Ports and Transformation Bremen (65002459). A.M. was supported by the project National Institute of Virology and Bacteriology (Programme EXCELES, ID Project Number LX22NPO5103)—funded by the European Union—Next Generation EU. P.C.D was supported by grants from the National Institute of Health R01DK136117 and R01GM107550. Z.K. was supported by the Ministry of Education, Youth and Sports of the Czech Republic grant CZ.02.01.01/00/22_008/0004597 within the One Health framework. T.P. was supported by the Czech Science Foundation (GA CR) grant 21-11563M and by the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement Number 891397. We acknowledge core facility structural mass spectrometry of the Czech Infrastructure for Integrative Structural Biology (CIISB), Instruct-CZ Centre, supported by the Ministry of Education, Youths and Sports of the Czech Republic (MEYS CR; LM2023042) and European Regional Development Fund-Project ‘Innovation of Czech Infrastructure for Integrative Structural Biology’ (Number CZ.02.01.01/00/23_015/0008175). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank O. Mokshyna and R. Bushuiev for fruitful discussions on metadata standardization and chemical space visualization. We thank M. Ulaszewska and T. Mak for providing information about mzCloud and NIST23, respectively, and acknowledge the support of the Dagstuhl computational metabolomics communities. The conceptual basis of this project was formed at Dagstuhl Seminars 20051 and 22181. We thank F. Rooks for editing this paper.

Extended data

Author contributions

C.B., R.S. and T.P. conceived the method and wrote the paper. D.P., L.-F.N., Z.K. and P.C.D. gave feedback on the method and results. C.B. and R.S. wrote the metadata clean-up Python script. C.B. performed MS data acquisition and spectral library extraction. R.S., S.H. and C.B. added new modules for the automatic library building in mzmine. Z.K., V.V. and R.N. provided compound libraries. A.M., M.D., P.S. and L.-F.N. prepared the mixing and dilution of the compound libraries. M.D. developed a script to optimize the mixing of compounds, avoiding interferences. S.H., A.M., M.D., P.S., P.C.D., D.P., L.-F.N. and Z.K. provided feedback for the paper. All authors read and approved the paper.

Peer review

Peer review information

Nature Methods thanks Juan Antonio Vizcaíno and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Data availability

All metadata files, one per compound library, and mzmine batch files for library processing were uploaded to the MERLIN (Mass spEctRal LIbrary Network) GitHub repository (https://github.com/merlin-ms) under the MIT license. All acquired flow injection–Orbitrap MSn files were deposited as .mzML and .raw files in the Zenodo datasets: 10966280 (.mzML positive and negative), 10966404 (.raw positive) and 10967081 (.raw negative) under the CC BY 4.0 license, and were uploaded to MassIVE MSV000094528 under the CCO 1.0 license. The mass spectral libraries included in MSnLib were deposited as .mgf and .json files in the Zenodo dataset 11163380 under the CC BY 4.0 license. Here, each spectral library for the individual compound libraries is uploaded as MS2 only or the full MSn library. The MS2 libraries contain the best and merged spectra for all acquired MS2 spectra, including the pseudo-MS2, in which the whole fragmentation tree is combined into a single spectrum. Polarities are kept separated, resulting in four entries for each compound library and library format. Additionally, the MS2 data are uploaded as a reference library in GNPS (https://external.gnps2.org/gnpslibrary) in the form of a default gold-level library named MSNLIB-POSITIVE and MSNLIB-NEGATIVE. We recommend using the Zenodo libraries because they contain more metadata and link back to the original data by universal spectrum identifier (USI). Regarding the metadata clean-up, the DrugBank lookup is an optional step, and the data are accessible on request. More information is available on the project website (https://go.drugbank.com). DrugCentral lookup is an optional step, and their whole database can be downloaded as a PostgreSQL dump from the company’s website (https://drugcentral.org). LOTUS lookup is an optional step, and the whole LOTUS dataset is incorporated into WIKIDATA. The metadata clean-up script contains a Prefect flow to download all relationships from WIKIDATA. Simply run the prepare_wikidata_lotus_data_prefect.py script (https://github.com/corinnabrungs/msn_tree_library). Broad Institute lookup is an optional step. The drug information can be downloaded as a .txt file from the institute’s website (https://repo-hub.broadinstitute.org/repurposing). The FBMN result can be accessed on GNPS: https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c05d34fb31ab4ee99293e722fb7eb83d Experimental public libraries were downloaded from the corresponding webpages: GNPS, ALL_GNPS_NO_PROPAGATED (https://external.gnps2.org/gnpslibrary, accessed 8 December 2023); MassBank EU, MASSBANK_NIST (https://github.com/MassBank/MassBank-data/releases/tag/2023.11, accessed 8 December 2023); MassBank North America, LC–MS/MS spectra (https://mona.fiehnlab.ucdavis.edu/downloads, accessed 8 December 2023). The dataset used for the library evaluation can be found in MassIVE under MSV000096589. Here we used only a subset of group 1 (group1_B*.mzML) for the analysis. The mzmine batch file is supplied as Supplementary File 9 and the results of this evaluation as Supplementary Files 10 and 11.

Code availability

The metadata clean-up pipeline was implemented in Python 3.10, with the Prefect library as the task orchestration tool. The source code is available on GitHub under the free and open MIT license (https://github.com/corinnabrungs/msn_tree_library). The TMAP projections and other analysis are available as Jupyter notebooks on GitHub (https://github.com/corinnabrungs/msn_tree_library/tree/master/notebooks). The mzmine code and documentation are available on GitHub under the same license (https://github.com/mzmine/mzmine3, https://github.com/mzmine/mzmine_documentation).

Competing interests

T.P., S.H. and R.S. are co-founders of mzio GmbH, which develops the mzmine software for mass spectrometry data processing. P.C.D. is an advisor and equity holder in the companies Cybele and Sirenas; a science advisor and equity holder in bileOmix; a scientific co-founder, advisor and equity holder of Ometa, Enveda and Arome, with prior approval by the University of California San Diego; and consulted for DSM Animal Health in 2023. The other authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Corinna Brungs, Robin Schmid.

Contributor Information

Robin Schmid, Email: rschmid1789@gmail.com.

Tomáš Pluskal, Email: tomas.pluskal@uochb.cas.cz.

Extended data

are available for this paper at 10.1038/s41592-025-02813-0.

Supplementary information

The online version contains supplementary material available at 10.1038/s41592-025-02813-0.

References

  • 1.Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem.84, 7274–7282 (2012). [DOI] [PubMed] [Google Scholar]
  • 2.Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem.91, 2155–2162 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bittremieux, W., Wang, M. & Dorrestein, P. C. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics18, 94 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sorokina, M., Merseburger, P., Rajan, K., Yirik, M. A. & Steinbeck, C. COCONUT online: Collection of Open Natural Products database. J. Cheminform.13, 2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res.52, D1180–D1192 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kim, S. et al. PubChem 2023 update. Nucleic Acids Res.51, D1373–D1380 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.de Jonge, N. F. et al. Good practices and recommendations for using and benchmarking computational metabolomics metabolite annotation tools. Metabolomics18, 103 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.van der Hooft, J. J. J., Vervoort, J., Bino, R. J. & de Vos, R. C. H. Spectral trees as a robust annotation tool in LC–MS based metabolomics. Metabolomics8, 691–703 (2012). [Google Scholar]
  • 9.Kasper, P. T. et al. Fragmentation trees for the structural characterisation of metabolites. Rapid Commun. Mass Spectrom.26, 2275–2286 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Vaniya, A. & Fiehn, O. Using fragmentation trees and mass spectral trees for identifying unknown compounds in metabolomics. Trends Analyt. Chem.69, 52–61 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Waridel, P. et al. Evaluation of quadrupole time-of-flight tandem mass spectrometry and ion-trap multiple-stage mass spectrometry for the differentiation of C-glycosidic flavonoid isomers. J. Chromatogr. A926, 29–41 (2001). [DOI] [PubMed] [Google Scholar]
  • 12.Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol.34, 828–837 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom.45, 703–714 (2010). [DOI] [PubMed] [Google Scholar]
  • 14.Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem.90, 3156–3164 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Schmid, R. et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat. Biotechnol.41, 447–449 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Probst, D. & Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminform.12, 12 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods17, 905–908 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bento, A. P. et al. An open source chemical structure curation pipeline using RDKit. J. Cheminform.12, 51 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kretschmer, F., Seipp, J., Ludwig, M., Klau, G. W. & Böcker, S. Coverage bias in small molecule machine learning. Nat. Commun.16, 554 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (6.4MB, pdf)

Supplementary Notes 1–8, Supplementary Figs. 1–15, Supplementary Tables 1 and 2

Reporting Summary (1.6MB, pdf)
Peer Review File (4.5MB, pdf)
Supplementary File 1 (45.2MB, tsv)

Compounds included in the seven compound libraries

Supplementary File 2 (9.9KB, mzbatch)

mzmine batch configuration file for blank analysis in positive mode

Supplementary File 3 (10.2KB, mzbatch)

mzmine batch configuration file for blank analysis in negative mode

Supplementary File 4 (197.5KB, mzbatch)

mzmine batch configuration file for library analysis in positive mode

Supplementary File 5 (180.1KB, mzbatch)

mzmine batch configuration file for library analysis in negative mode

Supplementary File 6 (15KB, mzmwizard)

mzwizard configuration file for mzmine

Supplementary File 7 (2.3MB, cys)

CytoScape file for FBMN visualization

Supplementary File 8 (4.3KB, tsv)

Statistics of the seven analyzed libraries

Supplementary File 9 (43KB, mzbatch)

mzmine batch file for analysis of MassIVE dataset MSV000096589

Supplementary File 10 (6.9KB, tsv)

Results of the annotation of MassIVE dataset MSV000096589

Supplementary File 11 (33.6MB, pdf)

Results of the annotation of MassIVE dataset MSV000096589 – mirror plots

Data Availability Statement

All metadata files, one per compound library, and mzmine batch files for library processing were uploaded to the MERLIN (Mass spEctRal LIbrary Network) GitHub repository (https://github.com/merlin-ms) under the MIT license. All acquired flow injection–Orbitrap MSn files were deposited as .mzML and .raw files in the Zenodo datasets: 10966280 (.mzML positive and negative), 10966404 (.raw positive) and 10967081 (.raw negative) under the CC BY 4.0 license, and were uploaded to MassIVE MSV000094528 under the CCO 1.0 license. The mass spectral libraries included in MSnLib were deposited as .mgf and .json files in the Zenodo dataset 11163380 under the CC BY 4.0 license. Here, each spectral library for the individual compound libraries is uploaded as MS2 only or the full MSn library. The MS2 libraries contain the best and merged spectra for all acquired MS2 spectra, including the pseudo-MS2, in which the whole fragmentation tree is combined into a single spectrum. Polarities are kept separated, resulting in four entries for each compound library and library format. Additionally, the MS2 data are uploaded as a reference library in GNPS (https://external.gnps2.org/gnpslibrary) in the form of a default gold-level library named MSNLIB-POSITIVE and MSNLIB-NEGATIVE. We recommend using the Zenodo libraries because they contain more metadata and link back to the original data by universal spectrum identifier (USI). Regarding the metadata clean-up, the DrugBank lookup is an optional step, and the data are accessible on request. More information is available on the project website (https://go.drugbank.com). DrugCentral lookup is an optional step, and their whole database can be downloaded as a PostgreSQL dump from the company’s website (https://drugcentral.org). LOTUS lookup is an optional step, and the whole LOTUS dataset is incorporated into WIKIDATA. The metadata clean-up script contains a Prefect flow to download all relationships from WIKIDATA. Simply run the prepare_wikidata_lotus_data_prefect.py script (https://github.com/corinnabrungs/msn_tree_library). Broad Institute lookup is an optional step. The drug information can be downloaded as a .txt file from the institute’s website (https://repo-hub.broadinstitute.org/repurposing). The FBMN result can be accessed on GNPS: https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c05d34fb31ab4ee99293e722fb7eb83d Experimental public libraries were downloaded from the corresponding webpages: GNPS, ALL_GNPS_NO_PROPAGATED (https://external.gnps2.org/gnpslibrary, accessed 8 December 2023); MassBank EU, MASSBANK_NIST (https://github.com/MassBank/MassBank-data/releases/tag/2023.11, accessed 8 December 2023); MassBank North America, LC–MS/MS spectra (https://mona.fiehnlab.ucdavis.edu/downloads, accessed 8 December 2023). The dataset used for the library evaluation can be found in MassIVE under MSV000096589. Here we used only a subset of group 1 (group1_B*.mzML) for the analysis. The mzmine batch file is supplied as Supplementary File 9 and the results of this evaluation as Supplementary Files 10 and 11.

The metadata clean-up pipeline was implemented in Python 3.10, with the Prefect library as the task orchestration tool. The source code is available on GitHub under the free and open MIT license (https://github.com/corinnabrungs/msn_tree_library). The TMAP projections and other analysis are available as Jupyter notebooks on GitHub (https://github.com/corinnabrungs/msn_tree_library/tree/master/notebooks). The mzmine code and documentation are available on GitHub under the same license (https://github.com/mzmine/mzmine3, https://github.com/mzmine/mzmine_documentation).


Articles from Nature Methods are provided here courtesy of Nature Publishing Group

RESOURCES