Abstract
Chemical formula annotation for tandem mass spectrometry (MS/MS) data is the first step toward structurally elucidating unknown metabolites. While great strides have been made toward solving this problem, the current state-of-the-art method depends on time-intensive, proprietary, and expert-parameterized fragmentation tree construction and scoring. In this work we extend our previous spectrum Transformer methodology into an energy based modeling framework, MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formulae, for learning to rank chemical formula and adduct assignments given an unannotated MS/MS spectrum. Importantly, MIST-CF learns in a data dependent fashion using a Formula Transformer neural network architecture and circumvents the need for fragmentation tree construction. We train and evaluate our model on a large open-access database, showing an absolute improvement of 10% top 1 accuracy over other neural network architectures. We further validate our approach on the CASMI2022 challenge dataset, achieving nearly equivalent performance to the winning entry within the positive mode category without any manual curation or post-processing of our results. These results demonstrate an exciting strategy to more powerfully leverage MS2 fragment peaks for predicting MS1 precursor chemical formula with data driven learning.
Keywords: mass spectrometry, machine learning, metabolomics, graph neural networks, chemical formulae
Graphical Abstract

Introduction
The discovery of previously unknown small molecules in biological samples is rapidly expanding our knowledge of plant chemistry,1,2 cancer biology,3,4 host-microbiome interactions,5–7 and other metabolite-mediated human biology.8 Similar small molecule discoveries in the environmental sciences have led to new insights regarding the exposome and pollutant effects, resolving mysteries such as high salmon mortality rates.9,10 Increasing our ability to detect and identify the so-called “dark metabolome” with analytical chemistry techniques represents an exciting opportunity in experimental and computational chemistry.
Tandem mass spectrometry (MS/MS) is a particularly well-suited analytical technique for this, as it allows for the high throughput characterization of small molecules from complex mixtures.11 In an MS/MS experiment, both an intact ionized mass (MS1) and a set of fragment peak masses (MS2) can be measured for each unknown molecule. This fragmentation spectrum serves as a structural representation of the molecule, ideally allowing the practitioner to match a small molecule structure to each resulting spectrum based upon database spectra matches. Due to the vastness of chemical space, however, many observed spectra have no precedent; in a large public MS/MS database, 87% of observed spectra remain unannotated.12
In such instances without spectral matches, we must rely on informatics and predictive modeling to identify the likely molecular structure. This pipeline almost always begins by inferring a chemical formula from the observed spectrum (Figure 1a). There are many formula options for each observed MS1 value, especially for higher mass compounds. Specifying the chemical formula (e.g., C6H12O6, C9H11NO3, etc.) constrains the space of potential compound candidates to a set of isomers, whereas the MS1 alone only constrains the space of candidates to those with similar masses. Automated assignment is far from trivial; while the the maximum formula annotation accuracy was 94% in in the recent Critical Assessment of Metabolite Identification in 2022 (CASMI20220),13 the median score was 71%, and these percentages were calculated only for the submitted predictions, rather than the total number of spectra tested. Improving the automated prediction accuracy of this step promises to improve, simplify, and speed up downstream analyses.
Figure 1: MIST-CF and SIRIUS both address the MS/MS chemical formula annotation problem.

a. Input samples are first processed, recording a tandem mass spectrum. Before assigning full molecular structures, the chemical formula can first be inferred to constrain structure assignment. b. Methodological similarities and differences between MIST-CF and SIRIUS. A candidate precursor mass is first decomposed into plausible chemical formulae and adduct pairs. MIST-CF (left) learns in a data-driven fashion to assign scores and circumvents the need of fragmentation tree construction compared to SIRIUS (right). Both methods rely on assigning subformulae (“subforms”), which are chemical formula subsets of the candidate precursor formula that match individual MS/MS peak masses.
Chemical formula annotation tools can be grouped into two categories: database dependent and database independent. Database dependent searches place restrictions on potential formulae, querying the candidate mass and spectrum against databases including NIST,14 GNPS,15 HMDB,16 or large compound libraries such as PubChem.17 Relying on databases inherently limits annotations to formulae that have already been observed. On the other hand, database independent (de novo) chemical formula annotation considers all possible chemical formulae, though the task becomes more challenging due to the larger number of candidates. A recent bottom-up computational approach, BUDDY,18 is a hybrid of the two approaches and assigns database formulae to MS2 peaks and neutral losses. By combining peak and neutral loss annotations, BUDDY generates potential candidates not present in databases. However, overall annotation performance is a function both of how the candidate space is defined and an algorithm’s ability to rank them; this limits our ability to perform a direct comparison to or analysis of BUDDY. Herein we focus specifically on de novo formula annotation for maximal flexibility using only MS/MS information.
Notably, both MZmine19,20 and SIRIUS21–23 have developed widely used methods for scoring chemical formula candidates using MS/MS information for de novo annotation. While MZmine evaluates each candidate formula based on the number of MS/MS peaks it can explain,20 SIRIUS scores them through a more expressive fragmentation tree strategy.21–23 SIRIUS first proposes candidate formulae through an exhaustive enumeration step up to a certain mass error from the observed MS1, labels the MS2 peaks with potential “subformulae” of the candidate MS1 annotation, performs an optimization to arrange these subformula annotations into a fragmentation tree, and finally computes a likelihood of the chemical formula based upon the tree and isotopic patterns. Despite their success and widespread use, these methods offer room for improvement in terms of both accuracy and speed. We recently observed program timeouts using SIRIUS for larger molecules (i.e., over 800 Da) during fragmentation tree calculations.24 An additional, lesser appreciated component of SIRIUS is that the method re-ranks candidate chemical formula based upon compound scores in the structure annotation step; this phase implicitly reposes SIRIUS as a database dependent search and led to changes in formula annotations in our previous study, particularly with respect to adduct assignments.24 It is unclear which pipeline steps lead to the high empirical formula annotation rates and the extent to which the tree score can be improved.
In this work, we present MIST-CF: Metabolite Inference with Spectrum Transformers for Chemical Formulae, an energy-based modeling approach to improving the database-independent, de novo chemical formula assignment step conditioned on both the MS1 mass and MS/MS spectrum. We previously demonstrated that Formula Transformers can be used to replace both the SVM module and fragmentation tree kernels used by CS:FingerID during the annotation step.24 This study now demonstrates that fragmentation trees can be replaced throughout the MS/MS processing pipeline for equally accurate, fast, and robust predictions. As part of this, we develop a simple peak subformula assignment routine, thus circumventing the need for fragmentation tree construction. We rigorously evaluate the performance of MIST-CF on two datasets: a public dataset subset from the GNPS15 we term NPLIB125,26 and a variant of NPLIB1 including spectra from the commercial NIST20 dataset.14 By training and evaluating our model both with and without data from NIST20, we enable reproducible evaluation of select trained models even in the absence of a NIST20 license. MIST-CF achieves equivalent formula annotation accuracy on the positive mode CASMI2022 challenge spectra to the winning SIRIUS solution and outperforms the out-of-the-box SIRIUS assignments by a margin of 19% (i.e., 86.8% vs. 67.8% accuracy). Altogether, this demonstrates a path forward for replacing fragmentation trees in MS/MS data processing with a fully integrated deep learning processing structure annotation pipeline.
We release MIST-CF as an open source tool that can be easily integrated into existing pipelines, with or without retraining, and is freely available under the MIT license at ref. Zenodo 8151513 and https://github.com/samgoldman97/mist-cf.
Methods
Preliminaries
In an MS/MS experiment, an input small molecule is ionized, often by the addition of an adduct (denoted by one-hot vector, 𝒜), measured, and fragmented in order to produce an MS/MS spectrum . The spectrum is composed of peaks that can each be represented as mass, intensity pairs:
| (1) |
Herein we consider only positively charged ions and peaks and assume that each fragment carries only a single positive charge; m/z and mass are used interchangeably when describing observed peaks.
In addition to the observed MS/MS spectrum , we also have access to the MS1 measurements denoted as the precursor mass . The core challenge of formula annotation is to determine the chemical formula of given and . We denote the target chemical formula as a vector of integers, .
We represent the target chemical formula as a vector , in which each index corresponds to the integer count for the observed element, where the parenthetical denotes an index into the vector. In total, we consider a set of common elements, “C”, “N”, “P”, “O”, “S”, “Si”, “I”, “H”, “Cl”, “F”, “Br”, “B”, “Se”, “Fe”, “Co”, “As”, “Na”, and “K” for formula vectors of size 18. We consider common positive mode adducts when generating chemical formula candidates: [M+H]+, [M+Na]+, [M-H2O+H]+, [M+NH4]+, [M]+, and [M-2H2O+H]+. The adduct candidate is specified by a one-hot vector . For de novo chemical formula annotation, can be decomposed into potential formula and adduct candidate options exhaustively using a highly efficient dynamic programming algorithm22 within small 1 to 5 parts-per-million (ppm) mass tolerances. The generated (formula, adduct) candidates can be further filtered based on presence in a database or using various heuristics such as number of ring double bond equivalents if desired.27
In addition to inferring the chemical formula for the full molecule, a step common to SIRIUS fragmentation tree generation21 and CSI:FingerID28 is to derive corresponding subformula annotations for the MS/MS peaks. In this way, a spectrum can be considered as a set of (subformulae, intensity) pairings, in which the set of peak masses is replaced with a set of chemical formula vectors , where each chemical formula is a subset of the precursor formula, i.e., ; peaks for which no formula can be assigned are excluded from the list. Because the fragment peaks are also charged, the observed masses are a summation of the mass of the subformula and its adduct. We define the MS2 peak adduct for the peak , the mass of which must be subtracted when considering chemical formula assignment. The parts-per-million difference at each peak between the adduct-adjusted peak mass and assigned subformula is referred to as to indicate measurement error.
Approaching chemical formula annotation with an energy based model formulation
As exhaustive chemical formula candidate generation can be solved via dynamic programming,22 the key challenge in de novo formula annotation is to score how well each candidate formula matches the observed MS1 and MS2 spectra. Taking a probabilistic lens and following previous work in metabolomics and proteomics,22,29 our goal is to learn the probability of a candidate chemical formula and adduct, . We assume the MS1 is useful only for candidate generation, and so the problem is simplified to approximating .
Energy based models (EBM) are a probabilistic modeling framework drawing inspiration from physics in which a probability distribution is defined by an energy function, .30
Mathematically, EBMs take the form:
| (2) |
where the denominator is referred to as a partition function and serves to normalize the energy to a valid probability, but is typically intractable to evaluate exactly. EBMs have reemerged in recent years, with work across the chemical sciences for reaction prediction,31 retrosynthesis,32 and scoring protein side-chain positions.33 These models are naturally suited for ranking applications. In our case, we factorize the probability of the candidate formula conditioned on the spectrum as an EBM of the form:
| (3) |
where defines an arbitrary neural network energy function that takes as input the candidate formula, adduct, and full fragmentation spectrum. For any differentiable , the energy function can be learned via a softmax loss function, aggregated over minibatches, and minimized via stochastic mini batch gradient descent:
| (4) |
where defines a set of “decoy” formulae for which will learn to assign a low score. In practice, we sample these decoy formulae during data preprocessing from the space of formulae with equivalent masses within a 10 ppm tolerance. We further filter candidate decoys to a maximum of 256 decoys per spectrum using the FastFilter model described below in order to sample “harder” decoys. The trained model can be applied independently to each candidate formula at inference time to yield a ranked list of assignments.
Our approach is conceptually similar to SIRIUS,21–23 but differs in that SIRIUS uses a heuristic maximum a posteriori (MAP) estimator to score fragmentation trees. This requires manually setting parameters such as the frequency of observing various fragments in the data. By using a flexible energy function trained via supervised learning, MIST-CF does not require manual parametrizations nor necessitate fragmentation tree generation. SIRIUS also allows for the incorporation of isotopic information from the MS1 to help identify chemical formula candidates. We focus solely on the MS2-related score rather than isotopic MS1 information, as this can be subsequently added and is often excluded from entries in spectral database such as the GNPS.15
The MIST-CF architecture
MIST-CF parametrizes the energy function using a Chemical Formula Transformer24 we denote as . For the input , we first attempt to label each spectrum peak with a plausible chemical subformula such that the spectrum can be represented by a set of subformulae (i.e., ). Peaks are sorted by intensity and a maximum number, are retained, set to 20 by default. MIST-CF then subsequently encodes these subformula peaks with a Chemical Formula Transformer, a Set Transformer variant34,35 we previously defined as part of the MIST architecture.24 The full MIST-CF pipeline is illustrated in Figure 1b, compared side-by-side to the SIRIUS model that requires a fragmentation tree calculation. We review the exact structure of this model below by describing each modeling step.
Subformula annotation
The first step in scoring a candidate formula for a spectrum is to assign a subformula to as many MS2 peaks as possible. We begin by subtracting the precursor adduct mass from all mass values in the MS2. This approach assumes that the subpeaks have the same adduct ionization as the precursor ion, but could be modified in future iterations of MIST-CF. All possible subformulae are enumerated and filtered with a ring double bond equivalent heuristic27 to remove implausible formulae and generate a candidate set . The mass for each subformula is compared to all adduct-adjusted spectrum masses and the peak masses are assigned their nearest subformula match within a variable ppm tolerance for each instrument type (Unknown/Ion Trap: 15 ppm; Q-ToF: 10 ppm; Orbitrap/FTICR: 5 ppm). An important contribution of this work with respect to the original MIST model24 is that this subformula assignment is now achieved with a compact and open source NumPy36 module, reducing the reliance on SIRIUS for high quality MS2 subformula annotations.
Formula embeddings
The subformula-annotated peaks are treated as integer vector inputs. Rather than pass these into a neural network directly, we first featurize each integer vector count into a sinusoidal embedding vector26,37 and concatenate the resulting output. Briefly, each integer element count, , in the input formula is encoded by our counts-based encoder into the vector:
| (5) |
where the periods (, etc.) are set at increasing powers of two up to 256 to discriminate all possible element counts given in the input, and Abs(·) is the absolute value function that results in only non-negative embeddings. While some information may be lost with this activation, empirically, using only positive embeddings was found to be helpful in prior work and has been maintained for consistency.26 Integer encodings of dimension 8 (log2(256)) for all 18 considered elements within a formula are flattened and concatenated, leading to a full formula encoding .
Spectra context
In addition to the encodings, the instrument type is considered as a one-hot vector covariate distinguishing between Ion Trap, Q-ToF, Orbitrap, and FTICR instruments. We denote this instrument type one-hot as .
Transformer architecture
Each peak subformula is first encoded into a hidden state vector that is then passed into the Transformer, with the precursor formula included as a special formula annotation . In addition to encoding the formula, we add additional features for the formula difference between the MS1 formula and each MS2 formula candidate vector (also converted into a sinusoidal embedding), a floating point scalar value for the MS2 peak intensity (set to 1 for the MS1 precursor formula candidate), the observed mass error between the observed adduct-adjusted mass and the monoisotopic mass of the MS2 subformula candidate (set to 0 for the MS1 precursor formula candidate), a one-hot encoding for the adduct type (assumed to be the same within a single spectrum), the instrument type one-hot vector, a Boolean flag set to 1 only at the precursor MS1 formula, and the number of total annotated peaks with subformula. These representations are embedded into hidden dimension with a shallow single hidden layer MLP:
| (6) |
| (7) |
A standard multi-head Transformer, with a slight modification to include featurized attention between peaks as described previously in the MIST architecture,24 is then used to transform these peak representations into a score:
| (8) |
where Pool(·) denotes the conversion of the variable length output of the transformer into a fixed length vector by selecting the output representation at only the special position. Due to certain training set examples not including MS1 masses, we take care to avoid inputting the relative mass difference between the assigned precursor formula and MS1 mass which is artificially set to 0. This helps avoid additional machine bias, as many of the training spectra have theoretical MS1 mass, rather than a measured MS1.
Model training
All models are trained using the aforementioned loss calculated using decoy formula. We sample minibatches of spectra, where there are up to decoys sampled for each spectrum in the minibatch, resulting in a total of candidate formula encoded per batch. To sample the most plausible “hard” negative decoys in each batch, we utilize a FastFilter module described in detail below. All models are implemented in Python version 3.8 using PyTorch version 1.9 and PyTorch Lightning38 version 1.6 and trained with the Adam optimizer.39 Hyperparameters are optimized with Ray Tune40 version 2.0. Each model was trained on a single NVIDIA RTX A5000 GPU with training times taking under 3 hours of wall time.
Generating formula candidates
By default, MIST-CF utilizes the mass decomposition algorithm embedded within the SIRIUS software23,41,42 (independent module from the rest of their pipeline) to decompose MS1 precursor masses into candidate formula options. We do not restrict the number of common elements C, N, O, or H. We set the maximum number of S and P atoms to 5 and 3, respectively, and limit each halogen (i.e., F, Cl, Br, I) to a maximum of one per formula, allowing for the recovery of 96% of formulae in the public NPLIB1 dataset. This constraint can be changed at inference time as desired. The chemical filter option “COMMON” is used to generate formulae for energy-based model training, and the “RDBE” filter is applied during inference. The default mass deviation is set to 10 ppm during training for all spectra by default, unless otherwise specified. For retrospective analysis on the NPLIB1 and NIST20 datasets, we resample individual parent masses using a truncated Gaussian centered around the exact mass with a standard deviation of 1/5 the instrument-specific ppm tolerance (Unknown/Ion Trap: 15 ppm; Q-ToF: 10 ppm; Orbitrap/FTICR: 5 ppm) as defined in BUDDY.18 We utilize SIRIUS version 5.6.3.
During inference, the user can choose to avoid this step in if they have their own, more narrow list of potential formulae candidates (e.g., generated by database search, knowledge about the chemical space being measured, or external tools such as BUDDY). In such cases, the exhaustive formula candidate generation step can be skipped and MIST-CF’s energy function can be used to directly evaluate the input candidate list.
Downselecting candidate formulae with a learned filter
Due to its energy-based formulation, the computational cost of MIST-CF scales linearly with the number of candidate precursor formulae. The evaluation of each formula requires assigning subformulae to peaks within a fragmentation spectrum, which is far faster than inducing a tree structure over subformulae but not negligible. To limit the space of candidate formulae, in addition to utilizing heuristics such as COMMON or RDBE as mentioned earlier, we train a data-driven filter we term FastFilter (Figure 2a) to further narrow the option set.
Figure 2: Large numbers of formula candidates can be quickly filtered with a simple feed forward neural network, FastFilter.

a. Generated candidate formulae for a spectrum can be prioritized with a learned model. A scalar score is generated for each candidate formula given an MS1 value and mass tolerance via a feed forward neural network. Only the top ranked formulae are selected for consideration with MIST-CF. b. The distribution of the number of candidates for all recorded MS1 spectra in our spectra libraries NPLIB1 and NIST20. All candidate chemical formulae are generated for all 6 adduct types considered with an allowed mass deviation of 10 ppm. c. FastFilter model accuracy at various top k cutoff values. The top formula is nearly always recovered within the top 256 candidates. The 95% confidence interval of the mean for recovery is shown across the 3 different formula splits considered in this work. All results are computed for a single trained FastFilter model on a large database of biologically-relevant molecules.
is a feedforward neural network that takes as input an encoded precursor formula candidate, , and learns, in the same fashion as MIST-CF, to predict an energy value based solely on the formula—not information about the spectrum—to approximate a non-normalized likelihood . Because no spectrum information is needed, we train a single FastFilter model using a large database of biologically-relevant molecules, containing 200,388 chemical formulae extracted from various sources such as KNApSAcK,43 HMDB,16 KEGG,,44 and PubChem,17 as prepared by Duhrkop et al.25 To avoid dataset leakage, all chemical formulae masses that appear in spectral libraries used for model training/testing are excluded; the remaining masses are split in an 80%/10%/10% ratio for training, validation, and testing of FastFilter. We use to select the top k candidates during inference (256 by default), or decoys during training, to further score with . Formulae are represented as input to the positive sinusoidal embedding vectors as in MIST-CF, and exact training parameters including learning rate, hidden size, and layers are described in the Supplementary Information.
Datasets
We evaluate MIST-CF in terms of its ability to predict precursor chemical formulae for MS/MS spectra from NPLIB1, a public natural products dataset extracted from the GNPS database.15 NPLIB1 is prepared as in Goldman et al.26 and extracted from Duhrkop et al.25 The dataset contains positive mode MS/MS spectra for compounds under 1,500 Da containing a predefined set of elements and adducts. In total NPLIB1 contains 10,709 spectra, 8,553 unique structures and 5,433 unique chemical formulae. We employ chemical formula splits in which 20% of the remaining 5,433 chemical formulae are selected randomly and added—with their corresponding spectra—to the test set. 10% of the remaining data is used for validation and early stopping.
In addition to the public data, we use the commercial NIST20 library14 to supplement the training dataset. We extract all Orbitrap high resolution positive mode mass spectra containing common elements and adducts. Examples with chemical formulae found in the test set are excluded to avoid biasing the model. In total, the combined dataset has 45,838 unique spectra, 30,950 unique 2D molecular structures, and 15,315 unique chemical formula. By using identical public test sets, we are able to report performance metrics with and without commercial library inclusion to enable replication studies and future methodological improvements.
Baseline models
In addition to learning using the MIST architecture, we select three separate baseline neural architectures: a feed forward network (FFN) inspired by MetFID45 that acts on a binned representation of the spectrum, ; ‘MS1 Only”, a variant of in which the binned spectrum is set to 0 for all spectra; and a Transformer model that utilizes multiscale sinusoidal embeddings (Transformer),46 . These models are trained and hyperparameter optimized equivalently to MIST-CF. In the FFN baseline, the binned spectrum representation is concatenated to the encoded MS1 candidate formula, the one-hot adduct representation, and the context vector, then passed into a multilayer perceptron module. The Transformer baseline concatenates the MS1 candidate formula encoding and context vector to the sinusoidal embedding at each of the top 100 most intense peaks, along with the intensity. An additional “cls” token is added to the peaks containing the mass of the MS1 candidate and intensity of 2. These are subsequently passed into a set of multi-head attention Transformer layers. The output is pooled at a special “cls” token. A single linear layer then predicts a scalar energy value.
Results
FastFilter limits the number of candidate formulae for MIST-CF consideration with high precision
A key limiting factor in de novo formula identification is the large size of the candidate space, particularly for molecules with higher masses. We first quantify the size of the candidate space. We process the full set of training spectra, including both NPLIB1 and NIST20, into chemical formula candidates with the permissive RDBE filter. Over 15% of spectra have greater than 5,000 formula candidates (Figure 2b). By training a light-weight model, FastFilter, to predict how likely each formula is to appear in a biologically-relevant database of small molecules, we are able to filter the candidates down to a smaller subset. We are able to recover the true formula 99% of the time within the top 256 candidates (Figure 2c). This guarantees the computational tractability of the MIST-CF pipeline, as subformula labeling would be prohibitively expensive for spectra with hundreds of thousands or millions of candidate formulae.
Chemical Formula Transformers provide meaningful representations for formula annotation
We evaluated four different architectures for encoding spectra and formulae to determine the best neural network architecture for the energy-based modeling framework described in Methods. MIST-CF uses a Transformer model to convolve upon MS2 subformula with context information concatenated to each formula (e.g., adduct, instrument type) (Figure 3a). We compare this model to a feed forward neural network (Figure 3b), a standard Transformer applied to sinusoidal embeddings of each m/z value,46 and a variant of the feed forward network that does not embed the spectrum (“MS1 only”). Importantly, all baseline neural network models are provided with the same context vectors, training set, and hyperparameter optimization schemes to enable fair comparison.
Figure 3: MIST-CF is a highly-effective architecture for learning to rank plausible MS1 formula annotations.

a. The MIST-CF architecture uses the candidate chemical formulae to generate subformulae and encode these with a Formula Transformer. b. The baseline feed forward network (FFN) separately encodes the spectra and formula before feeding their concenation into a multilayer perceptron (MLP). c. For all spectra in the test set, the fraction recovered at various top k values for all methods is computed and shown. d-e. The top 1 accuracy for the methods is grouped by the mass of the MS1 precursor or adduct type. All results are computed for MIST-CF and the following baselines: MS1 only (a feed forward network utilizing only the chemical formula and context vector), a feed forward network, and a Transformer. All results are computed over 3 random formula splits and respective training runs as described in Methods, where the NIST20 is included in the training sets. All spectra include up to 256 candidates to select from as selected with the “COMMON” filter23 and FastFilter. Error bars and shaded regions show 95% confidence intervals of the mean.
We find that MIST-CF outperforms all other architectures tested, likely due to its ability to explicitly combine the representation of the spectrum and formula candidate, rather than via intra-architecture concatenations (Figure 3b). We test the model by holding out a number of spectra and chemical formulae from model training. For each test spectrum, up to 256 formula and adduct candidates are re-ranked by each model. MIST-CF outperforms the next best model at top 1 accuracy by a > 10% margin (Figure 3c; Table 1). Curiously, the feed forward network and Transformer network show little difference from each other, indicating that neural network architecture (i.e., FFN vs. Transformer) has less impact than the input information provided to the model (i.e., subformula labels)—the main strength of MIST-CF. All models are able to perform at least as well as the MS1 only model, consistent with our intuition that the models are learning more than just database bias.
Table 1:
MIST-CF outperforms comparable neural network baselines at chemical formula annotation from MS/MS. Models were trained to predict held out NPLIB1 test examples using training sets consisting of “NPLIB1” or “NPLIB1 + NIST20.” The best value in each column is typeset in bold. Models were evaluated using three independent formula splits. Values are shown ± standard errors of the mean.
| Training dataset | NPLIB1 | NPLIB1 + NIST20 | ||||
|---|---|---|---|---|---|---|
| Top k | 1 | 2 | 3 | 1 | 2 | 3 |
| MS1 Only | 0.609 ± 0.002 | 0.773 ± 0.003 | 0.833 ± 0.003 | 0.623 ± 0.007 | 0.785 ± 0.008 | 0.847 ± 0.008 |
| FFN | 0.635 ± 0.009 | 0.786 ± 0.009 | 0.847 ± 0.007 | 0.652 ± 0.006 | 0.804 ± 0.009 | 0.867 ± 0.008 |
| Transformer | 0.639 ± 0.006 | 0.791 ± 0.009 | 0.851 ± 0.006 | 0.626 ± 0.014 | 0.772 ± 0.008 | 0.840 ± 0.011 |
| MIST-CF | 0.741 ± 0.010 | 0.878 ± 0.009 | 0.919 ± 0.007 | 0.769 ± 0.006 | 0.897 ± 0.007 | 0.931 ± 0.005 |
All models decrease in accuracy as the masses of compounds increase (Figure 3d) due to the growing number of plausible candidates. Similarly, models struggle on adducts that appear less often in the training dataset, such as potassium adducts (Figure 3e). This is one such area where manually parametrized models such as SIRIUS23 may be better suited to generalization. Data-driven methods such as MIST-CF are empirically better at correctly retrieving formulae with common proton or sodium adducts.
Integrating the higher quality Orbitrap training spectra from NIST20 leads to improved performance on the same test set. MIST-CF models trained on NPLIB1 alone achieve a top 1 accuracy of 0.741 compared to a top 1 accuracy of 0.769 for models that train on the NIST20 in addition (Table 1). This > 2% absolute improvement underscores that formula prediction accuracy may be enhanced with the upcoming release of new spectral libraries and higher quality data.12
One of the core methodological decisions in MIST-CF is the application of a Formula Transformer to the set of labeled subformulae in the MS2 spectrum. To verify that these peaks are informative for the model, we repeated benchmarking experiments for a single split of the data, this time modulating the maximum number of peaks (rank-ordered by intensity) viewable by MIST-CF.
Model performance rapidly increases as the maximum number of included peaks increases, sharply rising from 0.673 to 0.719 for 1 and 3 spectrum peaks respectively, where is functionally equivalent to the MS1 Only baseline with an accuracy of 0.623 (Table 1). There are diminishing returns of including lower intensity peaks, with 1, 5, 10, 20, 50 maximum formula peaks achieving top 1 accuracies of 0.673, 0.729, 0.754, 0.756, and 0.774 respectively (Figure 4). Making the model aware of a greater number of fragments enables more generalizable and accurate predictions. This builds confidence that the model is learning more than database biases, drawing information from even minor peaks to inform predictions. As can be seen, the absolute difference in performance appears to level off beyond 20 with only more marginal increases toward . Given that the performance benefit is marginal for many more peak annotations, we maintain our peak default unless otherwise stated to reduce the runtime of the method.
Figure 4: MS/MS subpeaks drive MIST-CF performance.

a. The (sub)Formula Transformer in MIST-CF operates on only the highest intensity MS2 peaks. Here, from that set of labeled MS2 peaks, the number of peaks used as input to the spectrum is limited. As the number of included peaks increases, the formula retrieval accuracy does too. b. The top 1 retrieval accuracy is shown for all maximum subpeak numbers. All results are shown for MIST-CF trained on the joint NPLIB1 and NIST20 dataset and tested on a single test split of the data.
MIST-CF compares favorably to existing formula annotation tools
The widely used SIRIUS tool23 is the de facto state of the art for the task of chemical formula annotation. We therefore perform a head-to-head comparison of MIST-CF and SIRIUS on the same NPLIB1 test set. Herein, MIST-CF considers all candidate MS1 formulae within 10 ppm of the recorded MS1 mass and utilizes FastFilter to down select to 256 candidates for full evaluation. We run the SIRIUS formula module from the command line with tree and compound timeouts of 300 seconds to avoid excessive execution times. With these constraints, MIST-CF is able to predict a formula for every spectrum, whereas SIRIUS fails for 10.77% of the spectra (Figure 5a). Failed spectra are associated with larger masses on average (700.80 Da) than successful ones (378.06 Da).
Figure 5: MIST-CF more frequently assigns correct chemical formulae than the SIRIUS formula assignment module on the NPLIB1 test set.

a. The number of spectra for which each method is able to predict a formula—regardless of accuracy. b. The distribution of molecular masses for the spectra both methods are able to annotate—regardless of accuracy— compared to the distribution of molecular masses for spectra on which only MIST-CF succeeds. c. Top k accuracy for both methods is shown. d. The rank at which each method is able to recover the true chemical formula. This visualization highlights that there are many spectra for which MIST-CF achieves rank 1 that SIRIUS does not predict in its top 3. MIST-CF models are trained on the joint NIST20 and NPLIB1 dataset. SIRIUS is executed with a compound and tree timeout of 300 seconds. All values are shown for 3 random formula splits of the NPLIB1 data. Error bars and confidence intervals show 95% confidence intervals for the standard error of the mean. Accuracy is computed with respect to the total MS1 composition (i.e., summed adduct and formula) to avoid biasing results against SIRIUS.
Model accuracy is evaluated in terms of the summed elemental composition of the formula and adduct. SIRIUS does not distinguish tree scores with different adduct assignments as we do in MIST-CF by design (e.g., [C6H12O6+H]+ and [C6H10O5+H-H2O]+ would appear to have equivalent tree scores). On the NPLIB1 test set when using full RDBE decoys (i.e., candidate formula not constrained with the “COMMON” filter as in Table 1), MIST-CF successfully predicts chemical formulae with a 71% top 1 accuracy (80% top 1 when considering joint formula and adduct accuracy), compared to SIRIUS’s 48% accuracy (evaluated on joint formula and adduct predictions). This represents a >20% improvement in absolute top 1 prediction accuracy (Figure 5c). On a per-spectrum basis, SIRIUS rarely makes correct predictions MIST-CF does not, with a total of approximately 3% of spectra falling into this category, whereas MIST-CF appears to predict 36% of spectra correctly when SIRIUS cannot (Figure 5d).
Beyond the improvement in accuracy, MIST-CF required approximately one-third the wall time of SIRIUS (evaluated on compounds under 700 Da for which SIRIUS does not time out) when both methods are run on a single CPU core (Supplementary Information).
MIST-CF achieves competitive out-of-the-box performance on the CASMI2022 challenge
We selected the dataset released as part of the Comparative Assessment of Small Molecule Identifications 2022 (CASMI2022)13 for additional validation. Our training datasets were generated prior to the competition announcement, minimizing the risk of training dataset bias and simulating a prospective use case. Herein, we focus only on the formula identification challenge rather than the full task of structural elucidation.
We extract all positive mode MS/MS files from the provided mzML files using MZmine 319 and apply both MIST-CF and SIRIUS to predict chemical formulae for each of the extracted 304 spectra. Using our constraints on element types, 296 of the 304 formulae are recoverable (97%), excluding species such as iodixanol (C35H44I6N6O15). We use an MS1 tolerance of 5 ppm.
Model performance was evaluated on three key metrics: the accuracy of predicting formula correctly, the accuracy of predicting the adduct correctly, and the accuracy of predicting the total elemental composition of the formula and adduct combined. As discussed above, we use this last metric because evaluating the accuracy of the chemical formula alone would bias the comparison for MIST-CF.
We consider three variants of SIRIUS: SIRIUS, SIRIUS (CSI:FingerID) and SIRIUS (Submission). SIRIUS (CSI:FingerID) provides formulae as re-ranked by their CSI:FingerID score when searched against PubChem. SIRIUS (Submission) uses predictions submitted by the SIRIUS authors at the latest competition under file name “duehrkop_CASMI2022.csv.” This submission reportedly used a mix of ion identity molecular networking,47 which can be used to more accurately resolve adduct types, along with manual curation to improve performance.
Encouragingly, we find that MIST-CF is competitive with SIRIUS (Submission), achieving a nearly equivalent collective joint formula and adduct accuracy of 0.862, despite our automated classification and no additional manual curation (Table 2). If evaluating SIRIUS (Submission) on the subset of 272 spectra for which a prediction was submitted, a higher accuracy is achieved compared to MIST-CF, as MIST-CF predicts all test set spectra; of the 32 spectra on which MIST-CF outperforms SIRIUS (Submission), SIRIUS (Submission) only submitted predictions for 7. This is in contrast to the 34 spectra on which SIRIUS outperforms MIST-CF. Nevertheless, when using default parameters, MIST-CF reaches a top 1 formula accuracy of 0.842 (including adduct prediction) compared to an accuracy of 0.516 for SIRIUS (CSI:FingerID). These results illustrate the competitiveness of MIST-CF and demonstrate that accurate, prospective formula annotation does not require computing full fragmentation trees. In addition to testing our default MIST-CF model, we also test a variant with as a comparison. Performance is equivalent for the joint accuracy of formula and adduct pairs. However, accuracy in predicting the formula alone increases, as the model appears to be marginally better at adduct assignment.
Table 2:
Model accuracy of MIST-CF and SIRIUS evaluated on CASMI2022. “Accuracy ” indicates accuracy of predicting the exact formula correctly. “Accuracy ” indicates the fraction of spectra for which the first predicted (formula, adduct) pair includes the correct adduct. “Accuracy ” indicates the fraction of spectra for which the total elemental composition of the top predicted formula and adduct matches the elemental composition of the summed true formula and adduct. “Predicted” indicates the number of spectra for which a formula and adduct prediction was made. “SIRIUS” makes predictions with the default command line tool described above; “SIRIUS (CSI:FingerID)” predicts formula, adduct pairs using CSI:FingerID rankings against PubChem; SIRIUS (Submission) is taken directly from the previous CASMI2022 results. MIST-CF (20 peaks) uses default 20 subformula peaks, whereas MIST-CF (50 peaks) utilizes 50 subformula peaks. Accuracy and Accuracy are not reported for SIRIUS because the method does not distinguish between formula, adduct pairs that sum to the same molecular formula, and our metrics default to the worst-case ranking for ties (i.e., somewhat unfairly penalizing methods such as SIRIUS that have more ties). All accuracies are reported for the top 1 prediction and divided by the total number of spectra (304), not the total number of predicted spectra. Numbers typeset in bold indicate the best result in the column.
| Method | Accuracy | Accuracy | Accuracy | Predicted |
|---|---|---|---|---|
| SIRIUS version 5.6.3 | - | - | 0.641 | 274 |
| SIRIUS version 5.6.3 (CSI:FingerID) | 0.516 | 0.543 | 0.678 | 254 |
| SIRIUS (Submission) | 0.865 | 0.855 | 0.868 | 272 |
| MIST-CF (20 peaks) | 0.842 | 0.901 | 0.862 | 304 |
| MIST-CF (50 peaks) | 0.822 | 0.885 | 0.868 | 304 |
Conclusion
We have introduced a data-driven neural network model, MIST-CF, for inferring chemical formulae from MS/MS spectra, trained using an energy-based modeling framework. We benchmark this model extensively to show how our recent Chemical Formula Transformer architecture is uniquely suited to this task of integrating formula and spectra information. MIST-CF outperforms other learning-to-rank neural network architectures and ties the winning solution at the CASMI 2022 competition within the positive mode category, despite using zero MS1 isotopic information or manual prediction refinement.
This work defines a clear problem formulation for learning to rank MS1 candidate formula from MS/MS data, including open source code and data splits. To address this task, we release open source and trained models with low memory and run-time concerns, even for large molecules. In addition to these models, we develop a smaller and light-weight neural network formula filtering model to help prioritize biologically-relevant chemical formula candidates. This work continues to demonstrate the efficacy of our recent Chemical Formula Transformer network architecture in thorough benchmarking comparisons.24 As part of this, we have greatly simplified and improved the the Chemical Formula Transformer implementation to now utilize custom formula embeddings, a simple and open source subformula assignment routine (i.e., no longer fragmentation trees), additional model inputs such as instrument type, and multiple adduct types. Applying these same changes to the Chemical Formula Transformer architecture for fingerprint prediction is likely to yield similar improvements.
There are many avenues for improving upon MIST-CF. We have trained models only for positive-mode data; we do not directly address the integration of MIST-CF scores with MS1 isotopic scores; we do not consider adduct switching in MS2 subformula assignment; we still rely upon SIRIUS’s algorithmic decomposition of exact masses into formula candidates due to their fast implementation; and we have not yet explored the use of forward structure-to-spectrum models26,48 for data augmentation. There is also an opportunity to combine MIST-CF with the recently reported BUDDY18 by re-ranking formulae generated by BUDDY rather than relying on the FastFilter.
Altogether, we are optimistic about the potential to integrate this model into existing pipelines for small molecule metabolite identification. By addressing the task of chemical formula annotation, this work moves us one step closer to our vision of an integrated neural network driven metabolite annotation pipeline.
Supplementary Material
The Supporting Information provides a description of the exact configuration used to rerun SIRIUS,23 timing experiments for MIST-CF, hyperpameter optimization, and the final hyperparameters utilized in experiments. This material is available free of charge via the Internet at http://pubs.acs.org.
Acknowledgement
The authors thank K. Duhrkop for helpful discussions around the SIRIUS method. The authors thank P. Dorrestein, H. Russo, and S. Zuffa for additional discussions about how to best utilize the introduced methods.
Funding
This work was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium. S.G. was additionally supported by the MIT-Takeda Fellowship program and the Machine Learning for Pharmaceutical Discovery and Synthesis consortium. J.P. was supported by the MIT-Accenture Fellowship.
Data and Software Availability
All code to replicate experiments, train new models, and load pre-trained models is available at https://github.com/samgoldman97/mist-cf. The code to parse NIST can be found in https://github.com/samgoldman97/nist-parser but this optional subset of training data requires a license and cannot be shared publicly. The exact repository version used in this work has been archived at Zenodo record 8151490 (data) and Zenodo record 8151513 (code).
References
- (1).Pluskal T; Torrens-Spence MP; Fallon TR; De Abreu A; Shi CH; Weng J-K The biosynthetic origin of psychoactive kavalactones in kava. Nature plants 2019, 5, 867–878. [DOI] [PubMed] [Google Scholar]
- (2).Torrens-Spence MP; Bobokalonova A; Carballo V; Glinkerman CM; Pluskal T; Shen A; Weng J-K PBS3 and EPS1 complete salicylic acid biosynthesis from isochorismate in Arabidopsis. Molecular plant 2019, 12, 1577–1586. [DOI] [PubMed] [Google Scholar]
- (3).Cao Y; Oh J; Xue M; Huh WJ; Wang J; Gonzalez-Hernandez JA; Rice TA; Martin AL; Song D; Crawford JM; Herzon SB; Palm NW Commensal microbiota from patients with inflammatory bowel disease produce genotoxic metabolites. Science 2022, 378, eabm3233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (4).Dang L; White DW; Gross S; Bennett BD; Bittinger MA; Driggers EM; Fantin VR; Jang HG; Jin S; Keenan MC; Marks KM; Prins RM; Ward PS; Yen KE; Liau LM; Rabinowitz JD; Cantley LC; Thompson CB; Vander Heiden MG; Su SM Cancer-associated IDH1 mutations produce 2-hydroxyglutarate. Nature 2009, 462, 739–744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (5).Quinn RA; Melnik AV; Vrbanac A; Fu T; Patras KA; Christy MP; Bodai Z; Belda-Ferre P; Tripathi A; Chung LK; Downes M; Welch RD; Quinn M; Humphrey G; Panitchpakdi M; Weldon KC; Aksenov A; da Silva R; Avila-Pacheco J; Clish C; Bae S; Mallick H; Franzosa EA; Lloyd-Price J; Bussell R; Thron T; Nelson AT; Wang M; Leszczynski E; Vargas F; Gauglitz JM; Meehan MJ; Gentry E; Arthur TD; Komor AC; Poulsen O; Boland BS; Chang JT; Sandborn WJ; Lim M; Garg N; Lumeng JC; Xavier RJ; Kazmierczak BI; Jain R; Egan M; Rhee KE; Ferguson D; Raffatellu M; Vlamakis H; Haddad GG; Siegel D; Huttenhower C; Mazmanian SK; Evans RM; Nizet V; Knight R; Dorrestein PC Global chemical effects of the microbiome include new bile-acid conjugations. Nature 2020, 579, 123–129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Paik D; Yao L; Zhang Y; Bae S; D’Agostino GD; Zhang M; Kim E; Franzosa EA; Avila-Pacheco J; Bisanz JE; Rakowski CK; Vlamakis H; Xavier RJ; Turnbaugh PJ; Longman RS; Krout MR; Clish CB; Rastinejad F; Huttenhower C; Huh JR; Devlin AS Human gut bacteria produce TH17-modulating bile acid metabolites. Nature 2022, 603, 907–912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (7).Sato Y; Atarashi K; Plichta DR; Arai Y; Sasajima S; Kearney SM; Suda W; Takeshita K; Sasaki T; Okamoto S Novel bile acid biosynthetic pathways are enriched in the microbiome of centenarians. Nature 2021, 599, 458–464. [DOI] [PubMed] [Google Scholar]
- (8).Wishart DS Metabolomics for Investigating Physiological and Pathophysiological Processes. Physiological Reviews 2019, 99, 1819–1875. [DOI] [PubMed] [Google Scholar]
- (9).Bundy JG; Davey MP; Viant MR Environmental metabolomics: a critical review and future perspectives. Metabolomics 2009, 5, 3–21. [Google Scholar]
- (10).Tian Z; Zhao H; Peter KT; Gonzalez M; Wetzel J; Wu C; Hu X; Prat J; Mudrock E; Hettinger R; Cortina AE; Biswas RG; Kock FVC; Soong R; Jenne A; Du B; Hou F; He H; Lundeen R; Gilbreath A; Sutton R; Scholz NL; Davis JW; Dodd MC; Simpson A; McIntyre JK; Kolodziej EP A ubiquitous tire rubber-derived chemical induces acute mortality in coho salmon. Science 2021, 371, 185–189. [DOI] [PubMed] [Google Scholar]
- (11).Neumann S; Böcker S Computational mass spectrometry for metabolomics: identification of metabolites and small molecules. Anal. Bioanal. Chem 2010, 398, 2779–2788. [DOI] [PubMed] [Google Scholar]
- (12).Bittremieux W; Wang M; Dorrestein PC The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 2022, 18, 94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (13).CASMI, Critical Assessment of Small Molecule Identification. http://www.casmi-contest.org/2022/index.shtml, Accessed 2022-12-01.
- (14).NIST, Tandem Mass Spectral Library. NIST; 2020, [Google Scholar]
- (15).Wang M; Carver JJ; Phelan VV; Sanchez LM; Garg N; Peng Y; Nguyen DD; Watrous J; Kapono CA; Luzzatto-Knaan T Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature biotechnology 2016, 34, 828–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (16).Wishart DS; Feunang YD; Marcu A; Guo AC; Liang K; Vázquez-Fresno R; Sajed T; Johnson D; Li C; Karu N; Sayeeda Z; Lo E; Assempour N; Berjanskii M; Singhal S; Arndt D; Liang Y; Badran H; Grant J; Serra-Cayuela A; Liu Y; Mandal R; Neveu V; Pon A; Knox C; Wilson M; Manach C; Scalbert A HMDB 4.0: the human metabolome database for 2018. Nucleic acids research 2018, 46, D608–D617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (17).Kim S; Thiessen PA; Bolton EE; Chen J; Fu G; Gindulyte A; Han L; He J; He S; Shoemaker BA PubChem substance and compound databases. Nucleic Acids Research 2016, 44, D1202–D1213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (18).Xing S; Shen S; Xu B; Li X; Huan T BUDDY: molecular formula discovery via bottom-up MS/MS interrogation. Nature Methods 2023, 1–10. [DOI] [PubMed] [Google Scholar]
- (19).Schmid R; Heuckeroth S; Korf A; Smirnov A; Myers O; Dyrlund TS; Bushuiev R; Murray KJ; Hoffmann N; Lu M; Sarvepalli A; Zhang Z; Fleischauer M; Dührkop K; Wesner M; Hoogstra SJ; Rudt E; Mokshyna O; Brungs C; Ponomarov K; Mutabdžija L; Damiani T; Pudney CJ; Earll M; Helmer PO; Fallon TR; Schulze T; Rivas-Ubach A; Bilbao A; Richter H; Nothias L-F; Wang M; Orešič M; Weng J-K; Böcker S; Jeibmann A; Hayen H; Karst U; Dorrestein PC; Petras D; Du X; Pluskal T Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nature biotechnology 2023, 41, 447–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (20).Pluskal T; Uehara T; Yanagida M Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Analytical chemistry 2012, 84, 4396–4403. [DOI] [PubMed] [Google Scholar]
- (21).Böcker S; Dührkop K Fragmentation trees reloaded. Journal of cheminformatics 2016, 8, 1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (22).Böcker S; Letzel MC; Lipták Z; Pervukhin A SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 2009, 25, 218–224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (23).Dührkop K; Fleischauer M; Ludwig M; Aksenov AA; Melnik AV; Meusel M; Dorrestein PC; Rousu J; Böcker S SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nature Methods 2019, 16. [DOI] [PubMed] [Google Scholar]
- (24).Goldman S; Wohlwend J; Stražar M; Haroush G; Xavier RJ; Coley CW Annotating metabolite mass spectra with domain-inspired chemical formula transformers. Nature Machine Intelligence 2023, 1–15, DOI: 10.1038/s42256-023-00708-3. [DOI] [Google Scholar]
- (25).Dührkop K; Nothias L-F; Fleischauer M; Reher R; Ludwig M; Hoffmann MA; Petras D; Gerwick WH; Rousu J; Dorrestein PC Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nature Biotechnology 2021, 39, 462–471. [DOI] [PubMed] [Google Scholar]
- (26).Goldman S; Bradshaw J; Xin J; Coley CW Prefix-tree Decoding for Predicting Mass Spectra from Molecules. arXiv preprint arXiv:2303.06470 2023, [Google Scholar]
- (27).Pretsch E; Bühlmann P; Affolter C; Pretsch E; Bhuhlmann P; Affolter C Structure determination of organic compounds; Springer, 2000. [Google Scholar]
- (28).Dührkop K; Shen H; Meusel M; Rousu J; Böcker S Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences 2015, 112, 12580–12585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (29).Zhang N; Aebersold R; Schwikowski B ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics 2002, 2, 1406–1412. [DOI] [PubMed] [Google Scholar]
- (30).LeCun Y; Chopra S; Hadsell R; Ranzato M; Huang F A tutorial on energy-based learning. Predicting structured data 2006, 1. [Google Scholar]
- (31).Lin MH; Tu Z; Coley CW Improving the performance of models for one-step retrosynthesis through re-ranking. Journal of cheminformatics 2022, 14, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (32).Sun R; Dai H; Li L; Kearnes S; Dai B Energy-based view of retrosynthesis. arXiv preprint arXiv:2007.13437 2020, [Google Scholar]
- (33).Du Y; Meier J; Ma J; Fergus R; Rives A Energy-based models for atomic-resolution protein conformations. International Conference on Learning Representations. 2020. [Google Scholar]
- (34).Ying C; Cai T; Luo S; Zheng S; Ke G; He D; Shen Y; Liu T-Y Do Transformers Really Perform Bad for Graph Representation? Advances in Neural Information Processing Systems 34. 2021; pp 28877–28888. [Google Scholar]
- (35).Vaswani A; Shazeer N; Parmar N; Uszkoreit J; Jones L; Gomez AN; Kaiser L; Polosukhin I Attention Is All You Need. Advances in Neural Information Processing Systems 30. 2017; pp 5998–6008. [Google Scholar]
- (36).Harris CR; Millman KJ; van der Walt SJ; Gommers R; Virtanen P; Cournapeau D; Wieser E; Taylor J; Berg S; Smith NJ; Kern R; Picus M; Hoyer S; van Kerkwijk MH; Brett M; Haldane A; Fernández del Río J; Wiebe M; Peterson P; Gérard-Marchant P; Sheppard K; Reddy T; Weckesser W; Abbasi H; Gohlke C; Oliphant TE Array programming with NumPy. Nature 2020, 585, 357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (37).Tancik M; Srinivasan PP; Mildenhall B; Fridovich-Keil S; Raghavan N; Singhal U; Ramamoorthi R; Barron JT; Ng R Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. Advances in Neural Information Processing Systems 33. 2020; pp 7537–7547. [Google Scholar]
- (38).Falcon W; The PyTorch Lightning team, PyTorch Lightning. 2019; https://github.com/Lightning-AI/lightning. [Google Scholar]
- (39).Kingma DP; Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014, [Google Scholar]
- (40).Liaw R; Liang E; Nishihara R; Moritz P; Gonzalez JE; Stoica I Tune: A research platform for distributed model selection and training. ICML 2018 AutoML Workshop 2018, [Google Scholar]
- (41).Bocker S; Lipták Z A fast and simple algorithm for the Money Changing Problem. Algorithmica 2007, 48, 413–432. [Google Scholar]
- (42).Dührkop K; Ludwig M; Meusel M; Böcker S Faster mass decomposition. Algorithms in Bioinformatics: 13th International Workshop, WABI 2013, Sophia Antipolis, France, September 2–4, 2013. Proceedings 13. 2013; pp 45–58. [Google Scholar]
- (43).Shinbo Y; Nakamura Y; Altaf-Ul-Amin M; Asahi H; Kurokawa K; Arita M; Saito K; Ohta D; Shibata D; Kanaya S KNApSAcK: a comprehensive species-metabolite relationship database. Plant metabolomics 2006, 165–181. [Google Scholar]
- (44).Kanehisa M; Goto S; Kawashima S; Nakaya A The KEGG databases at GenomeNet. Nucleic acids research 2002, 30, 42–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (45).Fan Z; Alley A; Ghaffari K; Ressom HW MetFID: artificial neural network-based compound fingerprint prediction for metabolite annotation. Metabolomics 2020, 16, 104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (46).Voronov G; Lightheart R; Davison J; Krettler CA; Healey D; Butler T Multiscale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data. arXiv preprint arXiv:2207.02980 2022, [Google Scholar]
- (47).Schmid R; Petras D; Nothias L-F; Wang M; Aron AT; Jagels A; Tsugawa H; Rainer J; Garcia-Aloy M; Dührkop K; Korf A; Pluskal T; Kameník Z; Jarmusch AK; Caraballo-Rodríguez AM; Weldon KC; Nothias-Esposito M; Aksenov AA; Bauermeister A; Albarracin Orio A; Grundmann CO; Vargas F; Koester I; Gauglitz JM; Gentry EC; Hövelmann Y; Kalinina SA; Pendergraft MA; Panitchpakdi M; Tehan R; Le Gouellec A; Aleti G; Mannochio Russo H; Arndt B; Hübner F; Hayen H; Zhi H; Raffatellu M; Prather KA; Aluwihare LI; Böcker S; McPhail KL; Humpf H-U; Karst U; Dorrestein PC Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nature communications 2021, 12, 3832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (48).Goldman S; Li J; Coley CW Generating Molecular Fragmentation Graphs with Autoregressive Neural Networks. arXiv preprint arXiv:2304.13136 2023, [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All code to replicate experiments, train new models, and load pre-trained models is available at https://github.com/samgoldman97/mist-cf. The code to parse NIST can be found in https://github.com/samgoldman97/nist-parser but this optional subset of training data requires a license and cannot be shared publicly. The exact repository version used in this work has been archived at Zenodo record 8151490 (data) and Zenodo record 8151513 (code).
