ForMileS: A Python Open-Source Program to Generate Molecular Structures for Tandem Mass Spectrometry Fragment Ions

Vinicius Kuchenbecker; Nelson H Morgon

doi:10.1021/acsomega.5c08184

. 2025 Oct 25;10(43):51869–51881. doi: 10.1021/acsomega.5c08184

ForMileS: A Python Open-Source Program to Generate Molecular Structures for Tandem Mass Spectrometry Fragment Ions

Vinicius Kuchenbecker ^1,^*, Nelson H Morgon ¹

PMCID: PMC12593108 PMID: 41210777

Abstract

Tandem mass spectrometry is a central analytical tool in chemistry, yet the fragmentation mechanisms underlying collision-induced dissociation remain incompletely understood. A key challenge is predicting fragment ion structures while preserving the essential structural features of the precursor ion. This paper introduces ForMileS (Formation of Mass SMILES), a streamlined Python open-source workflow for generating fragment ion structures with precursor-specific constraints from tandem mass spectrometry data. ForMileS employs a simplified branch-and-bound algorithm, accepting molecular formula, charge state, exact mass, and a base scaffold in SMILES format as input, along with parameters for branching, cyclicity, and bond types, via a graphical user interface. We demonstrate its application to the three main fragments of Polypropylene Glycol Octamer (PPG8), discussing the critical role of the base molecular scaffold (BMS) in the final structure set. Relative energy calculations using Density Functional Theory confirm the presence of expected structures, highlighting the lowest energy conformers. When applied to the smallest fragment of dipropylene glycol dimethyl ether (DGDE), ForMileS reveals that only linear double-bonded or cyclic structures are plausible, with the former being energetically favored. While successfully generating plausible structures, the exhaustive combinatorial charge generation step and the unrefined branch-and-bound method limit ForMileS’s performance, restricting its applicability to small molecules like C₆O₃H₁₉. This highlights the importance of future performance optimization through heuristics and energetic filters.

graphic file with name ao5c08184_0009.jpg

graphic file with name ao5c08184_0008.jpg

Introduction

Among the many mass spectrometry (MS) techniques and devices, tandem mass spectrometry (MS²) has become a widely used method for analyzing polar and macromolecules such as polymers, carbohydrates, proteins, peptides, and others. In MS² experiments, a molecular ion (also called a precursor ion) is isolated and then fragmented using various methods, the most common being collision-induced dissociation (CID). , This process generates an ensemble of fragment ions recorded in a mass spectrum. The precursor is a positively or negatively charged molecule (ionized), with the former being more common and also referred to as a protomer. In a low energy regime dissociation process, typically one fragment retains the original charge from the precursor, while the other (or others) are neutral. In cases such as in electron transfer dissociation more charged fragments can be generated as well.

Over the past years, many theoretical approaches have been developed to elucidate the dissociation mechanisms, an inherently challenging problem, and to predict and explain the fragmentation process in MS², such as transition state theory , and Rice–Ramsperger–Kassel–Marcus (RRKM) theory. All these approaches require, at least theoretically, knowledge of the precursor and fragment ion structures so that the reaction pathway (intrinsic reaction coordinate) can be modeled and energy barriers calculated.

In MS² performed on most common commercial devices, such as triple quadrupoles (QqQ), quadrupole time-of-flight (qToF), and higher-energy collisional dissociation Orbitraps (HCD-Orbitrap), particularly the latter, the experimental data directly include the exact mass of the fragment ions, their charge state, and the collision energy in the laboratory frame. , Additionally, the precursor structure can often be partly or fully known from other analytical techniques applied before the MS² experiment. With exact mass measurements to four decimal places, as is common with Orbitrap analyzers, it is possible to apply algorithms for elemental composition determination, providing molecular formulas for any peak in the spectrum, including MS² fragment ions.

A major challenge in theoretical modeling of MS² dissociation/fragmentation pathways lies in determining the molecular structures of the fragment ionsinformation not directly available from CID/HCD experiments. Without at least a small set of theoretically plausible fragment ion structures, attempts to model reaction mechanisms and spectra are impractical. ,

One way to address this issue is through generation of all possible molecular structures for a fragment ion, i.e., a complete search for all constitutional isomers. Since the molecular formula of a fragment ion can be derived from the MS² experiment, alongside other structure-related information as described above, the task becomes a combinatorial problem over the constituent atoms. Generating constitutional isomers from a molecular formula is a standard task in computer-assisted structure elucidation (CASE), which employs computational algorithms implemented in dedicated programs. ,

Several programs have been developed to generate sets of molecular structures, − notably those following a de novo approach, where MS² fragmentation data and computational tools are used to propose precursor ion candidates, as in metabolomics applications. However, as noted earlier, in this work the focus is on supporting elucidating fragment ion structures using available structural information from the precursor ion.

Two main approaches for MS² CASE are found in the literature: molecular graph generators (MGGs) and machine learning (ML) tools. ML methods use trained data sets, which greatly reduce search time but may lead to nonexhaustive solutions, as with stochastic methods, ,, potentially missing important structures and are highly dependent on databases for training. Despite that, they are very fast and tackle the essential problem of long and exhaustive generation in some MGG algorithms. Applications such as CFM-ID, , MetFrag, MassFrontier, and PyFragMS have been developed primarily for precursor ion elucidation from MS² data, rather than for fragment ion structure generation.

MGGs tackle the combinatorial task of generating all possible structures within a framework of rules derived from experimental or theoretical data (e.g., spectra, valence, and bond order rules). This is an NP-complete problem, , meaning that while a proposed solution can be quickly verified against the rules, finding the solution may require exponentially increasing computational effort as molecular formula size grows. Achieving both exhaustiveness and computational feasibility is therefore a significant challenge.

Existing MGG programs use various strategies and algorithms to balance completeness with computational constraints. The most common approach is structure assembly, in which molecular structures are built atom-by-atom from a given set of atoms using a defined algorithm. Examples include MOLGEN, MAYGEN, SURGE, MASS, and SMOG, with MOLGEN being considered a gold standard but available only under a commercial license.

Of these, MASS and its successor SMOG are notable for using adjacency matrices to represent molecular graphs. They are based on the branch-and-bound (B&B) algorithm of Igor Faradjev, a method dating back to the 1970s that remains important for molecular graph generation. Despite being a method that can achieve exhaustiveness with a relative simple algorithm implementation, the computational and time cost for such must be balanced in each developed program that uses B&B as base algorithm. Nonetheless, it is a method that balances simplicity and some level of exhaustiveness, depending on its application.

To manage the combinatorial explosion inherent to molecular graph generation, B&B builds structures atom-by-atom from an initial molecular scaffold, applying constraints to ensure canonicity. , Therefore, B&B implementation and efficiency is highly dependent on the choice of constraints that will be used by the program.

While these algorithms remain effective, their implementation has traditionally relied on compiled languages such as C, Java, or Fortranreflecting the technological context of their original development between the 1980s and 2000s. However, the growing demand for accessible, modifiable, and integrative tools in computational chemistry has prompted a shift toward more flexible environments. In this regard, Python has emerged as a leading high-level language in scientific programming, offering concise, readable syntax and a vast ecosystem of libraries. This accessibility makes it particularly suitable for chemists without extensive experience in lower-level programming.

Within Python’s extensive ecosystem, RDKit has become a key tool for computer-assisted chemistry. Its ability to read, process, and write chemical data, including structural formats such as SMILES, , facilitates interaction between chemists and computational tools. SMILES also ensures a degree of structural canonicity, which is essential for completeness in molecular graph generation.

While existing programs can generate isomers and compare them with MS² or other spectral data for graph construction or pruning, to the best of our knowledge, no B&B-based MGG written in Python has been reported that specifically combines molecular formula, charge state, exact mass, and precursor ion structural features to exhaustively generate fragment ion molecular structures for theoretical studies.

This paper introduces ForMileS (Formation of Mass SMILES), a deterministic MGG Python workflow designed to construct valid molecular graphs that are consistent with a target molecular formula for fragment ions. The program implements the established recursive B&B expansion algorithm to generate structures, commencing from a user-defined SMILES template pertinent to the precursor ion structure. The expansion process proceeds atom-by-atom, while strictly adhering to constraints such as atomic valence, bond orders, molecular formula satisfaction, and other MS²-related parameters, including charge state and exact mass. ForMileS subsequently outputs a collection of fragment ion structures generated in accordance with the precursor ion and MS² information provided by the user.

Table presented a summary of the already established programs and their method for generation of structures and main input parameters alongside ForMileS for comparison.

1. Comparison between ForMileS and Other Molecular Graph Generator Programs Features.

feature/program	ForMileS	MOLGEN	SMOG	RDKit EnumareLibrary
Input definition	Molecular formula + scaffold (SMILES), optional charge, mass target, tolerance	Molecular formula, optional constraints (e.g., connectivity, symmetry reduction)	Molecular formula, connectivity rules	Reaction definition (scaffolds + R-groups or building blocks)
Generation method	Recursive graph growth with pruning (branch and bound style) + optional scaffold seeding	Exhaustive combinatorial generation of constitutional isomers with symmetry pruning	Exhaustive canonical generation using orderly generation	Enumeration by applying substituents to predefined cores (combinatorial chemistry)
Constraints available	Branching allowed/forbidden, cyclic allowed/forbidden, max double/triple bonds, valence rules	Symmetry avoidance, max valence, atom counts, connectivity	Canonical order generation, valence rules	Library constraints defined by user (lists of R-groups)
Charge handling	Adds charges postgeneration to specific atom types	Typically neutral structures (extensions exist for charged)	Typically neutral	Charges only if defined in input scaffold
Mass filtering	Postgeneration exact mass filter with tolerance	Possible via constraints in formula	Possible via rules	Not natively; must filter externally
Output formats	SMILES, PNG/SVG, optional MOL/XYZ	SMILES, SDF, in-house formats	SMILES, SDF	SMILES, Mol objects (in-memory)
User interface	GUI built in Tkinter + Command Line Interface (CLI)	CLI only (commercial academic license)	CLI	Python API (scriptable)
Unique aspect	Formula + mass tolerance targeting (for MS); Scaffold-seeded growth; Charge placement automation; Graph expansion	Very mature, optimized, symmetry exact and exhaustive	Compact canonical representation	Tight RDKit integration, combinatorial enumeration

Open in a new tab

The program is not intended to replace existing tools (which are better at exhaustive and efficient generation) but rather to complement them, representing a modern and specific workflow to perform MGG specifically in the MS² context. The main contribution of ForMileS is the integration of the aforementioned MS² specific constraints with the B&B algorithm. Indeed, as will further be demonstrated in the next sections, ForMileS has aspects of both a generator and confirmation tool for molecular structures.

Hence, ForMileS is introduced as a support tool, positioned within a specialized area between exhaustive MGG methods, which are less MS² aware, and programs dedicated to the structural elucidation of precursor-focused MS².

Implementation and Computational Method

Description of Program Functionalities and Algorithm

Figure summarizes the complete ForMileS workflow in pseudocode form, highlighting how each numbered block corresponds to a functional module later described in the text. Table expands these modules, translating every pseudocode block into its explicit Python implementation and indicating the main functions or objects invoked in the source code. For example, the pseudocode segment “GrowRecursive() → ProcessCompleteMolecule()” (Figure , lines 6–9) is detailed in Table under “Graph generation,” where the recursive growth, valence validation, and pruning criteria are explained through the functions grow_recursive and valence_ok. Likewise, the pseudocode step “GenerateChargedSMILES() → FilterByMass()” is linked to the “Charge generation” and “Filtering” rows of Table and elaborated in the following subsections. This cross-reference aims to help to trace the execution flow from algorithmic logic (Figure ) to implementation (Table ) and explanatory discussion.

ForMileS simplified workflow with most important molecular graph generation functionalities explained using pseudocode of Python’s script.

2. Description of ForMileS Main Functionalities and Their Correspondence to the Implemented Script Elements.

program step	description	script elements
Graph generation	Parses target formula, then recursively grows from each scaffold: adds atoms to “open sites” only if (i) valence limits per element, (ii) allowed bond orders by element pair and global flags, (iii) structural constraints hold. When the atom deficit hits zero, the structure is considered complete.	parse_formula, open_sites, get_allowed_bond_orders, valence_ok, grow_recursive, process_complete_molecule
Ring handling	If rings are allowed, tries ring closures by adding bonds between atoms that still have valence headroom; only keeps sanitizable ringed structures.	try_ring_closures (called inside process_complete_molecule)
Bond-order permutations	For completed molecules, optionally explores higher bond orders (double/triple) bounded by global maxima to avoid combinatorial blow-up.	explore_bond_permutations (invoked from process_complete_molecule)
Duplicate control (neutral)	Every completed structure is converted to canonical SMILES; a seen set blocks duplicates immediately. Structural constraints are checked again at this boundary.	mol_to_canonical_smiles, process_complete_molecule (with seen), has_cycles, is_linear
Charge generation	For each neutral SMILES, uses RDKit’s symmetry ranks to choose exactly one representative atom per symmetry class for the allowed chargeable elements and applies the formal charge. After sanitization, canonical SMILES ensures dedup.	generate_charged_smiles (uses Chem.CanonicalRankAtoms, SetFormalCharge, Chem.MolToSmiles)
Filtering	Exact mass filter (monoisotopic) against TARGET_MASS ± TOLERANCE, with canonicalization-based dedup for the filtered set.	filter_by_mass using Descriptors.ExactMolWt
Artifacts	2D annotated PNGs (and optional SVG), optional XYZ/MOL via 3D embed + UFF optimize, plus a human-readable run summary with timings, RAM snapshots, and final counts.	smiles_to_images, generate_xyz_from_smiles, generate_mol_from_smiles, _write_run_summary
Performance snapshots	Lightweight RAM probe (psutil if available; resource fallback), timing with perf_counter().	_get_ram_mb, main __name__ = = ″__main__″ block

Open in a new tab

Molecular Graph Generation

While ForMileS does not contribute to the B&B algorithm itself, it implements core elements of it to graph generation within the aforementioned workflow to recursively grow the molecular graph atom by atom. , This process can be represented as a decision tree, where each level corresponds to the addition of one atom beyond the BMS atoms defined as input. The nodes represent molecular graphs, and the branches represent all possible ways of adding the next atom to valid sites with valid bonds. The search space is reduced through pruning (bounding). Details are found in Figure .

The program identifies the atoms and their quantity in the molecular formula parsed by the user, storing it in a vector $\vec{f_{t a r g e t}}$ . This is done using the functions parse_formula().

Then, from the BMS, the function count_atoms() counts all the already existent atoms in the current molecular graph G, storing in the vector f⃗(G). From this values, in each recursive growth of the molecular graph, i.e. each atom-adding step, the vector $d⃗ = \vec{f_{t a r g e t}} - f⃗ (G)$ is calculated with the function atom_deficit() to evaluate the atom deficiency in the current graph.

The first bound condition is when d⃗ = 0. This conditioning alone would still let the combinatory space to grow prohibitively the bigger the molecule, but satisfy the most important goal of any MGG: generating a structure that fulfills the required molecular formula.

Another important chemical constraint used for bounding is the valence of each atom. Each atom a have a current valence V(a) in each step of the molecular graph growing process. Using valence defined in parameters.json file, there is a V _max for each atom type, e.g. carbon has 4, oxygen 2 and etc. Because of this, a valence constraint to bound, i.e. pruning branches is that ∀a ∈ G, V(a) ≤ V _max(a). This also means that the program is limited to work only with the explicit valence information available at the parameters file and will use ‘if not valence_ok(current_mol, max_valence): return’ to check the valence.

Also, if any element x in the recursive branching tree has d⃗[x] < 0, this means the program overshot the formula and the branch is pruned. This is tested by the function is_deficit_valid(), being deficit = atom_deficit(target_formula, current_atoms). The process of growing branches continues for each atom, which means the program iterates over atoms still needed but skip the exhausted ones.

The program utilizes parameters.json to define valences and bond orders for atom additions. It employs recursive calls to attach_atom and grow_recursive functions for molecular graph construction, with the RWMol class facilitating molecule editing and candidate generation.

Upon valid molecule creation, postprocessing involves bond permutations for double/triple bonds and ring generation if permitted. To reduce combinatorial complexity, the user specifies the maximum number of allowed double and triple bonds.

Canonicity is not enforced by the branch-and-bound (B&B) growth. Instead, RDKit canonical SMILES ensures uniqueness, with virtual sets (called ‘seen’) eliminating duplicates at two stages: (i) neutral graphs from the B&B step, and (ii) charged variants from charge placement. This design collapses structures reachable via different growth orders or bond-order permutations into a single canonical representative.

At each B&B goal state (i.e., when atom-deficit is zero and constraints are met), the candidate graph is sanitized and converted to canonical SMILES before being admitted to the neutral pool. After charge localization, each charged candidate undergoes recanonicalization and deduplication.

Ring System Generation

If cycles are allowed by the user, when the formula is satisfied, the algorithm may attempt ring closures by adding bonds between atoms that still have valence headroom and are topologically eligible (subject to global constraints). Double/triple bond permutations are explored within user-specified maxima to capture alternative valence patterns relevant to the chemistry. Invalid proposals (valence overflow, impossible ring strain under the simple valence model, or rule violations) are rejected immediately by sanitize/validation.

Aromaticity is not hard-coded by ForMileS. Instead, it is perceived by RDKit during sanitize(), i.e., “kekulization”. The program grows graphs in a bond-order form. RDKit then analyzes cycles and π-electron patterns to mark atoms/bonds as aromatic when appropriate. Because canonical SMILES respect RDKit’s aromaticity perception, resonance-equivalent representations (e.g., alternating double-bond patterns in a ring) are collapsed to a single canonical form. This prevents duplicates arising from different but equivalent kekulé assignments. If the user limits the total number of double/triple bonds, that indirectly constrains which ring systems can become aromatic (some candidates will sanitize as nonaromatic if the π budget is insufficient).

Therefore, aromaticity is emergent from RDKit’s perception of the sanitized candidate. ForMileS does not enumerate “aromatic forms” separately. Canonical SMILES collapses resonance/kekulé variants, so aromatic duplicates are not produced.

Also, the cyclization process can explode in combinations, and therefore the program starts with a 10 maximum structure limitation in its script that can be changed by the user. Therefore, the cycles building process is not exhaustive as in other established MGG and would benefit from future implementations of chemical constraints.

Charge Generation and Exact Mass Filter

Charge generation initiates with neutral molecular graphs from the B&B step. This combinatorial process, while computationally intensive, is essential as multiple sites can bear charge in an MS² ion.

The parameters.json file specifies atoms capable of bearing charge. The generate_charged_smiles() function utilizes this list with SMILES strings from the B&B step. Its primary purpose is to combinatorially assign a charge to a labeled atom, store the resulting structure, and avoid duplicates using RDKit’s symmetry check. SMILES are crucial because RDKit’s Chem.MolFromSmiles() implicitly assigns hydrogens to heavy atoms.

For experimental data-based filtering, exact mass is employed. The filter_by_mass() function uses RDKit’s Descriptors.ExactMolWt() to determine if each charged SMILES falls within a specified mass tolerance, returning those that meet the criteria.

Although hydrogens are not initially assigned, RDKit functions explicitly account for them during mass calculation. Since charges are assigned prior to this step, the program can ″sanitize″ molecules by adjusting hydrogens when testing charged sites and their valences.

The Base Molecular Scaffold

Certain programs utilize Nuclear Magnetic Resonance (NMR), Infrared Spectroscopy (IR), or MS data files as inputs for molecular graph generation, providing an initial framework to reduce combinatorial space and ensure physical feasibility. However, this approach necessitates the specific data file, thereby restricting the program’s utility to empirical or database sources. ForMileS, conversely, allows the user to provide a SMILES string as the Base Molecular Scaffold (BMS)the initial substructure from which all conceivable molecular graphs are constructed. The BMS represents an indispensable molecular substructure of the precursor ion that must be conserved within the fragment ion structure. This is more than a mere constraint; it embodies a structural characteristic of the precursor ion that is maintained in the fragment ion.

In MS² experiments, as in numerous other MS applications, chemical expertise remains indispensable. The utilization of SMILES as the BMS expands possibilities, particularly for theoretical investigations, as expounded in the Introduction. The same rationale supporting this approach can also be invoked to critique it: significant lacunae persist in our comprehension of gas-phase dissociation and fragmentation mechanisms. Consequently, the selection of the BMS is a nontrivial decision and serves as the primary determinant of the final results. A dedicated subsection in the Results and Discussion section elucidates the effects of the BMS and offers guidance on its appropriate selection.

Technical Stack

As previously noted in the Introduction, the selection of Python as the programming language enhances script readability, minimizes code length, and facilitates seamless integration into larger computational pipelines. This choice also promotes the program’s adoption within the mass spectrometry (MS) domain, enabling experimental chemists with limited computer science expertise to comprehend the program’s underlying logic and even propose enhancements or modifications.

A crucial distinction of ForMileS from other MS²-context programs is its reliance on exact mass, thereby eliminating the necessity of accounting for hydrogen atoms during the combinatorial B&B steps. During the combinatorial step for generating charges, RDKit exclusively manages hydrogen atoms pertinent to the charge, such as those present in protonated alcohols. This process is streamlined by RDKit’s efficient handling of SMILES notation, which obviates the need for explicit hydrogens. Consequently, stereoisomers are not generated.

The program refrains from employing virtual fragmentation reaction techniques or data sets to assess the feasibility of dissociation products, unlike MOLGEN-MS. This approach stems from ForMileS’s design objective: to generate fundamental structures for the study of these reactions, rather than to predetermine them. While the application of reaction rules could reduce computational time, it would impose a level of reactional constraint antithetical to the program’s purpose.

Inputs and Outputs

In contrast to programs such as MASS/SMOG, SURGE, or even MOLGEN, ForMileS produces.xyz coordinate filesone of the most prevalent formats in computational chemistryin addition to.mol files. Although MOLGEN does generate.mol files, it is a commercially licensed program. All coordinate files are generated using RDKit’s integrated 2D-to-3D functions and necessitate thorough inspection and subsequent submission to a geometry optimization routine, which can be performed using classical force-field molecular dynamics (FFMD), semiempirical methods, or density functional theory (DFT).

Finally, standard RDKit functions save images and coordinate files for the filtered SMILES list. The program also saves intermediate SMILES lists for B&B generation, charge assignment, and final exact mass filtering in the output directory.

Beyond rudimentary valence rules for atomic bonding, the absence of any mechanistic reaction reasoning in the molecular structure generation, unlike MOLGEN-MS, implies that the final result set comprises purely combinatorial and mathematically possible constitutional isomers. These isomers require the assignment of physical meaning, typically through energy calculations such as Single Point Energy, Enthalpy, or Gibbs energy. Therefore, while the lack of further constraints permits a degree of freedom in generation, it is imperative that the results undergo validation, as the program alone does not furnish chemical and physical feasibility validation for the structures and may produce chemically unsound entities, including unstable heterocycles, antiaromatic systems, and highly strained molecules.

All program parameters are input via a Graphical User Interface (GUI) and stored in a discrete parameters.json file, affording users the flexibility to experiment with varying rules for elemental orders, maximum valences, and elements capable of bearing charges. This design reflects an open-source philosophy and extends the program’s applicability across a broad spectrum of systems. Currently, the program supports commonly encountered organic elements: Li, Be, B, C, N, O, F, Cl, Br, I, Si, P, and S, with a primary focus on C, O, and N during the publication period. As testing progresses, additional elements can be incorporated.

The input parameters in the GUI are molecular formula, BMS, exact mass, enabling/disabling doubles and/or triple bonds, cycle and ramified generation and how much double/triple bonds if any allowed. The user also defines if.mol, .xyz and.svg files will be generated and the path to the output folder.

The entirety of the program is operated through a Graphical User Interface (GUI) constructed with the native Tkinter package, ensuring compatibility and mitigating unnecessary conflicts. While this interface is ubiquitous and lightweight, its visual presentation is considered outdated, and its proper resolution is highly dependent on the personal computer’s graphical configurations.

Validation and Limitations

Given that the initial inspiration for the program’s development stemmed from the structural analysis of small ether-glycol MS² fragment ions, two representative examples from this class have been selected to demonstrate ForMileS’s capabilities. From the extensive collection of glycol Electrospray Ionization Tandem Mass Spectrometry (ESI-MS²) data accessible in databases such as mzCloud, Polypropylene Glycol Octamer (PPG8), a polycondensation polymer, was chosen to illustrate and discuss the rationale behind BMS selection and its chemical significance, owing to its well-characterized MS² profile and readily available data. Dipropylene glycol dimethyl ether (DGDE) was also selected due to its small size yet structural similarity to PPG8, serving to further exemplify the retrieval of chemical information from the results. Both examples are additionally utilized to evaluate the program’s performance and limitations. A set of small molecules reported in MAYGEN is used as well to perform benchmarking.

To illustrate the ForMileS workflow and execute theoretical energy calculations, the generated fragment structures were subjected to semiempirical geometry optimization, succeeded by a DFT frequency calculation to ascertain their relative Gibbs energy profiles. All computations were conducted using ORCA v.6.0. Geometry optimization employed the GFN2-xTB semiempirical method. Frequency calculations utilized the B3LYP , functional with the Alrich def2-SVP basis set, and outcomes were graphically represented with the lowest energy normalized to zero for the purpose of relative energy assessment.

Results and Discussion

Performance Benchmarking

Table compares the wall-time for graph generation in ForMileS with established MGGs: MOLGEN, MAYGEN (an open-source MMG), and OMG. All ForMileS tests were performed on an AMD Ryzen 3600 (8 cores, 16 GB RAM). Data for other MGGs were retrieved from MAYGEN’s original publication, not generated on the same setup. This comparison provides a raw view of performance.

3. Wall-Time Comparison for ForMiles and Other Molecular Graph Generators Available.

		wall-time (s)
molecular formula	# structures	MOLGEN	MAYGEN	OMG	ForMileS
C₃Cl₂H₄	7	0.006	0.070	0.219	0.049
C₃O₃H₄	152	0.026	0.110	0.389	1.462
C₆H₆	217	0.049	0.159	0.454	1.153
Cl₂C₅H₄	217	0.028	0.116	0.659	4.930
C₅H₉ClO	334	0.009	0.147	0.745	16.268
C₆OF₂H₁₂	536	0.032	0.210	6.219	2385.403

Open in a new tab

For this comparative assessment, ForMileS was executed without any constraints. Double and triple bonds were permitted to be generated indefinitely, as were ramified and cyclic structures, and no base scaffold (BMS) was provided. This is reflected in the fact that all programs generated precisely the same number of structures. The sole constraint is the inherent script of the program, which does not explicitly utilize hydrogens in the combinatorial B&B phase. Furthermore, no charge was added, as the data set employed consists of neutral molecules.

It is evident that ForMileS exhibits poor scalability when compared with established MGG methods, reinforcing that it is not an exhaustive generator suitable for replacing existing ones. While many larger molecular formulas are available for benchmarking, the extensive generation time observed for the last molecular formula already clearly indicated the program’s significant limitation in achieving efficiency without constraints.

For the smallest molecule (C₃Cl₂H₄) alone, ForMileS outperformed MAYGEN and OMG. This underscores the fundamental limitation of the B&B algorithm with respect to larger systems. Moreover, its current implementation is simplified. For truly efficient graph generation, we endorse the preeminent positions of MOLGEN and even MAYGEN as graph generators. These programs also possess their own ’good’ and ’bad’ lists that support the program in generating structures, and, as anticipated, no naive B&B implementation can approximate such efficiency. Despite this, exhaustiveness for small molecules was achieved, satisfying one of the primary concerns regarding ForMileS’s performance.

Therefore, the streamlined B&B implementation necessitates future revisions and algorithmic updates to achieve comparable efficiency with standard tools, and we present this version of the program as the foundational step toward that goal.

The computational scaling limitation highlights the importance of evaluating the influence of the BMS parameter on both time, memory, and, critically, the final result set. Figure summarizes the impact of BMS selection for the three main fragments of PPG8 concerning wall-time and the quantity of structures generated.

Wall-time as a function of number of nonscaffold atoms with SMILES used as BMS and total of generated structures for a) Fragment #3, b) Fragment #1, c) Fragment #2 without rings generated, d) Fragment #2 with full generation, including rings.

For all generation tests performed with PPG8 fragments, ramified structures were permitted, along with an extensive allowance for double and triple bonds. The primary parameter investigated in conjunction with BMS was the generation of ring structures, as detailed below.

The initial and most crucial observation is that BMS significantly impacts wall-time as the system size increases. With Fragments 1 and 2, the computational resources facilitated a comprehensive investigation into the effect of the number of non-BMS atomsranging from no B&B growth (number of non-BMS atoms maximized) to purely combinatorial, without BMS (which is represented as only one atom input). As Fragment 1 (Figure b) is small, its behavior is less distinct for validation, given that only four BMS were permitted. However, it is evident that with just one fewer atom in the BMS compared to the purely combinatorial no-BMS, the wall-time fluctuates around a mean value. Regarding the number of structures, the quantity remains constant whether ’OCC’ or ’O’ is employed, indicating convergence for this exceptionally small molecule.

It is with Fragment 2 (Figure c and d) that the anticipated behavior is more clearly discernible. Gompertz logistic regression aids in visualizing a substantially reduced (though significant) change in the wall-time of generated structures within the region of pure combinatorial generation (No BMS) and complete BMS. In the middle of the spectrum, the simple decision of adding or removing an atom from BMS can lead to astonishing changes in wall-time and in the number of final structures. This is also contingent on the size of the molecular formula, as observed when comparing Fragment 2 with Fragment 3. While the graphical representation may be misleading, even a transition from a full BMS to 1 fewer atom in Fragment 2 and 3 resulted in 80% and 87% more structures generated, respectively. For Fragment 3, a difference of 6 to 7 atoms in the BMS led to an 85% increase in wall-time, while yielding a 46% increase in the number of structures generated in the final set.

In the case of Fragment 3 (Figure a), for a molecule of this size, generation ceased when the number of non-BMS atoms reached 7, due to time limitations, as this step required almost 5 h and 30 min of generation. Should the process continue to the limit of no-BMS, a convergence behavior would be expected, but it would most likely be prohibitive in the time domain. Furthermore, to further mitigate time consumption for profile generation, no ring structures were permitted for Fragment 3. From the tests with Fragment 2, it is evident that cycle generation (Figure c,d) drastically increases time consumption but does not alter the overall behavior (shape of the curve); the same is assumed for larger systems. At the limit of full generation, allowing rings generated approximately 35% more structures at the expense of 65% more time. Despite this, in Fragment 2, both with and without ring generation, the last two BMS entries yielded the same number of structures, indicating convergence.

Another significant source of computational cost and delay is the fully combinatorial charge generation step. Figure presents a comparison of the actual B&B graph generation and charge generation steps for the aforementioned Fragment #2 of PPG8, along with an overall summary of Linear versus Cyclic generation.

Overlaid comparison of time for Fragment #2 generation with/without cycles formation just charge and just graph generation steps of ForMileS.

As seen in Figure , for larger BMS, the gap between graph and charge generation narrows, with charge generation becoming the primary bottleneck. This phase, being O(n^k) with k charges across n sites, significantly impacts ForMileS’s performance. The exhaustive nature of charge generation, which considers every atom specified in parameters.json for charge assignment, currently dictates this cost. While exhaustive, future heuristic rules could potentially improve performance without compromising chemical conclusions.

Figure illustrates using Fragment 2 and 3 that memory consumption follows the same growth trend as wall-time.

RAM memory as a function of number of non-BMS atoms in ForMileS run for a) PPG8 fragment #3 without cycles allowed, b) PPG8 fragment #2 with cycles allowed.

These findings indicate that the ultimate quantity of generated structures and the time expended are predominantly contingent upon the selection of the BMS although other parameters, such as the allowance for ring formation, do exert a significant influence in wall-time as well. While this can significantly reduce computational time, it concurrently diminishes the final molecular set.

This outcome is incongruous if a new Molecular Graph Generator (MGG) designed for efficient generation of a comprehensive molecular space is intended. However, given the promise of a tool specifically tailored for the MS² context, ForMiles can be used with the BMS judiciously chosen based on a set of criteria to generate structures suitable for dissociation studies. Nevertheless, based on these initial performance results, ForMileS is only recommended for relatively small molecules. Currently, the largest molecules tested without constraints are C₆OF₂H₁₂ or C₉O₃H₁₉, with a BMS constraint of 5 atoms, and even these necessitate several minutes for generation.

The ensuing section will elucidate and deliberate upon the chemical validity of the generated structures, alongside providing guidance for BMS selection to ensure an acceptable set of structures within a viable computational time frame.

The Role of Base Molecular Scaffold

Based on the previously discussed effect of BMS in the final number of structures generated, this is the main parameter to be evaluated when using ForMileS. The initial steps in determining the BMS involve: (i) ascertaining the molecular formula of the fragment and precursor ion structure; (ii) identifying potential precursor cleavage sites that yield a fragment with the experimentally observed exact mass; and (iii) recognizing consistent atom connectivity among all dissociation groups. For polymers, if the fragment’s exact mass corresponds to a multiple of the monomer unit, this is typically a precursor feature to be preserved during fragmentation. Finally, (iv) common structural features retained across all hypothetical precursor cleavages, as depicted in the gray dotted area of Figure a, represent the primary BMS candidate.

a) ESI-MS² spectrum for PPG8 from mzCloud database. BMS for ForMileS are highlighted with the dissociation patterns that led to them. mzCloud fragment structure is in a gray shaded area. b) PPG8 dissociation patterns and cleave site labeling. Adapted with permission from ref . Copyright 2011 Wiley & Sons.

Desirable BMS characteristics include: a) maximal distance from the anticipated bond fragmentation region during dissociation; and b) minimal reactivity with respect to functional group populations. These considerations are pertinent to typical CID/HCD processes occurring in the tens of μs to ms time scale. While various rearrangements can occur, some may be time-limited before the fragment reaches the analyzer. A larger fragment and a BMS more distant from the fragmentation site reduce the likelihoodthough do not eliminate itof the proposed BMS not being a preserved molecular precursor feature. A more reactive chosen BMS increases the probability of it not being a nontransforming structural feature of the precursor, thus potentially excluding important fragment structures.

This ″cut-and-recognize patterns″ process is complex across different molecules. Depending on the precursor and fragment type and size, knowledge of common neutral losses can be beneficial.

Regardless of the aforementioned guidelines, the BMS is a hypothetical choice that significantly biases the final result. Its sole commitments are to the fragment’s exact mass and the assumption that the chosen substructure is genuinely retained. The flexibility to test comes with the cost of bias, obliging the user to gather extensive data and theoretical information for BMS selection, acknowledging that it is a conjecture requiring further evaluation. Common neutral losses, as detailed in Tandem Mass Spectrometry literature, serve as valuable starting points.

Case Study 1: PPG8 Fragment Analysis

Figure a presents the ESI-MS² spectrum available at mzCloud. This is reported as a Higher-Collision Energy Dissociation (HCD) process with 40% Collision Energy. The three main fragments are presented in order of relative intensities. In the shaded area in the right corner (Figure c) there are each of the reported molecular structures for the fragments by the mzCloud database system.

Previous works − have systematically studied the pattern of bond dissociation in polyethers and polyesters. Figure b illustrates a schematic division of potential cleavage sites in PPG8, revealing a repetitive pattern despite the molecule’s size.

In the case of PPG8, for any given fragment, it is possible to exhaust the breakage patterns that would yield the exact fragment mass in accordance with experimental data. For instance, ’α+b_x’ cannot generate any fragments with an exact mass for this particular dissociative process, though it is not chemically prohibitive.

Indeed, this pattern does not represent a reaction dissociation step or mechanism; rather, it serves as a map of hypothetical cleavage sites within the precursor that would generate fragments fitting the observed exact mass. The gray dotted area in Figure a for each fragment box signifies the common structural section found in all possible cuts made in precursor ions to generate hypothetical fragments. This is not yet a chemical proposition of a reaction mechanism, but rather a structural hypothesis that currently satisfies exact mass and precursor structural features.

The common protonated PPG8 form in ESI-MS² is the end-group R–OH₂ ⁺, which readily loses H₂O in CID/HCD. ,, Therefore, in the ‘α, C_x’ cleavage proposed in Figure a, end-group fragmentation is anticipated. This suggests that three carbons connected with an end oxygen group are more probable than a ‘CCOC,’ despite the latter’s potential generation via an ‘a_x to a_x+1’ process. Between ‘CCO’ and ‘COC,’ the former appears more reasonable or general, allowing for both ether and alcohol structures for this fragment’s molecular formula, whereas ‘COC’ limits the final set to molecules containing only ether functional groups. Although ether is thermodynamically more stable, for very small fragments, numerous rearrangements can occur. ,, Thus, for larger, more complex molecules, reactivity considerations are crucial. However, for this small fragment, B&B performs well, and fewer constraints, such as no BMS, can be applied. Despite this, Figure b shows that both no BMS and ‘CCO’ yield the same converged result. Consequently, if time permits, testing different BMS options can evaluate their impact on computational time and the number of generated structures.

As is commonly applied in many cases of molecular structure set evaluation, geometry optimization followed by a frequency calculationor even simple Single Point Energy calculationscan be employed to assess the energetic profile of the resultant outcome and facilitate the evaluation of physically and chemically valid structures while excluding those with excessively high energy.

To exemplify this methodology, we selected the most spectrally intense fragment set, F1+, and subjected it to semiempirical geometry optimization using a robust and dependable semiempirical method, GFN2-xTB, , followed by a straightforward DFT/B3LYP/def2-SVP , frequency calculation to determine Gibbs energy. Given that the objective is to ascertain relative energy differences between structures expected to exhibit significant variations, the B3LYP functional with the single-ζ Alrich basis set was chosen. Large systems may also substantially benefit from the incorporation of D3 dispersion corrections.

To aid in the validation of the ForMileS output set, we investigated the presence of the fragment structure proposed by mzCloud itself (Figure c). It was anticipated that at least one of the generated structures would correspond to the published fragment. Furthermore, the Gibbs energy calculation was utilized to identify the most stable structures in the gas phase. The results are summarized in Figure .

Relative Gibbs energy profile for structures generated for Fragment #1 of PPG8. Lowest energy structures are highlighted in orange.

The presence of mzCloud molecular structures for F1+ within the generated set is consistent with the appropriate selection of BMS. This observation is similarly noted for F2+ and F3+, thereby contributing to the validation of the ForMileS output.

Furthermore, two generated structures are highlighted as the lowest energy isomers within the set. While frequency calculations are sensitive to the method and basis set for accurate energy level determination, the objective herein is to isolate structures within a defined relative energy range. It is observed that these two structures exhibit an energy difference exceeding 60 kJ/mol from the subsequent low-energy isomers. Notably, the mzCloud allyl-protonated alcohol structure presents an energy level greater than 146 kJ/mol higher when compared with the secondary protonated carbonyl and the secondary carbocation adjacent to the alcohol.

Additional generated structures, such as the highly unstable primary linear carbocation (which represents the highest energy structure) and the unique ring structures (positioned toward the high-energy end, particularly due to ring strain in 3- and 4-membered rings), are depicted in Figure .

It is well-established that tertiary carbocations exhibit greater stability than secondary and primary carbocations, particularly in linear molecules and in the absence of electron-donating groups such as heteroatom oxygen. Although a more comprehensive discussion on this topic is feasible, it falls outside the scope of the current work., , ⁴⁵

The presented workflow, while straightforward, is essential for imparting chemical meaning with a reasonable degree of reliability to the ForMileS output set. Users are strongly encouraged to employ diverse computational chemistry methods and analytical techniques to verify the output. This aligns with the program’s capability to generate coordinate files.

Case Study 2: DGDE Fragment #1 Analysis

Figure a shows the spectrum for DGDE as reported in the mzCloud database. The spectrum was recorded using a relative HCD (High energy Collision Induced Dissociation) energy of 40% in a HCD cell. For this molecule, mzCloud does not provide a molecular structure for each fragment, just the MS² data.

a) ESI-MS² spectrum for DGDE b) data related to ForMileS execution for Fragment F1+ of DGDE c) Relative Gibbs energy for generated structures with the lowest in energy highlighted in green box.

This molecule possesses structural elements analogous to polypropylene glycol dimer, featuring two methyl ether end-groups rather than hydroxyl. The hypothetical BMS selection could align with the previously discussed principles. Nevertheless, as previously demonstrated and elaborated, for a molecule of this magnitude, the program can be utilized without any BMS and still successfully generate all structures within a reasonable time frame.

For F1+, particularly, significant information can be gleaned from the final output set generated with cycles included. As depicted in Figure c, only double-bonded or cyclic structures satisfy the exact mass, even in the absence of a BMS. Disallowing double bonds leads ForMileS to generate solely cyclic structures. However, as also presented, the cyclic structures exhibit an energy level at least 21 kJ/mol higher than the double-bonded structures, which represent the lowest energy in the series.

Consequently, in conjunction with quantum chemistry calculations, ForMileS contributes to the chemical understanding that F1+ for DGDE must assume a double-bonded structure to achieve a reasonable Gibbs energy level, or a cyclic structure if other factors beyond thermodynamic stability are considered. This constraint arises simply because these are the sole structures that correspond to the molecular formula/exact mass of the fragment.

This illustrates how the confluence of all parameters, particularly exact mass, which is an experimental value, provides theoretical evidence for the mandatory presence of double bonds within the ensemble of structures for F1+, assuming the elemental composition is accurately proposed. Notwithstanding, as previously discussed, all results from ForMileS necessitate meticulous examination within a physical and chemical context, as the current iteration of the program does not incorporate any energy calculation to filter the final results.

Conclusion

Drawing inspiration from extant molecular graph generator programs and theoretical analyses of MS² dissociation mechanisms, we introduce an innovative open-source Python-based workflow designed for generating molecular graphs within the context of MS². This workflow incorporates the well-established branch-and-bound (B&B) algorithm. Featuring a graphical user interface (GUI), ForMileS leverages parameters pertinent to both precursor and fragment ions to produce sets of fragment ion structuresa distinct focus from numerous programs that primarily emphasize precursor ion elucidation. Through its application to Polypropylene Glycol Octamer (PPG8) and Dipropylene Glycol Dimethyl Ether (DGDE) fragments, coupled with relative energy calculations, we demonstrate that the putative fragment structures are indeed present within the final generated set. Furthermore, the program’s capability to generate coordinate files facilitates its integration with computational chemistry routines, such as DFT and semiempirical calculations. The selection guidelines for the Base Molecular Scaffold (BMS) were also discussed, highlighting its relationship with precursor ions. For DGDE, the program even contributes to ascertaining the chemical significance of the requisite presence of double bonds or cyclic structures for potentially generated structures. Despite its efficacy in reducing computational time and facilitating application, BMS remains a purely hypothetical parameter that critically biases the ultimate results. Additionally, the program exhibits very limited performance compared to already published molecular graph generators and is currently restricted to small molecules, such as C₆O₃H₁₉, primarily due to the fully combinatorial charge generation step. Future endeavors will prioritize performance optimization and the implementation of additional filters to ensure chemical validation and pruning, leading to more physically and chemically meaningful final results.

Acknowledgments

We acknowledge the importance of the reviewers in strengthening this publication via a very productive discussion process. Also, we are thankful to Chemyunion LTDA for supporting and endorsing the development of this work. We also are grateful to UNICAMP Chemistry Institute Theoretical Chemistry Department for encouraging the work and publication.

Glossary

Abbreviations

MGG: Molecular Graph Generator
CASE: Computer Assisted Structure Elucidation
MS: Mass Spectrometry
MS2: Tandem Mass Spectrometry
PPG8: Polypropylene Glycol Octamer
DGDE: Dipropylene Glycol Dimethyl Ether
SMILES: Simplified Molecular Input Line Entry System
BMS: Basic Molecular Scaffold.

The ForMileS program is open-source, released under GNU GPN v3.0. Source-code, examples, and tutorials presented in this study are openly available in https://github.com/Kuchenbecker/ForMileS, Repository ID 976333496 (repository metadata information). The data underlying this study are openly available in https://beta.mzcloud.org/DataViewer/app#/app mzCloud ID 17,722 (legacy ID 3032) and 5,419 (legacy ID 5474). ThermoFischer Free Account Login is necessary to access its database.

The Article Processing Charge for the publication of this research was funded by the Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES), Brazil (ROR identifier: 00x0ma614).

The authors declare no competing financial interest.

References

Djoumbou-Feunang Y., Pon A., Karu N., Zheng J., Li C., Arndt D., Gautam M., Allen F., Wishart D. S.. CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification. Metabolites. 2019;9(4):72. doi: 10.3390/metabo9040072. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carlo M. J., Nanney A. L. M., Patrick A. L.. Energy-Resolved In-Source Collison-Induced Dissociation for Isomer Discrimination. J. Am. Soc. Mass Spectrom. 2024;35(11):2631–2641. doi: 10.1021/jasms.4c00118. [DOI] [PubMed] [Google Scholar]
De Hoffmann, E. Mass Spectrometry: Principles and Applications, 3rd ed.; John Wiley & Sons, Inc.: New York, NY, 2013. [Google Scholar]
McQuarrie, D. A. ; Simon, J. D. . Physical Chemistry: A Molecular Approach; University Science Books: Sausalito, CA, 1997. [Google Scholar]
Steinfeld, J. I. ; Francisco, J. S. ; Hase, W. L. . Chemical Kinetics and Dynamics; Prentice Hall: Upper Saddle River, NJ, 1999. [Google Scholar]
Håkansson, K. ; Klassen, J. S. . Ion Activation Methods for Tandem Mass Spectrometry. In Electrospray and MALDI Mass Spectrometry; Cole, R. B. , Ed.; Wiley: Hoboken, NJ, 2010; pp 571–630. 10.1002/9780470588901.ch16. [DOI] [Google Scholar]
Asakawa D., Saikusa K.. Fragmentation Efficiency of Phenethylamines in Electrospray Ionization Source Estimated by Theoretical Chemistry Calculation. Journal of Mass Spectrometry. 2022;57:e4802. doi: 10.1002/jms.4802. [DOI] [PubMed] [Google Scholar]
Balaban A. T.. Applications of Graph Theory in Chemistry. J. Chem. Inf. Comput. Sci. 1985;25(3):334–343. doi: 10.1021/ci00047a033. [DOI] [Google Scholar]
Yirik, M. A. ; Colpan, K. E. ; Schmidt, S. ; Sorokina, M. ; Steinbeck, C. . Review on Chemical Graph Theory and Its Application in Computer-Assisted Structure Elucidation. ChemRxiv, 2021, 10.20944/preprints202111.0546.v1. [DOI]
Yirik M. A., Steinbeck C.. Chemical Graph Generators. PLoS Comput. Biol. 2021;17(1):e1008504. doi: 10.1371/journal.pcbi.1008504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molchanova M. S., Shcherbukhin V. V., Zefirov N. S.. Computer Generation of Molecular Structures by the SMOG Program. J. Chem. Inf. Comput. Sci. 1996;36(4):888–899. doi: 10.1021/ci950393z. [DOI] [Google Scholar]
Rieder S. R., Oliveira M. P., Riniker S., Hünenberger P. H.. Development of an Open-Source Software for Isomer Enumeration. J. Cheminform. 2023;15(1):10. doi: 10.1186/s13321-022-00677-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colbourn C. J., Read R. C.. Orderly Algorithms for Generating Restricted Classes of Graphs. Journal of Graph Theory. 1979;3(2):187–195. doi: 10.1002/jgt.3190030210. [DOI] [Google Scholar]
McKay B. D., Yirik M. A., Steinbeck C.. Surge - A Fast Open-Source Chemical Graph Generator. J. Cheminform. 2022;14:24. doi: 10.1186/s13321-022-00604-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yirik M. A., Sorokina M., Steinbeck C.. MAYGEN: An Open-Source Chemical Structure Generator for Constitutional Isomers Based on the Orderly Generation Principle. J. Cheminform. 2021;13(1):48. doi: 10.1186/s13321-021-00529-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Korth M., Grimme S.. “Mindless” DFT Benchmarking. J. Chem. Theory Comput. 2009;5(4):993–1003. doi: 10.1021/ct800511q. [DOI] [PubMed] [Google Scholar]
Faulon J.-L.. Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules. J. Chem. Inf. Comput. Sci. 1994;34(5):1204–1218. doi: 10.1021/ci00021a031. [DOI] [Google Scholar]
Wang F., Liigand J., Tian S., Arndt D., Greiner R., Wishart D. S.. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification. Anal. Chem. 2021;93(34):11692–11700. doi: 10.1021/acs.analchem.1c01465. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruttkies C., Schymanski E. L., Wolf S., Hollender J., Neumann S.. MetFrag Relaunched: Incorporating Strategies beyond in Silico Fragmentation. J. Cheminform. 2016;8(1):3. doi: 10.1186/s13321-016-0115-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou J., Weber R. J. M., Allwood J. W., Mistrik R., Zhu Z., Ji Z., Chen S., Dunn W. B., He S., Viant M. R.. HAMMER: Automated Operation of Mass Frontier to Construct in Silico Mass Spectral Fragmentation Libraries. Bioinformatics. 2014;30(4):581–583. doi: 10.1093/bioinformatics/btt711. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kostyukevich Y., Sosnin S., Osipenko S., Kovaleva O., Rumiantseva L., Kireev A., Zherebker A., Fedorov M., Nikolaev E. N.. PyFragMSA Web Tool for the Investigation of the Collision-Induced Fragmentation Pathways. ACS Omega. 2022;7(11):9710–9719. doi: 10.1021/acsomega.1c07272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benecke C., Gruner T., Kerber A., Laue R., Wieland T.. MOLecular structure GENeration with MOLGEN, new features and future developments. Fresenius J. Anal Chem. 1997;359:23–32. doi: 10.1007/s002160050530. [DOI] [Google Scholar]
Serov V. V., Elyashberg M. E., Gribov L. A.. Mathematical Synthesis and Analysis of Molecular Structures. J. Mol. Struct. 1976;31(2):381–397. doi: 10.1016/0022-2860(76)80018-X. [DOI] [Google Scholar]
Schneider N., Sayle R. A., Landrum G. A.. Get Your Atoms in OrderAn Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm. J. Chem. Inf. Model. 2015;55(10):2111–2120. doi: 10.1021/acs.jcim.5b00543. [DOI] [PubMed] [Google Scholar]
Weininger D.. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
Bogdanchikov A., Zhaparov M., Suliyev R.. Python to Learn Programming. J. Phys.: Conf. Ser. 2013;423:012027. doi: 10.1088/1742-6596/423/1/012027. [DOI] [Google Scholar]
Sieg J., Feldmann C. W., Hemmerich J., Stork C., Sandfort F., Eiden P., Mathea M.. MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-Learn. J. Chem. Inf. Model. 2024;64(24):9027–9033. doi: 10.1021/acs.jcim.4c00863. [DOI] [PubMed] [Google Scholar]
Weininger D., Weininger A., Weininger J. L.. SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Model. 1989;29(2):97–101. doi: 10.1021/ci00062a008. [DOI] [Google Scholar]
Cautereels J., Claeys M., Geldof D., Blockhuys F.. Quantum Chemical Mass Spectrometry: Ab Initio Prediction of Electron Ionization Mass Spectra and Identification of New Fragmentation Pathways: Quantum Chemical Mass Spectrometry. J. Mass Spectrom. 2016;51(8):602–614. doi: 10.1002/jms.3791. [DOI] [PubMed] [Google Scholar]
Neese F., Wennmohs F., Becker U., Riplinger C.. The ORCA Quantum Chemistry Program Package. J. Chem. Phys. 2020;152(22):224108. doi: 10.1063/5.0004608. [DOI] [PubMed] [Google Scholar]
Bannwarth C., Ehlert S., Grimme S.. GFN2-xTBAn Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. J. Chem. Theory Comput. 2019;15(3):1652–1671. doi: 10.1021/acs.jctc.8b01176. [DOI] [PubMed] [Google Scholar]
Becke A. D., Density-Functional Thermochemistry I. I. I.. The Role of Exact Exchange. J. Chem. Phys. 1993;98(7):5648–5652. doi: 10.1063/1.464913. [DOI] [Google Scholar]
Lee C., Yang W., Parr R. G.. Development of the Colle-Salvetti Correlation-Energy Formula into a Functional of the Electron Density. Phys. Rev. B. 1988;37(2):785–789. doi: 10.1103/PhysRevB.37.785. [DOI] [PubMed] [Google Scholar]
Weigend F., Ahlrichs R.. Balanced Basis Sets of Split Valence, Triple Zeta Valence and Quadruple Zeta Valence Quality for H to Rn: Design and Assessment of Accuracy. Phys. Chem. Chem. Phys. 2005;7(18):3297. doi: 10.1039/b508541a. [DOI] [PubMed] [Google Scholar]
mzCloud - PPGn8 Advanced Mass Spectral Database. https://beta.mzcloud.org/DataViewer/app/dataviewer/library/reference/detail/0bbf564a-837c-41aa-becf-509bc7943d65/00559f18-7864-4aa2-b148-4661647e38d0 (accessed 2025-09-16).
mzCloud DataBase. ThermoScientific. https://mzcloud.org. https://beta.mzcloud.org/DataViewer/app/dataviewer/library/autoprocessed/detail/bc59c351-f48a-416c-be35-1994e9b2807d/dbe9b3ca-8fe4-4336-8c76-dd501ae5ecf3 (accessed 2025-05-16).
Peironcely J. E., Rojas-Chertó M., Fichera D., Reijmers T., Coulier L., Faulon J.-L., Hankemeier T.. OMG: Open Molecule Generator. J. Cheminform. 2012;4(1):21. doi: 10.1186/1758-2946-4-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wesdemiotis C., Solak N., Polce M. J., Dabney D. E., Chaicharoen K., Katzenmeyer B. C.. Fragmentation Pathways of Polymer Ions. Mass Spectrom. Rev. 2011;30(4):523–559. doi: 10.1002/mas.20282. [DOI] [PubMed] [Google Scholar]
Jackson A. T., Green M. R., Bateman R. H.. Generation of End-group Information from Polyethers by Matrix-assisted Laser Desorption/Ionisation Collision-induced Dissociation Mass Spectrometry. Rapid Commun. Mass Spectrom. 2006;20(23):3542–3550. doi: 10.1002/rcm.2773. [DOI] [PubMed] [Google Scholar]
Jackson A. T., Slade S. E., Thalassinos K., Scrivens J. H.. End-Group Characterisation of Poly(Propylene Glycol)s by Means of Electrospray Ionisation-Tandem Mass Spectrometry (ESI-MS/MS) Anal Bioanal Chem. 2008;392(4):643–650. doi: 10.1007/s00216-008-2320-5. [DOI] [PubMed] [Google Scholar]
Koopman J., Grimme S.. Calculation of Electron Ionization Mass Spectra with Semiempirical GFNn-xTB Methods. ACS Omega. 2019;4(12):15120–15133. doi: 10.1021/acsomega.9b02011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grimme S., Antony J., Ehrlich S., Krieg H.. A Consistent and Accurate Ab Initio Parametrization of Density Functional Dispersion Correction (DFT-D) for the 94 Elements H-Pu. J. Chem. Phys. 2010;132(15):154104. doi: 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]
Clayden, J. ; Greeves, N. ; Warren, S. . Organic Chemistry; OUP Oxford, 2012. [Google Scholar]
Kuki Á., Nagy L., Shemirani G., Memboeuf A., Drahos L., Vékey K., Zsuga M., Kéki S.. A Simple Method to Estimate Relative Stabilities of Polyethers Cationized by Alkali Metal Ions. Rapid Commun. Mass Spectrom. 2012;26(3):304–308. doi: 10.1002/rcm.5307. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Yirik, M. A. ; Colpan, K. E. ; Schmidt, S. ; Sorokina, M. ; Steinbeck, C. . Review on Chemical Graph Theory and Its Application in Computer-Assisted Structure Elucidation. ChemRxiv, 2021, 10.20944/preprints202111.0546.v1. [DOI]

Data Availability Statement

[ref1] Djoumbou-Feunang Y., Pon A., Karu N., Zheng J., Li C., Arndt D., Gautam M., Allen F., Wishart D. S.. CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification. Metabolites. 2019;9(4):72. doi: 10.3390/metabo9040072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] Carlo M. J., Nanney A. L. M., Patrick A. L.. Energy-Resolved In-Source Collison-Induced Dissociation for Isomer Discrimination. J. Am. Soc. Mass Spectrom. 2024;35(11):2631–2641. doi: 10.1021/jasms.4c00118. [DOI] [PubMed] [Google Scholar]

[ref3] De Hoffmann, E. Mass Spectrometry: Principles and Applications, 3rd ed.; John Wiley & Sons, Inc.: New York, NY, 2013. [Google Scholar]

[ref4] McQuarrie, D. A. ; Simon, J. D. . Physical Chemistry: A Molecular Approach; University Science Books: Sausalito, CA, 1997. [Google Scholar]

[ref5] Steinfeld, J. I. ; Francisco, J. S. ; Hase, W. L. . Chemical Kinetics and Dynamics; Prentice Hall: Upper Saddle River, NJ, 1999. [Google Scholar]

[ref6] Håkansson, K. ; Klassen, J. S. . Ion Activation Methods for Tandem Mass Spectrometry. In Electrospray and MALDI Mass Spectrometry; Cole, R. B. , Ed.; Wiley: Hoboken, NJ, 2010; pp 571–630. 10.1002/9780470588901.ch16. [DOI] [Google Scholar]

[ref7] Asakawa D., Saikusa K.. Fragmentation Efficiency of Phenethylamines in Electrospray Ionization Source Estimated by Theoretical Chemistry Calculation. Journal of Mass Spectrometry. 2022;57:e4802. doi: 10.1002/jms.4802. [DOI] [PubMed] [Google Scholar]

[ref8] Balaban A. T.. Applications of Graph Theory in Chemistry. J. Chem. Inf. Comput. Sci. 1985;25(3):334–343. doi: 10.1021/ci00047a033. [DOI] [Google Scholar]

[ref9] Yirik, M. A. ; Colpan, K. E. ; Schmidt, S. ; Sorokina, M. ; Steinbeck, C. . Review on Chemical Graph Theory and Its Application in Computer-Assisted Structure Elucidation. ChemRxiv, 2021, 10.20944/preprints202111.0546.v1. [DOI]

[ref10] Yirik M. A., Steinbeck C.. Chemical Graph Generators. PLoS Comput. Biol. 2021;17(1):e1008504. doi: 10.1371/journal.pcbi.1008504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Molchanova M. S., Shcherbukhin V. V., Zefirov N. S.. Computer Generation of Molecular Structures by the SMOG Program. J. Chem. Inf. Comput. Sci. 1996;36(4):888–899. doi: 10.1021/ci950393z. [DOI] [Google Scholar]

[ref12] Rieder S. R., Oliveira M. P., Riniker S., Hünenberger P. H.. Development of an Open-Source Software for Isomer Enumeration. J. Cheminform. 2023;15(1):10. doi: 10.1186/s13321-022-00677-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Colbourn C. J., Read R. C.. Orderly Algorithms for Generating Restricted Classes of Graphs. Journal of Graph Theory. 1979;3(2):187–195. doi: 10.1002/jgt.3190030210. [DOI] [Google Scholar]

[ref14] McKay B. D., Yirik M. A., Steinbeck C.. Surge - A Fast Open-Source Chemical Graph Generator. J. Cheminform. 2022;14:24. doi: 10.1186/s13321-022-00604-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Yirik M. A., Sorokina M., Steinbeck C.. MAYGEN: An Open-Source Chemical Structure Generator for Constitutional Isomers Based on the Orderly Generation Principle. J. Cheminform. 2021;13(1):48. doi: 10.1186/s13321-021-00529-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] Korth M., Grimme S.. “Mindless” DFT Benchmarking. J. Chem. Theory Comput. 2009;5(4):993–1003. doi: 10.1021/ct800511q. [DOI] [PubMed] [Google Scholar]

[ref17] Faulon J.-L.. Stochastic Generator of Chemical Structure. 1. Application to the Structure Elucidation of Large Molecules. J. Chem. Inf. Comput. Sci. 1994;34(5):1204–1218. doi: 10.1021/ci00021a031. [DOI] [Google Scholar]

[ref18] Wang F., Liigand J., Tian S., Arndt D., Greiner R., Wishart D. S.. CFM-ID 4.0: More Accurate ESI-MS/MS Spectral Prediction and Compound Identification. Anal. Chem. 2021;93(34):11692–11700. doi: 10.1021/acs.analchem.1c01465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Ruttkies C., Schymanski E. L., Wolf S., Hollender J., Neumann S.. MetFrag Relaunched: Incorporating Strategies beyond in Silico Fragmentation. J. Cheminform. 2016;8(1):3. doi: 10.1186/s13321-016-0115-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] Zhou J., Weber R. J. M., Allwood J. W., Mistrik R., Zhu Z., Ji Z., Chen S., Dunn W. B., He S., Viant M. R.. HAMMER: Automated Operation of Mass Frontier to Construct in Silico Mass Spectral Fragmentation Libraries. Bioinformatics. 2014;30(4):581–583. doi: 10.1093/bioinformatics/btt711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Kostyukevich Y., Sosnin S., Osipenko S., Kovaleva O., Rumiantseva L., Kireev A., Zherebker A., Fedorov M., Nikolaev E. N.. PyFragMSA Web Tool for the Investigation of the Collision-Induced Fragmentation Pathways. ACS Omega. 2022;7(11):9710–9719. doi: 10.1021/acsomega.1c07272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Benecke C., Gruner T., Kerber A., Laue R., Wieland T.. MOLecular structure GENeration with MOLGEN, new features and future developments. Fresenius J. Anal Chem. 1997;359:23–32. doi: 10.1007/s002160050530. [DOI] [Google Scholar]

[ref23] Serov V. V., Elyashberg M. E., Gribov L. A.. Mathematical Synthesis and Analysis of Molecular Structures. J. Mol. Struct. 1976;31(2):381–397. doi: 10.1016/0022-2860(76)80018-X. [DOI] [Google Scholar]

[ref24] Schneider N., Sayle R. A., Landrum G. A.. Get Your Atoms in OrderAn Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm. J. Chem. Inf. Model. 2015;55(10):2111–2120. doi: 10.1021/acs.jcim.5b00543. [DOI] [PubMed] [Google Scholar]

[ref25] Weininger D.. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]

[ref26] Bogdanchikov A., Zhaparov M., Suliyev R.. Python to Learn Programming. J. Phys.: Conf. Ser. 2013;423:012027. doi: 10.1088/1742-6596/423/1/012027. [DOI] [Google Scholar]

[ref27] Sieg J., Feldmann C. W., Hemmerich J., Stork C., Sandfort F., Eiden P., Mathea M.. MolPipeline: A Python Package for Processing Molecules with RDKit in Scikit-Learn. J. Chem. Inf. Model. 2024;64(24):9027–9033. doi: 10.1021/acs.jcim.4c00863. [DOI] [PubMed] [Google Scholar]

[ref28] Weininger D., Weininger A., Weininger J. L.. SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Model. 1989;29(2):97–101. doi: 10.1021/ci00062a008. [DOI] [Google Scholar]

[ref29] Cautereels J., Claeys M., Geldof D., Blockhuys F.. Quantum Chemical Mass Spectrometry: Ab Initio Prediction of Electron Ionization Mass Spectra and Identification of New Fragmentation Pathways: Quantum Chemical Mass Spectrometry. J. Mass Spectrom. 2016;51(8):602–614. doi: 10.1002/jms.3791. [DOI] [PubMed] [Google Scholar]

[ref30] Neese F., Wennmohs F., Becker U., Riplinger C.. The ORCA Quantum Chemistry Program Package. J. Chem. Phys. 2020;152(22):224108. doi: 10.1063/5.0004608. [DOI] [PubMed] [Google Scholar]

[ref31] Bannwarth C., Ehlert S., Grimme S.. GFN2-xTBAn Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. J. Chem. Theory Comput. 2019;15(3):1652–1671. doi: 10.1021/acs.jctc.8b01176. [DOI] [PubMed] [Google Scholar]

[ref32] Becke A. D., Density-Functional Thermochemistry I. I. I.. The Role of Exact Exchange. J. Chem. Phys. 1993;98(7):5648–5652. doi: 10.1063/1.464913. [DOI] [Google Scholar]

[ref33] Lee C., Yang W., Parr R. G.. Development of the Colle-Salvetti Correlation-Energy Formula into a Functional of the Electron Density. Phys. Rev. B. 1988;37(2):785–789. doi: 10.1103/PhysRevB.37.785. [DOI] [PubMed] [Google Scholar]

[ref34] Weigend F., Ahlrichs R.. Balanced Basis Sets of Split Valence, Triple Zeta Valence and Quadruple Zeta Valence Quality for H to Rn: Design and Assessment of Accuracy. Phys. Chem. Chem. Phys. 2005;7(18):3297. doi: 10.1039/b508541a. [DOI] [PubMed] [Google Scholar]

[ref35] mzCloud - PPGn8 Advanced Mass Spectral Database. https://beta.mzcloud.org/DataViewer/app/dataviewer/library/reference/detail/0bbf564a-837c-41aa-becf-509bc7943d65/00559f18-7864-4aa2-b148-4661647e38d0 (accessed 2025-09-16).

[ref36] mzCloud DataBase. ThermoScientific. https://mzcloud.org. https://beta.mzcloud.org/DataViewer/app/dataviewer/library/autoprocessed/detail/bc59c351-f48a-416c-be35-1994e9b2807d/dbe9b3ca-8fe4-4336-8c76-dd501ae5ecf3 (accessed 2025-05-16).

[ref37] Peironcely J. E., Rojas-Chertó M., Fichera D., Reijmers T., Coulier L., Faulon J.-L., Hankemeier T.. OMG: Open Molecule Generator. J. Cheminform. 2012;4(1):21. doi: 10.1186/1758-2946-4-21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] Wesdemiotis C., Solak N., Polce M. J., Dabney D. E., Chaicharoen K., Katzenmeyer B. C.. Fragmentation Pathways of Polymer Ions. Mass Spectrom. Rev. 2011;30(4):523–559. doi: 10.1002/mas.20282. [DOI] [PubMed] [Google Scholar]

[ref39] Jackson A. T., Green M. R., Bateman R. H.. Generation of End-group Information from Polyethers by Matrix-assisted Laser Desorption/Ionisation Collision-induced Dissociation Mass Spectrometry. Rapid Commun. Mass Spectrom. 2006;20(23):3542–3550. doi: 10.1002/rcm.2773. [DOI] [PubMed] [Google Scholar]

[ref40] Jackson A. T., Slade S. E., Thalassinos K., Scrivens J. H.. End-Group Characterisation of Poly(Propylene Glycol)s by Means of Electrospray Ionisation-Tandem Mass Spectrometry (ESI-MS/MS) Anal Bioanal Chem. 2008;392(4):643–650. doi: 10.1007/s00216-008-2320-5. [DOI] [PubMed] [Google Scholar]

[ref41] Koopman J., Grimme S.. Calculation of Electron Ionization Mass Spectra with Semiempirical GFNn-xTB Methods. ACS Omega. 2019;4(12):15120–15133. doi: 10.1021/acsomega.9b02011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] Grimme S., Antony J., Ehrlich S., Krieg H.. A Consistent and Accurate Ab Initio Parametrization of Density Functional Dispersion Correction (DFT-D) for the 94 Elements H-Pu. J. Chem. Phys. 2010;132(15):154104. doi: 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]

[ref43] Clayden, J. ; Greeves, N. ; Warren, S. . Organic Chemistry; OUP Oxford, 2012. [Google Scholar]

[ref44] Kuki Á., Nagy L., Shemirani G., Memboeuf A., Drahos L., Vékey K., Zsuga M., Kéki S.. A Simple Method to Estimate Relative Stabilities of Polyethers Cationized by Alkali Metal Ions. Rapid Commun. Mass Spectrom. 2012;26(3):304–308. doi: 10.1002/rcm.5307. [DOI] [PubMed] [Google Scholar]

PERMALINK

ForMileS: A Python Open-Source Program to Generate Molecular Structures for Tandem Mass Spectrometry Fragment Ions

Vinicius Kuchenbecker

Nelson H Morgon

Abstract

Introduction

1. Comparison between ForMileS and Other Molecular Graph Generator Programs Features.

Implementation and Computational Method

Description of Program Functionalities and Algorithm

1.

2. Description of ForMileS Main Functionalities and Their Correspondence to the Implemented Script Elements.

Molecular Graph Generation

Ring System Generation

Charge Generation and Exact Mass Filter

The Base Molecular Scaffold

Technical Stack

Inputs and Outputs

Validation and Limitations

Results and Discussion

Performance Benchmarking

3. Wall-Time Comparison for ForMiles and Other Molecular Graph Generators Available.

2.

3.

4.

The Role of Base Molecular Scaffold

5.

Case Study 1: PPG8 Fragment Analysis

6.

Case Study 2: DGDE Fragment #1 Analysis

7.

Conclusion

Acknowledgments

Glossary

Abbreviations

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases