Abstract
A comprehensive understanding of drug metabolism is crucial for advancements in drug development. Automation has improved various stages of this process, from compound procurement to data analysis, but significant challenges persist in the metabolite identification (MetID) of macromolecules due to their size, structural complexity, and associated computational demands. This study introduces new algorithms for automated Liquid Chromatography-High-Resolution Mass Spectrometry (LC-HRMS) data analysis applicable to macromolecules. A novel peak detection approach based on the most abundant mass (MaM) is presented and systematically compared with the monoisotopic mass (MiM) approach, commonly used in small molecules MetID. Additionally, three structure visualization strategies, expanded (atom-level), non-expanded (monomer-level), and a hybrid mode, are evaluated for their impact on computation data processing time and interpretability, based on their distinct fragmentation strategies. The workflow was validated using six diverse datasets, comprising linear and cyclic peptides and oligonucleotides with both natural and unnatural monomers, covering a molecular weight range of 700–7630 Da. A total of 970 metabolites were identified under various experimental and ionization conditions. The MaM algorithm demonstrated higher scores and a greater number of matches, instilling greater confidence in the accurate prediction of metabolite structures, while the non-expanded visualization significantly reduced processing times (ranging from minutes to under an hour for most peptides). Furthermore, the visualization algorithm, which integrates monomer-level and atom/bond notation, enables clear localization of metabolic biotransformations. Compared to previous studies, the proposed workflow demonstrated reduced processing time, consistent detection of degradation products, and enhanced visualization capabilities, advancing automated MetID for macromolecules.
Introduction
An essential aspect of the drug development process is the comprehensive identification and characterization of the major metabolites of the drug candidate and the enzymes responsible for its metabolic transformation, commonly known as drug metabolism. These studies are crucial, as certain metabolites may exhibit superior potency or improved pharmacokinetics properties compared to the parent drugs, thereby enhancing therapeutic efficacy [1]. Conversely, some metabolites may be toxic or chemically reactive, potentially interfering with the metabolism of co-administered drugs and increasing the risk of drug-drug interactions [2, 3].
Therefore, MetID plays a vital role not only in guiding chemical modifications to improve metabolic stability and reduce toxicity, but also for informing clinical monitoring strategies and supporting personalized medicine approaches that aim to prevent adverse drug reactions. Collectively, these efforts are essential to the development of safe and effective therapeutic agents [4].
In recent years, the use of macromolecules such as peptides and oligonucleotides as therapeutic agents has rapidly grown in drug development, making MetID for these compounds increasingly important [5,6]. However, MetID for macromolecules presents greater challenges than for small molecules, especially in data analysis and result interpretation. The large size and structural complexity of such compounds, which often consists of hundreds of atoms, lead to an exponential increase in spectral signals that must be interpreted, along with a larger number of fragments to compute and compare, and difficulties to determine the specific location where the biotransformation has occurred [7]. Consequently, this complexity demands significantly more software processing time and memory.
Although several tools have been developed for automated MetID, these are primarily designed for small molecules and often struggle to process multiply charged states, which are prevalent in large biomolecules. Although some specialized approaches have been developed to address these challenges, they frequently suffer high false-positive rates in complex biological matrices and may offer limited support for various ionization modes, as highlighted in this research [7].
In our previous publications [8, 9] we developed software solutions focused on automating data analysis, primarily for small molecules, with some applicability to macromolecules. These tools have helped to create faster systems for the data processing step and the results review/visualization as they perform the following steps automatically: select the chromatographic peaks that are related to the compound of interest, find the mass spectral information for each extracted peak, assign potential structures by comparing the theoretical fragmentation that can be predicted with the actual mass to charge ratio (m/z) values obtained with the experimental spectra, scoring potential solutions depending on the fragments assigned to the spectra alone or by the comparison with the parent fragmentation. After clustering the results from different experimental conditions and consolidating them into a single experimental entity, the results are stored in the database. Subsequently, upon the conclusion of the review process, a report is generated.
The primary aim of this article is to present novel algorithms and approaches for automated LC-HRMS data analysis that specifically address the challenges of MetID in macromolecules. One of the new approaches introduced in the automated workflow is a peak detection algorithm based on the MaM peak —an approach that, to our knowledge, has not been previously reported. To demonstrate its suitability for MetID of macromolecules, the algorithm is compared with the traditional MiM peak detection method, which has long been used in small molecule MetID studies.
In addition, two visualization strategies for macromolecules are presented. In the expand form all atoms and intermonomer bonds are shown, whereas in the non-expanded form, the structure is represented by linking the monomer acronyms. These visualizations have direct implications for the computational process: in the non-expanded form, the structure is not subjected to virtual atom-level metabolite generation. Instead, biotransformations are applied at the monomer level, which reduces the number of potential fragments generated and leads to decreased processing time and memory consumption. The non-expanded approach is compared to the expanded one, also demonstrating how this representation can facilitate the identification of biotransformation sites.
These proposed approaches are integrated into a workflow that enables the interpretation of data acquired under diverse experimental conditions and ionization modes. To validate the applicability of this workflow, analysis was conducted on six datasets spanning a molecular range from 700 to 7630 Da. These datasets consist of both linear and cyclic peptides, incorporating natural and unnatural amino acids, as well as oligonucleotides. Specifically, dataset-1 comprises 9 commercially available peptides, dataset-2 includes one commercially available peptide and 4 synthetic analogues, dataset-3 involves a natural peptide hormone and 7 synthetic analogues, dataset-4 features an antisense oligonucleotide, dataset-5 contains 28 commercially available peptides, and dataset-6 is composed of a peptide hormone. Covering macromolecules of varying sizes and structural types—including linear, cyclic, and non-standard monomers—these datasets demonstrate that the proposed methodology can be broadly applied across a wide compound applicability domain.
Comparisons of the results obtained for certain compounds with those of prior studies have enabled an evaluation of several factors, such as the number and structure of identified metabolites, along with a consideration of the time consumed during the data processing step.
Materials and methods
Experimental data
For this study, six different experimental data sets (linear/cyclic, natural/unnatural amino acids, and an oligonucleotide dataset) have been used for the MetID, as shown in Table 1. The proteases and biological matrices used in the experimental incubations of these datasets represent key relevant proteolytic environments that therapeutic peptides are likely to encounter in vivo. This includes enzymes involved in gastrointestinal metabolism—where peptide hydrolysis primarily occurs—such as trypsin, chymotrypsin, elastase, and pepsin. The other proteases and matrices reflect metabolism in the liver, blood, and other physiological contexts, ensuring coverage of a broad range of relevant peptide degradation pathways [10].
Table 1. Summary of the number of compounds of each dataset, along with the molecular weight range of the compounds and the corresponding data acquisition mode. (DDA = data-dependent acquisition, DIA = data-independent acquisition).
| Dataset | Number of compounds | Molecular weight range (Da) | Data acquisition mode | Incubation conditions |
|---|---|---|---|---|
| Dataset-1 | 9 | 1282–3429 | DDA | Trypsin, Chymotrypsin, Pancreatic Elastase, and Pepsin |
| Dataset-2 | 5 | 3298–4184 | DDA | Dipeptidyl peptidase-4 (DPP-4) and neutral endopeptidase (NEP) |
| Dataset-3 | 8 | 1637–1679 | DDA and DIA | Human Serum |
| Dataset-4 | 1 | 7633 | DDA | Human Liver |
| Dataset-5 | 25 | 708–1900 | DIA | Human Cathepsin G, Human Neutrophil Elastase, Human MMP-12 catalytic domain, and Bovine pancreatic trypsin |
| Dataset-6 | 1 | 5808 | DIA | Insulin-degrading enzyme (IDE) |
The first set (dataset-1) is composed of nine commercially available peptides (secretin, calcitonin, oxytocin, octreotide, deslorelin, histrelin, goserelin, buserelin, and leuprolide), each of them, was separately incubated, with four selected protease enzymes – trypsin, chymotrypsin, pancreatic elastase, and pepsin. Data acquisition was performed using a Thermo Orbitrap® instrument in full scan mode with data-dependent tandem mass spectrometry (MS/MS). The detailed experimental conditions for this dataset are documented in the referenced bibliography [11]. Three of the compounds are cyclic peptides (octreotide, oxytocin, and calcitonin) and five contain unnatural amino acids (secretin, calcitonin, ocreotide, deslorelin, and histrelin). Molecular weight ranges from 1282 to 3429 Da, as illustrated in Table 2.
Table 2. Dataset-1 sequence structures and its molecular weights.
| Compound name | Molecular weight (Da) | Sequence | Structure |
|---|---|---|---|
| Deslorelin | 1282.45 | H-Pyr-His-Trp-Ser-Tyr-D-Trp-Leu-Arg-Pro-NHEt | Linear |
| Goserelin | 1269.41 | Glp-His-Trp-Ser-Tyr-Ser-tBu-Leu-Arg-Pro-NHNHCONH2 | Linear |
| Buserelin | 1238.66 | Glp-His-Trp-Ser-Tyr-Ser-tBu-Leu-Arg-Pro-NHEt | Linear |
| Histrelin | 1323.5 | Glp-His-Trp-Ser-Tyr-HisBzl-Leu-Arg-Pro-NHEt | Linear |
| Leuprolide | 1209.4 | Glp-His-Trp-Ser-Tyr-D-Leu-Leu-Arg-Pro-NHEt | Linear |
| Secretin Human | 3039.41 | H-His-Ser-Asp-Gly-Thr-Phe-Thr-Ser-Glu-Leu-Ser-Arg-Leu-Arg-Glu-Gly-Ala-Arg-Leu-Gln-Arg-Leu-Leu-Gln-Gly-Leu-Val-NH2 | Linear |
| Octreotide | 1019.24 | H-D-Phe-Cys (1)-Phe-D-Trp-Lys-Thr-Cys (1)-Thr-ol | Cyclic |
| Oxytocin | 1007.19 | H-Cys (1)-Tyr-Ile-Gln-Asn-Cys (1)-Pro-Leu-Gly-NH2 | Cyclic |
| Calcitonin | 3429.71 | H-Cys (1)-Ser-Asn-Leu-Ser-Thr-Cys (1)-Val-Leu-Gly-Lys-Leu-Ser-Gln-Glu-Leu-His-Lys-Leu-Gln-Thr-Tyr-Pro-Arg-Thr-Asn-Thr-Gly-Ser-Gly-Thr-Pro-NH2 | Cyclic |
Dataset-2 consists of a commercially available peptide glucagon-like peptide-1 (GLP-1), a 30 amino acid compound, and four synthetic analogues, designed to have a reduced susceptibility to enzymatic degradation, taspoglutide, exenatide, liraglutide and semaglutide, all of them linear peptides. MetID has been conducted under the presence of DPP-4 and NEP, as both enzymes are known to be involved in native GLP-1 degradation. Data acquisition employed a Thermo Orbitrap® instrument operating in full scan mode with data-dependent MS/MS, as detailed previously in the cited references [11]. Except for semaglutide, which was incubated in dog plasma – with the two metabolites first synthesized and then spiked into the plasma – the data were collected using a Waters® ACQUITY® Ultra-Performance Liquid Chromatography with Vion Ion Mobility Spectrometry Quadrupole Time-of-Flight (IMS-QToF) Mass Spectrometer operated by UNIFI in a data-independent mode, in collaboration with Zealand Pharma. Taspoglutide peptide has non-natural amino acids and liraglutide has C-16 fatty acid side chain (palmitic acid). Molecular weights ranges from 3297 to 4184 Da, as presented in Table 3, being exenatide the larger.
Table 3. Dataset-2: sequence structures and molecular weights of GLP-1 and its analogues.
| Compound name | Molecular weight (Da) | Sequence | Structure |
|---|---|---|---|
| GLP-1 | 3297.68 | H2N-His-Ala-Glu-Gly-Thr-Phe-Thr-Ser-Asp-Val-Ser-Ser-Tyr-Leu-Glu-Gly-Gln-Ala-Ala-Lys-Glu-Phe-Ile-Ala-Trp-Leu-Val-Lys-Gly-Arg- Gly-OH | Linear |
| Liraglutide | 3751.20 | H-His-Ala-Glu-Gly-Thr-Phe-Thr-Ser-Asp-Val-Ser-Ser-Tyr-Leu-Glu-Gly-Gln-Ala-Ala-Lys(γ-Glu-palmitoyl)-Glu-Phe-Ile-Ala-Trp-Leu-Val-Arg-Gly-Arg-Gly-OH | Linear |
| Taspoglutide | 3338.71 | H-His-Aib-Glu-Gly-Thr-Phe-Thr-Ser-Asp-Val-Ser-Ser-Tyr-Leu-Glu-Gly-Gln-Ala-Ala-Lys-Glu-Phe-Ile-Ala-Trp-Leu-Val-Lys-Aib-Arg-NH2 | Linear |
| Semaglutide | 4113.58 | H-His-Aib-Glu-Gly-Thr-Phe-Thr-Ser-Asp-Val-Ser-Ser-Tyr-Leu-Glu-Gly-Gln-Ala-Ala-Lys(γ-Glu-ADO-C18 di-acid)-Glu-Phe-Ile-Ala-Trp-Leu-Val-Arg-Gly-Arg-Gly-OH | Linear |
| Exenatide | 4184.03 | H-His-Gly-Glu-Gly-Thr-Phe-Thr-Ser-Asp-Leu-Ser-Lys-Gln-Met-Glu-Glu-Glu-Ala-Val-Arg-Leu-Phe-Ile-Glu-Trp-Leu-Lys-Asn-Gly-Gly-Pro-Ser-Ser-Gly-Ala-Pro-Pro-Pro-Ser-NH2 | Linear |
Dataset-3 includes somatostatin, a natural growth-inhibiting peptide hormone, along with seven 14-amino acid cyclic analogues. Data is collected in two data acquisition modes; the first one was conducted on a Thermo Q-Exactive® instrument employing full scan mode with data-dependent MS/MS and the second one High Definition MSE (HDMSE) data was collected using a Vion IMS QTof Mass Spectrometer. The detailed experimental conditions for this dataset are documented in the referenced bibliography [11, 12]. In the synthesis of these analogues, a common approach is employed, which entails substituting some of the natural amino acids with non-natural or modified ones (Fig 1). Notably, these analogues feature the substitution of Phe (7) by Msa, enhancing the rigidity due to the ortho substitution, and Trp (8) by D-Trp [13]. Additionally, various permutations involve substituting Ala (1), Cys (3), and Cys (14) with their D-amino acid equivalents, along with the substitution of Lys (4) by ornithine [13]. Molecular weight ranges from 1636 to 1678 Da Table 4. Given the inherent low stability of somatostatin, a critical consideration for its pharmaceutical utility, there is a great interest in evaluating whether these novel analogs (Table 4) exhibit prolonged lifetimes in human serum.
Fig 1. Structure of somatostatin and its seven modified analogues including unnatural amino acids.
All eight peptides exhibit a cyclic structure, closing through the disulfide bond (between monomer 3 and 14).
Table 4. Dataset-3 is composed of somatostatin and its seven modified analogues, with the corresponding molecular formulas and molecular weights.
| Compound name | Molecular formula | Monoisotopic mass (Da) | Structure |
|---|---|---|---|
| Somatostatin | C76H104N18O19S2 | 1636.7167 | Cyclic |
| Analogue 6 | C79H110N18O19S2 | 1678.7636 | Cyclic |
| Analogue 30 | C78H108N18O19S2 | 1664.7480 | Cyclic |
| Analogue 31 | C79H110N18O19S2 | 1678.7636 | Cyclic |
| Analogue 35 | C78H108N18O19S2 | 1664.7480 | Cyclic |
| Analogue 64 | C79H110N18O19S2 | 1678.7636 | Cyclic |
| Analogue 65 | C79H110N18O19S2 | 1678.7636 | Cyclic |
| Analogue 95 | C79H110N18O19S2 | 1678.7636 | Cyclic |
Dataset-4 includes an antisense oligonucleotide (ASOs) with the formula C242H307N91O150P94 (molecular weight of 7633 Da) containing 25 monomers. ASOs are synthetic, small-sized single-stranded nucleic acids. Data was collected using a Thermo Orbitrap® instrument in DDA mode. This dataset pertains to the incubation of ASOs in human liver tissue, a commonly studied experimental condition [14]. It enables researchers to evaluate the efficacy and selectivity of the ASOs in targeting specific messenger RNA molecules within the complex environment of the liver.
In this study, dataset-5 comprises a collection of 25 structurally diverse linear and cyclic peptides, with molecular weights ranging from 708 to 1900 Da (atosiban, BIO-11006, BIO-1211, carbetocin, CSP7, deslorelin, desmopressin, felypressin, gonadorelin, iseganan, lanreotide, LDTRYLEQLHKLY, leuprolide, lypressin, M10 peptide, MMI-0100, NAS-911, ocreotide, peptide T, salmon calcitonin, somatostatin, SPX-101, triptorelin, vasopressin, and vapreotide), as depicted in Table 5. These compounds have been incubated with four pulmonary proteases (human cathepsin G, human neutrophil elastase, human MMP-12 catalytic domain, and bovine pancreatic trypsin). Except felypressin, iseganan, LDTRYKEQLHKLY, lypressin, MMI-0100, vasopressin that data is unavailable for bovine pancreatic trypsin incubation, and atosiban, lanreotide, leuprolide which data is also unavailable for the human cathepsin G protease incubation. Data acquisition was performed using a Waters® Q-TOF instrument in a data-independent mode. The data was used to develop an assay workflow aimed at guiding the initial chemical modifications of peptide hits in early respiratory drug discovery projects. The detailed experimental conditions for this dataset are documented in the referenced bibliography [15]. This workflow utilizes WebMetabase to effectively detect and elucidate the structures of metabolites formed through enzymatic proteolysis. This data has been used in this study for a comprehensive comparison of results obtained through this new approach. Furthermore, its utilization serves to underscore the noteworthy advancements in data processing time realized through the implementation of this workflow.
Table 5. Dataset-5, composed of 28 peptides, with the corresponding sequence structures and molecular weights.
| Compound name | Molecular weight (Da) | Sequence | Structure |
|---|---|---|---|
| BIO-1211 | 708.8 | 4-[(2-tolyl)-urea]-phenylacetyl-Leu-Asp-Val-Pro-OH | Linear |
| CSP7 | 815.92 | H-Phe-Thr-Thr-Phe-Thr-Val-Thr-OH | Linear |
| Peptide T | 857.87 | H-Ala-Ser-Thr-Thr-Thr-Asn-Tyr-Thr-OH | Linear |
| BIO-11006 | 1050.18 | Ac-Gly-Ala-Gln-Phe-Ser-Lys-Thr-Ala-Ala-Lys-OH | Linear |
| SPX-101 | 1179.38 | H-D-Ala-D-Ala-Leu-Pro-Ile-Pro-Leu-Asp-Glu-Thr-D-Ala-D-Ala-OH | Linear |
| M10 peptide | 1181.27 | H-Thr-Arg-Pro-Ala-Ser-Phe-Trp-Glu-Thr-Ser-OH | Linear |
| Gonadorelin | 1182.31 | H-Pyr-His-Trp-Ser-Tyr-Gly-Leu-Arg-Pro-Gly-NH2 | Linear |
| Leuprolide | 1209.42 | H-Pyr-His-Trp-Ser-Tyr-D-Leu-Leu-Arg-Pro-NHEt | Linear |
| Deslorelin | 1282.48 | H-Pyr-His-Trp-Ser-Tyr-D-Trp-Leu-Arg-Pro-NHEt | Linear |
| Triptorelin | 1311.47 | H-Pyr-His-Trp-Ser-Tyr-D-Trp-Leu-Arg-Pro-Gly-OH | Linear |
| NAS-911 | 1393.68 | H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Sar-Leu-Met(O2)-NH2 | Linear |
| LDTRYLEQLHKLY | 1691.95 | H-Leu-Asp-Thr-Arg-Tyr-Leu-Glu-Gln-Leu-His-Lys-Leu-Tyr-OH | Linear |
| MMI-0100 | 2283.68 | H-Tyr-Ala-Arg-Ala-Ala-Ala-Arg-Gln-Ala-Arg-Ala-Lys-Ala-Leu-Ala-Arg-Gln-Leu-Gly-Val-Ala-Ala-OH | Linear |
| Salmon Calcitonin | 3431.89 | H-Cys (1)-Ser-Asn-Leu-Ser-Thr-Cys (1)-Val-Leu-Gly-Lys-Leu-Ser-Gln-Glu-Leu-His-Lys-Leu-Gln-Thr-Tyr-Pro-Arg-Thr-Asn-Thr-Gly-Ser-Gly-Thr-Pro-NH2 | Linear |
| Carbetocin | 988.17 | deamino-Cys (1)-Tyr(Me)-Ile-Gln-Asn-Cys (1)-Pro-Leu-Gly-NH2 | Cyclic |
| Atosiban | 994.19 | deamino-Cys (1)-D-Tyr(Et)-Ile-Thr-Asn-Cys (1)-Pro-Orn-Gly-NH2 | Cyclic |
| Octreotide | 1019.25 | H-D-Phe-Cys (1)-Phe-D-Trp-Lys-Thr-Cys (1)-Thr-ol | Cyclic |
| Felypressin | 1040.23 | H-Cys (1)-Phe-Phe-Gln-Asn-Cys (1)-Pro-Lys-Gly-NH2 | Cyclic |
| Lypressin | 1056.23 | H-Cys (1)-Tyr-Phe-Gln-Asn-Cys (1)-Pro-Lys-Gly-NH2 | Cyclic |
| Desmopressin | 1069.22 | deamino-Cys (1)-Tyr-Phe-Gln-Asn-Cys (1)-Pro-D-Arg-Gly-NH2 | Cyclic |
| Vasopressin | 1084.24 | H-Cys (1)-Tyr-Phe-Gln-Asn-Cys (1)-Pro-Arg-Gly-NH2 | Cyclic |
| Lanreotide | 1096.33 | H-D-2Nal-Cys (1)-Tyr-D-Trp-Lys-Val-Cys (1)-Thr-NH2 | Cyclic |
| Vapreotide | 1131.38 | H-D-Phe-Cys (1)-Tyr-D-Trp-Lys-Val-Cys (1)-Trp-NH2 | Cyclic |
| Somatostatin | 1637.88 | H-Ala-Gly-Cys (1)-Lys-Asn-Phe-Phe-Trp-Lys-Thr-Phe-Thr-Ser-Cys (1)-OH | Cyclic |
| Iseganan | 1900.28 | H-Arg-Gly-Gly-Leu-Cys (1)-Tyr-Cys (2)-Arg-Gly-Arg-Phe-Cys (2)-Val-Cys (1)- Val-Gly-Arg-NH2 | Cyclic |
Dataset-6 comprises human insulin, a peptide hormone containing three disulfide bridges, one of which is internally located within Chain A, while the other two covalently connect Chain A to Chain B (Fig 2). Data was collected with QTOF from a Waters® instrument. Insulin has been subjected to analysis following incubation with IDE, a protease widely recognized for its pivotal role in degrading and inactivating insulin. The detailed experimental conditions for this dataset are documented in the referenced bibliography [16].
Fig 2. Insulin structure with the linear visualization.
The structure of insulin consists of two peptide chains known as Chain A, comprising 21 amino acids (numbered 1–21), and Chain B, comprising 30 amino acids (numbered 22–51). The A and B chains are interconnected by two disulfide bonds (highlighted in pink and light blue), and an additional disulfide bond is formed within the A Chain (highlighted in purple).
Data preprocessing
The MassMetaSite procedure consists of three steps: (a) data reading, (b) automatic detection of the chromatographic peaks related to the parent compound and its metabolites, and (c) structure elucidation by proposing a potential metabolite structure based on the fragmentation pattern for each peak detected in the previous step.
a) Data reading. Three different acquisition files need to be defined, depending on the data. Firstly, a blank file is employed to distinguish relevant signals from background peaks. This file is crucial for investigating whether a detected peak in the incubation file is attributable to the compound of interest or if it was already present in the incubation matrix (blank sample). Secondly, a substrate file is utilized to analyze the fragmentation pattern of the substrate. This step is essential in the structure elucidation process, involving the comparison of fragments assigned to the spectra of the parent compound with the spectra of potential metabolites. Lastly, the incubation file which contains all the products after incubation, either in vitro or in vivo. It serves for investigating and identifying metabolites formed during the incubation process.
b) Automatic detection of the chromatographic peaks. During the automated chromatographic peak detection stage, an initial spectral noise analysis is conducted. For each full scan (intensity vs. m/z), a noise level is computed by calculating the change in slope between two consecutive shortlists of ions present in the full scan, and ions below this threshold are systematically eliminated. Subsequently, the list of ions is examined across chromatographic retention times. Ions are selected based on specific m/z values to precisely determine the presence or absence of peak formation.
Following the identification of a potential peak in the incubation sample, a background analysis is performed. Specifically, for the selected m/z and retention time of the potential peak, a search is conducted to verify the presence of the peak in the blank sample. If the peak is detected in the blank, a peak alignment optimization is initiated using a combination of Hodgkin and Pearson similarity indexes computation, which allows a comprehensive comparison of both shape and peak intensity. The sample peak is excluded from the analysis whenever it exhibits similar shape and equivalent (or lower) intensity to the blank peak. The Negative Control Area Ratio is then computed, representing the quantitative ratio between the peak area in the incubation sample and the corresponding in the blank.
Subsequently, a filtered spectrum is computed by merging all the scans within the peak retention time range. This involves the selection of m/z values that exhibit correlation within the chromatographic peak shape. Each m/z value of each filtered spectrum is compared with any of the m/z values for the metabolites of the parent compound. There are two potential options to represent the theoretical m/z of the compound of the peak under consideration: the monoisotopic or the most abundant isotope species. Additionally, the isotope pattern derived from the metabolite formula is compared to the one from the experimental spectra and a filter may be set to consider the similarity between the observed and predicted intensity for each potential isotope. In addition, m/z values from multiple charge states were also used in the analysis.
For each selected m/z value extracted from the filtered spectra, a comprehensive metabolite classification is conducted. This classification categorizes metabolites into distinct groups, including first-generation metabolites, second or higher generation metabolites, metabolites stemming from biotransformations unrecognized by the software (referred to as “red peaks” denoting unknowns), and cases where the fragment ion may arise from ion adduct formation or in-source neutral loss.
Ultimately, a MS/MS evaluation is conducted, examining the presence of m/z values observed in the parent spectrum within the potential peak. The evaluation considers the shift based on the obtained formula, classifying a non-shifted scenario when the same m/z observed in the parent spectra is also observed in the metabolite, and identifying a shift when a change in the m/z of the considered value relative to the parent is observed between a peak in the parent spectra and a peak in the filtered spectra.
The m/z values are scored according to multiple criteria: isotope similarity, retention time, MS/MS comparison and calculated m/z. Among all the values above the score threshold, the m/z that will represent the peak in the chromatogram is the one with the highest m/z value. This process results in a compiled list of peaks, each associated with an assigned m/z, retention time range, area, full scan filtered spectra, and MS/MS spectra.
c) Structure elucidation. The third stage of data processing is structure elucidation (Fig 3), during which the fragment ions obtained from the parent and those from the metabolite are compared.
Fig 3. Illustrates the third step, Structure Elucidation, of MassMetaSite procedure.
This process has two starting points: the parent structure, or the metabolite structure which is obtained by virtual synthesis:
- Identification of metabolite fragments from fragmentation of the parent
- Parent fragmentation: During this process, the parent molecule is fragmented, and the m/z of the fragments are computed. There could be more than one m/z value for a single fragment due to potential hydrogen rearrangements. Fragment structures are then associated to the spectra m/z values considering a user-specified tolerance.
- Generation of metabolite fragments: Metabolite fragments are built from parent fragments using metabolite and parent atom map. The metabolite resulting fragments m/z may be shifted or equal to the parent m/z depending on whether the fragment contains sites of metabolism or not [17].
- Association between parent peaks and metabolite peaks: For each parent spectrum, whether MS or MS/MS, the software checks if there are peaks with the same or shifted m/z in the associated metabolite spectrum. A shifted m/z is equal to the m/z of the parent plus the change of m/z due to the chemical modifications introduced during metabolism. Resulting in Substrate-Metabolite peak pairs that could be used for structural identification.
- Matches: When substrate and metabolite fragments are identical and both peaks of the Substrate-Metabolite fragment pair have the same m/z value, the observed and calculated interpretation match. Likewise, when the metabolite fragment is different from the substrate fragment and the Substrate-Metabolite fragment pair have a shifted mass, the interpretations also match [18].
- Mismatches: The fragments that are mismatching are those ones where the m/z is observed as non-shifted between the parent and metabolite spectra, but the atom set of the fragment corresponds to a chemical modification that would change the m/z. Similarly, a mismatch is detected when the m/z is observed as shifted between the parent and the metabolite spectra, but the atom set of the fragment corresponds to a modification that would not change the m/z of the fragment [18].
- Identification of metabolite fragment from the structure of the metabolite: Virtual fragments of the metabolites are generated based on a predefined list of metabolic biotransformation reactions [19].
- Fragmentation of the metabolite: This is the same as the parent fragmentation but the number of bonds that can be cut is usually lower since breaking all the possible metabolites has a greater computational cost.
- Metmatches: The fragments that are obtained in this way are assigned to the metabolite spectra are called metmatches. This fragmentation strategy is particularly beneficial for cyclic peptides, where the metabolite might be a linear peptide due to amide hydrolysis-induced ring opening, leading to a markedly different fragmentation pattern compared to the parent.
Scoring is done by summing the intensity for the matching peaks plus the sum of the intensity for the metmatching peaks minus the sum of the intensity for the mismatching peaks. The solutions with the highest score are auto selected by the system and reported as potential structural candidates [18].
Each experiment consisted of a set of samples, i.e., one sample per incubation time point per matrix. MassMetaSite processes each sample as a separate entity, and thus generates three main pieces of information for each sample: metabolic scheme, spectrometry data (product ion assignment) and outcomes (retention time, MS area, MS relative area, collision cross section, and parts per million (ppm) mass error) for each found component. WebMetabase then consolidates all these data from the individual files into a single interpretation for the entire experiment (time/matrix) and analyses which metabolite peaks from each sample can be clustered based on its retention time and m/z.
Settings/Structure visualization
In this study, data have been processed with distinct algorithms, establishing the groundwork for a comprehensive comparison among them. This research is focused on three crucial dimensions
-Peak detection (Monoisotopic Mass and Most Abundant Mass). Various algorithms for peak detection are employed based on the molecular size. The MiM represents the peak to the ion with the lowest mass-to-charge (m/z) ratio and it is calculated using the lightest isotope mass of each element present in the molecule. It is particularly useful for accurately determining the molecular formula, especially for smaller molecules [20]. Conversely, MaM represents the molecule’s most common isotopic distribution, considering the natural abundance of all isotopes in the molecule, not just the lightest ones.
For larger molecules or when the monoisotopic ion is undetectable, the MaM is employed for peak detection. This choice is made because, with increasing molecular size, the heightened probability of the entire molecule containing at least one heavy isotope atom (mainly 13C) becomes more pronounced. Consequently, the MiM peak may be much more difficult to detect than the MaM peak. In addition, MaM peaks are typically the ones which are selected for triggering MSMS scans in DDA when no preferred list is provided to the acquisition software.
In this study all datasets have been processed with both the MiM and MaM algorithms, except dataset-5 that has been exclusively subjected to processing with the MaM settings.
-Acquisition modes (Data-dependent acquisition and Data-independent acquisition) The LC-HRMS stands as the preferred method MetID, with DDA being commonly used strategy in MS data acquisition. In DDA, precursor ions selected based on their abundances are often employed to drive MS/MS. In contrast, DIA methods, such as MSE and HDMSE, eliminate the risk of overlooking metabolites by avoiding precursor ion selection [21]. The DIA HDMSE is a method that combines ion mobility separation with MSE data acquisition. It alternates between low and high collision energy ion mobility spectrometry-mass spectrometry scans, enabling accurate mass measurements of both precursor and product ions simultaneously. In contrast to MiM, where a specific m/z must be isolated before fragmentation, DIA provides more complex but more complete datasets.
Data from dataset-3 was acquired employing the two predetermined strategies, DIA and DDA, facilitating a comparison of outcomes obtained from both acquisition modes. Settings used for the processing of DIA (MSE/HDMSE) and DDA data for somatostatin synthetic analogues are presented in S6-S9 Files.
-Structure visualization (Expanded and non-expanded). Two visualization options are available for representing the structure of polymeric compounds like peptides or oligonucleotides during data analysis. Monomers of the compound can be depicted either in an expanded form, revealing all atoms and intermonomer bonds, or in a non-expanded form, where the structure is represented by linking the monomer acronyms. In this study, dataset-4 was processed using both visualization options, enabling a comparative analysis of processing time and providing an illustrative example of how metabolites structures are visualized after metabolic reactions using both approaches.
The selection to work on expanded or non-expanded monomers has an impact on structure visualization. The non-expanded mode shows the monomer symbol making it simpler for the user to identify the structure and the place where the biotransformation takes place and therefore it is recommended to be used. Nevertheless, it also has implications in the computation process. The structure that is represented as monomer does not undergo a virtual structure metabolite generation, the biotransformation is applied at monomer level and not at atomic level, therefore the resulting compound is not a valid chemical structure, since there is no information on the exact chemical structure that is obtained after the reaction. The part of the structure that is represented as atoms/bonds undergoes a typical virtual reaction and a defined chemical structure is obtained for each potential metabolite. In the monomer presented part of the molecule, fewer chemical structures need to be constructed during the calculation process, resulting in reduced computation time. There is another aspect applied on the part of all the molecule that is treated as monomer, since for this part only the typical a,b,c, and x,y,z fragmentation is considered, reducing in this way the number of potential fragments generated degreasing the time and memory consumption. For the rest of the molecule treated as atoms/bond all the bonds are disconnected to generate fragments that will generate an increased number of fragments.
Furthermore, there exists the option to work with a combination of both visualizations within the molecular structure. This can be achieved by selectively choosing which segments of the molecule to expand or maintain in a non-expanded state.
Data analysis
Following data consolidation, manual data interpretation by the user is conducted for peak selection and structure elucidation steps, applying diverse data analysis criteria to systematically eliminate any potential false positive metabolites. These criteria are:
Peak selection
MS area (%): Reporting with a relative area above 0.5%.
Difference between observed and calculated m/z (amu, ppm): For the MS signal the system computes the difference between the observed and the computed m/z. The observed m/z considers the m/z finds at the different scans and derives a value which is compared to the vendor software package to consider effects like peak saturation and loss of accuracy at the top of the peak. Maintaining a difference of less than 10 ppm between observed and computed values [22].
Value of isotopic all similarity: Quantifying the match between observed and expected isotopic patterns for peaks, where a low value suggests pattern variability.
Negative control area ratio: Establishing the ratio between peak areas in the incubation sample and the blank, with a signal observed in both considered non-specific.
Kinetics: Reflects changes in metabolite abundance over time. At time 0 (t = 0), when the incubation begins, the cluster chart would initially show the presence of ions solely related to the parent compound. There should be no signals corresponding to metabolites at this point, as no biotransformation has occurred yet. The first generated metabolite usually has an exponential shape, as they are starting to be formed. If the metabolites are further metabolized, the signal of the metabolite will decrease since the metabolite has been consumed to generate a second generation one. Typically, the second-generation metabolite has a sigmoidal shape since it needs the first-generation metabolite to form and then be further metabolized [11].
Shape of the metabolite peak: Ideally, metabolite peaks should exhibit a Gaussian shape; however, in practice, peak tails may occasionally occur. It is important to distinguish these from peaks that resemble background noise or exhibit irregular shapes, such as broad or asymmetric profiles, which may suggest contamination or interference rather than the presence of a true metabolite [23].
Structure elucidation.
The second step of the algorithm proposes potential metabolite structure based on the fragmentation pattern for each peak detected in the peak selection step Fig 4 illustrates the MS Spectra data interpretation window, highlighting the analysis of fragment structures used to generate the score, including the count of matches and mismatches.
Fig 4. Fragmentation pattern for the M2-38 metabolite of oxytocin in incubation with chymotrypsin in 120 minutes.
On the left, full scan/data-dependent MS/MS spectras for oxytocin and M2-38 are presented, while on the right, a subset of fragment structures derived from the selected matched peaks is displayed.
This window allows for a comparison to determine if the metabolites exhibit a similar fragmentation pattern compared to the substrate fragmentation. Metabolite fragment ions may either share the same m/z as a parent fragment ion (non-shifted ion) or exhibit a defined mass shift (shifted ion).
The MS and MS/MS spectra contain 5 types of fragments:
Black peaks: These peaks lack fragment assignments in the parent, they have no effect on the interpretation of the metabolite under consideration.
Red peaks: Represent matching peaks, and their structural interpretation aligns with the proposed metabolite structure. Clicking on red peaks reveals the assigned structure in the right panel.
Cyan peaks: Indicate mismatching peaks, and their structural interpretation contradicts the proposed metabolite structure.
Coral peaks: Correspond to metabolite matching peaks with structural information consistent with the proposed metabolite structure. However, they lack a substrate fragment match, resulting from manual editing or MassMetaSite if metabolite fragmentation is selected in the settings.
Light green peaks: Denote metabolite mismatching peaks, providing structural information contrary to the proposed structure under study. These peaks lack substrate peak matches and stem from the propagation of a manually edited peak.
It is essential to consider the isotope pattern and ensure that it aligns with the expected charge state of the metabolite. The charge of the ion significantly influences the spacing between isotopic peaks, and deviations in the observed pattern may serve as indicators of errors in charge assignment or other issues.
Furthermore, the structural assignment of the isotope pattern peaks is checked manually. If the structure assignment of a match or mismatch peak is not the expected one, it can be removed from the analysis and therefore the score will be re-calculated. In addition, black peaks can be examined, and structural information can be added by using the fragment structure editor if it is considered.
Processing time.
In this study, the data processing time has also been collected, encompassing the duration required for importing data into WebMetabase. Notably, dataset-5 facilitated a comparison with previously reported processing times in the bibliography [15], utilizing the same software with an outdated version (2021). A comparison of the processing time has also been conducted between the different algorithms and settings outlined in the Data Preprocessing section. Since the processing time may vary depending on the peak algorithm employed, as well as the choice of visualization for the compound representation, including expanded, non-expanded, or mixed options.
Results and discussion
This section presents the experimental results obtained through the application of our approach and algorithms to perform the MetID of the five distinct peptide datasets and an oligonucleotide dataset. All these metabolite structural assignments have been checked manually and considered as reliable because the fragmentation was adequate, isotope pattern was as expected, the m/z small differences between the m/z of observed and theoretical (<10 ppm), and the score was high.
Monoisotopic mass and most abundant mass
One of the primary objectives of this study is to conduct a comprehensive comparison between the two algorithms, MiM and MaM. To achieve this goal, datasets 1, 2, 3, 4 and 6 as previously outlined, have undergone processing with both algorithm configurations. Table 6 presents the number of identified metabolites corresponding to each dataset, based on the employed algorithm.
Table 6. Number of identified metabolites for each dataset, considering the algorithm, incubation conditions, and acquisition mode (in case of dataset-3).
| DATASET-1 | INCUBATION CONDITIONS | |||
|---|---|---|---|---|
| Trypsin | Chymotrypsin | Pancreatic Elastase | Pepsin | |
| MiM | 34 | 42 | 39 | 35 |
| MaM | 36 | 45 | 43 | 37 |
| DATASET-2 | INCUBATION CONDITIONS | |||
| DPP-4 | NEP | |||
| MiM | 26 | 4 | ||
| MaM | 27 | 4 | ||
| DATASET-3 | ACQUISITION MODE | |||
| DDA | DIA | |||
| MiM | 50 | 111 | ||
| MaM | 50 | 111 | ||
| DATASET-4 | STRUCTURE VISUALIZATION | |||
| NON-EXPANDED | ||||
| MiM | 7 | |||
| MaM | 11 | |||
| DATASET-5 | INCUBATION CONDITIONS | |||
| Trypsin | MMP12 | Neutrophil Elastase | CatG | |
| MaM | 31 | 60 | 70 | 77 |
| DATASET-6 | INCUBATION CONDITIONS | |||
| IDE | ||||
| MiM | 8 | |||
| MaM | 12 | |||
Notable differences between MiM and MaM algorithms are observed in compounds such as calcitonin from dataset-1 or taspoglutide from dataset-2. These variations are attributed to the larger peptide structures of these compounds. As molecular size increases, the relative intensity of the MiM tends to decrease. In such cases, the use of the MaM algorithm provides a more precise MetID in larger peptides.
The analysis of dataset-1 resulted in the identification of 150 metabolites through the MiM algorithm, while 161 metabolites were identified using the MaM algorithm. Calcitonin, a cyclic peptide, is one of the largest peptides of this dataset (3429.71 Da), yielding the identification of the same 6 metabolites with both settings, M1-2178, M2-2309, M3-1981, M4-1852, M5-499, and M6-1739 with the respectively retention times of 1.86, 1.91, 2.45, 2.47, 2.52, and 2.99 minutes. However, there is a noticeable difference between them in the score values Table 7. A higher score indicates a better match between the theoretical product ion m/z value and the observed m/z value in the MS/MS spectrum and therefore a more confident structure prediction. This scoring system helps in distinguishing reliable matches from potential false positives.
Table 7. Retention times of the identified Calcitonin metabolites along with their corresponding values for score, matches, mismatches, and metmatches obtained using both algorithms.
| RT (minutes) |
Most abundant mass | Monoisotopic mass | ||||||
|---|---|---|---|---|---|---|---|---|
| Score | Matches | Mismatches | MetMatches | Score | Matches | Mismatches | MetMatches | |
| 1.86 | 445.4 | 2 | 1 | 13 | 126.1 | 1 | 0 | 0 |
| 1.91 | 928.3 | 10 | 0 | 17 | 523.4 | 6 | 0 | 0 |
| 2.45 | 914.2 | 12 | 2 | 28 | 175.5 | 2 | 0 | 1 |
| 2.47 | 1062.3 | 13 | 1 | 40 | 197.1 | 0 | 0 | 0 |
| 2.52 | 1864.0 | 17 | 1 | 19 | 454.9 | 6 | 0 | 0 |
| 2.99 | 1177.4 | 23 | 3 | 30 | 234.5 | 2 | 0 | 0 |
The dataset-2, consisting of GLP-1 and four synthetic analogues, comprises linear peptides with a molecular weight exceeding 3000 Da, thereby accentuating the significant differences when utilizing MaM or MiM algorithms. This contrast is evident in the case of taspoglutide, as illustrated below.
Taspoglutide (3338.71 Da) incubated with DPP-4 has yielded 15 metabolites peaks with the MaM settings (M1-2175, M2-2163, M3-1966, M4-2223, M5-1925, M6-1895, M7-2222, M8-2154, M9-2255, M10-2094, M11-2147, M12-1977, M13-1396, M14-1146 and M15-407) with a retention time of 2.69, 3.54, 3.74, 3.88, 3.96, 4.00, 4.26, 4.47, 4.48, 4.54, 4.54, 4.92, 5.27, 5.90, and 6.80 respectively. In contrast, using MiM settings, 14 metabolites have been identified, the same as with MaM, but missing M6-1895 (at a retention time of 4.00). Eight of the metabolites correspond to first-generation products (from a single reaction) and are indicated by the green color of the peak, as shown in Fig 5. The other seven brown colored metabolites are indicative of multiple enzymatic reactions. A score is calculated and reported for each metabolite. It can be highlighted that the increased number of matches in the MaM analysis contributes to higher maximum score values. This increase in score values convert a greater level of confidence in the results obtained. As for example, with MaM the metabolite M4-2223 the score is 1302.1 with 28 matching fragments, while with MiM the same metabolite results in a score of 807.1 with 17 matching fragments. Other results are shown in supporting information.
Fig 5. Extracted ion chromatograms of Taspoglutide after 24 hours of incubation with DPP-4, using both algorithms.
(Blue peak: represents the parent peptide compound, green peaks: first generation of metabolites, and brown peaks: second generation or higher).
The peptide GLP-1 (3297.68 Da) exhibits a brief half-life, primarily attributed to its swift degradation by proteases DPP-4 and NEP. MetID of GLP-1, incubated with DPP-4, revealed the presence of three metabolites: M1-137, M2-394, and M3-208, with respective retention times of 6.53, 6.58, and 6.66 minutes. Notably, M3-208 exhibits the common cleavage site reported in bibliography [24] and attributed to DPP-4, occurring between Ala (8) and Glu (9). A discernible distinction between the two algorithms lies in the appearance of false positives, as shown in Fig 6, with a notable increase observed when employing the MiM settings.
Fig 6. False positives of the GLP-1 compound using both the MaM and MiM algorithms.
It is noteworthy that the number obtained with the MiM algorithm is significantly higher.
Semaglutide, a GLP-1 analogue, underwent data collection using the HDMSE acquisition mode on a Waters® QToF instrument. Structural assignments for two degradation products with both algorithms MiM and MaM, namely M1-3446 and M2-3418, have been achieved with high mass accuracy, featuring retention times of 2.77 and 3.07 minutes, respectively (Fig 7). Consistent with prior bibliography, these metabolites arise from three distinct metabolic modifications, specifically induced by amide hydrolysis and sequential beta-oxidation in the fatty acid part [25].
Fig 7. Metabolites identified and extracted ion chromatogram of Semaglutide using MiM algorithm.
Dataset-3 (comprising somatostatin and seven synthetic analogs incubated with human serum) allows the analysis with different acquisition modes in order to illustrate that the workflow for MetID employing data coming from distinct structural mass spectrometry techniques as DIA and DDA.
DDA data was collected with Thermo Scientific Q-Exactive Hybrid Quadrupole-Orbitrap Mass Spectrometer (Q-Exactive) instrument employing full scan mode and DIA HDMSE data were acquired using a Vion IMS QTof Mass Spectrometer. Both data was processed through Mass-MetaSite, and subsequently uploaded to WebMetabase for visualization via the Mass-MetaSite Batch Processor.
DDA and DIA data underwent processing with both algorithms (MiM and MaM). The results obtained show no distinctions. The identified metabolites, score values, and various parameters such as the numbers of matches, mismatches, and metmatches remain consistent across both algorithms. Considering the minimal chemical or monomer modifications within the peptide structure of these compounds, no substantial shift in molecular size was observed in this dataset.
The analysis of this dataset collected with DDA led to the identification of 17 metabolites for each of the algorithms. All the metabolites identified have been produced from amide hydrolysis reaction. The principal metabolite formations observed include the generation of -Ala (−71 Da) and -AlaGly (−128 Da) from the linear segment of the structure (Fig 8). The incorporation of D-Trp at the eighth position showed an improved stability over the parent compound somatostatin, due to the differences in the appearance of metabolism as synthetic analogs avoid the ring opening observed between D-Trp (8) and Lys (9). This observation aligns with findings from previous bibliography, which highlighted that the introduction of Msa residues, coupled with the presence of D-Trp8, contributes to the augmentation of aromatic side-chains interactions in Somatostatin, providing a greater stability [13].
Fig 8. Somatostatin (Parent compound) and major metabolites identified using both algorithms.
Metabolites M5-128 and M6-71 indicate cleavages from the tail portion of somatostatin. Additionally, M3 + 18 represents a ring-opening product occurring between DTrp (8) and Lys (9).
Similarly, for DIA data, the identification of key metabolites, specifically -Ala and -AlaGly, is consistent. As previously documented in bibliography [26], the analog labeled as 95 demonstrates superior stability, characterized by delayed and reduced metabolic transformations compared to other analogs. This stability is further elucidated in Fig 9, which delineates the time/response profiles of the substrate, illustrating the gradual disappearance of the peptide.
Fig 9. Substrate profiles employing In-In scaling for somatostatin, Analogue 31, 65, and 95.
Dataset-5 contains 16 linear and 12 cyclic peptides, incubated with cathepsin G, neutrophil elastase, trypsin and MMP-12. The data was collected using LC-HRMS, with analysis performed on a Synapt G2® high-definition quadrupole time-of-flight mass spectrometer (Waters®), operating in positive electrospray ionization mode
The data processing time, employing the settings outlined in the referenced research [13] study and utilizing the non-expanded structure visualization, has undergone a substantial reduction. As an illustration, the compound salmon calcitonin, which conventionally needed two hours for processing, now, requires only 25 minutes with the implementation of the new methodology.
As an illustrative example of this dataset, the following compound and its analogs will be described, while the MetID for the other compounds can be found in the Supplementary Information. Specifically, the dataset includes somatostatin, and analogs that have been synthesized over the past few decades introducing modifications such as exchange and deletion of amino acids, ring size reduction, or disulfide bridge modification, among others. [13] These analogs, namely octreotide, lanreotide, and vapreotide, are octapeptides characterized by a shorter and consequently less flexible ring structure compared to somatostatin.
Previous bibliography reports that the ring opening from somatostatin and its analogs was only observed in the case of somatostatin, as also observed in this study [13]. Despite somatostatin being rapidly degraded by proteases, its analogs exhibit stability, as illustrated in Fig 10, which presents extracted ion chromatograms after 60 minutes of incubation with neutrophil elastase. The processing time for these compounds was 15 minutes.
Fig 10. Extracted ion chromatograms using MaM algorithms.
A) somatostatin, B) lanreotide, C) octreotide, and D) vapeotride after 60 minutes of incubation with neutrophil elastase.
Fig 11 presents a detailed MetID of somatostatin incubated with neutrophil elastase. The analysis identified the same metabolites as reported in the previously bibliography [13]: M1-1371, M2-1204, M3-230, M4 + 18, M5-909, M6 + 18, and M7-661, with respective retention times (RT) of 0.73, 1.60, 1.60, 1.71, 1.93, 1.93, and 2.21.
Fig 11. Summarized MetID reports which each retention time (RT) from incubation of somatostatin with neutrophil elastase.
Dataset-6 contains data of human insulin (5808 Da), a cyclic peptide with three disulfide bridges, after the incubation with IDE at 2 minutes. Computing using the MaM algorithm led to the identification of 12 metabolites, designated as M1-2965, M2-3315, M3-3145, M4-2973, M5-2902, M6-3452, M7-3151, M8-3032, M9-2961, M10-3289, M11-2869, and M12-2798, with respective retention times of 2.06, 7.78, 8.08, 9.14, 9.65, 9.80, 10.38, 11.17, 11.43, 11.69, and 12.39 minutes (Fig 11). These metabolites have been previously documented in the bibliography and are generated through two cleavages, one within Chain A and the other within Chain B. Notably, four of them have been reported previously as major IDE-degraded insulin fragments (Fig 12) [16]. The formation of these metabolites results from cleavage occurring either within the A chain, specifically at positions A13-14 or A14-15, and in the middle of the B chain, either at positions B9-10 or B14-15.
Fig 12. Extracted ion chromatograms of Insulin after 2 minutes of incubation with IDE.
Blue peak: substrate/parent peptide, green peaks: first generation metabolites, brown peaks: second generation or higher metabolites.
In contrast, MiM identified 8 metabolites, M1-3306, M2-2971, M3-3450, M4-3150, M5-2959, M6-3287, M7-2867, and M8-2618, with respective retention times of 7.74, 9.14, 9.65, 9.78, 11.15, 11.43, 11.67, and 15.65 (Fig 13). Notably, two of the major previously bibliography-reported products are absent [16]. Moreover, consistent with previous observations, there is a significant difference in score values between the two algorithms, with MaM. scores consistently higher due to the higher number of matches and no presence of mismatches Table 8.
Fig 13. Four of major products corresponding to Insulin fragments, using MaM algorithm, after incubutation with IDE.
These metabolites, resulting from two distinct cleavages—one within Chain A and the other within Chain B—have been previously identified in the bibliography.
Table 8. Retention times of the identified Insulin metabolites along with their corresponding values for score, matches, mismatches, and metmatches obtained using both algorithms. NI = Non-identified metabolites.
| RT (minutes) |
Most abundant mass | Monoisotopic mass | ||||||
|---|---|---|---|---|---|---|---|---|
| Score | Matches | Mismatches | MetMatches | Score | Matches | Mismatches | MetMatches | |
| 2.06 | 509.6 | 3 | 0 | 0 | NI | NI | NI | NI |
| 7.78 | 1017.5 | 6 | 0 | 18 | 132.9 | 2 | 0 | 12 |
| 8.08 | 760.3 | 6 | 0 | 5 | NI | NI | NI | NI |
| 9.14 | 816.3 | 6 | 0 | 8 | 253.3 | 4 | 0 | 9 |
| 9.32 | 767.5 | 6 | 0 | 2 | NI | NI | NI | NI |
| 9.65 | 767.1 | 6 | 0 | 9 | 243.6 | 4 | 2 | 10 |
| 9.80 | 874.0 | 6 | 0 | 4 | 278.7 | 4 | 0 | 5 |
| 10.38 | 992.6 | 9 | 0 | 2 | NI | NI | NI | NI |
| 11.17 | 792.4 | 6 | 0 | 4 | 319.0 | 2 | 0 | 3 |
| 11.43 | 737.5 | 6 | 0 | 4 | 223.8 | 4 | 0 | 4 |
| 11.69 | 509.6 | 3 | 0 | 0 | 210.1 | 4 | 1 | 7 |
| 12.39 | 789.8 | 6 | 0 | 8 | NI | NI | NI | NI |
| 15.65 | NI | NI | NI | NI | 138.9 | 2 | 0 | 0 |
Structure visualization – atoms/bonds vs monomer
The analysis of biotransformation products for therapeutic oligonucleotides using LC-HRMS presents a significant challenge, primarily attributed to the high molecular weight of these compounds. Given that these oligonucleotides consist of multiple monomers susceptible to metabolic reactions, constructing a virtual set containing all potential metabolites becomes a resource-intensive task in terms of time and computational requirements. Furthermore, the extensive number of cleavable bonds amplifies the complexity of the fragmentation analysis, demanding additional time and computing resources. This study shows the fragmentation algorithm that allows the analysis at monomer levels (non-expanded) and the other at the atom/bond levels (expanded).
In this section, three experiments involving the incubation of ASOs in Human Liver at various timepoints are presented, comprising two sets incubated with distinct oligonucleotide strains (dataset-4). The data was acquired in a DDA mode in a Thermo Q-Exactive® spectrometer.
A total of 11 metabolites have been identified in both experiments (expanded and non-expanded) using the MaM algorithm, M1-5473, M2-5473, M3-3282, M4-2567, M5-930, M6-617, M7-616, M8-313, M9-312, M10-304, M11-304, with respective retention times of 7.17, 8.62, 14.42, 16.36, 17.11, 17.59, 17.84, 17.90, 17.92, 17.95, and 18.26 (Fig 14). The identified structures of the metabolites can be attributed to specific biotransformation reactions, including o-dealkylation, phosphoester hydrolysis, aromatic deamination, and nucleobase loss.
Fig 14. Extracted ion chromatograms of ASOs after 72 hours of incubation with the modified strain.
In contrast, using the MiM algorithm, a total of 7 metabolites have been identified (using non-expanded visualization), M1-5470, M2-5470, M3-2566, M4-617, M5-313, M6-313, and M7-304, with respective retention times of 7.17, 8.62, 16.36, 17.55, 17.89, 17.93, and 18.24. Table 9 illustrates the score value differences between the two algorithms.
Table 9. Retention times of the identified ASO metabolites along with their corresponding values for score, matches, mismatches, and metmatches obtained using both algorithms. NI = Non-identified metabolites.
| RT (minutes) |
Most Abundant Mass | Monoisotopic Mass | ||||||
|---|---|---|---|---|---|---|---|---|
| Score | Matches | Mismatches | MetMatches | Score | Matches | Mismatches | MetMatches | |
| 7.17 | 2654.9 | 31 | 0 | 2 | 835.9 | 15 | 0 | 0 |
| 8.62 | 2547.5 | 32 | 0 | 5 | 896.8 | 15 | 0 | 0 |
| 14.42 | 2250.2 | 31 | 0 | 11 | NI | NI | NI | NI |
| 16.36 | 4445.7 | 62 | 6 | 33 | 788.3 | 35 | 0 | 3 |
| 17.11 | 4019.8 | 46 | 4 | 19 | NI | NI | NI | NI |
| 17.59 | 5056.3 | 67 | 3 | 18 | 426.1 | 25 | 0 | 2 |
| 17.84 | 1778.9 | 24 | 0 | 0 | NI | 35 | 0 | 5 |
| 17.90 | 6171.2 | 101 | 6 | 19 | 541.9 | NI | NI | NI |
| 17.92 | 4081.9 | 62 | 1 | 4 | 561.3 | 5 | 0 | 3 |
| 17.95 | 8142.5 | 112 | 5 | 38 | NI | NI | NI | NI |
| 18.26 | 5423.3 | 83 | 2 | 21 | 371.5 | 25 | 0 | 1 |
In addition, a paired-samples t-test was performed on the Score values reported in Tables 7–9 (excluding metabolites not found in one of the two approaches), revealing a statistically significant difference between the MiM and MaM algorithms (p = 0.0002904).
In Fig 15, the two distinct structure visualizations are presented for the same identified metabolite, showcasing a nucleobase loss from the parent compound and two phosphoester hydrolyses. The depiction at the bond level provides a clearer understanding of the biotransformation pathways and chemical alterations experienced by the compound. It is noteworthy to consider the processing time, which, in this specific example, is 40 minutes for the non-expanded representation and extends to 70 minutes when three of the monomers are expanded.
Fig 15. Illustration of nucleobase loss in both expanded and non-expanded structural representations of ASO.
This visualization algorithm allows to combine monomer and atom/bond notation, being then easily to see the metabolic changes in the structure. As a result, the need to expand all monomers individually is avoided, alleviating the associated high processing time. The constraint structure alignment between the substrate and the metabolite, maintaining the same orientation, allows for the interpretation of the occurred biotransformations.
Processing time
The processing time is influenced by the size and molecular weight of the peptide, as shown in Table 10. For peptides with molecular weights between 3000 and 4000 Da, processing times range from 22 to 30 minutes when using the non-expanded visualization mode. In contrast, the expanded mode results in longer processing times, extending from 2 up to 8 hours. For peptides exceeding 4000 Da, the expanded mode becomes impractical due to excessive memory requirements and long processing times.
Table 10. Estimated processing times for peptides based on molecular weight and the used visualization approach.
| Molecular weight (Da) | Number of compounds | All monomers non-expanded (minutes) | All monomers expanded |
|---|---|---|---|
| < 1000 | 5 | 5 - 8 | 25 - 30 minutes |
| 1000 - 1200 | 15 | 8 - 13 | 30 - 40 minutes |
| 1200 - 1500 | 7 | 13 - 18 | 40 - 60 minutes |
| 1500 - 3000 | 12 | 18 - 22 | 60 - 120 minutes |
| 3000 - 4000 | 6 | 22 - 30 | 2–8 hours |
| > 4000 | 4 | 30 - 40 | * |
* Not computable due to high memory requirements and extended processing time.
This difference is due to the fragmentation method used in each mode: the non-expanded mode operates at the monomer level, limiting fragmentation to predefined ion types (e.g., a, b, c, x, y, z ions in peptide analysis), while the expanded mode simulates fragmentation at the atomic level by disconnecting all chemical bonds. As a result, the expanded approach generates a significantly higher number of theoretical fragments, increasing processing times.
Conclusions
A new automated workflow for LC-HRMS data analysis has been described and developed, addressing challenges associated with result visualization and computational time in processing incubated data of macromolecules. This approach has effectively proved the analysis of both linear and cyclic peptides containing natural or unnatural amino acids. A total of 970 metabolites have been identified across different incubation conditions and peak detection algorithms. Furthermore, its applicability extends beyond peptides, as demonstrated by successful processing of oligonucleotide data. The results have shown that the workflow can efficiently manage experimental data within a molecular range spanning 700–7630 Da. Importantly, its effectiveness has been validated across multiple acquisition modes, as data coming from different acquisition modes (DDA and DIA) has been processed.
WebMetabase was employed for the processing and visualization of data derived from six databases using different algorithms in the data preprocessing step.
In larger molecules (>3000 Da), notable differences were observed between the MiM and MaM peak detection algorithms. The MaM approach identified a greater number of metabolites, including several that were missed by MiM but previously reported in the literature, as for example in the case of insulin. In these high-mass compounds, the MaM algorithm produced higher scoring and more numerous matches, indicating increased confidence in structural predictions. Additionally, it demonstrated a lower incidence of false positives, reinforcing its suitability for macromolecules.
Two visualization strategies for macromolecules are presented, expanded and non-expanded, which directly influence how biotransformations are computed. The non-expanded mode reduces preprocessing time by minimizing the number of chemical structures that must be generated during analysis, with processing times ranging from 5 minutes for small peptides (<1000 Da) up to 40 minutes for larger peptides (>4000 Da). In contrast, the expanded mode simulates fragmentation at the atomic level and requires processing times ranging from 25 minutes for small peptides (<1000 Da) to several hours for larger peptides. For peptides larger than 4000 Da, the expanded mode becomes impractical due to excessively long processing times and high memory requirements. Moreover, both strategies can be combined in a hybrid approach, allowing selective expansion of specific monomers while keeping others non-expanded, as illustrated in the oligonucleotide dataset. This flexibility enhances interpretability by enabling targeted bond-level investigation of biotransformations without incurring the computational cost of expanding all the monomers within the compound.
Supporting information
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
(PDF)
Data Availability
All relevant data are within the manuscript and its Supporting Information files.
Funding Statement
This work has been partially supported by Doctorats Industrials, AGAUR, Generalitat de Catalunya. Industrial Doctorate grant 00002/2023.
References
- 1.Evans L, Phipps R, Shanu-Wilson J, Steele J, Wrigley S. Methods for metabolite generation and characterization by NMR. Identification and Quantification of Drugs, Metabolites, Drug Metabolizing Enzymes, and Transporters. Elsevier. 2020. p. 119–50. doi: 10.1016/b978-0-12-820018-6.00004-1 [DOI] [Google Scholar]
- 2.Wu Y, Pan L, Chen Z, Zheng Y, Diao X, Zhong D. Metabolite Identification in the Preclinical and Clinical Phase of Drug Development. Curr Drug Metab. 2021;22(11):838–57. doi: 10.2174/1389200222666211006104502 [DOI] [PubMed] [Google Scholar]
- 3.Li AP. Overview: Evaluation of metabolism-based drug toxicity in drug development. Chem Biol Interact. 2009;179(1):1–3. doi: 10.1016/j.cbi.2008.11.013 [DOI] [PubMed] [Google Scholar]
- 4.Kania J. Analyzing drug metabolism: a key factor in drug development and safety assessment. J Drug Metab Toxicol. 2024;15:329. [Google Scholar]
- 5.Wang L, Wang N, Zhang W, Cheng X, Yan Z, Shao G, et al. Therapeutic peptides: current applications and future directions. Signal Transduct Target Ther. 2022;7(1):48. doi: 10.1038/s41392-022-00904-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Moumné L, Marie A-C, Crouvezier N. Oligonucleotide Therapeutics: From Discovery and Development to Patentability. Pharmaceutics. 2022;14(2):260. doi: 10.3390/pharmaceutics14020260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cuyckens F, Dillen L, Cools W, Bockx M, Vereyken L, de Vries R, et al. Identifying metabolite ions of peptide drugs in the presence of an in vivo matrix background. Bioanalysis. 2012;4(5):595–604. doi: 10.4155/bio.11.333 [DOI] [PubMed] [Google Scholar]
- 8.Mass Analytica. WebMetabase. 2023.
- 9.Mass Analytica. MassMetaSite. 2023.
- 10.Yao J-F, Yang H, Zhao Y-Z, Xue M. Metabolism of Peptide Drugs and Strategies to Improve their Metabolic Stability. Curr Drug Metab. 2018;19(11):892–901. doi: 10.2174/1389200219666180628171531 [DOI] [PubMed] [Google Scholar]
- 11.Radchenko T, Brink A, Siegrist Y, Kochansky C, Bateman A, Fontaine F, et al. Software-aided approach to investigate peptide structure and metabolic susceptibility of amide bonds in peptide drugs based on high resolution mass spectrometry. PLoS One. 2017;12(11):e0186461. doi: 10.1371/journal.pone.0186461 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Radchenko T. New advances in metabolism prediction: Biotransformation of peptides and its implications in drug discovery. Barcelona: Universitat Pompeu Fabra. 2018. https://www.tdx.cat/handle/10803/665008 [Google Scholar]
- 13.Martín-Gago P, Aragón E, Gomez-Caminals M, Fernández-Carneado J, Ramón R, Martin-Malpartida P, et al. Insights into structure-activity relationships of somatostatin analogs containing mesitylalanine. Molecules. 2013;18(12):14564–84. doi: 10.3390/molecules181214564 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Basiri B, Xie F, Wu B, Humphreys SC, Lade JM, Thayer MB, Yamaguchi P, Florio M, Rock B. Introducing an in vitro liver stability assay capable of predicting the in vivo pharmacodynamic efficacy of siRNAs for IVIVC. Mol Ther Nucleic Acids. 2020;21:725–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wesche F, De Maria L, Leek T, Narjes F, Bird J, Su W, et al. Automated high-throughput in vitro assays to identify metabolic hotspots and protease stability of structurally diverse, pharmacologically active peptides for inhalation. J Pharm Biomed Anal. 2022;211:114518. doi: 10.1016/j.jpba.2021.114518 [DOI] [PubMed] [Google Scholar]
- 16.Manolopoulou M, Guo Q, Malito E, Schilling AB, Tang W-J. Molecular basis of catalytic chamber-assisted unfolding and cleavage of human insulin by human insulin-degrading enzyme. J Biol Chem. 2009;284(21):14177–88. doi: 10.1074/jbc.M900068200 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bonn B, Leandersson C, Fontaine F, Zamora I. Enhanced metabolite identification with MS(E) and a semi-automated software for structural elucidation. Rapid Commun Mass Spectrom. 2010;24(21):3127–38. doi: 10.1002/rcm.4753 [DOI] [PubMed] [Google Scholar]
- 18.Cece-Esencan EN, Fontaine F, Plasencia G, Teppner M, Brink A, Pähler A, et al. Software-aided cytochrome P450 reaction phenotyping and kinetic analysis in early drug discovery. Rapid Commun Mass Spectrom. 2016;30(2):301–10. doi: 10.1002/rcm.7429 [DOI] [PubMed] [Google Scholar]
- 19.Zelesky V, Schneider R, Janiszewski J, Zamora I, Ferguson J, Troutman M. Software automation tools for increased throughput metabolic soft-spot identification in early drug discovery. Bioanalysis. 2013;5(10):1165–79. doi: 10.4155/bio.13.89 [DOI] [PubMed] [Google Scholar]
- 20.Soares R, Franco C, Pires E, Ventosa M, Palhinhas R, Koci K, et al. Mass spectrometry and animal science: protein identification strategies and particularities of farm animal species. J Proteomics. 2012;75(14):4190–206. doi: 10.1016/j.jprot.2012.04.009 [DOI] [PubMed] [Google Scholar]
- 21.Radchenko T, Kochansky CJ, Cancilla M, Wrona MD, Mortishire-Smith RJ, Kirk J, et al. Metabolite identification using an ion mobility enhanced data-independent acquisition strategy and automated data processing. Rapid Commun Mass Spectrom. 2020;34(12):e8792. doi: 10.1002/rcm.8792 [DOI] [PubMed] [Google Scholar]
- 22.Mass Analytica. https://mass-analytica.com/products/webchembase/chromatography-quality-and-multiple-signal-detection/. Accessed 2024 February 26.
- 23.Wahab M, Patel D, Armstrong D. Peak shapes and their measurements: the need and the concept behind total peak shape analysis. LC GC North Am. 2017. Dec;12:846–53. [Google Scholar]
- 24.Manandhar B, Ahn J-M. Glucagon-like peptide-1 (GLP-1) analogs: recent advances, new possibilities, and therapeutic implications. J Med Chem. 2015;58(3):1020–37. doi: 10.1021/jm500810s [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Jensen L, Helleberg H, Roffel A, van Lier JJ, Bjørnsdottir I, Pedersen PJ, et al. Absorption, metabolism and excretion of the GLP-1 analogue semaglutide in humans and nonclinical species. Eur J Pharm Sci. 2017;104:31–41. doi: 10.1016/j.ejps.2017.03.020 [DOI] [PubMed] [Google Scholar]
- 26.Wrona MD, Kirk JM, Zamora I, Radchenko T, Escola A, Riera A, et al. Somatostatin analogue catabolite screening and identification using Vion IMS QTof with WebMetabase. 720006586EN. Waters. 2019. [Google Scholar]















