Glycan mixture analysis by kernel component composition for matrix factorization

Pengyu Hong; Chaoshuang Xia; Yang Tang; Juan Wei; Cheng Lin

doi:10.1007/s00216-025-05777-4

. Author manuscript; available in PMC: 2025 Sep 30.

Published in final edited form as: Anal Bioanal Chem. 2025 Feb 12;417(10):1975–1984. doi: 10.1007/s00216-025-05777-4

Glycan mixture analysis by kernel component composition for matrix factorization

Pengyu Hong ¹, Chaoshuang Xia ², Yang Tang ^3,⁴, Juan Wei ^2,⁵, Cheng Lin ²

PMCID: PMC12477321 NIHMSID: NIHMS2110862 PMID: 39939417

Abstract

A major challenge in structural glycomics is the presence of isomeric glycan structures, which may not be fully resolved by separation techniques such as liquid chromatography (LC) and ion mobility spectrometry (IMS). Tandem mass spectrometry (MS/MS) can be employed following on-line separation to distinguish unresolved features, as the temporal profiles of various fragment ions reflect different combinations of those from their respective precursor ions. However, traditional principal component analysis can produce negative signals that are unrealistic for real data, and classic non-negative matrix factorization (NMF) methods may result in factors that include contributions from multiple components. This paper introduces a new variation of NMF, termed kernel component composition (KCC), which enables users to impose domain-specific prior knowledge about the components as parametric kernels. These kernel parameters are then learned directly from the data. We developed a theoretically guaranteed algorithm based on proximal gradient descent to solve the optimization problem posed by KCC and derived detailed parameter update rules when using Gaussian kernels. The effectiveness of the KCC algorithm is demonstrated through simulation tests and its application to deconvoluting chemical datasets, including LC- and IM-MS/MS analysis of isomeric glycan mixtures.

Keywords: Kernel component composition, Non-negative matrix factorization, LC–MS/MS, IM-MS/MS, Glycans, Isomer analysis

Introduction

Glycans derived from biological sources often encompass a diverse array of isomeric structures. Accurate elucidation of glycan structures requires isolation of individual isomers from the mixture using techniques such as liquid chromatography (LC) [1, 2], capillary electrophoresis (CE) [3, 4], or ion mobility (IM) spectrometry [5]. However, these methods may not always guarantee complete isomer resolution. Furthermore, in the case of IM separation, the presence of multiple conformations for each isomer adds an additional layer of complexity to the separation process [6].

Partially co-eluting isomers can be distinguished by their unique tandem mass spectrometry (MS/MS) fragmentation patterns, which may include isomer-specific diagnostic fragments or variations in the abundances of shared product ions. The output of an LC–MS/MS, CE-MS/MS, or IM-MS/MS experiment is represented as a non-negative matrix, where one dimension corresponds to a temporal separation index, such as retention time, elution time, or arrival time, while the other dimension is peak intensity, reflecting the propensity of product ions generated from one or more precursor ions at each time point. The temporal profile of a specific fragment ion, depicted as an extracted ion chromatogram (EIC), electropherogram, or arrival time distribution (ATD), results from the collective contributions of its precursor ions. The weighting of each precursor’s contribution is influenced by its relative abundance in the mixture, as well as the fragmentation efficiency and distribution of the fragmentation pathways associated with each precursor. To identify individual isomers, an algorithm is required to isolate their spectra from the output matrix.

Discovering hidden structures or features within non-negative data is a common problem in data analysis. For example, the hidden features, which could be learned from a dataset of face images, are individual facial components, such as mouth, eye, nose, and so on. Non-negative matrix factorization (NMF) offers a popular solution by approximating the original data matrix into the product of two non-negative matrices, each of lower rank. Initially introduced by Paatero and Tapper [7] with an additive update rule for the lower-ranked matrices, NMF gained wider recognition following the proposal of a multiplicative update algorithm by Lee and Seung [8]. NMF has found applications in diverse fields, including image processing, text analysis, gene expression analysis, and more.

NMF can be formally explained as follows: given a data matrix $X$ of size $N \times M$ , where each column of $X$ represents an individual sample, we seek non-negative matrices $W$ and $Y$ of sizes $N \times R$ and $R \times M$ , respectively, whose product $WY$ minimizes the following objective function:

min_{W, Y \geq 0} F (W, Y) = {‖X - WY‖}_{F}^{2}

In this formulation, $R$ is a user-defined parameter, typically much smaller than the minimum of $N$ and $M$ , ensuring that the rank of $WY$ is at most $R$ . The columns of $W$ represent distinct hidden features embedded within the dataset. Each column of $Y$ , corresponding to an individual data sample, can be viewed as an encoding indicating the prominence of each given feature (a column of $W$ ) in that sample. Thus, the $j th$ data sample/column is approximated as $X^{j} \approx \sum_{i = 1}^{R} W^{i} Y_{i}^{j}$ . Here, $W^{i} Y_{i}^{j}$ represents how much the $i th$ hidden feature $W^{i}$ contributes to the $j th$ sample $X^{j}$ . As the approximation of the data matrix is usually of lower rank, NMF also serves as a data compression technique. There exist many algorithms to minimize $F (W, Y)$ , with the primary difference lying in the update rules for $W$ and $Y$ .

In many applications, it is preferable for $W$ and $Y$ to be sparse. $W$ should contain columns of non-holistic representations of hidden features, while each column of $Y$ encodes the corresponding data sample using as few learned features from $W$ as possible. Sparse coding is a highly valuable property of NMF, as it facilitates easy interpretation of the encoding. Various approaches explicitly incorporate sparseness constraints into the NMF formulation, such as the $L_{p}$ norms of both $W$ and $Y$ matrices [9]. In some applications, users may have domain-specific knowledge regarding the hidden features that can be expressed as parametric kernels whose parameters should be determined by data. For example, the “shapes” of hidden features may resemble Gaussian distributions with unknown means and variances. Such prior knowledge can be used effectively to handle noisy data and produce more interpretable results. Nevertheless, existing NMF variations do not offer adequate methods for leveraging prior knowledge.

Here we introduce a new approach termed kernel component composition (KCC), which encodes prior knowledge of hidden features as parametric kernels. While kernels are used in techniques such as kernel principal component analysis (PCA) [10] and kernel NMF [11], KCC differs significantly in its implementation. Unlike kernel PCA and kernel NMF, which employ pre-fixed kernels to transform inputs, KCC applies kernels to components and learns their unknown parameters from data. We also present a proximal style algorithm to solve the KCC problem. To illustrate the effectiveness of KCC, we first compare its performance against other algorithms using synthetic data. Subsequently, we demonstrate its utility in addressing a source separation problem in chemistry, examining outcomes from LC–MS/MS and IM-MS/MS analyses of isomeric glycan mixtures.

Methods

Theoretical methods

KCC minimizes the following loss function:

$ℓ (Y, Θ) = {‖X - f (Θ) Y‖}_{F}^{2} + λ {‖Y‖}_{1}$ , subject to $Y \geq 0$ and $D_{i} \geq Θ_{i} \geq 1 (1)$ .

where.

$X$ is an $N \times M$ matrix containing $M$ spectra of length $N$ .
$Θ \in ℝ^{R \times S}$ contains the parameters of the kernels, where $R$ is the number of kernels and $S$ is the number of parameters in each kernel. For example, for a Gaussian kernel, each row of $Θ$ contains two parameters (the mean and standard deviation of the Gaussian).
The function $f (Θ)$ produces an $N \times R$ kernelized matrix, where each column is the discretization of the kernel profile defined by the parameters specified in the $r th$ row of $Θ$ .
$D_{i}$ , where $i \in \{1, 2,, \dots, S\}$ , are user-defined constants that specify the maximal values of the elements in the $i th$ column of $Θ$ . For example, for the LC–MS/MS data presented below, $D_{1}$ (upper bound of the mean) is the largest retention time index, and $D_{2}$ (upper bound of the standard deviation) may be determined based on a realistic estimate of the maximal chromatographic peak width. Note that for the simulated data, the lower bounds of all parameters are set to 1, but these can also be adjusted for real data.
$Y$ is a non-negative $R \times M$ matrix indicating how $X$ can be reconstructed using $f (Θ)$ .
$λ$ is a non-negative hyperparameter for incorporating the $L_{1}$ norm on $Y$ to promote sparseness (larger $λ$ leads to sparser $Y$ ).

The goal is to find $Θ$ and $Y$ that minimize $ℓ (Y, Θ)$ . We developed the following algorithm to find suboptimal solutions. First, we randomly initialize $Θ^{0} \in ℝ^{R \times S}$ and $Y^{0} \in ℝ^{R \times M}$ such that $1 \leq Θ_{i}^{0} \leq D_{i}$ and $Y^{0} \geq 0$ . We then iterate the following two steps:

Fixing $Θ^{t}$ , obtain $Y^{t + 1} = \underset{Y}{argmin} ℓ (Y, Θ^{t})$ , subject to $Y \geq 0$ .
Fixing $Y^{t + 1}$ , use gradient descent to obtain $Θ^{t + 1}$ that reduces $ℓ (Y^{t + 1}, Θ)$ , subject to $1 \leq Θ_{i}^{0} \leq D_{i}$ .

The problem in step 1 is equivalent to LASSO [12] with non-negative constraint.

Materials

The glycan standards, lacto-N-fucopentaose (LNFP) I, II, and III were sourced from V-Labs, Inc. (Covington, Los Angeles, CA, USA). Maltotriose, isomaltotriose, LNFP V, and LNFP VI were acquired from Biosynth Limited (San Diego, CA, USA). Major Mix IMS/Tof Calibration Kit was obtained from Waters Inc. (Milford, MA, USA). HPLC grade water, acetonitrile (ACN), chloroform, and formic acid were purchased from Fisher Scientific (Pittsburgh, PA, USA). All other chemicals, including methyl iodide, dimethyl sulfoxide (DMSO), sodium borodeuteride (NaBD₄), sodium acetate, and acetic acid, were acquired from MilliporeSigma (St. Louis, MO, USA).

Sample preparation

LNFP glycan standards were subjected to deutero-reduction and permethylation using procedures described in detail elsewhere [13]. Maltotriose and isomaltotriose were labeled with an ¹⁸O isotope at the reducing end, following the protocol outlined in a prior study [14].

For LC–MS/MS analysis, the five-isomer mixture was prepared with an equal concentration of each deutero-reduced and permethylated LNFP isomer, dissolved in a 50:50 ACN:water solution. For IM-MS/MS analysis, the ¹⁸O-labeled maltotriose and isomaltotriose were analyzed either individually or in a 1:1 molar ratio, at a concentration of 5 pmol/µl in a 50% ACN solution containing 0.1% formic acid.

LC- and IM-MS/MS analyses

The LC–MS/MS data of the five-LNFP isomer mixture was acquired in a previous study [13] using a solariX hybrid Qh-Fourier transform ion cyclotron resonance (FTICR) mass spectrometer (Bruker Daltonics, Bremen, Germany) equipped with a TriVersa nanoMate ion source (Advion, Ithaca, NY, USA) and a nanoACQUITY UPLC system (Waters, Milford, MA, USA). On-line reversed-phase (RP)-LC separation was conducted with a nanoACQUITY UPLC Peptide BEH C18 column at 60 °C. Electronic excitation dissociation (EED) was performed with the cathode bias set between 16 and 18 V. Further details regarding the LC separation conditions, tandem MS analysis, and data processing can be found in the previous report [13].

IM-MS/MS analysis was performed on a Select Series cyclic ion mobility (cIM) mass spectrometer (Waters Inc., Wilmslow, UK), using a nano-ESI source in direct infusion positive ionization mode. The IM separation employed a traveling wave (TW) velocity of 375 m/s, and a TW static height of 18 V. The mass spectra were acquired with 3 pushes per bin and 200 bins. Nitrogen gas (1.734 mbar) was used as the collision gas. The separation time was set to 95 ms for multi-pass cIM analysis. For IM-MS² analysis, collision-induced dissociation (CID) was performed at a collision energy of 40 eV after IM separation.

Data preprocessing

The LC–MS/MS data were internally calibrated using several fragment ions assigned with high confidence, achieving a typical mass accuracy of < 1 ppm. Isotopic clusters were identified with the SNAP algorithm (Bruker Daltonics, Bremen, Germany), applying a quality factor threshold of 0.1, a minimum relative ion abundance of 0.01%, an $S / N$ cutoff of 5, and a maximum charge state of 4. To reduce noise, a median filter with a window size of 3 was applied, and fragment abundances lower than 1% of the maximum, excluding the MS² precursor, were set to 0. After noise filtering, spectra without any detectable signals were excluded from further analysis.

For IM-MS² analysis, raw datasets were processed using a custom Python script to extract the ATDs of precursor and selected fragment ions. The extracted ATDs were converted to csv files and used to generate the IM-MS² data matrices. Ion signals weaker than 0.1% of the maximum intensity were removed as noise.

Results and discussion

Simulation tests

To demonstrate the effectiveness of our model and algorithm, we initially conducted tests on synthetic data with known ground truth. The simulation tests were performed 100 times, and the results were summarized as below. The average time per trial was 105.46 s in MATLAB, utilizing one core in an Intel^® Xeon^® E5–2637 v3 3.5GHz15M Cache CPU. In each test, we generated a 300 × 200 data matrix $X$ . Each column of $X$ constituted a linear mixture of five Gaussians with added random noise. The mean and standard deviation of the $i th$ Gaussian were respectively generated as $μ_{i} = 45 i + 20 b_{i}$ , where $b_{i} \sim U ([0, 1])$ , and $σ_{i} = 12 + 5 a_{i}$ , where $a_{i} \sim U ([0, 1])$ . The mixing weights of the Gaussians in the $j th$ data sample (i.e., the $j th$ column of $X$ ) were generated as $w_{i j} = 100 z_{i j}$ , where $1 \leq i \leq 5$ and $z_{i j} \sim U ([0, 1])$ . The noise was randomly generated as $0.05 e_{i j}$ , where $e_{i j} \sim U ([0, 1])$ . The above setting makes Gaussians to overlap heavily so that it is challenging to recover the true means and standard deviations from the simulated data. Figure 1 shows a typical example of the simulated data.

Fig. 1 — The top plot shows five Gaussians used in one typical simulation test. The bottom plots show four simulated samples (i.e., four columns in $X$ )

The KCC algorithm was executed with the following parameters: $λ = 0.5$ , a maximum of 1000 iterations, a maximum of 100 components, and a fixed learning rate of 0.001 for the gradient descent optimization (step 2). The components $f (Θ)$ and $Y$ were initialized randomly. Alternatively, better initializations could enhance learning speed. For example, applying NMF on the data initially and using derived statistics from NMF results to initialize the kernels could potentially expedite convergence. Figure 2 illustrates a typical learning result of KCC.

Fig. 2 — A typical simulation test result of KCC, with settings described in the main text, showing the decomposition results of two samples in the dataset. The five kernels learned by KCC are indicated by the five Gaussians in each plot, scaled by their contributions. The approximations (red curves) by using the learned kernels match the samples (blue dashed curves)

The following procedure was used to evaluate how well KCC can recover the parameters of Gaussians used to simulate the data in each trial. Initially, we identified the maximum absolute value element (MAVE) in each row of the output $Y$ matrix and selected the learned components corresponding to the largest five MAVEs. The average MAVE of the 5th selected component is 69.0171 higher than that of the 6th learned components. Given the highest possible simulated weight of a component is 100, this difference indicates that the selected top five components effectively describe the data. Subsequently, we matched these selected components with the ground-truth Gaussians by minimizing the sum of the absolute discrepancy between the mean of each selected component and that of its match. Using this matching result, we also calculated the sum of the absolute difference between the standard deviation of each component and that of its corresponding ground-truth Gaussian. Over 100 tests, the average error in the mean parameter is 2.609, and the average error in the standard deviation parameter is 0.889, or around 7.4%. We thus conclude that the top five components learned by KCC closely resemble the ground-truth Gaussians.

The performance of the algorithm depends on the extent of overlap between Gaussians. For example, if we significantly increase the overlap by setting $μ_{i} = 20 i + 10 b_{i}$ where $b_{i} \sim U ([0, 1])$ and keeping $σ_{i} = 12 + 5 a_{i}$ where $a_{i} \sim U ([0, 1])$ , the average error of the mean parameter increases by 60% to 4.18, and the average error of the standard deviation parameter increases slightly to 0.923. Supplementary Fig. 1 shows a typical example of the simulated data and decomposition results by KCC. Given the severe overlap between neighboring peaks in this second set of synthetic data, such performance is satisfactory.

The ability of the algorithm to recover small components depends on both the strength of the component and how heavily it overlaps with major components. A minor component can be easily detected if it does not overlap with major components and is stronger than the user-defined noise level (see, for example, the detection of minor components of maltotriose in Fig. 6). If the overlap with major components is significant, the algorithm is more likely to miss the minor component. However, for isomer analysis, the presence of isomer-specific fragments can greatly improve identification of a minor component, even when it has significant overlap with a major component (see, for example, trace 96 in Fig. 5 that facilitates the detection of the blue component near the purple component in Fig. 4). The detection of minor components could be affected by initializations and noise level. For example, if all kernels are initialized closely to major components, kernels supposed to capture minor components may be trapped by the residues corresponding to the major components during the optimization process.

Fig. 6 — A The ATDs of the precursors of isomaltotriose (left), maltotriose (middle), and their mixture (right). B The learned kernels provide accurate approximations for the ATDs of the fragments produced from the mixture of isomaltotriose and maltotriose. The title of each plot indicates the *m/z* value of the corresponding fragment

Fig. 5 — The approximations of the profiles of the 34^th, 96^th, 113^th, and 180^th ions using the learned kernels shown in Fig. 4

Fig. 4 — The preprocessed LC-EED-MS/MS profile and its KCC results. A Each sample along the retention time axis represents the temporal profile, or EIC, of a product ion. The blue curve is the 257th temporal profile corresponding to the EIC of all precursors. B The KCC results on the 257th profile

Figure 3 compares the results of KCC with those of PCA and classic NMF. The PCA components exhibit negative signals, which, while mathematically valid and a natural consequence of maximizing variance, are unrealistic for real data where negative values lack meaningful interpretation. Meanwhile, the factors learned by classic NMF resemble the Gaussians used to simulate the data, but they appear as mixtures of Gaussians rather than capturing the essence of the data, where each component should be represented by a single Gaussian.

Fig. 3 — The analysis results of PCA (top, top five principal components) and classic NMF (bottom)

Component deconvolution from LC–MS/MS data

Given that KCC outperformed PCA and NMF on synthetic data, we next evaluated its utility in characterizing isomeric glycan mixtures. In the first example, KCC was applied to analyze LC–MS/MS data from a mixture of five isomeric pentasaccharides: LNFP I, II, III, V, and VI (structures shown in Scheme 1). For LC–MS/MS analysis, all glycans were analyzed in their deutero-reduced and permethylated form as sodium adducts. Deutero-reduction eliminated potential anomerism-induced chromatographic peak splitting and introduced a stable isotope label that aids in the differentiation of reducing-end and non-reducing-end fragments. Permethylation increased glycan hydrophobicity, thus enhancing separation on RPLC, and facilitated spectral interpretation by distinguishing internal and terminal fragments [15, 16]. Sodium adduction was used to minimize gas-phase glycan structure rearrangements [17], such as proton-mediated fucose migration, and to promote cross-ring fragmentation, which is crucial for distinguishing linkage isomers. EED was selected as the ion activation mode due to its ability to produce more informative tandem mass spectra for isomer differentiation [18, 19].

Scheme 1 — Structures of the isomeric glycans used in this study, represented according to the Symbol Nomenclature for Glycans (SNFG) convention

The preprocessed LC-EED-MS/MS, denoted as $X_{chem}$ , is visualized in Fig. 4A. Each sample along the retention time axis represents the temporal profile, or EIC, of a product ion. The 257th temporal profile (blue curve) corresponds to the EIC of all precursors. Because each reduced glycan exhibits a single peak during chromatographic elution, the EIC of each isomer can be effectively represented by a single Gaussian kernel. Considering that a product ion may be unique to a specific isomer or common among several isomers, each sample can be modeled using either a single Gaussian kernel or a linear combination of several kernels. Note that the actual m/z value of each observed ion is not pertinent for the purpose of component identification. The values on the “mass” axis represent indices of identified isotopic clusters. Similarly, the values on the “time” axis indicate the scan number within the elution time window where precursor ions were detected, rather than the actual retention time. However, since the LC–MS/MS experiment was performed with fixed acquisition time for each MS/MS scan, the retention time scales linearly with the scan number. This linear scaling ensures that the scan number effectively represents the retention time, facilitating accurate temporal profiling of the productions.

Figure 4B shows that KCC learned five kernels from $X_{chem}$ , which can be used to accurately approximate the EIC of the precursors. The effective extraction of the EICs of all five isomeric glycans as single Gaussians from the matrix suggests that each isomer can generate distinct fragmentation patterns, enabling KCC to separate their signals based on fragment ion EICs.

Figure 5 shows the EICs of four fragment ions along with their fits using the five kernels identified by KCC. The fragment ion at m/z 268.116 resulted from C/Z double cleavages at an N-acetylglucosamine (GlcNAc) residue. This fragment ion can only be produced by LNFP I, V, and VI, most abundantly when the GlcNAc residue is connected at its non-reducing end via a 1 → 3-linkage (LNFP I and V). The fragment at m/z 442.205 corresponds to a C/Z fragment ion containing one fucose (Fuc) residue and one GlcNAc residue. This is most abundantly produced at a branched GlcNAc site that carries one fucose branch and another branch linked to its 3-position, as seen in LNFP II. The fragment at m/z 472.216 is an internal fragment ion of the C/Z or B/Y type containing one galactose (Gal) residue and one GlcNAc residue, which can be generated by all five isomers. Finally, the fragment at m/z 654.341 corresponds to a Y ion containing one Fuc and two hexose residues, which can only be produced by LNFP V and VI. Note that the minor peak centered around retention time index 1980 is not present in any other EICs, including that of the precursor (Fig. 4B). It is likely due to the presence of a contaminant, and not associated with the purple component, as evident by its visible shift from kernel 3. Thus, from the EICs of the m/z 268 and m/z 654 fragments, we can assign the common component (cyan trace) to LNFP V, the m/z 654-specific component (green trace) to LNFP VI, and the m/z 268-specific component (red trace) to LNFP I. Similarly, the m/z 442 fragment shows only one major component in its EIC (blue trace), which can be assigned to LNFP II, while the remaining component (purple trace) is assigned to LNFP III.

Component deconvolution from IM-MS/MS data

Component analysis of IM-MS/MS data presents an additional challenge due to the potential existence of multiple gas-phase conformations for a single isomer [6]. If the ATD of each conformer can be modeled by a single Gaussian kernel, fitting the ATD of each isomer may require summing multiple Gaussians. Since the exact number of Gaussians needed to accurately describe the ATD of each isomer is unknown, establishing a general parametric kernel for its ATD becomes impractical. In contrast, employing KCC at the conformation level allows for the use of a single Gaussian kernel to approximate the ATD of each conformer.

We next performed KCC analysis on the IM-MS/MS profiles of a mixture of maltotriose (Mal) and isomaltotriose (IsoMal). Figure 6A shows the ATDs of IsoMal, Mal, and their mixture. While isomaltotriose exhibits a single peak in its ATD at this mobility resolution, maltotriose shows at least three distinct mobility peaks. Figure 6B illustrates the learned kernels and their contributions to the ATDs of various diagnostic ions generated by isomaltotriose, maltotriose, and their mixture. From the IM-MS/MS data of the IsoMal/Mal mixture, four kernels were identified. Kernel 3 (red trace) is attributed to isomaltotriose, as it is the only component observed in the ATD of the fragments at m/z 275.1 (^0,3A₂) and m/z 437.2 (^0,3A₃). The ^0,3A-cross ring cleavage at the reducing end represents a characteristic fragmentation pathway for 1 → 6 and 1 → 3-linked residues, resulting in an M-90 neutral loss fragment [20]; in this instance, it generates an M-92 fragment due to the ¹⁸O-labeling at the reducing end. Therefore, the other three kernels must correspond to maltotriose. These assignments were supported by the ATD of the precursor ions for the individual maltotriose and isomaltotriose standards.

For maltotriose, the fragmentation patterns of the three kernels differ considerably, and they can be categorized into two distinct groups: one group consists of the major component (kernel 4, black trace), while the other group includes the two minor components (kernel 1, blue trace, and kernel 2, purple trace). Reducing-end fragments, such as the Y₁ ion at m/z 205.1 and the Y₂ ion at m/z 367.1, along with most neutral loss fragments like the M-62 ion at m/z 467.2, exhibit a mobility profile resembling that of the precursor, with a strong presence of the major component. In contrast, non-reducing-end fragments, including the B₁ ion at m/z 185.1 and the B₂ ion at m/z 347.1, and the C₂ ion at m/z 365.1, show marked higher contributions from the minor components. One possible explanation for these differences could be the location of sodium cation binding sites. The major component may preferentially bind near the reducing-end, enhancing cross-ring fragmentation at the reducing-end residue and facilitating the detection of the reducing-end fragments, while the minor components tend to bind near the non-reducing-end, favoring the detection of the non-reducing-end fragments. Interestingly, the M-H₂¹⁸O fragment at m/z 509.2 also displayed significantly higher contributions from the minor components. A theoretical study on the fragmentation of sodiated glucose reveals that the energy barrier for dehydration is lower in α-glucose than in β-glucose [21]. Thus, we tentatively assign the major component to the β-anomer, and the minor components to the α-anomer. Additionally, the α-anomer may have a higher conformational flexibility, which contributes to the observation of two peaks in its ATD. Finally, the difference in the anomeric configuration could also affect the preferential sodium binding site and influence the relative propensity of other fragmentation pathways as discussed above.

The significant impact of the anomeric configuration on the CID fragmentation of sodium-adducted native glycans contrasts with the findings from a previous IM-MS/MS study, which showed similar EED fragmentation behaviors for different conformations of individual permethylated glycan isomer [6]. This discrepancy may stem from the charge-remote nature of the EED process [22], where the location of the charge carrier has little effect on the fragmentation pathways. It is also worth noting that permethylation would have prevented ring-opening at the reducing end, thereby blocking fragmentation pathways via retro-aldol-type reactions. For native glycans, a recent study employing ultrahigh-resolution IMS and cryogenic ion spectroscopy showed that, while mutarotation between α- and β-anomers readily occurs in solution, it was not observed in the gas phase during a prolonged ion mobility separation [23]. Thus, each anomer would remain locked in its configuration during IM-MS/MS analysis, which could lead to different CID fragmentation behaviors, as observed here. In contrast, interconversion between conformers of the same anomeric configuration likely involves a lower energy barrier, resulting in similar fragmentation patterns, as seen for the two minor conformations of maltotriose (the blue and purple components in Fig. 6B). This result suggests that, for IM-CID-MS/MS analysis of native glycan mixtures, the KCC-learned kernels should not be grouped at the isomer level; instead, they should be grouped at the anomeric level.

The ultimate goal of deconvoluting LC- or IM-MS/MS data, beyond merely identifying individual components from overlapping peaks, is to generate deconvoluted tandem mass spectra for each component. This information is embedded in the $Y$ matrix; specifically, the deconvoluted tandem mass spectrum of the $i th$ kernel can be reconstructed using the elements in the $i th$ row of the $Y$ matrix. Note that the deconvoluted spectra contain only the m/z channels used for KCC analysis and may exclude low-abundance fragments with intensity below the specified noise threshold. Supplementary Fig. 2 displays the deconvoluted tandem mass spectra of the four KCC-extracted components of the isomaltotriose/maltotriose mixture, as obtained from the IM-CID-MS/MS analysis. As expected, the 1 → 6-linkage specific M-92 fragment is present only in the tandem mass spectrum of kernel 3, allowing it to be identified as isomaltotriose. Additionally, spectral 1 and 2 feature prominent M-H₂O fragments, while also showing an increased production of non-reducing-end fragments C₁, C₂, and ^2,4A₂, which can be used to infer the anomeric configuration and/or potential cation binding sites, as discussed earlier.

Conclusions

We present a novel non-negative matrix factorization approach, KCC, for identifying hidden features in data. The features are modeled using parametric kernels, enabling the incorporation of domain-specific prior knowledge, such as peak shapes. The objective function of KCC is optimized using a proximal gradient descent style algorithm, which is theoretically guaranteed to decrease on each iteration, both in general and more specifically when applying Gaussian kernels with unknown parameters. KCC outperforms traditional methods like PCA and classic NMF in uncovering Gaussian kernels in synthetic data. We also demonstrate the utility of KCC in analytical chemistry by deconvoluting LC–MS/MS and IM-MS/MS data of glycan mixtures with isomeric structures. For LC–MS/MS analysis of reduced and permethylated glycans, KCC successfully identifies each isomer as a distinct kernel, even for isomers with significantly overlapping elution times. For IM-MS/MS analysis of native glycans, each conformation is effectively modeled by a Gaussian kernel, which can then be further grouped based on its anomeric configuration.

Supplementary Material

Supplementary Information

NIHMS2110862-supplement-Supplementary_Information.docx^{(121.2KB, docx)}

The online version contains supplementary material available at https://doi.org/10.1007/s00216-025-05777-4.

Acknowledgements

This work was supported by Massachusetts Life Sciences Center, and National Institutes of Health grants R01 GM132675, R24 GM134210, and S10 RR025082.

Footnotes

Published in the topical collection featuring Current Progress in Glycosciences and Glycobioinformatics with guest editors Joseph Zaia and Kiyoko F. Aoki-Kinoshita.

Declarations

Competing interests

The authors declare no competing interests.

References

1.Veillon L, Huang Y, Peng W, Dong X, Cho BG, Mechref Y. Characterization of isomeric glycan structures by LC-MS/MS. Electrophoresis. 2017;38:2100–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wei J, Tang Y, Bai Y, Zaia J, Costello CE, Hong P, Lin C. Toward automatic and comprehensive glycan characterization by online PGC-LC-EED MS/MS. Anal Chem 2019;92:782–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zhou X, Song W, Novotny MV, Jacobson SC. Fractionation and characterization of sialyl linkage isomers of serum N-glycans by CE–MS. J Sep Sci 2022;45:3348–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Stickney M, Sanderson P, Leach FE III, Zhang F, Linhardt RJ, Amster IJ. Online capillary zone electrophoresis negative electron transfer dissociation tandem mass spectrometry of glycosaminoglycan mixtures. Int J Mass Spectrom 2019;445: 116209. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chen Z, Glover MS, Li L. Recent advances in ion mobility–mass spectrometry for improved structural characterization of glycans and glycoconjugates. Curr Opin Chem Biol. 2018;42:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wei J, Tang Y, Ridgeway ME, Park MA, Costello CE, Lin C. Accurate identification of isomeric glycans by trapped ion mobility spectrometry-electronic excitation dissociation tandem mass spectrometry. Anal Chem 2020;92:13211–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Paatero P, Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994;5:111–26. [Google Scholar]
8.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–91. [DOI] [PubMed] [Google Scholar]
9.Hoyer PO. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 2004;5:1457–69. [Google Scholar]
10.Mika S, Schölkopf B, Smola A, Müller K-R, Scholz M, Rätsch G. Kernel PCA and de-noising in feature spaces. Adv Neural Inf Process Syst 1998;11:536–42. [Google Scholar]
11.Li Y, Ngom A. A new kernel non-negative matrix factorization and its application in microarray data analysis. In 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012, 371–378. [Google Scholar]
12.Tibshirani R Regression shrinkage and selection via the lasso. J. R. Soc. Stat. B Stat. Method 1996, 58, 267–288. [Google Scholar]
13.Tang Y, Wei J, Costello CE, Lin C. Characterization of isomeric glycans by reversed phase liquid chromatography-electronic excitation dissociation tandem mass spectrometry. J Am Soc Mass Spectrom 2018;29:1295–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Hong P, Sun H, Sha L, Pu Y, Khatri K, Yu X, Tang Y, Lin C. GlycoDeNovo – an efficient algorithm for accurate de novo glycan topology reconstruction from tandem mass spectra. J Am Soc Mass Spectrom 2017;28:2288–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Costell CE, Contado-Miller JM, Cipollo JF. A glycomics platform for the analysis of permethylated oligosaccharide alditols. J Am Soc Mass Spectrom 2007;18:1799–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhou S, Dong X, Veillon L, Huang Y, Mechref Y. LC-MS/MS analysis of permethylated N-glycans facilitating isomeric characterization. Anal Bioanal Chem 2017;409:453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Harvey DJ, Mattu TS, Wormald MR, Royle L, Dwek RA, Rudd PM. “Internal residue loss”: rearrangements occurring during the fragmentation of carbohydrates derivatized at the reducing terminus. Anal Chem 2002;74:734–40. [DOI] [PubMed] [Google Scholar]
18.Yu X, Jiang Y, Chen Y, Huang Y, Costello CE, Lin C. Detailed glycan structural characterization by electronic excitation dissociation. Anal Chem 2013;85:10017–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Wei J, Papanastasiou D, Kosmopoulou M, Smyrnakis A, Hong P, Tursumamat N, Klein JA, Xia C, Tang Y, De ZJ. novo glycan sequencing by electronic excitation dissociation MS²-guided MS³ analysis on an Omnitrap-Orbitrap hybrid instrument. Chem Sci 2023;14:6695–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Liew CY, Yen C-C, Chen J-L, Tsai S-T, Pawar S, Wu C-Y, Ni C-K. Structural identification of N-glycan isomers using logically derived sequence tandem mass spectrometry. Commun Chem 2021;4:92. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Chen J-L, Nguan HS, Hsu P-J, Tsai S-T, Liew CY, Kuo J-L, Hu W-P, Ni C-K. Collision-induced dissociation of sodiated glucose and identification of anomeric configuration. Phys Chem Chem Phys 2017;19:15454–62. [DOI] [PubMed] [Google Scholar]
22.Huang Y, Pu Y, Yu X, Costello CE, Lin C. Mechanistic study on electronic excitation dissociation of the cellobiose-Na⁺ complex. J Am Soc Mass Spectrom 2015;27:319–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Warnke S, Ben Faleh A, Scutelnic V, Rizzo TR. Separation and identification of glycan anomers using ultrahigh-resolution ionmobility spectrometry and cryogenic ion spectroscopy. J Am Soc Mass Spectrom 2019;30:2204–11. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

NIHMS2110862-supplement-Supplementary_Information.docx^{(121.2KB, docx)}

[R1] 1.Veillon L, Huang Y, Peng W, Dong X, Cho BG, Mechref Y. Characterization of isomeric glycan structures by LC-MS/MS. Electrophoresis. 2017;38:2100–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Wei J, Tang Y, Bai Y, Zaia J, Costello CE, Hong P, Lin C. Toward automatic and comprehensive glycan characterization by online PGC-LC-EED MS/MS. Anal Chem 2019;92:782–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Zhou X, Song W, Novotny MV, Jacobson SC. Fractionation and characterization of sialyl linkage isomers of serum N-glycans by CE–MS. J Sep Sci 2022;45:3348–61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Stickney M, Sanderson P, Leach FE III, Zhang F, Linhardt RJ, Amster IJ. Online capillary zone electrophoresis negative electron transfer dissociation tandem mass spectrometry of glycosaminoglycan mixtures. Int J Mass Spectrom 2019;445: 116209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Chen Z, Glover MS, Li L. Recent advances in ion mobility–mass spectrometry for improved structural characterization of glycans and glycoconjugates. Curr Opin Chem Biol. 2018;42:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Wei J, Tang Y, Ridgeway ME, Park MA, Costello CE, Lin C. Accurate identification of isomeric glycans by trapped ion mobility spectrometry-electronic excitation dissociation tandem mass spectrometry. Anal Chem 2020;92:13211–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Paatero P, Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994;5:111–26. [Google Scholar]

[R8] 8.Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–91. [DOI] [PubMed] [Google Scholar]

[R9] 9.Hoyer PO. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 2004;5:1457–69. [Google Scholar]

[R10] 10.Mika S, Schölkopf B, Smola A, Müller K-R, Scholz M, Rätsch G. Kernel PCA and de-noising in feature spaces. Adv Neural Inf Process Syst 1998;11:536–42. [Google Scholar]

[R11] 11.Li Y, Ngom A. A new kernel non-negative matrix factorization and its application in microarray data analysis. In 2012 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2012, 371–378. [Google Scholar]

[R12] 12.Tibshirani R Regression shrinkage and selection via the lasso. J. R. Soc. Stat. B Stat. Method 1996, 58, 267–288. [Google Scholar]

[R13] 13.Tang Y, Wei J, Costello CE, Lin C. Characterization of isomeric glycans by reversed phase liquid chromatography-electronic excitation dissociation tandem mass spectrometry. J Am Soc Mass Spectrom 2018;29:1295–307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Hong P, Sun H, Sha L, Pu Y, Khatri K, Yu X, Tang Y, Lin C. GlycoDeNovo – an efficient algorithm for accurate de novo glycan topology reconstruction from tandem mass spectra. J Am Soc Mass Spectrom 2017;28:2288–301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Costell CE, Contado-Miller JM, Cipollo JF. A glycomics platform for the analysis of permethylated oligosaccharide alditols. J Am Soc Mass Spectrom 2007;18:1799–812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Zhou S, Dong X, Veillon L, Huang Y, Mechref Y. LC-MS/MS analysis of permethylated N-glycans facilitating isomeric characterization. Anal Bioanal Chem 2017;409:453–66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Harvey DJ, Mattu TS, Wormald MR, Royle L, Dwek RA, Rudd PM. “Internal residue loss”: rearrangements occurring during the fragmentation of carbohydrates derivatized at the reducing terminus. Anal Chem 2002;74:734–40. [DOI] [PubMed] [Google Scholar]

[R18] 18.Yu X, Jiang Y, Chen Y, Huang Y, Costello CE, Lin C. Detailed glycan structural characterization by electronic excitation dissociation. Anal Chem 2013;85:10017–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Wei J, Papanastasiou D, Kosmopoulou M, Smyrnakis A, Hong P, Tursumamat N, Klein JA, Xia C, Tang Y, De ZJ. novo glycan sequencing by electronic excitation dissociation MS²-guided MS³ analysis on an Omnitrap-Orbitrap hybrid instrument. Chem Sci 2023;14:6695–704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Liew CY, Yen C-C, Chen J-L, Tsai S-T, Pawar S, Wu C-Y, Ni C-K. Structural identification of N-glycan isomers using logically derived sequence tandem mass spectrometry. Commun Chem 2021;4:92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Chen J-L, Nguan HS, Hsu P-J, Tsai S-T, Liew CY, Kuo J-L, Hu W-P, Ni C-K. Collision-induced dissociation of sodiated glucose and identification of anomeric configuration. Phys Chem Chem Phys 2017;19:15454–62. [DOI] [PubMed] [Google Scholar]

[R22] 22.Huang Y, Pu Y, Yu X, Costello CE, Lin C. Mechanistic study on electronic excitation dissociation of the cellobiose-Na⁺ complex. J Am Soc Mass Spectrom 2015;27:319–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Warnke S, Ben Faleh A, Scutelnic V, Rizzo TR. Separation and identification of glycan anomers using ultrahigh-resolution ionmobility spectrometry and cryogenic ion spectroscopy. J Am Soc Mass Spectrom 2019;30:2204–11. [DOI] [PubMed] [Google Scholar]

PERMALINK

Glycan mixture analysis by kernel component composition for matrix factorization

Pengyu Hong

Chaoshuang Xia

Yang Tang

Juan Wei

Cheng Lin

Abstract

Introduction