Abstract
Human serum glycomics is a promising method for finding cancer biomarkers but often lacks the tools for streamlined data analysis. The Glycolyzer software incorporates a suite of analytic tools capable of identifying informative glycan peaks out of raw mass spectrometry data. As a demonstration of its utility, the program was used to identify putative biomarkers for epithelial ovarian cancer from a human serum sample set. A randomized, blocked and blinded experimental design was used on a discovery set consisting of 46 cases and 48 controls. Retrosynthetic glycan libraries were used for data analysis and several significant candidate glycan biomarkers were discovered via hypothesis testing. The significant glycans were attributed to a glycan family based on glycan composition relationships and incorporated into a linear classifier motif test. The motif test was then applied to the discovery set to evaluate the disease state discrimination performance. The test provided strongly predictive results based on receiver operator characteristic curve analysis. The area under the receiver operator characteristic curve was 0.93. Using the Glycolyzer software, we were able to identify a set of glycan biomarkers that highly discriminate between cases and controls, and are ready to be formally validated in subsequent studies.
Keywords: Biomarkers, Clinical Glycomics, Data Processing, Human Serum, Ovarian Cancer
INTRODUCTION
Glycans are a common post-translational modification of proteins that consist of complex arrangements of monosaccharides that vary in size, linkage, and composition. They are instrumental to the vitality of higher organisms and are currently of considerable interest as a source for serum-based biomarkers [1-8]. Glycan cancer biomarkers are of particular importance because changes in glycosylation have been observed in globally released glycans from the serum of cancer patients [4-11] and on glycans released from targeted glycoproteins [12, 13]. Mass spectrometry is widely used for studying glycans because most compounds in a complex mixture can be simultaneously detected and identified. The masses and ionization characteristics of glycans are suitable for most modern mass spectrometers. However, the vast amount of glycan data makes it difficult to extract and organize information from mass spectra.
There have been several methods for annotating glycans incorporating combinatorial approaches (GlycoMod [14]), empirical databases (GlycoSuiteDB [15, 16], SWEET-DB [17], BOLD [18], KEGG [19], and EUROCarbDB [20]), glycobiology oriented glycan library models (Cartoonist [21], Retrosynthetic Glycan Network Libraries [22]), and tandem mass spectrometry processing algorithms (StrOligo [23], GlySpy and OSCAR [24, 25], GlycoPeakFinder [26], Glyco-Fragment [27, 28], GlycoWorkbench [29]). However, there has been little attention paid to raw glycan spectra processing. Vakhrushev and coworkers developed the SysBioWare software for processing and annotating raw glycan mass spectra [30]. This program includes several basic features including background subtraction, peak detection, noise thresholding, and data processing tools including preprocessing, smoothing, peak selection, and isotope grouping. Additional tools in the software used for glycan processing include a difference calculator that can use monosaccharide masses and a rudimentary biological filter that uses logical monosaccharide ratio statements entered by the user.
We have developed an integrated software annotation program for glycan biomarker discovery that is referred to as The Glycolyzer. The Glycolyzer contains a full data analysis pipeline in one software package to allow for minimal user intervention. The software was written in IgorPro (WaveMetrics, Portland, OR) language and the source code is available from our group website (chemgroups.ucdavis.edu/~lebrilla/Glycolyzer.zip) or by request from the authors. Although IgorPro is required to run the software, the algorithms can be viewed with text editors. The mass spectrum analysis software is a graphical user interface-based program designed for processing and analyzing carbohydrate mass spectra with a focus on clinical glycan biomarker discovery. The Glycolyzer has similarities with the SysBioWare software but goes further by incorporating a full analysis pipeline including additional algorithms for calibration, theoretical retrosynthetic library based glycan annotation, and statistical hypothesis testing. The overall workflow, including Fourier Transform Ion Cyclotron Resonance (FT-ICR) and general mass spectra data processing, is shown in Figure 1. Calibrated deconvoluted data from LC-MS experiments can be used as well by bypassing the internal preprocessing algorithms and proceeding directly to the annotation and statistics part of the pipeline.
Figure 1.
The overall data workflow for the Glycolyzer.
We used this software to discover serum-based glycan biomarkers for epithelial ovarian cancer. Epithelial ovarian cancer is the most dangerous of the gynecologic malignancies due to its propensity for late detection when most patients present advanced stages of the disease. It currently lacks diagnostic tests that are effective for screening and early detection. There are a limited number of FDA approved blood tests available to assist in the diagnosis and monitoring of ovarian cancer, including CA 125 and HE4 [31], but the value of these tests is largely limited to monitoring disease status after treatment, or assessing the risk of malignancy when an ovarian mass has already been detected. CA 125 is elevated in only 50% of Stage I cancers, so it is not a sensitive test for early detection. It is also rather non-specific, especially in pre-menopausal women, leading to many false positive results that require diagnostic intervention [31]. Thus, new novel serum based biomarkers with improved sensitivity and specificity would be highly desirable.
We have pursued glycomics analysis using mass spectrometry to detect glycans that are altered either by monosaccharide composition and/or have increased/decreased expression when comparing the serum from patients with ovarian cancer cases and healthy controls [7, 32]. Although a number of informative glycans were found that distinguished between cases and controls, detection of the glycan mass peaks has been hampered by the lack of useful bioinformatics analytic techniques. Use of the Glycolyzer software provides a platform that can substantially automate the analysis of complex mass spectrometry data, allowing for detection and annotation of informative glycans. Glycan annotation employ a novel theoretical glycan library that has recently been published [22].
As a demonstration of the utility of this software, we have used it to distinguish a unique set of glycans in a carefully selected group of matched cases and controls. Briefly, human serum N-linked glycans (N-glycans) were profiled with the theoretical retrosynthetic N-glycan library and experimental profiles were developed based on 46 control samples. The work presented here demonstrates the high throughput capabilities of the current methodology on a matched set of cases and controls. The methodology includes isolation of N-linked glycans from human serum, mass spectrometry using MALDI-FTICR, and then bioinformatic evaluation with the Glycolyzer software. Based on these results, the discovery set is appropriate for use in a clinical validation study to evaluate the robustness of the candidate markers presented here.
MATERIALS AND METHODS
Human Serum Samples
Approval for this research protocol using clinical data and human serum samples was obtained from the Institutional Review Board of the University of California, Davis Medical Center. Human serum samples were obtained through a formal data use agreement with the Gynecologic Oncology Group (GOG). The subjects either had epithelial ovarian cancer (cancer cases) or were healthy volunteers (healthy controls). All serum samples arrived frozen and were transferred to a −75°C freezer prior to processing.
The discovery set included healthy controls (n = 48) and ovarian cancer cases (n=46). The discovery set samples were aged matched by 5-year intervals to avoid confounding effects (40-45, 46-50, 51-55, 56-60, and 61-65 years). Disease status (case versus control) and age block were blinded outside of our laboratory prior to chemical analysis. The samples were blocked into 8 sets of 12 samples (each block contained 6 controls and 6 cancer cases) with relatively even balancing of subject ages. Following mass spectrometry data collection and annotation using the Glycolyzer software, the samples were unblinded for statistical analysis.
N-glycan release and extraction from human serum for the discovery set was carried out by the optimized methods described by Kronewitter et al. [33]. Briefly, 100 μL of serum was mixed with 100 μL digestion buffer (pH 7.5, 100 mM ammonium bicarbonate, 10 mM dithiothreitol) and heated in boiling water for 2 minutes to denature the proteins. After cooling to room temperature, 2.0 μL Peptide N-glycosidase F (PNGase F, 500,000 units/mL, glycerol free, New England BioLabs, Ipswich, MA) were added and the mixture was incubated in a microwave reactor for 20 minutes at a constant power of 20W. An 800 μL-aliquot of chilled ethanol was then added to precipitate peptides and proteins. The solution was frozen in a −75°C freezer for 60 minutes and then centrifuged at 13,300 revolutions per minute for 20 minutes (5415 D, Eppendorf AG, Hamburg, Germany). After centrifuging, 700 μL of supernatant was removed from the precipitate and dried in a Savant AES 2010 centrifugal evaporator (Thermo Fischer Scientific, Waltham, MA). PNGase F-released glycans were then purified by graphitized carbon cartridge solid phase extraction (GCC-SPE) with an automated Gilson GX-274 ASPEC liquid handler. GCC-SPE cartridges (150 mg bed weight, 4 mL cartridge volume) were acquired from Alltech (Deerfield, IL). Three fractions of glycans were collected using increasing amounts of acetonitrile (ACN): 4 ml each of 10% ACN/H2O (v/v), 20% ACN/H2O (v/v), and 40% ACN/H2O (v/v) with 0.05% trifluoroacetic acid. Each fraction was collected and dried in a centrifugal evaporator apparatus. Fractions were reconstituted in nanopure water prior to mass spectrometry. Mass spectra were recorded on an external source MALDI-FTICR instrument (HiResMALDI, IonSpec Corporation, Irvine, CA) equipped with a 7.0 T superconducting magnet and a pulsed 355 nm Nd:YAG laser. Five spectra were collected for each sample: 10% ACN and 20% ACN fractions in the positive mode and the 40% ACN fraction in the negative mode. A total of 1410 FT-ICR spectra were collected the 94 samples. The spectra were collected in blocks (blocked by SPE fraction). The samples from the blinded, randomized, sample set were analyzed sequentially on the same instrument over 2-3 days to maintain constant sample detection conditions. The mass spectra collection conditions were optimized for reproducibility by controlling several instrumental parameters during operation. The ultra-high vacuum base pressure was maintained lower than 1E-10 Torr (measured with an ion gauge). Cooling gas was used to kinetically cool the ions during ion accumulation in a hexapole prior to transfer to the ICR cell. The cooling gas pump down rate was controlled via the initial system pressure. The initial system pressure chosen was between 1E-10 and 5E-10 Torr prior to ionization and subsequent accumulation and detection. Fixing the initial pressure allowed for replicate pressure conditions in the ICR cell during detection. Under these conditions, the average coefficient of variation of glycan intensities from technical replicates from the same MALDI spot ranges from 12-17% [33].
DATA ANALYSIS ALGORITHMS
The Glycolyzer is a software package consisting of a graphical user interface and several modular data processing algorithms that can be linked to each other in a user defined order. All the algorithms are integrated into the platform’s user interface and can be run in series as a data analysis pipeline. Different degrees of processed data can be loaded into the software. For example, the analytical signal from the FT-ICR (ICR transient or free induction decay) can be loaded directly into the start of the pipeline and processed, or the analytical signal can be processed externally to the Glycolyzer software via instrument software (e.g., Omega8, IonSpec, Irvine, CA) and loaded in at a later point in the data analysis pipeline. This allows data from other types of mass spectrometers to be used as long as the data is already calibrated. If instrument software deconvolution is preferred rather than the Glycolyzer’s built-in deconvolution algorithm, exogenous deconvoluted monoisotopic masses can be loaded directly and the rest of the Glycolyzer’s analysis pipeline can still be applied.
Automatic Spectra Processing
Data analysis for clinical glycan sample sets requires many automated steps to assure rapid and consistent data handling. The Glycolyzer automates the full data analysis pipeline starting with the analytical signal from the instrument and concluding with biomarker elucidation. The general modules included are: data importing and exporting, FT-ICR signal pre-processing, internal calibration, noise threshold calculation, peak picking, isotope grouping and filtering, glycan annotation, intensity normalization, missing value filling, multiple spectra averaging, hypothesis testing, and multiple testing corrections. The glycans that pass the rigorous multiple testing corrected hypothesis tests are considered to be candidate biomarkers and can be incorporated into data classifiers and their diagnostic performance evaluated.
Data Importing/Exporting
Importing data from text files is facilitated by the Glycolyzer’s graphical user interface. Raw ICR transients, mass spectra, or deconvoluted monoisotopic mass lists can be loaded in as single files or as a batch. The modular pipeline of the Glycolyzer allows the user to select appropriate analysis algorithms for the data type loaded. Different levels of data preprocessing previously applied to file are taken into account by allowing data to start at different parts in the analysis pipeline.
FT-ICR Pre-Processing
Fast Fourier Transforms (FFT) were performed on raw data transients obtained from Omega8 (IonSpec, Irvine CA) data acquisition software. The Analog-to-Digital conversion (ADC) rate, magnet strength, number of zero fills, and apodization window are specified by the user. In this study, one zero-fill was used during the Fourier transform along with a Blackman apodization window. A one second transient was used. In addition, the user is able to truncate the length of the transients prior to applying the FFT to improve quantification by reducing the dampening effects inherent to ICR transients. The FFT converts the transients from the time domain to the frequency domain. Many apodization windows for smoothing out the peak shapes, such as the commonly used Blackman and Hamming windows, are included in the user interface.
High Mass Accuracy Spectra Calibration
High-mass-accuracy calibration was used for the clinical samples. The error was generally less than 5 parts-per-million (ppm) root-mean-squared (RMS) mass difference of calibrant ions from calculated values across a data set. Smaller errors, e.g. 1-2 ppm, have been obtained for glycan standards (data not shown) but is challenging for large data sets. Accurate calibrations allow for accurate mass determination of unknowns. For FT-ICR instruments, the free induction decay transients need to be converted into mass spectra via the FFT and calibration equations. The Glycolyzer’s internal calibration algorithm performs a six-point calibration using six common glycan ions in each spectrum. A serum N-glycan mass profile, derived from 46 healthy controls, was used to identify the six best ions for calibrating human serum N-glycan spectra [22]. The six calibrant ions were selected from the set of twenty eight glycans detected in 100% of the samples. The calibrant masses were converted to the frequency domain via the following standard calibration equation: [34-36]
Each calibrant ion mass was aligned to its respective monoisotopic peak in each spectrum. To identify the monoisotopic peak for alignment, the first step is to isotope-filter the frequency data and highlight monoisotopic peaks. Monoisotopic peak selection in the frequency domain is different from the mass domain because the isotopologue distributions are reversed and the neutron mass differences between isotopologue ions are non-linear in the frequency domain. For this reason, a novel deisotoping algorithm was developed specifically for the frequency domain and presented here. Finally, graphs containing the monoisotopic-highlighted experimental data surrounding each of the six calibrant ions are presented to the user for a final visual inspection. If the wrong peak is selected by the computer, the user can manually reselect the correct peak with arrow buttons then continue to the calibration algorithm and subsequent samples. The manual inspection step ensures proper calibration of densely-packed spectra that are hard to decipher with computer algorithms alone. Final calibration is performed by fitting the calibration equation to the calibration ions to find the equation coefficients. The optimized calibration is facilitated by a CurveFit function built into IgorPro that is based on the Levenberg-Marquardt algorithm. The Omega8’s (IonSpec, Irvine, CA) and the Glycolyzer’s internal calibration methods are compared in Supplementary Figure 1, where twelve spectra were calibrated and their root-mean-squared (RMS) mass deviations from known values recorded.
Noise Threshold
Separating the signal from the noise is important for peak annotation and reliable quantification. To threshold a spectrum, a limit-of-detection line is calculated. All peaks above the line are considered signal and all peaks below the line are classified as noise. One option for dynamically assigning a limit-of-detection is to manually set the threshold to a relative percentage of the base peak. A user-selected threshold is problematic because the cutoff is arbitrary and independent of the noise and background. In contrast, we apply different threshold settings based on the standard deviation and mean intensity of the noise. The mean intensity of the noise is calculated by the average mean of all the peak intensities in the spectra since the number of noise data points greatly outweighs the signal. Commonly, the lower limit of detection (LLOD) is set at three-sigma above the mean noise level, but we used six-sigma above the mean noise level to further reduce the number of falsely annotated noise peaks.
The standard deviation of the noise is calculated from a histogram of all intensities in the spectrum. This histogram is presented in Supplementary Figure 2. The most common intensity in the histogram is the noise level used as the standard deviation. Noise removal by threshold cutoffs drastically improves processing time since the subsequent algorithms are only applied to the signal. Alternately, the standard deviation of the noise is calculated from the full-width-at-half maximum (FWHM) of the distribution. However, the standard deviation from this method is smaller and produces a lower threshold line. Although lower threshold cutoffs allow for higher sensitivity, they also result in less specificity as noise peaks can be detected above the threshold. This algorithm works well for data collected in this study because there are significantly more noise peaks than signal peaks detected in a spectrum.
Peak Picking
The Glycolyzer program requires that each peak has a maximum and contain at least three data points. The centroid mass of each peak is derived by fitting a parabola to the top three points in each peak via parabolic regression. The fit parabola provides a centroid mass and a corrected intensity. Apex-based intensities are used for ICR spectra because peak line shapes and corresponding areas are affected by many variables not directly related to the number of ions in the ICR cell [37]. In contrast, intensities calculated by the area under the curve work well for TOF since the TOF detectors are based on counting ions.
Isotope Grouping
Current mass spectrometers commonly resolve glycans into their isotopologues. High resolving power presents the opportunity to identify the monoisotopic peak for further annotation and analysis. Several research groups have developed isotope grouping algorithms [38-42]. The Glycolyzer’s general isotope grouping workflow is based on the THRASH (Thorough High Resolution Analysis of Spectra by Horn) algorithm [43] with several modifications pertinent to MALDI ionization and glycans.
One significant improvement is the Glycolyzer’s ability to separate overlapping clusters of isotopologues. Rather than using subtractive methods for deconvoluting overlapping distributions, the theoretical overlapped models are reconstructed to reduce the propagation of fitting errors in the residual spectra. The reconstructive approach is similar to the LASSO method applied by Du and coworkers; [44] however, our model generation is permutated rather than regressed with automatic variable selection. A simplified workflow is presented in Supplementary Figure 3.
The first step for deconvolving the spectra is to identify an isotopic cluster. A cluster is a set of ions spaced apart by an isotope mass unit equal to 1.00235 Da [43], or a fraction depending on the charge state. The fraction is equal to the isotope unit divided by the charge state. A cluster can contain more than one isotopic distribution if multiple distributions overlap. Overlapped isotopic distributions are common in glycan spectra because chromatographic separation prior to mass spectrometry is typically not performed. MALDI mass spectrometry has the favorable characteristic of only producing ions with a single charge. This eliminates the need for charge deconvolution because the spacing between isotopologues is always the full mass of a neutron rather than a fractional mass related to higher charge states.
Isotope clusters are found in the spectra by a neighbor peak-finding algorithm. The algorithm looks for neighboring peaks around a principal ion that are one isotope mass unit away in both directions. A mass-error tolerance is applied to this calculation to provide a window for locating a neighboring peak apex. This mass error window allows for proper detection of neighboring peaks despite imperfect peak shapes and centroid errors. If a neighboring peak apex is within the error window, it is added to the cluster and the algorithm continues searching for additional ions to add to the cluster. Additional ions are found by making the newly added ion the principle ion and repeating the neighboring peak selection process. This peak finding process continues until there are no neighboring ions to add. If a large mass-error tolerance is selected by the user, the clustering algorithm may falsely include a second cluster if the spectrum is densely populated. However, this type of error will be corrected later in the algorithm when the cluster is deconvoluted (see below). However, if the error is too small, the tail end of a cluster may be broken off and form a second cluster. This condition results in assignment of extra false-positive monoisotopic peaks.
The second step for deconvolution is to create synthetic isotopic distributions. Depending on the type of molecules detected in the spectra, the isotope distributions will change. Peptide mass spectra are often simulated with the use of an averagine. An averagine unit represents, by mass and elemental composition, the average mass of an amino acid that occurs in human proteins. Unknown peptide masses can be converted to elemental compositions by dividing the unknown mass by the averagine mass (111.1254 Da) to find the number of averagine units and then multiplying the number of units by the averagine elemental composition (C4.9384H7.7583N1.3577O1.4773S0.0417) [45]. However, N-glycans have compositions that differ from peptides. Because of the need for sugar-based isotope distribution models, an averagose model was established by An et al. [46]. Subsequently, a direct glycan analogue to the averagine was presented by Vakhrushev et al. [30], which included an average monosaccharide unit based on an equal weighting of hexose, N-acetylhexosamine, fucose, and neuraminic acid monosaccharides. We now propose a similar averagose for modeling N-glycans based on the theoretical libraries and experimentally derived glycan profiles. Experimental serum profiles generated from applying theoretical libraries to experimental spectra provide a more accurate estimate of a human averagose. The proposed serum averagose is C6.0000H9.8124N0.3733O4.3470S0.0 with an average mass of 156.64662 Da (sulfur was included as a place holder since it is not typically seen in our spectra). This new more specific averagose is compared against theoretical isotope distributions modeled from the estimates of elemental compositions with Poisson distributions [46]. Supplementary Figure 4 demonstrates that both Vakhrushev and Glycolyzer methods produce characteristics similar to the exact elemental composition model. In addition, a peptide averagine was used for glycans and a relatively poor fit was obtained compared to averagose methods. For deisotoping purposes, reducing the length of the ICR-transients from 1.0 to 0.5 seconds (1,048,576 to 524,288 data points) improved the Chi squared fit of the model to the experimental data.
Next, we improved processing performance by filtering the clusters based on how many isotopic distributions are present in a cluster. Extensive deconvolution is not needed on single ion clusters and is reserved for larger clusters containing several monoisotopic ions and respective distributions. If there are multiple maxima within a cluster, multiple ions are expected and complete deconvolution is performed.
Theoretical isotopologue intensity distributions are then calculated based on an averagose model. If there is only one expected ion in the cluster, a simple theoretical model with one ion is created. However, if multiple ions need to be deconvoluted, combinations of multiple distributions are needed. Overlapping isotope group data reconstruction is accomplished by applying a non-linear set of sixteen ratios between the intensity of multiple clusters to build the model. Sixteen ratios of ion intensities are used to span two-orders of magnitude with a small number of steps. This decreases the computer processing overhead while maintaining the desired deconvolution sensitivity. A non-linear set of ratios are chosen to have greater detail for the ratios close to unity while the more apparent larger ratios are still included. The synthetic models are created with varying amounts of mass unit offsets between the theoretical ions. The number of unit offsets is limited by the number of ions in the cluster to further speed up the processing.
Finally, the complete models are multiplied by an alignment matrix and the individual fits are evaluated with a Chi-squared test. The best chi-squared fit alignment is decomposed to identify the monoisotopic peaks and the results are recorded. The monoisotopic and isotopologue peaks are assigned and the theoretical values are subtracted from the spectrum. The algorithm then repeats clustering on the non-annotated portion of the spectra. This process repeats until all the ions above the noise threshold are assigned.
Glycan Annotation
The Glycolyzer provides two methods for annotating peaks using accurate mass: development mode and high-throughput mode. Tools available for use in the development mode include a broad combinatorial method for making theoretical glycans and calculating monosaccharide differences from the spectra. The brute-force combinatorial method can be adapted with biological rules input by the user. Similar “biological filters” have been described in the literature to reduce the quantity of nonsensical glycan compositions [14, 30, 47]. OmniFinder, a dynamic algorithm similar to GlycoMod,[14] creates a list of all the mathematically possible glycans or glycopeptides within specified monosaccharide and/or amino acid compositions and searches for them in the spectra. The list is comprehensive but includes a high degree of false positive hits. The nonsensical glycan false hits are largely eliminated with an array of glycan filters based on known biology.
Another useful tool in the development mode is a glycan peak relationship finder. Mass differences consistent with monosaccharide masses are indicative of an ion being a glycan or a glycoconjugate. This can be helpful with variable or unknown head groups. This information is also helpful for determining families of glycans that differ by one monosaccharide. Finding these differences require processed spectra that only contain monoisotopic masses because many extraneous differences will be found that include associated isotopologues. A stem-and-leaf algorithm is employed to find differences because error bars can be applied to each side of the difference. The stem-and-leaf algorithm starts by looking for imprecise monosaccharide differences and iteratively focuses in on the differences with the least root-mean-square (RMS) mass error. The adaptive algorithm allows the difference finder to work on poorly calibrated spectra. Calibrated spectra often yield RMS mass errors in the several hundreds of parts-per-billion range for monosaccharide differences. The high accuracy of correctly matched pairs allows for easy differentiation of true assignments from false ones.
In our glycomics studies, high-throughput annotation was achieved by bounding the glycan composition possibilities to a targeted list of N-glycans. A recently published theoretical glycan library or experimentally derived glycan profile was used as a basis for annotation [22]. In short, the N-glycan library was generated by degrading fully glycosylated complex, hybrid, and high mannose type glycans all the way to the N-linked core. The glycome is bounded by the extent of glycosylation of the starting point glycans. The retrosynthetic degradation provides a well-defined comprehensive list. Subset profiles were rapidly established by scanning the N-glycan library across a set of samples and matching the masses to well calibrated, highly resolved, peaks with masses within a 15 RMS ppm mass error cutoff. Mass profile establishment is critical for advancement from the development stage to the high-throughput biomarker analysis.
Implementation of glycan libraries improves the biomarker detection sensitivity because it focuses the hypothesis testing to only glycan masses. Reducing the number of tests allows for respective performance gains from the multiple testing corrections. The Bonferroni multiple-testing-corrections help avoid inflated Type-1 error rates. The size of the glycan profiles is large enough to test all the glycans of interest but still small enough for significant changes to be detected.
The combinatorial glycan method (generating a library by iterating over all possible monosaccharide combinations) was compared with the theoretical glycan library method by examining the fraction of compositions consistent with the library to those that are not. The number of inconsistent combinatorial compositions increases with increasing tolerances for mass assignments. This trend is shown in Supplementary Figure 5. The drawback of using an unfiltered combinatorial library is that it generates between 40-60% false compositions depending on whether protonated masses or sodiated masses are used; assuming a 15 RMS ppm mass error cutoff. There are more false compositions in the sodiated mass list because of the allowed proton-sodium exchange common to the carboxylic acid group of sialic acid. The sodium substituted cation takes on a multiple sodiated form [M+(1+x)Na-(x)H]+ where x can be equal to or less than the number of exchangeable acid groups. An N-glycan biological filtered method is not included for comparison because the N-glycan filter is inherent with the theoretical retrosynthetic theoretical N-glycan library [22]. All of the rules are included in the glycan networks and initial starting point ions. Additionally, multiple mass error windows are included for comparison. Supplementary Figure 5 depicts the importance of high mass accuracy measurements and shows that as the mass error tolerance increases, the number of false assignments increases.
Since many glycans are present in families that are related by monosaccharides, identifying these differences in spectra helps confirm compositions without the need for tandem mass spectrometry or glycosidase digestion. It is critical that each spectrum is reduced to only monoisotopic peaks prior to searching for monosaccharide differences.
STATISTICS
Normalization
Normalizing spectra intensities is one of the most important operations in mass spectrometry analysis. It affects intensity values more than any other data operation. The Glycolyzer includes several normalization options: base peak intensity (BPI), total ion intensity (TI), total peak intensity (TPI), total library intensity (TLI) and select library intensity (SLI). Base peak intensity normalization converts peak intensities to a percentage relative to the most intense peak in the spectrum. However, changes in the base peak’s intensity cannot be observed and subsequent perturbations to it are propagated to other ions in the spectra. Total ion intensity is based on a sum of all data present in the unprocessed spectrum. Dividing ion intensities by the mean of all ion intensities will normalize the spectrum primarily to the noise level because of the relative sparseness of the ions as compared to the noise. Total peak intensity involves normalizing the spectra to the average peak intensity based on only peaks above the noise threshold. This is similar to the method used by Barkauskas et al. on a prostate cancer study [48] and focuses the normalization to intense peaks. The total library intensity option is similar to the total peak intensity except that only annotated peak intensities contribute to the mean total intensity divisor. This allows normalizing by only the ions of interest (N-glycans in this case). The select library intensity normalization further focuses the normalization divisor by including only a select subset of the annotated ions. Prior information on the frequency of detection of library ions in a data set (the percentage of samples containing the ion) can be used to rank the ions so only glycan ions with high detection rate are used for normalization calculations.
Although the different normalization methods tested on this data set produced slightly different sets of significant ions, there was a high degree of similarity between results because the methods all used a constant divisor and only varied by the different sets of ions used to calculate the divisor. The strongest biomarkers were found significant regardless of normalization method. The results from this study are based on the total peak intensity method.
Spectra Averaging
Collecting multiple spectra of the same sample greatly improves the precision of the measurement. As the number of spectra, N, increases, the standard deviation decreases inversely proportional to the square root of N [33]. Replicate spectra can be processed with the Glycolyzer providing the user with two options to incorporate them. The most common method averages specific ions intensities from each technical replicate together prior to statistical analysis. This works best when target ions are detected in all spectra. An alternate method is to take the highest value of each ion from the set of replicates to use as the value. This situation represents the best-case scenario of data from the sample. This can help overcome some of the variability from the MALDI ionization process, where cold spots on the matrix produce only the most intense ions. Each ion needs only be detected above the threshold in one sample of a given set of replicates to be included. Standard spectra averaging of specific ion intensities were used for the five technical replicates acquired in the discovery set.
Missing Values
When extracting glycan library masses from the data, some of the ions in the profile are not detected in the data above the noise threshold or are missed by deisotoping errors. The absence of a peak is useful when monitoring the frequency of detection of an ion (presence or absence) across a sample set but often causes problems with downstream statistics calculations. The solution to the missing data implemented here is to look below the threshold and find the largest peak within a prescribed mass error window. Filling in noise values for missing peaks should result in higher quality biomarkers because the former zero values will skew distributions of low intensity ions that are near the noise threshold cutoff. However, very low intensity peaks can be over represented if the number of zeroes is greater than the number of detected peaks across a data set. Although a potential problem, this scenario typically does not lead to an increased number of false positive biomarkers because the glycans with large amounts of missing values will not pass the strict hypothesis tests due to high variance caused by the randomness of the low intensity peaks used for data filling.
Multiple Statistical Hypothesis Testing
Each glycan annotated by the theoretical profile is subjected to hypothesis testing to determine if any changes are significant. Five technical replicate FT-ICR spectra from each sample are averaged prior to hypothesis testing. The natural logarithm of the intensities is used for testing to prevent the most intense ions in the spectra from overwhelming the less intense species. Furthermore, taking logarithms of the intensities improves the assumption of constant error variance and makes the data better suited for standard statistical testing [4]. Two-tailed t-tests were used for hypothesis testing. Due to the large amount of independent glycans tested in this manner, multiple testing corrections should be employed. Bonferroni corrections are implemented to add rigor to the testing by maintaining the family-wise error rate. Glycans with significant changes in intensity are found when they pass the hypothesis testing (p<0.05) and the Bonferroni multiple testing correction (n=101 for the number of glycan masses in the library).
Linear Classifier Motif Tests
The significant markers that passed the t-test were combined into a Motif Test that leverages deviations in case intensities from control mean intensities. Combining multiple markers into a diagnostic panel has been shown previously to improve discrimination [49, 50]. To obtain a score for each sample, each glycan in the Motif Test is weighted by the difference between the mean control ion intensity and the mean case ion intensity. The larger the difference between the mean is, the larger the weighting factor. The scoring scheme was set up by adding the absolute value of the marker ion deviations from the control mean. This allows the summation of positive and negative deviations found in the biomarker Motif Test. The net score is used to classify unknown samples; whereas the samples consistent with the motifs, and thus larger deviations, score higher. A separate Motif Test was developed for each ACN fraction. The results are summarized with receiver operating characteristic curves (ROC) and evaluated by the area under the curve (AUC). The AUC is calculated by geometric integration. Applying Motif Tests to the discovery set provided high AUC results for the three fractions: 10% (0.89), 20% (0.87), and 40% (0.88). When weighted evenly across the 10%, 20% and 40% fractions, a linear combination of the Motif Test scores can be linearly combined into an overall test metric. The overall test improves sensitivity and specificity and increases ROC AUC to 0.93. The ROC curve results are included in Figure 2.
Figure 2.
Receiver operating characteristic curve results from applying the glycan Motif Test to the discovery sample set. The area under the curves represents a high degree of specificity and sensitivity for all three fractions independently. The “Overall” trace is an evenly weighted linear combination of the Motif scores from the 10%, 20% and 40% fractions.
Data Modeling
The data analysis pipeline was evaluated by modeling the data with perturbation analysis. Synthetic case and control mass spectra were created with perturbed intensities. A representative sample spectrum was selected and used to seed new spectra. Each intensity value was modified with a multiplicative factor generated randomly from a normal distribution using a Box-Muller simulation [51]. Several data sets were generated to include distributions in intensity values that produced coefficients of variation of 10%, 20%, 30%, 40% and 60%. The randomization was evaluated by comparing two sets of unperturbed control spectra. After data processing, no significant biomarkers were detected (p=0.05) indicating the data is sufficiently randomized in the model. An example plot of 48 simulated control spectra with a coefficient of variation of 60% is included in Supplementary Figure 6. At each coefficient of variation perturbation, two sets of 48 spectra (one for case and the other for control) where the case set contained one glycan ion with its mean intensity value increased by 5%, 10%, 25%, 50%, 100% or 150% relative to the control. This change in abundance simulates the effect of a case biomarker ion deviating in intensity from the control. The resulting case sets contained one theoretical biomarker with increasing perturbations that could be used to test the Glycolyzer’s ability to detect biomarker changes.
The synthetic sets of raw spectra were processed with the Glycolyzer’s preprocessing and statistical algorithms and the p-values and ROC AUC values were recorded. The hypothesis testing analysis was based off of the single biomarker programmed into the model and the same multiple testing corrections were applied (N=101). Trend lines depicting the relationship between percent change in abundance and ROC AUC are plotted in Figure 3. Approximating the ROC AUC values at p=0.05 using linear interpolation of the data allows for the calculation of a p=0.05 cutoff line. Plotting the interpolated ROC AUC values vs. interpolated percent abundance change is shown in Supplementary Figure 7. The p=0.05 cut offline is presented as a dashed line in Figure 3. Modeled values higher than this dashed line in Figure 3 would pass the hypothesis tests. The glycan biomarkers detected from the experimental data were overlaid to demonstrate how well the experimental data followed the trends and pass the modeled p-value cutoff line. 85% of the experimentally determined biomarkers were above the modeled cutoff line.
Figure 3.

Evaluation of the performance of the Glycolyzer classifier using controlled modeled data. The trend lines demonstrate how the ROC analysis responds with variation within a data set and separation in average intensities between the datasets. The dots correspond to the glycans significantly changing between disease states. The dashed line corresponds to an approximated p=0.05 cutoff value determined from the modeled data. Experimental biomarkers below the simulated p=0.05 line are marked with an asterisks in Table 1.
DISCUSSION
The Glycolyzer successfully calibrated and processed 1410 transients from the ovarian cancer discovery set and identified several candidate glycan biomarkers. The unified software approach streamlined the data analysis and allowed for results to be obtained on the same day the last spectrum was collected. The empirically derived serum glycan profile [22] (containing 101 glycan masses) developed in-house was used to filter the data. After statistical analysis, 51 glycan candidate biomarkers (39 glycan masses due to detection in multiple fractions) were identified using a p<0.05 cutoff for Bonferroni corrected p-values (n=101). The candidate markers with their monosaccharide compositions and p-values are summarized in Table 1. The full list of 101 glycans monitored is included in Supplementary Figure 8 and the compositions are included in Supplementary Figure 9. The glycan log mean intensities and standard deviation are also presented.
Table1.
Candidate glycan biomarkers found with Bonferroni corrected p-values less than 0.05. The “ACN Elution Fraction” shows which acetonitrile fraction the glycan was detected and found to be significantly changing. Several glycans eluted in multiple fractions and were found significant in both fractions. “Exact m/z” is the calculated mass to charge ratio of the sodiated ion in the 10% and 20% fraction or the deprotonated ion in the 40% fraction. “Intensity Relative Percent Change” depicts the magnitude that the normalized glycan intensity is increasing or decreasing in the cancer state relative to control. The “Intensity Log10 Mean” corresponds to the logarithm of the mean intensity of the glycan from either the controls or cancer cases and the “Intensity Log10 CV” is the coefficient of variation of that mean. “Bonferroni Corrected p-value” is the t-test statistic corrected for 101 independent tests. The monosaccharide composition symbols are abbreviated: Hex=Hexose, HexNAc=N-Acetylglucosamine, Fucose=Deoxyhexoe, NeuAc=Neuraminic Acid and Na/H=Sodium cation substitution for proton. “ROC AUC” stands for the area under the curve of a ROC plot for a specific ion. An asterisks indicates a borderline biomarker because the area under the curve was not greater than the modeled p=0.05 significance line. The “Motif Test ROC AUC” shows the area under the curve of a ROC plot for all the ions combined into a Motif Test for each fraction.
| ID | ACN Elution Fraction |
m/z | Trend | Log Normal Mean |
Log Normal STDEV |
Log Patient Mean |
Log Patient STDEV |
Bonferroni Corrected p-Value |
H | N | F | s | Na-H | Family | AUC | Motif Test AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10% | 1257.423 | −1 | 1.101 | 0.023 | 1.081 | 0.027 | 1.2E-02 | 5 | 2 | 0 | 0 | 0 | 2 | 0.75 | 0.88 |
| 2 | 10% | 1298.449 | −1 | 1.014 | 0.017 | 0.998 | 0.017 | 1.3E-03 | 4 | 3 | 0 | 0 | 0 | 2 | 0.77 | |
| 3 | 10% | 1339.476 | 1 | 1.070 | 0.024 | 1.092 | 0.029 | 6.5E-03 | 3 | 4 | 0 | 0 | 0 | 1 | 0.77 | |
| 4 | 10% | 1460.502 | −1 | 1.100 | 0.027 | 1.075 | 0.033 | 5.0E-03 | 5 | 3 | 0 | 0 | 0 | 2 | 0.76 | |
| 5 | 10% | 1485.534 | 1 | 1.225 | 0.023 | 1.257 | 0.039 | 2.5E-04 | 3 | 4 | 1 | 0 | 0 | 1 | 0.79 | |
| 6 | 10% | 1542.555 | 1 | 1.169 | 0.034 | 1.207 | 0.042 | 3.5E-04 | 3 | 5 | 0 | 0 | 0 | 1 | 0.79 | |
| 7 | 10% | 1663.581 | −1 | 1.132 | 0.025 | 1.106 | 0.025 | 1.6E-04 | 5 | 4 | 0 | 0 | 0 | 4 | 0.79 | |
| 8 | 10% | 1688.613 | 1 | 1.212 | 0.024 | 1.232 | 0.028 | 1.5E-02 | 3 | 5 | 1 | 0 | 0 | 1 | 0.73 | |
| 9 | 10% | 1793.644 | −1 | 0.977 | 0.014 | 0.959 | 0.016 | 4.4E-06 | 4 | 4 | 2 | 0 | 0 | 4 | 0.83 | |
| 10 | 10% | 1809.639 | −1 | 1.132 | 0.030 | 1.103 | 0.026 | 8.5E-05 | 5 | 4 | 1 | 0 | 0 | 4 | 0.79 | |
| 11 | 10% | 1996.724 | −1 | 1.042 | 0.017 | 1.030 | 0.018 | 4.2E-02 | 4 | 5 | 2 | 0 | 0 | 3 | 0.75 | |
| 12 | 10% | 2028.714 | −1 | 0.977 | 0.015 | 0.967 | 0.012 | 2.9E-02 | 6 | 5 | 0 | 0 | 0 | 5 | 0.71 | |
| 13 | 10% | 2158.777 | −1 | 1.032 | 0.025 | 1.008 | 0.023 | 3.1E-04 | 5 | 5 | 2 | 0 | 0 | 3 | 0.79 | |
| 14 | 20% | 1339.476 | 1 | 1.023 | 0.029 | 1.053 | 0.040 | 2.4E-03 | 3 | 4 | 0 | 0 | 0 | 1 | 0.79 | 0.85 |
| 15 | 20% | 1444.507 | −1 | 1.119 | 0.021 | 1.092 | 0.035 | 1.2E-03 | 4 | 3 | 1 | 0 | 0 | 2 | 0.78 | |
| 16 | 20% | 1485.534 | 1 | 1.274 | 0.031 | 1.303 | 0.042 | 1.3E-02 | 3 | 4 | 1 | 0 | 0 | 1 | 0.77 | |
| 17 | 20% | 1542.555 | 1 | 0.961 | 0.014 | 0.986 | 0.025 | 7.5E-06 | 3 | 5 | 0 | 0 | 0 | 1 | 0.85 | |
| 18 | 20% | 1663.581 | −1 | 1.171 | 0.028 | 1.149 | 0.034 | 3.9E-02 | 5 | 4 | 0 | 0 | 0 | 4 | 0.73 | |
| 19 | 20% | 1809.639 | −1 | 1.296 | 0.025 | 1.269 | 0.030 | 2.8E-04 | 5 | 4 | 1 | 0 | 0 | 4 | 0.79 | |
| 20 | 20% | 1955.697 | −1 | 1.059 | 0.028 | 1.034 | 0.034 | 6.5E-03 | 5 | 4 | 2 | 0 | 0 | 4 | 0.75 | |
| 21 | 20% | 2158.777 | −1 | 1.057 | 0.027 | 1.037 | 0.025 | 1.8E-02 | 5 | 5 | 2 | 0 | 0 | 3 | 0.75 | |
| 22 | 20% | 2174.772 | −1 | 1.072 | 0.026 | 1.054 | 0.022 | 1.3E-02 | 6 | 5 | 1 | 0 | 0 | 5 | 0.75 | |
| 23 | 40% | 1274.453 | −1 | 0.972 | 0.016 | 0.955 | 0.016 | 1.2E-04 | 4 | 3 | 0 | 0 | 0 | 2 | 0.81 | 0.88 |
| 24 | 40% | 2441.870 | 1 | 1.080 | 0.021 | 1.119 | 0.036 | 4.9E-07 | 6 | 5 | 1 | 1 | 0 | 5 | 0.86 | |
| 25 | 40% | 2754.948 | 1 | 0.971 | 0.014 | 0.994 | 0.025 | 2.4E-05 | 6 | 5 | 1 | 2 | 1 | 5 | 0.82 | |
| 26 | 40% | 2807.003 | 1 | 0.982 | 0.019 | 1.020 | 0.031 | 1.7E-08 | 7 | 6 | 1 | 1 | 0 | 6 | 0.87 |
The data quality and processing improvements can be observed by selecting biomarker m/z 1809.63 from Table 1 as a case study. Plotting all of the mass spectra from the controls and juxtaposing it to all of the spectra from the cases shows that even without data processing, the abundance has decreased on average. This is shown with the mass spectra zoom profiles in Figure 4. The data processing improvements to the data can be exemplified using box plots in which the 0, 25, 50, 75, and 100 percentiles are shown for the normal and cases in Figure 5. The logarithm and normalization procedures applied tightened up the data distributions, produced more symmetric distributions and biomarker discernment fidelity.
Figure 4.
Comparison of mass spectra centered on the isotope envelope of m/z 1809.639 and its monoisotopic mass. The left plots correspond to the unprocessed, averaged data from the controls while the right plots correspond to the unprocessed, average data from the cases. Since the plots overlap, box plots were included to show the 0, 25, 50, 75 and 100 percentiles.
Figure 5.
Improvements from data processing. The left box plots correspond to the unprocessed, averaged data while the right box plots correspond to the same data after it is log transformed and normalized.
Twelve of the glycans identified as significant were detected in more than one elution fraction. Interestingly, all twelve were detected in multiple fractions and had consistent trends of increasing or decreasing intensities. Although it is possible that glycan isomers were crudely separated along the SPE fractional lines, the constant trends of the glycans across fractions suggest a split fractionation of single glycan structure. Several glycan compositions were detected with and without fucose. When the fucosylated/non-fucosylated pairs were detected in more than one fraction, the fucosylated form was more intense in the later fraction. This is consistent with the elution order observed with graphitized columns and LC/MS.
N-glycans are synthesized enzymatically with glycosyltransferases and glycosidases and are built up one monosaccharide unit at a time. This process results in families of glycans that differ from each other by only one monosaccharide. The statistically significant biomarkers detected in this study primarily come from a single family. A glycan network in Figure 6 shows 34 of the 39 glycan nodes can be linked either directly or indirectly to all other glycans in the family by single monosaccharide links. The glycan network is also coded with the change in intensity trends between the controls and the cancer cases. Detection of glycans in families increases the confidence of the glycan annotations because additional orthogonal information is used beyond exact mass.
Figure 6.
A network of the statistically significant glycan biomarkers. The compositions correspond to the number of units: Hexose-N-Acetylhexosamine-Fucose-Neuraminic Acid-Sodium Substituted Proton.
Most of the neutral glycans were decreasing in intensity while most of the glycans containing sialic acid were increasing in intensity. However, one set of neutral glycans, consisting of a subfamily of glycans located in the bottom left of Figure 6 (Hex3HexNAc4, Hex3HexNAc4 Fuc1, Hex3HexNAc5, Hex3Hex NAc5Fuc1) was increasing. Specifically, this increase in the FA2 glycan (Hex3HexNAc4Fuc1) is consistent with reported increases detected in ovarian cancer patients from serum IgG and whole serum [13]. The FA2 nomenclature is described by Gornik et al. [52]. Kim et al. also reported the same increase in FA2 levels in serum [53]. Eight significant glycan biomarkers contained sialic acid were detected in the 20% and 40% fractions. Increasing changes in sialylated glycan intensities are consistent with other reports in literature where sialylated glycans have been implicated in cancer detection and metastases [54-56]. Many sialic acid-containing and sialic acid-free glycan pairs, such as Hex5HexNAc4 and Hex5HexNAc4 NeuAc1, showed a trend of increasing sialic acid-containing and decreasing sialic acid-free intensities. This conflicting trend is consistent with the sialic acid-free glycans being used as substrates for up regulated sialyltransferases which produce sialylated glycans. The sialic acid pairs and mean intensities are listed in Table 2. Although we cannot confirm that the sialylated/non- sialylated pairs have the same core structure, the trends seem intriguing and require further analysis.
Table 2.
Sialic acid-containing or sialic acid-free pairs of glycans. The sialic acid-free glycans decreased significantly while the sialic acid-containing glycans increased significantly.
| ID | ACN Elution Fraction |
Exact m/z |
Intenstiy Relative Percent Change |
Intensity Log10 Control Mean |
Intensity Log10 Control CV |
Intensity Log10 Cases Mean |
Intensity Log10 Cases CV |
Hex | HexNAC | Fucose | NeuAC | Na/H |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | 10% | 2028.714 | −28.9% | 2.96 | 4.2% | 2.81 | 5.2% | 6 | 5 | 0 | 0 | 0 |
| 45 | 40% | 2441.870 | 137.4% | 3.97 | 5.1% | 4.35 | 7.2% | 6 | 5 | 1 | 1 | 0 |
| 48 | 40% | 2732.966 | 33.6% | 3.20 | 4.4% | 3.33 | 4.9% | 6 | 5 | 1 | 2 | 0 |
| 49 | 40% | 2754.948 | 68.7% | 3.07 | 4.0% | 3.30 | 6.6% | 6 | 5 | 1 | 2 | 1 |
| 35 | 20% | 2012.719 | −20.4% | 4.48 | 4.3% | 4.38 | 5.6% | 5 | 5 | 1 | 0 | 0 |
| 39 | 20% | 2325.796 | 29.4% | 2.49 | 7.4% | 2.60 | 12.6% | 5 | 5 | 1 | 1 | 1 |
| 26 | 20% | 1663.581 | −37.2% | 4.44 | 4.5% | 4.24 | 7.0% | 5 | 4 | 0 | 0 | 0 |
| 10 | 10% | 1663.581 | −48.2% | 4.30 | 4.8% | 4.01 | 6.7% | 5 | 4 | 0 | 0 | 0 |
| 33 | 20% | 1976.659 | 56.1% | 2.82 | 8.3% | 3.01 | 15.4% | 5 | 4 | 0 | 1 | 1 |
| 36 | 20% | 2122.717 | 28.7% | 2.54 | 6.9% | 2.65 | 12.1% | 5 | 4 | 1 | 1 | 1 |
| 5 | 10% | 1460.502 | −48.2% | 4.02 | 6.1% | 3.73 | 8.6% | 5 | 3 | 0 | 0 | 0 |
| 43 | 40% | 1727.601 | 29.0% | 3.59 | 4.3% | 3.70 | 3.4% | 5 | 3 | 0 | 1 | 0 |
Six of the eight glycan biomarkers contain fucose and which may indicate the presence of sialyl Lewis X motifs. Sialyl Lewis X has been documented as a marker for inflammation [57, 58] and its aberrant expression has been implicated in tumor formation and metastasis [59]. Additional structural studies would demonstrate whether the fucose is located on the core or antennae.
CONCLUDING REMARKS
The Glycolyzer software removes the data analysis bottleneck and drastically decreases the time-to-results for a clinical glycomics study. Each piece of the biomarker discovery pipeline can now be perturbed and evaluated now that the full pipeline is in place and metrics for evaluating the system have been established. Prior to the Glycolyzer, manual calibration and data analysis from a 94-sample set would take several weeks to months. The Glycolyzer can accomplish the same task in a matter of hours. The magnitude of samples processed in this study demonstrates the potential for high throughput analysis for discovery and validation studies in the future, both for ovarian cancer and other malignancies. In addition, the Glycolyzer allowed us to identify a panel of glycan biomarkers with high sensitivity and specificity that are appropriate for formal validation testing. Although the case subjects in this study were diagnosed with ovarian cancer, it is possible that the biomarkers are non-cancer specific and could represent an inflammatory response. This would need to be investigated in subsequent studies.
Supplementary Material
Supplementary Figure 1. Internal calibration mass accuracy is a comparison between Omega8 (IonSpec, Irvine, CA) software and the Glycolyzer software. The root-mean-square mass errors for each glycan ion are reported from the same spectra calibrated with Omega8 and Glycolyzer algorithms. The masses correspond to the sodiated glycan ion.
Supplementary Figure 2. (Left) A histogram of the relative intensities found in a spectrum. The largest frequency bin (0.07) is the noise level and the half-width at half-maximum is used for the standard deviation of the noise. The tail of the distribution at high intensities is caused by the contribution of the glycan signals. The curve continues out to x=100 but is truncated to highlight the shape of the noise distribution. (Right) The noise threshold lines are depicted on a zoomed-in mass spectrum showing a peak near the noise level. The lowest allowed peak intensity is 6σ above the mean noise level.
Supplementary Figure 3. This workflow diagram shows the steps used for the reconstructive isotope grouping algorithm. Clusters of data consist of all of the peaks that can be found within 1.00235 Da from a principle peak in both directions. MALDI ionization helps by restricting the charge state to a single charge. Once a cluster is isolated, the cluster is aligned with various models and fit. Best fits are determined based on chi squared testing. Only after a model is fit is the distribution subtracted from the peak list.
Supplementary Figure 4. Comparison of different isotope distribution models with regards to transient length. The black series corresponds to a full one second acquisition transient while the grey bars correspond to the same data set with the transients reduced to half-length. The “Hex-HexNAc-Fuc-NeuAc” model uses the averagose by Vakhrushev et al., the “Serum Library” model uses the Glycolyzer’s N-Glycan Library derived averagose, the “Exact Composition” uses the exact elemental composition in place of an averagose, the “HexNAc Only” model uses an averagose based on the N-Acetylhexosamine monosaccharide, and the “Peptide Averagine” uses the standard average amino acid averagine used for peptides and proteins. The p-values signify how well the theoretical isotope model fits the raw data in terms of probability. A p-value equal to unity is a perfect fit of the model and the data.
Supplementary Figure 5. A plot showing the importance of mass measurement accuracy for correct annotation when combinatorial glycan annotation approaches are applied. As the mass accuracy decreases, the rate of false compositional assignments increases. Sodiated glycans have a higher change of false positives because the number of sodiated masses is greater than protonated due to cation sodium/proton exchanges on sialic acid groups.
Supplementary Figure 6. Example of modeled mass spectra data. Example 48 spectra with a group-wide coefficient of variation set to 60% are plotted. The monoisotopic mass is also plotted to show the variation of intensity. These intensities are also summarized in a box plot to the left depicting the 0,25,50,75,100 percentiles.
Supplementary Figure 7. Approximation of t-test p=0.05 cutoff. Interpolated ROC AUC values are plotted vs. interpolated percent change in average abundance values. A linear fit was used to approximate the data.
Supplementary Figure 8. List of all glycans monitored in this study along with the mean and standard deviation. All three fractions are included (10%, 20% and 40%) The masses used for the 10% and 20% fractions reflect the aldehyde form of the glycan and have a sodium cation as the charge carrier. The 40% fraction masses are aldehyde glycans and are listed in the deprotonated form.
Supplementary Figure 9. List of all glycan compositions monitored in this study. The monosaccharide composition symbols are abbreviated: Hex=Hexose, HexNAc=N-Acetylglucosamine, Fucose=Deoxyhexoe, NeuAc=Neuraminic Acid and Na/H=Sodium cation substitution for proton.
ACKNOWLEDGEMENTS
GlycanFinder algorithms were influenced in part by IgorPro code used for combinatorial glycans model building developed by Brian H. Clowers. Eric D. Dodds helped develop the Fast Fourier Transform algorithm use to transform the raw transient data. Sample selection, age matching, and blinding were performed by Donald A. Barkauskas and David M. Rocke. In addition, insight into the statistical treatment of data was provided by them as well. Anding Fan helped develop an application that produced the raw transient text data files from instrument specific data files for use in the Glycolyzer.
We gratefully acknowledge the financial support provided by the National Institute of Health RO1 GM049077. Support was also provided by a gift from the National Ovarian Cancer Coalition (NOCC), Sacramento Chapter (to G.S.L.); a UC Davis Health Systems Research Award (to K.S.L), and an Ovarian Cancer Research Fund (OCRF) Award (to G.S.L). We also acknowledge the Gynecologic Oncology Group Tissue Bank for providing the serum sample sets used in this study.
REFERENCES
- [1].Lebrilla CB, An HJ. The prospects of glycan biomarkers for the diagnosis of diseases. Mol Biosyst. 2009;5:17–20. doi: 10.1039/b811781k. [DOI] [PubMed] [Google Scholar]
- [2].Packer NH, von der Lieth CW, Aoki-Kinoshita KF, Lebrilla CB, et al. Frontiers in glycomics: bioinformatics and biomarkers in disease. An NIH white paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11-13, 2006) Proteomics. 2008;8:8–20. doi: 10.1002/pmic.200700917. [DOI] [PubMed] [Google Scholar]
- [3].Turnbull JE, Field RA. Emerging glycomics technologies. Nat Chem Biol. 2007;3:74–77. doi: 10.1038/nchembio0207-74. [DOI] [PubMed] [Google Scholar]
- [4].Barkauskas DA, An HJ, Kronewitter SR, de Leoz ML, et al. Detecting glycan cancer biomarkers in serum samples using MALDI FT-ICR mass spectrometry data. Bioinformatics. 2009;25:251–257. doi: 10.1093/bioinformatics/btn610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].de Leoz MLA, An HJ, Kronewitter S, Kim J, et al. Glycomic approach for potential biomarkers on prostate cancer: Profiling of N-linked glycans in human sera and pRNS cell lines. Disease Markers. 2008;25:243–258. doi: 10.1155/2008/515318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Alley WR, Jr., Madera M, Mechref Y, Novotny MV. Chip-based reversed-phase liquid chromatography-mass spectrometry of permethylated N-linked glycans: a potential methodology for cancer-biomarker discovery. Anal Chem. 2010;82:5095–5106. doi: 10.1021/ac100131e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].An HJ, Miyamoto S, Lancaster KS, Kirmiz C, et al. Profiling of glycans in serum for the discovery of potential biomarkers for ovarian cancer. Journal of Proteome Research. 2006;5:1626–1635. doi: 10.1021/pr060010k. [DOI] [PubMed] [Google Scholar]
- [8].Bones J, Mittermayr S, O’Donoghue N, Guttman A, Rudd PM. Ultra performance liquid chromatographic profiling of serum N-glycans for fast and efficient identification of cancer associated alterations in glycosylation. Anal Chem. 2010;82:10208–10215. doi: 10.1021/ac102860w. [DOI] [PubMed] [Google Scholar]
- [9].An HJ, Kronewitter SR, de Leoz MLA, Lebrilla CB. Glycomics and disease markers. Current Opinion in Chemical Biology. 2009;13:601–607. doi: 10.1016/j.cbpa.2009.08.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Lebrilla CB, An HJ. The prospects of glycan biomarkers for the diagnosis of diseases. Molecular Biosystems. 2009;5:17–20. doi: 10.1039/b811781k. [DOI] [PubMed] [Google Scholar]
- [11].Li B, An HJ, Kirmiz C, Lebrilla CB, et al. Glycoproteomic Analyses of Ovarian Cancer Cell Lines and Sera from Ovarian Cancer Patients Show Distinct Glycosylation Changes in Individual Proteins. Journal of Proteome Research. 2008;7:3776–3788. doi: 10.1021/pr800297u. [DOI] [PubMed] [Google Scholar]
- [12].Meany DL, Zhang Z, Sokoll LJ, Zhang H, Chan DW. Glycoproteomics for Prostate Cancer Detection: Changes in Serum PSA Glycosylation Patterns. Journal of Proteome Research. 2008;8:613–619. doi: 10.1021/pr8007539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Saldova R, Royle L, Radcliffe CM, Abd Hamid UM, et al. Ovarian cancer is associated with changes in glycosylation in both acute-phase proteins and IgG. Glycobiology. 2007;17:1344–1356. doi: 10.1093/glycob/cwm100. [DOI] [PubMed] [Google Scholar]
- [14].Cooper Catherine A., Gasteiger Elisabeth, Packer Nicolle H. GlycoMod - A software tool for determining glycosylation compositions from mass spectrometric data. PROTEOMICS. 2001;1:340–349. doi: 10.1002/1615-9861(200102)1:2<340::AID-PROT340>3.0.CO;2-B. [DOI] [PubMed] [Google Scholar]
- [15].Cooper CA, Harrison MJ, Wilkins MR, Packer NH. GlycoSuiteDB: a new curated relational database of glycoprotein glycan structures and their biological sources. Nucleic Acids Research. 2001;29:332–335. doi: 10.1093/nar/29.1.332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Cooper CA, Joshi HJ, Harrison MJ, Wilkins MR, Packer NH. GlycoSuiteDB: a curated relational database of glycoprotein glycan structures and their biological sources. 2003 update. Nucleic Acids Research. 2003;31:511–513. doi: 10.1093/nar/gkg099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Lo A, Bunsmann P, Bohne A, Lo A, et al. SWEET-DB: an attempt to create annotated data collections for carbohydrates. Nucl. Acids Res. 2002;30:405–408. doi: 10.1093/nar/30.1.405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Cooper CA, Wilkins MR, Williams KL, Packer NH. BOLD - A biological O-linked glycan database. Electrophoresis. 1999;20:3589–3598. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3589::AID-ELPS3589>3.0.CO;2-M. [DOI] [PubMed] [Google Scholar]
- [19].Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita K, et al. KEGG as a glycome informatics resource. Glycobiology. 2006;16:63R–70. doi: 10.1093/glycob/cwj010. [DOI] [PubMed] [Google Scholar]
- [20].von der Lieth CW, Freire AA, Blank D, Campbell MP, et al. EUROCarbDB: An open-access platform for glycoinformatics. Glycobiology. 2011;21:493–502. doi: 10.1093/glycob/cwq188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Goldberg D, Sutton-Smith M, Paulson J, Dell A. Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra. Proteomics. 2005;5:865–875. doi: 10.1002/pmic.200401071. [DOI] [PubMed] [Google Scholar]
- [22].Kronewitter Scott R., An Hyun Joo, de Leoz Maria Lorna, Lebrilla Carlito B., et al. The development of retrosynthetic glycan libraries to profile and classify the human serum N-linked glycome. PROTEOMICS. 2009;9:2986–2994. doi: 10.1002/pmic.200800760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Ethier M, Figeys D, Perreault H. N-glycosylation analysis using the StrOligo algorithm. Methods Mol Biol. 2006;328:187–197. doi: 10.1385/1-59745-026-X:187. [DOI] [PubMed] [Google Scholar]
- [24].Lapadula AJ, Hatcher PJ, Hanneman AJ, Ashline DJ, et al. Congruent strategies for carbohydrate sequencing. 3. OSCAR: An algorithm for assigning oligosaccharide topology from MSn data. Analytical Chemistry. 2005;77:6271–6279. doi: 10.1021/ac050726j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Zhang H, Singh S, Reinhold VN. Congruent Strategies for Carbohydrate Sequencing. 2. FragLib: An MSn Spectral Library. Analytical Chemistry. 2005;77:6263–6270. doi: 10.1021/ac050725r. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Maass K, Ranzinger R, Geyer H, Lieth C.-W. v. d., Geyer R. “Glyco-Peakfinder” - denovo composition analysis of glycoconjugates. PROTEOMICS. 2007;7:4435–4444. doi: 10.1002/pmic.200700253. [DOI] [PubMed] [Google Scholar]
- [27].Lohmann KK, Lieth C.-W. v. d. GLYCO-FRAGMENT: A web tool to support the interpretation of mass spectra of complex carbohydrates. PROTEOMICS. 2003;3:2028–2035. doi: 10.1002/pmic.200300505. [DOI] [PubMed] [Google Scholar]
- [28].Lohmann KK, von der Lieth C-W. GlycoFragment and GlycoSearchMS: web tools to support the interpretation of mass spectra of complex carbohydrates. Nucl. Acids Res. 2004;32:W261–266. doi: 10.1093/nar/gkh392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Ceroni A, Maass K, Geyer H, Geyer R, et al. GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. J Proteome Res. 2008;7:1650–1659. doi: 10.1021/pr7008252. [DOI] [PubMed] [Google Scholar]
- [30].Vakhrushev SY, Dadimov D, Peter-Katalinicíχ J. Software Platform for High-Throughput Glycomics. Analytical Chemistry. 2009;81:3252–3260. doi: 10.1021/ac802408f. [DOI] [PubMed] [Google Scholar]
- [31].Moore RG, Jabre-Raughley M, Brown AK, Robison KM, et al. Comparison of a novel multiple marker assay vs the Risk of Malignancy Index for the prediction of epithelial ovarian cancer in patients with a pelvic mass. Am J Obstet Gynecol. 2010;203:228, e221–226. doi: 10.1016/j.ajog.2010.03.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Leiserowitz GS, Lebrilla C, Miyamoto S, An HJ, et al. Glycomics analysis of serum: a potential new biomarker for ovarian cancer? International Journal of Gynecological Cancer. 2007;18:470–475. doi: 10.1111/j.1525-1438.2007.01028.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Kronewitter SR, de Leoz ML, Peacock KS, McBride KR, et al. Human serum processing and analysis methods for rapid and reproducible N-glycan mass profiling. Journal of Proteome Research. 2010;9:4952–4959. doi: 10.1021/pr100202a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Shi SDH, Drader JJ, Freitas MA, Hendrickson CL, Marshall AG. Comparison and interconversion of the two most common frequency-to-mass calibration functions for Fourier transform ion cyclotron resonance mass spectrometry. International Journal of Mass Spectrometry. 2000;195-196:591–598. [Google Scholar]
- [35].Zhang Li-Kang, Rempel Don, Pramanik Birendra N., Gross Michael L. Accurate mass measurements by Fourier transform mass spectrometry. Mass Spectrometry Reviews. 2005;24:286–309. [Google Scholar]
- [36].Francl TJ, Sherman MG, Hunter RL, Locke MJ, et al. Experimental determination of the effects of space charge on ion cyclotron resonance frequencies. International Journal of Mass Spectrometry and Ion Processes. 1983;54:189–199. [Google Scholar]
- [37].Marshall AG, Comisarow MB, Parisod G. Theory of Fourier-Transform Ion-Cyclotron Resonance Mass Spectroscopy-Iii .1. Relaxation and Spectral-Line Shape in Fourier-Transform Ion Resonance Spectroscopy. Journal of Chemical Physics. 1979;71:4434–4444. [Google Scholar]
- [38].Wehofsky M, Hoffmann R, Hubert M, Spengler B. Isotopic deconvolution of matrix-assisted laser desorption/ionization mass spectra for substance-class specific analysis of complex samples. European Journal of Mass Spectrometry. 2001;7:39–46. [Google Scholar]
- [39].Maleknia SD, Downard KM. Charge ratio analysis method to interpret high resolution electrospray Fourier transform - ion cyclotron resonance mass spectra. International Journal of Mass Spectrometry. 2005;246:1–9. [Google Scholar]
- [40].Zhang XA, Asara JM, Adamec J, Ouzzani M, Elmagarmid AK. Data pre-processing in liquid chromatography-mass spectrometry-based proteomics. Bioinformatics. 2005;21:4054–4059. doi: 10.1093/bioinformatics/bti660. [DOI] [PubMed] [Google Scholar]
- [41].Kaur P, O’Connor PB. Algorithms for automatic interpretation of high resolution mass spectra. Journal of the American Society for Mass Spectrometry. 2006;17:459–468. doi: 10.1016/j.jasms.2005.11.024. [DOI] [PubMed] [Google Scholar]
- [42].Tabb DL, Shah MB, Strader MB, Connelly HM, et al. Determination of peptide and protein ion charge states by Fourier transformation of isotope-resolved mass spectra. Journal of the American Society for Mass Spectrometry. 2006;17:903–915. doi: 10.1016/j.jasms.2006.02.003. [DOI] [PubMed] [Google Scholar]
- [43].Horn DM, Zubarev RA, McLafferty FW. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. Journal of the American Society for Mass Spectrometry. 2000;11:320–332. doi: 10.1016/s1044-0305(99)00157-9. [DOI] [PubMed] [Google Scholar]
- [44].Du PC, Angeletti RH. Automatic deconvolution of isotope-resolved mass spectra using variable selection and quantized peptide mass distribution. Analytical Chemistry. 2006;78:3385–3392. doi: 10.1021/ac052212q. [DOI] [PubMed] [Google Scholar]
- [45].Senko MW, Beu SC, McLafferty FW. Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. Journal of the American Society for Mass Spectrometry. 1995;6:229–233. doi: 10.1016/1044-0305(95)00017-8. [DOI] [PubMed] [Google Scholar]
- [46].An HJ, Tillinghast JS, Woodruff DL, Rocke DM, Lebrilla CB. A New Computer Program (GlycoX) To Determine Simultaneously the Glycosylation Sites and Oligosaccharide Heterogeneity of Glycoproteins. Journal of Proteome Research. 2006;5:2800–2808. doi: 10.1021/pr0602949. [DOI] [PubMed] [Google Scholar]
- [47].Chu CS, Ninonuevo MR, Clowers BH, Perkins PD, et al. Profile of native N-linked glycan structures from human serum using high performance liquid chromatography on a microfluidic chip and time-of-flight mass spectrometry. PROTEOMICS. 2009;9:1939–1951. doi: 10.1002/pmic.200800249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Barkauskas DA, An HJ, Kronewitter SR, de Leoz ML, et al. Detecting glycan cancer biomarkers in serum samples using MALDI FT-ICR mass spectrometry data. Bioinformatics. 2009;25:251–257. doi: 10.1093/bioinformatics/btn610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Zhu CS, Pinsky PF, Cramer DW, Ransohoff DF, et al. A framework for evaluating biomarkers for early detection: validation of biomarker panels for ovarian cancer. Cancer Prev Res (Phila) 2011;4:375–383. doi: 10.1158/1940-6207.CAPR-10-0193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Cramer DW, Bast RC, Jr., Berg CD, Diamandis EP, et al. Ovarian cancer biomarker performance in prostate, lung, colorectal, and ovarian cancer screening trial specimens. Cancer Prev Res (Phila) 2011;4:365–374. doi: 10.1158/1940-6207.CAPR-10-0195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Box GEP, Muller ME. A Note on the Generation of Random Normal Deviates. The Annals of Mathematical Statistics. 1958;29:610–611. [Google Scholar]
- [52].Gornik O, Royle L, Harvey DJ, Radcliffe CM, et al. Changes of serum glycans during sepsis and acute pancreatitis. Glycobiology. 2007;17:1321–1332. doi: 10.1093/glycob/cwm106. [DOI] [PubMed] [Google Scholar]
- [53].Kim YG, Jeong HJ, Jang KS, Yang YH, et al. Rapid and high-throughput analysis of N-glycans from ovarian cancer serum using a 96-well plate platform. Analytical Biochemistry. 2009;391:151–153. doi: 10.1016/j.ab.2009.05.015. [DOI] [PubMed] [Google Scholar]
- [54].Dwek MV, Ross HA, Leathem AJ. Proteome and glycosylation mapping identifies post-translational modifications associated with aggressive breast cancer. PROTEOMICS. 2001;1:756–762. doi: 10.1002/1615-9861(200106)1:6<756::AID-PROT756>3.0.CO;2-X. [DOI] [PubMed] [Google Scholar]
- [55].Kyselova Z, Mechref Y, Al Bataineh MM, Dobrolecki LE, et al. Alterations in the serum glycome due to metastatic prostate cancer. J Proteome Res. 2007;6:1822–1832. doi: 10.1021/pr060664t. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Alley WR, Jr., Novotny MV. Glycomic analysis of sialic acid linkages in glycans derived from blood serum glycoproteins. J Proteome Res. 2010;9:3062–3072. doi: 10.1021/pr901210r. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].De Graaf TW, Van der Stelt ME, Anbergen MG, van Dijk W. Inflammation-induced expression of sialyl Lewis X-containing glycan structures on alpha 1-acid glycoprotein (orosomucoid) in human sera. J Exp Med. 1993;177:657–666. doi: 10.1084/jem.177.3.657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Brinkman-van der Linden ECM, de Haan PF, Havenaar EC, van Dijk W. Inflammation- induced expression of sialyl Lewisx is not restricted to α1-acid glycoprotein but also occurs to a lesser extent on α1-antichymotrypsin and haptoglobin. Glycoconjugate Journal. 1998;15:177–182. doi: 10.1023/a:1006972307166. [DOI] [PubMed] [Google Scholar]
- [59].Ohyama C, Tsuboi S, Fukuda M. Dual roles of sialyl Lewis X oligosaccharides in tumor metastasis and rejection by natural killer cells. EMBO J. 1999;18:1516–1525. doi: 10.1093/emboj/18.6.1516. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figure 1. Internal calibration mass accuracy is a comparison between Omega8 (IonSpec, Irvine, CA) software and the Glycolyzer software. The root-mean-square mass errors for each glycan ion are reported from the same spectra calibrated with Omega8 and Glycolyzer algorithms. The masses correspond to the sodiated glycan ion.
Supplementary Figure 2. (Left) A histogram of the relative intensities found in a spectrum. The largest frequency bin (0.07) is the noise level and the half-width at half-maximum is used for the standard deviation of the noise. The tail of the distribution at high intensities is caused by the contribution of the glycan signals. The curve continues out to x=100 but is truncated to highlight the shape of the noise distribution. (Right) The noise threshold lines are depicted on a zoomed-in mass spectrum showing a peak near the noise level. The lowest allowed peak intensity is 6σ above the mean noise level.
Supplementary Figure 3. This workflow diagram shows the steps used for the reconstructive isotope grouping algorithm. Clusters of data consist of all of the peaks that can be found within 1.00235 Da from a principle peak in both directions. MALDI ionization helps by restricting the charge state to a single charge. Once a cluster is isolated, the cluster is aligned with various models and fit. Best fits are determined based on chi squared testing. Only after a model is fit is the distribution subtracted from the peak list.
Supplementary Figure 4. Comparison of different isotope distribution models with regards to transient length. The black series corresponds to a full one second acquisition transient while the grey bars correspond to the same data set with the transients reduced to half-length. The “Hex-HexNAc-Fuc-NeuAc” model uses the averagose by Vakhrushev et al., the “Serum Library” model uses the Glycolyzer’s N-Glycan Library derived averagose, the “Exact Composition” uses the exact elemental composition in place of an averagose, the “HexNAc Only” model uses an averagose based on the N-Acetylhexosamine monosaccharide, and the “Peptide Averagine” uses the standard average amino acid averagine used for peptides and proteins. The p-values signify how well the theoretical isotope model fits the raw data in terms of probability. A p-value equal to unity is a perfect fit of the model and the data.
Supplementary Figure 5. A plot showing the importance of mass measurement accuracy for correct annotation when combinatorial glycan annotation approaches are applied. As the mass accuracy decreases, the rate of false compositional assignments increases. Sodiated glycans have a higher change of false positives because the number of sodiated masses is greater than protonated due to cation sodium/proton exchanges on sialic acid groups.
Supplementary Figure 6. Example of modeled mass spectra data. Example 48 spectra with a group-wide coefficient of variation set to 60% are plotted. The monoisotopic mass is also plotted to show the variation of intensity. These intensities are also summarized in a box plot to the left depicting the 0,25,50,75,100 percentiles.
Supplementary Figure 7. Approximation of t-test p=0.05 cutoff. Interpolated ROC AUC values are plotted vs. interpolated percent change in average abundance values. A linear fit was used to approximate the data.
Supplementary Figure 8. List of all glycans monitored in this study along with the mean and standard deviation. All three fractions are included (10%, 20% and 40%) The masses used for the 10% and 20% fractions reflect the aldehyde form of the glycan and have a sodium cation as the charge carrier. The 40% fraction masses are aldehyde glycans and are listed in the deprotonated form.
Supplementary Figure 9. List of all glycan compositions monitored in this study. The monosaccharide composition symbols are abbreviated: Hex=Hexose, HexNAc=N-Acetylglucosamine, Fucose=Deoxyhexoe, NeuAc=Neuraminic Acid and Na/H=Sodium cation substitution for proton.





