Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 8.
Published in final edited form as: J Proteome Res. 2023 Jan 11;22(2):399–409. doi: 10.1021/acs.jproteome.2c00570

TopPICR: A Companion R Package for Top-Down Proteomics Data Analysis

Evan A Martin 1, James M Fulcher 2, Mowei Zhou 3, Matthew E Monroe 4, Vladislav A Petyuk 5
PMCID: PMC12966957  NIHMSID: NIHMS2145292  PMID: 36631391

Abstract

Top-down proteomics is the analysis of proteins in their intact form without proteolysis, thus preserving valuable information about post-translational modifications, isoforms, and proteolytic processing. However, it is still a developing field due to limitations in the instrumentation, difficulties with the interpretation of complex mass spectra, and a lack of well-established quantification approaches. TopPIC is one of the popular tools for proteoform identification. We extended its capabilities into label-free proteoform quantification by developing a companion R package (TopPICR). Key steps in the TopPICR pipeline include filtering identifications, inferring a minimal set of protein accessions explaining the observed sequences, aligning retention times, recalibrating measured masses, clustering features across data sets, and finally compiling feature intensities using the match-between-runs approach. The output of the pipeline is an MSnSet object which makes downstream data analysis seamlessly compatible with packages from the Bioconductor project. It also provides the capability for visualizing proteoforms within the context of the parent protein sequence. The functionality of TopPICR is demonstrated on top-down LC-MS/MS data sets of 10 human-in-mouse xenografts of luminal and basal breast tumor samples.

Keywords: top-down proteomics, label free quantification, FAIMS, proteoform quantification, TopPIC

Graphical Abstract

graphic file with name nihms-2145292-f0001.jpg

INTRODUCTION

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is a widely used tool for measuring proteins in a biological sample. In general, LC-MS/MS is more robust, sensitive, accurate, and generally more established for analyzing molecules with a relatively lower charge or mass.1,2 Therefore, the bottom-up proteomics approach, where proteins are enzymatically or chemically cleaved into smaller peptide fragments, has been the preferred mode of proteomic analysis. Despite its prevalence, bottom-up proteomics requires some caution with data interpretation as it cannot directly identify the proteins in a sample3,4 because a given peptide sequence can appear in multiple proteins or their proteoforms.5 On the other hand, top-down proteomics, the analysis of intact proteins with no introduced digestion, can identify proteoforms directly. However, it is only in recent years that top-down proteomics has gained momentum due to advances in sample preparation, instrumentation, and software development,6-8 effectively enabling the launch of the Human Proteoform Project.9 Quantitative top-down proteomics (recently reviewed by Cupp-Sutton et al.)10 has also been demonstrated using both label-free, metabolic labeling, and chemical labeling approaches. Label-free quantification is the most straightforward to perform because it does not involve extra labeling protocols. Yet, a number of technical challenges still remain,5,8 especially regarding data analysis. For example, all existing top-down software tools are susceptible to errors and ambiguity due to mis-localization of post-translational modifications (PTMs)11 and isotopic errors for a given proteoform. In these cases, the same proteoform can be identified as different species resulting in erroneous downstream quantification. Manual validation is often performed to ensure data quality but can be prohibitive in large-scale proteomics studies.

Interpreting top-down spectra is more challenging compared to tryptic peptides due to the size and complexity of the intact protein species. The complexity is due to splice isoforms and multiple modification types that include sequence truncations (e.g., due to proteolytic cleavage), PTMs, and sample prep artifacts. As a result, the search space for the top-down data is substantially larger and consequently the uncertainty of the MS/MS spectra interpretation may result in different identifications for the same proteoform species. A similar phenomenon is observed in bottom-up proteomics for data that involves PTMs (e.g., known challenge in localization of the phosphorylation residues).12 One of the solutions is to group the top-down proteoform identifications based on sequence similarity, mass, and retention time. Indeed, previous attempts in label-free quantitative top-down proteomics grouped distinct proteoform identifications based on mass and retention time within individual LC-MS data sets.13-18 To extend those mass and retention time-group identifications across data sets, Lubeckyj et al.14 applied a retention time alignment. The approach of feature finding and alignment across data sets for label-free top-down quantification has been formalized in a standalone tool called Informed-Proteomics.19 Essentially this resulted in a top-down proteomics tool with capabilities similar to the bottom-up tools MaxQuant,20,21 MultiAlign,22 and MSFragger.23 Another recent method24 clustered proteoform identifications across output from multiple software packages, ProSight PD25 and TopPIC,26 but only on the mass dimension.

Here we add to the development of TopPIC, a popular top-down MS/MS search engine, by extending its capability toward cross-data set label-free proteoform quantification. To that end, we have developed a companion R package, TopPICR. Specifically, it first aligns retention times and recalibrates proteoform masses across data sets, then applies a clustering algorithm to determine the number of individual species linked to individual proteins (Figure 1). Once clusters are determined, TopPICR computes the centroid for each cluster and extends the identifications to all data sets. It does this through a technique similar to the bottom-up match-between-runs (MBR) approach.20,21 Finally, based on the quantitative data, it creates an ExpressionSet/MSnSet object compatible with the numerous packages from the Bioconductor project27-29 for further exploratory and statistical analysis. TopPICR also contains a visualization module for examining PTMs within a protein and proteoform identifications within a gene.

Figure 1.

Figure 1.

Workflow for processing quantitative data on proteoforms identified by TopPIC. The key steps in the method are 1) inferring a minimal set of accessions explaining all the proteoforms; 2) aligning retention times to a reference data set and recalibrating mass based on the error between the theoretical and observed masses; 3) clustering identified features by retention time and mass; and 4) matching cluster centroids in LC-MS space to all features intensities including unidentified ones.

METHODS

Data and TopPIC MS/MS Search Settings

To demonstrate the capabilities of TopPICR, we used the same data used by another top-down data analysis tool.19 The data consists of 10 human-in-mouse xenograft breast cancer samples30 characterized by the Clinical Proteomic Tumor Analysis Consortium (CPTAC).31 The data contain two breast cancer subtypes, basal-like and luminal B. Each subtype includes five technical replicates. Before applying the TopPICR pipeline, we analyzed the data with TopPIC version 1.5.4. We used the following TopPIC settings: a precursor window of 3 m/z, a mass error tolerance of 15 ppm, a maximum unknown mass shift of 500 Da, a minimum unknown mass shift of −150 Da, and the maximum number of unknown modifications allowed was 1. MS/MS spectra were searched against a database concatenated with entries from Homo sapiens and Mus musculus SwissProt canonical sequences (37,491), splice variants (26,943), and tentative sequences from TrEMBL (95,944), as well as common contaminants. To quantify confidence of the identifications, each protein sequence was shuffled and appended to the database as decoys. This is similar to the well-established target/decoy approach for bottom-up proteomics.32,33

Filter Low-Confidence Identifications

False discovery rate (FDR) calculations were based on using shuffled sequences as decoys.32,33 As the first step, TopPICR controls the quality of proteoform identifications. This includes filtering out rarely occurring identifications (i.e., single-spectrum identifications colloquially known as “one-hit wonders”) and controlling the FDR based on proteoform-to-spectrum matching metrics (e.g., E-value or Q-value). Generally, single-spectrum identifications are considered suspicious even if they have a confident MS/MS to sequence matching score. Moreover, in the case of quantification studies, single-spectrum identifications are not particularly useful because they do not allow direct sample-to-sample comparison. Although, theoretically, the intensities of corresponding species can be recovered in other samples using the match-between-runs procedure (described below). The chances that the quantification will be recovered for single-spectrum identification are low. Thus, the first step in filtering is to remove identifications with low occurring counts. We considered filtering by both the total number of spectra tied to a particular proteoform and the number of data sets or samples a particular proteoform is present. The actual thresholds for the minimum number of spectrum identifications and sample presence depend on the objective of a particular project.

The TopPIC search engine provides metrics, first introduced as part of the MS-Align+ engine,33 reflecting how well the fragments from an MS/MS spectrum match the suggested amino acid sequence including modifications. Moreover, it provides more interpretable measures of confidence such as E-values and Q-values. However, we redetermine the FDR thresholds for two reasons. First, applying thresholds on a minimal number of spectra and sample observations strongly affects the overall FDR. The second reason relates to the situation when a user uses a database that includes not only canonical protein sequences but also sequences of splice isoforms and tentative protein sequences. Statistically speaking, the prior expectations are such that overall, it is more likely to observe a protein in its canonical sequence rather than in a splice isoform or tentative form. A single E-value threshold could result in an overly optimiztic number of isoform and tentative sequence identifications. Thus, to control the FDR for a given level, we apply an optimization procedure across the annotation types (i.e., canonical, splice isoform, and tentative), which produces a separate E-value threshold for each type.

Inference of a Parsimonious Set of Proteins

The identified species in top-down proteomics are much longer than bottom-up tryptic peptides. Nonetheless, in the case of protein truncations, the ambiguity in protein assignment still exists. Obviously, this problem is more significant for shorter proteolytic fragments. To resolve this, we infer a parsimonious set of full protein sequences by formulating the inference problem as a bipartite graph set cover problem and solving it with a greedy set cover algorithm used in bottom-up proteomics.34 Specifically, the steps for selecting the set of accessions are 1) find the accession that maps to the highest number of sequences; 2) assign this accession as the representative accession for these sequences; 3) remove this accession along with the assigned sequences from further iterations; and 4) repeat steps 1–3 until all sequences are assigned. This procedure results in a parsimonious protein accession to sequence mapping.

Retention Time Alignment Across Data Sets

Aligning retention times between data sets is a key step for enabling cross-data set analysis (Figure 2A,C). The first step in the retention time alignment is selecting a reference. Since there are no theoretical retention time values, we chose a reference data set (Figure 2). The criterion for selecting a reference is that the data set should contain the highest number of proteoform identifications. This increases the chance for having a more robust alignment since we rely on the pairwise overlap in proteoform identifications between the data sets for computing the regression model. Often the systematic error in retention times between two data sets is nonlinear.35 We use LOESS regression, a nonparametric method, to model the systematic retention time error.35 The inferred model is then used to align data sets for both identified proteoforms and unidentified features. Later this will allow us to recover unidentified features using the match-between-runs technique.

Figure 2.

Figure 2.

Plots showing retention time and mass measurement error. A) The retention time deviation with a LOESS regression line is shown in red for sample CR33C. B) The error in parts per million between the theoretical and observed mass for sample CR33C. C) Box plots of the deviation between the reference data set and the remaining nine data sets before and after retention time alignment with outlying points removed. D) Box plots of the ppm error for all ten data sets before and after recalibrating the mass with outlying points removed.

Elimination of Systematic Mass Measurement Errors

Elimination of the systematic mass measurement error is performed using a similar two-step procedure (Figure 2B,D). Except in this case, the theoretical masses can be calculated based on the proteoform identifications, thus obviating the need for selecting a reference data set. Although there are sophisticated ways of modeling the systematic mass measurement error,36-38 in the current version of TopPICR we model it as a constant corresponding to the mean of the distribution of the mass measurement errors computed in parts per million (ppm). Then the masses of both identified proteoforms and unidentified features are readjusted by subtracting the inferred systematic relative error.

Dimension Transform and Clustering

A given proteoform may appear in multiple data sets under different identifications. The most common source for such ambiguity is due to PTM localization followed by the presence of unknown modifications in an amino acid sequence. To alleviate this ambiguity, we cluster proteoforms across data sets based on their retention time and mass.

Clustering directly in LC-MS space, that is in the retention time and mass dimensions, is problematic for two reasons. The first is that the Euclidean or any other less commonly used type of distance measure will not have any (physical or statistical) meaning if the retention time units (minutes or seconds) are mixed with molecular mass units (Daltons). Thus, it is difficult to select a threshold for a cluster boundary. Typically, this problem is solved by computing the Mahalanobis distance or, when the covariance is not considered, the normalized Euclidian distance. This way the distances scaled by the standard deviation effectively become unitless and can be interpreted from a statistical standpoint.

Prior to clustering, we transform the mass and retention time such that the Euclidean distance can be interpreted as the number of standard deviations of the measurement errors. To transform the mass and retention time, we first estimate the standard deviations for both types of measurement error. We use the median absolute deviation as a more robust estimate of the standard deviation as the data are prone to outliers. In the case of retention time, we calculate the errors for each data set relative to the reference data set. We calculate the mass measurement error for a proteoform between the observed and theoretical value in parts per million (ppm). Moreover, we exclude PrSMs (proteoform-spectrum matches) with observed masses further than 0.5 Da from their theoretical values as they likely represent isotopic errors and have a strong outlying effect. To derive a single standard deviation estimate for each dimension, we take the median value across the data setspecific standard deviation.

The retention time dimension can be directly transformed by dividing by the standard deviation

rt1rt0=errrt1σrt0σ=errσ

However, the mass dimension cannot be transformed in a similar way because the errors are computed in relative terms and depend on the mass range itself

m1m0m0=err

or

m1m0=m0×err

Typically, the error is expressed in parts per million, that is err×106. Thus, the same distance in ppm will have different values depending on the mass range if computed directly as a mass difference. The solution is to log-transform the mass dimension. After this transformation, the relative error becomes independent of the mass range

log(m1)log(m0)=log(m0×(1+err))log(m0)=log(1+err)

To express the distance in the mass dimension as a multiple of the standard deviation, we divided the logarithm of the mass by the logarithm of one plus the standard deviation

log(m1)log(1+σ)log(m0)log(1+σ)=log(1+err)log(1+σ)

Given the previous transforms of the retention time and mass dimensions, the Euclidean distance has a statistical interpretation as the number of standard deviations of the random measurement error. The proteoforms are clustered across all data sets using a conventional hierarchical clustering algorithm followed by cutting the tree at a given height. Note that due to the transformations, the height of the tree is a statistically interpretable parameter essentially corresponding to a Z-score of a bivariate normal distribution. In addition to the tree height, we also consider the number of member points in a cluster. Any clusters with fewer points than the minimum cluster threshold are considered noise points and do not belong to any particular cluster. The choice of the linkage method, height to cut the tree, and minimum cluster size are left as arguments for the user.

Recovering Unidentified Features Using the Match-Between-Runs Procedure

When using data-dependent acquisition, there is no guarantee that a proteoform will be selected for fragmentation in every sample. Moreover, even if it is selected, there is no guarantee that the spectrum will be of a good, interpretable quality. Thus, although a proteoform is present and its abundance can be calculated from the ion intensities, any approach that relies exclusively on MS/MS identifications will report it as missing. To recuperate missing proteoforms, we apply a procedure similar to MBR (match-between-runs) in bottom-up proteomics.20,21 This approach relies on the assumption that the samples are similar enough to each other and the species are likely to be present in other samples if found in just one sample. Briefly, it transfers the identifications across the data sets if there are features that match within a defined retention time and mass tolerance. Thus, after performing the clustering step, we compute the cluster centroids representing the median values of the retention time and mass. Then for each cluster, we select all (that is identified and unidentified) features that fall within a user-defined tolerance around each centroid.

Accounting for the Error in the Monoisotopic Mass Selection

Determining the monoisotopic mass of compounds with high molecular weight is a challenging task. Unlike in bottom-up proteomics, the monoisotopic peak of the measured species is likely to be low abundant if detected at all. The problem is exacerbated if the proteoform itself is low abundant. As a result, it is common to observe coeluting clusters spaced about 1 Da apart (Figure 3). To be exact, the mass difference between the major protein isotopologues corresponds to 13C–12C = 1.003355 Da. Since these satellite isotopologue clusters represent redundant measurements of the same proteoform we group them together.

Figure 3.

Figure 3.

LC-MS plots demonstrating the proteoform splitting problem due to errors in the monoisotopic peak determination and the steps taken to correct this problem. Points correspond to features or proteoform observations from a particular sample. Features that are close in proximity to one another in LC-MS space form clusters. All points belonging to the same cluster represent one proteoform and have the same color. A) All clusters identified for the GAPDH protein across the 10 data sets. B) Zoomed portion of the LC-MS space. In this region, each of the 14 clusters has a distinct color. Boxes denote groups of clusters. Clusters within each box appear to be coeluting and have a very similar mass. Panels C) and D) show the zoomed version of the cluster groups. The clusters are denoted as isotopologues (the same species that differ only in their isotopic composition) if the relative difference in mass is less than a predefined threshold (e.g., 5 ppm) after correcting for the 13C–12C differences. Panels E) and F) show the cluster grouping that accounts for the presence of isotopologues.

To group the isotopologue clusters, we apply the following procedure. First, we select the major cluster with the largest membership, that is the largest number of PrSMs. Next, we compare the centroid of the largest cluster with all other cluster centroids. The retention time dimension is compared as is, and clusters that are further than the specified tolerance are not considered. The rationale behind this is based on the fact that isotopologue clusters effectively coelute since they represent the same species. The default tolerance is 3 standard deviations of the retention time measurement error. To compare the mass centroids, we subtract the integer number of 13C–12C differences to account for isotopic errors, followed by calculating the relative mass measurement error in ppm. By default, the number of 13C–12C steps is 4, but it can be adjusted by the user depending on the application needs. According to our experience, 4 steps in both directions (increasing and decreasing mass) is sufficient to capture around 99% of the isotopologue clusters. If an adjusted centroid falls within a user-defined ppm tolerance, it is combined with the major cluster. The default tolerance for considering two clusters as belonging to the same isotopologue group is 3 standard deviations of the relative mass measurement error. We recursively repeat this procedure for the remaining clusters (excluding any clusters previously grouped with the major cluster). The intensities for the new aggregate clusters are summarized using either the sum or maximum within data set, depending on the user’s selection.

Consideration of Chromatographic and FAIMS CV Fractionation for Proteoform Quantification

Chromatographic39-42 and electrophoretic offline fractionations43-46 are often used in proteomics to reduce sample complexity and increase proteome coverage. Online gas-phase fractionation, specifically field asymmetric ion mobility spectrometry (FAIMS), has recently been used to increase proteome coverage in top-down proteomics.47-49 FAIMS effectively fractionates the ion by applying a compensation voltage (CV) that only allows certain types of species to pass. Practically, offline and online fractionation techniques represent the same challenge that redundant proteoform measurements need to be reduced to a single value per sample. This problem is conceptually similar to combining multiple tryptic peptides for the inference of a protein relative abundance value. In other words, the proteoform/fraction/CV combinations are treated the same as peptides when reducing them to the proteoform level. The advantage of such a generalization is that a wide range of roll-up algorithms can be applied.29,50-52 Specifically, we followed the roll-up algorithm,50 which scales the peptide intensities to a common reference level. The reference is typically the peptide that is quantified across the highest number of samples. The mean intensities of the nonreference peptides are then scaled to the same number as the reference peptide. The relative abundance at the protein level is computed as the mean of the scaled peptide values. When utilizing this method, we replace peptides with proteoform/fraction/CV combinations.

Alternatively, one can select a representative fraction/CV for each proteoform. Conceptually this is similar to selecting a representative peptide per protein. It makes the most sense to select the fraction/CV that contains the most identifications for a given proteoform as the representative fraction/CV.

RESULTS

Filtering and Preprocessing Proteoform Identifications

The numbers of identifications across 10 breast cancer xenograft data sets at the spectrum, proteoform, amino acid sequence, and gene level during different filtering and preprocessing steps are shown in Table 1. Requiring an amino acid sequence (i.e., proteoform with removed modifications) to be identified with at least two spectra significantly improves the quality of the data, decreasing the FDR between 5 and 10-fold across all levels. At the next step, we optimized the PrSM E-value thresholds to achieve <1% FDR at the gene level. As described in the Methods section, FDR was calculated based on the target/decoy approach with shuffled protein sequences used as decoys. Essentially finding the right E-value thresholds was formulated as a constrained optimization problem, with the number of genes acting as an objective function and the maximum allowed FDR acting as a constraint (<1% in this case). Note that TopPICR has the flexibility of defining the objective function based on PrSMs, proteoforms, proteoform sequences, or genes. Although E-values have statistical interpretation, originally, they are computed for PrSMs only and only within a particular data set. Once multiple data sets are collated together, and one wishes to optimize FDR at different identification levels, then the reported E-values cannot be taken as is. Therefore, an additional optimization step is required.

Table 1. Counts and FDR (Shown in Parentheses) at the PrSM, Proteoform, Amino Acid Sequence, and Gene Level after the Individual Steps in the Pipelinea.

No filter Occurrence filter E-value filter Inference Cluster Isotopologue cluster
PrSM 23136 (1.25%) 21697 (0.22%) 21647 (0.079%) 21610 (0.079%) 20980 (0.048%) 20980 (0.048%)
Proteoform 6878 (3.84%) 5439 (0.40%) 5419 (0.13%) 5419 (0.13%) 2705 (0.11%) 2451 (0.12%)
AA sequence 3679 (7.09%) 2240 (0.85%) 2225 (0.32%) 2225 (0.32%) 2158 (0.14%) 2158 (0.14%)
Gene 1482 (17.06%) 908 (2.09%) 893 (0.78%) 556 (1.26%) 527 (0.57%) 527 (0.57%)
a

Counts are the union taken across all ten samples. The objective was to achieve the maximum number of identifications while keeping the FDR (computed based on decoy identifications) under 1% at the gene level.

The optimization was performed separately depending on the protein sequence annotation type and resulted in 0.4988, 0.2362, and 0.0369 (the number of significant digits is shown for reference) E-value thresholds for canonical, isoform (Varsplic), and tentative (TrEMBL) sequences, respectively. The FDR thresholds were calculated separately for the three different annotation types due to different prior expectations that an identification is a true positive. For example, a sequence identified from the TrEMBL database is less likely to be a true positive than one from the canonical database. For this reason, controlling only the overall E-value threshold could result in an uncontrolled number of false positive identifications from either the isoform or tentative annotation types. Note the overall FDR can be above the specified threshold. This is because when determining the E-value thresholds, the FDR is calculated within each annotation type. Therefore, the number of unique genes can vary within an annotation type as opposed to across all annotation types. Finally, the proteoform inference step did not change the FDR at the PrSM, proteoform, and sequence level as anticipated, but increased the gene level FDR to 1.26%.

Proteoform Level Assignment and Proteoform Visualization

In top-down LC-MS, proteoform identification ambiguity occurs at many different levels and can be more convoluted relative to bottom-up proteomics. A proteoform identification can encompass a full-length protein, a splice isoform, a coding single nucleotide variant, a proteolytic fragment of the parent protein sequence, localized or unlocalized modifications (both post-translational and sample preparation artifacts), or any combination thereof. Recently, a “five-level classification system for proteoform identifications” has been proposed,53 such that ambiguity level can be clearly communicated. Briefly, a level 1 identification contains no ambiguity, while a level 5 identification cannot unambiguously describe the gene of origin, PTM identification, PTM localization, or the entire amino acid sequence. Intermediate levels of ambiguity make up the remaining levels between 1 and 5.53 We included the ability to assign all levels of ambiguity into TopPICR and classified the identifications from 10 data sets. Figure 4 displays the percentage of PrSMs from the 10 data sets for each proteoform level. Overall, most of our identifications fall into level 2D, or in other words, ambiguity with respect to the gene of origin. This is not entirely surprising as we included the TrEMBL database within our search, which contains a large number of computationally predicted but not verified open reading frames and splice isoforms. Furthermore, because the xenografted breast cancer samples (human-in-mouse) analyzed could potentially contain proteoforms derived from the mouse host, canonical SwissProt and TrEMBL entries from Mus musculus were included as well. Therefore, it should be noted that gene-of-origin ambiguity is entirely dependent on the context of the proteome search space, which may include multiple organisms or unreviewed sequencing data. We should also note that because TopPIC produces unknown mass shifts, we treat these shifts as ambiguous, unidentified PTMs.

Figure 4.

Figure 4.

Bar plot showing the percentage of PrSMs assigned to each proteoform level within TopPICR.

Although the above function helps to accurately describe where ambiguity exists in an identification, we also sought to spatially visualize proteoforms in the context of the parent protein sequence. As demonstrated in Table 1, there are 2–3 proteoforms, on average, mapping to an amino acid sequence. Each gene protein product was detected with about 4–5 such amino acid sequence fragments. Overall, there are about 10–11 proteoforms per gene on average. To help with interpreting the complexity of top-down identifications, TopPICR contains visualization capabilities. Figure 5 demonstrates modification position and modification type for GAPDH, a protein with a relatively large number of detected proteoforms. Proteoforms are positioned according to their mapping to the original sequence and colored according to the PrSM counts reflecting abundance. Modifications with the masses matched to the Unimod database were annotated correspondingly. For the unmatched modifications, we left the mass as the identifier. Figure 5 shows the top nine most abundant modifications. From the visualization, we can conclude that GAPDH has almost full sequence coverage. Notably, there are three modifications, pyroglutamate at Q48, disulfide bond at C152, and glutathione at C247, that have the same location across multiple proteolytic fragments. These observations suggest that these PTMs may be more likely to be true relative to the ones that are observed only on a single proteolytic fragment. Indeed, the mass corresponding to disulfide modification assigned to C152 probably reflects the previously reported disulfide bond between C152 and C156.54,55 The modification was observed across 33 different proteoforms, which substantiates the claim that this is not an artifact of the spectrum-matching procedure. It has also been claimed that this disulfide bond plays a role in GAPDH function. The glutathione modification of C247 has not been clearly demonstrated and has only been suggested.56 However, cysteine at position 247 in human GAPDH (or equivalent for other organisms) is known to be a reactive nucleophile and has been reported to be modified into S-(2-succinyl)cysteine57,58 and S-nitroso cysteine.59,60

Figure 5.

Figure 5.

Proteoforms mapped to the original protein sequence for accession P04406 from the GAPDH gene. The ordering on the y-axis is arbitrary and is determined by the optimal sequence placement along the x-axis. The gray bar at the bottom represents the reference or parent sequence. Each observed truncated proteoform is represented by a bar and colored according to its spectral count (number of PrSMs). The positions of the modifications are shown as white rectangles and numbered according to the type of modification. The identity of the modifications was determined by matching their mass to Unimod. If there was no corresponding match, it is left as a mass in Dalton units.

Preprocessing of the Data Sets by RT Alignment and Mass Recalibration

As expected from prior experience,35 the retention time drift between data sets followed a complex nonlinear pattern (Figure 2A). The deviation in retention time between the data sets reached up to 4–5 min within the 3-h LC gradient (Figure 2A). When aligning these data sets, we set the LOESS smoothing span to 0.5. Applying this procedure not only removed the overall shift in the retention time but also reduced the standard deviation of the retention time measurement error almost 2-fold (from 100 to 62 s) (Figure 2C).

Figure 2B shows a histogram of the mass measurement error using data set CR33C as an example. All the data sets showed about the same systematic error (about 5 ppm on average) potentially due to dated calibration (Figure 2D). Upon correction, the ppm errors were zero-centered (Figure 2D). This recalibration allows more accurate downstream clustering and a lower chance of wrongfully grouping unrelated proteoforms due to a symmetric and narrower tolerance.

Clustering

The retention time and mass dimensions were transformed, as described in the Methods section, into unitless and statistically interpretable Z-scores. The standard deviations for the retention time and mass dimensions were inferred from the data and corresponded to 20 s and 1.6 ppm, respectively. We used hierarchical clustering with the single linkage method. The choice of the linkage method was justified by visual inspection of the clustering results (e.g., Figure 3). We found that one of the main problems was clustering the species with tailing chromatographic profiles because determining the retention time peak apex for such species is not robust. The single linkage method performed the best at keeping such proteoforms in one cluster.

When cutting the tree, we applied a height of 4 that captures 0.9996 of the data in a bivariate normal distribution. This cut height was justified both by statistical interpretation and visual inspection of the clustering results. As an additional constraint, for these data sets, we required a cluster to have at least 2 points. A cluster is a group of proteoforms that are likely the same species but, as we emphasized earlier, may have different identifications due to ambiguity from PTM localization and unknown modifications. Prior to clustering, TopPIC identified 5,419 distinct proteoforms across all data sets. After eliminating this ambiguity by clustering, there are only 2,705 proteoforms, thus reducing the number of species by nearly half. Although it is possible that clustering bundles together some isomeric (e.g., different PTM localization) proteoforms that cannot be distinguished in RT or mass, this step is extremely useful for reducing unnecessary redundancy prior to quantitative analysis.

Accounting for Redundancy Due to Monoisotopic Mass Error Determination

Errors in the determination of the monoisotopic mass result in coeluting clusters with about 1 Da (13C–12C = 1.0035 Da to be precise) difference in the mass dimension. With lesser frequency, such error can extend to multiples of 13C–12C differences. Here we allowed up to ±4 multiples of the 13C–12C difference. Next, we combine clusters based on directly comparing retention time centroids and mass centroids after adjusting for multiples of isotopic error. For this study, we set the retention time and mass tolerances to 60 s and 5 ppm, respectively. These values correspond to about 3 standard deviations, our default suggestion, in the respective dimensions. Alternatively, a user can derive the actual time and relative mass error corresponding to a certain number of standard deviations using TopPICR capabilities and then set those values for clustering tolerances. This final cluster refinement step introduces isotopologue cluster groups and further eliminated proteoform redundancy from 2,705 to 2,158 species.

Match-Between-Runs

The key purpose of this step is to recover proteoforms that either were not selected for fragmentation in certain data sets or did not produce MS/MS spectra of a quality necessary for confident identification. For this step, we used the clusters prior to isotopologue grouping since this allows us to recover unidentified proteoforms with misassigned monoisotopic peaks. The selected tolerances for retention time and mass were set to 60 s (~3 standard deviations) and 5 ppm (~3 standard deviations). Both tolerances can be adjusted by the user and tuned if justified. As a result of MBR, the proportion of missing values reduced from 54.3% to 22.8%.

Statistical Analysis

Finally, for evaluation of the quantitative data, we repeated the statistical analysis performed by Park et al.19 (Figure 6) and compared the results obtained by the Informed-Proteomics top-down data analysis tool. Based on sample-to-sample correlations, samples within each group had a higher correlation than between groups. Principal component analysis (Figure 6B) easily distinguished two groups. The first component explained 81.46% of the variance, while the second explained only 7.05% and reflected the technical noise. To detect statistically significant differentially abundant proteoforms, we applied a t test followed by adjustment for multiple hypothesis testing. Proteoforms having p-values below the adjusted 0.05 threshold were considered differentially abundant. Out of 2,158 proteoforms, 1,291 were identified as differentially abundant. Overall, these results are on par with the previously published results from Informed-Proteomics, where out of 3,207 proteoforms, 1,636 were called differentially abundant.19 We have not explored whether Informed-Proteomics reported a larger number of identifications due to the differences in MS/MS search or due to redundancy in the proteoform identification. Overall, this comparison confirms that our R package effectively extends TopPIC into label-free top-down quantification analysis across data sets.

Figure 6.

Figure 6.

Statistical analysis of proteoform abundances from basal-like and luminal B cancer xenograft tumors. A) Heatmap showing sample-to-sample Pearson correlation on the log2-transformed proteoform abundances, B) PCA plot demonstrating separation of the two tumor types, and C) volcano plot of the differential abundance analysis using a moderated t test with a significance threshold of 0.05 FDR.

DISCUSSION

The main motivation for the development of TopPICR was to extend the TopPIC MS/MS search engine capability with cross-data set quantification. The TopPIC MS/MS search engine is gaining popularity in the top-down proteomics field due to unrestrictive licensing, open-source implementation, active development, and, finally, its unique capability of identifying proteoforms with unknown or open modifications.

The implementation of the key TopPICR step, which allows quantification across data sets, followed similar tools that rely on clustering in the LC-MS space, such as MultiAlign22 and Informed-Proteomics.19 To our knowledge, MaxQuant,20,21 one of the most popular bottom-up proteomics tools, does not cluster features in the LC-MS space. However, it does perform match-between-runs to recover the intensities of some of the missing features. Clustering features across data sets in LC-MS space is the most critical step. The novelty of our approach is in the transformation of the LC and MS dimensions (logtransform of mass and dividing the standard deviation of the random measurement error) effectively into Z-scores such that the distance becomes unitless and statistically interpretable. However, the choice of the hierarchical clustering technique was dictated by computational efficiency rather than statistical reasoning. For one, a hierarchical structure is not naturally applicable to this problem. This is because there is no hierarchical relationship between the LC-MS features. Rather the LC-MS clusters should be viewed as independent of each other. For example, a more relevant approach would be K-means clustering; however, it requires a priori knowledge of the number of clusters. In general, determining the number of clusters is a challenging problem.61 Thus, we believe there are further opportunities for improvement by exploring other clustering algorithms such as DBSCAN62,63 and affinity propagation.64,65.

The other compromise we took is clustering within individual genes rather than across all LC-MS features regardless of their identification. The advantage of this compromise is computational efficiency, both in time and space complexity (memory usage). For example, the space complexity of hierarchical clustering is O(n2). This resulted in running out of memory on a laptop with 16 GB of RAM for larger data sets. Most importantly, we found that clustering results without partitioning by gene are hard to interpret due to the high density of the data.

The downside to partitioning by gene is that proteoforms that were not identified in any of the data sets (thus do not have a related gene) remain invisible for such analysis. Thus, this presents two opportunities. First, is the need for proper benchmarking of the LC-MS feature clustering algorithms, that can assess the quality of both the number of clusters and cluster-membership assignment. The second related opportunity for improvement is that robust clustering of LC-MS features, regardless of their identification, would enable quantification of the unidentified features. Top-down quantification is a nascent field with yet-to-be-established good practices, protocols, and software tools.66-68 TopPICR represents a convenient utility that works in tandem with the popular MS/MS search engine TopPIC.

ACKNOWLEDGMENTS

The authors thank In Kwon Choi and Xiaowen Liu from Tulane University for help with the interpretation of the TopPIC software output and its prompt updates. This work was supported by the NIH National Institute of Aging grants U01 AG061356 (P.I. Philip L. De Jager, Columbia University).

Footnotes

The authors declare no competing financial interest.

Contributor Information

Evan A. Martin, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States

James M. Fulcher, Environmental and Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington 99352, United States

Mowei Zhou, Environmental and Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.

Matthew E. Monroe, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States

Vladislav A. Petyuk, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States

Data Availability Statement

The TopPICR R package is available at https://github.com/PNNL-Comp-Mass-Spec/TopPICR. The data and code to reproduce the results presented here are available at https://github.com/PNNL-Comp-Mass-Spec/TopPICR_reproducible_code_for_JPR.

REFERENCES

  • (1).Steen H; Mann M The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol 2004, 5 (9), 699–711. [DOI] [PubMed] [Google Scholar]
  • (2).Timp W; Timp G Beyond mass spectrometry, the next step in proteomics. Sci. Adv 2020, 6 (2), No. eaax8978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Plubell DL; Käll L; Webb-Robertson B-J; Bramer LM; Ives A; Kelleher NL; Smith LM; Montine TJ; Wu CC; MacCoss MJ Putting Humpty Dumpty Back Together Again: What Does Protein Quantification Mean in Bottom-Up Proteomics? J. Proteome Res 2022, 21 (4), 891–898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Nesvizhskii AI; Aebersold R Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell Proteomics 2005, 4 (10), 1419–1440. [DOI] [PubMed] [Google Scholar]
  • (5).Smith LM; Kelleher NL Consortium for Top Down, P. Proteoform: a single term describing protein complexity. Nat. Methods 2013, 10 (3), 186–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (6).Zhang Z; Wu S; Stenoien DL; Pasa-Tolic L High-throughput proteomics. Annu. Rev. Anal Chem 2014, 7, 427–454. [DOI] [PubMed] [Google Scholar]
  • (7).Chen B; Brown KA; Lin Z; Ge Y Top-Down Proteomics: Ready for Prime Time? Anal. Chem 2018, 90 (1), 110–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Melby JA; Roberts DS; Larson EJ; Brown KA; Bayne EF; Jin S; Ge Y Novel Strategies to Address the Challenges in Top-Down Proteomics. J. Am. Soc. Mass Spectrom 2021, 32 (6), 1278–1294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Smith LM; Agar JN; Chamot-Rooke J; Danis PO; Ge Y; Loo JA; Pasa-Tolic L; Tsybin YO; Kelleher NL Consortium for Top-Down, P. The Human Proteoform Project: Defining the human proteome. Sci. Adv 2021, 7 (46), No. eabk0734. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Cupp-Sutton KA; Wu S High-throughput quantitative top-down proteomics. Mol. Omics 2020, 16 (2), 91–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).LeDuc RD; Fellers RT; Early BP; Greer JB; Thomas PM; Kelleher NL The C-score: a Bayesian framework to sharply improve proteoform scoring in high-throughput top down proteomics. J. Proteome Res 2014, 13 (7), 3231–3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Jiang W; Wen B; Li K; Zeng WF; da Veiga Leprevost F; Moon J; Petyuk VA; Edwards NJ; Liu T; Nesvizhskii AI; et al. Deep-Learning-Derived Evaluation Metrics Enable Effective Benchmarking of Computational Tools for Phosphopeptide Identification. Mol. Cell Proteomics 2021, 20, 100171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Davis RG; Park HM; Kim K; Greer JB; Fellers RT; LeDuc RD; Romanova EV; Rubakhin SS; Zombeck JA; Wu C; et al. Top-Down Proteomics Enables Comparative Analysis of Brain Proteoforms Between Mouse Strains. Anal. Chem 2018, 90 (6), 3802–3810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Lubeckyj RA; Basharat AR; Shen X; Liu X; Sun L Large-Scale Qualitative and Quantitative Top-Down Proteomics Using Capillary Zone Electrophoresis-Electrospray Ionization-Tandem Mass Spectrometry with Nanograms of Proteome Samples. J. Am. Soc. Mass Spectrom 2019, 30 (8), 1435–1445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15).Melani RD; Gerbasi VR; Anderson LC; Sikora JW; Toby TK; Hutton JE; Butcher DS; Negrao F; Seckler HS; Srzentic K; et al. The Blood Proteoform Atlas: A reference map of proteoforms in human hematopoietic cells. Science 2022, 375 (6579), 411–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Ntai I; LeDuc RD; Fellers RT; Erdmann-Gilmore P; Davies SR; Rumsey J; Early BP; Thomas PM; Li S; Compton PD; et al. Integrated Bottom-Up and Top-Down Proteomics of Patient-Derived Breast Tumor Xenografts. Mol. Cell Proteomics 2016, 15 (1), 45–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Park HM; Satta R; Davis RG; Goo YA; LeDuc RD; Fellers RT; Greer JB; Romanova EV; Rubakhin SS; Tai R; et al. Multidimensional Top-Down Proteomics of Brain-Region-Specific Mouse Brain Proteoforms Responsive to Cocaine and Estradiol. J. Proteome Res 2019, 18 (11), 3999–4012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Zhou M; Uwugiaren N; Williams SM; Moore RJ; Zhao R; Goodlett D; Dapic I; Pasa-Tolic L; Zhu Y Sensitive Top-Down Proteomics Analysis of a Low Number of Mammalian Cells Using a Nanodroplet Sample Processing Platform. Anal. Chem 2020, 92 (10), 7087–7095. [DOI] [PubMed] [Google Scholar]
  • (19).Park J; Piehowski PD; Wilkins C; Zhou M; Mendoza J; Fujimoto GM; Gibbons BC; Shaw JB; Shen Y; Shukla AK; et al. Informed-Proteomics: open-source software package for top-down proteomics. Nat. Methods 2017, 14 (9), 909–914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (20).Cox J; Mann M MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol 2008, 26 (12), 1367–1372. [DOI] [PubMed] [Google Scholar]
  • (21).Tyanova S; Temu T; Cox J The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc 2016, 11 (12), 2301–2319. [DOI] [PubMed] [Google Scholar]
  • (22).LaMarche BL; Crowell KL; Jaitly N; Petyuk VA; Shah AR; Polpitiya AD; Sandoval JD; Kiebel GR; Monroe ME; Callister SJ; et al. MultiAlign: a multiple LC-MS analysis tool for targeted omics analysis. BMC Bioinformatics 2013, 14 (1), 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Kong AT; Leprevost FV; Avtonomov DM; Mellacheruvu D; Nesvizhskii AI MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 2017, 14 (5), 513–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Arauz-Garofalo G; Jodar M; Vilanova M; de la Iglesia Rodriguez A; Castillo J; Soler-Ventura A; Oliva R; Vilaseca M; Gay M Protamine Characterization by Top-Down Proteomics: Boosting Proteoform Identification with DBSCAN. Proteomes 2021, 9 (2), 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (25).LeDuc RD; Taylor GK; Kim YB; Januszyk TE; Bynum LH; Sola JV; Garavelli JS; Kelleher NL ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry. Nucleic Acids Res. 2004, 32, W340–W345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (26).Kou Q; Xun L; Liu X TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 2016, 32 (22), 3495–3497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Huber W; Carey VJ; Gentleman R; Anders S; Carlson M; Carvalho BS; Bravo HC; Davis S; Gatto L; Girke T; et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 2015, 12 (2), 115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (28).Gatto L; Lilley KS MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 2012, 28 (2), 288–289. [DOI] [PubMed] [Google Scholar]
  • (29).Gatto L; Gibb S; Rainer J MSnbase, Efficient and Elegant R-Based Processing and Visualization of Raw Mass Spectrometry Data. J. Proteome Res 2021, 20 (1), 1063–1069. [DOI] [PubMed] [Google Scholar]
  • (30).Li S; Shen D; Shao J; Crowder R; Liu W; Prat A; He X; Liu S; Hoog J; Lu C; et al. Endocrine-therapy-resistant ESR1 variants revealed by genomic characterization of breast-cancer-derived xenografts. Cell Rep 2013, 4 (6), 1116–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (31).Tabb DL; Wang X; Carr SA; Clauser KR; Mertins P; Chambers MC; Holman JD; Wang J; Zhang B; Zimmerman LJ; et al. Reproducibility of Differential Proteomic Technologies in CPTAC Fractionated Xenografts. J. Proteome Res 2016, 15 (3), 691–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Elias JE; Gygi SP Target-decoy search strategy for mass spectrometry-based proteomics. In Proteome bioinformatics; Springer, 2010; pp 55–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (33).Liu X; Sirotkin Y; Shen Y; Anderson G; Tsai YS; Ting YS; Goodlett DR; Smith RD; Bafna V; Pevzner PA Protein identification using top-down spectra. Mol. Cell Proteomics 2012, 11 (6), M111.008524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Zhang B; Chambers MC; Tabb DL Proteomic parsimony through bipartite graph analysis improves accuracy and transparency. J. Proteome Res 2007, 6 (9), 3549–3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (35).Smith R; Ventura D; Prince JT LC-MS alignment in theory and practice: a comprehensive algorithmic review. Brief Bioinform 2015, 16 (1), 104–117. [DOI] [PubMed] [Google Scholar]
  • (36).Petyuk VA; Mayampurath AM; Monroe ME; Polpitiya AD; Purvine SO; Anderson GA; Camp DG 2nd; Smith RD DtaRefinery, a software tool for elimination of systematic errors from parent ion mass measurements in tandem mass spectra data sets. Mol. Cell Proteomics 2010, 9 (3), 486–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (37).Petyuk VA; Jaitly N; Moore RJ; Ding J; Metz TO; Tang K; Monroe ME; Tolmachev AV; Adkins JN; Belov ME; et al. Elimination of systematic mass measurement errors in liquid chromatography-mass spectrometry based proteomics using regression models and a priori partial knowledge of the sample content. Anal. Chem 2008, 80 (3), 693–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (38).Jung HJ; Purvine SO; Kim H; Petyuk VA; Hyung SW; Monroe ME; Mun DG; Kim KC; Park JM; Kim SJ; et al. Integrated post-experiment monoisotopic mass refinement: an integrated approach to accurately assign monoisotopic precursor masses to tandem mass spectrometric data. Anal. Chem 2010, 82 (20), 8510–8518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (39).Simpson DC; Ahn S; Pasa-Tolic L; Bogdanov B; Mottaz HM; Vilkov AN; Anderson GA; Lipton MS; Smith RD Using size exclusion chromatography-RPLC and RPLC-CIEF as two-dimensional separation strategies for protein profiling. Electrophoresis 2006, 27 (13), 2722–2733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (40).Sharma S; Simpson DC; Tolic N; Jaitly N; Mayampurath AM; Smith RD; Pasa-Tolic L Proteomic profiling of intact proteins using WAX-RPLC 2-D separations and FTICR mass spectrometry. J. Proteome Res 2007, 6 (2), 602–610. [DOI] [PubMed] [Google Scholar]
  • (41).Tucholski T; Knott SJ; Chen B; Pistono P; Lin Z; Ge Y A Top-Down Proteomics Platform Coupling Serial Size Exclusion Chromatography and Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Anal. Chem 2019, 91 (6), 3835–3844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (42).Gargano AFG; Shaw JB; Zhou M; Wilkins CS; Fillmore TL; Moore RJ; Somsen GW; Pasa-Tolic L Increasing the Separation Capacity of Intact Histone Proteoforms Chromatography Coupling Online Weak Cation Exchange-HILIC to Reversed Phase LC UVPD-HRMS. J. Proteome Res 2018, 17 (11), 3791–3800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (43).Lee JE; Kellie JF; Tran JC; Tipton JD; Catherman AD; Thomas HM; Ahlf DR; Durbin KR; Vellaichamy A; Ntai I; et al. A robust two-dimensional separation for top-down tandem mass spectrometry of the low-mass proteome. J. Am. Soc. Mass Spectrom 2009, 20 (12), 2183–2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (44).Takemori A; Butcher DS; Harman VM; Brownridge P; Shima K; Higo D; Ishizaki J; Hasegawa H; Suzuki J; Yamashita M; et al. PEPPI-MS: Polyacrylamide-Gel-Based Prefractionation for Analysis of Intact Proteoforms and Protein Complexes by Mass Spectrometry. J. Proteome Res 2020, 19 (9), 3779–3791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (45).Corbett JR; Robinson DE; Patrie SM Robustness and Ruggedness of Isoelectric Focusing and Superficially Porous Liquid Chromatography with Fourier Transform Mass Spectrometry. J. Am. Soc. Mass Spectrom 2021, 32 (1), 346–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (46).Shen X; Yang Z; McCool EN; Lubeckyj RA; Chen D; Sun L Capillary zone electrophoresis-mass spectrometry for top-down proteomics. Trends Analyt Chem. 2019, 120, 115644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (47).Fulcher JM; Makaju A; Moore RJ; Zhou M; Bennett DA; De Jager PL; Qian WJ; Pasa-Tolic L; Petyuk VA Enhancing Top-Down Proteomics of Brain Tissue with FAIMS. J. Proteome Res 2021, 20 (5), 2780–2795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (48).Kaulich PT; Cassidy L; Winkels K; Tholey A Improved Identification of Proteoforms in Top-Down Proteomics Using FAIMS with Internal CV Stepping. Anal. Chem 2022, 94 (8), 3600–3607. [DOI] [PubMed] [Google Scholar]
  • (49).Gerbasi VR; Melani RD; Abbatiello SE; Belford MW; Huguet R; McGee JP; Dayhoff D; Thomas PM; Kelleher NL Deeper Protein Identification Using Field Asymmetric Ion Mobility Spectrometry in Top-Down Proteomics. Anal. Chem 2021, 93 (16), 6323–6328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).Polpitiya AD; Qian WJ; Jaitly N; Petyuk VA; Adkins JN; Camp DG 2nd; Anderson GA; Smith RD DAnTE: a statistical tool for quantitative analysis of -omics data. Bioinformatics 2008, 24 (13), 1556–1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (51).Fischer M; Renard BY iPQF: a new peptide-to-protein summarization method using peptide spectra characteristics to improve protein quantification. Bioinformatics 2016, 32 (7), 1040–1047. [DOI] [PubMed] [Google Scholar]
  • (52).Nikolovski N; Shliaha PV; Gatto L; Dupree P; Lilley KS Label-free protein quantification for plant Golgi protein localization and abundance. Plant Physiol 2014, 166 (2), 1033–1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (53).Smith LM; Thomas PM; Shortreed MR; Schaffer LV; Fellers RT; LeDuc RD; Tucholski T; Ge Y; Agar JN; Anderson LC; et al. A five-level classification system for proteoform identifications. Nat. Methods 2019, 16 (10), 939–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (54).Barinova KV; Serebryakova MV; Muronetz VI; Schmalhausen EV S-glutathionylation of glyceraldehyde-3-phosphate dehydrogenase induces formation of C150-C154 intrasubunit disulfide bond in the active site of the enzyme. Biochim Biophys Acta Gen Subj 2017, 1861 (12), 3167–3177. [DOI] [PubMed] [Google Scholar]
  • (55).Barinova KV; Serebryakova MV; Eldarov MA; Kulikova AA; Mitkevich VA; Muronetz VI; Schmalhausen EV S-glutathionylation of human glyceraldehyde-3-phosphate dehydrogenase and possible role of Cys152-Cys156 disulfide bridge in the active site of the protein. Biochim Biophys Acta Gen Subj 2020, 1864 (6), 129560. [DOI] [PubMed] [Google Scholar]
  • (56).Lind C; Gerdes R; Schuppe-Koistinen I; Cotgreave IA Studies on the mechanism of oxidative modification of human glyceraldehyde-3-phosphate dehydrogenase by glutathione: catalysis by glutaredoxin. Biochem. Biophys. Res. Commun 1998, 247 (2), 481–486. [DOI] [PubMed] [Google Scholar]
  • (57).Blatnik M; Thorpe SR; Baynes JW Succination of proteins by fumarate: mechanism of inactivation of glyceraldehyde-3-phosphate dehydrogenase in diabetes. Ann. N.Y. Acad. Sci 2008, 1126 (1), 272–275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (58).Blatnik M; Frizzell N; Thorpe SR; Baynes JW Inactivation of glyceraldehyde-3-phosphate dehydrogenase by fumarate in diabetes: formation of S-(2-succinyl)cysteine, a novel chemical modification of protein and possible biomarker of mitochondrial stress. Diabetes 2008, 57 (1), 41–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (59).Jia J; Arif A; Willard B; Smith JD; Stuehr DJ; Hazen SL; Fox PL Protection of extraribosomal RPL13a by GAPDH and dysregulation by S-nitrosylation. Mol. Cell 2012, 47 (4), 656–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (60).Jia J; Arif A; Terenzi F; Willard B; Plow EF; Hazen SL; Fox PL Target-selective protein S-nitrosylation by sequence motif recognition. Cell 2014, 159 (3), 623–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (61).Xu S; Qiao X; Zhu L; Zhang Y; Xue C; Li L Reviews on determining the number of clusters. Applied Mathematics & Information Sciences 2016, 10 (4), 1493–1512. [Google Scholar]
  • (62).Hahsler M; Piekenbrock M; Doran D dbscan: Fast Density-Based Clustering with R. Journal of Statistical Software 2019, 91 (1), 1–30. [Google Scholar]
  • (63).Ester M; Kriegel H-P; Sander J; Xu X A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96 Proceedings; AAAI, 1996; Vol. 96, pp 226–231. [Google Scholar]
  • (64).Frey BJ; Dueck D Clustering by passing messages between data points. Science 2007, 315 (5814), 972–976. [DOI] [PubMed] [Google Scholar]
  • (65).Bodenhofer U; Kothmeier A; Hochreiter S APCluster: an R package for affinity propagation clustering. Bioinformatics 2011, 27 (17), 2463–2464. [DOI] [PubMed] [Google Scholar]
  • (66).Donnelly DP; Rawlins CM; DeHart CJ; Fornelli L; Schachner LF; Lin Z; Lippens JL; Aluri KC; Sarin R; Chen B; et al. Best practices and benchmarks for intact protein analysis for top-down mass spectrometry. Nat. Methods 2019, 16 (7), 587–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (67).Srzentic K; Fornelli L; Tsybin YO; Loo JA; Seckler H; Agar JN; Anderson LC; Bai DL; Beck A; Brodbelt JS; et al. Interlaboratory Study for Characterizing Monoclonal Antibodies by Top-Down and Middle-Down Mass Spectrometry. J. Am. Soc. Mass Spectrom 2020, 31 (9), 1783–1802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (68).LeDuc RD; Schwammle V; Shortreed MR; Cesnik AJ; Solntsev SK; Shaw JB; Martin MJ; Vizcaino JA; Alpi E; Danis P; et al. ProForma: A Standard Proteoform Notation. J. Proteome Res 2018, 17 (3), 1321–1325. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The TopPICR R package is available at https://github.com/PNNL-Comp-Mass-Spec/TopPICR. The data and code to reproduce the results presented here are available at https://github.com/PNNL-Comp-Mass-Spec/TopPICR_reproducible_code_for_JPR.

RESOURCES