Peptide Sequence Confidence in Accurate Mass and Time Analysis and Its Use in Complex Proteomics Experiments

Damon May; Yan Liu; Wendy Law; Matt Fitzgibbon; Hong Wang; Samir Hanash; Martin McIntosh

doi:10.1021/pr8004502

. Author manuscript; available in PMC: 2018 Mar 1.

Published in final edited form as: J Proteome Res. 2008 Dec;7(12):5148–5156. doi: 10.1021/pr8004502

Peptide Sequence Confidence in Accurate Mass and Time Analysis and Its Use in Complex Proteomics Experiments

Damon May ¹, Yan Liu ¹, Wendy Law ¹, Matt Fitzgibbon ¹, Hong Wang ¹, Samir Hanash ¹, Martin McIntosh ^1,^*

PMCID: PMC5831358 NIHMSID: NIHMS91415 PMID: 19367719

Abstract

We present new algorithms and a software implementation for assigning confidence to peptide sequence assignments obtained through classic accurate mass and retention time (AMT) matching techniques, as well as methods for integrating these assignments with standard proteomics workflows. The algorithms are intended to increase the number of peptides and proteins identified (and, when applicable, quantitated by isotopic labeling) among related proteomics experiments that use high-resolution mass spectrometry instrumentation. The motivations for our extensions include the need to exploit high-resolution data to support highly complex proteomics experiments, especially those involving extensive off-line fractionation, to which recent label-free workflows might not easily generalize.

Keywords: AMT, msInspect, LC-MS, IPAS

INTRODUCTION

High-resolution mass spectrometry (MS) along with tandem MS dramatically increases the precision and volume of data that can be captured from proteomics experiments compared with tandem MS using low-resolution instruments alone. In particular, the high-resolution instruments provide a more complete census of precursor ions observable in a protein mixture. Several related approaches have been recently developed to exploit these data (see Veltri et al.(1) and Mueller et al.(2) for recent reviews) that rely on direct chromotographic alignment and emphasize the use of these data for label-free quantitative proteomic analysis. Here, we instead focus on the use of high-resolution LC-MS data in more traditional experiments which use isotopic labeling for quantitative comparisons and which also involve extensive fractionation of peptides and proteins.

All approaches that exploit high-resolution data begin with the identification of peptide signatures, including location of monoisotopic masses and retention times and computation of ion intensities. Approaches to downstream processing of these discovered peptide locations diverge. Some recent methods are based on associating ions across multiple experiments by direct chromatographic alignment. However, the earlier use of high-resolution data, pioneered by the Smith Laboratory(4-6), uses an Accurate Mass and Time (AMT) method for comparing ions. AMT exploits the fact that each peptide’s location in mass and normalized retention time (NRT) is strongly related its chemical composition. The sequences of peptide ions can be determined by matching their AMT “tags” to the tags stored in an external peptide database derived from MS/MS analysis. The AMT approach has been demonstrated to find more peptides in a single MS interrogation than tandem MS alone, and ion intensities may be used for quantitative comparison across or (when isotope labeling is used) within experiments.

One of the most challenging aspects of the AMT approach is matching the ions located in a single MS interrogation to a dense AMT database containing thousands of sequence entries, such as those derived from large-scale proteomics experiments. Determining the accuracy of each AMT assignment is a key component of any AMT workflow, as it allows the ability to control the overall error rate of the experiment (5-7).

We have developed a new algorithm for determining the confidence of sequence assignments obtained through AMT methods. The algorithm extends the approach introduced by the Smith Laboratory (6) and also the approach previously implemented in msInspect/AMT (7).

The Smith method determines match confidence in a manner directly analogous to the decoy database approach in tandem MS(8), in which False Identification Rates (FIRs) are computed. In brief, peptide locations are matched within some distance threshold to a target AMT database and then again matched within the same threshold to a decoy AMT database that contains the same sequences but with perturbed masses (e.g., 11 Daltons added to the peptide mass). The FIR is computed by comparing the numbers of matches to the target and decoy AMT databases.

One disadvantage of this and all FIR approaches is that the same uncertainty measure applies to the entire group of peptides identified within the region and does not distinguish between the higher- and lower-quality assignments within the group. Our approach attempts to identify parameters for determining match accuracy dynamically and to compute a per-peptide level of confidence (a match probability) - an approach analogous to that taken by PeptideProphet(9) for evaluating tandem MS measurements.

We have implemented these new algorithms for making peptide assignments and assigning match confidence within the open-source msInspect/AMT software platform(7). We also include other extensions to msInspect/AMT that support the use of these AMT-derived sequence assignments alongside sequences identified through standard tandem MS experiments. Specifically, after the assignment of peptides via AMT, msInspect/AMT automatically augments the tandem MS search results (PepXML files) to include the AMT sequences and match probabilities for use by downstream analysis tools for purposes such as protein inference (e.g, ProteinProphet(10)) or quantitation using isotope labeling (e.g., EXPRESS(11), ASAPratio(12) or Q3(13) for SILAC, ICAT, Acrylamide, or ¹⁸O(14) labeling).

To demonstrate the performance of our approach we evaluated a series of experiments using isotopically labeled human plasma, having between 88 and 96 fractions each. We show that, in this uncommonly complex experiment, accurate AMT assignments in combination with tandem MS approaches can increase the yield of confidently identified and quantitated peptides and proteins in each fraction, and in each experiment, compared with MS/MS results alone. This analysis shows that the traditional AMT workflow may be particularly useful in complex experiments having extensive fractionation, to which the more recent methods that exploit high-resolution data may not generalize well.

EXPERIMENTAL PROCEDURES

We begin by presenting our algorithm for assigning confidence to sequence assignments obtained through AMT matching and also extensions to msInspect/AMT to support their use in applied experiments. We then describe the series of experiments used to evaluate these methods.

Algorithm for Assigning Peptide Sequences Using AMT and Evaluating Match Confidence

Consider a single MS interrogation in which N peptide features have been located and their retention times normalized, and which we wish to match to the locations of sequences in an AMT database (a description of the steps needed to generate an AMT database and to extract and normalize peptide features is described below). Peptides will match the AMT database elements imperfectly, and we denote these errors in mass and retention times together as Z_i=(X_i,Y_i), where Xi and Yi represent errors in mass and retention time, respectively. A density plot representing the distribution of Z for a single fraction is shown in Figure 1 showing that, as has been observed previously(5), the distribution of errors in each dimension contains a Gaussian distribution, also mixed with an apparent uniform distribution. Here we formally model these distributions with the hypothesis that the distribution components result from a latent (unobserved) dichotomous variable D_i representing correct (D_i=1) or incorrect (D_i=0) AMT assignments. Our statistical procedure will be used to estimate the latent quantities and use these estimates as the level of confidence in each AMT match. This formulation is functionally a two-dimensional version of the approach used with tandem MS identifications by PeptideProphet, in which the distribution of the null component (a Gamma distribution in PeptideProphet) is replaced by a uniform distribution. We refer the reader to the PeptideProphet manuscript(9) for technical understanding of this approach in a context specific to proteomics and to Dempster et al.(15) for technical details of the statistical approach in general.

An illustration of the mixed distribution of AMT database matches across the two-dimensional space of Mass and NRT match error. Points indicate individual AMT matches. Red box represents the near-uniform distribution of false matches. Green region indicates the bivariate normal distribution of true matches. Coloration of individual points indicates probability as assigned by the EM algorithm: reddest points indicate p=0, bluest points indicate p=0.96.

a. Represent Mass and Time Match Errors Using a Statistical Model

Formally we model D_i marginally as a Bernoulli distribution with probability p (the total rate of correct assignments) and model Z for false matches (D_i=0) as two independent uniform distributions over the matching tolerances t_x and t_y, with areas A_x = [−t_x, t_x] and A_y = [−t_y, t_y]. For true matches (D_i=1) we approximate Z with two independent normal distributions with mean and standard deviation μ_X, σ_X, μ_Y, and σ_Y, respectively. Figure 1, as well as analysis of many such distributions, supports each of these assumptions. Thus, formally,

P (Z ∣ D) = {\begin{matrix} \frac{1}{A_{x} A_{y}} & D = 0 \\ \frac{1}{σ_{x}} ϕ (\frac{x - μ_{x}}{σ_{x}}) \frac{1}{σ_{y}} ϕ (\frac{x - μ_{y}}{σ_{y}}) & D = 1 \end{matrix}

b. Estimate Model Parameters

We estimate the model parameters by their maximum likelihood estimates (MLE), making use of the Expectation-Maximization (EM) algorithm (15) as a device to compute them. We omit technical details of the EM algorithm because the iterative steps for our model are quite similar to those described in detail elsewhere (10). However, in brief, in this specific mixture model framework (this is not true for all statistical models) the EM algorithm reduces to an intuitive, simple iterative procedure in which we first replace the latent data elements D_i with their expected values (denoted ${\hat{d}}_{i}$ ) computed as if the model parameters were known and then estimate the parameters as if the missing data elements D_i are equal to ${\hat{d}}_{i}$ . The first step (the E-Step) can be written as follows:

{\hat{d}}_{i} = P (D = 1 ∣ Z) = (\frac{\hat{p} (P (Z ∣ D = 1))}{(1 - \hat{p}) (P (Z ∣ D = 0)) + \hat{p} (P (Z ∣ D = 1))})

The second step (M-step) can be expressed as $\hat{p} = Σ {\hat{d}}_{i} ∕ N$ and ${\hat{μ}}_{x} = Σ w_{i} x_{i}$ and ${\hat{σ}}_{x} = \sqrt{Σ w_{i} x_{i}^{2} - {\hat{μ}}_{x}^{2}}$ (similar expressions hold for ${\hat{μ}}_{y}$ and ${\hat{σ}}_{y}$ ) where $w_{i} = {\hat{d}}_{i} ∕ Σ {\hat{d}}_{i}$ .

c. Choose Algorithm Starting Point and Evaluating Convergence

The iterative EM algorithm requires a starting point and also a method for determining convergence of the algorithm. The EM algorithm is quite robust to the specific choice of starting parameters, and so it is most convenient to begin with computationally simple approximations. Our starting point sets μ_X and μ_Y to the mean of all error values in the RT and mass dimensions, respectively, and we set σ_X and σ_Y to their standard deviations. The starting point for p is derived from the FIR approximated from a decoy AMT match using loose tolerances (as previously implemented in msInspect/AMT). To evaluate convergence we follow standard approaches and monitor all parameters and the complete data likelihood, but we also monitor the results of the E-steps, ${\hat{d}}_{i}$ , which provide our assignment probability estimates. We stop when the largest change in any assigned probability between iterations is smaller than 0.5% (for a probability of 0.9, for instance, this represents a change of 0.0045), or at minimum after 30 iterations. msInspect also provides graphs which can be used to evaluate convergence, including a plot of the model parameters and probability change estimates against iterations [See Supplementary Material for example].

d. Filtering Identifications

A general filter is applied to remove the obvious errors. We remove all sequence assignments having probability less than 0.1, and those for which the second best match exceeds 0.5 or the first and second best match are within 0.1 of each other (all parameters configurable).

Integrating Matching Results into Standardized Pipeline

The algorithm provides estimates of ${\hat{d}}_{i}$ , the probability of AMT assignment i being correct. msInspect/AMT adds all of the AMT sequences with their matching probabilities to the PepXML files resulting from the MS/MS search from the same interrogation, so that they may be used by all downstream analysis tools that operate on this standard file format, including ProteinProphet for protein inference and quantitation tools such as EXPRESS, Q3 or ASAPratio.

Interrogation of Plasma Using Isotope Labeling and Intact Protein Separation Prior to LC-MS/MS

Four independent, matched pairs of human serum pools were interrogated and compared with the Intact Protein Analysis System (IPAS) (13, 16). In brief, for each experiment, consisting of one disease pool and one control pool, sera pools were separately depleted of the top six abundant serum proteins using a Multiaffinity Removal System (MARS) column (4.6 ×100 mm; Agilent, Wilmington, DE)(16), then intact proteins were labeled with either heavy or light acrylamide(13) and combined prior to extensive off-line separation(16). The separation strategy used an orthogonal two-dimensional HPLC system in which intact proteins are fractionated first on an anion exchange column and then on a reversed phase column for a total of 656 fractions. These fractions were pooled into 96 fractions (fewer fractions were collected in some experiments due to equipment malfunction), digested(16), then interrogated using high-resolution tandem MS using an LTQ OrbiTrap XL mass spectrometer (Thermo-Finnigan) coupled to a nanoLC 2D, a two-dimensional HPLC system (Eksigent). The spectra were acquired in a data-dependent mode in m/z range of 400 to 1800, with selection of the five most abundant +2 or +3 ions of each MS spectrum for MS/MS analysis.

Database Search of MS/MS Data and Quantitation of Isotopically Labeled Peptide Using Tandem MS Workflow

Raw data files were converted to mzXML format using ReAdW 1.1 and Xcalibur 2.2. All mzXML files were searched using X! Tandem (2007.01.01) with an alternative scoring plugin(17) compatible with PeptideProphet. Searches were conducted against the human International Protein Index database (IPI Human v3.20) plus common contaminants. All searches used the following parameters: +/−1.5Da precursor mass error, tryptic cleavage with up to two missed cleavage sites, static modification of 71.0366Da (light acrylamide) on cysteine, potential modifications of 74.0466Da (¹³C acrylamide) on cysteine, and 15.9949Da (oxidation) on methionine. Peptide assignments were evaluated using PeptideProphet(9).

Creation of AMT Database from MS/MS Identifications and Identification of Peptide Locations in High-Resolution Data

Peptide identifications from the LC/MS-MS database search were processed using previously described methods which place retention times on a common scale (7, 18). We included in the AMT database all peptides with PeptideProphet probability ≥ 0.95. Each of the 374 mzXML files were processed by msInspect to discover all LC-MS peptide locations (19), which were filtered for quality by removing all peptides located without multiple isotopes or with a KL score exceeding 3.0 (KL is a quality score for LC-MS peptides(19)). Normalization procedures(7) were used to place their retention times on the same normalized scale as the AMT database. AMT database entries were duplicated to accommodate both light and heavy isotopic labels. We performed a first-pass match to the AMT database with loose mass and normalized retention time tolerances (defaults 20ppm and 0.15 NRT units), and then masses were calibrated based on this initial match prior to matching using the EM algorithm.

Using Mixture Model to Assign Peptide Features to AMT Database and Augment Search Results

We next assigned each peptide location to the AMT database and inferred match confidence using the EM algorithm described above. All data following the database search were processed using an Intel Xeon 5160 3GHz processor with 16GB of memory (only 1GB of memory was given to msInspect/AMT). Creating the combined AMT database from 374 fractions consumed approximately 13 minutes. Matching of all 374 fractions to the AMT database required 181 minutes (29 seconds per fraction). Matches passing a confidence threshold (probability ≥ 0.1, configurable) were added as additional information into the results of a database search on the tandem MS data for the same run.

Processing Augmented PepXML File to Identify Quantitative Ratios and Infer Proteins

Next we computed quantitative ratios between case and control samples (light and heavy labels) using Q3(13), an algorithm specifically designed to accommodate the three-Dalton mass difference for singly-labeled peptides, and ratio information was added to the existing PepXML files. Finally, we performed protein inference with ProteinProphet(10) in order to determine the proteins present in the experiment, using all identified peptides with probability greater than 0.2, and combined all peptide-level quantitation information for each identified protein in order to determine abundance ratios.

Architecture and Software Availability

All methods are implemented as part of the msInspect/AMT platform, which is a cross-platform and largely written in Java, with some statistical components (e.g., EM algorithm) written in the R statistical language. All analytical tools described in this work are freely available and open source under the Apache 2.0 license. The tools and source code, with sample datasets and a tutorial on use of the software, may be downloaded at http://proteomics.fhcrc.org/CPL/amt.

RESULTS

A total of four experiments consisting of 374 fractions were interrogated by MS/MS. The peptide and protein identifications resulting from the traditional MS/MS analysis alone and combined with the AMT results are summarized in Table 1. The experiment-level and fraction-level data are summarized in the right and left halves of the table, respectively.

Table 1.

Summary of peptide-level and protein-level results for each experiment, with MS/MS data alone and with MS/MS data combined with AMT data. All peptide counts are with PeptideProphet or AMT probability ≥ 0.9. All protein counts are of protein groups with ProteinProphet probability ≥ 0.9. Fraction-level numbers are averages over all fractions in each experiment. Counts of unique and quantified proteins per fraction are counts of protein groups with any high-quality peptide evidence (with isotopic ratios, for quantified summary) for the group in the fraction.

Experiment (Number of Fractions)	Analysis Approach	Experiment-Level Summary				Fraction-Level Summary
Experiment (Number of Fractions)	Analysis Approach	Unique Peptides	Quant. Peptides	Unique Proteins	Quant. Proteins	Unique Peptides	Quant. Peptides	Unique Proteins	Quant. Proteins
1 (96)	MS/MS	6128	1726	841	392	722.7	255.2	145.9	77.5
1 (96)	+AMT (% Increase)	7357 ( 20.1%)	1922 (11.4%)	1038 (23.4%)	428 (9.2%)	1165.3 (61.2%)	343.9 (34.8%)	213.9 (46.6%)	97.3 (25.5%)
2 (96)	MS/MS	7230	1408	1103	329	742.4	209.3	160.0	60.9
2 (96)	+AMT (% Increase)	8204 (13.5%)	1603 (13.8%)	1227 (11.2%)	354 (7.6%)	1045.5 (40.8%)	253.3 (21.0%)	218.5 (36.6%)	71.7 (17.7%)
3 (88)	MS/MS	5843	1401	710	270	760.6	214.9	142.3	65.9
3 (88)	+AMT (% increase)	7069 (21.0%)	1632 (16.5%)	875 (23.2%)	314 (16.2%)	1069.6 (40.6%)	270.7 (26.0%)	191.5 (34.6%)	77.1 (17.0%)
4 (94)	MS/MS	7400	1774	1044	346	996.6	298.3	177.4	83.4
4 (94)	+ AMT (% increase)	8815 (19.1%)	2003 (12.9%)	1335 (27.9%)	403 (16.5%)	1112.9 (11.7%)	419.9 (40.8%)	268.1 (51.1%)	108.9 (30.6%)
Mean (93.5)	MS/MS	6650.3	1577.3	924.5	334.3	805.6	244.4	156.4	69.2
Mean (93.5)	+ AMT (% increase)	7861.3 (18.2%)	1790.0 (13.5%)	1118.8 (21.0%)	374.5 (12.0%)	1098.3 (36.3%)	322.0 (31.7%)	223.0 (42.6%)	88.8 (28.3%)

Open in a new tab

We first consider the identifications from MS/MS analysis alone. In total, between 5843 and 7400 (average 6650.3; see final row of Table 1) unique peptide sequences (PeptideProphet probability ≥ 0.9) were identified per experiment, and between 722.7 and 996.6 unique peptide sequences (average 805.6) were identified per individual fraction. The number of unique quantified peptides (containing at least one cysteine) ranged between 1401 and 1774 per experiment, and between 209.3 and 298.3 per fraction. On the protein level (right two columns) between 710 and 1044 protein groups (average 924.5) were identified per experiment (ProteinProphet probability ≥ 0.9) and, within each experiment, peptide evidence for between 60.9 and 83.4 protein groups was identified on average per fraction (average over all experiments 69.2). The total number of quantified proteins was between 270 and 392 (average of 334.3) per IPAS experiment.

We also characterized each protein in each experiment by the percent amino acid coverage obtained, and also the number of fractions in which it was identified. On average, peptides associated with the accession number of each individual protein were observed in 11.9 fractions in a single experiment, and the median percent of amino acid coverage for each protein was 16.19% (95% of proteins’ coverage exceeds 3.74%). We report this information because, along with the goal of identifying as many proteins as possible in an experiment, another is to improve the ability to identify different protein isoforms(16), or proteins having a different chemical compositions but which have the same accession number (e.g., modifications, cleavage products, etc. are each different chemically but have the same accession number), and amino acid coverage information is vital to this analysis.

We next evaluated these same performance metrics using the combined MS/MS and AMT information. The increased information for all experiments is shown in Table 1. On the experiment level (left half of the table), between 7069 and 8815 (average 7861.25; see final row of Table 1) unique peptide sequences (PeptideProphet probability ≥ 0.9) were identified via MS/MS and AMT combined, an increase of 18.2% on average over MS/MS alone. Between 1045.5 and 1165.3 unique peptide sequences (average 1098.3) were identified per fraction, an increase of 36.3%. The number of unique quantified peptides ranged between 1603 and 2003 per experiment, an increase of 13.5%, and between 253.3 and 419.9 per fraction, an increase of 31.7%. On the protein level (right two columns), between 875 and 1335 proteins (average 1118.8) were identified with high confidence per experiment (an increase of 21.0%) and, within each experiment, peptide evidence for between 191.5 and 268.1 proteins was identified on average per fraction (average over all experiments 223.0, an increase of 42.6%). The total number of quantified proteins is between 314 and 428 (average of 374.5) per IPAS experiment, an increase of 12.0%.

Every fraction found quantified peptides and proteins that were not quantified using traditional MS/MS-based approaches, and over all experiments an average of 1113.0 peptides per experiment were quantified in fractions in which they had not been quantified by MS/MS alone (data not shown). From among all IPAS experiments a total of 4621.25 unique peptides (69.2% of all peptides found in via MS/MS search) were identified in at least one fraction with AMT but not (in that fraction) by standard MS/MS methods.

To interpret the gain in proteins at the experiment and fraction level, one must not only consider the number of entirely new proteins ascribed to the experiment or fraction, but also the ability to increase the explained amino acid coverage of the proteins already identified. Figure 2 demonstrates this ability graphically for proteins in a single representative fraction. The horizontal axis represents the coverage based on MS/MS alone and the vertical axis represents the fold increase in coverage based on the combined analysis. Overall, of the proteins identified by high quality with MS/MS, 20% find an increase in explained amino acid sequence coverage, with a median improvement of 18% per protein.

Fold increase in amino acid coverage of proteins with AMT peptide identifications (vertical axis) vs. percent amino acid coverage using only MS/MS peptide identifications (horizontal axis). Blue points show coverage increase when matching to a target AMT database; red points show a decoy database match (only 6 proteins with increased coverage).

The complementarity of AMT and MS/MS identifications, which governs the amount of increase in peptide coverage that AMT identifications provide in an MS/MS experiment, is illustrated in Figure 3. Each point represents the MS/MS (horizontal axis) and AMT (vertical axis) match probabilities for a peptide assigned by both methods in the same fraction. Region A denotes the peptides that are found by both methods with high quality (probability>0.9; 41% of peptides fall in region A). The peptides falling in Region B are those peptides with low MS/MS PeptideProphet score in a fraction but which are confidently identified using AMT; in this experiment, 10% of peptides fall in this range. The peptides falling in Region D are those peptides with low AMT probability score in a fraction but which are confidently identified using MS/MS; in this experiment, 5% of peptides fall in this range. Not visible in this Figure are the peptides identified by AMT (Region C, 14% of peptides) or by MS/MS (Region E, 27% of peptides), but not by both. Overall, in this experiment, the AMT approach can thus improve the number of peptides confidently identified per fraction by 14% + 10% = 25% compared with using MS/MS alone.

MS/MS database search probability (horizontal axis) vs. AMT match probability for the same peptide (vertical axis). Region A (41% of peptides): high probability in both MS/MS and AMT. Region B (10%): high-probability in AMT but low in MS/MS. Region C (14%): high-probability matches unique to AMT. Region D (5%): high-probability in MS/MS but low in AMT. Region E (27%): matches unique to MS/MS.

Evaluating the Accuracy of Matching

The results above show the ability to increase coverage and identifications based on AMT matching with the new algorithm. We used several approaches to demonstrate the overall accuracy of our matching algorithm.

The rate of agreement between MS/MS and AMT sequences identified in each fraction, as shown in Figure 3, provides a direct demonstration that the AMT matching is of high quality. We also evaluated the rates at which the AMT assignments and high-quality MS/MS assignments for the same ion disagree. We associated MS/MS identifications with LC-MS peptide features in the same mzXML file if they fell within 5ppm and 20 seconds of each other and matched uniquely. Of those peptide features that matched the AMT database with probability ≥0.9, only 0.26% (a rate of 0.0026) disagreed with the sequence assigned by MS/MS. These rates suggest that the matching algorithms rarely create discordance between AMT and MS/MS identifications.

We also established that the increase in percentage of amino acid coverage one should expect by chance is far below that shown in Figure 2, by matching that same experiment to a decoy AMT database. The points in red show the results of this analysis. Compared with 20% of the proteins increasing their coverage in the target AMT database, only six proteins (less than 0.01%) found an increase when using the decoy database.

We also evaluated the fit of the parametric model using a quantile-quantile plot of the estimated mixed distribution against the density of the actual match data using the automated graphing functions of msInspect (as described above; see Supplementary Material for example) and found an overall high-quality agreement between the estimated parametric model and the empirical behavior of the data, suggesting that our parametric assumptions are reasonable.

Finally, we compared the spatial distribution of peptides across the two dimensions of separation within a single experiment to determine whether the MS/MS and AMT identifications were overall concordant. The similarity of the spatial distribution is evidence for the accuracy of the method because the fraction information was not used as part of the matching algorithm, and so is an independent confirmation of the accuracy of the model. To compare the spatial behavior, we selected all peptide sequences found in a selected fraction (the “origin”) and then recorded the number of these peptides found in all other fractions separately by either MS/MS or AMT methods. Figure 4(a) shows a representative distribution of these counts across all fractions for MS/MS data, and Figure 4(b) shows the distribution for AMT matches; the two charts reveal a high degree of spatial association of their identified sequences.

Heatmaps describing the distribution of peptide identifications throughout the AX (horizontal axis) and RP (vertical axis) dimensions of a fractionated experiment. Red indicates many IDs, blue indicates few. Identifications charted are those peptides confidently identified via MS/MS in fraction (5,5), the reddest fraction in both charts. a) Peptides confidently identified via MS/MS. b) Peptides confidently matched via AMT.

One should also expect a high degree of correlation between the quantitative ratios derived by the Q3 algorithm based on MS/MS data and AMT data only if a high degree of accuracy can be obtained in our matching, because our matching algorithm does not make use of ion intensity. However, one should not expect identical quantitation by MS/MS and AMT for computed ion intensities because each method uses a different starting point for peptide abundance detection. The correlation coefficient between log AMT ratios (based on de novo discovery of peptides and quantitation) and log MS/MS ratios (based on MS/MS driven quantitation) was 0.95 (See Figure 5), 95% of the ratios (on the raw scale) differ between AMT and MS/MS by less than 15%, and 86% of the ratios peptides differ by less than 5% between the two methods. These differences compare quite favorably with, for instance, the agreement expected between different MS/MS-based quantitation methods (e.g., Q3, Xpress, ASAPRatio).

Comparison of median per-charge-state peptide labeled isotope ratios calculated by the Q3 algorithm, based on high-quality LC-MS/MS database search results (horizontal axis) and high-quality AMT matches (vertical axis). Correlation coefficient is 0.95.

DISCUSSION

We presented 1) a new algorithm for determining the probability of correct AMT sequence assignments, 2) its implementation in a workflow that allows the incorporation of the results into a traditional tandem MS pipeline, and 3) an example using this workflow with a set of experiments of uncommon complexity. This work leads us to conclude that it is feasible to borrow strength across a large number of experiments and across fractions within an experiment to increase the peptide and protein identifications and to increase the amino acid coverage of proteins identified by MS/MS methods alone. Since our implementation makes use of standard file formats for MS/MS data processing (PepXML), it is convenient to augment an existing MS/MS-based proteomics workflow to gain the benefits of AMT data.

The AMT approach used here follows the original formulation advocated by Smith et al., which can be seen as a means to integrate high resolution data into proteomics experiments. As with the Smith formulation, our model of the distribution of AMT matching error uses both the mass component and the NRT component of the error. This is far more effective than using only one component or the other alone. The actual benefit of using both mass and time will depend on the density of the peptide features in a single interrogation, as well as the density of the AMT database being matched. For our specific example here, using a model based only on mass, or only on NRT, to make AMT assignments would result in a roughly four-fold increase in the number of ambiguous matches (matching assignments to more than one peptide).

Our work has focused on quantitation of peptides and proteins only for isotopically labeled experiments. This may be contrasted to the more recently developed platforms that use high-resolution data(3, 20) and that emphasize label-free quantitative approaches. Those recently developed platforms circumvent the need to create an explicit AMT database and instead rely on direct chromatographic alignment of peptide locations between related series of experiments. In the AMT approach, peptide locations are associated between experiments only if they match the same entry in an external database. Each of these two approaches has advantages and disadvantages. An advantage of the direct chromatographic alignment approaches is that peptides that may never have been sequenced successfully are accessible for quantitative comparison between experiments, whereas the classic AMT approach requires the ions to have been selected for CID and sequenced with high confidence in some experiment.

However, an advantage of the AMT approach, combined with isotopic labeling, may be its suitability for evaluating complex experiments, such as those requiring off-line separation. There are several outstanding problems that still need to be solved before direct chromatographic alignment approaches may be used for these more complex workflows. For example, with fractionation, consider a series of n samples having k fractions each. Because individual peptides are likely to occur in multiple fractions (especially with intact protein separation(16)), naïve, direct chromatoghraphic alignment of all pairs of fractions could require roughly (nk)² alignments (precisely nk(nk – 1)/2), each with some propensity to admit and propagate errors. The classic AMT approach requires only nk alignments. Moreover, another consequence of peptides existing in multiple fractions is the difficulty in defining peptide intensity for peptides that migrate across one or more fractions, a problem automatically accounted for in experiments using isotopic labeling. Until those computational issues are resolved, the AMT approach using isotopic labeling and the method we describe here allow the use of information contained in high resolution instruments in complex proteomics workflows requiring separations.

Supplementary Material

1_si_001

NIHMS91415-supplement-1_si_001.zip^{(47.2MB, zip)}

2_si_002

NIHMS91415-supplement-2_si_002.pdf^{(14.2MB, pdf)}

ACKNOWLEDGMENT

Funding provided by National Cancer Institute grants U01 CA111273 and R01 CA107209, by Department of Defense grant W81XWH-06-1-0100, and by the Canary Foundation.

Footnotes

SUPPORTING INFORMATION AVAILABLE: A tutorial for msInspect/AMT platform, with sample charts; a version of the msInspect/AMT software; a tutorial with sample data for demonstration of AMT matching and confidence estimation; additional charts referred to in this manuscript. This material is available free at http://pubs.acs.org.

REFERENCES

1.Veltri P. Briefings in Bioinformatics. 2008;9:144–155. doi: 10.1093/bib/bbn007. [DOI] [PubMed] [Google Scholar]
2.Mueller LN, Brusniak M, Mani DR, Aebersold R. J Proteome Res. 2007;7:51–61. doi: 10.1021/pr700758r. [DOI] [PubMed] [Google Scholar]
3.Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA. Mol Cell Proteomics. 2006;5:1927–41. doi: 10.1074/mcp.M600222-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Smith RD, Anderson GA, Lipton MS, Pasa-Tolic L, Shen Y, Conrads TP, Veenstra TD, Udseth HR. Proteomics. 2002;2:513–23. doi: 10.1002/1615-9861(200205)2:5<513::AID-PROT513>3.0.CO;2-W. [DOI] [PubMed] [Google Scholar]
5.Norbeck AD, Monroe ME, Adkins JN, Anderson KK, Daly DS, Smith RD. J Am Soc Mass Spectrom. 2005;16:1239–49. doi: 10.1016/j.jasms.2005.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Petyuk V, Qian WJ, Chin M, Wang H, Livesay E, Monroe ME, Adkins J, Jaitly N, Anderson D, Camp DG, 2nd, Smith DJ, Smith R. Genome Res. 2007;17:328–336. doi: 10.1101/gr.5799207. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.May D, Fitzgibbon M, Liu Y, Holzman T, Eng J, Kemp CJ, Whiteaker J, Paulovich A, McIntosh M. J Proteome Res. 2007;6:2685–2694. doi: 10.1021/pr070146y. [DOI] [PubMed] [Google Scholar]
8.Elias JE, Gygi SP. Nat Methods. 2006;22:2830–2832. [Google Scholar]
9.Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Anal Chem. 2002;74:5383–92. doi: 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]
10.Nesvizhskii AI, Keller A, Kolker E, Aebersold R. Anal Chem. 2003;75:4646–58. doi: 10.1021/ac0341261. [DOI] [PubMed] [Google Scholar]
11.Han DK, Eng J, Zhou H, Aebersold R. Nat Biotechnol. 2001;19:946–951. doi: 10.1038/nbt1001-946. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Li XJ, Zhang H, Ranish JR, Aebersold R. Anal Chem. 2003;75:6648–57. doi: 10.1021/ac034633i. [DOI] [PubMed] [Google Scholar]
13.Faca V, Coram M, Phanstiel D, Glukhova V, Zhang Q, Fitzgibbon M, McIntosh M, Hanash S. J Proteome Res. 2006;5:2009–18. doi: 10.1021/pr060102+. [DOI] [PubMed] [Google Scholar]
14.Stewart II, Thompson T, Figeys D. Rapid Commun Mass Spectrom. 2001;15:2456–65. doi: 10.1002/rcm.525. [DOI] [PubMed] [Google Scholar]
15.Dempster AP, Laird NM, Rubin DB. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39:1–38. [Google Scholar]
16.Faca V, Pitteri SJ, Newcomb L, Glukhova V, Phanstiel D, Krasnoselsky A, Zhang Q, Struthers J, Wang H, Eng J, Fitzgibbon M, McIntosh M, Hanash S. J Proteome Res. 2007;6:3558–3565. doi: 10.1021/pr070233q. [DOI] [PubMed] [Google Scholar]
17.MacLean B, Eng J, Beavis RC, McIntosh M. Bioinformatics. 2006;22:2830–2. doi: 10.1093/bioinformatics/btl379. [DOI] [PubMed] [Google Scholar]
18.Krokhin OV, Craig R, Spicer V, Ens W, Standing KG, Beavis RC, Wilkins JA. Mol Cell Proteomics. 2004;3:908–19. doi: 10.1074/mcp.M400031-MCP200. [DOI] [PubMed] [Google Scholar]
19.Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, May D, Eng J, Fang R, Lin C, Chen J, Goodlett D, Whiteaker J, Paulovich A, McIntosh M. Bioinformatics. 2006;22:1902–9. doi: 10.1093/bioinformatics/btl276. [DOI] [PubMed] [Google Scholar]
20.Wang P, Coram M, Tang H, Fitzgibbon MP, Zhang H, Yi E, Aebersold R, McIntosh M. Biostatistics. 2006;8:357–367. doi: 10.1093/biostatistics/kxl015. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1_si_001

NIHMS91415-supplement-1_si_001.zip^{(47.2MB, zip)}

2_si_002

NIHMS91415-supplement-2_si_002.pdf^{(14.2MB, pdf)}

[R1] 1.Veltri P. Briefings in Bioinformatics. 2008;9:144–155. doi: 10.1093/bib/bbn007. [DOI] [PubMed] [Google Scholar]

[R2] 2.Mueller LN, Brusniak M, Mani DR, Aebersold R. J Proteome Res. 2007;7:51–61. doi: 10.1021/pr700758r. [DOI] [PubMed] [Google Scholar]

[R3] 3.Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA. Mol Cell Proteomics. 2006;5:1927–41. doi: 10.1074/mcp.M600222-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Smith RD, Anderson GA, Lipton MS, Pasa-Tolic L, Shen Y, Conrads TP, Veenstra TD, Udseth HR. Proteomics. 2002;2:513–23. doi: 10.1002/1615-9861(200205)2:5<513::AID-PROT513>3.0.CO;2-W. [DOI] [PubMed] [Google Scholar]

[R5] 5.Norbeck AD, Monroe ME, Adkins JN, Anderson KK, Daly DS, Smith RD. J Am Soc Mass Spectrom. 2005;16:1239–49. doi: 10.1016/j.jasms.2005.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Petyuk V, Qian WJ, Chin M, Wang H, Livesay E, Monroe ME, Adkins J, Jaitly N, Anderson D, Camp DG, 2nd, Smith DJ, Smith R. Genome Res. 2007;17:328–336. doi: 10.1101/gr.5799207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.May D, Fitzgibbon M, Liu Y, Holzman T, Eng J, Kemp CJ, Whiteaker J, Paulovich A, McIntosh M. J Proteome Res. 2007;6:2685–2694. doi: 10.1021/pr070146y. [DOI] [PubMed] [Google Scholar]

[R8] 8.Elias JE, Gygi SP. Nat Methods. 2006;22:2830–2832. [Google Scholar]

[R9] 9.Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Anal Chem. 2002;74:5383–92. doi: 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]

[R10] 10.Nesvizhskii AI, Keller A, Kolker E, Aebersold R. Anal Chem. 2003;75:4646–58. doi: 10.1021/ac0341261. [DOI] [PubMed] [Google Scholar]

[R11] 11.Han DK, Eng J, Zhou H, Aebersold R. Nat Biotechnol. 2001;19:946–951. doi: 10.1038/nbt1001-946. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Li XJ, Zhang H, Ranish JR, Aebersold R. Anal Chem. 2003;75:6648–57. doi: 10.1021/ac034633i. [DOI] [PubMed] [Google Scholar]

[R13] 13.Faca V, Coram M, Phanstiel D, Glukhova V, Zhang Q, Fitzgibbon M, McIntosh M, Hanash S. J Proteome Res. 2006;5:2009–18. doi: 10.1021/pr060102+. [DOI] [PubMed] [Google Scholar]

[R14] 14.Stewart II, Thompson T, Figeys D. Rapid Commun Mass Spectrom. 2001;15:2456–65. doi: 10.1002/rcm.525. [DOI] [PubMed] [Google Scholar]

[R15] 15.Dempster AP, Laird NM, Rubin DB. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39:1–38. [Google Scholar]

[R16] 16.Faca V, Pitteri SJ, Newcomb L, Glukhova V, Phanstiel D, Krasnoselsky A, Zhang Q, Struthers J, Wang H, Eng J, Fitzgibbon M, McIntosh M, Hanash S. J Proteome Res. 2007;6:3558–3565. doi: 10.1021/pr070233q. [DOI] [PubMed] [Google Scholar]

[R17] 17.MacLean B, Eng J, Beavis RC, McIntosh M. Bioinformatics. 2006;22:2830–2. doi: 10.1093/bioinformatics/btl379. [DOI] [PubMed] [Google Scholar]

[R18] 18.Krokhin OV, Craig R, Spicer V, Ens W, Standing KG, Beavis RC, Wilkins JA. Mol Cell Proteomics. 2004;3:908–19. doi: 10.1074/mcp.M400031-MCP200. [DOI] [PubMed] [Google Scholar]

[R19] 19.Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, May D, Eng J, Fang R, Lin C, Chen J, Goodlett D, Whiteaker J, Paulovich A, McIntosh M. Bioinformatics. 2006;22:1902–9. doi: 10.1093/bioinformatics/btl276. [DOI] [PubMed] [Google Scholar]

[R20] 20.Wang P, Coram M, Tang H, Fitzgibbon MP, Zhang H, Yi E, Aebersold R, McIntosh M. Biostatistics. 2006;8:357–367. doi: 10.1093/biostatistics/kxl015. [DOI] [PubMed] [Google Scholar]

PERMALINK

Peptide Sequence Confidence in Accurate Mass and Time Analysis and Its Use in Complex Proteomics Experiments

Damon May

Yan Liu

Wendy Law

Matt Fitzgibbon

Hong Wang

Samir Hanash

Martin McIntosh

Abstract

INTRODUCTION

EXPERIMENTAL PROCEDURES

Algorithm for Assigning Peptide Sequences Using AMT and Evaluating Match Confidence

Figure 1.

a. Represent Mass and Time Match Errors Using a Statistical Model

b. Estimate Model Parameters

c. Choose Algorithm Starting Point and Evaluating Convergence

d. Filtering Identifications

Integrating Matching Results into Standardized Pipeline

Interrogation of Plasma Using Isotope Labeling and Intact Protein Separation Prior to LC-MS/MS

Database Search of MS/MS Data and Quantitation of Isotopically Labeled Peptide Using Tandem MS Workflow

Creation of AMT Database from MS/MS Identifications and Identification of Peptide Locations in High-Resolution Data

Using Mixture Model to Assign Peptide Features to AMT Database and Augment Search Results

Processing Augmented PepXML File to Identify Quantitative Ratios and Infer Proteins

Architecture and Software Availability

RESULTS

Table 1.

Figure 2.

Figure 3.

Evaluating the Accuracy of Matching

Figure 4.

Figure 5.

DISCUSSION

Supplementary Material

ACKNOWLEDGMENT

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases