Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Nov 1.
Published in final edited form as: J Am Soc Mass Spectrom. 2015 Aug 4;26(11):1858–1864. doi: 10.1007/s13361-015-1228-5

Using SEQUEST with Theoretically Complete Sequence Databases

Rovshan G Sadygov 1,2,*
PMCID: PMC4607654  NIHMSID: NIHMS713022  PMID: 26238326

Abstract

SEQUEST has long been used to identify peptides/proteins from their tandem mass spectra and protein sequence databases. The algorithm has proven hugely successful for its sensitivity and specificity in identifying peptides/proteins whose sequences are present in the protein sequence databases. In this work, we report on work that attempts a new use for the algorithm by applying it to search a complete list of theoretically possible peptides, a de novo like sequencing. We used freely available mass spectral data and determined a number of unique peptides as identified by SEQUEST. Using masses of these peptides and the mass accuracy of 0.001 Da, we have created a database of all theoretically possible peptide sequences corresponding to the precursor masses. We used our recently developed algorithm for determining all amino acid compositions corresponding to a mass interval, and used a lexicographic ordering to generate theoretical sequences from the compositions. The newly generated theoretical database was many fold more complex than the original protein sequence database. We used SEQUEST to search and identify the best matches to the spectra from all theoretically possible peptide sequences. We found that SEQUEST cross-correlation score ranked the correct peptide match among the top sequence matches. The results testify to the high specificity of SEQUEST when combined with the high mass accuracy for intact peptides.

Keywords: SEQUEST, mass distribution of peptides, all theoretically possible peptides, de novo peptide sequencing

Introduction

High throughput protein identification using tandem mass spectrometry coupled to the liquid chromatography is a well-established and widely used technology for protein identification [1, 2]. The methodology has various implementations but can, in general be classified into three major components which are sample preparation (protein extraction, protein separation and digestion, peptide separation using chromatography) mass spectrometry and software for protein identification using tandem mass spectra and protein sequence databases [1]. The automated protein identification using software is a very important component in the methodology as the number of tandem mass spectra are in the tens of thousands and manual annotation of spectra is not feasible. SEQUEST [3] was one of the first database search engines developed to perform the task of the automating protein identification. Along with the other, very few early search engines of the time, Mascot (probability based) [4], error tolerant [5] and high mass accuracy concept[6], it has tremendously contributed to the development of the proteomics field and to its becoming widely accessible. Since the development of the original search engines a number of new software have been developed that emphasized diverse and increasing needs of the field. We can only note a few, probabilistic OMSSA[7], X!Tandem [8], MyriMatch [9], Byonic [10], Inspect [11, 12], high mass accuracy Andromeda [13]. The concepts of the probability based peptide identifications and databases have also been employed for modeling protein identifications from intact protein fragmentations [14, 15]. One of the important features of SEQUEST is its multiple scoring criteria. At first, it filters the database peptide sequences for candidate peptides using enzymatic specificity and experimental precursor mass including its accuracy. For each candidate peptide a preliminary score, Sp, is computed. Sp is fast and all database peptides meeting the mass filtering criterion are assigned Sp scores. In the second stage, a certain number (500 by default) of top Sp scoring peptides are used for cross-correlation analysis with the experimental spectrum, to generate XCorr. This step involves multiple fast Fourier transformations (FFTs) per a candidate peptide and is normally slower than Sp scoring. To accelerate this process for high mass accuracy data where the mass arrays are large, FFT libraries referred to as the fastest FFT in the West were adapted into the SEQUEST [16]. The XCorr reports the correlation values between the experimental spectrum and theoretical peptide sequence, Sp scoring accounts for total (explained) ion current. The other score, ΔCn, is the difference between the XCorr of a peptide and the highest ranked peptide, normalized by the XCorr of the latter. As the database sizes increased and more candidate sequence were correlated against the experimental spectra, it became necessary to provide a probability of a peptide identification being a true/false positive. A large number of research papers have explored different statistical approaches to employ the SEQUEST scores to assign the probability of false or true match [17-19]. SEQUEST identified peptides have been used for further bioinformatics confirmations of post-translational modifications (PTMs), such as phosphorylations [20-23]. In brief, SEQUEST has stimulated a large number of studies in bioinformatics and statistical approaches to automate and advance the protein identification, PTM determination, quantification and many other diverse applications of the proteomics. This is reflected in the number of citation of the original SEQUEST paper which is currently the most cited article in the JASMS. It has been serving as an inspiration for bioinformatics software development in the field of proteomics, metabolomics and other research areas using mass spectrometry based high throughput sequencing. In this issue of JASMS, Dr. David Tabb provides a comprehensive chronicle of the SEQUEST development and multiple software that it has influenced. Recent review papers describe protein identification [24] and interpretation of mass spectra [25].

In this paper, we report on our findings in using SEQUEST for a de novo like sequencing. Originally SEQUEST is designed as a database search engine to identify peptides from their tandem mass spectra and protein sequence databases. Here we adapt the algorithm for a small scale sequencing of all theoretically possible peptides by making use of our algorithm for generating amino acid compositions of all theoretically possible peptides from their intact masses and the mass accuracy of intact peptides [26, 27]. We sought to find out how SEQUEST scoring of a true match would fair with the large number of peptides that are analyzed in an unbiased de novo like sequencing [28-35]. Our secondary purpose was to find out how large XCorr's can be obtained. The approach may also contribute to false discovery rate control [36, 37] based on the use of decoy databases.

In the methods section we describe the workflow and generation of theoretical peptide sequences. The Results and Discussion section describes the application of the approach to study more than fourteen-hundred tandem mass spectra from a publicly available data set [38].

Methods

We start with identification of peptide sequences using their tandem mass spectra and protein sequence databases (UniProt) [39] using SEQUEST. Then given the mass of an intact peptide and the enzymatic specificity of protein digest, we generate the list of all theoretically possible amino acid compositions. The compositions are converted into peptide sequences using lexicographic ordering. The peptide sequences for each precursor mass are assembled into a theoretical database of candidate sequences. SEQUEST is used to search the theoretical database of sequences with the tandem mass spectra of the peptide. The procedure essentially amounts to the de novo like sequencing - without consideration for PTMs.

Generating Theoretical Peptide Sequences

Here we briefly review the procedure for generating peptide sequence for a given mass interval (determined by the mass of peptide and the mass accuracy of the measurement). A peptide is a sequence of letters from a twenty-letter alphabet A whose letters correspond to the twenty amino acids. This sequence is a realization from a composition represented by a numerical vector (a1, a2, ... , a20), whose jth component is the number of occurrences of the jth letter (amino acid) in the sequence, j = 1, 2, ... , 20. The number of the amino acid compositions of peptides of length L is given by the Bose-Einstein statistics:

(N+L1L)=(N+L1)!L!(N1)!

The number of all sequences of length L, with a given composition, is a multinomial coefficient:

L!a1!a2!aN!

and the number of all distinct sequences is NL. Here N (= 20) is the number of amino acids in the alphabet. The formulas are used to confirm the accuracy of the algorithms for determining amino acid compositions and the following sequence generations.

We have previously used our algorithm to build and study the mass distribution of all theoretically possible peptides [40] and applied them to distinguish phosphopeptides from unmodified peptides [41]. The algorithm accounts for the digest specificity and number of missed cleavages. Here we used this algorithm to generate amino acid compositions for all sequences whose mass fit the mass of an intact peptide with a given mass accuracy. The compositions are then used by a lexicographic algorithm to generate all possible unique peptide sequences from the compositions. Since the number of sequences is very large (20L), we made use of the mass degeneracy of the Lue and Ile by using Lue only to reduce the complexity of theoretical databases. This effectively reduces the number of amino acids to 19. We used full trypsin digest specificity with no missed cleavages. To reduce the complexity we have considered only peptides with intact mass less than 1200 Da, and have assumed mass window of 0.002 Da (2 mDa) centered on the precursor mass.

We used SEQUEST to search the theoretical sequence databases and identify the best matches to the spectra. Then we compared these peptides with the results that SEQUEST has identified from the UniProt database. No PTMs were considered in this study. Mass accuracy was 1 mDa for precursor ions. Figure 1 summarizes the workflow used in this study.

Figure 1.

Figure 1

The workflow of the SEQUEST peptide identification using theoretically complete peptide sequences. The green colored path indicates normal database search procedure that SEQUEST is used for. The blue path indicates the generation of theoretical peptides, creation of the theoretical fasta database and de novo like sequence identification with SEQUEST.

Results

To evaluate our approach we used spectra obtained from first strong anion exchange fraction of MCF7 cell line, 20100719_Velos1_TaGe_SA_MCF7_01.raw[37]. The mass spectra were acquired using Orbitrap Velos, the product ions were generated using higher energy collisional dissociation (HCD). As mentioned above, due to the computational complexities, we have limited the range of peptides to those with masses less than 1200 Da. Only +2 charged peptides were considered. For each peptide we then created a separate fasta database of all theoretical peptide sequences that fit 2 mDa mass window around the peptide's mass. We then used these databases in SEQUEST searches to determine the best match to the corresponding tandem mass spectra. In total there were 1400 spectra in the data set.

An example of the results is the peptide sequence, GAGTDDHTLIR, from human protein Annexin A5, with UniProt ID, P08758. It has the mass of (monoisotopic mass of the amino acid sequence plus the mass of proton) 1155.57528 Da. SEQUEST identifies this peptide with XCorr value of 2.71. We used the peptide composition algorithm[26] to generate all amino acid compositions in the mass range of [1155. 574, 1155.576] Da. There were 802 unique compositions (after accounting for the Leu and Ile degeneracy). Using lexicographic ordering, from the compositions we generated a new peptide sequence database, specifically for this peptide. The size of the database was about 9 Gb. It had more than 600 thousand candidate peptides for the spectrum. The best scoring peptide among the theoretical peptides was QGTDDHTLLR. It had an XCorr of 2.75. No other theoretical peptide sequence scored higher than the true peptide sequence, GAGTDDHTLIR. We note that the two sequences differ only on the prefix, “Q” in theoretical peptide versus “GA” in the true peptide. The annotated spectrum of the peptide is shown in Figure 2. Most of the y - ions of the peptide were observed in the tandem mass spectrum.

Figure 2.

Figure 2

Annotated HCD spectrum of the peptide GAGTDDHTLIR. The blue color indicates y- ions and the red color indicates the b - ions. All y - ions, except for y10 have been observed in the spectrum. The XCorr value of this peptide was 2.71. The search of the all theoretically possible peptides using SEQUEST returns a slightly different sequence as the highest XCorr peptide, QAGTDDHTLIR, XCorr = 2.75. This was the only theoretical peptide to score higher than the true peptide.

The peptide SGGGGGGGGSSWGGR of Heterogeneous nuclear ribonucleoprotein A0, UniProt ID Q13151, was one of the higher mass peptides with the mass of 1192.50899 Da. It had XCorr value of 4.22. The [1192.508, 1192.510] Da mass interval was used to generate theoretical peptide compositions for this peptide. There were 983 unique compositions. After converting the compositions to sequences, the database size of the theoretical peptides exceeded 16 Gb. It had more than 1.2 million candidate sequences. The best scoring peptide among the theoretical peptides was the sequence, GSGGGGGGGSSWNR. It had XCorr score of 4.2. SEQUEST correctly identified this peptide among all theoretically possible peptides for this tandem mass spectrum. In this case as well, we see that there is long subsequence, GGGGGGGSSW, common to the true peptide and best scoring theoretical peptide sequences.

Table 1 summarizes the results for a sample of six spectra that were used in this study. The peptides that we have chosen did not have very high XCorr values, in general. In spite of this, SEQUEST produced results where the true peptides were always amongst the top highest scoring peptides in the large, unbiased databases comprising all theoretically possible peptides. This testifies to high specificity of SEQUEST when combined with the high mass accuracy for intact peptides. Among the small number of peptides in this table, the misassignments by SEQUEST included, replacement of Ala and Gly by Gln, two Gly's by Asn, and in some cases - amino acid scrambling.

Table 1.

Summary for the peptide sequences, their tandem MS scan numbers (from the raw file 20100719_Velos1_TaGe_SA_MCF7_01.raw[37]) and corresponding XCorrs. Underlined are the common subsequences between the actual and theoretical peptide sequences. All precursors were +2 charged. All true peptides were among the three highest scoring peptides in their respective SEQUEST searches against the theoretical peptide databases.

Peptide Scan Massb XCorra Theoretical Peptide
GSGGGSSGGSIGGR 5202 1092.503 3.76/3.73 GSGGGSSGGSLNR
SGGGGGGGGSSWGGR 6946 1192.509 4.22/4.2 GSGGGGGGGSSWNR
GAGTDDHTLIR 10962 1155.575 2.71/2.75 QGTDDHTLLR
LGSLVENNER 13339 1130.580 2.23/2.31 LGSLVENGGER
IVQMTEAEVR 15962 1175.609 2.5/2.68 VLAGMTEEAVR
LTMQVSSLQR 18267 1162.624 2.5/2.4 TLAMGVSSGALR
a

The XCorrs are for the true peptide (the first score) and the best scoring theoretical peptide (the second score).

b

Shown is the mass of a peptide's monoisotopic mass plus the proton mass.

In Figure 3 we show the scatter plot of XCorrs computed for the peptides identified from UniProt and theoretical sequence databases for all of the spectra used in this study (1413 spectra). For 465 spectra (~ 33% of all spectra) the sequences identified from the theoretical and UniProt databases were identical (as mentioned above, we did not differentiate between Leu and Ile). In addition, 157 peptide sequences (11% of the total) in UniProt and the corresponding theoretical peptides had the same amino acid compositions. The complete list of all scan numbers, identified sequences and their XCorrs are provided in the Supplementary Materials section. The XCorrs for theoretical peptides are always higher than or equal to the corresponding values for UniProt database peptides. For SEQUEST identifications, an important value has been the ΔCn. This is the XCorr difference between the two highest ranked sequences, scaled by the XCorr of the highest ranked sequence. In Figure 4 we show the distribution for a similar value, which is the XCorr difference between the theoretical, XCorrTH, and UniProt, XCorrUni, database peptides, scaled by the XCorr of the theoretical peptide. The overall correlation between the XCorrTH and XCorrUni was 0.82 (Pearson's correlation). Pearson's correlation coefficient between the adapted ΔCn and XCorrTH is very small, 0.06, as can be seen from Figure 4.

Figure 3.

Figure 3

The scatter plot of the XCorr values for the theoretical and UniProt sequences. There were 465 (from the total of 1413) spectra for which the theoretical and database sequences were identical.

Figure 4.

Figure 4

The distribution of the XCorr differences between the theoretical, XCorrTH, and UniProt, XCorrUni, sequences scaled by the theoretical sequence's XCorr.

We compared the two results from the two sequencing strategies when a combined (forward and reversed) database is used to control the false discover rate (FDR) in the database searching. For this small dataset, 643 peptide spectrum matches (PSMs) passed the 1% FDR threshold. 202 of these PSMs had identical sequences to those obtained from the de novo like sequencing. 87 of these PSMs (passing 1% FDR threshold) had identical amino acid compositions, thus differing only by amino acid scrambling from the corresponding sequences identified in our approach. Among the rest of the PSMs filtered at 1% FDR, there were 80 sequences that had subsequences of at least three amino acid long that were common to both results. We note again that the size of the dataset is very small, and while FDR filtering helps to control some erroneous matches, the distribution of XCorrs is not likely to represent the true sample distribution for this system. We also tested using ΔCn as a cut-off criterion (ΔCn > 0.1) in addition to FDR. The relative statistics of the PSMs identified in the de novo like sequencing and database searching did not change substantially (about 5%).

Combined, forward and reversed, database searching is commonly used to control false discovery rate in large scale peptide identifications. As the peptide size increases, normally in the species specific protein sequence databases, there are less peptides with the similar mass, particularly when precursor masses are determined in high resolution and mass accuracy instruments. The current study accounted for all possible theoretical peptides - as it generated a comprehensive list of all peptides. We used a smaller mass window, 2 mDa, centered on the peptide mass to control the size of the theoretical databases. In most of the cases that we studied, there were a long common subsequences between the best theoretical match and the true peptide match. The long common subsequence is important as Blast searches of the theoretical peptides will likely map to correct proteins if the common subsequences (with the true peptides) are long. The study shows that for relatively short peptides (<1200 Da) peptide mass accuracy is very important and it will lead to correct peptide identifications even if the protein sequence database is unbiased (non-species specific) and very large (includes all theoretically possible peptides).

We note that in the current implementation of this approach, there are large computational resource requirements. It is possible to automate the approach and generate the theoretical sequence databases on the “fly”. However, the databases are still large and the computation takes considerably longer time compared to the regular database search.

Conclusions

We have implemented a workflow that allowed us to use SEQUEST scoring techniques for a de novo like peptide identification. For every spectrum search, we have generated sequences of all possible peptides, using the intact peptide mass with the mass accuracy of 1 mDa. For a given mass interval (centered on intact peptide's mass) we first determined all possible compositions. From the compositions, we generated all theoretical sequences using a lexicographic ordering. SEQUEST then was used to search the theoretically created database against the experimental spectrum. We have applied this approach to peptides with a mass less than 1200 Da. We found that when used with high mass accuracy for intact peptide mass, SEQUEST was very highly specific. 33% of peptides identified in the theoretical sequence databases were the same as the corresponding original sequences in UniProt. In general, only a few theoretical sequences scored higher than the true peptide sequence in each case. In many cases there were long common subsequences between the theoretically identified sequences and the true peptides. The current results testify to the high specificity of SEQUEST.

Supplementary Material

13361_2015_1228_MOESM1_ESM

Acknowledgements

The author acknowledges support by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM112044.

Abbreviations

Da

Dalton

FDR

false discovery rate

FFT

fast Fourier Transform

HCD

higher energy collisional dissociation

mDa

milliDalton

MS

mass spectrometry

PSM

peptide spectrum match

PTM

post-translational modification

Sp score

preliminary score

XCorr

cross-correlation score

Reference List

  • 1.Zhang Y, Fonslow BR, Shan B, Baek MC, Yates JR., III Protein analysis by shotgun/bottom-up proteomics. Chem Rev. 2013;113(4):2343–2394. doi: 10.1021/cr3003533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Walther TC, Mann M. Mass spectrometry-based proteomics in cell biology. J Cell Biol. 2010;190(4):491–500. doi: 10.1083/jcb.201004052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom. 1994;5(11):976–989. doi: 10.1016/1044-0305(94)80016-2. [DOI] [PubMed] [Google Scholar]
  • 4.Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20(18):3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [DOI] [PubMed] [Google Scholar]
  • 5.Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994;66(24):4390–4399. doi: 10.1021/ac00096a002. [DOI] [PubMed] [Google Scholar]
  • 6.Clauser KR, Baker P, Burlingame AL. Role of accurate mass measurement (+/− 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal Chem. 1999;71(14):2871–2882. doi: 10.1021/ac9810516. [DOI] [PubMed] [Google Scholar]
  • 7.Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, Maynard DM, Yang X, Shi W, Bryant SH. Open mass spectrometry search algorithm. J Proteome Res. 2004;3(5):958–964. doi: 10.1021/pr0499491. [DOI] [PubMed] [Google Scholar]
  • 8.Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20(9):1466–1467. doi: 10.1093/bioinformatics/bth092. [DOI] [PubMed] [Google Scholar]
  • 9.Tabb DL, Fernando CG, Chambers MC. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. J Proteome Res. 2007;6(2):654–661. doi: 10.1021/pr0604054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bern M, Kil YJ, Becker C. Byonic: advanced peptide and protein identification software. Curr Protoc Bioinformatics. 2012 doi: 10.1002/0471250953.bi1320s40. Chapter 13:Unit13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Tanner S, Pevzner PA, Bafna V. Unrestrictive identification of post-translational modifications through peptide mass spectrometry. Nat Protoc. 2006;1(1):67–72. doi: 10.1038/nprot.2006.10. [DOI] [PubMed] [Google Scholar]
  • 12.Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem. 2005;77(14):4626–4639. doi: 10.1021/ac050102d. [DOI] [PubMed] [Google Scholar]
  • 13.Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, Mann M. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res. 201110(4):1794–1805. doi: 10.1021/pr101065j. [DOI] [PubMed] [Google Scholar]
  • 14.Meng F, Cargile BJ, Miller LM, Forbes AJ, Johnson JR, Kelleher NL. Informatics and multiplexing of intact protein identification in bacteria and the archaea. Nat Biotechnol. 2001;19(10):952–957. doi: 10.1038/nbt1001-952. [DOI] [PubMed] [Google Scholar]
  • 15.Johnson JR, Meng F, Forbes AJ, Cargile BJ, Kelleher NL. Fourier-transform mass spectrometry for automated fragmentation and identification of 5-20 kDa proteins in mixtures. Electrophoresis. 2002;23(18):3217–3223. doi: 10.1002/1522-2683(200209)23:18<3217::AID-ELPS3217>3.0.CO;2-K. [DOI] [PubMed] [Google Scholar]
  • 16.Sadygov RG, Zabrouskov V. Database Search of High Mass Resolution Data. Journal of Biomolecular Techniques. 2007;18(6):1. [Google Scholar]
  • 17.Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal Chem. 2002;74(20):5383–5392. doi: 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]
  • 18.Anderson DC, Li W, Payan DG, Noble WS. A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res. 2003;2(2):137–146. doi: 10.1021/pr0255654. [DOI] [PubMed] [Google Scholar]
  • 19.Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4(11):923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
  • 20.Beausoleil SA, Villen J, Gerber SA, Rush J, Gygi SP. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat Biotechnol. 2006;24(10):1285–1292. doi: 10.1038/nbt1240. [DOI] [PubMed] [Google Scholar]
  • 21.Taus T, Kocher T, Pichler P, Paschke C, Schmidt A, Henrich C, Mechtler K. Universal and confident phosphorylation site localization using phosphoRS. J Proteome Res. 2011;10(12):5354–5362. doi: 10.1021/pr200611n. [DOI] [PubMed] [Google Scholar]
  • 22.Savitski MM, Lemeer S, Boesche M, Lang M, Mathieson T, Bantscheff M, Kuster B. Confident phosphorylation site localization using the Mascot Delta Score. Mol Cell Proteomics. 2011;10(2):M110. doi: 10.1074/mcp.M110.003830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Vandenbogaert M, Hourdel V, Jardin-Mathe O, Bigeard J, Bonhomme L, Legros V, Hirt H, Schwikowski B, Pflieger D. Automated phosphopeptide identification using multiple MS/MS fragmentation modes. J Proteome Res. 2012;11(12):5695–5703. doi: 10.1021/pr300507j. [DOI] [PubMed] [Google Scholar]
  • 24.Eng JK, Searle BC, Clauser KR, Tabb DL. A face in the crowd: recognizing peptides through database search. Mol Cell Proteomics. 2011;10(11):R111. doi: 10.1074/mcp.R111.009522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ma B, Johnson R. De novo sequencing and homology searching. Mol Cell Proteomics. 2012;11(2):O111. doi: 10.1074/mcp.O111.014902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nefedov AV, Mitra I, Brasier AR, Sadygov RG. Examining troughs in the mass distribution of all theoretically possible tryptic peptides. J Proteome Res. 2011;10(9):4150–4157. doi: 10.1021/pr2003177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Nefedov AV, Sadygov RG. A parallel method for enumerating amino acid compositions and masses of all theoretical peptides. BMC Bioinformatics. 2011;12(1):432. doi: 10.1186/1471-2105-12-432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang J, Xin L, Shan B, Chen W, Xie M, Yuen D, Zhang W, Zhang Z, Lajoie GA, Ma B. PEAKS DB: De Novo sequencing assisted database search for sensitive and accurate peptide identification. Mol Cell Proteomics. 2011 doi: 10.1074/mcp.M111.010587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A, Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun Mass Spectrom. 2003;17(20):2337–2342. doi: 10.1002/rcm.1196. [DOI] [PubMed] [Google Scholar]
  • 30.Kim S, Gupta N, Bandeira N, Pevzner PA. Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol Cell Proteomics. 2008 doi: 10.1074/mcp.M800103-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Frank AM, Savitski MM, Nielsen ML, Zubarev RA, Pevzner PA. De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res. 2007;6(1):114–123. doi: 10.1021/pr060271u. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Johnson RS, Taylor JA. Searching sequence databases via de novo peptide sequencing by tandem mass spectrometry. Mol Biotechnol. 2002;22(3):301–315. doi: 10.1385/MB:22:3:301. [DOI] [PubMed] [Google Scholar]
  • 33.Chi H, Sun RX, Yang B, Song CQ, Wang LH, Liu C, Fu Y, Yuan ZF, Wang HP, He SM, et al. pNovo: de novo peptide sequencing and identification using HCD spectra. J Proteome Res. 2010;9(5):2713–2724. doi: 10.1021/pr100182k. [DOI] [PubMed] [Google Scholar]
  • 34.Tabb DL, Saraf A, Yates JR., III GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Anal Chem. 2003;75(23):6415–6421. doi: 10.1021/ac0347462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yan Y, Kusalik AJ, Wu FX. NovoHCD: de novo peptide sequencing from HCD spectra. IEEE Trans Nanobioscience. 2014;13(2):65–72. doi: 10.1109/TNB.2014.2316424. [DOI] [PubMed] [Google Scholar]
  • 36.Moore RE, Young MK, Lee TD. Qscore: an algorithm for evaluating SEQUEST database search results. J Am Soc Mass Spectrom. 2002;13(4):378–386. doi: 10.1016/S1044-0305(02)00352-5. [DOI] [PubMed] [Google Scholar]
  • 37.Elias JE, Gygi SP. Target-decoy search strategy for mass spectrometry- based proteomics. Methods Mol Biol. 2010;604:55–71. doi: 10.1007/978-1-60761-444-9_5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Geiger T, Wehner A, Schaab C, Cox J, Mann M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics. 2012;11(3):M111. doi: 10.1074/mcp.M111.014050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005;33(Database issue):D154–D159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Mitra I, Nefedov AV, Brasier AR, Sadygov RG. Improved mass defect model for theoretical tryptic peptides. Anal Chem. 2012;84(6):3026–3032. doi: 10.1021/ac203255e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Sadygov RG. Use of singular value decomposition analysis to differentiate phosphorylated precursors in strong cation exchange fractions. Electrophoresis. 2014 doi: 10.1002/elps.201400053. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

13361_2015_1228_MOESM1_ESM

RESOURCES