Method for Independent Estimation of the False Localization Rate for Phosphoproteomics

Kerry A Ramsbottom; Ananth Prakash; Yasset Perez Riverol; Oscar Martin Camacho; Maria-Jesus Martin; Juan Antonio Vizcaíno; Eric W Deutsch; Andrew R Jones

doi:10.1021/acs.jproteome.1c00827

. 2022 May 31;21(7):1603–1615. doi: 10.1021/acs.jproteome.1c00827

Method for Independent Estimation of the False Localization Rate for Phosphoproteomics

Kerry A Ramsbottom ^†, Ananth Prakash ^‡, Yasset Perez Riverol ^‡, Oscar Martin Camacho ^†, Maria-Jesus Martin ^‡, Juan Antonio Vizcaíno ^‡, Eric W Deutsch ^§, Andrew R Jones ^†,^*

PMCID: PMC9251759 PMID: 35640880

Abstract

graphic file with name pr1c00827_0007.jpg

Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.

Keywords: phosphoproteomics, false localization rate, database searching, software, statistics

Introduction

There is great research interest in studying post-translational modifications (PTMs) to proteins, due to their importance in cell signaling, as a rapid mode of proteins changing their function, and their implication in almost all known disease processes. The most widely studied reversible modifications include phosphorylation (by far the most studied one and our primary focus here), acetylation, methylation, and attachment of small proteins, such as ubiquitin and SUMO.

High-throughput tandem mass spectrometry (MS) is commonly used for detection and localization of phosphorylation sites on proteins, using so-called phosphoproteomic methods. Typically in these methods, proteins are first extracted from samples and digested with an enzyme such as trypsin, and then, phosphorylated peptides are enriched in the sample, for example, using TiO₂ or other metal ion attached to a column (affinity chromatography), to which phosphate binds preferentially. The bound peptides are then eluted and analyzed by liquid chromatography–MS.¹ In the common analysis mode used in phosphoproteomics, data dependent acquisition is performed to fragment the most abundant peptides observed. The MS² fragmentation spectra (plus the mass/charge of the intact precursor) are then used to identify peptide sequences, e.g., using sequence database search software. In this approach, the spectra are searched against a theoretical digest of the proteome (i.e. peptide sequence database) for the given species, taking account for the variable modifications selected. For phosphorylation, most users search for phosphorylation on the canonical Ser, Thr, and Tyr (STY) residues, where the vast majority of detectable phosphorylation resides in eukaryotic systems. The search engine then considers every STY residue with and without the addition of the phosphate mass (+79.97 Da), greatly increasing the size of the search space, with a corresponding reduction in statistical power for peptide identification. Confident peptide identification is governed by the quality of the match between the observed spectrum and the theoretical spectrum expected for a peptide from the sequence database, from which local statistics such as p-values or e-values are usually calculated as well as sometimes a posterior error probability (PEP). If a PEP value is calculated, 1-PEP gives the probability that a given peptide-spectrum match (PSM) is correct. There are of course many different proteomics search engines, including commercial and free and/or open source; for a review, see Verheggen et al.²

An important consideration for phosphoproteomics is the confidence that a given site within a protein has been correctly identified as being phosphorylated. Ambiguity in this regard may occur when a confident PSM has more STY residues than n, where n is the number of phosphorylation modification instances detected, that is, intact peptidoform mass = peptide sequence mass + (n * 79.97 Da). In this case, the search engine itself or a downstream analysis package calculates statistics related to each of the n phosphosites within a peptide, such as a PEP that the site has been incorrectly localized (sometimes also called local false localization rate (FLR); local FLR or other ad hoc score. As for PSMs, if an accurate PEP can be estimated, then 1-PEP_site gives the probability that the site has been correctly localized, in this case, assuming already that the PSM is definitively correct. Correct site localization can be critically important for downstream uses of data. As one example, there are completely different kinases and phosphatases involved in Ser/Thr versus Tyr phosphorylation, and thus, biological conclusions as to the up and downstream signaling pathways would be completely different. Even where ambiguity relates to different, say, Ser residues in a peptide, nearby amino acid motifs allow inference of the kinase responsible for phosphorylating the site, and thus, incorrect site determination could lead researchers to making incorrect assumptions and conclusions.

Many of the published site localization algorithms were benchmarked by the originating authors, and scores were calibrated based on synthetic data sets, with a known “ground truth”, that is, where the sites of phosphorylation were known.³ There have also been some independent efforts to benchmark different site localization tools, showing that the choice of tool does alter the global statistics,^4,5 that is, sensitivity—how many sites in a whole data set can be correctly localized at a suitable overall (global) FLR. While tools continue to improve and become more widely used for ensuring confident site localization, there remain several unsolved challenges for the field, as follows. First, it is unclear whether findings on synthetic data sets can be extrapolated to genuine biological data sets that have generally a higher level of complexity, and synthetic data sets may have amino acid frequencies near to modification sites, which do not well reflect natural samples. Second, for analysis of real experimental data sets, there are no commonly used methods for independent estimation of global FLR, for example, that allow a researcher to ask the question—how many sites have we confidently identified and how many are likely false positives in the whole data set. For regular peptide/protein identification in proteomics, decoy database search methods are now almost ubiquitously used for estimating peptide/protein false discovery rate (FDR), since they give a search engine-independent statistic that is easy to understand. There is no generally accepted method for calculating the same type of statistic for PTM site identification. Third, our groups are interested in very large-scale re-analysis of public PTM enriched data sets, via a project called PTMeXchange. We wish to have methods that allow for accurate calculation of the probability that a given PTM site has been observed in a meta-analysis of data sets, where there could be potentially multiple PSMs from different studies supporting a given site. To our knowledge, there are no suitable statistical models for combining different evidence streams.

In this work, we explore the concept of using decoy amino acid(s) for estimation of site localization statistics (i.e. global FLR), in this context, defined as one that we know cannot be modified, to model the distribution of false localizations detected from a processing pipeline, resulting from both incorrect peptide identifications and incorrect localization of sites on correct peptide identifications. We test a range of different amino acids for their suitability as a decoy in synthetic and real data sets as well as demonstrate the results obtained from several common proteome informatics pipelines. The concept of using a decoy amino acid for localization of PTMs is not a new one. It has been used before in several previous publications and approaches.⁶ However, to our knowledge, no publication has yet validated the statistics associated with the use of decoy amino acids, particularly on multiple tools, and the method of using a decoy amino acid has not gained widespread use in the field. The majority of PTM-based studies still relies on using ad hoc score thresholds for determining whether PTMs have been correctly identified or not. From the results we present, we make some recommendations as to how we believe large-scale PTM-enriched studies should be analyzed to control the local and global FLR. While we have focused on computational analysis and pipelines for phosphorylation, general approaches and conclusions are largely applicable to other types of PTM readily detected by MS. The code used for the analysis is in GitHub: https://github.com/PGB-LIV/PhosphoFLR.

Methods

Our overall goal is to demonstrate methods for controlling and understanding FLR, rather than benchmarking tools per se, although we wished to demonstrate the reproducibility of methods in different pipelines. As such, we tested four commonly used analysis pipelines: trans-proteomic pipeline (TPP)⁷ including Comet search⁸ and PTMProphet site localization;⁹ MaxQuant including PTMScore;¹⁰ ProteomeDiscover, including Mascot search and ptmRS localization;¹¹ and PEAKS DB search with A-Score.¹² We tested the effects on global FLR of selecting the localization on different amino acids as “decoy” and profiled the frequency of potential decoy amino acids relative to assumed correct STY phosphorylation sites, to see which provides the decoy distribution best matching the target distribution, that is, other STY sites to which the site could be wrongly localized. The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE¹³ partner repository with the dataset identifier PXD028840 and DOI 10.6019/PXD028840.

Four data sets were used for evaluation of methods for estimating global FLR—two synthetic data sets, one model plant phosphoproteomics data set (from Arabidopsis thaliana), and one human phosphoproteomics data set. The raw files of the four search data sets were obtained from the ProteomeXchange Consortium¹⁴ via the PRIDE repository.¹⁵ These included ten files from the PXD007058⁵ synthetic data set (files named “HCDOT” pools 1–5, reps 1 and 2), 10 files from the PXD000138 synthetic data set,¹⁶ twelve files from the PXD008355¹⁷Arabidopsis data set (rapamycin treated), and six from the PXD000612¹⁸ human data set (files randomly selected). The PXD007058 and PXD000138 data sets contain synthetic phosphopeptide libraries. The use of synthetic phosphopeptides allowed us to define FLR (through one method) by comparing the results from our search pipelines to the known phosphopeptide sequences to determine if our analyses correctly localize the phosphosites. The PXD008355 A. thaliana data set and the PXD000612 human data set are both biological data sets with unknown phosphosites.

Databases were created for the searches of each data set. The PXD007058 search database consisted of the synthetic peptides;⁵ the PXD000138 search used IPI human sequences and phosphopeptide libraries matched to the original study;¹⁶ the PXD008355 Arabidopsis search database contained Araport11¹⁹ sequences, and the PXD000612 human search database was created from the Level 1 PeptideAtlas Tiered Human Integrated Search Proteome,²⁰ containing core isoforms from neXtProt.²¹ Each search database also contained the cRAP contaminant sequences (https://www.thegpm.org/crap/, last accessed October 2021). Decoys across all four databases were generated for each entry using the Brujin method (with k = 2).²²

Using the TPP,⁷ the data set files were first searched using Comet.⁸ The resulting files were then combined and processed using PeptideProphet,²³ iProphet,²⁴ and PTMProphet.⁹ For PXD007058, in addition to searching for phosphorylation, we also searched for pyrophosphorylation modifications, which pilot searches had determined that were unintentionally present on some synthetic peptides. The key Comet search parameters for each dataset are shown in Table 1; the key parameters used for the other pipelines are shown in Supporting Information Table S1.

Table 1. Comet Search Parameters for Each Data Set.

	PXD007058 (synthetic data set)	PXD000138 (synthetic data set)	PXD008355 (Arabidopsis data set)	PXD000612 (human data set)
peptide mass tolerance	20.0 ppm	5 ppm	7 ppm	7 ppm
fragment bin tolerance	0.02 Da	0.02 Da	0.02 Da	0.02 Da
digest mode	tryptic	tryptic	tryptic	tryptic
max missed cleavages	4	4	2	2
fixed mods	carbamidomethylation (C)	carbamidomethylation (C)	carbamidomethylation (C)	carbamidomethylation (C)
variable mods	oxidation (MWP)	oxidation (M)	oxidation (M)	oxidation (M)
	phospho (STYX^a)	phospho (STYX^a)	phospho (STYX^a)	phospho (STYX^a)
	pyrophospho (STY)^b, N-terminal acetylation	N-terminal acetylation	N-terminal acetylation	N-terminal acetylation
	ammonia loss (QC)	ammonia loss (QC)	ammonia loss (QC)	ammonia loss (QC)
	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)
	deamination (NQ)	deamination (NQ)	deamination (NQ)	deamination (NQ)
max variable PTMs	5	5	5	5

Open in a new tab

X corresponds to the different decoy amino acid searched: Ala, Gly, Leu, Asp, Glu, or Pro.

Preliminary analysis of the data set detected that several peptides had been manufactured with pyrophosphate modification rather than the intended phosphate, which can cause apparent errors when comparing the results to the answer key if they are not accounted for.

Downstream Data Analysis

The data from searching with TPP were downstream processed by custom Python scripts (https://github.com/PGB-LIV/PhosphoFLR). First, the global FDR was calculated from the decoy counts, and the PSMs were filtered for 1% PSM FDR. From these filtered PSMs, a site-based file was generated giving separate localization scores for each phosphosite found on each PSM, removing decoy and contaminant hits. These site-based PSMs were ordered by a combined probability, calculated by multiplying the PSM probability by the localization probability. In the processing pipeline from TPP, iProphet calculates a probability that a given PSM is correct, and PTMProphet calculates a probability for the site assignment. We demonstrated that there is almost no meaningful correlation (r² = ∼0.01) between these probabilities (Supporting Information Figure S1), and thus, we conclude that these probabilities are sufficiently independent that they can be multiplied to arrive at a final probability that a given site’s identification is supported by the given spectrum.

For the PEAKS search, the PSM score and A-scores for targets and decoys were modeled based on the counts of targets and decoys per histogram score bin, to generate similar probability estimates (code provided in the GitHub repository). For MaxQuant and Mascot searches, the PSM probability values were calculated as 1-PEP values (reported by the pipeline natively) with the PTM probabilities being calculated innately through PTM-score/ptmRS probabilities, respectively. For the synthetic peptide search, these site-based results were then filtered further to allow comparison with the synthetic peptide known localization key. Partial peptides and PSMs with the incorrect phosphorylation count compared to what was expected from the answer key, or other additional modifications, were removed. These remaining PSMs were then compared against the synthetic peptide answer key to determine if the phosphosites had been correctly identified. For analysis of the synthetic data only, results were ordered by the corresponding site localization probability rather than the combined probability; since due to the small size of the search database, not all pipelines could produce accurate estimates of PSM probability.

The sites reported for each analysis method were ordered by combined probability, and global FLR was estimated for every ranked site, from which we can then later apply a threshold at the lowest scoring site that delivers a desired global FLR (e.g. 1, 5, or 10%), similar to the q-value approach for standard database searching. The global FLRs for all of the data sets were estimated using two methods—model FLR and decoy FLR method. For the synthetic data sets, a third method was also used, called answer key method (comparing against the known phosphosite answer key)

Model FLR Method

First, we estimated the global FLR using the combined probabilities (global model FLR). For TPP, we use the 1—final probability, to give the local FLR (PEP) for each given site scored in a ranked list. The “Global model FLR” is calculated as a running sum of the local FLR divided by the count of rows (eq 1), that is, the estimated frequency of false localizations at each row in a ranked list, divided by the total number of reported observations.

Global Model FLR Equation

Where n is the count of observations, P^PSM is the local FLR (PEP) for a given PSM identification and P^PTM is the local FLR for a given site localization.

Decoy FLR Method

We used the identification of phosphorylated decoy amino acids (e.g. worked example follows for alanine, pA), as these are known to be false localizations and can therefore be used to estimate the FLR. The counts of the phosphorylated decoy amino acids were first normalized to allow comparison with true hits; by modeling the random frequency, one would expect incorrect sites to be assigned to target STY residues. The target/decoy ratio (T_c/X_c) was determined by dividing the total count of STY residues (T_c) by the total count of the decoy amino acid residues (X_c), within the set of PSMs with a scored phosphosite (eq 2).

At each position in the ranked list, we have a count of decoy amino acids observed ∑₁ⁿ(pX_c). For every false localization of the decoy amino acid, it would be expected that there is a count of “silent” false positives within the target list, proportional to the ratio of STY amino acids divided by the total count of decoy amino acids, that is, (T_c/X_c) ∑₁(pX_c), to model the expected frequency of random wrong hits. The expected count of false positives within the targets is then multiplied by 2 (to model the normalized frequency of random wrong assignments among both the decoy amino acid and the target amino acid) to arrive at a normalized false localization count. This is a relatively conservative method to calculate global FLR, but without the correction to multiply by two, the use of a less frequent decoy amino acid would be insufficiently corrected for.

This normalized false localization count is then divided by the total count of observations (n), at a given row in the ranked list, to obtain the Model (global) FLR estimate (pX_FLR_n in eq 2).

Phosphorylated Decoy Amino Acid FLR Equation

Where T_c is the total target (STY) count, X_c is the total decoy amino acid count, pX_c is the count of phosphorylated decoy amino acid, and n is the count of observations at a given position in the ranked list.

Answer Key FLR

For the synthetic data sets, we used the synthetic peptide false localizations in a similar way; the false localization count (i.e. result not matching the answer key) was divided by the total count of sites to calculate the FLR (eq 3).

Synthetic Answer Key FLR

Where n is the count of observations, and false localization count F_c is the count of sites not matching the answer key in a given position in the ranked list.

Collapsing Observations of a Site across Multiple PSMs

When summarizing study results, it is desirable to “collapse” results where there are multiple PSMs supporting the same modification site down to a single row. There is no well-agreed method for collapsing data, although common practice when using collapsing multiple PSMs into individual reports for peptides is to use the maximum peptide score for ordering results and disregarding the count of PSMs. The rationale for this simplification is that multiple PSMs reporting for the same peptide are not independent statistical tests and thus the same wrong answer can appear in multiple PSMs. As such, a simple method for ranking final “collapsed” results for sites is simply to take the maximum final probability. However, our own profiling of data sets suggests that this method is sub-optimal. Many of the high scoring decoy hits are supported by only a single PSM, and so a collapse method that weights sites supported by a higher number of PSMs is more likely to be true than the one supported by a single PSM (Supporting Information Figure S2). For this study, we use a relative ad hoc method for collapsing multiple observations that attempts to balance maximum final probability and spectral counts. We took the maximum probability for a given site, derived from multiple PSMs and binned into final probability values for 2 decimal places. We ranked via binned final probability and then ranked within bins via the count of PSMs.

Profiling Distance Distributions from Real Identifications to Decoy Amino Acids

In order to compare between the decoy amino acids investigated, the distribution of amino acids around phosphorylation sites were compared. The phosphorylation sites obtained searching each database for phospho (STY) using TPP was first filtered for 5% model FLR. The minimum distance between an assumed correctly localized phosphorylated STY and the nearest candidate amino acid was compared, alongside the minimum distance for the nearest STY. Histograms were generated with the normalized frequencies of these distributions in order to compare between the selected decoy amino acids and STY.

Profiling Site Probabilities for Proximal Amino Acids

When analyzing the results for different decoy amino acids, we observed particular differences in the global FLR estimates for certain decoys (particularly pAla vs pGly) that could not be explained by distributions of amino acids in relation to confident target sites (above). We further explored these effects by calculating the average final probabilities for assumed correct sites with different amino acids in the −1 and +1 position relative to the site. The assumed correct sites were estimated as sites with combined probability ≥0.68. This threshold was calculated from the average minimum combined probability using a 5% FLR cut off for each of the decoy FLR estimations across all searches. These average probability distributions were calculated for the Arabidopsis and human data sets, from results of the TPP search with no decoy amino acid (pSTY), pAla decoy (pASTY), pLeu decoy (pLSTY), and pGly decoy (pGSTY).

Results

Analysis of Synthetic Data Set PXD007058

The analysis setup aimed to determine whether global FLR within genuinely modifiable residues (target amino acids) could be estimated reliably by including in the search a “decoy” amino acid, that is, not modified. We tested localization on six different amino acids to act as a decoy in parallel searches: glycine, leucine, alanine, glutamate, aspartate, and proline, to determine what effect the selection of a particular decoy had on the results obtained. The set of potential decoy amino acids was selected based on the following rationale: (i) glycine, leucine, alanine—no evidence that they can be phosphorylated in any known biological system; all are relatively frequent amino acids in most biological systems; (ii) glutamate—frequently phosphorylated²⁵ and not typically detectable as phosphorylated in most standard enrichment MS experiments and thus could be a plausible choice as a decoy; (iii) aspartate and proline were chosen as expected to be deliberately poor choices of decoy amino acids since there are known SP and SD phosphorylation motifs, which could bias estimates of global FLR. We expect that a statistically reliable choice of amino acid should have a similar distance distribution from a phosphorylation site (STY) to another truly possible phosphorylation site (STY), under the theory that incorrect localizations are more likely to amino acids nearby in the sequence.

We first searched the PX007058 synthetic dataset, which allowed us to test the three different methods of FLR estimation against a known answer. The data set was searched using TPP and the global FLR of these was calculated using the decoy phosphorylated amino acid method for six different choices of decoy amino acid, that is, in six parallel searches (Figure 1a). As described in the Methods, the global “Decoy” FLR is estimated based on the counts of hits to the decoy amino acids in the ranked list of results, adjusted for the ratio of the counts of the decoy amino acid to the target amino acid in the modified peptides that have been considered. We also show the global FLR calculation for all three methods in Figure 1b–g, split by decoy amino acid choice; that is, (i) answer key—identifying false localization by comparing to the known phosphosites (in this synthetic data set, where the truly modified site is known), (ii) decoy amino acid method, and (iii) model FLR that is based on summing local FLR calculated by the analysis software intrinsically (see Methods). The first observation we make is that the choice of decoy amino acid can have a substantial effect on the sensitivity (counts of assumed true sites), for a given estimated global FLR threshold (Table 2 and Figure 1). At 5% FLR, the lowest sensitivity is achieved with a Gly decoy (749 sites) versus the highest sensitivity with a Glu decoy (952 sites).

(a) Comparison of FLR estimation searching PXD007058 (synthetic data set) using localization on different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR); (b–g) comparison of FLR estimation methods searching PXD007058 for each of the different decoy amino acids (TPP, fully tryptic, 1% FDR). X-axis = count of sites, y-axis is global FLR estimated as q-values.

Table 2. Counts at pX FLR (Calculated by the Decoy Method) for 1, 5, and 10% Using Each FLR Method, Searching PXD007058 (Synthetic Data Set) (TPP, Fully Tryptic, 1% FDR).

	count at 1% FLR			count at 5% FLR			count at 10% FLR
	answer key	model	decoy	answer key	model	decoy	answer key	model	decoy
pAla	702	226	799	835	700	866	935	836	921
pGly	571	203	515	778	645	749	866	791	787
pLeu	632	229	665	843	710	823	938	844	914
pAsp	535	258	207	834	721	784	929	849	805
pGlu	722	232	841	858	734	952	975	866	n/a
pPro	636	206	621	817	692	768	883	832	823

Open in a new tab

It can be seen that for pAla and pLeu decoys (Figure 1b,d), there is a close agreement between the two “empirical” methods of estimating global FLR, that is, answer key and decoy, with the model giving more conservative estimates of FLR. The pGly and pPro methods (Figure 1c,g) have good agreement between the answer key and decoy methods up to ∼700 counts, and then, the decoy method gives more conservative estimates (rising steeply), compared to the answer key method (and model). The pAsp decoy method agrees well with the model FLR but is more conservative than the answer key (Figure 1e). The pGlu decoy method is the least conservative, apparently underestimating global FLR compared to the answer key (Figure 1f). Overall, the matching of pAla and pLeu decoy FLR estimation to the answer key FLR gives some supporting evidence toward pAla and pLeu being appropriate choices for decoy amino acids. The model FLR method is shown to be more conservative that the other “empirical” FLR methods in most cases, especially in the most important regions of distribution, that is, up to 5% global FLR for example.

The estimates from Figure 1b–g and Table 2 demonstrate greater stability in the model FLR and answer key FLR across different decoy amino acids; that is, there is less variation in sensitivity at a given estimated FLR. This is to be expected since comparing the six different searches, many of the errors in localization are due to incorrect localization to a target amino acid (which largely behave the same across the six searches).

On the same synthetic data set, PXD007058, we also compared the estimation methods across four different pipelines: TPP, PEAKS, MaxQuant, and Mascot (Supporting Information Figure S3 and Table S2), searched for pSTY pASTY and pLSTY (for decoy comparison). The initial set of results from our analysis pipeline are the redundant identification of phosphorylation sites; that is, if multiple PSMs support the same site, these appear as multiple rows (not collapsed). In general, as noted in the Methods, our preference is to order these results by the final probability that a site has been observed (PSM–probability X site localization probability). This synthetic data set has a small database size and an overall small count of identifications, which makes it difficult to model PSM probability accurately. As such, for the synthetic data set only, we ordered results by site localization probability, having first accepted only PSMs with FDR <1%.

FLR was calculated using the synthetic answer key and the decoy amino acid hits. For all four pipelines tested, both the pAla and pLeu decoy methods agree well with the results from the answer key FLR method across all three pipelines, demonstrating that our method with these amino acids gives reliable FLR estimates in a software-independent manner. There are differences in the total number of sites identified at a fixed FLR threshold, depending on the pipeline applied. For this data set, TPP gives the highest sensitivity, followed by PEAKS, Mascot/ptmRS, and MaxQuant. However, our primary goal in this manuscript is not extensively to benchmark different pipelines, as there are choices of algorithm parameterization, which need to be optimized and could affect conclusions, and thus, we do not make any general conclusions about software performance for PTM analysis here.

The synthetic data set PXD000138 was also used to compare the FLR calculations using localization on different decoy amino acids (Supporting Information Figure S5 and Table S3). These results also showed a close agreement between the “empirical” methods of global FLR estimation; that is, the answer key and decoy estimation methods generally agreed with each other across the different decoy amino acids, although as observed for PXD007058, the choice of decoy amino acid affects the sensitivity (count of sites at a fixed FLR). The model estimation gives more conservative estimates of FLR across all decoy searches. In order to further compare between the decoy amino acids investigated, the distribution of amino acids around phosphorylation sites was compared for each of the synthetic sets (Supporting Information Figures S4 and S6). It would be assumed that if a phosphosite is wrongly localized, it would usually be to the nearest other STY residue than the correct site. We therefore assumed that a statistically reliable decoy amino acid will follow a similar (normalized) frequency distribution to the closest STY residue from correct hits. For the two synthetic data sets, all decoy amino acids had substantial differences between their distributions in relation to the assumed correct phosphosites, and none, in particular, matched well the distribution of the nearest STY residues. Given the artificial nature of synthetic data sets, we do not expect these distributions to reflect the reality of biological data sets, and thus, we next explored the behavior of the method on real data sets.

Biological Data Set Analysis

Data Set PXD008355

To investigate the effect of using localization on different decoy amino acids on different data sets, we compared the FLR estimations across the six different amino acids using two experimental data sets from A. thaliana and human. Figure 2 shows the decoy FLR comparisons searching the PXD008355 Arabidopsis data set with TPP. Here, we can see a similar trend as previously seen in the synthetic data set PXD007058. The FLR estimations with pGly and pPro give the most conservative performance at higher FLR values; that is, a steep rise in global FLR (Figure 2a) giving the lowest sensitivity at 10% global FLR, although there is a more complex picture at 1 and 5% FLR values (Table 4 and Figure 2b). We assume that many studies will aim to threshold at 5% global FLR; here, we observe lowest counts (of sites at 5% FLR) for the pAsp decoy and intermediate counts for pGly, pGlu, and pPro methods, and the highest counts of sites (sensitivity) for pAla and pLeu decoys. One of the challenges with accurate FLR estimation is that there can be some high-scoring incorrect localizations, and their position in the ranked list can have significant implications on the count of sites at 1% FLR (Figure 2b and Table 3). We thus would not recommend general thresholding at 1% global FLR but instead we recommend applying a 5% FLR where the global FLR estimates are likely to be robust. There is additional discussion of these high-scoring false hits in the Supporting Information (i), Figure S7 and Table S4. We also calculated the model FLR for each decoy option, demonstrating good agreement at 5% FLR between the two methods (decoy FLR vs model FLR) for pAla, pLeu, and pGly decoy options but less good agreement for other decoys (Supporting Information Figure S8).

Comparison of FLR estimation searching PXD008355 (*Arabidopsis* data set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR). (a) all PSMs; (b) zoom 5% FLR all PSMs; (c) collapsed by modified peptide, sorting by combined probability and count of supporting PSM; (d) collapsed by modified peptide, sorting by combined probability and count of supporting PSMs, zoom at 5% FLR. X-axis = count of sites; y-axis is global FLR estimated as q-values.

Table 4. Counts at pX FLR for 1, 5, and 10% Using Each Decoy Method, Searching PXD000612 (Human Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsed to Unique Sites.

	count at 1% FLR		count at 5% FLR		count at 10% FLR
	not collapsed	collapsed	not collapsed	collapsed	no collapsed	collapsed
pAla	62,050	5246	95,924	9504	103,609	11,491
pGly	58,367	5557	86,752	9267	94,536	11,193
pLeu	3822	6705	98,875	10172	106,453	12,030
pAsp	2650	809	71,968	8458	86,451	10,052
pGlu	22,563	608	78,708	7819	94,337	10,092
pPro	47,942	3256	79,821	7384	90,430	9482

Open in a new tab

Table 3. Counts at pX FLR for 1, 5, and 10% Using Each Decoy Method, Searching PXD008355 (Arabidopsis Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsing Multiple PSMs to One Row per Modified Peptide, Sorting by Combined Probability and Count of the Supporting PSMs.

	count at 1% FLR		count at 5% FLR		count at 10% FLR
	not collapsed	collapsed	not collapsed	collapsed	not collapsed	collapsed
pAla	23,104	1570	40,541	4704	44,556	5815
pGly	18,872	1469	35,939	3654	38,885	4990
pLeu	17,943	1017	42,157	4964	45,875	6068
pAsp	13,490	234	32,595	4151	39,262	5130
pGlu	21,923	2766	34,949	4385	40,532	5607
pPro	23,696	2771	35,170	4483	38,226	5221

Open in a new tab

We next explored data after collapsing multiple scores from different PSMs supporting the same site, taking the maximum probability for a given site for ranking results, along with the greatest number of supporting PSMs (see Methods). The results following this collapse step are shown in Figure 2c,d, demonstrating relatively similar trends, with considerable differences in sensitivity at 5 and 10% global FLR, with pAla and pLeu giving highest sensitivity at a 5 and 10% FLR. The statistical assumptions for the model FLR do not hold after collapse, so this method was not used.

To further investigate the selection of decoy amino acid candidates, the minimum distance between an assumed correctly localized phosphorylated STY (<5% global FLR filtered) and the nearest candidate amino acid were compared, alongside the minimum distance for the nearest STY (Figure 3). The rationale for this comparison is that in a regular search, not employing a decoy, if a phosphosite is wrongly localized, it will usually be to the nearest other STY residue than the correct site. We assume that a statistically reliable decoy amino acid will follow a similar (normalized) frequency distribution to the closest STY residue from correct hits. When comparing these distances in the Arabidopsis data set, it can be seen that Ala, Leu, and Gly follow somewhat similar frequency distributions to proximal STY, particularly, in the + positions (i.e. toward the C-terminus of the protein). Asp, Glu, and Pro are all enriched at the +1 position relative to STY, which likely partially explains the higher FLR estimates observed for the same site counts in Table 3 and Figure 2; that is, the pipeline wrongly assigns sites to Asp, Glu, and Pro more frequently than it would be other target sites (STY). We also observed in Table 3 that using a glycine decoy gave relatively low sensitivity at 5% FLR and steeply increasing FLR at higher site counts. The results for Gly in Figure 3 are thus an outlier with respective to Figure 2 as well as for the synthetic data set PXD007058 in Figure 1, in which we observed the lowest site count at 5% FLR for estimates using a Gly decoy. Our starting expectation was that Ala, Leu, and Gly would all make reliable choices as decoy amino acids, and thus, we also conducted an analysis of amino acid frequencies to attempt to explain the differences seen for Gly; results are shown in Supporting Information (ii) and Table S5. Gly and Ala have similar frequencies of observations in phosphopeptides, so this also does not explain the disparity. We further explore this phenomenon below.

Comparison of minimum distance between phosphorylated STY and the nearest target amino acid (Ala, Leu, Gly, Asp, Glu, and Pro), compared to STY distribution, searching PXD008355 (*Arabidopsis* data set).

Similar to the PXD007058 synthetic data set, a comparison between FLR estimates was made across the different pipelines: TPP, PEAKS, MaxQuant, and Mascot/ptmRS. The PXD008355 Arabidopsis data set was searched with an Ala decoy as well as a Leu and Gly decoy, and FLR estimations were calculated in the same way as before (Supporting Information Figure S9 and Table S6). In general, highest sensitivity is achieved by TPP and Mascot/ptmRS, whereas there are high-scoring decoy (amino acid) hits in the other two pipelines that lead to much lower sensitivity at a given FLR cut-off. For TPP and Mascot/ptmRS pipelines, the results from estimation with the three decoys are largely reproducible; that is, pAla and pLeu gives highest (and similar) counts of sites at a given FLR, whereas pGly gives a lower count of sites at the same FLR threshold.

In this approach, sites are ordered by final probability (PSM probability * PTM probability). An alternative approach commonly used in the field is to threshold first at say <1% FDR for PSMs or peptide, and then order purely by PTM localization score or probability. We tested a similar approach to see what effect there is on sensitivity at a given FLR for the pAla results (Supporting Information Figure S10). While ordering site localizations by PTM probability only rather than the combined PSM*PTM probability, we can see that there is lower sensitivity at 1% FLR for the PTM probability option, and almost identical sensitivity between the two options at 5 and 10% FLR (Supporting Information Table S7). We therefore conclude that it is slightly superior to model both the probability that a given PSM is correct as well as that the PTM has been correctly localized to give the best ordering of results, particularly for those highest scoring around 1% global FLR.

Data Set PXD000612

Given that we see consistent trends for the synthetic data set and Arabidopsis data set in terms of comparing decoys across different pipelines, for the final validation, we focus only on the use of TPP on one further validation set from a different species (human). We would expect some different phosphorylation motifs comparing an animal species to a plant species, which could affect decoy amino acid performance. Figure 4 and Table 4 illustrate the FLR comparison using the different decoy amino acids, searching the PXD000612 human data set. A similar trend can be seen here as in the Arabidopsis data set with pAla and pLeu giving the highest site counts at 5% FLR and pAsp giving lowest site count. On the zoomed plot (<5% FLR, Figure 4b), the same issue as for data set PXD008355 can be observed, with unstable decoy estimation at low counts due to a random factor from a few high-scoring decoys (FLR < 1%). For this data set, there is also good agreement between the model FLR and the decoy FLR for most amino acids except pGlu, where the model FLR tends to be more conservative than decoy FLR (Supporting Information Figure S11). We also show the data after collapsing multiple PSMs reporting on the same site (Figure 4c,d), giving similar trends in sensitivity at fixed FLR thresholds as for data without collapse.

Comparison of decoy FLR estimation searching PXD000612 (human data set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 % FDR). (a) no collapse, all sites shown; (b) zoom on 5% FLR; (c) collapse to unique sites; (d) zoom on <5 FLR for data collapsed to unique sites. X-axis is count of sites, y-axis is global FLR, estimated as q-values by the decoy method.

The distance between the phosphorylated STY and the nearest candidate amino acid were again compared (Supporting Information Figure S12) to further investigate the effect of decoy amino acid choice and to examine how the distributions differ between the different data sets. When comparing these distances in the human data set, a similar pattern is seen to that of the Arabidopsis data set. It can be seen that Ala and Leu again follow a somewhat similar frequency distribution to proximal STY residues, again particularly in the positive direction. Asp and Pro are again enriched at the +1 position relative to STY, which would be expected. Gly is also seen to follow a similar distribution to STY residues and therefore would be expected to be a reliable decoy amino acid, based on this measure. However, looking at the FLR comparisons seen in Figure 4, Gly can be seen to give more conservative FLR estimation (or lower site counts at 5% FLR e.g. than Ala or Leu), as was also seen in the Arabidopsis data set.

We next explored whether particular amino acids in proximity to true phosphorylation sites cause results to change. We plotted the average final probabilities for the Arabidopsis data set searched via the TPP pipeline and split according to the amino acid in the −1 (Figure 5) and +1 (Supporting Information Figure S13) position relative to the assumed correct phosphorylation site, for the data set searched with no decoy pSTY, pAla decoy, pLeu decoy, and pGly decoy. In the search with no decoy, there is a particular striking trend that sites have a lower probability when the −1 amino acid is Ser, Thr, or Tyr. This occurs because the site localization algorithm (PTMProphet in this case) has fewer ions available to discriminate the correct from incorrect localization. In the pAla and pLeu results, we see that Ala and Leu in the −1 position cause sites to have a similar reduction in final probability as Ser, Thr, and Tyr in the no decoy search, that is, final probability shifts from around ∼0.96 to ∼0.91 (Ala and Leu decoy). We interpret this to mean that they behave as statistically “good decoys”; that is, when they are present in the −1 position relative to a true site, they behave in a similar manner to STY residues. In the pGly data, there is a much larger drop off in final probabilities when G is in the −1 position (∼0.96 to ∼0.87), meaning that (most commonly) Gly-pSer sites are scored less well that phosphoserines preceded by other amino acids, and excessive probability space is being distributed to the pGly-Ser hypotheses. In the results, we observe that around 47% (Arabidopsis data) and 35% (human data) of the high scoring pGly decoys have the pGly-Ser motif. We see similar trends in the human data set (Supporting Information Figures S14 and S15). It is unclear why this particular amino acid combination causes a problem for PTM localization, but we hypothesize that the Gly-pSer bond is perhaps particularly stable during fragmentation and hence a discriminating y ion terminating with pSer is less commonly observed. We thus conclude based on what we have observed that pGly is not an ideal choice for a decoy amino acid.

Comparison of averaged final site probabilities for all peptides (final probability ≥0.68 split by amino acid in the −1 positions for the PXD008355 (*Arabidopsis* data set). (a) STY (no decoy); (b) STY with Ala decoy; (c) STY with Leu decoy; and (d) STY Gly decoy.

Discussion

In the proteomics field, prior to the widespread adoption of decoy database searching, there was a general problem with false positive results in the literature, as labs vied to report the largest number of peptides and proteins, without control of FDR. It is now accepted that all proteomics studies should control FDR at an appropriately conservative level, for example, <1% FDR at the protein-level for protein-centric studies. It has long been recognized that a similar problem exists with reporting PTM sites. The accurate discovery of a site can be crucially important for downstream interpretation since the identity of the residue (STY for canonical phosphorylation) and the proximal amino acids govern understanding of the kinase and phosphatase that regulate it. Given the interest in understanding phosphorylation (and other PTM) sites in most human diseases, adequate control of false reporting is crucial. It has been recently reported that >80% of the reported sites in a popular phosphorylation database are estimated to be false positives.²⁶ This likely resulted due to studies using overly weak FLR thresholds in publications, and results then get deposited in databases. Correct identifications tend to be reported from multiple studies, whereas random wrong site identifications tend to be seen only one or twice, and thus over time, database-level FLR creeps up.

This study is, to our knowledge, the most detailed attempt to understand how best to estimate global FLR using localization on decoy amino acids. We compare the method against the use of a statistical model, based on summing local FLR values, and the results agree reasonably but not perfectly well. We also demonstrate that the selection of a particular amino acid, even when correcting for the frequency of that amino acid in the results, does influence results more than would be desirable. We believe that our results back up that either pAla or pLeu make appropriate decoys based on their similar frequencies proximal to real phosphorylation sites (in a test case from humans and a model plant), as compared to target amino acids STY. The results for pAla and pLeu decoys also agree well with the model FLR (for the large data sets) and the answer key FLR, for the synthetic data sets. We have a slight preference to use pAla as a decoy going forward since there is a slight risk of confusion between Leu and Ile amino acids, which often cannot be distinguished by MS. In rare cases, where there are two peptides in the database, differing only by Ile/Leu, errors or inconsistencies in decoy FLR estimation could be introduced.

From the TPP pipeline, using iProphet and PTMProphet, it is possible in theory to use either the model FLR or the decoy FLR for thresholding final results. As noted above, performing a 1% global FLR threshold may be unstable (based on the decoy FLR method), depending on the chance of appearance of a few decoys high on the ranked list. If control at this level is required, for example, in cases where the exact sites of all modifications will be used in future validation studies, the model FLR would thus be preferred. For a less conservative threshold, say 5% FLR, for example, used when making more general biological conclusions from a quantitative study or pathway analysis following it, we believe that thresholding using the pAla decoy FLR method can be recommended. The rationale is that this method can be straightforwardly applied using any combination of tools and is simple to interpret. Most other pipelines in current use do not report accurate PEP values for PSMs and for site localization, allowing model FLR to be calculated reliably. We also recommend that the scores per site (final probability in the case of TPP-produced data) and the pAla “identifications” get carried forward and reported. This allows for the potential for meta-analyses and database submissions to estimate the resulting global FLR once multiple data sets have been combined.

We acknowledge that there is a slight downside to searching with a decoy amino acid in that the search spaces for PSM identification and PTM localization are both increased, leading to a potential loss in sensitivity. In our analyses, Ala residues are present at less than 1/3 the total frequency of STY residues, leading to a relatively modest increase in search spaces, say 30%. We also suggest that the proteomics field has generally accepted that doubling the PSM search database (and search time) through the inclusion of decoys is an acceptable trade-off for gaining the ability to estimate global FDR straightforwardly and transparently. It is also important to stress that the method of searching for a decoy amino acid is intended to perform global correction for different errors that can be introduced in a modification discovery pipeline (e.g. incorrect peptide identification and incorrect site localization). However, the approach does not enable calculation of local statistics for false discovery since a given decoy amino acid will not be present in all peptides and has a variable count of decoy sites in other peptides, compared to target amino acids. For local statistics, groups should rely on pipelines that can calculate accurate statistics for peptide identification and site localization. Nevertheless, our results show that adding an empirical decoy amino gives confidence that appropriate thresholds and quality control has been applied throughout the analytical pipeline, to deliver the final results from a study. While we have presented results for pAla as a decoy for phosphorylation studies, we also suggest that modified Ala could also be an appropriate decoy for other modification types, such as Lys modifications acetylation, methylation, ubiquitination SUMOylation, and so forth or for cases where multiple modifications are scored at the same time, although we have not yet profiled the amino acid distributions sufficiently to conclude that Ala is more suitable than other amino acids in these cases.

Conclusions

We have assessed six different amino acids for their ability to act as suitable decoy amino acids for the estimation of global FLR in phosphoproteomics studies. We have analyzed four data sets, two synthetic with a known answer and two biological-sample data sets. We conclude that either Ala or Leu make appropriate decoys and give reliable estimates of FLR above 1% FLR. Below 1% FLR, estimates can be unstable due to a few random high-scoring decoys. We demonstrate that the decoy-based FLR gives similar estimates to a modeled FLR for Ala and Leu decoys, based on summing local FLR values per site and based on the answer key for the synthetic data sets. We recommend that phosphoproteomics investigators should adopt the “pAla” decoy going forward, that is, the pASTY method, and report sites with appropriate global FLR control.

Acknowledgments

We are grateful for funding from BBSRC/NSF that supported this work [BB/S017054/1 and BB/S01781X/1]. J.A.V. and M.M. would also like to acknowledge EMBL core funding. E.W.D. acknowledges funding by the NSF grant DBI-1933311 and NIH grants R01GM087221 and R24GM127667.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00827.

Search parameters for each data set; log10 PTM probability versus PSM probability; boxplots of PSM supporting targets and decoys; comparison of pX FLR estimation searching PXD007058 using different pipelines; comparison of pAla/pLeu decoy FLR estimation searching PXD007058 using different pipelines; comparison of the minimum distance between phosphorylated STY and the nearest target amino acid, compared to the STY distribution, searching PXD007058; comparison of FLR estimation searching PXD000138 using localization on different decoy amino acids; comparison of FLR estimation methods searching PXD007058 for each of the different decoy amino acids; counts at pX FLR using the FLR method, searching PXD000138; comparison of the minimum distance between phosphorylated STY and the nearest target amino acid, compared to the STY distribution, searching PXD000138; comparison of FLR estimation searching PXD008355 using different decoy amino acids; counts of sites at pX decoy FLR for different thresholds using each decoy amino acid searching PXD008355; comparison of pX decoy FLR and model FLR estimation searching PXD008355 for different amino acids; comparison of amino acid frequency ratios between STY and the decoy amino acid for the identified peptides, identified phosphopeptides, and the search database; comparison of decoy FLR estimation searching PXD008355 using different pipelines for pAla, pLeu, and pGly; comparison of pAla/pLeu/pGly decoy FLR site counts searching PXD008355 using different pipelines; comparison of pAla decoy FLR estimation searching PXD008355 using different pipelines; counts of sites at pAla decoy FLR for different thresholds using each pipeline, searching PXD008355; comparison of pX decoy FLR and model FLR estimation searching PXD000612; comparison of the minimum distance between phosphorylated S, T, or Y and the nearest target amino acid, compared to the STY distribution, searching PXD000612; comparison of averaged final site probabilities for all peptides split by amino acid in the +1 positions for the PXD008355; and comparison of averaged final site probabilities for all peptides split by amino acid in the +1 and −1 positions for PXD000612 (PDF)

The authors declare no competing financial interest.

Supplementary Material

pr1c00827_si_001.pdf^{(1.6MB, pdf)}

References

Rigbolt K. T. G.; Blagoev B. Quantitative phosphoproteomics to characterize signaling networks. Semin. Cell Dev. Biol. 2012, 23, 863–871. 10.1016/j.semcdb.2012.05.006. [DOI] [PubMed] [Google Scholar]
Verheggen K.; Raeder H.; Berven F. S.; Martens L.; Barsnes H.; Vaudel M. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. Mass Spectrom. Rev. 2020, 39, 292–306. 10.1002/mas.21543. [DOI] [PubMed] [Google Scholar]
Beausoleil S. A.; Villén J.; Gerber S. A.; Rush J.; Gygi S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. 10.1038/nbt1240. [DOI] [PubMed] [Google Scholar]
Locard-Paulet M.; Bouyssié D.; Froment C.; Burlet-Schiltz O.; Jensen L. J. Comparing 22 Popular Phosphoproteomics Pipelines for Peptide Identification and Site Localization. J. Proteome Res. 2020, 19, 1338–1345. 10.1021/acs.jproteome.9b00679. [DOI] [PubMed] [Google Scholar]
Ferries S.; Perkins S.; Brownridge P. J.; Campbell A.; Eyers P. A.; Jones A. R.; Eyers C. E. Evaluation of Parameters for Confident Phosphorylation Site Localization Using an Orbitrap Fusion Tribrid Mass Spectrometer. J. Proteome Res. 2017, 16, 3448–3459. 10.1021/acs.jproteome.7b00337. [DOI] [PubMed] [Google Scholar]
Baker P. R.; Trinidad J. C.; Chalkley R. J. Modification site localization scoring integrated into a search engine. Mol. Cell. Proteomics 2011, 10, M111.008078. 10.1074/mcp.M111.008078. [DOI] [PMC free article] [PubMed] [Google Scholar]; a Fermin D.; Walmsley S. J.; Gingras A.-C.; Choi H.; Nesvizhskii A. I. LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach. Mol. Cell. Proteomics 2013, 12, 3409–3419. 10.1074/mcp.M113.028928. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deutsch E. W.; Mendoza L.; Shteynberg D.; Slagel J.; Sun Z.; Moritz R. L. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteonomics Clin. Appl. 2015, 9, 745–754. 10.1002/prca.201400164. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eng J. K.; Jahan T. A.; Hoopmann M. R. Comet: an open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. 10.1002/pmic.201200439. [DOI] [PubMed] [Google Scholar]
Shteynberg D. D.; Deutsch E. W.; Campbell D. S.; Hoopmann M. R.; Kusebauch U.; Lee D.; Mendoza L.; Midha M. K.; Sun Z.; Whetton A. D.; et al. PTMProphet: Fast and Accurate Mass Modification Localization for the Trans-Proteomic Pipeline. J. Proteome Res. 2019, 18, 4262–4272. 10.1021/acs.jproteome.9b00205. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tyanova S.; Temu T.; Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 2016, 11, 2301–2319. 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]
Taus T.; Köcher T.; Pichler P.; Paschke C.; Schmidt A.; Henrich C.; Mechtler K. Universal and confident phosphorylation site localization using phosphoRS. J. Proteome Res. 2011, 10, 5354–5362. 10.1021/pr200611n. [DOI] [PubMed] [Google Scholar]; a Perkins D. N.; Pappin D. J. C.; Creasy D. M.; Cottrell J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. . [DOI] [PubMed] [Google Scholar]
Zhang J.; Xin L.; Shan B.; Chen W.; Xie M.; Yuen D.; Zhang W.; Zhang Z.; Lajoie G. A.; Ma B. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 2012, 11, M111.010587. 10.1074/mcp.M111.010587. [DOI] [PMC free article] [PubMed] [Google Scholar]
Perez-Riverol Y.; Csordas A.; Bai J.; Bernal-Llinares M.; Hewapathirana S.; Kundu D. J.; Inuganti A.; Griss J.; Mayer G.; Eisenacher M.; et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019, 47, D442–D450. 10.1093/nar/gky1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vizcaíno J. A.; Deutsch E. W.; Wang R.; Csordas A.; Reisinger F.; Ríos D.; Dianes J. A.; Sun Z.; Farrah T.; Bandeira N.; et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 2014, 32, 223–226. 10.1038/nbt.2839. [DOI] [PMC free article] [PubMed] [Google Scholar]
Martens L.; Hermjakob H.; Jones P.; Adamski M.; Taylor C.; States D.; Gevaert K.; Vandekerckhove J.; Apweiler R. PRIDE: the proteomics identifications database. Proteomics 2005, 5, 3537–3545. 10.1002/pmic.200401303. [DOI] [PubMed] [Google Scholar]
Marx H.; Lemeer S.; Schliep J. E.; Matheron L.; Mohammed S.; Cox J.; Mann M.; Heck A. J. R.; Kuster B. A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nat. Biotechnol. 2013, 31, 557–564. 10.1038/nbt.2585. [DOI] [PubMed] [Google Scholar]
Van Leene J.; Han C.; Gadeyne A.; Eeckhout D.; Matthijs C.; Cannoot B.; De Winne N.; Persiau G.; Van De Slijke E.; Van de Cotte B.; et al. Capturing the phosphorylation and protein interaction landscape of the plant TOR kinase. Nat Plants 2019, 5, 316–327. 10.1038/s41477-019-0378-z. [DOI] [PubMed] [Google Scholar]
Sharma K.; D’Souza R. C.; Tyanova S.; Schaab C.; Wiśniewski J. R.; Cox J.; Mann M. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 2014, 8, 1583–94. 10.1016/j.celrep.2014.07.036. [DOI] [PubMed] [Google Scholar]
Cheng C. Y.; Krishnakumar V.; Chan A. P.; Thibaud-Nissen F.; Schobel S.; Town C. D. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017, 89, 789–804. 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
Deutsch E. W.; Sun Z.; Campbell D. S.; Binz P.-A.; Farrah T.; Shteynberg D.; Mendoza L.; Omenn G. S.; Moritz R. L. Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics. J. Proteome Res. 2016, 15, 4091–4100. 10.1021/acs.jproteome.6b00445. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zahn-Zabal M.; Michel P.-A.; Gateau A.; Nikitin F.; Schaeffer M.; Audot E.; Gaudet P.; Duek P. D.; Teixeira D.; de Laval V. R.; et al. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020, 48, D328–D334. 10.1093/nar/gkz995. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moosa J. M.; Guan S.; Moran M. F.; Ma B. Repeat-Preserving Decoy Database for False Discovery Rate Estimation in Peptide Identification. J. Proteome Res. 2020, 19, 1029–1036. 10.1021/acs.jproteome.9b00555. [DOI] [PubMed] [Google Scholar]
Keller A.; Nesvizhskii A. I.; Kolker E.; Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–5392. 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]
Shteynberg D.; Deutsch E. W.; Lam H.; Eng J. K.; Sun Z.; Tasman N.; Mendoza L.; Moritz R. L.; Aebersold R.; Nesvizhskii A. I. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 2011, 10, M111.007690. 10.1074/mcp.M111.007690. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hardman G.; Perkins S.; Brownridge P. J.; Clarke C. J.; Byrne D. P.; Campbell A. E.; Kalyuzhnyy A.; Myall A.; Eyers P. A.; Jones A. R.; et al. Strong anion exchange-mediated phosphoproteomics reveals extensive human non-canonical phosphorylation. EMBO J. 2019, 38, e100847 10.15252/embj.2018100847. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kalyuzhnyy A.; Eyers P. A.; Eyers C. E.; Bowler-Barnett E.; Martin M. J.; Sun Z.; Deutsch E. W.; Jones A. R. Profiling the Human Phosphoproteome to Estimate the True Extent of Protein Phosphorylation. J. Proteome Res. 2022, 10.1021/acs.jproteome.2c00131. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

pr1c00827_si_001.pdf^{(1.6MB, pdf)}

[ref1] Rigbolt K. T. G.; Blagoev B. Quantitative phosphoproteomics to characterize signaling networks. Semin. Cell Dev. Biol. 2012, 23, 863–871. 10.1016/j.semcdb.2012.05.006. [DOI] [PubMed] [Google Scholar]

[ref2] Verheggen K.; Raeder H.; Berven F. S.; Martens L.; Barsnes H.; Vaudel M. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. Mass Spectrom. Rev. 2020, 39, 292–306. 10.1002/mas.21543. [DOI] [PubMed] [Google Scholar]

[ref3] Beausoleil S. A.; Villén J.; Gerber S. A.; Rush J.; Gygi S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. 10.1038/nbt1240. [DOI] [PubMed] [Google Scholar]

[ref4] Locard-Paulet M.; Bouyssié D.; Froment C.; Burlet-Schiltz O.; Jensen L. J. Comparing 22 Popular Phosphoproteomics Pipelines for Peptide Identification and Site Localization. J. Proteome Res. 2020, 19, 1338–1345. 10.1021/acs.jproteome.9b00679. [DOI] [PubMed] [Google Scholar]

[ref5] Ferries S.; Perkins S.; Brownridge P. J.; Campbell A.; Eyers P. A.; Jones A. R.; Eyers C. E. Evaluation of Parameters for Confident Phosphorylation Site Localization Using an Orbitrap Fusion Tribrid Mass Spectrometer. J. Proteome Res. 2017, 16, 3448–3459. 10.1021/acs.jproteome.7b00337. [DOI] [PubMed] [Google Scholar]

[ref6] Baker P. R.; Trinidad J. C.; Chalkley R. J. Modification site localization scoring integrated into a search engine. Mol. Cell. Proteomics 2011, 10, M111.008078. 10.1074/mcp.M111.008078. [DOI] [PMC free article] [PubMed] [Google Scholar]; a Fermin D.; Walmsley S. J.; Gingras A.-C.; Choi H.; Nesvizhskii A. I. LuciPHOr: algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach. Mol. Cell. Proteomics 2013, 12, 3409–3419. 10.1074/mcp.M113.028928. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Deutsch E. W.; Mendoza L.; Shteynberg D.; Slagel J.; Sun Z.; Moritz R. L. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteonomics Clin. Appl. 2015, 9, 745–754. 10.1002/prca.201400164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Eng J. K.; Jahan T. A.; Hoopmann M. R. Comet: an open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. 10.1002/pmic.201200439. [DOI] [PubMed] [Google Scholar]

[ref9] Shteynberg D. D.; Deutsch E. W.; Campbell D. S.; Hoopmann M. R.; Kusebauch U.; Lee D.; Mendoza L.; Midha M. K.; Sun Z.; Whetton A. D.; et al. PTMProphet: Fast and Accurate Mass Modification Localization for the Trans-Proteomic Pipeline. J. Proteome Res. 2019, 18, 4262–4272. 10.1021/acs.jproteome.9b00205. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] Tyanova S.; Temu T.; Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc. 2016, 11, 2301–2319. 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]

[ref11] Taus T.; Köcher T.; Pichler P.; Paschke C.; Schmidt A.; Henrich C.; Mechtler K. Universal and confident phosphorylation site localization using phosphoRS. J. Proteome Res. 2011, 10, 5354–5362. 10.1021/pr200611n. [DOI] [PubMed] [Google Scholar]; a Perkins D. N.; Pappin D. J. C.; Creasy D. M.; Cottrell J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. . [DOI] [PubMed] [Google Scholar]

[ref12] Zhang J.; Xin L.; Shan B.; Chen W.; Xie M.; Yuen D.; Zhang W.; Zhang Z.; Lajoie G. A.; Ma B. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 2012, 11, M111.010587. 10.1074/mcp.M111.010587. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Perez-Riverol Y.; Csordas A.; Bai J.; Bernal-Llinares M.; Hewapathirana S.; Kundu D. J.; Inuganti A.; Griss J.; Mayer G.; Eisenacher M.; et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019, 47, D442–D450. 10.1093/nar/gky1106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] Vizcaíno J. A.; Deutsch E. W.; Wang R.; Csordas A.; Reisinger F.; Ríos D.; Dianes J. A.; Sun Z.; Farrah T.; Bandeira N.; et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 2014, 32, 223–226. 10.1038/nbt.2839. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Martens L.; Hermjakob H.; Jones P.; Adamski M.; Taylor C.; States D.; Gevaert K.; Vandekerckhove J.; Apweiler R. PRIDE: the proteomics identifications database. Proteomics 2005, 5, 3537–3545. 10.1002/pmic.200401303. [DOI] [PubMed] [Google Scholar]

[ref16] Marx H.; Lemeer S.; Schliep J. E.; Matheron L.; Mohammed S.; Cox J.; Mann M.; Heck A. J. R.; Kuster B. A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nat. Biotechnol. 2013, 31, 557–564. 10.1038/nbt.2585. [DOI] [PubMed] [Google Scholar]

[ref17] Van Leene J.; Han C.; Gadeyne A.; Eeckhout D.; Matthijs C.; Cannoot B.; De Winne N.; Persiau G.; Van De Slijke E.; Van de Cotte B.; et al. Capturing the phosphorylation and protein interaction landscape of the plant TOR kinase. Nat Plants 2019, 5, 316–327. 10.1038/s41477-019-0378-z. [DOI] [PubMed] [Google Scholar]

[ref18] Sharma K.; D’Souza R. C.; Tyanova S.; Schaab C.; Wiśniewski J. R.; Cox J.; Mann M. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 2014, 8, 1583–94. 10.1016/j.celrep.2014.07.036. [DOI] [PubMed] [Google Scholar]

[ref19] Cheng C. Y.; Krishnakumar V.; Chan A. P.; Thibaud-Nissen F.; Schobel S.; Town C. D. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017, 89, 789–804. 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]

[ref20] Deutsch E. W.; Sun Z.; Campbell D. S.; Binz P.-A.; Farrah T.; Shteynberg D.; Mendoza L.; Omenn G. S.; Moritz R. L. Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics. J. Proteome Res. 2016, 15, 4091–4100. 10.1021/acs.jproteome.6b00445. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Zahn-Zabal M.; Michel P.-A.; Gateau A.; Nikitin F.; Schaeffer M.; Audot E.; Gaudet P.; Duek P. D.; Teixeira D.; de Laval V. R.; et al. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020, 48, D328–D334. 10.1093/nar/gkz995. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Moosa J. M.; Guan S.; Moran M. F.; Ma B. Repeat-Preserving Decoy Database for False Discovery Rate Estimation in Peptide Identification. J. Proteome Res. 2020, 19, 1029–1036. 10.1021/acs.jproteome.9b00555. [DOI] [PubMed] [Google Scholar]

[ref23] Keller A.; Nesvizhskii A. I.; Kolker E.; Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–5392. 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]

[ref24] Shteynberg D.; Deutsch E. W.; Lam H.; Eng J. K.; Sun Z.; Tasman N.; Mendoza L.; Moritz R. L.; Aebersold R.; Nesvizhskii A. I. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 2011, 10, M111.007690. 10.1074/mcp.M111.007690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] Hardman G.; Perkins S.; Brownridge P. J.; Clarke C. J.; Byrne D. P.; Campbell A. E.; Kalyuzhnyy A.; Myall A.; Eyers P. A.; Jones A. R.; et al. Strong anion exchange-mediated phosphoproteomics reveals extensive human non-canonical phosphorylation. EMBO J. 2019, 38, e100847 10.15252/embj.2018100847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] Kalyuzhnyy A.; Eyers P. A.; Eyers C. E.; Bowler-Barnett E.; Martin M. J.; Sun Z.; Deutsch E. W.; Jones A. R. Profiling the Human Phosphoproteome to Estimate the True Extent of Protein Phosphorylation. J. Proteome Res. 2022, 10.1021/acs.jproteome.2c00131. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Method for Independent Estimation of the False Localization Rate for Phosphoproteomics

Kerry A Ramsbottom

Ananth Prakash

Yasset Perez Riverol

Oscar Martin Camacho

Maria-Jesus Martin

Juan Antonio Vizcaíno

Eric W Deutsch

Andrew R Jones

Abstract

Introduction

Methods

Table 1. Comet Search Parameters for Each Data Set.

Downstream Data Analysis

Model FLR Method

Global Model FLR Equation

Decoy FLR Method

Phosphorylated Decoy Amino Acid FLR Equation

Answer Key FLR

Synthetic Answer Key FLR

Collapsing Observations of a Site across Multiple PSMs

Profiling Distance Distributions from Real Identifications to Decoy Amino Acids

Profiling Site Probabilities for Proximal Amino Acids

Results

Analysis of Synthetic Data Set PXD007058

Figure 1.

Table 2. Counts at pX FLR (Calculated by the Decoy Method) for 1, 5, and 10% Using Each FLR Method, Searching PXD007058 (Synthetic Data Set) (TPP, Fully Tryptic, 1% FDR).

Biological Data Set Analysis

Data Set PXD008355

Figure 2.

Table 4. Counts at pX FLR for 1, 5, and 10% Using Each Decoy Method, Searching PXD000612 (Human Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsed to Unique Sites.

Figure 3.

Data Set PXD000612

Figure 4.

Figure 5.

Discussion

Conclusions

Acknowledgments

Supporting Information Available

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases