Can the false-discovery rate be misleading?

Rodrigo Barboza; Daniel Cociorva; Tao Xu; Valmir C Barbosa; Jonas Perales; Richard H Valente; Felipe M G França; John R Yates, III; Paulo C Carvalho

doi:10.1002/pmic.201100297

. Author manuscript; available in PMC: 2012 Mar 27.

Published in final edited form as: Proteomics. 2011 Sep 6;11(20):4105–4108. doi: 10.1002/pmic.201100297

Can the false-discovery rate be misleading?

Rodrigo Barboza ^1,^&, Daniel Cociorva ^2,^&, Tao Xu ², Valmir C Barbosa ¹, Jonas Perales ⁴, Richard H Valente ⁴, Felipe M G França ¹, John R Yates III ², Paulo C Carvalho ^3,^4,^*

PMCID: PMC3313620 NIHMSID: NIHMS357606 PMID: 21834134

Abstract

The decoy-database approach is currently the gold standard for assessing the confidence of identifications in shotgun proteomic experiments. Here we demonstrate that what might appear to be a good result under the decoy-database approach for a given false-discovery rate could be, in fact, the product of overfitting. This problem has been overlooked until now and could lead to obtaining boosted identification numbers whose reliability does not correspond to the expected false-discovery rate. To remedy this, we are introducing a modified version of the method, termed a semi-labeled decoy approach, which enables the statistical determination of an overfitted result.

Keywords: shotgun proteomics, overfitting, protein identification, false-discovery rate, decoy

The decoy-database approach [1,2] is currently the gold standard for assessing identifications in shotgun proteomic experiments. Briefly, this method relies on using a protein identification search engine to match experimental spectra against theoretical ones generated from a database containing target protein sequences and labeled decoys (e.g., reversed target sequences), usually occurring in the same number. According to Elias and Gygi, “one can estimate the total number of false positives (FPs) that meet specific selection criteria by doubling the number of selected decoy hits” [2]. This rationale is applied to the final lists of identifications, which include the decoy hits (i.e., hits known to be incorrect), and in the end the user generally only counts the identifications assigned as target.

The proteomics community has developed tools (e.g., DTASelect [3] and IDPicker [4]) having roots in this approach to automatically filter out low-quality identifications. New filtration tools and methods are usually benchmarked entirely by how many peptide sequence matches (PSMs) are identified under a specified false-discovery rate (FDR) [5]. The ensuing efforts to maximize PSMs tend to push authors to use increasingly complex discriminant functions to improve results under the same FDR. This, in turn, can give rise to a limitation not anticipated in the Elias-Gygi guideline.

Here we show that what might seem to be a good result under the decoy-database approach, henceforth referred to as the labeled decoy approach, can actually be the product of overfitting a discriminant model to the dataset. To overcome this limitation we introduce a modified decoy method, here termed a semi-labeled decoy approach, which relies on labeled decoys, but also on unlabeled decoys; the last one are sequences that the discriminator does not know to be decoys. These unlabeled decoys serve as an internal error reference that helps to statistically deal with overfitting.

We demonstrate the overfitting problem and the effectiveness of our approach on datasets of mass spectra obtained by analyzing Pyrococcus furiosus and Trypanosoma cruzi lysates with an Orbitrap XL (Thermo, San Jose, CA) under conditions previously described literature [6]. These spectra were searched using ProLuCID [7] against three types of database generated as follows:

The first is the widely adopted Target–Reverse database (T-R DB). In it, for every target sequence a decoy sequence is generated by reversing the target. Clearly, this produces a final database with target and decoy sequences in the same number.
The second database is here termed the Target-Scrambled0-Scrambled1 database (T-S0-S1 DB). In it, for every target sequence two decoy sequences, S0 and S1, having the same length as the target, are generated by randomly scrambling the contents of the target sequence. The number of non-target sequences is twice that of the target. The reason for generating two layers of decoys is that one will serve as labeled decoys and the other as unlabeled.
The third database is referred to as the Target-PairReversed-MiddleReversed database (T-PR-MR DB). In it, for every target sequence a PR and an MR sequence are generated as a function of the digestion enzyme used in the project. For each target sequence, first the peptides that the enzyme in question will produce are listed. For each one a PR peptide is generated by first swapping the two outermost amino acids, then treating pairs of the remaining amino acids as units and reversing their order. To exemplify, given the target peptide ABCDEFGHI, its PR peptide is IGHEFCDBA. The final PR sequence is obtained by concatenating all PR peptides. Similarly, for each target peptide an MR peptide is generated by first swapping the two outermost amino acids, then dividing the remaining portion in half and reversing each of the halves separately. To exemplify, the former target peptide becomes IEDCBHGFA. The final MR sequence is obtained by concatenating all MR peptides.

The T-PR-MR format is useful to reproduce the advantages of the widely adopted target-reverse approach when using two layers of decoys. As is known, randomizing each sequence independently does generate higher peptide diversity than reversing each sequence, especially in proteomes with high redundancy in the sequences or having conserved regions, such as those of mammals. The search engine will then compare each spectrum to more candidates from the randomized sequences than from the target sequences. This will generate a bias when estimating the FDR, as most search engines will not consider the number of distinct peptides generated from each protein database and most FDR computations assume the number of comparisons to targets and decoys to be the same. As an example, the Human IPI database contains some 140% more unique tryptic peptides in the scrambled decoy sequences than in the targets. The T-PR-MR format, similarly to the T-R format, addresses these issues by acceptably reproducing the diversity found when generating decoys. We created the T-PR-MR format by empirically testing different ways of rearranging the sequences to minimize overlapping peaks of theoretically generated mass spectra of corresponding T, PR, and MR peptides. We have included the T-S0-S1 format for benchmarking purposes only, as there are still various groups from the proteomics community that use randomized databases instead of reversed ones. However, for the reasons above and as demonstrated later by our results, we do not recommend using them.

We implemented two widely adopted pattern recognition strategies to generate discriminant models based on the search results to filter out low-quality identifications. These are the well-known Bayesian discriminator and a weightless artificial neural network (WNN), known as WiSARD [8], in its recently improved form [9]. A description of these approaches is available in Supplementary File I. Next we benchmarked both strategies by using the three database formats and accepting a 1% FDR at the spectral level, calculated by dividing the number of PSMs originating from labeled decoys by the total number of PSMs. The results from the P. furiosus dataset are presented in Tables I and II.

Table I. Bayesian discriminator results on the P. furiosus dataset.

T-R DB, T-S1-S0 DB, and T-PR-MR DB are the databases described in the main text. The numbers in the Spectra rows indicate how many PSMs were obtained. The numbers in the Peptides rows indicate how many unique peptides were identified.

	T-R DB
	Labeled Decoys	Total Target		Total
Spectra	1073	106280		107353
Peptides	900	19532		20432
	T-S1-S0 DB
	Labeled Decoys	Unlabeled Decoys	Total Target	Total
Spectra	1083	1105	106184	108372
Peptides	982	983	19609	21574
	T-PR-MR DB
Spectra	1083	1064	106229	108376
Peptides	936	939	19587	21462

Open in a new tab

Table II. WNN discriminator results on the P. furiosus dataset.

	T-R DB
	Labeled Decoys	Total Target		Total
Spectra	1162	115074		116236
Peptides	1099	26142		27241
	T-S1-S0 DB
	Labeled Decoys	Unlabeled Decoys	Total Target	Total
Spectra	1150	4917	108945	115012
Peptides	1126	4513	22714	28353
	T-PR-MR DB
Spectra	1152	4656	109440	115248
Peptides	1100	4291	22803	28194

Open in a new tab

The results from Tables I and II favor the WNN over the Bayesian discriminator, as the former yielded more PSMs. However, by introducing unlabeled decoys we see that the premise of having roughly the same number of false positives and of decoys does not always withstand detailed scrutiny: the results from the Bayesian discriminator appear to be consistent but those from the WNN do not. We also note that our new DB format (T-PR-MR) yields better results (i.e., less overfitting and more PSMs), as they reflect peptide diversity better than the randomized approach.

Clearly, what lies behind the apparent success of the WNN is related to its increased complexity, which has enabled it to overfit the data (i.e., achieve better separation of the labeled decoys from the rest). In this regard, even though it provided more PSMs meeting the FDR criterion, the elevated number of unlabeled decoys demonstrates that the results are not as reliable as those of the Bayesian discriminator.

As the number of unlabeled decoy identifications is expected to be roughly the same as that of labeled decoys in a database containing target, decoys, and unlabeled decoys in equal numbers, an overfitting p-value can be approximated by $P = Pr (X > s) \approx \sum_{t = s + 1}^{n} Bin (t, n, p)$ . Here X is a random variable indicating the number of unlabeled decoys identified, s is the value of X reported by the discriminator, n is the total number of identifications, p is the expected fraction of unlabeled decoys (i.e., p 0.01 for an FDR of 1%), and Bin is the binomial distribution function. This approach only applies to databases in which peptide diversity can be assumed to be nearly equal among the target, decoy, and unlabeled decoy sequences. By re-analyzing the tables above it follows that indeed only the results from the Bayesian discriminator can be taken with confidence (P ≫ 0.05) for the experiment at hand.

In a handpicked analysis from the T. cruzi dataset we show that the nature of the experiment can lead to overfitting (P < 0.05) even for discriminators, like the Bayesian one, that have done well in other circumstances. These results are presented in Table III.

Table III. Bayesian discriminator results on the T. cruzi dataset.

	T-S1-S0 DB
	Labeled Decoys	Unlabeled Decoys	Total Target	Total
Spectra	12	43	1221	1276
Peptides	9	12	267	288
	T-PR-MR DB
Spectra	12	20	1235	1267
Peptides	11	17	273	301

Open in a new tab

Table III presents a marginally overfitted result for the T-PR-MR approach (P = 0.03). Such results could be improved by employing widely adopted ad-hoc filtration strategies. Examples are only considering proteins with two spectral counts or two sequence counts. Nevertheless, an effective discriminator still remains the core of a filtration algorithm, as it is the one responsible for sorting the results according to confidence.

From our experience, in general overfitting should not be a problem for the majority of the widely adopted tools (e.g., Scaffold (Proteome Software), DTASelect, IDPicker, etc.), as they have already matured and have been extensively tested by the proteomics community. Nevertheless, it can be advisable to test for overfitting when experimenting with new parameters, even for a tool that one has experience with. Once it has been verified that overfitting is not a problem, it becomes unnecessary to search in databases with two layers of decoys, with the advantages of avoiding the increased search time and the loss in sensitivity caused by a database with more decoys and therefore more “distractions” for the search engine.

We strongly recommend adopting the semi-labeled decoy approach when benchmarking new tools. Without proper awareness, it would be easy to advocate in favor of our WNN discriminator over the Bayesian one. In fact, we claim we can ultimately build a filtration tool capable of outperforming any of the widely adopted filtration tools under current benchmarking standards (i.e., number of PSMs under a given FDR). As we demonstrated, however, its results would not be trustworthy.

In summary, the semi-labeled decoy approach complements the labeled decoy approach by statistically dealing with the overfitting problem. The method is simple and therefore makes it easy for authors of filtration tools to adopt overfitting p-values. As far as we know, this is the first strategy that can empirically demonstrate the overfitting in proteomic FDR experiments. We have limited ourselves to demonstrating the approach at the spectral level, but variations can be easily developed for use at the peptide and protein levels. Most importantly, we have shown that basing one’s decision exclusively on the FDR can be misleading. The mass spectra, search databases, and Java source code generated for this study are available at: http://max.ioc.fiocruz.br/pcarvalho/overfitting/.

Supplementary Material

supplementaryFile1

NIHMS357606-supplement-supplementaryFile1.pdf^{(218.9KB, pdf)}

Acknowledgments

Financial support provided by: CAPES-Fiocruz 30/2006, PDTIS-Fiocruz, CNPq 306070/2007-3, FAPERJ-BBP, NIH P41 RR011823 and ROI MH067880.

Footnotes

The authors declared no conflicts of interest.

Reference List

1.Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res. 2003;2:43–50. doi: 10.1021/pr025556v. [DOI] [PubMed] [Google Scholar]
2.Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4:207–214. doi: 10.1038/nmeth1019. [DOI] [PubMed] [Google Scholar]
3.Cociorva D, Tabb L, Yates JR. Validation of tandem mass spectrometry database search results using DTASelect. Curr Protoc Bioinformatics. 2007 doi: 10.1002/0471250953.bi1304s16. Chapter 13: Unit 13.4. [DOI] [PubMed] [Google Scholar]
4.Ma ZQ, Dasari S, Chambers MC, Litton MD, Sobecki SM, Zimmerman LJ, et al. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J Proteome Res. 2009;8:3872–3881. doi: 10.1021/pr900360j. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4:923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
6.Washburn MP, Wolters D, Yates JR., III Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001;19:242–247. doi: 10.1038/85686. [DOI] [PubMed] [Google Scholar]
7.Xu T, Venable JD, Park SK, Cociorva D, Lu B, Liao L, et al. ProLuCID, a fast and sensitive tandem mass spectra-based protein identification program. Mol Cell Proteomics. 2006;5 S:174. [Google Scholar]
8.Aleksander I, Thomas W, Bowden P. WiSARD, a radical new step forward in image recognition. Sensor Rev. 1984;4:120–124. [Google Scholar]
9.Grieco BPA, Lima PMV, De Gregorio M, França FMG. Producing pattern examples from "mental" images. Neurocomputing. 2010;73:1057–1064. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementaryFile1

NIHMS357606-supplement-supplementaryFile1.pdf^{(218.9KB, pdf)}

[R1] 1.Peng J, Elias JE, Thoreen CC, Licklider LJ, Gygi SP. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J Proteome Res. 2003;2:43–50. doi: 10.1021/pr025556v. [DOI] [PubMed] [Google Scholar]

[R2] 2.Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4:207–214. doi: 10.1038/nmeth1019. [DOI] [PubMed] [Google Scholar]

[R3] 3.Cociorva D, Tabb L, Yates JR. Validation of tandem mass spectrometry database search results using DTASelect. Curr Protoc Bioinformatics. 2007 doi: 10.1002/0471250953.bi1304s16. Chapter 13: Unit 13.4. [DOI] [PubMed] [Google Scholar]

[R4] 4.Ma ZQ, Dasari S, Chambers MC, Litton MD, Sobecki SM, Zimmerman LJ, et al. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J Proteome Res. 2009;8:3872–3881. doi: 10.1021/pr900360j. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat Methods. 2007;4:923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]

[R6] 6.Washburn MP, Wolters D, Yates JR., III Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol. 2001;19:242–247. doi: 10.1038/85686. [DOI] [PubMed] [Google Scholar]

[R7] 7.Xu T, Venable JD, Park SK, Cociorva D, Lu B, Liao L, et al. ProLuCID, a fast and sensitive tandem mass spectra-based protein identification program. Mol Cell Proteomics. 2006;5 S:174. [Google Scholar]

[R8] 8.Aleksander I, Thomas W, Bowden P. WiSARD, a radical new step forward in image recognition. Sensor Rev. 1984;4:120–124. [Google Scholar]

[R9] 9.Grieco BPA, Lima PMV, De Gregorio M, França FMG. Producing pattern examples from "mental" images. Neurocomputing. 2010;73:1057–1064. [Google Scholar]

PERMALINK

Can the false-discovery rate be misleading?

Rodrigo Barboza

Daniel Cociorva

Tao Xu

Valmir C Barbosa

Jonas Perales

Richard H Valente

Felipe M G França

John R Yates III

Paulo C Carvalho

Abstract

Table I. Bayesian discriminator results on the P. furiosus dataset.

Table II. WNN discriminator results on the P. furiosus dataset.

Table III. Bayesian discriminator results on the T. cruzi dataset.

Supplementary Material

Acknowledgments

Footnotes

Reference List

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Can the false-discovery rate be misleading?

Rodrigo Barboza

Daniel Cociorva

Tao Xu

Valmir C Barbosa

Jonas Perales

Richard H Valente

Felipe M G França

John R Yates III

Paulo C Carvalho

Abstract

Table I. Bayesian discriminator results on the P. furiosus dataset.

Table II. WNN discriminator results on the P. furiosus dataset.

Table III. Bayesian discriminator results on the T. cruzi dataset.

Supplementary Material

Acknowledgments

Footnotes

Reference List

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases