Analyzing marginal cases in differential shotgun proteomics

Paulo C Carvalho; Juliana S G Fischer; Jonas Perales; John R Yates; Valmir C Barbosa; Elias Bareinboim

doi:10.1093/bioinformatics/btq632

. 2010 Nov 11;27(2):275–276. doi: 10.1093/bioinformatics/btq632

Analyzing marginal cases in differential shotgun proteomics

Paulo C Carvalho ^1,2,^*, Juliana S G Fischer ^1,3, Jonas Perales ², John R Yates ⁴, Valmir C Barbosa ⁵, Elias Bareinboim ⁶

PMCID: PMC3018820 PMID: 21075743

Abstract

Summary: We present an approach to statistically pinpoint differentially expressed proteins that have quantitation values near the quantitation threshold and are not identified in all replicates (marginal cases). Our method uses a Bayesian strategy to combine parametric statistics with an empirical distribution built from the reproducibility quality of the technical replicates.

Availability:The software is freely available for academic use at http://pcarvalho.com/patternlab.

Contact: paulo@pcarvalho.com

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Shotgun proteomics describes a large-scale approach to analyzing complex peptide mixtures (i.e. mixtures originating from biological fluids, cell lysates, etc.). Briefly, the strategy is to perform protein digestion followed by peptide chromatographic separation online with tandem mass spectrometry (MS2) for protein identification (Washburn et al., 2001). The study of complex mixtures is challenging in itself because peptides are under-sampled during data acquisition by mass spectrometry.

The combined nature of sample complexity, data acquisition methodologies and under-sampling is bound to generate considerable experimental variation. Indeed, one may expect to observe some 25% additional uniquely identified proteins when comparing two technical replicates of a complex mixture (Liu et al., 2004). As we demonstrate below, this variation is largely due to peptide ions whose relative quantitation values lie near the detection threshold and therefore do not appear in all technical replicates (marginal cases).

One of the goals of proteomics is to distinguish between various states of a biological system according to protein expression differences. By directly applying common statistical approaches to pinpoint differentially expressed proteins without taking the necessary precautions that are inherently related to technical reproducibility, many marginal cases that are likely to be an artifact of chance may be included in the results and shadow important aspects. Moreover, many false negative cases may be lost.

2 PROBLEM FORMULATION AND MODELING

Consider two biological states B₁ and B₂ and two experimental datasets, one containing replicates from state B₁, the other as many replicates from state B₂. We address the question of estimating the probability that a protein appearing in at least one replicate from state B₁ is differentially expressed with respect to state B₂, i.e. that it is found in none of the replicates from B₂.

If P is the protein in question, then our aim is to estimate the probability P(H|D), where H stands for ‘P is not detected in any replicate from B₂’ and D for ‘P appears in at least one of the replicates from B₁’. We assume throughout that the appearance of any given protein in a replicate from B₂ is subject to the same underlying laws governing its appearance in replicates from B₁, and moreover that it may occur in any of the replicates from B₂ independently with the same probability. This implies that the number of replicates from B₂ containing that protein is distributed binomially. Henceforth, we use the smoother, approximate formula of the Poisson distribution instead. Accordingly, the probability that the protein appears in u of the replicates from B₂ with mean λ is denoted by Poi(u, λ). In our estimates, we always choose the value of λ in reference to what is observed or hypothesized with respect to state B₁.

From a Bayesian perspective, we begin by estimating the prior probability P(H) that protein P does not appear in any replicate from state B₂. If r is the number of replicates from state B₁ in which P is detected, then we set P(H) = Poi(0, r). Similarly, computing the desired probability, P(H|D), requires that first we obtain P(D|H) and P(D|not H), that is, the probabilities that P is detected in at least one replicate from B₁ conditioned, respectively, on the fact that it does not or does appear in replicates from B₂. In order to estimate either probability, we first partition the B₁-replicate proteins into four groups of approximately the same size, each corresponding to one of the categories low, medium, high, or very high, according to the average signal of each protein (e.g. spectral count, peak area, etc.) over the replicates in which it appears. Let G denote the group to which protein P belongs. Our estimates of P(D|H) and P(D|not H) are relative to G, therefore specific to a certain range of average signal. In what follows, we use f_t to denote the fraction of group-G proteins that occur in t replicates from state B₁.

We estimate P(D|H) as the sum of probabilities of pairs of independent events. If n is the total number of replicates from either state, we consider one pair for each possible number t of replicates from state B₁, t = 1, 2,…, n. The two independent events for each pair are that a randomly chosen protein from group G appears in t replicates from state B₁, and that it appears in none of the replicates from state B₂. Thus,

The case of P(D|not H) is similar, but now the invalidity of H implies that we must sum up the probabilities that the randomly chosen protein from group G appears in u replicates from state B₂, for u = 1, 2,…, n. We then obtain

The desired probability, finally, follows from the Bayesian inversion formula,

and is henceforth used as a p-value for all proteins in G that appear in r replicates from state B₁.

3 DATA ACQUISITION

For evaluation of the above methodology, we used two shotgun proteomic datasets acquired by Fischer et al. (2010). Briefly, the authors employed Multi-dimensional Protein Identification Technology (MudPIT; Washburn et al., 2001) to compare the A172 cell line in two biological states, here identified with the B₁ and B₂ states of Section 2. Each state was analyzed in triplicates (i.e. n = 3). Relative quantitation was performed by spectral counting. A protein required a minimum of two peptides (thus, two unique spectral counts) to be considered.

4 RESULTS

Each of Supplementary Figures 1A, B and C shows a Venn diagram (VD) of identified proteins from B₁ and B₂ appearing in at least one, at least two, and all three replicates, respectively. Supplementary Figures 2A and B show VDs comparing uniquely identified proteins among the technical replicates from B₁ and B₂, respectively. Both Supplementary Figures 1 and 2 corroborate the great variability claimed by Liu et al. (2004).

The model described in Section 2 has been implemented as part of the PatternLab for proteomics suite (Carvalho et al., 2008). Results on the biological states to which Section 3 refers are shown in Supplementary Tables I and II, respectively, to verify differential expression in state B₁ relative to state B₂ and conversely (i.e. reversing the roles of the two states in the discussion of Section 2). Clearly, proteins that are more reproducible (appear in more replicates) yield lower p-values.

The resulting algorithm was also incorporated into PatternLab's area-proportional VD module (Carvalho et al., 2010). The user can now choose between generating VDs by filtering proteins that appear in at least a certain number of replicates, or by using the new approach through a user-specified p-value. The new option can be used to eliminate proteins that cannot be claimed to be statistically differentially expressed. Supplementary Figure 3 shows a VD that considers a p-value cutoff of 0.05 for the two biological states of Section 3, instead of the replicate-cutoff criterion used in Supplementary Figure 1.

5 FINAL CONSIDERATIONS

An alternative, simple strategy to pinpoint marginal proteins representative of a biological state is to consider only proteins that appear in a minimum number of replicates. Such an approach, however, is arbitrary and lacks proper foundation. The approach we have described, on the other hand, is well-founded and therefore amounts to a more refined method. It is useful especially in generating VDs, such as the one in Supplementary Figure 3, for the study of proteins that are representative of a given biological state. We note, in relation to VDs such as this, that uniquely identified proteins in the VD are not to be claimed as being unique to a state; instead, they are most likely differentially expressed.

Funding: Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES-Fiocruz 30/2006); Programa de Desenvolvimento Tecnológico em Insumos para Saúde (PDTIS-Fiocruz); Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq); BBP grants from Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro; National Institutes of Health (NIH 5R01MH067880 and P41 RR011823).

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

supp_27_2_275__index.html^{(776B, html)}

REFERENCES

Carvalho PC, et al. PatternLab for proteomics: a tool for differential shotgun proteomics. BMC Bioinformatics. 2008;9:316. doi: 10.1186/1471-2105-9-316. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carvalho PC, et al. Analyzing shotgun proteomic data with PatternLab for proteomics. Curr. Protoc. Bioinformatics. 2010 doi: 10.1002/0471250953.bi1313s30. Chapter 13, Unit 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fischer JS, et al. Dynamic proteomic overview of glioblastoma cells (A172) exposed to perillyl alcohol. J. Proteomics. 2010;73:1018–1027. doi: 10.1016/j.jprot.2010.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, et al. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 2004;76:4193–4201. doi: 10.1021/ac0498563. [DOI] [PubMed] [Google Scholar]
Washburn MP, et al. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001;19:242–247. doi: 10.1038/85686. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

supp_27_2_275__index.html^{(776B, html)}

supp_btq632_SupplementaryMaterialv3.pdf^{(1.1MB, pdf)}

supp_btq632_SupplementaryMaterialv3.docx^{(558.5KB, docx)}

[B1] Carvalho PC, et al. PatternLab for proteomics: a tool for differential shotgun proteomics. BMC Bioinformatics. 2008;9:316. doi: 10.1186/1471-2105-9-316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Carvalho PC, et al. Analyzing shotgun proteomic data with PatternLab for proteomics. Curr. Protoc. Bioinformatics. 2010 doi: 10.1002/0471250953.bi1313s30. Chapter 13, Unit 15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Fischer JS, et al. Dynamic proteomic overview of glioblastoma cells (A172) exposed to perillyl alcohol. J. Proteomics. 2010;73:1018–1027. doi: 10.1016/j.jprot.2010.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Liu H, et al. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 2004;76:4193–4201. doi: 10.1021/ac0498563. [DOI] [PubMed] [Google Scholar]

[B5] Washburn MP, et al. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001;19:242–247. doi: 10.1038/85686. [DOI] [PubMed] [Google Scholar]

PERMALINK

Analyzing marginal cases in differential shotgun proteomics

Paulo C Carvalho

Juliana S G Fischer

Jonas Perales

John R Yates

Valmir C Barbosa

Elias Bareinboim

Abstract

1 INTRODUCTION

2 PROBLEM FORMULATION AND MODELING

3 DATA ACQUISITION

4 RESULTS

5 FINAL CONSIDERATIONS

Supplementary Material

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Analyzing marginal cases in differential shotgun proteomics

Paulo C Carvalho

Juliana S G Fischer

Jonas Perales

John R Yates

Valmir C Barbosa

Elias Bareinboim

Abstract

1 INTRODUCTION

2 PROBLEM FORMULATION AND MODELING

3 DATA ACQUISITION

4 RESULTS

5 FINAL CONSIDERATIONS

Supplementary Material

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases