Abstract
As a technique that allows simultaneous quantitation of proteins in multiple samples, iTRAQ (isobaric Tags for Relative and Absolute Quantitation) has gained increased interest and applications in proteomics research. Despite its success, iTRAQ data present a number of statistical challenges even after the proteins and peptides are identified and the peak areas of the reported ions are estimated for peptide intensities. In this article, we review recent studies on the analysis of iTRAQ data, the computation problems involved and the nonrandom missingness in the iTRAQ data.
Keywords and phrases: iTRAQ, ANOVA, Nonrandom missing, Bayesian hierarchical model, Mass spectrometry
1. INTRODUCTION
One main objective of proteomics research is to detect and quantify all proteins present in a biological sample. Proteins that exhibit an increase or decrease in abundance between distinct proteomes (e.g., disease and nondisease or control and treatments) are potential biomarkers. Many different techniques have been developed to simultaneously compare protein levels across multiple samples. One method that has gained increased attention is iTRAQ [10, 14, 23, 30], a shotgun technique that uses Isobaric Tags for Relative and Absolute Quantitation. Compared to other methods such as 2DE [20], ICAT (isotope-coded affinity tags) [4], and DIGE (differential gel electrophoresis) [5, 21], iTRAQ offers improved quantitative reproducibility, higher sensitivity [32], and has broad applications in proteomics research [1, 2, 8, 13, 29, 33].
Using four or eight isobaric tags, iTRAQ can simultaneously analyze up to eight biological samples [3, 23]. The four reagents used in the 4-plex version of iTRAQ are named 114, 115, 116 and 117. The eight reagents include these four and four additional reagents named 113, 118, 119 and 121. Each reagent is composed of a peptide reactive group and an isobaric tag that consists of a reporter group and a balance group. The peptide-reactive group specifically reacts with primary amine groups of peptides. The reporter group gives strong signature ions in tandem mass spectrometry (MS/MS) and is used to determine the relative abundance of a peptide. The balance group keeps the overall mass of the isobaric tag constant. With this property, identical peptides labelled with different isobaric tags will not be distinguishable in mass spectrometry.
In the experimental workflow for iTRAQ, unlabelled protein samples are first trypsin-digested and labelled with different isobaric tags independently. These labelled peptides from different samples are then mixed together and separated by liquid chromatography. Identical peptides from different samples labelled with different isotopes are chromatographically indistinguishable and appear as a single precursor. The isolated peptides are finally run through MS/MS for further fragmentation and generate a collection of mass spectra. The property of isobaric tags allows otherwise identical peptides from different samples to be detected as a single peak by mass spectrometry and to produce a single set of sequencing ions in MS/MS. The ion signals produced from the reporter regions together with the normal fragment ions provide information on peptide identification and quantitation for different samples. Using softwares such as MASCOT (Matrix Science Inc., Boston, MA, USA), a protein database search can be performed on the fragmentation data to identify the labelled peptides and hence the corresponding proteins. The relative abundence of low molecular mass reporter ions generated from the isobaric tags can then be used to quantify the relative abundence of peptides and proteins across the samples studied.
The observed peptide intensities are approximated by the peak areas of the ions originating from the isobaric tags used to label different samples. Several factors can affect the observed peptide intensities, such as the expression level of the protein that generates the peptide, some peptide specific features relating to different efficiency in ionization and fragmentation, different amounts of samples loaded into different channels, differences in sensitivity to instrument detection, sample preparation and experimental variations. Hill et al. [7] described in detail these biological and experimental factors and incorporated them into an ANOVA model to evaluate differential protein expression from iTRAQ data that are generated by a single experiment or multiple experiments.
One commonly encountered issue in iTRAQ data analysis is data missingness. Due to the nature of the technology, the overlap in identified proteins and peptides between replicate experiments is less than ideal, and many peptides are only observed for some samples in some spectra, leading to a large amount of missing data. For example, in a controlled study with 9 technical replicates described in [16], only 35.4% of the total 1,751 proteins were found in every experiment. Wang et al. [31] found that the total number of features identified in an experiment decreased over time by 49–73%. In a study of the effect of Caveolin-1 in three pairs of wild-type mice and knock-out Cav-1 mice, only about 1/3 of the proteins were identified in all three experiments, and only 1/4 peptides originating from these proteins were identified in all experiments [17]. These studies found that missingness does not occur at random. Instead, the probability that a protein/peptide is missing is related to its abundance. Less abundant peptides are harder to detect due to the data-dependent acquisition of the analysis process, hence more likely to be missing. This presents a nonignorable missing data problem. Ignoring the nonrandom missing pattern in statistical analysis may lead to significant bias in statistical inference and scientific conclusions.
To identify differentially expressed proteins across samples, one common approach is to calculate the ratio of the observed peptide intensities between two samples and to compare the calculated ratios against pre-specified upper and lower bounds. However, the criterion for threshold selection is subjective. For example, Seshi [27] considered iTRAQ ratios >5/4 or <4/5 as significant, whereas Salim et al. [26] used thresholds 1.20 and 0.83. These thresholds fail to consider the variability in data and are not statistically based. In this paper we review emerging new statistical approaches to quantitative proteomics that address the variations and missingness in iTRAQ data.
2. ANOVA ANALYSIS
Hill et al. [7] carefully studied the sources of variations in iTRAQ and applied ANOVA models to incorporate these variations in inferring differentially expressed proteins. They performed the normalization and quantification of differential protein expression with a single model fit to the observed peptide intensities obtained from the reporter ion peak areas from all observed tandem mass spectra. Their model relates differences in treatment to relative differences in protein expression, relates protein expression to peptide expression, and relates peptide expression to observed reporter ion peak areas. These relationships are captured using simple multiplicative expressions in the original scale, which is equivalent to a simple additive model in the logarithmic scale. The computional issues involved in the ANOVA model fitting for a medium or large size of global proteomics data sets were studied by Oberg et al. [19].
2.1 Model
Suppose that there are K iTRAQ experiments and the proteome contains I proteins. Let j(i) indicate the j-th peptide derived from the i-th protein, s index the biological sample obtained under a particular treated or control condition, and l index the isobaric tag labeling the sample.
We use yijksln to denote the log transformed value of the observed intensity for the j-th peptide derived from the i-th protein in the s-th biological sample, the k-th experiment, the l-th labeling reagent and the n-th MS/MS spectrum. Then the observed value is decomposed as
(1) |
where μ represents the grand mean, bk describes the effect due to a given iTRAQ experiment, vk,l describes the experimental effects of loading, mixing, and other sample handling effects, pi represents the protein effect, fj(i) corresponds to the peptide effect, rs denotes the sample effect, ri,s denotes the proteins differentially expressed between samples, and gj(i),s denotes the peptides differentially expressed between samples obtained under different conditions. The term hi,j(i),k,s,l,n represents the residual error for each observation that is not captured by the model. To ensure identifiability, one level of each predictor is referred to as the variable’s “reference level”. So the parameters in (1) (except μ) represent the relative effect of the corresponding predictor, and the value of each parameter corresponding to the “reference level” is zero. For example, if the sample from the control condition is referred to as the “reference sample”, then rs is the relative amount of total protein comparing the s-th sample to the reference sample, and ri,s denotes the relative amount of protein i comparing the s-th sample to the reference sample (the primary parameter of interest). When s indicates the reference sample, ri,s = rs = 0.
The terms in (1) are arranged into three groups describing the experimental effects, the protein and peptide effects, and the differences between samples (or the treatment effects). The first group (μ + bk + vk,l) describing the experimental effects includes variations in the amount of samples loaded into iTRAQ channels, the labeling efficiency, the mixing of labelled samples, and so on. These effects would not exist in an ideal world of perfectly reproducible instruments, experiment procedures, and subjects. The second group (pi + fj(i)) describes the differential effects of protein i and the j-th peptide derived from this protein. It has been observed that if a single purified protein is trypsinized and the results subjected to mass spectrometry, the reported peptide abundances may vary by the magnitude of two-to-three orders. The term fj(i) captures the variation of the expected amount of the j-th peptide to the expected amount of the i-th protein for subjects in the reference condition. The third group of effects (rs + ris + gj(i),s) capture the interest of the research, from which we infer the differentially expressed proteins and/or peptides between samples obtained under different treatment conditions. The term gj(i),s captures the effect of conditions at the peptide level. There are certainly biological conditions where a change to the levels of one or more peptides, but not the protein as a whole, will occur; for example a post-translational modification that involved a peptide substitution.
2.2 Model fitting
Parameters in models like (1) generally can be estimated using the standard method of least squares. However, the large size of global proteomics data sets may result in hundreds and thousands of parameters involved in the model (1), making it hard to estimate all of the parameters simultaneously using current software and computing facilities. Oberg et al. [19] described the following methods to partition the modeling process into a normalization portion (bias removal) and a differential expression portion.
2.2.1 Subsetting
This method partitions the global proteomics dataset into subsets by proteins and estimates the parameters separately for each identified protein. This will lead to biased estimates of parameters in the ANOVA model because model (1) involves the “experimental effects” (bk, vk,l) which would affect all proteins in an experiment. For example, a larger (or smaller) total amount of protein mixture loaded in an iTRAQ experiment will lead to all of the proteins in that experiment to have higher (or lower) intensities. Fitting model (1) separately for each protein will lead to different estimates of the global experimental effects for different proteins, which is unreasonable. So estimating the experimental effects for each protein individually rather than globally leads to incorrect normalization.
2.2.2 Stagewise regression
Denote the three groups of terms in the model (1) as groups I, II, and III, where group I corresponds to the experimental effects, group II corresponds to the protein and/or peptide effects, and group III corresponds to the differential expression portions of the model. The stagewise regression strategy fits the model to the entire data set in a stagewise fashion, that is, first group I, followed by group II, and then group III. Then it would be simple for each of the individual fits.
However, for the stagewise approach to give correct answers, it is necessary that the parameter estimates from the multiple stages are uncorrelated. In other words, to get unbiased estimates of parameters in the ANOVA model (1), it is necessary that the portions of the linear model design matrix corresponding to the multiple stages are orthogonal, which is not satisfied by MS data. For iTRAQ data, missingness is very common. Each global proteomics experiment detects different sets of proteins, resulting in an unbalanced data set for which the experimental and the protein/peptide parameters are correlated. Due to the imbalance in the proteomics data, groups I and II are not orthogonal. It has been found that the estimation bias in the stagewise estimation of group I can be extreme due to misssing data [31]. Wang et al. [31] proposed to compute the experimental effects only on the balanced subset of peptides that appear in all experiments as one approach to avoid this. To more efficiently use the data, [19] proposed to use all the data in an ANOVA model. Considering the imbalance in the data across multiple iTRAQ experiments, [19] proposed to estimate the group II effects together with the group I effects for correct estimation of group I terms. When the fraction of differentially expressed proteins is small, group III is nearly orthogonal to the group I and II model parts. Thus, estimating the differential expression terms in group III separately from the terms in groups I and II is likely to be reliable for most research studies. However, estimation of groups I and II simultaneously is still too large for current computational resources.
2.2.3 Iterative regression
Iterative regression is an alternative approach proposed in [19] to address the estimation of groups I and II simultaneously. The Gauss-Siedel algorithm [6] for instance, also known as backfitting, is one iterative technique that iterates over the stages, so that each stage is repeatedly re-fit given the solution to the previous stages. Specifically, the iterative regression for model fitting of (1) works as below. First, backfitting is used to iteratively solve for parameters in groups I and II (the experimental and protein/peptide terms). Second, the final result of the iterative fit is used to normalize the data by substracting out the systematic bias factors from the fits of groups I and II. The residuals are the normalized data. Third, these normalized values are used as inputs for estimating the differential expression effects in group III. In this analysis, the term gj(i),s in group III is removed assuming that there will be differential expression of certain proteins between the samples of interest but that any increase in protein expression will affect all of the peptides for that protein equally. With the peptide effects included in the normalization stages of the model fitting, the group III parameters are separable and can be estimated one protein at a time. Thus, the normalized data are used as inputs for the differential expression model, and the latter was fit separately for each of the identified proteins. In summary, the normalization terms (bk, vk,l, pi, fj(i)) are estimated globally, whereas the group III differential protein effects (ris, rs) are not. Fitting group III parameters on a protein-by-protein basis assumes that each protein has a different variance parameter, rather than a global variance parameter.
2.2.4 Mixed effets models
Treating some effects, such as fj(i), in the model (1) as random, is equivalent to assuming a prior distribution for the corresponding parameters. This introduces additional global parameters, the hyperparameters in the prior distributions, to the mixed effects model. Similar computational issues are involved in this mixed effects model. It is computationally challenging to fit the entire model to all data simultaneously for large datasets. Fitting separate models for each protein is invalid with respect to the global parameters. Data imbalance leads to the orthogonality requirement in a stagewise approach unsatisfied for the linear model design matrix corresponding to the multiple stages. So parameters from groups I and II must be estimated together to correctly estimate the group effects. But the standard iterative regression methods available for fixed effects models are not applicable to mixed effects models, and a solution remains an open problem.
2.3 Differential protein expression
With the fitted model for (1), the log difference of expression levels for protein i between the s-th sample and the reference sample (without loss of generality, let s = 1 for the reference sample), denoted by θi,s, is estimated by
(2) |
where Ji is the number of peptides derived from protein i. The 95% confidence interval for θi,s is constructed under the assumption of the normality of θ̂i,s as given by
Hill et al. [7] and Oberg et al. [19] studied the factors that could lead to variations in the observed peptide intensities and inferred differential protein expression via ANOVA analysis. The model (1) includes the experiment-to-experiment variation which increases with the introduction of additional experiments. Not all model elements are identifiable from one application to the next, and model (1) does not include all sources of error, either. For example, Keshamouni et al. [12] proposed an alternative ANOVA model for the analysis of data from a single iTRAQ experiment comparing a control and treated sample. Neither ANOVA model considers the missingness in iTRAQ data, potentially biasing their results.
3. NONRANDOM MISSINGNESS
Luo et al. [17] overcomes the limitations of ANOVA models through a Bayesian framework that incorporates the nonrandom missingness in iTRAQ data sets. Their model assumes that the measured peptide intensities are affected by both protein expression levels and peptide specific effects. The values of these two effects across multiple experiments are modeled as random effects. When a sample is labelled with multiple tags in a single experiment, the variations across different isobaric tags are also modelled as random effects. The nonrandom missingness of peptide data is modeled with a logistic regression which relates the missingness probability for a peptide with the expression level of the protein that produces this peptide. A Markov chain Monte Carlo method tailored for this model was developed for the inference of relative expression levels across different samples.
3.1 Model
We focus on describing the model for iTRAQ data from multiple experiments and the estimation of the relative expression levels of proteins. When the iTRAQ data is obtained from multiple experiments, [17] utilizes a Bayesian hierarchical model in the sense that the model has an observation component that models the observed peptide intensities as random effects whose conditional distribution depends on the expected protein expression levels and peptide effects, and a second (hierarchical) component that defines the distributions of these expected values.
In Luo et al. [17], the labelling effects are assumed to be removed by normalization methods such as quantile normalization. Assume that there are S (≥2) biological samples studied in K (≥2) experiments. Since multiple isobaric tags may label the same sample in one experiment, let Ls ≥ 1 denote the number of tags labelling the s-th sample. Then ∑s Ls = M is the number of isobaric tags used in one experiment, which is 4 when we use 4-plex isobaric reagents and 8 in the 8-plex version. Assume that there are I proteins in the sample and Ji peptides for the ith protein. For the lth label of the sth sample in the kth experiment, let ykijsln denote the log transformed value of measured observed intensity for the jth peptide of the ith protein from the nth spectrum. Note that j should be more appropriately denoted as j(i) to explicitly indicate that peptides are nested within proteins, and l should be denoted as l(s) to indicate the lth labelled tag of the sth sample. For notational simplification, we omit the parentheses. The measured intensity of a peptide depends on the protein expression level and the peptide effect. Let xkisl denote the log transformed expression level of the ith protein of the sth sample with the lth labelling tag in the kth experiment. Let zkij denote the log transformed peptide effect for the jth peptide of the ith protein in the kth experiment. Luo et al. [17] considered an additive model for ykijsln (k = 1, …, K; i = 1, …, I; j = 1, …, Ji; s = 1, …, S; l = 1, …, Ls; n = 1, …, Nkijsl):
(3) |
which corresponds to a multiplicative model in the original scale. In (3), εkijsln is assumed to be independently normally distributed with mean 0 and variance .
3.1.1 Missing data mechanism
The statistical model for peptide missingness in [17] was motivated by the study on the dataset obtained from the study of the roles of Caveolae for postnatal cardiovascular function. In this research, three experiments were conducted where the protein profiles from two wild-type mice and two knock-out Cav-1 mice were analyzed by iTRAQ with four isobaric tags in each experiment. Luo et al. [17] studied the proportion of peptides observed in one experiment but missing in another experiment, and found that there was a negative correlation between the missing probability and peptide intensity. In other words, less abundant peptides are more likely to be missing since they are harder to detect due to the data-dependent acquisition of the analysis process. Observing that there was an approximate linear relationship between the peptide missing probability and the observed intensity at the logit scale, Luo et al. [17] modeled the missing probability through a simple logistic regression model:
(4) |
where Ikijsln = 1 indicates that the jth peptide of the ith protein is measured in the kth experiment, the lth replicate of the sth sample and the nth spectrum. Formula (4) implies that the logit of the probability of peptide missingness is linearly dependent on its intensity. It is expected that b > 0 because peptides with lower intensities are more likely to be missing.
3.1.2 Priors
The Bayesian hierarchical framework in [17] takes into account the variabilities across experiments and samples, and assumes that xkisl and zkij are independently normally distributed across different experiments, i.e.:
(5) |
(6) |
where xisl and zij denote the protein and peptide effects averaged over multiple experiments, respectively. The protein expression levels in different replicates (labelled with different tags) of the same sample are also assumed to be normally distributed:
(7) |
where xis denotes the expression level of the ith protein in the sth sample. Assumptions (5)–(7) lead to an equivalent form of (3):
(8) |
where and denote the random effects across experiments, and denotes the variation among multiple replicates of the same sample. When a sample is labelled with a unique isobaric tag in an experiment, there is no replicate variation component within a sample. Formula (8) is a mixed-effects model. To ensure the identifiability of the model, the restriction xi1 = 0 is added. Then xis denotes the expression level of the ith protein in the sth sample relative to the first sample.
The second level of priors are normal distributions for xis and zij:
(9) |
(10) |
The hierarchical model is finished by assuming inverse gamma distributions as priors for the hyperparameters of variance: and , where γ1 and γ2 denote the shape and scale parameters of a gamma distribution, respectively, and assuming a ~ N(0, ν2) and b ~ N(0, ν2). The posterior distributions of relevant parameters are simulated by MCMC simulations and differentially expressed proteins are identified by analyzing the posterior distribution of xis.
3.2 Comparison to ANOVA analysis
The most important difference between this Bayesian model in [17] and the ANOVA model proposed by Hill et al. [7] and Oberg et al. [19] is that [17] clearly modeled the nonignorable missingness in iTRAQ data. Oberg et al. [19] remarked at the end of their paper that using a censoring mechanism to fit the model would be a natural next step. Instead of censoring the data at an unknown threshold value, [17] modeled a higher probability of peptide missingness for lower peptide intensities. These two methods also differ in terms of variations included in the model. The experimental effect and the replicative effect (when multiple tags label a sample) are considered constants for all proteins in the ANOVA model. In contrast, [17] modeled them as random effects that were specific to peptides and (or) proteins. Furthermore, the ANOVA analysis involves additional effects such as the labelling effect and the interaction between labelling and experimental effect gj(i),s, which are not modeled in [17]. Inclusion of the labelling effect is determined by the experiment design. When identical tags are used to label the same samples in multiple experiments, the labelling effect is not identifiable since it is confounded with the sampling effect. It is meaningful to include the labelling effect only when different tags are used to label the same samples in multiple experiments. For the interaction between labelling and experimental effect gj(i),s, although it is theoretically appropriate to have it in the model, there exists large uncertainty in the estimate of gj(i),s due to the small number of replicates (or no replicates) for each sample.
The common assumption in both the Bayesian method and the ANOVA analysis is that all of the peptide-based observations accurately reflect the intact proteins. We ignore the possibility of homologous genes resulting in two or more proteins that share identical and nonidentical peptides as well as the possibility of post-transcriptional modifications. Although (1) includes the interaction between peptide effects and treatment (gj(i),s), it is removed in the analysis of [19]. This term is not included in [17] either. So both [17] and [19] assume that certain proteins will have differential expressions across samples under different treatments, but that any change in protein expression will affect all of the peptides for that protein equally.
3.3 Nonrandom missingness in mass spectrometry data
Targeting for mass spectrometry data, the model (proposed by Wang et al. [31]) described in this subsection is not tailored for iTRAQ data. But since iTRAQ data are obtained by running the isolated peptides through MS/MS, this probability model provides an alternative way of studying the missingness in iTRAQ. Wang et al. [31] proposed to first remove sources of systamatic variation between MS profiles via global normalization, and then to investigate the intensity-dependent missingness and to impute the missed peptide intensities.
3.3.1 Global normalization
In their global normalization, [31] assumed that the sample intensities are all related by a constant factor which is to be chosen. In order to avoid the possible bias due to the nonrandom missingness in mass spectrometry data, Wang et al. proposed to use the top L ordered statistics (e.g., medians) of peptide intensities in each sample for rescaling, where L is a user-specified parameter. Let K (K > 2) be the number of MS profiles. Denote the observed intensities of the k-th profile as , where nk is the number of peptides identified in the k-th profile. For a given number L , the population median is defined as
and the scaling coefficient for normalization of the k-th profile is
(11) |
3.3.2 Nonrandom missingness and imputation
To account for the nonrandom missingness, Wang et al. [31] proposed to impute the missed peptide intensity in one sample with the ratio of the observed intensity in another sample divided by a scale coefficient estimated from the intensities of other peptides observed in both samples. Suppose the minimum detectable level of the instrument is d. Let be the true abundance of the j-th peptide in the k-th profile corresponding to the observed value . A peptide may or may not exist in a profile. Let be a latent variable indicating the presence of the j-th peptide in the k-th profile, with if the j-th peptide exists in the k-th profile, and otherwise. Then if . Let be the density function of when , we have
(12) |
where I0(·) indicates a point-mass at zero. With (12), Wang et al. [31] assumed that the true abundance of a peptide has a mixture distribution. With probability , the peptide does not exist in the k-th profile, and the abundance is zero. With probability , the peptide exists, and the distribution of the abundance is described by .
The missed value of the intensity level of the j-th peptide present in the k-th profile is imputed by the expected value , which is calculated as
(13) |
where the first equality is due to the fact that , and the second equality is due to the fact that when the j-th peptide exists in the k-th profile , no signal detection is equivalent to low intensity . The term in (13) can be determined when and d are specified, and , the probability that the j-th peptide exists in the k-th profile when no signal is detected, can be calculated as
(14) |
where the third equality holds because and . The term in (14) can be obtained from the density function when the latter is specified, and
(15) |
So when the conditional density and d are specified, the missed peptide intensity can be imputed with (13)–(15).
The minimum instrument detectable level parameter d is estimated by the background noise level in all MS raw profiles from the same instrument, denoted as d̂. Then the detectable level of the k-th profile is d̃(k) = d̂/λ(k), where λ(k) is the normalization scale coefficient in (11). Wang et al. [31] assume that
independently for k = 1, 2, …, K. This is equivalent to the assumption that the density function of when , is N(λ(k)μj, (λ(k)σj)2). In the special case that σj ≪ |d̃(k) − μj| and biological replications are available, Wang et al. [31] provided estimators for the missing probability and the imputed value as below:
where
and
The imputed data is used for further analysis such as estimation, clustering of proteins and differential protein identification.
The model proposed by Wang et al. [31] differs from the Bayesian model proposed by Luo et al. [17] in the following three ways. First, in [31], intensities lower than a certain level are censored and the censoring parameter is estimated based on the background noise levels; in [17], a logistic regression model is built to relate the missing probability with the potential true intensity. With the observation that less abundant peptides are more likely to be missing, the model based missing mechanism in [17] which links the probability of missing with peptide intensity is more reasonable than the censoring mechanism in [31]. Second, [31] conducts single value imputation and imputes the missed intensities with the expected values, while [17] conducts multiple imputation and simulates the posterior distributions of missed values. Third, [31] is not tailored for iTRAQ analysis and sources of variations should be removed when applying the idea in [31] to iTRAQ data. The strength of [31] lies in the smaller computation burden. When the density is specified, the missed peptide intensity can be easily imputed with the expected value obtained from formula (13).
4. CONCLUSION AND FUTURE DIRECTIONS
The protein and peptide identification from MS/MS data has been addressed by many researchers [9, 11, 15, 18, 22, 24, 25, 28]. In this article, we have focused on the quantitation of protein and peptide expression levels from iTRAQ data, which is a shotgun technique that uses isobaric tags to label peptides from different samples and analyzes the labelled peptides with tandem mass spectrometry. We have reviewed the studies on the sources of variations, the computational problems involved and the nonrandom missingness in the iTRAQ data. These studies are conducted after the protein database search for protein and peptide identification have been conducted from the collection of spectra, and the peak areas of the ions originating from the isobaric tags have been normalized for the estimation of peptide intensities. The uncertainties in the protein and peptide identification and the peak area evaluation are not considered. Furthermore, these studies assume that all of the peptide-based observations accurately reflect the intact proteins. It is possible that homologous genes can result in two or more proteins that share identical and nonidentical peptides. The possibility of post-transcriptional modifications is also ignored. The quantitation of protein would benefit from the improvement of protein identification and peak area evaluation from mass spectra.
As discussed above, due to the complex nature of iTRAQ data, it is very important to use sound experimental design and analysis strategies when using iTRAQ technology to detect and quantify the relative protein expression levels across samples, especially when multiple experiments are involved. Poor experimental design and analysis may confound signals with noises and lead to protein and peptide effects undistinguishable from systematic variations. To achieve the best power in sample comparisons, it is important to balance the treatment groups across experiments and to randomize the isobaric tags for samples, as much as possible, in the application of iTRAQ for comparative proteomic researches.
The nonrandom missingness in iTRAQ is modeled with a simple logistic regression in [17]. It is natural to consider more complex missingness models that include polynomial or local polynomial terms in the logistic regression, if the latter better describe the relationship between the missing probability and the peptide intensity. These missingness models can also be built in the Bayesian hierarchical structure as in [17] to infer the relative expression levels of proteins across samples.
ACKNOWLEDGEMENTS
The work was supported in part by NIH grants HV28286, DA018343, GM59507 and NSF grant DMS 0714817. The work was also supported in part by “Yale University Biomedical High Performance Computing Center” and NIH grant: RR19895, which funded the instrumentation.
Contributor Information
Ruiyan Luo, Email: matrxl@langate.gsu.edu, Department of Mathematics and Statistics, Georgia State University, 30 Pryor Street, Atlanta, GA 30303, USA.
Hongyu Zhao, Email: hongyu.zhao@yale.edu, Department of Epideomiology and Public Health, Yale University, 300 George Street, Suite 503, New Haven, CT 06511, USA.
REFERENCES
- 1.Boylan KL, Andersen JD, Anderson LB, Higgins L, Skubitz AP. Quantitative proteomic analysis by itraq for the identification of candidate biomarkers in ovarian cancer serum. Proteome Science. 2010 doi: 10.1186/1477-5956-8-31. http://www.proteomesci.com/content/8/1/31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Casado-Vela J, Martínez-Esteso MJ, Rodriguez E, Borrás E, Elortza F, Bru-Martínez R. iTRAQ-based quantitative analysis of protein mixtures with large fold change and dynamic range. Proteomics. 2010:343–347. doi: 10.1002/pmic.200900509. [DOI] [PubMed] [Google Scholar]
- 3.Choe L, D’Ascenzo M, Relkin NR, Pappin D, Ross P, Williamson B, Guertin S, Pribil P, Lee KH. 8-plex quantitation of changes in cerebrospinal fluid protein expression in subjects undergoing intravenous immunoglobulin treatment for Alzheimer’s disease. Proteomics. 2007;7:3651–3660. doi: 10.1002/pmic.200700316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nature Biotechnology. 1999;17:994–999. doi: 10.1038/13690. [DOI] [PubMed] [Google Scholar]
- 5.Hamdan M, Righetti PG. Modern strategies for protein quantification in proteome analysis: Advantages and limitations. Mass Spectrometry Reviews. 2002;21:287–302. doi: 10.1002/mas.10032. [DOI] [PubMed] [Google Scholar]
- 6.Hastie TJ, Tibshirani RJ. Generalized Additive Models. New York: Chapman and Hall; 1990. MR1082147. [Google Scholar]
- 7.Hill EG, Schwacke JH, Comte-Walters S, Slate EH, Oberg AL, Eckel-Passow JE, Therneau TM, Schey KL. A statistical model for iTRAQ data analysis. Journal of Proteome Research. 2008;7:3091–3101. doi: 10.1021/pr070520u. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hu H-D, Ye F, Zhang D-Z, Hu P, Ren H, Li S-L. iTRAQ quantitative analysis of multidrug resistance mechanisms in human gastric cancer cells. Journal of Biomedicine and Biotechnology. 2010 doi: 10.1155/2010/571343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kall L, Canterbury J, Weston J, Noble WS, MacCoss MJ. A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets. Nature Methods. 2007;4:923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
- 10.Karp NA, Huber W, Sadowski PG, Charles PD, Hester SV, Lilley KS. Addressing accuracy and precision issues in iTRAQ quantitation. Molecular & Cellular Proteomics. 2010;9:1885–1897. doi: 10.1074/mcp.M900628-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Keller A, Nesvizhskii A, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002;74:5383–5392. doi: 10.1021/ac025747h. [DOI] [PubMed] [Google Scholar]
- 12.Keshamouni VG, Michailidis G, Grasso CS, Anthwal S, Strahler JR, Walker A, Arenberg DA, Reddy RC, Akulapalli S, Thannickal VJ, Standiford TJ, Andrews PC, Omenn GS. Differential protein expression profiling by iTRAQ-2DLC-MS/MS of lung cancer cells undergoing epithelial-mesenchymal transition reveals a migratory/invasive phenotype. Journal of Proteome Research. 2006:1143–1154. doi: 10.1021/pr050455t. [DOI] [PubMed] [Google Scholar]
- 13.Kilner J, Zhu L, Ow SY, Evans C, Corfe BM. Assessing the loss of information through application of the ‘two-hit rule’ in iTRAQ datasets. Journal of Integrated Omics. 2011;1:124–134. [Google Scholar]
- 14.Lau E, Lam MPY, Siu SO, Kong RPW, Chan WL, Zhou Z, Huang J, Lo C, Chu IK. Combinatorial use of offline scx and online RP–RP liquid chromatography for itraq-based quantitative proteomics application. Molecular BioSystems. 2011 doi: 10.1039/c1mb05010a. [DOI] [PubMed] [Google Scholar]
- 15.Li Q, MacCoss MJ, Stephens M. A nested mixture model for protein identification using mass spectrometry. Ann. Appl. Stat. 2010;4:962–987. MR2758429. [Google Scholar]
- 16.Liu H, Sadygov RG, Yates JR. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry. 2004;76:4193–4201. doi: 10.1021/ac0498563. [DOI] [PubMed] [Google Scholar]
- 17.Luo R, Colangelo CM, Sessa WC, Zhao H. Bayesian analysis of iTRAQ data with nonrandom missingness: Identification of differentially expressed proteins. Statistics in Bioscience. 2009 doi: 10.1007/s12561-009-9013-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003;75:4646–4653. doi: 10.1021/ac0341261. [DOI] [PubMed] [Google Scholar]
- 19.Oberg A, Mahoney D, Eckel-Passow J, Malone C, Wolfinger R, Hill E, Cooper L, Onuma O, Spiro C, Therneau T, Bergen H. Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. Journal of Proteome Research. 2008;7:225–233. doi: 10.1021/pr700734f. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.O’Farrell PH. High resolution two-dimensional electrophoresis of proteins. Journal of Biological Chemistry. 1975;250:4007–4012. [PMC free article] [PubMed] [Google Scholar]
- 21.Patton WF. Detection technologies in proteome analysis. Journal of Chromatography. B, Analytical Technologies in the Biomedical and Life Sciences. 2002;771:3–31. doi: 10.1016/s1570-0232(02)00043-0. [DOI] [PubMed] [Google Scholar]
- 22.Price TS, Lucitt MB, Wu W, Austin DJ, Pizarro A, Yocum AK, Blair IA, FitzGerald GA, Grosser T. Ebp, a program for protein identification using multiple tandem mass spectrometry data sets. Mol. Cell. Proteomics. 2007;6:527–536. doi: 10.1074/mcp.T600049-MCP200. [DOI] [PubMed] [Google Scholar]
- 23.Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ. Multiplexed protein quantitation in saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Molecular & Cellular Proteomics. 2004;3:1154–1169. doi: 10.1074/mcp.M400129-MCP200. [DOI] [PubMed] [Google Scholar]
- 24.Sadygov R, Liu H, Yates J. Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 2004;76:1664–1671. doi: 10.1021/ac035112y. [DOI] [PubMed] [Google Scholar]
- 25.Sadygov R, Yates J. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 2003;75:3792–3798. doi: 10.1021/ac034157w. [DOI] [PubMed] [Google Scholar]
- 26.Salim K, Kehoe L, Minkoff MS, Bilsland JG, Munoz-Sanjuan I, Guest PC. Identification of differentiating neural progenitor cell markers using shotgun isobaric tagging mass spectrometry. Stem Cells and Development. 2006;15:461–470. doi: 10.1089/scd.2006.15.461. [DOI] [PubMed] [Google Scholar]
- 27.Seshi B. An integrated approach to mapping the proteome of the human bone marrow stromal cell. Proteomics. 2006;6:5169–5182. doi: 10.1002/pmic.200600209. [DOI] [PubMed] [Google Scholar]
- 28.Shen C, Wang Z, Shankar G, Zhang X, Li L. A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics. 2008;24:202–208. doi: 10.1093/bioinformatics/btm555. [DOI] [PubMed] [Google Scholar]
- 29.Skorobogatko YV, Deuso J, Adolf-Bergfoyle J, Nowak MG, Gong Y, Lippa CF, Vosseller K. Human Alzheimer’s disease synapticO-GlcNAcsite mapping and iTRAQ expression proteomics with ion trap mass spectrometry. Amino Acids. 2010;40:765–779. doi: 10.1007/s00726-010-0645-9. [DOI] [PubMed] [Google Scholar]
- 30.Unwin RD, Griffiths JR, Whetton AD. Simultaneous analysis of relative protein expression levels across multiple samples using iTRAQ isobaric tags with 2D nano LC-MS/MS. Nature Protocols. 2010;5:1574–1582. doi: 10.1038/nprot.2010.123. [DOI] [PubMed] [Google Scholar]
- 31.Wang P, Tang H, Zhang H, Whiteaker J, Paulovich AG, Mcintosh M. Normalization regarding nonrandom missing values in high-throughput mass spectrometry data. Pacific Symposium on Biocomputing. 2006;11:315–326. [PubMed] [Google Scholar]
- 32.Wu WW, Wang G, Baek SJ, Shen R-F. Comparative study of three proteomic quantitative methods, DIGE, cICAT, and iTRAQ, using 2D Gel- or LC-MALDI TOF/TOF. Journal of Proteome Research. 2006;5:651–658. doi: 10.1021/pr050405o. [DOI] [PubMed] [Google Scholar]
- 33.Ye H, Hill J, Kauffman J, Han X. Qualitative and quantitative comparison of brand name and generic protein pharmaceuticals using isotope tags for relative and absolute quantification and matrix-assisted laser desorption/ionization tandem time-of-flight mass spectrometry. Analytical Biochemistry. 2010;400:46–55. doi: 10.1016/j.ab.2010.01.012. [DOI] [PubMed] [Google Scholar]