Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Apr 5.
Published in final edited form as: Adv Biol Regul. 2017 Nov 20;67:128–133. doi: 10.1016/j.jbior.2017.11.005

Normalization of mass spectrometry data (NOMAD)

Carl Murie a, Brian Sandri b, Ann-Sofi Sandberg a, Timothy J Griffin c,d, Janne Lehtiö a, Christine Wendt b, Ola Larsson a
PMCID: PMC5885284  NIHMSID: NIHMS951780  PMID: 29174395

Abstract

iTRAQ and TMT reagent-based mass spectrometry (MS) are commonly used technologies for quantitative proteomics in biological samples. Such studies are often performed over multiple MS runs, potentially resulting in introduction of MS run bias that could affect downstream analysis. Such MS data have therefore commonly been normalized using a reference sample which is included in each MS run. We show, however, that reference normalization does not effectively remove systematic MS run bias. A linear model approach was previously proposed to improve on the reference normalization approach but does not computationally scale to larger data sets. Here we describe the NOMAD (normalization of mass spectrometry data) R package which implements a computationally efficient ANOVA normalization approach with protein assembly functionality. NOMAD provides the same advantages as the linear regression solution but is more computationally efficient which allows superior scaling to larger sample sizes. Moreover, NOMAD effectively removes bias which improves valid across MS run comparisons.

1. Introduction

In addition to understanding roles of metabolites and signaling molecules (Anderson et al., 2016; Sakane et al., 2017), proteome wide identification and quantification of proteins is considered pivotal for elucidating mechanisms underlying biological systems and pathological states (Bantscheff et al., 2012; Hassanein et al., 2011). For this purpose tandem mass spectrometry (MS/MS)-based quantification using isobaric labels such as isobaric tags for relative and absolute quantification (iTRAQ) (Ross et al., 2004) and tandem mass tags (TMT) (Thompson et al., 2003) were introduced (we will refer to these as isobaric tags below). In this technology, peptides are chemically labelled with isobaric tags, which, when fragmented in MS/MS, produce sample-specific reporter ions allowing simultaneous relative quantitative comparison of peptide abundance across multiple samples within a single MS run (Rauniyar and Yates, 2014). Current commercially available isobaric tags are limited to a maximum of ten samples per run. Therefore, larger sample sets need to be divided across several distinct MS runs due to the limited multiplexing power of isobaric tags. As a result, the technology offers challenges during normalization not only due to potential sources of bias from sample preparation but in particular because of the potential bias introduced across multiple MS runs related to particular samples in an isobaric set, MS-instrument or peptide fractionation performance.

At present, isobaric tag data produced over multiple MS runs are commonly normalized using a reference approach. Thus, to allow for comparison across multiple iTRAQ or TMT runs, one isobaric tag in each iTRAQ or TMT set is used to label a reference sample. Such relative measures (commonly referred to as iTRAQ or TMT ratios) are assumed to control for bias between MS runs. However, it is common to observe moderate to strong MS run bias after normalization using the reference approach - which will hamper and potentially disqualify downstream analyses (MS run bias from two data sets generated in two different laboratories are shown in Fig. 1). This inability in removing MS run bias could potentially be explained: because each sample in an MS run uses the same reference for normalization, MS run bias may persist and even potentially be incorporated during normalization to such a reference. The use of reference samples enforces additional limitations including the subsequent reduction in the number of query samples per MS run and a doubling of the across MS run variance due to the calculation of abundance ratios (Kerr et al., 2000).

Fig. 1. Normalization to reference samples does not eliminate run bias.

Fig. 1

Distribution of p-values from an ANOVA model evaluating the contribution of MS run on raw or reference normalized protein abundance scores from studies by Roark et al. and Sandberg & Lehtiö. The raw data (first row) show a strong MS run bias (i.e. an elevated frequency of low p-values for a difference in protein levels depending on MS run) which reduces the efficacy of protein abundance comparisons across MS runs. The reference normalization method reduces the MS run bias but does not eliminate such bias.

As a result of these shortcomings, a linear regression approach has been proposed for normalization of isobaric tag data that is applicable to data sets generated over multiple MS runs (Hill et al., 2008; Oberg et al., 2008). Several sources of bias including MS run, isobaric tag and peptide can be used as regression factors in the model. The residuals of the model are the peptide abundances after removing the bias from the regression factors and these can be used to calculate protein abundances. A significant issue for even relatively modestly sized studies performed across a few MS runs is the computational complexity of solving the linear model resulting in that the method cannot be applied (see results below). Some solutions to this problem have been proposed such as iterative or stagewise regression (Oberg et al., 2008) but to date there is no implementation of this approach. Here we provide the NOMAD (NOrmalization for MAss spectrometry Data) R package implementing an ANOVA normalization method designed for isobaric tag mass spectrometry data in a computationally efficient manner.

2. Methods

2.1. NOMAD normalization of MS data

The structure of the factorial design ANOVA model allows for a simple algebraic solution identical to the more computationally demanding matrix solution of a linear model. We use the following equation for a two-factor ANOVA design including an interaction to illustrate this (Draghici, 2012):

Xijk=μ+αi+βj+αβij+εijk (1)

where μ is the overall mean and a and β are factors used in normalization, i and j are the i-th and j-th levels of their respective factors, and k is the k-th data point in the ij-th cell. The residuals are defined as:

εijk=Xijkμij (2)

In equation (2) each data point (Xijk) belonging to a particular factor level has the mean of that factor level subtracted (μij). The residual of each data point (εijk) is the residual after the means of all factor levels that data point belongs to have been subtracted. This is done in a sequential process where the residuals after subtracting the level means for one factor are used as data for calculation and subtraction of level means of the next factor. This process can be extended to any number of normalization factors and their interactions with the remaining residuals used as peptide abundances.

2.2. Implementation of the NOMAD R-package

The NOMAD R-package provides two main functions. The first function (nomadNormalization) applies an ANOVA model to remove the bias of multiple factors and produces normalized peptide abundances. The second function (nomadProteinAssembly) combines the normalized peptide abundances into summary protein abundances and the user is given multiple options as to how the proteins are assembled. Moreover, the nomadNormalization function allows for correction for imperfect synthesis of isobaric tags. A necessary preprocessing step is to reformat the peptide level output of the MS quantification software (e.g. Protein Pilot) so that it can be used as input in NOMAD. NOMAD does not contain functions for such pre-processing (although we provide an example for a commonly used software in the NOMAD R vignette) because of the multitude of tailored software for such quantification.

2.3. Data sets

Herein we used two data sets. The first data set was previously published (Roark et al., 2015) and is an 8-plex iTRAQ data set over 3 runs. Each run contained two reference samples, resulting in 18 samples. Peptide level data was extracted as described (Roark et al., 2015). The second data set originates from plasma samples and is an 8-plex iTRAQ experiment performed over 4 runs (Sandberg and Lehtiö, unpublished). In brief, proteins extracted from plasma depleted of high abundant proteins were subjected to trypsinization and resulting peptides were iTRAQ labelled, samples pooled and pre-fractionated by high-resolution isoelectric focusing prior to LC-MS/MS (Branca et al., 2014). Each LC-MS/MS run contained one or two reference samples resulting in a total of 27 samples. In addition to NOMAD normalized expression data we also generated non-normalized (raw) data by running the nomadProteinAssembly function on non-normalized peptide data. For reference normalized data, this was followed by subtraction of reference sample log2 protein abundances per MS run.

2.4. Calculation of statistics for bias and changes in gene expression

We calculated statistics for changes in protein levels (using the lm function in R) depending on time to bronchiolitis obliterans syndrome (BOS) for the data set from Roark et al. (a continuous variable; the R model used was protein ∼ patient + BOS) or membership in one of 3 sample classes (Sandberg and Lehtiö, unpublished; the R model used was protein ∼ class). Similarly, statistics for run dependent bias was calculated by using MS run as the variable (i.e. the R model protein ∼ run).

2.5. Software

The NOMAD R-package is available from: https://github.com/carlmurie/NOMAD.

3. Results

3.1. Identification of factors for effective NOMAD normalization

We observed an enrichment of low p-values from ANOVA models assessing the relationship between protein levels and MS-run in two independent data sets from different laboratories after normalization to reference samples (Fig. 1). This is consistent with that reference based normalization does not fully address MS run batch effects. Factors that underlie observed bias need to be identified for effective normalization with the NOMAD approach. Possible factors for normalization of MS data include peptides, proteins, MS run (e.g. isobaric labelled sets of sample) and isobaric tags. We applied combinations of these factors both singly and with interactions within NOMAD to assess their effectiveness of addressing run bias (Fig. 2). This showed that the interactions between MS run with peptide and isobaric tag factors must be included in the ANOVA model to consistently obtain a uniform distribution of p-values for MS run bias (which is expected in the absence of MS run bias; Fig. 2). Thus NOMAD uses this model as the default model for normalization. Notably, however, it is possible to add or remove single or interaction factors (in the nomadNormalization function) thereby allowing for custom normalization of in principle any type of data. The impact of normalization factor selection can be evaluated using the nomadCheckBias function which produces plots showing the extent of MS run bias (i.e. like those shown in Fig. 2). Thus, in contrast to normalization to the reference (Fig. 1) NOMAD normalization completely removes MS run bias thereby allowing analysis independent of MS run (Fig. 2).

Fig. 2. Identification of factors necessary for efficient normalization using NOMAD.

Fig. 2

Distribution of p-values from an ANOVA model evaluating the contribution of MS run on NOMAD normalized protein abundance scores (data set from Roark et al. in the left column and the unpublished data set by Sandberg and Lehtiö in the right column). NOMAD normalization was performed using different models as indicated in the panels. A uniform frequency of p-values for MS run bias is consistent with that the model removes bias. In the indicated models, a “+” shows which factors were use and a “*” indicates that the interaction between two factors were used (i.e. identical to the syntax in R).

3.2. Data sets normalized using NOMAD retain changes in gene expression depending on biological variables

Nomad normalization can only be applied to data sets with balanced designs (i.e. all [or almost all] conditions are represented in each isobaric labelled set of samples, hence in each MS run) or randomized designs (i.e. samples are randomized between isobaric sets a thereby MS runs). This is required because if for example all control samples are assayed in one MS run whereas the treatment samples are processed in a second run, one cannot separate MS run bias from the biological differences between the two conditions. Thus, if the design is substantially unbalanced NOMAD could remove biological effects by defining these as MS run bias. The first study (by Roark et al.) used a balanced design whereby 3 separate time points from each patient was assayed in the same MS run and each MS run contained 2 patients. The second study (Sandberg and Lehtiö unpublished) used a randomized design but maintained approximately the same number of samples from the three sample classes in each MS run. To assess the effects of NOMAD normalization on biological interpretation, we calculated statistics for the biological effects using the data set normalized to the reference or the default NOMAD normalization. The default NOMAD normalization, which was identified as providing the most optimal normalization above (bottom row in Fig. 2), resulted in a slight reduction in the number of proteins with low p-values for the analysis of the relationship between protein levels and time to bronchiolitis obliterans syndrome (Fig. 3) (Roark et al., 2015). In the study of plasma proteins (Sandberg and Lehtiö, unpublished) reference normalization led to fewer observed p-values as compared to what is expected by chance when examining whether protein levels changed between the three experimental groups (i.e. the distribution of p-values indicates more large than small p-values thereby deviating from the NULL distribution of p-values). This is consistent with a substantial unaccounted source of variance. The same analysis using NOMAD normalized data as input (default NOMAD normalization) led to an essentially flat distribution of p-values when examining the relationship between protein expression and the three experimental groups. Thus for both data sets, NOMAD normalization only slightly reduced the levels of significances (presumably by removing MS run bias) or even lead to a dramatic increase in the frequency of proteins showing low p-values for the biological effect (likely by eliminating MS run bias from the residual errors in the applied ANOVA models). Thus, NOMAD efficiently removes MS run bias (Fig. 2) while retaining, or even enhancing, analysis of the biological effects. Importantly, such analysis can now be performed without the unwanted variance caused by a batch (MS run) effect.

Fig. 3. Default NOMAD normalization does not hamper identification of biological effects.

Fig. 3

Shown are densities of p-values from ANOVA models assessing the biological effects (i.e. time to BOS [Roark et al.] or class [Sandberg & Lehtiö]) when using reference normalized data or default NOMAD normalized data.

3.3. NOMAD efficiently scales to large data set

A key feature of NOMAD is the ability to scale for larger data sets and we therefore compared the performance of the regression approach to NOMAD using data sets of different sizes (Table 1). Indeed, normalization of even modestly sized data sets can only be performed with NOMAD. Thus, NOMAD allows for ANOVA based normalization of data sets for which there are currently no available methods.

Table 1.

Comparison of computation times between the regression approach and NOMAD. The number (N) of proteins/peptides/data points for each data set tested together with the associated computation times (t) using the linear regression approach (lm function in R) or NOMAD are shown. The data sets were randomly sampled from the unpublished data set by Sandberg & Lehtiö. Server: 2.00 GHz, 128G RAM, R version: 3.1.2

N proteins N peptides N data points NOMAD (t) Lm (t)
10 435 6552 0.3s 44s
100 1980 32072 1.6s 1hr 31 m
200 3360 53600 3s 9hr 32 m
300 5046 93920 7.5s 63hr 58 m
3857 72599 1144536 2 m 32s cannot complete

4. Discussion

Here we present a solution for a computationally efficient ANOVA normalization for iTRAQ/TMT MS data that scales well for even the largest studies. Moreover NOMAD is better able to address bias from multiple MS runs than the commonly used reference approach. Conveniently, reference samples are not required for NOMAD thus freeing all channels to be used for samples of biological interest. NOMAD thus allows for direct across-MS run comparisons of protein abundances e.g. for the purpose of exploratory analysis, which otherwise may be obscured by strong correlations between samples processed under the same MS run. Therefore, experimental designs can now include more factors of biological interest and increased sample sizes while being normalized efficiently as long as balanced or randomized designs are applied.

Acknowledgments

We would like to acknowledge Sue Van Riper (University of Minnesota Center for Mass Spectrometry and Proteomics, CMSP) for critical review of this manuscript and Pratik Jagtap (University of Minnesota, CMSP) for technical advice. This research was supported by grants from the Swedish Research Council, the Swedish Cancer foundation and the Wallenberg Academy Fellows program (O.L); Swedish Foundation for Strategic Research (SSF; to J.L.) and NIH R01 HL107612 (C.W.). T.J.G was supported in part by grant 1147079 from the U.S. National Science Foundation.

References

  1. Anderson KE, Juvin V, Clark J, Stephens LR, Hawkins PT. Investigating the effect of arachidonate supplementation on the phosphoinositide content of MCF10a breast epithelial cells. Adv Biol Regul. 2016;62:18–24. doi: 10.1016/j.jbior.2015.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bantscheff M, Lemeer S, Savitski MM, Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem. 2012;404(4):939–965. doi: 10.1007/s00216-012-6203-4. [DOI] [PubMed] [Google Scholar]
  3. Branca RM, Orre LM, Johansson HJ, Granholm V, Huss M, Perez-Bercoff A, Forshed J, Kall L, Lehtio J. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat Methods. 2014;11(1):59–62. doi: 10.1038/nmeth.2732. [DOI] [PubMed] [Google Scholar]
  4. Draghici S. Statistics and Data Analysis for Microarrays Using R and Bioconductor. second. Chapman and Hill/CRC; Florida, USA: 2012. Analysis of Variance - ANOVA. [Google Scholar]
  5. Hassanein M, Rahman JS, Chaurand P, Massion PP. Advances in proteomic strategies toward the early detection of lung cancer. Proc Am Thorac Soc. 2011;8(2):183–188. doi: 10.1513/pats.201012-069MS. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Hill EG, Schwacke JH, Comte-Walters S, Slate EH, Oberg AL, Eckel-Passow JE, Therneau TM, Schey KL. A statistical model for iTRAQ data analysis. J Proteome Res. 2008;7(8):3091–3101. doi: 10.1021/pr070520u. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol. 2000;7(6):819–837. doi: 10.1089/10665270050514954. [DOI] [PubMed] [Google Scholar]
  8. Oberg AL, Mahoney DW, Eckel-Passow JE, Malone CJ, Wolfinger RD, Hill EG, Cooper LT, Onuma OK, Spiro C, Therneau TM, Bergen HR., 3rd Statistical analysis of relative labeled mass spectrometry data from complex samples using ANOVA. J Proteome Res. 2008;7(1):225–233. doi: 10.1021/pr700734f. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Rauniyar N, Yates JR., 3rd Isobaric labeling-based relative quantification in shotgun proteomics. J Proteome Res. 2014;13(12):5293–5309. doi: 10.1021/pr500880b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Roark S, Sandri B, Dey S, Steinbach M, Becker T, Wendt CH. Longitudinal protein expression patterns in bronchiolitis obliterans syndrome. Pulmonol Respir Res. 2015;3(5):1–5. [Google Scholar]
  11. Ross PL, Huang YN, Marchese JN, Williamson B, Parker K, Hattan S, Khainovski N, Pillai S, Dey S, Daniels S, Purkayastha S, Juhasz P, Martin S, Bartlet-Jones M, He F, Jacobson A, Pappin DJ. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteom. 2004;3(12):1154–1169. doi: 10.1074/mcp.M400129-MCP200. [DOI] [PubMed] [Google Scholar]
  12. Sakane F, Mizuno S, Takahashi D, Sakai H. Where do substrates of diacylglycerol kinases come from? Diacylglycerol kinases utilize diacylglycerol species supplied from phosphatidylinositol turnover-independent pathways. Adv Biol Regul. 2017;S2212–4926(17):30152–5. doi: 10.1016/j.jbior.2017.09.003. [DOI] [PubMed] [Google Scholar]
  13. Thompson A, Schafer J, Kuhn K, Kienle S, Schwarz J, Schmidt G, Neumann T, Johnstone R, Mohammed AK, Hamon C. Tandem mass tags: a novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS. Anal Chem. 2003;75(8):1895–1904. doi: 10.1021/ac0262560. [DOI] [PubMed] [Google Scholar]

RESOURCES