Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial

Jelena Čuklina; Chloe H Lee; Evan G Williams; Tatjana Sajic; Ben C Collins; María Rodríguez Martínez; Varun S Sharma; Fabian Wendt; Sandra Goetze; Gregory R Keele; Bernd Wollscheid; Ruedi Aebersold; Patrick G A Pedrioli

doi:10.15252/msb.202110240

. 2021 Aug 25;17(8):e10240. doi: 10.15252/msb.202110240

Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial

Jelena Čuklina ^1,^2,³, Chloe H Lee ¹, Evan G Williams ^1,⁴, Tatjana Sajic ¹, Ben C Collins ^1,⁵, María Rodríguez Martínez ³, Varun S Sharma ¹, Fabian Wendt ⁶, Sandra Goetze ^6,^7,⁸, Gregory R Keele ⁹, Bernd Wollscheid ^6,^7,⁸, Ruedi Aebersold ^1,^10,^✉, Patrick G A Pedrioli ^1,^6,^7,^8,^✉

PMCID: PMC8447595 PMID: 34432947

Abstract

Advancements in mass spectrometry‐based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much‐needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step‐by‐step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.

Keywords: batch effects, data analysis, large‐scale proteomics, normalization, quantitative proteomics

Subject Categories: Proteomics

In mass spectrometry‐based proteomics, handling large sample sets introduces technical variability known as batch effects. This tutorial provides guidelines and tools for the assessment, normalization, and batch correction of proteomics data.

graphic file with name MSB-17-e10240-g010.jpg

Introduction

Recent advances in mass spectrometry (MS)‐based proteomic approaches have significantly increased sample throughput and quantitative reproducibility. As a consequence, large‐scale studies consisting of hundreds of samples are becoming increasingly common (Zhang et al, 2014, 2016; Liu et al, 2015; Mertins et al, 2016; Okada et al, 2016; Williams et al, 2016; Collins et al, 2017; Sajic et al, 2018). These technological and methodological advances, combined with proteins being the main regulators of the majority of biological processes, make MS‐based proteomics a key methodology for studying physiological processes and diseases (Schubert et al, 2017). MS‐derived quantitative measurements on thousands of proteins can, however, be affected by differences in sample preparation and data acquisition conditions such as different technicians, reagent batches, or changes in instrumentation. This phenomenon, known as “batch effects”, introduces noise that reduces the statistical power to detect the true biological signal. In the most severe cases, the biological signal ends up correlating with technical variables, leading to concerns about the validity of the biological conclusions (Petricoin et al, 2002; Hu et al, 2005; Akey et al, 2007; Leek et al, 2010).

Batch effects have been extensively discussed, both in the genomic community that made major contributions to the problem about a decade ago (Leek et al, 2010; Luo et al, 2010; Chen et al, 2011; Dillies et al, 2013; Lazar et al, 2013; Chawade et al, 2014) and in the proteomic community which has faced the issue quite recently (Gregori et al, 2012; Karpievitch et al, 2012; Chawade et al, 2014; Välikangas et al, 2018). Nevertheless, finding solutions to the problem of batch effects is still a topic of active research. Although extensive reviews have been written on the topic (Leek et al, 2010; Lazar et al, 2013), researchers still get confused about the terminology. For example, the distinction between normalization, batch effect correction, and batch effect adjustments is not always clear and these terms are often used interchangeably. To clarify how we use these terms in this Review, we compiled a glossary, found in Table 1. Some definitions are adapted from Leek et al, 2010.

Table 1.

Terminology.

Term	Definition
Batch effects	Systematic differences between the measurements due to technical factors, such as sample or reagent batches.
Normalization	Sample‐wide adjustment of the data with the intention to bring the distribution of measured quantities into alignment. Most prominently, sample means and medians are aligned after normalization.
Batch effect correction	Data transformation procedure that corrects quantities of specific features (genes, peptides, metabolites) across samples, to reduce differences that are associated with technical factors, recorded in the experimental protocol (i.e., sample preparation or measurement batches). Usually samples are assumed to be normalized prior to batch effect correction. This step is often called "batch effect removal" or "batch effect adjustment" in the literature. Note the difference in the definition used here.
Batch effect adjustment	Data transformation procedure that adjusts for differences between samples due to technical factors that altered the data (sample‐wise and/or feature‐wise). The fundamental objective of the batch effect adjustment is to make all samples comparable for a meaningful biological analysis. In our definition, batch effect adjustment is a two‐step transformation: first normalization, then batch effect correction. Performing normalization first helps feature‐level batch effect correction by first alleviating sample level discrepancies.

Open in a new tab

There is also considerable debate on which batch correction method performs best, and multiple articles have compared various methods (Luo et al, 2010; Chen et al, 2011; Chawade et al, 2014). Other publications advise checking the assumptions about the data before selecting the bias adjustment method (Goh et al, 2017; Evans et al, 2018).

The issue of batch correction is further complicated by the fact that each technology faces different issues. Specifically, RNA‐seq batch effect adjustment requires approaches that address sequencing‐specific problems (Dillies et al, 2013). Similarly, MS methods in proteomics (e.g., data‐dependent acquisition—DDA, data‐independent acquisition—DIA, and tandem mass tag—TMT) also present several field‐specific challenges. First, there is the problem of peptide to protein inference (Clough et al, 2012; Choi et al, 2014; Rosenberger et al, 2014; Teo et al, 2015; Muntel et al, 2019). As protein quantities are inferred from the quantities of measured peptides or even fragment ions, one needs to decide at which level to correct the data. Second, it is known that missing values can be associated with technical factors (Karpievitch et al, 2012; Matafora et al, 2017). Finally, when dealing with experiments with large sample numbers, typically in the order of hundreds, one needs to account for MS signal drift.

Here, we discuss the application of established approaches for batch effect adjustment. We also look at the methods that address MS‐specific challenges. We start by providing an overview of the workflow and a definition of key terms for each step. In addition to considering batch effect assessment and adjustment, we summarize the best practices for assessing the improvements in data quality post‐correction. We also devote a section to the implications of missing values in relation to batch effects and potential pitfalls related to their imputation. We finish with a discussion and a future perspective of the presented approaches.

To facilitate the application to practical use cases, we illustrate all the relevant steps using three large‐scale DIA and two DDA studies. For these "case studies", we primarily rely on the largest of the five datasets (i.e., Aging mouse study; preprint: Williams et al, 2021) and refer to the others where appropriate. The data analyses we show are only for illustration purposes and are not intended for deriving new biological insights.

Workflow overview

The purpose of this article is to guide researchers working with large‐scale proteomic datasets toward minimizing bias and maximizing the robustness and reproducibility of results generated from such data. The workflow starts from a matrix of quantified features (e.g., transitions, peptides, or proteins) across multiple samples, here referred to as “raw data matrix” and finishes with "batch‐adjusted" data, which are ready for downstream analyses (e.g., differential expression or network inference). We split the workflow into five steps, shown in Fig 1, and describe each of the steps below.

1. Initial assessment evaluates whether batch effects are present in raw data. 2. Normalization brings all samples from the dataset to a common scale. 3. Diagnostics of batch effects in normalized data. This step determines whether further correction is required. 4. Batch effect correction addresses feature‐specific biases. 5. Quality control tests whether bias has been reduced while retaining meaningful signals.

In the context of this article, we will use the term “adjust for batch effects” when referring to the whole workflow and “correct for batch effects” when referring to the correction of normalized data (see Table 1).

We provide a checklist that summarizes the most important points of the protocol in Table 2. It is also important to stress that batch factors should be already considered in the experimental design phase, to ensure that the data are not biased beyond repair, something that can happen when biological groups are completely confounded with sample preparation batches (Hu et al, 2005; Gilad & Mizrahi‐Man, 2015). For an extensive discussion on experimental design, we refer the reader to previously published materials on the topic (Oberg & Vitek, 2009; Čuklina et al, 2020). Here, we assume that the experiment has been designed with appropriate randomization and blocking, ensuring the correctability of bias caused by batch effects.

Table 2.

Batch effect processing checklist.

Step	Substeps
Experimental design^a	Randomize samples in a balanced manner to prevent confounding of biological factors with batches (technical factors).
	Consider adding replicates if possible, for example: (a) add replication for each technical factor; (b) regularly inject a sample mix every few (e.g., 10–15, but the exact number will need to be adjusted depending on experimental conditions) samples for control; (c) incorporate a sample mix per batch.
	Record all technical factors, both plannable and occurring unexpectedly.
Initial assessment	Check whether the sample intensity distributions are consistent.
	Check the correlation of all sample pairs.
	If intensities or sample correlations differ, check whether the intensities show batch‐specific biases.
Normalization	Choose a normalization procedure, appropriate for biological background and data properties.
Diagnostics	Using diagnostic tools, determine whether batch effects persist in the data.
	Use quality control already at this step and skip the correction if it is not necessary.
	Tip: If the goal is to determine differentially expressed proteins, and the batch effects are discrete or linear, multi‐factor ANOVA on normalized data is a sound statistical approach. This will adjust for batch effects while simultaneously identifying differentially expressed proteins. Note, that "hits" or differentially expressed proteins identified with this approach are valid even if diagnostic tools indicate the presence of batch effects. For more details on ANOVA methods, refer to (Rice, 2006).
Batch effect correction	Choose batch effect correction procedure, appropriate for the biological background and data properties, especially those detected at the previous step.
	Repeat the diagnostic step.
	Assess the ultimate benefit with quality control.
Quality control	Compare correlation of samples within and between the batches. Pay special attention to replicate correlation, if these are available.
Quality control	Compare correlation of peptides within and between the proteins.

Open in a new tab

^{^a}

For details on experimental design, see (Čuklina et al, 2020).

In the accompanying “proBatch” package, we implemented several methods with proven utility in batch effect analysis and adjustment. We also provide tips for integrating other tools that might be useful in this context, and for making them compatible. proBatch is made available as a Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/proBatch.html) and a pre‐built Docker container (https://hub.docker.com/r/digitalproteomes/probatch), as well as a GitHub repository (https://github.com/symbioticMe/batch_effects_workflow_code) of the workflow with all code and data required to reproduce the case study analyses.

Extensive comparison of various methods has been published previously (Luo et al, 2010; Chawade et al, 2014), and here, we summarize the best practices from these papers, as well as reviews (Leek et al, 2010; Lazar et al, 2013) and application papers (Collins et al, 2017; Sajic et al, 2018), and turn them into principles that can guide the reader in choosing an appropriate methodology.

Raw data matrix: choosing between protein/peptide/fragment level

This workflow starts with a raw data matrix, for which initial steps such as peptide‐spectrum matching, quantification, and FDR control have been completed. Data are assumed to be log‐transformed unless the variance stabilizing transformation (Durbin et al, 2002) is used. In the latter case, the data transformation is included in the normalization procedure.

We suggest performing batch effect adjustment on the peptide or fragment ion level, as this procedure alters feature abundances that are critical for protein quantity inference (Clough et al, 2012; Teo et al, 2015).

We also suggest that all detected peptides, including non‐proteotypic peptides and peptides with missed cleavages, should be kept into consideration during batch effect adjustment. Keeping all measurements is required to better evaluate the intensity distribution within each sample, which is critical for subsequent normalization and correction steps.

Initial assessment

The goals of the initial assessment phase are to determine bias magnitude and sources and to select a normalization method. In most cases, the intensity distributions differ among samples. Comparing global quantitative properties such as sample medians or standard deviations helps with the choice of normalization methods and the identification of technical factors requiring further control.

Three approaches are particularly useful for initial assessment: (i) plotting the sample intensity average or median in order of MS measurement or technical batch, allows to estimate MS drift or discrete bias in each batch; (ii) boxplots allow to assess sample variance and outliers; and (iii) inter‐ vs. intrabatch sample correlation. A higher correlation of samples from the same batch compared with unrelated batches is a clear sign of bias. Optionally, a few proteins or peptides can be checked for signs of bias.

Normalization

The goal of normalization is to bring all samples to the same scale to make them comparable. Commonly used methods of normalization are quantile normalization, median normalization, and z‐transformation. Two main considerations drive the choice of normalization method:

Heterogeneity of the data: If samples are fairly similar, the bulk of the proteome does not change, and thus, techniques such as quantile normalization (Bolstad et al, 2003) can be used. In datasets in which the samples are substantially different (i.e., when a large fraction of the variables are either positively or negatively affected by the treatment) different methods, such as HMM‐assisted normalization can be used (Landfors et al, 2011). Additionally, if some samples are expected to have informative outliers (e.g., muscle tissue, in which a handful of proteins are several orders of magnitude more abundant than the rest of the proteome), methods that keep the relationship of outliers to the bulk proteome need to be used (Wang et al, 2021).
Distribution of sample intensities: The initial assessment step, especially boxplots, indicates which level of correction is required: In most cases, shifting the means or medians is enough, but when variances differ substantially, these need to be brought to the same scale as well.

It should be noted that after normalization, no further data correction might be required. This can be determined with the diagnostic plots and quality control methods described below. If the results are satisfactory, keeping data manipulation minimal is advisable.

Diagnostics of normalized data

While normalization makes the samples more comparable, it only aligns their global patterns. Therefore, batch effects affecting specific proteins or protein groups might still represent a major source of variance even after normalization. Thus, the diagnosis of batch effects is most informative when performed on normalized data.

The diagnostic approaches can be divided into proteome‐wide and peptide‐level approaches. The main approaches for proteome‐wide diagnostics are as follows:

Hierarchical clustering is an algorithm that groups similar samples into a tree‐like structure called a dendrogram. Similar samples cluster together, and the driving cause of this similarity can be visualized by coloring the dendrogram by technical and biological factors. Hierarchical clustering is often combined with a heatmap, mapping quantitative values in the data matrix to colors which facilitates the assessment of patterns in the dataset.
Principal Component Analysis (PCA) is a technique that identifies the leading directions of variation, known as principal components. The projection of data on two‐component axes visualizes sample proximity. Additional coloring of the samples by technical/biological factors, or by highlighting replicates, facilitates the interpretation of what drives sample proximity. This technique is particularly convenient to assess clustering by biological and technical factors or to check for replicate similarity. Visualization without sample point or label overlay effects works in our experience up to about 50–100 samples in a dataset.

One should be careful in interpreting proteome‐wide diagnostics because these methods were designed for data matrices without missing values. Proteomic datasets often contain missing values for technical or biological reasons. For more details, we refer the reader to Box 1.

Box 1. Missing values.

Proteomic experiments now routinely profile hundreds or thousands of proteins across hundreds of samples. However, detecting all proteins without missing values across the whole dataset is not yet feasible. The patterns of "missingness" are known to be batch‐specific (Karpievitch et al, 2012), and some workflows are susceptible to a rapid inflation of missing values as the number of batches increases (Brenes et al, 2019). This is also true for the largest datasets of this manuscript: aging mouse DIA and TMT datasets (see Box 1 Figure, Figs EV5 and EV6 for details).

It should be noted, that even though "missingness" for low‐abundant peptides is more common (i.e., an issue related to the dynamic range and sensitivity of the mass spectrometer), this problem can also arise due to fundamental peptide interference regardless of their abundance or the acquisition parameters.

Missing values can also affect batch effect correction methodologies. For instance, the current implementation of ComBat (Johnson et al, 2007) does not work if a peptide is missing in one batch. One possible solution is to remove all peptides with missing values before the batch correction (Lee et al, 2019). However, this may lead to loss of valuable quantitative information. Thus, methods which are more robust to missing data, such as median centering, can sometimes be better suited for proteomic data.

Missing values are often imputed, by filling them with zeros, random small values (Tyanova et al, 2016) or re‐quantification of elution traces (Röst et al, 2016). Such imputation, however, can introduce bias that is batch‐ or peptide‐specific, as seen in Figs EV5A and EV6. In turn, this skews batch effect diagnostic methods, such as hierarchical clustering, PCA, or PVCA. In these cases, batch effect assessment will be biased, as the clustering pattern will be driven by missing values (Fig EV5A). One can estimate this effect by varying the fraction of missing values and assessing to what extent the batch effects are driven by consistently quantified peptides vs. missing values containing ones (Fig EV5B).

More importantly, imputed values bias the analysis past the batch effect adjustment stage. As shown in Box 1 Figure B and C, if re‐quantifications ("with requants") values inferred from MS elution traces are used, the correlation within batches seems higher than the correlation of replicates, while this problem is not observed when imputation is not used ("no requants"). Protein inference is also affected by the imputation on lower levels.

Finally, provided that there are enough confidently quantified values, many downstream analysis techniques, such as differential expression or protein correlation analyses, can handle missing values. We therefore advise to avoid imputation, or at least suggest to perform it after batch correction whenever possible.

Box 1

Box Figure 1. The problem of missing values in batch effect diagnosis and correction: Aging mouse study. (A) Hierarchical clustering and heatmap of normalized data; missing values shown in black. The missing values are non‐randomly associated with the batch; (B) heatmap of selected sample correlation: Stronger correlation of samples within Batch 2 (blue) and Batch 3 (brown) is visible in the data with "requants", and replicate correlation is much more prominent in the data without "requants"; (C) distribution of selected sample correlation: same effect, as in (B) showing the distribution of sample correlation.

In proteomics, peptide‐level diagnostics are as useful as proteome‐wide diagnostics. As in other high‐throughput measurements, individual features, in this case, peptides, are visualized to check for batch‐related bias. In proteomic datasets, spike‐in proteins or peptides can be added as controls. In most DIA datasets, iRT peptides (Escher et al, 2012), if added in precise concentrations, are well suited for individual feature diagnostics. It should be noted that individual peptides have a variety of different responses to various batch effects, so checking a handful of peptides is necessary, whether endogenous or spiked‐in.

Another reason to check individual peptides in proteomics is to examine the trends associated with sample running order. These trends might occur as MS signal deteriorates and require special correction approaches.

Note, that in proteomics, individual features are sometimes not peptides, but transitions or peptide groups. Thus, methods referred here as peptide‐level diagnostics are applicable to any feature‐level diagnostics.

Batch effect correction

Diagnostics help to determine whether batch effect corrections are needed. While global sample patterns are corrected during normalization, batch effects affect specific features and feature groups, and that is the level on which they need to be corrected.

In proteomic datasets, two types of batch effects are frequently encountered, continuous and discrete. If batch effects are continuous, e.g., manifest as MS signal drift progressing from run to run during the sample measurement process, an order‐specific curve needs to be fitted, such as a LOESS fit, or by using any other continuous algorithm. Signal drifts are likely to occur in studies profiling hundreds of samples. This problem is more prominent in mass spectrometry as compared to next‐generation sequencing and is thus still relatively new to the research community.

Discrete batch effects manifest as feature‐specific shifts of each batch as a whole. Here, methods such as mean and median centering work very well. An advanced modification of the mean shift is provided by ComBat (Johnson et al, 2007) that uses a Bayesian framework which can be applied to proteomic data (Lee et al, 2019). However, ComBat requires that all features are represented in each of the batches. Therefore, especially in large‐scale proteomic datasets, applying ComBat might require the removal of a substantial number of peptides that happen to be missing in at least one batch, regardless of how small this batch is (see Box 1 for details). Thus, one should be very careful when choosing the method for batch effect correction.

Quality control

The purpose of the quality control step is to determine whether the adjustment procedures—normalization and/or batch effect correction—have improved the data. At this step, the data after adjustment are compared with the raw data matrix. There are two types of criteria to evaluate the data quality: (i) removal of the bias (negative control) and (ii) improvement of the data (positive control).

Typically, bias is considered removed if the similarity between samples is no longer driven by technical factors. This means that neither hierarchical clustering nor PCA shows clustering by batch, and the correlation of samples from the same batch is no longer stronger than the correlation of unrelated samples. Also, individual features should not show batch‐related biases. Thus, comparison of diagnostic plots for raw and adjusted data serves as the negative control.

Proving improvement achieved by batch correction is much harder. It is common to take “improved clustering by biological condition” or “higher number of differentially expressed proteins” as a positive control and generally, as a sign of data quality enhancement. However, both criteria are subjective: It is impossible to know beforehand, whether biological groups are separable in the proteomic space, especially if only a subset of proteins changes while the bulk of the proteome does not. Similarly, it is not possible to predict whether higher sensitivity for differential expression comes at the expense of added false‐positive hits. Therefore, we do not recommend using these criteria to assess normalization or batch effect correction. As described above, the choice of the method should rather be based on the properties of the samples. In general, since batch adjustment removes a certain portion of variance, the coefficient of variation for peptides and proteins in replicated samples should decrease. This is especially true for spike‐in peptides or proteins that are added to samples in controlled quantities. A stronger positive control is the assessment of reproducibility, such as comparison of lists of differentially expressed proteins or regression/classification models derived from two or more sample sets belonging to different batches (Lazar et al, 2013). It is expected that in adjusted datasets, the resulting lists of differentially expressed proteins, or proteins providing optimal class separation, will be highly overlapping (Shabalin et al, 2008). If two sample sets are independently used for predictive modeling, the predictive performance of such models is also expected to be comparable in adjusted datasets (Luo et al, 2010). Note, however, that while this method generalizes well to studies with data acquired by different technologies (e.g., microarrays vs. RNA‐seq for transcriptomics or DIA vs. DDA for proteomics), it is restricted to fairly large datasets (several dozens, preferably, hundreds of samples), as predictions from small‐scale experiments tend to be unstable (underpowered).

Here, we also propose two positive control methods that do not rely on large sample size and are applicable to most proteomic experiments. The first is based on sample correlation. It is expected that the correlation between technical or biological replicate samples is higher than the correlation of unrelated samples. Particularly, the distribution of replicate correlations should be clearly shifted upwards, even though replicates might occasionally correlate less than some unrelated sample pairs, and this distinction should be strengthened by batch adjustment procedures. Following similar logic, other distance metrics can be used to assess sample proximity, which can also be visualized as improved clustering of replicated samples, seen on hierarchical clustering or PCA component plots. The latter method visualizes every sample in the experiment and thus is most suited to assess studies with up to 150 samples, while bigger sample sizes are harder to visualize. The second assessment method is specific for bottom‐up proteomics and makes use of peptide correlation. Correlation of unrelated peptides is expected to be close to zero, while peptides originating from the same protein are likely to be positively correlated. Since tens of thousands of peptides are routinely detected in modern high‐throughput proteomic experiments, improvements in this metric are a reliable readout of data quality following batch adjustment.

Description of datasets used to illustrate the workflow

To illustrate the application of the workflow described above, we use five proteomic datasets (three acquired in DIA and two in DDA mode), described in Table 3.

Table 3.

Dataset description.

Sample	Organism	Sample source	Sample‐to‐sample heterogeneity	Technical factors	Biological factors	Protein (peak groups/precursors) number	Number of samples	Dataset accession
InterLab study	Human	Cell culture	Very low: samples come from the same tissue cultures and differ only by few spike‐in peptides	Data acquisition sites Profiling days	None	4,077 (31,886)	229	PRIDE PXD004886
PanCancer study	Human	Blood	High: samples come from cancer patients and matched controls with different cancer localization	Protein digestion batch	Case / control Cancer localization	205 (1,360)	162	PRIDE PXD004998
Aging mouse study	Mouse	Liver tissue	Medium: samples come from population of inbred mice originating from two parental strains	Protein digestion batch MS batch MS drift	Strain Diet Age	3,940 (32,449)^a	413	PRIDE PXD009160
TMT mouse study	Mouse	Liver	Medium: samples from a population of inbred mice originating from eight parental strains	Sample preparation batch MS batch MS drift	Strain Age	6,813(66,418)^a	120	PRIDE PXD018886
Bariatric surgery study	Rat	Lymph	Medium: samples come from inbred rat population	Liquid handling robot	Gastric bypass vs. placebo surgery	302 (1,987)	68	MassIVE MSV000087519

Open in a new tab

^{^a}

Number of proteins and peptides before filtering for peptides with too many missing values.

The first study, called here "InterLab study", assessed the robustness of SWATH‐MS in a multi‐lab setting (Collins et al, 2017; Data ref: Collins et al, 2017). A set of 30 stable isotope labeled (SIL) peptides (Ebhardt et al, 2012), partitioned in five groups, was serially diluted in HEK293 cell lysate. The SIL peptides in the resulting samples spanned a concentration range from 12 amol to 10 pmol. These five sample sets were distributed to 11 laboratories worldwide for measurement by SWATH‐MS according to a predetermined schedule. Each of the samples was run on 3 separate days, with the exception of the 4th sample that was run three times on each day. In total, 229 samples were profiled. Thus, the technical covariates whose effect needed to be assessed were the data acquisition site and day. Note that due to the technical nature of the study, no biological signal needed to be identified. As only a small number of SIL peptides is different across these samples, all changes can be attributed to technical covariates, and therefore, the samples in this study can be treated as replicates. Within this manuscript, we analyze only the influence of the acquisition site as a batch factor.

The second study, named here "PanCancer study" (Sajic et al, 2018; Data ref: Sajic et al, 2018), profiled the blood plasma glycoproteome of a cohort of patients with five solid carcinomas and matched controls. In total, 155 blood plasma samples were collected. Protein digestion and glycopeptide enrichment were performed in 4 batches, several weeks apart. To account for sample preparation reproducibility, 7 biospecimens were replicated and allocated to a different batch. To control for intra‐sample variation caused by the sample preparation protocol, bovine fetuin‐B was spiked in equal amounts into each plasma sample. In total, 162 samples were measured (the validation cohort from the original manuscript is omitted from this analysis).

The third dataset is called here the "Aging mouse study" (preprint: Williams et al, 2021; Data ref: Williams et al, 2021). In this study, 413 liver proteomes were measured from 341 individual mice from the BXD reference mouse population (Peirce et al, 2004) to identify changes associated with age. Similarly to prior BXD mice metabolic profiling experiments (Williams et al, 2016), genetically identical cohorts of animals were also subjected to either chow or high‐fat diet. The samples were randomized with respect to biological covariates (age, diet, sex), and samples from two mice with EarTags "ET1506" or "ET1524" were both injected 10 times at various intervals throughout the run to control for signal consistency. Additionally, a mix of samples was shot 3 times as control. In this experiment, two technical factors are known to affect the measurement. First, the samples were digested in five batches. Second, to compensate for signal deterioration, MS data acquisition was interrupted for machine cleaning and tuning, resulting in 7 mass spectrometry batches. These are shown as vertical lines in Fig 2. Except for replicates, samples were run in the same order of digestion batches (see Fig EV1A), so these two factors are mostly confounded (i.e., digestion and MS batch mostly overlap). Therefore, in our analysis we only correct for MS batch, unless noted otherwise. Given the particularly severe MS signal deterioration at the end of MS batch 2, the last 13 samples of this batch were profiled again as first samples of MS batch 3. In total, 375 proteome acquisitions were considered in this manuscript, while 38 acquisitions were discarded prior to analysis due to major acquisition failures. See also "Appendix" in supporting information for a summary of the original experimental setup of this study.

(A) Mean intensity in raw peptide matrix vs. sample running order with repeatedly replicated samples shown in color. Vertical dotted lines indicate MS batch boundaries; (B) distribution of unadjusted sample intensity correlations—between batches, within batches, and in replicated samples; (C) bias in protein quantification: representative ACADS protein, the quantity of which follows the drift of the average sample intensity, preventing allele separation and QTL detection; (D) boxplots of sample intensities in raw, unnormalized peptide; (E) boxplots of sample intensities after quantile normalization. All plots represent peptide‐level data.

Figure EV1 — (A) Correlation of sample intensities indicates closer relationship between samples from the same batch; (B) Principal Components colored by digestion batch cluster together, but not the samples of mice on the same diet; (C) normalization removes a large fraction of variation, making samples more comparable, this is also seen at the level of individual peptides; (D) when fitting LOESS curve, span has to be chosen carefully: When too small, it will lead to overfitting and overcorrection.

The fourth study, named here “TMT mouse study”, used data acquired from livers from 120 individuals from the Collaborative Cross reference mouse population, taken at 8 weeks of age (preprint: Keele et al, 2020; Data ref: Keele et al, 2020). Data were acquired in 12 TMT batches of 10 samples each, with a six‐month gap between batches 10 and 11. Batches 1–10 and 11–12 were prepared and run directly sequentially. The peptide measurement table from that paper was used as input for proBatch, and cis‐pepQTLs were calculated before and after proBatch.

The fifth study, named here “Bariatric surgery study” (Kaufman et al, 2019; Data ref: Kaufman et al, 2019), profiled N‐linked glycoproteomes from rat lymph before and after gastric bypass surgery. The cohort consisted of 68 lymph samples originating from rats before and after gastric bypass surgery (RYGB) or placebo surgery (SHAM). Samples were collected from rats before, 5, 10, and 21 days after the operation. The samples for this study were processed using a Versette automated liquid handling system (ThermoFisher Scientific) in a 96‐well plate format. Differences in the performances of Versette’s channels have introduced a “robot batch” into these data. The samples were measured in label‐free DDA mode, and glycopeptides were quantified using Progenesis (non‐linear dynamics).

All in all, these datasets are representative of various applications of large‐scale proteomic studies. They make use of different sample sources (i.e., cell cultures, patients, model organisms). They ask technical and biological questions about proteomes of varying complexity and present different degrees of sample‐to‐sample heterogeneity. In this respect, the InterLab study is very homogeneous, to the point that all the samples are essentially technical replicates. The PanCancer study is highly heterogeneous and comprises 205 proteins identified in samples originating from different hospitals and different tissues. Finally, the model‐organism‐based studies represent an intermediate case. On the one hand, its subjects were genetically related, but on the other hand, sampling introduces a certain amount of sample heterogeneity.