Blind estimation and correction of microarray batch effect

Sudhir Varma

doi:10.1371/journal.pone.0231446

. 2020 Apr 9;15(4):e0231446. doi: 10.1371/journal.pone.0231446

Blind estimation and correction of microarray batch effect

Sudhir Varma ^1,^*

Editor: Volkhard Helms²

PMCID: PMC7145015 PMID: 32271844

Abstract

Microarray batch effect (BE) has been the primary bottleneck for large-scale integration of data from multiple experiments. Current BE correction methods either need known batch identities (ComBat) or have the potential to overcorrect, by removing true but unknown biological differences (Surrogate Variable Analysis SVA). It is well known that experimental conditions such as array or reagent batches, PCR amplification or ozone levels can affect the measured expression levels; often the direction of perturbation of the measured expression is the same in different datasets. However, there are no BE correction algorithms that attempt to estimate the individual effects of technical differences and use them to correct expression data. In this manuscript, we show that a set of signatures, each of which is a vector the length of the number of probes, calculated on a reference set of microarray samples can predict much of the batch effect in other validation sets. We present a rationale of selecting a reference set of samples designed to estimate technical differences without removing biological differences. Putting both together, we introduce the Batch Effect Signature Correction (BESC) algorithm that uses the BES calculated on the reference set to efficiently predict and remove BE. Using two independent validation sets, we show that BESC is capable of removing batch effect without removing unknown but true biological differences. Much of the variations due to batch effect is shared between different microarray datasets. That shared information can be used to predict signatures (i.e. directions of perturbation) due to batch effect in new datasets. The correction can be precomputed without using the samples to be corrected (blind), done on each sample individually (single sample) and corrects only known technical effects without removing known or unknown biological differences (conservative). Those three characteristics make it ideal for high-throughput correction of samples for a microarray data repository. We also compare the performance of BESC to three other batch correction methods: SVA, Removing Unwanted Variation (RUV) and Hidden Covariates with Prior (HCP). An R Package besc implementing the algorithm is available from http://explainbio.com.

Introduction

Batch effect (BE) has been the primary bottleneck for the large-scale integration of data from multiple experiments. BE, defined as the systematic biases between microarray data generated by different labs at different times or under different experimental conditions [1, 2], can act as a confounding variable in statistical tests and usually has a stronger effect on the measured expression than the biological phenotype under study [3].

Unknown or unrecorded experimental or biological differences can add a systematic difference between putative replicates within or between two batches. Thus, we use the term batch effect as a general term for any heterogeneity due to experimental factors between samples that are putative experimental replicates. The heterogeneity can extend to different samples within the same sample collection, i.e. considering only average difference between two collections will likely underestimate the BE.

In practice, it has proven difficult to separate heterogeneity due to technical differences from that due to unknown biological differences. The usual approach in batch correction [4–6] is to protect the known covariates and remove all remaining heterogeneity. Biological differences such as sex and genotype can be clinically important but they will be removed if they are not part of the protected covariates [7, 8]. Conversely, if the study design is unbalanced, the statistical significance of the association of gene expression with the protected covariates can be inflated beyond what one would expect by just a reduction in noise [9].

Additionally, current batch correction methods are intended to be used each time a new composite dataset is created. It is known that specific differences in sample condition, experimental technique [1, 10] and environmental conditions [2] can affect the measured gene expression in predictable directions irrespective of the sample type. However, there has been little systematic effort to estimate how many of those common effects are shared between datasets or to compute dataset-independent batch correction parameters that can be used for “blind” prediction of BE.

In this article, we take another approach: instead of estimating and removing all differences between two batches, we only remove those differences that are known to be associated with technical variations. We show that a large proportion of the BE in Affymetrix U133 Plus2 array data can be captured by a relatively small set of signatures, defined as the directions in which the measured expression has been perturbed by batch effects. We estimate batch effect signatures (BES) in the form of orthogonal components from a large reference dataset of samples. We develop an algorithm for computing the BES using the reference dataset such that the BES are unlikely to model known or unknown biological differences. We introduce a novel batch-correction method called Batch Effect Signature Correction (BESC) that uses the batch effect signatures for blind prediction and correction of BE in new samples and compare the performance to SVA.

Materials and methods

Batch effect and correction methods

The measured expression for a set of samples can depend on one or more biological factors (such as cell line name, tissue of origin or disease status) and unknown or unmodeled experimental factors (such as microarray batch, FFPE vs. fresh samples, and experimental technique). Following [4] we can model the expression of a sample as a linear combination of known biological covariates, unknown batch effect and noise

x_{ij} {= μ}_{i} {+ f}_{i} (y_{j}) + \sum_{l = 1}^{L} γ_{il} g_{lj} {+ e}_{ij}

(1)

Where x_ij is the measured expression of gene i (out of m genes) for sample j (out of n samples), μ_i is the overall mean expression of gene i, f_i(y_j) is a (possibly non-linear) function that models the dependence of the expression of the i^th gene on the j^th known biological factors y_j. The batch effect is modelled as a linear combination $\sum_{l = 1}^{L} γ_{i l} g_{lj}$ of L experimental factors each composed of g_lj the effect of the l^th experimental factor on the i^th gene multiplied by weight γ_il. The last term e_ij is uncorrelated noise.

Eq 1 indicates that the overall space of measured gene expressions can be separated into a space spanned by the biological variation f_i(y_j) and a space spanned by systematic batch effects $\sum_{l = 1}^{L} γ_{i l} g_{lj}$ . The purpose of BE correction is to estimate and remove the latter (third term on the RHS of Eq 1) while retaining the biological differences. Data-specific batch correction methods (such as ComBat and SVA) assume that the BE space (third term) is unique to each composite dataset and has to be recalculated each time a new composite dataset is created. However, we show that an orthogonal basis derived from the BE space of a reference dataset can be used to estimate and remove the variation in the BE space for other test datasets (i.e. blind batch correction).

Surrogate Variable Analysis (SVA)

SVA [4] has been widely used for estimating hidden covariates (including technical and biological). Fitting the functions f_i and global means μ_i using linear or non-linear regression, we can express Eq 1 in terms of the residuals of the fit

r_{i j} {= x}_{ij} {‐ μ}_{i} {‐ f}_{i} (y_{j}) = \sum_{l = 1}^{L} γ_{il} g_{lj}

(2)

The aim of SVA is to find a set of K≤L orthogonal vectors (surrogate variables) that span the same linear space as g_l

\sum_{k = 1}^{K} λ_{i k} h_{kj} \approx \sum_{l = 1}^{L} γ_{i l} g_{lj}

(3)

Each surrogate variable h_k = [h_k1,h_k2,…,h_kn]^T is a vector of length equal to the number of samples that models one hidden covariate that is not present in the known covariates, and λ_ik is the influence of h_k the measured expression of the i^th gene. Together they can be used to model heterogeneity from unknown sources in any future statistical analysis on that dataset. SVA proceeds to find these surrogate variables by doing a Singular Value Decomposition of the residuals r_ij.

The advantage of SVA is that we do not need to know the actual batches. The disadvantages of SVA are that, firstly, it has to be recomputed on each new dataset, secondly, it can remove unknown but important biological differences between samples.

Removing Unwanted Variation (RUV)

RUV [11] is a batch correction algorithm that uses factor analysis on control genes (i.e. genes that are not known to be differentially expressed for any of the known covariates) to estimate and remove batch effect. Apart from selection of the control genes, there is a parameter ν that has to be adapted to the dataset being corrected. RUV can be used without knowing the actual batches, but the selection of the control genes and ν remain challenging. Furthermore, these two selections are only valid for a particular dataset; they have to be re-evaluated for each new dataset.

Hidden Covariates with Prior (HCP)

HCP [12] is an approach to estimating batch effects modelled as a linear combination of known covariates. HCP aims to be a generalization of several factorization-based batch correction methods by modelling both the known and unknown biological or technical covariates. The weights given to the known and unknown variation results in three parameters λ,σ₁,σ₂ that have to be optimized for each dataset. HCP has the disadvantage that these parameters are specific to a particular dataset and have to be re-computed for each new dataset. Furthermore, without some ground truth to compare to (e.g. gene co-expression compared to known co-expressions, detection of known eQTLs), it is not possible to find the values for the three parameters that minimize the batch effect.

Batch Effect Signature Correction (BESC

We introduce a novel approach BESC that aims to learn the variation due to unknown technical differences using a reference dataset and apply it to correct batch effect in other datasets. To motivate our approach, note that λ_ik is the difference in the expression of the i^th gene for each unit change in the k^th surrogate variable. Thus, the vector λ_k = [λ_1k,λ_2k,…,λ_mk], for the k^th surrogate variable and each genes 1…m, is a signature that quantifies the dependence of the expression of each gene on the k^th surrogate variable. For example, if the surrogate variable captures the effect of a particular type of sample preparation, the signature is the differential expression between samples that use that preparation method vs. samples that do not. We call the set of K vectors λ_k the Batch Effect Signatures (BES). Each vector is a zero-mean unit vector (i.e. sum of square of components equals one) that is as long as the number of probes on the array. We can consider batch effect to be a perturbation in a certain direction for all samples in the same batch. These perturbations can be estimated by looking at the expressions of the same cell line in different batches. Each perturbation is a sum of contributions from multiple sources of technical variation. BES decomposes the individual contribution of each source of technical variation by looking for a set of vectors that spans the batch effect perturbation-space.

Our claim is that the expression difference (i.e. signature λ_k) between samples that differ on a certain experimental factor (e.g. RNA amplified vs. un-amplified) would be more likely to remain the same, multiplied by some coefficient even if the surrogate variables do not remain the same. Given a dataset of reference samples that is large and diverse enough, we can compute the signatures of the various known or unknown factors that contribute to BE. Those signatures can then be used to estimate and remove BE from new samples. We call this approach the Batch Effect Signature Correction (BESC).

Difference between SVA and BESC

There are some differences between the SVA and BESC formulations. Writing Eq (2) in matrix form

R = D U

(4)

Where R is the m×n matrix of residuals (n samples, m genes), D is the m×K matrix of batch effect signatures (λ_ik the element at i^th row and k^th column) and U is the K×n matrix of surrogate variables (h_kj the element at k^th row and j^th column). SVA constrains the rows of U to be orthogonal and estimates the matrix U using singular value decomposition (SVD) on the residual matrix R (which also constrains the columns of D to be orthogonal). On the other hand, we constrain the columns of D to be orthogonal and estimate D by taking the Principal Components (PCA) of the transpose of the residual matrix, R^T. Note that there is no substantial difference between using SVD or PCA for decomposing R (except computational stability).

Strictly speaking, SVA is not a batch correction algorithm. The surrogate variables also capture heterogeneity due to unknown biological differences since they are calculated without reference to the batch. Other batch correction methods such as ComBat [5] require known batch assignment, i.e. which samples belongs to which batch. A modification to SVA, permuted-SVA (or pSVA) [8] has been proposed to prevent the algorithm from removing unknown biological differences. However, pSVA has the same limitation as ComBat, i.e. it is applicable only when the technical covariates that contribute to batch effect are known. As SVA does not require that information, it is the closest comparable algorithm to BESC and was selected for comparison to BESC in this article.

Selecting samples for the reference set

The selection of the reference set is crucial to ensure that the BES computed from it does not capture unknown biological covariates of the samples (e.g. sex, genotype). We selected and annotated a reference set of 2020 cell line samples (242 unique cell lines) on the Affymetrix U133 Plus 2.0 platform, selected from 348 collections from the Gene Expression Omnibus (S1 Table). Fitting the linear model in Eq (1) using the cell line name as the known covariate captures all known and unknown biological differences between the samples. The same cell line, under untreated conditions, should give similar expression profiles between replicates.

Although it is possible for growth conditions (e.g. passage number, growth medium) to affect the expression for a cell line, it can be argued that cell lines that have been grown from a standardized population of cells are one of the most replicable biological samples. Any differences in expression measured in two batches can be treated as mostly arising from BEs and experimental noise.

Correcting BE using BES

Given a set of BES, the batch effect in a new set of samples is computed by fitting a linear model of the BES to the expressions of each sample, i.e. we find weights a_k that minimize the squared residuals

{‖ x_{n e w} - \sum_{k = 1}^{K} a_{k} λ_{k} ‖}^{2}

(5)

Where x_new is the vector of expressions for a new sample to be corrected and λ_k is the k^th BES. $\sum_{k = 1}^{K} a_{k} λ_{k}$ is the estimate of the BEs and the residuals $x_{c o r r} = x_{n e w} - \sum_{k = 1}^{K} a_{k} λ_{k}$ is the corrected expression for the sample. Since the BES λ_k are zero sum orthonormal vectors, it is simple to show that $a_{k} = λ_{k}^{T} x_{n e w}$ where the superscript T indicates the transpose, i.e. the weights for each BES is the scalar product of the BES and the expression vector.

BES is thus a “blind" estimate of the BE; we do not recalculate BE parameters on the set of new samples. That enables single sample correction of new samples, i.e. without having to re-compute the correction for all the samples whenever new samples are added.

Quantifying batch effect

Several methods have been proposed in literature to visualize or quantify the batch effect in a dataset [13]. Visualization methods include clustering (dendrograms) and principal component analysis (PCA). However, they are not suited for very large number of samples, as is the case for the datasets in this paper. Principal Variance Component Analysis (PVCA) [1] has been proposed as a quantitative measure. PVCA fits a linear mixed model to the principal components of the data using the known biological covariates and batch identities as random effects. The variance due to the different covariates and batch is summed up across the principal components and offers a way to compare the primary sources of variance. For the two validation sets, we used PVCA to compute the contribution of variance due to sample-type as a measure of the batch effect. The higher the contribution, the lower the batch effect.

However, PVCA has the disadvantage that the variances due to batch effect in different test datasets cannot be averaged together (as is required for cross-validation). So, we developed another batch effect measure, the Distance Ratio Score (DRS) that is intuitive, provides one value for each sample and can be averaged across various test sets. Consider a set of samples from one or more sample types hybridized in multiple batches. For a sample of a certain sample type, we take the log of the ratio of the distance to the closest sample of a different sample type to the distance to the closest sample belonging to a different batch but the same sample type.

D R S = \frac{1}{n} \sum_{i = 1}^{n} {l o g}_{2} (\frac{d (x_{i}, x_{i, d t})}{d (x_{i}, x_{i, d b / s t})})

(6)

Where d() is any measure of dissimilarity between two samples, x_i is the i^th sample, x_i,dt is the closest sample of a different sample type and x_i,db/st is the closest sample from a different batch of the same sample type as x_i. Intuitively, the DRS is high if samples of the same type cluster together irrespective of batch since the denominator will be small compared to the numerator. Conversely if most of the samples cluster according to batch rather than sample type, the DRS will be small. For our analysis, we used the Euclidean distance as the dissimilarity measure.

Cross-validating batch effect signatures

The question remains whether BES calculated on one dataset are general enough to be predictive of the BE in another dataset. To investigate that, we did 5-fold cross-validation, i.e. created 5 random splits of the samples in the reference set into training and testing sets (approximately 80% samples for training and 20% samples for testing) ensuring that all samples from an individual collection are either all in the testing set or in the training set (S1 Table).

For each train/test split, we used the training set to compute the residuals after fitting a model to the expression with the known sample type as the covariate. Then we computed the principal components of the transpose of the residual matrix. The eigenvectors of the covariance matrix (i.e. the principal components) were used as putative BES. We used varying numbers of eigenvectors with the highest eigenvalues to correct the samples in the testing set and computed a Distance Ratio Score (DRS) of the test set samples for each number of BES.

Consistency of BESC corrections for different reference sets

Since we claim that the BESC algorithm picks up batch effects due to common technical differences, it should pick the same corrections when trained on different reference sets. To investigate that, we used the 5 cross-validation splits of the reference set to create two smaller reference sets: one reference set using splits 1 and 2 together and another reference set using splits 3 and 4 together. From each reference set, we computed a set of BES: BES₁₊₂ from splits 1 and 2 and BES₃₊₄ from splits 3 and 4. Using these two sets of BES, we did three analyses. First, both sets of BES were used to correct the 5^th split with different numbers of BES. We visualized the two sets of corrected data for split 5 using principal component analysis (PCA). Second, we corrected random vectors with the two BES and computed the correlations between the corrections for different numbers of BES. Third, we corrected the colon cancer dataset (validation set 2) with both sets of BES and compared the list of genes differentially expressed between MSI and MSS samples. Results for these analyses are in S1 File.

Testing on validation sets

We selected two validation sets to test the effectiveness of the BES calculated on the reference set for removing BE. All of these samples were collected from public GEO datasets. Annotations for the samples were compiled from annotations provided by the original data submitter.

Different numbers of the top BES were used to estimate and remove the BE in the validation sets and the DRS BE score was calculated. Note that the BES were calculated using only the reference set samples. The sample type (or any other covariate) of the validation set samples were not used during the BE correction.

Validation set 1 (Primary normal samples)– 878 samples of primary healthy tissue (532 blood, 228 colon and 118 lung samples from 41 collections) on the U133 Plus 2.0 platform from GEO (S2 Table). We predicted the sex of each sample using 5 chromosome-Y genes (S1 File). Samples were selected to create a balanced dataset with respect to collection, organ and sex.

Validation set 2 (Colon cancer and normal)– 3041 samples of primary colon cancer and normal samples (476 normal colon and 2565 colon cancer samples from 60 GEO collections) on the U133 Plus 2.0 platform from GEO (S3 Table). We have compiled the reported Micro-Satellite Stability/Instability (MSS/MSI) status for 513 out of the 2565 colon cancer samples (154 MSI, 359 MSS). Out of those 513, we selected a balanced set of 503 samples with known MSI/MSS status to test for differential expression between MSI and MSS.

Permutation p-value for DRS

To test whether the improvement in the DRS is statistically significant, we performed a permutation test of the DRS obtained with various numbers of BES. We permuted the data for each gene in the reference set to destroy any batch effect signatures and then computed the BES using that null dataset. Varying numbers of those null BES were used to correct the samples in the validation set and the DRS obtained was compared to the true DRS at different numbers of BES. The procedure was repeated 100 times and the null DRSs used to compute the p-value for the true DRS using a Student’s t-test at each number of BES.

Comparison to SVA, RUV and HCP

We compared the DRS for the validation set for increasing numbers of BES to that for increasing levels of correction for SVA, RUV and HCP. The comparison was done for varying number of 1) surrogate variables (SVA), 2) factors (RUV) and 3) estimated hidden covariates (HCP). Note that it is not a direct comparison, since the other three methods use the validation set samples to compute the BE parameters. The training set/validation set approach we have taken with BES cannot be applied to SVA, RUV or HCP since they are expected to be run on the same dataset that is being corrected.

For RUV, we selected the set of housekeeping genes as well as the value of ν that gave the best performance in terms of DRS and PVCA (S1 File). For HCP, we did a grid search over the three parameters λ,σ₁,σ₂ and selected the values that gave the highest DRS.

One disadvantage of SVA is that it will remove unknown biological differences (e.g. sex) from the cleaned data [7, 8]. To show that, we predicted the sex of each sample in the validation set 1 using the expression of chromosome Y genes (RPS4Y1, KDM5D, USP9Y, DDX3Y, EIF1AY, see S1 File) and looked at the number of genes significantly different between male and female samples at various levels of correction for BESC and SVA. For validation set 2, we had the reported MSI status for 513 colon cancer samples. We looked at the number of genes significantly different between MSI and MSS samples. The computation of statistical significance for validation set 1 was done using organ source, sex and batch as covariates and for validation set 2 was done using batch and MSI/MSS status as covariates.

Results

BES computed on training data is predictive of batch effect in test data

Fig 1 shows the 5-split cross-validated batch effect DRS on the reference set. The plot shows the average DRS on the test set corrected using BES calculated on the training set. The DRS increases as the number of BES increases, indicating reduction of BE with increasing correction. It reaches a maximum with 30 BES used in the correction, indicating that the first 30 Batch Effect Signatures estimated on the training set capture variation due to BE in the test set. Further correction reduces the DRS, indicating that the BES above 30 do not capture any information about the BE.

BES computed on different reference sets are consistent

S5 Fig shows the results of applying BESC using BES computed from different reference sets. As long as the number of BES is smaller than 10, the corrected samples from the two sets of BES stay close together (while moving away from the uncorrected samples). After 10 BES, the two sets of corrected samples start separating. S6 Fig shows the mean correlation between the corrections performed by the two sets of BES on a random vector. The correlation increases with increasing number of BES, reaching a maximum at 10 BES and then goes down. Furthermore, S7 Fig shows the overlap percentage between genes differentially expressed between MSI and MSS for validation set 2 corrected by the two sets of BES. The overlap remains high (>80%) for up to 10 BES.

Together, the three results show that two reference sets (splits 1+2 and splits 3+4), pick up similar batch effect signatures (until ~ 10 BES) indicating that the algorithm is picking up consistent correction factors from different reference sets.

BES computed on reference set corrects batch effect in validation sets

Fig 2A shows the DRS on validation set 1 using various numbers of BES calculated on the reference set. The figure compares the performance of the BESC method to the SVA, RUV and HCP methods using the same number of correction factors as the number of BES. The BESC method shows improvement in the DRS when up-to 15 BES are used for correction. Using more than 15 signatures begins to decrease the effectiveness. SVA shows a steady improvement in batch effect at each level while RUV shows improvement up to two correction factors and then saturates. HCP shows improvement for the first correction factor, but its performance decreases once 20 or more correction factors are included. Fig 2B shows similar results using the variance contributed to the sample type (in this case, the organ of origin of the samples). The variance increases for the sample type (with a corresponding decreasing in variance due to batch effect; not shown) as the number of BES or number of surrogate variables increases. As with the DRS (Fig 2A), SVA show a steady improvement in batch effect at each level. The performance for BESC reaches a maximum when 15 BES are used. RUV and HCP do not show any improvement with correction (with the performance of HCP decreasing over the un-corrected data).

Fig 3A shows the DRS on validation set 2 for varying number of BES used in the correction. For that dataset too, the DRS reaches a maximum plateau at 15 BES. Using more than 15 BES does not significantly change the DRS. Fig 3B shows similar results using the PVCA-computed variance contribution due to sample-type (i.e. disease status- colon cancer vs. normal in this case). In both cases, the SVA corrected data shows superior performance in removing batch effect.

Comparison to SVA

SVA shows superior performance (Fig 2A) over most of the range of the x-axis (number of BES/number of surrogate variables used for correction).

However, Figs 2C and 3C illustrates the primary disadvantage of SVA which is the “normalizing away” of unknown but true biological differences. When sample sex is not included as one of the “protected” covariates in the SVA algorithm, the difference between male and female samples (i.e. the number of statistically significant genes) decreases with increasing SVA correction. On the other hand, the difference is maintained (and slightly increased) with increasing BES correction.

Fig 3C shows the number of genes that are differentially expressed between MSI and MSS samples at a statistically significant level (FDR<0.05). There is a steep drop off of the number of genes for the SVA-corrected data. SVA correction eventually removes almost all of the differences between the MSI and MSS samples. On the other hand, the BES corrected data retains most of the differential expression.

Figs 2C and 3C emphasize the conservative nature of BESC; only differences that are known to be due to BE are removed. Any unknown/unmodeled difference between the samples that is due to true biology is maintained. In contrast, SVA captures those differences unless they are protected, and correction with the surrogate variables removes those differences from the samples.

Comparison to RUV

For validation set 1, RUV shows some promise as a batch correction method for small numbers of correction factors (2 or less). However, it fails to do any correction for validation set 2. Also note that the results for RUV are selected from that value of ν and set of housekeeping genes that has the highest performance. In practice, it is computationally cumbersome to select the best parameter.

Comparison to HCP

HCP does some significant correction for validation set 2 (Fig 3) but shows disappointing performance on validation set 1 (Fig 2). In addition, HCP had the disadvantage that its performance had to be tuned by optimizing over a grid of three parameters (we show the results for the best performing combination).

BES correction is statistically significant

The orange line (Figs 2A and 3A) shows the average DRS for the validation sets corrected using the BES calculated on the permuted reference set data. As expected, there is no significant correction of the data (i.e. the DRS does not improve from baseline). The z-score p-value of the DRS using the true reference set (black line) is <1e-16 over the entire range of numbers of BES.

Discussion

We show that the batch effects between different datasets occupy a space that can be characterized by a small set of orthogonal basis vectors. That enables us to compute Batch Effect Signature (BES) vectors that capture the direction of perturbation using a reference dataset and apply them to predict and remove the batch in independent validation datasets. As far as we know, no other “blind” methods of batch correction have been published. All methods, (including SVA [4], RUV [11] and HCP [12]) require the correction factors to be computed on the entire sample set, needing re-calculation each time new samples are added.

Crucial to the correct operation of our algorithm is the selection of a reference set composed of cell lines since specifying the cell line name completely fixes all known and unknown biological covariates. We argue that cell lines that have been grown from a standardized population of cells are one of the most replicable biological experiments, and any difference between the same cell line sample from different experiments is quite likely due only to technical variation. However, note that we can use any sample that has been analysed in multiple experiments by different labs. For example, samples from The Cancer Genome Atlas (TCGA) or any other sample repository have the property that specifying the sample id completely specifies all of the known and unknown biological covariates. All that is required for inclusion in the reference set is that the sample must be uniquely identified across multiple experiments.

We compared the BESC algorithm to Surrogate Variable Analysis (SVA), Removing Unwanted Variation (RUV) and Hidden Covariates with Prior (HCP). At larger number of surrogate variables, SVA is more effective at detecting and removing residual structure from the dataset, however it is not possible to know how much of that residual structure is due to batch effect compared to unknown but true biological differences. We show that SVA is very likely to remove unknown but important biological information in our validation sets (e.g. sex in the case of normal samples and MSI vs. MSS differences in the case of colon cancer). We show that BESC is much more conservative about retaining unknown biological differences while removing technical differences. RUV and HCP show some performance on one validation set each, but do not do any significant correction in the other validation set. Furthermore, both algorithms need to be tuned by testing out various values of tuning parameter(s); a computationally expensive process.

The characteristics of the BESC algorithm make it ideally suited for large-scale batch correction in microarray data repositories. BESC is conservative, i.e. only BE known to be likely due to technical differences is removed. We have shown that biological variation, even if unknown to the BESC algorithm, is preserved. BESC can be applied to individual samples and does not need to be recomputed as more samples are added to the repository.

One primary disadvantage of the BESC algorithm is that it will remove all differences between samples that are parallel to the BES vectors. It is possible that there is important biological information along those directions. However, that biological difference is confounded with the technical differences and cannot be separated out in any analysis.

Another disadvantage is that the current BES vectors are only applicable to the Affymetrix U133 Plus2 array. We used samples from the U133 Plus2 platform to calculate the BES, mainly because it is the most commonly used platform with a large number of cell line samples. Before BES can be used on another platform, we will have to compile a reference set of cell line or other identified samples on that platform and re-compute the vectors on that set. Other BE correction methods are platform agnostic.

Conclusions

This paper describes a novel finding that batch effect (BE) perturbs measured gene expression in predictable directions which we call Batch Effect Signatures (BES). That characteristic can be used to compute possible directions of the perturbations in a reference dataset which can be used to predict the BE in an independent validation set. Selecting the reference set to contain only known cell-lines ensures that all (known or unknown) biological differences are fixed by specifying the cell-line name. That ensures that the BES capture differences due only to technical differences between the batches.

We show that the BES calculated on the reference set efficiently removes batch effect in two validation sets, as measured by PVCA and the Distance Ratio Score (DRS), a novel measure of batch effect. Compared to SVA, our algorithm does not remove all possible differences between samples of the same type in different batches, but we show that SVA also over-corrects by removing unknown but true biological differences. Compared to RUV and HCP, our algorithm shows superior performance over a wider range of datasets.

An R Package besc implementing the algorithm is available from http://www.explainbio.com. All data used are public data available from GEO (Gene Expression Omnibus https://www.ncbi.nlm.nih.gov/geo/). GEO accession numbers for all samples used in the analyses are in S1–S3 Tables.

Supporting information

S1 Fig. Normalization of arrays.

a) Bias in measured expression that depends on intensity b) Bias in measured expression that depends on array y-axis.

(TIF)

Click here for additional data file.^{(3.4MB, tif)}

S2 Fig. Prediction of sex.

Scatterplots of sex related genes with known sex and ellipse of 95% density for fitted Gaussian mixture model.

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S3 Fig. RUV parameter selection for validation set 1.

Plot of DRS and PVCA for various levels of the tuning parameter for RUV as well as two different sets of housekeeping genes.

(TIFF)

Click here for additional data file.^{(1.4MB, tiff)}

S4 Fig. RUV parameter selection for validation set 2.

Plot of DRS and PVCA for various levels of the tuning parameter for RUV as well as two different sets of housekeeping genes.

(TIFF)

Click here for additional data file.^{(1.4MB, tiff)}

S5 Fig. Consistency of calculated BES.

Overlap between significant genes selected for MSS/MSI differences in validation set 2 for BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(1.5MB, tiff)}

S6 Fig. PCA of corrected samples for two sets of BES.

Principal Component Analysis (PCA) plots of uncorrected samples and samples corrected using two different sets of BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(62.7KB, tiff)}

S7 Fig. Correlation between corrected samples.

Correlation between corrected samples at various levels of correction for BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(53.5KB, tiff)}

S1 Table. List of samples in the reference set (Cell lines) with cross-validation split IDs.

(XLS)

Click here for additional data file.^{(206.5KB, xls)}

S2 Table. List of samples in validation set 1 (Primary normal samples) with predicted sex.

(XLS)

Click here for additional data file.^{(105KB, xls)}

S3 Table. List of samples in validation set 2 (Colon cancer and normal) with known MSI/MSS status.

(XLS)

Click here for additional data file.^{(305KB, xls)}

S1 File. Supplementary methods.

(DOCX)

Click here for additional data file.^{(1.9MB, docx)}

Acknowledgments

The author gratefully acknowledges the valuable feedback on this paper given by Vinodh Rajpakse and Augustin Luna. The author is also grateful for the constructive suggestions from the two anonymous reviewers which has resolved several errors and made the manuscript stronger overall.

Data Availability

All data are taken from publicly available Gene Expression Omnibus projects (List of projects available in Supporting Information Tables).

Funding Statement

The author (SV) does contract statistical analysis under the business name of “HiThru Analytics LLC”. Currently he is working as a contractor, part time with the National Institutes of Health (Bethesda MD) and part time with Tridiuum Inc. (Philadelphia PA). There are no other owners of employees of HiThru Analytics and it is not a subsidiary of any other company. The Funder provided support in the form of salaries for authors (SV), but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.

References

1.Scherer A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470741384.html. Accessed 29 Nov 2016.
2.Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, et al. Effects of Atmospheric Ozone on Microarray Data Quality. Anal Chem. 2003;75:4672–5. 10.1021/ac034241b [DOI] [PubMed] [Google Scholar]
3.Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9. 10.1038/nrg2825 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Leek JT, Storey JD. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLoS Genet. 2007;3:e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27. 10.1093/biostatistics/kxj037 [DOI] [PubMed] [Google Scholar]
6.Marron JS, Todd MJ, Ahn J. Distance-Weighted Discrimination. J Am Stat Assoc. 2007;102:1267–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Jaffe AE, Hyde T, Kleinman J, Weinbergern DR, Chenoweth JG, McKay RD, et al. Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis. BMC Bioinformatics. 2015;16:372 10.1186/s12859-015-0808-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014;30:2757–63. 10.1093/bioinformatics/btu375 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Nygaard V, Rødland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2015;:kxv027. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Li S, Łabaj PP, Zumbo P, Sykacek P, Shi W, Shi L, et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat Biotechnol. 2014;32:888–95. 10.1038/nbt.3000 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52. 10.1093/biostatistics/kxr034 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, Montgomery SB, et al. Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PLOS ONE. 2013;8:e68141 10.1371/journal.pone.0068141 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, et al. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform. 2012. [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0231446.r001

Decision Letter 0

Iratxe Puebla

2 Sep 2019

PONE-D-19-17279

Blind estimation and correction of microarray batch effect

PLOS ONE

Dear Dr. Varma,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript has been assessed by two reviewers; their comments are available below.

The reviewers have raised some major concerns that need attention in a revision. The reviewers note that further information is needed on where the method can be publicly accessed, and they also raise the need to complete further tests as well as comparison to other approaches in order to validate the proposed methodology.

Could you please revise the manuscript to carefully address the concerns raised by the reviewers?

We would appreciate receiving your revised manuscript by Oct 15 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Iratxe Puebla

Senior Managing Editor, PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Thank you for including your competing interests statement; "The author has declared that no competing interests exist."

We note that one or more of the authors are employed by a commercial company: HiThru Analytics LLC

Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript presents a novel computational method (BESC) to correct for batch effects in microarray experiments. The novel idea of the author is that he argues that scanning a large representative set of data should be informative enough for parametrizing the "general" batch effects that are inherent to a particular microarray chip once and for all.

If this works, the author points out that one could run BESC once and apply it to new data sets in the future without having to rerun BESC.

I find this a novel, appealing idea. I am not aware of a comparable approach.

The manuscript is well written, clearly understandable and structured.

Major points

(1) The manuscript states that the method is available from http://explainbio.com

But I did not find the software there when I accessed that site.

Hence, I/we could not test the tool ourselves, which is standard when announcing a new bioinformatics tool.

(2) The presented results (figures + suppl figures) appear convincing.

To illustrate the generality and limitations of BESC I suggest to add the following 2 tests:

(a) In addition to the 5-fold CV already done, I suggest that the same 5 data splits should be processed as follows:

-> First, BESC should be parametrized on sets 1 + 2 and then applied to set 5.

-> Alternatively, BESC should be parametrized on sets 3 + 4 and then applied to set 5.

It this manner, two halves of the training data would be used to predict batch-effect corrected versions of set 5. I am curious how different are the results of this exercise.

If one does this for the TCGA data (Fig. 3 C), how large is the overlap coefficient of the differentially expressed genes between the two runs?

(b) The author argued he used data for a large panel of representative cell lines so that the parametrized BESC should be of general nature.

I wonder what the limitations of this approach are?

Does this also work for tissues and conditions that are not represented in the training data set?

This should be demonstrated on a suitable example.

(3) Another research group told us about their experience with TCGA data from the Illumina 27k CpG methylation chip (breast cancer). They found a strong batch effect between separate charges of the chip produced at different times.

Such a case would not be covered by BESC. If you parametrized on MA data collected up to a certain moment in time and apply this to process "all" future data processed with this chip type, one would have to assume that the chip is homogeneously manufactured over many years.

But this may not be the case, and may remain undetected if users simply rely on BESC.

Discussion of this and related effects should be added in the discussion section.

Minor points

(4) throughtout the text many nouns start with capital letters, but need not to do so.

(5) line 317: is to much more -> is much more

(6) line 337: which can used -> which can be used

Reviewer #2: The paper presents a novel method using signatures of batch effects generated from an external reference to eliminate batch effects from microarray data without harming the biological signal.

The idea and method is very interesting and well presented.

1. My main criticism is regarding the validation of the method. the method was compared to SVA which removes batch effects as well as the biological signal. The current validation might just show that SVA removes also biology with the batch effects which is worse than the present method or even the raw uncorrected data. Another comparison should be used for validation. For example, comparison with Combat (RUV etc) or other method that deals with removing batch effects without the biological signal, should be added. It might be also compared to the raw data, showing that it outperform the raw data. In case the new method doesn't correct the data, still it will outperform SVA which overfits the data and eliminated the DEGs.

2. Im missing a presentation of other recent methods that deal with removing batch effects without removing the biological signal. For example, HCP by Mustafavi, BCeF by Somekh etc.

Minor issues:

p.1. line 9 "Even though the effects..." - unclear sentence

p.1. line 22 - very long sentence with too many parentheses.

p.2. l 34 - "or biological" - should be just experimental here?

p.4 - some mistake for gij on the jth gene? not enough explanation on gamma and the factors.

p 5 - need to add more explanation on the SVA formulas. so (3) is currently not fully clear. I had to read the SVA paper to understand.

p 5 l.90 - sometimes ki and sometimes ik are used.

p 5 l.92 - should be "each gene" instead of "all genes"?

l 96 - length one is unclear

l 98 - refine the English sentence

l 102 - please add: to remain the same, multiplied by some coefficient.

p 12 line 246 - Figure not Fig. Figure 2A - please add explanation on what can be seen for SVA.

p. 15 line 294 - unclear sentence

line 299 - for SVA, please add reference

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 9;15(4):e0231446. doi: 10.1371/journal.pone.0231446.r002

Author response to Decision Letter 0

6 Dec 2019

Response to reviewers

Firstly, I would like to thank the reviewers for the thorough review and very constructive suggestions. I have taken their suggestions to heart and performed extensive work to do the comparisons to other batch effect methods as well as validation methods. I believe this makes the manuscript stronger and improves the presentation considerably.

I have added my responses below in light orange font.

Reviewer #1:

This manuscript presents a novel computational method (BESC) to correct for batch effects in microarray experiments. The novel idea of the author is that he argues that scanning a large representative set of data should be informative enough for parametrizing the "general" batch effects that are inherent to a particular microarray chip once and for all.

If this works, the author points out that one could run BESC once and apply it to new data sets in the future without having to rerun BESC.

I find this a novel, appealing idea. I am not aware of a comparable approach.

The manuscript is well written, clearly understandable and structured.

Major points

(1) The manuscript states that the method is available from http://explainbio.com

But I did not find the software there when I accessed that site.

Hence, I/we could not test the tool ourselves, which is standard when announcing a new bioinformatics tool.

That was due to a misconfiguration of the web server. The package should be available to download now, on clicking the “BESC R package” link on the top right-hand side. Please make sure to clear your browser cache before reloading http://explainbio.com

(2) The presented results (figures + suppl figures) appear convincing.

To illustrate the generality and limitations of BESC I suggest to add the following 2 tests:

(a) In addition to the 5-fold CV already done, I suggest that the same 5 data splits should be processed as follows:

-> First, BESC should be parametrized on sets 1 + 2 and then applied to set 5.

-> Alternatively, BESC should be parametrized on sets 3 + 4 and then applied to set 5.

It this manner, two halves of the training data would be used to predict batch-effect corrected versions of set 5. I am curious how different are the results of this exercise.

I’ve included a plot that explores this (Supplementary Figure 6). I’ve plotted the uncorrected samples from split 5 along with the same samples corrected by batch effect signatures (BES) calculated on splits 1+2 and splits 3+4 (i.e. two sets of BES). As long as the number of BES is smaller than 10, the corrected samples from the two sets of BES stay close together (while moving away from the uncorrected samples). After 10 BES, the two sets of corrected samples start separating.

Furthermore, Supplementary Figure 7 shows the mean correlation between the corrections performed by the sets of BES computed on splits 1+2 vs 3+4. The correlation increases with increasing number of BES, reaching a maximum at 10 BES and then goes down.

Together, the two results show that two sets of reference data splits (splits 1+2 and splits 3+4), pick up similar batch effect signatures (until ~ 10 BES) indicating that the algorithm is detecting a robust signal.

It seems reasonable that we can find 10 robust BES from a bit less than half the reference data (splits 1+2 or split 3+4) compared to ~20 robust BES from the full reference set. It is possible that additional data in the reference set can give us a larger number of BES covering more of the batch effect

If one does this for the TCGA data (Fig. 3 C), how large is the overlap coefficient of the differentially expressed genes between the two runs?

I’ve included a plot of the fraction of common genes between the colon cancer/normal dataset (validation set 2) when corrected by sets of BES calculated on the 1st + 2nd split combined and the 3rd + 4th split combined (Supplementary Figure 5). There is a high degree of overlap (>80% intersection) till the number of BES reaches 10. The BES start to diverge from 20 onwards and the overlap decreases.

(b) The author argued he used data for a large panel of representative cell lines so that the parametrized BESC should be of general nature.

I wonder what the limitations of this approach are?

Does this also work for tissues and conditions that are not represented in the training data set?

This should be demonstrated on a suitable example.

The two validation sets are actually different tissues and conditions compared to the reference set that was used to compute the BES. The reference set contains only cell line samples while the validation sets contain primary tissue samples (blood, lung and colon normal samples for validation set 1 and primary colon normal and cancer for validation set 2).

But this may not be the case, and may remain undetected if users simply rely on BESC.

Discussion of this and related effects should be added in the discussion section.

I do detect a year-dependent effect of one of the BES. The coefficient calculated for samples with different scan years varies smoothly from year to year. It is true that any abrupt large change in the array manufacturing would not be detected by BESC until we have a number of reference samples on the new array. I’ve added a remark in the discussion on this issue.

Minor points

(4) throughtout the text many nouns start with capital letters, but need not to do so.

(5) line 317: is to much more -> is much more Corrected line 372

(6) line 337: which can used -> which can be used Corrected line 395

Reviewer #2:

The paper presents a novel method using signatures of batch effects generated from an external reference to eliminate batch effects from microarray data without harming the biological signal.

The idea and method is very interesting and well presented.

Major issues:

Firstly, I would like to clarify that the comparison to the raw data is present in each figure (the very first point on the left with BES=0, indicating uncorrected data). I’ve added a dotted horizontal line in each figure to show the value for the uncorrected data.

I’ve added comparisons to the results using RUV [1] and HCP [2]. I find that RUV does outperform BESC in validation set 2 for correction with a small number of factors (Figure 3). However, this best performance is selected after choosing between two different sets of control genes and varying the value of ν (a tunable parameter for RUV). In contrast BESC (and SVA) does not need to be tuned.

The results with HCP was disappointing for both validation sets (Figures 2 and 3). In both cases HCP does not do any better than the raw data. Note that these are the best results as selected from a range of values for the three HCP parameters λ,σ_1,σ_2.

The tuning of the parameters for RUV and HCP are described in the Supplementary Methods.

2. Im missing a presentation of other recent methods that deal with removing batch effects without removing the biological signal. For example, HCP by Mustafavi, BCeF by Somekh etc.

I’ve added comparisons to RUV [1] and HCP [2] in the manuscript and discussion of the results from those algorithms. BCeF [3] appears to be a framework to evaluate and compare different batch correction methods based on how closely they recapitulate a known gene co-expression matrix. They report that none of the batch correction methods is able to fully recapitulate all the unmodelled confounders. As I understand it, they do not propose any new batch correction methods.

1. Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13: 539–552. doi:10.1093/biostatistics/kxr034

2. Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, Montgomery SB, et al. Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PLOS ONE. 2013;8: e68141. doi:10.1371/journal.pone.0068141

3. Somekh J, Shen-Orr SS, Kohane IS. Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset. BMC Bioinformatics. 2019;20: 268. doi:10.1186/s12859-019-2855-9

Minor issues:

p.1. line 9 "Even though the effects..." - unclear sentence I’ve rewritten the sentence (p.1. line 9)

p.1. line 22 - very long sentence with too many parentheses. I’ve rewritten the sentence (p.1. line 23)

p.2. l 34 - "or biological" - should be just experimental here? I have tried to differentiate between batch effect caused due to unknown technical effects from that due to biological effects (in p.2. line 37). Most batch correction methods (e.g. ComBat) remove both, but BESC attempts to remove only the technical differences.

p.4 - some mistake for gij on the jth gene? not enough explanation on gamma and the factors. I’ve explained gamma and tried to make the batch effect term clearer. glj (note it is “l”, a lower case “L”) is the effect of the batch l on sample j. The dependence on the gene “i” is modeled by γ_"il" . (p.4. line 77)

p 5 - need to add more explanation on the SVA formulas. so (3) is currently not fully clear. I had to read the SVA paper to understand. I’ve expanded the description (page 5)

p 5 l.90 - sometimes ki and sometimes ik are used. Corrected (p.6. line 122)

p 5 l.92 - should be "each gene" instead of "all genes"? Corrected (p.6. line 124)

l 96 - length one is unclear I’ve made it clearer (p.6. line 128)

l 98 - refine the English sentence I’ve re-written the sentence to make it clearer (p.7. 129-134)

l 102 - please add: to remain the same, multiplied by some coefficient. I’ve added that phrase (p.7. line 137)

p 12 line 246 - Figure not Fig. Figure 2A - please add explanation on what can be seen for SVA. I’ve updated the figure with results for RUV and HCA with description of the results for each algorithm (p.14. line 290)

p. 15 line 294 - unclear sentence I’ve re-written the sentence (p.16. line 349)

line 299 - for SVA, please add reference Added for SVA, RUV and HCP (p.17. line 353)

Attachment

Submitted filename: Response-to-reviewers.docx

Click here for additional data file.^{(24.8KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0231446.r003

Decision Letter 1

Volkhard Helms

16 Jan 2020

PONE-D-19-17279R1

Blind estimation and correction of microarray batch effect

PLOS ONE

Dear Dr. Varma,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that your manuscript may become acceptable after fixing a few remaining minor issues. Therefore, we invite you to submit a revised version of the manuscript that addresses the following 4 points:

(1) The supplemental figures are not referenced in the main text. It is up to readers to detect that there are some supplemental figures and to try to figure out what they show. So please add some sentences in the main text and explain what the supplemental figures show, like you did in your reply to the reviewers.

(2) Also, the legends of supplemental figures S5 to S7 should be improved.

For example, the current legend of Fig. S5

"S5 Fig: Consistency of calculated BES – Overlap between significant genes selected for

MSS/MSI differences in validation set 2 for BES calculated on different subsets of

the reference set."

does not reflect your reply to the comment of reviewer #1:

"I’ve included a plot of the fraction of common genes between the colon cancer/normal

dataset (validation set 2) when corrected by sets of BES calculated on the 1st + 2nd

split combined and the 3rd + 4th split combined (Supplementary Figure 5).

There is a high degree of overlap (>80% intersection) till the number of BES reaches 10.

The BES start to diverge from 20 onwards and the overlap decreases."

Similarly, the legends Figs. S6 and S7 could be improved.

For example, the legend of Fig. S6 does not explain what the black, red and blue symbols are.

The following small points refer to the line numbering of the revised manuscript with tracked changes

(3) line 108: the disadvantages ... is -> the disadvantages ... are

(4) line 387: correct plural/singular in

"We show that the batch effect between different datasets occupy a space"

We would appreciate receiving your revised manuscript by Mar 01 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Volkhard Helms

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

The author has done a commendable job in addressing the points raised by the two reviewers.

He has performed all requested additional analysis (reviewer #1) and comparisons to other tools (reviewer #2).

The new analysis is convincing. The reply letter properly addresses all raised points.

However, when looking at the revised version of the manuscript I felt that the presentation of the additional analysis (reviewer #1) in the manuscript could be improved, see points listed above.

[Note: HTML markup is below. Please do not edit.]

PLoS One. 2020 Apr 9;15(4):e0231446. doi: 10.1371/journal.pone.0231446.r004

Author response to Decision Letter 1

16 Mar 2020

1) The supplemental figures are not referenced in the main text. It is up to readers to detect that there are some supplemental figures and to try to figure out what they show. So please add some sentences in the main text and explain what the supplemental figures show, like you did in your reply to the reviewers.

Response:I have added text to the manuscript describing those analyses and their results (lines 223-235 and 297-309)

2) Also, the legends of supplemental figures S5 to S7 should be improved. For example, the current legend of Fig. S5 "S5 Fig: Consistency of calculated BES – Overlap between significant genes selected for

MSS/MSI differences in validation set 2 for BES calculated on different subsets of

the reference set."does not reflect your reply to the comment of reviewer #1:

"I’ve included a plot of the fraction of common genes between the colon cancer/normal

dataset (validation set 2) when corrected by sets of BES calculated on the 1st + 2nd

split combined and the 3rd + 4th split combined (Supplementary Figure 5).

There is a high degree of overlap (>80% intersection) till the number of BES reaches 10.

The BES start to diverge from 20 onwards and the overlap decreases."

Similarly, the legends Figs. S6 and S7 could be improved.

For example, the legend of Fig. S6 does not explain what the black, red and blue symbols are.

Response: I’ve update the legends and descriptions

3) The following small points refer to the line numbering of the revised manuscript with tracked changes

a. line 108: the disadvantages ... is -> the disadvantages ... are

b. line 387: correct plural/singular in "We show that the batch effect between different datasets occupy a space"

Response: I have corrected these errors

Attachment

Submitted filename: Response.docx

Click here for additional data file.^{(14.9KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0231446.r005

Decision Letter 2

Volkhard Helms

25 Mar 2020

Blind estimation and correction of microarray batch effect

PONE-D-19-17279R2

Dear Dr. Varma,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Volkhard Helms

Guest Editor

PLOS ONE

Additional Editor Comments (optional):

The author has now properly addressed my remaining minor points.

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0231446.r006

Acceptance letter

Volkhard Helms

27 Mar 2020

PONE-D-19-17279R2

Blind estimation and correction of microarray batch effect

Dear Dr. Varma:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Volkhard Helms

Guest Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Normalization of arrays.

a) Bias in measured expression that depends on intensity b) Bias in measured expression that depends on array y-axis.

(TIF)

Click here for additional data file.^{(3.4MB, tif)}

S2 Fig. Prediction of sex.

Scatterplots of sex related genes with known sex and ellipse of 95% density for fitted Gaussian mixture model.

(TIF)

Click here for additional data file.^{(1.2MB, tif)}

S3 Fig. RUV parameter selection for validation set 1.

Plot of DRS and PVCA for various levels of the tuning parameter for RUV as well as two different sets of housekeeping genes.

(TIFF)

Click here for additional data file.^{(1.4MB, tiff)}

S4 Fig. RUV parameter selection for validation set 2.

Plot of DRS and PVCA for various levels of the tuning parameter for RUV as well as two different sets of housekeeping genes.

(TIFF)

Click here for additional data file.^{(1.4MB, tiff)}

S5 Fig. Consistency of calculated BES.

Overlap between significant genes selected for MSS/MSI differences in validation set 2 for BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(1.5MB, tiff)}

S6 Fig. PCA of corrected samples for two sets of BES.

Principal Component Analysis (PCA) plots of uncorrected samples and samples corrected using two different sets of BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(62.7KB, tiff)}

S7 Fig. Correlation between corrected samples.

Correlation between corrected samples at various levels of correction for BES calculated on different subsets of the reference set.

(TIFF)

Click here for additional data file.^{(53.5KB, tiff)}

S1 Table. List of samples in the reference set (Cell lines) with cross-validation split IDs.

(XLS)

Click here for additional data file.^{(206.5KB, xls)}

S2 Table. List of samples in validation set 1 (Primary normal samples) with predicted sex.

(XLS)

Click here for additional data file.^{(105KB, xls)}

S3 Table. List of samples in validation set 2 (Colon cancer and normal) with known MSI/MSS status.

(XLS)

Click here for additional data file.^{(305KB, xls)}

S1 File. Supplementary methods.

(DOCX)

Click here for additional data file.^{(1.9MB, docx)}

Attachment

Submitted filename: Response-to-reviewers.docx

Click here for additional data file.^{(24.8KB, docx)}

Attachment

Submitted filename: Response.docx

Click here for additional data file.^{(14.9KB, docx)}

Data Availability Statement

All data are taken from publicly available Gene Expression Omnibus projects (List of projects available in Supporting Information Tables).

[pone.0231446.ref001] 1.Scherer A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470741384.html. Accessed 29 Nov 2016.

[pone.0231446.ref002] 2.Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, et al. Effects of Atmospheric Ozone on Microarray Data Quality. Anal Chem. 2003;75:4672–5. 10.1021/ac034241b [DOI] [PubMed] [Google Scholar]

[pone.0231446.ref003] 3.Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9. 10.1038/nrg2825 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref004] 4.Leek JT, Storey JD. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLoS Genet. 2007;3:e161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref005] 5.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–27. 10.1093/biostatistics/kxj037 [DOI] [PubMed] [Google Scholar]

[pone.0231446.ref006] 6.Marron JS, Todd MJ, Ahn J. Distance-Weighted Discrimination. J Am Stat Assoc. 2007;102:1267–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref007] 7.Jaffe AE, Hyde T, Kleinman J, Weinbergern DR, Chenoweth JG, McKay RD, et al. Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis. BMC Bioinformatics. 2015;16:372 10.1186/s12859-015-0808-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref008] 8.Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014;30:2757–63. 10.1093/bioinformatics/btu375 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref009] 9.Nygaard V, Rødland EA, Hovig E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics. 2015;:kxv027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref010] 10.Li S, Łabaj PP, Zumbo P, Sykacek P, Shi W, Shi L, et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat Biotechnol. 2014;32:888–95. 10.1038/nbt.3000 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref011] 11.Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–52. 10.1093/biostatistics/kxr034 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref012] 12.Mostafavi S, Battle A, Zhu X, Urban AE, Levinson D, Montgomery SB, et al. Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge. PLOS ONE. 2013;8:e68141 10.1371/journal.pone.0068141 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0231446.ref013] 13.Lazar C, Meganck S, Taminau J, Steenhoff D, Coletta A, Molter C, et al. Batch effect removal methods for microarray gene expression data integration: a survey. Brief Bioinform. 2012. [DOI] [PubMed] [Google Scholar]

PERMALINK

Blind estimation and correction of microarray batch effect

Sudhir Varma

Roles

Abstract

Introduction

Materials and methods

Batch effect and correction methods

Surrogate Variable Analysis (SVA)

Removing Unwanted Variation (RUV)

Hidden Covariates with Prior (HCP)

Batch Effect Signature Correction (BESC

Difference between SVA and BESC

Selecting samples for the reference set

Correcting BE using BES

Quantifying batch effect

Cross-validating batch effect signatures

Consistency of BESC corrections for different reference sets

Testing on validation sets

Permutation p-value for DRS

Comparison to SVA, RUV and HCP

Results

BES computed on training data is predictive of batch effect in test data

Fig 1. Cross-validated performance on reference set.

BES computed on different reference sets are consistent

BES computed on reference set corrects batch effect in validation sets

Fig 2. Performance on validation set 1.

Fig 3. Performance on validation set 2.

Comparison to SVA

Comparison to RUV

Comparison to HCP

BES correction is statistically significant

Discussion

Conclusions

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Iratxe Puebla

Roles

Author response to Decision Letter 0

Decision Letter 1

Volkhard Helms

Roles

Author response to Decision Letter 1

Decision Letter 2

Volkhard Helms

Roles

Acceptance letter

Volkhard Helms

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases