Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 4.
Published in final edited form as: Nat Comput Sci. 2023 Aug 23;3(8):709–719. doi: 10.1038/s43588-023-00500-8

Batch effect correction with sample remeasurement in highly confounded case-control studies

Hanxuan Ye 1, Xianyang Zhang 1,*, Chen Wang 2, Ellen L Goode 2, Jun Chen 2,*
PMCID: PMC10993308  NIHMSID: NIHMS1974348  PMID: 38177326

Abstract

Batch effects are pervasive in biomedical studies. One approach to address the batch effects is repeatedly measuring a subset of samples in each batch. These remeasured samples are used to estimate and correct the batch effects. However, rigorous statistical methods for batch effect correction with remeasured samples are severely underdeveloped. In this study, we developed a framework for batch effect correction using remeasured samples in highly confounded case-control studies. We provided theoretical analyses of the proposed procedure, evaluated its power characteristics, and provided a power calculation tool to aid in the study design. We found that the number of samples that need to be remeasured depends strongly on the between-batch correlation. When the correlation is high, remeasuring a small subset of samples is possible to rescue most of the power.

1. Introduction

One major issue facing biological studies is that biological measurement is highly susceptible to non-biological experimental variation or “batch effects”. Batch effects are pervasive in modern high-throughput omics technologies using microarrays or next-generation sequencing [1, 2]. Different experimental conditions, measurement modalities, personnel executing the experiments, and batches of reagents all contribute to batch effects. Such unwanted variation has severe statistical consequences. It could reduce statistical power by introducing extra variation or, more seriously, lead to false findings if the batch effects are confounded with the effects of interest. Although performing the biological measurement in a single batch is the most effective way to reduce batch effects, such practice may not always be possible due to various constraints such as resource availability and measuring capacity. Even if the experimental measurement is executed in a single batch, some unexpected batch effects could still arise. For example, different measuring chips, locations on the chips, DNA extraction plates, and sequencing lanes have all been found to produce batch effects in omics studies [35]. Therefore, addressing the batch effects in the study design and data analysis is critical to improve the statistical power, increase the robustness of the findings and reduce the developmental cost.

Over the past two decades, a number of batch effect correction methods have been developed and applied in practical data analysis. Two mainstreams for batch effect correction are location-scale (LS) matching and matrix factorization (MF). The LS methods assume the sources of the batch effects are known so that the location (e.g., mean), scale (e.g., standard deviation), or even the entire distribution are matched across batches. Methods in this category include batch mean-centering (BMC) [6], gene-wise standardization (SD) [7], ComBat [8, 9], cross-platform normalization (XPN) [10] and distance-weighted discrimination (DWD) [11]. Among these, ComBat, an empirical Bayes-based LS method, is the most widely used method due to its robustness to small batch sizes compared with earlier methods [8, 9]. In contrast, the MF-based methods do not require the sources of batch effects are known in advance. Instead, they search for directions of maximal variance associated with the batch effects and use the resulting latent factors to correct for batch effects. Methods in this category include singular value decomposition (SVD)/principal component analysis (PCA) [12, 13], surrogate variable analysis (SVA) [14], RUV [1517] and LEAPP [18]. The SVA, RUV, and LEAPP have also been studied and expanded within a unified CATE rotation [19] framework that adjusts for the confounders in hypothesis testing.

Previous efforts for batch effects correction have been focused on estimating and correcting batch effects based on independent samples [8, 9, 14, 18, 19]. In practice, however, one intuitive approach used by investigators to address batch effects is through remeasuring a subset of samples in each batch in the hope that these remeasured samples could be used to estimate and correct the batch effects [20, 21]. Unfortunately, other than some simple approaches, statistical methods for batch correction using the remeasured samples remain severely underdeveloped. Biostatisticians are often faced with the inability to efficiently utilize these remeasured samples in the analysis to correct for batch effects, hindering the successful completion of the proposed studies. To fill the methodological gap, this study investigates the feasibility and methodology for batch effect correction using remeasured samples in a highly confounded case-control study [22]. We specifically consider a challenging scenario, where an investigator has collected all the case samples, and she wants to compare these case samples to the control samples that have already been measured previously and a subset of which are still available for remeasurement. This scenario is quite common in clinical settings since clinical investigators usually obtain case samples more easily than control samples. For example, an investigator wants to compare her case samples to the control samples from the institutional biobank [23]. Oftentimes, the biobank samples have already been characterized in a standalone study or have been used as controls in other disease studies, resulting in a large amount of pre-existing control data that can be potentially used together with the new control data generated from remeasuring a subset of the biobank samples. Another example is the subsequent analysis in case-cohort studies [24]. One strength of case-cohort studies is that the subcohort can be used as a reference group for a variety of different case groups. The subcohort in a case-cohort study implemented early for a common disease can be used as a reference group for a series of rarer or long-latency disease. The data for the reference group already exists in study databases. For subsequent disease studies, the existing subcohort data may be re-used after a subset of the subcohort samples have been remeasured.

Obviously, if none of the control samples are remeasured, the biological effects will be completely confounded with the batch effects, and distinguishing between the biological and batch effects will be very difficult. Ideally, all the control samples need to be remeasured together with the case samples to maximize the discovery power. However, due to resource constraints and sample availability, such practice may not always be possible. Therefore, it will be of tremendous help if only remeasuring a small subset of control samples is necessary to correct the batch effects. Despite of a subject of critical importance, to our surprise, no dedicated statistical methods are available. No theoretical investigation has been performed to study the operating characteristics of batch effect correction with remeasured samples. It is unknown how many control samples need to be remeasured to recover most of the power, whether a handful of controls are sufficient to correct for batch effects, and what factors matter most in deciding the number of remeasured samples. A rigorous statistical testing method coupled with a power calculation tool is critically needed for this particular scenario. A successful tool could potentially rescue a completely confounded study and has a tremendous economic impact on the field.

In this study, we proposed a computationally efficient statistical method for batch effect correction with remeasured samples for a highly confounded case-control study. The method is based on the maximum likelihood framework, and hence the derived procedure is optimal in using the information available. We studied the theoretical properties of the procedure and proved the consistency and asymptotic normality of the resulting estimators. We investigated the power characteristics of the approach based on simulations and theoretical analysis, and identified statistical properties affecting its power. Finally, we proposed a power calculation tool to aid in the study design. A real dataset with known batch effects and a large number of remeasured samples was used to demonstrate the feasibility and efficiency of the proposed procedure.

2. Results

2.1. Problem Setup and Model

Consider that the control and case samples are measured on two different batches. We assume the linear model

yi=xia0+a1+zib+ϵi,ϵixiN0,1-xiσ12+xiσ22, (1)

for i=1,2,,n, where yi is the outcome, xi{0,1} is the control/case group membership (0: control, 1: case), zi contains measurements of other covariates including the intercept and possible covariate-batch (group) interactions, a0 is coefficient for the true biological effect and a1 is the coefficient for the nuisance batch effects. Since the batch effect and biological effect are indistinguishable in this example, remeasurement of a subset of samples is thus necessary. Suppose the control and case samples are collected in the first and second batch, respectively, and a subset of control samples of size n are remeasured in the second batch. Suppose the control and case group contain n1 and n2 samples, respectively. Without loss of generality, we assume that the first n1 control samples are remeasured, where n1n1 (Figure 1). Then we have

Control(batch1):yi=zib+ϵi(1),i=1,2,,n1
Case(batch2):yi=a0+a1+zib+ϵi(2),i=n1+1,,n1+n2=n
Control(batch2):yi=a1+zib+ϵi(2),i=n+1,,n+n1

where ϵi(1)N0,σ12 for 1in1, ϵi(2)N0,σ22 for n1+1in+n1, and

cov(ϵi(1),ϵn+i(2))=ρσ1σ2

for 1in1. The goal here is to develop an efficient procedure to test the null hypothesis that

H0:a0=0,

i.e., there is no true biological effect.

Fig. 1: Illustration of the study design.

Fig. 1:

Here, C1 denotes the set of n1 control samples in Batch 1, and T2 denotes the set of case samples in Batch 2. A subset S1 consisting of n1 samples from the C1 set is remeasured. The set of these remeasured samples in Batch 2 is indicated as C2. C1\S1 represents the unmeasured control samples.

We introduce some notation before describing the estimation and inference procedures. Denote by C1 the set of control samples in the first batch and T2 the set of case samples in the second batch. Let S1=1,,n1 and C2=n+1,,n+n1 be the subset of remeasured control samples in batch 1 and batch 2, respectively, where S1=C2=n1. See Figure 1 for illustration. Note that the covariates associated with the samples in S1 and C2 are the same. Let θ=a0,a1,b,ρ,σ1,σ2 be the parameter vector to be estimated, and N=n1,n2,n1 be the vector of sample sizes. We define μ1i=zib for iC1, μ2i=a0+a1+zib for iT2 and μ3i=a1+zib for iC2.

2.2. Simulation Studies

We conduct a set of simulation studies to investigate the finite sample performance of the proposed procedure in terms of estimation accuracy, type I error rate, and statistical power. Moreover, we compare our method (“ReMeasure”) with three alternative procedures:

  1. The location-scale matching approach (“LS”, details in Supplementary Section 3).

  2. Estimation and inference using only the second batch data (“Batch2”).

  3. Estimation and inference using the whole data set while ignoring the batch effects (“Ignore”).

We study the effects of location and scale differences between the two batches, the between-batch correlations, and the number of remeasured samples. We generate the data according to the model in (1). Specifically, we set n1=n2=50, and consider a univariate covariate zi randomly drawn from the standard normal distribution. We let b=-0.5 and set σ22=1 so that a0 can be interpreted as the Cohen's d [25], an effect size measure for a two-sample t-test. Let ρ be the between-batch correlation for these remeasured control samples. We investigate the batch scale parameter σ120.52,12,22, the between-batch correlation ρ{0.3,0.6,0.9}, and the remeasured sample size n1{5,10,15,20,25,30,35,40,45,50}. We set the true biological effect a0{0,0.5,0.8}, representing no effect, moderate effect, and strong effect, according to Cohen's criterion. We found empirically (Supplementary Figure 1a) that the behavior of the proposed estimator of a0 did not depend on the value of batch location effect a1, so we thus set a1=0.5 throughout the simulations.

In Figure 2 and Supplementary Table 1, we report the mean square error (MSE) of different procedures for estimating the biological effect a0 when a0=0.5. The MSE of a0 estimate for other values shows the same pattern (data not shown). The method that ignores the batch effect and the remeasured samples (“Ignore”) performs the worst in almost all settings. In contrast, the MSE for the other methods decreases with the number of remeasured samples but increases with σ12. When the between-batch correlation ρ is small, the MSE of the method based on the second batch (“Batch2”) is similar to that of the proposed method (“ReMeasure”), suggesting that the control samples in the first batch provide limited information when ρ is small. In this case, using the first batch of samples may only marginally improve the estimation efficiency. As ρ becomes larger, “Batch2” method begins to be less efficient. When the between-batch correlation is very high (ρ=0.9), the control samples in the first batch help improve the estimation accuracy tremendously, and “ReMeasure” achieves a considerably smaller MSE even when the number of remeasured samples is small. The location-scale matching method (“LS”), on the other hand, has a much higher MSE than “ReMeasure” especially when the batch scale parameter for the first batch is large σ1=2. As the number of remeasured samples increases, the discrepancy decreases, indicating that a large number of remeasured samples may be needed for “LS” to work properly.

Fig. 2: The mean square error (MSE) of a0 estimate for different procedures when both sample sizes n1=n2=50.

Fig. 2:

We vary the degrees of between-batch correlation (ρ values of 0.3, 0.6, and 0.9 for panels a, b, c, respectively) and degree of noise levels (σ1, left to right). Both the biological effect parameter a0 and batch location parameter a1 are set to 0.5. For clarity, the y-axis is presented in log1010 scale. Results are based on 1000 replications. Data are presented as mean values +/− SEM.

Next, we study the type I error rate and the statistical power (Figure 3). As expected, “Ignore” has the largest type I error inflation while “Batch2” controls the type I error across all settings. “LS” has severely inflated type I error when the number of remeasured samples is small, reflecting the large MSE observed. In contrast, the proposed method “ReMeasure” has much better type I error control than “LS” and it generally controls the type I error to the target level when n110. However, when the number of remeasured samples is very small n1=5, “ReMeasure” has some type I error inflation. A larger between-batch correlation (ρ) reduces its inflation. The inflation is due to the use of the plug-in estimates of the variance components σ12,σ22,ρ in deriving the asymptotic distribution. When the number of the remeasured sample is small, the estimation of ρ is subject to large variability, and the asymptotic null distribution could deviate from the true null distribution. Indeed, if we plug in the true ρ in the test statistic instead of the estimated version (“Oracle” procedure), the type I error under n1=5 is brought down close to the target level (Supplementary Figure 2a). In terms of statistical power, “ReMeasure” is similar to or slightly better than “Ignore” when ρ is small but is substantially more powerful when ρ is large. The high power of “LS” and “Ignore” is not very meaningful since they have severe type I error inflation. We also compared the performance under different σ1 values, the patterns were almost identical (Supplementary Figure 1b).

Fig. 3: Evaluation of empirical type I error and power for different procedures in testing the biological effect a0=0 with n1=n2=50.

Fig. 3:

The true biological effects a0 we explored include 0 (type I error, panel a), 0.5 (power, panel b), and 0.8 (power, panel c). For each panel, from left to right, we increase the between-batch correlation (ρ) from 0.3 to 0.9. The batch location parameter a1 is set to 0.5 and the batch scale parameter σ1 is set to 2. The dashed line indicates the 5% nominal type I error rate used.

To improve the type I error control under a small number of remeasured samples n1<10, we propose to use the bootstrap method to derive a more accurate null distribution. Supplementary Figure 3 shows that the bootstrap method could control the type I error at small n1s across settings. However, the better type I error control is at the expense of some power and it is slightly less powerful than the asymptotic approach. When ρ is small, it may not have any advantage over the “Batch2” method. Therefore, the bootstrap method is only recommended for small n1s when ρ is not small.

To demonstrate the robustness of the proposed method, we performed additional simulations under large sample sizes, different batch location parameters, and different error distributions. We also compared to two additional approaches: the naive approach, which fits a linear model based on all the samples adjusting the batch variable and ignoring repeated measurements, and the “LSind” approach, which is the “LS” method that uses the entire control samples to estimate the location and scale parameters. The results are summarized in Supplementary Section 4 (“Additional simulations”).

2.3. Theoretical Power Analysis

In practice, one frequent question asked by an investigator is how many control samples need to be remeasured to achieve sufficient statistical power. Although the simulation-based approach can be used for power calculation, it is computationally intensive and is not amenable to large sample sizes. It also does not allow the exploration of different parameter settings flexibly. Therefore, an analytical power calculation tool is needed to aid in the study design. To achieve this end, we propose an approximate power calculator based on the asymptotic distribution. Specifically, the type I error and power can be calculated theoretically through the asymptotic normality of aˆ0:aˆ0-a0/sdaˆ0𝒩(0,1). The power Praˆ0/sdaˆ0>z1-α2 for the significant level α can be calculated as

Praˆ0-a0sdaˆ0+a0sdaˆ0>z1-α2PrZ+a0sdaˆ0>z1-α2, (2)

where z1-α2 is the (1-α/2)-quantile of the standard normal distribution and Z𝒩(0,1). In the theoretical power calculation, the oracle estimator for sdaˆ0 is used, where we assume that ρ, σ1 and σ2 are all known.

Supplementary Figure 2b provides a comparison between the theoretical power (“Theory”) and the empirical power based on the asymptotic method (“ReMeasure”). The theoretical power does not deviate much from “ReMeasure” at different effect sizes. The approximation is more accurate when the number of remeasured samples is larger, and the between-batch correlation is higher. Thus, the theoretical power provides a reasonable approximation to the actual power when the proposed procedure is applied.

With the theoretical power calculator, we can conduct power analysis under different parameter settings. Compared to the usual parameters used in power calculation for a two-sample t-test, such as the sample size of the control group n1 and the case group n2, the effect size a0 (Cohen's d, mean difference standardized by the within-group standard deviation), significance level, and the desired power, power analysis for the proposed procedure depends on two additional parameters: the number of remeasured control samples n1 and the between-batch correlation ρ. On the other hand, the batch location and scale parameters have little effect on power. Besides traditional power analyses such as power vs. sample size and power vs. effect size, in our context, investigators are frequently interested in the following two types of power analysis:

  • Given fixed sample sizes for the control and case group, how much power do we have at different numbers of remeasured samples?

  • Given fixed sample sizes for the control and case group, how many control samples do we need to remeasure to recover, for example, 80% of the optimal power? The optimal power is defined as the power we can achieve by remeasuring all the control samples.

These questions can be easily answered by the theoretical power formula. To aid in study design, we provide an R Shiny app (https://hanxuan.shinyapps.io/PowerCalculation), which takes the user-supplied parameter values (sample size, effect size, between-batch correction, significance level) as the input and outputs the power at different numbers of remeasured samples. We provide both the absolute and relative power, where the absolute power is the statistical power in the traditional sense, i.e., the probability of rejecting the null hypothesis when the null hypothesis is false, and the relative power is the ratio of the absolute power to the optimal power defined above. Supplementary Figure 4 shows an example of power calculation for a confounded case-control study with sample remeasurement. In this example, both the case and control sample sizes are pre-fixed at 50, the expected between-batch correlation is 0.6, the effect size aimed to detect (Cohen’s d) is 0.6, and the significance level used is 0.05. The Shiny app outputs a power curve at different numbers of remeasured samples, based on which we can see that 35 control samples need to be remeasured to achieve 80% absolute power (Supplementary Figure 4a) and 19 control samples need to be remeasured to achieve 80% of the optimal power (Supplementary Figure 4b).

Finally, we perform additional power analysis to gain more insights into the proposed procedure. Figure 4 shows the proportion of control samples that need to be remeasured to achieve 80%, 90%, 95% relative power at different sample sizes, effect sizes, and between-batch correlations. We can see that the larger the between-batch correlation, the smaller the number of samples that need to be remeasured to achieve desired relative power. The proportion of samples that need to be remeasured drops rapidly when the correlation is greater than 0.6.

Fig. 4: Proportion of control samples that need to be remeasured to achieve 80%,90%,95% relative power vs. between-batch correlation ρ when n1=n2=50,100,200.

Fig. 4:

We fix a1=0.5 and consider settings where the effect size (Cohen’s d) takes values 0.5 (panel a) and 0.8 (panel b), representing moderate and strong effects, respectively, according to Cohen’s criterion. Results are derived from 500 replications.

2.4. Real Data Application

We next use a real dataset to illustrate the proposed method. The dataset came from two transcriptomics studies of ovarian cancer using different measurement platforms [2628]. In the first study, the gene expression was profiled using Agilent micro-arrays [26, 27]. In the second study, the gene expression was profiled using RNA-Seq [28]. It is well known that different measurement platform creates strong batch effects for omics study [1]. A subset of the samples were profiled in both studies, which provides us the opportunity to evaluate the proposed method. In this analysis, we focused on high-grade serous ovarian cancer, which is the most common type of ovarian cancer with well defined cancer subtypes [27, 29] (Agilent dataset n=306, RNA-Seq dataset n=97). There are 47 samples measured in both datasets. After intersecting the genes from the two platforms, we finally included 11,861 genes in the analysis. Based on these remeasured samples, we calculated the correlation of the gene expression between the two platforms. Supplementary Figure 5a shows that the distribution of the correlation coefficients has a wide range (−0.47, 0.87) with a median correlation of 0.48. About 24% genes have a correlation larger than 0.6. The overall correlation is considered to be medium. To demonstrate the proposed method, we analyzed the cancer subtype variable (four subtypes: C1-MES, C2-IMM, C4-DIF, and C5-PRO) to identify subtype-specific gene signatures by comparing the expression profile of a specific subtype to that of the other subtypes. The Agilent dataset consists of 76, 77, 71, and 82 samples for C1-MES, C2-IMM, C4-DIF, and C5-PRO subtypes, respectively, while the RNA-Seq dataset consists of 25, 20, 28, and 24 samples for C1-MES, C2-IMM, C4-DIF, and C5-PRO subtypes, respectively. We artificially created two sample groups with complete confounding by letting one subtype be measured on one platform and the rest three on the other platform, mimicking a completely confounded case-control study.

We first compare the performance of “ReMeasure”, “Batch2”, “Ignore” and “LS” after fitting gene-wise models. We start with evaluating the type I error control of the proposed method. This is achieved by comparing the same subtypes from the Agilent and the RNA-Seq platform. To ensure sufficient statistical power, we pooled samples from all four subtypes and made the subtype composition similar between the Agilent and RNA-Seq dataset. Specifically, we compare 276 Agilent samples consisting of 69 samples in each subtype to 68 RNA-Seq samples consisting of 17 samples in each subtype, using 40 remeasured Agilent samples to correct for batch effects. Since both batches have similar subtype composition and the patient characteristics are also similar between the two batches (they are from the same Midwest population), we expect to see very few substantial differences. Indeed, based on Figure 5, we observe that “Batch2” detects about 5% “significant” genes across different numbers of remeasured samples as expected at the 5% significance cutoff. For “ReMeasure”, it detects close to 5% “significant” genes when the number of remeasures samples are larger than or equal to 10, consistent with the simulation results. In contrast, “Ignore” and “LS” have made substantially more “false” discoveries, indicating that they have poor type I error control.

Fig. 5: Comparison of “ReMeasure”, “Batch2”, “Ignore” and “LS” on the real dataset.

Fig. 5:

(a) The number of discoveries vs. the number of remeasured samples by comparing the same subtypes between the two platforms. Specifically, “C1+C2+C4+C5 RNAseq vs. C1+C2+C4+C5 Agilent” denotes comparing subsets of samples, each consisting of an identical number of samples from each subtype (C1-MES, C2-IMM, C4-DIF, C5-PRO), between the RNAseq and Agilent platforms. A two-sided z-test with a 5% significance cut-off is applied for all methods. (b) The average rank vs. the number of remeasured samples for those subtype signature genes by comparing different subtypes on the two platforms. Likewise, “C1+C2+C5 RNAseq vs. C4 Agilent” refers to comparing combined “C1-MES,” “C2-IMM,” and “C5-PRO” subtypes from the RNAseq platform to the “C4-DIF” subtype from the Agilent platform. The same explanation applies to other titles.

Next, we conduct a power study by comparing one subtype from the Agilent platform to the other three subtypes from the RNA-Seq platform, treating the RNA-Seq samples as controls and the Agilent samples as cases. RNA-Seq samples remeasured on the Agilent platform are used to correct batch effects. To objectively evaluate power, we need to know the ground truth. However, the ground truth is unknown in this case, so instead we create a list of genes that are more likely to be subtype signatures by comparing one subtype vs. others in the same Agilent dataset. Based on two-sample t-tests and 5% FDR (Benjamini-Hochberg procedure), we identified 3793,4212,6168,4439 signature genes for the four subtypes, respectively. In the following, we conduct four types of comparisons: 1) C1+C2+C5 RNA-Seq vs. C4 Agilent, 2) C1+C4+C5 RNA-Seq vs. C2 Agilent, 3) C1+C2+C4 RNA-Seq vs. C5 Agilent and 4) C2+C4+C5 RNA-Seq vs. C1 Agilent. We evaluate the ability of the proposed method to retrieve those signature genes, in comparison to “Batch2”, “LS” and “Ignore”. If a method works, we expect that the signature genes will rank high (lower p-values) in the respective results. The average ranks of “ReMeasure” and “Batch2” are much higher than “Ignore” and “LS” (Figure 5). “ReMeasure” achieves a slightly higher rank than “Batch2”, especially when the number of remeasured samples is at the lower end. However, “ReMeasure” recovers substantially more genes than “Batch2” for the four subtypes at 5% FDR (Supplementary Figure 5b), indicating that “ReMeasure” is more powerful than “Batch2” while the false positive control is similar to “Batch2”.

Finally, we compare the number of discoveries for “Batch2” and “ReMeasure” on the genes with the lowest between-batch correlation (bottom quartile) and the highest between-batch correlation (top quartile). Supplementary Figures 5c and 5d reveal that our approach is more similar to “Batch2” under weak correlation and more powerful than “Batch2” under a strong correlation, consistent with our simulation study.

We further compare our “ReMeasure” method to “ComBat” [8], “SVA” [14, 30], and “RUV” [1517], the three most popular batch effect correction methods, on the real data set. “ComBat” directly removes the known batch effects by performing an empirical Bayesian adjustment, while “SVA” identifies and estimates the surrogate variables for unwanted variations, including batch effects and other unmeasured biological variations, with no requirement of knowing the batch a sample belongs to. “RUV” assumes a factor model that utilizes negative control genes (i.e., genes unrelated to the factor of interest) to estimate the latent factors for unwanted variations. Although in our case the batch information is known, we still run “SVA” and “RUV” to see whether they can capture the known batch effects. We used the ComBat and sva functions in the R Bioconductor sva package, and naiveReplicateRUV function in the R Bioconductor RUVnormalize package to run the three procedures. The remeasured samples in the second batch were included in the analysis, but their corresponding samples in the first batch were excluded to satisfy the independence assumption of both methods. For “SVA”, we used the permutation method described in [31] to estimate the optimal number of surrogate variables. The resulting surrogate variables were then included in the regression model as covariates. The p-values were calculated based on the F-test, comparing the model with and without the group variable. For “ComBat”, we fit the gene-wise linear regression model based on batch-corrected data. For “RUV”, 364 housekeeping genes were used as negative controls, and the remeasured samples were used as the replicates, following the original paper [17].

Under the null, where we compare the gene expression of the same subtypes (C1+C2+C4+C5) between the two measurement platforms (RNA-Seq vs. Agilent), “SVA” finds a substantially higher number of significant genes than what would be predicted under the null, even with a large number of estimated surrogate variables (>24 surrogate variables for most cases, Figure 6a), indicating that the estimated surrogate variables are still not adequate to capture the full batch effects. Since “SVA” could not control the type I error properly, its high power under the alternative hypothesis is thus not meaningful (Figure 6b). On the other hand, “ComBat” is very conservative and finds very few significant genes under the null (Figure 6a). Its type I error control is at the expense of power. When we compare one subtype vs. others (Figure 6b), the power of “ComBat” is extremely low, indicating that most of the true biological signals may be removed in batch correction due to high confounding of biological and batch effects. “RUV” also has substantially increased type I error, but is less serious than “SVA”. The number of detected significant genes decreases with the number of remeasured samples.

Fig. 6: Comparison to “ComBat”, “SVA” and “RUV” on the real dataset.

Fig. 6:

(a) and (b) show the number of discoveries vs. the number of remeasured samples by comparing subtypes on the two platforms. The p-values of “ComBat”, “SUV” are calculated based on the F-test, while those of “RUV” and “ReMeasure” are obtained from two-sides t-test and z-test, respectively. (a) Comparing the same subtypes when the nominal type I error level is 0.05. “C1+C2+C4+C5 RNAseq vs. C1+C2+C4+C5 Agilent” denotes comparing subsets of samples, each consisting of an identical number of samples from each subtype (C1-MES, C2-IMM, C4-DIF, C5-PRO), between the RNAseq and Agilent platforms. A 5% significance cut-off is applied for all methods. (b) Comparing different subtypes with 5% FDR. “C1+C2+C5 RNAseq vs. C4 Agilent” refers to comparing combined “C1-MES,” “C2-IMM,” and “C5-PRO” subtypes from the RNAseq platform to the “C4-DIF” subtype from the Agilent platform. The same explanation applies to other titles. (c) and (d) show the unadjusted raw p-value distribution. (c) Comparing the same subtypes with all 40 remeasured samples included. (d) Comparing different subtypes (C2+C4+C5 RNA-Seq vs. C1 Agilent) with all 35 remeasured samples included.

It is also interesting to compare the p-value distributions of the four methods. Under the null (C1+C2+C4+C5 RNA-Seq vs. C1+C2+C4+C5 Agilent), the p-value distribution of “ReMeasure” is close to the uniform distribution, while the p-value distributions of “ComBat”, “SVA”, and “RUV” deviate substantially from the uniform distribution (Figure 6c). When comparing C2+C4+C5 RNA-Seq to C1 Agilent (Figure 6d), the p-value distribution of “ReMeasure” has the expected form for a multiple testing experiment with signals, with a spike of small p-values and a long tail of larger p-values close to the uniform distribution. In contrast, the p-value distribution of “ComBat” has a spike on the right side of the histogram due to over-adjustment, and the p-values of “SVA” concentrate on the left side of the histogram due to under-adjustment. The p-value distribution of “RUV” behaves well in this case.

We thus conclude that the existing batch adjustment methods do not work well in the severely confounded scenario, and our method can effectively leverage the remeasured samples to correct batch effects.

3. Discussion

Due to the complex technical processes involved in biological measurement, even slight variation in sample preparation and processing can cause batch effects [1]. In many cases, batch effects are not known until the data are analyzed. Batch effects are most disastrous when it is highly confounded with the variable of interest, for example, when the case and control samples are measured separately. In such scenarios, it is extremely challenging to separate the true biological effects from batch effects. Although such confounded studies could be due to a bad study design or less awareness of batch effects, they could also be due to logistics issues. For example, a clinical investigator has collected patient samples and wants to compare them to existing controls. But due to sample availability or financial constraint, the investigator may not be able to remeasure all the control samples together with the case samples. It is thus of tremendous help to the investigator if she only needs to remeasure a small subset of control samples while retaining most power.

Traditional batch effect correction methods such as “ComBat” [8], “SVA” [14, 30], and “RUV” [1517] were mainly developed for independent samples and they have limited ability to correct batch effects in highly confounded scenario. They either removed the batch and biological effects altogether (reduced power) or retained the batch effects to a large extent (increased type I error).

Our method has several limitations. In some cases, the control samples may not be available for remeasurement, making our method not applicable. Even if they are available for remeasurement, there can still be subtle batch effects associated with difference in collection, storage, and freeze-thaw cycle [24]. Though reprocessing the samples can reduce batch effects, batch effects associated with the upstream technical variation can still persist. Our method cannot correct these residual batch effects. Furthermore, although we show our method is robust to some deviation of the Gaussian distribution, it can still perform poorly when the data are highly skewed or zero inflated. As the genomics studies move into the era of single-cell genomics, the genomics data has become even more complex with severe zero inflation [32]. Simple data transformation may not be sufficient to make the data Gaussian-like. To extend the capability of our method to analyze such complex genomics datasets, new methodological development is needed. One potential direction is to extend our method to the generalized linear model setting, where the measurement can be modeled by more general distributions such as zero-inflated negative binomial model for zero-inflated count data [33, 34].

Our procedure is based on the maximum likelihood estimation framework, and we proved its consistency and asymptotic normality. However, when the number of remeasured samples is small (n<10), the procedure could have inflated type I error. This is a disadvantage of the proposed method since the number of samples needed to be remeasured may be small when the between-batch correlation is high. To improve the small-sample performance, we proposed a bootstrap method based on residual resampling and showed that it had a well-controlled type I error. However, when the inter-batch correlation is not high (<0.8), the bootstrap method could be less powerful than the “Batch2” method. In this case, “Batch2” is recommended. As the type I error inflation of the asymptotic procedure is mainly driven by the inaccurate estimation of the between-batch correlation when we analyze a large number of features as in omics-wide testing, it is possible to improve the estimation efficiency by pooling information from all features using empirical Bayes method [8]. We leave this as a future research direction.

4. Methods

4.1. Parameter Estimation

Under the Gaussian assumption on the errors, the log joint likelihood of the data is given by

LN(θ)=n1log(σ12)n1log(σ22)n1log(1ρ2)1(1ρ2)iS1[(yiμ1iσ1)22ρ(yiμ1iσ1)(yn+iμ3iσ2)+(yn+iμ3iσ2)2](n1n1)log(σ12)iC1\S1(yiμ1iσ1)2n2log(σ22)iT2(yiμ2iσ2)2. (3)

The maximum likelihood estimator (MLE) of θ can be obtained as

θˆ=aˆ0,aˆ1,bˆ,ρˆ,σˆ1,σˆ2=argmaxθΘLN(θ). (4)

The solution to (4) does not have a closed-form solution due to the correlation between the remeasured samples from the two batches. One way to find the solution is by using a generic numerical optimization algorithm such as the Newton-Raphson method or its variants, which updates the parameters via the first or second-order methods until convergence. Here we provide a more efficient algorithm (see Supplementary Section 2) that explores the specific structure of the first-order conditions associated with the objective function. We deduce the first-order conditions by setting the partial derivative of the objective function with respect to each parameter to be zero. We then update the parameters by iteratively solving these equations. The algorithm is an order-of-magnitude faster than the generic optimization algorithm (Supplementary Figure 6).

4.2. Statistical Inference

We are mostly interested in estimating and conducting inference of the biological effect a0. The uncertainty assessment or variance of aˆ0 is key to hypothesis testing and power analysis. The alternate updating algorithm in Supplementary Section 2 allows us to obtain the variance estimate straightforwardly. It can be shown that

aˆ0=iT2yizibˆn2iC2yizibˆn1+ρˆσˆ2σˆ1iS1yizibˆn1 (5)

indicating that the MLE of a0 can be expressed as the linear combination of response variables from different batches. The first two terms in the formula estimate the biological effect without using the first batch of samples, and the third term uses those remeasured samples in the first batch to adjust the estimate. The degree of adjustment depends on the between-batch correlation for those remeasured samples. When the correlation is low, the estimate is similar to that without using the first batch. However, when the correlation is high, the adjustment could be substantial. Based on the formula (5), we can calculate its variance accordingly. The details for the variance formula can be found in Supplementary Section 2.

Based on the large-sample theory in Supplementary Section 1, the p-value for testing a0=0 can be computed as 2Φ(-aˆ0/sd^aˆ0), where Φ() denotes the cumulative distribution function of the standard normal distribution. The remeasured sample size n1 needs to be large for our large-sample theory to work. However, in practice, the remeasured sample size may be small, in which case, the estimation of the correlation parameter ρ is subject to large variability since it only depends on n1 pairs of observations. For small n1, the large sample theory does not provide an accurate approximation to the sampling distribution of aˆ0/sd^aˆ0. To overcome this issue, we propose to use the bootstrap method to improve the approximation accuracy of the finite sample distribution. For example, we can use the residual bootstrap. The set of residuals is obtained as

Control(batch1):ϵˆi(1)=yi-zibˆ,i=1,,n1,
Case(batch2):ϵˆi(2)=yi-aˆ0-aˆ1-zibˆ,i=n1+1,,n1+n2=n,
Control(batch2):ϵˆi(2)=yi-aˆ1-zibˆ,i=n+1,,n+n1.

We re-sample the residuals with replacements from each group and then generate a new bootstrap sample with the fixed zi but new yi using the fitted parameters and re-sampled residuals.

Given B bootstrap samples, we can calculate aˆ0(b) and Var^(aˆ0(b)) for 1bB based on each resample using Algorithm 1 and Formula (17) in Supplementary Section 2. Thereby, we obtain the bootstrap statistics Zb:=(aˆ0(b)-aˆ0)/Var^(aˆ0(b)) for b=1,,B. Given Z=aˆ0/Var^aˆ0, the bootstrapped p-value can be computed as B-1b=1B1Zb>|Z|.

Supplementary Material

Appendix B
Appendix A
Appendix C & D
4

Footnotes

Code Availability. All the codes to reproduce the results in this paper are available at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode. The developed R package BatchReMeasure is available at https://github.com/yehanxuan/BatchReMeasure. The specific version used to produce the results in this manuscript is also available on Code Ocean [35].

Data Availability.

Source data for Figures 26 is available with this manuscript. They can also be found at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode.

References

  • [1].Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA: Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11(10), 733–739 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Goh WWB, Wang W, Wong L: Why batch effects matter in omics data, and how to avoid them. Trends in Biotechnology 35(6), 498–507 (2017) [DOI] [PubMed] [Google Scholar]
  • [3].Scherer A: Batch Effects and Noise in Microarray Experiments: Sources and Solutions. John Wiley & Sons, New Jersey: (2009) [Google Scholar]
  • [4].Tom JA, Reeder J, Forrest WF, Graham RR, Hunkapiller J, Behrens TW, Bhangale TR: Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinformatics 18(1), 1–12 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Price EM, Robinson WP: Adjusting for batch effects in dna methylation microarray data, a lesson learned. Frontiers in Genetics 9, 83 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Sims AH, Smethurst GJ, Hey Y, Okoniewski MJ, Pepper SD, Howell A, Miller CJ, Clarke RB: The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets–improving meta-analysis and prediction of prognosis. BMC Medical Genomics 1(1), 1–14 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences 98(1), 31–36 (2001) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8(1), 118–127 (2007) [DOI] [PubMed] [Google Scholar]
  • [9].Zhang Y, Parmigiani G, Johnson WE: ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics and Bioinformatics 2(3) (2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Shabalin AA, Tjelmeland H, Fan C, Perou CM, Nobel AB: Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24(9), 1154–1160 (2008) [DOI] [PubMed] [Google Scholar]
  • [11].Benito M, Parker J, Du Q, Wu J, Xiang D, Perou CM, Marron JS: Adjustment of systematic microarray data biases. Bioinformatics 20(1), 105–114 (2004) [DOI] [PubMed] [Google Scholar]
  • [12].Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 97(18), 10101–10106 (2000) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Jolliffe IT: Principal Component Analysis. Springer, New York, NY: (2013) [Google Scholar]
  • [14].Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3(9), 161 (2007) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Gagnon-Bartsch JA, Speed TP: Using control genes to correct for unwanted variation in microarray data. Biostatistics 13(3), 539–552 (2012) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Gagnon-Bartsch JA, Jacob L, Speed TP: Removing unwanted variation from high dimensional data with negative controls. Berkeley: Tech Reports from Dep Stat Univ California, 1–112 (2013) [Google Scholar]
  • [17].Jacob L, Gagnon-Bartsch JA, Speed TP: Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17(1), 16–28 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Sun Y, Zhang NR, Owen AB: Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. The Annals of Applied Statistics 6(4), 1664–1688 (2012) [Google Scholar]
  • [19].Wang J, Zhao Q, Hastie T, Owen AB: Confounder adjustment in multiple hypothesis testing. Annals of Statistics 45(5), 1863 (2017) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Tasaki S, Suzuki K, Kassai Y, Takeshita M, Murota A, Kondo Y, Ando T, Nakayama Y, Okuzono Y, Takiguchi M, et al. : Multi-omics monitoring of drug response in rheumatoid arthritis in pursuit of molecular remission. Nature Communications 9(1), 1–12 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Xia Q, Thompson JA, Koestler DC: Batch effect reduction of microarray data with dependent samples using an empirical bayes approach (bridge). Statistical Applications in Genetics and Molecular Biology 20(4–6), 101–119 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Zhou L, Sue AC-H, Goh WWB: Examining the practical limits of batch effect-correction algorithms: When should you care about batch effects? Journal of Genetics and Genomics 46(9), 433–443 (2019) [DOI] [PubMed] [Google Scholar]
  • [23].Olson JE, Ryu E, Hathcock MA, Gupta R, Bublitz JT, Takahashi PY, Bielinski SJ, St Sauver JL, Meagher K, Sharp RR, et al. : Characteristics and utilisation of the mayo clinic biobank, a clinic-based prospective collection in the usa: cohort profile. Bmj Open 9(11), 032707 (2019) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Rundle AG, Vineis P, Ahsan H: Design options for molecular epidemiology research within cohort studies. Cancer Epidemiology Biomarkers & Prevention 14(8), 1899–1907 (2005) [DOI] [PubMed] [Google Scholar]
  • [25].Cohen J: Statistical Power Analysis for the Behavioral Sciences. Routledge, Oxfordshire, United Kingdom: (2013) [Google Scholar]
  • [26].Wang C, Winterhoff BJ, Kalli KR, Block MS, Armasu SM, Larson MC, Chen H-W, Keeney GL, Hartmann LC, Shridhar V, et al. : Expression signature distinguishing two tumour transcriptome classes associated with progression-free survival among rare histological types of epithelial ovarian cancer. British Journal of Cancer 114(12), 1412–1420 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Konecny GE, Wang C, Hamidi H, Winterhoff B, Kalli KR, Dering J, Ginther C, Chen H-W, Dowdy S, Cliby W, et al. : Prognostic and therapeutic relevance of molecular subtypes in high-grade serous ovarian cancer. Journal of the National Cancer Institute 106(10) (2014) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Fridley BL, Dai J, Raghavan R, Li Q, Winham SJ, Hou X, Weroha SJ, Wang C, Kalli KR, Cunningham JM, et al. : Transcriptomic characterization of endometrioid, clear cell, and high-grade serous epithelial ovarian carcinoma. Cancer epidemiology, biomarkers & prevention 27(9), 1101–1109 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Chen GM, Kannan L, Geistlinger L, Kofia V, Safikhani Z, Gendoo DM, Parmigiani G, Birrer M, Haibe-Kains B, Waldron L: Consensus on molecular subtypes of high-grade serous ovarian carcinoma. Clinical Cancer Research 24(20), 5037–5047 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Leek JT, Storey JD: A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105(48), 18718–18723 (2008) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Buja A, Eyuboglu N: Remarks on parallel analysis. Multivariate behavioral research 27(4), 509–540 (1992) [DOI] [PubMed] [Google Scholar]
  • [32].Stegle O, Teichmann SA, Marioni JC: Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics 16(3), 133–145 (2015) [DOI] [PubMed] [Google Scholar]
  • [33].Chen J, King E, Deek R, Wei Z, Yu Y, Grill D, Ballman K: An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 34(4), 643–651 (2018) [DOI] [PubMed] [Google Scholar]
  • [34].Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J-P: A general and flexible method for signal extraction from single-cell rna-seq data. Nature communications 9(1), 284 (2018) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Ye H, Zhang X, Chen J: BatchReMeasure: Batch effects correction with sample remeasurement. Code Ocean 10.24433/CO.4806327.v1 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Takeshi A: Advanced Econometrics. Harvard University Press, Cambridge, Massachusetts: (1985) [Google Scholar]
  • [37].Vaart A.W.v.d.: Asymptotic Statistics. Cambridge University Press, Cambridge, United Kingdom: (1998) [Google Scholar]
  • [38].Carmon Y, Duchi JC, Hinder O, Sidford A: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018) [Google Scholar]
  • [39].Nesterov Y, Polyak BT: Cubic regularization of newton method and its global performance. Mathematical Programming 108(1), 177–205 (2006) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix B
Appendix A
Appendix C & D
4

Data Availability Statement

Source data for Figures 26 is available with this manuscript. They can also be found at https://github.com/yehanxuan/BatchReMeasure-manuscript-sourcecode.

RESOURCES