PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data

Yiwen Wang; Kim-Anh Lê Cao

doi:10.1093/bib/bbac622

. 2023 Jan 18;24(2):bbac622. doi: 10.1093/bib/bbac622

PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data

Yiwen Wang ^1,², Kim-Anh Lê Cao ^3,^✉

PMCID: PMC10025448 PMID: 36653900

Abstract

Microbial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to and obscure any factors of interest. Existing batch effect correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce new multivariate and non-parametric batch effect correction methods based on Partial Least Squares Discriminant Analysis (PLSDA). PLSDA-batch first estimates treatment and batch variation with latent components, then subtracts batch-associated components from the data. The resulting batch-effect-corrected data can then be input in any downstream statistical analysis. Two variants are proposed to handle unbalanced batch x treatment designs and to avoid overfitting when estimating the components via variable selection. We compare our approaches with popular methods managing batch effects, namely, removeBatchEffect, ComBat and Surrogate Variable Analysis, in simulated and three case studies using various visual and numerical assessments. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, especially for unbalanced batch Inline graphic treatment designs. Our downstream analyses show selections of biologically relevant taxa. This work demonstrates that batch effect correction methods can improve microbiome research outputs. Reproducible code and vignettes are available on GitHub.

Keywords: microbiome data, multivariate, non-parametric, dimension reduction, batch effect correction

Introduction

Investigating the link between microbial composition and phenotypes, including human diseases, is the main goal of microbiome research. The disruption of gut microbial communities has been linked to varieties of diseases and sub-health status, ranging from inflammatory bowel disease [1], diabetes [2] to obesity [3] and malnutrition [4].

However, microbiome research faces the challenges of data reproducibility and replicability that invalidate statistical results. Because microbial communities are highly dynamic [5], microbiome data are highly susceptible to batch effects, that is, any unwanted sources of variation that are unrelated to and obscure the biological factors of interest [6]. Microbiome studies affected by batch affects are increasingly abundant in the literature: unwanted variation can be introduced by changes in technical procedures including sample collection, shipping and processing [7–9] or from independent studies [10]. Other confounding factors including geography, age, sex, stress and diet also introduce batch effects to the composition of the host microbiota [11–14]. These batch effects often mask the biological effects of interest. Batch effect management is therefore critical to improve the validity of microbiome studies’ results.

Two types of approaches exist to handle batch effects [6]: methods that correct for batch effects consist in removing batch variation from the data; methods that account for batch effects include batch effects as covariates in the statistical model. Evaluating the effectiveness of the former is easier than the latter through numerical and graphical analyses [6].

Methods that account for batch effects are often restricted to differential abundance analysis with models that hold strong assumptions about data distribution. They include zero-inflated Gaussian model [15] and Bayesian Dirichlet multinomial regression [16].

Methods that correct for batch effects are the most flexible and any type of downstream analysis can be applied to the resulting batch-effect-corrected data, including dimension reduction, visualization and clustering. However, for microbiome studies, these methods are challenged by small sample sizes, which increase the uncertainty of batch effect estimation [17]. In addition, batch effect correction methods assume that batch and treatment effects are independent, requiring a balanced batch Inline graphic treatment design [6]. However, microbiome experiments often result in unbalanced designs where batch and treatment effects are partly confounded, leading to the loss of treatment variation during the batch effect correction process.

The multivariate method Remove Unwanted Variation (RUV) has been recently adapted for microbiome data [18, 19], but requires negative control variables and technical sample replicates that capture batch variation, which are not often available in microbiome studies. Two methods percentile-normalization [20] and NetMoss [21] were developed to remove batch effects for microbial studies, but are only valid for case-control studies, which narrow the scope of their application.

Several batch effect correction methods have been developed for gene expression data [22, 23]. However, they are challenged by the inherent characteristics of microbiome data including zero inflation, uneven library sizes and compositional structure (even if data are transformed beforehand, for example, with centred log ratio transformation). Univariate methods disregard the inter-dependent relationships between microrganisms [24]. They also assume that batch effects are systematic and thus have a homogeneous influence on all microbial variables, which was found to be unlikely [6]. When non-systematic batch effects are mistakenly treated as systematic, biological variation of interest might be removed from the data, or the batch variation may remain during the batch effect correction process.

Promising methods have been proposed in other fields of application, such as single-cell RNA-sequencing. Seurat V3 [25], mnnCorrect [26], scmerge [27], zinbwave [28] assume a zero-inflated distribution but are only effective for very large sample size.

We propose novel approaches to correct for batch effects in microbiome data based on Partial Least Squares Discriminant Analysis (PLSDA [29]). PLSDA-batch is highly suitable for microbiome data as it is non-parametric, multivariate and allows for ordination and data visualization. Latent components related to treatment and batch effects are estimated to remove batch variation in the data while preserving biological variation of interest. Two other variants are proposed for unbalanced batch Inline graphic treatment designs and to select discriminative microbial variables among treatment groups. We assess the performance of PLSDA-batch in extensive simulation studies and three case studies that investigate microbial communities in sponge tissues, anaerobic digestion conditions and diet types in mice. We compare the efficiency of our approaches in removing batch effects and uncovering treatment effects with popular linear methods that have been previously applied in microbial studies [30–32], such as ComBat and removeBatchEffect. As our approach shares some similarities with Surrogate Variable Analysis (SVA), besides the fact that it accounts, rather than corrects for batch effects, we include some comparisons in the simulation studies.

Methods

Our three approaches are derived from PLSDA [29] to correct batch effects. We first give a brief description of the core method Partial Least Squares (PLS [33]), and its PLSDA extension for classification problems. We will use the following notations: Inline graphic denotes an explanatory data matrix with microbial variables and an data matrix with response variables. Both datasets match on the same samples. We denote the matrix transpose by . The norm of a random vector () is defined as and the norm is .

PLS and sparse PLSDA

PLS, a.k.a Projection to Latent Structures is an orthogonal component-based regression method commonly used to model the covariance structure between explanatory ( Inline graphic ) and response () matrices in large datasets. The optimization problem to solve is

(1)

where Inline graphic and represent the loading vectors of and , respectively. The aim of PLS is to find the linear transformations ( and ) of and that maximize the covariance between their latent components denoted as and , respectively, with and , . After the first pair of latent components () is obtained, the residual matrix is calculated via matrix deflation as

(2)

where Inline graphic . represents the regression coefficient vector for each variable in on . Similarly, we can calculate the residual matrix by deflating the matrix with . The deflated matrices are then used as updated and for the next PLS dimension. The deflation steps ensure that the latent components associated with each PLS dimension are orthogonal.

PLSDA is an adaption of PLS for classification and discrimination, where the response matrix Inline graphic is a dummy matrix transformed from a categorical outcome variable. Each column in indicates the group membership of each sample: If sample belongs to group , then equals 1, otherwise 0. For each dimension , the latent components and are calculated as shown earlier in Eq.(1). summarizes the variation from Inline graphic that is associated with , whereas is a linear combination of the dummy outcomes in . Thus, the component is mostly relevant to explain the discrimination between sample groups.

In PLSDA, we need to specify the optimal number of components Inline graphic . It can be chosen using repeated cross-validation to estimate the classification error rate on each component . As PLSDA is an iterative process based on deflated matrices, the components that yield the lowest error rate correspond to the overall performance of the PLSDA model [34].

sparse PLSDA (sPLSDA) uses Inline graphic penalization on the loading vectors in PLSDA to select variables [35]. During the regression step, for each component , the penalty is solved with soft-thresholding in Eq.(1):

(3)

where Inline graphic is a non-negative parameter that controls the amount of shrinkage on the loading vector and thus the number of non-zero loadings. The latent component is therefore calculated based on a subset of variables that are deemed most discriminative to classify the sample groups.

Two types of parameters need to be specified in sPLSDA: the number of components Inline graphic and the number of variables to select on each component, which corresponds to the shrinkage coefficient . Both parameters can be chosen simultaneously using repeated cross-validation by evaluating the classification error rate on a grid of number of variables to select on each component [34].

PLSDA-batch

PLSDA-batch aims to estimate and remove batch variation while preserving treatment variation. We use additional notations as we include in the model two different types of sample information, treatment and batch, denoted Inline graphic and , respectively. The matrices and include the loading vectors associated with and , respectively, where is the number of components associated with the treatment variation. The corresponding latent components are denoted and . Similar notations are used for the loading vectors and latent components associated with the batch effect across Inline graphic components. We will use simplified notations without superscript, such as , , and that are related to either treatment or batch variation when there is not ambiguity. is the matrix from which the batch effect is removed, and similarly for the treatment effect.

Overview

The general concept of PLSDA-batch is shown in the first column of Figure 1. Assuming Inline graphic includes both treatment and batch effects, the samples projected onto a Principal Component Analysis (PCA) plot would be segregated according to both treatment and batch information. In a first step, PLSDA-batch estimates the treatment variation with the components , which are extracted from Inline graphic to obtain , so that only batch variation remains. The second step estimates the batch associated components from . The original dataset is then deflated with to obtain the final matrix corrected for batch effects while preserving the treatment variation .

PLSDA-batch framework. From left to right columns: Visualization with Principal Component Analysis sample plots; Workflow describing each step of Algorithm 1 and Geometrical representation of the approach via projections and deflation. For illustrative purpose, we only represent one component associated with either treatment or batch effects.

Algorithmic and geometrical point of views

The remaining columns in Figure 1 further describe the approach. For illustrative purposes, we only depict the case where only one component is associated with either treatment or batch effects rather than several components. The data matrix Inline graphic with both treatment and batch effects can be decomposed into three major sources of variation: treatment, batch and residuals. All these sources are assumed to be independent but in practice, treatment and batch sources are likely to be correlated to some extent. This motivated our approach to first estimate the treatment variation to avoid over-estimating the batch variation and losing substantial treatment variation.

In the first step, we apply PLSDA to Inline graphic and to identify the dimension of treatment effects from (see Algorithm 1 ‘Estimation of latent dimensions’). is then calculated using a scalar projection of onto . Therefore, the treatment variation of all variables in is summarized in the component . We then calculate the matrix without treatment effects Inline graphic by deflating with . In the second step, we identify the batch-associated dimension from , then calculate by projecting onto . The batch variation is then removed from via matrix deflation while ensuring the treatment effects are fully preserved. Since the components and are orthogonal, we could also deflate Inline graphic with respect to but such alternative would require adding the treatment variation back.

graphic file with name bbac622fx1.jpg

Weighted PLSDA-batch

A balanced batch Inline graphic treatment design is an experimental design where samples within each treatment group are evenly distributed across batches [6]. Because of experimental constraints, a batch treatment design may be unbalanced, resulting in treatment and batch effects that are correlated and not separable. In PLSDA-batch, latent components associated with either treatment or batch effects are assumed to be orthogonal, thus ignoring the correlation between these two effects. The consequences might be over-estimation of the treatment variation as well as insufficient removal of the batch variation. Weighted PLSDA-batch (wPLSDA-batch) is inspired from weighted PCA to account for unbalanced designs [36], but in the case of PLSDA-batch the weight is defined accordingly. Further details on defining the weights are described in the Supplemental Section S2. Each sample Inline graphic is assigned a weight to take into account the number of samples within each batch and treatment:

(4)

where Inline graphic represents the indicator value (0 or 1) of sample and batch in the dummy matrix , and similarly for . represents the sample size in batch and treatment group . is a diagonal matrix that includes , . We obtain the weighted explanatory and response matrices and by multiplying and Inline graphic with , respectively. The batch-effect-corrected data resulting from the calculation on the weighted matrices using PLSDA-batch are then multiplied by to remove the influence of weights.

Sparse PLSDA-batch

In PLSDA-batch, the latent components are calculated based on all variables, thus assuming that all microorganisms are affected by the treatment (e.g. antibiotics). In most microbial studies, we can instead make the assumption that only a small number of microorganisms are affected by the treatment [37]. In that case, the batch-effect-corrected matrix Inline graphic may not be accurate as it depends on the calculation of the treatment components . These components are most likely to be affected by batch-related variables, especially when batch effect variability is high among samples.

To avoid overfitting when we estimate the treatment components, we apply Inline graphic -penalty to each treatment associated loading vector (see Eq. (3)) to select variables. Thus, variables with no treatment effect are assigned a zero loading value and are not included in the calculation of a component. Batch effects are assumed to be less microorganism specific than treatment effects. Thus, to ensure that the batch variation is fully captured, no variable selection is performed on the batch components.

Parameter tuning

In PLSDA-batch, we need to specify the optimal number of components associated with either treatment or batch effects ( Inline graphic or ). To choose this parameter, we estimate the variance explained in the outcome matrix on each treatment component , and similarly for the batch-associated outcome matrix and components. We choose the optimal number of components that explain 100% variance in either or . The remainder components should only explain some (unknown) noise.

In sPLSDA-batch, in addition to the parameter above, we also need to specify the optimal number of variables to select on each treatment component. For this purpose, we calculate the balanced classification error rate BER = Inline graphic ), where and represent the number of false and truly classified samples in the treatment group , where represents the total number of treatment groups [38]. The BER is evaluated through repeated cross-validation using the ‘maximum’ prediction distance as described in [34] on a proposed grid of number of variables to select on each treatment component. The number of variables yielding the lowest BER is the optimal parameter.

Simulation and case studies

Simulation study

Microbiome data are multivariate with inherent correlation structure between microbial variables. The data are over-dispersed with a distribution close to a negative binomial distribution [39, 40]. Inspired by [41], we simulated data from multivariate negative binomial distribution achieved with quantile–quantile transformation between multivariate normal and negative binomial distributions. To add treatment and batch effects, we used matrix factorization to simulate the mean for modelling negative binomial distribution as a matrix

for Inline graphic samples and microbial variables as follows:

(5)

where Inline graphic and represent the design vectors of treatment and batch effects, respectively, for each sample. and represent the regression coefficients of treatment and batch effects for each microbial variable, and , . contains the random noise that is independent and identically distributed (i.i.d) and Inline graphic , in which samples, variables.

The probability matrix

for modelling negative binomial distribution is calculated as

(6)

where Inline graphic and represent the probability of success in each trial and the mean for negative binomial distribution of sample and microbial variable , and is the dispersion parameter representing the number of successes.

We then simulated a data matrix based on multivariate normal distribution with mean Inline graphic and correlation matrix :

(7)

where the correlation matrix Inline graphic was simulated with the strategy adapted from [42] as follows: We first generated a lower triangular matrix , in which the diagonal elements follow , and the other elements . We randomly set the elements outside the diagonal of to zero with probability . A precision matrix, which is the inverse of covariance matrix, was created as Inline graphic . The corresponding correlation matrix to was then obtained. These parameters were set according to [42].

Thereafter we used Cumulative Distribution Function (CDF) to achieve quantile–quantile transformation as

(8)

where Inline graphic represents the cumulative probability of for sample and variable that belongs to matrix from multivariate normal distribution as Eq.(7). represents the cumulative probability of each in matrix from negative binomial distribution as Eq.(9).

Based on the cumulative probability from Eq.(8), we can simulate a data matrix Inline graphic with multivariate negative binomial distribution:

(9)

where Inline graphic represents the dispersion parameter, represents the probability matrix and the correlation matrix explaining the dependence structure between microbial variables.

We simulated datasets with different parameters including amount of batch and treatment effects ( Inline graphic , ) and variability among variables (, ), number of variables with batch and/or treatment effects (, and ), balanced and unbalanced batch treatment designs, as summarized in Table 1. The microbial variables with treatment or batch effects were randomly indexed in the data with non-zero Inline graphic or . The background noise was randomly sampled from , reflecting real microbiome datasets.

Table 1.

Summary of simulation scenarios (two batch groups). For a given choice of parameters reported in this table, each simulation was repeated 50 times. Inline graphic and represent the number of variables with treatment, batch or both effects, respectively. Simulation 6 includes parameters likely to represent real data according to our experience in analysing microbiome datasets.

Parameters
Simulation 1	3	1	7	{1,4,8}	60	150	0
Simulation 2	{3,5,7}	1	7	8	60	150	0
Simulation 3	3	{1,2,4}	7	8	60	150	0
Simulation 4	3	2	7	8	{30,60,100,150}	150	0
Simulation 5	3	2	7	8	60	{30,60,100,150}	0
Simulation 6	3	2	7	8	60	150	{0,18,30,42,60}

Open in a new tab

We also simulated datasets with different number of batch groups:

(1) Two batch groups: Each dataset included 300 variables and 40 samples grouped according to two treatments (trt1 and trt2) and two batches (batch1 and batch2). The balanced batch treatment experimental design included 10 samples from two batches, respectively, in each treatment group. The unbalanced design included 4 and 16 samples from batch1 and batch2, respectively, in trt1, 16 and 4 samples from batch1 and batch2 in trt2 (see Table 2).
(2) Three batch groups: Each dataset included 300 variables and 36 samples grouped according to two treatments (trt1 and trt2) and three batches (batch1, batch2 and batch3). The balanced batch treatment experimental design included six samples from three batches, respectively, in each treatment group. The unbalanced design included 2, 10 and 2 samples from batch1, batch2 and batch3, respectively, in trt1, 10, 2 and 10 samples from batch1, batch2 and batch3 in trt2 (see Table 3).

Table 2.

Unbalanced batch treatment design in the simulation study for two batch groups

	Trt1	Trt2
Batch1	4	16
Batch2	16	4

Open in a new tab

Table 3.

Unbalanced batch treatment design in the simulation study for three batch groups

	Trt1	Trt2
Batch1	2	10
Batch2	10	2
Batch3	2	10

Open in a new tab

In addition, we simulated a ground-truth dataset that only included treatment effects and background noise without batch effects to evaluate batch effect correction methods.

Our simulations generate over-dispersed count data with batch and treatment effects as well as correlation structure among variables, but without any compositional structure. We therefore only applied natural log transformation to the simulated data prior to analysis.

In these simulation scenarios, for PLSDA-batch we set Inline graphic (or ) components associated with treatment (or batch) effects (where and represent the total number of treatment and batch groups respectively) as () components are likely to explain 100% variance in . The number of variables with a true treatment effect () is set as the optimal number to select on each treatment component in sPLSDA-batch.

Case studies

We analysed three 16S rRNA amplicon datasets at the operational taxonomic unit (OTU). The count data were filtered to alleviate sparsity and transformed with Centered Log Ratio (CLR) transformation [43]. CLR is a pragmatic way to handle both uneven library sizes and compositional structure in real data [37]. It also helps reducing skewness in the data.

Sponge A. aerophoba. This study investigated the relationship between metabolite concentration and microbial abundance on specific sponge tissues [44]. The dataset includes the relative abundance of 24 OTUs and 32 samples collected from two tissue types (Ectosome versus Choanosome) and processed on two separate denaturing gradient gels in electrophoresis. The tissue variation is the effect of interest, while the gel variation is the batch effect. This study includes a batch effect with similar variation to the treatment effect, and a completely balanced batch x treatment design. The sponge study enables us to assess the efficacy of batch effect correction methods in such circumstance.

Anaerobic digestion. This study explored the microbial indicators that could improve the efficacy of anaerobic digestion (AD) bioprocess and prevent its failure [45]. The dataset includes 231 OTUs and 75 samples treated with two different ranges of phenol concentration (effects of interest). These samples were processed at five different dates corresponding to batch effects. This study includes a strong batch effect compared with the treatment effect, with an approximately balanced batch x treatment design. The AD dataset enables us to assess whether batch effect correction methods are able to remove sufficient batch variation in this case.

High fat high sugar diet. This study aimed to investigate the effect of high fat high sugar (HFHS) diet on the mouse microbiome [37]. This dataset includes 515 OTUs and 149 samples collected at day 1, 4 and 7 from the mice treated with two types of diets (HFHS versus normal). The diet variation is the treatment effect, while the day variation constitutes a potential batch effect, which is actually weak. The HFHS study enables us to assess whether batch effect correction methods are able to preserve treatment variation when batch effects are small.

For the PLSDA-batch analyses, we chose the number of components that explained 100% variance in Inline graphic associated with either treatment or batch effects (Sponge data: one treatment component, one batch component; AD data: one treatment component, four batch components and HFHS data: one treatment component, two batch components). For sPLSDA-batch, we chose the number of variables to select on each treatment component that yielded the lowest BER from repeated cross-validation with four folds and 50 repeats (Sponge data: one variable; AD data: 100 variables and HFHS data: two variables).

Benchmarking and assessment of batch effect removal

We compared our approaches with removeBatchEffect, ComBat and SVA. These methods are univariate and were originally developed for gene expression data from microarray or RNA-sequencing. They have been used extensively in microbiome studies [30–32, 46, 47] even though they would require further developments to be adapted to the inherent characteristics of microbiome data. These methods’ limitations include the inability to deal with non-Gaussian distribution, small sample sizes and dependence between microbial variables. Similar to the aim of our proposed methods, RemoveBatchEffect and ComBat correct for batch effects to generate batch effect-free data for downstream analysis, while SVA accounts for batch effects. Both our approaches and SVA attempt to preserve treatment variation prior to batch effect management to avoid information loss, but the algorithms used to achieve this purpose differ. However, SVA estimates and accounts for unknown batch effects, which may result in overfitting the data, compared with our approaches. Further details on these methods are described in the Supplemental Section S1. We used a wide range of performance measures to evaluate whether these methods are effective in managing batch effects while preserving treatment effects. These include classical accuracy measures used in simulation studies where we know the ground-truth, that is, we know which variables include batch and/or treatment effects [16], as well as multivariate and univariate approaches to measure the proportion of variance explained by batch and treatment effects after batch effect removal.

Accuracy measures (simulation study only)

We identified variables with a true treatment effect after correcting or accounting for batch effects using two approaches:

(1) Univariate one-way analysis of variance ANOVA) [48] to identify differentially abundant taxa between treatment groups (Benjamini–Hochberg adjusted P-value ) followed by accuracy measures described below,
(2) Multivariate sparse PLSDA to identify taxa that discriminate treatment groups followed by Area Under the Curve of Receiver Operating Characteristics (AUC-ROC).

We measured the accuracy of the selected variables from one-way ANOVA using Precision ( Inline graphic ), Recall () and score (), where is the number of true positives—the variables assigned with treatment effects in the simulation and correctly identified; the number of false positives—the variables without treatment effects but wrongly identified; the number of false negatives—the variables with treatment effects that were not identified. A high precision indicates an accurate model with a low number of false positives, while a high recall indicates a sensitive model with a low number of false negatives. The Inline graphic score balances both precision and recall, with a high score indicating a model with good accuracy and sensitivity.

We measured the accuracy of the selected variables from sPLSDA using AUC-ROC. As SVA does not generate batch-effect-corrected data, we only considered the Precision, Recall and Inline graphic score for this approach.

Proportion of explained variance across all variables

We used the multivariate method partial redundancy analysis (pRDA) in the batch-effect-corrected data to calculate the proportion of variance explained by treatment, batch effects and, most importantly, their intersection [6, 49]. The intersectional variance quantifies the unbalance in the batch Inline graphic treatment design. A null value indicates a completely balanced design.

Proportion of explained variance for each variable

We used the Inline graphic value estimated with one-way ANOVA to calculate the proportion of variance explained by treatment or batch effects for each variable. The values with either treatment or batch effects were then visualized with boxplots. We also considered the sum of all the values to compare the methods globally.

Principal Component Analysis (case studies only)

We investigated the variance structure of the data before and after batch effect correction using PCA. If batch effects account for the largest proportion of variance in the data, we expect a separation of the samples from different batches on the first component [6].

Alignment scores (case studies only)

We used the alignment score proposed for single-cell RNA-seq datasets integration [50]. We extended the approach that was originally developed based on canonical correlation analysis for PCA. This score complements the qualitative results from PCA to evaluate the degree of mixing samples from different batches in the batch-effect-corrected data. The alignment score ranges from 0 to 1 (poor to excellent mixing samples among the different batches). We first perform a PCA on a given batch-effect-corrected matrix to calculate a sample dissimilarity matrix based on the principal components that explained at least 95% of the total variance. Based on this dissimilarity matrix, the alignment score is defined as

(10)

where Inline graphic represents the number of nearest neighbours and represents the sample size. is the number of each sample’s nearest neighbours that belong to the same batch and represents the average of all . In our case studies, we chose , a value deemed reasonable for the sample size of our data.

Note that this score relies on PCA projection to calculate the nearest neighbours. It is only relevant to compare several PCA dissimilarity matrices (resulting from the batch-effect-corrected matrices with different methods) where the samples have similar sample distribution in their PCA projection.

Results

We benchmarked our three PLSDA-batch methods with removeBatchEffect, ComBat and SVA on the simulated datasets, then against the former two on the three case studies.

Simulation studies

We first describe the results from a single simulation scenario with two batch groups where parameters were representative of real data, namely, Inline graphic . The results for the other scenarios are summarized in Supplemental Figures S1–S6.

pRDA assessment

Efficient batch effect correction methods should generate data with a null proportion of variance explained by batch effects, and a proportion of variance explained by treatment that is larger compared with the original data, as shown in Figure 2A original data and ground-truth data.

Simulation studies (two batch groups): comparison of explained variance before and after batch effect correction for **(A)** balanced and **(B)** unbalanced batch treatment designs. The method pRDA estimated the proportion of variance explained by (from top to bottom) residuals, batch effects, intersection of batch and treatment effects and treatment effects. All methods performed equally well in removing batch variance for a balanced design except ComBat, while in an unbalanced design, our weighted variants wPLSDA-batch and swPLSDA-batch performed better than their unweighted counterparts.

For a balanced batch Inline graphic treatment design, we observed no intersection shared between treatment and batch variance, as expected. All methods successfully removed batch variance and preserved (or slightly increased) treatment variance (sPLSDA-batch), with the exception of ComBat where a very small amount of batch variance remained.

For a strong unbalanced batch Inline graphic treatment design (Figure 2B), we observed the presence of intersectional variance explained by both batch and treatment effects, as expected. This source of variance is also present in the ground-truth data but should be smaller compared with the uncorrected data. Both unweighted PLSDA-batch and sPLSDA-batch performed poorly for such design—for PLSDA-batch the intersectional variance increased, while for sPLSDA-batch the batch variance was not entirely removed. The other methods were successful in removing batch variance. removeBatchEffect and ComBat explained a proportion of variance by treatment similar to the ground-truth data, while wPLSDA-batch and swPLSDA-batch explained slightly less treatment variance.

assessment

We estimated the proportion of variance explained by treatment and batch effects for each variable using the Inline graphic value.

In the balanced batch Inline graphic treatment design (Figure 3A), removeBatchEffect and PLSDA-batch had the best performance, with results very similar to the ground-truth data. ComBat retained more batch variance of variables with batch effects only, and with both batch and treatment effects, indicating an incomplete removal of batch effects. This result is in agreement with the overall pRDA evaluation described earlier. For sPLSDA-batch, variables with no treatment effect (batch effects only) included a slight amount of (spurious) treatment variance. This was also observed in pRDA evaluation. However, sPLSDA-batch performed as well as PLSDA-batch when the simulated data did not include variables with both batch and treatment effects.

Simulation studies (two batch groups): values for each microbial variable before and after batch effect correction for **(A)** balanced and **(B)** unbalanced batch treatment designs. Each box represents a summary of values for variables simulated with the associated effects (batch or/and treatment effects). Each value was fitted for each variable from a one-way ANOVA with a treatment effect or batch effect as covariate (x-axis). The colours indicate the effects assigned to each variable. In both designs, ComBat did not remove enough batch variation. For the balanced design, sPLSDA-batch generated slightly spurious treatment variation for the variables with batch effects only. For the unbalanced design, wPLSDA-batch and swPLSDA-batch generated data with less treatment variation for the variables with both treatment and batch effects compared with the ground-truth data.

We observed similar performance for removeBatchEffect and ComBat for the unbalanced design (Figure 3B). With wPLSDA-batch and swPLSDA-batch, variables with both treatment and batch effects explained less treatment variance after correction, compared with the ground-truth data. However, for the other variables, wPLSDA-batch and its sparse version performed as similar as the ground-truth data.

The sum of all the Inline graphic values showed similar results (Supplemental Figure S7).

Accuracy measures

The results from the accuracy measures combined with variable selection highlight the importance of removing batch effects as both F1 score and AUC largely improved compared with the original data (Table 4).

Table 4.

Simulation studies (two batch groups): summary of accuracy measurements before and after batch effect correction. The proportion of correctly identified microbial variables with a true treatment effect was assessed with Precision, Recall, F1 score (using one-way ANOVA as variable selection procedure) and AUC (using sPLSDA as variable selection procedure). Each value is the mean (or standard deviation) over 50 repeats.

		Before correction	ground-truth data	SVA	removeBatchEffect	ComBat	PLSDA-batch	sPLSDA-batch
Balanced	Precision	0.984 (0.04)	0.952 (0.08)	0.957 (0.06)	0.950 (0.09)	0.952 (0.08)	0.952 (0.08)	0.807 (0.11)
	Recall	0.674 (0.03)	0.900 (0.03)	0.934 (0.03)	0.910 (0.03)	0.911 (0.03)	0.910 (0.03)	0.910 (0.03)
	F1	0.799 (0.02)	0.923 (0.05)	0.944 (0.04)	0.927 (0.05)	0.929 (0.05)	0.929 (0.05)	0.851 (0.06)
	AUC	0.944 (0.02)	0.964 (0.02)	/	0.968 (0.02)	0.968 (0.02)	0.969 (0.01)	0.954 (0.02)
		Before correction	ground-truth data	SVA	removeBatchEffect	ComBat	wPLSDA-batch	swPLSDA-batch
Unbalanced	Precision	0.385 (0.01)	0.973 (0.05)	0.401 (0.02)	0.901 (0.09)	0.834 (0.08)	0.943 (0.05)	0.943 (0.05)
	Recall	0.825 (0.03)	0.895 (0.03)	0.918 (0.03)	0.910 (0.03)	0.919 (0.03)	0.888 (0.03)	0.862 (0.03)
	F1	0.525 (0.01)	0.932 (0.03)	0.558 (0.02)	0.903 (0.05)	0.873 (0.05)	0.914 (0.03)	0.900 (0.03)
	AUC	0.704 (0.06)	0.967 (0.02)	/	0.963 (0.02)	0.962 (0.01)	0.965 (0.01)	0.954 (0.02)

Open in a new tab

In the balanced design, starting from the original data compared with the ground-truth data, selected variables had a higher precision, lower recall and lower AUC, indicating a smaller number of variables selected with an actual treatment effect. Combined with univariate one-way ANOVA, SVA performed best with the highest, and sometimes greater, accuracy measurements than the ground-truth data, as we discuss below. The other methods led to similar performance with the exception of sPLSDA-batch, which selected more false positives than the other methods. PLSDA-batch led to a slightly better AUC than the other methods.

In the unbalanced design, the precision of SVA is low and very similar to the original data, indicating that the performance of SVA heavily depends on the experimental design and is likely to overfit. This may explain the somewhat inflated results of SVA in the balanced design case. wPLSDA-batch performed best with results close to those from the ground-truth data.

We observed similar results but with higher resolution of these accuracy measures for the other simulation scenarios presented in Supplemental Figures S1–S6 and discussed in the Supplemental Section S3.1. For simulations with three batch groups (parameters Inline graphic ), we also observed similar results as the two batch group cases (Supplemental Figures S8, S9 and S10 and Table S1).

Summary of the simulation results

Our extensive simulation studies showed that weighted PLSDA-batch was essential for an unbalanced batch Inline graphic treatment design, compared with its unweighted counterpart. Our PLSDA-batch method preserved similar or slightly smaller proportion of treatment variance compared with the other batch effect correction methods, but achieved a higher F1 score and AUC especially in an unbalanced design. When there was no variables with both treatment and batch effects in the data, sPLSDA-batch- and PLSDA-batch-corrected data were close to the ground-truth data. However, when some variables included both these effects, sPLSDA-batch performed slightly worse than PLSDA-batch. Our results also suggested that SVA had a tendency to overfit the data, while ComBat was not able to completely remove batch variation. removeBatchEffect was not able to preserve enough treatment effects for accurate variable identification.