Rank normalization empowers a t-test for microbiome differential abundance analysis while controlling for false discoveries

Matthew L Davis; Yuan Huang; Kai Wang

doi:10.1093/bib/bbab059

. 2021 Apr 5;22(5):bbab059. doi: 10.1093/bib/bbab059

Rank normalization empowers a t-test for microbiome differential abundance analysis while controlling for false discoveries

Matthew L Davis ^1,^✉, Yuan Huang ², Kai Wang ³

PMCID: PMC9630402 PMID: 33822893

Abstract

A major task in the analysis of microbiome data is to identify microbes associated with differing biological conditions. Before conducting analysis, raw data must first be adjusted so that counts from different samples are comparable. A typical approach is to estimate normalization factors by which all counts in a sample are multiplied or divided. However, the inherent variation associated with estimation of normalization factors are often not accounted for in subsequent analysis, leading to a loss of precision. Rank normalization is a nonparametric alternative to the estimation of normalization factors in which each count for a microbial feature is replaced by its intrasample rank. Although rank normalization has been successfully applied to microarray analysis in the past, it has yet to be explored for microbiome data, which is characterized by high frequencies of 0s, strongly correlated features and compositionality. We propose to use rank normalization as an alternative to the estimation of normalization factors and examine its performance when paired with a two-sample t-test. On a rigorous 3rd-party benchmarking simulation, it is shown to offer strong control over the false discovery rate, and at sample sizes greater than 50 per treatment group, to offer an improvement in performance over commonly used normalization factors paired with t-tests, Wilcoxon rank-sum tests and methodologies implemented by R packages. On two real datasets, it yielded valid and reproducible results that were strongly in agreement with the original findings and the existing literature, further demonstrating its robustness and future potential. Availability: The data underlying this article are available online along with R code and supplementary materials at https://github.com/matthewlouisdavisBioStat/Rank-Normalization-Empowers-a-T-Test.

Keywords: rank normalization, microbiome, differential abundance analysis, false discovery rate, t-test

Introduction

The consideration of the human microbiome as an inseparable component of human physiology has proven to be nothing short of a paradigm shift in biomedical science. Already, novel understandings of the human-microbiome symbiosis have inspired medical innovations such as fecal transplants for intestinal bowel diseases [1], live cultures as a treatment for viral infections [2], and the use of microbial biomarkers as diagnostic tools [3].

Rapid innovations in high-thoroughput sequencing technologies have made possible the ability to classify microbial gene signatures at species-level resolution [4]. These advancements have inspired a type of exploratory statistical analysis referred to as differential abundance analysis, which in its simplest form attempts to answer the question, ’Given two groups of subjects, which microbes are more abundant in one group versus the other?’ [5].

Microbiome data obtained from 16S rRNA amplicon or shotgun metagenomic sequencing is presented in the form of an ‘OTU table’, with rows corresponding to operational taxon units (OTUs) and columns (samples) to the identification of subjects who were sequenced [6]. Each sample corresponds to the sequenced biological sample of each subject, and observed quantities in each cell (referred to here as counts) correspond to the number of distinct taxonomic gene signatures that could be identified by the sequencing procedure. The sum total of all observed counts in a sample is referred to as its ‘library size’, and is limited by a technological constraint on its ‘sequencing depth’, reflecting the total number of counts that a machine is able to sequence [5]. Relative to other high-throughput sequencing data, microbiome data is especially noisy; the extraction, transport and storage of raw samples introduce potentially confounding sources of variation [4], counts between OTUs are highly correlated [7], and most often, over 90% of counts in OTU tables are 0s [8]. Hence, raw counts can only be interpreted as sample-specific relative abundance, and the crucial first step of any analysis is to normalize the data to allow for proper comparisons between samples.

One of the most widely used approaches for normalizing microbiome data is through estimation of normalization factors (NFs). NFs, sometimes referred to as ‘scaling’ factors [6], are scalar values by which all counts in a sample are divided or multiplied. Total Sum Scaling (TSS) is one of the most straightforward in which each count is divided by the total sum of counts in its sample, thereby converting counts to proportions [9]. Oftentimes, the counts of the most highly abundant OTUs vary by orders of magnitude between samples and make up the majority of the total sum of counts in each sample. In response, Cumulative Sum Scaling (CSS) modifies TSS by setting NFs equal to the total sum of counts up to an optimized quantile, thereby excluding the overt influence of highly abundant OTUs [10]. Trimmed Mean M-values (TMM) borrowed from RNA-Seq analysis computes the trimmed log2 ratio of proportions in each sample to an arbitrarily chosen reference sample, by which counts are then multiplied [11]. These three methodologies are amongst the most commonly examined [8, 12] and are detailed more closely in Table 1.

Table 1.

Normalization methodologies to be paired with differential abundance tests

Normalization methodology	Description
Trimmed Mean M-value (TMM)	i) Compute an M-Value for each OTU as its log2-ratio of proportional abundances in each sample to an arbitrarily chosen reference sample. ii) Trim M-values based on relative and absolute abundance within each sample iii) Define weights as the inverse of the approximate asymptotic variances of the M-values iv) Calculate normalization factors as the weighted mean of trimmed M-values for an OTU multiplied by observed library sizes v) Divide counts by NFs
Cumulative Sum Scaling (CSS)	i) Define a reference distribution with quantiles equal to the median of each quantile across samples ii) Calculate the median absolute deviation over all samples of observed quantiles vs. reference iii) Determine the minimum quantile at which this median deviance is greater than 0.1 iv) Calculate normalization factors as the sum of all counts up to this quantile v) Divide counts by NFs
Total Sum Scaling (TSS)	Divide each count in a sample by the total sum of counts in the sample
Rank normalization	Replace each count in a sample with its intrasample rank

Open in a new tab

NFs are random variables because they are computed from the observed data, and are thus associated with variation. However, this variation is often ignored in subsequent hypothesis tests, and is instead misattributed to variability in normalized counts. This leads to false discoveries if variation of NFs are associated with the biological conditions of interest, thereby introducing a confounding source of bias into normalized counts, and a loss of power otherwise. This was demonstrated explicitly for TSS (i.e. proportions) [6] and extends to other NFs as well [8].

Following normalization, a differential abundance test is performed. Well-known tools include the R package metagenomeSeq [10] developed for metagenome analysis as well as edgeR [13], Limma-Voom [14] and DESeq2 [15] borrowed from RNA-Seq analysis. Recent benchmarking studies have shown that these methodologies often fail to control for false discoveries and that performance is mostly invariant to choice of NF [16, 17]. ALDEx2 was designed from the compositional perspective with application to microbiome data [18], however, recent performance under these same benchmarking studies show it to be highly conservative. These methodologies are briefly summarized in Table 2.

Table 2.

Differential abundance testing methodologies to be evaluated on simulations and real datasets

Differential abundance test	Description
edgeR (with TMM)	i) Fit a quasi-likelihood negative binomial generalized log-linear model to normalized counts ii) Perform empirical Bayes quasi-likelihood F-test for the coefficient of interest
metagenomeSeq (with CSS)	i) Fit a zero-inflated gaussian model to normalized counts ii) Perform univariate t-test for the coefficient of interest
ALDEx2	i) Generate Monte Carlo instances from a Dirichlet-multionimal distribution for each sample ii) Center log-transform data iii) Perform Welch’s two-sample t-test
DESeq2	i) Normalize counts using default median-ratio method ii) Fit a negative binomial distribution generalized linear model to normalized counts iii) Perform Wald test for the coefficient of interest
Limma-Voom (with TMM)	i) Transform normalized counts to log2-counts per million scale ii) Estimate mean-variance relationship of counts and compute precision weights for each observation iii) Fit a linear model to the transformed counts using weighted least-squares iv) Perform univariate t-test for the coefficient of interest
Wilcoxon rank-sum test	Perform Wilcoxon rank-sum test on normalized counts
t-test	Perform Welch’s two-sample t-test on normalized counts

Open in a new tab

Despite the diversity of highly sophisticated methodologies implemented in R packages, a classic t-test or Wilcoxon rank-sum (WRS) test is often used in practice [17, 19]. Existing NFs are usually evaluated for use with other R packages [8, 12, 16], and their performance with respect to classical tests are either ancillary or ignored altogether. There does not appear to be a recent focus on developing simple yet effective normalization strategies to be paired with a classical test, despite the fact that such a methodology has potential to be as widely used as any R package.

Rank normalization is a nonparametric alternative to the estimation of NFs in which counts are replaced with their intrasample rank, and thus bypasses the need to estimate any external parameters from the data. Its usefulness has been explored for compositional data in the presence of large numbers of 0s [20], and it appropriately allows for t-tests to be performed on otherwise noisy array-like data [21]. In the past, it has been extensively utilized in the context of microarray analysis, [22, 23], but to the best of our knowledge, it has yet to be explored in the microbiome setting.

Rank normalization

Differential abundance analysis performed on rank-normalized data with a t-test involves three steps:

Exclude OTUs from the data that only have counts of 0 in all samples.
In each sample, rank all counts in ascending order. Ties are dealt with using the average-ties method.
For each OTU, perform Welch’s two-sample t-test directly on the ranks over the two treatment conditions.

An important property of rank normalization using the average-ties method is that the corresponding rank for all counts of 0 in a sample are assigned to a rank equal to one half the number of 0s in their sample, plus one half. This is visualized in Figure 1.

Demonstration of rank normalization. This shows how one would rank samples in microbiome data. Data are contrived for simplicity, with only seven OTUs (rows) and two samples. Raw data correspond to the columns on the left, and these samples are ranked in ascending order on the right. Note how the ranks of 0s are assigned a value equal to one half the number of 0s in the sample plus one half.

An immediate consequence of this is that the ranks assigned to counts of 0 follow a distribution exactly proportional to the frequencies of 0s appearing in each sample; OTUs with counts of 0 have observed lower ranked abundance in samples with less 0s, and higher ranked abundance in samples with more 0s.

This is meaningful, as OTUs in samples sequenced with lower sequencing depth are less likely to meet the detectable threshold of abundance, and thus, exhibit greater frequencies of 0s than in samples with greater sequencing depth [8]. Since there is no way to determine for sure if a 0 is due to a failure to meet a detectable threshold of abundance or if an OTU is truly absent, ranking counts of 0s is a simple way to quantify this uncertainty without involving any parametric assumptions, transformations or imputation. Contrast this to NFs, which leave counts of 0 unchanged following normalization.

Distributions of observed ranks for three real datasets can be visualized in Figure 2. Notice the two distinct distributions; the denser distribution on the left describes the distribution of ranks corresponding to counts of 0, and is directly proportional to the frequency of 0s appearing in each sample. The shallower, smaller distribution on the right corresponds to the ranks of counts greater than 0.

Histograms of observed ranks for three real datasets. Ranks for all OTUs in all samples are shown. For each sample, OTUs are ranked in ascending order using the average-ties method, thus ranks assigned to counts of 0 are equal to one half the number of 0s in their sample plus one half. The larger, denser distribution on the left corresponds to the ranks assigned to counts of 0, and the shallower distribution on the right describes the ranks assigned to non-0 counts. Zeller: Zeller metagenomic data (used for real data analysis); RISK: RISK 16S rRNA amplicon data (used for real data analysis); HMP: HMP Mid Vagina 16S rRNA amplicon data (used as the template for simulations).

Although here we ranked in ascending order, it is important to mention that ranking in descending order will not affect results; a t-test performed on ascending ranks will yield the same results as a t-test performed on descending ranks, albeit, with reversed signs of the t statistics. This is because reversing the ranks in any sample simply amounts to subtracting each rank from the constant ‘N+1’, the number of OTUs in the table plus 1, with variance unchanged. Also worth noting is that performing a t-test across the fractional ranks (the ranks divided by the number of OTUs being tested) will yield identical inferences to the unscaled ranks.

Materials

For all computational tasks, a Dell XPS 13 8 core laptop was used, and all programming tasks implemented in R (versions 3.5.1 and 3.6.3) [24]. External R packages and versions used here include ALDEx2 (version 1.16.0) [18], BiocManager (version 1.30.4) [25], DESeq2 (version 1.24.0) [15], doSNOW (version 1.0.18) [26], edgeR (version 3.26.8) [13], limma (version 3.40.6) [27], metagenomeSeq (version 1.26.3) [10], MicrobeDS (version 0.1.0) [28], phyloseq (version 1.28.0) [9], and SpiecEasi (version 1.0.6) [7]. All R code for simulations were adapted from materials made freely available by the simulation authors [17]. Zeller metagenomic sequencing data can be found in the Supplementary Materials made available in the original publication [29]. The RISK 16S rRNA data [30] is curated by the MicrobeDS package in R [28]. The HMP data [31] can be found at https://www.hmpdacc.org/HMQCP/.

Simulations

Description

Methods to generate simulated microbiome datasets were adapted from a recent large-scale benchmarking study assessing the performance of differential abundance detection methodologies [17]. Although a multitude of simulation schemes were explored in depth by the authors, the settings chosen best reflect the scope and goals of this methodological study, with justifications given below.

Unlike the other main simulation schemes considered by the simulation authors, a parametric approach referred to by the simulation authors as ‘H1’ was the only one that allowed for precise control over simulation parameters. Data were generated under what is referred to by the simulation authors as ‘without compensation’ to preserve compositionality, an inherent feature of microbiome data [32]. OTU tables were generated with an imposed, realistic correlation structure between OTUs as estimated by SpiecEasi [7] and for the negative binomial distribution, outliers were introduced to mimic the noisy characteristics inherent to microbiome datasets (this feature was not implemented for the beta-binomial distribution by the simulation authors). Data sequenced from samples of the Mid Vagina (version V13 region of 16S rRNA gene) as a part of the Human Microbiome Project [31] (HMP) serve as the real data template because it exhibited both more correlations between OTUs and stronger correlations than any of the other sample templates examined [17], providing a more rigorous challenge to differential abundance detection methodologies.

Both a negative binomial distribution and beta-binomial distribution were chosen for modelling the marginal distributions of simulated OTUs. Counts following a negative binomial distribution were considered by the simulation authors to more accurately resemble the true distribution of counts in microbiome datasets, while the beta-binomial distribution exhibited greater frequencies of 0s and higher variability [17]. Although there is not a consensus on which is more appropriate, these are two of the most common parametric distributions used to model microbiome data in current research [33].

To simulate data, these parametric models were fit to 1000 of the most abundant OTUs from the sample template. For the negative binomial distribution, models were fit independently to each OTU, and parameters were estimated by maximum-likelihood. The beta-binomial distribution describes the marginal distribution of OTUs in samples jointly distributed according to a Dirichlet-Multinomial distribution, and so they were jointly modelled by this distribution with parameters estimated by method-of-moments. The models for each OTU were parameterized in such a way that the expected relative abundance of OTUs remained consistent across samples, but the expected magnitude abundance varied according to library sizes unique to each sample. Thus, each simulated OTU in each simulated sample was assigned a unique probability density function (and hence a unique, corresponding quantile function).

To impose a correlation structure, rather than directly generate counts from the marginal distributions described above, correlated multivariate-normal data were first randomly generated based on the estimated correlation structure of taxa by SpiecEasi, and then transformed to the copula space. Following this, counts were mapped from the copula space back to the marginal distributions of choice through the quantile functions unique to each OTU in each sample as described above. This is referred to as the ‘normal to anything’ approach, [7], and in this way, counts between OTUs are jointly correlated, but marginally follow the parametric distributions of choice.

Each simulated dataset was composed of 1000 OTUs (rows), and sample sizes (columns) varied between 5, 15, 25, 50, 100 and 150 per treatment group. For convenience, sample sizes of 5, 15 and 25 will be referred to as ‘smaller’ sample sizes, and sample sizes of 50, 100and 150 as ‘larger’ sample sizes. Consistent with the original simulation settings, increasing differential abundance was introduced under fold-changes (FC) of 1.5, 3 and 5, representing the increased multiplicative change in proportional abundance for differentially abundant OTUs. For all simulated datasets, 100 of the total 1000 OTUs (10%) were assigned to be increasingly differentially abundant, as specified in the original simulation design. Because differential abundance was introduced on a proportional scale, this induces compositional bias, in which the proportional abundances of nondifferentially abundant OTUs decrease. The procedure for introducing differential abundance is visualized in Figure 3.

Illustration of differential abundance and compositional bias. In this figure, steps towards introducing differential abundance in simulated microbiome datasets are shown. First, parameters for the simulation are determined by modelling the observed counts of OTUs in the dataset template (1). The mean relative abundance of an OTU, or ‘proportional abundance’, represents the expected proportion of counts in a sample belonging to each OTU. The observed proportional abundances of OTUs in the chosen dataset template serve as the proportional abundances assigned to OTUs in the simulation control group (2). By definition, proportions are constrained to sum to 1. When multiplicative differential abundance is first introduced introduced (3), the sum of all proportional abundances initially exceeds 1, so they are re-scaled back down to meet this constraint (4). This step induces compositional bias, in which the proportional abundance of nondifferentially abundant OTUs consequently decrease. To determine the expected abundance of OTUs on the absolute scale, library sizes (i.e. the sums of counts in observed samples from the dataset template) are randomly sampled with replacement and multiplied to the proportional abundances of each OTU in each sample (5). Counts are then generated through the procedure as described in Methods (6) yielding a simulated OTU table, which can be analyzed for differential abundance (7). For this contrived example, 10 OTUs were considered with 2 samples. OTU 5 was assigned differential abundance with a multiplicative fold-change of 5, as shown in step 3. The randomly sampled library size from the dataset template for the control sample is 1000, and the randomly sampled library size for the treatment group is 2000, as shown in step 5.

To ensure consistency with the original simulation settings, for each simulated dataset, only simulated OTUs that had counts greater than 0 in at least 5% of the samples were tested for differential abundance (i.e. each OTU had to be present in at least 5% of the samples). With exception to rank normalization and ALDEx2 (as recommended by its authors [18]), OTUs were removed prior to normalization.

The nominal Type-I error rate was set to be 5%, and Benjamini-Hochberg false discovery correction [35] was used to adjust Inline graphic values. Performance was evaluated by sensitivity and false discovery rate (FDR), which are described in Equations 1 and 2. Sensitivity refers to the proportion of OTU correctly classified as differentially abundant (true discoveries) divided by the total number of differentially abundant OTU. FDR refers to the proportion of OTU incorrectly classified as differentially abundant (false discoveries) out of all the OTU that were both correctly and incorrectly classified as differentially abundant.

(1)

(2)

For more detailed descriptions of the data generation process, readers are referred to the manuscript and Supplementary Materials provided by the simulation authors [17].

Results

Larger sample sizes

From Figures 4 and 5 and MainSimulationResults.csv as included in Supplementary Materials, performance of a t-test with rank normalization as compared to competing differential abundance detection methodologies are shown for the larger sample sizes of 50, 100, and 150 with FCs of 3 and 5. Under the beta-binomial distribution with FC of 3, the sensitivity of a t-test when paired with rank normalization ranged from less than 0.6% (s.e. 1.0%) at sample size of 50 to 5.1% (s.e. 2.5%) at sample size of 100 to 15.4% (s.e. 4.0%) at sample size of 150. Simultaneously, the mean FDR was virtually completely controlled for, with no false discoveries made at sample sizes of 50 and 100 to only 0.4% (s.e. 1.5%) at sample size of 150. Under the beta-binomial distribution with FC of 5, the sensitivity of a t-test when paired with rank normalization ranged from 8.2% (s.e. 3.8%) at sample size of 50 to 30.0% (s.e. 5.0%) at sample size of 100 to 52.6% (s.e. 3.6%) at sample size of 150. Simultaneously, the mean FDR remained below 5%, ranging from 0.4% (s.e. 2.0%) at sample size of 50 to 2.5% (s.e. 2.8%) at sample size of 100 to 4.8% (s.e. 3.4%) at sample size of 150. Overall, the mean FDR for rank normalization with a t-test never surpassed the nominal 5% level under the beta-binomial distribution.

Scatterplots of differential abundance detection methodological performance on simulated microbiome datasets, with counts generated from a beta-binomial distribution. Rows correspond to sample sizes (M) of 50, 100 and 150 per treatment group, and columns correspond to induced fold-changes (FC) of 3 and 5, respectively. X-axis refers to 1 minus the false discovery rate (1-FDR) and the y-axis refers to sensitivity. Numbered coordinate pairs indicate the mean performance of each indexed methodology. Bubbles extend from mean performance with horizontal and vertical radii equal to the standard error of FDR and sensitivity, respectively. The naming convention for the legend lists the differential abundance test used, followed by the normalization used, separated by an underscore. For example, ‘wilcox_tss’ refers to the Wilcoxon rank-sum test performed on data normalized using TSS. The t-test paired with rank normalization is indexed by the number 8 and the color gold, and dark gold dashed lines extend through its mean 1-FDR and sensitivity. Coordinates closer to the top right corner of each plot indicate better performance. The bright green dashed line extends upwards from the nominal 5% FDR. Data tables corresponding to these plots are included in Supplementary Materials as ‘MainSimulationResults.csv’.

Under the negative binomial distribution with FC of 3, the mean sensitivity of the t-test when paired with rank normalization ranged from 17.6% (s.e. 5.3%) at sample size of 50 to 23.3% (s.e. 3.9%) at sample size of 100 and 24.4% (s.e. 3.1%) at sample size of 150. Simultaneously, the mean FDR was held below the 5% level, with no false discoveries made at sample size 50, a mean of FDR of 1.1% (s.e. 3.6%) at sample size of 100 and a mean FDR of 3.5% (s.e. 7.5%) at sample size of 150. Under the negative binomial distribution with FC of 5, sensitivity ranged from 23.5% (s.e. 4.3%) at sample size of 50 to 27.3% (s.e. 4.9%) at sample size of 100 to 35.4% (s.e. 6.1%) at sample size of 150. Although mean FDR was held to 1.8% (s.e. 3.0%) at sample size of 50, at sample sizes 100 and 150 it inflated to 9.6% (s.e. 14.8%) and 21.7% (s.e. 18.3%), respectively, surpassing the nominal 5% level and indicating a loss of control over the FDR.

Inflated FDRs under these conditions are likely due to compositional bias, an inherent characteristic of microbiome data [32] in which differential abundance of nondifferentially abundant OTUs decrease as the relative abundance of differentially abundant OTUs increase. This decrease is proportional to FC and more likely to be misidentified at larger sample sizes. This is evidenced by the fact that false discoveries did indeed increase with sample size and FC, and that all false discoveries were estimated to have decreased differential abundance. In other words, had we conducted a one-sided t-test for increasing differential abundance with a nominal FDR level of 2.5%, no false discoveries would have been made. Furthermore, all true discoveries were correctly identified as increasingly differentially abundant.

It is worth noting that all methodologies demonstrated increased FDRs under these conditions. Even the otherwise-conservative ALDEx2 t-test, explicitly designed to address compositional bias, surpassed 8% FDR on average at sample size of 150. Furthermore, FDRs for rank normalization with a t-test were only half of that of the most specific WRS test, which was when it was paired with CSS.

Overall, a t-test when paired with rank normalization performed very well relative to competing methodologies for sample sizes greater than or equal to 50. Its performance proved to be especially strong under the beta-binomial distribution, but was prone to more false discoveries under the negative binomial distribution. For all conditions, rank normalization substantially increased the power of a t-test relative to CSS, TSS and TMM, which were the least powerful of all the methodologies considered. Rank normalization with a t-test tended to perform similarly to the top-performing WRS tests under conditions with a FC of 3 and sample sizes of 50; but under the strong compositional effects observed at sample sizes of 100 and 150 with FC of 5, a t-test with rank normalization outperformed all WRS tests exhibiting overall greater sensitivity, lower FDR, and less varying performance under both distributions.

Rank normalization when paired with a t-test was consistently more powerful than the ALDEx2 t-test, which appeared to be highly conservative, especially under the beta-binomial distribution. However, ALDEx2 did exhibit fewer false discoveries overall, and at sample size of 100 with FC of 5 under the negative binomial distribution, it uniquely controlled the mean FDR to under the 5% level while rank normalization with a t-test did not.

Limma-Voom demonstrated by far the most variation in performance for all conditions. Although on average Limma-Voom performed similarly to rank normalization with a t-test under the negative binomial distribution with FC of 3, it demonstrated less conservative performance with FC of 5, and under the beta-binomial distribution, it exhibited FDRs far exceeding the nominal 5% rate for all conditions.

Although edgeR and metagenomeSeq exhibited sensitivities significantly greater than all other methodologies for all conditions, their FDRs ranged between 50% and 80%, far exceeding the nominal FDR and indicating that the majority of the OTUs they identified as differentially abundant were false discoveries. DESeq2 performed somewhat similarly to edgeR and metagenomeSeq under the negative binomial distribution, albeit more conservatively. Under the beta-binomial distribution, it performed similarly to Limma-Voom with FDRs greatly surpassing the 5% level for all conditions. The performance of the above methodologies are consistent with what was observed in the original simulations using the HMP Mid Vagina template under H1 without compensation.

Smaller sample sizes

From Figures 6 and 7 and MainSimulationResults.csv, performance of a t-test with rank normalization as compared to competing differential abundance detection methodologies are shown for the smaller sample sizes of 5, 15 and 25 with FCs of 3 and 5.

At these smaller sample sizes, rank normalization paired with a t-test was greatly under-powered but exhibited no false discoveries. Under the negative binomial distribution with FC of 3, its mean sensitivity was 0 at sample size of 5, 0.4% (s.e. 0.8%) at sample size of 15, and 1.6% (s.e. 2.3%) at sample size of 25. With FC of 5, its mean sensitivity was just above 0 at sample size of 5, 4.0% (s.e. 3.5%) at sample size of 15, and 14.9% (s.e. 5.5%) at sample size of 25.

Relative to t-tests and WRS tests paired with NFs, its performance was virtually indistinguishable under the beta-binomial distribution; mean sensitivity and FDR remained near or equal to 0 for all classical tests, regardless of normalization. But under the negative binomial distribution at sample sizes of 15 and 25 with FC of 3, and at sample size of 15 with FC of 5, WRS tests and t-tests paired with TMM and CSS proved to be slightly more sensitive. Only at sample size of 25 with FC of 5 did rank normalization with a t-test achieve greater sensitivity than other t-tests paired with NFs.

At sample size of 5 under the beta-binomial distribution, DESeq2 proved to be the most sensitive, exhibiting mean sensitivity less than 5% accompanied by mean FDRs exceeding 40%. At sample size of 5 under the negative binomial distribution, metagenomeSeq exhibited the greatest mean sensitivity, extending past 15% accompanied by mean FDRs surpassing 80%. All other methodologies at sample size of 5 demonstrated virtually no sensitivity.

At sample sizes of 15 and 25, edgeR and metagenomeSeq proved to be the most powerful but also the least specific, results consistent with larger sample sizes. DESeq2 was the next most-powerful, but it too demonstrated FDRs greatly surpassing the nominal level under all conditions. Limma-Voom performed inconsistently; under the negative binomial distribution at sample sizes of 15 and 25, it was able to achieve slightly greater sensitivity than WRS tests and t-tests while still controlling the FDR, and offered similar performance under the beta-binomial distribution at sample size of 25 with FC of 5. However, under the beta-binomial distribution at other sample sizes and FCs, it exhibited FDRs greatly surpassing the nominal level while also failing to exhibit sensitivities above 5%. Like at larger sample sizes, the variability in performance for Limma-Voom was greater than for any other methodology.

The overall lack of balance between sensitivity and specificity observed at smaller sample sizes is likely due to the high concentration of 0s, which proved to be a much greater obstacle at smaller sample sizes than at larger ones. For OTU tables generated under the negative binomial distribution, approximately 69% of the simulated counts were 0s, and under the beta-binomial distribution, 95%. At smaller sample sizes, differentially abundant OTUs often exhibited only 1 or 2 non-0 counts per treatment group, and sometimes none at all. Indeed, for the OTU tables generated under the negative binomial distribution at sample size of 5, approximately 5% of the OTUs designated as differentially abundant demonstrated no non-0 counts for either treatment group. Under the beta-binomial distribution, this encompassed nearly 22% of the differentially abundant OTU at sample size of 5, 12% at sample size of 15 and 5% at sample size of 25. Realistically, the task of properly distinguishing differential abundance at smaller sample sizes is nearly impossible for OTUs exhibiting solely counts of 0, and exceptionally difficult for OTUs exhibiting few non-0 counts.

Overall, at sample sizes of 5, 15 and 25, the methodologies considered for this study were either overly conservative or exhibited highly inflated FDRs, results consistent with the original simulations [17]. Although a t-test when paired with rank normalization demonstrated clear performance advantages over existing methodologies at larger sample sizes, the task of developing a methodology for smaller sample sizes, one that can draw precise inference with little more than 1 or 2 non-0 counts per OTU, poses a problem that it was under-powered to solve. Of the existing methodologies that controlled the FDR, it appeared that WRS tests achieved the greatest sensitivity, but this sensitivity was modest.

Fold-change of 1.5

From Supplementary Figures S1 and S2 and MainSimulationResults.csv, performance of a t-test when paired with rank normalization as compared to competing differential abundance detection methodologies are shown for sample sizes 5, 15, 25, 50, 100 and 150, with differential abundance induced under a FC of 1.5.

At all sample sizes, performance of rank normalization was highly conservative. Under the beta-binomial distribution, virtually no discoveries were made at any sample size. Under the negative binomial distribution, sensitivity was greatest at sample size of 150, extending to only 0.9% (s.e. 2.5%), with no false discoveries made.

ALDEx2 and t-tests paired with NFs performed similarly to rank normalization, exhibiting nearly no sensitivity at any sample size under either distribution. Limma-Voom and WRS tests performed similarly as well, with exception to sample sizes of 100 and 150 under the negative binomial distribution, in which modest sensitivities ranging from 5% to 10% were achieved. The performance of edgeR, metagenomeSeq and DESeq2 mostly reflected the patterns observed with greater effect sizes; they demonstrated the greatest sensitivities on average, ranging between 5% and 45%, as well as the greatest FDRs, ranging from 65% to just below 90%. Thus, methodologies were either overly conservative or exhibited highly inflated FDRs when testing for weak differential abundance, even at larger sample sizes.

Real data analysis

Description

The performance of rank normalization with a t-test was also evaluated on two real datasets. Performance was evaluated by comparison to the original findings of the publications and confirmed with the scientific literature. Agreement with original findings was determined based on the proportion of OTUs identified as differentially abundant that belonged to taxa explicitly recognized as differentially abundant in the original study. To confirm these results, findings were also compared to separate studies. Because only two-group comparisons were being conducted, only a subset of samples from each dataset were chosen to exclude the effect of potentially confounding variables.

Microbiome data from a study referred to here as Zeller data includes a metagenomic dataset for analyzing the fecal microbiota with respect to Colorectal Cancer (CRC) [29]. This OTU table was filtered to exclude subjects with small adenoma, so only those with normal colons and those with CRC were compared. This yielded 53 subjects with CRC and 61 controls. After filtering for OTUs present in at least one sample, this OTU table was composed of 758 OTUs.

Microbiome data from a study referred to here as RISK data, provides 16S rRNA amplicon sequencing data on a large pediatric cohort for analyzing the association between gut microbiota and Crohn’s disease (CD) [30]. For analysis here, females with ileocolitis Crohn’s disease (CD) were compared to female controls without CD with respect to differential abundance in their terminal ileum microbiota. Samples were excluded for subjects with recorded antibiotic usage, healthy subjects that showed symptoms of other intestinal bowel diseases (inflammation, gastric involvement, ileal involvement), and subjects who had multiple samples sequenced and present in the dataset. This yielded 56 subjects with CD and 47 controls. After filtering for OTUs present in at least one sample, this OTU table was composed of 4247 OTUs.

Results

As shown in Table 3, on Zeller data, two unclassified Fusobacterium subspecies, two F. nucleatum subspecies, and P. asaccharolytica were found to be significantly increasingly differentially abundant in subjects with CRC. These findings reflect the main results from the original study [29], which found Fusobacterium, P. asaccharolytica and Peptostreptococcus stomatis (not identified here) to be the three most influential taxonomic biomarkers indicating presence of CRCs for their predictive machine learning models. Fusobacterium have extensively well documented associations with CRC in the literature, and to a lesser extent, P. asaccharolytica [36]; in fact, determining the biological mechanism by which F. nucleatum influences development of CRC is an active area of research in microbiome science [37].

Table 3.

Statistically significant taxa with increased differential abundance in Colorectal Cancer (CRC) patients as identified by a t-test performed on rank normalized data at 5% level. Analysis was conducted on the Zeller metagenomic dataset. Mean ranks for control and CRC groups, mean difference in ranks, and the upper and lower bound of a 95% confidence interval for this difference are given, as well as raw and Benjamini-Hochberg adjusted Inline graphic values. No other taxa were found to be significantly associated with CRC for this data.

Taxon	Rank control	Rank CRC	Difference	LB (2.5%)	UB (97.5%)		Adj.
Unclassified Fusobacterium [1482]	330.67	494.01	163.34	109.51	217.16	5.00E-08	3.79E-05
unclassified Fusobacterium [1481]	357.11	515.63	158.53	97.31	219.74	1.56E-06	5.90E-04
Fusobacterium nucleatum [1479]	309.89	428.18	118.29	69.79	166.78	7.86E-06	1.99E-03
Fusobacterium nucleatum [1480]	326.70	439.94	113.25	58.96	167.54	8.58E-05	1.63E-02
Porphyromonas asaccharolytica [1056]	322.74	428.19	105.45	51.26	159.65	2.26E-04	3.42E-02

Open in a new tab

As shown in Table 4, on RISK data, four distinct OTUs from the family Enterobacteriacea were identified as increasingly differentially abundant in females with CD, alongside Haemophilus parainfluenzae, and one species from the family Veillonellaceae. These OTUs all belong to taxa explicitly mentioned in the original findings [30], and are supported by other studies as well [38, 39]. Enterobacteriacea was identified as the single most increasingly differentially abundant family in subjects with CD by the study’s authors. The publication notes that whereas most taxa belonging to Pasteurellaceae and Clostridiales tended to be decreasingly differentially abundant in CD, members Haemophilus parainfluenzae and Veillonellaceae specifically demonstrated the opposite trends.

Table 4.

Statistically significant taxa with increased differential abundance in Crohn’s Disease (CD) patients as identified by a t-test performed on rank normalized data at 5% level. Analysis was conducted on the RISK 16S rRNA amplicon dataset. Mean ranks for control and CD groups, mean difference in ranks, and the upper and lower bound of a 95% confidence interval for this difference are given, as well as raw and Benjamini-Hochberg adjusted Inline graphic values. 44 other taxa were found to be significantly decreasingly abundant in CD patients, and are listed in Supplementary Materials.

Taxon	Rank control	Rank CD	Difference	LB (2.5%)	UB (97.5%)		Adj.
Enterobacteriaceae Unknown Sp.	2718.1	3466.49	748.4	348.26	1148.53	3.43E-04	3.69E-02
Haemophilus parainfluenzae	2487.47	3224.47	737.01	344.16	1129.85	3.26E-08	3.69E-02
Enterobacteriaceae Unknown Sp.	2019.68	2555.89	536.21	251.04	821.38	3.40E-04	3.69E-02
Enterobacteriaceae Unknown Sp.	1985.49	2502.5	517.01	251.37	782.65	2.16E-04	3.32E-02
Enterobacteriaceae Unknown Sp.	1892.16	2261.03	368.87	166.69	571.05	5.61E-04	4.78E-02
Veillonellaceae Megasphaera Unknown Sp.	1892.16	2249	356.84	161.62	552.06	5.48E-04	4.78E-02

Open in a new tab

Forty-four other taxa (included in Supplementary Materials as RISK_Rejects.csv) were identified by rank normalization with a t-test as significantly decreasingly abundant in subjects with CD, including two OTUs from Erysipelotrichaceae, two OTUs belonging to Bacteroides, and 39 members of the order Clostridiales, 24 of which belonged to the family Lachnospiraceae. These results were in agreement with original findings as well [30]. The one unique OTU identified that was not found to be in agreement with original findings was a taxon belonging to the genus Turicibacter, however, a separate analysis repeated on the same RISK data did find this OTU to be significantly decreased in abundance [39] and it has been recognized as decreasingly differentially abundant in separate studies as well [38].

The competing methodologies considered for simulation were also applied here, and results are included in Supplementary Materials as ‘RoughRes_RISK.csv’ and ‘RoughRes_Zeller.csv’. Most notably, t-tests when paired with NFs made no discoveries on either dataset, and thus, did not find any OTUs to be significantly differentially abundant. WRS tests paired with NFs identified more OTUs as differentially abundant than rank normalization on both datasets, and whereas nearly all findings were in agreement with the literature for the RISK dataset, Inline graphic 20% of the taxa rejected on the Zeller dataset were not findings supported by the original publication. And the ALDEx2 t-test only identified five OTUs on the RISK dataset and three on Zeller, all of which were also recognized by rank normalization with a t-test.

Discussion

Summary of results

Overall, a t-test when paired with rank normalization performed robustly on simulations and real data with sample sizes equal to or greater than 50, and with FCs greater than 1.5. Under these conditions, a t-test proved to be substantially more powerful when paired with rank normalization than with NFs, which was apparent under virtually all conditions and even more so on real data, in which t-tests paired with NFs were unable to find any taxa to be differentially abundant. WRS tests paired with NFs often performed similarly to rank normalization under simulations in which FC was 3 and sample size was 50. However, as sample size and FC increased, rank normalization achieved on average slightly greater sensitivity, less FDR, and less variability in performance. Thus, rank normalization with a t-test proved to be highly competitive with other classical tests at larger sample sizes, and arguably outperformed them under many conditions.

It is important to note that rank normalization with a t-test was susceptible to strong compositional bias under the negative binomial distribution with FC of 5, and failed to control the FDR to under the nominal level at sample sizes of 100 and 150. However, this is not unique; WRS tests, Limma-Voom, DESeq2, edgeR and metagenomeSeq all demonstrated much greater FDRs under these conditions, suggesting that rank normalization with a t-test was more robust to compositional bias than similarly or more sensitive methodologies. This phenomenon of increasing FDRs with sample size and effect size in the presence of compositionality is consistent with results observed in the original simulation results [17] as well as other benchmarking studies [6, 8]. Although ALDEx2 appeared to be the most robust to compositional bias, it too surpassed the 5% FDR at sample size of 150 with FC of 5, and under nearly all other conditions, proved to be one of the most conservative methodologies examined. Thus, we found no methodologies were completely immune to strong compositional effects at large sample sizes.

Interestingly, control over the FDR was not an issue for a t-test paired with rank normalization under the negative binomial distribution with sample sizes less than 100 or FC of 3, nor any conditions under the beta-binomial distribution; under these conditions, its mean FDR did not surpass the nominal 5% level. The same cannot be said for DESeq2, edgeR, metagenomeSeq, Limma-Voom nor WRS tests. Therefore, in comparison to similarly or more powerful methodologies, rank normalization with a t-test effectively controlled the FDR under a greater variety of conditions. However, this comes at the cost of power; DESeq2, edgeR and metagenomeSeq demonstrated much greater sensitivity under nearly all simulation conditions, despite the fact that the majority of OTU they identified as differentially abundant were false discoveries.

Simulation results presented here suggest that rank normalization when paired with a t-test performs optimally at sample sizes between 50 and 150 with moderate to large FCs, and arguably outperforms other methodologies under these conditions. But repeating simulations at sample sizes of 5, 15 and 25 show it to be highly conservative, and under some circumstances, slightly more so than t-tests paired with TMM and CSS. Thus, whereas sample sizes equal to or less than 25 are common in microbiome studies, our results show that a t-test paired with rank normalization may be greatly under-powered under these conditions. However, it must be noted that no methodologies performed particularly well under these conditions. Thus, developing a powerful methodology for analyzing differential abundance at smaller sample sizes that can also reliably control the FDR remains a topic for future research.

Similarly, simulations conducted with a FC of 1.5 suggest that rank normalization paired with a t-test, ALDEx2, Limma-Voom and all classical tests regardless of normalization are greatly under-powered for detecting weak differential abundance. Methodologies like DESeq2, edgeR, and metagenomeSeq were more sensitive under these conditions, but as previously observed at higher effect sizes, their increased sensitivity comes at a cost of poor specificity. Thus, developing a powerful methodology to detect weak differential abundance that can reliably control the FDR also remains a topic for future research.

In summary, our simulation results suggest a t-test when paired with rank normalization offers clear performance advantages over existing methodologies for detecting differential abundance at sample sizes equal to or greater than 50 with moderate to large FCs. Real data analyses appear to support these findings. However, simulations also demonstrated it to be very conservative at smaller sample sizes, as well as for detecting weak differential abundance.

Additional considerations

In addition to the empirical performance above, some of the most important advantages of rank normalization are pragmatic. No distributional assumptions need to be placed upon counts in a sample, the computational demands are negligible, and 0s are straightforwardly and meaningfully dealt with. In base R, rank normalization can be programmed in one line of code using the apply() function [24], and this simplicity helps ensure widely reproducible research. Rank normalization was shown to empower a t-test, which is one of the most (if not the most) widely used statistical tools, and is familiar to multidisciplinary researchers of diverse statistical backgrounds. This is crucial as researchers, not statisticians, are often the ones actually performing differential abundance analysis in practice.

With consideration to the limited scope of this study, there’s no immediately apparent reason why generalizing a t-test to the regression setting would be invalid. In reality, microbiome studies often provide detailed demographic and biological information for subjects in addition to their sequencing data, and multiple samples are sometimes collected from each subject over time [28, 34]. A logical next step for future research would be to explore the efficacy of more advanced statistical tools such as linear regression, ANOVA, or longitudinal analysis to be paired with rank normalization. Unlike the real data analyses conducted for this study, by using more sophisticated statistical procedures, confounding effects can be controlled for. However, this research must first be conducted before being applied in practice.

Limitations

Simulated datasets are inherently limited tools for evaluating performance, and especially so in the context of microbiome analysis; as the original simulation authors point out, data generated under parametric simulations are unrealistically ‘clean’ [17]. Nonetheless, through the introduction of outliers, imposed correlation structure, induced compositionality, and by modelling counts with both the negative binomial and beta-binomial distributions, the simulation scheme presented here serves as one of the most rigorous of those developed, and provides evidence in favor of the robustness of rank normalization.

Similarly, although results from analyses conducted on real data were encouraging, measuring performance is limited in the absence of gold-standard microbiome datasets, which to the best of our knowledge, do not exist. Thus, it is impossible to know for sure which microbes are actually truly differentially abundant in these datasets, and only comparisons to original findings and existing literature can be relied upon for evaluating results. Nonetheless, for both a metagenomic and 16S rRNA amplicon dataset, rank normalization when paired with t-test produced results nearly in complete agreement with original findings, and when juxtaposed with its performance on simulations, further lends legitimacy to its usefulness. Benchmarking studies are needed in order to confirm and replicate these findings.

The major practical disadvantages of using rank normalization for differential abundance analysis are that inferences can no longer be made about magnitude changes in abundance, and it cannot be coupled effectively with R package methodologies. Additionally, simulation results suggest that false discoveries may increase in the presence of strong compositional effects at very large sample sizes, and that performance is greatly under-powered for small sample sizes, or for detecting weak differential abundance. This leaves room for more robust statistical methodologies other than a t-test to be developed for rank normalized data, or for the rank normalization procedure to be modified for these sub-optimal conditions.

Conclusion

On both simulations and real data, rank normalization when paired with a t-test was shown to be a simple yet robust methodology for exploratory differential abundance analysis. It offered strong control over the FDR, and at sample sizes equal to or greater than 50, often outperformed some of the most widely used methodologies in current research. Future research could focus on pairing rank normalization with more powerful statistical tools to improve performance at smaller sample sizes, or extending a t-test to the regression setting. But at the moment, for researchers who are considering using a classical two-sample test for differential abundance analysis, pairing a t-test with rank normalization may be considered as a powerful and robust alternative to the estimation of normalization factors.

Key Points

We explored a methodology for microbiome differential abundance analysis, commonly used for microarray differential expression analysis, in which a t-test is conducted across the ranks of microbiome OTU tables.
On a rigorous 3rd-party benchmarking simulation, a t-test when paired with rank normalization outperformed commonly used differential abundance detection methodologies at larger sample sizes and findings from real datasets were in agreement with existing research.
Nonetheless, simulation results suggest that a t-test when paired with normalization was susceptible to more false discoveries under strong compositional effects, and was under-powered at small sample and effect sizes, indicating that future work could be done to improve the robustness of this methodology.

Supplementary Material

supplementary_files_bbab059

Click here for additional data file.^{(3.4MB, zip)}

Acknowledgments

It was important to evaluate the performance of our proposed methodology on a 3rd-party simulation in order to ensure reproducible and unbiased results. The original simulation authors [17] freely provided all R code needed to re-create their simulations, and additionally upon request, provided us previously-computed SpiecEasi correlation networks.

Matthew Davis. Matthew is a doctoral student and research assistant at the University of Iowa. His research interests include microbiome science, Bayesian inference, and machine-learning approaches to biomedical data.

Dr. Yuan Huang. Yuan has been an assistant professor of Biostatistics at the Yale School of Public Health since 2019, after previously serving as an assistant professor at the University of Iowa. Her research areas of interest include high-dimensional data analysis and integrative analysis, with application to cancer genomics and research on neurodegenerative diseases.

Dr. Kai Wang. Kai has been a professor of Biostatistics at the University of Iowa College of Public Health since 2013. His research areas of interest include developing methodologies for mediation analysis and well as high-throughput sequencing data and microbiome analysis.

Contributor Information

Matthew L Davis, Department of Biostatistics, University of Iowa College of Public Health, 145 N Riverside Dr, 52242, IA, USA.

Yuan Huang, Department of Biostatistics, Yale School of Public Health, 60 College St, 06510, CT, USA.

Kai Wang, Department of Biostatistics, University of Iowa College of Public Health, 145 N Riverside Dr, 52242, IA, USA.

Author contributions statement

K.W. and Y.H. conceived of the methodology, M.D. conducted the study and analyzed results M.D., K.W. and Y.H. wrote and reviewed the manuscript.

Funding

The authors received no specific funding for this work.

References

1. Wang J-W, Kuo C-H, Kuo F-C, et al. Fecal microbiota transplantation: review and update. J Formos Med Assoc 2019;118:S23–31. [DOI] [PubMed] [Google Scholar]
2. Clancy R. Immunobiotics and the probiotic evolution. FEMS Immunol Med Microbiol 2003;38:9–12. [DOI] [PubMed] [Google Scholar]
3. Malla MA, Dubey A, Kumar A, et al. Exploring the human microbiome: the potential future role of next-generation sequencing in disease diagnosis and treatment. Front Immunol 2019;9:2868. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Allaband C, McDonald D, Vázquez-Baeza Y, et al. Microbiome 101: studying, Analyzing, and interpreting gut microbiome data for clinicians. Clin Gastroenterol Hepatol 2019;17:218–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Calle ML. Statistical analysis of metagenomics data. Genomics Inform 2019;17:e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 2014;10:e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Kurtz ZD, Müller CL, Miraldi ER, et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015;11:e1004226 Publisher: Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Weiss S, Xu ZZ, Peddada S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 2017;5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. McMurdie PJ, Holmes S. Phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 2013;8:e61217. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Paulson J., Stine O., Bravo H., et al. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10 2013;1200–1202. 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010;11:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. McKnight DT, Huerlimann R, Bower DS, et al. Methods for normalizing microbiome data: an ecological perspective. Meth Ecol Evolut 2019;10:389–400. [Google Scholar]
13. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Law CW, Chen Y, Shi W, et al. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014;15:R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Thorsen J, Brejnrod A, Mortensen M, et al. Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. Microbiome 2016;4:62. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Hawinkel S, Mattiello F, Bijnens L, et al. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinf 2019;20:210–21 Publisher: Oxford Academic. [DOI] [PubMed] [Google Scholar]
18. Fernandes AD, Reid JN, Macklaim JM, et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Xia Y, Sun J. Hypothesis testing and statistical analysis of microbiome. Genes Dis 2017;4:138–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Bacon-Shone J. Ranking methods for compositional data. J R Stat Soc Ser C Appl Stat 1992;41:533–7. [Google Scholar]
21. Conover WJ, Iman RL. Analysis of covariance using the rank transformation. Biometrics 1982;38:715–24. [PubMed] [Google Scholar]
22. Breitling R, Herzyk P. Rank-based methods as a non-parametric alternative of the t-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 2005;3:1171–89 Publisher: Imperial College Press. [DOI] [PubMed] [Google Scholar]
23. Qiu X, Wu H, Hu R. The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis. BMC Bioinf 2013;14:124. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. R Core Team . R: A Language and Environment for Statistical Computing. Austria: R Foundation for Statistical Computing Vienna, 2020. [Google Scholar]
25. Morgan M. BiocManager: Access the Bioconductor Project Package Repository 2018. R package version 1.30.4. https://CRAN.R-project.org/package=BiocManager.
26. Microsoft Corporation and Stephen Weston . doSNOW: Foreach Parallel Adaptor for the ‘snow’ Package 2019. R package version 1.0.18 . https://CRAN.R-project.org/package=doSNOW.
27. Ritchie ME, Phipson B, Wu D, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Battaglia, T. MicrobeDS: Microbiome Datasets 2020. R package version 0.1.0.
29. Zeller G, Tap J, Voigt AY, et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol 2014;10:766. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Gevers D, Kugathasan S, Denson LA, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 2014;15:382–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Huttenhower C, Gevers D, Knight R, et al. Structure, function and diversity of the healthy human microbiome. Nature 2012;486:207. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, et al. Microbiome datasets are compositional: and this is not optional. Front Microbiol 2017;8:2224 Publisher: Frontiers. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Metwally A, Aldirawi H, Yang J. A review on probabilistic models used in microbiome studies. Commun Inform Syst 2018;18:173–91. [Google Scholar]
34. Pasolli E, Schiffer L, Manghi P, et al. Accessible, curated metagenomic data through ExperimentHub. Nat Methods 2017;14:1023–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 1995;57:289–300 Publisher: [Royal Statistical Society, Wiley]. [Google Scholar]
36. Chen Y, Yang Y, Gu J. Clinical implications of the associations between intestinal microbiome and colorectal cancer progression. Cancer Manag Res 2020;12:4117–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Shang FM, Liu HL. Fusobacterium nucleatum and colorectal cancer: a review. World J Gastrointest Oncol 2018;10:71–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. El Mouzan MI, Winter HS, Assiri AA, et al. Microbiota profile in new-onset pediatric Crohn’s disease: data from a non-western population. Gut Pathogens 2018;10:49. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Wang F, Kaplan JL, Gold BD, et al. Detecting microbial Dysbiosis associated with Pediatric Crohn disease despite the high variability of the gut microbiota. Cell Rep 2016;14:945–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary_files_bbab059

Click here for additional data file.^{(3.4MB, zip)}

[ref1] 1. Wang J-W, Kuo C-H, Kuo F-C, et al. Fecal microbiota transplantation: review and update. J Formos Med Assoc 2019;118:S23–31. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Clancy R. Immunobiotics and the probiotic evolution. FEMS Immunol Med Microbiol 2003;38:9–12. [DOI] [PubMed] [Google Scholar]

[ref3] 3. Malla MA, Dubey A, Kumar A, et al. Exploring the human microbiome: the potential future role of next-generation sequencing in disease diagnosis and treatment. Front Immunol 2019;9:2868. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Allaband C, McDonald D, Vázquez-Baeza Y, et al. Microbiome 101: studying, Analyzing, and interpreting gut microbiome data for clinicians. Clin Gastroenterol Hepatol 2019;17:218–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Calle ML. Statistical analysis of metagenomics data. Genomics Inform 2019;17:e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 2014;10:e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Kurtz ZD, Müller CL, Miraldi ER, et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015;11:e1004226 Publisher: Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Weiss S, Xu ZZ, Peddada S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 2017;5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. McMurdie PJ, Holmes S. Phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 2013;8:e61217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Paulson J., Stine O., Bravo H., et al. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10 2013;1200–1202. 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010;11:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. McKnight DT, Huerlimann R, Bower DS, et al. Methods for normalizing microbiome data: an ecological perspective. Meth Ecol Evolut 2019;10:389–400. [Google Scholar]

[ref13] 13. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Law CW, Chen Y, Shi W, et al. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014;15:R29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Thorsen J, Brejnrod A, Mortensen M, et al. Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. Microbiome 2016;4:62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Hawinkel S, Mattiello F, Bijnens L, et al. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinf 2019;20:210–21 Publisher: Oxford Academic. [DOI] [PubMed] [Google Scholar]

[ref18] 18. Fernandes AD, Reid JN, Macklaim JM, et al. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2014;2:15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Xia Y, Sun J. Hypothesis testing and statistical analysis of microbiome. Genes Dis 2017;4:138–48. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Bacon-Shone J. Ranking methods for compositional data. J R Stat Soc Ser C Appl Stat 1992;41:533–7. [Google Scholar]

[ref21] 21. Conover WJ, Iman RL. Analysis of covariance using the rank transformation. Biometrics 1982;38:715–24. [PubMed] [Google Scholar]

[ref22] 22. Breitling R, Herzyk P. Rank-based methods as a non-parametric alternative of the t-statistic for the analysis of biological microarray data. J Bioinform Comput Biol 2005;3:1171–89 Publisher: Imperial College Press. [DOI] [PubMed] [Google Scholar]

[ref23] 23. Qiu X, Wu H, Hu R. The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis. BMC Bioinf 2013;14:124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. R Core Team . R: A Language and Environment for Statistical Computing. Austria: R Foundation for Statistical Computing Vienna, 2020. [Google Scholar]

[ref25] 25. Morgan M. BiocManager: Access the Bioconductor Project Package Repository 2018. R package version 1.30.4. https://CRAN.R-project.org/package=BiocManager.

[ref26] 26. Microsoft Corporation and Stephen Weston . doSNOW: Foreach Parallel Adaptor for the ‘snow’ Package 2019. R package version 1.0.18 . https://CRAN.R-project.org/package=doSNOW.

[ref27] 27. Ritchie ME, Phipson B, Wu D, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Battaglia, T. MicrobeDS: Microbiome Datasets 2020. R package version 0.1.0.

[ref29] 29. Zeller G, Tap J, Voigt AY, et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol 2014;10:766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Gevers D, Kugathasan S, Denson LA, et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 2014;15:382–92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Huttenhower C, Gevers D, Knight R, et al. Structure, function and diversity of the healthy human microbiome. Nature 2012;486:207. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, et al. Microbiome datasets are compositional: and this is not optional. Front Microbiol 2017;8:2224 Publisher: Frontiers. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Metwally A, Aldirawi H, Yang J. A review on probabilistic models used in microbiome studies. Commun Inform Syst 2018;18:173–91. [Google Scholar]

[ref34] 34. Pasolli E, Schiffer L, Manghi P, et al. Accessible, curated metagenomic data through ExperimentHub. Nat Methods 2017;14:1023–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol 1995;57:289–300 Publisher: [Royal Statistical Society, Wiley]. [Google Scholar]

[ref36] 36. Chen Y, Yang Y, Gu J. Clinical implications of the associations between intestinal microbiome and colorectal cancer progression. Cancer Manag Res 2020;12:4117–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Shang FM, Liu HL. Fusobacterium nucleatum and colorectal cancer: a review. World J Gastrointest Oncol 2018;10:71–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. El Mouzan MI, Winter HS, Assiri AA, et al. Microbiota profile in new-onset pediatric Crohn’s disease: data from a non-western population. Gut Pathogens 2018;10:49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39. Wang F, Kaplan JL, Gold BD, et al. Detecting microbial Dysbiosis associated with Pediatric Crohn disease despite the high variability of the gut microbiota. Cell Rep 2016;14:945–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Rank normalization empowers a t-test for microbiome differential abundance analysis while controlling for false discoveries

Matthew L Davis

Yuan Huang

Kai Wang

Abstract

Introduction

Table 1.

Table 2.

Rank normalization

Figure 1 .

Figure 2 .

Materials

Simulations

Description

Figure 3 .

Results

Larger sample sizes

Figure 4 .

Figure 5 .

Smaller sample sizes

Figure 6 .

Figure 7 .

Fold-change of 1.5

Real data analysis

Description

Results

Table 3.

Table 4.

Discussion

Summary of results

Additional considerations

Limitations

Conclusion

Key Points

Supplementary Material

Acknowledgments

Contributor Information

Author contributions statement

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases