Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri; Zachary D Kurtz; Richard Bonneau; Christian L Müller

doi:10.1093/nargab/lqaa100

. 2020 Dec 17;2(4):lqaa100. doi: 10.1093/nargab/lqaa100

Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri ¹, Zachary D Kurtz ², Richard Bonneau ^3,^4,^5,^✉, Christian L Müller ^6,^7,^8,^✉

PMCID: PMC7745771 PMID: 33575644

Abstract

Estimation of statistical associations in microbial genomic survey count data is fundamental to microbiome research. Experimental limitations, including count compositionality, low sample sizes and technical variability, obstruct standard application of association measures and require data normalization prior to statistical estimation. Here, we investigate the interplay between data normalization, microbial association estimation and available sample size by leveraging the large-scale American Gut Project (AGP) survey data. We analyze the statistical properties of two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis workflows and log-ratio transformations. We show that shrinkage estimation, a standard statistical regularization technique, can universally improve the quality of taxon–taxon association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as downstream data analysis examples, we show that variance-stabilizing and log-ratio approaches enable the most taxonomically and structurally coherent estimates. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.

INTRODUCTION

Recent advances in microbial amplicon and metagenomic sequencing as well as large-scale data collection efforts provide samples across different microbial habitats that are amenable to quantitative analysis. Following the organization of sequence data into operational taxonomic units (OTUs) or amplicon sequence variants (ASVs), via pipelines such as qiime (1), mothur (2) or dada2 (3), the resulting count data are then available in tabular format for statistical analysis. Downstream analysis tasks include assessing community diversity (4), differential abundance analysis, associating bacterial compositions to system-specific ecological and biomedical covariates, and learning microbe–microbe associations.

However, technical artifacts inherent in microbial abundance data preclude the application of such analysis tasks directly on the measured counts. The data typically comprise a high proportion of zeros and carry only relative information about species abundance. The total number of read counts for any given observation is limited by the total amount of sequencing, quality of DNA preparations and other technical factors and does not represent the community abundance or total species abundance in the sample or ecosystem. For example, unequal amplicon library sizes can bias sequencing reads to OTUs from the larger sample, regardless of true abundance profiles. Although some recent studies have used controlled communities, spike-in controls and other innovations to obtain total community size (5–8), in the majority of experimental designs, the community size is unknown, and, thus, our data are best thought of as containing relative or compositional information (each OTU fraction of total counts, total community size unknown) (9,10). Additionally, technical variation due to sequencing such as differences in amplification biases and batch effects due to multiple sequencing runs can hamper proper quantification of microbial compositions (11).

To ameliorate these biases, general data normalization methods have been proposed to correct for sampling bias, library size and technical variability, including workflows from RNA-seq pre-processing and compositional data analysis (12–15). Dedicated normalization and modeling strategies are also available for specific analysis tasks, most prominently, for differential abundance testing (16–19).

Here, we examine data normalization schemes in connection with a fundamental multivariate statistical estimation task: inferring pairwise linear associations from microbial count data. Two common strategies that have been adopted for microbial relative abundance data are correlation after data normalization (20,21) and proportionality (22,23) as association measure for compositional data. Under the assumption that ecological association can be captured by empirical Pearson’s correlation or proportionality, accurate correlation or proportionality estimation is of paramount importance for a host of downstream analysis tasks, including state-of-the-art diversity estimation that takes the connectivity of the community into account (4), direct microbial association network inference (14), discriminant analysis and microbial community clustering.

While previous work (20,24) has assessed the precision of correlation detection strategies on synthetic microbial sequencing count data, we took a different approach and investigated the behavior of linear association estimation on the largest-to-date citizen-science sample collection, the American Gut Project (AGP) (25). The large available sample size n > 9000 allows us, for the first time, to critically measure the empirical consistency of combinations of data normalization and association estimation techniques. More specifically, given the lack of ‘gold standard’ microbial associations in gut microbial communities, we asked the question whether and how association patterns inferred from small but realistic sample sizes of tens to a few hundreds of samples resemble those inferred using the entire AGP dataset. This type of ‘sample size consistency’ evaluation was at the heart of the present study.

Using a comprehensive set of evaluation criteria and summary statistics, we first show that, independent of any specific data normalization scheme, standard linear association measures are unreliable in the small sample regime. We propose the application of shrinkage estimation (26) as an effective strategy for sample size consistent association estimation. We show that a popular correlation shrinkage approach from functional genomics (27) enjoys excellent performance in the microbiome context. For proportionality, we introduce a novel shrinkage estimator, ρ_* (rhoshrink), and assess its statistical behavior. In particular, we quantify the effects of sample size on data normalization and association estimates on several downstream analysis tasks, including microbial association (or relevance) network inference and clustering. Figure 1 shows the proposed analysis framework used in this study.

Figure 1. — Framework for examining the effects of normalization methods on linear association estimation with increasing sample size. Comparative summary statistics of the resulting association matrices include distribution-based analysis, distance-based matrix comparison, hierarchical clustering and association network analysis.

Our analysis revealed that all normalization-dependent association estimates in the AGP data can be broadly grouped into five categories and that variance-stabilizing and log-ratio approaches provide the most consistent estimation in terms of taxonomic and community structure coherence. Our findings, available in a fully reproducible statistical analysis R workflow at Synapse ID: syn21654780, have important implications for microbiome studies in multiple stages of analysis, most prominently in the presence of small sample sizes. In particular, we believe that our developed shrinkage estimation framework will improve the reproducibility of future microbiome data analysis studies at almost no additional computational cost.

MATERIALS AND METHODS

To examine the interplay of data normalization and association estimation methods, we first describe the four essential ingredients of our analysis: the processed AGP 16S rRNA dataset, the comprehensive list of data normalization methods, statistical estimation of linear associations and downstream statistical evaluation and analysis tools.

AGP sample collection

The AGP is a large public repository of human microbiome samples aiming to survey the diversity of microbiota.

Information including diet, disease status and lifestyle variables was measured for public health analysis, but the sheer size of the dataset makes it also a powerful profile of the microbial communities in the human gut. For each batch of samples, the V4 region of the 16S rRNA gene was amplified and sequenced using Illumina next-generation sequencing. The untrimmed data were then processed using sequence variant calling. We obtained OTU count tables and mapping files for unrarefied AGP samples (25) from the project website ftp://ftp.microbio.me/AmericanGut/ag-2017-12-04/. The original OTU table contained P = 35 511 OTU observations for n = 15 148 samples. We filtered the dataset to contain only fecal samples whose sequencing depths fall above the 10th percentile and removed OTUs that were present in <30% of all samples. This resulted in a data matrix comprised of P = 531 OTUs and n = 9631 samples.

To investigate the sample size dependence of data normalization and association estimation on this dataset, we generated collections of random subsamples of varying sample sizes, ranging from 25 ≤ n ≤ 9000, with 5 (random) replicates per sample size. While sample sizes of n ≤ 25 are not uncommon in clinical studies of the microbiome, n = 25 was the lower bound for study inclusion in a recent meta-analysis (28). The large sample limit of n = 9000 was chosen to ensure that the set of samples across the random subsamples is still relatively distinct, which also serves as our large sample reference for association estimates. To simulate reference data under null correlation or proportionality, we randomly shuffled OTU count data across samples prior to normalization.

Normalization methods

All normalization methods require as input OTU counts, collected over n samples and stored in a matrix Inline graphic . Each row is a p-dimensional vector , where j = 1, …, n is the sample index, is the read count of OTU i in sample j and is the set of natural numbers {0, 1, 2, …}. Let the total OTU count for sample j be . Several methods require the application of a log transformation, thus requiring non-negative input data. We include a pseudocount of 1 to all OTU input data if zero counts are not explicitly handled by the respective normalization scheme. We consider the following data normalization or transformation schemes.

Total sum scaling

A standard approach for normalizing count data is to divide individual counts by the total OTU counts in a sample, thus scaling the count vector such that the total sum is fixed to 1. This normalization is known as total sum scaling (tss) or total sum normalization. It reads

The resulting sample space of the data is thus the (p − 1)-dimensional simplex.

Cumulative sum scaling

The tss approach may place unwanted influence on OTUs that are highly sampled due to sequencing biases by over-representing it in the scaling factor m^(j) (11). To reduce the influence of these highly abundant OTUs for sparse data, cumulative sum scaling (css) has been proposed in (15) and implemented in the metagenomeSeq R package. Rather than normalizing by the total sum, css selects a scaling factor that is a fixed quantile of OTU counts. Formally,

where the scaling factor for sample j is Inline graphic . The quantity is the sum of read counts up to and including the lth quantile. N is a prespecified constant, e.g. N ≜ 1000, chosen such that the resulting data vector resembles the units of the original counts. The sample space of css-transformed data is that of non-negative real numbers Inline graphic .

Let Inline graphic be the index of the , the lth quantile for sample j, the median lth quantile across all samples and be the mean lth quantile. css requires the median absolute deviation of sample quantiles to be empirically stable via the quantity . A common choice is to set (15). The scaling factor is then defined by summing all the counts up to the smallest value of l that is stable, on average, across all samples that is greater than or equal to the median. The default choice for the median is the lth quantile.

Common sum scaling

Common sum scaling (com), as introduced in (11), is an alternative to rarefying OTU counts. Counts are scaled to the minimum depth of each sample via

where Inline graphic . The operator ⌊ · ⌋ (floor) converts a real number to the greatest integer that is less than or equal to the input.

Relative log expression

The relative log expression (rle) is introduced for gene expression data and available in the DESeq/edgeR package (13). The rle method is defined as follows. Let Inline graphic be the geometric mean of an m-dimensional vector x, and let be the vector of counts of OTU i over n samples (a transposed column vector of count matrix W). The numeric scaling factor for sample j is

where med[x] denotes the median of vector x and s^(·) = [s⁽¹⁾, …, s⁽ⁿ⁾] is a collection of the sample scaling factors. Let the global scaling factor Inline graphic be the arithmetic mean of all normalized scaling factors. The rle is then defined as

In summary, the rle estimates a median library from the geometric mean over all samples. The median ratio of each sample to the median library is then taken as the scale factor.

Inverse hyperbolic sine

A standard variance-reducing transformation, often applied to flow and mass cytometry data (29,30), is the inverse hyperbolic sine function, defined as

applied element-wise over the sample vector. The resulting data matrix is then mean centered prior to association estimation.

Wrench

The Wrench procedure, introduced in (17), estimates compositional correction factors in the presence of zero inflation. Wrench is defined as

The quantity Inline graphic represents a compositional scale factor where . Here, y_ji is the proportion of each feature i in sample j and is the average proportion of each feature i across all samples. The weight is estimated using the ‘W₂’ scheme, the default choice in the Wrench R package [see also (17) for further details]. While Wrench is capable of incorporating information about sample grouping, e.g. for differential abundance testing, we consider all samples to be in a single group.

Variance-stabilizing transform

The goal of variance-stabilizing transformations (vst) is to factor out the dependence of the variance on the mean (overdispersion) (13). Consider the mean-dispersion relation Inline graphic . Here, the ‘size factors’ are and is the average count to size factor ratio of sample j. The vst is then the integral quantity defined as

The function v(μ) is approximated by a spline function and evaluated for each count value in the column. The vst normalization is available in the DESeq package where the numerical fitting is achieved using local regression on the graph Inline graphic . A smooth function v(μ) is estimated using an estimate of raw variance: = , where . The local regression is parameterized such that large counts are scaled to be asymptotically equal to the logarithm base 2 of normalized counts. When we examined the per-OTU standard deviation [taken across all (P = 531) OTUs] plotted against the rank of the average OTU count, it can be seen that vst produces similar counts to both clr and a logarithm base 2 transform (Supplementary Figure S1).

Centered log-ratio transformation

Log-ratio transformations, introduced in (9), transform positive compositional data from the simplex to Euclidean space (9,14). The centered log-ratio (clr) transform is defined as

where the ratio is taken with respect to the geometric mean of the composition. The resulting data lie in a p − 1 hyperplane of p-dimensional Euclidean space.

Estimation of linear associations

Following a transformation of count data under some function Inline graphic , we consider several estimation methods for linear associations among the p OTUs.

Covariance and correlation estimation

The standard way of estimating linear associations is the empirical (sample) covariances in the sample space Inline graphic that forms the basis for many downstream multivariate data analysis techniques, including principal component analysis (PCA), discriminant analysis, metric learning and network inference.

Formally, column centering the transformed data results in an n × p data matrix Inline graphic , where I_p is the p-dimensional identity matrix and is unit (all-ones) matrix. In matrix notation, the sample covariance matrix (cov) is , where ○ indicates element-wise multiplication of two equal size matrices.

The estimate Inline graphic is a symmetric p × p matrix with the sample variances along the diagonal and can be normalized to obtain a matrix containing Pearson correlation coefficients. Let be a diagonal matrix with the p post-transformed OTU variances on the diagonal and zero elsewhere. The Pearson correlation matrix is then Inline graphic . The matrix is a symmetric p × p matrix where each entry corresponds to the Pearson correlation between OTUs i and k under the data transformation.

The magnitude and sign of the values in Inline graphic are often interpreted as the association strength and direction, respectively. The sample correlation/covariance matrices are, however, inadmissible in the p ≫ n setting, i.e. when fewer samples than OTUs are available. For example, type I errors may be grossly inflated, since the parameters under estimation are underdetermined. Standard operations for solving systems of linear equations such as PCA are then ill-posed.

Proportionality estimation

Covariance and correlation estimation on compositional data has long been criticized due to the necessary presence of negative bias, scale dependence and subcompositional incoherence in the estimates (9,31). Association measures based on the concept of proportionality have thus been put forward as an alternative to correlation (22). Here, we consider the symmetric proportionality ρ_p (23,32) that, by default, operates on clr-transformed data X^clr ≜ clr(W). The measure is defined as

(1)

where Inline graphic and are the columns of the matrix corresponding to OTUs i and k, respectively. The quantity ρ_p is a proportionality measure because differences of clr-transformed components are equivalent to log ratios of compositions. When p ≫ n, the sample estimator for ρ_p faces similar challenges as sample correlation/covariance estimation.

Shrinkage estimation of linear associations

One way to improve sample estimation of linear associations in the high-dimensional (p > n) setting is via regularization. For instance, covariance/correlation estimators with stronger statistical properties can be derived in the p > n setting when imposing structural assumptions about the underlying population covariance. One ubiquitous structural assumption is sparsity where only a few strong pairwise correlations are assumed to be present in the data. An effective data-driven approach to realizing structural sparsity is shrinkage estimation. We next revisit a popular covariance/correlation shrinkage approach and introduce a novel shrinkage estimator for proportionality.

Shrinkage covariance estimation

While several regularized covariance estimators are available in the literature (33–35), we focus here on Schäfer–Strimmer shrinkage estimation (27). The principal idea of shrinkage estimation is to shrink small sample correlations toward entries of a prescribed target matrix. The standard target matrix is the p × p-dimensional identity matrix. Shrinkage intensities are simultaneously estimated from data (33). In Schäfer–Strimmer shrinkage, as implemented in the R package corpcor, individual entries Inline graphic of the shrinkage covariance S* and entries in the shrinkage correlation R* are estimated as follows. For all off-diagonal elements in S*, we compute

where the shrunk correlation estimates are Inline graphic . The variance (var) estimates are shrunk in a separate procedure toward the median via .

The shrinkage intensities Inline graphic and are determined empirically by estimating the variance within the sample covariance matrix (see Supplementary Methods).

Shrinkage proportionality estimation

To derive a shrinkage estimator for proportionality, we first consider an equivalent formulation of (1) in terms of covariances and variances (23,32). The reformulation reads

and the corresponding sample estimator is thus

where Inline graphic are elements of the covariance estimates on clr-transformed data. This formulation clarifies the link between the standard correlation matrix on clr-transformed data and ρ_p: the former uses the geometric mean of and in the denominator, whereas the latter uses the arithmetic mean. Since Inline graphic is completely determined by sample covariances and variances, we expect the measure to have the same drawbacks in the small sample setting as the sample covariance estimators. We thus propose the following shrinkage proportionality estimator ρ_* (rhoshrink) as

(2)

where Inline graphic are the elements of the Schäfer–Strimmer shrinkage covariance S*. The estimator for ρ_* is thus completely determined by the Schäfer–Strimmer covariance shrinkage estimates, outlined in the previous paragraph.

Comparing association patterns

The distribution of association patterns was visualized using density plots. For each method, we examine the distribution of values after the appropriate association metric is applied to a single normalized subsample. In each density distribution, we also calculate statistical moments: mean, variance, skewness and kurtosis. To quantify similarities of the estimated association patterns across different data normalization methods, association measures and sample sizes, we considered three different distance measures. These distances are then used for comparative low-dimensional embeddings of the different estimates as well as for measuring convergence of the estimators with sample size.

Frobenius distance

Given a pair of p × p-dimensional association matrices Inline graphic and , the Frobenius distance measures the sum of squared differences between the corresponding entries and is defined as

Spectral distance

Given a square, symmetric matrix A, let A = UΣU^T be its singular value decomposition, where Σ is a diagonal matrix with singular values along the diagonal entries, i.e. Σ_ii = σ_i. Let σ_max(A) be the largest singular value of A. The spectral distance is

Due to its sole dependence on the ‘spectrum’ (the singular values) of the association matrix, the spectral distance is invariant to unitary transformations (e.g. rotations) of the matrices.

Correlation matrix distance

The correlation matrix distance (CMD) (36) measures the orthogonality of two correlation matrices and is defined as

where the trace operator (A) = ∑_iA_ii is the sum of the diagonal entries of a square matrix.

Downstream analysis

We considered two downstream exploratory data analysis tasks that require the estimation of microbial associations as input: (i) OTU clustering and (ii) microbial association network construction and community analysis.

Clustering

Unsupervised clustering of OTUs can help identify microbial subcommunities that may jointly affect host phenotype or reveal experimental and batch effects (25). We considered two popular clustering techniques: spectral and hierarchical clustering.

Spectral clustering requires the construction of an ‘affinity matrix’ from estimated associations, or a matrix expressing how similar pairwise entries are to each other. Here, to construct the affinity matrix we transformed associations into dissimilarity scores Inline graphic and constructed a k-nearest neighbor graph (k = 2) to obtain a sparse and symmetric affinity matrix A. Identification of OTU clusters is based on k-means clustering of the first m components of the eigendecomposition of the normalized Laplacian , where D is the diagonal degree matrix with entries containing the row or column sums of A (37). We chose the target cluster size to be number of connected components of the associated affinity graph (37). To assess the taxonomic content of a particular clustering, we evaluated the homogeneity of each cluster with respect to the taxonomic families of the underlying OTUs. As a quantitative measure, we computed the ratio of the effective family number (exponential of the Shannon entropy of family counts) to the total number of families detected per cluster.

For hierarchical clustering, we converted association matrices to dissimilarity measures using Inline graphic . Clustering was then performed using Ward’s method from the hclust package in R. Circular dendrograms were cut using the cuttree method, where k = 10 was chosen to represent the number of class annotations.

Relevance networks and community analysis

Relevance networks (38) are a popular way of visualizing and analyzing the overall structure of the microbial ecosystem. Relevance network construction ranks all pairwise correlation or proportionality values between OTUs by absolute value, selects a certain percentage of highest ranked pairs and visualizes the resulting set of pairs as edges between OTUs in an association network. The ranking of pairwise interactions allows us to compare strong associations regardless of differences in scale.

Multiple studies have found a higher prevalence of positive associations between taxonomically related OTUs in human gut datasets (14,39–42). We thus use taxonomic coherence, measured by assortativity (43), as independent summary statistic for relevance networks. When categorical variables are available for each node, the assortativity coefficient of a network takes values in [−1, 1] and measures the tendency of adjacent nodes to belong to the same category. In the context of microbial networks, the nodes are OTUs and the associated categories are their inferred taxonomic rank at the genus level.

We also examined the presence of community structure in the relevance networks using the concept of modularity (44). Similar to clustering analysis, modularity analysis of a network enables the partitioning of nodes into tightly connected subcommunities. Modularity was computed using the fast-greedy algorithm, described in (45) and implemented in the igraph package in R. Network layout was generated using the force-directed Fruchterman–Reingold algorithm (46).

RESULTS

Our comprehensive computation and analysis workflow produced several key results that are summarized below. We highlighted statistical properties of association estimation, followed by a comparison of downstream analysis results. For ease of presentation, we focused on tss and clr as representative data normalization/transformations as well as standard and shrinkage-based proportionality estimation in the main text. The complete analysis is available in the Supplementary Data.

Shrinkage universally improves consistency of association estimation

We first analyzed the influence of shrinkage on the estimation of associations under different data normalization and sample sizes. We show the convergence properties of association estimation, as measured by Frobenius distance d_f with respect to the large sample limit, with increasing sample size in Figure 2A. Shrinkage universally improves estimates in the low sample regime compared to its sample estimation counterparts. Even when the sample size n exceeds the number of OTUs p, most shrinkage estimates remain more similar to their respective large sample. This behavior is also reflected in the distribution of association estimates at low (n = 50) and large (n = 9000) sample sizes, as highlighted in Figure 3 for the proportionality measures ρ_p (rhoprop) and ρ_* (rhoshrink). In the small sample limit, rhoprop produces extreme proportionality estimates compared to the large sample limit (third row in Figure 3). The shapes of the distributions of rhoshrink estimates, however, were more similar in the small and large sample limits, and the distribution covered a similar range of [−0.1, 0.1]. These phenomena were observed for all combinations of data normalization and association estimation (Supplementary Figures S4A and S5A). As expected, the influence of shrinkage vanished in the large sample limit, as reflected in decreasing shrinkage intensities with increasing sample size (Supplementary Figure S2).

Figure 2. — Frobenius distance between estimates of association. (A) Average Frobenius distance between subsamples of the same sample size. Dashed lines represent the mean distance between normalized matrices after Pearson correlation. The solid lines represent the mean distance between normalized matrices where correlation/proportionality estimation with shrinkage was performed. The dot-dashed line represents rho, a proportionality metric. The long-dashed line represents rhoshrink, proportionality with shrinkage included. Vertical lines represent standard deviation from the mean for each corresponding method. (B) Multidimensional scaling (MDS) representation of Frobenius distance between correlation structures of varying sizes estimated from different normalization methods. The most opaque points represent the mean of five subsamples of the same size [color scheme as in (A)]. Points are labeled based on subsample size.

Figure 3. — Density of association values under different transformations and shrinkage. To represent clr and tss, data are normalized and correlation is calculated with shrinkage. Proportionality without shrinkage and proportionality with shrinkage are represented by rhoprop and rhoshrink, respectively. Each plot is a single random subsample of four representative methods at (A) 50 samples, (B) 50 samples with shuffled data and (C) 9000 samples. Mean, variance, skewness and kurtosis are shown for each distribution. Additional methods are provided in Supplementary Figure S3.

Normalization methods induce distinct association patterns

We next analyzed the similarity among the different association estimates with increasing sample size using MDS. Figure 2B shows a 2D MDS embedding of all shrinkage association estimates using the Frobenius distance. We identified five distinct classes. Association estimates following a variance-reducing/stabilizing transformations (clr, vst, asinh, rhoprop, rhoshrink) form a distinct linear trace in the embedding, ordered along sample size (V). Correlation estimates on raw count data form another distinct group (I). Correlation estimates after css (II) and wren (IV) normalization form two distinct traces in the embedding. Finally, correlation estimates following the com, rle and tss normalization form the fifth class of association patterns (III). For small sample sizes, association patterns are similar independent of the normalization methods. As each of the five classes forms a distinct linear trace in the embedding, we used the distances between estimates of different sample sizes to evaluate the rate at which normalization methods arrived at stable patterns of association. In agreement with Figure 2A, we observed that wren, vst and com arrived at consistent association estimates with the fewest samples, followed closely by rle and tss normalization methods (Figure 2B and Supplementary Figure S3). The observed grouping pattern and convergence behavior are largely invariant to the distance measure used (see Supplementary Figures S4B and S5B for spectral distance and CMD, respectively).

Association estimates are positively skewed

We next analyzed the shapes of empirical distributions of shrinkage association estimates for all normalization schemes in three different sample scenarios: small sample regime (n = 50), randomly shuffled data in the small sample regime (n = 50) and large sample regime (n = 9000). Figure 3 shows clr, tss, rhoprop and rhoshrink distributions across these scenarios (see Supplementary Figure S3 for all others). All correlation distributions are positively skewed. Estimates without shrinkage are considerably wider in the low sample regime (as exemplified for standard proportionality rhoprop versus rhoshrink in Figure 3). Overall, variance-reducing/stabilizing transformations (clr, vst, asinh, rhoprop, rhoshrink) induce wider, more symmetric association distributions. All other normalization schemes induce distributions with considerable positive skewness, resembling correlation distributions on raw count data (Supplementary Figure S3). Positive skewness also persists for association estimates on shuffled data. Although the shapes of shrinkage association distributions are visually similar in the small and large sample limits, we universally observed an increase in skewness and kurtosis with larger sample sizes independent of the normalization scheme.

Clustering methods are sensitive to normalization and shrinkage estimation

We next focused on analyzing the influence of normalization and association estimation on downstream data analysis tasks. We first considered clustering of OTUs using a large sample limit of n = 9000 samples from the AGP dataset. For spectral clustering, we asked the question whether and how normalization and shrinkage influence (i) the standard selection of the number of cluster and (ii) the taxonomic composition of the resulting clusters. One common strategy for model selection in spectral clustering is the ‘spectral gap’ criterion. The number of selected clusters is considerably larger (k ≥ 11) for the variance-reducing/stabilizing transformations (clr, vst, asinh, rhoprop, rhoshrink) than for other normalization methods (k ≤ 8) (Supplementary Figure S6). Despite the large sample size, the spectral gap of rhoprop- and rhoshrink-based spectral clustering is still different, resulting in k = 11 and k = 12 clusters, respectively. The different number of clusters also contributed to marked differences in terms of the homogeneity of OTU compositions, as shown in Figure 4. Variance-reducing/stabilizing transformations produced taxonomically more homogeneous groups at the family level. rhoshrink-based clustering produced the highest mean cluster purity, indicating strong agreement between estimated OTU associations and taxonomic identity (as shown for the family level in Figure 4 and Supplementary Figure S5). rhoprop- and rhoshrink-based clustering formed very similar but not identical clusters in terms of composition and cluster purity. A larger number of OTUs of family Ruminococcaceae and class Bacteriodia cluster together in clr-based clustering compared to tss-based clustering. OTU clusters derived from css, rle and wren normalization resulted in no distinct taxonomic grouping (see Supplementary Figure S5).

Figure 4. — OTU clusters from spectral clustering. (A–D) Each horizontal bar represents the composition of OTUs in a cluster at the family level. Clusters are in order of increasing percentage of the most abundant family: Ruminococcaceae. In each cluster, the colors represent the OTU families in each cluster. Numbers to the left of each bar represent the number of OTUs in each cluster. Values next to each method name represent cluster purity. Additional methods are provided in Supplementary Figure S7.

Hierarchical clustering largely confirmed the previous observations. For ease of comparison, we set the number of clusters to k = 10 for inference workflows. Figure 5 shows the dendrograms for clr-, tss-, rhoprop- and rhoshrink-based clustering. While some distinct and homogeneous clusters can be found in the tss case, the majority of OTUs have been grouped into a single cluster comprising many families and classes of taxonomically unrelated bacteria. However, taxonomic grouping is well represented by hierarchical clustering of rho- and rhoshrink-based estimates (Figure 5). Similarly, vst and asinh have recovered large groups of the most prevalent family annotation: Ruminococcaceae, Lachnospiraceae and Bacteroidaceae (see Supplementary Figure S8).

Figure 5. — Circular dendrograms showing hierarchical clustering patterns among OTUs. Each point surrounding the circular dendrogram represents one of the 531 OTUs in our dataset. The color represents family annotation. Each dendrogram (A–D) has been cut hierarchically into 10 trees (representing the 10 orders to which these taxonomic families map). The gray and black shading is used to highlight different clusters that are numbered. Hierarchical clustering of clr-transformed OTUs is better at delineating taxonomic relationships than clustering of those using tss; rhoprop and rhoshrink produce similar clustering patterns. Additional methods are provided in Supplementary Figure S8.

Normalization induces relevance networks with different community structures

We next considered the downstream statistical task of learning microbial relevance networks from AGP data. We estimated associations in the large sample limit n = 9000 and selected the top 2000 associations for network construction in every data normalization/association estimation workflow. Figure 6 shows network visualizations for clr-, tss-, rhoprop- and rhoshrink-based relevance networks (see Supplementary Figure S9 for the other instances). We identified subcommunities of highly connected OTUs using modularity maximization. The number of identified modules ranged between 20 (using Wrench) and 38 (using vst normalization). Relevance networks derived from variance-reducing/stabilizing transformations (clr, vst, asinh, rhoprop and rhoshrink) were partitioned into 35–38 modules and achieved a maximum modularity score of ≈0.8 (compared to modularity scores of <0.6 for all other networks). Visual inspection of these networks revealed that members of the Bacteroidetes phylum (represented by square nodes in Figure 6) formed tightly connected modules with few edges connecting to other phyla. Firmicutes (represented by circular nodes) in networks were divided into a higher number of modules comprising distinct families, including Lachnospiraceae (represented by orange circles) and Ruminococcaceae (teal circles, Figure 6 and Supplementary Figure S9). This striking modularity is less pronounced in the tss-based relevance networks (Figure 6B).

Figure 6. — Community structure of relevance networks. (A–D) The left network of each panel shows module membership. Each numbered node represents the module annotation of an OTU in the graph. The networks on the right represent the corresponding taxonomic annotation of the OTU at the family (color) and phylum (shape) levels. Values stated next to method name represent the number of modules in the network. Node layout is conserved for both networks in each panel. Additional methods are provided in Supplementary Figure S9.

Similar to the clustering analysis, we next evaluated the taxonomic coherence of the different networks. Using assortativity on the genus level as a quantitative measure, we found that relevance networks derived from variance-reducing/stabilizing transformations showed the highest overall assortativity in the large sample limit (≈0.35).

We next asked the question whether high-level network properties such as assortativity and modularity were consistent independent of the sample size used to estimate the association networks. We thus repeated the previous analysis for different sample sizes, ranging from n = 25 to n = 9000. Figure 7 shows the estimated network assortativity and maximum modularity score estimates versus sample size. We found that for relevance networks derived from variance-reducing/stabilizing transformations, both assortativity and modularity monotonically increase with sample size. Both estimates stabilize around sample sizes n ≈ p. For the remaining relevance networks, assortativity estimates monotonically increase with sample size, while modularity tends to decrease with sample size. In summary, this analysis implies that estimates of high-level network summary statistics such as assortativity and modularity are inconsistent compared to their large sample limit.

Figure 7. — Community analysis of relevance network structure with increasing sample size. (A) Assortativity coefficient across sample size of genus annotation. (B) Maximum modularity score across sample size at 2000 edges. For all plots, lines represent mean and gray ribbons represent standard deviation from the mean.

We next examined the edge overlap from correlation-based relevance networks (clr and tss transformations) and proportionality metrics (rhoprop and rhoshrink). We found a common core of 1086 edges between 349 OTUs that were present in all relevance networks. This consensus network also contained several tightly connected network modules with highly assortative inter-family associations (Figure 8A). Overall, we found that clr-, rhoprop- and rhoshrink-based networks shared the majority of common edges with rhoprop- and rhoshrink-based edge set differing only by a single edge. The tss-based relevance network comprised 779 unique edges not shared by any of the other networks (Figure 8B).

Figure 8. — Shared interactions between relevance networks. (A) Consensus network of edges in common between four representative methods. Network contains 1086 edges between 346 OTUs. Node color represents family annotation and node shape represents phylum. (B) Venn diagram showing unique and shared interactions predicted from representative normalization methods.

Additionally, we found that clr-, vst- and asinh-based correlation networks also shared a large common consensus core (Supplementary Figure S10A). Similarly, com-, rle- and tss-based networks showed a large edge set overlap (Supplementary Figure S10B). These observations confirmed the distinct groupings observed in the MDS embedding of Frobenius distances (Figure 2B).

DISCUSSION

Data normalization and inference of taxon–taxon associations from microbial genomic survey count data are two of the most basic statistical analysis tasks in modern microbiome research. To help the practitioner of microbial data analysis make informed choices about the different available normalization and association inference schemes, we have taken a closer look at the impact of data normalization on association estimation and several downstream exploratory data analysis tasks. Rather than asking what is the best method available for different analysis steps, we have leveraged the large available sample size in the AGP dataset and assessed the consistency of two ubiquitous linear association estimators for microbiome data, correlation and proportionality, under a wide range of realistic sample size scenarios, data normalization schemes and downstream data analysis tasks.

Our analysis revealed several important observations that have direct implications for best practice in microbiome data analysis workflows. First, we have confirmed that correlation and proportionality estimates are inconsistent in the low sample regime n < p when compared to large sample counterparts, in terms of both general large-scale association patterns (Figure 2) and downstream network summary statistics, including assortativity and modularity (Figure 7). While this phenomenon has been long appreciated in the statistical literature, we have established that shrinkage estimation, a popular statistical regularization scheme used in finance (33) and genomics research (27), can also improve association estimation for microbiome data, independent of the employed normalization method. Leveraging the close mathematical relationship between variance–covariance estimation and the concept of proportionality, we have also introduced a novel shrinkage proportionality estimator, rhoshrink, that is easy to compute and may prove useful in other scientific areas where compositional data are available.

On the AGP data, we have been able to categorize 10 data normalization/association estimation workflows into five coherent groups that show strong agreement across all sample size scenarios (Figure 2B). Most prominently, we have found that variance-reducing/stabilizing transformations lead to a high agreement of correlation or proportionality estimates. This was also confirmed in the downstream microbial relevance network comparison where clr-based correlation networks and proportionality association networks showed high agreement among the inferred edge sets (Figure 8). This implies that, in the presence of large sample sizes and large number of OTUs, differences between correlation and proportionality estimates are less pronounced than previously expected. An important observation on the AGP dataset was that the empirical distributions of association estimates were universally right-skewed even in the randomly shuffled data scenario. This implies that irrespective of the data normalization/association inference workflow, one will observe a higher prevalence of positive associations. This phenomenon has been previously described in the context of microbial association inference across many different microbial habitats (14,47). While it is tempting to interpret these results as ecological features of the underlying microbial community in terms of higher prevalence of commensal rather than competitive microbial interactions, the positive skewness may also be due to technical limitations in the data generation process and shortcomings in current statistical estimation. For instance, truncation to zero effects for low sequencing read counts likely obstructs unbiased estimation of negative correlations and, in turn, proportionality. A possible remedy for this data-induced artifact is the application of more advanced semi-parametric correlation estimators that infer latent correlations under data truncation assumptions (21,48). A detailed investigation of semi-parametric and other estimators may provide a promising avenue for future research.

In many studies, microbial counting strategy has transitioned away from the use of OTUs and toward ASVs (3). We suspect that shrinkage may also improve association learning in these contexts where variance-stabilizing transformations are often used in analysis (49).

Despite the universal presence of positive skewness in association estimates for the AGP data, we have observed that variance-reducing/stabilizing transformations could reduce positive skewness in shrunk association estimates (Figure 3). Moreover, our results on microbial association network construction and clustering as typical downstream exploratory data analysis examples also revealed that variance-reducing/stabilizing approaches provided the most consistent estimation in terms of taxonomic and structural coherence, as measured by taxonomic cluster purity in spectral and hierarchical clustering (Figures 4 and 5) and network assortativity (Figures 6 and 7). Taken together, we can recommend any variance-reducing/stabilizing transformations followed by shrinkage estimation for association inference. However, transformations such as asinh and clr may be preferred since they are faster to compute than vst, while providing similar statistical properties. The resulting shrinkage correlation estimates can then also serve as input for more involved direct microbial network inference workflows that account for transitive correlations, adjust for additional covariates or model latent effects (14,39,50,51).

For relevance network estimation, consensus network construction, as put forward here for the AGP data (Figure 8), is a straightforward strategy to relax the influence of data normalization. For our AGP consensus network, we found that more than half of the top 2000 edges in the tss-, clr-, rhoprop- and rhoshrink-based relevance networks were in full agreement, connecting a subset of 346 OTUs. The inferred AGP consensus network comprised a majority of positive edges and showed high assortativity at the genus level (0.39) and a maximum modularity of 0.8.

Assortativity increased in the consensus network compared to individual relevance networks. Notably, many taxa in the consensus network were frequently identified as key targets for microbiome therapeutics, such as prebiotic treatment and fecal microbiota transplants, including Akkermansia muciniphila, Prevotella copri, Ruminococcus bromii and Faecalibacterium prausnitzii (52,53).

Our computational data analysis workflow, available on GitHub and as Synapse project (see the ‘Data Availability’ section), is fully reproducible, provides all novel shrinkage estimators introduced here and allows easy extension and comparison to additional data normalization, estimation and downstream analysis tasks. For instance, future work could include the integration of more advanced zero-replacement strategies (54,55), application of popular data normalization schemes from single-cell data analysis (56) or the application of other correlation (21,48) or proportionality estimators, including those available in the propr package (23). Here, rather than using universal thresholding for sparsifying associations, more advanced selection strategies that control false discovery rates [as available in the propr package (23)] may improve the sample size consistency of the microbial association inference workflows.

Going forward, we believe that large-scale reproducible computational analysis workflows that focus on sample size-dependent consistency of statistical estimates are of paramount importance for deriving stable testable hypotheses about the complex interplay between host phenotype and the microbiome from large-scale microbial genomic survey data.

DATA AVAILABILITY

The code and data used are available as a GitHub repository at https://github.com/MichelleBadri/NormCorr-manuscript and Synapse project syn21654780. Data used for this study were accessed from ftp://ftp.microbio.me/AmericanGut/ag-2017-12-04/. The latest complete AGP dataset can be accessed on Qiita using study ID 10317 (25).

Supplementary Material

lqaa100_Supplemental_File

Click here for additional data file.^{(10.4MB, pdf)}

ACKNOWLEDGEMENTS

We would like to thank the Simons Foundation.

Contributor Information

Michelle Badri, Department of Biology, New York University, New York, NY 10012, USA.

Zachary D Kurtz, Lodo Therapeutics, New York, NY 10016, USA.

Richard Bonneau, Department of Biology, New York University, New York, NY 10012, USA; Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY 10010, USA; Computer Science Department, Courant Institute, New York, NY 10012, USA.

Christian L Müller, Center for Computational Mathematics, Flatiron Institute, Simons Foundation, New York, NY 10010, USA; Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg 85764, Germany; Department of Statistics, Ludwig-Maximilians-Universität München, Munich 80539, Germany.

SUPPLEMENTARY DATA

Supplementary Data are available at NARGAB Online.

FUNDING

Flatiron Institute, a division of the Simons Foundation.

Conflict of interest statement. None declared.

REFERENCES

1. Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Zwiener I., Frisch B., Binder H. Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS One. 2014; 9:e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Mark Welch J.L., Hasegawa Y., McNulty N.P., Gordon J.I., Borisy G.G. Spatial organization of a model 15-member human gut microbiota established in gnotobiotic mice. Proc. Natl Acad. Sci. U.S.A. 2017; 114:E9105–E9114. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Stömmler F., Gläsner J., Hiergeist A., Holler E., Weber D., Oefner P.J., Gessner A., Spang R. Adjusting microbiome profiles for differences in microbial load by spike-in bacteria. Microbiome. 2016; 4:28. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Aitchison J. The Statistical Analysis of Compositional Data. 1986; London: Chapman & Hall. [Google Scholar]
10. Gloor G.B., Macklaim J.M., Pawlowsky-Glahn V., Egozcue J.J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017; 8:2224. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. McMurdie P.J., Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014; 10:e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Kurtz Z.D., Muller C.L., Miraldi E.R., Littman D.R., Blaser M.J., Bonneau R.A. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 2015; 11:e1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Paulson J.N., Stine O.C., Bravo H.C., Pop M. Robust methods for differential abundance analysis in marker gene surveys. Nat. Methods. 2013; 10:1200–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Weiss S., Xu Z.Z., Peddada S., Amir A., Bittinger K., Gonzalez A., Lozupone C., Zaneveld J.R., Vázquez-Baeza Y., Birmingham A. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017; 5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Kumar M.S., Slud E.V., Okrah K., Hicks S.C., Hannenhalli S., Corrada Bravo H. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018; 19:799. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Martin B.D., Witten D., Willis A.D. Modeling microbial abundances and dysbiosis with beta-binomial regression. Ann. Appl. Stat. 2020; 14:94–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Brill B., Amir A., Heller R. Testing for differential abundance in compositional counts data, with application to microbiome studies. 2019; arXiv doi:30 March 2020, preprint: not peer reviewedhttps://arxiv.org/abs/1904.08937.
20. Weiss S., Treuren W.V., Lozupone C., Faust K., Friedman J., Deng Y., Xia L.C., Xu Z.Z., Ursell L., Alm E.J. et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J. 2016; 10:1669–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Yoon G., Gaynanova I., Müller C.L. Microbial networks in SPRING: semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data. Front. Genet. 2019; 10:516. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Lovell D., Pawlowsky-Glahn V., Egozcue J.J., Marguerat S., Bähler J. Proportionality: a valid alternative to correlation for relative data. PLoS Comput. Biol. 2018; 11:e1004075. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Quinn T.P., Richardson M.F., Lovell D., Crowley T.M. propr: an R-package for identifying proportionally abundant features using compositional data analysis. Sci. Rep. 2017; 7:16252. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Jackson D.A. Compositional data in community ecology: the paradigm or peril of proportions. Ecology. 1997; 78:929–940. [Google Scholar]
25. McDonald D., Hyde E., Debelius J.W., Morton J.T., Gonzalez A., Ackermann G., Aksenov A.A., Behsaz B., Brennan C., Chen Y. et al. American Gut: an open platform for citizen science microbiome research. mSystems. 2018; 3:e00031-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Stein C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981; 9:1135–1151. [Google Scholar]
27. Schäfer J., Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. 2005; 4:32. [DOI] [PubMed] [Google Scholar]
28. Duvallet C., Gibbons S.M., Gurry T., Irizarry R.A., Alm E.J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 2017; 8:1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Holmes S., Huber W. Modern Statistics for Modern Biology. 2019; Cambridge: Cambridge University Press. [Google Scholar]
30. Burbidge J.B., Magee L., Robb A.L. Alternative transformations to handle extreme values of the dependent variable. J. Am. Stat. Assoc. 1988; 83:123–127. [Google Scholar]
31. Pearson K. Mathematical contributions to the theory of evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc. R. Soc. Lond. 1897; 60:489–498. [Google Scholar]
32. Erb I., Notredame C. How should we measure proportionality on relative gene expression data?. Theory Biosci. 2016; 135:21–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Ledoit O., Wolf M. Honey, I shrunk the sample covariance matrix. J. Portfolio Manage. 2004; 30:110–119. [Google Scholar]
34. Rothman A.J., Bickel P.J., Levina E., Zhu J. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008; 2:494–515. [Google Scholar]
35. Bien J., Tibshirani R.J. Sparse estimation of a covariance matrix. Biometrika. 2011; 98:807–820. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Herdin M., Czink N., Özcelik H., Bonek E. Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. Proceedings of the IEEE Vehicular Technology Conference. 2005; Stockholm. [Google Scholar]
37. Ng A.Y., Jordan M.I., Weiss Y. On spectral clustering: analysis and an algorithm. Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01). 2001; Cambridge, MA: MIT Press; 849–856. [Google Scholar]
38. Butte A.J., Tamayo P., Slonim D., Golub T.R., Kohane I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. U.S.A. 2000; 97:12182–12186. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Claussen J.C., Skiecevičienė J., Wang J., Rausch P., Karlsen T.H., Lieb W., Baines J.F., Franke A., Hütt M.-T. Boolean analysis reveals systematic interactions among low-abundance species in the human gut microbiome. PLoS Comput. Biol. 2017; 13:e1005361. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Friedman J., Alm E.J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012; 8:e1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Faust K., Sathirapongsasuti J.F., Izard J., Segata N., Gevers D., Raes J., Huttenhower C. Microbial co-occurrence relationships in the human microbiome. PLoS Comput. Biol. 2012; 8:e1002606. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Lozupone C., Faust K., Raes J., Faith J.J., Frank D.N., Zaneveld J., Gordon J.I., Knight R. Identifying genomic and metabolic features that can underlie early successional and opportunistic lifestyles of human gut symbionts. Genome Res. 2012; 22:1974–1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Newman M. E.J. Mixing patterns in networks. Phys. Rev. E. 2003; 67:026126. [DOI] [PubMed] [Google Scholar]
44. Newman M.E. Modularity and community structure in networks. Proc. Natl Acad. Sci. U.S.A. 2006; 103:8577–8582. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Clauset A., Newman M. E.J., Moore C. Finding community structure in very large networks. Phys. Rev. E. 2004; 70:066111. [DOI] [PubMed] [Google Scholar]
46. Csardi G., Nepusz T. The igraph software package for complex network research. Inter J. 2006; 1695:1–9. [Google Scholar]
47. Faust K., Lima-Mendez G., Lerat J.-S., Sathirapongsasuti J.F., Knight R., Huttenhower C., Lenaerts T., Raes J. Cross-biome comparison of microbial association networks. Front. Microbiol. 2015; 6:1200. [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Yoon G., Carroll R.J., Gaynanova I. Sparse semiparametric canonical correlation analysis for data of mixed types. Biometrika. 2020; 107:609–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Sprockett D., Fischer N., Boneh R.S., Turner D., Kierkus J., Sladek M., Escher J.C., Wine E., Yerushalmi B., Dias J.A. et al. Treatment-specific composition of the gut microbiota is associated with disease remission in a pediatric Crohn’s disease cohort. Inflamm. Bowel Dis. 2019; 25:1927–1938. [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Biswas S., Mcdonald M., Lundberg D.S., Dangl J.L., Jojic V. Learning microbial interaction networks from metagenomic count data. J. Comput. Biol. 2016; 23:526–535. [DOI] [PubMed] [Google Scholar]
51. Kurtz Z.D., Bonneau R., Müller C.L. Disentangling microbial associations from hidden environmental and technical factors via latent graphical models. 2019; bioRxiv doi:23 December 2019, preprint: not peer reviewed 10.1101/2019.12.21.885889. [DOI]
52. Scher J.U., Sczesnak A., Longman R.S., Segata N., Ubeda C., Bielski C., Rostron T., Cerundolo V., Pamer E.G., Abramson S.B. et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. eLife. 2013; 2:e01202. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Scott K.P., Jean-Michel A., Midtvedt T., Hemert S.V. Manipulating the gut microbiota to maintain health and treat disease. Microb. Ecol. Health Dis. 2015; 26:25877. [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Palarea-Albaladejo J., Martin-Fernandez J.A. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometr. Intell. Lab. 2015; 143:85–96. [Google Scholar]
55. Cao Y., Zhang A., Li H. Multisample estimation of bacterial composition matrices in metagenomics data. Biometrika. 2020; 107:75–92. [Google Scholar]
56. Vallejos C.A., Risso D., Scialdone A., Dudoit S., Marioni J.C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods. 2017; 14:565–571. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqaa100_Supplemental_File

Click here for additional data file.^{(10.4MB, pdf)}

Data Availability Statement

[B1] 1. Caporaso J.G., Kuczynski J., Stombaugh J., Bittinger K., Bushman F.D., Costello E.K., Fierer N., Peña A.G., Goodrich J.K., Gordon J.I. et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010; 7:335–336. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Schloss P.D., Westcott S.L., Ryabin T., Hall J.R., Hartmann M., Hollister E.B., Lesniewski R.A., Oakley B.B., Parks D.H., Robinson C.J. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009; 75:7537–7541. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Callahan B.J., McMurdie P.J., Rosen M.J., Han A.W., Johnson A.J.A., Holmes S.P. DADA2: high-resolution sample inference from Illumina amplicon data. Nat. Methods. 2016; 13:581–583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Willis A.D., Martin B.D. Estimating diversity in networked ecologicalcommunities. Biostatistics. 2020; doi:10.1093/biostatistics/kxaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Bucci V., Tzen B., Li N., Simmons M., Tanoue T., Bogart E., Deng L., Yeliseyev V., Delaney M.L., Liu Q. et al. MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses. Genome Biol. 2016; 17:121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Zwiener I., Frisch B., Binder H. Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS One. 2014; 9:e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Mark Welch J.L., Hasegawa Y., McNulty N.P., Gordon J.I., Borisy G.G. Spatial organization of a model 15-member human gut microbiota established in gnotobiotic mice. Proc. Natl Acad. Sci. U.S.A. 2017; 114:E9105–E9114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Stömmler F., Gläsner J., Hiergeist A., Holler E., Weber D., Oefner P.J., Gessner A., Spang R. Adjusting microbiome profiles for differences in microbial load by spike-in bacteria. Microbiome. 2016; 4:28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Aitchison J. The Statistical Analysis of Compositional Data. 1986; London: Chapman & Hall. [Google Scholar]

[B10] 10. Gloor G.B., Macklaim J.M., Pawlowsky-Glahn V., Egozcue J.J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017; 8:2224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. McMurdie P.J., Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014; 10:e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Kurtz Z.D., Muller C.L., Miraldi E.R., Littman D.R., Blaser M.J., Bonneau R.A. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 2015; 11:e1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Paulson J.N., Stine O.C., Bravo H.C., Pop M. Robust methods for differential abundance analysis in marker gene surveys. Nat. Methods. 2013; 10:1200–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Weiss S., Xu Z.Z., Peddada S., Amir A., Bittinger K., Gonzalez A., Lozupone C., Zaneveld J.R., Vázquez-Baeza Y., Birmingham A. et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017; 5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Kumar M.S., Slud E.V., Okrah K., Hicks S.C., Hannenhalli S., Corrada Bravo H. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018; 19:799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Martin B.D., Witten D., Willis A.D. Modeling microbial abundances and dysbiosis with beta-binomial regression. Ann. Appl. Stat. 2020; 14:94–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Brill B., Amir A., Heller R. Testing for differential abundance in compositional counts data, with application to microbiome studies. 2019; arXiv doi:30 March 2020, preprint: not peer reviewedhttps://arxiv.org/abs/1904.08937.

[B20] 20. Weiss S., Treuren W.V., Lozupone C., Faust K., Friedman J., Deng Y., Xia L.C., Xu Z.Z., Ursell L., Alm E.J. et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J. 2016; 10:1669–1681. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Yoon G., Gaynanova I., Müller C.L. Microbial networks in SPRING: semi-parametric rank-based correlation and partial correlation estimation for quantitative microbiome data. Front. Genet. 2019; 10:516. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Lovell D., Pawlowsky-Glahn V., Egozcue J.J., Marguerat S., Bähler J. Proportionality: a valid alternative to correlation for relative data. PLoS Comput. Biol. 2018; 11:e1004075. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Quinn T.P., Richardson M.F., Lovell D., Crowley T.M. propr: an R-package for identifying proportionally abundant features using compositional data analysis. Sci. Rep. 2017; 7:16252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Jackson D.A. Compositional data in community ecology: the paradigm or peril of proportions. Ecology. 1997; 78:929–940. [Google Scholar]

[B25] 25. McDonald D., Hyde E., Debelius J.W., Morton J.T., Gonzalez A., Ackermann G., Aksenov A.A., Behsaz B., Brennan C., Chen Y. et al. American Gut: an open platform for citizen science microbiome research. mSystems. 2018; 3:e00031-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Stein C.M. Estimation of the mean of a multivariate normal distribution. Ann. Stat. 1981; 9:1135–1151. [Google Scholar]

[B27] 27. Schäfer J., Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. 2005; 4:32. [DOI] [PubMed] [Google Scholar]

[B28] 28. Duvallet C., Gibbons S.M., Gurry T., Irizarry R.A., Alm E.J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 2017; 8:1784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Holmes S., Huber W. Modern Statistics for Modern Biology. 2019; Cambridge: Cambridge University Press. [Google Scholar]

[B30] 30. Burbidge J.B., Magee L., Robb A.L. Alternative transformations to handle extreme values of the dependent variable. J. Am. Stat. Assoc. 1988; 83:123–127. [Google Scholar]

[B31] 31. Pearson K. Mathematical contributions to the theory of evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc. R. Soc. Lond. 1897; 60:489–498. [Google Scholar]

[B32] 32. Erb I., Notredame C. How should we measure proportionality on relative gene expression data?. Theory Biosci. 2016; 135:21–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] 33. Ledoit O., Wolf M. Honey, I shrunk the sample covariance matrix. J. Portfolio Manage. 2004; 30:110–119. [Google Scholar]

[B34] 34. Rothman A.J., Bickel P.J., Levina E., Zhu J. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008; 2:494–515. [Google Scholar]

[B35] 35. Bien J., Tibshirani R.J. Sparse estimation of a covariance matrix. Biometrika. 2011; 98:807–820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36. Herdin M., Czink N., Özcelik H., Bonek E. Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. Proceedings of the IEEE Vehicular Technology Conference. 2005; Stockholm. [Google Scholar]

[B37] 37. Ng A.Y., Jordan M.I., Weiss Y. On spectral clustering: analysis and an algorithm. Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS’01). 2001; Cambridge, MA: MIT Press; 849–856. [Google Scholar]

[B38] 38. Butte A.J., Tamayo P., Slonim D., Golub T.R., Kohane I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. U.S.A. 2000; 97:12182–12186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 39. Claussen J.C., Skiecevičienė J., Wang J., Rausch P., Karlsen T.H., Lieb W., Baines J.F., Franke A., Hütt M.-T. Boolean analysis reveals systematic interactions among low-abundance species in the human gut microbiome. PLoS Comput. Biol. 2017; 13:e1005361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] 40. Friedman J., Alm E.J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012; 8:e1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 41. Faust K., Sathirapongsasuti J.F., Izard J., Segata N., Gevers D., Raes J., Huttenhower C. Microbial co-occurrence relationships in the human microbiome. PLoS Comput. Biol. 2012; 8:e1002606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] 42. Lozupone C., Faust K., Raes J., Faith J.J., Frank D.N., Zaneveld J., Gordon J.I., Knight R. Identifying genomic and metabolic features that can underlie early successional and opportunistic lifestyles of human gut symbionts. Genome Res. 2012; 22:1974–1984. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B43] 43. Newman M. E.J. Mixing patterns in networks. Phys. Rev. E. 2003; 67:026126. [DOI] [PubMed] [Google Scholar]

[B44] 44. Newman M.E. Modularity and community structure in networks. Proc. Natl Acad. Sci. U.S.A. 2006; 103:8577–8582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B45] 45. Clauset A., Newman M. E.J., Moore C. Finding community structure in very large networks. Phys. Rev. E. 2004; 70:066111. [DOI] [PubMed] [Google Scholar]

[B46] 46. Csardi G., Nepusz T. The igraph software package for complex network research. Inter J. 2006; 1695:1–9. [Google Scholar]

[B47] 47. Faust K., Lima-Mendez G., Lerat J.-S., Sathirapongsasuti J.F., Knight R., Huttenhower C., Lenaerts T., Raes J. Cross-biome comparison of microbial association networks. Front. Microbiol. 2015; 6:1200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B48] 48. Yoon G., Carroll R.J., Gaynanova I. Sparse semiparametric canonical correlation analysis for data of mixed types. Biometrika. 2020; 107:609–625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B49] 49. Sprockett D., Fischer N., Boneh R.S., Turner D., Kierkus J., Sladek M., Escher J.C., Wine E., Yerushalmi B., Dias J.A. et al. Treatment-specific composition of the gut microbiota is associated with disease remission in a pediatric Crohn’s disease cohort. Inflamm. Bowel Dis. 2019; 25:1927–1938. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B50] 50. Biswas S., Mcdonald M., Lundberg D.S., Dangl J.L., Jojic V. Learning microbial interaction networks from metagenomic count data. J. Comput. Biol. 2016; 23:526–535. [DOI] [PubMed] [Google Scholar]

[B51] 51. Kurtz Z.D., Bonneau R., Müller C.L. Disentangling microbial associations from hidden environmental and technical factors via latent graphical models. 2019; bioRxiv doi:23 December 2019, preprint: not peer reviewed 10.1101/2019.12.21.885889. [DOI]

[B52] 52. Scher J.U., Sczesnak A., Longman R.S., Segata N., Ubeda C., Bielski C., Rostron T., Cerundolo V., Pamer E.G., Abramson S.B. et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. eLife. 2013; 2:e01202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B53] 53. Scott K.P., Jean-Michel A., Midtvedt T., Hemert S.V. Manipulating the gut microbiota to maintain health and treat disease. Microb. Ecol. Health Dis. 2015; 26:25877. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B54] 54. Palarea-Albaladejo J., Martin-Fernandez J.A. zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometr. Intell. Lab. 2015; 143:85–96. [Google Scholar]

[B55] 55. Cao Y., Zhang A., Li H. Multisample estimation of bacterial composition matrices in metagenomics data. Biometrika. 2020; 107:75–92. [Google Scholar]

[B56] 56. Vallejos C.A., Risso D., Scialdone A., Dudoit S., Marioni J.C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nat. Methods. 2017; 14:565–571. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Shrinkage improves estimation of microbial associations under different normalization methods

Michelle Badri

Zachary D Kurtz

Richard Bonneau

Christian L Müller

Abstract

INTRODUCTION

Figure 1.

MATERIALS AND METHODS

AGP sample collection

Normalization methods

Total sum scaling

Cumulative sum scaling

Common sum scaling

Relative log expression

Inverse hyperbolic sine

Wrench

Variance-stabilizing transform

Centered log-ratio transformation

Estimation of linear associations

Covariance and correlation estimation

Proportionality estimation

Shrinkage estimation of linear associations

Shrinkage covariance estimation

Shrinkage proportionality estimation

Comparing association patterns

Frobenius distance

Spectral distance

Correlation matrix distance

Downstream analysis

Clustering

Relevance networks and community analysis

RESULTS

Shrinkage universally improves consistency of association estimation

Figure 2.

Figure 3.

Normalization methods induce distinct association patterns

Association estimates are positively skewed

Clustering methods are sensitive to normalization and shrinkage estimation

Figure 4.

Figure 5.

Normalization induces relevance networks with different community structures

Figure 6.

Figure 7.

Figure 8.

DISCUSSION

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases