Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jun 28;19(6):e0305032. doi: 10.1371/journal.pone.0305032

Compositionally aware estimation of cross-correlations for microbiome data

Ib Thorsgaard Jensen 1,2,*, Luc Janss 3, Simona Radutoiu 1, Rasmus Waagepetersen 2,*
Editor: Enrique Hernandez-Lemus4
PMCID: PMC11213360  PMID: 38941272

Abstract

In the field of microbiome studies, it is of interest to infer correlations between abundances of different microbes (here referred to as operational taxonomic units, OTUs). Several methods taking the compositional nature of the sequencing data into account exist. However, these methods cannot infer correlations between OTU abundances and other variables. In this paper we introduce the novel methods SparCEV (Sparse Correlations with External Variables) and SparXCC (Sparse Cross-Correlations between Compositional data) for quantifying correlations between OTU abundances and either continuous phenotypic variables or components of other compositional datasets, such as transcriptomic data. SparCEV and SparXCC both assume that the average correlation in the dataset is zero. Iterative versions of SparCEV and SparXCC are proposed to alleviate bias resulting from deviations from this assumption. We compare these new methods to empirical Pearson cross-correlations after applying naive transformations of the data (log and log-TSS). Additionally, we test the centered log ratio transformation (CLR) and the variance stabilising transformation (VST). We find that CLR and VST outperform naive transformations, except when the correlation matrix is dense. SparCEV and SparXCC outperform CLR and VST when the number of OTUs is small and perform similarly to CLR and VST for large numbers of OTUs. Adding the iterative procedure increases accuracy for SparCEV and SparXCC for all cases, except when the average correlation in the dataset is close to zero or the correlation matrix is dense. These results are consistent with our theoretical considerations.

Introduction

Sequencing data are ubiquitous in modern biology [1]. For example, RNA-seq data have been used to identify genes associated with clinical outcomes of cancer patients [2], for human disease profiling [3], and to identify genes with possible links to Rett Syndrome [4]. Microbiome data have drawn much attention in recent years, particularly regarding the human gut microbiome. Composition of the human gut microbiome has been shown to be associated with several aspects of human health, such as obesity [5] and metabolic disorders [6]. More recently, the integration of microbiome data with other omics data has received increasing interest [711].

Data from sequencing technologies pose many specific challenges. They produce count data with technical noise, which makes results from rare features difficult to interpret. Additionally, they are compositional, meaning that the observed variables are components of an arbitrary total. As a result, apparent correlations involving sequencing data, such as microbiome data, may be due to the technical constraints and not biologically meaningful. We refer to this phenomenon as compositional effects. For datasets with few variables, direct examination of correlations might seem appealing. However, compositional effects become more pronounced in case of few variables and hence direct examination of the observed data does not suffice. Instead, it is pertinent to develop statistical methods that correct for compositional effects.

Within the field of microbiome studies, several methods have been proposed to infer interactions between microbial abundances. These include Local Similarity Analysis [12], which finds non-linear relationships through time using time series data; Sparse Compositional Correlations (SparCC), which infers correlations based on compositional data [13]; and Sparse Inverse Covariance Estimation for Ecological Association Inference (SPIEC-EASI), which infers relations through graphical models [14]. In this paper, we focus on the estimation of correlation coefficients.

When considering correlations in the context of compositional data, there are essentially three cases of interest: A) correlations between features of the same compositional dataset, B) cross-correlations between features of a compositional dataset and non-compositional variables, and C) cross-correlations between two compositional datasets. Correlations between bacterial abundances in a microbiome is an example of case A. An example of case B is cross-correlations between gut microbes and clinical features of patients [15], and an example of case C is cross-correlations between microbial abundances and gene expression levels from RNA-seq data [7]. For an overview of these cases, see Table 1.

Table 1. Cases A, B, and C along with explanations and and overview of applicable methods.

Case Correlations Methods
A Within a composition SparCC
SPIEC-EASI
Local Similarity Analysis
Pearson
B Between a composition and an external variable SparCEV
Pearson
C Between two compositions SparXCC
SPIEC-EASI
mmVec
Pearson

The methods mentioned above all operate in case A. Case B appears not have received much attention so far while more recently, case C has gained more interest, with new methods being developed. For example, SPIEC-EASI has been extended to infer interactions between variables from two compositional datasets [16]. Like the original SPIEC-EASI, pairs of variables that are conditionally independent are identified by estimating the precision matrix using a penalized estimation scheme to enforce sparsity. The method mmVec [17] is designed for identifying interactions between OTU abundances and metabolite concentrations. This method employs a neural network architecture to estimate the probability of observing a metabolite, given that a specific OTU is observed. It was shown to perform with similar accuracy as the extended SPIEC-EASI and to outperform correlation-based procedures. However, the correlations were estimated using a flawed methodology, where the centered log-ratio (CLR) transformation was applied to both datasets simultaneously, rather than separately, thus biasing the results. Quinn and Erb [18] showed that when the CLR-transformation was applied appropriately, correlation-based methods outperform both mmVec and SPIEC-EASI in the setup examined by Morton et al. [17]. In response, Morton et al. showed that it was possible to construct scenarios where mmVec outcompeted all alternatives [19]. In conclusion, there is scope for further developing and investigating methods for cases B and C.

In this paper, we focus on inferring cross-correlations in cases B and C. We introduce two novel compositionally aware methods, SparCEV (Sparse Correlations with External Variables) and SparXCC (Sparse Cross-Correlations between Compositional data). Using simulation studies, we compare these methods to Pearson cross-correlations applied to various transformations of the data. Theoretical comparisons of transformation-based methods and derivations of new methods are given in the supplementary material.

Materials and methods

Modelling sequencing data

Let ai denote the absolute abundance of OTU i, i = 1, …, p, A=j=1paj, and ri = ai/A the relative abundance. The aim of this paper is to estimate the correlation between log ai and other log transformed variables. However, we only have access to observed read counts, denoted xi for OTU i. To theoretically compare the different strategies and to develop new methods, we adopt a simplified modelling framework, where

xi=riN, (1)

where N=j=1pxj denotes the library size. We compare the methods considered using simulations from models that are more complex and realistic than (1), see the section “Simulation models”. In these models, xi given (ri, N) is not fixed. We use the term technical variance for the variance of xi given (ri, N) and the term biological variance for the variance of ri. The more realistic models are, however, intractable for theoretical analysis.

All tested methods but one require log-transformation of the xis, which is problematic if xi = 0 is observed. As a remedy, we add 1 to all read counts prior to log-transformation of the data (the pseudo-count method).

Existing strategies for cross-correlation estimation

Naive transformation

We use the term naive transformation to refer to any transformation that does not take the compositional nature of the data into account. Naive transformations considered in this paper are log and log total sum scaling (TSS). Theoretically, naive transformations do not adequately account for the compositional structure of the data (see S1 Text). Nonetheless, it remains a common practice to apply these transformations [20] or no transformations at all [10, 15, 2124]. As a result, any method that outperform naive transformations would constitute an improvement relative to common practice.

Adapted transformations

We use the term adapted transformations to refer to transformations that are adapted to the particular structure of the data beyond differing library sizes. In this paper, we consider the centered log-ratio (CLR) and the variance-stabilising transformation (VST). See Table 2 for definitions of all transformations (naive and adapted) considered in this paper. Some common transformations, such as trimmed M-means [25], DESeq’s median-based transformation [26], and upper-quartile transformation [27], are not included, since they do not correct for within-replicate biases and are thus not applicable for correlation estimation.

Table 2. An overview of the transformations used to assess cross-correlations.

In VST, k is a replicate index.

Transformation Expression Interpretation
log log xi Log-transformed observed read counts
log-TSS logTSS(xi)=log(xij=1pxj) The log of estimated relative abundances
CLR CLR(xi)=logxi-1pj=1plogxj Abundances relative to the average abundance
VST VST(xik)=0xik/s^kf(μ)dμ , where f(μ)=μ2a^0+μ(1+a^1) Removal of mean-variance relationship

VST makes use of the DESeq modelling framework [26]. Specifically, it assumes that xik ∼ NB(μik, ϕi), where μik = skλi and ϕi = a0 + a1i. The estimates a^1, a^0, λ^i and ŝk are obtained using the DESeq2 estimation procedures [26, 28]. Here, k denotes the index of the biological replicate. VST requires at least one feature without any zero counts. This may be violated for data with small p which is frequently the case for our simulated microbiome data. We therefore only consider the VST transformation in case C for simulated gene expression data while applying CLR to the microbiome data.

For convenience, we use the term CLR for the method where empirical Pearson cross-correlations are applied to CLR-transformed data, and likewise for log, log-TSS, and VST.

Theoretical assessment of transformations

We examine theoretically whether empirical Pearson cross-correlations combined with the transformations presented in Table 2 are likely to yield good approximations of cross-correlations. We focus on case B, since it is simpler and the results in case C are analogous. We seek an approximation of Corr[logai,b], where b is a non-compositional variable, here referred to as a phenotypic variable. By (1) and Table 2 we have,

xi=aiAN,TSS(xi)=aiA,CLR(xi)=logai-1pj=1plogaj.

By definition,

Corr[logai,b]=Cov[logai,b]Var[logai]Var[b], (2)

and we require good approximations of the numerator and the denominator. Since b is not compositional, Var[b] can easily be estimated without the need for approximation. According to our derivations in S1 Text, log, log-TSS, and CLR all lead to reasonable approximations of the covariance Cov[logai,b] under the model in (1) given appropriate assumptions. Furthermore, we show that

Var[CLR(xi)]Var[logai]

when p is large. However, analogous results do not hold for log and log-TSS, demonstrating that naive transformations are not sufficient. Summing up, CLR can yield good approximations under the model in (1) with the following assumptions:

  • (Bi) 1pjiCov[logai,logaj]0 for all i

  • (Bii) 1pi=1pCov[logai,b]0 for all i

  • (Biii) The number of OTUs, p, is large.

In case C, we we need (Bi), (Biii), and the additional assumptions

  • (Ci) 1qlkCov[logbk,logbl]0 for all k

  • (Cii) 1ql=1qCov[logai,bl]0 for all i

  • (Ciii) 1pj=1pCov[logaj,bk]0 for all k

  • (Civ) The number of genes, q, is large

Conditions (Bi), (Bii) and (Ci)-(Ciii) hold if the correlation matrix is sparse; thus, following the language of Friedman and Alm [13], we refer to these as sparsity assumptions. This is a slight abuse of terminology, since these conditions may also hold if all rows of the correlation matrix contain entries whose distributions are symmetric around zero, even though such a matrix is not sparse.

Compositionally aware methods

Inspired by SparCC, we introduce compositionally aware methods for cases B and C. In case B, we assume the same sparsity condition as for CLR in the previous section. We then show in S1 Text that

Corr[logai,b]1σbαi1p-1jiCov[logxixj,b],

where αi2=Var[logai] can be estimated by SparCC and σb2=Var[b] can be estimated in a standard fashion. In contrast to CLR this method only requires conditions (Bi) and (Bii), but not (Biii). Therefore, it is likely preferable when p is small. We name this method SparseCorrelations of External Variables (SparCEV).

In Case C, we let bk, k = 1, …, q, denote the gene expression level of the kth gene, B=k=1qbk, and M the library size. Similar to (1), we assume the model

yk=bkBM

for the observed gene expression level. In S1 Text, we obtain the relation

Corr[logai,logbk]tik(p-1)(q-1)αiβk, (3)

where

tik=jilkCov[logxixj,logykyl].

The parameters αi2=Var[logai] and βk2=Var[logbk] can be approximated by applying SparCC to the microbiome and gene expression datasets individually, and the variances in tik can be estimated in a standard fashion. Details regarding efficient computation of the tiks are given in the S1 Text. As with CLR, we need the assumptions (Bi) and (Ci)-(Ciii), but unlike CLR, we do not need (Biii) and (Civ). We refer to this method as Sparse Cross-Correlations of Compositional data (SparXCC, where “X” represents “cross”).

Iterative procedures

Unless all OTUs are uncorrelated with the non-compositional variables (case B) or the compositional variables of the other dataset (case C), the estimates above are biased. Specifically, in case B, we get

1p-1jiCov[logxixj,b]=Cov[logai,b]-1p-1jiCov[logaj,b].

Estimates based on the right-hand side are useful under the assumption that the second term on the right-hand side is small. However, in practice this assumption may be violated leading to estimation bias.

Now suppose we could identify the set R={j:Corr[logaj,b]=0}. Then,

1|R|jRCov[logxixj,b]=Cov[logai,b].

This observation motivates an iterative procedure, similar to the one employed by SparCC in case A. For iteration n, we estimate

ρ^i(n)=1σ^bα^i1|Rn|jRnC^ij,

where C^ij and σ^b are the standard empirical estimates of Cov[logxixj,b] and Var[b] respectively, α^i is the SparCC estimate of Var[logai], and

Rn={i:|ρ^i(n-1)|<t}, (4)

where t is some user-specified threshold. To initialize, we set R0 = {1, …, p}, and iteration concludes once Rn = Rn−1. When a distinction is necessary, we refer to SparCEV with and without the iterative procedure as SparCEV base and SparCEV iterative respectively. Unless otherwise stated, SparCEV refers to SparCEV iterative.

In case C, we have

tik(p-1)(q-1)=Cov[logai,logbk]-1p-1ijCov[logaj,logbk]-1q-1lkCov[logai,logbl]+jilkCov[logaj,logbl].

In order to eliminate the last three terms, we need the sets S = {i : ρik = 0 for all k} and T = {k : ρik = 0 for all i}. Then,

1(p-1)(q-1)αiβkjSlTCov[logxixj,logykyl]=Corr[logai,logbk].

In iteration n, we estimate

ρ^ik(n)=1(p-1)(q-1)α^iβ^kjSnlTnC^ijkl,

where C^ijkl is the standard empirical estimate of Cov[logxixj,logykyl], α^i and β^k are the SparCC estimates of Var[logai] and Var[logbk] respectively, and

Sn={i:1ql=1q|ρ^il(n-1)|<t1},Tn={k:1pj=1p|ρ^jk(n-1)|<t2},

where t1 and t2 are user-specified thresholds. Similar to case B, we set S0 = {1, …, p} and T0 = {1, …, q}, and iterate until Tn = Tn−1 and Sn = Sn−1. We use the terminology SparXCC base and SparXCC iterative analogously to SparCEV base and SparCEV iterative.

For some datasets, there may be no bias to correct and in such cases SparXCC base or SparCEV base are appropriate. To assess this in practice, one may estimate the correlation coefficients with both the base and iterative versions and then plot them against each other. The bias terms for each correlation coefficient are approximately identical, at least when p (and q in case C) are large. Thus, the discrepancies between the base estimates and the iterative estimates should be similar for all i (and k in case C). Consequently, if the pairs of estimates are far from a straight line with slope 1, this indicates that the iterative procedure does not produce useful estimates. This may happen, for example, when the threshold is too low, whereby too few OTUs (and/or genes) are included. On the other hand, if the pairs of estimates are close to a straight line with slope 1 and intercept different from zero, this indicates that the iterative procedure succeeds in correcting for the bias and should thus be used.

Choice of threshold

The thresholds t, t1, and t2 are important when carrying out the iterative versions of SparCEV and SparXCC. If they are set too low, few variables may qualify and the estimates in the subsequent iterations may become unreliable. If they are set too high, we risk including highly correlated OTUs (and genes in case C) which also renders the estimates less reliable. To address this, we employ a bootstrap procedure to select the thresholds. In case B, we permute the non-compositional variable, thus breaking the correlation with all OTUs. Then, we use SparCEV base on this permuted dataset to obtain the set of estimates

MPerm={|ρ^1Perm|,,|ρ^pPerm|}.

We can ensure that the vast majority of uncorrelated OTUs are in R1 (cf. (4)) by choosing t = max{MPerm}, but we may still risk including correlated pairs. Alternatively one may use some percentile of MPerm. We use the 80th percentile by default, but the results should always be examined relative to the base version as described above. If the results are far from a straight line with slope 1, the user may wish to experiment with different choices of t, such as different percentiles of the bootstrap set.

Simulation models

We adopt the parametric model employed by SparseDOSSA2 [29], adapting the methodology slightly to handle cases B and C. A simulated dataset contains n ≥ 1 replicates, where n is the number of microbiome samples sequenced. The individual simulated variables (e.g. abundances or gene expression levels) are characterized by the mean, μi, variance, σi2, and zero-probability, πi. The parameter πi reflects the probability that OTU i is absent from a given replicate. We refer to this as a biological zero. The correlation between variables is characterized by the correlation matrix Ψ. We simulate p OTU abundances, a1, …, ap and q other variables, b1, …, bq. In case B, the latter q variables are non-compositional, typically q = 1, and we take πp+k = 0, k = 1, …, q, so that biological zeros do not occur for the bks. In case C, q > 1 and the bks are compositional. The library size Na is simulated from a log normal distribution with parameters μa and σa2.

The simulation algorithm is given in the following steps. For ease of presentation, we present the case where n = 1, but when n ≥ 1 the steps would simply be repeated n times.

  1. Simulate the p + q-dimensional variable g ∼ N(0, Ψ).

  2. Define the variables Zi for i = 1, …, p + q such that Zi = 0 if gi < Φ−1(πi) and Zi=Fi-1(Φ(gi)) otherwise, where Φ is the standard normal cumulative distribution function (cdf) and Fi(t)=πi+(1-πi)Φ((logt-μi)/σi2) is the cdf of a zero-inflated log-Gaussian distribution with parameters (πi,μi,σi2). We now have
    logZi|Zi0N(μi,σi2)andP(Zi=0)=πi.
  3. Set ai = Zi for i = 1, …, p as the absolute OTU abundances. In case C, bj = Zj+p for the absolute gene expression levels. In case B, we let bj = log Zj+p for the non-compositional phenotypic variables.

  4. Set ria=ai/k=1pak as the relative abundances of the OTUs, and in case C, set rjb=bj/k=1qbk as the relative expression levels.

  5. Simulate NalogN(μa,σa2) and let ⌈Na⌉ be the library size.

  6. Simulate the vector, x = (x1, …, xp), of observed read counts of OTU 1,…,p as xMultinom(Na,r1a,,rpa).

  7. In case C, simulate NblogN(μb,σb2) and let ⌈Nb⌉ be the library size.

  8. In case C, simulate the vector, y = (y1, …, yq), of observed read counts of gene j = 1, …, q as yMultinom(Nb,r1b,,rpb).

In case B, steps 7–8 are skipped. Excluding the simulation of the bjs, steps 1–6 in the above procedure are identical to the procedure employed by SparseDOSSA2 [29]. The above simulation scheme differs from the model (1) by the multinomial noise generated in steps 6 and 8 where the multinomial model is a simplistic representation of the randomness generated in the sequencing procedure. The correlation Corr[logai,logaj] agrees with Ψij when πi = πj = 0. This does not hold in the presence of biological zeros, πi > 0 or πj > 0, in which case log ai or log aj may not even be well defined. We nevertheless use Ψ as a ground truth for comparison with our estimates, including the case of biological zeros. In that way, the presence of biological zeros is considered a source of noise relative to our method in addition to the multinomial noise.

The correlation matrix Ψ is constructed using two methods described in S1 Text. The first is called the cluster method, and it works by assigning a portion of the OTUs to a “cluster”. All OTUs in the cluster are correlated to each other with the same correlation coefficient and uncorrelated to every other OTU. All OTUs outside the cluster are also uncorrelated with each other. In case C, a similar portion of the genes are also assigned to the cluster, and in case B, all non-compositional variables are also assigned to the cluster. This gives us a high degree of control over the degree of sparsity and the strength of the correlations. The second method is called the loadings method, and it results in a correlation matrix without exact zero entries but where most variables are only weakly correlated, with a relatively small proportion of highly correlated variable pairs. The loadings method most likely results in more realistic systems than the cluster method, but it suffers from the limitation that it tends to produce matrices whose entries are symmetric around zero. This need not be true in a natural system.

Throughout the simulations in this paper, we simulate n = 50 replicates. In many practical settings, n is considerably lower than that. However, for the purposes of the present simulation study, it is important that we can detect biases in the estimators. If the bias is small relative to the variance of the estimator, it may be difficult to detect in a simulation study. Since the variance of an estimator increases as n decreases, it is counter-productive to perform simulation studies with small n. In other words, we construct a situation where the main bottleneck to producing accurate results is the chosen method, not the size of the dataset.

Selecting parameter values

In the simulation studies carried out in this paper, the log-scale parameters, μi and σi2, and zero-probabilities, πi, are chosen using a real dataset as a template. We estimate the mean, μri, and variance, σri2 of the observed read counts for i = 1, …, p. We then choose μi and σi2 such that the simulated variables have mean μri and variance σri2, on the linear scale. By the properties of the log-normal distribution, the means and variances are related by

σi2=log(1+σri2μri2),μi=logμri-σi22. (5)

The parameters πi are set to half the proportion of zeros for the ith variable, with the assumption that half of the zeros are biological and the other half are technical. In case B, the parameters of the phenotype variable are somewhat arbitrarily chosen so that it has a mean of 30 and a variance of 1. For the microbiome data, we use a dataset by Tao et al. [30] as a template, and for the gene expression data we use a currently unpublished dataset. All parameters used in the simulations are available at https://github.com/IbTJensen/Microbiome-Cross-correlations/.

We also examine the impact of diversity on the accuracy of the correlation estimation methods. We measure diversity using the effective number of OTUs, peff. We have peff = eH, where H=i=1prilogri is the entropy or Shannon index. The quantity peff can be interpreted as the minimal number of OTUs such that a replicate has entropy H. This occurs when all OTUs are equally abundant. Here, we choose σi2=1 and πi = 0 for i = 1, …, p and μi in such a way that we get a specific value of peff in expectation. This is accomplished by selecting the linear-scale mean relative abundances νi=1-ν1p-1 for i ≥ 2 and obtaining ν1 by solving the equation

logpeff=ν1logν1+i=2p1-ν1p-1log1-ν1p-1 (6)

for ν1 given a choice of peff. We then choose an arbitrary value for the microbial load, say 1000, and set μri = 1000νi. Finally, the μi are obtained from the right part of (5).

Methods assessment

We might assess the accuracy of correlation estimates ρ^ij by comparing the true correlations to the estimated correlations by computing, for example, the mean absolute error (MAE). However, suppose we use the estimate ρ^ij=0 for all i, j and pick, for example, c = 0.05 and ρ = 0.75 for the cluster method. Then, even though non-zero correlations are not well estimated, the MAE, = 0.0375, is quite low. Thus, we separately consider the MAE of the pairs whose true correlation is zero and the MAE of the pairs whose true correlation is non-zero. In case of the loadings method, no correlations are exactly zero, so we then instead assess the MAE of pairs whose true correlation-coefficient exceeds the thresholds 0, 0.1, …, 0.8.

In summary, for the cluster method, we use the criteria

1|S|(i,j)S|Ψij-ρ^ij|,1pq-|S|(i,j)S|Ψij-ρ^ij|,

where S = {(i, j) ∈ Apqij ≠ 0} and Apq={(i,j)N2|0ip,p<jp+q}. For the loadings method, we use the criteria

1|Su|(i,j)Su|Ψij-ρ^ij|,whereSu={(i,j)Apq:|Ψij|u},foru=0,0.1,,0.8, (7)

where u = 0 corresponds to the overall MAE.

Discriminating between correlated and uncorrelated pairs

Since sparsity is only an approximate assumption, any test-statistic used to derive p-values is likely to be biased. This is exacerbated by the technical noise, which has particularly high impact for low-abundance OTUs. We shall not attempt to remedy these challenges here. Instead of using p-values, we choose a dynamic threshold based on the data. Pairs whose estimated absolute correlation exceeds this threshold are considered the most likely candidates for genuinely correlated pairs. The threshold is derived in the following way. Let X and Y be the two datasets under study (in case C, Y is compositional, in case B it is not). Permute each dataset separately. This breaks all cross-correlation, but not the correlations within each dataset. Let Sperm be the set of cross-correlation estimates obtained from the permuted data and let

m=max{|ρ^1Perm|,,|ρ^pPerm|},

where ρ^iPerm is the correlation with OTU i estimated by applying SparCEV (or SparXCC in case C) on the permuted data (in case C, replace indices i with ik where appropriate). We consider OTU i to be correlated with the variable of interest if |ρ^i|>m.

Implementation and code availability

Illustrations are produced using ggplot2 version 3.4.1 [31], ggpubr version 0.6.0 [32], and GGally version 2.1.2 [33]. The VST transformation was performed using DESeq2 version 1.34.0 [28]. Correcting for differences between experimental groups were carried out with limma version 3.58.1 [34]. Running time was measured using microbenchmark version 1.4.10 [35]. Hypothesis testing on cross-correlations were carried out using psych version 2.2.9 [36]. The SPIEC-EASI networks were estimated using the package SpiecEasi version 1.1.2 [37]. Implementation of SparCEV and SparXCC (both base and iterative) are available in the R-package CompoCor, which can be found at https://github.com/IbTJensen/CompoCor. The scripts used for the simulations and data analysis can be found at https://github.com/IbTJensen/Microbiome-Cross-correlations/.

Results

In this section, we compare the different estimation methods on simulated datasets with the correlation matrices constructed using the cluster and the loadings methods.

Case B

Fig 1 shows the performance of the different correlation estimation methods, with correlation matrices generated by both the cluster and the loadings method. All MAEs are computed as means over 1000 simulated datasets, with n = 50 replicates. For the cluster method ρ = 0.75 and for the loadings method k = 5 (see S1 Text). With both correlation generation methods, poor results are obtained when only the log-transformation is applied, and all other methods yield better results. For the cluster method, CLR, SparCEV base, and SparCEV iterative outperform log-TSS when c = 0.1, SparCEV base outperforms CLR when p is small, and SparCEV iterative outperform all other methods when p > 20. When c = 0.4, log-TSS performs the same or better than CLR and SparCEV for p ≥ 100, but SparCEV iterative substantially outperforms all other methods. This suggests that the iterative procedure successfully alleviates the bias incurred from the compositional structure relative to the other methods. For c = 0.7, 70% of pairs are correlated, and thus the sparsity assumption is severely violated. As expected, this is a substantial obstacle to accurate estimation, especially for CLR, SparCEV base, and SparCEV iterative. In fact, log-TSS performs similarly or better, except when p = 10, where SparCEV base still has a slight edge. SparCEV iterative performs similar to log-TSS across all p ≥ 20 for c = 0.7. For the loadings method, SparCEV base outperforms all alternatives when p = 10. For p = 100, the difference between SparCEV base and CLR is negligible, but both outperform log-TSS. When p = 1000, SparCEV base and CLR perform practically identically, and they only outperform log-TSS at higher thresholds (cf. (7)) and only by a small margin. For correlations generated by the loadings method, SparCEV iterative offers no advantage over SparCEV base, and in fact performs markedly worse when p is small, although the difference shrinks as p increases. This happens because the loadings method tends to produce correlations that are roughly symmetric around zero. In other words, the bias that the iterative procedure is supposed to alleviate is close to zero by construction. Consequently, the iterative procedure essentially uses less data (by excluding OTUs strongly correlated with the non-compositional variable) for no advantage. This is also why the difference shrinks as p increases, as excluding some variables is less impactful when data is abundant.

Fig 1. Results on simulated data in case B.

Fig 1

MAE of different cross-correlation methods for correlation matrices generated by the cluster method (left column) and the loadings method (right column). For the cluster method, different p (number of OTUs) and c (the proportion of OTUs in a cluster) are used. For the loadings method, threshold values u = 0, 0.1, …, 0.8 (cf. (7)) and different p are used. The lines show the mean accuracy, and the edges of the envelopes show ±1.96 standard errors (SE). The results are based on 1000 simulated datasets where each simulated dataset has 50 replicates.

The general pattern observed in Fig 1 is that log yields the worst results, log-TSS is an improvement, CLR and SparCEV base outperform log-TSS (except when sparsity is severely violated), and SparCEV base outperforms CLR at low p. This behavior is consistent with the theory presented in S1 Text. The performance of SparCEV iterative generally depends on the magnitude of the bias incurred by violation of the sparsity assumption. Situations where p is small may be encountered in practice, for example, when abundances at high taxonomic levels are considered or when synthetic communities are employed, as is sometimes done in the plant field [24]. SparCEV iterative consistently outperforms SparCEV base for the cluster method, while the reverse is the case for the loadings method. For a real dataset, we do not know which of the scenarios the correlation structure most closely resembles. However, the advantage of SparCEV base for the loadings method is quite small for p ≥ 100 and SparCEV base and SparCEV iterative perform almost identically for the cluster method for p = 10 (except when sparsity is severely violated). Our practical recommendation is thus to employ SparCEV base when p is small, and SparCEV iterative otherwise. We carried out similar simulations without biological zeros. The results were practically identical and can be found in S1 Fig.

The effect of diversity

Friedman and Alm [13] showed that the accuracy of correlation estimates in case A depends on the diversity of the microbiome. They show that the accuracy of empirical Pearson correlation estimates decreases as peff increases, whereas SparCC is unaffected by peff. Fig 2 shows the impact of diversity in case B with p = 100 and different average peff. The simulation settings are identical to those in Fig 1 for p = 100, except that we choose (μi, σi2) differently (see Parameter Selection under Material and methods). Additionally, we set πi = 0 for all i to avoid zero inflation. This is because a more zero-inflated dataset will tend to have lower entropy (and thus lower peff) than a less zero-inflated dataset. This introduces a chaotic element to the simulation process that may muddle the patterns we seek to investigate.

Fig 2. Results for simulated data with differing diversity in case B.

Fig 2

MAE of different cross-correlation methods for correlation matrices generated by the cluster method (left column) and the loadings method (right column). For the cluster method, different peff (effective number of OTUs) and c (the proportion of OTUs in a cluster) are used. For the loadings method, threshold values u = 0, 0.1, …, 0.8 (cf. (7)) and different peff are used. The lines show the mean accuracy, and the edges of the envelopes show ±1.96 SE. The results are based on 1000 simulated datasets where each simulated dataset has 50 replicates.

In Fig 1, we show two sets of lines for the cluster method, one for correlated pairs and one for uncorrelated pairs. In Fig 2, these lines would have fallen on top of each other, so for ease of presentation, the lines for the uncorrelated pairs have been omitted. The results for uncorrelated pairs are instead shown in S2 Fig, where the overall pattern is similar to Fig 2. However, for uncorrelated pairs, both log and log-TSS perform better than CLR and SparCEV when peff is high and sparsity is violated, and SparCEV iterative outperforms all other methods under all settings.

In Fig 2, we see that SparCEV iterative, SparCEV base, and CLR are only mildly affected by the effective number of OTUs for correlation matrices generated by both the cluster method and the loadings method, regardless of threshold (cf. (7)) or density of the correlation matrix. SparCEV base still consistently outperforms CLR, although the difference is negligible (for all peff, the difference is similar to the difference we saw in Fig 1 at p = 100). As expected, SparCEV iterative outperforms SparCEV base for the cluster method with higher levels of sparsity. The accuracy of the results obtained from log and log-TSS depend heavily on peff. The accuracy of log-TSS is similar to that of CLR and SparCEV only for dense correlation matrices with uniformly distributed abundances, which are unlikely to occur in nature. In general, the benefit of using CLR or SparCEV is greater for less diverse microbiota. This is consistent with established knowledge in case A [13].

Application on atopic dermatitis data

In this section, we analyze the correlations found in an atopic dermatitis dataset from Byrd et al. [38, 39]. The severity of the symptoms was quantified by the widely used measure objective SCORing of Atopic Dermatitis (objective SCORAD) [40]. We estimate the correlations between objective SCORAD and bacterial abundances at the family level using SparCEV. The dataset contained 407 families and 27 replicates. On Fig 3D, we see the correlation coefficients estimated with SparCEV base and SparCEV iterative plotted against each other. They approximately lie on a straight line with a slope of 1 and an intercept less than zero. This is what we would expect if SparCEV base is negatively biased and SparCEV iterative successfully alleviates this bias. We obtained a correlation threshold of m = 0.59 using the threshold selection approach described in Materials and Methods. Additionally, we obtained bootstrap simulations by randomly permuting the SCORAD score within each replicate. These were used to calculate empirical bootstrap confidence intervals (CI) with the BCa-method by Efron [41]. The families with absolute correlation with SCORAD exceeding 0.59 are shown in Fig 3A.

Fig 3. Correlations between microbial abundances and the severity of atopic dermatitis.

Fig 3

Results from a correlation analysis on atopic dermatitis data from Byrd et al. [38]. A: All correlations exceeding the permutation threshold m = 0.59 with color according to the sign of the correlation and with error bars given by the empirical bootstrap 95%-confidence interval. B: Scatter plot between the effective number of families and the objective SCORAD. The blue line is derived from a smooth line fitted to the data with 95% confidence intervals derived from the standard deviation. C: Scatter plot between the estimated correlations using log-TSS and SparCEV. The straight line has slope 1 and intercept 0. D: Scatter plot between the estimated correlations using SparCEV base and SparCEV iterative. The straight line has slope 1 and intercept 0.

It is well known that colonization by Staphylococcus aureus can exacerbate the severity of atopic dermatitis [38]. Indeed, we find that Staphylococcaceae is positively correlated with the objective SCORAD (estimate: 0.76, 95%-CI: [0.60, 1.00]), see Fig 3A. Some members of the fungal family Malasseziaceae are believed to play a pathogenic role in atopic dermatitis [42, 43], and indeed we find this family to be positively correlated with the objective SCORAD (estimate: 0.62, 95%-CI: [0.41, 1.00]). Other studies found that the relative abundance of the genus Propionibacterium was depleted in patients with atopic dermatitis [44] and that the genus Cutibacterium may inhibit the growth of Staphylococcus aureus [45]. Both these genera are members of the family Propionibacteriaceae, but we did not find a correlation between the objective SCORAD and the abundance of this family (estimate: -0.04, 95%-CI: [-0.28, 0.64]). The strongest negative correlation detected was with the family Hyphomicrobiaceae (estimate: -0.63, 95%-CI: [-0.77, -0.24]), which to our knowledge does not have a previously established role in atopic dermatitis.

Fig 3B shows that the diversity is negatively correlated with the objective SCORAD score. Diversity is quantified as the effective number of families which is defined similarly as the previously considered effective number of OTUs. This is consistent with prior knowledge that the diversity of the skin microbiome is substantially reduced in atopic dermatitis patients [44, 45]. The effective number of families is only 18 (out of 407 observed families) even in the most diverse replicate (the effective number of families in the least diverse replicate is 1.2, with over 96% of the relative abundance occupied by Staphylococcaceae). Thus, the diversity in all replicates is low, and by Fig 2 we expect substantially more accurate correlation estimates from SparCEV or CLR compared with log-TSS.

According to Fig 3C, the estimates using log-TSS are consistently smaller than those of SparCEV. The theory in S1 Text provides a plausible explanation for this behaviour. In S1 Text, we show that

Cov[logTSS(xi),b]=Cov[logai,b]-Cov[logA,b],

where A denotes the microbial load. It has previously been established that S. aureus colonizes skin lesions during an atopic dermatitis flare [46]. Other studies have suggested that this is due to an increase in the absolute abundance of S. aureus rather than displacement of other microbes [4749]. In other words, it appears that when the abundance of Staphylococcaceae increases during a flare, the microbial load increases along with it. As a result, the covariances (and thus also the correlations) estimated with log-TSS are negatively biased. Since SparCEV is unaffected by correlations with the microbial load, it follows that the results obtained with SparCEV are more accurate on this dataset. Our correlation estimate of the family Malasseziaceae is consistent with this conclusion. As noted earlier we expect a positive correlation based on previously established associations, which is what we see from SparCEV (estimate: 0.62, 95%-CI: [0.35, 1.00]), but log-TSS returns a slightly negative correlation (estimate: -0.15, 95%-CI: [-0.50, 0.25]). After applying a t-test to the log-TSS estimated correlations and correcting for multiple testing with Benjamini-Hochberg, 140 statistically significant correlations (at significance level 0.05) are found. Of these, 138 are negatively correlated families and with the considerations above in mind, many of these may be false positives. Comparing with SparCEV, only five families were detected by both methods (absolute correlation above m in SparCEV, and p < 0.05 after correction for multiple testing for log-TSS). These include the two families with a positive correlation coefficient by log-TSS, Staphylococcaceae and Kosmotogaceae. The other three families were Hyphomicrobiaceae, Jonesiaceae, and Trueperaceae, which were the only three families with a negative correlation above m for SparCEV.

We also applied a t-test to correlation coefficients estimated by CLR and compared the statistically significant correlations to those below the permutation threshold m of SparCEV. The p-values were corrected for multiple testing using Benjamini-Hochberg. In total, 76 families were statistically significant when applying CLR, while 36 were above the permutation threshold when using SparCEV. This indicates that SparCEV with permutation thresholding is more conservative. Of the 36 families above the permutation threshold, eight of them were not statistically significant when using CLR. These include families that have previously been linked to atopic dermatitis, specifically Malasseziaceae [42, 43] and Streptococcaceae [50, 51] (not to be confused with Staphylococcaceae, which was found by both CLR and SparCEV). Of the families found correlated with the objective SCORAD by CLR but not SparCEV (48 in total), all but one were found to be negatively correlated with the objective SCORAD. The only exception was Casjensviridae, which was just barely below m for SparCEV (Estimate: 0.57, m = 0.59), but just barely significant for CLR (p = 0.047 after correction for multiple testing). The SparCEV estimates of the remaining 47 families range between absolute values barely below m (e.g Zoogloeaceae, estimate: -0.57) and quite far from m (e.g Nitrobacteraceae, estimate -0.36). We generally find that the estimated correlation coefficients found by SparCEV are larger than those found by CLR (See S3 Fig), although the discrepancy is not as pronounced as the one seen in Fig 3C. Details on the results for all families can be found in S1 Table.

Case C

We repeated the numerical studies from Fig 1 in case C. The left column in Fig 4 shows results with the correlation matrices obtained using the cluster method with q = 1000. We also carried out the simulations without biological zero, the results of which can be found in S4 Fig. For results with other values of q = 10, 100, see S5 Fig. On Fig 4, all tested methods perform similarly on non-correlated pairs. On correlated pairs, CLR and CLR+VST yield almost identical results and outperform log-TSS, while SparXCC base is superior when p or q is small, in agreement with the theory. On S5 Fig, we see that for c = 0.4, the performance lead for SparXCC is reduced and for c = 0.7 it is almost nonexistent. However, SparXCC iterative outperforms all other methods for p ≥ 100 regardless of c. Just as in case B, a non-sparse correlation matrix is a considerable obstacle for SparXCC base, CLR, and CLR+VST. They are all outcompeted by log-TSS in this setting. However, the iterative procedure clearly alleviates the error caused by violations of sparsity, although it still performs considerably worse for c = 0.7 than for c = 0.1, 0.4.

Fig 4. Results for simulated data in case C.

Fig 4

MAE of different cross-correlation methods for correlation matrices generated by the cluster method (left column) and the loadings method (right column) in case C. For the cluster method, different p (number of OTUs), q (number of genes) and c (the proportion of OTUs in a cluster) are used. For the loadings method, threshold values u = 0, 0.1, …, 0.8 (cf. (7)) and different p and q are used. The lines show the mean accuracy, and the edges of the envelopes show ±1.96 standard errors (SE). The results are based on 200 simulated datasets where each simulated dataset has 50 replicates.

With the loadings method, the pattern is broadly similar to the one seen in case B. When p = q = 10, SparXCC base outcompetes all alternatives, especially for higher thresholds (cf. (7)), except SparXCC iterative, which returned identical results. This is because it typically found no OTUs or genes with average absolute correlations under the thresholds used for the iterative procedure. As a result it simply returns estimates without carrying out the iterative procedure. In other words, with this particular setting SparXCC base and SparXCC iterative are identical in many cases. For p = q = 100 the difference between SparXCC base, SparXCC iterative, CLR, and CLR+VST is reduced and for p = q = 1000, SparXCC base, CLR, and CLR+VST all perform identically and all outperform log-TSS, while SparXCC iterative performs markedly worse at low thresholds (cf. (7)). S6 Fig shows results for all tested combinations of p and q. Situations where p, q or both are small may arise for example when examining correlations between 16S data (bacterial OTUs) and ITS data (fungal OTUs). Specifically, when synthetic communities are employed or when correlations at a high taxonomic level are of interest. Here SparXCC base outperforms all tested transformation-based methods when either p or q is sufficiently small and SparXCC iterative is identical to SparXCC base in these cases. Collectively, these results show that SparXCC outperforms the alternatives when p and q are small. They also show that the bias incurred by SparXCC base can be alleviated by leveraging the iterative procedure, and thus more accurate estimates can be achieved. However, in settings where this bias is small, such as for correlation matrices produced by the loadings method, SparXCC iterative may produce less accurate results than SparXCC base. On a real dataset, this may be assessed by plotting the estimates of SparXCC base and SparXCC iterative against each other. If they lie on a line with slope 1 and intercept different from 0, it indicates that SparXCC may be alleviating the bias incurred by SparXCC base. If the line has intercept 0, it indicates that SparXCC does not alleviate any bias (perhaps because none is present) and then SparXCC is preferred.

Application on plant microbiome data

In this section, we analyze the correlations found in the root microbiome of Lotus japonicus in a dataset by Thiergart et al. [20]. We have two sequencing datasets (each compositional), one from 16S ribosomal RNA and one from internal transcribed spacers (ITS). The 16S data contains bacterial OTUs, and the ITS data contains fungal OTUs. The data are from plants of multiple genotypes, the wild type (Gifu) and the mutants ccamk, symrk, ram1, and nfr5. The data contains 15–22 replicates for each genotype. We estimate the correlations using SparXCC. The replicates within each genotype originate from three different experiments. This potentially has a confounding effect on the results. For the purposes of this example, we employ the function RemoveBatcheffect from the R-package limma [34] to correct for differing means between experiments.

The next step is to assess whether to use SparXCC base or SparXCC iterative. To do this, we plotted the estimates from both methods against each other. For several of the genotypes (ram1, symrk, and nfr5), the results of SparXCC iterative were highly sensitive to the choice of t1 and t2 (see S7 Fig). Additionally, even after a suitable threshold was found, the results were on a straight line with intercept zero (see S7 Fig), indicating that the results from SparXCC base are not biased (or that SparXCC iterative fails to alleviate the biases). For these reasons, we use SparXCC base on this dataset. The results can be seen on Fig 5. More details on correlated OTUs can be found in S2 Table. A similar analysis was carried out on data collected from the rhizosphere of the plant. The results of this can be found in S8 Fig and S3 Table.

Fig 5. Correlation network between bacterial and fungal abundances in the root of Lotus japonicus.

Fig 5

Results from applying SparXCC to 16S and ITS sequencing data from the root microbiome of Lotus japonicus, from Thiergart et al. [20]. Each circular vertex represents a bacterial OTU from the 16S data and a square vertex represents a fungal OTU from the ITS data. Vertices are colored based on the phylum of the OTU it represents. Two vertices are connected by an edge if their estimated correlation is above the permutation threshold. The analysis is carried out separately for the genotypes Gifu, ram1, nfr5, ccamk, and symrk. Only cross-correlations are shown.

Thiergart et al. estimate cross-correlations between bacterial and fungal OTUs only on the Gifu data using Spearman correlations on TSS-transformed data (Spearman-TSS). They consider a pair correlated when the p-value is less than 0.001 and thus obtain 595 pairs with significant correlations. Using permutation thresholding on SparXCC, we find only 6 correlated pairs in Gifu when correcting for confounding effects. In order to make a direct comparison to the results in the original paper, we also carried out the correlation estimation without correcting for confounding effects. We then obtain 953 correlations above the permutation threshold (m). A substantial proportion of correlations identified by Thiergart et. al were also found by SparXCC (57%). On S9 Fig and in S4 Table, we see that in most cases, the methods find similar estimates, but in some cases they may differ considerably. In fact, in some cases, the two methods disagree on the sign of the correlation. The reason for the differences between the methods may be that SparXCC approximates Pearson correlations, which measure linear relationships, while Spearman correlations measure monotonic relationships. To examine this possibility, we also estimated the Pearson correlations of the log-TSS transformed data (Pearson-log-TSS) and found 1395 pairs with significant correlations. Interestingly, we found that Pearson-log-TSS showed a greater degree of overlap with Spearman-TSS than with SparXCC (87% vs 79%).

Of the pairs where SparXCC and Spearman-TSS disagreed on the sign, 14 were above the permutation threshold but not detected as significant by the t-test. All of these pairs involved two specific fungal OTUs, both members of the phylum Ascomycota. Both had many reads (ranging from 95 to 2576), so these results are not an artifact of low read counts. Additionally, all of these pairs showed the same pattern when comparing SparXCC to Pearson-log-TSS; the estimated correlations had different signs, but were not detected as significant by the t-test. We do not know the ground truth in this data example and the associations between bacterial and fungal microbes in legumes is poorly understood. As a result, it is not possible to rely on previously established knowledge to assess which of the methods produce the most accurate results. However, our simulation study and our theoretical considerations suggest that SparXCC produce superior results. With that in mind, this data example indicates that SparXCC may be able to capture some pattern of association that is lost with the transformation-based methods.

Running time

We compared the running time between the different methods. In Tables 3 and 4, we see that while the running time of CLR differs substantially between cases B and C, this is not the case for SparCEV and SparXCC. This is because the most time consuming part of these methods is the variance estimation procedure adopted from SparCC. This explains why SparXCC with p = q = 1000 has roughly twice the running time as SparCEV with p = 1000 (since the variance estimation procedure has to be run twice in Case C). When one dataset is much larger than the other in case C, the running time may be completely dominated by the variance estimation of that dataset. This explains why the running time of SparXCC with p = 1000, q = 10000 is practically identical to SparCEV when p = 10000. Timing was carried out with the R package microbenchmark [35] on a Lenovo X1 Carbon labtop equipped with a 13th Gen Intel®Corei7–1365U processor.

Table 3. Average running time for cross-correlation estimation methods for case B in seconds.

p CLR SparCEV (base)
100 0.0030 0.0047
1000 0.0079 0.210
10000 0.0462 163

Table 4. Average running time for cross-correlation estimation methods for case C in seconds.

p q CLR CLR+VST SparXCC base
1000 100 0.0144 0.353 0.190
1000 1000 0.0602 0.495 0.471
1000 10000 0.548 1.89 164

Discussion

For the theoretical considerations in this paper, we, like Friedman and Alm [13], assume that the data follow the model in (1). According to this model, the true relative abundances ri are observed, which would only be the case with infinite sequencing depth. We nevertheless assess the different correlation estimation methods using data simulated under a more realistic setting where the xis are noisy observations of the ris. Specifically, SparseDOSSA2 assumes that the xis are multinomial, given the library size N and the ris. Friedman and Alm [13] suggests mitigating the impact of the technical variance of xi given (ri, N) by using a Monte Carlo sampling procedure based on a uniform Dirichlet prior. However, in our simulation setup, we find that this reduces accuracy compared to using a pseudo-count (See S10 Fig). It is a topic of further research to investigate the nature of the technical variance (e.g. if it is truly generated by a multinomial model, as postulated by SparDossa2) and how to account for it in the cross-correlation estimations.

We did not consider testing null hypotheses of zero correlation. Due to various sources of bias, including the aforementioned technical variance, it is difficult to base hypothesis testing on theoretical results. Friedman and Alm [13] use a bootstrapping procedure when applying SparCC in case A. This is computationally demanding, however, not least since corrections for multiple testing are needed when carrying out hypothesis testing for a large number of correlations. Furthermore, in cases B and C, it is challenging to construct a bootstrap simulation scheme that respects the null hypothesis for a particular correlation while maintaining the remaining correlation structure. Due to these difficulties, we believe that it may be more appropriate to rely on the correlation estimates themselves, as we have done with the permutation threshold selection.

A fundamental assumption in this paper is that the interactions between microbes and other variables can be adequately described by a correlation matrix. To our knowledge, no alternatives have been unambiguously shown to universally better describe interactions between compositional datasets such as microbiome and RNA-seq data. Which metric is more sensible may depend on the underlying biology of the specific data under study. We compared the performance of SPIEC-EASI and correlation-based approaches in case C. SPIEC-EASI uses a penalized regression scheme to estimate the precision matrix, Ψ−1, and does not aim to estimate the correlation matrix, Ψ, directly. Instead the primary aim is to discriminate between pairs that are conditionally independent and pairs that are not. With a correlation matrix constructed using the cluster method, a pair is uncorrelated if and only if it is conditionally independent (i.e Ψij=0Ψij-1=0). Thus we can directly compare the power and false discovery rate (FDR) of SPIEC-EASI with those found using pair-wise correlations. To do this, we subjected the correlation estimates of CLR to a t-test and the estimates of SparXCC to the permutation thresholding scheme described in the material and methods section as well as using SPIEC-EASI to identify conditionally dependent pairs. The results of this comparison can be seen in S11 Fig. Compared to the thresholding method, we found that SPIEC-EASI has higher power for n ≤ 50 but a much higher FDR when n < 1000. The t-test had similar power to SparXCC with permutation thresholding, but FDR increased as n increased. However, it was able to adequately control the FDR at n ≤ 50. Permutation thresholding saw relatively high FDR at n = 20, but was otherwise able to better control the FDR than the other methods. See S12 Fig for similar simulations in case B, comparing a t-test and permutation thresholding.

The interactions present in a real biological system are likely to be more complicated than a correlation matrix generated by the cluster method can account for. In such cases, the methods may diverge (in general, Ψij-1=0 need not imply that Ψij = 0 or vice versa), and it may not be clear which is more appropriate.

Conclusion

When estimating correlations between compositional variables and non-compositional variables (case B), the results in Figs 1 and 2, and S1 Fig suggest that SparCEV iterative should be the method of choice, except when p is low, in which case SparCEV base may be preferred. When estimating cross-correlations between two compositional datasets (case C), the results in Fig 4 and S4S6 Figs suggest that the method of choice should be SparXCC base for datasets where the average cross-correlations are close to zero, and SparXCC iterative when this is not the case. In practice, this can be assessed by plotting estimates from SparXCC base and SparXCC iterative against each other. If they lie on a straight line with slope 1 and intercept different from 0, then SparXCC iterative is most likely preferable, whereas SparXCC base is preferable otherwise.

Supporting information

S1 Fig. Case B without biological zeros.

Accuracy of the different cross-correlation methods in case B, in the absence of biological zero by enforcing πj = 0 for j = 1, …, p. Otherwise, the same simulation settings as Fig 1 are used.

(PDF)

pone.0305032.s001.pdf (11KB, pdf)
S2 Fig. Case B diversity and zero correlations.

Accuracy of the different cross-correlation methods in case B on uncorrelated pairs at different levels of diversity.

(PDF)

pone.0305032.s002.pdf (6.9KB, pdf)
S3 Fig. Atopic dermatitis dataset, SparCEV vs CLR.

Correlation coefficients estimated by SparCEV and CLR plotted against each other. The straight line has slope 1 and intercept 0.

(PDF)

pone.0305032.s003.pdf (28.3KB, pdf)
S4 Fig. Case C without biological zeros.

Accuracy of the different cross-correlation methods in case C, in the absence of biological zero by enforcing πj = 0 for j = 1, …, p + q. Otherwise, the same simulation settings as Fig 4 are used.

(PDF)

pone.0305032.s004.pdf (11.9KB, pdf)
S5 Fig. Cluster method in case C for small q.

Accuracy of the different cross-correlation methods on correlation matrices generated by the cluster method in case C for q = 10, 100. Otherwise, the same simulation settings as Fig 4 are used.

(PDF)

S6 Fig. Loadings method in case C for all combinations of p and q.

Accuracy of the different cross-correlation methods on correlation matrices generated by the loadings method in case C for all combinations of p = 10, 100, 1000 and q = 10, 100, 1000. Otherwise, the same simulation settings as Fig 4 are used.

(PDF)

pone.0305032.s006.pdf (12.8KB, pdf)
S7 Fig. SparXCC iterative vs SparXCC base for different thresholds.

The correlation coefficients estimated by SparXCC base and SparXCC iterative plotted against each other for both the default choice of threshold (the 80th percentile) and a threshold chosen after manually evaluating percentiles of the permutations.

(PDF)

pone.0305032.s007.pdf (4.9MB, pdf)
S8 Fig. Cross-correlation network constructed on rhizosphere data.

Graph with edges between nodes when the cross-correlation is above a permutation threshold, estimated by SparXCC on rhizosphere data.

(PDF)

pone.0305032.s008.pdf (30.9KB, pdf)
S9 Fig. Spearman correlations of relative abundances vs SparXCC.

The estimated correlation coefficients as estimated by Spearman correlations of relative abundances plotted against correlations approximated by SparXCC. For Spearman, a pair is considered correlated when a t-test returns a p-value less than 0.001. For SparXCC, a pair is considered correlated when it is above the permutation threshold.

(PDF)

pone.0305032.s009.pdf (517.6KB, pdf)
S10 Fig. Pseudo-count versus Dirichlet Monte Carlo sampling.

Accuracy of using a pseudo-count versus Dirichlet Monte Carlo for SparCEV.

(PDF)

pone.0305032.s010.pdf (6.3KB, pdf)
S11 Fig. Separating correlated and uncorrelated pairs in Case C.

Power and FDR of CLR with a t-test (p- values corrected for multiple testing with Benjamini-Hochberg), SparXCC with permutation thresholding, and SPIEC-EASI.

(PDF)

pone.0305032.s011.pdf (6.6KB, pdf)
S12 Fig. Separating correlated and uncorrelated pairs in Case B.

Power and FDR of CLR with a t-test (p- values corrected for multiple testing with Benjamini-Hochberg) and SparCEV with permutation thresholding.

(PDF)

pone.0305032.s012.pdf (6.6KB, pdf)
S1 Table. Correlations between families and objective SCORAD score.

(CSV)

pone.0305032.s013.csv (43.2KB, csv)
S2 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the root of Lotus japonicus.

Confounding experiment effects were removed and SparXCC was applied. Only pairs whose estimated correlation coefficient exceeded the permutation threshold are included.

(CSV)

S3 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the rhizosphere of Lotus japonicus.

Confounding experiment effects were removed and SparXCC was applied. Only pairs whose estimated correlation coefficient exceeded the permutation threshold are included.

(CSV)

pone.0305032.s015.csv (612B, csv)
S4 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the root of Lotus japonicus.

The data was not corrected for confounding effects prior to correlation estimation. The included pairs either had an correlation coefficient estimated by SparXCC exceeding the permutation threshold, or had p < 0.001 from a t-test applied to the empirical Spearman correlation of log-TSS transformed data.

(CSV)

pone.0305032.s016.csv (75KB, csv)
S1 Text. Theoretical analysis of transformation-based correlations, derivation of compositionally aware methods, and construction of correlation matrices.

(PDF)

pone.0305032.s017.pdf (169.5KB, pdf)

Acknowledgments

We thank Adrián Gómez Repollés for assistance with the dermatitis data. We thank Thorsten Thiergart and Ruben Garrido-Oter for assistance with the plant microbiome data. We thank B Kirtley Amos and Max Gordon for critical reading. We thank Sha Zhang for supplying the data used to construct the templates for gene expression data in the simulation studies. We thank Taylor Grace FitzGerald for copy-editing.

Data Availability

All data used in this paper can be found at https://github.com/IbTJensen/Microbiome-Cross-correlations/. The raw sequencing data from Byrd et al. can be found in NCBI Bioproject 46333, and the OTU table was originally obtained from Morton et al. at https://github.com/knightlab-analyses/reference-frames. The raw sequencing data from Thiergart et al. can be found at the European Nucleotide Archive (ENA). The 16S dataset has project accession no. PRJEB34100, and the ITS dataset has project accession no. PRJEB34099. The OTU tables was originally obtained at https://github.com/ththi/Lotus-Symbiosis.

Funding Statement

This work was supported by the Bill and Melinda Gates Foundation and from Foreign, Commonwealth & Development Office through Engineering the Nitrogen Symbiosis for Africa (ENSA; OPP11772165). Ib Thorsgaard Jensen and Rasmus Waagepetersen were supported by research grant VIL57389 from Villum Fonden. The funders played no role in the content of this paper.

References

  • 1. McCombie WR, McPherson JD, Mardis ER. Next-Generation Sequencing Technologies. Cold Spring Harbor Perspectives in Medicine. 2019;9(11):a036798. doi: 10.1101/cshperspect.a036798 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357(6352):eaan2507. doi: 10.1126/science.aan2507 [DOI] [PubMed] [Google Scholar]
  • 3. Casamassimi A, Federico A, Rienzo M, Esposito S, Ciccodicola A. Transcriptome Profiling in Human Diseases: New Advances and Perspectives. International Journal of Molecular Sciences. 2017;18(8):1652. doi: 10.3390/ijms18081652 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Ehrhart F, Coort SL, Eijssen L, Cirillo E, Smeets EE, Bahram Sangani N, et al. Integrated analysis of human transcriptome data for Rett syndrome finds a network of involved genes. The World Journal of Biological Psychiatry. 2020;21(10):712–725. doi: 10.1080/15622975.2019.1593501 [DOI] [PubMed] [Google Scholar]
  • 5. Cani PD. Human gut microbiome: hopes, threats and promises. Gut. 2018;67(9):1716–1725. doi: 10.1136/gutjnl-2018-316723 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wilson AS, Koller KR, Ramaboli MC, Nesengani LT, Ocvirk S, Chen C, et al. Diet and the Human Gut Microbiome: An International Review. Digestive Diseases and Sciences. 2020;65(3):723–740. doi: 10.1007/s10620-020-06112-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Huang H, Ren Z, Gao X, Hu X, Zhou Y, Jiang J, et al. Integrated analysis of microbiome and host transcriptome reveals correlations between gut microbiota and clinical outcomes in HBV-related hepatocellular carcinoma. Genome Medicine. 2020;12(1):102. doi: 10.1186/s13073-020-00796-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Zancarini A, Westerhuis JA, Smilde AK, Bouwmeester HJ. Integration of omics data to unravel root microbiome recruitment. Current Opinion in Biotechnology. 2021;70:255–261. doi: 10.1016/j.copbio.2021.06.016 [DOI] [PubMed] [Google Scholar]
  • 9. Shaffer M, Armstrong AJS, Phelan VV, Reisdorph N, Lozupone CA. Microbiome and metabolome data integration provides insight into health and disease. Translational Research. 2017;189:51–64. doi: 10.1016/j.trsl.2017.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Monteleone AM, Troisi J, Fasano A, Dalle Grave R, Marciello F, Serena G, et al. Multi-omics data integration in anorexia nervosa patients before and after weight regain: A microbiome-metabolomics investigation. Clinical Nutrition. 2021;40(3):1137–1146. doi: 10.1016/j.clnu.2020.07.021 [DOI] [PubMed] [Google Scholar]
  • 11. Korenblum E, Dong Y, Szymanski J, Panda S, Jozwiak A, Massalha H, et al. Rhizosphere microbiome mediates systemic root metabolite exudation by root-to-root signaling. Proceedings of the National Academy of Sciences. 2020;117(7):3874–3883. doi: 10.1073/pnas.1912130117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Xia LC, Ai D, Cram J, Fuhrman JA, Sun F. Efficient statistical significance approximation for local similarity analysis of high-throughput time series data. Bioinformatics. 2013;29(2):230–237. doi: 10.1093/bioinformatics/bts668 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Friedman J, Alm EJ. Inferring Correlation Networks from Genomic Survey Data. PLoS Computational Biology. 2012;8(9):e1002687. doi: 10.1371/journal.pcbi.1002687 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Computational Biology. 2015;11(5):e1004226. doi: 10.1371/journal.pcbi.1004226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Xu X, Zhang W, Guo M, Xiao C, Fu Z, Yu S, et al. Integrated analysis of gut microbiome and host immune responses in COVID-19. Frontiers of Medicine. 2022;16(2):263–275. doi: 10.1007/s11684-022-0921-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Tipton L, Müller CL, Kurtz ZD, Huang L, Kleerup E, Morris A, et al. Fungi stabilize connectivity in the lung and skin microbial ecosystems. Microbiome. 2018;6(1):12. doi: 10.1186/s40168-017-0393-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Morton JT, Aksenov AA, Nothias LF, Foulds JR, Quinn RA, Badri MH, et al. Learning representations of microbe–metabolite interactions. Nature Methods. 2019;16(12):1306–1314. doi: 10.1038/s41592-019-0616-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Quinn TP, Erb I. Examining microbe–metabolite correlations by linear methods. Nature Methods. 2021;18(1):37–39. doi: 10.1038/s41592-020-01006-1 [DOI] [PubMed] [Google Scholar]
  • 19. Morton JT, McDonald D, Aksenov AA, Nothias LF, Foulds JR, Quinn RA, et al. Reply to: Examining microbe–metabolite correlations by linear methods. Nature Methods. 2021;18(1):40–41. doi: 10.1038/s41592-020-01007-0 [DOI] [PubMed] [Google Scholar]
  • 20. Thiergart T, Zgadzaj R, Bozsóki Z, Garrido-Oter R, Radutoiu S, Schulze-Lefert P. Lotus japonicus Symbiosis Genes Impact Microbial Interactions between Symbionts and Multikingdom Commensal Communities. mBio. 2019;10(5):e01833–19. doi: 10.1128/mBio.01833-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Yang Y, Misra BB, Liang L, Bi D, Weng W, Wu W, et al. Integrated microbiome and metabolome analysis reveals a novel interplay between commensal bacteria and metabolites in colorectal cancer. Theranostics. 2019;9(14):4101–4114. doi: 10.7150/thno.35186 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Stagaman K, Cepon-Robins TJ, Liebert MA, Gildner TE, Urlacher SS, Madimenos FC, et al. Market Integration Predicts Human Gut Microbiome Attributes across a Gradient of Economic Development. mSystems. 2018;3(1):e00122–17. doi: 10.1128/mSystems.00122-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Huang W, Sun D, Chen L, An Y. Integrative analysis of the microbiome and metabolome in understanding the causes of sugarcane bitterness. Scientific Reports. 2021;11(1):6024. doi: 10.1038/s41598-021-85433-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Vorholt JA, Vogel C, Carlström CI, Müller DB. Establishing Causality: Opportunities of Synthetic Communities for Plant Microbiome Research. Cell Host & Microbe. 2017;22(2):142–155. doi: 10.1016/j.chom.2017.07.004 [DOI] [PubMed] [Google Scholar]
  • 25. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11(10):R106. doi: 10.1186/gb-2010-11-10-r106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11(1):94. doi: 10.1186/1471-2105-11-94 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Ma S, Ren B, Mallick H, Moon YS, Schwager E, Maharjan S, et al. A statistical model for describing and simulating microbial community profiles. PLOS Computational Biology. 2021;17(9):e1008913. doi: 10.1371/journal.pcbi.1008913 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Tao K, Jensen IT, Zhang S, Villa-Rodríguez E, Blahovska Z, Salomonsen CL, et al. Nitrogen source and Nod factor signaling map out the assemblies of Lotus japonicus root bacterial communities. Plant Biology; 2023. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.05.27.542319. [Google Scholar]
  • 31. Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York; 2016. Available from: https://ggplot2.tidyverse.org. [Google Scholar]
  • 32.Kassambara A. ggpubr: ‘ggplot2’ Based Publication Ready Plots; 2022. Available from: https://CRAN.R-project.org/package=ggpubr.
  • 33.Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, et al. GGally: Extension to ‘ggplot2’; 2021. Available from: https://CRAN.R-project.org/package=GGally.
  • 34. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research. 2015;43(7):e47. doi: 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mersmann O. microbenchmark: Accurate Timing Functions; 2023. Available from: https://CRAN.R-project.org/package=microbenchmark.
  • 36.Revelle W. psych: Procedures for Psychological, Psychometric, and Personality Research; 2022. Available from: https://CRAN.R-project.org/package=psych.
  • 37. Kurtz Z, Mueller C, Miraldi E, Bonneau R. SpiecEasi: Sparse Inverse Covariance for Ecological Statistical Inference; 2022. [Google Scholar]
  • 38. Byrd AL, Deming C, Cassidy SKB, Harrison OJ, Ng WI, Conlan S, et al. Staphylococcus aureus and Staphylococcus epidermidis strain diversity underlying pediatric atopic dermatitis. Science Translational Medicine. 2017;9(397):eaal4651. doi: 10.1126/scitranslmed.aal4651 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Morton JT, Marotz C, Washburne A, Silverman J, Zaramela LS, Edlund A, et al. Establishing microbial composition measurement standards with reference frames. Nature Communications. 2019;10(1):2719. doi: 10.1038/s41467-019-10656-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Wolkerstorfer A, De Waard Van Der Spek FB, Glazenburg EJ, Mulder AP, Oranje AP. Scoring the Severity of Atopic Dermatitis: Three Item Severity Score as a Rough System for Daily Practice and as a Pre-screening Tool for Studies. Acta Dermato-Venereologica. 1999;79(5):356–359. doi: 10.1080/000155599750010256 [DOI] [PubMed] [Google Scholar]
  • 41. Efron B. Better Bootstrap Confidence Intervals. Journal of the American Statistical Association. 1987;82(397):171–185. doi: 10.1080/01621459.1987.10478410 [DOI] [Google Scholar]
  • 42. Darabi K, Hostetler SG, Bechtel MA, Zirwas M. The role of Malassezia in atopic dermatitis affecting the head and neck of adults. Journal of the American Academy of Dermatology. 2009;60(1):125–136. doi: 10.1016/j.jaad.2008.07.058 [DOI] [PubMed] [Google Scholar]
  • 43. Glatz M, Bosshard P, Hoetzenecker W, Schmid-Grendelmeier P. The Role of Malassezia spp. in Atopic Dermatitis. Journal of Clinical Medicine. 2015;4(6):1217–1228. doi: 10.3390/jcm4061217 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Bjerre RD, Bandier J, Skov L, Engstrand L, Johansen JD. The role of the skin microbiome in atopic dermatitis: a systematic review. British Journal of Dermatology. 2017;177(5):1272–1278. doi: 10.1111/bjd.15390 [DOI] [PubMed] [Google Scholar]
  • 45. Koh LF, Ong RY, Common JE. Skin microbiome of atopic dermatitis. Allergology International. 2022;71(1):31–39. doi: 10.1016/j.alit.2021.11.001 [DOI] [PubMed] [Google Scholar]
  • 46. Edslev SM, Olesen CM, Nørreslet LB, Ingham AC, Iversen S, Lilje B, et al. Staphylococcal Communities on Skin Are Associated with Atopic Dermatitis and Disease Severity. Microorganisms. 2021;9(2):432. doi: 10.3390/microorganisms9020432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Tauber M, Balica S, Hsu CY, Jean-Decoster C, Lauze C, Redoules D, et al. Staphylococcus aureus density on lesional and nonlesional skin is strongly associated with disease severity in atopic dermatitis. Journal of Allergy and Clinical Immunology. 2016;137(4):1272–1274.e3. doi: 10.1016/j.jaci.2015.07.052 [DOI] [PubMed] [Google Scholar]
  • 48. Gonzalez ME, Schaffer JV, Orlow SJ, Gao Z, Li H, Alekseyenko AV, et al. Cutaneous microbiome effects of fluticasone propionate cream and adjunctive bleach baths in childhood atopic dermatitis. Journal of the American Academy of Dermatology. 2016;75(3):481–493.e8. doi: 10.1016/j.jaad.2016.04.066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Edslev S M, Agner T, Andersen P S. Skin Microbiome in Atopic Dermatitis. Acta Dermato-Venereologica. 2020;100(12):358–366. doi: 10.2340/00015555-3514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Byrd AL, Belkaid Y, Segre JA. The human skin microbiome. Nature Reviews Microbiology. 2018;16(1). doi: 10.1038/nrmicro.2017.157 [DOI] [PubMed] [Google Scholar]
  • 51. Chng KR, Tay ASL, Li C, Ng AHQ, Wang J, Suri BK, et al. Whole metagenome profiling reveals skin microbiome-dependent susceptibility to atopic dermatitis flare. Nature Microbiology. 2016;1(7). doi: 10.1038/nmicrobiol.2016.106 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Enrique Hernandez-Lemus

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

2 Feb 2024

PONE-D-23-34186Compositionally aware estimation of cross-correlations for microbiome dataPLOS ONE

Dear Dr. Jensen,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please, when resubmitting your manuscript take into account the following methodological suggestions as per the Reviewer's advice (Points 1 to 4 under Comments)

1、On Real Data Analysis of Atopic Dermatitis:

2、Evaluation of the Dynamic Threshold Selection Method

3、Analysis of the Impact of Data Scale on Computational Time:

4、Specific Clarification of Multiple Comparisons Correction Method

Please submit your revised manuscript by Mar 18 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Note from Emily Chenette, Editor in Chief of PLOS ONE, and Iain Hrynaszkiewicz, Director of Open Research Solutions at PLOS: Did you know that depositing data in a repository is associated with up to a 25% citation advantage (https://doi.org/10.1371/journal.pone.0230416)? If you’ve not already done so, consider depositing your raw data in a repository to ensure your work is read, appreciated and cited by the largest possible audience. You’ll also earn an Accessible Data icon on your published paper if you deposit your data in any participating repository (https://plos.org/open-science/open-data/#accessible-data).

4. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

[Funding: This work was supported by the Bill and Melinda Gates Foundation and from

Foreign, Commonwealth & Development Office through Engineering the Nitrogen

Symbiosis for Africa (ENSA; OPP11772165). We thank Adri´an G´omez Repoll´es for

assistance with the dermatitis data. We thank Thorsten Thiergart and Ruben

Garrido-Oter for assistance with the plant microbiome data. We thank B Kirtley Amos

and Max Gordon for critical reading. We thank Sha Zhang for supplying the data used

to construct the templates for gene expression data in the simulation studies. We thank

Taylor Grace FitzGerald for copy-editing]

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

 [This work was supported by the Bill and Melinda Gates Foundation and from

Foreign, Commonwealth & Development Office through Engineering the Nitrogen

Symbiosis for Africa (ENSA; OPP11772165).

The funders played no role in the content of this paper.]

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Summary

In this research, the authors explore the complex domain of microbiome studies with an emphasis on deducing correlations between microbial abundances and various other variables. Addressing a notable gap in current methodologies, which primarily focus on compositional data, this paper introduces two innovative methods: SparCEV (Sparse Correlations with External Variables) and SparXCC (Sparse Cross-Correlations between Compositional data). These methods are uniquely designed to quantify correlations between OTU abundances and phenotypic variables or other compositional datasets, expanding the analytical capabilities in microbiome research. The authors have utilized a combination of real-world data analysis and comprehensive simulation studies to validate their methods.

Comments

1、On Real Data Analysis of Atopic Dermatitis:

"In the section analyzing real-world data on atopic dermatitis, it would be highly beneficial if the authors could present the highly correlated microbial species identified by other methods. A detailed comparison, particularly focusing on overlaps and distinctions among these methodologies, would greatly enhance our understanding of the uniqueness and effectiveness of your proposed approach."

2、Evaluation of the Dynamic Threshold Selection Method:

"The dynamic threshold selection method introduced in the article seems to rely significantly on subjective judgment, such as the user-defined parameter 't'. This reliance might impact the reproducibility and objectivity of the results. I would recommend that the authors explore this method in more depth, providing a more stable criterion for dynamic threshold selection or more objective guidelines to augment the universality and reliability of the method."

3、Analysis of the Impact of Data Scale on Computational Time:

"Given that the manuscript indicates similar outcomes for SparXCC and CLR transformations in Case C with large p and q values, albeit being time-consuming, a comparative analysis of the computational time across different methods as a function of data scale would be instructive. Such analysis would aid in assessing the efficiency and applicability of these methods in practical scenarios."

4、Specific Clarification of Multiple Comparisons Correction Method:

"In the process of distinguishing between correlated and uncorrelated pairs, a t-test has been applied to CLR. I would urge the authors to clearly specify the exact correction method used for addressing multiple comparisons. For example, was a Bonferroni correction or a Benjamini-Hochberg procedure employed? Clarity in this aspect is crucial for assessing the statistical rigor of the study."

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jun 28;19(6):e0305032. doi: 10.1371/journal.pone.0305032.r002

Author response to Decision Letter 0


24 Apr 2024

Response to editor:

1. The manuscript has been edited to comply with the PLOS One style guidelines. Specifically,

headings has been changed to sentence case and their font sizes has been adjusted, the main text

has been changed to double space paragraph format, and the ”Support information” section has been

moved down after the references. Additionally, addresses have been removed from author affiliations

and changes in author affiliations since the time of the original submission have been incorporated.

2. Implementations of the methods SparCEV and SparXCC are now available as an R-package at

https://github.com/IbTJensen/CompoCor. Everything else necessary to replicate the results of the

paper is available at https://github.com/IbTJensen/Microbiome-Cross-correlations.

3. See reply to Editor Point 2.

4. We have removed the funding information from the Acknowledgments section in the manuscript.

Please add the following to the funding statement: ”Ib Thorsgaard Jensen and Rasmus Waagepetersen were supported by research grant VIL57389 from

Villum Fonden.”

Response to reviewer:

1. We have now added a more thorough discussion on the differences between the results found

by SparCEV, CLR and log-TSS. Additionally, a supplementary table provides the results with all three

methods on all families (this table was also included in the previous version, but it was not mentioned

in the text. This has now been rectified.).

2. Inspired by this comment, we made a refinement to the estimation procedures. Specifically,

we implemented an iterative procedure similar to the one utilized by SparCC, which both SparCEV and

SparXCC are based on. We had initially written this off after initially seeing poor results, but after more

thorough investigation, we found that it can help alleviate the bias caused by the sparsity assumption.

This obviates the need for the parameter t. We have rerun all the simulation studies and included this

new iterative approach. In some cases we see a substantial gain in accuracy, while in others we see a

decrease in accuracy (in cases where the sparsity assumption is almost exact). We also provide practical

guidance for assessing whether or not the iterative procedure is appropriate on a given dataset.

The iterative procedure makes use of user-specified thresholds, t, t1 , and t2 to select ”weakly

correlated OTUs/genes” (no connection to the t from the previous version of the manuscript). However,

we believe these are of a different nature than the t from the previous version of the manuscript

for the following reasons: Firstly, SparCC, which is already widely used in the microbiome literature,

uses a similar procedure with a similar user-specified parameter. Secondly, we suggest a bootstrap

approach to select them in a data-driven way. Thirdly, we suggest a diagnostic plot to assess whether or

not SparCEV/SparXCC with the iterative procedure provides an improvement over SparCEV/SparXCC

without it on a given dataset. In contrast, the t from the previous version of the manuscript could

neither be selected nor evaluated for a specific dataset.

3. An analysis of the running time is now included in the manuscript. Additionally, we explored a

different approach for the mathematical derivation of SparXCC. A different way to express the covari-

ances was formulated, and it was easily shown to be equivalent to the formulation from the previous

manuscript. Using this formulation substantially speeds up SparXCC. Additionally, with the iterative

procedure SparXCC can provide substantially better results than CLR in some cases, even when p and

q are large, which we believe justifies the greater running time.

4. Throughout the manuscript, all p-values are corrected for multiple testing with Benjamini-

Hochberg (except in the plant microbiome data example, where we follow the method employed in the

original paper for the purposes of comparison). This has now been clearly indicated in every instance.

Attachment

Submitted filename: Response to reviwers.pdf

pone.0305032.s018.pdf (149.5KB, pdf)

Decision Letter 1

Enrique Hernandez-Lemus

23 May 2024

Compositionally aware estimation of cross-correlations for microbiome data

PONE-D-23-34186R1

Dear Dr. Jensen,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Enrique Hernandez-Lemus, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have adeptly incorporated my feedback. I am pleased with the revisions made to the manuscript. Consequently, I wholeheartedly endorse its publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Enrique Hernandez-Lemus

19 Jun 2024

PONE-D-23-34186R1

PLOS ONE

Dear Dr. Jensen,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Enrique Hernandez-Lemus

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Case B without biological zeros.

    Accuracy of the different cross-correlation methods in case B, in the absence of biological zero by enforcing πj = 0 for j = 1, …, p. Otherwise, the same simulation settings as Fig 1 are used.

    (PDF)

    pone.0305032.s001.pdf (11KB, pdf)
    S2 Fig. Case B diversity and zero correlations.

    Accuracy of the different cross-correlation methods in case B on uncorrelated pairs at different levels of diversity.

    (PDF)

    pone.0305032.s002.pdf (6.9KB, pdf)
    S3 Fig. Atopic dermatitis dataset, SparCEV vs CLR.

    Correlation coefficients estimated by SparCEV and CLR plotted against each other. The straight line has slope 1 and intercept 0.

    (PDF)

    pone.0305032.s003.pdf (28.3KB, pdf)
    S4 Fig. Case C without biological zeros.

    Accuracy of the different cross-correlation methods in case C, in the absence of biological zero by enforcing πj = 0 for j = 1, …, p + q. Otherwise, the same simulation settings as Fig 4 are used.

    (PDF)

    pone.0305032.s004.pdf (11.9KB, pdf)
    S5 Fig. Cluster method in case C for small q.

    Accuracy of the different cross-correlation methods on correlation matrices generated by the cluster method in case C for q = 10, 100. Otherwise, the same simulation settings as Fig 4 are used.

    (PDF)

    S6 Fig. Loadings method in case C for all combinations of p and q.

    Accuracy of the different cross-correlation methods on correlation matrices generated by the loadings method in case C for all combinations of p = 10, 100, 1000 and q = 10, 100, 1000. Otherwise, the same simulation settings as Fig 4 are used.

    (PDF)

    pone.0305032.s006.pdf (12.8KB, pdf)
    S7 Fig. SparXCC iterative vs SparXCC base for different thresholds.

    The correlation coefficients estimated by SparXCC base and SparXCC iterative plotted against each other for both the default choice of threshold (the 80th percentile) and a threshold chosen after manually evaluating percentiles of the permutations.

    (PDF)

    pone.0305032.s007.pdf (4.9MB, pdf)
    S8 Fig. Cross-correlation network constructed on rhizosphere data.

    Graph with edges between nodes when the cross-correlation is above a permutation threshold, estimated by SparXCC on rhizosphere data.

    (PDF)

    pone.0305032.s008.pdf (30.9KB, pdf)
    S9 Fig. Spearman correlations of relative abundances vs SparXCC.

    The estimated correlation coefficients as estimated by Spearman correlations of relative abundances plotted against correlations approximated by SparXCC. For Spearman, a pair is considered correlated when a t-test returns a p-value less than 0.001. For SparXCC, a pair is considered correlated when it is above the permutation threshold.

    (PDF)

    pone.0305032.s009.pdf (517.6KB, pdf)
    S10 Fig. Pseudo-count versus Dirichlet Monte Carlo sampling.

    Accuracy of using a pseudo-count versus Dirichlet Monte Carlo for SparCEV.

    (PDF)

    pone.0305032.s010.pdf (6.3KB, pdf)
    S11 Fig. Separating correlated and uncorrelated pairs in Case C.

    Power and FDR of CLR with a t-test (p- values corrected for multiple testing with Benjamini-Hochberg), SparXCC with permutation thresholding, and SPIEC-EASI.

    (PDF)

    pone.0305032.s011.pdf (6.6KB, pdf)
    S12 Fig. Separating correlated and uncorrelated pairs in Case B.

    Power and FDR of CLR with a t-test (p- values corrected for multiple testing with Benjamini-Hochberg) and SparCEV with permutation thresholding.

    (PDF)

    pone.0305032.s012.pdf (6.6KB, pdf)
    S1 Table. Correlations between families and objective SCORAD score.

    (CSV)

    pone.0305032.s013.csv (43.2KB, csv)
    S2 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the root of Lotus japonicus.

    Confounding experiment effects were removed and SparXCC was applied. Only pairs whose estimated correlation coefficient exceeded the permutation threshold are included.

    (CSV)

    S3 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the rhizosphere of Lotus japonicus.

    Confounding experiment effects were removed and SparXCC was applied. Only pairs whose estimated correlation coefficient exceeded the permutation threshold are included.

    (CSV)

    pone.0305032.s015.csv (612B, csv)
    S4 Table. Correlations between bacterial OTUs from 16S data and fungal OTUs from ITS data from the root of Lotus japonicus.

    The data was not corrected for confounding effects prior to correlation estimation. The included pairs either had an correlation coefficient estimated by SparXCC exceeding the permutation threshold, or had p < 0.001 from a t-test applied to the empirical Spearman correlation of log-TSS transformed data.

    (CSV)

    pone.0305032.s016.csv (75KB, csv)
    S1 Text. Theoretical analysis of transformation-based correlations, derivation of compositionally aware methods, and construction of correlation matrices.

    (PDF)

    pone.0305032.s017.pdf (169.5KB, pdf)
    Attachment

    Submitted filename: Response to reviwers.pdf

    pone.0305032.s018.pdf (149.5KB, pdf)

    Data Availability Statement

    All data used in this paper can be found at https://github.com/IbTJensen/Microbiome-Cross-correlations/. The raw sequencing data from Byrd et al. can be found in NCBI Bioproject 46333, and the OTU table was originally obtained from Morton et al. at https://github.com/knightlab-analyses/reference-frames. The raw sequencing data from Thiergart et al. can be found at the European Nucleotide Archive (ENA). The 16S dataset has project accession no. PRJEB34100, and the ITS dataset has project accession no. PRJEB34099. The OTU tables was originally obtained at https://github.com/ththi/Lotus-Symbiosis.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES