A zero inflated log-normal model for inference of sparse microbial association networks

Vincent Prost; Stéphane Gazut; Thomas Brüls

doi:10.1371/journal.pcbi.1009089

. 2021 Jun 18;17(6):e1009089. doi: 10.1371/journal.pcbi.1009089

A zero inflated log-normal model for inference of sparse microbial association networks

Vincent Prost ^1,^2,^*, Stéphane Gazut ², Thomas Brüls ^1,^*

Editor: Niranjan Nagarajan³

PMCID: PMC8244920 PMID: 34143768

Abstract

The advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes.

The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods.

We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such “biological” zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.

Author summary

The importance of associations in the structuring and dynamics of community members is widely acknowledged, but we are currently unable to co-culture most of the micro-organims sampled from the environment. Computational methods to predict microbial associations can therefore be of practical interest, in particular given the large amounts of multivariate microbial abundance data generated by metagenomics. This data can in theory be leveraged to infer association networks, but with limited success so far, as several of its attributes lead to technical difficulties, including its extreme sparsity, compositionality and overdispersion among others. In particular, structural zeros (as opposed to sampling and technical zeros) corresponding to true absences of biological signals frequently fail to be properly handled, and such non-random absences can lead to high levels of false positives. Given their prevalence, zero values should be properly handled by the modeling process by accounting for the zero generating process in the first place. We describe here a truncated log-normal graphical model that specifically addresses zeros originating from biological absences, and discuss consistent methods for estimating sparse and high-dimensional association networks. We also show that this model generates sparse multivariate counts more close to those derived from real-world microbiomes.

This is a PLOS Computational Biology Methods paper.

1 Introduction

Metagenomics has increased our awareness that microbes most often live in communities structured by both environmental factors and ecological associations between community members. Understanding the structure and dynamics of microbial communities thus requires to be able to detect such associations, and is of both fundamental [1] and practical importance as it may ultimately enable the design and engineering of consortia of interacting organisms in order to fulfill various needs (e.g. improving the efficiency of wastewater treatment plants [2, 3] and designing more robust fecal transplants [4]).

Interactions within these microbial systems can have a positive, negative or null impact on the involved organisms, leading to a typology of pairwise interactions based on combinations of these win or loss outcomes, e.g. Lidicker [5] distinguishes the mutualism (+/+), competition (-/-), predation or parasitism ((+/-) or (-/+)), amenalism ((0/-) or (-/0)) and commensalism ((+/0) or (0/+)) interaction types. However, because of the current inability to co-culture most of the microbes sampled from the environment [6, 7], computational methods play a key role in predicting microbial associations. The advent of high-throughput sequencing and metagenomics resulted in the production of large amounts of multivariate microbial abundance data that could in theory be leveraged to reconstruct microbial association networks [8]. In practice however, only limited success has been met in terms of robust structure inference of real-world microbial association networks, and different methods frequently yield quite different results [9].

In principle, methods that consider the conditional dependency structure of microbial networks (like probabilistic graphical models) should perform better than methods using only univariate associations to predict pairwise associations, because the former has the power to resolve direct from indirect associations (e.g. associations between two species mediated through a third species). For example, ref [10] showed that “univariate networks” (i.e. networks based on pairwise associations predicted from univariate statistical associations) include high proportions of false-positive predictions when there is a substantial level of dependence between the samples, while the number of false positives was highly reduced when conditional networks were computed.

In the framework of Gaussian Graphical Models (GGMs), the structure of the inverse covariance matrix Σ⁻¹ (also known as the precision matrix) encodes an undirected graph whose edges represent conditional dependencies between variables [11]. Powerful graph inference algorithms for the latter include the graphical lasso (glasso) [12] and neighborhood selection [13]. However and importantly, the Gaussian assumption does not accomodate the excess of zeros typical of metagenomic datasets.

Indeed, a key difficulty for the statistical methods is the extreme sparsity and overdispersion of real-world sequence-based abundance data, including non random absences (structural zeros) that can lead to high levels of false positive results. Early population-level metagenomic studies highlighted extreme sparsity patterns in the human microbiome, with a large proportion of taxa being rare and absent in the majority of subjects, resulting in far more zero counts for each taxon than expected on the basis of Poisson, negative binomial, or Dirichlet-multinomial distributions [14].

Given their prevalence, zero values need to be properly handled by the modeling approaches, which requires accounting for the zero generating process in the first place. Zero values are often considered resulting from either the stochastic nature of sampling (sampling zeros), failed measurements arising from technical bias (technical zeros) or zero values resulting from a true absence of signal (structural or “biological” zeros) [15]. Previous work addressing technical zeros (absence of data) include [16], while [17] is an example of recent work dealing with sampling zeros.

Most of the existing work focuses on Gaussian graphical models, which also includes Poisson log-normal models where counts follow Poisson distributions with parameters sampled from a latent multivariate gaussian variable (network inference proceeds in the latent space) as illustrated in [17–20]. Methods based on log-normal models in the GGM framework include gCoda [21], Banocc [22], and Gaussian copulas with zero-inflated marginals [23, 24]. Outside the realm of GGMs, previous work on Poisson graphical models include [16] and [25]. However, there is currently no consistent multivariate Poisson distribution to model dependencies between count variables, as Poisson graphical models fail to have proper joint distribution [25, 26] or to have both marginal and conditional Poisson distributions [27].

While there is an abundant literature about models developed for the reconstruction of ecological networks from microbial occurrence or abundance data, these typically make strong assumptions about the zero generating process (e.g. that all zeros are explained by a common probability model) and don’t investigate the consequences of deviating from these assumptions. Yet, the classical distinction between sampling zeros and structural zeros is not refined enough to describe the assumptions underlying commonly used models, like zero-inflated ones, which can lead to substantial biases under both simulated and real-world data settings [15].

We present here a truncated log-normal graphical model for inferring microbial association networks that accounts for both the compositionality of microbial abundance data and specifically addresses zeros originating from biological absence (structural zeros), and discuss consistent methods for estimating sparse and high-dimensional inverse covariance matrices.

Regarding the generative aspect of the model, which is important for producing realistic abundance data (e.g. for benchmarking purposes), an attractive approach (implemented in the popular Spiec-Easi toolkit [23]) relies on Gaussian copulas coupled with count distribution marginals in order to generate multivariate count distributions. Dependencies between variables can be simulated with a latent multivariate Gaussian distribution, which is then composed with count distributions, typically zero-inflated negative binomials. This protocol yields multivariate zero-inflated negative binomial distributions, but it does not produce compositional data, nor does it generate realistic count distributions when the proportion of zeros and the variances are large. We propose a zero-inflated log normal model for generating sparse multivariate counts, and demonstrate that it produces more realistic counts (Fig 1).

Fig 1 — (A) Global feature by feature difference between the real (LifeLines-Deep cohort) and simulated counts measured by Kolmogorov-Smirnov statistics. (B) Top: histograms showing real and simulated count data. Real counts obtained from shotgun sequencing the microbiomes of the LifeLines-Deep cohort are shown in red; green, purple and blue represent synthetic counts generated respectively by metaSPARSim [28], spiec-easi [23] and our model (Poisson Zero Inflated Log Normal). The different panels show results for different randomly selected taxa. Down: quantile-quantile (QQ)-plots showing the logarithm of real counts vs the logarithm of synthetic counts.

2 Motivation of the proposed model

The key motivation to use a multivariate zero-inflated model with a latent multivariate normal variable to account for the dependency structure of the model is partly inspired by the work of Lee et al., in the context of variable selection [14], with the following specific motivations:

to have a better representation of real counts (e.g. improving over commonly used zero inflated negative binomial models) (Fig 1)
to have a truly compositional data generator
to take structural (biological) zeros into account at the network inference level
to leverage the vast literature and experience available on log-normal models (e.g. [17, 18, 21], among others)

3 Materials and methods

We describe next the zero inflated log-normal model and a corresponding network inference method. Observed counts are noted y_ij where i = 1‥n indexes the samples and j = 1‥p the variables (taxa). We denote the vector y_i of size p the observation of counts for sample i and more generally in bold letters vectors of size p. The count vectors are gathered into a n × p matrix Y.

3.1 The model

Our model is derived from the multivariate Poisson log-normal model [29] to which a zero-inflated component is added. We consider a latent multivariate normal variable z_i and a variable a_i representing the real (unknown) abundances. We define the multivariate distribution of a_i as following:

\begin{matrix} \begin{matrix} z_{i} & \sim N (μ, Σ) \\ a_{i j} & = 1_{z_{i j} > δ_{j}} e^{z_{i j}} \end{matrix} \end{matrix}

(1)

where $N (μ, Σ)$ represents the multivariate normal distribution with mean μ and covariance Σ and $1$ is the indicator function (with δ determining the zero probability).

In Gaussian Graphical Models, the structure of Σ⁻¹ encodes an undirected graph with edges representing conditional dependencies between variables [11]. In this model, zeros are interpreted as biological zeros and are related to a latent gaussian variable.

In metagenomics, the observed counts y_i reflect the abundance proportions π_i of microbes present in the sample: $π_{i j} = \frac{a_{i j}}{\sum_{j} a_{i j}}$ . The number of reads observed for each sample is generally variable, and we will note it N_i. The counts can then be modeled with a Poisson or multinomial distribution whose parameter is proportional to π_i:

\begin{matrix} y_{i} \sim M (N_{i}, π_{i}) \end{matrix}

(2)

For example, [28] describes a method for simulating 16S rRNA gene count data based on gamma distributions, but it only moderately fits real data when sparsity and overdispersion are high, as shown in Fig 1. The same figure makes apparent that the zero-inflated negative binomial distribution is neither a good fit when parameters of the distribution are adjusted to real-world metagenomic data by using the data generation protocol of [23].

3.2 Network inference method

We next describe a network structure inference method suitable for metagenomic data consistent with the generative model presented above.

We will first present a centered log-ratio (clr)-like data transformation preserving scale invariance in the presence of zeros, and then propose an inference method for learning the sparse structure of Σ⁻¹ that is inspired by the work of [18].

3.2.1 Data transformation

Because of variations in sequencing depths across samples, the total number of reads in a sample only reflects the relative proportion of microbes it contains [30]. This compositional nature of sequence data prevents naive correlation-based analyses without appropriate correction [29], of which log-ratio based methods are the most commonly used. In the latter, the handling of zeros by adding a fixed pseudocount (i.e., irrespective of the sequencing depth of the various samples) has been shown to introduce important biases when the datasets include rare taxa and/or are inhomogeneous in coverage [10]. Downsampling or ignoring samples to level off sequence depth, or aggressive thresholding to remove rare taxa, can somewhat alleviate this bias, although this has been criticised as statistically unsound [31] and is performed at the cost of discarding large amounts of possibly biologically relevant data.

One way to address this difficulty, as originally shown by Aitchison [29] and implemented in popular toolkits (e.g. [23]), is to apply a centered-log-ratio (clr) transformation to the counts:

\begin{matrix} y_{i j}^{clr} = log (y_{i j}) - \frac{1}{p} \sum_{k = 1}^{p} log (y_{i k}) \end{matrix}

(3)

An important property of this transformation is scale-invariance, which guarantees that two samples with similar read proportions will have similar transformed count data. Indeed, if counts represent absolute abundances with a scaling factor a_i ≈ λy_i, then we would have $a_{i}^{clr} \approx y_{i}^{clr}$ .

As clr is not defined in zero, a typical workaround consists in adding a unit pseudocount to the original data in order to avoid numerical problems. However, doing so breaks the scale invariance property, as shown in S2 Text. We therefore propose a slightly different version of clr, preserving scale invariance in the presence of zeros and defined as following:

\begin{matrix} {\tilde{y}}_{i j} = {\begin{matrix} log (y_{i j}) - \frac{\sum_{k, y_{i k} \neq 0} log (y_{i k})}{| k, y_{i k} \neq 0 |} & if y_{i j} \neq 0 \\ 0 & if y_{i j} = 0 \end{matrix} \end{matrix}

(4)

where |k, y_ik ≠ 0| denotes the number of non zero entries of y_i. A proof that this transformation preserves scale invariance in the presence of zeros can be found in S2 Text.

3.2.2 Effect of clr transformation on the estimation of Σ⁻¹

In “canonical” (non zero-inflated) log-normal models, when $a_{i} \sim L N (μ, Σ)$ , we have $a_{i}^{clr} \sim N (F μ, F Σ F)$ , where $F = I - \frac{1}{p} J$ and J is a p × p matrix filled with ones. Even though there is no guarantee that FΣF is close to Σ (in the sense of the Frobenius norm) when p > > 0, as shown in S1 Text, the approximation appears reasonably good in practice (see e.g. [23]).

In the zero inflated log-normal model, we will assume similarly that if a_i follows a distribution as defined in Eq 1, then $\tilde{a_{i}}$ follows a zero inflated normal distribution defined as:

\begin{matrix} \begin{matrix} z_{i} & \sim N (\tilde{μ}, \tilde{Σ}) \\ {\tilde{a}}_{i j} & = 1_{z_{i j} > {\tilde{δ}}_{j}} z_{i j} \end{matrix} \end{matrix}

(5)

and assume that $\tilde{Σ}$ is a good approximation of Σ.

3.2.3 Sparse network inference

We will assume that $\tilde{y_{i}} \approx \tilde{a_{i}}$ follows the zero inflated normal distribution defined in Eq 5. For network structure inference, we follow a procedure similar in spirit to [18] and [24], with distinct data transformation steps to estimate the latent layer, whose key steps can be summarized as follows:

obtain initial estimates of parameters ${\hat{μ}}_{j}$ , $\hat{Σ} = ({\hat{Σ}}_{11}, {\hat{Σ}}_{22}, \dots, {\hat{Σ}}_{p p})$ and ${\hat{δ}}_{j}$
transform data to posterior mean of $Z | \tilde{Y}$ to obtain $\tilde{Z}$
infer the structure of Σ⁻¹ with either the graphical lasso (glasso) [12] using the empirical correlation matrix of $\tilde{Z}$ , or neighborhood search (Meinshausen and Bühlmann, MB) [13] on $\tilde{Z}$ .

The glasso [12] solves a penalized likelihood maximization problem for the multivariate normal distribution, and Ambroise and Chiquet have shown (personal communication and S4 Text) that the MB algorithm solves a penalized pseudo-likelihood maximization problem. On the other hand, [18] showed that this last step is therefore the maximization step of an EM algorithm, the two first steps yielding the expectation step.

3.2.4 Initial parameter estimates

We describe here how to obtain initial estimates for the model parameters $\hat{δ}, \hat{μ}, \hat{Σ}$ . As shown in [18], a diagonal estimation of Σ can be made in order to reduce the computational burden of the method (note [32] provides an algorithm for maximizing the penalized likelihood without the diagonality assumption). Under the constraint that $\hat{Σ}$ is diagonal, all variables are independent and marginal parameters ${\hat{μ}}_{j}$ and ${\hat{Σ}}_{j j}$ can be estimated separately for each variable. We have the following likelihood from the marginal distributions, which have both a continuous and a discrete part:

\begin{matrix} L (μ_{j}, Σ_{j j}, δ_{j} | \tilde{y_{i}}) = \prod_{i} (1_{{\tilde{y}}_{i j} = 0} Φ_{μ_{j}, Σ_{j j}} (δ_{j}) + 1_{{\tilde{y}}_{i j} \neq 0} f_{μ_{j}, Σ_{j j}} ({\tilde{y}}_{i j})) \end{matrix}

(6)

where $f_{μ_{j}, Σ_{j j}}$ and $Φ_{μ_{j}, Σ_{j j}}$ are respectively the normal distribution and the cumulative normal distribution of mean μ_j and variance Σ_jj.

The solution for ${\hat{δ}}_{j}$ is straightforward and independent of other parameters:

\begin{matrix} {\hat{δ}}_{j} = min_{i, y_{i j} \neq 0} {\tilde{y}}_{i j} \end{matrix}

(7)

The other parameters can be obtained by maximizing the log-likelihood:

\begin{matrix} \begin{matrix} {\hat{μ}}_{j}, {\hat{Σ}}_{j j} = & {argmax}_{μ, σ^{2}} \sum_{i, {\tilde{y}}_{i j} = 0} log (Φ_{μ, σ^{2}} ({\hat{δ}}_{j})) \\ + \sum_{i, {\tilde{y}}_{i j} \neq 0} log (f_{μ, σ^{2}} (log (y_{i j}))) \end{matrix} \end{matrix}

(8)

We will note $\hat{Σ} = diag ({\hat{Σ}}_{11}, {\hat{Σ}}_{22}, \dots, {\hat{Σ}}_{p p})$ .

3.2.5 Posterior mean transformation

Following [18], we use the algorithm on the inferred latent Gaussian variable $\tilde{z_{i}}$ . $\tilde{Z}$ is obtained by posterior mean transformation:

\begin{matrix} \tilde{Z} ≔ E_{\hat{δ}, \hat{μ}, \hat{Σ}} [Z | \tilde{Y}] \end{matrix}

(9)

As $\hat{Σ}$ is diagonal, each value of $\tilde{Z}$ can be computed separately (see S3 Text),

\begin{matrix} {\tilde{z}}_{i j} = E_{{\hat{δ}}_{j}, {\hat{μ}}_{j}, {\hat{Σ}}_{j j}} [z_{i j} | {\tilde{y}}_{i j}] \end{matrix}

(10)

for which the computation is easy:

\begin{matrix} {\tilde{z}}_{i j} = {\begin{matrix} \frac{\int_{- \infty}^{{\hat{δ}}_{j}} y . f_{{\hat{μ}}_{j}, {\hat{σ}}_{j}} (y) d y}{Φ_{{\hat{μ}}_{j}, {\hat{σ}}_{j}} ({\hat{δ}}_{j})} & if {\tilde{y}}_{i j} = 0 \\ {\tilde{y}}_{i j} & if {\tilde{y}}_{i j} \neq 0 \end{matrix} \end{matrix}

(11)

Interestingly, we noticed that fixing a small inaccuracy in [18] during data transformation at the level of the diagonal terms of $\tilde{S}$ in Equation (8) of S3 Text did not increase the support of inferred networks. Our correction actually leads to increased variance of individual observations with less covariance information being shifted across variables, therefore possibly leading to poorer support for the network structure.

3.3 Synthetic datasets

In order to compare the performance of various methods without knowing the true underlying associations among taxa in real datasets, we designed synthetic datasets with a controlled ground truth.

For our simulations, we tested two generative models. The first is the model described in section 3.1 with Eqs (1) and (2) (ZiLN). The second is the model implemented in [23] using Gaussian copulas and zero-inflated negative binomial marginals (combined in the NorTA protocol). The covariance matrix Σ associated to a given graph topology is generated using the Spiec-Easi framework [23], with condition number κ = 100 and the number of edges e equal to the number of variables p (the sparsity assumption of the association graph is translated into the scaling of the number of taxon-taxon associations with the number of taxa).

3.3.1 Synthetic networks for Simulation 1

In the first simulation, we used the ZiLN generative model in conjunction with different network topologies (band, Erdos-Renyi and scale-free), and varying sample numbers n = 50, 100, 300, 1000, 2000 and taxa numbers p = 100, 200, 500. The library sizes N_i were drawn from negative binomial distributions $N_{i} \sim N B (mean = 1.5 \times 10^{6}, size = 5)$ . The sparsity levels of taxa s_j (i.e. the expected proportion of zeros) was drawn from a uniform distribution s_j ∼ unif(0, 0.9), bringing the total number of zeros to 45% of count values. The correspondence to the parameter δ is given by: $δ_{j} = Φ_{μ_{j}, σ_{j}^{2}}^{- 1} (s_{j})$ , and the mean of the Gaussian distribution was also chosen randomly from a uniform distribution: μ_j ∼ unif(0, 3).

3.3.2 Synthetic networks for Simulation 2

In the second simulation, the sparsity level of the variables is varied while p and n are fixed to p = 300 and n = 100 respectively. We used a deterministic formulation for s: $s_{j} = {\frac{j}{d}}^{t}$ where t is chosen so that the mean of s_j is equal to the target global proportion of zeros which can take values from 0, 10, 50, 70 and 90%. The parameters μ and N_i are chosen as in Simulation 1.

For this second simulation, we also used the NorTA framework from [23] for comparative purposes, with a latent normal distribution $z_{i} \sim N (0, Σ)$ and count values generated by composition with marginal inverse cumulative distribution functions (cdf) $y_{i} = (F_{1}^{- 1} (Φ_{0, Σ_{11}} (z_{i 1})), F_{2}^{- 1} (Φ_{0, Σ_{22}} (z_{i 2})), ‥, F_{p}^{- 1} (Φ_{0, Σ_{p p}} (z_{i p})))$ , where F_j are cdf of zero-inflated negative binomial distributions. As parameters for those distributions, we took means as ${mean}_{j} = e^{μ_{j}}$ , a size ν fixed to ν = 10, and a probability of zero equal to s_j.

4 Results

4.1 Accounting for sparsity and overdispersion

In order to assess the ability of our model to deal with the high levels of sparsity and overdispersion that are inherent to natural microbial abundance datasets, we compared the extent to which synthetic count data generated under our model and two other model-based approaches are able to reproduce abundances of taxonomic profiles generated from human gut microbiomes of the population-based LifeLines-Deep cohort [33]. This cohort includes 1,135 individuals (474 men and 661 women) from the general Dutch population, whose gut microbiomes were shotgun sequenced using the Illumina short-read technology, generating an average of 32 million reads per sample (EBI dataset EGAD00001001991). We phylogenetically profiled the sequences from the LifeLines-Deep cohort using the metapalette software [34], and several samples by taxa matrices corresponding to different taxonomic ranks were constructed from these results, e.g. the raw species-level matrix contains 3957 taxa (variables) and about 90% of zero entries.

The performance of our model was compared to two other methods generating synthetic count data by integrating information from experimental count distributions in their model-based generative process: spiec-easi [23] and metasparsim [28]. Spiec-easi implements a Normal to Anything (NorTA) approach [35] to generate correlated count data by coupling Gaussian copulas with count distribution marginals (typically zero-inflated negative binomials) fitted on real-world count data [23], while metasparsim is a tool designed to generate 16S rRNA gene counts based on a two-step gamma multivariate hypergeometric model [28].

Fig 1 provides different views of the performance of the three methods, with the top panel of Fig 1B showing count histograms for different randomly selected taxa, while the bottom panel presents the same results in the form of quantile-quantile plots (QQ-plots). Both representations make apparent that the zero inflated log-normal model provides a better fit to the very sparse and overdispersed data extracted from the real-world microbiomes. In order to provide a more global assessment of the differences between real and simulated counts, we measured the feature by feature differences between these distributions with the Kolmogorov-Smirnov statistic (i.e. the largest absolute difference between the observed cdf of the real and simulated taxa abundances), thus providing a distance between them. Fig 1A shows that the proposed method achieves a smaller per-feature difference with the real distribution.

This observation also hints that state of the art methods (Spiec-Easi [23] and metaSPARSim [28]) used to simulate multivariate count data from rRNA amplicons fail to properly account for counts derived from shotgun metagenomic data, probably because of higher overdispersion in the latter.

4.2 Accuracy of microbial network inference

We compared our method against four state of the art methods for inferring microbial association networks. Magma [24] is a method for detecting microbial associations specifically aiming to deal with an excess of zero counts, while also taking compositionality and overdispersion into account; it is based on Gaussian copula models with marginals modeled with zero-inflated negative binomial generalized linear models, and the core algorithm for network structure inference is the graphical lasso, with its inference procedure relying on the estimation of the latent data by a specific medium imputation procedure [24]. Spiec-easi [23] is a popular and widely used toolkit for inferring microbial ecological networks; it is designed to deal with both compositionality and sparsity, and relies on algorithms for sparse neighborhood and inverse covariance selection (i.e. the graphical lasso) for network structure inference. SparCC is a relatively older but widely used method dealing with compositional data and measuring the linear relationship between log transformed abundances [36]. Flashweave [37] is a recently developed tool based on the HITON [38] constraint-based conditional independence solver and leverages the so-called local-to-global learning framework to infer directly associated neighborhoods of variables in large systems [37].

We performed a comparison of these three model-based methods in terms of their ability to accurately recover the structure of controlled networks of different dimensions, topologies and sampling regimes (described in the subsection 3.3.1 of the Material and Methods). The performance of the methods was quantified by the area under precision-recall curves (AUPR, which should be preferred over ROC-AUC on imbalanced datasets with a lot of true negatives like microbiome taxonomic data), and is displayed in Fig 2 for synthetic datasets encompassing different topologies (band, Erdos-Renyi, scale-free), dimensions (p = 100, 300, 500) and sample numbers (n = 50, 100, 300, 1000, 2000). We also initially measured these performances using AUROC (shown in S3 Fig), and assessed the precision of the methods on the 50 most confident edge predictions (shown in S2 Fig), both of which are largely consistent with the AUPR results shown in Fig 2.

Fig 2 — Bars represent the median over 10 runs, and error bars + 25% and −25% quantiles. Each method (Spiec-Easi (blue) [23], MAGMA (gray) [24] and ours (ZiLN, orange), as well as no transformation at all (green)), was tested with two structure inference algorithms (glasso and neighborhood selection). Two other orthogonal (i.e. based on distinct rationales) network inference methods, sparcc (dark-brown) [36] and Flashweave (dark-red) [37], were included to broaden the comparisons.

In practice, this comparison process does not rely on the inference of a single graph but of a series of graphical models of varying sparsity (including the empty and complete network as extremes), which together constitute the solution path. Each of the decreasing values of the sparsity-controlling parameter in the solution path leads to a graph solution, including its set of edges. If we denote the latter by E_i for graph solution i, we therefore have E₁ ⊂ E₂ ⊂ E_i‥ ⊂ E_n, which provides a natural ordering of the edges in these sets, e.g. edges associated to elevated sparsity-controling parameters are more reliable than edges appearing ectopically under low sparsity settings. This provides a rationale for ordering the edges and computing AUPR values.

Besides the expected dependence of the results on network topology, dimension and sample number, this figure makes apparent significant performance gains of the proposed method, in particular for the recovery of networks endowed with scale-free topologies that were shown to be the most difficult to reconstruct accurately in previous studies [23]. It should be noted however that all methods show only limited accuracy for this network topology when operating in the under-determined regime, where the number of variables (taxa, p) exceeds the number of samples (n). It should also be noted that the MB algorithm appears to perform better than the graphical lasso. A similar figure, presenting network reconstruction accuracy metrics for a distinct dataset generated using the NorTA protocol with zero inflated negative binomials, is shown in the Supplemental Information (see S1 Fig).

On the other hand, the experiments designed to probe the effects of increasing levels of sparsity (using the datasets described in the subsection 3.3.2 of the Material and Methods) make clear that all five methods are strongly affected by increasing proportions of zeros, as shown in Fig 3, with no method performing accurately under very high levels of sparsity (i.e., 90% of zeros).

Fig 3 — Top panel: network reconstruction accuracy for various methods using synthetic data generated with the NorTA protocol using zero inflated negative binomial marginals (see Methods). Bottom panel: network reconstruction accuracy using data generated under the zero inflated log-normal (ZiLN) model. Notations are as in Fig 2, and results are shown for the Erdos-Renyi topology.

4.3 Inference of real-world microbial association networks

We finally applied our method to infer microbial association networks from real-world taxonomic profiles generated from healthy gut microbiomes of the LifeLines-Deep [33] population cohort.

Even though confirmed ecological associations within natural microbial communities are only scarcely known, several global network properties have been reproducibly documented, including the salient observation of preferential associations among phylogenetically close organisms [39, 40]. This property, known as assortativity, is manifest at the level of highly connected components in our analysis, and is displayed for the species taxonomic level in Fig 4.

Fig 4 — Nodes represent species and are colored according to their taxonomic order.

To provide a more quantitative support for the observed assortativity phenomenon, we computed assortativity coefficients (homophyly of the graph) for the different methods at various phylogenetic levels using the igraph R package [41] (shown in S1 Table).

In order to assess the extent to which the taxa association predictions are shared among the five methods, we measured the amount of common edges predicted by the different methods. In order to get comparable edge sets predicted by the different methods, we decided to pick the 1200 top edges for each method (e.g. this entails using a sparsity-controlling parameter for glasso yielding 1200 detected edges). Fig 5 illustrates the overlap between the different edge sets, and makes apparent the substantial and still problematic differences that exist among the various predictions. On one hand, this is partly expected and consistent with previous observations [9]. On the other hand, the figure shows that our method has the highest ratio (239/100) of edges predicted unanimously among all methods versus predictions unique to the individual methods.

Fig 5 — To simplify the plot, a threshold was applied here to remove taxa absent in more than 20% of the samples, leading to a total of 565 species. It can be seen that our method has the highest ratio (239/100) of edges predicted by all methods versus idiosyncratic ones, but that overall substantial differences remain among the predictions of the various methods.

5 Discussion

Accurate inference of microbial association networks is necessary in order to gain a deeper understanding of the functioning of microbial consortia and to take advantage of the increasingly large body of data that is generated worldwide through environmental genomics initiatives. The formalism of Gaussian graphical models naturally lends itself to the problem of network reconstruction, but despite attractive mathematical attributes and the availability of powerful inference algorithms, application of this framework to the identification of robust microbial associations has only been met with moderate success. It appears unlikely though that the problem lies in the GGM formalism itself or in inherently flawed inference algorithms; instead, it is most likely rooted in difficulties to properly account for the idiosyncrasies of biological data. Beyond compositionality and overdispersion, high levels of sparsity are common attributes of real-world biological datasets, which are challenging to account for in a statistically sound manner. In particular, an important distinction between sampling zeros and structural zeros is often neglected, and nearly all of the actual analyses of multidimensional metagenomic count data perform some form of aggressive thresholding to discard rare but possibly biologically relevant data.

We presented here a simple statistical model specifically aimed at accounting for large amounts of structural (biological) zeros in the network inference process, and provide evidence of its usefulness by showing performance gains with respect to several state of the art methods for microbial network inference. Unoptimized code as well as scripts to replicate the figures can be accessed at https://github.com/vincentprost/Zi-LN.

Supporting information

S1 Text. Problem with compositionality under the gaussian assumption.

Discussion of the issue with compositionality under the log-normal model in a high dimensional setting.

(PDF)

Click here for additional data file.^{(155.1KB, pdf)}

S2 Text. Problem with the clr transformation when there is an excess of zeros.

Discussion of the effect of the clr transformation in the presence of an excess of biological zeros.

(PDF)

Click here for additional data file.^{(129KB, pdf)}

S3 Text. One step EM.

Description of the one step EM procedure.

(PDF)

Click here for additional data file.^{(197.5KB, pdf)}

S4 Text. Neighborhood selection as a penalized maximization problem.

Formalization of neighborhood selection as a penalized maximization problem, justifying its inclusion in the one step EM procedure.

(PDF)

Click here for additional data file.^{(172.4KB, pdf)}

S1 Table. Assortativity coefficients.

Assortativity coefficients (p < 10⁻⁴) of graphs for the different methods at various phylogenetic levels.

(PDF)

Click here for additional data file.^{(50.1KB, pdf)}

S1 Fig. Effect of the number of samples (n), number of taxa (p) and network topology on the performance of the various methods, measured with the area under precision-recall curve (AUPR), on synthetic datasets generated with the NorTA approach using zero inflated negative binomial marginals (see Methods).

Bars represent the median over 10 runs, and error bars +25% and −25% quantiles. Each method (Spiec-Easi (blue) [23], MAGMA (gray) [24] and ours (ZiLN, orange), as well as no transformation at all (green)), was tested with two structure inference algorithms (glasso and neighborhood selection). SparCC (dark-brown) [36] and Flashweave (dark-red) [37] are two unrelated inference methods based on a distinct (orthogonal) rationale, and were included for broadening the comparisons.

(PDF)

Click here for additional data file.^{(18.9KB, pdf)}

S2 Fig. Precision of the different methods on the top 50 edges only (could be compared to Fig 2, see main text).

(PDF)

Click here for additional data file.^{(18.5KB, pdf)}

S3 Fig. This figure is analogous to Fig 2 in the main text, but with AUROC computed instead of AUPR.

(PDF)

Click here for additional data file.^{(18.7KB, pdf)}

Acknowledgments

We would like to thank Christophe Ambroise and Julien Chiquet for helpful discussions and advices, and Sawsan Kanj for generating the taxonomic profiles of the LifeLines-Deep cohort.

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

V.P was supported by a Ph.D grant from CEA’s High Commissioner office (“Thèse Phare”). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

References

1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. science. 2008;320(5879):1034–1039. doi: 10.1126/science.1153213 [DOI] [PubMed] [Google Scholar]
2. Wagner M, Loy A. Bacterial community composition and function in sewage treatment systems. Current opinion in biotechnology. 2002;13(3):218–227. doi: 10.1016/S0958-1669(02)00315-4 [DOI] [PubMed] [Google Scholar]
3. Schmidt I, Sliekers O, Schmid M, Bock E, Fuerst J, Kuenen JG, et al. New concepts of microbial treatment processes for the nitrogen removal in wastewater. FEMS microbiology reviews. 2003;27(4):481–492. doi: 10.1016/S0168-6445(03)00039-1 [DOI] [PubMed] [Google Scholar]
4. Ianiro G, Rossi E, Thomas AM, Schinzari G, Masucci L, Quaranta G, et al. Faecal microbiota transplantation for the treatment of diarrhoea induced by tyrosine-kinase inhibitors in patients with metastatic renal cell carcinoma. Nature communications. 2020;11(1):1–6. doi: 10.1038/s41467-020-18127-y [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Lidicker WZ Jr. A clarification of interactions in ecological systems. Bioscience. 1979;29(8):475–477. doi: 10.2307/1307540 [DOI] [Google Scholar]
6. Solden L, Lloyd K, Wrighton K. The bright side of microbial dark matter: lessons learned from the uncultivated majority. Current opinion in microbiology. 2016;31:217–226. doi: 10.1016/j.mib.2016.04.020 [DOI] [PubMed] [Google Scholar]
7. Goers L, Freemont P, Polizzi KM. Co-culture systems and technologies: taking synthetic biology to the next level. Journal of The Royal Society Interface. 2014;11(96):20140065. doi: 10.1098/rsif.2014.0065 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Faust K, Raes J. Microbial interactions: from networks to models. Nature Reviews Microbiology. 2012;10(8):538–550. doi: 10.1038/nrmicro2832 [DOI] [PubMed] [Google Scholar]
9. Weiss S, Van Treuren W, Lozupone C, Faust K, Friedman J, Deng Y, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME journal. 2016;10(7):1669–1681. doi: 10.1038/ismej.2015.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Tackmann J, Rodrigues JFM, von Mering C. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell systems. 2019;9(3):286–296. doi: 10.1016/j.cels.2019.08.002 [DOI] [PubMed] [Google Scholar]
11. Lauritzen SL. Graphical Models. Oxford University Press; 1996. [Google Scholar]
12. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical LASSO. Biostatistics (Oxford, England). 2008;9:432–41. doi: 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34(3):1436–1462. doi: 10.1214/009053606000000281 [DOI] [Google Scholar]
14. Lee KH, Coull BA, Moscicki AB, Paster BJ, Starr JR. Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data. Biostatistics. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. bioRxiv. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Choi H, Gim J, Won S, Kim Y, Kwon S, Park C. Network analysis for count data with excess zeros. BMC Genetics. 2017;18. doi: 10.1186/s12863-017-0561-z [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chiquet J, Robin S, Mariadassou M. Variational Inference for sparse network reconstruction from count data. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. vol. 97 of Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR; 2019. p. 1162–1171. Available from: http://proceedings.mlr.press/v97/chiquet19a.html.
18. Sinclair D, Hooker G. Sparse inverse covariance estimation for high-throughput microRNA sequencing data in the Poisson log-normal graphical model. Journal of Statistical Computation and Simulation. 2019;89(16):3105–3117. doi: 10.1080/00949655.2019.1657116 [DOI] [Google Scholar]
19. Biswas S, Mcdonald M, Lundberg D, Dangl J, Jojic V. Learning Microbial Interaction Networks from Metagenomic Count Data. Journal of Computational Biology. 2016;23:526–535. doi: 10.1089/cmb.2016.0061 [DOI] [PubMed] [Google Scholar]
20. Wu H, Deng X, Ramakrishnan N. Sparse Estimation of Multivariate Poisson Log-Normal Models from Count Data; 2016. [Google Scholar]
21. Fang H, Huang C, Zhao H, Deng M. gCoda: Conditional Dependence Network Inference for Compositional Data. Journal of Computational Biology. 2017;24. doi: 10.1089/cmb.2017.0054 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Schwager E, Mallick H, Ventz S, Huttenhower C. A Bayesian method for detecting pairwise associations in compositional data. PLOS Computational Biology. 2017;13:e1005852. doi: 10.1371/journal.pcbi.1005852 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Computational Biology. 2015;11(5):1–25. doi: 10.1371/journal.pcbi.1004226 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Cougoul A, Bailly X, Wit EC. MAGMA: inference of sparse microbial association networks. bioRxiv. 2019. [Google Scholar]
25. Gallopin M, Rau A, Jaffrézic F. A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data. PloS one. 2013;8:e77503. doi: 10.1371/journal.pone.0077503 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Allen GI, Liu Z. A log-linear graphical model for inferring genetic networks from high-throughput sequencing data. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine. IEEE; 2012. p. 1–6.
27.Yang E, Allen G, Liu Z, Ravikumar PK. Graphical models via generalized linear models. In: Advances in Neural Information Processing Systems; 2012. p. 1358–1366.
28. Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. metaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20. doi: 10.1186/s12859-019-2882-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Aitchison J, Ho C. The multivariate Poisson-log normal distribution. Biometrika. 1989;76(4):643–653. doi: 10.1093/biomet/76.4.643 [DOI] [Google Scholar]
30. Weiss S, Xu Z, Peddada S, Amir A, Bittinger K, González A, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5. doi: 10.1186/s40168-017-0237-y [DOI] [PMC free article] [PubMed] [Google Scholar]
31. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014;10(4):e1003531. doi: 10.1371/journal.pcbi.1003531 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Gégout-Petit A, Muller-Gueudin A, Karmann C. Graph estimation for Gaussian data zero-inflated by double truncation; 2019. Available from: https://hal.archives-ouvertes.fr/hal-02367344. [Google Scholar]
33. Zhernakova A, Kurilshikov A, Bonder MJ, Tigchelaar EF, Schirmer M, Vatanen T, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352(6285):565–569. doi: 10.1126/science.aad3369 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Koslicki D, Falush D. MetaPalette: A K-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation. MSystems. 2016;1(3):e00020–16. doi: 10.1128/mSystems.00020-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Nelsen RB. An introduction to copulas. Springer Science & Business Media; 2007. [Google Scholar]
36. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012;8(9):e1002687. doi: 10.1371/journal.pcbi.1002687 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Tackmann J, Rodrigues JFM, von Mering C. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell systems. 2019;9(3):286–296. doi: 10.1016/j.cels.2019.08.002 [DOI] [PubMed] [Google Scholar]
38. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. Journal of Machine Learning Research. 2010;11(1). [Google Scholar]
39. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7). doi: 10.1371/journal.pcbi.1002606 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Hall CV, Lord A, Betzel R, Zakrzewski M, Simms LA, Zalesky A, et al. Co-existence of network architectures supporting the human gut microbiome. iScience. 2019;22:380–391. doi: 10.1016/j.isci.2019.11.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems:1695. [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009089.r001

Decision Letter 0

Stefano Allesina, Niranjan Nagarajan

22 Mar 2021

Dear Dr. Prost,

Thank you very much for submitting your manuscript "A zero inflated log-normal model for inference of sparse microbial association networks" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We apologise for the delay in review as one of the reviewers did not provide a review despite repeated reminders.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Niranjan Nagarajan

Associate Editor

PLOS Computational Biology

Stefano Allesina

Deputy Editor

PLOS Computational Biology

***********************

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this paper titled “A zero inflated log-normal model for inference of sparse microbial association networks”, Prost et al introduces a new algorithm for inferring association from microbiome data, taking into account of the data sparseness using a modified version of centered-log-ratio transformation. While this algorithm would likely contribute to the association inference toolbox, I do have some concerns on the methods and results.

Major concerns:

1. The authors tended to use “association” and “ecological interaction” interchangeably. As it is noted in recent works (Carr et al, 2019; Hirano & Takemoto, 2019), associations between microbial abundances do not necessarily mean interactions.

2. Introduction Line 76: How to tell a zero is a structural zero or a sampling zero? In practice, sampling zeros are essentially treated the same as structural zeros by this algorithm? Then how would the sequencing depth affect the performance of the algorithm?

3. Figure 1: as this figure only shows some examples, the authors might consider calculate a distance between simulated and true distributions, and show a comparison of the distances across all taxa.

4. In synthetic dataset 1 and 2, what was the number of edges in each simulation? Will this (i.e. the network density) affect the performance of the inference?

5. While AUPR is good for measuring the performance in some perspectives, it would be great if the authors could consider other metrics. For instance, the precision and recall (and F1 maybe) of the top 50 predictions is probably more important for practical usage.

6. Fig 4: It would be helpful to run some network clustering algorithm or community detection algorithm to demonstrate that phylogenetically related species were more often in the same cluster/community than random.

Minor concerns:

1. What is the \\delta_j in equation 1?

2. Introduction Line 9: I guess the authors mean “mutualism” (positive-positive interaction) instead of “commensalism” (neutral-positive interaction). Why are these four types of interactions “most important”?

3. Section 3.2.1: The first paragraph seems largely repetitive with the text in the introduction

4. Methods: How was the EM algorithm stopped?

Reviewer #2: This paper presents a truncated log-normal model to evaluate networks for microbiome data that accounts for compositionality and the zero-inflated counts. This model can be used to make inferences about microbial interaction networks and to simulate microbiome data. The authors compare simulated data results and application to a motivating dataset across different approaches. The paper is very well written and the model description is clear. I offer a few suggestions or points that might be helpful to clarify below.

1) How are the covariances between taxa affects by the 0 counts? Would rare taxa tend to have higher covariances simply because they tend to be absent from the majority of samples rather than due to an actual interaction? In other words, for two rare taxa that do not co-occur in any samples, is their covariance greater than 0?

2) The authors present the clr transformation to account for the compositionality of the abundance data, there are other transformations (ilr) that are equally common. I was wondering if the authors could comment on whether their approach can similarly be applied to the other transformations or if some of the attributes are specific to using the clr?

3) I don’t quite understand how the real counts in Fig 1 are aligned with the simulated counts, more specifically how are the real counts from taxa 540 matched up with the simulated counts from each approach. Did the authors use some information about the taxa from the real data to simulate counts for the same taxa? Or is there actually no pairing of information within each panel and the figure is meant to represent a range of distributions?

4) In Section 3.3.1, it wasn’t clear to me why the size for the negative binomial random number generator equaled 5. Same question for the parameters used to select the mean of the Gaussian distribution. Were these based on estimates from the real dataset?

5) In the results, the authors refer to methods being compared using the reference numbers (line 265), it would be helpful to also include the name of the method as it appears in the referenced figure so the reader doesn’t have to refer to earlier text.

6) I don’t have as much experience with the network approaches as other parts of the model, so my apologies if this is a well-known concept in the network literature. I don’t understand what information is being used to estimate the AUPR, the authors mention it is a measure of accuracy of network structure. Does it focus more on recovering the edges assuming all the nodes are the same or is that true only in this context since the nodes don’t change. A little bit more explanation or a reference would be helpful.

7) It would also be helpful to have some additional information for the real-world example related to the sparsity of the counts, a description of the average number of sequences per sample and the total number of taxa. This would help the reader put this data in context with the simulated data presented earlier in the paper.

8) Do the authors have an idea of how the results presented in Figure 5 might change using different values to define edges? It wasn’t clear how the presence of an edge was defined and whether this was consistently used for all approaches.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2021 Jun 18;17(6):e1009089. doi: 10.1371/journal.pcbi.1009089.r002

Author response to Decision Letter 0

29 Apr 2021

Attachment

Submitted filename: prost_answers2reviewers.pdf

Click here for additional data file.^{(204.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009089.r003

Decision Letter 1

Stefano Allesina, Niranjan Nagarajan

17 May 2021

Dear Dr. Prost,

We are pleased to inform you that your manuscript 'A zero inflated log-normal model for inference of sparse microbial association networks' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Niranjan Nagarajan

Associate Editor

PLOS Computational Biology

Stefano Allesina

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed all my concerns.

Reviewer #2: The authors have adequately addressed my previous comments. I appreciate the detailed responses, after reflecting more on my first comment regarding the covariance of two rare taxa, I believe that the authors modified approach to calculating the CLR completely alleviates my concern.

The authors also note from their simulations that data with 90% zeros were not described well by any method, this might be included as a consideration for the section related to the motivating example since that dataset also contained 90% zeros.

Neither of these comments are meant to suggest changes to the current version of the paper.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009089.r004

Acceptance letter

Stefano Allesina, Niranjan Nagarajan

14 Jun 2021

PCOMPBIOL-D-20-02287R1

A zero inflated log-normal model for inference of sparse microbial association networks

Dear Dr Prost,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Problem with compositionality under the gaussian assumption.

Discussion of the issue with compositionality under the log-normal model in a high dimensional setting.

(PDF)

Click here for additional data file.^{(155.1KB, pdf)}

S2 Text. Problem with the clr transformation when there is an excess of zeros.

Discussion of the effect of the clr transformation in the presence of an excess of biological zeros.

(PDF)

Click here for additional data file.^{(129KB, pdf)}

S3 Text. One step EM.

Description of the one step EM procedure.

(PDF)

Click here for additional data file.^{(197.5KB, pdf)}

S4 Text. Neighborhood selection as a penalized maximization problem.

Formalization of neighborhood selection as a penalized maximization problem, justifying its inclusion in the one step EM procedure.

(PDF)

Click here for additional data file.^{(172.4KB, pdf)}

S1 Table. Assortativity coefficients.

Assortativity coefficients (p < 10⁻⁴) of graphs for the different methods at various phylogenetic levels.

(PDF)

Click here for additional data file.^{(50.1KB, pdf)}

(PDF)

Click here for additional data file.^{(18.9KB, pdf)}

S2 Fig. Precision of the different methods on the top 50 edges only (could be compared to Fig 2, see main text).

(PDF)

Click here for additional data file.^{(18.5KB, pdf)}

S3 Fig. This figure is analogous to Fig 2 in the main text, but with AUROC computed instead of AUPR.

(PDF)

Click here for additional data file.^{(18.7KB, pdf)}

Attachment

Submitted filename: prost_answers2reviewers.pdf

Click here for additional data file.^{(204.3KB, pdf)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting information files.

[pcbi.1009089.ref001] 1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. science. 2008;320(5879):1034–1039. doi: 10.1126/science.1153213 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref002] 2. Wagner M, Loy A. Bacterial community composition and function in sewage treatment systems. Current opinion in biotechnology. 2002;13(3):218–227. doi: 10.1016/S0958-1669(02)00315-4 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref003] 3. Schmidt I, Sliekers O, Schmid M, Bock E, Fuerst J, Kuenen JG, et al. New concepts of microbial treatment processes for the nitrogen removal in wastewater. FEMS microbiology reviews. 2003;27(4):481–492. doi: 10.1016/S0168-6445(03)00039-1 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref004] 4. Ianiro G, Rossi E, Thomas AM, Schinzari G, Masucci L, Quaranta G, et al. Faecal microbiota transplantation for the treatment of diarrhoea induced by tyrosine-kinase inhibitors in patients with metastatic renal cell carcinoma. Nature communications. 2020;11(1):1–6. doi: 10.1038/s41467-020-18127-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref005] 5. Lidicker WZ Jr. A clarification of interactions in ecological systems. Bioscience. 1979;29(8):475–477. doi: 10.2307/1307540 [DOI] [Google Scholar]

[pcbi.1009089.ref006] 6. Solden L, Lloyd K, Wrighton K. The bright side of microbial dark matter: lessons learned from the uncultivated majority. Current opinion in microbiology. 2016;31:217–226. doi: 10.1016/j.mib.2016.04.020 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref007] 7. Goers L, Freemont P, Polizzi KM. Co-culture systems and technologies: taking synthetic biology to the next level. Journal of The Royal Society Interface. 2014;11(96):20140065. doi: 10.1098/rsif.2014.0065 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref008] 8. Faust K, Raes J. Microbial interactions: from networks to models. Nature Reviews Microbiology. 2012;10(8):538–550. doi: 10.1038/nrmicro2832 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref009] 9. Weiss S, Van Treuren W, Lozupone C, Faust K, Friedman J, Deng Y, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME journal. 2016;10(7):1669–1681. doi: 10.1038/ismej.2015.235 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref010] 10. Tackmann J, Rodrigues JFM, von Mering C. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell systems. 2019;9(3):286–296. doi: 10.1016/j.cels.2019.08.002 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref011] 11. Lauritzen SL. Graphical Models. Oxford University Press; 1996. [Google Scholar]

[pcbi.1009089.ref012] 12. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical LASSO. Biostatistics (Oxford, England). 2008;9:432–41. doi: 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref013] 13. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann Statist. 2006;34(3):1436–1462. doi: 10.1214/009053606000000281 [DOI] [Google Scholar]

[pcbi.1009089.ref014] 14. Lee KH, Coull BA, Moscicki AB, Paster BJ, Starr JR. Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data. Biostatistics. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref015] 15. Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. bioRxiv. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref016] 16. Choi H, Gim J, Won S, Kim Y, Kwon S, Park C. Network analysis for count data with excess zeros. BMC Genetics. 2017;18. doi: 10.1186/s12863-017-0561-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref017] 17.Chiquet J, Robin S, Mariadassou M. Variational Inference for sparse network reconstruction from count data. In: Chaudhuri K, Salakhutdinov R, editors. Proceedings of the 36th International Conference on Machine Learning. vol. 97 of Proceedings of Machine Learning Research. Long Beach, California, USA: PMLR; 2019. p. 1162–1171. Available from: http://proceedings.mlr.press/v97/chiquet19a.html.

[pcbi.1009089.ref018] 18. Sinclair D, Hooker G. Sparse inverse covariance estimation for high-throughput microRNA sequencing data in the Poisson log-normal graphical model. Journal of Statistical Computation and Simulation. 2019;89(16):3105–3117. doi: 10.1080/00949655.2019.1657116 [DOI] [Google Scholar]

[pcbi.1009089.ref019] 19. Biswas S, Mcdonald M, Lundberg D, Dangl J, Jojic V. Learning Microbial Interaction Networks from Metagenomic Count Data. Journal of Computational Biology. 2016;23:526–535. doi: 10.1089/cmb.2016.0061 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref020] 20. Wu H, Deng X, Ramakrishnan N. Sparse Estimation of Multivariate Poisson Log-Normal Models from Count Data; 2016. [Google Scholar]

[pcbi.1009089.ref021] 21. Fang H, Huang C, Zhao H, Deng M. gCoda: Conditional Dependence Network Inference for Compositional Data. Journal of Computational Biology. 2017;24. doi: 10.1089/cmb.2017.0054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref022] 22. Schwager E, Mallick H, Ventz S, Huttenhower C. A Bayesian method for detecting pairwise associations in compositional data. PLOS Computational Biology. 2017;13:e1005852. doi: 10.1371/journal.pcbi.1005852 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref023] 23. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Computational Biology. 2015;11(5):1–25. doi: 10.1371/journal.pcbi.1004226 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref024] 24. Cougoul A, Bailly X, Wit EC. MAGMA: inference of sparse microbial association networks. bioRxiv. 2019. [Google Scholar]

[pcbi.1009089.ref025] 25. Gallopin M, Rau A, Jaffrézic F. A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data. PloS one. 2013;8:e77503. doi: 10.1371/journal.pone.0077503 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref026] 26.Allen GI, Liu Z. A log-linear graphical model for inferring genetic networks from high-throughput sequencing data. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine. IEEE; 2012. p. 1–6.

[pcbi.1009089.ref027] 27.Yang E, Allen G, Liu Z, Ravikumar PK. Graphical models via generalized linear models. In: Advances in Neural Information Processing Systems; 2012. p. 1358–1366.

[pcbi.1009089.ref028] 28. Patuzzi I, Baruzzo G, Losasso C, Ricci A, Di Camillo B. metaSPARSim: a 16S rRNA gene sequencing count data simulator. BMC Bioinformatics. 2019;20. doi: 10.1186/s12859-019-2882-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref029] 29. Aitchison J, Ho C. The multivariate Poisson-log normal distribution. Biometrika. 1989;76(4):643–653. doi: 10.1093/biomet/76.4.643 [DOI] [Google Scholar]

[pcbi.1009089.ref030] 30. Weiss S, Xu Z, Peddada S, Amir A, Bittinger K, González A, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5. doi: 10.1186/s40168-017-0237-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref031] 31. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014;10(4):e1003531. doi: 10.1371/journal.pcbi.1003531 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref032] 32. Gégout-Petit A, Muller-Gueudin A, Karmann C. Graph estimation for Gaussian data zero-inflated by double truncation; 2019. Available from: https://hal.archives-ouvertes.fr/hal-02367344. [Google Scholar]

[pcbi.1009089.ref033] 33. Zhernakova A, Kurilshikov A, Bonder MJ, Tigchelaar EF, Schirmer M, Vatanen T, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352(6285):565–569. doi: 10.1126/science.aad3369 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref034] 34. Koslicki D, Falush D. MetaPalette: A K-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation. MSystems. 2016;1(3):e00020–16. doi: 10.1128/mSystems.00020-16 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref035] 35. Nelsen RB. An introduction to copulas. Springer Science & Business Media; 2007. [Google Scholar]

[pcbi.1009089.ref036] 36. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012;8(9):e1002687. doi: 10.1371/journal.pcbi.1002687 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref037] 37. Tackmann J, Rodrigues JFM, von Mering C. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell systems. 2019;9(3):286–296. doi: 10.1016/j.cels.2019.08.002 [DOI] [PubMed] [Google Scholar]

[pcbi.1009089.ref038] 38. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. Journal of Machine Learning Research. 2010;11(1). [Google Scholar]

[pcbi.1009089.ref039] 39. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS computational biology. 2012;8(7). doi: 10.1371/journal.pcbi.1002606 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref040] 40. Hall CV, Lord A, Betzel R, Zakrzewski M, Simms LA, Zalesky A, et al. Co-existence of network architectures supporting the human gut microbiome. iScience. 2019;22:380–391. doi: 10.1016/j.isci.2019.11.032 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009089.ref041] 41. Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006;Complex Systems:1695. [Google Scholar]

PERMALINK

A zero inflated log-normal model for inference of sparse microbial association networks

Vincent Prost

Stéphane Gazut

Thomas Brüls

Roles

Abstract

Author summary

1 Introduction

Fig 1.

2 Motivation of the proposed model

3 Materials and methods

3.1 The model

3.2 Network inference method

3.2.1 Data transformation

3.2.2 Effect of clr transformation on the estimation of Σ−1

3.2.3 Sparse network inference

3.2.4 Initial parameter estimates

3.2.5 Posterior mean transformation

3.3 Synthetic datasets

3.3.1 Synthetic networks for Simulation 1

3.3.2 Synthetic networks for Simulation 2

4 Results

4.1 Accounting for sparsity and overdispersion

4.2 Accuracy of microbial network inference

Fig 2. Effect of the number of samples (n), number of taxa (p) and network topology on the performance of the various methods, measured with the area under precision-recall curve (AUPR), on synthetic datasets.

Fig 3. Effect of the amount of zeros in the data (vertical panels) on method performance.

4.3 Inference of real-world microbial association networks

Fig 4. Species-level association network inferred from the LifeLines-Deep microbiomes.

Fig 5. Venn diagrams displaying the overlap between the species associations sets predicted by the different methods on the human gut taxonomic profiles of the LifeLines-Deep cohort [33].

5 Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Stefano Allesina

Niranjan Nagarajan

Roles

Author response to Decision Letter 0

Decision Letter 1

Stefano Allesina

Niranjan Nagarajan

Roles

Acceptance letter

Stefano Allesina

Niranjan Nagarajan

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2.2 Effect of clr transformation on the estimation of Σ⁻¹