Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Sep 8;16(9):e1008108. doi: 10.1371/journal.pcbi.1008108

Generalized estimating equation modeling on correlated microbiome sequencing data with longitudinal measures

Bo Chen 1, Wei Xu 1,2,*
Editor: Benjamin Althouse3
PMCID: PMC7500673  PMID: 32898133

Abstract

Existing models for assessing microbiome sequencing such as operational taxonomic units (OTUs) can only test predictors’ effects on OTUs. There is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. We propose a novel approach to estimate OTU correlations based on their taxonomic structure, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors’ effects and OTU correlations. We develop a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. In addition, longitudinal and other types of repeated OTU measures are integrated in the MTLC model. Extensive simulations have been conducted to evaluate the performance of the MTLC method. Compared with the existing methods, the MTLC method shows robust and consistent estimation, and improved statistical power for testing predictors’ effects. Lastly we demonstrate our proposed method by implementing it into a real human microbiome study to evaluate the obesity on twins.

Author summary

Human microbiome sequencing data analysis has been a fast growing area of genomic research in recent years. Although there have been several works for detecting predictors on a single operational taxonomic unit (OTU) or multiple OTUs simultaneously, there is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. Here we propose a novel approach to estimate OTU correlations based on their taxonomic structure after integrating longitudinal and other types of repeated OTU measures, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors’ effects and OTU correlations. The method is theoretically sound and practically easy to implement, and we provide corroborating evidence from simulation and a real human microbiome study.


This is a PLOS Computational Biology Methods paper.

Introduction

Human microbiome sequencing data analysis has been a fast-growing area of genomic research in recent years. Several studies showed that the microbial composition is associated with environmental and host factors [13]. The microbiome data are usually characterized by 16S ribosomal ribonucleic acid (rRNA) gene sequencing or shotgun metagenomics sequencing [4, 5]. Both sequencing technologies provide reads of bacteria counts clustered into operational taxonomic units (OTUs), where each OTU is typically mapped to a taxon at level species, genus, family, order, class, phylum, kingdom or domain in a taxonomic structure.

For each sample, OTU counts can be converted to relative abundances (RAs). No matter the OTU data is in format of counts or RAs, there are a few analytical challenges which prevent the application of standard regression methods on association study between microbial composition and the environmental or genetic factors. First, the OTU data usually contains excessive zeros, which prevents modelling the OTU data by using standard types of distributions. Next, for each individual, there may exist repeated measures of OTUs, such as microbiome samples collected from different locations of human body, or multiple observations at different time points in longitudinal setting. Furthermore, the sequencing method usually detects hundreds or thousands of OTUs, which are potentially correlated with each other [6]. Identifying correlations between taxa is a common goal in genomic survey [7]. An accurate estimated correlation can be used to determine drivers in environmental ecology or contribution to habitat niches or disease; it is also a powerful tool to help researchers with hypothesis generation, such as determining which interactions might be biologically relevant in their system, and should be given further study [8]. So instead of considering each OTU as independent, it is desirable to incorporate the taxonomic information into the analysis, which reflects the correlation structure between the OTUs.

Several solutions have been proposed to answer each of these challenges. Zero-inflated microbiome data can be fitted by either zero-inflated models or two-part models [9, 10]. Repeated measures can be characterized by random effects in mixed effects models [1115]. Modelling multiple OTUs together remains a challenging problem, although several attempts have been made. La Rosa et al. [16] and Chen et al. [17] proposed an approach which assumes that multiple OTUs follow Dirichlet multinomial (DM) distribution. However, the DM assumption imposes a negative correlation among OTUs where the true correlation can be both positive and negative. In addition, it has a fixed covariance structure which cannot flexibly handle various dispersion patterns. Tang et al. [18] proposed zero-inflated generalized Dirichlet multinomial distribution which allows for a more general covariance structure and excessive zeros in OTU counts. To further eliminate the negative correlation assumption, they also proposed distribution-free non-parametric tests [19, 20], which are robust to any correlation structures within a cluster of taxa. However, parameter estimates of covariate effects and correlation coefficients were not available due to the non-parametric essence. Alternatively, Shi et al. [21] proposed a model for Paired-Multinomial Data which works for a pair of repeated measures or a pair of correlated OTUs. Zhang at al. [22] considered estimating pairwise correlations between OTUs. Xu et al. [23] used latent variables to account for the correlation of multiple OTUs. Zhan et al. and Koh at al. [24, 25] adopted correlated sequence kernel association test assuming a random effect for each OTU, and Grantham et al. [26] used Bayesian factor analysis to cluster correlated OTUs into different factors. However, none of these approaches can model the taxonomic relationship between OTUs and provide estimations for complex correlation structure.

In order to estimate and test the association between the predictors and OTUs as well as simultaneously estimating the correlation parameters between OTUs, we propose a generalized estimating equation (GEE) [27] approach which can handle multiple correlated OTUs with repeated measures. Applying GEE model to either microbiome data [28, 29] or repeated measures such as longitudinal zero-inflated data [3032] is not new. The novel part of our method is to develop and construct correlation structures which can truly represent the taxonomic correlations and time dependency of longitudinal OTU measures. First, we develop a correlation structure of multiple OTUs solely depending on their taxonomic structure, so that the correlation structure can provide meaningful estimates of OTU correlations. Not like the multinomial models which assume negative correlations, the correlation of OTUs in the proposed model can be both positive and negative. In addition, we incorporate the taxonomic structure with correlations due to repeated measures, and all correlations of repeated measures can be explicitly estimated.

We organize this paper as following. In Methodology section, the detailed methodology framework is introduced including the zero-inflated GEE models, the construction of correlation structure on multiple OTUs with repeated measures, parameter estimation and hypothesis testing under the Microbiome Taxonomic Longitudinal Correlation (MTLC) model. Extensive simulation studies for comparing the performance of the proposed approach to other models are presented in Simulation section. In Application section, the proposed model is applied into a real microbiome sequencing study. The conclusion and further improvements of our method are discussed in Discussion section.

Methodology

Taxonomic structure of OTUs

Numerical representation of taxonomic structure

For known taxonomic structure of N OTUs, we consider its numeric representation, i.e., representing the structure by a list of numerical vectors. Throughout this paper, we call taxonomic levels from species to domain from lowest to highest. First, we find the taxonomic level at which all observed N OTUs belong to the same taxon but not at one level lower, and define such level as level 1. For example, if all OTUs belong to the same class but not the same order, then the level class would be level 1. Similarly, we can identify the taxonomic level at which each OTU represents a different taxon but not at one level higher, and define such level as level I. For example, if each OTU belongs to a different genus but not a different family, then the level genus would be level I. Fig 1 illustrates an example with I = 4 (class, order, family, and genus), where class is level 1 and genus is level 4.

Fig 1. Example illustrating the taxonomic structure of 6 hypothetical OTUs.

Fig 1

For i = 1, …I, let Mi be the number of taxa at taxonomic level i. By definition, M1 = 1 and Mi = N. For mi = 1, …, Mi, tmii denotes each taxon at level i, and nmii is the number of OTUs belonging to taxon tmii. nmii are then computed by the following algorithm:

  1. When i = I, nmii=1.

  2. For i = I −1, …, 1,
    nmii=tmi+1i+1tmiinmi+1i+1.

It is easy to check that for i = 1, …, I,

mi=1Minmii=N.

Let ni=(n1i,,nMii). Then the taxonomic structure can be numerically represented by (n1, …, nI).

In the illustrative taxonomic structure example from Fig 1, we observe 6 correlated OTUs with I = 4. Then M1 = 1, M2 = 2, M3 = 3, M4 = 6, and the numerical representation of Fig 1 is n1 = 6, n2 = (3, 3), n3 = (2, 1, 3), n4 = (1, 1, 1, 1, 1, 1).

Correlation matrix of taxonomic structure

Following the taxonomic structure, it is natural to assume that OTUs belonging to same taxa at higher levels may have some correlation. Because all OTUs belong to the same taxa at the highest taxonomic level (e.g., Bacteria domain), they are all correlated in principle. For N OTUs, there are up to (N2) pairwise correlations. When N is large, it would be infeasible to model (N2) correlation parameters, and our intuition is to reduce the number of parameters by making some reasonable assumptions such that many of the correlations are equal, according to the known taxonomic structure. The basic assumption we made is that for a cluster of OTUs, if each OTU represents a different taxon at level i + 1 but they all belong to the same taxon at level i, then all pairwise correlations of OTUs within this cluster should be equal. Under this assumption, there is only one correlation parameter in the simple case when I = 2. When I > 2, there are more than two levels in the OTU taxonomic structure, in which case the pairwise correlation coefficients for different pairs of OTUs may be equal or unequal, depending on the taxa which the OTUs belong to at each level. For a pair of OTUs, if they belong to different taxa at level i + 1 but the same taxa at level i, we call the taxon at level i as its first common taxon. For any two pairs of OTUs. A natural extension of our basic assumption is that two pairs of OTUs are assumed to have same correlation if and only if the first common taxa of both pairs are identical. Formally, let P* and P be two pairs of OTUs, which have correlation ρ* and ρ. tmi*,i* is the first common taxon of P*, and tmi,i is the first common taxon of P. Then we assume

ρ*=ρtmi*,i*=tmi,i

For all N OTUs, we define a taxonomic structure matrix to indicate which correlations are equal and which are not. The taxonomic structure matrix is an N × N symmetric matrix, where all diagonal entries are denoted by D, and off-diagonal entries are indexed by uppercase Roman numbers, i.e., I,II,III (see Fig 1). Each different index value represents a different correlation, and equal index value indicates the corresponding correlations are estimated by the same coefficient. We use Roman numbers to avoid any confusion with other Arabic numerals used elsewhere throughout our work, because these indices are categorical numbers which do not indicate any quantity. The values of off-diagonal entries are determined by the following steps:

  1. For i = 1, …, I − 1, Let Γi be an N × N block diagonal matrix,
    Γi=(B1iBMii).
    For mi = 1, …, Mi, each block B1i is an nmii×nmii matrix, whose diagonal entries are D and off-diagonal entries are h=0i-1Mh+mi. M0 has default value 0.
  2. When i = 1, Let Γ(1) = Γ1 be the interim correlation matrix.

  3. When i = 2, …, I − 1, replace the block diagonal entries of Γ(i−1) by Bmii and keep all other entries the same. The interim correlation matrix after the replacement at level i is defined as Γ(i).

  4. Sort all off-diagonal entries in Γ(I−1) from largest to smallest, where the smallest value corresponds to smallest order (order 1). Replace all off-diagonal entries by their corresponding orders in uppercase Roman numbers and define the new matrix as Γ. Γ is the taxonomic structure matrix which is numerically represented by (n1, …, nI).

In the above example of 6 hypothetical OTUs in Fig 1,

Γ1=(D111111D111111D111111D111111D111111D),Γ2=(D222D222DD333D333D),Γ3=(D44DDD666D666D).

Applying step 2 and 3 to achieve

Γ(3)=(D421114D211122D111111D661116D611166D)

Applying step 4 and the final taxonomic structure matrix Γ is

OTU1OTU2OTU3OTU4OTU5OTU6OTU1DIIIIIIIIOTU2IIIDIIIIIOTU3IIIIDIIIOTU4IIIDIVIVOTU5IIIIVDIVOTU6IIIIVIVD

In taxonomic structure matrix Γ, the index values are illustrated in Fig 1: index I indicates correlation of OTUs belonging to the same class but different orders; index II indicates correlation of OTUs belonging to the same order but different families; index III and IV indicate correlations of OTUs belonging to the same family.

Modelling correlations from repeated measures

Correlations of longitudinal data

Repeated measures of single OTU from the same individual may be another source of correlation, e.g., OTU observation at multiple time points within the same person. Fig 2 shows repeated measures of multiple OTUs at l time points.

Fig 2. Longitudinal OTU observations at l time points.

Fig 2

There are several different ways to characterize the correlations between each pair of time points, such as exchangeable, Toeplitz and unstructured. Exchangeable structure assumes all correlations are equal to each other. Toeplitz structure assumes time points with equal temporal distance have equal correlation. Unstructured model assumes each pair has different correlations and it is the most complicated structure in terms of correlation parameter estimation. Besides that, other correlation structures such as autoregressive, moving averages are also used for longitudinal data analysis [33, 34]. In this paper, we assume the correlation structure within the same individual is pre-specified. The correlation structure matrix within same individual following a given correlation structure is denoted by ΩT. The diagonal entries are denoted by D again, and off-diagonal entries are indexed by lowercase Roman numbers, i.e., i,ii,iii, etc‥ For example, if the longitudinal OTU observations consist of 3 time points, then ΩT assuming exchangeable structure is

T1T2T3T1DiiT2iDiT3iiD

Alternatively, ΩT assuming Toeplitz structure is

T1T2T3T1DiiiT2iDiT3iiiD

Sample correlation

In addition to time correlation, there may exist other types of sample correlations, such as two or more individuals from the same pedigree, or simply any repeated measures from the same individual. Without loss of generality we assume there are two repeated samples S1 and S2. Then sampling correlation is represented by correlation structure matrix ΩS:

S1S2S1DiS2iD

Combining longitudinal and sample correlation

Let Ω be the correlation structure combining both longitudinal and sample correlation. Ω = ΩT or ΩS when only time points correlation or sample correlation exists. When both correlations exist, we consider all combinations of time points and repeated samples in one big correlation structure Ω. For example, if there are two repeated samples at each of the 3 time points, then for each OTUs there are 6 observations for each individual in total, and Ω becomes

(T1,S1)(T2,S1)(T3,S1)(T1,S2)(T2,S2)(T3,S2)(T1,S1)Diiiiiiiiii(T2,S1)iDiiiiiiiii(T3,S1)iiDiiiiiiii(T1,S2)iiiiiiiiDii(T2,S2)iiiiiiiiiDi(T3,S2)iiiiiiiiiiD

Incorporating taxonomic structure with repeated measures

Suppose Ω has dimension L. For a = 1, …, N and b = 1, …, N, Ωab) is an L × L correlation matrix as a function of Γab, such that

Ω(Γab)=(ρ(Γab,Ω11)ρ(Γab,Ω1L)ρ(Γab,ΩL1)ρ(Γab,ΩLL)).

Γ and Ω are entries of Γ and Ω from corresponding rows and columns. We denote Ωab) as Ωab for notation simplicity.

To integrate repeated measures correlation structure Ω with taxonomic structure Γ, we introduce the integrative correlation matrix

R=(Ω11Ω1NΩN1ΩNN)

where Ωab is defined above. R is a J × J matrix where J = N × L, and each of its entry has the form ρ(Γ,Ω). The first subscript, Γ, is either D or an uppercase Roman number indexing taxonomic structure correlation; the second subscript, Ω, is either D or a lowercase Roman number indexing correlation from repeated measures of single OTU. In the above example, Γ11=Ω11=D, Γ21=III and Ω21=i. The diagonal entries of R, ρ(D,D) always equal to 1, and the off-diagonal entries are estimated in the next section.

Microbiome Taxonomic Longitudinal Correlation (MTLC) model

After specifying the correlation matrix within one cluster of OTUs with repeated measures, in this section, we introduce how to model the association between multiple OTUs and their predictors of interest. We propose a Microbiome Taxonomic Longitudinal Correlation (MTLC) model to estimate predictor effects, correlation coefficients between OTUs, longitudinal measures and other repeated measures. We also perform a hypothesis testing of the predictor effects based on MTLC model. The estimates and tests are achieved by Generalized Estimating Equations (GEE) framework.

Generalized estimating equation framework

Let yk’s be independent clusters for k = 1, …K, and each cluster yk=(yk1,ykJk) has length Jk. For j = 1, …Jk, let xkj denote the vector of covariates with length p, and μk=(μk1,,μkJk) is the mean of yk. Then for each observation ykj,

g(μkj)=xkjβ (1)

where g is a known link function and β are the regression parameters of the p covariates xkj. The conditional variance of ykj is defined as Var(ykj|xkj) = ν(μkj)ϕ, where ν is the variance function depending on the distribution of ykj, and ϕ is the dispersion parameter being σ2 for normally distributed ykj and 1 for other distributions belonging to exponential family. For estimating β, the following generalized estimating equation is solved:

U(β)=Σk=1KDkVk1(ykμk)=0 (2)

where Dk=dμkdβ and Vk=Ak1/2Rk(ρ)Ak1/2. Here Ak=diag(μk1ϕ,μkJkϕ), and Rk(ρ) is the working correlation matrix following the correlation structure R constructed in section “Incorporating taxonomic structure with repeated measures”, where ρ is the collection of all correlation coefficients in Rk. Clearly β^ depends on ρ and ϕ, which also needs to be estimated. If we define the Pearson residual ekj=(ykj-μkj)/ν(μkj), then ϕ^=1(k=1KJk)-pk=1Kj=1Jkekj2. Next, ρ^ is estimated as a function of ϕ and ekj. The exact formula of ρ^ depends on the correlation structure R, and a few examples of ρ^ under different structures are given in Liang et al [27] and Wang [33]. Because the Pearson residuals ekj’s also depend on β^, it yields an iterative scheme which switches between estimating β from fixed value of ϕ^ and ρ^ and estimating ϕ and ρ for a fixed value of β^. Under GEE theory [27], this scheme yields a consistent estimate for β. Moreover β^ is asymptotically normally distributed with mean β and variance

Vβ=(Σk=1KDkVk-1Dk)-1{Σk=1KDkVk-1Cov(yk)Vk-1Dk}(Σk=1KDkVk-1Dk)-1 (3)

where Cov(yk) is the true underlying covariance matrix of yk. The consistent estimator of Vβ, Vβ^, is achieved by replacing β^, ρ^, ϕ^ and {yk-μk(β^)}{yk-μk(β^)} for β, ρ, ϕ and Cov(yk).

GEE method yields consistent estimator of β, even if the structure of working correlation matrix is not correctly specified. The misspecified Rk(ρ) only affects the efficiency of β^. The consistent estimation of correlation matrix Rk(ρ^), however, relies on correct specification of the correlation structure.

For testing a hypothesis of H0: = c, a Wald test statistic can be used with the form

W=(Cβ^-c)(CVβ^C)-1(Cβ^-c) (4)

and Wdχ(q)2, where q is the rank of matrix C.

Estimating predictors effects on OTUs

Based on the GEE framework, we develop the MTLC model to assess the association between OTUs and the predictors of interest, accounting for the correlation of repeated OTU measures. To deal with the excess zeros of OTUs using MTLC model, first we convert quantitative OTU observations to binary outcomes (0 and 1), indicating the prevalence of OTU in each observation. Next, we focus on the OTU relative abundance (RA) of each non-zero observation, and assume the RAs following normal distribution after log transformation. We use two separate GEE models, one for assessing the predictor effects on OTU prevalence, and the other for assessing the predictor effects on positive RA. The predictors’ overall effects are finally tested by combining the test statistics from these two GEE models.

Formally, for k = 1, …K and j = 1, …, Jk, we assume each OTU observation ykj follows a mixture of Bernoulli and log-normal distribution: suppose ykj(0) follows a Bernoulli distribution with P(ykj(0)=1)=μkj(0), and ykj(+) follows a normal distribution such that ykj(+)N(μkj(+),σ2), then the distribution function of ykj is

F(y)={1-μkj(0)y=01-μkj(0)+μkj(0)Φ(log10y)y>0

where Φ is the distribution function of ykj(+). By definition, ykj(0) represents OTU prevalence observations because

ykj(0)={0ykj=01ykj>0

and ykj(+) represents the positive RAs because log10ykj=ykj(+) for all ykj > 0. We use yk(0) to denote the vector of all ykj(0), and yk(+) to denote the the subset of ykj(+) where ykj > 0.

Rather than running generalized linear model directly on yk, we apply GEE method separately on yk(0) and yk(+). For these two GEE models, the predictors’ design matrices Xk do not have to be the same in principal, although they could be the same in many practical situations. Without loss of generality we simply assume the predictors are same in each part of the GEE model in this paper. We choose logit link function for binary outcomes and identity link function for log transformed non-zero outcomes, and the two parts of the GEE model are

log(μkj(0)1-μkj(0))=xkjβ(0) (5)

and

μkj(+)=xkjβ(+) (6)

Using iterative scheme discussed in section “Generalized estimating equation framework” on yk(0) and yk(+), we can achieve the corresponding parameter estimation β^(0) and β^(+).

Hypothesis testing

For testing if the predictors have effects to either the prevalence of OTUs or the quantitative amount of RA, the null hypothesis is

H0:C(0)β(0)=c(0)andC(+)β(+)=c(+).

Assuming same Xk for the yk(0) part and yk(+) part of GEE model, β(0) and β(+) will have the same dimension p. Moreover, C(0) = C(+) and c(0) = c(+) in many practical situations. For example, if we want to test the first q predictors in Xk and the rest pq extra covariates are not of interest, then

C(0)=C(+)=(Iq×q0q×(p-q)0(p-q)×q0(p-q)×(p-q)),c(0)=c(+)=0.

For each part of H0, the corresponding test statistics W(0) and W(+) are computed following Eq 4.

It follows section “Generalized estimating equation framework” that W(0)dχ(q(0))2 and W(+)dχ(q(+))2. Besides, for jointly testing two null hypotheses by the combined test on W(0) and W(+), we adopt Cauchy combination test [35], which does not require the independence assumption between W(0) and W(+). Let p(0) and p(+) be the corresponding p-values, then the Cauchy combination test statistic is

WMTLC=0.5tan[(0.5-p(0))π]+0.5tan[(0.5-p(+))π]dCauchy(0,1) (7)

Estimating correlation coefficients

In our proposed MTLC model, the correlation structure is based on OTU taxonomic structure and characterizing correlations between repeated measures. Here we assume the two GEE models corresponding to the OTU prevalence part and positive RA part have the same correlation structure R. However, the estimated values of correlation coefficients, ρ^(0) and ρ^(+), may be different for each part of the GEE model. For yk(0) and yk(+), ρ^(0) and ρ^(+) are estimated separately following the iterative scheme discussed in section “Generalized estimating equation framework”.

It needs to be noted that GEE models do not require each cluster has equal cluster size, which could happen, for example, in unbalanced study designs and/or when some observations are missing. Even if yk(0) has equal size for all k, yk(+) may have different sizes as it is a collection of only positive RAs. It implies that the dimension of R may be greater than the length of yk(0) and yk(+) for some k. In such case, the rows and columns in R corresponding to empty values of OTU observations need to be removed, and we denote the modified correlation structure matrices by Rk(0)(ρ) and Rk(+)(ρ) correspondingly for each k. When applying the estimating equations in our MTLC model, we essentially use Rk(0)(ρ) and Rk(+)(ρ) as the working correlation matrices.

Simulation

Simulation settings

Simulation studies are designed to simulate zero inflated multivariate normal distribution to reflect the correlation of −log10 transformed OTUs. To achieve this, we simulate both multivariate Bernoulli distribution samples Y(0) and truncated multivariate normal distribution samples Z of size K and length J. Multivariate normal distributions are truncated to generate positive samples because all −log10 transformed RAs should be positive. We further assume a single binary predictor X, where X also has dimension K × J, and the mean of Y(0) and Z depend on X. Specifically, we simulate Y(0)BernoulliJ(exp(Xβ(0))1+exp(Xβ(0))), and ZNJ((+), R) truncated at 0. The zero-inflated multivariate normal distribution samples are computed as Y = Y(0) Z. Y is indirectly associated with X via Y(0) and Z.

For illustration purpose, we assume the simplest correlation structure, i.e., two correlated OTUs under taxonomic structure and two repeated measures at different time points). The correlation matrix R is then derived following section “Incorporating taxonomic structure with repeated measures”:

R=(ρ(D,D)ρ(D,i)ρ(I,D)ρ(I,i)ρ(D,i)ρ(D,D)ρ(I,i)ρ(I,D)ρ(I,D)ρ(I,i)ρ(D,D)ρ(D,i)ρ(I,i)ρ(I,D)ρ(D,i)ρ(D,D)).

ρ(D,D)=1, ρ(D,i) and ρ(I,D) denote the correlation between two time points and between two OTUs. ρ(I,i) represents the correlation of observations from different OTU and different time points, which is not of primary interest. We assume the simulated multivariate Bernoulli and multivariate normal distribution follow the same correlation structure R, but the correlation coefficients ρ^(0) and ρ^(+) can be different.

After achieving the zero-inflated multivariate normal distribution samples Y, we run a GEE logistic model following Eq 5 to estimate the effects of X to OTU prevalence Y(0), and GEE linear model following Eq 6 to estimate X effects to the non-zero RAs Y(+), where Y(+) is the subset of Z such that ykj(+)=zkj|(ykj(0)=1). Under GEE theory, both Y(0) and Y(+) yield consistent estimations of β and ρ. However, we simulate Z rather than Y(+), where Z and Y(+) may not yield same estimations in general. To solve this issue, we simulate Z and Y(0) independently, which implies that ykj(+) has the same distribution as zkj. Therefore, Z also yields consistent estimations of β and ρ.

Different from some literature that Y is directly simulated, we conducted our stimulation on Y(0) and Z separately. This is because following the mixture distribution framework, we conduct two separate GEE models on Y(0) and Y(+) rather than one model directly on Y. In this way, we can clearly specify the true values of predictor’s main effects and OTU correlations in simulation settings, and evaluate if the estimations of these values are unbiased explicitly. As a sensitivity analysis to evaluate the robustness of our model performance, we also simulate Y(0) and Z from (generalized) linear mixed model. Results are presented in S1 Appendix.

Inferences for predictor’s main effects

First, we evaluate the performance of our proposed MTLC model for estimating and testing the main effects or the predictor X. Let β(0) denote the effects on OTU prevalence and β(+) denote the effects on the log10 transformed none-zero RA. We evaluate the unbiasedness of estimated β^(0), β^(+), Type I error for testing β(0) = β(+) = 0 and test power when β(0) and/or β(+) ≠ 0. OTU observations are simulated under the simulation settings discussed in section “Simulation settings” with sample size K = 1000 and various combinations of β(0) and β(+) values. We assume ρ(D,i)=ρ(I,D)=0.3 and ρ(I,i)=0 for both the multivariate normal and multivariate Bernoulli distribution. β, Type I errors and powers are estimated based on 1000 replications. The computation time is about 4 hours to complete all 1000 replications on a desktop computer with quad-core processor and 8GB of RAM.

Next we compare our MTLC model to other models. All models are described in Table 1.

Table 1. Description of each model compared by simulation study.

Name Formula Description
GEE(0) Y(0)GEEX The logistic regression part of GEE for OTU prevalence
GEE(+) Y(+)GEEX The linear regression part of GEE for non-zero RAs
MTLC Y(0)GEEX
Y(+)GEEX
two-part GEE: our proposed microbiome taxonomic longitudinal correlation model model for OTU prevalence, linear model for non-zero RAs
2P_ind Y(0)X
Y(+)X
two-part independence: assuming no correlation, logistic
1P_GEE YGEEX one-part GEE: assuming same correlation structure, but only one GEE linear model for all 0 and non-zero RAs
1P_ind YX one-part independence: assuming no correlation and only one simple linear model for all 0 and non-zero RAs
1P_RE YX + γ1 + γ2 one-part linear mixed model with random intercepts: γ1, γ2 represents random intercepts of time points and OTUs

For each model, the estimated β^(0), β^(+), Type I error and power are summarized in Table 2. We find all estimates of β(0) and β(+) are unbiased under MTLC model. For the one-part models, because there is no true value of β as a mixture of β(0) and β(+), the unbiasedness of estimated β cannot be evaluated. Regarding the variations of estimated β^, the 2.5 and 97.5 percentile of the empirical distributions of β^ are shown in S1 Appendix.

Table 2. Estimated β^, Type I error and power, from 1000 replications.

(β(0), β(+)) Estimates GEE(0) GEE(+) MTLC 2P_ind 1P_GEE 1P_ind 1P_RE
(0,0) β^ NA NA NA NA 0.000 0.000 0.000
β^(0) 0.001 NA 0.001 0.001 NA NA NA
β^(+) NA 0.000 0.000 0.000 NA NA NA
T1E 0.056 0.038 0.039 0.120 0.050 0.116 0.047
(0,0.05) β^ NA NA NA NA 0.027 0.027 0.027
β^(0) 0.002 NA 0.002 0.002 NA NA NA
β^(+) NA 0.052 0.052 0.052 NA NA NA
Power 0.045 0.512 0.421 0.583 0.201 0.332 0.199
(0,-0.05) β^ NA NA NA NA -0.026 -0.026 -0.026
β^(0) -0.001 NA -0.001 -0.001 NA NA NA
β^(+) NA -0.050 -0.050 -0.050 NA NA NA
Power 0.048 0.487 0.394 0.552 0.187 0.312 0.188
(0.1,0) β^ NA NA NA NA 0.051 0.051 0.051
β^(0) 0.101 NA 0.101 0.101 NA NA NA
β^(+) NA 0.001 0.001 0.001 NA NA NA
Power 0.693 0.050 0.609 0.772 0.571 0.712 0.570
(0.1,0.05) β^ NA NA NA NA 0.075 0.075 0.075
β^(0) 0.100 NA 0.100 0.100 NA NA NA
β^(+) NA 0.049 0.049 0.049 NA NA NA
Power 0.705 0.487 0.771 0.887 0.862 0.934 0.866
(0.1,-0.05) β^ NA NA NA NA 0.025 0.025 0.025
β^(0) 0.099 NA 0.099 0.099 NA NA NA
β^(+) NA -0.050 -0.050 -0.049 NA NA NA
Power 0.696 0.481 0.800 0.896 0.171 0.287 0.171
(-0.1,0) β^ NA NA NA NA -0.051 -0.051 -0.051
β^(0) -0.101 NA -0.101 -0.101 NA NA NA
β^(+) NA -0.001 -0.001 -0.001 NA NA NA
Power 0.700 0.054 0.612 0.781 0.575 0.698 0.571
(-0.1,0.05) β^ NA NA NA NA -0.026 -0.026 -0.026
β^(0) -0.102 NA -0.102 -0.102 NA NA NA
β^(+) NA 0.050 0.050 0.050 NA NA NA
Power 0.719 0.483 0.803 0.905 0.188 0.304 0.183
(-0.1,-0.05) β^ NA NA NA NA -0.075 -0.075 -0.075
β^(0) -0.099 NA -0.099 -0.099 NA NA NA
β^(+) NA -0.050 -0.050 -0.050 NA NA NA
Power 0.694 0.471 0.786 0.906 0.887 0.949 0.887

Given the true Type I error at 0.05, 2P_ind and 1P_ind model have inflated Type I error, and all other estimated Type I errors are accurate. It needs to be noted that when only one of β(0) and β(+) equal to 0, the Type I error estimation is still accurate. For example, when (β(0), β(+)) = (0, 0.05), the GEE(0) model for testing β(0) = 0 has Type I error 0.062, which is not affected by the non-zero value of β(+). It further confirms the independence of the linear and logistic regression parts in the two-part model.

We also evaluate the power performance of different models. The power of 2P_ind and 1P_ind model are inflated due to Type I error inflation. Our proposed MTLC model is most powerful in general. When one of β(0) and β(+) is 0, the MTLC model is slightly less powerful than one of GEE(0) and GEE(+) model which only tests the part that β ≠ 0. However, when both β(0) and β(+) are non-zero, the MTLC model is much more powerful than both GEE(0) and GEE(+) model. The 1P_GEE model and 1P_RE model have similar powers. It needs to be noted that the 1P_RE model is not able to accommodate negative correlations due to the natural or random effects. This is the reason that we choose ρ01 and ρ10 to be positive in the simulation settings. When the true correlations are negative, the 1P_RE model simply reduces to 1P_ind model. Comparing to the MTLC model, the power of the one-part models drops dramatically when β(0) and β(+) have opposite sign. This is because the positive effect cancels out the negative effects in one-part models, but both effects are well captured in two-part models. When β(0) and β(+) have same direction, we do observe some cases that the power of one-part models are larger. This is related to how to deal with the excess zeros in the one-part models. Detailed discussion about this issue is provided in section “Two-part vs. one-part models”.

Estimations for the correlation coefficients

The MTLC model can also provide estimations of correlation coefficients. First we evaluate the unbiasedness of the correlation estimates. Let ρ(0) and ρ(+) be correlation coefficients in GEE(0) and GEE(+) model. In simulation settings, we choose ρ(D,i)(0)=ρ(I,D)(0)=0.5 and ρ(D,i)(+)=ρ(I,D)(+)=-0.3, β(0) = −0.1 and β(+) = 0.05. The specified β values do not affect the estimation of ρ. Sample size K = 1000 and number of replications remains to be 1000.

The correlation structure of OTUs is based on the taxonomic structure, which is usually known in practice. However, the correlation structure of repeated measures within each OTU may not be known and usually requires subjective assumptions. One merit of GEE model is that even if the assumption of correlation structure is not correct, it does not affect the estimation of main effect β. The β^ estimations are consistent under different assumptions of correlation structure, as illustrated by Yan [36] and confirmed by our simulation study (results not shown). Besides that, we evaluate the consistency of correlation estimations under wrong correlative structure setting.

In contrast to the correct correlation structure R, we first construct a model with a correlation matrix assuming that OTUs are independent while time points are still correlated. After that, we construct another model with correlation matrix assuming that time points are independent while OTUs are still correlated. When OTUs are assumed to be independent, the GEE model may only estimate ρ(D,i); when time points are independent, the GEE model may only estimate ρ(I,D). The correlation estimations are summarized in Table 3.

Table 3. Estimated GEE correlations under correct correlation structure, OTU independence structure and time points independence structure, compared to Pearson correlations.

Cor True Pearson True structure OTU ind Time points ind
ρ(D,i)(0) 0.5 0.497 0.495 0.495 NA
ρ(I,D)(0) 0.5 0.498 0.496 NA 0.496
ρ(I,i)(0) 0 0.000 -0.002 NA NA
ρ(D,i)(+) -0.3 -0.295 -0.299 -0.300 NA
ρ(I,D)(+) -0.3 -0.296 -0.299 NA -0.299
ρ(I,i)(+) 0 -0.001 -0.001 NA NA

From Table 3, the correlation estimates under true correlation structure are all unbiased. When the correlation structure is not correctly specified, it may not estimate all correlation coefficients for the correct correlation structure, but more interestingly, for those correlation coefficients which can be estimated under the misspecified structure, the estimation remains to be unbiased. It implies that if we are not interested in estimating all correlations in the correct correlation structure, we can simplify the correlation structure. For example, because the estimation of ρ(I,i) is not of interest, we can set it to 0 without affecting the estimation of ρ(D,i) and ρ(I,D).

The correlation structure only contains two OTUs and two time points, so the GEE correlation estimates are essentially pairwise correlations, and thus they can be compared with corresponding Pearson correlation coefficients. Both results are consistent as expected. The merit of our MTLC model is that when the correlation structure is more complicated and the pairwise Pearson correlation is not available, it may still provide unbiased estimation of the correlation matrix.

Two-part vs. one-part models

For one-part models, if we take −log10 transformation of both the non-zero RAs and 0, then all 0 becomes ∞. To solve this issue, one common approach is to change all 0 to some small value close to 0, such as 10−5. However, we find the one-part model test powers are sensitive to this arbitrary small value. In Table 4, we replace −log10 0 by 6, 5 4 and 3 and compare corresponding test powers with the MTLC model. We only present the 1P_GEE model as we have shown in Table 2 that the 1P_RE model has similar power to 1P_GEE.

Table 4. Comparing test powers from 1P_GEE model to MTLC model when −log10 0 are replaced by 6, 5 4 and 3.

(β(0), β(+)) MTLC −log10 0 = 6 −log10 0 = 5 −log10 0 = 4 −log10 0 = 3
(0,0) 0.039 0.038 0.052 0.040 0.044
(0,0.05) 0.421 0.138 0.156 0.304 0.478
(0,-0.05) 0.394 0.122 0.176 0.284 0.468
(0.1,0) 0.609 0.650 0.528 0.308 0.040
(0.1,0.05) 0.771 0.890 0.888 0.864 0.456
(0.1,-0.05) 0.764 0.346 0.218 0.050 0.484
(-0.1,0) 0.612 0.660 0.576 0.340 0.050
(-0.1,0.05) 0.803 0.306 0.166 0.052 0.486
(-0.1,-0.05) 0.786 0.846 0.854 0.844 0.472

Table 4 indicates that there is no optimal choice of the value for replacing 0 RAs. For each value selected, depending on (β(0), β(+)), there may exist some situations such that the one-part model has comparable power or even slightly better power than corresponding two-part model (e.g., 0.650 vs. 0.609 when (β(0), β(+)) = (0.1, 0) and replacing 0 by 10−6), but the power loss is much more significant for some other values of β (e.g., 0.138 vs. 0.421 when (β(0), β(+)) = (0, 0.05) and replacing 0 by 10−6). We conclude that our MTLC models has superior and robust power performance compared to the one-part models, and suggest readers avoid using the one-part models in practice when there are excessive numbers of 0s in OTU data.

Application

We implement our proposed MTLC model on a twin study described in Turnbaugh et al. [37]. The full dataset is provided in the supporting information S1 Data. The data consists of 54 families and each family has a pair of twins. Each individual has at most two observations at two time points. The primary research question is to assess the association between obesity status (lean, overweight or obese) and OTUs, and estimate the correlations between two time points, each pair of twins and OTUs. For illustration purpose, we only analyze OTUs within the order Clostridiales, which consists of 9 OTUs at genus level. The taxonomic structure of these 9 OTUs are shown in Fig 3.

Fig 3. Taxonomic structure of 9 OTUs.

Fig 3

From Fig 3, all 9 OTUs begin to belong to the same taxa (Clostridiales) at level order, and each of the 9 OTUs belongs to a different taxon at level genus. We define level order as level 1, level family as level 2 and level genus as level 3, thus I = 3. Accordingly, the numerical representation of the taxonomic structure is n1 = 9, n2 = (4, 1, 4), n3 = (1, 1, 1, 1, 1, 1, 1, 1, 1).

Next, following the 4 steps described in section “Taxonomic structure of OTUs”, the taxonomic structure matrix is

Γ=(DIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIIDIIIIIIIIIIIIIIIIIIIIIIID).

Because each OTU is observed at two time points for a pair of twins, the repeated measure correlation structure following section “Modelling correlations from repeated measures” is

Ω=(DiiiiiiiDiiiiiiiiiiDiiiiiiiD).

The dimension of Γ and Ω are N = 9 and L = 4, so as described in section “Incorporating taxonomic structure with repeated measures”, the integrative correlation matrix R has dimension J = N × L = 36. For a = 1, …, 9 and b = 1, …, 9, if Γab=D, then

Ωab=Ω(D)=(ρ(D,D)ρ(D,i)ρ(D,ii)ρ(D,iii)ρ(D,i)ρ(D,D)ρ(D,iii)ρ(D,ii)ρ(D,ii)ρ(D,iii)ρ(D,D)ρ(D,i)ρ(D,iii)ρ(D,ii)ρ(D,i)ρ(D,D));

if Γab=I, then

Ωab=Ω(I)=(ρ(I,D)ρ(I,i)ρ(I,ii)ρ(I,iii)ρ(I,i)ρ(I,D)ρ(I,iii)ρ(I,ii)ρ(I,ii)ρ(I,iii)ρ(I,D)ρ(I,i)ρ(I,iii)ρ(I,ii)ρ(I,i)ρ(I,D));

if Γab=II, then

Ωab=Ω(II)=(ρ(II,D)ρ(II,i)ρ(II,ii)ρ(II,iii)ρ(II,i)ρ(II,D)ρ(II,iii)ρ(II,ii)ρ(II,ii)ρ(II,iii)ρ(II,D)ρ(II,i)ρ(II,iii)ρ(II,ii)ρ(II,i)ρ(II,D));

if Γab=III, then

Ωab=Ω(III)=(ρ(III,D)ρ(III,i)ρ(III,ii)ρ(III,iii)ρ(III,i)ρ(III,D)ρ(III,iii)ρ(III,ii)ρ(III,ii)ρ(III,iii)ρ(III,D)ρ(III,i)ρ(III,iii)ρ(III,ii)ρ(III,i)ρ(III,D)).

The integrative correlation matrix is then

R=(Ω11Ω19Ω91Ω99).

To apply the proposed MTLC model, all OTU observations are summarized as Y. X is the single binary predictor denoting obesity status (lean vs. obese/overweight). Both Y and X have dimension K × J where K = 54 and J = 36. Some pedigrees only consist one individual instead a pair of twins, and OTUs are observed at one instead of two time points for some individuals, hence missing values exist in the matrix Y. Next, Y is separated as Y(0) and Y(+) representing OTU prevalences and positive RAs. We assume each ykj(0) follows Bernoulli distribution with mean μkj(0) and ykj(+) follows log normal distribution with mean μkj(+). Then under MTLC model, Y and X have the following relationship:

log(μkj(0)1-μkj(0))=α(0)+xkj(0)β(0) (8)
μkj(+)=α(+)+xkj(+)β(+) (9)

α(0) and α(+) are intercept parameters which are not our primary interest. Our goal is to estimate the effects of obesity status β(0) and β(+), and test H0: β(0) = β(+) = 0. β(0) and β(+) are estimated separately under Eq 2, and H0 is tested by the combined test statistic WMTLC following Eq 7.

We summarize the estimates of obesity effects for predicting OTUs and corresponding p-values for testing H0 in Table 5. We compare the MTLC model with the other models listed in Table 1. Using our MTLC model, obesity has shown significant overall association with these OTUs. Specially, it has shown significant association with the prevalence of OTUs, but no significant association with the non-zero RAs. All other models do not detect the overall significance. The computation time is less than 30 seconds for the twin study dataset.

Table 5. Estimated effects of obesity status to OTUs and p-value.

GEE(0) GEE(+) MTLC 2P_ind 1P_GEE 1P_ind 1P_RE
β^ NA NA NA NA -0.041 -0.024 -0.028
β^(0) -0.511 NA -0.511 -0.496 NA NA NA
β^(+) NA -0.017 -0.017 0.014 NA NA NA
p-value 0.017 0.518 0.034 0.093 0.215 0.450 0.475

Correlation estimates are presented in Table 6. ρ(D,i) and ρ(D,ii) are correlation between the two time points and correlation between the two twins. ρ(I,D), ρ(II,D) and ρ(III,D) are OTU correlations, representing correlation from different family but within the same order Clostridiales, and correlation within the same family Lachnospiraceae or Ruminococcaceae.

Table 6. Estimated correlation coefficients between time points, twins and OTUs.

Models GEE Pearson
GEE(0) ρ(D,i) 0.098 0.106
ρ(D,ii) 0.130 0.110
ρ(I,D) 0.229 NA
ρ(II,D) 0.217 NA
ρ(III,D) 0.347 NA
GEE(+) ρ(D,i) 0.696 0.751
ρ(D,ii) 0.550 0.561
ρ(I,D) -0.018 NA
ρ(II,D) -0.035 NA
ρ(III,D) -0.175 NA
1P_GEE ρ(D,i) 0.661 0.657
ρ(D,ii) 0.495 0.498
ρ(I,D) 0.051 NA
ρ(II,D) 0.082 NA
ρ(III,D) 0.015 NA

When Pearson correlations are available (ρ(D,i) and ρ(D,ii)), they are quite consistent with the correlation estimates under GEE models. However, Pearson correlation is not available for OTU correlations due to the complicated taxonomic structure, and only our proposed MTLC model can estimate these correlations.

Discussion

In this paper, we develop and implement a novel approach to model the correlations of OTUs based on the biological taxonomic structure. The proposed MTLC model can incorporate the taxonomic structure with repeated measures from longitudinal data. It has accurate Type I error, unbiased estimation of model parameters and robust power performance under a variety of situations. Compared to existing methods, our method is more powerful and can provide unbiased estimation of the correlation coefficients between multiple OTUs and repeated measures.

The MTLC model allows for sufficient flexibility of the correlation matrix construction. It not only allows different correlation matrices for the logistic regression part and linear regression part, but also put no constraint on the range of each correlation coefficient, i.e., any positive or negative value from -1 to 1. In contrast, the random effect in mixed effect model naturally leads to a positive correlation, because the same random effect adds to a few correlated samples. When the true correlations are negative, the mixed effects model (e.g., Chen et al. [13]) is simply reduced to ordinary linear and logistic regression model with independence assumption, which results in incorrect Type I errors as we have shown in section “Inferences for predictor’s main effects”. In summary, the MTLC model provides a reliable analytical framework for longitudinal microbiome data analysis.

Our methodology for constructing correlation matrix of taxonomic structure imposes no constraints to the number of OTUs, which is denoted by N. Based on the computation time shown in our simulation and application study, we find the MTLC model runs fast overall. However, when N is large, (e.g., N > 1000), the correlation matrix has a high dimension, and it may cause computational issues and become time consuming to implement the MTLC model. In such case, we suggest a dimension reduction by selecting a subgroup of OTUs. For example, if OTUs are from the same phylum but different classes. Our MTLC model can be implemented on each class separately or focus on the classes of interest, instead on the whole phylum.

We have shown that the correlation estimation is consistent under MTLC model, but the estimation accuracy is not clear. Yan [36] proposed standard error estimations of the correlation coefficients under GEE approach. When corresponding Pearson correlations are also available, we have found the standard error under GEE approach may depart from the standard error of Pearson correlations. Because the underlying distribution of the correlation estimates is unknown, it lacks theoretical justifications of the standard error estimates. Further studies are required for estimating the accurate standard errors of correlation coefficients under our MTLC model.

The MTLC model assumes −log10 transformed positive RAs following normal distribution. Clearly this is not the only approach to modelling the RA data, and there is no universal answer for choosing the “best” approach. Liu et al. [38] gave an overview for modelling zero-inflated non-negative continous data in general and proposed a few alternative distributions for the positive part of RAs. For example, zero-inflated beta distribution is another commonly used approach [13, 39], because beta distribution has range from 0 to 1 exactly matching the range of RAs.

When β(0) and β(+) have opposite signs, the predictor’s effects are described as “dissonant”. Under this scenario, the two-part models showing more powerful results in the simulation studies coincides with existing literature [9, 40]. In microbiome context, an example of this scenario is that, an antibiotic treatment may be effective in reducing the risk of carrying some specific bacteria, but may result in the growth of these bacteria once they survive due to antibiotic resistance [41, 42].

For the proposed method, the dimension of predictors’ design matrix Xk, p, is assumed to be less than the number of clusters K. For high dimensional predictor space, e,g., gene expressions in genome-wide association study, it is possible to encounter the situation of pK. In such cases regression models cannot be directly applied, and dimension reduction techniques need to be used. Traditional approaches such as principal component analysis and penalized regression including ridge regression and LASSO, as well as some machine learning based feature selection methods can be considered to be incorporated into the proposed method to deal with high dimensional predictors. We are planning to extend the proposed method to deal with such high dimensional predictors situation.

We have treated repeated longitudinal measures as a few discrete time points in our MTLC model. When there are more time points for each sample and the exact observation time for each sample is continuous, it is a natural extension of our current work to consider time as a continuous variable and OTU observations as a function of time. Further investigation of functional data analysis techniques can be explored and integrated with the OTU correlation structure developed in this paper.

Supporting information

S1 Data. Data for the real microbiome sequencing study in Application section.

(XLS)

S1 Appendix. Additional simulation results.

(PDF)

Acknowledgments

The authors would like to thank Dr. Lillian L. Siu, Dr. Bryan Coburn, Dr. Pierre Schneeberger, Dr. Osvaldo Espin-Garcia and Dr. Jeffrey Rosenthal for helpful discussions and suggestions at different stages of our study.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

WX was funded by Natural Sciences and Engineering Research Council of Canada (NSERC Grant RGPIN-2017-06672), Princess Margaret Cancer Foundation Award. BC is a post-doctoral fellowship trainee and supported by Princess Margaret Cancer Foundation for AI and Microbiome Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Kinross JM, von Roon AC, Holmes E, Darzi A, Nicholson JK. The human gut microbiome: implications for future health care. Current Gastroenterology Reports. 2008;10:396–403. 10.1007/s11894-008-0075-y [DOI] [PubMed] [Google Scholar]
  • 2. Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nature Reviews Genetics. 2012;13:260–270. 10.1038/nrg3182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gerber GK. The dynamic microbiome. FEBS Letters. 2014;588(22):4131–4139. 10.1016/j.febslet.2014.02.037 [DOI] [PubMed] [Google Scholar]
  • 4. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):4131–4139. 10.1038/nature08821 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Kuczynski J, Lauber CL, Walters WA, Wegener L, Clemente PJC, et al. Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics. 2012;13:47–58. 10.1038/nrg3129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Mandal S, Van Treuren W, White RA, Eggesbo M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease. 2015;26(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Friedman J, Alm EJ. Inferring Correlation Networks from Genomic Survey Data. PLoS Computational Biology. 2012;8(9):e1002687 10.1371/journal.pcbi.1002687 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Weiss S, Treuren WV, Lozupone C, Faust K, Friedman J, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. The ISME Journal. 2016;10:1669–1681. 10.1038/ismej.2015.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Xu L, Turpin W, Paterson AD, Xu W. Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data. PLoS ONE. 2015;10(7):e0129606 10.1371/journal.pone.0129606 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of Microbiome Data in the Presence of Excess Zeros. Frontiers in Microbiology. 2017;8:2014 10.3389/fmicb.2017.02114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Su L, Tom BDM, Long DL, Yiu S, Farewell VT. Two-Part and Related Regression Models for Longitudinal Data. Annual Review of Statistics and Its Application. 2017;4(1):283–315. 10.1146/annurev-statistics-060116-054131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Anthea M. Random Effects Modeling and the Zero-Inflated Poisson Distribution. Communications in Statistics—Theory and Methods. 2014;43(4):664–680. 10.1080/03610926.2013.814782 [DOI] [Google Scholar]
  • 13. Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–2617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Zhang X, Mallick H, Tang Z, Zhang L, Cui X, et al. Negative binomial mixed models for analyzing microbiome count data. BMC Bioinformatics. 2017;18(4):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zhang X, Pei YF, Zhang L, Guo B, Pendegraft AH, et al. Negative Binomial Mixed Models for Analyzing Longitudinal Microbiome Data. Frontiers in Microbiology. 2018;9:1683 10.3389/fmicb.2018.01683 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, et al. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE. 2012;7(12):e52078 10.1371/journal.pone.0052078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Chen J, Li H. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics. 2013;7(1):418–442. 10.1214/12-AOAS592 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Tang ZZ, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics. 2018;20(4):698–713. 10.1093/biostatistics/kxy025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Tang ZZ, Chen G, Alekseyenko AV, Li H. A general framework for association analysis of microbial communities on a taxonomic tree. Bioinformatics. 2017;33(9):1278–1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Tang ZZ, Chen G. Robust and Powerful Differential Composition Tests for Clustered Microbiome Data. Statistics in Biosciences. 2019. 10.1007/s12561-019-09251-5 [DOI] [Google Scholar]
  • 21. Shi P, Li H. A Model for Paired-Multinomial Data and Its Application to Analysis of Data on a Taxonomic Tree. Biometrics. 2017;73(4):1266–1278. 10.1111/biom.12681 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Zhang Y, Han SW, Cox LM, Li H. A multivariate distance–based analytic framework for microbial interdependence association test in longitudinal study. Genetic Epidemiology. 2017;41(8):769–778. 10.1002/gepi.22065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Xu L, Peterson AD, Xu W. Bayesian latent variable models for hierarchical clustered count outcomes with repeated measures in microbiome studies. Genetic Epidemiology. 2017;41(3):221–232. 10.1002/gepi.22031 [DOI] [PubMed] [Google Scholar]
  • 24. Zhan X, Xue L, Zheng H, Plantinga A, Wu MC, et al. A small–sample kernel association test for correlated data with application to microbiome association studies. Genetic Epidemiology. 2018;42(8):772–782. 10.1002/gepi.22160 [DOI] [PubMed] [Google Scholar]
  • 25. Koh H, Li Y, Zhan X, Chen J, Zhao N. A Distance–Based Kernel Association Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies. Frontiers in Microbiology. 2018;10:458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Grantham NS, Guan Y, Reich BJ, Borer ET, Gross K. MIMIX: a Bayesian Mixed–Effects Model for Microbiome Data from Designed Experiments. Journal of the American Statistical Association: Application and Case Studies. 2019;0(0):1–11. [Google Scholar]
  • 27. Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73(1):13–22. 10.1093/biomet/73.1.13 [DOI] [Google Scholar]
  • 28. Kelly BJ, Imai I, Bittinger K, Laughlin A, Fuchs BD, et al. Composition and dynamics of the respiratory tract microbiome in intubated patients. BMC Microbiome. 2016;4(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Seekatz AM, Rao K, Santhosh K, Young VB. Dynamics of the fecal microbiome in patients with recurrent and nonrecurrent Clostridium difficile infection. BMC Genome Medicine. 2016;8(47). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ballinger GA. Using Generalized Estimating Equations for Longitudinal Data Analysis. Organizational Research Methods. 2004;7(2):127–150. 10.1177/1094428104263672 [DOI] [Google Scholar]
  • 31. Shults J, Ratcliffe SJ. Analysis of multi-level correlated data in the framework of generalized estimating equations via xtmultcorr procedures in Stata and qls functions in Matlab. Statistics and Its Inference. 2009;2(2):187–196. [Google Scholar]
  • 32. Lee AH, Xiang L, Hirayama F. Modeling Physical Activity Outcomes: “A Two-part Generalized-estimating-equations Approach. Epidemiology. 2010;21(5):626–630. 10.1097/EDE.0b013e3181e9428b [DOI] [PubMed] [Google Scholar]
  • 33. Wang M. Generalized Estimating Equations in Longitudinal Data Analysis: A Review and Recent Developments. Advances in Statistics. 2014;2014(303728):1–11. [Google Scholar]
  • 34. Zadlo T. On longitudinal moving average model for prediction of subpopulation total. Statistical Papers. 2015;56(3):749–771. 10.1007/s00362-014-0607-5 [DOI] [Google Scholar]
  • 35. Liu Y, Xie J. Cauchy Combination Test: A Powerful Test With Analytic p-Value Calculation Under Arbitrary Dependency Structures. Journal of the American Statistical Association. 2020;115(529):393–402. 10.1080/01621459.2018.1554485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Yan J. The R Package geepack for Generalized Estimating Equations. Journal of Statistical Software. 2006;15(2). [Google Scholar]
  • 37. Turnbaugh PJ, Hamady M, Yatsunenko T, Canterel BL, Duncan A, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457(7228):480–484. 10.1038/nature07540 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Liu L, Shih YCT, Strawderman RL, Zhang D, Johnson BA, et al. Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review. Statistical Science. 2019;34(2):253–279. 10.1214/18-STS681 [DOI] [Google Scholar]
  • 39. Chai H, Jiang H, Lin L, Liu L. A marginalized two-part Beta regression model for microbiome compositional data. PLoS Computational Biology. 2018;14(7):e1006329 10.1371/journal.pcbi.1006329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Lachenbruch PA. Comparisons of two-part models with competitors. Statistics in Medicine. 2001;20:1215–1234. 10.1002/sim.790 [DOI] [PubMed] [Google Scholar]
  • 41. Costelloe C, Metcalfe C, Lovering A, Mant D, Hay AD. Effect of antibiotic prescribing in primary care on antimicrobial resistance in individualpatients: systematic review and meta-analysis. British Medical Journal. 2010;340(7756):1120. [DOI] [PubMed] [Google Scholar]
  • 42. Munita JM, Arias CA. Mechanisms of Antibiotic Resistance. Microbiology Spectrum. 2016;4(2). 10.1128/microbiolspec.VMBF-0016-2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008108.r001

Decision Letter 0

Jason A Papin, Benjamin Althouse

2 May 2020

Dear Dr. Xu,

Thank you very much for submitting your manuscript "Generalized Estimating Equation Modeling on Correlated Microbiome Sequencing Data with Longitudinal Measures" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Benjamin Althouse

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this article, the authors present an innovative method to analyze microbiome data for disease prediction and correlation analysis. It is very important to build a statistical and computational model which fully accounts for the correlation relationships among the OTUs.

The theoretical investigations are rigorous and the numeric studies provide very comprehensive assessment of the empirical performance of the proposed methods. The paper is well organized and all the theoretical and numerical results are presented in a very concise and rigorous way.

The authors proposed to use a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. Longitudinal and other types of repeated OTU measures are integrated in the MTLC model. Variance estimators of the proposed regression estimates are fully developed. Compared

with the existing methods including traditional GEE and mixed models, the MTLC method is shown to be more powerful and more accurate in numerical studies. Authors have also investigated the performance of the method on a real microbiome data.

Compared to existing predictive methods, the newly proposed method is advantageous for microbiome data analysis because it models the correlation structure among the repeated measurements and among the correlated OTUs. The authors have provided innovative and significant contributions in this paper. I strongly recommend the publication of this paper.

A minor comment is regarding the dimension of the predictor. It would be helpful to the readers if the authors can discuss how the model can be extended to accommodate higher dimensional predictors which often arises in practical applications.

Reviewer #2: In this paper, the authors proposed a Microbiome Taxonomic Longitudinal Correlation (MTLC) model to test the association between operational taxonomic units (OTUs) and the predictors of interest. The model consists of two parts: a logistic regression of the OTU prevalence and a linear regression of OTU relative abundance (RA). An omnibus test of these two regressions was developed to assess overall statistical significance. The model parameters and their variances are estimated by Generalized Estimating Equation (GEE) approach to account for the correlation between the OTUs and repeated measures. The simulation study shows that the proposed method can control type I error at the nominal level 0.05 and the power is the most robust against different configurations of the effect sizes. The advantage of the proposed method is further evidenced by a real data analysis in which the association between obesity and OTUs was detected by the proposed method but not by other comparative methods.

The idea of applying GEE to model longitudinal/correlated data is not new and but a two-part GEE with estimated correlation matrix seems to be novel in microbiome studies. However, some major concerns about the proposed statistical model, the simulation settings and the power comparison results need to be addressed.

General:

1. The authors mention in the Abstract and Introduction that the proposed method is able estimate the correlations between OTUs. However, the benefits of obtaining those correlation coefficients are not clearly stated. What additional information can an accurate estimated correlation bring to us?

2. The presented bibliography is rich. However, previous applications of GEE in microbiome data were not mentioned, e.g.

Kelly, B.J., Imai, I., Bittinger, K. et al. Microbiome 4, 7 (2016). https://doi.org/10.1186/s40168-016-0151-8

Seekatz, A.M., Rao, K., Santhosh, K. et al. Genome Med 8, 47 (2016). https://doi.org/10.1186/s13073-016-0298-8.

Statistical Model:

1. The statistical model is confusing. The authors seem to assume that, within each independent block, y and x are linked by a GLM (1). The y0, i.e. the dichotomized y, and y+, i.e. the truncated y>0, relate to x by other GLMs (5) and (6), respectively. If so, the distribution of y can be and should be stated explicitly. If not, what is the relationship between y and (y0, y+)? This also concerns the data generating distribution in the simulation. See the comments regarding simulations.

2. The authors claim that the test statistics W0 and W+ are independent (P.9 line 211). I am not convinced this is true for arbitrary y as it seems to be presented in the paper. In fact, given a vector y, the number of non-zero elements, i.e. E(y0), determines the length of y+ which seems to contradict the independence claim. The authors should clarify under what kind of distribution of y this independence property holds.

3. An omnibus test W_MTLC (7) is proposed to combine W0 and W+ by summation. Other approaches can be used to combine the p-values of these two tests, e.g. the minimum P-value (minP) approach and Cauchy combination test (CCT) Yaowu Liu & Jun Xie (2020) JASA, 115:529, 393-402, DOI: 10.1080/01621459.2018.1554485. Notice that the power of W0 and W+ can be drastically different in the simulations, I would expect the power of either minP or CCT to be higher than the summation of statistics as proposed.

Simulation:

1. The data generating distribution/process (P.9 lines 239-246 and P.9 lines 252-256) should be clearly written in statistical symbols/equations to avoid confusion. Most importantly, how are Y0, Y+ and Y related to the predictor X?

2. It seems that this simulation setting is fundamentally different from the simulations in literature, e.g. [11], [16], in which the distribution of Y is clearly defined, while in this paper, Y is constructed by Y0 and Y+. It would be very helpful if authors can

2.1. explain why they choose such way to simulate Y0 and Y+ and Y.

2.2. explain how to interpret the effect sizes beta0 and beta+. Are we particularly interested in detecting a pair of (beta0, beta+) in different directions (as shown in Table 2, this is where the proposed method has the largest power gain) in real data?

2.3. simulate Y from a simpler model, e.g. a linear mixed model, as a sensitivity analysis to see if the proposed method still works fine.

3. Table 2. Estimated beta, Type I error and power:

3.1. The authors state that when beta0 and beta+ are in the same direction, the two-parts model is still more powerful in “general” (P.13 lines 307-308). This is simply not true. In Table 2, there are only two rows (row 6 and 10) where effects are in the same direction. In both cases, the 1P_GEE and 1P_RE both have higher power than MTLC.

3.2. The authors should make it clear that the estimated beta, beta0, beta+ are from one simulation or the average from 1,000 replication? A more sensible way is to report the empirical distribution of the estimated beta, e.g. standard deviation, 2.5 and 97.5 percentile of the empirical distribution.

Minor (typos, etc.):

1. Author Summary and P.1 line 2: “…fast-growing…”.

2. P.2 lines 5-6: “…shotgun metagenomics sequencing…”.

3. P.5 line 106: “…other entries…”.

4. P.8 line 185, P11 line 282: “Next, we…”.

5. P.9 lines 196-197: mu0 (5) and mu+ (6) are not defined.

6. P.10 line 231: “…may be greater than…”.

7. P.12 Table 2: (beta_B, beta_N) should be (beta0, beta+).

8. P.14 line 363: “…one-part models…”.

9. P.15 line 375: “… at level genus.”

10. P.18 line 441: “…distribution…is unknown…”

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

Attachment

Submitted filename: PLOSreview.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008108.r003

Decision Letter 1

Jason A Papin, Benjamin Althouse

30 Jun 2020

Dear Dr. Xu,

We are pleased to inform you that your manuscript 'Generalized Estimating Equation Modeling on Correlated Microbiome Sequencing Data with Longitudinal Measures' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Benjamin Althouse

Associate Editor

PLOS Computational Biology

Jason Papin

Editor-in-Chief

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed the questions raised in my reports. I recommend the acceptance of the paper.

Reviewer #2: The authors have addressed all the concerns. I don't have further comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008108.r004

Acceptance letter

Jason A Papin, Benjamin Althouse

19 Aug 2020

PCOMPBIOL-D-20-00510R1

Generalized Estimating Equation Modeling on Correlated Microbiome Sequencing Data with Longitudinal Measures

Dear Dr Xu,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Sarah Hammond

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data. Data for the real microbiome sequencing study in Application section.

    (XLS)

    S1 Appendix. Additional simulation results.

    (PDF)

    Attachment

    Submitted filename: PLOSreview.pdf

    Attachment

    Submitted filename: Responses.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES