Abstract
In this paper we present a flexible model for microbiome count data. We consider a quasi-likelihood framework, in which we do not make any assumptions on the distribution of the microbiome count except that its variance is an unknown but smooth function of the mean. By comparing our model to the negative binomial generalized linear model (GLM) and Poisson GLM in simulation studies, we show that our flexible quasi-likelihood method yields valid inferential results. Using a real microbiome study, we demonstrate the utility of our method by examining the relationship between adenomas and microbiota. We also provide an R package ‘fql’ for the application of our method.
Keywords: spline, skewness, heteroscedasticity, zero-inflation
1. Introduction
Human microbiota are the communities of all microorganisms within a specific body site, while microbiome refers to the collection of the genomes of all the microbes. The microbial profile has been known to be associated with various diseases, including obesity, diabetes, Crohn’s disease, bacterial vaginosis, and cancer, among others.1–5 Due to its therapeutic potential, the microbiome is also a key component of precision medicine.6 With advances in the next-generation sequencing (NGS) technologies,7 research on microbiomes has been growing rapidly in recent years. These studies generated a great amount of microbiome data and called for novel statistical modeling and methods.
To identify differentially abundant microbiome taxa, traditional nonparametric tests like Mann–Whitney, Wilcoxon rank-sum, and Kruskal–Wallis tests are commonly used after normalization.8 However, these tests do not consider the compositional structure of the microbiome data and cannot adjust for covariates. Alternatively, parametric models based on transcriptomics data, such as DESeq and edgeR, have been proposed. 9–10 These models use a negative binomial distribution to model observed abundances after normalizing the data. Another method, ANCOM, is an additive log-ratio (alr)-based approach that considers the compositional structure of microbiome data.11 One of the most challenging features in analyzing microbiome data is that the abundance count for a taxon is often right skewed and heteroscedastic (overdispersed). Standard statistical methods that depend on the normality assumption are usually inadequate and often yield invalid inferences.
In this paper we utilize a quasi-likelihood framework to address the right skewness and overdispersion commonly observed in microbiome data. Wedderburn (1974)12 introduced the concept of quasi-likelihood, providing a framework to relax the strict assumptions of maximum likelihood estimation. Under this framework, we only need to specify the mean structure and the relation between variance and mean, without any distributional forms of the outcome. Nelder (1987)13 extended the quasi-likelihood approach by incorporating the estimation of variance functions in a parametric setting, which allows for greater flexibility in modeling the relationship between the mean and variance of the response variable. McCullagh (1989)14 discussed quasi-likelihood as an alternative to maximum likelihood estimation in generalized linear models. Basu and Rathouz (2005)15 applied quasi-likelihood to health outcomes research, showcasing the versatility of quasi-likelihood in capturing complex relationships and heteroscedasticity. They proposed an extended parametric form of link and variance function models within the quasi-likelihood framework, and derived the marginal effects. Chen et al. (2013)16 considered a nonparametric function of the nonlinear associations between mean and variance structure modeled by penalized splines. Chiou and Muller (1999)17 proposed a nonparametric quasi-likelihood method, with a nonparametric link function and a nonparametric variance-mean relationship. In this paper, along the lines of Chen et al. (2013)16 and Chiou and Muller (1999)17, we adapt the flexible quasi-likelihood approach to the microbiome data, and compare its performance with other available approaches. We also facilitate the application of our method with an R package “fql”.
Our motivating example is a fecal microbiome study that investigates the relationship between adenomas and the microbiome. Hale et al. (2017)18 collected 800 patients’ fecal microbiota with adenomas (n=266) and without adenomas (n=534), from standard screening colonoscopy operating between 2001 and 2005. We are interested in the effect of adenomas on microbial abundance.
The remaining paper is organized as follows. In Section 2 we present the model and its estimation and inference methods. In Section 3 we compare our model to other available methods in a simulation study. In Section 4 we apply our method to examine the relationship between adenomas and microbiota in the fecal microbiome study. Concluding remarks are given in Section 5.
2. Methods
2.1. Models
For a taxon, denote the observed number of reads in sample by . Let be the mean of . We use the , as the vector form of and , . We define a semi-parametric model with an unknown variance function as below:
| (1) |
| (2) |
where denotes a vector of covariates with the corresponding vector of regression coefficients and is a smooth function of the mean .
We consider a log link function in Model (1) to address the right skewness of the abundance count. For different outcome types, alternative link functions such as identity, logistic, or Box-Cox could be applied. In Model (2) the variance is modeled as an unknown but smooth function of the mean, which is more flexible than the original extended quasi-likelihood function (Nelder and Pregibon 1989)15. Of note, ( is the dispersion parameter) for the negative binomial distribution can also be considered as a special case of Model (2). We do not make additional assumptions about the distribution of the response variable (e.g., could be either continuous or discrete), rendering our model more flexible and widely applicable. We can thus avoid using more complicated parametric distributions, e.g., generalized gamma or log skew normal.19,20 Also, since we take the expected value of abundance count in Model (1), our model can accommodate observed values of zeros. Third, we can avoid the retransformation issue for log transformed outcome, facilitating the interpretation of parameter estimates in our model. 21
Estimation and Inference
Our model assumes the variance of the response variable is an unknown function of the mean. We follow the nonparametric quasi-likelihood theory of Chiou and Muller (1999)18 and define the flexible quasi-likelihood (FQL) function as below:
| (3) |
The estimation of the coefficient vector is a solution of the non-parametric quasi score equation below:
| (4) |
where the mean function , and is a vector.
Along the lines of Chen et al. (2013)16, the estimation procedure is completed as follows:
-
1
Initialize by fitting a model assuming a constant for all subjects. Set .
-
2
Estimate the unknown variance function by minimizing the penalized least square function:
| (5) |
where is a penalty function with smooth parameter . We use P-spline with quadratic penalty to estimate . Here we have , where is the vector of parameters in the P-spline model of , and is a positive semi-definite matrix with the detailed form given in Wood (2017)22. This can be carried out by function gam() from the R package “mgcv” by Wood (2000)23, where generalized cross validation method can be used to automatically select the parameter efficiently.
-
3
Estimate by solving the quasi score equation (4) above. By employing the Newton-Raphson method with Fisher scoring to iterate between Steps 2 and 3 until convergence to estimates and , we obtain the updated function of as:
| (6) |
Following McCullagh and Nelder (1989)13, the covariance matrix for this GLM model is given as below:
| (7) |
3. Simulation
In the simulation study, we generate data by the following distributions: negative binomial, Poisson, Gamma, and Pareto distribution. Since Gamma and Pareto are continuous distributions, we consider them as mis-specified distributions, and take the rounded values as the count outcome. We generate 600 datasets for each distribution, and within each dataset the sample size . We include a single covariate uniformly distributed over [0, 1]. In all four distributions we use the same mean structure, i.e., . We evaluate the type I error with the setting and and the power with setting and , as the proportion of the simulations that reject the null hypothesis.
Example 1: negative binomial distribution
The negative binomial distribution with number of failures and success probability has a density function for . We have the mean and the variance , therefore, . Here we set , i.e., . We generate from the negative binomial distribution with mean , i.e., with ; and and , respectively.
Example 2: Poisson distribution
The Poisson distribution is a distribution with . Here we set , with ; and and , respectively.
Example 3: Mis-specified (gamma) distribution
We generate a mis-specified model using gamma distribution with two parameters: shape and scale . We have and . We set and , so and . Here we let , with ; and and , respectively.
Example 4: Pareto distribution
The Pareto distribution is a power-law distribution originally designed to describe the distribution of wealth in a society, fitting the phenomena of “80–20” rule that 80% of wealth is held by a small fraction of the population24,25. This is very similar to the characteristics of the distribution of microbiome count data. Here we consider a Pareto (Type I) distribution, with shape parameter and scale parameter . The mean of the Pareto distribution is for ; and for we have the variance function as . As discussed above we also let , with ; and and , respectively.
We fit all datasets from the above four examples with three models: our FQL model, negative binomial GLM model, and Poisson GLM model. The results for estimates of are shown in Tables 1–4. We note that all 3 models are asymptotically unbiased as they have the correct model for the mean. As a result, the mean square errors (MSE) from the three models are very close. Hereafter we only focus on the standard error estimates and the 95% coverage probabilities for inference. We also present the type I error (false positive) and power for each setting in Figure 1.
Table 1A.
Comparison of different methods for data simulated from NB distribution with , and
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
|---|---|---|---|---|---|---|---|---|---|---|
| NB GLM | 0.00484 | 0.00718 | 0.085 | 0.083 | 94.8 | 0.00296 | 0.02049 | 0.143 | 0.141 | 95.5 |
| Poisson GLM | 0.00501 | 0.00722 | 0.085 | 0.059 | 81.3 | 0.00329 | 0.02061 | 0.144 | 0.100 | 82.0 |
| Our FQL | 0.00848 | 0.00841 | 0.091 | 0.085 | 93.7 | 0.00530 | 0.02301 | 0.152 | 0.143 | 94.5 |
| Table 1B. Comparison of different methods for data simulated from NB distribution with : and | ||||||||||
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
| NB GLM | 0.00391 | 0.00750 | 0.087 | 0.086 | 94.5 | 0.00173 | 0.02163 | 0.147 | 0.149 | 95.3 |
| Poisson GLM | 0.00391 | 0.00751 | 0.087 | 0.061 | 83.7 | 0.00173 | 0.02165 | 0.147 | 0.105 | 85.0 |
| Our FQL | 0.00613 | 0.00784 | 0.088 | 0.086 | 93.5 | 0.00132 | 0.02248 | 0.149 | 0.149 | 95.5 |
Table 4A.
Comparison of different methods for data simulated from the mis-specified Pareto distribution with : and
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
|---|---|---|---|---|---|---|---|---|---|---|
| NB GLM | 0.01815 | 0.00653 | 0.079 | 0.064 | 91.5 | 0.04257 | 0.02100 | 0.139 | 0.109 | 89.0 |
| Poisson GLM | 0.01821 | 0.00671 | 0.080 | 0.059 | 87.7 | 0.04271 | 0.02161 | 0.140 | 0.100 | 85.3 |
| Our FQL | 0.00009 | 0.00551 | 0.074 | 0.070 | 92.5 | 0.00867 | 0.01742 | 0.132 | 0.119 | 92.3 |
| Table 4B. Comparison of different methods for data simulated from the mis-specified Pareto distribution with : and | ||||||||||
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
| NB GLM | 0.02290 | 0.00661 | 0.078 | 0.064 | 91.5 | 0.00091 | 0.01850 | 0.136 | 0.111 | 90.8 |
| Poisson GLM | 0.02303 | 0.00681 | 0.079 | 0.060 | 88.7 | 0.00094 | 0.01916 | 0.138 | 0.104 | 89.7 |
| Our FQL | 9.481 e-06 | 0.00542 | 0.074 | 0.070 | 92.2 | 0.00814 | 0.01726 | 0.131 | 0.119 | 92.3 |
Figure 1.

Powers and Type I errors for simulation study
As shown in Table 1A, when data are generated from the negative binomial distribution with and , the negative binomial GLM and our FQL model have small biases in standard error estimation, both resulting in coverage probabilities close to the nominal level of 95%. In comparison, the Poisson GLM yields the estimated standard errors (SE) smaller than the sampling standard deviation (SD), leading to under-coverage (81.3% for and 82.0% for ). In the setting of negative binomial distribution with (Table 1B), the Poisson GLM leads to under-coverage (85% for ). From Figure 1, we can see that Poisson GLM fails to control the type I error (0.15), while our FQL model and the NB GLM control the type I error reasonably well (0.047 and 0.045).
When the data are generated from the Poisson distribution, we can see that the Poisson GLM, negative binomial GLM, and our FQL model all perform well in terms of coverage probabilities (Table 2A and 2B). Also, as shown in Figure 1, all methods have well controlled type I error, and all models have a very similar power: 0.508 (our FQL model) vs. 0.500 (NB) and 0.490 (Poisson).
Table 2A.
Comparison of different methods for data simulated from Poisson distribution with : and
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
|---|---|---|---|---|---|---|---|---|---|---|
| NB GLM | 0.00488 | 0.00355 | 0.060 | 0.060 | 94.0 | 0.00164 | 0.00984 | 0.099 | 0.101 | 94.7 |
| Poisson GLM | 0.00478 | 0.00355 | 0.060 | 0.059 | 94.0 | 0.00166 | 0.00983 | 0.099 | 0.100 | 94.7 |
| Our FQL | 0.00497 | 0.00378 | 0.062 | 0.061 | 93.2 | 0.00291 | 0.01044 | 0.102 | 0.103 | 94.3 |
| Table 2B. Comparison of different methods for data simulated from Poisson distribution with : and | ||||||||||
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
| NB GLM | 0.00055 | 0.00372 | 0.061 | 0.062 | 95.2 | 0.00118 | 0.01075 | 0.104 | 0.107 | 95.7 |
| Poisson GLM | 0.00055 | 0.00372 | 0.061 | 0.061 | 95.2 | 0.00117 | 0.01074 | 0.104 | 0.107 | 95.2 |
| Our FQL | 0.00092 | 0.00393 | 0.063 | 0.061 | 94.2 | 0.00152 | 0.01137 | 0.107 | 0.114 | 94.8 |
When the data are generated from the gamma distribution with (Table 3A and 3B), both the negative binomial and Poisson GLMs overestimate the standard deviation, resulting in over-coverage (in both settings the CPs are close to 99%). In contrast, our model performs the best among the three models with reasonable coverage probabilities for both and . From Figure 1, we can see that our model has better type I error control (0.048), while the NB and Poisson models are more conservative (0.013 each). Our model also has higher power: 0.715 vs. 0.562 (from both NB and Poisson models).
Table 3A.
Comparison of different methods for data simulated from the mis-specified Gamma distribution with : and
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
|---|---|---|---|---|---|---|---|---|---|---|
| NB GLM | 0.00162 | 0.00234 | 0.048 | 0.059 | 99.0 | 0.00085 | 0.00656 | 0.081 | 0.100 | 98.0 |
| Poisson GLM | 0.00162 | 0.00234 | 0.048 | 0.059 | 99.0 | 0.00085 | 0.00656 | 0.081 | 0.100 | 98.0 |
| Our FQL | 0.00436 | 0.00283 | 0.053 | 0.050 | 93.3 | 0.00397 | 0.00734 | 0.085 | 0.082 | 93.3 |
| Table 3B. Comparison of different methods for data simulated from the mis-specified Gamma distribution with : and | ||||||||||
| Fitting Model | bias | MSE | SD | SE | CP% | bias | MSE | SD | SE | CP% |
| NB GLM | 0.00043 | 0.00227 | 0.048 | 0.061 | 98.8 | 0.00091 | 0.00680 | 0.083 | 0.105 | 98.7 |
| Poisson GLM | 0.00043 | 0.00227 | 0.048 | 0.061 | 98.8 | 0.00091 | 0.00680 | 0.083 | 0.105 | 98.7 |
| Our FQL | 0.00063 | 0.00235 | 0.049 | 0.048 | 95.2 | 0.00164 | 0.00680 | 0.085 | 0.084 | 95.2 |
When the data are generated from the Pareto distribution, our proposed FQL model always obtains the most reliable variance estimation in both settings compared to the other two models (Table 4A and 4B). Table 4A and 4B show that both negative binomial and Poisson GLMs underestimate the standard deviation, and as a result their CPs are smaller than that of our proposed FQL model. Furthermore, as shown in Figure 1, our FQL model has higher power and lower type I error than the other two models.
In summary, we can see that our FQL model performs similarly compared to the true parametric models, while the other two models may lead to erroneous inference when the underlying model is mis-specified. Therefore, our model should be preferred in practical data analysis.
Per suggestions from a reviewer, we add a simulation study to assess the performance of our method for differential abundance analysis of microbiome data with compositionality and zero inflation. In this regard, we draw on the work of Yang and Chen (2022)26, who developed a semi-parametric real data based simulation framework for data generation and a linear model based permutation test model ZicoSeq for differential abundance analysis. Yang and Chen (2022)26 conducted comprehensive simulation studies to assess the performance of ZicoSeq alongside major existing differential abundance analysis methods. The results of their investigations demonstrated the competitive advantage of the ZicoSeq model over these established approaches. Motivated by these findings, we aim to conduct a comparative evaluation of our proposed model with ZicoSeq to gauge our model’s performance and provide valuable insights into the field of differential abundance analysis. The simulation framework captures the complexity of microbiome data by generating random samples from a large reference dataset (nonparametric part) and using these reference samples as templates to generate new samples (parametric part). We use the real data of Adenomas (Hale et al. 2017)18 as the reference dataset. In this study, we will evaluate the performance of several methods, including parametric models such as the negative binomial model, the Poisson model, our proposed semi-parametric model FQL, and the permutation test model ZicoSeq. To do so, for each single simulation run we generate 400 samples with 100 OTUs, of which 20 are differentially expressed. For the differential OTU, the abundance is
where the covariate follows a uniform distribution with , and the random abundance is from the reference real data. To evaluate the performance of the methods, we assess the false discovery rate (FDR) to compare the false discovery control and the true positive rate (TPR) for power comparison over 100 simulation datasets. To comprehensively assess the efficacy of the proposed methodologies across varying abundance levels, we conduct further analysis by stratifying the operational taxonomic units (OTUs) into distinct groups based on their abundance. We classify OTUs residing within the top half of the abundance range as common OTUs, while those residing within the bottom half are designated as rare OTUs. For the preprocessing of our simulation reference dataset, OTUs with prevalence less than 25% are excluded, resulting in the classification of OTUs with prevalence from 100% to 62.5% as common and OTUs with prevalence ranging from 62.5% to 25% as rare. Consequently, in our simulated datasets based on real data, the ratio of common to rare OTUs approximated 3:1 on average. By employing this approach, we aim to capture the potential performance variations of the methods under investigation across different abundance levels.
Results are shown in Table 5, where it is evident that our FQL model excels in delivering the finest balance between FDR and TPR: its FDR is comparable to that of Zicoseq which is much smaller than those from the Poisson and negative binomial GLMs; on the other hand, our FQL has a TPR close to negative binomial GLMs which is much higher than Zicoseq. The analysis results of common and rare OTUs groups reveal that our proposed model consistently demonstrates a lower FDR compared to both the negative binomial model and the Poisson model in both common and rare groups. Additionally, our model exhibits a better TPR when compared to the ZicoSeq model in both groups. Notably, the ZicoSeq model performs well in effectively controlling the false discovery rate within the rare OTU group, while our model excels in the common OTU group. These findings highlight the respective strengths of the ZicoSeq model and our proposed model in handling specific taxonomic groups, thereby offering valuable insights into their distinctive capabilities in addressing differential abundance analysis within microbiome datasets. Of note, in the Discussion section, we present several approaches to improve our method in handling rare OTUs with substantial zero-inflation.
Table 5.
Comparison of different methods with semi-parametric real data-based simulation.
| Abundance groups | NB GLM | Poisson GLM | ZicoSeq | FQL | |
|---|---|---|---|---|---|
| Overall | 0.4440 | 0.9885 | 0.2135 | 0.3840 | |
| Common | 0.5365 | 0.9914 | 0.2741 | 0.4641 | |
| Rare | 0.1524 | 0.9782 | 0.0152 | 0.1288 | |
| Overall | 0.2886 | 0.77807 | 0.0593 | 0.1204 | |
| Common | 0.2103 | 0.7784 | 0.0567 | 0.0775 | |
| Rare | 0.6438 | 0.7878 | 0.0300 | 0.3150 | |
| Overall | 0.3090 | 0.9820 | 0.1475 | 0.2415 | |
| Common | 0.3731 | 0.9878 | 0.1923 | 0.2936 | |
| Rare | 0.1094 | 0.9663 | 0.0000 | 0.0603 | |
| Overall | 0.2565 | 0.7736 | 0.0138 | 0.0680 | |
| Common | 0.1308 | 0.7697 | 0.0138 | 0.0242 | |
| Rare | 0.6633 | 0.7842 | 0.0000 | 0.2083 |
Per suggestion from a reviewer, we also compare their performance under , and the results are consistent, as shown in the bottom half of Table 5.
Finally, we present the results of parametric data simulation with sample size n=200, as with as the semi-parametric real data based simulation with a 10% (instead of the 20 of 100 above) differential expression in the supplemental materials. It can be seen that our model performs well in these settings too.
4. Application
Adenomatous polyps, or adenomas, are an understudied abnormal growth in the colon that appears similar to surrounding tissues. Adenomas have been recognized as a critical precursor to colorectal cancer, affecting over 1.8 million people in the US. Hale et al. (2017)18 conducted a study on the early events of carcinogenesis by investigating shifts in the gut microbiota of patients with adenomas. In the updated dataset of their experiment, the fecal microbiota information of 800 patients, including patients with adenomas (n=266) and without (n=534), was collected from standard screening colonoscopy operating between 2001–2005 at multiple medical centers, including the Mayo Clinic, Rochester, MN; Kaiser Permanente in Sacramento and Oakland, CA; Oregon Health & Science University, Portland, OR; University of Colorado Health Sciences Center, Denver, CO; Roswell Park Cancer Institute, Buffalo, NY; Indiana University Medical Center, Indianapolis, IN; and other North Central Cancer Treatment Group institutions. The 16S rRNA sequencing library was constructed at the University of Minnesota Genomics Center, and sequencing was performed at the Mayo Clinic Medical Genomics Facility. Microbiome count data for different taxa were obtained by first sequencing on a MiSeq using a MiSeq Reagent Kit v3 (2 × 300, 600 cycles, Illumina Inc., San Diego, CA, USA), then processed via the IM-TORNADO bioinformatics pipeline, using a 97% identity threshold to assign operational taxonomic units (OTUs). We thus have the abundance count of each taxon in the microbiome.
In total, 178 OTUs (genus level) are included in the dataset. In order to evaluate the performance of different models under different zero inflation status, we filter the taxa by removing those with prevalence less than 15% (76 OTUs left), and 25% (63 OTUs left) in our analysis. We are interested in the effect of adenomas on the abundance of these OTUs.
We apply our proposed FQL model, negative binomial GLM, Poisson GLM and ZicoSeq model in the analysis. Similar to Wang et al. (2021)27, we add the log transformed total count of all the taxa for each subject as an offset in every model. The covariates included in our analysis are gender, ever smoking, having polyps or not and sequencing batch. The gender and ever smoking are binary variable, while two dummy variables are added for the polytomous variable sequencing batch. Among the 800 samples, 450 were male; 457 had ever smoked; 524 had polyps; and 343 were from sequencing batch 1, 222 were from batch 2 while the remaining 235 were from batch 3.
Figures 2 illustrate the Venn diagram representing the significant result sets of the four models under different OTU prevalence cutoffs. The methodology employed for hypothesis testing involves the application of FDR control for multiple testing. In order to thoroughly evaluate the statistical significance, we have considered two distinct significance levels: 0.05 and 0.01. Under all scenarios, ZicoSeq fails to detect any differential taxa. For the analysis with prevalence of 15% (76 OTUs) and significance level of 0.05, the Poisson GLM yields the largest number of significant OTUs (58 OTUs), which is similar to the simulation setting as the Poisson regression fails to control the Type I error (Figure 1). Our flexible quasi-likelihood model identifies 8 OTUs with significant coefficients of adenomas effects (shown in Table 6); while the negative binomial regression identifies 6 OTUs. Similar to the simulation study settings 3 and 4, our proposed FQL model is more powerful and tends to identify more OTUs than the negative binomial GLM when the underlying distribution is mis-specified. There is an overlap of 3 OTUs from the three models. When the significance level is changed to 0.01, the conclusion is consistent: 9 for the FQL model, 5 for the negative binomial model, and an overlap of 2. For the analysis with prevalence of 25% (63 OTUs) with both significance level 0.05 and 0.01, we still observe similar findings revealing that our analysis is robust under different zero-inflation levels and significance levels.
Figure 2.

The Venn diagram for different prevalence cutoffs (15% vs 25%) and different significance (0.05 vs 0.01).
Table 6.
The 8 significant OTUs from our FQL model for data with prevalence of 25% and significance level of 0.05 after FDR correction.
| Phylum | Class | Order | Family | Genus | p_value |
|---|---|---|---|---|---|
| Chrysiogenetes | Chrysiogenetes | Chrysiogenales | Chrysiogenaceae | Desulfurispirillum | 1.465587E-03 |
| Firmicutes | Clostridia | Clostridiales | Veillonellaceae | Acidaminococcus | 1.610342E-04 |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | <1E-8 |
| Bacteria | Firmicutes | Clostridia | Clostridiales | Mogibacteriaceae | 6.144594E-03 |
| Firmicutes | Clostridia | Clostridiales | Christensenellaceae | Christensenella | 1.120178E-03 |
| Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Pseudobutyrivibrio | 7.634441E-06 |
| Firmicutes | Erysipelotrichi | Erysipelotrichales | Erysipelotrichaceae | cc_115 | <1E-8 |
| Proteobacteria | Gammaproteobacteria | Enterobacteriales | Enterobacteriaceae | Erwinia | 1.345151E-04 |
5. Discussion
In this paper we present a flexible quasi-likelihood (FQL) method for heteroscedastic outcomes. By assuming that the variance is an unknown function of the mean, we can accommodate heteroscedasticity by splines. Further, because FQL does not require the specification of the distribution function, it is more robust to model mis-specification. In both the simulation studies and a real microbiome study, we demonstrate that FQL has better performance than the competing models.
Another common feature in microbiome abundance data is zero inflation28. Some parametric distributions, e.g., Poisson or NB distributions, can accommodate zero values. We model the mean structure in Eqn. (1), which can also handle zero values. However, our method does not specifically address zero inflation, which leads to less satisfactory performance for rare OTUs in the simulation study. When the percentage of zeros is very high, it is advised that one-part models should be avoided since they may yield biased and unreliable results.28,29 Such data could be handled by three frameworks. The first framework is imputation (zero replacement). For example, Martín-Fernández et al. (2015)30 proposed a zero replacement method – a geometric Bayesian-multiplicative (GBM) method, which involves the Dirichlet prior distribution as the conjugate distribution of the multinomial distribution and a multiplicative modification of the non-zero values, enables preserving the ratio between the non-zero values. The second framework includes zero-inflated Poisson or negative binomial (NB) models.31–33 In these models, zero values can come from two latent classes: either a “true” (structural) zero in Part I or a “random” (sampling) zero in Part II from the realization of the Poisson or NB distribution. Thus, there is no clear-cut distinction ( vs. ) between Parts I and II of the model. Furthermore, these models have two issues: (i) interpretation is more complicated as we have to distinguish true zeros from random zeros; (ii) computation is more intricate, which could lead to non-convergence or convergence to the local maximum - a common problem for the latent class type of models.33 The third alternative is the hurdle model framework to clearly separate zero and positive counts. Traditionally, to model positive counts, we have to use truncated distributions without 0, e.g., zero-truncated negative binomial distribution, which often has a sophisticated density function.34,35 Future research should explore the incorporation of our FQL model into the hurdle model framework. Specifically, we will simply assume that the positive counts in Part II have a mean structure as in Model (1) with some variance structure in Model (2). As a result, we will not need to specify the complicated parametric distribution (e.g., zero-truncated NB) as in traditional hurdle models. Therefore, our model will enjoy the simplicity in the interpretation of hurdle models but avoid the complexity in specifying the truncated distribution.
Our model can be further extended in several directions. First, we can use other link functions, e.g., logit link for the relative abundance of microbiome.36 Second, for longitudinal or clustered data, we can add random effects in our model to capture the correlation among repeated measures. Finally, it may be of interest to incorporate the phylogenetic or taxonomic tree structure among different taxa to increase efficiency.37
Supplementary Material
Acknowledgement
Research reported in this publication was supported by NIH UL1 TR002345. We thank Dr. Jinsong Chen for helpful comments and Lu Yang for assistance with the Zicoseq package.
Footnotes
Software
An R package ‘fql’ is available at https://github.com/yimshi/fql.
Declaration of conflicting interests
Dr. Lei Liu is a consultant to the Adial Pharmaceuticals.
Data Availability Statement
Restrictions apply to the availability of these data, which were used under license for this study.
References
- 1.Everard A, Cani PD. Diabetes, obesity and gut microbiota. Best Practice & Research Clinical Gastroenterology. 2013; 27:73–83. [DOI] [PubMed] [Google Scholar]
- 2.Musso G, Gambino R, Cassader M. Obesity, diabetes, and gut microbiota: the hygiene hypothesis expanded? Diabetes Care. 2010; 33(10):2277–2284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, et al. Inflammation, Antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host & Microbe. 2015; 18(4):489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, et al. Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. Plos One. 2012; 7(6):e37818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Garrett WS. Cancer and the microbiota. Science. 2015; 348:80–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Petrosino JF. The microbiome in precision medicine: the way forward. Genome Med. 2018;10(1):12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gilbert JA, Meyer F, Bailey MJ. The future of microbial metagenomics (or is ignorance bliss?). ISME Journal. 2011; 5:777–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lin H, Peddada SD. Analysis of microbial compositions: a review of normalization and differential abundance analysis. NPJ Biofilms Microbiomes 2020. Dec 2;6:60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014; 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26: 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial ecology in health and disease 2015; 26:27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika 1974; 61:439–447. [Google Scholar]
- 13.McCullagh P, Nelder JA. Generalized Linear Models. Chapman Hall: New York, 1989. [Google Scholar]
- 14.Nelder JA, Pregibon D. An extended quasi-likelihood function. Biometrika 1987; 74:221–232. [Google Scholar]
- 15.Basu A, Rathouz P. Estimating marginal and incremental effects on health outcomes using flexible link and variance function models. Biostatistics 2005; 6:93–109. [DOI] [PubMed] [Google Scholar]
- 16.Chen J, Liu L, Zhang D, Shih T. A flexible model for the mean and variance functions, with application to medical cost data. Statistics in Medicine 2013; 32, 4306–4318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Chiou JM, Muller HG. Nonparametric Quasi-likelihood. The Annals of Statistics. 1999; 27:36–64. [Google Scholar]
- 18.Hale VL, Chen J, Johnson S, et al. Shifts in the fecal microbiota associated with adenomatous polyps. Cancer Epidemiology, Biomarkers & Prevention, 2017; 26(1): 85–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu L, Strawderman RL, Johnson BA, O’Quigley JM. Analyzing repeated measures semi-continuous data, with application to an alcohol dependence study. Statistical Methods in Medical Research. 2016; 25, 133–152. [DOI] [PubMed] [Google Scholar]
- 20.Liu L, Shih YCT, Strawderman RL, Zhang DW, Johnson B, Chai HT. Statistical analysis of zero-inflated nonnegative continuous data: a review. Statistical Science. 2019; 34, 253–279 [Google Scholar]
- 21.Duan N Smearing estimate: A nonparametric retransformation method. J. Amer. Statist. Assoc. 1983; 78 605–610. [Google Scholar]
- 22.Wood SN. Generalized Additive Models: An Introduction with R (2nd Edition). Chapman & Hall/CRC, Boca Raton, 2017. [Google Scholar]
- 23.Wood SN. Modeling and smoothing parameter estimation with multiple quadratic penalties. Journal of Royal Statistical Society, Series B 2000; 62:413–428. [Google Scholar]
- 24.Smith RL. Estimating Tails of Probability Distributions. Annals of Statistics. 1987; 15:1174–1207. [Google Scholar]
- 25.Gabaix X Power Laws in Economics and Finance. Annual Review of Economics. 2009; 1:255–293. [Google Scholar]
- 26.Yang L, Chen J. A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions. Microbiome 2022; 10(1), 130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang J, Reyes-Gibby CC, Shete S. An approach to analyze longitudinal zero-inflated microbiome count data using two-stage mixed effects models. Statistics in Biosciences 2021; 13:267–290. [Google Scholar]
- 28.Xu L, Paterson AD, Turpin W, Xu W. Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data. PLoS ONE 2015; 10:e0129606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Smith VA, Neelon B, Maciejewski ML, Preisser JS. Two parts are better than one: modeling marginal means of semicontinuous data. Health Services Outcomes Research Methodology 2017; 17:198–218. [Google Scholar]
- 30.Martín-Fernández J-A, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Bayesian-multiplicative treatment of count zeros in compositional data sets. Statistical Modelling. 2015;15:134–158. [Google Scholar]
- 31.Lambert D Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992; 34:1–14. [Google Scholar]
- 32.Greene WH. Accounting for excess zeros and sample selection in Poisson and negative binomial regression models. NYU Working Paper No. EC-94-10. 1994. [Google Scholar]
- 33.Bartholomew DJ, Knott M, Moustaki I. Latent variable models and factor analysis: A unified approach. John Wiley & Sons, 2011. [Google Scholar]
- 34.Long JS. Regression models for categorical and limited dependent variables. Sage, 1997. [Google Scholar]
- 35.Lee AH, Wang K, Yau KK, Somerford P. Truncated negative binomial mixed regression modelling of ischaemic stroke hospitalizations. Statistics in Medicine 2003; 22:1129–1139. [DOI] [PubMed] [Google Scholar]
- 36.Chai HT, Jiang HM, Lin L, Liu L. A Marginalized Two-Part Beta Regression Model for Microbiome Compositional Data. PLOS Computational Biology. 2018; 14:e1006329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Washburne AD. et al. Methods for phylogenetic analysis of microbiome data. Nature Microbiology 2018; 3:652–661 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Restrictions apply to the availability of these data, which were used under license for this study.
