Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values

Seungjun Ahn; Somnath Datta

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Mar 23:arXiv:2303.13702v1. [Version 1]

Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values

Seungjun Ahn ¹, Somnath Datta ^1,^*

PMCID: PMC10055480 PMID: 36994149

Abstract

A recent breakthrough in differential network (DN) analysis of microbiome data has been realized with the advent of next-generation sequencing technologies. The DN analysis disentangles the microbial co-abundance among taxa by comparing the network properties between two or more graphs under different biological conditions. However, the existing methods to the DN analysis for microbiome data do not adjust for other clinical differences between subjects. We propose a Statistical Approach via Pseudo-value Information and Estimation for Differential Network Analysis (SOHPIE-DNA) that incorporates additional covariates such as continuous age and categorical BMI. SOHPIE-DNA is a regression technique adopting jackknife pseudo-values that can be implemented readily for the analysis. We demonstrate through simulations that SOHPIE-DNA consistently reaches higher recall and F1-score, while maintaining similar precision and accuracy to existing methods (NetCoMi and MDiNE). Lastly, we apply SOHPIE-DNA on two real datasets from the American Gut Project and the Diet Exchange Study to showcase the utility. The analysis of the Diet Exchange Study is to showcase that SOHPIE-DNA can also be used to incorporate the temporal change of connectivity of taxa with the inclusion of additional covariates. As a result, our method has found taxa that are related to the prevention of intestinal inflammation and severity of fatigue in advanced metastatic cancer patients.

Keywords: differential network analysis, regression modeling, microbial co-abundance, jackknife pseudo-values

1. Introduction

The human microbiome is the collective genomes of microbes or microorganisms localized to the various sites of human body [1]. Recent clinical studies have shown that the microbiome has a regulatory role in a wide array of illnesses in humans, such as cancer [2], human immunodeficiency virus [3], and inflammatory bowel disease (IBD) [4]. Moreover, the human microbiome is linked to emotional well-being [5] and mental health including depression [6], autism spectrum disorders [7], and human brain diseases [8].

Following the advent of next-generation sequencing technologies, the taxonomic composition of microbial communities is better characterized by the amplification of small fragments (or amplicon) of the 16S ribosomal RNA (or 16S rRNA) gene. More recently, shotgun metagenomic sequencing has become an alternative for microbial community profiling [9]. Either sequencing platform typically employs similarity-based clustering algorithms to group 16S rRNA sequences into Operational Taxonomic Units (OTU) [10, 11] that are compositional.

The applications of network theory have been successfully utilized to better appraise the complex symbiotic (or dysbiotic) relationship between microbiome and disease states – microbial co-abundances [12]. The abundance matrix or the observed OTU table is used to infer microbial co-abundances among taxa through either correlation-based approaches or probabilistic graphical models.

The differential network (DN) analysis compares the network properties between two or more graphs under different biological conditions, such as degree centrality. Based on the recent review article [13], there are two methods that are newly available to the DN analysis for microbiome data: Microbiome Differential Network Estimation (MDiNE) [14] and Network Construction and comparison for Microbiome data (NetCoMi) [15]. These methods, however, do not incorporate additional covariates associated with the host or the composition of the microbiome.

It has been recognized that the composition of the gut microbiome is central to the pathogenesis of IBD [4, 16]. In addition, the gut microbiome composition in patients with IBD is largely influenced by various factors including the use of antibiotics, diet, and cigarette smoking [4]. In an analogous fashion, it is not unreasonable to speculate that the structure of the microbial networks can also vary depending on these factors. Thereby, there is a need for statistical methods for DN analysis that can include additional one predictor variables.

One way to accomplish this goal is to use a regression technique based on pseudo-values, a component to calculate the bias-corrected estimator of leaveone-out jackknife resampling procedure [17]. The pseudo-value technique was first postulated by Andersen and his colleagues [18, 19] in the context of multi-state survival models with right-censored data. Since then, it has been well studied in various disciplines of statistics including the interval-censored data [20, 21], clustered data [22, 23], and machine learning methods [24, 25].

The ultimate benefit of this technique is its straightforward inclusion of additional covariates in the generalized linear model [26]. An asymptotic linearity and consistency of pseudo-values given covariates are shown with the second-order von Mises expansion [27, 28]. The pseudo-values can then be used as the response variable in a regression model with the covariates [29]. Several studies reported that the type I error is well controlled at a nominal level of 0.05 while maintaining a high statistical power under the quasi-likelihood generalized linear mixed model [30] and generalized estimating equations framework [23, 31] for pseudo-value regression approach.

Hence, we propose a regression modeling method for DN analysis that regresses the jackknife pseudo-values calculated from a degree centrality of taxa in a microbial network to directly estimate the effects of predictors. In this approach, the grouping variable itself could also be included in the regression model along with additional clinical covariates while regressing the pseudo-values. We loosely refer to this as a “multivariable setting”, whereas in “univariable settings” only the grouping variable is utilized in a DN analysis.

In the present study, we introduce Statistical ApprOacH via Pseudo-value Information and Estimation for Differential Network Analysis (SOHPIE-DNA) that can include covariate information in analyzing microbiome data. We firstly demonstrate the plausibility of the proposed method by comparing the model performances with MDiNE and NetCoMi through simulations under multivariable and univariable settings. Furthermore, the SOHPIE-DNA is applied to illustrate its clinical utility by examining real data from the American Gut Project [32] and the Diet Exchange Study [33] to identify DC taxa with presence of covariates. All statistical analyses are performed in R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria).

2. Methods

2.1. Compositional Correlation-Based Methods for Network Estimation

The correlation is a useful proxy measure for identifying co-abundances or dependencies among taxa (or OTUs) in a microbial network. The Sparse Correlations for Compositional Data (SparCC) [34] estimates the pairwise correlations of the log-ratio transformed OTU abundances. Of note, a recent method, namely a Pseudo-value Regression Approach for Network Analysis (PRANA) [35], operates on gene expression data only, which therefore does not use a correlation measure that preserves the compositional profiling.

The co-abundance among taxa is described by a covariance matrix $T \in ℝ^{p \times p}$ where the non-diagonal elements t_jk are expressed by

t_{j k} \equiv Var (log \frac{u_{j}}{u_{k}}) = Var (log u_{j}) + Var (log u_{k}) - 2 cov (log u_{j}, log u_{k}) = σ_{j}^{2} + σ_{k}^{2} - 2 ρ_{j k} σ_{j} σ_{k},

(1)

where u_j and u_k are the fraction of OTU abundances, $σ_{j}^{2}$ and $σ_{k}^{2}$ are the variances of the log-transformed abundances, and ρ_jk is the correlation of taxa j and k, respectively. Moreover, the variance t_jj is approximated by

t_{j j} ≅ (p - 1) σ_{j}^{2} + \sum_{k \neq j} σ_{k}^{2},

(2)

where j, k ∈ {1, …p}. Then the correlation can be estimated by solving equations 1 and 2:

{\hat{ρ}}_{j k} = \frac{{\hat{σ}}_{j}^{2} + {\hat{σ}}_{k}^{2} - {\hat{t}}_{j k}}{2 {\hat{σ}}_{j} {\hat{σ}}_{k}},

(3)

where ${\hat{σ}}_{j}$ , ${\hat{σ}}_{k}$ , and ${\hat{t}}_{j k}$ are the sample estimates of σ_j, σ_k, and t_jk, respectively.

Furthermore, SparCC takes an iterative approach under the assumption (“sparsity of correlations” as in the original paper) that a small number of strong correlations exists in a true network, which hinders the detection of spurious correlations among taxa.

Besides SparCC, we have attempted to use other compositional correlation measures for our differential network analysis. See the Discussion section for further details.

2.2. Pseudo-value Approach

Consider undirected network estimated from n individuals. It can then be represented by the p×p association matrix that encodes the pairwise correlations ${\hat{ρ}}_{j k}$ between a pair of taxa j, k ∈ {1, …, p}. The association matrix is symmetric $({\hat{ρ}}_{j k} = {\hat{ρ}}_{k j})$ where the non-diagonal entries are either non-zero (i.e., some association between two taxa) or zero (i.e., no association between two taxa). The diagonal entries are all equal to one, because the network is assumed that there is no self-loop (i.e., a node cannot redirect to itself).

The network centrality has been studied to measure the extent of biological or topological importance that a node has in a network [36, 37]. For each taxa k, the network centrality is calculated as the marginal sum of the association matrix. where k = 1, …, p.

{\hat{θ}}_{k} = \sum_{j = 1}^{p} {\hat{ρ}}_{j k},

The jackknife pseudo-values [17] for the i^th individual and k^th taxon are defined by:

{\tilde{θ}}_{i k} = n {\hat{θ}}_{k} - (n - 1) {\hat{θ}}_{k (i)},

(4)

where ${\hat{θ}}_{k (i)}$ is the marginal sum of a taxon calculated based on the re-estimated association matrix using the microbiome data eliminating the i^th subject.

The computational cost of the re-estimation process is dependent on the sample size, as for each taxa k requires n such calculations with the data size of n − 1. A solution to speed up the processing time is the use of parallel computing such as mclapply function in parallel R package.

Let Z ∈ {1, 2} be a binary group indicator and denote 𝒢₁ = {i: Z_i = 1} and 𝒢₂ = {i: Z_i = 2}. Each group has the same set of p taxa, but group-specific sample size n_z = |𝒢_z| for the two groups z = 1, 2. Total sample size is $n = \sum_{z} n_{z}$ . The equation 4 is used to calculate the group-specific jackknife pseudo-values. That is, for taxonP k and group z, we define ${\hat{θ}}_{k}^{z}$ and ${\hat{θ}}_{k (i)}^{z}$ , where i = 1, …, n_z. Then for each i ∈ 𝒢_z, the k^th taxon jackknife pseudo-values are calculated from ${\tilde{θ}}_{i k} = n_{z} {\hat{θ}}_{k}^{z} - (n_{z} - 1) {\hat{θ}}_{k (i)}^{z}$ .

Let X = (X₁, …, X_q) denote q vector of covariates, such as age at diagnosis, current smoking status, and etc. The pseudo-value regression model for the i^th individual and k^th taxon is

μ_{i} = E [{\tilde{θ}}_{i k} ∣ Z_{i}, X_{i}] = α_{k} + β_{k} Z_{i} + \sum_{m = 1}^{q} γ_{k m} X_{i m},

(5)

where μ_i is the k-dimensional mean vector of pseudo-value ${\tilde{θ}}_{i k}$ for the ith individual, α_k is the intercept, β_k is the regressioncoefficient for Z, and γ_k1, …, γ_kq is the set of regression coefficients to be estimated for X. In our setting, the main parameter of interest is given by β_k, the change in network centrality measure of the k^th taxon between two groups.

The least trimmed squares (LTS), also known as least trimmed sum of squares [38], is then implemented to carry out a robust regression. The main advantages of the LTS estimator over other robust estimators including the M-estimator and least median of squares estimator are its computational efficiency and robustness to outliers in both the response and predictor variables [39, 40]. The LTS estimator is defined by

min_{α_{k}, β_{k}, γ_{k 1}, \dots, γ_{k q}} \sum_{i = 1}^{h} r_{(i)} {(α_{k}, β_{k}, γ_{k 1}, \dots, γ_{k q})}^{2},

where r_(i) is the set of ordered absolute values of the residuals sorted in increasing order of absolute value and h may depend on a pre-determined trimming proportion c ∈ [0.5, 1] [41]. For example, one can take h = [n(1 − c)] + 1.

2.3. Hypothesis Testing

We construct the null hypothesis of H_0: β_k = 0 against the research hypothesis H_1: β_k ≠ 0 to test if there is a true difference between groups in the network centrality measure of the k^th taxon. The t-statistic is defined by $U_{k} = {\hat{β}}_{k} / S E ({\hat{β}}_{k})$ for k = 1, …, p, where ${\hat{β}}_{k}$ is the least-squares estimator from the robust regression described in the above equations 5 and $S E ({\hat{β}}_{k})$ is the standard error of ${\hat{β}}_{k}$ . As far as the decision-making process, the asymptotically α-level test rejects H₀ if |U_k| > t_α/2. P-values are calculated using a t-distribution as in robustbase R package [42, 43].

Multiple hypothesis testing is a common feature in the DN analysis, and therefore it is crucial to appropriately control the false discovery rate (FDR). The FDR measures the proportion of false discoveries incurred among a set of DC taxa from the test. Most classically, the concept of FDR was pioneered by Benjamini and Hochberg [44], shown to achieve the FDR control, whilst maintaining the adequate statistical power [45]. However, the q-value [46] offers a less conservative FDR estimation over the conventional Benjamini-Hochberg procedure [47]. The q-value is estimated from the empirical distribution of the observed p-values, and keeps the balance between true positives and false positives [48]. Accordingly, the q-value is applied to adjust for the multiplicity control in the present paper using fdrtool R package.

2.4. Algorithm

The SOHPIE-DNA algorithm is described below in Algorithm 1.

2.5. Performance Evaluations

2.5.1. Construction of Adjacency Matrices

Generate the scale-free random network (or Barabási-Albert network) [49] with p nodes using the igraph R package [50]. A network is scale-free if its degree distribution follows a power-law distribution. In other words, a small portion of “hub” nodes has the highest degree centrality, while most nodes have lower degree centrality.

The two identical p×p adjacency matrices, where the diagonal entries are 0 and non-diagonal entries are either {0, 1}, are obtained from this random network. At the end of the data generation phase using SparseDOSSA2 in Simulated Data section 2.6.1, we are able to identify which taxa are spike-in associated with the covariate for each z = 1, 2. In order to distinguish networks representative of z = 1 (e.g., healthy control) from that of z = 2 (e.g., disease group), we keep track of the indices of these covariate-dependent taxa. We perturb the random networks by removing all the connected edges around nodes of the adjacency matrix using the indices recorded previously for each group. The network plots are provided to visually demonstrate the perturbed adjacency matrices (see Figure 1 and 2).

Fig. 1 — Network plots visualizing the microbial network (p = 20) with a covariate dependence structure that depends on continuous age and binary group information (δ₁ = 0.05 (left), δ₂ = 0.2 (right)). This represents the multivariable setting.

Fig. 2 — Network plots visualizing the microbial network (p = 20) without a covariate dependence structure that depends on binary group only (δ₁ = 0 (left), δ₂ = 0.2 (right)). This represents the univariable setting.

2.5.2. Performance Measures

Four performance metrics are adopted to evaluate our proposed method: precision, recall, F1-score, and accuracy. Let $Ω^{z} \in ℝ^{p \times p}$ be the group-specific adjacency matrix, where

Ω_{j k}^{z} = {\begin{array}{l} 1 & if the two nodes j and k are connected \\ 0 & otherwise, \end{array}

for z = 1, 2. Next, a node-specific true connection is calculated

η_{k} = I (\sum_{j = 1}^{p} | Ω_{j k}^{1} - Ω_{j k}^{2} | > 0),

indicating that taxa k has differential connectivity (DC).

In terms of notation, we use q_ks to denote a q-value [46] of taxa k at the simulation replicate s. An error rate control of α = 0.05 is used throughout the simulation. In the following, we present the details of each performance metric.

Precision is the fraction of taxa which are declared to be significantly DC from the test that are confirmed as true:

Precision = \frac{\sum_{k = 1}^{p} η_{k} I (q_{k s} < α)}{\sum_{k = 1}^{p} I (q_{k s} < α)} .

Recall is the fraction of truly DC taxa which are correctly declared to be significant between two comparing groups from the test:

Recall = \frac{\sum_{k = 1}^{p} η_{k} I (q_{k s} < α)}{\sum_{k = 1}^{p} η_{k}} .

The F1 score is the harmonic mean of precision and recall values. A higher F1 score indicates a better overall performance with lower false negative and false positive predictions:

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} .

Accuracy is defined as the fraction of total number of taxa that are correctly predicted to be DC. The accuracy ranges from 0 (no correct predictions) to 1 (perfect predictions):

Accuracy = \frac{\sum_{k = 1}^{p} I (η_{k} = 1) I (q_{k s} < α) + \sum_{k = 1}^{p} I (η_{k} = 0) I (q_{k s} \geq α)}{\sum_{k = 1}^{p} I (η_{k} = 1) + \sum_{k = 1}^{p} I (η_{k} = 0)} .

2.6. Materials

2.6.1. Simulated Data

The synthetic microbiome dataset are structured with p taxa and n sample size. In the simulation, binary group indicators 1 and 2 are generated from a Bernoulli distribution with equal probabilities and a single continuous covariate X ~ N(55, 10) (e.g., age at diagnosis). We test our proposed method on datasets under two different simulation scenarios: taxa are impacted by the effect of (1) Z and X or (2) Z only, which each corresponds to “multivariable” and “univariable” settings, respectively.

The actual microbial data generation (e.g., OTU counts) given the covariates is described next. In this context, it is perhaps worth mentioning that this part is completely different from generating gene expression data as in PRANA [35]. For each simulation scenario, we generate an OTU table that resembles the dependence structure of covariates Z and/or X on the microbial community (or the network) using the SparseDOSSA2 (Sparse Data Observations for the Simulation of Synthetic Abundances) R package [51]. SparseDOSSA2 adopts a Bayesian Gaussian copula model with zero-inflated, truncated lognormal distributions to capture the marginal distributions of each microbial taxa and to account for the correlation between taxa.

The package has a feature to indicate a user-specified percentage of taxa to be “spiked-in” association with the clinical information (or metadata). This is referred to as the “effect size” of differential abundance δ. To evaluate the effect size of Z under the univariable setting, we generate the data that half of the samples have taxa with no spike-in association, whereas the other half of the samples have spike-in association on 5%, 10%, or 20% of taxa. The distributions of age in the two groups are different. Therefore, under the multivariable setting, 5%, 10%, or 20% of taxa have spike-in association with X for each group z = 1, 2. In both scenarios, n_z ×p matrices for each group z = 1, 2 will be available for use.

2.6.2. Application Study

The American Gut Project Data

A pre-processed OTU table of the human stool microbiome samples from the American Gut Project [32] is available in the SpiecEasi R package, along with the corresponding metadata information. The gut microbiome is involved with the bidirectional relationship between the gastrointestinal system and central nervous system (i.e. gut-brain axis) that impacts on the migraine inflammation [52].

In the analysis, the main variable of interest is a binary variable indicating the migraine headache (yes or no). Age [53], sex [53], exercise frequency (≥ 3 days per week or otherwise) [54], and categorical alcohol consumption (heavy, moderate, or non-drinking) [55] are covariates that are included in the multivariable model. Additionally, migraine has been associated with the periodontal inflammation [56] and pet ownership [57], and therefore the oral hygiene behavior such as dental floss frequency (≥ 3 times per week or otherwise) and living with a dog (yes or no) were included in the model.

The initial OTU table consists of 138 taxa with 296 subjects. No taxa were removed, however, 28 subjects were excluded due to unidentified sampling body site and missing age or sex information. Hence, 138 taxa and 268 subjects were used for the analysis.

The Diet Exchange Study Data

A pre-processed data of the geographical epidemiology study [33] is available in microbiome [58] R package. The aim of the study was to assess the effect of fat and fiber intake of the diet on the composition of the colonic microbiota by switching the diet in study populations with high (African-Americans from Pittsburgh area of Pennsylvania; AA) and low (rural South Africans from KwaZulu region; RA) colon cancer risk for two weeks.

An initial OTU table contains 130 taxa with 38 subjects. After the exclusion of a subject with missing post-dietary intervention data and 18 rare taxa that appear in fewer than 10% of the samples, 112 taxa with 37 subjects (20 AA and 17 RA) are used for the analysis.

The main predictor variable is binary geographic location (AA or RA). Additional covariates considered in a multivariable model were sex and BMI groups (obese, overweight, or lean).

For each groups separately, we take the difference of the estimated association matrices (as well as the re-estimated association matrices) between two time points, that is, the endoscopy before and after two weeks of dietary change. The differences are then used to calculate the jackknife pseudo-values as in the previous sections. This additional step is intended to incorporate the temporal change of connectivity of each taxa after dietary interventions.

3. Results

3.1. Simulation Study

The sample size n = 20, 50, 200, 500 are considered for each microbial network with p = 20, 40 taxa over 1,000 Monte Carlo replicates. Simulations are repeated to assess the effects of covariates on taxa by changing the effect size, δ = 0.05, 0.1, 0.2, which is described in Simulated Data section 2.6.1. A new network is generated at each simulation replicate to account for biological variability of the network structure.

The performance metrics provided in section 2.5.2 are computed by comparing the test results with the true network. In the true network setting, a taxa is truly DC between groups if it is connected to at least one neighbor taxa. Tables 1 and 2 summarize simulation results under the multivariable setting. That is, a continuous covariate is included with the binary group variable in the regression model. To illustrate the utility of the proposed method on covariate-dependent network, we compared the pseudo-value regression approach with the recent methods available (NetCoMi and MDiNE) that cannot incorporate the additional covariate. Results show that the SOHPIE-DNA consistently maintains high recall values in all specifications of taxa, sample sizes, and effect sizes, and outperforms NetCoMi and MDiNE in almost all cases. A higher F1 score of SOHPIE-DNA indicates that the proposed method can achieve a better overall model performance in the presence of additional covariates, compared with the two competing methods. In general, all metrics improve as n increases and/or when the larger effect size is provided (δ = 0.2), as expected. It is worth noting that the MDiNE poses a practical challenge associated with substantially large computational time and costs. For instance, it requires more than 9 days to complete each simulation for p = 40 and n = 200 from the University of Florida Research Computing Linux server, HiPerGator 3.0 with 32CPU cores and 4GB of RAM per node, while it takes up to 18 hours to execute the same simulation tasks for both the SOHPIE-DNA and NetCoMi with 4CPU cores and 6GB of RAM per node. See Table S1 in Additional File 1 for more details.

Table 1.

The simulation results for the case when the network structure depends on age covariate. The binary group variable in the multivariable regression model (continuous age and binary group) using pseudo-value approach is compared with NetCoMi and MDiNE with 1,000 replicates. A random network with network size p=20 is generated at each simulation replicate. The best results are highlighted in boldface.

				Precision			Recall			FI			Accuracy
p	n	δ ₁	δ ₂	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE
20	20	0.05	0.05	0.26	0.28	0.36	0.69	0.25	0.01	0.39	0.35	0.31	0.42	0.06	0.75
		0.05	0.10	0.36	0.37	0.49	0.69	0.26	0.01	0.45	0.37	0.30	0.45	0.09	0.65
		0.05	0.20	0.52	0.51	0.39	0.69	0.24	0.01	0.57	0.36	0.24	0.51	0.12	0.49
		0.10	0.05	0.36	0.36	0.44	0.70	0.23	0.01	0.46	0.35	0.32	0.45	0.08	0.65
		0.10	0.10	0.42	0.42	0.42	0.69	0.25	0.01	0.51	0.37	0.25	0.47	0.11	0.58
		0.10	0.20	0.54	0.54	0.43	0.70	0.24	0.01	0.59	0.38	0.28	0.52	0.13	0.46
		0.20	0.05	0.50	0.51	0.48	0.69	0.25	0.01	0.56	0.38	0.25	0.51	0.12	0.50
		0.20	0.10	0.53	0.54	0.45	0.70	0.26	0.01	0.58	0.38	0.20	0.52	0.14	0.47
		0.20	0.20	0.60	0.58	0.42	0.70	0.24	0.01	0.62	0.38	0.18	0.54	0.14	0.41
	50	0.05	0.05	0.25	0.27	0.35	0.82	0.53	0.12	0.39	0.37	0.34	0.34	0.13	0.72
		0.05	0.10	0.35	0.37	0.42	0.84	0.58	0.10	0.49	0.44	0.30	0.41	0.20	0.63
		0.05	0.20	0.50	0.52	0.50	0.84	0.63	0.10	0.61	0.54	0.26	0.50	0.31	0.50
		0.10	0.05	0.35	0.37	0.42	0.83	0.58	0.11	0.49	0.44	0.30	0.40	0.20	0.63
		0.10	0.10	0.41	0.43	0.44	0.83	0.62	0.11	0.54	0.48	0.29	0.44	0.25	0.57
		0.10	0.20	0.53	0.55	0.57	0.84	0.64	0.12	0.64	0.56	0.28	0.52	0.34	0.47
		0.20	0.05	0.51	0.52	0.49	0.84	0.61	0.10	0.62	0.54	0.27	0.51	0.31	0.49
		0.20	0.10	0.54	0.55	0.54	0.83	0.64	0.11	0.64	0.57	0.27	0.52	0.35	0.47
		0.20	0.20	0.59	0.59	0.58	0.84	0.69	0.12	0.68	0.61	0.27	0.56	0.40	0.43
	200	0.05	0.05	0.26	0.27	0.26	0.93	0.63	0.45	0.41	0.38	0.35	0.29	0.16	0.52
		0.05	0.10	0.35	0.37	0.35	0.94	0.68	0.45	0.50	0.46	0.38	0.37	0.24	0.51
		0.05	0.20	0.51	0.52	0.50	0.94	0.74	0.48	0.65	0.58	0.47	0.51	0.37	0.49
		0.10	0.05	0.35	0.36	0.35	0.94	0.67	0.46	0.50	0.45	0.40	0.37	0.23	0.51
		0.10	0.10	0.41	0.43	0.40	0.94	0.74	0.48	0.56	0.52	0.43	0.42	0.30	0.50
		0.10	0.20	0.54	0.55	0.53	0.93	0.78	0.50	0.67	0.62	0.49	0.53	0.42	0.50
		0.20	0.05	0.50	0.52	0.48	0.94	0.72	0.48	0.64	0.58	0.46	0.50	0.36	0.49
		0.20	0.10	0.53	0.54	0.54	0.93	0.76	0.51	0.67	0.61	0.51	0.53	0.41	0.51
		0.20	0.20	0.58	0.59	0.56	0.93	0.81	0.52	0.71	0.66	0.52	0.57	0.48	0.49
	500	0.05	0.05	0.25	0.27	0.24	0.96	0.73	0.69	0.41	0.40	0.36	0.27	0.18	0.38
		0.05	0.10	0.35	0.37	0.35	0.97	0.78	0.75	0.51	0.48	0.47	0.36	0.28	0.42
		0.05	0.20	0.51	0.52	0.48	0.96	0.82	0.76	0.65	0.61	0.57	0.51	0.42	0.48
		0.10	0.05	0.34	0.35	0.35	0.96	0.76	0.71	0.50	0.46	0.45	0.35	0.26	0.43
		0.10	0.10	0.42	0.43	0.42	0.97	0.80	0.74	0.57	0.53	0.52	0.42	0.33	0.46
		0.10	0.20	0.53	0.54	0.51	0.96	0.85	0.79	0.67	0.64	0.60	0.53	0.45	0.50
		0.20	0.05	0.50	0.51	0.49	0.96	0.82	0.78	0.65	0.61	0.59	0.50	0.41	0.50
		0.20	0.10	0.53	0.54	0.52	0.97	0.86	0.76	0.67	0.64	0.60	0.53	0.46	0.51
		0.20	0.20	0.59	0.59	0.58	0.96	0.88	0.83	0.72	0.68	0.66	0.58	0.51	0.56

Open in a new tab

Table 2.

The simulation results for the case when the network structure depends on age covariate. The binary group variable in the multivariable regression model (continuous age and binary group) using pseudo-value approach is compared with NetCoMi and MDiNE with 1,000 replicates. A random network with network size p=40 is generated at each simulation replicate. The best results are highlighted in boldface.

				Precision			Recall			FI			Accuracy
p	n	δ ₁	δ ₂	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE
40	20	0.05	0.05	0.26	0.25	0.23	0.64	0.27	0.68	0.68	0.26	0.31	0.44	0.07	0.37
		0.05	0.10	0.34	0.33	0.27	0.64	0.28	0.68	0.43	0.30	0.40	0.46	0.09	0.41
		0.05	0.20	0.50	0.47	0.46	0.66	0.24	0.64	0.55	0.31	0.60	0.50	0.12	0.49
		0.10	0.05	0.35	0.34	0.35	0.65	0.28	0.59	0.44	0.30	0.48	0.46	0.09	0.48
		0.10	0.10	0.42	0.40	0.35	0.65	0.27	0.38	0.49	0.32	0.35	0.48	0.11	0.49
		0.10	0.20	0.53	0.50	0.54	0.67	0.24	0.50	0.58	0.32	0.51	0.51	0.13	0.52
		0.20	0.05	0.50	0.47	0.47	0.65	0.23	0.58	0.55	0.31	0.58	0.50	0.12	0.51
		0.20	0.10	0.53	0.51	0.57	0.66	0.24	0.38	0.57	0.32	0.60	0.51	0.13	0.49
		0.20	0.20	0.58	0.57	0.59	0.69	0.22	0.84	0.62	0.32	0.69	0.53	0.13	0.58
	50	0.05	0.05	0.25	0.26	0.26	0.83	0.44	0.57	0.38	0.31	0.35	0.34	0.11	0.48
		0.05	0.10	0.34	0.34	0.33	0.84	0.45	0.67	0.48	0.37	0.42	0.40	0.15	0.44
		0.05	0.20	0.50	0.48	0.49	0.84	0.40	0.65	0.62	0.41	0.53	0.50	0.20	0.50
		0.10	0.05	0.34	0.34	0.36	0.83	0.44	0.64	0.47	0.36	0.42	0.39	0.15	0.45
		0.10	0.10	0.41	0.41	0.44	0.84	0.48	0.63	0.54	0.41	0.48	0.44	0.19	0.49
		0.10	0.20	0.53	0.52	0.53	0.84	0.43	0.69	0.64	0.44	0.55	0.52	0.23	0.51
		0.20	0.05	0.49	0.48	0.49	0.84	0.40	0.65	0.61	0.41	0.50	0.50	0.20	0.50
		0.20	0.10	0.53	0.51	0.55	0.83	0.41	0.52	0.64	0.43	0.45	0.52	0.22	0.50
		0.20	0.20	0.58	0.57	0.51	0.84	0.40	0.55	0.68	0.44	0.48	0.56	0.23	0.49
	200	0.05	0.05	0.25	0.26	0.25	0.95	0.65	0.92	0.39	0.36	0.39	0.28	0.16	0.29
		0.05	0.10	0.34	0.35	0.33	0.95	0.70	0.92	0.50	0.45	0.48	0.36	0.24	0.35
		0.05	0.20	0.50	0.50	0.49	0.95	0.64	0.93	0.65	0.53	0.63	0.50	0.32	0.49
		0.10	0.05	0.34	0.35	0.34	0.95	0.69	0.93	0.50	0.44	0.49	0.36	0.23	0.36
		0.10	0.10	0.41	0.42	0.42	0.95	0.72	0.95	0.57	0.51	0.58	0.42	0.30	0.43
		0.10	0.20	0.53	0.53	0.55	0.95	0.68	0.95	0.68	0.57	0.69	0.53	0.36	0.54
		0.20	0.05	0.49	0.49	0.49	0.95	0.63	0.95	0.64	0.52	0.64	0.49	0.31	0.49
		0.20	0.10	0.52	0.53	0.53	0.95	0.67	0.94	0.67	0.57	0.67	0.52	0.35	0.52
		0.20	0.20	0.58	0.58	0.57	0.95	0.63	0.93	0.71	0.58	0.70	0.57	0.37	0.57
	500	0.05	0.05	0.25	0.25	0.24	0.98	0.78	0.99	0.39	0.37	0.38	0.26	0.20	0.24
		0.05	0.10	0.34	0.35	0.35	0.97	0.82	0.99	0.50	0.47	0.51	0.35	0.28	0.35
		0.05	0.20	0.49	0.50	0.50	0.98	0.80	0.98	0.65	0.59	0.65	0.49	0.39	0.50
		0.10	0.05	0.34	0.35	0.33	0.98	0.82	0.99	0.50	0.48	0.49	0.35	0.28	0.34
		0.10	0.10	0.41	0.41	0.40	0.98	0.85	0.98	0.57	0.54	0.56	0.41	0.35	0.40
		0.10	0.20	0.52	0.53	0.53	0.98	0.83	1.00	0.68	0.63	0.68	0.52	0.43	0.53
		0.20	0.05	0.50	0.50	0.52	0.98	0.79	1.00	0.65	0.59	0.68	0.50	0.39	0.52
		0.20	0.10	0.52	0.52	0.51	0.98	0.84	0.99	0.67	0.63	0.67	0.52	0.44	0.51
		0.20	0.20	0.58	0.57	0.58	0.98	0.81	1.00	0.72	0.65	0.73	0.57	0.47	0.58

Open in a new tab

Table 3 presents results of the univariable setting, where only the binary group variable is included in the model. In other words, only the effect of group was considered when generating random networks. On the whole, a similar pattern is shown in the univariable setting that the SOHPIE-DNA reaches a higher level of recall, compared with NetCoMi and MDiNE. Overall, our method resulted in a higher F1 score when the smaller network is considered. All of the methods suffer from a low precision with a small effect size (δ = 0.05), but eventually improves with a larger effect size (δ = 0.2).

Table 3.

The simulation results for the case when the network structure does not depend on age covariate. The binary group variable in the univariable regression model (binary group only) using pseudo-value approach is compared with NetCoMi and MDiNE with 1,000 replicates. A random network is generated at each simulation replicate. The best results are highlighted in boldface.

			Precision			Recall			FI			Accuracy
p	n	δ	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE	SOHPIE	NetCoMi	MDiNE
20	20	0.05	0.15	0.14	0.00	0.67	0.15	0.00	0.26	0.33	0.00	0.39	0.02	0.85
		0.10	0.27	0.29	0.42	0.68	0.16	0.00	0.38	0.32	0.28	0.43	0.04	0.73
		0.20	0.47	0.46	0.25	0.67	0.16	0.01	0.53	0.31	0.24	0.49	0.07	0.53
	50	0.05	0.14	0.14	0.11	0.82	0.32	0.03	0.24	0.28	0.42	0.27	0.04	0.83
		0.10	0.27	0.27	0.26	0.82	0.33	0.04	0.39	0.33	0.31	0.35	0.09	0.72
		0.20	0.46	0.46	0.42	0.81	0.33	0.04	0.58	0.40	0.23	0.47	0.15	0.53
	200	0.05	0.14	0.14	0.13	0.94	0.38	0.36	0.24	0.28	0.27	0.19	0.05	0.58
		0.10	0.27	0.27	0.25	0.93	0.39	0.36	0.41	0.35	0.33	0.30	0.11	0.54
		0.20	0.48	0.48	0.44	0.93	0.40	0.36	0.62	0.43	0.40	0.48	0.19	0.49
	500	0.05	0.14	0.14	0.14	0.97	0.52	0.67	0.24	0.26	0.25	0.17	0.07	0.38
		0.10	0.27	0.28	0.26	0.97	0.54	0.65	0.41	0.37	0.37	0.28	0.14	0.42
		0.20	0.47	0.48	0.45	0.96	0.54	0.64	0.62	0.49	0.51	0.47	0.25	0.47
40	20	0.05	0.14	0.13	0.20	0.61	0.18	0.66	0.23	0.23	0.29	0.44	0.02	0.48
		0.10	0.26	0.26	0.26	0.60	0.19	0.74	0.35	0.24	0.40	0.46	0.05	0.38
		0.20	0.46	0.45	0.44	0.60	0.18	0.79	0.51	0.26	0.60	0.49	0.08	0.46
	50	0.05	0.14	0.14	0.13	0.84	0.25	0.73	0.24	0.22	0.23	0.27	0.04	0.33
		0.10	0.26	0.25	0.24	0.82	0.24	0.70	0.39	0.25	0.36	0.34	0.06	0.39
		0.20	0.46	0.46	0.42	0.83	0.25	0.66	0.59	0.32	0.50	0.48	0.11	0.47
	200	0.05	0.14	0.14	0.15	0.94	0.32	0.94	0.24	0.22	0.25	0.18	0.05	0.19
		0.10	0.26	0.27	0.27	0.95	0.33	0.91	0.41	0.30	0.41	0.29	0.09	0.31
		0.20	0.47	0.47	0.47	0.95	0.32	0.91	0.62	0.37	0.61	0.47	0.15	0.47
	500	0.05	0.14	0.14	0.15	0.97	0.41	1.00	0.24	0.22	0.26	0.16	0.06	0.16
		0.10	0.27	0.26	0.26	0.97	0.42	0.99	0.41	0.32	0.41	0.28	0.11	0.26
		0.20	0.47	0.47	0.46	0.98	0.42	0.99	0.63	0.43	0.62	0.47	0.20	0.46

Open in a new tab

3.2. Analysis of the American Gut Project Data

Six out of 138 taxa are found significantly DC between migraineurs vs. non-migraineurs while adjusting for age, sex, exercise frequency, categorical alcohol consumption, oral hygiene behavior, and dog ownership. At the family-level, the DC taxa are members of Ruminococcaceae, Lachnospiraceae, Enterobacteriaceae, Erysipelotri-chaceae, and Bacteroidaceae. Of these families, the absence of Lachnospiraceae has been linked to the active or severe Clostridium difficile infection [59]. Erysipelotrichaceae has been associated with dyslipidemic phenotypes and systemic inflammation [60]. Moreover, a recent study [61] reported that the species enriched among migraineurs include Ruminococcus gnavus and Lachnospiraceae bacterium. The computational time for our analysis was about 12 hours on the high-performance Linux cluster, HiPerGator 3.0 with 16CPU cores and 4GB of RAM per node.

3.3. Analysis of the Diet Exchange Study Data

Out of 112 taxa, 16 are predicted to be significantly DC between AA and RA after the two-week dietary exchange intervention while accounting for their age and BMI group. A complete list of DC taxa represent Bacillus, Bacteroides uniformis et rel., Bacteroides vulgatus et rel., Clostridium ramosum et rel., Coprococcus eutactus et rel., Eggerthella lenta et rel., Escherichia coli et rel., Eubacterium hallii et rel., Eubacterium siraeum et rel., Faecalibacterium prausnitzii et rel., Prevotella oralis et rel., Roseburia intestinalis et rel., Ruminococcus gnavus et rel., Staphylococcus, Uncultured Bacteroidetes, and Xanthomonadaceae. Notably, Roseburia intestinalis contributes to the prevention and management of intestinal inflammation and atherosclerosis [62]. Eubacterium hallii has been negatively associated with the fatigue severity scores of patients with advanced metastatic cancer [63]. The analysis took about an hour and 11 minutes on the HiPerGator 3.0 with 16CPU cores and 4GB of RAM per node.

4. Discussion

In this manuscript, we introduce the SOHPIE-DNA, a pseudo-value regression approach that determines whether a microbial taxa is significantly DC between groups after adjusting for additional covariates. This study is the first of its kind in the literature to develop a regression modeling for the DN analysis in microbiome data, which includes more than one predictor (e.g., group) in the model and predicts features of connectivity of a network. A simulation study shows that, at least for the scenarios considered, the SOHPIE-DNA generally maintains higher recall and F1-score while maintaining similar precision and accuracy, when compared with the most recent state-of-the-art methods: NetCoMi and MDiNE.

In this study, the group-specific jackknife pseudo-values are calculated. Another way of calculating jackknife pseudo-values is to use the entire sample and use the group-level indicator as a covariate into the model. However, in our preliminary simulations, we found that doing it that way led to worse performance.

We analyzed the data from two published studies to showcase the utility of the SOHPIE-DNA. Firstly, 6 taxa are found to be significantly DC between migraineurs and non-migraineurs while accounting for covariates using the data from the American Gut Project. A slight modification to the proposed method is grafted for analyzing the Diet Exchange Study data, where the group-specific difference of the estimated association matrices between two time points are used for the pseudo-value calculation. As a result, 16 significantly DC taxa are identified between AA and RA after the two-week diet exchange intervention with the inclusion of covariates.

The latter application demonstrates the capability of assessing the temporal variation in connectivity measures. However, the SOHPIE-DNA currently has no feature to address the within-subject correlation for repeated measurements at different time points. This opens up an avenue for future investigation of longitudinal microbiome studies. One way of handling this is to use a generalized estimating equations (GEE) type approach for the pseudo-values and utilizing a jackknife estimate of the variance-covariance matrix of the pseudo-values at different time points.

Another line of future research direction to extend our work is to consider the idea of variable selection. This will help finding the best prediction model with a subset of phenotypic variables that are more biologically relevant across more heterogeneous study samples.

Additionally, we made an attempt of fitting a model under the generalized linear model for binary outcomes: logistic regression with or without the Firth’s correction, in case of small sample size. It was challenging to appropriately dichotomize the matrices with jackknife pseudo-values. Further studies will be needed to devise an adaptive algorithm to find a threshold value that better classify the jackknife pseudo-values.

As a last remark, it should be emphasized that methods other than SparCC were also considered for network estimation, which includes the CCLasso [64] and SPIEC-EASI [65] with graphical lasso or neighborhood selection algorithms. However, these were not favorable in terms of runtime or due to not being able to run under certain simulation scenarios. For instance, the computational time to complete the re-estimation step for the SPIEC-EASI took more than 200 minutes for p = 20 with n = 200 for a single simulation replicate. The CCLasso could not estimate the association matrix with small sample size for a smaller network (p = 20 for n = 20, 40, 60).

5. Conclusion

There has been limited research to date that discusses how to adjust for additional covariate information in DN analysis for microbiome data. Herewith, we propose SOHPIE-DNA, a novel pseudo-value regression approach for the DN analysis, which can include additional clinical covariate in the model.

Supplementary Material

NIHPP2303.13702V1-supplement-1.pdf^{(67.1KB, pdf)}

Acknowledgments

The authors are exceptionally thankful for the investigators involved with the American Gut Project and the Diet Exchange Study for sharing their data publicly. S.A. has been supported by Award Number [NIH T32AA025877] from the National Institute on Alcohol Abuse and Alcoholism of the National Institutes of Health. S.A. dedicates this work to remember all the memories that he had with his furry friend, Sofie.

Funding

S.A. is funded by the National Institute on Alcohol Abuse and Alcoholism at the National Institutes of Health under Award Number [NIH T32AA025877].

Footnotes

Competing interests

The authors have no competing interests to disclose.

Code availability

SOHPIE-DNA is available at https://github.com/sjahnn/SOHPIE-DNA.

Data availability

A pre-processed OTU table and metadata from the American Gut Project and from the Diet Exchange Study are available in the SpiecEasi R package and microbiome R package, respectively. Please reach out to the author (Somnath Datta, somnath.datta@ufl.edu) if you have any further inquiries.

References

[1].Weinstock G. M. Genomic approaches to studying the human microbiota. Nature 489, 250–256 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Bhatt A. P., Redinbo M. R. & Bultman S. J. The role of the microbiome in cancer development and therapy. CA Cancer J Clin 67, 326–344 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Vujkovic-Cvijin I. et al. HIV-associated gut dysbiosis is independent of sexual practice and correlates with noncommunicable diseases. Nat Commun. 11, 2448 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Glassner K. L., Abraham B. P. & Quigley E. The microbiome and inflammatory bowel disease. J Allergy Clin Immunol. 145, 16–27 (2020). [DOI] [PubMed] [Google Scholar]
[5].Lee S. H. et al. Emotional well-being and gut microbiome profiles by enterotype. Sci Rep. 10, 20736 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Valles-Colomer M. et al. The neuroactive potential of the human gut microbiota in quality of life and depression. Nat Microbiol. 4, 623–632 (2019). [DOI] [PubMed] [Google Scholar]
[7].Krajmalnik-Brown R., Lozupone C., Kang D. W. & Adams J. B. Gut bacteria in children with autism spectrum disorders: challenges and promise of studying how a complex community influences a complex disease. Microb Ecol Health Dis. 26, 26914 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Mayer E. A., Knight R., Mazmanian S. K., Cryan J. F. & Tillisch K. Gut microbes and the brain: paradigm shift in neuroscience. J Neurosci. 34 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Durazzi F. et al. Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota. Sci Rep. 11, 3030 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Johnson J. S. et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun. 10, 5029 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Reuter J. A., Spacek D. V. & Snyder M. P. High-throughput sequencing technologies. Mol Cell 58, 586–597 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Layeghifard M., Hwang D. M. & Guttman D. S. Disentangling interactions in the microbiome: A network perspective. Trends Microbiol. 25, 217–228 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Matchado M. S. et al. Network analysis methods for studying microbial communities: A mini review. Comput Struct Biotechnol J. 9, 2687–2698 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].McGregor K., Labbe A. & Greenwood C. MDiNE: a model to estimate differential co-occurrence networks in microbiome studies. Bioinformatics 36, 1840–1847 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Peschel S., Muller C., von Mutius E., Boulesteix A. & Depner M. NetCoMi: network construction and comparison for microbiome data in R. Brief Bioinform. 22, bbaa290 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Lee M. & Chang E. B. Inflammatory bowel diseases (IBD) and the microbiome-searching the crime scene for clues. Gastroenterology 160, 524–537 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Efron B. & Tibshirani R. J. An Introduction to the Bootstrap (Chapman & Hall/CRC, Philadelphia, 1993). [Google Scholar]
[18].Andersen P. K., Klein J. P. & Rosthøj S. Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90, 15–27 (2003). [Google Scholar]
[19].Andersen P. K. & Klein J. P. Regression analysis for multistate models based on a pseudo-value approach, with applications to bone marrow transplantation studies. Scand J Statist. 34, 3–16 (2007). [Google Scholar]
[20].Sabathé C. et al. Regression analysis in an illness-death model with interval-censored data: A pseudo-value approach. Stat Methods Med Res. 29, 752–764 (2020). [DOI] [PubMed] [Google Scholar]
[21].Johansen M. N., Lundbye-Christensen S., Larsen J. M. & Parner E. T. Regression models for interval censored data using parametric pseudo-observations. BMC Med Res Methodol. 21, 36 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Logan B. R., Zhang M. J. & Klein J. P. Marginal models for clustered time-to-event data with competing risks using pseudovalues. Biometrics 67, 1–7 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Ahn K. W. & Logan B. R. Pseudo-value approach for conditional quantile residual lifetime analysis for clustered survival and competing risks data with applications to bone marrow transplant data. Ann Appl Stat. 10, 618–637 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Zhao L. & Feng D. Deep neural networks for survival analysis using pseudo values. IEEE J Biomed Health Inform. 24, 3308–3314 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Ginestet P. G., Gabriel E. E. & Sachs M. C. Survival stacking with multiple data types using pseudo-observation-based-AUC loss. J Biopharm Stat. (2022). https://doi.org/doi: 10.1080/10543406.2022.2041655. [DOI] [PubMed] [Google Scholar]
[26].Logan B. R., Klein J. P. & Zhang M. J. Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics 64, 733–740 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Graw F., Gerds T. A. & Schumacher M. On pseudo-values for regression analysis in competing risks models. Lifetime Data Anal. 15, 241–255 (2009). [DOI] [PubMed] [Google Scholar]
[28].Overgaard M., Parner E. T. & Pedersen J. Asymptotic theory of generalized estimating equations based on jack-knife pseudo-observations. Ann Stat. 45, 1988–2015 (2017). [Google Scholar]
[29].Klein J. P., Gerster M., Andersen P. K., Tarima S. & Perme M. P. SAS and R functions to compute pseudo-values for censored data regression. Comput Methods Programs Biomed. 89, 289–300 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Wang Y. & Logan B. R. Testing for center effects on survival and competing risks outcomes using pseudo-value regression. Lifetime Data Anal. 25, 206–228 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Ahn K. W. & Mendolia F. Pseudo-value approach for comparing survival medians for dependent data. Stat Med. 33, 1531–1538 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].McDonald D. et al. American gut: an open platform for citizen science microbiome research. mSystems 3, e00031–18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].O’Keefe S. J. et al. Fat, fibre and cancer risk in african americans and rural africans. Nat Commun. 6, 6342 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Friedman J. & Alm E. J. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 8, e1002687 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Ahn S., Grimes T. & Datta S. A pseudo-value regression approach for differential network analysis of co-expression data. BMC Bioinformatics 24, 8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Ashtiani M. et al. A systematic survey of centrality measures for proteinprotein interaction networks. BMC Syst Biol. 12, 80 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Ozgür A., Vu T., Erkan G. & Radev D. R. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 24, i277–i285 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Rousseeuw P. J. Least median of squares regression. J Am Stat Assoc. 79, 871–880 (1984). [Google Scholar]
[39].Ahdesmäki M., Lähdesmäki H., Gracey A., Shmulevich L. & Yli-Harja O. Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics 8, 233 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Alfons A., Croux C. & Gelper S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat. 7, 226–248 (2013). [Google Scholar]
[41].Pison G., Van Aelst S. & Willems G. Small sample corrections for LTS and MCD. Metrika 55, 111–123 (2002). [Google Scholar]
[42].Todorov V. & Filzmoser P. An object-oriented framework for robust multivariate analysis. J Stat Soft. 32, 1–47 (2009). [Google Scholar]
[43].Maechler M. et al. robustbase: Basic Robust Statistics (2022). URL http://robustbase.r-forge.r-project.org/. R package version 0.95–0.
[44].Benjamini Y. & Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B. 57, 289–300 (1995). [Google Scholar]
[45].Benjamini Y. Discovering the false discovery rate. J. R. Statist. Soc. B 72, 405–416 (2010). [Google Scholar]
[46].Storey J. D. A direct approach to false discovery rates. J. R. Statist. Soc. B 64, 479–498 (2002). [Google Scholar]
[47].Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics 9, 303 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Storey J. D. & Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 100, 9440–9445 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Barabási A. L. & Albert R. Emergence of scaling in random networks. Science 286, 509–512 (1999). [DOI] [PubMed] [Google Scholar]
[50].Csárdi G. & Nepusz T. The igraph software package for complex network research. InterJournal, Complex Systems 1695, 16–27 (2006). [Google Scholar]
[51].Ma S. et al. A statistical model for describing and simulating microbial community profiles. PLoS Comput Biol. 17, e1008913 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Arzani M. et al. Gut-brain axis and migraine headache: a comprehensive review. J Headache Pain. 21, 1 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Stewart W. F., Linet M. S., Celentano D. D., Van Natta M. & Ziegler D. Age- and sex-specific incidence rates of migraine with and without visual aura. Am J Epidemiol. 134, 1111–1120 (1991). [DOI] [PubMed] [Google Scholar]
[54].Amin F. M. et al. The association between migraine and physical exercise. J Headache Pain. 19, 83 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Mostofsky E. et al. Prospective cohort study of daily alcoholic beverage intake as a potential trigger of headaches among adults with episodic migraine. Ann Med. 52, 386–392 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Leira Y. et al. Periodontal inflammation is related to increased serum calcitonin gene-related peptide levels in patients with chronic migraine. J Periodontol. 90, 1088–1095 (2019). [DOI] [PubMed] [Google Scholar]
[57].Koivusilta L. K. & Ojanlatva A. To have or not to have a pet for better health? PLoS One 1, e109 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
[58].Lahti L. & Shetty S. microbiome R package (2017). URL 10.18129/B9.bioc.microbiome. Bioconductor. [DOI]
[59].Taur Y. & Pamer E. G. Harnessing microbiota to kill a pathogen: Fixing the microbiota to treat clostridium difficile infections. Nat Med. 20, 246–247 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[60].Nolan-Kenney R. et al. The association between smoking and gut microbiome in bangladesh. Nicotine Tob Res. 22, 1339–1346 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Chen J., Wang Q., Wang A. & Lin Z. Structural and functional characterization of the gut microbiota in elderly women with migraine. Front Cell Infect Microbiol. 9, 470 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Nie K. et al. Roseburia intestinalis: A beneficial gut organism from the discoveries in genus and species. Front Cell Infect Microbiol. 11, 757718 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[63].Hajjar J. et al. Associations between the gut microbiome and fatigue in cancer patients. Sci Rep. 11, 5847 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[64].Fang H., Huang C., Zhao H. & Deng M. CCLasso: correlation inference for compositional data through lasso. Bioinformatics 31, 3172–3180 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[65].Kurtz Z. D. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 11, e1004226 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHPP2303.13702V1-supplement-1.pdf^{(67.1KB, pdf)}

Data Availability Statement

[R1] [1].Weinstock G. M. Genomic approaches to studying the human microbiota. Nature 489, 250–256 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Bhatt A. P., Redinbo M. R. & Bultman S. J. The role of the microbiome in cancer development and therapy. CA Cancer J Clin 67, 326–344 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Vujkovic-Cvijin I. et al. HIV-associated gut dysbiosis is independent of sexual practice and correlates with noncommunicable diseases. Nat Commun. 11, 2448 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Glassner K. L., Abraham B. P. & Quigley E. The microbiome and inflammatory bowel disease. J Allergy Clin Immunol. 145, 16–27 (2020). [DOI] [PubMed] [Google Scholar]

[R5] [5].Lee S. H. et al. Emotional well-being and gut microbiome profiles by enterotype. Sci Rep. 10, 20736 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Valles-Colomer M. et al. The neuroactive potential of the human gut microbiota in quality of life and depression. Nat Microbiol. 4, 623–632 (2019). [DOI] [PubMed] [Google Scholar]

[R7] [7].Krajmalnik-Brown R., Lozupone C., Kang D. W. & Adams J. B. Gut bacteria in children with autism spectrum disorders: challenges and promise of studying how a complex community influences a complex disease. Microb Ecol Health Dis. 26, 26914 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Mayer E. A., Knight R., Mazmanian S. K., Cryan J. F. & Tillisch K. Gut microbes and the brain: paradigm shift in neuroscience. J Neurosci. 34 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Durazzi F. et al. Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota. Sci Rep. 11, 3030 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Johnson J. S. et al. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun. 10, 5029 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Reuter J. A., Spacek D. V. & Snyder M. P. High-throughput sequencing technologies. Mol Cell 58, 586–597 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Layeghifard M., Hwang D. M. & Guttman D. S. Disentangling interactions in the microbiome: A network perspective. Trends Microbiol. 25, 217–228 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Matchado M. S. et al. Network analysis methods for studying microbial communities: A mini review. Comput Struct Biotechnol J. 9, 2687–2698 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].McGregor K., Labbe A. & Greenwood C. MDiNE: a model to estimate differential co-occurrence networks in microbiome studies. Bioinformatics 36, 1840–1847 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Peschel S., Muller C., von Mutius E., Boulesteix A. & Depner M. NetCoMi: network construction and comparison for microbiome data in R. Brief Bioinform. 22, bbaa290 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Lee M. & Chang E. B. Inflammatory bowel diseases (IBD) and the microbiome-searching the crime scene for clues. Gastroenterology 160, 524–537 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Efron B. & Tibshirani R. J. An Introduction to the Bootstrap (Chapman & Hall/CRC, Philadelphia, 1993). [Google Scholar]

[R18] [18].Andersen P. K., Klein J. P. & Rosthøj S. Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90, 15–27 (2003). [Google Scholar]

[R19] [19].Andersen P. K. & Klein J. P. Regression analysis for multistate models based on a pseudo-value approach, with applications to bone marrow transplantation studies. Scand J Statist. 34, 3–16 (2007). [Google Scholar]

[R20] [20].Sabathé C. et al. Regression analysis in an illness-death model with interval-censored data: A pseudo-value approach. Stat Methods Med Res. 29, 752–764 (2020). [DOI] [PubMed] [Google Scholar]

[R21] [21].Johansen M. N., Lundbye-Christensen S., Larsen J. M. & Parner E. T. Regression models for interval censored data using parametric pseudo-observations. BMC Med Res Methodol. 21, 36 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Logan B. R., Zhang M. J. & Klein J. P. Marginal models for clustered time-to-event data with competing risks using pseudovalues. Biometrics 67, 1–7 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Ahn K. W. & Logan B. R. Pseudo-value approach for conditional quantile residual lifetime analysis for clustered survival and competing risks data with applications to bone marrow transplant data. Ann Appl Stat. 10, 618–637 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Zhao L. & Feng D. Deep neural networks for survival analysis using pseudo values. IEEE J Biomed Health Inform. 24, 3308–3314 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Ginestet P. G., Gabriel E. E. & Sachs M. C. Survival stacking with multiple data types using pseudo-observation-based-AUC loss. J Biopharm Stat. (2022). https://doi.org/doi: 10.1080/10543406.2022.2041655. [DOI] [PubMed] [Google Scholar]

[R26] [26].Logan B. R., Klein J. P. & Zhang M. J. Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics 64, 733–740 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Graw F., Gerds T. A. & Schumacher M. On pseudo-values for regression analysis in competing risks models. Lifetime Data Anal. 15, 241–255 (2009). [DOI] [PubMed] [Google Scholar]

[R28] [28].Overgaard M., Parner E. T. & Pedersen J. Asymptotic theory of generalized estimating equations based on jack-knife pseudo-observations. Ann Stat. 45, 1988–2015 (2017). [Google Scholar]

[R29] [29].Klein J. P., Gerster M., Andersen P. K., Tarima S. & Perme M. P. SAS and R functions to compute pseudo-values for censored data regression. Comput Methods Programs Biomed. 89, 289–300 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Wang Y. & Logan B. R. Testing for center effects on survival and competing risks outcomes using pseudo-value regression. Lifetime Data Anal. 25, 206–228 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Ahn K. W. & Mendolia F. Pseudo-value approach for comparing survival medians for dependent data. Stat Med. 33, 1531–1538 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].McDonald D. et al. American gut: an open platform for citizen science microbiome research. mSystems 3, e00031–18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].O’Keefe S. J. et al. Fat, fibre and cancer risk in african americans and rural africans. Nat Commun. 6, 6342 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Friedman J. & Alm E. J. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 8, e1002687 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Ahn S., Grimes T. & Datta S. A pseudo-value regression approach for differential network analysis of co-expression data. BMC Bioinformatics 24, 8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Ashtiani M. et al. A systematic survey of centrality measures for proteinprotein interaction networks. BMC Syst Biol. 12, 80 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Ozgür A., Vu T., Erkan G. & Radev D. R. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics 24, i277–i285 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Rousseeuw P. J. Least median of squares regression. J Am Stat Assoc. 79, 871–880 (1984). [Google Scholar]

[R39] [39].Ahdesmäki M., Lähdesmäki H., Gracey A., Shmulevich L. & Yli-Harja O. Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics 8, 233 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Alfons A., Croux C. & Gelper S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat. 7, 226–248 (2013). [Google Scholar]

[R41] [41].Pison G., Van Aelst S. & Willems G. Small sample corrections for LTS and MCD. Metrika 55, 111–123 (2002). [Google Scholar]

[R42] [42].Todorov V. & Filzmoser P. An object-oriented framework for robust multivariate analysis. J Stat Soft. 32, 1–47 (2009). [Google Scholar]

[R43] [43].Maechler M. et al. robustbase: Basic Robust Statistics (2022). URL http://robustbase.r-forge.r-project.org/. R package version 0.95–0.

[R44] [44].Benjamini Y. & Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B. 57, 289–300 (1995). [Google Scholar]

[R45] [45].Benjamini Y. Discovering the false discovery rate. J. R. Statist. Soc. B 72, 405–416 (2010). [Google Scholar]

[R46] [46].Storey J. D. A direct approach to false discovery rates. J. R. Statist. Soc. B 64, 479–498 (2002). [Google Scholar]

[R47] [47].Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics 9, 303 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Storey J. D. & Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 100, 9440–9445 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Barabási A. L. & Albert R. Emergence of scaling in random networks. Science 286, 509–512 (1999). [DOI] [PubMed] [Google Scholar]

[R50] [50].Csárdi G. & Nepusz T. The igraph software package for complex network research. InterJournal, Complex Systems 1695, 16–27 (2006). [Google Scholar]

[R51] [51].Ma S. et al. A statistical model for describing and simulating microbial community profiles. PLoS Comput Biol. 17, e1008913 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Arzani M. et al. Gut-brain axis and migraine headache: a comprehensive review. J Headache Pain. 21, 1 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Stewart W. F., Linet M. S., Celentano D. D., Van Natta M. & Ziegler D. Age- and sex-specific incidence rates of migraine with and without visual aura. Am J Epidemiol. 134, 1111–1120 (1991). [DOI] [PubMed] [Google Scholar]

[R54] [54].Amin F. M. et al. The association between migraine and physical exercise. J Headache Pain. 19, 83 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Mostofsky E. et al. Prospective cohort study of daily alcoholic beverage intake as a potential trigger of headaches among adults with episodic migraine. Ann Med. 52, 386–392 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Leira Y. et al. Periodontal inflammation is related to increased serum calcitonin gene-related peptide levels in patients with chronic migraine. J Periodontol. 90, 1088–1095 (2019). [DOI] [PubMed] [Google Scholar]

[R57] [57].Koivusilta L. K. & Ojanlatva A. To have or not to have a pet for better health? PLoS One 1, e109 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] [58].Lahti L. & Shetty S. microbiome R package (2017). URL 10.18129/B9.bioc.microbiome. Bioconductor. [DOI]

[R59] [59].Taur Y. & Pamer E. G. Harnessing microbiota to kill a pathogen: Fixing the microbiota to treat clostridium difficile infections. Nat Med. 20, 246–247 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] [60].Nolan-Kenney R. et al. The association between smoking and gut microbiome in bangladesh. Nicotine Tob Res. 22, 1339–1346 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Chen J., Wang Q., Wang A. & Lin Z. Structural and functional characterization of the gut microbiota in elderly women with migraine. Front Cell Infect Microbiol. 9, 470 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Nie K. et al. Roseburia intestinalis: A beneficial gut organism from the discoveries in genus and species. Front Cell Infect Microbiol. 11, 757718 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] [63].Hajjar J. et al. Associations between the gut microbiome and fatigue in cancer patients. Sci Rep. 11, 5847 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] [64].Fang H., Huang C., Zhao H. & Deng M. CCLasso: correlation inference for compositional data through lasso. Bioinformatics 31, 3172–3180 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] [65].Kurtz Z. D. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 11, e1004226 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values

Seungjun Ahn

Somnath Datta

Abstract

1. Introduction

2. Methods

2.1. Compositional Correlation-Based Methods for Network Estimation

2.2. Pseudo-value Approach

2.3. Hypothesis Testing

2.4. Algorithm

2.5. Performance Evaluations

2.5.1. Construction of Adjacency Matrices

Fig. 1.

Fig. 2.

2.5.2. Performance Measures

2.6. Materials

2.6.1. Simulated Data

2.6.2. Application Study

The American Gut Project Data

The Diet Exchange Study Data

3. Results

3.1. Simulation Study

Table 1.

Table 2.

Table 3.

3.2. Analysis of the American Gut Project Data

3.3. Analysis of the Diet Exchange Study Data

4. Discussion

5. Conclusion

Supplementary Material

Acknowledgments

Funding

Footnotes

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases