Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Dec 22;40(1):btad766. doi: 10.1093/bioinformatics/btad766

SOHPIE: statistical approach via pseudo-value information and estimation for differential network analysis of microbiome data

Seungjun Ahn 1,, Somnath Datta 2
Editor: Lenore Cowen
PMCID: PMC10807904  PMID: 38134422

Abstract

Summary

The SOHPIE R package implements a novel functionality for “multivariable” differential co-abundance network (DN, hereafter) analyses of microbiome data. It incorporates a regression approach that adjusts for additional covariates for DN analyses. This distinguishes from previous prominent approaches in DN analyses such as MDiNE and NetCoMi which do not feature a covariate adjustment of finding taxa that are differentially connected (DC, hereafter) between individuals with different clinical and phenotypic characteristics.

Availability and implementation

SOHPIE with a vignette is available on CRAN repository https://CRAN.R-project.org/package=SOHPIE and published under General Public License (GPL) version 3 license.

1 Introduction

An emerging body of evidence indicate that the microbiome plays a critical and causal role in human health and disease (Mohajeri et al. 2018, Ogunrinola et al. 2020). Recent advancements in next-generation sequencing platforms have enabled researchers for the comprehensive analysis of microbial diversity (Reuter et al. 2015, Durazzi et al. 2021), most commonly estimated by an operational taxonomic units. For instance, in the last decade, large-scale studies such as the Human Microbiome Project (Turnbaugh et al. 2007) and Metagenomics of the Human Intestinal Tract (Qin et al. 2010) have been conducted to unfold the link between human microbiome and health.

Nonetheless, many statistical challenges remain for the analysis of microbiome data, mainly due to compositional (Aitchison 1982) and sparse data structure (Bharti and Grimm 2021). One of the most recent challenges is to unravel the microbial co-abundances among taxa through differential co-abundance network (DN) analysis (Matchado et al. 2021). The DN analysis is a method derived from the network theory that compares the topological properties (e.g. centrality or connectivity) between two or more networks (or graphs) under different biological conditions of individuals (e.g. high-risk versus low-risk groups).

To date, two statistical methods have been developed for the DN analysis, namely Microbiome Differential Network Estimation (MDiNE; McGregor et al. 2020) and Network Construction and comparison for Microbiome data (NetCoMi; Peschel et al. 2021). However, a covariate adjustment is not available in these methods. This factor hinders the possibilities for researchers to explore any additional clinical or demographic information that may be associated with the microbial composition and human host.

In an effort to fill the gap, we have recently proposed the SOHPIEDNA, a novel pseudo-value regression approach (Ahn and Datta 2023) in analyzing microbiome data, based on a direct marginal modeling with jackknife resampling technique (Efron and Tibshirani 1993, Andersen et al. 2003). SOHPIEDNA allows for a direct inclusion of covariates as independent variables in a regression model and has been shown to improve performance metrics over two existing methods (namely MDiNE and NetCoMi) in simulation studies.

In this article, we feature an R package, named ‘Statistical ApprOacH via Pseudo-value Information and Estimation’ (SOHPIE; pronounced as Sofie) for DN analysis of microbiome data. SOHPIE is a software expansion to the methodological framework (SOHPIEDNA) for wider reproducibility and accessibility. Herewith, we describe an algorithm in Software Implementation section, followed by a brief tutorial with built-in test data from the American Gut Project (McDonald et al. 2018) and the Diet Exchange Study (O’Keefe et al. 2015) in Section 3.

2 Software implementation

The SOHPIE has dependencies to the packages available in the Comprehensive R Archive Network (CRAN) library including robustbase,dplyr,fdrtool, and gtools. In addition, the base R packages such as parallel and stats.

The package provides a complete framework for the DN analysis in analyzing microbiome data. It comprises of five modules: (i) estimation of co-abundance network (or association matrix, which is a symmetric matrix of pairwise measures of association between two sets of taxa) and calculation of network centrality for the whole data; (ii) repeat module (i) but for each leave-one-out samples; (iii) calculation of jackknife pseudo-values; (iv) robust regression with jackknife pseudo-values as response variable and one or more covariates; and (v) extraction of regression analysis results such as q-values (Storey 2002) of each predictor variable. Of note, SparCC (Friedman and Alm 2012) was used for the estimation of co-abundance network. The wrapper function for SparCC was acquired from CCLasso (Fang et al. 2015), provided in GitHub (https://github.com/huayingfang/CCLasso). See Figure 1 for illustrative presentation of our work flow.

Figure 1.

Figure 1.

A flow diagram to describe the algorithmic framework to perform a DN analysis with a pseudo-value regression approach in SOHPIE R package. Blue texts indicate the name of R functions specific to each module in SOHPIE. Although a user can execute each separate functions for a pseudo-value regression analysis, it is highly recommended to use one single function, called SOHPIE_DNA(), for convenience.

3 Application

We demonstrate the functionalities of SOHPIE using two test data included in the package: combinedamgut and combineddietswap. Each of combinedamgut and combineddietswap are subsets of the pre-processed datasets that are available in SpiecEasi (Kurtz et al. 2015) and microbiome (Lahti and Shetty 2017) R packages, respectively.

3.1 Example I: American Gut Project Data

In SOHPIE, there is a vignette to assist users in performing the DN analysis with a pseudo-value regression approach. This details a step-by-step description of the process using combinedamgut data. A vignette, provided as Supplementary Data, can be viewed on CRAN page or from the R console by typing browseVignettes(”SOHPIE”).

After the data is loaded, a user should obtain indices for each category of main binary variable, separately (e.g. living with versus without a dog). Then, the main function, SOHPIE_DNA(), is employed for the DN analysis. It requires the name of previously loaded dataset and also the indices for each category that are obtained earlier. Additionally, a user must specify a value of trimming proportion c{0.5,1} for the least trimmed squares (LTS) estimator of the robust regression. We have used c =0.5 throughout this article, a default value suggested in robustbase package. In Supplementary Data, a small sensitivity analysis was conducted to investigate how sensitive the fit to the choice of the trimming proportion (suggested by a referee). Overall, the performances seem to be robust in this example with respect to the choice of c. Further information about a choice of trimming proportion and LTS estimator can be found elsewhere (Rousseeuw 1984, Pison et al. 2002). SOHPIE_DNA() outputs a list of data.frame objects containing coefficient estimates, P-values, and q-values of each predictor variable from the pseudo-value regression fitted for each taxon.

For user convenience, there are functions to quickly retrieve the names of differentially connected (DC) taxa (DCtaxa_tab()), P-values, q-values, coefficient estimates, and standard error of coefficient estimates of all variables (pval(), qval(), coeff(), and stderrs()) or for a specific variable of interest (pval_specific_var(), qval_specific_var(), coeff_specific_var(), and stderrs_specific_var()) that are considered in the analysis. A detailed information of the usage and demonstration can be found in the vignette.

3.2 Example II: Diet Exchange Study Data

In this example, we take a slight detour to account for temporal changes of connectivity of taxa in the analysis of combineddietswap dataset. More description is provided in the vignette (see Supplementary Data).

In the original study (O’Keefe et al. 2015), the dietary intervention was designed to assess the level of fat and fiber intake among study participants with high versus low colon cancer from two geographically distinct regions: African-Americans from Pittsburgh, Pennsylvania (AAM) versus rural South Africans (AFR), respectively. The study participants had undergone an endoscopy at baseline and at 29 days after the dietary intervention.

As a preparation step, the indices are located for each setting (i.e. AAM and baseline, AFR and baseline, AAM and 29 days, and AFR and 29 days). The analysis begins by estimation and re-estimation of association matrices for each setting using asso_mat(). For each geographic group (AAM and AFR) separately, the difference of estimated (and re-estimated) association matrices between two time points are observed. Then, these matrix differences are used for the calculation of network connectivity (thetahats()) and jackknife pseudo-values (thetatildefun()). The last component of this example is to fit the pseudo-value regression with covariates using pseudoreg(). Further, pseudoreg.summary() is used to produce a list of data.frame objects for coefficient estimates, P-values, and q-values of each predictor.

4 Conclusion

SOHPIE implements a suite of functions facilitating differential network analysis of finding DC taxa between two heterogeneous groups. The key features are the ability to appropriately to test for differential connectivity of a co-abundance network and also to adjust for covariates by introducing a pseudo-value regression framework. The jackknife-generated pseudo response values for regression reflect the influence of the i-th sample on the centrality (i.e. connectivity scores) of each taxon. The regression model describes the “effect” of the main factor (binary group variable) Z and covariates X on the quantified influences. Thus, DC between two groups is described and quantified by the regression coefficient on Z, in terms of how much the grouping affect the influences on the centrality, adjusting for other covariates. SOHPIE is a user-friendly and open-source software tool.

Supplementary Material

btad766_Supplementary_Data

Acknowledgements

The authors are grateful to the investigators involved with the American Gut Project and the Diet Exchange Study for sharing their data publicly. We are thankful for the two anonymous referees for their helpful comments. S.A. dedicates this work to remember all the memories that he had with his furry friend, Sofie.

Contributor Information

Seungjun Ahn, Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States.

Somnath Datta, Department of Biostatistics, University of Florida, Gainesville, FL 32610, United States.

Conflicts of interest

None declared.

Funding

Research reported in this publication was supported in part by the National Cancer Institute Cancer Center Support Grant [NIH P30CA196521-01] awarded to the Tisch Cancer Institute of the Icahn School of Medicine at Mount Sinai and used the Biostatistics Shared Resource Facility. The content is solely the responsibility of S.A. and does not necessarily represent the official views of the National Institutes of Health.

Data availability

The SOHPIE R package is freely available in the CRAN: https://CRAN.R-project.org/package=SOHPIE. The two sample datasets (combinedamgut and combineddietswap) are included in SOHPIE. More information on the original studies and data sources are stated in the main text above.

Author contributions

Seungjun Ahn: (Conceptualization [equal], Formal analysis [lead], Investigation [lead], Methodology [equal], Project administration [lead], Software [lead], Validation [lead], Visualization [lead], Writing – original draft [lead], Writing – review & editing [lead]) and Somnath Datta: (Conceptualization [equal], Investigation [supporting], Methodology [equal], Project administration [supporting], Resources [supporting], Validation [supporting], Writing – review & editing [supporting]).

References

  1. Ahn S, Datta S. Differential co-abundance network analyses for microbiome data adjusted for clinical covariates using jackknife pseudo-values. arXiv, arXiv:2303.13792v1, 2023, preprint: not peer reviewed. [DOI] [PMC free article] [PubMed]
  2. Aitchison J. The statistical analysis of compositional data. J R Statist Soc 1982;44:139–60. [Google Scholar]
  3. Andersen P, Klein J, Rosthøj S.. Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 2003;90:15–27. [Google Scholar]
  4. Bharti R, Grimm D.. Current challenges and best-practice protocols for microbiome analysis. Brief Bioinform 2021;22:178–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Durazzi F, Sala C, Castellani G. et al. Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota. Sci Rep 2021;11:3030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Efron B, Tibshirani RJ.. An Introduction to the Bootstrap. Philadelphia: Chapman & Hall/CRC, 1993. [Google Scholar]
  7. Fang H, Huang C, Zhao H. et al. CCLasso: correlation inference for compositional data through Lasso. Bioinformatics, 2015;31:3172–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Friedman J, Alm E.. Inferring correlation networks from genomic survey data. PLoS Comput Biol 2012;8:e1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kurtz ZD, Muller CL, Miraldi ER. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015;11:e1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lahti L, Shetty S.. microbiome R package Bioconductor. 2017. https://bioconductor.org/packages/release/bioc/html/microbiome.html.
  11. Matchado MS, Lauber M, Reitmeier S. et al. Network analysis methods for studying microbial communities: a mini review. Comput Struct Biotechnol J 2021;19:2687–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. McDonald D, Hyde E, Debelius JW. et al. ; American Gut Consortium. American gut: an open platform for citizen science microbiome research. mSystems 2018;3:e00031-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. McGregor K, Labbe A, Greenwood CMT. et al. MDiNE: a model to estimate differential co-occurrence networks in microbiome studies. Bioinformatics 2020;36:1840–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Mohajeri MH, Brummer RJM, Rastall RA. et al. The role of the microbiome for human health: from basic science to clinical applications. Eur J Nutr 2018;57:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ogunrinola G, Oyewale J, Oshamika O. et al. The human microbiome and its impacts on health. Int J Microbiol 2020;2020:8045646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. O’Keefe SJD, Li JV, Lahti L. et al. Fat, fibre and cancer risk in african americans and rural africans. Nat Commun 2015;6:6342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Peschel S, Müller CL, von Mutius E. et al. NetCoMi: network construction and comparison for microbiome data in R. Brief Bioinform 2021;22:bbaa290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Pison G, Van Aelst S, Willems G.. Small sample corrections for LTS and MCD. Metrika 2002;55:111–23. [Google Scholar]
  19. Qin J, Li R, Raes J. et al. ; MetaHIT Consortium. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 2010;464:59–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Reuter J, Spacek D, Snyder M.. High-throughput sequencing technologies. Mol Cell 2015;58:3587–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Rousseeuw P. Least median of squares regression. J Am Stat Assoc 1984;79:871–80. [Google Scholar]
  22. Storey J. A direct approach to false discovery rates. J. R. Statist. Soc. B 2002;64:479–98. [Google Scholar]
  23. Turnbaugh P, Ley R, Hamady M. et al. The human microbiome project. Nature 2007;449:804–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad766_Supplementary_Data

Data Availability Statement

The SOHPIE R package is freely available in the CRAN: https://CRAN.R-project.org/package=SOHPIE. The two sample datasets (combinedamgut and combineddietswap) are included in SOHPIE. More information on the original studies and data sources are stated in the main text above.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES