Abstract
Summary
Microbiome research is now moving beyond the compositional analysis of microbial taxa in a sample. Increasing evidence from large human microbiome studies suggests that functional consequences of changes in the intestinal microbiome may provide more power for studying their impact on inflammation and immune responses. Although 16S rRNA analysis is one of the most popular and a cost-effective method to profile the microbial compositions, marker-gene sequencing cannot provide direct information about the functional genes that are present in the genomes of community members. Bioinformatic tools have been developed to predict microbiome function with 16S rRNA gene data. Among them, PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) has become one of the most popular functional profile prediction tools, which generates community-wide pathway abundances. However, no state-of-art inference tools are available to test the differences in pathway abundances between comparison groups. We have developed ggpicrust2, an R package, for analyzing functional profiles derived from 16S rRNA sequencing. This powerful tool enables researchers to conduct extensive differential abundance analyses and generate visually appealing visualizations that effectively highlight functional signals. With ggpicrust2, users can obtain publishable results and gain deeper insights into the functional composition of their microbial communities.
Availability and implementation
The package is open-source under the MIT and file license and is available at CRAN and https://github.com/cafferychen777/ggpicrust2. Its shiny web is available at https://a95dps-caffery-chen.shinyapps.io/ggpicrust2_shiny/.
1 Introduction
One limitation of microbial community marker-gene sequencing is that it does not provide information about the functional composition of sample communities (Douglas et al. 2020). In recent years, several methods have been developed to predict functions from 16S rRNA sequence data, including PICRUSt2 (Douglas et al. 2020), Tax4Fun2 (Wemheuer et al. 2020), MicFunPred (Mongad et al. 2021), and PICRUSt (Langille et al. 2013), among others. However, the accuracy and applicability of these methods depend on the specific research question and the characteristics of the microbial community being studied. Overall, these methods have greatly enhanced our ability to understand the functional roles of microbial communities in various environments, from the human gut to soil and water ecosystems. Among the various tools available, PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) has emerged as a highly favored instrument for predicting functional profiles, as it facilitates the generation of comprehensive pathway abundances within microbial communities. By doing so, PICRUSt2 provides researchers with valuable insights into the functional roles of microbial communities.
Nonetheless, a consensus regarding the optimal methodology for inferring and visualizing the functional abundance output generated by PICRUSt2 remains to be established within the academic community. As determining the statistically significant differences in functions and pathways between groups using differential abundance (DA) methods constitutes a critical step in the analysis, selecting an appropriate DA approach is indeed a topic of considerable importance within the scholarly discourse. The official wiki of PICRUSt2 initially recommended STatistical Analysis of taxonoMic and functional Profiles (STAMP) (Parks et al. 2014) as the preferred software for analysis and visualization. However, STAMP has not been updated since 2015, indicating that it is unable to integrate the most recent advances in DA analysis, which are crucial for systematically making statistical inferences from PICRUSt2 output data. Furthermore, STAMP presents installation challenges on macOS platforms, making it less user-friendly and potentially hindering its adoption by researchers in the field. The performance of five DA methods supported by STAMP, including ANOVA (Fisher 1992), Kruskal–Wallis H-test (Kruskal and Wallis 1952), t-test (equal variance) (Student 1908), Welch’s t-test (Welch 1938), and White’s non-parametric test (White 1980), has been shown to be relatively inferior in a recent comparison of 20 DA methods across 38 datasets (Nearing et al. 2022). The comparison concluded that AlDEx2 (Fernandes et al. 2013) and ANCOM-II (Kaul et al. 2017) produce the most consistent results across studies and agree best with the interest of results from different approaches, but still recommend that researchers should use a consensus approach based on multiple DA methods to help ensure robust biological interpretations (Nearing et al. 2022). Despite there are several platforms or packages that support the multiple advanced DA methods such as MicrobiomeAnalyst (Chong et al. 2020), MicrobiomeExplorer (Reeder et al. 2021), microbiomeMarker (Cao et al. 2022), they are not specifically designed for PICRUSt2 functional output data. Due to the discrepancies in format and characteristics between PICRUSt2 output data and 16S rRNA genes data, the above platforms or software intended for the analysis of 16S rRNA genes data often encounter difficulties when importing PICRUSt2 data. Although almost all DA methods can be used in R, each method creates various burdens for data import and parameter configuration which increases both the effort and time cost, and diminishes the efficiency. Additionally, these R packages often lack the ability to visualize DA results and generate publication-quality figures. Thus, developing a user-friendly R package for analyzing PICRUSt2 functional output data using various DA methods and visualizations is urgently needed to fill the gaps.
2 ggpicrust2 R package
The general workflow of the package is shown in Fig. 1. ggpicrust2 not only allows for recently developed advanced DA methods and visualization of results but also can convert PICRUSt2 output Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) abundance tables into KEGG pathway abundance tables, which cannot be performed using PICRUSt2 alone. It also provides annotation of KO, Enzyme Commission (EC), MetaCyc pathway, and KEGG pathway and enables classification of KEGG pathways. In the future, ggpicrust2 plans to incorporate a broader array of functional prediction tools, including but not limited to Tax4Fun2 (Wemheuer et al. 2020), in order to expand its capabilities and utility. Additionally, the package will integrate other methods that have demonstrated strong performance in simulation comparisons, ensuring continuous improvement and alignment with the latest advancements in the field.
Figure 1.
Example of analysis and visualization workflow using the ggpicrust2 R package, divided into two subcomponents. (A) Analysis: This component involves several functions, including compare_metagenome_results(), ko2kegg_abundance(), pathway_daa(), compare_daa_results(), and pathway_annotation(). The input for this component is the PICRUSt2 Output: KO/EC/MetaCyc pathway Abundance Table. (B) Visualization: This component includes functions such as pathway_pca(), pathway_heatmap(), and pathway_errorbar(). The input for this component is the results of the analysis component as well as the original Abundance Table or KEGG Abundance Table. Functions in both components are connected with red lines to illustrate the required steps in the analysis, and with blue lines to depict optional steps in the workflow. The relationship among these functions is flexible, allowing for adaptable use depending on the specific needs of the analysis.
2.1 Data input
ggpicrust2 recommends adopting the data format of PICRUSt2 original output pred_metagenome_unstrat.tsv without reformat. But csv and txt are also acceptable. Furthermore, it is capable of supporting files that have been converted to mimic the PICRUSt2 output format, ensuring compatibility and flexibility for various data sources.
2.2 Conversion to KEGG pathway abundance
KO is a classification system developed by the KEGG database (Kanehisa et al. 2022). It uses a hierarchical structure to classify enzymes based on the reactions they catalyze. To better understand pathways’ role in different groups and classify the pathways, the KO abundance table can be converted to KEGG pathway abundance. But PICRUSt2 removes the function from PICRUSt. ko2kegg_abundance() can help convert the table.
2.3 Advanced DA methods
DA analysis plays a major role in PICRUSt2 downstream analysis. pathway_daa() integrates almost all DA methods applicable to the predicted functional profile which there excludes ANCOM (Mandal et al. 2015) and ANCOMBC (Lin and Peddada 2020). It includes ALDEx2 (Fernandes et al. 2013), DEseq2 (Love et al. 2014), Maaslin2 (Mallick et al. 2021), LinDA (Zhou et al. 2022), edgeR (Robinson et al. 2010), limma voom (Ritchie et al. 2015), metagenomeSeq (Paulson et al. 2013), and lefser (Segata et al. 2011) which have demonstrated varying degrees of success in distinct benchmarking assessments (Calgaro et al. 2020, Nearing et al. 2022, Yang and Chen 2022).pathway_daa() provides a convenient way to run these methods and compare the results. compare_daa_results() can be used to compare the consistency and inconsistency of statistically significant features obtained using different methods. It creates a report showing the number of common and different features identified by each method, and the features themselves. This helps researchers choose the method most suitable for their dataset and research question.
For guidance on selecting the most appropriate DA method for your study, Supplementary Table S1 provides a brief comparison and description of various DA methods, assisting researchers in selecting the best method based on the characteristics of their dataset and research question.
2.4 Annotation of KO, EC, and pathway
pathway_annotation() can annotate the KO, EC, MetaCyc pathways’ description from annotation table. The KO database is a database of molecular functions represented in terms of functional orthologs. The EC number is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. MetaCyc pathways describe the biochemical transformations of small molecules within an organism. And it can pull requests to online KEGG database to annotate KEGG pathways’ pathway_name, pathway_description, pathway_class, and pathway_map. The function can be used to annotate the output file of PICRUSt2 or the output table of pathway_daa().
2.5 Visualization
The mainstream visualization of PICRUSt2 is bar_plot, error_bar_plot, pca_plot, heatmap_plot. pathway_errorbar can show the relative abundance difference between groups and log2 fold change and P-values derived from DA results. pathway_pca() can show the difference after dimensional reduction via principal component analysis. pathway_heatmap() can visualize the patterns in PICRUSt2 output data which can be useful for identifying trends or highlighting areas of interests.
2.6 Integration
ggpicrust() is the integration function of pathway_daa(), pathway_annotation(), pathway_errorbar(), and ko2kegg_abundance(). This tool is designed to facilitate the entire data analysis process for those who are new to the field. However, it is also capable of being used by professional analysts in a modular fashion, allowing for increased customization and control. To further support users and promote the understanding of the package’s capabilities, we have developed a detailed user manual, which is provided as Supplementary Materials. This document includes step-by-step installation instructions, explanations of the main features, and guidance on how to effectively leverage the ggpicrust2 package within academic research. Our aim is to ensure that both novice and experienced researchers can easily access and benefit from the package’s advanced functionalities.
2.7 Application
After employing PICRUSt2 to perform functional profile prediction on microbiome data from C9orf72 loss of function mice, including those that underwent fecal transplantation and those that did not (Burberry et al. 2020), our subsequent data analysis using ggpicrust2 entailed the implementation of LinDA. This approach led to the identification of KEGG pathways that demonstrated statistically significant differences between the prosurvival and proinflammatory environments across both groups of mice. Of particular interest were the pathways ko05016, which is primarily involved in the pathogenesis of Huntington’s disease, and ko05012, known for its association with Parkinson’s disease. Both pathways are linked to human diseases and neurodegenerative disorders. The DA results were meticulously annotated and the output was visualized for subsequent analysis. The visual representation of the results, which provides insights into the involvement of these pathways in the studied conditions, is depicted in Fig. 1.
2.8 Comparison of metagenome results
compare_metagenome_results() analyzes and compares the functional predictions from different methods and sequencing metagenomes. It accepts a list of metagenome count matrices, their corresponding sample names, a method for DA, a method for P-value adjustment, and a reference group level for DA. The function concatenates all metagenome count matrices, creates new sample metadata, performs DA, and calculates Spearman correlation coefficients and corresponding P-values between each pair of metagenomes.
3 Conclusion
ggpicrust2, available at CRAN and https://github.com/cafferychen777/ggpicrust2, is an R package developed explicitly for PICRUSt2 predicted functional profile to do advanced DA analysis and visualization of the DA results. This package effectively addresses the limitations of existing tools in terms of methods and visualization, and its integrated and distributed design caters to both professionals and beginners by meeting the needs of both groups. By providing a seamless experience for analyzing and visualizing DA results, ggpicrust2 has the potential to significantly enhance the quality and efficiency of research involving functional profile predictions. ggpicrust2 has already been incorporated into the PICRUSt2 wiki documentation, reflecting its growing recognition and adoption within the research community.
Supplementary Material
Acknowledgements
We want to acknowledge Sonja Schaufelberger at the University of Gothenburg for the feedback and suggestions regarding the ggpicrust2 package. Her insights have significantly contributed to the improvement and development of our tool, ensuring that it remains both versatile and useful for researchers in the scientific community.
Contributor Information
Chen Yang, Department of Biostatistics, Southern Medical University, Guangzhou 510515, China.
Jiahao Mai, Department of Biostatistics, Southern Medical University, Guangzhou 510515, China.
Xuan Cao, Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH 45221, United States.
Aaron Burberry, Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
Fabio Cominelli, Department of Pathology, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States; Case Digestive Health Research Institute, Case Western Reserve University, Cleveland, OH 44016, United States.
Liangliang Zhang, Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH 44106, United States; Case Comprehensive Cancer Center, Cleveland, OH 44106, United States.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This work has been supported by the National Institutes of Health grants 5P30AG072959-02 and 3R01DK042191-30S1. Liangliang Zhang's startup funding (BGT630267) covered the publication fee.
Data availability
The example dataset can be accessed via the following link: https://github.com/cafferychen777/ggpicrust2_paper/tree/main/Dataset. The example dataset provided here is derived from a published study in Nature by Burberry A. et al. (2020). The study focuses on the identification of bacterial communities associated with pro-inflammatory or pro-survival outcomes in a model of Amyotrophic lateral sclerosis (ALS) and Frontotemporal dementia (FTD) with autoimmunity and systemic and neural inflammation features. The dataset comprises 16S rRNA sequencing profiles of bacterial compositions obtained from fecal samples of C9orf72 loss of function mice. As described in the Nature paper, the raw sequences can be accessed through the GEO repository via the following link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE147325.
References
- Burberry A, Wells MF, Limone F. et al. C9orf72 suppresses systemic and neural inflammation induced by gut bacteria. Nature 2020;582:89–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calgaro M, Romualdi C, Waldron L. et al. Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data. Genome Biol 2020;21:191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao Y, Dong Q, Wang D. et al. microbiomeMarker: an R/bioconductor package for microbiome marker identification and visualization. Bioinformatics 2022;38:4027–9. [DOI] [PubMed] [Google Scholar]
- Chong J, Liu P, Zhou G. et al. Using MicrobiomeAnalyst for comprehensive statistical, functional, and meta-analysis of microbiome data. Nat Protoc 2020;15:799–821. [DOI] [PubMed] [Google Scholar]
- Douglas GM, Maffei VJ, Zaneveld JR. et al. PICRUSt2 for prediction of metagenome functions. Nat Biotechnol 2020;38:685–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernandes AD, Macklaim JM, Linn TG. et al. ANOVA-Like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLoS One 2013;8:e67019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fisher RA. Statistical methods for research workers. In: Kotz S, Johnson NL (eds), Breakthroughs in Statistics. Springer Series in Statistics. New York: Springer, 1992, 66–70. [Google Scholar]
- Kanehisa M, Furumichi M, Sato Y. et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2023;51(D1):D587–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaul A, Mandal S, Davidov O. et al. Analysis of microbiome data in the presence of excess zeros. Front Microbiol 2017;8:2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kruskal WH, Wallis WA.. Use of ranks in one-criterion variance analysis. J Amer Stat Assoc 1952;47:583–621. [Google Scholar]
- Langille MGI, Zaneveld J, Caporaso JG. et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol 2013;31:814–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin H, Peddada SD.. Analysis of compositions of microbiomes with bias correction. Nat Commun 2020;11:3514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love MI, Huber W, Anders S. et al. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mallick H, Rahnavard A, McIver LJ. et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol 2021;17:e1009442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mandal S, Van Treuren W, White RA. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb Ecol Health Dis 2015;26(1):27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mongad DS, Chavan NS, Narwade NP. et al. MicFunPred: a conserved approach to predict functional profiles from 16S rRNA gene sequence data. Genomics 2021;113:3635–43. [DOI] [PubMed] [Google Scholar]
- Nearing JT, Douglas GM, Hayes MG. et al. Microbiome differential abundance methods produce different results across 38 datasets. Nat Commun 2022;13:342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parks DH, Tyson GW, Hugenholtz P. et al. STAMP: statistical analysis of taxonomic and functional profiles. Bioinformatics 2014;30:3123–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paulson JN, Stine OC, Bravo HC. et al. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 2013;10:1200–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reeder J, Huang M, Kaminker JS. et al. MicrobiomeExplorer : an R package for the analysis and visualization of microbial communities. Bioinformatics 2021;37:1317–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ritchie ME, Phipson B, Wu D. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson MD, McCarthy DJ, Smyth GK. et al. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segata N, Izard J, Waldron L. et al. Metagenomic biomarker discovery and explanation. Genome Biol 2011;12:R60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Student. The probable error of a mean. Biometrika 1908;6:1. [Google Scholar]
- Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika 1938;29:350–62. [Google Scholar]
- Wemheuer F, Taylor JA, Daniel R. et al. Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences. Environ Microbiome 2020;15:11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 1980;48:817. [Google Scholar]
- Yang L, Chen J.. A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions. Microbiome 2022;10:130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H, He K, Chen J. et al. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol 2022;23:95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The example dataset can be accessed via the following link: https://github.com/cafferychen777/ggpicrust2_paper/tree/main/Dataset. The example dataset provided here is derived from a published study in Nature by Burberry A. et al. (2020). The study focuses on the identification of bacterial communities associated with pro-inflammatory or pro-survival outcomes in a model of Amyotrophic lateral sclerosis (ALS) and Frontotemporal dementia (FTD) with autoimmunity and systemic and neural inflammation features. The dataset comprises 16S rRNA sequencing profiles of bacterial compositions obtained from fecal samples of C9orf72 loss of function mice. As described in the Nature paper, the raw sequences can be accessed through the GEO repository via the following link: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE147325.

