Abstract
The rise in the availability of bacterial genomes defines a need for synthesis: abstracting from individual taxa, to see larger patterns of bacterial lifestyles across systems. A key concept for such synthesis in ecology is the niche, the set of capabilities that enables a population’s persistence and defines its impact on the environment. The set of possible niches forms the niche space, a conceptual space delineating ways in which persistence in a system is possible. Here we use manifold learning to map the space of metabolic networks representing thousands of bacterial genera. The results suggest a metabolic niche space comprising a collection of discrete clusters and branching manifolds, which constitute strategies spanning life in different habitats and hosts. We further demonstrate that communities from similar ecosystem types map to characteristic regions of this functional coordinate system, permitting coarse-graining of microbiomes in terms of ecological niches that may be filled.
Subject terms: Microbial ecology, Microbial communities, Community ecology, Ecological modelling
The ecological niche of a given microbe is difficult to define, but can be approximated from the range of biochemical reactions encoded by its genome. Here the authors use these genomic data and analyze them using manifold learning, which generates a diffusion map of the metabolic niche space of over 2500 bacteria.
Introduction
It has been pointed out that a key to understanding the rules of life in ecological communities is to understand the structure of the niche space, the sets of ecological strategies that enable populations to grow and reproduce in an ecosystem1–6. Conceptual theories envision the niche space as an n-dimensional geometrical shape1,7 where each dimension is spanned by variables representing, often nonlinear combinations of salient traits or environmental features8–11. Empirical characterizations of the niche space have so far been conducted with a focus on individual groups of macrobiotic species, where different data analysis methods have been used to organize sets of functional traits that associate with major ecological roles in a system11,12; included are lizards5, beetles13,14, neotropical fish6, and terrestrial vascular plants10,15.
Bacteria are an attractive target for examining niche-based theories in ecology16–20 as many of the relevant traits, such as the ability to metabolize certain substrates or synthesize molecules that mediate ecological interactions, are biochemical in nature21,22. Hence they can be inferred from genomes, providing plentiful data to map the niche space on a grander scale. To operationalize the bacterial niche space we say that the sets of biochemical reactions encoded by genomes represent feasible metabolic strategies of extant microorganisms5,23,24. Together the strategies span a metabolic niche space1: the space of metabolic capabilities that populations may deploy to survive.
Ecological niches are thought to comprise complex nonlinear functions of multiple traits5,10,11,25. A central challenge in modeling the niche is thus to identify composite traits that map to interpretable ecological roles, or the ‘soft properties’26 that summarize organisms’ functional capabilities. A powerful analysis method for meeting this challenge is offered by diffusion maps27,28. This mathematically simple manifold learning method exploits the relationship between diffusion processes and geometric structures29–31 to define a new coordinate system for a dataset, where the axes, or variables, are nonlinear composites of its major features. The mathematical procedure does not provide an interpretation of these variables; however, our analyses show that they correspond to meaningful metabolic strategies. This offers a potential bridge between ecological niche theories and data that are readily accessible from bacterial genomes.
Here we use the diffusion map to construct and analyze a functional coordinate system that spans the bacterial metabolic niche space. As a compact prediction of metabolism, we generate genome-scale metabolic networks22,32 for representative species from all unique bacterial genera in the NCBI RefSeq33 release 92 database (N = 2621 genera). We map each representative network to a point in a 7769-dimensional discrete space, where axes indicate the presence or absence of predicted metabolic ‘traits’ given by unique chemical substrate–product pairs (i.e., directed edges) in the collection of networks. Although a complete picture of bacterial metabolism from genomic data is not yet possible, this array captures the major biochemical capabilities34 for a large fraction of known bacterial genera, and serves as input to the diffusion map algorithm. Our results indicate that manifold learning methods can delineate the salient geometric features27,28,35 of an ecological niche space, and that these structures mark potential strategies for survival under particular abiotic or biotic conditions. Subsequently, we demonstrate that bacterial communities from similar ecosystems occupy characteristic regions of the diffusion map, and that this provides a quantitative framework for defining potentially occupied ecological niches across complex microbial systems.
Results
The diffusion map finds new variables that reflect nonlinear combinations of metabolic capabilities and returns them in the order of their importance (see Methods)27,28,35. Each variable assigns coordinate entries to the genomes that can then be used to order genera, from the most negative to the most positive entries, along curves that span the niche space. Dimensions in diffusion space can then be interpreted by analyzing the strategies of taxa near the extrema of the orderings26,36, corresponding to large positive or negative (i.e., far from zero) variable entries.
Sharp differences delineate some metabolic strategies
The most important variable identified by the diffusion map, variable 1, separates the metabolic strategies of photosynthetic Cyanobacteria from those of all other taxa: the 108 cyanobacterial genomes in the dataset are assigned low values (i.e., negative numbers with large magnitudes) in variable 1, while all others have values that are close to zero (Fig. 1; Supplementary Fig. 1A). To confirm that this variable detects cyanobacterial photosynthetic activity, we identified metabolites that were over-represented in the metabolic networks of genera receiving far-from-zero entries in variable 1 (see Methods). This revealed an enrichment of 2-Phosphoglycolate, which is involved in essential photorespiratory pathways in photosynthetic organisms37; ribulose-1,5-bisphosphate (RuBP), used for carbon fixation from RuBisCO during photosynthesis; cyanophycin, a unique nitrogen reserve polymer38; and sucrose 6-phosphate, which catalyzes the final step in sucrose biosynthesis in Cyanobacteria39, confirming that the variable indicates the extent to which Cyanobacteria fix carbon through photosynthesis (Fig. 1; Supplementary Table 1).
The sharp differences in variable 1 show that this photosynthetic lifestyle is a discrete yes-or-no metabolic strategy where little middle ground exists. The diffusion map defines further variables that indicate such discrete clusters of unique capabilities (Fig. 1)—so-called ‘localized’ variables40—including capabilities associated with acetic acid production41 (variable 18), carnitine use for stress tolerance among anaerobic animal associates42 (variable 21), and chemolithoautotrophic or sulfur-oxidization strategies deployed by Epsilonproteobacteria near marine sediments and sea vents (variable 22).
Contrasting the major strategies of host associates to life in soils and oceans
Some variables identified by the diffusion map analysis span a continuous spectrum of strategies, which align with major taxonomic classes. The most important of these are variables 2, 3, and 4, which contrast different putative metabolic strategies encoded by relatively large proportions of the analyzed genomes (Fig. 2; Supplementary Fig. S1B). For instance, variable 2 identifies major differences in predicted strategies among host-associated Gammaproteobacteria and soilborne Actinobacteria. Close relatives of pathogenic Enterobacter, Franconibacter, and Buttiauxella species43 score the lowest (i.e., most negative) values (Fig. 2a, b). Metabolic capabilities associated with these taxa include the synthesis of membrane phospholipid precursors common in Gammaproteobacteria like CDP-diacylglycerol44 and phosphatidylethanolamine, which may be involved in bacterial adhesion to host cells45; and the ability to metabolize uncommon sugars like L-lyxose46 (Supplementary Table 2). At the opposite end, we find primarily Gram-positive soil organisms from the Microbacteriaceae, Beutenbergiaceae, and Micrococcaceae47 (Fig. 2a, b). Among the most correlated capabilities for species near this extremum are the synthesis of decaprenyl diphosphate, a key component of cell wall biosynthesis in some taxa48; and compounds related to the synthesis of thiol and bimane derivatives, which can function as defenses against alkylating agents, oxygen stress, and antibiotics49 (Supplementary Table 3).
The Gammaproteobacteria genera that received the lowest entries in variable 2 also constituted the negative extremum of variable 3, and the positive extremum of variable 4 (compare Fig. 2a–d), suggesting that the bacterial metabolic niche space features a collection of low-dimensional manifolds that cross each other at branching points36. This branching point in particular illustrates a multiway contrast between a subset of the Gammaproteobacteria and at least 3 other taxonomic classes. At the positive end of variable 3, we find taxa representing mammal- and bird-associated Clostridia, Tissierellia, Erysipelotrichia, and Bacilli47. Characteristic metabolites of these genera include components of the Wood–Ljungdahl pathway50, enabling the use of hydrogen as an electron donor; and indole, a signaling molecule that has been shown to modulate host inflammation and interspecific competition in human gastrointestinal tracts51 (Supplementary Table 4). Our interpretation is that variable 3 identifies different potential strategies for colonizing and weathering stress or interspecific competition in animal hosts.
The species that score the lowest (i.e., most negative) values in variable 4 are epipelagic and marine Rhodobacterales, Rhizobiales, and Rhodospirillales that are capable of utilizing a broad spectrum of carbon sources52. Here the most significant metabolic reactions are all involved in the L-2-aminoadipate pathway of lysine synthesis53 and the production of L-pipecolic acid (Supplementary Table 5), pointing to a strategy for growth under high-salt conditions54. Our interpretation is that this variable traces a range of strategies spanning a generalist lifestyle in oceans to associations with terrestrial hosts.
Host-microbe interactions also feature in variables 8 and 10, which highlight endosymbionts and endoparasites with the smallest genomes in the dataset. The lowest values of variable 8 coincided with animal- and plant-associated Tenericutes47, as well as candidate genera like Tremblaya and Sulcia, that associate with insect bacteriocytes55,56. Among the top 10 markers of taxa scoring low values in variable 8 include the predicted uptake22 of key amino acids such as L-histidine, L-arginine, L-isoleucine, L-valine, L-lysine, and L-leucine (Supplementary Table 6). The negative extremum of variable 10 features obligate endoparastites and close relatives of opportunistic pathogens, including putative animal- and arthropod-associates of the Pasteurellaceae, Erwiniaceae, Morganellaceae, and Rickettsiaceae47. Similarly to variable 8, metabolic network features that distinguished this group include the predicted uptake of L-histidine, L-arginine, L-threonine, L-isoleucine, L-glutamine, and L-lysine (Supplementary Table 7). Together, these variables indicate that one widespread strategy for life in close association with animal or plant cells is the use of essential and non-essential host-derived amino acids57.
Phylogenetic relatedness is a rough indicator of ecological similarity
The first several diffusion variables identify characteristic capabilities that discriminate between major taxonomic classes with many representative genera. To assess the overall relationship between metabolic similarity and phylogenetic relatedness we computed the correlation between pairwise inter-genome metabolic distances in diffusion space, and cophenetic distances on the phylogenetic tree (see ref. 30 for a detailed discussion of diffusion distances). Here a positive correlation suggests that closely related taxa deploy similar metabolic strategies on average.
The Pearson correlation between distance matrices was positive but exhibited a small coefficient (Fig. 3a; Mantel test, r = 0.273, P < 0.001), marking a weak association between predicted metabolic capabilities and phylogenetic relatedness. While it is not surprising that phylogenies contain information on the ecological roles of microorganisms58–60, a visualization of this relationship highlights a caveat: a large range of diffusion distances are observed for most given cophenetic distances between genome pairs (Fig. 3a). This high degree of variance can be explained by the presence of diffusion variables that deviate from basic contrasts among major taxonomic groups (e.g., Fig. 2), including some that differentiate closely related taxa (Fig. 3b, Supplementary Fig. 1C), and those that show similar strategies among distantly-related taxa (Fig. 3b), potentially reflecting metabolic niche convergence6 or horizontal gene transfer.
These examples demonstrate that diffusion variables provide dozens or possibly hundreds of meaningful coordinates that trace the space of bacterial metabolic strategies. Using a procedure proposed by Moon et al.36 we combined diffusion variables in a low-dimensional visualization of the strategy space (Fig. 3c; Supplementary Fig. 1). This embedding recapitulates the result that phylogenetic relatedness offers only a coarse marker of predicted functional similarity, corresponding to the appearance of representatives from multiple classes in close proximity to one another in the niche space.
It is important to interpret lower-dimensional embeddings of high-dimensional data with caution61. However, multiple observations point to some consistent geometric structures in the bacterial metabolic niche space. Included are the results of a 2-dimensional embedding of diffusion variables (Fig. 3c; Supplementary Fig. 1), the presence of localized variables (e.g., Fig. 1) and crossing points (e.g., Fig. 2) in the diffusion map, and the results of enrichment analyses (Supplementary Tables 1–7). Namely, they point to a metabolic niche space consisting of multiple quasi one-dimensional branches rising from a common core, punctuated by discrete clusters of taxa with unique capabilities. This geometry may represent a conceptual hybrid between Hutchinson’s original idea of the niche space as a continuous hypervolume1, and modern ideas which postulate that sets of functional traits separate into discrete ecological clusters5,6,12. We conjecture that the putative filamentous structure has implications for our understanding of bacterial evolution and ecological functioning. For instance, the underlying branching geometry naturally leads to a large amount of unoccupied metabolic niche space (Fig. 3c). Similar gaps in niche space have been observed in macrobiotic communities12, and could correspond to bacteria that have yet to be sampled, isolated, or sequenced. Alternatively, they could be a result of ‘forbidden’ metabolisms, i.e., combinations of capabilities that may be suboptimal or even pointless for life in Earth’s ecosystems.
Microbiomes map to characteristic regions of the metabolic niche space
Understanding the mapping from genomes to larger scale ecological strategies may prove useful for a variety of analyses16–18, such as quantifying the roles of organisms or designing substrates for culturing. Perhaps more importantly it provides an ecological frame of reference for coarse-graining complex bacterial communities. For a small scale demonstration of this point we created a simple mapping between a subset of community censuses from the Earth Microbiome Project (EMP)62 and our diffusion space.
First, for each selected bacterial community census in the EMP we matched all taxa (16S rRNA gene amplicon sequence variants) to the most closely related genome considered by our diffusion map analysis, and retained matches that exhibited at least a 97% sequence similarity (see Methods). We then determined whether EMP communities contained at least one taxon that mapped to any of the 10 extremal genomes along any of the first 50 diffusion variables. As a result, each microbiome sample was characterized by the presence or absence of each of the first 100 extremal metabolic strategies. These presence-absence data represent ecological characterizations for individual EMP communities. To summarize further we computed the proportions of communities from different ecosystem types that displayed the different extremal strategies, resulting in a bacterial metabolic fingerprint for each ecosystem type (Fig. 4). These fingerprints can be used to study systematic differences in the functional capabilities of typical community members across habitats. For instance, a simple hierarchical clustering analysis of metabolic fingerprints groups different ecosystem types meaningfully together based on the metabolic strategies of their constituents (Fig. 4). Visible are clear strategy sets that differentiate functional diversity in freshwater, soil, marine, and host-associated systems.
Discussion
Here we showed that the shape of a trait space can be systematized through manifold learning27. The diffusion map of bacterial capabilities reveals a wealth of ecologically salient variables that span a functional coordinate system. Some show evidence of discrete capabilities such as photosynthesis (Fig. 1). Other strategies span a continuous space representing degrees of specialization or reliance on hosts (Fig. 2). Yet others highlight strategies for energy production or stress response, some of which differentiate closely related species (Fig. 3b, Supplementary Fig. 1C) or emerged, potentially through convergent evolution or gene transfer, in different branches of the tree of life (Fig. 3b).
The diffusion variables provide a physical method for organizing the genomic information that continues to emerge, in a way that reveals both larger scale geometries and finer details compared to alternative embedding methods27,36 (Supplementary Discussion; Supplementary Figs. 1–6). From the perspective of microbial systems, diffusion distances in trait space (e.g., Fig. 3a) provide a powerful proxy for ecological similarities that can complement insights from current phylogenetic methods60,63. Traits used to calculate diffusion distances need not be derived from metabolic reconstructions of whole genomes as in the present analysis, but could comprise functional information identified, for instance, through species-level profiling64 of metagenomic or metatranscriptomic shotgun sequencing data. From an ecological point of view the present analysis constitutes the most extensive mapping of a niche space geometry so far, and facilitates the application of quantitative ecological theories to data describing bacterial communities.
Our analysis focused largely on the bacteria’s capabilities to catalyze steps in primary metabolism. Even within the realm of primary metabolism the genes reveal only the set of theoretical capabilities encoded by genomes, conceptually analogous to the fundamental niche concept1 in ecology. Hence our analysis ignores uncharacterized parts of secondary metabolism, behavior, regulation, and trophic interactions. For any other group of organisms such a limited analysis would be mostly meaningless; however, due to the diversity of metabolic capabilities in bacteria it reveals a rich and complex functional coordinate system (Fig. 3c). As our understanding of genomic data advances, deeper insights into secondary metabolism are bound to become available, providing an even more detailed picture of the metabolic niche space. Moreover, we envision that with future transcriptomic data, manifold learning methods could also map the realized niche1 (the metabolic strategies that are deployed under a given set of conditions) bringing our understanding of ecology in complex microbial communities closer to the biochemical level.
Methods
Metabolic networks
Genomes were obtained from the National Center for Biotechnology Information (NCBI) RefSeq33 release 92 database (accessed on 2019 March 20). We first obtained the ‘representative’, ‘reference’, ‘complete’, ‘contig’, and ‘scaffold’ sets and reduced these to a set of genus-level representatives using the following sampling procedure. We first selected a random representative genome for all unique genera in the combined ‘representative’ and ‘reference’ sets. Novel genera in the remaining RefSeq categories, that were not already represented in the ‘reference’ and ‘representative’ sets, were then appended to the set in the same way, for a total N = 2621 genomes. Metabolic models were constructed for the selected genome assemblies using the CarveMe reconstruction algorithm32, that starts with a universal bacterial metabolic model comprising known biochemical reactions in the BiGG Models65 database and generates genome-specific reaction sets by paring those without genomic support. Finally, metabolic models’ cytoplasmic compartments were retained and summarized as metabolic networks—directed graphs in which nodes are chemical metabolite compounds and directed edges link substrates to products22.
Phylogenetic tree generation
Phylogenetic trees were used to visualize metabolic differences between taxa, and were constructed using the GToTree pipeline66 with the “universal” protein set defined by Hug et al.67. GToTree identifies target genes with HMMER368, aligns them with MUSCLE69, and trims alignments with trimAl70. Trees were generated from the aligned and concatenated gene sets using FastTree71, and visualized using iToL72.
Diffusion map procedure
Diffusion mapping27,28 was performed using the algorithm described by Barter & Gross26. Briefly, the method involves (i) calculating a matrix describing euclidean similarities among the k-nearest neighbors for samples in a dataset, (ii) interpreting this as a weighted adjacency matrix, and (iii) computing the corresponding row-normalized Laplacian matrix. The eigenvectors of the Laplacian represent new diffusion variables describing important variation in the dataset26. The importance of each eigenvector is indicated by the corresponding Laplacian eigenvalue27,30, which captures the characteristic time scale of diffusive modes over the data in that dimension35. The first (i.e., most important) variable is given by the eigenvector corresponding to the smallest non-zero eigenvalue, then the second smallest eigenvalue, and so on. This variant is nearly parameter-free, with only a single choice for the value of k. Here, we consider k = 10, although the results presented above were insensitive to the choice of k.
Identifying associated metabolites
We sought to identify metabolites that were over-represented in the metabolic networks of taxa, that were themselves assigned extreme entries along diffusion map variables. This was accomplished using a permutational variant of the gene set enrichment analysis, GSEA73. Genome rankings were defined by the orderings specified by each diffusion variable. Enrichment analyses were accomplished for the ranked sets using the fgsea library in R74, with a Benjamini–Hochberg–adjusted75 P value < 0.05 used as the threshold for retaining metabolites associated with taxa that map to variable extrema.
Mapping environmental samples to diffusion space
We obtained the ‘emp_deblur_150bp.subset_2k.rare_5000’ dataset, describing a subset of the environmental 16S rRNA gene sequences from the Earth Microbiome Project62, EMP, accessed via ftp://ftp.microbio.me/emp/. Communities from the EMP were mapped to diffusion space using the following procedure: First, we generated a BLAST76 reference database of predicted 16S rRNA gene sequences for our set of RefSeq genomes using barrnap (https://github.com/tseemann/barrnap) to identify and retain the first instance of this ribosomal gene. The DECIPHER library77 in R was used to align sequences. We then conducted a BLAST sequence similarity search to match denoised sequence variants present in each EMP sample to the custom BLAST database and retained the top hits. Niches—operationally defined as the strategies describing the 10 taxa with the highest (positive) and lowest (negative) entries along each diffusion variable—were said to be occupied by taxa in an EMP community census if at least one detected sequence variant exhibited a 97% or greater rRNA gene sequence similarity to any of the extremal genomes. The results of this procedure were summarized as plots of the proportion of samples within each EMP ‘env_feature’ category satisfying this criterion. Hierarchical clustering of similar ecosystem types was accomplished using the Ward78 linkage method.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank Jonathan A. Eisen and James P. O’Dwyer for comments and discussions. A.K.F. was supported by a Research Associateship Program fellowship from the National Research Council.
Author contributions
A.K.F. and T.G. conceptualized the study, wrote the manuscript, and contributed analyses. A.K.F. contributed computer code.
Data availability
Genome accession numbers are available at 10.6084/m9.figshare.12864011.v4.
Code availability
R scripts and sample data are available at 10.6084/m9.figshare.12864011.v4.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information is available for this paper at 10.1038/s41467-020-18695-z.
References
- 1.Hutchinson GE. Cold Spring Harbor symposium on quantitative biology. Concluding Remarks. 1957;22:415–427. [Google Scholar]
- 2.MacArthur, R. H. In Challenging Biological Problems: Directions Toward Their Solution (ed. Behnke, J. A.) pp. 253–259 (Oxford University Press, 1972).
- 3.Chase, J. M. & Leibold, M. A. Ecological Niches: Linking Classical and Contemporary Approaches (University of Chicago Press, 2003).
- 4.Holt RD. Bringing the Hutchinsonian niche into the 21st century: ecological and evolutionary perspectives. Proc. Natl Acad. Sci. USA. 2009;106:19659–19665. doi: 10.1073/pnas.0905137106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Winemiller KO, Fitzgerald DB, Bower LM, Pianka ER. Functional traits, convergent evolution, and periodic tables of niches. Ecol. Lett. 2015;18:737–751. doi: 10.1111/ele.12462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Pianka ER, Vitt LJ, Pelegrin N, Fitzgerald DB, Winemiller KO. Toward a periodic table of niches, or exploring the lizard niche hypervolume. Am. Naturalist. 2017;190:601–616. doi: 10.1086/693781. [DOI] [PubMed] [Google Scholar]
- 7.Blonder B, Lamanna C, Violle C, Enquist BJ. The n-dimensional hypervolume. Glob. Ecol. Biogeogr. 2014;23:595–609. doi: 10.1111/geb.12146. [DOI] [Google Scholar]
- 8.Hoogenboom MO, Connolly SR. Defining fundamental niche dimensions of corals: synergistic effects of colony size, light, and flow. Ecology. 2009;90:767–780. doi: 10.1890/07-2010.1. [DOI] [PubMed] [Google Scholar]
- 9.Porter WP, Kearney M. Size, shape, and the thermal niche of endotherms. Proc. Natl Acad. Sci. USA. 2009;106:19666–19672. doi: 10.1073/pnas.0907321106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kraft NJB, Godoy O, Levine JM. Plant functional traits and the multidimensional nature of species coexistence. Proc. Natl Acad. Sci. USA. 2015;112:797–802. doi: 10.1073/pnas.1413650112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Benjamin B. Hypervolume concepts in niche-and trait-based ecology. Ecography. 2018;41:1441–1455. doi: 10.1111/ecog.03187. [DOI] [Google Scholar]
- 12.González AL, Dézerald O, Marquet PA, Romero GQ, Srivastava DS. The multidimensional stoichiometric niche. Front. Ecol. Evol. 2017;5:110. doi: 10.3389/fevo.2017.00110. [DOI] [Google Scholar]
- 13.Stevenson, B. G. The Hutchinsonian niche: multivariate statistical analysis of dung beetle niches. Coleopter. Bull. 36, 246–249 (1982).
- 14.Inward DJG, Davies RG, Pergande C, Denham AJ, Vogler AP. Local and regional ecological morphology of dung beetle assemblages across four biogeographic regions. J. Biogeogr. 2011;38:1668–1682. doi: 10.1111/j.1365-2699.2011.02509.x. [DOI] [Google Scholar]
- 15.Díaz S, et al. The global spectrum of plant form and function. Nature. 2016;529:167–171. doi: 10.1038/nature16489. [DOI] [PubMed] [Google Scholar]
- 16.Green JL, Bohannan BJM, Whitaker RJ. Microbial biogeography: from taxonomy to traits. science. 2008;320:1039–1043. doi: 10.1126/science.1153475. [DOI] [PubMed] [Google Scholar]
- 17.Noah F, Bradford MA, Jackson RB. Toward an ecological classification of soil bacteria. Ecology. 2007;88:1354–1364. doi: 10.1890/05-1839. [DOI] [PubMed] [Google Scholar]
- 18.Claire Horner-Devine M, Bohannan BJM. Phylogenetic clustering and overdispersion in bacterial communities. Ecology. 2006;87:S100–S108. doi: 10.1890/0012-9658(2006)87[100:PCAOIB]2.0.CO;2. [DOI] [PubMed] [Google Scholar]
- 19.Lennon JT, Aanderud ZT, Lehmkuhl BK, Schoolmaster Jr DR. Mapping the niche space of soil microorganisms using taxonomy and traits. Ecology. 2012;93:1867–1879. doi: 10.1890/11-1745.1. [DOI] [PubMed] [Google Scholar]
- 20.Fisher, C. K., Thierry, M. & Walczak, A. M. Variable habitat conditions drive species covariation in the human microbiota. PLoS Comput. Biol. 13, e1005435 (2017). [DOI] [PMC free article] [PubMed]
- 21.Prosser JI, et al. The role of ecological theory in microbial ecology. Nat. Rev. Microbiol. 2007;5:384–392. doi: 10.1038/nrmicro1643. [DOI] [PubMed] [Google Scholar]
- 22.Elhanan B, Martin K, Feldman MW, Ruppin E. Large-scale reconstruction and phylogenetic analysis of metabolic environments. Proc. Natl Acad. Sci. USA. 2008;105:14482–14487. doi: 10.1073/pnas.0806162105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Humphries MM, McCann KS. Metabolic ecology. J. Anim. Ecol. 2014;83:7–19. doi: 10.1111/1365-2656.12124. [DOI] [PubMed] [Google Scholar]
- 24.Chase, J. M. In The theory of ecology (eds Scheiner, S. M. and Willig, M. R.) pp. 93–107 (2011).
- 25.D’Andrea R, Ostling A. Challenges in linking trait patterns to niche differentiation. Oikos. 2016;125:1369–1385. doi: 10.1111/oik.02979. [DOI] [Google Scholar]
- 26.Barter E, Gross T. Manifold cities: Social variables of urban areas in the uk. Proc. R. Soc. A. 2019;475:20180615. doi: 10.1098/rspa.2018.0615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Coifman RR, et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. Proc. Natl Acad. Sci. USA. 2005;102:7426–7431. doi: 10.1073/pnas.0500334102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Coifman RR, Lafon S. Diffusion maps. Appl. Comput. Harmonic Anal. 2006;21:5–30. doi: 10.1016/j.acha.2006.04.006. [DOI] [Google Scholar]
- 29.Kac M. Can one hear the shape of a drum? Am. Math. Monthly. 1966;73:1–23. doi: 10.1080/00029890.1966.11970915. [DOI] [Google Scholar]
- 30.Boaz, N., Stephane, L., Ioannis, K. & Coifman, R. R. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Advances in Neural Information Processing Systems 955–962 (2006).
- 31.Jones PW, Mauro M, Schul R. Manifold parametrizations by eigenfunctions of the laplacian and heat kernels. Proc. Natl Acad. Sci. USA. 2008;105:1803–1808. doi: 10.1073/pnas.0710175104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Daniel M, Sergej A, Melanie T, Patil KR. Fast automated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic Acids Res. 2018;46:7542–7553. doi: 10.1093/nar/gky157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Pruitt KD, Tatiana T, Maglott DR. Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2006;35:D61–D65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Mendes-Soares H, Michael M, Soares LM, Chia N. Mminte: an application for predicting metabolic interactions among the microbial species in a community. BMC Bioinforma. 2016;17:343. doi: 10.1186/s12859-016-1230-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Boaz, N., Stephane, L., Ronald, C. & Kevrekidis, I. G. In Principal Manifolds For Data Visualization and Dimension Reduction pp. 238–260 (Springer, 2008).
- 36.Moon KR, et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol. 2019;37:1482–1492. doi: 10.1038/s41587-019-0336-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Marion E, et al. The photorespiratory glycolate metabolism is essential for cyanobacteria and might have been conveyed endosymbiontically to plants. Proc. Natl Acad. Sci. USA. 2008;105:17199–17204. doi: 10.1073/pnas.0807043105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Watzer B, Forchhammer K. Cyanophycin synthesis optimizes nitrogen utilization in the unicellular cyanobacterium synechocystis sp. strain pcc 6803. Appl. Environ. Microbiol. 2018;84:e01298–18. doi: 10.1128/AEM.01298-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sonia F, Lunn JE, Franck B, Ferrer J-L. The structure of a cyanobacterial sucrose-phosphatase reveals the sugar tongs that release free sucrose in the cell. Plant Cell. 2005;17:2049–2058. doi: 10.1105/tpc.105.031229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Amy N, Thilo G, Bassler KE. Mesoscopic structures and the laplacian spectra of random geometric graphs. J. Complex Netw. 2015;3:543–551. doi: 10.1093/comnet/cnv004. [DOI] [Google Scholar]
- 41.Komagata, K., Iino, T., Yamada, Y. The Family Acetobacteraceae. In The Prokaryotes (eds Rosenberg, E., DeLong, E. F., Lory, S., Stackebrandt, E., Thompson, F.) pp. 3–78 (Springer, Berlin, Heidelberg, 2014).
- 42.Meadows JA, Wargo MJ. Carnitine in bacterial physiology and metabolism. Microbiology. 2015;161:1161. doi: 10.1099/mic.0.000080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kämpfer P, Svenja M, Müller HE. Characterization of buttiauxella and kluyvera species by analysis of whole cell fatty acid patterns. Syst. Appl. Microbiol. 1997;20:566–571. doi: 10.1016/S0723-2020(97)80028-8. [DOI] [Google Scholar]
- 44.Parsons JB, Rock CO. Bacterial lipids: metabolism and membrane homeostasis. Prog. Lipid Res. 2013;52:249–276. doi: 10.1016/j.plipres.2013.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Foster DB, et al. Phosphatidylethanolamine recognition promotes enteropathogenic E. coli and enterohemorrhagic E. coli host cell attachment. Microb. Pathogenesis. 1999;27:289–301. doi: 10.1006/mpat.1999.0305. [DOI] [PubMed] [Google Scholar]
- 46.Mayer, C. & Boos, W. Hexose/pentose and hexitol/pentitol metabolism. EcoSal Plus1 (2005). [DOI] [PubMed]
- 47.Reimer LC, et al. Bac dive in 2019: bacterial phenotypic data for high-throughput biodiversity analysis. Nucleic Acids Res. 2019;47:D631–D636. doi: 10.1093/nar/gky879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Devinder K, Brennan PJ, Crick DC. Decaprenyl diphosphate synthesis in mycobacterium tuberculosis. J. Bacteriol. 2004;186:7564–7570. doi: 10.1128/JB.186.22.7564-7570.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Newton GL, Nancy B, Fahey RC. Biosynthesis and functions of mycothiol, the unique protective thiol of Actinobacteria. Microbiol. Mol. Biol. Rev. 2008;72:471–494. doi: 10.1128/MMBR.00008-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Yaozhu W, Xiaofei Z, Sixue Z, Tan X. Structural and functional insights into corrinoid iron-sulfur protein from human pathogen Clostridium difficile. J. Inorg. Biochem. 2017;170:26–33. doi: 10.1016/j.jinorgbio.2017.02.005. [DOI] [PubMed] [Google Scholar]
- 51.Charles D, Plants-Paris K, Dayna B, DuPont HL. Clostridium difficile modulates the gut microbiota by inducing the production of indole, an interkingdom signaling and antimicrobial molecule. mSystems. 2019;4:e00346–18. doi: 10.1128/mSystems.00346-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Luo H, Moran MA. How do divergent ecological strategies emerge among marine bacterioplankton lineages? Trends Microbiol. 2015;23:577–584. doi: 10.1016/j.tim.2015.05.004. [DOI] [PubMed] [Google Scholar]
- 53.Kanehisa M, Goto S. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Neshich IAP, Eduardo K, Arruda P. Genome-wide analysis of lysine catabolism in bacteria reveals new connections with osmotic stress resistance. ISME J. 2013;7:2400–2410. doi: 10.1038/ismej.2013.123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chang H-H, et al. Complete genome sequence of ?candidatus sulcia muelleri? ml, an obligate nutritional symbiont of maize leafhopper (dalbulus maidis) Genome Announc. 2015;3:e01483–14. doi: 10.1128/genomeA.01483-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.López-Madrigal S, Amparo L, Andres M, Gil R. The link between independent acquisition of intracellular gamma-endosymbionts and concerted evolution in tremblaya princeps. Front. Microbiol. 2015;6:642. doi: 10.3389/fmicb.2015.00642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Dale C, Moran NA. Molecular interactions between bacterial symbionts and their hosts. Cell. 2006;126:453–465. doi: 10.1016/j.cell.2006.07.014. [DOI] [PubMed] [Google Scholar]
- 58.Langille MGI, et al. Predictive functional profiling of microbial communities using 16s rrna marker gene sequences. Nat. Biotechnol. 2013;31:814. doi: 10.1038/nbt.2676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Stilianos L, et al. Function and functional redundancy in microbial systems. Nat. Ecol. Evol. 2018;2:936. doi: 10.1038/s41559-018-0519-1. [DOI] [PubMed] [Google Scholar]
- 60.Douglas, G. M. et al. Picrust2: an improved and extensible approach for metagenome inference. BioRxivhttps://www.biorxiv.org/content/10.1101/672295v2 (2019).
- 61.Cooley, S. M., Timothy, H., Deeds, E. J. & Ray, J. C. J. A novel metric reveals previously unrecognized distortion in dimensionality reduction of scRNA-seq data. BioRxiv https://www.biorxiv.org/content/10.1101/689851v3 (2019).
- 62.Thompson LR, et al. A communal catalogue reveals earth’s multiscale microbial diversity. Nature. 2017;551:457. doi: 10.1038/nature24621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 2005;71:8228–8235. doi: 10.1128/AEM.71.12.8228-8235.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Franzosa EA, et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nat. Methods. 2018;15:962–968. doi: 10.1038/s41592-018-0176-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.King ZA, et al. Bigg models: a platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Res. 2015;44:D515–D522. doi: 10.1093/nar/gkv1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Lee MD. GtoTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019;1:3. doi: 10.1093/bioinformatics/btz188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Hug LA, et al. A new view of the tree of life. Nat. Microbiol. 2016;1:16048. doi: 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]
- 68.Eddy SR. Accelerated profile hmm searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Price MN, Dehal PS, Arkin AP. Fasttree 2–approximately maximum-likelihood trees for large alignments. PloS ONE. 2010;5:e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Letunic, I. & Bork, P. Interactive tree of life (iTol) v4: recent updates and new developments. Nucleic Acids Res. 47, 256–259 (2019). [DOI] [PMC free article] [PubMed]
- 73.Aravind S, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2019).
- 75.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300. [Google Scholar]
- 76.Altschul SF, Warren G, Webb M, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 77.Wright, E. S. Using DECIPHER v2.0 to analyze big biological sequence data in R. R. J. 8, 352–359 (2016).
- 78.Ward Jr JH. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963;58:236–244. doi: 10.1080/01621459.1963.10500845. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Genome accession numbers are available at 10.6084/m9.figshare.12864011.v4.
R scripts and sample data are available at 10.6084/m9.figshare.12864011.v4.