Skip to main content
Nature Portfolio logoLink to Nature Portfolio
. 2020 Nov 9;39(4):499–509. doi: 10.1038/s41587-020-0718-6

A genomic catalog of Earth’s microbiomes

Stephen Nayfach 1, Simon Roux 1, Rekha Seshadri 1, Daniel Udwary 1, Neha Varghese 1, Frederik Schulz 1, Dongying Wu 1, David Paez-Espino 1, I-Min Chen 1, Marcel Huntemann 1, Krishna Palaniappan 1, Joshua Ladau 1, Supratim Mukherjee 1, T B K Reddy 1, Torben Nielsen 1, Edward Kirton 1, José P Faria 2, Janaka N Edirisinghe 2, Christopher S Henry 2, Sean P Jungbluth 1,4, Dylan Chivian 3, Paramvir Dehal 3, Elisha M Wood-Charlson 3, Adam P Arkin 3, Susannah G Tringe 1, Axel Visel 1; IMG/M Data Consortium, Tanja Woyke 1, Nigel J Mouncey 1, Natalia N Ivanova 1, Nikos C Kyrpides 1, Emiley A Eloe-Fadrosh 1,
PMCID: PMC8041624  PMID: 33169036

Abstract

The reconstruction of bacterial and archaeal genomes from shotgun metagenomes has enabled insights into the ecology and evolution of environmental and host-associated microbiomes. Here we applied this approach to >10,000 metagenomes collected from diverse habitats covering all of Earth’s continents and oceans, including metagenomes from human and animal hosts, engineered environments, and natural and agricultural soils, to capture extant microbial, metabolic and functional potential. This comprehensive catalog includes 52,515 metagenome-assembled genomes representing 12,556 novel candidate species-level operational taxonomic units spanning 135 phyla. The catalog expands the known phylogenetic diversity of bacteria and archaea by 44% and is broadly available for streamlined comparative analyses, interactive exploration, metabolic modeling and bulk download. We demonstrate the utility of this collection for understanding secondary-metabolite biosynthetic potential and for resolving thousands of new host linkages to uncultivated viruses. This resource underscores the value of genome-centric approaches for revealing genomic properties of uncultivated microorganisms that affect ecosystem processes.

Subject terms: Computational biology and bioinformatics, Microbiology


Cataloging microbial genomes from Earth’s environments expands the known phylogenetic diversity of bacteria and archaea.

Main

A vast number of diverse microorganisms have thus far eluded cultivation and remain accessible only through cultivation-independent molecular approaches. Genome-resolved metagenomics is an approach that enables the reconstruction of composite genomes from microbial populations and was first applied to a low-complexity acid mine drainage community1. With advances in computational methods and sequencing technologies, this approach has now been applied at much larger scales and to numerous other environments, including the global ocean2, cow rumen3, human microbiome46, deep subsurface7 and aquifers8. These studies have led to substantial insights into evolutionary relationships and metabolic properties of uncultivated bacteria and archaea810.

Beyond expanding and populating the microbial tree of life11,12, a comprehensive genomic catalog of uncultivated bacteria and archaea would afford an opportunity for large-scale comparative genomics, mining for genes and functions of interest (for example, CRISPR–Cas9 variants13) and constructing genome-scale metabolic models to enable systems biology approaches8,14,15. Further, recent genome reconstructions of uncultivated bacteria and archaea have yielded unique insights into the evolutionary trajectories of eukaryotes and ancestral microbial traits1618.

Here we applied large-scale genome-resolved metagenomics to recover 52,515 medium- and high-quality metagenome-assembled genomes (MAGs), which form the Genomes from Earth’s Microbiomes (GEM) catalog. The GEM catalog was constructed from 10,450 metagenomes sampled from diverse microbial habitats and geographic locations (Fig. 1). These genomes represent 12,556 novel candidate species-level operational taxonomic units (OTUs), representing a resource that captures a broad phylogenetic and functional diversity of uncultivated bacteria and archaea. To demonstrate the value of this resource, we used the GEM catalog to perform metagenomic read recruitment across Earth’s biomes, identify novel biosynthetic capacity, perform metabolic modeling and predict host–virus linkages.

Fig. 1. Environmental and geographic distribution of metagenome-assembled genomes.

Fig. 1

a, A total of 52,515 MAGs were recovered from geographically and environmentally diverse metagenomes in IMG/M. The majority (6,380 of 10,450; 61%) of metagenomes were reassembled for this work using the latest state-of-the-art assembly pipeline (Supplementary Table 1). These genomes form the GEM catalog. All MAGs were ≥50% complete, were ≤5% contaminated and had a quality score (completeness − 5 × contamination) of ≥50. b, Distribution of quality metrics across the MAGs. Approximately 200 randomly selected data points are overlaid on each boxplot, showing the minimum value, first quartile, median, third quartile and maximum value. See Supplementary Table 2 for quality statistics for all MAGs. c, Distribution of MAGs across biomes and sub-biomes, based on environmental metadata in the Genomes OnLine Database (GOLD; https://gold.jgi-psf.org). The number of MAGs associated with each sub-biome is indicated next to the plot. d, Geographic distribution of MAGs within each biome.

Results

Over 52,000 metagenome-assembled genomes recovered from environmentally diverse metagenomes

We performed metagenomic assembly and binning on 10,450 globally distributed metagenomes from diverse habitats, including ocean and other aquatic environments (3,345), human and animal host-associated environments (3,536), as well as soils and other terrestrial environments (1,919), to recover 52,515 MAGs (Fig. 1a–c and Supplementary Tables 1 and 2). These metagenomes include thousands of unpublished datasets contributed by the Integrated Microbial Genomes and Microbiomes (IMG/M) Data Consortium, in addition to publicly available metagenomes (Methods and Supplementary Tables 1 and 2). This global catalog of MAGs contains representatives from all of Earth’s continents and oceans with particularly strong representation of samples from North America, Europe and the Pacific Ocean (Fig. 1d and Supplementary Fig. 1). The GEM catalog is available for bulk download along with environmental metadata (Data availability and Supplementary Table 1) and can be interactively explored via the IMG/M (https://img.jgi.doe.gov) or the Department of Energy (DOE) Systems Biology Knowledgebase (Kbase; https://kbase.us) web portals for streamlined comparative analyses and metabolic modeling.

MAGs from the GEM catalog all meet or exceed the medium-quality level of the MIMAG standard19 (mean completeness = 83%; mean contamination = 1.3%) and include 9,143 (17.4%) assigned as high quality based on the presence of a near-full complement of rRNAs, tRNAs and single-copy protein-coding genes (Fig. 1a,b and Supplementary Table 2). Genome sizes of high-quality GEMs ranged from 0.63 to 11.28 Mb, with most small-sized MAGs belonging to expected reduced genome lineages like the Nanoarchaeota or Mycoplasmatales, and similarly, large-sized MAGs belonging to Myxococcota and Planctomycetota. Genome size and GC content was lowest in host-associated microbiomes (median: 2.61 Mb; 46.9%) and highest in terrestrial microbiomes (median: 3.77 Mb; 57.1%), which is consistent with pangenome expansion in soil environments20. MAG sizes were consistent with isolate genomes of the same species, indicating no major loss of gene content in individual genomes (Supplementary Fig. 2). One exception was Sinorhizobium medicae, in which MAGs assembled from root nodules were nearly two times larger than isolate genomes (11–12 Mb compared to 6–7 Mb for isolate references; 99% average nucleotide identity (ANI) and 65% alignment fraction (AF) to S. medicae USDA1004). Although tetranucleotide frequency composition of binned scaffolds showed good consistency overall, numerous SNPs were detected, suggesting a composite arising from two strains of the same population. We additionally compared MAGs independently assembled by Parks et al.10 for a subset of GEM samples, which further reinforced the reproducibility of our composite genome bins (Supplementary Table 3 and Supplementary Note).

Taxonomically defined reference genomes are commonly used to infer the abundance of microorganisms from metagenomes but fail to recruit the majority of sequencing reads outside the human microbiome21. To explore whether the MAGs from the GEM catalog could address this issue, we aligned high-quality reads from 3,170 metagenomes with available read data to the 52,515 GEMs and to all isolate genomes from NCBI RefSeq. This revealed that an average of 30.5% (interquartile range (IQR) = 5.9–49.3%) and 14.6% (IQR = 0.9–15.8%) of metagenomic reads per sample were assigned to one or more GEMs or isolate genomes, respectively (Supplementary Fig. 3 and Supplementary Table 4). Across all samples, GEMs resulted in a median 3.6-fold increase in the number of mapped reads, which was particularly pronounced for certain environments like bioreactors or invertebrate hosts (Supplementary Fig. 3). Despite this improvement, nearly 70% of reads remained unmapped to any MAG or isolate genome. This was particularly noticeable for soil communities (for example, >95% of reads were unmapped to any genome in 55% of samples), which are highly complex and challenging to assemble22,23. Consistent with this result, metagenomes with the highest k-mer diversity24 tended to have the lowest mapping rates (Spearman’s r = −0.68; P value = 0). These communities likely contain closely related organisms, which pose a major problem for metagenomic assembly and binning25. Low mapping rates may also reflect the presence of viruses, plasmids and microbial eukaryotes, which were not recovered by the pipeline used in this study.

The GEM catalog expands genomic diversity across the tree of life

To uncover new species-level diversity, we clustered GEMs on the basis of 95% whole-genome ANI revealing 18,028 species-level OTUs (Fig. 2a,b, Supplementary Fig. 4 and Supplementary Table 5). Although the species concept for prokaryotes is controversial26, this operational definition is commonly used and is considered to be a gold standard27,28. Based on taxonomic annotations from the Genome Taxonomy Database (GTDB)29,30, we found that the GEMs cover 137 known phyla, 305 known classes and 787 known orders. The vast majority of non-singleton OTUs contained GEMs from only a single environment or multiple closely related environments (for example, bioreactors and wastewater; Supplementary Fig. 5), suggesting that few species have a broad habitat range, whereas nearly 40% were found in multiple sampling locations (Fig. 2c). Accumulation curves of MAGs revealed no plateau for species-level OTUs (Supplementary Fig. 6), indicating that additional species remain to be discovered across biomes, which is also suggested from the low percentage of mapped reads.

Fig. 2. Species-level clustering of the GEM catalog with >500,000 reference genomes.

Fig. 2

a, MAGs from the current study were compared to 524,046 publicly available reference genomes found in IMG/M and NCBI. All reference genomes met the same minimum quality standards as applied to the GEM catalog. All MAGs and reference genomes were clustered into 45,599 species-level OTUs on the basis of 95% ANI and 30% AF. b, Overlap of OTUs between genome sets. MAGs from the current study revealed genomes for 12,556 species for the first time. c, The vast majority of OTUs with >1 genome from the GEM catalog were restricted to individual biomes and sub-biomes, although over a third were found in multiple geographic locations. d, A large proportion of the 12,556 newly identified species were represented by only a single genome. e,f, Comparison of the current dataset with the 16 largest previously published genome studies, selected on the basis of species-level diversity. Study identifiers were derived from either NCBI BioProject or GOLD. Studies by Wu et al. 35, HMP (2010)36 and Mukherjee at al. 34 contain additional genomes generated after publication. All MAGs from other studies were filtered using the same quality criteria as the GEM dataset (Fig. 1a and Methods). Genomes from the current study represent over three times more diversity compared to any previously published study.

Next, we compared the 18,028 OTUs against an extensive database of 524,046 reference genomes including >300,000 MAGs from previous studies, >200,000 genomes of organisms isolated in pure culture (including all of RefSeq) and >2,000 single-amplified genomes (SAGs; Fig. 2a). These included large MAG studies conducted in the human microbiome46, global ocean2, aquifer systems7,8,31, permafrost thaw gradient14, cow rumen3, hypersaline lake sediments32 and hydrothermal sediments33, as well as several large isolate genome sequencing studies such as the Genomic Encyclopedia of Bacteria and Archaea (GEBA) project34,35 and the Human Microbiome Project (HMP)36, although several studies were published during the course of the current study and were not included37,38. All reference genomes were subjected to the same quality criteria as we applied to the GEM dataset (≥50% completeness, ≤5% contamination and a quality score of ≥50).

Notably, 12,556 OTUs from the GEM catalog (representing 23,095 MAGs) were distinct from reference genomes at 95% ANI and thus represent new candidate species. At the same time, 70% of all reference genomes were recruited to the GEM catalog at >95% ANI, indicating it has good coverage of existing genomes. New OTUs were found in 326 studies, with an average of 40 for each study. The Microbial Dark Matter (MDM) Phase II study, an extension of the GEBA-MDM project12, contributed the most novelty with 790 new OTUs derived from 1,124 MAGs found in 80 metagenomes.

Supporting their novelty, the vast majority of the 12,556 new OTUs were distantly related to reference genomes or barely aligned at all (93.7% of OTUs with <90% ANI or <10% AF compared to references), and >99% were unannotated at the species level by the GTDB. However, MAGs from new OTUs tended to be slightly less complete (averages: 81.0% versus 84.6%), displayed slightly higher contamination (averages: 1.5% versus 1.1%) and were often found as singletons (Fig. 2d, Supplementary Table 6 and Supplementary Note). These observations are likely explained by a number of factors, including genome reduction for uncultivated lineages6, problems assembling the 16S rRNA locus39 and challenges recovering members of the rare biosphere40.

We clustered the unrecruited reference genomes into an additional 27,571 OTUs, resulting in a combined dataset of 45,599 species-level OTUs (Fig. 2a,b). This revealed that while the GEM catalog contained fewer genomes, it represented 3.8 times more diversity compared to any previously published study (Fig. 2e). For example, Parks et al. performed large-scale assembly and binning of all environmental metagenomes available in the NCBI Sequence Read Archive in an unprecedented effort to expand genomic representation of uncultivated lineages10,30. Based on the clustering and quality control performed in the current study, these 10,728 MAGs represent 5,200 OTUs, covering only 12% of OTUs from the GEM catalog (Supplementary Table 7).

Next, we constructed a phylogeny of the 45,599 OTUs based on 30 concatenated marker genes (Fig. 3a, Supplementary Table 8 and Methods). Phylogenetic analysis of this tree supported that the GEM catalog is the most diverse dataset published to date (Fig. 2f). Overall, the GEM catalog resulted in a 44% gain in phylogenetic diversity across the entire tree of bacteria and archaea and currently represents 31% of all known diversity based on cumulative branch length. Gains in phylogenetic diversity were relatively consistent across taxonomic groups, but especially high for certain large clades that included Planctomycetota (79% gain), Verrucomicrobiota (68% gain) and Patescibacteria (also referred to as the ‘Candidate Phyla Radiation’) (60% gain) (Fig. 3b and Supplementary Table 9). The GEM catalog resulted in more variable gains across environments (Supplementary Table 10), though almost no new diversity was uncovered in human-associated samples (Fig. 3b) which were previously analyzed in recent MAG studies46. Notably, these analyses also revealed that 75% of the phylogenetic diversity of cataloged microbial diversity is exclusively represented by uncultured genomes (that is, MAGs or SAGs).

Fig. 3. The GEM catalog fills gaps in the tree of life.

Fig. 3

a, A phylogenetic tree was built for 43,979 of the 45,599 OTUs based on a concatenated alignment of 30 universally distributed single-copy genes. The full alignment contained 4,689 amino acid positions, with each OTU containing data for at least 30% of positions. Species-level OTUs were further clustered based on phylogenetic distance into 1,928 approximately order-level clades. Green branches indicate new lineages represented only by the GEM catalog. The inner strip chart indicates whether an order is newly identified (green; represented only by GEMs) or was previously known (light gray; represented by a reference genome). The next strip chart indicates whether an order is uncultured (blue; represented only by MAGs/SAGs) or cultured (gray; represented by at least one isolate genome). The next four strip charts indicate the environmental distribution of the orders; the last plot indicates the number of MAGs from the GEM catalog recovered from each order. The GEM catalog’s composite genomes are broadly distributed across the tree of life, including many new order-level clades, though most new lineages are interspersed between existing ones. Vast regions of the tree are represented only by uncultivated genomes. b, Phylogenetic diversity was computed for subtrees represented by the GEM catalog/reference genomes (green scale) or cultivated/uncultivated genomes (blue scale). Gray bars indicate percentage of total phylogenetic diversity represented by each taxonomic group (left) or environment (right). The GEM catalog consistently expands phylogenetic diversity across different phyla within bacteria and archaea and for different environments. One exception is the human microbiome, where the GEM catalog contributes little new diversity. Combining the GEM catalog with other uncultivated genomes, it becomes apparent that uncultivated genomes dominate the diversity within most phyla and environments, particularly for groups like the Patescibacteria (Candidate Phyla Radiation) and Nanoarchaeota.

To determine whether the GEM catalog contained new lineages at higher taxonomic ranks, we used relative evolutionary divergence (RED)30 to cluster all 45,599 OTUs into monophyletic groups, including singletons, representing 16,062 genera, 5,165 families, 1,928 orders, 368 classes and 129 phyla (Supplementary Tables 1113, Supplementary Fig. 7 and Methods). At the phylum level, we identified 16 clades exclusively represented by GEMs (11 clades in bacteria and 5 in archaea), which may indicate new phyla. However, these clades were supported by only 29 GEMs, which were largely assigned to known phyla by the tool GTDB-Tk (28/29). At lower taxonomic ranks, considerably more novel groups were identified, including 456 new orders, 1,525 new families and 5,463 new genera. We conclude that, in contrast to earlier metagenome binning studies that uncovered vast new lineages of life, the majority of deep-branching lineages are represented by current genome sequences.

Encoded functional potential in the GEMs

To provide a systems-level snapshot of metabolic potential, we built genome-scale metabolic models for the nonredundant, high-quality GEMs with >40 representatives for each environment (n = 3,255) in KBase41 (Supplementary Figs. 8 and 9, Supplementary Table 14 and Supplementary Note). Beyond known metabolic pathways, we hypothesized that MAGs from the GEM catalog contained a reservoir of functional novelty. To address this question, we compiled a catalog of 5,794,145 protein clusters (PCs) representing 111,428,992 full-length genes, with 51.7% of PCs containing at least two sequences. The vast majority of PCs were not functionally annotated compared to the TIGRFAM or KEGG Orthology databases, and most lacked even a single Pfam domain (95.2%, 88.9% and 74.5% unannotated for TIGRFAM, KEGG and Pfam, respectively). Comparatively, for a catalog of 270 million genes from 76,000 reference bacterial and archaeal genomes available through IMG/M42, these percentages are approximately 70%, 50% and 20%, respectively. Nearly 70% of all PCs were not functionally annotated by any of the three databases, and 47% had no significant similarity to UniRef (https://www.uniprot.org), a large and regularly updated protein resource. While the largest PCs tended to be previously known, several large PCs lacked any annotation, including 356 clusters with at least 1,000 members and 28,869 clusters with at least 100 members.

While it is outside the scope of this study to systematically interpret the functional capacities of all GEMs, here we present a few illustrative vignettes. First, we found that GEMs recapitulated recent observations of an expanded purview of methanogenesis (Supplementary Fig. 10) due to membership of new archaeal phyla like the Halobacterota, Hadesarchaea (including Archaeoglobi and Syntrophoarchaeia) and lineages within the Crenarchaeota (for example, Thermoprotei, Korarchaeia and Bathyarchaeia)4346. At a lower taxonomic rank, we identified GEMs for a novel species of the genus Coxiella, which includes the class B bioterrorism agent Coxiella burnetii associated with substantial health and economic burden47, providing an opportunity to gain new insights into the evolution of host–pathogen interactions within this genus. Several virulence factors were found in the GEMs, including the Dot/Icm type IV secretion system (Supplementary Fig. 7) used to deliver effector proteins into the cytoplasm of the host cell48; however, the characterized C. burnetii T4SS effectors were absent. Thus, GEMs offer potential for new discovery at the highest and lowest taxonomic ranks.

Broad and diverse secondary-metabolite biosynthetic potential

Most secondary metabolites have been isolated from cultivated bacteria affiliated to only a handful of bacterial groups, includingStreptomycetes, Pseudomonas, Bacillus and Streptococcus49. More recently, mining of metagenomic data from soil has expanded representation to members of the phyla Acidobacteria, Verrucomicobia, Gemmatimonadetes and the candidate phylum Rokubacteria50. The GEM catalog affords a unique opportunity to explore the repertoire of secondary-metabolite biosynthetic gene clusters (BGCs) encoded within this taxonomically and biogeographically diverse genome collection. We identified 104,211 putative BGC regions from the 52,515 GEMs using AntiSMASH (v5.1)51 (Supplementary Table 15). For comparison, this represents an increase of BGCs in IMG/ABC (Atlas of BGCs)52 by 31% and is 54 times the size of the manually curated MIBiG dataset49. Approximately 66% of GEM BGCs intersected with one or more contig boundaries, indicating that a majority may be incomplete (Supplementary Fig. 12), which is consistent with previous observations based on fragmented recovery50,53. We assigned the class of secondary metabolites synthesized by each BGC across the GEM catalog (Fig. 4a). A total of 44,835 gene clusters or cluster fragments containing nonribosomal peptide synthetases (NRPSs) and/or polyketide synthases (PKSs) were identified from 104 phyla, 23,738 terpene clusters from 79 phyla and 12,360 ribosomally processed peptide (RiPP) clusters from 76 phyla. While fragmentation likely skewed cluster content counts in unpredictable ways, we observed trends that may be reflective of nature. For example, Firmicutes had unusually high numbers of RiPPs (more than half of their BGCs were RiPP clusters), while Thermoplasmatota and Verrucomicrobiota contained relatively high numbers of terpene clusters (68% and 50% of their BGCs, respectively). Analyses of environmental trends for BGCs were less clear, with no environmental source group showing a clear skew in relative BGC family content (Fig. 4a). If accurate, this implies that specific chemistry is not limited or amplified by environment, and that most classes of secondary metabolites can be found nearly anywhere.

Fig. 4. Biosynthetic gene clusters recovered from the GEMs dataset.

Fig. 4

a, Relative frequency of BGC types across dominant phyla (left) and habitats (right). BGC types are highly variable across phyla but relatively stable across habitats. AAmodifier, amino acid modifying system. b, The single largest BGC region, found in a soil-derived bacterium from the Acidobacteria phylum and UBA5704 genus. The BGC encodes 62 PKS or NRPS modules with three colinear module chains.

To evaluate BGC novelty, we queried each BGC sequence against the NCBI nucleotide sequence collection. Using a threshold of 75% identity over 80% of the query length, we identified 87,187 (83%) as putatively novel BGCs that encoded new chemistry (Supplementary Table 16). Although many modular clusters are fragmented, we identified over 3,000 BGC regions >50 kb in length and more than 17,000 >30 kb. Together, the GEM catalog holds potential as a rich source of novel predicted BGCs and provides ample opportunity to explore biosynthetic potential outside known clades. As noted elsewhere54, Myxococcus showed promising biosynthetic potential, with 1,751 regions across 232 MAGs and a broad diversity of antiSMASH-defined BGC families. The single largest BGC region was found in a soil-derived bacterium putatively of the phylum Acidobacteria and genus UBA5704, encoding a remarkable number of 62 PKS or NRPS modules with three clear colinear module chains (Fig. 4b). Although several Acidobacteria are known to contain PKS and NRPS clusters, this MAG contains an additional 66 BGC regions, indicating a level of biosynthetic potential that may have been underestimated within this phylum.

GEMs reveal thousands of new virus–host connections

In addition to the assembly of microbial genomes, recent studies have highlighted how metagenomes can be mined for novel viral genomes55. However, most uncultivated viruses cannot be associated with a microbial host, which is crucial for understanding their roles and impacts in nature. We reasoned that MAGs from the GEM catalog could be used to improve host prediction for viral genomes. To address this, we identified connections between the 52,515 GEMs and 760,453 viruses in IMG/VR56 using a combination of CRISPR-spacer matches (≤1 SNP) and genome sequence matches (>90% identity over >500 bp), which showed good agreement (Supplementary Note). IMG/VR viruses were connected to consistent host taxa (95% of linkages per virus to the same host family), and >96% of connected viruses and GEMs were derived from a similar environment based on the top level of the GOLD57 environmental ontology.

Using a combination of the two approaches, we predicted connections between 81,449 IMG/VR viruses and 23,082 GEMs (Fig. 5a and Supplementary Table 17), increasing the total number of IMG/VR viruses with a predicted host by >2.5-fold (from 36,976 to 92,872). However, these expanded virus–host connections still covered only 10.7% of the 760,453 viral genomes from IMG/VR and 44.0% of MAGs from the GEM catalog. This is exemplified for certain phyla like Thermoplasmatota, where a virus was linked to only 1.6% of the 624 assembled MAGs.

Fig. 5. MAGs resolve host–virus connectivity.

Fig. 5

a, Bacterial and archaeal phyla from the GEM catalog were linked to viruses. The bar plot displays the percentage of MAGs linked to viruses from each phylum containing 100 or more MAGs. Phylum names were derived from the GTDB, and the numbers to the right represent MAGs from each phylum. Bar colors indicate the method of linking viruses to hosts; white indicates the percentage of MAGs not associated with any virus. b, Phylogeny of DJR viruses with associated host information. For each clade of three or more DJR sequences associated with the same host group, host information is indicated next to the clade along with the number of sequences linking this DJR clade to this host group, first from reference sequences, then from the GEM catalog. Reference sequences were obtained from Kauffman et al.59. Clades are colored according to the origin of the host information, and new host groups identified exclusively from the GEM catalog are highlighted in bold. All nodes with >50% support are displayed as multifurcation, and nodes with >80% support are highlighted with a black dot.

To address this limitation, we performed de novo prediction of integrated prophages in GEMs using VirSorter58 after carefully removing viral contamination (Methods). This approach provided an additional 10,410 viruses linked to 7,805 GEMs. These novel MAG-derived virus–host linkages included several groups of understudied clades, including the double jelly roll (DJR) lineage, which is a commonly overlooked group of non-tailed double-stranded DNA viruses59,60. Recent studies of DJR virus diversity have revealed that members of this group infect hosts across the three domains of life, yet they have also highlighted subgroups without a known host59. Here, we identified 73 DJR sequences in the GEM catalog, which provided host information for four additional DJR clades (Fig. 5b). In addition, two of these clades were linked through the GEMs to uncultivated bacterial and archaeal groups that had not yet been identified as putative DJR hosts (namely Omnitrophica and Nanoarchaeota). Beyond the DJR group, we identified putative hosts for two single-stranded DNA virus families, including four clades of Microviridae and 28 clades of Inoviridae (Supplementary Fig. 12 and Supplementary Table 18). Taken together, these different examples demonstrate how MAGs can resolve novel virus–host linkages.

Discussion

This resource of 52,515 medium- and high-quality MAGs represents the largest effort to date to capture the breadth of bacterial and archaeal genomic diversity across Earth’s biomes. The GEM catalog considerably expands the known phylogenetic diversity of bacteria and archaea, increases recruitment of metagenomic sequencing reads, contains a wealth of biosynthetic potential and improves host assignments for uncultivated viruses. Despite an overall 44% increase in phylogenetic diversity of bacteria and archaea, we found little evidence of new deep-branching lineages representing new phyla, consistent with recent studies of microbial diversity30,61. Likewise, despite a 3.6-fold increase in recruitment of metagenomic reads, over two-thirds of metagenome reads still lack a mappable reference genome. Thus, continued efforts to capture the genomes of new species- and strain-level representatives will further improve metagenomic resolution.

Large-scale genomic inventories provide critical resources to the broader research community3436. With that said, MAGs from the GEM catalog, like other MAGs generated to date, have several limitations for users to be aware of, including undetected contamination, low contiguity and incompleteness. Although these MAGs are important placeholders for many new candidate species, we expect many will be replaced in the future by higher quality MAGs or ultimately by genome sequences from clonal isolates. As we have illustrated with the large repertoire of new secondary metabolite BGCs and putative virus–host connections, we anticipate that the GEM catalog will become a valuable resource for future metabolic and genome-centric data mining and experimental validation.

Methods

Metagenomic samples and assembly

For genome binning, we used 10,450 metagenomic assemblies from the IMG/M database42 that correspond to 527 studies and 10,331 samples from a myriad of microbial environments (Supplementary Table 1). The majority (6,380 of 10,450; 61%) of metagenomes were reassembled for this work using the latest state-of-the-art assembly pipeline: read filtering with BFC, followed by assembly with metaSPAdes with the option ‘--meta’. Assembled metagenomes from IMG/M were generated using a variety of quality-control and assembly methods, as described by Huntemann et al.62. Where unassembled metagenomes were available, reads were mapped back to assembled contigs using BWA-MEM63 with default parameters, and contig coverage information was generated using SAMtools64.

Metagenome binning and quality control

MAGs were recovered for the individual metagenomic assemblies using MetaBAT65 on the basis of tetranucleotide frequencies using v0.32.4 and v0.32.5 with option ‘--superspecific’ (Supplementary Table 2). Depth information was used when available, and contigs shorter than 3,000 bp were discarded. The resulting MAGs were refined in two stages. First, RefineM (v0.0.20)10 was used to remove contigs with aberrant read depth, GC content and/or tetranucleotide frequencies. Second, contigs were removed with conflicting phylum-level taxonomy. Taxonomic annotations of contigs were obtained based on protein-level alignments against the IMG/M database (downloaded 07 December 2017) using the Last aligner (v876)66 and taking the lowest common ancestor of taxonomically classified genes.

The completeness and contamination of all MAGs was estimated using CheckM (v1.0.11)67 via the lineage-specific workflow. Based on these results, we selected 52,515 MAGs that were estimated to be at least 50% complete, with less than 5% contamination and had a quality score of >50 (defined as the estimated completeness of a genome minus five times its estimated contamination). As additional indicators of completeness, we identified tRNA genes using tRNAscan-SE (v2.0)68 and rRNA genes using Infernal (v1.1.2)69 with models from the Rfam database70. Based on these results, we found that 9,143 of the 52,515 MAGs were classified as high quality based on the MIMAG standard (≥90% completeness, ≤5% contamination, ≥18/20 tRNA genes and presence of 5S, 16S and 23S rRNA genes), with the remaining classified as medium quality. These 52,515 MAGs form the GEM dataset.

Metagenomic read recruitment to MAGs and reference genomes

We selected 3,170 metagenomic samples with available sequencing reads from the Joint Genome Institute and Sequence Read Archive databases to quantify mappability (Supplementary Table 4). Up to 500,000 reads from each metagenome were aligned to a database containing 52,515 GEMs and another database containing 151,730 genomes from NCBI RefSeq (release 93)71. We used only 500,000 reads per metagenome, representing a median of 0.84% of reads across datasets (IQR = 0.40–1.78%), to avoid the high computational cost of aligning all reads and is in line with previous analyses4. Read alignment was performed using Bowtie (v2.3.2) in ‘end-to-end’ mode with the option ‘--very-sensitive’, and up to 20 alignments per read were retained72. After alignment, we discarded low-quality reads with an average base quality score of <30, read length of <70 bp or any ambiguous base calls. Additionally, we discarded poor alignments where the edit distance exceeded 5 per 100-bp reads (that is, <95% identity).

Clustering MAGs into species-level OTUs

The 52,515 MAGs from the GEM dataset were clustered into 18,028 species-level OTUs on the basis of 95% genome-wide ANI (Supplementary Tables 2 and 5). ANI was estimated using MUMmer (v4.0.0)73 with default parameters, which computes the average DNA identity across one-to-one alignment blocks between genomes. Alignments covering <30% of either genome were discarded. We used a 30% AF threshold, as opposed to a previous study that recommends using 60% AF (ref. 74), to avoid the formation of spurious OTUs that can result from incomplete genomes6. Centroid-based clustering was performed, where the MAG with the highest CheckM quality score was designated as the centroid, and all MAGs within 95% ANI to the centroid were assigned to the same cluster. As validation, we quantified the similarity of the species-level OTUs to the GTDB taxonomy for 23,009 MAGs assigned to a known species. Both datasets represented a similar number of species (3,537 OTUs versus 3,481 from the GTDB), and MAGs tended to be assigned to the same species in both databases (adjusted Rand Index = 0.99).

Comparing MAGs to >500,000 genomes in public databases

We compared representative genomes from the 18,028 OTUs to a large number of publicly available reference genomes. Approximately 564,467 reference genomes were obtained from a variety of sources, including IMG/M (59,047 isolates, 8,412 MAGs and 7,066 SAGs), NCBI RefSeq (release 93; 151,730 isolates), GenBank (29,127 MAGs and 1,555 SAGs) and human-associated MAGs from three recent studies (307,530)46. CheckM was applied to all references and we selected those meeting the same minimum quality criteria applied to the GEM dataset (>50% completeness, <5% contamination and a quality score of >50). This resulted in a final set of 524,046 references from IMG/M (56,884 isolates, 6,146 MAGs and 1,475 SAGs), NCBI RefSeq (release 93; 150,245 isolates), GenBank (23,162 MAGs and 717 SAGs) and human-associated MAGs from three recent studies (285,417). We first used Mash (v2.0)75 with a sketch size of 10,000 to find the most similar reference genome to each of the 18,028 OTUs; and second, we used MUMmer (v4.0.0) with default parameters to estimate ANI between genome pairs. Based on this analysis, we found that 12,556 OTUs (69.4% of total) failed to match any reference genome at >95% ANI over >30% of the genome. Next, we identified OTUs represented only by reference genomes. First, we assigned 364,602 reference genomes to one of the 5,472 reference OTUs from the GEM dataset based on >95% ANI over >30% of the genome. The remaining 159,444 reference genomes were clustered into 27,571 additional OTUs based on 95% ANI using MUMmer. This resulted in a final dataset of 45,599 OTUs representing all GEMs and reference genomes.

Constructing a phylogeny of nonredundant MAGs and reference genomes

We constructed a multimarker gene tree of the 45,599 OTUs based on a subset of 30 genes from the PhyEco database76 that were single copied in >99% of genomes searched (Supplementary Table 8). HMMER (v3.1b2)77 was used to identify homologs of the marker genes in the genomes of each OTU using marker-gene-specific bit-score thresholds. To mitigate missing data in incomplete genomes, we pooled homologs across genomes from the same OTU (using a maximum of ten genomes, selected on the basis of CheckM quality) for each of the 30 marker genes. We then picked the centroid gene for each marker gene in each OTU, which represents the gene with the highest similarity to other members of the same OTU. Multiple sequence alignments of the centroids were created for each marker gene using FAMSA (v1.2.5) with default parameters78. Columns with >10% gaps were trimmed with trimAl (v1.4; option --gt 0.90)79, individual marker-gene alignments were concatenated together, and sequences with >70% gaps were removed. Concatenated multiple sequence alignments contained 4,689 columns and 43,979 sequences. FastTree (v2.1.10)80 was used to build an approximate maximum likelihood tree using the WAG + GAMMA models.

The phylogenetic tree was used to further cluster the 45,599 OTUs into monophyletic groups at the genus, family, order, class and phylum levels using a recently described method30. Briefly, the tree was rooted between the bacteria and archaea, and a subclade was extracted for each domain. OTUs were clustered into monophyletic groups with bootstrap support values of >0.7 on the basis of their RED. Rank-specific RED cutoffs were identified to maximize similarity to the GTDB taxonomy for OTUs from known clades, where similarity was measured using the adjusted mutual information statistic calculated by the ‘scikit-learn’ package in Python (v0.21.3)81 (Supplementary Fig. 7 and Supplementary Tables 1012). Monophyletic clades containing only GEMs were considered newly identified lineages, including those represented by a single GEM.

Secondary metabolism

Secondary-metabolite BGCs and regions were identified using AntiSMASH (v5.1)51 with default settings, ignoring contigs with lengths shorter than 5 kb. BGCs were compared to those in the NCBI nucleotide database (downloaded 07 Oct 2019) using the command ‘blastn’ within the NCBI BLAST+ package (v2.9)82 with an E-value cutoff of 1 × 10−1. Results were parsed to evaluate top hits, and we considered redundant clusters (that is, those seen in previous sequencing efforts) to be BGC sequences matching 80% or more of the BGC query length averaging 75% or more sequence identity against a database hit. For the purpose of counting BGC biochemistry, the 46 AntiSMASH-generated specific BGC families were categorized into one of six broader groups: ‘PKS’, ‘NRPS’, ‘terpene’, ‘RiPP’, ‘AAmodifier’ and ‘other’, based on categories suggested by the BiG-SCAPE software package83.

Connecting MAGs to viruses identified from IMG/VR and VirSorter

MAGs were used to predict hosts for 81,449 viral genomes from IMG/VR56 using a combination of CRISPR-spacer matches and sequence similarity between viruses and MAGs. CRISPR arrays were identified on contigs longer than 10 kb in MAGs using a combination of CRT81 and PILER-CR84. To minimize spurious predictions, we dropped arrays with fewer than three spacers, those with nonconserved repeats (<97% average identity to consensus repeat) or those in MAGs containing fewer than four CRISPR-associated proteins. This resulted in identification of 567,316 CRISPR spacers longer than 25 bp in 23,851 arrays in 13,540 MAGs. Protospacers were identified by aligning spacers to 760,453 IMG/VR genomes with blastn and identifying near-perfect matches (up to one mismatch covering at least 95% of the spacer length). Additionally, MAG contigs were aligned to IMG/VR genomes with blastn to identify integrated phage sequences. An IMG/VR genome was determined to be integrated in a MAG if it aligned by >90% identity over >500 bp on a contig that was >1.5 times the length of the IMG/VR genome. Contigs that were <1.5 times the length of the IMG/VR genome were considered a ‘full viral sequence’ and were discarded due to a lack of host information and the potential for inaccurate binning (that is, binning based on the virus genome characteristics rather than the host).

To maximize the number of prophages identified in MAGs, we used VirSorter (v1.0.3)58 to perform de novo prediction, retaining all predictions of categories 4 and 5. To exclude possible decayed prophages, that is, integrated virus genomes which are now inactive and progressively removed from the host genome, all predictions for which 30% or more of the genes displaying a best hit to Pfam were excluded (thresholds: hmmsearch score ≥ 50 and E ≤ 0.001). These hits were further reduced by filtering any contig that displayed >90% DNA identity over >500 bp to any of the 81,449 previously detected viral genomes from IMG/VR.

Detailed investigation of selected virus groups

Groups of temperate or chronic viruses for which MAG-based linkages were further investigated included the DJR capsid viruses (double-stranded DNA temperate bacteriophages and archaeoviruses), inoviruses (single-stranded DNA viruses with a chronic infection cycle) and Microviridae (single-stranded DNA viruses, lytic or lysogenic cycle). DJR sequences were specifically identified by searching the predicted proteins from metagenome contigs for a Hidden Markov Model built from known DJR major capsid proteins, based on the sequences from Kauffman et al.59. The search was computed with hmmsearch from the HMMER (v3.1b2) suite, selecting hits with a hmmsearch score ≥ 50 and an E ≤ 0.001. An additional 81 DJR sequences were collected which had initially been predicted by VirSorter with lower confidence (category 6). Additionally, inoviruses were identified in MAGs based on a custom approach recently developed to identify inovirus-like sequences in the same metagenome assemblies before genome binning85.

For DJR and Microviridae, phylogenies were built as follows: a multiple alignment was computed with MAFFT (v7.407)86 using the ‘einsi’ mode; the alignment was automatically trimmed with trimAl (v1.4.rev15) using the ‘gappyout’ option79; and the tree was built with IQ-TREE (v1.5.5)87 with 1,000 ultrafast bootstraps and automatic selection of the evolutionary model. Major capsid protein sequences were used for the DJR alignment, with references obtained from Kauffman et al.59. Similarly, major capsid protein sequences were used for the Microviridae alignment, with references obtained from Microviridae genomes available in the NCBI RefSeq and GenBank databases (as of October 2019). In addition, the 20 best blast hits from NCBI RefSeq bacterial genomes for each GEM Microviridae sequence were included to incorporate additional putative prophages in the tree. For inoviruses, the gene-content-based classification previously outlined was used by mapping GEM inovirus sequences to the recently described inovirus genome catalog85 using the MUMmer4 function73 with cutoffs of 95% ANI and 70% AF.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Online content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41587-020-0718-6.

Supplementary information

Supplementary Information (8MB, pdf)

Supplementary Text, Figs. 1–13 and References

Reporting Summary (2.6MB, pdf)
Supplementary Tables (32MB, xlsx)

Supplementary Tables 1–13 and 15–18.

Supplementary Table 14 (8.1MB, xlsx)

Genome-scale metabolic models in KBase.

Acknowledgements

This work was conducted by the US DOE Joint Genome Institute, a DOE Office of Science User Facility (contract no. DE-AC02–05CH11231), and used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US DOE (contract no. DE-AC02–05CH11231). This work was also supported as part of the Genomic Sciences Program DOE Systems Biology KBase (award nos. DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886).

Author contributions

N.C.K. and E.A.E.-F. conceived the study. S.N., S.R., R.S., D.U., N.V., F.S., D.W., D.P.-E., J.L., N.N.I. and E.A.E.-F. analyzed and interpreted the data. I.-M.C., M.H., K.P., S.M. and T.B.K.R. provided support for data through IMG/M and GOLD. T.N., E.K. and S.P.J. performed metagenomic assembly and binning. J.P.F., J.N.E., C.S.H., S.P.J., D.C., P.D., E.M.W.-C. and A.P.A. performed metabolic modeling through KBase. S.N. and E.A.E-F. designed and wrote the manuscript with feedback from S.T., A.V., T.W., N.J.M. and N.C.K. The IMG/M Data Consortium contributed metagenomic data. All authors reviewed and corrected the manuscript.

Data availability

All available metagenomic data, bins and annotations are available through the IMG/M portal (https://img.jgi.doe.gov/). Bulk download for the 52,515 MAGs is available at https://genome.jgi.doe.gov/GEMs and https://portal.nersc.gov/GEM. Genome-scale metabolic models for the nonredundant, high-quality GEMs are summarized at 10.25982/53247.64/1670777 and available in KBase (https://narrative.kbase.us/#org/jgimags). IMG/M identifiers of all metagenomes binned, including detailed information for each metagenome, are available in Supplementary Table 1.

Code availability

The pipeline used to generate the metagenome bins is available at https://bitbucket.org/berkeleylab/metabat/src/master/.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A list of authors and their affiliations appears at the end of the paper.

Change history

4/1/2021

A Correction to this paper has been published: 10.1038/s41587-021-00898-4

Change history

11/18/2020

A Correction to this paper has been published: 10.1038/s41587-020-00769-4

Contributor Information

Emiley A. Eloe-Fadrosh, Email: eaeloefadrosh@lbl.gov

IMG/M Data Consortium:

Helena Abreu, Silvia G. Acinas, Eric Allen, Michelle A. Allen, Lauren V. Alteio, Gary Andersen, Alexandre M. Anesio, Graeme Attwood, Viridiana Avila-Magaña, Yacine Badis, Jake Bailey, Brett Baker, Petr Baldrian, Hazel A. Barton, David A. C. Beck, Eric D. Becraft, Harry R. Beller, J. Michael Beman, Rizlan Bernier-Latmani, Timothy D. Berry, Anthony Bertagnolli, Stefan Bertilsson, Jennifer M. Bhatnagar, Jordan T. Bird, Jeffrey L. Blanchard, Sara E. Blumer-Schuette, Brendan Bohannan, Mikayla A. Borton, Allyson Brady, Susan H. Brawley, Juliet Brodie, Steven Brown, Jennifer R. Brum, Andreas Brune, Donald A. Bryant, Alison Buchan, Daniel H. Buckley, Joy Buongiorno, Hinsby Cadillo-Quiroz, Sean M. Caffrey, Ashley N. Campbell, Barbara Campbell, Stephanie Carr, JoLynn Carroll, S. Craig Cary, Anna M. Cates, Rose Ann Cattolico, Ricardo Cavicchioli, Ludmila Chistoserdova, Maureen L. Coleman, Philippe Constant, Jonathan M. Conway, Walter P. Mac Cormack, Sean Crowe, Byron Crump, Cameron Currie, Rebecca Daly, Kristen M. DeAngelis, Vincent Denef, Stuart E. Denman, Adey Desta, Hebe Dionisi, Jeremy Dodsworth, Nina Dombrowski, Timothy Donohue, Mark Dopson, Timothy Driscoll, Peter Dunfield, Christopher L. Dupont, Katherine A. Dynarski, Virginia Edgcomb, Elizabeth A. Edwards, Mostafa S. Elshahed, Israel Figueroa, Beverly Flood, Nathaniel Fortney, Caroline S. Fortunato, Christopher Francis, Claire M. M. Gachon, Sarahi L. Garcia, Maria C. Gazitua, Terry Gentry, Lena Gerwick, Javad Gharechahi, Peter Girguis, John Gladden, Mary Gradoville, Stephen E. Grasby, Kelly Gravuer, Christen L. Grettenberger, Robert J. Gruninger, Jiarong Guo, Mussie Y. Habteselassie, Steven J. Hallam, Roland Hatzenpichler, Bela Hausmann, Terry C. Hazen, Brian Hedlund, Cynthia Henny, Lydie Herfort, Maria Hernandez, Olivia S. Hershey, Matthias Hess, Emily B. Hollister, Laura A. Hug, Dana Hunt, Janet Jansson, Jessica Jarett, Vitaly V. Kadnikov, Charlene Kelly, Robert Kelly, William Kelly, Cheryl A. Kerfeld, Jeff Kimbrel, Jonathan L. Klassen, Konstantinos T. Konstantinidis, Laura L. Lee, Wen-Jun Li, Andrew J. Loder, Alexander Loy, Mariana Lozada, Barbara MacGregor, Cara Magnabosco, Aline Maria da Silva, R. Michael McKay, Katherine McMahon, Chris S. McSweeney, Mónica Medina, Laura Meredith, Jessica Mizzi, Thomas Mock, Lily Momper, Mary Ann Moran, Connor Morgan-Lang, Duane Moser, Gerard Muyzer, David Myrold, Maisie Nash, Camilla L. Nesbø, Anthony P. Neumann, Rebecca B. Neumann, Daniel Noguera, Trent Northen, Jeanette Norton, Brent Nowinski, Klaus Nüsslein, Michelle A. O’Malley, Rafael S. Oliveira, Valeria Maia de Oliveira, Tullis Onstott, Jay Osvatic, Yang Ouyang, Maria Pachiadaki, Jacob Parnell, Laila P. Partida-Martinez, Kabir G. Peay, Dale Pelletier, Xuefeng Peng, Michael Pester, Jennifer Pett-Ridge, Sari Peura, Petra Pjevac, Alvaro M. Plominsky, Anja Poehlein, Phillip B. Pope, Nikolai Ravin, Molly C. Redmond, Rebecca Reiss, Virginia Rich, Christian Rinke, Jorge L. Mazza Rodrigues, William Rodriguez-Reillo, Karen Rossmassler, Joshua Sackett, Ghasem Hosseini Salekdeh, Scott Saleska, Matthew Scarborough, Daniel Schachtman, Christopher W. Schadt, Matthew Schrenk, Alexander Sczyrba, Aditi Sengupta, Joao C. Setubal, Ashley Shade, Christine Sharp, David H. Sherman, Olga V. Shubenkova, Isabel Natalia Sierra-Garcia, Rachel Simister, Holly Simon, Sara Sjöling, Joan Slonczewski, Rafael Soares Correa de Souza, John R. Spear, James C. Stegen, Ramunas Stepanauskas, Frank Stewart, Garret Suen, Matthew Sullivan, Dawn Sumner, Brandon K. Swan, Wesley Swingley, Jonathan Tarn, Gordon T. Taylor, Hanno Teeling, Memory Tekere, Andreas Teske, Torsten Thomas, Cameron Thrash, James Tiedje, Claire S. Ting, Benjamin Tully, Gene Tyson, Osvlado Ulloa, David L. Valentine, Marc W. Van Goethem, Jean VanderGheynst, Tobin J. Verbeke, John Vollmers, Aurèle Vuillemin, Nicholas B. Waldo, David A. Walsh, Bart C. Weimer, Thea Whitman, Paul van der Wielen, Michael Wilkins, Timothy J. Williams, Ben Woodcroft, Jamie Woolet, Kelly Wrighton, Jun Ye, Erica B. Young, Noha H. Youssef, Feiqiao Brian Yu, Tamara I. Zemskaya, and Ryan Ziels

Supplementary information

is available for this paper at 10.1038/s41587-020-0718-6.

References

  • 1.Tyson GW, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. [DOI] [PubMed] [Google Scholar]
  • 2.Tully BJ, Graham ED, Heidelberg JF. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data. 2018;5:170203. doi: 10.1038/sdata.2017.203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Stewart RD, et al. Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat. Commun. 2018;9:870. doi: 10.1038/s41467-018-03317-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Pasolli E, et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography and lifestyle. Cell. 2019;176:649–662. doi: 10.1016/j.cell.2019.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature568, 499–504 (2019). [DOI] [PMC free article] [PubMed]
  • 6.Nayfach S, et al. New insights from uncultivated genomes of the global human gut microbiome. Nature. 2019;568:505–510. doi: 10.1038/s41586-019-1058-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Castelle CJ, et al. Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nat. Commun. 2013;4:2120. doi: 10.1038/ncomms3120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Anantharaman K, et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 2016;7:13219. doi: 10.1038/ncomms13219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Brown CT, et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature. 2015;523:208–211. doi: 10.1038/nature14486. [DOI] [PubMed] [Google Scholar]
  • 10.Parks DH, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2017;2:1533–1542. doi: 10.1038/s41564-017-0012-7. [DOI] [PubMed] [Google Scholar]
  • 11.Zhu Q, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea. Nat. Commun. 2019;10:5477. doi: 10.1038/s41467-019-13443-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Rinke C, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499:431–437. doi: 10.1038/nature12352. [DOI] [PubMed] [Google Scholar]
  • 13.Harrington LB, et al. A thermostable Cas9 with increased lifetime in human plasma. Nat. Commun. 2017;8:1424. doi: 10.1038/s41467-017-01408-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Woodcroft BJ, et al. Genome-centric view of carbon processing in thawing permafrost. Nature. 2018;560:49–54. doi: 10.1038/s41586-018-0338-1. [DOI] [PubMed] [Google Scholar]
  • 15.Ji M, et al. Atmospheric trace gases support primary production in Antarctic desert surface soil. Nature. 2017;552:400–403. doi: 10.1038/nature25014. [DOI] [PubMed] [Google Scholar]
  • 16.Soo RM, et al. On the origins of oxygenic photosynthesis and aerobic respiration in Cyanobacteria. Science. 2017;355:1436–1440. doi: 10.1126/science.aal3794. [DOI] [PubMed] [Google Scholar]
  • 17.Martijn J, et al. Deep mitochondrial origin outside the sampled alphaproteobacteria. Nature. 2018;557:101–105. doi: 10.1038/s41586-018-0059-5. [DOI] [PubMed] [Google Scholar]
  • 18.Spang, A., Caceres, E. F. & Ettema, T. J. G. Genomic exploration of the diversity, ecology and evolution of the archaeal domain of life. Science 357, eaaf3883 (2017). [DOI] [PubMed]
  • 19.Bowers RM, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 2017;35:725–731. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Maistrenko OM, et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 2020;14:1247–1259. doi: 10.1038/s41396-020-0600-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Nayfach S, et al. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 2016;26:1612–1625. doi: 10.1101/gr.201863.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Howe AC, et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc. Natl Acad. Sci. USA. 2014;111:4904–4909. doi: 10.1073/pnas.1402564111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.van der Walt AJ, et al. Assembling metagenomes, one community at a time. BMC Genomics. 2017;18:521. doi: 10.1186/s12864-017-3918-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Rodriguez, R. L., et al. Nonpareil 3: fast estimation of metagenomic coverage and sequence diversity. mSystems3, e00039-18 (2018). [DOI] [PMC free article] [PubMed]
  • 25.Sczyrba A, et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods. 2017;14:1063–1071. doi: 10.1038/nmeth.4458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Rossello-Mora R, Amann R. The species concept for prokaryotes. FEMS Microbiol. Rev. 2001;25:39–67. doi: 10.1016/S0168-6445(00)00040-1. [DOI] [PubMed] [Google Scholar]
  • 27.Konstantinidis KT, Tiedje JM. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 2005;187:6258–6264. doi: 10.1128/JB.187.18.6258-6264.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Richter M, Rossello-Mora R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl Acad. Sci. USA. 2009;106:19126–19131. doi: 10.1073/pnas.0906412106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chaumeil, P. A., et al. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics btz848 (2019). [DOI] [PMC free article] [PubMed]
  • 30.Parks, D. H., et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol.36, 996–1004 (2018). [DOI] [PubMed]
  • 31.Probst AJ, et al. Differential depth distribution of microbial function and putative symbionts through sediment-hosted aquifers in the deep terrestrial subsurface. Nat. Microbiol. 2018;3:328–336. doi: 10.1038/s41564-017-0098-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Vavourakis CD, et al. A metagenomics roadmap to the uncultured genome diversity in hypersaline soda lake sediments. Microbiome. 2018;6:168. doi: 10.1186/s40168-018-0548-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Dombrowski N, Teske AP, Baker BJ. Expansive microbial metabolic versatility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat. Commun. 2018;9:4999. doi: 10.1038/s41467-018-07418-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Mukherjee S, et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. 2017;35:676–683. doi: 10.1038/nbt.3886. [DOI] [PubMed] [Google Scholar]
  • 35.Wu D, et al. A phylogeny-driven genomic encyclopaedia of bacteria and archaea. Nature. 2009;462:1056–1060. doi: 10.1038/nature08656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Human Microbiome Jumpstart Reference Strains Consortium A catalog of reference genomes from the human microbiome. Science. 2010;328:994–999. doi: 10.1126/science.1183605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Poyet M, et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nat. Med. 2019;25:1442–1452. doi: 10.1038/s41591-019-0559-3. [DOI] [PubMed] [Google Scholar]
  • 38.Pachiadaki MG, et al. Charting the complexity of the marine microbiome through single-cell genomics. Cell. 2019;179:1623–1635. doi: 10.1016/j.cell.2019.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Yuan C, et al. Reconstructing 16S rRNA genes in metagenomic data. Bioinformatics. 2015;31:i35–i43. doi: 10.1093/bioinformatics/btv231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Lynch MD, Neufeld JD. Ecology and exploration of the rare biosphere. Nat. Rev. Microbiol. 2015;13:217–229. doi: 10.1038/nrmicro3400. [DOI] [PubMed] [Google Scholar]
  • 41.Arkin AP, et al. KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat. Biotechnol. 2018;36:566–569. doi: 10.1038/nbt.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Chen IA, et al. IMG/M v5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res. 2019;47:D666–D677. doi: 10.1093/nar/gky901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Borrel G, et al. Wide diversity of methane and short-chain alkane metabolisms in uncultured archaea. Nat. Microbiol. 2019;4:603–613. doi: 10.1038/s41564-019-0363-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Hua ZS, et al. Insights into the ecological roles and evolution of methyl-coenzyme M reductase-containing hot spring archaea. Nat. Commun. 2019;10:4574. doi: 10.1038/s41467-019-12574-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Evans PN, et al. Methane metabolism in the archaeal phylum Bathyarchaeota revealed by genome-centric metagenomics. Science. 2015;350:434–438. doi: 10.1126/science.aac7745. [DOI] [PubMed] [Google Scholar]
  • 46.Wang Y, et al. Expanding anaerobic alkane metabolism in the domain of archaea. Nat. Microbiol. 2019;4:595–602. doi: 10.1038/s41564-019-0364-2. [DOI] [PubMed] [Google Scholar]
  • 47.Mori M, Roest HJ. Farming, Q fever and public health: agricultural practices and beyond. Arch. Public Health. 2018;76:2. doi: 10.1186/s13690-017-0248-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Weber MM, et al. Identification of Coxiella burnetii type IV secretion substrates required for intracellular replication and Coxiella-containing vacuole formation. J. Bacteriol. 2013;195:3914–3924. doi: 10.1128/JB.00071-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kautsar, S. A. et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res.8, D454–D458 (2020). [DOI] [PMC free article] [PubMed]
  • 50.Crits-Christoph A, et al. Novel soil bacteria possess diverse genes for secondary-metabolite biosynthesis. Nature. 2018;558:440–444. doi: 10.1038/s41586-018-0207-y. [DOI] [PubMed] [Google Scholar]
  • 51.Blin K, et al. antiSMASH 5.0: updates to the secondary-metabolite genome mining pipeline. Nucleic Acids Res. 2019;47:W81–W87. doi: 10.1093/nar/gkz310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Palaniappan, K. et al. IMG-ABC v5.0: an update to the IMG/Atlas of Biosynthetic Gene Clusters Knowledgebase. Nucleic Acids Res.48, D422–D430 (2019). [DOI] [PMC free article] [PubMed]
  • 53.Meleshko D, et al. BiosyntheticSPAdes: reconstructing biosynthetic gene clusters from assembly graphs. Genome Res. 2019;29:1352–1362. doi: 10.1101/gr.243477.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Herrmann J, Fayad AA, Muller R. Natural products from myxobacteria: novel metabolites and bioactivities. Nat. Prod. Rep. 2017;34:135–160. doi: 10.1039/C6NP00106H. [DOI] [PubMed] [Google Scholar]
  • 55.Trubl, G. et al. Soil viruses are underexplored players in ecosystem carbon processing. mSystems, 3, e00076-18 (2018). [DOI] [PMC free article] [PubMed]
  • 56.Paez-Espino D, et al. IMG/VR v2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 2019;47:D678–D686. doi: 10.1093/nar/gky1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Mukherjee S, et al. Genomes OnLine database (GOLD) v7: updates and new features. Nucleic Acids Res. 2019;47:D649–D659. doi: 10.1093/nar/gky977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Roux S, et al. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3:e985. doi: 10.7717/peerj.985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Kauffman KM, et al. A major lineage of non-tailed dsDNA viruses as unrecognized killers of marine bacteria. Nature. 2018;554:118–122. doi: 10.1038/nature25474. [DOI] [PubMed] [Google Scholar]
  • 60.Krupovic M, Koonin EV. Multiple origins of viral capsid proteins from cellular ancestors. Proc. Natl Acad. Sci. USA. 2017;114:E2401–E2410. doi: 10.1073/pnas.1621061114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Schloss, P. D. et al. Status of the archaeal and bacterial census: an update. mBio17, e002001-16 (2016). [DOI] [PMC free article] [PubMed]
  • 62.Huntemann M, et al. The standard operating procedure of the DOE-JGI metagenome annotation pipeline (MAP v4) Stand. Genomic Sci. 2016;11:17. doi: 10.1186/s40793-016-0138-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Li H, Durbin R. Fast and accurate short-read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Li H, et al. The sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kang DD, et al. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165. doi: 10.7717/peerj.1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Kielbasa SM, et al. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21:487–493. doi: 10.1101/gr.113985.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337. doi: 10.1093/bioinformatics/btp157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Kalvari I, et al. Rfam 13.0: shifting to a genome-centric resource for noncoding RNA families. Nucleic Acids Res. 2018;46:D335–D342. doi: 10.1093/nar/gkx1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.O’Leary NA, et al. Reference sequence database at NCBI: current status, taxonomic expansion and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Marcais G, et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 2018;14:e1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Varghese NJ, et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 2015;43:6761–6771. doi: 10.1093/nar/gkv657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Ondov BD, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Wu D, Jospin G, Eisen JA. Systematic identification of gene families for use as ‘markers’ for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. PLoS ONE. 2013;8:e77033. doi: 10.1371/journal.pone.0077033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Deorowicz S, Debudaj-Grabysz A, Gudys A. FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci. Rep. 2016;6:33964. doi: 10.1038/srep33964. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE, 5, e9490 (2010). [DOI] [PMC free article] [PubMed]
  • 81.Bland C, et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics. 2007;8:209. doi: 10.1186/1471-2105-8-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Camacho C, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, et al. A computational framework to explore large-scale biosynthetic diversity. Nat. Chem. Biol. 2020;16:60–68. doi: 10.1038/s41589-019-0400-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics. 2007;8:18. doi: 10.1186/1471-2105-8-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Roux S, et al. Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes. Nat. Microbiol. 2019;4:1895–1906. doi: 10.1038/s41564-019-0510-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Nguyen LT, et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (8MB, pdf)

Supplementary Text, Figs. 1–13 and References

Reporting Summary (2.6MB, pdf)
Supplementary Tables (32MB, xlsx)

Supplementary Tables 1–13 and 15–18.

Supplementary Table 14 (8.1MB, xlsx)

Genome-scale metabolic models in KBase.

Data Availability Statement

All available metagenomic data, bins and annotations are available through the IMG/M portal (https://img.jgi.doe.gov/). Bulk download for the 52,515 MAGs is available at https://genome.jgi.doe.gov/GEMs and https://portal.nersc.gov/GEM. Genome-scale metabolic models for the nonredundant, high-quality GEMs are summarized at 10.25982/53247.64/1670777 and available in KBase (https://narrative.kbase.us/#org/jgimags). IMG/M identifiers of all metagenomes binned, including detailed information for each metagenome, are available in Supplementary Table 1.

The pipeline used to generate the metagenome bins is available at https://bitbucket.org/berkeleylab/metabat/src/master/.


Articles from Nature Biotechnology are provided here courtesy of Nature Publishing Group

RESOURCES