Abstract
Soil harbors a vast expanse of unidentified microbes, termed as microbial dark matter, presenting an untapped reservoir of microbial biodiversity and genetic resources, but has yet to be fully explored. In this study, we conduct a large-scale excavation of soil microbial dark matter by reconstructing 40,039 metagenome-assembled genome bins (the SMAG catalogue) from 3304 soil metagenomes. We identify 16,530 of 21,077 species-level genome bins (SGBs) as unknown SGBs (uSGBs), which expand archaeal and bacterial diversity across the tree of life. We also illustrate the pivotal role of uSGBs in augmenting soil microbiome’s functional landscape and intra-species genome diversity, providing large proportions of the 43,169 biosynthetic gene clusters and 8545 CRISPR-Cas genes. Additionally, we determine that uSGBs contributed 84.6% of previously unexplored viral-host associations from the SMAG catalogue. The SMAG catalogue provides an useful genomic resource for further studies investigating soil microbial biodiversity and genetic resources.
Subject terms: Metagenomics, Soil microbiology
Soil conceals a vast realm of unexplored microbes, often referred to as the “microbial dark matter.” This hidden universe boasts a rich tapestry of microbial and genetic biodiversity. Here, the authors introduce the SMAG catalogue, comprising of 40,039 metagenome-assembled genomes from 3304 soil metagenomes, and uncovering 21,077 species-level genome bins.
Introduction
The soil microbiome is not only valuable as the primary regulator of soil ecosystem services but also as a source of genetic resources for human healthcare and biotechnological applications1. The majority of antibiotics currently used in human medicine were discovered from soil-living bacteria or fungi between the 1940s and 1970s2, but the golden age declined after the 1970s owing to the difficulty of cultivating unidentified bacterial species3. However, cultivation-independent approaches, e.g., rRNA gene-based survey, have confirmed that up to 99% of soil microorganisms have not been cultivated under laboratory conditions to date4. Those countless undiscovered microbes in the soil, referred to as soil microbial dark matter5, comprise enormous untapped diversity and genetic resources6. For instance, Ling et al. recently discovered teixobactin, a new antibiotic without detectable resistance, by growing uncultured bacteria from the soil with iChip7.
Genome-resolved metagenomics can yield metagenome-assembled genomes (MAGs) from contigs assembled with shotgun-sequenced short reads8, providing previously unexplored genomes of Bacteria9, Archaea10, and viruses11 for understanding the functional characteristics of uncultivated microbes. MAGs have substantially expanded the genomic catalogue for manifold environments including relatively limited soil environments12, the human gut13, animals14, the global ocean15, and other environments16,17. In additionally, MAGs have increased the diversity and topological structure of the tree of life, providing insights into uncultivated microbial taxa and virus-host associations, as well as promoting the discovery of genetic resources, such as biosynthetic gene clusters (BGCs)18, CRISPR19, and antiphage defense systems20.
A single gram of surface soil can contain billions of bacterial and archaeal cells and trillions of viruses21, indicating that soil microbial diversity is substantially higher than in other environments due to its high complexity and spatial heterogeneity4. However, few prior studies focused on reconstructing MAGs from soils, due to the challenges associated with complicated soil metagenomes, which are enriched genomes for uncultivated and undescribed microorganisms1. Most existing studies on the soil microbiome suffer from the limitations and biases of reference databases and cannot characterize microbes with high taxonomic resolution22. Several studies have recovered genomes from soil metagenomes at a small scale and multi-systems for exploring their functions and genetic resources15,23, but the myriad of soil metagenomes available in public databases has not been mined at present at a global scale.
In this work, to construct an informational public resource database and explore soil microbial dark matter from metagenomes, we first reconstructed MAGs from global-scale genome-resolved metagenomics to expand the genomic catalogue of soil microbiomes and shed light on microbial dark matter in soils. We then clustered the MAGs into 21,077 SGBs and identified 16,530 uSGBs by aligning SGBs with ~500,000 reference genomes from the Refseq database and MAGs from other studies. Intraspecific pangenome and single nucleotide variants (SNVs) profiles reveal the functional contribution of uSGBs in the soil microbiomes. Moreover, we explored BGCs and CRISPR-Cas genetic resources, confirming the considerable potential of soil microbiomes in mining genetic resources. Furthermore, we uncovered previously unexplored viral-host associations concealed in the MAGs. The SMAG catalogue constitutes abundant information, providing important opportunities for future broad studies focused on unraveling the ecological roles of soil microbiomes and identifying genetic resources.
Results
40,039 MAGs reconstructed from large-scale genome-resolved metagenomics
To reconstruct previously unexplored bacterial and archaeal genomes, we performed a large-scale single-sample metagenomic assembly on 3304 soil metagenomes across the globe (Fig. 1a), including 363 metagenomes from the in-house dataset and 2941 from publicly available metagenomes. The soil samples were mainly collected from grassland, cultivated land, and forest (Fig. 1b). The number of reconstructed MAGs per metagenome was positively correlated with metagenome read depth (Supplementary Fig. 1a) and follows a power-law distribution (Supplementary Fig. 1b). The number of reconstructed MAGs substantially increased when the number of clean reads >108 (Supplementary Fig. 1a), suggesting that sequencing depth greater than this threshold would result in worthwhile gains in MAG reconstruction. We reconstructed a total of 40,039 genomes that meet or exceed the medium-quality level of the minimum information about a metagenome-assembled genome (MIMAG) standard24 (completeness ≥50% and contamination <10%), which we refer to as the SMAG catalogue (Supplementary Data 2). About 3641 (9.1%) of these MAGs were identified as high-quality genomes with completeness >90%, contamination <5%, and presence of the 23S, 16S, and 5S rRNA gene and at least 18 tRNAs according to recent guidelines (Fig. 1a, Supplementary Fig. 1c–e). Moreover, 5184 (13%) of MAGs had completeness ≥90% and contamination <5%, but the absence of all rRNA genes or less than 18 tRNAs24, largely meaning that near full-complement rRNA genes sequences are challenging for assembling from metagenomes25, especially for near-complete MAGs26. To evaluate the quality of the MAGs in the SMAG catalogue, we inferred the level of strain heterogeneity within each MAG. The median strain heterogeneity (proportion of polymorphic positions) of the high-quality SMAG catalogue was 7.14% (Supplementary Fig. 1f). And the SMAG catalogue is distinct in its exclusive focus on soil microbiomes on a global scale, which specifically allowed us to undertake an in-depth analysis of this particular niche, expanding the knowledge base on soil microbial diversity. Besides, the geographic distribution of the soil metagenomes in our study significantly substantially extended compared with the MAGs resource from environments (Fig. 1d), surpassing both MAGs derived from the Tara Ocean project27,28 and environmentally derived MAGs from Genomes of Earth’s Microbiomes (GEM) catalogue29, but lagging behind human-associated MAGs13,30, highlighting the challenges posed by the complex and heterogeneous soil environment4 and also illustrating the necessity of constructing high-quality soil metagenomic genome reference datasets for more accurate predictions about the ecological functions of soil microbiomes.
MAG analyses retrieve 16,530 previously uncharacterized bacterial and archaeal clades
To explore taxonomic components in the SMAG catalogue, we clustered the 40,039 MAGs into 21,077 SGBs (Supplementary Data 2) based on 95% whole-genome average nucleotide identity (ANI). We annotated the taxonomy with Genome Taxonomy Database (GTDB), which is commonly used13 and considered a gold standard for defining prokaryotic species31 (Fig. 2a). These SGBs were assigned to 88 bacterial phyla and 11 archaeal phyla. The number of MAGs in the SGBs follows a power-law distribution (Supplementary Fig. 2a), suggesting that most of the SGBs comprised a few MAGs.
To identify previously unexplored soil bacterial and archaeal clades in the SMAG catalogue, we compared the MAGs from the SMAG against nearly 500,000 reference genomes, including 282,219 genomes from the Refseq database (November of 2021), 207,953 MAGs from previous studies, and 123,580 MAGs and 1710 single-cell amplified genomes (SAGs) from GenBank (November of 2021). We identified 16,530 uSGBs (78.4% of SGBs) and 4567 known SGBs, (22.6% of SGBs) (Fig. 2a) based on the threshold of 95% ANI and 30% alignment fraction (AF). Consistent with the knowledge of soil microbial dark matter32, we also found that most MAGs in the SMAG catalogue are uSGBs. The genome size of SGBs and the reference genome size showed a positive linear relationship (Supplementary Fig. 2b). Moreover, we found that most of the SGBs (70.8%) and uSGBs (71.4%) were singleton MAGs (Fig. 2a). The proportion of singleton MAGs in uSGBs (71.2%) was substantially higher than in known SGBs (kSGBs) (50.0%) (Fig. 2a), indicating the critical contribution of the SMAG catalogue in recovering rare species of soil microbiomes. The vast majority of SGBs were unannotated at the species level by the GTDB (18,988, 90.1%), and were barely aligned to reference genomes (14,060, 88.6% of uSGBs with <90% ANI or <10% AF compared to reference genomes).
To examine whether the previously unidentified uSGBs in the SMAG catalogue improve mappability for soil metagenomes, we mapped 494 metagenomes randomly selected from the metagenomes dataset for reconstructing the SMAG catalogue to all 40,039 MAGs. The total mapping rates (the ratio of mapped reads to the total reads) ranged from 2.6 to 89% (medium mapping rate = 12.5%) (Fig. 2b). Consistent with the previous study33, the contribution of uSGBs for reads mapping was fivefold of kSGBs, which illustrated that the uSGBs were important genomic resources to understand soil microbial dark matter. Moreover, the mapping capacity of the SMAG catalogue was further validated by aligning 70 other soil metagenomes unused for reconstructing the SMAG catalogue (Fig. 2b). The genome size of all recovered MAGs ranged from 0.53 to 12.3 Mb (Supplementary Fig. 2c). Most phyla’s genome sizes were consistent between kSGBs and uSGBs, except for Armatimonadota, Bdellovibrionota, and unclassified bacterial phylum UBA10199. MAG sizes of uSGBs were larger than MAG sizes of kSGBs for these phyla. (Supplementary Fig. 2c). MAGs sizes of kSGBs were consistent with the genome size of isolated reference genomes of the same genera (Supplementary Fig. 2b), which validates metagenomic-driven strategies to mine the uSGBs from the complex soil environment. The phyla with the smallest genome sizes are Patescibacteria (median = 0.78 Mb) and Thermoproteota (median = 1.60 Mb), especially for Patescibacteria with large function size by simplifying genome size34, while Myxococcota (median = 5.10 Mb), Cyanobacteria (median = 5.07 Mb), and Planctomycetota (median = 4.79 Mb) have the largest genome sizes. The smallest-sized high-quality MAGs (~0.53 Mb) were assigned to a previously unidentified species of the genus Buchnera, which is experiencing a reductive process towards a minimum genome needed for symbiotic life with aphids35.
Next, we built the phylogenetic tree of 21,077 SGBs (Fig. 2d), showing that the bacterial and archaeal diversity across the tree of life was expanded by uSGB genomes from the SMAG catalogue. The proportions of uSGBs in the eight most dominant bacterial phyla (>75%) were greater than those in most of the rare phyla except Planctomycetota, Armatimonadota, and Eremiobacterota (>80%) (Fig. 2e), demonstrating the challenges in assembling rare biosphere36. Although the MAGs were assembled from corresponding soil samples as dominant taxa, they were rare in most of the other soil samples. And these MAGs may provide tremendous reference genomic resources in deciphering potential functions of rare biosphere in soil microbiomes (Fig. 2f).
Based on values of relative evolutionary divergence (RED)37 in the GTDB (release 202) annotation (Supplementary Fig. 2d), we further identified previously unidentified lineages at higher taxonomic ranks. In total, we determined 6392 unannotated genus-level genome bins (uGGBs), 1166 unannotated family-level genome bins (uFGBs), 258 unannotated order-level genome bins (uOGBs), and 31 unannotated class-level genome bins (uCGBs) by GTDB-tk. Two bacterial SGBs were potentially unannotated phylum-level genome bins (uPGBs) with completeness and contamination at (90.65%, 2.44%), and (90.96%, 1.10%), respectively, which indeed illustrated the underestimated diversity of the soil microbial dark matter and highlighted the pressing need for continued exploration of the soil microbiome. This is based on the concatenated protein phylogeny as the basis for a bacterial taxonomy37, which improved the classification of uncultured microorganisms of the SMAG. However, the rarefaction curves reveal obvious unsaturation at species rank in the SMAG catalogue (Supplementary Fig. 2), indicating that additional previously uncharacterized lineages are yet to be discovered at species ranks.
Functional landscape and intraspecies genomic diversity
To better understand the functional landscape of soil microbiota, we predicted full-length putative protein sequences from the 5184 high-quality MAGs (>90% completeness, <5% contamination) of the SMAG catalogue. We then performed an in-depth functional annotation of those gene clusters with eggNOG database38 (v5.0). We identified 41 KEGG pathways, most of which were enriched in uSGBs (Supplementary Fig. 3a), including the pathways related to polyketide synthesis and disease association pathways. Based on the KEGG enrichment analysis, many phyla were only annotated by functional enrichment from uSGBs in the SMAG catalogue, especially for Asgardarchaeota, Krumholzibacteriota, and Tectomicrobia (Fig. 3a). Besides, for the COG functional categories, we also found the Function unknown was over-represented in the SMAG catalogue, and uSGBs in particular (Supplementary Fig. 3b), providing evidence that uSGBs substantially expanded the functional landscape.
Core genes are shared by all strains that are involved in basic biological processes, such as gene expression, energy production, and amino acid metabolism. Accessory genes are the specific genes for certain genomes. To explore the intraspecific genomic diversity of the SMAG, we generated 107 pangenomes for 2200 SGBs with >10 high-quality MAGs by clustering protein sequences from all conspecific genomes at 90% amino acid identity, which was used to define a “core” genome13. Open pangenomes have larger sizes with the increase of individuals39. To assess the openness of pangenomes of the SMAG, we identified that the longest pangenome size is 20,926,893 bp with 14 conspecific genomes. The average pangenome length reached 6,699,815 bp and almost 40% of the pangenome size is larger than the average. (Supplementary Fig. 3c, Supplementary Data 3). The proportion of core genes decreased with the number of conspecific genomes and genome sizes (Supplementary Fig. 3d, e), which is consistent with previous studies on a limited number of strains and species due to the addition of duplicated genes40.
The proportion of core genes varied across different phyla (Fig. 3b). Species from Verrucomicrobiota and Nitrospirota showed the highest and the lowest proportion of core genes, respectively (Fig. 3b), which is mainly due to their species ubiquity41. Given that Verrucomicrobia is generally among the most abundant taxa in soil but with high proportion of core genes suggests its closed pangenome and implies its critical role in fundamental functions in soils. Conversely, Nitrospirota was observed frequently in wastewater habitats42, and had the lowest proportion of core genes, suggesting its large pangenome openness which enables high environmental adaptability43.
To investigate the functional divergence between core and accessory genes, we compared the proportion of genes assigned with eggNOG in core and accessory genes. The core genes were better annotated than accessory genes based on all five databases (Wilcox test, P < 0.001), and the proportions of core genes of uSGBs annotated with eggNOG (Wilcox test, P = 0.054), KEGG (Wilcox test, P = 0.0005), and GO (Wilcox test, P = 0.041) were significantly lower than those of kSGBs (Fig. 3c). Thereafter, we investigated the functional enrichment by the core and accessory genes based on the eggNOG functional annotations. Significance was calculated with a two-tailed Wilcoxon rank-sum test and further adjusted for multiple comparisons using the Benjamini–Hochberg correction. A positive effect size (Cohen’s d) indicates that the core gene is dominantly represented. The core genes were significantly assigned (p adjust < 0.001) to genetic information processing and key metabolic functions like Carotenoid biosynthesis and Phosphotransferase system (PTS). While great number of accessory genes are overrepresented in various secondary metabolites, such as Biosynthesis of enediyne antibiotics, Aurachin biosynthesis and Novobiocin biosynthesis with large effsize (d estimate >2), indicating the important role of accessory genes in defense activities (Fig. 3d). A similar tendency was found in the COG analysis. Core genes were dominantly represented in the basic cellular processes like Amino acid transport and metabolism. In contrast, more accessory genes are related to environmental adaptation and inter-strain differences. The accessory genes show dominantly represented in secondary metabolites biosynthesis, Transport and catabolism, Defense mechanisms. Moreover, a much greater proportion of function unknown COGs are poorly characterized without a known function (Fig. 3e). These results provide a functional landscape difference between core and accessory genes identified from the SMAG catalogue by pangenome analysis.
To profile the intra-species variation of the SMAG catalogue, we investigated intraspecies single-nucleotide variants (SNVs) within SGBs with ≥3 MAGs. We detected 582,519,530 SNVs from 2448 SGBs with at least three conspecific MAGs (Fig. 3e, Supplementary Data 3). Of these SNVs, 326,163,258 (56%) filtered (exclude synonymous mutations) SNVs (were detected and 174,868,789 (53.6%) were found exclusively in uSGBs, and 151,294,469 (46.4%) were exclusively detected in kSGBs (Fig. 3e), indicating a large number of previously undiscovered SNVs in the SMAG catalogue. We also assigned the detected SNVs to the kSGBs and uSGBs across different phyla. Notably, we observed a divergence in the density of SNVs between kSGBs and uSGBs across most dominant phyla (Fig. 3g, Supplementary Fig. 3f). In addition, a majority of the phyla exhibited relatively low pN/pS ratios (pN/pS < 1) (Fig. 3h and Supplementary Data 3). This suggests that the evolution of soil microbial organisms might be more influenced by long-term purifying selection and drift, rather than by rapid adaptations to specific environments44. While species from Patescibacteria possess the smallest genome sizes, displayed the lowest SNV density coupled with the highest pN/pS ratios, possibly owing to their reduced redundant and non-essential functions that enable them to maintain community stability34. These findings suggest that the SMAG catalogue encompasses a significant amount of intraspecific SNVs. The observed variations in SNV density and pN/pS ratios across different phyla underscore the diverse niche widths of these species and their varying capacities to acquire and allocate soil resources45.
Broad secondary metabolite biosynthetic potential
Microbial genomes encode biosynthetic gene clusters (BGCs) that produce natural secondary metabolites, offering vast potential for discovering ecologically and biotechnologically relevant enzymes and biochemical compounds. In addition to exploring BGCs from cultivated microorganisms46, many studies have employed metagenomic data mining to survey BGCs for drug discovery47 and microbiome ecology studies9. Given the tremendous microbial diversity in soil ecosystems, the SMAG catalogue offers an important resource for mining BGCs for natural product development and drug synthesis. We identified 70,081 putative BGCs, of which 69,990 were annotated with one or more BGC types. The BGCs identified from the 21,077 representative MAGs of the SMAG catalogue are 36 times the number of the manually curated Minimum Information about a Biosynthetic Gene (MIBiG) dataset46. After filtering contigs ≥ 5 kb, 43,169 BGCs were categorized into eight groups (Supplementary Data 4), most of which were identified from uSGBs (Fig. 4a). The number of non-ribosomal peptide synthetase (NRPS), the necessary multienzyme machinery for assembling numerous peptides for antibacterial (such as penicillin)48, was the highest with a total of 10,277 (23.8%) BGCs encoded by 49 phyla. We also identified 9632 (22.3%) BGCs synthesizing ribosomally synthesized and post-translationally modified peptide (RiPPs) from 69 phyla, 7671 (17.8%) terpene gene clusters from 45 phyla, 1790 (4.1%) polyketide synthase (PKSI) clusters from 28 phyla, and 1664 (3.9%) PKS–NRPS hybrid gene clusters from 23 phyla (Fig. 4c).
We then assessed the biosynthetic potential of the dominant phyla (Fig. 4b). Consistent with the GEM catalogue29 and glacier catalogue9, Proteobacteria process the greatest biosynthetic potential, with 1439 NRPS, 2153 RiPPs, 2052 terpene, and 3216 other BGCs encoded by 6774 Proteobacterial MAGs, followed by Actinobacteriota with 5376 MAGs encoding 9575 BGCs. Furthermore, we identified a total of 9119 BGCs encoded by 4781 Acidobacteriota MAGs, with one MAG from unannotated genus of family UBA5704 encoding 111 NRPS or PKS modules with clear colinear module chains (Fig. 4d). In addition, we identified high biosynthetic potential for Gemmatimonadota (2633 regions across 1300 MAGs), indicating we may underestimate the biosynthetic potential of these linkages. Although most identified BGCs were fragmented (Supplementary Fig. 4a), we identified 742 regions with a length >50 kb and 4772 regions >30 kb. Five NRPS clusters with a length >100 kb (Supplementary Fig. 4b–f) were all identified from uSGBs. The largest BGC in the SMAG catalogue (270,820 bp) was identified from genus UBA5704 of Acidobacteriota (Fig. 4e), while the largest BGC (275,339 bp) from GEM was identified from the same genus but with the core biosynthetic sequence identity range of 0–52.48% (Supplementary Data 4). We found that both of the two BGCs were mainly involved in amino acid metabolism, but SMAG_BGC exhibited an additional involvement in carbohydrate metabolism and environmental information processing based on the KO assignment results (Fig. 4f). Taken together, these results suggest that the SMAG catalogue can serve as a valuable resource for the discovery of new drugs and therapeutics.
CRISPR and Cas protein genetic resources
Microbes rely on diverse defense mechanisms that allow them to withstand viral predation and exposure to foreign DNA. Many Bacteria and Archaea possess clustered regularly interspaced short palindromic repeats (CRISPR) together with CRISPR-associated genes (Cas), called CRISPR–Cas systems, to prevent viral infection49. Spacers are the regions of the leader end of the CRISPR array with a length of 24–48 nucleotides50 to be transcribed and processed into CRISPR RNAs (crRNAs) for the microbes’ “immune” system51. The SMAG catalogue affords a significant opportunity to explore the diversity of Cas proteins resources for transforming and synthesis efficient gene editing tools.
To profile the diversity of genetic resources associated with CRISPR-Cas systems in the SMAG catalogue, we characterized spacers and Cas genes by predicting open reading frames (ORFs) and aligning to Cas proteins in National Center for Biotechnology Information (NCBI). In total, we identified 1142 spacers from 662 MAGs (Supplementary Fig. 5a, Supplementary Data 5), on average 0.40 ± 0.35 (mean ± SD) spacer sequences per Mb of genomic length. Given the number of spacers in each MAGs displayed on a scale-free distribution (Fig. 5a), the majority (454) of MAGs possessed only one spacer sequence and a few MAGs possessed more than 10 spacer sequences, indicating their potential ability to defend against viral infection. The number of spacers did not increase with genome size (Fig. 5b). MAGs with a genome size of ~5.5 Mb, either from kSGBs or uSGBs, possessed the highest number of spacers, but there was no difference in the source between kSGBs or uSGBs (Supplementary Fig. 5b). Spacer loads differed significantly across phyla (Fig. 5c), with the highest density of spacer loads for Cyanobacteria and Proteobacteria, on account of the largest number of genomes, and with the lowest density of spacer loads for Firmicute_E and Myxococcota. The largest numbers of spacers in MAGs were found in a cyanobacterial MAG reconstructed from grassland that possessed 20 spacer sequences. This could be explained by the fact that Cyanobacteria is the only bacteria with a rich number of transposable elements and transposase genes involved in the complex differentiation process52.
We further quantified 8545 Cas-associated genes from 563 MAGs, with an average of 281.2 ± 219.6 Cas-associated genes per MAG. (Supplementary Fig. 5c, Supplementary Data 5). Cas1 and Cas 2, are highly conserved, generally as a universal marker for CRISPR-Cas systems53, which was the most widely-known protein identified from the SMAG catalogue. Approximately 200 Cas 2 were identified for small putative nucleases (80–120 aa) and considered a second marker for CRISPR-Cas systems54. Notably, we identified 42 Cas 9, which were potentially engineered for powerful genome editing tools55. 245 MAGs (43.5%) possessed less than 10 Cas-associated genes (Fig. 5d) and only 1611 Cas-associated genes (18.8%) were identified with certain Cas-associated genes (Fig. 5g, Supplementary Fig. 5d). The collection of Cas protein family profiles is a resource for the identification of CRISPR–Cas systems3, which also illustrates the necessity and importance of mining the soil microbiome.
MAG with the largest number of spacers (20) also possessed a large number of Cas-associated genes (238) (Supplementary Fig. 5a, c). Consistent with spacers, MAGs with a genome size of ~5.5 Mb, either from kSGBs or uSGBs, had the highest numbers of Cas-associated genes (Fig. 5e). The density of Cas-associated genes in genomes varied with various phyla (Fig. 5f), with the highest density in Patescibacteria and Firmicute_C, and the lowest density of in Myxococcota and Desulfobacterota_F. uSGBs expanded the profiles of Cas protein resources in many phyla, such as Verrucomicrobiota, Armatimonadota, Gemmatimonadota, Fusobacteriota and Desulfobacterota_B (Fig. 5h), indicating that uSGBs offered an important information about Cas proteins from the soil microbiome. This also demonstrates the utility of metagenomic mining for gene editing tools development.
Connecting MAGs to virus-host associations
Previously uncharacterized MAGs help to improve predictions of virus-host association prediction, which are crucial for understanding the roles and impacts of viruses in natural ecosystems. In this study, 21,510 virus-host associations were identified by predicting prophages. (Supplementary Data 6). Those prophages can be clustered into 257 clusters at the family level. The predicted virus-host associations were mainly contributed by Actinobacteria (8116), followed by Proteobacteria (3468), Acidobacteria (3162), and Thermoproteota (2310) (Fig. 6a). The proportion of the uSGBs contributed to virus–host associations was 84.6% (Fig. 6b), suggesting that uSGBs from the SMAG catalogue considerably expand our understanding of virus-host associations.
To explore the host phylogenetic ranges of viruses, we analyzed the host taxa of 76 generalist viruses with >25 predicted hosts (The term “generalist” viruses refer to the potential host range of a virus). Many studies indicate that viruses can alter the host metabolic process and participate in element cycling in the soil through a variety of auxiliary metabolic genes56. Most of those generalist viruses are mainly predicted from uSGBs, indicating uSGBs from the SMAG catalogue reveal a great deal of previously unexplored virus–host associations involved in the geochemical cycles of the global soil elements. However, the proportion of kSGBs hosts was >88% for acidobacteriotal virus GSV_39462, proteobacterial virus GSV_66726, and patescibacterial virus GSV_270. Almost all of those generalist viruses predicted potential hosts from the same phyla except GSV_42450 (Fig. 6c), which was predicted from MAGs that ranged from nine phyla, mainly from Acidobacteria (33 SGBs), Gemmatimonadota (seven SGBs), and Actinobacteriota (six SGBs).
Discussion
Here we established the SMAG catalogue by reconstructing 40,039 bacterial and archaeal genomes, representing 21,077 species-level genome bins, from large-scale metagenomic assembly. As a result of our work, the majority (16,530 uSGBs) of reconstructed genomes are currently unidentified from species to class level, and uSGBs made a great contribution to increasing the mapping rate of soil metagenomes.
We found the uSGBs immensely expanded the functional landscape of soil microbiota. The pangenome (107) and SNV (582,519,530) analyses show a large number of unknown core genes that need further investigation. Based on the proportion of core genes, we identified the divergence of pangenome openness across different phyla, suggesting their divergent roles in ecological functions in soils43. These results indicated that pangenome evolution analysis within defined phylogenetic groups should consider the environmental effect41. A large number of SNVs were detected from uSGBs in the SMAG catalogue, revealing a lot of previously unknown intraspecies variations. And the divergent pN/pS ratios indicated the soil microbiome experienced a strong purifying selection, which may highlight the environmental adaptability of species within the community, emphasizing a balance where deleterious genetic variations are minimized45.
The SMAG catalogue showed a rich discovery potential for BGC diversity, which is a vital resource for the synthesis of natural products57. We found most BGCs identified were from genomes of uSGBs and the biosynthetic potential of microorganisms is divergent across various BGC types, indicating the great potential for mining previously unexplored BGCs from the uncultivated and unknown microorganisms from soils. However, most BGCs identified from the SMAG catalogue were fragmented, indicating that short-read sequencing restrained the recovery of full-length BGC sequences from uncultivated bacteria15, and tools based on Hidden Markov Model (HMM)-based algorithms limited the accuracy and generalizability of BGC identification58. Long-read sequencing and deep learning adopted for metagenomic assembly may enable more complete genomes59, high-resolution analysis of resistance determinants and mobile elements60. Thus, future research can combine the long-read sequencing to construct more complete BGCs15 and machine learning can be introduced into the identification of the BGCs region61 and the mining of microbial dark matter60.
The SMAG catalogue also encompassed great potential in the development of anti-viral defense systems. We detected 8545 natural CRISPR-Cas genes, revealing the considerable potential of the SMAG catalogue for mining gene editing tools. We identified that uSGBs offered plenty of previously unexplored resources of Cas proteins resource. Different phyla showed varied potential for developing “immune” systems based on the spacers and Cas gene numbers, which can guide researchers to mine targeted “immune” systems53. Furthermore, the identification of spacers helps to understand the process of insertion of spacer sequences into the host CRISPR locus to generate immunological memory62. Overall, the analysis makes the first large-scale and comprehensive portrayal of the Cas proteins resource in the soil microbiome, which is a momentous resource for exploring the molecular “immune” system of microbes.
The uSGBs contributed the most virus-host associations in the SMAG catalogue. Most virus-host associations were predicted by prophages, a prevalent infection pathway of viruses in soil microbiota21. We found divergent host phylogenetic ranges of viruses across different phyla. Interestingly, the finding of a generalist virus offers new insights into experimental work of phage cultivation63. Together, these results demonstrate previously unexplored putative virus–host connections, expanding our understanding of soil microbial dark matter.
In summary, we have established this soil MAGs catalogue, which sheds light on soil microbial dark matter and provides valuable insights into the diversity and function of soil microbiomes. Besides, given the large uncultured and unknown diversity remaining in soil microbiomes, highlighting major informational databases for provocative new biological insights and having a high-quality genome catalogue substantially enhances the resolution and accuracy of metagenome-based studies for the broad relative readership. Also, the MAGs in the SMAG catalogue are resources for building genome-scale metabolic models (GEMs), which would be a crucial resource for designing and engineering microbiomes. All in all, knowledge gained from this work is valuable as a genetic resource for future studies based on genome-centric mining, and to prioritize targets for further experimental validation.
Methods
Sampling, sequencing, and collection of soil metagenomes
We downloaded 2941 soil metagenomes from the NCBI Sequence Read Archive (SRA)64 publicly available with file sizes exceeding 2GB from different countries which cover 9 soil ecosystems and about 363 in-house data were sampled, see details in Supplementary Data 1.
We collected ecosystem classifications manually, and for projects with insufficient information, we defined the ecosystem type by latitude and longitude using GlobeLand30 and Google Maps In-house samples from this study were sampled by our team across China (348) and Europe (15) in 2018–2020 using a standard sampling protocol65. Five-point sampling method (non-probability sampling) was performed in house samples. All soil samples were kept cool using dry ice until visible roots and stones were removed. And then all clean soils were stored at −80 °C until DNA extraction. In all cases, DNA extraction of 400 mg of soil in each sample was performed using MP FastDNA SPIN Kits 385 for soil (MP Biomedicals, Solon, OH, USA) according to the manufacturer’s instructions and DNA was purified and concentrated using Qubit fluorometric quantitation (Thermo Fisher Scientific, 388 Waltham, MA, USA). Purified DNA was stored at −20 °C for sequencing. Metagenomic sequencing from each soil sample was conducted by Illumina HiSeq 4000 or Illumina novaseq pe150 (Illumina, San Diego, CA, USA), generating 150 bp paired end reads. Sequence data have been deposited in the public NCBI under BioProject accession numbers PRJNA983538.
Metagenome quality control, assembly
All the downloaded SRA files were split into paired-end raw reads using fastq-dump (v2.9.6) from sratoolkit (v2.9.6) with option ‘—split-3’, and then all raw reads were separately quality-controlled using Trimmomatic66 (v2.39) to trim adaptors and primers, and to filter short (<50 bp) and low-quality reads (<20 bases), followed by assembly with MEGAHIT67 (v1.2.9) with a minimum contig length of 500 bp and with the options ‘--k-step 10 --k-min 27’ to each sample separately.
Metagenome binning and refinement
Soil MAGs were recovered for individual metagenomic assemblies using metaWRAP68 on the basis of tetranucleotide frequencies (TNF) and coverage information, contigs shorter than 1000 bp were discarded. The resulting MAGs were refined using the module ‘bin_refinement’ from metaWRAP68 (v1.2.1) to combine and improve the results generated by the three binners. During refinement, the completeness and contamination of all MAGs were estimated using CheckM69 (v1.0.11) via the lineage-specific workflow with the options ‘-c 50 -x 10’ to filter MAGs to be at least 50% complete, with <10% contamination. Ribosomal RNAs (rRNAs) were identified with nhmmer function (part of HMMER 3) from Barrnap (v.0.9) with the options ‘-reject 0.01 –e-value 1e-3’ and ‘-kingdom bac/arc’ for bacteria and archaea, respectively. Transfer RNAs (tRNAs) were annotated with tRNAscan-SE70 (v.2.0.9) with options ‘-A’ for archaeal species and ‘-B’ for bacterial lineages. Based on these results, we classified the MAGs as the high quality based on the MIMAG standard24 (>90% completeness, ≤5% contamination, ≥18/20 tRNA genes, and the presence of 5S, 16S, and 23S rRNA genes), with the remaining classified as medium quality.
Dereplication and species-level genome bins clustering of SMAG
The 40,349 MAGs from the SMAG dataset were further quality-filtered with the function ‘--checkM_method (lineage_wf)’ to avoid low-quality genomes, and then the 40,039 filtered MAGs were dereplicated and clustered into 21,077 SGBs based on 95% ANI with the following options: ‘-pa 0.9 -sa 0.95 -nc 0.10 -cm larger’ using dRep71 (v2.2.4). To reduce the computational burden of clustering all genomes, we used the multi-round clustering method just by set the parameter ‘--multiround_primary_clustering’ from dRep which is helpful when clustering 5000+ genomes and will be done with single linkage clustering aiming to reduce the final computational load which was previously used to cluster >200,000 human gut MAGs13.
Phylogenetic and taxonomic annotation of SMAG
A total of representative 21,077 SGBs were classified with GTDB-TK72 (v.1.6.0) using ‘classify_wf’ function and default parameters according to the Genome Taxonomy Database (GTDB) (release 202)37. In short, the GTDB-Tk classifies each genome based on ANI to a curated collection of reference genomes, placement in the bacterial or archaeal reference genome tree, and relative evolutionary distance (RED). The phylogenetic analyses of 21,077 SGBs were performed with PhyloPhlAn73 (v3.0.60). The phylogeny in Fig. 2 was built using the 400 universal PhyloPhlAn markers with the following options: ‘--diversity high --accurate --min_num_markers 100’. For the internal steps the following tools with their set of parameters were used: Diamond74 (v0.9.14.115) with parameters: ‘blastp --quiet --threads 1 --outfmt 6 --more-sensitive –id 50 --max-hsps 35 -k 0’; mafft75 (v7.310) with the ‘--anysymbol’ option; trimal76 (v1.4rev15) with the ‘-gappyout’ option; FastTree77 (v2.1.10) with ‘-mlacc 2 -slownni -spr 4 -fastest -mlnni 4 -no2nd -gtr -nt’ options; RAxML78 (v8.1.12) with parameters: ‘-m PROTCATLG -p 1989 <phylogenetic tree computed by FastTree >.’ and the best tree refined by RAxML is visualized ggtree79 (v3.2.1).
To estimate the relative abundance of each MAG from separate soil samples, clean reads of each sample were aligned to the SMAG catalogue after de-replicating all MAGs at 95% identity with dRep71 (v2.2.4) to avoid arbitrary mapping between representatives of highly similar genomes using BWA80 (v0.7.17). The outputs were converted to BAM format by Samtools81 (v1.10). Then the BAM was filtered with coverM v0.2.0 (https://github.com/wwood/CoverM) with the options “--min-read-percent-identity 0.95 --min-read-aligned-percent 0.90”, the coverage of each contig was calculated with coverM using ‘trimmed_mean’ mode, so calculating the coverage as the mean of the number of reads aligned to each position, with the 10% smallest fraction of positions and 90% maximum fraction for trimmed_mean calculations. The coverage of each MAG was calculated as the average of contig coverages, weighting each contig by its length in base pairs. The relative abundance of each lineage in each sample was calculated as its coverage divided by the total coverage of all genomes in the dereplicated set. And samples with relative abundance of mag <0.01% were considered as rare biosphere, otherwise they were considered as abundant biosphere36.
Comparing MAGs to >500,000 genomes in public databases
We compared representative genomes from the 21,077 SGBs to a large number of publicly available reference genomes. Approximately 500,000 reference genomes were obtained from a variety of sources, including NCBI RefSeq (n = 282,219), GenBank (123,580 MAGs and 1710 SAGs) of November 2021 and multiple system-associated MAGs from several recent studies (207,593)29,30,82. We first used Mash83 (v2.3) with the function of ‘dist’ to find the most similar reference genome to each of the 21,077 SGBs, and then we used the MUMmer84 (v4.0.0) with the function ‘dnadiff’ and default parameters to estimate ANI between genome pairs. Based on the analysis results, a species was considered to have been cultured if it matched an isolate RefSeq genome with at least 95% ANI over at least 30% of the genome length, and we considered a species as an unknown genome if it represented only by SMAG.
Functional analysis of SMAG
And putative protein-coding sequences (CDSs) of SMAG were predicted using Prodigal85 (v2.6.3) with the ‘-p single’ parameter. The predicted CDSs were then clustered by MMseqs286 with the options’--min-seq-id 0.95 -c 0.9 --cluster-mode 2 --cov-mode 1’, and then the representative CDSs were annotated with eggNOG-mapper87 (v2.1.6) with database (v5.0)38, and KEGG (Kyoto Encyclopedia of Genes and Genome) and Clusters of Orthologous Groups of proteins (COGs) functional annotations were derived from the eggNOG-mapper results.
The secondary-metabolite biosynthetic potential of SMAG
Secondary-metabolite BGCs of SMAG were identified using antiSMASH58 (v6.1) with default settings and the corresponding database (v5.0)88, Then BGCs were subsequently filtered, retaining only the ones encoded on scaffolds ≥5 kb to reduce the risk of fragmentation, as done previously29,89, which resulted in a total of 43,169 BGCs (Supplementary Data 5). And these BGCs were categorized into eight groups: ‘PKSI’, ‘PKS-NPR_Hybrids’, ‘PKSothers’, ‘NRPS’, ‘RiPPs’, ‘Terpene’, ‘Saccharides’ and ‘Others’, based on the categories suggested by the BiG-SCAPE90. We selected sequences encoding core biosynthetic genes from the two BGCs to do sequence identity comparison by Clustal (v2.1)91, and the KO of the largest BGCs from SMAG and GEM were assigned using BlastKOALA92v.2.21.
SNV, Pangenome analysis of SMAG
A total of 2448 species with at least three conspecific genomes (completeness >= 50%, contamination <= 5%) were used to generate a catalogue of SNVs (Supplementary Data 3). We mapped all conspecific genomes to the representative genome for each species using the ‘nucmer’ program from MUMmer84 (v4.0.0) and filtered alignments using the ‘delta-filter’ program with options ‘-q -r’ to exclude chance- and repeat-induced alignments. Thereafter, we identified SNVs using the ‘show-snps’ program. Single-base insertions and deletions were not counted as SNVs. Each SNV locus was included in the catalogue only when the alternate allele was detected in at least two conspecific genomes. To filter the synonymous SNVs, we calculated the synonymous ratio with the house script snv-filter.py, and we estimated the ratio of non-synonymous to synonymous polymorphism rates44 (pN/pS) to evaluate the genetic diversity. Pan-genome analyses were carried out by selecting 2200 SGBs with >10 high-quality MAGs (completeness >= 80%, contamination <= 5%) using Roary93 (v3.12.0), with the options of a minimum amino acid identity at 90% (‘-i 90’) and a core gene defined at 90% presence (‘-cd 90’).
CRISPR and Cas protein
CRISPR arrays were identified on contigs longer than 3 kb in MAGs using a combination of PLIER-CR94 (v0.4.2). And the MAGs containing fewer than four CRISPR-associated proteins were removed. Proteins were predicted with Prodigal (v2.6.3) and de-duplicated to construct a database. Proteins with lengths between 200 and 1000 aa were obtained. The NR database was used to remove proteins of known function and Cas proteins in NCBI were used for further characterization of the candidate Cas proteins.
Connecting MAGs to viruses identified from VirSorter2
To maximize the number of prophages identified in MAGs, we used VirSorter295 (v2.0.alpha) to perform de novo prediction. Only those classified into prophage by CheckV96 (Version 1.0) were retained. To exclude possible decayed prophages, that is, integrated virus genomes which are now inactive and progressively removed from the host genome, all predictions for which 30% or more of the genes displaying the best hit to Pfam (35.0)97 were excluded (thresholds: hmmsearch score ≥ 50 and E ≤ 0.001).
Statistics and reproducibility
No data were excluded from the analyses. The investigators were not blinded to allocation during experiments and outcome assessment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank C. Kelly, C. Averill, D. Buckley, D. Goodheart, D. Duncan, D. Myrold, E. Eloe-Fadrosh, E. Brodie, E. Högfors-Rönnholm, H. Cadillo-Quiroz, J. Tiedje, J. Jansson, J. Norton, J. Blanchard, J. Schweitzer, J. Banfield, J. Gladden, J. Raff, K. Peay, K. Gravuer, K. M. DeAngelis, L. Meredith, M. Kalyuzhnaya, M. Waldrop, N. Fierer, P. Dijkstra, P. Baldrian, S. Theroux, S. Tringe, T. Woyke, T. Whitman, W. Mohn & San Diego State University for their permission to use their metagenome data. We also thank Jianyu Jiao from Sun Yat-Sen University for the useful discussion. Thanks for the support from Amazon Web Services for providing computing resources. This work was supported by the National Natural Science Foundation of China (grants 42090060, 42277283 to B.M. and 41991334 to J.X.), the Key R&D Program of Zhejiang Province (2023C02004 to B.M. and J.X., 2023C02015 to B.M.), and the Fundamental Research Funds for the Central Universities (226-2022-00139 to B.M.).
Author contributions
B.M., J.X., J.Z., and Y.Z. conceived and co-supervised the study. C.L., Y.W., J.Y., K.Z., R.X., and H.R. designed the methods of the research. B.M., C.L., Y.W., J.Y., X.L., R.X., R.P., J.Z., Y.Z., and J.X. performed bioinformatic and statistical analyses. B.M. and C.L. drafted the manuscript; B.M., C.L., X.L., R.P, J.Z., Y.Z., and J.X. reviewed and edited the manuscript. B.M. and J.X. performed funding acquisition.
Peer review
Peer review information
Nature Communications thanks Jialiang Kuang, and the other, anonymous, reviewer for their contribution to the peer review of this work. A peer review file is available.
Code availability
The workflow used to generate the genome, taxonomic analysis, and functional annotation, alongside the BGCs, pan-genome, SNV annotations, and virus predictions and scripts used to generate the figures are described at GitHub repository through https://github.com/Caiyulu-818/SMAG/releases/tag/v1.0 (ref. 98). All statistical analyses for generating figures were performed using the R environment v4.1.299.
Data availability
The raw sequence data of the in-house samples reported in this paper and the 16,530 uSGBs of the SMAG catalogue have been deposited to NCBI SRA and GenBank under the bioproject accession number: PRJNA983538. For the bulk download, all the MAGs, SNV catalogues and viruses predicted the SMAG has been deposited in Zenodo repository through 10.5281/zenodo.7341719 (ref. 100) and also be available in the freely accessible interface-web of the SMAG catalogue (https://smag.microbmalab.cn). The source data underlying Figs. 1–6 and Supplementary Figs. 1-6 are provided as Source Data files and have been deposited in the Figshare database (10.6084/m9.figshare.23298791). The databases used in this study include GEM catalog (https://genome.jgi.doe.gov/portal/GEMs/GEMs.home.html), the UHGG (https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/), and GTDB database Release 202 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Bin Ma, Caiyu Lu, Yiling Wang.
Change history
12/6/2023
A Correction to this paper has been published: 10.1038/s41467-023-44072-7
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-023-43000-z.
References
- 1.Banerjee, S. & van der Heijden, M. G. A. Soil microbiomes and one health. Nat. Rev. Microbiol.10.1038/s41579-022-00779-w (2022). [DOI] [PubMed]
- 2.Shlaes, D. M. The Perfect Storm. in Antibiotics: The Perfect Storm (ed. Shlaes, D. M.) 1–7 (Springer Netherlands, 2010).
- 3.New FN, Brito IL. What is metagenomics teaching us, and what is missed? Annu. Rev. Microbiol. 2020;74:117–135. doi: 10.1146/annurev-micro-012520-072314. [DOI] [PubMed] [Google Scholar]
- 4.Fierer N. Embracing the unknown: disentangling the complexities of the soil microbiome. Nat. Rev. Microbiol. 2017;15:579–590. doi: 10.1038/nrmicro.2017.87. [DOI] [PubMed] [Google Scholar]
- 5.Rinke C, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499:431–437. doi: 10.1038/nature12352. [DOI] [PubMed] [Google Scholar]
- 6.Lewis WH, Tahon G, Geesink P, Sousa DZ, Ettema TJG. Innovations to culturing the uncultured microbial majority. Nat. Rev. Microbiol. 2021;19:225–240. doi: 10.1038/s41579-020-00458-8. [DOI] [PubMed] [Google Scholar]
- 7.Ling LL, et al. A new antibiotic kills pathogens without detectable resistance. Nature. 2015;517:455–459. doi: 10.1038/nature14098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Berg G, et al. Microbiome definition re-visited: old concepts and new challenges. Microbiome. 2020;8:103. doi: 10.1186/s40168-020-00875-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liu, Y. et al. A genome and gene catalogue of glacier microbiomes. Nat. Biotechnol.10.1038/s41587-022-01367-2 (2022). [DOI] [PubMed]
- 10.Hua Z-S, et al. Insights into the ecological roles and evolution of methyl-coenzyme M reductase-containing hot spring Archaea. Nat. Commun. 2019;10:4574. doi: 10.1038/s41467-019-12574-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zayed AA, et al. Cryptic and abundant marine viruses at the evolutionary origins of Earth’s RNA virome. Science. 2022;376:156–162. doi: 10.1126/science.abm5847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dove NC, Taş N, Hart SC. Ecological and genomic responses of soil microbiomes to high-severity wildfire: linking community assembly to functional potential. ISME J. 2022;16:1853–1863. doi: 10.1038/s41396-022-01232-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Almeida, A. et al. A unified catalogue of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol.10.1038/s41587-020-0603-3 (2020). [DOI] [PMC free article] [PubMed]
- 14.Chen C, et al. Expanded catalogue of microbial genes and metagenome-assembled genomes from the pig gut microbiome. Nat. Commun. 2021;12:1106. doi: 10.1038/s41467-021-21295-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Waschulin V, et al. Biosynthetic potential of uncultured Antarctic soil bacteria revealed through long-read metagenomic sequencing. ISME J. 2022;16:101–111. doi: 10.1038/s41396-021-01052-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Andrei A-Ş, et al. Niche-directed evolution modulates genome architecture in freshwater Planctomycetes. ISME J. 2019;13:1056–1071. doi: 10.1038/s41396-018-0332-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu L, et al. Charting the complexity of the activated sludge microbiome through a hybrid sequencing strategy. Microbiome. 2021;9:205. doi: 10.1186/s40168-021-01155-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Crits-Christoph A, Diamond S, Butterfield CN, Thomas BC, Banfield JF. Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis. Nature. 2018;558:440–444. doi: 10.1038/s41586-018-0207-y. [DOI] [PubMed] [Google Scholar]
- 19.Makarova KS, et al. Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 2020;18:67–83. doi: 10.1038/s41579-019-0299-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gao L, et al. Diverse enzymatic activities mediate antiviral immunity in prokaryotes. Science. 2020;369:1077–1084. doi: 10.1126/science.aba0372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Williamson KE, Fuhrmann JJ, Wommack KE, Radosevich M. Viruses in soil ecosystems: an unknown quantity within an unexplored territory. Annu. Rev. Virol. 2017;4:201–219. doi: 10.1146/annurev-virology-101416-041639. [DOI] [PubMed] [Google Scholar]
- 22.Geisen S, et al. A methodological framework to embrace soil biodiversity. Soil Biol. Biochem. 2019;136:107536. doi: 10.1016/j.soilbio.2019.107536. [DOI] [Google Scholar]
- 23.Bay SK, et al. Trace gas oxidizers are widespread and active members of soil microbial communities. Nat. Microbiol. 2021;6:246–256. doi: 10.1038/s41564-020-00811-w. [DOI] [PubMed] [Google Scholar]
- 24.Bowers RM, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 2017;35:9. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yuan C, Lei J, Cole J, Sun Y. Reconstructing 16S rRNA genes in metagenomic data. Bioinformatics. 2015;31:i35–i43. doi: 10.1093/bioinformatics/btv231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Parks DH, et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2017;2:1533–1542. doi: 10.1038/s41564-017-0012-7. [DOI] [PubMed] [Google Scholar]
- 27.Royo-Llonch, et al. Compendium of 530 metagenome-assembled bacterial and archaeal genomes from the polar Arctic Ocean. Nat. Microbiol. 2021;6:1561–1574. doi: 10.1038/s41564-021-00979-9. [DOI] [PubMed] [Google Scholar]
- 28.Nishimura Y, Yoshizawa S. The OceanDNA MAG catalogue contains over 50,000 prokaryotic genomes originated from various marine environments. Sci. Data. 2022;9:305. doi: 10.1038/s41597-022-01392-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nayfach S, et al. A genomic catalogue of Earth’s microbiomes. Nat. Biotechnol. 2021;39:499–509. doi: 10.1038/s41587-020-0718-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Nayfach S, Shi ZJ, Seshadri R, Pollard KS, Kyrpides NC. New insights from uncultivated genomes of the global human gut microbiome. Nature. 2019;568:505–510. doi: 10.1038/s41586-019-1058-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Richter M, Rosselló-Móra R. Shifting the genomic gold standard for the prokaryotic species definition. Proc. Natl Acad. Sci. USA. 2009;106:19126–19131. doi: 10.1073/pnas.0906412106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Jiao J-Y, et al. Insight into the function and evolution of the Wood–Ljungdahl pathway in Actinobacteria. ISME J. 2021;15:3005–3018. doi: 10.1038/s41396-021-00935-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Pasolli E, et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell. 2019;176:649–662.e20. doi: 10.1016/j.cell.2019.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tian R, et al. Small and mighty: adaptation of superphylum Patescibacteria to groundwater environment drives their genome simplicity. Microbiome. 2020;8:51. doi: 10.1186/s40168-020-00825-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Douglas AE. Nutritional interactions in insect-microbial symbioses: aphids and their symbiotic bacteria Buchnera. Annu. Rev. Entomol. 1998;43:17–37. doi: 10.1146/annurev.ento.43.1.17. [DOI] [PubMed] [Google Scholar]
- 36.Lynch MDJ, Neufeld JD. Ecology and exploration of the rare biosphere. Nat. Rev. Microbiol. 2015;13:217–229. doi: 10.1038/nrmicro3400. [DOI] [PubMed] [Google Scholar]
- 37.Parks DH, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 2018;36:996–1004. doi: 10.1038/nbt.4229. [DOI] [PubMed] [Google Scholar]
- 38.Huerta-Cepas J, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47:D309–D314. doi: 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Golicz AA, Bayer PE, Bhalla PL, Batley J, Edwards D. Pangenomics comes of age: from bacteria to plant and animal applications. Trends Genet. 2020;36:132–145. doi: 10.1016/j.tig.2019.11.006. [DOI] [PubMed] [Google Scholar]
- 40.Kim Y, Koh I, Young Lim M, Chung W-H, Rho M. Pan-genome analysis of Bacillus for microbiome profiling. Sci. Rep. 2017;7:10984. doi: 10.1038/s41598-017-11385-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Maistrenko OM, et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 2020;14:1247–1259. doi: 10.1038/s41396-020-0600-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Spasov E, et al. High functional diversity among Nitrospira populations that dominate rotating biological contactor microbial communities in a municipal wastewater treatment plant. ISME J. 2020;14:1857–1872. doi: 10.1038/s41396-020-0650-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Brockhurst MA, et al. The ecology and evolution of pangenomes. Curr. Biol. 2019;29:R1094–R1103. doi: 10.1016/j.cub.2019.08.012. [DOI] [PubMed] [Google Scholar]
- 44.Schloissnig S, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. doi: 10.1038/nature11711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Des Roches S, et al. The ecological importance of intraspecific variation. Nat. Ecol. Evol. 2018;2:57–64. doi: 10.1038/s41559-017-0402-5. [DOI] [PubMed] [Google Scholar]
- 46.Kautsar SA, et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 2020;48:D454–D458. doi: 10.1093/nar/gkz882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Wang Z, et al. A naturally inspired antibiotic to target multidrug-resistant pathogens. Nature. 2022;601:606–611. doi: 10.1038/s41586-021-04264-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Süssmuth RD, Mainz A. Nonribosomal peptide synthesis—principles and prospects. Angew. Chem. Int. Ed. 2017;56:3770–3821. doi: 10.1002/anie.201609079. [DOI] [PubMed] [Google Scholar]
- 49.van der Oost J, Jore MM, Westra ER, Lundgren M, Brouns SJJ. CRISPR-based adaptive and heritable immunity in prokaryotes. Trends Biochem. Sci. 2009;34:401–407. doi: 10.1016/j.tibs.2009.05.002. [DOI] [PubMed] [Google Scholar]
- 50.Kunin V, Sorek R, Hugenholtz P. Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biol. 2007;8:R61. doi: 10.1186/gb-2007-8-4-r61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Horvath P, Barrangou R. CRISPR/Cas, the Immune System of Bacteria and Archaea. Science. 2010;327:167–170. doi: 10.1126/science.1179555. [DOI] [PubMed] [Google Scholar]
- 52.Hou S, et al. CRISPR-Cas systems in multicellular cyanobacteria. RNA Biol. 2019;16:518–529. doi: 10.1080/15476286.2018.1493330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Münch PC, Franzosa EA, Stecher B, McHardy AC, Huttenhower C. Identification of natural CRISPR systems and targets in the human microbiome. Cell Host Microbe. 2021;29:94–106.e4. doi: 10.1016/j.chom.2020.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Deveau H, Garneau JE, Moineau S. CRISPR/Cas system and its role in phage-bacteria interactions. Annu. Rev. Microbiol. 2010;64:475–493. doi: 10.1146/annurev.micro.112408.134123. [DOI] [PubMed] [Google Scholar]
- 55.Schuler G, Hu C, Ke A. Structural basis for RNA-guided DNA cleavage by IscB-ωRNA and mechanistic comparison with Cas9. Science. 2022;376:1476–1481. doi: 10.1126/science.abq7220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Pratama AA, van Elsas JD. The ‘neglected’ soil virome – potential role and impact. Trends Microbiol. 2018;26:649–662. doi: 10.1016/j.tim.2017.12.004. [DOI] [PubMed] [Google Scholar]
- 57.van Bergeijk DA, Terlouw BR, Medema MH, van Wezel GP. Ecology and genomics of Actinobacteria: new concepts for natural product discovery. Nat. Rev. Microbiol. 2020;18:546–558. doi: 10.1038/s41579-020-0379-y. [DOI] [PubMed] [Google Scholar]
- 58.Medema MH, et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 2011;39:W339–W346. doi: 10.1093/nar/gkr466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Che Y, et al. Mobile antibiotic resistome in wastewater treatment plants revealed by Nanopore metagenomic sequencing. Microbiome. 2019;7:44. doi: 10.1186/s40168-019-0663-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Hoarfrost A, Aptekmann A, Farfañuk G, Bromberg Y. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nat. Commun. 2022;13:2606. doi: 10.1038/s41467-022-30070-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Hannigan GD, et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019;47:e110. doi: 10.1093/nar/gkz654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Nuñez JK, Lee ASY, Engelman A, Doudna JA. Integrase-mediated spacer acquisition during CRISPR–Cas adaptive immunity. Nature. 2015;519:193–198. doi: 10.1038/nature14237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Shen J, et al. Large-scale phage cultivation for commensal human gut bacteria. Cell Host Microbe. 2023;31:665–677.e7. doi: 10.1016/j.chom.2023.03.013. [DOI] [PubMed] [Google Scholar]
- 64.Arita M, Karsch-Mizrachi I, Cochrane G. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2021;49:D121–D124. doi: 10.1093/nar/gkaa967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Soil Sampling and Methods of Analysis. (Canadian Society of Soil Science; CRC Press, 2008).
- 66.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31:1674–1676. doi: 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
- 68.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6:158. doi: 10.1186/s40168-018-0541-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Chan PP, Lin BY, Mak AJ, Lowe TM. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 2021;49:9077–9096. doi: 10.1093/nar/gkab688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11:2864–2868. doi: 10.1038/ismej.2017.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics10.1093/bioinformatics/btz848 (2019). [DOI] [PMC free article] [PubMed]
- 73.Asnicar F, et al. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0. Nat. Commun. 2020;11:2500. doi: 10.1038/s41467-020-16366-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 75.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Price MN, Dehal PS, Arkin AP. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5:e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 2017;8:28–36. doi: 10.1111/2041-210X.12628. [DOI] [Google Scholar]
- 80.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Almeida A, et al. A new genomic blueprint of the human gut microbiota. Nature. 2019;568:499–504. doi: 10.1038/s41586-019-0965-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Ondov BD, et al. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol. 2019;20:232. doi: 10.1186/s13059-019-1841-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Hyatt D, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat. Commun. 2018;9:2542. doi: 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 2021;38:5825–5829. doi: 10.1093/molbev/msab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Blin K, et al. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 2019;47:W81–W87. doi: 10.1093/nar/gkz310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature10.1038/s41586-022-04862-3 (2022). [DOI] [PMC free article] [PubMed]
- 90.Navarro-Muñoz JC, et al. A computational framework to explore large-scale biosynthetic diversity. Nat. Chem. Biol. 2020;16:60–68. doi: 10.1038/s41589-019-0400-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Madeira F, et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 2022;50:W276–W279. doi: 10.1093/nar/gkac240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 2016;428:726–731. doi: 10.1016/j.jmb.2015.11.006. [DOI] [PubMed] [Google Scholar]
- 93.Page AJ, et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31:3691–3693. doi: 10.1093/bioinformatics/btv421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Bland C, et al. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinforma. 2007;8:1–8. doi: 10.1186/1471-2105-8-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Guo J, et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome. 2021;9:37. doi: 10.1186/s40168-020-00990-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Nayfach S, et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 2021;39:578–585. doi: 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Mistry J, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Lu, C. Y. A Genomic Catalogue of Soil Microbiomes Boosts Mining of Biodiversity and Genetic Resources, SMAG v1.0, 10.5281/zenodo.8429870, (2023). [DOI] [PMC free article] [PubMed]
- 99.Team, R. R.: A language and environment for statistical computing. MSOR connections (2014).
- 100.European Organization For Nuclear Research & OpenAIRE. Zenodo. 10.25495/7GXK-RD71 (2013).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The workflow used to generate the genome, taxonomic analysis, and functional annotation, alongside the BGCs, pan-genome, SNV annotations, and virus predictions and scripts used to generate the figures are described at GitHub repository through https://github.com/Caiyulu-818/SMAG/releases/tag/v1.0 (ref. 98). All statistical analyses for generating figures were performed using the R environment v4.1.299.
The raw sequence data of the in-house samples reported in this paper and the 16,530 uSGBs of the SMAG catalogue have been deposited to NCBI SRA and GenBank under the bioproject accession number: PRJNA983538. For the bulk download, all the MAGs, SNV catalogues and viruses predicted the SMAG has been deposited in Zenodo repository through 10.5281/zenodo.7341719 (ref. 100) and also be available in the freely accessible interface-web of the SMAG catalogue (https://smag.microbmalab.cn). The source data underlying Figs. 1–6 and Supplementary Figs. 1-6 are provided as Source Data files and have been deposited in the Figshare database (10.6084/m9.figshare.23298791). The databases used in this study include GEM catalog (https://genome.jgi.doe.gov/portal/GEMs/GEMs.home.html), the UHGG (https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/), and GTDB database Release 202 (https://data.ace.uq.edu.au/public/gtdb/data/releases/release202/).