Abstract
Eukaryotic genomes are known to have garnered innovations from both archaeal and bacterial domains but the sequence of events that led to the complex gene repertoire of eukaryotes is largely unresolved. Here, through the enrichment of hydrothermal vent microorganisms, we recovered two circularized genomes of Heimdallarchaeum species that belong to an Asgard archaea clade phylogenetically closest to eukaryotes. These genomes reveal diverse mobile elements, including an integrative viral genome that bidirectionally replicates in a circular form and aloposons, transposons that encode the 5,000 amino acid-sized proteins Otus and Ephialtes. Heimdallaechaeal mobile elements have garnered various genes from bacteria and bacteriophages, likely playing a role in shuffling functions across domains. The number of archaea- and bacteria-related genes follow strikingly different scaling laws in Asgard archaea, exhibiting a genome size-dependent ratio and a functional division resembling the bacteria- and archaea-derived gene repertoire across eukaryotes. Bacterial gene import has thus likely been a continuous process unaltered by eukaryogenesis and scaled up through genome expansion. Our data further highlight the importance of viewing eukaryogenesis in a pan-Asgard context, which led to the proposal of a conceptual framework, that is, the Heimdall nucleation–decentralized innovation–hierarchical import model that accounts for the emergence of eukaryotic complexity.
Subject terms: Evolution, Ecology
The recovery of two circularized genomes of the Heimdallarchaeum species from hydrothermal vent enrichment cultures reveals that these Asgard archaea carry diverse mobile genetic elements, such as an integrative viral genome and aloposons. These mobile genetic elements contain several bacteria- and phage-derived genes, modulating the shuffling of information between bacteria and archaea, and potentially influencing eukaryogenesis.
Main
To chronicle the emergence of evolutionary innovation is a long-standing pursuit in biology. Due to scant record of reliable microscale fossils, resolving evolutionary history at the cellular scale relies primarily on molecular comparisons across present-day life, provided that phylogenetic relatives can be well delineated. Culture-independent metagenomics has substantially expanded our access to the Earth’s diverse biomes1, including lineages carrying genetic imprints of critical evolutionary events through deep time. The Heimdallarchaeota, previously referred to as the ancient archaea group (AAG)2, are one such group and the closest known relative of eukaryotes as suggested by phylogenomics3–5. Heimdallarchaeotes and their related lineages collectively called the Asgard archaea contain a sizeable repertoire of eukaryotic signature proteins (ESPs)3,6,7. However, the genetic make-up of Heimdallarchaeotes has so far only been inferred from a few metagenome-assembled genomes (MAGs), which are fragmented and suffer from uncertainty in their completeness and accuracy3,7–12. Mobile (genetic) elements, including transposons, viruses and plasmids, which are known to play dominant roles in evolution13, are frequently misassembled, omitted or misassigned during MAG assembly and binning14. These drawbacks propagate into uncertainties in the resolution of archaeal lineages related to eukaryotes and can obscure the drivers of evolutionary crosstalk and divergence between eukaryotes and their prokaryotic relatives.
Results
Circular Heimdallarchaeota genomes
Recovering contiguous genomes from environmental samples is notoriously challenging due to their enormous biodiversity and strain-level heterogeneity, while most known lineages have been hard to isolate due to their unresolved metabolism and/or poorly understood partner-dependent growth. We overcame these limitations by combining cultivation methods with molecular community profiling to progressively dissect environmental microbial enrichment cultures where a clonal expansion of our species of interest was accompanied by a reduction in diversity (Extended Data Fig. 1 and Methods). Using anaerobic cultivation methods, we enriched a member of the Heimdallarchaeota AAG clade from a barite-rich rock retrieved in 2017 from the Auka hydrothermal vent field (23° 57′ N, 108° 51′ W) located in the southern Pescadero Basin near the southern tip of the Gulf of California at a water depth of 3,674 m (ref. 15). While initially below detection, this rock-associated AAG phylotype emerged at 1–4% of the 16S ribosomal RNA gene relative abundance in 3 lactate-supplemented, anaerobic enrichment cultures incubated at 40 °C after 7 months (Extended Data Fig. 1, Supplementary Tables 1–3 and Supplementary Note 1). In an independent set of enrichments inoculated with sediments collected from the Auka site in 2018 (23° 53′ N, 108° 48′ W), alkane-supplemented anaerobic incubations at 37 °C additionally yielded a second AAG phylotype that increased in 16S rRNA gene relative abundance from 0.03 to 4–7% after 9 months (Supplementary Tables 4 and 5 and Supplementary Note 1).
De novo assembly16–18 of Nanopore long-read and Illumina paired-end sequencing of genomic DNA recovered from these enrichments (Supplementary Table 6) resulted in complete circularized genomes of the two AAG species from the barite and sediment enrichment cultures, with genome sizes of 3.32 and 3.08 million base pairs (Mbp), respectively. The two circular AAG genomes showed 82% alignment fraction, 88% average nucleotide identity (ANI), 90% amino acid identity (AAI) and 97.9% 16S rRNA identity (Supplementary Table 7), which demarcate a clear species boundary19 within the same genus20. Thus, we propose the species names Candidatus Heimdallarchaeum endolithica PR6 (endo- (Greek), within; lithos (Greek), rock) and Candidatus Heimdallarchaeum aukensis PM71 (Auka, the local vent field) denoting their environmental origins (Fig. 1a).
Taxonomy and metabolism
The taxonomy of Asgard archaea is yet to reach consensus. The initial Heimdallarchaeota3, despite remaining monophyletic in all phylogenomic analyses, was proposed to either split into four phyla (Heimdall-, Gerd-, Kari-, Hodarchaeota)7 or alternatively grouped under a single order named the Heimdallarchaeia21. In this study, we collectively refer to them as ‘the Heimdall group’. Phylogenomic analyses based on 76 concatenated ribosomal proteins show that the Heimdallarchaeum spp. constitute a deeper-branching clade related to the previously described MAG AB_125 (ref. 3), well placed under ‘Heimdall’ in all proposed classification strategies (Fig. 1b and Extended Data Fig. 2). Additionally, we also identified a fragmented MAG B53_G1622 (299 contigs, 1.67 Mbp, approximately 50% complete) from the Guaymas Basin, formerly assigned under the Pacearchaeota, which we now designate as a strain of Ca. H. endolithica, with an average ANI of 97.5% compared with our PR6 strain.
Ca. Heimdallarchaeum spp. are predicted to garner energy by anaerobically oxidizing organic substrates via processes involving a partial tricarboxylic acid (TCA) cycle and, given the absence of discernible terminal electron accepting pathways, dissipating electrons via H2 production (Extended Data Fig. 3a). They each encode one membrane-bound hydrogenase (MBH) complex and two cytosolic sulfhydrogenase complexes (SHYI and SHYII) (Fig. 1c). Hydrogen has been hypothesized to act as a syntrophic intermediate bridging archaea and bacteria before the engulfment of mitochondrial ancestor by an (Asgard) archaeal ancestor of eukaryotes4,23–25. Indeed, in the recent description of Ca. Prometheoarchaeum syntrophicum, MBH associated with unusual membrane extensions were hypothesized to facilitate cell–cell contact and hydrogen exchange with syntrophic partner bacteria23. Following from this concept, we postulate that cytosolic hydrogen generation by SHY, as found in the Ca. Heimdallarchaeum spp., could impose a selective advantage for a hydrogen-dependent endosymbiotic strategy (Fig. 1c).
Eukaryotic signatures
One of the many challenges of resolving the relationship between archaea and eukaryotes is the curation of representative, high-quality genomes across lineages at their interface. To this end, we verified the complete marker gene coverage of the Ca. Heimdallarchaeum spp. as well as six other highly contiguous Asgard archaea genomes (Extended Data Fig. 4a, Methods and Supplementary Note 2). They include three previously described3,23,26 and three assembled in this study from our enrichment cultures—a Lokiarchaeote that we have named Ca. Harpocratesius repetitus FW102, a Thorarchaeote FW25 and a Heimdall group Gerdarchaeote AC18 (Fig. 1d). Notably, the dual-contig assembly Ca. H. repetitus FW102, which relates to Ca. P. syntrophicum MK_D1 at the family level, contains two complete sets of 16S/23S rRNA genes, potentially relevant to their growth strategies in the environment27.
These complete genomes confirmed that many of the previously described ESPs3,6 are distributed universally across known Asgard phyla (Fig. 1d), specifically genes involved in (1) membrane remodelling (endosomal sorting complexes required for transport components VPS4/VPS22/VPS25), (2) cytoskeleton organization (actin, profilin and gelsolin (except in Odin LCB_4)), (3) protein N-linked glycosylation (OST3/STT3/ribophorin) and (4) intracellular trafficking (roadblock/LC7/dynein family and a large repertoire of small GTPases). On the other hand, enzymes involved in the synthesis of ester-linked phospholipids, which are critical for closing the ‘lipid divide’ between the Archaea and Eukaryota domains23,26, show a mosaic distribution across the Asgard archaea lineages (Fig. 1d). For example, both Ca. Heimdallarchaeum spp. in our study lack 1-acyl-sn-glycerol-3-phosphate acetyltransferase involved in the attachment of the second fatty acid chain to the glycerol backbone28.
Maximum-likelihood analysis using a previously described approach based on the SR4 model3,29 and a concatenation of a complete set of 56 single-copy markers, indicates a close relationship between the Heimdall group archaea, which include the Heimdallarchaeum spp. and eukaryotes (Fig. 1d). This supports a parsimonious topology, reported in multiple studies3,5,7. We additionally produced a set of customized Asgard-specific Hidden Markov Models (HMMs) (Supplementary Data 1) that complement existing Archaea-specific HMMs along with a set of filtering parameters (Methods and Supplementary Tables 8 and 9) as resources. Maximum-likelihood analyses of a greater diversity of Asgard archaea7,11,12,16 that were selected through the framework described above (19 of 282 evaluated MAGs shown in Extended Data Fig. 2) further verified the phylogenetic topology, placing the Heimdall group closest to eukaryotes (Extended Data Fig. 4b). We note that statistical model selection, taxonomic evenness and assumptions with rooting represent ongoing debates for deep phylogeny5,7. The circularized genomes and resources described in this study may assist with future analyses of the Asgard archaea using a broader range of statistical parameters and emerging high-quality genomes.
Abundant repetitive features
Our approach retained a substantial number of non-tandem repeats (3% of genome lengths) and tandem CRISPR or intragenic repeats (212 and 262 counts) within the circular Ca. Heimdallarchaeum spp. genomes (Fig. 2a,b). This is notably more prominent relative to the recently constructed circular genomes of Ca. P. syntrophicum23, where no tandem repeats and only 1% of non-tandem repeats were observed.
Non-tandem repeats in the Ca. Heimdallarchaeum spp. overlap prominently with one of the most pervasive mechanisms of gene transfer within and between genomes, that is, a total of 11 families of transposases/integrases, 7 of which have multiplied and transposed to result in up to 27 copies within an individual genome (Fig. 2a). These and other transposases/integrases found in Asgard archaea primarily cluster with various small families within the 96,367 transposase/integrase sequences recovered from the prokaryotic Genome Taxonomic Database (GTDB)30 (Fig. 2c). Despite the under-representation of archaeal sequences in public databases and in the transposase/integrase dataset in this study, they have representatives in almost all clusters. The intermingled evolutionary relationship between archaeal and bacterial transposases/integrases documented in this study can potentially be both the result of, and contributor to, the gene flow observed between these two domains31–33.
The circular genomes of Ca. Heimdallarchaeum spp. contain seven CRISPR–Cas systems (Fig. 2b), including five complete operons (labelled C1–3, 5, 6), one array-free operon (C7) and one orphan array (C4) (see Extended Data Fig. 5 for the complete gene organizations). Contrasting the overall gene conservation between the two genomes, these CRISPR–Cas systems exhibit strong variability and site-specific integration (Fig. 2d). For example, C5 and C6 exhibited a complete local operon swap, while C3 and C4 were integrated immediately next to transfer RNA genes, a feature often exploited by bacteriophages34 and other Heimdallarchaeal mobile elements (see examples in Fig. 3 below).
CRISPR–Cas-guided discovery of mobile elements
We recruited a total of 1,565 Heimdall-associated CRISPR spacers in our Pescadero metagenomes constructed in this study and previously published Guaymas metagenomes (Methods). They revealed eight protospacers within four distinct mobile elements, which are hosted by Ca. Heimdallarchaeum spp. and are unrelated to any previously reported mobile elements (outlined in Fig. 3a). We named them Heimdallarchaeal mobile elements HeimM1 and HeimM2 and Heimdallarchaeal viruses HeimV1 and HeimV2, respectively.
HeimM1, detected within the sediment-hosted Ca. H. aukensis, is a C2-associated small defence island encoding an efflux pump CcmA and contains a protospacer that matches a spacer at the same genomic locus in the rock-hosted Ca. H. endolithica PR6 C1 (Fig. 3b). Such a territorial dispute within the genome, as well as the site-specific integrations of CRISPR–Cas outlined above, exemplify the emerging view that defence systems are mobile elements themselves35 and contribute to gene flow between habitats.
HeimM2 (8 kbp) encodes an internalin-like, leucine-rich repeat peptide and an enzyme homologous to rRNA self-splicing homing endonucleases (Fig. 3c). The latter are typically found as group I introns embedded within rRNA genes and are considered selfish elements. In this study, this gene was part of a mobile element inserted exactly between the only copy of the 16S rRNA gene and the tRNA gene ArgTCT, suggesting that it has likely been co-opted by HeimM2 for site-specific integration at this site.
The putative integrated viruses HeimV1 and HeimV2 are both found in Ca. H. endolithica. Each encodes proteins with homologues preferentially found in the viral database IMG/VR v.336 compared to the microbial genome database GTDB v.202, and viral structural proteins predicted by machine learning-based annotations (PhANNs37) (Fig. 3d,e and Extended Data Fig. 6).
HeimV2 (44 kbp), integrated at the same site as HeimM2, may be a hybrid between a virus and a previously undescribed class of transposons, which we tentatively call aloposons, in reference to the twin giants Aloadae in Greek mythology. They share the following features (Fig. 3d). First, they all contain tandem genes encoding proteins 3,000–6,000 amino acids in size, which we refer to as Otus and Ephialtes, the Aloadae twins. Second, they all integrate at different tRNA sites downstream of the giant genes. Aloposon2 in Ca. H. endolithica and Aloposon3 in Ca. H. aukensis represent a highly conserved element that has transposed from one tRNA site to the other during its coevolution with its host. Third, they all encode four consecutive genes upstream of the giant genes, including a gene encoding a bacterial MinD/ParA-like AAA family ATPase. Additionally, we found tandem giant genes in two Thorarchaeota MAGs showing distant homology to the Heimdallarchaeum giant proteins, as well as many unrelated giant genes across the Asgard archaea, some of which may also be part of Asgard mobile elements (Extended Data Fig. 7).
Putative virus HeimV1 (30 kbp) is a circular element with a highly polycistronic gene arrangement and an enrichment in nucleic acid-processing enzymes, viral structural proteins and viral gene homologues (Fig. 3e). As shown in Fig. 3f, HeimV1 exists in two states. Besides the genome-integrated lysogenic state found in one of the incubations, where its sequencing read abundance was at the same level as its genomic neighbourhood, in another enrichment incubation, HeimV1 showed an anomalously high read abundance relative to the host Ca. H. endolithica, suggestive of active replication. PCR and Sanger sequencing further confirmed the circularized state of HeimV1 as well as its integration between the host transposase and tRNA genes. Furthermore, the detailed sequencing read abundance profile across HeimV1 shows the characteristic V shape of an unsynchronized, bidirectionally self-replicating population of circular DNA elements (Fig. 3f). Such a well-defined profile can only emerge if the replications in each HeimV1 circular element initiate at a defined origin of replication.
The mobile elements described above also influence ecosystems beyond the southern Pescadero Basin vent system. CRISPR spacers targeting HeimV1 and HeimV2 were detected in metagenomes from the Guaymas Basin22, a hydrothermal vent site 400 km northwest of the southern Pescadero Basin. The Pescadero-derived mobile element HeimM1 in Ca. H. aukaensis also exists in the Ca. H. endolithica B53_G16 MAG assembled from the Guaymas Basin. Furthermore, HeimV1-related proviruses encoding tail fibre protein homologues are also found in the Heimdall group MAGs from the Gulf of Mexico in the Atlantic (Gerdarchaeota clade E44_bin34 (ref. 9)) and from the South China Sea (Hodarchaeota clade B3_Heim10) on the other side of the Pacific (Fig. 3e). Notably, the contig in the E44_bin34 MAG maintains the same gene synteny around the tail fibre gene as in HeimV1, albeit with only approximately 30% sequence homology. These observations indicate the expansive distribution of these mobile elements in diverse lineages of Heimdall group archaea across a large geographical range in deep sea ecosystems.
Diverse evolutionary origins of Heimdallarchaeal viruses
Phylogenetic analyses of viral genes indicate that HeimV1 and HeimV2 share their evolutionary origins with bacteriophages. As shown in Fig. 4a, the viral integrase of HeimV1 is phylogenetically most closely related to integrases found in environmental bacteriophages identified to be hosted by the phylum Bacteroidetes, along with integrases found in seven families of Bacteroidetes and other viruses with microbial hosts that are unidentified. Similarly, independent phylogenetic analyses of homologues of proteins affiliated with prophage transcriptional regulators, IbrA and IbrB, which are encoded by HeimV2 simultaneously found their closest relatives in bacteriophages or unidentified elements targeting diverse members of phylum Firmicutes (Fig. 4b and Extended Data Fig. 8).
While most viruses encoding genes related to HeimV1 and HeimV2 are unclassified, several belong to the order Caudovirales, including members of the family Siphoviridae. Well-studied members of Caudovirales are known to be tailed bacteriophages packaging double-stranded DNA, in line with the machine learning-based predictions of tail fibres in both HeimV1 and HeimV2 (>90% confidence; Fig. 3d,e).
Heimdallarchaeal viruses and other mobile elements associated with the Heimdall group archaea are predicted to have origins in both bacteria and archaea. For example, HeimV1 encodes a protein with two unknown domains flanking a full-length CTPase homologous to Noc/ParB/SpoJ-like proteins that bind DNA and regulate bacterial cell division (Fig. 4c). On the other hand, the HeimV1 methylase gene appears to have evolved from the Asgard archaea and is potentially involved in evading host detection (Fig. 4d). Phylogenetic analysis suggests that divergence of this viral methylase from its host was an ancient event that occurred before the divergence between the Heimdall and Loki group archaea, estimated to have taken place around two billion years ago38.
A survey of Heimdallarchaeum-associated protospacers within the entire Pescadero/Guaymas metagenomic dataset yielded 56 total contigs belonging to the putative Heimdall group mobile elements (Supplementary Data 2). Most coding sequences (76.9%) have no apparent homology with known microorganisms and viruses, while another 13.1% have homologues in diverse bacteria (Fig. 4e), which is higher than the 8.9% archaeal fraction. This further suggests that mobile elements and viruses may play a prominent role in shaping the evolution of Heimdallarchaeota by introducing functional innovations of bacterial origin.
Asgard–eukaryote parallelism in bacterial gene import
To understand the consequence of cross-domain gene flow in the evolution of Asgard archaea, we performed protein orthology-based functional and taxonomic profiling39 of the proteomes encoded by the complete genomes in this study. Functional analyses of the Asgard archaeal proteome based on clusters of orthologous groups (COGs)39,40 revealed distinct categories of genes that are associated with different taxonomic groups (Fig. 5a). The Archaea-related proteins in Asgard archaea were predominantly represented by information processing functions, including translation (J), transcription (K) and replication and repair (L), which is similar to the key archaeal modules inherited by eukaryotes41. By contrast, the annotated bacteria-related proteins were preferentially enriched in metabolic functions, including energy production and conversion (C) and the metabolism and transport of amino acids (E), carbohydrates (G) and inorganic ions (P). Different from both the above groups, nearly half of eukaryote-related proteins within the Asgard genomes were dedicated to intracellular trafficking and secretion (U), and cytoskeleton (Z) and protein modification (O) functions.
The import of bacterial genes into archaea and eukaryotes have been independently explored31,32,41,42. In this study, we show that the inheritance of information processing from the Archaea and metabolic functions from the Bacteria domain in the Asgard archaea is very similar to the signature of the eukaryotic genome profile. Strikingly, the archaeal:bacterial gene ratio forms an inverse relation with the genome size in Asgard archaea that is quantitatively comparable with previous characterizations across eukaryotes41 (Fig. 5b). Such a quantitative agreement on their genome size dependence suggests that the bacterial import of genomic material into eukaryotes may not necessitate an independent mechanism (such as endosymbiosis42) or a dramatically different selective force from their closest archaeal relatives. Instead, genome size control alone may be sufficient to account for the over-representation of bacterial genes in some eukaryotes43.
Domain-specific scaling of gene flow
Different scaling laws appear to govern the fluidity of genes with different taxonomic origins within the Asgard archaea. The total number of genes with closest orthologues in Archaea were remarkably invariable at approximately 900 genes across all Asgard archaeal representatives that span a threefold difference in genome size, from 1.5 Mbp in Odin LCB_4 to 4.4 Mbp in Lokiarchaeotes (Fig. 5c). While the archaeal reference database is currently significantly smaller than the bacterial one, which likely caused an underestimation of the exact number of archaea-related genes, the trend cannot be explained by such a database bias. One the other hand, we found that genome completeness and accuracy is key to capturing this feature since it is otherwise entirely obscured in Asgard genomes of variable completeness and contamination levels (Extended Data Fig. 9). By contrast, the bacterial, eukaryotic and taxonomically unassigned fractions of the genome increased linearly with the remaining portion of the genome. These scaling properties suggest a fundamental difference in the evolutionary plasticity between conserved archaeal ‘core’ genes and other fractions of the gene content with different evolutionary origins among the Asgard archaea.
Decentralized eukaryotic innovation
Eukaryote-related proteins (ERPs) capture present-day Asgard–Eukaryota protein orthologues that are estimated to be most closely related to each other. They include, but are not restricted to, previously investigated ESPs3,6,7—loosely defined as eukaryotic proteins with no archaeal or bacterial homologues in the predicted last eukaryotic common ancestor (LECA)44. Our analyses show that the scaling property of ERPs is similar to bacteria-related but not archaea-related proteins (Fig. 5c), prompting us to explore their evolutionary fluidity across Asgard archaea lineages.
Beyond the ESPs described above, which are shared by all Asgard archaea (Fig. 1d), we found diverse families of ERPs existing in only one or two of the Asgard clades examined in this study (Fig. 6a). Comparison of the circular genomes of Ca. Heimdallarchaeum spp. and the Lokiarchaeote Ca. P. syntrophicum revealed fewer than half of their ERP families being shared, notably with members of the Heimdallarchaeum harbouring fewer ERPs overall, despite their closer phylogenetic relationship with eukaryotes (Fig. 6b). Furthermore, even species related at the genus (Ca. Heimdallarchaeum spp.) or family levels (within Thorarchaeota/Lokiarchaeota) have apparent differences in their ERP pools (Fig. 6b). Such a high mobility of ERPs in the recent evolutionary history of Asgard archaea suggests that many of these genes are involved in the auxiliary but not core cellular functions. They are likely, or could have been during their evolutionary history, shuffled as part of their mobilomes. Hence, the evolutionary entanglement between the Asgard archaea and the Eukaryota must be understood in the pan-Asgard space and in the context of genome size expansion.
Thus, our analyses collectively suggest a plausible scenario where an ancestral Heimdall group archaeon with a small genome engaged in endosymbiosis with a bacterium and established the archaeal basis of information processing in the first eukaryotic common ancestor (FECA). The remaining defining features of eukaryotes are a result of decentralized innovations across the tree of life that became hierarchically imported, most frequently and often indirectly, through Asgard archaea lineages closest to FECA, to ultimately orchestrate LECA (Fig. 6c). As such, it is possible that the acquired non-essential genes were later co-opted to serve essential functions as the archaeon–bacterium symbiont expanded its regulatory complexity. We refer to this conceptual framework as the Heimdall nucleation–decentralized innovation–hierarchical import (HDH) model for future implementation and debate.
Discussion
The contiguous and complete genomes of Asgard archaea constructed in this study allowed us to resolve the composite origins of their genetic repertoires and identify diverse, unique mobile elements as their drivers. One important facet to be considered is timescale. While the pivotal role of horizontal transfer in the diversification of Asgard archaea is evidenced by the high number of bacteria-related genes found in this study, a considerable fraction of these genes is likely now stable in their respective lineages and only a certain fraction is a part of their present-day mobilomes—the entire set of mobile elements in a genome. However, the uncharted features, such as the extraordinarily large proteins in aloposons and Asgard-specific host range of mobile elements found in this study, suggest that the Asgard archaea mobilome may still hold ancient signatures inherited around the time of eukaryogenesis. Expanding the repertoire of complete genomes in a broader Asgard archaea taxonomic range, pan-genomic analyses of the same or closely related species and molecular clock approaches will together help chronicle the horizontal transfer events across their evolutionary history. Given that the presence of bacterial genes is prevalent in both branches of the Asgard–eukaryote sisterhood, it will be particularly exciting to explore the extent to which bacterial genes have been transferred into their shared ancestors before eukaryogenesis.
Genome size variability in both eukaryotes and prokaryotes have been attributed to rapid expansion driven by mobile elements followed by gradual erosion under natural selection (such as nutrient availability)45,46. It is thus reasonable to assume that such expansion–erosion cycles would have occurred around the time of eukaryogenesis. While the mechanism of genome expansion around eukaryogenesis is genetic, which will be further elucidated by future discoveries of more Asgard archaea mobile elements, the selection pressure for these traits is ecophysiological. In this study, we showed that the influx of genes into the Asgard archaea is highly constrained by genome size in a similar fashion as in eukaryotes. Hence, resolving the ecophysiological drivers of genome size stratification across Asgard archaea lineages may help us unlock the origin of eukaryotic genome complexity.
Etymology
Ca. H. endolithica PR6
Heimdall, watchman of the gods in Norse mythology; archaios (Greek), ancient, primitive; endo- (Greek), within; lithos (Greek), rock). Proposed classification: class Ca. Heimdallarchaeia, order Ca. Heimdallarchaeales, family Ca. Heimdallarchaeaceae, genus Ca. Heimdallarchaeum.
Ca. H. aukensis PM71
Heimdall, watchman of the gods in Norse mythology; archaios (Greek), ancient, primitive; Auka, the local hydrothermal vent field in the southern Pescadero Basin where the species originated; -sis (Greek), process or condition. Proposed classification same as above.
Ca. H. repetitus FW102
Harpocrates, Greek god of silence; archaios (Greek), ancient, primitive; repetita (Latin), repetitive (referring to the high fraction of repetitive sequences that constitute 4% of the genome). Proposed classification: class Ca. Lokiarchaeia, order Ca. Lokiarchaeales, family Ca. Prometheoarchaeaceae, genus Ca. Harpocratesius.
Methods
Hydrothermal vent rock and sediment sample collection
Rock no. NA091-R045 (source of Ca. H. endolithica PR6, Ca. H. repetitus FW102 and Thorarchaeote FW25) and rock no. NA091-R008 (source of Heimdall group Gerdarchaeote AC18) were retrieved from the Auka hydrothermal vent site situated on the margin of the southern Pescadero Basin of the Gulf of California using remotely operated vehicle Hercules during research expedition NA091 on E/V Nautilus on 2 November 2017. Local venting fluids have a measured temperature approaching 300 °C, contain hydrocarbons and hydrogen and are precipitating minerals, such as calcite and barite15. R045 was collected during dive H1658 at coordinates 23.956987786° N, 108.86227922° W at a water depth of 3,674 m, near shimmering water, a sign of locally focused hydrothermal fluid discharge. R008 was collected during dive H1657 at coordinates 23° 57′ N, 108° 52′ W at a water depth of 3,651 m. After shipboard recovery, rock samples were placed in Mylar bags prefilled with 0.2 µm filtered bottom seawater collected during the same dive, flushed with N2 gas for 10 min, sealed and stored at 4 °C until preparation for incubations in the laboratory.
Sediment sample no. FK181031-S0193-PC3 (source of Ca. H. aukensis) was collected during the research expedition FK181031 on R/V Falkor to the southern Pescadero Basin on 14 November 2018. The sample was collected during dive S193 at the Auka hydrothermal vent site (23.954822° N, 108.863009° W, water depth of 3,657 m), near the site where rocks nos. NA091-R045 and NA091-R008 were collected in 2017. The sediment push core was extruded upwards and sectioned into discrete 3 cm depth horizons on board immediately after recovery, transferred into sterile Whirl-Pak bags and sealed in a larger Mylar bag, flushed with argon gas, heat-sealed and stored at 4 °C until use in the laboratory.
Sample collection permits for the expedition were granted by the Dirección General de Ordenamiento Pesquero y Acuícola, Comisión Nacional de Acuacultura y Pesca (Permiso de Pesca de Fomento no. PPFE/DGOPA-200/18) and the Dirección General de Geografía y Medio Ambiente, Instituto Nacional de Estadística y Geografía (authorization no. EG0122018), with the associated diplomatic note no. 18-2083 (CTC/07345/18) from the Secretaría de Relaciones Exteriores-Agencia Mexicana de Cooperación Internacional para el Desarrollo/Dirección General de Cooperación Técnica y Científica.
Artificial seawater medium recipe
Artificial seawater was prepared as described in Scheller et al.47 with minor modifications. Briefly, 1 l of artificial seawater (ASW) medium contained 46.6 mM MgCl2, 9.2 mM CaCl2, 485 mM NaCl, 7 mM KCl, 20 mM Na2SO4, 1 mM K2HPO4, 2 mM NH4Cl, 1 ml of 1,000× trace element solution, 1 ml of 1,000× vitamin solution and 0.5 mg of resazurin and was buffered by 25 mM HEPES buffer adjusted to pH 7.5. One litre of 1,000× trace element solution contained 50 mM nitrilotriacetic acid, 5 mM FeCl3, 2.5 mM MnCl2, 1.3 mM CoCl2, 1.5 mM ZnCl2, 0.32 mM H3BO3, 0.38 mM NiCl2, 0.03 mM Na2SeO3, 0.01 mM CuCl2, 0.21 mM Na2MoO4 and 0.02 mM Na2WO4. One litre of 1,000× vitamin solution contained 82 μM d-biotin, 45 μM folic acid, 490 μM pyridoxine, 150 μM thiamine, 410 μM nicotinic acid, 210 μM pantothenic acid, 310 μM para-aminobenzoic acid, 240 μM lipoic acid, 14 μM choline chloride and 7.4 μM vitamin B12.
Enrichment cultivation
Rock no. NA091-R045 was anaerobically fragmented; then, approximately 5 g wet weight was crushed using a sterile agate mortar and pestle on 8 November 2018 and immediately immersed in anaerobic ASW medium in 25–125 ml of butyl rubber-stoppered serum bottles supplemented with different carbon/energy sources, including lactate, H2/CO2, hexane and decane and incubated in the dark at 40 °C (Extended Data Fig. 1a). The headspace for all cultures was flushed and overpressurized with N2 gas (2 atm). For the H2-containing cultures, the N2 gas headspace was replaced with H2/CO2 at an 80:20 mixture by flushing for 1 min and subsequent equilibration at 2 atm. After 33 d of incubation, the lactate-fed first-generation culture produced 5 mM sulphide, indicating active sulphate reduction. This enrichment was mixed by gentle shaking and diluted 1:100 vol/vol into fresh anaerobic ASW medium containing the same suite of carbon/energy sources as described above (Extended Data Fig. 1b). A transfer using the liquid fraction-lacking rock particles from the primary lactate enrichment was also included to enrich for members of the planktonic community alone with lactate as the carbon and energy source. This enrichment was later found to be devoid of the AAG (Heimdall) phylotype. Third- and fourth-generation cultures were set up in the following months through 1:100 dilution (Extended Data Fig. 1b). Further details of microbial community development in these enrichments are provided in Supplementary Note 1 and Supplementary Tables 1–3.
R008 was prepared as above except using 2 atm of methane in the headspace as the sole carbon source and electron donor. The culture was passaged twice using a 1:100 dilution under the same culturing conditions; the cell fraction was collected by centrifugation after a total of 22 months for metagenomic sequencing (described below).
For sediment enrichment cultivation, the top 3 cm section of the sediment core was mixed with anaerobic ASW at a 1:4 vol/vol ratio; a total of 60 ml volume each was dispensed into seven 125 ml glass serum bottles sealed with butyl rubber stoppers. The headspace was replaced by ethane (2 atm) in 2 bottles (Supplementary Table 5), while the headspace in 1 bottle was replaced by 100% N2 gas (2 atm). The cultures were incubated at 37 °C in the dark. Further details on microbial community development are provided in Supplementary Note 1 and Supplementary Table 4.
Mineralogical analyses
The mineralogical composition of rocks NA091-R045 and R008 was characterized on a PANalytical X’Pert Pro X-Ray diffractometer. A dried rock aliquot was finely powdered using a clean agate mortar and pestle and scanned from 3 to 75° (2θ angle) at a 0.0167° step size. Mineral identification was performed with the X’Pert HighScore software v4.1 using the search and march algorithm.
DNA extraction
Combined cells with rock or sediment substrate were pelleted through centrifugation at 13,000 r.p.m. for 3 min. For amplicon sequencing, unless specified in Supplementary Table 6, DNA was extracted using the Qiagen DNeasy PowerSoil kit (catalogue no. 47014) according to the manufacturer’s instructions as described previously48 with a minor modification, where mechanical shearing was carried out using the MP Biomedicals FastPrep-24 system (catalogue no. 116004500) at level 5.5 for 45 s. For genomic sequencing, incubated rock and sediment cultures were extracted using multiple approaches, including the Qiagen DNeasy PowerSoil kit, ZymoBIOMICS 96 MagBead DNA Kit (catalogue no. D4302; Zymo Research Corporation), Quick-DNA 96 Kit (catalogue no. D3010; Zymo Research Corporation), ZymoBIOMICS DNA Microprep Kit (catalogue no. D4301; Zymo Research Corporation) and a standard phenol/chloroform-based protocol. The list of samples and their extraction methods are provided in Supplementary Table 6.
16S rRNA gene amplicon sequencing
For amplicon (iTAG) sequencing of 16S rRNA genes, extracted DNA was amplified using primer pair 515f/806r GTGCCAGCMGCCGCGGTAA/ GGACTACHVGGGTWTCTAAT, barcoded and sequenced at Laragen using the Illumina MiSeq platform and analysed using Qiime v.1.8.0 (ref. 49) as described previously48. Taxonomic assignment was based on the SILVA 138 database (https://www.arb-silva.de)50.
Full-length 16S archaeal rRNA gene sequences were amplified using the archaeal primer pair SSU1Arf/SSU1492Rngs TCCGGTTGATCCYGCBRG/ CGGNTACCTTGTKACGAC as described by Bahram et al.51, multiplexed as instructed by PacBio and sequenced using the PacBio Sequel II at the Brigham Young University DNA Sequencing Center and then analysed using the DADA2 package v1.9.1 in R v3.6.0 as described in Callahan et al.52 using the SILVA 138 database for taxonomic classification. Note that in the SILVA 138 database, all Asgard archaea clades are classified under Asgardarchaeota.
Metagenomic sequencing
A total of 11 metagenomic sequencing runs were performed using the Illumina and Oxford Nanopore platforms, with details listed in Supplementary Table 6. For Illumina short-read sequencing, libraries were constructed using the NEBNext Ultra and Nextera Flex Library kits as specified in the Supplementary Table 6. Sequencing was carried out using a HiSeq 2500 system (single-end, 100 bp) at the Caltech Genetics and Genomics Laboratory and HiSeq 4000 system at Novogene (paired-end, 150 bp). Only paired-end data were used for assembly, while all data were used for error correction. Due to the low DNA quantity obtained from the sediment incubation that yielded Ca. H. aukensis, we used multiple displacement amplification with the QIAGEN REPLI g Midi Kit before library preparation for Nanopore sequencing. Oxford Nanopore sequencing libraries were constructed using the PCR Barcoding Kit (catalogue no. SQK-PBK004) and were sequenced on MinION flow cells FLO-MIN106. Base calling was performed with the ONT Guppy software v.3.4.5.
Genome assembly, error correction and read coverage mapping
Two different approaches were used to assemble contiguous genomes from metagenomes. For species of interest, if Nanopore sequencing yielded high read coverage and read lengths N50 > 2 kb, we obtained more contiguous genomes through de novo assembly purely based on Nanopore reads. If Nanopore sequencing did not yield a high number of reads or exhibited low read lengths, we obtained more contiguous genomes through de novo assembly first based on Illumina reads and then joined using Nanopore reads.
For Ca. H. endolithica, Nanopore sequencing data were assembled de novo using Canu17 v.2.1, which yielded a 30 Mbp assembly, including a 3.4 Mbp contig. The approximate 40 kilobase (kb) regions at two ends of an approximate 3.4 Mbp contig were repetitive. This repeated region was deleted at one end and the two ends were joined to result in a circular genome. The resulting genome was mapped using BamM (http://ecogenomics.github.io/BamM/, based on Burrows–Wheeler Aligner53 mapping) with 150 bp Illumina paired-end reads (88× coverage on average) and 100 bp single-end reads (20× coverage). Mapped reads were then used for error correction through pilon54 v.1.22. To account for the reduced mapping at the edges (approximate 50 bp region), the two ends of the genomic sequence were joined, read-mapped and error-corrected again using the same methods. After the genome was annotated, it was rotated such that the genomic sequence ended with tRNA (GlyCCC), which was the integration site of the putative provirus HeimV1. All sequencing reads derived from incubations of the same rock were mapped onto the final genome using BamM, which was then used for coverage calculation through bedtools (https://bedtools.readthedocs.io/en/latest/).
For Ca. H. aukensis, Illumina PE150 bp sequencing data were assembled using SPAdes18 v.3.14.1 with the ‘-meta’ option and k-mers 21,33,55,77,99. The assembly was then scaffolded using Nanopore reads through two iterations of LRScaf55 v.1.1.10. The Ca. H. aukensis genome was joined after trimming the identical sequences at the two ends. The end-joining region was verified through PCR amplification and Sanger sequencing using the primer pair CGCTTTCTTCAAACAATATTTCTGGTG/CTTACTTTCTCTCGGTCCATTTTTCAC. Finally, a 1 kbp stretch of unresolved genomic sequence at an approximate 2.9 Mbp position was resequenced through PCR amplification and Sanger sequencing using the primers GAGTTTTTTCAATCTTATAATGCCAAACTAAAAAATAG (forward), CAGTCAGATTTGACACAATTTTGGTC (reverse) and GCTGGACTCAACCTATAACTAATAGT (reverse). The final assembly was read-mapped, error-corrected through pilon v.1.24 using 346× coverage. It was rotated as described above to place the tRNA gene GlyCCC at the end.
The metagenome containing the Lokiarchaeote Ca. H. repetitus FW102 was assembled using Canu v.2.1, as described for the Ca. H. endolithica genome, and then binned using metabat2 v.2.15 (ref. 56) with default parameters. The bin was then used to recruit long reads using minimap2 v.2.17 and reassembled and binned again. We then used LRScaf to scaffold the contigs and used ten iterations of pilon v.1.24 to achieve error correction and resolve ambiguous bases.
The Thorarcheote FW25 MAG was assembled using the hybrid assembly of Illumina reads and Nanopore reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99, and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package (http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_intro). These reads were then used for hybrid assembly with Nanopore long reads via SPAdes v.3.14.1 with k-mers 21,33,55,77,99. It was then binned again using metabat2 v.2.15 with default parameters to yield the final Thorarcheote FW25 MAG.
The metagenome containing Gerdarchaeote AC18 was assembled from Illumina reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99 and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package and then reassembled and binned using SPAdes and metabat2 to yield the final Gerdarchaeote AC18 bin.
Alignment fraction, ANI and AAI
ANI and alignment fraction values, independently calculated for rRNA, tRNA and coding gene sequences were obtained using ANIcalculator57 2014-127, v.1.0 (https://ani.jgi.doe.gov/html/download.php?). Note that Lokiarchaeote FW102 contains 2 copies of 16S rRNA genes at 99% identity with each other, and Thorarchaeote BC has a partial 16S rRNA gene. The alignment of 16S rRNA was carried out using SINA58 v.1.2.11. The AAI values of translated proteomes were obtained with the enveomics package v1.8.059. The final output is shown in Supplementary Table 7.
Genome and mobilome annotations
Gene calling was done using a combination of Prodigal v.2.6.3 and Glimmer v.3.0.2 using translation code 11 within the RASTtk60 pipeline, now under the PATRIC package v1.03261. Translated coding sequences were annotated and domain-assigned using eggNOG mapper39 v.2. The tRNA, 16S rRNA and 23S rRNA genes were identified using RNAmmer62 v.1.2 embedded in RASTtk. Thus far, 5S rRNA gene sequences could not be predicted through the existing HMM using various approaches. Long, non-tandem repeats were identified using RASTtk with the default cut-off of 95% identity and 100 bp. Tandem repeat sequences were identified using RASTtk, Prokka v1.14.6 and CRISPRCasTyper 1.1.463. Prokka and CRISPRCasTyper both employ MinCED (https://github.com/ctSkennerton/minced) to identify repeats and detect intragenic tandem repeats, which were manually removed from the CRISPR–Cas analyses. The Cas genes were annotated using CRISRCasTyper.
All identified Heimdallarchaeum mobilomes were further analysed using PSI-BLAST 1.10.064, CDD search v3.1965 and PhANNs webserver (version March 2021)37.
Genome evaluation and HMM construction
Marker coverage was carried out using a two-step process. First, we used the automated marker analyses via CheckM66 v.1.1.3 with the lineage_wf option and the default HMM E value cut-off, which included the 149 standard archaeal single-copy marker set. Next, each of the missing markers was examined with hmmer67 v.3.3.2 using the hmmsearch option with manual inspection of alignment regions and bitscores. This rescued markers unidentified through the default cut-offs by CheckM as well as divergent variants that most likely functionally replace the genuinely missing marker. The detailed description of markers missed by CheckM can be found in Supplementary Note 2 and the final evaluation of marker presence is displayed in Extended Data Fig. 4a and Supplementary Table 15. Next, we constructed an updated HMM set to replace the CheckM set by (1) updating all HMM to the most recent versions, (2) removing the six commonly missing or duplicated markers shown in Extended Data Fig. 4a from the list and (3) overcoming the pitfall of existing HMMs constructed using only a few sequences acquired from Euryarchaeota and Crenarchaeota. We manually constructed Asgard-specific versions based on the 282 Asgard archaea genomes. The HMMs constructed in this study are PF00832.ASG, PF00861.ASG, PF01194.ASG, PF01287.ASG, PF01667.ASG, PF03874.ASG, PF03876.ASG, PF13656.ASG, TIGR00270.ASG, TIGR00336.ASG, TIGR00442.ASG, TIGR02338.ASG and TIGR03677.ASG. The updated HMM file has been provided as a supplementary data file. The updated HMM was used to evaluate the 282 genomes reported in this study and in the literature3,6–12,16,23,26,68–77 through (1) CheckM, which uses Prodigal for gene calling, and (2) the more up to date HMMER3.2.2 on our gene calls described above. The latter generally produced slightly higher completeness and redundancy values (Supplementary Tables 8 and 9). For the expanded set of Asgard archaea genomes used for the phylogenomic analyses shown in Extended Data Fig. 4b, we applied the following filtering criteria: ≤100 contigs, >96% marker completeness and <8% marker redundancy. We also took the evenness of taxonomic sampling into account. The set is also shown in the Asgard archaea tree in Extended Data Fig. 2. The importance of genome quality evaluation is highlighted in Extended Data Fig. 9.
Phylogenomics
A phylogenomic tree of Asgard archaea was constructed with IQ-TREE v.2.1.2 (ref. 78) using a partitioned analysis79 with model selection using ModelFinder80 and 1,000 ultrafast bootstrap replicates using UFBoot281 on a concatenated alignment generated from MUSCLE82 v.3.8.1551 alignments of 76 archaeal marker genes identified in the genomes using HMMs included with anvi’o v.6.2 (ref. 83). The phylogenomic tree was visualized using iTOL84 and rooted with the TACK superphylum.
The Archaea–Eukaryota phylogenomic tree, including the Asgard genomes discussed in this study, was constructed based on the 56 Archaea–Eukaryota ribosomal proteins used by Zaremba-Niedzwiedzka et al.3 using reference sequences from the corresponding Dryad repository. In addition to the Asgard archaea identified in this study, additional sequences of the most complete genomes representing different lineages of the TACK superphylum were added to the dataset. Sequences of 56 archaeal COGs obtained from the Dryad repository were used as reference databases to retrieve homologous sequences from target genomes using BLAST85 v.2.10.1. Each set of archaeal COG sequences were aligned using MUSCLE v.3.8.1551 and inspected and trimmed manually. Manually trimmed alignments were then further trimmed using BMGE86, recoded to four-state SR4 using a custom script (https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny) and finally concatenated and converted to PHYLIP format using catfasta2phyml v1.1.0 (https://github.com/nylander/catfasta2phyml). The final concatenated, recoded alignment was used to calculate phylogenies using IQ-TREE v.2.1.2 (ref. 78) using a C60 model adapted for SR4 recoded data by Zaremba-Niedzwiedzka et al.3 and 1,000 ultrafast bootstrap replicates using UFBoot. The phylogenomic tree was visualized using iTOL84 and rooted with Euryarchaeota as the outgroup. The genomes and conserved genes used for the phylogenomic analyses are listed in Supplementary Tables 16 and 17.
Discovery of Heimdallarchaeum-targeting mobile elements through CRISPR spacer targeting
Repeat sequences from the Heimdallarchaeum CRISPR arrays were used to blast against the CRISPR repeats we recruited, using CRISPRCasTyper, from multiple databases with a 95% alignment and 95% identity cut-off. The databases include GTDB v.95, our in-house assemblies from the Pescadero Basin (this study, F.W. et al. manuscript in preparation and Speth et al.87; Supplementary Table 10, 22 sets) and published assemblies from the Guaymas Basin22 (Supplementary Table 11, 16 sets).
While no homologous CRISPR repeats were found in the entire GTDB database, we found several CRISPR arrays from the Guaymas and Pescadero assemblies with identical repeats to the Heimdallarchaeum CRISPR repeats found in this study, demonstrating the specificity of the CRISPR discovery approach. Since both the Guaymas and Pescadero CRISPR sets comprise assembled sequences that were not de-replicated, the entire CRISPR spacer collection from the recruited CRISPR arrays was de-replicated using a 100% identity cut-off. Notably, no spacer overlap was found between the Guaymas and Pescadero CRISPR sets. In total, the final de-replicated, putative Heimdallarchaeota spacerome in this study consisted of 455 from the 2 original Heimdallarchaeum genomes, 578 from the Pescadero Basin assemblies and 532 from the Guaymas Basin assemblies. We note that the above set likely only represents a fraction of the true Heimdallarchaeum spacerome given that the original CRISPR repeats came from only two species.
Next, to identify potential mobile genetic elements (MGEs) targeted by the Heimdallarchaeum spacerome, we used BLAST to search for spacer matches in the above three assembly datasets, the two Ca. Heimdallarchaeum genomes and various published virus databases/datasets, which are the RefSeq virus database r9888, IMG/VR v.3 (ref. 36) and the huge phage89, giant virus90 and Loki’s castle virus datasets91. To avoid self-matches, the CRISPR arrays containing the spacers were replaced by Ns in their respective assemblies. For the homology cut-off, we used 95% alignment and 95% identity as described previously92. Strikingly, no spacer matches were found from any of the viral datasets or GTDB genome database. The spacer matches to the Guaymas and Pescadero Basins metagenomes are listed in Supplementary Tables 12 and 13.
We then de-replicated the putative MGEs/viruses identified above using BLAST, removed contigs smaller than 2.8 kb and manually examined the target gene neighbourhoods and potential self-match due to CRISPR arrays that evaded detection and blocking. These contigs, together with the ones described in Fig. 2, ultimately constitute the 56 putative Heimdallarchaeota MGEs listed in Supplementary Table 14.
Resolution of the genomic insertion and circularization of HeimV1
To capture the two different states during the life cycles of HeimV1 (Fig. 3f), we used three primer sets to amplify the sequences around the two insertion sites of HeimV1 and confirmed them using gel electrophoresis and Sanger sequencing. Set 1 amplified the region between upstream tRNA GlyCCC in the Ca. H. endolithica genome and the first coding gene of the HeimV1 (GTGAATCAATAGCTTTCACTTATAATGAG/GTGATTGTATTAAGTCTGCAACATATTC). Set 2 amplified the regions containing the transposase in the Ca. H. endolithica genome and the integrase in HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/CTTCTTTCCTCTTTTTGTCTCTGCTTC). Set 3 amplified the two ends of the circular HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/GTGATTGTATTAAGTCTGCAACATATTC). Each primer set amplified approximately 2 kb of target regions with set 1 and set 2 indicating the presence of the integrated state of HeimV1 and set 3 indicating the circular state.
Protein clustering of integrases and transposases
Protein sequences showing integrase and transposase domains, identified using eggNOG mapper from the 8 Asgard archaea MAGs, were pooled and clustered at 90% sequence identity using cd-hit93 v.4.8.1. The resulting representative sequences were used for two sequential rounds of homology searches using DIAMOND94 v.2.0.6 against the protein sequences obtained from the GTDB v.95 genome database. A cut-off of >20% sequence identity, >85% sequence alignment and <15% length difference was used for the first round; a cut-off of >30% sequence identity, >90% sequence alignment and <10% length difference was used for the second round. The resulting protein sequences were combined with the Asgard archaea integrases/transposases originally pooled and were clustered together using 95% sequence identity with cd-hit. The resulting 96,367 representative sequences were clustered using ASM-Clust95 with a sequence subset size of 5,000 to generate the alignment score matrix, using default values for the other settings.
Taxonomic profiling through protein orthologues
The taxonomic clustering and COG analyses were carried out using eggNOG mapper39 with the eggNOG orthologue database v.5.0. The protein counts belonging to each taxonomic group (Archaea/Bacteria/Eukaryota/Unassigned) were extracted from the output and fitted linearly with MATLAB R2018a using the polyfit function and yielding Fig. 5c.
Since different proteins evolved at different rates, we combined the use of a single cut-off-based protein clustering approach with functional domain-based manual refinement to capture and compare ERPs across the lineages selected in this study. First, we used BLAST v.2.2.26 to evaluate the sequence homologies within the entire proteome of the eight MAGs in this study. We then used an 80% alignment length (relative to the length of the shorter protein sequence) and 0.24 alignment × identity cut-off to yield candidate protein clusters, which we then cross-referenced with the Eukaryota group in the eggNOG classification to generate 227 candidate ERP clusters. Finally, we manually examined the relatedness within and between each ERP cluster through batch searches using the conserved domain database65. This led to the recombination of the candidate ERP clusters into the functionally distinct 135 ERP families. To align with previous work3,6, all small GTPases were classified as one single ERP family, constituting 291 proteins from the 8 representative Asgard archaea MAGs.
Maximum-likelihood analyses of proteins encoded by HeimV1 and HeimV2
Homology search for all peptide sequences of HeimV1 through DIAMOND94 v.2.0.6 was carried out against the GTDB v.95, Pescadero Basin and Guaymas assemblies, RefSeq virus database88, IMG/VR36 and huge phage89, giant virus90 and Loki’s castle virus datasets91. The search outputs were pre-clustered with a 70% identity cut-off using cd-hit v.4.8.1 (ref. 93). The representative sequences were aligned using the MAFFT v.7.475 (ref. 96) option linsi and trimmed with trimAl v.1.4.1 (ref. 97), option gappyout. Maximum-likelihood analyses were carried out with IQ-TREE v.2.1.12 (ref. 78) using the LG4X model and ultrafast bootstrap with 2,000 replicates. The phylogenetic tree was visualized and prepared using iTOL84.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank W. Fischer for critical comments on the manuscript, L. Kelly for advice on viral sequence analysis, A. Roger for discussions on phylogenetic methods and K. Makarova and E. Koonin for discussions on CRISPR–Cas systems. We thank the pilots, crew and participants on the cruises to the southern Pescadero Basin, FK181031 on R/V Falkor operated by the Schmidt Ocean Institute and NA091 on E/V Nautilus operated by the Ocean Exploration Trust, with NA091 supported by the Dalio Foundation and Woods Hole Oceanographic Institute. This research used samples provided by the Ocean Exploration Trust’s Nautilus Exploration Program, cruise NA091. We thank chief scientists S. Wankel and A. Michel for the opportunity to sail on NA091, Co-Chief Scientists D. Caress and R. Zierenberg on FK181031, and S. Wankel, A. Foulk and L. Marsh, R. Zierenberg and D. Cardace for assistance with shipboard processing of rock samples and J. Magyar and S. Goffredi for shipboard processing of sediment samples. Illumina library construction and Nanopore sequencing were performed at the Millard and Muriel Jacobs Genetics and Genomics Laboratory at Caltech. F.W. was supported by the Netherlands Organisation for Scientific Research Rubicon Award no. 019.162LW.037 and a Human Frontiers Science Program Long-term Fellowship no. LT000468/2017. D.R.S. was supported by the Netherlands Organisation for Scientific Research Rubicon Award no. 019.153LW.039 and the Caltech GPS Division Texaco Postdoctoral Fellowship. J.P.A. is funded by the National Science Foundation (NSF) no. OCE-1431598. V.J.O. is a Canadian Institute for Advanced Science fellow in the Earth 4D program. This research was supported by a Caltech Center for Evolutionary Science Pilot Grant (F.W. and V.J.O.), the NOMIS Foundation (V.J.O.), the Simons Foundation Principles of Microbial Ecosystems project (V.J.O.) and the NSF Center for Dark Energy Biosphere Investigations (no. OCE-0939564, V.J.O. and J.P.A.). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Extended data
Source data
Author contributions
F.W., D.R.S., A.C. and V.J.O. conceived the project. D.R.S. and V.J.O. collected the hydrothermal vent samples. F.W., D.R.S., A.C. and S.A.C. carried out the microbial incubations, periodic sampling, DNA extraction, sulphide analyses and amplicon sequencing. I.A.A. and A.N. prepared the Illumina sequencing libraries. I.A.A. performed the Oxford Nanopore sequencing. F.W., I.A.A. and A.P. assembled the Asgard archaea genomes. D.R.S. performed the phylogenomic analyses, protein clustering and overall bioinformatics platform support. R.A.B. performed the ANI/AAI analyses and taxonomic evaluation. F.W., D.R.S. and A.P. annotated the genomes. F.W. performed the PacBio HiFi 16S sequencing, protein phylogenetic analyses, marker HMM construction, comparative genomics, CRISPR/mobilome discovery, statistical analyses and wrote the paper. V.J.O. revised the paper. D.R.S., A.P., A.C., R.A.B. and J.P.A. provided critical comments on the paper. All authors read and approved the manuscript. V.J.O. and J.P.A. supervised the work.
Peer review
Peer review information
Nature Microbiology thanks Brett Baker and the other, anonymous, reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
The assembled genomes and raw metagenomic sequencing reads can be found on the National Center for Biotechnology Information database under BioProject no. PRJNA721962. Source data are provided with this paper.
Code availability
The custom script for recoding of amino acid sequences to four-state SR4 can be found at https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny. Other custom scripts can be found at https://github.com/wufabai/genomics.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Fabai Wu, Email: wu.fa.bai@gmail.com.
Victoria J. Orphan, Email: vorphan@gps.caltech.edu
Extended data
is available for this paper at 10.1038/s41564-021-01039-y.
Supplementary information
The online version contains supplementary material available at 10.1038/s41564-021-01039-y.
References
- 1.Hug LA, et al. A new view of the tree of life. Nat. Microbiol. 2016;1:16048. doi: 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]
- 2.Takai K, Horikoshi K. Genetic diversity of archaea in deep-sea hydrothermal vent environments. Genetics. 1999;152:1285–1297. doi: 10.1093/genetics/152.4.1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zaremba-Niedzwiedzka K, et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature. 2017;541:353–358. doi: 10.1038/nature21031. [DOI] [PubMed] [Google Scholar]
- 4.Spang A, et al. Proposal of the reverse flow model for the origin of the eukaryotic cell based on comparative analyses of Asgard archaeal metabolism. Nat. Microbiol. 2019;4:1138–1148. doi: 10.1038/s41564-019-0406-9. [DOI] [PubMed] [Google Scholar]
- 5.Williams TA, Cox CJ, Foster PG, Szöllősi GJ, Embley TM. Phylogenomics provides robust support for a two-domains tree of life. Nat. Ecol. Evol. 2020;4:138–147. doi: 10.1038/s41559-019-1040-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Spang A, et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature. 2015;521:173–179. doi: 10.1038/nature14447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liu Y, et al. Expanded diversity of Asgard archaea and their relationships with eukaryotes. Nature. 2021;593:553–557. doi: 10.1038/s41586-021-03494-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bulzu P-A, et al. Casting light on Asgardarchaeota metabolism in a sunlit microoxic niche. Nat. Microbiol. 2019;4:1129–1137. doi: 10.1038/s41564-019-0404-y. [DOI] [PubMed] [Google Scholar]
- 9.Dong X, et al. Metabolic potential of uncultured bacteria and archaea associated with petroleum seepage in deep-sea sediments. Nat. Commun. 2019;10:1816. doi: 10.1038/s41467-019-09747-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huang J-M, Baker BJ, Li J-T, Wang Y. New microbial lineages capable of carbon fixation and nutrient cycling in deep-sea sediments of the northern South China Sea. Appl. Environ. Microbiol. 2019;85:e00523-–19. doi: 10.1128/AEM.00523-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Cai M, et al. Diverse Asgard archaea including the novel phylum Gerdarchaeota participate in organic matter degradation. Sci. China Life Sci. 2020;63:886–897. doi: 10.1007/s11427-020-1679-1. [DOI] [PubMed] [Google Scholar]
- 12.Sun J, et al. Recoding of stop codons expands the metabolic potential of two novel Asgardarchaeota lineages. ISME Commun. 2021;1:30. doi: 10.1038/s43705-021-00032-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Frost LS, Leplae R, Summers AO, Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 2005;3:722–732. doi: 10.1038/nrmicro1235. [DOI] [PubMed] [Google Scholar]
- 14.Nelson WC, Tully BJ, Mobberley JM. Biases in genome reconstruction from metagenomic data. PeerJ. 2020;8:e10119. doi: 10.7717/peerj.10119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Paduan JB, et al. Discovery of hydrothermal vent fields on Alarcón Rise and in Southern Pescadero Basin, Gulf of California. Geochem. Geophys. Geosyst. 2018;19:4788–4819. [Google Scholar]
- 16.Caceres, E. F. et al. Near-complete Lokiarchaeota genomes from complex environmental samples using long and short read metagenomic analyses. Preprint at bioRxiv10.1101/2019.12.17.879148 (2019).
- 17.Koren S, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 2018;9:5114. doi: 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Barco RA, et al. A genus definition for bacteria and archaea based on a standard genome relatedness index. mBio. 2020;11:e02475-–19. doi: 10.1128/mBio.02475-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rinke C, et al. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat. Microbiol. 2021;6:946–959. doi: 10.1038/s41564-021-00918-8. [DOI] [PubMed] [Google Scholar]
- 22.Dombrowski N, Teske AP, Baker BJ. Expansive microbial metabolic versatility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat. Commun. 2018;9:4999. doi: 10.1038/s41467-018-07418-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Imachi H, et al. Isolation of an archaeon at the prokaryote–eukaryote interface. Nature. 2020;577:519–525. doi: 10.1038/s41586-019-1916-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.López-García P, Moreira D. The Syntrophy hypothesis for the origin of eukaryotes revisited. Nat. Microbiol. 2020;5:655–667. doi: 10.1038/s41564-020-0710-4. [DOI] [PubMed] [Google Scholar]
- 25.Sousa FL, Neukirchen S, Allen JF, Lane N, Martin WF. Lokiarchaeon is hydrogen dependent. Nat. Microbiol. 2016;1:16034. doi: 10.1038/nmicrobiol.2016.34. [DOI] [PubMed] [Google Scholar]
- 26.Manoharan L, et al. Metagenomes from coastal marine sediments give insights into the ecological role and cellular features of Loki- and Thorarchaeota. mBio. 2019;10:e02039-–19. doi: 10.1128/mBio.02039-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Roller BRK, Stoddard SF, Schmidt TM. Exploiting rRNA operon copy number to investigate bacterial reproductive strategies. Nat. Microbiol. 2016;1:16160. doi: 10.1038/nmicrobiol.2016.160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yao J, Rock CO. Phosphatidic acid synthesis in bacteria. Biochim. Biophys. Acta. 2013;1831:495–502. doi: 10.1016/j.bbalip.2012.08.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Susko E, Roger AJ. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 2007;24:2139–2150. doi: 10.1093/molbev/msm144. [DOI] [PubMed] [Google Scholar]
- 30.Parks DH, et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol. 2020;38:1079–1086. doi: 10.1038/s41587-020-0501-8. [DOI] [PubMed] [Google Scholar]
- 31.López-García P, Zivanovic Y, Deschamps P, Moreira D. Bacterial gene import and mesophilic adaptation in archaea. Nat. Rev. Microbiol. 2015;13:447–456. doi: 10.1038/nrmicro3485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nelson-Sathi S, et al. Origins of major archaeal clades correspond to gene acquisitions from bacteria. Nature. 2015;517:77–80. doi: 10.1038/nature13805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Groussin M, et al. Gene acquisitions from bacteria at the origins of major archaeal clades are vastly overestimated. Mol. Biol. Evol. 2016;33:305–310. doi: 10.1093/molbev/msv249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Williams KP. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Res. 2002;30:866–875. doi: 10.1093/nar/30.4.866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Koonin EV, Makarova KS, Wolf YI, Krupovic M. Evolutionary entanglement of mobile genetic elements and host defence systems: guns for hire. Nat. Rev. Genet. 2020;21:119–131. doi: 10.1038/s41576-019-0172-9. [DOI] [PubMed] [Google Scholar]
- 36.Roux S, et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 2021;49:D764–D775. doi: 10.1093/nar/gkaa946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Cantu VA, et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 2020;16:e1007845. doi: 10.1371/journal.pcbi.1007845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Betts HC, et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2018;2:1556–1562. doi: 10.1038/s41559-018-0644-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Huerta-Cepas J, et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019;47:D309–D314. doi: 10.1093/nar/gky1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinform. 2003;4:41. doi: 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Alvarez-Ponce D, Lopez P, Bapteste E, McInerney JO. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proc. Natl Acad. Sci. USA. 2013;110:E1594–E1603. doi: 10.1073/pnas.1211371110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ku C, et al. Endosymbiotic gene transfer from prokaryotic pangenomes: inherited chimerism in eukaryotes. Proc. Natl Acad. Sci. USA. 2015;112:10139–10146. doi: 10.1073/pnas.1421385112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Brueckner J, Martin WF. Bacterial genes outnumber archaeal genes in eukaryotic genomes. Genome Biol. Evol. 2020;12:282–292. doi: 10.1093/gbe/evaa047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kurland CG, Collins LJ, Penny D. Genomics and the irreducible nature of eukaryote cells. Science. 2006;312:1011–1014. doi: 10.1126/science.1121674. [DOI] [PubMed] [Google Scholar]
- 45.Giovannoni SJ, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science. 2005;309:1242–1245. doi: 10.1126/science.1114057. [DOI] [PubMed] [Google Scholar]
- 46.Kapusta A, Suh A, Feschotte C. Dynamics of genome size evolution in birds and mammals. Proc. Natl Acad. Sci. USA. 2017;114:E1460–E1469. doi: 10.1073/pnas.1616702114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Scheller S, Yu H, Chadwick GL, McGlynn SE, Orphan VJ. Artificial electron acceptors decouple archaeal methane oxidation from sulfate reduction. Science. 2016;351:703–707. doi: 10.1126/science.aad7154. [DOI] [PubMed] [Google Scholar]
- 48.Mason OU, et al. Comparison of archaeal and bacterial diversity in methane seep carbonate nodules and host sediments, Eel River Basin and Hydrate Ridge, USA. Microb. Ecol. 2015;70:766–784. doi: 10.1007/s00248-015-0615-6. [DOI] [PubMed] [Google Scholar]
- 49.Caporaso JG, et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods. 2010;7:335–336. doi: 10.1038/nmeth.f.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Quast C, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013;41:D590–D596. doi: 10.1093/nar/gks1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bahram M, Anslan S, Hildebrand F, Bork P, Tedersoo L. Newly designed 16S rRNA metabarcoding primers amplify diverse and novel archaeal taxa from the environment. Environ. Microbiol. Rep. 2019;11:487–494. doi: 10.1111/1758-2229.12684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Callahan BJ, et al. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. Nucleic Acids Res. 2019;47:e103. doi: 10.1093/nar/gkz569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Walker BJ, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Qin M, et al. LRScaf: improving draft genomes using long noisy reads. BMC Genom. 2019;20:955. doi: 10.1186/s12864-019-6337-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Kang DD, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7:e7359. doi: 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Varghese NJ, et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res. 2015;43:6761–6771. doi: 10.1093/nar/gkv657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Pruesse E, Peplies J, Glöckner FO. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics. 2012;28:1823–1829. doi: 10.1093/bioinformatics/bts252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Rodriguez-R LM, Konstantinidis KT. The enveomics collection: a toolbox for specialized analyses of microbial genomes and metagenomes. PeerJ. Prepr. 2016;4:e1900v1. [Google Scholar]
- 60.Brettin T, et al. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci. Rep. 2015;5:8365. doi: 10.1038/srep08365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Davis JJ, et al. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res. 2020;48:D606–D612. doi: 10.1093/nar/gkz943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lagesen K, et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35:3100–3108. doi: 10.1093/nar/gkm160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Russel J, Pinilla-Redondo R, Mayo-Muñoz D, Shah SA, Sørensen SJ. CRISPRCasTyper: automated identification, annotation, and classification of CRISPR–Cas loci. CRISPR J. 2020;3:462–469. doi: 10.1089/crispr.2020.0059. [DOI] [PubMed] [Google Scholar]
- 64.Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Lu S, et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2020;48:D265–D268. doi: 10.1093/nar/gkz991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Angle JC, et al. Methanogenesis in oxygenated soils is a substantial fraction of wetland methane emissions. Nat. Commun. 2017;8:1567. doi: 10.1038/s41467-017-01753-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Nayfach S, et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 2021;39:499–509. doi: 10.1038/s41587-020-0718-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Rasigraf O, et al. Microbial community composition and functional potential in Bothnian Sea sediments is linked to Fe and S dynamics and the quality of organic matter. Limnol. Oceanogr. 2020;65:S113–S133. [Google Scholar]
- 71.Seitz KW, et al. Asgard archaea capable of anaerobic hydrocarbon cycling. Nat. Commun. 2019;10:1822. doi: 10.1038/s41467-019-09364-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Seitz KW, Lazar CS, Hinrichs K-U, Teske AP, Baker BJ. Genomic reconstruction of a novel, deeply branched sediment archaeal phylum with pathways for acetogenesis and sulfur reduction. ISME J. 2016;10:1696–1705. doi: 10.1038/ismej.2015.233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Tully BJ, Graham ED, Heidelberg JF. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data. 2018;5:170203. doi: 10.1038/sdata.2017.203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Vavourakis CD, et al. Metagenomes and metatranscriptomes shed new light on the microbial-mediated sulfur cycle in a Siberian soda lake. BMC Biol. 2019;17:69. doi: 10.1186/s12915-019-0688-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Wong HL, et al. Disentangling the drivers of functional complexity at the metagenomic level in Shark Bay microbial mat microbiomes. ISME J. 2018;12:2619–2639. doi: 10.1038/s41396-018-0208-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Penev PI, et al. Supersized ribosomal RNA expansion segments in Asgard archaea. Genome Biol. Evol. 2020;12:1694–1710. doi: 10.1093/gbe/evaa170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Farag IF, Zhao R, Biddle JF. ‘Sifarchaeota,’ a novel Asgard phylum from Costa Rican sediment capable of polysaccharide degradation and anaerobic methylotrophy. Appl. Environ. Microbiol. 2021;87:e02584-20. doi: 10.1128/AEM.02584-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Chernomor O, von Haeseler A, Minh BQ. Terrace aware data structure for phylogenomic inference from supermatrices. Syst. Biol. 2016;65:997–1008. doi: 10.1093/sysbio/syw037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods. 2017;14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 2018;35:518–522. doi: 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 2004;5:113. doi: 10.1186/1471-2105-5-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Eren AM, et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ. 2015;3:e1319. doi: 10.7717/peerj.1319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 2019;47:W256–W259. doi: 10.1093/nar/gkz239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Camacho C, et al. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 2010;10:210. doi: 10.1186/1471-2148-10-210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Speth, D. R. et al. Microbial community of recently discovered Auka vent field sheds light on vent biogeography and evolutionary history of thermophily. Preprint at bioRxiv10.1101/2021.08.02.454472 (2021).
- 88.Brister JR, Ako-Adjei D, Bao Y, Blinkova O. NCBI viral genomes resource. Nucleic Acids Res. 2015;43:D571–D577. doi: 10.1093/nar/gku1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Al-Shayeb B, et al. Clades of huge phages from across Earth’s ecosystems. Nature. 2020;578:425–431. doi: 10.1038/s41586-020-2007-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Schulz F, et al. Giant virus diversity and host interactions through global metagenomics. Nature. 2020;578:432–436. doi: 10.1038/s41586-020-1957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Bäckström D, et al. Virus genomes from deep sea sediments expand the ocean megavirome and support independent origins of viral gigantism. mBio. 2019;10:e02497-–18. doi: 10.1128/mBio.02497-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Shmakov SA, et al. The CRISPR spacer space is dominated by sequences from species-specific mobilomes. mBio. 2017;8:e01397-–17. doi: 10.1128/mBio.01397-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 94.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 95.Speth, D. R. & Orphan, V. J. ASM-Clust: classifying functionally diverse protein families using alignment score matrices. Preprint at bioRxiv10.1101/792739 (2019).
- 96.Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34:2490–2492. doi: 10.1093/bioinformatics/bty121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The assembled genomes and raw metagenomic sequencing reads can be found on the National Center for Biotechnology Information database under BioProject no. PRJNA721962. Source data are provided with this paper.
The custom script for recoding of amino acid sequences to four-state SR4 can be found at https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny. Other custom scripts can be found at https://github.com/wufabai/genomics.