Skip to main content
Microbiome logoLink to Microbiome
. 2025 Jul 1;13:155. doi: 10.1186/s40168-025-02140-8

Extensive data mining uncovers novel diversity among members of the rare biosphere within the Thermoplasmatota

Mara D Maeke 1, Xiuran Yin 1,2,, Lea C Wunder 1, Chiara Vanni 3, Tim Richter-Heitmann 1, Samuel Miravet-Verde 4, Hans-Joachim Ruscheweyh 4, Shinichi Sunagawa 4, Jenny Fabian 5, Judith Piontek 5, Michael W Friedrich 1,3,, Christiane Hassenrück 5,
PMCID: PMC12220078  PMID: 40598319

Abstract

Background

Rare species, especially of the marine sedimentary biosphere, have long been overlooked owing to the complexity of sediment microbial communities, their sporadic temporal and patchy spatial abundance, and challenges in cultivating environmental microorganisms. In this study, we combined enrichments, targeted metagenomic sequencing, and extensive data mining to uncover uncultivated members of the archaeal rare biosphere in marine sediments.

Results

In protein-amended enrichments, we detected the ecologically and metabolically uncharacterized class Candidatus Penumbrarchaeia within the phylum Thermoplasmatota. By screening more than 8000 metagenomic runs and 11,479 published genome assemblies, we expanded the phylogeny of Ca. Penumbrarchaeia by 3 novel orders. All six identified families of this class show low abundance in environmental samples characteristic of rare biosphere members. Members of the class Ca. Penumbrarchaeia were predicted to be involved in organic matter degradation in anoxic, carbon-rich habitats. All Ca. Penumbrarchaeia families contain high numbers of taxon-specific orthologous genes, highlighting their environmental adaptations and habitat specificity. Besides, members of this group exhibit the highest proportion of unknown genes within the entire phylum Thermoplasmatota, suggesting a high degree of functional novelty in this class.

Conclusions

In this study, we emphasize the necessity of targeted, data-integrative approaches to deepen our understanding of the rare biosphere and uncover the functions and metabolic potential hidden within these understudied taxa.

Download video file (55.8MB, mp4)

Video Abstract

Supplementary Information

The online version contains supplementary material available at 10.1186/s40168-025-02140-8.

Introduction

In the environment, the vast majority of microbial species is represented by low abundance microorganisms, known as the “rare biosphere” [1]. While many studies define rare taxa as those being less than 0.01–0.1% abundant in a sample at a specific time point [2, 3], rarity is not only confined to population sizes but can also be measured by geographic range and habitat specificity [4, 5]. Rare taxa are hypothesized to play an important role in ecosystems by carrying a gene pool, which can be accessed under changing environmental conditions, functioning as a seed bank, or by supporting the community with key functions [6, 7]. These functions can be nutrient cycling [810], degradation of pollutants [11], or the promotion of community resilience [12, 13]. Due to high intra- or interspecific competition [6], as well as the assumed limited environmental distribution most rare taxa occupy, these taxa mostly show only temporally and spatially constrained abundance [7], making the study of the rare biosphere challenging.

Most studies so far conducted on the rare marine biosphere have focused on diversity assessments of bacterial and archaeal plankton using high-throughput 16S rRNA gene surveys [2, 3, 1420]. More recently, first metagenomic approaches have been applied to investigate the rare biosphere in marine bacterioplankton [21]. However, the rare biosphere in marine sediments remains largely unexplored, owing to the complexity of sediment communities [22, 23], the sequencing effort needed to resolve such complex communities, computational resources, and associated costs.

To reduce the complexity of sediment communities and thereby make the rare biosphere more accessible for the analysis of metabolic functions, enrichment experiments with substrate amendments were designed to selectively promote the growth of rare taxa [2427]. While isolation would be preferable to further characterize the metabolic capabilities of rare taxa, most microorganisms remain uncultured despite recent advances in cultivation techniques [28, 29]. Therefore, enrichment techniques coupled to metagenome analyses simplify the description of the full metabolic potential of rare uncultivated microorganisms in the absence of isolates [2, 2931]. Yet, even the combination of these methods only provides a snapshot of the enriched taxa at a specific time point in a specific setting and cannot offer further information regarding their global diversity or habitat selection.

Recent advances in high-throughput sequencing and computational techniques have enabled deep sequencing of microbial communities, including the rare biosphere [2, 3234] and led to rapid data accumulation on public databases. Data from next-generation sequencing runs deposited on the Sequence Read Archive (SRA) under the umbrella of the International Nucleotide Database Collaboration (INSDC) reached a data volume of 57.9 PB in 2024 [35], while the number of assembled genomes exceeds 2.3 million [36]. Through the screening and recovery of novel metagenome-assembled genomes (MAGs) and single-cell amplified genomes (SAGs) from public archives by data-driven projects, such as the Genome Taxonomy Database (GDTB) [37], the Genomes from Earth’s Microbiomes (GEM) [38], and the Ocean Microbiomics Database [39], the catalog of microbial diversity is steadily increasing. Genomic analyses conducted on unidentified microorganisms disclosed novel functions and illustrated the extent of their untapped metabolic potential [3842]. These studies attest to the wealth of information available in existing data on public archives. Despite the evidence outlined above for the importance of the rare biosphere, rare taxa remain regularly overlooked in environmental metagenomic studies, since the most abundant taxa remain the focus of the research on key players in microbial communities [6]. Still, the generated sequencing data may hold valuable information about the phylogenetic and metabolic diversity of rare taxa, their habitats, and their role in the Earth’s environments, which can be accessed by group-targeted data mining of the SRA.

Archaea, in particular, have been regarded as members of the rare biosphere, complicating the study of their diversity and role in biogeochemical cycles [34, 43]. While it has been estimated that half of the archaeal diversity remains unidentified, multiple archaeal groups were shown to play an important role in organic matter degradation [34, 44, 45]. To investigate the capabilities of uncultivated archaea involved in the degradation of organic matter, specifically proteins, we established enrichments amended with pure egg-white protein (further referred to as protein). After detecting a so far ecologically and metabolically uncharacterized class of the phylum Thermoplasmatota within these protein-amended enrichments, we used the data available on the databases of the INSDC to conduct extensive group-targeted data mining to describe this new class. By screening 11,479 publicly available genome assemblies, data from 8287 metagenomic sequencing runs, and the genomes of the Ocean Microbiomics Database (OMDB), we generated an integrative dataset enabling the study and formal description of this new class. We investigated the phylogenetic diversity, biogeography, and metabolic capabilities of the novel class and presented three previously unknown orders within this group. Notably, by analyzing the metabolic potential within the novel class, we observed percentages of unknown genes higher than those found in any other class within the Thermoplasmatota.

Results

High abundance of an ecologically and metabolically uncharacterized class of the phylum Thermoplasmatota in protein enrichments

To investigate the potential of novel microorganisms for protein degradation, we established a series of slurry enrichments using sediment from the Helgoland mud area. In order to reduce and eventually eliminate the sediment component, incubations were transferred into anoxic Widdel medium after 372 days for further enrichment (“Further enrichment of target microorganisms” section). Samples were amended with protein, sulfate, and an antibiotic mix, supplied at the beginning of the incubation experiment, to specifically target archaeal groups by inhibiting general bacterial groups in marine sediments. Subsamples of the second-generation enrichment were taken on days 98 (480 days total) and 157 (529 days total) to perform archaeal 16S rRNA gene amplicon sequencing and qPCR. Within one replicate incubation, amended with protein, sulfate, and antibiotics, we observed a high relative abundance of EX4484-6 (90.4%), a so far undescribed class within the phylum Thermoplasmatota, which had increased from day 98 to day 157 (Fig. 1a, Table S1). The taxon was represented by four ASVs, of which one ASV showed a relative abundance of 90.2% (Fig. S1b, Table S1). Hereafter, the class EX4484-6 will be referred to as Candidatus Penumbrarchaeia (proposal of type material and higher ranks in Supplementary material). In contrast, a second replicate showed a high relative abundance of ‘Candidatus Prometheoarchaeum syntrophicum’ strain MK-D1 on day 157 (86.56%), while the relative abundance of the Ca. Penumbrarchaeia class was lower (12%) (Fig. S1a and c, Table S1, Results SI). The control samples without protein amendment showed high relative abundances in Bathyarchaeia and Lokiarchaeia, reaching up to 52% and 31%, respectively. The class Ca. Penumbrarchaeia was not enriched in any of the control samples.

Fig. 1.

Fig. 1

Abundance of archaeal 16S rRNA genes in second-generation enrichments on days 98 and 157. a Relative 16S rRNA gene abundance within protein samples (amended with protein, sulfate, and antibiotics) on day 98 and day 157 and control samples (amended with sulfate and antibiotics). b 16S rRNA gene copies per ml slurry of the classes Ca. Penumbrarchaeia and Lokiarchaeia subgroup Loki-2b in protein-amended samples and control samples

Using a newly designed qPCR primer set, specific for the archaeal class Ca. Penumbrarchaeia, we followed the abundance of its 16S rRNA gene copies within the same enrichments. In the unenriched environmental samples and in control samples, gene copies of the class Ca. Penumbrarchaeia were below the detection limit (Fig. 1b), providing the first indication that in the environment this group is a member of the rare biosphere. However, in the protein-amended samples, the 16S rRNA gene copies of the class Ca. Penumbrarchaeia increased 100-fold from day 98 to day 157, reaching a number of 1.2 × 107 gene copies per ml slurry (Fig. 1b, day 157). After 372 days (744 days total), gene copies decreased again (8.7 × 105 gene copies per ml slurry; Fig. S2). Based on these initial findings, the uncultivated class Ca. Penumbrarchaeia might be involved in protein metabolism. Metagenomic sequencing was performed to further analyze the metabolic potential of this class.

Data mining uncovers 35 MAGs representing the class Ca. Penumbrarchaeia

Through metagenomic sequencing of the protein and antibiotics amended enrichment from day 157, we retrieved one MAG classified as a member of the class Ca. Penumbrarchaeia. To analyze the phylogeny between the 16S rRNA gene obtained from the Ca. Penumbrarchaeia MAG and the highly abundant Ca. Penumbrarchaeia ASV obtained through 16S rRNA gene amplicon sequencing, we calculated a 16S rRNA gene phylogenetic tree (Fig. S3a). Both the single copy of the full-length 16S rRNA gene from the MAG and the 16S rRNA gene sequence of the ASV fell into the same branch with a sequence identity of 100%. A nucleotide blast search against the rRNA/ITS databases of NCBI (date accessed 19.03.2024) [46] revealed that the closest cultured representative of both sequences was Methanomassiliicoccus luminyensis, with a sequence identity of 83–84% and 100% of the 16S rRNA gene sequence of the ASV and 97% of the 16S rRNA gene from the MAG being covered (Table S2).

To further characterize the novel class, we performed a group-targeted data mining, investigating the phylogenetic breadth, biogeography, abundance, and rarity in other habitats, as well as the metabolic capabilities. We increased the number of available genomes for the class Ca. Penumbrarchaeia by searching 11,479 genome assemblies of the phylum Thermoplasmatota and additional unclassified genome assemblies (Fig. 2a) published in GenBank. Still, we could only find an additional five published Ca. Penumbrarchaeia MAGs, prompting us to screen 57.8 TB of publicly available metagenomic short-read data. Using the nonredundant marine Ca. Penumbrarchaeia MAGs retrieved through the first screening, we searched 8287 publicly available metagenomic runs (of 7684 metagenomic samples) of aquatic origin for the presence of Ca. Penumbrarchaeia MAGs (Fig. 2b, c). Further including data retrieved from the OMDB (v2), sampling bias was reduced by screening samples of in total 54 different categories, covering coastal and open ocean environments, along with extreme habitats known to host Thermoplasmatota [4751] (Fig. 2b). In 30 metagenomic studies, the class Ca. Penumbrarchaeia was found with a cumulative coverage > 2 (Table S3). Medium- and high-quality Ca. Penumbrarchaeia MAGs as per the minimum information about a metagenome-assembled genome (MIMAG) [52] could be reconstructed from the studies PRJNA368391, PRJNA531756, PRJNA541421, PRJNA704804, PRJNA721298, and PRJNA889212 and were added to our dataset. Further, previously unpublished Ca. Penumbrarchaeia MAGs from the Baltic Sea, Benguela Upwelling System (Namibia), Cariaco Basin (Venezuela), Arabic Sea, Qinghai Lake (China), and the Scotian Slope (Atlantic Ocean), were added to our dataset (Table S4), yielding a total of 35 medium- to high-quality Ca. Penumbrarchaeia MAGs with a completeness of > 80% (80.3 to 97.9%) and contamination of < 5% (0.05 to 4.35%). Genome sizes of the single MAGs were adjusted based on their completeness and varied from the smallest genome size of 1.14 Mbp to the largest genome size of 3.94 Mbp (Table S5).

Fig. 2.

Fig. 2

Screened data for retrieving novel Ca. Penumbrarchaeia MAGs. Number of datasets of a screened published MAGs from GenBank and b screened metagenomic samples from the SRA, the OMDB (v2), and those from which Ca. Penumbrarchaeia MAGs were reconstructed. Numbers above bars indicate the number of samples in which Ca. Penumbrarchaeia was detected and ultimately reconstructed. The category “other” includes samples of 35 additional environmental categories, for which < 100 metagenomic samples were available, none of which Ca. Penumbrarchaeia was detected in (e.g., algae, alkali sediment, viral, hypolithon, or coral reef metagenomes). c World map of all screened metagenomic samples derived through short-read data mining and the OMDB (v2) with an indication of those locations at which target MAGs were detected

Ca. Penumbrarchaeia forms a novel class with four orders

Next, we performed a phylogenomic analysis on 370 Thermoplasmatota species found within the screened 11,479 genome assemblies and all 35 Ca. Penumbrarchaeia MAGs. The tree based on 53 archaeal marker genes showed one distinct cluster with high bootstrap support, consisting of all 35 Ca. Penumbrarchaeia MAGs (Fig. 3a). Taxonomic ranks were assigned to the Ca. Penumbrarchaeia MAGs using the relative evolutionary divergence (RED) rank normalization according to GTDB. Based on the calculated RED [37], the Ca. Penumbrarchaeia cluster can be grouped into four orders (orders 1–4) (Fig. S4), which consist of a total of six families (order 1: family 1 A, 1B; order 2: family 2; order 3: family 3 A, 3B; order 4: family 4) (for further information, see Supplementary results II). From these, only families 1 A and 1B were represented in the GTDB database versions 207 and 214. Additional computed average nucleotide identity (ANI) and amino acid identity (AAI) supported the classification according to reported RED values (Figs. S5 and S6). The phylogeny was later also confirmed by a species tree based on the core genome (“Annotation of EX4484-6 MAGs” section) of the class Ca. Penumbrarchaeia (Fig. S3b).

Fig. 3.

Fig. 3

Phylogenomic tree of the Thermoplasmatota based on 53 archaeal marker genes. a Maximum-likelihood tree (RAxML, 100 bootstraps) of 370 Thermoplasmatota MAGs and 35 Ca. Penumbrarchaeia MAGs obtained through data mining of genome assemblies and metagenomic short-read datasets. MAGs in bold define the nonredundant Ca. Penumbrarchaeia MAGs, derived through MAG dereplication (“Retrieval of EX4484-6 MAGs from public archives” section). Node labels indicate RED values, which were used to define the phylogeny of the class Ca. Penumbrarchaeia into four orders, consisting of six families (1 A (Ca. Penumbrarchaeia), 1B, 2, 3 A, 3B, 4). The environments from which single MAGs are derived are indicated by a colored strip. b Relative abundance of nonredundant Ca. Penumbrarchaeia representative MAGs in the environment estimated from metagenomic short-reads mapped against a competitive reference and quantified using coverm. A more detailed description is provided in the main “Relative MAG abundance in the environment” section. The environments individual MAGs were found in are indicated by color

Ca. Penumbrarchaeia is globally distributed in coastal and organic-rich areas

We aimed to further gain insights into the biogeography of the Ca. Penumbrarchaeia MAGs and differences between the habitats of individual families within this class. For this, 8573 metagenomic runs were mapped against a competitive mapping index that included 20 nonredundant Ca. Penumbrarchaeia representatives, and relative abundances of MAGs in these samples were calculated. From all searched metagenomic sequencing runs, Ca. Penumbrarchaeia MAGs could only be detected in 128 runs (1.49% of all searched runs; Tables S6 and 7). Three of the nonredundant Ca. Penumbrarchaeia MAGs (MSM105, EMB267, GCA_016928095.1) were not found in any of the searched runs. As we used stringent detection methods with a minimum breadth of coverage of 50% and a minimum percentage identity of 95% for computation of relative abundances to avoid false positives, it is possible that these three MAGs were not detected because of false negatives or because their specific environment of preference was not sufficiently well represented in our selection of metagenomic runs.

Relative abundances of Ca. Penumbrarchaeia MAGs in the environment ranged between 0% and a maximum of 0.32%, with on average 0.031% (Figs. 3b and 4, Fig. S7). While at the class level Ca. Penumbrarchaeia could be detected in continental shelf environments worldwide (Fig. S7), single families preferred very distinctive habitats (Figs. 3b and 4). MAGs of the families 1 A and 2 were mostly detected in marine surface sediments (Table S7). Family 1 A was additionally present in a variety of environments, such as lake and marine sediments, hydrothermal vents, and cold seeps, and therefore could be found more globally distributed (Figs. 3b and 4). The MAGs of family 2 were present in marine sediments from the Baltic Sea (Fig. 3b). MAGs of family 1B were only found in hydrothermal sediments from the Guaymas and Pescadero basins (Fig. 3b). All MAGs from the families 3 A, 3B, and 4 were found in the marine water column, specifically in oxygen minimum and -deficient zones (Figs. 3b and 4). MAGs of families 3B and 4 were confined to the Cariaco Basin (Venezuela), while family 3 A was additionally observed in the Arabic Sea and off the coasts of Mexico, Peru, and Chile (Fig. 3b). Two of the MAGs originating from the water column (SAMEA2620113_1 and SAMN10231904_1) were found predominantly in the free-living fraction of the seawater community (retained on filters with a pore size 0–5 µm), suggesting a habitat selection of these microorganisms (Fig. 4a). However, most other MAGs found within the water column were derived from bulk seawater samples (retained on filters with 0.2-µm pore size); thus, it cannot be resolved whether they were part of the free-living or particle-associated community (Fig. 4a). Compared to other classes within the phylum of Thermoplasmatota, the class Ca. Penumbrarchaeia showed the lowest average relative abundance in the environment (Fig. S8b). The highest relative abundances among the Thermoplasmatota were found in the classes Thermoplasmata (up to 43%) and Ca. Poseidonia (up to 6%).

Fig. 4.

Fig. 4

Relative abundances of Ca. Penumbrarchaeia MAGs in screened samples for a individual nonredundant MAGs. For MAGs found in marine water samples, the filter size each sample was filtered through is indicated by shape. Colors indicate the environment of the samples in which the MAGs were found. b Relative abundance of Ca. Penumbrarchaeia MAGs summarized by family and for the class as a whole. Family 1 A corresponds to Ca. Penumbrarchaeia

Housekeeping genes and shared metabolic potential encoded by the core genome of the class Ca. Penumbrarchaeia

To assess genomic differences among families, orthogroups shared between MAGs of the class Ca. Penumbrarchaeia were studied. Orthogroups are defined as a set of genes, which descended from a single last common ancestor [53], and as such are considered to be homologous [54]. For this, we performed an NMDS analysis based on orthogroups found within all Ca. Penumbrarchaeia MAGs (Fig. S9a). The NMDS showed a clustering of MAGs, which resembled the structure of the phylogenomic tree (Fig. 3a) with MAGs derived from sediment samples and those derived from water samples forming two well-defined distinct clusters. Generally, the number of orthogroups found per family resembled differences in genome sizes within the families (Fig. S9b, Table S5). Three of the families contained high numbers of family-specific orthogroups. In family 1 A, with the highest number of orthogroups (n = 2961), 1179 family-specific orthogroups (40% of the orthogroups in this family) could be identified (Fig. S9b). Families 3 A and 4 from the water column contained 848 (32%) and 800 (40%) family-specific orthogroups, respectively. We further observed that more closely related families within the class Ca. Penumbrarchaeia shared more orthogroups (Fig. S10). In total, we found 477 orthogroups, which were present in at least 90% of all 35 Ca. Penumbrarchaeia MAGs, and as such were defined as the core genome of the class Ca. Penumbrarchaeia (Table S8). Overall, the core genome encoded expected housekeeping genes involved in gene expression, such as translation, transcription, replication, DNA repair, tRNA biogenesis, and ribosomal proteins. Further, genes affiliated with metabolic processes, such as transporters, the gluconeogenesis pathway, reverse TCA (rTCA) cycle, pyruvate metabolism, fatty acid, and amino acid degradation, were found within the core genome (Tables S8, S9, S10, 11). Moreover, genes needed for biosynthetic processes, such as the nucleotide and amino sugar metabolism as well as the biosynthesis of amino acids, fatty acids, lipids, glycans, vitamins, and cofactors, were encoded (Tables S8, S9, 10).

Heterotrophy and mixotrophy as main nutritional strategies of the class Ca. Penumbrarchaeia

As we found our initial Ca. Penumbrarchaeia MAG within protein-amended enrichments, we analyzed all MAGs for the potential of protein degradation (Table S11). Metabolic capabilities included amino acid degradation for all MAGs of this novel class (for details, see Fig. 5). MAGs of the families 1 A, 1B, and 2 additionally encoded genes for extracellular peptidases of the families C11 A, M14B, and S08 A (Fig. 5, Fig. S11a), which are required for the first step of protein polymer hydrolysis into smaller peptides, whereas families 3 A, 3B, and 4 lacked genes for these enzymes. However, MAGs of all families encoded genes for oligopeptide transporters, different aminopeptidases (pepF, pepT, pepP, pepS, map), and aminotransferases (Table S12, Fig. S11b). Furthermore, all MAGs encoded aspartate aminotransferase (aspB) and alanine aminotransferase (alaA) for deamination. Single MAGs encoded an alanine-glyoxylate transaminase (AGXT2), branched-chain amino acid aminotransferase (ilvE), and aromatic amino acid aminotransferase. After deamination, the resulting 2-oxoacids can be further converted to acetyl-CoA via pyruvate ferredoxin oxidoreductase (por) or to acyl-CoA via indolepyruvate ferredoxin oxidoreductase (ior), 2-oxoacid:ferredoxin oxidoreductase (kor), or 2-oxoisovalerate ferredoxin oxidoreductase (vor). Energy-rich acyl-CoA can support ATP formation by an acetyl coenzyme A synthetase (acdAB; ADP-forming), which was found in all orders. In four families (1B, 2, 3 A, and 4), also genes for succinyl-CoA synthetase (sucCD) were also found. The amino acid degradation results in the main products acetate and organic acids.

Fig. 5.

Fig. 5

Metabolic reconstruction of the main metabolic features of the novel class Ca. Penumbrarchaeia. The presence of genes is indicated by full or half circles for each family or with red stars if present in > 75% of the MAGs of all families. Amino acid degradation is as follows: gdhA glutamate dehydrogenase, kor 2-oxoacid:ferredoxin oxidoreductase, vor 2-oxoisovalerate ferredoxin oxidoreductase, ior indolepyruvate ferredoxin oxidoreductase, por pyruvate ferredoxin oxidoreductase, and acd acetyl coenzyme A synthetase; beta-oxidation: ACADS butyryl-CoA dehydrogenase, ACADM acyl-CoA dehydrogenase, crt enoyl-CoA hydratase, fadB 3-hydroxyacyl-CoA dehydrogenase, and fadA acetyl-CoA acyltransferase; rTCA: acl/ACLY ATP-citrate lyase, ACO aconitate hydratase, idh isocitrate dehydrogenase, korABCD 2-oxoacid:ferredoxin oxidoreductase, sucCD succinyl-CoA synthetase, sdhAB succinate dehydrogenase/fumarate reductase, fum fumarate hydratase, and mae malate dehydrogenase (oxaloacetate-decarboxylating); pyruvate metabolism: acd acetyl coenzyme A synthetase, acs acetyl-CoA synthetase, por pyruvate ferredoxin oxidoreductase, and ldh lactate dehydrogenase; and hydrogenases: hydADGB sulfhydrogenase, mvh F420-non-reducing hydrogenase, hdr heterodisulfide reductase, hypABCDEF hydrogenase expression/formation protein, atpABCDEFGHJK V/A-type H + -transporting ATPase, lctP lactate permease, hppA K(+)-stimulated pyrophosphate-energized sodium pump, and TC.NSS neurotransmitter:Na + symporter family

The families 2, 3 A, and 4 additionally encoded all genes of the beta-oxidation pathway, including a butyryl-CoA dehydrogenase (ACADS), acyl-CoA dehydrogenase (ACADM), enoyl-CoA hydratase (crt), 3-hydroxyacyl-CoA dehydrogenase (fadB), and acetyl-CoA acyltransferase (fadA) to further degrade short- and medium-chain acyl-CoAs. Family 1B additionally contained a lactate dehydrogenase (ldhA), which forms lactate from pyruvate. The findings of protein, amino acid, and fatty acid degradation among the families of the Ca. Penumbrarchaeia suggest a heterotrophic lifestyle for this class. Additionally, all families contained genes encoding the partial rTCA cycle. Moreover, the presence of genes of the Wood-Ljungdahl pathway in family 3 A could indicate a mixotrophic lifestyle for this family. A more detailed annotation of metabolic and assimilatory pathways can be found in supplementary results for all orders (Results SIII, Figs. S12, S13, S14, S15, Tables S10, S11, S12).

Ca. Penumbrarchaeia families are adapted to environmental stress

MAGs of all families encoded genes for the prevention of oxidative stress, including thioredoxin reductase (trxR), thioredoxin (trxA), and desulfoferredoxin (dfx), acting as superoxide reductase [55]. Moreover, in MAGs of all families, except family 4, genes for the conversion of hydrogen peroxide to water, catalyzed by peroxiredoxin (prxQ), were found [56]. MAGs of families 1 A and 2 further contained genes for the reduction of arsenate via an arsenate reductase (arsC), which reduces arsenate As(V) to arsenite As(III). For the removal of arsenite from the cell, a gene encoding an arsenite transporter (acr3/arsB) was found in MAGs of families 1 A, 2, and 3 A. An arsenite methyltransferase (AS3MT) was additionally found in the MAGs of families 1 A, 3 A, 3B, and 4. Lastly, a gene for the defense against antimicrobial drugs was encoded in all MAGs of families 1 A, 2, and 4, annotated as a MATE family drug/sodium antiporter, which is driven by a sodium gradient [57].

Rare MAGs in the phylum Thermoplasmatota hold higher numbers of genes encoding functionally unknown genes

Defining protein-coding genes without a known functional annotation as either hypothetical (at least one hypothetical classification through KEGG or the nonredundant RefSeq database NR) or unknown (no database match), we observed high numbers of genes encoding proteins with hypothetical and unknown functions in all MAGs of Ca. Penumbrarchaeia (Table S14, Fig. S16). Considering the low relative abundance of Ca. Penumbrarchaeia MAGs in the screened metagenomic runs, this raises the question of whether the unexpectedly high number of genes encoding hypothetical and unknown proteins in this novel class may be related to its rarity. The percentage of genes encoding unknown proteins among the Ca. Penumbrarchaeia ranged between 1.6 and 63% in single MAGs with an average of 26%, and high percentages of genes encoding unknown proteins were found in MAGs of the families 2, 3 A, 3B, and 4 (43–63%), most of which derived from the water column (Fig. S16). Additionally, AGNOSTOS was run to investigate novelty at protein domain level [42]. Based on the AGNOSTOS classification, 14–29% of the encoded proteins in the Ca. Penumbrarchaeia MAGs were classified as genomic unknowns (genes with unknown function, derived from sequenced or draft genomes), and between 0.24 and 17% genes have been characterized as environmental unknowns (genes with unknown function, found only in environmental metagenomes or MAGs) (Fig. S17) with no further functional assignment. While classification through AGNOSTOS showed a higher percentage of genes encoding proteins with known protein domains, KEGG and NR annotations could not give functional assignments for these genes, which were therefore regarded as genes with novel function. Novel genes in this study were defined as those which are orphaned in function, despite possibly being found in other microorganisms.

To test for a correlation between the percentage of genes encoding unknown proteins and the occurrence of the MAGs within the phylum of Thermoplasmatota, we defined a rarity index as the median relative abundance of the MAG in the environment, weighted by the fraction of datasets in which the MAG occurred across all screened datasets (fraction of occurrence). Based on the rarity index, we differentiated between rare (rarity < median rarity) and common (rarity > median rarity) MAGs. The percentages of genes encoding unknown proteins in the rare group were higher than in the common group (Fig. 6b). Notably, we found that most of our Ca. Penumbrarchaeia MAGs (73%) were defined as rare by our definition (Fig. 6a, Table S16). Further, we observed that the three novel Ca. Penumbrarchaeia orders (orders 2, 3, and 4) hold a higher percentage of genes encoding unknown proteins compared to any other Thermoplasmatota order (Fig. 6c). MAGs in the category “not detected” were not found in any of the screened metagenomic runs derived from aquatic samples (Fig. 6) as these MAGs originated from soil, biodigesters, or human- and animal-associated habitats (Table S15). Thus, low or missing relative abundances for these nonaquatic Thermoplasmatota do not necessarily classify these taxa as overall rare but as rare in the screened environments.

Fig. 6.

Fig. 6

Relationship between protein novelty and MAG occurrence. a Percentage of genes encoding proteins of unknown function vs. rarity for each of the redundant MAGs within the phylum Thermoplasmatota. The x-axis was square-root (sqrt)-transformed. The rarity index was defined as median relative abundance weighted by the fraction of occurrence. MAGs below the median rarity were defined as rare, and MAGs above were defined as common. b Percentage of genes encoding proteins of unknown function in genomes of the three defined rarity groups: rare, common, and not detected (nd), which contain those genomes, to which none of the screened metagenomic short-read data mapped. Differences between groups were tested by Wilcoxon rank-sum tests, Bonferroni-adjusted significance threshold is 0.0167, and p-values are indicated by asterisks (****p < = 0.0001, ***p < = 0.001, **p < = 0.01). Number of observations N indicates number of genomes sorted into each of the defined rarity groups. c Percentage of genes encoding proteins of unknown function for each order found within the phylum Thermoplasmatota. Number of observations N represents the number of Thermoplasmatota genomes per order

Following up on the hypothesis of the rare biosphere acting as a gene pool for the community, we investigated if orthogroups found in the Ca. Penumbrarchaeia MAGs were also present in other Thermoplasmatota. While we observed significant differences between orthogroups shared among Ca. Penumbrarchaeia MAGs and those shared with other Thermoplasmatota for orthogroups in all three categories (annotated, hypothetical, and unknown; Wilcoxon rank-sum test, Bonferroni adjusted p-value: 0.0167), the absolute effect size was higher for hypothetical and unknown orthogroups (Impact Effect size test, annotated: 0.3496, hypothetical: 1.1955, unknown: 2.2668). Thus, hypothetical and unknown orthogroups found in the Ca. Penumbrarchaeia MAGs were mostly confined to the class Ca. Penumbrarchaeia (Fig. S18), while annotated orthogroups were more likely to be shared between Ca. Penumbrarchaeia MAGs and other Thermoplasmatota. Ancestral gene family reconstruction further predicted that the majority of these hypothetical and unknown orthogroups were gained by the four orders within the class Ca. Penumbrarchaeia, rather than being present in the last common ancestor and subsequently lost by other members of the Thermoplasmatota (Fig. S19).

Discussion

The importance of the rare biosphere in the environment and its role in promoting community stability has been recognized only recently [2, 6, 13, 58, 59]. Studies on global phylogenetic diversity identified not only abundant taxa but also members of the rare biosphere [3739, 60]. Yet, studies on the rare biosphere in marine environments are very limited, and the roles rare taxa play in these ecosystems remain elusive. We detected the novel and rare class Ca. Penumbrarchaeia within the phylum Thermoplasmatota in protein-amended enrichments. Using these first findings, we combined information gathered from the enrichment experiments and coupled these to extensive group-targeted data mining to investigate organic matter degradation by members of the Ca. Penumbrarchaeia. This class has been so far overlooked; it remains unexplored and scarcely represented in INSDC databases.

In this study, we identified the class Ca. Penumbrarchaeia as an organic matter degrader in sediments from the Helgoland mud area, likely able to survive on protein compounds as a sole carbon source, enabled by the presence of genes encoding pathways for the degradation of amino acids in all 35 Ca. Penumbrarchaeia MAGs (Fig. 5). The presence of a partial rTCA could be indicating an alternative pathway to convert the intermediate glutamate into acetyl-CoA as has been described previously by Yin et al. [61]. These results are consistent with findings from several other members of the phylum Thermoplasmatota, which live heterotrophically or mixotrophically by degrading organic matter, i.e., utilizing fatty acids, carbohydrates, proteins, peptides, and amino acids as their substrates in aquatic environments [44, 45, 61, 62].

Along with genes for the degradation of amino acids, MAGs of sediment-inhabiting Ca. Penumbrarchaeia families encoded genes for extracellular peptidases involved in protein polymer hydrolysis. A previous study identified low-abundant archaeal groups involved in catabolic protein degradation, utilizing proteins as both energy and carbon sources [61]. Generally, many microbial groups are capable of protein degradation, but in an anabolic fashion, incorporating extracellular peptides or amino acids as the carbon and nitrogen sources. In contrast, catabolic protein degraders appear to be rare in environments, as revealed by several studies using stable isotope probing approaches [61, 63, 64]. For example, some unknown Thermoplasmatota and Lokiarchaeia have been identified as dissimilatory protein degraders, yet they remain nearly undetectable in situ based on 16S rRNA gene sequencing [61]. An explanation is that in addition to the competition with widespread anabolic protein degraders, the bioavailability of free proteins or amino acids in environments is extremely limited (typically below 10 µM for amino acids) [65]. Such scarcity of accessible proteins as both energy and carbon sources will constrain the survival and prevalence of the catabolic degraders, which require a higher protein supply than the anabolic counterparts.

In MAGs derived from the water column, genes for protein degradation were not found, contrasting with other abundant planktonic Thermoplasmatota that were shown to degrade proteins [62]. The pelagic Ca. Penumbrarchaeia MAGs in our study might have lost the trait for protein degradation or thrive in close proximity to those microbes that host extracellular peptidases. Despite that missing genes in some families could also result from the incompleteness of the analyzed MAGs, our observation suggests different lifestyle preferences among the Ca. Penumbrarchaeia families.

Although the potential for fatty acid degradation was observed in some of the MAGs, the oxidation reaction is thermodynamically unfavorable, and mechanisms for the consumption of reducing equivalents are required [66]. As we did not observe an alternative electron acceptor, electrons might undergo bifurcation via the detected heterodisulfide reductase [67], which was present in some Ca. Penumbrarchaeia members with the potential for fatty acid oxidation. Alternatively, the fermentation of fatty acids could also become favorable by a low hydrogen partial pressure in the environment [66]. Therefore, despite showing the potential for fatty-acid degradation, the feasibility of this metabolism remains so far unresolved for members of Ca. Penumbrarchaeia.

Overall, all datasets containing reads of Ca. Penumbrarchaeia MAGs were confined to continental shelf environments. These areas are regarded as organic carbon storage hotspots where large amounts of carbon are supplied through river discharge and land runoff and large phytoplankton blooms in upwelling areas [68, 69]. Due to high organic matter input and high sedimentation rates, oxygen in such sediments can be depleted within the first millimeters, generating an anoxic environment [70]. Corroborating Ca. Penumbrarchaeia presence in anoxic environments, all MAGs from the water column were predominantly found in oxygen-minimum and oxygen-deficient zones, in which high bioavailability of marine organic matter is prevalent due to high primary productivity in the overlying water column and high respiration causes local anoxia [7174]. The presence of all MAGs only in such organic, carbon-rich, and anoxic environments with high input of marine organic matter agrees with our findings that the class Ca. Penumbrarchaeia is involved in amino acid and partially protein degradation, besides being protected against possible oxidative stress.

Additional to the core metabolism shared among the majority of Ca. Penumbrarchaeia MAGs, we identified environment-specific adaptations. MAGs retrieved in this study were found in samples from marine sediments, lake sediments, hydrothermal sediments and vents, cold seeps, and the marine water column. While most Ca. Penumbrarchaeia families were found at distinct sites only and may therefore occupy very distinctive habitats, two families (1 A and 3 A) exhibited a more widespread distribution. We detected a high functional repertoire of the most heterogeneous family 1 A, reflected by a high number of family-specific orthogroups. Finding the highest number of MAGs in this family reflects its broad distribution. Specific environmental adaptations could be detected in most of the families. For example, families 1 A, 2, and 4 were found in locations of the Baltic Sea and the Cariaco Basin in Venezuela, which are in proximity to riverine input. As such, these locations might be experiencing input of waste water, which could explain the defensive MATE family drug/sodium antiporter excreting antimicrobial drugs or naturally occurring antibiotics in these MAGs. The presence of such genes might indicate antimicrobial resistance in these human-influenced environments [7577]. Further, MAGs found in the Baltic Sea carried genes against arsenic toxicity, which is in agreement with high arsenic concentrations reported in this environment [78].

Not only the class Ca. Penumbrarchaeia did exhibit a habitat selection of specific continental shelf environments but also its relative abundance in the environment did not exceed 0.32%, with an average relative abundance of < 0.05%. In contrast to our findings, other classes within the Thermoplasmatota, such as Ca. Poseidonia and Thermoplasmata were found to have much higher relative abundances in the screened aquatic environments. Only recently, novel taxa, namely Ca. Sysuiplasmatales, Ca. Lunaplasmatales, Ca. Yaplasmales, and Ca. Gimiplasmatales, were described and are characterized as having a limited environmental distribution and low abundances [51, 7981]. The continuous detection of novel rare taxa indicates the untapped diversity, which lies within the Thermoplasmatota and likely lies within other prokaryotic phyla as well. With continuous advances in sequencing technologies, deeper sequencing will shed more light on unknown microorganisms. Our findings highlight the importance of studying the rare biosphere to understand not only their metabolic functions, including so far undescribed functional diversity, but also evolutionary processes [34], such as the transition from an aerobic to anaerobic lifestyle among members of the Thermoplasmatota [51].

Besides being classified as a rare biosphere member, all orders of the class Ca. Penumbrarchaeia contained high percentages of genes encoding unknown and hypothetical proteins. It is a common feature of uncultivated and rare taxa to contain high numbers of unknown protein families [2, 40] with up to 70% of genes that cannot be annotated by databases such as KEGG and NR [38, 82], as these databases use complete genomes and functionally characterized genes as reference [83, 84]. These high numbers of unknown protein families may be indicative of the class functioning as a genetic seed bank under changing environmental conditions [2, 6, 7]. Especially, rare organisms are expected to harbor genes that could become important once environmental pressures arise. By activating these genes, community resilience is promoted.

To further investigate the distribution and origin of the genes encoding hypothetical and unknown proteins found within the Ca. Penumbrarchaeia MAGs in our study, we analyzed the presence of the respective genes in members of all Thermoplasmatota. Most of these genes were unique to the class Ca. Penumbrarchaeia, not being present in the last common ancestor to other Thermoplasmatota lineages and therefore most likely not inherited vertically. Some genes encoding unknown and hypothetical proteins might be acquired laterally, as they were predicted to have been gained independently by Thermoplasmatota orders outside the class Ca. Penumbrarchaeia. Still, the origin of most genes encoding unknown and hypothetical proteins found in the Ca. Penumbrarchaeia remains unknown. As our exploration of the potential for lateral gene transfer is limited to the phylum Thermoplasmatota, additional insights on the gene origin may be gained by extending the analysis to all prokaryotic lineages.

The low probability of lateral gene transfer for these uncharacterized genes raises the question of which evolutionary pressures and vectors of lateral gene transfer might be inactive. As members of the rare biosphere, the class Ca. Penumbrarchaeia could undergo less phage predation, and as such, phage-mediated gene transfer might be restricted; yet, also genetic distance to other organisms could be indicative of a lack of lateral gene transfer by transduction [85]. Moreover, cell density in the marine subsurface or marine water column could be too low to allow efficient gene transfer [86, 87]. The lack of gene transfer of these genes encoding uncharacterized proteins furthermore indicates that these genes rather act as accessory than core metabolic functions [88] and, as such, promote the potential of members of the class Ca. Penumbrarchaeia as genetic seed bank.

Conclusions

With our study, we demonstrated how to combine targeted enrichments with metagenomic sequencing and group-targeted data mining to investigate the metabolic potential of the rare biosphere, specifically focusing on the uncharacterized class Ca. Penumbrarchaeia within the phylum Thermoplasmatota. We identified the class Ca. Penumbrarchaeia as utilizing proteins and amino acids in organic matter-rich and aquatic habitats with limited or no oxygen. Our study revealed habitat specificity for all families in this class with low abundances in their environments. We used the class Ca. Penumbrarchaeia to better understand features of rare microorganisms in the environment and could identify high percentages of genes encoding hypothetical and unknown proteins, compared to other classes of the phylum Thermoplasmatota. The limited lateral transfer of these uncharacterized genes offers an intriguing incentive to further explore the functional and genetic diversity among members of the rare biosphere. Temporally, variable abundances and niche preferences of rare biosphere members can be further investigated by sequencing so far underrepresented environments and over time series. Our findings not only shed light on the metabolic potential and habitat specificity of the class Ca. Penumbrarchaeia but also underscore the broader importance of exploring the rare biosphere. Building on this foundation, the data available in public archives present a valuable opportunity to target the generation of new data in areas of underrepresentation, such as the rare biosphere, by first thoroughly reusing what is available. With this study, we highlight the need for targeted and data-integrative approaches to gain further insights into the rare biosphere and unravel functions and metabolic potential that lie within these understudied taxa.

Methods

1. Sample collection and enrichments

Sediment was collected from the Helgoland mud area (54°05′15.5″N, 7°58′05.5″E) by gravity coring in 2017 during the RV Heincke cruise HE483 [89]. Sediments from a depth of 45–70 cm were selected from gravity core HE483/2–2 for initial slurry incubations. We selected sediments from the Helgoland mud area for these incubations as it is renowned for its high organic content [90, 91]. Anoxic slurry incubations were set up using sediment and sterilized artificial seawater (ASW; composition 26.4-g NaCl, 11.2-g MgCl2 · 6H2O, 1.5-g CaCl2 · 2H2O, 0.7-g KCl, and 4.26-g Na2SO4 per liter) at a ratio of 1:4 (w/v). A total volume of 50-ml mixed slurry was dispensed in 120-ml serum bottles sealed with butyl rubber stoppers. To remove residual oxygen, the headspace of the serum bottles was exchanged with N2. The slurry was preincubated for 2 days. Four replicates were set up for each initial treatment, containing either 1.86 g/l egg white protein or no additional substrate as control. Two of the four treatments were additionally amended with an antibiotic mix (D-cycloserine, kanamycin, vancomycin, ampicillin, and streptomycin; 40 mg/l for each). Samples were incubated at 20 °C. After 25 days, one replicate of the protein enrichment with antibiotics and one replicate without antibiotics were additionally amended with sodium 2-bromoethanesulfonate (BES; 4 mM) for the inhibition of methanogenesis.

2. Further enrichment of target microorganisms

To retrieve highly enriched Thermoplasmatota from the first generation of enrichments, 5-ml slurry were anoxically transferred into Widdel medium after 372 days of the initial incubations. A total volume of 25-ml medium was anaerobically dispensed into 50-ml serum bottles, and the headspace was flushed with N2:CO2 (80:20, v/v) (BIOGON C20 E941/E290; Linde, Germany). As basal medium, anaerobic Widdel medium [92] was prepared using salt water medium (20.0-g NaCl, 3.0-g MgCl2 · 6-H2O, 0.2-g KH2PO4, 0.25-g NH4Cl, 0.5-g KCl, 0.15-g CaCl2 · 2-H2O per liter): After sterilization, the basal medium was cooled down under an N2:CO2 (80:20, v/v) atmosphere. Sterilized 1-M NaHCO3 (30 ml), trace element solution SL 10 (1 ml) [93], selenite-tungstate solution (1 ml) [92], seven vitamins solution (0.5 ml) [94], and sodium sulfide (Na2S · 9 H2O; final concentration 0.7 M) were added to 1-l basal medium. As a redox indicator, sterilized resazurin (0.2 ml; 0.2% w/v) was added. The pH was adjusted from 7.2 to 7.4.

All samples were amended once at the beginning of the enrichment experiment with 30-mM sulfate (Na2SO4) and an antibiotic mix (D-cycloserin, kanamycin, vancomycin, ampicillin, and streptomycin; 50 mg/l for each) that has been used previously to inhibit general bacterial activity in marine sediments [61, 95, 96]. Control samples were incubated without an additional carbon source. Protein samples were amended with 2.33 g/l egg white protein as the sole carbon source. Samples were incubated at 20 °C.

3. Nucleic acid extraction

Two milliliters of slurry was sampled anaerobically from each of the treatments at days 98, 157, and 372 of the second-generation enrichment (“Further enrichment of target microorganisms” section) for nucleic acids extraction according to Aromokeye et al. [97]. DNA pellets were washed twice with 1-ml cold 70% ethanol and eluted in 100-µl diethylpyrocarbonate (DEPC) treated water (Carl Roth, Germany). The quality of nucleic acids was checked with a NanoDrop 1000 spectrophotometer (PEQLAB Biotechnologie, Erlangen, Germany).

4. 16S rRNA gene amplicon sequencing

Illumina amplicon sequencing libraries were prepared as described in Aromokeye et al. [97] using 30 PCR cycles. Primers targeting the V4 region of the archaeal 16S rRNA were Arc519 F (5′-CAGCMGCCGCGGTAA-3′) [98] and Arc806R (5′-GGACTACVSGGGTATCTAAT-3′) [99]. Thermal cycling conditions were as follows: initial denaturation at 95 °C for 3 min, 30 cycles of denaturation at 95 °C for 20 s, annealing at 60 °C for 15 s, elongation at 72 °C for 15 s, and final elongation at 72 °C for 1 min. Amplicons were generated at Novogene (Cambridge, UK) on the NovaSeq 6000 platform (2 × 250 bp, Illumina) in mixed orientation by ligation, therefore resulting in forward and reverse amplicon orientation in both forward (R1) and reverse read files (R2). Reads were demultiplexed and primer clipped using cutadapt v 2.1. [100] and further processed using the package dada2 v1.16.0 [101] in R v4.0.2 [102]. R1 and R2 reads were trimmed to 150 and 160 bp with a maximum error rate of 2. Subsequently, error rates were learned and samples dereplicated and denoised independently for each library orientation, by pooling the data from all samples, using a modified loess function adapted for libraries with binned quality scores [103]. Error-corrected R1 and R2 reads were merged into amplicon sequence variants (ASVs), and sequence tables for forward and reverse orientations were combined by reorientation of the reverse ASVs. Chimeras, ASVs of unexpected lengths (< 249 bp and > 255 bp), and singletons were removed. A bootstrap cutoff of 80 was used to perform taxonomic classification with the assignTaxonomy function of dada2 with a newly formatted GTDB r214 reference database containing the full 16S rRNA gene set. For further processing of archaeal ASVs, all non-archaeal taxa were removed.

5. Quantitative PCR (qPCR)

We quantified 16S rRNA gene copy numbers of Archaea, Bacteria, and uncultured Thermoplasmatota in the second-generation enrichments. Reaction mixtures contained 1 × Takyon MasterMix (Eurogentec, Seraing, Belgium), 4-µg bovine serum albumin (Roche, Mannheim, Germany), 300-nM primers, 1 ng of DNA template, or 2 µl of standard in a total reaction volume of 20 µl. The primer pair used for archaea was 806 F/912R (5′-GACTACHVGGGTATCTAATCC-3′/5′-GTGCTCCCCCGCCAATTCCTTTA-3′, annealing temperature 58 °C) [104], for bacteria 8 Fmod/338Rmod (5′-AGAGTTTGATYMTGGCTCAG-3′/5′-GCWGCCWCCCGTAGGWGT-3′, annealing temperature 58 °C) [105, 106], for uncultured Thermoplasmatota (EX4484-6, now Ca. Penumbrarchaeia) the newly designed pair 472 F/633R (5′-CGGTAAATCTCTGGGTAAATCG-3′/5′-ACCCGTTCTGGTCGGACGCYTT-3′, annealing temperature 64 °C), and for Lokiarchaeia subgroup 2b a newly designed pair Loki2b_34 F/Loki2b_278R (5′-TCCGACTGCTATCCGGGTAA-3′/5′-TCACGGCCCTTATCGATCAT-3′, annealing temperature 60 °C). Thermal cycling conditions were as follows: initial denaturation at 95 °C for 5 min, 40 cycles of denaturation at 95 °C for 30 s, annealing at given temperatures for 30 s, and elongation at 72 °C for 40 s.

6. Metagenomic analysis

After analyzing the amplicon sequencing data, the protein and antibiotics amended replicate of day 157 (second generation), which showed a high proportion of Thermoplasmatota, was chosen for metagenomic sequencing. An amount of ~ 1-µg extracted DNA was used for metagenomic sequencing on the Illumina HiSeq 4000 platform (2 × 150 bp) at Novogene (Cambridge, UK). Metagenomic reads were adapter and quality trimmed using bbduk from bbmap version 38.86 [107]. De novo assemblies were generated with SPAdes v.3.15.5 using the flag meta [108] and megahit v1.2.9 using the preset meta-sensitive [109]. For read mapping, fasta headers of each of the assemblies were simplified with anvio-7.1 [110]; contigs < 1000 bp were removed, and, subsequently, the quality-trimmed reads were mapped back to the assemblies using bowtie2 v2.3.5.1 [111]. For each assembly, binning was performed with metabat2 v2.12.1 [112] and CONCOCT v1.0.0 [113], followed by bin refinement using the bin refinement module of metaWRAP v1.3.1 [114]. The refined bins of both assemblies were then dereplicated by dRep v3.0.0 using the ANImf algorithm with a primary ANI of 0.9 (mash v2.3 [115]) and a secondary ANI of 0.95 (fastANI v1.32 [116]) with a minimum aligned fraction of 0.5 [117]. Dereplicated bins were reassembled using the bin reassembly module of metaWRAP.

Quality of obtained MAGs was calculated with checkM2 v0.1.3 [118], and taxonomic classification was assigned through gtdbtk v2.1.0 [119], based on the GTDB database v207 [37].

7. Retrieval of EX4484-6 MAGs from public archives

To increase the number of EX4484-6 MAGs for further analysis, a list of in total 19,931 genome assemblies, consisting of all Thermoplasmatota as well as ecological and unclassified MAGs according to their NCBI taxon ID, was retrieved through the ENA advanced search interface (date accessed: 22.08.2022; Fig. S20). Genome assemblies were filtered based on their scientific names, removing those environments affiliated with anthropogenic activities. The remaining 11,479 MAGs were downloaded from RefSeq [84] or GenBank [120], quality assessed using checkM2 v0.1.3 [118], and filtered based on completeness (> 80%) and contamination (< 5%). Quality-filtered MAGs (1602) were taxonomically classified using gtdbtk v2.1.0 [119]. Only 714 MAGs classified as Thermoplasmatota were kept for further analyses. Additionally, this MAG collection was supplemented by 132 quality-filtered Thermoplasmatota MAGs from GTDB [37], which were not yet included in the ENA search output. Furthermore, 34 additional Thermoplasmatota MAGs from previous studies of Ca. Lutacidiplasmatales [50], Ca. Gimiplasmatales [80], Ca. Sysuiplasmatales [51], and uncultured Thermoplasmatota [61] were included in the dataset. The total remaining set of 844 MAGs was dereplicated using dRep v3.0.0 as described above (“Metagenomic analysis” section), resulting in a total number of 388 MAGs. Within this data set, five new nonredundant MAG clusters of EX4484-6 were found, of which four clusters were represented by one single MAG, while one cluster was represented by four MAGs, all of which derived from Guaymas Basin hydrothermal sediments [48, 121, 122]. The cluster representatives of four clusters were of marine origin and further used as input for the following mining of metagenomic short-read data for EX4484-6.

8. Mining of metagenomic short-read data from public archives

Data mining of environmental metagenomes was conducted to increase the number of MAGs for the group EX4484-6 by reanalysis of metagenomic short-read data (Supplementary Fig. 19). For this, a list containing raw reads of 44,968 ecological metagenomes was downloaded from the ENA advanced search (date accessed: 10.11.2022) using the parameters tax_tree (410,657), library strategy “WGS,” instrument platform “ILLUMINA,” library source “METAGENOMIC,” and library selection “RANDOM.” Samples without location information, unspecified instrument model, missing ftp links, and a base count below 3 Gbp were removed from the selection. Through manual filtering, nonaquatic samples were further removed, leaving a data set of 8287 metagenomes.

In a first screening step, all 8287 metagenomes were downloaded, quality trimmed using bbduk from bbmap version 38.86 [107], and mapped against the previously identified EX4484-6 clusters (MAG from enrichments, “Metagenomic analysis” section; dereplicated marine EX4484-6 clusters, “Retrieval of EX4484-6 MAGs from public archives” section) using bwa v0.7.17 [123]. The per-genome coverage was estimated from the number of mapped reads using coverm v0.6.1 [124] with a minimum read percent identity of 50%, a minimum aligned percentage of 50%, and a minimum read overlap of 30 bp between read and reference. Cumulative coverage was calculated for all datasets within a given study accession. Studies with a cumulative coverage for EX4484-6 MAGs > 2 were used for de novo co-assembly and binning, unless MAGs associated with the study had already been published. In these cases, published MAGs were screened for EX4484-6 MAGs along with the catalog of Earth’s microbiome [38], the OceanDNA MAG catalog [125], and the Ocean Microbiomics Database OMDB (v2) [39, 126]. MAGs from the OMDB (v2) were contributed by the co-authors and produced as previously published in Paoli et al. [39]. A more detailed pipeline description is included in supplementary methods. Further, high-quality EX4484-6 MAGs from unpublished work were contributed to this study by co-authors. Raw reads from studies without published MAGs were downloaded and adapter and quality trimmed using bbduk v38.86 [107]. For each study, de novo co-assemblies were computed using megahit v1.2.9 [109] with either the preset meta-large, starting at the lowest possible kmer size (27 or 37) for large sample sizes or a modified preset meta-sensitive starting at kmer size 23. Then, quality-trimmed reads were mapped onto the co-assembly using bowtie2 v2.3.5.1 [111] and binned, refined, and subsequently classified using gtdbtk v2.1.0 [119] with GTDB database v207 as previously described (“Metagenomic analysis” section). If bins of the class EX4484-6 were present, all Thermoplasmatota MAGs within the bin set were reassembled using the metaWRAP bin reassembly module followed again by taxonomic classification. All newly found EX4484-6 MAGs were additionally manually refined using anvio v7.1 [110].

9. Annotation of EX4484-6 MAGs

Genes of EX4484-6 MAGs were predicted using prodigal v2.6.3 [127]. Predicted genes were annotated with the nonredundant RefSeq database (NR) (accessed 23 June 2023) [128] and KEGG release 104 [83, 129] using diamond blastp v2.0.15 [130]. Additionally, we retrieved a full annotation of all MAGs using InterProScan v5.65–97.0 [131, 132]. Peptidases were identified by diamond blastp against the MEROPS database v12.4 [133]. From these, extracellular peptidases (signal peptides) were determined using signalp v6.0 [134]. Signal peptides were only counted as such if they were annotated as “SP” and contained an additional MEROPS annotation. CAZymes were annotated using dbCAN v3.0.7 [135]. To reduce false-positive CAZymes in the annotation, all predicted CAZymes were manually validated with the KEGG and NR annotations. Only those CAZymes with more than one annotation according to dbCAN that were also represented in KEGG or NR were counted as positive hits. Lastly, transporters were annotated with the Transporter Classification Database (accessed 12/2023) [136] using diamond blastp v2.0.15 [130] to improve information on transporters found through NR and KEGG. For all blastp searches, a blast score ratio [137] threshold of 0.4 was applied. If metabolic pathways were incomplete, also annotations below the blast score ratio were analyzed.

From all predicted genes for each of the redundant EX4484-6 MAGs, genes were sorted into the categories annotated, hypothetical and unknown based on their annotation status. Annotated genes were defined as genes which had a functional annotation from KEGG or NR, hypothetical genes were defined from the remaining genes without functional annotation if they had at least one hypothetical classification, and genes without any annotation were defined as functionally unknown genes. To further characterize the unknown genes in the EX4484-6 MAGs, we applied AGNOSTOS v1.1 [42] with default parameters.

To analyze clusters of orthologous genes within the class of EX4484-6, all predicted genes were clustered into orthogroups using OrthoFinder v2.5.5 with multiple sequence alignment for tree inference [53]. From all orthogroups, a nonmetric multidimensional scaling (NMDS) was computed using the R function metaMDS from the package vegan v2.6.4 [138] with Jaccard dissimilarities based on shared orthogroups between MAGs. For visualization of family-specific orthogroups, an upset plot was created using the package UpSetR v1.4.0 [139]. Orthogroups, which were present in at least 90% of all redundant EX4484-6 MAGs, were defined as the core genome, and annotations of genes therein were analyzed.

For contextualization of the proportions of unknown genes in the EX4484-6 MAGs, we similarly annotated all Thermoplasmatota genomes (“Retrieval of EX4484-6 MAGs from public archives” section) using NR and KEGG and sorted predicted genes into the three previously mentioned categories: annotated, hypothetical, and unknown. Additionally, orthogroups across all Thermoplasmatota were computed using the rooted phylogenomic tree (“Marker gene tree and phylogenomic analyses” section) as the input tree and sorted based on their annotation status as described previously. The proportion of EX4484-6 orthogroups shared within EX4484-6 MAGs and between Thermoplasmatota and EX4484-6 MAGs was calculated (“Retrieval of EX4484-6 MAGs from public archives”). Wilcoxon rank-sum tests were then applied to test for differences in that proportion between EX4484-6 and other Thermoplasmatota for each annotation status, and the effect size for each difference was computed using the package ImpactEffectsize in R [140]. The p-value was corrected for executing three pairwise comparisons using the Bonferroni correction.

10. Marker gene tree and phylogenomic analyses

All dereplicated Thermoplasmatota MAGs (“Retrieval of EX4484-6 MAGs from public archives” section) together with all redundant EX4484-6 MAGs (“Mining of metagenomic short-read data from public archives” section) were used as input for a phylogenomic marker gene tree. In total, the tree contained 419 genomes, including an outgroup consisting of 14 medium to good quality (completeness > 80%, contamination < 5%), randomly selected Halobacteriota genomes. A multiple sequence alignment (MSA) with 53 marker genes was generated using the de novo workflow of gtdbtk v2.1.0 [119]. With the resulting MSA, a suitable model (LG + I + R10 + F) was identified using ModelFinder [141] as implemented within iqtree2 v2.2.2.7 [142]. The marker gene tree was calculated using raxml-ng v1.1.0 [143] with 20 starting trees. Bootstrap convergence with a bootstrap cutoff at 0.02 was reached after 100 trees. The tree was rooted at the outgroup and manually collapsed based on the GTDB v207 taxonomy [37] using iTol v6.8.1 [144]. Further, relative evolutionary divergence (RED) was calculated using the function getreds from the R package castor [145]. Additionally, average nucleotide identity (ANI) was calculated using fastANI v1.32 [116], and amino acid identity (AAI) was computed using fastAAI v0.1.18 [146]. Based on this information, the class EX4484-6 was described as Ca. Penumbrarchaeia (Supplementary results and discussion II). In addition to the gene tree based on 53 archaeal marker genes, a maximum-likelihood species tree was computed from gene trees of all orthogroups that were part of the core genome (Annotation of EX4484-6 MAGs) using SpeciesRax implemented in GeneRax v.2.1.3 [147, 148].

For gene family reconstruction of uncharacterized genes found in the class Ca. Penumbrarchaeia, Wagner parsimony was computed with a gain penalty of 1 using COUNT [149] for the phylum Thermoplasmatota, using those orthogroups of the class Ca. Penumbrarchaeia that contained genes encoding hypothetical and unknown proteins.

11. Relative MAG abundance in the environment

All redundant Ca. Penumbrarchaeia MAGs were dereplicated using dRep v3.0.0 as described above (“Metagenomic analysis” section), resulting in 20 nonredundant MAGs. Among these clusters, three of the initial four marine cluster representatives (“Retrieval of EX4484-6 MAGs from public archives” section) were found. The fourth cluster was represented by a MAG from a different study, showing higher completeness and lower contamination than the previously obtained cluster representative while still deriving from the same sediment (Guaymas Basin). To quantify representatives of Ca. Penumbrarchaeia in the environment, a competitive mapping index was constructed from the dereplicated Ca. Penumbrarchaeia MAGs and other representatives of the Thermoplasmatota phylum (“Retrieval of EX4484-6 MAGs from public archives” section) using bowtie2 v2.3.5 [111]. For obtaining relative abundances of Ca. Penumbrarchaeia MAG representatives in the environment, the initial 8287 metagenomic runs (“Mining of metagenomic short-read data from public archives” section) were downloaded along with 286 additional TARA ocean runs, which we initially excluded from our metagenomic run screening as unlikely to contain sufficient EX4484-6-affiliated reads for MAG recovery (“Mining of metagenomic short-read data from public archives” section). PhiX sequences and adapters were removed, followed by quality trimming using a minimum read length equal to 2/3 of the initially generated read length or 100 bp for sequencing runs with more than 160 cycles, a trimming quality of 20, and a minimum average quality of 10 with bbduk from bbmap version 38.86 [107]. Quality-trimmed reads were aligned to the competitive mapping index using bowtie2. Mean coverage, breadth of coverage, and relative abundances of single MAGs were computed using coverm genome v0.6.1 [124] with a minimum breadth of coverage of 50% and a percentage identity of at least 95%.

12. Evaluation of rarity in the phylum Thermoplasmatota

Rarity for each nonredundant Ca. Penumbrarchaeia MAG was defined as the median of its relative abundance in the metagenomic runs, where it was detected (“Relative MAG abundance in the environment” section), weighted by the fraction of the runs in which it occurred in of all screened runs. The median rarity across all Thermoplasmatota MAGs was calculated and then defined as the cutoff between rare (rarity < median rarity) and common MAGs (rarity > median rarity) in our dataset. Additionally, a third group was contributed by those MAGs that were not detected in any of the screened metagenomic runs. The percentages of unknown genes (“Annotation of EX4484-6 MAGs” section) were sorted by rarity groups (common, rare, not detected), and Wilcoxon rank-sum tests were applied to test for differences between these three groups. The p-value was corrected for executing three pairwise comparisons using the Bonferroni correction.

Supplementary Information

40168_2025_2140_MOESM1_ESM.docx (9.6MB, docx)

Supplementary Material 1: Figures S1. Microbial community composition of second-generation enrichments. Relative abundances of (a.) archaeal 16S rRNA genes, (b.) Ca. Penumbrarchaeia ASVs and (c.) Lokiarchaeia ASVs at day 98 and 157 in two replicates each of control samples and protein amended samples, both amended with 30 mM sulfate (Na2SO4) and an antibiotics mix (D-cycloserin, kanamycin, vancomycin, ampicillin and streptomycin; 50 mg/l for each). Protein samples were additionally amended with 2.33 g/l egg white protein. S2. Microbial community composition of second-generation enrichments. 16S rRNA gene copies per ml slurry of Bacteria, Archaea, the class Ca. Penumbrarchaeia and Lokiarchaeia subgroup Loki-2b. Gene copies are shown for both replicate enrichments (Replicate 1, Replicate 2*) of control samples and protein amended samples, both amended with 30 mM sulfate (Na2SO4) and an antibiotics mix (D-cycloserin, kanamycin, vancomycin, ampicillin and streptomycin; 50 mg/l for each). Protein samples were additionally amended with 2.33 g/l egg white protein. Error bars represent standard deviations of the three technical qPCR replicates per sample. S3. Phylogenetic reconstruction of the class Ca. Penumbrarchaeia. (a.) Maximum-likelihood tree (RAxML, convergence reached after 1300 bootstraps) of 354 full length 16S rRNA gene sequences of Thermoplasmatota, including all retrieved full length Ca. Penumbrarchaeia 16S rRNA genes from single MAGs. Shorter 16S rRNA gene sequences of ASVs and those found in Ca. Penumbrarchaeia MAGs were added to the existing tree. (b.) Maximum-likelihood tree based on orthogroups that were part of the core genome and detected in > 90% of all Ca. Penumbrarchaeia MAGs. The species tree was inferred from gene trees of all orthogroups using SpeciesRax [98]. S4. Partial phylogenomic tree of the Thermoplasmatota including RED. Maximum-likelihood tree (RAxML, 100 bootstraps) of 370 Thermoplasmatota MAGs and 35 Ca. Penumbrarchaeia MAGs obtained through data mining of genome assemblies and metagenomic short read data sets. Node labels display taxonomic affiliation on class level. The class Ca. Penumbrarchaeia clusters with Thermoplasmata_A/DTKX_01 according to the marker gene tree and is indicated by a red box (Fig. 3a). Node values indicate relative evolutionary divergence (RED). The complete tree is provided as separate PDF. S5. Average nucleotide identity (ANI). Heatmap of ANI between all 35 Ca. Penumbrarchaeia MAGs obtained during our data mining. Numbers in colored tiles indicate the ANI percentage between the compared MAGs (threshold ANI > 75%). MAGs were ordered according to their taxonomy in the marker gene tree (Fig. 3. S6. Amino acid identity (AAI). Heatmap of AAI between all 35 Ca. Penumbrarchaeia MAGs obtained during our data mining. Numbers in colored tiles indicate the AAI percentage between the compared MAGs. MAGs were ordered according to their taxonomy in the marker gene tree (Fig. 3). S7. Distribution of the class Ca. Penumbrarchaeia. World map showing the distribution, relative abundance and environment of all observed Ca. Penumbrarchaeia detected in 128 samples. Relative abundances were calculated by aligning quality trimmed reads of 8573 metagenomic sequencing runs to a competitive mapping index containing 20 non-redundant Ca. Penumbrarchaeia MAGs. Point colors indicate the environment, point sizes indicate relative abundance. S8. Abundance of non-redundant Thermoplasmatota in the environment. (a.) As fraction of occurrence for each order found within the non-redundant Thermoplasmatota data set. The fraction of occurrence was defined as fraction of data sets MAGs occurred in across all screened data sets. Number of observations N represents the number of Thermoplasmatota genomes per order. (b.) As relative abundance of each order in samples they occurred in. Number of observations N represents the number of metagenomic runs, in which the genome was detected. Order 1—Order 4 correspond to the Ca. Penumbrarchaeia orders, with Order 1 being Ca. Penumbrarchaeales. S9. Clustering of Ca. Penumbrarchaeia MAGs based on orthogroups. a NMDS based on all orthogroups found in all 35 Ca. Penumbrarchaeia. The NMDS was computed using the function metaMDS from the package vegan v2.6.4 with Jaccard dissimilarities and two dimensions. Single families are indicated by different colors. Family 1 A corresponds to Ca. Penumbrarchaeaceae. b Upset plot showing intersections of orthogroups for all six families. Intersections shown reflect 95% of all orthogroups within the data set. Numbers of orthogroups per intersection are indicated above the bars. Orthogroup intersections, which are only present in sediment or water are indicated by color. Total number of different orthogroups per family are indicated by horizontal bars next to the upset plot. S10. Shared orthogroups between Ca. Penumbrarchaeia MAGs. Heatmap of shared orthogroups between all 35 Ca. Penumbrarchaeia MAGs. Numbers in colored tiles in the heatmap indicate the percentage of shared orthogroups between the compared MAGs. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S11. Peptidases in Ca. Penumbrarchaeia MAGs. a Extracellular peptidase homologs within MAGs of the class Ca. Penumbrarchaeia. b Peptidase homologs within MAGs of the class Ca. Penumbrarchaeia. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S12. Metabolic reconstruction of Ca. Penumbrarchaeales (Order 1) (Ca. Penumbrarchaeaceae (Family 1 A), Family 1B). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles for each family. Red stars indicate the presence of genes in all families. S13. Metabolic reconstruction of Order 2 (Family 2). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles. S14. Metabolic reconstruction of Order 3 (Family 3 A, Family 3B). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles for each family. Red stars indicate the presence of genes in all families. S15. Metabolic reconstruction of Order 4 (Family 4). Pathways in grey represent pathways with missing genes and are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles. S16. Annotation status of predicted genes in Ca. Penumbrarchaeia MAGs. Genes classified as annotated, hypothetical and unknown based on NR and KEGG annotations shown as (a.) number of genes and (b.) percentage of genes. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S17. AGNOSTOS annotation category of genes in Ca. Penumbrarchaeia MAGs. Gene classification based on AGNOSTOS. Gene classifications are shown as (a.) number of genes and (b.) percentage of genes. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S18. Shared orthogroups among Thermoplasmatota. Boxplot of Thermoplasmatota genomes sharing orthogroups (OGs) present within Ca. Penumbrarchaeia MAGs separated by annotation category: annotated, hypothetical and unknown, according to their NR and KEGG annotation. Differences between groups were tested by Wilcoxon Signed Rank test, p-values are indicated by asterisks (**** p < = 0.0001). The measure of impact describes the effect size. Number of observations N indicates the number of orthogroups in each category. Number of genomes per group is indicated by number n. S19. Ancestral gene family reconstruction of orthogroups containing genes with hypothetical or unknown function. Evolutionary events are indicated for each order within the phylum of Thermoplasmatota representing orthogroup gains, expansions and losses. Evolutionary events were inferred using the Wagner parsimony in COUNT (gain penalty = 1) [99]. S20. Flowchart of methods applied during the data mining conducted in this study. Retrieval of Ca. Penumbrarchaeia (EX4484-6) MAGs from public archives (methods Sect. 7) is displayed in green, the subsequent mining of metagenomic short read data from public archives is displayed in blue (methods Sect. 8). Annotation and classification (methods Sect. 9) was conducted on both, the MAGs retrieved from public archives and MAGs reconstructed from metagenomic short read data (yellow). Programs and settings used are indicated next to or below arrows.

40168_2025_2140_MOESM2_ESM.xlsx (4.5MB, xlsx)

Supplementary Material 2: Tables S1. ASV counts and relative abundances of archaeal 16S rRNA gene amplicon sequencing for control and protein amended samples. S2. Top hits of a megablast search of the Ca. Penumbrarchaeia ASV sq2 and 16S rRNA gene obtained from the Ca. Penumbrarchaeia MAG E3_1_d157 against the rRNA/ITS database of the NCBI RefSeq Targeted Loci Project. S3. Overview list of metagenomic studies, in which Ca. Penumbrarchaeia MAGs had a coverage > 2. These studies were used to reconstruct Ca. Penumbrarchaeia MAGs. S4. Overview list of MAGs obtained in this studyS4. Overview list of MAGs obtained in this study. S5. Quality statistics computed with checkM and checkM2 and taxonomy information computed using gtdb_tk with the GTDB database v207S5. Quality statistics computed with checkM and checkM2 and taxonomy information computed using gtdb_tk with the GTDB database v207. S6. Table containing relative abundances of single MAGs in mapped metagenomic runs. S7. List of all unique metagenomic runs, in which Ca. Penumbrarchaeia MAGs were found with meta data derived from ENA of the sampling location and environment details. . S8. List of Orthogroups, which were selected as core genome (Orthogroup present in at least 90% of MAGs). The table shows gene counts for each MAG within single orthogroups. S9. Gene counts for single pathways within each of the Ca. Penumbrarchaeia MAGs. Genes were extracted from the core genome and annotated using NR and kegg. Based on found annotations genes were grouped into kegg pathway descriptions. S10. Full annotation of genes, which were found in all orthogroups of the core genome. Annotation was performed using kegg, NR, OM_RGC, dbCAN and merops. S11. Full annotation of selected pathways for all 35 Ca. Penumbrarchaeia MAGs. Annotation was performed using KEGG (release 104) and NR (release 13/05/2023), dbCAN (v3.0.7) and merops (v12.4). S12. Annotation of peptidases found in all 35 Ca. Penumbrarchaeia MAGs. The annotation was performed using merops (v12.4). S13. Annotation of CAZymes found in all 35 Ca. Penumbrarchaeia MAGs. The annotation was performed using dbCAN (v3.0.7). Results were validated with NR and KEGG annotations.S14. Gene counts for the categories annotated, hypothetical and unknown, along with percentages for each category. S15. Counts of rarity groups per habitat. S16.Overview table of all Thermoplasmatota MAGs with their taxonomic classification, fraction of occurrence, percentage of unknown genes (perentage_unknown), rarity and habitat. S17. Studies from selected marine metagenomics literature and matched with BioProject identifiers from the European Nucleotide Archive for MAG reconstruction in OMDv2

40168_2025_2140_MOESM3_ESM.pdf (20.1KB, pdf)

Supplementary Material 3. Proposal of type material and higher ranks. Methods: I. Clone library construction. II. Standard preparation for qPCR. III. qPCR primer design. IV. 16S rRNA gene phylogenetic tree. V. Data collection, processing and MAG reconstruction in OMDB (v2). Results and discussion: I. Additional qPCR results. II. Delineation of the class Ca. Penumbrarchaeia. III. Annotation of the class Ca. Penumbrarchaeia. Carbon metabolism. Carbon assimilation. Hydrogenases and energy conservation. Transporters. Stress response.

Acknowledgements

We acknowledge funding from the start-up research fund (project-ID XJ2300006031) of Hainan University and the International Science and Technology Cooperation project of Hainan province to XY. We acknowledge funding from ETH Zurich and the Swiss National Science Foundation [205320_215395] to SS, and SMV acknowledges funding from the Human Frontier Science Program through a postdoctoral fellowship [LT0050/2023-L]. For the data contributed by JP and JF we acknowledge the projects MGF-Ostsee and EVAR funded by the German Federal Ministry of Education and Research (BMBF) grant numbers 03 F0848 A and 03 F0814, respectively.

Authors’ contributions

XY, MDM and MWF conceived the initial idea for enrichment experiments; MDM and XY carried out enrichment, cloning and characterization experiments; MDM, LCW and TRH performed analyses of amplicon sequencing data; CH conceived the initial idea for data-mining; MDM, CH, SMV, HJR, SS and CV performed bioinformatic analyses for EX4484-6 MAG generation; JF and JP supplied the study with additional genomes; MDM created all illustrations; MWF, CH and XY provided constructive feedback and guided the execution of the project; MDM wrote the original draft; all authors revised the manuscript.

Funding

Open Access funding enabled and organized by Projekt DEAL. This study was supported by DFG under Germany’s excellence strategy no. EXC-2077–390741603. Project DEAL enabled and organized open-access funding.

Data availability

All code used to perform analyses in this study is available on zenodo https://doi.org/10.5281/zenodo.15464567. Amplicon and metagenomic raw reads for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB80318 (https://www.ebi.ac.uk/ena/data/view/PRJEB80318), using the data brokerage service of the German Federation for Biological Data (GFBio [150]), in compliance with the Minimal Information about any (X) Sequence (MIxS) standard [151]. Ca. Penumbrarchaeia MAGs obtained and generated in this study and all data tables used for figure generation are available on zenodo https://zenodo.org/records/10813815. The type species genome has been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number GCA_965234725.1. Clone sequences were deposited at GenBank under the accession numbers PQ255994-PQ256084.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Xiuran Yin, Email: 996383@hainanu.edu.cn.

Michael W. Friedrich, Email: michael.friedrich@uni-bremen.de

Christiane Hassenrück, Email: christiane.hassenrueck@io-warnemuende.de.

References

  • 1.Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere.” Proc Natl Acad Sci U S A. 2006;103(32):12115–20. 10.1073/pnas.0605127103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Pascoal F, Costa R, Magalhães C. The microbial rare biosphere: current concepts, methods and ecological principles. FEMS Microbiol Ecol. 2020;97(1): fiaa227. 10.1093/femsec/fiaa227. [DOI] [PubMed] [Google Scholar]
  • 3.Galand PE, Casamayor EO, Kirchman DL, Lovejoy C. Ecology of the rare microbial biosphere of the Arctic Ocean. Proc Natl Acad Sci U S A. 2009;106(52):22427–32. 10.1073/pnas.0908284106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rabinowitz D, Rapp JK, Dixon PM. Competitive abilities of sparse grass species: means of persistence or cause of abundance. Ecology. 1984;65(4):1144–54. 10.2307/1938322. [Google Scholar]
  • 5.Rabinowitz D. Seven forms of rarity. In: Synge H, editor. The biological aspects of rare plant conservation. New York: John Wiley and Sons; 1981. p. 205–17.
  • 6.Jousset A, Bienhold C, Chatzinotas A, Gallien L, Gobet A, Kurm V, et al. Where less may be more: how the rare biosphere pulls ecosystems strings. ISME J. 2017;11(4):853–62. 10.1038/ismej.2016.174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lynch MDJ, Neufeld JD. Ecology and exploration of the rare biosphere. Nat Rev Microbiol. 2015;13(4):217–29. 10.1038/nrmicro3400. [DOI] [PubMed] [Google Scholar]
  • 8.Musat N, Halm H, Winterholler B, Hoppe P, Peduzzi S, Hillion F, et al. A single-cell view on the ecophysiology of anaerobic phototrophic bacteria. Proc Natl Acad Sci U S A. 2008;105(46):17861–6. 10.1073/pnas.0809329105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Pester M, Bittner N, Deevong P, Wagner M, Loy A. A ‘rare biosphere’ microorganism contributes to sulfate reduction in a peatland. ISME J. 2010;4(12):1591–602. 10.1038/ismej.2010.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hausmann B, Pelikan C, Rattei T, Loy A, Pester M. Long-term transcriptional activity at zero growth of a cosmopolitan rare biosphere member. mBio. 2019;10(1):e02189-18. 10.1128/mBio.02189-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dell’Anno A, Beolchini F, Rocchetti L, Luna GM, Danovaro R. High bacterial biodiversity increases degradation performance of hydrocarbons during bioremediation of contaminated harbor marine sediments. Environ Pollut. 2012;167:85–92. 10.1016/j.envpol.2012.03.043. [DOI] [PubMed] [Google Scholar]
  • 12.Griffiths BS, Kuan HL, Ritz K, Glover LA, McCaig AE, Fenwick C. The relationship between microbial community structure and functional stability, tested experimentally in an upland pasture soil. Microb Ecol. 2004;47(1):104–13. 10.1007/s00248-002-2043-7. [DOI] [PubMed] [Google Scholar]
  • 13.van Elsas JD, Chiurazzi M, Mallon CA, Elhottovā D, Krištůfek V, Salles JF. Microbial diversity determines the invasion of soil by a bacterial pathogen. Proc Natl Acad Sci U S A. 2012;109(4):1159–64. 10.1073/pnas.1109326109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Campbell BJ, Yu L, Heidelberg JF, Kirchman DL. Activity of abundant and rare bacteria in a coastal ocean. Proc Natl Acad Sci U S A. 2011;108(31):12776–81. 10.1073/pnas.1101405108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gilbert JA, Field D, Swift P, Newbold L, Oliver A, Smyth T, et al. The seasonal structure of microbial communities in the Western English Channel. Environ Microbiol. 2009;11(12):3132–9. 10.1111/j.1462-2920.2009.02017.x. [DOI] [PubMed] [Google Scholar]
  • 16.Hugoni M, Taib N, Debroas D, Domaizon I, Jouan Dufournel I, Bronner G, et al. Structure of the rare archaeal biosphere and seasonal dynamics of active ecotypes in surface coastal waters. Proc Natl Acad Sci U S A. 2013;110(15):6004–9. 10.1073/pnas.1216863110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Vergin KL, Done B, Carlson CA, Giovannoni SJ. Spatiotemporal distributions of rare bacterioplankton populations indicate adaptive strategies in the oligotrophic ocean. Aquat Microb Ecol. 2013;71(1):1–13. [Google Scholar]
  • 18.Caporaso JG, Paszkiewicz K, Field D, Knight R, Gilbert JA. The Western English Channel contains a persistent microbial seed bank. ISME J. 2012;6(6):1089–93. 10.1038/ismej.2011.162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hamasaki K, Taniguchi A, Tada Y, Kaneko R, Miki T. Active populations of rare microbes in oceanic environments as revealed by bromodeoxyuridine incorporation and 454 tag sequencing. Gene. 2016;576(2):650–6. 10.1016/j.gene.2015.10.016. [DOI] [PubMed] [Google Scholar]
  • 20.Mo Y, Zhang W, Yang J, Lin Y, Yu Z, Lin S. Biogeographic patterns of abundant and rare bacterioplankton in three subtropical bays resulting from selective and neutral processes. ISME J. 2018;12(9):2198–210. 10.1038/s41396-018-0153-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Royo-Llonch M, Ferrera I, Cornejo-Castillo FM, Sánchez P, Salazar G, Stepanauskas R, et al. Exploring microdiversity in novel Kordia sp. (Bacteroidetes) with proteorhodopsin from the tropical Indian Ocean via single amplified genomes. Front Microbiol. 2017;8:8. 10.3389/fmicb.2017.01317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Vavourakis CD, Andrei A-S, Mehrshad M, Ghai R, Sorokin DY, Muyzer G. A metagenomics roadmap to the uncultured genome diversity in hypersaline soda lake sediments. Microbiome. 2018;6(1):168. 10.1186/s40168-018-0548-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zinger L, Amaral-Zettler LA, Fuhrman JA, Horner-Devine MC, Huse SM, Welch DBM, et al. Global patterns of bacterial beta-diversity in seafloor and seawater ecosystems. PLoS ONE. 2011;6(9): e24570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Crespo BG, Wallhead PJ, Logares R, Pedrós-Alió C. Probing the rare biosphere of the North-West Mediterranean Sea: an experiment with high sequencing effort. PLoS ONE. 2016;11(7): e0159195. 10.1371/journal.pone.0159195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Donachie SP, Foster JS, Brown MV. Culture clash: challenging the dogma of microbial diversity. ISME J. 2007;1(2):97–9. 10.1038/ismej.2007.22. [DOI] [PubMed] [Google Scholar]
  • 26.Hu B, Xu B, Yun J, Wang J, Xie B, Li C, et al. High-throughput single-cell cultivation reveals the underexplored rare biosphere in deep-sea sediments along the Southwest Indian Ridge. Lab Chip. 2020;20(2):363–72. 10.1039/c9lc00761j. [DOI] [PubMed] [Google Scholar]
  • 27.Rego A, Raio F, Martins TP, Ribeiro H, Sousa AGG, Séneca J, et al. Actinobacteria and Cyanobacteria diversity in terrestrial Antarctic microenvironments evaluated by culture-dependent and independent methods. Front Microbiol. 2019;10: 1018. 10.3389/fmicb.2019.01018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lewis WH, Tahon G, Geesink P, Sousa DZ, Ettema TJG. Innovations to culturing the uncultured microbial majority. Nat Rev Microbiol. 2021;19(4):225–40. 10.1038/s41579-020-00458-8. [DOI] [PubMed] [Google Scholar]
  • 29.Saw JHW. Characterizing the uncultivated microbial minority: towards understanding the roles of the rare biosphere in microbial communities. mSystems. 2021;6(4). 10.1128/msystems.00773-21. [DOI] [PMC free article] [PubMed]
  • 30.Lagier JC, Armougom F, Million M, Hugon P, Pagnier I, Robert C, et al. Microbial culturomics: paradigm shift in the human gut microbiome study. Clin Microbiol Infect. 2012;18(12):1185–93. 10.1111/1469-0691.12023. [DOI] [PubMed] [Google Scholar]
  • 31.Pascoal F, Magalhães C, Costa R. The link between the ecology of the prokaryotic rare biosphere and its biotechnological potential. Front Microbiol. 2020;11: 231. 10.3389/fmicb.2020.00231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Loman NJ, Constantinidou C, Chan JZM, Halachev M, Sergeant M, Penn CW, et al. High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol. 2012;10(9):599–606. 10.1038/nrmicro2850. [DOI] [PubMed] [Google Scholar]
  • 33.Pedrós-Alió C. The rare bacterial biosphere. Ann Rev Mar Sci. 2012;4(4):449–66. 10.1146/annurev-marine-120710-100948. [DOI] [PubMed] [Google Scholar]
  • 34.Spang A, Caceres EF, Ettema TJG. Genomic exploration of the diversity, ecology, and evolution of the archaeal domain of life. Science. 2017;357(6351):eaaf3883. 10.1126/science.aaf3883. [DOI] [PubMed] [Google Scholar]
  • 35.ENA - European Nucleotide Archive: Statistics. 2024. https://www.ebi.ac.uk/ena/browser/about/statistics. Accessed 13 May 2024.
  • 36.Sequence Read Archive (SRA) summary. 2024. https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt. Accessed 13 May 2024.
  • 37.Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36(10):996–1004. 10.1038/nbt.4229. [DOI] [PubMed] [Google Scholar]
  • 38.Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N, Schulz F, et al. A genomic catalog of Earth’s microbiomes. Nat Biotechnol. 2021;39(4):499–509. 10.1038/s41587-020-0718-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Paoli L, Ruscheweyh H-J, Forneris CC, Hubrich F, Kautsar S, Bhushan A, et al. Biosynthetic potential of the global ocean microbiome. Nature. 2022;607(7917):111–8. 10.1038/s41586-022-04862-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Rodríguez del Río Á, Giner-Lamia J, Cantalapiedra CP, Botas J, Deng Z, Hernández-Plaza A, et al. Functional and evolutionary significance of unknown genes from uncultivated taxa. Nature. 2024;626(7998):377–84. 10.1038/s41586-023-06955-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pavlopoulos GA, Baltoumas FA, Liu S, Selvitopi O, Camargo AP, Nayfach S, et al. Unraveling the functional dark matter through global metagenomics. Nature. 2023;622(7983):594–602. 10.1038/s41586-023-06583-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Vanni C, Schechter MS, Acinas SG, Barberán A, Buttigieg PL, Casamayor EO, et al. Unifying the known and unknown microbial coding sequence space. eLife. 2022;11:e67667. 10.7554/eLife.67667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Medina-Chávez NO, Travisano M. Archaeal communities: the microbial phylogenomic frontier. Front Genet. 2022;12:12. 10.3389/fgene.2021.693193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lloyd KG, Schreiber L, Petersen DG, Kjeldsen KU, Lever MA, Steen AD, et al. Predominant archaea in marine sediments degrade detrital proteins. Nature. 2013;496(7444):215–8. 10.1038/nature12033. [DOI] [PubMed] [Google Scholar]
  • 45.Lazar CS, Baker BJ, Seitz KW, Teske AP. Genomic reconstruction of multiple lineages of uncultured benthic archaea suggests distinct biogeochemical roles and ecological niches. ISME J. 2017;11(4):1058. 10.1038/ismej.2017.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.NCBI blastn suite. 2024. https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_SPEC=GeoBlast&PAGE_TYPE=BlastSearch . Accessed 19 Mar 2024.
  • 47.Baker BJ, Banfield JF. Microbial communities in acid mine drainage. FEMS Microbiol Ecol. 2003;44(2):139–52. 10.1016/s0168-6496(03)00028-x. [DOI] [PubMed] [Google Scholar]
  • 48.Dombrowski N, Teske AP, Baker BJ. Expansive microbial metabolic versatility and biodiversity in dynamic Guaymas Basin hydrothermal sediments. Nat Commun. 2018;9(1):4999. 10.1038/s41467-018-07418-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Reysenbach AL, Liu Y, Banta AB, Beveridge TJ, Kirshtein JD, Schouten S, et al. A ubiquitous thermoacidophilic archaeon from deep-sea hydrothermal vents. Nature. 2006;442(7101):444–7. 10.1038/nature04921. [DOI] [PubMed] [Google Scholar]
  • 50.Sheridan PO, Meng Y, Williams TA, Gubry-Rangin C. Recovery of Lutacidiplasmatales archaeal order genomes suggests convergent evolution in Thermoplasmatota. Nat Commun. 2022;13(1):4110. 10.1038/s41467-022-31847-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Yuan Y, Liu J, Yang TT, Gao SM, Liao B, Huang LN. Genomic insights into the ecological role and evolution of a novel Thermoplasmata order, “Candidatus Sysuiplasmatales”. Appl Environ Microbiol. 2021;87(22): e0106521. 10.1128/aem.01065-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35(8):725–31. 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20(1):238. 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fitch WM. Distinguishing homologous from analogous proteins. Syst Biol. 1970;19(2):99–113. 10.2307/2412448. [PubMed] [Google Scholar]
  • 55.Riebe O, Fischer RJ, Bahl H. Desulfoferrodoxin of Clostridium acetobutylicum functions as a superoxide reductase. FEBS Lett. 2007;581(29):5605–10. 10.1016/j.febslet.2007.11.008. [DOI] [PubMed] [Google Scholar]
  • 56.Perkins A, Nelson KJ, Parsonage D, Poole LB, Karplus PA. Peroxiredoxins: guardians against oxidative stress and modulators of peroxide signaling. Trends Biochem Sci. 2015;40(8):435–45. 10.1016/j.tibs.2015.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Krah A, Huber RG, Zachariae U, Bond PJ. On the ion coupling mechanism of the MATE transporter ClbM. Biochim Biophys Acta Biomembr. 2020;1862(2): 183137. 10.1016/j.bbamem.2019.183137. [DOI] [PubMed] [Google Scholar]
  • 58.Gobet A, Böer SI, Huse SM, van Beusekom JEE, Quince C, Sogin ML, et al. Diversity and dynamics of rare and of resident bacterial populations in coastal sands. ISME J. 2011;6(3):542–53. 10.1038/ismej.2011.132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wang Y, Hatt JK, Tsementzi D, Rodriguez-R LM, Ruiz-Pérez CA, Weigand MR, et al. Quantifying the Importance of the rare biosphere for microbial community response to organic pollutants in a freshwater ecosystem. Appl Environ Microbiol. 2017;83(8):e03321–416. 10.1128/AEM.03321-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ, Castelle CJ, et al. A new view of the tree of life. Nat Microbiol. 2016;1(5):16048. 10.1038/nmicrobiol.2016.48. [DOI] [PubMed] [Google Scholar]
  • 61.Yin X, Zhou G, Cai M, Zhu Q-Z, Richter-Heitmann T, Aromokeye DA, et al. Catabolic protein degradation in marine sediments confined to distinct archaea. ISME J. 2022. 10.1038/s41396-022-01210-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Orsi WD, Smith JM, Liu S, Liu Z, Sakamoto CM, Wilken S, et al. Diverse, uncultivated bacteria and archaea underlying the cycling of dissolved protein in the ocean. ISME J. 2016;10(9):2158–73. 10.1038/ismej.2016.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Pelikan C, Wasmund K, Glombitza C, Hausmann B, Herbold CW, Flieder M, et al. Anaerobic bacterial degradation of protein and lipid macromolecules in subarctic marine sediment. ISME J. 2021;15(3):833–47. 10.1038/s41396-020-00817-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Zhu QZ, Yin X, Taubner H, Wendt J, Friedrich MW, Elvert M, et al. Secondary production and priming reshape the organic matter composition in marine sediments. Sci Adv. 2024;10(20):eadm8096. 10.1126/sciadv.adm8096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Jørgensen NOG, Lindroth P, Mopper K. Extraction and distribution of free amino acids and ammonium in sediment interstitial waters from the Limfjord, Denmark. Oceanol Acta. 1981;4:465–74. [Google Scholar]
  • 66.Monetti MA, Scranton MI. Fatty acid oxidation in anoxic marine sediments: the importance of hydrogen sensitive reactions. Biogeochemistry. 1992;17(1):23–47. 10.1007/BF00002758. [Google Scholar]
  • 67.Buckel W, Thauer RK. Flavin-based electron bifurcation, a new mechanism of biological energy coupling. Chem Rev. 2018;118(7):3862–86. 10.1021/acs.chemrev.7b00707. [DOI] [PubMed] [Google Scholar]
  • 68.Atwood TB, Witt A, Mayorga J, Hammill E, Sala E. Global patterns in marine sediment carbon stocks. Front Mar Sci. 2020;7: 165. 10.3389/fmars.2020.00165. [Google Scholar]
  • 69.Burdige DJ. Preservation of organic matter in marine sediments: controls, mechanisms, and an imbalance in sediment organic carbon budgets? Chem Rev. 2007;107(2):467–85. 10.1021/cr050347q. [DOI] [PubMed] [Google Scholar]
  • 70.Glud RN. Oxygen dynamics of marine sediments. Mar Biol Res. 2008;4(4):243–89. 10.1080/17451000801888726. [Google Scholar]
  • 71.Scranton MI, Astor Y, Bohrer R, Ho TY, Muller-Karger F. Controls on temporal variability of the geochemistry of the deep Cariaco Basin. Deep-Sea Res Part I Oceanogr Res Pap. 2001;48(7):1605–25. 10.1016/S0967-0637(00)00087-X. [Google Scholar]
  • 72.Rabalais NN, Turner RE, Jr. WJW. Gulf of Mexico hypoxia, a.k.a. “The Dead Zone”. Annu Rev Ecol Syst. 2002;33(1):235–63; 10.1146/annurev.ecolsys.33.010802.150513.
  • 73.Helly JJ, Levin LA. Global distribution of naturally occurring marine hypoxia on continental margins. Deep-Sea Res Part I Oceanogr Res Pap. 2004;51(9):1159–68. 10.1016/j.dsr.2004.03.009. [Google Scholar]
  • 74.Wyrtki K. The oxygen minima in relation to ocean circulation. Deep-Sea Res Oceanogr Abstr. 1962;9(1):11–23. 10.1016/0011-7471(62)90243-7. [Google Scholar]
  • 75.Bouki C, Venieri D, Diamadopoulos E. Detection and fate of antibiotic resistant bacteria in wastewater treatment plants: a review. Ecotoxicol Environ Saf. 2013;91:1–9. 10.1016/j.ecoenv.2013.01.016. [DOI] [PubMed] [Google Scholar]
  • 76.Schijven JF, Blaak H, Schets FM, de Roda Husman AM. Fate of extended-spectrum β-lactamase-producing Escherichia coli from faecal sources in surface water and probability of human exposure through swimming. Environ Sci Technol. 2015;49(19):11825–33. 10.1021/acs.est.5b01888. [DOI] [PubMed] [Google Scholar]
  • 77.Munk P, Brinch C, Møller FD, Petersen TN, Hendriksen RS, Seyfarth AM, et al. Genomic analysis of sewage from 101 countries reveals global landscape of antimicrobial resistance. Nat Commun. 2022;13(1):7251. 10.1038/s41467-022-34312-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Szubska M, Bełdowski J. Spatial distribution of arsenic in surface sediments of the southern Baltic Sea. Oceanol. 2023;65(2):423–33. 10.1016/j.oceano.2022.12.002. [Google Scholar]
  • 79.Zheng P-F, Wei Z, Zhou Y, Li Q, Qi Z, Diao X, et al. Genomic evidence for the recycling of complex organic carbon by novel Thermoplasmatota clades in deep-sea sediments. mSystems. 2022;7(3):e00077-22. 10.1128/msystems.00077-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Hu W, Pan J, Wang B, Guo J, Li M, Xu M. Metagenomic insights into the metabolism and evolution of a new Thermoplasmata order (Candidatus Gimiplasmatales). Environ Microbiol. 2021;23(7):3695–709. 10.1111/1462-2920.15349. [DOI] [PubMed] [Google Scholar]
  • 81.Zinke LA, Evans PN, Santos-Medellín C, Schroeder AL, Parks DH, Varner RK, et al. Evidence for non-methanogenic metabolisms in globally distributed archaeal clades basal to the Methanomassiliicoccales. Environ Microbiol. 2021;23(1):340–57. 10.1111/1462-2920.15316. [DOI] [PubMed] [Google Scholar]
  • 82.Salazar G, Paoli L, Alberti A, Huerta-Cepas J, Ruscheweyh HJ, Cuenca M, et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell. 2019;179(5):1068-83.e21. 10.1016/j.cell.2019.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51(D1):D587–92. 10.1093/nar/gkac963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45. 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Popa O, Landan G, Dagan T. Phylogenomic networks reveal limited phylogenetic range of lateral gene transfer by transduction. ISME J. 2017;11(2):543–54. 10.1038/ismej.2016.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Abe K, Nomura N, Suzuki S. Biofilms: hot spots of horizontal gene transfer (HGT) in aquatic environments, with a focus on a new HGT mechanism. FEMS Microbiol Ecol. 2020;96(5):fiaa031. 10.1093/femsec/fiaa031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Angles ML, Marshall KC, Goodman AE. Plasmid transfer between marine bacteria in the aqueous phase and biofilms in reactor microcosms. Appl Environ Microbiol. 1993;59(3):843–50. 10.1128/aem.59.3.843-850.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Dmitrijeva M, Tackmann J, Matias Rodrigues JF, Huerta-Cepas J, Coelho LP, von Mering C. A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer. Nat Ecol Evol. 2024;8(5):986–98. 10.1038/s41559-024-02357-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Bijma J. Station list and links to master tracks in different resolutions of HEINCKE cruise HE483, Bremerhaven - Bremerhaven, 2017-04-19 - 2017-04-26. PANGAEA 2017. 10.1594/PANGAEA.876341.
  • 90.Hebbeln D, Scheurle C, Lamy F. Depositional history of the Helgoland mud area, German Bight. North Sea Geo-Mar Lett. 2003;23(2):81–90. 10.1007/s00367-003-0127-0. [Google Scholar]
  • 91.Oni OE, Schmidt F, Miyatake T, Kasten S, Witt M, Hinrichs K-U, et al. Microbial communities and organic matter composition in surface and subsurface sediments of the Helgoland mud area, North Sea. Front Microbiol. 2015;6:6. 10.3389/fmicb.2015.01290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Widdel F. Anaerober Abbau von Fettsäuren und Benzoesäure durch neu isolierte Arten Sulfat-reduzierender Bakterien. Georg-August-Universität zu Göttingen. Göttingen; 1980.
  • 93.Widdel F, Kohring G-W, Mayer F. Studies on dissimilatory sulfate-reducing bacteria that decompose fatty acids. Arch Microbiol. 1983;134(4):286–94. 10.1007/BF00407804. [DOI] [PubMed] [Google Scholar]
  • 94.Widdel F, Pfennig N. Studies on dissimilatory sulfate-reducing bacteria that decompose fatty acids. Arch Microbiol. 1981;129(5):395–400. 10.1007/BF00406470. [DOI] [PubMed] [Google Scholar]
  • 95.Imachi H, Nobu MK, Nakahara N, Morono Y, Ogawara M, Takaki Y, et al. Isolation of an archaeon at the prokaryote-eukaryote interface. Nature. 2020;577(7791):519–25. 10.1038/s41586-019-1916-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Yu T, Hu H, Zeng X, Wang Y, Pan D, Deng L, et al. Widespread Bathyarchaeia encode a novel methyltransferase utilizing lignin-derived aromatics. mLife. 2023;2(3):272–82. 10.1002/mlf2.12082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Aromokeye DA, Richter-Heitmann T, Oni OE, Kulkarni A, Yin X, Kasten S, et al. Temperature controls crystalline iron oxide utilization by microbial communities in methanic ferruginous marine sediment incubations. Front Microbiol. 2018;9:9. 10.3389/fmicb.2018.02574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Ovreås L, Forney L, Daae FL, Torsvik V. Distribution of bacterioplankton in meromictic Lake Saelenvannet, as determined by denaturing gradient gel electrophoresis of PCR-amplified gene fragments coding for 16S rRNA. Appl Environ Microbiol. 1997;63(9):3367–73. 10.1128/aem.63.9.3367-3373.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Takai K, Horikoshi K. Rapid detection and quantification of members of the archaeal community by quantitative PCR using fluorogenic probes. Appl Environ Microbiol. 2000;66(11):5066–72. 10.1128/aem.66.11.5066-5072.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. 2011;17(1):3. 10.14806/ej.17.1.200. [Google Scholar]
  • 101.Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581–3. 10.1038/nmeth.3869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2020. https://www.R-project.org/.
  • 103.Salazar G. 2020. https://github.com/benjjneb/dada2/issues/938#issuecomment-589657164. Accessed 15 Jan 2024.
  • 104.Yu Y, Lee C, Kim J, Hwang S. Group-specific primer and probe sets to detect methanogenic communities using quantitative real-time polymerase chain reaction. Biotechnol Bioeng. 2005;89(6):670–9. 10.1002/bit.20347. [DOI] [PubMed] [Google Scholar]
  • 105.Satokari RM, Vaughan EE, Akkermans AD, Saarela M, de Vos WM. Bifidobacterial diversity in human feces detected by genus-specific PCR and denaturing gradient gel electrophoresis. Appl Environ Microbiol. 2001;67(2):504–13. 10.1128/aem.67.2.504-513.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Daims H, Brühl A, Amann R, Schleifer KH, Wagner M. The domain-specific probe EUB338 is insufficient for the detection of all bacteria: development and evaluation of a more comprehensive probe set. Syst Appl Microbiol. 1999;22(3):434–44. 10.1016/s0723-2020(99)80053-8. [DOI] [PubMed] [Google Scholar]
  • 107.Bushnell B. BBMap: a fast, accurate, splice-aware aligner. 2014. https://sourceforge.net/projects/bbmap/.
  • 108.Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes de novo assembler. Curr Protoc Bioinformatics. 2020;70(1): e102. 10.1002/cpbi.102. [DOI] [PubMed] [Google Scholar]
  • 109.Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31(10):1674–6. 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
  • 110.Eren AM, Kiefl E, Shaiber A, Veseli I, Miller SE, Schechter MS, et al. Community-led, integrated, reproducible multi-omics with anvi’o. Nat Microbiol. 2021;6(1):3–6. 10.1038/s41564-020-00834-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7: e7359. 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–6. 10.1038/nmeth.3103. [DOI] [PubMed] [Google Scholar]
  • 114.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):158. 10.1186/s40168-018-0541-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9(1):5114. 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11(12):2864–8. 10.1038/ismej.2017.126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. bioRxiv. 2022. 10.1101/2022.07.11.499243. [DOI] [PubMed]
  • 119.Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. bioRxiv. 2022. 10.1101/2022.07.11.499641. [DOI] [PMC free article] [PubMed]
  • 120.Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;48(D1):D84–6. 10.1093/nar/gkz956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Speth DR, Yu FB, Connon SA, Lim S, Magyar JS, Peña-Salinas ME, et al. Microbial communities of Auka hydrothermal sediments shed light on vent biogeography and the evolutionary history of thermophily. ISME J. 2022;16(7):1750–64. 10.1038/s41396-022-01222-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Zhou Z, Liu Y, Xu W, Pan J, Luo ZH, Li M. Genome- and community-level interaction insights into carbon utilization and element cycling functions of Hydrothermarchaeota in hydrothermal sediment. mSystems. 2020;5(1). 10.1128/msystems.00795-19. [DOI] [PMC free article] [PubMed]
  • 123.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60. 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Woodcraft BJ. CoverM. 2007. https://github.com/wwood/CoverM. Accessed 30 Nov 2023.
  • 125.Nishimura Y, Yoshizawa S. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Sci Data. 2022;9(1):305. 10.1038/s41597-022-01392-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Ocean Microbiomics Database. 2024. https://microbiomics.io/ocean2 . Accessed 05 Sept 2024.
  • 127.Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 2010;11: 119. 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50(D1):D20–6. 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8. 10.1038/s41592-021-01101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Blum M, Chang H-Y, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2020;49(D1):D344–54. 10.1093/nar/gkaa977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–40. 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Rawlings ND, Waller M, Barrett AJ, Bateman A. MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 2014;42(Database issue):D503-9. 10.1093/nar/gkt953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40(7):1023–5. 10.1038/s41587-021-01156-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res. 2023;51(W1):W115–21. 10.1093/nar/gkad328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Saier MH, Reddy VS, Moreno-Hagelsieb G, Hendargo KJ, Zhang Y, Iddamsetty V, et al. The Transporter Classification Database (TCDB): 2021 update. Nucleic Acids Res. 2021;49(D1):D461–7. 10.1093/nar/gkaa1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Rasko DA, Myers GSA, Ravel J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinform. 2005;6(1): 2. 10.1186/1471-2105-6-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Oksanen J, Simpson G, Blanchet F, Kindt R, Legendre P, Minchin P, et al. vegan: community ecology package. 2022.
  • 139.Conway JR, Lex A, Gehlenborg N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics. 2017;33(18):2938–40. 10.1093/bioinformatics/btx364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Lötsch J, Ultsch A. A non-parametric effect-size measure capturing changes in central tendency and data distribution shape. PLoS ONE. 2020;15(9): e0239623. 10.1371/journal.pone.0239623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017;14(6):587–9. 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4. 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143.Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35(21):4453–5. 10.1093/bioinformatics/btz305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 144.Letunic I, Bork P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics. 2007;23(1):127–8. 10.1093/bioinformatics/btl529. [DOI] [PubMed] [Google Scholar]
  • 145.Louca S, Doebeli M. Efficient comparative phylogenetics on large trees. Bioinformatics. 2017;34(6):1053–5. 10.1093/bioinformatics/btx701. [DOI] [PubMed] [Google Scholar]
  • 146.Konstantinidis K, Ruiz Pérez C, Gerhardt K, Rodríguez-R L, Jain C, Tiedje J, et al. FastAAI: efficient estimation of genome average amino acid identity and phylum-level relationships using tetramers of universal proteins. Preprint Res Square. 2022. 10.21203/rs.3.rs-1459378/v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: a tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss. Mol Biol Evol. 2022;39(2): msab365. 10.1093/molbev/msab365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Morel B, Kozlov AM, Stamatakis A, Szöllősi GJ. GeneRax: a tool for species-tree-aware maximum likelihood-based gene family tree inference under gene duplication, transfer, and loss. Mol Biol Evol. 2020;37(9):2763–74. 10.1093/molbev/msaa141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149.Csűös M. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics. 2010;26(15):1910–2. 10.1093/bioinformatics/btq315. [DOI] [PubMed] [Google Scholar]
  • 150.Diepenbroek M, Glöckner F, Grobe P, Güntsch A, Huber R, König-Ries B, et al. Towards an integrated biodiversity and ecological research data management and archiving platform: the German federation for the curation of biological data (GFBio). In Informatik 2014 (pp. 1711-1721). Gesellschaft für Informatik eV. 2014.
  • 151.Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol. 2011;29(5):415–20. 10.1038/nbt.1823. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

40168_2025_2140_MOESM1_ESM.docx (9.6MB, docx)

Supplementary Material 1: Figures S1. Microbial community composition of second-generation enrichments. Relative abundances of (a.) archaeal 16S rRNA genes, (b.) Ca. Penumbrarchaeia ASVs and (c.) Lokiarchaeia ASVs at day 98 and 157 in two replicates each of control samples and protein amended samples, both amended with 30 mM sulfate (Na2SO4) and an antibiotics mix (D-cycloserin, kanamycin, vancomycin, ampicillin and streptomycin; 50 mg/l for each). Protein samples were additionally amended with 2.33 g/l egg white protein. S2. Microbial community composition of second-generation enrichments. 16S rRNA gene copies per ml slurry of Bacteria, Archaea, the class Ca. Penumbrarchaeia and Lokiarchaeia subgroup Loki-2b. Gene copies are shown for both replicate enrichments (Replicate 1, Replicate 2*) of control samples and protein amended samples, both amended with 30 mM sulfate (Na2SO4) and an antibiotics mix (D-cycloserin, kanamycin, vancomycin, ampicillin and streptomycin; 50 mg/l for each). Protein samples were additionally amended with 2.33 g/l egg white protein. Error bars represent standard deviations of the three technical qPCR replicates per sample. S3. Phylogenetic reconstruction of the class Ca. Penumbrarchaeia. (a.) Maximum-likelihood tree (RAxML, convergence reached after 1300 bootstraps) of 354 full length 16S rRNA gene sequences of Thermoplasmatota, including all retrieved full length Ca. Penumbrarchaeia 16S rRNA genes from single MAGs. Shorter 16S rRNA gene sequences of ASVs and those found in Ca. Penumbrarchaeia MAGs were added to the existing tree. (b.) Maximum-likelihood tree based on orthogroups that were part of the core genome and detected in > 90% of all Ca. Penumbrarchaeia MAGs. The species tree was inferred from gene trees of all orthogroups using SpeciesRax [98]. S4. Partial phylogenomic tree of the Thermoplasmatota including RED. Maximum-likelihood tree (RAxML, 100 bootstraps) of 370 Thermoplasmatota MAGs and 35 Ca. Penumbrarchaeia MAGs obtained through data mining of genome assemblies and metagenomic short read data sets. Node labels display taxonomic affiliation on class level. The class Ca. Penumbrarchaeia clusters with Thermoplasmata_A/DTKX_01 according to the marker gene tree and is indicated by a red box (Fig. 3a). Node values indicate relative evolutionary divergence (RED). The complete tree is provided as separate PDF. S5. Average nucleotide identity (ANI). Heatmap of ANI between all 35 Ca. Penumbrarchaeia MAGs obtained during our data mining. Numbers in colored tiles indicate the ANI percentage between the compared MAGs (threshold ANI > 75%). MAGs were ordered according to their taxonomy in the marker gene tree (Fig. 3. S6. Amino acid identity (AAI). Heatmap of AAI between all 35 Ca. Penumbrarchaeia MAGs obtained during our data mining. Numbers in colored tiles indicate the AAI percentage between the compared MAGs. MAGs were ordered according to their taxonomy in the marker gene tree (Fig. 3). S7. Distribution of the class Ca. Penumbrarchaeia. World map showing the distribution, relative abundance and environment of all observed Ca. Penumbrarchaeia detected in 128 samples. Relative abundances were calculated by aligning quality trimmed reads of 8573 metagenomic sequencing runs to a competitive mapping index containing 20 non-redundant Ca. Penumbrarchaeia MAGs. Point colors indicate the environment, point sizes indicate relative abundance. S8. Abundance of non-redundant Thermoplasmatota in the environment. (a.) As fraction of occurrence for each order found within the non-redundant Thermoplasmatota data set. The fraction of occurrence was defined as fraction of data sets MAGs occurred in across all screened data sets. Number of observations N represents the number of Thermoplasmatota genomes per order. (b.) As relative abundance of each order in samples they occurred in. Number of observations N represents the number of metagenomic runs, in which the genome was detected. Order 1—Order 4 correspond to the Ca. Penumbrarchaeia orders, with Order 1 being Ca. Penumbrarchaeales. S9. Clustering of Ca. Penumbrarchaeia MAGs based on orthogroups. a NMDS based on all orthogroups found in all 35 Ca. Penumbrarchaeia. The NMDS was computed using the function metaMDS from the package vegan v2.6.4 with Jaccard dissimilarities and two dimensions. Single families are indicated by different colors. Family 1 A corresponds to Ca. Penumbrarchaeaceae. b Upset plot showing intersections of orthogroups for all six families. Intersections shown reflect 95% of all orthogroups within the data set. Numbers of orthogroups per intersection are indicated above the bars. Orthogroup intersections, which are only present in sediment or water are indicated by color. Total number of different orthogroups per family are indicated by horizontal bars next to the upset plot. S10. Shared orthogroups between Ca. Penumbrarchaeia MAGs. Heatmap of shared orthogroups between all 35 Ca. Penumbrarchaeia MAGs. Numbers in colored tiles in the heatmap indicate the percentage of shared orthogroups between the compared MAGs. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S11. Peptidases in Ca. Penumbrarchaeia MAGs. a Extracellular peptidase homologs within MAGs of the class Ca. Penumbrarchaeia. b Peptidase homologs within MAGs of the class Ca. Penumbrarchaeia. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S12. Metabolic reconstruction of Ca. Penumbrarchaeales (Order 1) (Ca. Penumbrarchaeaceae (Family 1 A), Family 1B). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles for each family. Red stars indicate the presence of genes in all families. S13. Metabolic reconstruction of Order 2 (Family 2). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles. S14. Metabolic reconstruction of Order 3 (Family 3 A, Family 3B). Pathways in grey represent pathways with missing genes that are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles for each family. Red stars indicate the presence of genes in all families. S15. Metabolic reconstruction of Order 4 (Family 4). Pathways in grey represent pathways with missing genes and are therefore not functional. Gene abbreviations can be found in Supplementary Table 9. The presence of genes is indicated by full or half circles. S16. Annotation status of predicted genes in Ca. Penumbrarchaeia MAGs. Genes classified as annotated, hypothetical and unknown based on NR and KEGG annotations shown as (a.) number of genes and (b.) percentage of genes. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S17. AGNOSTOS annotation category of genes in Ca. Penumbrarchaeia MAGs. Gene classification based on AGNOSTOS. Gene classifications are shown as (a.) number of genes and (b.) percentage of genes. MAGs were ordered according to their family taxonomy in the marker gene tree (Fig. 3). Family 1 A corresponds to Ca. Penumbrarchaeaceae. S18. Shared orthogroups among Thermoplasmatota. Boxplot of Thermoplasmatota genomes sharing orthogroups (OGs) present within Ca. Penumbrarchaeia MAGs separated by annotation category: annotated, hypothetical and unknown, according to their NR and KEGG annotation. Differences between groups were tested by Wilcoxon Signed Rank test, p-values are indicated by asterisks (**** p < = 0.0001). The measure of impact describes the effect size. Number of observations N indicates the number of orthogroups in each category. Number of genomes per group is indicated by number n. S19. Ancestral gene family reconstruction of orthogroups containing genes with hypothetical or unknown function. Evolutionary events are indicated for each order within the phylum of Thermoplasmatota representing orthogroup gains, expansions and losses. Evolutionary events were inferred using the Wagner parsimony in COUNT (gain penalty = 1) [99]. S20. Flowchart of methods applied during the data mining conducted in this study. Retrieval of Ca. Penumbrarchaeia (EX4484-6) MAGs from public archives (methods Sect. 7) is displayed in green, the subsequent mining of metagenomic short read data from public archives is displayed in blue (methods Sect. 8). Annotation and classification (methods Sect. 9) was conducted on both, the MAGs retrieved from public archives and MAGs reconstructed from metagenomic short read data (yellow). Programs and settings used are indicated next to or below arrows.

40168_2025_2140_MOESM2_ESM.xlsx (4.5MB, xlsx)

Supplementary Material 2: Tables S1. ASV counts and relative abundances of archaeal 16S rRNA gene amplicon sequencing for control and protein amended samples. S2. Top hits of a megablast search of the Ca. Penumbrarchaeia ASV sq2 and 16S rRNA gene obtained from the Ca. Penumbrarchaeia MAG E3_1_d157 against the rRNA/ITS database of the NCBI RefSeq Targeted Loci Project. S3. Overview list of metagenomic studies, in which Ca. Penumbrarchaeia MAGs had a coverage > 2. These studies were used to reconstruct Ca. Penumbrarchaeia MAGs. S4. Overview list of MAGs obtained in this studyS4. Overview list of MAGs obtained in this study. S5. Quality statistics computed with checkM and checkM2 and taxonomy information computed using gtdb_tk with the GTDB database v207S5. Quality statistics computed with checkM and checkM2 and taxonomy information computed using gtdb_tk with the GTDB database v207. S6. Table containing relative abundances of single MAGs in mapped metagenomic runs. S7. List of all unique metagenomic runs, in which Ca. Penumbrarchaeia MAGs were found with meta data derived from ENA of the sampling location and environment details. . S8. List of Orthogroups, which were selected as core genome (Orthogroup present in at least 90% of MAGs). The table shows gene counts for each MAG within single orthogroups. S9. Gene counts for single pathways within each of the Ca. Penumbrarchaeia MAGs. Genes were extracted from the core genome and annotated using NR and kegg. Based on found annotations genes were grouped into kegg pathway descriptions. S10. Full annotation of genes, which were found in all orthogroups of the core genome. Annotation was performed using kegg, NR, OM_RGC, dbCAN and merops. S11. Full annotation of selected pathways for all 35 Ca. Penumbrarchaeia MAGs. Annotation was performed using KEGG (release 104) and NR (release 13/05/2023), dbCAN (v3.0.7) and merops (v12.4). S12. Annotation of peptidases found in all 35 Ca. Penumbrarchaeia MAGs. The annotation was performed using merops (v12.4). S13. Annotation of CAZymes found in all 35 Ca. Penumbrarchaeia MAGs. The annotation was performed using dbCAN (v3.0.7). Results were validated with NR and KEGG annotations.S14. Gene counts for the categories annotated, hypothetical and unknown, along with percentages for each category. S15. Counts of rarity groups per habitat. S16.Overview table of all Thermoplasmatota MAGs with their taxonomic classification, fraction of occurrence, percentage of unknown genes (perentage_unknown), rarity and habitat. S17. Studies from selected marine metagenomics literature and matched with BioProject identifiers from the European Nucleotide Archive for MAG reconstruction in OMDv2

40168_2025_2140_MOESM3_ESM.pdf (20.1KB, pdf)

Supplementary Material 3. Proposal of type material and higher ranks. Methods: I. Clone library construction. II. Standard preparation for qPCR. III. qPCR primer design. IV. 16S rRNA gene phylogenetic tree. V. Data collection, processing and MAG reconstruction in OMDB (v2). Results and discussion: I. Additional qPCR results. II. Delineation of the class Ca. Penumbrarchaeia. III. Annotation of the class Ca. Penumbrarchaeia. Carbon metabolism. Carbon assimilation. Hydrogenases and energy conservation. Transporters. Stress response.

Data Availability Statement

All code used to perform analyses in this study is available on zenodo https://doi.org/10.5281/zenodo.15464567. Amplicon and metagenomic raw reads for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB80318 (https://www.ebi.ac.uk/ena/data/view/PRJEB80318), using the data brokerage service of the German Federation for Biological Data (GFBio [150]), in compliance with the Minimal Information about any (X) Sequence (MIxS) standard [151]. Ca. Penumbrarchaeia MAGs obtained and generated in this study and all data tables used for figure generation are available on zenodo https://zenodo.org/records/10813815. The type species genome has been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number GCA_965234725.1. Clone sequences were deposited at GenBank under the accession numbers PQ255994-PQ256084.


Articles from Microbiome are provided here courtesy of BMC

RESOURCES