Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 3;15:34482. doi: 10.1038/s41598-025-22383-7

Widely-distributed freshwater microorganisms with streamlined genomes co-occur in cohorts with high abundance

Alejandro Rodríguez-Gijón 1,, Armando Pacheco-Valenciana 1, Felix Milke 2, Jennah E Dharamshi 1, Justyna J Hampel 1, Julian Damashek 3, Gerrit Wienhausen 2, Luis Miguel Rodriguez-R 4,6,, Sarahi L Garcia 1,2,5,
PMCID: PMC12495000  PMID: 41044404

Abstract

Genome size is known to reflect the eco-evolutionary history of prokaryotic species, including their lifestyle, environmental preferences, and habitat breadth. However, it remains uncertain how strongly genome size is linked to prokaryotic prevalence, relative abundance and co-occurrence. To address this gap, we present a systematic and global-scale evaluation of the relationship between genome size, relative abundance and prevalence in freshwater ecosystems. Our study includes 80,561 medium-to-high quality genomes, from which we identified 9,028 species (ANI > 95%) present in a manually curated dataset of 636 freshwater metagenomes. Our results show that prokaryotes with reduced genomes exhibited higher prevalence and relative abundance, suggesting that genome streamlining may promote cosmopolitanism. Furthermore, network analyses revealed that the most prevalent prokaryotes have streamlined genomes that are found in co-occurrent cohorts potentially sustained by metabolic dependencies. Overall, species in these groups possess a diminished capacity for synthesizing different essential metabolites such as vitamins, amino acids and nucleotides, potentially fostering metabolic complementarities within the community. Moreover, we found the presence of the essential biosynthetic functions to be usage-dependent: nucleotide and amino acids biosynthesis are the most complete, whereas vitamin biosynthesis is most incomplete. Our results underscore genome streamlining as a central eco-evolutionary strategy that both shapes and is shaped by community dynamics, ultimately fostering interdependences among prokaryotes.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-22383-7.

Keywords: Freshwater, Genome size, Prevalence, Cohorts, Bacteria, Archaea, Comparative genomics

Subject terms: Computational biology and bioinformatics, Ecology, Ecology, Evolution, Genetics, Microbiology

Introduction

Genome size reflects both the evolutionary history and the ecological dynamics of aquatic prokaryotes1,2. Decades ago, research on genome size primarily focused on cultivated prokaryotic isolates, thereby overlooking the full spectrum of naturally occurring genome size variation across prokaryotes3. Traditional cultivation techniques are inherently biased, as they recover only a limited fraction of the microbial biodiversity and tend to favor organisms with larger genomes4,5. In contrast, recent advances in dilution-high-throughput cultivation, single-cell genomics and metagenomics have broadened our perspective on microbial diversity and genome size, revealing that streamlined genomes are common among free-living microorganisms610. Streamlined prokaryotes are often highly abundant in oligotrophic environments such as oceans11 and freshwater ecosystems6,12. For instance, metagenomic studies highlighted the prominence of members of the phylum Actinomycetota with reduced genomes in the surface layers, where they account for up to 29% of the microbial communities across geographically distant freshwater bodies1316. Similarly, dominant aquatic taxa such as SAR11 (Ca. Pelagibacterales)1719, OM4320 and acI clades21,22 are characterized by compact genomes below 1.6 Mbp, and are widely distributed within their respective habitats. Intriguingly, despite their high relative abundances, these microorganisms with small genomes often exhibit complex and unusual nutritional requirements7. Further investigation into the relationship between genome size, metabolic dependencies, relative abundance and prevalence is essential to better understand the ecological advantages conferred by genome reduction across ecosystems.

Many microorganisms with reduced genomes lack the ability to biosynthesize many essential metabolites, a condition known as auxotrophy, and must acquire these nutrients from external sources to thrive2325. The ‘Black Queen Hypothesis’ posits that gene loss can drive metabolic dependencies when critical metabolites are provided by other co-occurring community members26. Under this scenario, microbial species with reduced genome sizes are expected to coexist with those that retain the necessary biosynthetic capabilities that they have lost27,28. Field studies highlight the importance of producing costly essential metabolites; for example, diazotrophs play a critical role by providing fixed nitrogen as a public good29. Intriguingly, diazotrophs often have larger genome sizes than non-nitrogen-fixing lineages30, but represent only a small fraction of marine microbial populations, thereby highlighting the tradeoff inherent in maintaining such a costly function. Although these observations have been made for specific functions and highly studied taxonomic groups, the broader applicability of the ‘Black Queen Hypothesis’ across diverse metabolic processes in aquatic microbial communities remains to be systematically tested.

Here, we present a systematic and global-scale evaluation of the ‘Black Queen Hypothesis’ based on freshwater metagenomic datasets to study the relative abundance, prevalence and co-occurrence of microorganisms with different genome sizes in microbial communities. We selected freshwater ecosystems as they provide an ideal model for this study: since lakes experience limited gene flow from immigrating bacteria due to physical barriers and spatial distance, they promote the isolation of microbial populations to evolve independently31. More specifically, we aim to: i) examine the relationship between genome size, relative abundance and prevalence (defined in this study as the percentage of metagenomic samples in which a given taxonomic group is detected), ii) investigate co-occurrence patterns of microorganisms with varying genome sizes, and iii) infer patterns of metabolic interdependencies that potentially occur between co-occurrent freshwater prokaryotes.

Results and discussion

The FRESH-MAP dataset

Our study leverages 80,561 medium-to-high-quality genomes (completeness > 50% and contamination < 5%) collected from various environments (i.e., aquatic, terrestrial and host-associated), emphasizing on freshwater bodies (Table S1). These genomes grouped into 24,050 species-clusters after genome dereplication using an ANI (Average Nucleotide Identity) threshold of > 95%, and for each of the species-clusters, the genome with the highest estimated completeness and lowest estimated contamination was selected as the representative genome (Table S2). The 24,050 representative genomes were used for competitive mapping against a manually curated global dataset of 636 freshwater metagenomes (Figure S1 and Table S3) to determine their prevalence and relative abundance. Notably, mapped reads accounted for an average of 41.82% of the total reads in the metagenomic dataset (n = 636; Table S4), approximately 25% more than reported on a recent marine study32. In total, we detected the presence of 9,028 species in at least one freshwater metagenome (Figure S2 and Table S5), and we refer to this novel catalogue of prokaryotic species detected across global freshwater bodies as the ‘FRESH-MAP’ dataset33. Prior to other analysis, we examined whether incomplete genomes in the FRESH-MAP dataset might bias our estimates of genome size across taxonomic groups, as well as their relationship to prevalence and relative abundance. Although we detected some bias consistent with a previous report5, correlations were weak in our study (Figure S3), indicating that the completeness of the FRESH-MAP representative genomes gives a good estimate of genome size, prevalence and average relative abundance. Consequently, we chose to retain all medium-to-high quality genomes (mean completeness across FRESH-MAP genomes = 85.4%) in our analysis to maximize insights gained.

On general observations, we detected 374.4 species per metagenome on average, with a maximum of 1,566 species in metagenome SRX3726699 (Table S5). While the number of detected species per metagenome is positively correlated to the total number of reads per metagenome (Figure S4), this relationship explains less than 18% of the variability. Approximately 97% of the species-clusters (n = 8,758) were derived from culture-independent techniques (i.e., SAGs and MAGs), while only 3% of species-clusters included at least one representative genome derived from a cultured isolate (n = 270; Figure S2). Similarly, 81.32% of the species-clusters originally derived from strictly freshwater environments, while 16.85% of the detected species-clusters derived from strictly non-freshwater environments (Figure S2). In the FRESH-MAP dataset, 320 species-clusters were classified as Archaea (spanning over 12 different phyla; Figure S5), and 8,708 species-clusters as Bacteria (spanning over 83 different phyla; Figure S6), surpassing the identified prokaryotic diversity reported in previous surveys6,34,35.

Prokaryotes with smaller genomes have a higher prevalence and average relative abundance

To explore the link between estimated genome size and both prevalence and relative abundance across the FRESH-MAP genomes, we performed a competitive mapping of all dereplicated genomes against the collection of metagenomes. We observed that the relationship between the estimated genome size and the prevalence markedly followed a smooth pattern of constrained variation, with species with small genomes (below 2 Mbp) present in a range of up to approximately 50% of metagenomes, and those with larger genomes (over 6 Mbp) in a range of up to 18% of metagenomes (Fig. 1B). Similarly, genomes reduced in size and with high prevalence also show lower GC content and higher coding density than those with larger genomes and lower prevalence (Figure S7). While this level of ubiquity has been described for different marine taxa such as SAR86 and the family Pelagibacteraceae (< 1.7 Mbp and GC < 33%)36, and different freshwater taxa such as Actinomycetota16,21 and Pseudomonadota35,37,38, our systematic overview confirms these observations for 9,028 species spanning over 95 prokaryotic phyla. Notably, species classified as Ca. Patescibacteria in our study appear to have a lower prevalence in relation to their estimated genome size given the general trend (Fig. 1B), a discrepancy that could be explained by two factors. First, symbiotic lifestyles are hypothesized to be common across Ca. Patescibacteria, which likely limits the dispersive capabilities of clade members39,40. Second, these organisms are typically abundant only below the oxycline in lakes41,42, a trend strongly reflected in our dataset, where Ca. Patescibacteria is one of the most prevalent taxa in hypolimnion metagenomes (Figure S8). In summary, prokaryotic species with high relative abundance and prevalence often have reduced genomes in freshwater environments.

Fig. 1.

Fig. 1

Overview of the relationship between estimated genome size (Mbp), prevalence (%, over 636 freshwater metagenomes), and average relative abundance (%) across the 9,028 species-clusters (ANI > 95%) representative genomes of the FRESH-MAP database. A shows the relationship between the estimated genome size of major phyla. Numbers next to boxes indicate the number of species-clusters per phylum. B shows the relationship between estimated genome size and prevalence. C compares prevalence between phyla. D shows the relationship between estimated genome size and average relative abundance. E compares average relative abundances between phyla. Different letters in A, C and E indicate statistical differences (p < 0.05; Kruskal–Wallis non-parametric test corrected with Benjamini-Hochberg) between phyla. Different colors in A-E indicate different phyla according to the legend at the top-right of the figure.

While we observed a large variability in the average relative abundance across species (Fig. 1D), the average relative abundance of each phylum is remarkably low, ranging from 0.11% to 0.52% (Fig. 1E). Notably, all median values fall below 0.1% (Fig. 1E), highlighting that over 50% of the species irrespective of their origin occur at very low abundances. A similar patter emerges when we consider the environment of origin of the genome (Figure S9), reflecting the large number of low-abundance prokaryotic taxa that exist in freshwater ecosystems43,44, where only a smaller subset of freshwater taxa is sufficiently abundant to be detected by shallow sequencing. Consequently, given the positive correlation between the number of detected species and sequencing depth (Figure S4), and the long-tail distribution of low-abundance prokaryotic species (Figs. 1D and 1E) our results underscore the critical need for deep metagenomic sequencing approaches to fully capture microbial community complexity.

Estimated genome size variability is linked to taxonomy, genome type, and ecology

Our results show that members of the Ca. Patescibacteria (averaging 0.91 Mbp; n = 529) and Actinomycetota (averaging 2.13 Mbp; n = 1024; Fig. 1A) have the most reduced estimated genome sizes in the ‘FRESH-MAP’ dataset, mirroring previous findings22,45,46. In contrast, Verrucomicrobiota members have the largest estimated genome sizes in our dataset (averaging 4.15 Mbp; n = 753; Fig. 1A). This group has been observed to have a large variability on genome size across freshwater bodies6, suggesting a wide ecological diversity within the phylum. However, these differences might span from divergent evolutionary histories, since different studies show how genome size complexity is tightly linked to evolutionary history2,47. We also observed that species-clusters uniquely retrieved via culture-independent techniques have significantly smaller estimated genome sizes and have a higher prevalence and relative abundance (Figure S10). While genome completeness biases our view on genome size by only 2% (Figure S3B), this suggests that the bias in metagenome assembly and binning would not account for the genome size difference observed between all isolate representatives and ecosystem MAGs, neither for the differences among ecosystem MAGs5.

Moreover, to investigate genome size variability, we selected the representative genomes from all genera comprising at least five species-clusters in the ‘FRESH-MAP’ dataset, yielding 368 bacterial and 7 archaeal genera (Tables S1 and S2). We found that genera with larger average genome sizes tended to exhibit greater variance in genome size (Figure S11). For Bacteria, this positive correlation persisted even after normalizing using a coefficient of variation, whereas this correlation was not as evident for Archaea (Figure S11), likely due to the limited number of archaeal genera present in the analysis. The tendency for higher variance among genera with larger genomes aligns with previous findings from cultivated prokaryotic genera from diverse environments2. More specifically, the genus-level clades SCTL01 and ER46 (both Verrucomicrobiota) exhibit the largest variance (both ~ 6.61) in our dataset, with average estimated genome sizes of 5.76 and 5.96 Mbp respectively (Figure S10). Notably, the genus-level clade ER46 has been observed on a wide variety of environments, including plant-associated48, freshwaters49, anaerobic bioreactors50, and groundwaters51. In contrast, we observed a low variance in genome size across clades with reduced genomes, including several genus-level clades in Ca. Patescibacteria, the Ca. Allofontibacter (Pseudomonadota)52,53, and the genus-level clade UBA970 (Bacillota; GTDB r220)54. Interestingly, these genus-level clades with low variability in genome size were exclusively recovered from freshwater environments (Table S2), indicating that genera with greater variably in genome size also exhibit a broader functional diversity and habitat versatility55. Our findings indicate that larger genome sizes might enhance a microbe’s capacity to survive and thrive across a larger diversity of environments. Further research examining the relationship between genome size variability and prokaryotic niche breadth could yield valuable insights into prokaryotic adaptability.

Prokaryotic species with reduced genomes co-occur in cohorts with high prevalence

Since previous hypothesis consider that prokaryotes thrive in interconnected communities25, we predicted a co-occurrence network using the ‘FRESH-MAP’ dataset. While co-occurrence does not necessarily imply direct interaction nor active exchange of metabolites56, it can still provide valuable insights into microorganisms that tend to co-occur with the same local neighbors57. In total, 1,202 species showed significant co-occurrences and were included in the network analysis (Table S6). The network was significantly more modular than expected by random chance (p-value = 0; 500 permutations), and clustered into nine groups of co-occurring prokaryotes that we define as cohorts (Fig. 2A and Table S6). Of those, four cohorts (i.e., 1, 2, 3 and 6) had a large number of members (between 209 and 295 species-clusters), while the other five cohorts (i.e., 4, 5, 7, 8 and 9) had a relatively low number of members (between 7 and 77 species-clusters) (Table S6 and S7). Given that the bimodal distribution of cohort-member-numbers might potentially stem from insufficient coverage of the smaller cohorts, we focused on the four bigger cohorts for further analysis.

Fig. 2.

Fig. 2

Overview of the co-occurrence network and analyses. A shows the 1,202 species-clusters representative genomes (nodes) included in the co-occurrence network and the connections (edges, grey) between them (rSparCC > 0.4, p-value < 0.05). Different colors denote different co-occurrence cohorts as it can be inferred from B. B shows the preferred environmental conditions for each cohort, where red indicates estimations above the baseline and blue below the baseline. The preferred environmental condition is calculated as the weighted average of relative abundances of each cohort in each sample for each environmental parameter (absolute latitude, temperature and oxygen). C shows the relation between the prevalence (%, over 636 freshwater metagenomes) and average relative abundance (%). Datapoints in black correspond to species-clusters not included in the co-occurrence network, and datapoints with different colors refer to different cohorts as indicated in the subplot. The subplot also compares the residuals of each cohort and those species-clusters out of the co-occurrence network, and includes the number of species-clusters per cohort. D shows the correlation between prevalence and the degree of connectedness (number of edges) within major cohorts (i.e., with more than 200 species-clusters) for each species-cluster. The subplot in D compares residuals of the linear regression for each cohort. E and F compare the average estimated genome size (Mbp) and the coding density (%) between major cohorts, respectively. Different letters in C-D (subplots) and E–F indicate statistical differences (p < 0.05; Kruskal–Wallis non-parametric test corrected with Benjamini-Hochberg) between cohorts.

While members of cohorts 1, 2 and 6 are connected and present similar preferred environmental condition for higher concentrations of oxygen, we found that microbial species within cohort 3 had no connection to the members of other cohorts, likely result of preferred environmental conditions for low oxygen concentrations (Fig. 2B). Moreover, cohort 3 represents the largest fraction of the microbial communities in oxygen-depleted zones in ten out of thirteen lakes from which we have depth profiles in our metagenomic dataset (Figure S12). The taxonomy of the members of cohort 3 also reflect this environmental preference for low oxygen concentrations, since it hosts the majority of the species classified as Ca. Patescibacteria in our co-occurrence network (Table S7). Additionally, 10 different phyla appear to be uniquely associated with cohort 3 (Table S7), including Desulfobacterota (12 species-clusters), Halobacteriota (3 species-clusters), and Omnitrophota (7 species-clusters), taxa that have been previously associated with freshwaters with low oxygen concentrations13,58,59.

Correlation residuals between the prevalence and the average relative abundance show that species in the co-occurrence network exhibited a higher prevalence than expected given their average relative abundance (Fig. 2C). Their widespread persistence likely results from a broad niche breadth, efficient dispersal, and competitive advantages that enable them to thrive locally60,61, with potential beneficial interactions with other community members further reinforcing their central role in community functioning27. Hence, we explored the degree of connectedness (measured as the number of edges per node) to quantify the co-occurrence of each species within its cohort. Our results indicate that, while prevalence is positively correlated with the degree of connectedness (Fig. 2D), the estimated genome size is negatively correlated with this measure (Figure S13), indicating that microorganisms with reduced genomes often have a larger network connectivity. This phenomena, was already observed in previous work on an 8-year time-series of Lake Erken, where streamlined freshwater bacteria (e.g., the order Nanopelagicales and the Ca. genus Planktophila) were found to be central members of functional cohorts62. Taken together, while variability in environmental chemistry and physics shapes the selection of different microbial species across freshwater ecosystems, microorganisms with similar ecological affinities tend to co-occur. Eventually, co-occurring microbial species co-evolve metabolic dependencies27, potentially promoting the formation of microbial networks supported by metabolic exchange25,26. Under this context, microorganisms with streamlined genomes are supported by other co-occurring microorganisms for the acquisition of the metabolites.

However, not all cohorts are structured the same way. Correlation residuals between connectedness, prevalence, and estimated genome size (Fig. 2D and S13) indicate that members of cohort 1 have more connections than expected given their estimated genome size and prevalence. Moreover, cohort 1 members have statistically the lowest average estimated genome sizes and the highest average coding density of all cohorts analyzed (Figs. 2E and 2 F). cohort 1, and such characteristics might be conducive for evolving dependencies through adaptive gene loss26.

Low anabolic independence is widespread and cohorts show metabolic complementarities

To robustly explore the relationship between genome size and anabolic independence, we focused only on the 4,725 high-quality (completeness > 90% and contamination < 5%) representative genomes in the ‘FRESH-MAP’ dataset (Tables S2, S8 and S9). Our analysis revealed a positive correlation between estimated genome size and the average completeness of amino acid, nucleotide and vitamin biosynthetic pathways (Fig. 3). Notably, the smallest genomes in the dataset exhibited a highly reduced biosynthetic capacity for amino acids and nucleotides, a striking observation given that these compounds are essential building blocks of life. In the context of the ‘Black Queen Hypothesis’, the external acquisition of these essential metabolites among co-occurring prokaryotes could promote community stability26,27,63, and hence, favor the assembly of interdependent populations of streamlined microorganisms driven by a crossed acquisition of essential metabolites they cannot biosynthesize.

Fig. 3.

Fig. 3

Exploration of the relationship between biosynthetic potential to produce essential metabolites and the estimated genome size (Mbp). A-C show the relationship between estimated genome size and average pathway completeness (%) for different KEGG modules across all 4,725 high-quality representative genomes (completeness > 90% and contamination < 5%)) from the FRESH-MAP database. KEGG modules include biosynthesis of amino acids (A), nucleotides (B) and vitamins (C). In A-C, ‘n’ indicates the number of modules per category, and the different colors indicate different phyla according to the legend at the bottom of the figure.

However, metabolite acquisition in aquatic microbial communities may occur both actively and passively64. For instance, lysis induced by phages and protist grazing is responsible for approximately 50% of bacterial mortality65, and it releases valuable cellular content that can be re-utilized by other microorganisms. Moreover, recent studies further indicate that bacteriophage-mediated lysis supports more effectively the growth of amino acid auxotrophs than mechanical lysis or active secretion66, and prophage induction may facilitate the release of vitamin B12 from de novo synthesizers67. Collectively, these findings underscore the critical role of both active and passive mechanisms in redistributing essential metabolites and shaping aquatic microbial communities.

Regardless of the mode of metabolite release, the positive correlation between estimated genome size and the biosynthetic potential for essential metabolites in high-quality genomes indicates metabolic interdependencies. Moreover, when we focus on abundant and prevalent species that consistently co-occur with the same neighbors, a pattern of functional complementarity emerges. For instance, of the four analyzed cohorts, only a limited number of members have complete biosynthetic pathways for vitamins B2, B5, B12 and K2, while many members within the same cohort lack these capabilities (see cohort 3 in Fig. 4, and cohorts 1, 2 and 6 in Figures S14-S16). These patterns of metabolic complementarity and low anabolic independence for key metabolic functions suggest that different microorganisms in natural communities could specialize in distinct biosynthetic roles, relying on their co-occurrent neighbors to supply those essential metabolites they cannot synthesize on their own, as previously hypothesized26. Similar patterns have been observed in marine environments, where metagenomics unveil that the exchange of B1 and B12 vitamins might be key for the co-occurrence of partial synthesizers68. In summary, the maintenance of the biosynthetic pathways appears to be usage-dependent, with nucleotide biosynthesis being more often complete, while vitamin biosynthesis pathways are often less complete probably because vitamins can be re used instead of needed to be incorporated into the cell biomass (Figs. 3 and 4).

Fig. 4.

Fig. 4

Overview of module completeness (%; rows in the heatmap) for biosynthesis of amino acids, nucleotides, and vitamins across the species-clusters (columns) in cohort 3. Module completeness is colored in yellow between 0 and 30%, green between 30 and 70%, light blues between 70 and 100%, and dark blue for 100%. We include information on average relative abundance (%), prevalence (%), estimated genome size (Mbp), and genome completeness (%), according to the legend to the right of the figure. Overviews for cohorts 1, 2 and 6 can be found in Figures S12-S14.

Notably among all vitamins, vitamin B12 de novo biosynthesis shows the lowest average completeness per species, with an average completeness of 23.20% for the anaerobic pathway [M00122 + M00924], and an average completeness of 21.68% for the aerobic pathway [M00122 + M00925] (Table S8). In our dataset, species potentially capable of de novo B12 biosynthesis represent less than 6% of those in the co-occurrence network, have a larger average estimated genome size (3.43 Mbp) than non-biosynthesizers (2.64 Mbp), and span over seven different phyla, including Pseudomonadota, Chloroflexota, Desulfobacterota and Cyanobacteriota (Figure S17). While vitamin B12 is essential for a variety of microorganisms, facultative species that do not depend on it for growth can still utilize it when available69. In summary, only a minority of cohort members can biosynthesize vitamin B12 de novo, reflecting potential interdependencies driven by low anabolic independence of abundant and prevalent microorganisms in these communities. Future work focusing on the direct evidence of metabolic “cross-feeding” via experimental co-cultures with or without phages, or metatranscriptomic validation will provide a more rounded view on the importance of metabolic exchange in freshwater community assembly.

Finally, we examined functions related to regulation (e.g., sigma factors and two-component systems), structure, and secondary metabolism across the 4,725 high-quality representative genomes in the ‘FRESH-MAP’ dataset (Table S9). We observed a positive correlation between estimated genome size and number of KEGG KOs70 per Mbp associated with regulatory functions. This trend is true whether or not genomes with zero KOs per Mbp for each specific function were included (Figs. 5A and 5B). These findings indicate that larger genomes are enriched with regulatory genes, supporting a recent survey of 44 European lakes that linked larger estimated genome sizes with a higher number of pathways involved in regulation and environmental interaction71. In contrast, the relationship differed for flagella and secondary metabolism pathways. While a positive correlation between KOs per Mbp was observed when all genomes were considered, the correlation was negative when species-clusters with zero KOs per Mbp were excluded for each specific function (Fig. 5C–5E). These results suggest that prokaryotes with reduced genomes might adopt one or both strategies: either highly compacting their genomes to achieve a higher functional density, and/or completely losing the genes involved in secondary metabolism and mobility functions. Conversely, we found a negative correlation between estimated genome size and carbon fixation (Fig. 5F), mirroring previous findings from a metagenomic study of the brackish Baltic Sea72. Together, these results shed light onto the divergent selective pressures behind genome streamlining to adapt to varying ecological roles and metabolic demands in freshwater ecosystems.

Fig. 5.

Fig. 5

Overview of the relationship between estimated genome size and the genetic potential for catabolic and structural functions, expressed as the number of KEGG orthologs (KOs) per Mbp, across all 4,725 high-quality representative genomes (completeness > 90% and contamination < 5%) from the FRESH-MAP database. Analyzed functions include sigma factors (A), two-component systems (B), flagella (C), nitrogen cycle (D), sulfur cycle (E), and carbon fixation (F). On the top-right of each panel indicates the total number of KOs per category. Regular linear regressions refer to all datapoints (i.e., all genomes), and the dashed linear regressions exclude those datapoints where 0 KOs per Mbp for that function were detected. Different colors in A-F refer to different phyla according to the legend at the bottom of the figure.

Conclusions 

In our study, we show that freshwater prokaryotes with smaller genomes are often more prevalent and more abundant across metagenomic samples. First, we note that the correlation between genome size and prevalence is strongly linked with the co-occurrence of prokaryotic species in a given community. Although co-occurrence networks favor the inclusion of organisms with larger relative abundances (and often with smaller genomes), we observe a strong positive correlation between the degree of connectedness of the species-clusters in the network and their prevalence. Second, we observe that prokaryotes with smaller genomes have lower pathway completeness for the biosynthesis of essential metabolites, potentially indicating metabolic interdependencies. Third, we observe that prokaryotes with reduced genomes may follow one or two different strategies for optimizing genome size and secondary metabolism functions, by either undergoing genome compactation, and/or undergoing complete gene loss. Overall, our results provide novel insights into the effect of streamlining and biotic interactions on the ecology of freshwater prokaryotes.

Material and methods

Metagenome sequencing, assembly and binning of MAGs from an anthropogenic pond

Two samples were collected on July 23rd 2021 from an anthropogenic pond in Stadsträdgården, Uppsala (Sweden). Sampling, DNA extractions, library preparation, sequencing, assembly and binning of reads followed the published workflow by Rodríguez-Gijón and Hampel (2023)73. In brief, we extracted DNA with two different methods from duplicate filters from the same location: the DNAeasy PowerWater kit (Qiagen) extraction protocol for Sample_112_S83, and the FastDNA® SPIN Kit for soil (MPBiomedicals) extraction protocol for Sample_101_S1. Sequence libraries were prepared using SMARTer Thruplex library preparation (350 bp average fragment size) at the National Genomics Infrastructure at the Science for Life Laboratory (SciLifeLab) in Stockholm (Sweden). Sequencing was performed using the Illumina NovaSeq 6000 platform on a S4 v1.5 flowcell in 300 cycle mode (2 × 150 bp). Metadata, and accession numbers of these samples can be found in Table S10.

Raw sequence read processing was performed using the metaWRAP pipeline v1.3.274, in which assembly was performed with MegaHit75 and binning with CONCOCT v1.076, metaBAT2 v2.12.177, and maxBIN2 v2.2.678. Metagenomic bins generated from these tools were consolidated and refined using the “metaWRAP_bin_refinement” script, and the quality of the resulting metagenome-assembled genomes (MAGs) was assessed using CheckM v1.1.379. In total, we obtained 52 medium-to-high quality MAGs (completeness > 50% and contamination < 5%), with an average completeness of 76.26% and average contamination of 2.15% (Figure S18).

Re-binning of StratfreshDB metagenomes

We retrieved the metagenomic sequence reads and assemblies of 267 samples from stratified freshwater bodies from the StratFreshDB (Bioproject accession PRJEB38681)80. We excluded sediment samples, and performed metagenome-resolved genomics with the remaining 258 metagenomes (Table S11). Poor quality sequences were removed by trimming the paired-end reads using Trimmomatic v0.36 with the options: ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:2:keepBothReadsLEADING:3 TRAILING:3 SLIDINGDOWN:4:15 MINLEN:5081.

Multiple metagenomes were sequenced from the same lake or pond (28/40 sampling sites; Table S11) at different timepoints, water column depths, or sampling sites80. Thus, we expect to find the same species-clusters to be present across different samples, allowing the use of differential coverage binning to improve MAG retrieval and quality from the metagenome assemblies. To accomplish this, each set of 258 sequence reads were mapped to each individual metagenomic assembly using Minimap2 v2.2482. The resulting SAM files were then sorted and converted to BAM format using SAMtools v1.1483. Depth coverage profiles were generated for each combination of metagenome assembly contigs and sequence reads using the MetaBAT2 v2.12.1 utility script “jgi_summarize_bam_contig_depths” with the option “–outputDepth”77. A custom script was then used to combine individual outputs into a set of three different depth profiles for each metagenome assembly: a “single” depth profile with coverage in the respective sample, a “site” depth profile with coverage across all samples from the same sampling site (Table S11), and an “all” depth profile with coverage across all 258 samples (“make_depth_summaries.py” script in https://github.com/jennahd/meta-utils).

Contig binning was performed with each of the three depth profiles for each assembly using MetaBAT2 v2.12.1 with the options “-maxP 93 –minS 50 -s 50,000 -m 1500”77. No bins were retrieved for samples E4, F3, and UppL2, which had small assemblies and few sequence reads. Bin sets generated with the three different coverage profiles were then consolidated using the metaWRAP v1.3.2 “bin_refinement” module, where the corresponding highest quality hybridized or original bin was kept from the combined sets. MAGs with completeness above 40% completeness and contamination below 5% based on CheckM v1.0.12, which is included in the metaWRAP “bin_refinement” pipeline were retained79. The quality of the resulting MAGs was then compared to the original StratfreshDB MAGs80. Only the original StratFreshDB MAGs from the same set of metagenomes and with completeness above 40% completeness and contamination below 5% based on CheckM v1.0.12 implemented in the metaWRAP v1.3.2 “bin_refinement” module was considered. Genome statistics for all re-binned MAGs can be found in Table S12.

In total, we obtained 11,146 re-binned MAGs with an average completeness of 74.7% and an average contamination of 1.84%, while the 7,838 MAGs from the original publication had an average completeness of 76.9% and an average contamination of 2.10% (Table S12). While the average completeness and contamination between the original and re-binned MAGs are comparable, the number of MAGs obtained that meet the quality thresholds increased by 42.2% using our differential coverage binning method, and the number of high-quality MAGs with completeness ≥ 90% increased by 17% (Table S12). Across all metagenomes, the number of MAGs retrieved from re-binning was significantly higher than the number of original MAGs (Figure S19). Thus, re-binning improved the retrieval of MAGs across metagenomes.

Collection of publicly available genomes

We downloaded 70,954 publicly available genomes, including MAGs, single-amplified genomes (SAGs), and genomes from isolates, from approximately 590 different publications and/or BioProjects (Table S1). These genomes were downloaded from the NCBI database by using their assembly accessions with the Datasets CLI tools v14.7.0 (https://github.com/ncbi/datasets). Although a large proportion of the MAGs were retrieved from metagenomic surveys or isolated cultures from freshwater environments (Table S1), we also added non-freshwater MAGs from different projects, such as the GEMs catalog84. Together with the newly binned and re-binned MAGs in our study, we leverage 80,561 genomes of medium-to-high-quality (completeness > 50% and contamination < 5%) (Table S1). To estimate the quality of all genomes, we first classified them taxonomically using GTDB-tk v2.1.185 according to the GTDB classification r20754. Genome quality was then estimated using CheckM v1.1.379 following the typical workflow (“lineage_wf”), except those classified as phyla Actinomycetota and Ca. Patescibacteria, as previous work showed that genome quality estimates for these two groups improved when using custom marker genes72. Custom marker gene sets for both phyla were provided by CheckM79,86. We then estimated the genome size of all 80,561 genomes by dividing the assembly size by its completeness ranging from 0 to 1 provided by CheckM79. All medium-to-high quality genomes were then de-replicated using fastANI (ANI > 95%), and mOTUpan v0.3.2 (“mOTUlize.py” pipeline)87,88. In total, we obtained 24,050 species-clusters with one species representative each of highest quality (Table S2).

Competitive mapping and relative abundance estimations

We compiled a dataset of 636 short-read metagenomes from globally distributed freshwater environments, from which 72 metagenomes belong to the hypolimnion of 13 freshwater lakes. Metadata, accession numbers, and BioProject can be found in Table S3. The FastQ files of the metagenomes were downloaded using the “SRA.download.bash” script from the Enveomics collection89, and the raw metagenomic reads of all metagenomes were trimmed using the Microbial Genomes Atlas (MiGA) v1.3.8.290. We first created a MiGA environment (“miga new”), in which the fastQ files were copied (“miga add”). Then, all fastQ files were trimmed (“miga run -r trimmed_reads”), and the statistics were calculated using the function “miga summary”.

To estimate the relative abundance of all 24,050 species-clusters across the trimmed metagenomic reads, we used Strobealign v0.11.091. All representative genomes were concatenated into the same fna file using the “FastA.tag.rb” script from the Enveomics collection89, and mapping indexes were created. As our metagenome dataset is composed of fastQ files produced using different sequencing read lengths, we computed seven different indexes with different read lengths (“strobealign –create-index -r 50/100/125/150/250/300/400”). These indexes and the concatenated fna files were used to compute the mapping (“strobealign –use-index”). Resulting sam files were later converted into sorted.bam files using SAMtools v1.1783. To remove outlier mapping results, we calculated the Truncated Average Depth 80% (TAD80) to eliminate the 10% of highest and the 10% lowest mapping scores per metagenome sample, using the “BedGraph.tad.rb” script from the Enveomics collection89.

We also estimated the genome equivalents (defined as the total number of sequenced bp in the trimmed fastQ file divided by the average genome size in the metagenome) for each trimmed metagenome using MicrobeCensus v1.1.092. MicrobeCensus aligns a set of universal single-copy genes to the trimmed reads, and estimates the average genome length of the microbial community as inversely proportional to the number of hits for these genes. Lastly, the relative abundance of each species-cluster was estimated by dividing the TAD80 score by the number of genome equivalents. Mapping statistics can be found in Table S4, while the relative abundance of all 24,050 species-clusters across the 636 metagenomes can be found in Table S5. The scripts used for the estimation of relative abundance can be found in https://github.com/alejandrorgijon/competitive_mapping_scripts.

Co-occurrence network prediction

To predict co-occurrence between species-clusters based on the relative abundances obtained after mapping (Table S6) we used the FastSpar implementation93 of the SparCC algorithm94. SparCC infers co-occurrences based on correlations within compositional data which includes co-occurrences due to both, shared environmental preferences and potential biotic interactions. We selected only species-clusters present in at least 3 metagenomes with an overall relative abundance higher than 10–4. We calculated p-values for co-occurrences via bootstrapping, by running SparCC with 50 iteration rounds on the shuffled abundance-matrix 500 times. The p-value was defined as the proportion of bootstrapped correlation values that yielded a correlation as high as the computed value for the unshuffled data94. For further analyses, we only included significant positive co-occurrences (p-value < 0.05, corSparCC > 0.4). We computed network modularity based on hierarchical agglomeration clustering95 and inferred its significance by comparing it to the modularity calculated in 500 randomly rewired networks (preserving degree distribution and using 1000 rewiring iterations). Only observed network clusters with at least six members were kept to remove spurious clusters: we call these clusters “cohorts”, groups of organisms that co-occur and vary together in space and time and express more correlations between each other than to organisms from other cohorts. The degree of connectedness was inferred as the number of edges of each node within each cohort. For that, we subset the network to contain only nodes from a single cohort and calculated the degree for each node. The co-occurrence network was visualized in R using the package ggnetwork96. Preferred environmental conditions for each cohort were calculated as the weighted average by the relative abundance of each cohort per sample and per environmental parameter (i.e., absolute latitude, temperature and oxygen concentration). The standardized environmental preferences were visualized in a heatmap using the R package pheatmap97. Parameter values were standardized by z-scoring to allow comparisons between parameters:

Zscore Inline graphic

Metabolic annotation

To estimate metabolic potential, we selected all 4,725 high-quality representative genomes (completeness > 90% and contamination < 5%) from the ‘FRESH-MAP’ dataset (Tables S2 and S5). We used Anvi’o v7.198 to reformat the FastA files (“anvi-script-reformat-fasta”) and create contig databases (“anvi-gen-contigs-database”). We then identified KEGG pathways and KEGG orthologs (KOs)70 present in our genomes, and subset the metabolic modules for amino acids, nucleotides and vitamins (“anvi-run-kegg-kofams” and “anvi-estimate-metabolism”)99. To study the biosynthetic potential for these metabolites, we selected only those modules for which 1) at least one genome in our dataset had the complete pathway (i.e., 100% completeness), and 2) at least 20% of the genomes had a completeness for that given module > 0%. Completeness of biosynthetic modules can be found in Table S8, and presence of KOs can be found in Table S9.

Statistical analysis

Figures were created in R v4.3.2100 using the package ggplot v3.4.4101. Linear regression statistics were calculated to test the fit of our data to linear regressions in scatterplots (Figs. 23, 5, S3, S4, S11, S13 and S17) using the functions “stat_regline_equation” and “stat_cor” (Pearson’s correlation coefficient) from the R package ggpubr v0.6.0102. Statistical differences between groups in boxplots (Figs. 1, 2, S3, S8-S10, S13 and S19) were tested using the function “stat_compare_means” implemented in ggpubr v0.6.0102.

To investigate genome size variability in Figure S11, we calculated the mean estimated genome size per genus and the corresponding standard deviation using the functions “mean” from the R package base v4.3.2 and the function “sd” from the R package stats v4.3.2100. The standard deviation was also used to calculate the variance (sd2), and the coefficient of variance (CV) as indicated below:

graphic file with name d33e1455.gif

Supplementary Information

Acknowledgements

This work was supported by SciLifeLab. The authors acknowledge support from SNIC/Uppsala Multidisciplinary Center for Advanced Computational Science for access to the UPPMAX computational infrastructure. Computational work and data handling were enabled by resources in the projects SNIC 2022/5-392, 2023/5-126 and 2023/5-379 provided by the Swedish National Infrastructure for Computing (SNIC), partially funded by the Swedish Research Council through grant agreement no. 2018-05973. The computational results presented here have been achieved partially using the LEO HPC infrastructure of the University of Innsbruck. Additional computational resources were supported by the US National Science Foundation (NSF) through the ACCESS program with allocation MCB190042.

Author contributions

AR-G and SLG conceptualized and designed the research project. AR-G, LMR-R, and SLG refined the project idea. AR-G, JJH, JD, SLG, APV, and LMR-R compiled and curated the data. JJH performed the DNA extractions. AR-G, FM, and LMR-R performed the bioinformatic analyses, and JED re-binned the StratfreshDB pelagic metagenomes. AR-G and SLG performed the data analysis. AR-G drafted the first manuscript. All authors did literature searches, edited, and reviewed the manuscript.

Funding

Open access funding provided by Stockholm University. National Science Foundation,MCB190042, Vetenskapsrådet,2018-05973,SNIC/Uppsala Multidisciplinary Center for Advanced Computational Science, SNIC 2022/5-392, 2023/5-126 and 2023/5-379.

Data availability

The paired sequences of both metagenomic samples from the pond in Stadsträdgården, Uppsala (Sweden) and the 52 medium-to-high-quality MAGs have been deposited under the NCBI BioProject PRJNA1045862103. The 11,146 re-binned genomes from the raw metagenomic reads of the StratfreshDB are available through the Figshare data repository https://figshare.com/s/9af0a87d5fa6b80017f8. The 9,028 representative genomes of the FRESH-MAP dataset are available through the Figshare data repository 10.17044/scilifelab.28327964.v133. Information about the original publication of all genomes and metagenomes obtained from public repositories can be found in Tables S1 and S3.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Alejandro Rodríguez-Gijón, Email: alejandro.rgijon@gmail.com.

Luis Miguel Rodriguez-R., Email: lmrodriguezr@gmail.com.

Sarahi L. Garcia, Email: sarahi.garcia@uol.de

References

  • 1.Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J.14, 1247–1259 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Martinez-Gutierrez, C. A. & Aylward, F. O. Genome size distributions in bacteria and archaea are strongly linked to evolutionary history at broad phylogenetic scales. PLoS Genet.18, e1010220 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Konstantinidis, K. T. & Tiedje, J. M. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc. Natl. Acad. Sci.101, 3160–3165 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol.1, 16048 (2016). [DOI] [PubMed] [Google Scholar]
  • 5.Rodríguez-Gijón, A. et al. A genomic perspective across Earth’s microbiomes reveals that genome size in Archaea and Bacteria is linked to ecosystem type and trophic strategy. Front. Microbiol.12, 761869 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chiriac, M., Haber, M. & Salcher, M. M. Adaptive genetic traits in pelagic freshwater microbes. Environ. Microbiol.25, 606–641 (2023). [DOI] [PubMed] [Google Scholar]
  • 7.Giovannoni, S. J., Cameron Thrash, J. & Temperton, B. Implications of streamlining theory for microbial ecology. ISME J.8, 1553–1565 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kashtan, N. et al. Single-Cell genomics reveals hundreds of coexisting subpopulations in wild prochlorococcus. Science344, 416–420 (2014). [DOI] [PubMed] [Google Scholar]
  • 9.Kettler, G. C. et al. Patterns and implications of gene gain and loss in the evolution of prochlorococcus. PLoS Genet.3, e231 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature499, 431–437 (2013). [DOI] [PubMed] [Google Scholar]
  • 11.Swan, B. K. et al. Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean. Proc. Natl. Acad. Sci.110, 11463–11468 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Garcia, S. L. et al. Contrasting patterns of genome-level diversity across distinct co-occurring bacterial populations. ISME J.12, 742–755 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cabello-Yeves, P. J., Picazo, A., Roda-Garcia, J. J., Rodriguez-Valera, F. & Camacho, A. Vertical niche occupation and potential metabolic interplay of microbial consortia in a deeply stratified meromictic model lake. Limnol. Oceanogr.68, 2492–2511 (2023). [Google Scholar]
  • 14.Cabello-Yeves, P. J. et al. Microbiome of the deep Lake Baikal, a unique oxic bathypelagic habitat. Limnol Oceanogr65, 1471–1488 (2020). [Google Scholar]
  • 15.Okazaki, Y., Nakano, S., Toyoda, A. & Tamaki, H. Long-Read-Resolved, Ecosystem-Wide Exploration of Nucleotide and Structural Microdiversity of Lake Bacterioplankton Genomes. MSystems7, e00433-e522 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rohwer, R. R. et al. Two decades of bacterial ecology and evolution in a freshwater lake. Nat. Microbiol.10, 246–257 (2025). [DOI] [PubMed] [Google Scholar]
  • 17.Freel, K. C. et al. New isolate genomes and global marine metagenomes resolve ecologically relevant units of SAR11. BioRxiv10.1101/2024.12.24.630191 (2024). [Google Scholar]
  • 18.Giovannoni, S. J. Genome streamlining in a cosmopolitan oceanic bacterium. Science309, 1242–1245 (2005). [DOI] [PubMed] [Google Scholar]
  • 19.Grote, J. et al. Streamlining and core genome conservation among highly divergent members of the SAR11 clade. MBio3, e00252-e312 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Giovannoni, S. J. et al. The small genome of an abundant coastal ocean methylotroph. Environ. Microbiol.10, 1771–1782 (2008). [DOI] [PubMed] [Google Scholar]
  • 21.Kim, S., Kang, I., Seo, J.-H. & Cho, J.-C. Culturing the ubiquitous freshwater actinobacterial acI lineage by supplying a biochemical ‘helper’ catalase. ISME J.13, 2252–2263 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Neuenschwander, S. M., Ghai, R., Pernthaler, J. & Salcher, M. M. Microdiversification in genome-streamlined ubiquitous freshwater Actinobacteria. ISME J.12, 185–198 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mee, M. T., Collins, J. J., Church, G. M. & Wang, H. H. Syntrophic exchange in synthetic microbial communities. Proc. Natl. Acad. Sci. U.S.A.111, E2149–E2156 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ramoneda, J., Jensen, T. B. N., Price, M. N., Casamayor, E. O. & Fierer, N. Taxonomic and environmental distribution of bacterial amino acid auxotrophies. Nat. Commun.14, 7608 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zengler, K. & Zaramela, L. S. The social network of microorganisms — how auxotrophies shape complex communities. Nat. Rev. Microbiol.16, 383–390 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Morris, J. J., Lenski, R. E. & Zinser, E. R. The black queen hypothesis: Evolution of dependencies through adaptive gene loss. MBio3, e00036-e112 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kost, C., Patil, K. R., Friedman, J., Garcia, S. L. & Ralser, M. Metabolic exchanges are ubiquitous in natural microbial communities. Nat. Microbiol.8, 2244–2252 (2023). [DOI] [PubMed] [Google Scholar]
  • 28.Zelezniak, A. et al. Metabolic dependencies drive species co-occurrence in diverse microbial communities. Proc. Natl. Acad. Sci. U.S.A.112, 6449–6454 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Church, M., Jenkins, B., Karl, D. & Zehr, J. Vertical distributions of nitrogen-fixing phylotypes at Stn Aloha in the oligotrophic North Pacific Ocean. Aquat. Microb. Ecol.38, 3–14 (2005). [Google Scholar]
  • 30.Bergman, B., Sandh, G., Lin, S., Larsson, J. & Carpenter, E. J. Trichodesmium – a widespread marine cyanobacterium with unusual nitrogen fixation properties. FEMS Microbiol. Rev.37, 286–302 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Hoetzinger, M. et al. Geographic population structure and distinct intra-population dynamics of globally abundant freshwater bacteria. ISME J.18, wrae113 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Giordano, N. et al. Genome-scale community modelling reveals conserved metabolic cross-feedings in epipelagic bacterioplankton communities. Nat. Commun.15, 2721 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Garcia, S. & Rodríguez-Gijón, A. FRESH-MAP dataset: study on the ecological success of streamlined aquatic microorganisms. 7881929408 Bytes Stockholm University 10.17044/SCILIFELAB.28327964.V1 (2025).
  • 34.Garner, R. E. et al. A genome catalogue of lake bacterial diversity and its drivers at continental scale. Nat. Microbiol.8, 1920–1934 (2023). [DOI] [PubMed] [Google Scholar]
  • 35.Newton, R. J., Jones, S. E., Eiler, A., McMahon, K. D. & Bertilsson, S. A guide to the natural history of freshwater lake bacteria. Microbiol. Mol. Biol. Rev.75, 14–49 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Dupont, C. L. et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J.6, 1186–1199 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rodriguez-R, L. M., Tsementzi, D., Luo, C. & Konstantinidis, K. T. Iterative subtractive binning of freshwater chronoseries metagenomes identifies over 400 novel species and their ecologic preferences. Environ. Microbiol.22, 3394–3412 (2020). [DOI] [PubMed] [Google Scholar]
  • 38.Nuy, J. K., Hoetzinger, M., Hahn, M. W., Beisser, D. & Boenigk, J. Ecological differentiation in two major freshwater bacterial taxa along environmental gradients. Front. Microbiol.11, 154 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lemos, L. N. et al. Genomic signatures and co-occurrence patterns of the ultra-small Saccharimonadia (phylum CPR/Patescibacteria) suggest a symbiotic lifestyle. Mol. Ecol.28, 4259–4271 (2019). [DOI] [PubMed] [Google Scholar]
  • 40.Nelson, W. C. & Stegen, J. C. The reduced genomes of Parcubacteria (OD1) contain signatures of a symbiotic lifestyle. Front. Microbiol.6, 713 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Peura, S. et al. Novel autotrophic organisms contribute significantly to the internal carbon cycling potential of a boreal lake. MBio9, e00916-e918 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Peura, S. et al. Distinct and diverse anaerobic bacterial communities in boreal lakes dominated by candidate division OD1. ISME J.6, 1640–1652 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Nemergut, D. R. et al. Patterns and Processes of microbial community assembly. Microbiol. Mol. Biol. Rev.77, 342–356 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lynch, M. D. J. & Neufeld, J. D. Ecology and exploration of the rare biosphere. Nat. Rev. Microbiol.13, 217–229 (2015). [DOI] [PubMed] [Google Scholar]
  • 45.Chiriac, M.-C. et al. Ecogenomics sheds light on diverse lifestyle strategies in freshwater CPR. Microbiome10, 84 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Ghylin, T. W. et al. Comparative single-cell genomics reveals potential ecological niches for the freshwater acI Actinobacteria lineage. ISME J.8, 2503–2516 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Whitney, K. D. & Garland, T. Did genetic drift drive increases in genome complexity?. PLoS Genet.6, e1001080 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bünger, W., Jiang, X., Müller, J., Hurek, T. & Reinhold-Hurek, B. Novel cultivated endophytic Verrucomicrobia reveal deep-rooting traits of bacteria to associate with plants. Sci. Rep.10, 8692 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Cabello-Yeves, P. J. et al. Reconstruction of diverse verrucomicrobial genomes from metagenome datasets of freshwater reservoirs. Front. Microbiol.8, 2131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Zhuang, J., Zhou, Y., Liu, Y. & Li, W. Flocs are the main source of nitrous oxide in a high-rate anammox granular sludge reactor: Insights from metagenomics and fed-batch experiments. Water Res.186, 116321 (2020). [DOI] [PubMed] [Google Scholar]
  • 51.He, C. et al. Genome-resolved metagenomics reveals site-specific diversity of episymbiotic CPR bacteria and DPANN archaea in groundwater ecosystems. Nat. Microbiol.6, 354–365 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Henson, M. W., Lanclos, V. C., Faircloth, B. C. & Thrash, J. C. Cultivation and genomics of the first freshwater SAR11 (LD12) isolate. ISME J.12, 1846–1860 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tsementzi, D. et al. Ecogenomic characterization of widespread, closely-related SAR11 clades of the freshwater genus “Candidatus Fonsibacter” and proposal of Ca. Fonsibacter lacus sp. nov. Syst. Appl. Microbiol.42, 495–505 (2019). [DOI] [PubMed] [Google Scholar]
  • 54.Parks, D. H. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotechnol.38, 1079–1086 (2020). [DOI] [PubMed] [Google Scholar]
  • 55.Bentkowski, P., Van Oosterhout, C. & Mock, T. A model of genome size evolution for prokaryotes in stable and fluctuating environments. Genome Biol. Evol.7, 2344–2351 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Blanchet, F. G., Cazelles, K. & Gravel, D. Co-occurrence is not evidence of ecological interactions. Ecol. Lett.23, 1050–1063 (2020). [DOI] [PubMed] [Google Scholar]
  • 57.Von Meijenfeldt, F. A. B., Hogeweg, P. & Dutilh, B. E. A social niche breadth score reveals niche range strategies of generalists and specialists. Nat. Ecol. Evol.7, 768–781 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Bomberg, M., Montonen, L., Jurgens, G. & Münster, U. Diversity and function of archaea in freshwater habitats. Curr. Trends Microbiol.4, 61–89 (2008). [Google Scholar]
  • 59.Murphy, C. L. et al. Genomic characterization of three novel Desulfobacterota classes expand the metabolic and phylogenetic diversity of the phylum. Environ. Microbiol.23, 4326–4343 (2021). [DOI] [PubMed] [Google Scholar]
  • 60.Rillig, M. C. & Mansour, I. Microbial ecology: Community coalescence stirs things up. Curr. Biol.27, R1280–R1282 (2017). [DOI] [PubMed] [Google Scholar]
  • 61.Shade, A. et al. Fundamentals of Microbial Community Resistance and Resilience. Front. Microbio.3, 417 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Mondav, R. et al. Streamlined and abundant bacterioplankton thrive in functional cohorts. MSystems5, e00316-e320 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Oña, L. & Kost, C. Cooperation increases robustness to ecological disturbance in microbial cross-feeding networks. Ecol. Lett.25, 1410–1420 (2022). [DOI] [PubMed] [Google Scholar]
  • 64.Sultana, S., Bruns, S., Wilkes, H., Simon, M. & Wienhausen, G. Vitamin B12 is not shared by all marine prototrophic bacteria with their environment. ISME J.17, 836–845 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Breitbart, M., Bonnain, C., Malki, K. & Sawaya, N. A. Phage puppet masters of the marine microbial realm. Nat. Microbiol.3, 754–766 (2018). [DOI] [PubMed] [Google Scholar]
  • 66.Pherribo, G. J. & Taga, M. E. Bacteriophage-mediated lysis supports robust growth of amino acid auxotrophs. ISME J.17, 1785–1788 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Wienhausen, G. et al. Ligand cross-feeding resolves bacterial vitamin B12 auxotrophies. Nature629, 886–892 (2024). [DOI] [PubMed] [Google Scholar]
  • 68.Arandia-Gorostidi, N. et al. Metagenomic-based network analysis reveals the importance of vitamin cross-feeding in marine microbial assemblages. BioRxiv10.1101/2025.08.08.668683 (2025). [Google Scholar]
  • 69.Groon, L.-A., Bruns, S., Dlugosch, L., Wilkes, H. & Wienhausen, G. Effects of vitamin B12 supply on cellular processes of the facultative vitamin B12 consumer Vibrio campbellii. Appl. Environ. Microbiol.91, e01422-e1424 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: Biological systems database as a model of the real world. Nucleic Acids Res.53, D672–D677 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Shah, M. et al. Genome-resolved metagenomics reveals the effect of nutrient availability on bacterial genomic properties across 44 European freshwater lakes. Environ. Microbiol.26, e16634 (2024). [DOI] [PubMed] [Google Scholar]
  • 72.Rodríguez-Gijón, A. et al. Linking prokaryotic genome size variation to metabolic potential and environment. ISME Commun.3, 25 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Rodríguez-Gijón, A., Hampel, J. J., Dharamshi, J., Bertilsson, S. & Garcia, S. L. Shotgun metagenomes from productive lakes in an urban region of Sweden. Sci. Data10, 810 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome6, 158 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Li, D. et al. MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods102, 3–11 (2016). [DOI] [PubMed] [Google Scholar]
  • 76.Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods11, 1144–1146 (2014). [DOI] [PubMed] [Google Scholar]
  • 77.Kang, D. D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ7, e7359 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: An automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics32, 605–607 (2016). [DOI] [PubMed] [Google Scholar]
  • 79.Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res.25, 1043–1055 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Buck, M. et al. Comprehensive dataset of shotgun metagenomes from oxygen stratified freshwater lakes and ponds. Sci. Data8, 131 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience10, giab008 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol.39, 499–509 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: A toolkit to classify genomes with the genome taxonomy database. Bioinformatics10.1093/bioinformatics/btz848 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature523, 208–211 (2015). [DOI] [PubMed] [Google Scholar]
  • 87.Buck, M., Mehrshad, M. & Bertilsson, S. mOTUpan: A robust Bayesian approach to leverage metagenome-assembled genomes for core-genome estimation. NAR Genom. Bioinform.4, lqac060 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun.9, 5114 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Rodriguez-R, L. M. & Konstantinidis, K. T. The Enveomics Collection: A Toolbox for Specialized Analyses of Microbial Genomes and Metagenomes. https://peerj.com/preprints/1900v1 (2016) 10.7287/peerj.preprints.1900v1.
  • 90.Rodriguez-R, L. M. et al. The microbial genomes atlas (MiGA) webserver: Taxonomic and gene diversity analysis of Archaea and Bacteria at the whole genome level. Nucleic Acids Res.46, W282–W288 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Sahlin, K. Strobealign: Flexible seed size enables ultra-fast and accurate read alignment. Genome Biol.23, 260 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Nayfach, S. & Pollard, K. S. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome Biol.16, 51 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Watts, S. C., Ritchie, S. C., Inouye, M. & Holt, K. E. FastSpar: Rapid and scalable correlation estimation for compositional data. Bioinformatics35, 1064–1066 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Friedman, J. & Alm, E. J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol.8, e1002687 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. Phys. Rev. E70, 066111 (2004). [DOI] [PubMed] [Google Scholar]
  • 96.Tyner, S., Briatte, F. & Hofmann, H. Network Visualization with ggplot2. R J.9(1), 27–59 (2017). [Google Scholar]
  • 97.Kolde. pheatmap: Pretty Heatmaps. (2019).
  • 98.Eren, A. M. et al. Community-led, integrated, reproducible multi-omics with anvi’o. Nat. Microbiol.6, 3–6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Veseli, I. et al. Microbes with Higher Metabolic Independence Are Enriched in Human Gut Microbiomes under Stress. https://elifesciences.org/reviewed-preprints/89862 (2023) 10.7554/eLife.89862. [DOI] [PMC free article] [PubMed]
  • 100.R Core Team. R: A language and environment for statistical computing. (2020).
  • 101.Wickham, H. ggplot2: Elegant Graphics for Data Analysis. (2016).
  • 102.Kassambara, A. ggpubr: ‘ggplot2’ Based Publication Ready Plots. (2020).
  • 103.NCBI BioProject. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1045862/. (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The paired sequences of both metagenomic samples from the pond in Stadsträdgården, Uppsala (Sweden) and the 52 medium-to-high-quality MAGs have been deposited under the NCBI BioProject PRJNA1045862103. The 11,146 re-binned genomes from the raw metagenomic reads of the StratfreshDB are available through the Figshare data repository https://figshare.com/s/9af0a87d5fa6b80017f8. The 9,028 representative genomes of the FRESH-MAP dataset are available through the Figshare data repository 10.17044/scilifelab.28327964.v133. Information about the original publication of all genomes and metagenomes obtained from public repositories can be found in Tables S1 and S3.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES