High-resolution metagenome assembly for modern long reads with myloasm

Jim Shaw; Maximillian G Marin; Heng Li

doi:10.1038/s41587-026-03053-z

. Author manuscript; available in PMC: 2026 Mar 31.

Published before final editing as: Nat Biotechnol. 2026 Mar 27:10.1038/s41587-026-03053-z. doi: 10.1038/s41587-026-03053-z

High-resolution metagenome assembly for modern long reads with myloasm

Jim Shaw ^1,^2,^*, Maximillian G Marin ^1,², Heng Li ^1,^2,^*

PMCID: PMC13035328 NIHMSID: NIHMS2152955 PMID: 41896477

Abstract

Long-read metagenome assembly promises complete genomic recovery from microbiomes. However, the complexity of metagenomes poses challenges. We present myloasm, a metagenome assembler for modern long reads such as PacBio HiFi and Oxford Nanopore Technologies (ONT) R10.4 long reads. Myloasm uses polymorphic k-mers to construct a high-resolution string graph and then leverages differential abundance for graph simplification. On real-world ONT metagenomes, myloasm assembled three times more complete circular contigs than the next-best assembler. Myloasm can make ONT and HiFi assemblies comparable: for a jointly sequenced gut metagenome, myloasm with ONT assembled more complete circular genomes than any assembler with HiFi. Myloasm also recovers previously inaccessible within-species diversity: we recovered six complete Prevotella copri single-contig genomes from a gut metagenome and eight complete TM7 (Saccharibacteria) contigs with > 93% similarity from an oral metagenome. Myloasm outperforms existing long-read metagenome assemblers across a range of environments and modern sequencing technologies.

1. Introduction

Metagenomics studies the collective genomic content within a community of organisms through untargeted DNA sequencing [1]. After sequencing, computational de novo assembly algorithms are used to reconstruct genomes. The recovery of genomes from metagenomic sequencing has led to a drastic increase in our understanding of the microbial world [2,3]. Metagenomics has enabled fundamental insights into evolution and ecology [4,5,6,7] and has emerged as a crucial tool for linking microbial communities to human health [8,9,10,11,12] or environmental health [13,14,15,16]. Many microbes have not been cultured [17], so a metagenomics approach is necessary for the discovery of new genomes. In the past decade, the emergence of long-read sequencing technologies has led to drastic improvements in the quality of isolate and metagenome recovery [18,19,20,21,22]. Nevertheless, the scale, diversity, and complexity of metagenomes remains a challenge for complete recovery of metagenomes [23].

Microbial communities can contain many organisms with distinct but similar genomes. This can be due to ongoing recombination [24,25] within a population or distinct populations of the same species [26,27]. Furthermore, horizontal gene transfer can lead to shared chromosomal DNA between species [28]. This makes metagenome assembly challenging due to the presence of highly-similar and thus repetitive sequences within a metagenome. Short-read sequencing has fundamental and practical limitations for resolving these intergenomic repeats [29,30,31], but ONT and PacBio HiFi long reads represent promising solutions.

HiFi assemblers [21,20] use highly accurate reads with > 99.95% median base accuracy (after excluding homopolymer errors [32]). HiFi sequencing combined with algorithms designed specifically for highly accurate reads has historically resulted in better assemblies compared to ONT methods [33,34]. However, the rapid improvement of ONT accuracies driven by its latest R10.4 chemistry has the potential to close this gap [35,36]. ONT offers several basecalling options, for example, super accuracy (sup) and high accuracy (hac) basecalling. Sup-basecalled reads have ≈ 99% median base accuracy whereas hac-reads have ≈ 98% [37], but sup requires more compute for basecalling. Nevertheless, these ONT R10.4 reads have error rates that are orders of magnitude higher than HiFi, and new algorithmic techniques are required to harness their potential [36].

Modern long-read assemblers can be broadly stratified into two paradigms: string graphs [38] and de Bruijn graphs [39]. In the string graph, nodes are reads and edges are overlaps between reads. In the (node-centric) de Bruijn graph, nodes are k-mers and edges are error-free overlaps of length k – 1. String graphs are more powerful at resolving genomic repeats because overlaps are not restricted to length k – 1. However, string graphs are less computationally efficient because they require all overlaps to be explicitly computed, whereas de Bruijn graphs do not. On the other hand, sequencing errors pose a problem for exact k-mer overlaps—even a modest k-mer overlap length of 500 bp is likely to have an error for the newest ONT reads. New minimizer de Bruijn graph approaches tackle these limitations by using longer sequence contexts and error correction [40,20,41]. Nevertheless, these graphs are still losing long-range information compared to string graphs.

1.1. Our contribution

We present myloasm (metagenomic noisy long-read assembler), a new metagenome assembler for modern long reads with greater than approximately 97% accuracy, such as PacBio HiFi and ONT R10.4 long reads. In myloasm, we use a string graph approach. Unlike existing HiFi string graph approaches [33,32], we eschew error correction of reads, which may fail for low-coverage or high-diversity populations due to the lack of reads sequenced from identical genomes. We instead use polymorphic k-mers within the sample to resolve similar sequences in metagenomes (e.g., co-existing strains or conserved genomic regions). In addition, we use a graph cleaning algorithm inspired by annealing approaches from statistical physics [42] to integrate coverage and overlap information. We show that myloasm can resolve simple populations of closely related strains better than de Bruijn graph approaches, yet handle complex metagenomes better than previous string graph algorithms. This leads to improved recovery of metagenome-assembled genomes (MAGs) over previous approaches.

Metagenome assembly is extremely complex, and assembly errors are inevitable [43]. In light of this, we built a software suite for quality control and visualization of myloasm’s outputs called mylotools. A key advantage of myloasm’s string graph approach over de Bruijn graphs is that contigs are just sequences of reads and overlaps, rather than k-mers. Mylotools leverages this property of string graphs to output interactive visualizations that integrate information from reads, overlaps, and genomic signatures to assist practitioners in detecting misassemblies (Supplementary Fig. 1).

2. Results

2.1. Resolving overlap assembly graphs with polymorphic k-mers and a random path model

Myloasm first performs a form of reference-free variant calling via SNPmers (Fig. 1A). SNPmers are pairs of k-mers that differ by a substitution in the middle base. SNPmers have been used and defined previously in other contexts [44,45,46] under the framework of split k-mers. SNPmers capture single nucleotide polymorphisms (SNPs) through a k-mer context instead of a reference genome. Myloasm calls SNPmers by finding suitable pairs of k-mers and excluding erroneous, low-frequency k-mers. Myloasm then uses a combination of SNPmers and open syncmers [47] to index the reads.

In the overlapping stage (Fig. 1B), myloasm finds exact open syncmer matches and then performs chaining [48,49] to find overlaps. Then, myloasm matches SNPmers while ignoring the middle base. That is, myloasm allows matching of polymorphic SNPmers. Myloasm then performs chaining again, but this time on the SNPmers. We mathematically prove that, under mild assumptions, this double chaining procedure can be used to estimate a true sequence identity (i.e., free of sequencing error) for the mapping through a statistical model of mutation and sequence error. Myloasm uses this estimate of sequence identity to build a string graph. This step is permissive for missing syncmer or SNPmer hits between reads; error correction is not required. However, it is strict for mismatched SNPs (i.e., mismatched middle bases) across SNPmers (Fig. 1B, bottom). After transitive reduction [38], non-branching paths are collapsed into unitigs.

Myloasm uses differential abundance within a metagenome to resolve the unitig graph (Fig. 1C). Myloasm first performs read-to-read mapping and calculates depth of coverage. Importantly, myloasm calculate coverages at different identity cutoffs for mapping. We found this to be necessary under extensive strain variation, where few perfect mappings may exist. Myloasm then uses mapping coverage, and thus differential abundance, to simplify the graph. We describe this below.

A key challenge is to integrate coverage and overlap information for graph simplification. To do so, myloasm weighs edges (i.e., read overlaps) by their likelihood to appear under a random local path model, and then cuts edges with low weight. Intuitively, local paths represent a small assembly within a neighborhood of the nodes. Myloasm assigns an energy function to each local path, and then defines a probability distribution via the Boltzmann distribution. Here, a path has low energy (and thus, high probability) if the coverage is consistent, the overlaps are relatively large, and the overlaps have few mismatched SNPmers. Thus, an edge has high weight if, roughly speaking, it appears in many possible ”good” assemblies with consistent coverage and long overlaps.

We parameterize this probability model by a temperature parameter that controls the strictness of the energy term. For a given temperature, myloasm then cuts edges with small expected value of appearing in a random path. Myloasm then removes tips and bubbles [50], and collapses paths into unitigs. However, we do not use a fixed temperature value. Instead, we iterate this cleaning procedure from high to low temperature. The intuition is that the initial stages (high temperature) cut only confident edges, leading to longer unitigs. These longer unitigs have more robustly estimated edge weights. Then, myloasm can prune edges more aggressively in later steps (lower temperatures). After cleaning, the final unitigs are polished and output to the user as contigs.

2.2. Synthetic benchmarking of multi-strain communities with ONT R10.4 and HiFi parameters

A primary challenge of metagenome assembly is disentangling highly similar genomic sequences within a metagenome, e.g., genomic repeats or co-existing strains of the same species. This is especially a challenge for ONT long reads, where the error rate may surpass the nucleotide divergence of two strains. We sought to test if assemblers can disentangle highly similar strains with modern long-read data using synthetically generated metagenomes (Fig. 2 and Supplementary Table 1).

We first constructed 27 metagenomes, each with five species and two strains per species. We fixed one strain at 35x depth and varied (1) the other strain’s coverage, (2) technology-specific simulation parameters, and (3) strain-to-strain average nucleotide identity (ANI) from 97.5% to 99.5% (Fig. 2A). We analyze the Qscore, defined as $- 10 {l o g}_{10} (\frac{indels + substitutions}{genome length})$ , and the genome fraction, defined as the percentage of genome content assembled in the reference genomes.

Myloasm strongly outperformed the other methods for the ONT parameter sets (Fig. 2B). For all 12 ONT datasets with < 99.5% ANI, myloasm recovered ≥ 95% genome fraction with median Qscore between 47 and 55. In contrast, metaFlye recovered only one out of two strains per species (≤ 52% genome fraction) for all ONT metagenomes. Even when metaFlye assembled one of two genomes, its assemblies were of lower quality than myloasm: on the Nano Long + 35x minor coverage dataset, myloasm had a $10^{\frac{55 - 22}{10}} \approx 2000$ -fold increase in assembly fidelity. MetaMDBG (hereafter assumed to be in nanoMDBG mode [41] for ONT parameters) performed well at high coverages but struggled at moderately low coverages (50-78% genome fraction at 15x minor coverage). Myloasm also had the best performance on median NGA50 and misassemblies for 24/27 of the datasets (Supplementary Fig. 2).

For the PacBio HiFi parameter sets, all methods performed well when ANI was < 99.5%; all methods had > 90% genome fraction, although myloasm’s median Qscore was 10 points higher than metaMDBG and metaFlye on average. All methods struggled at 99.5% ANI, recovering < 73% genome fraction. Interestingly, at 99.5% ANI myloasm with the Nano Long parameters (average 84.6% genome fraction and 49 median Qscore) outperformed the PacBio parameters (average 72.3% genome fraction and 42 median Qscore). This shows that the length of the reads may be more important than accuracy for highly similar strains.

We constructed another benchmark with the Nano Long preset, this time using only one species, but one to ten strains of S. enterica (Fig. 2C and Supplementary Fig. 3). All strains had 98 to 99% pairwise ANI and depth of coverage randomly chosen between 10-30x. On this dataset, myloasm had > 89% genome fraction at ten strains, whereas metaFlye had only 10% and metaMDBG had 30%. On datasets with more than four strains, the median Qscores of myloasm were > 40 (less than 1/10000 bp with an error) whereas metaMDBG and metaFlye had < 23 median Qscore (more than 1/200 bp with an error). In conclusion, across a wide range of parameters and strain complexity—up to 10 strains and 99.5% ANI—myloasm outputs high quality contigs and recovers genomic content that previous methods could not.

2.3. High-resolution assemblies on a concatenated mock ONT R10.4 community

We constructed a mock long-read metagenome with 48 genomes sequenced to 21 Gbp with ONT R10.4 technology and basecalled with the sup (super accuracy) or hac (high accuracy) option (Fig. 3A). This was done by concatenating two real mock metagenomes and 14 isolate sequencing runs [37]. We focus on the results for the sup-basecalled reads in Fig. 3; hac results are shown in Supplementary Fig. 4 and comprehensive statistics are available in Supplementary Table 2.

Fig. 3: — Results on a concatenated ONT R10.4 mock metagenome with 48 genomes consisting of 14 isolates and 2 mock metagenomes. A. The benchmarking setup. B. Qscore and genome fraction for genomes with a > 90% ANI strain within the metagenome. Depth of coverage is shown to the right of the genome name. “Nearest ANI” is the highest ANI between the genome and another genome in the community. C. Qscore restricted to only the 14 genomes with a > 90% ANI partner strain after downsampling the reads to a fraction of the original metagenome (x-axis). D. Qscores for all 48 genomes at various downsampling fractions. E. Circular and complete genomes and fraction of genome recovered as a function of coverage over all downsampling fractions. Logistic regression was run over the dataset with confidence intervals found by bootstrapping over 1000 iterations. F. Plasmid results. Left: dark bars show the number of circular + complete plasmids and light bars shown non-circular but complete plasmids. Right: number of assembled plasmids contigs that are > 10% longer than the reference plasmid genome, indicating possible multimeric assembly. Red outlines on bars indicate a metric where a higher value is worse. Box plots show the median (middle line), upper and lower quartiles (lower and upper box limits) and the nearest datapoint to 1.5 times the interquartile range (whiskers).

To better understand assembly behavior for closely related strains, we first focused our analysis on five groups of genomes with > 90% ANI to each other (Fig. 3B) and sufficient sequencing depth for assembly. Myloasm had a median Qscore of 49.3 compared to 35.1 of metaMDBG and 28.6 of metaFlye (Fig. 3C) on these five groups of genomes. For the two L. monocytogenes and two Klebsiella genomes at a modest ≈ 95% similarity, metaFlye had > 90% genome fraction for only one of the four strains. For this genome, metaFlye’s Qscore was < 25, indicating serious polishing errors. On the other hand, the two 98.89% similar S. aureus strains were fully recovered by myloasm with > 50 Qscore, whereas the low-coverage S. aureus strain posed a problem for metaMDBG (< 50% genome fraction) and metaFlye (no contigs of length > 100 kb).

The six E. coli genomes were particularly challenging: five of six all had > 98.3% pairwise ANI, with B1109 and JM109 having 99.53% ANI. Myloasm recovered > 75% of each E. coli genome. MetaMDBG only recovered > 75% of 2/6 genomes and metaFlye could not recover > 75% of any genome. Myloasm had the highest Qscore on 5/6 E. coli genomes; we found that a long contig was assembled for B2207, but it had strain switching errors, leading to low Qscore. In particular, myloasm excelled for the lower coverage ATCC strain, having a Qscore of > 50 (< 1 error per 100kb). In summary, real ONT sequencing corroborates with our synthetic results (Fig. 2): myloasm can recover high-quality strain-level genomes, even under varying coverages and high within-metagenome similarity.

On all genomes (not just closely related strains), myloasm had the highest median Qscores on four of the six datasets after downsampling (Fig. 3D). Myloasm had the most circular single-contig prokaryotic genomes (Supplementary Fig. 5) across 5/6 sampling rates. At low coverage, myloasm is more likely to assemble a complete circular genome and recover more of the genome than other assemblers (Fig. 3E). Myloasm circularized 92% of genomes with coverage > 50, whereas metaFlye and metaMDBG circularized 60% and 61% respectively. Lastly, we found that the number of misassemblies decreased as the coverage increased, but we found that no method was consistently the best or worst (Supplementary Fig. 6).

For plasmids, myloasm recovered the most circular and complete plasmids across all sampling rates (Fig. 3F). We found several duplicated plasmids in the output assembly for all methods. This has been corroborated by other long-read assembly benchmarks [51]. We found that some reads spanned a single plasmid genome multiple times—one read spanned a S. aureus plasmid exactly six times, and all methods failed to correctly recover this plasmid. This multimer phenomenon has been described before for nanopore sequencing [52]. Nevertheless, myloasm and metaFlye were the most precise across all datasets, with 15 and 18 duplicated plasmids in total respectively, whereas metaMDBG had 45 duplicated plasmids.

2.4. Enhanced metagenome-assembled genome (MAG) recovery on diverse metagenomes

We next benchmarked myloasm, metaMDBG, and metaFlye on six real R10.4 ONT metagenomes [34,53,54] and six real PacBio HiFi metagenomes [55,56,57,58,20,59] (see Supplementary Table 3 for accessions). We also benchmarked hifiasm-meta but only for the HiFi datasets. We show MAG recovery results in Fig. 4A for single-sample assembly and binning with CheckM2 [60] evaluation.

Fig. 4: — MAG and contig recovery results for real metagenomes. A. MAG recovery after assembly and binning. Dark bars represent circular contigs with > 90% completeness and < 5% contamination. Medium bars represent non-circular MAGs with > 90% completeness and < 5% contamination. Light bars represent MAGs with > 90% completeness and < 5% contamination. NanoMDBG is used for ONT reads whereas metaMDBG is used for HiFi reads. B. Top: number of high-quality (HQ) contigs with > 90% completeness and < 5% contamination on two different gut samples from Minich et al. sequenced with HiFi and ONT R10.4 data. Bottom: the same datasets but counting only circular HQ contigs. Backgrounds are colored by gut sample.

Myloasm had particularly strong performance for recovering circular, complete contigs (> 90% completeness and < 5% contamination) on ONT data. Across all ONT datasets, myloasm recovered > 3× more circular, complete contigs on average than the next best method. This increase was consistent across diverse ONT metagenomes, from oral (> 3× more circular, complete contigs) to gut (> 2 – 5× more), and even the > 100 Gbp “Microflora” soil dataset [54] (61 circular, complete contigs for myloasm versus 23 for metaMDBG and 0 for metaFlye). Myloasm also recovered more high-quality (HQ) MAGs (> 90% complete and < 5% contaminated but not necessarily circular) on 6/6 ONT datasets. Interestingly, a single circular contig made up > 80% of myloasm’s HQ MAGs on 4/6 ONT datasets; no other assembler had even 50%. This hints that, with long enough reads, a single circular contig is starting to become the expectation for HQ MAGs.

For HiFi reads, myloasm recovered the most circular, complete contigs for 5/6 datasets but had slightly less than hifiasm-meta for the chicken gut (24 for hifiasm-meta vs 23 myloasm). Myloasm also recovered the most HQ MAGs on 4/6 datasets. The results diverged particularly on complex seawater metagenomes. For example, metaFlye did not recover a single HQ MAG from the Seawater 1 dataset [57] whereas myloasm had 16. Hifiasm-meta struggled on these complex metagenomes as well: hifiasm-meta recovered < 4 times less HQ MAGs on the Seawater 2 dataset [56] compared to myloasm and metaMDBG. Overall, myloasm can recover more complete and contiguous MAGs on a variety of sequencing technologies and environments.

2.5. Myloasm makes ONT sequencing competitive with HiFi for the human gut

We compared all assemblers across HiFi and ONT R10.4 using two gut datasets of Minich et al. [34] (Fig. 4B) with both types of sequencing. Minich et al. found that HiFi still outperformed ONT for metaMDBG and metaFlye, producing 2.34 times more circular, complete genomes than ONT. With myloasm, this is no longer the case: myloasm + ONT generated an average of 105% (sample h6) and 96% (sample d5) of the best HiFi assembler’s circular, complete contigs across all subsamples. For the same metrics, metaMDBG achieved only 13% (h6) and 35% (d5); metaFlye achieved only 22% (h6) and 19% (d5).

Assembly critically depends on the read length distributions, which can differ due to differences in library preparation between ONT and HiFi for even the same biological samples. For the HiFi d5 sample, the N50 and N10 read lengths were N50 = 11 kbp and N10 = 19 kbp. The ONT d5 sample had N50 = 10 kbp and N10 = 26 kbp. Thus, ONT had more very long reads compared to HiFi. Similarly, the HiFi h6 sample had N50 = 12 kbp and N10 = 17kbp versus the ONT sample with N50 = 6 kbp and N10 = 33 kbp. The ONT h6 sample had a particularly heterogeneous length distribution that was approximately 2× smaller for N50 than HiFi, yet myloasm recovered more circular, complete contigs on average with ONT than HiFi. This shows that myloasm’s algorithmic approach can effectively leverage the presence of very long reads. Ultimately, myloasm closes the gap between HiFi and ONT for high-quality prokaryote assembly in human gut metagenomes.

2.6. Contig and MAG-level contamination assessment

For the same 12 datasets as previously described, we ran CheckM2 on all contigs of length > 500 kbp and show the results in Fig. 5A. We show the analogous results for MAGs in Supplementary Fig. 7. On the six ONT datasets, myloasm produced more > 50, 75, and 90% complete contigs with low contamination (< 5%) than every other assembler. Across all ONT datasets, myloasm output 3.1× more contigs with > 90% completeness than metaFlye and 2.2× more than metaMDBG on average. For HiFi, metaMDBG was more competitive; myloasm produced 1.2× more > 90% complete contigs on average, which was a relatively smaller increase than the ONT case.

As expected with this greater sensitivity, myloasm also produces more contigs that had > 5% contamination (denoted as high contamination). Of the > 90% complete contigs across all datasets, myloasm produced 781 with low contamination and 49 with high contamination, giving a low/high contamination ratio of 15.9. MetaFlye had a low/high contamination ratio of 236/11 = 21.4 and metaMDBG had 459/29 = 15.8. Thus, myloasm had competitive relative contamination for high-completeness contigs.

For > 50% complete contigs, the low/high contamination ratio was worse for myloasm (12.9) than metaMDBG (21.5) and metaFlye (27.5). However, for > 50% complete bins, not contigs, myloasm actually had a better low/high contamination ratio: the ratios were 3.3 (myloasm), 2.4 (metaMDBG), 1.9 (metaFlye). Furthermore, myloasm recovered 1.28× and 1.74× more > 50% complete and low-contamination bins than metaMDBG and metaFlye across all datasets. For recovery at the bin level, myloasm is both more sensitive and precise.

2.7. Reference-free assessment of assembly quality

We examined read mapping patterns against contigs to assess quality using the same analysis as in a recent study [43] (Fig. 5B, C). We measured the number of regions in the assembly with (1) positions with only soft-clipped alignments yet > 10× depth of coverage (read-clipped regions; RCRs) and (2) no read support in a > 500 bp window (zero-coverage regions; ZCRs). These regions indicate likely local errors in the assembly.

Another potential error is calling non-circular contigs as circular. We attempted to quantify the number of erroneously or prematurely circularized contigs but ran into edge cases that produced false positives (see Real benchmarking procedure in Methods). Therefore, we defer the assessment of poor circularization to Recovery of plasmids and viruses.

We found fewer read-clipped regions (RCRs) per 1 Mbp in myloasm on average (0.013 for ONT; 0.008 for HiFi) compared to metaFlye (0.126 for ONT; 0.068 for HiFi) and metaMDBG (0.224 for ONT; 0.026 for HiFi). However, hifiasm-meta was the most accurate for HiFi data (0.001 RCRs per 1 Mbp). Similarly, for zero-coverage regions (ZCRs) per 1 Mbp, myloasm (0.050 for ONT; 0.068 for HiFi) was better than metaFlye (0.067 for ONT; 0.103 for HiFi) and metaMDBG (0.697 for ONT; 0.320 for HiFi) on average. Hifiasm-meta was also the best for ZCRs (0.034). Based on read mapping evaluation, myloasm achieves fewer local assembly errors on real datasets than previous ONT assemblers.

2.8. Timing and memory benchmarks on real data

Myloasm achieved the fastest assembly times on 9/12 datasets (Fig. 5D and Supplementary Table 4). Myloasm was faster than metaFlye on every dataset and faster than metaMDBG on 10/12 datasets. Myloasm also used less peak memory than metaFlye on 11/12 datasets (0.78× as much peak RAM on average) and hifiasm-meta on 4/6 datasets (0.83× peak RAM on average). However, myloasm consistently required more memory than metaMDBG (average 6.3× higher peak RAM usage), reflecting the efficiency of metaMDBG’s sparse minimizer approach. The microflora soil dataset was the only dataset where myloasm’s RAM was the highest (299 GB for myloasm versus 211 GB for metaFlye and 47.3 GB for metaMDBG). Overall, myloasm provides substantial runtime improvements over existing methods with competitive memory usage.

2.9. Recovery of plasmids and viruses

We evaluated the recovery of plasmids and viruses (Supplementary Fig. 8) on real metagenomes by predicting plasmid/viral contigs with geNomad [61] and then marking a contig as duplicated if the k-mer multiplicity is > 1.25. We focus on duplications as a measure of structural assembly error because (1) duplications are common for long-read plasmid misassemblies [51] and (2) they can be estimated without a reference genome. Plasmid/viral prediction software can have false positives, so we restricted analysis to contigs with a plasmid conjugation gene (including mobilization genes) detected by geNomad or circular plasmids (Supplementary Fig. 8A). For viruses, we used CheckV on predicted contigs to measure the sensitivity of high-quality (HQ) virus recovery (Supplementary Fig. 8B).

For ONT and plasmids, myloasm recovered more circular and non-duplicated ONT plasmids (median 30.5) than metaFlye (25.5) and metaMDBG (8.5). For HiFi, myloasm also recovered more circular plasmids (median 21) than the other methods (7.0 for metaFlye, 5.0 for hifiasm-meta, and 3.0 for metaMDBG). MetaMDBG performed poorly for circular plasmids: a median of 71.4% of putatively circular plasmids were duplicated for ONT and 79.0% for HiFi. Myloasm had the lowest plasmid duplication rate for ONT (median 1.1%) and second best for HiFi (median 4.8% vs 1.9% for hifiasm-meta).

For ONT and viruses, myloasm recovered fewer HQ viruses (median 50.5 versus 63 for metaMDBG and 71 for metaFlye) but comparable circular HQ viruses (median 31 versus 33 for metaMDBG and 35 for metaFlye). For HiFi and viruses, myloasm excelled, recovering more HQ viruses in general (median of 120 non-circular; 72.0 circular) than metaMDBG (88 non-circular; 52.5 circular), metaFlye (65 non-circular; 13.5 circular) and hifiasm-meta (39.5 non-circular; 21 circular). The discrepancy was notable on the Seawater 2 dataset, where myloasm recovered 243 circular, HQ, and non-duplicated viruses, whereas metaFlye recovered 42 (> 5.8 fold difference). MetaMDBG’s rate of circular duplication was still higher for both ONT (12% median vs 5.6% for metaFlye and 2.4% for myloasm) and HiFi (17.4% vs 0.4% for metaFlye, 0.4% for hifiasm, and 1.3% for myloasm). Overall, myloasm improves recovery of putative circular plasmid and viral contigs across a range of metagenomes and technologies without erroneously duplicating small genomes too often.

2.10. Myloasm reveals within-species heterogeneity and strain-specific dynamics in ONT metagenomes

We investigated myloasm’s ability to recover similar coexisting genomes within real communities. For myloasm’s assemblies on the six ONT datasets, we counted the high-quality (HQ) contigs (i.e., > 90% complete and < 5% contaminated) with > 70% aligned fraction (AF) and > 93% ANI to another contig within the assembly—we denote this as a species neighbor. Here 93% ANI was used as a relaxed species-level similarity threshold [62] that suggests either a within-species relationship or two very similar species. For the three gut metagenomes, 26 out of 174 high-quality contigs had a species neighbor (14.9%). The two oral metagenomes had more heterogeneity around the species level, with 157 out of 275 (57.1%) having a species neighbor. For the microflora soil dataset, 6 out of 65 (9.2%) had a species neighbor.

We first highlight a gut metagenome (Gut 2) with interesting patterns of within-species diversity (Fig. 6A–C). Phylogenetic analysis of single-contig, HQ genomes (Fig. 6A) showed more comprehensive recovery with myloasm (72 genomes and 10 phyla) than nanoMDBG (18 genomes and 8 phyla) and metaFlye (18 genomes and 8 phyla). Myloasm assembled six Prevotella copri [63,64] (also known as Segatella copri) single-contig HQ genomes in this sample (Fig. 6A, B), with four of them circular. All other species in the Bacteroidota phylum only had one genome recovered. The P. copri genomes had length 4.0-4.3 Mbp with mean completeness 99.6% and contamination 1.2%. All genomes had 73-78% AF and 97.76-98.19% ANI to each other. Interestingly, we also assembled five circular elements between 127 and 155 kbp that had sequence homology to known extrachromosal elements of P. copri [63]. Four out of six of these circular elements had mean depth of coverage within one of a P. copri genome, as well as large shared genomic regions, allowing us to infer host linkage (Supplementary Fig. 9). In contrast, metaFlye and nanoMDBG could not recover any P. copri contigs of length > 700 kbp (less than 17% of the genome) (Fig. 6B).

High-contiguity assemblies enable investigations into mobile genetic elements and antibiotic resistance genes (ARGs) that short reads have difficulty assembling [65,31,66]. We highlight a particular example of ermF, an ARG that confers resistance to erythromycin through rRNA methylation [67]. ermF was prevalent within 9 of myloasm’s 23 Bacteroidota single-contig HQ genomes (Fig. 6C). Interestingly, only one of the six P. copri genomes had the ermF gene present, highlighting intraspecific patterns of ARG dissemination. The nine ermF genes could be separated into two distinct ermF sequences with 98% nucleotide identity to each other. Seven of the nine genomes shared an identical (100% nucleotide similarity) ermF sequence, confirming recent transfer across even Muribaculaceae and Bacteroidaceae families. Two of the nine genomes shared the other ermF sequence identically. These two distinct ermF genes could also be delineated by their gene contexts; the second ermF sequence was clearly present in an integrative and conjugative element found in both Prevotella genomes. Thus, myloasm’s increased assembly resolution can resolve strain-specific patterns of genome dynamics and distinct evolutionary trajectories of antibiotic resistance genes.

Lastly, we investigated the Oral 1 ONT metagenome (Fig. 6D–F). Oral microbiomes are taxonomically distinct from gut microbiomes [68], and our analysis found many genomes that are highly similar within this metagenome (Supplementary Fig. 10). For all contigs of length ≥ 500 kbp, we analyzed how many contigs had > 90% ANI and > 70% AF to another contig within the assembly (Fig.6D). Myloasm recovered 82 such contigs versus only 6 for nanoMDBG and metaFlye. An interesting example was a cluster of eight Saccharibacteria (also known as TM7) genomes of the genus Nanosynbacter with size ≈ 700 kbp (Fig. 6E, F). Six of the eight genomes were recovered as circular contigs. Seven of eight had > 95% completeness and < 0.6% contamination; all had > 93.9% ANI. We visualized the six circular Nanosynbacter genomes (Fig. 6F), finding perfectly conserved synteny even though several pairs of genomes had ANI < 95%, confirming previous findings on synteny conservation in TM7 [69]. We also found large unique regions within the Nanosynbacter genomes that had phage tail and capsid proteins, indicating the presence of strain-specific prophages. Thus, myloasm’s assemblies follow previous biological findings while highlighting the diversity across similar species and strains.

3. Discussion

We have created myloasm, a new metagenome long-read assembler for modern long reads with greater than ≈ 97% accuracy, such as hac and sup base-called ONT R10.4 data and PacBio HiFi. Myloasm obtains substantially more MAGs and circular, high-quality contigs than previous ONT methods on synthetic and real data (Fig. 2–4). Myloasm also often improves results for HiFi data compared to previous methods and performs well for diverse metagenomes. Interestingly, myloasm demonstrates that ONT and HiFi reads are becoming more comparable, with myloasm even assembling more circular and complete genomes with ONT than HiFi (Fig. 4B) for a jointly-sequenced gut metagenome.

In general, our philosophy was to design myloasm as a metagenome assembler rather than adapt an isolate assembler for metagenomes. This led to our technical innovation of using SNPmers and SNPmer-associated overlap statistics. SNPmers are essentially a way of aggregating sequencing redundancy to find polymorphic markers, similar in spirit to error correction of reads. The ability to estimate true sequence identity through SNPmers gives a continuous measure of overlap similarity. This was crucial for separating similar strains and calculating coverage, which is key for metagenome assembly; the graph cleaning step makes substantial use of our coverage estimation process.

Myloasm substantially improves upon previous ONT assemblers when similar genomes are present (Fig. 1 and Fig. 6), although this is not the only reason for myloasm’s improved assemblies. We found that myloasm obtained higher contiguity even if no within-species heterogeneity was assembled (see Fig. 6A). The term “strain-resolved” assembly is sometimes used to imply a method that can assemble multiple genomes of the same species [70,71,72]. There are fundamental issues to strain-resolved assembly, both theoretically and practically. One issue is that the notion of a clonal strain does not make sense in communities with high rates of recombination and gene flow. Theoretically, it is not even clear what a genome assembly should represent in this case, and we are not aware of existing mathematical theories that model genome assembly for continuous “cloud” of genotypes. A more practical issue is that even if we could assemble a collection of genomes that differ by a few substitutions, it may be preferable to output a single genome instead for interpretability. As sequencing technologies enable even higher resolution assemblies, these fundamental issues deserve more thought.

Finally, we note that errors are present in all assemblies [43]. Automated quality control software, such as CheckM2 and CheckV, are still necessary for contigs (Fig. 5). We have found errors to arise due to a range of issues, from sequencing artifacts (e.g., chimeric reads) to erroneous graph resolution for regions with high similarity. Our mylotools software enables inspection of contigs via integrated plots of GC content, cumulative GC skew [73], read coverage, and read overlap lengths to assist users (Supplementary Fig. 1). Nevertheless, more work is needed to understand the algorithmic causes of assembly errors in complex metagenome assemblies [74]. We conclude by noting that long-read metagenome assembly is far from a solved problem, and we believe many opportunities still exist for optimization from both theoretical and practical points of view.

4. Methods

Myloasm broadly follows the string graph paradigm of Myers [38]. Briefly, we first remove reads contained in other reads, compute overlaps of the non-contained reads, and generate an assembly graph. In this assembly graph, nodes are reads and edges are overlaps. Non-branching paths are then merged into unitigs. This unitig graph is subject to additional processing and cleaning, for example, tip removal (removing some nodes with no out or in-edges) and bubble popping [75]. The final contigs are spelled out by walks along the cleaned string graph.

There are two key challenges in the string graph approach. First, we must determine what constitutes a valid overlap. This is non-trivial because two reads may come from similar yet distinct genomic regions, but sequencing error may be higher than genomic divergence. To tackle this challenge, we formulate a new statistical model that uses SNPmers to discern true polymorphism from sequencing errors. Second, the assembly graph will inevitably have spurious edges (i.e., overlaps) due to perfect repetitive regions. This graph must be simplified using overlap information and sequencing coverage. We develop a novel probabilistic path model to inform graph simplification, and summarize the graph cleaning procedure in Algorithm 1.

4.1. Capturing within-sample polymorphism with SNPmers

We use SNPmers in analogy to how heterozygous alleles are used to phase reads for diploid or polyploid genomes [76]. This is done by focusing on polymorphic sites between inter or intra-genomic repeats. Ignoring non-polymorphic bases increases the signal-to-noise ratio under sequencing error. However, since we do not have a reference genome, we use k-mers as a reference-free context [77,44] to call polymorphisms. Then, we compare these polymorphic k-mers to disambiguate strains. This technique was also called split k-mers by Derelle et al. [44]

Definition 1. Assume k, the k-mer length, is odd. A pair of k-mers is called a SNPmer pair if they have a different middle base but exact matches for the two flanks of length $(k - 1) / 2$ . For example, AACAA and AAGAA are a SNPmer pair for k = 5. We call any of the k-mers within a SNPmer pair a SNPmer. We only consider biallelic SNPmer pairs.

We use k = 21 as the default parameter in myloasm. This choice is inspired by previous k-mer comparison methods that show k = 21 leads to a good balance between sensitivity and k-mer uniqueness in bacterial genomes [78].

4.2. Counting k-mers and filtering for SNPmers

To obtain a set of SNPmer pairs (and thus, SNPmers) for a sample, myloasm first counts all k-mers within the reads. We represent each k-mer in its canonical form: the lexicographically smaller of itself or its reverse complement. We keep track of whether the k-mer was in forward or reverse orientation relative to its canonical form. We ignore k-mers where the first $(k - 1) / 2$ bases and the last $(k - 1) / 2$ bases are reverse complements of each other. Such pairs are strand-ambiguous if the middle base is ignored. After obtaining all forward and reverse counts of each canonical k-mer, we retain only k-mers that have been seen once on both strands; we use two bloom filters to filter out rare k-mers—one for each strand. This reduces erroneous SNPmers because sequencing errors often have strand bias [79].

To efficiently find SNPmer pairs, we lexicographically sort the canonical k-mers while masking the middle base. This groups all k-mers with the same flanking k-mers in the sorted list. If there are more than two k-mers with the same flanking $(k - 1) / 2$ -mers, we take the top two most frequent k-mers as a potential SNPmer pair.

Finally, we retain these SNPmer pairs if they pass two additional tests. First, we test for strand bias by using a two-sided Fisher’s exact test on the 2x2 contingency table formed by the forward and reverse counts of the k-mer pair. We reject this pair if the odds ratio is > 1.5 and the p-value is < 0.005. We found Fisher’s exact test is too strict for high coverage counts, so we used an odds ratio condition rather than just a p-value.

Second, we use a binomial test based on the total count for each k-mer and a null error probability of 0.025: letting A be the total count for the higher frequency k-mer, and B the smaller frequency k-mer, we reject the SNPmer pair if Pr(Binom(A, 0.025) > B) > 0.05. If both tests pass, we retain the SNPmer pair. Despite ONT error rates being < 2.5%, ONT errors depend on the k-mer context, and some specific k-mers may have elevated error rates [80]. Thus, we chose a more conservative null error probability.

4.3. Open syncmer and SNPmer read indexing

Myloasm indexes the reads with SNPmers from the previous step. However, SNPmers themselves are not reliable enough to map reads: genomes with minimal variation within the metagenome should have minimal SNPmers. Thus, we also index reads with open syncmers [47].

Definition 2. Given s < k, a k-mer is an open syncmer if the fol lowing holds: the smal lest s-mer within the k-mer—subject to some fixed s-mer ordering—is the s-mer in position (k – s + 1) from the left (i.e., the middle s-mer). Define c = (k – s + 1).

Under appropriate stochasticity assumptions, $1 / (k - s + 1) = 1 / c k$ -mers expected to be sampled as open syncmers [47]. This leads to a speedup and memory improvement of c times. We order the s-mers by its numerical value after applying an invertible hash for 64-bit integers [50] and let k = 21 and s = 11 by default. We use open syncmers as opposed to the more common minimizer paradigm [81]. Open syncmers do not have the context-dependency property of minimizers [81,82]—minimizers lead to biased sequence identity estimates [83] that will impact our method later on. Furthermore, open syncmers have good empirical sensitivity [82,84] and theoretical guarantees for sequence alignment [85].

4.4. Read mapping and overlapping with double k-mer chaining

Define anchor to be a k-mer “match” (either SNPmer or open syncmer) between two sequences. These anchors are found in different ways: we let open syncmer anchors be exact k-mer matches. In contrast, we let SNPmer anchors be matches on only the two flanking $(k - 1) / 2$ -mers, thus ignoring the middle base. After finding all anchors, we perform two rounds of chaining: first on the open syncmers, then on the SNPmers.

Definition 3. We represent an anchor as a pair (x, y) of k-mer positions in the first and second sequence respectively. A co-linear chain (hereafter shortened to just chain) is a sequence of anchors (x₁, y₁), (x₂, y₂), … such that x_i+1 > x_i and y_i+1 > y_i. Chaining is the computational problem of finding an optimal chain subject to some objective function.

Chains capture the backbone of a sequence alignment and can be “extended” to form a full alignment. Note that to handle reverse complements, we can change the second condition to y_i > y_i+1, but for exposition we will assume the forward case. For all stages except the last polishing stage, myloasm does not do any base level alignment, only chaining.

Algorithmically, we follow the chaining procedure in minimap2 [49] with modifications. Given a lexicographically sorted list of anchors, define the maximum chaining score for the ith anchor (x_i, y_i) as

f (i) = m a x {\underset{j < i}{m a x} {f (j) + α (i, j) - β (i, j)}, w}} .

The goal of chaining is to find the maximal f(i) over all anchors. By keeping track of the optimal predecessor j for each i, we can recover the chain that corresponds to the maximal f(i). w reflects the goodness of k-mer matching. $α (i, j) = m i n \{x_{i} - x_{j}, y_{i} - y_{j}, w\}$ reflects the score of appending an additional anchor after penalizing overlapping anchors. $β (i, j) = |(x_{i} - x_{j}) - (y_{i} - y_{j})|$ is the gap cost. Additionally, we let $β = - \infty$ if $|x_{i} - x_{j}| > G$ or $|(x_{i} - x_{j}) - (y_{i} - y_{j})| > G^{'}$ to break chaining for long gaps. We let $w = c$ for open syncmer chaining ( $c = k - s + 1$ ; $k = 21$ and $s = 11$ by default) and $w = 50$ for SNPmer chaining. $G^{'} = 200$ and $G = 10000$ by default. These values were chosen to be in the same order of magnitude as in the minimap2 algorithm and tuned heuristically.

Given N anchors, the optimal chain can be found in in O(N²) time by computing f(i) with two for-loops and then backtracking. However, this quadratic time complexity is prohibitive. We therefore use minimap2’s heuristic of breaking the inner for-loop if the chaining score is not improved after h = 10 iterations.

4.5. Estimating true sequence alignment identity with double chains

After double chaining, myloasm checks the middle bases in the SNPmer chain. The number of mismatched SNPmer anchors (i.e., mismatched middle bases) is due to both true differences and sequencing errors. Long-read sequencing errors are mostly indels [80]. Indels usually change the flanking $(k - 1) / 2$ -mers—SNPmer anchors would rarely occur under indels. On the other hand, true mutations for prokaryotes are usually substitutions [86]. Thus, SNPmer anchors should form for true mutations. Of course, substitution sequencing errors could lead to SNPmer anchors, but low-frequency substitutions errors should be filtered out by the SNPmer calling step. In addition, we enforce a heuristic Q23 base quality threshold (99.5% accuracy) on the middle base of a SNPmer match.

To summarize, we use the following intuition to estimate a true overlap sequence identity. (1) Sequencing errors will lead to less anchors (both SNPmer and open syncmer anchors), whereas (2) true mutations leads to SNPmers with mismatched middle bases. We use this intuition to estimate true sequence divergence through a probability model as follows.

Definition 4. Let X be a uniformly random string. Let Y be a version of X where each base is mutated to another base independently with probability θ. After mutation, mark each base as “erroneous” with probability ϵ.

We examine the set of k-mer matches between X and Y. We exclude k-mers with an “erroneous” base under this idealized error model as follows. A k-mer on X is a mismatched SNPmer on Y with probability ${(1 - θ)}^{k - 1} θ \cdot {(1 - ϵ)}^{k}$ : the middle base mutates (probability of θ), the other bases do not (probability of ${(1 - θ)}^{k - 1}$ ), and there are no sequencing errors (probability of ${(1 - ϵ)}^{k}$ ). By linearity of expectation,

E [number of SNPmers with mismatched middle base] = | X | \cdot {(1 - θ)}^{k - 1} θ \cdot {(1 - ϵ)}^{k}

(1)

where |X| is the length of X (we assume that $| X | \approx | X | - k + 1 = the number of k - mers$ ).

The left-hand side of Equation 1 can be estimated from SNPmer chaining. However, we can not estimate the true sequence identity θ: there is latent error ϵ. This is where we use the open syncmer chain. The probability of an open syncmer anchor forming is $\frac{1}{c} (1 - θ)^{k} \cdot (1 - ϵ)^{k} : \frac{1}{c} k - mers$ are open syncmers, and for the open syncmer to match, there must be no errors and no mutations. Note that this would not hold for minimizers [81] due to the minimizer context dependency issue [47,82,83]. Thus,

E [number of open syncmer matches] = | X | \frac{1}{c} (1 - θ)^{k} \cdot (1 - ϵ)^{k} .

(2)

To estimate θ, we divide the two expressions.

R = \frac{E [number of SNPmers with mismatched middle base]}{E [number of open syncmer matches] \cdot c} = \frac{θ}{(1 - θ)} .

(3)

\Rightarrow θ = \frac{R}{(1 + R)}

(4)

In summary, we can “divide out” the sequencing error by using the double chain. It turns out that the above idea leads to a provably consistent estimator for θ as the overlap length goes to ∞, under some mild assumptions.

Theorem 1. Let M_n be the number of mismatched SNPmers for a random string X and Y of length n under our random model. Let O_n be the analogous random variable for the number of matching open syncmers. Set

R_{n} = \frac{M_{n}}{O_{n} \cdot c} .

Under an idealized model of X where every k-mer and s-mer is unique, as n → ∞,

\frac{R_{n}}{1 + R_{n}} \overset{p}{\to} θ .

That is, the left hand side converges to the right hand side in probability as the overlap gets large.

We prove this theorem in the Supplementary Materials. We use techniques from Spouge et al. [87] showing that M_n and O_n have central limit theorems [88]. This leads to our estimator of sequence identity, θ, as follows.

Definition 5. For a double chain between two sequences, let M be the number of mismatched SNPmers and O be the number of open syncmer anchors. Define

r = \frac{M}{O \cdot c} .

Then, myloasm’s estimator of sequence identity, $\hat{θ}$ , is defined as

\hat{θ} = \frac{r}{(1 + r)} .

(5)

We show the effectiveness of the $\hat{θ}$ estimator in Supplementary Fig. 11 relative to using a raw sequence alignment identity. Thus, $\hat{θ}$ is an “error-corrected” divergence signal that allows us to disentangle distinguish error from true sequence divergence.

4.6. Removing contained reads

To build a string graph [38], we must remove reads that are contained (in another longer read). We will call a non-contained read an outer read. Removal of contained reads can be done by all-to-all read mapping. However, this O(N²) computation for N reads is computationally intensive and often the main bottleneck in assembly workflows [89]. We instead devise a method that, for each queried read, quickly obtains a set of possible outer reads. As soon as the queried read is contained in a single read, we can move to the next query. This obviates the need for further alignments involving this query read.

Myloasm first builds a sparse hash table index with k-mers. For each read, myloasm filters out open syncmers if their hash value is > (2⁶⁴ – 1)/C (C = 4 by default). This downsamples open syncmers by C = 4 times. Myloasm then inserts the filtered open syncmers as a key and appends its read ID to the value.

Next, for each query read, we query the hash table with all open syncmers. We track reads in the index that are ”hit” by the query read’s open syncmers. We also track the query read’s leftmost and rightmost open syncmers for each read that is hit. After querying all open syncmers, we retain each hit read if the following hold: we require the rightmost and leftmost open syncmers of the query read—against a specific hit read—span 90% of the query read’s length. We then reverse sort these outer reads by the length of this span multiplied by the total number of open syncmers hit.

Finally, we do double chaining against each candidate outer read in the sorted list of hit reads. If (1) the number of mismatched SNPmers is 0 (i.e., $\hat{θ}$ in Equation 5 is 0) and (2) the read open syncmer chain plus 30 · c spans 95% of the query read, then we remove the query read as a contained read.

4.7. Mapping to outer reads and calculating coverage

After removing contained reads, we map all reads against outer (i.e., non-contained) reads. The set of outer reads is usually much smaller than the set of all reads, so myloasm builds a new open syncmer index but only on the outer reads. Myloasm then performs all-to-outer double chaining and retains mappings with $\hat{θ} \leq 0.01$ .

Our goal is to calculate coverage for each read based on the read mappings. We differentiate two types of mappings: maximal mappings and local mappings.

Definition 6. Suppose the leftmost and rightmost positions for an open syncmer chain are (A₁, B₁) on read R₁ and (A₂, B₂) on read R₂;. let |R| be the length of a read R. A mapping is maximal if for some F (default = 300) the expression (A₁ < F OR A₂ < F) AND (B₁ + F > |R₁| OR B₂ + F > |R₂| is true. Otherwise, it is a local mapping.

Intuitively, a maximal mapping indicates that the mapping is as long as possible; it can not be extended much further without going past the end of a read. If the mapping from R₁ to R₂ is not maximal, there are regions without similar bases. Thus, the two reads are unlikely to originate from the same region of the same genome. Only maximal mappings should be used for calculating coverage.

We define the depth of coverage for each read as the minimum depth along each read, similar to Nurk et al. [90]. This is done to avoid overestimating coverage for repetitive regions. We define an identity-thresholded coverage as follows.

Definition 7. Let cov_R(ω) be the minimum depth along R after removing maximal mappings with $\hat{θ} > ω$ .

We will calculate multiple coverages. We found that in high-diversity strain populations, there may be no perfect overlaps—that is, $\hat{θ} > 0$ for every mapping. In this case, it is still useful to get a rough sense of coverage at lower resolutions.

4.8. Cleaning chimeric reads and large sequencing errors

Long reads can have large structural sequencing errors. For example, chimeric reads stitch together segments from two or more distinct genomes. Nanopore reads can also have large errors in low-complexity regions [91]. These errors can complicate the string graph. Long reads—especially ONT—can have highly heterogeneous lengths. Thus, long erroneous reads can eliminate contained reads and break contiguity [92,93]; see Supplementary Fig. 12. We found this to be the primary cause of errors in ONT assemblies.

We clean erroneous outer reads by using the all-against-outer local read mappings. If a region along the outer read has depth d ≤ 2 but one of the flanking regions 200 bp to the left or right has depth > 5 · d, then we break the read into two reads at this junction. For nanopore reads, if a region has depth d = 3 then we also break it if one flanking region have depth > 75.

After breaking chimeric outer reads, we repeat the containment and mapping procedure again, this time with the broken reads. This gives a second set of outer reads. Any remaining chimeric reads in the second round’s outer reads are broken one last time, and these outer reads are used for graph generation.

4.9. Read overlap graph generation with adaptive edge thresholds

Myloasm performs all-to-all read overlapping with double chaining for the final outer reads. We add an edge between two nodes if an overlap exists of length > 500 by default and $\hat{θ} = 0$ ; we discard reads of length < 1000. However, the requirement that $\hat{θ} = 0$ is sometimes too strict. It is possible that no reads come from the exact same strain or a sequencing error causes a SNPmer mismatch.

Thus, we use a coverage-informed threshold for $\hat{θ}$ as follows. Recall the definition of cov_R(ω) (Definition 7). If (1) cov_R(0) < 5, (2) cov_R(1/100) > 3 · cov_R(0), and (3) cov_R(1/100) > 5 then we define a new coverage threshold ω_R. We use an iterative procedure to find ω_R: first set ω_R = 0 and then iteratively calculate cov_R(ω_R + 0.05/100). Myloasm stops when cov_R(ω_R +0.05/100) and $\frac{{c o v}_{R} (ω_{R} + 0.05 / 100)}{{c o v}_{R} (ω_{R})} < 1.5$ . Finally, for an overlap between R₁ and R₂, we form an edge if $\hat{θ}$ is less than $m i n (ω_{R_{1}}, ω_{R_{2}})$ .

We perform two more simplification procedures on this overlap graph.

Definition 8. Let the anchor sparsity of an overlap be the average bases between open syncmer anchors.

We prune overlaps with anchor sparsity > 8 · c. Under a Bernoulli mutation model [94], this prunes overlaps of $< 100 * {(1 / 8)}^{1 / k} \approx 90.6 %$ sequence identity (k = 21). Given that modern nanopore reads have > 99% median accuracy [37], this did not cause many false negative overlaps. This heuristic was needed for the following reason: for very divergent overlaps, open syncmer chains could still form but no SNPmers were found, thus $\hat{θ} = 0$ . Mathematically, for higher sequence divergence θ, the expected ratio of mismatched SNPmers to syncmers may increase (Equation 3). But, the number of SNPmer mismatches still decreases (Equation 1) and is often 0 when θ is high.

We then prune overlaps with > 5 times higher anchor sparsity relative to the least sparse overlap. We then rescue edges using two heuristics. We first rescue overlaps that are 1.5 times longer than the longest adjacent $\hat{θ} = 0$ overlap and has $\hat{θ} < 0.05 / 100$ . Second, we rescue overlaps with $\hat{θ} < 1 / 100$ and connect tips to another node; in other words, we “rescue tips” if possible.

4.10. Bidirected string graph representation

Note that the initial read overlap graph is bidirected [38]. In a bidirected graph, an edge can be an in-edge or out-edge for either incident node. This represents the relative orientation of the overlapping reads. We define paths in a bidirected graph as follows.

Definition 9. For an edge e and an incident node x, let $d i r (e, x) \in {O u t, I n}$ be the incidence of e with respect to x. A path is a sequence of edges (e₁, e₂, …, e_n) and nodes (x₁, x₂, …, x_n+1) such that $d i r (e_{i - 1}, x_{i}) \neq d i r (e_{i}, x_{i})$ . The length of a path is the length of the string it spells out.

This definition naturally generalizes the notion of a path in a directed graph, where an edge x → y labelled e must have $d i r (e, x) = O u t$ and $d i r (e, y) = I n$ . However, this definition still works for a bidirected path: for example, $(e, e^{'})$ for nodes $(x, y, z)$ is a valid path if $d i r (e, x) = I n$ , $d i r (e, y) = I n$ , $d i r (e^{'}, y) = O u t$ , and $d i r (e^{'}, z) = O u t$ . Paths of overlapping reads—and thus, unitigs—spell out strings after appropriate reverse complementation.

4.11. Conservative unitig graph cleaning

We remove transitive edges in the cleaned overlap graph using the algorithm of Myers [38] and collapse non-branching paths into unitigs. This representation is the string graph, but we will refer to it as the unitig graph to make it clear that in subsequent cleaning steps, non-branching paths are always collapsed into unitigs.

We perform a set of conservative graph cleaning operations on the unitig graph. These operations requires a notion of safety for edge cutting. Intuitively, an edge is safe to cut if it does not break contiguity too badly. We illustrate this definition in Supplementary Fig. 13.

Definition 10. Let U₁ and U₂ be unitigs adjacent to an edge e. Let ℓ and L be safety two parameters. The edge e safe if the following are all true. (A) There exists a path (U_a, x, …) with length > L where $a \in {1,2}$ and $x \neq U_{1}$ or $U_{2}$ . (B) For some sequence of nodes $P^{'} = (U_{a}, U_{b}, \dots, x_{i - 1}, x_{i}, x_{i + 1}, \dots)$ (where $a, b \in {1,2}$ ) of length $< ℓ$ , there exists a path ( $x_{i}, x_{i - 1}^{'}, x_{i - 2}^{'}, \dots$ ) that:

has length > L and
is edge-disjoint from $P^{'}$ and
both $x_{i - 1}$ and $x_{i - 1}^{'}$ are simultaneously in or out-nodes relative to x_i.

Note that $P^{'}$ is not a valid path since $x_{i - 1}$ , $x_{i - 1}^{'}$ are the same incidence to x_i. The first condition implies that cutting edge e does not create a “dead end”; the presence of an alternate path of length > L implies that cutting e does not break contiguity. The second condition implies that no dead ends will be created in a neighborhood of radius ℓ around e.

Safety can be computed as follows. Path (A) in the definition can be found using a depth-first search (DFS). Path (B) can be found by doing a “forward” search of radius ℓ and then a “backward” DFS each time a new node is found for the forward search.

We cut safe edges using ℓ = 10kb and L = 50kb in two ways. These parameters will change later on. Firstly, we safely cut all edges that are dominated by another edge.

Definition 11. For a node x, an out-edge (resp. in-edge) e₁ is dominated by another out-edge (resp. in-edge) e₂ if has all strictly larger than e₁:

overlap length
number of open syncmer anchors
$- \hat{θ}$
number of matching SNPmer anchors
–number of mismatched SNPmer anchors.

This heuristic removes edges that might have been erroneously rescued in the previous steps. Note that edges with $\hat{θ} = 0$ can not be dominated due to the strictness of the inequality.

We also cut safe edges with small relative overlap length as follows. Let ol(e) be the overlap length and α < 1. Let e₁, e₂, … be out or in-edges of some node. We cut e_i if for some $e_{j} o l (e_{i}) < α \cdot o l (e_{j})$ . We call this operation a drop cut. Finally, we incorporate tip removal and bubble popping [50] with these graph cleaning procedures in an iterative fashion.

We summarize this procedure in Lines 1-8 of Algorithm 1. First, we remove dominating edges. Let α, the drop cut ratio, increase from 0.5/3, 0.5/2, and then 0.5. In each iteration, we remove unitig tips that have length ≤ 20 kbp and ≤ 3 reads. Then, we pop bubbles of maximum length < 50 kb. Finally, we perform drop cuts with ratio α. After each step, we collapse new non-branching paths into unitigs.

4.12. Path probability model

The previous cleaning procedure conservatively reduces the complexity of the graph. In the subsequent sections, we incorporate sequencing coverage information to more aggressively clean the graph.

Recall that we can attach to each read a set of coverage values. Intuitively true paths within the unitig graph should have consistent coverage because the sequencing coverage along a genome is roughly constant. We proceed by defining a probability model on paths in the graph. We subsequently use this model to weight edges. First, we define the coverage divergence between two unitigs.

Definition 12. Let U be a sequence of reads R₁, R₂.… For a threshold ω, the coverage of U is defined as an ordered set ${c o v}_{U} (ω) = ({c o v}_{R_{1}} (ω), {c o v}_{R_{2}} (ω), \dots)$ ; see Definition 7. Let Q_x be the x-th percentile quantile function (e.g., x = 50 is the median). For U₁ and U₂, define

L (ω, x) = \{\begin{array}{l} l o g [\frac{Q_{x} ({c o v}_{U_{1}} (ω) + 3)}{Q_{x} ({c o v}_{U_{2}} (ω) + 3)}] & if ({c o v}_{U_{1}} (ω) \geq 5 and {c o v}_{U_{2}} (ω) \geq 5) or ω \geq 1 / 100 \\ \infty, & otherwise . \end{array}

(6)

Let $Ω = {1 / 100, 0.25 / 100, 0}$ and define the coverage divergence between two sequences of reads be

D (U_{1}, U_{2}) = | Ω | \cdot \underset{ω \in Ω}{m i n} {| L (ω, 50) | + | L (ω, 75) - L (ω, 25) |} .

The first term of the coverage divergence is a log ratio of the median coverages. The second term captures differences in distributional shape. Myloasm stabilizes the L(x, ω) statistic by adding a pseudocount of 3 and only allowing a threshold ω if the coverage is high enough. This is crucial for strain populations where we often see ${c o v}_{R} (0) \approx 1$ but ${c o v}_{R} (1 / 100) ≫ 1$ . We found that adjacent unitigs can have highly variable coverages at specific values of ω due to e.g. inexact repeats, so the minimum smoothes out large variation. We now define the energy of a path.

Definition 13. Suppose we have a unitig path (U₁, …, U_n+1) with edges (e₁, …, e_n). Let Inc(U_i, e_i) be the in-edges (resp. out-edges) of U_i if e_i is an in-edge (resp. out-edge) of U_i. Let ${\hat{θ}}_{i}$ be the $\hat{θ}$ for the ith edge and

Θ (e_{i}) = 200 \cdot ({\hat{θ}}_{i} - \underset{e_{j} \in I n c (U_{i}, e_{i})}{m i n} {\hat{θ}}_{j}) .

Furthermore, let

O L (e_{i}) = |0.5 * {l o g}_{2} (\frac{o l (e_{i})}{{m a x}_{e \in I n c (U_{i}, e)} o l (e)})| .

Then the energy of a path is

E (p a t h) = \sum_{i = 1}^{n} [D (⨁_{j = 1}^{i} U_{j}, U_{i + 1}) + O L (e_{i}) + Θ (e_{i})] .

(7)

Where $⨁_{j = 1}^{i} U_{j}$ is the concatenation of reads for all unitigs up to i.

The definition of energy is arbitrary, but it encodes the intuition that we should penalize small overlaps (the OL(e_i) term), low sequence identity (the $Θ (e_{i})$ term), and inconsistent converage (the D term). We use the energy define a distribution over a class of paths as follows.

Definition 14. Assume some probability distribution over the vertices Pr(U). We choose

P r (U) = \frac{\sqrt{| U |}}{\sum_{U_{i} \in Nodes} \sqrt{|U_{i}|}}

where $| U |$ is the number of reads in the starting unitig U in our implementation.

For a starting unitig node U, let $𝒫_{o}^{K} (U)$ be the set of out-paths from U that either have exactly K nodes or are not able to be extended further. Define $𝒫_{i}^{K} (U)$ similarly for in-paths. Then for $P \in 𝒫_{o}^{K} (U)$ , and a “temperature” T > 0 let

Pr (P) = \frac{\Pr (U)}{2} \cdot \frac{e^{- E (P) / T}}{\sum_{P_{i} \in 𝒫_{o}^{K} (U)} e^{- E (P_{i}) / T}} \propto \frac{\sqrt{| U |} \cdot e^{- E (P) / T}}{\sum_{P_{i} \in 𝒫_{o}^{K} (U)} e^{- E (P_{i}) / T}} .

(8)

If $P \in P_{i}^{K} (U)$ , define Pr(P) analogously. Note that a path traversed in the opposite direction is technically a different path (and thus may have different probability).

$𝒫_{o}^{K} (U)$ represent “maximal” paths from U of length at most K (see Supplementary Fig. 13). Intuitively, we think of generating these maximal paths of length ≤ K as follows. We first sample a node with probability Pr(U), then pick a direction with probability 1/2, and finally sample a path over all maximal length ≤ K paths from the node based on the Boltzmann distribution (equivalent to a softmax function) of the path energies.

4.13. Path-infused edge weights

We use our path probability model to infuse edge weights with “global” information around a neighbourhood of the edge as follows.

Definition 15. Define the collection of paths ${x \to y}$ as the set of paths of the form $(\dots, x, y, \dots .)$ . Define ${y \to x}$ similarly. Note that $P r ({x \to y}) \neq P r ({y \to x})$ ; the energy of a path depends on which direction it is traversed. Define the weight of an edge w(e) as

w (e) = m i n (P r ({x \to y}), P r ({y \to x}))

Intuitively, w(e) measures the likelihood of a local path flowing through the edge from both directions. We chose the minimum of the two directions because we often see nodes with skewed out-degree and in-degree in the graph (e.g., due to a conserved repeat at the end of one unitig).

Algorithm for computing edge weights

An important consideration is how to choose K, the number of nodes, in Definition 14. Depending on the genome, the string lengths spelled out by paths of length K can vary highly. Thus, we choose a K_o(U) that depends on the starting node U and direction o or i. We set K_o(U) to be the smallest number of nodes in an out-path path of length ≥ 1 Mbp (analogously for in-paths). If this is > 10, we set it to 10 instead. We discuss choosing the temperature T in the next section.

To actually compute the path probabilities in Definition 14, we would have to enumerate all paths. This is sometimes computationally prohibitive, so we use a beam search (i.e., top-N greedy breadth-first search) to prune out paths with low probability, as low-probability paths should not influence w(e) much. Precisely, we calculate the probability of each path by iterating over all nodes then running a beam search in both directions while keeping the N = 20 lowest energy paths at each iteration. Then, $P r ({x \to y})$ is the sum of the probability of all paths passing from x to y.

4.14. Annealing-inspired graph cleaning

Once we have computed edge weights, we cut safe edges (see Definition 10) with low edge weight relative to incident edges. That is, with Inc(U, e) defined as in Definition 13 and some $α^{'} < 1$ , we cut edges with

\frac{w (e)}{\max_{e^{'} \in Inc (U, e)} w (e^{'})} < α^{'} .

(9)

The crucial parameter for defining edge weights is the temperature T. T controls the strictness of the energy penalties in Definition 14. We circumvent a fixed choice of T by iterating through a range of T values in analogy to simulated annealing algorithms [42]. At high temperature, ratios of w(e) are closer to 1, so only very poor edges are cut. After cutting, re-unitigging yields larger unitigs with more reads. Subsequent calculations of energies in Definition 13 have larger “sample sizes” and thus higher reliability. Then, we repeat this procedure at lower T.

We summarize our entire graph cleaning procedure in Algorithm 1 along with our heuristically chosen parameters. Lines 9-20 correspond annealing-inspired cleaning step with additional tip removal and bubble popping. Notably, we also iterate on $α^{'}$ and a parameter m. Recall that the definition of edge safety (Definition 10) depends on parameters ℓ and L—the larger ℓ, the less stringent the safety conditions. We make edge cuts more aggressive in Line 13 by multiplying the safety constant ℓ = 20000 by m. We also allow progressively larger bubbles of length m · 50 kb to be popped.

4.15. Circular contig extraction

We apply the progressive coverage filter of metaMDBG [20] for retrieving circular contigs as follows. We let cumulative coverage of a unitig be ${c o v}_{U} (0) + {c o v}_{U} (0.25 / 100) + {c o v}_{U} (1 / 100)$ (the same parameters as Ω in Definition 12). We remove contigs iteratively with cumulative coverage lower than β = 1, …, max(cumulative coverage) and re-unitig after each step. Afterwards, we initialize an empty set $C = \emptyset$ and iterate from $β^{'} = m a x (cumulative coverage) \dots 1$ . Suppose an isolated circular contig (i.e. in-degree = out-degree = 1) is found at an iteration. If the cumulative coverage of the circular contig is $> 2 \cdot β^{'}$ and none of its reads are in a contig in C, then we insert this contig into C. Finally, we remove all nodes in the original graph with a read in a contig of C and re-unitig. Then, we add all circular contigs in C back into the graph as isolated nodes.

Even after the above filter, very small circular plasmids were often not circularized properly. Thus, we use a heuristic extract small circular contigs as follows. If for a connected component in the final graph, every unitig has < 100 kb length and the entire component has < 100 reads, we mark it as a possible small circular genome. We recompute all pairwise overlaps and extract pairs of reads that form a putative circular unitig. Then, we sort pairs based on their estimated circular genome length. We iterate the through the sorted list until we find a genome that is 1.1 times longer than the previous genome length, taking the last pair of reads to be a circular unitig. The reason for this procedure is that we often found artifactual reads, e.g. reads from multimeric plasmids [52], that span the genome more than once. Sometimes pairs of these reads represent a repeated small genome, so we take the largest yet hopefully not repetitive pair of reads.

Algorithm 1.

Graph cleaning pseudocode (G: initial unitig graph)

1:	$ℓ \leftarrow 10000$	▷ Safety parameters (Definition 10)
2:	$L \leftarrow 50000$	▷ Safety parameters (Definition 10)
3:	RemoveDominatedEdges(G, ℓ, L)	▷ See Section Conservative unitig graph cleaning
4:	for $α = 0.5 / 3, 0.5 / 2, 0.5$ do
5:	RemoveTips(G, 20000, 3)	▷ < 20kb length and 3 reads
6:	PopBubbles(G, 50000)	▷ < 50kb length
7:	DropCut(G, α, ℓ, L)
8:	end for
9:	$ℓ^{'} \leftarrow 20000$	▷ New safety parameters
10:	$L^{'} \leftarrow 300000$	▷ New safety parameters
11:	for T = 2, 1.5, 1, 0.5 do	▷ Annealing-inspired cleaning with path probabilities
12:	for $α^{'} = 0.125, 0.25, 0.5$ do
13:	for m = 10, 15, 30 do
14:	$w (E) \leftarrow ComputeEdgeWeights (G, T)$	▷ Definitions 15 and 14
15:	RemoveTips(G, 100000, 15)
16:	PopBubbles(G, min(50000 · m, 1000000))
17:	CutRelative(G, w(E), $α^{'}$ , $L^{'}$ , $ℓ^{'}$ · m)	▷ Equation 9 with safety parameters $L^{'}$ , $ℓ^{'}$ · m
18:	end for
19:	end for
20:	end for

Open in a new tab

4.16. Polishing and dereplication

Spurious unitigs are often present after cleaning due to, e.g., sequencing errors. To dereplicate highly similar unitigs, we perform all-to-all pairwise mappings of unitigs and remove unitigs that are (1) > 98% contained in a larger unitig, (2) have $\hat{θ} = 0$ , and (3) have < 8 · c anchor sparsity (that is, bases between anchors in the open syncmer chain on average; see Section 4.9 for justification).

We then align all reads to the remaining unitigs through double chaining. This time we extend the open syncmer chain to a full sequence alignment by performing global alignment between anchors using Block Aligner [95]. If a maximal mapping (see Definition 6) has < 8 · c anchor sparsity and $\hat{θ} = 0$ , then we keep it. If a read contains no mapping that satisfies $\hat{θ} = 0$ , then we let ${\hat{θ}}_{best}$ be smallest $\hat{θ}$ mapping and iteratively retain best mappings sorted by $\hat{θ}$ values, stopping when $\hat{θ} > \frac{4}{3} {\hat{θ}}_{best}$ . This heuristic allows for secondary alignments when ${\hat{θ}}_{best}$ is large and there is inherent ambiguity in the mapping.

After mapping, we polish using the SPOA (SIMD Partial Order Alignment) library [96,97] along 400bp windows to get the final contigs. We perform one last dereplication step by computing all-to-all average nucleotide identity (ANI) with skani [98] and the parameters -c 15 -m 15. We label contigs with > 99.9% ANI and > 99% fraction to a larger contig as duplicates and remove them. We label contigs with > X% ANI (default: X = 99%) and > 90% aligned fraction as alternates in a separate output.

4.17. Mylotools implementation

Mylotools uses the plotly [99] library to generate interactive HTML visualizations of myloasm’s assemblies (Supplementary Fig. 1). Mylotools visualizes the read coverage, overlaps, GC content in windows, and cumulative GC skew [73] across a contig. For each read, three depths of coverage values are shown: myloasm calculates depth after removing mappings with θ (the estimated SNPmer divergence) ≥ 0, 0.25/100, and 1/100. For the overlaps, the length and number of mismatched SNPmers is also shown.

For a high-quality contig, (1) read coverage and GC content should be relatively stable, (2) overlaps should be long with minimal mismatched SNPmers, and (3) cumulative GC skew often has a peak and a trough (although not always [73]). The presence of abnormal shifts can be used to scan for chimeric contig joins.

4.18. Benchmarking setup

All assemblers were run with default parameters except for technology-specific parameters. For metaFlye (v2.9.5-b1801), we used --nano-hq for nanopore and --pacbio-hifi for HiFi reads. For metaMDBG (v1.1), we used --in-hifi and --in-ont for HiFi and nanopore respectively. For benchmarking results, used myloasm v0.3.0 with --hifi for HiFi reads and default parameters otherwise. We used myloasm v0.1.0 for all results in Fig. 6 (and in section Analysis of within-species diversity and horizontal gene transfer). We improved the polishing procedure of myloasm from v0.1.0 to v0.3.0 but left the graph resolution steps unchanged, leading to highly similar contigs in terms of structure. Hifiasm-meta (0.3-r074) was run with default parameters. All assemblers were run with 60 threads on a 64-core Intel(R) Xeon(R) Gold 6130 CPU machine with 384 GB of RAM except the “Microflora” dataset [54], for which we used a AMD EPYC 7301 machine with 1 TB of RAM. Timing and memory benchmarking was done through snakemake [100].

4.19. Benchmarking of in silico long-read metagenomes containing multiple strains

We generated two sets of synthetic long-read metagenomes: (1) five species and two strains per species (Fig. 2A–B) and (2) metagenomes with varying number of S. enterica strains (Fig. 2C).

To generate the first dataset, we downloaded assemblies designated as “Complete” using NCBI Datasets [101] for five species shown in Fig. 2A. We chose these species because they captured a range of diversity (each belonging to a different order), the available of complete reference genomes, and each species having genomes with ANI values between 97.5% to 99.5%. We then used skani to find pairs of genomes with 97.5%, 98.5%, and 99.5% ANI to one another, resulting in our final collection of reference strains.

For each pair of strains, we arbitrarily designated a high abundance and low abundance strain. The high abundance strain was set to 35x depth of coverage and the low-abundance strain varied between 15-35x. We then used badread [102] to simulate long reads using five parameters:

(Mean length, Length standard deviation, Mean identity, Identity standard deviation, Error model).

The three parameter sets were as follows:

Pacbio: (10 kbp, 4kbp, 99.9%, 0.18%, pacbio2021)
Nano Short: (2.5 kbp, 4kbp, 98%, 2%, nanopore2023)
Nano Long: (9kbp, 13kbp, 98%, 2%, nanopore2023)

We chose these parameters by selecting a real HiFi run and two ONT runs from Minich et al. [34] and calculating the length and identity distributions from their fastq files (with identity inferred using base qualities). We used badread’s pre-trained error models. The resulting N50 read lengths are shown in Fig. 2A.

For the 1-10 strain S. enterica metagenomes, we chose ten arbitrary references and ensured all ten had pairwise ANI between 98% and 99%. We sampled a coverage depth value from 10 to 30 for each using a uniform distribution and simulated reads using Nano Long parameters.

All assembly statistics were calculated by QUAST [103] as follows. We took all contigs of length > 100 kb for each assembler and matched it to the closest reference genome with skani. We ran QUAST on the set of matched contigs for each reference genome. We used this method as opposed to MetaQUAST [104] to have more control over strain-level assignments of contigs. Qscore is defined as $- 10 {l o g}_{10} (\frac{substitutions + indels}{genome size})$ , where substitutions and indels were was calculated from QUAST’s outputs.

4.20. Concatenated mock R10.4 community benchmarking

We concatenated four separate datasets of ONT R10.4 reads with both hac and sup basecalled data available to form a mock metagenome. The three datasets were

Zymo Gut Microbiome Standard D6331 (dorado v0.8.2; v5.0.0 basecalling models)
Zymo Oral Microbiome Standard D6332 (dorado v0.8.2; v5.0.0 basecalling models)
Zymo HMW DNA Standard D6322 (dorado v0.7.3; v5.0.0 basecalling models)
14 isolates from Hall et al. [37] (dorado v0.5.0; v4.3.0 basecalling models)

The zymo datasets were taken from https://github.com/Kirk3gaard/MicroBench. The fourteen isolates were downsampled to 10% of the original dataset unless the reads had < 1 Gbp, in which case we did not downsample. All datasets were concatenated and downsampled to 25% of the reads. This final dataset had 21 Gbp of data for 48 genomes after dereplicating reference genomes that had > 99.9% ANI according to skani. Supplementary Table 2 shows the exact genome accessions and datasets used. We used skani and QUAST on > 100kbp contigs similarly to the synthetic benchmarking experiments.

We designated a contig as circular if the assembler designated it as circular and its length was > 1M bp. For plasmids, we did not use QUAST. Instead, we denoted a circular contig as a circular + perfect plasmid if skani (with parameter --slow) assigned it to a plasmid with > 99% ANI and both the contig and reference had > 95% aligned fraction. If it satisfied the ANI and aligned fraction requirements but was not circular, we denoted it as a partial match. If the contig had < 90% aligned fraction to the plasmid but the plasmid had > 95% aligned fraction to the contig, then we designated it as a multimer and thus misassembled plasmid.

4.21. Real metagenome benchmarking procedure

We used six R10.4 nanopore datasets [53,34,54] and six PacBio HiFi datasets [20,55,58,56,57,59] and two samples sequenced with both HiFi and nanopore from Minich et al. Accessions are shown in Supplementary Table 3.

After assembly, we mapped reads back to the assemblies with minimap2 [49] and ran SemiBin2 [105] with default parameters for binning. We used CheckM2 [60] to calculate contamination and completeness of all resulting bins as well as contigs with length > 500 kb. We calculated zero-coverage windows within assemblies and positions with > 10 read clippings and no other alignments using a faster rust reimplementation of the anvi-script-find-misassemblies script from Anvi’o [106,43] available at https://github.com/bluenote-1577/rust-anvio-mis.

We used genomad [61] (v1.11.0) to find candidate viral and plasmid contigs. CheckV (v1.0.3) was used on all predicted viral contigs after excluding proviruses to assess quality. Conjugation and mobilization gene annotations were taken from genomad. To assess duplication, we obtained a k-mer frequency by counting all 21-mers in a contig and downsampling ≈ 1/10 k-mers with FracMinHash [107] and then finding the mean k-mer multiplicity after trimming the top 10% most and least-frequent k-mers. If this k-mer multiplicity was > 1.25, we marked the virus/plasmid contig as a duplicate.

We attempted to quantify the number of falsely circularized contigs, but ran into several issues. Firstly, simply quantifying the number of < 90% complete circular contigs is insufficient due to lineage-specific underestimation of genomic completeness in CheckM2. Furthermore, we found circularized contigs that were likely secondary chromosomes with low CheckM2 completeness. We also investigated < 500 kbp circular contigs with rRNA gene content but found several microeukaryotic organelle genomes in the environmental datasets (examples shown in Supplementary Fig. 14) and other types of small, circular extrachromosomal elements with rRNA gene content [63].

4.22. Analysis of within-species diversity and horizontal gene transfer

All analysis of ANI and AF were done through skani with parameters --medium --detailed --faster-small-m 400. To construct the phylogenetic trees, we took all contigs of length > 500 kbp and completeness > 90% with contamination < 5%. We ran the classify-wf workflow of GTDB-TK [108] 2.3.2 with the GTDB-R214 taxonomy and then FastTree [109] with the GTDB-TK’s multiple sequence alignments of the bac120 prokaryotic marker genes. Antibiotic resistance genes were annotated with the RGI pipeline using the CARD [110] database (May 2025 release). De novo gene annotation was performed on contigs individually using Bakta [111]. All circular contigs were updated to start at dnaA with the DNAapler [112] pipeline prior to Bakta annotation. Clinker [113] was used to visualize gene-level sequence relationships in genomic contexts surrounding predicted ermF locations in the Gut1-ONT dataset. The pgv-mummer workflow of pyGenomeViz (v1.6.1) was used to run mummer [114] iteratively between the six circular contigs of the Nanosynbacter (CPR bacteria) cluster identified in the Oral 1 dataset.

Supplementary Material

Supplementary Materials

NIHMS2152955-supplement-Supplementary_Materials.pdf^{(2.5MB, pdf)}

Supplementary Table 4

NIHMS2152955-supplement-Supplementary_Table_4.xlsx^{(18.8KB, xlsx)}

Supplementary Table 1

NIHMS2152955-supplement-Supplementary_Table_1.xlsx^{(156.7KB, xlsx)}

Supplementary Table 2

NIHMS2152955-supplement-Supplementary_Table_2.xlsx^{(504.2KB, xlsx)}

8. Acknowledgements

This work is supported by US National Institute of Health grant R01HG010040 to H.L. J.S. is supported by an NSERC Postdoctoral Fellowship (PDF) award. We thank Haoyu Cheng, Xiaowen Feng, and members of the Li lab for helpful discussions. We thank Mantas Sereika and Rasmus Kirkegaard for providing valuable software feedback.

Footnotes

⁷

Competing interest statement

No competing interests declared.

5. Data availability

The mock nanopore community was generated from the MicroBench [115] suite by Rasmus Kirkegaard (https://github.com/Kirk3gaard/MicroBench and accession PRJEB85558) along with 14 isolates taken from Hall et. al [37]. Accessions for the real datasets are available in Supplementary Table 3.

6. Code availability

Myloasm is open source and available at https://github.com/bluenote-1577/myloasm. Documentation for myloasm is available at https://myloasm-docs.github.io/. The mylotools software suite is open source and available at https://github.com/bluenote-1577/mylotools.

References

1.Quince C, Walker AW, Simpson JT, Loman NJ & Segata N Shotgun metagenomics, from sampling to analysis. Nature Biotechnology 35, 833–844 (2017). [DOI] [PubMed] [Google Scholar]
2.Hug LA et al. A new view of the tree of life. Nature Microbiology 1, 1–6 (2016). [DOI] [PubMed] [Google Scholar]
3.Parks DH et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology 2, 1533–1542 (2017). [DOI] [PubMed] [Google Scholar]
4.Spang A et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Zhao S et al. Adaptive Evolution within Gut Microbiomes of Healthy People. Cell Host & Microbe 25, 656–667.e8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Pérez-Cobas AE, Gomez-Valero L & Buchrieser C Metagenomic approaches in microbial ecology: An update on whole-genome and marker gene sequencing analyses. Microbial Genomics 6, mgen000409 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kiefl E et al. Structure-informed microbial population genetics elucidate selective pressures that shape protein evolution. Science Advances 9, eabq4632 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wallen ZD et al. Metagenomics of Parkinson’s disease implicates the gut microbiome in multiple disease mechanisms. Nature Communications 13, 6958 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Tisza MJ & Buck CB A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proceedings of the National Academy of Sciences 118, e2023202118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Franzosa EA et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature Microbiology 4, 293–305 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Schmidt TSB et al. Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nature Medicine 28, 1902–1912 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Bedarf JR et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Medicine 9, 39 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Woodcroft BJ et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018). [DOI] [PubMed] [Google Scholar]
14.Ustick LJ et al. Metagenomic analysis reveals global-scale patterns of ocean nutrient limitation. Science 372, 287–291 (2021). [DOI] [PubMed] [Google Scholar]
15.Liang J-L et al. Novel phosphate-solubilizing bacteria enhance soil phosphorus cycling following ecological restoration of land degraded by mining. The ISME Journal 14, 1600–1613 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Cavicchioli R et al. Scientists’ warning to humanity: Microorganisms and climate change. Nature Reviews Microbiology 17, 569–586 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Steen AD et al. High proportions of bacteria and archaea across most biomes remain uncultured. The ISME Journal 13, 3126–3130 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bertrand D et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nature Biotechnology 37, 937–944 (2019). [DOI] [PubMed] [Google Scholar]
19.Kolmogorov M et al. metaFlye: Scalable long-read metagenome assembly using repeat graphs. Nature Methods 17, 1103–1110 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Benoit G et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Feng X, Cheng H, Portik D & Li H Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods 19, 671–674 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Agustinho DP et al. Unveiling microbial diversity: Harnessing long-read sequencing technology. Nature methods 21, 954–966 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Feng X & Li H Evaluating and improving the representation of bacterial contents in long-read metagenome assemblies. Genome Biology 25, 92 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Crits-Christoph A, Olm MR, Diamond S, Bouma-Gregson K & Banfield JF Soil bacterial populations are shaped by recombination and gene-specific selection across a grassland meadow. The ISME Journal 14, 1834–1846 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Liu Z & Good BH Dynamics of bacterial recombination in the human gut microbiome. PLOS Biology 22, e3002472 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chen-Liaw A et al. Gut microbiota strain richness is species specific and affects engraftment. Nature 637, 422–429 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Goyal A, Bittleston LS, Leventhal GE, Lu L & Cordero OX Interactions between strains govern the eco-evolutionary dynamics of microbial communities. eLife 11, e74987 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Brito IL Examining horizontal gene transfer in microbial communities. Nature Reviews Microbiology 19, 442–453 (2021). [DOI] [PubMed] [Google Scholar]
29.Nagarajan N & Pop M Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing. Journal of Computational Biology 16, 897–908 (2009). [DOI] [PubMed] [Google Scholar]
30.Bresler G, Bresler M & Tse D Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics 14, S18 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kerkvliet JJ et al. Metagenomic assembly is the main bottleneck in the identification of mobile genetic elements. PeerJ 12, e16695 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Nurk S et al. HiCanu: Accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research 30, 1291–1305 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Cheng H, Concepcion GT, Feng X, Zhang H & Li H Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Minich JJ et al. Culture-independent meta-pangenomics enabled by long-read metagenomics reveals associations with pediatric undernutrition. Cell 0 (2025). [DOI] [PubMed] [Google Scholar]
35.Sereika M et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nature Methods 19, 823–826 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Cheng H et al. Efficient near telomere-to-telomere assembly of Nanopore Simplex reads. bioRxiv 2025.04.14.648685 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Hall MB et al. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data. eLife 13, RP98300 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Myers EW The fragment assembly string graph. Bioinformatics (Oxford, England) 21 Suppl 2, ii79–85 (2005). [DOI] [PubMed] [Google Scholar]
39.Compeau PEC, Pevzner PA & Tesler G Why are de Bruijn graphs useful for genome assembly? Nature biotechnology 29, 987–991 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Ekim B, Berger B & Chikhi R Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems 12, 958–968.e6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Benoit G et al. High-quality metagenome assembly from nanopore reads with nanoMDBG. bioRxiv 2025.04.22.649928 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Kirkpatrick S, Gelatt CD & Vecchi MP Optimization by Simulated Annealing. Science 220, 671–680 (1983). [DOI] [PubMed] [Google Scholar]
43.Trigodet F, Sachdeva R, Banfield JF & Eren AM Assemblies of long-read metagenomes suffer from diverse errors. bioRxiv 2025.04.22.649783 (2025). [Google Scholar]
44.Derelle R et al. Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA) (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Gardner SN & Hall BG When Whole-Genome Alignments Just Won’t Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes. PLOS ONE 8, e81760 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Harris SR SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 10.1101/453142 (2018). [DOI] [Google Scholar]
47.Edgar R Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 9, e10805 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Myers G & Miller W Chaining multiple-alignment fragments in sub-quadratic time. In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’95, 38–47 (Society for Industrial and Applied Mathematics, USA, 1995). [Google Scholar]
49.Li H Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Li H Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Bouras G et al. Hybracter: Enabling scalable, automated, complete and accurate bacterial genome assemblies. Microbial Genomics 10, 001244 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Vaisbourd E, Bren A, Alon U & Glass DS Preventing Multimer Formation in Commonly Used Synthetic Biology Plasmids. ACS Synthetic Biology 14, 1309–1315 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Kiguchi Y et al. Giant extrachromosomal element “Inocle” potentially expands the adaptive capacity of the human oral microbiome. Nature Communications 16, 7397 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Sereika M et al. Recovery of highly contiguous genomes from complex terrestrial habitats reveals over 15,000 novel prokaryotic species and expands characterization of soil and sediment microbial communities (2024). [Google Scholar]
55.Gehrig JL et al. Finding the right fit: Evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microbial Genomics 8, 000794 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Sidhu C et al. Dissolved storage glycans shaped the community composition of abundant bacterioplankton clades during a North Sea spring phytoplankton bloom. Microbiome 11, 77 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Priest T, Orellana LH, Huettel B, Fuchs BM & Amann R Microbial metagenome-assembled genomes of the Fram Strait from short and long read sequencing platforms. PeerJ 9, e11721 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Kato S, Masuda S, Shibata A, Shirasu K & Ohkuma M Insights into ecological roles of uncultivated bacteria in Katase hot spring sediment from long-read metagenomics. Frontiers in Microbiology 13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Zhang Y et al. Improved microbial genomes and gene catalog of the chicken gut from metagenomic sequencing of high-fidelity long reads. GigaScience 11, giac116 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Chklovski A, Parks DH, Woodcroft BJ & Tyson GW CheckM2: A rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods 20, 1203–1212 (2023). [DOI] [PubMed] [Google Scholar]
61.Camargo AP et al. Identification of mobile genetic elements with geNomad. Nature Biotechnology 42, 1303–1312 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT & Aluru S High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications 9, 5114 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Blanco-Míguez A et al. Extension of the Segatella copri complex to 13 species with distinct large extrachromosomal elements and associations with host conditions. Cell Host & Microbe 31, 1804–1819.e9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Chang H-W et al. Prevotella copri and microbiota members mediate the beneficial effects of a therapeutic food for malnutrition. Nature Microbiology 9, 922–937 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Maguire F et al. Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic Islands. Microbial Genomics 6, mgen000436 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Abramova A, Karkman A & Bengtsson-Palme J Metagenomic assemblies tend to break around antibiotic resistance genes. BMC Genomics 25, 959 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Xing L et al. ErmF and ereD Are Responsible for Erythromycin Resistance in Riemerella anatipestifer. PLoS ONE 10, e0131078 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Huttenhower C et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.He X et al. Cultivation of a human-associated TM7 phylotype reveals a reduced genome and epibiotic parasitic lifestyle. Proceedings of the National Academy of Sciences of the United States of America 112, 244–249 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Kazantseva E, Donmez A, Frolova M, Pop M & Kolmogorov M Strainy: Phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nature Methods 1–10 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Shaw J, Gounot J-S, Chen H, Nagarajan N & Yu YW Floria: Fast and accurate strain haplotyping in metagenomes. Bioinformatics 40, i30–i38 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Jochheim A et al. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. Microbiome 12, 187 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Grigoriev A Analyzing genomes with cumulative skew diagrams. Nucleic Acids Research 26, 2286–2290 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Schmidt S, Toivonen S, Medvedev P & Tomescu AI Applying the Safe-And-Complete Framework to Practical Genome Assembly. LIPIcs : Leibniz international proceedings in informatics 312, 8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Dabbaghie F, Ebler J & Marschall T BubbleGun: Enumerating bubbles and superbubbles in genome graphs. Bioinformatics 38, 4217–4219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Lancia G, Bafna V, Istrail S, Lippert R & Schwartz R SNPs Problems, Complexity, and Algorithms. In auf der Heide FM (ed.) Algorithms — ESA 2001, Lecture Notes in Computer Science, 182–193 (Springer, Berlin, Heidelberg, 2001). [Google Scholar]
77.Chaung K et al. SPLASH: A statistical, reference-free genomic algorithm unifies biological discovery. Cell 186, 5440–5456.e26 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Ondov BD et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Liu X et al. Nanopore strand-specific mismatch enables de novo detection of bacterial DNA modifications. Genome Research 34, 2025–2038 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Delahaye C & Nicolas J Sequencing DNA with nanopores: Troubles and biases. PLOS ONE 16, e0257521 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Roberts M, Hayes W, Hunt BR, Mount SM & Yorke JA Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004). [DOI] [PubMed] [Google Scholar]
82.Shaw J & Yu YW Theory of local k-mer selection with applications to long-read alignment. Bioinformatics 38, 4659–4669 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Belbasi M, Blanca A, Harris RS, Koslicki D & Medvedev P The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics (Oxford, England) 38, i169–i176 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Frith MC, Shaw J & Spouge JL How to optimally sample a sequence for rapid analysis. Bioinformatics 39, btad057 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Shaw J & Yu YW Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research gr.277637.122 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Chen J-Q et al. Variation in the Ratio of Nucleotide Substitution and Indel Rates across Genomes in Mammals and Bacteria. Molecular Biology and Evolution 26, 1523–1531 (2009). [DOI] [PubMed] [Google Scholar]
87.Spouge JL, Das P, Chen Y & Frith M The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches. Journal of Computational Biology 31, 1195–1210 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Hoeffding W & Robbins H The Central Limit Theorem for Dependent Random Variables. In Fisher NI & Sen PK (eds.) The Collected Works of Wassily Hoeffding, 205–213 (Springer New York, New York, NY, 1994). [Google Scholar]
89.Stanojević D, Lin D, de Sessions PF & Šikić M Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. bioRxiv 2024.05.18.594796 (2024). [Google Scholar]
90.Nurk S et al. The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
91.Tan K-T, Slevin MK, Meyerson M & Li H Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres. Genome Biology 23, 180 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Jain C Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinformatics 39, btad124 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
93.Li H & Durbin R Genome assembly in the telomere-to-telomere era. Nature Reviews Genetics 25, 658–670 (2024). [DOI] [PubMed] [Google Scholar]
94.Blanca A, Harris RS, Koslicki D & Medvedev P The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 29, 155–168 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
95.Liu D & Steinegger M Block Aligner: An adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices. Bioinformatics 39, btad487 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
96.Vaser R, Sović I, Nagarajan N & Šikić M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27, 737–746 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
97.Lee C, Grasso C & Sharlow MF Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002). [DOI] [PubMed] [Google Scholar]
98.Shaw J & Yu YW Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods 1–5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
99.Kruchten N, Seier A & Parmer C An interactive, open-source, and browser-based graphing library for Python. DOI: 10.5281/zenodo.14503524 (2025). [DOI] [Google Scholar]
100.Köster J & Rahmann S Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012). [DOI] [PubMed] [Google Scholar]
101.O’Leary NA et al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Scientific Data 11, 732 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
102.Wick RR Badread: Simulation of error-prone long reads. Journal of Open Source Software 4, 1316 (2019). [Google Scholar]
103.Gurevich A, Saveliev V, Vyahhi N & Tesler G QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
104.Mikheenko A, Saveliev V & Gurevich A MetaQUAST: Evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2016). [DOI] [PubMed] [Google Scholar]
105.Pan S, Zhao X-M & Coelho LP SemiBin2: Self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics 39, i21–i29 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
106.Eren AM et al. Anvi’o: An advanced analysis and visualization platform for ‘omics data. PeerJ 3, e1319 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
107.Rahman Hera M, Pierce-Ward NT & Koslicki D Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Research 33, 1061–1068 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
108.Chaumeil P-A, Mussig AJ, Hugenholtz P & Parks DH GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
109.Price MN, Dehal PS & Arkin AP FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
110.Alcock BP et al. CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research 51, D690–D699 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
111.Schwengers O et al. Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics 7, 000685 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
112.Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V & Roach MJ Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software 9, 5968 (2024). [Google Scholar]
113.Gilchrist CLM & Chooi Y-H Clinker & clustermap.js: Automatic generation of gene cluster comparison figures. Bioinformatics 37, 2473–2475 (2021). [DOI] [PubMed] [Google Scholar]
114.Marçais G et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
115.Kirkegaard R MicroBench (https://github.com/Kirk3gaard/MicroBench) (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS2152955-supplement-Supplementary_Materials.pdf^{(2.5MB, pdf)}

Supplementary Table 4

NIHMS2152955-supplement-Supplementary_Table_4.xlsx^{(18.8KB, xlsx)}

Supplementary Table 1

NIHMS2152955-supplement-Supplementary_Table_1.xlsx^{(156.7KB, xlsx)}

Supplementary Table 2

NIHMS2152955-supplement-Supplementary_Table_2.xlsx^{(504.2KB, xlsx)}

Data Availability Statement

[R1] 1.Quince C, Walker AW, Simpson JT, Loman NJ & Segata N Shotgun metagenomics, from sampling to analysis. Nature Biotechnology 35, 833–844 (2017). [DOI] [PubMed] [Google Scholar]

[R2] 2.Hug LA et al. A new view of the tree of life. Nature Microbiology 1, 1–6 (2016). [DOI] [PubMed] [Google Scholar]

[R3] 3.Parks DH et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology 2, 1533–1542 (2017). [DOI] [PubMed] [Google Scholar]

[R4] 4.Spang A et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Zhao S et al. Adaptive Evolution within Gut Microbiomes of Healthy People. Cell Host & Microbe 25, 656–667.e8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Pérez-Cobas AE, Gomez-Valero L & Buchrieser C Metagenomic approaches in microbial ecology: An update on whole-genome and marker gene sequencing analyses. Microbial Genomics 6, mgen000409 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Kiefl E et al. Structure-informed microbial population genetics elucidate selective pressures that shape protein evolution. Science Advances 9, eabq4632 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Wallen ZD et al. Metagenomics of Parkinson’s disease implicates the gut microbiome in multiple disease mechanisms. Nature Communications 13, 6958 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Tisza MJ & Buck CB A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proceedings of the National Academy of Sciences 118, e2023202118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Franzosa EA et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature Microbiology 4, 293–305 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Schmidt TSB et al. Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nature Medicine 28, 1902–1912 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Bedarf JR et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Medicine 9, 39 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Woodcroft BJ et al. Genome-centric view of carbon processing in thawing permafrost. Nature 560, 49–54 (2018). [DOI] [PubMed] [Google Scholar]

[R14] 14.Ustick LJ et al. Metagenomic analysis reveals global-scale patterns of ocean nutrient limitation. Science 372, 287–291 (2021). [DOI] [PubMed] [Google Scholar]

[R15] 15.Liang J-L et al. Novel phosphate-solubilizing bacteria enhance soil phosphorus cycling following ecological restoration of land degraded by mining. The ISME Journal 14, 1600–1613 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Cavicchioli R et al. Scientists’ warning to humanity: Microorganisms and climate change. Nature Reviews Microbiology 17, 569–586 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Steen AD et al. High proportions of bacteria and archaea across most biomes remain uncultured. The ISME Journal 13, 3126–3130 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Bertrand D et al. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nature Biotechnology 37, 937–944 (2019). [DOI] [PubMed] [Google Scholar]

[R19] 19.Kolmogorov M et al. metaFlye: Scalable long-read metagenome assembly using repeat graphs. Nature Methods 17, 1103–1110 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Benoit G et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Feng X, Cheng H, Portik D & Li H Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods 19, 671–674 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Agustinho DP et al. Unveiling microbial diversity: Harnessing long-read sequencing technology. Nature methods 21, 954–966 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Feng X & Li H Evaluating and improving the representation of bacterial contents in long-read metagenome assemblies. Genome Biology 25, 92 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Crits-Christoph A, Olm MR, Diamond S, Bouma-Gregson K & Banfield JF Soil bacterial populations are shaped by recombination and gene-specific selection across a grassland meadow. The ISME Journal 14, 1834–1846 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Liu Z & Good BH Dynamics of bacterial recombination in the human gut microbiome. PLOS Biology 22, e3002472 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Chen-Liaw A et al. Gut microbiota strain richness is species specific and affects engraftment. Nature 637, 422–429 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Goyal A, Bittleston LS, Leventhal GE, Lu L & Cordero OX Interactions between strains govern the eco-evolutionary dynamics of microbial communities. eLife 11, e74987 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Brito IL Examining horizontal gene transfer in microbial communities. Nature Reviews Microbiology 19, 442–453 (2021). [DOI] [PubMed] [Google Scholar]

[R29] 29.Nagarajan N & Pop M Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing. Journal of Computational Biology 16, 897–908 (2009). [DOI] [PubMed] [Google Scholar]

[R30] 30.Bresler G, Bresler M & Tse D Optimal assembly for high throughput shotgun sequencing. BMC Bioinformatics 14, S18 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Kerkvliet JJ et al. Metagenomic assembly is the main bottleneck in the identification of mobile genetic elements. PeerJ 12, e16695 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Nurk S et al. HiCanu: Accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research 30, 1291–1305 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Cheng H, Concepcion GT, Feng X, Zhang H & Li H Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Minich JJ et al. Culture-independent meta-pangenomics enabled by long-read metagenomics reveals associations with pediatric undernutrition. Cell 0 (2025). [DOI] [PubMed] [Google Scholar]

[R35] 35.Sereika M et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nature Methods 19, 823–826 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Cheng H et al. Efficient near telomere-to-telomere assembly of Nanopore Simplex reads. bioRxiv 2025.04.14.648685 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Hall MB et al. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data. eLife 13, RP98300 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Myers EW The fragment assembly string graph. Bioinformatics (Oxford, England) 21 Suppl 2, ii79–85 (2005). [DOI] [PubMed] [Google Scholar]

[R39] 39.Compeau PEC, Pevzner PA & Tesler G Why are de Bruijn graphs useful for genome assembly? Nature biotechnology 29, 987–991 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Ekim B, Berger B & Chikhi R Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems 12, 958–968.e6 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Benoit G et al. High-quality metagenome assembly from nanopore reads with nanoMDBG. bioRxiv 2025.04.22.649928 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Kirkpatrick S, Gelatt CD & Vecchi MP Optimization by Simulated Annealing. Science 220, 671–680 (1983). [DOI] [PubMed] [Google Scholar]

[R43] 43.Trigodet F, Sachdeva R, Banfield JF & Eren AM Assemblies of long-read metagenomes suffer from diverse errors. bioRxiv 2025.04.22.649783 (2025). [Google Scholar]

[R44] 44.Derelle R et al. Seamless, rapid and accurate analyses of outbreak genomic data using Split K-mer Analysis (SKA) (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Gardner SN & Hall BG When Whole-Genome Alignments Just Won’t Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes. PLOS ONE 8, e81760 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Harris SR SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 10.1101/453142 (2018). [DOI] [Google Scholar]

[R47] 47.Edgar R Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. PeerJ 9, e10805 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Myers G & Miller W Chaining multiple-alignment fragments in sub-quadratic time. In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’95, 38–47 (Society for Industrial and Applied Mathematics, USA, 1995). [Google Scholar]

[R49] 49.Li H Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Li H Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Bouras G et al. Hybracter: Enabling scalable, automated, complete and accurate bacterial genome assemblies. Microbial Genomics 10, 001244 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Vaisbourd E, Bren A, Alon U & Glass DS Preventing Multimer Formation in Commonly Used Synthetic Biology Plasmids. ACS Synthetic Biology 14, 1309–1315 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Kiguchi Y et al. Giant extrachromosomal element “Inocle” potentially expands the adaptive capacity of the human oral microbiome. Nature Communications 16, 7397 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Sereika M et al. Recovery of highly contiguous genomes from complex terrestrial habitats reveals over 15,000 novel prokaryotic species and expands characterization of soil and sediment microbial communities (2024). [Google Scholar]

[R55] 55.Gehrig JL et al. Finding the right fit: Evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microbial Genomics 8, 000794 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Sidhu C et al. Dissolved storage glycans shaped the community composition of abundant bacterioplankton clades during a North Sea spring phytoplankton bloom. Microbiome 11, 77 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Priest T, Orellana LH, Huettel B, Fuchs BM & Amann R Microbial metagenome-assembled genomes of the Fram Strait from short and long read sequencing platforms. PeerJ 9, e11721 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Kato S, Masuda S, Shibata A, Shirasu K & Ohkuma M Insights into ecological roles of uncultivated bacteria in Katase hot spring sediment from long-read metagenomics. Frontiers in Microbiology 13 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Zhang Y et al. Improved microbial genomes and gene catalog of the chicken gut from metagenomic sequencing of high-fidelity long reads. GigaScience 11, giac116 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Chklovski A, Parks DH, Woodcroft BJ & Tyson GW CheckM2: A rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods 20, 1203–1212 (2023). [DOI] [PubMed] [Google Scholar]

[R61] 61.Camargo AP et al. Identification of mobile genetic elements with geNomad. Nature Biotechnology 42, 1303–1312 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT & Aluru S High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications 9, 5114 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Blanco-Míguez A et al. Extension of the Segatella copri complex to 13 species with distinct large extrachromosomal elements and associations with host conditions. Cell Host & Microbe 31, 1804–1819.e9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Chang H-W et al. Prevotella copri and microbiota members mediate the beneficial effects of a therapeutic food for malnutrition. Nature Microbiology 9, 922–937 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Maguire F et al. Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic Islands. Microbial Genomics 6, mgen000436 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Abramova A, Karkman A & Bengtsson-Palme J Metagenomic assemblies tend to break around antibiotic resistance genes. BMC Genomics 25, 959 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Xing L et al. ErmF and ereD Are Responsible for Erythromycin Resistance in Riemerella anatipestifer. PLoS ONE 10, e0131078 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] 68.Huttenhower C et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] 69.He X et al. Cultivation of a human-associated TM7 phylotype reveals a reduced genome and epibiotic parasitic lifestyle. Proceedings of the National Academy of Sciences of the United States of America 112, 244–249 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] 70.Kazantseva E, Donmez A, Frolova M, Pop M & Kolmogorov M Strainy: Phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nature Methods 1–10 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] 71.Shaw J, Gounot J-S, Chen H, Nagarajan N & Yu YW Floria: Fast and accurate strain haplotyping in metagenomes. Bioinformatics 40, i30–i38 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] 72.Jochheim A et al. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. Microbiome 12, 187 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] 73.Grigoriev A Analyzing genomes with cumulative skew diagrams. Nucleic Acids Research 26, 2286–2290 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Schmidt S, Toivonen S, Medvedev P & Tomescu AI Applying the Safe-And-Complete Framework to Practical Genome Assembly. LIPIcs : Leibniz international proceedings in informatics 312, 8 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] 75.Dabbaghie F, Ebler J & Marschall T BubbleGun: Enumerating bubbles and superbubbles in genome graphs. Bioinformatics 38, 4217–4219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] 76.Lancia G, Bafna V, Istrail S, Lippert R & Schwartz R SNPs Problems, Complexity, and Algorithms. In auf der Heide FM (ed.) Algorithms — ESA 2001, Lecture Notes in Computer Science, 182–193 (Springer, Berlin, Heidelberg, 2001). [Google Scholar]

[R77] 77.Chaung K et al. SPLASH: A statistical, reference-free genomic algorithm unifies biological discovery. Cell 186, 5440–5456.e26 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] 78.Ondov BD et al. Mash: Fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] 79.Liu X et al. Nanopore strand-specific mismatch enables de novo detection of bacterial DNA modifications. Genome Research 34, 2025–2038 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Delahaye C & Nicolas J Sequencing DNA with nanopores: Troubles and biases. PLOS ONE 16, e0257521 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] 81.Roberts M, Hayes W, Hunt BR, Mount SM & Yorke JA Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004). [DOI] [PubMed] [Google Scholar]

[R82] 82.Shaw J & Yu YW Theory of local k-mer selection with applications to long-read alignment. Bioinformatics 38, 4659–4669 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] 83.Belbasi M, Blanca A, Harris RS, Koslicki D & Medvedev P The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics (Oxford, England) 38, i169–i176 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] 84.Frith MC, Shaw J & Spouge JL How to optimally sample a sequence for rapid analysis. Bioinformatics 39, btad057 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] 85.Shaw J & Yu YW Proving sequence aligners can guarantee accuracy in almost O(m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research gr.277637.122 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] 86.Chen J-Q et al. Variation in the Ratio of Nucleotide Substitution and Indel Rates across Genomes in Mammals and Bacteria. Molecular Biology and Evolution 26, 1523–1531 (2009). [DOI] [PubMed] [Google Scholar]

[R87] 87.Spouge JL, Das P, Chen Y & Frith M The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches. Journal of Computational Biology 31, 1195–1210 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R88] 88.Hoeffding W & Robbins H The Central Limit Theorem for Dependent Random Variables. In Fisher NI & Sen PK (eds.) The Collected Works of Wassily Hoeffding, 205–213 (Springer New York, New York, NY, 1994). [Google Scholar]

[R89] 89.Stanojević D, Lin D, de Sessions PF & Šikić M Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads. bioRxiv 2024.05.18.594796 (2024). [Google Scholar]

[R90] 90.Nurk S et al. The complete sequence of a human genome. Science 376, 44–53 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R91] 91.Tan K-T, Slevin MK, Meyerson M & Li H Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres. Genome Biology 23, 180 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] 92.Jain C Coverage-preserving sparsification of overlap graphs for long-read assembly. Bioinformatics 39, btad124 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R93] 93.Li H & Durbin R Genome assembly in the telomere-to-telomere era. Nature Reviews Genetics 25, 658–670 (2024). [DOI] [PubMed] [Google Scholar]

[R94] 94.Blanca A, Harris RS, Koslicki D & Medvedev P The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology 29, 155–168 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] 95.Liu D & Steinegger M Block Aligner: An adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices. Bioinformatics 39, btad487 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R96] 96.Vaser R, Sović I, Nagarajan N & Šikić M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27, 737–746 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R97] 97.Lee C, Grasso C & Sharlow MF Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002). [DOI] [PubMed] [Google Scholar]

[R98] 98.Shaw J & Yu YW Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods 1–5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R99] 99.Kruchten N, Seier A & Parmer C An interactive, open-source, and browser-based graphing library for Python. DOI: 10.5281/zenodo.14503524 (2025). [DOI] [Google Scholar]

[R100] 100.Köster J & Rahmann S Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012). [DOI] [PubMed] [Google Scholar]

[R101] 101.O’Leary NA et al. Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets. Scientific Data 11, 732 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R102] 102.Wick RR Badread: Simulation of error-prone long reads. Journal of Open Source Software 4, 1316 (2019). [Google Scholar]

[R103] 103.Gurevich A, Saveliev V, Vyahhi N & Tesler G QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R104] 104.Mikheenko A, Saveliev V & Gurevich A MetaQUAST: Evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2016). [DOI] [PubMed] [Google Scholar]

[R105] 105.Pan S, Zhao X-M & Coelho LP SemiBin2: Self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics 39, i21–i29 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R106] 106.Eren AM et al. Anvi’o: An advanced analysis and visualization platform for ‘omics data. PeerJ 3, e1319 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R107] 107.Rahman Hera M, Pierce-Ward NT & Koslicki D Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Research 33, 1061–1068 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R108] 108.Chaumeil P-A, Mussig AJ, Hugenholtz P & Parks DH GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R109] 109.Price MN, Dehal PS & Arkin AP FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R110] 110.Alcock BP et al. CARD 2023: Expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Research 51, D690–D699 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R111] 111.Schwengers O et al. Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics 7, 000685 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R112] 112.Bouras G, Grigson SR, Papudeshi B, Mallawaarachchi V & Roach MJ Dnaapler: A tool to reorient circular microbial genomes. Journal of Open Source Software 9, 5968 (2024). [Google Scholar]

[R113] 113.Gilchrist CLM & Chooi Y-H Clinker & clustermap.js: Automatic generation of gene cluster comparison figures. Bioinformatics 37, 2473–2475 (2021). [DOI] [PubMed] [Google Scholar]

[R114] 114.Marçais G et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R115] 115.Kirkegaard R MicroBench (https://github.com/Kirk3gaard/MicroBench) (2025). [Google Scholar]

PERMALINK

High-resolution metagenome assembly for modern long reads with myloasm

Jim Shaw

Maximillian G Marin

Heng Li

Abstract

1. Introduction

1.1. Our contribution

2. Results

2.1. Resolving overlap assembly graphs with polymorphic k-mers and a random path model

Fig. 1:

2.2. Synthetic benchmarking of multi-strain communities with ONT R10.4 and HiFi parameters

Fig. 2:

2.3. High-resolution assemblies on a concatenated mock ONT R10.4 community

Fig. 3:

2.4. Enhanced metagenome-assembled genome (MAG) recovery on diverse metagenomes

Fig. 4:

2.5. Myloasm makes ONT sequencing competitive with HiFi for the human gut

2.6. Contig and MAG-level contamination assessment

Fig. 5:

2.7. Reference-free assessment of assembly quality

2.8. Timing and memory benchmarks on real data

2.9. Recovery of plasmids and viruses

2.10. Myloasm reveals within-species heterogeneity and strain-specific dynamics in ONT metagenomes

Fig. 6:

3. Discussion

4. Methods

4.1. Capturing within-sample polymorphism with SNPmers

4.2. Counting k-mers and filtering for SNPmers

4.3. Open syncmer and SNPmer read indexing

4.4. Read mapping and overlapping with double k-mer chaining

4.5. Estimating true sequence alignment identity with double chains

4.6. Removing contained reads

4.7. Mapping to outer reads and calculating coverage

4.8. Cleaning chimeric reads and large sequencing errors

4.9. Read overlap graph generation with adaptive edge thresholds

4.10. Bidirected string graph representation

4.11. Conservative unitig graph cleaning

4.12. Path probability model

4.13. Path-infused edge weights

Algorithm for computing edge weights

4.14. Annealing-inspired graph cleaning

4.15. Circular contig extraction

Algorithm 1.

4.16. Polishing and dereplication

4.17. Mylotools implementation

4.18. Benchmarking setup

4.19. Benchmarking of in silico long-read metagenomes containing multiple strains

4.20. Concatenated mock R10.4 community benchmarking

4.21. Real metagenome benchmarking procedure

4.22. Analysis of within-species diversity and horizontal gene transfer

Supplementary Material

8. Acknowledgements

Footnotes

5. Data availability

6. Code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases