Alignment-free unique molecular identifier clustering suppresses sequencing errors for accurate detection of low-frequency DNA variants

Fei Yu; Haojie Xiao; Dongyang Song; Xiao Yang; Shiyue Huang; Yu Wang; Mingze Bai; Xiaoming Yao; Kunxian Shu; Dan Pu

doi:10.1093/bib/bbaf483

. 2025 Sep 22;26(5):bbaf483. doi: 10.1093/bib/bbaf483

Alignment-free unique molecular identifier clustering suppresses sequencing errors for accurate detection of low-frequency DNA variants

Fei Yu ^1,^#, Haojie Xiao ^2,^#, Dongyang Song ³, Xiao Yang ⁴, Shiyue Huang ⁵, Yu Wang ⁶, Mingze Bai ⁷, Xiaoming Yao ⁸, Kunxian Shu ^9,^✉, Dan Pu ^10,^✉

PMCID: PMC12452285 PMID: 40982469

Abstract

Accurate detection of low-frequency DNA variants (below 1%) is essential in diverse biological and clinical contexts, yet remains fundamentally constrained by the high intrinsic error rates of next-generation sequencing technologies. Although unique molecular identifiers (UMIs) have significantly mitigated these errors by uniquely indexing original template molecules, their efficacy is compromised by UMI collisions and by artifacts introduced during polymerase chain reaction (PCR) amplification and sequencing, which collectively engender false-positive variant calls. Here, we present AFUMIC, an alignment-free UMI clustering framework that systematically addresses these limitations through collision-resilient UMI grouping and a consensus quality score (CQS)–guided strategy for high-fidelity consensus sequence generation. AFUMIC reduces singleton families, enhances clustering precision, and maximizes data retention, yielding 7.27-fold and 3.84-fold increases in single-strand consensus sequence and duplex consensus sequence output, respectively, compared to Du Novo. It further decreases the per-base error rate from Inline graphic to and raises the proportion of error-free positions from 45.27% to 99.85%, enabling confident detection of variants at variant allele frequencies as low as . Notably, AFUMIC exhibits superior computational efficiency, rendering it well-suited for high-throughput analysis of UMI-tagged libraries in large-scale genomic studies. Collectively, AFUMIC represents an efficient methodology for ultrasensitive variant detection and establishes a broadly applicable and computationally efficient framework for error-corrected sequencing that can be readily deployed in both clinical diagnostics and large-scale genomic research.

Keywords: UMI clustering, low-frequency variant, duplex sequencing, error suppression

Introduction

Next-generation sequencing (NGS) has facilitated the parallel characterization of genetic variants across multiple samples and has been routinely applied to determine variants with a variant allele fraction (VAF) above 1 Inline graphic [1, 2]. Although NGS theoretically enables the detection of variants at low frequencies (below 1), practical implementation is constrained by its intrinsic high error rates (0.1–1) originating from library preparation and sequencing [2]. Reliable detection of low-frequency variants is pivotal for applications, including cancer diagnosis, therapy guidance, and monitoring [3–5], drug resistance [6], organ transplant rejection [7, 8], infectious disease [9], prenatal diagnosis [10, 11], and forensic genetics [12]. Consequently, it is particularly urgent, yet remains incredibly challenging, to detect low-frequency DNA variants employing NGS.

Bioinformatic analysis typically employs probabilistic modeling to distinguish true variants from sequencing artifacts [13–15], but it alone is insufficient for the detection of low-frequency variants with high confidence, therefore promoting the adoption of unique molecular identifiers (UMIs) to unambiguously tag original template molecules [16, 17]. UMI-based error correction strategies assign a unique UMI to each DNA template molecule prior to PCR amplification, therefore uniquely tagging each molecule and ensuring that all subsequent PCR duplicates share an identical UMI. In the subsequent data analysis, an initial step of UMI clustering is necessitated to group reads with identical UMIs into a single “read family”, which is subsequently applied to generate single-strand consensus sequences (SSCSs) for both forward and reverse strands. Complementary SSCSs are subsequently merged to yield duplex consensus sequences (DCSs). Variants detected in only one SSCS but absent in the complementary SSCS are likely sequencing errors or artifacts, whereas true variants should be present in DCSs [16, 17]. Numerous UMI-based methodologies have been developed and applied across diverse research and clinical applications [18–24].

Nonetheless, UMI-based strategies remain susceptible to several limitations. UMI collisions, in which distinct template molecules are tagged with identical UMIs, lead to erroneous read grouping and consequent loss or misestimation of variant calls[25]. Furthermore, UMIs are vulnerable to both PCR- and sequencing-induced errors, giving rise to two forms of clustering inaccuracy (Fig. 1). Errors introduced during early PCR cycles within UMIs can split reads from a single template into multiple clusters, producing artifactual variant calls; in contrast, sequencing errors or late-cycle PCR errors typically generate singleton clusters that are excluded from downstream analysis, substantially reducing data retention [26].

Propagation of PCR- and sequencing-induced errors in Duplex sequencing. (A) Schematic of error accumulation during DNA library construction. Three template molecules are uniquely tagged with UMIs and undergo three PCR cycles. Because of polymerase inefficiency, not all molecules are replicated in each cycle, and amplification errors can arise at any stage. Errors introduced during early cycles are disproportionately propagated in the final amplicon pool. (B) Errors introduced during paired-end sequencing. In addition to PCR-derived errors, base-calling errors during paired-end sequencing can alter both the original template sequence and the associated UMI tags.

To circumvent these limitations, various UMI clustering algorithms have been developed, categorized as alignment-based and alignment-free strategies. Alignment-based strategies (e.g. UMI-tools [26] and DAUMI [27]) rely on aligning reads back to a reference genome and clustering UMIs according to alignment coordinates and UMI sequences. Although effective in reducing sequencing errors and mitigating PCR amplification bias, these algorithms are biased toward the reference, which limits their applicability for de novo assembly or detecting indels or other alleles that significantly deviate from the reference and cause alignment difficulties. Alignment-free clustering strategies bypass the alignment step and instead cluster reads based on UMI sequence similarity (e.g. Du Novo [28] and AmpliCI [25]) or the entire DNA sequence, including the UMI sequences (e.g. CD-HIT [29], Rainbow [30], and Calib [31]). By avoiding the alignment step, these approaches offer enhanced efficiency for larger datasets. However, approaches that rely solely on UMI sequences struggle to discriminate UMI collisions. In contrast, methodologies that cluster reads based on entire sequences can alleviate this issue, but they risk either over-merging similar reads or under-merging multi-error reads derived from abundant singletons [27, 32, 33]. Importantly, these tools require substantial amounts of memory and computational resources.

Here, we introduce AFUMIC, an alignment-free UMI clustering framework designed to mitigate PCR- and sequencing-induced artifacts, while improving clustering fidelity and optimizing read utilization. Through rigorous evaluation on simulated and benchmark datasets, we demonstrate its superior performance in UMI clustering, SSCS and DCS generation efficiency, and background error suppression. Additionally, by effective consideration of core read-family sequences on the basis of sequence similarity, AFUMIC substantially enhances the computational efficiency. The validation of clinical sample datasets further underscores its utility in clinical settings.

Materials and methods

Dataset

To comprehensively evaluate AFUMIC, three distinct datasets were employed (Additional file 1: Supplementary Table S1). The first, a synthetic UMI-based targeted sequencing dataset generated by utilizing UMI-Gen [34], was constructed from three binary alignment map (BAM) files, a browser extensible data (BED) file defining genomic targets, and a comma-separated values file specifying known variant allele frequencies (VAFs). Within this dataset, three control files, together with a curated subset of BED-defined target regions (Additional file 1: Supplementary Table S2) from the original publication, were employed to estimate background error rates and define the target sequencing panel. Fifteen known variants spanning a broad range of VAFs were randomly introduced into the simulated reads. The resulting simulated dataset comprised 368 746 reads, each carrying an 8-bp UMI barcode and exhibiting a mean fragment length of 150 bp. The second dataset, HD701, generated by Orabi et al. [30], constituted a widely adopted benchmark for UMI-based targeted sequencing and encompassed 254 validated variants with VAFs ranging from 1% to 24%, including 10 confirmed via ddPCR by Horizon Discovery, and 244 originating from the parental HD701 cell lines (Additional file 1: Supplementary Table S3). The third dataset, FGFR3 Y C, prepared by Salazar et al. [21], comprised DNA from a cell line harboring the FGFR3 variants (c.742C>T, c.746C>G, c.749C>G, c.1620C>A) serially diluted into sperm DNA from a young donor to prepare four Duplex sequencing libraries (FGFR3 Y C 1:10, FGFR3 Y C 1:100, FGFR3 Y C 1:1000, FGFR3 Y C 1:10 000) at dilution ratios of 1:10, 1:100, 1:1000, and 1:10 000, corresponding to anticipated VAFs of 10%, 1%, 0.1%, and 0.01%, respectively.

Data pre-processing

The AFUMIC workflow is illustrated in Additional file 2: Supplementary Fig. S1. Raw sequencing data was pre-processed to maximize data quality and minimize technical artifacts. Initially, sequencing quality metrics, including read length distribution, base quality scores, and GC content, were assessed employing FastQC (v0.12.1). Adapter sequences and low-quality bases were subsequently trimmed employing Trim Galore (v0.4.4), with a Phred score threshold of 20. Following quality filtering, UMIs were extracted using UMI-tools (v1.1.4) by concatenating barcode sequences from read pairs in both orientations (UMI1 + UMI2 or UMI2 + UMI1). The resulting concatenated UMIs were exported in the FAST-All (FASTA) format for downstream processing. A reference genome index was subsequently constructed with BWA (v0.7.17), against which concatenated UMI sequences were aligned using Bowtie (v1.3.1) to assess sequence integrity and remove barcodes with structural inconsistencies or ambiguous alignments.

UMIs clustering

A schematic overview of AFUMIC is illustrated in Fig. 2A. Briefly, reads sharing identical or highly similar UMIs were initially clustered utilizing a graph-based strategy implemented in NetworkX (v2.8), after which UMI errors were corrected. The corrected UMIs were subsequently reassigned to their corresponding read groups for consensus construction. Within each group, multiple sequence alignment was performed using MAFFT (v7.505), and the consensus base at each position was determined based on the highest consensus quality score (CQS). These resulting consensus bases were assembled into SSCSs. Ultimately, complementary SSCSs originating from the same DNA duplex were merged to generate DCSs for accurate variant detection.

Overview of AFUMIC and UMI clustering workflow. (A) Schematic representation of AFUMIC workflow. Adapter sequences are trimmed from raw sequencing reads, after which reads sharing identical or near-identical UMIs (within a HD threshold) are grouped into molecular families. A graph-based clustering algorithm is applied to correct UMI errors and resolve ambiguous groupings, with corrected UMI assignments propagated to the original dataset. Within each molecular family, multiple sequence alignment is performed to extract the core sequences, and per-base CQSs are computed to generate SSCSs. Complementary SSCS pairs, derived from opposite strands of the same DNA duplex, are subsequently merged to produce double-strand consensus sequences (DCSs). (B) Graphical depiction of the UMI clustering process. Reads whose UMIs differ by a single base are clustered using a graph-based approach, enabling error-tolerant grouping and robust family reconstruction.

Figure 2B illustrates the UMI clustering strategy implemented in AFUMIC. In brief, AFUMIC implements a graph-based clustering algorithm, developed using the NetworkX Python package (v2.8), to identify and correct sequencing errors within UMI tags. Each read family, representing two complementary strands derived from the same DNA molecule, was tagged as (UMI1 + UMI2) or (UMI2 + UMI1), depending on strand orientation. Each double-stranded family was represented as an undirected graph Inline graphic , where nodes () corresponded to UMI sequences and edges () represented sequence similarities. All nodes were enumerated, and the node with the highest degree was designated as the root node ; the remaining nodes were assigned as leaves. The Hamming distance (HD) between a leaf and the root Inline graphic , denoted as , was defined as follows:

(1)

where Inline graphic and denoted the th nucleotides of and , respectively. If was less than or equal to a predefined threshold, an edge was added to the set , and the sequence of the leaf node was corrected to match that of the root node. To balance error correction with data retention, the HD threshold was empirically set to Inline graphic , based on performance across simulated datasets. Nodes exceeding this threshold were retained without correction to avoid over-merging within the graph structure. Corrected read sequences were recorded in a families.corrected.tsv file, which contained UMI sequences and their corresponding read identifiers for downstream analysis. For visualization, final UMI clustering graphs generated at a threshold of 2 were exported in Graph Exchange XML Format and visualized using Gephi (v0.10.1) for clustering topology inspection.

Multiple sequence alignment and consensus sequence generation

Next, multiple sequence alignments were performed on the entire read sequences, including UMI sequences, to facilitate the generation of consensus sequences for variant calling. Reads listed in the families.corrected.tsv file were subsequently aligned using MAFFT (v7.505) under the global alignment mode, thereby ensuring that sequences derived from the same original DNA molecule were precisely co-localized within the same alignment block, which in turn enabled accurate positional comparison and robust extraction of core regions. Within each group, multiple sequence alignment was performed, and alignment scores were calculated utilizing the Smith–Waterman algorithm with a scoring scheme of +2 for matches, –0.5 for mismatches, and –1.0 for gaps. To identify candidate core regions for consensus construction, a 30 bp window was centered on each alignment position, with four additional 30-mers extracted at offsets of +2, +1, –1, and –2 bases. Among these candidates, the most frequently observed sequence was selected as the representative core. The default core region window was set here to 30 bp, with a center shift tolerance of Inline graphic 5 bp, yielding a maximum evaluation span of 40 bp. For each aligned position, per-base quality was calculated using the CQS, defined as:

(2)

Here, Inline graphic denoted the frequency of the most abundant nucleotide at a given position within a read family. A nucleotide was designated as the consensus base if it achieved the highest CQS and was supported by at least three reads; otherwise, the position was assigned as “N”. SSCSs were generated by concatenating consensus bases across all positions within each a strand-specific family. DCSs were subsequently constructed by pairing SSCSs derived from complementary strands of the same original DNA molecule. A base was retained in the final DCS only if both SSCSs exhibited concordant calls at the corresponding position; discordant sites were assigned as “N”. DCSs were aligned to the GRCh38 reference genome for downstream variant calling.

Variant calling and annotation

Variant calling was performed employing BCFtools (v1.17), which generated variant call format files containing the genomic coordinates, allelic depth, and associated quality metrics for each candidate variant. To maximize the accuracy of variant detection, stringent filtering criteria were applied during the variant calling process. Subsequently, all retained variants were subjected to functional annotation using ANNOVAR (v2023Mar15) [35], a widely implemented tool for annotating genetic variants from high-throughput sequencing.

Background error estimation

To quantify base substitution (error) rates at each genomic locus, the Integrated Digital Error Suppression tool (https://cappseq.stanford.edu/ides/download.php#bgReport) was employed [19]. Briefly, DCS-aligned BAM files were converted into per-base frequency tables by ides-bam2freq.pl. Subsequently, background error rates were calculated using ides-bgreport.pl, considering non-reference bases with an allele frequency Inline graphic and supported by at least one read. The error rates were defined as the ratio of non-reference base counts to the total number of sequenced bases within the targeted panel.

Performance evaluation

The HD701 benchmark dataset was employed to systematically evaluate the performance of AFUMIC in comparison with six established methodologies: Duplex sequencing [17], Calib [31], Du Novo [28], UMI-tools [26], Starcode [36], and Rainbow [30]. Standard classification metrics, including true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), were employed to calculate sensitivity, precision, and F1 score, as defined by the following equations:

(3)

(4)

(5)

Additionally, the computational efficiency of AFUMIC was benchmarked by applying it to the HD701 dataset and recording the execution times of each major pipeline stage—including UMI clustering, sequence alignment, read family grouping, and the overall workflow—and comparing these with those of state-of-the-art UMI clustering tools, including UMI-tools [26], UMICollapse [37], and Calib [31]. All computational analyses were performed on a high-performance computing server hosted by Chongqing University of Posts and Telecommunications, equipped with an Intel Xeon Platinum 8375C processor, 256 GB of physical memory, and running Ubuntu 22.04.

Results

AFUMIC enhances UMI clustering efficiency

Accurate UMI clustering is essential for consensus sequencing, ensuring that PCR duplicates derived from the same template molecule are correctly consolidated into read families. In practice, however, sequencing and amplification errors within UMIs can generate artificial UMI variants. Consequently, reads that should be assigned to the same family are instead partitioned into singleton read families—those containing only a single read pair [28]. Analysis of a simulated dataset initially comprising 10 406 family groups revealed that singletons constitute 71.4% of read groups (Fig. 3A and B, Additional file 1: Supplementary Table S4), highlighting the substantial data attrition intrinsic to conventional clustering strategies, which typically mandated a minimum of three duplicates per family.

Efficiency of UMI clustering and data utilization in AFUMIC. (A) Total number of read families in the simulated dataset prior to UMI clustering at HDs of 1, 2, and 3. (B) Proportion of read family sizes in the simulated dataset prior to UMI clustering. (C) Comparison of read family counts across HD thresholds between AFUMIC and Du Novo. (D) Number of corrected UMIs produced by AFUMIC and Du Novo. (E) Comparison of UMIs failing to meet thresholds between AFUMIC and Du Novo. (F) Required clustering iterations for AFUMIC and Du Novo at varying HD thresholds. (G and H) Proportion of read families in the HD701 dataset at an HD threshold of 1 before (G) and after UMI clustering (H). (I) Distribution of read family sizes in the simulated dataset at an HD threshold of 1. (J) Proportion of read family sizes in the simulated dataset at an HD threshold of 1. (K and L) Distribution of read family sizes in the simulated dataset at HD thresholds of 2 (K), and 3 (L), respectively. (M) Number of singletons at an HD threshold of 1 across different approaches. (N) Comparison of SSCS and DCS yields generated by different methodologies. (O) Efficiency of SSCS and DCS generation utilizing different approaches.

To address this limitation, AFUMIC implemented a graph-based algorithm to correct barcode discrepancies by utilizing an adjustable HD threshold. Reads with UMIs falling within the specified threshold were iteratively clustered into consensus families. In contrast, those with UMIs that exceeded the threshold were stringently excluded to prevent erroneous clustering. Performance evaluation across HDs of 1, 2, and 3 demonstrated that AFUMIC reconstructed 792, 701, and 560 read families, respectively, and corrected 9615, 9705, and 9847 erroneous UMIs. Correspondingly, 130 583, 10 329, and 1465 UMIs were excluded, indicating a marked decrease in data loss as clustering tolerance increased (Fig. 3C–E, Additional file 1: Supplementary Table S4). To further visualize the structural topology of AFUMIC's clustering, we rendered the graph topology generated by AFUMIC at an HD of 2 by employing Gephi (v0.10.1). As shown in Additional file 2: Supplementary Fig. S2A, the network exhibited well-separated UMI clusters with dense internal connectivity and minimal cross-cluster links, indicating effective error correction and high clustering fidelity.

To benchmark AFUMIC's UMI clustering performance, we evaluated Du Novo, a previously reported graph-based tool [28]. Under identical HD thresholds, Du Novo reconstructed 792, 382, and 381 read family groups and corrected 8888, 9011, and 9331 UMIs, respectively (Fig. 3C and D, Additional file 1: Supplementary Table S4). Unlike AFUMIC, Du Novo retained all UMIs without exclusion, which coincided with a sharp rise in computational iterations at thresholds 2 and 3 (Fig. 3E and F, Additional file 1: Supplementary Table S4), suggesting a propensity for over-merging and a consequent reduction in clustering specificity. These findings demonstrate that AFUMIC achieves superior UMI clustering fidelity by balancing error correction with the controlled exclusion of ambiguous UMIs, thereby reducing singleton inflation and enhancing data retention.

AFUMIC optimizes sequencing data retention

Despite the high accuracy of consensus sequencing strategies, their inherently redundant design frequently compromises data efficiency and increases sequencing costs. Conventional consensus sequencing strategies rely on the requirement for large read family sizes to achieve robust error suppression, which may lead to substantial data loss. To evaluate this trade-off, we assessed the relationship between family size and data retention utilizing both simulated and HD701 datasets. A threshold of 20 reads per family was applied, consistent with prior reports indicating it was the typical requirement for generating reliable consensus sequences [19, 38]. Analysis of simulated and HD701 datasets revealed that 91.8% and 92.6% of read families, respectively, contained Inline graphic 20 reads (Additional file 2: Supplementary Fig. S2B and C). The application of a minimum family size of 20 would result in substantial data attrition. Further analysis revealed that families with fewer than three reads accounted for the majority of the datasets, prompting the majority of approaches to set a minimum family size threshold of three [2, 17]. However, the implementation of a minimum family size of 3 in AFUMIC resulted in the exclusion of 85.7% and 76.6% of reads across the two datasets, respectively (Fig. 3B and G). These observations underscore the inefficiency of prevailing consensus sequencing strategies under current error suppression thresholds.

To address this limitation, we assessed the impact of UMI clustering on the distribution of read families by applying AFUMIC across HD thresholds of 1, 2, and 3 (Fig. 3H–L). Prior to clustering, family sizes were skewed toward high-count singletons (Fig. 3G). At a Hamming threshold of 1, 92.4% of HD701 reads were organized into families exceeding three members, while 7.6% remained singleton clusters (Fig. 3H–J). Increasing the threshold to 2 and 3 progressively eliminated low-size clusters and expanded the distribution of larger family sizes (Fig. 3K and L). Accordingly, a HD threshold of 2 was implemented to optimize the balance between cluster specificity and data retention in subsequent analyses.

Singletons frequently arise from errors introduced in UMI sequences during sequencing or late PCR cycles [39]. Errors incurred during the early cycles of PCR in UMI barcodes may cause reads from the same molecule to be erroneously segregated into distinct clusters, thereby leading to misestimation of sampled variants. AFUMIC mitigated these artifacts by applying a graph-based correction to UMI sequences prior to family assignment. At a Hamming threshold of 2, AFUMIC reduced singleton counts from 2 545 549 to 315 825 in the HD701 dataset, effectively recovering 2 229 724 reads for consensus construction in the HD701 dataset (Fig. 3M). By contrast, Du Novo [28] collapsed 2,019,756 reads into only 11 clusters, yielding no singleton families and suggesting substantial UMI overcorrection,whereas Duplex sequencing [17], which omitted UMI correction, retained 3 806 613 unclustered singletons (Fig. 3M).

Next, we evaluated the impact of singleton rescue on downstream consensus sequence generation by comparing SSCS and DCS efficiencies (defined as SSCS/total reads and DCS/total reads, respectively) across AFUMIC, Du Novo, and Duplex sequencing. AFUMIC outperformed Duplex sequencing in both SSCS (18.75% versus 6.71%) and DCS (2.19% versus 0.89%) yields (Fig. 3N and O). Moreover, compared to Du Novo, AFUMIC achieved a 7.27-fold improvement in SSCS efficiency, increasing from 2.58% to 18.75%, and a 3.84-fold increase in DCS efficiency (from 0.57% to 2.19% (Fig. 3N and O). Collectively, these findings indicate that AFUMIC effectively rescues reads misclassified as singletons via UMI error correction, thereby substantially increasing the yield of consensus sequences.

AFUMIC enables high-fidelity detection of low-frequency variants

To evaluate the accuracy of AFUMIC in the identification of low-frequency variants, we benchmarked it against the HD701 reference dataset, which comprised 22 082 242 read pairs and 254 validated SNVs. Following UMI-based clustering and error correction, variant calling was performed employing BCFtools (v1.17), and the resulting calls were compared with those from six established methodologies: Duplex sequencing, Calib, Du Novo, UMI-tools, Starcode, and Rainbow. Among all evaluated methodologies, AFUMIC recovered the highest number of TPs, with Calib and Starcode ranking second and third, respectively (Fig. 4A). Given the high FP rates across approaches, a focused comparison was conducted between AFUMIC and Duplex sequencing, the latter being widely regarded as the gold standard of high-accuracy sequencing [40]. AFUMIC identified 281 variants, comprising 162 TPs and 119 FPs, whereas Duplex sequencing reported 408 variants, of which only 119 were TPs and 289 were FPs (Fig. 4B). These results indicate that AFUMIC achieves enhanced specificity and recovers more TPs than Duplex sequencing. F1 score analysis further corroborated these findings, suggesting that AFUMIC achieved a better balance between sensitivity and precision compared to Du Novo (Fig. 4C). Functional annotation employing ANNOVAR identified 56 synonymous variants—some potentially affecting codon usage or splicing—and three stop-gain mutations introducing premature termination codons (Additional file 1: Supplementary Tables S5 and S6).

Performance evaluation of AFUMIC through variant calling and background error suppression. (A) Number of TP variants detected in the HD701 dataset across AFUMIC and six established UMI-based methodologies. (B) Venn diagram illustrating variants identified by different approaches in the HD701 dataset. (C) Comparison of F1 scores between AFUMIC and Duplex sequencing. (D) Proportion of genomic positions without errors. (E) Percentage of sequencing errors. (F) Distribution of background error counts across different variant types in the HD701 dataset.

AFUMIC effectively suppresses background errors

Background errors in high-throughput sequencing predominantly originate from mismatches in UMIs during PCR, DNA polymerase damage, ultrasonic damage, and other factors [19]. These technical artifacts pose a significant challenge to the reliable detection of low- and ultra-low-frequency variants. To quantitatively evaluate AFUMIC’s capacity for background error suppression, we compared its performance with Du Novo and Duplex sequencing. AFUMIC demonstrated a substantial improvement in error suppression, elevating the proportion of error-free positions from 45.27% to 99.85% (Fig. 4D, Additional file 1: Supplementary Table S7), and reducing the per-base error rate by 14.33-fold, from Inline graphic to (Fig. 4E, Additional file 1: Supplementary Table S7). In contrast, Du Novo achieved a comparable increase in error-free sites (from 45.27% to 99.79%), but yielded only a 6.40-fold reduction in error rate (from to ) (Fig. 4D and E, Additional file 1: Supplementary Table S7). These results underscore AFUMIC’s superior capability in reducing sequencing errors compared to Du Novo.

We next characterized base substitution error profiles across AFUMIC, Du Novo, and Duplex sequencing (Fig. 4F, Additional file 2: Supplementary Fig. S3A–F). AFUMIC effectively suppressed G Inline graphic C and TC substitutions but showed a relatively limited reduction of CA and GT transversions. These transversions were consistent with previously reported patterns arising from oxidative stress and spontaneous deamination of 5-methylcytosine during sample processing [41–43], suggesting that these errors primarily originated from DNA damage accumulated during sample preparation.

AFUMIC demonstrates superior computational efficiency

To comprehensively evaluate the computational efficiency of AFUMIC, we conducted a comparative analysis against three prevalent UMI clustering tools: UMI-tools, UMICollapse, and Calib, utilizing the HD701 dataset. UMI-tools and UMICollapse were alignment-based approaches, whereas Calib and AFUMIC operated in an alignment-free manner. Consistent with expectations, alignment-free tools ran substantially faster than their alignment-dependent counterparts (Fig. 5A, Additional file 1: Supplementary Table S8). AFUMIC completed the entire analysis in 11 min, compared to 29.4 min for Calib (2.67-fold faster). In contrast, UMI-tools and UMICollapse required 71 and 99.30 min, respectively. Despite both being alignment-free, AFUMIC and Calib displayed pronounced disparities in computational time allocation across the major pipeline stages. Specifically, AFUMIC required only 5.85 min for UMI clustering and approximately 4 minutes for consensus generation, whereas Calib consumed 6.80 and 22.60 min, respectively (Fig. 5A, Additional file 1: Supplementary Table S8). These results underscore AFUMIC's superior runtime efficiency, driven by optimized clustering workflow and consensus construction.

Performance evaluation of AFUMIC with respect to computational efficiency and low-frequency variant detection in clinical sequencing datasets. (A) Comparison of computational runtime between AFUMIC, UMI-tools, UMICollapse, and Calib. (B and C) Distributions of read family sizes before (B) and after (C) AFUMIC-based clustering in clinical datasets (SRR13290174–SRR13290177) at dilution ratios ranging from 1:10 000 to 1:10. (D and E) Efficiency of SSCS (D) and DCS (E) generation by different methodologies utilizing the same clinical datasets. (F) Proportion of error-free base positions. (G) Per-base error rates observed in raw reads, as well as in SSCS and DCS generated utilizing AFUMIC in clinical datasets.

Clinical applications of AFUMIC

We next analyzed the FGFR3 Y C dataset to evaluate AFUMIC’s clinical utility. This dataset comprised genomic DNA harboring point mutations (c.742C>T, c.746C>G, c.749C>G, c.1620C>A) from Coriell-derived cell lines serially diluted into wild-type donor DNA at ratios ranging from 1:10 to 1:10 000. Prior to UMI clustering, a disproportionately high fraction of reads were present as singletons, thereby constraining data utilization efficiency (Fig. 5B, Additional file 2: Supplementary Fig. S4A). Following AFUMIC clustering, the proportion of read families with three or more reads increased substantially (Fig. 5C, Additional file 2: Supplementary Fig. S4B), indicating that AFUMIC effectively consolidated reads into appropriate groups and thereby enhanced overall data utilization.

Enhanced data utilization translated directly into improved consensus sequence generation. Across all dilutions, AFUMIC achieved consistently higher SSCS and DCS efficiencies than Duplex sequencing and Du Novo (Fig. 5D and E). Furthermore, AFUMIC substantially suppressed background errors in the sample datasets, increasing the average proportion of error-free bases from 16.15% to 98.06%—a 6.07-fold improvement—and reducing the proportion of erroneous bases from Inline graphic to and approximately a 22.40-fold reduction (Fig. 5F and G). The base substitution error profiles across clinical dilution series further supported the efficacy of AFUMIC in mitigating background errors (Additional file 2: Supplementary Fig. S4C-E). Notably, error types C A, C T, and G Inline graphic A were consistently suppressed at the DCS level across all dilutions, thereby highlighting AFUMIC’s robustness in low-allele-fraction settings. Conversely, AFUMIC exhibited diminished suppression efficiency for A C and T C errors. Ultimately, comparative analysis applying the FGFR3 Y C dataset demonstrated that AFUMIC detected 7 of 16 variants at a VAF of Inline graphic , thereby outperforming both Duplex sequencing and Du Novo. These results underscore the markedly enhanced sensitivity of AFUMIC for ultra-low-frequency variant detection (Additional file 1: Supplementary Table S9).

Discussion

The capacity to detect low-frequency DNA variants is increasingly vital in both medical research and clinical settings. However, it remains challenging due to the high intrinsic error rates of standard NGS approaches. Although NGS technologies have advanced in recent years, these limitations persist, thereby necessitating robust error suppression strategies to enhance detection accuracy [44]. UMI-based technologies have emerged as powerful solutions to address this challenge and have proven to be the most effective methodologies for the detection of low-frequency DNA variants, with the theoretical potential to reduce error rates to below Inline graphic [2]. Nonetheless, UMIs can collide, where identical UMIs label multiple sampled molecules [25], resulting in undercounting, variant loss, or misestimation. Moreover, UMIs themselves are susceptible to both PCR- and sequencing-induced errors, causing false-positive variants and inflated counts of sampled molecules [31]. Existing UMI-based approaches frequently struggle with these issues, particularly in the context of UMI collisions or PCR- and sequencing-induced artifacts in UMI sequences, thus limiting their reliability in the high-fidelity detection of low-frequency variants.

In this context, we developed AFUMIC, an alignment-free UMI clustering framework designed to address both collision-induced ambiguities and technical artifacts. By integrating CQS, AFUMIC substantially mitigated sequencing-derived errors. AFUMIC was comprehensively evaluated across three datasets—a simulated dataset, the HD701 reference standard, and a clinically relevant FGFR3 Y C dilution series—and demonstrated high sensitivity in detecting low-frequency variants, while maximizing data retention and minimizing background noise. Notably, by bypassing conventional alignment and focusing on core sequence regions rather than full-length reads during consensus construction, AFUMIC exhibited superior computational efficiency over existing methodologies.

AFUMIC significantly enhanced UMI clustering by correcting sequencing errors and effectively resolving UMI collisions, thereby increasing data retention and utilization (Fig. 3). Comparative benchmarking demonstrated that AFUMIC excluded 130 583, 10 329, and 1465 UMIs at HD thresholds of 1, 2, and 3, respectively. In contrast, Du Novo retained all UMIs at the same thresholds, indicating a propensity toward overcorrection. Further investigation revealed that Du Novo occasionally merged dissimilar UMI sequences beyond the permitted edit distance through iterative correction. For instance, the UMI “AGCAAAAAGCTACCAA” underwent five sequential corrections, ultimately yielding “AAAAAAGCTACCAA” despite exceeding the defined threshold. AFUMIC circumvented such issues by enforcing stringent filtering criteria to remove UMIs exceeding the specified HD, thereby enhancing error-correction fidelity. Importantly, analysis at thresholds of 2 and 3 indicated that although the number of reconstructed families was marginally reduced, erroneous UMI loss decreased precipitously—from 130 583 to 1465—thereby substantially optimizing data retention and mitigating the overcorrection observed with Du Novo.

A major limitation of consensus sequencing lies in its dependence on extremely high sequencing depth to offset inherent error rates, frequently at the expense of data utilization efficiency. The application of stringent family size thresholds—commonly three or more reads—results in the elimination of a substantial fraction of potentially informative data [45]. Across all three datasets, we observed that read families with fewer than three members predominated (Fig. 3B and G; Fig. 5B), with singletons constituting more than 60% of reads prior to clustering. This predominance of small families was primarily attributed to UMI synthesis errors and sequencing artifacts. Previous attempts to recover singletons [45, 46] involved either pairing with complementary strand consensus sequences or clustering with similar singleton UMIs. However, these approaches frequently failed to adequately address PCR- or sequencing-induced errors embedded within the UMI itself. AFUMIC addressed this challenge through an alignment-free UMI clustering strategy that corrected both synthesis- and sequencing-induced errors prior to consensus generation. This approach substantially reduced singleton counts and enhanced the recovery efficiency of both SSCS and DCS (Fig. 3M–O). In contrast, Duplex sequencing required a minimum family size of three for SSCS construction, thereby causing considerable data attrition. Although Du Novo enabled partial recovery via UMI correction, it failed to correct molecular sequence errors, ultimately compromising clustering accuracy and diminishing downstream consensus yields.

Sequencing errors remain a major impediment to the accurate identification of genetic variants in low-frequency alleles or rare subclones, which are pivotal for cancer molecular diagnosis, therapeutic decision-making, and longitudinal surveillance through deep sequencing. Substitution-specific error profiling across diverse datasets revealed that AFUMIC exhibited context-dependent suppression efficiencies (Fig. 4F; Additional file 2: Supplementary Fig. S3 and S4C–E), thereby underscoring the necessity of substitution-aware benchmarking within ultra-sensitive variant detection workflows. The distinct error profiles observed in UMIs and molecular sequences were attributable to cumulative artifacts introduced at various stages of the NGS workflow, including sample handling, library preparation, PCR enrichment, and sequencing [47]. AFUMIC implemented an integrated error suppression strategy for the robust mitigation of these sequencing-induced artifacts. Its graph-based strategy corrected UMI tagging errors and distinguished between true UMIs and those altered by PCR- or sequencing-induced errors, thereby enabling accurate read family reconstruction.

Additionally, AFUMIC filtered late-cycle PCR errors and stochastic sequencing artifacts, which frequently manifested as singleton reads. Errors introduced during the earliest PCR cycles, particularly the first round of amplification, presented a greater challenge, as they were symmetrically propagated to daughter molecules and thus could be misinterpreted as true mutations. Given that complementary errors are extremely improbable to occur by chance at the same genomic position on both DNA strands [2], AFUMIC mitigated early-cycle artifacts by independently constructing strand-specific consensus sequences and scoring mutations only when concordant evidence was observed on both strands. Despite these advantages, AFUMIC, akin to other UMI-based tools evaluated in this study, encountered limitations in correcting errors that arose prior to UMI tagging, such as during cDNA synthesisThese pre-barcoding artifacts preceded the molecular tagging step, thereby compromising the efficacy of post-tagging error-correction strategies. These pre-UMI errors may be more effectively mitigated through downstream analyses employing UMI-based variant callers [26, 48, 49].

In conclusion, this study introduced AFUMIC, an alignment-free UMI clustering approach designed to address major challenges in the detection of low-frequency DNA variants. AFUMIC demonstrated superior data retention, greater computational efficiency, and increased variant calling accuracy, thereby representing a substantial advancement in the field of genomic sequencing with promising applications in both research and clinical diagnostics.

Key Points

A novel alignment-free UMI clustering approach, AFUMIC, is introduced, which effectively addresses UMI collisions, mitigates sequencing artifacts, and leverages CQS to enhance consensus sequence generation.
AFUMIC markedly diminishes the prevalence of singleton families, enhances clustering fidelity, optimizes data retention, and effectively suppresses sequencing artifacts.
AFUMIC demonstrates superior computational efficiency, thereby facilitating scalable, high-throughput processing of UMI-tagged sequencing data and offering broad applicability to large-scale genomic studies and clinical investigations.

Supplementary Material

Supplemental_materials_2_bbaf483

supplemental_materials_2_bbaf483.docx^{(1.3MB, docx)}

Supplemental_materials_1_bbaf483

supplemental_materials_1_bbaf483.xlsx^{(53.6KB, xlsx)}

Acknowledgments

Not applicable.

Contributor Information

Fei Yu, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Haojie Xiao, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Dongyang Song, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Xiao Yang, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Shiyue Huang, Chongqing Yangjiaping Middle School, Chongqing 400050, China.

Yu Wang, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Mingze Bai, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Xiaoming Yao, Caronos Inc., Anji, Zhejiang 313300, China.

Kunxian Shu, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Dan Pu, Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, No. 2 Chongwen Road, Nan'an District, Chongqing 400065, China.

Author contributions

Fei Yu (Conceptualization, Data curation, Formal Analysis, Methodology, Software, Writing—original draft, Writing—review & editing), Haojie Xiao (Methodology, Software), Dongyang Song (Data Analysis), Xiao Yang (Data analysis), Yueshi Huang (Investigation, Feedback), Yu Wang (Investigation, Feedback), Mingze Bai (Validation, Writing—review & editing), Xiaoming Yao (Review), Kunxian Shu (Validation, Review, Supervision), and Dan Pu (Conceptualization, Supervision, Writing—original draft, Writing—review & editing)

Conflict of interest: The authors declare no competing interests.

Funding

This study was supported by the Scientific and Technological Research Program of Chongqing Education Committee [KJQN202300627], and the Natural Science Foundation of Chongqing [CSTB2024NSCQ-KJFZMSX0036, CSTC2021JCYJ-MSXMX0848].

Data availability

The data analyzed in this study are publicly available in the Sequence Read Archive (SRA) at https://www.ncbi.nlm.nih.gov/sra, with accession numbers SRR7903524 for HD701, SRR13290174 for FGFR3 Y C 1:10 000, SRR13290175 for FGFR3 Y C 1:1000, SRR13290176 for FGFR3 Y C 1:100, and SRR13290177 for FGFR3 Y C 1:10. Code availability is available at https://github.com/DanPu-Lab/AFUMIC.

References

1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17:333–51. 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. 10.1038/nrg.2017.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Deveson IW, Gong B, Lai K. et al. Evaluating the analytical validity of circulating tumor DNA sequencing assays for precision oncology. Nat Biotechnol 2021;39:1115–28. 10.1038/s41587-021-00857-z [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Nagasaka M, Uddin MH, Al-Hallak MN. et al. Liquid biopsy for therapy monitoring in early-stage non-small cell lung cancer. Mol Cancer 2021;20:82–98. 10.1186/s12943-021-01371-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Filipska M, Rosell R. Mutated circulating tumor DNA as a liquid biopsy in lung cancer detection and treatment. Mol Oncol 2021;15:1667–82. 10.1002/1878-0261.12983 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Streck NT, Espy MJ, Ferber MJ. et al. Use of next-generation sequencing to detect mutations associated with antiviral drug resistance in cytomegalovirus. J Clin Microbiol 2023;61:e0042923. 10.1128/jcm.00429-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Oellerich M, Sherwood K, Keown P. et al. Liquid biopsies: donor-derived cell-free DNA for the detection of kidney allograft injury. Nat Rev Nephrol 2021;17:591–603. 10.1038/s41581-021-00428-0 [DOI] [PubMed] [Google Scholar]
8. Khush KK. Clinical utility of donor-derived cell-free DNA testing in cardiac transplantation. J Heart Lung Transplant 2021;40:397–404. 10.1016/j.healun.2021.01.1564 [DOI] [PubMed] [Google Scholar]
9. Li M, Song C, Hu J. et al. Impact of pretreatment low-abundance HIV-1 drug resistance on virological failure after 1 year of antiretroviral therapy in China. J Antimicrob Chemother 2023;78:2743–51. 10.1093/jac/dkad297 [DOI] [PubMed] [Google Scholar]
10. Moufarrej MN, Bianchi DW, Shaw GM. et al. Noninvasive prenatal testing using circulating DNA and RNA: advances, challenges, and possibilities. Annu Rev Biomed Data Sci 2023;6:397–418. 10.1146/annurev-biodatasci-020722-094144 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Faldynova L, Walczyskova S, Cerna D. et al. Non-invasive prenatal testing (NIPT): combination of copy number variant and gene analyses using an “in-house” target enrichment next generation sequencing-solution for non-centralized NIPT laboratory? Prenat Diagn 2023;43:1320–32. 10.1002/pd.6421 [DOI] [PubMed] [Google Scholar]
12. Kayser M, Branicki W, Parson W. et al. Recent advances in forensic DNA phenotyping of appearance, ancestry and age. Forensic Sci Int Genet 2023;65:102870. 10.1016/j.fsigen.2023.102870 [DOI] [PubMed] [Google Scholar]
13. Dou Y, Kwon M, Rodin RE. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat Biotechnol 2020;38:314–9. 10.1038/s41587-019-0368-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Razavi P, Li BT, Brown DN. et al. High-intensity sequencing reveals the sources of plasma circulating cell-free DNA variants. Nat Med 2019;25:1928–37. 10.1038/s41591-019-0652-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Christensen MH, Drue SO, Rasmussen MH. et al. DREAMS: deep read-level error model for sequencing data applied to low-frequency variant calling and circulating tumor DNA detection. Genome Biol 2023;24:99. 10.1186/s13059-023-02920-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Kinde I, Wu J, Papadopoulos N. et al. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci U S A 2011;108:9530–5. 10.1073/pnas.1105422108 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Schmitt MW, Kennedy SR, Salk JJ. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 2012;109:14508–13. 10.1073/pnas.1208715109 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ren Y, Zhang Y, Wang D. et al. SinoDuplex: an improved duplex sequencing approach to detect low-frequency variants in plasma cfDNA samples. Genomics Proteomics Bioinf 2020;18:81–90. 10.1016/j.gpb.2020.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Newman AM, Lovejoy AF, Klass DM. et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat Biotechnol 2016;34:547–55. 10.1038/nbt.3520 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Springer S, Masica DL, Dal Molin M. et al. A multimodality test to guide the management of patients with a pancreatic cyst. Sci Transl Med 2019;11:11. 10.1126/scitranslmed.aav4772 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Salazar R, Arbeithuber B, Ivankovic M. et al. Discovery of an unusually high number of de novo mutations in sperm of older men using duplex sequencing. Genome Res 2022;32:499–511. 10.1101/gr.275695.121 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Axelsson J, LeBlanc D, Shojaeisaadi H. et al. Frequency and spectrum of mutations in human sperm measured using duplex sequencing correlate with trio-based de novo mutation analyses. Sci Rep 2024;14:23134. 10.1038/s41598-024-73587-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Pilheden M, Ahlgren L, Hyrenius-Wittsten A. et al. Duplex sequencing uncovers recurrent low-frequency cancer-associated mutations in infant and childhood KMT2A-rearranged acute leukemia. Hemasphere. 2022;6:e785. 10.1097/HS9.0000000000000785 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Dodge AE, LeBlanc DPM, Zhou G. et al. Duplex sequencing provides detailed characterization of mutation frequencies and spectra in the bone marrow of MutaMouse males exposed to procarbazine hydrochloride. Arch Toxicol 2023;97:2245–59. 10.1007/s00204-023-03527-y [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Clement K, Farouni R, Bauer DE. et al. AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing. Bioinformatics. 2018;34:i202–10. 10.1093/bioinformatics/bty264 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res 2017;27:491–9. 10.1101/gr.209601.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Peng X, Dorman KS. Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers. Bioinformatics. 2023;39:btad002. 10.1093/bioinformatics/btad002 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Stoler N, Arbeithuber B, Guiblet W. et al. Streamlined analysis of duplex sequencing data with Du novo. Genome Biol 2016;17:180–90. 10.1186/s13059-016-1039-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Chong Z, Ruan J, Wu CI. Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics. 2012;28:2732–7. 10.1093/bioinformatics/bts482 [DOI] [PubMed] [Google Scholar]
31. Orabi B, Erhan E, McConeghy B. et al. Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics. 2019;35:1829–36. 10.1093/bioinformatics/bty888 [DOI] [PubMed] [Google Scholar]
32. Peng X, Dorman KS. AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data. Bioinformatics. 2021;36:5151–8. 10.1093/bioinformatics/btaa648 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Callahan BJ, McMurdie PJ, Rosen MJ. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581–3. 10.1038/nmeth.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Sater V, Viailly PJ, Lecroq T. et al. UMI-gen: a UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries. Comput Struct Biotechnol J 2020;18:2270–80. 10.1016/j.csbj.2020.08.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164. 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Zorita E, Cusco P, Filion GJ. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31:1913–9. 10.1093/bioinformatics/btv053 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Liu D. Algorithms for efficiently collapsing reads with unique molecular identifiers. PeerJ. 2019;7:e8275. 10.7717/peerj.8275 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Pel J, Choi WWY, Leung A. et al. Duplex proximity sequencing (pro-seq): a method to improve DNA sequencing accuracy without the cost of molecular barcoding redundancy. PLoS One 2018;13:e0204265. 10.1371/journal.pone.0204265 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Stoler N, Arbeithuber B, Povysil G. et al. Family Reunion via error correction: an efficient analysis of duplex sequencing data. BMC Bioinf 2020;21:1–10. 10.1186/s12859-020-3419-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Bae JH, Liu R, Roberts E. et al. Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nat Genet 2023;55:871–9. 10.1038/s41588-023-01376-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Chen L, Liu P, Evans TCJr. et al. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 2017;355:752–6. 10.1126/science.aai8690 [DOI] [PubMed] [Google Scholar]
42. Costello M, Pugh TJ, Fennell TJ. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res 2013;41:e67. 10.1093/nar/gks1443 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Chen G, Mosier S, Gocke CD. et al. Cytosine deamination is a major cause of baseline noise in next-generation sequencing. Mol Diagn Ther 2014;18:587–93. 10.1007/s40291-014-0115-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Cohen JD, Douville C, Dudley JC. et al. Detection of low-frequency DNA variants by targeted sequencing of the Watson and Crick strands. Nat Biotechnol 2021;39:1220–7. 10.1038/s41587-021-00900-z [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Wang TT, Abelson S, Zou J. et al. High efficiency error suppression for accurate detection of low-frequency variants. Nucleic Acids Res 2019;47:e87. 10.1093/nar/gkz474 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Chen H, Yu F, Lu D. et al. Enhanced error suppression for accurate detection of low-frequency variants. Electrophoresis. 2025;46:65–75. 10.1002/elps.202400202 [DOI] [PubMed] [Google Scholar]
47. Ma X, Shao Y, Tian L. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol 2019;20:1–15. 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Shugay M, Zaretsky AR, Shagin DA. et al. MAGERI: computational pipeline for molecular-barcoded targeted resequencing. PLoS Comput Biol 2017;13:e1005480. 10.1371/journal.pcbi.1005480 [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Xu C, Gu X, Padmanabhan R. et al. smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers. Bioinformatics. 2019;35:1299–309. 10.1093/bioinformatics/bty790 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental_materials_2_bbaf483

supplemental_materials_2_bbaf483.docx^{(1.3MB, docx)}

Supplemental_materials_1_bbaf483

supplemental_materials_1_bbaf483.xlsx^{(53.6KB, xlsx)}

Data Availability Statement

[ref1] 1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17:333–51. 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet 2018;19:269–85. 10.1038/nrg.2017.117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Deveson IW, Gong B, Lai K. et al. Evaluating the analytical validity of circulating tumor DNA sequencing assays for precision oncology. Nat Biotechnol 2021;39:1115–28. 10.1038/s41587-021-00857-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Nagasaka M, Uddin MH, Al-Hallak MN. et al. Liquid biopsy for therapy monitoring in early-stage non-small cell lung cancer. Mol Cancer 2021;20:82–98. 10.1186/s12943-021-01371-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Filipska M, Rosell R. Mutated circulating tumor DNA as a liquid biopsy in lung cancer detection and treatment. Mol Oncol 2021;15:1667–82. 10.1002/1878-0261.12983 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Streck NT, Espy MJ, Ferber MJ. et al. Use of next-generation sequencing to detect mutations associated with antiviral drug resistance in cytomegalovirus. J Clin Microbiol 2023;61:e0042923. 10.1128/jcm.00429-23 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Oellerich M, Sherwood K, Keown P. et al. Liquid biopsies: donor-derived cell-free DNA for the detection of kidney allograft injury. Nat Rev Nephrol 2021;17:591–603. 10.1038/s41581-021-00428-0 [DOI] [PubMed] [Google Scholar]

[ref8] 8. Khush KK. Clinical utility of donor-derived cell-free DNA testing in cardiac transplantation. J Heart Lung Transplant 2021;40:397–404. 10.1016/j.healun.2021.01.1564 [DOI] [PubMed] [Google Scholar]

[ref9] 9. Li M, Song C, Hu J. et al. Impact of pretreatment low-abundance HIV-1 drug resistance on virological failure after 1 year of antiretroviral therapy in China. J Antimicrob Chemother 2023;78:2743–51. 10.1093/jac/dkad297 [DOI] [PubMed] [Google Scholar]

[ref10] 10. Moufarrej MN, Bianchi DW, Shaw GM. et al. Noninvasive prenatal testing using circulating DNA and RNA: advances, challenges, and possibilities. Annu Rev Biomed Data Sci 2023;6:397–418. 10.1146/annurev-biodatasci-020722-094144 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Faldynova L, Walczyskova S, Cerna D. et al. Non-invasive prenatal testing (NIPT): combination of copy number variant and gene analyses using an “in-house” target enrichment next generation sequencing-solution for non-centralized NIPT laboratory? Prenat Diagn 2023;43:1320–32. 10.1002/pd.6421 [DOI] [PubMed] [Google Scholar]

[ref12] 12. Kayser M, Branicki W, Parson W. et al. Recent advances in forensic DNA phenotyping of appearance, ancestry and age. Forensic Sci Int Genet 2023;65:102870. 10.1016/j.fsigen.2023.102870 [DOI] [PubMed] [Google Scholar]

[ref13] 13. Dou Y, Kwon M, Rodin RE. et al. Accurate detection of mosaic variants in sequencing data without matched controls. Nat Biotechnol 2020;38:314–9. 10.1038/s41587-019-0368-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Razavi P, Li BT, Brown DN. et al. High-intensity sequencing reveals the sources of plasma circulating cell-free DNA variants. Nat Med 2019;25:1928–37. 10.1038/s41591-019-0652-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Christensen MH, Drue SO, Rasmussen MH. et al. DREAMS: deep read-level error model for sequencing data applied to low-frequency variant calling and circulating tumor DNA detection. Genome Biol 2023;24:99. 10.1186/s13059-023-02920-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Kinde I, Wu J, Papadopoulos N. et al. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci U S A 2011;108:9530–5. 10.1073/pnas.1105422108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Schmitt MW, Kennedy SR, Salk JJ. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 2012;109:14508–13. 10.1073/pnas.1208715109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Ren Y, Zhang Y, Wang D. et al. SinoDuplex: an improved duplex sequencing approach to detect low-frequency variants in plasma cfDNA samples. Genomics Proteomics Bioinf 2020;18:81–90. 10.1016/j.gpb.2020.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Newman AM, Lovejoy AF, Klass DM. et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat Biotechnol 2016;34:547–55. 10.1038/nbt.3520 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Springer S, Masica DL, Dal Molin M. et al. A multimodality test to guide the management of patients with a pancreatic cyst. Sci Transl Med 2019;11:11. 10.1126/scitranslmed.aav4772 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Salazar R, Arbeithuber B, Ivankovic M. et al. Discovery of an unusually high number of de novo mutations in sperm of older men using duplex sequencing. Genome Res 2022;32:499–511. 10.1101/gr.275695.121 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Axelsson J, LeBlanc D, Shojaeisaadi H. et al. Frequency and spectrum of mutations in human sperm measured using duplex sequencing correlate with trio-based de novo mutation analyses. Sci Rep 2024;14:23134. 10.1038/s41598-024-73587-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Pilheden M, Ahlgren L, Hyrenius-Wittsten A. et al. Duplex sequencing uncovers recurrent low-frequency cancer-associated mutations in infant and childhood KMT2A-rearranged acute leukemia. Hemasphere. 2022;6:e785. 10.1097/HS9.0000000000000785 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Dodge AE, LeBlanc DPM, Zhou G. et al. Duplex sequencing provides detailed characterization of mutation frequencies and spectra in the bone marrow of MutaMouse males exposed to procarbazine hydrochloride. Arch Toxicol 2023;97:2245–59. 10.1007/s00204-023-03527-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Clement K, Farouni R, Bauer DE. et al. AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing. Bioinformatics. 2018;34:i202–10. 10.1093/bioinformatics/bty264 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Smith T, Heger A, Sudbery I. UMI-tools: modeling sequencing errors in unique molecular identifiers to improve quantification accuracy. Genome Res 2017;27:491–9. 10.1101/gr.209601.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Peng X, Dorman KS. Accurate estimation of molecular counts from amplicon sequence data with unique molecular identifiers. Bioinformatics. 2023;39:btad002. 10.1093/bioinformatics/btad002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Stoler N, Arbeithuber B, Guiblet W. et al. Streamlined analysis of duplex sequencing data with Du novo. Genome Biol 2016;17:180–90. 10.1186/s13059-016-1039-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–2. 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Chong Z, Ruan J, Wu CI. Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads. Bioinformatics. 2012;28:2732–7. 10.1093/bioinformatics/bts482 [DOI] [PubMed] [Google Scholar]

[ref31] 31. Orabi B, Erhan E, McConeghy B. et al. Alignment-free clustering of UMI tagged DNA molecules. Bioinformatics. 2019;35:1829–36. 10.1093/bioinformatics/bty888 [DOI] [PubMed] [Google Scholar]

[ref32] 32. Peng X, Dorman KS. AmpliCI: a high-resolution model-based approach for denoising Illumina amplicon data. Bioinformatics. 2021;36:5151–8. 10.1093/bioinformatics/btaa648 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Callahan BJ, McMurdie PJ, Rosen MJ. et al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581–3. 10.1038/nmeth.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. Sater V, Viailly PJ, Lecroq T. et al. UMI-gen: a UMI-based read simulator for variant calling evaluation in paired-end sequencing NGS libraries. Comput Struct Biotechnol J 2020;18:2270–80. 10.1016/j.csbj.2020.08.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38:e164. 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Zorita E, Cusco P, Filion GJ. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31:1913–9. 10.1093/bioinformatics/btv053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Liu D. Algorithms for efficiently collapsing reads with unique molecular identifiers. PeerJ. 2019;7:e8275. 10.7717/peerj.8275 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Pel J, Choi WWY, Leung A. et al. Duplex proximity sequencing (pro-seq): a method to improve DNA sequencing accuracy without the cost of molecular barcoding redundancy. PLoS One 2018;13:e0204265. 10.1371/journal.pone.0204265 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39. Stoler N, Arbeithuber B, Povysil G. et al. Family Reunion via error correction: an efficient analysis of duplex sequencing data. BMC Bioinf 2020;21:1–10. 10.1186/s12859-020-3419-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Bae JH, Liu R, Roberts E. et al. Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nat Genet 2023;55:871–9. 10.1038/s41588-023-01376-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Chen L, Liu P, Evans TCJr. et al. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 2017;355:752–6. 10.1126/science.aai8690 [DOI] [PubMed] [Google Scholar]

[ref42] 42. Costello M, Pugh TJ, Fennell TJ. et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res 2013;41:e67. 10.1093/nar/gks1443 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. Chen G, Mosier S, Gocke CD. et al. Cytosine deamination is a major cause of baseline noise in next-generation sequencing. Mol Diagn Ther 2014;18:587–93. 10.1007/s40291-014-0115-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Cohen JD, Douville C, Dudley JC. et al. Detection of low-frequency DNA variants by targeted sequencing of the Watson and Crick strands. Nat Biotechnol 2021;39:1220–7. 10.1038/s41587-021-00900-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] 45. Wang TT, Abelson S, Zou J. et al. High efficiency error suppression for accurate detection of low-frequency variants. Nucleic Acids Res 2019;47:e87. 10.1093/nar/gkz474 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Chen H, Yu F, Lu D. et al. Enhanced error suppression for accurate detection of low-frequency variants. Electrophoresis. 2025;46:65–75. 10.1002/elps.202400202 [DOI] [PubMed] [Google Scholar]

[ref47] 47. Ma X, Shao Y, Tian L. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol 2019;20:1–15. 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] 48. Shugay M, Zaretsky AR, Shagin DA. et al. MAGERI: computational pipeline for molecular-barcoded targeted resequencing. PLoS Comput Biol 2017;13:e1005480. 10.1371/journal.pcbi.1005480 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] 49. Xu C, Gu X, Padmanabhan R. et al. smCounter2: an accurate low-frequency variant caller for targeted sequencing data with unique molecular identifiers. Bioinformatics. 2019;35:1299–309. 10.1093/bioinformatics/bty790 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Alignment-free unique molecular identifier clustering suppresses sequencing errors for accurate detection of low-frequency DNA variants

Fei Yu

Haojie Xiao

Dongyang Song

Xiao Yang

Shiyue Huang

Yu Wang

Mingze Bai

Xiaoming Yao

Kunxian Shu

Dan Pu

Abstract

Introduction

Figure 1.

Materials and methods

Dataset

Data pre-processing

UMIs clustering

Figure 2.

Multiple sequence alignment and consensus sequence generation

Variant calling and annotation

Background error estimation

Performance evaluation

Results

AFUMIC enhances UMI clustering efficiency

Figure 3.

AFUMIC optimizes sequencing data retention

AFUMIC enables high-fidelity detection of low-frequency variants

Figure 4.

AFUMIC effectively suppresses background errors

AFUMIC demonstrates superior computational efficiency

Figure 5.

Clinical applications of AFUMIC

Discussion

Key Points

Supplementary Material

Acknowledgments

Contributor Information

Author contributions

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases