Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Sep 16:2025.09.14.676103. [Version 1] doi: 10.1101/2025.09.14.676103

A Universal Duplex Sequencing Approach for Accurate Detection of Somatic Mutations

Shuvro P Nandi 1,2,3,^, Yuhe Cheng 1,2,3,^, Shams Al-Azzam 1,2,3,4, Safa Saeed 1,2,3, Audrey Kristin 1,2,3, Nadia Sunico 1,2,3, Isabella R Stuewe 1,2,3, Zichen Jiang 1,2,3, Luka Culibrk 5,6, Maria Zhivagui 7, Xiaoxu Yang 8,9,10, Rachel M Wise 11, Foster C Jacobs 12, Bérénice Chavanel 13, Michael Korenjak 13, Mia Petljak 5,6, Silvia Balbo 12, Laurie G Hudson 11, Ke Jian Liu 14,15, Jiri Zavadil 13, Joseph G Gleeson 8,9, Ludmil B Alexandrov 1,2,3,16,*
PMCID: PMC12458366  PMID: 41000705

Abstract

Ultra-accurate detection of rare somatic mutations is critical for understanding mutational processes in human disease, aging, and environmental exposures, yet current methods are limited by error rates, restricted genome coverage, and high DNA input. We present UDSeq, a duplex sequencing protocol combining random fragmentation, efficient UMI ligation, and quantitative input control to achieve near-complete genome/exome representation from as little as 100 pg DNA. Benchmarking in human sperm estimates a UDSeq error rate of ~2.5×10−9 per base pair. UDSeq captures mutational signatures from heterogeneous populations without clonal expansion, reproduces exposure-specific patterns in cell lines and rodent models, and enables cross-species profiling. Compared with prior duplex methods, UDSeq yields up to fourfold more usable duplex molecules, improves library conversion, and remains cost-effective. We include a step-by-step protocol with quality-control checkpoints for fragment size, ligation yield, library conversion, and duplication rate. UDSeq provides a scalable, low-input platform for accurate profiling of somatic mutagenesis.

INTRODUCTION

Somatic mutations, which can be caused by both endogenous and exogenous mutagenic processes, are present in all human cells1. These mutations accumulate gradually over time and often go unnoticed, as most have minimal or no effect on cellular function13. However, certain mutations can disrupt key biological processes4, lead to cell death5, or confer a selective growth advantage, resulting in clonal expansion6,7. Cancer is the most well-known example of a disease driven by somatic mutations, where specific alterations initiate tumorigenesis8, promote progression9, and can confer treatment resistance10. Beyond cancer, somatic mutations are increasingly recognized as contributors to other diseases11, including neurodegenerative disorders12 and cardiovascular conditions13. However, their roles in these areas remain much less studied, as detecting low-frequency mutations in non-cancer tissues presents significant technical challenges14,15.

From a technical standpoint, the study of somatic mutations in cancer has been greatly facilitated by the fact that most tumors originate from a single mutated cell, whose proliferation produces thousands of descendants carrying the same mutations as the progenitor8. Although additional mutations may emerge during clonal expansion, the original mutations remain uniformly present across the tumor tissue9. This clonal nature enables reliable detection of tumor mutations with conventional sequencing, despite its inherent error rates ranging from one error per 1,000 base pairs (bp; i.e., 10−3 per bp) to six errors per 1,000 bp (i.e., 6 × 10−3 per bp)16. Because these shared mutations appear in every cancer cell, repeated resequencing generates a strong, reproducible signal that stands out from the random noise of sequencing errors. In contrast, detecting somatic mutations in most non-cancer tissues is far more challenging, as each cell typically harbors a unique set of mutations17,18. As such, accurately studying somatic mutations in non-cancerous healthy or diseased tissues requires methods with error rates below 1 × 10−8 errors per bp (less than one error per 100 million sequenced bp)18.

The need for sequencing protocols with low error rates is further evident in efforts to evaluate the mutagenic potential of known carcinogens in experimental systems. For example, in vitro studies have required intricate experimental setups, where cells are exposed to a potential mutagenic carcinogen, followed by isolation and clonal expansion of single cells from the exposed population before sequencing19,20. The clonal amplification step, though labor-intensive, is crucial for accurately detecting mutations, as it ensures that mutations present in the progeny of a single cell can be distinguished from background noise19,20. Without this step, bulk sequencing cannot distinguish low-frequency mutations from background noise, since each cell harbors a distinct set of somatic mutations.

To overcome these limitations, several approaches have been developed to detect rare somatic mutations. Single-cell DNA sequencing2125 and single-cell clonal expansion2628 can, in principle, resolve mutations at the level of individual cells26,28. However, these methods are labor-intensive, costly, and require sequencing large numbers of cells to obtain a representative view of tissue-wide mutagenesis. Additionally, amplification artifacts29 and allelic dropout30 can compromise accuracy for detecting rare mutations. As a scalable alternative, duplex sequencing has emerged as a powerful method for profiling somatic mutations at extremely low frequencies31. By exploiting DNA’s double-stranded nature, duplex sequencing independently evaluates each strand32,33 and confirms mutations only when they are independently detected on both strands, greatly reducing error rates and enabling confident identification of rare somatic mutations.

Advances in duplex sequencing have improved the detection of rare somatic mutations, but no current method offers universal, single-molecule resolution in a cost-effective format that supports both whole-genome profiling and targeted capture from limited input DNA across species. An ideal approach would achieve an error rate below 10−8, offer full genome compatibility for profiling human tissues, human and non-human model systems, and non-model organisms, while requiring minimal DNA input, supporting targeted enrichment, and remaining cost-effective. To address this need, we developed Universal Duplex Sequencing (UDSeq), a novel single-molecule duplex sequencing protocol for rapid, accurate detection of rare somatic mutations across diverse biological systems. We put UDSeq in the context of existing error-corrected sequencing methods, demonstrating superior performance and broad applicability. To showcase its versatility, we applied UDSeq to samples from humans, mice, rats, chickens, and sheep, successfully detecting mutations in whole genomes and targeted regions derived from cell lines and multiple tissue types.

RESULTS

Overview of Existing Error-Corrected Sequencing Methods

Over the past decade, duplex sequencing has revolutionized the detection of rare somatic mutations18. In this approach, a ‘duplex consensus’ is generated by independently sequencing both the Watson and Crick strands of the same DNA molecule and typically confirming mutations only if present on both strands. Most protocols use unique molecular identifiers (UMIs) and exploit DNA strand complementarity to perform this process33. To our knowledge, the first method of this kind was introduced in 2012, enabling sequencing of small genomic panels—generally under 1 megabase—with error rates of approximately 10−7 errors per bp32. However, this original DupSeq method had low efficiency in generating duplex consensuses32, limiting its practical application by necessitating large amounts of input DNA and extensive sequencing. Subsequently, Hoang et al. developed BotSeqS to address this challenge, introducing a dilution step immediately before library amplification34. This dilution step creates a bottleneck, enabling efficient random sampling of double-stranded template molecules and substantially reducing the required amount of sequencing. Notably, BotSeqS could be applied to input DNA amounts as low as 50 nanogram (ng). Despite this improvement, the error rate of BotSeqS was similar to that of DupSeq, with independent analysis estimating it at ~2 × 10−7 errors per bp35.

To further enhance BotSeqS and reduce error rates, NanoSeqV1 was developed by incorporating optimized DNA fragmentation and restrictive end repair method during library preparation by replacing sonication and end repair with restriction enzyme-based fragmentation using HpyCH4V35. This innovation reduced the error rate to approximately 5 × 10−9 errors per base pair and allowed NanoSeqV1 to be applied to input DNA amounts as low as 50 ng. Nonetheless, it limited coverage to only 30% of the human genome and restricted its applicability to other genomes due to the specificity of the restriction enzyme35. More recently, the CODEC (Concatenating Original Duplex for Error Correction) method employed specially designed quadruplex adaptors to physically link the Watson and Crick strands into a single-duplex molecule, enabling sequencing on a standard Illumina short-read platform36. The original CODEC method achieved error rates of approximately 10−7 errors per bp, comparable to those of DupSeq and BotSeqS, while offering greater cost-effectiveness compared to DupSeq36. Additionally, the original CODEC method could be applied to input DNA amounts as low as 2.5 ng. A modified version of the CODEC protocol, incorporating fragmentation steps similar to those used by NanoSeqV1, reduced the error rate to approximately 10−8 errors per bp. However, this modified version inherited the same limitations as NanoSeqV1, including coverage restricted to only 30% of the human genome36. Additionally, to overcome the partial genome coverage limitations of the original NanoSeq, a second version—NanoSeqV2—was developed using an alternative genome fragmentation strategy. Nonetheless, its low library conversion efficiency necessitates a substantially larger amount of input DNA, which may restrict its broader applicability37.

The previously described methods relied entirely on short-read sequencing based on duplex sequencing. In contrast, a recently developed long-read sequencing technique, HiDEF-seq (Hairpin Duplex Enhanced Fidelity sequencing), leveraged the PacBio platform to achieve single-molecule fidelity38. HiDEF-seq utilized the inherent single-molecule nature of PacBio’s technology39 to achieve high accuracy, performing 5 to 20 sequencing passes per strand with estimated error rates below 10−9 errors per bp38. Notably, HiDEF-seq can resolve some single-strand mismatches, a capability not achievable with other duplex sequencing methods. However, it required a high input of DNA and incurred higher costs due to the expense of PacBio long-read sequencing compared to short-read technologies. Specifically, it needed at least 500 ng of high-quality DNA or 1,500 ng of degraded DNA to achieve 40% genome coverage, with even larger amounts required for complete genome sequencing38.

Each of the previously discussed approaches—DupSeq, BotSeqS, NanoSeq, CODEC, and HiDEF-seq—represents a significant advancement in the detection of rare somatic mutations, introducing innovative methodologies, enhanced efficiency, and reduced error rates (Table 1). However, each method also comes with its own set of limitations, including challenges related to efficiency, error rates, genome coverage, DNA input requirements, or cost. To overcome many of these limitations, we present UDSeq, a single-molecule duplex sequencing protocol optimized for rapid, accurate detection of rare somatic mutations (Table 1), with the complete protocol provided in Supplementary Note 1.

Table 1:

Comparative overviews of DupSeq, BothSeq, NanoSeq, HiDEF-seq, CODEC, and UDSeq.

Sequencing protocol Primary innovation Genome coverage Minimum input DNA Approx. error rate (per bp) Key limitations
DupSeq 32 Independent tagging and sequencing of both DNA strands to enable duplex consensus mutation calling Targeted panels ~1 μg 2 × 10−7 Requires high DNA input; generally limited to defined panels; labor-intensive workflow; High error rate when compared to other protocols
BotSeqS 34 Bottleneck dilution to enrich randomly sampled duplex DNA molecules for consensus mutation calling Genome-wide or targeted 50 ng 2 × 10−7 High error rate when compared to other protocols. Low efficiency.
NanoSeqV1 35 Removes end-repair–associated errors via restriction-enzyme fragmentation ~30% of genome 50 ng 5 × 10−9 Restriction sites limit coverage and species portability; needs redesign for new targets
NanoSeqV2 37 Alternative genome fragmentation methods that provide full genome coverage whilst retaining the original error rates Genome-wide or targeted 30 ng 5 × 10−9 Low efficiency. Still unpublished but available as preprint
HiDEF-seq 38 High-fidelity duplex consensus from PacBio long-read sequencing with detection of single-strand mismatches ~40% of genome 500 ng – 1.5 μg 4 × 10−9 Complex workflow, elevated sequencing depth required; partial genome coverage; high cost
CODEC 36 Custom quadruplex adapters physically linking Watson & Crick strands Genome-wide or targeted 2.5 ng 1 × 10−7 Requires specialized reagents, sequencing customization, and intensive library preparation; higher error rate than other protocols.
CODEC-HpyCH4V 36 Restriction enzyme–based CODEC enabling targeted duplex capture ~30% of genome 50 ng 3 × 10–8 Requires specialized reagents, sequencing customization, and intensive library preparation; higher error rate than other protocols.
UDSeq Near-complete genome coverage with ultra-low input via random fragmentation and efficient duplex consensus. Genome-wide or targeted 100 pg 2.5 × 10−9 Similar to all other duplex methods, does not capture large structural variants or copy-number changes

Innovation Over Prior Protocols

To develop the UDSeq protocol, we built upon the advances of NanoSeqV1 over BotSeqS and introduced targeted innovations that overcome key limitations in existing duplex sequencing methods. Each improvement in the protocol was designed not only to enhance performance but also to expand the scope, versatility, and practicality of single-molecule mutation detection (Figure 1a; Supplementary Figure 1a).

Figure 1: Overview and validation of UDSeq for accurate detection of somatic mutations.

Figure 1:

(a) High-level workflow illustrating the versatility of UDSeq across whole-genome and targeted sequencing approaches. (b) Comparison of error rates among UDSeq, other duplex sequencing methods, and germline de novo mutation (DNM) studies. UDSeq and other duplex approaches were applied to human sperm samples, whereas the DNM studies analyzed germline data from trios. Scatter plot shows the relationship between paternal age and the number of single base substitutions (SBSs) per haploid sperm genome across different sequencing approaches. Data points represent individual samples analyzed by UDSeq (n=8), HiDEF-Seq (n=5), and NanoSeqV1 (n=7), as well as parental DNM estimates from Jónsson (n=1,548) and Halldorsson (n=2,963). Regression lines with corresponding equations and coefficients of determination (R2) are shown for each dataset: error-corrected sequencing (UDSeq, HiDEF-Seq, NanoSeqV1) and parental DNM studies. (c) Left panel shows estimated slopes for the number of SBS accumulated per haploid sperm genome per year, and the right panel shows estimated y-intercepts representing the predicted number of SBS present at birth. Both values are derived from the regression analyses in panel b. Error bars indicate standard error of the estimate. (d) Top panel shows the SBS-96 mutational profile from sperm samples analyzed by UDSeq (n=8), and the bottom panel shows the SBS-96 profile from parental DNMs. The SBS-96 profile encompasses all single-base substitutions (C>A, C>G, C>T, T>A, T>C, and T>G) and their immediate trinucleotide sequence context. The two profiles have a cosine similarity of 0.92. Relative contributions of the aging-associated signatures SBS1 and SBS5 are shown on the right for each profile.

First, we replaced sonication- and HpyCH4V-based fragmentation with random fragmentation using either NEBNext dsDNA Fragmentase (M0348L) or UltraShear (M7634L). Both methods enable unbiased fragmentation across the genome, allowing UDSeq to achieve near-complete coverage (≥95%; comparable to bulk sequencing) of the genome and exome—an advance over NanoSeqV1, which was limited to ~30% of the genome (Supplementary Figure 1b), and comparable to NanoSeqV2, which also provides near-complete coverage37. dsDNA Fragmentase is more cost-effective and widely accessible, though it produces short overhangs that require additional trimming during bioinformatics processing. In contrast, UltraShear generates highly uniform fragment sizes without overhangs, reducing computational preprocessing steps—but at higher reagent cost. This flexibility allows users to balance performance, cost, and bioinformatics complexity based on experimental needs.

Second, we adopted the xGen cfDNA & FFPE DNA Kit (IDT) for ligation of unique UMIs. This kit provides high ligation efficiency even with minimal DNA input, enabling accurate duplex sequencing from as little as 0.1 ng (100 picograms) of starting material. Compared to NanoSeqV2, it delivers a substantial improvement in library conversion efficiency—yielding up to four times more femtomoles of usable library from the same input DNA (p=0.00022; Supplementary Figure 1c). This enhancement reduces input requirements and broadens applicability to samples with limited DNA, such as clinical biopsies or environmental isolates.

Third, we incorporated accurate quantification of UMI-ligated molecules using the NEBNext Library Quant Kit for Illumina and iTaq Universal SYBR Green Supermix on a Bio-Rad real-time PCR system. This ensures precise input into PCR amplification, reducing over-amplification artifacts and preserving single-molecule fidelity. PCR was then performed with UDI primers (IDT) and NEBNext® Ultra II Q5® Master Mix (M0544L), which maintains high fidelity during amplification.

Finally, by integrating random fragmentation with low-input, high-efficiency UMI ligation and precise quantification, UDSeq uniquely enables ultra-accurate, single-molecule somatic mutation detection across species, with support for whole-genome coverage or targeted panels—even from limited input material. Despite offering substantial advantages in sensitivity, flexibility, and scalability, the protocol remains cost-efficient—comparable to or even lower in cost than other duplex sequencing methods. (Supplementary Figure 1d). In the sections that follow, we systematically evaluate the protocol’s error rate and demonstrate its applicability across a range of experimental settings. Together, these advances position UDSeq as a cost-effective, scalable platform for widespread genomic applications (Figure 1a; Supplementary Figure 1a).

Assessing and Comparing the Error Rate of UDSeq

To assess the error rate of UDSeq, we sequenced DNA extracted from sperm samples provided by eight males ranging in age from 19 to 70 years. We compared our results with two large-scale Icelandic population studies of trios (mother, father, and child), which estimated sperm mutation rates in fathers at different ages based on phasing de novo mutations (DNM) observed in the offspring40,41. These DNM trio studies reported sperm mutation rates of 1.54 and 1.40 single base substitutions (SBS) per year, respectively, with the number of SBS at birth (i.e., age zero) estimated at 4.96 and 6.58, respectively (Figure 1bc). Consistent with the estimates from the DNM trio studies, the UDSeq data revealed that sperm accumulate 1.58 SBSs per year and an estimated number of mutations at age zero of 12.60. By analyzing the difference in mutation rates at age zero between UDSeq and the DNM trio studies, we estimated that the error rate of UDSeq is between 6 and 7.6 artifactual SBS per sequenced haploid sperm sample containing approximately three billion base pairs. This corresponds to an error rate of approximately 2.5 × 10−9 errors per bp (Supplementary Figure 1d). Applying the same approach to previously sequenced sperm samples, we also estimated the error rates of NanoSeqV1 (n=7 sperm samples) and HiDEF-seq (n=5) which yielded error rates of about 4.8 × 10−9 and 4.3 × 10−9 errors per bp, respectively (Figure 1c). Although derived using a different approach, our estimated error rate closely matches values reported in previous studies—for example, 4.8 × 10−9 in our analysis compared to 5 × 10−9 for NanoSeqV1 in their original publication35. Overall, given the sample sizes of sperm samples, the error rates of UDSeq, NanoSeq, and HiDEF-seq were effectively similar, with less than 5 mutations per billion sequenced base pairs (i.e., <5 × 10−9 errors per bp; Figure 1c). Lastly, as expected42,43, the mutational patterns observed in the eight sperm samples profiled by UDSeq exhibited the patterns of clock-like signatures SBS1 and SBS5, closely resembled that of paternal de novo mutations41 (cosine similarity = 0.92; Figure 1d).

In vitro Assessment of Mutagenesis

Traditional sequencing protocols for evaluating environmental carcinogen exposure in vitro generally require months of precise exposure, clonal expansion, and sequencing19,44,45 (Figure 2a), whereas UDSeq enables direct mutational detection in heterogeneous cell populations, significantly reducing the timeline (Figure 2b). To showcase the utility of the UDSeq protocol, we applied it to three human cell lines exposed to four environmental carcinogens with well-documented mutagenic properties. These in vitro experiments included: (i) HepG2 human liver cancer cell line, derived from a well-differentiated hepatocellular carcinoma, exposed to 4-Nitroquinoline 1-oxide (4NQO; 0.5 μM for 4 hours) and aristolochic acid-I (AA-I; 80μM for 24 hours); (ii) immortalized normal oral keratinocytes (NOK) exposed to the tobacco specific nitrosamine 4-methylnitrosamino-1-(3-pyridyl)-1-butanone (NNK)46; and (iii) N/TERT-1 keratinocytes exposed to solar-simulated ultraviolet-light radiation (ssUVR; 3 KJ/m2)47. A heterogeneous population of cells was cultured for a single passage following exposure, after which DNA was extracted and subjected to UDSeq at whole-genome resolution (Figure 2b). The resulting mutational profiles from this duplex sequencing approach were then compared to those obtained from clonally expanded cells that were exposed to the same carcinogens (Figure 2b).

Figure 2: UDSeq enables rapid, ultra-low-input, and versatile assessment of in vitro mutational profiles.

Figure 2:

(a) Schematic workflow for in vitro mutagenesis assessment using single-cell clonal expansion followed by bulk sequencing, a process requiring 60 days or more. (b) Alternative workflow using UDSeq without clonal expansion, enabling mutagenesis assessment in as little as 15 days. (c) SBS-96 mutational profiles of environmental mutagens. Each SBS-96 profile represents all single-base substitutions (C>A, C>G, C>T, T>A, T>C, T>G) within their trinucleotide sequence context. Left: UDSeq-derived SBS-96 mutational profiles from human cell lines exposed to mutagens. Right: SBS-96 derived mutational profiles from human cell lines exposed to mutagens using single-cell clonal expansion followed by bulk sequencing. Cosine similarities (cs) between UDSeq-derived and bulk sequencing–derived profiles are shown between the panels. Mutagen names are indicated in the top left corner, and the corresponding cell line with replicate number is shown in the top right corner of each panel. (d) SBS-96 mutational profiles generated from as little as 100 picograms of DNA using UDSeq. SBS-96 mutational profiles from environmental mutagen exposures are shown with cell line names and replicate numbers indicated as in (c). (e) Demonstration of UDSeq’s versatility across exome and targeted sequencing. SBS-96 mutational profiles from environmental mutagen exposures are shown for both exome and targeted panels, with cell line names and replicate numbers indicated as in (c). Abbreviations: 4NQO, 4-Nitroquinoline 1-oxide; AA-I, aristolochic acid I; BEAS-2B, immortalized human bronchial epithelial cell line; HFF, human foreskin fibroblast; HepG2, human liver cancer cell line derived from hepatocellular carcinoma; IC50, half-maximal inhibitory concentration; iPSC, induced pluripotent stem cells; NNK, 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone; NOK, oral and epidermal keratinocytes; UVR, ultraviolet radiation.

The mutational profile induced by AA-I closely matched the COSMIC signature SBS22a, consistent with SBS22a’s AA-I etiology as established from human cancer samples48 and in vitro models49. The profile also showed strong concordance with bulk sequencing data derived from clonally expanded AA-I-exposed cells (cosine similarity = 0.98; Figure 2c). Similarly, the pattern of ssUVR matched the known ssUVR-light associated COSMIC experimental mutational signature for solar simulated radiation47, while the pattern of 4NQO was identical to that observed in clonally expanded human cells exposed to 4NQO (Figure 2c). Lastly, for NNK acetate, the mutational profiles were also nearly identical to the one found in clonally expanded human lung cells exposed to the same compound50. Overall, these results confirm that UDSeq can replicate mutational patterns observed in clonally expanded cells, without the time or resource burden of extended culture.

The previously generated in vitro results utilized 100 ng of input DNA (Figure 2c). To evaluate UDSeq’s performance with low-input DNA, we applied it to 100 picograms (pg) of DNA 4NQO- and AA-I-exposed HepG2 cells. Across all conditions, UDSeq reliably detected the expected mutational patterns, yielding results consistent with those obtained from higher DNA input (Figure 2d).

To demonstrate UDSeq’s versatility in generating custom pull-down sequencing, we also performed whole-exome sequencing (using xGen Exome Hybridization Panel) and targeted gene panel sequencing encompassing 127 known cancer-associated genes (xGen Pan-Cancer Hybridization Panel). The resulting mutational profiles closely matched expectations, with AA-I exposure aligning with SBS22a (cosine similarity = 0.98) and 4NQO exposure mirroring patterns observed in HepG2 cells (cosine similarity = 0.94; Figure 2e).

In vivo Assessment of Mutagenesis

To showcase the utility of the UDSeq protocol for assessing in vivo mutagenesis, we exposed SKH-1 hairless mice to ssUVR (14 kJ/m2; ~0.5 minimal erythema dose) three times per week for 30 weeks (Figure 3a) and F344 rats to 5 parts per million NNK in drinking water for 15 weeks (Figure 3b), and compared their mutational profiles to those of unexposed controls. As expected, in SKH-1 hairless mice, mutational burden analysis revealed 105-fold and 6.5-fold higher mutational loads in the dorsal and ventral skin of ssUVR-exposed mice, respectively (p<0.05), compared to controls (Figure 3a). ssUVR-associated mutational signatures were present in the dorsal and ventral skin of exposed mice but absent in the skin of unexposed controls (Figure 3a). Similarly, F344 rats exposed to NNK exhibited a 4.7-fold higher mutational burden compared to controls (p=0.013; Figure 3b), with an NNK-specific mutational pattern closely resembling that observed in cell line experiments (Figure 2b; cosine similarity = 0.90).

Figure 3: UDSeq-based in vivo mutagenesis in mouse and rat models.

Figure 3:

(a) Left: Schematic of the in vivo mutagenesis workflow in SKH-1 hairless mice, with cohorts either unexposed (controls) or subjected to solar-simulated UVR (14 kJ/m2; ~0.5 minimal erythema dose) three times per week for 30 weeks. Middle: Box plots showing mutation burden per base pair in ventral skin from control mice (Control; n=4), ventral skin from ssUVR-exposed mice (ssUVR_VS, n=4), and dorsal skin from ssUVR-exposed mice (ssUVR_DS, n=4). The y-axis represents mutation burden per base pair on a log scale. Horizontal lines within boxes indicate medians; boxes represent interquartile ranges (IQR), and whiskers extend to 1.5× IQR. P-values were calculated using a two-sided t-test: control vs. ssUVR_VS, p=0.014; ssUVR_VS vs. ssUVR_DS, p=0.010. Right: SBS-96 mutational profiles of control, ssUVR_VS, and ssUVR_DS skin samples, with contributing COSMIC reference mutational signatures shown adjacent to each profile. Each SBS-96 profile represents all single-base substitutions (C>A, C>G, C>T, T>A, T>C, T>G) within their trinucleotide sequence context. (b) Left: Schematic of the in vivo mutagenesis workflow in F344 rats: control and 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone (NNK)–exposed groups (NNK administered in drinking water for 15 weeks), with lung tissue collected for analysis. Middle: Box plots showing mutation burden per base pair in lung tissue for control (n=3) and NNK-exposed (n=4) lungs. The y-axis represents mutation burden per base pair on a log scale, and the format of the box plots is identical to the one in (b). P-values were calculated using a two-sided t-test: control vs. NNK, p=0.010. Right: SBS-96 mutational profiles of control and NNK-exposed lung tissues, with contributing COSMIC reference signatures shown alongside each profile. The in vitro–derived NNK experimental signature from Figure 2c was included in the assignment and was detected exclusively in the NNK-exposed samples.

Additionally, we evaluated the capability of UDSeq to profile tissues from non-model organisms by whole-genome sequencing breast, pancreas, and skin tissues from healthy chickens, as well as different layers of kidney tissue samples from healthy sheep. As anticipated51, the tissues from both chickens and sheep displayed distinct patterns of clock-like mutational signatures SBS1 and SBS5, along with SBS18 in skin tissue of chickens (Supplementary Figure 3a-c). Furthermore, we observed that the cortex has a 1.49-fold higher mutational burden than the medulla in the same kidney samples (Supplementary Figure 3c)52.

Examining Mutational Processes in Healthy Human Tissues

To demonstrate UDSeq’s ability to study mutational processes in normal somatic tissues of healthy individuals, we applied the protocol at whole-genome resolution to five organs from a single 70-year-old individual: left cortex, right cortex, left kidney, right kidney, and liver (Figure 4a). The data revealed that the brain had the lowest mutational burden, followed by the kidneys and liver (Figure 4a). Interestingly, the left kidney exhibited more mutations than the right kidney. However, since DNA was extracted from bulk tissue, this difference may stem from variations in capturing the kidney cortex and medulla, as observed in prior reports52 and our data from sheep kidney (Supplementary Figure 3c). To investigate the mutational processes active in these organs, we analyzed the mutational signatures present53 across tissues and identified patterns consistent with known biology. As expected, clock-like mutational signatures SBS1 and SBS5—associated with cell proliferation and aging—were detected in all samples (Figure 4b). SBS40, a signature commonly observed in renal tissues and cancers despite its unknown etiology54, was present in both kidney samples. SBS4, which is associated with tobacco smoking across multiple cancer types and has also been observed in liver cancer55, was uniquely detected in the liver, although the smoking status of the donor was unavailable. Together, these findings confirm the ability to detect the presence of expected tissue-specific mutational processes (Figure 4b).

Figure 4: UDSeq-based mutational burden and mutational profiles across healthy human tissues.

Figure 4:

(a) Schematic of the human body highlighting sampled organs and their mutation burden estimates from a single 70-year-old individual. Mutation burden is expressed as single base substitutions (SBS) per base pair, and the total number of SBS per diploid genome is shown for each organ (denoted by m). (b) SBS-96 mutational profiles are shown for each tissue. Each SBS-96 profile represents all single-base substitutions (C>A, C>G, C>T, T>A, T>C, T>G) within their trinucleotide sequence context. Contributing COSMIC mutational signatures are displayed adjacent to each profile. Y-axis scales are adjusted individually to optimally display the percentage of mutations within each tissue.

DISCUSSION

In this study, we introduce UDSeq, a novel and cost-efficient single-molecule duplex sequencing protocol designed to overcome the key limitations of existing error-corrected sequencing technologies. By enabling both whole-genome and targeted sequencing from as little as 100 picograms of input DNA, UDSeq combines ultra-low error rates (~2.5 × 10−9 errors per base pair; Table 1) with high sensitivity and versatility, making it well-suited for detecting rare somatic mutations across a wide range of biological contexts. Benchmarking against human sperm DNA validated its accuracy, yielding mutation rates that align with parent-offspring trio-based de novo mutation studies and faithfully recapitulating clock-like mutational signatures SBS1 and SBS5.

Unlike prior duplex sequencing protocols—such as NanoSeqV1, CODEC-HpyCH4V, and HiDEF-seq—that rely on enzyme-based fragmentation and are limited to partial genome coverage, UDSeq leverages random fragmentation to achieve near-complete genome and exome representation. This enzyme-independent approach expands applicability across species and simplifies targeted sequencing without requiring protocol modifications. Additionally, our optimized library preparation pipeline enhances library conversion efficiency, generating up to four times more duplex molecules than NanoSeqV2 from the same DNA input. Combined with its compatibility with low-input samples and streamlined workflow, UDSeq offers both technical performance and cost-effectiveness, making it particularly valuable for studies involving scarce clinical material or environmental specimens.

In this study, we demonstrated the power of UDSeq across diverse applications. In vitro, it captured carcinogen-induced mutational signatures in heterogeneous cell populations—without the need for laborious clonal expansion. In vivo, it revealed exposure-specific mutation patterns in rodent models and enabled genome-wide mutation detection in non-model organisms including chickens and sheep. In human tissue biopsies, UDSeq recovered expected organ-specific mutational signatures and identified differences in mutational burden, further validating its utility for studying tissue-specific mutagenesis. While some of these applications could, in principle, be addressed with other duplex sequencing protocols, to the best of our knowledge UDSeq is the only method optimized to perform all of them, with experiments conducted across different laboratories confirming its versatility and with the protocol streamlined to facilitate adoption and use by others.

While UDSeq has demonstrated versatility across a wide range of applications and offers clear advantages over existing duplex sequencing protocols, it still shares certain limitations inherent to all short-read duplex sequencing. Specifically, the protocol is not well-suited for detecting large structural variants, complex rearrangements, or copy number alterations, which require long-range genomic context. Future integration with long-read sequencing technologies or complementary genomic platforms could overcome these challenges, further extending UDSeq’s utility to capture both small-scale mutations and large-scale genomic alterations.

In summary, UDSeq is a robust, scalable, and cost-efficient duplex sequencing protocol that enables accurate detection of rare somatic mutations at single-molecule resolution. Its flexibility across species, compatibility with limited input material, and high technical fidelity position UDSeq as a powerful tool for advancing studies of mutagenesis, somatic mosaicism, aging, cancer biology, and environmental exposures. Importantly, we provide a clear and streamlined protocol (Supplementary Note 1) that is easy to use and has been extensively validated across multiple independent laboratories, ensuring broad reproducibility and accessibility.

METHODS

Human biospecimens

All human biospecimens were collected with informed consent from all human research participants or their families. The tissue samples used in this study were collected post-mortem from deceased human participants by LIBD, not from living individuals. The collection was conducted in accordance with applicable national and state Institutional Review Board (IRB) regulations (study number: 1126332; IRB tracking number: 20111080). Sperm samples were collected from healthy ethnically diverse males enrolled according to approved human subjects’ protocols from the Institutional Review Board (IRB) of the University of California for blood, saliva, and semen sampling (140028, 161115). Genomic DNA was extracted using the DNeasy Blood and Tissue kit (QIAGEN, Cat# 69506, Valencia, CA) following the manufacturer’s recommendations.

Cytotoxicity assessment

Cytotoxicity assessment was performed for all in vitro experiments. Specifically, cell viability was determined using the CellTiter-Glo® Luminescent Cell Viability Assay (Promega, Cat# G7572, Madison, WI), which quantifies ATP as an indicator of metabolically active cells. The reagent was added to each well of a 96-well plate at a 1:10 ratio. After a 10-minute incubation at room temperature, luminescence was recorded using a Cytation 5 Cell Imaging Multi-Mode Reader (BioTek, Winooski, VT). Relative cell viability was calculated as the percentage of luminescent signal from treated cells compared to untreated controls

In vitro experiments

HepG2 human liver cancer cell line, derived from a well-differentiated hepatocellular carcinoma were purchased from ATCC (HB-8065). An hTERT immortalized non-cancerous human keratinocyte cell line (i.e., N/TERT-1) was purchased from Cellosaurus (RRID: CVCL_CW92). Normal oral keratinocytes (NOK) cell lines were a kind gift from Dr. Paul Lambert (University of Wisconsin-Madison, United States of America). The cells were generated by retroviral insertion of the human hTERT gene in oral epithelial cells derived from gingival tissue. The cells were propagated in the keratinocyte growth medium 2 (PromoCell GmbH, Heidelberg, Germany) and 1% penicillin/streptomycin. All other cells were cultured by following the recommended cell maintenance process from manufacturer using T25 (Thermo Fisher, 169900) or T75 (Thermo Fisher, 156800) flasks. Following cytotoxicity assessment, half-maximal inhibitory concentration (IC50) of environmental carcinogens was used for exposure with a specific duration of time. For in vitro experiments for profiling with UDSeq, no single cell clonal bottlenecking/passaging was done after exposure. For each experiment, cells were passaged only once after exposure, followed by DNA extraction. Following treatment, genomic DNA extraction was performed using the DNeasy Blood & Tissue Kit (QIAGEN, Cat# 69506, Valencia, CA), including RNase A treatment to eliminate RNA contamination. DNA concentrations were measured using the Qubit dsDNA Broad Range Assay Kit (Thermo Fisher Scientific, Cat# Q32850).

For in vitro experiment with bottlenecking, clonal expansion and profiling with bulk sequencing, primary human cells derived from human foreskin fibroblasts (HFFs) were passaged and clonally expanded by following the methods in Zhivagui et al.56. Cells were washed weekly, until clones reached confluency and were transferred progressively to T-75 flasks. 4NQO exposure (0.5 μM for 4 hours) was performed following cytotoxicity assessment to determine the IC50 concentration. Following exposure, cells underwent an additional clonal passage for ~35 rounds of cell division, after which DNA was extracted and subjected to bulk whole-genome sequencing using the NEBNext® Ultra II DNA Library Prep Kit for Illumina® (E7645S). Clonal expansion results for other cells were based on previously generated sequencing data as reported in the original publications.

In vivo experiments

Male SKH-1 mice (21–25 days old) were purchased from Charles River Laboratories (Wilmington, MA). These studies were performed under an approved Institutional Animal Care and Use Committee (IACUC) protocol 25–201636-HSC at the University of New Mexico. Mice were either controls (i.e., unexposed to ssUVR) or exposed to ssUVR (14 kJ/m2; ~0.5 minimal erythema dose) 3 times per week for 30 weeks. Animals were sacrificed 4 weeks after the last ssUVR treatment. Animals were euthanized using CO2 followed by cervical dislocation and tissues were collected. Skin tissue was collected in 10% neutral buffered formalin, RNAlater, snap-frozen, and epidermal scrapings obtained from both ventral and dorsal skin. We have complied with all relevant ethical regulations for animal use. Genomic DNA extraction was performed using the DNeasy Blood & Tissue Kit (QIAGEN, Cat# 69506, Valencia, CA), including RNase A treatment to eliminate RNA contamination. DNA concentrations were measured using the Qubit dsDNA Broad Range Assay Kit (Thermo Fisher Scientific, Cat# Q32850).

F344 rats (21–25 days old) were purchased from Charles River Laboratories (Wilmington, MA). These studies were performed under an approved Institutional Animal Care and Use Committee (IACUC) protocol (#1802–35549A) at University of Minnesota. Following one week of acclimation, rats were treated with NNK (5 parts per million in drinking water) and were euthanized after 15 weeks. Control rats were provided with normal drinking water. Animals were euthanized using CO2 followed by cervical dislocation and tissues were collected. Lung tissues were from both control and NNK-exposed rats. Tissues were collected and flash frozen. Genomic DNA extraction was performed using the DNeasy Blood & Tissue Kit (QIAGEN, Cat# 69506, Valencia, CA), including RNase A treatment to eliminate RNA contamination. DNA concentrations were measured using the Qubit dsDNA Broad Range Assay Kit (Thermo Fisher Scientific, Cat# Q32850).

Chicken and sheep organs were obtained from a butcher shop in San Diego. Genomic DNA extraction was performed using the DNeasy Blood & Tissue Kit (QIAGEN, Cat# 69506, Valencia, CA), including RNase A treatment to eliminate RNA contamination. DNA concentrations were measured using the Qubit dsDNA Broad Range Assay Kit (Thermo Fisher Scientific, Cat# Q32850).

UDSeq Library Preparation

The complete step-by-step UDSeq protocol is provided in Supplementary Note 1. Briefly, to minimize DNA damage during fragmentation, intact genomic DNA was enzymatically fragmented using NEBNext dsDNA Fragmentase (M0348S) or UltraShear (M7634L) to achieve an average fragment size of ~350 bp. Fragmentation conditions were carefully optimized for each species and sample type. For human samples, both sperm and cell lines were fragmented for 15 minutes, while human tissue required 20 minutes. Mouse cell lines were also fragmented for 20 minutes, but mouse tissue needed a longer duration of 25 minutes. Similarly, rat tissues were fragmented for 25 minutes to achieve optimal results. Fragmented DNA was then used for UMI adapter ligation with the xGen cfDNA & FFPE DNA Library Preparation Kit. All steps were carried out on magnetic beads to reduce DNA loss during purification, thereby improving library conversion efficiency (Supplementary Figure 1b). In the final step, an appropriate femtomole input amount was used for PCR amplification to incorporate sample index sequences compatible with Illumina® sequencing platforms.

DNA Quantification, Dilution, and PCR Amplification

A key strength of UDSeq lies in the accurate quantification of adapter-ligated DNA using qPCR. To avoid the variability introduced by mixed primer sets during quantification, we utilized NEBNext® Library Quant DNA Standards, which reliably quantify UMI-ligated molecules. For size correction, we used 330 bp for the standards. For adapter-ligated DNA, we estimated fragment size by adding 82 bp (accounting for UMI-containing adapters) to the average fragment length determined by TapeStation. For example, a sample with an average fragment size of 370 bp was quantified using 452 bp as the effective fragment length. Additional details are provided in Supplementary Note 1.

For library amplification, we used 0.2 fmol of input DNA and 15 PCR cycles to achieve ~90× whole-genome coverage in human samples. For mouse samples, we used 0.15 fmol with the same number of cycles. For other species, input amounts were adjusted as appropriate to target ~80% duplicated and ~20% unique reads (Supplementary Note 1).

Targeted hybrid capture was performed using 6–8 multiplexed samples per reaction, with 500 ng of adapter-ligated DNA per capture. The complete targeted capture protocol, including exome and panel-based enrichment, is described in the Supplementary Note 1. The pre-made UDSeq libraries were sequenced on an Illumina NovaSeq 6000 and NovaSeq X platform using 150 PE sequencing chemistry to effective data volume.

Trimming, Alignment, and Mutation Identification

All bioinformatics analyses were performed within the Triton Shared Compute Cluster (San Diego Supercomputer Center (2022): Triton Shared Computing Cluster. University of California, San Diego. Service. https://doi.org/10.57873/T34W2R). Somatic mutations and mutational burden from UDSeq data with matched normal were analyzed using DupCaller57 ver1.0.1. Briefly, Paired-end FASTQ files with equal-length barcodes at the start of each read were preprocessed to remove barcodes and align sequences to the reference genome using BWA. PCR and optical duplicates were marked using GATK. DupCaller constructs sample-specific error profiles by analyzing single-strand mismatches and single-read discrepancies. These profiles are stratified by trinucleotide context and homopolymer length for substitutions and indels, respectively. A strand-aware probabilistic model calculates genotype likelihoods and assigns confidence scores to candidate mutations. Mutations exceeding a confidence threshold are retained. Post-calling filters were used to exclude low-quality reads, common germline variants, and noisy loci.

Mutational profile and signature analysis

The variant call format files (VCFs) from DupCaller were used for mutational profiles and signatures assignment. Analysis of mutational profiles was performed using our previously established methodology with the SigProfiler suite of tools. Briefly, mutational matrices for SBS, DBS and Indels were generated with SigProfilerMatrixGenerator58 (Version 1.2.16). Plotting of each mutational profile was done with SigProfilerPlotting (Version 1.3.13). Assignment of mutational signatures to samples was done with SigProfilerAssignment59. Mutational profile of sperm samples from NanoSeqV135 and HiDEF-Seq38 were obtained from their corresponding publications. Parental de novo mutations were obtained from Halldorsson, B. V. et al.,41. and the patterns of the mutations are plotted with SigProfilerMatrixGenerator. Regression plots mutation rate was calculated as previously described in Ref.38. The corrected mutation burdens output from DupCaller was used for plotting using R statistical language60.

Supplementary Material

Supplement 1

Supplementary Figure 1: UDSeq protocol overview and comparative performance. (a) Detailed schematic of the UDSeq protocol, comprising four major steps: (i) DNA extraction and fragmentation, (ii) end repair and adapter ligation for library preparation, (iii) library quantification and sequencing, and (iv) data analysis. (b) Comparative overview of genomic coverage achieved for human cortex (UDSeq), human cortex (NanoSeqV1), and human blood (bulk WGS) samples across different target regions (exome and genome) at varying coverage thresholds. (c) Quantitative polymerase chain reaction (quantitative PCR) amplification curves and quantification of unique molecular identifier–ligated molecules (in femtomoles) demonstrating library conversion efficiency of Universal Duplex Sequencing (UDSeq) versus NanoSeqV2 using 10 nanograms of fragmented DNA input. UDSeq achieved approximately four-fold higher conversion efficiency. Horizontal lines indicate medians; boxes represent interquartile ranges (IQR), and whiskers extend to 1.5× IQR. Statistical significance was assessed using a two-sided t-test (p=0.00022). (d) Cost-effectiveness comparison of UDSeq, NanoSeq, CODEC, and HiDEF-seq. Projected error rates and estimated cost per megabase of duplex coverage for CODEC (using HpyCH4V enzymatic fragmentation), CODEC (sonication), NanoSeqV1, HiDEF-seq, and UDSeq. Error rates are shown on the y-axis.

media-1.pdf (1.6MB, pdf)
Supplement 2

Supplemental Figure 2: Mutational profiles and genomic analyses across species and tissues. (a) SBS-96 mutational profiles from distinct anatomical layers of the kidney in three individual sheep, with the relative contributions of the clock-like COSMIC signatures SBS1 and SBS5 shown alongside each profile. Each SBS-96 profile represents all single-base substitutions (C>A, C>G, C>T, T>A, T>C, T>G) within their trinucleotide sequence context. Y-axis scales are adjusted independently to optimize visualization of mutation percentages within each trinucleotide context. (b) SBS-96 mutational profiles from skin, breast, and pancreas tissues in chicken, with the contributions of the clock-like COSMIC signatures SBS1 and SBS5, as well as the reactive oxygen species–associated signature SBS18, displayed adjacent to each profile. Profiles are displayed in a format consistent with (a). (c) Box plots showing mutation burden per base pair (log scale) for kidney cortex and two replicates of kidney medulla from sheep, as well as for skin, breast, and two replicates of pancreas from chicken. Horizontal lines indicate medians; boxes represent interquartile ranges (IQR), and whiskers extend to 1.5× IQR. (d) Bar plots showing the percentage of whole-genome duplex coverage achieved for chicken, sheep, rat, and mouse using the UDSeq approach. Coverage is expressed as the proportion of the genome successfully sequenced at the desired depth.

media-2.pdf (1.3MB, pdf)
Supplement 3
media-3.pdf (1.1MB, pdf)

ACKNOWLEDGMENTS

The authors would like to thank Cécilia Sirand for her technical support in performing some of the cell line experiments and Dr Fekadu Kassie for assistance with the NNK rat study. This work was supported by the US National Institute of Health grants R01ES032547, R01ES036931, R01CA269919, R01CA296974, P01CA281819, and U01CA290479 to L.B.A. and RO1CA220376 to S.B. as well as by L.B.A.’s Packard Fellowship for Science and Engineering and the UC San Diego Sanford Stem Cell Institute. The work presented here is also supported by a network grant from The Larry L. Hillblom Foundation to L.B.A. and J.G.G. as well as by UK Grand Challenge 2016 Award “Mutographs of Cancer” C98/A24032 to L.B.A. and J.Z. This work was supported in part by NIH award R00HD111686 to X.Y. The computational analyses reported in this manuscript have utilized the Triton Shared Computing Cluster at the San Diego Supercomputer Center of UC San Diego. The funders had no roles in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Footnotes

DISCLAIMER

Where members are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy or views of the International Agency for Research on Cancer/World Health Organization.

COMPETING INTERESTS

L.B.A. is a co-founder, CSO, scientific advisory member, and consultant for Acurion (formerly io9), has equity and receives income. The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. L.B.A. is also a compensated member of the scientific advisory board of Inocras. L.B.A.’s spouse is an employee of Hologic, Inc. L.B.A. declares U.S. provisional applications filed with UCSD with serial numbers: 63/269,033; 63/289,601; 63/483,237; 63/412,835; 63/492,348; and 63/366,392 as well as a European patent application with application number EP25305077.7. L.B.A. and S.P.N. also declare provisional patent application PCT/US2023/010679. L.B.A. is also an inventor of a US Patent 10,776,718 for source identification by non-negative matrix factorization. All other authors declare that they have no competing interests.

Data availability

All whole-genome sequencing data have been or will be deposited in the Sequence Read Archive (SRA) or the database of Genotypes and Phenotypes (dbGaP), as appropriate. Duplex sequencing data from N/TERT-1 and HepG2 cell lines, as well as SKH-1 mouse, F344 rat, sheep, and chicken tissues, are available under accession number PRJNA1262723. Duplex sequencing data for NOK cells are accessible via PRJNA1196807. Bulk clonal expansion sequencing data for iPSC, BEAS-2B, and N/TERT-1 were obtained from the respective publications cited in the manuscript. Data for human foreskin fibroblasts (HFF), generated as part of this study, are also deposited under PRJNA1262723. Whole-genome sequencing data from human subjects will be made available in dbGaP upon acceptance of the manuscript. Patient ID 7614 data can be accessed via PRJNA799597. Sequencing data for sperm samples are available under accession numbers PRJNA660493, PRJNA753973, and PRJNA588332. All other data are available from the corresponding authors or other sources upon reasonable request.

REFERENCES

  • 1.Martincorena I. & Campbell P. J. Somatic mutation in cancer and normal cells. Science 349, 1483–1489 (2015). 10.1126/science.aab4082 [DOI] [PubMed] [Google Scholar]
  • 2.Ren P., Zhang J. & Vijg J. Somatic mutations in aging and disease. Geroscience 46, 5171–5189 (2024). 10.1007/s11357-024-01113-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Alexandrov L. B. et al. Clock-like mutational processes in human somatic cells. Nat Genet 47, 1402–1407 (2015). 10.1038/ng.3441 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Reva B., Antipin Y. & Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic acids research 39, e118–e118 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Evan G. & Littlewood T. A matter of life and cell death. Science 281, 1317–1322 (1998). [DOI] [PubMed] [Google Scholar]
  • 6.Maeda H. & Kakiuchi N. Clonal expansion in normal tissues. Cancer Sci 115, 2117–2124 (2024). 10.1111/cas.16183 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Martincorena I. Somatic mutation and clonal expansions in human tissues. Genome Medicine 11, 35 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Stratton M. R., Campbell P. J. & Futreal P. A. The cancer genome. Nature 458, 719–724 (2009). 10.1038/nature07943 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Greaves M. & Maley C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Blair B. G., Bardelli A. & Park B. H. Somatic alterations as the basis for resistance to targeted therapies. J Pathol 232, 244–254 (2014). 10.1002/path.4278 [DOI] [PubMed] [Google Scholar]
  • 11.Shendure J. & Akey J. M. The origins, determinants, and consequences of human mutations. Science 349, 1478–1483 (2015). [DOI] [PubMed] [Google Scholar]
  • 12.Proukakis C. Somatic mutations in neurodegeneration: An update. Neurobiology of Disease 144, 105021 (2020). [DOI] [PubMed] [Google Scholar]
  • 13.Jaiswal S. & Ebert B. L. Clonal hematopoiesis in human aging and disease. Science 366, eaan4673 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Pfeifer G. P. & Jin S.-G. Methods and applications of genome-wide profiling of DNA damage and rare mutations. Nature Reviews Genetics, 1–18 (2024). [Google Scholar]
  • 15.Fowler J. C. & Jones P. H. Somatic mutation: what shapes the mutational landscape of normal epithelia? Cancer discovery 12, 1642–1655 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Stoler N. & Nekrutenko A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform 3, lqab019 (2021). 10.1093/nargab/lqab019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Dou Y., Gold H. D., Luquette L. J. & Park P. J. Detecting somatic mutations in normal cells. Trends in Genetics 34, 545–557 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Menon V. & Brash D. E. Next-Generation Sequencing Methodologies to Detect Low-Frequency Mutations:“Catch Me If You Can”. Mutation Research/Reviews in Mutation Research, 108471 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kucab J. E. et al. A compendium of mutational signatures of environmental agents. Cell 177, 821–836 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Koh G., Zou X. & Nik-Zainal S. Mutational signatures: experimental design and analytical framework. Genome Biol 21, 37 (2020). 10.1186/s13059-020-1951-5020-1951-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lodato M. A. et al. Somatic mutation in single human neurons tracks developmental and transcriptional history. Science 350, 94–98 (2015). 10.1126/science.aab1785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Xing D., Tan L., Chang C. H., Li H. & Xie X. S. Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands. Proc Natl Acad Sci U S A 118 (2021). 10.1073/pnas.2013106118 [DOI] [Google Scholar]
  • 23.Gonzalez-Pena V. et al. Accurate genomic variant detection in single cells with primary template-directed amplification. Proc Natl Acad Sci U S A 118 (2021). 10.1073/pnas.2024176118 [DOI] [Google Scholar]
  • 24.Dong X. et al. Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat Methods 14, 491–493 (2017). 10.1038/nmeth.4227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Petljak M. et al. Characterizing Mutational Signatures in Human Cancer Cell Lines Reveals Episodic APOBEC Mutagenesis. Cell 176, 1282–1294 e1220 (2019). 10.1016/j.cell.2019.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Blokzijl F. et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature 538, 260–264 (2016). 10.1038/nature19768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Mitchell E. et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature 606, 343–350 (2022). 10.1038/s41586-022-04786-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yoshida K. et al. Tobacco smoking and somatic mutations in human bronchial epithelium. Nature 578, 266–272 (2020). 10.1038/s41586-020-1961-11961-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Luquette L. J. et al. Single-cell genome sequencing of human neurons identifies somatic point mutation and indel enrichment in regulatory elements. Nature genetics 54, 1564–1571 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Hou Y. et al. Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing. Gigascience 4, s13742-13015-10068-13743 (2015). [Google Scholar]
  • 31.Menon V. & Brash D. E. Next-generation sequencing methodologies to detect low-frequency mutations: “Catch me if you can”. Mutat Res Rev Mutat Res 792, 108471 (2023). 10.1016/j.mrrev.2023.108471 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Schmitt M. W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A 109, 14508–14513 (2012). 10.1073/pnas.1208715109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kennedy S. R. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat Protoc 9, 2586–2606 (2014). 10.1038/nprot.2014.170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Hoang M. L. et al. Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proceedings of the National Academy of Sciences 113, 9846–9851 (2016). doi: 10.1073/pnas.1607794113 [DOI] [Google Scholar]
  • 35.Abascal F. et al. Somatic mutation landscapes at single-molecule resolution. Nature 593, 405–410 (2021). [DOI] [PubMed] [Google Scholar]
  • 36.Bae J. H. et al. Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nature Genetics 55, 871–879 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lawson A. R. et al. Somatic mutation and selection at epidemiological scale. medRxiv, 2024.2010. 2030.24316422 (2024). [Google Scholar]
  • 38.Liu M. H. et al. DNA mismatch and damage patterns revealed by single-molecule sequencing. Nature, 1–10 (2024). [Google Scholar]
  • 39.Wenger A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nature biotechnology 37, 1155–1162 (2019). [Google Scholar]
  • 40.Jónsson H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017). [DOI] [PubMed] [Google Scholar]
  • 41.Halldorsson B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019). [DOI] [PubMed] [Google Scholar]
  • 42.Axelsson J. et al. Frequency and spectrum of mutations in human sperm measured using duplex sequencing correlate with trio-based de novo mutation analyses. Scientific Reports 14, 23134 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Rahbari R. et al. Timing, rates and spectra of human germline mutation. Nat Genet 48, 126–133 (2016). 10.1038/ng.3469 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zou X. et al. Validating the concept of mutational signatures with isogenic cell models. Nature communications 9, 1744 (2018). [Google Scholar]
  • 45.Zavadil J. & Rozen S. G. Experimental Delineation of Mutational Signatures Is an Essential Tool in Cancer Epidemiology and Prevention. Chem Res Toxicol 32, 2153–2155 (2019). 10.1021/acs.chemrestox.9b00339 [DOI] [PubMed] [Google Scholar]
  • 46.Korenjak M. et al. Human cancer genomes harbor the mutational signature of tobacco-specific nitrosamines NNN and NNK. bioRxiv, 2024.2006. 2028.600253 (2024). [Google Scholar]
  • 47.Speer R. M. et al. Arsenic is a potent co-mutagen of ultraviolet light. Communications Biology 6, 1273 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Senkin S. et al. Geographic variation of mutagenic exposures in kidney cancer genomes. Nature 629, 910–918 (2024). 10.1038/s41586-024-07368-207368-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Nik-Zainal S. et al. The genome as a record of environmental exposure. Mutagenesis 30, 763–770 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Mingard C. et al. Dissection of cancer mutational signatures with individual components of cigarette smoking. Chemical Research in Toxicology 36, 714–723 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Ivanov D., Hwang T., Sitko L. K., Lee S. & Gartner A. Experimental systems for the analysis of mutational signatures: no ‘one-size-fits-all’solution. Biochemical Society Transactions 51, 1307–1317 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Martin G. M. et al. Somatic mutations are frequent and increase with age in human kidney epithelial cells. Human Molecular Genetics 5, 215–221 (1996). [DOI] [PubMed] [Google Scholar]
  • 53.Díaz-Gay M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39, btad756 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Alexandrov L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Alexandrov L. B. et al. Mutational signatures associated with tobacco smoking in human cancer. Science 354, 618–622 (2016). 10.1126/science.aag0299 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Zhivagui M. et al. DNA damage and somatic mutations in mammalian cells after irradiation with a nail polish dryer. nature communications 14, 276 (2023). [Google Scholar]
  • 57.Cheng Y. et al. Improved Mutation Detection in Duplex Sequencing Data with Sample-Specific Error Profiles. bioRxiv, 2025.2007. 2013.664565 (2025). [Google Scholar]
  • 58.Bergstrom E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC genomics 20, 1–12 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Diaz-Gay M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. bioRxiv, 2023.2007. 2010.548264 (2023). [Google Scholar]
  • 60.Team, R. C. R: A language and environment for statistical computing. (2013).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

Supplementary Figure 1: UDSeq protocol overview and comparative performance. (a) Detailed schematic of the UDSeq protocol, comprising four major steps: (i) DNA extraction and fragmentation, (ii) end repair and adapter ligation for library preparation, (iii) library quantification and sequencing, and (iv) data analysis. (b) Comparative overview of genomic coverage achieved for human cortex (UDSeq), human cortex (NanoSeqV1), and human blood (bulk WGS) samples across different target regions (exome and genome) at varying coverage thresholds. (c) Quantitative polymerase chain reaction (quantitative PCR) amplification curves and quantification of unique molecular identifier–ligated molecules (in femtomoles) demonstrating library conversion efficiency of Universal Duplex Sequencing (UDSeq) versus NanoSeqV2 using 10 nanograms of fragmented DNA input. UDSeq achieved approximately four-fold higher conversion efficiency. Horizontal lines indicate medians; boxes represent interquartile ranges (IQR), and whiskers extend to 1.5× IQR. Statistical significance was assessed using a two-sided t-test (p=0.00022). (d) Cost-effectiveness comparison of UDSeq, NanoSeq, CODEC, and HiDEF-seq. Projected error rates and estimated cost per megabase of duplex coverage for CODEC (using HpyCH4V enzymatic fragmentation), CODEC (sonication), NanoSeqV1, HiDEF-seq, and UDSeq. Error rates are shown on the y-axis.

media-1.pdf (1.6MB, pdf)
Supplement 2

Supplemental Figure 2: Mutational profiles and genomic analyses across species and tissues. (a) SBS-96 mutational profiles from distinct anatomical layers of the kidney in three individual sheep, with the relative contributions of the clock-like COSMIC signatures SBS1 and SBS5 shown alongside each profile. Each SBS-96 profile represents all single-base substitutions (C>A, C>G, C>T, T>A, T>C, T>G) within their trinucleotide sequence context. Y-axis scales are adjusted independently to optimize visualization of mutation percentages within each trinucleotide context. (b) SBS-96 mutational profiles from skin, breast, and pancreas tissues in chicken, with the contributions of the clock-like COSMIC signatures SBS1 and SBS5, as well as the reactive oxygen species–associated signature SBS18, displayed adjacent to each profile. Profiles are displayed in a format consistent with (a). (c) Box plots showing mutation burden per base pair (log scale) for kidney cortex and two replicates of kidney medulla from sheep, as well as for skin, breast, and two replicates of pancreas from chicken. Horizontal lines indicate medians; boxes represent interquartile ranges (IQR), and whiskers extend to 1.5× IQR. (d) Bar plots showing the percentage of whole-genome duplex coverage achieved for chicken, sheep, rat, and mouse using the UDSeq approach. Coverage is expressed as the proportion of the genome successfully sequenced at the desired depth.

media-2.pdf (1.3MB, pdf)
Supplement 3
media-3.pdf (1.1MB, pdf)

Data Availability Statement

All whole-genome sequencing data have been or will be deposited in the Sequence Read Archive (SRA) or the database of Genotypes and Phenotypes (dbGaP), as appropriate. Duplex sequencing data from N/TERT-1 and HepG2 cell lines, as well as SKH-1 mouse, F344 rat, sheep, and chicken tissues, are available under accession number PRJNA1262723. Duplex sequencing data for NOK cells are accessible via PRJNA1196807. Bulk clonal expansion sequencing data for iPSC, BEAS-2B, and N/TERT-1 were obtained from the respective publications cited in the manuscript. Data for human foreskin fibroblasts (HFF), generated as part of this study, are also deposited under PRJNA1262723. Whole-genome sequencing data from human subjects will be made available in dbGaP upon acceptance of the manuscript. Patient ID 7614 data can be accessed via PRJNA799597. Sequencing data for sperm samples are available under accession numbers PRJNA660493, PRJNA753973, and PRJNA588332. All other data are available from the corresponding authors or other sources upon reasonable request.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES