Abstract
Error-corrected next-generation sequencing (ecNGS) is an emerging technology for accurately measuring somatic mutations. Here, we report paired-end and complementary consensus sequencing (PECC-Seq), a high-accuracy ecNGS approach for genome-wide somatic mutation detection. We characterize a novel 2-aminoimidazolone lesion besides 7,8-dihydro-8-oxoguanine and the resulting end-repair artifacts originating from NGS library preparation that obscure the sequencing accuracy of NGS. We modify library preparation protocol for the enzymatic removal of end-repair artifacts and improve the accuracy of our previously developed duplex consensus sequencing method. Optimized PECC-Seq shows an error rate of <5 × 10−8 with consensus bases compressed from approximately 25 Gb of raw sequencing data, enabling the accurate detection of low-abundance somatic mutations. We apply PECC-Seq to the quantification of in vivo mutagenesis. Compared with the classic gpt gene mutation assay using gpt delta transgenic mice, PECC-Seq exhibits high sensitivity in quantitatively measuring dose-dependent mutagenesis induced by Aristolochic acid I (AAI). Moreover, PECC-Seq specifically characterizes the distinct genome-wide mutational signatures of AAI, Benzo[a]pyrene, N-Nitroso-N-ethylurea and N-nitrosodiethylamine and reveals the mutational signature of Quinoline in common mouse models. Overall, our findings demonstrate that high-accuracy PECC-Seq is a promising tool for genome-wide somatic mutagenesis quantification and for in vivo mutagenicity testing.
Graphical Abstract
Graphical Abstract.
Introduction
Mutagenesis plays a crucial role in various human diseases, especially cancer. Accurate detection and quantification of low-abundance somatic mutations only present in a small fraction of cells or even in single DNA molecules is critical in both basic research and clinical applications, such as aging, oncology, genetic toxicology and diagnostics (1–5). Next-generation sequencing (NGS) technologies have been powerful tools for the exploration of genetic variations. However, suffering from the high error rates (approximately 10−3–10−2 errors per base pair), detection of subclonal mutations with abundance <1% using NGS is a great challenge (4,6). With the emergence of new technologies, such as single-cell sequencing or laser microdissection of single-cell colonies, it becomes possible to detect mutation in small subcolonies with NGS (7,8). But till now, direct quantification of the somatic mutagenesis from single DNA-duplex molecules still mainly relies on indirect methods using phenotypic-selectable reporter genes within transgenic rodent (TGR) models (9,10).
Error-corrected next-generation sequencing (ecNGS) technology building on the duplex consensus sequencing strategy (DCSS) provides highly accurate mutation detection (4,10). DCSS utilizes the molecular barcodes to uniquely tag each original DNA fragment, and redundant copies that arise from both strands of the DNA duplex are sequenced, tracked and grouped by the identical tags to create consensus sequences bioinformatically (4,10–12). Discordant base calls among the homogeneous copied reads are eliminated as sequencing errors, and only variants present among all the copied reads are considered as true mutations. In DCSS, random sequencing errors, PCR errors and variants from mutagenic single-stranded lesions can be efficiently removed (4,10–12). DCSS approaches, including duplex sequencing (DS), bottleneck sequencing system (BotSeqS), Hawk-Seq™, Jade-Seq™ and nanorate sequencing (NanoSeq), can dramatically reduce the standard NGS error rates to 10−9–10−6 per base pair (2,11–15). Previously, we developed a whole-genome DCSS approach, named paired-end and complementary consensus sequencing (PECC-Seq), with error rates of <5 × 10−7 per base pair (Figure 1A) (16). Different from other approaches, PECC-Seq employed a PCR-free library preparation protocol and utilized the copies from overlaps in the shortened complementary DNA strand-derived paired-end reads rather than amplicon copies of DNA templates to correct sequencing errors, ensuring the genome-wide coverage for mutation detection (16).
Figure 1.
Characterization of sequencing artifacts in C:G base pairs. (A) Consensus sequencing-based error-correction with PECC-Seq (top) and the derived SSCS analysis (bottom). Shortened PCR-free DNA libraries (∼150 bp) generated overlapped paired-end reads under 2 × 150 bp paired-end sequencing. In PECC-Seq, with the paired-end overlaps from complementary DNA strands, four copied sequences from the same starting double-stranded templates were captured for duplex consensus sequencing-based error-correction. To profile the error patterns on the single strands of PECC-Seq library, paired-end overlaps and the single-strand-derived ExAmp duplicates generated from the Illumina patterned flow cell of NovaSeq 6000 platform were used for single strand consensus sequencing. (B) Error patterns and error rates of variants detected in PECC-Seq (n = 2 technical replicates). (C) Mutational spectra of variants detected with PECC-Seq. In PECC-Seq, consensus bases located in the terminal 10 bp of inserted fragments were trimmed to reduce end-repair artifacts. The background variants displayed a similar spectrum to that of end-repair artifacts in the terminal 10 bp (data were merged from two technical replicates; 156 and 212 variants were identified, respectively). (D) Artifacts introduced by the end-repair process during library preparation. In PECC-Seq library preparation, ultrasonically sheared DNA was end-repaired, and subsequently subjected to size selection and adapter ligation to construct shortened PCR-free libraries. The end-repair step converted the 3′- and 5′-overhangs of DNA fragments into blunt ends as well as fixed the single-stranded mutagenic lesions into double-stranded errors during DNA polymerase filling. (E) Profiles of sequencing errors in paired-end consensus bases (n = 2 technical replicates). (F) Error patterns of artifacts on the single-stranded templates (e.g. mutagenic DNA lesions and misincorporations of end-repair process) (n = 2 technical replicates). (G–H) Increased Iz (G) and 8-oxoG lesions (H) in ultrasonically sheared DNA templates (n = 4 technical replicates). (I) Single-strand specific S1 nuclease treatment that digested the single-stranded sites in templates reduced the increased Iz lesions induced by ultrasonic shearing (n = 4 technical replicates).
ecNGS possesses great potential for the direct quantification of somatic mutagenesis in any tissue from any species. Each consensus sequence in DCSS represents a single duplex template from the haploid genome. Theoretically, DCSS can reveal the low-abundance mutations even at single-DNA-molecule resolution (2,17). As early attempts, duplex sequencing has been used to measure low-abundance somatic mutations induced by aflatoxin, urethane, benzo[a]pyrene (BaP) and N-nitroso-N-ethylurea (ENU) and revealed distinct mutational signatures related to these mutagens in mouse models (3,18,19). Limited to the recovery of sequenced copied reads as consensus sequences, duplex sequencing is usually applied in the mutation detection of restricted genome regions and is suitable for revealing low-abundance somatic mutations in a few target gene loci (3,4,11). BotSeqS, Hawk-Seq™ and PECC-Seq provided genome-wide somatic mutation detection protocols for human and mouse genomes (13,14,16). However, similar residual low-frequency artifacts (approximately 10−7–10−6 per base pair) in these approaches hampered the precise identification of true mutations, especially the ultra-low-frequency spontaneous mutations (2,15,16). NanoSeq improved the accuracy of BotSeqS to less than 5 × 10−9 per base pair but was limited to ∼30% coverage of the human genome with its restricted-enzymes-sheared protocol (2). Therefore, there is still room for developing highly-accurate whole-genome ecNGS methods. The spontaneous mutation frequencies were estimated as low as approximately 2 × 10−8 in human sperm and cord blood, and 7.13 × 10−8–1.63 × 10−7 in several mouse tissues (2,3). For the precise detection of ultra-low-abundance somatic mutations, error rates of applied techniques should be reduced to the levels of spontaneous mutation frequencies. Residual technical artifacts observed in the DCSS approaches need to be characterized and avoided.
Here, we first explored the origins of residual sequencing artifacts in DCSS approaches by integrating PECC-Seq, sequencing error profiling using single strand consensus sequencing analysis (SSCS), and HPLC–MS/MS analysis. We demonstrated that the guanine lesions and the resulting end-repair artifacts originating from NGS library preparation contributed to the sequencing artifacts in DCSS. We then reduced the error rates of PECC-Seq to <5 × 10−8 via the enzymatic removal of sequencing artifacts and improved its application in the direct quantification of genome-wide somatic mutagenesis. Using transgenic and common mouse models, we demonstrated the high sensitivity and specificity of PECC-Seq in detecting the genome-wide somatic mutations induced by mutagens. Our findings illustrate the use of PECC-Seq as a reliable new tool for somatic mutagenesis quantification and for in vivo mutagenicity testing.
Materials and methods
Animal administration
Male gpt delta transgenic mice (C57BL/6J background, 8-week-old) were randomized and administrated with gradient doses (0, 0.125, 0. 25, 0.5, 1, 2 and 4 mg/kg/day) of Aristolochic acid I (AAI, CAS No. 313-67-7, Nature Standard, China) via gavage for 28 consecutive days. All animals were sacrificed on day 31 and the kidney tissues were collected for the PECC-Seq analysis.
Male C57BL/6J mice and DBA/2 mice (8-week-old) were randomized and administrated with BaP (CAS No. 50-32-8, Sigma-Aldrich, MO, USA), ENU (CAS No. 759-73-9, Sigma-Aldrich, MO, USA), N-nitrosodiethylamine (DEN, CAS No. 55-18-5, TCI Chemicals, Japan), or Quinoline (CAS No. 91-22-5, J&K Scientific, China) following the scheme listed in Table 1. All animals were sacrificed on day 18 or 31 and the liver tissues were collected for the PECC-Seq analysis.
Table 1.
Administration scheme of the animal experiments
Chemicals | Animals | Doses (mg/kg/day) | Administration routes (vehicle) | Duration of dosing (day) | Sampling days |
---|---|---|---|---|---|
AAI | gpt delta mice | 0, 0.125, 0. 25, 0.5, 1, 2, 4 | p.o. (saline) | 28 | 31 |
Quinoline | DBA/2 mice | 0, 50 | i.p. (olive oil) | 4 | 18 |
BaP | C57BL/6J mice | 0, 50 | p.o. (olive oil) | 28 | 31 |
ENU | C57BL/6J mice | 20 | p.o. (saline) | 15a | 31 |
DEN | C57BL/6J mice | 20 (10 from day 9) | p.o. (saline) | 13a | 31 |
a Administration terminated due to animal welfare requirements.
All animal experiments were conducted at the National Institute of Health Sciences (NIHS, Japan), Shanghai Jiao Tong University School of Medicine (SJTUSM, China) and Shanghai Model Organisms Center (SMOC, China). All animal experiments were approved by the Institutional Animal Care and Use Committee of these facilities (Approval No. 697 in NIHS, A-2019-013 in SJTUSM and 2022-0027 in SMOC).
Library preparation and sequencing of PECC-seq analysis
Genomic DNA from the liver or kidney tissues was extracted using the QIAGEN Blood & Cell Culture DNA Mini Kit or the QIAGEN DNesay Blood & Tissue Kit (QIAGEN, Germany, Cat. No. 13323 and 69504). DNA library for PECC-Seq was prepared using the Illumina TruSeq DNA PCR-Free library preparation kit (Illumina, CA, USA, Cat. No. 20015962) following the manufacturer's manual with several modifications in the steps of DNA shearing, end-repair and library size selection. First, extracted genomic DNA (<5 μg) was ultrasonically sheared in Tris-EDTA buffer using Covaris S220 system with the following settings: peak incident power 175 W, duty factor 10%, cycles per burst 200, temperature 7°C and treatment time 140 s. Prolonged ultrasonic shearing duration was employed to generate more shortened fragments for library size selection. Next, the sheared DNA fragments were end-repaired to create blunt ends. Additional S1 nuclease treatment was included in PECC-Seq protocol to cleave single-stranded overhangs as well as nicks or gaps that present in the middle of the sheared DNA templates. Sheared DNA was processed with S1 nuclease (30 units/400 ng DNA, Thermo Fisher Scientific, MA, USA, Cat. No. EN0321) at 30°C for 30 min and purified with 2× sample purification beads (SPB). The purified DNA templates were subsequently subjected to the standard end-repair procedures in Illumina TruSeq protocol. After end-repair, adjusted ratios of SPB were used to select target DNA fragments. End-repaired DNA fragments were size-selected with 0.9 × of SPB to remove large DNA fragments and 1.25× of SPB to capture target DNA fragments with ∼150 bp in length. Then, selected DNA fragments were subjected to standard dA-tailing and adapter ligation following the manufacturer's instructions. Since mapping coordinates served as endogenous barcodes in PECC-Seq, universal adapters with no unique molecular identifies (Illumina TruSeq single indexes, Cat. No. 20015960) were used in PECC-Seq library. The input for library preparation of all samples was ∼400 ng of sheared DNA. Paired-end sequencing of the prepared shortened PCR-free DNA libraries was performed on the Illumina NovaSeq 6000 platform with read length of 2 × 150 bp using 8–40 Gb raw data per sample.
Sequencing data processing and mutation analysis
The sequencing data processing and mutation analysis in PECC-Seq were carried out following the detailed workflows described in our previous work (16). Briefly, raw reads were trimmed with the Trimmomatic software (v0.39) using the palindrome mode for adapter removal and then aligned to reference genome mm10 with the Burrows-Wheeler aligner (BWA, v0.7.17). After alignment, reads were filtered with SAMtools (v1.9) and only properly mapped paired-end reads with a mapping quality score of 60 and no 5′-end soft clipping were retained. The mapping coordinates and mapping orientations of the paired-end reads served as endogenous tags to label the single DNA duplex derived copies. Paired-end reads with the same 5′ mapping coordinates and opposite mapping orientations were then grouped as complementary strand-derived paired-end reads. Overlapped sequences among the complementary paired-end reads were extracted to form consensus sequences for sequencing error correction using the R software (v1.2.1335). Any site with discordant bases was discarded and only sites with 4 identical bases were retained to create consensus bases. Furthermore, consensus bases located at the terminal 10 bp of inserted fragments were eliminated to reduce the impact of potential residual end-repair artifacts. A total of 5 × 107–4 × 108 error-corrected consensus bases were obtained per sample. Detailed codes for sequencing data preprocessing, consensus sequences extraction and consensus bases creation are available in our previous study (16).
The resulting consensus bases were subjected to mutation analysis and the detected mutations were then confirmed within the IGV browser. Variants that only observed in individual DNA templates were picked up as newly generated mutations. Candidate mutations were identified and validated in IGV browser with the criteria as follows: (i) variants happened solely and were not observed in other reads of the variant sites; (ii) variants were not observed in other samples (i.e. sequencing data from the same mouse and from other mice used in the experiment); (iii) variants with ≥2 variants observed in the same consensus reads were considered as false-positive variants and eliminated; (iv) low quality consensus bases (i.e. with ≥2 raw bases had Qscore <25) were discarded. The mutation frequencies were calculated by dividing the number of detected mutations by the total number of created consensus bases. The 96-trinucleotide profiles and transcriptional strand bias of detected mutations were plotted using MutationalPatterns (v3.2.0) R package.
Error profiling with single strand consensus sequencing analysis
To profile the error patterns on the single-stranded templates of the PECC-Seq library, SSCS analysis was employed. Copied reads from the single stands of the duplex were used to create error-corrected consensus sequences. From the PECC-Seq data, the paired-end overlaps in the shortened DNA libraries and the single-stranded template-derived Exclusion Amplification duplicates (ExAmp duplicates) generated in the Illumina patterned flow cell (20, Hadfield, J. (2016) http://enseqlopedia.com/2016/05/increased-read-duplication-on-patterned-flowcells-understanding-the-impact-of-exclusion-amplification/, Steven, W. (2017) https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences/) equipped in NovaSeq 6000 platform were extracted for single-stranded sequencing error-correction (Supplementary Figures S1 and S2). In contrast to PECC-Seq analysis, read pairs with same 5′ mapping coordinates and same mapping orientations were extracted from the filtered sequencing data in PECC-Seq as ExAmp duplicate-derived paired-end reads, and the overlapped sequences were then grouped and used for consensus-making and further error analysis following the same protocol as PECC-Seq.
Guanine lesions identification with HPLC–MS/MS analysis
Oxidative guanine lesions 7,8-dihydro-8-oxoguanine (8-oxoG) and 2-aminoimidazolone (Iz) in the DNA templates were quantified with HPLC–MS/MS analysis. To reduce the hydrolysis of Iz, DNA was digested with 1 U DNase I, 2 U calf intestinal phosphatase, and 0.005 U snake venom phosphodiesterase I (New England Biolabs, MA, USA) at 5°C for 24 h, then subjected to HPLC/triple quadrupole mass spectrometry (HPLC-MS/MS) system. The HPLC separation was conducted with the reversed-phase Hypersil GOLD column (100 × 2.1 mm, 1.9 μm) on the Thermo TSQ Quantum Access Max system equipped with Accela U-HPLC (Thermo Fisher Scientific, MA, USA). The 1–15% gradient of acetonitrile in water (with 0.1% formic acid) was used as the mobile phase at a flow rate of 0.2 mL/min. Eluate from the HPLC column was directly introduced into the ESI-triple quadrupole mass spectrometer. Guanine lesions were detected using the positive ion mode and selective reaction monitoring (SRM) analysis with the settings of collision energy 15 eV, fragmentor voltage 90 V and capillary voltage 3500 V. The injection volume was 15.0–25.0 μl. The 8-oxoG and Iz were detected in the form of mononucleotides 8-oxodG and dIz by monitoring the transitions of m/z 284 → 168 and 229 → 113, respectively.
Gpt gene mutation assay
Kidney tissues from the AAI-administrated gpt delta mice were subjected to gpt gene mutation assay as reported in our previous study (21). High-molecular-weight genomic DNA of the kidney tissues was extracted using the RecoverEase DNA Isolation Kit (Agilent Technologies, CA, USA, Cat. No. 720202). Integrated lambda EG10 DNA in the genome were rescued as phages by in vitro packaging reaction using Transpack Packaging Extract (Agilent Technologies, CA, USA, Cat. No. 200220). The Escherichia coli strain YG6020 was then infected with the rescued phages carrying the gpt gene of E. coli and the gpt gene was converted into plasmids with the Cre recombinase expressed by YG6020. The bacteria harboring the plasmids carrying mutated gpt gene can be positively selected with 6-thioguanine (6-TG, CAS No. 154–42-7, TCI Chemicals, Japan) and form bacterial colonies on plates containing 6-TG after being incubated at 37°C for 4 days. The gpt gene mutant frequencies were calculated by dividing the number of gpt mutants by the number of total rescued transgenes.
Statistical analysis
Statistical analysis was conducted using the R software (v4.1.1). Two-sided Student's t-test was used to compare two groups and one-way ANOVA followed by two-sided Dunnett's multiple comparison test was employed to compare the differences between multiple groups versus the control group. For the dose-effect data of mutation frequencies generated from AAI-administrated gpt delta mice, the lowest observed adverse effect level (LOAEL) and the quantitative Benchmark dose (BMD) approach were employed to estimate the mutagenic potency of AAI measured using PECC-Seq. The mutation frequencies were log10 transformed and two-sided Dunnett's multiple comparison test was used to determine the LOAEL value (i.e. the lowest dose that induced significantly increased mutation frequencies). BMD analysis was conducted using the PROAST software (v70.0, https://r4eu.efsa.europa.eu/app/bmd) with the critical effect size (CES) of 60% to calculate the BMD confidence intervals (CIs) for the induced mutations (22). The cosine similarity values between the 96-trinucleotide profiles and the Catalogue of Somatic Mutations in Cancer (COSMIC) signatures (v3.3) were calculated using the MutationalPatterns R package. Two-sided Poisson test was used for the statistical analysis of the transcriptional strand bias. A value of P < 0.05 was considered statistically significant. Data are expressed as mean ± SEM. The number of biological replicates or technical replicates are shown in the figure legends. The exact p-values are specified in the figures as * P < 0.05, ** P < 0.01 and *** P < 0.001.
Results
Characterization of sequencing artifacts derived from NGS library preparation
Building on DCSS, our previously developed PECC-Seq utilized shortened PCR-free DNA libraries and capitalized on paired-end overlaps derived from complementary strands of DNA duplex to correct sequencing errors (Figure 1A, top), resulting in 10−7–10−6 errors per base pair (Figure 1B) (16). Compared with the expected spontaneous mutation frequencies in mammalian cells (i.e. <1 × 10−7), the error rates were higher, suggesting the existence of sequencing artifacts at low levels, especially in C:G base pairs (Figure 1B). Here, to eliminate these artifacts and improve the accuracy of PECC-Seq for somatic mutation detection, the origins of residual sequencing artifacts were clarified.
Residual artifacts in PECC-Seq mainly featured CG > GC and CG > AT variants, showing high similarity to the end-repair artifacts trimmed from the ends of the library (Figure 1C, cosine similarity of 0.79). We reasoned that these residual artifacts might have originated from the end-repair process following ultrasonic DNA shearing during library preparation (16). In the error-prone end-repair process of sequencing library preparation, misincorporation of dNTPs opposite the mutagenic single-stranded DNA lesions can fix the lesions as double-stranded errors that escape correction within the DCSS (Figure 1D). Therefore, to verify our speculation regarding the end-repair origin of artifacts, clues about mutagenic DNA lesions and misincorporation of dNTPs were sought in single-stranded templates. We utilized copies from the single strands of the PECC-Seq library to create single strand consensus sequences (i.e. SSCS analysis) for profiling error patterns in single strands. The SSCS analysis used paired-end overlaps from the PECC-Seq library and single-strand-derived ExAmp duplicates generated from the Illumina patterned flow cells (20, Hadfield, J. (2016) http://enseqlopedia.com/2016/05/increased-read-duplication-on-patterned-flowcells-understanding-the-impact-of-exclusion-amplification/, Steven, W. (2017) https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences/) to provide error-corrected variant calling of single strands (Figure 1A, bottom; see Supplementary Figures S1 and S2 for additional details on the characteristics of ExAmp duplicates and principle of the SSCS analysis).
As the ExAmp duplicates were from the same starting single-stranded templates, the paired-end consensus bases between ExAmp duplicates should be consistent. Mismatches in the paired-end consensus bases between ExAmp duplicates, which mainly reflected the single-stranded errors from cluster generation and subsequent sequencing, featured abundant A > G and T > C errors (Figure 1E and Supplementary Figure S3). This suggested that the remaining artifacts in C:G base pairs did not originate from these random sequencing errors. In contrast, distinct error patterns were observed in single-stranded templates using SSCS analysis. G > T variants, which indicated the oxidative 8-oxoG damage generated by ultrasonic DNA shearing (23,24), accounted for most single-stranded lesions, as expected (Figure 1F). An imbalance between the G > C variants and their counterpart C > G variants was observed across the single-stranded templates. G > C variants occurred more frequently in the 5′-end of the templates, while C > G variants aggregated in the 3′-end (Figure 1F and Supplementary Figure S4). The 3′-overhangs are always trimmed with 3′ to 5′ exonuclease while the 5′-overhangs and gaps are filled with DNA polymerase during the end-repair process (Figure 1D). Hence, misincorporation occurs more frequently in the 3′-end of complementary single strands opposite the 5′-overhangs. In other words, elevated levels of C > G variants in the 3′-end indicated patterns of base misincorporation during the end-repair process, whereas the counterpart G > C errors may reflect the error preference for some type of guanine lesions. The unexpected CG > GC artifacts in PECC-Seq were suggested to be fixed through misincorporation of guanine opposite the guanine lesions located at single-stranded sites during the end-repair step.
Contrary to 8-oxoG and the resultant G > T errors, little is known regarding the guanine lesions that result in G > C variants in NGS. Formamidopyrimidine[fapy]-DNA glycosylase (Fpg) is an 8-oxoG DNA glycosylase that can release 8-oxoG lesions and reduce G > T errors in NGS (23,24). With Fpg treatment, approximately 80% of the CG > AT variants in PECC-Seq were eliminated, while the remaining CG > GC variants were resistant to Fpg digestion (Supplementary Figure S5C). This indicated that the observed G > C variants were products of guanine lesions other than 8-oxoG. A less common oxidative product of guanine, Iz, can lead to mismatches mainly with Iz:G, thus inducing CG > GC transversion. To confirm the presence of Iz in ultrasonically sheared samples, HPLC-MS/MS analysis was performed (Supplementary Figure S6). Ultrasonically sheared DNA templates contained significantly elevated levels of Iz and 8-oxoG, and guanine lesions increased with an extended ultrasonic shearing duration (Figure 1G and H). Application of the single-strand specific S1 nuclease during the end-repair step could cleave the single-stranded sites usually filled with DNA polymerase, thus efficiently reducing CG > GC artifacts (Supplementary Figure S5E). Consistent with this, after S1 nuclease treatment, the elevated levels of Iz in the ultrasonically sheared templates were significantly reduced to the background levels (Figure 1I). Overall, the ultrasonic shearing process caused Iz formation in addition to 8-oxoG, particularly at the single-stranded sites of the templates.
These results indicated that the ultrasonic shearing process could introduce abundant guanine lesions to the single-stranded sites of DNA templates, resulting in fixed end-repair artifacts in C:G base pairs. Oxidative 8-oxoG lesions mainly contributed to CG > AT errors, whereas other types of guanine lesions, such as Iz, contributed to CG > GC errors.
Optimized PECC-seq approach for genome-wide somatic mutation detection
To mitigate the adverse effects of guanine lesion-derived sequencing artifacts in low-abundance mutation detection, the PECC-Seq method was modified to enhance its accuracy in base calling. Based on the formation of end-repair artifacts, library preparation protocols that reduced ultrasonic DNA lesions (e.g. reducing shearing duration, using antioxidative Tris–EDTA buffer for DNA shearing, and applying Fpg to digest 8-oxoG lesions) and inhibited the misincorporation during the end-repair step (e.g. using high-fidelity T4 DNA polymerase for end-repair and utilizing S1 nuclease to digest end-repaired single-stranded sites) could decrease sequencing artifacts in PECC-Seq (Supplementary Figure S5). Fpg treatment could reduce approximately 80% of the CG > AT errors but had no effects on the CG > GC errors (Supplementary Figure S5C). Using high-fidelity T4 DNA polymerase for end-repair eliminated half of the artifacts in C:G base pairs while the sequencing yields were reduced (Supplementary Figure S5D). Among the test conditions, S1 nuclease treatment exhibited the best performance to efficiently remove the residual artifacts in PECC-Seq (Supplementary Figure S5E).
As shown in Figure 2A, to enhance the accuracy, we employed milder ultrasonic fragmentation conditions (i.e. DNA shearing in Tris–EDTA buffer with restricted duration) and S1 nuclease treatment before the standard end-repair process in the PECC-Seq library preparation protocol to reduce the formation of single-stranded DNA damage and remove single-stranded overhangs or potential gaps in the templates, respectively. Combined with the adjusted bead-based fragment size selection procedure, this protocol produced ∼150 bp shortened PCR-free libraries with modest artifacts. The shortened sequencing libraries ensured the generation of overlaps between the 150 bp read pairs after 2 × 150 bp paired-end sequencing, that is, two copied sequences with identical information of the single-stranded templates. Since the libraries were prepared using the PCR-free protocol, the original duplex DNA templates were denatured and then both of the two single strands of the duplex DNA were directly injected into the flow cells for sequencing. The single stranded library would randomly occupy the nanowells, generate sequencing clusters and be sequenced. For partial duplex templates, both derived single strands would be sequenced simultaneously in the flow cells. Complementary strands from the duplex DNA templates shared the same fragmentation points. Hence, pairs of paired-end reads derived from both strands of the DNA templates were captured using the mapping coordinates and mapping orientations as endogenous tags. By utilizing the overlapped sequences from complementary DNA templates, four naturally generated copies of the original DNA templates could be obtained for sequencing error-correction (Figure 2A, red frame). Any site with discordant bases among the four copied sequences was considered with errors and eliminated. The identical bases in the copied sequences were retained to create error-corrected consensus bases. Furthermore, to reduce the effects of the remaining end-repair artifacts, consensus bases from the terminal 10 bp of the templates were discarded (Supplementary Figure S7).
Figure 2.
Principle and general performance of the optimized PECC-Seq analysis. (A) Optimized PECC-Seq protocol. Shortened PCR-free libraries (∼150 bp) were constructed with optimized ultrasonic shearing and adjusted size selection procedures. S1 nuclease treatment was applied to cleave the single-stranded regions to reduce end-repair artifact formation in the subsequent standard end-repair step. Sequencing was performed on Illumina 2 × 150 bp paired-end sequencing platforms. With the mapping coordinates and mapping orientations as endogenous tags, overlapped sequences from the shortened duplex DNA template-derived paired-end reads were identified and grouped for sequencing error-correction. (B) Error rate estimates of PECC-Seq (n = 12, 2 and 10 technical replicates, respectively). Before the method optimization, consensus reads in PECC-Seq exhibited error rates of 10−7–10−6 but remained abundant in end-repair artifacts. The modified PECC-Seq protocol further reduced the variants on the damaged templates and lowered the error rates to 10−8–10−7. (C) Mutational spectra detected with PECC-Seq before and after removing the end-repair artifacts. Relative contributions of the six base substitution types were plotted. With the optimized PECC-Seq protocol, most of the end-repair artifacts in C:G base pairs, especially CG > GC transversions, were efficiently removed. Variants detected from 2 and 10 technical replicates were merged for plotting the spectra, respectively.
PECC-Seq was applied to the genome-wide spontaneous somatic mutation detection in mouse models. By integrating the experimental modification of the library preparation procedures into DCSS, PECC-Seq reduced the error rates of conventional NGS in variant calling from 10−2–10−3 to <3.46 (± 0.98) × 10−8, as shown in the whole-genome sequencing data of mouse liver and kidney tissues (Figure 2B). The error rates of the modified PECC-Seq were approximately 10-times lower than that before modification (10−7–10−6, Figure 2B) and comparable with the spontaneous mutation frequencies previously reported in normal mouse tissues (3). Excess sequencing artifacts in C:G base pairs were efficiently removed by S1 nuclease treatment during the end-repair process (Figure 2C). In addition, because PECC-Seq was designed based on whole-genome sequencing of the PCR-free libraries, PECC-Seq could theoretically provide genome-wide coverage. More than 80% of the whole-genome sites were covered as shown in the 3× consensus bases merged from mouse kidney tissues. Overall, PECC-Seq exhibited high accuracy in somatic mutation detection and improved applications requiring direct quantification of genome-wide low-abundance somatic mutations.
Sensitivity of PECC-seq in the quantification of in vivo mutagenesis
We applied PECC-Seq to measure the chemically-induced somatic mutations. Classical TGR gene mutation assays exhibit high accuracy and sensitivity in estimating the somatic mutagenesis induced by xenobiotics. In TGR assays, exogeneous reporter genes integrated in the genomic DNA of transgenic animals are recovered as phages by in vitro phage packaging reactions and transfected into E.coli. Mutations in the reporter genes are then detected via phenotypic-selection within the bacterial host. Here, to address the sensitivity of PECC-Seq in somatic mutagenesis quantification, we compared the frequencies of chemically-induced mutations measured by PECC-Seq and a standard TGR mutation assay (i.e. the gpt gene mutation assay) in gpt delta transgenic mice.
Male gpt delta transgenic mice were administrated with gradient doses of the mutagen AAI following the guidance of Organization of Economic Cooperation and Development Guidelines for the Testing of Chemicals TG488 to induce gradient levels of mutations in the kidney (OECD iLibrary, https://doi.org/10.1787/9789264203907-en). Genome-wide mutation frequencies per base pair in kidney tissues were measured using PECC-Seq. The background mutation frequencies in the control group were 3.77 (± 0.85) × 10−8 per base pair. AAI administration induced dose-dependent increases in mutation frequencies, as shown by PECC-Seq (Figure 3A). A significant increase in the mutation frequencies of AAI-treated kidney tissues was observed, even in the lowest dose group (i.e. 0.125 mg/kg/day).
Figure 3.
AAI-induced mutations in the kidneys of gpt delta transgenic mice measured using PECC-Seq. (A) Mutation frequencies induced by gradient doses of AAI (n = 3 biological replicates in each group except n = 2 in the group of 0.5 mg/kg/day; mutations from two technical replicates of each independent biological sample were merged). The dots indicate the mutation frequencies of each biological replicate. The lines indicate the mean values of mutation frequencies. (B) Mutational spectra induced by high doses of AAI. A total of 44, 138, 494 and 929 mutations merged from the control, 1 mg/kg/day AAI, 2 mg/kg/day AAI and 4 mg/kg/day AAI groups (n = 3, 2, 3 and 3 biological replicates) were used for the profiling of the mutational signatures, respectively. The labels indicate the cosine similarities to the known AA-related mutational signature in COSMIC (Signature 22). The right bar plots indicate the transcriptional strand bias on the T > A transversions.
The gpt assay utilizes the integrated bacterial gpt gene as the reporter gene to indicate genome-wide mutagenesis of target tissues in gpt delta transgenic mice. Mutation frequencies determined using PECC-Seq displayed trends similar to those determined using the gpt assay in our previous study (21). A strong correlation in the fold-changes of induced mutation frequencies between the two assays was observed (R2 = 0.94), while fold-changes of the mutation frequencies measured using PECC-Seq were higher than that of the gpt assay in the low-dose groups (i.e. ≤0.5 mg/kg/day, Figure 4A). The LOAEL value determined in gpt assay (0.25 mg/kg/day) was higher than that measured in PECC-Seq (0.125 mg/kg/day). Furthermore, the BMD approach was applied to quantitatively estimate the mutagenic potency of AAI determined by the two assays. As shown in the BMD model, the BMD CIs determined using PECC-Seq were comparable with those determined using the gpt assay (Figure 4B, C, Supplementary Figure S8 and Supplementary Table S3). With the CES set at 60% (i.e. a 60% increase in the mutation frequencies compared to the control group) (22), the BMD60 values were 0.053–0.165 mg/kg/day for PECC-Seq and 0.089–0.291 mg/kg/day for the gpt assay, respectively. AAI induces mutations mainly with TA > AT transversions; thus, the BMD values for the induced T > A mutations were calculated using the PECC-Seq data. Compared with the BMD values from the total mutation frequencies, the BMD60 of TA > AT mutations was much lower, estimated at 0.0044–0.0365 mg/kg/day (Figure 4C). Taken together, PECC-Seq yielded a sensitivity comparable to conventional in vivo gpt gene mutation assay.
Figure 4.
Mutagenic potency of AAI determined using PECC-Seq analysis and the gpt gene mutation assay. (A) Fold-changes of the increased mutation frequencies measured using PECC-Seq (green dots, n = 3 biological replicates except for n = 2 in the group of 0.5 mg/kg/day) and the gpt assay in the AAI-administrated kidney tissues. Results of gpt gene mutation frequencies were obtained from our previous study (red dots; n = 5 biological replicates in the control group; n = 6 in the AAI-exposed groups) (21). (B) Dose-effect curve fitting of the PECC-Seq results with the BMD approach using PROAST software. (C) BMD values and two-sided 90% CIs (i.e. lower and upper confidence limits on the BMD) of the mutations induced by AAI as determined with PECC-Seq and the gpt assay. A CES of 60% was used for the BMD values calculation.
Mutagen-specific base substitution signatures characterized by PECC-seq
To test its specificity in characterizing base substitution signatures induced by environmental agents, PECC-Seq was used to measure the diverse mutation spectra of several known mutagens including AAI, BaP, Quinoline, ENU and DEN.
AAI induces TA > AT transversions almost exclusively through the formation of AA-adenine adducts. As shown in Figure 3A, most of the induced mutations detected using PECC-Seq were TA > AT, especially in the high-dose groups. Signature 22 from the COSMIC catalog of somatic mutations in human cancers is associated with aristolochic acid exposure (25). The 96-trinucleotide profiles obtained from the mutations detected with PECC-Seq exhibited strong similarities to COSMIC Signature 22 (cosine similarities of 0.81, 0.95 and 0.96 in the 1, 2 and 4 mg/kg/day groups, respectively; Figure 3B). AA-adenine adducts on transcribed strands can be repaired more frequently by transcription-coupled nucleotide excision repair (TC-NER). Consistently, more T > A mutations were observed in the transcribed strands (i.e. more A > T mutations in the non-transcribed strands) (Figure 3B). BaP exposure induced predominant mutations in CG > AT due to the bulky BaP adducts formed on the guanine residues (Figure 5A). Transcriptional strand bias in CG > AT transversions indicated the involvement of TC-NER in the repair of BaP adducts. The mutational spectrum of BaP from PECC-Seq analysis was most similar to that of COSMIC Signature 4 (cosine similarity of 0.92), which is associated with tobacco smoking (Figure 5B) (25). As alkylating agents, ENU and DEN displayed similar mutational signatures, predominantly TA > AT and TA > CG (cosine similarity of 0.97), reflecting the formation of O2-alkylthymine and O4-alkylthymine, respectively (Figure 5A). The 96-trinucleotide profile of ENU was similar to the signatures determined in iPSC and TGR models (cosine similarities of 0.88 and 0.91, respectively) (3,26). Quinoline is a hepato-mutagen that mainly induces CG > GC transversions, as reported in the TGR model (27). Similar to the TGR mutation assays, PECC-Seq detected Quinoline-induced mutations with dominant CG > GC transversions (43.2 versus 12.5% in the control group; Figure 5A). Mutation data determined using PECC-Seq also exhibited transcriptional strand bias in CG > GC transversions. Thus, for mutations induced by different mutagens, PECC-Seq revealed similar mutational spectra specific for each mutagen.
Figure 5.
Mutagen-specific mutational signatures revealed with PECC-Seq. (A) Mutational signatures, including 96-trinucleotide profiles, transcriptional strand bias and six base substitution subtypes of mutations induced by mutagens BaP, DEN, ENU and Quinoline (n = 3 biological replicates in mutagen-exposed groups, n = 6 in the control group of C57BL/6J mice and n = 3 in the control group of DBA/2 mice; total 164, 288, 3063, 1250 and 162 mutations were used for the profiling of the mutational signatures, respectively). The mutation frequencies were 5.30 × 10−7 (8.43-fold versus control), 7.49 × 10−6 (119.02-fold), 3.10 × 10−6 (49.29-fold) and 1.98 × 10−7 (4.07-fold) in the livers of BaP-, DEN-, ENU- and Quinoline-administrated animals, respectively. (B) Heatmap of the cosine similarities between the mutagen-specific mutational signatures and the COSMIC signatures.
Discussion
In this study, we characterized end-repair artifacts in DCSS and presented an optimized PECC-Seq protocol with error rates of less than five errors per hundred million base pairs for low-abundance somatic mutation detection. As a high-accuracy ecNGS approach, PECC-Seq can precisely detect genome-wide spontaneous and chemically-induced somatic mutations. Here, we demonstrate that PECC-Seq is a promising approach for measuring somatic mutations and for in vivo mutagenicity assessment.
PECC-Seq enables highly accurate detection of low-abundance somatic mutations. The measured background mutation frequencies in normal mouse tissues suggested an error rate of <5 × 10−8 in PECC-Seq. The TGR assays can accurately measure the somatic mutations at the single-gene level and serve as the ‘gold-standard’ in quantifying chemically-induced somatic mutagenesis. In this study, we employed the gpt gene mutation assay as a standard for comparison. Compared with the gpt assay, PECC-Seq displayed a consistent trend and similar sensitivity in revealing the dose-dependent AAI-induced mutations. In other words, the mutation frequencies of the control group measured with PECC-Seq were close to the real background mutation frequencies. Duplex sequencing is the most accurate ecNGS approach with a theoretical error rate of <10−9. Valentine et al. have reported background mutation frequencies in mouse tissues of 1.63 × 10−7 (marrow), 1.06 × 10−7 (peripheral blood), 9.63 × 10−8 (liver), 7.13 × 10−8 (lung) and 7.45 × 10−8 (spleen) with duplex sequencing (3). The mean mutation frequencies in mouse tissues measured using PECC-Seq were 2.78 × 10−8 (liver of gpt delta mice), 3.75 × 10−8 (kidney of gpt delta mice), 6.19 × 10−8 (liver of C57BL/6J mice) and 5.48 × 10−8 (liver of DBA/2 mice). Background mutation frequencies from PECC-Seq were comparable but slightly lower than those measured with duplex sequencing. This may be explained by that the duplex sequencing data were measured from several target genes while the PECC-Seq results were measured from whole-genome sequencing data. Taken together, PECC-Seq can accurately identify of spontaneous somatic mutations with low error rates. But whether any sequencing artifacts remained in the identified mutations and the precise error rates of PECC-Seq still require further study to clarify. In contrast to other ecNGS methods, PECC-Seq is performed using a PCR-free library preparation strategy that ensures whole-genome coverage for mutation detection. Compared with NanoSeq (∼ 30% coverage of the genome with the restricted-enzymes-sheared protocol) (2) and duplex sequencing (usually for targeted genome regions) (3,18,19), more than 80% of the genome sites can be covered with PECC-Seq. Hence, PECC-Seq may be more suitable for characterizing mutations at the whole-genome scale, whereas the PCR-based ecNGS method (e.g. duplex sequencing) can better reveal mutational signatures in target gene loci of interest.
Emerging ecNGS approaches have great potential to revolutionize the paradigm of somatic mutagenesis assessment. Recently, an expert working group has been convened by the Genetic Toxicology Technical Committee of the Health and Environmental Sciences Institute to investigate the application of ecNGS approaches to mutagenicity and carcinogenicity testing (1). Conventional gene mutation assays are, essentially, a class of indirect mutation detection methods that use the phenotypic changes on reporter genes as the surrogate of genome-wide mutagenesis (9,10, OECD iLibrary, https://doi.org/10.1787/9789264203907-en). The inherent properties of the TGR mutation assays suggest that these assays are restricted to specific genomic loci and a modest number of animal models, resulting in limitations that impede the progress in the field of genetic toxicology. Compared with TGR mutation assays, high-accuracy ecNGS approaches, such as PECC-Seq, enable direct mutation detection in any tissue from any species and reveal unbiased genome-wide mutational landscapes rather than mutational signatures in single reporter genes. Characterization of mutational signatures using ecNGS approaches can provide additional mechanistic insights into the biological relevance. Furthermore, these ecNGS approaches can be easily integrated into other genotoxic assays using common testing models.
In quantifying the dose-dependent effect of AAI-induced mutations, PECC-Seq showed a sensitivity similar to that of the standard gpt gene mutation assay but revealed more obvious changes in the AAI-specific TA > AT transversions in the low-dose groups (≤0.5 mg/kg/day). Our results from the BaP- and Quinoline-exposed groups also displayed comparable fold-increases in mutation frequencies to the data reported with duplex sequencing and TGR mutation assays. BaP exposure at 50 mg/kg/day for 28 days caused 8.43-fold, 8.58-fold and 10.81-fold increases in mutation frequencies as measured using PECC-Seq, duplex sequencing and the TGR mutation assay with the cII gene, respectively (3). Quinoline exposure at 50 mg/kg/day for 4 days caused a 4.07-fold increase in mutation frequencies determined by PECC-Seq and induced 3.76-fold and 7.39-fold increases in the mutation frequencies measured using the LacZ and cII genes in MutaMouse, respectively (27,28).
Although TGR mutation assays are highly accurate in measuring mutagenesis, several inherent defects may hamper their precision. In TGR mutation assays, the reporter bacterial transgenes (e.g. lacZ, lacI, cII or gpt) are usually highly methylated, which increases spontaneous C > T mutations due to the deamination of 5-methylcytosine (29). This may explain the results observed in the low-dose AAI-exposed groups, as PECC-Seq exhibited higher fold-changes than the gpt assay. Since gene mutation assays identify mutants by means of phenotypic-selection, mutations without phenotypic changes (e.g. synonymous mutations) will be underestimated. As shown in this study, approximately one-fourth of the detected exonic mutations were synonymous (Supplementary Table S4). In addition, unrepaired DNA lesions or DNA adducts can be fixed as mutants during the expression of reporter genes in bacterial hosts, which may lead to an overestimation of mutation frequencies (3). Contrary to TGR mutation assays, ecNGS approaches directly measure fixed mutations in the genome and can thus overcome the bias of the overestimation or underestimation of mutations in reporter gene assays. Taken together, our results suggest that PECC-Seq can provide comparable but more precise estimates of in vivo mutagenesis than the standard TGR mutation assays.
The ability to reveal the mutational signatures of xenobiotics at the whole-genome scale is a further advantage of PECC-Seq. TGR mutation assays can measure mutational spectra by sequencing the recovered mutants (9). However, because TGR mutation assays are limited to reporter genes, it is difficult to clarify unbiased mutational spectra with narrow trinucleotide repertoires, not to mention the features of genome location and transcriptional strand bias. In this study, we characterized the typical mutational spectra associated with several known mutagens and revealed the genome-wide mutational signature of Quinoline. The trinucleotide profiles were highly similar to signatures obtained from previous computational or experimental approaches (cosine similarities of 0.96, 0.92 and 0.91 in AAI-, BaP- and ENU-signatures, respectively). In addition, gene annotations revealed significant transcriptional strand biases of exonic mutations in the AAI-, BaP-, ENU- and DEN-exposed groups, which reflected the implications of TC-NER on DNA adducts or alkylated bases. Hence, PECC-Seq not only revealed more general genome-wide mutational signatures of xenobiotics compared with conventional TGR mutation assays, but also provided details for investigating the underlying mechanisms of mutagenesis.
Cost is another consideration when applying the DCSS approaches for mutagenicity testing. Since redundant copied reads are utilized for sequencing error-correction, the sequencing cost is usually much higher for ecNGS than for conventional NGS (4). By capitalizing on the paired-end overlaps rather than the PCR copies generally used in other DCSS approaches, PECC-Seq minimizes the copies utilized and thus improves consensus-making efficiency. As shown by the current data from the NovaSeq 6000 platform, approximately 120 raw bases were required to create an error-corrected consensus base. Considering the spontaneous mutation frequencies detected in mammalian cells (10−8–10−7), at least 15G sequencing data in NovaSeq 6000 platform could provide sufficient error-corrected bases (approximately 108 consensus bases) for quantifying the in vivo mutagenesis induced by weak mutagens. In different Illumina sequencing platforms, the recovery of the complementary strands from duplex templates is different (16). The minimum number of the raw bases required for consensus-making may have several times of difference. Recently, with the reduction in the cost of NGS, the cost of derived PECC-Seq has decreased and is comparable with that of TGR mutation assays.
In addition to the application to mutagenesis assessment, identifying the low-abundance somatic mutations and the mutational signatures using ecNGS are of great significance in cancer research. The cancer genome is a faithful record of the mutagenic processes one has experienced throughout the lifecycle (30). ecNGS can unravel the mutational signatures caused by specific agents or mutagenic processes in controlled experimental studies, thus providing clues for understanding the origins of cancer. Somatic mutations accumulate with age, carcinogenic exposure and DNA repair deficiency. In normal tissues, the increased mutation burden and the distinctive mutational signatures associated with carcinogenic processes can act as an early indicator for cancer risk assessment. For clinical diagnostic testing, the single-DNA-molecule resolution of ecNGS enables highly sensitive detection of the oncogenic driver mutations even in a very small fraction. Duplex sequencing could detect variants with allele frequencies of <10−4 (3). Recently, several panel-based ecNGS approaches have been applied to detect driver mutations in acute myeloid leukemia with the allele frequency lower to 0.05–0.1% (31–33). To date, the application of ecNGS in clinical diagnostics remains limited. As a new strategy, further studies are required to assess the clinical applicability of ecNGS.
Our findings contribute to a better understanding of the molecular basis of end-repair artifact formation in NGS. Oxidative DNA lesions caused by mechanical DNA shearing during library preparation are a major source of sequencing artifacts and may introduce bias into mutation calls. Several studies have revealed causality between oxidative 8-oxoG lesions and subsequent CG > AT artifacts (23,24). We have previously observed, after error correction with DCSS, that CG > GC artifacts were more dominant than CG > AT errors in PECC-Seq as residual sequencing artifacts specific to the ultrasonically sheared DNA libraries (16). Similar excess CG > GC errors were also observed in several other duplex consensus sequencing-based ecNGS methods, including duplex sequencing, BotSeqS and Hawk-seq (2,13–15,18). However, the molecular basis of these CG > GC artifacts in NGS remains to be investigated. In this study, we found that guanine lesions other than 8-oxoG contributed to CG > GC artifacts. Although G > T errors from 8-oxoG lesions were much greater than G > C errors in single-stranded templates, the error rates of fixed CG > GC artifacts were higher than those of CG > AT errors as shown by PECC-Seq and SSCS analysis. These results suggest that guanine lesions that induce G > C errors occur more frequently in single-stranded templates that require further error-prone end-repair with DNA polymerases (i.e. gaps and 5′-overhangs). A few oxidative guanine lesions (e.g. Iz, guanidinohydantoin and spiroiminodihydantoin) have been reported to cause G > C transversion (34–36). We observed increased levels of Iz in the ultrasonically sheared templates, especially in the single-stranded sites of double-stranded DNA. Hence, Iz can at least partially explain ultrasonic shearing-induced CG > GC artifacts observed in ecNGS. Our results showed that the addition of the antioxidant agent ethylenediaminetetraacetic acid (EDTA) before ultrasonication could reduce approximately half of the CG > AT artifacts and 25% of the CG > GC artifacts (Supplementary Figure S5). A previous study reported that metal chelator deferoxamine mesylate could reduce the formation of CG > AT artifacts but lower the yields of library preparation (23). Reactive oxygen species play an important role in the formation of oxidative guanine lesions (37). Besides EDTA, whether free radical scavengers or other metal chelators can be used to reduce the ultrasonication-induced guanine lesions remains to be determined. Compared with G > T errors, the incidence of G > C errors is relatively rare, indicating little effect on the conventional NGS. However, in the 3′-end of the templates, C > G errors showed comparable incidence with G > T errors of the same order of magnitude (1.18 × 10−5 C > G errors in 3′-end vs. 1.87 × 10−5 G > T errors in 3′-end and 5.46 × 10−6 G > T errors in the middle region). Thus, in the mutation calls with conventional NGS, attention should be paid to the C > G errors in the 3′-end of templates as there may be potential residues of end-repair artifacts.
In conclusion, we characterized a novel 2-aminoimidazolone lesion besides 7,8-dihydro-8-oxoguanine and the resulting end-repair artifacts in NGS library preparation, which hampered the sequencing accuracy of NGS. To enhance the accuracy of NGS, we reduced these sequencing artifacts enzymatically and developed the PECC-Seq, an error-corrected NGS approach, for high-accuracy genome-wide somatic mutagenesis quantification. Our study provides new insights into the application of ecNGS approaches in mutagenicity testing as well as other fields in basic research and clinical applications.
Supplementary Material
Acknowledgements
We would like to thank Daru Lu, Tianbao Zhang and Jian Fei for their helpful discussions. We also thank the TriApex Laboratories Co., Ltd. for its support.
Author contributions: Y.L.: Funding acquisition, Project administration, Supervision and Writing –review & editing. X.Y.Y.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, and Writing – original draft. T.S.: Methodology, Investigation. Y.Y.C., J.S, B.Z.Z., K.M., J.X., W.Y.L. and X.Y.Z.: Investigation.
Contributor Information
Xinyue You, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Yiyi Cao, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Takayoshi Suzuki, Division of Genetics and Mutagenesis, National Institute of Health Sciences, Kawasaki 210-9501, Japan.
Jie Shao, State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, The Chinese Academy of Sciences, Beijing 100085, China; The University of Chinese Academy of Sciences, Beijing 100049, China.
Benzhan Zhu, State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, The Chinese Academy of Sciences, Beijing 100085, China; The University of Chinese Academy of Sciences, Beijing 100049, China.
Kenichi Masumura, Division of Risk Assessment, National Institute of Health Sciences, Kawasaki 210-9501, Japan.
Jing Xi, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Weiying Liu, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Xinyu Zhang, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Yang Luan, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
Data availability
The raw whole-genome sequencing data of all samples have been deposited in the Sequence Read Archive (SRA) with the accession number of PRJNA932860 and PRJNA632709. Detailed information on the sequencing data of all samples is available in Supplementary Table S1. The trinucleotide substitution profiles of all samples are available in Supplementary Table S2.
Supplementary data
Supplementary Data are available at NAR Online.
Funding
Major Program of National Natural Science Foundation of China [41991314]; China Postdoctoral Science Found [2021M702159]; Foundation of Science and Technology Commission of Shanghai Municipality [20142202700]; National Natural Science Foundation of China [21976200]. Funding for open access charge: National Natural Science Foundation of China [41991314].
Conflict of interest statement. None declared.
References
- 1. Marchetti F., Cardoso R., Chen C.L., Douglas G.R., Elloway J., Escobar P.A., Harper T. Jr, Heflich R.H., Kidd D., Lynch A.M et al. Error-corrected next-generation sequencing to advance nonclinical genotoxicity and carcinogenicity testing. Nat. Rev. Drug. Discov. 2023; 22:165–166. [DOI] [PubMed] [Google Scholar]
- 2. Abascal F., Harvey L.M.R., Mitchell E., Lawson A.R.J., Lensing S.V., Ellis P., Russell A.J.C., Alcantara R.E., Baez-Ortega A., Wang Y et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021; 593:405–410. [DOI] [PubMed] [Google Scholar]
- 3. Valentine C.C. 3rd, Young R.R., Fielden M.R., Kulkarni R., Williams L.N., Li T., Minocherhomji S., Salk J.J. Direct quantification of in vivo mutagenesis and carcinogenesis using duplex sequencing. Proc. Natl. Acad. Sci. U.S.A. 2020; 117:33414–33425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Salk J.J., Schmitt M.W., Loeb L.A Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat. Rev. Genet. 2018; 19:269–285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Vijg J., Schumacher B., Abakir A., Antonov M., Bradley C., Cagan A., Church G., Gladyshev V.N., Gorbunova V., Maslov A.Y et al. Mitigating age-related somatic mutation burden. Trends Mol. Med. 2023; 29:530–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Fox E.J., Reid-Bayliss K.S., Emond M.J., Loeb L.A Accuracy of Next Generation Sequencing Platforms. Next Gener. Seq. Appl. 2014; 1:1000106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Dong X., Zhang L., Milholland B., Lee M., Maslov A.Y., Wang T., Vijg J Accurate identification of single-nucleotide variants in whole-genome-amplified single cells. Nat. Methods. 2017; 14:491–493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Moore L., Cagan A., Coorens T.H.H., Neville M.D.C., Sanghvi R., Sanders M.A., Oliver T.R.W., Leongamornlert D., Ellis P., Noorani A et al. The mutational landscape of human somatic and germline cells. Nature. 2021; 597:381–386. [DOI] [PubMed] [Google Scholar]
- 9. Lambert I.B., Singer T.M., Boucher S.E., Douglas G.R Detailed review of transgenic rodent mutation assays. Mutat. Res. 2005; 590:1–280. [DOI] [PubMed] [Google Scholar]
- 10. Salk J.J., Kennedy S.R Next-generation genotoxicology: using modern sequencing Technologies to assess somatic mutagenesis and cancer risk. Environ. Mol. Mutagen. 2020; 61:135–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Schmitt M.W., Kennedy S.R., Salk J.J., Fox E.J., Hiatt J.B., Loeb L.A Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl. Acad. Sci. U.S.A. 2012; 109:14508–14513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kennedy S.R., Schmitt M.W., Fox E.J., Kohrn B.F., Salk J.J., Ahn E.H., Prindle M.J., Kuong K.J., Shen J.C., Risques R.A et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 2014; 9:2586–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hoang M.L., Kinde I., Tomasetti C., McMahon K.W., Rosenquist T.A., Grollman A.P., Kinzler K.W., Vogelstein B., Papadopoulos N Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proc. Natl. Acad. Sci. U.S.A. 2016; 113:9846–9851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Matsumura S., Sato H., Otsubo Y., Tasaki J., Ikeda N., Morita O Genome-wide somatic mutation analysis via Hawk-Seq reveals mutation profiles associated with chemical mutagens. Arch. Toxicol. 2019; 93:2689–2701. [DOI] [PubMed] [Google Scholar]
- 15. Otsubo Y., Matsumura S., Ikeda N., Yamane M Single-strand specific nuclease enhances accuracy of error-corrected sequencing and improves rare mutation-detection sensitivity. Arch. Toxicol. 2022; 96:377–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. You X., Thiruppathi S., Liu W., Cao Y., Naito M., Furihata C., Honma M., Luan Y., Suzuki T Detection of genome-wide low-frequency mutations with Paired-End and Complementary Consensus Sequencing (PECC-Seq) revealed end-repair-derived artifacts as residual errors. Arch. Toxicol. 2020; 94:3475–3485. [DOI] [PubMed] [Google Scholar]
- 17. Bae J.H., Liu R., Roberts E., Nguyen E., Tabrizi S., Rhoades J., Blewett T., Xiong K., Gydush G., Shea D et al. Single duplex DNA sequencing with CODEC detects mutations with high sensitivity. Nat. Genet. 2023; 55:871–879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Chawanthayatham S., Valentine C.C. 3rd, Fedeles B.I., Fox E.J., Loeb L.A., Levine S.S., Slocum S.L., Wogan G.N., Croy R.G., Essigmann J.M. Mutational spectra of aflatoxin B1 in vivo establish biomarkers of exposure for human hepatocellular carcinoma. Proc. Natl. Acad. Sci. U.S.A. 2017; 114:E3101–E3109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. LeBlanc D.P.M., Meier M., Lo F.Y., Schmidt E., Valentine C. 3rd, Williams A., Salk J.J., Yauk C.L., Marchetti F. Duplex sequencing identifies genomic features that determine susceptibility to benzo(a)pyrene-induced in vivo mutations. Bmc Genomics [Electronic Resource]. 2022; 23:542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Zhou L., Ng H.K., Drautz-Moses D.I., Schuster S.C., Beck S., Kim C., Chambers J.C., Loh M Systematic evaluation of library preparation methods and sequencing platforms for high-throughput whole genome bisulfite sequencing. Sci. Rep. 2019; 9:10383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Chen R., You X., Cao Y., Masumura K., Ando T., Hamada S., Horibata K., Wan J., Xi J., Zhang X et al. Benchmark dose analysis of multiple genotoxicity endpoints in gpt delta mice exposed to aristolochic acid I. Mutagenesis. 2021; 36:87–94. [DOI] [PubMed] [Google Scholar]
- 22. White P.A., Long A.S., Johnson G.E Quantitative interpretation of genetic toxicity dose-response data for risk assessment and regulatory decision-making: current status and emerging priorities. Environ. Mol. Mutagen. 2020; 61:66–83. [DOI] [PubMed] [Google Scholar]
- 23. Costello M., Pugh T.J., Fennell T.J., Stewart C., Lichtenstein L., Meldrim J.C., Fostel J.L., Friedrich D.C., Perrin D., Dionne D et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res. 2013; 41:e67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Chen L., Liu P., Evans T.C. Jr, Ettwiller L.M DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 2017; 355:752–756. [DOI] [PubMed] [Google Scholar]
- 25. Alexandrov L.B., Kim J., Haradhvala N.J., Huang M.N., Tian Ng A.W., Wu Y., Boot A., Covington K.R., Gordenin D.A., Bergstrom E.N. et al. The repertoire of mutational signatures in human cancer. Nature. 2020; 578:94–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kucab J.E., Zou X., Morganella S., Joel M., Nanda A.S., Nagy E., Gomez C., Degasperi A., Harris R., Jackson S.P et al. A compendium of mutational signatures of environmental agents. Cell. 2019; 177:821–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Suzuki T., Wang X., Miyata Y., Saeki K., Kohara A., Kawazoe Y., Hayashi M., Sofuni T Hepatocarcinogen quinoline induces G:c to C:g transversions in the cII gene in the liver of lambda/lacZ transgenic mice (MutaMouse). Mutat. Res. 2000; 456:73–81. [DOI] [PubMed] [Google Scholar]
- 28. Suzuki T., Miyata Y., Saeki K., Kawazoe Y., Hayashi M., Sofuni T In vivo mutagenesis by the hepatocarcinogen quinoline in the lacZ transgenic mouse: evidence for its in vivo genotoxicity. Mutat. Res. 1998; 412:161–166. [DOI] [PubMed] [Google Scholar]
- 29. Nohmi T., Masumura K Gpt delta transgenic mouse: a novel approach for molecular dissection of deletion mutations in vivo. Adv. Biophys. 2004; 38:97–121. [PubMed] [Google Scholar]
- 30. Stratton M.R., Campbell P.J., Futreal P.A The cancer genome. Nature. 2009; 458:719–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Patkar N., Kakirde C., Shaikh A.F., Salve R., Bhanshe P., Chatterjee G., Rajpal S., Joshi S., Chaudhary S., Kodgule R et al. Clinical impact of panel-based error-corrected next generation sequencing versus flow cytometry to detect measurable residual disease (MRD) in acute myeloid leukemia (AML). Leukemia. 2021; 35:1392–1404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Hourigan C.S., Dillon L.W., Gui G., Logan B.R., Fei M., Ghannam J., Li Y., Licon A., Alyea E.P., Bashey A et al. Impact of conditioning intensity of allogeneic transplantation for acute myeloid leukemia with genomic evidence of residual disease. J. Clin. Oncol. 2020; 38:1273–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Balagopal V., Hantel A., Kadri S., Steinhardt G., Zhen C.J., Kang W., Wanjari P., Ritterhouse L.L., Stock W., Segal J.P Measurable residual disease monitoring for patients with acute myeloid leukemia following hematopoietic cell transplantation using error corrected hybrid capture next generation sequencing. PLoS One. 2019; 14:e0224097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Neeley W.L., Delaney J.C., Henderson P.T., Essigmann J.M In vivo bypass efficiencies and mutational signatures of the guanine oxidation products 2-aminoimidazolone and 5-guanidino-4-nitroimidazole. J. Biol. Chem. 2004; 279:43568–43573. [DOI] [PubMed] [Google Scholar]
- 35. Kino K., Sugiyama H Possible cause of G-C→C-G transversion mutation by guanine oxidation product, imidazolone. Chem. Biol. 2001; 8:369–378. [DOI] [PubMed] [Google Scholar]
- 36. Kino K., Sugiyama H UVR-induced G-C to C-G transversions from oxidative DNA damage. Mutat. Res. 2005; 571:33–42. [DOI] [PubMed] [Google Scholar]
- 37. AbdulSalam S.F., Thowfeik F.S., Merino E.J Excessive reactive oxygen species and exotic DNA lesions as an exploitable liability. Biochemistry. 2016; 55:5341–5352. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw whole-genome sequencing data of all samples have been deposited in the Sequence Read Archive (SRA) with the accession number of PRJNA932860 and PRJNA632709. Detailed information on the sequencing data of all samples is available in Supplementary Table S1. The trinucleotide substitution profiles of all samples are available in Supplementary Table S2.