Abstract
Error-corrected sequencing of genomic targets enriched by probe-based capture has become a standard approach for detecting single-nucleotide variants (SNVs) and small insertion/deletions (indels) present at very low variant allele frequencies. Less attention has been given to comparable strategies for rare structural variant (SV) junctions, where different error mechanisms must be addressed. Working from samples with known SV properties, we demonstrate that duplex sequencing (DuplexSeq), which demands confirmation of variants on both strands of a source DNA molecule, eliminates false SV junctions arising from chimeric PCR. DuplexSeq could not address frequent intermolecular ligation artifacts that arise during Y-adapter addition prior to strand denaturation without requiring multiple source molecules. In contrast, tagmentation libraries coupled with data filtering based on strand family size greatly reduced both artifact classes and enabled efficient and specific detection of single-molecule SV junctions. The throughput of SV capture sequencing (svCapture) and base-level accuracy of DuplexSeq provided detailed views of the microhomology profile and limited occurrence of de novo SNVs near the junctions of hundreds of newly created SVs, suggesting end joining as a possible formation mechanism. The open source svCapture pipeline enables rare SV detection as a routine addition to SNVs/indels in properly prepared capture sequencing libraries.
INTRODUCTION
Detecting sequence variants in next-generation sequencing (NGS) libraries demands that the signal from true variant DNA molecules rises above the background signal from pre-analytical and computational process errors. Historically, confidence was achieved by sequencing samples with moderate to high variant allele frequencies (VAFs) and demanding independent detection of variants in multiple source DNA molecules (1). As researchers have increasingly sought to apply NGS to samples with much lower VAFs, focus has shifted toward reducing the sequencing error baseline to improve signal-to-noise ratios. Applications where it is essential to detect sequence changes with very low VAFs include characterization of heterogeneity in tumor samples (2), studies of the nature and occurrence of mosaicism and somatic mutations in tissues (3–5), detection of genetic alterations induced by genotoxicants (6,7) and assessment of the off-target effects of gene editing methods (8,9).
Error minimization in NGS requires a detailed understanding of the mechanisms that create those errors (Figure 1A and Table 1). Error minimization strategies can be divided into two categories. We define ‘error suppression’ as preventing pre-analytical, i.e. wet lab, error mechanisms from impacting sequenced libraries so that false variants are never detected. PCR-free libraries and treatments to remove base damage are example error suppression strategies that eliminate or reduce PCR and base-copying artifacts, respectively (10). ‘Error correction’ refers to bioinformatics strategies, sometimes supported by input library designs, that allow errors present in sequenced libraries to be recognized and discarded.
Figure 1.
Design principles guiding structural variant (SV) capture sequencing (svCapture) and this study. (A) The most important error mechanisms giving rise to false calls for single-nucleotide variants (SNVs)/small insertions and/or deletions (indels) and SVs. (B) The three targeted common fragile site (CFS) genes. Arrows indicate the gene span and genome orientation. The central 250 kb shaded areas were targeted for capture (T regions). Vertical dashed lines mark the boundaries of the 250 kb adjacent (A) regions in triangle plots. Blue horizontal bars show the spans of overlapping SVs known to be present in sequenced clones based on prior microarray analysis.
Table 1.
Properties of different classes of false and bona fide captured SV junctions
| Parameter | Ligation artifact | Chimeric PCR | True SV |
|---|---|---|---|
| Endpoint locations | Second endpoints randomly sampled from genome | SV endpoints paired within and between capture targets (but not adjacent regions) | SV endpoints paired within (but not between) capture targets and adjacent regions |
| Duplex molecules possible | Yes, artifact occurs before strand melting | No, artifact occurs after strand melting | Yes |
| Strand family sizes | Low to high | Low, arise late in PCR | Low to high |
| Endpoints shared with proper molecules | No | Yes, artifact templates pre-exist in library | No |
| Number of source DNA molecules | One | One | One to many, as VAF increases |
| Junction properties | Blunt ends or A/T insertions reflecting post-shearing A-tailing | Large microhomology to support high-temperature annealing | Variable, predominantly 2–4 bp microhomology or small insertions |
Progress toward practical, error-minimized sequencing was advanced by the invention of duplex sequencing (DuplexSeq) (11,12). DuplexSeq is an error correction strategy in which unique molecular identifiers (UMIs) facilitate the recognition of the ends of individual source DNA molecules and the strands of those molecules that gave rise to sequenced read pairs. Computationally enforcing the detection of called variants on both DNA strands of a source molecule counteracts the most important error mechanisms giving rise to SNVs and indels—pre-existing base damage and polymerase copying errors (10)—which are inherently strand-specific (Figure 1A). DuplexSeq has been extended in derivative approaches dependent on the same core logic, such as NanoSeq and SaferSeqS (13,14). Together, duplex error correction has enabled single-molecule SNV detection and is having a substantial impact on the above applications (15–18).
A limitation of most error correction strategies and studies is that they have not addressed the detection of rare SVs. Like SNVs and indels, SVs are a critical form of mutagenesis associated with specific disease states and clastogenic genotoxicants (19–21). SVs alter many more bases than SNVs and indels but their junctions are less frequent (22), increasing the need for stringency when assessing SV rates. Critically, the mechanisms that give rise to SV artifacts in sequencing libraries are very different from those that give rise to SNVs and indels, necessitating different error minimization strategies (Figure 1A and Table 1). SV errors are created by three main mechanisms, listed here in their order of occurrence. Ligation artifacts arise from the joining of two source molecules by DNA ligase during adapter addition in libraries fragmented through sonication or non-tagmentation-based enzymatic fragmentation. Chimeric PCR occurs when a primer begins extension from one source DNA molecule only to switch to a different template in a subsequent cycle. Alignment errors occur when the aligner software reports an improper genomic location for a read or read segment.
Our efforts to address the challenges of error-minimized SV detection were motivated by our goal of defining the mechanisms of SV formation at CFSs (23–27). CFSs are genomic loci that are especially sensitive to replication stress and SV formation, representing genomic instability ‘hotspots’ (28–30). Effective study of the molecular events occurring at CFSs demands that de novo SVs be detected as they form, when only a single source molecule will carry a given junction. Similar needs for ultra-rare SV detection exist in genotoxicant monitoring, characterization of low-level mosaicism in tissues and other applications. Here, we carefully analyze the frequency and nature of SV artifact classes in target capture libraries made from DNAs bearing SVs characterized a priori or increased population burdens of rare induced SVs. DuplexSeq effectively addressed PCR artifact mechanisms, whereas tagmentation-based libraries coupled with bioinformatics filters addressed both PCR and ligation-mediated artifacts to yield signal-to-noise ratios suitable for monitoring rates of rare SV junction formation. The structures of hundreds of high-confidence and accurately sequenced SV junctions suggest that they are likely to arise by end joining.
MATERIALS AND METHODS
Human fibroblast cell line, clone mixtures and stressed populations
Samples analyzed in this study are derived from HF1, a TERT-immortalized primary human foreskin fibroblast cell line used in our prior SV studies (27,30). HF1 was propagated in Dulbecco’s modified Eagle medium supplemented with 13% fetal bovine serum, 4 mM l-glutamine and 1× penicillin–streptomycin. Initial experiments used mixtures of previously described HF1 cell clones bearing known SVs (27,30). We recovered the clones from our frozen stocks, expanded them and extracted DNA using the Qiagen Blood and Cell Culture Mini Kit (#13323). DNA was quantified using Qubit (Thermo Fisher), diluted to 30 ng/μl and then mixed with DNAs from other clones or parental HF1 cells at known ratios. Further experiments used HF1 cell populations treated to induce de novo SVs. An exponentially dividing culture of HF1 cells was split and left untreated or exposed to either 0.2 or 0.6 μM aphidicolin (APH) in dimethyl sulfoxide for 72 h. Cells were allowed to recover without APH to complete SV junction formation for a further period of 24 h. Cells were then split, seeded at 2 × 105 cells and expanded to confluence, after which DNA was extracted for sequencing.
Target capture probe design and nomenclature
Capture probes were targeted to large genes according to the design parameters in Figure 1B and Supplementary Table S1. Final probes were designed and synthesized by Twist Biosciences using their proprietary algorithms to ensure specificity and quality and used as provided by the vendor. When denoting the correspondence of SV endpoints to target regions, ‘T’ refers to bases within a capture target, ‘A’ refers to bases in the 250 kb regions adjacent to each capture target and ‘-’ refers to all other genome bases. Uppercase letter pairs, e.g. ‘TA’, identify SVs with both ends in the same capture target, whereas lowercase letters, e.g. ‘tt’, identify all other unexpected, often artifactual, SVs.
DuplexSeq libraries
DuplexSeq libraries were prepared by TwinStrand Biosciences Inc. (Seattle, WA) according to published procedures (12,17,18). Briefly, genomic DNA was fragmented by ultrasonication to a peak size of 300 bp, followed by end repair, A-tailing and DuplexSeq™ adapter ligation. After indexing PCR, libraries were subjected to one round of overnight hybrid capture with biotinylated probes. Targets were purified with streptavidin magnetic beads, followed by washes and a final round of PCR prior to pooling and sequencing.
Tagmentation sequencing libraries
Bead-based tagmentation libraries were prepared using the Illumina DNA Prep with enrichment, formerly called the Nextera Flex, kit (Illumina #20025523) by the University of Michigan Advanced Genomics Core. Tagmented libraries were prepared using an input of 300 ng of genomic DNA, IDT for Illumina unique dual barcodes (Illumina #20027213) and library PCR amplification of nine cycles. Libraries were quantified using Qubit and quality was checked using an Agilent TapeStation. Capture was then performed by pooling 500 ng of each library and hybridizing with 4 μl Twist Biosciences probes and 6 μl PCR grade water. Enrichment was completed with 12 cycles of PCR amplification.
DNA sequencing
All sequencing reads were obtained in the 2 × 151 format using Illumina NovaSeq 6000 by the University of Michigan Advanced Genomics Core. Barcoded samples were pooled with each other and other users’ samples and subjected to a sequencing depth calculated to yield a projected coverage of ∼2000-fold in the capture target regions based on prior experience. Library read counts, source DNA molecule yields and enrichment values can be found in Supplementary Table S2.
svCapture data analysis tools and versions
Data analysis and visualization, including plots in figures, were accomplished using the svCapture pipeline and app in v1.0.0 of the svx-mdi-tools suite (repository: https://github.com/wilsontelab/svx-mdi-tools; documentation: https://wilsontelab.github.io/svx-mdi-tools), supported by genomic modules in the main branch of the genomex-mdi-tools suite (https://github.com/wilsontelab/genomex-mdi-tools), as implemented using the frameworks provided by the Michigan Data Interface (https://github.com/MiDataInt). Here, ‘pipeline’ refers to high-performance computing actions executed using various programs and custom shell, Perl and R scripts on the University of Michigan Great Lakes cluster, whereas ‘app’ refers to R Shiny graphical tools executed using R 4.0.3 on a Windows 10 desktop computer. The job configuration scripts used to launch the pipeline and associated log files that list all options and program versions are available at https://github.com/wilsontelab/publications/tree/main/svCapture-2022. A simpler demo is available at https://wilsontelab.github.io/svx-mdi-tools/docs/svCapture/toy-example.html.
svCapture pipeline
The svCapture pipeline ‘align’ action converts putative DuplexSeq UMI sequences within read pairs to a UMI index from 1 to 96, when applicable, allowing up to one mismatch between the observed and expected UMIs. Reads with non-matching UMIs are discarded. The fastp program (31) merges overlapping read pairs and enforces read quality filtering. The bwa mem program (32,33) aligns read pairs to the GRCh38/hg38 or other appropriate reference genome obtained from Illumina iGenomes with results stored in name-sorted CRAM files.
The ‘collate’ action performs source DNA molecule analysis. Two outermost molecule positions are selected from aligned segments passing a map quality filter (MAPQ > 20) to act as keys for comparing read pairs to each other. Read pairs that share the same UMI indices, when applicable, and the same endpoints and outer clip lengths within a 1 bp allowance are grouped and referred to as a source DNA molecule. For libraries using ligated Y adapters, such as DuplexSeq, read pairs with opposite strand orientations are grouped as part of the same source molecule. In contrast, different read pair orientations sharing the same molecular endpoints must arise from independent Tn5 cleavage events and are considered different source molecules in tagmentation libraries. Read pair strand family sizes are counted, and a two-step consensus is constructed, first on each strand separately and then, when available, between the two strands. We used a fractional threshold of 0.667 shared bases in at most 11 downsampled read pairs for consensus calling; sequence positions failing this criterion are masked as N bases. svCapture includes steps for alignment-guided merging of smaller read pair overlaps not handled by fastp. Consensus sequences are finally realigned to the reference genome.
The ‘extract’ action examines final read alignments for patterns that predict an underlying SV junction (Supplementary Figure S1). Source DNA molecules are described as a series of nodes at the endpoints of all alignment segments, each with a chromosome, position and direction (left or right) that the molecule proceeds from that point with respect to the reference genome. Source DNA molecule replicates are defined by their shared outer nodes. Putative SV junction molecules are handled one at a time and characterized by their inner nodes. Node pairs defining novel junctions are sorted to reflect the top strand of the reference genome in non-inverted segments in the order found in the source molecules. Associated alignment sequences and CIGAR strings are carried forward after reverse complementing the alignments within (but not flanking) inversion segments as defined relative to the conjoined reference genome, such that all sequences now represent contiguous, reconstructed SV junctions. Nodes from all source molecules are compared to identify SV molecules that share outer endpoints with properly paired molecules, which provides evidence that those SV molecules are PCR artifacts derived by recombining other molecules.
The ‘find’ action searches extracted node tables for sets of molecules with distinct outer node signatures but the same inner node signatures, i.e. independent source molecules crossing the same SV junction (Supplementary Figure S1). Data are sorted by left nodes to allow molecules to be broken into sets where nodes are separated by less than the maximum proper insert size in the library as declared by bwa mem. Molecules within each set are re-sorted by the right node and the process is repeated to yield an evidence set supporting each putative SV. Additional steps include further purging of likely duplicate source molecules with outer endpoints closer than 5 bp in total, the recovery of source DNA molecules that were clipped at the junction on their outer edges by at least 5 bases and the reconstruction of junction sequences lacking split read evidence by joining the inner clipped nodes of molecules where junctions fell in read gaps. A single source DNA molecule with the most central junction is chosen as a reference for parsing final junction descriptions, including SV type, correspondence to capture targets, and the nature of microhomologies and de novo insertions. All steps can be performed on a single sample or on a set of input samples analyzed together to ensure sensitive matching of even single source molecules between samples.
Finally, the pipeline ‘genotype’ action compares the base content of SV calls to the non-SV molecules of the source individual. Proper molecules lacking SV junctions from CRAM files are analyzed by bcftools mpileup, call and norm (34) to establish a VCF file with a biallelic genotype at every sufficiently covered genome position within and adjacent to each capture target region. The resulting unphased haplotypes are compared to consensus SV junction sequences obtained from all aligned segments over all source molecules in each SV set. SV consensus bases absent from either the reference genome or haplotype variant lists are marked as presumptive SNVs or indels created during SV junction formation.
svCapture app and related downstream analysis
The svCapture app supports interactive visualization of SVs and enforces further adjustable SV filters, such as demanding duplex molecules and setting minimum strand family sizes and minimum and maximum numbers of supporting molecules or co-analyzed samples that share the same SV. For users wishing to examine SVs in other tools, a VCF file is also created where filtered SVs are encoded as breakends (SVTYPE = BND).
Matching of svCapture SV calls to prior data was achieved by first computationally comparing SV endpoints within a 25 kb allowance, given the imprecision of microarray calls. All potentially matching SVs were then manually examined for best matches, most of which were unambiguous as discussed below. VAFs were calculated for each sequenced SV as the count of total supporting molecules, including outer clips, divided by the average target region coverage. Expected VAFs were one half the clone frequency because all SVs were known to be heterozygous.
To assess the non-randomness of the positions of de novo SNVs and indels, the number of variant-containing SV consensuses crossing informative positions at each distance from the junction is counted to establish the expected distance distribution of randomly selected bases from those SVs, given the distribution of sample fragment sizes. The observed number of variants is randomly selected from this weighted distribution over 10 000 iterations. The P-value is the fraction of iterations whose median distance is less than the median of the observed variant distances. A significant result indicates that SNVs and indels are more likely to occur near the SV junction (Mann–Whitney and other tests are also available).
RESULTS
Design of cell mixtures and capture targets for svCapture validation
To help optimize capture sequencing for rare SV detection, we drew on established human HF1 fibroblast cell clones bearing known replication stress-induced SVs as established by genomic microarrays (27,30). We constructed a first set of five sample mixtures that were prepared, sequenced and analyzed together that bore varying levels of different HF1 SV clones from 1% to 30% (Supplementary Tables S3 and S4). The composition of most mixtures was known to the data analyst, whereas the last sample was initially blinded and included unknown clone percentages and a subset of clones absent from other samples. A second set of five samples contained populations of HF1 cells treated with low doses of APH, a commonly used agent of replication stress that is highly effective at inducing de novo SVs (30,35). To further promote predictable SV patterns, we selected three genes from published HF1 SV data (27,30) as known, high-value hotspots of replication stress-induced SV junction formation: NEGR1 (0.89 Mb, chr1), MAGI2 (1.44 Mb, chr7) and PRKG1 (1.31 Mb, chr10). Capture probes targeted a territory of 250 kb around the center of each hotspot gene (Figure 1B and Supplementary Table S1), which bears the highest burden of induced SVs (30). This represented a substantial increase in target span as compared to typical prior DuplexSeq libraries (15–18). Most SVs induced in these large genes are known to be deletions (30), which provided a further specificity test for method validation.
DNAs from the above sample sets were subjected to target capture and, initially, sequencing by DuplexSeq. A subset of the same DNAs was later sequenced again using tagmentation. We wrote a data analysis pipeline in the Michigan Data Interface because prior pipelines were optimized to address duplex consensus making or SV detection but not both (12,17,36). Like prior pipelines, svCapture groups read pairs by source molecule and strand, creates single-strand and, when possible, duplex consensus sequences, and realigns those consensuses to the reference genome. Additional features include read pair merging to maximize alignment specificity, persistence of all source molecules and coordination of strand tracking with SV junction calling.
DuplexSeq error correction eliminates chimeric PCR artifacts
Our entry rationale was that DuplexSeq would correct for chimeric PCR artifacts in capture libraries, which our early efforts suggested was the major confounder to SV detection. PCR is the last step in most capture protocols to create sufficient DNA mass for sequencing. It is problematic because it can promote chimeric PCR that is highly likely to fuse two molecules enriched by capture, mimicking the SV junctions we wish to detect. Chimeric PCR occurs after source DNA molecules are melted (Figure 1A) and should be amenable to duplex correction as it is highly unlikely that the same chimeric PCR event would occur twice independently, once in each template strand orientation. Because we know the properties of our source samples and genomic loci well (27,30), we can assert that apparent translocations between capture target regions are nearly always artifacts. True SVs will be mainly deletions with one captured endpoint within a target (T) region and the second endpoint within or adjacent (A) to the same region. Second endpoints at large distances from the target region are again mostly artifacts.
We used plot devices described in Supplementary Figure S2 to reveal the frequency, location and properties of SV junctions detected by svCapture. Triangle plots of the locations of high-MAPQ, single-sample DuplexSeq SVs demonstrate a high burden of artifacts, most of which were confined within the boundaries of the capture target regions (Figure 2A and Supplementary Figure S3A). Most of those SV artifacts, including intertarget translocations, had large stretches of junctional microhomology with a peak at 8–9 bp (Figure 3A and Supplementary Figure S3E). This combination of features matches the expectations of post-capture chimeric PCR (Table 1). Consistently, requiring that at least one duplex molecule supported an SV call eliminated this error class (Figure 2B and Supplementary Figure S3B). Notably, duplex filtering was not required to correct for false chimeric PCR SV calls. Examining SV strand family sizes, i.e. the number of replicate read pairs from each source molecule strand (Supplementary Figure S1A), showed that intertarget translocation SVs of the ‘tt’ class rarely showed more than a single supporting read pair (Supplementary Figure S4). Moreover, presumptive chimeric PCR events often shared one or both outer endpoints with proper molecules from the same library, the likely templates for late cycle chimeric PCR (Supplementary Figure S5A). Enforcing strand family size (Figure 2C and Supplementary Figure S3C) and/or shared endpoint (Supplementary Figure S5B) filters proved effective for correcting chimeric PCR artifacts without sacrificing >50% of source molecules that failed duplex detection (Supplementary Table S2).
Figure 2.
SV junction locations with and without error correction filters. Parallel triangle plots of untreated DuplexSeq and tagmentation libraries with similar target region coverage to minimize visualization bias. All plots are filtered for high-quality SV calls (MAPQ > 50 in each flanking alignment) detected in a single sample, the expected pattern for SV artifacts. Panels (A)–(D) plot a single DuplexSeq library, whereas panels (E)–(H) plot two tagmentation libraries. Panels (A) and (E) plot all detected SV junctions without further filtering. Panels (B) and (F) further require at least one duplex molecule to plot the SV (tagmentation libraries do not support duplex assignments). Panels (C) and (G) instead require that at least three read pairs (not source molecules) matched the SV. Panels (D) and (H) instead require that at least two source DNA molecules matched the SV. See Supplementary Figure S1 for definitions of read pairs and source molecules, Supplementary Figure S2 for a guide to interpreting the plots and Supplementary Figure S3 for plots with all DuplexSeq data. Dup, duplication; Inv, inversion; Del, deletion; Trans, translocation.
Figure 3.
SV junction properties with and without error correction filters. Parallel junction property plots with the same samples and filters as in Figure 2. See Figure 2 legend for details, Supplementary Figure S2 for a guide to interpreting the plots and Supplementary Figure S3 for plots with all DuplexSeq data.
DuplexSeq error correction alone cannot eliminate intermolecular ligation artifacts
Examining Figure 2C and Supplementary Figure S5B revealed a residual class of presumptive SV artifact junctions that did not respect the capture target boundaries, e.g. translocation junctions extending throughout the target (T) and adjacent (A) regions. Many of those remaining artifact junctions fell in the unsequenced gaps between read pairs (hence, they could not be plotted in Figure 3C), consistent with them being especially large molecules with central junctions, the expected pattern for intermolecular ligation artifacts. Consistently, this class of artifact junctions extended at the same baseline level throughout the untargeted genome (Figure 4A) and had a distinct junction profile characterized by blunt joints or single-base insertions or deletions (Figure 4B), as predicted for intermolecular ligation artifacts (1041 of 1293, 81%, of the one-base insertions were A or T, suggestive of A-tailing). Because such artifacts arise prior to strand melting (Figure 1A), duplex filters alone are not expected to correct them (Table 1) and indeed they did not (Figures 2B and 4C, and Supplementary Figure S3B). Even with duplex filtering, tens of thousands of target-to-genome SV calls with predominantly blunt end junctions persisted that would confound many applications. Both ligation and chimeric PCR artifacts could be corrected by demanding multiple independent source molecules (Figure 2D), but that restriction prevents the detection of many rare, true SVs (see below).
Figure 4.
Intermolecular ligation is not corrected by DuplexSeq but is suppressed by tagmentation. All data represent single-sample, single-molecule, SVs >2 kb with a minimum MAPQ value of 50 for both SV alignments. (A) SVs from APH-treated DuplexSeq samples with at least three read pairs (to remove chimeric PCR events) in a 10 Mb window surrounding the capture target regions. For all plotted SVs, one end was within a capture target. The frequencies of the other, more target-distant, junction positions are plotted per 100 kb bin. The central peak reflects APH-induced SVs in target and adjacent regions, whereas the baseline throughout the chromosomes reflects ligation artifacts. (B) Distribution of microhomology and de novo insertion lengths of three types of filtered junctions from DuplexSeq libraries. Presumed ligation artifacts are from panel (A), presumed chimeric PCR events are the tt-class translocation junctions filtered away from Figure 2A in Figure 2C and presumed true SVs are APH-induced deletion SVs of the TT or TA class with at least three read pairs. (C) Like panel (A), now also requiring at least one duplex molecule to plot the SV. The small apparent baseline reduction results from the loss of non-duplex molecules, not an increased SV call specificity. (D) Like panel (A), for all combined tagmentation libraries. Note the much lower rate of presumed ligation artifacts connecting the target regions to the rest of the genome. (E) Like panel (B), for the tagmentation libraries. The erratic chimeric PCR distribution (red) results from the small number of such SVs.
Tagmentation capture libraries largely, but incompletely, suppress ligation artifacts
Because there was no practical method for correcting ligation artifacts, we turned to error suppression to achieve this important goal. Because tagmentation libraries use the Tn5 transposase to connect adapters to genomic DNA molecules, not DNA ligase, ligation artifacts are not expected (Table 1). Comparing tagmentation to DuplexSeq libraries in the absence of any filters first revealed a new class of frequent, small inversions with large microhomologies that appeared uniquely in the tagmentation libraries (compare Figure 3A and E). Others have described that such errors arise by intramolecular foldback synthesis that is especially prevalent in tagmentation libraries, most likely during the extension of tagmented ends (37,38). Somewhat surprisingly, chimeric PCR artifact rates were lower in tagmentation libraries (compare Figure 2A and E), perhaps secondary to a greater capture target enrichment (Supplementary Table S2). Family size filters were again highly effective at removing intertarget chimeric PCR artifacts as seen for DuplexSeq (Figure 2G and G). Most importantly, the rates of intermolecular ligation evident as ‘t-’ class molecules were markedly reduced in tagmentation libraries (Figure 4D). The few such molecules that did occur shared a junction microhomology profile characterized by predominantly blunt ends (Figure 4E). This unexpected result suggests a low degree of ligation activity in tagmentation libraries (see the ‘Discussion’ section).
svCapture faithfully reports SV junction locations
To ensure that error correction and suppression were not eliminating true SV molecules, we examined our samples with mixed SV-containing clones, requiring three read pairs to call an SV. Figure 5A and B shows a clear match to the SVs known to exist from prior microarray analysis down to a 1% allele frequency (compare detected SVs to the black circles denoting expected SVs). One expected 25 kb duplication in the NEGR1 gene did not have an explicitly matching call, but we observed a nearby 4.5 kb duplication (red circle in Figure 5B) with unambiguous molecular evidence (Figure 5C and Supplementary Figure S6A). We interpret this as an error in microarray assessment of this SV, a method with much less precision in endpoint assignments, especially for duplications. We similarly observed apparently undetected SVs when we examined our initially blinded sample (Figure 5D). While we cannot rule out that some SVs were missed by sequencing, we again noted the presence of other strongly evidenced SV junctions leading us to question the precision of the microarray calls. For example, HF1 clone Scr1-14A had two ∼75 kb deletion calls in MAGI2 by microarray (Supplementary Table S4), whereas sequencing established the existence of a single, overlapping, unambiguous 149 kb deletion in the pool (red circle in Figure 5D; Supplementary Figure S6B). Other unexpected but strongly evidenced SVs were also noted at low VAFs, such as the deletion highlighted by a red circle in Figure 5A and depicted in Supplementary Figure S6C with six supporting molecules.
Figure 5.
Sensitive and specific detection of known SVs down to 1% VAF. Panels (A) and (B) are similar to Figure 2, showing duplications and deletions detected by at least three read pairs for the indicated HF1 mixed-clone samples. Black circles denote the positions of subclonal SVs expected based on prior microarray analysis. All but one matched both samples unambiguously. (C) Genome coordinates of the two outer positions of molecules supporting the SV highlighted by a red circle in panel (B), which likely accounts for the one call discrepancy in panels (A) and (B). The legend identifies the molecule types, which support a random distribution of many matching source molecules. (D) Like panels (A) and (B), for the blinded sample. See text for discussion of discrepancies and Supplementary Figure S6B for a depiction of the SV marked with a red circle. (E) Correlation plot of the observed VAF for the same SVs from the 10% and 1% mixed-clone samples. Horizontal and vertical lines denote the expected values. Symbols indicate the number of SV ends that were inside a capture target region. (F) Correlation plot of the observed versus expected VAF for the two indicated variable mixtures. Symbols are as in panel (E).
svCapture reports SV allele frequencies with modest accuracy
To explore the quantitative efficiency and accuracy of svCapture, we plotted the observed versus expected SV VAFs. While many SVs in the 10% mixture were recovered at approximately the expected VAF of 5%, those of the TA class showed lower VAFs (Figure 5E). SVs with only one end in a capture target region could thus be detected but at a lower efficiency, presumably due to reduced binding to capture probes. The 10% and 1% clone mixtures showed only a modest correlation (R = 0.74), in a pattern that initially suggested a reduced detection efficiency at lower VAF. However, plotting the observed versus expected VAF for mixtures with variable content showed the opposite trend, with underrepresentation of the SVs with higher expected VAF (Figure 5F). SV recovery efficiency is thus variable and a function of several factors. Importantly, in this study one of those factors is the uncertain accuracy with which the clone mixtures were prepared.
svCapture permits SV assessments at very low allele frequencies
While reliable detection of SV junctions at 1% VAF is valuable, it falls short of the needs of many applications. Therefore, we examined our APH-treated cell populations expected to carry a high burden of individually rare SVs across different cells. Figure 6A and B shows the ready detection of increased SV junctions by either sonication-based or tagmentation-based svCapture following APH treatment of HF1 fibroblasts. In aggregate, these induced SVs were very often single-molecule detections (305 of 415, 73%), consistent with the expectations of de novo SV formation in a large cell population without clonal restriction. Induced SVs were strongly biased toward deletions (359 of 415, 87%), as we have previously observed at large gene loci (30), which, together with their induction by APH, provides excellent support for their existence in the source DNA. We further observed an apparent dose responsiveness of SV induction by APH and that tagmentation appeared more efficient than DuplexSeq at rare SV detection (Figure 6C), although more data are needed to draw firm conclusions on these points. The basal rate of SV formation in untreated cells is expected to include some persistent artifacts noted above (e.g. Figure 4D), but the fact that they are also mostly deletions in the middle of large transcribed genes suggests that many correspond to the basal copy number variant rate we have documented with microarrays (30).
Figure 6.
APH induces rare, detectable, often single-molecule deletion SVs in CFS genes. (A, B) Junction property plots comparing an untreated and an APH-treated sample for each of the DuplexSeq and tagmentation-based svCapture approaches, showing consistent SV induction by APH. (C) SV formation rates, i.e. the number of observed junctions divided by the average on-target coverage, for all sequenced APH-treated sample libraries. Throughout, filters enforced a minimum MAPQ of 50 per alignment for SVs with at least three read pairs detected in a single co-analyzed sample.
DuplexSeq reveals a low rate of de novo insertions, SNVs and indels near SV junctions
We finally examined properties of SV junctions that are useful for inferring the associated DNA repair mechanisms. The hundreds of SVs induced by APH showed a junction profile distinct from either artifact class characterized by short, mainly 2–4 bp, microhomologies and a tail of junctions with de novo sequence insertions of up to 25 bp (Figures 4B and E and 6). These patterns provide further validation that these are valid SV calls distinct from the SV artifact classes. The high accuracy of DuplexSeq, coupled with abundant read depth for creating unphased haplotypes of the source individual, further allowed us to examine the rate of de novo SNVs and indels in the alignments flanking the SV junctions (see Figure 7A and Supplementary Figure S7 for an example). We measured a net rate of <1 in 1000 variant base positions in duplex SV calls that could not be accounted for by constitutional SNPs and indels in the HF1 genome (Figure 7B). The positions of the few variants observed in subclonal SVs from mixed-clone samples appeared random, whereas there was a small but statistically significant bias (P ≤ 0.038) for the more numerous APH-induced SVs for being located closer to the junction than expected by random chance (Figure 7C).
Figure 7.
Low burden of de novo SNVs and indels near SV junctions. (A) An example SV sequence on the left side of the junction aligned to the reference genome (hg38) and the two identical haplotypes of the HF1 cell line (hap1 and hap2), showing a base match to a homozygous HF1 SNP (blue) and a de novo SNV (red) in the SV located 11 bases internal to the one-base microhomology (underlined). (B) Summary of the number and frequency of variant SVs and base positions for subclonal SVs present in two or more source molecules of the mixed-clone DuplexSeq libraries, or single-sample SVs with three read pairs in the APH-treated DuplexSeq libraries. (C, D) Cumulative frequency plots of the number of sequenced bases (grey) and number of variant bases (blue) as a function of distance from the SV junction for the subclonal and APH-induced SVs, respectively.
DISCUSSION
We describe svCapture, which extends error-minimized sequencing to SVs to enable a variety of downstream applications in molecular genetics, cancer biology, genetic toxicology and other fields. A first conclusion, consistent with prior work with whole genome sequencing, is that pre-analytical process errors, including intermolecular ligation and chimeric PCR, can be corrected by demanding that multiple independent source molecules match an SV junction (1,20). Demanding even two source molecules provides strong correction since all relevant wet lab process errors occur after fragmentation such that distinct molecule outer endpoints establish that an SV junction was present multiple times in the source DNA. The processes for determining source molecule independence and matching SV junctions must be meticulous but are well described (20,36), and SV junctions provide abundant information for accurately grouping source molecules.
The challenge we addressed was extending SV detection to capture NGS where sequencing multiple independent source molecules is impractical or impossible due to low VAFs in input samples. Relevant applications include real-time monitoring of SV junction formation prior to replication in mechanistic studies, where only one source molecule could possibly carry the junction, or SVs arising just prior to the last mitotic division in neuronal or germ cell development (39–41) or during short-duration genetic toxicology studies (6,7,42). Even when SV junctions are subclonal, having been replicated during prior cell divisions, they are effectively unique at very low VAFs with respect to randomly sampled NGS libraries unless prohibitively deep sequencing is performed. At an average on-target coverage of 2400 in bulk samples, we readily detected known SVs at 1% VAF with multiple molecules (Figure 5 and Supplementary Figure S6) but SVs would often yield single-molecule detection at 0.1% VAF. Moreover, we observed only a modest ability of svCapture to accurately score VAFs. Multiple factors likely contribute to variable detection efficiency, including the number of SV ends captured, GC content, probe coverage, competition during capture and PCR, and others.
Of the three SV error classes (Figure 1A), we were the least concerned about alignment errors. Modern aligners, such as the bwa mem utility used in svCapture (32,33), provide reliable information regarding alignment confidence in the form of MAPQ scores. Enforcing MAPQ filters reduces the detection of true SV junctions in repetitive DNA, but that lack of sensitivity is inherent to short-read libraries. Long-read technologies remain essential for characterizing SVs in repetitive regions (43–45). The greatest residual problem we have found regarding read alignment is for the subset of junctions containing short tandem repeats, where sequence changes arising during library preparation can cause aligners to give falsely high MAPQ scores.
In contrast, the remaining error classes—intermolecular ligation and chimeric PCR—readily yield false SV junctions with high MAPQ scores. Other methods are required to minimize their impact. Our goal was to improve SV signal-to-noise ratios by reducing artifact background, since true signals are obligatorily low in our target applications. Several computational filters could effectively correct for chimeric PCR errors, including duplex strand detection in DuplexSeq and high read pair and low shared endpoint counts in all library types (Figures 2 and 3, and Supplementary Figures S3 and S5).
Thus, ligation artifacts proved of greatest concern. DuplexSeq logic could not correct for intermolecular ligation because the artifact occurs prior to strand melting, and no other computational filter proved capable of doing so other than source molecule counts. However, libraries that use tagmentation to covalently bond adapter sequences to genomic DNA were highly effective at suppressing ligation artifacts from ever appearing in sequenced DNAs (Figure 4). Tagmented DNAs cannot typically support duplex error correction because they lack dual strand identification, but, as noted above, chimeric PCR can be corrected in other ways such that highly sensitive and specific detection of even single-molecule SVs was achieved by tagmentation-based svCapture (Figures 2, 3, 5 and 6). Like others, we noted a class of small inversion artifacts in tagmentation libraries (37,38), but this does not impair the accurate calling of inversions larger than ∼1 kb or deletions, duplications or translocations.
Surprisingly, some apparent intermolecular ligation was observed in tagmentation libraries, even if greatly reduced relative to DuplexSeq. This inference is drawn because the artifacts in question had second SV ends spread evenly throughout the genome and the same predominantly blunt end junction structure as seen with DuplexSeq (Figure 4). We do not know whether the relevant phosphodiester bond formation is catalyzed by Tn5 itself or by some amount of ligase in the proprietary Illumina DNA Prep kits.
Importantly, any wet lab approach that prevents intermolecular ligation would suppress that artifact mechanism. The ideal would be a library prep that is both PCR-free and ligation-free. PCR-free tagmentation libraries are now offered by vendors but with unspecified details as to how sequencer-ready libraries are created following Tn5 action. Quispe-Tintaya et al. reported structural variant search (SVS) (42,46) for use on the Ion Torrent platform. Although SVS uses ligation to add second adapter strands, it does so after homopolymer tailing that reduces intermolecular ligation. Xing et al. reported META-CS in which pools of barcodes on Tn5 mosaic ends support duplex logic in tagmentation libraries (47). These innovations in SV error suppression were applied to whole genome sequencing, which requires a commitment to higher sequencing costs as compared to target capture. However, META-CS could be adapted to capture sequencing where it would be expected to provide best sensitivity and accuracy for rare SV junction detection at the expense of a more complex library preparation, as compared to the commercial kits shown here to be suitable for detecting rare induced SV junctions above a baseline.
Data reported here represent the largest set of de novo SV junctions yet acquired in a controlled, prospective, experimental paradigm. They established in high-accuracy calls two features of induced SV junctions, made possible in part by the application of DuplexSeq logic. First, the APH-induced junction sequence profile is characterized by short microhomologies and small de novo insertions (Figures 4 and 6) that are a good match to DNA polymerase theta-mediated end joining (TMEJ) (48,49). We also observed a low but measurable rate of de novo SNV acquisition near SV junctions that showed a biased localization in the first 50 bases on either side of the junction (Figure 7), consistent with a process confined to end sequences. This combination of features suggests that TMEJ or another end joining mechanism might create many SV junctions, with TMEJ favored since we previously showed that a canonical non-homologous end joining factor, Xrcc4, is not required for SV junction formation (29). More data are needed, however, since both microhomologous junctions and error-prone copying have also been invoked for replication-based SV formation models such as microhomology-mediated break-induced replication (50,51).
In summary, sensitive and specific detection of very rare SV junctions can be achieved in short-read capture libraries with sufficient attention to suppressing intermolecular ligation events by appropriate choices of library preparation protocols and correcting chimeric PCR artifacts by filters in data analysis pipelines. The svCapture procedure has a high specificity and efficiency but modest quantitative accuracy for assessing VAFs. There are unavoidable limitations associated with short reads that include the inability to detect SV junctions in repetitive DNA or those mediated by recombination between longer regions of homology. Moreover, even in our optimized tagmentation approach the background of target-to-genome SV artifacts is not zero. Nevertheless, many important classes of non-homologous SVs can be readily detected above very low background signals enabling intersample comparisons of the formation rates of even single-molecule SVs. The method is well suited to studies of SV formation mechanisms, somatic SVs and SV-inducing genotoxicants.
DATA AVAILABILITY
All codes used in data analysis and figure generation are available for download as part of the svx-mdi-tools repository at https://doi.org/10.5281/zenodo.7871677 and https://github.com/wilsontelab/svx-mdi-tools, via its svCapture pipeline and app. HF1 svCapture raw read data, in FASTQ format, and all called SVs, in VCF format, are available from the Database of Genotypes and Phenotypes (dbGaP) under accession phs003121.v1.p1.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Olivia Koues, Melissa Coon and additional staff members at the University of Michigan Advanced Genomics Core for their expert assistance in library preparation and sequencing, Pam Bennett-Baker for support with library preparation and sequencing, Clint Valentine for helpful discussions regarding pipeline development, Anthony Nguyen for performing exploratory studies of SNV formation in SV sequences, and Michael Hipp, Gabriel Pratt, Lindsey Williams and Sharie Austin for analytical and project management support.
Contributor Information
Thomas E Wilson, Department of Pathology, University of Michigan, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA.
Samreen Ahmed, Department of Pathology, University of Michigan, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA.
Jake Higgins, TwinStrand Biosciences Inc., Seattle, WA 98121, USA.
Jesse J Salk, TwinStrand Biosciences Inc., Seattle, WA 98121, USA.
Thomas W Glover, Department of Pathology, University of Michigan, Ann Arbor, MI 48109, USA; Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA.
SUPPLEMENTARY DATA
Supplementary Data are available at NARGAB Online.
FUNDING
National Institutes of Health [CA200731 to T.E.W. and T.W.G., ES017885 to T.E.W.]. Funding for open access charge: National Institutes of Health.
Conflict of interest statement. The authors have no conflicts of interest regarding the contents of this manuscript.
REFERENCES
- 1. Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 2013; 43:11.10.1–11.10.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Dentro S.C., Leshchiner I., Haase K., Tarabichi M., Wintersinger J., Deshwar A.G., Yu K., Rubanova Y., Macintyre G., Demeulemeester J. et al. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell. 2021; 184:2239–2254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Coorens T.H.H., Moore L., Robinson P.S., Sanghvi R., Christopher J., Hewinson J., Przybilla M.J., Lawson A.R.J., Spencer Chapman M., Cagan A. et al. Extensive phylogenies of human development inferred from somatic mutations. Nature. 2021; 597:387–392. [DOI] [PubMed] [Google Scholar]
- 4. Park S., Mali N.M., Kim R., Choi J.W., Lee J., Lim J., Park J.M., Park J.W., Kim D., Kim T. et al. Clonal dynamics in early human embryogenesis inferred from somatic mutation. Nature. 2021; 597:393–397. [DOI] [PubMed] [Google Scholar]
- 5. Spencer Chapman M., Ranzoni A.M., Myers B., Williams N., Coorens T.H.H., Mitchell E., Butler T., Dawson K.J., Hooks Y., Moore L. et al. Lineage tracing of human development through somatic mutations. Nature. 2021; 595:85–90. [DOI] [PubMed] [Google Scholar]
- 6. Heflich R.H., Johnson G.E., Zeller A., Marchetti F., Douglas G.R., Witt K.L., Gollapudi B.B., White P.A. Mutation as a toxicological endpoint for regulatory decision-making. Environ. Mol. Mutagen. 2020; 61:34–41. [DOI] [PubMed] [Google Scholar]
- 7. Minocherhomji S., Liu Y., He Y.D., Fielden M.R. Biomarkers of genome instability in normal mammalian genomes following drug-induced replication stress. Environ. Mol. Mutagen. 2020; 61:770–785. [DOI] [PubMed] [Google Scholar]
- 8. Naeem M., Majeed S., Hoque M.Z., Ahmad I. Latest developed strategies to minimize the off-target effects in CRISPR–Cas-mediated genome editing. Cells. 2020; 9:1608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Zhang X.H., Tee L.Y., Wang X.G., Huang Q.S., Yang S.H. Off-target effects in CRISPR/Cas9-mediated genome engineering. Mol. Ther. Nucleic Acids. 2015; 4:e264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Chen L., Liu P., Evans T.C. Jr, Ettwiller L.M DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science. 2017; 355:752–756. [DOI] [PubMed] [Google Scholar]
- 11. Schmitt M.W., Kennedy S.R., Salk J.J., Fox E.J., Hiatt J.B., Loeb L.A. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl Acad. Sci. U.S.A. 2012; 109:14508–14513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kennedy S.R., Schmitt M.W., Fox E.J., Kohrn B.F., Salk J.J., Ahn E.H., Prindle M.J., Kuong K.J., Shen J.C., Risques R.A. et al. Detecting ultralow-frequency mutations by duplex sequencing. Nat. Protoc. 2014; 9:2586–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Abascal F., Harvey L.M.R., Mitchell E., Lawson A.R.J., Lensing S.V., Ellis P., Russell A.J.C., Alcantara R.E., Baez-Ortega A., Wang Y. et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021; 593:405–410. [DOI] [PubMed] [Google Scholar]
- 14. Cohen J.D., Douville C., Dudley J.C., Mog B.J., Popoli M., Ptak J., Dobbyn L., Silliman N., Schaefer J., Tie J. et al. Detection of low-frequency DNA variants by targeted sequencing of the Watson and Crick strands. Nat. Biotechnol. 2021; 39:1220–1227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Salk J.J., Loubet-Senear K., Maritschnegg E., Valentine C.C., Williams L.N., Higgins J.E., Horvat R., Vanderstichele A., Nachmanson D., Baker K.T. et al. Ultra-sensitive TP53 sequencing for cancer detection reveals progressive clonal selection in normal tissue over a century of human lifespan. Cell Rep. 2019; 28:132–144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Short N.J., Kantarjian H., Kanagal-Shamanna R., Sasaki K., Ravandi F., Cortes J., Konopleva M., Issa G.C., Kornblau S.M., Garcia-Manero G. et al. Ultra-accurate duplex sequencing for the assessment of pretreatment ABL1 kinase domain mutations in Ph+ ALL. Blood Cancer J. 2020; 10:61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Valentine C.C. 3rd, Young R.R., Fielden M.R., Kulkarni R., Williams L.N., Li T., Minocherhomji S., Salk J.J. Direct quantification of in vivo mutagenesis and carcinogenesis using duplex sequencing. Proc. Natl Acad. Sci. U.S.A. 2020; 117:33414–33425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Wang Y., Mittelstaedt R.A., Wynne R., Chen Y., Cao X., Muskhelishvili L., Davis K., Robison T.W., Sun W., Schmidt E.K. et al. Genetic toxicity testing using human in vitro organotypic airway cultures: assessing DNA damage with the CometChip and mutagenesis by duplex sequencing. Environ. Mol. Mutagen. 2021; 62:306–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Harel T., Lupski J.R. Genomic disorders 20 years on—mechanisms for clinical manifestations. Clin. Genet. 2018; 93:439–449. [DOI] [PubMed] [Google Scholar]
- 20. Ho S.S., Urban A.E., Mills R.E. Structural variation in the sequencing era. Nat. Rev. Genet. 2020; 21:171–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Cosenza M.R., Rodriguez-Martin B., Korbel J.O. Structural variation in cancer: role, prevalence, and mechanisms. Annu. Rev. Genomics Hum. Genet. 2022; 23:123–152. [DOI] [PubMed] [Google Scholar]
- 22. 1000 Genomes Project Consortium Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A. et al. A global reference for human genetic variation. Nature. 2015; 526:68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Glover T.W., Berger C., Coyle J., Echo B. DNA polymerase alpha inhibition by aphidicolin induces gaps and breaks at common fragile sites in human chromosomes. Hum. Genet. 1984; 67:136–142. [DOI] [PubMed] [Google Scholar]
- 24. Glover T.W., Arlt M.F., Casper A.M., Durkin S.G. Mechanisms of common fragile site instability. Hum. Mol. Genet. 2005; 14:R197–R205. [DOI] [PubMed] [Google Scholar]
- 25. Glover T.W., Wilson T.E., Arlt M.F. Fragile sites in cancer: more than meets the eye. Nat. Rev. Cancer. 2017; 17:489–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Glover T.W., Wilson T.E. Twin peaks: finding fragile sites with MiDAS-seq. Cell Res. 2020; 30:944–945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Park S.H., Bennett-Baker P., Ahmed S., Arlt M.F., Ljungman M., Glover T.W., Wilson T.E. Locus-specific transcription silencing at the FHIT gene suppresses replication stress-induced copy number variant formation and associated replication delay. Nucleic Acids Res. 2021; 49:7507–7524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Arlt M.F., Ozdemir A.C., Birkeland S.R., Wilson T.E., Glover T.W. Hydroxyurea induces de novo copy number variants in human cells. Proc. Natl Acad. Sci. U.S.A. 2011; 108:17360–17365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Arlt M.F., Rajendran S., Birkeland S.R., Wilson T.E., Glover T.W. De novo CNV formation in mouse embryonic stem cells occurs in the absence of Xrcc4-dependent nonhomologous end joining. PLoS Genet. 2012; 8:e1002981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Wilson T.E., Arlt M.F., Park S.H., Rajendran S., Paulsen M., Ljungman M., Glover T.W. Large transcription units unify copy number variants and common fragile sites arising under replication stress. Genome Res. 2015; 25:189–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Chen S., Zhou Y., Chen Y., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018; 34:i884–i890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Li H., Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; arXiv doi:26 May 2013, preprint: not peer reviewedhttps://arxiv.org/abs/1303.3997.
- 34. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27:2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Arlt M.F., Mulle J.G., Schaibley V.M., Ragland R.L., Durkin S.G., Warren S.T., Glover T.W. Replication stress induces genome-wide copy number changes in human cells that resemble polymorphic and pathogenic variants. Am. J. Hum. Genet. 2009; 84:339–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Cameron D.L., Di Stefano L., Papenfuss A.T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat. Commun. 2019; 10:3240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Gregory T., Ngankeu A., Orwick S., Kautto E.A., Woyach J.A., Byrd J.C., Blachly J.S. Characterization and mitigation of fragmentation enzyme-induced dual stranded artifacts. NAR Genom. Bioinformatics. 2020; 2:lqaa070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Peccoud J., Lequime S., Moltini-Conclois I., Giraud I., Lambrechts L., Gilbert C. A survey of virus recombination uncovers canonical features of artificial chimeras generated during deep sequencing library preparation. G3 (Bethesda). 2018; 8:1129–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. McConnell M.J., Lindberg M.R., Brennand K.J., Piper J.C., Voet T., Cowing-Zitron C., Shumilina S., Lasken R.S., Vermeesch J.R., Hall I.M. et al. Mosaic copy number variation in human neurons. Science. 2013; 342:632–637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Breuss M.W., Yang X., Schlachetzki J.C.M., Antaki D., Lana A.J., Xu X., Chung C., Chai G., Stanley V., Song Q. et al. Somatic mosaicism reveals clonal distributions of neocortical development. Nature. 2022; 604:689–696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Breuss M.W., Yang X., Gleeson J.G. Sperm mosaicism: implications for genomic diversity and disease. Trends Genet. 2021; 37:890–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Quispe-Tintaya W., Lee M., Dong X., Weiser D.A., Vijg J., Maslov A.Y. Bleomycin-induced genome structural variations in normal, non-tumor cells. Sci. Rep. 2018; 8:16523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Zhao X., Collins R.L., Lee W.P., Weber A.M., Jun Y., Zhu Q., Weisburd B., Huang Y., Audano P.A., Wang H. et al. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am. J. Hum. Genet. 2021; 108:919–928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Ebert P., Audano P.A., Zhu Q., Rodriguez-Martin B., Porubsky D., Bonder M.J., Sulovari A., Ebler J., Zhou W., Serra Mari R. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021; 372:eabf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Aganezov S., Yan S.M., Soto D.C., Kirsche M., Zarate S., Avdeyev P., Taylor D.J., Shafin K., Shumate A., Xiao C. et al. A complete reference genome improves analysis of human genetic variation. Science. 2022; 376:eabl3533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Quispe-Tintaya W., Gorbacheva T., Lee M., Makhortov S., Popov V.N., Vijg J., Maslov A.Y. Quantitative detection of low-abundance somatic structural variants in normal cells by high-throughput sequencing. Nat. Methods. 2016; 13:584–586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Xing D., Tan L., Chang C.H., Li H., Xie X.S. Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands. Proc. Natl Acad. Sci. USA. 2021; 118:e2013106118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Ramsden D.A., Carvajal-Garcia J., Gupta G.P. Mechanism, cellular functions and cancer roles of polymerase-theta-mediated DNA end joining. Nat. Rev. Mol. Cell Biol. 2022; 23:125–140. [DOI] [PubMed] [Google Scholar]
- 49. Schimmel J., Kool H., van Schendel R., Tijsterman M. Mutational signatures of non-homologous and polymerase theta-mediated end-joining in embryonic stem cells. EMBO J. 2017; 36:3634–3649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Hastings P.J., Ira G., Lupski J.R. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet. 2009; 5:e1000327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Bahrambeigi V., Song X., Sperle K., Beck C.R., Hijazi H., Grochowski C.M., Gu S., Seeman P., Woodward K.J., Carvalho C.M.B. et al. Distinct patterns of complex rearrangements and a mutational signature of microhomeology are frequently observed in PLP1 copy number gain structural variants. Genome Med. 2019; 11:80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All codes used in data analysis and figure generation are available for download as part of the svx-mdi-tools repository at https://doi.org/10.5281/zenodo.7871677 and https://github.com/wilsontelab/svx-mdi-tools, via its svCapture pipeline and app. HF1 svCapture raw read data, in FASTQ format, and all called SVs, in VCF format, are available from the Database of Genotypes and Phenotypes (dbGaP) under accession phs003121.v1.p1.







