Abstract
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.
Keywords: Next-generation sequencing, Complex structural variant, Pattern growth, Graph mining, Formation mechanism
Introduction
Computational methods based on next-generation sequencing (NGS) have provided an increasingly comprehensive discovery and catalog of simple structure variants (SVs) that usually have two breakpoints, such as deletions (Dels) and inversions (Invs) [1], [2], [3], [4], [5], [6], [7]. In general, these approaches follow a model-match strategy, where a specific SV model and its corresponding mutational signal model are proposed. Afterward, the mutational signal model is used to match observed signals for the detection (Figure 1A). This model-match strategy has been proved effective for detecting simple SVs, providing us with prominent opportunities to study and understand genome evaluation and disease progression [8], [9], [10], [11]. However, recent research has revealed that some rearrangements have multiple, compounded mutational signals and usually cannot fit into the simple SV models [8], [12], [13], [14], [15], [16] (Figure 1B). For example, in 2015, Sudmant et al. [8] systematically categorized 5 types of complex structural variants (CSVs) and found that a remarkable 80% of 229 Inv sites were complex events. Collins et al. [17] used long-insert size whole-genome sequencing (liWGS) on autism spectrum disease (ASD) and successfully resolved 16 classes of 9666 CSVs from 686 patients. In 2019, Lee et al. [16] revealed that 74% of known fusion oncogenes of lung adenocarcinomas were caused by complex genomic rearrangements, including EML4-ALK and CD74-ROS1. Though less frequently reported, compared with simple SVs, these multiple breakpoint rearrangements were considered as punctuated events, leading to severe genome alterations at once [10], [18], [19], [20], [21]. This dramatic change of genome provided distinctive evidence to study formation mechanisms of rearrangement and to understand cancer genome evolution [13], [14], [17], [18], [19], [21], [22], [23], [24].
However, due to the lack of effective CSV detection algorithms, most CSV-related studies screen these events from the “sea” of simple SVs through computational expensive contig assembly and realignment, clustering of incomplete breakpoints, or even targeted manual inspection [8], [12], [16]. In fact, many CSVs have already been neglected or misclassified in this “sea” because of the incompatibility between complicated mutational signals and existing SV models. Although the importance and challenge for CSV detection have been recognized, only a few dedicated algorithms have been proposed for CSV discovery, and they follow two major approaches guided by the model-match strategy. TARDIS and SVelter utilize the top-down approach, where they attempt to model all the mutational signals of a CSV event instead of modeling specific parts of signals. In particular, TARDIS [25] proposes sophisticated abnormal alignment models to depict the mutational signals reflected by dispersed duplication (Disdup) and inverted duplication (Invdup). The pre-defined models are then used to fit observed signals from alignments for the detection of the two specific CSV types. Indeed, this is complicated and greatly limited by the diverse types of CSVs. To solve this, SVelter [26] replaces the modeling process for specific CSVs with a randomly created virtual rearrangement. And CSVs are detected by minimizing the difference between the virtual rearrangement and the observed signals. However, GRIDSS [27] represents the assembly-based approach, which detects CSVs through extra breakpoints discovered from contig-assembly and realignment. Although the assembly-based approach is sensitive for breakpoint detection, it lacks certain regulations to constrain or classify these breakpoints and leaves them as independent events. As a result, these model-match-guided approaches would substantially break up or misinterpret the CSVs because of partially matched signals (Figure 1B). Moreover, the graph is another approach that has been widely used for simple [2], [28] and complex [19], [29] SV detection. Notably, ARC-SV [29] uses clustered discordant read-pairs to construct an adjacency graph and adopts a maximum likelihood model to detect CSVs, showing the great potential of using the graph to detect CSVs. Accordingly, there is an urgent demand for a new strategy, enabling CSV detection without pre-defined models as well as maintaining the completeness of a CSV event.
In this study, we proposed a bottom-up guided model-free strategy, implemented as Mako, to effectively discover CSVs all at once based on short-read sequencing. Specifically, Mako uses a graph to build connections of mutational signals derived from abnormal alignment, providing the potential breakpoint connections of CSVs. Meanwhile, Mako replaces model fitting with the detection of maximal subgraphs through a pattern growth approach. Pattern growth is a bottom-up approach, which captures the natural features of data without sophisticated model generation, allowing CSV detection without pre-defined models. We benchmarked Mako against five widely used tools on a series of simulated and real data. The results show that Mako is an effective and efficient algorithm for CSV discovery, which will provide more opportunities to study genome evolution and disease progression from large cohorts. Remarkably, the analysis of subgraphs detected by Mako highlights the unique strength of Mako, where Mako is able to effectively characterize the CSV breakpoint connections, confirming the completeness of a CSV event. Moreover, we systematically analyzed the CSVs detected by Mako on three healthy samples, revealing a novel role of sequence homology in CSV formation.
Method
Overview of Mako
Given that a CSV is a single event with multiple breakpoint connections, the breakpoints in the current CSV shall not connect with false-positive breakpoints or those from unrelated events. Thus, we formulate the discovery of CSVs as maximal subgraph pattern detection in a signal graph. Accordingly, Mako detects CSVs with NGS data in two major steps, i.e., signal graph creation and subgraph detection (Figure 2). Firstly, Mako collects and clusters abnormally aligned reads as signal nodes and defines two types of edges to build the signal graph , with and . Each signal node is represented as , where type, pos, and weight denote the abnormal alignment type, node position, and the number of supporting abnormal reads, respectively. For the edge set, each edge in and is represented as and , respectively, where . Specifically, represents paired edges from a certain number of supporting paired-reads or split-reads (sr). indicates the adjacent edges induced from the reference genome, connecting two adjacent signal nodes of distance (dist). Secondly, Mako applies a pattern growth approach to detect the maximal subgraphs as potential CSVs at the whole-genome scale. Meanwhile, the attributes of the subgraph are used to measure the complexity, and CSV types are determined by the edge connection types of the corresponding subgraphs (Figure 2).
Building signal graph
To create the signal graph, Mako collects abnormally aligned reads that satisfy one of the following criteria from the alignment file: 1) clipped portion with minimum 10% size fraction of the overall read length; 2) split reads with high mapping quality; 3) discordant read-pairs. As a result, one group of signal nodes is created by clustering clipped-reads or split-reads at the same position on the genome, which is filtered by weight and the ratio between weight and the coverage at pos. Another group of signal nodes is derived from clusters of discordant read-pairs, where the clustering distance is the estimated average insert size minus two-fold read length. It should be noted that a discordant alignment produces two nodes, and Mako separately clusters discordant alignments with multiple abnormally aligned types, such as abnormal insert size and incorrect mapping orientation. We adopt the procedure introduced by Chen [4] to avoid using randomly occurred discordant alignment (File S1). Additionally, edges are created alone with the signal nodes, where multiple types of edges might co-exist between two nodes.
Detecting CSVs with pattern growth
Pattern growth has been widely used in many areas [30], [31], [32], [33], [34], [35], such as insertion/deletion (Indel) detection in DNA sequences [1], [23]. For CSV detection, the subgraph pattern starts at a single node and grows by adding one node each time until it cannot find a proper one (Algorithm I). Specifically, the subgraph is allowed to grow according to the increasing order of pos value for each node, and backtracking is only allowed for nodes involved in the current subgraph. Of note, pattern growth via adjacent edges is conditional to the distance constrain (minDist) because these edges are derived from the reference genome instead of alternatives. For example, Mako detects the maximal subgraph ACBD by visiting nodes A, C, B, and D, while the edge between D and E is constrained because of the larger distance (Figure 2).
Given that the signal graph contains millions of nodes at the whole-genome scale, we adopt the “seed-and-extension” [36], [37] strategy to accelerate subgraph detection. Moreover, the discovered subgraphs not only differ in edge connections but also in node type of the subgraph. Therefore, we propose an algorithm that starts at multiple signal nodes of the same type at the whole-genome scale, while extends locally for subgraph detection (Algorithm II). The parameter minFreq is used to measure the frequency of detected subgraphs, and Mako uses minFreq = 1 to avoid missing subgraphs of rare CSVs or incomplete ones. The detected CSV subgraph provides the connections between multiple breakpoints of a CSV, and the attributes of the subgraph are used to measure the complexity of CSVs. Accordingly, Mako defines the boundary of CSVs using the leftmost and rightmost pos values of the nodes and utilizes the number of identical node types multiplied by the number of edges as a complexity measurement score (CXS). For example, the discovered CSV subgraph ACBD has a CXS of 8 due to 4 different node types, i.e., A, C, B, and D, and two paired edges (Figure 2). A toy example of excuting the algorithm is shown in Figure S1.
Algorithm I: Detect maximal subgraphs |
---|
Input: Signal graph , parameters |
Output: A set of CSV subgraphs , with |
1: procedure |
2: Initialize equals to frequency of node in ; |
3: Build index-projection of ; |
4: forindo: |
5: Build index-projection ; |
6: ; |
7: ifthen |
8: ; |
9: end if |
10: end for |
11: end procedure |
Algorithm II: Multi-location subgraph growth |
---|
1: procedure |
2: Initialize with adjacent node direct after through ; |
3: forindo: |
4: ifthen |
5: ; |
6: ; |
7: ; |
8: end if |
9: end for |
10: end procedure |
11: procedure |
12: Set the nodes in with respect increasing order of value: ; |
13: Set ; |
14: ifthen |
15: ifthen |
16: return True |
17: else: |
18: forto 0 do |
19: if between and then |
20: return True |
21: end if |
22: end if |
23: return False |
24: end procedure |
Performance evaluation
Since CSVs contain multiple breakpoints, we propose two tiers of stringency for their evaluation, i.e., unique-interval match and all-breakpoint match. For a unique-interval match, the correct predicted breakpoints shall be within 500-bp distance to the leftmost and rightmost breakpoints of a benchmark CSV. For the all-breakpoint match initially proposed by Sniffles [38], a benchmark CSV is divided into separate subcomponents, and each of them should be correctly detected. For a CSV with an Inv flanked by two Dels containing three components, the correct prediction of all breakpoints for the three components is considered as an all-breakpoint match. Meanwhile, if only one prediction is close to the leftmost and rightmost breakpoints of the CSV, this prediction is considered as a unique-interval match. For simulated CSVs, true positives (TPs) are defined as predictions satisfying either match criterion, while predictions not in the benchmark are false positives (FPs). False negatives (FNs) are events in the benchmark set that are not matched by predictions. Whereas it is usually challenging to measure the FPs for real data due to the lack of a curated CSV set, we only consider the number of correct discoveries (File S1).
Preparing CSV benchmarks for performance evaluation
In this study, we use both simulated and real CSVs to benchmark the performance of different callers. We follow the workflow introduced by the Sniffles [38] to create simulated CSVs (Figure S2). Firstly, VISOR [39] is used to create Del, Inv, Invdup, tandem duplication (Tandup), and Disdup. These events, termed as basic operations, are implanted and marked on the reference genome GRCh38 to generate an alternative genome. Secondly, CSVs are created by randomly adding basic operations to those marked operations, leading to a new genome harboring CSVs (CSV genome). Meanwhile, the purity parameter of VISOR is used to produce homozygous and heterozygous CSVs. Afterward, VISOR generates simulated paired-end reads based on the CSV genome with wgsim (https://github.com/lh3/wgsim) and aligns them to the reference genome with BWA-MEM [37]. According to the above-generalized simulation procedures, we create reported CSV types published by previous studies [8], [17] and randomized CSV types (File S1).
In terms of the real data, we are not aware of any public CSV benchmarks due to the breakpoint complexity and underdeveloped methods [8], [12], [26], [40], [41]. Fortunately, Pacific Biosciences (PacBio) reads could span multiple breakpoints of CSVs, providing direct evidence to validate CSVs through sequence Dotplot [42]. Thus, we curate the CSV benchmark from a simple SV callset by breakpoint clustering and manual inspection. For SV clustering, each of them is considered as an interval, and hierarchical clustering with the average method is used to find interval clusters (Figures S3 and S4). We then use the threshold that could produce the most clusters for merging clusters, which could potentially reduce the number of missed CSVs (Figures S5 and S6; Table S1). Given these simple SV clusters, we apply Gepard to create Dotplots based on PacBio high-fidelity (HiFi) reads and manually investigate each Dotplot. Since CSVs are rare and might appear at the minor allele, we create Dotplot for each long read that spans the corresponding region.
Orthogonal validation of Mako-detected CSVs
To fully characterize Mako’s performance on real data, we use experimental and computational validations as well as manual inspections of CSVs from HG00733. The raw CSV calls from HG00733 are obtained by selecting events with more than one link type observed in the subgraph. For the experimental validation, Primer3 (https://github.com/primer3-org/primer3) is used to design PCR primers, where primers are selected within the extended distance but 200 bp outside of the boundaries of the breakpoints defined by Mako (Figure S7). BLAT (https://users.soe.ucsc.edu/~kent/) search is performed at the same time to ensure all primer candidates have only one hit in the human genome. Afterward, we select amplification products with the expected product size and bright electrophoretic bands for Sanger sequencing (Figure S8). The obtained Sanger sequences are aligned against the reference allele of the CSV site and visualized with Gepard for breakpoint inspection (File S1).
As for the computational validation, two orthogonal data obtained from Human Genome Structural Variant Consortium (HGSVC) are used, i.e., Oxford Nanopore Technologies (ONT) sequencing and HiFi contigs. We first apply VaPoR [43] on the ONT reads to validate CSVs, referring as ONT validation. Additionally, we apply a K-mer-based breakpoint examination based on haplotype-aware HiFi contigs, from which we calculate the difference between the K-mer breakpoints and predicted breakpoints (Figure S9; File S1).
Furthermore, we manually curate detected CSVs via Dotplots created by Gepard (Figure S10), which is similar to the procedure of creating the benchmark CSVs for real data (File S1). For CSVs at highly repetitive regions, we further validate them according to specific patterns (Figures S11–S13).
Results
Mako effectively characterizes multiple breakpoints of CSVs
The most important feature for a CSV is the presence of multiple breakpoints in a single event. Thus, we first examined the performances of Mako, Lumpy, Manta, SVelter, TARDIS, and GRIDSS for detecting multiple breakpoints. The results were evaluated according to the all-breakpoint match criterion on both reported and randomized CSV-type simulations. Overall, for the heterozygous (Figure 3A) and homozygous (Figure 3B) simulations, Mako was comparable to GRIDSS, and these two methods outperformed other algorithms. For example, GRIDSS, Mako, and Lumpy detected 50%, 51%, and 46% of reported heterozygous CSV breakpoints, while they reported 53%, 54%, and 44% of randomized ones. Because the graph encoded both multiple breakpoints and their substantial connections for each CSV, Mako achieved better performance on randomized events, which included more subcomponents than the reported ones. Indeed, by comparing reported and randomized simulations, the breakpoint detection sensitivity (Figure 3A and B) of Mako for randomized simulation increased, while that of other algorithms dropped except for GRIDSS. Although the assembly-based method, GRIDSS, is as effective as Mako for breakpoint detection, it lacks a proper procedure to resolve the connections among breakpoints.
Mako precisely discovers CSV unique-interval
CSV is considered as a single event consisting of connected breakpoints, and we have demonstrated that Mako is able to detect CSV breakpoints effectively. However, the breakpoint detection evaluation only assesses the discovery of basic components for a CSV and lacks examination for CSV completeness. We then investigated whether Mako could precisely capture the entire CSV interval even with missing breakpoints. According to the unique-interval match criterion, Mako consistently outperformed other algorithms for both reported and randomly created CSVs, while SVelter and GRIDSS ranked second and third, respectively. For the reported CSVs at 30× coverage (Figure 3C and D), the recalls of Mako were 92% and 94% for reported heterozygous and homozygous CSVs, respectively, which were significantly higher than those of SVelter (57% for reported heterozygous CSVs and 49% for reported homozygous CSVs). Due to the randomized top-down approach, SVelter was able to discover some complete CSV events, but it may not explore all possibilities. Remarkably, we noted that Mako’s sensitivity was even better for randomized simulation (Figure 3E and F), which was consistent with our previous observation (Figure 3A and B). In particular, at 30× coverage, Mako detected 203% more heterozygous CSVs than that of SVelter (Figure 3E), probably due to the complementary graph edges for accurate CSV site discovery.
Performance on real data
We further compared Mako with SVelter, GRIDSS, and TARDIS on the whole-genome sequencing data of NA19240 and SKBR3. Firstly, we compared the callsets of different callers (Figures S14 and S15), and found that Mako shared most calls with GRIDSS (Figure 4A and B), which was consistent with our observation in simulated data (Figure 3). Furthermore, we examined the discovery completeness of 59 (NA19240) and 21 (SKBR3) benchmark CSVs (Table 1, Table S2; File S2). Because Manta and Lumpy contributed to the CSV benchmark sets, they were excluded from the comparison. The results showed that Mako performed the best for the two benchmark sets with different CXS thresholds, while TARDIS ranked second (Figure 4C). Given that Invdup and Disdup dominated the two benchmark sets (Table 1) and that TARDIS has designed specific models for these two types, TARDIS detected more events of these two duplication types than SVelter and GRIDSS. SVelter only detected three benchmark CSVs for SKBR3 because the randomized approach may not explore all combinations of CSVs. Based on the aforementioned observation, we concluded that the graph-based model-free strategy of Mako performed better than either randomized model (SVelter) or specific model (TARDIS) with few computational resources (Figure S16).
Table 1.
Type |
Benchmark summary |
Description | |
---|---|---|---|
NA19240 | SKBR3 | ||
Disdup | 15 | 12 | Dispersed duplication |
Invdup | 18 | – | Inverted duplication |
DelInv | 7 | 5 | Deletion associated with inversion |
DelDisdup | 5 | 1 | Deletion associated with dispersed duplication |
DelInvdup | 1 | – | Deletion associated with inverted duplication |
DisdupInvdup | 2 | 2 | Dispersed duplication with inverted duplication |
InsInv | 1 | – | Insertion associated with inversion |
Tantrans | 1 | – | Adjacent segment swap |
DelSpaDel | 8 | 1 | Two deletions with inverted or non-inverted spacer |
TanDisdup | 1 | – | Tandem dispersed duplication |
CSV subgraph illustrates breakpoint connections
Having demonstrated the performance of Mako on simulated and real data, we surveyed the landscape of CSVs from three individual genomes. Specifically, CSVs from autosomes were selected from Mako’s callset with more than one edge connection type observed in the subgraph, leading to 403, 609, and 556 events for HG00514, HG00733, and NA19240, respectively (Figure S17; Table S3). We systemically evaluated all CSV events in HG00733 via experimental and computational validations as well as manual inspections (File S3). For experimental validation, we successfully designed primers for 107 CSVs (Table S4), where 15 out of 21 (71%) CSVs were successfully amplified and validated by Sanger sequencing (Table 2, Tables S5 and S6; File S4). The computational validation showed up to 87% accuracy (Figure S4; Table 3, Tables S5, S7 and S8), indicating that a combination of methods and external data is necessary for comprehensive CSV validation. Further analysis showed that the medians of experimental and computational breakpoint shift were 13 bp and 26 bp, respectively (Figure S18). We observed that approximately 54% of CSVs were found in either short tandem repeat (STR) or variable number tandem repeat (VNTR) regions, contributing to 75% of all events inside the repetitive regions (Figure S17). For the connection types, more than half of the events contain Dup and Ins edges in the graph (Figure S17), indicating duplication-involved sequence insertion. Moreover, around 40% of the events contain Del edges (Figure S17), showing connections of two distant segments derived from either Dup or Inv events. We further examined whether the CSV subgraph depicts the connections for each CSV via discordant read-pairs. Interestingly, we observed two representative events with four breakpoints at chr6:128,961,308–128,962,212 and chr5:151,511,018–151,516,780 from NA19240 and SKBR3, respectively (Figure 5). Both events were correctly detected by Mako, but missed by SVelter and reported more than once by GRIDSS and TARDIS (Table S9). In particular, the CSV at chr6:128,961,308–128,962,212 that consists of two deletions and an inverted spacer (DelSpaDel) was reported twice and five times by GRIDSS and TARDIS. The event at chr5:151,511,018–151,516,780 that consists of Del and Disdup was reported four and three times by GRDISS and TARDIS. These redundant predictions complicated and misled downstream functional annotations. On the contrary, Mako was able to completely detect the aforementioned two CSV events and also capable of revealing the breakpoint connections of CSVs encoded in the subgraphs. The aforementioned observations suggest that Mako’s subgraph representation is interpretable, from which we can characterize the breakpoint connections for a given CSV event.
Table 2.
Chromosome | Start | End | Mako type |
---|---|---|---|
Chr1 | 81,194,398 | 81,195,874 | Del, Inv |
Chr2 | 119,659,504 | 119,661,322 | Del, Dup |
Chr3 | 146,667,093 | 146,667,284 | Del, Dup |
Chr5 | 141,480,327 | 141,483,116 | Del, Dup |
Chr7 | 1,940,931 | 1,941,009 | Dup, Ins |
Chr9 | 29,591,409 | 29,593,057 | Del, Inv |
Chr10 | 14,568,488 | 14,568,677 | Dup, Ins |
Chr12 | 71,315,482 | 71,316,928 | Del, Inv |
Chr12 | 77,989,900 | 77,994,324 | Del, Inv |
Chr13 | 74,340,759 | 74,342,810 | Del, Dup |
Chr16 | 78,004,459 | 78,007,456 | Del, Dup |
Chr17 | 34,854,438 | 34,855,851 | Del, Inv |
Chr17 | 48,538,270 | 48,540,171 | Del, Dup |
Chr18 | 72,044,575 | 72,045,937 | Del, Dup |
Chr21 | 26,001,844 | 26,001,844 | Del, Inv |
Note: Del, deletion; Ins, insertion; Dup, duplication; Inv, inversion.
Table 3.
Validation strategy | Total | Valid | Invalid | Inconclusive | |
---|---|---|---|---|---|
Experimental (PCR succeeded) | 21 | 15 (71%) | 6 (29%) | – | |
Computational |
ONT reads | 609 | 256 (42%) | – | 353 (58%) |
HiFi contigs | 414 (68%) | 195 (32%) | – | ||
ONT reads or HiFi contigs | 533 (87%) | 76 (13%) | – | ||
Manual | HiFi reads | 609 | 440 (72%) | 169 (28%) | – |
Note: ONT, Oxford Nanopore Technologies; HiFi, Pacific Biosciences high-fidelity.
Contribution of homology sequence in CSV formation
Given 1568 detected CSVs from three genomes (HG00514, HG00733, and NA19240), we further investigated the formation mechanisms of these CSVs. Ongoing studies have revealed that inaccurate DNA repair and the 2–33 bp long microhomology sequence at breakpoint junctions play an important role in CSV formation [18], [44], [45], [46], [47]. To further characterize CSVs’ internal structure and examine the impact of homology sequence on CSV formation, we manually reconstructed 1052 high-confident CSV calls given by Mako (252/403 from HG00514, 440/609 from HG00733, and 360/556 from NA19240) via Dotplots created by PacBio HiFi reads (Figure 6A, Figure S19; Table S10; File S3). The percentage of successfully reconstructed events was similar to the orthogonal validation rate, showing that CSVs detected by Mako were accurate, and the validation method was effective. The high-confident CSV callset contains 816 insertion associated with duplication (InsDup) events with both Ins and Dup edge connections. Further investigation revealed that these events contain irregular repeat sequence expansion, making them different from simple Ins or Dup events (Figure S20). Besides, we found two novel types, named adjacent segment swap (Tantrans) and tandem dispersed duplication (TanDisdup) (Figure 6B, Figures S21 and S22). We inferred that homology sequence-mediated inaccuracy replication was the major cause for these two types. Furthermore, we observed that 134 CSVs contain either Invdup or Disdup events (Table S10). These Invdup/Disdup-involved CSVs were mainly caused by microhomology-mediated break-induced replication (MMBIR) according to previous studies [18], [45], [48]. It was known that different homology patterns caused distinct CSV types (Figure 6C and D). Surprisingly, one particular homology pattern yielded multiple CSV types (Figure 6E). In particular situations of the three different homology patterns, DNA double-strand break (DSB) occurred after replication of fragment c. According to the MMBIR mechanism and template switch (TS) [22], [45], [46], [47], the pattern I (Figure 6C) and pattern II (Figure 6D) each yield one output, but pattern III (Figure 6E) produced three different outcomes. These results provide additional evidence for understanding the impact of sequence contents on DNA DSB repair, leading to a better understanding of diversity variants produced by CRISPR [49], [50].
Discussion
Currently, short-read sequencing is significantly reduced in cost and has been applied to clinical diagnostics and large cohort studies [16], [51], [52]. However, CSVs from short-read data are not fully explored due to the methodology limitations. Although long-read sequencing technologies bring us promising opportunities to characterize CSVs [13], [14], [38], their application is currently limited to small-scale projects, and the methods for CSV discovery are also underdeveloped. As far as we know, NGMLR combined with Sniffles is the only pipeline that utilizes the model-match strategy to discover two specific forms of CSVs, namely DelInv and Invdup. Therefore, there is a strong demand in the genomic community to develop effective and efficient algorithms to detect CSVs using short-read data. It should be noted that CSV breakpoints might come from either single haplotype or different haplotypes, where two simple SVs from different haplotypes lead to false positives (Figure S23). This may increase the false discovery rate due to a lack of haplotype information. Therefore, the combination of short-read and long-read sequencing might improve CSV discovery and characterization.
To sum up, we develop Mako, utilizing the graph-based pattern growth approach, for CSV discovery with 70% accuracy and 20 bp median breakpoint shift. To the best of our knowledge, Mako is the first algorithm that utilizes the bottom-up guided model-free strategy for SV discovery, avoiding the complicated model and match procedures. Given the fact that CSVs are largely unexplored, Mako presents opportunities to broaden our knowledge of genome evolution and disease progression.
Code availability
Mako is implemented in Java 1.8, and it is available at https://github.com/xjtu-omics/Mako. It is free for non-commercial use by academic, government, and non-profit/not-for-profit institutions. A commercial version of the software is available and licensed through Xi’an Jiaotong University. All scripts used in this study are also included in the Github repository, and a detailed description of using these scripts and other tools is provided.
CRediT author statement
Jiadong Lin: Methodology, Software, Formal analysis, Data curation, Visualization, Writing - original draft, Writing - review & editing. Xiaofei Yang: Methodology, Writing - original draft, Writing - review & editing. Walter Kosters: Methodology, Writing - original draft. Tun Xu: Data curation. Yanyan Jia: Validation. Songbo Wang: Validation, Formal analysis. Qihui Zhu: Validation. Mallory Ryan: Validation. Li Guo: Writing - original draft. Chengsheng Zhang: Validation, Writing - original draft. HGSVC: Resources. Charlse Lee: Resources, Writing - original draft. Scott E. Devine: Resources. Evan E. Eichler: Resources. Kai Ye: Conceptualization, Resources, Supervision, Project administration, Funding acquisition. All authors have read and approved the final manuscript.
Competing Interests
The authors have declared no competing interests.
Acknowledgments
Acknowledgments
This study was supported by the National Key R&D Program of China (Grant Nos. 2018YFC0910400 and 2017YFC0907500), the National Science Foundation of China (Grant Nos. 31671372, 61702406, and 31701739), the Fundamental Research Funds for the Central Universities, the World-Class Universities (Disciplines) and the Characteristic Development Guidance Funds for the Central Universities, and the Shanghai Municipal Science and Technology Major Project (Grant No. 2017SHZDZX01).
Authors from HGSVC
Mark B. Gerstein1, Ashley D. Sanders2, Micheal C. Zody3, Michael E. Talkowski4, Ryan E. Mills5, Jan O. Korbel2, Tobias Marschall6, Peter Ebert6, Peter A. Audano7, Bernardo Rodriguez-Martin2, David Porubsky7, Marc Jan Bonder2,8, Arvis Sulovari7, Jana Ebler6, Weichen Zhou5, Rebecca Serra Mari6, Feyza Yilmaz9, Xuefang Zhao4, PingHsun Hsieh7, Joyce Lee10, Sushant Kumar1, Tobias Rausch2, Yu Chen11, Zechen Chong11, Katherine M. Munson7, Mark J.P. Chaisson12, Junjie Chen13, Xinghua Shi13, Aaron M. Wenger14, William T. Harvey7, Patrick Hansenfeld2, Allison Regier15, Ira M. Hall15, Paul Flicek16, Alex R. Hastie10, Susan Fairely16
1Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
2European Molecular Biology Laboratory (EMBL), Genome Biology Unit, D-69117 Heidelberg, Germany
3New York Genome Center, New York, NY 10013, USA
4Center for Genomic Medicine, Massachusetts General Hospital, Department of Neurology, Harvard Medical School, Boston, MA 02114, USA
5Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
6Heinrich Heine University, Medical Faculty, Institute for Medical Biometry and Bioinformatics, D-40225 Düsseldorf, Germany
7Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA
8Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), D-69120 Heidelberg, Germany
9The Jackson Laboratory for Genomic Medicine, Farmington, CT 06030, USA
10Bionano Genomics, San Diego, CA 92121, USA
11Department of Genetics and Informatics Institute, School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA
12Molecular and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
13Department of Computer & Information Sciences, Temple University, Philadelphia, PA 19122, USA
14Pacific Biosystems of California, Inc., Menlo Park, CA 94025, USA
15Washington University, St. Louis, MO 63108, USA
16European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
Handled by Fangqing Zhao
Footnotes
Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2021.03.007.
Contributor Information
Kai Ye, Email: kaiye@xjtu.edu.cn.
The Human Genome Structural Variation Consortium:
Mark B. Gerstein, Ashley D. Sanders, Micheal C. Zody, Michael E. Talkowski, Ryan E. Mills, Jan O. Korbel, Tobias Marschall, Peter Ebert, Peter A. Audano, Bernardo Rodriguez-Martin, David Porubsky, Marc Jan Bonder, Arvis Sulovari, Jana Ebler, Weichen Zhou, Rebecca Serra Mari, Feyza Yilmaz, Xuefang Zhao, PingHsun Hsieh, Joyce Lee, Sushant Kumar, Tobias Rausch, Yu Chen, Zechen Chong, Katherine M. Munson, Mark J.P. Chaisson, Junjie Chen, Xinghua Shi, Aaron M. Wenger, William T. Harvey, Patrick Hansenfeld, Allison Regier, Ira M. Hall, Paul Flicek, Alex R. Hastie, and Susan Fairely
Supplementary material
The following are the Supplementary data to this article:
References
- 1.Ye K., Schulz M.H., Long Q., Apweiler R., Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Rausch T., Zichner T., Schlattl A., Stutz A.M., Benes V., Korbel J.O. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–i339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Layer R.M., Chiang C., Quinlan A.R., Hall I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15:R84. doi: 10.1186/gb-2014-15-6-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen K., Wallis J.W., McLellan M.D., Larson D.E., Kalicki J.M., Pohl C.S., et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6:677–681. doi: 10.1038/nmeth.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cameron D.L., Di Stefano L., Papenfuss A.T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun. 2019;10:3240. doi: 10.1038/s41467-019-11146-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kosugi S., Momozawa Y., Liu X., Terao C., Kubo M., Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 2019;20:117. doi: 10.1186/s13059-019-1720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chen X., Schulz-Trieglaff O., Shaw R., Barnes B., Schlesinger F., Källberg M., et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32:1220–1222. doi: 10.1093/bioinformatics/btv710. [DOI] [PubMed] [Google Scholar]
- 8.Sudmant P.H., Rausch T., Gardner E.J., Handsaker R.E., Abyzov A., Huddleston J., et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chaisson M.J.P., Sanders A.D., Zhao X., Malhotra A., Porubsky D., Rausch T., et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 2019;10:1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gao R., Davis A., McDonald T.O., Sei E., Shi X., Wang Y., et al. Punctuated copy number evolution and clonal stasis in triple-negative breast cancer. Nat Genet. 2016;48:1119–1130. doi: 10.1038/ng.3641. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Yates L.R., Knappskog S., Wedge D., Farmery J.H.R., Gonzalez S., Martincorena I., et al. Genomic evolution of breast cancer metastasis and relapse. Cancer Cell. 2017;32:169–184.e7. doi: 10.1016/j.ccell.2017.07.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Quinlan A.R., Hall I.M. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 2012;28:43–53. doi: 10.1016/j.tig.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Nattestad M., Goodwin S., Ng K., Baslan T., Sedlazeck F.J., Rescheneder P., et al. Complex rearrangements and oncogene amplifications revealed by long-read DNA and RNA sequencing of a breast cancer cell line. Genome Res. 2018;28:1126–1135. doi: 10.1101/gr.231100.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sanchis-Juan A., Stephens J., French C.E., Gleadall N., Mégy K., Penkett C., et al. Complex structural variants in Mendelian disorders: identification and breakpoint resolution using short- and long-read genome sequencing. Genome Med. 2018;10:95. doi: 10.1186/s13073-018-0606-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Greer S.U., Nadauld L.D., Lau B.T., Chen J., Wood-Bouwens C., Ford J.M., et al. Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases. Genome Med. 2017;9:57. doi: 10.1186/s13073-017-0447-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lee J.K., Park S., Park H., Kim S., Lee J., Lee J., et al. Tracing oncogene rearrangements in the mutational history of lung adenocarcinoma. Cell. 2019;177:1842–1857.e21. doi: 10.1016/j.cell.2019.05.013. [DOI] [PubMed] [Google Scholar]
- 17.Collins R.L., Brand H., Redin C.E., Hanscom C., Antolik C., Stone M.R., et al. Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome. Genome Biol. 2017;18:36. doi: 10.1186/s13059-017-1158-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Carvalho C.M.B., Lupski J.R. Mechanisms underlying structural variant formation in genomic disorders. Nat Rev Genet. 2016;17:224–238. doi: 10.1038/nrg.2015.25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Baca S.C., Prandi D., Lawrence M.S., Mosquera J.M., Romanel A., Drier Y., et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153:666–677. doi: 10.1016/j.cell.2013.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Korbel J.O., Campbell P.J. Criteria for inference of chromothripsis in cancer genomes. Cell. 2013;152:1226–1236. doi: 10.1016/j.cell.2013.02.023. [DOI] [PubMed] [Google Scholar]
- 21.Sanders A.D., Meiers S., Ghareghani M., Porubsky D., Jeong H., van Vliet M.A.C.C., et al. Single-cell analysis of structural variations and complex rearrangements with tri-channel processing. Nat Biotechnol. 2020;38:343–354. doi: 10.1038/s41587-019-0366-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Malhotra A., Lindberg M., Faust G.G., Leibowitz M.L., Clark R.A., Layer R.M., et al. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 2013;23:762–776. doi: 10.1101/gr.143677.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ye K., Wang J., Jayasinghe R., Lameijer E.W., McMichael J.F., Ning J., et al. Systematic discovery of complex insertions and deletions in human cancers. Nat Med. 2016;22:97–104. doi: 10.1038/nm.4002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zhang C.Z., Leibowitz M.L., Pellman D. Chromothripsis and beyond: rapid genome evolution from complex chromosomal rearrangements. Genes Dev. 2013;27:2513–2530. doi: 10.1101/gad.229559.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Soylev A., Le T.M., Amini H., Alkan C., Hormozdiari F. Discovery of tandem and interspersed segmental duplications using high-throughput sequencing. Bioinformatics. 2019;35:3923–3930. doi: 10.1093/bioinformatics/btz237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zhao X., Emery S.B., Myers B., Kidd J.M., Mills R.E. Resolving complex structural genomic rearrangements using a randomized approach. Genome Biol. 2016;17:126. doi: 10.1186/s13059-016-0993-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cameron D.L., Schroder J., Penington J.S., Do H., Molania R., Dobrovic A., et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017;27:2050–2060. doi: 10.1101/gr.222109.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Marschall T., Costa I.G., Canzar S., Bauer M., Klau G.W., Schliep A., et al. CLEVER: clique-enumerating variant finder. Bioinformatics. 2012;28:2875–2882. doi: 10.1093/bioinformatics/bts566. [DOI] [PubMed] [Google Scholar]
- 29.Arthur J.G., Chen X., Zhou B., Urban A.E., Wong W.H. Detection of complex structural variation from paired-end sequencing data. bioRxiv. 2017;200170 [Google Scholar]
- 30.Liao V.C.C., Chen M.S. DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences. Knowl Inf Syst. 2014;38:623–639. [Google Scholar]
- 31.Tsai H.P., Yang D.N., Chen M.S. Mining group movement patterns for tracking moving objects efficiently. IEEE T Knowl Data En. 2011;23:266–281. [Google Scholar]
- 32.Huang Y., Zhang L.Q., Zhang P.S. A framework for mining sequential patterns from spatio-temporal event data sets. IEEE T Knowl Data En. 2008;20:433–448. [Google Scholar]
- 33.Ye K., Kosters W.A., IJzerman A.P. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences. Bioinformatics. 2007;23:687–693. doi: 10.1093/bioinformatics/btl665. [DOI] [PubMed] [Google Scholar]
- 34.Pei J., Han J., Wang W. Constraint-based sequential pattern mining: the pattern-growth methods. J Intell Inf Syst. 2007;28:133–160. [Google Scholar]
- 35.Pei J., Han J.W., Mortazavi-Asl B., Wang J.Y., Pinto H., Chen Q.M., et al. Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE T Knowl Data En. 2004;16:1424–1440. [Google Scholar]
- 36.Li H., Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11:473–483. doi: 10.1093/bib/bbq015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Sedlazeck F.J., Rescheneder P., Smolka M., Fang H., Nattestad M., von Haeseler A., et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bolognini D., Sanders A., Korbel J.O., Magi A., Benes V., Rausch T. VISOR: a versatile haplotype-aware structural variant simulator for short and long read sequencing. Bioinformatics. 2020;36:1267–1269. doi: 10.1093/bioinformatics/btz719. [DOI] [PubMed] [Google Scholar]
- 40.McPherson A., Wu C., Wyatt A.W., Shah S., Collins C., Sahinalp S.C. nFuse: discovery of complex genomic rearrangements in cancer using high-throughput sequencing. Genome Res. 2012;22:2250–2261. doi: 10.1101/gr.136572.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dzamba M., Ramani A.K., Buczkowicz P., Jiang Y., Yu M., Hawkins C., et al. Identification of complex genomic rearrangements in cancers using CouGaR. Genome Res. 2017;27:107–117. doi: 10.1101/gr.211201.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Delcher A.L., Phillippy A., Carlton J., Salzberg S.L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002;30:2478–2483. doi: 10.1093/nar/30.11.2478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhao X., Weber A.M., Mills R.E. A recurrence-based approach for validating structural variation using long-read sequencing technology. GigaScience. 2017;6:1–9. doi: 10.1093/gigascience/gix061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ottaviani D., LeCain M., Sheer D. The role of microhomology in genomic structural variation. Trends Genet. 2014;30:85–94. doi: 10.1016/j.tig.2014.01.001. [DOI] [PubMed] [Google Scholar]
- 45.Kramara J., Osia B., Malkova A. Break-induced replication: the where, the why, and the how. Trends Genet. 2018;34:518–531. doi: 10.1016/j.tig.2018.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hartlerode A.J., Willis N.A., Rajendran A., Manis J.P., Scully R. Complex breakpoints and template switching associated with non-canonical termination of homologous recombination in mammalian cells. PLoS Genet. 2016;12:e1006410. doi: 10.1371/journal.pgen.1006410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zhou W., Zhang F., Chen X., Shen Y., Lupski J.R., Jin L. Increased genome instability in human DNA segments with self-chains: homology-induced structural variations via replicative mechanisms. Hum Mol Genet. 2013;22:2642–2651. doi: 10.1093/hmg/ddt113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Yang L., Luquette L.J., Gehlenborg N., Xi R., Haseley P.S., Hsieh C.H., et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013;153:919–929. doi: 10.1016/j.cell.2013.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chen W., McKenna A., Schreiber J., Haeussler M., Yin Y., Agarwal V., et al. Massively parallel profiling and predictive modeling of the outcomes of CRISPR/Cas9-mediated double-strand break repair. Nucleic Acids Res. 2019;47:7989–8003. doi: 10.1093/nar/gkz487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Allen F., Crepaldi L., Alsinet C., Strong A.J., Kleshchevnikov V., De Angeli P., et al. Predicting the mutations generated by repair of Cas9-induced double-strand breaks. Nat Biotechnol. 2019;37:64–72. doi: 10.1038/nbt.4317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Quigley D.A., Dang H.X., Zhao S.G., Lloyd P., Aggarwal R., Alumkal J.J., et al. Genomic hallmarks and structural variation in metastatic prostate cancer. Cell. 2018;175:889. doi: 10.1016/j.cell.2018.10.019. [DOI] [PubMed] [Google Scholar]
- 52.Fraser M., Sabelnykova V.Y., Yamaguchi T.N., Heisler L.E., Livingstone J., Huang V., et al. Genomic hallmarks of localized, non-indolent prostate cancer. Nature. 2017;541:359–364. doi: 10.1038/nature20788. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.