Abstract
Genomic rearrangements cause congenital disorders, cancer, and complex diseases in human. Yet, they are still understudied in rare diseases because their detection is challenging, despite the advent of whole genome sequencing (WGS) technologies. Short-read (srWGS) and long-read WGS approaches are regularly compared, and the latter is commonly recommended in studies focusing on genomic rearrangements. However, srWGS is currently the most economical, accurate, and widely supported technology. In Caenorhabditis elegans (C. elegans), such variants, induced by various mutagenesis processes, have been used for decades to balance large genomic regions by preventing chromosomal crossover events and allowing the maintenance of lethal mutations. Interestingly, those chromosomal rearrangements have rarely been characterized on a molecular level. To evaluate the ability of srWGS to detect various types of complex genomic rearrangements, we sequenced three balancer strains using short-read Illumina technology. As we experimentally validated the breakpoints uncovered by srWGS, we showed that, by combining several types of analyses, srWGS enables the detection of a reciprocal translocation (eT1), a free duplication (sDp3), a large deletion (sC4), and chromoanagenesis events. Thus, applying srWGS to decipher real complex genomic rearrangements in model organisms may help designing efficient bioinformatics pipelines with systematic detection of complex rearrangements in human genomes.
Subject terms: Computational biology and bioinformatics, Genetics
Introduction
Structural variations (SVs) are genomic rearrangements such as copy number alterations, inversions, and translocations. More complex events, known as chromoanagenesis, combine a cascade of chromosomal rearrangements1. Over the past few years, structural variants and complex genomic rearrangements have been implicated in various phenotypes: cancer2,3, rare disorders4–9 and common diseases10 in humans, reproduction traits in pigs11, virulence traits in plant pathogenic fungi12, local adaptation in maize13, and behavior in Caenorhabditis elegans (C. elegans)14. However, the technologies and methods used to identify SVs and complex rearrangements are still multifaceted and no approach has yet been recognized as standard. Short-read and long-read whole genome sequencing (WGS) technologies, as well as their respective tools and pipelines, are often assessed and compared in their ability to detect structural variants and complex rearrangements15–21. The read length of short-read technologies is often reported as a limitation for detecting larger and more complex events22. Meanwhile, long-read sequencing and linked-reads approaches are gaining popularity23–25, especially when the analysis of short-read sequencing data fails to uncover SVs and complex rearrangements of interest26,27. Here, we focused on short-read WGS of C. elegans strains known to harbor SVs and show that short-read WGS provides enough data to decipher SVs of various types and complex genomic rearrangements in these genomes when tailored workflows are used.
In C. elegans, SVs and complex rearrangements have been used for decades to balance large parts of the genome by suppressing crossover events and maintaining heterozygosity. It facilitates the investigation of lethal mutations, the construction of new strains, and the screening of mutations28. While some balancers are spontaneous, like the reciprocal translocation nT1(IV;V)28, most were created via random mutagenesis processes, such as X-ray mutagenesis, chemical mutagens (acetaldehyde, ENU, EMS), gamma irradiation, and more recently by using CRISPR-Cas9 methods29,30. For most of the mutagen-induced balancers, the implicated chromosomal rearrangements are uncharacterized at the molecular level (i.e., precise genomic position and nature of the rearrangement are unknown. Thus, C. elegans balancers constitute an interesting source of various genomes and complex genomic rearrangements to assess the ability of short-read PCR-free WGS Illumina technologies and tailored bioinformatics workflows to detect and characterize complex structural variants. Here, we sequenced the genomes of three C. elegans balancers, ranging from a well-characterized SV [eT1(III;V), a reciprocal translocation] to an uncharacterized and molecularly unknown balancer [sC4 (BC4586)]. Beyond the successful proof-of-concept detection of eT1(III;V), we deciphered the structure and genomic positions of sDp3 and sC4, as well as additional rearrangements not previously known to exist in the balancer strains selected for this study (BC4586, BC986, and VC109).
In our study, we found that short-read WGS datasets can be used to detect, identify, and characterize SVs and complex genomic rearrangements in C. elegans genomes. The knowledge gained from the analytical methods used on C. elegans balancers may help optimize detection and characterization of complex variants in humans using short-read WGS.
Results
Short-read WGS can be used to detect homozygous and heterozygous reciprocal translocations
The strains BC986 and VC109 carry the reciprocal translocation eT1(III;V). In C. elegans, the reciprocal translocation eT1(III;V) balancer has been well studied and it is described as balancing LGV, from the left chromosome end through unc-23, and LGIII, from the right end to unc-3631. Its genomic breakpoints were more recently localized in the second intron of unc-36 on LGIII and between rol-3 and unc-42 on LGV32. Therefore, we first focused our efforts on retrieving eT1(III;V) breakpoints, to assess the ability of short-read WGS to decipher reciprocal translocations as a proof of concept for our approach.
Reads were aligned to the C. elegans reference genome (WS265) and candidate breakpoints were predicted using an ensemble of tools (see “Methods”). Two sets of breakpoints related to a translocation between LGIII and LGV were correctly identified by several tools in these eT1 strains, but not in controls. The breakpoints we identified agreed with the locations previously described by Zhao and colleagues32: III:8,200,762–V:8,930,675 and III:8,200,764–V:8,930,675 (Fig. 1A and Supplemental Fig. S1). As a first validation step, we used the Integrative Genomics Viewer (IGV) to review the visual signature of reads aligned around those locations (Fig. 1B). In homozygous genomes, we observed that no read was overlapping the position of the breakpoint (i.e., the reads mostly aligned either on the left or the right of the breakpoint, with little or no read sequence aligning across the breakpoint position). In heterozygous genomes, half of the reads were displaying this signature (Fig. 1B). Then, we amplified the genomic loci around those breakpoints by PCR and submitted the PCR products for Sanger sequencing (Fig. 1C–E). By analyzing the Sanger sequences, we confirmed that the breakpoint on LGIII was in the second intron of unc-36 at 8,200,764 Mb and that the breakpoint on LGV was intergenic, localized at 8,930,675 Mb. Additionally, we characterized microhomologies at the breakpoint on LGIII, composing a 43-bp sequence inserted at the junction containing several sequences flanking the breakpoints. The main part of the inserted sequence (27 bp) has been duplicated from the LGV flanking region. Two additional sequences, respectively 5 bp and 1 bp long, are duplicated from the LGIII flanking region (Fig. 1D).
One of the strains (VC109) was viable in both heterozygous and homozygous states33. We prepared genomic DNA from both eT1 heterozygous (wild-type looking worms) and eT1 homozygous (phenotypically unc-36 worms) and sequenced them. We were able to identify (Fig. 1B) and confirm the eT1 breakpoints in both cases (Fig. 1C), demonstrating that the short-read WGS approach is effective at deciphering position and structure of the breakpoints for reciprocal translocations regardless of the zygosity status.
Short-read WGS contains enough information to identify short and large copy number variations
By combining calls from various tools, coverage analysis, and read inspection, we detected an assorted set of copy number variations. We confirmed their nature and positions by PCR and Sanger sequencing. Overall, we observed five deletions specific to BC986, spanning from 69 bp to 8,779 bp (Supplemental Figs. S2, S3). In VC109 genomes, we also detected four additional deletions ranging from 86 to 255 bp in size. Some were heterozygous, others were homozygous (Supplemental Figs. S4, S5, S6). We also identified two direct tandem copy number gain events in VC109. The first one, localized on LGI, was a homozygous direct tandem duplication in both heterozygous (phenotypically wild-type) and homozygous (phenotypically unc-36) worms. The second direct tandem duplication mapped on LGV and was both heterozygous and homozygous in heterozygous and homozygous worms, respectively (Supplemental Fig. S6). More information regarding these reported CNVs is available in Supplemental Table S1.
Short-read WGS can uncover a free duplication
The sDp3 balancer, also present in BC986 along with eT1(III;V), has been described as a free duplication on LGIII effectively balancing the left portion of LGIII from around unc-86 through to at least dpy-1, but does not extend to unc-4528. So far, 22 genes have been described to be overlapped by sDp3 and, by analysis of the coverage, we confirmed that their sequences were duplicated (Fig. 2A). None of the tools we applied (see “Methods”) reported breakpoints or structural variants that could fit the sDp3 description. However, we observed heterozygous SNVs from the left end of LGIII until at least the eT1 breakpoint (III:8,200,675), corroborating the presence of an event balancing this part of LGIII, and maintaining heterozygosity (Supplemental Fig. S2 and Supplemental Table S3). An unbiased analysis of the sequencing read depth on LGIII helped us map the duplication to two different loci: between III:1.4 Mb-2.4 Mb and III:3.6 Mb-8.6 Mb (Fig. 2B). To confirm this structure, we inspected the reads aligned around III:2.4 Mb and III:3.6 Mb. We identified read pairs for which the forward read was aligned to the first segment of the duplication and the mate aligned along the second segment, thus corroborating our hypothesis. To experimentally validate it, we identified the breakpoint linking the two parts of the duplication (III:2,452,252 and III:3,693,056) and confirmed by PCR and Sanger sequencing (Fig. 2C and D).
Short-read WGS can reveal and characterize unexpected complex rearrangements
By comparing variants and breakpoints in the three eT1 strains and controls—strains without eT1(III;V) including N2 and BC4586, we built an “eT1 haplotype” composed of variants specific to the eT1 strains. Interestingly, along with eight SNVs (list available in Supplemental Table S2), we also characterized two unexpected and undescribed complex rearrangements.
The first one could have been interpreted at first sight as a classic large copy number gain in tandem (direct) spanning from V:2,144,217 to V:2,156,311 (Supplemental Fig.S7A). It overlapped seven intact genes: srbc-20, C45H4.t1, C45H4.21, C45H4.13, C45H4.19, srbc-24 and, srbc-23, as well as partially spanning srbc-52 (exon 1 only) and srbc-21 (up to intron 4). PCR and Sanger sequencing confirmed the duplication breakpoints and structure in direct tandem (Supplemental Fig. S7B). Both BC986 and unc-36 VC109 worms [eT1(III;V) homozygous] were homozygous for the direct tandem duplication (ratio of coverage = 2) while wild-type looking VC109 [eT1(III;V) heterozygous] was heterozygous (ratio of coverage = 1.5) (Supplemental Fig. S7C). The analysis of the coverage however showed a discontinuity in the coverage (ratio of coverage dropping back to 1) between V:2,148,200 and V:2,148,630 which corresponds to the three last exons of srbc-20 and a part of the last exon of C45H4.21 (Supplemental Fig. S7C). An inspection of the reads revealed that this variant is complex, with an inversion overlapping the copy number gain. We confirmed the inversion V:2,148,056–2,148,630 by PCR and Sanger sequencing (Supplemental Fig. S7B and D).
The second eT1 specific complex rearrangement was localized on LGV around 1.1 Mb. The complex rearrangement described here overlapped with the gene Y50D4B.1, a non-essential gene in C. elegans. Between V:1.118 Mb and V:1.130 Mb, we identified 15 different breakpoints (Table 1). By inspecting the reads, we identified three short deletions (homozygous in BC986 and VC109 unc-36 worms, heterozygous in VC109 wild-type worms), one inversion, one large deletion and three inverted tandem duplications (Fig. 3). We confirmed experimentally all breakpoints by PCR and Sanger sequencing.
Table 1.
Variant | Breakpoint 1 | Breakpoint 2 | Gene |
---|---|---|---|
Inverted tandem duplication | V:1,118,539 | V:1,118,853 | Y50D4B.1 |
Deletion | V:1,118,855 | V:1,128,457 | Y50D4B.1 |
Deletion | V:1,126,582 | V:1,126,983 | Y50D4B.1 |
Deletion | V:1,127,007 | V:1,127,020 | Y50D4B.1 |
Inverted tandem duplication | V:1,127,471 | V:1,129,752 | Y50D4B.1 |
Inverted tandem duplication | V:1,128,827 | V:1,129,264 | Y50D4B.1 |
Deletion | V:1,129,264 | V:1,129,753 | Y50D4B.1 |
Inversion | V:1,126,322 | V:1,128,612 | Y50D4B.1 |
In the strain VC109 only, we detected eight breakpoints on LGIII around 10 Mb (Table 2 and Fig. 4). Based on coverage analysis and PCR, we further characterized this complex rearrangement as being composed of one direct tandem duplication, one inverted tandem duplication, and two deletions. Because of the presence of copy gains in the rearrangement and microhomologies at breakpoint junctions, this complex rearrangement could be characterized as chromoanasynthesis. It overlapped the non-essential gene tbc-8, so it is not expected to have an important effect on the fitness of the worms.
Table 2.
Variant | Length | Breakpoint 1 | Breakpoint 2 | Gene |
---|---|---|---|---|
Inverted tandem duplication | 8315 bp | III:10,362,573 | III:10,370,888 | tbc-8 |
Deletion | 4387 bp | III:10,366,492 | III:10,370,879 | tbc-8 |
Direct tandem duplication | 9409 bp | III:10,366,037 | III:10,375,446 | tbc-8 |
Deletion | 2500 bp | III:10,368,666 | III:10,371,166 | tbc-8 |
Short-read WGS to characterize BC4586, an uncharacterized genetic balancer strain
The strain BC4586 contains the sC4 rearrangement that has been used to balance the right end of LGV, from rol-9 to unc-76. It was also reported that it reduces the genetic distance between the genes unc-76 and rol-9 to 1.8%, suggesting the presence of a deletion28. To the best of our knowledge, the rearrangement sC4 remains molecularly uncharacterized. We used short-read WGS to determine the nature of the sC4 rearrangement and to report additional genomic variants in BC4586.
We first performed “sC4 haplotype” analysis (Supplemental Table S3) and observed stretches of heterozygous SNVs only on LGV from ~ 12 to ~ 16 Mb and from ~ 19 Mb to its right end. This suggested that sC4 might be able to balance further than unc-76. We detected a deletion on the right portion of LGV between 16 and 19 Mb (Fig. 5A) that explains the reduced genetic distance previously reported between unc-76 and rol-9. We confirmed the deletion by PCR (Fig. 5B). We have also detected a non-reciprocal translocation of the right arm of LGV to the right arm of LGIV (Table 3). We hypothesized that this has led to a fusion of the two chromosomes, by their right ends. The breakpoint was supported by several reads. However, the region surrounding the breakpoint on LGV is highly repetitive, and despite our best efforts, we could not design a unique set of primers to validate this hypothesis by Sanger. Therefore, we assessed the karyotypes of the diakinetic oocytes using DAPI staining. The wild-type oocytes typically have six pairs of DAPI-stained bivalent diakinetic chromosomes (Fig. 5C), whereas in the BC4586, we frequently observe five pairs (Fig. 5C), confirming sC4 chromosome fusion.
Table 3.
Variant | Breakpoint 1 | Breakpoint 2 | Overlapped genes |
---|---|---|---|
Deletion | V:16,060,619 | V:19,331,432 | 1279 genes |
Translocation | IV:17,114,723 | V:19,835,910 | cyn-13 |
Deletion | IV:9,853,074 | IV:9,853,123 | ssq-1; K07F5.12 ; npp-1 |
Inversion | IV:9,853,123 | IV:9,853,675 | |
Deletion | IV:9,853,675 | IV:9,857,585 | |
Direct tandem duplication | IV:9,857,585 | IV:9,862,397 |
On LGIV, we also characterized a complex homozygous rearrangement combining two deletions, one inversion and one direct tandem duplication localized around 9.8 Mb (Fig. 5D). We confirmed the breakpoints for both complex rearrangements by PCR (Fig. 5D). We also reported and validated a deletion on LGV and a direct tandem duplication on LGIII (Fig. 5D, Supplemental Table S1, Supplemental Fig. S8). The Circos plot in Supplemental Fig. S9 summarizes our findings.
Discussion
Short-read whole genome sequencing (WGS) has often been used to retrieve structural variants and more complex rearrangements among other variations in humans17,34,35, D. melanogaster36, as well as C. elegans37–40. Here, by reporting the precise breakpoints of complex rearrangements in C. elegans [eT1(III;V), sC4, sDp3], we describe the molecular structure of widely used balancers, most of them for the first time to the best of our knowledge. We also show that short-read WGS enables identification and characterization of large SVs and complex rearrangements, by deep analysis of short-read WGS datasets.
Every breakpoint that we uncovered with deep analysis of the short reads was validated experimentally. However, the interpretation of the structure of the rearrangements could necessitate further exploration. Still, despite the limited ability of short-read WGS to span large genomic rearrangements fully or to explore repetitive regions such as telomeres, we characterized the balancer sC4 as a large deletion and a chromosome fusion (IV;V). This rearrangement could reflect a telomere crisis41 occurring as an end-to-end chromosome fusion associated with telomere shortening. This type of event has been previously studied in C. elegans38,42,43. We also uncovered a free duplication, composed of two genomic segments (sDp3), along with chromosomal rearrangements combining several various events that present features of chromoanagenesis. Our analyses of eT1 strains confirmed the eT1 breakpoints as identified by Zhao and colleagues32. Interestingly, we placed the LGIII breakpoint 3-bp anterior to the one previously reported. Although we retrieved the junction sequence described as a 35-bp duplication, our approach with short-read WGS showed a more complex scenario with microhomologies of several flanking sequences, suggesting the involvement of a replication-based DNA mechanism repair such as fork-stalling and template switching, or microhomology-mediated break-induced replication.
In C. elegans, short-read WGS has been employed in only a few studies to describe SVs. For instance, Meier and colleagues38 and Volkova and colleagues39 reported mutational signatures (SNVs and SVs) created by carcinogen exposure on strains with DNA repair deficiency. Itani and colleagues37 characterized a complex rearrangement created by ENU-based mutagenesis. In 2017, Cook and colleagues44 published the database CeNDR (C. elegans Natural Diversity Resource) that regroups genomic variations uncovered by genome sequencing in wild C. elegans strains. Other than insertions of transposable elements45, SVs and complex rearrangements are not reported for the natural isolates in the CeNDR. Our study shows that short-read sequencing is a viable option for future studies to explore the natural variation of C. elegans species beyond SNVs, especially by re-analyzing datasets already available in CeNDR, for which those complex variants might have been overlooked.
It is quite common in human studies to assess analysis pipelines of short-read WGS using either generic genomes (Genome in a Bottle) or simulated data46,47, especially because real-life cases emerge anecdotally. However, in our study, we assessed several tools and approaches on real biological data from model organism genomes. This approach presents two main advantages. First, in model organisms like C. elegans, balancers are widely used and well-known as being genomes containing SVs and complex rearrangements, largely comparable to humans. Thus, they constitute good surrogates of real-life cases, without the limitation related to a low frequency of those events. Second, as shown here, real biological data allows us to uncover unexpected events, of various natures and complexity. Thus, there is a probability that simulations might not be able to cover the wide diversity of chromosomal rearrangements or might not simulate the complexity of read signatures. Thus, we reasoned that tool assessment would be more accurate if they were performed on real data, human or not, combined with experimental validation.
Conclusions
In our study, we showed that short-read data provides enough information to detect a spectrum of complex variants with tailored bioinformatics approaches. Thus, to improve the detection and characterization of SVs and complex rearrangements, it is important to also optimize pipelines and analyses to get the best out of the short-read datasets. Indeed, short-read sequencing is the most widely used approach and the most cost-effective technology available. Also, as there are more tools and pipelines available to analyze short-read data than for long-read or linked-read data, it facilitates pipeline tailoring by using different tools and approaches. Additionally, short-reads permit the detection of both single nucleotide variants and larger ones, whereas long-read approaches are error prone and thus, limited, in their ability to accurately detect SNVs. This constitutes quite an advantage for short-read approach as it avoids the necessity to resort to another assay for small variants. Moreover, public databases on human variation such as TopMed48 and gnomAD49 have been built upon calls from short-read datasets. Therefore, in the context of human rare disease unsolved cases, where population databases are a major asset to decipher rare and pathogenic variants from common and benign ones, short-read sequencing remains the main approach. Thus, improving short-read sequencing pipelines to maximize the detection of variants is of utmost importance.
Methods
Worm maintenance and strains
Strains Bristol N2 wild type, Hawaiian strain wild type CB4856, BC986 (sDp3(III;f); + /eT1 (III;V)), VC109 (apc-11(gk37)/eT1 III; + /eT1 V) and BC4586 (unc-76(e911) rol-9(sc148)/sC4(s2172) [dpy-21(e428)] V) were used in this study. Strains were obtained from the CGC (Caenorhabditis Genetics Center). N2 was used as the wildtype strain. All strains were maintained at 16 °C and kept on standard NGM plates streaked with OP50.
DNA extraction
Genomic DNA was collected from approximately 100 mg of worm tissue using the Qiagen Blood and Tissue kit (Cat #: 13323) following the manufacturer's recommendations. DNA was eluted with 10 mM Tris–HCl (pH 8.0). Samples were quality-checked to ensure a minimum quantity of 1500 ng and a 260/280 ratio of 1.8 before submitting for sequencing.
Library preparation, sequencing and data pre-processing
Paired-end short-read WGS were obtained for all strains with PCR-free library preparation protocol and NovaSeq6000 Illumina sequencing technology. We checked the quality of the fastq files using FastQC50. The reads were 151 bp long. We trimmed the reads and removed the adapters using TRIMMOMATIC v0.3651. For each sample, we aligned between 16 and 34 million reads using BWA-MEM v0.7.1752 algorithm to the C. elegans reference genome WS265. It resulted in a 30X read coverage per strain on average (Supplemental Table S4). We then sorted the reads according to their coordinates with ‘samtools sort’ (samtools v1.5)53.
SV and complex rearrangement detection
We called and characterized SNVs, indels, SVs, and complex rearrangements for each strain in this study using a collection of published tools and downstream in-house designed analysis methods. Strain N2 was used as a control. The SNVs and indels genotype of each strain was established using RUFUS35. The analysis of SNV heterozygosity along the genome of each strain was used to highlighted balanced genomic regions. For SVs and complex variants, we initially ran nine different tools with default parameters: BreakDancer v.BreakDancerMax-1.1r11254 (https://github.com/genome/breakdancer), CNVnator v0.4.155 (https://github.com/abyzovlab/CNVnator), DELLY v0.7.856 (https://github.com/dellytools/delly), GRIDSS v2.8.057 (https://github.com/PapenfussLab/GRIDSS), Manta v1.6.058 (https://github.com/Illumina/manta), SeekSV v1.2.359 (https://github.com/qiukunlong/seeksv), Tardis v1.0.760 (https://github.com/BilkentCompGen/tardis), TIDDIT v2.12.061 (https://github.com/SciLifeLab/TIDDIT) and RUFUS35 (https://github.com/jandrewrfarrell/RUFUS). For complex variants, breakpoints were defined combining RUFUS and GRIDSS calls and custom methods (visual assessment, coverage analysis, reads inspection):
Visual assessment consists in reviewing the visual signature of reads aligned around each breakpoint with IGV. A breakpoint is represented by accumulation of split reads, with little or no read sequence aligning across the breakpoint position. The visual signature gives information to characterize the type of rearrangement29.
Read inspection consists in a “manual” re-alignment of reads aligned at each breakpoint junction. Reads are extracted from bam files with “samtools view” and re-aligned using Blast (UCSC – Feb. 2013; WBcel235/ce11). This analysis aims to identify split reads supporting the breakpoint junctions (as described by Iwata et al.29). Such reads are fundamental to the design of PCR primers for further validation.
To characterize copy number variations (stand-alone CNVs or as part of complex variants), we estimated the average genome coverage and read depth by intervals of 1–10 kb (depending on the length of the CNV) using the ‘samtools depth’ function.
The circular visualizations were produced using Circos62. The line charts were prepared in Excel. The data relating to the genomic variations are available in Supplementary information for BC4586 (Supplemental Table S5) and VC109 (Supplemental Tables S6 and S7) genomes. The complete list of additional SVs identified and confirmed by PCR, but not discussed in this paper, is available in Supplemental Table S1. Circos plots, PCR gels, and IGV screenshots are available in Supplemental Figures S1, S2, S3, S4, S5, S6, S7, S8, S9.
Experimental validation
We confirmed breakpoints of SVs and complex rearrangements by PCR and Sanger Sequencing. All primers and sequences are available in Supplementary Information (Supplemental Tables S8, S9, S10, S11). For the cytological assessment of bivalent diakinetic oocyte karyotypes, 1-day-old adult hermaphrodite worms were washed once in M9 medium, fixed in cold methanol, rehydrated in PBS (0.01% Tween) and mounted using SlowFade Gold antifade reagent with DAPI (Invitrogen S36938). Images were acquired using a Zeiss Imager M2. Raw counts can be found in Supplemental Table S12. The p-value was calculated using two-tailed Z-test.
Supplementary Information
Acknowledgements
This research was enabled by utilizing the Compute Canada (www.computecanada.ca) computing resources. This work was supported by funding from Alberta Children’s Hospital Research Institute Foundation, Canadian Institute of Health Research, CIHR-Project grant number PJT-156068, Eyes High Postdoctoral Fellowship, Genome Canada (275SIL)/Genome BC/CIHR (GP1-155868) LSARP Genomics and Precision Health Silent Genomes Project.
Author contributions
T.M. and M.T.G. conceptualized and designed the study. T.M., M.O. and M.T.G. analyzed and interpreted the WGS data. X.L. performed the experimental validation. S.S. and F.J. prepared the strains and DNA. T.M. drafted the manuscript and M.T.G. helped finalize it. Every author read, commented, and validated the manuscript.
Data availability
The sequencing data generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA728090.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Tatiana Maroilley and Xiao Li.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-021-97764-9.
References
- 1.Pellestor F, Gaillard J, Schneider A, Puechberty J, Gatinois V. Chromoanagenesis, the mechanisms of a genomic chaos. Semin. Cell Dev. Biol. 2021 doi: 10.1016/j.semcdb.2021.01.004. [DOI] [PubMed] [Google Scholar]
- 2.Cortés-Ciriano I, et al. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing. Nat. Genet. 2020;52:331–341. doi: 10.1038/s41588-019-0576-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Goldrich DY, et al. Identification of somatic structural variants in solid tumors by optical genome mapping. J. Pers. Med. 2021;11:142. doi: 10.3390/jpm11020142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Tommerup N. Mendelian cytogenetics. Chromosome rearrangements associated with mendelian disorders. J. Med. Genet. 1993;30:713–727. doi: 10.1136/jmg.30.9.713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kloosterman WP, et al. Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Hum. Mol. Genet. 2011;20:1916–1924. doi: 10.1093/hmg/ddr073. [DOI] [PubMed] [Google Scholar]
- 6.Maroilley T, Tarailo-Graovac M. Uncovering missing heritability in rare diseases. Genes. 2019;10:275. doi: 10.3390/genes10040275. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zepeda-Mendoza CJ, Morton CC. The iceberg under water: Unexplored complexity of chromoanagenesis in congenital disorders. Am. J. Hum. Genet. 2019;104:565–577. doi: 10.1016/j.ajhg.2019.02.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Anzick S, et al. Chromoanasynthesis as a cause of Jacobsen syndrome. Am. J. Med. Genet. A. 2020;182:2533–2539. doi: 10.1002/ajmg.a.61824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Arya P, Hodge JC, Matlock PA, Vance GH, Breman AM. Two patients with complex rearrangements suggestive of germline chromoanagenesis. Cytogenet. Genome Res. 2021 doi: 10.1159/000512898. [DOI] [PubMed] [Google Scholar]
- 10.Belyeu JR, et al. De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2396 families. Am. J. Hum. Genet. 2021 doi: 10.1016/j.ajhg.2021.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Du H, et al. Analysis of structural variants reveal novel selective regions in the genome of Meishan pigs by whole genome sequencing. Front. Genet. 2021;12:550676. doi: 10.3389/fgene.2021.550676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Langner T, et al. Genomic rearrangements generate hypervariable mini-chromosomes in host-specific isolates of the blast fungus. PLoS Genet. 2021;17:e1009386. doi: 10.1371/journal.pgen.1009386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Crow T, et al. Gene regulatory effects of a large chromosomal inversion in highland maize. PLoS Genet. 2020;16:e1009213. doi: 10.1371/journal.pgen.1009213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhao Y, et al. A spontaneous complex structural variant in rcan-1 increases exploratory behavior and laboratory fitness of Caenorhabditis elegans. PLoS Genet. 2020;16:e1008606. doi: 10.1371/journal.pgen.1008606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Begum G, et al. Long-read sequencing improves the detection of structural variations impacting complex non-coding elements of the genome. Int. J. Mol. Sci. 2021;22:2060. doi: 10.3390/ijms22042060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liu Y, et al. Comparison of multiple algorithms to reliably detect structural variants in pears. BMC Genom. 2020;21:61. doi: 10.1186/s12864-020-6455-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Neerman N, et al. A clinically validated whole genome pipeline for structural variant detection and analysis. BMC Genom. 2019;20:545. doi: 10.1186/s12864-019-5866-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat. Commun. 2019;10:3240. doi: 10.1038/s41467-019-11146-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kosugi S, et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 2019;20:117. doi: 10.1186/s13059-019-1720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Uguen K, et al. Genome sequencing in cytogenetics: Comparison of short-read and linked-read approaches for germline structural variant detection and characterization. Mol. Genet. Genomic Med. 2020;8:e1114. doi: 10.1002/mgg3.1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Onishi-Seebacher M, Korbel JO. Challenges in studying genomic structural variant formation mechanisms: The short-read dilemma and beyond. BioEssays News Rev. Mol. Cell. Dev. Biol. 2011;33:840–850. doi: 10.1002/bies.201100075. [DOI] [PubMed] [Google Scholar]
- 22.Yang L. A practical guide for structural variation detection in the human genome. Curr. Protoc. Hum. Genet. 2020;107:e103. doi: 10.1002/cphg.103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ebert P, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021 doi: 10.1126/science.abf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Mizuguchi T, et al. A 12-kb structural variation in progressive myoclonic epilepsy was newly identified by long-read whole-genome sequencing. J. Hum. Genet. 2019;64:359–368. doi: 10.1038/s10038-019-0569-5. [DOI] [PubMed] [Google Scholar]
- 25.Thibodeau ML, et al. Improved structural variant interpretation for hereditary cancer susceptibility using long-read sequencing. Genet. Med. Off. J. Am. Coll. Med. Genet. 2020;22:1892–1897. doi: 10.1038/s41436-020-0880-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lei M, et al. Long-read DNA sequencing fully characterized chromothripsis in a patient with Langer-Giedion syndrome and Cornelia de Lange syndrome-4. J. Hum. Genet. 2020;65:667–674. doi: 10.1038/s10038-020-0754-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Merker JD, et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet. Med. Off. J. Am. Coll. Med. Genet. 2018;20:159–163. doi: 10.1038/gim.2017.86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Edgley ML, Baillie DL, Riddle DL, Rose AM. Genetic balancers. WormBook Online Rev. C Elegans Biol. 2006 doi: 10.1895/wormbook.1.89.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Iwata S, Yoshina S, Suehiro Y, Hori S, Mitani S. Engineering new balancer chromosomes in C. elegans via CRISPR/Cas9. Sci. Rep. 2016;6:33840. doi: 10.1038/srep33840. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Dejima K, et al. An aneuploidy-free and structurally defined balancer chromosome toolkit for Caenorhabditis elegans. Cell Rep. 2018;22:232–241. doi: 10.1016/j.celrep.2017.12.024. [DOI] [PubMed] [Google Scholar]
- 31.Rosenbluth RE, Baillie DL. The genetic analysis of a reciprocal translocation, eT1(III; V), in Caenorhabditis elegans. Genetics. 1981;99:415–428. doi: 10.1093/genetics/99.3-4.415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhao Y, et al. A mutational analysis of Caenorhabditis elegans in space. Mutat. Res. 2006;601:19–29. doi: 10.1016/j.mrfmmm.2006.05.001. [DOI] [PubMed] [Google Scholar]
- 33.C. elegans Deletion Mutant Consortium. Large-scale screening for targeted knockouts in the Caenorhabditis elegans genome. G3 Bethesda Md2, 1415–1425 (2012). [DOI] [PMC free article] [PubMed]
- 34.Campbell PJ, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat. Genet. 2008;40:722–729. doi: 10.1038/ng.128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ostrander BEP, et al. Whole-genome analysis for effective clinical diagnosis and gene discovery in early infantile epileptic encephalopathy. Npj Genomic Med. 2018;3:1–10. doi: 10.1038/s41525-018-0061-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Miller DE, et al. Whole-Genome analysis of individual meiotic events in drosophila melanogaster reveals that noncrossover gene conversions are insensitive to interference and the centromere effect. Genetics. 2016;203:159–171. doi: 10.1534/genetics.115.186486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Itani OA, Flibotte S, Dumas KJ, Moerman DG, Hu PJ. Chromoanasynthetic genomic rearrangement identified in a n-ethyl-n-nitrosourea (ENU) mutagenesis screen in Caenorhabditis elegans. G3 Bethesda Md. 2015;6:351–356. doi: 10.1534/g3.115.024257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Meier B, et al. C. elegans whole-genome sequencing reveals mutational signatures related to carcinogens and DNA repair deficiency. Genome Res. 2014;24:1624–1636. doi: 10.1101/gr.175547.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Volkova NV, et al. Mutational signatures are jointly shaped by DNA damage and repair. Nat. Commun. 2020;11:2169. doi: 10.1038/s41467-020-15912-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hillier LW, et al. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods. 2008;5:183–188. doi: 10.1038/nmeth.1179. [DOI] [PubMed] [Google Scholar]
- 41.McClintock B. The stability of broken ends of chromosomes in Zea Mays. Genetics. 1941;26:234–282. doi: 10.1093/genetics/26.2.234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Meier B, Volkova NV, Gerstung M, Gartner A. Analysis of mutational signatures in C. elegans: Implications for cancer genome analysis. DNA Repair. 2020;95:102957. doi: 10.1016/j.dnarep.2020.102957. [DOI] [PubMed] [Google Scholar]
- 43.Hillers KJ, Villeneuve AM. Chromosome-wide control of meiotic crossing over in C. elegans. Curr. Biol. CB. 2003;13:1641–1647. doi: 10.1016/j.cub.2003.08.026. [DOI] [PubMed] [Google Scholar]
- 44.Cook DE, Zdraljevic S, Roberts JP, Andersen EC. CeNDR, the Caenorhabditis elegans natural diversity resource. Nucl. Acids Res. 2017;45:D650–D657. doi: 10.1093/nar/gkw893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Laricchia KM, Zdraljevic S, Cook DE, Andersen EC. Natural variation in the distribution and abundance of transposable elements across the Caenorhabditis elegans species. Mol. Biol. Evol. 2017;34:2187–2202. doi: 10.1093/molbev/msx155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Li Z, et al. VarBen: Generating in silico reference data sets for clinical next-generation sequencing bioinformatics pipeline evaluation. J. Mol. Diagn. JMD. 2020 doi: 10.1016/j.jmoldx.2020.11.010. [DOI] [PubMed] [Google Scholar]
- 47.Richmond PA, et al. GeneBreaker: variant simulation to improve the diagnosis of Mendelian rare genetic diseases. Hum. Mutat. 2020 doi: 10.1002/humu.24163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Burgess DJ. The TOPMed genomic resource for human health. Nat. Rev. Genet. 2021;22:200–200. doi: 10.1038/s41576-021-00343-x. [DOI] [PubMed] [Google Scholar]
- 49.Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data. (2010).
- 51.Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinforma. Oxf. Engl. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. http://arxiv.org/abs/13033997 Q-Bio (2013).
- 53.Li H, et al. The sequence alignment/map format and SAMtools. Bioinforma. Oxf. Engl. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Fan, X., Abbott, T. E., Larson, D. & Chen, K. BreakDancer: Identification of genomic structural variation from paired-end read mapping. Curr. Protoc. Bioinforma.45, 15.6.1–11 (2014). [DOI] [PMC free article] [PubMed]
- 55.Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Rausch T, et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinforma. Oxf. Engl. 2012;28:i333–i339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Cameron DL, et al. GRIDSS: sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly. Genome Res. 2017;27:2050–2060. doi: 10.1101/gr.222109.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chen X, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinforma. Oxf. Engl. 2016;32:1220–1222. doi: 10.1093/bioinformatics/btv710. [DOI] [PubMed] [Google Scholar]
- 59.Liang Y, et al. Seeksv: an accurate tool for somatic structural variation and virus integration detection. Bioinforma. Oxf. Engl. 2017;33:184–191. doi: 10.1093/bioinformatics/btw591. [DOI] [PubMed] [Google Scholar]
- 60.Soylev A, Kockan C, Hormozdiari F, Alkan C. Toolkit for automated and rapid discovery of structural variants. Methods San Diego Calif. 2017;129:3–7. doi: 10.1016/j.ymeth.2017.05.030. [DOI] [PubMed] [Google Scholar]
- 61.Eisfeldt J, Vezzi F, Olason P, Nilsson D, Lindstrand A. TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Research. 2017;6:664. doi: 10.12688/f1000research.11168.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Krzywinski M, et al. Circos: An information aesthetic for comparative genomics. Genome Res. 2009;19:1639–1645. doi: 10.1101/gr.092759.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The sequencing data generated in this study have been submitted to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA728090.