Summary paragraph
A key mutational process in cancer is structural variation, in which rearrangements delete, amplify or reorder genomic segments ranging in size from kilobases to whole chromosomes1–7. Here, through the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types8, we developed methods to group, classify and describe somatic structural variants. 16 signatures of structural variation emerged. Deletions have multimodal size distribution; assort unevenly across tumour types and patients; enrich in late-replicating regions; and correlate with inversions. Tandem duplications also have multimodal size distribution, but enrich in early-replicating regions, as do unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy number gains and frequent inverted rearrangements. One prominent structure consists of 2-7 templates copied from distinct regions of the genome strung together within one locus. Such ‘cycles of templated insertions’ correlate with tandem duplications, frequently activating the telomerase gene, TERT, in liver cancer. A wide variety of rearrangement processes are active in cancer, generating complex configurations of the genome upon which selection can act.
Mutations arising in somatic cells are the driving force of cancer development. An especially important class of somatic mutation is structural variation, in which genomic rearrangement acts to amplify, delete or reorder chromosomal material at scales ranging from single genes to entire chromosomes. Analysis of both cancer and germline genomes has enabled several distinctive patterns of SVs to be described1–7 – hypotheses about the underlying basis of several of these patterns have been proposed based on their clustering, orientation and associated copy number changes. Hypothesis-driven in vitro studies are now beginning to reveal some of the mechanistic processes generating these structures9–13, and generate further predictions that can be assessed in the genomics data. The landscape of structural variation in human cancer remains incompletely mapped, however, and there are many complex structures that still elude formal description.
The Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types generated by the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) projects. These sequencing data were aligned to the human genome (reference build hs37d5) and analysed with standardised, high-accuracy pipelines to call somatic and germline variants of all classes8. Here, we analyse the patterns and signatures of SVs across PCAWG. We propose a working classification scheme encompassing known and novel classes of SVs. We develop methods for annotating the observed SVs in a given cancer genome, identifying a class of replication-based rearrangement processes generating clusters of several SVs. We explore the size, activity and genome-wide distribution of classifiable SV types across the cohort, using signature analysis to define how they correlate within patients. Other papers within PCAWG address complementary aspects of SVs, including inference of positive selection acting on recurrently rearranged regions of the genome14; how SVs affect the transcriptome15 and chromosome topology16; patterns of somatic retrotransposition17; and distribution of chromothripsis across cancer types18.
A working classification of SVs
The glossary in Extended Table 1 defines how we have used particular terms. A structural variant (SV) manifests as a junction between two breakpoints in the genome. Generally, there will be a change in copy number across a given breakpoint if only one side of the break is rescued by an SV; if both sides of a dsDNA break are rescued, a reciprocal or balanced SV will result, without substantial copy number change. Sometimes we observe clusters of SVs where several breakpoints occur close together, either in time or in genomic space (usually both). Such spatial and/or temporal proximity generally, but not always, implies that the SVs within a cluster are mechanistically linked. Clusters can be phased, when all SVs in the cluster resolve to a single derivative chromosome, or unphased, when carried on different derivative chromosomes. An example of the latter is a reciprocal translocation, resulting in two derivative chromosomes, each with a single inter-chromosomal breakpoint junction (Figure 1).
Figure 1. Classification of structural variants in cancer genomes.
Schematics of major SV classes, grouped according to whether they are simple of complex; and arise through ‘cut-and-paste’ or through copy-and-paste’ processes. Each schematic comprises three parts. The top shows dotted arcs for each rearrangement junction joining two chromosomal segments together. The middle shows the copy number of involved genomic segments. The bottom shows the configuration of the final derivative chromosome that results from the SV, with the colour of the segments corresponding to the colour of that segment in the copy number schematic. The ‘+’ sign for some classes indicates the different derivative chromosomes created: the SVs are not phased to a single derivative.
Distinct classes of SV can be recognised from the orientation of the two segments at the junction and associated copy number changes (Figure 1; Supplementary Figure 1). Some classes of SV are difficult to detect with short-read sequencing data, such as isochromosomes and rearrangements between extended, highly homologous sequences – these are not considered further here. We propose to categorise SV classes across two facets: the number of breakpoints involved (simple; complex) and by whether the patterns are likely to arise from ‘cut-and-paste’ or ‘copy-and-paste’ rearrangement processes. A ‘cut-and-paste’ process generates a cluster of SVs consistent with reshuffling or loss of extant genomic segments; a ‘copy-and-paste’ process is one in which copies of genomic templates are newly replicated/synthesised and inserted during the rearrangement process. Deletions, reciprocal inversions, unbalanced translocations and reciprocal translocations are examples of simple cut-and-paste SVs since they can be reconstructed from incorrect re-ligation of chromosomal breaks. Tandem duplications are simple ‘copy-and-paste’ SVs, since they arise through local insertion of a newly generated extra copy of a genomic template.
More complex cut-and-paste SV processes also occur in cancer. Breakage-fusion-bridge events result from cycles of DNA breakage, end-to-end sister chromatid fusions, mitotic bridges and further DNA breakage. These manifest as one or a few proximate, inverted breakpoint junctions with associated copy number change, which we call fold-back inversions1,2,19 (Figure 1). Chromoplexy5,20, particularly frequent in prostate cancers, results from several simultaneous dsDNA breaks in several chromosomes that are rejoined incorrectly, leading to balanced chains of rearrangements. Chromothripsis3, in which chromosome shattering and rearrangement occur in a single catastrophic event9,21, leads to a pattern of oscillating copy number changes and localised clustering of tens to hundreds of breakpoints22.
In the germline, more complex copy-and-paste classes of SV have been described, involving small duplications and triplications, thought to arise from stalling of the replication fork leading to template switching4,23,24. Here we describe a wide range of complex copy-and-paste types of somatic SV occurring in human cancers, typically characterised by copy number gains and frequent inverted rearrangements.
Annotation of SV classes
We analysed the 2,559 whole cancer genomes across 38 tumour types (alongside their matched germline DNA) that passed the most stringent PCAWG quality control criteria: 2,429 tumours had one or more somatic SVs detected8. As described in the resource paper for the consortium8, SVs were identified using aberrantly mapping and/or split reads in paired-end sequencing data25. Four somatic SV callers were used20,25–27, with the final SV dataset comprising events returned by ≥2 callers, merged by a graph-based consensus method8. We only consider somatically acquired SVs in this analysis, and exclude somatic retrotransposition events. Validation of SV calls was undertaken using both manual inspection and pull-down with resequencing of breakpoints. With these approaches, we estimate the sensitivity of the consensus SV call set to be 90% for true calls generated by any one of the four callers; specificity was estimated as 97.5%8. A mean of 3.22 algorithms of the 4 used called each SV in the consensus set genome-wide, and this differed little across repetitive elements (mean for SINE elements, 3.22; for LINE elements, 3.21).
Since SVs from a given cancer are often highly clustered, we grouped rearrangements into clusters based on proximity of breakpoints, the overall number of events in that genome and size distribution of those events (Supplementary Methods). Essentially, a particular cluster contains SVs that are significantly closer together than expected by chance given the overall number and orientation of SVs overall in that patient. Alongside the clustering, we computed an in silico library of all possible genomic configurations that result from sequential simple SVs (deletions, tandem duplications, inversions, translocations, chromosome duplications or losses), to a depth of five rearrangements. For each observed SV cluster, we could then compare its genomic configuration against the library to determine how it might have arisen.
This methodology has the advantage that breakpoint junctions are classified according to the wider genomic context in which they occur. True deletions, say, will therefore be identifiably different from breakpoint junctions that happen to have a deletion-type orientation but arise within, for instance, a chromothripsis event of markedly different mechanism and properties. Over half the breakpoint junctions observed here arise within clusters of several or many SVs (Figure 2A) – removing these from catalogues of true deletions, tandem duplications and inversions enables more precise description of simple SV properties.
Figure 2. Frequency of SV classes across tumour types.
(A) Violin plots of density of classified SV categories across patients within each histology group. Tumour type panels are sorted in descending order of the average number of SV breakpoints per sample. Within each tumour type, the frequency distribution (y axis) of different SV categories (x axis) across patients is shown as a density, with the regions of highest density having the greatest width of shaded area. The number of patients (pts) are indicated at the top right of each panel.
(B) Per-sample counts of complex (lower) and classified (upper) SV breakpoint junctions for oesophageal adenocarcinoma.
(C) Per-sample counts of complex (lower) and classified (upper) SV breakpoint junctions for ovarian adenocarcinoma.
Among simple SV classes, deletion was the most common, followed by tandem duplication and unbalanced translocation. Reciprocal translocations and reciprocal inversions were uncommon events (Figure 2A). There was considerable variability in the overall numbers and distribution of SV classes across tumour types and across patients within a given tumour type (Extended Figure 1). For example, oesophageal adenocarcinomas were characterised by many deletions and a large number of complex clustered rearrangements (Figure 2B), whereas ovarian cancers often carried high numbers of tandem duplications and/or deletions with moderate numbers of unbalanced translocations (Figure 2C).
Cycles of templated insertions
We next examined clusters containing 2-10 SVs. One novel configuration consisted of several segments of copy number gains, typically on different reference chromosomes, linked together through SVs (Figure 3; Extended Figure 2). A sequential path through consecutive segments can be formed by following the breakpoint junctions, suggesting each cluster represents a string of duplicated templates inserted into a single derivative chromosome, likely acquired concurrently. While it is theoretically possible that the SVs in such clusters are not phased on the same derivative chromosome or do not occur concurrently, we think this is unlikely for several reasons. First, we found examples of RNA transcripts that spliced together exons separated by two junctions in the SV cluster (Supplementary Figure 2), suggesting they are indeed phased on the same derivative chromosome. Second, long-read sequencing data reported in the resource paper for the consortium supported the phasing of SVs linking templated insertions8. Third, we found that the clonal fraction of tumour cells tended to be more similar for SVs within these clusters than for randomly chosen SVs in each patient (Supplementary Figure 3), suggesting they co-occur in evolutionary time. Fourth, the level of copy number gain for individual segments in the cluster tended to be identical (Figure 3; Extended Figure 2).
Figure 3. Chains, cycles and bridges of templated insertions.
(A-C) Examples of a typical (A) cycle, (B) chain, and (C) bridge of templated insertions. The estimated copy number profile is shown as in Figure 1, with SVs shown as dotted arcs linking two CN segments. Underneath each is the derivative chromosome that could explain the copy number and SV profile.
(D,E) Cycles of templated insertions affecting the TERT gene in two hepatocellular carcinomas.
We define three basic categories based on whether or not the string of inserted segments returns to the original chromosome: those that do not return we term chains of templated insertions; and those that do return are either bridges (leaving a gap on the host chromosome) or cycles (re-replicating a segment on the host chromosome). Overall in the PCAWG dataset, we observed 1467 cycles and 1275 bridges of templated insertions (Figure 3A-B; Extended Figure 2). In chains of templated insertions, the string of genomic segments does not return to the chromosome of departure (Figure 3C, Extended Figure 2), but is similarly associated with copy number gains at each templated segment. There were 285 instances of such chains in the dataset, commonly manifesting as unbalanced translocations joined through one or more intermediary templated insertions.
Most templated insertion events involve just two breakpoint junctions, but this can extend to three, four or more linked rearrangements (Extended Figure 3A). The longest such event, from a cervical squamous cell cancer, had seven templated insertions strung together on an eighth host chromosome (Figure 3C; other long examples in Extended Figure 3).
Templated insertions affecting TERT
SVs drive tumour development through their effects on cancer genes, whether by altering gene copy number, disrupting tumour suppressor genes, creating fusion genes or juxtaposing the coding sequence of one gene with the regulatory apparatus of another. We found that many liver cancers had cycles of templated insertions affecting the TERT gene (Figure 3D-E, Extended Figure 4). TERT promoter point mutations are present in 54% of liver cancers and an additional 5-10% have structural variants activating the gene28. Of 30 liver cancer patients with SVs affecting TERT, we find that 10 were templated insertion events, mostly cycles. All these events duplicated the entire TERT gene and linked it to duplications of whole genes, fragments of genes or regulatory elements from elsewhere in the genome, and led to increased expression of TERT (Extended Figure 4E). Thus, this particular rearrangement process is distinctive for the exquisite precision with which a cancer can copy-and-paste normally disparate functional elements of its genome together without wholesale instability.
Tumour suppressor genes were also inactivated by templated insertions (Extended Figure 5). For example, amongst many straightforward deletions, RB1 was hit by cycles of templated insertions, a templated insertion with deletion and one instance of the linked, inverted duplications detailed above. These events typically generated duplications of internal exons in RB1 and/or insertions of exons from other genes, all of which presumably rendered a non-functional transcript.
Local n-jumps and local-distant clusters
Many clusters of 2-10 SVs in the dataset were confined to a single genomic region. Of those comprising two local rearrangements, some had straightforward explanations such as nested or adjacent tandem duplications. Many, however, did not have a trivial explanation (Figure 4A). These included a duplication–inverted-triplication–duplication structure that has been observed in germline SVs24 (349 instances); a structure of two duplications linked by inverted rearrangements (531 instances); and structures of copy number loss plus nearby duplication linked by inverted rearrangements (472 instances). These patterns all had solutions in which breakpoints were phased to a single derivative chromosome (Figure 4A), although non-phased solutions are theoretically possible, if unlikely. Beyond clusters of two rearrangements (2-jumps), we also found examples involving three, four or more rearrangements confined to one genomic locale (Figure 4B). All these SV cluster configurations can be phased to a single derivative chromosome, with breakpoints tightly grouped.
Figure 4. Examples of clusters of 2-5 rearrangements seen in human cancers.
(A) Structures created by two local rearrangements that cannot easily be explained by simple SV classes (which we call ‘local 2-jumps’). The estimated copy number profile is shown as in Figure 1, with SVs shown as dotted arcs linking two CN segments. Underneath each are the possible configurations of the derivative chromosome, noting that multiple solutions are possible for each example.
(B) Structures created by 3-4 local rearrangements that cannot easily be explained by simple SV categories.
(C) Structures created by one local rearrangement and one rearrangement reaching elsewhere in the genome (‘local-distant’ clusters).
Beyond clusters confined to a single genomic region, we found clusters of 2-10 SVs that combined local jumps with rearrangements reaching into one or more distant regions of the genome (Figure 4C). Simple examples of these events include unbalanced translocation or large deletion with a locally-derived fragment inserted at the breakpoint, but there was also an extensive range of more complex patterns. In some cases, the source of the inserted fragment was distal to the major break, and the SV could feasibly result from several concurrent DNA breaks in close spatial proximity with capture of a short DNA fragment during repair (cut-and-paste). In other cases, the origin of the inserted fragment was proximal to the major break and associated with a gain in copy number. This pattern is difficult to explain by a cut-and-paste mechanism because the copy number gain implies the inserted segment was a duplicate of the original template, rather than a separated fragment redistributed from its original locus. Instead, a copy-and-paste mechanism may be the more parsimonious explanation for these events.
Curiously, a comparison of local footprints linked together through distant rearrangements revealed strong connectivity of footprints with the same or similar structure, often enriched 10-fold or more than expected by chance (‘Footprint connectivity analysis’, Supplementary Results). The reasons for this are unclear, but may reflect innate structural symmetry introduced either through the generation or resolution of rearrangements, or through the repeated action of a mechanism imparting consistent structural motifs.
‘Copy-and-paste’ patterns of SV clusters
The diverse patterns of 2-10 clustered SVs described in the preceding sections (Figures 3-4) share important morphological features: (1) genomic configurations that can be phased to a single derivative chromosome; (2) low-level gains in copy number, especially duplications and triplications; (3) a high frequency of inverted rearrangements in addition to non-inverted rearrangements; (4) occurrence on a chromosome background with similar average copy number to the tumour overall; and (5) tight proximity of breakpoints within the local footprint, typically <1Mb.
Using our in silico library of genomic configurations, we could define all possible routes by which sequential SVs could generate these structures through the classically defined repertoire of deletion, tandem duplication, inversion, and translocation (Supplementary Figure 4). For reasons explained in Supplementary Results, these routes typically require implausible machinations of chromosomes. In particular, the high prevalence of inverted breakpoint junctions and local copy number gains is difficult to recreate using sequential simple rearrangements – simple inversion events are uncommon in cancers (Figure 1D) and they tend not to generate copy number gains (except through breakage-fusion-bridge cycles, which also cause terminal deletions2, not seen in the events under discussion here).
If these events cannot be satisfactorily explained by sequential simple rearrangements, another possible explanation could be a complex ‘cut-and-paste’ mechanism, such as chromothripsis, chromoplexy or breakage-fusion-bridge cycles. However, the patterns described above do not fit with these processes either (Supplementary Results). Although chromothripsis with copy number gain has been described3,11,19,22, the resulting copy number and rearrangement patterns have different properties to those described here. Chromoplexy, in which chromosome breaks lead to balanced interchange at multiple breakpoint junctions5,20, typically generates unphased solutions. Repeated breakage-fusion-bridge cycles tend to cause high-level copy number gains associated with inverted, fold-back rearrangements1,2, unlike the structures reported here.
Instead, we believe many of these locally complex SV clusters with low-level copy number gains are generated in a single event by a ‘copy-and-paste’ process. That is, copying of genomic templates is an intrinsic aspect of the structural variation process in these events, with the extra copies being inserted in the resulting derivative chromosome. If the genomic templates all originate locally, we would observe local n-jumps (such as in Figure 3A-B), with tight clustering of breakpoints, phased solutions, frequent copy number gains and a mix of inverted and non-inverted breakpoint junctions. If the original templates for the copied segments derive from across the genome, chains, cycles and bridges of templated insertions would arise (Figure 2).
Genomic properties of SVs
The size of tandem duplications and deletions followed complex, often multimodal, distributions across tumour types (Figure 5A, Extended Figure 6A). As reported previously6,29, individual patients, however, tended to have a simpler, usually unimodal, distribution of deletions or tandem duplications (Extended Figure 6B), implying that the complexity seen in a given tumour type results from combining samples with different profiles. The sizes of individual fragments in templated insertion events were also distinctly multimodal, with varying peak heights across tumour types (Figure 5B). Curiously, when correlating template sizes within a given event, two patterns emerged – one in which template sizes were closely correlated with one another, and one in which a small (<1kb) template was linked with one of any size (Extended Figure 7A-B). Likewise, the size of segments within a given local 2-jump event showed moderately strong correlations (Extended Figure 7C).
Figure 5. Size distribution and genomic properties of classified SVs.
(A) Size distribution of deletions per histology group, with tumour types ordered according to total number of events seen. Vertical dashed lines represent the two prominent modes.
(B) Size distribution of segments of templated insertion per histology group. For each tumour type, the three distributions for cycles, bridges and chains of templated insertions are superimposed.
(C) Associations between a subset of the genomic properties (rows) and classes of SV (columns). Each density curve represents the quantile distribution of the genomic property values at observed breakpoints compared to random genome positions. Stars indicate significant departure from uniform quantiles after multiple hypothesis correction on a one-sided Kolmogorov-Smirnov test based on a sample size of 2,559 genomes containing SVs: False Discovery Rate <0.01 *, <0.001 **, and <10-6 ***. Cells with significant property associations are shaded by the magnitude of the shift of the median observed quantile above (blue) or below (red) 0.5. The interpretation of each property from left to right is indicated by the axes to the right of the property label.
(D) Rearrangement counts as a function of bases of junction microhomology, fit to three different linear functions consistent with different formation mechanisms. NHEJ, non-homologous end-joining; MMEJ, microhomology-mediated end-joining; SSA, single-strand annealing.
(E) Enrichment or depletion of breakpoint junctions between regions of the genome with particular annotations, compared with a permuted background that preserves breakpoint positions but swaps breakpoint partners. Centre points are the mean fold-change over the permuted background; error bars represent three standard deviations. Analysis is based on a sample size of 2,559 genomes containing SVs. Complex uncl., complex clusters unclassified; LTR, long terminal repeat; SINE, short interspersed nuclear element; LINE, long interspersed nuclear element; TAD, topologically associated domain; heterochrom, heterochromatin.
A number of genomic properties, such as replication timing, transcriptional activity and chromatin state, influence the density of point mutations30,31 and copy number alterations32, but how this relates to individual SV classes is unclear. From the literature, we compiled a library of the genome-wide distribution of 38 features, including replication timing, GC content, repeat density, gene density and distance to G-quadruplex motifs among others. Replication timing had the strongest association with SV occurrence, with deletions enriched in late-replicating regions, and tandem duplications and unbalanced translocations preferentially occurring in early-replicating regions (Figure 5C, Extended Figure 8). For individual patients with high numbers of deletions or tandem duplications, we observed striking heterogeneity in their distribution according to replication timing – some had events occurring in predominantly late-replicating regions; others exclusively in early-replicating regions; and others distributed more evenly (Supplementary Figure 5). Regions of active chromatin and increased gene density correlated positively with the rate of rearrangement.
A structural variant requires DNA repair pathways to join two sequences together, and several repair mechanisms are available to somatic cells. Some require sequence homology between the two ends, while others can operate to join non-homologous sequences. As previously reported2,25,33, we find across PCAWG that many SVs do not have sequence homology at the breakpoint junction (Figure 5D), and therefore arise through non-homologous end-joining. Nonetheless, a sizable fraction of SVs have more microhomology than expected by chance with an apparently bimodal distribution of microhomology lengths. One set of SVs has 2-7bp of microhomology, likely generated by microhomology-mediated end-joining; a second set of SVs has 10-30bp microhomology, likely generated through single-strand annealing or other forms of homologous recombination, including microhomology-mediated break-induced replication. Repetitive sequences in the genome, such as SINE and LINE elements, are the likely substrate of such SVs, and indeed we find enrichment for SVs joining such elements (Figure 5E; Supplementary Figure 6).
Signatures of structural variation
The heterogeneous spectrum of point mutations across cancers can be reconstructed from the differential action of a relatively limited repertoire of mutational processes, each with a characteristic ‘signature’34. The differences across patients in size distribution of tandem duplication and deletion, together with the widely varying frequency and patterns of SV across tumour types and genome topology, suggest that we could similarly learn such correlations across individual SV classes.
We divided each patient’s set of SVs into mutually exclusive categories. We split the most frequent simple SV classes, deletions and tandem duplications, into 11 categories according to size, replication timing and occurrence at fragile sites. Other configurations of SVs and copy number changes seen >50 times in the cohort were each included as further categories, including the new patterns of SV described above: cycles, chains and bridges of templated insertions (also split by size); local n-jumps; and local-distant clusters.
We applied two methods for signature discovery, with comparable results, identifying 16 SV signatures (of which the 12 most prevalent are shown in Figure 6A). Signature extraction on random splits of the cohort into two halves identified 10 highly correlated signatures (Supplementary Figure 7), closely matching signatures called in the full cohort despite the lower power. Three signatures of deletions emerged, split by size – interestingly, the signature of small (<50kb) deletions included small reciprocal inversions; the signature of large (>500kb) deletions included large reciprocal inversions. This implies that the frequencies of deletions and reciprocal inversions are correlated across the cohort, both following similar size distributions within an individual patient.
Figure 6. Structural variant signatures in human cancers.
(A) The 12 most distinctive structural variation signatures extracted by the Bayesian hierarchical Dirichlet process algorithm, run on a sample size of 2,559 genomes containing SVs. Here the lengths of the bars represent the estimated proportion of each event class assigned to each signature (rows sum to one), with the black line segments representing the 95% posterior interval for bar length from the Markov chain.
(B) Association of pathogenic mutations (germline and somatic combined) in key DNA repair genes with SV signatures. Sample size of patients with pathogenic variants in the specific genes assessed are shown in brackets after each gene label (y axis). Hypothesis tests and effect sizes for each gene are derived from linear models for signature intensity after correction for histology. Significant associations from two-sided tests with correction for multiple hypothesis testing are shown. The colour and size of the points represent the estimated effect sizes.
We identified five signatures of tandem duplications, split by size and replication timing. Cycles, bridges and chains of templated insertions were particularly prominent in signatures of early-replicating tandem duplications, whereas local 2-jump structures were more associated with late-replicating tandem duplications. All these patterns exemplify the ‘copy-and-paste’ concept outlined above, in which extra copies of genomic templates are produced and inserted as an integral feature of the SV process.
Another signature was characterised by deletions and tandem duplications at chromosomal fragile sites35. Interestingly, tandem duplications were more prominent at the edges of the fragile site, whereas deletions concentrated in the centre (Extended Figure 9A-B). The size range of fragile site deletions peaked around 100kb, similar to the larger deletion signature, while the rarer fragile site tandem duplications showed no strong size peak (Extended Figure 9C). Sites of fragility varied extensively across tumour types (Extended Figure 9D).
Unbalanced translocations comprised their own signature, suggesting they derive from a distinct rearrangement process in cancer genomes. A further signature comprised the fold-back inversions that are a hallmark of breakage-fusion-bridge cycles with similar structures such as translocations adjacent to fold-back inversions. Finally, there was a signature of balanced rearrangements, including reciprocal translocations and chromoplexy clusters5. This signature probably arises from several dsDNA breaks, potentially occurring in interphase, in which both sides of the break are incorrectly repaired through ligation to other, simultaneously broken regions of the genome.
SVs by DNA repair genes and tumour type
We grouped annotations of pathogenic germline variants and somatic driver mutations in DNA repair genes across the cohort8, correlating their presence with activity of the SV signatures (Figure 6B). As previously described in breast and ovarian cancers6,29, BRCA1 mutations significantly associated with the small tandem duplication signatures, the mechanistic basis of which is increasingly well understood10. Also, as previously described6,36, CDK12 variants predicted signatures of mid-sized to large tandem duplications. BRCA2 variants correlated with small deletions, as expected29, but also with the reciprocal SV signature that includes chromoplexy. Interestingly, PALB2 variants showed the same correlations with signatures of small deletions and reciprocal SVs as BRCA2 – PALB2 colocalises with, stabilises and assists BRCA2 during homologous recombination37, so we might have predicted inactivation of either gene would have similar SV signatures. These associations between driver mutations and SV signatures were consistently evident across many tumour types (Extended Figure 10).
The SV signatures showed considerable heterogeneity in their activity across tumour types and among patients within a given tumour type (Supplementary Figure 8). Tumours of the gastrointestinal tract, including colorectal and oesophageal adenocarcinomas, showed high rates of the fragile site signature. Prostate cancer was striking for the prevalence of the chromoplexy signature, as reported previously5,20. Squamous cell carcinomas of the lung were characterised by the fold-back inversion signature.
We assessed how different classes of SV altered known cancer genes (Supplementary Table 1). Some cancer genes only acquire oncogenic potential with specific structural events, such as fusion genes or enhancer hijacking. Not surprisingly, these genes typically showed little variability in which classes of SV could generate such events (Extended Figure 11A-C), although there were exceptions. The TMPRSS2-ERG fusion gene of prostate cancer, for example, was generated by a range of processes, including simple deletions, chromoplexy and chromothripsis, all prevalent signatures in this tumour type (Extended Figure 11D-F).
Tumour suppressor genes and recurrently amplified genes showed more variability in which types of SV were observed, and these were shaped by signatures active in the relevant tumour types. For example, the tumour suppressor genes, PTEN and RAD51B, commonly inactivated in breast and ovarian cancers, were often targeted by tandem duplications generating out-of-frame exon duplications (Extended Figure 12A-B). In contrast, deletions were the predominant events inactivating SMAD4 and CDKN2A, in keeping with their prevalence in cancers of the gastrointestinal tract (Extended Figure 12C-D). MYC, one of the most commonly amplified genes pan-cancer, showed significant diversity in mechanisms of rearrangement: nested tandem duplications in breast cancer; translocations or chromoplexy with IGH in lymphoma; as well as chromothripsis, cycles of templated insertions, local n-jumps and local-distant clusters in other tumour types (Extended Figure 13).
Discussion
We have described the patterns and signatures of structural variation in a large cohort of uniformly analysed cancer genomes. A major grouping of SV patterns that emerges from our study is one in which extra copies of genomic templates are inserted during the rearrangement process. This includes simple events such as tandem duplications, but we also find a range of more complex events with duplications and triplications, rearranged locally as well as inserted distantly. The signatures analysis grouped a large proportion of these more complex events together with tandem duplications, suggesting they represent a continuum of processes sharing underlying properties. A replication-based mechanism has been proposed to explain local 2-jumps4,23,24, in which stalled replication forks or other DNA lesions cause the DNA polymerase to switch templates and continue replication in a new location. Studies in experimental models are now revealing that a wide range of mechanisms and DNA lesions can result in templated insertions – these include tandem duplications in BRCA1 deficiency10, translocations with templated insertions caused by dysregulated strand invasion38, and distant templated insertions in the absence of replication helicases39.
Genomic instability in cancer, then, is not a single phenomenon. Instead, many different mutational processes can act to restructure the genome and, in doing so, generate a remarkably flexible array of possible structures. Any given tumour draws on a subset of the available processes, shaped by the cell of origin, germline predisposition and other unknown factors. Selection does the rest, promoting the clone that has chanced upon the particular structure that increases its potential for self-determination.
Methods
A detailed description of the methods used in this paper and many additional results are described in Supplementary Information. Here, we summarise the key aspects of the analysis:
Generation of the SV call-set
The final set of SVs used in this manuscript was generated by the Technical Working Group of the PCAWG consortium and described in the main PCAWG paper8. Briefly, four variant callers were used to identify somatically acquired SVs from matched tumour and germline whole genome sequencing data: SvABA (Broad pipeline), DELLY (DKFZ pipeline), BRASS (Sanger pipeline) and dRanger (Broad pipeline). These were merged into a final call-set using a graph-based algorithm to identify overlapping breakpoint junctions across algorithms. Detailed visual inspection of SV calls suggested that a simple approach of accepting all SV calls made by 2 or more of the 4 algorithms gave the best trade-off between sensitivity and specificity.
SV clustering and annotation
To identify clusters of SVs, we developed a method for grouping SVs into clusters and footprints in order to allow structural and mechanistic inferences to be made systematically. In parallel, we process the somatic copy number (CN) data and merge it with SV junctions in order to allow us produce rearrangement patterns from the generated SV clusters and footprints. We produce normalised representations of SV cluster patterns, which allows us to tabulate the number of different cluster and footprint patterns and analyse their features. Finally, we performed manual and simulation-assisted interpretation of the recurrently observed cluster and footprint patterns. The individual steps of the SV classification pipeline are outlined below and detailed in the subsequent subsections.
Computing exact breakpoint coordinates from clipped reads.
Removing redundant “segment-bypassing” SVs.
Merging rearrangement breakpoints with copy number data to yield SV breakpoint-demarcated normalized absolute copy number data.
Clustering individual SVs into SV clusters and footprints
Heuristically refining SV clusters and footprints
Filtering artefactual fold-back-type SVs with insufficient support
Determining balanced overlapping breakpoints. This step is to distinguish very short templated insertions from mutually overlapping balanced breakpoints.
Computing rearrangement patterns and categories
Distribution of SVs across the genome
We divided the hg19 human reference genome (autosomes and chromosome X) into 3,036,315 pixels of 1kb, and calculated a suite of metrics per-pixel to summarise a variety of genome properties with potential relevance to the distribution of rearrangements, as listed in Supplementary Information. Properties were matched as closely as possible to the tissue of origin for PCAWG cancer samples. All other genome properties were held fixed across all tissues. To test for association between SV event classes and the library of genome properties, the genome property metrics were compared between real SV positions (randomly choosing one side of each breakpoint junction to reduce dependence between observations) and 1 million uniform random positions from the callable genome space. To compare the tissue-specific properties, each random position was assigned a random tissue type, drawing from the observed tissue type distribution in the SV call set. For each genome property and each event class, the real observations were pooled amongst the random ones, then rank transformed and normalised on a scale from 0 to 1. Under the null hypothesis of no event-vs-property association, the ranks of the real observations would follow a uniform distribution. We tested this in each case with a Kolmogorov-Smirnov test then applied a Benjamini-Yekutieli correction for false discovery rate across the entire suite of tests and set the threshold for significance reporting at 0.01.
SV signatures analysis
We deployed two algorithms for extracting SV signatures. Both used the same input files, comprising a matrix of counts per patient (across all patients) of SV clusters falling into a number of mutually exclusive categories. These categories included the major classes of SVs, with the commoner events (deletions, tandem duplications and inversions) split by size and/or replication timing. The two algorithms deployed for extracting the signatures were (1) a hierarchical Dirichlet process and (2) non-negative matrix factorisation. Further details on the implementation of these algorithms are available in Supplementary Information.
Extended Data
Extended Figure 1. Per-sample counts of SV breakpoint junctions by histology group.
Counts of simple, classified SVs are shown above the x axis and counts of complex breakpoint junctions below the x axis. Patients within each tumour type are ranked by frequency of simple SVs.
Extended Figure 2. Further examples of templated insertion chains, cycles and bridges.
Schematics follow the same structure as Figure 3.
Extended Figure 3. Number of breakpoint junctions in cycles, bridges and chains of templated insertions.
(A) Histogram of numbers of breakpoint junctions in templated insertion cycles, chains and bridges across all samples in all tumour types in the cohort.
(B,C) Two examples of particularly long cycles of templated insertions in the cohort. Examples are depicted similarly to those in Figure 3.
Extended Figure 4. Templated insertion events activating TERT in hepatocellular carcinoma.
(A) The positions of all SV breakpoints in the TERT regions in the PCAWG cohort (including 50kb flanks either side), coloured by classification and vertically spaced by the distance to the next breakpoint in the cohort. If the two sides of a breakpoint junction are contained within the plotting window, they are joined by a curved line. The number of samples with a breakpoint in the plotting window is annotated top left.
(B,C,D) Examples of two cycles and a chain of templated insertions affecting TERT in hepatocellular carcinomas.
(E) Expression levels of TERT in patients with hepatocellular carcinoma (n=187 patients), separated by whether TERT was wild-type, had an activating promoter point mutation, SVs in a templated insertion, or other class. Individual patient data are shown as points. The box shows the median expression level as a thick black line, with the box’s range denoting the interquartile range. The whiskers show the range of data or 1.5x the interquartile range, whichever is lesser.
Extended Figure 5. Templated insertion events inactivating RB1 in breast and ovarian carcinomas.
(A) The positions of all SV breakpoints in the RB1 regions in the PCAWG cohort (including 50kb flanks either side), coloured by classification and vertically spaced by the distance to the next breakpoint in the cohort. If the two sides of a breakpoint junction are contained within the plotting window, they are joined by a curved line. The number of samples with a breakpoint in the plotting window is annotated top left.
(B,C,D,E) Examples of three cycles and a bridge of templated insertions affecting RB1 in breast and ovarian carcinomas.
Extended Figure 6. Size distribution of tandem duplications.
(A) Size distribution of tandem duplications per histology group.
(B) Samples with more than 20 tandem duplications were grouped using hierarchical clustering according to the within-patient distribution of tandem duplication size. Seven clusters emerged, with the size distribution of up to 8 randomly chosen samples per cluster illustrated. The numbers in the top right of each panel denote the number of tandem duplications in that sample.
Extended Figure 7. Size properties of clustered SV classes.
(A) Comparison of the minimum and maximum templated insert size for multi-insert cycles, chains and bridges of templated insertions.
(B) All events with three or more templated inserts, grouped by combination of insert sizes.
(C) Correlations (Pearson’s correlation coefficient) and raw sizes of individual genomic segments for reciprocal inversions and local 2-jumps. Each individual event is shown as a line linking the size of the individual segments in that event. The sample sizes for each event class are shown in the labels for each panel.
Extended Figure 8. Relationship of extended panel of genomic properties with SV categories.
Associations between a subset of the genomic properties (rows) and classes of SV (columns). Each density curve represents the quantile distribution of the genomic property values at observed breakpoints compared to random genome positions. Stars indicate significant departure from uniform quantiles after multiple hypothesis correction by the Benjamini-Yekutieli method on a one-sided Kolmogorov-Smirnov test based on a sample size of 2,559 genomes containing SVs: False Discovery Rate <0.01 *, <0.001 **, and <10-6 ***. Cells with significant property associations are shaded by the magnitude of the shift of the median observed quantile above (blue) or below (red) 0.5. The interpretation of each property from left to right is indicated by the axes to the right of the property label.
Extended Figure 9. Properties of SVs at chromosomal fragile sites.
(A) SV breakpoints in the most affected fragile sites: FHIT, MACROD2 and WWOX. These are coloured by classification and vertically spaced by the distance to the next breakpoint in the cohort. If the two sides of a breakpoint junction are contained within the plotting window, they are joined by a curved line. The number of samples with a breakpoint in the plotting window is annotated top left.
(B) Number of deletions and tandem duplications (upper) and number of affected samples (lower) for the 18 fragile sites considered in this analysis.
(C) Size distribution of deletions and tandem duplications in fragile sites (FS) compared to the rest of the genome.
(D) Fragile site preference for 20 cancer histology groups as indicated by the proportion of samples harbouring a deletion in each of the 18 fragile sites considered here. The number of samples is indicated in parentheses.
Extended Figure 10. Consistency of associations between signatures and mutations in DNA repair genes.
(A) Box-and-whisker plots showing number of SVs attributed to the small deletion signature in different tumour types, split by BRCA2 status (BRCA2 wild-type in orange; BRCA2 mutant in cyan). The box denotes the interquartile range, with the median marked as a horizontal line. The whiskers extend as far as the range or 1.5x the interquartile range, whichever is less. Outlier patients are shown as points. Note the increase in events attributed to the small deletion signature when BRCA2 is mutant, across multiple tumour types (breast, pancreatic, ovarian, prostate, lung squamous etc).
(B) Box-and-whisker plots as for (A) showing number of SVs attributed to the small deletion signature in different tumour types, split by PALB2 status.
(C) Box-and-whisker plots as for (A) showing number of SVs attributed to the early replicating, small tandem duplication signature in different tumour types, split by BRCA1 status.
(D) Box-and-whisker plots as for (A) showing number of SVs attributed to the large tandem duplication signature in different tumour types, split by CDK12 status.
Extended Figure 11. Patterns of SVs causing fusion genes and enhancer hijacking.
(A) Rainfall plot of SV breakpoints in the genes KIAA1549 and BRAF, commonly fused together through a tandem duplication in pilocytic astrocytomas. SVs are coloured by classification and arranged vertically by the distance to the next breakpoint in the cohort. If the two sides of a breakpoint junction are contained within the plotting window, they are joined by a curved line. The number of samples with a breakpoint in the plotting window is annotated at the top.
(B) Rainfall plot of SV breakpoints affecting RET, commonly fused to CCDC6 by inversion in papillary thyroid cancer.
(C) Rainfall plot of SV breakpoints affecting BCL2, commonly hijacked to the IGH immunoglobulin locus by translocations in B-cell lymphomas.
(D) Rainfall plot of SV breakpoints affecting ERG, commonly fused with TMPRSS2 by deletion or more complex events in prostate adenocarcinoma.
(E) Example of a TMPRSS2-ERG fusion gene in a prostate adenocarcinoma created by a chromoplexy cycle. The estimated copy number profile is shown as black horizontal segments, with SVs shown as dotted arcs linking the edges of two CN segments.
(F) Example of a TMPRSS2-ERG fusion gene in a prostate adenocarcinoma created by chromothripsis.
Extended Figure 12. Patterns of SVs affecting selected tumour suppressor genes.
(A) Rainfall plot of SV breakpoints in the gene PTEN, commonly inactivated in breast and ovarian adenocarcinomas, where tandem duplication signatures are frequent. SVs are coloured by classification and arranged vertically by the distance to the next breakpoint in the cohort. If the two sides of a breakpoint junction are contained within the plotting window, they are joined by a curved line. The number of samples with a breakpoint in the plotting window is annotated at the top.
(B) Rainfall plot of SV breakpoints affecting RAD51B, commonly inactivated in breast and ovarian adenocarcinomas.
(C) Rainfall plot of SV breakpoints affecting CDKN2A, commonly inactivated in tumours of the gastrointestinal tract, where deletion signatures are common.
(D) Rainfall plot of SV breakpoints affecting SMAD4, commonly inactivated in tumours of the gastrointestinal tract.
Extended Figure 13. Examples of SVs increasing the copy number of MYC.
The estimated copy number profile is shown as black horizontal segments, with SVs shown as dotted arcs linking the edges of two CN segments.
Extended Table 1. Glossary of key terms as we use them in this paper.
Term | Description |
---|---|
Structural variant (SV) | Juxtaposition of non-contiguous chromosomal segments through a process of genomic rearrangement. |
Breakpoint | NET components are also required for pathology and parasite sequestration |
Copy number alteration (CNA) | Change in the number of copies of a given chromosomal segment from that expected. |
Reciprocal or balanced SV | A pair of SVs in which both sides of a single dsDNA break are rescued in the rearrangement. Typically used to describe some inversions and some translocations. |
Unbalanced SV | An SV (usually inversion or translocation) in which only one side of the dsDNA break is rescued, thereby generating a copy number alteration across the breakpoint. |
Cluster of SVs | A set of SVs that are closer together in genomic space than expected by chance. Typically, such clustering implies a shared mechanistic basis for the SV generation. |
Derivative chromosome | A chromosome that carries one or more SVs. |
Phased SVs | Set of SVs and copy number alterations in a cluster carried on a single derivative chromosome. |
Chromosomal segment | A contiguous stretch of DNA that is of constant copy number, used to denote the regions of chromosome between SVs. |
Template | A region of chromosomal DNA that is copied and inserted elsewhere in the genome. |
SV class | A type of structural variant, such as deletion, tandem duplication or translocation. |
Deletion | Loss of a segment of chromosome from the genome spanned by a junction between the two breakpoints either side. |
Tandem duplication | Extra copy of a segment of chromosome in which the duplicated region is inserted immediately adjacent to the original template in the same orientation. |
Reciprocal inversion | A segment of chromosomal DNA inserted into its original position, but in the opposite orientation. |
Fold-back inversion | An inverted rearrangement between two breakpoints typically <20kb apart on the chromosome, with associated copy number change. Often a sign of breakage-fusion-bridge cycles. |
Translocation | Breakpoint junction between two different chromosomes, either reciprocal or unbalanced. |
Breakage-fusion-bridge cycle | SV mechanism in which a naked DNA end (breakage) is copied to its sister chromatid during S phase, with the two ends undergoing fusion (by fold-back inversion). At anaphase, the resulting dicentric chromosome is stretched between the two daughter cells (bridge), leading to further DNA breakage and potentially further cycles. |
Chromoplexy | A set of >2 reciprocal SVs in which the chromosomal ends either side of each breakpoint are shuffled such that every end is rescued in a rearrangement junction. |
Chromothripsis | A cluster of many SVs (10s to 100s) in one or a few chromosomes, occurring in a single catastrophic event, with oscillating copy number profile and rearrangement junctions of all four possible orientations. |
Local n-jump | A cluster of n SVs in a single genomic region, typically phased to a single derivative chromosome, exhibiting some copy number gains and junctions with inverted and non-inverted orientation. |
Cycle, chain or bridge of templated insertions | Copies of one or more genomic templates drawn from across the genome, strung together in a contiguous string and inserted into a single derivative chromosome. A chain of templated insertions does not return to the original chromosome, leading to an unbalanced translocation. A cycle has a duplication on the host chromosome, while a bridge inserts the template copies into a deletion on the host chromosome. |
Local-distant cluster | A cluster of SVs that has both local rearrangements and rearrangements to other parts of the genome. |
Supplementary Material
Acknowledgements
This work was supported by the Wellcome Trust. P.J.C. is a Wellcome Trust Senior Clinical Fellow (WT088340MA). We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium, and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment and harmonised variant calling of the cancer genomes used in this study. We thank the patients and their families for their participation in the individual ICGC and TCGA projects.
Footnotes
Author Contributions
Y.L., N.D.R., J.A.W. and O.S. contributed equally to this manuscript, undertaking evaluation and curation of structural variant calls, merging SV call-sets from 4 separate algorithms into a final dataset. Y.L. performed the clustering and classification of SVs, and identified novel patterns of rearrangement, with assistance from N.D.R. and M.I. N.D.R. performed the analysis of SV signatures with assistance from Y.L. N.D.R., J.A.W. and O.S. analysed the distribution of SVs across the genome, with input from J.E.H., E.K., K.K. and S.E.S. S.W. and J.K. contributed to the analysis of how germline variants influenced signatures of SVs. J.W., R.B. and P.J.C. jointly oversaw the project, assisted with data interpretation and wrote the paper, with input from all authors.
Competing Interests
The following authors declare that they have competing interests: Rameen Beroukhim (owns equity in Ampressa Therapeutics); Matthew Meyerson (scientific advisory board chair of, and consultant for, OrigiMed; research funding from Bayer and Ono Pharma; patent royalties from LabCorp.); Jeremiah Wala (consultant for Nference Inc.); Cheng-Zhong Zhang (co-founder and equity holder of Pillar Biosciences, a for-profit company specializing in the development of targeted sequencing assays.)
Data Availability
Somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described in the PCAWG marker paper8 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identification information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorisation
Code Availability
The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution. These are described in detail in the PCAWG marker paper8. The code for grouping SVs into SV clusters and footprints is available at https://github.com/cancerit/ClusterSV/ (version 1.0). The code for simulating rearrangements can be found at https://github.com/cancerit/SimSvGenomes (version 1.0). Code for sampling from the hierarchical Dirichlet process for identification of mutational signatures is implemented as an R package at https://github.com/nicolaroberts/hdp (version 0.1.1).
Participants In PCAWG Structural Variation Working Group
In alphabetical order
Kadir C Akdemir12, Eva G Alvarez13-15, Adrian Baez-Ortega16, Rameen Beroukhim3-5, Paul C Boutros17-20, David D L Bowtell21,22, Benedikt Brors23-25, Kathleen H Burns26, Peter J Campbell1,11, Kin Chan27, Ken Chen12, Isidro Cortés-Ciriano28-30, Ana Dueso-Barroso31, Andrew J Dunford3, Paul A Edwards32,33, Xavier Estivill34, Dariush Etemadmoghadam21, Lars Feuerbach24, J Lynn Fink31,36, Milana Frenkel-Morgenstern37, Dale W Garsed21, Mark Gerstein38-41, Dmitry A Gordenin42, David Haan43, James E Haber8, Julian M Hess3,44, Barbara Hutter23,25,45, Marcin Imielinski6,9, David TW Jones46,47, Young Seok Ju1,48, Marat D Kazanov49-51, Leszek J Klimczak52, Youngil Koh53,54, Jan O Korbel7, Kiran Kumar3, Eunjung Alice Lee55, Jake June-Koo Lee29,56, Yilong Li1,2, Andy G Lynch32,33,57, Geoff Macintyre32,33, Florian Markowetz32,33, Iñigo Martincorena1, Alexander Martinez-Fundichely58-60, Matthew Meyerson3,4,61, Satoru Miyano62, Hidewaki Nakagawa63, Fabio CP Navarro40, Stephan Ossowski64-66, Peter J Park29,30, John V Pearson67,68, Montserrat Puiggròs31, Karsten Rippe69, Nicola D Roberts1, Steven A Roberts70, Bernardo Rodriguez-Martin13-15, Steven E Schumacher3-5, Ralph Scully71, Mark Shackleton21,22, Nikos Sidiropoulos10, Lina Sieverling24,72, Chip Stewart3, David Torrents31,73, Jose MC Tubio13-15, Izar Villasante31, Nicola Waddell67,68, Jeremiah A Wala3-5, Joachim Weischenfeldt10, Lixing Yang74, Xiaotong Yao9,75, Sung-Soo Yoon54, Jorge Zamora1,13-15 and Cheng-Zhong Zhang3,4,61
1. Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK.
2. Totient Inc, Cambridge, MA 02142, USA.
3. The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA.
4. Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA 02138, USA.
5. Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA.
6. Weill Cornell Medical College, New York, NY 10065, USA
7. European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany.
8. Department of Biology and Rosenstiel Basic Medical Sciences Research Center, Brandeis University, Waltham, MA 02454, USA.
9. New York Genome Center, New York, NY 10013, USA.
10. Biotech Research & Innovation Centre (BRIC); The Finsen Laboratory, Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
11. Stem Cell Institute and Department of Haematology, University of Cambridge, Cambridge CB2 2XY, UK.
12. University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
13. Department of Zoology, Genetics and Physical Anthropology, Universidade de Santiago de Compostela, Santiago de Compostela 15706, Spain.
14. Centre for Research in Molecular Medicine and Chronic Diseases (CIMUS), Universidade de Santiago de Compostela, Santiago de Compostela 15706, Spain.
15. The Biomedical Research Centre (CINBIO), Universidade de Vigo, Vigo 36310, Spain.
16. Transmissible Cancer Group, Department of Veterinary Medicine, University of Cambridge, Cambridge CB3 0ES, UK.
17. Computational Biology Program, Ontario Institute for Cancer Research, Toronto, ON M5G 0A3, Canada.
18. Department of Medical Biophysics, University of Toronto, Toronto, ON M5S 1A8, Canada.
19. Department of Pharmacology, University of Toronto, Toronto, ON M5S 1A8, Canada.
20. University of California Los Angeles, Los Angeles, CA 90095, USA.
21. Peter MacCallum Cancer Centre, Melbourne, VIC 3000, Australia.
22. Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, VIC 3052, Australia.
23. National Center for Tumor Diseases (NCT) Heidelberg, Heidelberg 69120, Germany.
24. Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
25. German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
26. Johns Hopkins School of Medicine, Baltimore, MD 21205, USA.
27. University of Ottawa Faculty of Medicine, Department of Biochemistry, Microbiology and Immunology, Ottawa, ON K1H 8M5, Canada.
28. Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK.
29. Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA.
30. Ludwig Center, Harvard Medical School, Boston, MA 02115, USA.
31. Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain.
32. Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge CB2 0RE, UK.
33. University of Cambridge, Cambridge CB2 1TN, UK.
34. Sidra Medicine, Doha 26999, Qatar.
35. Barcelona Supercomputing Center, Barcelona 08034, Spain.
36. Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, The University of Queensland, St Lucia, QLD 4072, Australia.
37. The Azrieli Faculty of Medicine, Bar-Ilan University, Safed 13195, Israel.
38. Department of Computer Science, Princeton University, Princeton, NJ 08540, USA.
39. Department of Computer Science, Yale University, New Haven, CT 06520, USA.
40. Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
41. Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.
42. Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences (NIEHS), Durham, NC 27709, USA.
43. Biomolecular Engineering Department, University of California, Santa Cruz, Santa Cruz, CA 95064, USA.
44. Massachusetts General Hospital Center for Cancer Research, Charlestown, MA 02129, USA.
45. Heidelberg Center for Personalized Oncology (DKFZ-HIPO), German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
46. Hopp Children's Cancer Center (KiTZ), Heidelberg 69120, Germany.
47. Pediatric Glioma Research Group, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
48. Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea.
49. Skolkovo Institute of Science and Technology, Moscow 121205, Russia.
50. A.A.Kharkevich Institute of Information Transmission Problems, Moscow 127051, Russia.
51. Dmitry Rogachev National Research Center of Pediatric Hematology, Oncology and Immunology, Moscow, 117997, Russia.
52. Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences (NIEHS), Durham, NC 27709, USA.
53. Center For Medical Innovation, Seoul National University Hospital, Seoul 03080, South Korea.
54. Department of Internal Medicine, Seoul National University Hospital, Seoul 03080, South Korea.
55. Division of Genetics and Genomics, Boston Children's Hospital and Harvard Medical School, Boston, MA 02115, USA.
56. Ludwig Center at Harvard, Boston, MA 02115, USA.
57. School of Medicine/School of Mathematics and Statistics, University of St Andrews, St Andrews, Fife KY16 9SS, UK.
58. Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA.
59. Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10021, USA.
60. Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10065, USA.
61. Dana-Farber Cancer Institute, Boston, MA 02215, USA.
62. The Institute of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan.
63. RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.
64. Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain.
65. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain.
66. Institute of Medical Genetics and Applied Genomics, University of Tübingen, Tübingen 72074, Germany.
67. Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane 4006, Australia.
68. Institute for Molecular Bioscience, University of Queensland, St Lucia, Brisbane, QLD 4072, Australia.
69. German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.
70. School of Molecular Biosciences and Center for Reproductive Biology, Washington State University, Pullman, WA 99164, USA.
71. Cancer Research Institute, Beth Israel Deaconess Medical Center, Boston, MA 02215, USA.
72. Faculty of Biosciences, Heidelberg University, Heidelberg 69120, Germany.
73. Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain.
74. Ben May Department for Cancer Research, Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.
75. Tri-institutional PhD program of computational biology and medicine, Weill Cornell Medicine, New York, NY 10065, USA.
References
- 1.Bignell GR, et al. Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. Genome Res. 2007;17:1296–1303. doi: 10.1101/gr.6522707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Campbell PJ, et al. The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature. 2010;467:1109–1113. doi: 10.1038/nature09460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Stephens PJ, et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011;144:27–40. doi: 10.1016/j.cell.2010.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lee JA, Carvalho CM, Lupski JR. A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell. 2007;131:1235–1247. doi: 10.1016/j.cell.2007.11.037. [DOI] [PubMed] [Google Scholar]
- 5.Baca SC, et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153:666–677. doi: 10.1016/j.cell.2013.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Menghi F, et al. The Tandem Duplicator Phenotype Is a Prevalent Genome-Wide Cancer Configuration Driven by Distinct Gene Mutations. Cancer Cell. 2018;34:197–210.e5. doi: 10.1016/j.ccell.2018.06.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Liu P, et al. An Organismal CNV Mutator Phenotype Restricted to Early Human Development. Cell. 2017;168:830–842.e7. doi: 10.1016/j.cell.2017.01.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Pan-cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature. 2019 XXX, XXX. [Google Scholar]
- 9.Zhang C-Z, et al. Chromothripsis from DNA damage in micronuclei. Nature. 2015;522:179–84. doi: 10.1038/nature14493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Willis NA, et al. Mechanism of tandem duplication formation in BRCA1-mutant cells. Nature. 2017;551:590–595. doi: 10.1038/nature24477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Maciejowski J, Li Y, Bosco N, Campbell PJ, de Lange T. Chromothripsis and Kataegis Induced by Telomere Crisis. Cell. 2015;163:1641–1654. doi: 10.1016/j.cell.2015.11.054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ly P, et al. Chromosome segregation errors generate a diverse spectrum of simple and complex genomic rearrangements. Nat Genet. 2019;1 doi: 10.1038/s41588-019-0360-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ghezraoui H, et al. Chromosomal Translocations in Human Cells Are Generated by Canonical Nonhomologous End-Joining. Mol Cell. 2014;55:829–842. doi: 10.1016/j.molcel.2014.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rheinbay E, et al. On the discovery of somatic driver events in >2,500 whole cancer genomes. Nature. 2019 XXX, XXX. [Google Scholar]
- 15.PCAWG Transcriptome Core Group et al. Genomic basis for RNA alterations revealed by whole-genome analyses of 27 cancer types. Nature. 2019 XXX, XXX. [Google Scholar]
- 16.Akdemir KC, et al. Chromatin Folding Domains Disruptions by Somatic Genomic Rearrangements in Human Cancers. Nat Genet. 2019 doi: 10.1038/s41588-019-0564-y. XXX, XXX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Rodriguez-Martin B, et al. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat Genet. 2019 doi: 10.1038/s41588-019-0562-0. XXX, XXX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cortés-Ciriano I, et al. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing. Nat Genet. 2018 doi: 10.1038/s41588-019-0576-7. XXX, XXX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li Y, et al. Constitutional and somatic rearrangement of chromosome 21 in acute lymphoblastic leukaemia. Nature. 2014;508:98–102. doi: 10.1038/nature13115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Berger MF, et al. The genomic complexity of primary human prostate cancer. Nature. 2011;470:214–220. doi: 10.1038/nature09744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Crasta K, et al. DNA breaks and chromosome pulverization from errors in mitosis. Nature. 2012;482:53–58. doi: 10.1038/nature10802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Rausch T, et al. Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations. Cell. 2012;148:59–71. doi: 10.1016/j.cell.2011.12.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Hastings PJ, Ira G, Lupski JR. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet. 2009;5 doi: 10.1371/journal.pgen.1000327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Carvalho CMB, et al. Inverted genomic segments and complex triplication rearrangements are mediated by inverted repeats in the human genome. Nat Genet. 2011;43:1074–1081. doi: 10.1038/ng.944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Campbell PJ, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008;40:722–9. doi: 10.1038/ng.128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Rausch T, et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–i339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wala JA, et al. SvABA: genome-wide detection of structural variants and indels by local assembly. Genome Res. 2018;28:581–591. doi: 10.1101/gr.221028.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Totoki Y, et al. Trans-ancestry mutational landscape of hepatocellular carcinoma genomes. Nat Genet. 2014;46:1267–73. doi: 10.1038/ng.3126. [DOI] [PubMed] [Google Scholar]
- 29.Nik-Zainal S, et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature. 2016;534:47–54. doi: 10.1038/nature17676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Supek F, Lehner B. Differential DNA mismatch repair underlies mutation rate variation across the human genome. Nature. 2015;521:81–84. doi: 10.1038/nature14173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Schuster-Böckler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488:504–507. doi: 10.1038/nature11273. [DOI] [PubMed] [Google Scholar]
- 32.De S, Michor F. DNA replication timing and long-range DNA interactions predict mutational landscapes of cancer genomes. Nat Biotechnol. 2011;29:1103–1108. doi: 10.1038/nbt.2030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yang L, et al. Diverse mechanisms of somatic structural variations in human cancer genomes. Cell. 2013;153:919–929. doi: 10.1016/j.cell.2013.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Alexandrov L, et al. The Repertoire of Mutational Signatures in Human Cancer. Nature. 2019 doi: 10.1038/s41586-020-1943-3. XXX, XXX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lukusa T, Fryns JP. Human chromosome fragility. Biochim Biophys Acta. 2008;1779:3–16. doi: 10.1016/j.bbagrm.2007.10.005. [DOI] [PubMed] [Google Scholar]
- 36.Popova T, et al. Ovarian cancers harboring inactivating mutations in CDK12 display a distinct genomic instability pattern characterized by large tandem duplications. Cancer Res. 2016;76:1882–1891. doi: 10.1158/0008-5472.CAN-15-2128. [DOI] [PubMed] [Google Scholar]
- 37.Xia B, et al. Control of BRCA2 Cellular and Clinical Functions by a Nuclear Partner, PALB2. Mol Cell. 2006;22:719–729. doi: 10.1016/j.molcel.2006.05.022. [DOI] [PubMed] [Google Scholar]
- 38.Piazza A, Wright WD, Heyer WD. Multi-invasions Are Recombination Byproducts that Induce Chromosomal Rearrangements. Cell. 2017;170:760–773.e15. doi: 10.1016/j.cell.2017.06.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Yu Y, et al. Dna2 nuclease deficiency results in large and complex DNA insertions at chromosomal breaks. Nature. 2018;564:287–290. doi: 10.1038/s41586-018-0769-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.