Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Jul 1.
Published in final edited form as: Nat Commun. 2013;4:1502. doi: 10.1038/ncomms2502

DNA replication timing and higher-order nuclear organization determine single nucleotide substitution patterns in cancer genomes

Lin Liu 1, Subhajyoti De 2, Franziska Michor 1,*
PMCID: PMC3633418  NIHMSID: NIHMS436750  PMID: 23422670

Abstract

Single nucleotide substitutions (SNS) are a defining characteristic of cancer genomes. Many SNS in cancer genomes arise due to errors in DNA replication, which is spatio-temporally stratified. Here we propose that DNA replication patterns help shape the mutational landscapes of normal and cancer genomes. Using data on five fully sequenced cancer types and two personal genomes, we determined that the frequency of intergenic SNS is significantly higher in late DNA replication timing regions, even after controlling for a number of genomic features. Furthermore, some substitution signatures are more frequent in certain DNA replication timing zones. Finally, integrating data on higher-order nuclear organization, we found that genomic regions in close spatial proximity to late replicating domains display similar mutation spectra as the late replicating regions themselves. These data suggest that DNA replication timing together with higher-order genomic organization contribute to the patterns of SNS in normal and cancer genomes.

Introduction

Human cancer genomes exhibit complex mutational landscapes, often characterized by a large number of single nucleotide substitutions (SNS) found throughout the genome13. The patterns of SNS have been shown to depend on the type of cancer, the number of cell divisions leading to the initiation and progression of the tumor, as well as tissue-specific patterns of driver events in cancer47. Mutation rates also vary according to different genomic features such as GC content, recombination rate, CpG islands and others810.

Recent advances in genomic profiling methods have enabled the characterization of the spatial arrangement of genomic material within inter-phase nuclei11, 12. The use of such databases has enabled an unprecedented mapping of genomic regions not only relative to each other, but also with regard to different higher-order structures within individual cell types11, 13, 14. Furthermore, the temporal order of DNA replication in human cells displays marked variability across genomic regions, in that some areas are replicated early while others are replicated late during S phase1517. To date, such data has been used to investigate evolutionary divergence between species and human nucleotide diversity, showing that late replicating regions display larger point mutation rates than early replicating regions18. It was also recently elucidated that genomic regions of similar replication timing are clustered spatially in the nucleus, that the two boundaries of somatic copy number alterations (SCNAs) in cancer genomes tend to be found in regions with the same replication timing, and that regions replicated early and late display distinct patterns of frequencies of SCNA boundaries, SCNA size and a preference for deletions over insertions19. For example, deletions are generally more frequent than amplifications in late as compared to early replication timing zones.

Recently available genome-wide sequencing data have enabled us to investigate the patterns of SNS in different temporal phases and spatial compartments during the DNA replication process. Several studies have illustrated associations between mutation frequencies and other genetic and epigenetic factors2022. Woo et al.20 utilized information on selection and DNA replication timing to study the local variation of mutation frequencies, whereas Schuster-Bockler et al.20 and the TCGA lung cancer consortium21, 22 proposed a multivariate analysis approach to investigate epigenetic markers using data from different cell types.

Here we investigated the patterns of SNS across the genome by using replication timing data conserved across several cell lines based on data from Hansen et al23, 24 and regions not under strong selection pressure. We then comprehensively catalogued individual mutation signatures in these constant late and early replication timing zones. Finally, we utilized information on higher-order chromatin interactions between genomic material to demonstrate the coordinative effects between replication timing and nuclear architecture on the mutational landscape of cancer genomes.

RESULTS

Description of analyzed data sets

We integrated SNS data from completely sequenced genomes of five cancer types (melanoma25, 26, prostate cancer27, small cell lung cancer28, chronic lymphocytic leukemia29, and colorectal cancer30; the number of samples analyzed is shown in Table 1), two completely sequenced personal genomes31, 32, genome-wide DNA replication timing data23, 24, and data on single nucleotide differences between the human (hg18) and chimpanzee (panTro2) genomes33. All data was mapped to the human genome version hg18. Genome-wide replication timing data, obtained using a technique based on massively parallel sequencing (Repli-Seq) across different human cell types23, was used to classify genomic regions as ‘constant early’, ‘constant mid’, ‘constant late’ and ‘variable’, according to the extent of consistency of replication timing regions across the different cell types.

Table 1. Cancer types and numbers of samples used in this study.

The table displays the number of samples and citation for each cancer type analyzed.

Cancer Types Source Number of Samples

Melanoma Study 1 Pleasance et al, Nature 201025 1
Melanoma Study 2 Berger et al, Nature 201226 25
Prostate Cancer Berger et al, Nature 201127 7
Small Cell Lung Cancer Pleasance et al, Nature 201028 1
Chronic Lymphocytic Leukemia Puente et al, Nature 201129 4
Colorectal Cancer Bass et al, Nat Genet. 201130 9

Since cancer development encompasses two intertwined processes – the acquisition of mutations and natural selection affecting the frequency of the resultant phenotypes3, we first excluded regions such as the centromere and telomere, Y chromosome, genes and promoters (±2 kb), repeat elements and ultra-conserved regions33 from the data. The remaining sequences were expected to evolve nearly neutrally and were termed Filtered Intergenic Regions (FIRs). Using FIRs only, we were also able to avoid some challenging issues of variant calling outside of these regions34. The frequency of mutations detected in these regions was referred to as Adjusted Intergenic Mutation Frequency (AIMF). We mapped SNS data for each cancer genome onto the FIRs and calculated the AIMF for both the whole genome and each chromosome individually. Our analysis revealed that the AIMF varies substantially across the four cancer types (Table 2 and Supplementary Table S1, based on ANOVA adjusted by multiple comparisons); such variation could be explained by biological differences in the cancer types and/or differences in the experimental design, sequencing technologies and variant calls. Nevertheless, similar trends were observed in the two completely sequenced personal genomes31, 32, pointing towards meaningful differences (Table 2). We then also repeated these analyses using genome-wide data instead of FIRs and obtained consistent results (Table 3).

Table 2.

The genome-wide Adjusted Intergenic Mutation Frequency (AIMF).

AIMF Melanoma Study1 Melanoma Study2 Prostate Cancer SCLC CLL Colorectal Cancer Watson HuRef

Constant Early 0.100 0.203 0.00874 0.0439 0.00230 0.0309 7.99 10.7

Constant Late 0.184 0.356 0.0149 0.135 0.00653 0.0801 9.07 11.5

Core 0.096 0.199 0.00886 0.0405 0.00223 0.0297 7.92 10.6
0.181 0.352 0.0135 0.129 0.00640 0.0801 8.78 11.3

Periphery 0.124 0.226 0.00806 0.0630 0.00270 0.0372 8.41 11.0
0.185 0.357 0.0153 0.138 0.00658 0.0802 9.18 11.5

Low CpGI 0.105 0.212 0.00763 0.0605 0.00352 0.0353 8.36 10.4
0.183 0.362 0.0150 0.140 0.00833 0.0794 9.02 11.4

High CpGI 0.0998 0.202 0.00885 0.0423 0.00305 0.0307 7.96 10.7
0.189 0.319 0.0143 0.108 0.00605 0.0586 9.36 11.8

Low RR 0.105 0.200 0.00926 0.0447 0.00313 0.0314 7.15 9.35
0.192 0.377 0.0150 0.141 0.00836 0.0793 8.63 10.8

High RR 0.0972 0.205 0.00839 0.0432 0.00307 0.0310 8.53 11.5
0.171 0.323 0.0146 0.119 0.00747 0.0717 9.76 12.5

Low GC 0.0973 0.206 0.00989 0.0464 0.00348 0.0365 7.25 9.21
Percent 0.179 0.362 0.0149 0.141 0.00830 0.0836 8.91 11.1

High GC 0.101 0.202 0.00842 0.0432 0.00298 0.0295 8.20 11.1
Percent 0.198 0.338 0.0149 0.119 0.00721 0.0723 9.51 12.4

Gene Poor 0.0933 0.197 0.00833 0.0554 0.00357 0.0368 8.38 10.8
0.187 0.341 0.0149 0.139 0.00821 0.0780 9.15 11.6

Gene Rich 0.102 0.192 0.00883 0.0415 0.00299 0.0300 7.91 10.6
0.166 0.284 0.0148 0.111 0.00659 0.0628 8.48 10.5

GPos 0.0978 0.206 0.00809 0.0459 0.00320 0.0320 7.95 10.4
0.185 0.359 0.0151 0.141 0.00819 0.0790 9.05 11.4

GNeg 0.102 0.202 0.00902 0.0429 0.00303 0.0301 8.02 10.8
0.179 0.343 0.0141 0.119 0.00745 0.0720 9.17 11.8

The AIMF (x10−4) is shown for five cancer types (1 sample of melanoma of study 125, 25 samples of melanoma of study 226, 7 samples of prostate cancer27, 1 sample of small cell lung cancer28, 4 samples of chronic lymphocytic leukemia29, 9 samples of colorectal cancer30) and two completely sequenced personal genomes (Watson31 and HuRef genomes32) for ‘constant early’ (purple) and ‘constant late’ replicating regions (orange); and after stratifying by six different genomic features (nuclear lamina-associated domains11, CpG islands, recombination rate, GC percentage, gene density, and chromatin status). CpG island data generated by Wu et al53 were obtained from the UCSC genome browser. GC percentage was calculated for each 1Mb window using gc5base54. Chromatin status was derived from Giemsa-staining-based g-banding patterns50. We also used RefSeq genes in the control analyses. The recombination rate based on either the deCODE55, Marshfield56, or Genethon genetic maps57 was downloaded from UCSC genome browser.

Table 3.

The genome-wide Mutation Frequency (MF).

Genome-Wide MF Melanoma Study1 Melanoma Study2 Prostate Cancer SCLC CLL Colorectal Cancer Watson HuRef

Constant Early 0.068 0.197 0.0119 0.047 0.00187 0.0300 6.63 9.84
Constant Late 0.176 0.352 0.0172 0.139 0.00571 0.0746 8.27 12.1

Core 0.064 0.194 0.0119 0.045 0.00185 0.0295 6.56 9.72
0.173 0.343 0.0167 0.124 0.00616 0.0752 8.05 11.8

Periphery 0.104 0.226 0.0121 0.065 0.00208 0.0351 7.25 10.9
0.177 0.356 0.0173 0.143 0.00555 0.0744 8.34 12.3

Low CpGI 0.079 0.216 0.0112 0.054 0.00228 0.0352 7.03 10.1
0.178 0.362 0.0172 0.145 0.00596 0.0772 8.23 12.0

High CpGI 0.068 0.204 0.0120 0.047 0.00183 0.0296 6.59 9.80
0.164 0.318 0.0169 0.105 0.00429 0.0603 8.45 12.1

Low RR 0.067 0.216 0.0122 0.047 0.00171 0.0303 5.91 8.65
0.184 0.358 0.0173 0.149 0.00630 0.0775 7.86 11.5

High RR 0.069 0.195 0.0117 0.048 0.00199 0.0298 7.17 10.7
0.161 0.323 0.0165 0.123 0.00489 0.0685 8.92 13.0

Low GC 0.063 0.208 0.0112 0.051 0.00167 0.0313 5.77 8.20
Percent 0.178 0.360 0.0175 0.147 0.00601 0.0778 8.16 11.9

High GC 0.069 0.196 0.0120 0.047 0.00189 0.0299 6.71 10.0
Percent 0.165 0.316 0.0154 0.114 0.00428 0.0592 8.78 12.8

Gene Poor 0.073 0.204 0.0115 0.051 0.00220 0.0328 7.27 9.74
0.179 0.357 0.0172 0.142 0.00593 0.0766 8.37 11.1

Gene Rich 0.068 0.196 0.0120 0.047 0.00182 0.0296 6.52 10.4
0.154 0.322 0.0168 0.111 0.00427 0.0617 7.59 12.3

GPos 0.066 0.201 0.0117 0.049 0.00201 0.0308 6.52 9.64
0.177 0.354 0.0171 0.143 0.00596 0.0757 8.25 12.2

GNeg 0.070 0.195 0.0120 0.046 0.00181 0.0296 6.28 9.94
0.164 0.337 0.0163 0.126 0.00504 0.0676 8.21 12.0

The MF (x10−4) is shown for five cancer types (1 sample of melanoma of study 125, 25 samples of melanoma of study 226, 7 samples of prostate cancer27, 1 sample of small cell lung cancer28, 4 samples of chronic lymphocytic leukemia29, 9 samples of colorectal cancer30) and two completely sequenced personal genomes (Watson31 and HuRef genomes32) for ‘constant early’ (purple) and ‘constant late’ replicating regions (orange); and after stratifying by six different genomic features (nuclear lamina-associated domains11, CpG islands, recombination rate, GC percentage, gene density, and chromatin status). CpG island data generated by Wu et al53 were obtained from the UCSC genome browser. GC percentage was calculated for each 1Mb window using gc5base54. Chromatin status was derived from Giemsa-staining-based g-banding patterns50. We also used RefSeq genes in the control analyses. The recombination rate based on either the deCODE55, Marshfield56, or Genethon genetic maps57 was downloaded from UCSC genome browser. RR –recombination rate. CpGI – CpG island. GPos – Giemsa positive. Gneg – Giemsa negative.

Mutation frequencies depend on replication timing

We first sought to investigate the effects of DNA replication timing onto the patterns of SNS frequency in cancer genomes. We utilized only constant late and constant early replication timing zones23 in order to exclude tissue specificity as a confounding factor. The constant mid category represented a much smaller part of the human genome and was thus discarded. We first analyzed the melanoma genomes25. We observed that the mutation frequency in the FIRs was intimately linked to DNA replication timing: FIRs with constant late replication timing displayed a significantly higher AIMF compared to those with constant early replication timing (Mann-Whitney U-Test p-value = 2.075x10−7). This effect was consistent across all 23 chromosomes (chr1-22 and chrX). We did not identify a significant trend when investigating the 23 chromosomes individually (Figure 1 and Supplementary Figures S1–S13). We then repeated our analysis for the other four cancer types (prostate cancer, small cell lung cancer, chronic lymphocytic leukemia, and colorectal cancer) and two personal genomes (Watson31 and HuRef32 genomes, analyzed separately) and obtained similar results (Figure 1). Using a permutation test based on randomly permuting the number of mutations in the adjusted intergenic regions (Supplementary Figures S14–S15), we recalculated the permuted AIMFs and compared them to the observed patterns, obtaining a permutation p-value < 0.001 for all cancer types. To investigate the confounding effects of different genomic features, we then adjusted for a variety of potential confounders such as gene density, GC percentage, recombination rate, CpG islands, chromatin states9 and nuclear lamina-associated regions11 (Supplementary Figures S1–S13 and Table 2). The observed patterns of SNS with regard to replication timing were consistent in different groups categorized by these genomic features for all analyzed genomes. This observation suggests that our findings are unlikely biased by these genomic features and the internal biological variation among cancer types.

Figure 1. Effects of DNA replication timing on mutation rates.

Figure 1

The figure shows the Adjusted Intergenic Mutation Frequency (AIMF) for regions residing within constant early (purple) and constant late DNA replication timing zones (orange) for completely sequenced genomes of five cancer types and two personal genomes: (A) melanoma of study 1 (1 sample)25, (B) melanoma of study 2 (25 samples in total)26, (C) prostate cancer (7 samples in total)27, (D) small cell lung cancer (1 sample)28, (E) chronic lymphocytic leukemia (4 samples in total)29, (F) colorectal cancer (9 samples in total)30, (G) Watson31 and (H) HuRef genomes32. The AIMF represents the number of single nucleotide substitutions observed per base pair in the Filtered Intergenic Regions (FIR), which overlap with constant early and constant late DNA replication timing zones, respectively. The horizontal axes display the results for chr1 – chr22 and chrX.

We then repeated our analyses using genome-wide mutation frequencies in constant late and constant early replication timing regions (Table 3). In general, we obtained robust results. Surprisingly, in prostate cancer and small cell lung cancer, the genome-wide mutation frequencies were higher than the AIMF (Table 2 and Table 3); these findings might arise due to an excess of mutations in repeat elements in these two cancer types, which could be due to mapping issues, different criteria used for variant calls, or diverse biological mechanisms of tumorigenesis. After adjusting for several genomic features, we again obtained results consistent with previous studies showing that genomic regions, which (i) have a high gene density, (ii) reside in euchromatin regions, or (iii) have a high CpG content display lower mutation rates. When analyzing adjusted intergenic regions instead of the whole genome, however, some of these associations were not observed: for instance, we observed a relatively higher AIMF in melanoma samples as well as the Watson and HuRef genomes in regions with higher CpG density compared to lower CpG density. One possible reason for this observation is that SNS in FIRs might not be strongly affected by the active elements around the regions. Alternatively, this trend might also be due to sequencing or mapping issues in repeat elements. We also calculated the SNS frequencies in genes only: the SNS frequency in genes was much lower than the AIMF (chi-squared p-value < 0.0001) and constant late replication timing regions had larger SNS frequencies in genes (Supplementary Table S2). To account for the potential inconsistencies of replication timing across cell lines, we used six alternative replication timing datasets24, 35, 36 from the Replication Domain database to confirm our findings (Supplementary Figure S16).

Recent evidence suggests that DNA replication timing may be coordinated across megabase-scale domains in metazoan genomes, and that early and late replication initiation occurs in spatio-temporally separate nuclear compartments13, 14,19. Thus, it is possible that DNA replication timing domains within a larger genomic region (e.g. 1 Mb) might affect the SNS frequency. For instance, overall, constant late regions could reside in regions that are either predominantly replicated late or not, and vice versa. To address this issue, we segmented the human genome into 1 Mb non-overlapping windows and dichotomized these windows into those with a large versus small proportion of late-replicating domains based on the prevalence of late replicating base pairs within them. Using different cutoffs to categorize these 1 Mb windows, we found that in the stratum with a large proportion of late RT material, the SNS frequencies are higher than in the stratum with a small proportion of late RT material (Chi-squared p-value < 0.001 in all cases), but the differences of mutation frequencies between specific early and late replication timing regions hold in both strata (Supplementary Figure S17). This observation was also consistent across the five cancer types. Therefore, the prevalence of late replication timing zones on a larger scale is unlikely to affect our observations. Interestingly, although it has been reported that the transition regions between late and early replication timing zones are less stable than other parts of the genome37, we did not observe significant differences in terms of mutation rates between regions at the center versus at the boundary of individual replication timing zones based on the constant late and early replication timing data (Supplementary Figure S18).

Different temporal phases of DNA replication have been reported to associate with the existence of DNA secondary structures38, common fragile sites39 and sometimes cis-regulatory elements40. To examine whether these factors could confound the different mutation frequencies in early and late replication timing zones, stratification analyses were performed based on these factors (Supplementary Figure S19). The preference of SNS in constant late over constant early DNA replication timing was not masked by these factors, demonstrating remarkable robustness of our observation in addition to other control analyses. Besides, we focused on intergenic mutations, whose function is difficult to be inferred computationally or verified experimentally41. However, some portion of the intergenic regions can potentially be transcribed42; for instance, noncoding RNAs, especially large intergenic noncoding RNAs (lincRNAs), may be one missing piece of unraveling the complexity of the cancer genome41. A recent study has catalogued all known lincRNAs with the most thorough annotation to date43. Since those adjusted intergenic mutations included in our study are far away from protein coding genes (median distance to the closest transcription start site: 400 kb), it is possible that these mutations play a role in acting on those lincRNAs. We observed that the SNS did not display any global preference towards residing within FIR regions overlapping with lincRNAs (Supplementary Table S3). However, since we cannot rule out that mutations varying lincRNAs are more frequent in cancer genomes and the effects of variation in lincRNAs may be subtle compared with variation in protein-coding genes, more work is required to delineate these effects.

Mutation signatures depend on replication timing

When investigating the different types of SNS in cancer FIRs, we observed that the patterns depended on whether FIRs were located in constant early versus constant late replication timing zones. We considered six types of SNS signatures for each nucleotide in the genome: A→C: T→G, A→G: T→C, A→T: T→A, C→A: G→T, C→T: G→A, and C→G: G→C. The proportions of these six types of substitutions were calculated for the constant late and constant early replication timing FIRs (Figure 2). The overall patterns were significantly different between constant early and constant late replication timing (Chi-squared test, p-values < 0.01 in all cases, Figure 2). Similar differences of substitution patterns between early and late replication timing zones were obtained after controlling for the effects of gene density, GC percentage, chromatin state, CpG islands, recombination rate, and nuclear lamina-associated regions (Supplementary Figures S20–S31). Interestingly, we also obtained a similar trend using the single nucleotide polymorphism data from the two completely sequenced personal genomes (Figure 2 and Supplementary Figures S20–S31). The mutation signatures within genes and promoters were also investigated (Supplementary Figure S32) to allow a comparison between genic and intergenic regions. We found similar patterns in genes and FIRs in terms of mutation signatures (Supplementary Figure S32).

Figure 2. Relationship between DNA replication timing and substitution patterns.

Figure 2

The figure shows the proportions of different types of single nucleotide substitutions in the constant early (purple) and constant late (orange) DNA replication timing zones for completely sequenced genomes of five cancer types and two personal genomes: (A) melanoma of study 125, (B) melanoma of study 226, (C) prostate cancer27, (D) small cell lung cancer28, (E) chronic lymphocytic leukemia29, (F) colorectal cancer30, (G) Watson31 and (H) HuRef genomes32. The proportions were calculated based on the hg18 reference allele so that Prob(A→C: T→G) + Prob(A→G: T→C) + Prob(A→T: T→A) = 100%, and Prob(C→A: G→T) + Prob(C→T: G→A) + Prob(C→G: G→C) = 100% for each of the constant late and constant early categories. Note that A→T: T→A is a signature commonly higher in late replication timing in all cancer types. Using the Chi-squared test and correcting for multiple hypothesis testing by false discovery rate, (B), (C), (G) and (H) are significantly different with adjusted p-values less than 0.01.

Comparing the data across cancer types, we observed some common patterns: some signatures were more prevalent in the constant late regions, whereas others were preferentially located in constant early regions. For instance, A→T: T→A transversions occurred most often in the constant late replication timing regions in all five cancer types. Out of the five cancer types and the two personal genomes studied, the differences of the proportion of A→T: T→A in early and late replication timing regions were significant in prostate cancer samples, melanoma samples from study 2, and Watson and HuRef genomes (adjusted p-values < 0.01 after multiple testing correction). Overall, the higher proportion of A→T: T→A in late replication timing zones was observed in 38 out of all the 47 samples analyzed in our study (Supplementary Figures S33–S40). In contrast, the frequencies of mutations and the relative proportions of the six types of substitution signatures differed among the five cancer genomes and two personal genomes; for example, the most frequent type of substitution in melanoma was the C→T transition25. In general, the consistency in the relative proportions of substitution signatures in constant early versus constant late replication timing regions might indicate common mutagenic mechanisms in different temporal phases of DNA replication.

Mutation frequencies and higher-order nuclear organization

The spatio-temporal segregation of DNA replication timing leads to the formation of DNA replication factories in which DNA synthesis takes place on multiple strands simultaneously13, 14. We therefore aimed to test the hypothesis that those regions brought in close spatial proximity by the proposed fractal organization of the genome12 display similar mutation frequencies. To address this question, we divided the whole genome into 100 Kb non-overlapping windows and obtained Hi-C-based long-range interaction data from the GM06990 and K562 cell lines from Lieberman-Aiden et al12 to measure the spatial proximity between two individual windows. We excluded any two loci that were closer than 20 Kb from each other on linear DNA. We then stratified all pairs of windows according to the number of Hi-C reads between them and investigated those windows close to but outside of the constant late DNA replication timing zones. Those regions that overlapped with FIRs were referred to as ‘transition to late’ regions; these are the regions that do not reside in constant late replication timing zones but are linked to constant late region with at least one Hi-C read. Compared with the AIMF in constant late and constant early DNA replication timing zones, we found that the AIMF in the ‘transition to late’ regions were much closer to, yet still smaller than that in constant late DNA replication timing zones. Interestingly, the AIMF was positively associated with the interaction counts (linear regression p-value < 0.01 for each cancer type, Figure 3). Furthermore, in most cases, the AIMF in these regions was higher than the genome-wide AIMF (Figure 3). These observations were consistent across the Hi-C data from the GM06990 and K562 cell lines and the Hi-C data for the GM06990 cell line generated using different restriction enzymes (HindIII and NCoI) (Supplementary Figures S41–S43).

Figure 3. Higher-order nuclear architecture is associated with mutation frequencies.

Figure 3

The figure shows the Adjusted Intergenic Mutation Frequency (AIMF) in the ‘transition-to-late’ regions defined by different numbers of Hi-C interaction counts from the GM06990 cell line between regions inside and outside the constant late DNA replication timing zones for (A) the melanoma sample of study 125, (B) melanoma samples of study 226, (C) prostate cancer samples27, (D) the small cell lung cancer sample28, (E) chronic lymphocytic leukemia samples29, and (F) colorectal cancer samples30. Statistical significance was evaluated using simple linear regression, and p-values were obtained. All p-values were less than 0.01. The green bar shows the genome-wide AIMF, the orange bar the AIMF in constant late DNA replication timing FIR, and the purple bar the AIMF in constant early DNA replication timing FIR. The blue dashed line, i.e. the fitted linear model, shows the positive association between the AIMF and the Hi-C counts that was used to stratify the regions. Due to the small mutation number in the chronic lymphocytic leukemia genome, we only used 2–8 Hi-C counts in panel D. The x-axes display the groups of regions stratified by the number of Hi-C interactions with constant late replication timing regions.

We also examined whether the different proportions of DNA replication timing (including constant early, constant mid, constant late, and variable) in the transition zones confounded our results. To address this issue, we performed the following analysis: the FIRs were divided into four groups – (i) constant late regions linked with Hi-C reads to constant late regions, (ii) constant late regions linked with Hi-C reads to constant early regions, (iii) constant early regions linked with Hi-C reads to constant late regions, and (iv) constant early regions linked with Hi-C reads to constant early regions. We found that group (i) had the highest mutation frequency while group (iv) had the lowest. Moreover, the mutation frequency of group (ii) was closer to, but still lower than that of group (i), and a similar trend was observed between groups (iii) and (iv) (Figure 4). Interestingly, all pairwise comparisons were significantly different (Mann-Whitney U-test, FDR-adjusted p-value < 0.03 in all cases). Taken together, we found that those regions close to late DNA replication timing zones had similar, though lower, mutation frequencies, suggesting a potential role for higher-order chromatin organization on the mutagenic mechanisms during DNA replication.

Figure 4. Effects of transition regions on mutation frequencies.

Figure 4

The figure shows the Adjusted Intergenic Mutation Frequency (AIMF) for (A) the melanoma sample of study 125, (B) melanoma samples of study 226, (C) prostate cancer samples27, (D) the small cell lung cancer sample28, (E) chronic lymphocytic leukemia samples29, and (F) colorectal cancer samples30, in four groups of adjusted intergenic regions: constant late replication timing regions linked with constant late replication timing regions by Hi-C interactions (purple); constant late replication timing regions linked with constant early replication timing regions by Hi-C interactions (green); constant early replication timing regions linked with constant late replication timing regions by Hi-C interactions (gold); and constant early replication timing regions linked with constant early replication timing regions by Hi-C interactions (red). The x-axes display the groups of paired regions stratified by the number of Hi-C reads (2 – 10). All pairwise comparisons were significantly different from each other (Mann-Whitney U-test, false discovery rate-adjusted p-values < 0.03 in all cases).

Evolutionary and cancer mutations share genomic locations

We then sought to compare the regions prone to accumulating adjusted intergenic SNS in cancer genomes versus mutations arising on evolutionary time scales. To this end, we obtained data on differences between the human hg18 and chimpanzee panTro2 genomes from the UCSC genome browser33, using a similar approach as in Stamatoyannopoulos et al18, 44, and compared the number of such changes with the number of SNS in each cancer type in 1Mb non-overlapping windows. The five cancer types had very different regions that overlapped with those regions harboring human-chimpanzee SNS (Supplementary Figure S44). After collapsing the windows with SNS in each of the five cancer types together, we identified 1,039 such windows with at least one SNS in any of the five cancer types in early DNA replication timing zones. We then fixed the number of windows with cancer mutations, and selected the same number of windows with the highest number of human-chimpanzee SNS. Out of these 1,039 windows, 775 were also present among the human-chimpanzee SNS windows. We then performed similar analyses in late DNA replication timing zones, and found that, out of 1,240 windows, 1,208 overlapped in cancer and human-chimpanzee SNS (Supplementary Figure S45). Although the overlap between regions with cancer SNS and the regions with the top human-chimpanzee single nucleotide substitutions varied across different cancer types, after pooling them together, the overlap became larger. Therefore, we concluded that at the scale of 1Mb, most regions harboring human-chimpanzee SNS were also regions harboring SNS in any one of the five cancer types. This finding suggests some common mechanisms between human-chimpanzee evolutionary transversion and cancer mutagenesis, with no obvious differences in early versus late DNA replication timing zones.

DISCUSSION

In this paper, we have demonstrated that mutational landscapes of cancer genomes differ between early and late DNA replication timing zones, with higher mutation frequencies in late replication timing regions. We identified different patterns of mutation signatures across these zones; for example, AT|TA mutation signatures commonly appeared in most cancer samples investigated. This finding implies that some mutagenic and repair mechanisms might depend on the DNA replication timing of genomic material. The differences in mutation frequencies and signatures between early and late replication timing also hold after controlling for several genomic features such as GC percentage, CpG density, recombination rate, chromatin accessibility, gene density, and lamina-associated domains. Also, the transition to late regions defined based on Hi-C interactions, although not located in constant late replication timing regions, have higher mutation frequencies than the overall AIMFs. Taken together, we conclude that (i) DNA replication timing is a robust genomic feature affecting SNS frequencies in both cancer and personal genomes, after controlling for many variables such as GC percentage, gene density, recombination rate, higher-order DNA replication timing domains, CTCF binding sites, secondary structures and lincRNAs; (ii) SNS display specific patterns in early versus late DNA replication timing regions; and (iii) higher-order nuclear organization, together with DNA replication timing, affects the mutation frequencies. Furthermore, we found that in general, genome-wide mutation frequencies were lower than AIMFs. The exceptions in prostate cancer and small cell lung cancer could be due to an excess of mutations in repeat elements observed in our analysis, since the majority of the regions excluded from the genome to determine the AIMF were genes, promoters and repeat elements. The overall higher genome-wide mutation frequency in late replication timing regions also holds after controlling for several genomic features.

The higher SNS frequencies in late DNA replication timing zones in cancer genomes could partly arise from the accumulation of single-stranded DNAs, given similar observations in our analyses and others18 and given that a certain fraction of regions harboring mutations overlapped between cancer and personal genomes (Supplementary Figure S45). DNA repair processes can often repair the errors arising during replication45, and it has been suggested that both DNA replication timing and the efficiency of DNA repair are related to higher-order chromatin structure45, 46. Our findings suggest that some portions of the genome have similar mutation frequencies as their counterparts residing closely within the 3D structure of the nucleus. Chromatin organization and replication timing are intertwined, and could be a driving force of carcinogenesis by disrupting specific processes such as replication initiation and replication fork progression46. However, since most mutations analyzed reside in non-coding parts of the genome, these patterns might only have indirect applicability to an understanding of the origins of cancer. Our study represents a novel approach to study the replication process-related SNS in cancer genomes together with the higher-order nuclear organization. This approach can lead to a better understanding of the mutational landscape of cancer genomes from the perspective of replication, epigenetics and chromatin structure.

METHODS

Datasets and analyses

Cancer types and sample numbers analyzed are listed in Table 1. All analyses were performed using human genome version hg18 as reference genome. To obtain the Filtered Intergenic Regions (FIRs), we employed a similar approach as was used by two other studies18, 47. We removed all Refseq genes and promoters (up to 2 kb upstream of a gene), ultra-conserved elements with a conservation score greater than 300, and also intronic sequences, which are related to transcription-coupled DNA repair. We also excluded repeat elements, centromeres and telomeres to minimize variant calling complexity in these regions48, as well as the Y chromosome. All of these data were downloaded from the UCSC genome browser from the NCBI36/hg18 human genome49. The remaining genomic regions were termed Filtered Intergenic Regions (FIRs). The total length of FIRs was approximately 780 Mb. We then overlaid the DNA replication timing data obtained from Hansen et al23 onto the FIRs and found that 79.23 Mb and 169.50 Mb of the FIRs resided within replicating regions that were consistently early or late, respectively, across multiple cell types. The human GC percentage, CpG island and recombination rate data were also obtained from UCSC genome browser. Since highly compact heterochromatin stains for Giemsa, whereas euchromatin is often unstained, we were able to characterize euchromatin and heterochromatin states globally across different cell types using Giemsa staining data50. Data on nuclear lamina-associated domains from Guelen et al11 were obtained from the NCBI GEO database, accession code GSE8854. Genomic regions harboring nuclear lamina-associated domains are referred to as the nuclear periphery, whereas the remaining regions are referred to as nuclear core. While analyzing the effects of lamina-associated domains on the mutation patterns, we used a bootstrap sampling approach (Supplementary Figure S13) to take into account the variability of nuclear topology across different cell types. The Hi-C data for GM06990 and K562 cell lines was obtained from Lieberman-Aiden et al12 through the GEO database. Moreover, data on genome-wide common fragile sites were obtained from Durkin and Glover39. The G-quadruplex and CTCF-binding site locations were obtained from Quadruplex.org51 and CTCFBSDB52, respectively. The large intergenic noncoding RNA catalog can be obtained from http://www.broadinstitute.org/genome_bio/human_lincrnas43. All statistical calculations were performed using open source R software. When necessary, “liftover” software was used to map data from other human genome versions to hg18.

Supplementary Material

1
2

Acknowledgments

The authors gratefully acknowledge support from the Dana-Farber Cancer Institute Physical Sciences-Oncology Center (NCI U54CA143798) as well as feedback and advice from the Michor lab (michorlab.dfci.harvard.edu).

Footnotes

Author contributions

L.L., S.D, and F.M. conceived the experiments and wrote the paper. L.L. performed the analyses.

Competing interests

The authors declare no competing interests.

References

  • 1.Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. doi: 10.1016/j.cell.2011.02.013. [DOI] [PubMed] [Google Scholar]
  • 2.Bignell GR, et al. Signatures of mutation and selection in the cancer genome. Nature. 2010;463:893–898. doi: 10.1038/nature08768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009;458:719–724. doi: 10.1038/nature07943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Podlaha O, Riester M, De S, Michor F. Evolution of the cancer genome. Trends Genet. 2012;28:155–163. doi: 10.1016/j.tig.2012.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lengauer C, Kinzler KW, Vogelstein B. Genetic instabilities in human cancers. Nature. 1998;396:643–649. doi: 10.1038/25292. [DOI] [PubMed] [Google Scholar]
  • 6.Heng HH, et al. Evolutionary mechanisms and diversity in cancer. Adv Cancer Res. 2011;112:217–253. doi: 10.1016/B978-0-12-387688-1.00008-9. [DOI] [PubMed] [Google Scholar]
  • 7.Greenman C, et al. Patterns of somatic mutation in human cancer genomes. Nature. 2007;446:153–158. doi: 10.1038/nature05610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Surralles J, Ramirez MJ, Marcos R, Natarajan AT, Mullenders LH. Clusters of transcription-coupled repair in the human genome. Proc Natl Acad Sci U S A. 2002;99:10571–10574. doi: 10.1073/pnas.162278199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Holmquist GP. Chromosome bands, their chromatin flavors, and their functional features. Am J Hum Genet. 1992;51:17–37. [PMC free article] [PubMed] [Google Scholar]
  • 10.Hodgkinson A, Chen Y, Eyre-Walker A. The large-scale distribution of somatic mutations in cancer genomes. Hum Mutat. 2012;33:136–143. doi: 10.1002/humu.21616. [DOI] [PubMed] [Google Scholar]
  • 11.Guelen L, et al. Domain organization of human chromosomes revealed by mapping of nuclear lamina interactions. Nature. 2008;453:948–951. doi: 10.1038/nature06947. [DOI] [PubMed] [Google Scholar]
  • 12.Lieberman-Aiden E, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yaffe E, et al. Comparative analysis of DNA replication timing reveals conserved large-scale chromosomal architecture. PLoS Genet. 2010;6:e1001011. doi: 10.1371/journal.pgen.1001011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ryba T, et al. Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome Res. 2010;20:761–770. doi: 10.1101/gr.099655.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jeon Y, et al. Temporal profile of replication of human chromosomes. Proc Natl Acad Sci U S A. 2005;102:6419–6424. [Google Scholar]
  • 16.Woodfine K, et al. Replication timing of the human genome. Hum Mol Genet. 2004;13:191–202. doi: 10.1093/hmg/ddh016. [DOI] [PubMed] [Google Scholar]
  • 17.Gilbert DM. Replication timing and transcriptional control: beyond cause and effect. Curr Opin Cell Biol. 2002;14:377–383. doi: 10.1016/s0955-0674(02)00326-5. [DOI] [PubMed] [Google Scholar]
  • 18.Stamatoyannopoulos JA, et al. Human mutation rate associated with DNA replication timing. Nat Genet. 2009;41:393–395. doi: 10.1038/ng.363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.De S, Michor F. DNA replication timing and long-range DNA interactions predict mutational landscapes of cancer genomes. Nat Biotechnol. 2011;29:1103–1108. doi: 10.1038/nbt.2030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Woo YH, Li WH. DNA replication timing and selection shape the landscape of nucleotide variation in cancer genomes. Nat Commun. 2012;3:1004. doi: 10.1038/ncomms1982. [DOI] [PubMed] [Google Scholar]
  • 21.Schuster-Bockler B, Lehner B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature. 2012;488:504–507. doi: 10.1038/nature11273. [DOI] [PubMed] [Google Scholar]
  • 22.Hammerman PS, et al. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012;489:519–525. doi: 10.1038/nature11404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hansen RS, et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci U S A. 2010;107:139–144. doi: 10.1073/pnas.0912402107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Weddington N, et al. ReplicationDomain: a visualization tool and comparative database for genome-wide replication timing data. BMC Bioinformatics. 2008;9:530. doi: 10.1186/1471-2105-9-530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010;463:191–196. doi: 10.1038/nature08658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Berger MF, et al. Melanoma genome sequencing reveals frequent PREX2 mutations. Nature. 2012;485:502–506. doi: 10.1038/nature11071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Berger MF, et al. The genomic complexity of primary human prostate cancer. Nature. 2011;470:214–220. doi: 10.1038/nature09744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pleasance ED, et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010;463:184–190. doi: 10.1038/nature08629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Puente XS, et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature. 2011;475:101–105. doi: 10.1038/nature10113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bass AJ, et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet. 2011;43:964–968. doi: 10.1038/ng.936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. doi: 10.1038/nature06884. [DOI] [PubMed] [Google Scholar]
  • 32.Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Fujita PA, et al. The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 2011;39:D876–882. doi: 10.1093/nar/gkq963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ryba T, et al. Abnormal developmental control of replication-timing domains in pediatric acute lymphoblastic leukemia. Genome Res. 2012;22:1833–1844. doi: 10.1101/gr.138511.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pope BD, et al. DNA replication timing is maintained genome-wide in primary human myoblasts independent of D4Z4 contraction in FSH muscular dystrophy. PLoS One. 2011;6:e27413. doi: 10.1371/journal.pone.0027413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Watanabe Y, et al. Chromosome-wide assessment of replication timing for human chromosomes 11q and 21q: disease-related genes in timing-switch regions. Hum Mol Genet. 2002;11:13–21. doi: 10.1093/hmg/11.1.13. [DOI] [PubMed] [Google Scholar]
  • 38.McMurray CT. DNA secondary structure: a common and causative factor for expansion in human disease. Proc Natl Acad Sci U S A. 1999;96:1823–1825. doi: 10.1073/pnas.96.5.1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Durkin SG, Glover TW. Chromosome fragile sites. Annu Rev Genet. 2007;41:169–192. doi: 10.1146/annurev.genet.41.042007.165900. [DOI] [PubMed] [Google Scholar]
  • 40.Gondor A, Ohlsson R. Replication timing and epigenetic reprogramming of gene expression: a two-way relationship? Nat Rev Genet. 2009;10:269–276. doi: 10.1038/nrg2555. [DOI] [PubMed] [Google Scholar]
  • 41.Tsai MC, Spitale RC, Chang HY. Long intergenic noncoding RNAs: new links in cancer progression. Cancer Res. 2011;71:3–7. doi: 10.1158/0008-5472.CAN-10-2483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008;118:1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cabili MN, et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011;25:1915–1927. doi: 10.1101/gad.17446611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Karolchik D, et al. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008;36:D773–779. doi: 10.1093/nar/gkm966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Misteli T, Soutoglou E. The emerging role of nuclear architecture in DNA repair and genome maintenance. Nat Rev Mol Cell Biol. 2009;10:243–254. doi: 10.1038/nrm2651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Alabert C, Groth A. Chromatin replication and epigenome maintenance. Nat Rev Mol Cell Biol. 2012;13:153–167. doi: 10.1038/nrm3288. [DOI] [PubMed] [Google Scholar]
  • 47.Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA. Promoter regions of many neural- and nutrition-related genes have experienced positive selection during human evolution. Nat Genet. 2007;39:1140–1144. doi: 10.1038/ng2104. [DOI] [PubMed] [Google Scholar]
  • 48.Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. [DOI] [PubMed] [Google Scholar]
  • 49.Miller W, et al. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 2007;17:1797–1808. doi: 10.1101/gr.6761107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Furey TS, Haussler D. Integration of the cytogenetic map with the draft human genome sequence. Hum Mol Genet. 2003;12:1037–1044. doi: 10.1093/hmg/ddg113. [DOI] [PubMed] [Google Scholar]
  • 51.Huppert JL, Balasubramanian S. Prevalence of quadruplexes in the human genome. Nucleic Acids Res. 2005;33:2908–2916. doi: 10.1093/nar/gki609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Bao L, Zhou M, Cui Y. CTCFBSDB: a CTCF-binding site database for characterization of vertebrate genomic insulators. Nucleic Acids Res. 2008;36:D83–87. doi: 10.1093/nar/gkm875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Wu H, Caffo B, Jaffee HA, Irizarry RA, Feinberg AP. Redefining CpG islands using hidden Markov models. Biostatistics. 2010;11:499–514. doi: 10.1093/biostatistics/kxq005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kent WJ, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Kong A, et al. A high-resolution recombination map of the human genome. Nat Genet. 2002;31:241–247. doi: 10.1038/ng917. [DOI] [PubMed] [Google Scholar]
  • 56.Dib C, et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. 1996;380:152–154. doi: 10.1038/380152a0. [DOI] [PubMed] [Google Scholar]
  • 57.Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet. 1998;63:861–869. doi: 10.1086/302011. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2

RESOURCES