Abstract
Uncovering general principles of genome evolution that are time-invariant and that operate in germ and somatic cells has implications for genome-wide association studies (GWAS), gene therapy, and disease genomics. Here we investigate the relationship between structural alterations (e.g., insertions and deletions) and single-nucleotide substitutions by comparing the following genomes that diverged at different times across germ- and somatic-cell lineages: (i) the reference human and chimpanzee genome (in million years), (ii) the reference human and personal genomes (in tens of thousands of years), and (iii) structurally altered regions in cancer and genetically engineered cells (in days). At the species level, genes with structural alteration in nearby regions show increased single-nucleotide changes and tend to evolve faster. In personal genomes, the single-nucleotide substitution rate is higher near sites of structural alteration and decreases with increasing distance. In human cancer cell populations and in cells genetically engineered using zinc-finger nucleases, single-nucleotide changes occur frequently near sites of structural alterations. We present evidence that structural alteration induces single-nucleotide changes in nearby regions and discuss possible molecular mechanisms that contribute to this phenomenon. We propose that the low fidelity of nonreplicative error-prone repair polymerases, which are used during insertion or deletion, result in break-repair-induced single-nucleotide mutations in the vicinity of structural alteration. Thus, in the mutational landscape, structural alterations are linked to single-nucleotide changes across different time scales in both somatic- and germ-cell lineages. We discuss implications for genome evolution, GWAS, disease genomics, and gene therapy and emphasize the need to investigate both types of mutations within a single framework.
Keywords: single-nucleotide substitution, structural alteration, mutation, DNA repair
Understanding how mutations contribute to genetic variation in a population and drive the evolution of new species is a fundamental problem in the postgenomic era. In recent years, intermediate-scale mutations (e.g., insertion, deletion, translocation, and inversion) that result in structural variation have gained considerable attention (1). Although the genome-wide impact of mutations of different sizes (e.g., single-nucleotide changes and structural alterations such as insertion or deletion of genetic material) has been investigated separately, it has long been speculated (2, 3), and increasing evidence suggests, that mutations of different sizes are correlated in prokaryotes and eukaryotes (4–13). Although these studies have provided evidence for such a relationship, how this phenomenon affects functional elements in the genome has not been systematically investigated in humans. Specifically, the implications for sequence (protein and nucleic acid) evolution and the nature of such dependence in genomes that “diverged” at different time scales, i.e., at the time scales of speciation, population divergence, and the life time of an individual, has not been addressed within a single framework (Fig. 1). In addition, it is unclear if such a phenomenon operates in both germ and somatic cells. We therefore investigated this problem by systematically analyzing the reference genomes of humans and chimpanzees, completely sequenced genomes of several human individuals from different populations, and structurally altered genomic regions in cancer and genetically engineered cell populations. We demonstrate that structural alteration results in increased single-nucleotide changes in their neighborhood. This relationship holds well across different time scales and thus appears to be a time-invariant principle of genome evolution.
Fig. 1.
Are structural alterations and single-nucleotide changes linked across different time scales? To answer this question, this study analyzed genomes at three different time scales: speciation, population, and cell division time scales within a single framework. For each time scale, the figure summarizes the specific questions addressed and the corresponding datasets used. The shaded downward arrow in the left-most vertical panel shows the different time scales and the schematic figure next to the arrow shows the different genomes (which diverged at different time scales) that were compared. The third panel shows schematically the transmission of genetic material through the germ-cell and somatic-cell lineage. A cell containing the genetic material is shown as a solid circle (purple, zygote; red, germ cell; blue, somatic cell). Arrows represent a cell division event, and a semicircle represents a gamete, containing half the genetic material. The fourth and fifth panels describe the specific questions investigated and the datasets used to address them.
Results and Discussion
Analysis at the Species Level.
At the speciation time scale (in million years), it was recently shown that the single-nucleotide mutation rate is elevated in regions (spanning several hundred bases) surrounding the sites of insertions and deletions (InDels) of genetic material and decreases with increasing distance from the InDel site (5). Given that InDels can occur in intergenic and intronic regions, these observations raise an important question: How do InDels affect functional elements such as protein-coding genes in the vicinity of such structural alterations? To understand the evolutionary implications of such dependence, we investigated whether genes with alteration in their vicinity (i.e., genomic neighborhood) due to InDels show a high local nucleotide sequence divergence rate (Fig. 1). We first identified one-to-one orthologous genes by comparing the reference genome sequences of humans and chimpanzees (diverged ∼5 Mya) and investigated the extent of alteration (due to insertion, deletion, translocation, or inversion) in their genomic neighborhood as measured by the conservation of genomic neighborhood (CGN) score (14) (SI Appendix SM-1). The extreme values, CGN scores of 0 or 1 for a gene, represent an extensively altered or an absolutely conserved genomic neighborhood, respectively. Genomic alterations within syntenic regions are brought about not only by InDels involving a few bases (generally less than 10 bp) due to strand slippage but also by the activity of transposable elements and errors during DNA repair, which result in inversion, duplication, deletion, and insertion of genetic material greater than 10 bp (15, 16). We then obtained the synonymous (dS) and the nonsynonymous (dN) divergence rate and the divergence rate of all interspersed repeats in the intergenic/intronic regions within 250 kb of the midpoint of each gene (KI) from Ensembl v48 and Khaitovich et al. (17), respectively. We next compared the distribution of values for groups of orthologous genes using appropriate statistical tests (Materials and Methods and Fig. 1).
Mutations leading to synonymous substitutions and single-nucleotide changes in the interspersed elements in the intergenic/intronic regions are likely to be nearly neutral in the absence of functional constraints (18). An investigation of the genes showed that there was a consistent increase in dS value with a decrease in CGN scores (Fig. 2A). This suggests that genes with extensive alteration in their genomic neighborhood due to structural changes show an increase in the synonymous substitution rate. The set of genes with a CGN score of 0 had significantly higher dS values (P = 7.44 × 10−03; Mann–Whitney test; n = 57 genes) relative to that of all other genes. In addition, we found that the KI value for genes with an altered genomic neighborhood were significantly and consistently higher than genes with a conserved neighborhood (Fig. 2B). Taken together, these observations suggest that the local background mutation rate in genes with an altered neighborhood is significantly higher compared with those with a conserved neighborhood. Because local GC content and the recombination rate of a region may affect the extent of errors introduced by the DNA repair process, genes in such regions have been shown to evolve rapidly (19–21). An analysis of the GC content (SI Appendix SM-2) and recombination rate (estimated for humans as available from Ensembl v48 and the UCSC Genome Browser, SI Appendix SM-3) suggested that our findings are unlikely to be biased due to these factors. This suggests that at the speciation time scale (i) intermediate-scale mutations in genomic regions are linked to increased local single-nucleotide divergence of nearby genes; and (ii) in addition to biased gene conversion and local recombination rate (19, 20), alterations in the genomic neighborhood of a gene could also contribute to the observed higher local mutation rate of the gene (see below for possible mechanisms).
Fig. 2.
Box plot of the distribution of (A) dS, (B) KI, (C) dN, (D) percentage protein sequence identity, and (E) dN/dS for orthologous genes between the reference human and chimpanzee genome sequence grouped according to their CGN scores. The genome-wide median values are shown as horizontal lines. Box plot identifies the middle 50% of the data, the median, and the extreme points. The entire set of data points is divided into quartiles, and the interquartile range (IQR) is calculated as the difference between ×0.75 and ×0.25. The range of the 25% of the data points above (×0.75) and below (×0.25) the median (×0.50) is displayed as a solid box. The horizontal line and the notch represent the median and confidence intervals, respectively. Data points greater or less than 1.5·IQR represent outliers and are excluded only to improve visualization of the graphs. The horizontal line that is connected by dashed lines above and below the solid box (whiskers) represents the largest and the smallest nonoutlier data points, respectively.
Nonsynonymous sites are usually under negative selection because amino acid changes in the gene products could have deleterious structural and functional consequences at the protein level. We found that there was an increase in both dN value and the dN/dS ratio (a signature of positive selection or reduced negative selection) with a decrease in CGN score (Fig. 2 C–E). The group of genes with CGN = 0 had significantly higher dN values (P = 1.06 × 10−3; Mann–Whitney test; n = 57 genes) and dN/dS ratio (P = 1.85 × 10−2; Mann–Whitney test; n = 55 genes) relative to that of all other orthologous genes. Consistent with this observation, the percentage protein sequence identity decreased with a decreasing CGN score, and the group of genes with CGN = 0 had significantly lower percentage protein sequence identity values relative to that of all other genes. These findings suggest that genes with extensive structural alteration in their genomic neighborhood show an increase in the nonsynonymous substitution rate in protein-coding regions, leading to high protein sequence divergence and elevated dN/dS ratio, which are signatures of reduced negative selection or positive selection.
To ensure that our observation is not affected due to rapidly evolving genes, we investigated how such genes evolve in other reference mammalian genomes and within the human HapMap population (SI Appendix SM-4). We found that genes with an altered neighborhood in one set of species comparisons (e.g., humans and chimps) do not diverge rapidly in rat–mouse or dog–cow species comparisons. This suggests that alteration in a genomic neighborhood causes an increase in the single-nucleotide divergence rate and that our observation is unlikely to be biased due to enrichment of genes that evolve rapidly during mammalian evolution. An investigation of the HapMap data revealed that the distribution of nonsynomous SNPs (minor allele frequency >0.05) and the proportion of species-level amino acid mutations (obtained by comparing the human and chimpanzee reference genomes) that are also polymorphic in the human population is comparable for genes with an altered and conserved neighborhood. This suggests (i) that not all of the genes with an altered neighborhood that diverged rapidly between humans and chimpanzees also show high variability (nonsynonymous SNPs) within the human population and (ii) that a considerable proportion of mutations accumulated at the species level are not likely to be polymorphic in human populations. Such a pattern may arise due to (a) strong natural selection around the locus of the structural alteration or (b) an elevated local mutation rate but only for a short period (i.e., a few generations or cell divisions) immediately after the genomic alteration event (see below for possible mechanisms), with some mutations having been fixed afterward due to a population bottleneck.
Analysis at the Population Level.
At the population divergence time scale (in tens of thousands of years), we investigated whether structural alteration (InDels) is linked to increased single-nucleotide polymorphisms in the sequenced genomes of individuals from different populations (Fig. 1). We first collected the structural variation and polymorphism data for the genome sequence of Craig Venter, a Caucasian individual (HuRef) (22), which contained ∼4.1 million DNA variants including SNPs, InDels, etc., as compared with the reference human genome (Homo sapiens: NCBI36). An investigation of the SNP density using a window of variable size (w = 1 kb, 2 kb, … up to 1 Mb), after centering on the site of variation, revealed that the SNP density is high in a relatively narrow window immediately close to the structural variation and decays with increasing distance away from the site of the structural alteration (Fig. 3A). For InDels >30 bp, the observed SNP density changes from a mean value of 2.8 × 10−3 substitutions per base within 1 kb to 1.8 × 10−3 substitutions per base in a region that is ∼50 kb away from the site of the structural variation. We also observed a similar trend for the Korean Reference Genome (KorRef) (23) and the Yoruban genome (NA18507) (24) (Fig. 3 B and C). The observed trend also appears to be independent of the size of the structural variation analyzed (SI Appendix SM-5). We found that the patterns of substitution (transition and transversion) were comparable at different distances from the site of structural alteration, suggesting no substitution bias in the vicinity of the site of structural alteration (SI Appendix SM-5). We also found that InDel density and SNP density were strongly correlated (Spearman's rank correlation coefficient: 0.69; P < 2.2 × 10−16) in the HuRef genome. A similarly strong correlation was observed in the KorRef and the Yoruban genome (SI Appendix SM-5), which is consistent with what was recently reported by Kim et al. (6) for a second distinct Korean (AK1) genome. These findings suggest that our observations are unlikely to be an artifact of genome assembly or the sequencing strategy used.
Fig. 3.
Distribution of the density of single-nucleotide change as a function of distance from the site of structural alteration (i.e., InDels greater than 30 bp) in the genome of (A) Venter (HuRef), (B) Yoruban (NA18507), and (C) Korean (KorRef) individuals. (E) SNP density for 100-kb genomic blocks in the Yoruban genome that have at least one structural alteration (small red rectangles) is significantly different from that of the genomic blocks that have no structural alteration, given that both sets of blocks have no InDels in the KorRef genome (P < 2.2 × 10−16). (D) SNP density for 100-kb genomic blocks in the KorRef genome that have at least one structural alteration is significantly different from that of the genomic blocks that have no structural alteration, given that both sets of blocks have no structural alteration in the Yoruban genome (P < 2.2 × 10−16). (F) Conditional probability values. Nonoverlapping 100-kb genomic blocks with high SNP density (i.e., greater than the median SNP density) are defined as those blocks with an above-the-median value in each of the personal genomes.
We further investigated if this relationship is causal by comparing equivalent regions from the personal genomes. Because the experimental approach to sequencing individuals is different, and the absolute number of mutations identified is not directly comparable, we analyzed only the Yoruban and KorRef genomes that were similar in terms of the number of structural alterations and SNP density. We first divided the genomes into equivalent genomic blocks of 100 kb. We then investigated if genomic blocks with at least one structural alteration in one genome but no structural alteration in the other genome had a statistically significant increase in their SNP density when compared with genomic blocks that had no structural alteration in both genomes (Fig. 3 D and E). We found a statistically significant difference in the SNP densities between the two types of genomic blocks, suggesting that the presence of a structural alteration is likely to contribute to increased single-nucleotide substitution density even after controlling for local mutation rates (Fig. 3 D and E and SI Appendix SM-6). We then investigated whether the regions of high single-nucleotide substitution density are also prone to have structural alterations, which may suggest overall genome instability. On the contrary, we found that the conditional probability of finding a structural alteration in genomic blocks of increased SNP density was significantly lower than that of finding increased SNP density in genomic blocks harboring a structural variation (Fig. 3F). This indicates that sites of increased SNP density do not necessarily harbor excess of structural variations. This also suggests that the rate of single-nucleotide substitution may depend on a number of factors (25) and that the presence of structural alteration in the vicinity is only one of these factors. Taken together, these findings suggest that at the time scales of population divergence, the presence of structural alteration is likely to cause an increase in the local single-nucleotide substitution rate in the vicinity. Although we present evidence that InDels induce SNPs in their vicinity, we want to stress that we cannot rule out the effect of SNPs on InDels in certain genomic locations.
Analysis at the Cellular Level.
Is the association between structural alterations and single-nucleotide changes observed in both somatic and germ cell lineages? Although the above-reported pattern of mutations needs to occur in germ cells to be transmitted to the next generation, mechanisms (see below) that may contribute to such a trend could very well operate in somatic cells and may therefore affect the genome sequence of daughter cells when parent cells divide (Fig. 1). To this end, we analyzed the mutational landscape of cancer genomes. To examine patterns of genome evolution in somatic cells during cancer, we collected structural variation and resequencing on the basis of single-nucleotide mutation data, which were generated by comparing malignant melanoma and a lympho-blastoid cell line from the same individual (26). We analyzed the density of single-nucleotide changes around the sites of structural variations (w = 1 kb to 1 Mb) in a manner similar to the previous analysis and found that the SNP density is high in a relatively narrow window immediately close to the structural variation and decays with increasing distance away from the site of the structural alteration (Fig. 4 A and B). For structural variations >30 bp, the observed SNP density changes from a mean value of 1.8 × 10−5 substitutions per base within 1 kb to 1.1 × 10−5 substitutions per base in a region that is ∼1 Mb away from the site of the structural variation, and the pattern is independent of the size of the structural variation (SI Appendix SM-7).
Fig. 4.
Distribution of SNP density as a function of the distance from the site of structural alteration of size greater than 20 bp (A) and greater than 30 bp (B) in the cancer (melanoma) genome. (C) Conditional probability values. Nonoverlapping 100-kb genomic blocks with high SNP density (i.e., greater than the median SNP density) are defined as those blocks with above-the-median value in each of the personal genomes.
We also found that the conditional probability of finding a structural alteration in 100-kb genomic blocks of increased SNP density was significantly lower than that of finding increased SNP density in genomic blocks harboring a structural variation (Fig. 4C). This indicates that sites of increased SNP density do not necessarily harbor an excess of structural variations. It also suggests that at the cancer genome evolution time scale, the presence of a structural alteration is likely to cause an increase in the local single-nucleotide substitution rate in the vicinity. In addition, we performed an extensive literature search to identify instances of genomic alteration events that were resequenced from cancer cell populations. Inspection of sequencing reads from the high-throughput sequencing approaches that map to translocation or fusion events in a VcaP prostate cancer cell line and a MeWo melanoma cell line (27–29) clearly revealed instances of single-nucleotide changes in certain sequenced reads in the vicinity of the alteration site (SI Appendix SM-7). We believe that the single-nucleotide mutations observed in such reads are unlikely to be sequencing or mapping artifacts because (i) we observe them in multiple independent studies and in different genomic loci and (ii) the locus with the structural alteration (fusion or translocation) mapped by the read will be unique and will be found only in the genome of the cancer cell investigated.
Furthermore, we collected and analyzed sequences of clones from normal individual cells that were obtained after insertion or deletion of the target sequences through genetic engineering by zinc-finger nucleases from humans (30), Zea mays (31), and zebra fish (32). Although the genetic engineering procedure correctly inserted or deleted the template sequence as expected in the right genomic location, we noted a number of single-nucleotide changes in the vicinity of the alteration site in certain clones from individual cells (SI Appendix SM-7). This may again indicate the underlying heterogeneity in cell populations that might have been introduced due to the engineered InDel event as it is known that this process is imprecise (33). Because we did not find any large-scale dataset pertaining to the engineered cells, we were unable to assess the statistical significance of our observation. Hence we caution that further work needs to be done to distinguish genetic heterogeneity from sequencing errors and to provide a more resolved understanding of this phenomenon in engineered cells. However, because we observe a similar trend in three independent studies in three different organisms, we believe that our findings are unlikely to be an artifact.
Taken together, our observations suggest that genomic alterations involving insertion or deletion and single-nucleotide changes may even be linked in normal and diseased somatic cells. It also suggests that the association between structural variation and single-nucleotide change is seen in both the somatic- and the germ-cell lineages. We hope that further experiments that involve insertion and deletion of genetic material in cultured cell lines or germ cells in whole organisms, followed by monitoring of sequence divergence over successive divisions or generations, can provide a more resolved understanding of this phenomenon.
Finally, we performed analyses of pair-wise comparisons of the genomes (SI Appendix SM-8) from the three different time scales (i.e., species, population, and cancer evolution). We found that (i) genes with altered neighborhood display increased structural alteration (insertion, deletion, segmental duplication, and inversion) in the human population; (ii) genes that are rearranged in cancer are significantly more likely to have an altered neighborhood between humans and chimpanzees; and (iii) rearranged cancer genes are significantly more likely to show more copy number variation (CNVs) in the human population than other genes, indicating that the rearranged genes are present in unstable genomic regions. Although several genes are rearranged in the cancer genomes, only a subset of the rearranged genes is implicated as a driver of cancer. When we focus only on the subset of those genes that are thought to cause cancer (as obtained from the Cancer Gene Census database), we do not find the above associations (SI Appendix SM-8). Taken together, these results suggest that there is indeed a correlation in the location of genomic changes across different time scales.
Possible Molecular Mechanisms.
In summary, we show that structural alteration and single-nucleotide mutation rates are linked by comparing genomes that diverged at different time points—from a few million years (between species) to a few days (between individual cells). These observations raise an important question: How does insertion or deletion of genetic material in the vicinity of a gene, or translocation of genetic material to a new genomic location, increase the local nucleotide-level mutation rate? Recent studies on the mechanisms of insertion, deletion, DNA repair, and replication (16, 34–39) provide possible explanations for our observations. The generation of structural alteration (e.g., insertion or deletion) requires single-stranded and double-stranded breaks. Depending on the nature of the alteration, it may involve repair mechanisms such as break-induced replication, nonhomologous end joining, homologous recombination, microhomology-mediated end-joining, and/or single-strand annealing. Some of these mechanisms require template-based homologous recombination, crossing-over of broken DNA, formation of DNA structures such as D-loops, and resection (chewing up) of a single strand of the DNA by nucleases, thereby requiring DNA synthesis of the resected region in a template-dependent manner. These steps require the action of different nucleases, primase, synthesis, and degradation of RNA primers and the involvement of different nonreplicative, low-fidelity repair polymerases with very different error rates of incorporating a wrong base (40, 41). For example, replicative polymerases (POLE and POLD1) have a very low error rate of 1 in 105–106 bases, whereas some of the repair polymerases such as POLK, POLI, REV1, REV3, POLL, and POLB, which are typically used during the process of insertion of deletion of genetic material, have relatively high error rates in the range of 1 in 101–105 bases [see table 3 in Rattray and Strathern (41)]. Because the region closest to the InDel site is always resected and then filled in using the low-fidelity nonreplicative repair polymerase, we propose that the repair burden, and hence the likelihood of introducing a mutation, is higher closer to the site of structural alteration. Indeed, it has been shown in yeast that near the sites of homologous recombination, which involve a site-specific double-stranded break (DSB), there is a 100- to 1,000-fold increase in mutation rate of genes adjacent to the DSB (42, 43). The majority of such mutations, which are referred to as break-repair-induced mutations, have been identified as single-nucleotide changes, and a role for the REV3 repair polymerase has also been implicated (42).
In addition to the above possibilities, mechanisms such as transcription-coupled DNA repair, a specialized repair pathway that counteracts the toxic effects of DNA damage in transcriptionally active regions, have been shown to be prominent in specific chromosomal domains (44–46). Accordingly, alterations in the genomic neighborhood in such chromosomal domains could affect the mutation-correction balance, thereby resulting in an altered nucleotide divergence rate in these regions. Although we have discussed only a few possibilities, other mechanisms may very well contribute to the observed trends.
Conclusions and Implications.
Although we investigated only one-to-one orthologous genes in our species-level analysis, the reported phenomena are likely to be prevalent, or stronger, in one-to-many and many-to-many orthologs. The duplicate gene copy (i.e., paralog), arising due to insertion of genetic material elsewhere on the genome and thereby involving DNA break, repair, and localized template-dependent DNA synthesis, may experience an altered sequence-level mutation rate. This will be a combination of the local background genomic mutation rate and the increased local mutation rate due to the genomic alteration event. This may result in increased divergence of nucleotide and protein sequence of the duplicated gene, which may be tolerated due to relaxed functional constraints. Therefore, the reported phenomena could be one possible mechanistic explanation for the observed asymmetric sequence evolution of gene duplicates (47).
At the level of speciation, the mutational landscape introduced due to increased local nucleotide divergence rate as a result of alteration in the genomic neighborhood is likely to appear similar to that of lineage-specific accelerated evolution, even when functional advantage need not be present. At the level of individuals in a population of species, the elevated local mutation rate due to recent genomic alteration may introduce diversity, which may have advantageous or deleterious consequences. For example, when the alterations in genomic neighborhood are not fixed and the hemizygotic condition offers selective advantage, variation at both the structural and the sequence level are promoted. Furthermore, such mutations may affect cellular homeostasis by altering protein-coding regions, leading to misexpression of key pleiotropic genes, and may result in disease phenotypes. Given that mobile elements, which facilitate InDel events, (i) play an important role in generating interindividual structural variation (48, 49) and (ii) may even result in somatic mosaicism in multicellular organisms (50, 51), it is likely that InDels introduced by this mechanism will also result in increased single-nucleotide changes in the vicinity and may have a role in certain diseases such as cancer. In this context, the availability of data from the 1,000 genome sequence project will provide better understanding to uncover the interplay between InDels and single-nucleotide changes in the genome of individuals within a population.
Finally, our observation that mutations of different sizes are linked raises the possibility that, in certain instances, it might not be easy to pin-point contributing mutations (e.g., CNVs, InDels, or the linked SNPs in the vicinity of the structural alteration) and that epistatic effects of mutations of different sizes will need to be considered in genome-wide association studies and disease genomics (e.g., cancer genomics). Therefore, it will be important to investigate mutations of different sizes in the vicinity of the marker polymorphism, and caution should be taken in interpreting results from genome-wide studies. In addition, such interdependence of mutations needs to be considered in the design of SNP arrays and in the mapping and assembly of short read sequences obtained from next-generation sequencing approaches in genome-scale studies. These findings also have implications for recently emerging gene therapy and genetic-engineering applications, which involve viral vector-based integration of genetic material or a zinc-finger–coupled nuclease-dependent double-stranded break, followed by insertion of new genetic material via homologous recombination. For example, one has to carefully investigate the vicinity of genetically engineered regions when using viral vectors or zinc-finger nuclease when inserting or deleting genetic material to ensure that individual cells selected from the engineered cell population do not carry other unexpected or undesirable single-nucleotide changes in the vicinity. In conclusion, our findings suggest that, in order to understand how mutations drive genome evolution, it is important to study mutations of different sizes, their influence on each other, and how they affect genome organization within a single conceptual framework.
Materials and Methods
We performed species level analysis using human and chimpanzee transcriptome sequence data (17). Population level analysis was performed using the Venter (22), Korean (23) and Yoruban (24) genomes. Cell level analysis was carried out using data from a completely sequenced human melanoma genome (26). Please see SI Appendix SM-9for details of Materials and Methods.
Supplementary Material
Acknowledgments
We thank A. Pombo, A. Travers, A. Wuster, B. Lang, B. Lehner, C. Chothia, C. Ravarani, D. Green, D. Rubinsztein, G. Chalancon, J. Grimmett, J. Sale, J. Su, K. van Steen, K. Weber, M. Buljan, M. Lamers, P. Dear, S. Balaji, S. Teichmann, and V. Pisupati for providing helpful comments. We thank both referees for their constructive criticisms on our work. We apologize for not being able to cite several previous works due to space constraints. S.D. and M.M.B. acknowledge the Medical Research Council for support. M.M.B. thanks Darwin College and Schlumberger Ltd. for generous support. M.M.B. is an EMBO Young Investigator. S.D. thanks King's College for support.
Footnotes
The authors declare no conflict of interest.
*This Direct Submission article had a prearranged editor.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.0914454107/-/DCSupplemental.
References
- 1.Hurles ME, Dermitzakis ET, Tyler-Smith C. The functional impact of structural variation in humans. Trends Genet. 2008;24:238–245. doi: 10.1016/j.tig.2008.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fisher RA. The Genetical Theory of Natural Selection. Clarendon Press, Oxford; 1930. [Google Scholar]
- 3.Nei M. Modification of linkage intensity by natural selection. Genetics. 1967;57:625–641. doi: 10.1093/genetics/57.3.625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Navarro A, Barton NH. Chromosomal speciation and molecular divergence: Accelerated evolution in rearranged chromosomes. Science. 2003;300:321–324. doi: 10.1126/science.1080600. [DOI] [PubMed] [Google Scholar]
- 5.Tian D, et al. Single-nucleotide mutation rate increases close to insertions/deletions in eukaryotes. Nature. 2008;455:105–108. doi: 10.1038/nature07175. [DOI] [PubMed] [Google Scholar]
- 6.Kim JI, et al. A highly annotated whole-genome sequence of a Korean individual. Nature. 2009;460:1011–1015. doi: 10.1038/nature08211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hardison RC, et al. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 2003;13:13–26. doi: 10.1101/gr.844103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Longman-Jacobsen N, Williamson JF, Dawkins RL, Gaudieri S. In polymorphic genomic regions indels cluster with nucleotide polymorphism: Quantum genomics. Gene. 2003;312:257–261. doi: 10.1016/s0378-1119(03)00621-8. [DOI] [PubMed] [Google Scholar]
- 9.Yang S, et al. Patterns of insertions and their covariation with substitutions in the rat, mouse, and human genomes. Genome Res. 2004;14:517–527. doi: 10.1101/gr.1984404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Marques-Bonet T, et al. On the association between chromosomal rearrangements and genic evolution in humans and chimpanzees. Genome Biol. 2007;8:R230. doi: 10.1186/gb-2007-8-10-r230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chen JQ, et al. Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol Biol Evol. 2009;26:1523–1531. doi: 10.1093/molbev/msp063. [DOI] [PubMed] [Google Scholar]
- 12.Hollister JD, Ross-Ibarra J, Gaut BS. Indel-associated mutation rate varies with mating system in flowering plants. Mol Biol Evol. 2010;27:409–416. doi: 10.1093/molbev/msp249. [DOI] [PubMed] [Google Scholar]
- 13.Zhu L, Wang Q, Tang P, Araki H, Tian D. Genomewide association between insertions/deletions and the nucleotide diversity in bacteria. Mol Biol Evol. 2009;26:2353–2361. doi: 10.1093/molbev/msp144. [DOI] [PubMed] [Google Scholar]
- 14.De S, Teichmann SA, Babu MM. The impact of genomic neighborhood on the evolution of human and chimpanzee transcriptome. Genome Res. 2009;19:785–794. doi: 10.1101/gr.086165.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Volfovsky N, et al. Genome and gene alterations by insertions and deletions in the evolution of human and chimpanzee chromosome 22. BMC Genomics. 2009;10:51. doi: 10.1186/1471-2164-10-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zhang F, Carvalho CM, Lupski JR. Complex human chromosomal and genomic rearrangements. Trends Genet. 2009;25:298–307. doi: 10.1016/j.tig.2009.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Khaitovich P, et al. Parallel patterns of evolution in the genomes and transcriptomes of humans and chimpanzees. Science. 2005;309:1850–1854. doi: 10.1126/science.1108296. [DOI] [PubMed] [Google Scholar]
- 18.Hurst LD. Fundamental concepts in genetics: Genetics and the understanding of selection. Nat Rev Genet. 2009;10:83–93. doi: 10.1038/nrg2506. [DOI] [PubMed] [Google Scholar]
- 19.Berglund J, Pollard KS, Webster MT. Hotspots of biased nucleotide substitutions in human genes. PLoS Biol. 2009;7:e26. doi: 10.1371/journal.pbio.1000026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Galtier N, Duret L, Glémin S, Ranwez V. GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends Genet. 2009;25:1–5. doi: 10.1016/j.tig.2008.10.011. [DOI] [PubMed] [Google Scholar]
- 21.Lercher MJ, Hurst LD. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 2002;18:337–340. doi: 10.1016/s0168-9525(02)02669-0. [DOI] [PubMed] [Google Scholar]
- 22.Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ahn SM, et al. The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group. Genome Res. 2009;19:1622–1629. doi: 10.1101/gr.092197.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Costantini M, Bernardi G. Mapping insertions, deletions and SNPs on Venter's chromosomes. PLoS ONE. 2009;4:e5972. doi: 10.1371/journal.pone.0005972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010;463:191–196. doi: 10.1038/nature08658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Maher CA, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009;458:97–101. doi: 10.1038/nature07638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Maher CA, et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci USA. 2009;106:12353–12358. doi: 10.1073/pnas.0904720106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Berger MF, et al. Integrative analysis of the melanoma transcriptome. Genome Res. 2010;20:413–427. doi: 10.1101/gr.103697.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lee HJ, Kim E, Kim JS. Targeted chromosomal deletions in human cells using zinc finger nucleases. Genome Res. 2010;20:81–89. doi: 10.1101/gr.099747.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Shukla VK, et al. Precise genome modification in the crop species Zea mays using zinc-finger nucleases. Nature. 2009;459:437–441. doi: 10.1038/nature07992. [DOI] [PubMed] [Google Scholar]
- 32.Doyon Y, et al. Heritable targeted gene disruption in zebrafish using designed zinc-finger nucleases. Nat Biotechnol. 2008;26:702–708. doi: 10.1038/nbt1409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Olsen PA, Solhaug A, Booth JA, Gelazauskaite M, Krauss S. Cellular responses to targeted genomic sequence modification using single-stranded oligonucleotides and zinc-finger nucleases. DNA Repair (Amst) 2009;8:298–308. doi: 10.1016/j.dnarep.2008.11.011. [DOI] [PubMed] [Google Scholar]
- 34.Pardo B, Gómez-González B, Aguilera A. DNA repair in mammalian cells: DNA double-strand break repair: How to fix a broken relationship. Cell Mol Life Sci. 2009;66:1039–1056. doi: 10.1007/s00018-009-8740-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Llorente B, Smith CE, Symington LS. Break-induced replication: What is it and what is it for? Cell Cycle. 2008;7:859–864. doi: 10.4161/cc.7.7.5613. [DOI] [PubMed] [Google Scholar]
- 36.Flores-Rozas H, Kolodner RD. Links between replication, recombination and genome instability in eukaryotes. Trends Biochem Sci. 2000;25:196–200. doi: 10.1016/s0968-0004(00)01568-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lindahl T, Wood RD. Quality control by DNA repair. Science. 1999;286:1897–1905. doi: 10.1126/science.286.5446.1897. [DOI] [PubMed] [Google Scholar]
- 38.Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet. 2009;10:551–564. doi: 10.1038/nrg2593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kvikstad EM, Chiaromonte F, Makova KD. Ride the wavelet: A multiscale analysis of genomic contexts flanking small insertions and deletions. Genome Res. 2009;19:1153–1164. doi: 10.1101/gr.088922.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Pavlov YI, Shcherbakova PV, Rogozin IB. Roles of DNA polymerases in replication, repair, and recombination in eukaryotes. Int Rev Cytol. 2006;255:41–132. doi: 10.1016/S0074-7696(06)55002-8. [DOI] [PubMed] [Google Scholar]
- 41.Rattray AJ, Strathern JN. Error-prone DNA polymerases: When making a mistake is the only way to get ahead. Annu Rev Genet. 2003;37:31–66. doi: 10.1146/annurev.genet.37.042203.132748. [DOI] [PubMed] [Google Scholar]
- 42.Rattray AJ, Shafer BK, McGill CB, Strathern JN. The roles of REV3 and RAD57 in double-strand-break-repair-induced mutagenesis of Saccharomyces cerevisiae. Genetics. 2002;162:1063–1077. doi: 10.1093/genetics/162.3.1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Strathern JN, Shafer BK, McGill CB. DNA synthesis errors associated with double-strand-break repair. Genetics. 1995;140:965–972. doi: 10.1093/genetics/140.3.965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ju BG, et al. A topoisomerase IIbeta-mediated dsDNA break required for regulated transcription. Science. 2006;312:1798–1802. doi: 10.1126/science.1127196. [DOI] [PubMed] [Google Scholar]
- 45.Perillo B, et al. DNA oxidation as triggered by H3K9me2 demethylation drives estrogen-induced gene expression. Science. 2008;319:202–206. doi: 10.1126/science.1147674. [DOI] [PubMed] [Google Scholar]
- 46.Misteli T, Soutoglou E. The emerging role of nuclear architecture in DNA repair and genome maintenance. Nat Rev Mol Cell Biol. 2009;10:243–254. doi: 10.1038/nrm2651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Conant GC, Wagner A. Asymmetric sequence divergence of duplicate genes. Genome Res. 2003;13:2052–2058. doi: 10.1101/gr.1252603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Xing J, et al. Mobile elements create structural variation: Analysis of a complete human genome. Genome Res. 2009;19:1516–1526. doi: 10.1101/gr.091827.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Seleme MC, et al. Extensive individual variation in L1 retrotransposition capability contributes to human genetic diversity. Proc Natl Acad Sci USA. 2006;103:6611–6616. doi: 10.1073/pnas.0601324103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Muotri AR, et al. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature. 2005;435:903–910. doi: 10.1038/nature03663. [DOI] [PubMed] [Google Scholar]
- 51.Dear PH. Copy-number variation: The end of the human genome? Trends Biotechnol. 2009;27:448–454. doi: 10.1016/j.tibtech.2009.05.003. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




