Skip to main content
eLife logoLink to eLife
. 2017 Apr 25;6:e24284. doi: 10.7554/eLife.24284

Rapid evolution of the human mutation spectrum

Kelley Harris 1,*, Jonathan K Pritchard 1,2,3,*
Editor: Gilean McVean4
PMCID: PMC5435464  PMID: 28440220

Abstract

DNA is a remarkably precise medium for copying and storing biological information. This high fidelity results from the action of hundreds of genes involved in replication, proofreading, and damage repair. Evolutionary theory suggests that in such a system, selection has limited ability to remove genetic variants that change mutation rates by small amounts or in specific sequence contexts. Consistent with this, using SNV variation as a proxy for mutational input, we report here that mutational spectra differ substantially among species, human continental groups and even some closely related populations. Close examination of one signal, an increased TCCTTC mutation rate in Europeans, indicates a burst of mutations from about 15,000 to 2000 years ago, perhaps due to the appearance, drift, and ultimate elimination of a genetic modifier of mutation rate. Our results suggest that mutation rates can evolve markedly over short evolutionary timescales and suggest the possibility of mapping mutational modifiers.

DOI: http://dx.doi.org/10.7554/eLife.24284.001

Research Organism: Human

eLife digest

DNA is a molecule that contains the information needed to build an organism. This information is stored as a code made up of four chemicals: adenine (A), guanine (G), cytosine (C), and thymine (T). Every time a cell divides and copies its DNA, it accidentally introduces ‘typos’ into the code, known as mutations. Most mutations are harmless, but some can cause damage. All cells have ways to proofread DNA, and the more resources are devoted to proofreading, the less mutations occur. Simple organisms such as bacteria use less energy to reduce mutations, because their genomes may tolerate more damage. More complex organisms, from yeast to humans, instead need to proofread their genomes more thoroughly.

Recent research has shown that humans have a lower mutation rate than chimpanzees and gorillas, their closest living relatives. Humans and other apes copy and proofread their DNA with basically the same biological machinery as yeast, which is about a billion years old. Yet, humans and apes have only existed for a small fraction of this time, a few million years. Why then do humans need to replicate and proofread their DNA differently from apes, and could it be that the way mutations arise is still evolving?

Previous research revealed that European people experience more mutations within certain DNA motifs (specifically, the DNA sequences ‘TCC’, ‘TCT’, ‘CCC’ and ‘ACC’) than Africans or East Asians do.

Now, Harris (who conducted the previous research) and Pritchard have compared how various human ethnic groups accumulate mutations and how these processes differ in different groups.

Statistical analysis of the genomes of thousands of people from all over the world did indeed show that the mutation rates of many different three-letter DNA motifs have changed during the past 20,000 years of human evolution. Harris and Pritchard report that when groups of humans left Africa and settled in isolated populations across different continents, each population quickly became better at avoiding mutations in some genomic contexts, but worse in others. This suggests that the risk of passing on harmful mutations to future generations is changing and evolving at an even faster rate than was originally suspected.

The results suggest that every human ethnic group carries specific variants of the genes which ensure that DNA replication and repair are accurate. These differences appear to influence which types of mutations are frequently passed down to future generations. An important next step will be to identify the genetic variants that could be controlling mutational patterns and how they affect human health.

DOI: http://dx.doi.org/10.7554/eLife.24284.002

Introduction

Germline mutations not only provide the raw material for evolution but also generate genetic load and inherited disease. Indeed, the vast majority of mutations that affect fitness are deleterious, and hence biological systems have evolved elaborate mechanisms for accurate DNA replication and repair of diverse types of spontaneous damage. Due to the combined action of hundreds of genes, mutation rates are extremely low–in humans, about one point mutation per 100 MB or about 60 genome-wide per generation (Kong et al., 2012; Ségurel et al., 2014).

While the precise roles of most of the relevant genes have not been fully elucidated, research on somatic mutations in cancer has shown that defects in particular genes can lead to increased mutation rates within very specific sequence contexts (Alexandrov et al., 2013; Helleday et al., 2014). For example, mutations in the proofreading exonuclease domain of DNA polymerase ϵ cause TCTTAT and TCGTTG mutations on the leading DNA strand (Shinbrot et al., 2014). Mutational shifts of this kind have been referred to as ‘mutational signatures’. Specific signatures may also be caused by nongenetic factors such as chemical mutagens, UV damage, or guanine oxidation (Ohno et al., 2014).

Together, these observations imply a high degree of specialization of individual genes involved in DNA proofreading and repair. While the repair system has evolved to be extremely accurate overall, theory suggests that in such a system, natural selection may have limited ability to fine-tune the efficacy of individual genes (Lynch, 2011; Sung et al., 2012). If a variant in a repair gene increases or decreases the overall mutation rate by a small amount–for example, only in a very specific sequence context–then the net effect on fitness may fall below the threshold at which natural selection is effective. (Drift tends to dominate selection when the change in fitness is less than the inverse of effective population size). The limits of selection on mutation rate modifiers are especially acute in recombining organisms such as humans because a variant that increases the mutation rate can recombine away from deleterious mutations it generates elsewhere in the genome.

Given these theoretical predictions, we hypothesized that there may be substantial scope for modifiers of mutation rates to segregate within human populations, or between closely related species. Most triplet sequence contexts have mutation rates that vary across the evolutionary tree of mammals (Hwang and Green, 2004), but evolution of the mutation spectrum over short time scales has been less well described. Weak natural mutators have recently been observed in yeast (Bui et al., 2017) and inferred from human haplotype data (Seoighe and Scally, 2017); if such mutators affect specific pathways of proofreading or repair, then we may expect shifts in the abundance of mutations within particular sequence contexts. Indeed, one of us has recently identified a candidate signal of this type, namely an increase in TCCTTC transitions in Europeans, relative to other populations (Harris, 2015); this was recently replicated (Mathieson and Reich, 2016). Here, we show that mutation spectrum change is much more widespread than these initial studies suggested: although the TCCTTC rate increase in Europeans was unusually dramatic, smaller scale changes are so commonplace that almost every great ape species and human continental group has its own distinctive mutational spectrum.

Results

To investigate the mutational processes in different human populations, we classified each single nucleotide variants (SNV) in the 1000 Genomes Phase 3 data (Auton et al., 2015) in terms of its ancestral allele, derived allele, and 5’ and 3’ flanking nucleotides. We collapsed strand complements together to obtain 96 SNV categories. Since the detection of singletons may vary across samples, and because some singletons may result from cell-line or somatic mutations, we only considered variants seen in more than one copy. We further excluded variants in annotated repeats (since read mapping error rates may be higher in such regions) and in PhyloP conserved regions (to avoid selectively constrained regions) (Pollard et al., 2010). From the remaining sites, we calculated the distribution of derived SNVs carried by each Phase 3 individual. We used this as a proxy for the mutational input spectrum in the ancestors of each individual.

To explore global patterns of the mutation spectrum, we performed principal component analysis (PCA) in which each individual was characterized simply by the fraction of their derived alleles in each of the 96 SNV categories (Figure 1A). PCA is commonly applied to individual-level genotypes, in which case the PCs are usually highly correlated with geography (Novembre et al., 2008). Although the triplet mutation spectrum is an extremely compressed summary statistic compared to typical genotype arrays, we found that it contains sufficient information to reliably classify individuals by continent of origin. The first principal component separated Africans from non-Africans, and the second separated Europeans from East Asians, with South Asians and admixed native Americans (Figure 1—figure supplement 2) appearing intermediate between the two.

Figure 1. Global patterns of variation in SNV spectra.

(A) Principal component analysis of individuals according to the fraction of derived alleles that each individual carries in each of 96 mutational types. (B) Heatmaps showing, for pairs of continental groups, the ratio of the proportions of SNVs in each of the 96 mutational types. Each block corresponds to one mutation type; within blocks, rows indicate the 5’ nucleotide, and columns indicate the 3’ nucleotide. Red colors indicate a greater fraction of a given mutation type in the first-listed group relative to the second. Points indicate significant contrasts at p <105. See Figure 1—figure supplements 1, 2 and 3 for heatmap comparisons between additional population pairs as well as a description of PCA loadings and the p-valuesof all mutation class enrichments. Figure 1—figure supplement 4 demonstrates that these patterns are unlikely to be driven by biased gene conversion. In Figure 1—figure supplement 5, we see that this mutation spectrum structure replicates on both strands of the transcribed genome as well as the non-transcribed portion of the genome. Figure 1—figure supplements 6, 7 and 8 show that most of this structure replicates across multiple chromatin states and varies little with replication timing.

DOI: http://dx.doi.org/10.7554/eLife.24284.003

Figure 1—source data 1. This text file shows the number of SNPs in each of the 96 mutational categories that passed all filters in each 1000 Genomes continental group.
DOI: 10.7554/eLife.24284.004

Figure 1.

Figure 1—figure supplement 1. Pairwise mutation spectrum comparisons among continental groups.

Figure 1—figure supplement 1.

Each of these plots compares the mutation spectra of two populations P1 and P2. Letting fi denote the fraction of SNVs in population Pi that have a given triplet context, ancestral allele, and derived allele, the corresponding heat map square visualizes the enrichment ratio f1/f2. Black dots mark mutation types for which the difference between populations has a χ2p p-value less than 10-5.

Figure 1—figure supplement 2. PCA of all 1000 Genomes continental groups.

Figure 1—figure supplement 2.

All admixed North and South American individuals were omitted from Figure 1 in the main text to clarify the separation of other populations along an African vs non-African axis and an East vs West Eurasian axis. Here, admixed Americans are added in black. As expected, some African-Americans group with the Africans, while other admixed Americans fall within the variation of other East and West Eurasians. The accompanying heat maps show the mutation type loadings of the first two principal components, the second of which is heavily weighted toward the European TCCTTC signature.

Figure 1—figure supplement 3. Mutation spectrum comparison p-values.

Figure 1—figure supplement 3.

Each left-hand plot shows all chi-squared p-values corresponding to the ratios from Figure 1A. In the absence of recent mutation spectrum evolution, only one out of 96 SNP categories is expected to have a p-value below 0.01 (lower dotted line). In contrast, the majority of p values meet the more stringent threshold p<1e5. The corresponding right hand panel shows a closeup of the distribution of p-values greater than 1e-5.

Figure 1—figure supplement 4. The effects of biased gene conversion on mutation spectra.

Figure 1—figure supplement 4.

When using segregating variation to study the mutation spectrum, one potential source of bias is that strong-to-weak mutations, where the ancestral allele is G or C and the derived allele is A or T, have a lower fixation probability than weak-to-strong mutations due to biased gene conversion (BGC). If this effect were sufficiently strong, it would inflate the apparent mutation fractions of weak-to-strong mutations, especially in populations with large effective sizes where natural selection is particularly efficient. Within humans, Africans have the largest long-term effective population size, while East Asians and Native Americans have the lowest. Therefore, if BGC has created differences in mutation spectra between populations, the fraction of weak-to-strong SNVs should be highest in Africans, intermediate in Europeans and South Asians, and lowest in East Asians and Native Americans. This violin plot reveals no such pattern, suggesting that BGC is not a strong driver of mutation spectrum differences between human populations. We do not observe either a direct correlation between in strong-to-weak mutation fraction and distance from Africa or an inverse correlation between weak-to-strong mutation fraction and distance from Africa.

Figure 1—figure supplement 5. Mutation spectra of transcribed vs non-transcribed DNA.

Figure 1—figure supplement 5.

Using the UCSC Genome Browser annotations of the human reference hg19, we determined whether each SNP occurs in a transcribed or non transcribed region. We further divided SNPs occurring in transcribed regions according to whether the ancestral A or C allele occurs on the (+)-strand or the (-)-strand. Panels A, B, and C all show the same population-specific mutation type enrichments that are observed in Figure 1B. Panel D plots the residuals between panels A and B, highlighting mutation types that show a modest difference in strand bias between populations.

Figure 1—figure supplement 6. Mutation spectra of ChromHMM chromatin states (Part I of II).

Figure 1—figure supplement 6.

To investigate whether any mutation spectrum shifts might be confined to particular chromatin states, we used chromHMM annotations of the human embryonic stem cell line HESC-H1 (Hoffman et al., 2013). Each heat map plots mutation spectrum comparisons for SNPs that are annotated as being part of the same chromatin state, and dots mark mutation types that show a significant enrichment in one population at the level p<0.01. Every chromatin state shows enrichment of the TCCTTC signature in Europe and South Asia. Some heat maps are noisy due to the small sample size of SNPs contained within these regions, but all showcase the same general patterns as Figure 1B.

Figure 1—figure supplement 7. Mutation spectra of ChromHMM chromatin states (Part II of II).

Figure 1—figure supplement 7.

Figure 1—figure supplement 8. Variation of the mutation spectrum with DNA replication timing.

Figure 1—figure supplement 8.

We partitioned the genome into 10 equal replication timing quantiles using data obtained from (Woodfine et al., 2004), then computed mutation spectrum differences within each quantile. Although most patterns from Figure 1B replicate within each replication timing bin, there are a few exceptions. CpG transitions, which occur most often in early-replicating regions, vary in population bias depending on replication timing. In addition, the deficit of ACAAAA and AAAATA mutations in Africa compared to Europe and Asia is observed mainly in early-replicating regions.

Remarkably, we found that the mutation spectrum differences among continental groups are composed of small shifts in the abundance of many different mutation types (Figure 1B). For example, comparing Africans and Europeans, 43 of the 96 mutation types are significant at a p<105 threshold using a forward variable selection procedure. The previously described TCCTTC signature partially drives the difference between Europeans and the other groups, but most other shifts are smaller in magnitude and appear to be spread over more diffuse sets of related mutation types. East Asians have excess AT transversions in most sequence contexts, as well as about 10% more *AC*CC mutations than any other group. Compared to Africans, all Eurasians have proportionally fewer C* mutations relative to A* mutations.

Replication of mutation spectrum shifts

One possible concern is that batch effects or other sequencing artifacts might contribute to differences in mutational spectra. Therefore we replicated our analysis using 201 genomes from the Simons Genome Diversity Project (Mallick et al., 2016). The SGDP genomes were sequenced at high coverage, independently from 1000 Genomes, using an almost non-overlapping panel of samples. We found extremely strong agreement between the mutational shifts in the two data sets (Figure 2). For example, all of the 43 mutation types with a significant difference between Africa and Europe (at p<105) in 1000 Genomes also show a frequency difference in the same direction in SGDP (comparing Africa and West Eurasia). In both 1000 Genomes and SGDP, the enrichment of *AC*CC mutations in East Asia is larger in magnitude than any other signal aside from the previously described TCCTTC imbalance.

Figure 2. Concordance of mutational shifts in 1000 Genomes versus SGDP.

Each panel shows natural-log mutation spectrum ratios between a pair of continental groups, based on 1000 Genomes (x-axis) and SGDP (y-axis) data. Data points encoded by (+) symbols denote mutation types that are not significantly enriched in either population in the Figure 1 1000 Genomes analysis (p<105). These heatmaps use the same labeling and color scale as in Figure 1. All 1000 Genomes ratios in this figure were estimated after projecting the 1000 Genomes site frequency spectrum down to the sample size of SGDP. See Figure 2—figure supplements 1 and 2 for a complete set of SGDP heatmaps and regressions versus 1000 Genomes.

DOI: http://dx.doi.org/10.7554/eLife.24284.013

Figure 2.

Figure 2—figure supplement 1. Heatmap comparisons between continental groups in 1000 Genomes and the SGDP.

Figure 2—figure supplement 1.

Here, each 1000 Genomes population is projected down to the sample size of the corresponding SGDP population in order to sample alleles with a similar distribution of ages and frequencies.
Figure 2—figure supplement 2. Regression of the SGDP heatmap coefficients versus the corresponding 1000 Genomes heatmap coefficients.

Figure 2—figure supplement 2.

The greatest discrepancies between 1000 Genomes and SGDP involve transversions at CpG sites, which are among the rarest mutational classes. These discrepancies might result from data processing differences or random sampling variation, but might also reflect differences in the fine-scale ethnic composition of the two panels.

Evidence for a pulse of TCCTTC mutations in Europe and South Asia

To investigate the timescale over which the mutation spectrum change occurred, we analyzed the allele frequency distribution of TCCTTC mutations, which are highly enriched in Europeans (Figure 3A; p<1×10300 for Europe vs. Africa) and to a lesser extent in South Asians. We calculated allele frequencies both in 1000 Genomes and in the larger UK10K genome panel (Walter et al., 2015). As expected for a signal that is primarily European, we found particular enrichment of these mutations at low frequencies. But surprisingly, the enrichment peaks around 0.6% frequency in UK10K, and there is practically no enrichment among the very lowest frequency variants (Figure 3B and Figure 3—figure supplement 1). CT mutations on other backgrounds, namely within TCT, CCC and ACC contexts, are also enriched in Europe and South Asia and show a similar enrichment around 0.6% frequency that declines among rarer variants (Figure 3C). This suggests that these four mutation types comprise the signature of a single mutational pulse that is no longer active. No other mutation types show such a pulse-like distribution in UK10K, although several types show evidence of monotonic rate change over time (Figure 3—figure supplements 3, 4 and 5).

Figure 3. Geographic distribution and age of the TCC mutation pulse.

(A) Observed frequencies of TCCTTC variants in 1000 Genomes populations. (B) Fraction of TCCTTC variants as a function of allele frequency in different samples indicates that these peak around 1%. See Figure 3—figure supplement 1 for distributions of TCCTTC allele frequency within all 1000 Genomes populations, and see Figure 3—figure supplement 2 for the replication of this result in the Exome Aggregation Consortium Data. In the UK10K data, which has the largest sample size, the peak occurs at 0.6% allele frequency. (C) Other enriched CT mutations with similar context also peak at 0.6% frequency in UK10K. See Figure 3—figure supplements 3, 4 and 5 for labeled allele frequency distributions of all 96 mutation types (most represented here as unlabeled grey lines). See Figure 3—figure supplement 6 for heatmap comparisons of the 1000 Genomes populations partitioned by allele frequency, which provide a different view of these patterns. (D) A population genetic model supports a pulse of TCCTTC mutations from 15,000 to 2000 years ago. Inset shows the observed and predicted frequency distributions of this mutation under the inferred model.

DOI: http://dx.doi.org/10.7554/eLife.24284.016

Figure 3.

Figure 3—figure supplement 1. TCCTTC mutation fraction as a function of allele frequency in all 1000 Genomes populations.

Figure 3—figure supplement 1.

To enable better comparison with the 1000 Genomes data, the UK10K SNPs have been downsampled to 200 individuals. The age distribution of alleles of a given frequency varies as a function of the number of lineages being sampled–this is why the UK10K pulse peaks around 0.6% frequency when measured in a dataset of thousands of lineages, but peaks around 2% in a subsample of only 400 lineages. Some African and East Asian population names have been omitted for clarity since the TCCTTC mutation fraction is so uniform within these continental groups. Red = European populations; Blue = South Asian; Orange = Americas; Purple = Africa; Green = East Asia.
Figure 3—figure supplement 2. Fraction of TCCTTC mutations as a function of allele frequency in ExAC.

Figure 3—figure supplement 2.

Lek et al. compiled data from 60,706 exomes to create the Exome Aggregation Consortium dataset, which enables the analysis of ultra-rare human variation (Lek et al., 2016). The overall fraction of TCCTTC mutations is slightly higher in exome data than in whole genome data because exons contain a skewed distribution of triplet contexts, but the pulse pattern from Figure 3B reproduces unmistakably.
Figure 3—figure supplement 3. Mutation type enrichment as a function of allele frequency in UK10K (Part I of III).

Figure 3—figure supplement 3.

The eleven panels in Figure 3—figure supplements 2, 3 and 4 show the full dependence of mutation spectrum on allele frequency in the UK10K data. If we let F(f,m) denote the fraction of SNVs of frequency f that are of type m and let F(m) denote the fraction of all mutations that are of type m, the enrichment of mutation type m as a function of frequency is F(f,m)/F(m). This function is expected to fluctuate around y=1 unless the rate of m has recently increased or decreased. All 96 mutation types are visualized in every panel, but most corresponding lines are greyed out to enhance readability. Some lines deviate from y=1 due to the effects of biased gene conversion (BGC)–this occurs when one of the ancestral or derived alleles is a weak base (A or T, abbreviated W) and the other allele is a strong base (G or C, abbreviated S). WS mutations are more abundant at high allele frequencies, while SW mutations are more abundant at low frequencies. These effects are visible but modest in panels D, G, H, and I, but much more pronounced in panels B, C, and F, which focus on mutations in the CpG context. Transitions of the type CpACpG, which create CpG motifs, are extremely enriched at high frequencies, and this pattern may be an artifact of ancestral misidentification (Hernandez et al., 2007). CpG motifs have such high mutation rates that CpGCpT transitions often happen at the same site in humans and chimps, and these low-frequency double mutations are misclassified as high-frequency CpTCpG mutations. Although it is not surprising to see a peak of CpTCpG transitions at high frequencies in panel F, it is somewhat surprising to see CpGGpG transversions peak in abundance at high frequencies in panel C. This might be a signature of recent declines in the rates of these mutations, since neither ancestral misidentification nor biased gene conversion is thought to produce such a pattern. In addition, neither of these processes can explain the strong enrichment of certain AT mutations at high frequencies that is observed in panel K.
Figure 3—figure supplement 4. Mutation type enrichment as a function of allele frequency in UK10K (Part II of III).

Figure 3—figure supplement 4.

The eleven panels in this three-part figure show the full dependence of mutation spectrum on allele frequency in the UK10K data.
Figure 3—figure supplement 5. Mutation type enrichment as a function of allele frequency in UK10K (Part III of III).

Figure 3—figure supplement 5.

The 11 panels in this three-part figure show the full dependence of mutation spectrum on allele frequency in the UK10K data.
Figure 3—figure supplement 6. Mutation spectrum comparisons partitioned by allele frequency.

Figure 3—figure supplement 6.

Each of these heatmaps shows a subset of the data used to construct Figure 1B, partitioned by allele frequency to show how rare variants are the most highly differentiated between populations. Black dots highlight mutation types that are significantly different in abundance between two populations in a particular frequency class at the p<105 level according to a chi-square test.

We used the enrichment of TCCTTC mutations as a function of allele frequency to estimate when this mutation pulse was active. Assuming a simple piecewise-constant model, we infer that the rate of TCCTTC mutations increased dramatically 15,000 years ago and decreased again 2000 years ago. This time-range is consistent with results showing this signal in a pair of prehistoric European samples from 7000 and 8000 years ago, respectively (Mathieson and Reich, 2016). We hypothesize that this mutation pulse may have been caused by a mutator allele that drifted up in frequency starting 15,000 years ago, but that is now rare or absent from present day populations.

Although low frequency allele calls often contain a higher proportion of base calling errors than higher frequency allele calls do, it is not plausible that base-calling errors could be responsible for the pulse we have described. In the UK10K data, a minor allele present at 0.6% frequency corresponds to a derived allele that is present in 23 out of 3854 sampled haplotypes and supported by 80 short reads on average (assuming 7x coverage per individual). When independently generated datasets of different sizes are projected down to the same sample size, the TCCTTC pulse spans the same range of allele frequencies in both datasets (Figure 3—figure supplements 1 and 2), which would not be the case if the shape of the curve were a function of low-frequency errors.

Fine-scale mutation spectrum variation in other populations

Encouraged by these results, we sought to find other signatures of recent mutation pulses. We generated heatmaps and PCA plots of mutation spectrum variation within each continental group, looking for fine-scale differences between closely related populations (Figure 4 and Figure 4—figure supplement 1 through 6). In some cases, mutational spectra differ even between very closely related populations. For example, the *AC*CC mutations with elevated rates in East Asia appear to be distributed heterogeneously within that group, with most of the load carried by a subset of the Japanese individuals. These individuals also have elevated rates of ACAAAA and TATTTT mutations (Figure 4A and Figure 4—figure supplement 4). This signature appears to be present in only a handful of Chinese individuals, and no Kinh or Dai individuals. As seen for the European TCC mutation, the enrichment of these mutation types peaks at low frequencies, that is, 1%. Given the availability of only 200 Japanese individuals in 1000 Genomes, it is hard to say whether the true peak is at a frequency much lower than 1%.

Figure 4. Mutational variation among east Asian populations.

(A) PCA of east Asian samples from 1000 Genomes, based on the relative proportions of each of the 96 mutational types. See Figure 4—figure supplement 2 through 6 for other finescale population PCAs. (B) Heatmaps showing, for pairs of east Asian samples, the ratio of the proportions of SNVs in each of the 96 mutational types. Points indicate significant contrasts at p <105. See Figure 4—figure supplement 1 for additional finescale heatmaps. (C) and (D) Relative enrichment of each mutational type in Japanese and Dai, respectively as a function of allele frequency. Six mutation types that are enriched in JPT are indicated. Populations: CDX=Dai, CHB=Han (Beijing); CHS=Han (south China); KHV=Kinh; JPT=Japanese.

DOI: http://dx.doi.org/10.7554/eLife.24284.023

Figure 4—source data 1. This text file shows the number of SNPs in each of the 96 mutational categories that passed all filters in each finescale 1000 Genomes population.
DOI: 10.7554/eLife.24284.024

Figure 4.

Figure 4—figure supplement 1. Mutation spectrum differences within Africa, Europe, East Asia, and South Asia.

Figure 4—figure supplement 1.

Figure 4B of the main text shows heat map comparisons between East Asian populations, which display fine-scale differences that are exceptionally well defined. For completeness, this figure shows finescale heatmap comparisons within all 1 kG continental groups. We can see that CACCCC and TATTTT are heterogeneously distributed within multiple continents, but to the greatest extent in East Asia. In addition, the TCCTTC signature is somewhat heterogeneously distributed within Europe and South Asia, being depleted in Finns and enriched in the Punjabi and Gujarati. Each continental group in the 1000 Genomes data is divided into five sub-populations. These heat maps compare the mutation spectra of these fine-scale populations to each other. African populations are: MSL = Mende in Sierra Leone; LWK = Luhya in Webuye, Kenya; YRI = Yoruba in Ibadan, Nigeria; GWD = Gambian in Western Divisions; ESN = Esan in Nigeria. European populations are: IBS = Iberian Population in Spain; TSI = Toscani in Italia; GBR = British in England and Scotland; CEU = Utah Residents (CEPH) with Northern and Western Ancestry; FIN = Finnish in Finland. East Asian populations are: CDX = Chinese Dai in Xishuangbanna, China; JPT = Japanese in Tokyo, Japan; CHB = Han Chinese in Bejing, China; CHS = Southern Han Chinese; KHV = Kinh in Ho Chi Minh City, Vietnam. South Asian populations are: ITU = Indian Telugu from the UK; GIH = Gujarati Indian from Houston, Texas; PJL = Punjabi from Lahore, Pakistan; BEB = Bengali from Bangladesh; STU = Sri Lankan Tamil from the UK.
Figure 4—figure supplement 2. PCA of American populations.

Figure 4—figure supplement 2.

Population abbreviations are: CLM = Colombians from Medellin, Colombia; MXL = Mexican Ancestry from Los Angeles, USA; PUR = Puerto Ricans from Puerto Rico; PEL = Peruvians from Lima, Peru; ACB = African Caribbeans in Barbados; ASW = Americans of African Ancestry in SW USA. Admixed populations from the Americans show structure that mirrors the continental groups, with PC1 essentially measuring the ratio between African and non-African ancestry and PC2 measuring the ratio between European and Native American ancestry. The accompanying heat maps show the loadings of the first two principal components.
Figure 4—figure supplement 3. PCA of African populations.

Figure 4—figure supplement 3.

Population abbreviations are: MSL = Mende in Sierra Leone; LWK = Luhya in Webuye, Kenya; YRI = Yoruba in Ibadan, Nigeria; GWD = Gambian in Western Divisions; ESN = Esan in Nigeria.
Figure 4—figure supplement 4. PCA of East Asian populations.

Figure 4—figure supplement 4.

Population abbreviations are: CDX = Chinese Dai in Xishuangbanna, China; JPT = Japanese in Tokyo, Japan; CHB = Han Chinese in Bejing, China; CHS = Southern Han Chinese; KHV = Kinh in Ho Chi Minh City, Vietnam.
Figure 4—figure supplement 5. PCA of South Asian populations.

Figure 4—figure supplement 5.

Population abbreviations are: ITU = Indian Telugu from the UK; GIH = Gujarati Indian from Houston, Texas; PJL = Punjabi from Lahore, Pakistan; BEB = Bengali from Bangladesh; STU = Sri Lankan Tamil from the UK.
Figure 4—figure supplement 6. PCA of European populations.

Figure 4—figure supplement 6.

Population abbreviations are: IBS = Iberian Population in Spain; TSI = Toscani in Italia; GBR = British in England and Scotland; CEU = Utah Residents (CEPH) with Northern and Western Ancestry; FIN = Finnish in Finland.

PCA reveals relatively little fine-scale structure within the mutational spectra of Europeans or South Asians (Figure 4—figure supplements 5 and 6). However, Africans exhibit some substructure (Figure 4—figure supplement 3), with the Luhya exhibiting the most distinctive mutational spectrum. Unexpectedly, a closer examination of PC loadings reveals that the Luhya outliers are enriched for the same mutational signature identified in the Japanese. Even in Europeans and South Asians, the first PC is heavily weighted toward *AC*CC, ACAAAA, and TATTTT, although this signature explains less of the mutation spectrum variance within these more homogeneous populations. The sharing of this signature may suggest either parallel increases of a shared mutation modifier, or a shared aspect of environment or life history that affects the mutation spectrum.

Mutation spectrum variation among the great apes

Finally, given our finding of extensive fine-scale variation in mutational spectra between human populations, we hypothesized that mutational variation between species is likely to be even greater. To compare the mutation spectra of the great apes in more detail, we obtained SNV data from the Great Ape Diversity Panel, which includes 78 whole genome sequences from six great ape species including human (Prado-Martinez et al., 2013). Overall, we find dramatic variation in mutational spectra among the great ape species (Figure 5 and Figure 5—figure supplement 1).

Figure 5. Mutational differences among the great apes.

(A) Relative abundance of SNV types in 5 ape species compared to Bornean Orangutan; data from (Prado-Martinez et al., 2013). Boxes indicate labels in (B). For additional comparisons see Figure 5—figure supplement 1. (B) Schematic phylogeny of the great apes highlighting notable changes in SNV abundance.

DOI: http://dx.doi.org/10.7554/eLife.24284.031

Figure 5.

Figure 5—figure supplement 1. Mutation spectra of great apes.

Figure 5—figure supplement 1.

These heatmap comparisons demonstrate that closely related great apes such as Chimpanzees and Bonobos have more similar mutation spectra than more distantly related apes do.

As noted previously (Moorjani et al., 2016a), one major trend is a higher proportion of CpG mutations among the species closest to human, possibly reflecting lengthening generation time along the human lineage, consistent with previous indications that species closely related to humans have lower mutation rates than more distant species (Goodman, 1961; Li and Tanimura, 1987; Scally and Durbin, 2012). However, most other differences are not obviously related to known processes such as biased gene conversion and generation time change. The AT mutation rate appears to have sped up in the common ancestor of humans, chimpanzees, and bonobos, a change that appears consistent with a mutator variant that was fixed instead of lost. It is unclear whether this ancient AT speedup is related to the AT speedup in East Asians. Other mutational signatures appear on only a single branch of the great ape tree, such as a slowdown of AC mutations in gorillas.

Discussion

The widespread differences captured in Figures 1 and 2 may be footprints of allele frequency shifts affecting different mutator alleles. But in principle, other genetic and non-genetic processes may also impact the observed mutational spectrum. First, biased gene conversion (BGC) tends to favor C/G alleles over A/T, and BGC is potentially more efficient in populations of large effective size compared to populations of smaller effective size (Galtier et al., 2001). However, despite the bottlenecks that are known to have affected Eurasian diversity, there is no clear trend of an increased fraction of C/GA/T relative to A/TC/G in non-Africans vs Africans, or with distance from Africa (Figure 1—figure supplement 7), and previous studies have also found little evidence for a strong genome-wide effect of BGC on the mutational spectrum in humans and great apes (Do et al., 2015; Moorjani et al., 2016a). For these reasons, we think that evolution of the mutational process is a better explanation than BGC or selection for differences that have been observed between the spectra of ultra-rare singleton variants and older human genetic variation (Carlson et al., 2017);

It is also known that shifts in generation time or other life-history traits may affect mutational spectra, particularly for CpG transitions (Martin and Palumbi, 1993; Amster and Sella, 2016). Most CpG transitions result from spontaneous methyl-cytosine deamination as opposed to errors in DNA replication. Hence the rate of CpG transitions is less affected by generation time than other mutations (Hwang and Green, 2004; Moorjani et al., 2016b; Gao et al., 2016). We observe that Europeans have a lower fraction of CpG variants compared to Africans, East Asians and South Asians (Figure 1B), consistent with a recent report of a lower rate of de novo CCGCTG mutations in European individuals compared to Pakistanis (Narasimhan et al., 2016). Such a pattern may be consistent with a shorter average generation time in Europeans (Moorjani et al., 2016b), although it is unclear that a plausible shift in generation-time could produce such a large effect. Apart from this, the other patterns evident in Figure 1 do not seem explainable by known processes.

In summary, we report here that, mutational spectra differ significantly among closely related human populations, and that they differ greatly among the great ape species. Our work shows that subtle, concerted shifts in the frequencies of many different mutation types are more widespread than dramatic jumps in the rate of single mutation types, although the existence of the European TCCTTC pulse shows that both modes of evolution do occur (Harris, 2015; Moorjani et al., 2016b; Mathieson and Reich, 2016).

At this time, we cannot exclude a role for nongenetic factors such as changes in life history or mutagen exposure in driving these signals. However, given the sheer diversity of the effects reported here, it seems parsimonious to us to propose that most of this variation is driven by the appearance and drift of genetic modifiers of mutation rate. This situation is perhaps reminiscent of the earlier observation that genome-wide recombination patterns are variable among individuals (Coop et al., 2008), and ultimate discovery of PRDM9 (Baudat et al., 2010); although in this case it is unlikely that a single gene is responsible for all signals seen here. As large datasets of de novo mutations become available, it should be possible to map mutator loci genome-wide. In summary, our results suggest the likelihood that mutational modifiers are an important part of the landscape of human genetic variation.

Materials and methods

Data availability

All datasets analyzed here are publicly available at the following websites:

Human mutation spectrum processing

Mutation spectra were computed using 1000 Genomes Phase 3 SNPs (Auton et al., 2015) that are biallelic, pass all 1000 Genomes quality filters, and are not adjacent to any N’s in the hg19 reference sequence. Ancestral states were assigned using the UCSC Genome Browser alignment of hg19 to the PanTro2 chimpanzee reference genome; SNPs were discarded if neither the reference nor alternate allele matched the chimpanzee reference. To minimize the potential impact of ancestral misidentification errors, SNPs with derived allele frequency higher than 0.98 were discarded. We also filtered out regions annotated as ‘conserved’ based on the 100-way PhyloP conservation score (Pollard et al., 2010), download from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/, as well as regions annotated as repeats by RepeatMasker (Smit et al., 2013), downloaded from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/nestedRepeats.txt.gz. To be counted as part of the mutation spectrum of population P (which can be either a continental group or a finer-scale population from one city), a SNP should not be a singleton within population P–at least two copies of the ancestral and derived alleles must be present within that population.

An identical approach was used to extract the mutation spectrum of the UK10K ALSPAC panel (Walter et al., 2015), which is not subdivided into smaller populations. The data were filtered as described in Field et al. (2016). The filtering procedure performed by Field et al. (2016) reduces the ALSPAC sample size to 1927 individuals.

We also computed mutation spectra of the Simons Genome Diversity Panel (SGDP) populations (Mallick et al., 2016). Four of the SGDP populations, West Eurasia, East Asia, South Asia, and Africa, were compared to their direct counterparts in the 1000 Genomes data. Three additional SGDP populations, Central Asia and Siberia, Oceania, and America, had no close 1000 Genomes counterparts and were not analyzed here (although each project contained a panel of people from the Americans, the composition of the American panels was extremely different, with the 1000 Genomes populations being much more admixed with Europeans and Africans). SGDP sites with more than 20% missing data were not utilized. All other data processing was done the same way described for the 1000 Genomes data.

The following table gives the same size of each population panel, as well as the total number of SNPs segregating in the panel that are used to compute mutation type ratios:

Dataset Population Number of individuals Number of SNPs
1 kg Africa 504 16,870,400
1 kg Europe 503 8,508,040
1 kg East Asia 504 7,895,925
1 kg South Asia 489 9,552,781
SGDP Africa 45 6,569,658
SGDP West Eurasia 69 4,201,571
SGDP East Asia 49 3,312,645
SGDP South Asia 38 3,449,624

Great ape diversity panel data processing

Biallelic great ape SNPs were extracted from the Great Ape Diversity Panel VCF (Prado-Martinez et al., 2013), which is aligned to the hg18 human reference sequence. Ancestral states were assigned using the Great Ape Genetic Diversity project annotation, which used the Felsenstein pruning algorithm to assign allelic states to internal nodes in the great ape tree. In the Great Ape Diversity Panel, the most recent common ancestor (MRCA) of the human species is labeled as node 18; the MRCAs of chimpanzees, bonobos, gorillas, and orangutans, respectively, are labeled as node 16, node 17, node 19, and node 15. We extracted the state of each MRCA at each SNP in the alignment and used it to polarize the ancestral and derived allele at that site; a SNP was discarded whenever the ancestral node was assigned an uncertain or polymorphic ancestral state. As with the human data, SNPs with derived allele frequency higher than 0.98 were not used, and both repeats and PhyloP-annotated conserved regions were filtered away.

Visual representation of mutation spectra

The mutation type of an SNP is defined in terms of its ancestral allele, its derived allele, and its two immediate 5’ and 3’ neighbors. Two mutation types are considered equivalent if they are strand-complementary to each other (e.g. ACGATG is equivalent to CGTCAT). This scheme classifies SNPs into 96 different mutation types, each that can be represented with an A or C ancestral allele.

To compute the frequency fP(m) of SNP m in population P, we count up all SNPs of type m where the derived allele is present in at least one representative of population P (which can be either a specific population such as YRI or a broader continental group such as AFR). After obtaining this count CP(m), we define fP(m) to be the ratio CP(m)/mCP(m), where the sum in the denominator ranges over all 96 mutation types m. The enrichment of mutation type m in population P1 relative to population P2 is defined to be fP1(m)/fP2(m); these enrichments are visualized as heat maps in Figures 1B, 3B and 4A.

To track changes in the mutational spectrum over time, we compute fP(m) in bins of restricted allele frequency. This involves counting the number of SNPs of type m that are present at frequency ϕ in population P to obtain counts CP(m,ϕ) and frequencies fP(m,ϕ)=CP(m,ϕ)/mCP(mϕ). Deviation of the ratio fP(m,ϕ)/fP(m) from one indicates that the rate of m has fluctuated recently in the history of population P. To make the sampling noise approximately uniform across alleles of different frequencies, alleles of derived count greater than five were grouped into approximately log-spaced bins that each contained similar numbers of UK10K SNPs. More precisely, we defined a set of bin endpoints b1,b2, such that the total number of SNPs ranging in derived allele count between bi and bi+11 is greater than or equal to the number of 5-ton SNPs, while the total number of SNPs ranging in derived allele count from bi to bi+1-2 is less than the number of 5-ton SNPs.

In some cases, for example Figure 2, Figure 2—figure supplement 1B and Figure 3—figure supplement 1, site frequency spectra were projected down to a smaller sample size before counting SNPs in order to more accurately compare datasets of different sample sizes. A binomial sampling approach was used to project a sample of N haplotypes does to a smaller sample size n. Letting CP(N)(m,ϕ) denote the SNP counts in the large sample of N haplotypes, effective SNP counts CP(n)(m,ϕ) in a sample of n haplotypes are computed as follows:

CP(n)(m,k/n)=(nk)=1N-1(/N)k(1-/N)n-kCP(N)(m,/N)

Significance testing

One central goal of this paper is to test whether many mutation types differ in rate between human populations or whether mutation spectrum shifts have been rare events affecting only a small proportion of mutation types. A simple statistical method for answering this question would be to perform 96 separate chi-square tests, one for each triplet-context-dependent mutation type, as follows:

Let Si denote the total number of SNPs segregating in population Pi, and let Si(m) denote the number of SNPs of mutation type m. If mutation type m is more prevalent in population P1 than in population P2, a chi-square test provides a natural way of assessing the significance of this difference. As described in Harris (2015), this test is performed on the following two-by-two contingency table:

S1(m) P1-S1(m)
S2(m) P2-S2(m)

It would be appealing to conclude that every mutation type ‘passing’ this chi-square test is a mutation type that has changed in rate during recent human history. However, if we were to perform the full set of 96 tests, they would not be independent. A sufficiently large increase in the rate of one mutation type m1 in population P1 after divergence from P2 could cause another mutation type m2, whose rate has remained constant, to comprise significantly different fractions of the SNPs from P1 and P2. To minimize this effect, we formulate the following iterative procedure of conditionally independent tests: first, compute a chi-square significance value punordered(m) for each mutation type m using the two-by-two chi-square table above. We then use these values to order the SNPs from lowest p value to highest and compute a set of ordered p values pordered(m). For the mutation type m0 with the lowest unordered p value, punordered(m0)=pordered(m0). For mutation type mi, which has the ith lowest unordered p value and i<96, pordered(mi) is computed from the following contingency table:

S1(mi) j=i+196S1(mj)
S2(mi) j=i+196S2(mj)

For mutation type m96, which has the highest unordered p value, the ordered p value is computed from the contingency table

S1(m96) S1(m95)
S2(m96) S2(m95)

This procedure is guaranteed to find fewer mutation types to differ significantly in rate between populations compared to separate chi-square tests.

Principal component analysis

The python package matplotlib.mlab.PCA was used to perform PCA on the complete set of 1000 Genomes diploid genomes. First, the triplet mutational spectrum of each haplotype h was computed as a 96-element vector encoding the mutation frequencies (fh(m))m of the non-singleton derived alleles present on that haplotype. The mutational spectrum of each diploid genome was then computed by averaging together the spectra of its two constituent haplotypes. In the same way, a separate PCA was performed on each of the five continental groups to reveal finescale components of mutation spectrum variation.

Dating of the TCCT mutation pulse

We estimated the duration and intensity of TCCT rate acceleration in Europe by fitting a simple piecewise-constant rate model to the UK10K frequency data. To specify the parameters of the model, we divide time into discrete log-spaced intervals bounded by time points t1,,td, assigning each interval a TCCT mutation rate r0,rd. In units of generations before the present, the time discretization points were chosen to be: 20, 40, 200, 400, 800, 1200, 1600, 2000, 2400, 2800, 3200, 3600, 4000, 8000, 12,000, 16,000, 20,000, 24,000, 28,000, 32,000, 36,000, 40,000. We assume that the total rate r of mutations other than TCCT stays constant over time (a first-order approximation).

In terms of these rate variables, we can calculate the expected shape of the TCCT pulse shown in Figure 2B of the main text. The shape of this curve depends on both the mutation rate parameters ri and the demographic history of the European population, which determines the joint distribution of allele frequency and allele age. To account for the effects of demography, we use Hudson’s ms program to simulate 10,000 random coalescent trees under a realistic European demographic history inferred from allele frequency data (Tennessen et al., 2012) and condition our inference upon this collection of trees as follows:

Let A(m,t) be the function for which titi+1A(m,t)dt equals the coalescent tree branch length, averaged over the sample of simulated trees, that is ancestral to exactly m lineages and falls between time ti and ti+1. Given this function, which can be empirically estimated from a sample of simulated trees, the expected frequency spectrum entry k/n is

E(k/n)=i=1dti-1tiA(k,t)dtj=1ni=1dti-1tiA(j,t)dt

and the expected fraction of TCCT mutations in allele frequency bin k/n is

E(fTCCT(k/n))=i=1driti-1tiA(k,t)dtri=1dti-1tiA(k,t)dt.

The expected value of the TCCT enrichment ratio being plotted in Figure 2B is

E(rTCCT(k/n))=i=1driti-1tiA(k,t)dtj=1ni=1dti-1tiA(j,t)dti=1dti-1tiA(k,t)dtj=1ni=1driti-1tiA(j,t)dt

In Figure 2B, enrichment ratios are not computed for every allele frequency in isolation, but for allele frequency bins that each contain similar numbers of SNPs. Given integers 1km<km+1n, the expected TCCT enrichment ratio averaged over all SNPs with allele frequency between km/n and km+1/n is:

E(rTCCT(km/n))=i=1driti-1tik=kmkm+1A(k,t)dtj=1ni=1dti-1tiA(j,t)dti=1dti-1tik=kmkm+1A(k,t)dtj=1ni=1driti-1tiA(j,t)dt

We optimize the mutation rates r1,,rd using a log-spaced quantization of allele frequencies k1/n,,km/n defined such that all bins contain similar numbers of SNPs. The chosen allele count endpoints k1,,km are: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000. Given this quantization of allele frequencies, we optimize r1,,rd by using the BFGS algorithm to minimize the least squares distance D(r0,,rd) between E(rTCCT(km/n)) and the empirical ratio rTCCT(km/n) computed from the UK10K data. This optimization is subject to a regularization penalty that minimizes the jumps between adjacent mutation rates ri and ri+1:

D(r0,,rd)=m=1d(E(rTCCT(km/n))-rTCCT(km/n))2+0.25i=1d(ri-1-ri)2

Although the underlying model of mutation rate change assumed here is very simple, it still represents an advance over the method used in (Harris, 2015) to estimate of the timing of the TCCTTC mutation rate increase. That method relied upon explicit estimates of allele age from a dataset of less than 100 individuals, which are much noisier than integration of a joint distribution of allele age and frequency across a sample of thousands of haplotypes.

Acknowledgements

This work was funded by NIH grants GM116381 and HG008140, and by the Howard Hughes Medical Institute. We thank Jeffrey Spence and Yun S Song for technical assistance. We also thank Ziyue Gao, Arbel Harpak, Molly Przeworski, Joshua Schraiber, and Aylwyn Scally for comments and discussion, as well as two anonymous reviewers.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

  • National Institutes of Health NRSA-F32 Grant GM116381 to Kelley Harris.

  • Howard Hughes Medical Institute Investigator Grant to Jonathan K Pritchard.

  • National Institutes of Health R01 Grant HG008140 to Jonathan K Pritchard.

Additional information

Competing interests

The authors declare that no competing interests exist.

Author contributions

KH, Conceptualization, Data curation, Software, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

JKP, Conceptualization, Funding acquisition, Visualization, Writing—original draft, Writing—review and editing.

Additional files

Major datasets

The following previously published datasets were used:

1000 Genomes Project Consortium,2015,1000 Genomes Phase 3,http://www.internationalgenome.org/category/phase-3/,Publicly available at internationalgenome.org

Swapan Mallick,David Reich,et al,2016,Simons Genome Diversity Project,https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/,Publicly available from the Simons Foundation. Directions for downloading available here: http://simonsfoundation.s3.amazonaws.com/share/SCDA/datasets/2014_11_12/StepstodownloadtheSGDPdataset_v4.docx

Prado-Martinez,Tomas Marques-Bonet,et al,2013,Whole genome sequences for a set of 79 great ape individuals. Genome sequencing,https://www.ncbi.nlm.nih.gov/bioproject/PRJNA189439/,Publicly available at NCBI BioProject, submitted as part of the Great Ape Genome Diversity Project (accession no: PRJNA189439)

Monkol Lek,Daniel MacArthur,et al,2016,Exome Aggregation Consortium,http://exac.broadinstitute.org,Summary data publicly available for download at http://exac.broadinstitute.org/downloads

Prado-Martinez,Tomas Marques-Bonet,et al,2013,Great Ape Genome Diversity Project,https://www.ncbi.nlm.nih.gov/sra?term=SRP018689,Publicly available at NCBI Sequence Read Archive (accession no: SRP018689)

References

  1. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale AL, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Imielinsk M, Jäger N, Jones DT, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt AN, Valdés-Mas R, van Buuren MM, van 't Veer L, Vincent-Salomon A, Waddell N, Yates LR, Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR, Australian Pancreatic Cancer Genome Initiative. ICGC Breast Cancer Consortium. ICGC MMML-Seq Consortium. ICGC PedBrain Signatures of mutational processes in human cancer. Nature. 2013;500:415–421. doi: 10.1038/nature12477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Amster G, Sella G. Life history effects on the molecular clock of autosomes and sex chromosomes. PNAS. 2016;113:1588–1593. doi: 10.1073/pnas.1515798113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR, 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Baudat F, Buard J, Grey C, Fledel-Alon A, Ober C, Przeworski M, Coop G, de Massy B. PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science. 2010;327:836–840. doi: 10.1126/science.1183439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bui DT, Friedrich A, Al-Sweel N, Liti G, Schacherer J, Aquadro CF, Alani E. Mismatch repair incompatibilities in diverse yeast populations. Genetics. 2017;205:1459–1471. doi: 10.1534/genetics.116.199513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carlson J, Scott L, Locke A, Flickinger M, Levy S. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. bioRXiv. 2017 doi: 10.1101/108290. [DOI] [PMC free article] [PubMed]
  7. Coop G, Wen X, Ober C, Pritchard JK, Przeworski M. High-resolution mapping of crossovers reveals extensive variation in fine-scale recombination patterns among humans. Science. 2008;319:1395–1398. doi: 10.1126/science.1151851. [DOI] [PubMed] [Google Scholar]
  8. Do R, Balick D, Li H, Adzhubei I, Sunyaev S, Reich D. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nature Genetics. 2015;47:126–131. doi: 10.1038/ng.3186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Field Y, Boyle EA, Telis N, Gao Z, Gaulton KJ, Golan D, Yengo L, Rocheleau G, Froguel P, McCarthy MI, Pritchard JK. Detection of human adaptation during the past 2000 years. Science. 2016;354:760–764. doi: 10.1126/science.aag0776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Galtier N, Piganeau G, Mouchiroud D, Duret L. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics. 2001;159:907–911. doi: 10.1093/genetics/159.2.907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gao Z, Wyman MJ, Sella G, Przeworski M. Interpreting the dependence of mutation rates on age and time. PLoS Biology. 2016;14:e1002355. doi: 10.1371/journal.pbio.1002355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Goodman M. The role of immunochemical differences in the phyletic development of human behavior. Human Biology. 1961;33:131–162. [PubMed] [Google Scholar]
  13. Harris K. Evidence for recent, population-specific evolution of the human mutation rate. PNAS. 2015;112:3439–3444. doi: 10.1073/pnas.1418652112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Helleday T, Eshtad S, Nik-Zainal S. Mechanisms underlying mutational signatures in human cancers. Nature Reviews Genetics. 2014;15:585–598. doi: 10.1038/nrg3729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hernandez RD, Williamson SH, Bustamante CD. Context dependence, ancestral misidentification, and spurious signatures of natural selection. Molecular Biology and Evolution. 2007;24:1792–1800. doi: 10.1093/molbev/msm108. [DOI] [PubMed] [Google Scholar]
  16. Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, Kellis M, Noble WS. Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Research. 2013;41:827–841. doi: 10.1093/nar/gks1284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. PNAS. 2004;101:13994–14001. doi: 10.1073/pnas.0404142101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, Wong WS, Sigurdsson G, Walters GB, Steinberg S, Helgason H, Thorleifsson G, Gudbjartsson DF, Helgason A, Magnusson OT, Thorsteinsdottir U, Stefansson K. Rate of de novo mutations and the importance of father's age to disease risk. Nature. 2012;488:471–475. doi: 10.1038/nature11396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O'Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li WH, Tanimura M. The molecular clock runs more slowly in man than in apes and monkeys. Nature. 1987;326:93–96. doi: 10.1038/326093a0. [DOI] [PubMed] [Google Scholar]
  21. Lynch M. The lower bound to the evolution of mutation rates. Genome Biology and Evolution. 2011;3:1107–1118. doi: 10.1093/gbe/evr066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, Skoglund P, Lazaridis I, Sankararaman S, Fu Q, Rohland N, Renaud G, Erlich Y, Willems T, Gallo C, Spence JP, Song YS, Poletti G, Balloux F, van Driem G, de Knijff P, Romero IG, Jha AR, Behar DM, Bravi CM, Capelli C, Hervig T, Moreno-Estrada A, Posukh OL, Balanovska E, Balanovsky O, Karachanak-Yankova S, Sahakyan H, Toncheva D, Yepiskoposyan L, Tyler-Smith C, Xue Y, Abdullah MS, Ruiz-Linares A, Beall CM, Di Rienzo A, Jeong C, Starikovskaya EB, Metspalu E, Parik J, Villems R, Henn BM, Hodoglugil U, Mahley R, Sajantila A, Stamatoyannopoulos G, Wee JT, Khusainova R, Khusnutdinova E, Litvinov S, Ayodo G, Comas D, Hammer MF, Kivisild T, Klitz W, Winkler CA, Labuda D, Bamshad M, Jorde LB, Tishkoff SA, Watkins WS, Metspalu M, Dryomov S, Sukernik R, Singh L, Thangaraj K, Pääbo S, Kelso J, Patterson N, Reich D. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016;538:201–206. doi: 10.1038/nature18964. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Martin AP, Palumbi SR. Body size, metabolic rate, generation time, and the molecular clock. PNAS. 1993;90:4087–4091. doi: 10.1073/pnas.90.9.4087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Mathieson I, Reich DE. Variation in mutation rates among human populations. bioRxiv. 2016 doi: 10.1101/063578. [DOI]
  25. Moorjani P, Amorim CE, Arndt PF, Przeworski M. Variation in the molecular clock of primates. PNAS. 2016a;113:10607–10612. doi: 10.1073/pnas.1600374113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Moorjani P, Gao Z, Przeworski M. Human germline mutation and the erratic molecular clock. bioRxiv. 2016b doi: 10.1101/058024. [DOI] [PMC free article] [PubMed]
  27. Narasimhan V, Rahbari R, Scally A, Wuster A, Mason D, Xue Y, Wright J, Trembath R, Maher E, van Heel D, Auton A, Hurles M, Tyler-Smith C, Durbin R. A direct multi-generational estate of the human mutation rate from autozygous segments seen in thousands of parentally related individuals. bioRxiv. 2016 doi: 10.1101/059436. [DOI]
  28. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD. Genes mirror geography within Europe. Nature. 2008;456:98–101. doi: 10.1038/nature07331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ohno M, Sakumi K, Fukumura R, Furuichi M, Iwasaki Y, Hokama M, Ikemura T, Tsuzuki T, Gondo Y, Nakabeppu Y. 8-oxoguanine causes spontaneous de novo germline mutations in mice. Scientific Reports. 2014;4:4689. doi: 10.1038/srep04689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O'Connor TD, Santpere G, Cagan A, Theunert C, Casals F, Laayouni H, Munch K, Hobolth A, Halager AE, Malig M, Hernandez-Rodriguez J, Hernando-Herraez I, Prüfer K, Pybus M, Johnstone L, Lachmann M, Alkan C, Twigg D, Petit N, Baker C, Hormozdiari F, Fernandez-Callejo M, Dabad M, Wilson ML, Stevison L, Camprubí C, Carvalho T, Ruiz-Herrera A, Vives L, Mele M, Abello T, Kondova I, Bontrop RE, Pusey A, Lankester F, Kiyang JA, Bergl RA, Lonsdorf E, Myers S, Ventura M, Gagneux P, Comas D, Siegismund H, Blanc J, Agueda-Calpena L, Gut M, Fulton L, Tishkoff SA, Mullikin JC, Wilson RK, Gut IG, Gonder MK, Ryder OA, Hahn BH, Navarro A, Akey JM, Bertranpetit J, Reich D, Mailund T, Schierup MH, Hvilsom C, Andrés AM, Wall JD, Bustamante CD, Hammer MF, Eichler EE, Marques-Bonet T. Great ape genetic diversity and population history. Nature. 2013;499:471–475. doi: 10.1038/nature12228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Scally A, Durbin R. Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics. 2012;13:745–753. doi: 10.1038/nrg3295. [DOI] [PubMed] [Google Scholar]
  33. Seoighe C, Scally A. Inference of candidate germline mutator loci in humans from genome-wide haplotype data. PLoS Genetics. 2017;13:e1006549. doi: 10.1371/journal.pgen.1006549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Shinbrot E, Henninger EE, Weinhold N, Covington KR, Göksenin AY, Schultz N, Chao H, Doddapaneni H, Muzny DM, Gibbs RA, Sander C, Pursell ZF, Wheeler DA. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Research. 2014;24:1740–1750. doi: 10.1101/gr.174789.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Smit A, Hubley R, Green P. 4.0http://www.repeatmasker.org RepeatMasker Open. 2013
  36. Sung W, Ackerman MS, Miller SF, Doak TG, Lynch M. Drift-barrier hypothesis and mutation-rate evolution. PNAS. 2012;109:18488–18492. doi: 10.1073/pnas.1216223109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Ségurel L, Wyman MJ, Przeworski M. Determinants of mutation rate variation in the human germline. Annual Review of Genomics and Human Genetics. 2014;15:47–70. doi: 10.1146/annurev-genom-031714-125740. [DOI] [PubMed] [Google Scholar]
  38. Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM, Broad GO. Seattle GO. NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012;337:64–69. doi: 10.1126/science.1219240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Walter K, Min JL, Huang J, Crooks L, Memari Y, McCarthy S, Perry JR, Xu C, Futema M, Lawson D, Iotchkova V, Schiffels S, Hendricks AE, Danecek P, Li R, Floyd J, Wain LV, Barroso I, Humphries SE, Hurles ME, Zeggini E, Barrett JC, Plagnol V, Richards JB, Greenwood CM, Timpson NJ, Durbin R, Soranzo N, UK10K Consortium The UK10K project identifies rare variants in health and disease. Nature. 2015;526:82–90. doi: 10.1038/nature14962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Woodfine K, Fiegler H, Beare DM, Collins JE, McCann OT, Young BD, Debernardi S, Mott R, Dunham I, Carter NP. Replication timing of the human genome. Human Molecular Genetics. 2004;13:191–202. doi: 10.1093/hmg/ddh016. [DOI] [PubMed] [Google Scholar]
eLife. 2017 Apr 25;6:e24284. doi: 10.7554/eLife.24284.043

Decision letter

Editor: Gilean McVean1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Rapid evolution of the human mutation spectrum" for consideration by eLife. Your article has been favorably evaluated by Detlef Weigel (Senior Editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors. The following individual involved in review of your submission has agreed to reveal his identity: Aylwyn Scally (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

There are often discrepancies in mutation rates and spectra inferred from short term observations and long term comparisons. One potential explanation is that these rates and spectra are not fixed, but vary in the course of evolution. This paper makes an important contribution, by documenting changes in the mutation spectrum during human evolution. It presents an analysis of population differences in the spectrum of context-dependent single nucleotide polymorphism, focused on humans and great apes. A previous observation about differences in mutation spectrum between human populations is replicated here and a hypothesis about a historical burst of mutation is presented. Many additional, weaker differences are also seen but replicated across data sets, which argues for biological, rather than experimental explanations. Very substantial differences among great apes are described.

Essential revisions:

The original reviews are below. In terms of revision, the only major issue that we wish to see addressed concerns the analysis of the burst hypothesis. Specifically, there is a potential concern about whether certain artefacts could explain the apparent lack of evidence for the population-specific bias among the most recent mutations and questions about differences among populations in the relationship between allele frequency and age. There are also suggestions about how to look at patterns along the genome to hunt for clues as to possible causes.

We have left the full reviews in as we think there are other ideas here that you may wish to pick up on, but pursuing them is not essential for the revision. We look forward to seeing the revision.

Reviewer #1:

This paper presents an analysis of population differences in the spectrum of context-dependent single nucleotide polymorphism, primarily focused on humans, but also analysing data from great apes. One of the authors previously reported a substantial difference in one particular type. This is replicated here and augmented with a large number of other, much weaker findings. These are replicated across data sets, which argues for biological, rather than experimental explanations. Very substantial differences among great apes are described. One specific hypothesis, about an apobec mutation is assessed and some – moderately weak – evidence for association is seen.

The analyses presented are basically well done and reasonably compelling that there are repeatable differences in mutational spectra. The obvious – and I think rather important – criticism is that this works moves us no further along in terms of identifying causal factors. The authors argue for a contribution of transient mutator phenotypes. However, it is not clear how plausible this model is – most models of mutator suggest that they tend not to persist in sexually reproducing populations. This could potentially be explored by simulation. The authors argue that non-genetic factors – e.g. environmental exposure – are unlikely to explain the phenomena – though no hard evidence is given, although I agree that the great ape differences are compatible with a substantial genetic component.

The only other substantial comment I have is around the analysis that fits the burst of TCC-to-TTC mutations. My concern is that sequencing data sets will have higher error rates at low frequencies and likely differential discovery based on sequence context due to systematic fluctuation in sequencing depth (true for both UK10k and 1000G). Hence, I have a concern that the most recent TCC->TTC mutations could be being lost/swamped in such a way that leads to an apparent burst, when the process is still active.

Reviewer #2:

This is a nice paper on a topic of current interest and relevance in human genetics and genomics. It points to important evidence for recent variability over time in at least some aspects of the human mutation rate, something which until now we have only been able to speculate about, and which has potentially broad implications for human evolutionary genetics.

I have listed below a few thoughts and comments, including some things I think the authors should address. However, I found the paper well written and well presented, and have no major issues to raise.

It would be good to get a sense of the raw numbers involved. What do the relative differences between populations mean in terms of actual numbers of variants? For example, what actual density of additional derived T alleles are there in Europe compared to Africa for the TCC->TTC signal?

Is there an issue with ascertainment bias due to demography, in that the allele frequency spectrum varies between populations due to their differing demographic histories, and this might differentially affect the ascertainment of variants in different spectral classes, depending on their relative abundance?

It would be interesting to know how the structure presented in Figure 1 varies with the age of the variants used to construct it (or, as a proxy, their allele frequency). Presumably one would expect the differences between populations to disappear as one excludes more recent variants, as older ones are more likely to be shared. Is this the case?

I think the procedure used to estimate statistical significance, referred to as 'a forward variable selection procedure' needs a better description and motivation. It's not wholly clear to me how the procedure adopted achieves its goal of minimising the interdependence of the tests for each mutation type. It looks to me like some form of partitioned chi-squared test for comparing multiple proportions, but I don't know this statistical literature well – can you cite a useful reference or else explain how you arrived at it? I'd be happy with just testing for a significant overall difference in spectrum between two populations – is testing for significance of individual components of the spectrum really necessary?

For the differences within the great ape tree, did you also compare with an outgroup such as macaque or gibbon? It wasn't clear how you polarise the C-T rate difference as an increase on the Pongine branch, and the CpG rate difference as an increase in Hominines.

Reviewer #3:

This article extends previous knowledge on heterogeneity of mutational spectra between populations. It detailed differences in frequencies of specific mutation types between diverse population groups. Moreover, authors date TCC->TTC mutational pulse for the European population. The arguments supporting the main conclusion of the study seem convincing, although the discussion of possible importance of evolutionary forces other than mutation would be helpful (e.g. interaction between BGC and demographic history). There is a noticeable overlap with the earlier work, and the manuscript would strongly benefit from additional analyses of the observed mutational patterns.

For example, are the relative increases of mutation types uniformly distributed along the genome or enriched in specific genomic locations? Are they associated with epigenomic features or display asymmetry with respect to transcription or replication? Are they dependent on local recombination rate? Any analysis suggesting of a mechanistic hypothesis underlying the observation would strengthen the paper.

In contrast to the main result of the manuscript, I am skeptical about the conjecture related to the APOBEC-induced mutagenesis. It is not statistically sound and based on arbitrary thresholds. Is it possible to compare replication asymmetry of APOBEC-like mutations (TC[A/T]-> [C/G]) between populations with different frequencies of variants associated with breast or bladder cancer?

eLife. 2017 Apr 25;6:e24284. doi: 10.7554/eLife.24284.044

Author response


Essential revisions:

The original reviews are below. In terms of revision, the only major issue that we wish to see addressed concerns the analysis of the burst hypothesis. Specifically, there is a potential concern about whether certain artefacts could explain the apparent lack of evidence for the population-specific bias among the most recent mutations and questions about differences among populations in the relationship between allele frequency and age. There are also suggestions about how to look at patterns along the genome to hunt for clues as to possible causes.

We thank the editor and reviewers for these thoughtful comments, which have helped us to improve the substance and presentation of the paper. To address this essential revision point, we welcome the chance to present additional evidence that the burst of TCC→TTC mutations at intermediate frequencies is not well explained by bioinformatic artifacts. We have added a new analysis of the Exome Aggregation Consortium Data (Figure 3—figure supplement 2), a dataset even larger than UK10K that supports the mutation burst hypothesis as well as the datasets we previously analyzed. This evidence is summarized in the following new paragraph of the paper):

“Although low frequency allele calls often contain a higher proportion of base calling errors than higher frequency allele calls do, it is not plausible that base-calling errors could be responsible for the pulse we have described. […] When independently generated datasets of different sizes are projected down to the same sample size, the TCC→TTC pulse spans the same range of allele frequencies in both datasets (Figure 3—figure supplements 1 and 2).”

Figure 3B and Figure 3—figure supplement 2 illustrate how well supported the pulse pattern is in all cases.

The point about differences between populations in the relationship between allele frequency and age is addressed in direct response to reviewer 2.

We have left the full reviews in as we think there are other ideas here that you may wish to pick up on, but pursuing them is not essential for the revision. We look forward to seeing the revision.

We appreciate this encouraging response, and have incorporated many ideas from these reviews into additional supplementary analyses. As described in more detail below, we have added new supplementary heat map figures that describe the variation of the human mutation spectrum with allele frequency, replication timing, chromatin state, and transcription. We have also attempted to clarify the description of our significance-testing procedure in the Methods section. Finally, we decided to remove the APOBEC analysis given that it is less conclusive than the other sections of the paper and appears to have detracted, in the eyes of the reviewers, from the impact of the paper’s main conclusions.

Reviewer #1:

This paper presents an analysis of population differences in the spectrum of context-dependent single nucleotide polymorphism, primarily focused on humans, but also analysing data from great apes. One of the authors previously reported a substantial difference in one particular type. This is replicated here and augmented with a large number of other, much weaker findings. These are replicated across data sets, which argues for biological, rather than experimental explanations. Very substantial differences among great apes are described. One specific hypothesis, about an apobec mutation is assessed and some – moderately weak – evidence for association is seen.

The analyses presented are basically well done and reasonably compelling that there are repeatable differences in mutational spectra. The obvious – and I think rather important – criticism is that this works moves us no further along in terms of identifying causal factors. The authors argue for a contribution of transient mutator phenotypes. However, it is not clear how plausible this model is – most models of mutator suggest that they tend not to persist in sexually reproducing populations. This could potentially be explored by simulation. The authors argue that non-genetic factors – e.g. environmental exposure – are unlikely to explain the phenomena – though no hard evidence is given, although I agree that the great ape differences are compatible with a substantial genetic component.

It is a fair point that we have not yet nailed down concrete mechanisms that are causing mutation spectrum evolution. However, it is our hope that describing the temporal structure of mutation spectrum change will bring us and others closer achieving this goal in the future. Thanks to the work done in this paper, we now know that the TCCTTC pulse appears to be presently inactive, meaning that it is probably not caused by currently segregating genetic variation that could be mapped via genome-wide association approaches. This implies that the TCCTTC pulse might not be the best signal to chase in search of causal factors, which is an important thing to keep in mind for anyone who wants to tackle the challenging problem of mapping mutators in the future. This paper also shows that there are other mutation spectrum differences between populations that are weaker in magnitude than the TCCTTC pulse, but are nevertheless reproducible and might be better leads to go after in search of the molecular underpinnings of mutation rate variation.

In terms of the plausibility of mutator phenotypes existing in sexual populations, a natural mutator phenotype was recently identified in yeast (see “Mis- match repair incompatibilities in diverse yeast populations” by Bui, et al. Genetics 2017).

The only other substantial comment I have is around the analysis that fits the burst of TCC-to-TTC mutations. My concern is that sequencing data sets will have higher error rates at low frequencies and likely differential discovery based on sequence context due to systematic fluctuation in sequencing depth (true for both UK10k and 1000G). Hence, I have a concern that the most recent TCC->TTC mutations could be being lost/swamped in such a way that leads to an apparent burst, when the process is still active.

We believe that the new analyses presented in Figure 3—figure supplements 2 and 6 provide good evidence that sequencing errors are not artificially deflating estimates of the TCC→TTC mutation fraction at low frequencies.

Reviewer #2:

This is a nice paper on a topic of current interest and relevance in human genetics and genomics. It points to important evidence for recent variability over time in at least some aspects of the human mutation rate, something which until now we have only been able to speculate about, and which has potentially broad implications for human evolutionary genetics.

I have listed below a few thoughts and comments, including some things I think the authors should address. However, I found the paper well written and well presented, and have no major issues to raise.

It would be good to get a sense of the raw numbers involved. What do the relative differences between populations mean in terms of actual numbers of variants? For example, what actual density of additional derived T alleles are there in Europe compared to Africa for the TCC->TTC signal?

The raw numbers of TCC→TTC mutations (as well as mutations in other contexts) are available in the supplementary file total_continent_mut_counts.txt. In particular, there are 270,538 TCC→TTC variants in Africa compared to 187,174 in Europe (where there are many fewer total SNPs).

Is there an issue with ascertainment bias due to demography, in that the allele frequency spectrum varies between populations due to their differing demographic histories, and this might differentially affect the ascertainment of variants in different spectral classes, depending on their relative abundance?

With regard to the relationship between allele age and allele frequency, it is a fair point that different populations have different relationships between allele frequency and allele age due to contrasting demographic histories. However, differences in the relationship between allele frequency and allele age cannot produce differences between the mutation spectra of two populations in the scenario where each population has experienced an identical rate and spectrum of mutations throughout recent history. The exception to this would be if selective forces like biased gene conversion caused different populations to retain certain classes of mutations at different rates, which is plausible in principle since selective forces act more weakly in populations of smaller effective size. The observed patterns are not consistent with classical biased gene conversion though – if biased gene conversion were the only force creating differences between population mutation spectra, all C/G⇌A/T mutations would be affected in much the same way regardless of sequence context and neither C⇌G nor A⇌T mutations should have differences in abundance between populations.

It would be interesting to know how the structure presented in Figure 1 varies with the age of the variants used to construct it (or, as a proxy, their allele frequency). Presumably one would expect the differences between populations to disappear as one excludes more recent variants, as older ones are more likely to be shared. Is this the case?

This is a nice suggestion for bridging the visualizations presented in Figure 1 and Figure 2. We now include this analysis as Figure 3—figure supplement 6, and it indeed shows that differences disappear as we restrict to higher frequency variants.

I think the procedure used to estimate statistical significance, referred to as 'a forward variable selection procedure' needs a better description and motivation. It's not wholly clear to me how the procedure adopted achieves its goal of minimising the interdependence of the tests for each mutation type. It looks to me like some form of partitioned chi-squared test for comparing multiple proportions, but I don't know this statistical literature well – can you cite a useful reference or else explain how you arrived at it? I'd be happy with just testing for a significant overall difference in spectrum between two populations – is testing for significance of individual components of the spectrum really necessary?

Since this procedure is not described in the literature to our knowledge, we have expanded its description in the supporting information to include better motivation for what is done (see Methods section). These p-values are used to annotate Figure 1A with the dots that denote significance. We think that conservatively estimating the number of mutation types that vary in rate between populations is important to this paper because Harris 2015 focused on the rate distribution of a single mutation type and we want to emphasize here that mutation spectrum evolution is much more pervasive than that.

For the differences within the great ape tree, did you also compare with an outgroup such as macaque or gibbon? It wasn't clear how you polarise the C-T rate difference as an increase on the Pongine branch, and the CpG rate difference as an increase in Hominines.

We have rephrased the statement about CpGs to say, “one major trend is a higher proportion of CpG mutations among the species closest to human” to be more agnostic about the question of whether the rate decreased in one part of the tree versus increased in another part of the tree. For statements about A→T and A→C mutations in the great apes, we have left in language that hypothesizes increases and decreases in rate because these are supported by parsimony (i.e., the rate differential is more parsimoniously explained by a rate increase along one branch of the tree than by rate decreases along two separate branches).

Reviewer #3:

This article extends previous knowledge on heterogeneity of mutational spectra between populations. It detailed differences in frequencies of specific mutation types between diverse population groups. Moreover, authors date TCC->TTC mutational pulse for the European population. The arguments supporting the main conclusion of the study seem convincing, although the discussion of possible importance of evolutionary forces other than mutation would be helpful (e.g. interaction between BGC and demographic history). There is a noticeable overlap with the earlier work, and the manuscript would strongly benefit from additional analyses of the observed mutational patterns.

For example, are the relative increases of mutation types uniformly distributed along the genome or enriched in specific genomic locations? Are they associated with epigenomic features or display asymmetry with respect to transcription or replication? Are they dependent on local recombination rate? Any analysis suggesting of a mechanistic hypothesis underlying the observation would strengthen the paper.

We agree and, we have included new supplementary figures showing how these patterns vary with replication timing, chromatin state, and transcriptional state. Our main inference from these figures is that the discernible features of Figure 1B appear to vary only modestly across the genome.

In contrast to the main result of the manuscript, I am skeptical about the conjecture related to the APOBEC-induced mutagenesis. It is not statistically sound and based on arbitrary thresholds. Is it possible to compare replication asymmetry of APOBEC-like mutations (TC[A/T]-> [C/G]) between populations with different frequencies of variants associated with breast or bladder cancer?

We agree that these conclusions about APOBEC-induced mutagenesis are not as decisive as the main conclusions in the paper. We intended this analysis to work as a useful illustration of how linkage disequilibrium can be used to test hypotheses about mutator activity, but in revising the paper, we have decided to remove this section to avoid detracting from the impact of the more key points we are making.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. This text file shows the number of SNPs in each of the 96 mutational categories that passed all filters in each 1000 Genomes continental group.

    DOI: http://dx.doi.org/10.7554/eLife.24284.004

    DOI: 10.7554/eLife.24284.004
    Figure 4—source data 1. This text file shows the number of SNPs in each of the 96 mutational categories that passed all filters in each finescale 1000 Genomes population.

    DOI: http://dx.doi.org/10.7554/eLife.24284.024

    DOI: 10.7554/eLife.24284.024

    Data Availability Statement

    All datasets analyzed here are publicly available at the following websites:


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES