Abstract
Mechanistic processes underlying human germline mutations remain largely unknown.Variation in mutation rate and spectra along the genome is informative about the biological mechanisms. We statistically decompose this variation into separate processes using a blind source separation technique. The analysis of a large-scale whole genome sequencing dataset (TOPMed) reveals nine processes that explain the variation in mutation properties between loci. Seven of these processes lend themselves to a biological interpretation. One process is driven by bulky DNA lesions that resolve asymmetrically with respect to transcription and replication. Two processes independently track direction of replication fork and replication timing. We identify a mutagenic effect of active demethylation primarily acting in regulatory regions. We also demonstrate that a recently discovered mutagenic process specific to oocytes can be localized solely from population sequencing data. This process is spread across all chromosomes and is highly asymmetric with respect to the direction of transcription, suggesting a major role of DNA damage.
The superb accuracy of transmission of genetic information between generations is one of the most fascinating properties of life. Infrequent errors in this transmission lead to mutations that are the source of genetic variation which fuels evolution and causes genetic disease. The key importance of mutagenesis motivated decades of experimental research that revealed various modes of errors made by complex machineries of DNA replication and DNA repair (1-3). In spite of this effort, biochemical mechanisms primarily responsible for human germline mutation remain uncharacterized. Statistical analysis of massive whole genome sequencing datasets in light of the knowledge accumulated by experimental genetics and biochemistry offers a promising avenue of inquiry.
Studies of the origin of cancer somatic mutations have been propelled by the statistical analysis of “mutation signatures” in cancer genomic datasets and by mapping these signatures to known exposures to endogenous and exogenous mutagens (4-6). This analysis exploits the trinucleotide context-dependency of mutation rate. Differential exposure of tumors to mutagens serves as the main statistical instrument for the analysis. This approach is not directly transferable to studies of human germline mutation because there is no analog of the differential mutagen exposure, although some success was achieved by comparing human populations (7-10).
Here, we use variation along the genomic coordinate as the statistical instrument to decompose human germline mutagenesis into independent biochemical processes. Human mutation rate exhibits a modest but highly significant variation along the genome (11-13). Our model assumes that several mechanistic processes generate human germline mutations. These processes are characterized by types and context-dependency of nucleotide changes and vary in their relative intensities along the genome (Fig. 1A). Mutational signatures and the relative intensity of each process at each locus can be derived from the analysis of DNA sequencing data alone. Slightly more formally, each process is characterized by a relative preference for each of the 192 types of all possible single nucleotide mutations in trinucleotide contexts oriented to the reference strand. Each process is assumed to vary along the genome and the observed heterogeneity of mutational spectra between loci is driven by different relative contributions of the processes (Fig. 1A). Inference of mutational processes then represents a classical blind source separation problem that separates a set of source signals from observed signal mixtures. For that, we devised a computational approach that performs dimensionality reduction using Principal Component Analysis (PCA) following by Independent Component Analysis (ICA) of mutational spectra in reduced space, so that processes have independent spectra and each process may have either positive (enrichment) or negative (depletion) preferences for context-specific mutation types (Fig. S1A, see Methods). Although there could be different mathematical formulations of a source separation problem, we argue that PCA-based dimensionality reduction following by ICA-based spatial inference is both statistically powerful and biologically reasonable for the population datasets considered here compared to the other state-of-the art approaches (Fig. S1H, see Methods). Simulations show that, accounting for the size and properties of the TOPMed dataset, our approach recovers processes that have a genome-wide contribution of at least 0.1% of the overall mutation rate and spatial scale of at least 10kb (Fig. 1E).
As with any statistical procedure, the key question is whether a particular inferred process reflects the biological reality or is a spurious signal. A powerful way to assess the biological relevance of the inferred processes is provided by the symmetry between antiparallel strands of DNA. Although DNA is a symmetric molecule, directional processes such as transcription and replication break this symmetry. Mutational mechanisms coupled with these processes are strand-dependent. For example, within genes A>G mutations are depleted on the transcribed strand and enriched on the complementary non-transcribed strand. This observation is attributed to the action of transcription-coupled repair (TCR) (3, 14). All mutational mechanisms can be broadly classified into strand-dependent and strand-independent.
Our statistical procedure assigns the direction of mutations with respect to the human genome reference irrespectively of the direction of transcription, replication or double strand break repair. For some genes the reference strand happens to be transcribed, while for other genes it happens to be non-transcribed. As a consequence, in some genic regions we will detect depletion of A>G mutations and in others we will detect depletion of its complementary mutation T>C.
For a strand-dependent mutation process, our statistical procedure would infer two independent components (Fig. 1B). Remarkably, these components can be easily identified as corresponding to the same underlying process because they would be exactly complementary to each other. Following the example of mutation processes associated with transcription, the intensity of A>G mutations in one of the components would be identical to the intensity of T>C in the other. In contrast, a mutation process that is not strand-dependent would generate a component that would be self-complementary (for example, the intensity of A>G would be identical to the intensity of T>C). As a result, all biologically relevant components would either be self-complementary or arise in mutually complementary pairs (Fig. 1B-D).
We rely on this observation to test the biological validity of the inferred processes. Motivated by the visual representation in Figure 1B,F, we called this test a “reflection test”.
We applied our method to a dataset of very rare single nucleotide variants (SNVs) from the TOPMed freeze 5 (15) serving as a proxy to mutations (16). Overall, the dataset included over 293 million SNVs with allele frequency below 10−4. To capture the regional variation, we binned the genome into 264,291 non-overlapping windows of 10 kb, which is the optimal scale for the number of inferred components (Fig. S1G)
ICA identifies 14 independent components that successfully pass the “reflection test”, corresponding to 9 mutational processes, 5 of which are strand-dependent and the remaining 4 are strand-independent (Fig. 1F, Fig. S1C-E, Fig. S2). Almost all of these components have the average bootstrap support at the level of 70-99% (Fig. S1D).
These 14 components are robust with respect to window size and are reproduced in the independent gnomAD dataset (Fig. S1F). Finally, we validated these components using de novo mutations identified by parent-child trio sequencing (17, 18). The spectra of de novo mutations in loci dominated by a specific component show a high concordance with the component spectrum inferred from the TOPMed dataset (Fig. S1K-M).
Eight of nine processes show notable and highly distinct correlations with genomic features known to impact mutation rate, including gene bodies, replication timing, direction of replication, and chromatin accessibility (Fig. 2A, Table S1). This strong association is remarkable given that the statistical inference was totally agnostic with respect to features other than mutation density.
Broadly, mutations can be introduced either as replication errors or as a consequence of DNA damage. The hallmark of mutations induced by bulky DNA damage is strand asymmetry with respect to direction of transcription (3, 19) and, as we recently argued, direction of replication (20). Bulky DNA damage is resolved in a strand specific manner within gene bodies due to the action of TCR (3, 21) and due to the preferential error-prone damage bypass on the lagging strand during replication (2). Components 1 and 2 have mutually complementary spectra and together correspond to a single strand-dependent process (Fig. 1D, Fig. 2A, Fig. S1). The strand asymmetry of this process, measured as the difference between intensities of components 1 and 2, strongly correlates with directions of both transcription (r=0.32) and replication (r=−0.15). The sum of the two components intensities reflects the overall regional activity of the process 1/2. For the process 1/2, it correlates with replication timing (r=0.34). Components 1 and 2 correlate in strand specific manner with the experimentally obtained activity of the transcription coupled repair system (21, 22) in a strand-specific way (Fig. S3). Collectively, these observations strongly suggest that the process 1/2 is driven by the asymmetric resolution of bulky DNA damage.
In contrast, strand-dependent process 3/4 likely captures replication errors. The asymmetry of this process strongly correlates with the direction of replication (r=0.31) but is not meaningfully associated with any other epigenomic feature including direction of transcription. Therefore, in contrast to process 1/2, this process is unlikely to be mediated by bulky DNA damage. We hypothesize that process 3/4 reflects either a differential fidelity between replicative polymerases or a differential efficiency of mismatch repair (MMR) between leading and lagging strands (1, 23, 24). Although replication infidelity is frequently assumed to be a major (or even leading) factor in germline mutagenesis (25, 26), process 3/4 offers the first probable genomic footprint of replicative errors. Interestingly, process 3/4 (sum of intensities of components 3 and 4) does not appreciably correlate with replication timing, even though many other processes do.
Process 5 most closely tracks replication timing (r=0.54), showing greater intensity in late-replicating regions. The association of germline mutation rate with replication timing was noted a decade ago, but it was shown to be quantitatively weak (13, 27). A recent study reported that the association is much stronger for C>A mutations (28). C>A mutations are indeed enriched in process 5, although this enrichment is limited to TpCpN sequence contexts. Unlike other processes, process 5 affects all mutation types in the same direction (all types have positive values in the spectrum). This process is responsible for the largest fraction of mutation rate variation along the genome (Fig. 2A). In spite of these observations, the interpretation of this process is not straightforward because replication timing itself is correlated with many epigenomic features. Interestingly, most other processes are associated with replication timing, not only to a weaker degree, but also in the opposite direction (Fig. 2A). This counteracting effect explains the weakness of the association between overall mutation rate and replication timing.
Strand asymmetric process 6/7 is dominated by C>G transversions and is characterized by strong local spikes totaling 265 Mb throughout the genome (Figure 3A-C). Analysis of de novo mutations within these regions reveals that they are dramatically enriched in mutations of maternal origin (Table S2). Several genomic regions with high prevalence of maternal mutations, many of them occurring in clusters, have been reported by the original trio sequencing studies (29, 30). Spikes of process 6/7 include all these regions and many previously unreported regions, also strongly enriched in individual and clustered mutations of maternal origin (Table S2 and Table S3). Overall, the rate of clustered maternal de novo mutations in regions of high intensity of process 6/7 is 18-fold higher than in the rest of the genome. These regions constitute 10% of the genome but harbor 67% of clustered maternal mutations (Fig. 3D, Table S3). Mutations in high intensity regions of process 6/7 have stronger dependence on maternal age and are responsible for 35% of mutations caused by oocyte aging. Mutations within these regions show a 2.6-fold excess in children of older mothers compared to younger mothers (Fig. 3H). In the remaining 90% of the genome this excess is just 1.4-fold. In contrast to earlier reports, this difference is not limited to C>G mutations (30).
Five prominent spikes of process 6/7 overlap long fragile genes (WWOX, RBFOX1, CSMD1, FHIT, SDK1) (31). In these and other genes, process 6/7 displays a strong strand asymmetry with respect to transcription (Fig. 3, Fig S4, Fig S5; r=0.26). Within the gene bodies as compared to flanking regions, the rate of C>G mutations is decreased on the transcribed strand and is increased on the non-transcribed strand by as much as 50-200% (Fig. 3, Fig. S5).
Maternal mutations accumulate in oocytes that are arrested in the second phase of meiosis from the early stages of embryogenesis. Thus, the age-related increase of maternal mutations is unlikely to be explained by replication errors. Alternative mutation mechanisms should involve either DNA damage or resolution of double strand breaks outside of S-phase. The latter is favored by the current literature (29, 30). This is an appealing explanation in light of mutation clusters and the striking maternal age dependency resembling the impact of age on structural variants (32). This is consistent with our observation that process 6/7 overlaps genes with common fragile sites. At the same time, the directly established spectrum of mutations induced by recombination has no sign of enrichment in C>G and is very different from process 6/7 (18). Furthermore, the signature of homology repair deficiency in cancer genomes also has a very different spectrum (4).
The strand asymmetry of process 6/7 cannot be easily explained by the double strand break model. The reduction of mutations on the transcribed strand suggests the role of bulky DNA damage repaired by TCR. In addition, the relationship with direction of replication (r=0.14, Fig. 2A, Fig. S4A) probably indicates that the unrepaired lesions on the leading and lagging strands are asymmetrically converted into mutations at the very first division of the zygote. The most surprising observation is the increase of mutation rate on the non-transcribed strand. Transcription-associated mutagenesis (TAM) has been previously reported in lower organisms and in some cancer types (19, 33). Our analysis identifies TAM in human oocytes and shows that it is primarily localized to bursts of the process 6/7 (Fig. 3G, Fig. S5). TAM is a strand-dependent process associated with transcription and is unlikely to be explained by double strand break repair. Collectively, these observations, point to the localized susceptibility to DNA damage or the failure of DNA repair.
Processes 8 and 9 are dominated by mutations in the CpG context. Process 8 is characterized by CpG transitions and describes a well-known mechanism of spontaneous deamination of methylated cytosines which converts them into thymines. As expected, the intensity of process 8 is positively correlated with methylation levels and is low in CpG islands marking actively demethylated regulatory elements. Process 9 is characterized by CpG transversions. The intensity of this process spikes at CpG islands and is negatively correlated with methylation level (Fig. 4). CpG transversions were previously shown to be positively associated with the level of cytosine hydroxymethylation (34). Based on high intensity in CpG islands, the negative correlation with methylation level and the positive correlation with hydroxymethylation level, we hypothesize that process 9 is caused by active demethylation of regulatory regions. Enzymatic demethylation is initiated by oxidation of a methylcytosine resulting in a hydroxymethylcitosine (35). The hydroxymethylcitosine base, following cycles of subsequent oxidation, is removed by the Base Excision Repair system (BER), creating an abasic site. Unfinished repair of abasic sites is known to result in C>G mutations (36).
Process 9 explains a small portion of the mutation rate variability. However, it disproportionately contributes to regulatory regions of the human genome. In undermethylated regions, the rate of CpG transversions is elevated under ChIP seq peaks for transcription factors (Figure 4). The mutagenic effect of repair of hydroxymethylated cytosines has been shown previously (34). We identify this process in an unsupervised manner and attribute it to unintended side effect of the functionally significant demethylation. In line with our model, cadmium, that suppresses cytosine demethylation, leads to depletion of C>G mutations in daphnia (Supplementary Manuscript).
The only remarkable association between intensity of process 10/11 and genomic features is a weak spike at the transcription end site on the transcribed strand of the gene (Figure 2E). Potentially this process is associated with transcription termination, but this localized effect is diluted at the 10 KB scale. The remaining processes 12/13 and 14 explain small proportions of the mutation rate variation. Statistical analysis of these processes does not unequivocally suggest specific biological mechanisms (see Supplementary Note for discussion of these processes).
Our analysis was enabled by the massive scale of the TOPMed dataset. Subsampling of the dataset shows that many components would not be detectable in smaller datasets. Even at the TOPMed scale, there are no statistical signs of saturation for the number of detectable processes (Fig. S1I-J) and a notable range of mutational processes remains undetectable in current settings (Fig. 1E). We hypothesize that larger population sequencing datasets are needed to paint a more detailed picture human germline mutagenesis.
In sum, our unsupervised statistical analysis of the genomic variation in mutation rate evident in population sequencing data implicates a compendium of biological processes responsible for human mutation. Our approach identifies a highly localized strand-dependent process dominated by mutations of maternal origin. This process tracks direction of transcription, suggesting a dominant role of transcriptionally-mediated damage in oocytes. We also characterize a mutation signature of replication errors, which has been historically suspected to be a major source of germline mutation. We attribute mutagenic patterns of repair of hydroxymethylated cytosines (34) to active demethylation of regulatory regions. We envision that a spatial mutational model applied to new datasets will uncover new links between DNA biochemistry and localized mutational patterns.
Material and Methods.
Preparation of mutational matrix
As a proxy for germline mutations, we used SNVs with allelic frequency below 10−4 from TOPMed freeze 5 (1) or gnomAD (2). The genome was binned into non-overlapping windows of 2, 5, 10, 30, 100 or 1000 kilobases in size, and mutation rate within each window was estimated as a ratio between the number of mutations and the number of available sites. To explore uniformity of the calling/sequencing quality, we obtained the distribution of the number of mutations within 1 kb windows across the genome. This distribution was bimodal with the first mode equal to 0 SNVs per region (Fig. S1A). This mode clearly corresponded to regions of low quality. Therefore, we excluded 1kb loci with the abnormally low mutation counts (less than 50 mutations) from all subsequent analyses. Overall, our results were stable with respect to different filtering thresholds (data not shown).
Inference of mutational components
Mutation rates for each mutation type were Z-score transformed across all windows to zero mean and unit standard deviation. Using a predefined number of components n, matrix R of transformed mutation rates in w=264’291 windows of t=192 mutation types was then factorized by singular value decomposition using the R package svd:
(1) |
Matrix Vn×t of loadings of mutation types onto first n principal components was then centered to zero mean of columns V^n,t = Vn,t – ⟨V⟩n, and rotated to infer statistically independent residual spectra components M^n×t using the independent component analysis (ICA) R package icafast:
Components spectra were then defined as:
Since ICA defines components up to a sign and scalar, signs of rows of M were oriented to enable positive third moment and scales of rows were normalized to unit Euclidean norm. Oriented matrix Mn,t was considered as a matrix of normalized loadings of mutation types on components, while the matrix of intensities of mutational components in windows was estimated as: I = U · Λ · S. Altogether, matrix Rw,t of transformed mutation rates was factorized into a product of intensities Iw,n and independent spectra Mn,t of n mutational processes:R=I · M.
The spearman correlation coefficient was estimated between the spectrum and reverse complementary spectrum of each pair of components and with itself (we call it reflection correlation). Components having a reflection correlation more than 0.75 with at least one component were considered having reflection, or were otherwise considered to be noise. Inferred components were then classified into strand-independent, strand-dependent pairs or noise using the reflection test. Among components with reflection, components having a reflection correlation of more than 0.75 to itself were considered strand-independent. Pairs of components having a reflection correlation with each other were considered as two components of a strand-dependent process. Empirical observations show that a cutoff of 0.75 falls in a wide interval of values that deliver the same classification of components.
Since the reflection property of a component likely indicates its biological relevance, we used the number of components having reflection as a natural criterion to choose a predefined number of inferred components n: the number of components with reflection was estimated for the range of values of n from 2 to 50, and the value of n corresponding to the highest number of components with reflection was selected. The procedure identified that the maximum number of mutational components with reflection is 14 for 10kb genomic windows.
As a first step Rwxt matrix was factorized on 14 components (n=14) with svd. Than to select the optimal window size, the algorithm was applied for a range of windows from 2 kb to 1 mb (2, 5, 10, 30, 100, 1000 kilobases) using these 14 input components. The number of components with reflection (Spearman correlation > 0.75) was estimated for each window size. Only window of 10 kb had all 14 components with reflection.
Power analyses of the datasets
The dataset was subsampled up to the size of 1, 5, 20, 40, 60, 80, 90 and 95 % of the original dataset. For each subsampled dataset the method estimated mutational components using svd decompose matrix as input. Recovery quality of a component of the original dataset in each subsampled dataset was estimated as maximum absolute Pearson correlation across all inferred components of a subsampled dataset. To account for uncertainty in subsampling outcome, quality of recovery was averaged across 10 independent sampling runs at each subsampling depth. Finally, at each subsampling depth we estimated average number of components 1) having reflection and 2) having highly correlated (>0.75) counterpart in the original dataset.
Comparison of different inference methods.
We compared methods that use decorrelation, independence, and non-negativity as constraints on matrix factorization problem. PCA was used as a baseline method that decorrelates mutational components using matrix R of mutation type rates, Z-score transformed across genomic windows. In case of PCA, mutational components were interpreted as rows of orthogonal matrix V (see equation 1). PCA was also used as a dimensionality reduction approach before ICA. Mutational components that maximize independence of spectra were obtained as described above, while independence of intensities was achieved using ICA of orthogonal matrix U (see equation 1).
On the other hand, non-negative matrix factorization (NMF) approach was applied to the matrix of mutation type rates with each mutation type rate normalized by its genome-wide average level. Using NMF R package we run standard NMF algorithm (option ‘brunet’), NMF that tends to produce sparser components (option ‘ns-NMF’, default parameters) and NMF that tends to diversify expression of components patterns (option ‘pe-NMF’, parameters: alpha=0.01, beta=1). Since we noticed that for this TOPMed dataset NMF tends to converge to different local optima, each NMF algorithm was run using 10 starting points, including ‘nnsvd’, ‘ica’ and 8 ‘random’ options. To make analysis of NMF components compatible with that of PCA and ICA, NMF-inferred components were centered by subtracting 1 (normalized genome-wide average rates). All of the methods were run using dimensionality of 14 of input components. Components that have spectra dominated by a single outlier mutation type, that is 10 times exceeding loadings of any other mutation types, were removed. Reflection test with cutoff of 0.75 on reflection correlation was used to estimate the number of potential biological components for each method.
Statistical properties of mutational components
The scale of mutational components was defined using a linear autoregressive model. The spatial intensity of each mutational component was modeled as:
where Ip is the intensity at position p, ak are autoregressive coefficients and ξp is the residual noise. Order M of the model was chosen using Akaike Information Criterion. The R package ar was used to fit the autoregressive model. The scale of each process was defined as the half-life of the autoregressive model
The contribution of each component was defined as the squared sum of intensities. Contributions of all components were then scaled to the unit sum.
Assessment of components robustness
Robustness of each component spectrum was assessed using a bootstrap of genomic windows. 500 sets of 14 components were inferred using a bootstrap of windows. Maximum Spearman correlations between an original component and the components in each bootstrapped set were calculated to provide estimates of the similarity of potentially identical components. For all mutational components, average Spearman correlations of the spectra with bootstrapped components were above 0.68, indicating the robustness of spectra estimates.
Inference of components was repeated for a window size of 5 kb and 30 kb to explore robustness with respect to window size. For each window choice, the procedure of inference was repeated independently, including selection of the optimal number of components. The spectra of all original components were recapitulated, with a correlation of more than 0.64 in at least one of two runs. Finally, component spectra were compared between TOPMed and gnomAD datasets. For that, the procedure of components inference, including selection of the optimal number of components, was repeated for the gnomAD dataset using a window size of 100 kb. The spectra of most components were recapitulated with a correlation of more than 0.6, while three components (3, 10, 11) showed moderate correlation (0.46, 0.43, 0.55). Overall, this indicates that components are robust with respect to the choice of window and dataset.
Comparison with de novo data
To assess if the spatial distribution of de novo mutations is consistent with individual mutational processes, we pooled 421,106 de novo point mutations from two datasets (3, 4) and estimated the log ratio of de novo frequencies of mutation types in 25% of genomic windows of high component intensities relative to frequencies of mutation types in the whole genome. Consistency of de novo data with the mutational component is quantified as the Spearman correlation between this log ratio for de novo mutations and the spectrum of the corresponding mutational component. Spearman correlations were positive for each component. To estimate the uncertainty of these correlations, we repeated the estimation of Spearman correlations multiple times using bootstrapped sets of genomic windows. The significance of each association was assessed as the p-value of zero correlation relative to the distribution of bootstrapped Spearman correlations. The results show that for all components the correlation is significantly consistent (p < 0.05; Supplementary Figure 1). To assess parent-specific effects, the Spearman correlation between the log ratio of de novo mutation frequencies and the spectrum of mutational process was estimated separately for phased maternal and paternal de novo mutations. Before this procedure, 63,387 paternal de novo mutations were downsampled to match the size of 17,406 maternal de novo mutations. The distribution of differences between maternal and paternal Spearman correlations was constructed using a bootstrap of genomic windows to assess statistical significance, estimated as the p-value of zero correlation relative to the bootstrapped distribution. Similarly, to assess age effects of mutational processes, the dataset was partitioned by the average age of parents in two equal parts of young and old parents and the procedure identical to that applied for parent-specific effects was repeated.
Simulations to assess the limitations of the approach
The ability to infer spatially-varying mutational processes depends on their statistical properties, such as spatial scale, degree of variability along the genome and degeneracy of mutational spectrum. Limitations of inference with respect to these statistical properties were analyzed through simulations of mutational processes underlying spatially variable mutation rates. Briefly, we simulated spectra and intensities of 14 mutational components corresponding to 4 strand-independent and 5 strand-dependent processes (4·1 and 5·2 components respectively), linearly combined them to obtain variable mutation type rates along the genome and sampled mutation counts using Poisson process. Then the procedure of spatial inference was made for the matrix of simulated mutation counts and spectra of inferred components were compared to simulated ones to estimate quality of recovery. Recovery quality of a simulated component was calculated as a maximum absolute Pearson correlation to inferred components. Simulations were repeated 8000 times to assess processes in a wide range of scales, loadings and spectra degeneracies.
In more detail, intensities of components were simulated by continuous Ornstein-Uhlenbeck (O-U) processes. Scale of component was estimated as half-life (hl) of O-U process. The latter was sampled from 100 bp to 200 kb uniformly at log scale. Stationary O-U mean m (see equation 2) was assigned to 3 and stationary variance α was sampled from 0.005 to 5 uniformly at log scale. Stationary variance controls the degree of spatial variability of components. Overall, intensities I were modeled using O-U diffusion:
(2) |
where p is genomic position, λ is a rate of reversion , , α = exp(unif(log(5 · 10−3), log(5))), m = 3.
Rate vector of 192 mutation types was sampled using Dirichlet distribution with a concentration parameter α sampled uniformly from 0.01 to 10 at log scale. Concentration parameter controls degeneracy of spectra and is shown in Supplementary Fig. 1a as a “spectra degeneracy” score. Mutation type rates of components spectra were then re-normalized to match the average observed genome-wide mutation frequencies in TOPMed. Rates Si,j of each spectra mutation type j were scaled by a factor cj: Si,j ← Si,j · cj, where with μj being average genome-wide mutation rate of type j and ⟨Ii⟩ is average intensity of a process i. Finally, expected mutation rates of each type j in each window w is a linear combination of components vw,j = Σi Iwi · Sij. Mutation counts of each type in each window were sampled from Poisson process with a rate mw,j = vw,j · cj proportional to mutation rate vw,j and average number of available cj context triplets per window in the human genome. The procedure of components inference was then applied to matrix mw,j of simulated mutation counts.
Associations with epigenetic tracks and DNA features
We relied on the analysis of correlations between mutational processes and epigenomic tracks to gain insight into biological mechanisms.
Replication timing was obtained from (5). In the absence of data from the relevant germline tissue, we used the track for Mcf7 cells. The results were insensitive to the choice of cell type. Replication fork direction was determined as in (6).
Gene coordinates were obtained from the ‘knownGenes’ track downloaded from the UCSC genome browser. We measured gene bias within each window as the number of nucleotides transcribed on the reference strand minus the number of nucleotides transcribed on the strand complementary to the reference. Correlations with process 6/7 asymmetry, estimated as the difference in intensities of components 6 and 7, were calculated only in regions of high intensity of process 6/7 (component 6 + component 7 intensity >1.4).
Methylation level for each CpG dinucleotide were obtained from (7) and the methylation level of a window was calculated as a mean across all CpG sites within it.
Hydroxymethylation data was obtained from (8). Because this track is very sparse, similarly to previous study (9), we considered any CpG site with the fraction of hydroxymethylated reads exceeding 0.1 as hydroxymethylated. The hydroxymethylation level of a window was calculated as the fraction of hydroxymethylated CpG dinucleotides among all CpG dinucleotides.
Histone modifications H3k4me3, H3k27ac and H3k4me1 were downloaded from the UCSC genome browser. These tracks were obtained for human embryonic stem cells as a potentially relevant cell type.
Sex-specific recombination rate were obtained from (3).
CpG islands coordinates were downloaded from UCSC genome browser.
Correlations between all tracks and mutational processes are listed in Supl. Table 1.
Associations with the activity of nucleotide excision repair
Nucleotide excision repair (NER) effectively removes bulky lesions and its activity is partly governed by chromatin structure (10). Kinetics of CPD and 6-4PP repair by NER was measured in (11). Repair of 6-4PP occurs within less than an hour and thus is unlikely to be relevant for the mutagenesis that operates in the germline, because divisions of spermatogonia take many days and the dictate phase of oogenesis lasts for many years. Therefore, we focused on the repair of CPDs, a much slower process (11). The majority of UV-induced lesions occur in TT dinucleotides due to properties of UV radiation. To account for this bias, we normalized NER activity to TT dinucleotide content. Following this, we correlated NER efficiency with the intensity of each mutational process.
On the other hand, local activity of NER should be inverse to the amount of the damage that remains in DNA after 48 hours past UV-irradiation. We correlated mutational processes with the amount of unrepaired CPD damage (12), normalized to TT dinucleotide content. Correlations between NER activity and mutational processes are shown in Fig. S4.
Clustered de novo mutations
In line with previous studies, we defined clustered de novo mutations as pairs of mutations observed in the same individual at distances less than 20,000 nucleotides (13, 14). De novo mutations were obtained from (3) and entire clusters were attributed to be of maternal or paternal origin if there was at least one phased mutation of this origin. Clusters that have mutations on both the paternal and maternal haplotype were excluded.
Alteration of mutation rate in gene bodies
To directly estimate the effect of transcription on mutation rate, we compare the mutation rate for each of 12 mutation types on the non-transcribed strand of the gene to the mutation rate 100 KB upstream and downstream of the gene (Fig. 3G and Suppl. Fig. S5). To reliably estimate the intensity of the process and the mutation rate within genes, only genes longer than 100 KB were considered. Differences in mutation rate between the gene and flanking region were normalized to the genome-average mutation rate for each corresponding mutation type.
Maternal age effect in regions susceptible to process 6/7
“Maternal regions” were determined by the high intensity of component 6 + component 7. To choose the threshold for this sum, we compared quintiles of the distribution of component 6 + component 7 to that of the normal distribution and the deviated right tail of 8601 windows was used to define “maternal regions” (Fig. S.5).
To calculate the effect of maternal age, Broyden–Fletcher–Goldfarb–Shanno maximum likelihood algorithm was used (R package bbmle). We deal with the uncertainty contributed by non-phased mutations as in (15).
Effect of transcription binding sites on mutation rate
Aggregate ChIP-seq peaks were obtained from ReMap2018 (16). CpG islands were excluded from the following analysis.
Mutation rate for the set of transcription factor binding sites was calculated in overlapping 100 nucleotide-long sliding windows for each trinucleotide context, then this rate was normalized on genome average mutation rate values and combined into the stated categories using a weighted average.
Supplementary Material
Acknowledgements
Whole genome sequencing (WGS) for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). Specific funding sources for each study and genomic center are given in Supplementary Note 3 and Table S4.
Centralized read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Phenotype harmonization, data management, sample-identity QC, and general study coordination were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. R.A.S and P.V.K. were supported by NHLBI (R01HL131768).
The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.
References
- 1.Kunkel TA, Erie DA, Eukaryotic Mismatch Repair in Relation to DNA Replication. Annu. Rev. Genet 49, 291–313 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yeeles JTP, Poli J, Marians KJ, Pasero P, Rescuing stalled or damaged replication forks. Cold Spring Harb Perspect Biol. 5, a012815 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Marteijn JA, Lans H, Vermeulen W, Hoeijmakers JHJ, Understanding nucleotide excision repair and its roles in cancer and ageing. Nat. Rev. Mol. Cell Biol 15, 465–481 (2014). [DOI] [PubMed] [Google Scholar]
- 4.Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale A-L, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Imielinsk M, Jäger N, Jones DTW, Jones D, Knappskog S, Kool M, Lakhani SR, López-Otín C, Martin S, Munshi NC, Nakamura H, Northcott PA, Pajic M, Papaemmanuil E, Paradiso A, Pearson JV, Puente XS, Raine K, Ramakrishna M, Richardson AL, Richter J, Rosenstiel P, Schlesner M, Schumacher TN, Span PN, Teague JW, Totoki Y, Tutt ANJ, Valdés-Mas R, van Buuren MM, van’t Veer L, Vincent-Salomon A, Waddell N, Yates LR, Australian Pancreatic Cancer Genome Initiative, ICGC Breast Cancer Consortium, ICGC MMML-Seq Consortium, ICGC PedBrain, Zucman-Rossi J, Futreal PA, McDermott U, Lichter P, Meyerson M, Grimmond SM, Siebert R, Campo E, Shibata T, Pfister SM, Campbell PJ, Stratton MR, Signatures of mutational processes in human cancer. Nature. 500, 415–421 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Helleday T, Eshtad S, Nik-Zainal S, Mechanisms underlying mutational signatures in human cancers. Nat. Rev. Genet 15, 585–598 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Ng AW, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN, Islam SMA, Lopez-Bigas N, Klimczak LJ, McPherson JR, Morganella S, Sabarinathan R, Wheeler DA, Mustonen V, the P. M. S. W. Group, Getz G, Rozen SG, Stratton MR, The Repertoire of Mutational Signatures in Human Cancer. bioRxiv, 322859 (2019). [Google Scholar]
- 7.Harris K, Pritchard JK, Rapid evolution of the human mutation spectrum. eLife. 6, e24284 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Narasimhan VM, Rahbari R, Scally A, Wuster A, Mason D, Xue Y, Wright J, Trembath RC, Maher ER, van Heel DA, Auton A, Hurles ME, Tyler-Smith C, Durbin R, Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nature Communications. 8, 303 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Harris K, Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl. Acad. Sci. U.S.A 112, 3439–3444 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mathieson I, Reich D, Differences in the rare variant spectrum among human populations. PLOS Genetics. 13, e1006581 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hodgkinson A, Eyre-Walker A, Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet 12, 756–766 (2011). [DOI] [PubMed] [Google Scholar]
- 12.Terekhanova NV, Seplyarskiy VB, Soldatov RA, Bazykin GA, Evolution of Local Mutation Rate and Its Determinants. Mol. Biol. Evol 34, 1100–1109 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Smith TCA, Arndt PF, Eyre-Walker A, Large scale variation in the rate of germ-line de novo mutation, base composition, divergence and diversity in humans. PLOS Genetics. 14, e1007254 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Green P, Ewing B, Miller W, Thomas PJ, Green ED, Transcription-associated mutational asymmetry in mammalian evolution. Nature Genetics. 33, 514 (2003). [DOI] [PubMed] [Google Scholar]
- 15.Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, Pitsillides AN, LeFaive J, Lee S, Tian X, Browning BL, Das S, Emde A-K, Clarke WE, Loesch DP, Shetty AC, Blackwell TW, Wong Q, Aguet F, Albert C, Alonso A, Ardlie KG, Aslibekyan S, Auer PL, Barnard J, Barr RG, Becker LC, Beer RL, Benjamin EJ, Bielak LF, Blangero J, Boehnke M, Bowden DW, Brody JA, Burchard EG, Cade BE, Casella JF, Chalazan B, Chen Y-DI, Cho MH, Choi SH, Chung MK, Clish CB, Correa A, Curran JE, Custer B, Darbar D, Daya M, de Andrade M, DeMeo DL, Dutcher SK, Ellinor PT, Emery LS, Fatkin D, Forer L, Fornage M, Franceschini N, Fuchsberger C, Fullerton SM, Germer S, Gladwin MT, Gottlieb DJ, Guo X, Hall ME, He J, Heard-Costa NL, Heckbert SR, Irvin MR, Johnsen JM, Johnson AD, Kardia SLR, Kelly T, Kelly S, Kenny EE, Kiel DP, Klemmer R, Konkle BA, Kooperberg C, Köttgen A, Lange LA, Lasky-Su J, Levy D, Lin X, Lin K-H, Liu C, Loos RJF, Garman L, Gerszten R, Lubitz SA, Lunetta KL, Mak ACY, Manichaikul A, Manning AK, Mathias RA, McManus DD, McGarvey ST, Meigs JB, Meyers DA, Mikulla JL, Minear MA, Mitchell B, Mohanty S, Montasser ME, Montgomery C, Morrison AC, Murabito JM, Natale A, Natarajan P, Nelson SC, North KE, O’Connell JR, Palmer ND, Pankratz N, Peloso GM, Peyser PA, Post WS, Psaty BM, Rao DC, Redline S, Reiner AP, Roden D, Rotter JI, Ruczinski I, Sarnowski C, Schoenherr S, Seo J-S, Seshadri S, Sheehan VA, Shoemaker MB, Smith AV, Smith NL, Smith JA, Sotoodehnia N, Stilp AM, Tang W, Taylor KD, Telen M, Thornton TA, Tracy RP, Berg DJVD, Vasan RS, Viaud-Martinez KA, Vrieze S, Weeks DE, Weir BS, Weiss ST, Weng L-C, Willer CJ, Zhang Y, Zhao X, Arnett DK, Ashley-Koch AE, Barnes KC, Boerwinkle E, Gabriel S, Gibbs R, Rice KM, Rich SS, Silverman E, Qasba P, Gan W, Topm. P. G. W. G. Trans-Omics for Precision Medicine (TOPMed) Program, Papanicolaou GJ, Nickerson DA, Browning SR, Zody MC, Zöllner S, Wilson JG, Cupples LA, Laurie CC, Jaquish CE, Hernandez RD, O’Connor TD, Abecasis GR, Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866 (2019). [Google Scholar]
- 16.Carlson J, Locke AE, Flickinger M, Zawistowski M, Levy S, Myers RM, Boehnke M, Kang HM, Scott LJ, Li JZ, Zöllner S, Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nature Communications. 9, 3753 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.An J-Y, Lin K, Zhu L, Werling DM, Dong S, Brand H, Wang HZ, Zhao X, Schwartz GB, Collins RL, Currall BB, Dastmalchi C, Dea J, Duhn C, Gilson MC, Klei L, Liang L, Markenscoff-Papadimitriou E, Pochareddy S, Ahituv N, Buxbaum JD, Coon H, Daly MJ, Kim YS, Marth GT, Neale BM, Quinlan AR, Rubenstein JL, Sestan N, State MW, Willsey AJ, Talkowski ME, Devlin B, Roeder K, Sanders SJ, Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science. 362, eaat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Halldorsson BV, Palsson G, Stefansson OA, Jonsson H, Hardarson MT, Eggertsson HP, Gunnarsson B, Oddsson A, Halldorsson GH, Zink F, Gudjonsson SA, Frigge ML, Thorleifsson G, Sigurdsson A, Stacey SN, Sulem P, Masson G, Helgason A, Gudbjartsson DF, Thorsteinsdottir U, Stefansson K, Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science. 363, eaau1043 (2019). [DOI] [PubMed] [Google Scholar]
- 19.Haradhvala NJ, Polak P, Stojanov P, Covington KR, Shinbrot E, Hess JM, Rheinbay E, Kim J, Maruvka YE, Braunstein LZ, Kamburov A, Hanawalt PC, Wheeler DA, Koren A, Lawrence MS, Getz G, Mutational Strand Asymmetries in Cancer Genomes Reveal Mechanisms of DNA Damage and Repair. Cell. 164, 538–549 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Seplyarskiy VB, Akkuratov EE, Akkuratova N, Andrianova MA, Nikolaev SI, Bazykin GA, Adameyko I, Sunyaev SR, Error-prone bypass of DNA lesions during lagging-strand replication is a common source of germline and cancer mutations. Nature Genetics. 51, 36 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Adar S, Hu J, Lieb JD, Sancar A, Genome-wide kinetics of DNA excision repair in relation to chromatin state and mutagenesis. Proc. Natl. Acad. Sci. U.S.A 113, E2124–2133 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hu J, Adebali O, Adar S, Sancar A, Dynamic maps of UV damage formation and repair for the human genome. Proc. Natl. Acad. Sci. U.S.A (2017), doi: 10.1073/pnas.1706522114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Andrianova MA, Bazykin GA, Nikolaev SI, Seplyarskiy VB, Human mismatch repair system balances mutation rates between strands by removing more mismatches from the lagging strand. Genome Res. (2017), doi: 10.1101/gr.219915.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Haradhvala NJ, Kim J, Maruvka YE, Polak P, Rosebrock D, Livitz D, Hess JM, Leshchiner I, Kamburov A, Mouw KW, Lawrence MS, Getz G, Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nat Commun. 9, 1746 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A, Wong WSW, Sigurdsson G, Walters GB, Steinberg S, Helgason H, Thorleifsson G, Gudbjartsson DF, Helgason A, Magnusson OT, Thorsteinsdottir U, Stefansson K, Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 488, 471–475 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tomasetti C, Li L, Vogelstein B, Stem cell divisions, somatic mutations, cancer etiology, and cancer prevention. Science. 355, 1330–1334 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Stamatoyannopoulos JA, Adzhubei I, Thurman RE, Kryukov GV, Mirkin SM, Sunyaev SR, Human mutation rate associated with DNA replication timing. Nature Genetics. 41, 393–395 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Agarwal I, Przeworski M, Signatures of replication, recombination and sex in the spectrum of rare variants on the human X chromosome and autosomes. bioRxiv, 519421 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Goldmann JM, Seplyarskiy VB, Wong WSW, Vilboux T, Neerincx PB, Bodian DL, Solomon BD, Veltman JA, Deeken JF, Gilissen C, Niederhuber JE, Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet 50, 487–492 (2018). [DOI] [PubMed] [Google Scholar]
- 30.Jónsson H, Sulem P, Kehr B, Kristmundsdottir S, Zink F, Hjartarson E, Hardarson MT, Hjorleifsson KE, Eggertsson HP, Gudjonsson SA, Ward LD, Arnadottir GA, Helgason EA, Helgason H, Gylfason A, Jonasdottir A, Jonasdottir A, Rafnar T, Frigge M, Stacey SN, Magnusson OT, Thorsteinsdottir U, Masson G, Kong A, Halldorsson BV, Helgason A, Gudbjartsson DF, Stefansson K, Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature. 549, 519–522 (2017). [DOI] [PubMed] [Google Scholar]
- 31.Wei P-C, Lee C-S, Du Z, Schwer B, Zhang Y, Kao J, Zurita J, Alt FW, Three classes of recurrent DNA break clusters in brain progenitors identified by 3D proximity-based break joining assay. PNAS. 115, 1919–1924 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ottolini CS, Newnham L, Capalbo A, Natesan SA, Joshi HA, Cimadomo D, Griffin DK, Sage K, Summers MC, Thornhill AR, Housworth E, Herbert AD, Rienzi L, Ubaldi FM, Handyside AH, Hoffmann ER, “Genome-wide recombination and chromosome segregation in human oocytes and embryos reveal selection for maternal recombination rates.” Nat Genet. 47, 727–735 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jinks-Robertson S, Bhagwat AS, Transcription-associated mutagenesis. Annu. Rev. Genet 48, 341–359 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Supek F, Lehner B, Hajkova P, Warnecke T, Hydroxymethylated Cytosines Are Associated with Elevated C to G Transversion Rates. PLOS Genetics. 10, e1004585 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wu X, Zhang Y, TET-mediated active DNA demethylation: mechanism, function and beyond. Nature Reviews Genetics. 18, 517–534 (2017). [DOI] [PubMed] [Google Scholar]
- 36.Chan K, Resnick MA, Gordenin DA, The choice of nucleotide inserted opposite abasic sites formed within chromosomal DNA reveals the polymerase activities participating in translesion DNA synthesis. DNA Repair (Amst.). 12, 878–889 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
Material and methods references
- 1.Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, Taliun SAG, Corvelo A, Gogarten SM, Kang HM, Pitsillides AN, LeFaive J, Lee S, Tian X, Browning BL, Das S, Emde A-K, Clarke WE, Loesch DP, Shetty AC, Blackwell TW, Wong Q, Aguet F, Albert C, Alonso A, Ardlie KG, Aslibekyan S, Auer PL, Barnard J, Barr RG, Becker LC, Beer RL, Benjamin EJ, Bielak LF, Blangero J, Boehnke M, Bowden DW, Brody JA, Burchard EG, Cade BE, Casella JF, Chalazan B, Chen Y-DI, Cho MH, Choi SH, Chung MK, Clish CB, Correa A, Curran JE, Custer B, Darbar D, Daya M, de Andrade M, DeMeo DL, Dutcher SK, Ellinor PT, Emery LS, Fatkin D, Forer L, Fornage M, Franceschini N, Fuchsberger C, Fullerton SM, Germer S, Gladwin MT, Gottlieb DJ, Guo X, Hall ME, He J, Heard-Costa NL, Heckbert SR, Irvin MR, Johnsen JM, Johnson AD, Kardia SLR, Kelly T, Kelly S, Kenny EE, Kiel DP, Klemmer R, Konkle BA, Kooperberg C, Köttgen A, Lange LA, Lasky-Su J, Levy D, Lin X, Lin K-H, Liu C, Loos RJF, Garman L, Gerszten R, Lubitz SA, Lunetta KL, Mak ACY, Manichaikul A, Manning AK, Mathias RA, McManus DD, McGarvey ST, Meigs JB, Meyers DA, Mikulla JL, Minear MA, Mitchell B, Mohanty S, Montasser ME, Montgomery C, Morrison AC, Murabito JM, Natale A, Natarajan P, Nelson SC, North KE, O’Connell JR, Palmer ND, Pankratz N, Peloso GM, Peyser PA, Post WS, Psaty BM, Rao DC, Redline S, Reiner AP, Roden D, Rotter JI, Ruczinski I, Sarnowski C, Schoenherr S, Seo J-S, Seshadri S, Sheehan VA, Shoemaker MB, Smith AV, Smith NL, Smith JA, Sotoodehnia N, Stilp AM, Tang W, Taylor KD, Telen M, Thornton TA, Tracy RP, Berg DJVD, Vasan RS, Viaud-Martinez KA, Vrieze S, Weeks DE, Weir BS, Weiss ST, Weng L-C, Willer CJ, Zhang Y, Zhao X, Arnett DK, Ashley-Koch AE, Barnes KC, Boerwinkle E, Gabriel S, Gibbs R, Rice KM, Rich SS, Silverman E, Qasba P, Gan W, Topm. P. G. W. G. Trans-Omics for Precision Medicine (TOPMed) Program, Papanicolaou GJ, Nickerson DA, Browning SR, Zody MC, Zöllner S, Wilson JG, Cupples LA, Laurie CC, Jaquish CE, Hernandez RD, O’Connor TD, Abecasis GR, Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. bioRxiv, 563866 (2019). [Google Scholar]
- 2.Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won H-H, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, Exome Aggregation Consortium, Analysis of protein-coding genetic variation in 60,706 humans. Nature. 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Halldorsson BV, Palsson G, Stefansson OA, Jonsson H, Hardarson MT, Eggertsson HP, Gunnarsson B, Oddsson A, Halldorsson GH, Zink F, Gudjonsson SA, Frigge ML, Thorleifsson G, Sigurdsson A, Stacey SN, Sulem P, Masson G, Helgason A, Gudbjartsson DF, Thorsteinsdottir U, Stefansson K, Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science. 363, eaau1043 (2019). [DOI] [PubMed] [Google Scholar]
- 4.An J-Y, Lin K, Zhu L, Werling DM, Dong S, Brand H, Wang HZ, Zhao X, Schwartz GB, Collins RL, Currall BB, Dastmalchi C, Dea J, Duhn C, Gilson MC, Klei L, Liang L, Markenscoff-Papadimitriou E, Pochareddy S, Ahituv N, Buxbaum JD, Coon H, Daly MJ, Kim YS, Marth GT, Neale BM, Quinlan AR, Rubenstein JL, Sestan N, State MW, Willsey AJ, Talkowski ME, Devlin B, Roeder K, Sanders SJ, Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science. 362, eaat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.The ENCODE Project Consortium, An integrated encyclopedia of DNA elements in the human genome. Nature. 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Seplyarskiy VB, Andrianova MA, Bazykin GA, APOBEC3A/B-induced mutagenesis is responsible for 20% of heritable mutations in the TpCpW context. Genome Res. 27, 175–184 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zink F, Magnusdottir DN, Magnusson OT, Walker NJ, Morris TJ, Sigurdsson A, Halldorsson GH, Gudjonsson SA, Melsted P, Ingimundardottir H, Kristmundsdottir S, Alexandersson KF, Helgadottir A, Gudmundsson J, Rafnar T, Jonsdottir I, Holm H, Eyjolfsson GI, Sigurdardottir O, Olafsson I, Masson G, Gudbjartsson DF, Thorsteinsdottir U, Halldorsson BV, Stacey SN, Stefansson K, Insights into imprinting from parent-of-origin phased methylomes and transcriptomes. Nature Genetics. 50, 1542 (2018). [DOI] [PubMed] [Google Scholar]
- 8.Yu M, Hon GC, Szulwach KE, Song C-X, Zhang L, Kim A, Li X, Dai Q, Shen Y, Park B, Min J-H, Jin P, Ren B, He C, Base-Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome. Cell. 149, 1368–1380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Supek F, Lehner B, Hajkova P, Warnecke T, Hydroxymethylated Cytosines Are Associated with Elevated C to G Transversion Rates. PLOS Genetics. 10, e1004585 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yu S, Evans K, van Eijk P, Bennett M, Webster RM, Leadbitter M, Teng Y, Waters R, Jackson SP, Reed SH, Global genome nucleotide excision repair is organized into domains that promote efficient DNA repair in chromatin. Genome Res. 26, 1376–1387 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Adar S, Hu J, Lieb JD, Sancar A, Genome-wide kinetics of DNA excision repair in relation to chromatin state and mutagenesis. Proc. Natl. Acad. Sci. U.S.A 113, E2124–2133 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hu J, Adebali O, Adar S, Sancar A, Dynamic maps of UV damage formation and repair for the human genome. Proc. Natl. Acad. Sci. U.S.A (2017), doi: 10.1073/pnas.1706522114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Francioli LC, Polak PP, Koren A, Menelaou A, Chun S, Renkens I, Genome of the Netherlands Consortium, van Duijn CM, Swertz M, Wijmenga C, van Ommen G, Slagboom PE, Boomsma DI, Ye K, Guryev V, Arndt PF, Kloosterman WP, de Bakker PIW, Sunyaev SR, Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet 47, 822–826 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Goldmann JM, Seplyarskiy VB, Wong WSW, Vilboux T, Neerincx PB, Bodian DL, Solomon BD, Veltman JA, Deeken JF, Gilissen C, Niederhuber JE, Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet 50, 487–492 (2018). [DOI] [PubMed] [Google Scholar]
- 15.Gao Z, Moorjani P, Sasani T, Pedersen B, Quinlan A, Jorde L, Amster G, Przeworski M, Overlooked roles of DNA damage and maternal age in generating human germline mutations. bioRxiv, 327098 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chèneby J, Gheorghe M, Artufel M, Mathelier A, Ballester B, ReMap 2018: an updated atlas of regulatory regions from an integrative analysis of DNA-binding ChIP-seq experiments. Nucleic Acids Res. 46, D267–D275 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.