Abstract
The rate and spectrum of somatic mutations can diverge from that of germline mutations. This is because somatic tissues experience different mutagenic processes than germline tissues. Here, we use nanorate sequencing (NanoSeq) to identify somatic mutations in Arabidopsis shoots with high sensitivity. We report a somatic mutation rate of 3.6x10−8 mutations/bp, ~4-5x the germline mutation rate. Somatic mutations displayed elevated signatures consistent with oxidative damage, UV damage, and transcription-coupled nucleotide excision repair. Both somatic and germline mutations were enriched in transposable elements and depleted in genes, but this depletion was greater in germline mutations. Somatic mutation rate correlated with proximity to the centromere, DNA methylation, chromatin accessibility, and gene/TE content, properties which were also largely true of germline mutations. We note DNA methylation and chromatin accessibility have different predicted effects on mutation rate for genic and non-genic regions; DNA methylation associates with a greater increase in mutation rate when in non-genic regions, and accessible chromatin associates with a lower mutation rate in non-genic regions but a higher mutation rate in genic regions. Together, these results characterize key differences and similarities in the genomic distribution of somatic and germline mutations.
Introduction
The rate of spontaneous mutations can vary between tissue type [1-6], genomic location [7-14], and environmental conditions [15-17]. A better understanding of how often and where new mutations occur would inform many open questions in plant biology and agriculture. For instance, in animals, the somatic mutation rate can be 4-25 times higher than the germline rate [1, 4, 18]; to what degree is this also true in plants? Do plants propagated vegetatively or through tissue culture have a different genomic distribution of mutations compared to plants grown from seed? How might long-lived plants, such as trees, cope with ongoing mutations throughout development?
Answering such questions requires methods to accurately measure the rate and distribution of somatic mutations. This is challenging because current sequencing error rates are ~10−3 [19], orders of magnitude higher than the rate of the spontaneous mutation [1]. One way to address this problem is through deep whole-genome sequencing [2, 3, 12, 13, 20-25]: by focusing on mutations observed multiple times in independent reads, it is possible to distinguish true mutations from sequencing errors. However, the need to repeatedly observe a mutation within a sample limits the analysis to mutations that are relatively abundant. This can bias against detection of mutations acquired later in development which will be rare within a plant. An alternative method without these restrictions is Duplex Sequencing, which can confidently identify mutations from a single DNA molecule [26].
Duplex Sequencing and its successor, nanorate sequencing (NanoSeq), are designed to detect mutations in individual DNA molecules by repeatedly sequencing the top and bottom strand of DNA [1, 27, 28]. True mutations will be present in all PCR duplicates derived from the top strand and the bottom strand. In contrast, PCR errors, sequencing errors, and single-stranded DNA damage are unlikely to be present in duplicates of both strands and can be filtered out. This results in an estimated error rate of 2x10−7 errors/bp [1]. NanoSeq improved Duplex Sequencing further by minimizing errors introduced during DNA end repair, which were rare but contributed significantly to error rates [1]. NanoSeq has been used to estimate mutation burden in animal tissues reporting values consistent with single-cell derived colonies, placing its error rate around 5x10−9 [1]. One key advantage of NanoSeq and Duplex Sequencing lies in their ability to detect mutations on a molecule-by-molecule basis, allowing for identification of mutations independent of their frequency in the tissue [1, 26]. Duplex Sequencing has shown promise in identifying somatic mutations in Arabidopsis [29], but its initial use has returned few mutations in wild type plants (~1 mutation/plant), limiting the statistical power to ask many important questions. NanoSeq may offer a solution to this recall problem.
Here, we performed NanoSeq on wild-type Arabidopsis grown in standard conditions and under UV treatment, identifying 4,155 somatic mutations. We found that 93% of these mutations were unique to a single DNA molecule, supporting an allele frequency below ~1/200. We compared the untreated plant somatic mutations to germline mutations from mutation accumulation (MA) lines [10] and identified a somatic mutation rate 4-5x that of the germline rate. These somatic mutations also displayed greater signatures of oxidative damage, transcription-coupled nucleotide excision repair, and a UV-like process relative to germline mutations. We then investigated the genome-wide distribution of somatic mutations and found mutations occurred more frequently in transposable elements (TEs), methylated cytosines, and the pericentromere, consistent with observations in MA lines [9, 10]. In addition, we noted accessible chromatin correlated with lower mutation rates outside of genes but higher rates within and upstream of genes. Intriguingly, these patterns were absent in plants treated with UVC, which instead displayed a largely uniform distribution of somatic mutations.
Results
Identification of somatic mutations with NanoSeq
We created NanoSeq libraries from the shoots of 8 Col-0 Arabidopsis plants (Fig 1A&B). Somatic mutations were identified using a custom filtering pipeline (Fig S1-3 and Methods). These filters removed single-stranded DNA damage, as well as sequencing, PCR, and alignment errors. We then removed any mutations found in more than one sibling or present in >35% of the reads, since these are likely inherited germline mutations.
Figure 1: Identification of somatic mutations with NanoSeq.
A, Schematic of the core difference between NanoSeq and standard whole genome sequencing (WGS) libraries. After fragmentation and adapter ligation of genomic DNA, NanoSeq libraries are diluted to a target number of DNA molecules. PCR and sequencing of the NanoSeq library yield multiple PCR duplicates from each strand of most DNA molecules. Somatic mutations are identified by their presence in all PCR duplicates of a molecule. In contrast, standard WGS has no dilution, so most molecules have only a single read pair, making low abundance mutations indistinguishable from errors. B, NanoSeq libraries constructed for this study. Eight untreated and six UV-treated Col-0 sibling Arabidopsis were used for NanoSeq libraries and somatic mutations were identified in each. Six in silico “swapped” samples were made by combining the top and bottom strand PCR duplicates of molecules from two different untreated NanoSeq libraries. C, Frequency of mutations identified by NanoSeq. Mutations found in only a single DNA molecule are colored blue, whereas mutations found in more than one are colored pink. Mutations with a frequency >0.35 were filtered out for further analysis and are colored light pink. D, Somatic mutation rates of NanoSeq libraries. Somatic mutation rate was calculated as the mutation count divided by the number of bases in DNA molecules which pass all filters (i.e. callable bases). Dots represent replicate plants. The horizontal line is the Arabidopsis germline mutation rate [10]. Error bars represent ± one SEM.
We identified 1,425 somatic mutations, including 1,323 single nucleotide variants (SNVs) and 102 small insertions and deletions (indels). Ninety-three percent of mutations were found in only one DNA molecule (Fig 1C) but could still be confidently identified as mutations based on duplicate reads from each strand of the DNA duplex. The sites of these single molecule mutations were covered by ~200 reads, giving a rough upper bound for their frequency of ~1/200 molecules, or ~1/100 diploid cells within the plant.
With NanoSeq, each base of a molecule which passes all filters (i.e. a callable base) represents an opportunity to detect a mutation. Thus, a sample’s somatic mutation rate was calculated as mutations divided by callable bases (Methods). This normalization method reported consistent mutation rates after downsampling the data (Fig S4), showing that comparisons between samples of different sequencing depths are possible.
To estimate an error rate for NanoSeq, we generated in silico “swapped” libraries, where each molecule consists of a mix of reads from two NanoSeq libraries. This eliminates true mutations, so any identified mutations must be false positives. This method estimated an error rate of 1.2x10−9, consistent with the error rate reported for NanoSeq in mammals [1] and 30-fold less than the somatic mutation rate measured in Arabidopsis shoots (Fig. 1D). To further confirm that we can measure the somatic mutation rate in Arabidopsis beyond background errors, we tested whether growing plants under UV would increase the measured mutation rate. Plants were grown with supplemental UVB (N=2 plants) or UVC light (N=4 plants) and subjected to NanoSeq identically as the untreated plants. We identified 2,730 additional somatic mutations in the UV treated plants, with an estimated 18- and 25-fold increase in the measured mutation rate for UVB and UVC (respectively).
The measured somatic mutation rate of the untreated plants was 3.6x10−8 mutations/bp. This number represents the average fraction of mutated bases across all DNA in the plants. Multiplying it by the diploid genome size tells us the average diploid somatic cell contains 8.6 mutations which were not present in the zygote. In Arabidopsis, the germline mutation rate (SNVs and indels) has been estimated as 7.4-8.25x10−9 mutations/bp, or 1.8-1.96 mutations per generation [9, 10]. Thus, the average somatic mutation rate in whole Arabidopsis shoots is ~4-5x greater than the germline mutation rate.
SNV spectra of somatic and germline mutations
We next looked for signatures of unique mutagenic processes in somatic mutations compared to germline mutations identified in Arabidopsis MA lines [10]. To do this, we visualized the somatic and germline mutation spectra as the rate of insertions, deletions, and each type of SNV (e.g. A>G rate = A>G mutations / callable A:T base pairs). Rates were normalized by the genome-wide average mutation rate to make spectra comparable between samples (Fig 2A, Methods). We then further broke down the SNVs by their 3bp sequence context, normalizing by the frequency of each context in the genome to make rates comparable between contexts (Fig 2B).
Figure 2: SNV spectra of somatic and germline mutations.
A, Mutation rate of SNVs and indels by type. The rate for each SNV/indel is the fraction of mutations falling into each category divided by the fraction of callable bases where such a mutation could occur. Significance labels indicate whether the value is significantly different from the same SNV/indel in the untreated samples B, SNV mutation rate by 3bp context. Rates are calculated the same as in A. C, Comparison of C>T mutation rates in YC vs RC sequence contexts (Y=C/T, R=A/G). D, Number of insertions and deletions detected by length. E, Comparison of C>T mutation rates in CG vs CH sequence contexts (H=A/C/T). A-E, Error bars represent ± one SEM. Dots represent replicate plants/lines. All significance values are from Holm adjusted two-tailed t-tests. *=p<0.05, **=p<0.01, ***=p<0.001, ****=p<0.0001. Untreated=somatic mutations in untreated plants, UVB=somatic mutations in UVB treated plants, UVC=somatic mutations in UVC treated plants, MA lines=germline mutations from published mutation accumulation lines [10].
In the UV-treated samples, C>T SNVs occurred at the highest rate, with YC sites having a far higher rate than RC sites (Y=C/T, R=A/G) (Fig 2C). This is consistent with known mechanisms of UV mutagenesis, where the cytosine in UV-induced pyrimidine dimers is spontaneously deaminated to uracil [30]. Prior studies of UV-induced mutations in Arabidopsis, yeast, and human cancer display a similar pattern [31-33]. The UV-treated samples also possessed a higher rate of T>A mutations in TTA and ATA contexts. This is likely due to thymidylyl-(3′-5′)-deoxyadenosine, a minor UV-induced photoproduct, and is a signature observed in Arabidopsis and yeast, but not humans [32, 34, 35].
When comparing somatic mutations from the untreated plants to the germline mutations, we noted a difference in the rate of C>A mutations. Of the germline SNVs, 5.6% were C>A, but this increased to 15.3% in the somatic mutations (p<0.001, Holm adjusted two-tailed t-test). In addition, the C>T mutation rate in the untreated samples was 2.4x higher in YC contexts than RC contexts, significantly greater than the 1.3x difference in the germline mutations (Fig 2C, p<0.01, Holm adjusted one-sample two-tailed t-test). Given that the plants were grown indoors under fluorescent lights, we expect UV levels to be very low, so the YC>YT somatic mutations may represent a mutational process other than UV. While fewer large indels (>10bp) were observed in the somatic mutations (Fig 2D), our mapping pipeline for NanoSeq is not optimized for large indels and likely under reports these. In most other regards, the somatic and germline mutation spectra were comparable, being dominated by C>T mutations and displaying higher rates of C>T mutation at CG sites (Fig 2E).
Global distribution of somatic and germline mutations
We next aimed to characterize the distribution of somatic mutations within the genome compared to germline mutations. In addition to the MA line dataset, we included two polymorphism datasets; the first was 13,792,559 polymorphisms from wild accessions of the 1001 Genomes Project [36]. Second, we used NanoSeq to identify 624,111 fixed polymorphisms in Ler-0—an accession within the 1001 Genomes dataset. These polymorphism datasets provide much greater power than the MA lines but represent historical mutations observed in a population rather than selfing lines. For simplicity, we will refer to the somatic mutations, MA line mutations, and polymorphisms collectively as “mutations”.
Mutation rate was calculated in sliding 2 megabase (Mb) windows across the genome (Fig 3A). Here, mutation rate refers to the number of mutations within the window divided by the callable bases within the window (or the number of callable sites for the MA line, Ler-0, and 1001 Genomes datasets, Methods). Thus, the mutation rate factors in the mappability and sequencing coverage of the window. For most datasets, mutation rate was highest near the centromere and lower on the chromosome arms. To quantify this relationship, we plotted the mutation rate of non-overlapping 1Mb bins against their distance to the centromere.
Figure 3: Global distribution of somatic and germline mutations.
A, Mutation rate relative to the genome average across each chromosome. Mutation rate is calculated for sliding 2Mb windows as mutations divided by callable base pairs (untreated, UVB, and UVC) or callable sites (MA lines, Ler-0, and 1001). A value of 1 indicates the genome-wide average mutation rate for the sample. Vertical black lines indicate the centromere positions [39]. Shaded area around the untreated line represents a 95% confidence interval from bootstrapping samples. B, Correlation between distance to centromere and mutation rate. Each dot represents one non-overlapping 1Mb window. Linear regression line (ordinary least squares) and correlation coefficients are displayed. C, Ratio of nontemplate:template strand C>T mutations. Significance labels indicate whether the mean is significantly greater than one (Holm adjusted one-sample one-tailed t-test) D, Log2 fold relative mutation rate by genomic region. A value of 0 indicates the region has the same mutation rate as the genome-wide average. Pie chart indicates the proportion of the Arabidopsis genome classified as each region. E, Mutation rate in nonsynonymous sites relative to synonymous sites in each dataset. C-E, Error bars represent ± one SEM. For panels C & E, SEM of the MA lines was calculated by bootstrapping lines. Dots represent replicate plants. *=p<0.05, **=p<0.01. Untreated=somatic mutations in untreated Col-0, UVB=somatic mutations in UVB treated Col-0, UVC=somatic mutations in UVC treated Col-0, MA lines= germline mutations from published Col-0 mutation accumulation lines [10], Ler-0=polymorphisms in Ler-0 identified with NanoSeq, 1001= polymorphisms from published dataset of 1,135 wild accessions [36].
For most datasets, we observed a negative correlation between mutation rate and distance to centromere (Fig 3B); the one exception was the UVC-treated samples, where the mutation rate was more uniform across the chromosomes. We hypothesized that the uniform mutation rate in UVC, but not UVB, samples could be due to differential activity of transcription-coupled nucleotide excision repair (TC-NER). This pathway is known to repair UV damage only on the template strands of transcribed genes and so should be more active in the chromosome arms [37]. To test for signatures of TC-NER activity, we compared rates of C>T mutations when the cytosine was on the template vs nontemplate strand (Fig 3C). Contrary to our hypothesis, both UVB and UVC samples showed a strong signature of TC-NER activity, with C>T mutations roughly twice as common on the nontemplate strand. We also noted signatures of TC-NER activity in the untreated somatic mutations but not in the other datasets. Thus, TC-NER activity appears to be influencing the somatic mutation rate but not the germline mutation rate, and the uniform mutation rate in UVC samples is not a product of diminished TC-NER activity.
The pericentromere has more TEs and fewer genes than the chromosome arms. Thus, a higher mutation rate in the pericentromere could reflect a higher mutation rate in TEs. We categorized every site in the genome as one of five regions—exon, intron, TE, transcription start site (TSS) proximal, or intergenic. TE regions were defined using a published TE annotation [38], TSS proximal as 1 to 500bp upstream all transcription start sites, and intergenic as anything not falling into the other four regions. For each region, the mutation rate was calculated relative to the genome-wide average rate (Fig 3D). In most datasets, exons and introns were depleted for mutations, whereas TEs were enriched, with TEs possessing 3.2x the somatic mutation rate of genes. However, genes were more depleted for MA line mutations than for somatic mutations, as seen by a lower relative mutation rate (p<0.05, Holm adjusted two-tailed Welch’s t-test). Only the UVC treated samples did not have a significantly lower mutation rate in genes (p=0.49, one-sample t-test), instead displaying a largely uniform mutation rate across the five regions (Fig S7). In addition, TSS proximal regions had a lower mutation rate than the intergenic regions in the MA lines (p<0.01, Holm adjusted two-tailed t-test) and polymorphism datasets, but this difference was not significant in the somatic mutations (p=0.53, two-tailed t-test). Lastly, the Ler-0 and 1001 datasets displayed a substantially higher mutation rate in introns compared to exons, which may be due to the influence of selection on these datasets.
To look for signatures of purifying selection, we calculated a global dN/dS value for each dataset as the ratio of mutation rate in nonsynonymous to synonymous (4-fold degenerate) sites. As expected, the Ler-0 and 1001 genomes datasets had a dN/dS significantly lower than one, but the somatic mutations and MA line mutations did not (Fig 3E). This suggests purifying selection is too weak to detect in both the somatic and MA line mutations and is unlikely to be driving their observed distribution.
Somatic mutation rate correlates with centromere proximity, DNA methylation, and chromatin accessibility
Previous work in Arabidopsis has found that the germline mutation rate correlates with proximity to the centromere and DNA methylation [10]. However, it is not known whether these factors influence somatic and germline mutation rates to the same degree nor whether they correlate with somatic mutation rate at all.
To test whether somatic mutation rate correlates with proximity to the centromere independent of TE content, we compared mutation rates of pericentromeric genes and TEs (<5Mb from the centromere) to distal genes and TEs (>5Mb from the centromere) (Fig 4A). In all four datasets, pericentromeric genes had a higher mutation rate than their distal counterparts, but this difference was not significant for the somatic mutations nor the MA lines (p=0.08, p=0.36, Holm adjusted two-tailed t-test). Pericentromeric TEs had 1.9x the somatic mutation rate of distal TEs (p<0.01, Holm adjusted two-tailed t-test), and this was significantly greater than the 1.1x difference in MA lines (p<0.05, Holm adjusted one-sample two-tailed t-test). Thus, proximity to the centromere associated with a greater increase in TE mutation rates in somatic contexts than in germline contexts.
Figure 4: Somatic mutation rate correlates with centromere proximity, DNA methylation, and chromatin accessibility.
A, Log2 fold relative mutation rate of genes and TEs in pericentromeric (<5Mb from the centromere) vs distal (>5Mb) regions. B, Log2 fold relative mutation rate of genic CG sites and non-genic CG, CHG, and CHH sites by methylation status in Col-0. C, Log2 fold relative mutation rate of ACR overlapping vs non-overlapping regions. Two regions are considered, genic and non-genic. Each bar represents the mutation rate in the intersection of the specified region and ACR or non-ACR space. A-C, Error bars represent ± one SEM. Dots represent replicate plants. Significance labels are for Holm adjusted two-tailed t-tests. *=p<0.05, **=p<0.01, ***=p<0.001, ****=p<0.0001. D, Linear/logistic regression predictions of mutation rate in ACRs vs non-ACRs. Points represent the predicted mutation rate of the model under fixed C:G content, DNA methylation, distance to centromere, and nonsynonymous site content. Error bars represent ± one SEM calculated using the delta method. Untreated=somatic mutations in untreated Col-0, MA lines= germline mutations from published Col-0 mutation accumulation lines [10], Ler-0=polymorphisms in Ler-0 identified with NanoSeq, 1001= polymorphisms from published dataset of 1,135 wild accessions [36].
In Arabidopsis, DNA methylation is primarily found in TEs and genes. TEs have high levels of cytosine methylation in all sequence contexts—CG, CHG, and CHH (where H=A/C/T) [40, 41]. In contrast, most genes are unmethylated (~80%) or methylated only in CG contexts (~17%), though a small fraction are methylated in all sequenced contexts (~3%) [42]. To investigate the correlation of DNA methylation with mutation rate, we used published bisulfite sequencing data to identify methylated cytosines in Col-0 [43]. We compared methylated and unmethylated genic CG sites and found the methylated sites had a significantly higher mutation rate in all four datasets (Fig 4B). The same was true for non-genic CG, CHG, and CHH sites. However, in all four datasets, methylation in non-genic regions associated with a greater increase in mutation rate than in genic regions. We saw little difference between somatic and MA line mutations, suggesting DNA methylation has a similar mutagenic effect in somatic and germline contexts.
In human cancers, chromatin accessibility has been shown to anticorrelate with mutation rate [44]. To test whether this is also the case in Arabidopsis, we made use of published ATAC-seq data to identify accessible chromatin regions (ACRs) in Col-0 [45]. Looking only within non-genic regions, we calculated mutation rate of ARCs vs non-ACRs (Fig 4C). In all four datasets, non-genic regions had a lower mutation rate if they overlapped an ACR, consistent with the observation in human cancers [44]. We then made the same comparison but looked only within genic regions. Intriguingly, we now observed the opposite pattern, where genic regions overlapping an ACR had a higher mutation rate than other genic regions. This difference was only significant for the somatic mutations and not the MA line mutations (p<0.01 & p=0.60, Holm adjusted two-tailed t-test). Thus, accessible chromatin is associated with a lower somatic mutation rate outside of genes, but a higher rate within genes.
ACRs are generally unmethylated and less common near the centromere [45-48], so we wondered whether these factors were driving the correlation of non-genic ACRs with a lower mutation rate. To test this, we made linear/logistic regression models which identify independent associations of various factors with mutation rate. For each dataset, we fit a model of mutation rate per genomic site which considered whether the site was a C:G pair, methylated, accessible, in the pericentromere, in a gene, and—for the Ler-0 and 1001 datasets only—whether it was nonsynonymous (Methods). Lastly, we considered a potential interaction term between chromatin accessibility and genic regions. All four models predicted a higher mutation rate for pericentromeric and methylated sites, consistent with previous analyses (Table S1). In addition, they predict ACRs to have a lower mutation rate outside of genes and a higher mutation rate within genes, in agreement with Figure 4C (Fig 4D, Table S1).
Discussion
The Arabidopsis somatic mutation rate is 4-5x the germline mutation rate
We used NanoSeq to identify somatic mutations in untreated and UV-treated Arabidopsis independent of their frequency within the plant. An average of 178 mutations/plant were identified, representing a major improvement in recall over a previous Duplex Sequencing study [29]. 93% of these somatic mutations were observed in only one DNA molecule, placing their frequency at somewhere less than 1/100 cells. We report a somatic mutation rate 4-5x higher than the germline mutation rate, in other words, the average Arabidopsis shoot cell contains 4-5x as many new mutations as a zygote of the next generation. This difference is less than what is observed for most human somatic tissues, which accumulate mutations 4-25x as rapidly as the paternal germline [1, 4].
The higher somatic mutation rate could be explained if the average shoot cell has undergone more cell divisions since the zygote than the sperm or egg. DNA replication is known to be mutagenic, as it introduces polymerase errors and propagates DNA damage into double-stranded mutations [49]. However, measurements of telomere shortening in Arabidopsis telomerase mutants estimate that rosette leaves undergo fewer cell divisions than the progeny, not more [50]. Another possibility would be that the cell lineage leading to the next generation has a lower average mutation rate per mitotic division. In humans and mice, per cell division mutation rate estimates are an order of magnitude higher in fibroblasts than the germline [51]. Even nondividing neurons and smooth muscle have higher mutation rates than the mitotically active male germline [1]. Thus, the higher somatic mutation rate in Arabidopsis may not be the result of more cell divisions but instead an increase in mutagenic processes and/or less efficient DNA repair.
Somatic mutations possess signatures of oxidative damage and a UV-like process
We found that C>A and YC>YT mutations were more common in somatic than germline contexts. As the somatic mutations likely include those accumulated outside the meristem, these signatures may represent mutagenic processes more active in terminal tissues. One potential source of the C>A mutations is reactive oxygen species, which react with DNA to form 8-oxo-7,8-dihydroguanine (8-oxoG) and other lesions [52]. 8-oxoG then mispairs with adenine, producing C>A mutations [52]. Reactive oxygen species are produced by chloroplast metabolism, resulting in much higher levels of ROS in leaves than roots [53-55]. Thus, the increased rate of C>A somatic mutations may be a consequence of greater photosynthetic activity in the leaves compared to the meristem. The origin of the YC>YT somatic mutations is much less clear. This mutation signature is often associated with UV-induced pyrimidine dimers, but the untreated plants were not exposed to sunlight [30]. This signature could represent some other mutagenic process or unexpected UV emission from the fluorescent grow lights.
We also noted a strong signal of TC-NER in the C>T somatic mutations, whereas this signature was absent in the germline mutations. The simple explanation for this is that TC-NER is more active in terminal tissues than the meristem. An analogous situation occurs in C. elegans, where TC-NER is more active in somatic tissues, and global genome NER (GG-NER) is more active in the germline [56-58]. Perhaps Arabidopsis has adopted a similar strategy, where GG-NER maintains the integrity of the whole genome in the meristem, while terminal tissues only require maintenance of transcribed genes and so use TC-NER. An alternative explanation for the absence of a TC-NER signature in the germline mutations is that the germline C>T mutations are caused by a different type of DNA damage which TC-NER cannot repair.
Global distribution of somatic mutations
Somatic and germline mutations were enriched near the centromere and in TEs, while genes were depleted. This pattern has previously been noted in Arabidopsis as dependent on mismatch repair activity, suggesting mismatch repair is more efficient within genes [12, 59]. We note that the depletion of mutations in genes was slightly greater in germline than somatic contexts. This may be caused by differential mismatch repair activity in the meristem and terminal tissues. Another explanation is purifying selection; MA lines are theorized to experience selection against strong deleterious mutations which may have no consequences in heterozygous somatic contexts [60]. However, we lacked the statistical power to detect any purifying selection in the germline and somatic mutations.
Intriguingly, the UVC treated samples were the only plants to not display an enrichment for mutations in the centromere nor a depletion in genes. The UVB and UVC treatments differed in the wavelength (311nm vs 254nm) and duration (8h/day vs ~3s/day), with UVC treatments being far more concentrated. We suspect the UVC treatments overwhelmed repair pathways which had time to preferentially repair damage in genes during UVB treatment. UV damage can be repaired by TC-NER, GG-NER, and photoreactivation [37, 61]. Strong signatures of TC-NER activity were observed in the UVC sample, indicating this pathway was not overwhelmed. Studies in yeast have revealed that photoreactivation occurs significantly faster in nucleosome free DNA [62, 63]. Thus, we hypothesize photoreactivation preferentially repaired genes and euchromatin in the UVB samples but was overwhelmed by the UVC treatment. It is also worth noting that for the UVC mutation rate to be uniform while TC-NER is active, UVC must have induced more damage in genes than non-genic regions. This is the case in yeast, where UVC induces more pyrimidine dimers on the template strand of genes than flanking regions [64].
Proximity to the centromere, DNA methylation, and chromatin accessibility correlate with somatic mutation rate
Methylated cytosines had higher somatic and germline mutation rates, consistent with known properties of ~2-5x faster deamination rates and ~2.6x higher rates of DNA polymerase errors at methylated cytosines [9, 10, 65-68]. Arabidopsis and most flowering plants have CG methylation in the body of many genes, but this methylation has no known function [69-72]. Thus, it is puzzling why plants would maintain this mutagenic modification in genes [73, 74]. Our data suggest DNA methylation within genes is less mutagenic than methylation in non-genic regions, reducing the fitness cost of maintaining gene body methylation.
We found that pericentromeric TEs had a higher somatic mutation rate than distal TEs, which may be explained by replication timing. In both Arabidopsis and animals, heterochromatic domains—like the pericentromere—replicate late in S phase [75, 76]. These late replicating regions have higher mutation rates in human cancers, dependent on functional mismatch repair activity [77, 78]. Arabidopsis mismatch repair may also be less functional late in S phase, when the pericentromere is being replicated.
Within non-genic regions, chromatin accessibility correlated with a lower somatic and germline mutation rate, consistent with what is seen in human cancers [44]. In yeast, repair of alkylation and UV damage is faster at sites where the DNA minor groove faces away from the nucleosome [64, 79]. Thus, we suspect non-genic ACRs in Arabidopsis have lower mutation rates because they are more accessible to repair factors. In contrast, genic regions had higher somatic mutation rate when overlapping an ACR. As most of these ACRs are present near the site of transcription initiation, their elevated mutation rate may represent transcription-associated mutagenesis, where unwinding of the DNA makes it more susceptible to damage [80].
Conclusion
We used NanoSeq to generate an unbiased measurement of somatic mutations in Arabidopsis. Somatic mutations accumulated at a faster rate than germline mutations and possessed signatures of oxidative damage, TC-NER activity, and a UV-like mutational process. Somatic mutation rate correlated with gene/TE content, proximity to the centromere, DNA methylation, and chromatin accessibility, similar to germline mutations. Deducing the causal effects of these factors on mutation rate will require further studies under genetic perturbation.
Methods
Plant material
All plants were grown in Sungro soil containing Osmocote fertilizer at 21°C under 16 hours of light. All Col-0 plants were siblings. Untreated Col-0 and Ler-0 were grown under a mix of Philips F96T8/TL841 PLUS and Sylvania Octron Eco fluorescent lamps. UVC treated Col-0 were grown under the same conditions, but treated with UVC from a Spectroline Select Series UV Crosslinker at multiple points throughout the course of growth. The number and intensity of UVC treatments varied between the four plants (10x0.02J/cm2, 10x0.05J/cm2, 4x0.1J/cm2, 2x0.2J/cm2). UVB treated Col-0 were grown under Sylvania Octron Eco and a Philips TL 20W/01Narrowband UVB lamp. The UVB lamp was ~20cm from the plants and on for every other hour of the 16-hour day.
All plants were harvested at the opening of the first flower. For the Col-0 plants, flowers and flower buds were removed and the remaining shoot was flash frozen. For the Ler-0 plant, a single leaf was flash frozen.
NanoSeq
Each frozen tissue sample was ground with a mortar and pestle. DNA was isolated with a DNeasy Plant Mini Kit (Qiagen, 69106) following manufacturer’s instructions. DNA was fragmented to 150bp using a Covaris E220 Evolution Focused-ultrasonicator by the Georgia Genomic and Bioinformatics Core. 200ng of fragmented DNA was size selected using in-house AMPure XP magnetic beads (0.79 beads:sample ratio for lower and 2.15 for upper selection). End blunting was performed by incubating in a reaction of 1x S1 nuclease buffer and 10units S1 nuclease (Thermo, EN0321) for 30 min at RT. The reaction was stopped by addition of 3μL EDTA, cleaned up with magnetic beads (1.8 ratio), and eluted in 32μL 10mM TrisHCl (pH 8). A-tailing and nick blocking was performed by first adding 5μL 10x E. coli DNA ligase buffer (NEB, M0205S), 4.5μL 10mM ATP, and 1uL 10units/μL T4 Polynucleotide Kinase (NEB, M0201S) and incubating at 37°C for 30 min. Samples were then placed on ice, 1uL 10unit/μL E. coli DNA ligase (NEB, M0205S) was added, and samples were incubated at 16°C for 30 min. A mix of 1.75μL NF H2O, 0.25μL 10mM dATP, 0.5μL ddCTP, 0.5μL ddGTP, 0.5μL ddTTP (Cytiva, 27204501), and 3μL 5units/μL Klenow fragment (3’ -> 5’ exo-) (NEB, M0212S) was added and incubated at 37°C for 30 min. Samples were cleaned up with magnetic beads (1.8 ratio) and 2μL 8μM iTruSeq adapter stubs (sequence below) were added. Adapter ligation was performed in a 25μL reaction of 1x T4 DNA ligase buffer and 20units/μL T4 DNA ligase (NEB, M0202S). Samples were incubated at 16°C for 16 hours and cleaned up with two rounds of magnetic beads (1.4 ratio).
DNA concentration was measured by Quibit and a portion of each library was diluted to 0.1ng/μL. Three dilutions were used to make a 0.0125ng/μL dilution. These dilutions and three 0.1ng/μL dilutions of a previously sequenced library were run on qPCR in triplicate 10μL reactions of 1x Luna qPCR universal master mix (NEB, M3003S), 1.5μM Illumina i5 indexed primer, and 1.5μM Illumina i7 indexed primer (sequences below). The qPCR program was 95°C (60s), 35x (95°C (15s), 60°C (30s)). Cq values of new and previously sequenced libraries were compared to determine the optimal volume of each sample to use per million read pairs sequenced, such that the number of callable fragments is maximized. An optimal volume of sample for the amount of sequencing planned was used for further steps. For untreated Col-0 samples, samples were split in three to reduce the chance of DNA molecules with the same alignment start and end position being present in a single library.
Samples were amplified in a 50μL reaction of 0.33μM Illumina i5 indexed primer, 0.33μM Illumina i7 indexed primer, 200μM dNTPs, 1x Phusion HF Buffer, and 0.04units/μL Phusion HF DNA Polymerase (NEB, M0530S). PCR program was 98°C (60s), 2x (98°C (15s), 50°C (120s), 72°C (15s)), 11x (98°C (15s), 60°C (30s), 72°C (15s)). Samples were cleaned up with magnetic beads (1.4 ratio) and sequenced 150bp paired end reads on the NovaSeq X series platform 25B flow cell.
iTruSeq stub sequences: ACACTCTTTCCCTACACGACGCTCTTCCGATCT and /5phos/GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Illumina i5 and i7 indexed primers: AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTAC and CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAG where Ns are the sample index
Filtering and alignment of reads
Sequencing reads were trimmed and filtered with fastp using default parameters [82]. Reads were aligned to the TAIR10 reference genome using Bowtie2 with -X 800 [81, 83]. Optical duplicates were then marked using SAMtools fixmate -m, Sambamba sort, and SAMtools markdup -d 2500 [84, 85]. Optical duplicates, non-concordantly mapped, and ambiguously mapped reads were filtered using sambamba view --filter mapping_quality ≥ 1 and proper_pair and ([dt] == null or [dt] != ‘SQ’). Replicate libraries generated from the same plant were then merged using SAMtools merge -r to produce the final filtered BAM files. These commands were run as a Snakemake pipeline [86].
In silico generation of a “swapped” control
The filtered BAM files of the untreated plants were used to generate “swapped” libraries using a custom script. For each set of PCR duplicates present in more than one plant (same fragment alignment start and end site), one set of top strand PCR duplicates (read1 on the forward strand) and one set of bottom strand PCR duplicates (read1 on the reverse strand) were selected at random from two different plants. These selected PCR duplicates were output as a single swapped library. The process was then repeated five more times to generate six swapped BAM files.
Filtering somatic mutations
A set of unfiltered variants were identified using a custom script. These variants were then filtered by passing the following requirements: ≥24% of reads in the top strand duplicates and ≥24% of reads in the bottom strand duplicates support the variant; ≥6 duplicate reads cover the variant position; ≥2 top strand duplicates and ≥2 bottom strand duplicates cover the variant; ≥4 duplicate reads support the variant with a BQ >30 at the variant base (if SNV); ≥10 average MQ of supporting reads; ≤0.21 mismatches/bp between the variant and each fragment end (this removes all variants <5bp from the fragment end); variant is ≥6bp from fragment ends (applied to indels only); variant position is ≥0.7 percentile in total read coverage across all untreated BAMs (i.e. removes genomic positions with low read coverage); variant position is not the start/end of a poly-A nor poly-T repeat of length ≥8; variant position is not the start/end of a dinucleotide nor trinucleotide repeat of ≥5 repeating units; variant is supported by the majority of reads in ≤3 sets of PCR duplicates across the eight untreated libraries; variant is in a fragment of length ≤300bp; there are ≤4bp between any 2 variants in the same set of PCR duplicates passing all previous filters (this filter was not applied to the Ler-0 sample); ≥76% of reads in the top strand duplicates and ≥76% of reads in the bottom strand duplicates support the variant (Fig S1). Variants present in a sibling plant were filtered as likely germline mutations. We determined the cutoffs for these filters by varying each one and assessing how it affected the number of variants identified in untreated samples and swapped samples as well as the fraction of variants which were C>T (Fig S3).
The frequency of each mutation in the sample was calculated as the fraction of duplicate sets covering the variant site which had >76% of reads supporting the variant. For the Ler-0 sample, only variants with a frequency >0.35 were retained. For all other samples, only variants with a frequency <0.35 were retained.
Calculation of callable coverage and somatic mutation rate
For the untreated and UV treated samples, callable coverage was calculated per site as the number of opportunities a somatic mutation could have been observed and passed all filters. Each set of PCR duplicates was considered “callable” if it had ≥1 bottom strand, ≥1 top strand, and ≥3 total read pairs, as well as average MQ ≥10 and fragment length ≤300bp. Then, each base within a callable fragment contributed to callable coverage if it overlapped ≥2 reads from the bottom strand, ≥2 reads from the top strand, and ≥6 total reads, was ≥5bp from a fragment end, and was not in any blacklisted regions of the genome. Blacklisted regions were those with <0.7 percentile total read coverage across all untreated BAMs, the starts/ends of ≥8bp poly-A and poly-T repeats, and the starts/ends of ≥5 length di/trinucleotide repeats. For the Ler-0 sample, callable coverage was limited to a max of one, so every site had callable coverage of either 0 (uncallable) or 1 (callable).
To calculate the overall somatic mutation rate of a sample, the callable coverage of every site in the nuclear genome was summed and divided by a predicted rate of fragment conflicts. The conflict rate represents the chance each fragment in the library has another fragment with the same alignment start and end site, which may preclude mutation identification in that fragment. Conflict rate was estimated by randomly selecting half of the fragments from two replicate libraries of the same plant and calculating a “half” conflict rate as the fraction of selected fragments which conflict with another selected fragment. The full conflict rate was then calculated as 1 – (1 – half conflict rate)2. For samples with no replicate libraries, the conflict rate was imputed using the other samples assuming a linear relationship between conflict rate and fragments in the library (Fig S8). Thus, the overall somatic mutation rate was calculated as the number of mutations / (callable coverage * (1 – conflict rate)). Conflict rate was not considered when calculating somatic mutation rate for specific genomic regions nor when calculating mutation spectra (Fig 2-4).
Processing of published mutation/polymorphism datasets
For the MA line dataset, we downloaded mutations from Weng, Becker [10]. These were homozygous mutations present in one of 107 Col-0 lines propagated for 25 generations of single seed descent. For the 1001 genomes dataset, SNVs and short indels in 1,135 wild Arabidopsis accessions were retrieved from 1001genomes.org [36]. We discarded mutations in both datasets if they were present at a position with no callable coverage in the untreated first sibling NanoSeq library. As an estimate of callable coverage in these datasets, we took the callable coverage of the untreated first sibling NanoSeq library and limited the coverage per site to one, so every site had callable coverage of either 0 (uncallable) or 1 (callable).
Calculation of mutation spectra
A mutation rate was calculated for each SNV type, 3bp context, insertions, and deletions as the count of that mutation type divided by the callable coverage of sites which could harbor that mutation (e.g. coverage of all C:G pairs for C>T mutations, coverage of all ACA sites for ACA>AAA mutations, and coverage of all sites for indels). This mutation rate was then divided by the genome-wide average mutation rate for that sample to get the relative mutation rate.
DNA methylation and chromatin accessibility
Bisulfite sequencing data of Col-0 were downloaded from SRR2922654 Bewick, Ji [43]. Sequencing reads were trimmed and filtered with fastp using default parameters [82]. Methylation files were generated using methylpy --trim-reads False --merge-by-max-mapq True --min-mapq 1 --min-qual-score 1 [87]. Methylated sites were called using METHimpute [88]. Cytosines with an “Intermediate” methylation status were not considered methylated nor unmethylated for generating Figure 4B.
Two replicates of Col-0 ATAC-seq data were downloaded from PRJNA527732 [45]. Sequencing reads were trimmed and filtered with fastp using default parameters [82]. Reads were aligned to the TAIR10 reference genome using Bowtie2 with -X 800 [83]. Peaks were called for each replicate using MACS2 callpeak -g 1.1e8 -q 0.7 --nomodel --extsize 200 --shift −100 [89]. The union of peaks in the two replicates was used as the final ACR set.
Linear/logistic regression models of mutation rate
For each of the four datasets, we constructed a linear/logistic regression model using the Statsmodels package (Table S1) [90]. Each genomic site had the following predictor variables with values of 0 (false) or 1 (true), C:G pair, methylated, within an ACR, within 5Mb of the centromere, within a gene, nonsynonymous coding site, and within both an ACR and a gene (ACR * gene). The nonsynonymous variable was only considered in the models for the Ler-0 and 1001 datasets. The response variable was modeled differently for each dataset; for the untreated dataset, a generalized linear model was used, and the response variable was modeled as a binomial distribution where trials=callable coverage and successes=mutations at the site (generalized linear model). For the MA line and Ler-0 datasets, a logistic regression model was used, and the response variable was whether a mutation was present at the site. For the 1001 genomes dataset, a generalized linear model was used, and the response variable was modeled as a Poisson distribution where successes=mutations at the site (this dataset could have >1 mutation per site). Each model was fit using the genomic sites with >0 callable coverage in the dataset. To calculate mutation rate in Figure 4D, we predicted the mutation rate for a site with the ACR, gene, and ACR*gene variables set to 111, 010, 100, or 000 and all other predictors set to their average value across all sites.
Supplementary Material
Acknowledgements
We would like to thank Mark Minow for suggestions on data processing and visualization, Bruce Martin for advice on regression methods, and Yangyang Xu for performing DNA extractions. The Georgia Advanced Computing Resource Center provided the computational resources required for data analysis. This research was supported by the National Science Foundation (MCB-2242696) and the University of Georgia Office of Research to R.J.S. as well as the National Institute for General Medical Sciences of the National Institute of Health to C.A.M. (1T32GM142623) and B.N. (R35GM151237).
Data Availability
NanoSeq library sequencing data can be found on NCBI SRA database under the BioProject accession PRJNA1247547.
References cited
- 1.Abascal F., et al. , Somatic mutation landscapes at single-molecule resolution. Nature, 2021. 593(7859): p. 405–410. [DOI] [PubMed] [Google Scholar]
- 2.Goel M., et al. , The vast majority of somatic mutations in plants are layer-specific. Genome Biology, 2024. 25(1): p. 194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang L., et al. , The architecture of intra-organism mutation rate variation in plants. PLOS Biology, 2019. 17(4): p. e3000191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Moore L., et al. , The mutational landscape of human somatic and germline cells. Nature, 2021. 597(7876): p. 381–386. [DOI] [PubMed] [Google Scholar]
- 5.Sun H., et al. , The identification and analysis of meristematic mutations within the apple tree that developed the RubyMac sport mutation. BMC Plant Biology, 2024. 24(1): p. 912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Amundson K.R., et al. , Differential mutation accumulation in plant meristematic layers. bioRxiv, 2023: p. 2023.09.25.559363. [Google Scholar]
- 7.Gonzalez-Perez A., Sabarinathan R., and Lopez-Bigas N., Local Determinants of the Mutational Landscape of the Human Genome. Cell, 2019. 177(1): p. 101–114. [DOI] [PubMed] [Google Scholar]
- 8.Supek F. and Lehner B., Scales and mechanisms of somatic mutation rate variation across the human genome. DNA Repair, 2019. 81: p. 102647. [DOI] [PubMed] [Google Scholar]
- 9.Ossowski S., et al. , The Rate and Molecular Spectrum of Spontaneous Mutations in Arabidopsis thaliana. Science, 2010. 327: p. 92–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Weng M.L., et al. , Fine-Grained Analysis of Spontaneous Mutation Spectrum and Frequency in Arabidopsis thaliana. Genetics, 2019. 211(2): p. 703–714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Quiroz D., et al. , Causes of Mutation Rate Variability in Plant Genomes. Annu Rev Plant Biol, 2023. 74: p. 751–775. [DOI] [PubMed] [Google Scholar]
- 12.Quiroz D., et al. , H3K4me1 recruits DNA repair proteins in plants. The Plant Cell, 2024. 36(6): p. 2410–2426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Monroe J.G., et al. , Mutation bias reflects natural selection in Arabidopsis thaliana. Nature, 2022. 602(7895): p. 101–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Staunton P.M., Peters A.J., and Seoighe C., Somatic mutations inferred from RNA-seq data highlight the contribution of replication timing to mutation rate variation in a model plant. Genetics, 2023. 225(2): p. iyad128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Belfield E.J., et al. , Thermal stress accelerates Arabidopsis thaliana mutation rate. Genome Res, 2021. 31(1): p. 40–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Jiang C., et al. , Environmentally responsive genome-wide accumulation of de novo Arabidopsis thaliana mutations and epimutations. Genome Res, 2014. 24(11): p. 1821–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lucht J.M., et al. , Pathogen stress increases somatic recombination frequency in Arabidopsis. Nature Genetics, 2002. 30(3): p. 311–314. [DOI] [PubMed] [Google Scholar]
- 18.Jónsson H., et al. , Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 2017. 549(7673): p. 519–522. [DOI] [PubMed] [Google Scholar]
- 19.Stoler N. and Nekrutenko A., Sequencing error profiles of Illumina sequencing instruments. NAR Genom Bioinform, 2021. 3(1): p. lqab019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Duan Y., et al. , Limited accumulation of high-frequency somatic mutations in a 1700-year-old Osmanthus fragrans tree. Tree Physiology, 2022. 42(10): p. 2040–2049. [DOI] [PubMed] [Google Scholar]
- 21.Schmid-Siegert E., et al. , Low number of fixed somatic mutations in a long-lived oak tree. Nature Plants, 2017. 3(12): p. 926–929. [DOI] [PubMed] [Google Scholar]
- 22.Schmitt S., et al. , Low-frequency somatic mutations are heritable in tropical trees Dicorynia guianensis and Sextonia rubra. Proceedings of the National Academy of Sciences, 2024. 121(10): p. e2313312121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Xian W., et al. , Minimizing detection bias of somatic mutations in a highly heterozygous oak genome. bioRxiv, 2025: p. 2025.02.13.638107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Orr A.J., et al. , A phylogenomic approach reveals a low somatic mutation rate in a long-lived plant. Proceedings of the Royal Society B: Biological Sciences, 2020. 287(1922): p. 20192364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ren Y., et al. , Somatic Mutation Analysis in Salix suchowensis Reveals Early-Segregated Cell Lineages. Molecular Biology and Evolution, 2021. 38(12): p. 5292–5308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Menon V. and Brash D.E., Next-generation sequencing methodologies to detect low-frequency mutations: “Catch me if you can”. Mutation Research - Reviews in Mutation Research, 2023. 792: p. 108471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hoang M.L., et al. , Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proc Natl Acad Sci U S A, 2016. 113(35): p. 9846–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schmitt M.W., et al. , Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A, 2012. 109(36): p. 14508–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Waneka G., et al. , Exploring the Relationship Between Gene Expression and Low-Frequency Somatic Mutations in Arabidopsis with Duplex Sequencing. Genome Biology and Evolution, 2024. 16(10). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ikehata H. and Ono T., The Mechanisms of UV Mutagenesis. Journal of Radiation Research, 2011. 52(2): p. 115–125. [DOI] [PubMed] [Google Scholar]
- 31.Armstrong J.D. and Kunz B.A., Site and strand specificity of UVB mutagenesis in the SUP4-o gene of yeast. Proceedings of the National Academy of Sciences, 1990. 87(22): p. 9005–9009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nakamura M., Nunoshiba T., and Hiratsu K., Detection and analysis of UV-induced mutations in the chromosomal DNA of Arabidopsis. Biochemical and Biophysical Research Communications, 2021. 554: p. 89–93. [DOI] [PubMed] [Google Scholar]
- 33.Yurchenko A.A., et al. , Genomic mutation landscape of skin cancers from DNA repair-deficient xeroderma pigmentosum patients. Nature Communications, 2023. 14(1): p. 2561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Laughery M.F., et al. , The Surprising Diversity of UV-Induced Mutations. Adv Genet (Hoboken), 2024. 5(2): p. 2300205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Zhao X. and Taylor J.-S., Mutation Spectra of TA*, the Major Photoproduct of Thymidylyl-(3'–5')-Deoxyadenosine, in Escherichia Coli under SOS Conditions. Nucleic Acids Research, 1996. 24(8): p. 1561–1565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Alonso-Blanco C., et al. , 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. Cell, 2016. 166(2): p. 481–491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Nieto Moreno N., Olthof A.M., and Svejstrup J.Q., Transcription-Coupled Nucleotide Excision Repair and the Transcriptional Response to UV-Induced DNA Damage. Annu Rev Biochem, 2023. 92: p. 81–113. [DOI] [PubMed] [Google Scholar]
- 38.Panda K. and Slotkin R.K., Long-Read cDNA Sequencing Enables a "Gene-Like" Transcript Annotation of Transposable Elements. Plant Cell, 2020. 32(9): p. 2687–2698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Naish M., et al. , The genetic and epigenetic landscape of the Arabidopsis centromeres. Science, 2021. 374(6569): p. eabi7489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cokus S.J., et al. , Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 2008. 452(7184): p. 215–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lister R., et al. , Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell, 2008. 133(3): p. 523–536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zhang Y., et al. , Natural variation in DNA methylation homeostasis and the emergence of epialleles. Proceedings of the National Academy of Sciences, 2020. 117(9): p. 4874-4884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Bewick A.J., et al. , On the origin and evolutionary consequences of gene body DNA methylation. Proceedings of the National Academy of Sciences, 2016. 113(32): p. 9111-9116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Polak P., et al. , Reduced local mutation density in regulatory DNA of cancer genomes is linked to DNA repair. Nature Biotechnology, 2014. 32(1): p. 71–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lu Z., et al. , The prevalence, evolution and chromatin signatures of plant regulatory elements. Nature Plants, 2019. 5(12): p. 1250–1259. [DOI] [PubMed] [Google Scholar]
- 46.Ricci W.A., et al. , Widespread long-range cis-regulatory elements in the maize genome. Nat Plants, 2019. 5(12): p. 1237–1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Oka R., et al. , Genome-wide mapping of transcriptional enhancer candidates using DNA and chromatin features in maize. Genome Biology, 2017. 18(1): p. 137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Crisp P.A., et al. , Stable unmethylated DNA demarcates expressed genes and their cis-regulatory space in plant genomes. Proceedings of the National Academy of Sciences, 2020. 117(38): p. 23991–24000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chatterjee N. and Walker G.C., Mechanisms of DNA damage, repair, and mutagenesis. Environ Mol Mutagen, 2017. 58(5): p. 235–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Watson J.M., et al. , Germline replications and somatic mutation accumulation are independent of vegetative life span in Arabidopsis. Proc Natl Acad Sci U S A, 2016. 113(43): p. 12226–12231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Milholland B., et al. , Differences between germline and somatic mutation rates in humans and mice. Nature Communications, 2017. 8(1): p. 15183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Cooke M.S., et al. , Oxidative DNA damage: mechanisms, mutation, and disease. The FASEB Journal, 2003. 17(10): p. 1195–1214. [DOI] [PubMed] [Google Scholar]
- 53.Foyer C.H., Oxygen processing in photosynthesis. Biochemical Society Transactions, 1996. 24(2): p. 427–433. [DOI] [PubMed] [Google Scholar]
- 54.Lodeyro A.F., et al. , Suppression of Reactive Oxygen Species Accumulation in Chloroplasts Prevents Leaf Damage but Not Growth Arrest in Salt-Stressed Tobacco Plants. PLOS ONE, 2016. 11(7): p. e0159588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Noctor G., et al. , Drought and Oxidative Load in the Leaves of C3 Plants: a Predominant Role for Photorespiration? Annals of Botany, 2002. 89(7): p. 841–850. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Lans H., et al. , The DNA damage response to transcription stress. Nat Rev Mol Cell Biol, 2019. 20(12): p. 766–784. [DOI] [PubMed] [Google Scholar]
- 57.Rieckher M., et al. , Distinct DNA repair mechanisms prevent formaldehyde toxicity during development, reproduction and aging. Nucleic Acids Research, 2024. 52(14): p. 8271–8285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sabatella M., et al. , Tissue-Specific DNA Repair Activity of ERCC-1/XPF-1. Cell Reports, 2021. 34(2): p. 108608. [DOI] [PubMed] [Google Scholar]
- 59.Belfield E.J., et al. , DNA mismatch repair preferentially protects genes from mutation. Genome Res, 2018. 28(1): p. 66–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Halligan D.L. and Keightley P.D., Spontaneous Mutation Accumulation Studies in Evolutionary Genetics. Annual Review of Ecology, Evolution, and Systematics, 2009. 40(1): p. 151–172. [Google Scholar]
- 61.Banaś A.K., et al. , All You Need Is Light. Photorepair of UV-Induced Pyrimidine Dimers. Genes, 2020. 11(11): p. 1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Gaillard H., et al. , Chromatin Remodeling Activities Act on UV-damaged Nucleosomes and Modulate DNA Damage Accessibility to Photolyase *. Journal of Biological Chemistry, 2003. 278(20): p. 17655–17663. [DOI] [PubMed] [Google Scholar]
- 63.Suter B., Livingstone-Zatchej M., and Thoma F., Chromatin structure modulates DNA repair by photolyase in vivo. The EMBO Journal, 1997. 16(8): p. 2150-2160-2160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Mao P., et al. , Chromosomal landscape of UV damage formation and repair at singlenucleotide resolution. Proceedings of the National Academy of Sciences, 2016. 113(32): p. 9057–9062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Ehrlich M., et al. , DNA cytosine methylation and heat-induced deamination. Bioscience Reports, 1986. 6(4): p. 387–393. [DOI] [PubMed] [Google Scholar]
- 66.Holliday R. and Grigg G.W., DNA methylation and mutation. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 1993. 285(1): p. 61-67. [DOI] [PubMed] [Google Scholar]
- 67.Shen J.-C., Rideout W.M. III, and Jones P.A., The rate of hydrolytic deamination of 5-methylcytosine in double-stranded DNA. Nucleic Acids Research, 1994. 22(6): p. 972-976. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Tomkova M., et al. , Human DNA polymerase epsilon is a source of C>T mutations at CpG dinucleotides. Nat Genet, 2024. 56(11): p. 2506–2516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Bewick A.J. and Schmitz R.J., Gene body DNA methylation in plants. Curr Opin Plant Biol, 2017. 36: p. 103–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Niederhuth C.E., et al. , Widespread natural variation of DNA methylation within angiosperms. Genome Biol, 2016. 17(1): p. 194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Tran R.K., et al. , DNA Methylation Profiling Identifies CG Methylation Clusters in Arabidopsis Genes. Current Biology, 2005. 15(2): p. 154–159. [DOI] [PubMed] [Google Scholar]
- 72.Takuno S. and Gaut B.S., Body-methylated genes in Arabidopsis thaliana are functionally important and evolve slowly. Mol Biol Evol, 2012. 29(1): p. 219–27. [DOI] [PubMed] [Google Scholar]
- 73.Vidalis A., et al. , Methylome evolution in plants. Genome Biology, 2016. 17(1): p. 264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Zilberman D., An evolutionary case for functional gene body methylation in plants and animals. Genome Biol, 2017. 18(1): p. 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Concia L., et al. , Genome-Wide Analysis of the Arabidopsis Replication Timing Program Plant Physiology, 2018. 176(3): p. 2166–2185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Dimitrova D.S. and Gilbert D.M., The Spatial Position and Replication Timing of Chromosomal Domains Are Both Established in Early G1 Phase. Molecular Cell, 1999. 4(6): p. 983–993. [DOI] [PubMed] [Google Scholar]
- 77.Supek F. and Lehner B., Differential DNA mismatch repair underlies mutation rate variation across the human genome. Nature, 2015. 521(7550): p. 81–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Woo Y.H. and Li W.-H., DNA replication timing and selection shape the landscape of nucleotide variation in cancer genomes. Nature Communications, 2012. 3(1): p. 1004. [DOI] [PubMed] [Google Scholar]
- 79.Mao P., et al. , Genome-wide maps of alkylation damage, repair, and mutagenesis in yeast reveal mechanisms of mutational heterogeneity. Genome Res, 2017. 27(10): p. 1674–1684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Jinks-Robertson S. and Bhagwat A.S., Transcription-associated mutagenesis. Annu Rev Genet, 2014. 48: p. 341–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Lamesch P., et al. , The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Research, 2011. 40(D1): p. D1202–D1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Chen S., Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta, 2023. 2(2): p. e107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Langmead B. and Salzberg S.L., Fast gapped-read alignment with Bowtie 2. Nature Methods, 2012. 9(4): p. 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Danecek P., et al. , Twelve years of SAMtools and BCFtools. Gigascience, 2021. 10(2). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Tarasov A., et al. , Sambamba: fast processing of NGS alignment formats. Bioinformatics, 2015. 31(12): p. 2032–2034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Mölder F., et al. , Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]. F1000Research, 2021. 10(33). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Schultz M.D., et al. , Human body epigenome maps reveal noncanonical DNA methylation variation. Nature, 2015. 523(7559): p. 212–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Taudt A., et al. , METHimpute: imputation-guided construction of complete methylomes from WGBS data. BMC Genomics, 2018. 19(1): p. 444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Zhang Y., et al. , Model-based Analysis of ChIP-Seq (MACS). Genome Biology, 2008. 9(9): p. R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Seabold S. and Perktold J., Statsmodels: econometric and statistical modeling with python. SciPy, 2010. 7(1): p. 92–96. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
NanoSeq library sequencing data can be found on NCBI SRA database under the BioProject accession PRJNA1247547.




