Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2014 Apr 23;111(18):6684–6689. doi: 10.1073/pnas.1321854111

Natural insertions in rice commonly form tandem duplications indicative of patch-mediated double-strand break induction and repair

Justin N Vaughn 1, Jeffrey L Bennetzen 1,1
PMCID: PMC4020087  PMID: 24760826

Significance

Very short insertions are usually attributable to replication slippage. Another class of longer insertions (>10 bp) creates tandem duplications even in the absence of preexisting repeats. This work provides analysis into the properties and mechanistic implications of such insertion polymorphisms segregating in rice. To our knowledge, this work is the first comprehensive analysis of this major class of natural mutations in any plant. Inspired by the prior experiments of Stéphane Vispé and Masahiko Satoh, we propose a model for how a substantial number of double-strand breaks are created and how they might result in tandem duplications. The model, based on patch-mediated nick repair, is indirectly supported by recently published experiments using a modified CRISPR-associated 9 nicking enzyme.

Keywords: double-strand break repair, structural DNA variation

Abstract

The insertion of DNA into a genome can result in the duplication and dispersal of functional sequences through the genome. In addition, a deeper understanding of insertion mechanisms will inform methods of genetic engineering and plant transformation. Exploiting structural variations in numerous rice accessions, we have inferred and analyzed intermediate length (10–1,000 bp) insertions in plants. Insertions in this size class were found to be approximately equal in frequency to deletions, and compound insertion–deletions comprised only 0.1% of all events. Our findings indicate that, as observed in humans, tandem or partially tandem duplications are the dominant form of insertion (48%), although short duplications from ectopic donors account for a sizable fraction of insertions in rice (38%). Many nontandem duplications contain insertions from nearby DNA (within 200 bp) and can contain multiple donor sources—some distant—in single events. Although replication slippage is a plausible explanation for tandem duplications, the end homology required in such a model is most often absent and rarely is >5 bp. However, end homology is commonly longer than expected by chance. Such findings lead us to favor a model of patch-mediated double-strand-break creation followed by nonhomologous end-joining. Additionally, a striking bias toward 31-bp partially tandem duplications suggests that errors in nucleotide excision repair may be resolved via a similar, but distinct, pathway. In summary, the analysis of recent insertions in rice suggests multiple underappreciated causes of structural variation in eukaryotes.


Genomic DNA insertion causes genome expansion and, potentially, the rearrangement and diffusion of protein domains and regulatory elements throughout the genome (1, 2). Additionally, genetic engineers generally aim to integrate specific DNA into the nuclear genome, so the natural mechanisms by which this integration occurs may serve as a starting point to elaborate and improve genome modification (3, 4). Common causes of gene-sized insertions are unequal recombination (5), transposable element replication (1), and ectopic recombination stimulated by double-strand breaks (DSBs) in the genome (2, 6). Shorter events are less well characterized, but it appears that these can be created by similar processes (7). Still, high-throughput sequencing of DSB repair events in humans (8) and plants (9) suggests that insertions related to induced breaks are very rare and very short.

Although the processes described above can produce duplications at distant genetic loci, the most common form of non-microsatellite-associated insertions in humans is tandem duplications (10). Once created, tandem duplications can be dramatically expanded by unequal recombination or replication slippage. Such duplications may be deleterious, or they may be promoted by selection for a novel or expanded function (11, 12).

Although tandem repeats are ubiquitous in eukaryotic genomes, the mechanisms for their origin are still in question. Early analysis of human indel mutations indicated that replication slippage was the most effective model to explain the origin of assorted repeats (13). In other studies, longer, de novo tandem duplications were also hypothesized to be caused by slippage because, out of 85 insertions producing such duplications, 50 were associated with flanking repeats >2 bp (14). Replication slippage would presumably require a preexisting short repeat because priming must occur between the end of the loop that will become the duplication and the position to where replication slips. Authors of more recent work investigating insertions across the human genome suggest alternatives to replication slippage on the grounds that homology is often either nonexistent or very short, whereas the length of homology and the length of insertion are not correlated (10). These researchers favor a model based on DSBs being repaired by nonhomologous end-joining (NHEJ). However, conventional models of DSB repair are strained to predict tandem duplications >10 bp, much less >100 bp. Such models require extensive single-stranded, complementary ends to be preserved during the break. Moreover, DSBs produced by Tal-effector nucleases in humans do not yield insertions that form tandem repeats, despite the fact that the breaks generate a 5′ overhang (15). Thus, this common class of mutations currently lacks a firm molecular explanation.

Similar to tandem duplications, short duplications are commonly found within 100 bp of one another, but with unique intervening DNA (16). By comparing human polymorphisms with chimp sequence, Thomas et al. (16) inferred that the repeats were recent insertions. As discussed by the authors and herein, a mechanism for such duplications is even less forthcoming than for tandem duplications.

In this study, we used extensive population-scale rice resequencing data to confirm that tandem duplications are also abundant natural polymorphisms in the plant kingdom. Additionally, we found that many insertions in rice, although not perfectly tandem, are from a ∼50-bp window around the insertion site. We rarely found the end homology in tandem repeats that is expected for replication slippage, although we did note a bias toward short microhomology between insertion ends and insertion site. These data led us to elaborate on the DSB model of tandem duplication, proposing that long patch base excision repair (BER) on complementary strands commonly leads to such patterns (17). Additionally, we characterized common forms of nontandem, but local, duplication.

Results

Inferring Insertions.

Recent mutational events can be inferred by comparing orthologous sequences between two lineages with an orthologous sequence in a known outgroup lineage (1). The extant state in the sister lineages matching the outgroup state is inferred to be the ancestral state. Although such inferences may be false due to segregating polymorphisms in the ancestral population, they are generally valid for Oryza sativa (rice) comparisons using Oryza glaberrima as an outgroup (1). Insertion variations segregating in an extant population are more likely to be recent mutations and, hence, are less likely to complicate interpretation resulting from multiple events and sequence divergence. For these reasons, we chose to use recently published data regarding genetic variation in a sample of 50 rice accessions (18). In that study, indels >9 bp were ascertained by pooling sequence data for subpopulations in the sample and de novo assembling reads. Once assembled, these contigs were positioned on the rice reference genome of the cultivar Nipponbare to localize the relative mutation. The authors also resequenced the Nipponbare genome and assessed structural variations relative to a different rice reference genome, indica cultivar 93-11. Using known structural variations between these two Sanger-sequenced genomes—93-11 and Nipponbare—they determined that 96% of structural variants identified in their short-read pipeline were true positives.

To infer whether an indel was an insertion or deletion, we used the recently released genome sequence of O. glabberima (Table S1). To establish large tracts of colinearity, each O. glabberima chromosome was aligned with the homologous chromosomes from cultivars indica 93-11 and Nipponbare. The region surrounding each of the polymorphic indels was extracted and realigned by using the outgroup, the Nipponbare reference, and the variant sequence. Variant sequences were synthetically derived from the Nipponbare reference by using associated indel information. A variant was called an insertion if a gap of the exact size was also found in the outgroup and 15 bp flanking the gaps were gap-free and shared >90% identity with the outgroup.

For clarity of exposition and to ease manual curation of the data, we chose to describe data from only 4 of 12 rice chromosomes: the longest (no. 1), the shortest (no. 10), the most heavily curated (no. 3), and no. 8, which falls between nos. 3 and 10 in length. On these four chromosomes, only 92 of 65,391 insertions (0.1%) relative to Nipponbare reference have a perfectly adjacent deletion (Table 1). Thus, of structural variations, complex structural variants derived from mixed insertion–deletion events are rare within this size class of 10–1,000 bp. Also, the inferred insertion/deletion profile deviates dramatically from induced DSB repair outcomes, which commonly produce deletions (6, 9), whereas we inferred approximately equivalent frequencies of insertions relative to deletions. As expected, the number of structural variations correlated with the size of a chromosome. Chromosome 3 had the highest percentage of inferable events, likely because of its high-quality assembly.

Table 1.

Summary of structural variations and inferred events for chromosomes analyzed

Chr. Chr. size,* Mb Structural variations Adjacent events Deletions Insertions Inferable events, %
1 43.2 22,278 53 4,912 4,390 42
3 36.4 14,990 18 4,285 3,544 52
8 28.4 14,757 9 3,047 2,711 39
10 23.1 13,366 12 2,907 2,400 40
Total 131.1 65,391 92 15,151 13,045 43

Chr., chromosome.

Adjacent events are defined as insertion polymorphisms directly adjacent to deletion polymorphisms.

In the human genome, short insertions (8–100 bp) commonly create tandem duplications that are in the same orientation and have no unique spacer sequence between the resultant repeats (10). Such insertions are impossible to position exactly (Fig. 1 AC), and so we used the trace extension metric, d, to first characterize the inferred insertions. As illustrated in Fig. 1 AC and Messer and Arndt (10), d can characterize whether an insertion is a tandem duplication (ld in Fig. 1B) or comes from a more distant site (d = 0 in Fig. 1A). Also, d allows one to determine whether an insertion and its donor have similarity that extends beyond the boundaries of the insertion, even when the insertion creates a tandem duplication (d > l in Fig. 1C).

Fig. 1.

Fig. 1.

Trace extension, d, spectra for rice insertions. (AC) Simplified dot plots between ancestral sequence without an insertion and derived sequence with an insertion. (A) When d = 0, a gap can be placed exactly, and there is no similarity between the insertion and regions adjacent to the insertion site. (B) When d = l, a gap cannot be placed exactly, and the insertion is a duplication of DNA adjacent to the site of insertion. (C) When d > l, not only do the conditions of d = l apply, but the similarity between donor loci and the site of insertion extends beyond the inserted sequence, suggesting a homology-dependent mechanism. (D) A heatmap plotting the total counts of all insertions having a particular combination of length (x axis) and d (y axis). Gray-scale range is from 1 to >90 insertions.

Tandem Duplications Accounts for More Than Half of >9-bp Insertions and Rarely Exhibit Extensive End Homology.

The sharp diagonal in Fig. 1D demonstrates that, as in humans, insertions >9 bp commonly create tandem duplication: d is often equal to l. Across all analyzed chromosomes, tandem duplications account for approximately half of all insertions. The remaining insertions either come from more distant loci (d ∼ 0) or are partially tandem duplications (Fig. 1D). Of additional interest, there is a clear bias toward partially tandem duplications of 31 bp, and, within that group, there is very little variation around this value. In other words, it is common for insertions to occur that duplicate 31 bp around the site of insertion but include intervening sequence between the duplications. We discuss the implications of such a bias below.

Replication slippage is a common explanation for tandem duplications (14), although this explanation is now being challenged, particularly with regard to whole genome analysis (10). One assumed prerequisite for replication slippage is a repeat that allows for the stabilization of the slippage intermediate and resumption of synthesis at the mispaired site (Fig. 2A). The minimum primer length for extension for most polymerases in vitro appears to be 6 bp (19). Although we cannot be certain that this rule applies at all times in vivo, length restrictions on microsatellite expansion support this result (20).

Fig. 2.

Fig. 2.

End homology between donors and insertions and the implication for mechanistic models. (A) Replication slippage is dependent on priming, and thus tandem duplications resulting from such a model should exhibit similarity between the end of the duplicated region and the beginning of the donor site, described here as end homology. When the replicating strand slips back to another annealing site, the DNA exhibits a transient bubble. Replication from the slipped site duplicates the intervening DNA (light gray) as well as the priming site (dark gray). (B) A null distribution of end homology was generated by randomly sampling positions along the rice genome (Methods). The tandem repeat homology was assessed by subtracting the length of the insertion from the trace extension, d. Values >5 bp suggest a synthesis-dependent mechanism of duplication. (C) A patch-mediated model of tandem duplication formation in which synchronous BER events create a DSB that is repaired via NHEJ.

Data above the d = l diagonal is indicative of the degree of homology extending beyond the insertion and its donor sequence. A model of replication slippage would predict dl to be ∼6 or greater. In fact, if replication slippage were common for these insertions, the observed 1:1 diagonal should be shifted at least 6 units higher. Additionally, for longer insertions, we expect that greater homology would be required to stabilize the intermediate loop. This expectation was not observed (Fig. 1D). More often than not, d is equal to l; therefore, if replication slippage is occurring, it is occurring most often in the absence of any priming base.

Bias Toward Short End Homology Suggests That NHEJ, Not Replication Slippage, Occurs During Tandem Duplication.

As discussed above, tandem repeats are generally not associated with end homology: dl = 0 (Fig. 1 B and D). Still, we tested for statistical bias in our (dl) values by randomly selecting one position in the Nipponbare genome and another position 100 bp downstream and generating a null distribution based on the number of downstream bases that the two points shared. In fact, end homologies for tandem repeats are typically longer than expected by chance, although they are often 0 and are nearly always <6 bp (Fig. 2B). This end homology profile exhibited by tandem repeats is strikingly similar to that of transfected DNA repaired by NHEJ in mammalian cells (21).

As elaborated on in Discussion, these data lead us to a model for the origin of tandem duplications that depends on long-patch BER. Originally proposed as a major (but underappreciated) cause of DSBs (17), the “DNA repair patch-mediated pathway” also explains the high frequency of tandem duplications in plants and animals. Briefly, we propose that DNA lesions (e.g., single-strand nicks) close to one another on complementary strands trigger simultaneous BER. Because complementary strands for both events are concurrently displaced by the other repair event, the repair results in a DSB (Fig. 2C). Regardless of the degree of end degradation, repair of this DSB via NHEJ will result in a tandem duplication.

Nontandem Duplications Are Often Derived from Local Sequences Within 200 bp and Do Not Exhibit Signs of Canonical Conversion Mechanisms.

Although Fig. 1 reveals tandem duplications, it cannot show the homologies that exist >3–4 bp away from the insertion. Many of the insertions with d ∼ 0 could in fact come from very near the insertion. Based on manual curation and prior work (7, 16), we predicted that local homologies were more likely to be the source of insertions than more distant sites. Therefore, we first searched each insertion that was not a tandem duplication against the ancestral sequence upstream and downstream for 100 + 2.5l bp, where l is the length of the insertion. For each match within this region, we determined the coverage of the match relative to the insertion and the distance of the match relative to the site of insertion (Methods).

The distance between insertion site and donor locus is typically within 50 bp, although more distant conversions are clearly possible (Fig. 3A). In some cases, only a fragment of the insertion appears to be derived from a local locus. There are a substantial number of complex events in Fig. 3A, in which 15% or more of an insertion could be accounted for by a local donor sequence, but the rest of the sequence either comes from another local site or a region outside of our search (gray bars in Fig. 3A). When an insertion is made up of two distinct local donor sequences, these donor loci can be quite distant from one another and can be derived from two upstream sites, two downstream sites, or both upstream and downstream sites. Note that these are not overlapping matches but account for distinct regions of the insertion (Methods).

Fig. 3.

Fig. 3.

Characterization of insertions donated from the sequence around an insertion site. (A) The length and position of the donor site relative to the insertion site are plotted (x axis) for each insertion with d < 4 and < 3 extensive local matches. Insertions are sorted top to bottom based on shortest-to-longest length. Darkness of a donor bar indicates the percentage of the insertion covered by the donor, with black being 100%. Only donors with >15% coverage and an overlap of >7 bp are plotted. For brevity, only insertions from chromosome 10 are presented; other chromosomes exhibit a similar profile (Figs. S1S3). (B) Summary of all insertion types for chromosomes 1, 3, 8, and 10. Examples of each category are given as schematics and representative dot plots in Figs. S4S8. (C) Similarly to Fig. 2B, end homology of local duplications is plotted against a null distribution. Gap homology is the frequency of particular d scores <12 for insertions >20 bp. Excluding partially tandem duplications, the gap homology distribution should resemble the null distribution. (Upper) The schematic illustrates what is meant by 5′ and 3′ homology. The extent of the insertion is colored light gray, and extendable homologous regions are dark gray. Only duplications with 95% coverage of the insertion and positioned >9 bp from the insertion were considered. Categories are broken into the position of the donor relative to the insertion site: for 5′ donors, n = 264; for 3′ donors, n = 301.

Unlike tandem duplications, local duplications with intervening unique sequence are difficult to explain by invoking replication slippage or patch-mediated repair. Another model has been postulated to explain the types of insertions found at repaired DSBs that were induced by homing endonucleases. The synthesis-dependent strand-annealing (SDSA) model assumes that after DSB and 5′ to 3′ resectioning, the 3′ overhang is able to synthesize from a local or ectopic site. Once synthesis has occurred, the temporary duplex is denatured, and then the newly synthesized single strand can begin again its search for an appropriate ligation partner (3). This process appears to be a major mechanism by which sequences are duplicated from distant loci (2, 7) and may account for short, local duplications as well (22).

The most plausible scenario for local duplication by SDSA would be that the 3′ overhang of one side of the DSB uses the 3′ overhang of the other side as a template. Alternatively, a 3′ overhang could form a loop and copy sequence on its own strand in the reverse orientation. We only observed ∼0.5% of local duplications in the reverse orientation. Thus, we consider this possibility rare. The paucity of reversed insertions also indicates that the 3′ overhang rarely invades the duplex of the other half of the DSB because it could then arbitrarily copy DNA in either direction. It follows, then, that duplications in the donor orientation are also unlikely to be the result of the 3′ overhang invading the duplex on its own side of the DSB.

If we consider the 3′ overhang as only being able to copy from the 3′ overhang of the other side of the DSB, then insertions with 5′ donors would have to be extended by the 3′ overhang of the 3′ side of the DSB. The opposite would be true for insertions with 3′ donors. To test this model, we determined the 5′ and 3′ homology between insertions and their donors, using only local duplications with >95% coverage and a spacer distance of >9 bp.

As with the end homology seen in tandem duplications, the homology patterns are slightly skewed toward longer tracks of homology relative to a null model. However, if SDSA was a common cause of these insertions, the 5' donor and 3' homology and the 3' donor and 5' homology should be highly enriched in >6-bp end homologies (Fig. 3C). Although we observe that ∼5% of these high-confidence local duplications follow this expectation, the great majority favor the NHEJ-like pattern observed for tandem duplications.

Insertions Associated with Partially Tandem Duplications Comprise Mostly Local Duplications but with a Substantial Fraction of More Distant Donors.

Fig. 1D exhibits a striking feature characterized by insertions with various lengths >31 bp and d = 31. Such insertions are tandem duplications with a substantial block of intervening DNA. If such events were the result of replication slippage, DNA polymerase would need to stall, copy DNA from an ectopic or local positions, commonly slip back 31 bp from its original template position, and resume replication. Alternatively, under the patch-mediated model, the presence of a tandem duplication with intervening DNA copied from another locus indicates a series of events: patch-mediated DSB induction, followed by additional synthesis from elsewhere in the genome. Interestingly, insertions that are 31 bp in length with d = 31 are no more abundant than other tandem duplications (Fig. 1D). This finding indicates that these partially tandem insertions nearly always result in the insertion of at least some DNA from outside the tandemly duplicated sequence.

We manually inspected all 227 insertions >40 bp in length with a d value between 29 and 31 (Dataset S1). We found that 72% of the nontandemly duplicated parts of these sequences came from the local sequence context. Indeed, 14% of the insertions overlap the tandem duplication, thus resulting in a triplication of some of the donor sequence, which would be expected if additional replication was occurring after patch-mediated tandem duplication. Unlike the analysis shown in Fig. 3B, a much greater percentage of the insertions involve local duplication. This observation may in part be a result of an excessive number of inferred insertions caused by paralog-related misassembly, whereas these partially tandem insertions are more legitimate examples of DSB repair. Still, 16% of the insertion appears to have been copied from a distant locus, and the remainder, 12%, are a combination of local and unidentified donors.

Discussion

Short insertions of DNA into the genome are most commonly triggered by microsatellite repeats. In contrast, gene-sized insertions are often mediated by transposable elements and SDSA. In humans, there is a third and distinct class of common insertions of intermediate length that has no clear explanation (10, 20). We have identified abundant recent insertions >9 bp in rice using population polymorphisms and a reproductively isolated outgroup. Of the insertions analyzed, the majority are tandem duplications, and these tandem duplications lack a signature of replications slippage (Figs. 2 and 3B). We feel that the weight of evidence supports a model in which patch-mediated displacement of complementary strands creates tandemly duplicated ends that are rejoined by NHEJ (Fig. 2C). This model precludes the need for the end homology required by type B DNA polymerases and naturally resolves the difficult problem of how lengthy duplicates are established, even in the absence of end homology. Additionally, when modified to account for cross-reactivity with nucleotide excision repair (NER) errors, the model explains the peculiar observation of a bias toward 31-bp partially tandem duplications.

Long-patch BER has been shown to create DSBs in vivo when single-strand lesions, such as uracils, are present on complementary strands within 30 bp of each other (Fig. 2C and ref. 17). The efficiency of DSB is reduced with longer patches, but patches as long as 80 bp of BER have been observed in vivo (23), where DNA synthesis may be more rapid. Moreover, even longer distances between paired nicks induced by modified CRISPR-associated 9 enzymes can stimulate DSBs in human cells (24). In support of our model, insertions resulting from these DSBs formed tandem duplications (24). NER is distinct from BER in that a consistently sized segment of DNA is removed to repair large adducts such as UV-induced thymine dimers. In NER, RPA 70 protein is used to measure ∼30 bp from an initial nicking site to a secondary nicking site on the same strand (25). We hypothesize that the 31-bp bias seen in partially tandem duplications is a result of nicking errors on the complementary strand (Fig. 1D). Such errors would induce a DSB with two complementary 31-bp 5′ overhangs. Unlike the patch-mediated model (Fig. 2C), where in-filling is concurrent with DSB creation, overhangs produced by NER error would likely undergo less direct NHEJ. As noted, above, we do not see an enrichment in 31-bp tandem duplications as would be expected if these ends were simply filled in and ligated.

The presence of local duplications with short stretches of intervening DNA has been observed in the human genome (16). Many of these are likely to be partially tandem duplications (Fig. 3), but the authors also report on a class of mutations similar to the local duplications seen in this study. As observed in humans and Arabidopsis (16), there is a sharp limitation to the distance between these duplications (Fig. 3A). In addition, we find that such insertions can be the composite of many local donor sequences. Unlike tandem repeats, such patterns are difficult to explain by the patch-mediated model, but, like tandem repeats, they appear to lack a signature of primed synthesis at an ectopic site (Fig. 3C). Some members of the X family of DNA polymerases in humans engage in primer-free but template-directed synthesis during NHEJ. Polymerase μ can even tolerate steric conflict between template and the terminal priming base (26). Rice only has one X polymerase, and it is more similar to the less-promiscuous human polymerase λ (27). Given the divergence, its function may have expanded, and/or other polymerases from the Y family that are tolerant of terminal priming mismatch may be active during NHEJ (28, 29). Thus, the molecular machinery to explain local duplications is present in humans and, likely, in plants. The constrained distance from the insertion site of these duplications suggests that the polymerase is using the single-stranded end of the resectioned DNA as a template because they cannot invade double-stranded DNA. Still, given that single-strand annealing (SSA) is commonly used to resolve DSBs (3), it is difficult to imagine why the nascent duplex synthesized by an X or Y polymerase would ever denature. Such denaturing in SDSA appears to be mediated by specific helicases (30, 31), and given that we observe long, local duplications (Fig. 3A), the proposed pathway would also be dependent on these helicases outcompeting the SSA machinery.

If X/Y-type polymerases are mediating replication slippage, arguments against the slippage model based on priming constraints are not particularly valid. Still, replication slippage is an unlikely explanation given the following additional observations. (i) The partially tandem duplications that are commonly observed would require the polymerase to copy from an ectopic site and then slip 5′ to where it initially stalled, despite the added insertion and the affinity between preexisting strands. Moreover, the 31-bp bias in this duplication category (Fig. 1D) would require that the polymerase commonly slip back 31 bp from where it initially stalled. (ii) As observed previously (10), we find no correlation between length of insertion and end homology (Fig. 1D). Longer homology would almost certainly be required for stabilizing long, unhybridized loops. (iii) If long strands are so easily displaced, it is unclear why, during a slippage event, the leading strand would prefer its template strand over the lagging strand template, thus creating a duplication in the opposite orientation. As described above, such duplications are rarely observed.

It is plausible that, if a terminal priming base is not required, both local and tandem duplications could be explained by a polymerase X or Y repair model discussed above. In the case of tandem duplication, the polymerase would start at the end of the break and copy the resectioned strand; the extension would be denatured, and NHEJ or SSA would attach the ends. However, data on various forms of resolved DSBs show that insertions, when they do occur, rarely, if ever, produce tandem duplications (9, 22). Although they do produce local duplications, these occur adjacent to deletions more common than insertions alone (9). We observe very few such adjacent events: ∼3 in 1,000 insertions (Table 1). The fraction of local duplications is clearly >0.3% (Fig. 3B); therefore, a large number of local duplications still lack a strong mechanistic explanation. Indeed, one of the strengths of the patch-mediated model is that tandem repeats can mask the deletions observed in DSB repair experiments, simply resulting in a shorter duplicated region.

Because of the ubiquity of tandem duplications and the generality of the patch-mediated model, it is tempting to include local duplications as a possible outcome, but such a model would require an internal deletion of the duplicated region that is positionally coincident on the start of the duplication. A reasonable biochemical scenario that would result in such an outcome has yet to be conceived. Indeed, an agreed-upon mechanism to explain local duplications is still lacking (11). The continued analysis of mutants involved in DNA repair in plants, particularly the repair polymerases, should inspire better models. When coupled with next-generation sequencing, mutant analysis can be particularly informative and robust (9). However, current methods for inducing DSBs may poorly simulate the majority of naturally occurring DSBs. In the future, poststress resequencing of wild-type and mutant lines may be the most accurate method to achieve a thorough understanding of these events (32).

Methods

Whole-Genome Alignments and Insertion Inference.

To infer insertions and deletions, we first generated whole-genome alignments of O. sativa var. indica, O. sativa var. japonica, and O. glabberima. The source and version of each genome is given in Table S1. Mauve progressive aligner was run for chromosomes separately with default parameters (33). These alignments were not used directly for indel inference, but as a guide for downstream alignments used to infer the ancestral state of indel variations found in a large sample of cultivated rice accessions. The list of structural variations, all of which are >9 bp, was downloaded from ref. 18 on September 28, 2012. Only structural variations for indica, tropical japonica, and aromatic rice accessions were used. For each polymorphism site in the Nipponbare reference sequence, the relative position 100 + 2.5l, where l is the indel length, bases upstream and downstream were extracted. Similarly, the outgroup sequence associated with the same region of the alignment was extracted. The variant sequence was reconstructed from the Nipponbare sequence based on the position of the indel and its sequence. These three sequences—reference, variant, and outgroup—were realigned by using MAFFT in default mode (FFT-NS-2) (34), and these alignments were used to infer events. For a gap to be inferred as an insertion or deletion, the 15 bases flanking both sides of the gap had to match the outgroup with a pairwise identity >90%. Additionally, the flanking sequences could not contain gaps.

Dot plots for assorted derived-ancestral pairs were generating by using dottup, which is available through the EMBOSS program suite (35), with additional PDF conversion. A word size of 3 bp was used in all cases.

Assessing the Trace Extension Metric, d.

An insertion will produce a gap when aligned with a reference. The gap can only be placed exactly if the sequence within the insertion is unique relative to the flanking sequences into which it has inserted. This ambiguity can be seen in a dot plot, where the position chosen to start a gap can clearly be varied within a defined range (Fig. 1 B and C). Previous authors proposed a “trace extension” value, d, to account for this property (10). The trace extension is best illustrated in a dot plot; with the ancestral state on the x axis and the derived state on the y axis, d is the horizontal distance between the end of the first major diagonal and the beginning of the second. Based on the relationship between d and the length of the insertion, one can quickly determine whether the insertion resulted in a tandem duplication and whether the donor sequence possessed short, terminal repeats before insertion. We reimplemented the trace extension approach using blast2seq, which in effect reports all significantly long diagonals in a dot plot between two sequences. We aligned each derived state with the ancestral state using blast2seq with default parameters, except that the DUST filter was turned off (-F F) and gap extension and gap opening penalties were set to 6 and 2, respectively, (-G 2 -E 4). These were optimized to disallow the spanning of gaps >10 bp. Only hits with an e-value of <0.001 were kept. Major diagonals, which represent alignment between the sequences to the left and right of an insertion (Fig. 1A), had to be >2.5l, where l is the length of the insertion.

Characterizing Insertions and Identifying Donor Loci.

For counting purposes (Fig. 3B), we considered any derived insertion as a tandem duplication if the alignment with the ancestral sequence resulted in dl > −3. End homology of a tandem duplication (Fig. 2) was calculated by subtracting d from l (Fig. 1) for dl > −1. The blast2seq alignments used above could also be used to characterize local duplications. For entirely nontandem insertions (d < 4), local donor sites were found by identifying sequences upstream/downstream of the insertion that overlapped the insertion by >7 bp and had identity of >95%. The percent coverage of a match was calculated as the length of the match divided by the length of the insertion. Multiple donor sites were allowed if the donor sites did not overlap within the insertion by >3 bp. The 3′ end homology of local duplications (Fig. 3C) was only assessed for insertions with d = 0 and was calculated by subtracting the start of the gap in the ancestral sequence (relative to the derived sequence) from the beginning of the insertion match position in the ancestral sequence and vice versa for 5′ end homology.

Software.

Unless otherwise stated, Perl scripts were written to perform these analysis; the following external libraries were also used and are available from CPAN (www.cpan.org): Bioperl (Version 6.1) (36) and Set::IntSpan (Version 1.16).

Supplementary Material

Supporting Information

Acknowledgments

This work was supported by National Science Foundation Plant Genome Program Awards 0607123 and 043707-01.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1321854111/-/DCSupplemental.

References

  • 1.Ma J, Bennetzen JL. Rapid recent growth and divergence of rice nuclear genomes. Proc Natl Acad Sci USA. 2004;101(34):12404–12410. doi: 10.1073/pnas.0403715101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wicker T, Buchmann JP, Keller B. Patching gaps in plant genomes results in gene movement and erosion of colinearity. Genome Res. 2010;20(9):1229–1237. doi: 10.1101/gr.107284.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Puchta H. The repair of double-strand breaks in plants: Mechanisms and consequences for genome evolution. J Exp Bot. 2005;56(409):1–14. doi: 10.1093/jxb/eri025. [DOI] [PubMed] [Google Scholar]
  • 4.Fauser F, et al. In planta gene targeting. Proc Natl Acad Sci USA. 2012;109(19):7535–7540. doi: 10.1073/pnas.1202191109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Woodhouse MR, Pedersen B, Freeling M. Transposed genes in Arabidopsis are often associated with flanking repeats. PLoS Genet. 2010;6(5):e1000949. doi: 10.1371/journal.pgen.1000949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Salomon S, Puchta H. Capture of genomic and T-DNA sequences during double-strand break repair in somatic plant cells. EMBO J. 1998;17(20):6086–6095. doi: 10.1093/emboj/17.20.6086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Pace JK, 2nd, Sen SK, Batzer MA, Feschotte C. Repair-mediated duplication by capture of proximal chromosomal DNA has shaped vertebrate genome evolution. PLoS Genet. 2009;5(5):e1000469. doi: 10.1371/journal.pgen.1000469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Mali P, et al. RNA-guided human genome engineering via Cas9. Science. 2013;339(6121):823–826. doi: 10.1126/science.1232033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huefner ND, Mizuno Y, Weil CF, Korf I, Britt AB. Breadth by depth: Expanding our understanding of the repair of transposon-induced DNA double strand breaks via deep-sequencing. DNA Repair (Amst) 2011;10(10):1023–1033. doi: 10.1016/j.dnarep.2011.07.011. [DOI] [PubMed] [Google Scholar]
  • 10.Messer PW, Arndt PF. The majority of recent short DNA insertions in the human genome are tandem duplications. Mol Biol Evol. 2007;24(5):1190–1197. doi: 10.1093/molbev/msm035. [DOI] [PubMed] [Google Scholar]
  • 11.Thomas EE. Short, local duplications in eukaryotic genomes. Curr Opin Genet Dev. 2005;15(6):640–644. doi: 10.1016/j.gde.2005.09.008. [DOI] [PubMed] [Google Scholar]
  • 12.Nourmohammad A, Lässig M. Formation of regulatory modules by local sequence duplication. PLOS Comput Biol. 2011;7(10):e1002167. doi: 10.1371/journal.pcbi.1002167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Levinson G, Gutman GA. Slipped-strand mispairing: A major mechanism for DNA sequence evolution. Mol Biol Evol. 1987;4(3):203–221. doi: 10.1093/oxfordjournals.molbev.a040442. [DOI] [PubMed] [Google Scholar]
  • 14.Chen J-M, Chuzhanova N, Stenson PD, Férec C, Cooper DN. Meta-analysis of gross insertions causing human genetic disease: Novel mutational mechanisms and the role of replication slippage. Hum Mutat. 2005;25(2):207–221. doi: 10.1002/humu.20133. [DOI] [PubMed] [Google Scholar]
  • 15.Reyon D, et al. FLASH assembly of TALENs for high-throughput genome editing. Nat Biotechnol. 2012;30(5):460–465. doi: 10.1038/nbt.2170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Thomas EE, et al. Distribution of short paired duplications in mammalian genomes. Proc Natl Acad Sci USA. 2004;101(28):10349–10354. doi: 10.1073/pnas.0403727101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Vispé S, Satoh MS. DNA repair patch-mediated double strand DNA break formation in human cells. J Biol Chem. 2000;275(35):27386–27392. doi: 10.1074/jbc.M003126200. [DOI] [PubMed] [Google Scholar]
  • 18.Xu X, et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat Biotechnol. 2012;30(1):105–111. doi: 10.1038/nbt.2050. [DOI] [PubMed] [Google Scholar]
  • 19.Zhao G, Guan Y. Polymerization behavior of Klenow fragment and Taq DNA polymerase in short primer extension reactions. Acta Biochim Biophys Sin (Shanghai) 2010;42(10):722–728. doi: 10.1093/abbs/gmq082. [DOI] [PubMed] [Google Scholar]
  • 20.Montgomery SB, et al. 1000 Genomes Project Consortium The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 2013;23(5):749–761. doi: 10.1101/gr.148718.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Roth DB, Porter TN, Wilson JH. Mechanisms of nonhomologous recombination in mammalian cells. Mol Cell Biol. 1985;5(10):2599–2607. doi: 10.1128/mcb.5.10.2599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lloyd AH, Wang D, Timmis JN. Single molecule PCR reveals similar patterns of non-homologous DSB repair in tobacco and Arabidopsis. PLoS ONE. 2012;7(2):e32255. doi: 10.1371/journal.pone.0032255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ward JF. DNA damage produced by ionizing radiation in mammalian cells: Identities, mechanisms of formation, and reparability. Prog Nucleic Acid Res Mol Biol. 1988;35:95–125. doi: 10.1016/s0079-6603(08)60611-x. [DOI] [PubMed] [Google Scholar]
  • 24.Mali P, et al. CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering. Nat Biotechnol. 2013;31(9):833–838. doi: 10.1038/nbt.2675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Costa RMA, Chiganças V, Galhardo RdaS, Carvalho H, Menck CF. The eukaryotic nucleotide excision repair pathway. Biochimie. 2003;85(11):1083–1099. doi: 10.1016/j.biochi.2003.10.017. [DOI] [PubMed] [Google Scholar]
  • 26.Nick McElhinny SA, et al. A gradient of template dependence defines distinct biological roles for family X polymerases in nonhomologous end joining. Mol Cell. 2005;19(3):357–366. doi: 10.1016/j.molcel.2005.06.012. [DOI] [PubMed] [Google Scholar]
  • 27.Amoroso A, et al. Oxidative DNA damage bypass in Arabidopsis thaliana requires DNA polymerase λ and proliferating cell nuclear antigen 2. Plant Cell. 2011;23(2):806–822. doi: 10.1105/tpc.110.081455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.García-Ortiz MV, Ariza RR, Hoffman PD, Hays JB, Roldán-Arjona T. Arabidopsis thaliana AtPOLK encodes a DinB-like DNA polymerase that extends mispaired primer termini and is highly expressed in a variety of tissues. Plant J. 2004;39(1):84–97. doi: 10.1111/j.1365-313X.2004.02112.x. [DOI] [PubMed] [Google Scholar]
  • 29.Garcia-Diaz M, Bebenek K. Multiple functions of DNA polymerases. CRC Crit Rev Plant Sci. 2007;26(2):105–122. doi: 10.1080/07352680701252817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Sebesta M, Burkovics P, Haracska L, Krejci L. Reconstitution of DNA repair synthesis in vitro and the role of polymerase and helicase activities. DNA Repair (Amst) 2011;10(6):567–576. doi: 10.1016/j.dnarep.2011.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Roth N, et al. The requirement for recombination factors differs considerably between different pathways of homologous double-strand break repair in somatic plant cells. Plant J. 2012;72(5):781–790. doi: 10.1111/j.1365-313X.2012.05119.x. [DOI] [PubMed] [Google Scholar]
  • 32.St Charles J, et al. High-resolution genome-wide analysis of irradiated (UV and γ-rays) diploid yeast cells reveals a high frequency of genomic loss of heterozygosity (LOH) events. Genetics. 2012;190(4):1267–1284. doi: 10.1534/genetics.111.137927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rissman AI, et al. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics. 2009;25(16):2071–2073. doi: 10.1093/bioinformatics/btp356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000;16(6):276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
  • 36.Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12(10):1611–1618. doi: 10.1101/gr.361602. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES