Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2025 Jul 2;53(12):gkaf619. doi: 10.1093/nar/gkaf619

The origin of mirror repeats in the human genome

Ryan McGinty 1, Alisa Lyskova 2, Sergei M Mirkin 3,
PMCID: PMC12214015  PMID: 40598895

Abstract

Mirror DNA repeats were found in genomic DNA several decades ago, but their role and the mechanisms leading to their abundance have remained a mystery. The only firmly established functional property was that the subset of long homopurine–homopyrimidine mirror repeats (H-motifs) can form a triple-helical DNA secondary structure (H-DNA). Here, we analyzed the sequence content of mirror repeats in the telomere-to-telomere human genome sequence. Our findings suggest that long mirror repeats in genomic DNA originate exclusively from the expansion of simple tandem repeats (STRs). Strikingly, long H-motifs are highly overrepresented compared to all other mirror repeats and STRs. We hypothesize that long H-motif STRs could be particularly expansion-prone owing to H-DNA-mediated genome instability, pointing to the length at which this structure becomes a significant hindrance.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

DNA is at once a series of letters that can be dissected linguistically, as well as a dynamic three-dimensional molecule frequently deviating from the classical double helix. The first linguistic analyses of DNA texts revealed the presence of three types of DNA repeats: direct, inverted, and mirror [1, 2]. Earlier studies noted the symmetrical nature of protein recognition sites and debated the role of DNA secondary structure [3, 4]. Separately, the observation that significant portions of eukaryotic genomes rapidly “fold back” after denaturation was attributed to intrastrand base pairing of longer inverted and direct repeats [5, 6], prompting further interest in repetitive DNA functionality, such as transposon hopping and RNA stability [7–10]. In this light, the presence of direct and inverted repeats in genomic DNA was perhaps not as surprising as mirror repeats, which lacked any known consequence.

A short time later, however, several new structures were discovered (reviewed in [11]), including an intramolecular triple helix, H-DNA, which requires a sequence of all purines on one strand (and all pyrimidines on the other strand) arranged in mirror symmetry [12]. This subset of mirror repeats (H-motifs) forms a triplex when a strand from one half of the repeat (whether purine or pyrimidine) folds back to bond with the duplex half of the repeat, while its complement remains single-stranded (Fig. 1A; reviewed in [13]). To this day, formation of H-DNA is the only known functional property of mirror repeats.

Notably, H-DNA forms in vivo [14] and restricts major genetic transactions by stalling DNA replication and transcription; it is also a target for several DNA repair enzymes (reviewed in [11, 15]). H-DNA stability increases exponentially with the length of the repeat motif [16, 17]. Consequently, stable H-DNA structures dramatically elevate expansion of H-motif repeats and cause genome instability (reviewed in [11, 13]). Several human disorders have been linked to this phenomenon, owing to the disruption of gene function by sufficiently long H-motifs (reviewed in [11, 18]). In these known examples, carriers of a relatively long H-motif transmit an even longer allele to their offspring, and this allele continues to expand somatically throughout life, progressing the disease. It is therefore crucial to understand how H-motifs mutate to grow longer.

Other repeat motifs, such as inverted and direct repeats, also have mutagenic and pathogenic potential that may increase with motif length (reviewed in [11, 19]). However, with these motifs, the lengthening process is somewhat better understood. Proposed mechanisms can involve problems arising during DNA replication and/or repair, leading to re-replication of short stretches and/or nonallelic homologous recombination [20–22]. Length instability of simple tandem repeats (STRs), in which a short 1–9-nt motif is tandemly repeated many times, is understood to arise from several molecular mechanisms, including replication slippage (reviewed in [11, 13]). Generation of mirror repeats is more difficult to imagine because DNA replication always proceeds in the 5′ to 3′ direction; the templated reaction would require that the polymerase moves in reverse along the DNA strand, which would only be possible by repeated disengagement and copying of a single position. Nonetheless, mirror repeats of substantial lengths do exist in the human genome, necessitating some mechanism for their generation.

Early bioinformatics analyses, even though based on limited genomic sequences, showed a very strong overrepresentation of various repeat motifs, including long H-motifs [23, 24]. This was unexpected in the context of their negative impact on genome stability and the subsequent discovery of trinucleotide repeat disorders, which a priori could indicate the existence of a positive selection for those repeats. Our recent analysis of the complete telomere-to-telomere human genome sequence verified the accuracy of the early claims regarding the overrepresentation of STRs. We concluded, however, that repeat length instability could be the sole mutagenic force leading to the abundance of STRs as the equilibrium state of the genome, and no positive selection is needed to account for this effect [25]. Notably, the repeat length ranges described here are mostly below the length threshold for the late-onset repeat expansion disorders; thus, they do not cause reproductive problems and are not likely to be under strong negative selection.

Our previous work [26] also highlighted that STRs can have multiple axes of symmetry and can therefore comprise mirror, inverted, and direct repeats (Fig. 1B), as well as Z-DNA and G4 motifs. Indeed, all known examples of disease-causing H-motifs are also STRs [18]. We therefore asked to what extent the expansion of STRs explains the lengthening of various larger motifs. Our findings suggest that the overwhelming majority of long mirror repeats found in genomic DNA are at the same time STRs, and their lengthening results exclusively from STR expansions. In contrast, other motifs seem to originate from both STR-dependent and -independent mechanisms. Strikingly, long H-motifs are found in much greater abundance compared to the rest of mirror repeats or STRs. We noted that the overrepresentation of H-motifs starts roughly from the length at which H-DNA becomes sufficiently stable to hinder various genetic processes [11, 15]. We speculate, therefore, that long H-motif STRs could be particularly expansion-prone owing to H-DNA-mediated genome instability.

Figure 1.

Figure 1.

Illustration of H-DNA secondary structure and repeat symmetry. (A) Example of H-DNA secondary structure, which requires a mirror repeat (red and blue regions). Triple-helical DNA is stabilized through non-Watson–Crick base pairing (gray bars) in addition to typical Watson–Crick base pairing (black bars). (B) Various DNA sequences containing axes of symmetry, as indicated. MR, mirror repeat; DR, direct repeat; IR, inverted repeat; STR, simple tandem repeat. Arrows indicate direction of symmetry. Red color indicates reverse complementarity. First three examples are randomly generated sequences illustrating central axes of symmetry only. Subsequent examples illustrate that various STR motifs can contain multiple axes of symmetry. (GA)n and (GAA)n motifs are homopurine mirror repeats known as H-motifs.

Materials and methods

Mirror repeat count expectation

A simplistic expectation for the counts of mirror repeats of various lengths can be calculated by multiplying the length of the genome by the probability of encountering a particular nucleotide on both sides of the motif, raised to the power of the length of the motif. Because a mirror repeat, in principle, can be made up of any combination of nucleotides, the probability of encountering a particular nucleotide should take into account the composition of the genome. If all nucleotides were found in equal quantities, this probability would be 1/4. However, this probability must be adjusted to account for the human genome composition of ∼60% A/T. Assuming random distribution of the four nucleotides throughout the genome, we calculate this skew as the sum of the probabilities of encountering each nucleotide twice (i.e. PA2 + PT2 + PG2 + PC2 ≈ 1/3.868). Thus, the final equation for the expected counts at given motif stem length L can be written as

graphic file with name TM0001.gif (1)

Repeat databases

We previously generated catalogs of various repeat motifs in the human reference genome GRCh38 [26]. Following the same procedure, we generated new catalogs of mirror, inverted, and direct repeats in the human telomere-to-telomere reference sequence T2T-CHM13 (version 2.0; http://hgdownload.soe.ucsc.edu/downloads.html#human). Motif catalogs were not prefiltered for overlap with STRs or other motifs. For MRs, IRs, and DRs, spacers up to 100 nt were allowed. Within each category, fully overlapping entries (i.e. sharing start and/or end coordinates) were filtered, keeping the entry with the longest stem length. Motifs below stem length of 10 nt were not mapped due to the exponential increase in number of motifs with decreasing length. Up to two interruptions (i.e. a single nucleotide that does not fit the given motif) are allowed in each motif. Note that this requirement filters out many longer repeated units, such as transposons, which may not share complete sequence identity after accumulating separate mutations over time. For the analysis excluding multi-copy repeats (such as minisatellites or tandem Alu repeats, etc.), database entries were excluded if the coordinates of the left stem matched the right stem coordinates in a separate entry, and vice versa.

G4 tracts were originally mapped through a physical detection method [27]; thus, no additional information could be gained by the liftover of coordinates to T2T-CHM13, and so G4s remain mapped to GRCh38. STR catalogs were not updated here; counting of STR motifs in T2T-CHM13 was performed previously [25].

To generate a database of control loci for a given category of repeats, we took the total number of motifs per chromosome and generated a list of the same length consisting of random integers ranging from 1 to the length of that chromosome. We then added the length of each motif (or stem length, where applicable) to the corresponding random integer, producing a list of length-matched start and end coordinates on each chromosome.

rDNA regions were defined according to the methods in [28].

STR detection

For a given sequence, STR-like subsequences were uncovered using the Regex findall command using the following patterns: r‘([ATGC]{1}?)\1+’, r‘([ATGC]{2}?)\1+’, …, r‘([ATGC]{9,}?)\1+’. Note that the integers {1–8} represent separate searches for tandem repeats of that length (i.e. mononucleotides, dinucleotides, etc.), with {9,} representing a search for tandem repeats 9 nt or longer. Next, for each recurring element, the number of recurrences was counted (without explicitly searching for contiguous repetitions) and multiplied by the length of the element, producing the total number of nucleotides occupied by each motif. Of these, we take the maximum and report the total length of the subsequence divided by the total length of the original sequence. Probability densities were calculated using the Scipy.stats.gaussian_kde package (https://docs.scipy.org/doc/scipy-1.15.0/reference/generated/scipy.stats.gaussian_kde.html).

Results

We previously cataloged mirror, direct, inverted, G4, and STR motifs in the GRCh38 human reference genome [26], which we update here using the complete genome sequence, T2T-CHM13. (We did not catalog longer tandem repeats, such as minisatellites or transposon arrays.) Recently, we addressed the genome-wide overrepresentation of the number of STRs compared to a randomized genome by utilizing a complex model of repeat length instability [25]. The probability of encountering a mononucleotide repeat of length n in a randomized genome can be computed by the genomic frequency of that nucleotide, raised to the power of n (i.e. the sequence AAAAA is expected to occur with probability = (0.3)5 , given that adenine makes up ∼30% of the human genome). Likewise, a simplistic expectation for the probability of encountering a mirror repeat of stem length n also scales with n as the exponent [see Equation 1 for details], producing a rapid drop-off in counts as length increases (Fig. 2). Empirical counts of mirror repeats in the genome are vastly inflated by comparison (Fig. 2). The shape of the distribution, a feature that is informative of underlying mutational mechanisms [25], resembles that of STRs (Fig. 2). Counts of inverted, direct, and G4 motifs also appear inflated to varying degrees, and each with a unique shape (Fig. 2), suggesting distinct biological origins.

Figure 2.

Figure 2.

Distribution of repeat lengths in the human genome. Counts of various repeat motifs in the telomere-to-telomere human genome reference sequence (or GRCh38 for G4 motif tracts). X-axis displays total repeat tract length. STR motifs are grouped by unit length (e.g. mononucleotides, dinucleotides, etc.). Length of perfect motifs is shown (i.e. lacking any interruptions; not applicable for G4 tracts). Tract length excludes spacer length for MR, IR, and DR motifs. For comparison, we calculated the expected number of mirror repeats in a randomized genome (see the “Materials and methods” section).

We next investigated the STR content of each motif in the genome-wide catalog. We wrote an algorithm that detects repeated subsequences (i.e. short motifs of unit length 1–9 nt) within a given sequence and returns the motif explaining the largest portion of that sequence by total length, allowing for noncontiguity of the subsequences. (See the “Materials and methods” section for detailed explanation of algorithm.) For random control loci, repeat content is highly variable within very short sequences, but approaches ∼30% in longer sequences (Fig. 3), reflecting the measurement of A/T nucleotide content (i.e. in the absence of any highly repeated subsequence, the algorithm detects the most frequent nucleotide). Above this floor, the expectation is that the algorithm reports a repeated motif that may be considered an STR.

Figure 3.

Figure 3.

STR content within larger repeat motifs varies by length. Within each (A) MR, (B) IR, and (C) DR stem sequence and each (D) G4 motif tract, the most-repeated STR subsequence was determined. All repetitions of this motif are then counted (including interruptions), and the total length of the subsequence is divided by the total stem or tract length of the larger motif. In the absence of any STR, the most-repeated base is counted (i.e. random loci typically contain ∼30% A/T content in the human genome). Left panels display the mean STR portion (y-axis) versus motif length (x-axis) for MR, IR, DR, and G4 motifs, compared to chromosome-matched and length-matched random genomic loci. Transparency shows range covering 95% of values per length bin; note that very small counts result in narrower 95% ranges within long length bins, but greater noise between adjacent bins. Right panel displays the same data as a probability density function, grouped by stem or tract length as indicated.

For mirror, inverted, and direct repeats, as well as G4 motif tracts (i.e. a contiguous region beginning and ending with a G3N1–7 motif, which was identified via physical detection of G4 structure formation), average STR content exceeds that for corresponding length-matched random control loci (Fig. 3). The shortest G4 tracts are highly enriched in STRs (Fig. 3D). For G4 motif tracts above 100 nt, STR content is low, but increases with length of the tract (Fig. 3D), implying that G-content increases relative to the total length of spacer sequences as a function of overall tract length. Both the length of spacers in the G3N1–7 motif and the presence of longer runs of Gs are known to affect G-quadruplex structure stability [29, 30]. For direct repeats, average STR content is ∼60% for stem lengths up to ∼200 nt, decreasing to ∼45% and below for the longest repeats. The probability density function shows a wide and multi-modal distribution of STR content, with the longest repeats more closely resembling random loci, confirming that direct repeats arise from both STR-dependent and -independent mechanisms (Fig. 3C). Inverted repeats display an inflation of STR content for stem lengths of ∼20–80 nt, with a bimodal distribution representing distinct populations of inverted repeats with either high or low STR content (Fig. 3B). Longer inverted repeats appear unimodal (Fig. 3B), suggesting that they originate primarily from an STR-independent mechanism.

Strikingly, for mirror repeats alone, the STR content of mirror repeats approaches a unimodal genome-wide average of 100% as stem lengths increase (Fig. 3A). This demonstrates that the longest mirror repeats are typically comprised of a single repeated STR motif and suggests that STR length instability is the sole biological mechanism accounting for mirror repeat lengthening. For shorter mirror repeats, a single STR accounts for an average of ∼80% or more of the total length. This may reflect the chance placement of an STR surrounding or at the center of a randomly occurring mirror repeat (see Fig. 2 for expected distribution of lengths). Mirror repeats may also consist of multiple STR motifs arranged in mirror symmetry (with our algorithm reporting only the longest one), but it appears that mirror symmetry predominantly grows through expansion of only one STR motif per mirror repeat.

As a precaution, we conducted the same analysis after excluding minisatellites and other multiply repeated elements. This outcome of this analysis (Supplementary Fig. S1) appeared to be virtually indistinguishable from that in Fig. 3AC.

As described above, H-DNA stability has an additional requirement that one strand of the motif contains all purines, while the other strand contains all pyrimidines (referred to as homopurine/homopyrimidine or hR/hY), which is necessary due to the chemistry of the non-Watson–Crick bonds [16]. Comparing counts of hR/hY mirror repeats (H-motifs) to hW mirror repeats and hK/hM mirror repeats (all A or T on one strand and all A or C on one strand, respectively), or mirror repeats not restricted to two nucleotides per strand, long hR/hY mirror repeats are the most abundant (Fig. 4A). Accordingly, hR/hY STRs (di-, tri-, and tetranucleotides, abbreviated as A1–3G1–3) are more prevalent than A1–3T1–3, A1–3C1–3, and the sum of all other di-, tri-, and tetranucleotide STR motifs (Fig. 4B). The distributions line up closely until mirror stem length of ∼25 nt or STR total length of ∼50 nt (Fig. 4A and B), which is roughly consistent with the length at which H-DNA starts hindering DNA replication, making H-motifs particularly unstable [11, 15]. (Note that an STR with a total length of 50 nt would have an axis of mirror symmetry in the center, corresponding to a stem length of ∼25 nt.) Regardless of nucleotide content, STR content approaches 90%–100% as mirror repeats lengthen (Fig. 4C). Thus, while H-motif and non-H-motif mirror repeats appear to lengthen via the STR-related path, H-motifs demonstrate much higher rates of expansion.

Figure 4.

Figure 4.

hR/hY nucleotide content of mirror repeats. (A) Counts of mirror repeat motifs according to nucleotide content of the motif. Mirror stem sequences were restricted to >95% or <5% AnGn (hR/hY), AnTn (hW), AnCn (hK/hM), or all other mirror motifs. X-axis displays stem length for MR motifs. (B) STR motifs are grouped according to nucleotide content as shown. Motifs represent the sum of counts of di-, tri-, and tetranucleotides, excluding mononucleotides (e.g. hR/hY STRs consist of uninterrupted AG1–3, A2G2, or A1–3G motifs, abbreviated as A1–3G1–3). In combining STRs of different unit length, missing counts (i.e. dinucleotides of length not divisible by 2, etc.) were interpolated with a linear spline method. X-axis displays total length of STR tract. (Note that STR tract length is approximately equivalent to 2× stem length when placing an axis of mirror symmetry within the STR.) Left panels display the mean STR portion along the axis of repeat stem length for (C) hR/hY mirror repeats and (D) mirror repeats with greater nucleotide diversity, compared to chromosome-matched and length-matched random genomic loci. Transparency shows range covering 95% of values per length bin. Right panel displays the same data as a probability density function, grouped by motif length as indicated.

A recent report [28] indicated that H-DNA-forming sequences are highly enriched within ribosomal DNA arrays (rDNA). We therefore counted hR/hY mirror repeats within rDNA genomic regions. We found that rDNA regions contribute few hR/hY mirror repeats to the genome overall, and they do not contain any of the longest hR/hY mirror repeats (Supplementary Fig. S2A and C). Counts per length bin appear roughly proportional to the total length of rDNA regions (Supplementary Fig. S2B and D), suggesting no relative enrichment.

Discussion

Our analysis highlights a fundamental difference between the three linguistic repeat motifs in DNA sequences: inverted repeats and direct repeats (broadly defined) appear to lengthen by multiple mechanisms, some of which are STR-dependent, while others allow repeats to emerge de novo from complex DNA sequences (Fig. 3B and C). In contrast, long mirror repeats present in genomic DNA are, in fact, just a subset of STRs and accumulate exclusively via the same general mechanisms responsible for STR expansions (Fig. 3A). In other words, STR-independent mechanisms capable of creating or extending mirror symmetry per se do not seem to exist. It has previously been reported that mirror repeats are more abundant in heavily repetitive regions of the genome [31]. This is in line with our findings, both as a direct consequence of STR expansion and because mirror repeats are more likely to be found by chance when the nucleotide frequency of a region is heavily skewed (see Equation 1).

At the same time, we demonstrate the clear biological importance of a subset of mirror repeats: H-motifs. The abundance of H-motifs compared to other mirror repeats (and equivalently, the abundance of homopurine/homopurine STRs compared to other STRs) is intriguing. The distributions of repeat lengths for H-motifs and non-H-motifs diverge at a stem length of ∼25 nt (STR total length of ∼50 nt) (Fig. 4A and B). This length is in line with biophysical data demonstrating the sharp length dependence of the B-DNA to H-DNA transition owing to the high nucleation energy for H-DNA formation [17]. It is also in line with the length at which H-DNA starts hindering DNA replication and repair, making H-motifs prone to both expansions and contractions [15]. We therefore propose that increasingly stable H-DNA structures lead to more frequent disruption of DNA replication, subsequently increasing the rate of STR expansions. While this conclusion remains speculative, we favor it over an alternative hypothesis that some biological mechanism acts in the opposite direction by actively depleting all non-H-motif STRs.

It may seem counterintuitive that DNA repeats that are more unstable are more abundant than their stable counterparts. In our separate study [25], we developed a mathematical model addressing this question. In brief, when an unstable repeat becomes longer, it becomes more prone to further expansions and contractions, which ultimately generates a distribution with a tail of progressively longer repeats. In addition to a higher overall rate of instability, H-DNA formation may lead to an increase in the number of repeats added per mutational event. Finally, when a repeat gets sufficiently long, the outcome of mutational events may preferentially shift from contractions to expansions. The known mechanisms responsible for DNA repeat instability include strand slippage and/or hairpin formation during replication or repair [32–34], replication fork stalling that differs by strand orientation [35–39], defects in Okazaki fragment maturation [40], homologous recombination [41, 42], error-prone DNA repair [43–48], difficulty in processing of single-stranded DNA nicks or gaps within H-DNA structures [49, 50], and others [13].

The advent of long-read sequencing and other improved repeat detection tools has led to a recent surge in the uncovering of repeat loci as the genetic cause of human disorders [18]. The first known H-motif in disease, (GAA)n repeats in Friedreich’s ataxia, has been joined by the recent discoveries of (CCCTCT)n repeat expansions in X-linked dystonia parkinsonism [51, 52], (AAGGG)n repeat expansions in CANVAS disease [53, 54], and (GAA)n repeats in FGF14-related ataxia [55, 56]. Expansions of H-motif STRs were also observed in several human cancers [57]. Strikingly, (GAAA)n repeats in the first intron of UGT2B7 gene were found to expand in a third of all renal cell carcinoma samples [58]. Thus, all known examples in human health exemplify the notion that long H-motifs originate from STRs. Our results suggest that, as a part of efforts to diagnose novel repeat-related disorders, it will not be fruitful to search specifically for the de novo addition of long mirror repeats per se. Instead, efforts to characterize expanded STRs should simultaneously uncover any and all disease-causing H-motifs.

In the broader sense, repeat-related disorders that tend to result in progressively debilitating symptoms have overwhelmingly been linked to specific structure-prone motifs: STRs capable of forming hairpins/cruciforms, H-DNA, and/or G-quadruplex structures (reviewed in [15]). The connection between structure and progression relates to the high rate of somatic expansions that continue throughout life [59]. We believe that H-motifs will thus be among the most frequently identified culprits of repeat-related disorders.

Supplementary Material

gkaf619_Supplemental_File

Acknowledgements

We thank Alexey Kondrashov, Shamil Sunyaev, and Daniel Balick for helpful discussions.

Author contributions: Ryan McGinty (Conceptualization, Data curation, Investigation, Methodology, Writing – original draft, Writing – review & editing), Alisa Lyskova (Data curation [supporting], Investigation [supporting], Methodology [supporting]), Sergei M. Mirkin (Conceptualization, Funding acquisition, Investigation, Project administration, Supervision, Writing – original draft, Writing – review & editing).

Contributor Information

Ryan McGinty, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States.

Alisa Lyskova, Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119234, Russia.

Sergei M Mirkin, Department of Biology, Tufts University, Medford, MA 02155, United States.

Supplementary data

Supplementary data is available at NAR online.

Conflict of interest

None declared.

Funding

Work by R.M. in the Sunyaev lab was supported by grants from NIGMS (R35GM127131), NIMH (R01MH101244), and NHGRI (U01HG012009). S.M.M. was supported by grants from NIGMS (R35GM130322) and NSF 2153071. Funding to pay the Open Access publication charges for this article was provided by NIH R35GM130032.

Data availability

The datasets analyzed during the current study were described previously [26]; databases of repeats in T2T-CHM13 are available at DOI: 10.6084/m9.figshare.28551065. The T2T-CHM13 and GRCh38 genomes are freely available from the NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/). The code to perform the analysis in the current study is available at DOI: 10.6084/m9.figshare.28551065.

References

  • 1. Brendel  V, Beckmann  JS, Trifonov  EN  Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J Biomol Struct Dyn. 1986; 4:11–21. 10.1080/07391102.1986.10507643. [DOI] [PubMed] [Google Scholar]
  • 2. Beckmann  JS, Brendel  V, Trifonov  EN  Intervening sequences exhibit distinct vocabulary. J Biomol Struct Dyn. 1986; 4:391–400. 10.1080/07391102.1986.10506357. [DOI] [PubMed] [Google Scholar]
  • 3. Lewin  B  Interaction of regulator proteins with recognition sequences of DNA. Cell. 1974; 2:1–7. 10.1016/0092-8674(74)90002-6. [DOI] [PubMed] [Google Scholar]
  • 4. Jovin  TM  Recognition mechanisms of DNA-specific enzymes. Annu Rev Biochem. 1976; 45:889–920. 10.1146/annurev.bi.45.070176.004325. [DOI] [PubMed] [Google Scholar]
  • 5. Wilson  DA, Thomas  CA  Jr  Palindromes in chromosomes. J Mol Biol. 1974; 84:115–38. 10.1016/0022-2836(74)90216-2. [DOI] [PubMed] [Google Scholar]
  • 6. Schmid  CW, Deininger  PL  Sequence organization of the human genome. Cell. 1975; 6:345–58. 10.1016/0092-8674(75)90184-1. [DOI] [PubMed] [Google Scholar]
  • 7. Gierer  A  Model for DNA and protein interactions and the function of the operator. Nature. 1966; 212:1480–1. 10.1038/2121480a0. [DOI] [PubMed] [Google Scholar]
  • 8. Lilley  DM  Hairpin-loop formation by inverted repeats in supercoiled DNA is a local and transmissible property. Nucleic Acids Res. 1981; 9:1271–89. 10.1093/nar/9.6.1271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Groenen  MA, Timmers  E, van de Putte  P  DNA sequences at the ends of the genome of bacteriophage Mu essential for transposition. Proc Natl Acad Sci USA. 1985; 82:2087–91. 10.1073/pnas.82.7.2087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gilson  E, Perrin  D, Clement  JM  et al.  Palindromic units from E. coli as binding sites for a chromoid-associated protein. FEBS Lett. 1986; 206:323–8. 10.1016/0014-5793(86)81005-5. [DOI] [PubMed] [Google Scholar]
  • 11. Wang  G, Vasquez  KM  Dynamic alternative DNA structures in biology and disease. Nat Rev Genet. 2023; 24:211–34. 10.1038/s41576-022-00539-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mirkin  SM, Lyamichev  VI, Drushlyak  KN  et al.  DNA H form requires a homopurine–homopyrimidine mirror repeat. Nature. 1987; 330:495–7. 10.1038/330495a0. [DOI] [PubMed] [Google Scholar]
  • 13. Hisey  JA, Masnovo  C, Mirkin  SM  Triplex H-DNA structure: the long and winding road from the discovery to its role in human disease. NAR Mol Med. 2024; 1:ugae024. 10.1093/narmme/ugae024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Matos-Rodrigues  G, van Wietmarschen  N, Wu  W  et al.  S1-END-seq reveals DNA secondary structures in human cells. Mol Cell. 2022; 82:3538–52. 10.1016/j.molcel.2022.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Khristich  AN, Mirkin  SM  On the wrong DNA track: molecular mechanisms of repeat-mediated genome instability. J Biol Chem. 2020; 295:4134–70. 10.1074/jbc.REV119.007678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lyamichev  VI, Mirkin  SM, Kumarev  VP  et al.  Energetics of the B–H transition in supercoiled DNA carrying d(CT)x•d(AG)x and d(C)n•d(G)n inserts. Nucleic Acids Res. 1989; 17:9417–23. 10.1093/nar/17.22.9417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Mirkin  SM, Frank-Kamenetskii  MD  H-DNA and related structures. Annu Rev Biophys Biomol Struct. 1994; 23:541–76. 10.1146/annurev.bb.23.060194.002545. [DOI] [PubMed] [Google Scholar]
  • 18. Depienne  C, Mandel  JL  30 years of repeat expansion disorders: what have we learned and what are the remaining challenges?. Am J Hum Genet. 2021; 108:764–85. 10.1016/j.ajhg.2021.03.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Burssed  B, Zamariolli  M, Bellucco  FT  et al.  Mechanisms of structural chromosomal rearrangement formation. Mol Cytogenet. 2022; 15:23. 10.1186/s13039-022-00600-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Lee  JA, Carvalho  CM, Lupski  JR  A DNA replication mechanism for generating nonrecurrent rearrangements associated with genomic disorders. Cell. 2007; 131:1235–47. 10.1016/j.cell.2007.11.037. [DOI] [PubMed] [Google Scholar]
  • 21. Schimmel  J, Muñoz-Subirana  N, Kool  H  et al.  Small tandem DNA duplications result from CST-guided Pol α-primase action at DNA break termini. Nat Commun. 2021; 12:4843. 10.1038/s41467-021-25154-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Al-Zain  AM, Nester  MR, Ahmed  I  et al.  Double-strand breaks induce inverted duplication chromosome rearrangements by a DNA polymerase δ-dependent mechanism. Nat Commun. 2023; 14:7020. 10.1038/s41467-023-42640-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Cox  R, Mirkin  SM  Characteristic enrichment of DNA repeats in different genomes. Proc Natl Acad Sci USA. 1997; 94:5237–42. 10.1073/pnas.94.10.5237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Schroth  GP, Ho  PS  Occurrence of potential cruciform and H-DNA forming sequences in genomic DNA. Nucleic Acids Res. 1995; 23:1977–83. 10.1093/nar/23.11.1977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. McGinty  RJ, Balick  DJ, Mirkin  S  et al.  Inherent instability of simple DNA repeats shapes an evolutionarily stable distribution of repeat lengths. bioRxiv10 January 2025, preprint: not peer reviewed 10.1101/2025.01.09.631797. [DOI]
  • 26. McGinty  RJ, Sunyaev  SR  Revisiting mutagenesis at non-B DNA motifs in the human genome. Nat Struct Mol Biol. 2023; 30:417–24. 10.1038/s41594-023-00936-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Marsico  G, Chambers  VS, Sahakyan  AB  et al.  Whole genome experimental maps of DNA G-quadruplexes in multiple species. Nucleic Acids Res. 2019; 47:3862–74. 10.1093/nar/gkz179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Chantzi  N, Chan  CSY, Patsakis  M  et al.  Ribosomal DNA arrays are the most H-DNA rich element in the human genome. NAR Genom Bioinform. 2025; 7:lqaf012. 10.1093/nargab/lqaf012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Tippana  R, Xiao  W, Myong  S  G-quadruplex conformation and dynamics are determined by loop length and sequence. Nucleic Acids Res. 2014; 42:8106–14. 10.1093/nar/gku464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Štefan  U, Brázda  V, Plavec  J  et al.  The influence of G-tract and loop length on the topological variability of putative five and six G-quartet DNA structures in the human genome. Int J Biol Macromol. 2024; 280:136008. 10.1016/j.ijbiomac.2024.136008. [DOI] [PubMed] [Google Scholar]
  • 31. Chen  ZX, Oliver  B, Zhang  YE  et al.  Expressed structurally stable inverted duplicates in mammalian genomes as functional noncoding elements. Genome Biol Evol. 2017; 9:981–92. 10.1093/gbe/evx054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Wells  RD, Parniewski  P, Pluciennik  A  et al.  Small slipped register genetic instabilities in Escherichia coli in triplet repeat sequences associated with hereditary neurological diseases. J Biol Chem. 1998; 273:19532–41. 10.1074/jbc.273.31.19532. [DOI] [PubMed] [Google Scholar]
  • 33. Heidenfelder  BL, Makhov  AM, Topal  MD  Hairpin formation in Friedreich’s ataxia triplet repeat expansion. J Biol Chem. 2003; 278:2425–31. 10.1074/jbc.M210643200. [DOI] [PubMed] [Google Scholar]
  • 34. Ruggiero  BL, Topal  MD  Triplet repeat expansion generated by DNA slippage is suppressed by human flap endonuclease 1. J Biol Chem. 2004; 279:23088–97. 10.1074/jbc.M313170200. [DOI] [PubMed] [Google Scholar]
  • 35. Krasilnikova  MM, Mirkin  SM  Replication stalling at Friedreich’s ataxia (GAA)n repeats in vivo. Mol Cell Biol. 2004; 24:2286–95. 10.1128/MCB.24.6.2286-2295.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Chandok  GS, Patel  MP, Mirkin  SM  et al.  Effects of Friedreich’s ataxia GAA repeats on DNA replication in mammalian cells. Nucleic Acids Res. 2012; 40:3964–74. 10.1093/nar/gks021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Follonier  C, Oehler  J, Herrador  R  et al.  Friedreich’s ataxia-associated GAA repeats induce replication-fork reversal and unusual molecular junctions. Nat Struct Mol Biol. 2013; 20:486–94. 10.1038/nsmb.2520. [DOI] [PubMed] [Google Scholar]
  • 38. Gerhardt  J, Bhalla  AD, Butler  JS  et al.  Stalled DNA replication forks at the endogenous GAA repeats drive repeat expansion in Friedreich’s ataxia cells. Cell Rep. 2016; 16:1218–27. 10.1016/j.celrep.2016.06.075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Rastokina  A, Cebrián  J, Mozafari  N  et al.  Large-scale expansions of Friedreich’s ataxia GAA•TTC repeats in an experimental human system: role of DNA replication and prevention by LNA–DNA oligonucleotides and PNA oligomers. Nucleic Acids Res. 2023; 51:8532–49. 10.1093/nar/gkad441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Tsutakawa  SE, Thompson  MJ, Arvai  AS  et al.  Phosphate steering by Flap Endonuclease 1 promotes 5′-flap specificity and incision to prevent genome instability. Nat Commun. 2017; 8:15855. 10.1038/ncomms15855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Napierala  M, Dere  R, Vetcher  A  et al.  Structure-dependent recombination hot spot activity of GAA•TTC sequences from intron 1 of the Friedreich’s ataxia gene. J Biol Chem. 2004; 279:6444–54. 10.1074/jbc.M309596200. [DOI] [PubMed] [Google Scholar]
  • 42. Tang  W, Dominska  M, Greenwell  PW  et al.  Friedreich’s ataxia (GAA)n•(TTC)n repeats strongly stimulate mitotic crossovers in Saccharomyces cerevisae. PLoS Genet. 2011; 7:e1001270. 10.1371/journal.pgen.1001270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kim  HM, Narayanan  V, Mieczkowski  PA  et al.  Chromosome fragility at GAA tracts in yeast depends on repeat orientation and requires mismatch repair. EMBO J. 2008; 27:2896–906. 10.1038/emboj.2008.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Ezzatizadeh  V, Pinto  RM, Sandi  C  et al.  The mismatch repair system protects against intergenerational GAA repeat instability in a Friedreich ataxia mouse model. Neurobiol Dis. 2012; 46:165–71. 10.1016/j.nbd.2012.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Halabi  A, Ditch  S, Wang  J  et al.  DNA mismatch repair complex MutSβ promotes GAA·TTC repeat expansion in human cells. J Biol Chem. 2012; 287:29958–67. 10.1074/jbc.M112.356758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Du  J, Campau  E, Soragni  E  et al.  Role of mismatch repair enzymes in GAA·TTC triplet-repeat expansion in Friedreich ataxia induced pluripotent stem cells. J Biol Chem. 2012; 287:29861–72. 10.1074/jbc.M112.391961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Bourn  RL, De  Biase I, Pinto  RM  et al.  Pms2 suppresses large expansions of the (GAA·TTC)n sequence in neuronal tissues. PLoS One. 2012; 7:e47085. 10.1371/journal.pone.0047085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Ezzatizadeh  V, Sandi  C, Sandi  M  et al.  MutLα heterodimers modify the molecular phenotype of Friedreich ataxia. PLoS One. 2014; 9:e100523. 10.1371/journal.pone.0100523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Li  L, Scott  WS, Khristich  AN  et al.  Recurrent DNA nicks drive massive expansions of (GAA)n repeats. Proc Natl Acad Sci USA. 2024; 121:e2413298121. 10.1073/pnas.2413298121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Masnovo  C, Paleiov  Z, Dovrat  D  et al.  Stabilization of expandable DNA repeats by the replication factor Mcm10 promotes cell viability. Nat Commun. 2024; 15:10532. 10.1038/s41467-024-54977-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Bragg  DC, Mangkalaphiban  K, Vaine  CA  et al.  Disease onset in X-linked dystonia-parkinsonism correlates with expansion of a hexameric repeat within an SVA retrotransposon in TAF1. Proc Natl Acad Sci USA. 2017; 114:E11020–8. 10.1073/pnas.1712526114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Westenberger  A, Reyes  CJ, Saranza  G  et al.  A hexanucleotide repeat modifies expressivity of X-linked dystonia parkinsonism. Ann Neurol. 2019; 85:812–22. 10.1002/ana.25488. [DOI] [PubMed] [Google Scholar]
  • 53. Cortese  A, Simone  R, Sullivan  R  et al.  Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019; 51:649–58. 10.1038/s41588-019-0372-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Rafehi  H, Szmulewicz  DJ, Bennett  MF  et al.  Bioinformatics-based identification of expanded repeats: a non-reference intronic pentamer expansion in RFC1 causes CANVAS. Am J Hum Genet. 2019; 105:151–65. 10.1016/j.ajhg.2019.05.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Rafehi  H, Read  J, Szmulewicz  DJ  et al.  An intronic GAA repeat expansion in FGF14 causes the autosomal-dominant adult-onset ataxia SCA50/ATX-FGF14. Am J Hum Genet. 2023; 110:105–19. 10.1016/j.ajhg.2022.11.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Pellerin  D, Danzi  MC, Wilke  C  et al.  Deep intronic FGF14 GAA repeat expansion in late-onset cerebellar ataxia. N Engl J Med. 2023; 388:128–41. 10.1056/NEJMoa2207406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Bacolla  A, Tainer  JA, Vasquez  KM  et al.  Translocation and deletion breakpoints in cancer genomes are associated with potential non-B DNA-forming sequences. Nucleic Acids Res. 2016; 44:5673–88. 10.1093/nar/gkw261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Erwin  GS, Gürsoy  G, Al-Abri  R  et al.  Recurrent repeat expansions in human cancer genomes. Nature. 2023; 613:96–102. 10.1038/s41586-022-05515-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Handsaker  RE, Kashin  S, Reed  NM  et al.  Long somatic DNA-repeat expansion drives neurodegeneration in Huntington’s disease. Cell. 2025; 188:623–39.e19. 10.1016/j.cell.2024.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaf619_Supplemental_File

Data Availability Statement

The datasets analyzed during the current study were described previously [26]; databases of repeats in T2T-CHM13 are available at DOI: 10.6084/m9.figshare.28551065. The T2T-CHM13 and GRCh38 genomes are freely available from the NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/). The code to perform the analysis in the current study is available at DOI: 10.6084/m9.figshare.28551065.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES