Summary
X-linked dystonia parkinsonism (XDP) is a progressive adult-onset neurogenerative disorder caused by the insertion of a SINE-VNTR-Alu (SVA) retrotransposon in TAF1. One element of the SVA is a tandem polymorphic CCCTCT repeat tract whose length inversely correlates with the age of disease onset. Previous observations that the repeat exhibits length-dependent somatic expansion and that XDP onset is modified by variation in DNA repair gene MSH3 indicated that somatic repeat expansion is an important disease driver. Here, we sought to uncover genetic modifiers of CCCTCT instability in XDP individuals and to provide a mechanistic link between somatic instability and disease. We determined quantitative metrics of both repeat expansion and repeat contraction in blood. Using genetic association analyses of exome sequencing data and directed sequencing of a variant MSH3 repeat, we found that MSH3 modifies repeat expansion and contraction in blood as well as age at onset. MSH3 alleles associated with earlier disease onset were associated with more expansion and less contraction. Conversely, alleles associated with later disease onset were associated with less expansion and more contraction. Notably, MSH3 repeat alleles were also similarly associated with expansion and contraction in brain tissues. Our findings provide key evidence that the role of MSH3 in CCCTCT repeat dynamics underlies its impact on clinical disease and indicate that therapeutic strategies to lower or inhibit MSH3 are predicted to both slow CCCTCT expansion and promote CCCTCT contraction, impacting the disease course prior to clinical onset.
Keywords: XDP, MSH3, somatic instability, repeat, expansion, contraction, genetic modifiers, retrotransposon, TAF1
Graphical abstract

MSH3 modifies both age of onset and the somatic instability of the X-linked dystonia parkinsonism hexanucleotide repeat. This highlights a critical role of somatic instability in driving disease, highlights MSH3 as a human-validated therapeutic target, and underscores somatic instability as a common mechanism among repeat expansion disorders.
Introduction
X-linked dystonia-parkinsonism (XDP) (DYT3 [MIM: 314250]) is a fatal neurodegenerative disease endemic to the island of Panay, Philippines.1,2 Clinically, XDP is most commonly described as a focal dystonia that presents in mid-adulthood and spreads to multiple body regions. The dystonia is accompanied by, or replaced with, parkinsonism ∼10–15 years after initial onset.3 Studies of postmortem XDP brains have revealed changes in the neostriatum that include the depletion of striatal medium spiny neurons (MSNs).4,5,6 MRI studies have shown changes in the cortex, substantia nigra, and cerebellum.4,7,8,9 Despite active research into XDP, there are no treatments to slow or prevent the disease.
The genetic mutation underlying XDP is the insertion of a short interspersed nuclear element (SINE)-variable number of tandem repeats (VNTR)-Alu (SVA) retrotransposon in the TATA-box binding protein associated factor 1 (TAF1) (MIM: 313650) gene.10,11 The SVA insertion is associated with transcriptional dysregulation and altered splicing of TAF1,10,12,13,14,15,16,17 although the causative roles of the various altered TAF1 RNA species remain unclear. The SVA contains a polymorphic hexameric CCCTCT repeat, varying in length from 33 to 56 repeats,11,18,19 placing XDP in a rapidly growing class of disorders associated with expanded microsatellite repeats, such as Huntington disease (HD [MIM: 143100]).20 The length of the CCCTCT repeat is inversely correlated with XDP age at onset (AAO),11,18,19 as observed in other repeat expansion diseases, indicating a critical role of CCCTCT length in driving XDP pathogenesis. Also in common with other repeat diseases, the CCCTCT repeat is unstable in the germline and somatic tissues.11,18,19 We recently demonstrated that somatic instability of the CCCTCT repeat is expansion biased, tissue specific, and repeat length dependent, suggesting that the repeat-length-dependent property driving XDP onset is its propensity to expand in somatic cells.19
Importantly, some individuals present AAO either earlier or later than predicted by their inherited CCCTCT length, suggesting that other genetic modifiers may also play a role in determining the rate of XDP onset. Indeed, a recent genome-wide association study (GWAS) showed that variants near or within MutS homolog 3 (MSH3 [MIM: 600887]) and PMS1 homolog 2 (PMS2 [MIM: 600259]) modify XDP AAO,21 together accounting for ∼25% of AAO residual variance (AAO adjusted for CCCTCT repeat length). MSH3 and PMS2 encode proteins in the mismatch repair (MMR) pathway and were previously uncovered in a GWAS, among other DNA repair genes, as modifiers of the age of HD motor onset and other clinical phenotypes.22,23,24 Significantly, these genes modify somatic expansion of the HD CAG repeat,23,25,26,27 indicating that they modify HD by altering the rate of somatic repeat expansion in the brain. It seems likely, therefore, that MSH3 and PMS2 modify the somatic instability of the XDP CCCTCT repeat. However, genetic modifiers of somatic instability in XDP remain to be identified. Here, we aim to uncover genetic modifiers of CCCTCT instability in XDP individuals and to provide a mechanistic link between somatic instability and disease. Combining whole-exome sequencing, GWAS data, and targeted genotyping, we provide direct evidence that MSH3 modifies somatic CCCTCT expansion and contraction in XDP individuals and show that the same MSH3 variants alter expansion, contraction. and AAO, directly supporting somatic repeat dynamics as being a major driver of XDP clinical onset.
Subjects, material, and methods
Sample collection
Individuals recruited for this study included individuals with XDP evaluated at Massachusetts General Hospital (MGH) (Boston, MA, USA), Jose R. Reyes Memorial Medical Center (JRRMMC) (Manila, Philippines), and the Sunshine Care Foundation Clinic in Roxas City (Panay, Philippines). All participants provided written informed consent, and the study was approved by local Institutional Review Boards at both MGH and JRRMMC.
Blood samples were collected from male individuals (n = 382, of whom 360 were affected and 22 were non-manifesting carriers—the latter used for modeling instability only). Genomic DNA (gDNA) was extracted from blood using the Flexigene reagent kit (Qiagen) according to the manufacturer’s instructions, and XDP genetic status was confirmed by PCR amplification of a 48 bp-deletion haplotype marker as previously described.11,19 This study included samples used in our previous study.19 XDP clinical diagnosis was determined by trained clinicians, as described in Acuna et al.3
Postmortem brain tissues from male individuals with XDP (n = 60) were obtained from the Collaborative Center for XDP (CCXDP) Brain Bank at MGH (Boston, MA, USA). gDNA was extracted from postmortem brain tissues using the DNeasy Blood & Tissue Kit (Qiagen), according to the manufacturer’s instructions and including the addition of 3 μL of RNase. The postmortem brain tissues used in this study comprised up to 17 regions from up to 60 males: frontal cortex Brodmann area 9 (BA9, n = 58), caudate (n = 33), cerebellum (n = 60), cingulate gyrus (n = 20), deep cerebellar nuclei (n = 21), hippocampus (n = 19), insular cortex (n = 19), inferior olivary nucleus (n = 9), lateral thalamus (n = 20), medial thalamus (n = 34), occipital cortex (n = 60), parietal cortex (n = 20), putamen (n = 19), red nucleus (n = 11), substantia nigra (n = 19), subthalamic nucleus (n = 7) and temporal pole (n = 57). A subset of these brain tissues was used for analyses of somatic expansion in our previous study.19
Determination of XDP repeat length
The length of the XDP CCCTCT repeat tract in blood and brain samples was determined using a fluorescent PCR-based assay.11,19 PCR reactions consisted of 125 ng of gDNA per reaction added in a 50 μL reaction volume with 14 μL of buffer, 2 μL of 2.5 mM dNTPs, 0.5 U of polymerase provided with the PrimeSTAR GLX polymerase kit (TaKaRa), and 10 mM of each primer. The PCR conditions were 94°C × 2 min followed by 30 × (98°C × 10 s, 64°C × 35 s). The primers used were
forward, 5′-[6FAM]-AGCAGTACAGTCCAGCTTTGGC-3′ and
reverse, 5′-CTCAAGCCTTATTACAATGCCAGT-3′.
PCR products were run on the Applied Biosystems 3730xl DNA Analyzer using GeneScan 500 LIZ as an internal standard and output data analyzed using GeneMapper V5 (Applied Biosystems).11,19 The PCR products comprised a series of peaks separated by 6 bp (one CCCTCT repeat). The repeat length of the modal peak, assumed to be the inherited repeat length, was assigned based on DNA standards of known repeat lengths (37, 44, and 50).
Defining expansion and contraction phenotypes in blood and relationships with AAO
Expansion and contraction indices were obtained from fragment sizing analyses, as described previously, using a 5% relative peak height threshold cutoff.28 We first developed regression models to explain somatic expansion or contraction indices as a function of inherited repeat length and age at sample collection and subsequently derived residual expansion and contraction index values from these models to explain the contribution of somatic expansion or contraction to AAO. We found that logarithmic transformation of the somatic expansion or contraction indices improved the model fitting. In both the expansion and contraction modeling, we initially tested the interaction term between repeat length and age at sample collection (see Figure S1 for plots that indicate a weak interaction). However, the coefficients obtained in models with the interaction term did not contribute substantially to the model and did not alter the outcome of downstream exome-wide association study (ExWAS) analyses. We therefore did not include an interaction term in order to simplify the models and avoid centering the data, allowing for an easier interpretation of the residual values. For contractions, we found that inclusion of only the first contraction peak relative to the modal allele (N − 1 peak) provided a better model fit (R2 = 0.4101) than models based on inclusion of either all the contraction peaks (R2 = 0.3861) or the first plus second contraction peaks (R2 = 0.3934). Thus, we used the N − 1 peak inclusion model to minimize outlier effects in the downstream genetic analyses but note that all three contraction models resulted in the identification of a signal at the MSH3 locus in the exome association analysis.
The final models used were as follows.
Model 1: ln[Expansion index] ∼ Repeat length + Age at collection (n = 304 blood samples that included age at collection).
Model 2: ln[Contraction index (N − 1)] ∼ Repeat length + Age at collection (n = 303 blood samples that included age at collection).
Expansion and contraction residuals from these models were used as covariates in the following regression models of AAO.
Model 3: AAO ∼ Repeat length + Expansion residual (n = 283 blood samples with expansion residual and AAO).
Model 4: AAO ∼ Repeat length + Contraction residual (n = 282 blood samples with expansion residual and AAO).
Expansion and contraction residuals were also used as phenotypes in subsequent genetic modifier analyses.
For genetic analyses of AAO, we derived a residual AAO accounting only for inherited repeat length.
Model 5: AAO ∼ Repeat length (n = 377; these included 360 blood samples with AAO and 17 brain samples with AAO).
Note that the modal repeat length in brain does not deviate from that in blood.19
Measures of somatic instability in brain tissues
Expansion and contraction indices were obtained from fragment sizing analyses, as described previously, using a 5% relative peak height threshold cutoff.28 For genetic modifier analyses, residual expansion and contraction indices were derived in each tissue accounting for repeat length (modal allele) and age at collection, i.e., age at death (Expansion or Contraction index ∼ Repeat length + Age at death). Here, the full range of contraction peaks was used.
Whole-exome sequencing
DNA from 256 blood samples from XDP males was submitted for whole-exome sequencing using the Agilent SureSelect Human All Exon V.6 Kit (S07604514, Agilent, USA). Exome data underwent assessment of target exome coverage using Picard’s CollectHsMetrics function (https://broadinstitute.github.io/picard/). Only exomes with ≥70% of the exome covered at 10× or greater were included for downstream analyses. Whole-exome data were analyzed following a multi-step quality control (QC) pipeline (Figure S2), with the standard Genome Analysis Toolkit (GATK) best practices recommendations.29,30 After variant discovery with the GATK, variant annotation was performed with Variant Effect Predictor (VEP, v.110) using the GRCh37 human genome reference. We subsetted the single-nucleotide variants (SNVs) based on minor allele frequency (MAF) for single-variant and gene-based analyses using Plink v.2.0.0 (Figure S2). Only individuals clustering in the East Asian ancestry were considered. 250 samples remained for downstream analyses following the QC steps.
Single-variant analysis
Plink v.2.0.0 was used for the standard QC of SNVs (Figure S2): per-sample missing called rate (--mind) ≤ 0.1, genotype missing call rate (--geno) ≤ 0.05, Hardy-Weinberg equilibrium (HWE) = 1e−06, and MAF ≥ 0.01. We removed individuals that deviated ±3 standard deviations (SDs) from the sample heterozygosity rate mean. 248,146 SNVs passed QC. We used carefully curated phenotypes (n = 245 individuals for repeat expansion, n = 242 individuals for repeat contraction, and n = 235 individuals for AAO) for exome-wide association analyses. We estimated the association of SNVs with an XDP phenotype by fitting an exome-wide efficient mixed model using GEMMA (v.0.983).31 GEMMA accounts for relatedness by generating a relatedness matrix, and we used this and population stratification as covariates. Table S1 shows the genetic relationships in our XDP cohort. Although we only analyzed exonic variants, we applied a conservative genome-wide significance p value of <5.0e−08 for these analyses.
Gene-based association analysis
Plink v.2.0.0 was used for standard QC of rare SNVs (Figure S2): per-sample missing called rate (--mind) ≤ 0.1, genotype missing called (--geno) ≤ 0.05, and MAF < 0.01. 208,707 SNVs passed QC. Non-synonymous variants (missense, start loss, stop gain, and stop loss, n = 47,089) and a subgroup of the most deleterious SNVs (combined annotation-dependent depletion [CADD20] score ≥20, indicating the top 1% of predicted damaging variants in the genome, n = 11,130) underwent whole-exome gene-based and candidate gene-based association analyses using burden (CMC-Wald) and SKAT-O tests, using relatedness matrix and population stratification as covariates. p < 3.2e−07 was considered exome-wide significant, based on the number of genes captured (n = 18,128) and average number of variants in each gene (n = 8.43). In our initial test of 15 candidate genes, we incorporated a Bonferroni correction of the nominal p value obtained, adjusting for 15 genes.
Genome-wide association analysis
A GWAS of blood expansion was performed in 163 samples for which genome-wide SNV genotyping data had been previously generated.21 SNV QC was performed using Plink v.1.9.0: genotype missing call (--geno) ≤ 0.02, per-sample missing called rate (--mind) ≤ 0.02, deviation from mean heterozygosity ±3 SDs, and MAF ≥ 0.01. 443,059 SNVs passed QC and underwent GWAS analysis using GEMMA v.0.983,31 where population stratification and a relatedness matrix were included in the linear mixed model. SNVs with a p value < 5.0e−08 were considered genome-wide significant.
Genotyping MSH3 tandem repeats
DNA sequencing of MSH3 exon 1 polymorphic variant coding repeat region in gDNA samples from blood (n = 302) or brain (n = 60) was accomplished using the Illumina MiSeq Adapter Metagenomics 16S Targeted Protocol (Illumina). The MiSeq assay consists of two PCRs. PCR 1 uses forward and reverse MSH3 exon 1 target sequences and the Illumina adaptor sequences (in bold text):
forward, TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGTGAGCCGATTCTTCCAGTC and
reverse, GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGCCCAGTCCCAGACAGAACCT.
PCR 1 reaction conditions, in a total reaction of 50 μL, contained 40 ng of DNA, 1× Qiagen buffer, 1× Q-Solution, 0.2 mM dNTPs, 0.5 U of Qiagen Taq, and 0.25 μM each forward and reverse primer. PCR cycling conditions were as follows: initial denaturation at 93°C for 3 min; 33 cycles of 93°C for 30 s, 60°C for 30 s, and 72°C for 90 s; and a final extension at 72°C for 10 min. PCR 1 products were cleaned up with Agentcourt AMPure XP beads with a 1× PCR product-to-beads ratio. PCR 2 was performed to attach the second part of the adapter and the dual barcodes using the Nextera XT Index Kit v.2 Set A and D. PCR 2 reaction contained 5 μL of the cleaned-up PCR 1 product, 1× Qiagen buffer, 1× Q-Solution, 0.2 mM dNTPs, 1.25 U of Qiagen Taq, and 0.25 μM each forward and reverse primer (barcodes). PCR cycling conditions were as follows: initial denaturation at 93°C for 3 min; and eight cycles of 93°C for 30 s, 60°C for 30 s, and 68°C for 90 s. PCR 2 products were cleaned up with Agentcourt AMPure XP beads with a 1× PCR product-to-beads ratio. The final libraries were run on an Agilent Tape Station to check for concentration. Pools were made by combining 4 μL of each set of the 96 libraries and run on the Illumina MiSeq using the MiSeq Reagent Kit v.3 (600 cycles) to produce 2 × 300-bp reads. Libraries were demultiplexed by Illumina MiSeq Control software.
The repeat structure was identified by using the 15 bp flanking sequence immediately before and after the repeat sequence. If one or both flanking sequences were not found, the read was not analyzed. When both flanking sequences were found, the sequences of Read1 and the reverse complement of Read2 were read in 9 bp segments, starting at the 5′ end of the left flank, and each repeat was compared to the previous repeat segment to determine the number of that specific repeat in succession. When the right flank was reached, the repeat structure analysis stopped and the repeat structure was defined for Read1 and Read2 separately. Two SNVs—rs141080879 (GenBank: NM_002439.4) (c.129A>G [p.Ala43=]); (g.79950675A>G [GRCh37]) and rs1650697 (GenBank: NM_002439.4) (c.235A>G [p.Ile79Val]) (g.79950781A>G [GRCh37])—captured in the sequencing were recorded in phase with each repeat allele.
To confirm MSH3 repeat alleles, we also performed fragment sizing analysis. PCR was performed using the Qiagen Taq polymerase kit. The PCR reaction contained 40 ng of DNA, 1× Qiagen buffer, 1× Q-Solution, 0.2 mM dNTPs, 0.5 U of Qiagen Taq, and 0.25 μM each forward and reverse primer. PCR cycling conditions were as follows: initial denaturation at 93°C for 3 min; 33 cycles of 93°C for 30 s, 60°C for 30 s, and 72°C for 90 s; and a final extension at 72°C for 10 min. Primers were as follows:
forward, 5′-[6FAM]-TGAGCCGATTCTTCCAGTC-3′ and
reverse, 3′-CCCAGTCCCAGACAGAACCT-3′.
The PCR products were run on the Applied Biosystems 3730xl DNA Analyzer with Genescan 500 LIZ as the internal size standard. The results were analyzed with GeneMapper v5 (Applied Biosystems).
Haplotype analysis
We used MSH3 repeat genotypes identified using MiSeq as above, and SNV genotypes from the exome sequencing to construct MSH3 haplotypes. We used individuals homozygous for the three most common repeat alleles (3a/3a, 6a/6a, and 7a/7a) to identify seven repeat + SNV haplotypes (3a, 6a-1, 6a-2, 6a-3, 6a-4, 7a-1, and 7a-2). Note that the rs1650697 variant, captured in phase with MSH3 repeats, confirmed the genotype obtained in the exome sequencing and was the major variant that distinguished haplotypes. rs141080879, also captured in phase with MSH3 repeat, was not informative in distinguishing haplotypes (all individuals had A allele). We included in our analyses these homozygous individuals as well as heterozygous individuals whose genotypes could be resolved as pairwise combinations of these seven identified haplotypes.
PacBio PCR amplicon sequencing
The CCCTCT repeat was PCR amplified using reaction conditions described above (“determination of XDP repeat length”) from blood DNAs of 37 XDP males and an XDP individual repeat-containing bacterial artificial chromosome (BAC).10 PCR products from at least five independent PCRs were pooled and purified using the MiniElute PCR Purification Kit (Qiagen) to obtain at least 100 ng of DNA in the band containing the repeat tract (∼350–550 bp), as estimated using the Agilent DNA Bioanalyzer 2100. Samples were submitted to the University of Maryland Institute for Genome Science for DNA size selection using Bluepippin, followed by Single Molecule, Real-Time (SMRT) Pacific Bioscience (PacBio) sequencing. The number of highly accurate long (HIFI) circular consensus sequence (CCS) reads obtained ranged from ∼42,000 to 93,270. Reads from individuals were aligned to a custom BAC reference and displayed as waterfall plots (Figure S3). We developed two methods to count the number of CCCTCT (or reverse complement AGAGGG) repeats and to accurately identify the structure of the repeat tract in each HIFI CCS read. These two analyses methods are summarized in Figure S4. Method 1, based on a pipeline similar to that used for analysis of HTT CAG MiSeq sequence,24 is unbiased to the underlying sequence and identifies all reads that begin with at least ten CCCTCT repeats and ends with a 3′ flanking “stop” sequence. Method 2 identifies 5′ and 3′ flanking sequences and three specific repeat structures—(CCCTCT)n, (CCCTCT)nCCT(CCCTCT)2, or (CCCTCT)nCCCT(CCCTCT)2—within the flanks (Figure S4). The scripts for methods 1 and 2 can be found at https://github.com/alanmejiamaza/Counting-Repeats. Both methods yielded similar proportions of these three sequence structures (Table S2).
PacBio whole-genome sequencing
30 μg of blood DNA from two male XDP individuals were submitted for whole-genome PacBio sequencing at the University of Maryland Institute for Genome Science. Sequencing was carried out in a 30-h movie time in eight SMRT cells (four cells each). Overall coverage in these two XDP samples was ∼30× (individual 1) and ∼34× (individual 2). CCS reads were assembled and mapped to the GRCh37 human genome reference. As SVAs are frequently observed in the human genome, only reads spanning intron 32 of TAF1 were selected for further analysis. We counted five (individual 1) and nine (individual 33) CCS reads spanning TAF1 intron 32. These CCS reads were manually evaluated for the number of CCCTCT repeats and sequence structure (Table S3).
RNA-sequencing analysis
We determined the effects of MSH3 repeat allele on MSH3 expression levels using postmortem brain samples from 45 XDP individuals. We profiled four brain regions using RNA sequencing (RNA-seq) and performed targeted analysis of MSH3 gene expression (caudate tissue, cerebellar cortex, BA9 frontal cortex, and occipital cortex). RNA was extracted from brains after bead-based homogenization and TRIzol extraction (Invitrogen). Paired-end 150 bp poly(A)-enriched RNA-seq libraries were generated (Illumina TruSeq) and sequenced on an Illumina platform (up to 50 million reads per sample). After performing QC, reads were aligned to the human genome reference (GRCh38) using STAR v.2.7.10. Counts were normalized using DESeq2’s median-of-ratios method, then technical covariates were regressed using the Surrogate Variable Analysis package prior to generating expression values. Scripts used for analyses can be found at https://github.com/alanmejiamaza/RNA-seq-analyses. The raw sequencing data have been deposited at the Database of Genotypes and Phenotypes (dbGaP, accession number phs001525).
Results
Expansion, but not contraction, of the CCCTCT repeat in blood contributes to XDP onset
We previously demonstrated repeat length-dependent and tissue-specific expansion of the XDP CCCTCT repeat that was greater in brain tissues than in blood.19 Despite the relatively modest instability in blood DNA, it can be readily measured from fragment sizing data obtained from bulk PCR-based analyses.19 As the number of postmortem XDP brains is limited, here we performed our genetic discovery analyses using blood DNA from affected males to maximize sample size. Using small pool-PCR (SP-PCR) in brain tissues, we also showed previously the occurrence of repeat contractions as well as repeat expansions.19 Notably, SP-PCR sizing of single molecules revealed a single peak with marginal or absent PCR slippage products (Figure S5), confirming that the distribution of both contraction and expansion peaks observed in bulk PCR represents somatic contraction and expansion events and are not artifacts of the PCR. We therefore used bulk PCR analyses of blood DNA to derive quantitative metrics of both expansion and contraction (see subjects, material, and methods).19,28
We first set out to test whether CCCTCT expansion and/or contraction, measured in blood DNA, contribute to XDP AAO in affected males. Expansion and contraction indices were dependent on inherited repeat length (assumed to be the modal repeat detected in blood) and age at collection (Figures 1A, 1B, and 1D, regression models 1 and 2; Figure S1). This larger dataset confirms our previous observation of repeat-length-dependent expansion in blood,19 extends this relationship to repeat contraction, and shows that repeat expansion and contraction accumulate with age in blood. Importantly, the age dependence of both expansions and contractions reinforces data from single-molecule SP-PCR (Figure S5) that both expansion and contraction peaks in PCR products from bulk DNA represent somatic mutation events and are not simply PCR artifacts. From regression models 1 and 2 (Figure 1D), we derived expansion and contraction residuals (subjects, material, and methods), representing scores of individual-specific measures of blood repeat expansion or repeat contraction independent of inherited repeat length and age at blood collection. We then tested the contribution of expansion residual or contraction residual to AAO in males from whom we had both blood instability and AAO data (Figure 1D, regression models 3 and 4). We found that both the inherited repeat length and expansion residual significantly contributed to an earlier AAO (Figures 1C and 1D, regression model 3), together explaining 46% of the variation of XDP onset. In contrast, contraction residual in blood did not contribute significantly to AAO (Figure 1D, regression model 4). Thus, these data support the contribution of repeat expansion, which can be captured in blood DNA, to XDP AAO, and suggest the existence of shared genetic modifiers of AAO and repeat expansion in XDP. The data do not rule out a contribution of repeat contraction to AAO but suggest that any potential contribution is less well captured by a readout in blood.
Figure 1.
Models of CCCTCT somatic instability and age at onset in XDP blood samples
(A–C) Correlations of somatic expansion (A), somatic contraction (B), and age at onset (AAO) (C) with inherited CCCTCT repeat length.
(D) Linear regression models of somatic instability (models 1 and 2) and AAO (models 3 and 4). CCCTCT instability and CCCTCT repeat length were measured by ABI fragment sizing. n, number of individuals. Each dot represents an XDP individual.
Analyses of sequence variation in the CCCTCT repeat
Non-canonical repeat-containing sequences such as those containing repeat interruptions have been shown to modify disease phenotypes, e.g., in HD and in myotonic dystrophy type I (DM1 [MIM: 160900]),23,26,32 and a recent study reported the presence of mosaic divergent repeat interruptions of the XDP repeat tract.33 We therefore first examined the presence of variant repeat structures in our XDP cohort by performing PacBio long-read sequencing of repeat-containing PCR amplicons from the blood of 37 males. These included 17 individuals and 14 individuals who fell into the 10th percentile and 90th percentile “extremes” of AAO and/or repeat expansion, respectively. We also performed PacBio amplicon sequencing of a previously sequenced XDP individual BAC,10 as well as PCR-free whole-genome PacBio sequencing of blood DNA from two XDP males. Waterfall plots showing the repeats and allele structure are shown in Figure S3. Using two independent alignment-free methods (subjects, material, and methods; Figure S4), we extracted sequences with either (CCCTCT)n, (CCCTCT)nCCT(CCCTCT)2, or (CCCTCT)nCCCT(CCCTCT)2 structures, previously described.33 We found that the vast majority (mean ∼98% by each method) of these reads from the amplicon sequencing of all 37 blood samples contained the sequence structure (CCCTCT)nCCT(CCCTCT)2 (Table S2). The same structure was found in ∼97% of the reads from the BAC (Table S2 and Figure S3), confirming the previously determined deep-sequencing assembly of BAC clones of the region spanning the SVA.10 The modal repeat length (“n” within the (CCCTCT)nCCT(CCCTCT)2 sequence) was highly correlated with that obtained by fragment sizing (Figure S6). We also detected reads at low frequencies, both in the individual samples and in the BAC, with either (CCCTCT)n or (CCCTCT)nCCCT(CCCTCT)2 structures (Table S2). PCR-free whole-genome sequencing, while of low depth, confirmed the predominance of (CCCTCT)nCCT(CCCTCT)2-containing reads and concordance of modal repeat length with that obtained via amplicon sequencing (Table S3). Therefore, in the samples examined thus far that included phenotypic extremes, we have not identified any variation between individuals in the repeat-containing structure, nor do we find evidence for within-individual variant repeat mosaicism at a level reported previously.33 The latter discrepancy may be explained by different sequencing technologies used (PacBio vs. nanopore) and/or the type of sequencing analyses performed including base-calling stringency and alignment stringency. For example, the methods we have used are alignment free, while Trinh et al.33 used alignment-based Noise Canceling Repeat Finder. It is also possible, although less likely, that our XDP cohort has different allele structures compared to those reported by Trinh et al.33 Further sequencing of XDP individuals using a variety of methods would be needed to assess the presence of rarer inter-individual variation in repeat structures and to further assess intra-individual variant repeat mosaicism.
Shared and differential impacts of MSH3 alleles on AAO and repeat instability in blood
Given the above observations suggesting that in our cohort, variation between individuals in the CCCTCT repeat structure may be relatively rare, we subsequently focused on the role of trans-acting modifiers. A previous GWAS of 353 XDP males showed significant associations of AAO with SNVs at or close to MSH3 and PMS2.21 Complementing this study, and with a goal of capturing additional genetic variation that might be present in the unique XDP population of Filipino origin, we performed whole-exome sequencing in a cohort of 256 males (Figure S2). In the 250 samples passing QC (Figure S2), we performed genetic association analyses of three phenotypes: (1) AAO residual, derived from regression model 5 (subjects, material, and methods), representing the AAO unaccounted for by inherited repeat length; (2) blood expansion residual, derived from regression model 1 (Figure 1D), representing the level of repeat expansion in blood unaccounted for by inherited repeat length and age at sampling; and (3) blood contraction residual, derived from regression model 2 (Figure 1D), representing the level of repeat contraction in blood unaccounted for by inherited repeat length and age at sampling. We first performed ExWASs with variants of ≥1% MAF (present in our dataset). In 235 individuals with exome and AAO data, we identified a signal (signal 1) tagged by SNV rs1650697 (GenBank: NM_002439.4) (c.235A>G [p.Ile79Val]) (g.79950781A>G [GRCh37]) (effect size β = −0.33 ± 0.54, p = 5.68e−09), with genome-wide significance, and which also included rs245016 (GenBank: NC_000005.9) (g.80021193C>T [GRCh37]) (Tables 1 and S4; Figures 2, S7, and S8). rs1650697 is located within exon 1 of MSH3 and the 5′ UTR of dihydrofolate reductase (DHFR [MIM: 126060]) gene, with the minor (A) allele associated with earlier onset (Table 1). After conditioning on rs1650697, no other variant at the MSH3/DHFR locus remained close to genome-wide significance (Figure S9 and Table S5). Thus, despite the relatively small cohort size, we were able to identify a modifier signal at MSH3/DHFR, recapitulating findings in the GWAS.21 We did not detect a significant signal in PMS2, likely because the intronic modifier SNV detected in the GWAS21 is not in linkage disequilibrium (LD) with any SNVs captured in the exome sequencing. Interestingly, rs8087221 (GenBank: NC_000018.9) (g.60252217T>C [GRCh37]) on chr18 emerged as genome-wide significant after the conditional analysis (Table S5). However, given the low minor allele frequency of this SNV (1.1%), this effect has a high likelihood of being spurious and would need further confirmation.
Table 1.
Top chr5 modifier signals in exome-wide association tests
| rs ID | chr:location (GRCh37) | chr5 signal | Effect allele | AF | Nominal test |
Conditional test |
||
|---|---|---|---|---|---|---|---|---|
| Beta ± SE | p value | Beta ± SE | p value | |||||
| AAO | ||||||||
| rs1650697 | 5:79950781 | 1 | A | 0.372 | −0.33 ± 0.54 | 5.68e−09 | N/A | N/A |
| Expansion | ||||||||
| rs1650697 | 5:79950781 | 1 | A | 0.359 | 0.22 ± 0.03 | 1.31e−12 | N/A | N/A |
| rs1643639 | 5:79952390 | 2 | C | 0.112 | −0.32 ± 0.04 | 3.27e−11 | −0.24 ± 0.04 | 4.05e−08 |
| Contraction | ||||||||
| rs1643639 | 5:79952390 | 2 | C | 0.114 | 1.06 ± 0.02 | 3.48e−07 | N/A | N/A |
Top SNVs in ExWAS for modifiers of AAO, expansion, and contraction (nominal test). In the conditional test, the association test was repeated, adjusting for the genotype of top SNV rs1650697 (AAO and expansion) or rs1643639 (contraction), and chr5 SNVs reaching genome-wide significance after conditional tests are shown. chr5 signals 1 and 2 are distinguished by conditional analyses (Figure S9). AF, effect allele frequency; beta, beta estimate (effect size); SE, standard error for beta estimate; N/A, not applicable.
Figure 2.
Exome-wide association analyses for age at onset and CCCTCT somatic instability in blood
ExWAS for residual AAO, residual somatic CCCTCT expansion, and residual somatic CCCTCT contraction in blood, performed on SNVs with minor allele frequency ≥0.01, showing the region of the MSH3/DHFR locus in which significant associations were detected. Genomic coordinates are based on GRCh37. Signal 1 and signal 2 SNVs are indicated in pink and blue, respectively, and are identified as separate signals based on conditional analyses (Figure S9; Tables S5, S7, and S10). The arrows indicate the rsIDs of the top signal 1 and signal 2 SNVs. Downward-pointing triangles show SNVs associated with a lower residual value (earlier AAO, less expansion, less contraction), and upward-pointing triangles show SNVs associated with a higher residual value (later AAO, more expansion, more contraction). The horizontal dotted lines indicate a genome-wide significance threshold of p = 5e−08.
The contribution of blood somatic expansion to AAO (Figure 1D, regression model 3) predicted that modifiers of AAO also modify expansion in blood. We therefore performed ExWAS using the expansion residual (derived from regression model 1, Figure 1D) as the phenotype in 245 individuals with exome and expansion data. We identified a genome-wide significant signal (signal 1) with rs1650697 (β = 0.22 ± 0.03, p = 1.31e−12) as the top SNV, (Figures 2, S7, and S8; Tables 1 and S6) and encompassing rs245016 as in the AAO ExWAS. The minor (A) allele of rs1650697 was associated with increased somatic expansion, mirroring the direction of effect on onset and consistent with higher repeat expansion rates hastening XDP onset. Conditioning on rs1650697 revealed a distinct second genome-wide significant signal (signal 2), with the top SNV, rs1643639 (GenBank: NC_000005.9) (g.79952390T>C [GRCh37]) (β = −0.32 ± 0.04, p = 3.27e−11), within intron 2 of MSH3 and intron 1 of DHFR (Tables 1 and S7; Figures S8 and S9). In contrast to signal 1, signal 2 minor alleles were associated with decreased expansion in blood (Figures 2, S8, and S9; Table 1). Consistent with two distinct modifier signals, top SNVs from signal 1 and signal 2 were not in LD in our XDP cohort (r2 = 0.07), nor in individuals with East Asian (EAS) background in the 1000 Genomes Project (1KG). Conditioning on both rs1650697 and rs1643639 showed no other genome-wide significant signal. We confirmed these two independent signals in a GWAS of residual expansion performed on a subset of 163 XDP males that overlapped with the previous GWAS21 and our exome dataset and for which genome-wide SNV genotyping data were available (Table S8).
To gain further insight into somatic instability in XDP blood, we performed an ExWAS on the repeat contraction residual phenotype (derived from regression model 2, Figure 1D) in 242 individuals with exome and contraction data. rs1643639 (top SNV for signal 2 in the repeat expansion ExWAS) gave the strongest association signal and was associated with increased contraction, although the p value of 3.48e−07 (β = 1.06 ± 0.02) did not reach genome-wide significance (Figures 2, S8, and S9; Tables 1 and S9). Conditioning on rs1643639 revealed no other genome-wide significant signal (Figure S9 and Table S10). Signal 1 did not emerge as a significant modifier of repeat contraction (p = 2.37e−05), consistent with this signal being the major modifier of AAO (Figure 2) and the lack of contribution of blood contraction to AAO (Figure 1D, regression model 4). Linear regression modeling showed that the lead SNVs from signals 1 and 2 accounted for 29% of the residual somatic expansion and 16% and 0.08% of the residual AAO and somatic contraction, respectively (Table S11). Note that although the MSH3 variants did not meet the genome-wide significance threshold in the contraction ExWAS, these regression analyses showed that rs1650697 significantly reduced contraction (β = −0.06, p = 0.016), while rs1643639 significantly increased contraction (β = 0.106, p = 0.00065) (Table S11). When contraction was included as a covariate in the repeat expansion model, rs1650697 and rs1643639 remained significantly associated with expansion (p = 3.62e−09 and p = 4.77e−08, respectively), although the p values were reduced by a factor of ∼10 (Table S11). When expansion was included as a covariate in the repeat contraction model, p values for rs1650697 and rs1643639 were also reduced, with rs1643639 still significantly associated with contraction (p = 0.0069), while the effect of rs1650697 did not meet statistical significance but still trended in the same direction (p = 0.089) (Table S11). These observations indicate that the effects of variants rs1650697 and rs1643639 on contraction are in part independent of their effects on expansion and vice versa. Consistent with the directions of effect of signal 1 and signal 2 variants on the XDP AAO and instability phenotypes, GTEx cis-expression quantitative trait loci (eQTL) data34 show that rs1650697 effect allele “A” is associated with higher expression of MSH3 in blood, cortex, and basal ganglia while, in contrast, rs1643639 effect allele “C” is associated with lower expression of MSH3 (Figure S10).
In summary, these data provide direct support for MSH3 modifying both repeat expansion and contraction and indicate that MSH3 modifier alleles have different relative impacts of effect on AAO and blood instability phenotypes.
Rare MSH3 coding variants are associated with less somatic expansion
We next assessed the association of rare protein-coding variants (MAF < 1% in East Asians and our XDP cohort) with the XDP phenotypes. As our sample size is small, we initially took a candidate gene approach, examining genes (MSH2 [MIM: 609309], MSH6 [MIM: 600678], MSH3 [MIM: 600887], MLH1 [MIM: 120436], PMS1 [MIM: 600258], PMS2 and MLH3 [MIM: 604395], FAN1 [MIM: 613534], LIG1 [MIM: 126391], POLD1 [MIM: 174761], ATAD5 [MIM: 609534], RRM2B [MIM: 604712], TCERG1 [MIM: 605409], MED15 [MIM: 607372], and CCDC82 [MIM: 619870]) associated with HD clinical phenotypes and/or somatic HTT (MIM: 613004) CAG expansion in blood or mouse models.23,25 We identified 49 non-synonymous (missense, start loss, stop gain, and stop loss)35,36 coding variants across these genes (Table S12) and performed gene-based tests of association (sequence kernel association test [SKAT] optimized [SKAT-O] and the CMC-Wald burden test) with the AAO, expansion, and contraction phenotypes. Of the 15 genes tested, only MSH3 showed a significant association in both tests between non-synonymous variants and repeat expansion after multiple test correction (Bonferroni correction for 15 genes) (SKAT-O adjusted p = 2.3e−02, CMC-Wald adjusted p = 8.0e−03, β = −0.517; Table S13). Although there were some nominally significant associations, none of these genes showed significant associations after Bonferroni correction with contraction or AAO (Tables S14 and S15). The same gene-based tests using a subset of the most deleterious SNVs (CADD score ≥20) did not increase the significance of these associations. All individuals (n = 5), unrelated, carrying MSH3 variants were heterozygous for one of five variants in the coding region of MSH3: rs201149584 (GenBank: NM_002439.5) (c.173C>T [p.Ala58Val]) (g.79950719C>T [GRCh37]); rs771054581 (GenBank: NM_002439.5) (c.1720C>T [p.Arg574Trp]) (g.80040391C>T [GRCh37]); rs373251342 (GenBank: NM_002439.4) (c.2288G>T [p.Cys763Phe]) (g.80071547G>T [GRCh37]); rs753525389 (GenBank: NM_002439.5) (c.2731T>G [p.Leu911Val]) (g.80109478T>G [GRCh37]); and rs139205893 (GenBank: NM_002439.5) (c.3000T>G [p.Asp1000Glu]) (g.80150135T>G [GRCh37]) (Figure 3A and Table S12). The p.Ala58Val variant was located in the intrinsically disordered N-terminal domain (NTD), and the remaining four were scattered across different protein domains (Figures 3A and 3B). Only four individuals had a clinical AAO (one was a non-manifesting carrier), and no obvious clustering of these SNVs was observed with AAO (Figure 3C). Interestingly, all variants, except p.Leu911Val, were present in individuals at or below the 90th percentile of expansion residuals (lowest 10% extremes of expansion) suggesting that these SNVs are associated with the dysfunction of MSH3 (Figure 3C). Two of these variants (p.Ala58Val and p.Arg574Trp) were found in two individuals in the 10th percentile of contraction residuals (highest 10% extremes of contraction) (Figure 3C). We also performed exome-wide SKAT-O and CMC-Wald burden tests (Table S16). These did not reveal any genes meeting exome-wide significance (p < 3.2e−07), although we note that MSH3 is among the most significant genes for expansion (Table S16). Overall, while no genes meet criteria of exome-wide significance in these rare SNV analyses, prior evidence that MSH3 is a modifier lends support to the burden of rare non-synonymous MSH3 SNVs contributing to lower somatic expansion. Exome or whole-genome sequencing in larger cohorts will be needed to confirm these observations.
Figure 3.
MSH3 rare coding variants and relationships with AAO, blood expansion, and blood contraction
(A) Location of the five rare (MAF < 0.01) MSH3 variants detected: p.Ala58Val (rs201149584; g.79950719C>T); p.Arg574Trp (rs771054581; g.80040391C>T); p.Cys763Phe (rs373251342; g.80071547G>T); p.Leu911Val (rs753525389; g.80109478C>T); p.Asp1000Glu (rs139205893; g.80150135T>C); and the common variant p.Ile79Val, specified by signal 1 top SNV rs1650697 (g.79950781A>G) mapped onto MSH3 motifs.37 Locations are based on GRCh37.
(B) Three-dimensional structure of MSH3 (light blue) complexed with MSH2 (dark blue) (= MutSβ dimer), modeled with PyMOL software, showing the spatial locations of the rare variants. p.Ala58Val, located in the unstructured N-terminal domain, is not shown.
(C) MSH3 rare variants plotted to show their CADD score (x axis) and AAO residual, expansion residual, and contraction residual on the y axes. Gray circles in the background represent all data from those individuals in which non-synonymous coding changes were identified in 15 candidate genes (Table S12). NTD, N-terminal domain; MMBD, mismatch-binding domain; CTD, C-terminal domain. Upper and lower horizontal dotted lines show the 90th and 10th percentile values, respectively, for each phenotype. CADD scores were obtained using GRCh37-v.1.6.
MSH3 tandem repeat alleles modify both somatic instability and AAO
Exon 1 of MSH3 encodes a variable alanine- and proline-encoding 9 bp tandem repeat. The length of this repeat is associated with repeat expansion and clinical disease measures in HD and DM1 and with AAO in XDP.21,38 To evaluate the disease-modifying effect of MSH3 tandem repeat alleles in our XDP cohort, we performed single-molecule sequencing (MiSeq) of the polymorphic repeats in 302 blood samples (Figure 4). The most common repeat alleles found were the 6a allele (56%), followed by the 7a (30%) and 3a (12.6%) alleles (Figures 4A and 4B). In addition, we detected previously unreported MSH3 tandem repeats (3c, 5c, 6d, and 8c) at low frequencies (Figures 4A and 4B). Note that the 6d allele is a variant of the 6a allele that contains the p.Ala58Val substitution identified above. We then tested the association between the most common MSH3 repeat genotypes (3a/3a, 3a/6a, 3a/7a, 6a/6a, 6a/7a, and 7a/7a) and XDP phenotypes (Figure 4C and Table S17). We observed that 3a-containing genotypes were significantly associated with later onset, confirming previous observations.21 We further found that 3a-containing genotypes were associated with significantly lower expansion and greater contraction (the latter not meeting statistical significance). Conversely, 6a/7a and 7a/7a genotypes were significantly associated with earlier onset, greater expansion, and less contraction. Effect-size effects and p values in these regression models are shown in Table S17. Investigating the phenotypes as a function of the number of 3a, 6a, or 7a alleles (Figure S11 and Table S18), we found that 3a alleles significantly delayed AAO and reduced expansion, with a moderate impact on promoting contractions, while 7a alleles significantly accelerated AAO, increased expansion, and suppressed contraction. The presence of two 3a alleles or two 7a alleles had a stronger modifying effect than the presence of a single allele. The number of 6a alleles had a mild impact on delaying AAO and suppressing expansion. Effect-size effects and p values in these regression models are shown in Table S18.
Figure 4.
The association of MSH3 tandem repeat variants with age at onset, blood expansion, and blood contraction
(A) Schematic of MSH3 tandem repeats detected using MiSeq analyses in 302 XDP blood samples. The repeats are upstream of the DHFR coding region; DHFR shares a promoter with MSH3 and is transcribed in the opposite direction. Each repeat variant is color coded, and the corresponding amino acids are represented. Nomenclature is according to Flower et al.38 6d is a variant of 6a resulting from a p.Ala58Val substitution.
(B) Frequency of MSH3 tandem repeats detected.
(C) Residuals of AAO, CCCTCT expansion, and CCCTCT contraction for individuals harboring 3a/3a, 3a/6a, 3a/7a, 6a/6a, 6a/7a, and 7a/7a MSH3 repeat genotypes. See Table S17 for statistical analyses. Tukey box-whisker plots are shown, with each dot representing an XDP individual.
We identified seven repeat haplotypes by converging SNPs from exome analysis and MSH3 repeat alleles (subjects, material, and methods) and tested their effect on XDP phenotypes (Table S19). The only 3a haplotype identified (3a-1) included the top signal 2 SNV rs1643639 effect allele (C) and did not allow the effects (earlier AAO, less expansion, and more contraction) of the 3a repeat to be distinguished from those of the rs1643639 modifier variant. Similarly, both 7a-1 and 7a-2 haplotypes included the top signal 1 SNV rs1650697 effect allele (A), tracking with later AAO, more expansion, and less contraction, precluding the ability to distinguish effects due to the repeat or the SNV. Interestingly, the two most frequent 6a haplotypes, 6a-1 and 6a-2, contained the rs1650697 effect allele A and major allele G, respectively. While 6a-1 was associated with significantly earlier AAO, 6a-2 was associated with significantly later AAO (Table S19). Expansion and contraction phenotypes did not distinguish the effects of 6a-1 and 6a-2. These data therefore indicate, at least for AAO, an effect of rs1650697 A variant that is independent of the MSH3 variant repeat.
MSH3 modifies somatic instability in brain tissues
Taken together, the results presented above support a role for MSH3 in determining the timing of onset of XDP by modifying the instability of the XDP CCCTCT repeat and predict that MSH3 modifies CCCTCT repeat instability in the brain. To test this, we derived quantitative measures of expansion and contraction in different brain tissues. We previously showed, by quantifying expansion indices from fragment sizing data of CCCTCT-containing PCR amplicons, that the extent of CCCTCT expansion in brain is region dependent and repeat-length dependent.19 With the knowledge from SP-PCR that contractions occur in the brain,19 we first expanded on these analyses by also quantifying contraction indices from bulk PCR fragment sizing data in a wide range of brain tissues. Interestingly, except for cerebellum, which exhibited less CCCTCT contraction than any other brain tissue analyzed, all other brain regions had relatively similar contraction values (Figure S12). This contrasts with expansions that are more variable between brain tissues and greatest in cortical regions (Figure S12).19 Further, cerebellum exhibited less contraction than blood but more expansion than blood (Figure S12).19 Contractions also exhibited repeat-length dependence (Figure S13). Regression modeling in a subset of brain tissues (caudate, cerebellum, temporal pole, occipital cortex and BA9 cortex, and medial thalamus) for which we had the largest expansion and contraction datasets (31–57 individuals, depending on the region) revealed significant positive associations of repeat length and/or at death (used as age at sampling), with expansion and/or contraction in many brain regions (Table S20). Power was limited in brain regions with fewer samples, and generally, repeat length and age at death appeared to be better predictors of contraction than expansion. This could potentially reflect, in part, the loss of cells with highly expanded repeats.
We then derived residuals of expansion or contraction indices based on these regression models (Table S20). Given the significant associations of MSH3 polymorphic tandem repeat with blood expansion, contraction, and AAO, we then evaluated the association of residual expansion or contraction values in brain tissues with MSH3 tandem repeat genotypes (3a/3a, 3a/6a, 3a/7a, 6a/6a, 6a/7a, and 7a/7a) determined using MiSeq (Figure 5 and Table S21). Expansion and contraction residuals from all brain tissues increased and decreased, respectively, with an increasing number of MSH3 repeats (smallest -3a to largest -7a), paralleling results in blood. Regression analyses (excluding the single 3a/3a individual) showed that 6a/7a and 7a/7a genotypes significantly increased expansions in all brain regions tested relative to 3a/6a (Table S21). 6a/7a and 7a/7a genotypes significantly decreased contractions in BA9 cortex relative to 3a/6a (Table S21). Assessing the impact of individual repeat alleles (Table S22), the 3a repeat had the strongest and statistically significant impact on decreasing expansion in the caudate; 3a alleles accounted for 44.6% of the residual expansion (Table S22). The 7a allele had the strongest and statistically significant impact on increasing expansion in the cerebellum; 7a alleles accounted for 24.1% of the residual expansion (Table S22). In a subset of XDP individuals and brain regions (caudate, cerebellum, occipital cortex, and BA9 cortex), we also analyzed MSH3 expression levels obtained from RNA-seq data. We observed that individuals with 3a/6a and 6a/6a genotypes tended to have the lowest levels of MSH3 mRNA in these brain regions, while those with 6a/7a or 7a/7a genotypes had higher MSH3 mRNA levels (Figure 6 and Table S23). The effect of the MSH3 repeat alleles on MSH3 expression was particularly pronounced in caudate, where one or two 7a alleles significantly increased expression 1.1-fold or 1.3-fold, respectively (Table S24). These data are consistent with the association of the 3a allele with reduced MSH3 expression in blood38 and demonstrate correlations between MSH3 expression, more expansion, and less contraction in XDP brains. Overall, we provide evidence that MSH3 modifies the onset of XDP by altering the instability of the CCCTCT repeat in the brain.
Figure 5.
MSH3 tandem repeat variants in XDP postmortem brain tissues
Residuals of CCCTCT expansion (above the dotted lines) and CCCTCT contraction (below the dotted lines) in caudate, cerebellum, medial thalamus, temporal pole, cortex BA9, and occipital cortex for individuals harboring 3a/3a, 3a/6a, 3a/7a, 6a/6a, 6a/7a, and 7a/7a MSH3 repeat genotypes. See Table S21 for statistical analyses. Tukey box-whisker plots are shown, with each dot representing an XDP individual.
Figure 6.
MSH3 tandem repeat variants alter MSH3 expression levels in the brain
MSH3 mRNA expression levels in caudate, cerebellum, cortex BA9, and occipital cortex for individuals harboring 3a/6a, 6a/6a, 6a/7a, and 7a/7a MSH3 repeat genotypes. See Table S23 for statistical analyses. Tukey box-whisker plots are shown, with each dot representing an XDP individual.
Discussion
Previous observations that (1) the length of the XDP SVA CCCTCT repeat tract is inversely correlated with AAO, (2) the CCCTCT repeat exhibits length-dependent somatic expansion in the brain, and (3) variants in MMR genes modify AAO support a model in which somatic CCCTCT repeat expansion drives the rate of XDP onset. Here, we demonstrate that MSH3 variants modify somatic CCCTCT repeat instability both in blood and in the brain. We provide empirical evidence for an association of MSH3 with repeat instability in the brain, and our findings provide key evidence that the role(s) of MSH3 in CCCTCT repeat dynamics underlie its impact on clinical disease.
We previously showed that while somatic instability of the XDP CCCTCT repeat was expansion biased, both expansions and contractions could be detected. With this knowledge, and the observation from analysis of single molecules that this hexanucleotide repeat exhibits little to no (typically contraction-biased) confounding PCR slippage artifact, we quantified both repeat expansions and contractions for integration with genetic data. In blood and brain tissues, both expansions and contractions exhibited dependence on repeat length and on age at sampling (blood) or age at death (brain), indicating somatic events that increase over time as XDP individuals age. In the brain, contractions were distinguished by an apparent lack of brain-regional specific differences that is observed for expansions, with the exception of the cerebellum, which exhibits low levels of both expansion and contraction relative to other brain regions. These findings suggest that the factors (as yet unknown) contributing to cell-type-specific expansion propensities and, hence, the levels of expansion that are reflected in tissues do not influence contractions in the same way. It is also possible that there are in fact subtle differences in contractions between brain regions that cannot be readily resolved from the bulk PCR amplicon data, as suggested by our previous single-molecule analyses.19
The results of our MSH3 common SNV and repeat variant association analyses are summarized in Figure 7. In the ExWAS analysis, we detected two distinguishable major modification signals, captured to different degrees by AAO and repeat expansion or contraction in blood. Signal 1, tagged by rs1650697, was associated with an earlier AAO and increased expansion in blood. Signal 2, tagged by rs1643639, was not associated with AAO but was associated with decreased expansion and increased contraction in blood. Although the impact on contraction did not meet the criterion of genome-wide significance (potentially conservative in the context of only the exome variants analyzed in this study), the p value of 3.48e−07 was highly suggestive of a real effect. The different relative impacts of these two variants on AAO, blood expansion, and blood contraction are unlikely to be due to differences in the power to detect modifier effects given the similar sample sizes for each analysis (AAO n = 235, expansion n = 245, contraction n = 242). Rather, these data suggest that there may be cell-type-dependent modifier effects of these MSH3 alleles. As examples (refer to Figure 2), (1) relevant to AAO, signal 1 might be predicted to exert a relatively strong impact on repeat expansion compared to signal 2 in the cell type(s) in the brain that are relevant to disease onset; and (2) relevant to blood contraction, signal 2 may have a relatively strong impact compared to signal 1 in the hematopoietic cells driving repeat contraction observed in blood. As both signal-1- and signal-2-tagging SNVs are eQTLs influencing MSH3 expression levels, these differences may reflect cell-type-specific eQTL effects, as suggested based on similar observations in HD GWASs.23 To ascertain the relationship between our modifier signals and those in previous GWASs, we determined the LD between the top signal 1 and signal 2 SNVs with those MSH3 SNVs identified in GWASs as modifiers of XDP AAO21 and those that modified either blood expansion or clinical phenotypes in HD23 (Table S25). This indicates that our signal 1 partially captures the Laabs et al. top MSH3 SNV rs245013 (GenBank: NC_000005.9) (g.80047076A>C [GRCh37]) (LD between signal 1 top SNV rs1650697 and rs245013, r2 = 0.399 in Europeans and 0.659 in East Asians). For HD, our signal 1 captures the same signal as the “5AM1” disease-hastening and “5ABEM1” blood expansion-promotion effects with LD between signal 1 top SNV rs1650697 and 5AM1 top SNV rs245100 (GenBank: NC_000005.9) (g.79933093A>G [GRCh37]) or 5ABEM1 top SNV rs245105 (GenBank: NC_000005.9) (g.79928518T>C [GRCh37]) r2 = 1 in Europeans and 0.956 in East Asians. Our signal 2 partially captures the Laabs et al. rs33003 (GenBank: NC_000005.9) (g.80171134A>G [GRCh37]) SNV that modified XDP AAO independently of rs245013 (LD between signal 2 top SNV rs1643639 and rs33003, r2 = 0.522 in Europeans and 0.144 in East Asians). For HD, our signal 2 partially captures the same signal as the disease-delaying “5AM3” with LD between signal 2 top SNV rs1643639 and 5AM3 top SNV rs6151716 (GenBank: NC_000005.9) (g.80016305_80016307del [GRCh37]) r2 = 0.544 in Europeans, unavailable in East Asians, and “5ABEM3” blood expansion-suppressing effects with LD between signal 2 top SNV rs1643639 and 5ABEM3 top SNV rs1650689 (GenBank: NC_000005.9) (g.79955999T>G [GRCh37]) r2 = 0.295 in Europeans, 0.737 in East Asians. These analyses may be limited by lack of specific haplotype information in the Filipino population. Nevertheless, overall, the data indicate that MSH3 variants tagging effects on clinical phenotypes and repeat instability have shared effects across these two diseases. In candidate SNV-based analyses, MSH3 SNVs were associated with clinical phenotype and/or somatic expansion in other repeat expansion diseases. SNV rs10168 (GenBank: NC_000005.9) (g.79950403C>T [GRCh37], LD with our signal 2 top SNV rs1643639, r2 = 1.0 in Europeans and 1.0 East Asians) and rs1677658 (GenBank: NC_000005.9) (g.79950859G>T [GRCh37], LD with our signal 2 top SNV rs1643639, r2 = 0.54 in Europeans and unavailable in East Asians) were associated with less somatic expansion of the DM1 CTG repeat, and a later AAO38,39 SNV rs701383 (GenBank: NC_000005.9) (g.79913275G>A [GRCh37], LD with our signal 1 top SNV rs1650697, r2 = 0.522 in Europeans and East Asians) was associated with somatic expansion of the fragile X messenger ribonucleoprotein 1 (FMR1 [MIM: 309550]) gene-associated CGG repeat.40 The rs701383 major (G) allele was associated with less somatic CGG repeat expansion, consistent with the direction of effect in our study, in which the rs1650697 minor (A) allele is associated with more somatic expansion of the XDP repeat.
Figure 7.
Summary of MSH3 associations
Upward arrows indicate associations with greater residual values (more expansion, more contraction, later onset), and downward arrows indicate associations with lower residual values (less expansion, less contraction, earlier onset). Red is consistent with a detrimental effect, and green is consistent with a beneficial effect. SNV summaries are based on meeting genome-wide significance in ExWAS data with the exception of blood contraction signal 2, where the p value for association was 3.48e−07 (Tables S4–S7, S9, and S10). MSH3 repeat allele summaries are based on nominal significance in regression analyses (Tables S17, S18, S21, and S22). Gray boxes represent cells with no data available.
Analyses of the variant alanine- and proline-coding repeat in exon 1 of MSH3 revealed an association of the 7a repeat with an earlier AAO and with increased repeat expansion and decreased repeat contraction in both blood and brain. In contrast, the 3a repeat was associated with a later AAO, decreased repeat expansion, and increased repeat contraction in both blood and brain. The effects of these alleles on expansion and contraction are consistent with their impact on clinical onset, i.e., more expansion and less contraction are associated with hastened onset, while less expansion and more contraction are associated with delayed onset and with the association of the variants with MSH3 expression in brain. The effects of 3a, 6a, and 7a alleles in our study are concordant with effects previously observed on XDP AAO21 and on HD AAO and blood somatic expansion.38 Interestingly, in contrast to an association of the 7a allele with more somatic expansion and earlier disease onsets in XDP and HD, the 7a allele was mildly associated with less somatic expansion of the DM1 CTG repeat and a later AAO.38 The 3a repeat was not found to modify AAO or somatic expansion in Machado-Joseph disease (MJD [MIM: 109150]) and Friedreich ataxia (FRDA [MIM: 229300]),41 although this likely in part reflects small sample sizes.42 Here and in previous studies,21,38 the effects of the 7a and 3a alleles could not be distinguished from effects of surrounding SNVs that are in strong LD with the repeat alleles. However, in our study, and also seen in Laabs et al.,21 two 6a-containing haplotypes (6a-1 and 6a-2), distinguished by signal 1 SNV rs1650697, had opposite effects on onset, consistent with the onset-hastening effect of the A allele. Distinguishing the potentially functional effects of the variant MSH3 repeats and other variant(s) both in XDP and in other diseases will require a more detailed understanding of haplotype structures in larger sample sizes across the different disease populations.
As our sample size was relatively small, we were underpowered to detect any exome-wide significant effect of rare variants in gene-burden-based analyses. However, MSH3 did emerge as one of the more significant modifiers of repeat expansion in these analyses, with five non-synonymous variants found in individuals exhibiting approximately the lowest 10% of expansion values in blood. Further understanding of the potential role of these variants will require their phasing to the common MSH3 SNV and repeat modifier variants and studies of their impacts on MSH3 function or stability. We note that all five variants are predicted to lower MSH3 stability (http://mupro.proteomics.ics.uci.edu/).
Taken together, our data provide strong support for somatic expansion being a major driver of XDP onset. What is the role of somatic repeat contraction in determining the onset of XDP? Although blood contraction did not predict AAO (Figure 1), consistent with the lack of the genome-wide significant SNVs contributing to both AAO and contraction in blood (Figure 2), the effect of the AAO-associated MSH3 variant repeats on contraction both in blood and in brain (Figures 4 and 5) support an involvement of contraction. i.e., more contraction delays onset and vice versa. Based on computational modeling of repeat instability in DM1 and HD, it has been suggested that expansions, which predominate in somatic cells and increase over time, are the net result of frequent expansion and contraction events, with expansion rates slightly exceeding contraction rates.43,44 Thus, the two events may be intimately coupled, and the balance between them may be influenced by different cis-acting (e.g., repeat sequence and flanking sequence) and trans-acting factors. In HD and in other repeat expansion diseases, expansions are thought to be driven by the binding of MutSβ (MSH2-MSH3) complex to repeat loop-outs.38,45,46 The processing of such a loop-out to incorporate an expansion vs. a contraction is likely to depend in part on downstream events that involve other MMR components, e.g., MLH3 and PMS2, which have different endonucleolytic cleavage propensities that could favor expansions over contractions.,46,47 Though not investigated here, further studies will be needed to understand the role of XDP onset modifier PMS2 in CCCTCT repeat instability and to dissect the various other factors that contribute to repeat expansion and contraction of the CCCTCT repeat. Our data suggest that CCCTCT loop-outs are also targets for MSH3 binding and that higher levels of MSH3 might tip the resolution of such structures in favor of expansions over contractions, while lower MSH3 levels might favor contraction events. Interestingly, in a DM1 mouse model48 and to a lesser extent in an HD mouse model,49 knockout of Msh3 both suppressed expansions and promoted contractions in the germline. Knockout of MSH3 also appeared to promote contractions in a human RPE1-based model of HTT CAG instability.50 It is also possible that expansions and contractions might predominate in different cell types. Therefore, ultimately, resolving differential genetic modifier effects on expansions and contractions in individual cell types in the brain and in blood will be needed. Insight into the functional impacts of rare coding MSH3 variants may also provide future opportunities to distinguish mechanisms of expansion and contraction.
Building on a framework in HD24,25,51 and as predicted to be applicable across many repeat expansion disorders,52 we propose that in XDP the CCCTCT repeat must reach a threshold length(s) in target cell type(s) in order to elicit neuronal demise and ensuing clinical disease. The CCCTCT repeat can undergo somatic expansion and somatic contraction, and the rate of both of these events may contribute to the timing of disease onset. Our data demonstrate a critical role for onset modifier MSH3 in both driving expansions and protecting against contractions. As described above, observations across multiple repeat expansion diseases strongly indicate that MSH3 is a fundamental player in repeat instability, and strategies to target MSH3 are therefore likely to have wide applicability across repeat expansion diseases. The details of the repeat instability mechanisms, e.g., functional impacts on MSH3 that might be revealed from population-specific variation and the relative roles of other factors, may differ between diseases and remain to be dissected. Importantly, given the genetic validation in individuals and strong impact on disease onset, MSH3 is a compelling therapeutic target in XDP, with strategies to lower or inhibit MSH3 predicted to both slow CCCTCT expansion and promote CCCTCT contraction, impacting the disease course prior to clinical onset.
Data and code availability
Scripts used to analyze PacBio amplicon sequencing data can be found at https://github.com/alanmejiamaza/Counting-Repeats, and scripts used for analyses of RNA-seq data can be found at: https://github.com/alanmejiamaza/RNA-seq-analyses. PacBio amplicon data are deposited at the NCBI Sequence Read Archive (accession number: SUB15770210, https://submit.ncbi.nlm.nih.gov/subs/sra/SUB15770210/files), genetic association summary statistics based on exome sequencing data are deposited at DRYAD (accession number: 408382, https://datadryad.org/submission/408382), and the raw RNA-seq data are deposited at the database of Genotypes and Phenotypes (dbGaP) (accession number: phs001525.v2.p1, https://dbgap.ncbi.nlm.nih.gov/beta/study/phs001525.v2.p1/#study). ABI sequencer scan data can be shared upon reasonable request. Requests for tissue specimens may be directed to xdp@partners.org.
Acknowledgments
We thank Dr. Konrad Karczewski for his advice on exome analyses and Drs. Marc Ciosi and Darren Monckton for their input on modeling repeat instability. This work was supported by the CCXDP (V.C.W., L.J.O., D.C.B., M.E.T., and N.S.); NIH grants R01-NS102423 (D.C.B. and M.E.T.), R01-NS049206 (V.C.W.), and R01-NS091161 (M.E.M.); the Massachusetts General Hospital Executive Committee on Research (MGH ECOR) (A.M.M.); and The Mannion Family MGH Research Scholars Award (V.C.W.). Figures and schematics were created using BioRender. Plots were done in R/RStudio v.1.3 (https://cran.r-project.org/mirrors.html).
Author contributions
Conceptualization, A.M.M., V.C.W., and L.J.O.; methodology, A.M.M. and A.D.; investigation, A.M.M., M.H., K.C., T.G., A.N., A.D., R.Y., and P.D.V.M.; data analysis, A.M.M., A.D., M.H., K.C., T.G., and V.C.W.; visualization, A.M.M., K.C., and A.D.; resources, E.B.P., M.G.M., J.S.H., E.P.N., C.F.-C., G.P.L., M.S., E.M., M.C.A., and C.C.E.D.; supervision, V.C.W., L.J.O., M.E.M., D.C.B., and M.E.T.; writing – original draft, A.M.M., V.C.W., and L.J.O.
Declaration of interests
V.C.W. was a founding scientific advisory board member with a financial interest in Triplet Therapeutics Inc. Her financial interests were reviewed and are managed by MGH and Mass General Brigham in accordance with their conflict-of-interest policies. V.C.W. is a scientific advisory board member of LoQus23 Therapeutics Ltd. and has provided paid consulting services to Acadia Pharmaceuticals Inc., Alnylam Inc., Biogen Inc., Passage Bio, Rgenta Therapeutics, and Ascidian Therapeutics.
Published: December 23, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2025.12.002.
Contributor Information
Laurie J. Ozelius, Email: laurie.ozelius@mgh.harvard.edu.
Vanessa Chantal Wheeler, Email: vwheeler@mgh.harvard.edu.
Web resources
Counting-Repeats repository, https://github.com/alanmejiamaza/Counting-Repeats
OMIM, https://www.omim.org/
RNA-seq scripts, https://github.com/alanmejiamaza/RNA-seq-analyses
Supplemental information
References
- 1.Lee L.V., Rivera C., Teleg R.A., Dantes M.B., Pasco P.M.D., Jamora R.D.G., Arancillo J., Villareal-Jordan R.F., Rosales R.L., Demaisip C., et al. The Unique Phenomenology of Sex-Linked Dystonia Parkinsonism (XDP, DYT3, “Lubag”) Int. J. Neurosci. 2011;121:3–11. doi: 10.3109/00207454.2010.526728. [DOI] [PubMed] [Google Scholar]
- 2.Lee L.V., Maranon E., Demaisip C., Peralta O., Borres-Icasiano R., Arancillo J., Rivera C., Munoz E., Tan K., Reyes M.T. The natural history of sex-linked recessive dystonia parkinsonism of Panay, Philippines (XDP) Parkinsonism Relat. Disord. 2002;9:29–38. doi: 10.1016/S1353-8020(02)00042-1. [DOI] [PubMed] [Google Scholar]
- 3.Acuna P., Supnet-Wells M.L., Spencer N.A., De Guzman J.K., Russo M., Hunt A., Stephen C., Go C., Carr S., Ganza N.G., et al. Establishing a natural history of X-linked dystonia parkinsonism. Brain Commun. 2023;5:fcad106. doi: 10.1093/braincomms/fcad106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Goto S., Lee L.V., Munoz E.L., Tooyama I., Tamiya G., Makino S., Ando S., Dantes M.B., Yamada K., Matsumoto S., et al. Functional anatomy of the basal ganglia in X-linked recessive dystonia-parkinsonism. Ann. Neurol. 2005;58:7–17. doi: 10.1002/ana.20513. [DOI] [PubMed] [Google Scholar]
- 5.Waters C.H., Faust P.L., Powers J., Vinters H., Moskowitz C., Nygaard T., Hunt A.L., Fahn S. Neuropathology of lubag (x-linked dystonia parkinsonism) Mov. Disord. 1993;8:387–390. doi: 10.1002/mds.870080328. [DOI] [PubMed] [Google Scholar]
- 6.Goto S., Kawarai T., Morigaki R., Okita S., Koizumi H., Nagahiro S., Munoz E.L., Lee L.V., Kaji R. Defects in the striatal neuropeptide Y system in X-linked dystonia-parkinsonism. Brain. 2013;136:1555–1567. doi: 10.1093/brain/awt084. [DOI] [PubMed] [Google Scholar]
- 7.Brüggemann N., Heldmann M., Klein C., Domingo A., Rasche D., Tronnier V., Rosales R.L., Jamora R.D.G., Lee L.V., Münte T.F. Neuroanatomical changes extend beyond striatal atrophy in X-linked dystonia parkinsonism. Parkinsonism Relat. Disord. 2016;31:91–97. doi: 10.1016/j.parkreldis.2016.07.012. [DOI] [PubMed] [Google Scholar]
- 8.Hanssen H., Prasuhn J., Heldmann M., Diesta C.C., Domingo A., Göttlich M., Blood A.J., Rosales R.L., Jamora R.D.G., Münte T.F., et al. Imaging gradual neurodegeneration in a basal ganglia model disease. Ann. Neurol. 2019;86:517–526. doi: 10.1002/ana.25566. [DOI] [PubMed] [Google Scholar]
- 9.Arasaratnam C.J., Singh-Bains M.K., Waldvogel H.J., Faull R.L.M. Neuroimaging and neuropathology studies of X-linked dystonia parkinsonism. Neurobiol. Dis. 2021;148 doi: 10.1016/j.nbd.2020.105186. [DOI] [PubMed] [Google Scholar]
- 10.Aneichyk T., Hendriks W.T., Yadav R., Shin D., Gao D., Vaine C.A., Collins R.L., Domingo A., Currall B., Stortchevoi A., et al. Dissecting the Causal Mechanism of X-Linked Dystonia-Parkinsonism by Integrating Genome and Transcriptome Assembly. Cell. 2018;172:897–909.e21. doi: 10.1016/j.cell.2018.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bragg D.C., Mangkalaphiban K., Vaine C.A., Kulkarni N.J., Shin D., Yadav R., Dhakal J., Ton M.-L., Cheng A., Russo C.T., et al. Disease onset in X-linked dystonia-parkinsonism correlates with expansion of a hexameric repeat within an SVA retrotransposon in TAF1. Proc. Natl. Acad. Sci. USA. 2017;114:E11020–E11028. doi: 10.1073/pnas.1712526114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Makino S., Kaji R., Ando S., Tomizawa M., Yasuno K., Goto S., Matsumoto S., Tabuena M.D., Maranon E., Dantes M., et al. Reduced Neuron-Specific Expression of the TAF1 Gene Is Associated with X-Linked Dystonia-Parkinsonism. Am. J. Hum. Genet. 2007;80:393–406. doi: 10.1086/512129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ito N., Hendriks W.T., Dhakal J., Vaine C.A., Liu C., Shin D., Shin K., Wakabayashi-Ito N., Dy M., Multhaupt-Buell T., et al. Decreased N-TAF1 expression in X-Linked Dystonia-Parkinsonism patient-specific neural stem cells. Dis. Model. Mech. 2016;9:451–462. doi: 10.1242/dmm.022590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rakovic A., Domingo A., Grütz K., Kulikovskaja L., Capetian P., Cowley S.A., Lenz I., Brüggemann N., Rosales R., Jamora D., et al. Genome editing in induced pluripotent stem cells rescues TAF1 levels in X-linked dystonia-parkinsonism. Mov. Disord. 2018;33:1108–1118. doi: 10.1002/mds.27441. [DOI] [PubMed] [Google Scholar]
- 15.Al Ali J., Vaine C.A., Shah S., Campion L., Hakoum A., Supnet M.L., Acuña P., Aldykiewicz G., Multhaupt-Buell T., Ganza N.G.M., et al. TAF1 Transcripts and Neurofilament Light Chain as Biomarkers for X-linked Dystonia-Parkinsonism. Mov. Disord. 2021;36:206–215. doi: 10.1002/mds.28305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Capponi S., Stöffler N., Penney E.B., Grütz K., Nizamuddin S., Vermunt M.W., Castelijns B., Fernandez-Cerado C., Legarda G.P., Velasco-Andrada M.S., et al. Dissection of TAF1 neuronal splicing and implications for neurodegeneration in X-linked dystonia-parkinsonism. Brain Commun. 2021;3 doi: 10.1093/braincomms/fcab253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Domingo A., Amar D., Grütz K., Lee L.V., Rosales R., Brüggemann N., Jamora R.D., Cutiongco-dela Paz E., Rolfs A., Dressler D., et al. Evidence of TAF1 dysfunction in peripheral models of X-linked dystonia-parkinsonism. Cell. Mol. Life Sci. 2016;73:3205–3215. doi: 10.1007/s00018-016-2159-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Westenberger A., Reyes C.J., Saranza G., Dobricic V., Hanssen H., Domingo A., Laabs B.H., Schaake S., Pozojevic J., Rakovic A., et al. A hexanucleotide repeat modifies expressivity of X-linked dystonia parkinsonism. Ann. Neurol. 2019;85:812–822. doi: 10.1002/ana.25488. [DOI] [PubMed] [Google Scholar]
- 19.Campion L.N., Mejia Maza A., Yadav R., Penney E.B., Murcar M.G., Correia K., Gillis T., Fernandez-Cerado C., Velasco-Andrada M.S., Legarda G.P., et al. Tissue-specific and repeat length-dependent somatic instability of the X-linked dystonia parkinsonism-associated CCCTCT repeat. Acta Neuropathol. Commun. 2022;10:49. doi: 10.1186/s40478-022-01349-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Depienne C., Mandel J.-L. 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges? Am. J. Hum. Genet. 2021;108:764–785. doi: 10.1016/j.ajhg.2021.03.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Laabs B.-H., Klein C., Pozojevic J., Domingo A., Brüggemann N., Grütz K., Rosales R.L., Jamora R.D., Saranza G., Diesta C.C.E., et al. Identifying genetic modifiers of age-associated penetrance in X-linked dystonia-parkinsonism. Nat. Commun. 2021;12:3216. doi: 10.1038/s41467-021-23491-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lee J.-M., Huang Y., Orth M., Gillis T., Siciliano J., Hong E., Mysore J.S., Lucente D., Wheeler V.C., Seong I.S., et al. Genetic modifiers of Huntington disease differentially influence motor and cognitive domains. Am. J. Hum. Genet. 2022;109:885–899. doi: 10.1016/j.ajhg.2022.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Genetic Modifiers of Huntington’s Disease (GeM-HD) Consortium Genetic modifiers of somatic expansion and clinical phenotypes in Huntington’s disease highlight shared and tissue-specific effects. Nat. Genet. 2025;57:1426–1436. doi: 10.1038/s41588-025-02191-5. [DOI] [PubMed] [Google Scholar]
- 24.Genetic Modifiers of Huntington’s Disease (GeM-HD) Consortium CAG Repeat Not Polyglutamine Length Determines Timing of Huntington’s Disease Onset. Cell. 2019;178:887–900.e14. doi: 10.1016/j.cell.2019.06.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Mouro Pinto R., Murtha R., Azevedo A., Douglas C., Kovalenko M., Ulloa J., Crescenti S., Burch Z., Oliver E., Kesavan M., et al. In vivo CRISPR-Cas9 genome editing in mice identifies genetic modifiers of somatic CAG repeat instability in Huntington’s disease. Nat. Genet. 2025;57:314–322. doi: 10.1038/s41588-024-02054-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ciosi M., Maxwell A., Cumming S.A., Hensman Moss D.J., Alshammari A.M., Flower M.D., Durr A., Leavitt B.R., Roos R.A.C., et al. TRACK-HD team A genetic association study of glutamine-encoding DNA sequence structures, somatic CAG expansion, and DNA repair gene variants, with Huntington disease clinical outcomes. EBioMedicine. 2019;48:568–580. doi: 10.1016/j.ebiom.2019.09.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wheeler V.C., Dion V. Modifiers of CAG/CTG Repeat Instability: Insights from Mammalian Models. J. Huntingtons Dis. 2021;10:123–148. doi: 10.3233/JHD-200426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lee J.-M., Zhang J., Su A.I., Walker J.R., Wiltshire T., Kang K., Dragileva E., Gillis T., Lopez E.T., Boily M.-J., et al. A novel approach to investigate tissue-specific trinucleotide repeat instability. BMC Syst. Biol. 2010;4:29. doi: 10.1186/1752-0509-4-29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Poplin R., Ruano-Rubio V., DePristo M.A., Fennell T.J., Carneiro M.O., Van Der Auwera G.A., Kling D.E., Gauthier L.D., Levy-Moonshine A., Roazen D., et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017 doi: 10.1101/201178. Preprint at. [DOI] [Google Scholar]
- 30.Van Der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., Del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., et al. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. CP in Bioinformatics. 2013;43:11.10.1–11.10.33. doi: 10.1002/0471250953.bi1110s43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. doi: 10.1038/ng.2310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Wenninger S., Cumming S.A., Gutschmidt K., Okkersen K., Jimenez-Moreno A.C., Daidj F., Lochmüller H., Hogarth F., Knoop H., Bassez G., et al. Associations Between Variant Repeat Interruptions and Clinical Outcomes in Myotonic Dystrophy Type 1. Neurol. Genet. 2021;7 doi: 10.1212/NXG.0000000000000572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Trinh J., Lüth T., Schaake S., Laabs B.-H., Schlüter K., Laβ J., Pozojevic J., Tse R., König I., Jamora R.D., et al. Mosaic divergent repeat interruptions in XDP influence repeat stability and disease onset. Brain. 2023;146:1075–1082. doi: 10.1093/brain/awac160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lee S., Emond M.J., Bamshad M.J., Barnes K.C., Rieder M.J., Nickerson D.A., NHLBI GO Exome Sequencing Project—ESP Lung Project Team. Christiani D.C., Wurfel M.M., Lin X. Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies. Am. J. Hum. Genet. 2012;91:224–237. doi: 10.1016/j.ajhg.2012.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lee S., Abecasis G.R., Boehnke M., Lin X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. Am. J. Hum. Genet. 2014;95:5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lee J.-H., Thomsen M., Daub H., Thieulin-Pardo G., Steinbacher S., Sztyler A., Dahiya V., Neudegger T., Dominguez C., Iyer R.R., et al. Elucidation of multiple high-resolution states of human MutSβ by cryo-EM reveals interplay between ATP/ADP binding and heteroduplex DNA recognition. Nucleic Acids Res. 2025;53 doi: 10.1093/nar/gkaf604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Flower M., Lomeikaite V., Ciosi M., Cumming S., Morales F., Lo K., Hensman Moss D., Jones L., Holmans P., et al. TRACK-HD Investigators MSH3 modifies somatic instability and disease severity in Huntington’s and myotonic dystrophy type 1. Brain. 2019;142:1876–1886. doi: 10.1093/brain/awz115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Morales F., Vásquez M., Santamaría C., Cuenca P., Corrales E., Monckton D.G. A polymorphism in the MSH3 mismatch repair gene is associated with the levels of somatic instability of the expanded CTG repeat in the blood DNA of myotonic dystrophy type 1 patients. DNA Repair. 2016;40:57–66. doi: 10.1016/j.dnarep.2016.01.001. [DOI] [PubMed] [Google Scholar]
- 40.Hwang Y.H., Hayward B.E., Zafarullah M., Kumar J., Durbin Johnson B., Holmans P., Usdin K., Tassone F. Both cis and trans-acting genetic factors drive somatic instability in female carriers of the FMR1 premutation. Sci. Rep. 2022;12 doi: 10.1038/s41598-022-14183-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yau W.Y., Raposo M., Bettencourt C., Labrum R., Vasconcelos J., Parkinson M.H., Giunti P., Wood N.W., Lima M., Houlden H. The repeat variant in MSH3 is not a genetic modifier for spinocerebellar ataxia type 3 and Friedreich’s ataxia. Brain. 2020;143:e25. doi: 10.1093/brain/awaa043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Flower M., Lomeikaite V., Holmans P., Jones L., Tabrizi S.J., Monckton D.G. Reply: The repeat variant in MSH3 is not a genetic modifier for spinocerebellar ataxia type 3 and Friedreich’s ataxia. Brain. 2020;143:e26. doi: 10.1093/brain/awaa044. [DOI] [PubMed] [Google Scholar]
- 43.Higham C.F., Morales F., Cobbold C.A., Haydon D.T., Monckton D.G. High levels of somatic DNA diversity at the myotonic dystrophy type 1 locus are driven by ultra-frequent expansion and contraction mutations. Hum. Mol. Genet. 2012;21:2450–2463. doi: 10.1093/hmg/dds059. [DOI] [PubMed] [Google Scholar]
- 44.Higham C.F., Monckton D.G. Modelling and inference reveal nonlinear length-dependent suppression of somatic instability for small disease associated alleles in myotonic dystrophy type 1 and Huntington disease. J. R. Soc. Interface. 2013;10 doi: 10.1098/rsif.2013.0605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Panigrahi G.B., Slean M.M., Simard J.P., Gileadi O., Pearson C.E. Isolated short CTG/CAG DNA slip-outs are repaired efficiently by hMutSβ, but clustered slip-outs are poorly repaired. Proc. Natl. Acad. Sci. USA. 2010;107:12593–12598. doi: 10.1073/pnas.0909087107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Pluciennik A., Burdett V., Baitinger C., Iyer R.R., Shi K., Modrich P. Extrahelical (CAG)/(CTG) triplet repeat elements support proliferating cell nuclear antigen loading and MutLα endonuclease activation. Proc. Natl. Acad. Sci. USA. 2013;110:12277–12282. doi: 10.1073/pnas.1311325110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Kadyrova L.Y., Gujar V., Burdett V., Modrich P.L., Kadyrov F.A. Human MutLγ, the MLH1–MLH3 heterodimer, is an endonuclease that promotes DNA expansion. Proc. Natl. Acad. Sci. USA. 2020;117:3535–3542. doi: 10.1073/pnas.1914718117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Foiry L., Dong L., Savouret C., Hubert L., te Riele H., Junien C., Gourdon G. Msh3 is a limiting factor in the formation of intergenerational CTG expansions in DM1 transgenic mice. Hum. Genet. 2006;119:520–526. doi: 10.1007/s00439-006-0164-7. [DOI] [PubMed] [Google Scholar]
- 49.Dragileva E., Hendricks A., Teed A., Gillis T., Lopez E.T., Friedberg E.C., Kucherlapati R., Edelmann W., Lunetta K.L., MacDonald M.E., Wheeler V.C. Intergenerational and striatal CAG repeat instability in Huntington’s disease knock-in mice involve different DNA repair genes. Neurobiol. Dis. 2009;33:37–47. doi: 10.1016/j.nbd.2008.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.McLean Z.L., Gao D., Correia K., Roy J.C.L., Shibata S., Farnum I.N., Valdepenas-Mellor Z., Kovalenko M., Rapuru M., Morini E., et al. Splice modulators target PMS1 to reduce somatic expansion of the Huntington’s disease-associated CAG repeat. Nat. Commun. 2024;15:3182. doi: 10.1038/s41467-024-47485-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Handsaker R.E., Kashin S., Reed N.M., Tan S., Lee W.-S., McDonald T.M., Morris K., Kamitaki N., Mullally C.D., Morakabati N.R., et al. Long somatic DNA-repeat expansion drives neurodegeneration in Huntington’s disease. Cell. 2025;188:623–639.e19. doi: 10.1016/j.cell.2024.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Kaplan S., Itzkovitz S., Shapiro E. A Universal Mechanism Ties Genotype to Phenotype in Trinucleotide Diseases. PLoS Comput. Biol. 2007;3 doi: 10.1371/journal.pcbi.0030235. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Scripts used to analyze PacBio amplicon sequencing data can be found at https://github.com/alanmejiamaza/Counting-Repeats, and scripts used for analyses of RNA-seq data can be found at: https://github.com/alanmejiamaza/RNA-seq-analyses. PacBio amplicon data are deposited at the NCBI Sequence Read Archive (accession number: SUB15770210, https://submit.ncbi.nlm.nih.gov/subs/sra/SUB15770210/files), genetic association summary statistics based on exome sequencing data are deposited at DRYAD (accession number: 408382, https://datadryad.org/submission/408382), and the raw RNA-seq data are deposited at the database of Genotypes and Phenotypes (dbGaP) (accession number: phs001525.v2.p1, https://dbgap.ncbi.nlm.nih.gov/beta/study/phs001525.v2.p1/#study). ABI sequencer scan data can be shared upon reasonable request. Requests for tissue specimens may be directed to xdp@partners.org.







