Skip to main content
Genome Research logoLink to Genome Research
. 2025 Apr;35(4):810–823. doi: 10.1101/gr.279491.124

Optical genome mapping enables accurate testing of large repeat expansions

Bart van der Sanden 1, Kornelia Neveling 1, Syukri Shukor 2, Michael D Gallagher 2, Joyce Lee 2, Stephanie L Burke 2, Maartje Pennings 1, Ronald van Beek 1, Michiel Oorsprong 1, Ellen Kater-Baats 1, Eveline Kamping 1, Alide A Tieleman 3, Nicol C Voermans 3, Ingrid E Scheffer 4,5, Jozef Gecz 6,7,8, Mark A Corbett 8, Lisenka ELM Vissers 1, Andy Wing Chun Pang 2, Alex Hastie 2, Erik-Jan Kamsteeg 1,10,✉,#, Alexander Hoischen 1,9,10,✉,#
PMCID: PMC12047237  PMID: 40113266

Abstract

Short tandem repeats (STRs) are common variations in human genomes that frequently expand or contract, causing genetic disorders, mainly when expanded. Traditional diagnostic methods for identifying these expansions, such as repeat-primed PCR and Southern blotting, are often labor-intensive, locus-specific, and are unable to precisely determine long repeat expansions. Sequencing-based methods, although capable of genome-wide detection, are limited by inaccuracy (short-read technologies) and high associated costs (long-read technologies). This study evaluated optical genome mapping (OGM) as an efficient, accurate approach for measuring STR lengths and assessing somatic stability in 85 samples with known pathogenic repeat expansions in DMPK, CNBP, and RFC1, causing myotonic dystrophy types 1 and 2 and cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS), respectively. Three workflows—manual de novo assembly, local guided assembly (local-GA), and a molecule distance script—were applied, of which the latter two were developed as part of this study to assess the repeat sizes and somatic repeat stability. OGM successfully identified 84/85 (98.8%) of the pathogenic expansions, distinguishing between wild-type and expanded alleles or between two expanded alleles in recessive cases, with greater accuracy than standard of care (SOC) for long repeats and no apparent upper size limit. Notably, OGM detected somatic instability in a subset of DMPK, CNBP, and RFC1 samples. These findings suggest OGM could advance diagnostic accuracy for large repeat expansions, providing a more comprehensive genome-wide assay for repeat expansion disorders by measuring exact repeat lengths and somatic instability across multiple loci simultaneously.


Short tandem repeats (STRs) are common repeats of a particular k-mer of 1–6 bp in length (Tankard et al. 2018). More than a million cataloged STR loci make up ∼3% of the human genome and are scattered throughout (International Human Genome Sequencing Consortium 2001; Gymrek 2017). Expansions or contractions of at least 60 of these STRs have been associated with human genetic disorders, concerning predominantly neurogenetic diseases (Depienne and Mandel 2021; Tanudisastro et al. 2024). These disorders include, but are not limited to, myotonic dystrophies, Huntington's disease, fragile X syndrome, and different forms of spinocerebellar ataxias (van der Sanden et al. 2021; Rudaks et al. 2024). STR disorders present with overlapping clinical phenotypes, strong heterogeneity of symptoms, and variation in age of onset, which makes identification of the molecular diagnosis challenging (Tankard et al. 2018).

All individuals have a certain repeat length at each disease-associated STR locus; however, only once the size of a disease-associated repeat exceeds a certain repeat size threshold, the individual may develop a disorder. For several STR disorders, a strong correlation between the size of the expansion and the severity as well as the age of onset of the disorder have been associated (Paulson 2018; Depienne and Mandel 2021). An important characteristic of dominant STR expansion disorders is anticipation, a phenomenon where new generations are affected at an earlier age of onset and with more severe symptoms than the preceding generations. In addition to anticipation, repeat expansions can present with somatic instability, a dynamic process in which the repeat size can increase over time, which may be tissue dependent (Monckton et al. 1995; Wong et al. 1995; Gomes-Pereira et al. 2004). For some repeat expansion disorders, the disease severity increases when the repeat expansion is somatically unstable (Gomes-Pereira et al. 2004; Swami et al. 2009; Goold et al. 2021; Ruiz de Sabando et al. 2024). Finally, repeat expansions can contain interruptions—for example, a CCG interruption in a CTG repeat expansion in DMPK—and these may cause a repeat expansion to be more stable than uninterrupted repeat expansions, thereby reducing somatic instability and leading to milder symptoms (Cumming et al. 2018; Nolin et al. 2019; Depienne and Mandel 2021). However, repeat expansions are largely heterogeneous, and not all repeat expansion loci are equally affected by repeat interruptions or somatic instability.

The current standard of care (SOC) for patients with a suspected repeat expansion disorder can be time consuming and costly. The clinician must request the appropriate repeat expansion test based on the patient's disorder. The SOC then consists of targeted PCR and repeat-primed PCR (RP-PCR) and/or Southern blot assays. These assays must be refined for each different repeat expansion locus, which means that the same sample may have to undergo multiple rounds of diagnostic testing. This can be due to phenotypic overlap between expansions of different STRs, heterogeneity of symptoms, and variation in penetrance and age of onset (Tankard et al. 2018). Over the last decade, exome sequencing (ES) has become increasingly important for diagnosing patients (Srivastava et al. 2019), and in addition to the targeted repeat expansion assays, it is now also possible to detect specific STR expansions using ES and genome sequencing (GS) (Gymrek et al. 2012; Tang et al. 2017; Willems et al. 2017; Dashnow et al. 2018; Tankard et al. 2018; Dolzhenko et al. 2019; Mousavi et al. 2019; van der Sanden et al. 2021). However, dedicated short-read sequencing STR detection tools are limited by the 100–150 bp read length and/or total fragment length of, e.g., Illumina's sequencing by synthesis method (Halman and Oshlack 2020; Tanudisastro et al. 2024). Altogether, every genetic diagnostic test that is currently performed for patients with a suspected repeat expansion disorder has its own limitations and no generic one-test-fits-all approach is currently available.

The introduction of long-read technologies has allowed the detection of large repeat expansions and determining the exact repeat size because long reads can entirely span (very long) repeat loci, which improves mapping quality and reduces mapping bias (Mantere et al. 2019; Tanudisastro et al. 2024). Recently, long-read sequencing technologies, such as HiFi (PacBio) and nanopore (ONT) sequencing, have proven the benefit of long reads for STR detection (Giesselmann et al. 2019; Mitsuhashi et al. 2019; Sone et al. 2019; Chiu et al. 2021; Dolzhenko et al. 2024). However, the current high cost of long-read GS limits the widespread use of the technology for STR expansion detection (Tang et al. 2017). Therefore, targeted long-read sequencing approaches are emerging (Loose et al. 2016; Höijer et al. 2018; Miyatake et al. 2022; Stevanovski et al. 2022). Optical genome mapping (OGM) is another long-read technology, which generates images of ultra-long high molecular weight (UHMW) DNA molecules with an average N50 > 250 kb (Neveling et al. 2021). OGM has proven to provide a cost-effective and easy-to-use alternative for structural variant (SV) detection and is also capable of detecting STRs (Mantere et al. 2021; Neveling et al. 2021; Facchini et al. 2023; Guruju et al. 2023). In addition, OGM is independent of sequence context and in combination with the ultra-long molecules and genome-wide coverage, it enables the analysis of even the most complicated regions of the genome in contrast to DNA sequencing approaches (Neveling et al. 2021). Therefore, OGM has a great potential for determining the exact repeat sizes of even the longest repeats.

In this study, we tested whether OGM can efficiently and accurately identify the repeat length across multiple STR loci simultaneously, thereby detecting large STR expansions and determining their absolute repeat sizes as well as potential somatic instability.

Results

To assess the technical validity of OGM to size large repeat expansions and determine somatic instability, we performed OGM for 85 samples with known clinically relevant repeat expansions in DMPK, CNBP, and RFC1 causing myotonic dystrophy types 1 and 2, and cerebellar ataxia, neuropathy, and vestibular areflexia syndrome (CANVAS), respectively. Next, the OGM data were sequentially used in three different workflows. Firstly, the regularly available standard analysis workflow is referred to as “manual de novo assembly,” secondly a local guided assembly (local-GA), and thirdly a molecule distance script. The latter two were developed and applied as part of this study. The first two workflows were used to determine the repeat size of both alleles, while the third workflow was mainly used to identify potential somatic instability. This approach allowed for a direct comparison of the repeat sizes estimated by OGM and the repeat sizes reported after the SOC, providing an evaluation of OGM as a repeat expansion detection technology.

Standard of care results

For all 85 individuals, SOC genetic testing previously identified at least a monoallelic repeat expansion in CNBP, DMPK, or RFC1 that was larger than the pathogenic threshold (Table 1). All individuals with a monoallelic repeat expansion in DMPK or CNBP resulted in the diagnosis of myotonic dystrophy type 1 or 2, respectively. Of the 30 samples with a repeat expansion in DMPK, 21 had a repeat expansion >150 units (450 bp) reported after SOC and, based on this result, we expected these repeat expansions to be larger than the formal SV detection limit of OGM, which is currently ∼500 bp. The nine remaining DMPK repeat expansions were determined to be smaller than 500 bp in size (range 61–159 units or 183–477 bp) and thereby below the formal OGM resolution cutoff. In the case of the individuals with an RFC1 repeat expansion, 19 of the 30 individuals had a biallelic pathogenic AAGGG repeat expansion resulting in a diagnosis of CANVAS, respectively. One other patient had a biallelic AAAAG repeat expansion that is considered to be benign. In addition, five other individuals were carriers of a pathogenic AAGGG repeat expansion of one allele, but carried a benign AAAAG or AAAGG repeat expansion on the other allele. The five remaining individuals were carriers of a monoallelic AAGGG RFC1 repeat expansion without an indication of a repeat expansion or other genetic variant on the other allele. The SOC had a detection threshold of >75 repeat units for CNBP and >150 repeat units for DMPK. For RFC1, the SOC only predicted a mono- or biallelic repeat expansion, without providing any predictions of the expanded repeat size (Table 1).

Table 1.

Sample and analysis overview

SOC Manual de novo assembly Local guided assembly Molecule distance script
Sample ID Sample material Allele 1 Allele 2 Conclusion Allele 1 Allele 2 Conclusion Allele 1 Allele 2 Conclusion Somatic instability
CNBP_01 EDTA blood >75 18 Detected 3681 20 Detected 61 0
CNBP_02 EDTA blood >75 15 Detected 6331 39 Detected 5000 10 Detected A + B
CNBP_03 EDTA blood >75 13 Detected 6517 53 Detected 3659 23 Detected A + B
CNBP_04 EDTA blood >75 15 Detected 7155 −27 Detected 4401 4 Detected A + B
CNBP_05 EDTA blood >75 15 Detected 8042 0 Detected 3521 33 Detected A + B
CNBP_06 EDTA blood >75 15 Detected −14 −15 3687 21 Detected A + B
CNBP_07 EDTA blood >75 16 Detected 5212 −12 Detected 5000 10 Detected A + B
CNBP_08 EDTA blood >75 17 Detected 2502 2 Detected 2661 12 Detected A + B
CNBP_10 EDTA blood >75 13 Detected 2874 −20 Detected 2963 11 Detected A + B
CNBP_11 EDTA blood >75 16 Detected 375 11 Detected 254 34 Detected A + B
CNBP_12 EDTA blood >75 Normal Detected 3471 −16 Detected 3346 14 Detected A
CNBP_13 EDTA blood >75 16 Detected 4634 −14 Detected 4330 3 Detected A + B
CNBP_14 EDTA blood >75 16 Detected 5244 −2 Detected 4186 49 Detected A + B
CNBP_15 EDTA blood >75 Normal Detected 2183 −1 Detected 2092 18 Detected A + B
CNBP_16 EDTA blood >75 Normal Detected 3221 320 Detected 3201 0 Detected A + B
CNBP_17 EDTA blood >75 9 Detected 6275 −13 Detected 5000 2 Detected A + B
CNBP_18 EDTA blood >75 15 Detected 1915 −29 Detected 1656 15 Detected A + B
CNBP_19 EDTA blood >75 Normal Detected 1460 −2 Detected 1574 0 Detected A + B
CNBP_20 EDTA blood >75 Normal Detected 3977 −7 Detected 3577 11 Detected A + B
CNBP_21 EDTA blood >75 18 Detected 288 30 Detected 244 8 Detected A + B
CNBP_22 EDTA blood >75 17 Detected 1683 −24 Detected 1725 70 Detected A + B
CNBP_23 EDTA blood >75 12 Detected 2131 −25 Detected 2515 8 Detected A + B
CNBP_24 EDTA blood 134 Normal Detected −13 −14 3618 10 Detected A + B
CNBP_25 EDTA blood >75 Normal Detected 1476 −19 Detected 2626 9 Detected A + B
CNBP_26 EDTA blood >135 Normal Detected 3737 14 Detected 3241 45 Detected A + B
DMPK_01 EDTA blood >150 11 Detected 269 55 Detected 247 35 Detected
DMPK_02 EDTA blood >150 11 Detected 456 47 Detected 473 30 Detected
DMPK_03 EDTA blood >150 5 Detected 252 57 Detected 116 Detected
DMPK_04 EDTA blood 61 11 Detected 66 66 Detected 60 27 Detected B
DMPK_05 EDTA blood >150 5 Detected 457 54 Detected 485 30 Detected A + B
DMPK_06 EDTA blood 127 5 Detected 64 64 Detected 82 81 Detected B
DMPK_07 EDTA blood >150 12 Detected 378 28 Detected 358 20 Detected
DMPK_08 EDTA blood 91 5 Detected 67 67 Detected 49 B
DMPK_09 EDTA blood 96–130 5 Detected 71 71 Detected 68 Detected
DMPK_10 Cell pellet >150 12 Detected 2829 37 Detected 2825 12 Detected A + B
DMPK_11 Cell pellet >150 5 Detected 233 58 Detected 231 12 Detected A + B
DMPK_12 Cell pellet >150 13 Detected 213 24 Detected 219 34 Detected B
DMPK_13 Cell pellet >150 13 Detected 163 21 Detected 167 33 Detected A + B
DMPK_14 Cell pellet >150 13 Detected 202 10 Detected 188 7 Detected B
DMPK_15 Cell pellet >150 6 Detected 1839 28 Detected 1768 53 Detected A + B
DMPK_16 EDTA blood >150 12 Detected 85 85 Detected 52 Detected B
DMPK_17 EDTA blood >150 5 Detected 491 41 Detected 510 15 Detected A + B
DMPK_18 EDTA blood >150 5 Detected 71 71 Detected 69 Detected B
DMPK_19 EDTA blood 73 12 Detected 54 54 Detected 61 0 Detected B
DMPK_20 EDTA blood >150 5 Detected 55 55 Detected 41 B
DMPK_21 EDTA blood >150 12 Detected 1366 43 Detected 1347 23 Detected B
DMPK_22 EDTA blood 74 13 Detected 31 31 61 6 Detected A + B
DMPK_23 EDTA blood >150 7 Detected 372 17 Detected 369 25 Detected A + B
DMPK_24 EDTA blood 130 5 Detected 82 82 Detected 93 78 Detected A + B
DMPK_25 EDTA blood 159 5 Detected 79 79 Detected 109 63 Detected B
DMPK_26 EDTA blood 88 13 Detected 45 45 29 0 A
DMPK_27 EDTA blood >150 5 Detected 1648 21 Detected 20 12 B
DMPK_28 EDTA blood >150 14 Detected 440 41 Detected 393 3 Detected B
DMPK_29 EDTA blood >150 12 Detected 320 20 Detected 310 12 Detected B
DMPK_30 EDTA blood >150 29 Detected 290 63 Detected 131 Detected B
RFC1_01 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1487 1174 Biallelic 1497 1160 Biallelic A
RFC1_02 EDTA blood ≫AAGGG ≫AAAGG Biallelic 738 452 Biallelic 751 458 Biallelic
RFC1_03 EDTA blood ≫AAGGG ≫AAAAG Biallelic 883 111 Biallelic 897 126 Biallelic A
RFC1_04 EDTA blood ≫AAGGG ≫AAAAG Biallelic 750 97 Biallelic 760 99 Biallelic
RFC1_05 EDTA blood ≫AAGGG 11 AAAAG Monoallelic 1167 −5 Monoallelic 1175 4 Monoallelic A + B
RFC1_06 EDTA blood ≫AAAAG ≫AAAAG Homozygous 1565 1278 Biallelic 1579 1283 Biallelic A
RFC1_07 EDTA blood ≫AAGGG 11 AAAAG Monoallelic 625 −3 Monoallelic 643 10 Monoallelic
RFC1_08 EDTA blood ≫AAGGG ≫AAGGG Homozygous 840 840 Homozygous 856 856 Homozygous
RFC1_09 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1506 770 Biallelic 1499 778 Biallelic A
RFC1_10 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1289 1175 Biallelic 1307 1131 Biallelic A
RFC1_11 EDTA blood ≫AAGGG ≫AAGGG Homozygous 812 812 Homozygous 818 818 Homozygous
RFC1_12 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1106 895 Biallelic 1121 888 Biallelic A
RFC1_13 EDTA blood ≫AAGGG ≫AAGGG Homozygous 927 927 Homozygous 933 933 Homozygous
RFC1_14 EDTA blood ≫AAGGG ≫AAGGG Homozygous 725 725 Homozygous 740 737 Biallelic
RFC1_15 EDTA blood ≫AAGGG ≫AAGGG Homozygous 873 811 Biallelic 887 783 Biallelic A
RFC1_16 EDTA blood ≫AAGGG 11 AAAAG Monoallelic 1104 −11 Monoallelic 1097 4 Monoallelic A + B
RFC1_17 EDTA blood ≫AAGGG 9 AAAAG Monoallelic 711 −1 Monoallelic 723 8 Monoallelic B
RFC1_18 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1444 1134 Biallelic 1474 1127 Biallelic A + B
RFC1_19 EDTA blood ≫AAGGG ≫AAGGG Homozygous 223 40 Biallelic 235 51 Biallelic A
RFC1_20 EDTA blood ≫AAGGG ≫AAAAG Biallelic 494 100 Biallelic 502 106 Biallelic
RFC1_21 EDTA blood ≫AAGGG ≫AAGGG Homozygous 703 600 Biallelic 715 600 Biallelic A + B
RFC1_22 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1161 905 Biallelic 1180 911 Biallelic A
RFC1_23 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1028 701 Biallelic 1054 733 Biallelic
RFC1_24 EDTA blood ≫AAGGG 11 AAAAG Monoallelic 602 2 Monoallelic 615 11 Monoallelic A
RFC1_25 EDTA blood ≫AAGGG ≫AAGGG Homozygous 855 854 Biallelic 868 868 Homozygous
RFC1_26 EDTA blood ≫AAGGG ≫AAGGG Homozygous 714 573 Biallelic 734 585 Biallelic
RFC1_27 EDTA blood ≫AAGGG ≫AAGGG Homozygous 973 973 Homozygous 987 403 Biallelic A
RFC1_28 EDTA blood ≫AAGGG ≫AAGGG Homozygous 875 875 Homozygous 893 893 Homozygous
RFC1_29 EDTA blood ≫AAGGG ≫AAGGG Homozygous 1071 890 Biallelic 1073 905 Biallelic
RFC1_30 EDTA blood ≫AAGGG ≫AAAAG Biallelic 743 74 Biallelic 754 85 Biallelic A

For each sample, this table presents the SOC result, as well as the repeat size estimates from the two OGM sizing workflows in repeat units (manual de novo assembly and local guided assembly) and the somatic instability assessment from the molecule distance script workflow. For CNBP and DMPK, the table indicates whether the dominant repeat allele was detected (Detected). For RFC1, we also checked whether OGM identified a monoallelic, biallelic, or homozygous repeat expansion. For the molecule distance script, “A” denotes multiple consensus maps, and “B” denotes a gradient in the molecule distances. We considered somatic instability in cases where both “A + B” provided suggestive evidence.

Detecting repeat expansions using optical genome mapping

The OGM approach consisted of the generally available de novo assembly pipeline as well as two workflows that were developed as part of this study, i.e., local-GA and molecule distance script. In this study, we used these three different and complementary analytical workflows based on the OGM BNX molecule files to either estimate the size of both alleles at the respective locus of interest (manual de novo assembly and local-GA) or to assess the somatic stability of the detected repeat expansion(s) (molecule distance script) (Fig. 1). The manual de novo assembly workflow identified a repeat expansion beyond the gene-specific repeat size threshold in 81/85 (95.3%) samples, while the local-GA workflow identified a repeat expansion beyond the gene-specific repeat size threshold in 80/85 (94.1%) samples (Table 1; Supplemental Fig. S1). Jointly, we were able to identify a repeat expansion for 84 of the 85 samples by combining the results of the two different sizing workflows, even when considering the expected expansions smaller than the 500 bp formal cutoff for SV calling with OGM. The one remaining sample (DMPK_26) had a repeat size of 88 repeat units based on SOC, but only a premutation was suggested by the OGM findings with 45 repeat units called by the manual de novo assembly. Of the 84 detected repeat expansions, 77 were called by both workflows and the remaining seven were called as repeat expansions by one of the two workflows (Table 1; Supplemental Fig. S1). Notably, this even included eight samples with DMPK repeat expansion lengths <500 bp, the formal detection limit of OGM. Of the latter, six were called by both sizing OGM workflows, while the other two were only called by one of the two workflows.

Figure 1.

Figure 1.

Total overview of the data analysis workflow. For each sample, a de novo assembly was generated and the local-GA pipeline and molecule distance script were run. After each workflow, the maps and/or molecules to calculate workflow-specific repeat lengths were manually assessed. Green boxes denote the data analysis parts and gray boxes denote the data interpretation parts. (*) Workflows 1 and 2 were used to determine repeat lengths, while workflow 3 was used to identify potential somatic instability.

Concordance between OGM and SOC

Myotonic dystrophy types 1 and 2 are both autosomal dominant disorders, which is why we only expected heterozygous repeat expansions in the DMPK and CNBP samples. For all these samples except one, OGM identified the heterozygous repeat expansion. However, CANVAS is an autosomal recessive disorder caused by compound heterozygous or homozygous repeat expansions in RFC1, which is required to assess the repeat length in both alleles. Therefore, we confirmed whether both OGM workflows resulted in the same type of repeat expansion as reported after SOC, i.e., a monoallelic, biallelic, or homozygous repeat expansion, and for all 30 RFC1 samples, OGM confirmed the SOC results (Table 1).

In addition, the actual repeat lengths of the two OGM workflows (manual de novo assembly and local-GA) were compared to the repeat lengths reported after SOC. For all 25 CNBP and 30 RFC1 samples, the repeat lengths identified by OGM had at least the length reported after SOC (Table 1) and these results were considered concordant. In the case of DMPK, for 20 of the 30 samples, the repeat expansion lengths were also concordant with SOC, while for the other 10 samples, OGM presented different calls for the absolute repeat length compared to the SOC (Table 2). For seven of these 10 samples, the SOC identified a repeat expansion length <500 bp, the formal resolution limit of OGM. The remaining three samples had an expected repeat length >500 bp (based on SOC). The results for DMPK also indicated that the manual de novo assembly overestimated the repeat size of the expected wild-type allele (based on SOC) to be ≥50 repeat units (range 54–85 repeat units or 162–255 bp) for 15 samples. All but three of these wild-type alleles were called <50 repeat units by the local-GA workflow, suggesting that the local-GA may be more accurate in distinguishing wild-type and small repeat expansions.

Table 2.

Overview of repeat expansions with different calls for absolute repeat size

Sample ID SOC Manual de novo assembly Local guided assembly
Allele 1 Allele 2 Allele 1 Allele 2 Allele 1 Allele 2
DMPK_06 127 5 64 64 82 81
DMPK_08 91 5 67 67 49 wt
DMPK_09 96–130 5 71 71 68 wt
DMPK_16 >150 12 85 85 52 wt
DMPK_18 >150 5 71 71 69 wt
DMPK_19 73 12 54 54 61 0
DMPK_20 >150 5 55 55 41 wt
DMPK_22 74 13 31 31 61 6
DMPK_24 130 5 82 82 93 78
DMPK_25 159 5 79 79 109 63

Repeat sizes represent repeat units.

Distinguishing between the two repeat alleles in biallelic repeats

OGM also allowed to distinguish between the two RFC1 repeat expansion alleles of similar size for 19/25 RFC1 repeat expansion samples for which SOC identified a biallelic or homozygous expansion. For the remaining six RFC1 repeat expansion samples, OGM detected a homozygous repeat expansion, which confirmed the SOC results (Table 1).

Comparing the exact repeat sizes across the two OGM sizing pathways

One of the advantages of OGM over the SOC is that it also provided estimates of the actual size of large repeats starting from ∼500 bp in size. This allowed us to compare the repeat size estimates of each sample across the two OGM repeat sizing workflows. The ranges of the detected repeat expansions detected by both sizing workflows were [288–8042] and [244–6544] for CNBP, [54–2829] and [52–2825] for DMPK, and [223–1565] and [235–1579] for RFC1 for the manual de novo assembly workflow and local-GA workflow, respectively (Table 1). There was a strong, significant correlation among the manual de novo assembly and the guided assembly workflows (R = 0.97, P = <0.001) (Fig. 2). The intercept for the comparison was 147 and the slope was 0.90, indicating a small deviation between the results of the two repeat sizing workflows. The average deviation across all three genes was 10.4%, while the gene-specific deviations were 20.0% for CNBP, 12.7% for DMPK, and 1.6% for RFC1 (Supplemental Table S1).

Figure 2.

Figure 2.

Correlation between the manual de novo assembly repeat lengths and the local-GA repeat lengths. For this correlation assessment, we only used the 77/85 (90.6%) samples for which both the manual de novo assembly workflow and the local-GA workflow detected a repeat expansion. The black line represents the trendline showing the correlation between manual de novo assembly and local-GA. The dashed gray line represents the optimal correlation line.

Detecting somatic instability

Based on the number of consensus maps and corresponding molecules resulting from the local-GA workflow and the visual inspection of the bar plot and histogram resulting from the molecule distance script workflow (Fig. 3), we detected suggestive evidence of somatic instability in 36/85 samples. Of these, 23 were CNBP samples, nine were DMPK samples, and four were RFC1 samples. Notably, of the 25 samples with the largest repeat alleles (>1500 repeat units), only two had no suggestive evidence of instability. It seems that the molecular distance script workflow may be best suited to detect instability, and 16 different samples show a suggestive pattern for instability by this tool alone. Due to the suspected somatic instability, the estimated repeat sizes may vary more than the estimated repeat sizes of samples without a somatic instability suspicion. This result suggests the benefit of sequentially using the different OGM repeat expansion workflows, especially in the case of samples that are suspected to present with somatic instability.

Figure 3.

Figure 3.

Overview of the data analysis outputs of the three OGM repeat expansion workflows for sample DMPK_10.This figure only shows the visual results of the data analysis. The results of the data interpretation are mainly the estimates of the actual repeat sizes resulting from the manual de novo assembly and local-GA workflows, as well as the visualization of the label distances in each molecule covering the locus of interest resulting from the molecule distance script. (A) Representation of the repeat expansion locus in the de novo assembly showing the position of the repeat expansion in the gene (3′ UTR). Labels of interest are indicated by red arrowheads. These labels were used to manually calculate the repeat size by subtracting the reference distance (green bar) from the distances of the respective sample maps (blue bars). (B) Consensus-guided assemblies across the DMPK repeat expansion locus. The DMPK gene is indicated by the red box. Based on the estimated repeat length, each map is assigned to allele 1 or allele 2 in order to separate the two alleles. Final repeat sizes are calculated by combining the repeat sizes of the maps assigned to the same allele (see also Methods). (C) This bar plot shows the distance between the labels of interest in each molecule ordered from smallest to largest. (D) This histogram shows the result of the molecule distance script that automatically assigns molecules to one of the alleles. The blue peak represents allele 1, while the orange peak represents allele 2. Both the bar plot and histogram can then be used to assess whether a sample contains evidence for somatic instability or not.

Discussion

Determining the exact length of specific repeat expansions is of great importance for the patient and their family due to the rough correlation between repeat size and disease severity and age of onset, but also due to genetic anticipation. Current molecular diagnostic efforts for repeat expansion disorders entail labor-intensive and time-consuming PCR and/or Southern blot efforts. The current SOC only determines a repeat size range but does not detect/estimate the actual repeat length (due to artifacts or resolution limits of the respective tests). Also, the read size of short-read sequencing methods has proven to be too limited to accurately detect all repeat expansions, and long-read sequencing is still not routinely used in most laboratories and is currently too expensive, while it allows the detection of an increasing amount of novel repeat expansion and contraction disorders most recently (Pellerin et al. 2023). Here, we present a generic assay that works for three different repeat loci (i.e., CNBP, DMPK, and RFC1) and most likely also for all other repeat expansion loci for which the pathogenic repeat size extends beyond ∼300 bp in size. OGM's use of native DNA molecules without any experimental noise (e.g., PCR artifacts, or bias for one of the alleles) allows detection of very long repeat expansions. An additional benefit of this approach is the possibility to detect somatic instability in the repeat expansion of interest.

Overall, our results increased the repeat allele sizing resolution for all 84 of the 85 investigated repeat expansion samples. In addition, when checking 20 alleles in 10 control samples, no repeat expansion beyond the pathogenic repeat size threshold was detected (Supplemental Table S2). Being able to provide a more accurate repeat length measurement, especially for very long repeat expansion alleles, is one of the apparent strengths of OGM. Here, we even detected CNBP expansions >7000 repeat units, suggesting that OGM has no upper size limit, which may still exist for most short- and long-read sequencing approaches. In addition, for 19 of the RFC1 samples, the SOC reported a biallelic or homozygous repeat expansion and OGM allowed to distinguish between the two alleles of similar size, which is not possible with current SOC. With OGM enabling to confirm, size, and distinguish both heterozygous and biallelic repeat expansions, it also increases molecular diagnostic capabilities and allows for improved patient and family counseling. This is particularly important for families with RFC1 repeat expansions because the repeat length of these expansion alleles, and especially the length of the smaller allele, is an important factor for predicting disease onset, phenotype variability, and severity (Currò et al. 2024).

An additional benefit of this approach is the possibility to detect somatic instability in the repeat expansion of interest, a phenomenon that could potentially lead to variability in disease severity and age of onset, especially if affected tissue could be sampled (Monckton et al. 1995; Wong et al. 1995; Gomes-Pereira et al. 2004; Swami et al. 2009; Goold et al. 2021). Here, we detected evidence of somatic instability for at least 36/85 samples or 30.0% of DMPK samples, 92.0% of CNBP samples, and 16.0% of RFC1 samples. Repeat instability seems to occur for almost all long repeats, i.e., all but two of the 25 largest repeats. Finding this large number of somatically unstable repeat expansions was not expected beforehand. For CNBP (Alfano et al. 2022) and DMPK (Morales et al. 2023) repeat expansion alleles, the presence of somatic instability is well known; however, so far, RFC1 repeat alleles have been considered stable and evidence for somatic instability in RFC1 repeat expansions is limited (Currò et al. 2024). Finding this new evidence highlights an opportunity for future repeat expansion research using OGM, as OGM can easily identify somatic instability for various repeat loci. The sensitivity of this approach may increase with generating higher coverage with OGM, ideally even utilizing DNA from affected tissue instead of blood-derived DNA. Also, updates of the molecular distance script workflow should allow more accurate cutoffs for instability to be determined in the future.

Notwithstanding the accurate repeat expansion detection and improved allele sizing resolution using OGM, our results confirm the suspicion that OGM might not be accurate for repeat sizes smaller than <300 bp. For 10 DMPK cases, smaller repeat sizes than expected by SOC were detected. For seven of those, also SOC confirmed allele sizes smaller than 500 bp. However, due to technical difficulties such as the extinction of the RP-PCR signal, the precision of SOC may also not represent the ground truth in all cases. Our data also suggest that using a pathogenic repeat length threshold >300 bp (74 repeat units for CNBP, i.e., 296 bp) does not result in false positive findings, i.e., the overestimation of wild-type alleles, while a smaller threshold (50 repeat units for DMPK, i.e., 150 bp) may result in an overestimation of wild-type allele sizes as seen for 15/85 samples (all called with >50 repeat units for the suspected wild-type DMPK allele). This overestimation of wild-type allele sizes seems to mainly occur in the manual de novo assembly workflow and is less of an issue when using the local-GA workflow. In total, our study suggests that OGM is highly accurate for identifying large repeat expansions. There was only one sample for which full expansion status was called by SOC but only premutation status by OGM. This may not be surprising as this sample presented with a repeat size by SOC of only 88 repeat units or 264 bp.

Even though both the manual de novo assembly workflow and the local-GA workflow use the same BNX molecule file as starting input, we show that there was an average deviation of 10.4% between the repeat sizes as estimated by these two different workflows across the three genes (Fig. 2; Supplemental Table S1). However, the correlation between the two workflows was still highly significant (P = <0.001; Fig. 2). The CNBP samples showed the highest average deviation (20.0%), while the DMPK and RFC1 samples had an average deviation of 12.7% and 1.6%, respectively. The high deviation for the CNBP samples was likely due to the high level of somatic instability for these samples as the somatic instability makes it more difficult for the workflows to determine an exact repeat size as all molecules have different repeat lengths. For DMPK, the deviation was mainly caused by the larger deviation for the repeats <500 bp in size compared to the ones >500 bp in size. Taking out the DMPK samples with repeat sizes <500 bp, resulted in an average deviation of 4% between the two sizing workflows. Our results also suggest that it may not be necessary to choose only one of the three OGM repeat workflows, because they can also be used sequentially or in parallel, which would create one single method for repeat expansion detection using OGM data. By using this single method, the different analysis workflows work together and can even complement each other. First, the manual de novo workflow can indicate a potential repeat expansion even beyond the currently specified 500 bp resolution cutoff of SV calling using OGM. Next, the local GA allows a more targeted size estimate by collecting molecules and aligning these molecules to each other to create a consensus map for only the specific region of interest. The algorithm then determines the size of the expansion in the different maps specifically at the respective locus of interest. Finally, the molecule distance script can separate the two alleles and clearly visualize this separation by plotting individual molecule lengths at the locus of interest. The plots resulting from this latter part are particularly useful for identifying unstable repeat expansions. Altogether, this suggests that the three separate workflows work best in a complementary fashion and all three can be performed locally. The manual de novo assembly workflow can be performed using the Bionano Access analysis software by loading in a pregenerated de novo assembly file and the local-GA and molecule distance script workflows were developed as part of this study and are publicly available (https://github.com/bionanogenomics/local_guided_assembly/ and https://github.com/bionanogenomics/molecule_distance/) (van der Sanden et al. 2024).

The local-GA data not only provides repeat size estimates for both alleles, but it also generates confidence intervals for each repeat length. In this study, these confidence intervals remained outside of the scope because we only worked with the repeat size estimates that could be equally compared between the two sizing workflows. However, being able to use these confidence intervals could be a very nice add-on for clinical laboratories when using OGM for repeat expansion detection, because the two different workflows present different repeat sizes and it may be difficult to rationalize which sizes to use. However, potential somatic instability must be taken into account when using the confidence intervals, which suggests that potential somatic repeat instability should be assessed using the molecule distance script before using the confidence intervals in downstream analyses. In addition, the molecule distance script can be improved in identifying and characterizing somatic expansion alleles by implementing a statistical method to automate the output interpretation. Now the somatic instability assessment relies on manual inspection, but an automated model would help to reduce variability in reporting results.

The advantages of OGM over SOC and sequencing methods are not limited to the sizing resolution and detection of somatic instability. Considering the unexpectedly high level of somatic instability, OGM presents with another advantage, that is that higher coverage than for GS can routinely be reached with the latest OGM iterations allowing coverage up to 1500-fold without extra cost (Smith et al. 2023). In addition, OGM only uses natural UHMW DNA molecules that are not sheared and are not subjected to any obvious bias, such as PCR or sequencing bias. Even though the laboratory process for OGM requires up to 5 h of hands-on time and contains multiple incubation steps, this method provides higher accuracy and higher throughput. Moreover, after analyzing the labeled DNA on the Saphyr machine, the results can easily be reanalyzed for different repeat expansion loci, without the need to rerun any sample, while for SOC new PCRs or blots have to be performed. Since some repeat expansion disorders have overlapping phenotypic characteristics and strong heterogeneity of symptoms, this option of analyzing the entire human genome at once proves a large benefit—and would allow OGM to become a truly generic test for all established expansion disorders for which expansions lead to SVs >500 bp or even ∼300 bp as shown for the smallest alleles here (DMPK_04). In line with this, 11 additional samples with a repeat expansion in ATXN10 (Morato Torres et al. 2022), C9orf72 (Barseghyan et al. 2022), FXN, NOP56, or STARD7 were also analyzed successfully (Supplemental Table S3), suggesting indeed that OGM is suited for known repeat expansion disorders with a pathogenic repeat size threshold >300 bp. Finally, if a repeat expansion disorder is suspected, but is not confirmed by OGM, the generated de novo assembly still allows to identify different types of SVs, including other insertions and deletions, but also deletions, inversions, and translocations. Hereby, this method is more versatile than other repeat expansion disorder tests in the SOC.

Besides the advantages of OGM over SOC and sequencing efforts, it also has a known limitation, being the inability to provide sequence context for all its SV calls and therefore also for the repeat expansion insertion calls. For certain repeat expansion disorders, the sequence context can be of high importance, since repeat interruptions may cause repeat (in)stability and thereby mitigating the disease severity. Also for RFC1 repeats where pathogenic AAGGG and normal AAAGG and AAGGG repeats are known, OGM cannot determine which type of repeat expansion is detected. Therefore, if the sequence context is of importance for the specific repeat expansion disorder, the OGM test still must be complemented with preferably (targeted) long-read sequencing (LRS), which adds to the financial considerations that have to be made before choosing OGM as the technology to detect those specific repeat expansions. In general, LRS seems very accurate for the detection of repeat expansions, but performing whole-genome LRS is still very expensive and not yet available to all (clinical or diagnostic) laboratories and thereby not yet feasible as a first-line test for most centers. However, with Oxford Nanopore Technologies’ adaptive sampling and PacBio's PureTarget, two more targeted approaches to detect repeat expansions have become available, which combine deeper coverage and improved cost efficiency. A potential benefit of these targeted approaches, which allow the use of nonamplified DNA molecules, over OGM is the possibility to also assess methylation, a biochemical process that has been shown to contribute to disease development (De Roeck et al. 2019). It remains to be seen if (targeted) long-read sequencing will allow the study of all repeat expansions, as even here some challenges may be expected, such as sequence context (DNA-quadruplexes), and very long expansions beyond the actual read lengths, e.g., CNBP. In addition, a separate copy number variant or SV in or around the region of interest, as well as variation of the label site can influence the results of the workflows. Therefore, a thorough inspection of the de novo assembly in workflow 1 using the Circos plot or genome browser in the Bionano Access software is of great importance. When there is any indication of another large variant, the results of the different repeat detection workflows must be analyzed with extra care to prevent the reporting of false positive or false negative results. Finally, the ∼300 bp resolution of OGM limits the application of OGM to a subset of all known repeat expansions, which suggests that several disease-associated repeat expansions in genes, such as ATXN1, ATXN3, and HTT, can only partially or not at all be assessed by the presented method (Supplemental Table S4). Therefore, OGM may have to be supplemented with SOC or a (targeted) sequencing approach to test the most important repeat expansion loci with a pathological repeat size threshold between 300 and 500 bp. For repeat sizes <300 bp, an SOC or sequencing-based approach is definitely needed because these repeats are currently beyond OGM's capability.

In conclusion, our data demonstrate that OGM can efficiently and accurately identify the repeat lengths across multiple STR loci simultaneously, thereby detecting large STR expansions and determining their repeat sizes. This supports the technical validity of OGM for the detection of repeat expansion alleles larger than ∼300 bp in size. OGM increased the allele sizing resolution for 84/85 repeat samples, and it indicated 36 samples with suggestive evidence of somatic repeat instability. Our results also suggest that OGM can detect all large repeat expansions >300 bp in size using a single test, which is in contrast to the current SOC that uses multiple gene-specific tests to reach the same conclusions while potentially taking more time and being more expensive. To move toward clinical testing, in addition to our current retrospective technical feasibility study, usually prospectively designed clinical validity and utility studies may be warranted. This study suggests that OGM could serve as an efficient workflow for repeat expansion detection although (targeted) long-read sequencing approaches, which we have not directly compared, are also emerging. However, whether the efficiency of OGM can compensate for the unavailability of exact sequence context remains to be determined.

Methods

Patient selection

The Department of Human Genetics of the Radboudumc is a referral center for patients with suspected repeat expansion disorders. In total, 85 patients with a known (biallelic) repeat expansion in CNBP (n = 25), DMPK (n = 30), and RFC1 (n = 30) were selected from our patient cohort and anonymized for further use in this study. Further repeat expansion details can be found in Table 3. This study was approved by the Medical Review Ethics Committee Arnhem-Nijmegen under 2011-188 and 2020-7142. Deanonymization and subsequent data sharing of these samples was not allowed by the specific consent, which also made additional genetic analyses, downstream from the application of OGM, not possible.

Table 3.

Repeat expansion details

Gene Disease Inheritance Location of repeat in gene Repeat unit Normal repeat size Premutation repeat size Pathogenic repeat size
CNBP DM2 AD Intron CCTG <27 27–74 >74
DMPK DM1 AD 3′ UTR CTG 5–35 36–49 >49
RFC1 CANVAS AR Intron AAGGG (pathogenic)
AAAAG (benign)
AAAGG (benign)
11 (AAAAG) n/a >400a

aSOC for RFC1 repeat expansions is not suited to detect full repeat sizes. It uses a combination of locus-spanning PCR, resulting in allelic dropouts for repeats >120 units, and RP-PCR to detect the repeats up to 20 units. For the sake of this technical study, repeat sizes >20 units were already considered expansions irrespective of their pathogenicity. Table adjusted from van der Sanden et al. (2021).

Standard of care tests

PCR and fragment-length analysis, RP-PCR, and Southern blotting for CNBP and DMPK repeat expansions were previously performed as part of routine diagnostic repeat expansion testing according to previously described standard protocols (Kamsteeg et al. 2012). Locus-spanning PCR and RP-PCR for RFC1 repeat expansions were also performed as part of routine diagnostic repeat expansion testing according to the previously described standard protocol (Ghorbani et al. 2022).

DNA isolation, labeling, and optical genome mapping

DNA isolation, labeling, and OGM were performed as described previously (Mantere et al. 2021; Neveling et al. 2021). For each individual, UHMW DNA was isolated from 650 µL of whole peripheral blood (EDTA) or 1–1.5 million cultured cells using the SP Blood and Cell Culture DNA Isolation Kit according to the manufacturer's instructions (Bionano, San Diego, CA, USA). Briefly, cells were treated with a lysis-and-binding buffer (LBB) to release UHMW DNA, which was then bound to a nanobind disk, washed, and eluted in the provided elution buffer. UHMW DNA molecules were labeled with the DLS (Direct Label and Stain) DNA Labeling Kit (Bionano). Direct Label Enzyme (DLE-1) and DL-green fluorophores were used to label 750 ng of UHMW DNA. After a wash-out of the DL-green fluorophore excess, the DNA backbone was counterstained overnight before quantitation. Labeled UHMW DNA was loaded on a Saphyr chip G2.3 for linearization and imaging on the Saphyr instrument (Bionano).

OGM repeat expansion workflows

The entire data analysis was performed as previously described (van der Sanden et al. 2024). In the following section, we only summarized the most important steps in the data analysis process.

The BNX molecule files generated by the Bionano Saphyr machine were sequentially used in three different workflows (Fig. 1).

  1. Manual de novo assembly

  2. Local guided assembly (local-GA)

  3. Molecule distance script

Manual de novo assembly

In the manual de novo assembly workflow, for each individual, a de novo assembly was generated on Solve 3.7.2 and Access 1.7.2 using default parameters against the GRCh38/hg38 reference genome. The de novo assembly was then used to estimate the repeat length for both alleles by calculating the genomic distance between the reference start and end label flanking the repeat locus of interest (Fig. 4A; Supplemental Table S5). The reference length between the two labels of interest was then subtracted from both allele lengths in the sample to get a repeat size estimate for both alleles. These sizes were then divided by the repeat unit length of the respective repeat locus to get the manual de novo assembly size estimates.

Figure 4.

Figure 4.

Representative plots of a sample with evidence and without evidence of somatic instability. The left part represents a stable RFC1 repeat expansion and the right part represents an unstable CNBP repeat expansion. (A) The number of assembled maps at the region of interest in the local-GA data might indicate somatic instability. In this case, the stable repeat had two consensus maps while the unstable repeat had six consensus maps. (B) A gradient of label distance in the molecule pile-up might also indicate mosaicism. The stable repeat had no gradient, while the unstable repeat presented a gradient of label distances based on the large variability in the distance between the red label and black label in each molecule. This variability results in the gradient or “stairway” pattern. (C) The molecule distance script output plots show the repeat expansion size that is detected in each molecule by determining the distance between two specific labels of interest. This bar plot represents the distance between the labels of interest in each molecule ordered from smallest to largest. Molecule distance bar plots with a steep gradient or a stairway distribution of label distances would suggest somatic instability. The stable repeat had no stairway pattern, while the unstable repeat showed a stairway pattern for the expanded allele. The plot for the stable repeat visualizes the separation of the smaller allele and the larger allele around the middle of the plot (molecule number 57). The plot for the unstable repeat visualizes the same separation of the smaller allele and the larger allele (around molecule number 75). (D) The histogram plots outputted by the molecule distance script represent the separation of the two alleles based on the label distances in each molecule. The smaller alleles are indicated with blue peaks and the larger alleles are indicated with orange peaks. A “smear” instead of a real peak in the histogram for one of the alleles might indicate somatic instability. For the stable repeat, no smear was detected, while the unstable repeat presented with a “smear” for the expanded allele. This is due to large variability in molecule label distances and therefore repeat expansion size.

Local guided assembly

For the local-GA workflow, the local-GA script was run on the command line with locus-specific seed and coordinate files using default settings (van der Sanden et al. 2024) (https://github.com/bionanogenomics/local_guided_assembly, https://github.com/bionanogenomics/local_guided_assembly/tree/master/seed_files, and https://github.com/bionanogenomics/local_guided_assembly/tree/master/coo_csvs). Each of the output analysis reports lists the consensus map IDs (Fig. 4B) and calculated repeat expansion counts for each of those consensus maps. Maps were subsequently assigned to one of the two different alleles based on the estimated repeat counts. Generally, an output analysis report could contain maps with no or short repeat counts and maps with a large repeat counts. For homozygous and biallelic repeat expansions, the maps for both alleles could present large repeat counts. If the local-GA workflow resulted in a single consensus map and only one allele was expanded in the manual de novo assembly workflow for the same sample, the single local-GA consensus map was used as a heterozygous call. If both alleles were expanded in the manual de novo assembly workflow, the single map was used as homozygous call. For repeat report maps with ambiguous repeat counts, the global mean of repeat counts was used as a cutoff value to assign alleles 1 or 2. Maps reported with “−1” repeat counts were excluded since the repeat counts could not be determined. Resulting repeat lengths were used as local-GA size estimates.

Molecule distance script

The molecule distance script (https://github.com/bionanogenomics/molecule_distance) workflow was run on the command line and required the intermediate alignmolvref files from the local-GA workflow. This alignmolvref result shows molecules aligned to the reference assembly (GRCh38/hg38). The script subsequently queried the distance between two predefined labels in each molecule (Supplemental Table S6). To successfully calculate the distance between the two labels of interest, only the molecules that contain both labels of interest were considered. Genomic distances were calculated using the distance between the start and end coordinates of the labels of interest in each molecule. The resulting repeat lengths were used as input for generating bar plots and histograms that visualize the repeat lengths to provide evidence for potential somatic instability (Fig. 4C,D).

OGM repeat data interpretation

First, we determined for the manual de novo assembly workflow and local-GA workflow if a repeat expansion in the locus of interest of each respective sample was detected. A repeat was found to be detected when the result of the workflow identified that the longest allele was expanded beyond a gene-specific repeat size threshold. For CNBP and DMPK the pathogenic repeat size threshold was used as gene-specific threshold, while for RFC1 a repeat size threshold of 20 repeat units was used (Table 1). Subsequently, for the RFC1 samples, we assessed whether the results of the SOC corresponded with the results of the two OGM sizing workflows. For each detected RFC1 repeat expansion, we determined whether it was monoallelic, biallelic, or homozygous by comparing the detected repeat size(s) to the respective gene-specific repeat size thresholds. The results of the two OGM workflows were then independently compared to the results of the SOC. Both OGM workflows had to indicate the same type of repeat as the SOC. If SOC reported a homozygous repeat expansion, OGM was allowed to identify both a homozygous and a biallelic repeat expansion. Finally, the actual repeat sizes resulting from the manual de novo assembly workflow and the local-GA workflow were compared to the repeat sizes reported after SOC. For each sample, we determined whether at least one of the two OGM workflows identified a repeat expansion larger or equal to the SOC result.

Detecting somatic instability

To identify potential somatic instability, multiple checks were performed. Firstly, the number of assembled maps at the region of interest in the local-GA data might indicate mosaicism (Fig. 3A). Stable repeat expansions usually form two maps during local-GA, indicating the reference and expanded allele. Additional maps are formed by molecules of unstable repeats clustered by the pipelines. Secondly, in the Bionano Access genome browser view, the molecule alignments to each of the assembled local-GA maps were visualized to search for a “gradient” of label distance in the molecule pile-up (Fig. 3B). Such a gradient might also indicate mosaicism. Finally, the molecule-to-reference alignment plots—or molecule distance plots—generated by the molecule distance script were examined for evidence of unstable alleles. When the expanded allele portion of a stable repeat locus is visualized using the molecule distance script, the molecule distances plateau at a certain length. Molecule distance bar plots with a steep gradient or a “stairway” distribution of label distances and histograms with a “smear” instead of a peak, would suggest somatic instability (Fig. 3C,D). We considered the data suggestive of somatic instability if a sample had both multiple consensus maps and a gradient distribution of molecule distances.

Data access

The optical genome mapping data generated in this study have been uploaded to the Radboud Data Repository (https://data.ru.nl/). These data can be accessed at https://doi.org/10.34973/c48g-kv10. Access to this data set will be granted to research institutions for academic purposes following a request made to the Data Access Committee. The local guided assembly and the molecule distance scripts are available at GitHub (https://github.com/bionanogenomics/local_guided_assembly/blob/master/run_local_guided_assembly.sh and https://github.com/bionanogenomics/molecule_distance/, respectively) and as Supplemental Codes 1 and 2, respectively.

Supplemental Material

Supplement 1
Supplement 2
Supplemental_Code1.zip (319.5KB, zip)
Supplement 3
Supplemental_Code2.zip (1.4MB, zip)

Acknowledgments

We acknowledge colleagues from the diagnostic division of the Radboudumc (Genome Diagnostics Nijmegen) as well as the Radboud Genomics Technology Center for their support. A.Ho. was supported by a ZonMW (The Netherlands Organization for Health Research and Development) Vici grant (No. 09150182310053). L.E.L.M.V. and A.Ho. were supported by the Solve-RD project. The Solve-RD project has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 779257. The aims of this study contribute to the ERDERA project, which has received funding from the European Union's Horizon Europe research and innovation program under grant agreement no. 101156595. The aims of this study contribute to the PPP project OGM-NGC. This research was part of the Netherlands X-omics Initiative and partially funded by NWO (Dutch Research Council, 184.034.019).

Author contributions: Conceptualization: E.-J.K. and A.Ho.; Data curation: B.v.d.S., K.N., S.S., M.D.G., J.L., and A.W.C.P.; Formal analysis: B.v.d.S., K.N., S.S., M.D.G., J.L., M.P., R.v.B., M.O., E.K.-B., E.K., and A.W.C.P.; Funding acquisition: A.Ho.; Investigation: K.N., S.S., M.D.G., J.L., M.P., R.v.B., M.O., E.K.-B., E.K., and A.W.C.P.; Methodology: B.v.d.S., S.S., M.D.G., J.L., S.L.B., A.W.C.P., and A.Ha.; Project administration: B.v.d.S., E.-J.K., and A.Ho.; Resources: A.A.T., N.C.V., I.E.S., J.G., M.A.C., A.Ha., and A.Ho.; Software: S.S., M.D.G., J.L., A.W.C.P., and A.Ha.; Supervision: L.E.L.M.V., A.Ha., E.-J.K., and A.Ho.; Validation: B.v.d.S., S.S., M.D.G., J.L., and A.W.C.P.; Visualization: B.v.d.S., S.S., S.L.B., and A.W.C.P.; Writing—original draft: B.v.d.S., K.N., E.-J.K., and A.Ho.; Writing—review and editing: B.v.d.S., K.N., S.S., M.D.G., J.L., S.L.B., A.A.T., N.C.V., A.W.C.P., and A.Ho. All authors have contributed to the manuscript and have read and approved the final version of the manuscript.

Footnotes

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279491.124.

Freely available online through the Genome Research Open Access option.

Competing interest statement

S.S., M.D.G., S.L.B., A.W.C.P., and A.Ha. are employees and shareholders of Bionano Genomics, a company commercializing an optical genome mapping technology. J.L. is a former employee of Bionano Genomics. The remaining authors declare that they have no competing interests.

References

  1. Alfano M, De Antoni L, Centofanti F, Visconti VV, Maestri S, Degli Esposti C, Massa R, D'Apice MR, Novelli G, Delledonne M, et al. 2022. Characterization of full-length CNBP expanded alleles in myotonic dystrophy type 2 patients by Cas9-mediated enrichment and nanopore sequencing. Elife 11: e80229. 10.7554/eLife.80229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barseghyan H, Pang AWC, Zhang Y, Sahajpal NS, Delpu Y, Lai C-YJ, Lee J, Tessereau C, Oldakowski M, Kolhe RB, et al. 2022. Neurogenetic variant analysis by optical genome mapping for structural variation detection-balanced genomic rearrangements, copy number variants, and repeat expansions/contractions. In Genomic structural variants in nervous system disorders (ed. Proukakis C), pp. 155–172. Springer, New York. [Google Scholar]
  3. Chiu R, Rajan-Babu IS, Friedman JM, Birol I. 2021. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol 22: 224. 10.1186/s13059-021-02447-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cumming SA, Hamilton MJ, Robb Y, Gregory H, McWilliam C, Cooper A, Adam B, McGhie J, Hamilton G, Herzyk P, et al. 2018. De novo repeat interruptions are associated with reduced somatic instability and mild or absent clinical features in myotonic dystrophy type 1. Eur J Hum Genet 26: 1635–1647. 10.1038/s41431-018-0156-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Currò R, Dominik N, Facchini S, Vegezzi E, Sullivan R, Galassi Deforie V, Fernández-Eulate G, Traschütz A, Rossi S, Garibaldi M, et al. 2024. Role of the repeat expansion size in predicting age of onset and severity in RFC1 disease. Brain 147: 1887–1898. 10.1093/brain/awad436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, Davis M, Lamont P, Clayton JS, Laing NG, et al. 2018. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol 19: 121. 10.1186/s13059-018-1505-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Depienne C, Mandel J-L. 2021. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet 108: 764–785. 10.1016/j.ajhg.2021.03.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. De Roeck A, De Coster W, Bossaerts L, Cacace R, De Pooter T, Van Dongen J, D'Hert S, De Rijk P, Strazisar M, Van Broeckhoven C, et al. 2019. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol 20: 239. 10.1186/s13059-019-1856-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dolzhenko E, Deshpande V, Schlesinger F, Krusche P, Petrovski R, Chen S, Emig-Agius D, Gross A, Narzisi G, Bowman B, et al. 2019. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35: 4754–4756. 10.1093/bioinformatics/btz431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dolzhenko E, English A, Dashnow H, De Sena Brandine G, Mokveld T, Rowell WJ, Karniski C, Kronenberg Z, Danzi MC, Cheung WA, et al. 2024. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol 42: 1606–1614. 10.1038/s41587-023-02057-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Facchini S, Dominik N, Manini A, Efthymiou S, Currò R, Rugginini B, Vegezzi E, Quartesan I, Perrone B, Kutty SK, et al. 2023. Optical Genome Mapping enables detection and accurate sizing of RFC1 repeat expansions. Biomolecules 13: 1546. 10.3390/biom13101546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Ghorbani F, de Boer-Bergsma J, Verschuuren-Bemelmans CC, Pennings M, de Boer EN, Kremer B, Vanhoutte EK, de Vries JJ, van de Berg R, Kamsteeg EJ, et al. 2022. Prevalence of intronic repeat expansions in RFC1 in Dutch patients with CANVAS and adult-onset ataxia. J Neurol 269: 6086–6093. 10.1007/s00415-022-11275-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Giesselmann P, Brändl B, Raimondeau E, Bowen R, Rohrandt C, Tandon R, Kretzmer H, Assum G, Galonska C, Siebert R, et al. 2019. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat Biotechnol 37: 1478–1481. 10.1038/s41587-019-0293-x [DOI] [PubMed] [Google Scholar]
  14. Gomes-Pereira M, Fortune MT, Ingram L, McAbney JP, Monckton DG. 2004. Pms2 is a genetic enhancer of trinucleotide CAG.CTG repeat somatic mosaicism: implications for the mechanism of triplet repeat expansion. Hum Mol Genet 13: 1815–1825. 10.1093/hmg/ddh186 [DOI] [PubMed] [Google Scholar]
  15. Goold R, Hamilton J, Menneteau T, Flower M, Bunting EL, Aldous SG, Porro A, Vicente JR, Allen ND, Wilkinson H, et al. 2021. FAN1 controls mismatch repair complex assembly via MLH1 retention to stabilize CAG repeat expansion in Huntington's disease. Cell Rep 36: 109649. 10.1016/j.celrep.2021.109649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Guruju NM, Jump V, Lemmers R, Van Der Maarel S, Liu R, Nallamilli BR, Shenoy S, Chaubey A, Koppikar P, Rose R, et al. 2023. Molecular diagnosis of facioscapulohumeral muscular dystrophy in patients clinically suspected of FSHD using optical genome mapping. Neurol Genet 9: e200107. 10.1212/NXG.0000000000200107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gymrek M. 2017. A genomic view of short tandem repeats. Curr Opin Genet Dev 44: 9–16. 10.1016/j.gde.2017.01.012 [DOI] [PubMed] [Google Scholar]
  18. Gymrek M, Golan D, Rosset S, Erlich Y. 2012. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res 22: 1154–1162. 10.1101/gr.135780.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Halman A, Oshlack A. 2020. Accuracy of short tandem repeats genotyping tools in whole exome sequencing data. F1000Res 9: 200. 10.12688/f1000research.22639.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Höijer I, Tsai YC, Clark TA, Kotturi P, Dahl N, Stattin EL, Bondeson ML, Feuk L, Gyllensten U, Ameur A. 2018. Detailed analysis of HTT repeat elements in human blood using targeted amplification-free long-read sequencing. Hum Mutat 39: 1262–1272. 10.1002/humu.23580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921. 10.1038/35057062 [DOI] [PubMed] [Google Scholar]
  22. Kamsteeg EJ, Kress W, Catalli C, Hertz JM, Witsch-Baumgartner M, Buckley MF, van Engelen BG, Schwartz M, Scheffer H. 2012. Best practice guidelines and recommendations on the molecular diagnosis of myotonic dystrophy types 1 and 2. Eur J Hum Genet 20: 1203–1208. 10.1038/ejhg.2012.108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Loose M, Malla S, Stout M. 2016. Real-time selective sequencing using nanopore technology. Nat Methods 13: 751–754. 10.1038/nmeth.3930 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Mantere T, Kersten S, Hoischen A. 2019. Long-read sequencing emerging in medical genetics. Front Genet 10: 426. 10.3389/fgene.2019.00426 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Mantere T, Neveling K, Pebrel-Richard C, Benoist M, van der Zande G, Kater-Baats E, Baatout I, van Beek R, Yammine T, Oorsprong M, et al. 2021. Optical genome mapping enables constitutional chromosomal aberration detection. Am J Hum Genet 108: 1409–1422. 10.1016/j.ajhg.2021.05.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N. 2019. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 20: 58. 10.1186/s13059-019-1667-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Miyatake S, Koshimizu E, Fujita A, Doi H, Okubo M, Wada T, Hamanaka K, Ueda N, Kishida H, Minase G, et al. 2022. Rapid and comprehensive diagnostic method for repeat expansion diseases using nanopore sequencing. NPJ Genom Med 7: 62. 10.1038/s41525-022-00331-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Monckton DG, Wong LJ, Ashizawa T, Caskey CT. 1995. Somatic mosaicism, germline expansions, germline reversions and intergenerational reductions in myotonic dystrophy males: small pool PCR analyses. Hum Mol Genet 4: 1–8. 10.1093/hmg/4.1.1 [DOI] [PubMed] [Google Scholar]
  29. Morales F, Corrales E, Vásquez M, Zhang B, Fernández H, Alvarado F, Cortés S, Santamaría-Ulloa C, Marigold Myotonic Dystrophy Biomarkers Discovery Initiative-Mmdbdi, Krahe R, et al. 2023. Individual-specific levels of CTG•CAG somatic instability are shared across multiple tissues in myotonic dystrophy type 1. Hum Mol Genet 32: 621–631. 10.1093/hmg/ddac231 [DOI] [PubMed] [Google Scholar]
  30. Morato Torres CA, Zafar F, Tsai YC, Vazquez JP, Gallagher MD, McLaughlin I, Hong K, Lai J, Lee J, Chirino-Perez A, et al. 2022. ATTCT and ATTCC repeat expansions in the ATXN10 gene affect disease penetrance of spinocerebellar ataxia type 10. HGG Adv 3: 100137. 10.1016/j.xhgg.2022.100137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. 2019. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res 47: e90. 10.1093/nar/gkz501 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Neveling K, Mantere T, Vermeulen S, Oorsprong M, van Beek R, Kater-Baats E, Pauper M, van der Zande G, Smeets D, Weghuis DO, et al. 2021. Next-generation cytogenetics: comprehensive assessment of 52 hematological malignancy genomes by optical genome mapping. Am J Hum Genet 108: 1423–1435. 10.1016/j.ajhg.2021.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Nolin SL, Glicksman A, Tortora N, Allen E, Macpherson J, Mila M, Vianna-Morgante AM, Sherman SL, Dobkin C, Latham GJ, et al. 2019. Expansions and contractions of the FMR1 CGG repeat in 5,508 transmissions of normal, intermediate, and premutation alleles. Am J Hum Genet A 179: 1148–1156. 10.1002/ajmg.a.61165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Paulson H. 2018. Repeat expansion diseases. Handb Clin Neurol 147: 105–123. 10.1016/B978-0-444-63233-3.00009-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Pellerin D, Danzi MC, Wilke C, Renaud M, Fazal S, Dicaire MJ, Scriba CK, Ashton C, Yanick C, Beijer D, et al. 2023. Deep intronic FGF14 GAA repeat expansion in late-onset cerebellar ataxia. N Engl J Med 388: 128–141. 10.1056/NEJMoa2207406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Rudaks LI, Yeow D, Ng K, Deveson IW, Kennerson ML, Kumar KR. 2024. An update on the adult-onset hereditary cerebellar ataxias: novel genetic causes and new diagnostic approaches. Cerebellum 23: 2152–2168. 10.1007/s12311-024-01703-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Ruiz de Sabando A, Ciosi M, Galbete A, Cumming SA, Álvarez V, Martinez-Descals A, Mila M, Trujillo-Tiebas MJ, López-Sendón JL, Fenollar-Cortés M, et al. 2024. Somatic CAG repeat instability in intermediate alleles of the HTT gene and its potential association with a clinical phenotype. Eur J Hum Genet 32: 770–778. 10.1038/s41431-024-01546-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Smith AC, Hoischen A, Raca G. 2023. Cytogenetics is a science, not a technique! Why optical genome mapping is so important to clinical genetic laboratories. Cancers (Basel) 15: 5470. 10.3390/cancers15225470 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sone J, Mitsuhashi S, Fujita A, Mizuguchi T, Hamanaka K, Mori K, Koike H, Hashiguchi A, Takashima H, Sugiyama H, et al. 2019. Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease. Nat Genet 51: 1215–1221. 10.1038/s41588-019-0459-y [DOI] [PubMed] [Google Scholar]
  40. Srivastava S, Love-Nichols JA, Dies KA, Ledbetter DH, Martin CL, Chung WK, Firth HV, Frazier T, Hansen RL, Prock L, et al. 2019. Meta-analysis and multidisciplinary consensus statement: exome sequencing is a first-tier clinical diagnostic test for individuals with neurodevelopmental disorders. Genet Med 21: 2413–2421. 10.1038/s41436-019-0554-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Stevanovski I, Chintalaphani SR, Gamaarachchi H, Ferguson JM, Pineda SS, Scriba CK, Tchan M, Fung V, Ng K, Cortese A, et al. 2022. Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing. Sci Adv 8: eabm5386. 10.1126/sciadv.abm5386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Swami M, Hendricks AE, Gillis T, Massood T, Mysore J, Myers RH, Wheeler VC. 2009. Somatic expansion of the Huntington's disease CAG repeat in the brain is associated with an earlier age of disease onset. Hum Mol Genet 18: 3039–3047. 10.1093/hmg/ddp242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, et al. 2017. Profiling of short-tandem-repeat disease alleles in 12,632 human whole genomes. Am J Hum Genet 101: 700–715. 10.1016/j.ajhg.2017.09.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Tankard RM, Bennett MF, Degorski P, Delatycki MB, Lockhart PJ, Bahlo M. 2018. Detecting expansions of tandem repeats in cohorts sequenced with short-read sequencing data. Am J Hum Genet 103: 858–873. 10.1016/j.ajhg.2018.10.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tanudisastro HA, Deveson IW, Dashnow H, MacArthur DG. 2024. Sequencing and characterizing short tandem repeats in the human genome. Nat Rev Genet 25: 460–475. 10.1038/s41576-024-00692-3 [DOI] [PubMed] [Google Scholar]
  46. van der Sanden BPGH, Corominas J, de Groot M, Pennings M, Meijer RPP, Verbeek N, van de Warrenburg B, Schouten M, Yntema HG, Vissers LELM, et al. 2021. Systematic analysis of short tandem repeats in 38,095 exomes provides an additional diagnostic yield. Genet Med 23: 1569–1573. 10.1038/s41436-021-01174-1 [DOI] [PubMed] [Google Scholar]
  47. van der Sanden B, Neveling K, Pang AWC, Shukor S, Gallagher MD, Burke SL, Kamsteeg E-J, Hastie A, Hoischen A. 2024. Optical genome mapping for applications in repeat expansion disorders. Curr Protoc 4: e1094. 10.1002/cpz1.1094 [DOI] [PubMed] [Google Scholar]
  48. Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. 2017. Genome-wide profiling of heritable and de novo STR variations. Nat Methods 14: 590–592. 10.1038/nmeth.4267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wong LJ, Ashizawa T, Monckton DG, Caskey CT, Richards CS. 1995. Somatic heterogeneity of the CTG repeat in myotonic dystrophy is age and size dependent. Am J Hum Genet 56: 114–122. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
Supplement 2
Supplemental_Code1.zip (319.5KB, zip)
Supplement 3
Supplemental_Code2.zip (1.4MB, zip)

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press

RESOURCES