Abstract
Microsatellites are abundant genomic elements that contribute to genetic diversity and disease-associated regulatory variation. Although long-read sequencing enables accurate resolution of repetitive regions, computational methods for fully resolved microsatellite genotyping remain limited. Here, we introduce variant motif where (vmwhere), a computational framework for identifying, genotyping, decomposing, and visualizing complex tetrameric microsatellites from long-read sequencing data. Using simulated error-free reads, vmwhere accurately measures several genotyping metrics, including allele length, repeat length, maximum consecutive repeat length, and motif density. Applied to long-read whole-genome sequencing data, vmwhere identified sequence interruptions, motif-specific differences in repeat architecture, and ancestry-associated allele variation, including long repeat alleles that exceed short-read sequencing limitations. We applied vmwhere to GGAA microsatellites in Ewing sarcoma, an aggressive pediatric cancer driven by EWS-FLI1 fusion oncoprotein, which binds to microsatellites and remodels chromatin. Genome-wide integration of long-read-defined microsatellite architecture with chromatin accessibility and EWS-FLI1 binding revealed that GGAA repeat structure was associated with chromatin state, with longer consecutive repeat microsatellites exhibiting increased EWS-FLI1 binding and chromatin accessibility. Cell line–specific expansions and contractions of GGAA microsatellite repeat length were associated with gains and losses of chromatin accessibility. Further, we identified haplotype-specific chromatin states, with preferential binding and accessibility at longer alleles. Together, these results establish vmwhere as a scalable framework for resolving population-level microsatellite variation and linking repeat architecture to chromatin state. Repeat structure and length characteristics provides insights into genotype–function relationships at microsatellite repeats in cancer.
Full Text
The Full Text of this preprint is available as a PDF (2.1 MB). The Web version will be available soon.
