Abstract
Variable Number Tandem Repeats (VNTRs) are tandem repeat (TR) loci that vary in copy number across a population. Using our program, VNTRseek, we analyzed human whole genome sequencing datasets from 2770 individuals in order to detect minisatellite VNTRs, i.e., those with pattern sizes ≥7 bp. We detected 35 638 VNTR loci and classified 5676 as commonly polymorphic (i.e. with non-reference alleles occurring in >5% of the population). Commonly polymorphic VNTR loci were found to be enriched in genomic regions with regulatory function, i.e. transcription start sites and enhancers. Investigation of the commonly polymorphic VNTRs in the context of population ancestry revealed that 1096 loci contained population-specific alleles and that those could be used to classify individuals into super-populations with near-perfect accuracy. Search for quantitative trait loci (eQTLs), among the VNTRs proximal to genes, indicated that in 187 genes expression differences correlated with VNTR genotype. We validated our predictions in several ways, including experimentally, through the identification of predicted alleles in long reads, and by comparisons showing consistency between sequencing platforms. This study is the most comprehensive analysis of minisatellite VNTRs in the human population to date.
INTRODUCTION
Over 50% of the human genome consists of repetitive DNA sequence (1,2) and tandem repeats (TRs) comprise one such class. A TR consists of a pattern of nucleotides repeated two or more times in succession. TR loci are defined by their position on the genome, the sequence and length of their repeat unit, and copy number.
TRs are commonly divided into three classes based on pattern size: short tandem repeats (STRs), or microsatellites, with pattern lengths of six or fewer base pairs (bp), minisatellites with patterns ranging from seven to several hundred bp, and macrosatellites with patterns from hundreds to thousands of bp (3,4). This study focuses on the TR minisatellite class, which comprises more than a million loci in the human genome.
Many TRs appear to be monoallelic with regard to copy number. However a significant fraction exhibit copy number variability in the population and these variant TR loci are called Variable Number Tandem Repeats (VNTRs). Changes in VNTR copy number have been proposed to arise by slipped strand mispairing (5–7), unequal crossover (8,9), and gene conversion (8,10).
VNTRs are highly mutable, with germline mutation rates estimated between 10−3 and 10−7 per cell division (11–15). This mutation rate, which far exceeds that of SNPs, makes VNTRs useful for DNA fingerprinting (16–18). VNTRs have also been predicted to have high heterozygosity, ranging from 43% to 59% (19), and the copy numbers of several VNTR loci have been shown to be population-specific in humans (20,21), suggesting that these VNTRs may be useful for population wide studies.
More than half of previously identified human VNTR loci (22) are located near or within genes (23), and so their potential effects on gene expression or protein products are substantial. Indeed VNTRs have been associated with changes in levels of gene expression (24), including tissue specific expression (25,26).
Minisatellite VNTRs have been proposed to regulate genes in several ways. In promoters, they can carry binding sites for transcription factors such as NF-κB and myc/HLH (27,28), permitting copy number to affect transcription factor binding and therefore level of transcription (29–32). VNTRs in introns have been shown to contain enhancer sequences (33–35) or to cause differential splicing (36,37). VNTRs in exons can also affect transcription (38,39) and the stability or translation rate of the resulting proteins (40).
Microsatellites have been extensively studied (41–45) and linked to various diseases and cancer (46–49). For example, in Ewing sarcoma, the EWS-FLI fusion protein activation of enhancer regions at GGAA microsatellite repeats contributes to tumor development (50–52) and repression of these enhancers impairs tumor growth (51).
Furthermore, minisatellite VNTRs have been associated with a variety of diseases (53–55), including neurodegenerative disorders such as Alzheimer’s disease (36,56,57) and Huntington’s disease (28,58), and other psychiatric conditions, such as PTSD (59), ADHD (60,61), depression (62) and addiction (63). VNTRs have been shown to be risk factors in various cancers (64–74) and have been linked to cancer prognosis and outcome (75–80). Commercial cancer diagnosis kits using minisatellite VNTRs have been introduced (81) and it has been proposed that VNTRs associated with cancers be used for targeted sequencing in personalized therapies (82–84).
VNTRs have been proposed as drivers of phenotypic variation in evolution (85–87). For example, an examination of functional enrichment in genes containing TRs or VNTRs, in both humans and apes, found that genes with ape-specific TR variability were associated with the senses of taste and smell, while genes with human-specific TR variability or human-specific copy number were associated with neurogenesis and neural development. These results suggest that VNTR polymorphisms may help account for ‘human-specific cognitive traits’ (24). Additionally, the Eichler group (87) has examined TR loci on human and ape genome assemblies from PacBio sequencing data and identified 1584 human-specific VNTR loci with 52 as candidate regions associated with disease.
Despite their biological significance, until recently, relatively few human minisatellite VNTRs have been identified and studied in detail. The ever–increasing availability of accurate whole genome sequencing (WGS) data, however, provides extensive opportunity for high throughput, genome-wide VNTR genotyping. Further, the emergence of PCR-free WGS datasets is reducing locus selection bias and enabling better filtering of false positive VNTR variants.
Nonetheless, genotyping variability in repeat sites remains challenging (88,89). Although a number of tools have been designed to detect microsatellite copy number variability such as lobSTR (90), popSTR (91), hipSTR (92), GangSTR (93) and ExpansionHunter (94), only two high-throughput tools are available for minisatellite genotyping. The adVNTR tool (95) trains a Hidden Markov Model (HMM) for each VNTR locus of interest and has been used to predict variability in 2944 VNTRs intersecting coding regions.
VNTRseek (96), developed in our lab, uses the Tandem Repeats Finder (TRF) (97) to detect and characterize TRs inside reads and then maps read TRs to TRs in a reference set. Because it builds pattern profiles before mapping, VNTRseek is robust in the presence of SNPs and small indels. While VNTRseek has high precision, it has two major limitations. It only detects minisatellites which contain a minimum of 1.8–1.9 pattern copies (depending on pattern length), a limitation from the Tandem Repeats Finder (TRF) software, and the tandem array, plus short flanking sequences, must fit completely inside a read. This means that arrays longer than the read length cannot be detected. Given these limitations, the results we report underestimate the true polymorphic nature of minisatellite TRs.
In this paper, we present the most comprehensive catalog of minisatellite VNTRs in the human genome to date, pooling results for WGS datasets from 2770 individuals, processed with VNTRseek on the GRCh38 human reference genome. We report a large collection of previously unknown VNTR loci, find that many VNTR loci and alleles are common in the population, show that VNTR loci are enriched in gene and regulatory sequences, provide evidence of gene expression differences correlated with VNTR genotype and evidence of population-specific VNTR alleles.
MATERIALS AND METHODS
Datasets
Datasets comprising 2801 PCR-free, WGS samples from 2,770 individuals were used in this study (Table 1): 30 individuals from the 1000 Genomes Project Phase 3 (98), including the Utah (CEU) and Yoruban (YRI) trios (mother-father-child); 2,504 unrelated individuals mostly overlapping with the 1000 Genomes Project, recently sequenced at >30× coverage by the New York Genome Center (NYGC); 253 individuals from the Simons Genome Diversity Project (SGDP) (43), seven individuals sequenced by the Genome in a Bottle (GIAB) Consortium (99), including the Chinese (HAN) and Ashkenazi Jewish (AJ) trios and NA12878 (with ID HG001); two ‘haploid’ hydatidiform mole cell line genomes, CHM1 (100) and CHM13 (101); tumor/normal tissues from two unrelated individuals with breast cancer (breast invasive ductal carcinoma cell line/lymphoblastoid cell line) from the Illumina Basespace public WGS datasets (102); and the AJ child sequenced with PacBio Circular Consensus Sequencing (CCS) reads (103). Duplicates of 27 genomes were present in two datasets, 1000 Genomes and NYGC. One of these, NA12878, was also included in the GIAB dataset.
Table 1.
Data
| Data source | Read length (bp) | Read coverage | Samples in set | Ref. | |
|---|---|---|---|---|---|
| 1000 Genomes Phase 3 HC | Yoruban (YRI) trio | 250 | 71–73× | 3 | (98) |
| Utah (CEU) trio | 250 | 55–63× | 3 | ||
| Others | 250 | 33–66× | 24 | ||
| New York Genome Center (NYGC) | 150 | 29–101× | 2504 | * | |
| Simon’s Genome Diversity Project (SGPD) | 100 | 33–133× | 253 | (43) | |
| Genome In A Bottle (GIAB) | Ashkenazi Jewish (AJ) trio | 250 | 61–69× | 3 | (99) |
| Chinese (HAN) trio | 148/250 | 111–333× | 3 | ||
| NA12878 (HG001) | 148 | 291× | 1 | ||
| Haploid genomes | CHM1 | 148 | 40× | 1 | (100) |
| CHM13 | 250 | 128× | 1 | (101) | |
| Illumina basespace | Tumor/Normal | 101 | 38–88× | 4 | (102) |
| PacBio | Ashkenazi Jewish (AJ) child | ∼13 500 | 27× | 1 | (103) |
2801 publicly available WGS samples, for 2770 individuals, were used in this study. Read coverage was calculated as the product of the number of reads and the average read length, divided by the haploid genome size, as in the Lander/Waterman equation (104). All coverage values are approximate. The 1000 Genomes Phase 3 samples were released in 2015. The NYGC samples were released in 2020 by the New York Genome Center (NYGC). For the Simons Genome Diversity Project (SGDP), released in 2016, only datasets which were not present in the the 1000 Genomes datasets were used. The PacBio data were used only for comparison and validation purposes but not for our VNTR results.*These data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1.
Overall, read coverage ranged from approximately 27×, in the PacBio sample to 333×, in the GIAB Chinese child. Besides the PacBio data, reads consisted of three lengths, 100/101 bp (257 samples), 148/150 bp (2508 samples), and 250 bp (35 samples). All data were downloaded as raw fastq files, except for the PacBio data which were obtained as a BAM file with reads aligned to GRCh37. SRA links to the data are given in Table 4.
Table 4.
Dataset sources
| Dataset (individuals) | URL or accession numbers |
|---|---|
| 1000 Genomes phase 3 HC (30) | EBI: 1000 Genomes Project |
| NYGC (2504) | Data collection 30X on GRCh38 ENA project: PRJEB31736 |
| SGDP (253) | IGSR: Data collection SGDP |
| GIAB (7) | NCBI SRA project: SRP047086 |
| CHM1 (1) | NCBI SRA: SRX652547 |
| CHM13 (1) | NCBI SRA: SRX1009644 |
| Tumor/Normal Pairs (4) | Illumina basespace project |
| PacBio (1) | NCBI SRA: SRX5327410 |
The majority of the analyses in this paper were performed on the 2504 genomes from NYGC. The 253 genomes from SGDP provided insight into under-represented populations. The 27 genomes duplicated in the 1000 Genomes and NYGC datasets were used to measure consistency across sequencing platforms. The trios from the 1000 Genomes (CEU and YRI) and GIAB (AJ and Chinese HAN) datasets were used for analyzing Mendelian inheritance. The cancer datasets were used to find possible changes in VNTRs in tumor tissues. The PacBio data were used for validation purposes only.
Curating the reference TRs
The 22 autosomes and sex chromosomes from the human reference genome GRCh38 were used to produce a reference set of TRs in TRDB (105) with the TRF software and four quality filtering steps as described in (96). In addition, centromere regions were excluded from the reference set. These filtering tools are available online in TRDB. Starting with 1 199 362 TRs found by TRF, we curated a filtered reference set with 228 486 TRs. Using VNTRseek, we classified the TRs into two subcategories, singletons and indistinguishables (Supplementary Section S2.1). A singleton TR appears to be unique in the genome based on a combination of its repeat pattern and flanking sequence. An indistinguishable TR belongs to a family of genomically dispersed TRs which share highly similar patterns and flanking sequence and may therefore produce misleading genotype calls. Indistinguishable TRs (total 37 200 or about 16% of the reference set) were flagged. Simulation testing revealed that some singletons produced false positive VNTRs. To minimize this issue, an additional filtering step was added to eliminate problematic singleton loci from the reference set (see Supplementary Section S2.2 and Supplementary MaterialReference_TRs.txt for the reference sets).
We assumed that genotyping was possible for a reference TR locus, given a particular read length, if the TR array length plus a minimum 10 bp flank on each side, would fit within the read. The number of reference TR alleles that could be genotyped using each of the read lengths in our data is summarized in Supplementary Table S1.
Genotyping TRs and VNTRs
Each dataset was processed separately with VNTRseek using default parameters: minimum and maximum flanking sequence lengths of 10 and 50 bp, respectively, on each side of the array, and requiring at least two reads mapped with the same array copy number to make an allele call. Output from VNTRseek included two VCF files containing genotype calls, one reporting all detected TR and VNTR loci, and the other limited to VNTR loci only. (A locus was considered a VNTR if a non-reference allele was detected.) The VCF files contained two specialized FORMAT fields: SP, for number of reads supporting each allele, and CGL, to indicate genotype by the number of copies gained or lost with respect to the reference. For example, a genotype of 0 indicated detection of only the TR reference allele (zero copies gained or lost), while 0, +2 indicated a heterozygous locus with a reference allele and an allele with a gain of two copies.
To remove clear inconsistencies, for this study we filtered the VCF files to remove per sample VNTR loci with more alleles than the expected number of chromosomes. The filtering criteria for these loci, termed multis is detailed in Supplementary Section S2.3. After multi filtering, a TR locus was labeled as a VNTR if any remaining allele, different from the reference, was observed in any sample.
Experimental validation
Accuracy of VNTRseek genotyping was experimentally tested for 13 predicted VNTR loci in the Ashkenazi Jewish (AJ) trio. The following DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: NA24385, NA24149 and NA24143 (also identified as GIAB IDs HG002, HG003 and HG004). Selection criteria required that the PCR product was not contained in a repeat region, unique primers could be designed, the primer-defined allele length difference was between 10% and 20% of the longest allele, and the primer-defined GC content was between 40% and 60%. Given these criteria, we prioritized VNTRs in genes and regulatory regions which might be of interest to researchers. Primers were designed with Primer-BLAST (106) and used to amplify the VNTR loci from the genomic DNA of each individual using the following reagents: 0.2 μl 5 U/μl DreamTaq DNA Polymerase (ThermoFisher Scientific), 4.0 μl 10× DreamTaq Buffer (ThermoFisher Scientific), 0.8 μl 10mM dNTP mix (ThermoFisher Scientific), 3.2 μl primer mix at a final concentration of 0.5 μM, 1.6 μl genomic DNA (40 ng), and 30.2 μl nuclease-free water. PCR cycling conditions were as follows: 30 s at 95°C, 30 s at 56–60°C, 20 s at 72°C, for 30 cycles, with an initial denaturation of 3 min at 95°C and a final extension of 7 min at 72°C. The resulting amplicons were run on a 2% agarose gel at 100 V for 2 h and visualized with UV light using Ethidium Bromide. A complete list of loci and primers is given in the Supplementary Material as Experiment_primers.txt.
In addition, VNTRseek predictions in the NA12878 genome were compared to experimental validations in the paper describing the adVNTR software (95). We had three datasets for NA12878: HG001 (148 bp) from GIAB, NA12878 (150 bp) from NYGC and NA12878 (250 bp) from 1000 Genomes. The adVNTR predictions used GRCh37 coordinates which were converted using the UCSC liftover tool (107) to coordinates to GRCh38.
Validation using long reads
Aligned PacBio reads for the AJ child (GIAB ID HG002) were processed to validate VNTRseek predictions. The read sequences were extracted from the BAM file and mapped back to the GRCh38 genome using BWA MEM default settings (108). Using bedtools (109), the reads aligning to each TR reference locus were extracted. For each read, a local wraparound dynamic programming alignment was performed using the reference pattern and the same scoring parameters used to generate the reference set (match = +2, mismatch = −5 and gap = −7). The number of copies of the pattern in the resulting alignment was then determined and compared with the VNTRseek predictions. If the difference between a PacBio copy number in at least one read and the VNTRseek copy number was within ±0.25 of a copy, we considered the VNTRseek allele to be validated.
Measuring consistency of Mendelian inheritance
A locus on an autosomal chromosome is consistent with Mendelian inheritance if the genotype of a child can be explained as one allele from the mother and one from the father. Genotype consistency was evaluated for all mother-father-child trios, i.e. the AJ, CEU, HAN and YRI trios. We evaluated loci defined by several increasingly stringent criteria: both parents heterozygous, all members of the trio heterozygous, all members of the trio heterozygous and with different genotypes. These criteria were selected to avoid false interpretations of consistency.
TR loci on the X and Y chromosome of male children were also selected for evaluation when both the son and the appropriate parent had a predicted genotype. In these cases, inheritance consistency means a son’s X chromosome allele is observed on one of the mother’s X chromosomes, and a son’s Y chromosome allele is observed on the father’s Y chromosome.
Measuring allele consistency across platforms
VNTR calls were compared for each of 27 genomes that were represented twice, once in the 1000 Genomes dataset, sequenced in 2015 on an Illumina HiSeq2500 with 250 bp read length and once in the NYGC dataset, sequenced in 2019 on an Illumina NovaSeq 6000 with 150 bp read length. The two platforms have different error profiles.
Because read length and coverage differed among datasets, for each pairwise comparison, we only considered loci that were genotyped in both samples and classified as VNTR in at least one. We extracted the non-reference VNTR alleles (detected in at least one sample) and computed consistency as the ratio of the size of the intersection set of those alleles (found in both platforms) over the size of their union (found in either). For alleles detected in the 250 bp reads, we only counted those that could have been detected in the shorter 150 bp reads. Reference alleles were excluded to avoid inflating the ratio.
Common and private VNTRs
To classify commonly polymorphic VNTRs (hereafter ‘common VNTRs’) and private VNTRs, we used results from the NYGC dataset (2504 individuals) as the read length and coverage were comparable across all genomes. Additionally, these genomes contain no related individuals and represent a wide set of populations (26 populations from five continents). Loci were classified as common VNTRs if non-reference alleles were identified in at least 5% (126) of the individuals and classified as private VNTRs if non-reference alleles were identified in less than 1% (25).
Annotation and enrichment
Annotation based on overlap with functional genomic regions was performed for the reference TR loci. Genomic annotations for GRCh38 were obtained from the UCSC Table Browser (110) in BED format. Known gene transcripts from GENCODE V32 (111) were used along with tracks for introns, coding exons, and 5′ and 3′ exons. Regulatory annotations included transcription factor binding site (TFBS) clusters (112,113) and DNAse clusters (114) from ENCODE 3 (115), and CpG island tracks (116), comprising 25%, 15% and 1% of the genome, respectively. Bedtools (109) was used to find overlaps between TR loci and the annotation features. Any size overlap was allowed.
LOLAweb (117) was used to determine VNTR enrichment for genomic regions in comparison to the background TR annotations, and common and private VNTR enrichment in comparison to all VNTR annotations. TRs on the sex chromosomes were excluded in the background set. To identify gene and pathway functions that could be affected by common VNTR copy number change, genes with exons or introns overlapping with common VNTRs were collected and their enrichment computed using GSEA (118) for Gene Ontology (GO) terms (119) for biological process and KEGG pathways (120) with FDR P-value < 0.05.
Association of VNTR alleles with gene expression
To detect expression differences among individuals with different VNTR genotypes, mRNA expression counts from lymphoblastoid cell lines of 660 individuals by the Geuvadis consortium (Accession: E-GEUV-1) were downloaded (121). A total of 445 individuals overlapped with the 2504 NYGC genomes set. We paired VNTR loci with genes within 10 kb, and extracted the genotypes for each individual at those VNTRs. When no genotype was observed for an individual, we classified the genotype as other. We did this because we assumed that the alleles were outside the detection range, given that genotypes were observed in other individuals with similar coverage. VNTR loci were retained for analysis if at least two genotypes were detected for that VNTR across all individuals (at least three if other was one of the genotypes) and if each genotype was observed in at least 20 individuals. Genes were excluded from analysis if the median TPM (transcripts per kilobase million) expression value equaled zero.
To control for confounders we used covariates for sex and population structure and detected additional hidden covariates using Iteratively Adjusted Surrogate Variable Analysis (IA-SVA) (122) on the log2 normalized TPM values. For population structure we used the top five principal components determined from a principal components analysis of the informative SNP genotypes from the 445 individuals as reported by the 1000 Genomes project (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/). Using IA-SVA and observing that covariates sixteen and above were over 85% correlated with other covariates (Supplementary Figures S22–S24), we chose fifteen hidden factors to include in our model. Finally, we used a linear regression
with the log2 normalized TPM values to extract residuals to be used in the downstream association model.
For each gene-VNTR pair, we used a one-way ANOVA test as residuals ∼ genotype to detect if the mean of any genotype class was different from the others. The P-values of the ANOVA tests were adjusted using FDR. Any gene–VNTR pair with FDR <0.05 was reported. For significant eQTLs, we calculated, for reporting purposes, the maximum difference of the residual means over all pairs of genotype classes.
To associate eQTLs with histone marks or open chromatin, we downloaded narrow peaks data in GRCh38 in bed format from 14 experiments on histone marks and one on DNAse hypersensitive sites from the GM12878 (B-lymphocyte) cell line from the ENCODE project (123) (source IDs are given in Supplementary Table S13). Any overlaps of peaks with the eQTL VNTRs were reported.
Population-specific alleles
The 2504 genomes in the NYGC dataset consisted of 26 populations of individuals with ancestry from five super-populations: African, American, East Asian, European, and South Asian. To investigate the predictive power of common VNTRs with regard to super-population membership, Principal Component Analysis (PCA) clustering was applied. For each sample, a vector of common loci alleles showing presence/absence (1/0) was produced. Uninformative alleles (that were not present in at least 5% of the samples) were removed and principal components (PCs) calculated over the resulting vector set. Using a 70% training to 30% testing split of the data, a decision tree based on the first 10 PCs was trained using 10-fold cross validation and then validated on the testing data.
In order to find super-population markers among the common VNTRs, a one-sided Fisher’s exact test was used to calculate the odds ratio and P-value of each allele being in one super-population versus being collectively in all the others. We only considered alleles over-represented rather than both over- and under-represented because of an interest in identifying alleles that have a phenotypic effect. Odds ratio values were log2 transformed and P-values were adjusted for false discovery rate (FDR) (124). Any allele with FDR <0.05 and log2(odds ratio) >1 was chosen as a significant marker for that population.
RESULTS
In this section, we start with a summary and characterization of our VNTR predictions, followed by identification of commonly polymorphic VNTRs and an enrichment analysis of their association with genomic functional regions and genes sets. We next report on the effect of VNTRs on expression of nearby genes, and then identify population-specific VNTR alleles and show that they are predictive of ancestry. We conclude this section with evidence confirming the accuracy of our predictions using several validation methods.
About one in five minisatellite TRs are variable in the human population
WGS datasets from 2770 human genomes were analyzed with VNTRseek to detect VNTRs. Overall, 184 315 out of 191 286 singleton reference TR loci (∼96%) were genotyped across all samples (Table 2) while 5% of the loci had TR arrays too long to fit within the longest reads and could only be genotyped if they lost a sufficient number of copies.
Table 2.
TRs and VNTRs detected, by dataset. TRsgenotyped is the number of distinct TR loci genotyped across all individuals within a dataset. (All other numbers are also per dataset.) Multis are TR loci genotyped in a single individual with more than the expected number of alleles. They could be artifacts or indicate copy number variation in a genomic segment. Multis were excluded from further analysis on a per sample basis. VNTRsdetected is the number of TR loci, excluding multis, with a detected allele different from the reference
| Dataset | Samples | TRs genotyped | Multis | VNTRs detected |
|---|---|---|---|---|
| 1000 Gen. | 30 | 178 395 | 366 | 8761 |
| NYGC | 2504 | 177 612 | 1181 | 33 403 |
| SGDP | 253 | 156 803 | 221 | 9944 |
| GIAB | 7 | 178 804 | 239 | 6736 |
| CHM1 | 1 | 159 563 | 175 | 1118 |
| CHM13 | 1 | 170 805 | 632 | 1977 |
| Tumor-Normal | 4 | 150 531 | 21 | 1291 |
| Totals | 2800 | 184 315 | - | 35 638 |
A total of 5 198 392 loci with non-reference alleles were detected, corresponding to 35 638 (∼19%) distinct VNTR loci, indicating wide occurrence of these variable repeats. Their occurrence within genes was common, totaling 7698 protein coding genes, and 3512 exons. The resulting genotypes were output in VCF format files (see Data Availability Section) and summarized for each genome (Supplementary Material Summary_of_results.txt). A website is under development to view the VNTR alleles (http://orca.bu.edu/VNTRview/).
The number of TRs genotyped and VNTRs detected depends on coverage and read length
To determine the effect of coverage and read length on genotyping, we measured two quantities: the percentage of reference singleton TRs that were genotyped, and the total number of singleton VNTRs that were detected in each genome. Only singleton loci were considered in all further analyses. Figure 1A shows that there was a strong positive correlation between coverage and the ability to genotype TRs. A strong correlation with read length was also apparent, however, the effect was larger, primarily due to the ability of longer reads to span, and thus allow VNTRseek to detect, longer TR arrays. These results suggest that our VNTR numbers are undercounts.
Figure 1.
(A) TR Genotyping sensitivity. Graph shows the relationship between coverage, read length, and the percentage of TRs in the reference set that were genotyped. Each symbol represents a single sample and specific samples are labeled. Increasing read length had the largest effect on sensitivity because many reference TR alleles could not be detected at the shorter read lengths (see Supplementary Table S1). (B) VNTRs detected. Graph shows the relationship between coverage, read length, and the number of VNTR loci detected. Read length and coverage both had large effects. Coloring of symbols shows that population also had a strong effect, reflecting distance from the reference, which is primarily European. Note the reduced numbers for CHM1 (150 bp) and CHM13 (250 bp). Because they are ‘haploid’ genomes, parental heterozygous loci with one reference allele would appear to be VNTRs, on average, only about half the time. (C) Alleles detected per locus. Each bar represents a specific number of alleles detected across all datasets. Coloring shows that proportion of loci where the reference allele was or was not observed. (D) Copies gained or lost. Each bar represents a specific number of copies gained or lost in non-reference VNTR alleles relative to the reference allele. Loss was always more frequently encountered. (E) VNTR locus sample support. Data shown are the common loci from the 2504 sample NYGC dataset. Each bar represents the number of samples calling a locus as a VNTR. Bin size is 100. Bar height is number of loci with that sample support. Red line indicates the 5% cutoff for common loci (126 samples).
VNTR detection was similarly dependent on coverage and read length, as shown in Figure 1B. However, detection was also positively correlated with population, which seemed likely due to the evolutionary distance of populations from the reference genome, which is primarily European (125). For example, in the 250 bp trios with comparable coverage, the African Yoruban genomes (YRI) had the highest number of VNTRs detected, followed by the Ashkenazi Jewish genomes (AJ), and finally, the Utah genomes (CEU). Notably, within each trio, the VNTR counts were similar.
The ‘haploid’ genomes CHM1 (150 bp) and CHM13 (250 bp) had greatly reduced VNTR counts relative to genomes with similar coverage and read length. This was because in these genomes, which consist of two copies of an underlying haploid genome, the single allele represented at any VNTR locus would frequently be a reference allele and so the locus would not be called as a VNTR.
More than two alleles are common in VNTRs
Two alleles were detected in the majority of VNTR loci across all datasets (Figure 1C). However at 10 698 loci (29%), three or more alleles were detected. In a substantial number of loci (5395), the reference allele was never seen, but in only 105 of these (2%) was the reference allele in the VNTRseek detectable range for the 150 and 100 bp reads, which made up the bulk of our data. Interestingly, in 1166 loci, the reference allele, although detectable, was not the major allele (Supplementary Material Major_genotypes.txt).
Loss of VNTR copies relative to the reference is more common than gain
Overall, VNTRseek found approximately 1.8-fold more alleles with copy losses (3 444 128), with respect to the reference copy number, than gains (1 958 250). Loss of one copy (2 263 608) was the most common type of VNTR polymorphism (Figure 1D). Although there were allele lengths that VNTRseek could not detect, this bias persisted even when restricting the loci to only those where gain and loss could both be observed (Supplementary Figure S5). The overabundance of VNTR copy loss may actually be an underestimate. Since VNTRseek required a read to span a TR array for it to be detected, gain of one copy would have been possible in approximately 68%, 82% and 92% of loci for samples with read lengths of 100, 150 and 250 bp, respectively. By contrast, the reference locus needed to have a minimum of 2.8 copies for a loss of one copy to be observed by TRF, and only 16% of the reference loci met this criterion. Higher observed copy loss could be explained by a bias in the reference genome towards including higher copy number repeats (126), or by an overall mutational preference for copy loss.
VNTRs have high heterozygosity
High heterozygosity in human populations suggests higher genetic variability and may have beneficial effects on a range of traits associated with human health and disease (127). Since calculating heterozygosity for VNTRs is not straightforward (because of limitations on discovering alleles, especially within shorter reads), we used the percentage of detected, per-sample heterozygous VNTRs as an estimate for heterozygosity. At read length 250 bp, per-sample heterozygous VNTR loci comprised approximately 46–55% of the total, which is comparable to previous theoretical estimates of 43–59% (19). At shorter read lengths, the bottom of the range extended lower (∼38–57% for 150 bp reads, ∼29–51% for 100 bp reads, Supplementary Figure S1), as expected, because longer alleles were undetectable if they did not fit within a single read.
Interestingly, despite the previous comment, within genomes that were comparable in read length and coverage, the fraction of heterozygous loci clustered within populations (Supplementary Figures S2–S4), with African genomes generally having more heterozygous calls and East Asians fewer. This result is consistent with previous findings of population differences in SNP heterozygosity among Yoruban and Ashkenazi Jewish individuals with respect to European individuals (128,129), and suggests higher genomic diversity among African genomes, as has been previously noted (130).
Loss of heterozygosity observed in tumor samples
A significant loss of heterozygosity (LOH) was observed in predicted VNTRs of one of the tumor tissues compared to its matching normal tissue (sample ID HC1187). The percentage of heterozygous VNTRs was roughly double in the normal tissue (∼38% versus ∼19%) (Supplementary Tables S2 and S3). Extreme loss of heterozygosity in small variants has previously been reported in these samples by Illumina Basespace (131) with the number of heterozygous small variants in HC1187 being four times lower in the tumor tissue compared to the normal. Taken together, these results suggest that VNTR LOH could be linked to tumor progression.
Additionally, in both tumors a large number of loci exhibited loss of both alleles in comparison to the normal tissue (Supplementary Table S2). Given that the coverage for the tumor samples was significantly higher than for the normal tissue, it is unlikely that these observations were due to artifacts. Also, the tumor samples did not show a higher percentage of filtered multi VNTRs (too many alleles) than the normal samples (1.37% and 1.23% in normal tissue versus 1.72% and 1.71% in tumor tissue).
Knowledge of gene associations with somatic tumor mutations (VNTR alleles present in a tumor, but not normal tissue) could be useful as indicators of cancer prognosis and for therapy. In the HC2218 individual, somatic tumor mutations overlapped with lncRNAs (ACO73336.1, AC107959.2, AL355388.2), introns (C3orf67, COX17, DHRS3, DPP6, GAN, PCGF3, RGS12, SLC25A13, SLC6A19, TACR2, TEPP), and promoter regions (TRIM24, DUSP4). Of these, DHRS3, DUSP4, GAN, RGS12, and TRIM14 are known oncogenes or tumor suppressors (as indicated by CancerMine (132)). TRIM24 has been associated with prognosis in breast cancer (133–135) and over-expression of DUSP4 has been shown to improve the outcome of chemotherapy and overall survival (136,137).
In the HC1187 individual, somatic tumor mutations overlapped with lncRNAs (LINC01708, AC1058290.1, AC104596.1), exons (THNSL2), introns (AJAP1, SMAD1, FLT4, PTPN3, ADAMTSL2, ANO2, SOX5, SGCG, WDR72, NQO1, CCDC200, ARHGAP45, AC005258.1, PEAK3) and promoters (HFM1, TBK1, GNS, LEMD3, FGFR3, VIPR2). Of these, AJAP1, FGFR3, GNS, NQO1, PTPN3, SMAD1, SOX5 and TBK1 are known oncogenes or tumor suppressors (132). TBK1 and FGFR3 have been used as treatment targets for HER2+ breast cancer (138,139).
Common versus private VNTRs
Following methodology used with SNPs (140), we classified VNTR loci in the 2504 healthy, unrelated individuals from the NYGC dataset (150 bp and coverage >30×) as commonly polymorphic, or ‘common VNTRs,’ if non-reference alleles occurred in at least 5% of a population (126 individuals) and as private VNTRs if they occurred in less than 1% (25 individuals).
We classified 5676 VNTRs as common (17% of the 33 403 VNTRs detected in this population) and 68% as private. In each sample, we detected, on average, 1951 VNTR loci, and among those, 1783 were common VNTRs (median 1,677) and 46 were private VNTRs (median 17). A total of 3627 common VNTRs overlapped with 2173 protein coding genes including 254 exons. Interestingly, increasing the threshold for common VNTRs did not reduce the number dramatically (Supplementary Figure S6), suggesting that these VNTRs have not occurred randomly, but rather have undergone natural selection. Widespread occurrence of common VNTRs indicates a fitness for use in Genome Wide Association Studies (GWAS). A list of common and private VNTRs can be found in Supplementary Material Common_VNTRs.txt and Private_VNTRs.txt.
Common VNTR enrichment in functionally annotated regions
To determine possible functional effects of the common VNTRs, we classified the overlap of reference TRs with various functionally annotated genomic regions: upstream and downstream of genes, 3′ UTRs, 5′ UTRs, introns, exons, transcription factor binding site (TFBS) clusters, CpG islands, and DNAse clusters (Supplementary Material Reference_set.txt).
Our reference TR set comprised only 0.52% of the genome, however, 49% of human genes contained at least one TR and 5% of all the TFBS clusters overlapped with TRs. Moreover, high proportions of our TR reference set and common VNTRs intersected with genes (63% and 64%, respectively), TFBS clusters (38% and 51%), and DNAse clusters (21% and 28%) (Supplementary Table S5).
In comparison to TRs, VNTR loci were positively enriched in 1 kb upstream and downstream regions of genes, 5′ and 3′ UTRs, coding exons, TFBS clusters, DNAse clusters, and CpG islands (P-values < 0.05) (Supplementary Tables S4 and S5). The common VNTRs, on the other hand, compared to all VNTRs, were enriched in 1 kb upstream regions of genes, TFBS, and CpG islands, suggesting regulatory function. Private VNTRs were less likely to occur in 1 kb upstream or downstream regions, inside TFBS clusters, open DNAse clusters, or CpG islands.
Focusing on the common VNTRs, we used the LOLAweb (141) online tool to perform enrichment analysis with various curated feature sets (Supplementary Figures S7–S12 in Supplementary Section S5.2), Among the results, DNAse enrichments by tissue type (142) pointed to brain, muscle, epithelial, fibroblast, bone, hematopoietic, cervix, skin, and endothelial (Supplementary Figure S11) with brain showing up multiple times, consistent with findings in the literature (24). These results suggest that VNTR alleles may affect gene regulation in multiple tissues.
VNTR genotypes are correlated with gene expression differences
To detect association between VNTR genotypes and expression of nearby genes, we paired VNTRs to any gene within 10 kb and after removing genes with low expression and controlling for confounders, applied a one-way ANOVA test to determine if there was a significant difference between the average gene expression levels for the VNTR genotypes. A total of 1071 gene-VNTR pairs were tested and 193 pairs (187 genes, 188 VNTRs) exhibited significant expression differences at FDR <0.05 (Supplementary Figure S25 and Supplementary MaterialRNA_VNTRs.txt)). The top 10 genes by mean expression difference were: DPYSL4, KLF11, B4GALNT3, PIP5K1B, DNAJA4, THNSL2, CD151, MXRA7, HEBP1 and FARP1. Of these eQTL VNTRs, 47 also exhibited population-specific alleles (next section and Supplementary Table S15) and 81 overlapped with peaks for histone marks and DNAse hypersensitive sites (see Supplementary Table S13 and Supplementary Figure S31), a significant level of enrichment in these marks (Supplementary Table S14).
Three of the top genes are shown in Figure 2. Gene MXRA7 is associated with a VNTR (id 182606303) in the 5’ UTR exon, DPYSL4 is associated with a VNTR (id 182316137) in the first intron, and CSTB is associated with an upstream VNTR (id 182814480). The VNTR region in MXRA7 is a target site for transcription factors METTL23 and JMJD6. METTL23 is known to function as a regulator in the transcriptional pathway for human cognition (143) and has been associated with mental retardation and intellectual disability (144). JMJD6 is associated with pancreatitis (145) and tumorgenesis (146,147). Copy number expansions in the VNTR upstream of CSTB have been previously associated with progressive myoclonic epilepsy (EPM1) (148). For this VNTR, we observed the −1 and 0 alleles (2 and 3 copies, respectively), which are common in healthy individuals. However, 201 individuals had genotypes outside of our detection range which likely represented longer expansions and these individuals showed higher expression of this gene. More examples are given in Supplementary Figures S26–S30.
Figure 2.

Gene expression differences and VNTR genotype. Shown are violin plots of gene expression values (log2 normalized TPM) for three genes which displayed significant differential expression when samples were partitioned by VNTR allele genotype. Additional examples are shown in Supplementary Figures S26–S30. Genotype is indicated in labels on the X-axis and numbers refer to copies gained or lost relative to the reference allele. ‘Other’ indicates a partition with undetected alleles presumed outside the range of VNTRseek detection (see text). Number of samples in each partition is shown in parenthesis. In these examples, the effect size for at least one genotype class was significant. Top: VNTR 182606303 is upstream of MXRA7 and partially overlaps the 5’ UTR exon. Middle: VNTR 182316137 occurs inside the first intron of DPYSL4. Bottom: VNTR 182814480 occurs upstream of CSTB.
One in five common VNTR loci have population-specific alleles
We further investigated whether VNTR alleles are population-specific and whether they can be used to predict ancestry. Understanding the occurrence of population-specific VNTR alleles will be useful when controlling for population effects in GWAS, and more generally in interpreting gene expression differences among people of different ancestry.
A total of 4605 alleles from the common VNTR loci were classified as common if they were detected in at least 5% of the population (NYGC). We then constructed a matrix of presence/absence of each allele by sample and clustered the samples using Principal Component Analysis. We found that the first, second, fourth, and fifth principal components (PCs) separated the super populations as shown in Figure 3. Each PC captured a small fraction of the variation in the dataset, suggesting that there was substantial variation between individuals from the same population.
Figure 3.
Principal Component Analysis (PCA) of common VNTR alleles in the NYGC population (150 bp). PCA was performed to reduce the dimensions of the data. Left: PC1 captured ∼5% of the variation and separated Africans from the other super-populations, suggesting that they had the greatest distance from the others. PC2 separated East Asian and European populations but left individuals from the Americas and South Asia mixed. Right: PC4 separated the South Asian population and PC5 separated the American populations. PC3 (not shown) captured batch effects due to differences in coverage. Some American sub-populations proved hardest to separate, likely due to ancestry mixing.
The first PC separated Africans, suggesting furthest evolutionary distance. The second PC separated East Asians. The third PC captured coverage bias. The fourth and fifth PCs separated South Asians and Americans, respectively. The American population had a sub-population of Puerto Ricans that clustered with the Iberian Spanish population, suggesting mixed ancestry (149). To show the power of these alleles to predict ancestry, we next trained a decision tree model (Supplementary Figure S19) using the top 10 PCs (11% of the total variation) and achieved a recall of >98% on every population when applied to the 30% test partition (Supplementary Table S10).
A one-sided Fisher’s exact test was applied to determine the population-specific VNTR alleles that were over-represented in one population versus all the others. A total of 3850 VNTR alleles were identified as population-specific in one or more super-populations, corresponding to 1096 VNTR loci (Supplementary Figures S20 and S21). The complete list of population-specific alleles can be found in Supplementary Material Superpopulation_VNTRs.txt. These loci overlapped with 689 genes and 51 coding exons. Africans had the highest number of population-specific alleles (266), followed by East Asians (65), while Americans had the lowest (13), suggesting more mixed ancestry. We observed 63 loci that had a population-specific allele in each population. Figure 4 illustrates seven of the top population-specific loci in a ‘virtual gel’ representation, mimicking the appearance of bands on an agarose gel for easier interpretation. Forty-eight genes that displayed expression differences correlated with VNTR genotype were associated with population-specific VNTR loci (Supplementary Table S15), including the VNTR 182316137 associated with the gene DPYSL4, discussed in the previous section, which exhibited seven different alleles, five of which were population-specific.
Figure 4.

‘Virtual gel’ representation of seven population-specific VNTR alleles. Each dot represents an allele in one sample. Samples are separated vertically by super-population. Dots are jiggered in a rectangular area to reduce overlap. Population-specific alleles show up as bands over-represented in one population. Numbers and labels at bottom are VNTR locus ids with nearby genes indicated and the population-specific allele expressed as copy number change (+1, −2, etc.) from the reference. For example, in the leftmost column, the +1 allele was over-represented in the African population. Note that the allele bias towards pattern copy loss relative to the reference allele is apparent and that at one locus (second from left) the reference allele was the population-specific allele since almost no reference alleles were observed in the four other populations. The details of these seven loci are given in Supplementary Table S11.
Finally, to identify potential functional roles of the population-specific VNTR loci we performed Gene Set Enrichment Analysis (GSEA) for the associated genes against the Broad Institute MSigDB (150). Genes overlapping with the population-specific VNTRs were enriched for Endocytosis (hsa04144), Fatty acid metabolism (hsa01212), and Arrhythmogenic right ventricular cardiomyopathy (ARVC) (hsa05412) pathways (Supplementary Table S12). Among the GO biological processes affected by these genes were neurogenesis (GO:0022008; FDR = 3.62e−8), neuron differentiation (GO:0030182; FDR = 2.52e−7), and neuron development (GO:0048666; FDR = 4.31e−7). These processes are potentially related to other findings that have linked VNTRs to neurodegenerative disorders and cognitive abilities (28,36,56–62,151) (Supplementary Material Population_specific_Go_BP.xlsx). The GO term behavior (GO:0007610; FDR = 2.22e−4) was also found, which could be related to the association of VNTR loci with aggressive behavior (152–154). Other notable GO terms were regulation of muscle contraction (GO:0006937) and neuromuscular processes related to balancing (GO:0050885) with FDR <1%. The genes were also highly enriched in midbrain neurotype cell gene signatures (FDR = 5.49e−25), which might affect movement and emotions (155–157).
Accuracy of VNTR predictions
To show the reliability of our results, we experimentally validated VNTR predictions at 13 loci in the three related AJ genomes, and also compared VNTRseek predictions to alleles experimentally validated in the literature. We additionally used accurate long reads on one genome (HG002) to find evidence of the predicted alleles. Separately, we showed the consistency of our predictions in two ways: first, we looked at inheritance consistency among four trios (mother, father, child), and second, we compared result for genomes sequenced on two different platforms.
Experimental validation
All but one of the 66 predicted VNTR alleles were confirmed at 13 loci in the three related AJ genomes (child HG002, father HG003 and mother HG004). In the remaining case, two predicted alleles were separated by only 15 nucleotides and could not be distinguished. At two loci, other bands were also observed. In one, all three family members contained an allele outside the detectable range of VNTRseek (longer than the reads). In the other, one allele that was detectable was missed in two family members (see Table 3 for a summary of results, Supplementary Table S6 for details of the experiment, and Supplementary Figures S13–S17 for gel images).
Table 3.
Experimental validation results
| VNTRseek predicted | Experimental validation | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| # | TR id | Pattern size | Ref. copies | Gene | C | F | M | C | F | M |
| 1* | 182316181 | 105 | 4 | STK32C | −2 | −2 | −2 | −2, −1 | −2, −1 | −2, +1 |
| 2 | 182316985 | 27 | 6 | LINC01168+ | −3, 0 | −1, 0 | −3, −2 | yes | yes | yes |
| 3 | 182453735 | 30 | 2 | DNAJC3 | 0, +1 | 0, +1 | 0, +1 | yes | yes | yes |
| 4 | 182461997 | 38 | 7 | RASA3 | −5, −4 | −5, −4 | −5, −2 | yes | yes | yes |
| 5 | 182493720 | 70 | 3 | BEGAIN | 0 | 0 | 0, −1 | yes | yes | yes |
| 6 | 182515357 | 34 | 8 | MEGF11 | −5 | −5, −6 | −5 | yes | yes | yes |
| 7 | 182608886 | 27 | 6 | RPTOR | 0, −2 | 0, −3 | −2, +1 | yes | yes | yes |
| 8 | 182620950 | 48 | 3 | RNF138 | 0, −1 | 0, −1 | 0, −1 | yes | yes | yes |
| 9 | 182982510 | 34 | 4 | SLC12A7 | 0, −1 | 0 | 0, −1 | yes | yes | yes |
| 10# | 183046759 | 38 | 4 | ARL10 | 0 | 0, −1 | −1 | 0, +1 | yes | −1, +1 |
| 11† | 183081195 | 15 | 4 | TENT5A | +2, −1 | +2, +1 | +2, −1 | yes | +2 | yes |
| 12 | 183117043 | 17 | 9 | MRM2+, LFNG+ | −5, −3 | −5, −3 | 0, −3 | yes | yes | yes |
| 13 | 183169331 | 15 | 4 | IRF5 | −2 | −2, 0 | −2 | yes | yes | yes |
Thirteen VNTR loci were selected for experimental validation in the AJ trio. All but one of the 66 bands predicted by VNTRseek were validated. †For the remaining band, the results were questionable because the two predicted alleles for the father were only 15 nucleotides different in length, which was too close to distinguish in the image. *For all three individuals, the gel contained bands (bold) not predicted (or detectable) by VNTRseek. The extra band for the son corresponded to the −1 allele as found in the PacBio reads. The father’s extra band appeared to match with the −1 allele. The mother’s extra band appeared to be a +1 allele. #An extra band for the mother and son (bold) was not predicted by VNTRseek, although it seemed to match the +1 allele that was detectable. +These VNTRs overlapped regulatory sites that target the given genes.
We also compared VNTRseek predictions in three datasets from the NA12878 genome with VNTRs validated in the adVNTR paper (95). Out of the original 17 VNTR loci experimentally validated in that paper, four were not included in our reference set and for one, the matching TR could not be determined. In total, 11 out of 16 detectable alleles were correctly predicted, four were not found in the NA12878 sample with sufficient read size (250 bp), and one was incorrectly predicted in the HG001 sample and not found in the other two (Supplementary Table S7).
Validation of predicted VNTRs using long reads
PacBio Circular Consensus Sequencing reads from the HG002 genome (103), with an average length of 13.5 kb and an estimated 99.8% sequence accuracy, were computationally tested to determine if they confirmed VNTRseek predicted alleles for the GIAB Illumina reads from the same genome. Overall, >97% of predicted alleles were confirmed, and at the predicted VNTR loci, >87% of alleles were confirmed (Supplementary Table S8).
VNTR predictions are consistent with Mendelian inheritance
We compared the predicted alleles in four trios (CEU and YRI trios from 1000 Genomes; Chinese HAN and AJ from GIAB), testing loci on autosomes and X and Y chromosomes (see Materials and Methods). In all cases, only a handful of loci were inconsistent (Supplementary Table S9).
VNTR non-reference allele predictions showed substantial consistency across platforms
In 2015, the 1000 Genomes Phase 3 sequenced 30 genomes using Illumina HiSeq2500 at read length 250 bp. In 2020, 27 of those 30 genomes were resequenced by NYGC using Illumina Novaseq 6000 at read length 150 bp. Comparing VNTR loci genotyped in both platforms and only non-reference alleles detectable at both read lengths, agreement ranged from 76% to 91% (Supplementary Figure S18). When including reference alleles, the agreement increased to greater than 99%. Note, however, that read coverage was not the same for both datasets, causing variation in statistical power.
DISCUSSION
The current study represents, to our knowledge, the largest analysis of human whole genome sequencing data to detect copy number variable tandem repeats (VNTRs) and greatly expands the growing information on this class of genetic variation. The TRs genotyped consisted of some 184 000 minisatellites occupying the mid-range of pattern sizes, from seven to 126 bp. Our results reveal that nearly 20% (35 828) are variable, exhibiting at least one non-reference allele, a number much larger than has been generally understood. Moreover, we have classified a large subset of these (5676) as common VNTRs, with non-reference alleles occurring in >5% of the population. When considering the largest dataset in our study (2504 individuals), we found that, on average, each genome exhibited non-reference alleles at 1951 VNTR loci and among those, 1783 were common VNTRs.
In addition to their widespread occurrence, further evidence of minisatellite VNTR importance can be seen in the enrichment of these loci in genes and gene regulatory regions (promoters, transcription factor binding sites, DNAse hypersensitive sites, and CpG islands). Our entire set of VNTRs overlapped with 7698 protein coding genes and 3512 exons. The common VNTRs occurred within or were proximal to over 2173 protein coding genes, including overlapping with 254 exons. Biological function enrichment among these genes included neuron development and differentiation, and behavior. Examples include low-density lipoprotein receptor-related protein 6 (LRP6), a co-receptor for Wnt signaling, which is down-regulated in Alzheimer’s disease and which appears critical for maintaining synaptic integrity (158); Down syndrome cell adhesion molecule-like 1 (Dscaml1), which appears essential for development of GABAergic neurons in the entorhinal cortex (159); and ZMIZ1, a transcription factor co-activator, for which mutant allele over-expression leads to pyramidal neuron morphology abnormalities (160). These observations are consistent with the finding that VNTR expansions in humans compared to primates are associated with gain of cognitive abilities (24), and possible involvement of VNTRS with many neurodegenerative diseases and behavioral disorders (87).
The overabundance of VNTR proximity to genes suggests that variability at these loci could affect gene expression and indeed, we observed that the expression levels of 187 genes were significantly correlated with the presence of specific VNTR alleles in lymphoblastoid cell lines of 445 individuals. Others have also recently detected minisatellite VNTR eQTLs. In (25), expression levels in 46 tissues for 6802 genes were tested and 161 eQTLs were found. Thirteen of those genes coincided with ones detected in our study: BPTF, CCDC57, CCDC66, EPDR1, EPS8L2, LSS, MVB12A, PTPRVP, S100A10, SNAPC1, TMEM52, TRAPPC2L, ZNF736. Similarly, in (26) 21 VNTRs were found associated with expression in 38 genes. However, their VNTR alleles were larger and did not overlap the ones we tested in this study.
eQTLs have also been found among the variant STR class. In (161), 2060 genes were found to be affected by eQTL STRs. In another study (162), 28 375 variant STR loci were associated with differential expression of 12 494 genes in at least one tissue (FDR < 10%) and 1420 of the loci were predicted, by CAVIAR (163), to be causal in 17 tissues.
These findings are suggestive, but more study is required, both to determine if there is more evidence of tissue specific gene expression variation associated with VNTR genotypes (25,26,87) and if such correlational differences can be definitively tied to actions associated with specific VNTR alleles such as regulator binding affinity changes in regulatory regions. For more elaborate studies such as these, it will be essential that for each sample used to measure gene expression, the raw whole genome sequencing data be available, so that specialized software programs, such as VNTRseek can be used to determine VNTR genotype.
The frequency of VNTR occurrence and possible effects on gene expression suggest that minisatellite VNTR loci could be useful in genome-wide association studies (GWAS). However, it is well known that hidden differences can lead to misinterpretation of GWAS results, and care is particularly important when those differences are tied to human ancestry. Relevant to this, we have determined that 1096 of the common VNTR loci contain alleles showing significant population specificity and that these loci intersect with 689 genes. Understanding such hidden variability will be essential for interpreting GWAS and future studies should investigate possible haplotype linkages between specific VNTR alleles and nearby SNP alleles.
Population-specific alleles also have the potential for use in tracing early human migration. We have shown through principal component analysis with common VNTR alleles that super-populations are easily separated. Further, we have constructed a decision tree based on common VNTR alleles that obtains nearly perfect classification of individuals at the super population level. It will be interesting to see whether, with more information, classification can be refined further to encompass specific sub-populations, whether a minimal minisatellite VNTR set can be established for high accuracy population classification, and whether VNTR alleles can be used to estimate mixed ancestry as is done now with SNP haplotyping.
Despite the high precision of VNTRseek, (see Supplementary Figure S32 and Supplementary Table S16– S20) our curation of VNTR loci has certainly produced an undercount. This is true because VNTRseek requires that the tandem array fit within a read. Longer reads will help, but the abundance of high-coverage. low-error, long-read datasets is currently limited. Alternate methods exist (87,95), but these have not reported an ability to handle macrosatellite VNTRs where the arrays and patterns are hundreds to thousands of base pairs long. For this range of the tandem repeat spectrum, new tools must be developed.
Another limitation comes from use of the Tandem Repeats Finder, which requires that the array contain at least 1.9 copies to be detected. At read length 150 bp, which included the majority of our samples, a gain of one copy compared to the reference genome could be detected in 82% of the TR loci while loss of one copy could be detected in only 16%. Despite this imbalance, one copy loss was observed nearly 40% more often than one copy gain, an important observation with regard to potential tandem repeat copy number bias in the reference genome.
Previous studies on VNTR prevalence in the human genome have been limited to a subset of minisatellites inside the transcriptome and a limited number of genomes. Here, we have shown a broader prevalence of VNTR loci and suggested their importance with regard to gene function, population studies, and GWAS. Future research can be expected to further enhance our understanding of this important class of genomic variation.
DATA AVAILABILITY
The reference TR set files, output VCF files, and the pre-processed data files along with the code to create figures and tables are published at: DOI 10.5281/zenodo.4065850 VNTRseek can be downloaded at: https://github.com/Benson-Genomics-Lab/VNTRseek.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Thomas Gilmore (Boston University) for helpful discussions and comments on the manuscript, Alberto Cruz-Martiń (Boston University) for discussion on gene association with neuron development and Xiaoling Zhang (Boston University School of Medicine) for her help with the eQTL analysis.
Contributor Information
Marzieh Eslami Rasekh, Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA.
Yözen Hernández, Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA.
Samantha D Drinan, Department of Biology, Boston University, Boston, MA 02215, USA.
Juan I Fuxman Bass, Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA; Department of Biology, Boston University, Boston, MA 02215, USA.
Gary Benson, Graduate Program in Bioinformatics, Boston University, Boston, MA 02215, USA; Department of Biology, Boston University, Boston, MA 02215, USA; Department of Computer Science, Boston University, Boston, MA 02215, USA.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
NSF [IIS-1017621, IIS-1423022, DBI-1559829, in part]; NIH [R35 GM128625, in part]. Funding for open access charge: University funds.
Conflict of interest statement. None declared.
REFERENCES
- 1. Treangen T.J., Salzberg S.L.. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2012; 13:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. de Koning A.J., Gu W., Castoe T.A., Batzer M.A., Pollock D.D.. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011; 7:e1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lim K.G., Kwoh C.K., Hsu L.Y., Wirawan A.. Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance. Brief. Bioinform. 2013; 14:67–81. [DOI] [PubMed] [Google Scholar]
- 4. Richard G.-F., Kerrest A., Dujon B.. Comparative genomics and molecular dynamics of DNA repeats in eukaryotes. Microbiol. Mol. Biol. R. 2008; 72:686–727. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Taylor J.S., Breden F.. Slipped-strand mispairing at noncontiguous repeats in Poecilia reticulata: a model for minisatellite birth. Genetics. 2000; 155:1313–1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Levinson G., Gutman G.A.. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 1987; 4:203–221. [DOI] [PubMed] [Google Scholar]
- 7. Madsen C.S., Ghivizzani S.C., Hauswirth W.W.. In vivo and in vitro evidence for slipped mispairing in mammalian mitochondria. Proc. Natl. Acad. Sci. U.S.A. 1993; 90:7671–7675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Jeffreys A.J., Neil D.L., Neumann R.. Repeat instability at human minisatellites arising from meiotic recombination. EMBO J. 1998; 17:4147–4157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Debrauwère H., Buard J., Tessier J., Aubert D., Vergnaud G., Nicolas A.. Meiotic instability of human minisatellite CEB1 in yeast requires DNA double-strand breaks. Nat. Genet. 1999; 23:367. [DOI] [PubMed] [Google Scholar]
- 10. Pâques F., Richard G.-F., Haber J.E.. Expansions and contractions in 36-bp minisatellites by gene conversion in yeast. Genetics. 2001; 158:155–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Bustamante A.V., Sanso A.M., Segura D., Parma A.E., Lucchesi P. M.A.. Dynamic of mutational events in variable number tandem repeats of Escherichia coli O157: H7. BioMed Res. Int. 2013; 2013:390354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Vogler A.J., Keys C., Nemoto Y., Colman R.E., Jay Z., Keim P.. Effect of repeat copy number on variable-number tandem repeat mutations in Escherichia coli O157: H7. J. Bacteriol. 2006; 188:4253–4263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Fu S., Octavia S., Wang Q., Tanaka M.M., Tay C.Y., Sintchenko V., Lan R.. Evolution of variable number tandem repeats and its relationship with genomic diversity in Salmonella typhimurium. Front. Microbiol. 2016; 7:2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Verstrepen K.J., Jansen A., Lewitter F., Fink G.R.. Intragenic tandem repeats generate functional variability. Nat. Genet. 2005; 37:986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Legendre M., Pochet N., Pak T., Verstrepen K.J.. Sequence-based estimation of minisatellite and microsatellite repeat variability. Genome Res. 2007; 17:1787–1796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Panigrahi I. Genetic fingerprinting for human diseases: applications and implications. DNA Fingerprinting: Advancements and Future Endeavors. 2018; Springer; 141–150. [Google Scholar]
- 17. Sinha M., Rao I.A., Mitra M.. Molecular basis of identification through DNA fingerprinting in humans. DNA Fingerprinting: Advancements and Future Endeavors. 2018; Springer; 129–140. [Google Scholar]
- 18. Imam J., Reyaz R., Rana A.K., Yadav V.K.. DNA fingerprinting: discovery, advancements, and milestones. DNA Fingerprinting: Advancements and Future Endeavors. 2018; Springer; 3–24. [Google Scholar]
- 19. Denoeud F., Vergnaud G., Benson G.. Predicting human minisatellite polymorphism. Genome Res. 2003; 13:856–867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Deka R., Chakraborty R., Ferrell R.E.. A population genetic study of six VNTR loci in three ethnically defined populations. Genomics. 1991; 11:83–92. [DOI] [PubMed] [Google Scholar]
- 21. Deka R., DeCroo S., Yu L.M., Ferrell R.E.. Variable number of tandem repeat (VNTR) polymorphism at locus D17S5 (YNZ22) in four ethnically defined human populations. Human Genet. 1992; 90:86–90. [DOI] [PubMed] [Google Scholar]
- 22. Hancock J.M., Santibáñez-Koref M.F.. Trinucleotide expansion diseases in the context of micro- and minisatellite evolution Hammersmith Hospital, April 1–3, 1998. EMBO J. 1998; 17:5521–5524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Duitama J., Zablotskaya A., Gemayel R., Jansen A., Belet S., Vermeesch J.R., Verstrepen K.J., Froyen G.. Large-scale analysis of tandem repeat variability in the human genome. Nucleic Acids Res. 2014; 42:5728–5741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Sonay T.B., Carvalho T., Robinson M.D., Greminger M.P., Krützen M., Comas D., Highnam G., Mittelman D., Sharp A., Marques-Bonet T.et al.. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. 2015; 25:1591–1599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Bakhtiari M., Park J., Ding Y.-C., Shleizer-Burko S., Neuhausen S.L., Halldórsson B.V., Stefansson K., Gymrek M., Bafna V.. Variable Number Tandem Repeats mediate the expression of proximal genes. 2020; bioRxiv doi:25 May 2020, preprint: not peer reviewed 10.1101/2020.05.25.114082. [DOI] [PMC free article] [PubMed]
- 26. Lu T.-Y.T., Chaisson M.J.Human Genome Structural Variation Consortium et al.. Profiling variable-number tandem repeat variation across populations using repeat-pangenome graphs. 2020; bioRxiv doi:04 January 2021, preprint: not peer reviewed 10.1101/2020.08.13.249839. [DOI] [PMC free article] [PubMed]
- 27. Trepicchio W.L., Krontiris T.G.. Members of the rel/NF-χB family of transcriptional regulatory proteins bind the HRAS1 minisatellite DNA sequence. Nucleic Acids Res. 1992; 20:2427–2434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Krontiris T.G., Devlin B., Karp D.D., Robert N.J., Risch N.. An association between the risk of cancer and mutations in the HRAS1 minisatellite locus. New. Engl. J. Med. 1993; 329:517–523. [DOI] [PubMed] [Google Scholar]
- 29. Wang S., Wang M., Yin S., Fu G., Li C., Chen R., Li A., Zhou J., Zhang Z., Liu Q.. A novel variable number of tandem repeats (VNTR) polymorphism containing Sp1 binding elements in the promoter of XRCC5 is a risk factor for human bladder cancer. Mutat. Res./Fundam. Mol. Mech. Mutagen. 2008; 638:26–36. [DOI] [PubMed] [Google Scholar]
- 30. Zukic B., Radmilovic M., Stojiljkovic M., Tosic N., Pourfarzad F., Dokmanovic L., Janic D., Colovic N., Philipsen S., Patrinos G.P., et al.. Functional analysis of the role of the TPMT gene promoter VNTR polymorphism in TPMT gene transcription. Pharmacogenomics. 2010; 11:547–557. [DOI] [PubMed] [Google Scholar]
- 31. Vasiliou S.A., Ali F.R., Haddley K., Cardoso M.C., Bubb V.J., Quinn J.P.. The SLC6A4 VNTR genotype determines transcription factor binding and epigenetic variation of this gene in response to cocaine in vitro. Addict. Biol. 2012; 17:156–170. [DOI] [PubMed] [Google Scholar]
- 32. Vafiadis P., Bennett S.T., Todd J.A., Nadeau J., Grabs R., Goodyer C.G., Wickramasinghe S., Colle E., Polychronakos C.. Insulin expression in human thymus is modulated by INS VNTR alleles at the IDDM2 locus. Nat. Genet. 1997; 15:289–292. [DOI] [PubMed] [Google Scholar]
- 33. Greenwood T.A., Kelsoe J.R.. Promoter and intronic variants affect the transcriptional regulation of the human dopamine transporter gene. Genomics. 2003; 82:511–520. [DOI] [PubMed] [Google Scholar]
- 34. Lovejoy E., Scott A., Fiskerstrand C., Bubb V., Quinn J.. The serotonin transporter intronic VNTR enhancer correlated with a predisposition to affective disorders has distinct regulatory elements within the domain based on the primary DNA sequence of the repeat unit. Eur. J. Neurosci. 2003; 17:417–420. [DOI] [PubMed] [Google Scholar]
- 35. Klenova E., Scott A.C., Roberts J., Shamsuddin S., Lovejoy E.A., Bergmann S., Bubb V.J., Royer H.-D., Quinn J.P.. YB-1 and CTCF differentially regulate the 5-HTT polymorphic intron 2 enhancer which predisposes to a variety of neurological disorders. J. Neurosci. 2004; 24:5966–5973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. De Roeck A., Duchateau L., Van Dongen J., Cacace R., Bjerke M., Van den Bossche T., Cras P., Vandenberghe R., De Deyn P.P., Engelborghs S.et al.. An intronic VNTR affects splicing of ABCA7 and increases risk of Alzheimer’s disease. Acta Neuropathol. 2018; 135:827–837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Pacheco A., Berger R., Freedman R., Law A.J.. A VNTR regulates miR-137 expression through novel alternative splicing and contributes to risk for schizophrenia. Sci. Rep.-UK. 2019; 9:11793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Schoots O., Van Tol H.. The human dopamine D4 receptor repeat sequences modulate expression. Pharmacogenomics J. 2003; 3:343–348. [DOI] [PubMed] [Google Scholar]
- 39. Xiao X., Jones G., Sevilla W.A., Stolz D.B., Magee K.E., Haughney M., Mukherjee A., Wang Y., Lowe M.E.. A carboxyl ester lipase (CEL) mutant causes chronic pancreatitis by forming intracellular aggregates that activate apoptosis. J. Biol. Chem. 2016; 291:23224–23236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Ræder H., Johansson S., Holm P.I., Haldorsen I.S., Mas E., Sbarra V., Nermoen I., Eide S.Å., Grevle L., Bjørkhaug L.et al.. Mutations in the CEL VNTR cause a syndrome of diabetes and pancreatic exocrine dysfunction. Nat. Genet. 2006; 38:54. [DOI] [PubMed] [Google Scholar]
- 41. Willems T., Gymrek M., Highnam G., Mittelman D., Erlich Y.Yaniv and 1000 Genomes Project Consortium et al.. The landscape of human STR variation. Genome Res. 2014; 24:1894–1904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Willems T., Zielinski D., Yuan J., Gordon A., Gymrek M., Erlich Y.. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods. 2017; 14:590–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Mallick S., Li H., Lipson M., Mathieson I., Gymrek M., Racimo F., Zhao M., Chennagiri N., Nordenfelt S., Tandon A.et al.. The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature. 2016; 538:201–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Gettings K.B., Borsuk L.A., Zook J., Vallone P.M.. Unleashing novel STRs via characterization of genome in a bottle reference samples. Forensic Sci. Int.: Genet. Suppl. Ser. 2019; 7:218–220. [Google Scholar]
- 45. Krishnan V., Utiramerur S., Ng Z., Datta S., Snyder M.P., Ashley E.A.. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics. 2021; 22:85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Brouwer J.R., Willemsen R., Oostra B.A.. Microsatellite repeat instability and neurological disease. BioEssays. 2009; 31:71–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Rohilla K.J., Gagnon K.T.. RNA biology of disease-associated microsatellite repeat expansions. Acta Neuropathol. Commun. 2017; 5:63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Hannan A.J. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 2018; 19:286. [DOI] [PubMed] [Google Scholar]
- 49. Rodriguez C., Todd P.. New pathologic mechanisms in nucleotide repeat expansion disorders. Neurobiol. Dis. 2019; 130:104515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Beck R., Monument M.J., Watkins W.S., Smith R., Boucher K.M., Schiffman J.D., Jorde L.B., Randall R.L., Lessnick S.L.. EWS/FLI-responsive GGAA microsatellites exhibit polymorphic differences between European and African populations. Cancer Genetics. 2012; 205:304–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Boulay G., Volorio A., Iyer S., Broye L.C., Stamenkovic I., Riggi N., Rivera M.N.. Epigenome editing of microsatellite repeats defines tumor-specific enhancer functions and dependencies. Genes Dev. 2018; 32:1008–1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Nacev B.A., Jones K.B., Intlekofer A.M., Jamie S., Allis C.D., Tap W.D., Ladanyi M., Nielsen T.O.. The epigenomics of sarcoma. Nat. Rev. Cancer. 2020; 20:608–623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Antwi-Boasiako C., Dzudzor B., Kudzi W., Doku A., Dale C., Sey F., Otu K., Boatemaa G., Ekem I., Ahenkorah J.et al.. Association between eNOS gene polymorphism (T786C and VNTR) and sickle cell disease patients in Ghana. Diseases. 2018; 6:90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Ksiazek K., Blaszczak J., Buraczynska M.. IL4 gene VNTR polymorphism in chronic periodontitis in end-stage renal disease patients. Oral Dis. 2019; 25:258–264. [DOI] [PubMed] [Google Scholar]
- 55. Cong L., Tu G., Liang D.. A systematic review of the relationship between the distributions of aggrecan gene VNTR polymorphism and degenerative disc disease/osteoarthritis. Bone Joint Res. 2018; 7:308–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Katsumata Y., Fardo D.W., Bachstetter A.D., Artiushin S.C., Wang W.-X., Wei A., Brzezinski L.J., Nelson B.G., Huang Q., Abner E.L.et al.. Alzheimer disease pathology-associated polymorphism in a complex variable number of tandem repeat region within the MUC6 gene, near the AP2A2 gene. J. Neuropathol. Exp. Neurol. 2020; 79:3–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Chang H.-I., Chang Y.-T., Tsai S.-J., Huang C.-W., Hsu S.-W., Liu M.-E., Chang W.-N., Lien C.-Y., Huang S.-H., Lee C.-C.et al.. MAOA-VNTR genotype effects on ventral striatum-hippocampus network in Alzheimer’s disease: analysis using structural covariance network and correlation with neurobehavior performance. Mol. Neurobiol. 2019; 56:4518–4529. [DOI] [PubMed] [Google Scholar]
- 58. Scott H., Nelson P., Hopwood J., Morris C.. PCR of a VNTR linked to mucopolysaccharidosis type I and Huntington disease. Nucleic Acids Res. 1991; 19:6348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Hoxha B., Goçi A.U., Agani F., Haxhibeqiri S., Haxhibeqiri V., Sabic E.D., Kucukalic S., Bravo A.M., Kucukalic A., Dzubur A.K., et al.. The role of TaqI DRD2 (rs1800497) and DRD4 VNTR polymorphisms in posttraumatic stress sisorder (PTSD). Psychiatria Danubina. 2019; 31:263–268. [DOI] [PubMed] [Google Scholar]
- 60. Šerỳ O., Paclt I., Drtílková I., Theiner P., Kopečková M., Zvolskỳ P., Balcar V.J.. A 40-bp VNTR polymorphism in the 3’-untranslated region of DAT1/SLC6A3 is associated with ADHD but not with alcoholism. Behav. Brain Funct. 2015; 11:21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Grünblatt E., Werling A.M., Roth A., Romanos M., Walitza S.. Association study and a systematic meta-analysis of the VNTR polymorphism in the 3’-UTR of dopamine transporter gene and attention-deficit hyperactivity disorder. J. Neural. Transm. 2019; 126:517–529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Van Assche E., Moons T., Van Leeuwen K., Colpin H., Verschueren K., Van Den Noortgate W., Goossens L., Claes S.. Depressive symptoms in adolescence: The role of perceived parental support, psychological control, and proactive control in interaction with 5-HTTLPR. Eur. Psychiatry. 2016; 35:55–63. [DOI] [PubMed] [Google Scholar]
- 63. Stolf A.R., Cupertino R.B., Müller D., Sanvicente-Vieira B., Roman T., Vitola E.S., Grevet E.H., von Diemen L., Kessler F.H., Grassi-Oliveira R.et al.. Effects of DRD2 splicing-regulatory polymorphism and DRD4 48 bp VNTR on crack cocaine addiction. J. Neural. Transm. 2019; 126:193–199. [DOI] [PubMed] [Google Scholar]
- 64. Ramírez-Patiño R., Figuera L.E., Puebla-Pérez A.M., Delgado-Saucedo J.I., Legazpí-Macias M.M., Mariaud-Schmidt R.P., Ramos-Silva A., Gutiérrez-Hurtado I.A., Gómez Flores-Ramos L., Zúñiga-González G.M.et al.. Intron 4 VNTR (4a/b) polymorphism of the endothelial nitric oxide synthase gene is associated with breast cancer in Mexican women. J. Korean Med. Sci. 2013; 28:1587–1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Vairaktaris E., Serefoglou Z.C., Yapijakis C., Vassiliou S., Nkenke E., Avgoustidis D., Vylliotis A., Stathopoulos P., Neukam F.W., Patsouris E.. The platelet glycoprotein Ibα VNTR polymorphism is associated with risk for oral cancer. AntiCancer Res. 2007; 27:4121–4125. [PubMed] [Google Scholar]
- 66. Sousa H., Santos A.M., Catarino R., Pinto D., Moutinho J., Canedo P., Machado J.C., Medeiros R.. IL-1RN VNTR polymorphism and genetic susceptibility to cervical cancer in Portugal. Mol. Biol. Rep. 2012; 39:10837–10842. [DOI] [PubMed] [Google Scholar]
- 67. Safarinejad M.R., Safarinejad S., Shafiei N., Safarinejad S.. Effects of the T-786C, G894T, and Intron 4 VNTR (4a/b) polymorphisms of the endothelial nitric oxide synthase gene on the risk of prostate cancer. Urologic Oncology: Seminars and Original Investigations. 2013; 31:Elsevier; 1132–1140. [DOI] [PubMed] [Google Scholar]
- 68. Ibrahimi M., Moossavi M., Mojarad E.N., Musavi M., Mohammadoo-khorasani M., Shahsavari Z.. Positive correlation between interleukin-1 receptor antagonist gene 86bp VNTR polymorphism and colorectal cancer susceptibility: a case-control study. Immunol. Res. 2019; 67:151–156. [DOI] [PubMed] [Google Scholar]
- 69. Cui J., Luo J., Kim Y.C., Snyder C., Becirovic D., Downs B., Lynch H., Wang S.M.. Differences of variable number tandem repeats in XRCC5 promoter are associated with increased or decreased risk of breast cancer in BRCA gene mutation carriers. Front. Oncol. 2016; 6:92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Al-Eitan L.N., Rababa’h D.M., Alghamdi M.A., Khasawneh R.H.. The influence of an IL-4 variable number tandem repeat (VNTR) polymorphism on breast cancer susceptibility. Pharmacogenomics Personalized Med. 2019; 12:201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Ahn E.-K., Kim W.-J., Kwon J.-A., Choi P.-J., Kim W.J., Sunwoo Y., Heo J., Leem S.-H.. Variants of MUC5B minisatellites and the susceptibility of bladder cancer. DNA Cell Biol. 2009; 28:169–176. [DOI] [PubMed] [Google Scholar]
- 72. Kwon J.-A., Lee S.-Y., Ahn E.-K., Seol S.-Y., Kim M.C., Kim S.J., Kim S.I., Chu I.-S., Leem S.-H.. Short rare MUC6 minisatellites-5 alleles influence susceptibility to gastric carcinoma by regulating gene. Hum. Mut. 2010; 31:942–949. [DOI] [PubMed] [Google Scholar]
- 73. Weitzel J.N., Ding S., Larson G.P., Nelson R.A., Goodman A., Grendys E.C., Ball H.G., Krontiris T.G.. The HRAS1 minisatellite locus and risk of ovarian cancer. Cancer Res. 2000; 60:259–261. [PubMed] [Google Scholar]
- 74. Wang L., Soria J.-C., Chang Y.-S., Lee H.-Y., Wei Q., Mao L.. Association of a functional tandem repeats in the downstream of human telomerase gene and lung cancer. Oncogene. 2003; 22:7123–7129. [DOI] [PubMed] [Google Scholar]
- 75. Calvo R., Pifarré A., Rosell R., Sánchez J.J., Monzoó M., Ribera J.M., Feliu E.. H-RAS 1 minisatellite rare alleles: a genetic susceptibility and prognostic factor for non-Hodgkin’s lymphoma. JNCI: J. Natl. Cancer Inst. 1998; 90:1095–1098. [DOI] [PubMed] [Google Scholar]
- 76. Yoon S.-L., Jung S.-I., Kim W.-J., Kim S.I., Park I.-h., Leem S.-H.. Variants of BORIS minisatellites and relation to prognosis of prostate cancer. Genes Genomics. 2011; 33:49–56. [Google Scholar]
- 77. Batra S.K., McLendon R.E., Koo J.S., Castelino-Prabhu S., Fuchs H.E., Krischer J.P., Friedman H.S., Bigner D.D., Bigner S.H.. Prognostic implications of chromosome 17p deletions in human medulloblastomas. J. Neuro-Oncol. 1995; 24:39–45. [DOI] [PubMed] [Google Scholar]
- 78. Andersson U., Osterman P., Sjöström S., Johansen C., Henriksson R., Brännström T., Broholm H., Christensen H.C., Ahlbom A., Auvinen A.et al.. MNS16A minisatellite genotypes in relation to risk of glioma and meningioma and to glioblastoma outcome. Int. J. Cancer. 2009; 125:968–972. [DOI] [PubMed] [Google Scholar]
- 79. Lim S.W., Kim H.R., Kim H.Y., Huh J.W., Kim Y.J., Shin J.H., Suh S.P., Ryang D.W., Kim H.R., Shin M.G.. High-frequency minisatellite instability of the mitochondrial genome in colorectal cancer tissue associated with clinicopathological values. Int. J. Cancer. 2012; 131:1332–1341. [DOI] [PubMed] [Google Scholar]
- 80. Xia X., Rui R., Quan S., Zhong R., Zou L., Lou J., Lu X., Ke J., Zhang T., Zhang Y.et al.. MNS16A tandem repeats minisatellite of human telomerase gene and cancer risk: a meta-analysis. PLoS One. 2013; 8:e73367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Leem S.H., Jeong Y.H., Yoon S.L., Seol S.-Y.. Diagnosis kits and method for detecting cancer using polymorphic minisatellite. 2011; (July 19, 2011) US Patent 7,981,613.
- 82. Singh R., Bandyopadhyay D.. MUC1: a target molecule for cancer therapy. Cancer Biol. Ther. 2007; 6:481–486. [DOI] [PubMed] [Google Scholar]
- 83. Yoon S.-L., Roh Y.-G., Chu I.-S., Heo J., Kim S.I., Chang H., Kang T.-H., Chung J.W., Koh S.S., Larionov V.et al.. A polymorphic minisatellite region of BORIS regulates gene expression and its rare variants correlate with lung cancer susceptibility. Exp. Mol. Med. 2016; 48:e246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84. Rose A.M. Therapeutics and diagnostics based on minisatellite repeat element 1 (msr1). 2015; (December 17, 2015) US Patent App. 14/761,952.
- 85. Fondon J.W., Garner H.R.. Molecular origins of rapid and continuous morphological evolution. Proc. Natl. Acad. Sci. U.S.A. 2004; 101:18058–18063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Laidlaw J., Gelfand Y., Ng K.-W., Garner H.R., Ranganathan R., Benson G., Fondon III J.W.. Elevated basal slippage mutation rates among the Canidae. J. Hered. 2007; 98:452–460. [DOI] [PubMed] [Google Scholar]
- 87. Sulovari A., Li R., Audano P.A., Porubsky D., Vollger M.R., Logsdon G.A., Warren W.C., Pollen A.A., Chaisson M.J., Eichler E.E.et al.. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc. Natl. Acad. Sci. U.S.A. 2019; 116:23243–23253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Gymrek M. A genomic view of short tandem repeats. Curr. Opin. Genet. Dev. 2017; 44:9–16. [DOI] [PubMed] [Google Scholar]
- 89. Tørresen O.K., Star B., Mier P., Andrade-Navarro M.A., Bateman A., Jarnot P., Gruca A., Grynberg M., Kajava A.V., Promponas V.J.et al.. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019; 47:10994–11006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90. Gymrek M., Golan D., Rosset S., Erlich Y.. lobSTR: a short tandem repeat profiler for personal genomes. Genome Res. 2012; 1154–1162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91. Kristmundsdóttir S., Sigurpálsdóttir B.D., Kehr B., Halldórsson B.V.. popSTR: population-scale detection of STR variants. Bioinformatics. 2017; 33:4041–4048. [DOI] [PubMed] [Google Scholar]
- 92. Willems T., Zielinski D., Yuan J., Gordon A., Gymrek M., Erlich Y.. Genome-wide profiling of heritable and extlessi extgreaterde novo extless/i extgreater STR variations. Nat. Methods. 2017; 14:590–592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93. Mousavi N., Shleizer-Burko S., Yanicky R., Gymrek M.. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019; 47:e90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94. Dolzhenko E., Deshpande V., Schlesinger F., Krusche P., Petrovski R., Chen S., Emig-Agius D., Gross A., Narzisi G., Bowman B.et al.. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics. 2019; 35:4754–4756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Bakhtiari M., Shleizer-Burko S., Gymrek M., Bansal V., Bafna V.. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 2018; 28:1709–1719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Gelfand Y., Hernandez Y., Loving J., Benson G.. VNTRseek-a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Res. 2014; 42:8884–8894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999; 27:573–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98. 1000 Genomes Project Consortium and others A global reference for human genetic variation. Nature. 2015; 526:68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99. Zook J.M., Catoe D., McDaniel J., Vang L., Spies N., Sidow A., Weng Z., Liu Y., Mason C.E., Alexander N.et al.. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific Data. 2016; 3:160025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100. Chaisson M.J.P., Huddleston J., Dennis M.Y., Sudmant P.H., Malig M., Hormozdiari F., Antonacci F., Surti U., Sandstrom R., Boitano M.et al.. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2014; 517:608–611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101. Huddleston J., Chaisson M. J.P., Steinberg K.M., Warren W., Hoekzema K., Gordon D., Graves-Lindsay T.A., Munson K.M., Kronenberg Z.N., Vives L.et al.. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017; 27:677–685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102. Drmanac R., Sparks A.B., Callow M.J., Halpern A.L., Burns N.L., Kermani B.G., Carnevali P., Nazarenko I., Nilsen G.B., Yeung G.et al.. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010; 327:78–81. [DOI] [PubMed] [Google Scholar]
- 103. Wenger A.M., Peluso P., Rowell W.J., Chang P.-C., Hall R.J., Concepcion G.T., Ebler J., Fungtammasan A., Kolesnikov A., Olson N.D.et al.. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019; 37:1155–1162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104. Lander E.S., Waterman M.S.. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988; 2:231–239. [DOI] [PubMed] [Google Scholar]
- 105. Gelfand Y., Rodriguez A., Benson G.. TRDB—the tandem repeats database. Nucleic Acids Res. 2007; 35:D80–D87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106. Ye J., Coulouris G., Zaretskaya I., Cutcutache I., Rozen S., Madden T.L.. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics. 2012; 13:134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107. Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., Haussler D.. The human genome browser at UCSC. Genome Res. 2002; 12:996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; doi:10.6084/M9.FIGSHARE.963153.V1.
- 109. Quinlan A.R., Hall I.M.. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110. Karolchik D., Hinrichs A.S., Furey T.S., Roskin K.M., Sugnet C.W., Haussler D., Kent W.J.. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004; 32:D493–D496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111. Hsu F., Kent W.J., Clawson H., Kuhn R.M., Diekhans M., Haussler D.. The UCSC known genes. Bioinformatics. 2006; 22:1036–1046. [DOI] [PubMed] [Google Scholar]
- 112. ENCODE Project Consortium and others An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113. Davis C.A., Hitz B.C., Sloan C.A., Chan E.T., Davidson J.M., Gabdank I., Hilton J.A., Jain K., Baymuradov U.K., Narayanan A.K.et al.. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018; 46:D794–D801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114. Thurman R.E., Rynes E., Humbert R., Vierstra J., Maurano M.T., Haugen E., Sheffield N.C., Stergachis A.B., Wang H., Vernot B.et al.. The accessible chromatin landscape of the human genome. Nature. 2012; 489:75–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115. Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J.et al.. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518:317–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116. Gardiner-Garden M., Frommer M.. CpG islands in vertebrate genomes. J. Mol. Biol. 1987; 196:261–282. [DOI] [PubMed] [Google Scholar]
- 117. Sheffield N.C., Bock C.. LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics. 2016; 32:587–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S.et al.. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T.et al.. Gene ontology: tool for the unification of biology. Nat. Genet. 2000; 25:25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120. Kanehisa M., Goto S.. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28:27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121. Lappalainen T., Sammeth M., Friedländer M.R., Ac’t Hoen P., Monlong J., Rivas M.A., Gonzalez-Porta M., Kurbatova N., Griebel T., Ferreira P.G.et al.. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501:506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122. Lee D., Cheng A., Lawlor N., Bolisetty M., Ucar D.. Detection of correlated hidden factors from single cell transcriptomes using Iteratively Adjusted-SVA (IA-SVA). Sci. Rep.-UK. 2018; 8:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123. Consortium E.P.et al.. The ENCODE (ENCyclopedia of DNA Elements) project. Science. 2004; 306:636–640. [DOI] [PubMed] [Google Scholar]
- 124. Benjamini Y., Hochberg Y.. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B (Methodological). 1995; 57:289–300. [Google Scholar]
- 125. Günther T., Nettelblad C.. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet. 2019; 15:e1008302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126. Kent W.J., Haussler D.. Assembly of the working draft of the human genome with GigAssembler. Genome Res. 2001; 11:1541–1548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127. Campbell H., Carothers A.D., Rudan I., Hayward C., Biloglav Z., Barac L., Pericic M., Janicijevic B., Smolej-Narancic N., Polasek O.et al.. Effects of genome-wide heterozygosity on a range of biomedically relevant human quantitative traits. Hum. Mol. Genet. 2007; 16:233–241. [DOI] [PubMed] [Google Scholar]
- 128. Herráez D.L., Bauchet M., Tang K., Theunert C., Pugach I., Li J., Nandineni M.R., Gross A., Scholz M., Stoneking M.. Genetic variation and recent positive selection in worldwide human populations: evidence from nearly 1 million SNPs. PLoS One. 2009; 4:e7888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129. Bray S.M., Mulle J.G., Dodd A.F., Pulver A.E., Wooding S., Warren S.T.. Signatures of founder effects, admixture, and selection in the Ashkenazi Jewish population. Proc. Natl. Acad. Sci. U.S.A. 2010; 107:16222–16227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130. Edea Z., Bhuiyan M., Dessie T., Rothschild M., Dadi H., Kim K.. Genome-wide genetic diversity, population structure and admixture analysis in African and Asian cattle breeds. Animal. 2015; 9:218. [DOI] [PubMed] [Google Scholar]
- 131. Raczy C., Petrovski R., Saunders C.T., Chorny I., Kruglyak S., Margulies E.H., Chuang H.-Y., Källberg M., Kumar S.A., Liao A.et al.. Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics. 2013; 29:2041–2043. [DOI] [PubMed] [Google Scholar]
- 132. Lever J., Zhao E.Y., Grewal J., Jones M.R., Jones S.J.. CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer. Nat. Methods. 2019; 16:505–507. [DOI] [PubMed] [Google Scholar]
- 133. Chambon M., Orsetti B., Berthe M.-L., Bascoul-Mollevi C., Rodriguez C., Duong V., Gleizes M., Thénot S., Bibeau F., Theillet C.et al.. Prognostic significance of TRIM24/TIF-1α gene expression in breast cancer. Am. J. Pathol. 2011; 178:1461–1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134. LU G.M., Wang S., Zhao L., Zhou X.Q., Yan H., Wang Q.C., Huang J.F.et al.. Expression of CD44, TRIM24, TAGLN-2, ER and PR in breast invasive ductal carcinoma and their clinicopathologic significance. Chinese J. Clin. Exp. Pathol. 2017; 33:724–727. [Google Scholar]
- 135. Pathiraja T.N., Thakkar K.N., Jiang S., Stratton S., Liu Z., Gagea M., Shi X., Shah P.K., Phan L., Lee M.-H.et al.. TRIM24 links glucose metabolism with transformation of human mammary epithelial cells. Oncogene. 2015; 34:2836–2845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136. Balko J.M., Cook R.S., Vaught D.B., Kuba M.G., Miller T.W., Bhola N.E., Sanders M.E., Granja-Ingram N.M., Smith J.J., Meszoely I.M.et al.. Profiling of residual breast cancers after neoadjuvant chemotherapy identifies DUSP4 deficiency as a mechanism of drug resistance. Nat. Med. 2012; 18:1052–1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137. Menyhart O., Budczies J., Munkácsy G., Esteva F.J., Szabó A., Miquel T.P., Győrffy B.. DUSP4 is associated with increased resistance against anti-HER2 therapy in breast cancer. Oncotarget. 2017; 8:77207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138. Deng T., Liu J.C., Chung P.E., Uehling D., Aman A., Joseph B., Ketela T., Jiang Z., Schachter N.F., Rottapel R.et al.. shRNA kinome screen identifies TBK1 as a therapeutic target for HER2+ breast cancer. Cancer Res. 2014; 74:2119–2130. [DOI] [PubMed] [Google Scholar]
- 139. Long X., Shi Y., Ye P., Guo J., Zhou Q., Tang Y.. MicroRNA-99a suppresses breast cancer progression by targeting FGFR3. Front. Oncol. 2020; 9:1473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140. Cichon S., Craddock N., Daly M., Faraone S., Gejman P., Kelsoe J., Lehner T., Levinson D., Moran A., Sklar P.et al.. Psychiatric GWAS Consortium Coordinating Committee Genomewide association studies: history, rationale, and prospects for psychiatric disorders. Am. J. Psychiat. 2009; 166:540–556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 141. Nagraj V., Magee N.E., Sheffield N.C.. LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis. Nucleic Acids Res. 2018; 46:W194–W199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142. Sheffield N.C., Thurman R.E., Song L., Safi A., Stamatoyannopoulos J.A., Lenhard B., Crawford G.E., Furey T.S.. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res. 2013; 23:777–788. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 143. Reiff R.E., Ali B.R., Baron B., Yu T.W., Ben-Salem S., Coulter M.E., Schubert C.R., Hill R.S., Akawi N.A., Al-Younes B.et al.. METTL23, a transcriptional partner of GABPA, is essential for human cognition. Hum. Mol. Genet. 2014; 23:3456–3466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 144. Bernkopf M., Webersinke G., Tongsook C., Koyani C.N., Rafiq M.A., Ayaz M., Müller D., Enzinger C., Aslam M., Naeem F.et al.. Disruption of the methyltransferase-like 23 gene METTL23 causes mild autosomal recessive intellectual disability. Hum. Mol. Genet. 2014; 23:4015–4023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145. Yanagihara T., Tomino T., Uruno T., Fukui Y.. Thymic epithelial cell–specific deletion of JMJD6 reduces Aire protein expression and exacerbates disease development in a mouse model of autoimmune diabetes. Biochem. Biophys. Res. Commun. 2017; 489:8–13. [DOI] [PubMed] [Google Scholar]
- 146. Poulard C., Rambaud J., Lavergne E., Jacquemetton J., Renoir J.-M., Trédan O., Chabaud S., Treilleux I., Corbo L., Le Romancer M.. Role of JMJD6 in breast tumourigenesis. PLoS One. 2015; 10:e0126181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147. Wong M., Sun Y., Xi Z., Milazzo G., Poulos R.C., Bartenhagen C., Bell J.L., Mayoh C., Ho N., Tee A.E.et al.. JMJD6 is a tumorigenic factor and therapeutic target in neuroblastoma. Nat. Commun. 2019; 10:3319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148. Lalioti M., Antonarakis S., Scott H.S.. The epilepsy, the protease inhibitor and the dodecamer: progressive myoclonus epilepsy, cystatin b and a 12-mer repeat expansion. Cytogenet. Genome. Res. 2003; 100:213–223. [DOI] [PubMed] [Google Scholar]
- 149. Sudmant P.H., Rausch T., Gardner E.J., Handsaker R.E., Abyzov A., Huddleston J., Zhang Y., Ye K., Jun G., Fritz M. H.-Y.et al.. An integrated map of structural variation in 2,504 human genomes. Nature. 2015; 526:75–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 150. Liberzon A., Subramanian A., Pinchback R., Thorvaldsdóttir H., Tamayo P., Mesirov J.P.. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011; 27:1739–1740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151. Marinho F. V.C., Pinto G.R., Oliveira T., Gomes A., Lima V., Ferreira-Fernandes H., Rocha K., Magalhães F., Velasques B., Ribeiro P.et al.. The SLC6A3 3-UTR VNTR and intron 8 VNTR polymorphisms association in the time estimation. Brain Struct. Funct. 2019; 224:253–262. [DOI] [PubMed] [Google Scholar]
- 152. Schlüter T., Winz O., Henkel K., Eggermann T., Mohammadkhani-Shali S., Dietrich C., Heinzel A., Decker M., Cumming P., Zerres K.et al.. MAOA-VNTR polymorphism modulates context-dependent dopamine release and aggressive behavior in males. NeuroImage. 2016; 125:378–385. [DOI] [PubMed] [Google Scholar]
- 153. Zammit S., Jones G., Jones S.J., Norton N., Sanders R.D., Milham C., McCarthy G.M., Jones L.A., Cardno A.G., Gray M.et al.. Polymorphisms in the MAOA, MAOB, and COMT genes and aggressive behavior in schizophrenia. Am. J. Med. Genet. B: Neuropsychiatric Genet. 2004; 128:19–20. [DOI] [PubMed] [Google Scholar]
- 154. Vernaleken I., Schlüter T., Eggermann T., Roesch F., Zerres K., Mottaghy F.. Effect of MAOA-VNTR polymorphism on aggression and dopamine release. J. Nuclear Med. 2015; 56:300–300. [Google Scholar]
- 155. Mill J., Asherson P., Browes C., D’Souza U., Craig I.. Expression of the dopamine transporter gene is regulated by the 3’ UTR VNTR: evidence from brain and lymphocytes using quantitative RT-PCR. Am. J. Med. Genet. 2002; 114:975–979. [DOI] [PubMed] [Google Scholar]
- 156. Diatchenko L., Nackley A.G., Tchivileva I.E., Shabalina S.A., Maixner W.. Genetic architecture of human pain perception. Trends Genet. 2007; 23:605–613. [DOI] [PubMed] [Google Scholar]
- 157. Kang A.M., Palmatier M.A., Kidd K.K.. Global variation of a 40-bp VNTR in the 3’-untranslated region of the dopamine transporter gene (SLC6A3). Biol. Psychiatry. 1999; 46:151–160. [DOI] [PubMed] [Google Scholar]
- 158. Liu C.-C., Tsai C.-W., Deak F., Rogers J., Penuliar M., Sung Y.M., Maher J.N., Fu Y., Li X., Xu H.et al.. Deficiency in LRP6-mediated Wnt signaling contributes to synaptic abnormalities and amyloid pathology in Alzheimer’s disease. Neuron. 2014; 84:63–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 159. Hayase Y., Amano S., Hashizume K., Tominaga T., Miyamoto H., Kanno Y., Ueno-Inoue Y., Inoue T., Yamada M., Ogata S.et al.. Down syndrome cell adhesion molecule like-1 (DSCAML1) links the GABA system and seizure susceptibility. Acta Neuropathol. Commun. 2020; 8:206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 160. Carapito R., Ivanova E.L., Morlon A., Meng L., Molitor A., Erdmann E., Kieffer B., Pichot A., Naegely L., Kolmer A.et al.. ZMIZ1 variants cause a syndromic neurodevelopmental disorder. Am. J. Hum. Genet. 2019; 104:319–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 161. Gymrek M., Willems T., Guilmatre A., Zeng H., Markus B., Georgiev S., Daly M.J., Price A.L., Pritchard J.K., Sharp A.J.et al.. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 2016; 48:22–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162. Fotsing S.F., Margoliash J., Wang C., Saini S., Yanicky R., Shleizer-Burko S., Goren A., Gymrek M.. The impact of short tandem repeat variation on gene expression. Nat. Genet. 2019; 51:1652–1659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 163. Hormozdiari F., Kostem E., Kang E.Y., Pasaniuc B., Eskin E.. Identifying causal variants at loci with multiple signals of association. Genetics. 2014; 198:497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The reference TR set files, output VCF files, and the pre-processed data files along with the code to create figures and tables are published at: DOI 10.5281/zenodo.4065850 VNTRseek can be downloaded at: https://github.com/Benson-Genomics-Lab/VNTRseek.


