Summary
Variable number tandem repeats (VNTRs) are composed of large tandemly repeated motifs, many of which are highly polymorphic in copy number. However, because of their large size and repetitive nature, they remain poorly studied. To investigate the regulatory potential of VNTRs, we used read-depth data from Illumina whole-genome sequencing to perform association analysis between copy number of ∼70,000 VNTRs (motif size ≥ 10 bp) with both gene expression (404 samples in 48 tissues) and DNA methylation (235 samples in peripheral blood), identifying thousands of VNTRs that are associated with local gene expression (eVNTRs) and DNA methylation levels (mVNTRs). Using an independent cohort, we validated 73%–80% of signals observed in the two discovery cohorts, while allelic analysis of VNTR length and CpG methylation in 30 Oxford Nanopore genomes gave additional support for mVNTR loci, thus providing robust evidence to support that these represent genuine associations. Further, conditional analysis indicated that many eVNTRs and mVNTRs act as QTLs independently of other local variation. We also observed strong enrichments of eVNTRs and mVNTRs for regulatory features such as enhancers and promoters. Using the Human Genome Diversity Panel, we define sets of VNTRs that show highly divergent copy numbers among human populations and show that these are enriched for regulatory effects and preferentially associate with genes that have been linked with human phenotypes through GWASs. Our study provides strong evidence supporting functional variation at thousands of VNTRs and defines candidate sets of VNTRs, copy number variation of which potentially plays a role in numerous human phenotypes.
Keywords: VNTR, minisatellite, macrosatellite, eQTL, mQTL
Introduction
Tandem repeats (TRs) are stretches of DNA comprised of two or more contiguous copies of a sequence of nucleotides arranged in head-to-tail pattern, e.g., CAG-CAG-CAG. The human genome contains >1 million TRs that collectively span ∼3% of our total genome.1 These TRs range in motif size from mono-nucleotide repeats at one extreme (e.g., TTTTTTT) to those with much larger motifs that can in some cases be several kilobases in size, even containing entire exons or genes within each repeated unit.2,3 Because of their repetitive nature, TRs often show high mutation frequencies, and many show extremely high levels of length polymorphism.4,5 For example, a recent comprehensive study of genome variation showed that ∼50% of insertion-deletion events within the human genome map to TR regions.6 However, despite contributing to a large fraction of genetic variation, TRs remain poorly studied and, as a result, their influence on human phenotypes is almost certainly underestimated. This is largely due to their repetitive and highly variable nature, which until the recent advent of specialized algorithms designed to genotype TR lengths from sequencing data, made them largely inaccessible to high-throughput genotyping approaches.7, 8, 9, 10, 11, 12
Previously, we and others have demonstrated functional effects on local gene expression and epigenetics of length variation in TRs with both short motifs (motif size 1–6 bp, often termed microsatellites) and TRs with very large motifs (motif size > 2 kb, also termed macrosatellites).13, 14, 15, 16 In contrast, TRs with motif sizes between these two extremes, often termed variable number tandem repeats (VNTRs) or minisatellites, have been less well studied. This is largely due to technical difficulties of genotyping variation at loci composed of moderate-to-large tandem repeats motifs and is further compounded by the fact that many TRs undergo a relatively high rate of recurrent mutation, meaning that copy number variation of large TRs is often poorly tagged by flanking SNVs.16 As a result, variation of many TR loci is poorly ascertained by standard SNV-based genome-wide association studies (GWASs). Thus, there is currently a knowledge gap regarding the role of TR variation in human disease.
Numerous targeted studies in the literature have implicated length variation of VNTR loci as putative drivers of human molecular and disease phenotypes. Specific examples include a 12-mer repeat upstream of CSTB (MIM: 601145) that is the strongest known expression quantitative trait locus (eQTL) associated with CSTB expression; a 30-mer repeat in the promoter of MAOA (MIM: 309850) implicated in multiple neurologic and behavioral phenotypes; a 14-mer repeat upstream of INS (MIM: 176730) that is associated with multiple metabolic traits, insulin production, and diabetes risk; and an 25-mer repeat intronic within ABCA7 (MIM: 605414) that is enriched for long alleles in Alzheimer disease and correlates with ABCA7 splicing and amyloid β levels in cerebrospinal fluid.17, 18, 19, 20, 21, 22
Building on this prior work, here we used read depth from Illumina whole-genome sequencing (WGS) data to perform a genome-wide analysis of copy number variation at ∼70,000 VNTR loci (defined here as TRs that have motif size ≥ 10 bp and span ≥100 bp in the reference genome) in two discovery cohorts and a third replication population. Our study provides functional insight into a previously understudied fraction of human genetic variation and suggests that future studies of VNTR variation may explain some of the “missing heritability” of the human genome.23,24
Subjects and methods
Description of cohorts used for VNTR association analysis
Individuals included in this study provided proper informed consent for research use, and all procedures followed were in accordance with the ethical standards of the responsible committee(s) on human studies. Local ethical approval for this study was granted under HS#: 20-00153.
GTEx
We obtained Illumina 150 bp paired-end WGS data and resulting variant calls made with GATK in 620 individuals from the Genotype-Tissue Expression (GTEx) project from dbGAP (dbGAP: phs000424.v7.p2). Normalized RNA sequencing (RNA-seq) expression data for these samples were downloaded from the GTEX portal (v.7), comprising quality-controlled and processed files for 48 tissues generated by the GTEx Consortium. These data were aligned to hg19 and had already undergone filtering to remove genes with low expression and been subject to rank-based inverse normal transformation.
PCGC
WGS and methylation data for 249 individuals were selected from the cohort collected by the Pediatric Cardiac Genomic Consortium (PCGC). An extensive description of PCGC samples as well as further details about sample collection can be found in a summary publications released by the PCGC.25,26 Briefly, the cohort comprises individuals aged from newborn to 47 years (mean 8.2 years) diagnosed with a range of congenital heart defects; conotruncal and left-sided obstructive lesions were the two most common diagnoses. Illumina 150 bp paired-end WGS data generated via PCR-free libraries from peripheral blood DNA (average of 36× genome coverage, range 25–39×) were obtained from dbGAP (dbGAP: phs001138.v1.p2). Peripheral blood methylomes were downloaded from GEO (GEO: GSE159930) and normalized as described previously.27 We utilized the array data to infer the likely sex of each sample on the basis of scatterplots of mean β value of probes located on the X chromosome (chrX) versus the fraction of probes located on the Y chromosome (chrY) with detection p > 0.01. We compared these predictions against self-reported sex for each sample and removed four samples with a potential sex mismatch. We utilized data from autosomal probes, excluding any that mapped to multiple genomic locations. We also utilized the genotypes obtained from GATK analysis of the WGS data and in each sample excluded β values for any CpG that contained an SNV within either the probe-binding site or the interrogated CpG. After these filters, a total of 821,035 CpG sites were retained for downstream analysis. All PCGC data were aligned to hg19.
PPMI
We utilized data from the Parkinson’s Progression Markers Initiative (PPMI) cohort, corresponding to 712 individuals (189 healthy control individuals and 523 affected individuals with varying types of Parkinsonism), with available Illumina WGS data aligned to the hg38 reference genome.28 RNA-seq data generated from peripheral blood were available for 676 PPMI samples, comprising read counts for 22,582 genes listed in GENCODE v.19, aligned to hg19. The read counts were filtered, normalized, and subjected to rank-based inverse normal transformation via scripts provided by the GTEx Consortium. DNA methylation data generated via the Illumina 850k array from peripheral blood DNA were available for 524 PPMI samples, aligned to hg19.
Estimation of VNTR copy number in two discovery cohorts
We downloaded 886,954 autosomal TRs listed in the simple repeats track from the hg19 build of the UCSC Genome Browser, retaining only those repeats with motif size ≥ 10 bp and total length of repeat tract ≥ 100 bp. Where multiple TR annotations overlapped, these were merged together, resulting in 89,893 unique autosomal VNTR regions that were used in subsequent analysis. All analyses in GTEx and PCGC discovery cohorts were performed with the hg19 assembly.
In each sample of the discovery cohort (GTEx and PCGC), we estimated relative diploid copy number of each autosomal VNTR region by using CNVnator (v.0.3.3 with default thresholds and bin size 100 bp), which uses normalized read depth to estimate copy number of a locus.29 It should be noted that in VNTR regions, where by definition there are multiple copies of a repeated motif, CNVnator copy number estimates represent the fold change in total (diploid) repeat number relative to the number of motifs annotated in the (haploid) reference genome. For example, Figure 1A shows CNVnator-estimated copy number for a 44-mer repeat that has 43 copies in the reference genome (chr12: 132,148,891–132,150,764, hg38). An individual with a relative CNVnator copy estimate of 6 is therefore predicted to carry a total of 43 × 6 = 258 copies of this repeat.
Utilizing CNVnator copy number estimates of invariant regions of the genome, we observed a strong technical bias in GTEx WGS data: samples that were sequenced prior to 2016 showed systematic shifts in estimated copy number compared with later batches (Figure S1). As a result, we removed 136 samples that were sequenced in batch “2015-10-06.” On the basis of analysis of invariant loci, and principal component analysis and density plots based on VNTR copy number, we excluded a further 60 samples from the GTEx cohort and 10 samples from the PCGC cohort that were outliers in one or more of these analyses (Figure S2).
In situations where a VNTR is embedded within a larger copy number variable region, copy number estimates for a VNTR based on read depth can be confounded by variations of the wider region because these would result in gains or losses in the total number of VNTR copies present but without any change in the length of the VNTR array. To identify VNTRs where our copy number estimates were potentially subject to this confounder, we performed copy number analysis of the 3′ and 5′ regions flanking each VNTR by using CNVnator (Figure S3). In cases where the 1 kb flanking region of a VNTR overlapped a simple repeat with motif size ≥ 6 bp, we trimmed the flanking region, retaining only the flanking portion that was adjacent to the VNTR. We then removed from our analysis any VNTRs where
-
(1)
both flanks had trimmed length < 500 bp;
-
(2)
correlation (R) between copy number of the VNTR and either of the flanking regions was >0.5;
-
(3)
either flanking region showed large variations in copy number, defined as those flanks where the difference between 99th and 1st percentile was >2;
-
(4)
they overlapped CNVs with minor allele frequency > 10% in Europeans.34
As copy number estimates in GTEx WGS samples showed high variability based on analysis of density plots, we normalized VNTR copy numbers in the 424 remaining samples by applying an inverse rank normal transformation.35 On the basis of visual inspection of density plots of these transformed copy numbers, we removed a further 20 outlier samples (Figure S2), leaving 404 samples that were used for association analysis with gene expression (Table S1). Finally, we removed VNTRs that showed very low levels of variation in the population (standard deviation < 0.2).
Comparison of VNTR copy number estimates with genotypes obtained via long reads and the adVNTR algorithm
Using the tool MsPAC, we generated diploid genome assemblies for 14 individuals from available Pacific Biosciences (PacBio) WGS data and phased SNVs (Table S2).36 Where phased SNVs were not available (samples HG02059, HG02818, HG03486, and HG0386), we performed phasing by using GATK HaplotypeCaller and WhatsHap.37 We generated VNTR genotypes directly from the diploid long-read assemblies by using PacMonSTR.38 For each of these individuals, PCR-free Illumina WGS data were also available, and we processed them with CNVnator to estimate VNTR copy number, as described above. To estimate the accuracy of our VNTR genotypes derived via CNVnator, we utilized a set of 2,027 VNTRs that are associated with local gene expression (eVNTRs) that showed significant associations with gene expression in one or more GTEx tissues and that were composed of single annotated (i.e., non-merged) tandem repeats, copy number of which could be unambiguously genotyped with PacMonSTR. We discarded genotypes where both haplotypes in a sample were not represented in the PacBio genome assemblies or where VNTR copy number was >200, yielding a final total of 16,403 pairwise genotypes derived from 1,891 VNTR loci across the 14 samples, representing all eVNTR loci genotyped by CNVnator for which we also obtained at least one set of diploid genotypes from the 14 PacBio genome assemblies analyzed. To assess the performance of an alternative approach for genotyping VNTRs from short-read WGS, we were also able to generate genotypes for 1,746 of these same 1,891 loci from the Illumina WGS reads with adVNTR in the 14 samples by using default parameters.11
Identification of eVNTRs in the GTEx cohort
Using RNA-seq data from the filtered set of 404 WGS samples that passed our quality control steps, we adjusted gene expression data for sex, RNA-seq platform, the first three principal components from SNV genotypes, and between 15–60 covariates per tissue estimated via PEER.39 Within each tissue, we performed linear regression between VNTR copy number and corrected expression level of each gene located within ±500 kb by using the lm function in R. We applied a false discovery rate (FDR) correction and reported all VNTR:gene pairs with FDR q < 0.1 in any tissue.40
Identification of mVNTRs in the PCGC cohort
After excluding samples that either did not pass our quality control for DNA methylation or were outliers for VNTR copy number on the basis of density plots, 235 individuals from the PCGC cohort were utilized for association analysis of VNTR copy number with CpG methylation levels (VNTRs that are associated with DNA methylation levels, or mVNTRs). We excluded CpGs with low levels of variation (standard deviation < 0.02), leaving 316,169 CpGs that were located within ±50 kb of VNTRs that were used for association analysis. CpG methylation data (β values) were adjusted for age, sex, the top two ancestry-related principal components derived from principal-component analysis (PCA) of SNVs, and blood cell fractions estimated directly from the methylation data with the Houseman method.41,42 We used the resulting residuals to test the association between DNA methylation and estimated VNTR copy number by using the lm function in R. We applied a Bonferroni correction to the resulting p values based on the total number of pairwise VNTR:CpG tests performed and considered those with Bonferroni-adjusted p < 0.01 as significant.
Replication of eVNTRs and mVNTRs in the PPMI cohort
We utilized available WGS, RNA-seq and methylation data for 712 individuals from the PPMI cohort. We generated copy number estimates for all VNTR loci utilized in the GTEx and PCGC discovery cohorts by using CNVnator (v.0.4.1) and applied the same quality control and normalization steps as used in the discovery cohorts, including inverse rank normal transformation to the VNTR copy numbers, resulting in the exclusion of nine outlier samples. We normalized gene expression data by using the same method as applied to the GTEx cohort, including application of inverse rank normal transformation. These normalized expression data were adjusted for sex, the first three ancestry-related principal components derived from PCA of SNVs, and 60 additional components estimated via PEER.39 We performed association between VNTRs and normalized adjusted gene expression levels by using linear regression, as described above for the GTEx cohort.
For replication of mVNTRs, we applied the same quality control and normalization pipeline to the methylation data as used for the PCGC cohort, as described above. Normalized β values were adjusted for sex, age, the top three principal components from SNV genotypes, and estimated blood cell fractions. We then used the residuals to perform linear regression with VNTR genotype. CNVnator analysis of the PPMI cohort was performed in the hg38 assembly, and we used liftover to convert VNTR coordinates to the hg19 assembly for association analysis with methylation and expression data.
Enrichment analysis
We performed all enrichment analyses by comparing the frequency of significant eVNTRs and mVNTRs against the background set of all VNTR:gene pairs that were tested in each cohort, and we generated p values by using the hypergeometric distribution. We defined promoter regions as ±2 kb of gene transcription start sites (TSSs). We utilized a set of enhancer element annotations downloaded the GeneHancer track in the UCSC Genome Browser, utilizing only loci labeled “Enhancers.”43 We utilized a composite list of silencer element annotations, corresponding to all significant silencer elements identified in two cell types under different conditions.44
Replication of VNTR:CpG associations via Oxford Nanopore long-read data
FASTQ files with Oxford Nanopore technology (ONT) reads-derived WGS from 30 EBV-transformed lymphoblastoid cell lines were downloaded from the Human Pangenome Reference Consortium and aligned to hg38 via minimap2 with default parameters.45 SNV calls were generated for each sample via the bwa-GATK pipeline based on Illumina WGS downloaded from the International Genome Sample Resource. Variants were phased with WhatsHap37 and the alignment of the ONT reads to the reference genome. We generated diploid genome assemblies by using MsPAC36 with ONT reads aligned to hg38 and phased SNVs as input. VNTRs were genotyped on each assembled haplotype via PacMonSTR.38 We used the call-methylation function in nanopolish to score CpG sites in each read as either methylated or not methylated.46 Because MsPAC partitioned reads into the two possible haplotypes per sample, we calculated the methylation fraction for each CpG site per haplotype on the basis of all haplotype-phased reads overlapping each CpG. In order to ensure robust methylation measurements, we only retained CpGs for phased haplotypes that were covered by ≥10 reads. We then calculated correlation coefficients between VNTR copy number and methylation fractions for mVNTR:CpG pairs identified in the PCGC cohort where there were ≥20 haplotypes with both VNTR genotypes and CpG measurements available and where the VNTR showed an allelic range ≥ 2 copies.
Population stratification of VNTRs
We obtained Illumina WGS reads mapped to hg38 from samples in the Human Genome Diversity Panel, utilizing data for a subset of 676 samples that were sequenced with PCR-free protocols.47 We used CNVnator (v.0.4.1) to estimate relative copy number of autosomal VNTRs (hg38). We performed quality control by using PCA and density plots to remove outliers and compared the reported sex of each sample against sex chromosome copy number, removing any discordant samples. We filtered VNTRs to remove those within putative larger CNVs, as detailed above. After applying these filters, we utilized genotypes of 66,796 VNTRs in 643 samples from seven different super-populations. For each super-population, we calculated VST as follows:
where VA is the variance of all the samples, VT is the variance of the target population, VB is the variance of the background population, and CT and CB are fractions of the number of target and background populations, respectively.48 For each of the seven super-populations, we calculated VST for each VNTR by considering one super-population as the target and using all other samples as background. p values were generated by permutation testing (n = 1,000 permutations), and samples were randomly assigned to either the target or background groups. We selected those VNTRs in each super-population with VST ≥ 0.1 and permutation p < 0.01.
Annotation of VNTRs with potential trait associations
In order to link eVNTRs with human traits that they might influence, we used two complementary approaches. First, we used results of PrediXcan applied to 44 GTEx tissues and >100 phenotypes from GWASs, annotating eVNTRs with phenotypes if they shared the same gene name and tissue as indicated by PrediXcan.49 However, because PrediXcan has been applied to a relatively limited set of traits, we further annotated eVNTRs by using a combination of eQTLs identified by the GTEx project and SNVs from the GWAS Catalog.50 Here, eVNTRs were linked to putative associated phenotypes as follows: (1) for each eVNTR identified in a specific tissue, we joined these with eQTLs identified in the same GTEx tissue based on gene name; (2) we extracted all SNVs from the GWAS Catalog with p < 5 × 10−8 and joined these to the GTEx eQTLs; and (3) where an eVNTR was joined with an SNV that was both a GWAS variant and an eQTL for the same gene in the same tissue, we annotated the eVNTR with the GWAS phenotype(s).
Analysis of eVNTRs and mVNTRs on the X chromosome
Although all analysis described above was based on autosomal loci, we also performed association analysis of VNTRs on the X chromosome in the GTEx, PCGC, and PPMI cohorts. Here, we only analyzed 46,XY males, as determined by read-depth analysis of the sex chromosomes (n = 182 in GTEx, n = 123 in PCGC, and n = 465 in PPMI). After removal of the pseudo-autosomal regions and quality filtering (as described above), we performed association analysis of 2,348 VNTRs with gene expression and DNA methylation, utilizing the same statistical thresholds as for autosomal loci.
Results
Robust genotyping of VNTRs via read depth
Using read depth from Illumina WGS data as a proxy for diploid copy number, we generated copy number estimates for a set of 70,787 large TRs (median motif size 116 bp, mean span of repeat tract in reference genome 353 bp), henceforth referred to as variable number tandem repeats (VNTRs). Many VNTR loci showed highly variable copy number estimates among different individuals, indicative of extreme levels of inter-individual polymorphisms at many of these loci (Figure 1A).
In order to assess the validity of genotyping VNTRs from read depth, we compared estimated VNTR copy numbers from CNVnator with genotypes obtained directly from spanning long reads from de novo diploid PacBio genome assemblies. Using 14 individuals for which both Illumina and PacBio WGS data were available, we observed good global correlation between these two approaches, with an overall R2 = 0.81, indicating that read depth is generally an effective proxy for measuring total copy number at the majority of VNTR loci (Figure 1B, Table S3). In comparison, we found that an alternative tool designed for genotyping VNTRs from short-read data performed relatively poorly, yielding an R2 = 0.14 when compared with direct genotypes generated from long-read WGS (Figure S4, Table S3).11
Given that some VNTR motifs are not unique and can occur at multiple genomic loci, we investigated the reliability of reads mapped to VNTR loci. Using high-coverage Illumina WGS data in a Yoruba individual from the 1000 Genomes Project (NA18874), we assessed mapping quality scores for reads that overlapped VNTRs on the basis of both their MAPQ score and the MAPQ score of their mate pairs. We classified reads from VNTR loci into three categories. The first category was MAPQ ≥ 10, which we considered reliably mapped. The second category was MAPQ < 10 but with a mate pair that mapped reliably within ±10 kb. We considered these reads as reliably mapped to the correct VNTR on the basis of their mate pair. Likely many such reads that are contained entirely within a VNTR yield low mapping quality because of the fact that VNTRs are composed of repeated copies, giving multiple possible map positions within a single VNTR tract. The third category was MAPQ < 10 and with a mate pair that was not anchored within ±10 kb. We considered these reads unreliably mapped. Overall, we observed that the vast majority of reads from VNTR loci were reliably mapped: 97.5% of VNTRs comprised <10% of overlapping reads that were unreliably mapped (MAPQ < 10 and no anchoring mate pair), and only a single VNTR contained >50% of unreliably mapped reads (Figure S5). These data indicate that ambiguous read mapping to tandemly repeated regions is not a significant confounder of our approach.
Overview of association analysis of VNTRs with gene expression and DNA methylation
To assess the potential regulatory effects of copy number changes of VNTRs on local gene expression and epigenetics we utilized two discovery cohorts for which PCR-free Illumina WGS data were available. First, we used a subset of quality-filtered samples from the GTEx project, comprising 404 individuals with expression data from 48 different tissues. Here, we performed cis-association analysis between estimated VNTR copy number and normalized gene expression within ±500 kb. Second, we used 235 quality-filtered samples from the PCGC for which DNA methylation profiles from whole blood were available. Here, we performed cis-association analysis between estimated VNTR copy number and CpG methylation levels within ±50 kb.
Summary of autosomal gene expression associations in the GTEx cohort
After multiple testing correction, in the GTEx cohort we identified a total of 13,752 significant pairwise VNTR:gene expression associations (10% FDR) across 48 different tissues, corresponding to 2,980 unique expression QTL VNTRs (henceforth termed eVNTRs) that were associated with the expression level of 3,167 different genes (Table S4). Using Q-Q plots to explore the distribution of observed versus expected associations, in each GTEx tissue we observed a clear enrichment for significant associations compared with the null distribution and little evidence of genomic inflation (mean λ = 1.019, range 0.997–1.040) (Figure 1C). As expected, the number of significant associations observed in different tissues was strongly associated with sample size (i.e., statistical power), varying from 13 identified in uterus to 1,080 in thyroid (Table S4). An example of the distribution of genome-wide association signals observed in skeletal muscle is shown in Figure 1D. Of note, we frequently observed the same VNTR:gene pairwise associations in multiple different tissues (35% of VNTR:gene associations were seen in ≥2 tissues), and of these, 99.4% showed consistent directionality in different tissues (Figure S6, Table S4). In addition, 34% of eVNTRs were associated with the expression of multiple different genes (mean of 3 associated genes per eVNTR, range 1–48).
Supporting a biological role in modulating gene expression, eVNTRs showed enrichments for several genome annotations with regulatory potential. We observed a 7.9-fold enrichment for eVNTRs located within ±2 kb of transcription start sites (p = 1.1 × 10−73, Figure 1E). Consistent with this observation, the sequence content of eVNTRs also showed a strong bias toward higher GC content (permutation p < 10−7) (Figure S7). We also observed that eVNTRs were enriched at both annotated enhancer (1.7-fold enrichment, p = 1.7 × 10−14) and silencer elements (2.5-fold enrichment, p = 1.9 × 10−4). Further examples of results observed at eVNTRs in the GTEx cohort are shown in Figure S8.
In further support of our results, we successfully replicated three associations of VNTRs with the expression level of individual genes that had been identified in previous targeted studies: a 36-mer coding VNTR in exon 1 of AS3MT (MIM: 611806) that is associated with AS3MT expression and schizophrenia risk, a 72-mer intronic VNTR that regulates SIRT3 expression (MIM: 604481), and a 33-mer promoter VNTR that regulates expression of TRIB3 (MIM: 607898), a gene that has been linked with multiple human phenotypes.51, 52, 53
Summary of autosomal DNA methylation associations in the PCGC cohort
As for eVNTRs in the GTEx cohort, a Q-Q plot showed a clear enrichment for significant associations compared with the null distribution, although with some evidence for genomic inflation (λ = 1.297) (Figure S9). Because of this, in order to ensure robust associations, we chose to apply a more stringent multiple testing correction, identifying a total of 3,152 VNTR:CpG pairwise associations in the PCGC cohort (Bonferroni-corrected p < 0.01), corresponding to 1,480 unique methylation QTL VNTRs (henceforth termed mVNTRs) and 2,466 unique CpGs (Table S5). Similar to observations made for eVNTRs, mVNTRs also showed a strong bias to occur in close proximity to the CpGs they associated with, and the majority are separated by <5 kb (Figure S10). mVNTRs tended to have a significantly higher GC content than all VNTRs in the genome (permutation p < 10−7, Figure S7) and were 2.2-fold enriched for annotated enhancers (p = 3.7 × 10−19) and 2.2-fold enriched for annotated silencers (p = 8.1 × 10−3). Three examples of the association signals observed around mVNTRs are shown in Figure 2, while additional plots of eight other mVNTR loci are shown in Figure S11.
Conditional analysis indicates many VNTR associations are independent of SNV QTLs
Given that multiple different genetic variants may exert regulatory effects on gene expression and CpG methylation, we considered the possibility that the VNTR associations we observed might be indirect correlations driven by linkage disequilibrium between VNTRs and other variants that are the primary QTLs. To assess whether VNTRs act as QTLs independent of other local genetic variation, we performed conditional analyses by removing the effect of the strongest SNV QTL associated with each gene and CpG that were putatively associated with VNTR copy number.
First, we utilized SNV genotypes from the WGS data in our two discovery cohorts to identify SNVs that were significantly associated (FDR q < 0.1) with local gene expression and CpG methylation levels (Figure 3A). For each VNTR pairwise association, we then retained only the subset of individuals that were homozygous for the major allele of the lead QTL SNV and repeated the association analysis between VNTR copy number and gene expression/DNA methylation (Figure 3B). Doing so, we observed a clear trend where the majority of VNTR associations retained the same directionality as in our original analyses (Figures 3C and 3D). Overall, 9,791 of 12,784 eVNTR:gene pairs (76.6%) and 2,280 of 3,152 mVNTR:CpG pairs (72%) showed the same direction of association after conditioning on the lead QTL SNV. Despite a considerable loss of statistical power due to the reduced sample size when conditioning based on the strongest SNV QTL, in the GTEx cohort, 2,146 associations showed the same direction of effect with p < 0.01 and 1,434 met our genome-wide significance discovery threshold (FDR q < 0.1) (Table S4). Similarly, for mVNTRs identified in the PCGC cohort, after conditioning on the lead mQTL SNV, 693 associations showed the same direction of effect with p < 0.01 and 273 associations met our genome-wide significance discovery threshold (Bonferroni p < 0.01) (Table S5). Overall, these results indicate that many of the VNTR associations we detected are independent of other local QTLs and are not simply driven by the linkage disequilibrium architecture of the genome.
Large-scale replication of eVNTRs and mVNTRs in an independent cohort
In order to assess the robustness of the associations we identified in the GTEx and PCGC discovery cohorts, we conducted replication analysis in the PPMI cohort, consisting of a total of 703 individuals with WGS, gene expression, and methylation data. We used CNVnator to analyze VNTR copy number in each sample and then performed association analysis with both gene expression and CpG methylation levels by using identical pipelines as applied in the two discovery cohorts. These analyses identified 3,537 significant autosomal eVNTRs that were associated with the expression level of 3,615 unique genes (6,454 pairwise associations) (Table S6) and 3,288 significant autosomal mVNTRs that were associated with methylation levels of 6,999 unique CpGs (9,730 pairwise associations) (Table S7).
When compared to the associations identified in whole blood from the GTEx and PCGC cohorts, we observed replication at genome-wide significance levels and with concordant directionality for 278 of 381 (73%) GTEx eVNTR:gene pairwise associations and 2,507 of 3,139 (80%) PCGC mVNTR:CpG pairwise associations (Figure 4), yielding strong evidence to support that the majority of associations we report are most likely robust. We also observed a trend for many VNTR loci to be associated with both gene expression and CpG methylation. In the PPMI cohort, of the 3,537 unique eVNTR loci identified, 1,489 (42.1%) were also associated with the methylation level of one or more cis-linked CpGs. Of these, 653 (43.9%) had one or more associated CpGs that were located in either the promoter or an annotated enhancer element of the same gene whose expression they associated with. Using GeneHancer annotations, which define promoter and enhancer elements that are linked to the gene(s) they most likely regulate,43 we identified CpGs that were associated with an mVNTR and lie within annotated promoters or enhancers of eVNTR target genes. We then compared the correlation coefficients between VNTR copy number and both methylation and expression, thereby comparing associations of VNTRs with both gene expression and epigenetics of their annotated regulatory regions. We observed that for both promoters (p = 5.8 × 10−10) and enhancers (p = 3.9 × 10−16), there was a significant inverse relationship between CpG methylation and gene expression, i.e., functional VNTRs preferentially showed opposite directionality of effects on methylation of regulatory elements and expression of the associated genes (Figure S12). This high degree of convergence between these two data types lends further support to our results and suggests that, in at least a subset of cases, the potential mechanism of action of VNTRs on gene expression is via epigenetic modification of regulatory elements.
Replication of VNTR:CpG associations via Oxford Nanopore long-read data
We utilized a set of 30 genomes sequenced with Oxford Nanopore technology (ONT) to further validate a subset of mVNTRs. Here, after generating phased genome assemblies,36 we directly genotyped VNTR copy number on each haplotype by using spanning long reads and determined allelic CpG methylation levels by analysis of electrical current signals from each phased read,46 allowing direct association of DNA methylation levels with cis-linked VNTR alleles (Figure 5A). After quality filtering, data for each VNTR:CpG pair were available for a mean of 24 independent haplotypes and with a mean read depth of 24× per CpG on each haplotype.
As a result of the low number of samples, after quality filtering we were only able to assess replication of 228 mVNTR:CpG pairs identified in the PCGC discovery cohort that had sufficient data for robust analysis because the multiple testing burden for a genome-wide analysis would be prohibitive. We observed a clear trend where the majority of VNTR:CpG associations identified via read depth and Illumina 850k array profiling showed concordant directionality with direct VNTR and methylation measurements from ONT reads, and 163 of 228 (71%) VNTR:CpG pairs showed consistent directionality of association in the two datasets (Figure 5B). It should be noted that given the very small size of this cohort, the relatively coarse resolution of methylation measurements compared with some of the effect sizes, and the different cell type compared with the PCGC discovery cohort, not all loci were expected to show strong replication.
In addition to providing replication for individual CpGs, the use of ONT reads provided much more complete assessment of CpG methylation levels compared with the targeted coverage of the Illumina 850k array, and in several cases, we observed broad clusters of multiple CpGs that showed strong associations with VNTR copy number that were not apparent from array profiling (Figures 5C and 5D). Overall, these data provided additional supporting evidence that read-depth profiling of VNTRs is effective for identifying genuine biological associations.
Population stratification and trait associations of VNTRs
We analyzed VNTR copy number in samples from the Human Genome Diversity Panel and used these data to estimate the degree of population stratification in VNTR copy number with the VST statistic.47,48 Examples of VNTRs with high population stratification are shown in Figure 6. We observed strong enrichment for VNTRs with high population divergence within the set of putatively functional VNTRs identified in our discovery cohorts: there were 27 GTEx eVNTRs with VST > 0.2 (5.7-fold enrichment compared with all VNTR loci tested, p = 7.9 × 10−14) and 120 with VST > 0.1 (3.8-fold enrichment, p = 9.2 × 10−38), while for mVNTRs in the PCGC cohort, 15 had VST > 0.2 (6.3-fold enrichment, p = 1.3 × 10−8) and 112 had VST > 0.1 (6.6-fold enrichment, p = 1.5 × 10−57). We also compared this set of population-stratified VNTRs to TRs that were previously identified as having expanded specifically in the human lineage compared to other primates and observed similar enrichments (GTEx eVNTRs 5.7-fold enriched, p = 0.045; PCGC mVNTRs 9.2-fold enriched, p = 0.018).54
To investigate whether eVNTRs with elevated VST levels were enriched for phenotype associations, we annotated eVNTRs with human phenotypes that they potentially regulate by using both the results of PrediXcan and a combination of tissue-matched eQTLs joined with variants from the GWAS Catalog (Table S4). This identified 198 of 2,980 eVNTRs (6.6%) that had trait annotations from PrediXcan, while 634 eVNTRs (21.3%) had annotations derived from the overlap of GWAS Catalog variants and eQTLs. Examples of several functionally interesting candidate eVNTRs that are potentially linked to human traits via annotations from PrediXcan include the following.
-
(1)
An 87-mer VNTR (chr6: 166,997,608–166,997,912, hg38) that associates with expression of RNASET2 (MIM: 612944) in esophagus mucosa. RNASET2 is a secreted extracellular ribonuclease with roles in immune sensing and response and is linked by PrediXcan with risk of Crohn disease, inflammatory bowel disease, and rheumatoid arthritis.55,56
-
(2)
A VNTR region composed of multiple motifs (chr16: 29,196,863–29,197,354, hg38) that associates with expression of TUFM (MIM: 602389) in thyroid. TUFM is a mitochondrial elongation factor involved in mitochondrial replication and is linked by PrediXcan with body mass index and hip and waist circumference.57
-
(3)
A 53-mer VNTR (chr17: 83,032,018–83,032,543, hg38) located intronic within B3GNTL1 (MIM: 615337) that associates with B3GNTL1 expression in aorta. B3GNTL1 is a glycosyltransferase that transfers sugar moieties to acceptor molecules and is linked by PrediXcan with levels of glycated hemogoblin.58
Consistent with the notion that selection may have acted to modify copy number of functional VNTR loci in specific populations, we observed that eVNTRs with elevated VST levels were enriched for putative phenotype associations: 44 GTEx eVNTRs with VST > 0.1 were linked with GWAS traits, representing a 1.7-fold enrichment when compared with all eVNTRs identified (p = 9.0 × 10−5), while 13 had trait associations from PrediXcan (1.6-fold enrichment, p = 0.058).
Analysis of eVNTRs and mVNTRs on the X chromosome
In addition to analysis of autosomal loci, we also performed association analysis between VNTR copy number and gene expression and DNA methylation on the X chromosome. Because of the confounder of X chromosome inactivation, which results in large epigenetic and expression changes over most of the X chromosome in females,59 we utilized only male samples, resulting in an approximate halving of sample size and a large corresponding reduction in statistical power. We identified 14 eVNTR:gene pairwise associations in the GTEx cohort (Table S8), 12 mVNTR:CpG pairwise associations in the PCGC cohort (Table S9), and 36 eVNTR:gene and 36 mVNTR:CpG pairwise associations in the PPMI cohort (Tables S10 and S11).
Discussion
Here, we have conducted a genome-wide scan for putatively functional VNTRs that associate with local gene expression (eVNTRs) and DNA methylation (mVNTRs) by using two separate cohorts for initial discovery, followed by subsequent replication in a third cohort. In addition, we provided further validation of mVNTRs by using phased genomes sequenced with ONT long reads. We identified thousands of VNTRs where repeat copy number associated with local expression and epigenetics and successfully replicated the majority of these signals at stringent genome-wide significance thresholds. Multiple observations are consistent with a functional role for these loci, including an enrichment for regulatory elements such as gene promoters, annotated enhancer and silencer elements, a strong bias for eVNTRs/mVNTRs to lie in close proximity to their associated gene/CpG, and replication of several known VNTR associations from prior targeted studies. We hypothesize that VNTRs might act to modify gene expression and epigenetics via several different mechanisms. These include modifying the structural properties of the DNA and/or chromatin fiber, changing the number of binding sites for DNA and/or chromatin-associated factors, or altering spacing between regulatory elements and their targets.
Using conditional analysis where we removed the effect of known SNV QTLs for the same gene or CpG that was associated with VNTR copy number, we show that many of the signals we detected are not simply driven by linkage disequilibrium between VNTRs and flanking SNVs. We also investigated stratification of VNTRs by using diverse human populations. As selection resulting from differing environmental pressures represents a potential mechanism leading to high population divergence, elevated VST can be an indicator of possible selective effects acting on VNTR copy number. We observed multiple examples of putatively functional eVNTRs and mVNTRs that showed population-specific expansion or contraction. By annotating eVNTRs with possible human traits that they might influence based on the genes they regulate, we found that eVNTRs with elevated VST levels were enriched for putative phenotype associations. Finally, we also observed that eVNTRs and mVNTRs were enriched for TRs that have undergone human-specific expansions in copy number.54 Overall, these data provide strong evidence to support the notion that copy number variation of some VNTR loci exerts a regulatory effect on the local genome, is most likely associated with a wide variety of human traits and disease susceptibilities, and similar to single nucleotide and other types of structural variation, has most likely been subject to selective pressures during recent evolutionary history.60, 61, 62
The majority of VNTRs we assayed via read depth were relatively large, exceeding the read length of Illumina WGS (mean motif size 116 bp, mean span of repeat tract in reference genome 353 bp). Because of their size and tandemly repeated nature, copy number variation of VNTRs is difficult to assay in Illumina WGS data via standard tools for genotyping structural variants. By direct comparison with VNTR genotypes derived from long-read sequencing, we observed that CNVnator generally provides relatively good estimates of relative diploid VNTR copy number for the majority of VNTRs in the genome. In contrast, other published tools for genotyping VNTRs are either limited to only being able to genotype alleles that are shorter than the sequencing read length or performed poorly in our hands for the set of VNTRs we assayed.11,12
However, the use of read depth does have some major limitations. First, read depth does not provide any allelic information and only yields a relative estimate of total copy number from the sum of both alleles. For example, a heterozygous individual with alleles of two and eight repeats (total n = 10) will be indistinguishable from an individual who is homozygous for an allele with five repeats (total n = 10). Also, the use of read depth does not differentiate between specific repeat motifs with divergent sequence that may independently vary in copy number, as has been observed to occur at some VNTRs.4 Furthermore, in the case that a repeat motif strongly diverges from those that are represented in the reference genome, these might be poorly measured or missed entirely because mapping of reads to a VNTR is based on alignment to the reference sequence. We observed some evidence to support this: some VNTRs that showed underestimation of copy number compared with direct genotypes from PacBio WGS often had consistently low read depth within the VNTR locus. Finally, by studying discrepancies between VNTR copy estimates derived from read depth and PacBio WGS, we observed that the accuracy of read depth for genotyping VNTRs is inversely related to both motif size and copy number. Thus, read depth is best suited for studying those VNTRs with larger and higher copy number motifs.
Given that TR loci are frequently misassembled or collapsed during genome assembly, it is therefore likely that our study has not effectively assessed some fraction of VNTRs that are poorly represented in the current reference genome.52,54,60,63 Ongoing efforts to improve and diversify the human reference genome will most likely provide a more complete ascertainment of VNTRs that are present in the human population.54,64 The use of read depth to genotype VNTRs can also potentially be confounded where a VNTR is contained within a larger underlying copy number variation or through batch effects in WGS data. However, here we applied stringent quality control steps to remove such confounders, and through visualization of the underlying data at individual VNTRs and large-scale replication in an independent cohort, we minimized the possibility that these significantly influenced our results.
Other limitations of our analysis are that by using linear regression in our association analysis, we tested a model in which the relationship between expression/DNA methylation and VNTR copy number is presumed to be linear, and therefore, we had limited power to identify more complex non-linear effects of different VNTR alleles that have been observed at some TR loci.65 Furthermore, we were only able to assay CpG methylation in whole blood, and the measurements of DNA methylation that we used were based on methylation arrays, which only assay a small fraction of all CpGs in the genome. However, using a set of 30 genomes sequenced with ONT long reads, we were able to perform a more complete assessment of methylation levels for a much larger number of CpGs, although here the small size of this cohort and the corresponding lack of statistical power effectively limited us to performing replication analysis of those mVNTR:CpG associations already identified via read depth. Finally, it should be noted that the PCGC and PPMI cohorts are composed of individuals with either congenital heart defects or Parkinson disease, respectively. However, we consider it unlikely that this significantly influences our overall conclusion that variation of some VNTRs is associated with local gene expression and DNA methylation. Overall, despite various technical and biological differences among the cohorts we profiled with Illumina or ONT WGS, we were able to replicate the majority of eVNTRs and mVNTRs, indicating the overall robustness of our results.
Our study provides an initial map of putatively functional VNTRs, and hints that future studies of tandem repeat variation will most likely yield novel insights into the genetic basis of human phenotypes that have been largely ignored in the era of SNV-based GWASs. In order to make results of our association analysis of eVNTRs and mVNTRs easily accessible to the community, we have created new tracks viewable in the UCSC Genome Browser (Figure S13 and data and code availability). In the future, we postulate that the application of long-read sequencing that provides improved genotyping of VNTRs in large cohorts will lead to deeper insights into the effects of this class of structural variation on diverse human traits.
Declaration of interests
The authors declare no competing interests.
Published: March 31, 2021
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.03.016.
Data and code availability
All reported associations of VNTRs with gene expression and DNA methylation are available as Track Hubs within the UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=VNTR.
Web resources
1000 Genomes Project high-coverage WGS data, https://www.internationalgenome.org/data-portal/data-collection/30x-grch38
Database of Genotypes and Phenotypes (dbGaP) GTEx data, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2
Database of Genotypes and Phenotypes (dbGaP) PCGC data, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001138.v1.p2
Gene Expression Omnibus (GEO) PCGC data, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159930
GTEx portal, https://www.gtexportal.org/
GWAS catalog, https://www.ebi.ac.uk/gwas/
Human Genome Diversity Panel, https://www.internationalgenome.org/data-portal/data-collection/hgdp
Human Pangenome Reference Consortium, https://github.com/human-pangenomics/hpgp-data
OMIM, http://www.omim.org/
Parkinson’s Progression Markers Initiative (PPMI), https://www.ppmi-info.org/
The International Genome Sample Resource, https://www.internationalgenome.org/data-portal/data-collection/30x-grch38
UCSC Genome Browser, http://genome.ucsc.edu
Supplemental information
References
- 1.Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- 2.Perry G.H., Dominy N.J., Claw K.G., Lee A.S., Fiegler H., Redon R., Werner J., Villanea F.A., Mountain J.L., Misra R. Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 2007;39:1256–1260. doi: 10.1038/ng2123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Warburton P.E., Hasson D., Guillem F., Lescale C., Jin X., Abrusan G. Analysis of the largest tandemly repeated DNA families in the human genome. BMC Genomics. 2008;9:533. doi: 10.1186/1471-2164-9-533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Course M.M., Gudsnuk K., Smukowski S.N., Winston K., Desai N., Ross J.P., Sulovari A., Bourassa C.V., Spiegelman D., Couthouis J. Evolution of a Human-Specific Tandem Repeat Associated with ALS. Am. J. Hum. Genet. 2020;107:445–460. doi: 10.1016/j.ajhg.2020.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Song J.H.T., Lowe C.B., Kingsley D.M. Characterization of a Human-Specific Tandem Repeat Associated with Bipolar Disorder and Schizophrenia. Am. J. Hum. Genet. 2018;103:421–430. doi: 10.1016/j.ajhg.2018.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chaisson M.J.P., Sanders A.D., Zhao X., Malhotra A., Porubsky D., Rausch T., Gardner E.J., Rodriguez O.L., Guo L., Collins R.L. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 2019;10:1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dashnow H., Lek M., Phipson B., Halman A., Sadedin S., Lonsdale A., Davis M., Lamont P., Clayton J.S., Laing N.G. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19:121. doi: 10.1186/s13059-018-1505-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mousavi N., Shleizer-Burko S., Yanicky R., Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Res. 2019;47:e90. doi: 10.1093/nar/gkz501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Dolzhenko E., van Vugt J.J.F.A., Shaw R.J., Bekritsky M.A., van Blitterswijk M., Narzisi G., Ajay S.S., Rajan V., Lajoie B.R., Johnson N.H., US–Venezuela Collaborative Research Group Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res. 2017;27:1895–1903. doi: 10.1101/gr.225672.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Willems T., Zielinski D., Yuan J., Gordon A., Gymrek M., Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat. Methods. 2017;14:590–592. doi: 10.1038/nmeth.4267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bakhtiari M., Shleizer-Burko S., Gymrek M., Bansal V., Bafna V. Targeted genotyping of variable number tandem repeats with adVNTR. Genome Res. 2018;28:1709–1719. doi: 10.1101/gr.235119.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gelfand Y., Hernandez Y., Loving J., Benson G. VNTRseek-a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Res. 2014;42:8884–8894. doi: 10.1093/nar/gku642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Quilez J., Guilmatre A., Garg P., Highnam G., Gymrek M., Erlich Y., Joshi R.S., Mittelman D., Sharp A.J. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 2016;44:3750–3762. doi: 10.1093/nar/gkw219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gymrek M., Willems T., Guilmatre A., Zeng H., Markus B., Georgiev S., Daly M.J., Price A.L., Pritchard J.K., Sharp A.J., Erlich Y. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat. Genet. 2016;48:22–29. doi: 10.1038/ng.3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fotsing S.F., Margoliash J., Wang C., Saini S., Yanicky R., Shleizer-Burko S., Goren A., Gymrek M. The impact of short tandem repeat variation on gene expression. Nat. Genet. 2019;51:1652–1659. doi: 10.1038/s41588-019-0521-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Brahmachary M., Guilmatre A., Quilez J., Hasson D., Borel C., Warburton P., Sharp A.J. Digital genotyping of macrosatellites and multicopy genes reveals novel biological functions associated with copy number variation of large tandem repeats. PLoS Genet. 2014;10:e1004418. doi: 10.1371/journal.pgen.1004418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Borel C., Migliavacca E., Letourneau A., Gagnebin M., Béna F., Sailani M.R., Dermitzakis E.T., Sharp A.J., Antonarakis S.E. Tandem repeat sequence variation as causative cis-eQTLs for protein-coding gene expression variation: the case of CSTB. Hum. Mutat. 2012;33:1302–1309. doi: 10.1002/humu.22115. [DOI] [PubMed] [Google Scholar]
- 18.Deckert J., Catalano M., Syagailo Y.V., Bosi M., Okladnova O., Di Bella D., Nöthen M.M., Maffei P., Franke P., Fritze J. Excess of high activity monoamine oxidase A gene promoter alleles in female patients with panic disorder. Hum. Mol. Genet. 1999;8:621–624. doi: 10.1093/hmg/8.4.621. [DOI] [PubMed] [Google Scholar]
- 19.Guo G., Ou X.M., Roettger M., Shih J.C. The VNTR 2 repeat in MAOA and delinquent behavior in adolescence and young adulthood: associations and MAOA promoter activity. Eur. J. Hum. Genet. 2008;16:626–634. doi: 10.1038/sj.ejhg.5201999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rodríguez S., Gaunt T.R., O’Dell S.D., Chen X.H., Gu D., Hawe E., Miller G.J., Humphries S.E., Day I.N.M. Haplotypic analyses of the IGF2-INS-TH gene cluster in relation to cardiovascular risk traits. Hum. Mol. Genet. 2004;13:715–725. doi: 10.1093/hmg/ddh070. [DOI] [PubMed] [Google Scholar]
- 21.Santoro N., Cirillo G., Amato A., Luongo C., Raimondo P., D’Aniello A., Perrone L., Miraglia del Giudice E. Insulin gene variable number of tandem repeats (INS VNTR) genotype and metabolic syndrome in childhood obesity. J. Clin. Endocrinol. Metab. 2006;91:4641–4644. doi: 10.1210/jc.2005-2705. [DOI] [PubMed] [Google Scholar]
- 22.De Roeck A., Duchateau L., Van Dongen J., Cacace R., Bjerke M., Van den Bossche T., Cras P., Vandenberghe R., De Deyn P.P., Engelborghs S., BELNEU Consortium An intronic VNTR affects splicing of ABCA7 and increases risk of Alzheimer’s disease. Acta Neuropathol. 2018;135:827–837. doi: 10.1007/s00401-018-1841-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Manolio T.A., Collins F.S., Cox N.J., Goldstein D.B., Hindorff L.A., Hunter D.J., McCarthy M.I., Ramos E.M., Cardon L.R., Chakravarti A. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Eichler E.E., Flint J., Gibson G., Kong A., Leal S.M., Moore J.H., Nadeau J.H. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 2010;11:446–450. doi: 10.1038/nrg2809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gelb B., Brueckner M., Chung W., Goldmuntz E., Kaltman J., Kaski J.P., Kim R., Kline J., Mercer-Rosa L., Porter G., Pediatric Cardiac Genomics Consortium The congenital heart disease genetic network study: Rationale, design, and early results. Circ. Res. 2013;112:698–706. doi: 10.1161/CIRCRESAHA.111.300297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hoang T.T., Goldmuntz E., Roberts A.E., Chung W.K., Kline J.K., Deanfield J.E., Giardini A., Aleman A., Gelb B.D., Mac Neal M. The congenital heart disease genetic network study: Cohort description. PLoS ONE. 2018;13:e0191319. doi: 10.1371/journal.pone.0191319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Martin-Trujillo A., Patel N., Richter F., Jadhav B., Garg P., Morton S.U., McKean D.M., DePalma S.R., Goldmuntz E., Gruber D. Rare genetic variation at transcription factor binding sites modulates local DNA methylation profiles. PLoS Genet. 2020;16:e1009189. doi: 10.1371/journal.pgen.1009189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Marek K., Chowdhury S., Siderowf A., Lasch S., Coffey C.S., Caspell-Garcia C., Simuni T., Jennings D., Tanner C.M., Trojanowski J.Q., Parkinson’s Progression Markers Initiative The Parkinson’s progression markers initiative (PPMI) - establishing a PD biomarker cohort. Ann. Clin. Transl. Neurol. 2018;5:1460–1477. doi: 10.1002/acn3.644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Abyzov A., Urban A.E., Snyder M., Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Näslund K., Saetre P., von Salomé J., Bergström T.F., Jareborg N., Jazin E. Genome-wide prediction of human VNTRs. Genomics. 2005;85:24–35. doi: 10.1016/j.ygeno.2004.10.009. [DOI] [PubMed] [Google Scholar]
- 31.Audano P.A., Sulovari A., Graves-Lindsay T.A., Cantsilieris S., Sorensen M., Welch A.E., Dougherty M.L., Nelson B.J., Shah A., Dutcher S.K. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019;176:663–675.e19. doi: 10.1016/j.cell.2018.12.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Stranger B.E., Nica A.C., Forrest M.S., Dimas A., Bird C.P., Beazley C., Ingle C.E., Dunning M., Flicek P., Koller D. Population genomics of human gene expression. Nat. Genet. 2007;39:1217–1224. doi: 10.1038/ng2142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gibbs J.R., van der Brug M.P., Hernandez D.G., Traynor B.J., Nalls M.A., Lai S.L., Arepalli S., Dillman A., Rafferty I.P., Troncoso J. Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain. PLoS Genet. 2010;6:e1000952. doi: 10.1371/journal.pgen.1000952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Conrad D.F., Pinto D., Redon R., Feuk L., Gokcumen O., Zhang Y., Aerts J., Andrews T.D., Barnes C., Campbell P., Wellcome Trust Case Control Consortium Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–712. doi: 10.1038/nature08516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.McCaw Z.R., Lane J.M., Saxena R., Redline S., Lin X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics. 2020;76:1262–1272. doi: 10.1111/biom.13214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Rodriguez O.L., Ritz A., Sharp A.J., Bashir A. MsPAC: a tool for haplotype-phased structural variant detection. Bioinformatics. 2020;36:922–924. doi: 10.1093/bioinformatics/btz618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Patterson M., Marschall T., Pisanti N., van Iersel L., Stougie L., Klau G.W., Schönhuth A. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J. Comput. Biol. 2015;22:498–509. doi: 10.1089/cmb.2014.0157. [DOI] [PubMed] [Google Scholar]
- 38.Ummat A., Bashir A. Resolving complex tandem repeats with long reads. Bioinformatics. 2014;30:3491–3498. doi: 10.1093/bioinformatics/btu437. [DOI] [PubMed] [Google Scholar]
- 39.Stegle O., Parts L., Piipari M., Winn J., Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 2012;7:500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Benjamini Y., Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. B. 1995;57:289–300. [Google Scholar]
- 41.Pedersen B.S., Quinlan A.R. Who’s Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. Am. J. Hum. Genet. 2017;100:406–413. doi: 10.1016/j.ajhg.2017.01.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Houseman E.A., Accomando W.P., Koestler D.C., Christensen B.C., Marsit C.J., Nelson H.H., Wiencke J.K., Kelsey K.T. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13:86. doi: 10.1186/1471-2105-13-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Fishilevich S., Nudel R., Rappaport N., Hadar R., Plaschkes I., Iny Stein T., Rosen N., Kohn A., Twik M., Safran M. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford) 2017;2017:bax028. doi: 10.1093/database/bax028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pang B., Snyder M.P. Systematic identification of silencers in human cells. Nat. Genet. 2020;52:254–263. doi: 10.1038/s41588-020-0578-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Simpson J.T., Workman R.E., Zuzarte P.C., David M., Dursi L.J., Timp W. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods. 2017;14:407–410. doi: 10.1038/nmeth.4184. [DOI] [PubMed] [Google Scholar]
- 47.Bergström A., McCarthy S.A., Hui R., Almarri M.A., Ayub Q., Danecek P., Chen Y., Felkel S., Hallast P., Kamm J. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020;367:eaay5012. doi: 10.1126/science.aay5012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Redon R., Ishikawa S., Fitch K.R., Feuk L., Perry G.H., Andrews T.D., Fiegler H., Shapero M.H., Carson A.R., Chen W. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Barbeira A.N., Dickinson S.P., Bonazzola R., Zheng J., Wheeler H.E., Torres J.M., Torstenson E.S., Shah K.P., Garcia T., Edwards T.L., GTEx Consortium Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 2018;9:1825. doi: 10.1038/s41467-018-03621-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Battle A., Brown C.D., Engelhardt B.E., Montgomery S.B., GTEx Consortium; Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group; Statistical Methods groups—Analysis Working Group; Enhancing GTEx (eGTEx) groups; NIH Common Fund; NIH/NCI Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. doi: 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Li M., Jaffe A.E., Straub R.E., Tao R., Shin J.H., Wang Y., Chen Q., Li C., Jia Y., Ohi K. A human-specific AS3MT isoform and BORCS7 are molecular risk factors in the 10q24.32 schizophrenia-associated locus. Nat. Med. 2016;22:649–656. doi: 10.1038/nm.4096. [DOI] [PubMed] [Google Scholar]
- 52.Bellizzi D., Covello G., Di Cianni F., Tong Q., De Benedictis G. Identification of GATA2 and AP-1 Activator elements within the enhancer VNTR occurring in intron 5 of the human SIRT3 gene. Mol. Cells. 2009;28:87–92. doi: 10.1007/s10059-009-0110-3. [DOI] [PubMed] [Google Scholar]
- 53.Örd T., Puurand T., Örd D., Annilo T., Möls M., Remm M., Örd T. A human-specific VNTR in the TRIB3 promoter causes gene expression variation between individuals. PLoS Genet. 2020;16:e1008981. doi: 10.1371/journal.pgen.1008981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sulovari A., Li R., Audano P.A., Porubsky D., Vollger M.R., Logsdon G.A., Warren W.C., Pollen A.A., Chaisson M.J.P., Eichler E.E., Human Genome Structural Variation Consortium Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc. Natl. Acad. Sci. USA. 2019;116:23243–23253. doi: 10.1073/pnas.1912175116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Scaldaferri D., Bosi A., Fabbri M., Pedrini E., Inforzato A., Valli R., Frattini A., De Vito A., Noonan D.M., Taramelli R. The human RNASET2 protein affects the polarization pattern of human macrophages in vitro. Immunol. Lett. 2018;203:102–111. doi: 10.1016/j.imlet.2018.09.005. [DOI] [PubMed] [Google Scholar]
- 56.Ostendorf T., Zillinger T., Andryka K., Schlee-Guimaraes T.M., Schmitz S., Marx S., Bayrak K., Linke R., Salgert S., Wegner J. Immune Sensing of Synthetic, Bacterial, and Protozoan RNA by Toll-like Receptor 8 Requires Coordinated Processing by RNase T2 and RNase 2. Immunity. 2020;52:591–605.e6. doi: 10.1016/j.immuni.2020.03.009. [DOI] [PubMed] [Google Scholar]
- 57.Bogenhagen D.F., Rousseau D., Burke S. The layered structure of human mitochondrial DNA nucleoids. J. Biol. Chem. 2008;283:3665–3675. doi: 10.1074/jbc.M708444200. [DOI] [PubMed] [Google Scholar]
- 58.Zheng H., Li Y., Ji C., Li J., Zhang J., Yin G., Xu J., Ye X., Wu M., Zou X. Characterization of a cDNA encoding a protein with limited similarity to β1, 3-N-acetylglucosaminyltransferase. Mol. Biol. Rep. 2004;31:171–175. doi: 10.1023/b:mole.0000043552.32411.67. [DOI] [PubMed] [Google Scholar]
- 59.Galupa R., Heard E. X-Chromosome Inactivation: A Crossroads Between Chromosome Architecture and Gene Regulation. Annu. Rev. Genet. 2018;52:535–566. doi: 10.1146/annurev-genet-120116-024611. [DOI] [PubMed] [Google Scholar]
- 60.Sabeti P.C., Varilly P., Fry B., Lohmueller J., Hostetter E., Cotsapas C., Xie X., Byrne E.H., McCarroll S.A., Gaudet R., International HapMap Consortium Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–918. doi: 10.1038/nature06250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Pickrell J.K., Coop G., Novembre J., Kudaravalli S., Li J.Z., Absher D., Srinivasan B.S., Barsh G.S., Myers R.M., Feldman M.W., Pritchard J.K. Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 2009;19:826–837. doi: 10.1101/gr.087577.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Almarri M.A., Bergström A., Prado-Martinez J., Yang F., Fu B., Dunham A.S., Chen Y., Hurles M.E., Tyler-Smith C., Xue Y. Population Structure, Stratification, and Introgression of Human Structural Variation. Cell. 2020;182:189–199.e15. doi: 10.1016/j.cell.2020.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Tørresen O.K., Star B., Mier P., Andrade-Navarro M.A., Bateman A., Jarnot P., Gruca A., Grynberg M., Kajava A.V., Promponas V.J. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019;47:10994–11006. doi: 10.1093/nar/gkz841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Miga K.H., Koren S., Rhie A., Vollger M.R., Gershman A., Bzikadze A., Brooks S., Howe E., Porubsky D., Logsdon G.A. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585:79–84. doi: 10.1038/s41586-020-2547-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Vinces M.D., Legendre M., Caldara M., Hagihara M., Verstrepen K.J. Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009;324:1213–1216. doi: 10.1126/science.1170097. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All reported associations of VNTRs with gene expression and DNA methylation are available as Track Hubs within the UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=VNTR.