Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project

Xinmeng Jasmine Mu; Zhi John Lu; Yong Kong; Hugo Y K Lam; Mark B Gerstein

doi:10.1093/nar/gkr342

. 2011 May 19;39(16):7058–7076. doi: 10.1093/nar/gkr342

Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project

Xinmeng Jasmine Mu ¹, Zhi John Lu ^1,2, Yong Kong ^2,3, Hugo Y K Lam ⁴, Mark B Gerstein ^1,2,5,^*

PMCID: PMC3167619 PMID: 21596777

Abstract

In the human genome, it has been estimated that considerably more sequence is under natural selection in non-coding regions [such as transcription-factor binding sites (TF-binding sites) and non-coding RNAs (ncRNAs)] compared to protein-coding ones. However, less attention has been paid to them. To study selective pressure on non-coding elements, we use next-generation sequencing data from the recently completed pilot phase of the 1000 Genomes Project, which, compared to traditional methods, allows for the characterization of a full spectrum of genomic variations, including single-nucleotide polymorphisms (SNPs), short insertions and deletions (indels) and structural variations (SVs). We develop a framework for combining these variation data with non-coding elements, calculating various population-based metrics to compare classes and subclasses of elements, and developing element-aware aggregation procedures to probe the internal structure of an element. Overall, we find that TF-binding sites and ncRNAs are less selectively constrained for SNPs than coding sequences (CDSs), but more constrained than a neutral reference. We also determine that the relative amounts of constraint for the three types of variations are, in general, correlated, but there are some differences: counter-intuitively, TF-binding sites and ncRNAs are more selectively constrained for indels than for SNPs, compared to CDSs. After inspecting the overall properties of a class of elements, we analyze selective pressure on subclasses within an element class, and show that the extent of selection is associated with the genomic properties of each subclass. We find, for instance, that ncRNAs with higher expression levels tend to be under stronger purifying selection, and the actual regions of TF-binding motifs are under stronger selective pressure than the corresponding peak regions. Further, we develop element-aware aggregation plots to analyze selective pressure across the linear structure of an element, with the confidence intervals evaluated using both simple bootstrapping and block bootstrapping techniques. We find, for example, that both micro-RNAs (particularly the seed regions) and their binding targets are under stronger selective pressure for SNPs than their immediate genomic surroundings. In addition, we demonstrate that substitutions in TF-binding motifs inversely correlate with site conservation, and SNPs unfavorable for motifs are under more selective constraints than favorable SNPs. Finally, to further investigate intra-element differences, we show that SVs have the tendency to use distinctive modes and mechanisms when they interact with genomic elements, such as enveloping whole gene(s) rather than disrupting them partially, as well as duplicating TF motifs in tandem.

INTRODUCTION

Only 1.5% of the human genome is protein-coding (1), and the vast genomic regions of non-coding DNA have long been thought as ‘junk’ DNA. However, 5% of the human genome is estimated to be under natural selection (2), suggesting that more sequences in non-coding DNA are under selection than protein-coding regions. Moreover, analyses on conserved non-coding elements (CNCs) and genome-wide association studies (GWAS) have shown that non-coding DNA is involved in biological functions and disease associations (3). The recent ENCODE Project (Encyclopedia of DNA Elements) has also elucidated a variety of ways in which non-coding elements can be biochemically active within the genome, such as interacting with transcription factors (TFs) (4,5). Despite the work described above, much less effort has been invested in the functional analysis of non-coding elements, compared to the extensively studied protein-coding regions.

One way to evaluate the functional relevance of non-coding elements is to examine the levels of naturally occurring genomic variations therein (i.e. DNA polymorphism within populations). A reduction of polymorphism in non-coding elements, compared to sequences under neutral evolution, suggests non-coding elements are subject to natural selection or lower mutation rates. Polymorphism naturally co-varies with divergence between species regardless of the mutation rate (6). Thus, to see if varying diversity is a mark of selection, one may test whether it is not varying proportionally to divergence—the regime of the McDonald–Kreitman test (MK test) (7). In addition, selective constraints maintain deleterious mutations at low frequencies in a population, resulting in a skew of the derived allele frequency spectrum towards the low-frequency alleles; whereas positive selection raises advantageous alleles to high frequencies. We have studied these signatures of natural selection using genomic variation data provided by the 1000 Genomes Project (8). The Project has recently completed its pilot phase, in which whole genome next-generation sequencing data of 2–6× of genomic coverage has been generated from 179 unrelated individuals within three population groups. The data include 60 individuals of European ancestry in Utah (CEU), 59 individuals of Yoruban ancestry from Nigeria (YRI) and 60 individuals of Han Chinese ancestry from Beijing and Japanese ancestry from Tokyo (CHBJPT) (8).

There are two major advantages in using this dataset to study the impact of genomic variations on non-coding elements. First, the 1000 Genome Project provides a more comprehensive catalog of genomic variations than previous studies. Previous efforts, such as the HapMap, utilize the array-based single-nucleotide polymorphism (SNP) genotyping method by designing probes at certain genomic loci (9,10). However, this type of study is limited to SNPs already identified previously, and SNPs adjacent to probed SNPs are typically missing [inference through linkage disequilibrium (LD) has limited power for rare variants]. However, using next-generation sequencing technology, the 1000 Genomes Project generates reads from the genome in a relatively unbiased and uniform fashion, allowing for a more complete identification and genotyping of genomic variations. Another type of study exploits Sanger sequencing to obtain genomic variations within targeted local regions in the genome (11). In contrast, the 1000 Genomes Project achieves shotgun sequencing at a genome-wide scale.

A second advantage of the 1000 Genomes data is the discovery of genomic variations spanning a full spectrum, instead of merely SNPs. Variation between two random copies of the human genome was initially estimated to be ∼0.1%, most of which was attributed to SNPs (12). Nonetheless, taking into account SNPs, short deletions and insertions (indels), as well as structural variations (SVs) that include large deletions, duplications, insertions and inversions, two copies of the human genome differ by 0.5% of the DNA sequence (13). Moreover, indels and SVs are also found to contribute considerably to phenotypes and diseases (14–19). In this regard, the 1000 Genomes Project has systematically identified and genotyped all three types of variations—SNPs, indels and SVs. Hitherto, little has been known about the significance of functional relevance of the latter two types of variations. Previously, there have been studies that identify various types of genomic variations in several personal genomes (13,20–22). However, the 1000 Genomes Project has profoundly advanced in SV detection in terms of number, size-range and breakpoint-precision beyond these studies (8,23). In addition, the scale of the 1000 Genomes data enables us to apply population-based approaches in our analyses.

In this study, we examine the functional impact of genomic variations on non-coding elements to elucidate selective pressures acting on them. We investigate this at three progressive levels: comparing classes of elements, comparing subclasses within an element class and inspecting the internal structure of a given element.

Comparing classes of elements

Through studying levels of polymorphism and divergence, as well as the allele frequency spectrum, we find that TF-binding sites and non-coding RNAs (ncRNAs) are less constrained for SNPs than are coding sequences (CDSs), but more constrained than a neutral reference. We also determine that the levels of constraint for the three types of variations (SNPs, indels and SVs) are, in general, correlated, but there is some heterogeneity: counter-intuitively, TF-binding sites and ncRNAs are relatively more selectively constrained for indels than for SNPs, compared to CDSs. Further investigation reveals that this difference is largely attributed to relaxed constraints for in-frame indels in CDSs.

Comparing subclasses within an element class

After examining the overall properties of a class of elements, we analyze the selective pressure upon various subclasses within an element class, and show that the extent of selection can be rationalized in terms of genomic properties of each subclass, e.g. the exact sequences of the TF-binding motifs are under stronger selective pressure than the corresponding peak regions, and ncRNAs with higher expression levels tend to be under stronger purifying selection.

Intra-element differences of a given element

In order to make further statements about the selection on non-coding elements, we have developed element-aware aggregation techniques to investigate the differences across the linear genomic structure of a given element. We find that a similar level of additional selective pressure for SNPs is imposed on TF-binding motifs relative to their surrounding regions. We also demonstrate that substitutions in TF-binding motifs inversely correlate with site conservation, and SNPs unfavorable for motifs are under more selective constraints than favorable SNPs. Moreover, both the micro-RNAs (miRNAs) (particularly the seed regions) and their binding targets are under stronger selective pressure than their surroundings. Finally, we show that SVs have a tendency to use distinctive modes and mechanisms when they interact with genomic elements, such as enveloping whole gene(s) rather than disrupting them partially, as well as duplicating TF motifs in tandem.

MATERIALS AND METHODS

Overall framework for integrating genomic variation data and non-coding elements

We have developed a framework, ncVAR, for an integrative analysis of genomic variation data and non-coding elements (schematics shown in Figure 1A). We first compile datasets of annotations for various types of genomic variations (SNPs, indels and SVs), and datasets of different non-coding elements annotations (TF-binding sites, ncRNAs, pseudogenes, etc.). We further subdivide each class of non-coding elements into subclasses, based on their genomic properties.

We then carry out integrative analysis of the two data sources using two strategies. In the first strategy, we annotate genomic variations within non-coding elements, and compute population genetics metrics, such as the global mean of nucleotide diversity and divergence, for each class or subclass of elements. This allows comparison of functional impact of various types of genomic variations in different classes or subclasses of genomic elements. In the second strategy, we develop techniques of element-aware aggregations for genomic variations within non-coding elements. This enables evaluation of the functional relevance of the internal structures of each element. Results from the two strategies are represented in the form of X–Y plots and aggregation plots, respectively.

Data preparation for genomic variation

SNP, indel and SV annotations, and allele frequency data are obtained from the pilot release of the 1000 Genomes Project (8) (Supplementary Table S1). Since indels and SVs have only been genotyped for the autosomes, we have carried out all our analyses only on the autosomes. For the 10 871 SVs in the data release, we further limit them to genotyped SVs across all three populations, a non-zero allele frequency, and at least 50% of the individuals passing the genotyping quality filter, which leaves 6379 SVs. Among these SVs, 4470 have been polarized (i.e. inferring the ancestral allele of the variant by comparing to the other primate genomes—also known as rectification) as deletions (8,24).

Data preparation for genomic elements

The non-coding elements we survey include genome-wide annotations of nine TF-binding sites and related sites (i.e. DNase I hypersensitive sites), ncRNAs, pseudogenes and non-coding domains of protein-coding genes, that is, introns, 3′ untranslated regions (3′UTRs) and 5′UTRs (Figure 1B and Supplementary Table S1).

TF-binding sites

For TF-binding sites, we use peak signals from a variety of chromatin immunoprecipitation with sequencing (ChIP-seq) experiments. Although the precision of the ChIP-seq method is no longer limited by the spacing of probes of array-based methods, it has yet to detect the boundaries of TF-binding sites at single-nucleotide resolution. In fact, the ChIP system pulls down DNA, up to hundreds of base pairs, away from the actual interacting sites (25). To better represent the DNA–protein interaction sites, we scan the TF peaks with consensus sequences of corresponding motifs to obtain sites representing TF-binding motifs (Supplementary Data). Eight TF-binding sites and the DNase sites (26) are downloaded from the UCSC genome browser (27). See specific file names and other information at the website to supplement this study http://info.gersteinlab.org/NCVAR. The TF-binding sites include CTCF (26), STAT1 (28), NFκB (29), c-Myc (30), c-Fos (30), c-Jun (30), JunD (30) and PolII (29). The NRSF binding peaks and motifs are obtained from the original publication (31), with genomic coordinates mapped from NCBI build 35 to build 36 using the Liftover tool from the UCSC genome browser. Analyses throughout this study use NCBI build 36 of the human reference genome and the GENCODE gene set version 3b for gene annotations (32). We intersect TF-binding peaks with genes and retain the peak and motif regions that fall into the intergenic regions for analyses, truncating the peaks and discarding the motifs if they partially overlap with genes.

ncRNAs

ncRNAs are genes that are transcribed but not translated into proteins. They have diverse regulatory functions, including regulation of transcription (miRNA), RNA splicing (small nuclear RNA—snRNA), translation (messenger RNA—mRNA, transfer RNA—tRNA and ribosomal RNA—rRNA) and chemical modification of other RNA molecules (small nucleolar RNA—snoRNA) (33). The ncRNA annotations are obtained from Ensembl release 53 (33) and GtRNAdb (34).

Non-coding domains of the protein-coding genes

Non-coding domains of the protein-coding genes may play various regulatory roles. For instance, UTRs contain structured regions, such as the riboswitches and the internal ribosome entry sites (IRES), which modulate gene expression (36,37). 3′UTRs also provide binding sites to miRNAs, which inhibits translation (33). Moreover, introns have been found to harbor sites that are associated with disease (38). CDS, intron and UTR annotations are obtained from the longest transcript of each gene with annotated start and end codons. 5′UTRs are extracted as sequences from the transcription start site (TSS) to the start codon (exclusive). 3′UTRs are extracted as sequences from the stop codon (inclusive) to the transcription end site (TES).

Pseudogenes

Finally, pseudogenes are usually disabled gene homologues, and are thus not functional (39). Hence, we use them as a neutral reference in this study. In addition to pseudogenes, we have considered a number of other candidates as a neutral reference, such as ancestral repeats (ARs) and random intergenic regions. Pseudogene annotations are obtained from the Ensembl 53 build at Pseudogene.org database (40). ARs are obtained by intersecting the repeat elements in the human genome (annotated by the RepeatMasker program) with human–mouse alignments. Both the repeat element and alignment annotations are downloaded from the UCSC genome browser. Those regions that can be aligned and are 100 bp or larger are extracted. A set of random intergenic regions of 500 bp is generated from the intergenic regions of the human reference genome, and 200 kb away from any gene. The set of random regions also excludes annotations of TF-binding sites, ncRNAs, ARs and pseudogenes.

Calculation of population-based statistics

SNP diversity and divergence analysis

The SNP diversity (π) of a region is estimated as the per-site heterozygosity (2pq) across the portion of the region that is ‘accessible’ (passing all the filters for SNP detection, including depth-of-sequence coverage, uniqueness of mapping, and gaps in the human reference genome), where p is the allele frequency, and q = 1−p. The ‘accessible’ genome annotations for the three population groups are obtained from the 1000 Genomes pilot release (8).

Denote d as the number of nucleotide differences per site between the human and the chimp reference genomes, excluding gaps across the accessible and alignable region. Divergence (D_xy) is then obtained by applying to d the Jukes-Cantor correction for multiple hits (6). The human-chimpanzee alignment is between the human build 36 and the chimp panTro2 assembly, and is obtained from the UCSC genome browser.

MK test for neutrality

Polymorphism (P) denotes the number of SNPs across the accessible and alignable region. The number of fixed differences (D) is obtained by computing Inline graphic followed by the Jukes–Cantor correction, where d is the per-site number of differences between human and chimp, and l is the total length of the accessible and alignable region. To carry out the MK test, a contingency table is formed from P and D in a region i under study and a neutral reference n, respectively. Fisher’s exact test is used to assess the significance of the MK test. The neutrality index (NI) is calculated as

Allele frequency spectrum analysis

A derived allele frequency (DAF) is computed for each polarized SNP, indel and SV from the genotyping allele frequency and ancestral allele information. SNPs and indels have been identified and genotyped for each of the three populations separately. Therefore, we assess the allele frequency spectrum for each population. The SVs have been genotyped across all three populations. Thus, we effectively use the average allele frequency of the three populations to evaluate the DAF of an SV (see Supplementary Data for more details).

To quantify the intensity of natural selection using allele frequency, traditional tests such as Tajima’s D and the Fu and Li tests draw comparisons to the standard neutral coalescence model (6). However, due to the low sequence coverage (2−6×) in the 1000 Genomes pilot dataset, there is a severe bias towards common alleles—even neutrally evolving sequences display a depletion of rare variants compared to the standard neutral model. Therefore, to overcome this intrinsic variant detection bias, we have sought to derive a measurement that quantifies the intensity of selection compared to a relative neutral reference. We define the Excess of low-frequency variants within a region relative to a neutral reference as

where N_i is the fraction of variants in the region i that have a DAF < 0.05, and N_n is the fraction of variants in the neutral reference n that have a DAF < 0.05.

Element-aware aggregation of nucleotide diversity

Basic aggregation procedure

In element-aware aggregations, we aggregate both SNP and indel diversity in an element-aware fashion. For simplicity, we just refer to nucleotide diversity, but the same logic applies to indels. To develop the aggregation procedure, each sequence of an element annotation is divided into a fixed number of bins with uniform size. For a given annotation, sequences with different lengths might be chosen to have different bin sizes, but the number of bins is fixed in all the sequences for the annotation. A nucleotide diversity measure for each bin, in each sequence, is calculated as described above. The diversity measures for each bin are then averaged across all the sequences to obtain an overall measure for the bin, which is represented by one data point in the element-aware aggregation plot. An aggregation mean is then calculated from all the data points within an annotation. Sequences shorter than the number of bins are discarded.

Block bootstrapping

To estimate the standard deviation (SD) of the aggregation within an element, we apply bootstrapping methods. In the human genome, neighboring variants can be co-inherited, which results in the association of these variants in the population. This property is termed as LD (41). As a result, the nucleotide diversity we calculate for genomic elements that are sufficiently close to each other can be dependent. To overcome the dependence between the observations, we apply a block bootstrapping procedure to gene annotations and their surrounding regions—a procedure extended from the traditional simple bootstrapping (Supplementary Data).

For each element annotation, we randomly resample n = 1 000 000 blocks from the genome. For each block resampled, the basic aggregation procedure described above is applied to all the sequences of the element within the block. Those blocks for which the nucleotide diversity cannot be calculated are discarded. Denote x₁, x₂, … , x_n as the aggregation mean from resampled blocks 1, 2, … , n, respectively. Let w₁, w₂, … , w_n be the number of sequences of the element within blocks 1, 2, … , n, respectively. Then,

and

where the bootstrapping mean Inline graphic is calculated as the weighted average of the aggregation means, and is an unbiased estimator of the SD (42) for weighted samples of blocks. is then renormalized according to the effective genome size G and the block size L to obtain the bootstrapping SD for the whole genome (S):

The effective genome refers to the portion of the genome where resampled blocks contain at least one sequence of the element (i.e. excluding deserts in the genome for the element). A 95% confidence interval (CI) of the aggregation is then calculated from Inline graphic Since LD extends to up to 1 Mb in the human genome (41), we use 1 Mb as the block size L, which is designed to capture the dependence between the sequences.

Randomization test for SVs interacting with genomic elements

We use the 10 871 SVs with single-nucleotide resolution mentioned in the data preparation section for the randomization test. SV formation mechanisms are classified using the BreakSeq tool (24). The association of SVs with a class of elements is determined by calculating the number of SVs overlapping the elements. The enrichment and P-value for each association are computed from a non-parametric randomization test. A global background is obtained by randomly shuffling the SV locations within the human genome. This procedure is repeated 10 000 times. The enrichment measure is calculated as the ratio of the observed statistic to the average of the statistics taken from the background measures. The P-value is computed by fitting a Gaussian model to the background measures, and calculating the area under the density curve corresponding to Z-scores as extreme as, or more extreme than, the observed one. A local background is obtained by randomly shuffling the SV locations within a 10 Mb window around them. The calculations for the local background that follow are the same as those for the global background. The association is reported to be significant for a P-value <0.05.

RESULTS

Nucleotide diversity, divergence and allele frequency spectrum in non-coding elements

non-coding elements are under selective constraints

First, we examine the levels of SNP diversity within humans and divergence between the human and the chimpanzee genomes in non-coding elements (‘Materials and Methods’ section). We compare the global means of diversity and divergence in non-coding elements to those in the neutral reference and CDSs. We find that, with the exception of diversity in tRNA, non-coding elements exhibit a lower level of diversity and divergence than the neutral reference, and a higher level of both measures than CDSs (Figure 2, Table 1 and Supplementary Table S2). For the CEU population, on an average, SNP diversity in TF-binding sites and ncRNAs is about double that of CDSs (208 and 202% that of CDSs, respectively), and about one-fifth lower than the neutral reference (22 and 24%, respectively); divergence in TF-binding sites and ncRNAs is about double that of CDSs (194 and 193% that of CDSs, respectively), and approximately one-third lower than the neutral reference (33 and 34%, respectively). We interpret these results as a display of purifying selection on non-coding elements. The MK test also shows that, relative to the neutral reference, all non-coding elements are under selective constraints, as indicated by NI > 1, and significant P-values in most classes of elements (Table 1 and Supplementary Table S2). As a control, a set of accelerated elements in human, identified in a previous study (43), shows clear signatures of positive selection in our analysis (i.e. an elevated level of divergence and a reduced level of diversity compared to the neutral reference, NI < 1 and the MK test P-value = 3.6E-57; Figure 2 and Table 1).

Figure 2. — Levels of SNP diversity within humans and divergence between the human and the chimpanzee genomes. The diversity and divergence are calculated only for the accessible and alignable regions. The dashed line represents data points with the same ratio of diversity to divergence as for the neutral reference. Data is shown for CEU.

Table 1.

Diversity, divergence, and test of neutrality in genomic elements in CEU

Element	SNP diversity (π × 1000)	Divergence (D_xy × 100)	Polymorphism (P)	Number of fixed differences (D)	Neutrality index (NI)	McDonald–Kreitman test P-value
Pseudogene	1.02	2.02	46 122	206 922	1.00	–
CDS	0.38	0.69	49 636	181 193	1.23	2.38E-179
Intron	0.69	1.22	2 244 675	8 610702	1.17	3.03E-205
3′UTR	0.61	1.12	60 129	232 581	1.16	3.53E-103
5′UTR	0.70	1.22	293 916	1 116 579	1.18	3.78E-202
TF peak	0.80	1.34	111 140	417 405	1.19	5.30E-186
TF motif	0.67	1.11	2409	8545	1.26	2.13E-22
ncRNA	0.78	1.33	2254	8023	1.26	1.42E-20
Accelerated element	0.60	2.30	701	5656	0.56	5.07E-55

Open in a new tab

Relative strengths of selective constraints

Since the diversity of indels and SVs is not as straightforward to assess as for SNPs (Supplementary Data), likewise for divergence (due to complications in alignment), we examine another signature of selection—the allele frequency spectrum of SNPs, indels and SVs collectively. In CEU, TF-binding sites and ncRNAs, respectively, display a 4 and 16% excess of low-frequency SNPs (DAF < 0.05) compared to the neutral reference (Wilcoxon rank-sum test P-value = 2.8E-5 and 5.4E-7, respectively), and a 21 and 12% reduction of fraction of low-frequency SNPs compared to CDSs (Wilcoxon test P-value <2.2E-16 and 7.5E-4, respectively, see Figure 3A and Supplementary Table S3). A similar pattern is observed for UTRs and introns (Figure 3A and Supplementary Table S3). This confirms the results from the previous section: the extent of selective constraints on non-coding elements is higher than the neutral reference, but lower than CDSs. Moreover, among the gene domains, 3′UTRs are more selectively constrained than 5′UTRs and introns (Wilcoxon test P-value = 9.4E-23 and 2.5E-28, respectively), but less than CDSs (Wilcoxon test P-value = 5.8E-38), indicating that 3′UTRs might include a larger fraction of functionally important sequences than the other non-coding gene domains. It is also noted that there are no evident elevations of high DAF SNPs (DAF > 0.95) in the allele frequency spectra (Figure 3A), implying that positive selection is not prevalent in non-coding elements in humans.

Figure 3. — The derived allele frequency spectra for (A) SNPs, (B) indels and (C) SVs. SNP and indel allele frequencies are shown for CEU, and SV allele frequencies represent the average of the three populations. Black boxes highlights the low-frequency alleles (DAF < 0.05).

As mentioned previously, we have also considered ancestral repeats and random intergenic regions as candidates for a neutral reference in addition to pseudogenes (‘Materials and Methods’ section). We choose pseudogenes, however, because they show signatures that are the most consistent with neutral sequences. In other words, the alternative candidates still display some signatures of purifying selection compared to pseudogenes (Supplementary Tables S2 and S3). Although pseudogenes have been reported in individual cases to be involved in functionality (44), they are mostly deactivated and nonfunctional gene fossils. Even if some pseudogenes are under slight selective constraints, our analyses of purifying selection in non-coding elements will be conservative.

We have obtained similar results for YRI and CHBJPT populations (Supplementary Figures S1–S3, and Supplementary Tables S2 and S3). Altogether, these results suggest that, in the human genome, non-coding elements are under different levels of selective constraints for SNPs. The constraint levels usually fall between those of neutral sequences and coding regions.

Allele frequency spectrum of indels and SVs, and comparison to SNPs

As previously mentioned, the 1000 Genomes Project provides the first dataset that includes annotations of indels and SVs alongside SNPs at a population level in humans. It is thus appealing to examine the properties of indels, SVs and SNPs simultaneously, and make comparisons. We investigate the allele frequency spectrum for indels in CEU and find that, overall, non-coding elements demonstrate an excess of low-frequency indels compared to the neutral reference (Figure 3B)—a similar finding to that for SNPs. However, the most prominent increase in the fraction of indels in CDSs compared to non-coding elements occurs within a DAF range of 0.05–0.20, whereas that of SNPs occurs within a DAF range of 0–0.05 (Figure 3A and B).

For SVs, pseudogenes no longer show the lowest fraction of low-frequency alleles as for SNPs and indels (Figure 3C). Rather, the fraction of low-frequency SVs in pseudogenes is comparable to, or larger than, that of introns, 5′UTRs and 3′UTRs (Wilcoxon test P-value = 0.37, 0.43 and 0.81, respectively). Why would pseudogenes lose neutrality for SVs? Referring to the formation mechanisms of SVs might provide a clue. SVs formed by the non-allelic homologous recombination mechanism (NAHR) exploit sequences of extensive homologies at the two breakpoints (45). Hence, repeat elements associated with pseudogenes may mediate NAHR events by providing homologous stretches at the breakpoints. Then, selection in sequences around pseudogenes may place the associated SV events under selective constraints.

Figure 4A–C draw pair-wise comparisons of the excess of low-frequency alleles (ε) among SNPs, indels and SVs (‘Materials and Methods’ section). Correlations between the fractions of low-frequency variants are calculated using the data points representing the average for each of the seven major classes of elements: pseudogenes, TF-binding sites, ncRNAs and four gene domains (Correlation = 0.75 between SNP and indel, 0.64 between indel and SV, and 0.11 between SNP and SV). The results show that selective constraints for the three types of variants are, in general, correlated in non-coding elements, especially between SNP and indel, and between SV and indel. Nevertheless, we also see differences.

Within CDSs, compared to SNPs and in-frame indels that only modify one or two local amino acids, a frame-shift indel alters all the amino acid sequences that follow, and may introduce premature stop codons that truncate the protein products, which is very detrimental to, and therefore expected to be quickly removed from, the genome. Thus, one might expect that selective pressure against indels in CDSs to be stronger than that against SNPs, compared to the other functional elements. To our surprise, we find that TF-binding sites and ncRNAs are, on average, relatively even more constrained for indels than for SNPs, compared to what we observe for CDSs (Figure 4A–C). In fact, the ratio of low-frequency indel fraction to low-frequency SNP fraction is increased by 8 and 22% in TF-binding sites and ncRNAs relative to CDSs, respectively (P-value = 8.9E-2 and 1.1E-1, respectively, by bootstrapping).

To further explore the above observations, we consider the differences between in-frame and frame-shift indels. The size distribution of indels shows a periodic peak of 3 nts for CDSs but not for non-coding elements (i.e. the fraction of indels of size 3, 6, 9 bp… is elevated for CDSs, see Figure 4D). Since the majority of the indels are no larger than 3 bp, we extract the indels of size 1, 2 and 3 bp, and examine the excess of low-frequency indels, respectively. We find that frame-shift indels (1 and 2 bp) have more low-frequency alleles than in-frame indels (3 bp) in CDSs (ε is 160 and 116% higher than 3 bp indels, respectively), consistent with a relaxed constraint for the in-frame indels compared to frame-shift indels (Figure 4D). However, even for 1 bp indels, which introduce frame-shifts, we do not see an elevated level of constraint for indels relative to SNPs in CDSs, compared to non-coding elements (Figure 4D). Therefore, the selective pressure for indels relative to SNPs in TF-binding-sites and ncRNAs is as much as, if not more than, in CDSs. Taken together, comparisons of the allele frequency spectrum between SNPs, indels and SVs reveal heterogeneity in the selective pressure for the three types of variations in non-coding elements, despite an overall correlation.

Differences in selective pressure between subclasses within an element class

Instead of treating each class of elements as a whole as described in the proceeding section, we further analyze the mode and extent of selection with respect to subclasses of elements having different genomic properties, such as the genomic locations, RNA expression levels, number of binding targets, sequence divergence, conservation of secondary structure, etc. (Supplementary Data).