Abstract
Although thousands of genomic regions have been associated with heritable human diseases, attempts to elucidate biological mechanisms are impeded by a general inability to discern which genomic positions are functionally important. Evolutionary constraint is a powerful predictor of function that is agnostic to cell type or disease mechanism. Here, single base phyloP scores from the whole genome alignment of 240 placental mammals identified 3.5% of the human genome as significantly constrained, and likely functional. We compared these scores to large-scale genome annotation, genome-wide association studies (GWAS), copy number variation, clinical genetics findings, and cancer data sets. Evolutionarily constrained positions are enriched for variants explaining common disease heritability (more than any other functional annotation). Our results improve variant annotation but also highlight that the regulatory landscape of the human genome still needs to be further explored and linked to disease.
Introduction
In the past 15 years, increasingly larger genomic studies have delivered many novel associations for a wide array of human diseases, disorders, biomarkers, and other traits. Approximately 200K genetic associations have been identified that span the allelic spectrum, from ultra-rare variants in large sequencing datasets to variants common in all humans, in both coding and regulatory regions (see Supplementary Methods, Section 1). Although these associations meet rigorous standards for statistical significance and replicability, their functional importance is generally unknown. Inferring functional importance is crucial to translating the results of rare and common variant association studies into the biological, clinical, and therapeutic knowledge required to understand and treat human disease. Exceptional efforts have been made to annotate the human genome using functional genomics—e.g., ENCODE (1) and GTEx (2)—as well as inferring deleterious effects from allele frequencies and location in coding sequence—e.g., gnomAD (3) and TOPMed (4). Although these seminal projects greatly expanded knowledge, this “central problem in biology” is unresolved and motivated the NHGRI Impact of Genomic Variation on Function initiative.
Evolutionary constraint is complementary to these efforts. Functional importance is inferred from the signatures of evolution in the human genome: “constraint” indicates genomic positions that have changed more slowly than expected under neutral drift due to purifying selection. A key advantage of constraint lies in its mechanistic agnosticism; a highly constrained base has an impact on some biological process, in some cell, at some life stage (discussed in Supplementary Methods, Section 2). Constraint has been used in efforts to understand the human genome for over 50 years beginning with cross-species protein sequence comparisons. More recently, at the extremes of the allelic spectrum, constraint is often used by clinical geneticists to prioritize potentially causal rare variants (5, 6), and common variants in regions under constraint are highly enriched in genome-wide association study (GWAS) results (7–9). Despite its reported importance, evolutionary constraint is not systematically leveraged in interpreting the function of GWAS loci (10–15).
Our companion paper describes the Zoonomia reference-free alignment of 240 placental mammals spanning ~100 million years of evolution (Companion paper #1, Christmas et al.). The analyses showed the unprecedented informativeness of this alignment at multiple scales: from exceptionally constrained 100 kb bins (e.g., all HOX clusters) to smaller ultra-conserved and human accelerated regions, non-coding regulatory regions, nuances of the genetic code, and specific base positions in binding motifs. These results strongly suggested the utility of constraint as a functional annotation that can be leveraged to deepen our understanding of heritable human diseases. In this paper, we demonstrate the importance of mammalian constraint for connecting genotype to phenotype for human disease.
The properties of evolutionary constraint at single base resolution
Defining constraint.
Placental mammalian constraint was estimated using phyloP scores (16) across 240 species for 2,852,623,265 bases in the human genome (chr1–22, X, Y; Supplementary Methods, Section 3). In our companion paper we estimated that ~13% of the human genome is under some degree of constraint due to purifying selection; for these disease-focused analyses, we used an empirical subset with the strongest constraint signatures. We defined a base as constrained in mammals if its phyloP score was ≥ 2.27 (FDR 0.05 threshold, 100,651,377 bases or 3.53% of the genome). We defined constraint across 43 primates using a phastCons (17) threshold (≥ 0.961, 101,134,907 bases) selected to match the same fraction of the genome annotated as constrained in mammals. Mammalian and primate constraint overlapped significantly but not fully (Jaccard index 0.30). In Supplementary Methods, Section 4, we describe the properties of constrained genomic positions, from base level to higher order annotations. Briefly, we found that mammalian constrained bases had a marked tendency to cluster (median distance 2 bases) compared to random expectations (median 24 bases), and that specific genomic elements were highly enriched in constrained bases (particularly coding sequence, CDS, as expected) as well as multiple regulatory features (Figs. 1A and S1), and that constraint scores captured nuances of the genetic code (fig. S2).
Constraint across the allelic spectrum.
Genetic variation is fundamental to heritable human diseases, disorders, and other traits. We thus evaluated the relationship between allele frequency and constraint (Fig. 1B). Using whole genome sequencing data from over 140K humans (TOPMed, v8) (4), we observed an inverse correlation between allele count and phyloP score (rho = −0.07) with stronger correlations in CDS regions and for non-synonymous variants (rho = −0.12 and −0.18, all P < 2.2×10−308). As expected due to negative selection, common genetic variants were depleted for constrained bases (1.85% vs. 3.53% expected by chance, P < 2.2×10−308). However, this relatively high fraction of constrained bases highlights the ability of mammalian constraint to predict deleterious effects across the allele frequency spectrum. To evaluate these relations more formally, genome-wide models contrasting singletons (AC = 1) to common variants (AF ≥ 0.005) found that common variants had lower phyloP scores and a marked increase in CG context (fig. S3, Supplementary Methods, Section 4). Models for CDS SNPs found an inverse association of AC with constraint, and that common SNPs had greater odds of occurring at a C or G base, and tend not to occur in important CDS positions (e.g., codon position 1 or 2, or at bases that could mutate to stop).
Common constrained SNPs are relevant for human diseases.
We conducted additional analyses of common SNPs (AF ≥ 0.005) as these variants are foundational for GWAS (Supplementary Methods, Section 4). Of these 15,777,878 SNPs in TOPMed, 1.85% (N = 291,669) are constrained, far less than genome-wide constraint (3.53%). Our modeling showed that constrained SNPs were 22x more likely to occur in CDS bases, 3x more likely to occur in promoters, and ~2x more likely to be a “fine-mapped” eQTL-SNP or to occur in open chromatin or an enhancer.
The strong tendency of these constrained SNPs to occur in CDS was unexpected given that (by definition) these positions are highly constrained in placental mammals and yet variable in humans. We hypothesized that this could occur if selection effects were variable across genes (some generate peptide variability whereas others are highly intolerant of CDS variation). We found that 37.8% of protein-coding (PC) genes had no constrained CDS SNPs and other genes had appreciable fractions (up to 10% of all CDS bases are common SNPs). The top 5% (N = 980) of genes with the most constrained CDS SNPs have medical relevance (131 have an OMIM entry including multiple neurological disorders) and were strongly enriched for G-protein coupled receptors (GPCR), “druggable” genes (both GPCR and non-GPCR) (18), taste receptors, skin development, and multiple immune processes. These biological processes are at the interface of a mammal and its environment and allow adaptation to an environmental niche. We suggest that many of these genes could be prioritized for gene-environment interactions searches as constrained variants reaching high frequency in human populations are relevant for human diseases.
Base pair resolution of deleterious effects.
We contrasted constraint scores to metrics used to aid the interpretation of functional variation for human health. First, pathogenic ClinVar (19) variants were significantly skewed to higher phyloP in comparison to benign variants (two-tailed Wilcoxon rank sum test, P < 2.2×10−16, Fig. 1C), and phyloP scores were strongly associated with the improvement in annotations of variants in ClinVar from 2016 to 2021 (e.g. uncertain to benign or to pathogenic; Supplementary Methods, Section 5). For a second metric, CADD (6), which incorporates evolutionary constraint, we found variant positions with a higher likelihood of deleteriousness were also enriched for constrained phyloP scores (two-tailed Wilcoxon rank sum test, P < 2.2×10−16, Fig. 1C). A focused analysis of human non-synonymous variants at constrained sites across the mammalian tree using TOGA (Tool to infer Orthologs from Genome Alignments, Companion paper #1, Christmas et al.; Companion paper #10, Kirilenko et al), identified 1,570 genes for which a non-synonymous change resulted in a ClinVar pathogenic or likely pathogenic phenotype in humans (Supplementary Methods, Section 5). For example, the CFTR gene underlying cystic fibrosis (20) showed a high burden of pathogenic compared to benign sites (123 vs. 1 out of 1,585 alignment sites). A further 12,889 genes had identifiable constrained sites, but lacked records of non-synonymous pathogenic alterations (Supplementary Methods, Section 5). Several of these constrained positions, currently lacking ClinVar pathogenic annotations, likely represent novel sources of deleterious variation resulting in a disease state. We tested this by leveraging functionally explored variation in two G-protein coupled receptors, GPR75 (21) and ADRB2 (22), and showed that functionally important SNP or amino acid sites respectively, were marked by higher constraint scores (Supplementary Methods, Section 5). Species alignments at this scale also allow for the identification of potential model systems, those for which a substitution may result in a human disease state, but is otherwise naturally occurring in non-human mammals. We found 697 such sites across 330 genes, including multiple positions in SOD1 (pathogenic sites for amyotrophic lateral sclerosis). These observations open the avenue for natural adaptive variants to inform the development of new therapies for treatment (Supplementary Methods, Section 5).
Common variation and human diseases and complex traits
GWAS have found that the genetic architecture of human diseases and complex traits is highly polygenic and dominated by common variants with weak effects (10). Here, we dissected the impact of common variants (defined in this section as AF ≥ 0.05) on this architecture via polygenic analyses of disease SNP-heritability (h2) using stratified LD score regression (S-LDSC) (7, 23, 24) using the results of 63 independent European ancestry GWAS (25) (mean N = 314K; table S1, Supplemental Methods, Section 6).
Constraint scores are proportional to common variant SNP-heritability enrichments.
We first validated the relevance of our constraint scores to investigate the role of common variants in human diseases and complex traits. We found that common variants in the highest constraint score percentiles had greater enrichment for GWAS trait associated variants (measured by SNP-h2 enrichment, the proportion of h2 divided by the proportion of SNPs; Fig. 2A and table S2). We observed decreasing but significant enrichments (P < 0.05/15) for SNPs in the four first percentiles of mammalian constraint scores (phyloP) (in line with 3.53% of the genome bases being considered as constrained using a 5% FDR threshold), and in the first five percentiles of primate (phastCons) constraint scores. We justified the use of different scores to measure constraint in mammals and primates by the fact that phyloP scores were unable to detect single-base constraint in primates due to lack of power and were too noisy to lead to significant h2 enrichment (fig. S4). While both phyloP and phastCons scores performed similarly in heritability analyses, phyloP is superior for having single-base resolution (fig. S4 and additional justification in Supplemental Methods, Section 6).
Mammal constraint scores are base pair specific.
We evaluated the resolution of constraint scores by estimating SNP-h2 with different distances to a constrained base. First, we confirmed the base pair resolution of mammalian constraint scores by observing that SNPs ~1 bp from a constrained variant were significantly less enriched than constrained SNPs (P ≤ 3.35×10−3) (Fig. 2B and table S3). We also observed log-linear decrease of h2 enrichment as a function of the distance to a constrained base, with significant h2 enrichment up to 100 kb from constrained bases, confirming the larger-scale clustering of constrained bases. Finally, demonstrating the power of a broad placental mammal-wide genome sampling, constraint scores obtained only from primate species have lower resolution (~10 bp, Fig. 2B) as these are based on fewer species (43), from a single mammalian order, and thus less branch length.
Zoonomia constraint is uniquely informative.
Annotations derived from mammal and primate constrained positions were more informative for human diseases than key functional annotations, including previously published constrained annotations (17, 26, 27) (Fig. 2D and table S4). First, their degrees of enrichment (7.84 ± 0.37 fold for mammals and 11.10 ± 0.40 fold for primates) exceeded those of previously published constraint and key functional annotations, such as non-synonymous coding variants (7.20 ± 0.78 fold) or fine-mapped eQTL-SNPs (4.81 ± 0.31 fold) (28). Second, in conditional analyses involving 106 annotations analyzed jointly (Supplemental Methods, Section 6), we observed that these constrained annotations were among the most significant (P = 1.17×10−10 for mammals, and P = 1.19×10−53 for primates, respectively), and more significant than previously published constrained annotations (Fig. 2D and table S4).
Variants at constrained positions are less enriched in blood and immune traits heritability than in other complex traits.
We did not observe disease-specific patterns for our constrained annotations, without any trait exhibiting significantly higher h2 enrichment than the mean calculated for the mammal and primate constrained annotations (fig. S5 and table S5). However, we observed consistently lower h2 enrichments for constrained annotations in a meta-analysis of 11 blood and immune traits, as previously observed (7), but no differential enrichment in 9 brain disorders (Fig. 2C, table S1, and table S6).
Variants at positions constrained in primates are informative for non-coding common variants.
Surprisingly, SNPs constrained in primates have greater SNP-h2 enrichment than SNPs constrained in mammals (Figs. 2A-C). To investigate, we intersected mammalian and primate constraint information, and observed significantly higher h2 enrichment in SNPs constrained in both mammals and primates (16.52 ± 0.73 fold), compared with constraint only in primates (8.66 ± 0.38 fold), or only in mammals (3.56 ± 0.40 fold) (Fig. 2E and table S7). We verified that these results are mostly driven by the intersection of mammal and primate constrained bases (and not due to the different scoring tests, fig. S6). By stratifying constrained mammalian bases by their primate constraint scores, we found that variants identified as constrained in mammals but not in primates are not significantly enriched in h2, whereas SNPs constrained in primates were significantly enriched regardless of their constraint scores in mammals (fig. S7). These results explain the lower SNP-h2 for constraint in mammals, and demonstrate increased informativeness when combining information from primates and mammals. Interestingly, we observed consistently higher h2 enrichment for SNPs that are constrained in both mammals and primates when stratifying by genomic function (i.e., coding regions, promoters, and enhancers), but that constraint is more informative in primates than in mammals only for non-coding variants (Fig. 2E). Strikingly, we observed that constrained SNPs defined as non-functional (see Supplemental Methods, Section 6) were still enriched in h2 (>2.67 fold with P < 1.22×10−4, except for SNPs constrained only in mammals or primates; Fig. 2E), emphasizing the informativeness of our constrained annotations to annotate non-coding variants with unknown function.
Disease effect sizes of common variants at constrained positions differ across human populations.
While our heritability analyses focused on European ancestry GWAS, variant effect sizes differ across human populations, especially for variants with stronger gene-environment interactions (29). To quantify how effect sizes of constrained common variants differ across populations, we applied S-LDXR (29) on 31 diseases and complex traits with GWAS data from East Asian (mean N = 90K) and European (mean N = 267K) populations. Variants at constrained sites in mammals and primates were among the most significantly depleted in squared trans-ancestry genetic correlation (P = 4.38×10−9 and P = 1.63×10−14, the third and most significant investigated annotation, respectively; Fig. 2F and table S8). These results highlight more population-specific causal effect sizes for variants at constrained positions, in line with stronger gene-environment interactions at these loci, and potentially explain how genetic variations at constrained bases could have become common in human populations.
Strong effect sizes for coding low-frequency variants at constrained positions.
Annotations constrained by purifying selection tend to have low-frequency variants (0.5% ≤ AF < 5%) with larger effect sizes leading to higher enrichment in low-frequency variant h2 compared to common variant h2 (8). We quantified low-frequency SNP-h2 enrichments of constrained annotations by analyzing 27 well-powered independent UK Biobank traits (same as in (8); mean N = 355K; table S9). We observed that constrained annotations had consistently larger low-frequency h2 enrichment than common h2 enrichment, especially for variants at constrained sites in mammals (16.83 ± 0.92 vs. 8.70 ± 0.72 fold; P = 3.22×10−11 for difference) (fig. S8 and table S10) in line with greater effect sizes as allele frequency decreases (Fig. 2G and table S11). This enrichment difference was driven by coding variants at constrained sites in mammals (48.84 ± 3.10 vs. 19.42 ± 1.91 fold; P = 6.36×10−16 for difference); we note that the low-frequency h2 enrichment for these variants was similar to that of non-synonymous variants (40.38 ± 2.40 fold), suggesting that constraint information is as informative as protein change information at the coding level.
In conclusion, we observed that our mammalian constraint scores have unprecedented base pair resolution to investigate common variants in GWAS findings for human complex traits and diseases, are uniquely informative compared to known functional annotations and previously published constraint scores, are even more informative when combined with primate constraint scores, and could be utilized to investigate variants defined as non-functional.
Leveraging constraint to move from prioritization to function
Zoonomia constraint scores improve functionally-informed fine-mapping analyses.
Based on our heritability results, we expect that our constraint scores will improve functionally-informed fine-mapping of constrained genetic variants associated with common traits. We compared PolyFun (30) fine-mapping results obtained with no annotations (non-functional model), with its default set of annotations (baseline-LF model), and with an augmented baseline-LF annotations containing multiple Zoonomia constrained annotations (baseline-LF+Zoonomia model) on 24 well-powered UK Biobank diseases and complex traits (30, 31) (mean N = 440K; table S12 and Supplemental Methods, Section 7). We observed significantly (P < 1.00×10−4) greater posterior inclusion probability (PIP) for variants at constrained sites in mammals and primates when using PolyFun with the baseline-LF+Zoonomia model compared to the non-functional and baseline-LF models (Figs. 3A and 3B). Notably, PolyFun with the baseline-LF+Zoonomia model detected 1,407 variants at constrained sites in mammals fine-mapped with high confidence (PIP > 0.75) across all the UK Biobank traits (32.80% of high confidence fine-mapped variants), against 732 and 1,216 when using the non-functional and baseline-LF and models, respectively (24.50% and 29.67% of high confidence fine-mapped variants, respectively) (fig. S9).
Fine-mapping examples.
We highlight the utility of evolutionary constraint scores in fine-mapping analyses. First, rs1421085 has a causal and experimentally validated association with BMI (the SNP is located in FTO but has regulatory effects on IRX5 and IRX3) (32, 33); this variant is extremely constrained in mammals (phyloP = 6.31) and primates (phastCons = 1.00), leading to a higher PIP when using the baseline-LF+Zoonomia model (0.84) than when using the non-functional and baseline-LF models (0.13 and 0.58, respectively; Fig. 3C). Interestingly, the fraction of CDS and promoter bases that are constrained for IRX5 (0.79 and 0.58) and IRX3 (0.74 and 0.34) were higher than for FTO (0.61 and 0.23), suggesting that constrained variant in regulatory regions could be more likely to target genes with constrained CDS and/or promoters (see below). Second, rs6914622 is constrained in mammals and primates (phyloP = 2.37 and phastCons = 1.00) and may be causal in hypothyroidism via the baseline-LF+Zoonomia model (PIP = 0.76; Fig. 3D) but not via the non-functional and baseline-LF models (PIP ≤ 0.14). Conversely, the sentinel variant rs9497965 is not evolutionarily constrained but has a notable PIP in the baseline-LF model (PIP ≥ 0.85) but not in the baseline-LF+Zoonomia model (PIP = 0.24). Using epigenetic marks from four thyroid cell types (34) (functional information not in the fine-mapping models), rs6914622 was in an active enhancer in all thyroid cell-types and rs9497965 was inferred as being in an enhancer in only one thyroid cell type (weak transcription and quiescent for the others), suggesting a causal role for rs6914622 over rs9497965. While functional follow-up is necessary, these examples illustrated how Zoonomia constraint scores can significantly impact fine-mapping. One method of functional follow-up, Cell-TACIT, is explored in a companion paper (Companion paper #11, Phan et al.), in which the conservation of human neural cell type-specific open chromatin across mammals is used to improve the fine-mapping of GWAS for brain disorders. Some regulatory elements may not be conserved at the nucleotide level but lie in a cell type regulatory element predicted to be conserved across mammalians. Fine-mapping genetic variants with constraint and Cell-TACIT provide examples of how mammalian genomes can be leveraged to discover nucleotide and regulatory conservation to link variation to function. Finally, as discussed in another companion paper, Human Accelerated Regions can also improve fine-mapping interpretation (Companion paper #8, Keough et al,).
Measures of constraint can reveal unannotated variants impacting human health.
Due to the challenge of generating functional datasets in all cell-types and cell-states, much of the genome’s regulatory space is still not fully annotated (35). The high levels of constraint and low levels of variant diversity in UNannotated Intergenic COnstraint RegioNs (UNICORNs, Supplemental Methods, Section 8, Companion paper #1, Christmas et al.) suggest that they are likely of functional importance despite lacking functional annotations (consistent with our observation that non-functional constrained SNPs are enriched in h2, Fig. 2E). While fewer fine-mapped SNPs were located within UNICORNs (833 SNPs) compared to a matched set of random unannotated non-constrained intergenic regions (5,895 SNPs) and to SNPs located elsewhere in the genome (305,599 SNPs), those variants had higher mean PIP scores (0.15 UNICORNs vs 0.05 for the other two regions). This demonstrates that UNICORNs can reveal unannotated variants impacting human health and disease. UNICORNs contain fine-mapped SNPs with significantly higher PIP scores compared to the background sets across multiple traits (linear regression, P < 0.01 in all cases after correcting for multiple testing; table S13). For example, a 163 bp UNICORN contains rs72782676 with fine-mapping evidence for multiple traits (e.g., eosinophil count, asthma, eczema, respiratory and ENT diseases; AFTOPMed = 0.005; PIP > 0.99 in all GWAS) (Fig. 3E). The nearest gene, GATA3, sits 915 kb upstream, and is a master transcriptional regulator for T Helper 2 lineage commitment (36), and is known to play an important role in inflammatory disease (37, 38). This UNICORN highlights a strong regulatory candidate for GATA3 in a disease-relevant region currently lacking annotation.
Predicted variant effect validated at single base resolution.
Massively parallel reporter assays (MPRAs), have been used to rapidly test thousands of genomic variants for their potential regulatory effects on gene expression. While the functional output from these high-throughput methods are useful for localising putative causal alleles, overlaying constraint scores may help further elucidate functional variants (Supplemental Methods, Section 8). To investigate this, we integrated our Zoonomia-derived phyloP scores with > 35,000 assayed variants from existing 3’UTR (39) and eQTL (40) MPRAs. Using the 3’UTR MPRA data to highlight our results, we found that phyloP scores could differentiate between sequence backgrounds with and without regulatory activity, (e.g. across multiple tissues, Neutral vs Active: Polig = 2.32×10−5, Fig. 3F). PhyloP scores further highlighted variants with allele-specific regulatory effects (e.g. Neutral vs Skew: Pbase = 1.4×10−5; Fig. 3G). Additionally, we found that selection on constrained phyloP positions enriched the allele-specific regulatory effects by 1.3 fold (Supplemental Methods, Section 8). Similar trends were observed in promoter and enhancer saturation mutagenesis MPRAs (41). For example, phyloP constraint was a strong predictor for variant effect within the LDLR promoter (Spearman rho = 0.51), with five of the most constrained sites providing the strongest regulatory effects and also tagging pathogenic ClinVar positions (Fig. 3H). Further, in our companion paper (Companion paper #? CONDEL, Xue et al), we use MPRAs to directly assess the regulatory impacts of bases under high constraint that have been deleted specifically in the human lineage. For many we can precisely identify how the deletions impact transcription factor binding which is well correlated with the observed regulatory changes, linking sequence change to mechanism. We found these human-specific deletions were enriched to overlie psychiatric disease GWAS signals (i.e. Schizophrenia, Bipolar), and discovered 717 deletions with significant species-specific regulatory effects, providing candidates targets that may have contributed to the prevalence of human neurological disorders.
Evolutionary constraint, protein-coding genes, and human disease
Gene-based measures of evolutionary constraint have an important role in understanding the impact of genetic variation on human disease (e.g., LOEUF) (3). As detailed in Supplementary Methods, Section 9, we defined 7 measures of gene constraint based on the Zoonomia alignment including fraction of CDS constrained, normalization against 32.13 million CDS bases, a model-based approach adjusting for 12 covariates (codon information, mutational consequences, and positional features), and cross-species amino acid constraint (normalized Shannon entropy). After evaluation, we selected the fraction of constrained CDS bases per gene (fracCdsCons) as a simple measure of gene constraint, given its continuous distribution, low missingness, high correlations with more complex measures of gene constraint, and external validation (Fig. 4A). These gene-based constraint metrics are provided in table S14.
Given the complexities of human PC genes, it would be surprising if any one gene metric applies to all genes (e.g., LOEUF and pLI are missing for 10.1% of PC genes). We used an empirical approach to identify gene outliers, and identified 277 genes (1.43%) inaccessible to fracCdsCons (clusters A-B, Fig. 4A; Supplementary Methods, Section 10).
We validated fracCdsCons in several ways (Supplementary Methods, Section 10). First, given its widespread use, we compared fracCdsCons to the inverse-scored LOEUF (3) and found rho = −0.55. This is notable given the markedly different basis of each measure—constraint over ~100 million years of mammalian evolution vs statistical modeling of pLoF counts in human WES catalogs (Supplemental Methods, Section 2): empirical confirmation is an important validator for both measures. We next compared fracCdsCons to external gene sets with established patterns of constraint (similar to the LOEUF validation strategy)(3) and obtained similar patterns between both scores (Figs. 4B and 4C).
Second, we used an empirical approach to cluster genes based on different constrained metrics (Fig. 4A; Supplementary Methods, Section 10; table S14). We identified 277 gene outliers (1.43%) inaccessible to fracCdsCons (clusters A-B), and conducted gene set analyses for 19,109 PC genes (clusters C-E, tables S15 and S16). The 5% most constrained genes (N=955, fracCdsCons 0.811–0.975) were strongly enriched in gene sets: basic embryology (stem cell proliferation/differentiation, tube formation, anterior/posterior patterning, endoderm/mesoderm formation); organ morphogenesis (central/peripheral nervous system, connective tissue, ear, epithelium, eye, gastrointestinal tract, heart, kidney, lung, muscle, myeloid, pancreas, skeleton); cell cycle (phase transition, fate, WNT), cell signaling, positive and negative regulatory processes; and pre-/post-synaptic processes (synapse assembly, postsynaptic density, neurotransmitter regulation, synaptic vesicle cycle, modulation of transsynaptic signaling). The 5% least constrained genes (N=956, fracCdsCons 0–0.150) were strongly enriched in gene sets: microbial defense response (adaptive immunity, bacteria/virus, cell killing, cytokine/interferon); bitter taste and olfaction; and skin development (keratinization, keratinocyte differentiation, epidermal cell differentiation, and epidermis development). The most constrained genes captured processes fundamental to the making of a mammal and the least constrained genes are central to the adaptive evolution of a mammal to its environment—i.e., the specific microbiota, adaptations of smell and taste to detect mates, prey, predators, and poisons, and adaptations of skin for temperature regulation, camouflage, and defense.
Finally, we evaluated the relevance of mammalian gene constraint to human disease. Fig. S10A shows the relationship of fracCdsCons with multiple human disease annotations. For all comparisons, increasing constraint is correlated with increasing relevance for human disease. Fig. S10B depicts the relation with GTEx gene expression, and greater gene constraint is correlated with greater expression in all tissues. “Housekeeping” genes that are uniformly expressed across tissues had greater constraint (P < 3×10−197) and comprised 3.0% of the least constrained decile and 30.5% of the most constrained decile. Finally, we evaluated the impact of common SNPs linked to PC genes in each fracCdsCons decile by estimating their gene h2 enrichment (defined as h2 enrichment for the decile annotation divided by the mean h2 enrichment over all deciles) using S-LDSC on 63 independent GWAS datasets (Supplemental Methods, Section 10). We observed significantly higher gene h2 enrichment for SNPs linked to genes in the most constrained deciles (P = 6.96×10−59; Fig. 4D and table S17). Interestingly, we observed stronger gene h2 enrichment patterns in a meta-analysis of nine brain disorders, and gene h2 enrichment patterns nearly independent of gene constraint in a meta-analysis of 11 blood and immune traits (Fig. 4D and table S17).
Mammalian constraint is correlated between coding and regulatory elements.
We extended our approach to measure gene constraint on different regulatory features (including promoters, and ENCODE3 distal enhancers linked to their genes using EpiMap (34)), as human diseases and complex traits are predominantly impacted by common regulatory variants. We found substantial correlations of constraint between CDS and the regulatory parts of protein-coding genes, with a higher correlation between CDS and promoter gene constraint (r = 0.55) than between CDS and distal enhancer gene constraint (r = 0.25) (Figs. 4E-G; gene scores reported in table S18). These correlations are consistent with the idea that if the function of a gene in mammals requires high conservation of protein structure, then its regulatory sequences tend to also be constrained. Interestingly, we observed families of genes with shared constrained patterns (such as HOX genes that have constrained exons, promoters and enhancers), and with distinct constrained patterns (such as defensin beta (DEFB) genes, which only have constrained enhancers). Finally, we observed that common SNPs linked to genes with constrained promoters and distal enhancers are as enriched in h2 as genes with constrained CDS, suggesting that constraint in regulatory elements can be leveraged in the analyses of human diseases and complex traits (Fig. 4F and table S17).
Mammalian constraint and copy number variation
Copy number variants (CNVs) are genomic segments that have fewer or more copies compared to a reference genome. CNVs are important drivers of evolution and risk factors for multiple human diseases (42–44). However, CNVs often occur in high repeat/low mappability regions meaning that detecting their presence and significance often carries uncertainty (45, 46). We thus evaluated whether mammalian constraint could help prioritize potentially disease-related CNVs. First, as a qualitative check, we evaluated a pathogenic CNV—a small distal enhancer upstream of SOX9 with a ClinVar pathogenic annotation as a cause of Pierre Robin sequence—and found that it was highly constrained (47) (Supplemental Methods, Section 11). Second, we evaluated constraint in structural variants (SV) identified in TOPMed (4). We found that singleton (AC=1) SV deletions, inversions, and duplications had similar fractions of constrained bases. However, common (AF ≥ 0.005) SV deletions had far less constraint than SV inversions or duplications. We speculate that singletons are recent mutations relatively unexposed to purifying selection whereas common SV deletions are directly exposed to selection pressures due to the impacts of haploinsufficiency.
Third, these analyses suggest that constrained bases could have utility in CNV prioritization and burden calculations. Given that CNVs are known risk factors for schizophrenia (48), we obtained the CNV call set from the largest published study (21,094 cases, 20,227 controls) (49). After replicating the main analysis, we found that schizophrenia cases had greater CNV constraint burden (the total number of conserved bases impacted by a CNV) compared to controls. The case-control differences were 4–5 logs more significant than two commonly used measures of CNV burden (total number and total bases per person). The improvements were particularly notable for CNV deletions. We suggest that the number of constrained bases impacted by a CNV is a more direct assessment of functional impact—e.g., a large CNV with no constrained bases is less likely to be deleterious than a far smaller CNV that deletes constrained exons, promoters, and/or enhancer elements.
Cancer driver genes identified with mammalian constraint
Moving from the germline to the somatic genomes, we demonstrated how mammalian constraint in non-coding regions of the genome could be applied to detect candidate cancer driver genes (Supplementary Methods, Section 12). Non-coding constraint mutations (NCCMs, phyloP ≥ 1.2 (50)) were identified using whole genome sequencing data (International Cancer Genome Consortium) (51) for two types of brain tumors primarily affecting children. Pilocytic astrocytoma is a low-grade tumor (52) and medulloblastomas are malignant brain tumors with intertumoral heterogeneity informed by subgroups determined by molecular profiling (i.e., Wingless/Integrated (WNT), Sonic Hedgehog Signaling (SHH), Group 3 and Group 4) (53). We identified NCCMs within introns, 5ánd 3ÚTRs, and regions within 100kb of each gene (50).
We found drastically different NCCM rates between the two cancers. In pilocytic astrocytoma, known to have coding/translocation mutations primarily in BRAF, high NCCM rates were restricted to the BRAF locus, in line with the low somatic mutation burden of this tumor type. Strikingly, for medulloblastoma, 114 genes had ≥ 2 NCCMs/100 kb (Fig. 5A) and 525 genes had ≥ 5 NCCMs per gene. These genes were enriched for the GO biological processes “nervous system development” (P = 1.32×10−26) and “generation of neurons” (P = 1.68×10−22.). Among the top 114 genes, 15 gene loci were primarily seen in adult cases (≥18 years of age) and 7 loci in pediatric cases (<18 years of age). A subset of these loci is shown in Fig. 5B (Companion paper #12, Sakthikumar et al). An example is ZFHX4, previously reported to be differentially expressed in medulloblastoma (54), where NCCMs were predominantly identified in adult patients of the SHH subgroup, and found in high constraint ZFHX4 intronic regions (Fig. 5C). For the pediatric set of medulloblastoma, potential driver genes included BMP4 and the HOXB locus (containing multiple genes), mostly in patients diagnosed as Group 3 or Group 4. Multiple NCCMs in these two loci were shown to have differential DNA binding capacity in a medulloblastoma cell line (Companion paper #12, Sakthikumar et al). Further, we noted differential gene expression in medulloblastoma compared to cerebellum for multiple NCCM genes, e.g. HOXB2 (55), for which expression levels correlate with patient survival (56).
The addition of evolutionary constraint measures may help advance stratification of medulloblastoma, both with regard to age, and molecular subgroups. More generally, we demonstrate how NCCM analysis can be used as a tool for the identification of novel driver genes in cancer. We suggest that NCCM analysis should be evaluated in more cancer types for its potential to yield a better understanding of disease biology and improved diagnosis and prognosis.
Discussion
The strength of evolutionary constraint can deepen our understanding of human diseases. The alignment of 240 placental mammals, representing ~100 million years of evolution, achieved single base resolution that allows detailed evaluation of individual mutations in contrast to previous methodologies of only gene-sized resolution. Evolutionary constraint compares favourably to huge amounts of functional genomics data as functionality in any tissue at any time point will be detected by constraint. We demonstrate that constraint can be used to detect candidate causal mutations in both rare and common disease as well as in cancer, and could be particularly leveraged for brain diseases that are more impacted by constrained genes and biological processes. Finally, we note that primate constraint has a stronger heritability enrichment than when measured across placental mammals in non-coding regions suggesting that sequencing more primates would complement the current efforts to validate function of the multitude of regulatory elements present in the human lineage.
Supplementary Material
Acknowledgments
Computations and data handling were enabled by resources in projects, SNIC 2017/7-385, SNIC 2017/7-386, SNIC 2019/3-415, SNIC 2019/30-57, SNIC 2019/8-369, SNIC 2021/2-11, SNIC 2021/5-296, SNIC 2021/6-208, SNIC 2021/5-28, provided by the Swedish National Infrastructure for Computing (SNIC) at UPPMAX, partially funded by the Swedish Research Council through grant agreement no. 2018-05973.
Funding
Swedish Research Council and Knut and Alice Wallenberg Foundation, Swedish Cancer Society, Swedish Childhood Cancer Fund, NIMH U01MH116438, Gladstone Institutes, NIDA DP1DA04658501, NIDA F30DA053020, UCD Ad Astra Fellowship, R00 HG010160 and NHGRI U41HG002371.
Footnotes
Diversity & Inclusion
One or more of the authors of this paper self-identifies as a member of the LGBTQ+ community.
Competing interests
PFS is a consultant and shareholder for Neumora.
References and Notes:
- 1.ENCODE Project Consortium, Moore J. E., Purcaro M. J., Pratt H. E., Epstein C. B., Shoresh N., Adrian J., Kawli T., Davis C. A., Dobin A., Kaul R., Halow J., Van Nostrand E. L., Freese P., Gorkin D. U., Shen Y., He Y., Mackiewicz M., Pauli-Behn F., Williams B. A., Mortazavi A., Keller C. A., Zhang X.-O., Elhajjajy S. I., Huey J., Dickel D. E., Snetkova V., Wei X., Wang X., Rivera-Mulia J. C., Rozowsky J., Zhang J., Chhetri S. B., Zhang J., Victorsen A., White K. P., Visel A., Yeo G. W., Burge C. B., Lécuyer E., Gilbert D. M., Dekker J., Rinn J., Mendenhall E. M., Ecker J. R., Kellis M., Klein R. J., Noble W. S., Kundaje A., Guigó R., Farnham P. J., Cherry J. M., Myers R. M., Ren B., Graveley B. R., Gerstein M. B., Pennacchio L. A., Snyder M. P., Bernstein B. E., Wold B., Hardison R. C., Gingeras T. R., Stamatoyannopoulos J. A., Weng Z., Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 583, 699–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.GTEx Consortium, The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Karczewski K. J., Francioli L. C., Tiao G., Cummings B. B., Alföldi J., Wang Q., Collins R. L., Laricchia K. M., Ganna A., Birnbaum D. P., Gauthier L. D., Brand H., Solomonson M., Watts N. A., Rhodes D., Singer-Berk M., England E. M., Seaby E. G., Kosmicki J. A., Walters R. K., Tashman K., Farjoun Y., Banks E., Poterba T., Wang A., Seed C., Whiffin N., Chong J. X., Samocha K. E., Pierce-Hoffman E., Zappala Z., O’Donnell-Luria A. H., Minikel E. V., Weisburd B., Lek M., Ware J. S., Vittal C., Armean I. M., Bergelson L., Cibulskis K., Connolly K. M., Covarrubias M., Donnelly S., Ferriera S., Gabriel S., Gentry J., Gupta N., Jeandet T., Kaplan D., Llanwarne C., Munshi R., Novod S., Petrillo N., Roazen D., Ruano-Rubio V., Saltzman A., Schleicher M., Soto J., Tibbetts K., Tolonen C., Wade G., Talkowski M. E., Neale B. M., Daly M. J., MacArthur D. G., The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Taliun D., Harris D. N., Kessler M. D., Carlson J., Szpiech Z. A., Torres R., Taliun S. A. G., Corvelo A., Gogarten S. M., Kang H. M., Pitsillides A. N., LeFaive J., Lee S.-B., Tian X., Browning B. L., Das S., Emde A.-K., Clarke W. E., Loesch D. P., Shetty A. C., Blackwell T. W., Smith A. V., Wong Q., Liu X., Conomos M. P., Bobo D. M., Aguet F., Albert C., Alonso A., Ardlie K. G., Arking D. E., Aslibekyan S., Auer P. L., Barnard J., Barr R. G., Barwick L., Becker L. C., Beer R. L., Benjamin E. J., Bielak L. F., Blangero J., Boehnke M., Bowden D. W., Brody J. A., Burchard E. G., Cade B. E., Casella J. F., Chalazan B., Chasman D. I., Chen Y.-D. I., Cho M. H., Choi S. H., Chung M. K., Clish C. B., Correa A., Curran J. E., Custer B., Darbar D., Daya M., de Andrade M., DeMeo D. L., Dutcher S. K., Ellinor P. T., Emery L. S., Eng C., Fatkin D., Fingerlin T., Forer L., Fornage M., Franceschini N., Fuchsberger C., Fullerton S. M., Germer S., Gladwin M. T., Gottlieb D. J., Guo X., Hall M. E., He J., Heard-Costa N. L., Heckbert S. R., Irvin M. R., Johnsen J. M., Johnson A. D., Kaplan R., Kardia S. L. R., Kelly T., Kelly S., Kenny E. E., Kiel D. P., Klemmer R., Konkle B. A., Kooperberg C., Köttgen A., Lange L. A., Lasky-Su J., Levy D., Lin X., Lin K.-H., Liu C., Loos R. J. F., Garman L., Gerszten R., Lubitz S. A., Lunetta K. L., Mak A. C. Y., Manichaikul A., Manning A. K., Mathias R. A., McManus D. D., McGarvey S. T., Meigs J. B., Meyers D. A., Mikulla J. L., Minear M. A., Mitchell B. D., Mohanty S., Montasser M. E., Montgomery C., Morrison A. C., Murabito J. M., Natale A., Natarajan P., Nelson S. C., North K. E., O’Connell J. R., Palmer N. D., Pankratz N., Peloso G. M., Peyser P. A., Pleiness J., Post W. S., Psaty B. M., Rao D. C., Redline S., Reiner A. P., Roden D., Rotter J. I., Ruczinski I., Sarnowski C., Schoenherr S., Schwartz D. A., Seo J.-S., Seshadri S., Sheehan V. A., Sheu W. H., Shoemaker M. B., Smith N. L., Smith J. A., Sotoodehnia N., Stilp A. M., Tang W., Taylor K. D., Telen M., Thornton T. A., Tracy R. P., Van Den Berg D. J., Vasan R. S., Viaud-Martinez K. A., Vrieze S., Weeks D. E., Weir B. S., Weiss S. T., Weng L.-C., Willer C. J., Zhang Y., Zhao X., Arnett D. K., Ashley-Koch A. E., Barnes K. C., Boerwinkle E., Gabriel S., Gibbs R., Rice K. M., Rich S. S., Silverman E. K., Qasba P., Gan W., NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Papanicolaou G. J., Nickerson D. A., Browning S. R., Zody M. C., Zöllner S., Wilson J. G., Cupples L. A., Laurie C. C., Jaquish C. E., Hernandez R. D., O’Connor T. D., Abecasis G. R., Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 590, 290–299 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cooper G. M., Shendure J., Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Reviews Genetics. 12 (2011), pp. 628–640. [DOI] [PubMed] [Google Scholar]
- 6.Kircher M., Witten D. M., Jain P., O’Roak B. J., Cooper G. M., Shendure J., A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Finucane H. K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R., Anttila V., Xu H., Zang C., Farh K., Ripke S., Day F. R., ReproGen Consortium, Schizophrenia Working Group of the Psychiatric Genomics Consortium, RACI Consortium, Purcell S., Stahl E., Lindstrom S., Perry J. R. B., Okada Y., Raychaudhuri S., Daly M. J., Patterson N., Neale B. M., Price A. L., Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gazal S., Loh P.-R., Finucane H. K., Ganna A., Schoech A., Sunyaev S., Price A. L., Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat. Genet. 50, 1600–1607 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hujoel M. L. A., Gazal S., Hormozdiari F., van de Geijn B., Price A. L., Disease Heritability Enrichment of Regulatory Elements Is Concentrated in Elements with Ancient Sequence Age and Conserved Function across Species. Am. J. Hum. Genet. 104, 611–624 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Visscher P. M., Wray N. R., Zhang Q., Sklar P., McCarthy M. I., Brown M. A., Yang J., 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101, 5–22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gallagher M. D., Chen-Plotkin A. S., The Post-GWAS Era: From Association to Function. Am. J. Hum. Genet. 102, 717–730 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tam V., Patel N., Turcotte M., Bossé Y., Paré G., Meyre D., Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484 (2019). [DOI] [PubMed] [Google Scholar]
- 13.Claussnitzer M., Cho J. H., Collins R., Cox N. J., Dermitzakis E. T., Hurles M. E., Kathiresan S., Kenny E. E., Lindgren C. M., MacArthur D. G., North K. N., Plon S. E., Rehm H. L., Risch N., Rotimi C. N., Shendure J., Soranzo N., McCarthy M. I., A brief history of human disease genetics. Nature. 577, 179–189 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Uffelmann E., Huang Q. Q., Munung N. S., de Vries J., Okada Y., Martin A. R., Martin H. C., Lappalainen T., Posthuma D., Genome-wide association studies. Nature Reviews Methods Primers. 1, 1–21 (2021). [Google Scholar]
- 15.Lappalainen T., MacArthur D. G., From variant to function in human disease genetics. Science. 373, 1464–1468 (2021). [DOI] [PubMed] [Google Scholar]
- 16.Siepel A., Pollard K. S., Haussler D., New Methods for Detecting Lineage-Specific Selection. Lecture Notes in Computer Science (2006), pp. 190–205.
- 17.Siepel A., Bejerano G., Pedersen J. S., Hinrichs A. S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L. W., Richards S., Weinstock G. M., Wilson R. K., Gibbs R. A., Kent W. J., Miller W., Haussler D., Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Finan C., Gaulton A., Kruger F. A., Lumbers R. T., Shah T., Engmann J., Galver L., Kelley R., Karlsson A., Santos R., Overington J. P., Hingorani A. D., Casas J. P., The druggable genome and support for target identification and validation in drug development. Sci. Transl. Med. 9 (2017), doi: 10.1126/scitranslmed.aag1166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Landrum M. J., Lee J. M., Benson M., Brown G. R., Chao C., Chitipiralla S., Gu B., Hart J., Hoffman D., Jang W., Karapetyan K., Katz K., Liu C., Maddipatla Z., Malheiro A., McDaniel K., Ovetsky M., Riley G., Zhou G., Holmes J. B., Kattman B. L., Maglott D. R., ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lopes-Pacheco M., CFTR Modulators: Shedding Light on Precision Medicine for Cystic Fibrosis. Front. Pharmacol. 0 (2016), doi: 10.3389/fphar.2016.00275. [DOI] [PMC free article] [PubMed]
- 21.Akbari P., Gilani A., Sosina O., Kosmicki J. A., Khrimian L., Fang Y.-Y., Persaud T., Garcia V., Sun D., Li A., Mbatchou J., Locke A. E., Benner C., Verweij N., Lin N., Hossain S., Agostinucci K., Pascale J. V., Dirice E., Dunn M., Regeneron Genetics Center, DiscovEHR Collaboration, Kraus W. E., Shah S. H., Chen Y.-D. I., Rotter J. I., Rader D. J., Melander O., Still C. D., Mirshahi T., Carey D. J., Berumen-Campos J., Kuri-Morales P., Alegre-Díaz J., Torres J. M., Emberson J. R., Collins R., Balasubramanian S., Hawes A., Jones M., Zambrowicz B., Murphy A. J., Paulding C., Coppola G., Overton J. D., Reid J. G., Shuldiner A. R., Cantor M., Kang H. M., Abecasis G. R., Karalis K., Economides A. N., Marchini J., Yancopoulos G. D., Sleeman M. W., Altarejos J., Della Gatta G., Tapia-Conyer R., Schwartzman M. L., Baras A., Ferreira M. A. R., Lotta L. A., Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science. 373 (2021), doi: 10.1126/science.abf8683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Jones E. M., Lubock N. B., Venkatakrishnan A. J., Wang J., Tseng A. M., Paggi J. M., Latorraca N. R., Cancilla D., Satyadi M., Davis J. E., Babu M. M., Dror R. O., Kosuri S., Structural and functional characterization of G protein-coupled receptors with deep mutational scanning. Elife. 9 (2020), doi: 10.7554/eLife.54895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gazal S., Finucane H. K., Furlotte N. A., Loh P.-R., Palamara P. F., Liu X., Schoech A., Bulik-Sullivan B., Neale B. M., Gusev A., Price A. L., Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gazal S., Marquez-Luna C., Finucane H. K., Price A. L., Reconciling S-LDSC and LDAK functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gazal S., Weissbrod O., Hormozdiari F., Dey K., Nasser J., Jagadeesh K., Weiner D., Shi H., Fulco C., O’Connor L., Pasaniuc B., Engreitz J. M., Price A. L., Combining SNP-to-gene linking strategies to pinpoint disease genes and assess disease omnigenicity. medRxiv, 2021.08.02.21261488 (2021). [DOI] [PMC free article] [PubMed]
- 26.Davydov E. V., Goode D. L., Sirota M., Cooper G. M., Sidow A., Batzoglou S., Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lindblad-Toh K., Garber M., Zuk O., Lin M. F., Parker B. J., Washietl S., Kheradpour P., Ernst J., Jordan G., Mauceli E., Ward L. D., Lowe C. B., Holloway A. K., Clamp M., Gnerre S., Alföldi J., Beal K., Chang J., Clawson H., Cuff J., Di Palma F., Fitzgerald S., Flicek P., Guttman M., Hubisz M. J., Jaffe D. B., Jungreis I., Kent W. J., Kostka D., Lara M., Martins A. L., Massingham T., Moltke I., Raney B. J., Rasmussen M. D., Robinson J., Stark A., Vilella A. J., Wen J., Xie X., Zody M. C., Broad Institute Sequencing Platform and Whole Genome Assembly Team, Baldwin J., Bloom T., Chin C. W., Heiman D., Nicol R., Nusbaum C., Young S., Wilkinson J., Worley K. C., Kovar C. L., Muzny D. M., Gibbs R. A., Baylor College of Medicine Human Genome Sequencing Center Sequencing Team, Cree A., Dihn H. H., Fowler G., Jhangiani S., Joshi V., Lee S., Lewis L. R., Nazareth L. V., Okwuonu G., Santibanez J., Warren W. C., Mardis E. R., Weinstock G. M., Wilson R. K., Genome Institute at Washington University, Delehaunty K., Dooling D., Fronik C., Fulton L., Fulton B., Graves T., Minx P., Sodergren E., Birney E., Margulies E. H., Herrero J., Green E. D., Haussler D., Siepel A., Goldman N., Pollard K. S., Pedersen J. S., Lander E. S., Kellis M., A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 478, 476–482 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hormozdiari F., Gazal S., van de Geijn B., Finucane H. K., Ju C. J.-T., Loh P.-R., Schoech A., Reshef Y., Liu X., O’Connor L., Gusev A., Eskin E., Price A. L., Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nat. Genet. 50, 1041–1047 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Shi H., Gazal S., Kanai M., Koch E. M., Schoech A. P., Siewert K. M., Kim S. S., Luo Y., Amariuta T., Huang H., Okada Y., Raychaudhuri S., Sunyaev S. R., Price A. L., Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun. 12, 1098 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Weissbrod O., Hormozdiari F., Benner C., Cui R., Ulirsch J., Gazal S., Schoech A. P., van de Geijn B., Reshef Y., Márquez-Luna C., O’Connor L., Pirinen M., Finucane H. K., Price A. L., Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Loh P.-R., Kichaev G., Gazal S., Schoech A. P., Price A. L., Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Claussnitzer M., Dankel S. N., Kim K.-H., Quon G., Meuleman W., Haugen C., Glunk V., Sousa I. S., Beaudry J. L., Puviindran V., Abdennur N. A., Liu J., Svensson P.-A., Hsu Y.-H., Drucker D. J., Mellgren G., Hui C.-C., Hauner H., Kellis M., FTO Obesity Variant Circuitry and Adipocyte Browning in Humans. N. Engl. J. Med. 373, 895–907 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Claussnitzer M., Hui C.-C., Kellis M., FTO Obesity Variant and Adipocyte Browning in Humans. N. Engl. J. Med. 374 (2016), pp. 192–193. [DOI] [PubMed] [Google Scholar]
- 34.Boix C. A., James B. T., Park Y. P., Meuleman W., Kellis M., Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature. 590, 300–307 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Umans B. D., Battle A., Gilad Y., Where Are the Disease-Associated eQTLs? Trends Genet. 37, 109–124 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhu J., Yamane H., Cote-Sierra J., Guo L., Paul W. E., GATA-3 promotes Th2 responses through three different mechanisms: induction of Th2 cytokine production, selective growth of Th2 cells and inhibition of Th1 cell-specific factors. Cell Res. 16, 3–10 (2006). [DOI] [PubMed] [Google Scholar]
- 37.Mjösberg J., Bernink J., Golebski K., Karrich J. J., Peters C. P., Blom B., te Velde A. A., Fokkens W. J., van Drunen C. M., Spits H., The transcription factor GATA3 is essential for the function of human type 2 innate lymphoid cells. Immunity. 37, 649–659 (2012). [DOI] [PubMed] [Google Scholar]
- 38.Wohlfert E. A., Grainger J. R., Bouladoux N., Konkel J. E., Oldenhove G., Ribeiro C. H., Hall J. A., Yagi R., Naik S., Bhairavabhotla R., Paul W. E., Bosselut R., Wei G., Zhao K., Oukka M., Zhu J., Belkaid Y., GATA3 controls Foxp3+ regulatory T cell fate during inflammation in mice. J. Clin. Invest. 121, 4503–4515 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Griesemer D., Xue J. R., Reilly S. K., Ulirsch J. C., Kukreja K., Davis J., Kanai M., Yang D. K., Montgomery S. B., Novina C. D., Tewhey R., Sabeti P. C., Genome-Wide Functional Screen of 3’UTR Variants Uncovers Causal Variants for Human Disease and Evolution. SSRN Electronic Journal,, doi: 10.2139/ssrn.3762769. [DOI] [PMC free article] [PubMed]
- 40.Tewhey R., Kotliar D., Park D. S., Liu B., Winnicki S., Reilly S. K., Andersen K. G., Mikkelsen T. S., Lander E. S., Schaffner S. F., Sabeti P. C., Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell. 172, 1132–1134 (2018). [DOI] [PubMed] [Google Scholar]
- 41.Kircher M., Xiong C., Martin B., Schubach M., Inoue F., Bell R. J. A., Costello J. F., Shendure J., Ahituv N., Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Innan H., Kondrashov F., The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010). [DOI] [PubMed] [Google Scholar]
- 43.Zarrei M., MacDonald J. R., Merico D., Scherer S. W., A copy number variation map of the human genome. Nat. Rev. Genet. 16, 172–183 (2015). [DOI] [PubMed] [Google Scholar]
- 44.Mérot C., Oomen R. A., Tigano A., Wellenreuther M., A Roadmap for Understanding the Evolutionary Significance of Structural Genomic Variation. Trends Ecol. Evol. 35, 561–572 (2020). [DOI] [PubMed] [Google Scholar]
- 45.Lappalainen T., Scott A. J., Brandt M., Hall I. M., Genomic Analysis in the Age of Human Genome Sequencing. Cell. 177, 70–84 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mahmoud M., Gobet N., Cruz-Dávalos D. I., Mounier N., Dessimoz C., Sedlazeck F. J., Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Long H. K., Osterwalder M., Welsh I. C., Hansen K., Davies J. O. J., Liu Y. E., Koska M., Adams A. T., Aho R., Arora N., Ikeda K., Williams R. M., Sauka-Spengler T., Porteus M. H., Mohun T., Dickel D. E., Swigut T., Hughes J. R., Higgs D. R., Visel A., Selleri L., Wysocka J., Loss of Extreme Long-Range Enhancers in Human Neural Crest Drives a Craniofacial Disorder. Cell Stem Cell. 27, 765–783.e14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sullivan P. F., Geschwind D. H., Defining the Genetic, Genomic, Cellular, and Diagnostic Architectures of Psychiatric Disorders. Cell. 177, 162–183 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Marshall C. R., Howrigan D. P., Merico D., Thiruvahindrapuram B., Wu W., Greer D. S., Antaki D., Shetty A., Holmans P. A., Pinto D., Gujral M., Brandler W. M., Malhotra D., Wang Z., Fajarado K. V. F., Maile M. S., Ripke S., Agartz I., Albus M., Alexander M., Amin F., Atkins J., Bacanu S. A., Belliveau R. A. Jr, Bergen S. E., Bertalan M., Bevilacqua E., Bigdeli T. B., Black D. W., Bruggeman R., Buccola N. G., Buckner R. L., Bulik-Sullivan B., Byerley W., Cahn W., Cai G., Cairns M. J., Campion D., Cantor R. M., Carr V. J., Carrera N., Catts S. V., Chambert K. D., Cheng W., Cloninger C. R., Cohen D., Cormican P., Craddock N., Crespo-Facorro B., Crowley J. J., Curtis D., Davidson M., Davis K. L., Degenhardt F., Del Favero J., DeLisi L. E., Dikeos D., Dinan T., Djurovic S., Donohoe G., Drapeau E., Duan J., Dudbridge F., Eichhammer P., Eriksson J., Escott-Price V., Essioux L., Fanous A. H., Farh K.-H., Farrell M. S., Frank J., Franke L., Freedman R., Freimer N. B., Friedman J. I., Forstner A. J., Fromer M., Genovese G., Georgieva L., Gershon E. S., Giegling I., Giusti-Rodríguez P., Godard S., Goldstein J. I., Gratten J., de Haan L., Hamshere M. L., Hansen M., Hansen T., Haroutunian V., Hartmann A. M., Henskens F. A., Herms S., Hirschhorn J. N., Hoffmann P., Hofman A., Huang H., Ikeda M., Joa I., Kähler A. K., Kahn R. S., Kalaydjieva L., Karjalainen J., Kavanagh D., Keller M. C., Kelly B. J., Kennedy J. L., Kim Y., Knowles J. A., Konte B., Laurent C., Lee P., Lee S. H., Legge S. E., Lerer B., Levy D. L., Liang K.-Y., Lieberman J., Lönnqvist J., Loughland C. M., Magnusson P. K. E., Maher B. S., Maier W., Mallet J., Mattheisen M., Mattingsdal M., McCarley R. W., McDonald C., McIntosh A. M., Meier S., Meijer C. J., Melle I., Mesholam-Gately R. I., Metspalu A., Michie P. T., Milani L., Milanova V., Mokrab Y., Morris D. W., Müller-Myhsok B., Murphy K. C., Murray R. M., Myin-Germeys I., Nenadic I., Nertney D. A., Nestadt G., Nicodemus K. K., Nisenbaum L., Nordin A., O’Callaghan E., O’Dushlaine C., Oh S.-Y., Olincy A., Olsen L., O’Neill F. A., Van Os J., Pantelis C., Papadimitriou G. N., Parkhomenko E., Pato M. T., Paunio T., Psychosis Endophenotypes International Consortium, Perkins D. O., Pers T. H., Pietiläinen O., Pimm J., Pocklington A. J., Powell J., Price A., Pulver A. E., Purcell S. M., Quested D., Rasmussen H. B., Reichenberg A., Reimers M. A., Richards A. L., Roffman J. L., Roussos P., Ruderfer D. M., Salomaa V., Sanders A. R., Savitz A., Schall U., Schulze T. G., Schwab S. G., Scolnick E. M., Scott R. J., Seidman L. J., Shi J., Silverman J. M., Smoller J. W., Söderman E., Spencer C. C. A., Stahl E. A., Strengman E., Strohmaier J., Stroup T. S., Suvisaari J., Svrakic D. M., Szatkiewicz J. P., Thirumalai S., Tooney P. A., Veijola J., Visscher P. M., Waddington J., Walsh D., Webb B. T., Weiser M., Wildenauer D. B., Williams N. M., Williams S., Witt S. H., Wolen A. R., Wormley B. K., Wray N. R., Wu J. Q., Zai C. C., Adolfsson R., Andreassen O. A., Blackwood D. H. R., Bramon E., Buxbaum J. D., Cichon S., Collier D. A., Corvin A., Daly M. J., Darvasi A., Domenici E., Esko T., Gejman P. V., Gill M., Gurling H., Hultman C. M., Iwata N., Jablensky A. V., Jönsson E. G., Kendler K. S., Kirov G., Knight J., Levinson D. F., Li Q. S., McCarroll S. A., McQuillin A., Moran J. L., Mowry B. J., Nöthen M. M., Ophoff R. A., Owen M. J., Palotie A., Pato C. N., Petryshen T. L., Posthuma D., Rietschel M., Riley B. P., Rujescu D., Sklar P., St Clair D., Walters J. T. R., Werge T., Sullivan P. F., O’Donovan M. C., Scherer S. W., Neale B. M., Sebat J., CNV and Schizophrenia Working Groups of the Psychiatric Genomics Consortium, Contribution of copy number variants to schizophrenia from a genome-wide study of 41,321 subjects. Nat. Genet. 49, 27–35 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Sakthikumar S., Roy A., Haseeb L., Pettersson M. E., Sundström E., Marinescu V. D., Lindblad-Toh K., Forsberg-Nilsson K., Whole-genome sequencing of glioblastoma reveals enrichment of non-coding constraint mutations in known and novel genes. Genome Biol. 21, 127 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zhang J., Bajari R., Andric D., Gerthoffert F., Lepsa A., Nahal-Bose H., Stein L. D., Ferretti V., The International Cancer Genome Consortium Data Portal. Nat. Biotechnol. 37, 367–369 (2019). [DOI] [PubMed] [Google Scholar]
- 52.Louis D. N., Perry A., Reifenberger G., von Deimling A., Figarella-Branger D., Cavenee W. K., Ohgaki H., Wiestler O. D., Kleihues P., Ellison D. W., The 2016 World Health Organization Classification of Tumors of the Central Nervous System: a summary. Acta Neuropathol. 131, 803–820 (2016). [DOI] [PubMed] [Google Scholar]
- 53.Northcott P. A., Buchhalter I., Morrissy A. S., Hovestadt V., Weischenfeldt J., Ehrenberger T., Gröbner S., Segura-Wang M., Zichner T., Rudneva V. A., Warnatz H.-J., Sidiropoulos N., Phillips A. H., Schumacher S., Kleinheinz K., Waszak S. M., Erkek S., Jones D. T. W., Worst B. C., Kool M., Zapatka M., Jäger N., Chavez L., Hutter B., Bieg M., Paramasivam N., Heinold M., Gu Z., Ishaque N., Jäger-Schmidt C., Imbusch C. D., Jugold A., Hübschmann D., Risch T., Amstislavskiy V., Gonzalez F. G. R., Weber U. D., Wolf S., Robinson G. W., Zhou X., Wu G., Finkelstein D., Liu Y., Cavalli F. M. G., Luu B., Ramaswamy V., Wu X., Koster J., Ryzhova M., Cho Y.-J., Pomeroy S. L., Herold-Mende C., Schuhmann M., Ebinger M., Liau L. M., Mora J., McLendon R. E., Jabado N., Kumabe T., Chuah E., Ma Y., Moore R. A., Mungall A. J., Mungall K. L., Thiessen N., Tse K., Wong T., Jones S. J. M., Witt O., Milde T., Von Deimling A., Capper D., Korshunov A., Yaspo M.-L., Kriwacki R., Gajjar A., Zhang J., Beroukhim R., Fraenkel E., Korbel J. O., Brors B., Schlesner M., Eils R., Marra M. A., Pfister S. M., Taylor M. D., Lichter P., The whole-genome landscape of medulloblastoma subtypes. Nature. 547, 311–317 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Smits M., van Rijn S., Hulleman E., Biesmans D., van Vuurden D. G., Kool M., Haberler C., Aronica E., Vandertop W. P., Noske D. P., Würdinger T., EZH2-regulated DAB2IP is a medulloblastoma tumor suppressor and a positive marker for survival. Clin. Cancer Res. 18, 4048–4058 (2012). [DOI] [PubMed] [Google Scholar]
- 55.Weishaupt H., Johansson P., Sundström A., Lubovac-Pilav Z., Olsson B., Nelander S., Swartling F. J., Batch-normalization of cerebellar and medulloblastoma gene expression datasets utilizing empirically defined negative control genes. Bioinformatics. 35, 3357–3364 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Cavalli F. M. G., Remke M., Rampasek L., Peacock J., Shih D. J. H., Luu B., Garzia L., Torchia J., Nor C., Morrissy A. S., Agnihotri S., Thompson Y. Y., Kuzan-Fischer C. M., Farooq H., Isaev K., Daniels C., Cho B.-K., Kim S.-K., Wang K.-C., Lee J. Y., Grajkowska W. A., Perek-Polnik M., Vasiljevic A., Faure-Conter C., Jouvet A., Giannini C., Nageswara Rao A. A., Li K. K. W., Ng H.-K., Eberhart C. G., Pollack I. F., Hamilton R. L., Gillespie G. Y., Olson J. M., Leary S., Weiss W. A., Lach B., Chambless L. B., Thompson R. C., Cooper M. K., Vibhakar R., Hauser P., van Veelen M.-L. C., Kros J. M., French P. J., Ra Y. S., Kumabe T., López-Aguilar E., Zitterbart K., Sterba J., Finocchiaro G., Massimino M., Van Meir E. G., Osuka S., Shofuda T., Klekner A., Zollo M., Leonard J. R., Rubin J. B., Jabado N., Albrecht S., Mora J., Van Meter T. E., Jung S., Moore A. S., Hallahan A. R., Chan J. A., Tirapelli D. P. C., Carlotti C. G., Fouladi M., Pimentel J., Faria C. C., Saad A. G., Massimi L., Liau L. M., Wheeler H., Nakamura H., Elbabaa S. K., Perezpeña-Diazconti M., Chico Ponce de León F., Robinson S., Zapotocky M., Lassaletta A., Huang A., Hawkins C. E., Tabori U., Bouffet E., Bartels U., Dirks P. B., Rutka J. T., Bader G. D., Reimand J., Goldenberg A., Ramaswamy V., Taylor M. D., Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell. 31, 737–754.e6 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Buniello A., MacArthur J. A. L., Cerezo M., Harris L. W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E., Suveges D., Vrousgou O., Whetzel P. L., Amode R., Guillen J. A., Riat H. S., Trevanion S. J., Hall P., Junkins H., Flicek P., Burdett T., Hindorff L. A., Cunningham F., Parkinson H., The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Liu B., Gloudemans M. J., Rao A. S., Ingelsson E., Montgomery S. B., Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, 768–769 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Amberger J. S., Bocchini C. A., Scott A. F., Hamosh A., OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res. 47, D1038–D1043 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Landrum M. J., Lee J. M., Riley G. R., Jang W., Rubinstein W. S., Church D. M., Maglott D. R., ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–5 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Lek M., Karczewski K. J., Minikel E. V., Samocha K. E., Banks E., Fennell T., O’Donnell-Luria A. H., Ware J. S., Hill A. J., Cummings B. B., Tukiainen T., Birnbaum D. P., Kosmicki J. A., Duncan L. E., Estrada K., Zhao F., Zou J., Pierce-Hoffman E., Berghout J., Cooper D. N., Deflaux N., DePristo M., Do R., Flannick J., Fromer M., Gauthier L., Goldstein J., Gupta N., Howrigan D., Kiezun A., Kurki M. I., Moonshine A. L., Natarajan P., Orozco L., Peloso G. M., Poplin R., Rivas M. A., Ruano-Rubio V., Rose S. A., Ruderfer D. M., Shakir K., Stenson P. D., Stevens C., Thomas B. P., Tiao G., Tusie-Luna M. T., Weisburd B., Won H.-H., Yu D., Altshuler D. M., Ardissino D., Boehnke M., Danesh J., Donnelly S., Elosua R., Florez J. C., Gabriel S. B., Getz G., Glatt S. J., Hultman C. M., Kathiresan S., Laakso M., McCarroll S., McCarthy M. I., McGovern D., McPherson R., Neale B. M., Palotie A., Purcell S. M., Saleheen D., Scharf J. M., Sklar P., Sullivan P. F., Tuomilehto J., Tsuang M. T., Watkins H. C., Wilson J. G., Daly M. J., MacArthur D. G., Exome Aggregation Consortium, Analysis of protein-coding genetic variation in 60,706 humans. Nature. 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Cummings B. B., Karczewski K. J., Kosmicki J. A., Seaby E. G., Watts N. A., Singer-Berk M., Mudge J. M., Karjalainen J., Satterstrom F. K., O’Donnell-Luria A. H., Poterba T., Seed C., Solomonson M., Alföldi J., Daly M. J., MacArthur D. G., Transcript expression-aware annotation improves rare variant interpretation. Nature. 581, 452–458 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Johansson P. A., Brattås P. L., Douse C. H., Hsieh P., Adami A., Pontis J., Grassi D., Garza R., Sozzi E., Cataldo R., Jönsson M. E., Atacho D. A. M., Pircs K., Eren F., Sharma Y., Johansson J., Fiorenzano A., Parmar M., Fex M., Trono D., Eichler E. E., Jakobsson J., A cis-acting structural variation at the ZNF558 locus controls a gene regulatory network in human brain development. Cell Stem Cell (2021),, doi: 10.1016/j.stem.2021.09.008. [DOI] [PubMed]
- 64.Zoonomia Consortium A comparative genomics multitool for scientific discovery and conservation. Nature. 587, 240–245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J. E., Mudge J. M., Sisu C., Wright J. C., Armstrong J., Barnes I., Berry A., Bignell A., Boix C., Carbonell Sala S., Cunningham F., Di Domenico T., Donaldson S., Fiddes I. T., García Girón C., Gonzalez J. M., Grego T., Hardy M., Hourlier T., Howe K. L., Hunt T., Izuogu O. G., Johnson R., Martin F. J., Martínez L., Mohanan S., Muir P., Navarro F. C. P., Parker A., Pei B., Pozo F., Riera F. C., Ruffier M., Schmitt B. M., Stapleton E., Suner M.-M., Sycheva I., Uszczynska-Ratajczak B., Wolf M. Y., Xu J., Yang Y. T., Yates A., Zerbino D., Zhang Y., Choudhary J. S., Gerstein M., Guigó R., Hubbard T. J. P., Kellis M., Paten B., Tress M. L., Flicek P., GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Hickey G., Paten B., Earl D., Zerbino D., Haussler D., HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics. 29, 1341–1342 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Earl D., Nguyen N., Hickey G., Harris R. S., Fitzgerald S., Beal K., Seledtsov I., Molodtsov V., Raney B. J., Clawson H., Kim J., Kemena C., Chang J.-M., Erb I., Poliakov A., Hou M., Herrero J., Kent W. J., Solovyev V., Darling A. E., Ma J., Notredame C., Brudno M., Dubchak I., Haussler D., Paten B., Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 24, 2077–2089 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Felsenstein J., Churchill G. A., A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13, 93–104 (1996). [DOI] [PubMed] [Google Scholar]
- 69.Storey J. D., A direct approach to false discovery rates. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002). [Google Scholar]
- 70.Pollard K. S., Hubisz M. J., Rosenbloom K. R., Siepel A., Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Navarro Gonzalez J., Zweig A. S., Speir M. L., Schmelter D., Rosenbloom K. R., Raney B. J., Powell C. C., Nassar L. R., Maulding N. D., Lee C. M., Lee B. T., Hinrichs A. S., Fyfe A. C., Fernandes J. D., Diekhans M., Clawson H., Casper J., Benet-Pagès A., Barber G. P., Haussler D., Kuhn R. M., Haeussler M., Kent W. J., The UCSC Genome Browser database: 2021 update. Nucleic Acids Res. 49, D1046–D1057 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Zdobnov E. M., Kuznetsov D., Tegenfeldt F., Manni M., Berkeley M., Kriventseva E. V., OrthoDB in 2020: evolutionary and functional annotations of orthologs. Nucleic Acids Research. 49 (2021), pp. D389–D393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Partridge E. C., Chhetri S. B., Prokop J. W., Ramaker R. C., Jansen C. S., Goh S.-T., Mackiewicz M., Newberry K. M., Brandsmeier L. A., Meadows S. K., Messer C. L., Hardigan A. A., Coppola C. J., Dean E. C., Jiang S., Savic D., Mortazavi A., Wold B. J., Myers R. M., Mendenhall E. M., Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature. 583, 720–728 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.ENCODE Project Consortium, Snyder M. P., Gingeras T. R., Moore J. E., Weng Z., Gerstein M. B., Ren B., Hardison R. C., Stamatoyannopoulos J. A., Graveley B. R., Feingold E. A., Pazin M. J., Pagan M., Gilchrist D. A., Hitz B. C., Cherry J. M., Bernstein B. E., Mendenhall E. M., Zerbino D. R., Frankish A., Flicek P., Myers R. M., Perspectives on ENCODE. Nature. 583, 693–698 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Grubert F., Srivas R., Spacek D. V., Kasowski M., Ruiz-Velasco M., Sinnott-Armstrong N., Greenside P., Narasimha A., Liu Q., Geller B., Sanghi A., Kulik M., Sa S., Rabinovitch M., Kundaje A., Dalton S., Zaugg J. B., Snyder M., Landscape of cohesin-mediated chromatin loops in the human genome. Nature. 583, 737–743 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Van Nostrand E. L., Freese P., Pratt G. A., Wang X., Wei X., Xiao R., Blue S. M., Chen J.-Y., Cody N. A. L., Dominguez D., Olson S., Sundararaman B., Zhan L., Bazile C., Bouvrette L. P. B., Bergalet J., Duff M. O., Garcia K. E., Gelboin-Burkhart C., Hochman M., Lambert N. J., Li H., McGurk M. P., Nguyen T. B., Palden T., Rabano I., Sathe S., Stanton R., Su A., Wang R., Yee B. A., Zhou B., Louie A. L., Aigner S., Fu X.-D., Lécuyer E., Burge C. B., Graveley B. R., Yeo G. W., A large-scale binding and functional map of human RNA-binding proteins. Nature. 583, 711–719 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Fullard J. F., Giambartolomei C., Hauberg M. E., Xu K., Voloudakis G., Shao Z., Bare C., Dudley J. T., Mattheisen M., Robakis N. K., Haroutunian V., Roussos P., Open chromatin profiling of human postmortem brain infers functional roles for non-coding schizophrenia loci. Hum. Mol. Genet. (2018), doi: 10.1093/hmg/ddy229. [DOI] [PMC free article] [PubMed]
- 78.de la Torre-Ubieta L., Stein J. L., Won H., Opland C. K., Liang D., Lu D., Geschwind D. H., The Dynamic Landscape of Open Chromatin during Human Cortical Neurogenesis. Cell. 172, 289–304.e18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Fullard J. F., Hauberg M. E., Bendl J., Egervari G., Cirnaru M.-D., Reach S. M., Motl J., Ehrlich M. E., Hurd Y. L., Roussos P., An atlas of chromatin accessibility in the adult human brain. Genome Res. 28, 1243–1252 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Bryois J., Garrett M. E., Song L., Safi A., Giusti-Rodriguez P., Johnson G. D., Shieh A. W., Buil A., Fullard J. F., Roussos P., Sklar P., Akbarian S., Haroutunian V., Stockmeier C. A., Wray G. A., White K. P., Liu C., Reddy T. E., Ashley-Koch A., Sullivan P. F., Crawford G. E., Evaluation of chromatin accessibility in prefrontal cortex of individuals with schizophrenia. Nat. Commun. 9, 3121 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Wang D., Liu S., Warrell J., Won H., Shi X., Navarro F. C. P., Clarke D., Gu M., Emani P., Yang Y. T., Xu M., Gandal M. J., Lou S., Zhang J., Park J. J., Yan C., Rhie S. K., Manakongtreecheep K., Zhou H., Nathan A., Peters M., Mattei E., Fitzgerald D., Brunetti T., Moore J., Jiang Y., Girdhar K., Hoffman G. E., Kalayci S., Gümüş Z. H., Crawford G. E., PsychENCODE Consortium, Roussos P., Akbarian S., Jaffe A. E., White K. P., Weng Z., Sestan N., Geschwind D. H., Knowles J. A., Gerstein M. B., Comprehensive functional genomic resource and integrative model for the human brain. Science. 362 (2018), doi: 10.1126/science.aat8464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Hopkins A. L., Groom C. R., The druggable genome. Nat. Rev. Drug Discov. 1, 727–730 (2002). [DOI] [PubMed] [Google Scholar]
- 83.Rentzsch P., Witten D., Cooper G. M., Shendure J., Kircher M., CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 47, D886–D894 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Katoh K., Standley D. M., MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Jones C. T., Swingler R. J., Brock D. J., Identification of a novel SOD1 mutation in an apparently sporadic amyotrophic lateral sclerosis patient and the detection of Ile113Thr in three others. Hum. Mol. Genet. 3, 649–650 (1994). [DOI] [PubMed] [Google Scholar]
- 86.Draper A. C. E., Wilson Z., Maile C., Faccenda D., Campanella M., Piercy R. J., Species-specific consequences of an E40K missense mutation in superoxide dismutase 1 (SOD1). FASEB J. 34, 458–473 (2020). [DOI] [PubMed] [Google Scholar]
- 87.1000 Genomes Project Consortium, Auton A., Brooks L. D., Durbin R. M., Garrison E. P., Kang H. M., Korbel J. O., Marchini J. L., McCarthy S., McVean G. A., Abecasis G. R., A global reference for human genetic variation. Nature. 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Bulik-Sullivan B. K., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Loh P.-R., Finucane H. K., Ripke S., Yang J., Patterson N., Daly M. J., Price A. L., Neale B. M., LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics. 47 (2015), pp. 291–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Bulik-Sullivan B., ReproGen Consortium, Finucane H. K., Anttila V., Gusev A., Day F. R., Loh P.-R., Duncan L., Perry J. R. B., Patterson N., Robinson E. B., Daly M. J., Price A. L., Neale B. M., Psychiatric Genomics Consortium, Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium, An atlas of genetic correlations across human diseases and traits. Nature Genetics. 47 (2015), pp. 1236–1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Wang G., Sarkar A., Carbonetto P., Stephens M., A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 82 (2020), pp. 1273–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Petrovski S., Gussow A. B., Wang Q., Halvorsen M., Han Y., Weir W. H., Allen A. S., Goldstein D. B., The Intolerance of Regulatory Sequence to Genetic Variation Predicts Gene Dosage Sensitivity. PLoS Genet. 11, e1005492 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Grant B. J., Skjaerven L., Yao X.-Q., The Bio3D packages for structural bioinformatics. Protein Sci. 30, 20–30 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.McInnes L., Healy J., Melville J., UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2018), (available at http://arxiv.org/abs/1802.03426).
- 94.Hahsler M., Piekenbrock M., Doran D., dbscan: Fast Density-Based Clustering with R. J. Stat. Softw. 91 (2019), doi: 10.18637/jss.v091.i01. [DOI] [Google Scholar]
- 95.Howe K. L., Achuthan P., Allen J., Allen J., Alvarez-Jarreta J., Amode M. R., Armean I. M., Azov A. G., Bennett R., Bhai J., Billis K., Boddu S., Charkhchi M., Cummins C., Da Rin Fioretto L., Davidson C., Dodiya K., El Houdaigui B., Fatima R., Gall A., Garcia Giron C., Grego T., Guijarro-Clarke C., Haggerty L., Hemrom A., Hourlier T., Izuogu O. G., Juettemann T., Kaikala V., Kay M., Lavidas I., Le T., Lemos D., Gonzalez Martinez J., Marugán J. C., Maurel T., McMahon A. C., Mohanan S., Moore B., Muffato M., Oheh D. N., Paraschas D., Parker A., Parton A., Prosovetskaia I., Sakthivel M. P., Salam A. I. A., Schmitt B. M., Schuilenburg H., Sheppard D., Steed E., Szpak M., Szuba M., Taylor K., Thormann A., Threadgold G., Walts B., Winterbottom A., Chakiachvili M., Chaubal A., De Silva N., Flint B., Frankish A., Hunt S. E., IIsley G. R., Langridge N., Loveland J. E., Martin F. J., Mudge J. M., Morales J., Perry E., Ruffier M., Tate J., Thybert D., Trevanion S. J., Cunningham F., Yates A. D., Zerbino D. R., Flicek P., Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Han X., Zhou Z., Fei L., Sun H., Wang R., Chen Y., Chen H., Wang J., Tang H., Ge W., Zhou Y., Ye F., Jiang M., Wu J., Xiao Y., Jia X., Zhang T., Ma X., Zhang Q., Bai X., Lai S., Yu C., Zhu L., Lin R., Gao Y., Wang M., Wu Y., Zhang J., Zhan R., Zhu S., Hu H., Wang C., Chen M., Huang H., Liang T., Chen J., Wang W., Zhang D., Guo G., Construction of a human cell landscape at single-cell level. Nature. 581, 303–309 (2020). [DOI] [PubMed] [Google Scholar]
- 97.Gene Ontology Consortium, The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 49, D325–D334 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Koopmans F., van Nierop P., Andres-Alonso M., Byrnes A., Cijsouw T., Coba M. P., Cornelisse L. N., Farrell R. J., Goldschmidt H. L., Howrigan D. P., Hussain N. K., Imig C., de Jong A. P. H., Jung H., Kohansalnodehi M., Kramarz B., Lipstein N., Lovering R. C., MacGillavry H., Mariano V., Mi H., Ninov M., Osumi-Sutherland D., Pielot R., Smalla K.-H., Tang H., Tashman K., Toonen R. F. G., Verpelli C., Reig-Viader R., Watanabe K., van Weering J., Achsel T., Ashrafi G., Asi N., Brown T. C., De Camilli P., Feuermann M., Foulger R. E., Gaudet P., Joglekar A., Kanellopoulos A., Malenka R., Nicoll R. A., Pulido C., de Juan-Sanz J., Sheng M., Südhof T. C., Tilgner H. U., Bagni C., Bayés À., Biederer T., Brose N., Chua J. J. E., Dieterich D. C., Gundelfinger E. D., Hoogenraad C., Huganir R. L., Jahn R., Kaeser P. S., Kim E., Kreutz M. R., McPherson P. S., Neale B. M., O’Connor V., Posthuma D., Ryan T. A., Sala C., Feng G., Hyman S. E., Thomas P. D., Smit A. B., Verhage M., SynGO: An Evidence-Based, Expert-Curated Knowledge Base for the Synapse. Neuron. 103, 217–234.e4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Sanyal A., Lajoie B. R., Jain G., Dekker J., The long-range interaction landscape of gene promoters. Nature. 489, 109–113 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Innan H., Kondrashov F., The evolution of gene duplications: classifying and distinguishing between models. Nat. Rev. Genet. 11, 97–108 (2010). [DOI] [PubMed] [Google Scholar]
- 101.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M. A., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Casper J., Zweig A. S., Villarreal C., Tyner C., Speir M. L., Rosenbloom K. R., Raney B. J., Lee C. M., Lee B. T., Karolchik D., Hinrichs A. S., Haeussler M., Guruvadoo L., Navarro Gonzalez J., Gibson D., Fiddes I. T., Eisenhart C., Diekhans M., Clawson H., Barber G. P., Armstrong J., Haussler D., Kuhn R. M., Kent W. J., The UCSC Genome Browser database: 2018 update. Nucleic Acids Res. 46, D762–D769 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Thomas-Chollier M., Hufton A., Heinig M., O’Keeffe S., Masri N. E., Roider H. G., Manke T., Vingron M., Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nat. Protoc. 6, 1860–1869 (2011). [DOI] [PubMed] [Google Scholar]
- 104.Portales-Casamar E., Thongjuea S., Kwon A. T., Arenillas D., Zhao X., Valen E., Yusuf D., Lenhard B., Wasserman W. W., Sandelin A., JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–10 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Bandopadhayay P., Bergthold G., Nguyen B., Schubert S., Gholamin S., Tang Y., Bolin S., Schumacher S. E., Zeid R., Masoud S., Yu F., Vue N., Gibson W. J., Paolella B. R., Mitra S. S., Cheshier S. H., Qi J., Liu K.-W., Wechsler-Reya R., Weiss W. A., Swartling F. J., Kieran M. W., Bradner J. E., Beroukhim R., Cho Y.-J., BET bromodomain inhibition of MYC-amplified medulloblastoma. Clin. Cancer Res. 20, 912–925 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.