A genomic mutational constraint map using variation in 76,156 human genomes

Siwei Chen; Laurent C Francioli; Julia K Goodrich; Ryan L Collins; Masahiro Kanai; Qingbo Wang; Jessica Alföldi; Nicholas A Watts; Christopher Vittal; Laura D Gauthier; Timothy Poterba; Michael W Wilson; Yekaterina Tarasova; William Phu; Riley Grant; Mary T Yohannes; Zan Koenig; Yossi Farjoun; Eric Banks; Stacey Donnelly; Stacey Gabriel; Namrata Gupta; Steven Ferriera; Charlotte Tolonen; Sam Novod; Louis Bergelson; David Roazen; Valentin Ruano-Rubio; Miguel Covarrubias; Christopher Llanwarne; Nikelle Petrillo; Gordon Wade; Thibault Jeandet; Ruchi Munshi; Kathleen Tibbetts; Genome Aggregation Database Consortium; Anne O’Donnell-Luria; Matthew Solomonson; Cotton Seed; Alicia R Martin; Michael E Talkowski; Heidi L Rehm; Mark J Daly; Grace Tiao; Benjamin M Neale; Daniel G MacArthur; Konrad J Karczewski

doi:10.1038/s41586-023-06045-0

. Author manuscript; available in PMC: 2024 Dec 10.

Published in final edited form as: Nature. 2023 Dec 6;625(7993):92–100. doi: 10.1038/s41586-023-06045-0

A genomic mutational constraint map using variation in 76,156 human genomes

Siwei Chen ^1,^2,^†, Laurent C Francioli ^1,^2,^†, Julia K Goodrich ¹, Ryan L Collins ^1,^3,⁴, Masahiro Kanai ^1,², Qingbo Wang ^1,⁵, Jessica Alföldi ^1,², Nicholas A Watts ^1,², Christopher Vittal ^1,², Laura D Gauthier ⁶, Timothy Poterba ^1,^2,⁷, Michael W Wilson ^1,², Yekaterina Tarasova ¹, William Phu ^1,⁸, Riley Grant ¹, Mary T Yohannes ¹, Zan Koenig ^2,⁷, Yossi Farjoun ⁹, Eric Banks ⁶, Stacey Donnelly ¹⁰, Stacey Gabriel ¹¹, Namrata Gupta ^1,¹¹, Steven Ferriera ¹¹, Charlotte Tolonen ⁶, Sam Novod ⁶, Louis Bergelson ⁶, David Roazen ⁶, Valentin Ruano-Rubio ⁶, Miguel Covarrubias ⁶, Christopher Llanwarne ⁶, Nikelle Petrillo ⁶, Gordon Wade ⁶, Thibault Jeandet ⁶, Ruchi Munshi ⁶, Kathleen Tibbetts ⁶; Genome Aggregation Database Consortium, Anne O’Donnell-Luria ^1,^3,⁸, Matthew Solomonson ^1,², Cotton Seed ^2,⁷, Alicia R Martin ^1,^2,⁷, Michael E Talkowski ^1,^3,⁷, Heidi L Rehm ^1,³, Mark J Daly ^1,^2,¹², Grace Tiao ^1,², Benjamin M Neale ^1,^2,^†, Daniel G MacArthur ^1,^13,^14,^†, Konrad J Karczewski ^1,^2,⁷

¹Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

²Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA

³Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA

⁴Division of Medical Sciences, Harvard Medical School, Boston, MA, USA

⁵Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan

⁶Data Science Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁷Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁸Division of Genetics and Genomics, Boston Children’s Hospital, Boston, MA, USA

⁹Richards Lab, Lady Davis Institute, Montreal, QC, Canada

¹⁰Broad Institute of MIT and Harvard, Cambridge, MA, USA

¹¹Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

¹²Institute for Molecular Medicine Finland, (FIMM) Helsinki, Finland

¹³Centre for Population Genomics, Garvan Institute of Medical Research and UNSW Sydney, Sydney, Australia

¹⁴Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, Australia

^†

These authors contributed equally.

Author Contribution

SC, LCF, JKG, QW, AOD-L, HLR, MJD, BMN, DGM, and KJK contributed to the writing of the manuscript and generation of figures. SC, RLC, MK, and KJK contributed to the analysis of data. LCF, QW, CV, LDG, TP, CS, MET, BMN, and KJK developed tools and methods that enabled the scientific discoveries herein. LCF, JKG, JA, MWW, YT, WP, MTY, ZK, YF, EB, SD, SG, NG, SF, CT, SN, LB, DR, VR-R, MC, CL, NP, GW, TJ, RM, KT, ARM, GT, and KJK contributed to the production and quality control of the gnomAD dataset. NAW, RG, MS, and KJK contributed to the gnomAD browser. All authors listed under The Genome Aggregation Database Consortium contributed to the generation of the primary data incorporated into the gnomAD resource. All authors reviewed the manuscript. SC and LCR contributed equally to this study. BMN and DGM contributed equally to this study.

^✉

Correspondence should be addressed to K.J.K (konradk@broadinstitute.org) and S.C (siwei@broadinstitute.org)

PMCID: PMC11629659 NIHMSID: NIHMS2037379 PMID: 38057664

Summary

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders^1–4, but attempts to assess constraint for non-protein-coding regions have proven more difficult. Here we aggregate, process, and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD), the largest public open-access human genome allele frequency reference dataset, and use this dataset to build a genomic constraint map for the whole genome (called Gnocchi). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association, and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.

The expansion in the scale of human whole-genome or exome sequencing data has allowed characterization of the patterns of variation in human genes. With these data it is possible to directly assess the strength of negative selection on loss-of-function (LoF) and missense variation by modeling “constraint,” the depletion of variation in a gene compared to an expectation conditioned on that gene’s mutability. Using coding variant data from sequencing thousands to hundreds of thousands of humans⁵, we and others previously developed constraint metrics that classify each protein-coding gene along a spectrum of LoF/missense intolerance^5–7, providing a valuable resource for studying the functional significance of human genes^1–4. Although of outsized biological importance, protein-coding regions comprise less than 2% of the human genome, and the vast non-coding genome has been much less characterized, even though the importance of non-coding variation in human complex diseases has been long recognized^8–12.

Several challenges arise when extending the gene constraint model to the non-coding space. First, the sample size of human whole-genome reference data has been relatively small compared to the exome, limiting the power of detecting depletions of variation at a fine scale. Second, in coding regions, the gene model enables accurate prediction of the effect of specific variants on amino acid translation; such nucleotide-specific models of the consequences of basepair changes are not available in non-coding regions. Third, there is a strong expectation from Mendelian genetics and existing constraint analyses that the coding regions, while a small fraction of the genome, are grossly overrepresented for rare and common disease mutations under selection. Fourth, the mutation rate in non-coding regions is highly heterogeneous and can be affected not only by local sequence context as commonly modeled in gene constraint metrics but also by a variety of genomic features at larger scales^13,14.

Current methods attempting to evaluate non-coding constraint can be broadly divided into three categories: 1) context-dependent mutational models that assess the deviation of observed variation from an expectation based on the sequence composition of k-mers (e.g., Orion¹⁵, CDTS¹⁶, DR¹⁷); 2) machine-learning classifiers that are trained to differentiate between disease-associated variants and benign variants (e.g., CADD¹⁸, GWAVA¹⁹, JARVIS²⁰); and 3) phylogenetic conservation scores that use comparative genomics data to infer evolutionary constraint (e.g., phastCons²¹, phyloP²²). While all these methods aid in our understanding of the non-coding genome, each suffer from limitations/biases, respectively as 1) overlooking the influence of regional genomic features beyond the scale of flanking nucleotides on mutation rate; 2) a strong dependence on the availability of well-characterized functional mutations as training data; and 3) compromised power to detect regions that have only recently been under selection in the human lineage and may have a functional impact on human-specific traits or diseases.

Here we present a genome-wide map of human constraint (called Gnocchi: Genomic NOn-Coding Constraint of HaploInsufficient variation), generated from a high-quality set of variant calls from 76,156 whole-genome sequences (gnomAD v3.1.2 https://gnomad.broadinstitute.org). We describe an improved model of human mutation rates that jointly analyzes local sequence context and regional genomic features and quantifies the depletion of variation in tiled windows across the entire genome. Incorporating constraint evidence from functional elements linked to genes can enhance the identification of genes under strong constraint and aid in the functional interpretation of non-coding regions. Our study aims to depict a genome-wide view of how natural selection shapes patterns of human genetic variation and identify which functional genomic elements likely harbor variation with potential clinical significance.

Aggregating 76,156 whole genomes

We aggregated, reprocessed, and performed joint variant-calling on 153,030 whole genomes mapped to human genome reference build GRCh38, of which 76,156 samples were retained as high-quality sequences from unrelated individuals, without known severe pediatric disease, and with appropriate consent and data use permissions for the sharing of aggregate variant data (Supplementary Fig. 1–5 and Supplementary Table 1–3). Among these samples, 36,811 (48.3%) are of non-European ancestry, including 20,744 individuals with African ancestries and 7,647 individuals with admixed Amerindigineous ancestries. After stringent quality control, we discovered a set of 644,267,978 high-confidence short nuclear variants (single nucleotide/indel variants; gnomAD v3.1.2), of which 390,393,900 low-frequency (allele frequency [AF]≤0.1%), high-quality single nucleotide variants were used for building the genome-wide constraint map. These correspond to approximately one variant every 4.9 bp (one low-frequency variant every 8 bp) of the genome, providing a high density of variation.

Gnocchi quantifies genomic constraint

To construct a genome-wide mutational constraint map, we divided the genome into continuous non-overlapping 1kb windows, and quantified constraint for each window by comparing the expected and the observed variation in our gnomAD dataset. Here, we implemented a refined mutational model, which incorporates trinucleotide sequence context, base-level methylation, and regional genomic features to predict expected levels of variation under neutrality. In brief, we estimated the relative mutability for each single nucleotide substitution with one base of adjacent nucleotide context (e.g., ACG -> ATG), with adjustment for the effect of methylation on mutation rate at CpG sites, which become saturated for mutation at sample sizes above ~10K genomes²³ (Extended Fig. 1a,b and Supplementary Fig. 6; Methods). Meanwhile, we adjusted the effects of regional genomic features for each trinucleotide mutation rate based on the occurrence of de novo mutations (N=413,304 previously detected in family-based whole-genome sequencing studies^24,25; Extended Fig. 1c), and then applied it to establish the expected number of variants per 1kb across the entire genome (Methods).

We quantified the deviation from expectation for each 1kb window using a Z score⁷ - hereinafter referred to as “Gnocchi” (Methods; Extended Fig. 1d,e) - which was centered around zero for non-coding regions (median=0.08), and was significantly higher (more constrained) for windows containing any protein-coding sequences (median=1.47, Wilcoxon P<10⁻²⁰⁰; Fig. 1a). Gnocchi is positively correlated with the percentage of coding bases in a window and presented a substantial shift towards higher constraint for exonic sequences from directly concatenating coding exons into 1kb windows (median=3.17; Extended Fig. 2a–c). About 3.12% and 0.05% of the non-coding windows exhibited constraint as strong as the 50^th and 90^th percentile of exonic regions (Extended Fig. 2d). Comparing Gnocchi against the adjusted proportion of singletons (APS) score, a measure of constraint developed for structural variation (SV)²⁶, we found a significant correlation (linear regression beta=0.01, P=4.3×10⁻⁶⁵, Fig. 1b; Methods), providing an internal validation of our approach.

Gnocchi highlights non-coding function

To further validate the Gnocchi metric and investigate the functional relevance of non-coding regions under selection, we examined the correlation between Gnocchi and several annotations of functional non-coding sequences (Fig. 2a). First, we found that candidate cis-regulatory elements (cCREs, derived from ENCODE²⁷ integrated DNase- and ChIP-seq data) are significantly enriched in the most constrained percentile of the genome (Gnocchi≥4, OR=2.77 compared to the genome-wide average, Fisher’s exact P<10⁻²⁰⁰); cCREs with a promoter-like signature (cCRE-PLS) presented the strongest enrichment (OR=7.28), followed by elements with a proximal/distal enhancer-like signature (pELS OR=4.35, dELS OR=2.14), and as a negative control, elements bound by CTCF but not associated with a regulatory signature showed no enrichment (CTCF-only OR=0.82). These patterns indicate that a large fraction of the constrained non-coding regions may serve a regulatory role, in line with previous findings^15,16,20. Similarly, significant enrichment was found for an independent set of active, in vivo-transcribed enhancers (identified by FANTOM CAGE analyses²⁸; OR=3.58) and super enhancers²⁹ (OR=3.41), which are groups of enhancers in close genomic proximity regulating genes important for cell type specification³⁰. By aggregating the regulatory annotations, we estimated that ~10.4% and ~6.3% of promoters and enhancers, respectively, are under selection as strong as the average constraint for coding exons (Extended Fig. 3a; Methods). A much higher proportion, 22.2%, was found for sequences encoding microRNAs (miRNAs), which are increasingly recognized as key mediators in various developmental and physiological processes³¹. In contrast, only 3.7% of long non-coding RNAs (lncRNAs) exhibited such strong constraint, similar to that of non-coding regions overall (3.1%; Extended Fig. 2d and 3b).

Fig. 2: — Correlation between Gnocchi and functional non-coding annotations. a,b, Distributions of candidate regulatory elements (a) and GWAS variants (b) along the spectrum of Gnocchi in non-coding regions. Enrichment was evaluated by comparing the proportion of non-coding 1kb windows, binned by Gnocchi, that overlap with a given functional annotation to the genome-wide average. Error bars indicate 95% confidence intervals of the odds ratios. cCRE, candidate cis-regulatory element: N=34,803 with a promoter-like signature (PLS), N=141,830 with a proximal enhancer-like signature (pELS), N=667,599 with a distal enhancer-like signature (dELS), N=56,766 bound by CTCF without a regulatory signature (CTCF-only); Super enhancers: N=331,601; FANTOM enhancers: N=63,285; GWAS Catalog: N=111,308 variants with an association P ≤5.0×10⁻⁸, N=9,229 with an independent replication; GWAS fine-mapping: N=2,191 variants fine-mapped with a posterior inclusion probability of causality≥0.9. See Methods for details on data collection. c, Enrichment of fine-mapped variants in constrained non-coding regions (Gnocchi≥4). Credible set (CS)-trat pairs with a significant enrichment are shown, ordered by the lower bound of 95% confidence interval; only lower bounds are shown for presentation purposes. d, The distribution of variants fine-mapped for coronary artery disease (CAD) in constrained regions (Gnocchi≥4) of *PLG*. Each bar shows the Gnocchi score of a 1kb window (gaps indicate windows removed by quality filters); windows containing fine-mapped variants are colored by purple, and the number of variants in each window is annotated on top of the bar correspondingly. Ten variants are located within *PLG* introns, four are mapped to the antisense gene of *PLG* (ENSG00000287558), and 14 reside in the downstream intergenic regions.

We next examined the distribution of putatively functional non-coding variants on the constraint spectrum. There was significant enrichment for non-coding variants implicated by genome-wide association studies (GWAS) in the constrained end of the genome: 837/19,471 constrained windows [Gnocchi≥4] overlapped with GWAS Catalog³² annotations (OR=1.57 compared to the genome-wide average of 51,430/1,843,559, Fisher’s exact P=2.5×10⁻³², Fig. 2b; Methods). The enrichment became stronger when restricted to the subset of variants that had been replicated by an independent study (OR=2.08, P=4.1×10⁻¹³). Moreover, further strong signals were found for likely causal GWAS variants fine-mapped for 148 complex diseases and traits in large-scale biobanks³³ (OR=3.24, P=3.0×10⁻¹⁰; Methods). Across the 95% credible set (CS)-trait pairs, strong enrichment was predominantly seen in disease phenotypes, including coronary artery disease (CAD), inguinal hernia, fibroblastic disorders, and glaucoma (ORs 3.31–6.02, Fig. 2c; Methods). In the 95% CS of CAD, for instance, the highest Gnocchi score was found for rs1897107 and rs1897109 (both within the same genomic window chr6:160725000–160726000, Gnocchi=6.32); high constraint (Gnocchi≥4) was also found for 26 variants from the same CS (totaling 28/52), which together spanned a ~153 kb sequence downstream of the gene PLG (Fig. 2d). PLG encodes the plasminogen protein that circulates in blood plasma and is converted to plasmin to dissolve the fibrin of blood clots. While dysregulation of the PLG-plasmin system has been frequently associated with CAD^34–39, no specific variants in PLG have been implicated. Our results prioritized a set of non-coding variants in highly constrained regions of PLG, which adds quantitative evidence to the implication of PLG in CAD and may help direct or prioritize follow-up functional experiments.

Collectively, these results demonstrated a significant positive correlation between constraint and functional non-coding annotations, illustrating the utility of Gnocchi in characterizing non-coding regions. Yet, we suggest that Gnocchi provides additional information to existing annotations For instance, prioritizing ENCODE cCREs by Gnocchi revealed increasingly stronger GWAS enrichment in the more constrained cCREs (Extended Fig. 4a), and constrained regions outside cCREs also captured significant signals, reflecting the value of Gnocchi independent of regulatory annotations. Moreover, besides prioritizing existing GWAS results, Gnocchi can be used as a prior for statistical fine-mapping. Using UK Biobank (UKBB) traits as examples, incorporating Gnocchi into the functionally informed fine-mapping model⁴⁰ predicted ~13K variant-trait pairs to have an increased posterior inclusion probability of causality (ΔPIP≥0.01), in which 164 likely causal associations were newly identified at PIP≥0.8 (Extended Fig. 4b; Methods). While only functional tests can ultimately validate the underlying causality, our constraint map presents a valuable resource for expanding or refining the catalog of functional non-coding variants in the human genome.

Gnocchi versus other non-coding metrics

To benchmark the performance of Gnocchi in prioritizing non-coding variants, we extended the analyses of GWAS variants to compare it with other population genetics-based constraint metrics (Orion¹⁵, CDTS¹⁶, gwRVIS²⁰, and DR¹⁷). Specifically, we assessed the performance of different metrics in identifying putative functional non-coding variants – as aforementioned, a) GWAS Catalog³² variants (N=9,229 with an independent replication); b) GWAS fine-mapping³³ variants (N=2,191), and additionally, c) a subset of high-confidence causal variants from b (N=140); and d) likely pathogenic Mendelian variants (N=1,026 from ClinVar⁴¹) and the Human Gene Mutation Database (HGMD)⁴² – against background variants in the population with a similar allele frequency (hereafter referred to as “positive” and “negative” variant set, respectively; Methods). Overall, Gnocchi achieved the highest performance across all comparisons, as measured by the area under curve (AUC) statistic (Fig. 3a,b and Extended Fig. 5). The performance was also more stable than others when varying the allele frequency threshold for the negative variant set (Extended Fig. 5). This may be due to other metrics being informed by the site frequency spectrum, which made the classification performance sensitive to differences in allele frequency between the positive and negative variants. We also showed that our performance was robust to the artificial break of genomic windows (non-overlapping 1kb) by reconstructing Gnocchi scores in a sliding-window (1kb stepped by 100bp) approach as adopted by other metrics (Extended Fig. 6).

Fig. 3: — Performance of Gnocchi and other predictive metrics in prioritizing non-coding variants. a,b, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants – 2,191 GWAS fine-mapping variants (a) and 1,026 likely pathogenic variants (b) – against background variants in the population. The performance of each metric was measured and ranked by the area under curve (AUC) statistic. c,d, The relative contribution of different metrics in classifying GWAS variants (b) and likely pathogenic variants (c). The eight metrics were modeled as eight independent predictors for the classification, and the relative contribution of one predictor over another was evaluated by estimating their additional $R^{2}$ contributions across all subset models.

Extending the comparison to include phylogeny-based conservation scores (phyloP²², phastCons²¹, and GERP⁴³) revealed relatively low performance compared to the population genetics-based constraint metrics (Fig. 3a,b). The conservation scores were weakly correlated with constraint (Spearman’s rank correlation coefficient 0.017–0.19, Extended Fig. 7), suggesting that intraspecies (human lineage-specific) constrained regions complement, rather than reflect a subset of, regions that are conserved across species. Each individual metric also contributed to the classification when modeled as independent predictive variables (Fig. 3c,d; Methods), reinforcing the complementary nature of different approaches. Variants that were uniquely captured by Gnocchi, for instance, tended to be in regions with high recombination rates (3.45-fold the rest of the positive variant set) and high DNA methylation (2.74-fold; Methods), both associated with an increased mutation rate that had been adjusted in our refined mutational model. To further illustrate this improvement, we rebuilt our constraint model from solely the local sequence context, i.e., without adjustment on mutation rate by regional genomic features, and confirmed that Gnocchi outperformed such metrics (Extended Fig. 6). Altogether, we demonstrate that Gnocchi is an effective metric for identifying functional variants in the non-coding genome; at the same time, we suggest that a combination of different metrics is likely to provide the most informative results for prioritizing functional variation.

Gnocchi prioritizes copy number variants

Besides single nucleotide variants (SNVs) that have been extensively studied in GWAS, copy number variants (CNVs) causing dosage alterations (deletions/loss or duplications/gain) of DNA represent another important class of variation for contributing variability in risk for human disease^44–49. Yet, unlike SNVs, CNVs can be large and determining the “minimal critical region”⁵⁰ with a pathogenic effect has been a major challenge. Although CNVs primarily affect non-coding sequences, the most commonly studied mechanism is still the dosage alteration of overlapping protein-coding genes⁵¹. Using our genome-wide constraint map, we explored the possibility that constrained non-coding regions are also sensitive to a dosage effect, which may underlie the pathogenicity of corresponding CNVs.

We surveyed a collection of ~100K CNVs from a genome-wide CNV morbidity map of developmental delay and congenital birth defects^52,53. There was a substantial excess of CNVs that affected constrained non-coding regions (Gnocchi≥4) among individuals with developmental disorders (DD cases) in comparison to healthy controls (42.6% versus 12.5%, OR=5.21, Fisher’s exact P<10⁻²⁰⁰, Fig. 4a; Methods). Moreover, of the 19 loci that had been previously identified as pathogenic⁵², all but one (94.7%) affected constrained non-coding regions; the high incidence was recapitulated in a curated set of ~4K putative pathogenic CNVs (85.5% in ClinVar⁴¹, Fig. 4a). Importantly, the case-control enrichment remained significant, albeit attenuated, after adjusting for the size and gene content of each CNV and when being tested in the subset of CNVs that are exclusively non-coding (Fig. 4b; Methods). Non-coding constraint presented high association with DD CNVs conditioning on gene constraint (log[OR]=1.06, logistic regression P<10⁻¹⁰⁰), lending support to the possibility that dosage alteration of constrained non-coding regions may be an alternative explanation for the mechanism of CNVs underlying DDs.

Fig. 4: — Contribution of non-coding constraint in evaluating copy number variants (CNVs). a, Proportions of constrained CNVs (Gnocchi≥4) identified in individuals with developmental delay (DD cases) versus healthy controls. Constrained CNVs are more common in DD cases than controls (7,239/17,004=42.6% versus 10,403/83,526=12.5%) and are most frequent for CNVs previously implicated as pathogenic (18/19=94.7% by DD and 3,433/4,014=85.5% by ClinVar). Error bars indicate standard errors of the proportions. b, Contribution of non-coding constraint to predicting CNVs in DD cases versus controls. Non-coding constraint remains a significant predictor for the case/control status of CNVs after adjusting for gene constraint (LOEUF score), gene number, and size of CNVs (N_case=17,004, N_control=83,526; purple), as well as being tested in the subset of non-coding CNVs (N_case=8,702, N_control=66,795; blue). Error bars indicate 95% confidence intervals of the log odds ratios. c, CNVs at the *IHH* locus associated with synpolydactyly and craniosynostosis. The four implicated duplications (grey horizontal bars) span a ~102kb sequence upstream of *IHH*. Each vertical bar shows the Gnocchi score of a 1kb window within the locus, with the highest score overlapping the *IHH* gene (red) and the highest non-coding score overlapping the major *IHH* enhancers (purple); gaps indicate windows removed by quality filters. d, Non-coding CNVs with the highest Gnocchi score identified in DD cases. The highest-scored window is located within the potential “critical region” (purple vertical bars) shared by 12 DD deletions (red horizontal bars; grey indicates two deletions observed in controls). The critical region overall, has a significantly higher Gnocchi score than the other regions affected by DD or control deletions, as shown in the kernel density estimate (KDE) plot on the right.

One known example of pathogenic non-coding dosage alteration is the duplication of IHH regulatory domain in synpolydactyly and craniosynostosis^54–56. The four implicated duplications covered a ~102kb sequence upstream of IHH, with a ~10kb overlapping region (“critical region”⁵⁰; Fig. 4c). The region contained no genes but exhibited high levels of constraint (median Gnocchi=2.52, Wilcoxon P=1.3×10⁻³ compared to the rest of the genome). The most constrained window (chr2:219111000–219112000, Gnocchi=4.12) overlapped with the major enhancer of IHH, the duplication of which has been shown to result in dosage-dependent IHH misexpression and consequently syndactyly and malformation of the skull⁵⁶. This result highlights a potential use of the Gnocchi metric to prioritize non-coding regions within large CNVs. As a further illustration, we examined a set of non-coding CNVs that had the highest Gnocchi score among the DD cases. The most constrained genomic window (chr11:133208000–133209000, Gnocchi=8.87) was affected by 12 deletions spanning a ~400kb non-coding sequence (Fig. 4d). While of varying size, the deletions shared a common region of ~20kb (potential “critical region”), which encompassed the most constrained window and overall, showed a significantly higher constraint than the other affected regions (median Gnocchi=1.63 versus 0.84, Wilcoxon P=1.6×10⁻³; Fig. 4d). In addition, the ~400kb sequence also harbored two deletions from healthy controls, which interestingly, overlapped with the two lowest Gnocchi scores within the region and were significantly less constrained than those from DD cases (median Gnocchi=1.07 versus 0.62, Wilcoxon P=4.74×10⁻⁴). These findings suggest that Gnocchi can be a useful indicator of critical regions affected by large CNVs, facilitating the interpretation of non-coding risk factors in CNV disease association studies.

Gnocchi informs gene function

Given the significant role of non-coding regions in gene regulation, it is natural to expect that more constrained regulatory elements would regulate more constrained genes. To test this, we analyzed the constraint for enhancers that had been linked to specific genes⁵⁷ (Methods). More constrained non-coding regions were more frequently linked to regulating a gene (Fig. 5a), and as expected, enhancers linked to constrained genes (predicted by loss-of-function observed/expected upper bound fraction [LOEUF]⁵, or curated disease genes from^58–60; Methods) were significantly more constrained than those linked to presumably less constrained genes (median Gnocchi=2.71 versus 1.99, Wilcoxon P=1.3×10⁻²⁶, Fig. 5b; Methods), thus supporting a correlated constraint between genes and their regulatory elements.

Fig. 5: — Correlation of constraint between non-coding regulatory elements and protein-coding genes. a, The proportion of non-coding 1kb windows overlapping with enhancers that were predicted to regulate specific genes, as a function of their Gnocchi scores. More constrained non-coding regions are more frequently linked to a gene (left to right: N=2,022/62,894, 2,743/62,653, 7,475/134,279, 20,383/252,354, 43,414/376,829, 66,343/417,743, 65,343/313,110, 38,785/152,787, 15,417/51,439, 6,663/19,471). Error bars indicate standard errors of the proportions. b, Comparison of the Gnocchi scores of enhancers linked to constrained and unconstrained genes. Enhancers of established sets of constrained genes (four blue boxes: N=189 haploinsufficient genes, N=2,454 essential genes, N=1,771 autosomal dominant disease genes, N=1,920 LOEUF-predicted constrained genes) are more constrained than enhancers of presumably less constrained genes (two grey boxes: N=356 olfactory receptor genes, N=189 LOEUF-predicted unconstrained genes). Enhancers of genes that are underpowered for gene constraint detection (“LOEUF underpowered”, N=1,117) present a higher constraint than those powered yet unconstrained genes (“LOUEF unconstrained”). The box plots show the distribution of Gnocchi scores of enhancers linked to different gene sets, denoting the median, quartiles and range (excepting outliers). c, Improvement of incorporating enhancer constraint into LOEUF in prioritizing underpowered genes. ROC curves and AUCs show the performance of two logistic regression models using LOEUF (blue) and LOEUF+enhancer Gnocchi score (purple) as independent predictive variables to classify constrained and unconstrained genes, tested on a set of 77 underpowered genes. d, Contribution of enhancer constraint to predicting gene expression in specific tissue types. The x-axis shows the linear regression coefficient of tissue-specific enhancer Gnocchi score predicting the expression level of target genes in matched tissue types (N_HSC&B-cell=11,970, N_Brain=11,555, N_Heart=10,759, N_Pancreas=10,572, N_Blood&T-cell=10,403, N_Muscle=10,380, N_Adipose=9,316, N_Liver=8,838, N_Spleen=8,308, N_Ovary=7,926, N_Lung=7,499), conditioning on gene constraint (LOEUF score). Error bars indicate 95% confidence intervals of the coefficient estimates.

On the other hand, a particularly interesting set of associations are the links between constrained enhancers and the “unconstrained” genes predicted by LOEUF, because these links may reflect functional significance of the “unconstrained” genes that had been previously unrecognized. The lack of predicted gene constraint can be explained by the design of LOEUF as a measure of intolerance to rare LoF variation, where small genes with few expected LoF variants are likely underpowered. Indeed, stratifying genes by the number of expected LoF variants showed a significantly higher enhancer constraint for genes that were underpowered (≤5 expected LoF variants)⁵ compared to genes that were sufficiently powered while scored as unconstrained (median Gnocchi=2.64 versus 2.27, Wilcoxon P=9.8×10⁻⁴, Fig. 5a). This suggests that certain underpowered genes may be functionally important but were not recognized in gene constraint evaluation. For instance, ASCL2, a basic helix-loop-helix (bHLH) transcription factor, had only 0.57 expected LoFs (versus 0 observed) across >125K exomes⁵; although being depleted for LoF variation, the absolute difference was too small to obtain a precise estimate of LoF intolerance. Yet, we found ASCL2 had a highly constrained enhancer (Gnocchi=5.58), located ~16kb upstream of the gene, where >40% of the expected variants were depleted (188.6 expected versus 112 observed, chr11:2286000–2287000). The same genomic window also contained an eQTL chr11:2286192:G>T that was predicted to be significantly associated with ASCL2 expression⁶¹; elevated ASCL2 expression has been implicated in the development and progression of several human cancers^62–64. This example highlights the value of non-coding constraint – as a complementary metric to gene constraint – for identifying functionally important genes.

A practical implementation of this finding is to integrate the constraint of regulatory elements into the modeling of gene constraint, which essentially gains power from extending the functional unit of a gene to encompass its regulatory components. As a proof-of-principle, we tested whether adding the Gnocchi score of enhancer to LOEUF improves the prioritization of underpowered genes. The enhancer Gnocchi score was found a significant predictor of constrained genes (logistic regression P=7.4×10⁻¹¹ conditioning on LOEUF) and improved the performance of LOEUF in identifying constrained genes that were underpowered (AUC = 0.80 versus 0.73, bootstrap P=0.03, Fig. 5b; Methods). Moreover, such approaches would allow incorporation of tissue/cell-type specific information into gene constraint modeling given the diverse range of epigenomic data. We explored this by testing whether the constraint of tissue-specific enhancers is predictive of tissue-specific gene expression (as a proxy for tissue-specific gene function). The enhancer Gnocchi score, again conditioning on LOEUF, was a significant predictor of the expression level of target genes in matched tissue types (Fig. 5c; Methods). These results further support the application of the Gnocchi metric for improving the characterization of gene function. While we acknowledge that the biological consequences of mutations in enhancers are not clearly understood and thus natural selection may differ in strength depending on mechanistic consequence, an extended model to incorporate non-coding variation information in a biologically-informed way holds promise to facilitate our understanding of the molecular mechanisms underlying selection.

Discussion

We have previously developed constraint metrics that leverage population-scale exome and genome sequencing data to evaluate genic intolerance to coding variation for each protein-coding gene^5,23. Here, we adopted the same principle with an extended mutational model to assess constraint across the entire genome, using our latest release of gnomAD (v3.1.2), a dataset of harmonized high-quality whole-genome sequences from 76,156 individuals of diverse ancestries. Improvements to constraint modeling include unified fitting of the mutation rate for all substitution and trinucleotide contexts and inclusion of regional genomic features to refine the expected variation in non-coding regions (Methods). We validated our metric, called Gnocchi, using a series of external functional annotations, with a focus on the non-coding genome, and demonstrated the value of Gnocchi for prioritizing non-coding elements and identifying functionally important genes. We have made the Gnocchi scores publicly accessible via the gnomAD browser (https://gnomad.broadinstitute.org).

One key challenge in quantifying non-coding constraint is the estimation of the true base mutation rate, which can be affected by various genomic phenomena, potentially operating at different scales. To this end, we extended our previous mutational model, which computed the relative mutability of each substitution in a trinucleotide context, to include adjustments for regional genomic features that may index processes influencing mutagenesis. The adjustment was applied to each specific trinucleotide context and allowed a varying genomic scale for each specific feature (Methods). The added value of this adjustment was demonstrated by the improved performance of Gnocchi in identifying functional variants (Extended Fig. 6). Gnocchi also outperformed other genome-wide predictive scores, while each metric tended to provide complementary information. We note that all comparisons were restricted to non-coding regions for explicitly evaluating the metrics in prioritizing non-coding variants, and we further eliminated potential bias from nearby genes by recapitulating the results within regions >10kb away from any protein-coding exons (Supplementary Fig. 7). Overall, Gnocchi presented consistent, high performance in identifying functional non-coding variants in the human genome.

Despite the clear constraint signal identified for non-coding regions, many limitations exist. First, the lack of prior classification of the molecular consequences of non-coding variants, as analogous to “nonsynonymous” versus “synonymous” informed by the genetic code in coding regions, limits the resolution of non-coding constraint assessment (e.g., to measure constraint against “LoF” variation). While there are rich resources defining regulatory elements in the non-coding genome, no method is available for determining the impact of each possible variant on gene regulation and the distribution of their effect sizes genome-wide. Further, the interpretation of non-coding constraint, especially in the context of gene regulation, can only be informative when considered in a particular context, such as a tissue/cell type, developmental stage, or environment. Such information is not inherently built into our constraint metric nor in the mutational dataset; thus ad hoc integration of external annotations (e.g., tissue-specific enhancers as analyzed in this study) is often necessary for justifying specific biological implications. Also, since the detection of depletion of variation is immune to negative selection after reproductive age, genomic regions involved in late-onset phenotypes are likely to go underrecognized.

Finally, while this is among the largest datasets of human genomes examined to date for non-coding constraint, our method will substantially increase in power and resolution as sample sizes increase. Benchmarking on the depletion of variation seen in coding regions, we are currently well-powered to detect extreme non-coding constraint as strong as the 90^th percentile of coding exons of similar size, and we estimate a sample size of ~340K genomes to detect constraint as to the 50^th percentile (Extended Fig. 8a; Methods). Much larger sample sizes will be required for further increasing the resolution, for instance from 1kb to a 100bp scale, we would need ~5.3M samples (Extended Fig. 8b); under the current sample size, 1kb presented optimal performance when compared to varying window sizes tested from 100bp-3kb (Extended Fig. 8c). Meanwhile, we emphasize the importance of increasing genetic ancestral diversity in population-scale datasets like gnomAD. A more diverse population would identify a larger number of rare variants, thereby increasing the power of detecting depletions of variation. We explicitly demonstrated this by reconstructing Gnocchi from the subset of European population and comparing it to that from an equal-sized subset containing all diverse populations – the latter was proven to achieve a higher predictive power (Extended Fig. 8d). Future efforts towards a larger, more diverse human reference dataset would empower finer studies of the influence of human demography on constraint metrics, facilitating a fuller understanding of the distribution and effect of human genetic variation.

Overall, our study demonstrates the value of the genome-wide constraint map in characterizing both non-coding regions and protein-coding genes, providing a significant step towards a comprehensive catalog of functional genomic elements for humans.

Methods

Aggregation, variant-calling, and quality control of gnomAD genome data

We aggregated whole genome sequence data from 153,030 individuals spanning projects from case-control consortia and population cohorts, in a similar fashion to previous efforts⁶⁵. Informed consent was obtained for the original studies that generated sequencing data and we keep a blank copy of those consents on file with our local Office of Research Subject Protection (ORSP). The Institutional Review Board (IRB) has approved our study protocol, and we confirm that we have complied with all relevant ethical regulations relating to human research subjects.

We harmonized the sequencing data using the GATK Best Practices pipeline and joint-called all samples using Hail⁶⁶, and developed and utilized an updated pipeline of sample, variant, and genotype quality control to create a high-quality callset of 76,156 individuals, computing frequency information for several strata of this dataset based on attributes such as ancestry and sex for each of 644,267,978 short nuclear variants (see Supplementary Information).

Estimation of trinucleotide context-specific mutation rates

We estimated the probability of a given nucleotide mutating to one of the three other possible bases in a trinucleotide context (XY₁Z -> XY₂Z), by computing the proportion of all possible variants observed per context in the human genome. Since CpG transitions begin to saturate (proportion observed approaching 1) at a sample size of ~10K genomes, we downsampled the gnomAD dataset to 1,000 genomes for this calculation. The computed proportion observed values, which represent the relative mutability of each trinucleotide context, were further scaled so that the weighted genome-wide average is the human per-base, per-generation mutation rate (1.2×10⁻⁸) to obtain the absolute mutation rates $μ$ . To estimate the proportion of variants expected to be observed in the full gnomAD dataset of 76,156 genomes, we fitted the actual proportion observed in the dataset against $μ$ , using an exponential regression that caps at 1 for refining the estimates of (near-)saturated variant types (R²=0.999, Extended Fig. 1a,b; Supplementary Data 1).

A total of 390,393,900 high-quality, rare (AF≤0.1%) variants observed in 76,156 gnomAD genomes, a dataset of 6,079,733,538 possible variants at 2,026,577,846 autosomal sites (30–32X coverage), were used in the calculation of trinucleotide context-specific mutation rates. The estimates are well-correlated with the mutation rates reported in previous independent studies and are highly stable across different AF thresholds in gnomAD (Supplementary Fig. 6).

Adjustment of the effect of DNA methylation on CpG mutation rates

Given the strong effect of DNA methylation on increasing the mutation rate at CpG sites, we stratified all CpG sites by their methylation levels and computed the proportion observed within each context and methylation level. As an improvement to our previous methylation annotation (by averaging different tissues⁶⁵), we analyzed methylation data from germ cells across 14 developmental stages, comprising eight from preimplantation embryos (sperm, oocyte, pronucleus, two-cell-, four-cell-, eight-cell-, morula-, and blastocyst-stage embryos)⁶⁷ and six from primordial germ cells (7Wk, 10Wk, 11Wk, 13Wk, 17Wk, and 19Wk)⁶⁸. For each stage, we computed methylation level at each CpG site as the proportion of whole-genome bisulfite sequencing reads corresponding to the methylated allele. To derive a composite score from the 14 stages, we regressed the observation of a CpG variant in gnomAD (0 or 1) on the methylation computed at the corresponding site (a vector of 14), and we used the coefficients from the regression model as weights to compute a composite methylation score for each CpG site. This metric was further discretized into 16 levels (by a minimum step of 0.05: [0,0.05], (0.05,0.1], (0.1,0.15], (0.15,0.2], (0.2,0.25], (0.25,0.3], (0.3,0.5], (0.5,0.55], (0.55,0.6], (0.6,0.65], (0.65,0.7], (0.7,0.75], (0.75,0.8], (0.8,0.85], (0.85,0.9], (0.9,1.0]) to stratify CpG variants in the mutation rate analysis.

Adjustment of the effects of regional genomic features on mutation rates

To estimate the effects of regional genomic features on mutation rates under neutrality, we uti3lized de novo mutations (DNMs), as a proxy of spontaneous mutations, and fitted logistic regression models using the genomic features as predictive variables. A set of 413,304 unique DNMs were compiled from two large-scale family-based whole-genome sequencing studies^69,70, and an exclusive set of 4,104,879 genomic sites (~10× the DNMs) randomly drew from the genome was used as the “nonmutated” background. For each DNM or background site, we computed 13 genomic features (see Collection of genomic features) at four scales by taking the mean value of 1kb, 10kb, 100kb, and 1Mb windows centering at the site. This generated a feature matrix of 13×4=52 columns and 413,304+4,104,879 =4,518,183 rows. The matrix was further divided based on the trinucleotide context of each DNM or background site (by row) to assess the effects of genomic features on context-specific mutation rates. In particular, for CpG contexts, features that were correlated with DNA methylation (GC content, CpG_island, short interspersed nuclear element, and nucleosome density), which had been used for adjusting CpG mutation rates, were excluded from the analysis.

For each trinucleotide context, we first performed univariable logistic regression to select features that are significantly associated with an increased/decreased probability of observing a DNM. Features with a significant association surpassing the Bonferroni correction for 13×4=52 tests were selected; if a feature was significant at multiple genomic scales, the smallest window size was selected for the highest resolution (Extended Fig. 1c). Next, we fitted multivariable logistic regression using the selected features to predict DNMs from the background. To control for multicollinearity, we transformed the input feature matrix using principal components analysis (PCA⁷¹) to generate decorrelated predictive variables (i.e., the principal components or PCs). The regression coefficients were the primary output of interest, which represent the effects of genomic features on increasing (a positive coefficient) or decreasing (a negative coefficient) the mutation rate, and were used for adjusting the expected number of variants in a given region. The selected features, the PCs, and the coefficients are summarized in Extended Fig. 1c and are available as pickle files for implementation (see Code availability).

Prediction of expected number of variants per 1kb

Using the trinucleotide mutation rate estimates and the above adjustments, we computed the expected number of variants in a given 1kb genomic window as follow:

E x p (w) = \sum_{i}^{64} {r (w)}_{i} \sum_{j = 1}^{3} \sum_{m = 1}^{k} {n (w)}_{i, j, m} \times p_{i, j, m}

where $i$ denotes one of the 64 trinucleotide contexts; $j$ denotes one of the three bases substituting the central nucleotide; $m$ denotes one of the $k$ DNA methylation levels, where $k$ =16 for CpG sites (see Adjustment of the effect of DNA methylation on CpG mutation rates) and $k$ =1 for non-CpG sites (i.e., no stratification). Essentially, the expected value of variants in a genomic window $w$ is calculated by multiplying the number of possible variants ( $n$ ) in $w$ by the probability of a variant ( $p$ ) and summing across all trinucleotide contexts ( $i$ ), substitutions ( $j$ ), and methylation levels ( $m$ ); $p_{i, j, m}$ is the trinucleotide mutation rate estimated in this study (as described in Estimation of trinucleotide context-specific mutation rates).

Additionally, $E x p$ is adjusted by a factor $r$ , which represents the effect of regional genomic features of $w$ on mutation rate. For each $i$ , specific features have been pre-selected and their effects on mutation rate have been estimated using logistic regression models (see Adjustment of the effects of regional genomic features on mutation rates). Denote the feature values, computed centering $w$ and decorrelated by PCA, and the regression coefficients by $x = {x_{1}, x_{2}, . ., x_{t}}$ and $β = {β_{1}, β_{2}, . ., β_{t}}$ , respectively, where $t$ is the number of selected features for $i$ , the adjustment factor $r$ is defined as the ratio of logit given $x (w)$ to that of the genome-wide average $\bar{x}$ : $r = β ∙ x (w) / β ∙ \bar{x}$ ; since the adjustment is specific to each trinucleotide context, $r$ is further subscribed by $i$ .

Construction of Gnocchi

We created a signed score - called Gnocchi - to quantify the depletion of variation (constraint) at a 1kb scale by comparing the observed variation to an expectation:

χ^{2} = {(O b s - E x p)}^{2} / E x p

Genocchi = \{\begin{matrix} \sqrt{χ^{2}} i f O b s < E x p \\ - \sqrt{χ^{2}} i f O b s \geq E x p \end{matrix}

The observed variant count ( $O b s$ ) is the number of unique rare (AF≤0.1%) variants in a 1kb window identified in the gnomAD dataset of 76,156 genomes, and the expected number of variants ( $E x p$ ) is established as described above based on the sequence context and the regional genomic features of the 1kb window.

Gnocchi scores were created for 2,689,987 non-overlapping 1kb windows across the human genome, comprising 2,561,056 on autosomes and 128,931 on chromosome X. Due to the lack of DNM data on chromosome X, the genomic feature adjustment factor $r$ was assessed using autosomal regions and extrapolated to chromosome X. We performed downstream analyses separately for autosomes and chromosome X and presented the former as primary, with the latter provided in Supplementary Fig. 8. For the analyses, we filtered the dataset to windows where 1) the sites contained at least 1,000 possible variants, 2) at least 80% of the observed variants passed all variant call filters (INFO/FILTER equals to “PASS”), and 3) the mean coverage in the gnomAD genomes was between 25–35X (or 20–25X for chromosome X). This resulted in 1,984,900 autosomal windows (77.5% of initial) for the primary analyses, of which 141,341 overlapped with coding regions and 1,843,559 were exclusively non-coding. The computed Gnocchi scores are available in Supplementary Data 2. We also computed the sores in a sliding window approach (1kb stepped by 100bp) and provided them in Supplementary Data 3.

Collection of genomic features

The 13 regional genomic features used for adjusting trinucleotide mutation rate are 1) GC content⁷², 2) low-complexity region⁷³, 3) short and 4) long interspersed nuclear element⁷², distance from the 5) telomere and the 6) centromere⁷², 7) male and 8) female recombination rate⁶⁹, 9) DNA methylation, 10) CpG island⁷², 11) nucleosome density⁷⁴, 12) maternal and 13) paternal DNM cluster⁷⁵. Data were downloaded from the referenced resources, lifted over to GRCh38 coordinates when needed using CrossMap⁷⁶, and files in .bed or .BigWig format were processed using bedtools⁷⁷ and bigWigAverageOverBed⁷⁸ to obtain feature values within specific genomic windows.

Correlation between Gnocchi and APS

As an internal validation, we compared our Gnocchi score against the SV constraint score APS⁷⁹. For each SV from the original study⁷⁹, we assessed its constraint by assigning the highest Gnocchi score among all overlapping 1kb windows. The correlation between Gnocchi and APS was evaluated across 116,184 high-quality autosomal SVs scored by both metrics, using a linear regression test. In Fig. 1b, the correlation was presented by the mean value of APS across ascending constraint Gnocchi score bins, with 95% confidence intervals computed from 100-fold bootstrapping.

Correlation between Gnocchi and putative functional non-coding annotations

We validated the Gnocchi metric using a number of external functional annotations, including 926,535 ENCODE cCREs⁸⁰ (34,803 promoter-like [PLS], 141,830 proximal enhancer-like [pELS], 667,599 distal enhancer-like [dELS), and 56,766 CTCF-only elements), 63,285 FANTOM5⁸¹ enhancers, 331,601 super enhancers (SEdb⁸²), 111,308 GWAS Catalog⁸³ variants (with an association P ≤ 5.0×10⁻⁸; 9,229 with an independent replication), 2,191 GWAS variants fine-mapped across population biobanks with a posterior inclusion probability of causality≥0.9⁸⁴, and 100,530 CNVs from a CNV morbidity map of developmental delay^85,86.

To assess the correlation between Gnocchi and the collected functional elements, we intersected each annotation with the scored 1kb windows binned by Gnocchi score (<-4, [-4,-3), [-3,-2), [-2,-1), [-1,-0), [0,1), [1,2), [2,3), [3,4), ≥4), and counted the frequency of overlapping windows within each bin. The enrichment of a given annotation (except CNVs) at a constraint level was evaluated by comparing the corresponding frequency to the genome-wide average using a Fisher’s exact test. In the analysis of CNVs, we assessed their enrichment in constrained regions by assigning each CNV the highest Gnocchi score among its overlapping windows and comparing the proportions of constrained CNVs (Gnocchi≥4) from cases of developmental delay and healthy controls (Supplementary Data 4). The enrichment was further examined using a logistic regression model to adjust for the size and gene content (gene constraint⁶⁵ and gene number) of each CNV. We note that we performed all above analyses restricting to exclusively non-coding windows to evaluate the use of Gnocchi in characterizing the non-coding genome.

Estimation of constraint for aggregated regulatory annotations

We estimated how constrained the sequences encoding regulatory elements overall compared to coding exons by aggregating the regulatory annotations at a 1kb scale. These included 7,246 promoter-, 154,003 enhancer-, 117 microRNA (miRNA)-, and 414,084 long non-coding RNA (lncRNA)-1kb elements, created from concatenating ENCODE cCREs-PLS, cCREs-dELS, GENCODE⁸⁷ miRNA, and FANTOM5 lncRNA⁸⁸ annotations, respectively, into 1kb windows. Similarly, 27,875 exonic 1kb elements were created from aggregating all protein-coding exons. Gnocchi scores were computed for the created 1kb elements and the percentiles of each regulatory annotation were compared against the exonic region. Benchmarking on the 50^th percentile (median) of exonic regions, we estimated the proportion of the regulatory elements that are under selection as strong as the coding exons.

Incorporation of Gnocchi into GWAS fine-mapping

To demonstrate the use of Gnocchi in statistical fine-mapping, we performed approximate functionally informed fine-mapping⁸⁹ incorporating Gnocchi score and our previous fine-mapping results for 119 UK Biobank (UKBB) traits⁸⁴. The Gnocchi scores were normalized and used as functional prior probabilities to update the posterior inclusion probabilities (PIPs; denoted as PIP_Z) based on the previous UKBB fine-mapping (using a uniform prior, PIP_unif) and SuSiE⁹⁰. To exclude signals that potentially correspond to coding variants, we restricted our analysis to 60,121 non-coding variants in 6,592 SuSiE 95% credible set (CS)-trait pairs that do not contain variants within 1 kb of exonic regions. A total of 13,069 variant-trait pairs were predicted to have an increased PIP (ΔPIP≥0.01) of causality. The variants, associated traits, and PIP scores (PIP_unif and PIP_Z) are provided in Supplementary Data 5.

Comparison of Gnocchi and other predictive metrics

We compared the Gnocchi metric with other seven genome-wide predictive scores – Orion⁹¹, CDTS⁹², gwRVIS⁹³, DR⁹⁴, phyloP⁹⁵, phastCons⁹⁶, and GERP⁹⁷. Each score was downloaded from the original study, lifted over to GRCh38 coordinates (for Orion) and multiplied by −1 (for CDTS, gwRVIS, and DR) when needed so that a higher value represents a higher constraint/conservation for all metrics. Pairwise correlation between the scores was assessed by comparing the mean value of each score on 1kb windows, using a Spearman’s rank correlation test.

We evaluated the predictive performance of each metric in distinguishing functional non-coding variants (“positive” variant set) from background variants (“negative” variant set). Four positive variant sets were compiled from public databases: 1) 9,229 variants from GWAS Catalog⁸³ (with an independent replication), 2) 2,191 variants from a recent fine-mapping study⁸⁴ (with a posterior inclusion probability of causality≥0.9), 3) 140 high-confidence variants from 2), and 4) 1,026 variants from ClinVar⁹⁸ (annotated as “pathogenic” or “likely pathogenic”) and HGMD (annotated as ‘disease-causing mutation’ [DM] curated by⁹²). All variants were filtered to non-coding regions; in particular, pathogenic variants were more strictly filtered to intergenic/intron variants given its strong predominance of variants close to protein-coding exons (>90% were splice site/region variants). A further stringent non-coding subset was generated by excluding variants within 10kb to any exons, which resulted in 1) 4,379, 2) 967, 3) 59, and 4) 45 variants. For each positive variant set, a negative variant set was created by randomly drawing variants from the Trans-Omics for Precision Medicine (TOPMed) whole-genome sequencing dataset (Freeze 8)⁹⁹ to ~10× the size of corresponding positive variant set, of which the most severe molecular consequence is intergenic or intron and the AF approximates the positive variant set; AF>5% and allele count (AC)=1 were applied respectively for matching positive variant set 1)-3) and 4), based on their AF distributions in TOPMed (Fig. 3b). The selected variants were scored by each of the eight metrics, using bedtools⁷⁷ (for .bed files) and bigWigAverageOverBed⁷⁴ (for .BigWig files), and the performance of each metric in classifying positive and negative variants was assessed by the area under curve (AUC) statistic, as presented by the receiver operating characteristic (ROC) curve.

To investigate whether different metrics capture complementary information in the classification, we fitted logistic regression models using all eight metrics as independent variables. The relative contribution of each metric was evaluated by the dominance analysis^100,101, which estimates the dominance of one predictor over another by comparing their additional $R^{2}$ contributions across all subset models. We further explored whether specific features were particularly captured by (and may have contributed to the performance of) our metric. We merged all positive variant sets and focused on a set of variants (N=204) that were uniquely prioritized by our metric, defined as being captured in the 99^th percentile of Gnocchi score but not in that of any other scores. Specific features associated with these variants were evaluated by comparing values of the 13 genomic features of these variants to the rest of the positive variant set. The fold change was used to indicate the extent to which a feature is distinguished in variants captured by Gnocchi from others.

Correlation of constraint between non-coding regulatory elements and protein-coding genes

To examine whether constraint of non-coding regulatory elements informs the constraint of their target genes, we compared Gnocchi scores of enhancers linked to constrained genes and unconstrained genes. The former included well-established gene sets of 189 ClinGen¹⁰² haploinsufficient genes, 2,454 MGI¹⁰³ essential genes mapped to human orthologs, 1, 771 OMIM¹⁰⁴ autosomal dominant genes, and 1,920 LOEUF⁶⁵ first-decile genes; and the latter included a curated list of 356 olfactory receptor genes and 189 LOEUF last-decile genes with at least 10 expected LoF variants (which are sufficiently powered to be classified into the most constrained decile⁶⁵). The LOEUF underpowered list included 1,117 genes with ≤5 expected LoF variants. Enhancers linked to each gene were obtained from the Roadmap Epigenomics Enhancer-Gene Linking database, which used correlated patterns of activity between histone modifications and gene expression to predict enhancer-gene links^105,106. For each gene, we aggregated and merged enhancers predicted from all 127 reference epigenomes and assigned the most constrained enhancer to each gene for the analysis of enhancer-gene constraint correlation (Supplementary Data 6).

In the analysis of correlation between tissue-specific enhancer constraint and tissue-specific gene expression, we processed the enhancer-gene links with the same principle as described above but within specific tissue types (as defined in the Roadmap Epigenomics metadata¹⁰⁷). For each gene and tissue type, we searched for tissue-specific gene expression in the Genotype-Tissue Expression (GTEx¹⁰⁸) database (RNASeQCv1.1.9) and computed a normalized median expression for each gene (log₂(TPM+1)). Enhancer constraint and gene expression values were calculated for 11 matched tissue types, and the correlation within each tissue type was evaluated by regressing gene expression on enhancer constraint, including gene constraint (LOEUF score) as a covariate.

Incorporation of non-coding constraint of regulatory elements into gene constraint modeling

To demonstrate the practical value of non-coding constraint in improving gene constraint modeling, we compared two models – using 1) LOEUF and 2) LOEUF+enhancer Gnocchi score (as described in Correlation of constraint between non-coding regulatory elements and protein-coding genes) – in predicting constrained genes, with a particular focus on genes that were underpowered in LOEUF. A set of 3,220 unique constrained genes were curated from ClinGen¹⁰², MGI¹⁰³, and OMIM¹⁰⁴ (see Correlation of constraint between non-coding regulatory elements and protein-coding genes), and a set of 356 olfactory receptor genes was used as the unconstrained genes. We trained logistic regression models on 50% of the genes and tested the performance on 77 underpowered genes in the remaining 50%. The predictive performance of the two models were measured by AUC, and the significance of the difference in AUCs was assessed using a bootstrap test¹⁰⁹.

Power of constraint detection

We estimated the power of our metric in detecting non-coding constraint as the percentage of the non-coding genome to obtain a high Gnocchi score (Gnocchi≥4) under a certain strength of negative selection, which was quantified by the level of depletion of variation (i.e., 1-observed/expected). For a given depletion of variation, the minimum number of expected variants to achieve a Gnocchi≥4 was determined, and the number of samples required to achieve the expected number of variants was estimated using a linear model of log(number of expected variants) ~ log(number of samples) from downsampling the gnomAD dataset. The power was estimated at two scales – 1kb (used in this study) and 100bp – and benchmarked by the depletion of variation observed in coding exons of similar size.

Extended Data

Extended Data Fig. 3: — Estimation of constraint for aggregated regulatory annotations. a,b, Gnocchi scores of aggregated promoter (dark purple), enhancer (light purple), microRNA (miRNA; dark blue), and long non-coding RNA (lncRNA; light blue) annotations are compared against those of exonic (a) and non-coding (b) regions at a 1kb scale. The Gnocchi score percentiles of each annotation (y-axis) are benchmarked by the score deciles of exonic or non-coding regions (10–100 percentiles; x-axis); the grey dashed vertical line indicates the median (50^th percentile).

Extended Data Fig. 4: — Applications of Gnocchi for characterizing non-coding regions in addition to existing functional annotations. a, Use of Gnocchi for prioritizing non-coding regions with or without a regulatory annotation (N=464,504 and 1,379,055, respectively). Constrained non-coding regions are enriched for GWAS variants, independent of the candidate cis-regulatory element (cCRE) annotation from ENCODE. Error bars indicate 95% confidence intervals of the odds ratios. b, Use of Gnocchi in statistical fine-mapping. The increase in posterior inclusion probability (PIP) when incorporating Gnocchi score as a functional prior into previous fine-mapping results (that used a uniform prior; denoted as PIP_Gnocchi and PIP_unif, respectively) is shown for 164 new likely causal associations with a PIP_Gnocchi ≥0.8 as a function of PIP_Gnocchi.

Extended Data Fig. 5: — Comparison of Gnocchi and other predictive metrics in prioritizing non-coding variants. a, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants (“positive” variant set) – left to right: 9,229 GWAS Catalog variants, 2,191 GWAS fine-mapping variants, a subset of 140 high-confidence fine-mapped variants, and 1,026 likely pathogenic variants – against “negative” variant set randomly drew from the population with a similar allele frequency (AF). AF>5% and allele count (AC)=1 were applied respectively for matching the three GWAS variant sets and the likely pathogenic variant set, based on their AF distributions in TOPMed (shown in b). b, AUCs of the classification with a varying AF threshold for the negative variant set. As most GWAS variants are common and most likely pathogenic variants are very rare (not seen in the population), AF>5% and AC=1 were applied respectively in the primary analyses shown in a.

Extended Data Fig. 6: — Comparison of constraint scores built from different mutational models and genomic windows. Gnocchi (presented in this study) outperforms the scores rebuilt from mutational models that only consider local sequence context – trinucleotide (trimer-only) or heptanucleotide (heptamer-only) – without adjustment on mutation rate by regional genomic features, and the performance is robust to the artificial break of genomic windows when computed at a 1kb sliding by 100bp scale.

Extended Data Fig. 7: — Pairwise correlations between different constraint/conservation metrics. The Spearman’s rank correlation between each pair of the eight metrics was computed based on the mean value of each score on 1kb windows across the genome.

Extended Data Fig. 8: — Power of constraint detection. a,b, The sample size required for well-powered non-coding constraint detection. The percentage of non-coding regions powered to detect constraint (Gnocchi≥4) at a 1kb (a) and 100bp (b) scale under varying levels of selection (depletion of variation) is shown as a function of log-scaled sample size. Lighter color indicates milder deletion of variation (weaker selection), which requires a larger sample size to detect constraint; the grey dashed vertical line indicates the current sample size of 76,156 genomes. Dotted curves (left to right) benchmark the 95^th, 90^th, and 50^th percentile of depletion of variation observed in coding exons of similar size. The number of samples required to obtain an 80% detection power is labeled at corresponding benchmarks. c, AUCs of Gnocchi scores computed on different window sizes in identifying putative functional non-coding variants. 1kb (used in this study) presents the optimal window size with high performance while maintaining reasonable resolution. d, AUCs of Gnocchi scores computed from different subsets of gnomAD in identifying putative functional non-coding variants. While with an equal sample size, the downsampled dataset with diverse ancestries presents higher performance than the Non-Finnish European (NFE)-only dataset.

Supplementary Material

Supplementary Information

NIHMS2037379-supplement-Supplementary_Information.pdf^{(2.5MB, pdf)}

Supplementary Datasets

NIHMS2037379-supplement-Supplementary_Datasets.zip^{(314MB, zip)}

Acknowledgements

We thank the individuals whose data is in gnomAD for their contributions to research. Development of the Genome Aggregation Database was supported by NIDDK U54DK105566 and the NHGRI of the National Institutes of Health under award number U24HG011450. Additional funding for Genome Aggregation Database Consortium members is listed in the Supplementary Information. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Competing Interest

KJK is a consultant for Vor Biopharma, Tome Biosciences, and is on the Scientific Advisory Board of Nurture Genomics. DGM is a paid advisor to GSK, Insitro, Variant Bio and Overtone Therapeutics, and has previously received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer and Sanofi-Genzyme.

Genome Aggregation Database Consortium

Maria Abreu¹⁵, Carlos A. Aguilar Salinas¹⁶, Tariq Ahmad¹⁷, Christine M. Albert^18,19, Jessica Alföldi^1,2, Diego Ardissino²⁰, Irina M. Armean^1,2, Elizabeth G. Atkinson^21,22, Gil Atzmon^23,24, Eric Banks⁶, John Barnard²⁵, Samantha M. Baxter¹, Laurent Beaugerie²⁶, Emelia J. Benjamin^27,28,29, David Benjamin⁶, Louis Bergelson⁶, Michael Boehnke³⁰, Lori L. Bonnycastle³¹, Erwin P. Bottinger³², Donald W. Bowden^33,34,35, Matthew J. Bown^36,37, Harrison Brand^3,38, Steven Brant^39,40,41, Ted Brookings^6,42, Sam Bryant^2,22, Sarah E. Calvo^1,3, Hannia Campos^43,44, John C. Chambers^45,46,47, Juliana C. Chan⁴⁸, Katherine R. Chao^1,2, Sinéad Chapman^1,2,7, Daniel I. Chasman^18,49, Siwei Chen^1,2, Rex Chisholm⁵⁰, Judy Cho³², Rajiv Chowdhury⁵¹, Mina K. Chung⁵², Wendy K. Chung^53,54,55, Kristian Cibulskis⁶, Bruce Cohen^56,57, Ryan L. Collins^1,3,4, Kristen M. Connolly⁵⁸, Adolfo Correa⁵⁹, Miguel Covarrubias⁶, Beryl B. Cummings^1,4, Dana Dabelea⁶⁰, Mark J. Daly^1,2,12, John Danesh⁵¹, Dawood Darbar⁶¹, Phil Darnowsky¹, Joshua Denny⁶², Stacey Donnelly¹⁰, Ravindranath Duggirala⁶³, Josée Dupuis^64,65, Patrick T. Ellinor^1,66, Roberto Elosua^67,68,69, James Emery⁶, Eleina England^1,70, Jeanette Erdmann^71,72,73, Tõnu Esko^1,74, Emily Evangelista¹, Yossi Farjoun⁹, Diane Fatkin^75,76,77, Steven Ferriera¹¹, Jose Florez^49,78,79, Laurent C. Francioli^1,2, Andre Franke^80,81, Jack Fu^1,3,38, Martti Färkkilä^82,83,84, Stacey Gabriel¹¹, Kiran Garimella⁶, Laura D. Gauthier⁶, Jeff Gentry⁶, Gad Getz^49,85,86, David C. Glahn^87,88, Benjamin Glaser⁸⁹, Stephen J. Glatt⁹⁰, David Goldstein^91,92, Clicerio Gonzalez⁹³, Julia K. Goodrich¹, Riley Grant¹, Leif Groop^94,95, Sanna Gudmundsson^1,2,8, Namrata Gupta^1,11, Andrea Haessly⁶, Christopher Haiman⁹⁶, Ira Hall⁹⁷, Craig L. Hanis⁹⁸, Matthew Harms^99,100, Mikko Hiltunen¹⁰¹, Matti M. Holi¹⁰², Christina M. Hultman^103,104, Chaim Jalas¹⁰⁵, Thibault Jeandet⁶, Mikko Kallela¹⁰⁶, Masahiro Kanai^1,2, Diane Kaplan⁶, Jaakko Kaprio⁹⁵, Konrad J. Karczewski^1,2,7, Sekar Kathiresan^3,49,107, Eimear E. Kenny¹⁰⁸, Bong-Jo Kim¹⁰⁹, Young Jin Kim¹⁰⁹, Daniel King¹, George Kirov¹¹⁰, Zan Koenig^2,7, Jaspal Kooner^46,111,112, Seppo Koskinen¹¹³, Harlan M. Krumholz¹¹⁴, Subra Kugathasan¹¹⁵, Soo Heon Kwak¹¹⁶, Markku Laakso^117,118, Nicole Lake¹¹⁹, Trevyn Langsford⁶, Kristen M. Laricchia^1,2, Terho Lehtimäki¹²⁰, Monkol Lek¹¹⁹, Emily Lipscomb¹, Christopher Llanwarne⁶, Ruth J.F. Loos^32,121,122, Wenhan Lu¹, Steven A. Lubitz^1,66, Teresa Tusie Luna^123,124, Ronald C.W. Ma^48,125,126, Daniel G. MacArthur^1,13,14, Gregory M. Marcus¹²⁷, Jaume Marrugat^128,129, Alicia R. Martin^1,2,7, Kari M. Mattila¹²⁰, Steven McCarroll^7,130, Mark I. McCarthy^131,132,133, Jacob L. McCauley^134,135, Dermot McGovern¹³⁶, Ruth McPherson¹³⁷, James B. Meigs^1,49,138, Olle Melander¹³⁹, Andres Metspalu¹⁴⁰, Deborah Meyers¹⁴¹, Eric V. Minikel¹, Braxton D. Mitchell¹⁴², Vamsi K. Mootha^1,143, Ruchi Munshi⁶, Aliya Naheed¹⁴⁴, Saman Nazarian^145,146, Benjamin M. Neale^1,2, Peter M. Nilsson¹⁴⁷, Sam Novod⁶, Anne O’Donnell-Luria^1,3,8, Michael C. O’Donovan¹⁴⁸, Yukinori Okada^5,149,150, Dost Ongur^49,56, Lorena Orozco^151,152, Michael J. Owen¹⁴⁸, Colin Palmer¹⁵³, Nicholette D. Palmer³³, Aarno Palotie^2,7,95, Kyong Soo Park^116,154, Carlos Pato¹⁵⁵, Nikelle Petrillo⁶, William Phu^1,8, Timothy Poterba^1,2,7, Ann E. Pulver¹⁵⁶, Dan Rader^145,157, Nazneen Rahman¹⁵⁸, Heidi L. Rehm^1,3, Alex Reiner^159,160, Anne M. Remes¹⁶¹, Dan Rhodes¹, Stephen Rich^162,163, John D. Rioux^164,165, Samuli Ripatti^10,95,166, David Roazen⁶, Dan M. Roden^167,168, Jerome I. Rotter¹⁶⁹, Valentin Ruano-Rubio⁶, Nareh Sahakian⁶, Danish Saleheen^170,171,172, Veikko Salomaa¹⁷³, Andrea Saltzman¹, Nilesh J. Samani^37,174, Kaitlin E. Samocha^1,3, Alba Sanchis-Juan³, Jeremiah Scharf^1,3,7, Molly Schleicher¹, Heribert Schunkert^175,176, Sebastian Schönherr¹⁷⁷, Eleanor G. Seaby^1,178, Cotton Seed^2,7, Svati H. Shah^179,180, Megan Shand⁶, Ted Sharpe⁶, Moore B. Shoemaker¹⁸¹, Tai Shyong^182,183, Edwin K. Silverman^184,185, Moriel Singer-Berk¹, Pamela Sklar^186,187,188, Jonathan T. Smith⁶, J. Gustav Smith^189,190, Hilkka Soininen¹⁹¹, Harry Sokol^192,193,194, Matthew Solomonson^1,2, Rachel G. Son¹, Jose Soto⁶, Tim Spector¹⁹⁵, Christine Stevens^1,2,7, Nathan O. Stitziel^196,197,198, Patrick F. Sullivan^103,199, Jaana Suvisaari¹⁷³, E. Shyong Tai^200,201,202, Michael E. Talkowski^1,3,7, Yekaterina Tarasova¹, Kent D. Taylor¹⁶⁹, Yik Ying Teo^200,203,204, Grace Tiao^1,2, Kathleen Tibbetts⁶, Charlotte Tolonen⁶, Ming Tsuang^205,206, Tiinamaija Tuomi^95,207,208, Dan Turner²⁰⁹, Teresa Tusie-Luna^210,211, Erkki Vartiainen²¹², Marquis Vawter²¹³, Christopher Vittal^1,2, Gordon Wade⁶, Lily Wang²¹⁴, Qingbo Wang^1,5, Arcturus Wang^1,2,7, James S. Ware^1,215,216, Hugh Watkins²¹⁷, Nicholas A. Watts^1,2, Rinse K. Weersma²¹⁸, Ben Weisburd⁶, Maija Wessman^95,219, Nicola Whiffin^1,220,221, Michael W. Wilson^1,2, James G. Wilson²²², Ramnik J. Xavier^223,224, Mary T. Yohannes¹

¹⁵University of Miami Miller School of Medicine, Gastroenterology, Miami, USA

¹⁶Unidad de Investigacion de Enfermedades Metabolicas, Instituto Nacional de Ciencias Medicas y Nutricion, Mexico City, Mexico

¹⁷Peninsula College of Medicine and Dentistry, Exeter, UK

¹⁸Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA, USA

¹⁹Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA

²⁰Department of Cardiology University Hospital, Parma, Italy

²¹Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA

²²Stanley Center for Psychiatric Research, The Broad Intitute of MIT and Harvard, Cambridge MA, USA

²³Department of Biology Faculty of Natural Sciences, University of Haifa, Haifa, Israel

²⁴Departments of Medicine and Genetics, Albert Einstein College of Medicine, Bronx, NY, USA

²⁵Department of Quantitative Health Sciences, Lerner Research Institute Cleveland Clinic, Cleveland, OH, USA

²⁶Sorbonne Université, APHP, Gastroenterology Department Saint Antoine Hospital, Paris, France

²⁷NHLBI and Boston University’s Framingham Heart Study, Framingham, MA, USA

²⁸Department of Medicine, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA

²⁹Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA

³⁰Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, USA

³¹National Human Genome Research Institute, National Institutes of Health Bethesda, MD, USA

³²The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA

³³Department of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC, USA

³⁴Center for Genomics and Personalized Medicine Research, Wake Forest School of Medicine, Winston-Salem, NC, USA

³⁵Center for Diabetes Research, Wake Forest School of Medicine, Winston-Salem, NC, USA

³⁶Department of Cardiovascular Sciences and NIHR Leicester Biomedical Research Centre, University of Leicester, Leicester, UK

³⁷NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester, UK

³⁸Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA

³⁹Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, New Brunswick, NJ, USA

⁴⁰Department of Genetics and the Human Genetics Institute of New Jersey, School of Arts and Sciences, Rutgers, The State University of New Jersey, Piscataway, NJ, USA

⁴¹Meyerhoff Inflammatory Bowel Disease Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA

⁴²Fulcrum Genomics, Boulder, CO, USA

⁴³Harvard School of Public Health, Boston, MA, USA

⁴⁴Central American Population Center, San Pedro, Costa Rica

⁴⁵Department of Epidemiology and Biostatistics, Imperial College London, London, UK

⁴⁶Department of Cardiology, Ealing Hospital, NHS Trust, Southall, UK

⁴⁷Imperial College, Healthcare NHS Trust Imperial College London, London, UK

⁴⁸Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, China

⁴⁹Department of Medicine, Harvard Medical School, Boston, MA, USA

⁵⁰Northwestern University, Evanston, IL, USA

⁵¹University of Cambridge, Cambridge, England

⁵²Departments of Cardiovascular, Medicine Cellular and Molecular Medicine Molecular Cardiology, Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA

⁵³Department of Pediatrics, Columbia University Irving Medical Center, New York, NY, USA

⁵⁴Herbert Irving Comprehensive Cancer Center, Columbia University Medical Center, New York, NY, USA

⁵⁵Department of Medicine, Columbia University Medical Center, New York, NY, USA

⁵⁶McLean Hospital, Belmont, MA, USA

⁵⁷Department of Psychiatry, Harvard Medical School, Boston, MA, USA

⁵⁸Genomics Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁵⁹Department of Medicine, University of Mississippi Medical Center, Jackson, MI, USA

⁶⁰Department of Epidemiology Colorado School of Public Health Aurora, CO, USA

⁶¹Department of Medicine and Pharmacology, University of Illinois at Chicago, Chicago, IL, USA

⁶²Vanderbilt University Medical Center, Nashville, TN, USA

⁶³Department of Life Sciences, College of Arts and Scienecs, Texas A&M University-San Antonio, San Antonio, TX, USA

⁶⁴Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA

⁶⁵Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada

⁶⁶Cardiac Arrhythmia Service and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA

⁶⁷Cardiovascular Epidemiology and Genetics, Hospital del Mar Medical Research Institute (IMIM), Barcelona, Catalonia, Spain

⁶⁸CIBER CV, Spain

⁶⁹Departament of Medicine, Faculty of Medicine, University of Vic-Central University of Catalonia, Vic Catalonia, Spain

⁷⁰Clalit Genomics Center, Israel

⁷¹Institute for Cardiogenetics, University of Lübeck, Lübeck, Germany

⁷²German Research Centre for Cardiovascular Research, Hamburg/Lübeck/Kiel, Lübeck, Germany

⁷³University Heart Center Lübeck, Lübeck, Germany

⁷⁴Estonian Genome Center, Institute of Genomics University of Tartu, Tartu, Estonia

⁷⁵Victor Chang Cardiac Research Institute, Darlinghurst, NSW, Australia

⁷⁶Faculty of Medicine, UNSW Sydney, Kensington, NSW, Australia

⁷⁷Cardiology Department, St Vincent’s Hospital, Darlinghurst, NSW, Australia

⁷⁸Diabetes Unit and Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA

⁷⁹Programs in Metabolism and Medical & Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁸⁰Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany

⁸¹University Hospital Schleswig-Holstein, Kiel, Germany

⁸²Helsinki University and Helsinki University Hospital Clinic of Gastroenterology, Helsinki, Finland

⁸³Helsinki University and Helsinki University Hospital, Helsinki, Finland

⁸⁴Abdominal Center

⁸⁵Bioinformatics Program MGH Cancer Center and Department of Pathology, Boston, MA, USA

⁸⁶Cancer Genome Computational Analysis, Broad Institute of MIT and Harvard, Cambridge, MA, USA

⁸⁷Department of Psychiatry and Behavioral Sciences, Boston Children’s Hospitaland Harvard Medical School, Boston, MA, USA

⁸⁸Harvard Medical School Teaching Hospital, Boston, MA, USA

⁸⁹Department of Endocrinology and Metabolism, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Israel

⁹⁰Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, NY, USA

⁹¹Institute for Genomic Medicine, Columbia University Medical Center Hammer Health Sciences, New York, NY, USA

⁹²Department of Genetics & Development Columbia University Medical Center, Hammer Health Sciences, New York, NY, USA

⁹³Centro de Investigacion en Salud Poblacional, Instituto Nacional de Salud Publica, Mexico

⁹⁴Lund University Sweden, Sweden

⁹⁵Institute for Molecular Medicine Finland, (FIMM) HiLIFE University of Helsinki, Helsinki, Finland

⁹⁶Center for Genetic Epidemiology, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, USA

⁹⁷Washington School of Medicine, St Louis, MI, USA

⁹⁸Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX, USA

⁹⁹Department of Neurology Columbia University, New York City, NY, USA

¹⁰⁰Institute of Genomic Medicine, Columbia University, New York City, NY, USA

¹⁰¹Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland

¹⁰²Department of Psychiatry, Helsinki University Central Hospital Lapinlahdentie, Helsinki, Finland

¹⁰³Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

¹⁰⁴Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁰⁵Bonei Olam, Center for Rare Jewish Genetic Diseases, Brooklyn, NY, USA

¹⁰⁶Department of Neurology, Helsinki University, Central Hospital, Helsinki, Finland

¹⁰⁷Cardiovascular Disease Initiative and Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA

¹⁰⁸Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁰⁹Division of Genome Science, Department of Precision Medicine, National Institute of Health, Republic of Korea

¹¹⁰MRC Centre for Neuropsychiatric Genetics & Genomics, Cardiff University School of Medicine, Cardiff, Wales

¹¹¹Imperial College, Healthcare NHS Trust, London, UK

¹¹²National Heart and Lung Institute Cardiovascular Sciences, Hammersmith Campus, Imperial College London, London, UK

¹¹³Department of Health THL-National Institute for Health and Welfare, Helsinki, Finland

¹¹⁴Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, Center for Outcomes Research and Evaluation Yale-New Haven Hospital, New Haven, CT, USA

¹¹⁵Division of Pediatric Gastroenterology, Emory University School of Medicine, Atlanta, GA, USA

¹¹⁶Department of Internal Medicine, Seoul National University Hospital, Seoul, Republic of Korea

¹¹⁷The University of Eastern Finland, Institute of Clinical Medicine, Kuopio, Finland

¹¹⁸Kuopio University Hospital, Kuopio, Finland

¹¹⁹Department of Genetics, Yale School of Medicine, New Haven, CT, USA

¹²⁰Department of Clinical Chemistry Fimlab Laboratories and Finnish Cardiovascular Research Center-Tampere Faculty of Medicine and Health Technology, Tampere University, Finland

¹²¹The Mindich Child Health and Development, Institute Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹²²The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark

¹²³National Autonomous University of Mexico, Mexico City, Mexico

¹²⁴Salvador Zubirán National Institute of Health Sciences and Nutrition, Mexico City, Mexico

¹²⁵Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China

¹²⁶Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Hong Kong, China

¹²⁷Division of Cardiology, University of California San Francisco, San Francisco, CA, USA

¹²⁸Hospital del Mar Medical Research Institute (IMIM), Barcelona, Spain

¹²⁹CIBERCV, Madrid, Spain

¹³⁰Department of Genetics, Harvard Medical School, Boston, MA, USA

¹³¹Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital Old Road Headington, Oxford, OX, LJ, UK

¹³²Welcome Centre for Human Genetics, University of Oxford, Oxford, OX, BN, UK

¹³³Oxford NIHR Biomedical Research Centre, Oxford University Hospitals, NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX, DU, UK

¹³⁴John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA

¹³⁵The Dr. John T. Macdonald Foundation Department of Human Genetics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA

¹³⁶F. Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute Cedars-Sinai Medical Center, Los Angeles, CA, USA

¹³⁷Atherogenomics Laboratory University of Ottawa, Heart Institute, Ottawa, Canada

¹³⁸Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA

¹³⁹Department of Clinical Sciences University, Hospital Malmo Clinical Research Center, Lund University, Malmö, Sweden

¹⁴⁰Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, Estonia

¹⁴¹University of Arizona Health Science, Tuscon, AZ, USA

¹⁴²University of Maryland School of Medicine, Baltimore, MD, USA

¹⁴³Howard Hughes Medical Institute and Department of Molecular Biology, Massachusetts General Hospital, Boston, MA, USA

¹⁴⁴International Centre for Diarrhoeal Disease Research, Bangladesh

¹⁴⁵Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

¹⁴⁶Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA

¹⁴⁷Lund University, Dept. Clinical Sciences, Skåne University Hospital, Malmö, Sweden

¹⁴⁸Centre for Neuropsychiatric Genetics & Genomics, Cardiff University School of Medicine, Cardiff, Wales

¹⁴⁹Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, Japan

¹⁵⁰Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Suita, Japan

¹⁵¹Instituto Nacional de Medicina Genómica, (INMEGEN) Mexico City, Mexico

¹⁵²Laboratory of Immunogenomics and Metabolic Diseases, INMEGEN,Mexico City, Mexico

¹⁵³Medical Research Institute, Ninewells Hospital and Medical School University of Dundee, Dundee, UK

¹⁵⁴Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Republic of Korea

¹⁵⁵Department of Psychiatry Keck School of Medicine at the University of Southern California, Los Angeles, CA, USA

¹⁵⁶Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, USA

¹⁵⁷Children’s Hospital of Philadelphia, Philadelphia, PA, USA

¹⁵⁸Division of Genetics and Epidemiology, Institute of Cancer Research, London, UK

¹⁵⁹University of Washington, Seattle, WA, USA

¹⁶⁰Fred Hutchinson Cancer Research Center, Seattle, WA, USA

¹⁶¹Medical Research Center, Oulu University Hospital, Oulu Finland and Research Unit of Clinical Neuroscience Neurology University of Oulu, Oulu, Finland

¹⁶²Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA

¹⁶³Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA

¹⁶⁴Research Center Montreal Heart Institute, Montreal, Quebec, Canada

¹⁶⁵Department of Medicine, Faculty of Medicine Université de Montréal, Québec, Canada

¹⁶⁶Department of Public Health Faculty of Medicine, University of Helsinki, Helsinki, Finland

¹⁶⁷Departments of Medicine, Pharmacology, Biomedical Informatics Vanderbilt, University Medical Center, Nashville, TN, USA

¹⁶⁸Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

¹⁶⁹The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA

¹⁷⁰Department of Biostatistics and Epidemiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

¹⁷¹Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA

¹⁷²Center for Non-Communicable Diseases, Karachi, Pakistan

¹⁷³National Institute for Health and Welfare, Helsinki, Finland

¹⁷⁴Department of Cardiovascular Sciences, University of Leicester, Leicester, UK

¹⁷⁵Department of Cardiology, Deutsches Herzzentrum München, Technical University of Munich, DZHK Munich Heart Alliance, Germany

¹⁷⁶Technische Universität München, Germany

¹⁷⁷Institute of Genetic Epidemiology, Department of Genetics, Medical University of Innsbruck, 6020 Innsbruck, Austria

¹⁷⁸Faculty of Medicine, University of Southampton, Southampton, SO16 6YD, UK

¹⁷⁹Duke Molecular Physiology Institute, Durham, NC

¹⁸⁰Division of Cardiology, Department of Medicine, Duke University School of Medicine, Durham, NC, USA

¹⁸¹Division of Cardiovascular Medicine, Nashville VA Medical Center, Vanderbilt University School of Medicine, Nashville, TN, USA

¹⁸²Division of Endocrinology, National University Hospital, Singapore

¹⁸³NUS Saw Swee Hock School of Public Health, Singapore

¹⁸⁴Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA

¹⁸⁵Harvard Medical School, Boston, MA, USA

¹⁸⁶Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁸⁷Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁸⁸Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA

¹⁸⁹The Wallenberg Laboratory/Department of Molecular and Clinical Medicine, Institute of Medicine, Gothenburg University and the Department of Cardiology, Sahlgrenska University Hospital, Gothenburg, Sweden

¹⁹⁰Department of Cardiology, Wallenberg Center for Molecular Medicine and Lund University Diabetes Center, Clinical Sciences, Lund University and Skåne University Hospital, Lund, Sweden

¹⁹¹Institute of Clinical Medicine Neurology, University of Eastern Finad, Kuopio, Finland

¹⁹²Sorbonne Université, INSERM, Centre de Recherche Saint-Antoine, CRSA, AP-HP, Saint Antoine Hospital, Gastroenterology department, F-75012 Paris, France

¹⁹³INRA, UMR1319 Micalis & AgroParisTech, Jouy en Josas, France

¹⁹⁴Paris Center for Microbiome Medicine, (PaCeMM) FHU, Paris, France

¹⁹⁵Department of Twin Research and Genetic Epidemiology King’s College London, London, UK

¹⁹⁶Department of Medicine, Washington University School of Medicine, Saint Louis, MO, USA

¹⁹⁷Department of Genetics, Washington University School of Medicine, Saint Louis, MO, USA

¹⁹⁸The McDonnell Genome Institute at Washington University, Saint Louis, MO, USA

¹⁹⁹Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, NC, USA

²⁰⁰Saw Swee Hock School of Public Health National University of Singapore, National University Health System, Singapore

²⁰¹Department of Medicine, Yong Loo Lin School of Medicine National University of Singapore, Singapore

²⁰²Duke-NUS Graduate Medical School, Singapore

²⁰³Life Sciences Institute, National University of Singapore, Singapore

²⁰⁴Department of Statistics and Applied Probability, National University of Singapore, Singapore

²⁰⁵Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, CA, USA

²⁰⁶Institute of Genomic Medicine, University of California San Diego, San Diego, CA, USA

²⁰⁷Endocrinology, Abdominal Center, Helsinki University Hospital, Helsinki, Finland

²⁰⁸Institute of Genetics, Folkhalsan Research Center, Helsinki, Finland

²⁰⁹Juliet Keidan Institute of Pediatric Gastroenterology Shaare Zedek Medical Center, The Hebrew University of Jerusalem, Jerusalem, Israel

²¹⁰Instituto de Investigaciones Biomédicas, UNAM, Mexico City, Mexico

²¹¹Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Mexico City, Mexico

²¹²Department of Public Health Faculty of Medicine University of Helsinki, Helsinki, Finland

²¹³Department of Psychiatry and Human Behavior, University of California Irvine, Irvine, CA, USA

²¹⁴Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA

²¹⁵National Heart & Lung Institute & MRC London Institute of Medical Sciences, Imperial College London, London, UK

²¹⁶Royal Brompton & Harefield Hospitals, Guy’s and St. Thomas’ NHS Foundation Trust, London, UK

²¹⁷Radcliffe Department of Medicine, University of Oxford, Oxford, UK

²¹⁸Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, Netherlands

²¹⁹Folkhälsan Institute of Genetics, Folkhälsan Research Center, Helsinki, Finland

²²⁰Big Data Institute, University of Oxford, UK

²²¹Wellcome Centre for Human Genetics, University of Oxford, UK

²²²Division of Cardiology, Beth Israel Deaconess Medical Center, Boston, MA USA

²²³Program in Infectious Disease and Microbiome, Broad Institute of MIT and Harvard, Cambridge, MA, USA

²²⁴Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, USA

Footnotes

Code availability

All code to perform quality control of the resource is publicly available at https://github.com/broadinstitute/gnomad_qc, and many of the functions are documented in a Python package (gnomad) at https://broadinstitute.github.io/gnomad_methods/index.html. The code to compute the constraint statistics is available at https://github.com/atgu/gnomad_nc_constraint.

Lists of authors and their affiliations appear at the end of the paper

Data availability

We release the aggregated allele frequency dataset at https://gnomad.broadinstitute.org, in a browser and bulk downloads for VCFs and Hail Tables, as well as all constraint statistics described in this manuscript. Additionally, we provide a subset of the dataset that includes individual level data for the HGDP¹¹⁰ and the 1000 Genomes projects¹¹¹: the generation and use of this dataset is described in a companion manuscript¹¹². There are no restrictions on the aggregate data released. External datasets used in this study are available in the following public resources:

ENCODE cCREs https://screen-v2.wenglab.org/, super enhancers http://www.licpathway.net/sedb/download.php, FANTOM5 enhancers https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_latest/extra/enhancer/, miRNA https://genome.ucsc.edu/cgi-bin/hgTables (All GENCODE V32 track), FANTOM5 lncRNA https://fantom.gsc.riken.jp/cat/v1/#/genes, GWAS Catalog https://genome.ucsc.edu/cgi-bin/hgTables (GWAS Catalog track), GWAS fine-mapping https://www.finucanelab.org/data, CNV morbidity map of developmental delay https://genome.ucsc.edu/cgi-bin/hgTables (Development Delay track), ClinVar https://genome.ucsc.edu/cgi-bin/hgTables (ClinVar Variants track), TOPMed https://bravo.sph.umich.edu/freeze8/hg38/downloads, ClinGen https://genome.ucsc.edu/cgi-bin/hgTables (ClinGen track), MGI https://www.informatics.jax.org/, OMIM https://www.omim.org/, Roadmap Epigenomics Enhancer-Gene Linking https://ernstlab.biolchem.ucla.edu/roadmaplinking/, GTEx https://gtexportal.org/home/datasets.

References

1.Short PJ et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616, doi: 10.1038/nature25983 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Satterstrom FK et al. Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell 180, 568–584 e523, doi: 10.1016/j.cell.2019.12.036 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Singh T et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat Genet 49, 1167–1173, doi: 10.1038/ng.3903 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ganna A et al. Quantifying the Impact of Rare and Ultra-rare Coding Variation across the Phenotypic Spectrum. Am J Hum Genet 102, 1204–1211, doi: 10.1016/j.ajhg.2018.05.002 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi: 10.1038/s41586-020-2308-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Petrovski S, Wang Q, Heinzen EL, Allen AS & Goldstein DB Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 9, e1003709, doi: 10.1371/journal.pgen.1003709 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Samocha KE et al. A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944–950, doi: 10.1038/ng.3050 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Hindorff LA et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362–9367, doi: 10.1073/pnas.0903103106 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lanyi JK Photochromism of halorhodopsin. cis/trans isomerization of the retinal around the 13–14 double bond. J Biol Chem 261, 14025–14030 (1986). [PubMed] [Google Scholar]
10.Mathelier A, Shi W & Wasserman WW Identification of altered cis-regulatory elements in human disease. Trends Genet 31, 67–76, doi: 10.1016/j.tig.2014.12.003 (2015). [DOI] [PubMed] [Google Scholar]
11.Spielmann M & Mundlos S Looking beyond the genes: the role of non-coding variants in human disease. Hum Mol Genet 25, R157–R165, doi: 10.1093/hmg/ddw205 (2016). [DOI] [PubMed] [Google Scholar]
12.Zhang F & Lupski JR Non-coding genetic variants in human disease. Hum Mol Genet 24, R102–110, doi: 10.1093/hmg/ddv259 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Seplyarskiy VB & Sunyaev S The origin of human mutation in light of genomic data. Nat Rev Genet 22, 672–686, doi: 10.1038/s41576-021-00376-2 (2021). [DOI] [PubMed] [Google Scholar]
14.Seplyarskiy VB et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035, doi: 10.1126/science.aba7408 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gussow AB et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS One 12, e0181604, doi: 10.1371/journal.pone.0181604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.di Iulio J et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337, doi: 10.1038/s41588-018-0062-7 (2018). [DOI] [PubMed] [Google Scholar]
17.Halldorsson BV et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740, doi: 10.1038/s41586-022-04965-x (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kircher M et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310–315, doi: 10.1038/ng.2892 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Yousefian-Jazi A, Jung J, Choi JK & Choi J Functional annotation of noncoding causal variants in autoimmune diseases. Genomics 112, 1208–1213, doi: 10.1016/j.ygeno.2019.07.006 (2020). [DOI] [PubMed] [Google Scholar]
20.Vitsios D, Dhindsa RS, Middleton L, Gussow AB & Petrovski S Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat Commun 12, 1504, doi: 10.1038/s41467-021-21790-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Siepel A et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050, doi: 10.1101/gr.3715005 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Pollard KS, Hubisz MJ, Rosenbloom KR & Siepel A Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121, doi: 10.1101/gr.097857.109 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lek M et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291, doi: 10.1038/nature19057 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Halldorsson BV et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, doi: 10.1126/science.aau1043 (2019). [DOI] [PubMed] [Google Scholar]
25.An JY et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, doi: 10.1126/science.aat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Collins RL et al. A structural variation reference for medical and population genetics. Nature 581, 444–451, doi: 10.1038/s41586-020-2287-8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Consortium, E. P. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710, doi: 10.1038/s41586-020-2493-4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461, doi: 10.1038/nature12787 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Jiang Y et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res 47, D235–D243, doi: 10.1093/nar/gky1025 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Pott S & Lieb JD What are super-enhancers? Nat Genet 47, 8–12, doi: 10.1038/ng.3167 (2015). [DOI] [PubMed] [Google Scholar]
31.Bartel DP Metazoan MicroRNAs. Cell 173, 20–51, doi: 10.1016/j.cell.2018.03.006 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Welter D et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42, D1001–1006, doi: 10.1093/nar/gkt1229 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Kanai M et al. Insights from complex trait fine-mapping across diverse populations. medRxiv, 2021.2009.2003.21262975, doi: 10.1101/2021.09.03.21262975 (2021). [DOI] [Google Scholar]
34.Jung RG et al. Association between plasminogen activator inhibitor-1 and cardiovascular events: a systematic review and meta-analysis. Thromb J 16, 12, doi: 10.1186/s12959-018-0166-4 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Song C, Burgess S, Eicher JD, O’Donnell CJ & Johnson AD Causal Effect of Plasminogen Activator Inhibitor Type 1 on Coronary Heart Disease. J Am Heart Assoc 6, doi: 10.1161/JAHA.116.004918 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Schaefer AS et al. Genetic evidence for PLASMINOGEN as a shared genetic risk factor of coronary artery disease and periodontitis. Circ Cardiovasc Genet 8, 159–167, doi: 10.1161/CIRCGENETICS.114.000554 (2015). [DOI] [PubMed] [Google Scholar]
37.Li YY Plasminogen activator inhibitor-1 4G/5G gene polymorphism and coronary artery disease in the Chinese Han population: a meta-analysis. PLoS One 7, e33511, doi: 10.1371/journal.pone.0033511 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Drinane MC, Sherman JA, Hall AE, Simons M & Mulligan-Kehoe MJ Plasminogen and plasmin activity in patients with coronary artery disease. J Thromb Haemost 4, 1288–1295, doi: 10.1111/j.1538-7836.2006.01979.x (2006). [DOI] [PubMed] [Google Scholar]
39.Lowe GD et al. Tissue plasminogen activator antigen and coronary heart disease. Prospective study and meta-analysis. Eur Heart J 25, 252–259, doi: 10.1016/j.ehj.2003.11.004 (2004). [DOI] [PubMed] [Google Scholar]
40.Wang QS et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat Commun 12, 3394, doi: 10.1038/s41467-021-23134-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Landrum MJ et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46, D1062–D1067, doi: 10.1093/nar/gkx1153 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Stenson PD et al. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21, 577–581, doi: 10.1002/humu.10212 (2003). [DOI] [PubMed] [Google Scholar]
43.Davydov EV et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6, e1001025, doi: 10.1371/journal.pcbi.1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Greenway SC et al. De novo copy number variants identify new genes and loci in isolated sporadic tetralogy of Fallot. Nat Genet 41, 931–935, doi: 10.1038/ng.415 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Mefford HC et al. Recurrent reciprocal genomic rearrangements of 17q12 are associated with renal disease, diabetes, and epilepsy. Am J Hum Genet 81, 1057–1069, doi: 10.1086/522591 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Sebat J et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449, doi: 10.1126/science.1138659 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Stefansson H et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236, doi: 10.1038/nature07229 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Walsh T et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543, doi: 10.1126/science.1155174 (2008). [DOI] [PubMed] [Google Scholar]
49.Wright CF et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314, doi: 10.1016/S0140-6736(14)61705-0 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Spielmann M, Lupianez DG & Mundlos S Structural variation in the 3D genome. Nat Rev Genet 19, 453–467, doi: 10.1038/s41576-018-0007-0 (2018). [DOI] [PubMed] [Google Scholar]
51.Spielmann M & Mundlos S Structural variations, the regulatory landscape of the genome and their alteration in human disease. Bioessays 35, 533–543, doi: 10.1002/bies.201200178 (2013). [DOI] [PubMed] [Google Scholar]
52.Coe BP et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat Genet 46, 1063–1071, doi: 10.1038/ng.3092 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Cooper GM et al. A copy number variation morbidity map of developmental delay. Nat Genet 43, 838–846, doi: 10.1038/ng.909 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Klopocki E et al. Copy-number variations involving the IHH locus are associated with syndactyly and craniosynostosis. Am J Hum Genet 88, 70–75, doi: 10.1016/j.ajhg.2010.11.006 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Barroso E et al. Identification of the fourth duplication of upstream IHH regulatory elements, in a family with craniosynostosis Philadelphia type, helps to define the phenotypic characterization of these regulatory elements. Am J Med Genet A 167A, 902–906, doi: 10.1002/ajmg.a.36811 (2015). [DOI] [PubMed] [Google Scholar]
56.Will AJ et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat Genet 49, 1539–1545, doi: 10.1038/ng.3939 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330, doi: 10.1038/nature14248 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Rehm HL et al. ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242, doi: 10.1056/NEJMsr1406261 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Blake JA et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res 39, D842–848, doi: 10.1093/nar/gkq1008 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.McKusick VA Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80, 588–604, doi: 10.1086/514346 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585, doi: 10.1038/ng.2653 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Xu H et al. Elevated ASCL2 expression in breast cancer is associated with the poor prognosis of patients. Am J Cancer Res 7, 955–961 (2017). [PMC free article] [PubMed] [Google Scholar]
63.Jubb AM et al. Achaete-scute like 2 (ascl2) is a target of Wnt signalling and is upregulated in intestinal neoplasia. Oncogene 25, 3445–3457, doi: 10.1038/sj.onc.1209382 (2006). [DOI] [PubMed] [Google Scholar]
64.Tian Y et al. MicroRNA-200 (miR-200) cluster regulation by achaete scute-like 2 (Ascl2): impact on the epithelial-mesenchymal transition in colon cancer cells. J Biol Chem 289, 36101–36115, doi: 10.1074/jbc.M114.598383 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi: 10.1038/s41586-020-2308-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Hail v. 0.2.62–84fa81b9ea3d. https://github.com/hail-is/hail/commit/84fa81b9ea3d.
67.Zhu P et al. Single-cell DNA methylome sequencing of human preimplantation embryos. Nat Genet 50, 12–19, doi: 10.1038/s41588-017-0007-6 (2018). [DOI] [PubMed] [Google Scholar]
68.Tang WW et al. A Unique Gene Regulatory Network Resets the Human Germline Epigenome for Development. Cell 161, 1453–1467, doi: 10.1016/j.cell.2015.04.053 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Halldorsson BV et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, doi: 10.1126/science.aau1043 (2019). [DOI] [PubMed] [Google Scholar]
70.An JY et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, doi: 10.1126/science.aat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Ross DA, Lim J, Lin R-S & Yang M-H Incremental learning for robust visual tracking. International journal of computer vision 77, 125–141 (2008). [Google Scholar]
72.Karolchik D et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32, D493–496, doi: 10.1093/nar/gkh103 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Li H Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851, doi: 10.1093/bioinformatics/btu356 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Davis CA et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res 46, D794–D801, doi: 10.1093/nar/gkx1081 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Goldmann JM et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat Genet 50, 487–492, doi: 10.1038/s41588-018-0071-6 (2018). [DOI] [PubMed] [Google Scholar]
76.Zhao H et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007, doi: 10.1093/bioinformatics/btt730 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Quinlan AR & Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842, doi: 10.1093/bioinformatics/btq033 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Kent WJ, Zweig AS, Barber G, Hinrichs AS & Karolchik D BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207, doi: 10.1093/bioinformatics/btq351 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Collins RL et al. A structural variation reference for medical and population genetics. Nature 581, 444–451, doi: 10.1038/s41586-020-2287-8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Consortium EP et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710, doi: 10.1038/s41586-020-2493-4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461, doi: 10.1038/nature12787 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Jiang Y et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res 47, D235–D243, doi: 10.1093/nar/gky1025 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Welter D et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42, D1001–1006, doi: 10.1093/nar/gkt1229 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Kanai M et al. Insights from complex trait fine-mapping across diverse populations. medRxiv, 2021.2009.2003.21262975, doi: 10.1101/2021.09.03.21262975 (2021). [DOI] [Google Scholar]
85.Coe BP et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat Genet 46, 1063–1071, doi: 10.1038/ng.3092 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Cooper GM et al. A copy number variation morbidity map of developmental delay. Nat Genet 43, 838–846, doi: 10.1038/ng.909 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Harrow J et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22, 1760–1774, doi: 10.1101/gr.135350.111 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Hon CC et al. An atlas of human long non-coding RNAs with accurate 5’ ends. Nature 543, 199–204, doi: 10.1038/nature21374 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Wang QS et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat Commun 12, 3394, doi: 10.1038/s41467-021-23134-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Wang G, Sarkar A, Carbonetto P & Stephens M A simple new approach to variable selection in regression, with application to genetic fine-mapping. BioRxiv, 501114 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
91.Gussow AB et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS One 12, e0181604, doi: 10.1371/journal.pone.0181604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
92.di Iulio J et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337, doi: 10.1038/s41588-018-0062-7 (2018). [DOI] [PubMed] [Google Scholar]
93.Vitsios D, Dhindsa RS, Middleton L, Gussow AB & Petrovski S Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat Commun 12, 1504, doi: 10.1038/s41467-021-21790-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
94.Halldorsson BV et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740, doi: 10.1038/s41586-022-04965-x (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
95.Pollard KS, Hubisz MJ, Rosenbloom KR & Siepel A Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121, doi: 10.1101/gr.097857.109 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
96.Siepel A et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050, doi: 10.1101/gr.3715005 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
97.Davydov EV et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6, e1001025, doi: 10.1371/journal.pcbi.1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
98.Landrum MJ et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46, D1062–D1067, doi: 10.1093/nar/gkx1153 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
99.Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299, doi: 10.1038/s41586-021-03205-y (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
100.Budescu DV Dominance analysis: a new approach to the problem of relative importance of predictors in multiple regression. Psychological bulletin 114, 542 (1993). [Google Scholar]
101.Azen R & Budescu DV The dominance analysis approach for comparing predictors in multiple regression. Psychological methods 8, 129 (2003). [DOI] [PubMed] [Google Scholar]
102.Rehm HL et al. ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242, doi: 10.1056/NEJMsr1406261 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
103.Blake JA et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res 39, D842–848, doi: 10.1093/nar/gkq1008 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
104.McKusick VA Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80, 588–604, doi: 10.1086/514346 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
105.Ernst J et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49, doi: 10.1038/nature09906 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
106.Liu Y, Sarkar A, Kheradpour P, Ernst J & Kellis M Evidence of reduced recombination rate in human regulatory domains. Genome Biol 18, 193, doi: 10.1186/s13059-017-1308-x (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
107.Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330, doi: 10.1038/nature14248 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
108.Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585, doi: 10.1038/ng.2653 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
109.Robin X et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics 12, 1–8 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
110.Bergstrom A et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, doi: 10.1126/science.aay5012 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
111.Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 68–74, doi: 10.1038/nature15393 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
112.Koenig Z et al. A harmonized public resource of deeply sequenced diverse human genomes. bioRxiv, 2023.2001. 2023.525248 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

NIHMS2037379-supplement-Supplementary_Information.pdf^{(2.5MB, pdf)}

Supplementary Datasets

NIHMS2037379-supplement-Supplementary_Datasets.zip^{(314MB, zip)}

Data Availability Statement

[R1] 1.Short PJ et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616, doi: 10.1038/nature25983 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Satterstrom FK et al. Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell 180, 568–584 e523, doi: 10.1016/j.cell.2019.12.036 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Singh T et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat Genet 49, 1167–1173, doi: 10.1038/ng.3903 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ganna A et al. Quantifying the Impact of Rare and Ultra-rare Coding Variation across the Phenotypic Spectrum. Am J Hum Genet 102, 1204–1211, doi: 10.1016/j.ajhg.2018.05.002 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi: 10.1038/s41586-020-2308-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Petrovski S, Wang Q, Heinzen EL, Allen AS & Goldstein DB Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet 9, e1003709, doi: 10.1371/journal.pgen.1003709 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Samocha KE et al. A framework for the interpretation of de novo mutation in human disease. Nat Genet 46, 944–950, doi: 10.1038/ng.3050 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Hindorff LA et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362–9367, doi: 10.1073/pnas.0903103106 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Lanyi JK Photochromism of halorhodopsin. cis/trans isomerization of the retinal around the 13–14 double bond. J Biol Chem 261, 14025–14030 (1986). [PubMed] [Google Scholar]

[R10] 10.Mathelier A, Shi W & Wasserman WW Identification of altered cis-regulatory elements in human disease. Trends Genet 31, 67–76, doi: 10.1016/j.tig.2014.12.003 (2015). [DOI] [PubMed] [Google Scholar]

[R11] 11.Spielmann M & Mundlos S Looking beyond the genes: the role of non-coding variants in human disease. Hum Mol Genet 25, R157–R165, doi: 10.1093/hmg/ddw205 (2016). [DOI] [PubMed] [Google Scholar]

[R12] 12.Zhang F & Lupski JR Non-coding genetic variants in human disease. Hum Mol Genet 24, R102–110, doi: 10.1093/hmg/ddv259 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Seplyarskiy VB & Sunyaev S The origin of human mutation in light of genomic data. Nat Rev Genet 22, 672–686, doi: 10.1038/s41576-021-00376-2 (2021). [DOI] [PubMed] [Google Scholar]

[R14] 14.Seplyarskiy VB et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035, doi: 10.1126/science.aba7408 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Gussow AB et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS One 12, e0181604, doi: 10.1371/journal.pone.0181604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.di Iulio J et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337, doi: 10.1038/s41588-018-0062-7 (2018). [DOI] [PubMed] [Google Scholar]

[R17] 17.Halldorsson BV et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740, doi: 10.1038/s41586-022-04965-x (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Kircher M et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310–315, doi: 10.1038/ng.2892 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Yousefian-Jazi A, Jung J, Choi JK & Choi J Functional annotation of noncoding causal variants in autoimmune diseases. Genomics 112, 1208–1213, doi: 10.1016/j.ygeno.2019.07.006 (2020). [DOI] [PubMed] [Google Scholar]

[R20] 20.Vitsios D, Dhindsa RS, Middleton L, Gussow AB & Petrovski S Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat Commun 12, 1504, doi: 10.1038/s41467-021-21790-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Siepel A et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050, doi: 10.1101/gr.3715005 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Pollard KS, Hubisz MJ, Rosenbloom KR & Siepel A Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121, doi: 10.1101/gr.097857.109 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Lek M et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291, doi: 10.1038/nature19057 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Halldorsson BV et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, doi: 10.1126/science.aau1043 (2019). [DOI] [PubMed] [Google Scholar]

[R25] 25.An JY et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, doi: 10.1126/science.aat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Collins RL et al. A structural variation reference for medical and population genetics. Nature 581, 444–451, doi: 10.1038/s41586-020-2287-8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Consortium, E. P. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710, doi: 10.1038/s41586-020-2493-4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461, doi: 10.1038/nature12787 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Jiang Y et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res 47, D235–D243, doi: 10.1093/nar/gky1025 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Pott S & Lieb JD What are super-enhancers? Nat Genet 47, 8–12, doi: 10.1038/ng.3167 (2015). [DOI] [PubMed] [Google Scholar]

[R31] 31.Bartel DP Metazoan MicroRNAs. Cell 173, 20–51, doi: 10.1016/j.cell.2018.03.006 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Welter D et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42, D1001–1006, doi: 10.1093/nar/gkt1229 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Kanai M et al. Insights from complex trait fine-mapping across diverse populations. medRxiv, 2021.2009.2003.21262975, doi: 10.1101/2021.09.03.21262975 (2021). [DOI] [Google Scholar]

[R34] 34.Jung RG et al. Association between plasminogen activator inhibitor-1 and cardiovascular events: a systematic review and meta-analysis. Thromb J 16, 12, doi: 10.1186/s12959-018-0166-4 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Song C, Burgess S, Eicher JD, O’Donnell CJ & Johnson AD Causal Effect of Plasminogen Activator Inhibitor Type 1 on Coronary Heart Disease. J Am Heart Assoc 6, doi: 10.1161/JAHA.116.004918 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Schaefer AS et al. Genetic evidence for PLASMINOGEN as a shared genetic risk factor of coronary artery disease and periodontitis. Circ Cardiovasc Genet 8, 159–167, doi: 10.1161/CIRCGENETICS.114.000554 (2015). [DOI] [PubMed] [Google Scholar]

[R37] 37.Li YY Plasminogen activator inhibitor-1 4G/5G gene polymorphism and coronary artery disease in the Chinese Han population: a meta-analysis. PLoS One 7, e33511, doi: 10.1371/journal.pone.0033511 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Drinane MC, Sherman JA, Hall AE, Simons M & Mulligan-Kehoe MJ Plasminogen and plasmin activity in patients with coronary artery disease. J Thromb Haemost 4, 1288–1295, doi: 10.1111/j.1538-7836.2006.01979.x (2006). [DOI] [PubMed] [Google Scholar]

[R39] 39.Lowe GD et al. Tissue plasminogen activator antigen and coronary heart disease. Prospective study and meta-analysis. Eur Heart J 25, 252–259, doi: 10.1016/j.ehj.2003.11.004 (2004). [DOI] [PubMed] [Google Scholar]

[R40] 40.Wang QS et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat Commun 12, 3394, doi: 10.1038/s41467-021-23134-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Landrum MJ et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46, D1062–D1067, doi: 10.1093/nar/gkx1153 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Stenson PD et al. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21, 577–581, doi: 10.1002/humu.10212 (2003). [DOI] [PubMed] [Google Scholar]

[R43] 43.Davydov EV et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6, e1001025, doi: 10.1371/journal.pcbi.1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Greenway SC et al. De novo copy number variants identify new genes and loci in isolated sporadic tetralogy of Fallot. Nat Genet 41, 931–935, doi: 10.1038/ng.415 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Mefford HC et al. Recurrent reciprocal genomic rearrangements of 17q12 are associated with renal disease, diabetes, and epilepsy. Am J Hum Genet 81, 1057–1069, doi: 10.1086/522591 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Sebat J et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449, doi: 10.1126/science.1138659 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Stefansson H et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236, doi: 10.1038/nature07229 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Walsh T et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543, doi: 10.1126/science.1155174 (2008). [DOI] [PubMed] [Google Scholar]

[R49] 49.Wright CF et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314, doi: 10.1016/S0140-6736(14)61705-0 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Spielmann M, Lupianez DG & Mundlos S Structural variation in the 3D genome. Nat Rev Genet 19, 453–467, doi: 10.1038/s41576-018-0007-0 (2018). [DOI] [PubMed] [Google Scholar]

[R51] 51.Spielmann M & Mundlos S Structural variations, the regulatory landscape of the genome and their alteration in human disease. Bioessays 35, 533–543, doi: 10.1002/bies.201200178 (2013). [DOI] [PubMed] [Google Scholar]

[R52] 52.Coe BP et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat Genet 46, 1063–1071, doi: 10.1038/ng.3092 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Cooper GM et al. A copy number variation morbidity map of developmental delay. Nat Genet 43, 838–846, doi: 10.1038/ng.909 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Klopocki E et al. Copy-number variations involving the IHH locus are associated with syndactyly and craniosynostosis. Am J Hum Genet 88, 70–75, doi: 10.1016/j.ajhg.2010.11.006 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Barroso E et al. Identification of the fourth duplication of upstream IHH regulatory elements, in a family with craniosynostosis Philadelphia type, helps to define the phenotypic characterization of these regulatory elements. Am J Med Genet A 167A, 902–906, doi: 10.1002/ajmg.a.36811 (2015). [DOI] [PubMed] [Google Scholar]

[R56] 56.Will AJ et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat Genet 49, 1539–1545, doi: 10.1038/ng.3939 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330, doi: 10.1038/nature14248 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Rehm HL et al. ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242, doi: 10.1056/NEJMsr1406261 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Blake JA et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res 39, D842–848, doi: 10.1093/nar/gkq1008 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.McKusick VA Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80, 588–604, doi: 10.1086/514346 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585, doi: 10.1038/ng.2653 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Xu H et al. Elevated ASCL2 expression in breast cancer is associated with the poor prognosis of patients. Am J Cancer Res 7, 955–961 (2017). [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Jubb AM et al. Achaete-scute like 2 (ascl2) is a target of Wnt signalling and is upregulated in intestinal neoplasia. Oncogene 25, 3445–3457, doi: 10.1038/sj.onc.1209382 (2006). [DOI] [PubMed] [Google Scholar]

[R64] 64.Tian Y et al. MicroRNA-200 (miR-200) cluster regulation by achaete scute-like 2 (Ascl2): impact on the epithelial-mesenchymal transition in colon cancer cells. J Biol Chem 289, 36101–36115, doi: 10.1074/jbc.M114.598383 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443, doi: 10.1038/s41586-020-2308-7 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Hail v. 0.2.62–84fa81b9ea3d. https://github.com/hail-is/hail/commit/84fa81b9ea3d.

[R67] 67.Zhu P et al. Single-cell DNA methylome sequencing of human preimplantation embryos. Nat Genet 50, 12–19, doi: 10.1038/s41588-017-0007-6 (2018). [DOI] [PubMed] [Google Scholar]

[R68] 68.Tang WW et al. A Unique Gene Regulatory Network Resets the Human Germline Epigenome for Development. Cell 161, 1453–1467, doi: 10.1016/j.cell.2015.04.053 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] 69.Halldorsson BV et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, doi: 10.1126/science.aau1043 (2019). [DOI] [PubMed] [Google Scholar]

[R70] 70.An JY et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, doi: 10.1126/science.aat6576 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] 71.Ross DA, Lim J, Lin R-S & Yang M-H Incremental learning for robust visual tracking. International journal of computer vision 77, 125–141 (2008). [Google Scholar]

[R72] 72.Karolchik D et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32, D493–496, doi: 10.1093/nar/gkh103 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] 73.Li H Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851, doi: 10.1093/bioinformatics/btu356 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Davis CA et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res 46, D794–D801, doi: 10.1093/nar/gkx1081 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] 75.Goldmann JM et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat Genet 50, 487–492, doi: 10.1038/s41588-018-0071-6 (2018). [DOI] [PubMed] [Google Scholar]

[R76] 76.Zhao H et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007, doi: 10.1093/bioinformatics/btt730 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] 77.Quinlan AR & Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842, doi: 10.1093/bioinformatics/btq033 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] 78.Kent WJ, Zweig AS, Barber G, Hinrichs AS & Karolchik D BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207, doi: 10.1093/bioinformatics/btq351 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] 79.Collins RL et al. A structural variation reference for medical and population genetics. Nature 581, 444–451, doi: 10.1038/s41586-020-2287-8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Consortium EP et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710, doi: 10.1038/s41586-020-2493-4 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] 81.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461, doi: 10.1038/nature12787 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R82] 82.Jiang Y et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res 47, D235–D243, doi: 10.1093/nar/gky1025 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] 83.Welter D et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42, D1001–1006, doi: 10.1093/nar/gkt1229 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] 84.Kanai M et al. Insights from complex trait fine-mapping across diverse populations. medRxiv, 2021.2009.2003.21262975, doi: 10.1101/2021.09.03.21262975 (2021). [DOI] [Google Scholar]

[R85] 85.Coe BP et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat Genet 46, 1063–1071, doi: 10.1038/ng.3092 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] 86.Cooper GM et al. A copy number variation morbidity map of developmental delay. Nat Genet 43, 838–846, doi: 10.1038/ng.909 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] 87.Harrow J et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res 22, 1760–1774, doi: 10.1101/gr.135350.111 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R88] 88.Hon CC et al. An atlas of human long non-coding RNAs with accurate 5’ ends. Nature 543, 199–204, doi: 10.1038/nature21374 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R89] 89.Wang QS et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat Commun 12, 3394, doi: 10.1038/s41467-021-23134-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R90] 90.Wang G, Sarkar A, Carbonetto P & Stephens M A simple new approach to variable selection in regression, with application to genetic fine-mapping. BioRxiv, 501114 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R91] 91.Gussow AB et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS One 12, e0181604, doi: 10.1371/journal.pone.0181604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] 92.di Iulio J et al. The human noncoding genome defined by genetic diversity. Nat Genet 50, 333–337, doi: 10.1038/s41588-018-0062-7 (2018). [DOI] [PubMed] [Google Scholar]

[R93] 93.Vitsios D, Dhindsa RS, Middleton L, Gussow AB & Petrovski S Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat Commun 12, 1504, doi: 10.1038/s41467-021-21790-4 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R94] 94.Halldorsson BV et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740, doi: 10.1038/s41586-022-04965-x (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] 95.Pollard KS, Hubisz MJ, Rosenbloom KR & Siepel A Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121, doi: 10.1101/gr.097857.109 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R96] 96.Siepel A et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050, doi: 10.1101/gr.3715005 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R97] 97.Davydov EV et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6, e1001025, doi: 10.1371/journal.pcbi.1001025 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R98] 98.Landrum MJ et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46, D1062–D1067, doi: 10.1093/nar/gkx1153 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R99] 99.Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299, doi: 10.1038/s41586-021-03205-y (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R100] 100.Budescu DV Dominance analysis: a new approach to the problem of relative importance of predictors in multiple regression. Psychological bulletin 114, 542 (1993). [Google Scholar]

[R101] 101.Azen R & Budescu DV The dominance analysis approach for comparing predictors in multiple regression. Psychological methods 8, 129 (2003). [DOI] [PubMed] [Google Scholar]

[R102] 102.Rehm HL et al. ClinGen--the Clinical Genome Resource. N Engl J Med 372, 2235–2242, doi: 10.1056/NEJMsr1406261 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R103] 103.Blake JA et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res 39, D842–848, doi: 10.1093/nar/gkq1008 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R104] 104.McKusick VA Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 80, 588–604, doi: 10.1086/514346 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R105] 105.Ernst J et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49, doi: 10.1038/nature09906 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R106] 106.Liu Y, Sarkar A, Kheradpour P, Ernst J & Kellis M Evidence of reduced recombination rate in human regulatory domains. Genome Biol 18, 193, doi: 10.1186/s13059-017-1308-x (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R107] 107.Roadmap Epigenomics, C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330, doi: 10.1038/nature14248 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R108] 108.Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585, doi: 10.1038/ng.2653 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R109] 109.Robin X et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics 12, 1–8 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R110] 110.Bergstrom A et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, doi: 10.1126/science.aay5012 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R111] 111.Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 68–74, doi: 10.1038/nature15393 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R112] 112.Koenig Z et al. A harmonized public resource of deeply sequenced diverse human genomes. bioRxiv, 2023.2001. 2023.525248 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A genomic mutational constraint map using variation in 76,156 human genomes

Siwei Chen

Laurent C Francioli

Julia K Goodrich

Ryan L Collins

Masahiro Kanai

Qingbo Wang

Jessica Alföldi

Nicholas A Watts

Christopher Vittal

Laura D Gauthier

Timothy Poterba

Michael W Wilson

Yekaterina Tarasova

William Phu

Riley Grant

Mary T Yohannes

Zan Koenig

Yossi Farjoun

Eric Banks

Stacey Donnelly

Stacey Gabriel

Namrata Gupta

Steven Ferriera

Charlotte Tolonen

Sam Novod

Louis Bergelson

David Roazen

Valentin Ruano-Rubio

Miguel Covarrubias

Christopher Llanwarne

Nikelle Petrillo

Gordon Wade

Thibault Jeandet

Ruchi Munshi

Kathleen Tibbetts

Anne O’Donnell-Luria

Matthew Solomonson

Cotton Seed

Alicia R Martin

Michael E Talkowski

Heidi L Rehm

Mark J Daly

Grace Tiao

Benjamin M Neale

Daniel G MacArthur

Konrad J Karczewski

Summary

Aggregating 76,156 whole genomes

Gnocchi quantifies genomic constraint

Fig. 1:

Gnocchi highlights non-coding function

Fig. 2:

Gnocchi versus other non-coding metrics

Fig. 3:

Gnocchi prioritizes copy number variants

Fig. 4:

Gnocchi informs gene function

Fig. 5:

Discussion

Methods

Aggregation, variant-calling, and quality control of gnomAD genome data

Estimation of trinucleotide context-specific mutation rates

Adjustment of the effect of DNA methylation on CpG mutation rates

Adjustment of the effects of regional genomic features on mutation rates

Prediction of expected number of variants per 1kb

Construction of Gnocchi

Collection of genomic features

Correlation between Gnocchi and APS

Correlation between Gnocchi and putative functional non-coding annotations

Estimation of constraint for aggregated regulatory annotations

Incorporation of Gnocchi into GWAS fine-mapping

Comparison of Gnocchi and other predictive metrics

Correlation of constraint between non-coding regulatory elements and protein-coding genes

Incorporation of non-coding constraint of regulatory elements into gene constraint modeling

Power of constraint detection

Extended Data

Extended Data Fig. 1:

Extended Data Fig. 2: