Abstract
Measures of selective constraint on genes have been used for many applications including clinical interpretation of rare coding variants, disease gene discovery, and studies of genome evolution. However, widely-used metrics are severely underpowered at detecting constraint for the shortest ~25% of genes, potentially causing important pathogenic mutations to be overlooked. We developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, . Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease, and other phenotypes, especially for short genes. Our new estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve estimation of many gene-level properties, such as rare variant burden or gene expression differences.
1. Introduction
Identifying the genes important for disease and fitness is a central goal in human genetics. One particularly useful measure of importance is how much natural selection constrains a gene [1–4]. Constraint has been used to prioritize de novo and rare variants for clinical followup [5, 6], predict the toxicity of drugs [7], link GWAS hits to genes [8], and characterize transcriptional regulation [9, 10], among many other applications.
To estimate the amount of constraint on a gene, several metrics have been developed using loss-of-function variants (LOFs), such as protein truncating or splice disrupting variants. If a gene is important, then natural selection will act to remove LOFs from the population. Several metrics of gene importance have been developed based on this intuition to take advantage of large exome sequencing studies.
In one line of research, the number of observed unique LOFs is compared to the expected number under a model of no selective constraint. This approach has led to the widely-used metrics pLI [11] and LOEUF [12].
While pLI and LOEUF have proved useful for identifying genes intolerant to LOF mutations, they have important limitations [3]. First, they are uninterpretable in that they are only loosely related to the fitness consequences of LOFs. Their relationship with natural selection depends on the study’s sample size and other technical factors [3]. Second, they are not based on an explicit population genetics model so it is impossible to compare a given value of pLI or LOEUF to the strength of selection estimated for variants other than LOFs [3, 4].
Another line of research has solved these issues of interpretability by estimating the fitness reduction for heterozygous carriers of an LOF in any given gene [1, 2, 4]. Throughout, we will adopt the notation of Cassa and colleagues and refer to this reduction in fitness as [1, 2], although the same population genetic quantity has been referred to as hs [4, 13]. In [1], a deterministic approximation was used to estimate , which was relaxed to incorporate the effects of genetic drift in [2]. This model was subsequently extended by Agarwal and colleagues to include the X chromosome and applied to a larger dataset, with a focus on the interpretability of [4].
A major issue for most previous methods is that thousands of genes have few expected unique LOFs under neutrality, as they have short protein-coding sequences. For example, there are >5,000 genes that cannot be called as constrained by LOEUF, as they have too few expected unique LOFs to fall under the recommended LOEUF cutoff of 0.35 [14]. This problem is not limited to LOEUF, however, and all of these methods are severely underpowered to detect selection for this ~25% of genes.
Here, we present an approach that can accurately estimate even for genes with few expected LOFs, while maintaining the interpretability of previous population-genetics based estimates [1, 2, 4].
Our approach has two main technical innovations. First, we use a novel population genetics model of LOF allele frequencies. Previous methods have either only modeled the number of unique LOFs, throwing away frequency information [11,12,15], or considered the sum of LOF frequencies across the gene [1,2,4], an approach that is not robust to misannotated LOFs. In contrast, we model the frequencies of individual LOF variants, allowing us to not only use the information in such frequencies but also to model the possibility that any given LOF variant has been misannotated, making our estimates more robust. Our approach uses new computational machinery, described in a companion paper [16], to accurately obtain the likelihood of observing an LOF at a given frequency without resorting to simulation [2, 4] or deterministic approximations [1].
Second, our approach uses thousands of gene features, including gene expression patterns, protein structure information, and evolutionary constraint, to improve estimates for genes with few expected LOFs. By using these features, we can share information across similar genes. Intuitively, this allows us to improve estimates for genes with few expected LOFs by leveraging information from genes with similar features that do have sufficient LOF data.
Adopting a similar approach, a recent preprint [15] used gene features in a deep learning model to improve estimation of constraint for genes with few expected LOFs, but did not use an explicit population genetics model, resulting in the same issues with interpretability faced by pLI and LOEUF.
We applied our method to a large exome sequencing cohort [12]. Our estimates of are substantially more predictive than previous metrics at prioritizing essential and disease-associated genes. We also interrogated the relationship between gene features and natural selection, finding that evolutionary conservation, protein structure, and expression patterns are more predictive of than co-expression and protein-protein interaction networks. Expression patterns in the brain and expression patterns during development are particularly predictive of . Finally, we use to highlight differences in selection on different categories of genes and consider in the context of selection on variants beyond LOFs.
Our approach, GeneBayes, is extremely flexible and can be applied to improve estimation of numerous gene properties beyond . Our implementation is available at https://github.com/tkzeng/GeneBayes.
2. Results
2.1. Model Overview
Using LOF data to infer gene constraint is challenging for genes with few expected LOFs, with metrics like LOEUF considering almost all such genes to be unconstrained (Figures 1A,B). We hypothesized that it would be possible to improve estimation using auxiliary information that may be predictive of LOF constraint, including gene expression patterns across tissues, protein structure, and evolutionary conservation. Intuitively, genes with similar features should have similar levels of constraint. By pooling information across groups of similar genes, constraint estimated for genes with sufficient LOF data may help improve estimation for underpowered genes.
However, while the frequencies of LOFs can be related to through models from population genetics [1, 2, 4], we lack an understanding of how other gene features relate to constraint a priori.
To address this problem, we developed a flexible empirical Bayes framework, GeneBayes, that learns the relationship between gene features and (Figure 1C). Our model consists of two main components. First, we model the prior on for each gene as a function of its gene features (Figure 1C, left). Specifically, we train gradient-boosted trees using NGBoost [17] to predict the parameters of each gene’s prior distribution from its features. Our gene features include gene expression levels, Gene Ontology terms, conservation across species, neural network embeddings of protein sequences, gene regulatory features, co-expression and protein-protein interaction features, sub-cellular localization, and intolerance to missense mutations (see Methods and Supplementary Note C for a full list).
Second, we use a model from population genetics to relate to the observed LOF data (Figure 1C, right). This model allows us to fit the gradient-boosted trees for the prior by maximizing the likelihood of the LOF data. Specifically, we use the discrete-time Wright Fisher model with genic selection, a standard model in population genetics that accounts for mutation and genetic drift [13, 18]. In our model, is the reduction in fitness per copy of an LOF, and we infer while keeping the mutation rates and demography fixed to values taken from the literature (Supplementary Note B). Likelihoods are computed using new methods described in a companion paper [16].
Previous methods use either the number of unique LOFs or the sum of the frequencies of all LOFs in a gene, but we model the frequency of each individual LOF variant. We used LOF frequencies from the gnomAD consortium, which consists of exome sequences from ~125,000 individuals for 18,563 genes after filtering.
Combining these two components—the learned priors and the likelihood of the LOF data— we obtained posterior distributions over for every gene. Throughout, we use the posterior mean value of for each gene as a point estimate. See Methods for more details and Supplementary Table 2 for estimates of .
2.2. Population genetics model and gene features both affect the estimation of
First, we explored how LOF frequency and mutation rate relate to in our population genetics model (Figure 2A). Invariant sites with high mutation rates are indicative of strong selection , consistent with [19], while such sites with low mutation rates are consistent with essentially any value of for the demographic model considered here. Regardless of mutation rate, singletons are consistent with most values of but can rule out extremely strong selection, and variants observed at a frequency of >10% rule out even moderately strong selection .
To assess how informative gene features are about , we trained our model on a subset of genes and evaluated the model on held-out genes (Figure 2B, Methods). We computed the Spearman correlation between estimates from the prior and estimated from the LOF data only. The correlation is high and comparable between train and test sets (Spearman and 0.78 respectively), indicating the gene features alone are highly predictive of and that this is not a consequence of overfitting.
To further characterize the impact of features on our estimates of , we removed all features from our model and recalculated posterior distributions (Figure 2C). For most genes, posteriors are substantially more concentrated when using gene features.
Next, we compared our estimates of using GeneBayes to LOEUF and to selection coefficients estimated by [4] (Figure 2D). To facilitate comparison, we use the posterior modes of reported in [4] as point estimates, but we note that [4] emphasizes the value of using full posterior distributions. While the correlation between our estimates is high for genes with sufficient LOFs (for genes with more LOFs than the median, Spearman with with from [4] = 0.88), it is lower for genes with few expected LOFs (for genes with fewer LOFs than the median, Spearman with with from [4] = 0.71).
We further explored the reduced correlations for genes with few expected LOFs. For example, TBC1D3 and PLN have few expected LOFs, and their likelihoods are consistent with any level of constraint (Figure 2E). Due to the high degree of uncertainty, LOEUF considers both genes to be unconstrained, while the point estimates from [4] err in the other direction and consider both genes to be constrained (Figure 2D). This uncertainty arises from use of the LOF data alone, and is captured by the wide posterior distributions for the estimates from [4]. In contrast, by using gene features, our posterior distributions of indicate that PLN is strongly constrained but TBC1D3 is not, consistent with the observation that heterozygous LOFs in PLN cause severe cardiac dilation and heart failure [20].
In contrast to estimates of , LOEUF further ignores information about allele frequencies by considering only the number of unique LOFs, resulting in a loss of information. For example, AARD and TWIST1 have almost the same numbers of observed and expected unique LOFs, so LOEUF is similar for both (LOEUF = 1.1 and 1.06 respectively). However, while TWIST1’s observed LOF is present in only 1 of 246,192 alleles, AARD’s is ~40× more frequent. Consequently, the likelihood rules out the possibility of strong constraint at AARD (Figure 2F), causing the two genes to differ in their estimated selection coefficients (Figure 2D).
In contrast, TWIST1 has a posterior mean of 0.11 when using gene features, indicating very strong selection. Consistent with this, TWIST1 is a transcription factor critical for specification of the cranial mesoderm, and heterozygous LOFs in the gene are associated with Saethre-Chotzen syndrome, a disorder characterized by congenital skull and limb abnormalities [21, 22].
Besides PLN and TWIST1, many genes are considered constrained by but not by LOEUF, which is designed to be highly conservative. In Table 1, we list 15 examples with and , selected based on their clinical significance and prominence in the literature (Methods). One notable example is a set of 16 ribosomal protein genes for which heterozygous disruption causes Diamond-Blackfan anemia—a rare genetic disorder characterized by an inability to produce red blood cells [23] (Supplementary Table 1). All are considered strongly constrained by (minimum ). In contrast, only 6 are considered constrained by LOEUF (), as many of these genes have few expected unique LOFs.
Table 1:
Gene | LOEUF | Obs. | Exp. | Condition and reference | |
---|---|---|---|---|---|
| |||||
RPS15A * | 0.61 | 0.56 | 0 | 5.4 | Diamond-Blackfan anemia: Red blood cell aplasia resulting in growth, craniofacial, and other congenital defects [23] |
DCX | 0.48 | 0.62 | 3 | 12.6 | Lissencephaly: Migrational arrest of neurons resulting in mental re-tardation and seizures [24] |
SOX2 | 0.33 | 0.57 | 1 | 8.3 | Syndromic microphthalmia: Missing or small eyes from birth [25] |
NDP | 0.33 | 0.88 | 0 | 3.4 | Norrie disease: Retinal dystrophy resulting in early childhood blindness, mental disorders, and deafness [26] |
EIF5A | 0.32 | 0.54 | 1 | 8.7 | Faundes-Banka syndrome: Developmental delay, microcephaly, and facial dysmorphisms [27] |
CDKN1C | 0.27 | 0.53 | 0 | 5.7 | Beckwith-Wiedemann syndrome: Pediatric overgrowth with predisposition to tumor development [28] |
TGIF1 | 0.25 | 0.91 | 5 | 11.5 | Holoprosencephaly: Structural malformation of the forebrain during development [29] |
SH2D1A | 0.23 | 0.96 | 1 | 4.9 | Lymphoproliferative syndrome: Severe immune dysregulation due to improper lymphocyte apoptosis [30] |
CEBPA | 0.17 | 1.18 | 0 | 2.4 | Acute myeloid leukemia: Blood and bone marrow cancer with rapid progression [31] |
GATA4 | 0.15 | 0.53 | 3 | 14.7 | Atrial septal defect: Congenital heart defect resulting in a hole between the atria [32] |
TIMP3 | 0.13 | 0.53 | 2 | 11.8 | Sorsby fundus dystrophy: Retinal dystrophy that causes loss of vision [33] |
FOXC2 | 0.13 | 0.79 | 3 | 9.8 | Lymphedema-distichiasis syndrome: Lymphedema of the limbs and double rows of eyelashes [34] |
IGF2 | 0.12 | 1.13 | 3 | 6.8 | Silver-Russell syndrome: Growth retardation, relative macrocephaly, and feeding difficulties [35] |
PLN | 0.12 | 1.56 | 0 | 1.5 | Dilated cardiomyopathy: Enlarged heart chambers, decreased contrac-tile function, and heart failure [20] |
TWIST1 | 0.11 | 1.06 | 1 | 4.5 | Saethre-Chotzen syndrome: Craniosynostosis, facial dysmorphism, and hand and foot abnormalities [21] [22] |
Mutations that disrupt the functions of these genes are associated with Mendelian diseases in the OMIM database [36]. Genes are ordered by (posterior mean). Obs. and Exp. are the unique number of observed and expected LOFs respectively.
RPS15A is associated with Diamond-Blackfan anemia along with nine other genes considered constrained by but not by LOEUF (Supplementary Table 1).
2.3. Utility of in prioritizing phenotypically important genes
To assess the accuracy of our estimates and evaluate their ability to prioritize genes, we first used these estimates to classify genes essential for survival of human cells in vitro. Genome-wide CRISPR growth screens have measured the effects of gene knockouts on cell survival or proliferation, quantifying the in vitro importance of each gene for fitness [37, 38]. We find that our estimates of outperform other constraint metrics at classifying essential genes (Figure 3A, left; bootstrap for pairwise differences in AUPRC between our estimates and other metrics). The difference is largest for genes with few expected LOFs, where (GeneBayes) retains similar precision and recall while other metrics lose performance (Figure 3A, right). In addition, our estimates of outperform other metrics at classifying nonessential genes (Supplementary Figure 2A).
DeepLOF [15], the only other method that combines information from both LOF data and gene features, outperforms methods that rely exclusively on LOF data, highlighting the importance of using auxiliary information. Yet, DeepLOF uses only the number of unique LOFs, discarding frequency information. As a result, it is outperformed by our method, indicating that careful modeling of LOF frequencies also contributes to the performance of our approach.
Next, we performed further comparisons of our estimates of against LOEUF, as LOEUF and its predecessor pLI are extremely popular metrics of constraint. To evaluate the ability of these methods to prioritize disease genes, we first used and LOEUF to classify curated developmental disorder genes [39]. Here, outperforms LOEUF (Figure 3B; bootstrap for the difference in AUPRC) and performs favorably compared to additional constraint metrics (Supplementary Figure 2B).
Next, we considered a broader range of phenotypic abnormalities annotated in the Human Phenotype Ontology (HPO) [40]. For each HPO term, we calculated the enrichment of the 10% most constrained genes and depletion of the 10% least constrained genes, ranked using or LOEUF. Genes considered constrained by are 1.9-fold enriched in HPO terms, compared to 1.5-fold enrichment for genes considered constrained by LOEUF (Figure 3C, left). Additionally, genes considered unconstrained by are 3.0-fold depleted in HPO terms, compared to 2.1-fold depletion for genes considered constrained by LOEUF (Figure 3C, right).
X-linked inheritance is one of the terms with the largest enrichment of constrained genes (6.6-fold enrichment for and 4.2-fold enrichment for LOEUF). The ability of to prioritize X-linked genes may prove particularly useful, as many disorders are enriched for X-chromosome genes [41] and the selection on losing a single copy of such genes is stronger on average [4]. Yet, population-scale sequencing alone has less power to detect a given level of constraint on X-chromosome genes, as the number of X chromosomes in a cohort with males is smaller than the number of autosomes.
We next assessed if de novo disease-associated variants are enriched in constrained genes, similar to the analyses in [4,5]. To this end, we used data from 31,058 trios to calculate for each gene the enrichment of de novo missense and LOF mutations in offspring with DDs relative to unaffected parents [5]. We found that for both classes of variants, enrichment is higher for genes considered constrained by , with the highest enrichment observed for LOF variants (Figure 3D; enrichment of and LOEUF respectively, for missense mutations = 2.2, 1.9; splice site mutations = 6.3, 4.6; and nonsense mutations = 9.5, 6.7). Consistent with previous findings, the excess burden of de novo variants is predominantly in highly constrained genes (Supplementary Figure 2C, left). Notably, this difference in enrichment remains after removing known DD genes (Supplementary Figure 2C, right). Together, these results indicate that not only improves identification of known disease genes but may also facilitate discovery of novel DD genes [5].
Finally, constraint can also be related to longer-term evolutionary processes that give rise to the variation among individuals or species, including variation in gene expression levels. We expect constrained genes to maintain expression levels closer to their optimal values across evolutionary time scales, as each LOF can be thought of as a ~50% reduction in expression. Consistent with this expectation, we find that less constrained genes have larger absolute differences in expression between human and chimpanzee in cortical cells [42], with a stronger correlation for than for LOEUF (Figure 3E). This pattern should also hold when considering the variation in expression within a species. We quantified variance using the normalized standard deviation of gene expression levels estimated from RNA-seq samples in GTEx [43] and found that the variance decreases with increased constraint, again with a stronger correlation for (Figure 3E).
2.4. Interpreting the learned relationship between gene features and
Our framework allows us to learn the relationship between gene features and in a statistically principled way. In particular, by fitting a model with all of the features jointly, we can account for dependencies between the features. To interrogate the relationship between features and , we divided our gene features into 10 distinct categories (Figure 4A) and trained a separate model per category using only the features in that category. We found that missense constraint, gene expression patterns, evolutionary conservation, and protein embeddings are the most informative categories.
Next, we further divided the expression features into 24 subgroups, representing tissues, cell types, and developmental stage (Table 6). Expression patterns in the brain, digestive system, and during development are the most predictive of constraint (Figure 4B). Notably, a study that matched Mendelian disorders to tissues through literature review found that a sizable plurality affect the brain [44]. Meanwhile, most of the top digestive expression features are also related to development (e.g., expression component loadings in a fetal digestive dataset [45]). The importance of developmental features is consistent with the severity of many developmental disorders and the expectation that selection is stronger on early-onset phenotypes [46], supported by the findings of [4].
Table 6:
Category | Terms in the feature (not case sensitive) |
---|---|
| |
Brain | brain, nerve, microglia, hippocampus |
Digestive | digestive, gut, gutendoderm, intestine, colon, ileum |
Development | development, gastrulation, embryo |
Lung | lung, airway |
Eye | eye, retina |
Endothelium | endothelium |
Muscle | muscle |
Hair follicle | hairfollicle |
Kidney | kidney |
Immune | immune, monocytes, nk, tcell, pbmc |
Prostate | prostate |
Blood | blood, heme, fetalblood |
Adipocyte | adipocyte |
Heart | heart, aorta |
Thymus | thymus |
Pancreas | pancreas, islets, pancreasductal |
Liver | liver |
Testis | testis |
Synovial fibroblast | synovialfibroblast |
Bladder | bladder |
Placenta | placenta |
Bone marrow | bonemarrow |
CSF | csf |
Lymph nodes | lymphnodes |
To quantify the relationship between constraint and individual features, we changed the value of one feature at a time and used the variation in predicted over the feature values as the score for each feature (Methods).
We first explored some of the individual Gene Ontology (GO) terms most predictive of constraint (Figure 4C). Consistent with the top expression features, the top GO features highlight developmental and brain-specific processes as important for selection.
Next, we analyzed network (Figure 4D), gene regulatory (Figure 4E), and gene structure (Figure 4F) features. Protein-protein interaction (PPI) and gene co-expression networks have highlighted “hub” genes involved in numerous cellular processes [47,48], while genes linked to GWAS variants have more complex enhancer landscapes [49]. Consistent with these studies, we find that connectedness in PPI and co-expression networks as well as enhancer and promoter count are positively associated with constraint (Figure 4D,E). In addition, gene structure affects gene function—for example, UTR length and GC content affect RNA stability, translation, and localization [50, 51]—and likewise, several gene structure features are predictive of constraint (Figure 4F). Our results indicate that more complex genes—genes that are involved in more regulatory connections, that are more central to networks, and that have more complex gene structures—are generally more constrained.
2.5. Contextualizing the strength of selection against gene loss-of-function
A major benefit of over LOEUF and pLI is that has a precise, intrinsic meaning in terms of fitness [1–4]. This facilitates comparison of between genes, populations, species, and studies. For example, can be compared to selection estimated from mutation accumulation or gene deletion experiments performed in model organisms [52,53]. More broadly, selection applies beyond LOFs. While we focused on estimating changes in fitness due to LOFs, consequences of non-coding, missense, and copy number variants can be understood through the same framework, as we expect such variants to also be under negative selection [19] due to ubiquitous stabilizing selection on traits [54]. Quantifying differences in the selection on variants will deepen our understanding of the evolution and genetics of human traits (see Discussion).
To contextualize our estimates, we compared the distributions of for different gene sets (Figure 5A) and genes (Figure 5B), and analyzed them in terms of selection regimes. To define such regimes, we first conceptualized selection on variants as a function of their effects on expression (Figure 5C), where heterozygous LOFs reduce expression by ~50% across all contexts relevant to selection. Under this framework, we can directly compare to selection on other variant types—for the hypothetical genes in Figure 5C, a GWAS hit affecting Gene 1 has a stronger selective effect than a LOF affecting Gene 2, despite having a smaller effect on expression.
Next, we divided the range of possible values into four regimes determined by theoretical considerations [55] and comparisons to other types of variants [56, 57]—nearly neutral (9% of genes), weak selection (22%), strong selection (54%), and extreme selection (15%). LOFs in nearly neutral genes have minimal effects on fitness—the frequency of such variants is dominated by genetic drift rather than selection [55]. Under the weak selection regime ( from 10−4 to 10−3), gene LOFs have similar effects on fitness as typical GWAS hits, which usually have small or context-specific effects on gene expression or function [56]. Under the strong selection regime ( from 10−3 to 10−1), gene LOFs have fitness effects on par with the strongest selection coefficients measured for common variants, such as the selection estimated for adaptive mutations in LCT [57]. Finally, for genes in the extreme selection regime , LOFs have an effect on fitness equivalent to a >2% chance of embryonic lethality, indicating that such LOFs have an extreme effect on survival or reproduction.
Gene sets vary widely in their constraint. For example, genes known to be haploinsufficient for severe diseases are almost all under extreme selection. In contrast, genes that can tolerate homozygous LOFs are generally under weak selection. One notable example of such a gene is LPA—while high expression levels are associated with cardiovascular disease, low levels have minimal phenotypic consequences [58, 59], consistent with limited conservation in the sequence or gene expression of LPA across species and populations [60, 61]
Other gene sets have much broader distributions of values. For example, manually curated recessive genes are under weak to strong selection, indicating that many such genes are either not fully recessive or have pleiotropic effects on other traits under selection. For example, homozygous LOFs in PROC can cause life-threatening congenital blood clotting [62], yet for PROC is non-negligible (Figure 5B), consistent with observations that heterozygous LOFs can also increase blood clotting and cause deep vein thrombosis [63].
Similarly, values for ClinVar disease genes [64] span the range from weak to extreme selection, with only moderate enrichment for greater constraint relative to all genes. Consistent with this, the effects of disease on fitness depend on disease severity, age-of-onset, and prevalence throughout human history. For example, even though heterozygous loss of BRCA1 greatly increases risk of breast and ovarian cancer [65], BRCA1 is under strong rather than extreme selection. Possible partial explanations are that these cancers have an age-of-onset past reproductive age and are less prevalent in males, or that BRCA1 is subject to some form of antagonistic pleiotropy [14, 66].
3. Discussion
Here, we developed an empirical Bayes approach to accurately infer , an interpretable metric of gene constraint. Our approach uses powerful machine learning methods to leverage vast amounts of functional and evolutionary information about each gene while coupling them to a population genetics model.
There are two advantages of this approach. First, the additional data sources result in substantially better performance than LOEUF across tasks, from classifying essential genes to identifying pathogenic de novo mutations. These improvements are especially pronounced for the large fraction of genes with few expected LOFs, where LOF data alone is underpowered for estimating constraint.
Second, by inferring , our estimates of constraint are interpretable in terms of fitness, and we can directly compare the impact of a loss-of-function across genes, populations, species, and studies.
As a selection coefficient, can also be directly compared to other selection coefficients, even for different types of variants [3, 4]. In general, we believe genes are close to their optimal levels of expression and experience stabilizing selection [54], in which case expression-altering variants decrease fitness, with larger perturbations causing greater decreases (Figure 5C). Estimating the fitness consequences of other types of expression-altering variants, such as duplications or eQTLs, will allow us to map the relationship between genetic variation and fitness in detail, deepening our understanding of the interplay of expression, complex traits, and fitness [10, 56, 67, 68].
A recent method, DeepLOF [15], uses a similar empirical Bayes approach, but by estimating constraint from the number of observed and expected unique LOFs, it inherits the same difficulties regarding interpretation as pLI and LOEUF, and loses information by not considering variant frequencies. On the other hand, another line of work [1, 2], culminating in [4], solved the issues with interpretability by directly estimating . Yet, by relying exclusively on LOFs, these estimates are underpowered for ~25% of genes. Furthermore, by using the aggregate frequencies of all LOF variants, previous estimates [1, 2, 4] are not robust to misannotated LOF variants. Our approach eliminates this tradeoff between power and interpretability present in existing metrics.
Our estimates of will be useful for many applications. For example, by informing gene-level priors, LOEUF, pLI, and previous estimates of have been used to increase the power of association studies based on rare or de novo mutations [5,6,69]. In such contexts, our estimates can be used as a drop-in replacement. Additionally, extremely constrained and unconstrained genes may be interesting to study in their own right. Genes of unknown function with particularly high values of should be prioritized for further study. Investigating highly constrained genes may give insights into the mechanisms by which cellular and organism-level phenotypes affect fitness [70].
While we primarily used the posterior means of here, our approach provides the entire posterior distribution per gene, similar to [4]. In some applications, different aspects of the posterior may be more relevant than the mean. For example, when prioritizing rare variants for followup in a clinical setting, the posterior probability that is high enough for the variant to severely reduce fitness may be more relevant.
As more exomes are sequenced, one might expect that we would be better able to more accurately estimate . Yet, in a companion paper [16], we show that increasing the sample size used for estimating LOF frequencies will provide essentially no additional information for the ~85% of genes with the lowest values of . This fundamental limit on how much we can learn about these genes from LOF data alone highlights the importance of approaches like ours that can leverage additional data types. By sharing information across genes, we can overcome this fundamental limit on how accurately we can estimate constraint.
Here we focused on estimating , but our empirical Bayes framework, GeneBayes, can be used in any setting where one has a model that ties a gene-level parameter to gene-level observable data (Supplementary Note D). For example, GeneBayes can be used to find trait-associated genes using variants from case/control studies [71, 72], or to improve power to find differentially expressed genes in RNA-seq experiments [73]. We provide a graphical overview of how GeneBayes can be applied more generally in Figure 6. Briefly, GeneBayes requires users to specify a likelihood model and the form of a prior distribution for their parameter of interest. Then, using empirical Bayes and a set of gene features, it improves power to estimate the parameter by flexibly sharing information across similar genes.
In summary, we developed a powerful framework for estimating a broadly applicable and readily interpretable metric of constraint, . Our estimates provide a more informative ranking of gene importance than existing metrics, and our approach allows us to interrogate potential causes and consequences of natural selection.
4. Methods
Empirical Bayes overview
Many genes have few observed loss-of-function variants, making it challenging to infer constraint without additional information. Bayesian approaches that specify a prior distribution for each gene can provide such information to improve constraint estimates, but specifying prior distributions is challenging as we have limited prior knowledge about the selection coefficients . Empirical Bayes procedures allow us to learn a prior distribution for each gene by combining information across genes.
To use the information contained in the gene features, we learn a mapping from a gene’s features to a prior specific for that gene. We parameterize this mapping using gradient-boosted trees, as implemented in NGBoost [17]. Intuitively, this approach learns a notion of “similarity” between genes based on their features, and then shares information across similar genes to learn how relates to the gene features. This approach has two major benefits. First, by sharing information between similar genes, it can dramatically improve the accuracy of the predicted values, particularly for genes with few expected LOFs. Second, by leveraging the LOF data, this approach allows us to learn about how the various gene features relate to fitness, which cannot be modeled from first principles.
For a more in-depth description of our approach along with mathematical and implementation details, see Supplementary Note A.
Population genetic likelihood
To model how relates to the frequency of individual LOF variants, we used the discrete-time Wright-Fisher model, with an approximation of diploid selection with additive fitness effects. We used a composite likelihood approach, assuming independence across individual LOF variants to obtain gene-level likelihoods. Within this composite likelihood, we model each individual variant as either having a selection coefficient of with probability , or having a selection coefficient of 0 with probability . That is, acts as the prior probability that a given variant is misannotated, and we assume that misannotated variants evolve neutrally regardless of the strength of selection on the gene. All likelihoods were computed using new machinery developed in a companion paper [16].
Our model depends on a number of parameters—a demographic model of past population sizes, mutation rates for each site, and the probability of misannotation. The demographic model is taken from the literature [75] with modifications as described in [4]. The mutation rates account for trinucleotide context as well as methylation status at CpGs [12]. Finally, we estimated the probability of misannotation from the data.
For additional technical details and intuition see Supplementary Note B.
Curation of LOF variants
We obtained annotations for the consequences of all possible single nucleotide changes to the hg19 reference genome from [76]. The effects of variants on protein function were predicted using Variant Effect Predictor (VEP) version 85 [77] using GENCODE v19 gene annotations [78] as a reference. We defined a variant as a LOF if it was predicted by VEP to be a splice acceptor, splice donor, or stop gain variant. In addition, predicted LOFs were further annotated using LOFTEE [12], which implements a series of filters to identify variants that may be misannotated (for example, LOFTEE considers predicted LOFs near the ends of transcripts as likely misannotations). For our analyses, we only kept predicted LOFs labelled as High Confidence by LOFTEE, which are LOFs that passed all of LOFTEE’s filters.
Next, we considered potential criteria for further filtering LOFs: cutoffs for the median exome sequencing read depth, cutoffs for the mean pext (proportion expressed across transcripts) score [76], whether to exclude variants that fall in segmental duplications or regions with low mappability [79], and whether to exclude variants flagged by LOFTEE as potentially problematic but that passed LOFTEE’s primary filters.
We trained models with these filters one at a time and in combination, and chose the model that had the best AUPRC in classifying essential from nonessential genes in mice. The filters we evaluated and chose for the final model are reported in Table 2. Since we used mouse gene essentiality data to choose the filters, we do not further evaluate on these data.
Table 2:
Filtering criterion | Tested values | Best value |
---|---|---|
| ||
Cutoff for sequencing read depth (median across exomes) | 5×, 10×, 20× | 20× |
Cutoff for mean pext across tissues | 0.05, 0.1 | 0.05 |
Filter if variant falls in a segmental duplication or low mappability region | True, False | False |
Filter if variant is flagged as potentially problematic | True, False | True |
We considered genes to be essential in mice if they are heterozygous lethal, as determined by [12] using data from heterozygous knockouts reported in Mouse Genome Informatics [80]. We classify genes as nonessential if they are reported as “Viable with No Phenotype” by the International Mouse Phenotyping Consortium [81] (annotations downloaded on 12/08/22 from https://www.ebi.ac.uk/mi/impc/essential-genes-search/).
Finally, we annotated each variant with its frequency in the gnomAD v2.1.1 exomes [12], a dataset of 125,748 uniformly-analyzed exomes that were largely curated from case–control studies of common adult-onset diseases. gnomAD provides precomputed allele frequencies for all variants that they call.
For potential LOFs that are not segregating, gnomAD does not release the number of individuals that were genotyped at those positions. For these sites, we used the median number of genotyped individuals at the positions for which gnomAD does provide this information. We performed this separately on the autosomes and X chromosome.
Data sources for the variant annotations, filters, and frequencies, as well as additional information used to compute likelihoods are listed in Table 3.
Table 3:
Resource | Link |
---|---|
Annotations for possible LOFs | gs://gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv |
Mean methylation for CpG sites | gs://gcp-public-data--gnomad/resources/methylation |
Exome sequencing coverage | gs://gcp-public-data--1gnomad/release/2.1/coverage/exomes/gnomad.exomes.coverage.summary.tsv.bgz |
Variant frequencies | gs://gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz |
Low mappability and segmental duplications | https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.1/GRCh37/Union/GRCh37_alllowmapandsegdupregions.bed.gz |
Feature processing and selection
We compiled 10 types of gene features from several sources:
Gene structure (e.g., number of transcripts, number of exons, GC content)
Gene expression across tissues and cell lines
Biological pathways and Gene Ontology terms
Protein-protein interaction networks
Co-expression networks
Gene regulatory landscape (e.g., number and properties of enhancers and promoters)
Conservation across species
Protein embeddings
Subcellular localization
Missense constraint
Additionally, we included an indicator variable that is 1 if the gene is on the non-pseudoautosomal region of the X chromosome and 0 otherwise.
For a description of the features within each category and where we acquired them, see Supplementary Note C.
Training and validation
We fine-tuned a set of hyperparameters for our full empirical Bayes approach, using the best hyperparameters from an initial feature selection step (described in Supplementary Note C) as a starting point. To minimize overfitting, we split the genes into three sets—a training set (chromosomes 7–22, X), a validation set for hyperparameter tuning (chromosomes 2, 4, 6), and a test set to evaluate overfitting (chromosomes 1, 3, 5). During each training iteration, one or more trees were added to the model to fit the natural gradient of the loss on the training set. We stopped model training once the loss on the validation set did not improve for 10 iterations in a row (or the maximum number of iterations, 1,000, was reached). Using this approach, we performed a grid search over the hyperparameters listed in Table 4 and used the combination that minimized the validation loss.
Table 4:
Parameter(s) | Tested values | Best value |
---|---|---|
| ||
Learning rate | 0.0125, 0.05, 0.2 | 0.0125 |
Maximum tree depth (max_depth) | 3, 4, 5 | 3 |
Data subsampling ratio (subsample) | 0.6, 0.8, 1 | 0.8 |
Minimum weight of a leaf node (min_child_weight) | 1, 2, 4 | 1 |
L1 regularization (alpha) | 0, 1, 2 | 2 |
L2 regularization (lambda) | 1, 2, 4 | 1 |
Number of trees to fit per iteration (n_estimators) | 1, 2, 4 | 4 |
For Figure 2B, we reported results from the best model learned using the training set. For all other results, we trained a model on all genes using the hyperparameters and number of training iterations learned during this hyperparameter fine-tuning step.
Choosing genes for Table 1
To identify genes that are considered constrained by but not by LOEUF, we filtered for genes with (top ~17% most constrained genes, analogous to the recommended LOEUF cutoff of 0.35 [14], which corresponds to the top ~16% of genes) and (least constrained ~73% of genes). Of these, we identified genes where heterozygous or hemizygous mutations that decrease the amount of functional protein (e.g. LOF mutations) are associated with Mendelian disorders in the Online Mendelian Inheritance in Man (OMIM) database [36]. We chose genes for Table 1 primarily based on their prominence in the existing literature.
Evaluation on additional datasets
Definition of human essential and nonessential genes
We obtained data from 1,085 CRISPR knockout screens quantifying the effects of genes on cell survival or proliferation from the DepMap portal (22Q2 release) [37, 38]. Scores from each screen are normalized such that nonessential genes identified by [82] have a median score of 0 and that common essential genes identified by [82, 83] have a median score of −1.
In classifying essential genes (Figure 3A), we define a gene as essential if its score is < − 1 in at least 25% of screens, and as not essential if its score is > − 1 in all screens. In classifying nonessential genes, we define a gene as nonessential if it has a minimal effect on growth in most cell lines (score > − 0.25 and <0.25 in at least 99% of screens), and as not nonessential if its score is <0 in all screens.
Definition of developmental disorder genes
Through the Deciphering Developmental Disorders (DDD) study [39], clinicians have annotated a subset of genes with the strength and nature of their association with developmental disorders. We classify genes as developmental disorder genes if they are annotated by the DDD study with confidence_category = definitive and allelic_requirement = monoallelic_autosomal, monoallelic_X_hem (hemizygous), or monoallelic_X_het (heterozygous).
We classify genes as not associated with developmental disorders if they are annotated by the DDD study, do not meet the above criteria, and are not annotated with confidence_category = strong or moderate and allelic_requirement = monoallelic_autosomal, monoallelic_X_hem, or monoallelic_X_het.
We downloaded genes with DDD annotations from https://www.deciphergenomics.org/ddd/ddgenes on 05/06/2023 .
Enrichment/depletion of Human Phenotype Ontology (HPO) genes
The Human Phenotype Ontology (HPO) provides a structured organization of phenotypic abnormalities and the genes associated with them, with each HPO term corresponding to a phenotypic abnormality. We calculated the enrichment of constrained genes in each HPO term with at least 200 genes as the ratio (fraction of HPO genes under constraint)/(fraction of background genes under constraint). We defined genes under constraint to be the decile of genes considered most constrained by or LOEUF. To choose background genes, we sampled from the set of all genes to match each HPO term’s distribution of expected unique LOFs. Similarly, we calculated the depletion of unconstrained genes in each HPO term as the ratio (fraction of HPO genes not under constraint)/(fraction of background genes not under constraint), where we define genes not under constraint to be the decile of genes considered least constrained by or LOEUF.
We downloaded HPO phenotype-to-gene annotations from http://purl.obolibrary.org/obo/hp/hpoa/phenotype_to_genes.txt on 01/27/2023 .
Enrichment of de novo mutations in developmental disorder patients
We used the enrichment metric developed by [5] in their analysis of de novo mutations (DNMs) identified from exome sequencing of 31,058 developmental disorder patients and their unaffected parents. Enrichment of DNMs in developmental disorder patients was calculated as the ratio of observed DNMs in patients over the expected number under a null mutational model that accounts for the study sample size and triplet mutation rate at the mutation sites [84].
For Figure 3D, we calculated the enrichment of DNMs in constrained genes, defined as the decile of genes considered most constrained by or LOEUF. For Supplementary Figure 2C, we calculated the enrichment of DNMs in constrained genes with and without known associations with development disorders. We defined a gene as having a known association if it is annotated by the DDD study (see Methods section “Definition of developmental disorder genes”) with confidence_category = definitive or strong and allelic_requirement = monoallelic_autosomal, monoallelic_X_hem (hemizygous), or monoallelic_X_het (heterozygous).
For each set of genes, we computed the mean enrichment over sites and 95% Poisson confidence intervals for the mean using the code provided by [5].
Expression variability across species
To understand the variability in expression between humans and other species, we focused on gene expression differences between human and chimpanzee as estimated from RNA sequencing of an in vitro model of the developing cerebral cortex for each species [42]. As a metric of variability between the two species, we used the absolute log-fold change (LFC) in gene expression between human and chimpanzee cortical spheroids, which was calculated from samples collected at several time points throughout differentiation of the spheroids. LFC estimates were obtained from Supplementary Table 9 of [42].
To visualize the relationship between constraint and absolute LFC, we plotted a LOESS curve between the constraint on a gene (gene rank from least to most constrained using either or LOEUF as the constraint metric) and the absolute LFC for the gene. Curves were calculated using the LOWESS function from the statsmodels package with parameters frac = 0.15 and delta = 10.
Expression variability across individuals
We used the coefficient of variance () as a metric for gene expression variability across individuals, defined as where and are the standard deviation and mean of the expression level of gene respectively. Here, expression is in units of Transcripts Per Million. We calculated CV using 17,398 RNA-seq samples in the GTEx v8 release [43], with data from 838 donors and 52 tissues/cell lines.
Another potential metric for gene expression variability is the standard deviation for a gene, . However, as the mean expression for a gene, , is strongly correlated with (Spearman in GTEx), the relation between and may be confounded by the relation between and . In contrast, we found that CV is only slightly correlated with (Spearman in GTEx).
LOESS curves were computed as in “Expression variability across species.”
Feature interpretation
Training models on feature subsets
We grouped features into categories (see Supplementary Table 4 for the features in each category), and trained a model for each category to predict from the corresponding features. For each model, we tuned hyperparameters over a subset of the values we considered for the full model (Table 5), and chose the combination of hyperparameters that minimized the loss over genes in the validation set. As a baseline, we trained a model with no features, such that all genes have a shared prior distribution that is learned from the LOF data—this model is analogous to a standard empirical Bayes model.
Table 5:
Parameter(s) | Tested values |
---|---|
| |
Learning rate | 0.0125, 0.05 |
Maximum tree depth (max_depth) | 3 |
Data subsampling ratio (subsample) | 0.8, 1 |
Minimum weight of a leaf node (min_child_weight) | 1 |
L1 regularization (alpha) | 0, 1, 2 |
L2 regularization (lambda) | 1 |
Number of trees to fit per iteration (n_estimators) | 1, 2, 4 |
Definition of expression feature subsets
We grouped gene expression features into 24 categories representing tissues, cell types, and developmental stage using terms present in the feature names (Table 6).
Scoring individual features
To score individual gene features, we varied the value of one feature at a time and calculated the variance in predicted as a feature score. In more detail, we fixed each feature to values spanning the range of observed values for that feature (0th, 2nd, ..., 98th, and 100th percentile), such that all genes shared the same feature value. Then, for each of these 51 feature values, we averaged the values predicted by the learned priors over all genes, where the predicted for each gene is the mean of its prior. We denote this averaged prediction by for some feature and percentile . Finally, we define the score for feature as score , where sd is a function computing the sample standard deviation. In other words, a feature with a high score is one for which varying its value causes high variance in the predicted .
For the lineplots in Figures 4C–4F, we scale the predictions for each feature by subtracting from each prediction.
Pruning features before computing feature scores
While investigating the effects of features on predicted , we found that including highly correlated features in the model could produce unintuitive results, such as opposite correlations with for highly similar features. Therefore, for Figures 4C–4F, we first pruned the set of features to minimize pairwise correlations between the remaining features. To do this, we randomly kept one feature in each group of correlated features, where such a group is defined as a set of features where each feature in the set has an absolute Spearman to some other feature in the set.
For Figures 4C–4F, we trained models on the relevant features in this pruned set (gene ontology, network, gene regulatory, and gene structure features for Figures 4C, 4D, 4E, and 4F respectively). After feature pruning, we found the directions of effect for the features were consistent with their marginal directions of effect.
Supplementary Material
Acknowledgements
We would like to thank Ipsita Agarwal, Molly Przeworski, Jesse Engreitz, and members of the Pritchard Lab for valuable feedback and discussions. This work was supported by NIH grants R01HG011432 and R01HG008140.
Footnotes
Additional Declarations: There is NO Competing Interest.
Code availability
GeneBayes and code for estimating are available at https://github.com/tkzeng/GeneBayes.
Data availability
Posterior means and 95% credible intervals for are available in Supplementary Table 2. Posterior densities for are available in Supplementary Table 3. A description of the gene features is available in Supplementary Table 4. These supplementary tables are also available at [74], along with likelihoods for , LOF variants with misannotation probabilities, and gene feature tables.
References
- [1].Cassa CA, Weghorn D, Balick DJ, Jordan DM, Nusinow D, Samocha KE, et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nature Genetics. 2017;49(5):806–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Weghorn D, Balick DJ, Cassa C, Kosmicki JA, Daly MJ, Beier DR, et al. Applicability of the Mutation–Selection Balance Model to Population Genetics of Heterozygous Protein-Truncating Variants in Humans. Molecular Biology and Evolution. 2019;36(8):1701–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Fuller ZL, Berg JJ, Mostafavi H, Sella G, Przeworski M. Measuring intolerance to mutation in human genetics. Nature Genetics. 2019;51(5):772–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Agarwal I, Fuller ZL, Myers SR, Przeworski M. Relating pathogenic loss-of function mutations in humans to their evolutionary fitness costs. eLife. 2023;12:e83172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Kaplanis J, Samocha KE, Wiel L, Zhang Z, Arvai KJ, Eberhardt RY, et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature. 2020;586(7831):757–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Fu JM, Satterstrom FK, Peng M, Brand H, Collins RL, Dong S, et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nature Genetics. 2022;54(9):1320–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Whiffin N, Armean IM, Kleinman A, Marshall JL, Minikel EV, Goodrich JK, et al. The effect of LRRK2 loss-of-function variants in humans. Nature Medicine. 2020;26(6):869–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nature Genetics. 2022;54(6):827–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Wang X, Goldstein DB. Enhancer domains predict gene pathogenicity and inform gene discovery in complex disease. The American Journal of Human Genetics. 2020;106(2):215–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Mostafavi H, Spence JP, Naqvi S, Pritchard JK. Limited overlap of eQTLs and GWAS hits due to systematic differences in discovery. bioRxiv. 2022:2022–05. [Google Scholar]
- [11].Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Gillespie JH. Population genetics: a concise guide. JHU press; 2004. [Google Scholar]
- [14].Gudmundsson S, Singer-Berk M, Watts NA, Phu W, Goodrich JK, Solomonson M, et al. Variant interpretation using population databases: Lessons from gnomAD. Human Mutation. 2022;43(8):1012–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].LaPolice TM, Huang YF. A deep learning framework for predicting human essential genes from population and functional genomic data. bioRxiv. 2021:2021–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Spence J, Zeng T, Mostafavi H, Pritchard J. Scaling the discrete-time Wright-Fisher model to biobank-scale datasets. bioRxiv. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Duan T, Anand A, Ding DY, Thai KK, Basu S, Ng A, et al. Ngboost: Natural gradient boosting for probabilistic prediction. In: International Conference on Machine Learning. PMLR; 2020. p. 2690–700. [Google Scholar]
- [18].Ewens WJ. Mathematical population genetics: theoretical introduction. vol. 27. Springer; 2004. [Google Scholar]
- [19].Agarwal I, Przeworski M. Mutation saturation for fitness effects at human CpG sites. eLife. 2021;10:e71513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Haghighi K, Kolokathis F, Pater L, Lynch RA, Asahi M, Gramolini AO, et al. Human phospholamban null results in lethal dilated cardiomyopathy revealing a critical difference between mouse and human. The Journal of Clinical Investigation. 2003;111(6):869–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Howard TD, Paznekas WA, Green ED, Chiang LC, Ma N, Luna RIOD, et al. Mutations in TWIST, a basic helix–loop–helix transcription factor, in Saethre-Chotzen syndrome. Nature Genetics. 1997;15(1):36–41. [DOI] [PubMed] [Google Scholar]
- [22].Ghouzzi VE, Merrer ML, Perrin-Schmitt F, Lajeunie E, Benit P, Renier D, et al. Mutations of the TWIST gene in the Saethre-Chotzene syndrome. Nature Genetics. 1997;15(1):42–6. [DOI] [PubMed] [Google Scholar]
- [23].Da Costa L, Leblanc T, Mohandas N. Diamond-Blackfan anemia. Blood. 2020;136(11):1262–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].des Portes V, Pinard JM, Billuart P, Vinet MC, Koulakoff A, Carrié A, et al. A novel CNS gene required for neuronal migration and involved in X-linked subcortical laminar heterotopia and lissencephaly syndrome. Cell. 1998;92(1):51–61. [DOI] [PubMed] [Google Scholar]
- [25].Fantes J, Ragge NK, Lynch SA, McGill NI, Collin JRO, Howard-Peebles PN, et al. Mutations in SOX2 cause anophthalmia. Nature Genetics. 2003;33(4):462–3. [DOI] [PubMed] [Google Scholar]
- [26].Berger W, de Pol Dv, Warburg M, Gal A, Bleeker-Wagemakers L, de Silva H, et al. Mutations in the candidate gene for Norrie disease. Human Molecular Genetics. 1992;1(7):461–5. [DOI] [PubMed] [Google Scholar]
- [27].Faundes V, Jennings MD, Crilly S, Legraie S, Withers SE, Cuvertino S, et al. Impaired eIF5A function causes a Mendelian disorder that is partially rescued in model systems by spermidine. Nature Communications. 2021;12(1):833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Hatada I, Ohashi H, Fukushima Y, Kaneko Y, Inoue M, Komoto Y, et al. An imprinted gene p57 KIP2 is mutated in Beckwith–Wiedemann syndrome. Nature Genetics. 1996;14(2):171–3. [DOI] [PubMed] [Google Scholar]
- [29].Gripp KW, Wotton D, Edwards MC, Roessler E, Ades L, Meinecke P, et al. Mutations in TGIF cause holoprosencephaly and link NODAL signalling to human neural axis determination. Nature Genetics. 2000;25(2):205–8. [DOI] [PubMed] [Google Scholar]
- [30].Coffey AJ, Brooksbank RA, Brandau O, Oohashi T, Howell GR, Bye JM, et al. Host response to EBV infection in X-linked lymphoproliferative disease results from mutations in an SH2-domain encoding gene. Nature Genetics. 1998;20(2):129–35. [DOI] [PubMed] [Google Scholar]
- [31].Smith ML, Cavenagh JD, Lister TA, Fitzgibbon J. Mutation of CEBPA in familial acute myeloid leukemia. New England Journal of Medicine. 2004;351(23):2403–7. [DOI] [PubMed] [Google Scholar]
- [32].Garg V, Kathiriya IS, Barnes R, Schluterman MK, King IN, Butler CA, et al. GATA4 mutations cause human congenital heart defects and reveal an interaction with TBX5. Nature. 2003;424(6947):443–7. [DOI] [PubMed] [Google Scholar]
- [33].Langton KP, McKie N, Curtis A, Goodship JA, Bond PM, Barker MD, et al. A novel tissue inhibitor of metalloproteinases-3 mutation reveals a common molecular phenotype in Sorsby’s fundus dystrophy. Journal of Biological Chemistry. 2000;275(35):27027–31. [DOI] [PubMed] [Google Scholar]
- [34].Fang J, Dagenais SL, Erickson RP, Arlt MF, Glynn MW, Gorski JL, et al. Mutations in FOXC2 (MFH-1), a forkhead family transcription factor, are responsible for the hereditary lymphedema-distichiasis syndrome. The American Journal of Human Genetics. 2000;67(6):1382–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Begemann M, Zirn B, Santen G, Wirthgen E, Soellner L, Büttel HM, et al. Paternally inherited IGF2 mutation and growth restriction. New England Journal of Medicine. 2015;373(4):349–56. [DOI] [PubMed] [Google Scholar]
- [36].Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM. org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Research. 2015;43(D1):D789–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Meyers RM, Bryan JG, McFarland JM, Weir BA, Sizemore AE, Xu H, et al. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nature Genetics. 2017;49(12):1779–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Ghandi M, Huang FW, Jané-Valbuena J, Kryukov GV, Lo CC, McDonald III ER, et al. Next-generation characterization of the cancer cell line encyclopedia. Nature. 2019;569(7757):503–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Wright CF, Campbell P, Eberhardt RY, Aitken S, Perrett D, Brent S, et al. Genomic Diagnosis of Rare Pediatric Disease in the United Kingdom and Ireland. New England Journal of Medicine. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, et al. The human phenotype ontology in 2021. Nucleic Acids Research. 2021;49(D1):D1207–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Leitão E, Schröder C, Parenti I, Dalle C, Rastetter A, Kühnel T, et al. Systematic analysis and prediction of genes associated with monogenic disorders on human chromosome X. Nature Communications. 2022;13(1):6570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Agoglia RM, Sun D, Birey F, Yoon SJ, Miura Y, Sabatini K, et al. Primate cell fusion disentangles gene regulatory divergence in neurodevelopment. Nature. 2021;592(7854):421–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Consortium G. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Basha O, Argov CM, Artzy R, Zoabi Y, Hekselman I, Alfandari L, et al. Differential network analysis of multiple human tissue interactomes highlights tissue-selective processes and genetic disorder genes. Bioinformatics. 2020;36(9):2821–8. [DOI] [PubMed] [Google Scholar]
- [45].Gao S, Yan L, Wang R, Li J, Yong J, Zhou X, et al. Tracing the temporal-spatial transcriptome landscapes of the human fetal digestive tract using single-cell RNA-sequencing. Nature Cell Biology. 2018;20(6):721–34. [DOI] [PubMed] [Google Scholar]
- [46].Charlesworth B, et al. Evolution in age-structured populations. vol. 2. Cambridge University Press; Cambridge; 1994. [Google Scholar]
- [47].Barrio-Hernandez I, Schwartzentruber J, Shrivastava A, Del-Toro N, Gonzalez A, Zhang Q, et al. Network expansion of genetic associations defines a pleiotropy map of human cell biology. Nature Genetics. 2023:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP. Gene co-expression analysis for functional classification and gene–disease predictions. Briefings in Bioinformatics. 2018;19(4):575–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, et al. Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593(7858):238–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Mayr C. Regulation by 3’-untranslated regions. Annual Review of Genetics. 2017;51:171–94. [DOI] [PubMed] [Google Scholar]
- [51].Leppek K, Das R, Barna M. Functional 5’ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nature Reviews Molecular Cell Biology. 2018;19(3):158–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Agrawal AF, Whitlock MC. Inferences about the distribution of dominance drawn from yeast gene knockout data. Genetics. 2011;187(2):553–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Mukai T, Chigusa SI, Mettler L, Crow JF. Mutation rate and dominance of genes affecting viability in Drosophila melanogaster. Genetics. 1972;72(2):335–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Sella G, Barton NH. Thinking about the evolution of complex traits in the era of genome-wide association studies. Annual Review of Genomics and Human Genetics. 2019;20:461–93. [DOI] [PubMed] [Google Scholar]
- [55].Charlesworth B. Effective population size and patterns of molecular evolution and variation. Nature Reviews Genetics. 2009;10(3):195–205. [DOI] [PubMed] [Google Scholar]
- [56].Simons YB, Mostafavi H, Smith CJ, Pritchard JK, Sella G. Simple scaling laws control the genetic architectures of human complex traits. bioRxiv. 2022:2022–10. [Google Scholar]
- [57].Mathieson I, Terhorst J. Direct detection of natural selection in Bronze Age Britain. Genome Research. 2022;32(11–12):2057–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Emdin CA, Khera AV, Natarajan P, Klarin D, Won HH, Peloso GM, et al. Phenotypic characterization of genetically lowered human lipoprotein (a) levels. Journal of the American College of Cardiology. 2016;68(25):2761–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Langsted A, Nordestgaard BG, Kamstrup PR. Low lipoprotein (a) levels and risk of disease in a large, contemporary, general population study. European Heart Journal. 2021;42(12):1147–56. [DOI] [PubMed] [Google Scholar]
- [60].Rausell A, Luo Y, Lopez M, Seeleuthner Y, Rapaport F, Favier A, et al. Common homozygosity for predicted loss-of-function variants reveals both redundant and advantageous effects of dispensable human genes. Proceedings of the National Academy of Sciences. 2020;117(24):13626–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Reyes-Soffer G, Ginsberg HN, Berglund L, Duell PB, Heffron SP, Kamstrup PR, et al. Lipoprotein (a): a genetically determined, causal, and prevalent risk factor for atherosclerotic cardiovascular disease: a scientific statement from the American Heart Association. Arteriosclerosis, Thrombosis, and Vascular Biology. 2022;42(1):e48–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Millar DS, Johansen B, Berntorp E, Minford A, Bolton-Maggs P, Wensley R, et al. Molecular genetic analysis of severe protein C deficiency. Human Genetics. 2000;106:646–53. [DOI] [PubMed] [Google Scholar]
- [63].Romeo G, Hassan HJ, Staempfli S, Roncuzzi L, Cianetti L, Leonardi A, et al. Hereditary thrombophilia: identification of nonsense and missense mutations in the protein C gene. Proceedings of the National Academy of Sciences. 1987;84(9):2829–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Landrum MJ, Chitipiralla S, Brown GR, Chen C, Gu B, Hart J, et al. ClinVar: improvements to accessing data. Nucleic Acids Research. 2020;48(D1):D835–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].Couch FJ, Nathanson KL, Offit K. Two decades after BRCA: setting paradigms in personalized cancer care and prevention. Science. 2014;343(6178):1466–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Smith KR, Hanson HA, Hollingshaus MS. BRCA1 and BRCA2 mutations and female fertility. Current Opinion in Obstetrics & Gynecology. 2013;25(3):207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].O’Connor LJ, Schoech AP, Hormozdiari F, Gazal S, Patterson N, Price AL. Extreme poly-genicity of complex traits is explained by negative selection. The American Journal of Human Genetics. 2019;105(3):456–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Benton ML, Abraham A, LaBella AL, Abbot P, Rokas A, Capra JA. The influence of evolutionary history on human health and disease. Nature Reviews Genetics. 2021;22(5):269–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [69].Satterstrom FK, Kosmicki JA, Wang J, Breen MS, De Rubeis S, An JY, et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell. 2020;180(3):568–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Gardner EJ, Neville MD, Samocha KE, Barclay K, Kolk M, Niemi ME, et al. Reduced reproductive success is associated with selective constraint on human genes. Nature. 2022;603(7903):858–63. [DOI] [PubMed] [Google Scholar]
- [71].He X, Sanders SJ, Liu L, De Rubeis S, Lim ET, Sutcliffe JS, et al. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genetics. 2013;9(8):e1003671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Zhu X, Stephens M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The Annals of Applied Statistics. 2017;11(3):1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [73].Boyeau P, Regier J, Gayoso A, Jordan MI, Lopez R, Yosef N. An empirical Bayes method for differential expression analysis of single cells with deep generative models. bioRxiv. 2022:2022–05. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [74].Zeng T, Spence JP, Mostafavi H, Pritchard JK. s_het estimates from GeneBayes and other supplementary datasets. Zenodo; 2023. Available from: 10.5281/zenodo.7939768. [DOI] [Google Scholar]
- [75].Schiffels S, Durbin R. Inferring human population size and separation history from multiple genome sequences. Nature Genetics. 2014;46(8):919–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [76].Cummings BB, Karczewski KJ, Kosmicki JA, Seaby EG, Watts NA, Singer-Berk M, et al. Transcript expression-aware annotation improves rare variant interpretation. Nature. 2020;581(7809):452–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [77].McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. The ensembl variant effect predictor. Genome Biology. 2016;17(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [78].Frankish A, Carbonell-Sala S, Diekhans M, Jungreis I, Loveland JE, Mudge JM, et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Research. 2023;51(D1):D942–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [79].Olson ND, Wagner J, McDaniel J, Stephens SH, Westreich ST, Prasanna AG, et al. PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. Cell Genomics. 2022;2(5):100129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [80].Blake JA, Baldarelli R, Kadin JA, Richardson JE, Smith CL, Bult CJ. Mouse Genome Database (MGD): Knowledgebase for mouse–human comparative biology. Nucleic Acids Research. 2021;49(D1):D981–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [81].Groza T, Gomez FL, Mashhadi HH, Muñoz-Fuentes V, Gunes O, Wilson R, et al. The International Mouse Phenotyping Consortium: comprehensive knockout phenotyping underpinning the study of human disease. Nucleic acids research. 2023;51(D1):D1038–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [82].Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Molecular Systems Biology. 2014;10(7):733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [83].Blomen VA, Májek P, Jae LT, Bigenzahn JW, Nieuwenhuis J, Staring J, et al. Gene essentiality and synthetic lethality in haploid human cells. Science. 2015;350(6264):1092–6. [DOI] [PubMed] [Google Scholar]
- [84].Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, et al. A framework for the interpretation of de novo mutation in human disease. Nature Genetics. 2014;46(9):944–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [85].Si Amari. Natural Gradient Works Efficiently in Learning. Neural Computation. 1998;10(2):251–76. [Google Scholar]
- [86].Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. 2019;32. [Google Scholar]
- [87].Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101. 2017. [Google Scholar]
- [88].Gómez P, Toftevaag HH, Meoni G. torchquad: Numerical Integration in Arbitrary Dimensions with PyTorch. Journal of Open Source Software. 2021;6(64):3439. [Google Scholar]
- [89].Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–94. [Google Scholar]
- [90].Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132(4):1161–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [91].Harpak A, Bhaskar A, Pritchard JK. Mutation rate variation is a primary determinant of the distribution of allele frequencies in humans. PLoS Genetics. 2016;12(12):e1006489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [92].Varin C, Reid N, Firth D. An overview of composite likelihood methods. Statistica Sinica. 2011:5–42. [Google Scholar]
- [93].Ramoni RB, Mulvihill JJ, Adams DR, Allard P, Ashley EA, Bernstein JA, et al. The undiagnosed diseases network: accelerating discovery about health and disease. The American Journal of Human Genetics. 2017;100(2):185–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [94].Consortium GP, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [95].Lee Y, Nelder JA. Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(4):619–56. [Google Scholar]
- [96].Meng XL. Decoding the h-likelihood. Statistical Science. 2009;24(3):280–93. [Google Scholar]
- [97].Weeks EM, Ulirsch JC, Cheng NY, Trippe BL, Fine RS, Miao J, et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. medRxiv. 2020:2020–09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [98].Boukas L, Bjornsson HT, Hansen KD. Promoter CpG density predicts downstream gene loss-of-function intolerance. The American Journal of Human Genetics. 2020;107(3):487–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [99].Pers TH, Karjalainen JM, Chan Y, Westra HJ, Wood AR, Yang J, et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nature Communications. 2015;6(1):5890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [100].The Gene Ontology resource: enriching a GOld mine. Nucleic acids research. 2021;49(D1):D325–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [101].Raina P, Guinea R, Chatsirisupachai K, Lopes I, Farooq Z, Guinea C, et al. GeneFriends: gene co-expression databases and tools for humans and model organisms. Nucleic Acids Research. 2023;51(D1):D145–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [102].Consortium G, Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348(6235):648–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [103].DGT RPC, Consortium F, et al. A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [104].Fulco CP, Nasser J, Jones TR, Munson G, Bergman DT, Subramanian V, et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature Genetics. 2019;51(12):1664–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [105].Roadmap EC, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [106].Liu Y, Sarkar A, Kheradpour P, Ernst J, Kellis M. Evidence of reduced recombination rate in human regulatory domains. Genome Biology. 2017;18(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [107].Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005;15(8):1034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [108].Sullivan PF, Meadows JR, Gazal S, Phan BN, Li X, Genereux DP, et al. Leveraging base-pair mammalian constraint to understand genetic variation and human disease. Science. 2023;380(6643):eabn2937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [109].Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome research. 2010;20(1):110–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [110].Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;44(10):7112–27. [DOI] [PubMed] [Google Scholar]
- [111].Stärk H, Dallago C, Heinzinger M, Rost B. Light attention predicts protein location from the language of life. Bioinformatics Advances. 2021;1(1):vbab035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [112].Huang YF. Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genetics. 2020;16(7):e1008922. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Posterior means and 95% credible intervals for are available in Supplementary Table 2. Posterior densities for are available in Supplementary Table 3. A description of the gene features is available in Supplementary Table 4. These supplementary tables are also available at [74], along with likelihoods for , LOF variants with misannotation probabilities, and gene feature tables.