Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2020 Jul 15;16(7):e1008922. doi: 10.1371/journal.pgen.1008922

Unified inference of missense variant effects and gene constraints in the human genome

Yi-Fei Huang 1,2,*
Editor: Scott M Williams3
PMCID: PMC7384676  PMID: 32667917

Abstract

A challenge in medical genomics is to identify variants and genes associated with severe genetic disorders. Based on the premise that severe, early-onset disorders often result in a reduction of evolutionary fitness, several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows improved performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders, and feature importance analysis suggests that both gene-level selective constraints and variant-level predictors are important for accurate variant prioritization. Furthermore, based on UNEECON, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations, which can be partially explained by the prevalence of disordered protein regions that are highly tolerant to missense mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in the central nervous system and the autism spectrum disorders. Overall, UNEECON is a promising framework for both variant and gene prioritization.

Author summary

Numerous statistical methods have been developed to predict deleterious missense variants or constrained genes in the human genome, but unified prioritization methods that utilize both variant- and gene-level information are underdeveloped. Here we present UNEECON, an evolution-based deep learning framework for unified variant and gene prioritization. By integrating variant-level predictors and gene-level selective constraints, UNEECON outperforms existing methods in predicting missense variants and protein-coding genes associated with dominant disorders. Based on UNEECON, we show that disordered proteins are tolerant to missense mutations but not to loss-of-function mutations. In addition, we find that genes under strong selective constraints at both missense and loss-of-function levels are strongly associated with the central nervous system and the autism spectrum disorders, highlighting the need to investigate the function of these highly constrained genes in future studies.

Introduction

A fundamental question in biology is to understand how genomic variation contributes to phenotypic variation and disease risk. While millions of protein-altering variants have been identified in the human genome, it is challenging to assess the functional and clinical significance of these variants. In particular, a large fraction of missense variants have been annotated as “variants of uncertain significance” (VUS) [1, 2], forming a major hurdle for both basic research and medical practice. This problem is further exacerbated by the difficulty of experimentally validating the function of large numbers of missense variants in vivo. Therefore, there is a tremendous need for accurate computational tools to prioritize deleterious missense variants [3].

Since early-onset, severe genetic disorders are often associated with a reduction of evolutionary fitness, signatures of negative (purifying) selection, such as sequence conservation, have been widely used to predict deleterious variants associated with Mendelian disorders [412]. Among existing evolutionary approaches for variant prioritization, recently developed integrative methods are particularly powerful [815]. By learning a linear or nonlinear mathematical function from predictive variant features, such as sequence conservation scores and protein structural features, to the strength of negative selection, these statistical methods estimate negative selection on observed and potential mutations in the human genome. The estimated strength of negative selection can then be utilized to prioritize deleterious variants associated with severe genetic disorders. Because these evolutionary approaches are trained on tremendous natural polymorphisms observed in healthy individuals instead of sparsely annotated pathogenic variants, they have shown good performance in predicting pathogenic variants, frequently outperforming or on par with supervised machine learning models trained on disease data [8, 1012].

Despite the success of evolution-based metrics of variant effects, the existing methods nevertheless suffer from a few critical limitations. First, most existing methods focus on learning a shared mathematical function from predictive variant features to negative selection and assume that this function is equally applicable to all protein-coding genes. Instead, a subset of genes can depart from the genome-wide trend between variant features and negative selection, possibly due to enhanced or relaxed purifying selection on these genes in the human lineage [12]. Second, most existing methods are trained on common genetic variants, making it challenging to distinguish strong negative selection associated with severe genetic diseases from moderate negative selection without clinical implications. Third, these methods typically are agnostic to the mode of inheritance and, therefore, may be suboptimal in the prediction of deleterious variants associated with dominant disorders on which we focus in this work.

In parallel with the development of variant-level interpretation methods, several complementary, gene-centric methods have been proposed to predict protein-coding genes associated with dominant genetic disorders [1623]. Unlike variant-level predictors, the gene-level prioritization methods seek to identify constrained genes that are intolerant to heterozygous nonsynonymous mutations. These gene-level constraint metrics have been shown to provide complementary information on variant effects and have been successfully used to prioritize variants associated with dominant genetic disorders [16, 17, 24]. However, these methods typically assume that all the missense mutations in a gene or a genic region have identical effects and, therefore, may not be able to distinguish pathogenic missense variants from proximal benign missense variants.

Since variant-level and gene-level prioritization methods leverage complementary signatures of negative selection, unifying the two lines of research for joint inference of variant effects and gene constraints should be beneficial. Recently, a couple of studies have tried to address this question [2527], but all of these methods are supervised machine learning models trained on disease variants. Therefore, we lack an evolution-based statistical framework to combine predictive variant features and gene-level selective constraints for variant and gene prioritization. Based on a novel deep learning framework, i.e., deep mixed-effects model, we develop UNEECON (UNified inferencE of variant Effects and gene CONstraints), an evolution-based framework to predict deleterious variants and constrained genes from both variant features and gene-level intolerance to missense mutations. By integrating 30 predictive variant features and genomic variation from 141,456 human genomes [28, 29], UNEECON outperforms existing variant effect predictors and gene constraint scores in predicting pathogenic missense variants with a dominant mode of inheritance. In addition, deleterious de novo variants predicted by UNEECON are strongly enriched in individuals affected by severe development disorders [30, 31], highlighting its power for interpreting the effects of de novo mutations. Furthermore, UNEECON provides estimates of gene-level selective constraints (UNEECON-G scores) for all protein-coding genes. In the setting of gene prioritization, UNEECON-G scores show better performance than previous gene constraint scores in predicting human essential genes [32], mouse essential genes [33, 34], autosomal dominant disease genes [35, 36], and haploinsufficient genes [37]. UNEECON is a powerful framework for both variant and gene prioritization.

Results

UNEECON integrates variant- and gene-level information to predict detrimental variants and constrained genes

The key idea of UNEECON is to combine variant-level predictive features, such as sequence conservation scores and protein structural features, and gene-level signatures of selective constraints to infer negative selection on every potential missense mutation in the human genome. Inspired by classical sequence conservation models [3842], which use site-specific substitution rate as a proxy of negative selection, we utilize the relative probability of the occurrence of a potential missense mutation in human populations, compared to neutral mutations, as an allele-specific predictor of negative selection.

We first fit a mutation model to calculate μij, i.e., the probability of the occurrence of mutation i in gene j when selection is absent. Our mutation model is trained on putatively neutral mutations in the gnomAD data [29] and captures the impact of multiple neutral and technical factors, including the 7-mer sequence context centered on the focal site, the identity of the alternative allele, the local mutation rate, and the average sequencing coverage, on the probability of occurrence of a neutral mutation. At the gene level, the number of synonymous mutations predicted by the mutation model was nearly perfectly correlated with the observed number of synonymous mutations (Spearman’s ρ = 0.976; Fig A in S1 File), suggesting that the mutation model can serve as a proper baseline for inferring negative selection.

Given the mutation model, UNEECON employs a novel deep learning framework, deep mixed-effects model, to infer the relative probability of the occurrence of each potential missense mutation (Fig 1). In more detail, we denote ηij as the relative probability of the occurrence of missense mutation i in gene j with respect to the neutral occurrence probability, μij. ηij captures the impact of natural selection, instead of mutation and genetic drift, on the occurrence of a missense mutation. Analogous to generalized linear mixed models, we assume that ηij depends on the sum of a variant-level fixed effect, zij, and a gene-level random effect, uj. We assume that zij captures the contribution of predictive variant features to negative selection. Denoting Xij as the vector of predictive features associated with mutation i in gene j, we use a feedforward neural network to model the relationship between Xij and zij (Fig 1). Furthermore, we assume that the random-effect term, uj, is a Gaussian (normal) random variable which models the variation of gene-level constraints that is not predictable from feature vector Xij. We then perform a logistic transformation on the sum of zij and uj to obtain ηij. Finally, we multiply ηij and μij to obtain the occurrence probability of missense mutation i in the gnomAD exome sequencing data.

Fig 1. Overview of the UNEECON model.

Fig 1

UNEECON estimates negative selection on missense mutation i in gene j based on the relative probability of the occurrence of the missense mutation, ηij, compared to the occurrence probability of neutral mutations, μij. ηij depends on the sum of a variant-level fixed effect, zij, and a gene-level random effect, uj. We assume that zij captures the contribution of variant-level features, Xij, to negative selection, and model the relationship between Xij and zij with a feedforward neural network. We assume that uj is a Gaussian random variable modeling the gene-level variation of selective constraints that cannot be predicted from variant features. The sum of zij and uj is then sent to a logistic function to obtain ηij. The neutral occurrence probability, μij, is from a context-dependent mutation model trained on putatively neutral mutations. Free parameters of the UNEECON model are estimated by minimizing the discrepancy between the predicted occurrence probability, ηijμij, and the observed occurrence of each potential missense mutation in the gnomAD exome sequencing data [29].

We estimate the parameters of the UNEECON model by minimizing the discrepancy between the predicted occurrence probability, ηijμij, and the observed presence/absence of each potential missense mutation, which is equivalent to maximizing a Bernoulli likelihood function. We also employ regularization techniques, including dropout [43] and early stopping [44], to avoid overfitting. After training, we calculate variant-effect scores (UNEECON scores) and gene-level intolerance to missense mutations (UNEECON-G scores) defined as the expected reduction of the occurrence probability at variant and gene levels, respectively. Both UNEECON and UNEECON-G scores range from 0 to 1, with higher scores suggesting stronger negative selection.

UNEECON scores capture variation of negative selection within and across genes

We trained the UNEECON model with 30 missense variant features, including conservation scores, protein structural features, and regulatory features (Table A in S1 File), and rare missense variants with a minor allele frequency (MAF) lower than 0.1% in the gnomAD dataset [29]. After the training process, we first compared the distributions of UNEECON scores across potential missense mutations in haploinsufficient genes [37], autosomal dominant disease genes [35, 36], autosomal recessive disease genes [35, 36], and olfactory receptor genes [45]. In agreement with previous studies [16, 28, 35], missense mutations in autosomal dominant disease genes had higher UNEECON scores than those in autosomal recessive disease genes (Fig 2a), suggesting a stronger selection on heterozygous missense mutations in the genes associated with autosomal dominant disorders. Also, we observed that missense mutations in olfactory receptor genes had much lower UNEECON scores (Fig 2a), potentially due to relaxed purifying selection or enhanced positive selection in the human lineage [46].

Fig 2. Distributions of UNEECON scores across potential missense mutations.

Fig 2

(a) Distributions of UNEECON scores estimated for potential missense mutations in haploinsufficient (HI) genes [37], autosomal dominant disease genes [35, 36], autosomal recessive disease genes [35, 36], and olfactory receptor genes [45]. (b) Distributions of UNEECON scores estimated for potential missense mutations in various protein regions. The functional sites and protein secondary structures are based on UniProt annotations [47]. The predicted disordered protein regions are from MobiDB [48]. (c) Average UNEECON scores estimated for all codon positions in the CDKL5 protein. Each grey dot represents the UNEECON score averaged over all missense mutations in a codon position. Blue curve represents the locally estimated scatterplot smoothing (LOESS) fit. Blue and red dots represent pathogenic and benign missense variants from ClinVar [30], respectively. The horizontal line represents a constrained region reported in a previous study [25].

Second, we investigated whether the distributions of UNEECON scores varied across different types of protein regions annotated by UniProt [47, 49] and MobiDB [48]. We observed that the distributions of UNEECON scores were similar among α-helices, β-strands, and hydrogen-bonded turns (Fig 2b). In contrast, UNEECON scores were significantly lower in disordered protein regions. As expected, UNEECON scores were significantly higher in functional protein sites, including enzyme active sites and ligand binding sites (Fig 2b), suggesting much stronger purifying selection on these critical residues.

Interestingly, UNEECON scores showed a bimodal distribution in both enzyme active sites and ligand binding sites, with the first mode around 0.8 and the second mode around 0.2 (Fig 2b). Therefore, even though most of enzyme active sites and ligand binding sites are believed to play a crucial role in maintaining the functional integrity of proteins, a substantial fraction of heterozygous missense mutations in these sites are not subject to strong negative selection. This result may be due to the fact that heterozygous mutations in recessive genes have a limited impact on protein function. In agreement with this explanation, functional protein sites in autosomal recessive disease genes had substantially lower UNEECON scores than those in haploinsufficient and autosomal dominant genes (Fig B in S1 File).

Furthermore, we used the human CDKL5 protein as an example to illustrate UNEECON’s capability to capture regional variation of negative selection within a gene. Similar to a previously published metric of regional missense constraints [25], UNEECON predicted that the N-terminus of CDKL5 was under strong negative selection (Fig 2c). In agreement with the evidence of strong selection, this region was also enriched with known pathogenic missense variants and depleted with benign missense variants (Fig 2c). Interestingly, UNEECON also predicted that the C-terminus of CDKL5 was under much weaker selection compared with the rest of the protein (Fig 2c). Accordingly, this unconstrained region was enriched with benign missense variants and depleted with pathogenic missense variants.

UNEECON scores accurately predict missense mutations associated with dominant genetic disorders

Since UNEECON measures the strength of strong negative selection on heterozygous missense mutations, we hypothesize that UNEECON scores are predictive of missense variants associated with dominant Mendelian disorders. To test this hypothesis, we compared the performance of UNEECON with eight existing methods in the setting of predicting autosomal dominant disease variants in ClinVar [30]. The eight previously published methods can be classified into three categories: 1) variant-level prediction methods, including LASSIE [12], CADD [8], Eigen [50], and PrimateAI [11]; 2) gene-level and region-level prediction methods, including RVIS [16], pLI [17], and CCR [22]; and 3) MPC [25], a supervised machine learning method utilized both variant features and regional constraint scores as input features. It is worth noting that several aforementioned methods, including UNEECON, used the variants from the gnomAD dataset as a part of their training data. If the same gnomAD variants are also used as negative controls in the evaluation of performance, we will overestimate the power of these methods trained on the gnomAD data. To avoid this problem, we used benign missense variants in ClinVar as negative controls and removed any ClinVar variants that are also present in gnomAD. Then, we matched the numbers of positives and negatives by random sampling without replacement.

Overall, UNEECON outperformed previous methods in predicting ClinVar missense variants associated with autosomal dominant disorders (Fig 3a; Table B in S1 File). Among the previously published methods, variant-level predictors performed better than gene-level and region-level constraint scores. To test the robustness of these results, we constructed an alternative set of dominant pathogenic variants defined as all the pathogenic missense variants located in 709 genes associated with autosomal dominant diseases [35, 36]. UNEECON again outperformed the other methods in this dataset, even though the difference in performance between UNEECON and LASSIE was not statistically significant (Fig C in S1 File; Table B in S1 File).

Fig 3. Predictive power of various methods for distinguishing pathogenic missense variants from benign missense variants.

Fig 3

(a) Performance in predicting autosomal dominant pathogenic variants from ClinVar [30]. True positive and true negative rates correspond to the fractions of pathogenic and benign variants exceeding various thresholds, respectively. AUC corresponds to the area under the receiver operating characteristic curve. (b) Enrichment of predicted deleterious de novo variants in individuals affected by developmental disorders [31]. The y-axis corresponds to the log2 odds ratio of the enrichment of predicted deleterious variants in the affected individuals for a given percentile threshold. The x-axis corresponds to the various percentile threshold values used in the enrichment analysis. Error bars represent the standard error of the log2 odds ratio.

Because some genes are better studied than others in medical genetics, known pathogenic variants are not evenly distributed across genes [51]. In the ClinVar database, the majority of protein-coding genes contain only a single class of variants, i.e., either “pathogenic-only” or “benign-only”. It is considerably more challenging to predict pathogenic variants in “mixed” genes that contain both pathogenic and benign variants [51]. We evaluated UNEECON on 157 autosomal dominant genes containing both pathogenic and benign variants. UNEECON again outperformed previous methods in separating pathogenic missense variants from benign missense variants in these “mixed” genes (Fig D1 in S1 File).

To evaluate the performance of UNEECON when the information of gene-level selective constraints is absent, we removed all protein-coding genes with at least one ClinVar pathogenic missense variant from the training data and retrained the UNEECON model on the new training set. Then, we evaluated the performance of this version of UNEECON in separating pathogenic missense variants from benign missense variants in the “mixed” genes containing both pathogenic and benign missense variants. It is worth noting that the “mixed” genes were not used in the training step. In the step of prediction, UNEECON effectively substituted the gene-level random-effect term with its genome-wide average, forcing UNEECON to make predictions solely based on variant-level features. Again, UNEECON outperformed previous methods in this setting (Fig D2 in S1 File), suggesting that UNEECON is a robust method and can still predict pathogenic variants when the information of gene-level selective constraints is absent.

While UNEECON was highly powerful in predicting autosomal dominant disease variants, it might not be as accurate when applied to predict recessive disease variants which are only detrimental in the homozygous state. To test this hypothesis, we evaluated UNEECON’s performance in predicting autosomal recessive disease variants in ClinVar [30]. As expected, UNEECON was outperformed by multiple methods, such as LASSIE [12] and Eigen [50], in this setting (Fig E in S1 File). Similar results were reached with an alternative set of recessive pathogenic variants defined as all the pathogenic missense variants located in 1,183 autosomal recessive genes [35, 36] (Fig F in S1 File).

As an orthogonal benchmark of UNEECON’s performance in predicting dominant disease variants, we investigated whether UNEECON was able to predict de novo missense mutations identified in individuals affected by severe developmental disorders [31]. We obtained de novo missense variants identified in affected individuals and healthy individuals from denovo-db [52]. Then, for multiple percentile rank cutoff values (top 10%, 20%, 30%, and 40%), we evaluated the enrichment of deleterious variants predicted by each method in the affected individuals. Overall, missense variants predicted by UNEECON and MPC showed the highest enrichments in the affected individuals (Fig 3b), suggesting that these two methods were more powerful in predicting de novo risk mutations. The performance gaps between UNEECON/MPC and the other methods were highest at the most stringent cutoff of 10% (Fig 3b).

UNEECON uses a data-driven approach to combine variant-level features and gene-level selective constraints. Alternatively, variant-level and gene-level predictors can be used as two successive filters in the same variant prioritization pipeline. We compared UNEECON with such a heuristic method [16] in the setting of predicting ClinVar missense variants and de novo missense mutations associated with severe developmental disorders. The heuristic method converted RVIS and PolyPhen-2 scores into two binary predictors, and only the variants predicted by both RVIS and PolyPhen-2 were considered to be deleterious. Compared with RVIS and the heuristic method combining RVIS and PolyPhen-2, UNEECON showed significantly better performance in predicting pathogenic missense variants in ClinVar, and the de novo mutations predicted by UNEECON showed substantially stronger enrichment in the individuals affected by severe developmental disorders (Fig G in S1 File).

UNEECON-G scores perform favorably in predicting disease genes and essential genes

The UNEECON model also provided UNEECON-G scores which measure gene-level intolerance to missense mutations. We compared the performance of UNEECON-G scores with alternative gene constraint scores, including RVIS [16], pLI [17], mis-z [17], and GDI [19], in the setting of predicting disease genes and essential genes. The disease gene sets included 709 autosomal dominant disease genes [35, 36] and 294 haploinsufficient genes [37]. The essential gene sets included 2,454 human orthologs of mouse essential genes [33, 34] and 683 human essential genes identified by CRISPR knockout experiments in human cell lines [32]. For each set of autosomal dominant disease genes, haploinsufficient genes, and mouse essential genes, we constructed a negative gene set by sampling a matched number of genes from the other genes in the human genome. For the essential genes identified by CRISPR in cell lines, we constructed a negative gene set by sampling a matched number of genes from the nonessential genes reported in the same study [32]. Overall, UNEECON-G significantly outperformed alternative gene-level metrics in predicting the four sets of essential and disease genes (Fig 4; Table C in S1 File).

Fig 4. Predictive power of various methods for distinguishing disease and essential genes from genes not likely to have strong phenotypic effects.

Fig 4

(a) Performance in predicting autosomal dominant disease genes [35, 36]. (b) Performance in predicting haploinsufficient genes [37]. (c) Performance in predicting human orthologs of mouse essential genes [33, 34]. (d) Performance in predicting human essential genes in cell lines [32]. True positive and true negative rates correspond to the fractions of positive and negative genes exceeding various thresholds, respectively. AUC corresponds to the area under the receiver operating characteristic curve.

Both variant-level features and gene-level constraints are important for variant interpretation

To gain more insights into which components of the UNEECON model are critical for predicting detrimental variants, we characterized the contributions of the 30 variant features and the gene-level random effect to negative selection. Unfortunately, it is challenging to directly interpret the UNEECON model due to the nonlinearity introduced by the feedforward artificial neural network. To bypass this difficulty, we used a linear version of the UNEECON model as an approximation of the nonlinear UNEECON model. The linear UNEECON model has no hidden units and can be interpreted as a generalized linear mixed model. Then, we evaluated the importance of the variant-level features and the gene-level random effect in this linear surrogate model. The variant score predicted by the linear UNEECON model was nearly perfectly correlated with the UNEECON score predicted by the original nonlinear UNEECON model (Spearman’s ρ = 0.95), supporting the use of the linear UNEECON model as a surrogate for model interpretation.

In the linear surrogate model, the relative probability of the occurrence of mutation i in gene j, ηij, is a linear combination of variant features, Xij, and the gene-level random effect, uj, up to a logistic transformation. Analogous to the interpretation of canonical linear models, we defined the contribution score of a variant feature as the negative value of the weight (regression coefficient) associated with this feature, and similarly defined the contribution score of the gene-level random effect as the negative value of its standard deviation. Positive and negative contribution scores suggest that the corresponding variables are positively and negatively correlated with negative selection, respectively, and contribution scores near zero indicate that the corresponding variables are not important for predicting detrimental mutations.

As shown in Fig H in S1 File, the gene-level random effect was the most important feature for predicting deleterious variants, highlighting the importance of gene-level constraints for variant interpretation. In addition, several conservation scores, such as the qualitative predictions from LRT, PROVEAN, and SIFT, and a subset of protein structural features, such as the predicted probabilities of forming various protein secondary structures, were also important for accurate prediction of deleterious variants. Even though these variant features individually were less important than the gene-level random effect, accumulatively they explained a significant fraction of the variation of negative selection across missense mutations. Therefore, both variant features and gene-level constraints are important for predicting deleterious variants.

Missense constraints and loss-of-function constraints are weakly correlated across genes

The UNEECON-G score represents gene-level intolerance to missense mutations in human populations. Alternatively, selective constraint on a gene can be defined as its degree of intolerance to loss-of-function mutations, such as stop-gained, frameshift, and splice-site mutations [17, 28, 29]. A recent study suggested that gene-level intolerance to loss-of-function mutations may not be a reliable predictor of pathogenic missense variants [53], implying a weak correlation between gene-level intolerance to loss-of-function mutations and that to missense mutations. To test this hypothesis at a genome-wide scale, we investigated the correlation between the UNEECON-G score and the pLI score, a metric of gene-level intolerance to loss-of-function mutations [17]. As expected, the UNEECON-G score was only moderately correlated with the pLI score (Spearman’s ρ = 0.59; Fig 5a).

Fig 5. Distributions of gene-level intolerance to missense and to loss-of-function mutations.

Fig 5

(a) Correlation between gene-level intolerance to missense mutations (UNEECON-G score) and that to loss-of-function (LOF) mutations (pLI score). Blue dots represent 956 genes intolerant to both missense and LOF mutations. Red dots represent 956 genes tolerant to missense but not to loss-of-function mutations. (b) Distribution of protein disorder content in the gene sets intolerant to loss-of-function mutations. (c) Enrichment of Reactome pathways in the gene set intolerant to both missense and loss-of-function mutations. The gene set tolerant to missense but not to loss-of-function mutations is used as a background. Only the highest-level Reactome terms from the PANTHER hierarchy view are included in the visualization. The term “unclassified” indicates that the corresponding genes have no known or inferred function. A fold enrichment below 1 indicates a depletion in the gene set intolerant to both missense and loss-of-function mutations, or equivalently, an enrichment in the gene set tolerant to missense but not to loss-of-function mutations. (d) Enrichment of autism genes in the gene sets intolerant to loss-of-function mutations. Error bars represent the standard error of the log2 odds ratio.

Intrinsically disordered proteins play key roles in a multitude of biological processes [54], but their sequences are poorly conserved across species [55]. Therefore, missense mutations in intrinsically disordered proteins may be under weak negative selection even if loss-of-function mutations in these proteins are deleterious. To test this hypothesis, we investigated the distributions of protein disorder content, i.e., the fraction of disordered regions in a protein [48], across 1,912 protein-coding genes intolerant to heterozygous loss-of-function mutations (pLI score ≥ 0.9). We split the 1,912 genes into two equal-size groups based on their UNEECON-G scores, and defined the 956 genes with higher UNEECON-G scores as the gene set intolerant to both missense and loss-of-function mutations. Accordingly, we defined the 956 genes with lower UNEECON-G scores as the gene set tolerant to missense but not to loss-of-function mutations. Compared with the 956 genes intolerant to both missense and loss-of-function mutations, the disorder contents were significantly higher in the 956 genes tolerant to missense mutations but not to loss-of-function mutations (Fig 5b). Therefore, disordered protein regions can at least partially explain the observed discrepancy between missense constraints and loss-of-function constraints.

We further investigated the enrichment of Reactome pathways [56] and Gene Ontology terms [57, 58] in the 956 genes intolerant to both missense and loss-of-function mutations, using the 956 genes tolerant to missense but not to loss-of-function mutations as a background. Several key pathways associated with central nervous system, gene regulation, cell cycle, and innate immune response were overrepresented in the genes intolerant to both missense and loss-of-function mutations (Fig 5c; Table D in S1 File). We observed an enrichment of similar terms based on the Gene Ontology (Tables E & F in S1 File). In agreement with the enrichment of pathways associated with the central nervous system, the genes intolerant to both missense and loss-of-function mutations were strongly enriched in the gene set implicated in the autism spectrum disorders [59] (Fig 5d). In contrast, the genes tolerant to missense but not to loss-of-function mutations were more likely to have no known or inferred function (Fig 5c; Table D in S1 File).

Discussion

Here we present UNEECON for a unified prediction of deleterious missense mutations and highly constrained genes in the human genome. Compared with previously published methods, UNEECON shows superior performance in predicting missense variants and protein-coding genes associated with dominant genetic disorders. Therefore, UNEECON is a promising framework for both variant and gene prioritization. Furthermore, unlike supervised machine learning approaches, such as MPC [25], UNEECON integrates variant features and gene constraints based on signatures of negative selection instead of labeled disease variants. Therefore, UNEECON is unlikely to suffer from the circularity and the inflated performance commonly found in supervised methods [51].

It is worth noting that UNEECON is different from existing prioritization methods in multiple important aspects. First, unlike classical sequence conservation metrics [3842] and integrative variant scores [815], UNEECON infers the strength of negative selection based on rare genetic variants. Therefore, UNEECON predicts strongly deleterious variants with a dominant/semidominant mode of inheritance [60]. Second, unlike previously published gene constraint metrics that typically assign the same score to all missense variants within a gene [1623], UNEECON assigns different scores to the variants in the same gene in a feature-dependent manner, providing high-resolution maps of variant effects within genes. Third, UNEECON is able to adjust the distribution of variant scores within a gene according to the degree of depletion of missense variants in this gene, allowing for assigning different scores to missense mutations with similar variant features but located in different genes.

Similar to other metrics of negative selection, the performance of UNEECON strongly depends on the correlation between variant penetrance and negative selection. Our analysis of pathogenic variants suggests that UNEECON can predict deleterious variants associated with dominant genetic disorders but not necessarily those associated with recessive disorders. Also, because common variants are unlikely to be under strong negative selection [12], UNEECON scores might not be able to predict common variants associated with complex traits or late-onset diseases. In contrast, a recent study suggests that rare variants associated with complex traits are strongly enriched in coding regions and tend to be under negative selection [61], implying that UNEECON could be a useful tool for rare variant association studies.

UNEECON is based on a novel machine learning framework, deep mixed-effects model, to integrate variant features and gene-level constraints. By comparing the nonlinear UNEECON model with a linear UNEECON model without hidden layers, we observe that the UNEECON scores from the nonlinear model are nearly perfectly correlated with the scores from the linear UNEECON model (Spearman’s ρ = 0.95). Therefore, the additional nonlinearity introduced by neural networks may not be critical for the dataset described in this work. Nevertheless, the deep mixed-effects model has the flexibility of modeling complex interactions between variant-level features, which may be important for analyzing other datasets.

Our analysis of the linear UNEECON model suggests that both the predictive variant features and the gene-level constraints are important for variant interpretation. In particular, gene-level evidence of selective constraints is the single most important predictor of negative selection on missense mutations. Also, predictive variant features accumulatively explain a large fraction of the variation of negative selection across missense variants. We expect that the combination of variant-level predictors and gene-level constraints will be an essential component in the future development of variant and gene prioritization methods.

By contrasting UNEECON-G scores against pLI scores, we observe a low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations. The prevalence of disordered protein regions in the human proteome is a key biological factor contributing to the low correlation between missense and loss-of-function constraints. Furthermore, we observe that the genes intolerant to both loss-of-function mutations and missense mutations may play key roles in the central nervous system and the autism spectrum disorders, highlighting the needs to investigate the function of these genes using state-of-the-art experimental techniques. By combing powerful variant and gene prioritization tools, such as UNEECON, and high-throughput mutagenesis and genome editing techniques [62, 63], we will obtain more insights into the function of these strongly constrained genes in the future.

Materials and methods

Predictive variant features

UNEECON was trained on 30 missense variant features previously used to infer fitness effects of coding variants in the human genome (Table A in S1 File; [12]). These features can be classified into three categories: sequence conservation scores, protein structural features, and functional genomic features. The sequence conservation scores included SIFT [5], PROVEAN [7], SLR [64], Grantham [65], PSIC [66], LRT [67], MutationAssessor [4], HMMEntropy [68], and phyloP scores [40]. The protein structural features included predicted secondary structures, B-factors, contributions to protein stability, and relative solvent accessibilities from SNVBox [68]. The functional genomic features included the non-commercial version of SPIDEX splicing scores [69] and the maximum RNA-seq signals from the Roadmap Epigenomics Project [70]. Following a common practice in machine learning and statistics, we standardized continuous features by subtracting the mean and dividing by the standard deviation. All the features were based on the hg19 (GRCh37) assembly.

Population genomic data

We dowloaded whole genome variation data, exome variation data, and corresponding sequencing coverage data from the gnomAD browser [28, 29] (version 2.1.0). We only retained rare SNVs (MAF < 0.1%) that passed gnomAD’s built-in quality filter for downstream analysis.

Context-dependent mutation model

We trained a context-dependent mutation model to capture the impact of 7-mer sequence context, local mutation rate, and sequencing coverage on the probability of the occurrence of each mutation in the gnomAD exome sequencing data. Because of the intrinsic sparsity of putatively neutral variants in coding regions, it is difficult to build a mutation model solely based on the gnomAD exome sequencing data. Therefore, we first built a mutation model based on neutral noncoding variants in the gnomAD whole genome sequencing (WGS) data. Then, we recalibrated the WGS-based mutation model in the gnomAD exome sequencing data to adjust for the differences in population sample size and sequencing coverage between the WGS and the exome sequencing data.

To build the WGS-based mutation model, we first compiled a list of putatively neutral noncoding regions following a strategy described in previous studies [10, 71, 72]. We removed coding exons [73], conserved phastCons elements [74], nucleotide sites within 1000 bp of any coding exons, and nucleotide sites within 100 bp of any phastCons elements. We assumed that the remaining noncoding regions were largely depleted with functional elements and, therefore, mutations in these regions were putatively neutral. It is worth noting that we also removed any nucleotide sites with an average sequencing coverage below 20, due to the difficulty of variant calling in low coverage regions, and any sites overlapping CpGs, due to the high mutation rates in these sites.

Then, we fit a WGS-based mutation model to the gnomAD WGS data in putatively neutral noncoding regions. First, for each possible combination of mutation l and sequence context k (the focal nucleotide with 3 flanking nucleotides on each side), we calculated its mutability, fkl, defined as the proportion of 7-mer sequences of k with observed rare variant l in the gnomAD WGS data. We defined Fkl=logit(fkl)log(fkl1-fkl), which represents the contribution of sequence context and mutation type to the probability of variant occurrence in logit scale. Second, we fit a genome-wide logistic regression model to estimate the contributions of sequencing depth and context-dependent mutability to the probability of the occurrence of a mutation. This logistic regression model assumed

P(YiWGS=1)=logistic(α0+α1log(di)+α2Fkili), (1)

where YiWGS is a binary indicator of the occurrence of neutral noncoding mutation i in the gnomAD WGS data. α0, α1, and α2 are the free parameters in the logistic regression model. di is the average sequencing depth at the nucleotide position of mutation i. ki and li are the sequence context and the mutation type of mutation i, respectively. We then fit a local logistic regression model for each exon m in the human genome with

P(YiWGS=1)=logistic(α3m+α^1log(di)+α^2Fkili),forallmutationiswithin60kbofexonm (2)

where α3m is an exon-specific free parameter independently estimated for each exon m, and α^1 and α^2 are the estimates of regression coefficients in Eq 1. The local regression model effectively added an exon-specific, multiplicative scaling factor to adjust for the variation of local mutation rates across exons. We defined

qi=α^3m+α^1log(di)+α^2Fkili (3)

as the logit of the predicted occurrence probability of mutation i in exon m in the WGS-based mutation model, given the estimated α^3m, α^1, and α^2.

Finally, we recalibrated the WGS-based mutation model in the gnomAD exome sequencing data with a logistic regression model,

P(Yiexome=1)=logistic(β0+qi), (4)

where Yiexome is a binary indicator of the presence of synonymous mutation i in the gnomAD exome sequencing data, and β0 is the free parameter of the logistic regression model. The exome-based logistic regression model effectively added a multiplicative scaling factor to accommodate for the differences in population sample size and sequencing coverage between the WGS and the exome sequencing data. In the final exome mutation model, we defined the predicted probability of the occurrence of missense mutation i in gene j as

μij=logistic(β^0+qi)exp(β^0+qi)1+exp(β^0+qi), (5)

where β^0 is the maximum likelihood estimate of β0 in Eq 4. We fit all the logistic regression models using the glm function in R [75].

Details of the UNEECON model

We assume that negative selection on a potential mutation is a mathematical function of the sum of a variant-level fixed effect, zij, and a gene-level random effect, uj (Fig 1). Denoting Xij as the vector of variant features associated with mutation i in gene j, we assume that the variant-level fixed effect, zij, can be modeled by a feedforward neural network. At the bottom of the neural network, a nonlinear hidden layer and a dropout layer are employed to transform Xij to a hidden vector,

Hij=droput(ReLU(Xij·Whidden+Bhidden)), (6)

where ReLU and dropout are the rectified linear layer [76] and the dropout layer [43], respectively. Whidden and Bhidden are the weight matrix and bias vector of the hidden layer, respectively. Then, we assume the fixed effect term, zij, is a linear combination of the hidden vector Hij,

zij=Hij·Woutput+boutput, (7)

where Woutput and boutput are the weight vector and the bias term, respectively. We use the Glorot method [77] to initialize the weights, Whidden and Woutput, and initialize the bias terms, Bhidden and boutput, with zeros. In the dropout layer, we use a fixed dropout rate of 0.5. It is worth noting that, if a linear version of the UNEECON model is used, Eq 6 will be replaced by an identity function HijXij.

To capture the variation of gene-level selective constraints that cannot be predicted from feature vector Xij, we introduce a gene-level random-effect term following a Gaussian distribution,

ujN(0,σ), (8)

where σ is the standard deviation of the Gaussian distribution. Given the fixed and random effect terms, we assume ηij, the relative probability of the occurrence of mutation i in gene j, follows

ηij=logistic(zij+uj)exp(zij+uj)1+exp(zij+uj). (9)

We further assume that the total probability of the occurrence of potential missense mutation i in gene j is equal to the product of ηij and μij, the probability of variant occurrence under the neutral mutation model described in Eq 5. Accordingly, the likelihood function for the data associated with gene j is defined as

Lj=i(ηijμij)Yij(1-ηijμij)1-Yij, (10)

where Yij is a binary indicator of the occurrence of missense mutation i in gene j in the gnomAD exome sequencing data.

Training the UNEECON model

We trained UNEECON with the Adam optimization algorithm [78]. In each iteration, UNEECON loaded a mini-batch of data consisting of all the potential missense mutations in a single gene. Then, we calculated the log likelihood of the mini-batch of data using Eq 10 and numerically integrated out uj with a 20-point Gaussian Quadrature rule. The negative log value of Eq 10 was used as the objective function in the Adam optimizer.

We randomly split the data into a training set (80% genes; 51,108,443 potential missense mutations), a validation set (10% genes; 6,435,990 potential missense mutations), and a test set (10% genes; 6,414,968 potential missense mutations). After each epoch of training, UNEECON evaluated the objective function in the validation set and stopped training when the objective function did not improve over 5 successive epochs to avoid overfitting (early stopping). Furthermore, we performed a grid search to optimize two hyperparameters, i.e., the learning rate of the Adam algorithm (10−2, 10−3, and 10−4) and the number of hidden units (64, 128, 256, 512, and a linear model without hidden units). The model was evaluated on the test dataset to choose optimal hyperparameters. We observed that a nonlinear UNEECON model with 512 hidden units and a learining rate of 10−4 had the lowest negative log likelihood on the test dataset. Therefore, we chose this model as the optimal one for downstream analysis.

Calculating UNEECON and UNEECON-G scores

After training, we fixed the free parameters, i.e., Whidden, Woutput, Bhidden, boutput, and σ, to the estimated values in the optimal model. Then, we calculated the UNEECON score for all the potential missense mutations in the human genome. For each gene j, UNEECON first calculated the posterior distribution of the random effect, P(uj|Dataj), where Dataj = {X1j, X2j, ⋯, XNj, Y1j, Y2j, ⋯, YNj} are the features and indicators of occurrence of all the potential missense mutations in gene j. Denoting Euj|Dataj(ηij|uj)ηijP(uj|Dataj)duj as expected relative probability of the occurrence of mutation i in gene j, we define the UNEECON score of mutation i as 1-Euj|Data(ηij|uj). The UNEECON-G score is defined as the average UNEECON score of all the missense mutations in a gene.

Distributions of UNEECON scores across gene categories and protein regions

We investigated the distributions of UNEECON scores across different gene categories and protein regions. We obtained lists of haploinsufficient genes [37] (n = 294), autosomal dominant disease genes [35, 36] (n = 709), autosomal recessive disease genes [35, 36] (n = 1, 183), and olfactory receptor genes [45] (n = 371) from the GitHub repository for the MacArthur Lab at the Broad Institute (https://github.com/macarthur-lab/gene_lists). We obtained annotations of α-helices, β-strands, hydrogen-bonded turns, enzyme active sites, and binding sites from UniProt [47]. We obtained annotated disordered protein regions from MobiDB 3.0 [48]. We plotted the distributions of UNEECON scores using R [75].

Comparison with other methods in the prediction of ClinVar pathogenic variants

We evaluated the performance of UNEECON, compared with eight previously published methods, in the setting of predicting pathogenic missense variants in the ClinVar database downloaded on Feb 25, 2019 [30]. We obtained MPC [25], PrimateAI [11], Eigen [50], CADD [8], RVIS [16], and pLI [17] scores from dbNSFP (version 4.0b2a) [79]. We obtained LASSIE scores [12] from the LASSIE GitHub repository (https://github.com/CshlSiepelLab/LASSIE) and CCR score (version 2) [22] from https://s3.us-east-2.amazonaws.com/ccrs/ccrs/ccrs.autosomes.v2.20180420.bed.gz. It is worth noting that we used the version of pLI scores trained on the gnomAD exome sequencing data [29].

We considered autosomal missense variants with annotations of “pathogenic/likely pathogenic” as positives and the ones with annotations of “benign/likely benign” as negative controls. We further removed the variants with conflict pathogenicity annotations and the variants present in the gnomAD exome sequencing dataset [29]. For all the comparisons, we only included the variants scored by all methods. We plotted the receiver operating characteristic (ROC) curves and calculated the areas under the receiver operating characteristic curves (AUCs) using the ROCR package [80]. Because the AUC metric is sensitive to label imbalance, we matched the numbers of positives and negatives using random sampling without replacement. After all the filtering steps, we obtained 473 pathogenic and 473 benign variants in the autosomal dominant dataset, as well as 277 pathogenic and 277 benign variants in the autosomal recessive dataset.

Comparison with alternative methods in predicting de novo mutations associated with developmental disorders

We downloaded de novo mutations identified in 4,293 individuals affected by developmental disorders [31] and 2,278 healthy individuals from denovo-db (version v1.6.1). The healthy controls included individuals enrolled in previous studies of autism [8185], severe non-syndromic sporadic intellectual disability [86], schizophrenia [87], and healthy populations [8891]. We removed redundant variants and any variants presented in the gnomAD exome sequencing data. Then, for various percentile rank cutoffs (top 10%, 20%, 30%, and 40%), we calculated the log2 odds ratio of the enrichment of predicted deleterious variants in affected individuals using the fisher.test function in R [75].

Comparison with alternative methods in predicting disease genes and essential genes

We obtained four sets of disease genes and essential genes from the GitHub repository for the MacArthur Lab at the Broad Institute (https://github.com/macarthur-lab/gene_lists). These gene sets included 294 haploinsufficient genes [37], 709 autosomal dominant disease genes [35, 36], 2,454 human orthologs of mouse essential genes [33, 34], and 683 essential genes based on CRISPR screen data from human cell lines [32]. For each set of the haploinsufficient, autosomal dominant, and mouse essential genes, we utilized the other autosomal genes as negative controls. Because it is easier to reject the null model of neutral evolution in a longer gene, gene constraint scores tend to be correlated with gene length [17, 29]. To control for the impact of gene length on performance evaluation, we used MatchIt [92] to pair each disease gene with a non-disease gene with matched gene length, resulting in a negative gene set with matched gene number and gene length. Similarly, for the 683 human essential genes based on CRISPR screen data, we used the 913 nonessential genes from the same study [32] as negative controls and constructed a negative gene set with matched gene length and gene number. We plotted the ROC curves and calculated the AUC metrics using the ROCR package [80].

Comparison of UNEECON-G and pLI scores

To evaluate the relationship between gene-level intolerance to missense mutations and that to loss-of-function mutations, we compared the UNEECON-G scores with the pLI scores trained on the gnomAD data [17, 79]. We identified 1,912 genes intolerant to loss-of-function mutations based on a threshold of pLI score of 0.9. Then, we split the 1,912 genes into 956 genes tolerant to missense genes and 956 genes intolerant to missense mutations based on the median UNEECON-G score. We used PANTHER [93] to calculate the enrichment of Reactome pathways and Gene Ontology categories in the set of genes intolerant to both missense and loss-of-function mutations, compared with those tolerant to missense mutations but not to loss-of-function mutations. We also downloaded 1,054 autism related genes from the SFARI database on March 9, 2019 [59], and evaluated the enrichment of autism genes in the two gene sets.

Supporting information

S1 File. Supplementary material.

Supplementary figures and tables.

(PDF)

S2 File. Supplementary dataset.

List of 1,912 genes intolerant to loss-of-function mutations.

(TSV)

Acknowledgments

The author thanks Noah Dukler for comments on the manuscript.

Data Availability

The UNEECON program and precomputed UNEECON/UNEECON-G scores are available at https://github.com/yifei-lab/UNEECON.

Funding Statement

This work was supported by start-up funds from the Pennsylvania State University. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine. 2015;17(5):405–423. 10.1038/gim.2015.30 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Maxwell K, Hart S, Vijai J, Schrader K, Slavin T, Thomas T, et al. Evaluation of ACMG-Guideline-Based Variant Classification of Cancer Susceptibility and Non-Cancer-Associated Genes in Families Affected by Breast Cancer. The American Journal of Human Genetics. 2016;98(5):801–817. 10.1016/j.ajhg.2016.02.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Eilbeck K, Quinlan A, Yandell M. Settling the score: variant prioritization and Mendelian disease. Nature Reviews Genetics. 2017;18(10):599–612. 10.1038/nrg.2017.52 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research. 2011;39(17):e118 10.1093/nar/gkr407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Research. 2003;31(13):3812–3814. 10.1093/nar/gkg509 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nature Reviews Genetics. 2011;12(9):628–640. 10.1038/nrg3046 [DOI] [PubMed] [Google Scholar]
  • 7. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLOS ONE. 2012;7(10):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics. 2014;46(3):310–315. 10.1038/ng.2892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gulko B, Hubisz MJ, Gronau I, Siepel A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nature Genetics. 2015;47(3):276–283. 10.1038/ng.3196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Huang YF, Gulko B, Siepel A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nature Genetics. 2017;49(4):618–624. 10.1038/ng.3810 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, et al. Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics. 2018;50(8):1161–1170. 10.1038/s41588-018-0167-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Huang YF, Siepel A. Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease. Genome Research. 2019;29(8):1310–1321. 10.1101/gr.245522.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science. 2013;342(6154):1235587 10.1126/science.1235587 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fu Y, Liu Z, Lou S, Bedford J, Mu X, Yip K, et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biology. 2014;15(10):480 10.1186/s13059-014-0480-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gulko B, Siepel A. An evolutionary framework for measuring epigenomic information and estimating cell-type-specific fitness consequences. Nature Genetics. 2019;51(2):335–342. 10.1038/s41588-018-0300-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Petrovski S, Wang Q, Heinzen EL, Allen AS, Goldstein DB. Genic intolerance to functional variation and the interpretation of personal genomes. PLOS Genetics. 2013;9(8):e1003709 10.1371/journal.pgen.1003709 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, et al. A framework for the interpretation of de novo mutation in human disease. Nature Genetics. 2014;46(9):944–950. 10.1038/ng.3050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Petrovski S, Gussow AB, Wang Q, Halvorsen M, Han Y, Weir WH, et al. The intolerance of regulatory sequence to genetic variation predicts gene dosage sensitivity. PLoS Genet. 2015;11(9):e1005492 10.1371/journal.pgen.1005492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Itan Y, Shang L, Boisson B, Patin E, Bolze A, Moncada-Vélez M, et al. The human gene damage index as a gene-level approach to prioritizing exome variants. Proceedings of the National Academy of Sciences. 2015;112(44):13615–13620. 10.1073/pnas.1518646112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Gussow A, Petrovski S, Wang Q, Allen A, Goldstein D. The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biology. 2016;17(1):9 10.1186/s13059-016-0869-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Pérez-Palma E, May P, Iqbal S, Niestroj LM, Du J, Heyne H, et al. Identification of pathogenic variant enriched regions across genes and gene families. bioRxiv. 2019; [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Havrilla JM, Pedersen BS, Layer RM, Quinlan AR. A map of constrained coding regions in the human genome. Nature Genetics. 2019;51(1):88–95. 10.1038/s41588-018-0294-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Silk M, Petrovski S, Ascher DB. MTR-Viewer: identifying regions within genes under purifying selection. Nucleic Acids Research. 2019;47(W1):W121–W126. 10.1093/nar/gkz457 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Iossifov I, Levy D, Allen J, Ye K, Ronemus M, Lee Yh, et al. Low load for disruptive mutations in autism genes and their biased transmission. Proceedings of the National Academy of Sciences. 2015;112(41):E5600–E5607. 10.1073/pnas.1516376112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Samocha KE, Kosmicki JA, Karczewski KJ, O’Donnell-Luria AH, Pierce-Hoffman E, MacArthur DG, et al. Regional missense constraint improves variant deleteriousness prediction. bioRxiv. 2017; [Google Scholar]
  • 26. Jagadeesh KA, Wenger AM, Berger MJ, Guturu H, Stenson PD, Cooper DN, et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nature Genetics. 2016;48(12):1581–1586. 10.1038/ng.3703 [DOI] [PubMed] [Google Scholar]
  • 27. Evans P, Wu C, Lindy A, McKnight DA, Lebo M, Sarmady M, et al. Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets. Genome Research. 2019;29(7):1144–1151. 10.1101/gr.240994.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–291. 10.1038/nature19057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv. 2019; [Google Scholar]
  • 30. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Research. 2014;42(D1):D980–D985. 10.1093/nar/gkt1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Deciphering Developmental Disorders Study, McRae JF, Clayton S, Fitzgerald TW, Kaplanis J, Prigmore E, et al. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542:433–438. 10.1038/nature21062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hart T, Brown KR, Sircoulomb F, Rottapel R, Moffat J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Molecular Systems Biology. 2014;10(7):733 10.15252/msb.20145216 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Blake JA, Bult CJ, Kadin JA, Richardson JE, Eppig JT, the Mouse Genome Database Group. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Research. 2010;39(suppl1):D842–D848. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Georgi B, Voight BF, Bucan M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLOS Genetics. 2013;9(5):e1003484 10.1371/journal.pgen.1003484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Blekhman R, Man O, Herrmann L, Boyko AR, Indap A, Kosiol C, et al. Natural Selection on Genes that Underlie Human Disease Susceptibility. Current Biology. 2008;18(12):883–889. 10.1016/j.cub.2008.04.074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Berg JS, Adams M, Nassar N, Bizon C, Lee K, Schmitt CP, et al. An informatics approach to analyzing the incidentalome. Genetics In Medicine. 2012;15:36 10.1038/gim.2012.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, et al. ClinGen—the clinical genome resource. New England Journal of Medicine. 2015;372(23):2235–2242. 10.1056/NEJMsr1406261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Armon A, Graur D, Ben-Tal N. ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. Journal of Molecular Biology. 2001;307(1):447–463. 10.1006/jmbi.2000.4474 [DOI] [PubMed] [Google Scholar]
  • 39. Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Research. 2005;15(7):901–913. 10.1101/gr.3577405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research. 2010;20(1):110–121. 10.1101/gr.097857.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Huang YF, Golding GB. Phylogenetic Gaussian process model for the inference of functionally important regions in protein tertiary structures. PLoS Computational Biology. 2014;10(1):e1003429 10.1371/journal.pcbi.1003429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Huang YF, Golding GB. FuncPatch: a web server for the fast Bayesian inference of conserved functional patches in protein 3D structures. Bioinformatics. 2015;31(4):523–531. 10.1093/bioinformatics/btu673 [DOI] [PubMed] [Google Scholar]
  • 43. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 2014;15:1929–1958. [Google Scholar]
  • 44. Bengio Y. Practical recommendations for gradient-based training of deep architectures In: Neural networks: tricks of the trade. Berlin, Heidelberg: Springer Berlin Heidelberg; 2012. p. 437–478. [Google Scholar]
  • 45. Mainland JD, Li YR, Zhou T, Liu WLL, Matsunami H. Human olfactory receptor responses to odorants. Scientific Data. 2015;2:150002 10.1038/sdata.2015.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Gilad Y, Bustamante CD, Lancet D, Pääbo S. Natural selection on the olfactory receptor gene family in humans and chimpanzees. The American Journal of Human Genetics. 2003;73(3):489–501. 10.1086/378132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. McGarvey PB, Nightingale A, Luo J, Huang H, Martin MJ, Wu C, et al. UniProt genomic mapping for deciphering functional effects of missense variants. Human mutation. 2019;40(6):694–705. 10.1002/humu.23738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Piovesan D, Tabaro F, Paladin L, Necci M, Mičetić I, Camilloni C, et al. MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins. Nucleic Acids Research. 2017;46(D1):D471–D476. 10.1093/nar/gkx1071 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research. 2018;47(D1):D506–D515. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nature Genetics. 2016;48(2):214–220. 10.1038/ng.3477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Grimm DG, Azencott CA, Aicheler F, Gieraths U, MacArthur DG, Samocha KE, et al. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Human Mutation. 2015;36(5):513–523. 10.1002/humu.22768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Turner TN, Yi Q, Krumm N, Huddleston J, Hoekzema K, F Stessman HA, et al. denovo-db: a compendium of human de novo variants. Nucleic Acids Research. 2016;45(D1):D804–D811. 10.1093/nar/gkw865 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Ziegler A, Colin E, Goudenège D, Bonneau D. A snapshot of some pLI score pitfalls. Human Mutation. 2019;40(7):839–841. [DOI] [PubMed] [Google Scholar]
  • 54. Wright PE, Dyson HJ. Intrinsically disordered proteins in cellular signalling and regulation. Nature Reviews Molecular Cell Biology. 2015;16:18 10.1038/nrm3920 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, et al. Evolutionary rate heterogeneity in proteins with long disordered regions. Journal of Molecular Evolution. 2002;55(1):104–110. 10.1007/s00239-001-2309-6 [DOI] [PubMed] [Google Scholar]
  • 56. Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome athway Knowledgebase. Nucleic Acids Research. 2018;46(D1):D649–D655. 10.1093/nar/gkx1132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. 2000;25(1):25–29. 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research. 2019;47(D1):D330–D338. 10.1093/nar/gky1055 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Abrahams BS, Arking DE, Campbell DB, Mefford HC, Morrow EM, Weiss LA, et al. SFARI Gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (ASDs). Molecular Autism. 2013;4(1):36 10.1186/2040-2392-4-36 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Fuller ZL, Berg JJ, Mostafavi H, Sella G, Przeworski M. Measuring intolerance to mutation in human genetics. Nature Genetics. 2019;51(5):772–776. 10.1038/s41588-019-0383-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Wainschtein P, Jain DP, Yengo L, Zheng Z, Cupples LA, Shadyab AH, et al. Recovery of trait heritability from whole genome sequence data. bioRxiv. 2019; [Google Scholar]
  • 62. Starita LM, Ahituv N, Dunham MJ, Kitzman JO, Roth FP, Seelig G, et al. Variant interpretation: functional assays to the rescue. The American Journal of Human Genetics. 2017;101(3):315–325. 10.1016/j.ajhg.2017.07.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Kinney JB, McCandlish DM. Massively parallel assays and quantitative sequence-function pelationships. Annual Review of Genomics and Human Genetics. 2019;20:99–127. 10.1146/annurev-genom-083118-014845 [DOI] [PubMed] [Google Scholar]
  • 64. Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169(3):1753–1762. 10.1534/genetics.104.032144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1974;185(4154):862–864. 10.1126/science.185.4154.862 [DOI] [PubMed] [Google Scholar]
  • 66. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nature Methods. 2010;7:248–249. 10.1038/nmeth0410-248 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Research. 2009;19(9):1553–1561. 10.1101/gr.092619.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Wong WC, Kim D, Carter H, Diekhans M, Ryan MC, Karchin R. CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics. 2011;27(15):2147–2148. 10.1093/bioinformatics/btr357 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RKC, et al. The human splicing code reveals new insights into the genetic determinants of disease. Science. 2015;347:1254806 10.1126/science.1254806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–330. 10.1038/nature14248 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Arbiza L, Gronau I, Aksoy BA, Hubisz MJ, Gulko B, Keinan A, et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nature Genetics. 2013;45(7):723–729. 10.1038/ng.2658 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Gronau I, Arbiza L, Mohammed J, Siepel A. Inference of natural selection from interspersed genomic elements based on polymorphism and divergence. Molecular Biology and Evolution. 2013;30(5):1159–1171. 10.1093/molbev/mst019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Research. 2012;22(9):1760–1774. 10.1101/gr.135350.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research. 2005;15(8):1034–1050. 10.1101/gr.3715005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Team RDC. R: a language and environment for statistical computing; 2008. Available from: http://www.R-project.org. [Google Scholar]
  • 76. Nair V, Hinton GE. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10. USA: Omnipress; 2010. p. 807–814.
  • 77. Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Teh YW, Titterington M, editors. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. vol. 9 of Proceedings of Machine Learning Research. Chia Laguna Resort, Sardinia, Italy: PMLR; 2010. p. 249–256.
  • 78. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv:14126980. 2014;. [Google Scholar]
  • 79. Liu X, Jian X, Eric B. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations. Human Mutation. 2013;34(9):E2393–E2402. 10.1002/humu.22376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–3941. 10.1093/bioinformatics/bti623 [DOI] [PubMed] [Google Scholar]
  • 81. Iossifov I, O’Roak BJ, Sanders SJ, Ronemus M, Krumm N, Levy D, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature. 2014;515:216–221. 10.1038/nature13908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Krumm N, Turner TN, Baker C, Vives L, Mohajeri K, Witherspoon K, et al. Excess of rare, inherited truncating mutations in autism. Nature Genetics. 2015;47(6):582–588. 10.1038/ng.3303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Turner T, Hormozdiari F, Duyzend M, McClymont S, Hook P, Iossifov I, et al. Genome sequencing of autism-affected families reveals disruption of putative noncoding regulatory DNA. The American Journal of Human Genetics. 2016;98(1):58–74. 10.1016/j.ajhg.2015.11.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Yuen RKC, Merico D, Bookman M, L Howe J, Thiruvahindrapuram B, Patel RV, et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nature Neuroscience. 2017;20:602–611. 10.1038/nn.4524 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Werling DM, Brand H, An JY, Stone MR, Zhu L, Glessner JT, et al. An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder. Nature Genetics. 2018;50(5):727–736. 10.1038/s41588-018-0107-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Rauch A, Wieczorek D, Graf E, Wieland T, Endele S, Schwarzmayr T, et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. The Lancet. 2012;380(9854):1674–1682. 10.1016/S0140-6736(12)61480-9 [DOI] [PubMed] [Google Scholar]
  • 87. Gulsuner S, Walsh T, Watts A, Lee M, Thornton A, Casadei S, et al. Spatial and temporal mapping of de novo mutations in schizophrenia to a fetal prefrontal cortical network. Cell. 2013;154(3):518–529. 10.1016/j.cell.2013.06.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. The 1000 Genomes Project, Conrad DF, Keebler JEM, DePristo MA, Lindsay SJ, Zhang Y, et al. Variation in genome-wide mutation rates within and between human families. Nature Genetics. 2011;43(7):712–714. 10.1038/ng.862 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Ramu A, Noordam MJ, Schwartz RS, Wuster A, Hurles ME, Cartwright RA, et al. DeNovoGear: de novo indel and point mutation discovery and phasing. Nature Methods. 2013;10(1):985–987. 10.1038/nmeth.2611 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. The Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics. 2014;46:818–825. 10.1038/ng.3021 [DOI] [PubMed] [Google Scholar]
  • 91. Besenbacher S, Liu S, Izarzugaza JMG, Grove J, Belling K, Bork-Jensen J, et al. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios. Nature Communications. 2015;6:5969 10.1038/ncomms6969 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Ho DE, Imai K, King G, Stuart EA. MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software. 2011;42(8):1–28. [Google Scholar]
  • 93. Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, et al. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Research. 2017;45(D1):D183–D189. 10.1093/nar/gkw1138 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Scott M Williams

10 Jan 2020

Dear Dr Huang,

Thank you very much for submitting your Research Article entitled 'Unified inference of missense variant effects and gene constraints in the human genome' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Scott M. Williams

Section Editor: Natural Variation

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This paper describes the development and evaluation of UNEECON, a framework for jointly predicting deleterious variants and constrained genes. This is certainly an interesting topic in the context of variant effect prediction and interpretation. I find the attempt to unify variant-level and gene-level quite innovative, and it is certainly an approach that will be useful in the study of severe, early onset disorders. Further, the use of a deep neural network to learn parameters relevant to population genetics from millions of variants from gnomAD is a novel contribution to this area. From the perspective of constrained gene prediction, UNEECON results are quite promising. However, I remain skeptical of some of the claims with regards to pathogenicity prediction and the overall argument that this method would be better in practice than those evaluated here (see below). Overall, the paper is clearly written and the methods are outlined satisfactorily (see minor comments for some things that need to be clarified). I outline my comments below:

MAJOR:

- A general issue that has plagued the field is the problem of unevenly distributed variant information across genes. Some genes are over-studied and are likely to have more variants identified as pathogenic. More importantly, many genes are likely to contain variants from only one class. UNEECON is interesting in this context that it is trained on genes that are mostly going to contain only benign variants and is evaluated on a ClinVar set that will skew mostly towards pathogenic-only or bi-class genes. Ref. 58 from this paper highlighted a method that performed extremely well in its evaluations when run on “pathogenic-only” or “benign-only” proteins but drastically underperformed on “mixed” genes. Given that UNEECON is heavily influenced by gene-level features, I wonder if it is susceptible to the same issue. One way to test this would be to perform a version of the ClinVar evaluation on only the subset of genes that contain both classes of variants. If performance drops then perhaps unification of variant-level and gene-level information may not be the best approach for variant pathogenicity prediction.

- On a related note, I am concerned about information leakage between the training set and evaluation sets used in this paper. I agree that UNEECON benefits from actually not using the pathogenic variants in its training and evaluation. However, the sheer size of gnomAD is expected to include every known gene in the training of the deep neural network. Since UNEECON uses gene-level information, there is a distinct possibility that performance may be inflated even after excluding variants in both ClinVar and gnomAD. In fact, Ref. 58 (cited for the circularity and inflation issues) points this out and recommends gene-level partitioning for cross-validation experiments such as those conducted by PolyPhen-2 and MutPred. Of course, in the context of this paper, the proposed experiment would be to train a version of UNEECON without variants in genes from the ClinVar set and evaluate that version on the ClinVar set. That way any gene-level bias in the performance measures would be eliminated.

- On page 8, line 254, is it all that surprising that UNEECON-G and pLI scores do not correlate? Intuitively, the impact of missense variants and LOF mutations are going to vary in magnitude even within the same gene. While a biological explanation (as provided here) for this may very well be plausible, it is more likely that the discrepancies are due to technical reasons. A recent commentary (PMID: 30977936) has touched upon issues related to the methodology and applicability of pLI scores. This commentary highlights the example of BRCA genes that have near-zero pLI scores but are known to harbor several deleterious missense variants.

MINOR:

- The bimodality of the UNEECON score distribution for active sites is worrisome with the peak closer to 0.25 is a little confusing. I interpret this as “there are more variants in active sites that have low UNEECON scores than high.” This is counter-intuitive and warrants some explanation.

- In the functional analyses related to Fig. 5, are there any interesting depletions? I am curious about the functions of those genes that are tolerant to missense but not to LOF mutations. I am also not sure what “unclassified” means in this context.

- What is the difference between Eqns. 2 and 3? It is difficult to tell with q_i being defined.

- In the Methods section, it would be helpful to readers if a clear account of the parameters to be estimated is provided up front.

- The paper is missing details of the final model that emerged from the evaluation process, its architecture and its parameters.

- Similarly, the paper lacks details on dataset sizes, particularly in the context of model training and evaluation. How many variants were used to train the deep mixed-effects model? How many variants were included in the evaluations relevant to ClinVar? How many pathogenic and how many benign?

- I am also curious about the activation function of the output layer of the neural network. This is of particular relevance to z_ij and its scaling relative to u_j. Is there a potential for one quantity to systematically dominate the other in Eqn. 9?

Reviewer #2: The work represents an important advance in prioritization of genes and variants relevant to human disease. it has been known since the introduction of gene level intolerance scoring in 2013 that gene level metrics of the strength of purifying selection provide independent information about variation pathogenicity to the longer established variant level metrics that largely depend on conservation and amino acid substitution features. While attempts have been made previously to integrate both approaches into a single predictive framework these have been based on supervised learning approaches using a set of putatively pathogenic and benign variants. The work here combines a selected set of variant level features with a gene level term and estimates selective constraint operating against all possible gene sequence changes based on human polymorphism data compared against sequence specific mutability. As such, it provides an integrated approach assessing purifying selection operating in the human population.

The authors have rerun the standard assessments used to test both gene level and variant level predictors with generally improved performance both for identifying relevant gene sets (e.g. haploinsufficient genes) and pathogenic variants. In addition to these advances, the model allows some novel biological insights, including explaining an important reasons for discrepancy between intolerance to missense and loss of function variation as being due to the proportion of proteins that is disordered. The model also highlights that the gene level term is more informative than variant level terms which is still not as widely appreciated as it should be.

For these reasons the work here represents an important advance in the field.

While the paper is generally clearly written and the conclusions generally fair, I do have a couple of relatively minor suggestions for consideration. Perhaps most fundamentally, while the use of UNEECON deep learning model to combine variant features and a gene level term to predict the strength of selection operating against specific alleles is welcome, since it allows non linear combinations of these terms to be learned, it is striking that a linear approximation of the UNEECON model is very highly correlated, suggesting little benefit from the model learning optimum non linear combinations. The authors appropriately use the linear model to infer the relative importance of features, but the very high correlation between the two models suggests the linear modle is likely to have similar performance to UNEECON. Given the more direct interpretability of the linear model, the authors should comment on whether the more complex model is in fact needed for use. The second small point is that some of the comparisons are inappropriate since some of the metrics are used in ways they were not intended for. For example, in Figure 3a representing prediction in distinguishing pathogenic variants gene level metrics such as RVIS are compared directly to UNEECON. As outlined however in the initial work, gene level metrics are intended to be used alongside some version of a variant level predictor (since as emphasized here and in the original publications the two approaches offer independent information). The fair comparison therefore for generating a version of figures 3 focused on variants would be to use a combination of a variant and gene level metric for all those comparisons like RVIS that are gene level metrics. This idea was outlined in the initial publications under the banner of a combined threshold for both gene level and variant level. I have no doubt that UNEECON would still perform better, but one appropriate simple comparison would be to re run these analyses including for example a hard threshold on some appropriate variant score such as PP2 alongside the quantitative gene level score such as RVIS as currently used. Finally, the gene level metrics in use are known to struggle with small genes since there is often not enough polymorphism data to infer selection. The authors should address robustness to gene size.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Decision Letter 1

Scott M Williams

30 Apr 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Huang,

Thank you very much for submitting your Research Article entitled 'Unified inference of missense variant effects and gene constraints in the human genome' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Scott M. Williams

Section Editor: Natural Variation

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The revised version of the paper addresses most of the concerns that I had with the original version of the paper. However, I still remain skeptical of the claim of UNEECON’s “unmatched” performance when it comes to pathogenicity prediction. Although the AUCs are indeed higher for UNEECON in Figs. 3, 4, S3 and S4, performances in the most important region of the ROC curves (the low-false-positive-rate region) tend to be on comparable to other methods. I suggest toning down such strong claims made with regards to performance of UNEECON in pathogenicity prediction, when compared to other methods.

I also would like to follow up on the following statement in the item-by-item response: “Training a version of UNEECON without gnomAD variants in disease genes will disable UNEECON’s ability to learn gene-level constraints in disease genes, leading to an underestimation of UNEECON’s performance.” This gets to the actual motivation behind my comment. If gene-level constraints are that important to UNEECON’s performance, then it is expected that UNEECON will underperform when attempting to predict a pathogenic variant in a gene with no previous known disease association. The gnomAD subset that does not overlap with ClinVar serves as a proxy for such genes as it is quite comprehensive in the coverage of the genome. My original concern was that UNEECON may simply be good at separating disease-associated genes (which is as the author correctly said is subject to ascertainment bias) from those in gnomAD, and that this was a major driver of variant-level predictive performance. This is somewhat alleviated through the inclusion of Fig. S4 but a true test of UNEECON’s ability to contribute to novel discoveries is in its ability to make correct variant-level predictions in “undiscovered” disease genes. If an experiment to test this seems infeasible, it would be helpful to clearly state this as a limitation of the model in the Discussion section.

Reviewer #2: the authors have done a thorough job of responding to the reviews and I have no further comments

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Decision Letter 2

Scott M Williams

9 Jun 2020

Dear Dr Huang,

We are pleased to inform you that your manuscript entitled "Unified inference of missense variant effects and gene constraints in the human genome" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Scott M. Williams

Section Editor: Natural Variation

PLOS Genetics

Hua Tang

Section Editor: Natural Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01659R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Scott M Williams

7 Jul 2020

PGENETICS-D-19-01659R2

Unified inference of missense variant effects and gene constraints in the human genome

Dear Dr Huang,

We are pleased to inform you that your manuscript entitled "Unified inference of missense variant effects and gene constraints in the human genome" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Supplementary material.

    Supplementary figures and tables.

    (PDF)

    S2 File. Supplementary dataset.

    List of 1,912 genes intolerant to loss-of-function mutations.

    (TSV)

    Attachment

    Submitted filename: Response to the reviewers.pdf

    Attachment

    Submitted filename: Response to the reviewers.pdf

    Data Availability Statement

    The UNEECON program and precomputed UNEECON/UNEECON-G scores are available at https://github.com/yifei-lab/UNEECON.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES