Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Nov 1.
Published in final edited form as: Hum Mutat. 2021 Aug 15;42(11):1503–1517. doi: 10.1002/humu.24269

A domain damage index to prioritizing the pathogenicity of missense variants

Hua-Chang Chen 1,2, Jing Wang 1,2, Qi Liu 1,2,*, Yu Shyr 1,2,*
PMCID: PMC8511099  NIHMSID: NIHMS1730062  PMID: 34350656

Abstract

Prioritizing causal variants is one major challenge for the clinical application of sequencing data. Prompted by the observation that 74.3% of missense pathogenic variants locate in protein domains, we developed an approach named domain damage index (DDI). DDI identifies protein domains depleted of rare missense variations in the general population, which can be further used as a metric to prioritize variants. DDI is significantly correlated with phylogenetic conservation, variant-level metrics, and reported pathogenicity. DDI achieved great performance for distinguishing pathogenic variants from benign ones in three benchmark datasets. The combination of DDI with the other two best approaches improved the performance of each individual method considerably, suggesting DDI provides a powerful and complementary way of variant prioritization.

Keywords: Protein Domain, pathogenicity prediction, disease-causing, variant prioritization, constrain, conservation, missense variants

1. INTRODUCTION

Next-generation sequencing technologies, such as whole exome and genome sequencing, have been widely used in biomedical research and clinical diagnostics. Providing a comprehensive genetic profile of an individual, they revolutionize our ability to identify the molecular basis of genetic disorders. A typical personal exome and genome sequencing reveal thousands to millions of genetic variants (Belkadi et al., 2015; Ng et al., 2009), however, only a small fraction of variants will affect protein function. Among these functional variants, only a very few, typically one or two, are phenotypically causal variants. Distinguishing disease-causal variants from neutral ones remains an essential and challenging step to interpreting personal genomic data.

A number of computational tools have been developed for automatic pre-evaluation of variants pathogenicity based on genomic features and/or population allele frequency. The commonly used genomic features include conservation (such as GERP++(Davydov et al., 2010), SiPhy(Garber et al., 2009), phyloP(Pollard, Hubisz, Rosenbloom, & Siepel, 2010), and phastCons(Siepel et al., 2005)), protein sequence and structure (such as logR.E-value(Clifford, Edmonson, Nguyen, & Buetow, 2004), SIFT(Kumar, Henikoff, & Ng, 2009), PolyPhen-2(Adzhubei et al., 2010), MutPred(B. Li et al., 2009), MutationAssessor(Reva, Antipin, & Sander, 2011)), and physicochemical features of altered amino acids (such as MAPP(Stone & Sidow, 2005), FATHMM(Shihab et al., 2013), PROVEAN(Choi, Sims, Murphy, Miller, & Chan, 2012),GEMME(Laine, Karami, & Carbone, 2019), DeMaSk(Munro & Singh, 2020)). Some tools utilized unsupervised strategies by assuming the relationship between genomic scores and deleterious effect, e.g., high conservation suggesting functional importance. Others employed supervised machine learning algorithms to map genomic scores into potential pathogenicity based on known disease-causing variants, such as PolyPhen-2 (Adzhubei et al., 2010), MutationTaster(Schwarz, Roedelsperger, Schuelke, & Seelow, 2010), VEST(Carter, Douville, Stenson, Cooper, & Karchin, 2013), GWAVA (Ritchie, Dunham, Zeggini, & Flicek, 2014). Each genomic feature measures different properties of a variant and plays a critical role in identifying disease-causing variants. Recently, significant efforts have been made to integrate those diverse annotations into one consensus score, such as CONDEL(Gonzalez-Perez & Lopez-Bigas, 2011), CAROL(Lopes et al., 2012), and CADD(Kircher et al., 2014). CADD trained a support vector machine from 63 different annotations. They include not only scores derived from individual approaches (such as protein-level scores from SIFT and PolyPhen) but also other genomic features (such as expression value, histone modification, nucleosome occupancy, and open chromatin tracks from ENCODE) (Kircher et al., 2014). Besides, ensemble methods have emerged, which integrated the output from multiple approaches into a unified prediction, such as metaSVM (Dong et al., 2015) and REVEL(Ioannidis et al., 2016). For example, REVEL incorporated 18 scores from 13 tools as features and trained a random forest model on known pathogenic and neutral missense variants (Ioannidis et al., 2016). Due to the limited source of pathogenic variants, however, some tools in ensemble prediction already used the test dataset to train their models, potentially leading to overfitting or circularity issues(Grimm et al., 2015). Although genomic features provide useful site-specific information for variant prioritization, variants affecting protein function are not necessarily disease-causing. Evidence supporting pathogenicity, such as functional studies on the impact of a sequence variant, genome-wide association with genetic disorders, co-segregation with disease status, and background variations in population controls are needed to prevent misannotation. (MacArthur et al., 2014).

Population-based sequencing efforts, such as the Genome Aggregation Database (gnomAD)(Karczewski et al., 2020), Exome Aggregation Consortium (ExAC)(Lek et al., 2016), and 1000 Genomes(The Genomes Project et al., 2015), have generated large-scale reference landscapes of human genetic variants from very common to ultra-rare. Population allele frequency from these genetic catalogs provides powerful and complementary information for variants prioritization since rare variants are more likely to be damaging than common ones. Efforts have been made to incorporate the allele frequencies in different populations with genomic features to estimate variant pathogenicity, such as Eigen(Ionita-Laza, McCallum, Xu, & Buxbaum, 2016), eDiVA(Bosio et al., 2019) and LEAP(Lai et al., 2020). However, there might be tens of thousands of rare variants in an individual exome or genome(Belkadi et al., 2015). Moreover, population allele frequency confounds with population stratification, resulting in large diversity among populations. Some genes tend to have more rare variants naturally due to higher mutation rates or longer length. Due to these subpopulation-specific and gene-specific features, extra caution should be paid when interpreting the likelihood of a rare variant as disease-causing. Based on population allele frequency and variant density, gene-level metrics assess the possibility of one gene being pathogenic under the assumption that highly damaged genes in human populations are unlikely to cause disease(Itan et al., 2015; Lek et al., 2016; Petrovski, Wang, Heinzen, Allen, & Goldstein, 2013). For example, RVIS(Petrovski et al., 2013) and ExAC z-score(Lek et al., 2016) developed gene-level intolerance scores to estimate accumulated mutational damage in the general population. GDI integrated the populational genetic data with the CADD score to model the intolerance of a gene to potential damaging variants(Itan et al., 2015). DOMINO(Quinodoz et al., 2017) used multiple features such as gene-level risk score (ExAC z-score), protein-interaction score (STRING - combined score), and conservation score (PhyloP), to infer the genes’ association with dominant disorders by machine learning. These gene-level metrics have been very successful for identifying genes under strong purifying selection (constrained). However, they cannot describe the regional heterogeneity of constraint within one gene. MPC(Kaitlin E. Samocha et al., 2017), MTR(Traynelis et al., 2017), subRVIS(Gussow, Petrovski, Wang, Allen, & Goldstein, 2016), CCR(Havrilla, Pedersen, Layer, & Quinlan, 2019), Pathopredictor(Evans et al., 2019), and LIMBR(Hayeck et al., 2019), have been presented to address the regional variability in missense intolerance, which identified regions constrained owing to an atypical dearth of variation. As an example, MTR compared the observed and the expected missense ratio in a sliding window of 31 codons to define the intolerance of the window(Traynelis et al., 2017). Instead of comparing the number of variants, CCR utilized the length of regions without any protein-changing variations to estimate the constraint level based on the assumption that longer regions tend to be more constrained. Using similar logic as RVIS, subRVIS and LIMBR assessed the departure of the observed number of common missense variants from the expectation given the total number of variants in a region. Most recently, MVP predicts the pathogenicity of missense variants by deep residual network, which leverages not only “raw” genomic features, but also many deleterious scores from previous approaches, including gene-level constraint, region-level metrics and variant-level scores (Qi et al., 2021).

Protein domains are the most functionally important regions of protein-coding genes. According to Pfam(El-Gebali et al., 2019), they contain more than 96% active sites and 89% binding sites, including ATP, DNA, calcium, and other substrates binding. They also mediate about 75% of protein-protein interactions(Diella et al., 2008). Domain mutation analyses in cancer, which map somatic mutations to domains within gene families, identified oncodomains and functional mutation hotspots(Miller et al., 2015; Peterson, Gauran, Park, Park, & Kann, 2017). Prompted by the observation that protein domains enrich for pathogenic variants, we developed a domain-centric and population-based approach named DDI. DDI aims to identify highly constrained domains by their depletion of rare missense variants in the general population. We demonstrated that DDI is significantly correlated with evolutionary conservation, variant-level metrics, and reported pathogenicity. DDI achieved outstanding performance in distinguishing pathogenic variants from neutral ones. The combination of DDI with two ensemble methods MVP and REVEL greatly improved the performance, suggesting DDI can serve as a complementary and informative measure for prioritizing damaging missense variants.

2. METHODS

To quantify the intolerance of rare missense variants in the general population for each domain, we compared the observed and the expected number of rare missense variants to obtain the z-value of DDI. The observed number of rare missense variants was obtained from gnomAD using MAF<0.1%. The expected number of missense variants in a domain was estimated by a permutation procedure based on the assumption that coding regions (within and outside of protein domains) were under neutral selection. (Figure 1). Domains with significant depletion of missense variants than expected show evidence of selective constraint.

Figure 1. Workflow of DDI.

Figure 1.

(A) Constructing of DDI using gnomAD genomic data. The variants in the CDS regions are used to generate the mutation table. The variants in the CDS regions are used to generate the mutation table. The variants in the intronic regions are used to calculate the expected CDS variant count. (B) Then the expected domain variant count was calculated via permutation. The variant counts in protein domains for each consequence are aggregated and compared with the observed counts to calculate the DDI.

2.1. Observed number of rare missense variants per domain

Amino acid coordinates of protein domains were obtained from Pfam (v31.0)(El-Gebali et al., 2019). To count the observed number of rare missense variants in the domain (NDomainobs), rare variants (MAF<0.1%) were extracted from gnomAD genomes VCF file (v2.1.1) and mapped to gene structures (Figure 1). Amino acid changes in canonical transcripts were annotated by ANNOVAR(Wang, Li, & Hakonarson, 2010). Domains were tracked back to protein sequences and mapped to the GRCh37 genomic coordinates. Briefly, the canonical UniProt IDs of protein-coding genes with the correct start/stop codon and length were obtained from HGNC(Yates et al., 2017), and then the canonical transcript IDs were acquired from UniProt(UniProt, 2019). Amino acid coordinates of protein domains were converted to genomic coordinates, and rare variants were mapped to protein domain regions using bedtools (Quinlan & Hall, 2010).

2.2. Expected number of rare missense variants per domain

The expected number of rare missense variants per domain (NDomainexp) was estimated by the permutation procedure based on the assumption that coding regions (within and outside of protein domains) were under neutral selection. The permutation procedure follows a specific mutational model and the expected number of variants in the coding region (NCDSexp) (Figure 1).

The mutational model is an extension of the model developed by Samocha et al. (K. E. Samocha et al., 2014). We made two main extensions. First, mutational probabilities were generated from coding regions rather than intergenic regions like Samocha et al. did. Second, Samocha et al. counted the trinucleotide XY1Z to XY2Z change, which obtained a mutation table of 64 * 3 = 192 possible mutations. In comparison, our model also considered the codon position where the change occurred, i.e., the first, the second, or the third nucleotides in the codon. When the change occurred at the first nucleotide in the codon, we counted the change of the trinucleotide Y1XZ to Y2XZ. When the change occurred at the second nucleotide in the codon, we counted the change of the trinucleotide XY1Z to XY2Z. When the change occurred at the third nucleotide in the codon, we counted the change of the trinucleotide XZY1 to XZY2. Therefore, our model contained a mutation table of 3*64*3=576 possible mutations (Suppl. Table 1). This mutation model retains the intrinsic factors like codon context and codon position, which mimics the mutation profile under neutral selection more accurately.

We hypothesized that variants within intronic regions are more likely to be under neutral selection than those in coding regions. Therefore, the expected number of variants in the coding region (NCDSexp) of each gene was adjusted using the number of observed rare variants in the intronic region (Nintronobs) and the coding/intron length ratio (Equation 1). For those 1408 genes without introns, the observed number of variants in the coding region was used. Furthermore, we found that the number of coding variants per kb is less than those in introns when summarizing all genes, where the coefficient (C in the equation) is around 0.916. Thus, the expected number of coding variants in each gene was further adjusted by the coefficient.

NCDSexp=C*LCDSLintron*Nintronobs (1)

Additionally, we estimated the mutation possibilities of the three codon positions (fi, i = 1,2, or 3). We tallied CDS mutation counts at the first, second and third codon positions (N1, N2 and N3), and obtained the mutation probability of each position (fi) by dividing the count by the total number of mutations (N1 + N2 + N3), respectively (Equation 2). As a result, 37.6% of variants occur at the third base of the codon, whereas 32.4% and 30.0% take place at the first and second base. This is consistent with a previous finding reporting that the third codon position is the least functional constraint, while the second codon is the most functional constraint (Bofkin & Goldman, 2007).

fi=NiN1+N2+N3 (2)

In the permutation procedure, the expected number of variants in each phase was calculated based on NCDSexp (Equation 3).

Ni=NCDSexp*fi (3)

For each permutation, Ni sites in the coding region of each phase were randomly selected. For a specific site, the mutational probabilities were obtained from the mutation table (Suppl. Table 1) based on the codon context and codon position, where a random mutation was generated using a roulette wheel selector. The consequence of the simulated mutation was calculated and the number of missense mutations in the domain was tallied. The permutation was repeated 2000 times and the distribution of the expected number of missense variants in each domain was obtained (Figure 1).

2.3. Domain damage index score

Based on the observed and the expected number of missense variants for each domain, a signed Z score was developed to estimate the significance of domains depleted or enriched of rare missense variants.

zscore=μ[NDomainexp]NDomainobsσ[NDomainexp]

where μ is the mean and σ is the standard deviation of NDomainexp from 2000 permutations, NDomainobs is the observed number of missense variants in the domain. A positive z-score suggests that the domain has fewer rare variants than expected in the general population, i.e., intolerance of missense variants (constrained). In contrast, a negative z-score indicates that mutational damage has accumulated in the general population (unconstrained). The percentile of each domain was also calculated by dividing its rank by the total domain number. 99% of z-score percentile indicates that this domain is more constrained than 99% of domains in the human genome, whereas 1% of z-score suggests this domain is highly damaged.

2.4. Odds ratio of pathogenicity for domain

The odds ratio of pathogenicity for one domain was defined as the ratio of the odds of being pathogenic for variants located in the domain and the odds of being pathogenic for those located outside of the domain. High odds ratio indicates enrichment of pathogenic variants in the domain.

odds ratio=(Ndomainpatho+0.5)/(Ndomainbenign+0.5)(Noutpatho+0.5)/(Noutbenign+0.5)

where Ndomainpatho and Noutpatho is the number of pathogenic missense variants in the domain and outside of the domain, respectively. Ndomainbenign and Noutbenign is the number of benign missense variants in the domain and outside of the domain, respectively. The pathogenic variants were from ClinVar (labelled as pathogenic/likely-pathogenic in the ClinSig field), while the benign variants came from rare variants in gnomAD (MAF<0.1%). The rare variants in gnomAD but reported as pathogenic variants in ClinVar were further excluded from the benign dataset. To avoid the division by zero, we added each value by 0.5 (Haldane-Anscombe correction)(Ruxton & Neuhäuser, 2013).

2.5. Performance comparison

To evaluate the performance of DDI on distinguishing pathogenic variants from benign ones, we compiled three testing datasets: (a) missense variants from DoCM (Database of Curated Mutations)(Ainscough et al., 2016) as positives and randomly selected rare missense variants (MAF <0.1%) from MGRB (Medical Genome Reference Bank) (Lacaze et al., 2019) as negatives. DoCM is a highly curated database of disease-causing mutations. MGRB contains the whole-genome data of 4000 healthy elderly individuals (age >=70, no reported history of cancer, cardiovascular disease, dementia or cancer). (b) missense variants from cancer hotspots database(Chang et al., 2016) as positives and randomly selected rare missense variants from MGRB database as negatives. Cance hotspots data have been used as positives in a recent pathogenicity prediction study (Qi et al., 2021). (c) missense variants labeled as “pathogenic” from COSMIC(Tate et al., 2019) as positives and randomly selected rare missense variants from MGRB database as negatives. To be noted, we did not use pathogenic variants from ClinVar as positives since several classifiers, such as CADD and GDI, were trained on ClinVar directly or indirectly. Variants overlapping with ClinVar version 20150203 from all benchmark datasets were excluded to avoid the circularity.

Without a single base pair resolution as variant-level methods have, the DDI score of one domain is extended to all variants in the domain. DDI was compared with 14 metrics, REVEL, MVP, GDI, RVIS, ExAC missense constraint, subRVIS, MTR, CADD, MutationTaster, MutationAssessor, FATHMM, PROVEAN, metaSVM, and MutPred. REVEL, MVP, metaSVM, and CADD are variant-level ensemble methods which integrated scores from multiple predictors and genomic features. Scores were downloaded from the corresponding websites and queried using tabix (H. Li, 2011). GDI, RVIS, and ExAC missense constraint are gene-level classifiers, which assume variants within intolerant genes are more likely to be pathogenic. subRVIS and MTR are region-level metrics to prioritize variants based on the identification of least-damaged regions in the general population. The scores of GDI, RVIS, ExAC missense constraint, subRVIS were downloaded from the supplementary data of the articles. MTR scores were retrieved from its website (http://mtr-viewer.mdhs.unimelb.edu.au/mtr-viewer/). The scores of other methods were obtained from dbNSFP v4.1 database (Liu, Li, Mou, Dong, & Tu, 2020).

2.6. Combination of DDI with REVEL and MVP

Because DDI, REVEL, and MVP showed the best performance on three benchmark datasets, we combined DDI with REVEL or MVP. The combination score was the mean percentile of DDI and REVEL/MVP, which was obtained by first ranking prediction scores of each method for all missense variants in the exome, calculating the percentile, and averaging the percentile from DDI and REVEL/MVP.

2.7. Metrics for performance evaluation

We evaluate the performance based on the following metrics: 1) The ROC curve and the corresponding AUC value, which were used to assess the overall performance at different thresholds. ROC analysis was performed using the scikit-learn(Pedregosa et al., 2011) package from Python 3.7. 2) The positive predictive value (PPV), the negative predictive value (NPV), the true positive rate (TPR, also referred to as sensitivity), the true negative rate (TNR, also referred to as specificity), the false positive rate (FPR), the false negative rate (FNR), accuracy, the Mathew correlation coefficient (MCC) and the Youden’s index (Youden, 1950) at a certain threshold. The threshold with the highest Youden’s Index was chosen.

PPV=TPTP+FP
NPV=TNTN+FN
TPR=TPTP+FN
TNR=TNTN+FP
FPR=FPFP+TN
FNR=FNFN+TP
Accuracy=TP+TNTP+TN+FP+FN
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
Youden index=TPTP+FN+TNTN+FP1

3. RESULTS

3.1. Missense pathogenic variants enriched in protein domains

There are 34,726 pathogenic/likely pathogenic missense variants reported in ClinVar, involving 3,533 genes and 3,982 domains. 74.3% of those variants (25,793) locate in protein domains, although protein domains only account for 35.5% of total coding regions, suggesting a 2.1-fold enrichment of pathogenic variants in domains (p-value<1e-300). Among these 3982 domains, 1237 (31.1%) domains contain all pathogenic variants of the protein, and 1463 (36.7%) domains contain more than 80% pathogenic of the protein(Figure 2a, Suppl. Table 2). Furthermore, pathogenic variants from the COSMIC database show a similar pattern (Figure 2b). In the general population (e.g., gnomAD), however, significant lower density of rare variants (MAF <0.1%) in domains than outside regions (p=5.9e-9) was observed (Figure 2a). The odds ratio for each domain were defined as the ratio of odds of being pathogenic for variants located in the domain and the odds of being pathogenic for those located outside of this domain (see material and methods). 3025 (76%) domains have OR>1, and 2285 (57.4%) have OR>2 (Figure 2c, Suppl. Table 2), suggesting that variants located in those domains are more likely to be pathogenic. The significant enrichment of pathogenic variants and depletion of rare variants in the general population indicate that protein domains are hotspots for deleterious variants, which supports our hypothesis that domain-centric metric has great potential to identify disease-causing variants.

Figure 2. Missense Pathogenic variants are enriched in protein domains.

Figure 2.

The number of domains with the ratio of pathogenic variants from ClinVar (A) and COSMIC (B) and rare missense variants from gnomAD (MAF<0.1%) in the protein domains. The number of domains in each bin defined by odds ratio of pathogenicity (C).

3.2. Correlation of DDI with reported pathogenicity

DDI was defined as the enrichment or depletion of rare variants to a given domain in the general population (gnomAD). Domains with high DDI values are the least damaged, while those with low DDI values are the most damaged. The least damaged domains are under high selective constraint and of most functional importance. Variants in the least damaged domains are more likely to cause disease.

Although DDI was calculated from rare missense variants in the general population, DDI value is strongly correlated with reported pathogenicity in ClinVar and COSMIC. Pathogenic variants have significantly higher DDI values than benign variants in both ClinVar and COSMIC databases (p-value<1e-300) (Figure 3a and 3b). Compared to the median DDI value of −1.1 for benign variants, the median of DDI reaches 1.02 for pathogenic variants in ClinVar. We observed similar patterns in COSMIC, where the median of DDI value is 1.45 for pathogenic variants but −0.43 for benign ones. We divided DDI scores into percentile bins and defined the odds ratio as the ratio of the odds of pathogenicity if variants belong to one specific DDI percentile bin and the odds of pathogenicity if variants don’t. The odds ratio reached 3.60 (95% CIs 3.47–3.73) when DDI was above 90%, suggesting variants with high DDI are highly likely to cause disease. The odds ratio decreased as the DDI value dropped (Figure 3c). When DDI values fell below the 50th percentile, the odds ratio became less than 1, indicating that variants in the highly damaged domains are more likely to be benign (Figure 3c). The strong correlation between DDI and reported pathogenicity confirmed that DDI has the great potential to be a metric for distinguishing pathogenic variants from benign ones.

Figure 3. Correlation of DDI with reported pathogenicity.

Figure 3.

DDI score distribution for pathogenic (blue) and benign variants (red) in ClinVar (A) and in COSMIC (B). The odds ratio of pathogenicity in DDI percentile bins (C).

3.3. Correlation of DDI with phylogenetic conservation

Although evolutionary conservation was not used for DDI calculation, a strong correlation was observed between DDI and conservation score. Comparing the average conservation scores of domains in different DDI percentile ranges, domains with higher DDI were more conserved than domains with lower DDI. Domains with DDI above the 95th percentile had the highest conservation score (mean=5.0, p=1.57e-251), followed by 90, 80, and 50th percentile, while those with DDI in the bottom five percentile showed the lowest conservation score (mean=2.46, p=3.2–112) (Figure 4a). The Spearman correlation coefficient for conservation score and DDI is 0.489 (p-value<1e-300). This result suggested that domains with higher DDI tend to be under stronger purifying selection pressure than those with lower DDI. Therefore, DDI can be considered as a surrogate to estimate evolutionary constraints.

Figure 4. Correlation of DDI between domain conservation and average CADD score.

Figure 4.

The conservation score (A) and average CADD score (B) distribution for different DDI percentile bins. The red lines indicate the domains with >50% DDI percentile, the blue lines indicate the domains with ≤ 50% DDI.

3.4. Correlation of DDI with CADD score

The correlation between DDI and CADD score was also investigated. CADD is a variant-level damage prediction metric, which is one of the best methods for prioritizing variants(Kircher et al., 2014). CADD integrated 63 genomic features, including conservation, genic effects, regulatory element annotation, and protein structure information(Kircher et al., 2014; Rentzsch, Witten, Cooper, Shendure, & Kircher, 2019). The CADD score for a domain was the average of all possible CADD scores of each base in the domain. A significant correlation was observed between DDI and CADD scores (Figure 4b). Domains with DDI above the 95th percentile had the highest CADD score (mean=23.33, p=2.69e-284), followed by 90th, 80th, and 50th percentile, while those with DDI in the bottom five percentile showed the lowest CADD score (mean=18.58, p= 5.09e-170). The Spearman correlation coefficient for average CADD score and DDI is 0.443 (p-value<1e-300). These results indicated that DDI is a useful metric to predict the deleterious effect.

3.5. Identification of constrained domains within highly damaged genes

Although gene-level metrics, such as GDI, RVIS, and ExAC z-score, have proved successful in prioritizing genes most likely to cause disease, they cannot capture heterogeneity in missense intolerance across gene regions. Since pathogenic variants often cluster in particular gene regions, especially in functionally important elements, DDI held the promise to find those units under strong constraint even in highly damaged genes. There are 3,698 highly damaged genes with GDI constraint rank <20. Most domains in these genes had lower DDI values than the average, indicating that domains in highly damaged genes also tended to be less constrained as well. However, 217 domains involving 196 proteins have DDI values above 90th percentile (Suppl. Table 3), suggesting they are under strong constraint, and variants in these domains are highly likely to be deleterious.

To understand the contradictory prediction between gene-level metric scores (GDI) and domain-level metric scores (DDI), the pathogenic variants from ClinVar and rare variants from gnomAD were examined in these 196 proteins. 47 domains involving 41 proteins were found to have at least one missense pathogenic variant reported in ClinVar (Suppl. Table 4). 42 out of these 47 domains have an odds ratio of pathogenicity > 1, with a median odds ratio of 10.2 (Suppl. Table 4). These results indicate that DDI is able to find those pathogenic domains even when genes are highly damaged. As one example, with a GDI rank of 4.54, CACNA1A is more tolerant to damage than 95.5% of genes, suggesting it is highly damaged in the general population and thus unlikely to cause disease. CACNA1A protein contains six domains, four ion transport domains (PF00520), one voltage-gated calcium channel IQ domain (PF08763), and one voltage-dependent L-type calcium channel domain (PF16905). The ion transport domain (PF00520) is very constrained with a DDI constraint rank of 99.76, while the other two domains have a DDI rank of 78.8 and 71.1 respectively. CACNA1A belongs to a family of voltage-gated calcium channels that mediate the entry of calcium ions into excitable cells and are also required for a variety of calcium-dependent processes. Variants in CACNA1A cause a spectrum of neurological disorders, such as Familial hemiplegic migraine type 1(FHM1), Episodic ataxia type 2(EA2), and spinocerebellar ataxia type 6(Indelicato et al., 2019; Luo et al., 2017) . There are 54 pathogenic variants reported in ClinVar, all of which are exclusively located in the ion transport domain (PF00520) (Figure 5a). In contrast, rare missense variants barely appear within the four ion-transport domains in the general population (Figure 5b). These findings support DDI results ranking the ion-transport domain of CACNA1A to be the most constrained. The enrichment of pathogenic variants was also observed in constraint domains of other highly damaged genes such as SMARCA2, ERBB2, AR, FOXC1, and NSD1 (Suppl. Figure S1).

Figure 5. Constrained domains in highly damaged genes.

Figure 5.

The lollipop plot of ClinVar pathogenic missense variants in CACNA1A gene (A). The gnomAD rare missense variants distribution in CACNA1A (B).

The ability to identify constrained domains in highly damaged genes demonstrated that DDI provides complementary information and high resolution for variants prioritization.

3.6. The most and least damaged domains

There are in total 6,354 domains found in 17,252 human proteins. Among them, 3,119 (49.2%) domains belong only to one protein, 1,147 (18%) domains relate to two proteins, and 2,088 (32.8%) domains are found in more than two proteins. The depletion of rare missense variants in one gene would affect DDI values of all domains in that protein, meaning that DDI values are gene dependent. Therefore, the same type of domain may have very different DDI values across genes (Suppl. Figure S2).

Although most domains are under gene-dependent constraint, there are some domains whose DDI values are consistently high across genes, suggesting their functional conservation and essentiality. Two criteria were set to find those domains, involved in more than 20 proteins, and their DDI scores significantly higher than the background with p-values<2.5e-5. As a result, 25 gene-independent constrained domains were identified (Figure 6a, Suppl. Table 5). These domains showed significantly higher conservation scores than others (Figure 6b, Suppl. Table 5), further supporting their functional importance. Ubiquitination is one of the most important posttranslational modifications. The ubiquitylation system include E1, E2 and E3 enzymes. E2 enzymes are responsible for ubiquitin (Ub) loading and are also involved in the determination of the length and topology of the chain(Swatek & Komander, 2016; Ye & Rape, 2009; Zheng & Shabek, 2017). As the functional core of E2 enzymes, Ubiquitin-conjugating enzyme domain (PF00179, UQ_con), is highly constrained across 37 E2 enzyme family members (p=3.66e-17). E3 enzymes are responsible for ubiquitination specificity and the transfer of ubiquitin from E2 enzyme to the substrate. The slight variations in the N-terminal of more than 600 E3 members enable the targeting towards various protein substrates (Deshaies & Joazeiro, 2009). Although highly diverse in the N-terminal, we found that the ubiquitin-transferase domain (PF00632, HECT) at the C-terminal of E3 enzymes is high constrained across proteins (p=1.74e-8). Phosphorylation is another important post-translational modification. Compared to ubiquitination, protein phosphorylation is involved in more general processes such as structural transformation, protein-protein interaction, signal transduction, and functional activation/deactivation. Protein kinases are highly diverse to target various protein substrates, but they share highly conserved “universal cores” (Scheeff & Bourne, 2005). In support of this idea, we found that the protein kinase domain (PF00069, Pkinase) and the protein tyrosine kinase domain (PF07714, Pkinase_Tyr), involving 340 and 124 proteins respectively, are also intolerant of missense variants in a gene-independent manner (p= 1.14e-41 and 9.75e-08). Chromatin remodulation and transcriptional regulation are sophisticated and delicate networks, which involve numerous proteins and enable the establishment of proper temporal and spatial expression. 16 out of 25 most constrained domain families are involved in this process., including 7 domain families are related to DNA binding or chromatin conformation (such as bromodomain, Chromo domain, and PHD domain), seven domain families are related to transcription factors (such as ETS, and Forkhead domain), and two responsible for RNA recognition (RRM and K homology domain). Other constrained domains include Ras Family, Ion transport family, and hormone receptor domain, which are all related to signal transduction.

Figure 6. Most and least constrained domains across genes.

Figure 6.

Boxplot of DDI percentile in 25 most constrained domains (A). Domains are sorted by the median of DDI percentile. Boxplot of the conservation score for the 25 most constrained domains (B). Boxplot of DDI percentile for the 15 least constrained domains(C).

Additionally, 15 gene-independent unconstrained domains were identified, whose DDI values are significantly lower than others (p-value<2.5e-5, Figure 6c, Suppl. Table 6). For example, the olfactory receptor domain (PF13853, 7tm_4) in all 373 human olfactory receptor proteins showed significantly lower DDI scores than other domains(DDI rank mean=31.3, std=20.1, p=9.4e-36). Accumulating an elevated number of variants, the domain is highly damaged and functional unconstraint. Consistently, the olfactory proteins are known to be under subtle selection pressure. As another example, immunoglobulin domains such as V-set and Ig_2 are also amongst the most unconstrained domains. This finding supports the highly variable nature of immunoglobin proteins. DDI, therefore, can be used as an indicator of the relative biological essential or redundancy of a given domain.

3.7. Performance assessment

The performance of DDI was compared with 14 approaches in distinguishing pathogenic variants from neutral ones. They include five variant-level metrics (MutationTaster, MutationAssessor, FATHMM, PROVEAN, and MutPred), two region-level metrics (subRVIS and MTR), three gene-level measurements (GDI, RVIS, and ExAC constraint), and four ensemble approaches (CADD, REVEL, MVP, and metaSVM).

DDI is an unsupervised approach, whereas several methods are supervised and trained on known pathogenic variants from databases such as ClinVar and HGMD. The overlapping between training and testing datasets would cause the inflation of performance. To avoid potential circularity, we compiled three independent curated datasets from DoCM, cancer hotspots, and pathogenic COSMIC as positives and randomly selected rare variants from MGRB as negatives. The performance is summarized in Table 13 and Figure 7. For the DoCM dataset, three methods obtained AUC scores > 0.85, including the latest ensemble method MVP with an AUC of 0.899, REVEL with an AUC of 0.892, and DDI with an AUC of 0.852 (Figure 7A). Their PPV values ranged from 45% to 47%, NPV values were around 96%, specificity ranged from 77% to 80%, sensitivity ranged from 83%−89%, MCC above 0.52 and Youden’s indices greater than 0.63 (Table 1). For the cancer hotspots dataset, two approaches, DDI with an AUC of 0.790 and REVEL with an AUC of 0.784 achieved better performance than other methods (Figure 7B). Their PPV values were around 38%, NPV greater than 92%, specificity around 77%, sensitivity around 70%, and Youden’s indices greater than 0.46 (Table 2). In the pathogenic COSMIC dataset, DDI outperformed all of 14 predictors with an AUC of 0.715 (Figure 7C), followed by REVEL with an AUC of 0.713. DDI achieved a PPV of 72.9%, NPV of 64.7%, sensitivity of 70% and specificity of ~68% (Table 3). Furthermore, the combination of DDI with REVEL (DDI+REVEL) and MVP (DDI+MVP) greatly improved the performance of each individual method in all three datasets. The combination of DDI with REVEL improved the performance of REVEL by 4.9%, 8.3% and 8.7% of AUC values for the three datasets, respectively. The combination of DDI with MVP improved the performance of MVP by 5.0%, 11.7% and 10.9%, respectively (Figures 7D,E, and F). These results indicated that DDI, the measure of protein domain constraint, provides a key and complementary feature for variant prioritization.

Table 1.

Performance evaluation based on DoCM and MGRB dataset

Methods Mising Rate (%) PPV (%) NPV (%) Specificity (%) FPR (%) TPR (%) FNR (%) Accuracy (%) AUC MCC Youden’s index
DDI+MVP 1.55 71.69 96.84 92.58 7.42 86.14 13.86 91.43 0.944 0.734 0.787
DDI+REVEL 1.24 63.27 97.29 88.75 11.25 88.70 11.30 88.74 0.936 0.685 0.774
MVP 0.47 47.04 95.78 80.47 19.53 83.02 16.98 80.91 0.899 0.521 0.635
REVEL 0.00 45.62 96.46 78.29 21.71 86.36 13.64 79.70 0.892 0.522 0.647
DDI 1.24 45.89 97.23 77.76 22.24 89.48 10.52 79.80 0.852 0.538 0.672
metaSVM 0.00 37.51 94.53 71.79 28.21 80.31 19.69 73.27 0.834 0.409 0.521
FATHMM 3.57 40.38 92.47 77.07 22.93 71.22 28.78 76.02 0.811 0.398 0.483
MutPred 2.95 54.97 88.38 74.53 25.47 76.04 23.96 74.97 0.802 0.468 0.506
PROVEAN 3.57 34.04 94.18 67.53 32.47 80.06 19.94 69.70 0.794 0.366 0.476
ExAC missense 1.24 40.13 96.56 72.39 27.61 87.76 12.24 75.07 0.790 0.470 0.601
CADD 0.00 35.09 93.03 71.44 28.56 74.26 25.74 71.92 0.784 0.359 0.457
GDI 2.48 32.32 90.84 70.96 29.04 65.98 34.02 70.09 0.731 0.292 0.369
MTR 34.42 25.84 93.28 76.10 23.90 60.28 39.72 74.18 0.712 0.264 0.364
RVIS 1.24 29.44 96.29 50.60 49.40 91.37 8.63 58.10 0.679 0.329 0.420
MutationTaster 0.16 24.89 93.56 45.75 54.25 85.09 14.91 52.61 0.672 0.239 0.308
MutationAssessor 1.24 28.17 86.16 70.49 29.51 50.55 49.45 66.77 0.603 0.174 0.210
subRVIS 0.00 18.34 96.95 8.22 91.78 98.76 1.24 23.86 0.28 0.10 0.07

PPV, positive predictive value; NPV, negative predictive value; FPR, false positive rate; TPR, true positive rate; FNR, false negative rate; AUC, area under the curve; MCC, Mathew correlation coefficient; DoCM, Database of Curated Mutations, is a highly curated database of known, disease-causing mutations. MGRB, Medical Genome Reference Bank, a resource containing approximately 4,000 whole genome sequences from healthy, age > 70 people to be used for control purposes in disease-specific genomic research. To avoid the circularity, we excluded the variants overlapping with ClinVar version 20150203.

Table 3.

Performance evaluation based on COSMIC and MGRB dataset

Methods Mising Rate (%) PPV (%) NPV (%) Specificity (%) FPR (%) TPR (%) FNR (%) Accuracy (%) AUC MCC Youden’s index
DDI+REVEL 2.40 76.25 65.27 73.78 26.22 68.20 31.80 70.70 0.775 0.418 0.420
DDI+MVP 2.56 75.61 65.40 72.54 27.46 68.92 31.08 70.54 0.771 0.412 0.415
DDI 0.00 72.85 64.67 67.69 32.31 70.11 29.89 69.02 0.715 0.377 0.378
REVEL 2.40 69.67 60.91 64.20 35.80 66.63 33.37 65.54 0.713 0.307 0.308
CADD 2.23 67.71 60.98 59.18 40.82 69.32 30.68 64.78 0.700 0.286 0.285
MVP 2.56 68.64 59.53 63.20 36.80 65.21 34.79 64.31 0.695 0.283 0.284
MTR 13.96 73.07 57.65 78.82 21.18 49.81 50.19 63.28 0.684 0.297 0.286
metaSVM 2.40 67.55 58.19 62.15 37.85 63.83 36.17 63.08 0.669 0.259 0.260
GDI 0.18 66.60 60.95 55.79 44.21 71.14 28.86 64.29 0.662 0.272 0.269
PROVEAN 3.32 64.63 58.96 51.13 48.87 71.51 28.49 62.44 0.657 0.231 0.226
ExAC missense 0.00 67.78 64.03 56.22 43.78 74.47 25.53 66.31 0.656 0.312 0.307
MutPred 8.48 78.94 37.51 74.00 26.00 44.15 55.85 53.46 0.618 0.173 0.182
MutationTaster 2.23 63.42 57.86 49.37 50.63 70.95 29.05 61.30 0.616 0.208 0.203
FATHMM 3.35 65.10 53.61 61.90 38.10 57.02 42.98 59.19 0.611 0.188 0.189
MutationAssessor 6.35 64.57 50.10 61.54 38.46 53.34 46.66 56.88 0.599 0.148 0.149
RVIS 0.95 62.73 63.37 34.75 65.25 84.54 15.46 62.89 0.583 0.224 0.193
subRVIS 0.55 57.52 52.46 21.77 78.23 84.31 15.69 56.59 0.47 0.08 0.06

PPV, positive predictive value; NPV, negative predictive value; FPR, false positive rate; TPR, true positive rate; FNR, false negative rate; AUC, area under the curve; MCC, Mathew correlation coefficient; COSMIC, Catalogue Of Somatic Mutations In Cancer, a database for exploring the impact of somatic mutations in human cancer. MGRB, Medical Genome Reference Bank, a resource containing approximately 4,000 whole genome sequences from healthy, age > 70 people to be used for control purposes in disease-specific genomic research. To avoid the circularity, we excluded the variants overlapping with ClinVar version 20150203.

Figure 7. Performance evaluation in three benchmark datasets.

Figure 7.

ROC curves using pathogenic missense variants in DoCM (A, D), cancer hotspots (B,E) and pathogenic variants from COSMIC (C,F) as true positives and randomly selected rare (MAF<0.1%) missense variants from MGRB as negatives.

Table 2.

Performance evaluation based on Cancer Hotspot and MGRB dataset

Methods Mising Rate (%) PPV (%) NPV (%) Specificity (%) FPR (%) TPR (%) FNR (%) Accuracy (%) AUC MCC Youden’s index
DDI+MVP 5.71 54.09 93.89 87.50 12.50 72.12 27.88 84.89 0.849 0.535 0.596
DDI+REVEL 5.71 52.92 94.84 86.07 13.93 76.97 23.03 84.53 0.849 0.549 0.630
DDI 5.71 38.51 93.39 77.38 22.62 72.12 27.88 76.52 0.790 0.397 0.495
REVEL 0.00 39.41 92.32 77.72 22.28 69.14 30.86 76.24 0.784 0.386 0.469
ExAC missense 5.71 35.56 94.93 71.31 28.69 80.61 19.39 72.84 0.770 0.398 0.519
MVP 0.00 36.12 92.07 74.55 25.45 69.14 30.86 73.62 0.760 0.351 0.437
metaSVM 0.00 35.20 91.00 75.09 24.91 64.57 35.43 73.27 0.744 0.322 0.397
CADD 0.00 28.16 94.44 55.93 44.07 84.00 16.00 60.72 0.732 0.300 0.399
GDI 8.00 64.60 89.87 95.13 4.87 45.34 54.66 86.97 0.729 0.470 0.405
PROVEAN 2.86 27.78 93.67 54.95 45.05 82.35 17.65 59.71 0.718 0.283 0.373
MutPred 10.29 42.48 86.21 64.25 35.75 71.97 28.03 66.32 0.710 0.322 0.362
FATHMM 5.14 33.59 88.74 77.95 22.05 53.01 46.99 73.61 0.699 0.263 0.310
MTR 0.57 43.71 87.98 88.72 11.28 41.95 58.05 80.64 0.664 0.312 0.307
MutationAssessor 4.57 34.93 87.66 80.57 19.43 47.90 52.10 74.73 0.660 0.254 0.285
RVIS 8.00 26.74 92.66 55.21 44.79 78.88 21.12 59.28 0.627 0.257 0.341
MutationTaster 0.00 22.28 90.15 43.73 56.27 77.14 22.86 49.51 0.612 0.161 0.209
subRVIS 1.14 17.99 95.95 8.39 91.61 98.27 1.73 23.65 0.35 0.10 0.07

PPV, positive predictive value; NPV, negative predictive value; FPR, false positive rate; TPR, true positive rate; FNR, false negative rate; AUC, area under the curve; MCC, Mathew correlation coefficient; Cancer Hotspot, the recurrently mutated residues across 11,119 human tumors, spanning 41 cancer types identified by Chang et al. 2016. MGRB, Medical Genome Reference Bank, a resource containing approximately 4,000 whole genome sequences from healthy, age > 70 people to be used for control purposes in disease-specific genomic research. To avoid the circularity, we excluded the variants overlapping with ClinVar version 20150203.

4. DISCUSSION

We presented a population-based approach named DDI for estimating the accumulation or depletion of rare missense variants in protein domains. We showed that the least damaged domains in the general population are under more purification selection pressure and are more likely to cause disease than the most damaged ones. We demonstrated that DDI achieved great performance for detecting pathogenic variants from benign ones in three benchmark datasets.

We attribute the success of DDI to three factors. The first is thanks to the expansion of the mutational model by considering both the trinucleotide context and codon phase. It is well known that the third position of the codon changes faster than the other two (Bofkin & Goldman, 2007). The mutational frequencies at the first, second, and third positions in our model are 0.32, 0.30, and 0.38, respectively, which reflects the biological reality. Taking the codon phase into consideration provides a close simulation of the expected mutation profile under neutral selective pressure. The second is the usage of intronic variants rather than synonymous changes to estimate the expected number of variants in neutral selection. Although synonymous changes are assumed to be neutral, they have been reported to be under purifying selection and might be a source of functionally important variations(Kristofich et al., 2018; Plotkin & Kudla, 2011; Sauna & Kimchi-Sarfaty, 2011). Therefore, it is more reasonable to use intronic variants. The third is due to creating the mutation rate table directly from coding regions instead of using intergenic variants, which reflects the reality of the neutral selection.

A growing number of curated pathogenic variants greatly facilitate the development of machine learning based approaches. However, the performance of these methods highly depends on the quality of labelled data. DDI, in comparison, is an unsupervised approach and independent of any existing pathogenicity datasets. Therefore, DDI avoids the risk of overfitting and has the generalization in predicting the deleteriousness of novel variants. Although DDI is domain-centric, without a single base-pair resolution as variant-level methods have, it outperformed most of variant-level predictors. Moreover, the combination of DDI with REVEL and MVP improved the performance of each individual method considerably. These results indicate that DDI captures valuable features of population genetic data and serves as a key and complementary feature for developing new ensemble methods or integrating with current top-notch variant-level methods.

Supplementary Material

supinfo2

Table S1. The mutation table in DDI

Table S2. The statistics of pathogenic variants from ClinVar and rare variants from gnomAD in 3,982 domains.

Table S3. 217 constraint domains within highly damaged genes

Table S4. 47 constraint domains within highly damaged genes having pathogenic variants reported in ClinVar

Table S5. 25 gene-independent constrained domains.

Table S6. 15 gene-independent unconstrained domains.

supinfo1

Figure S1. The lollipop plot of ClinVar pathogenic missense variants in SMARCA2, ERBB2, AR, FOXC1, and NSD1.

Figure S2. The DDI percentile distribution for domains.

ACKNOWLEGMENT

We would like to acknowledge funding from National Cancer Institute [5U01 CA163056-05, U2C CA233291 and U54 CA217450 to Y.S.]; Cancer Center Support Grant [2P30 CA068485-19 to Y.S.].

Footnotes

CONFLICT OF INTERESTS

The authors declare no competing interests.

SUPPORTING INFORMATION

Additional supporting information includes 2 figures and 6 tables.

DATA AVAILABILITY STATEMENT

The code generated during this study are freely available at https://github.com/chc-code/domain-damage-index

The website for DDI database is freely available at http://bioinfo.vanderbilt.edu/ddi/

REFERENCES

  1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, … Sunyaev SR (2010). A method and server for predicting damaging missense mutations. Nat Methods, 7(4), 248–249. doi: 10.1038/nmeth0410-248 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ainscough BJ, Griffith M, Coffman AC, Wagner AH, Kunisaki J, Choudhary MN, … Mardis ER (2016). DoCM: a database of curated mutations in cancer. Nat Methods, 13(10), 806–807. doi: 10.1038/nmeth.4000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, … Abel L (2015). Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A, 112(17), 5473–5478. doi: 10.1073/pnas.1418631112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bofkin L, & Goldman N (2007). Variation in evolutionary processes at different codon positions. Mol Biol Evol, 24(2), 513–521. doi: 10.1093/molbev/msl178 [DOI] [PubMed] [Google Scholar]
  5. Bosio M, Drechsel O, Rahman R, Muyas F, Rabionet R, Bezdan D, … Ossowski S (2019). eDiVA-Classification and prioritization of pathogenic variants for clinical diagnostics. Hum Mutat, 40(7), 865–878. doi: 10.1002/humu.23772 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carter H, Douville C, Stenson PD, Cooper DN, & Karchin R (2013). Identifying Mendelian disease genes with the Variant Effect Scoring Tool. Bmc Genomics, 14. doi: 10.1186/1471-2164-14-s3-s3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chang MT, Asthana S, Gao SP, Lee BH, Chapman JS, Kandoth C, … Taylor BS (2016). Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat Biotechnol, 34(2), 155–163. doi: 10.1038/nbt.3391 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Choi Y, Sims GE, Murphy S, Miller JR, & Chan AP (2012). Predicting the Functional Effect of Amino Acid Substitutions and Indels. Plos One, 7(10). doi: 10.1371/journal.pone.0046688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Clifford RJ, Edmonson MN, Nguyen C, & Buetow KH (2004). Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms. Bioinformatics, 20(7), 1006–1014. doi: 10.1093/bioinformatics/bth029 [DOI] [PubMed] [Google Scholar]
  10. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, & Batzoglou S (2010). Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP plus. Plos Computational Biology, 6(12). doi: 10.1371/journal.pcbi.1001025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Deshaies RJ, & Joazeiro CA (2009). RING domain E3 ubiquitin ligases. Annu Rev Biochem, 78, 399–434. doi: 10.1146/annurev.biochem.78.101807.093809 [DOI] [PubMed] [Google Scholar]
  12. Diella F, Haslam N, Chica C, Budd A, Michael S, Brown NP, … Gibson TJ (2008). Understanding eukaryotic linear motifs and their role in cell signaling and regulation. Front Biosci, 13, 6580–6603. doi: 10.2741/3175 [DOI] [PubMed] [Google Scholar]
  13. Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, & Liu X (2015). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum Mol Genet, 24(8), 2125–2137. doi: 10.1093/hmg/ddu733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, … Finn RD (2019). The Pfam protein families database in 2019. Nucleic Acids Res, 47(D1), D427–D432. doi: 10.1093/nar/gky995 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Evans P, Wu C, Lindy A, McKnight DA, Lebo M, Sarmady M, & Abou Tayoun AN (2019). Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets. Genome Res, 29(7), 1144–1151. doi: 10.1101/gr.240994.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Garber M, Guttman M, Clamp M, Zody MC, Friedman N, & Xie X (2009). Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics, 25(12), I54–I62. doi: 10.1093/bioinformatics/btp190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gonzalez-Perez A, & Lopez-Bigas N (2011). Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet, 88(4), 440–449. doi: 10.1016/j.ajhg.2011.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grimm DG, Azencott CA, Aicheler F, Gieraths U, MacArthur DG, Samocha KE, … Borgwardt KM (2015). The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat, 36(5), 513–523. doi: 10.1002/humu.22768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gussow AB, Petrovski S, Wang Q, Allen AS, & Goldstein DB (2016). The intolerance to functional genetic variation of protein domains predicts the localization of pathogenic mutations within genes. Genome Biol, 17, 9. doi: 10.1186/s13059-016-0869-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Havrilla JM, Pedersen BS, Layer RM, & Quinlan AR (2019). A map of constrained coding regions in the human genome. Nat Genet, 51(1), 88–95. doi: 10.1038/s41588-018-0294-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hayeck TJ, Stong N, Wolock CJ, Copeland B, Kamalakaran S, Goldstein DB, & Allen AS (2019). Improved Pathogenic Variant Localization via a Hierarchical Model of Sub-regional Intolerance. Am J Hum Genet, 104(2), 299–309. doi: 10.1016/j.ajhg.2018.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Indelicato E, Nachbauer W, Karner E, Eigentler A, Wagner M, Unterberger I, … Boesch S (2019). The neuropsychiatric phenotype in CACNA1A mutations: a retrospective single center study and review of the literature. Eur J Neurol, 26(1), 66–e67. doi: 10.1111/ene.13765 [DOI] [PubMed] [Google Scholar]
  23. Ioannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, … Sieh W (2016). REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99(4), 877–885. doi: 10.1016/j.ajhg.2016.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ionita-Laza I, McCallum K, Xu B, & Buxbaum JD (2016). A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet, 48(2), 214–220. doi: 10.1038/ng.3477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Itan Y, Shang L, Boisson B, Patin E, Bolze A, Moncada-Velez M, … Casanova JL (2015). The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci U S A, 112(44), 13615–13620. doi: 10.1073/pnas.1518646112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, … MacArthur DG (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809), 434–443. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, & Shendure J (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics, 46(3), 310–+. doi: 10.1038/ng.2892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kristofich J, Morgenthaler AB, Kinney WR, Ebmeier CC, Snyder DJ, Old WM, … Copley SD (2018). Synonymous mutations make dramatic contributions to fitness when growth is limited by a weak-link enzyme. PLoS Genet, 14(8), e1007615. doi: 10.1371/journal.pgen.1007615 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kumar P, Henikoff S, & Ng PC (2009). Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protocols, 4(7), 1073–1082. doi: 10.1038/nprot.2009.86 [DOI] [PubMed] [Google Scholar]
  30. Lacaze P, Pinese M, Kaplan W, Stone A, Brion MJ, Woods RL, … Thomas DM (2019). The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. Rationale and cohort design. Eur J Hum Genet, 27(2), 308–316. doi: 10.1038/s41431-018-0279-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lai C, Zimmer AD, O’Connor R, Kim S, Chan R, van den Akker J, … Mishne G (2020). LEAP: Using machine learning to support variant classification in a clinical setting. Hum Mutat, 41(6), 1079–1090. doi: 10.1002/humu.24011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Laine E, Karami Y, & Carbone A (2019). GEMME: a simple and fast global epistatic model predicting mutational effects. Mol Biol Evol. doi: 10.1093/molbev/msz179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, … Exome Aggregation, C. (2016). Analysis of protein-coding genetic variation in 60,706 humans (2016/08/19 ed. Vol. 536). [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, … Radivojac P (2009). Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics, 25(21), 2744–2750. doi: 10.1093/bioinformatics/btp528 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li H (2011). Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27(5), 718–719. doi: 10.1093/bioinformatics/btq671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Liu X, Li C, Mou C, Dong Y, & Tu Y (2020). dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med, 12(1), 103. doi: 10.1186/s13073-020-00803-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lopes MC, Joyce C, Ritchie GR, John SL, Cunningham F, Asimit J, & Zeggini E (2012). A combined functional annotation score for non-synonymous variants. Hum Hered, 73(1), 47–51. doi: 10.1159/000334984 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Luo X, Rosenfeld JA, Yamamoto S, Harel T, Zuo Z, Hall M, … Members of the, U. D. N. (2017). Clinically severe CACNA1A alleles affect synaptic function and neurodegeneration differentially. PLoS Genet, 13(7), e1006905. doi: 10.1371/journal.pgen.1006905 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, … Gunter C (2014). Guidelines for investigating causality of sequence variants in human disease. Nature, 508(7497), 469–476. doi: 10.1038/nature13127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Miller ML, Reznik E, Gauthier NP, Aksoy BA, Korkut A, Gao J, … Sander C (2015). Pan-Cancer Analysis of Mutation Hotspots in Protein Domains. Cell Syst, 1(3), 197–209. doi: 10.1016/j.cels.2015.08.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Munro D, & Singh M (2020). DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics. doi: 10.1093/bioinformatics/btaa1030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, … Shendure J (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature, 461(7261), 272–276. doi: 10.1038/nature08250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, … Duchesnay E (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Retrieved from <Go to ISI>://WOS:000298103200003 [Google Scholar]
  44. Peterson TA, Gauran IIM, Park J, Park D, & Kann MG (2017). Oncodomains: A protein domain-centric framework for analyzing rare variants in tumor samples. PLoS Comput Biol, 13(4), e1005428. doi: 10.1371/journal.pcbi.1005428 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Petrovski S, Wang Q, Heinzen EL, Allen AS, & Goldstein DB (2013). Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet, 9(8), e1003709. doi: 10.1371/journal.pgen.1003709 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Plotkin JB, & Kudla G (2011). Synonymous but not the same: the causes and consequences of codon bias. Nat Rev Genet, 12(1), 32–42. doi: 10.1038/nrg2899 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Pollard KS, Hubisz MJ, Rosenbloom KR, & Siepel A (2010). Detection of nonneutral substitution rates on mammalian phylogenies. Genome Research, 20(1), 110–121. doi: 10.1101/gr.097857.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Qi H, Zhang H, Zhao Y, Chen C, Long JJ, Chung WK, … Shen Y (2021). MVP predicts the pathogenicity of missense variants by deep learning. Nat Commun, 12(1), 510. doi: 10.1038/s41467-020-20847-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Quinlan AR, & Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. doi: 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Quinodoz M, Royer-Bertrand B, Cisarova K, Di Gioia SA, Superti-Furga A, & Rivolta C (2017). DOMINO: Using Machine Learning to Predict Genes Associated with Dominant Disorders. Am J Hum Genet, 101(4), 623–629. doi: 10.1016/j.ajhg.2017.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Rentzsch P, Witten D, Cooper GM, Shendure J, & Kircher M (2019). CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47(D1), D886–D894. doi: 10.1093/nar/gky1016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Reva B, Antipin Y, & Sander C (2011). Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research, 39(17), E118–U185. doi: 10.1093/nar/gkr407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Ritchie GR, Dunham I, Zeggini E, & Flicek P (2014). Functional annotation of noncoding sequence variants. Nat Methods, 11(3), 294–296. doi: 10.1038/nmeth.2832 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Ruxton GD, & Neuhäuser M (2013). Review of alternative approaches to calculation of a confidence interval for the odds ratio of a 2 × 2 contingency table. Methods in Ecology and Evolution, 4(1), 9–13. doi: 10.1111/j.2041-210x.2012.00250.x [DOI] [Google Scholar]
  55. Samocha KE, Kosmicki JA, Karczewski KJ, O’Donnell-Luria AH, Pierce-Hoffman E, MacArthur DG, … Daly MJ (2017). Regional missense constraint improves variant deleteriousness prediction. bioRxiv, 148353. doi: 10.1101/148353 [DOI] [Google Scholar]
  56. Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, … Daly MJ (2014). A framework for the interpretation of de novo mutation in human disease. Nat Genet, 46(9), 944–950. doi: 10.1038/ng.3050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Sauna ZE, & Kimchi-Sarfaty C (2011). Understanding the contribution of synonymous mutations to human disease. Nat Rev Genet, 12(10), 683–691. doi: 10.1038/nrg3051 [DOI] [PubMed] [Google Scholar]
  58. Scheeff ED, & Bourne PE (2005). Structural evolution of the protein kinase-like superfamily. PLoS Comput Biol, 1(5), e49. doi: 10.1371/journal.pcbi.0010049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Schwarz JM, Roedelsperger C, Schuelke M, & Seelow D (2010). MutationTaster evaluates disease-causing potential of sequence alterations. Nature Methods, 7(8), 575–576. doi: 10.1038/nmeth0810-575 [DOI] [PubMed] [Google Scholar]
  60. Shihab HA, Gough J, Cooper DN, Stenson PD, Barker GL, Edwards KJ, … Gaunt TR (2013). Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat, 34(1), 57–65. doi: 10.1002/humu.22225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou MM, Rosenbloom K, … Haussler D (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15(8), 1034–1050. doi: 10.1101/gr.3715005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Stone EA, & Sidow A (2005). Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res, 15(7), 978–986. doi: 10.1101/gr.3804205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Swatek KN, & Komander D (2016). Ubiquitin modifications. Cell Res, 26(4), 399–422. doi: 10.1038/cr.2016.39 [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, … Forbes SA (2019). COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res, 47(D1), D941–D947. doi: 10.1093/nar/gky1015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. The Genomes Project C, Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, … Abecasis GR (2015). A global reference for human genetic variation. Nature, 526, 68. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Traynelis J, Silk M, Wang Q, Berkovic SF, Liu L, Ascher DB, … Petrovski S (2017). Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res, 27(10), 1715–1729. doi: 10.1101/gr.226589.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. UniProt, C. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res, 47(D1), D506–D515. doi: 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Wang K, Li M, & Hakonarson H (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res, 38(16), e164. doi: 10.1093/nar/gkq603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Yates B, Braschi B, Gray KA, Seal RL, Tweedie S, & Bruford EA (2017). Genenames.org: the HGNC and VGNC resources in 2017. Nucleic Acids Res, 45(D1), D619–D625. doi: 10.1093/nar/gkw1033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Ye Y, & Rape M (2009). Building ubiquitin chains: E2 enzymes at work. Nat Rev Mol Cell Biol, 10(11), 755–764. doi: 10.1038/nrm2780 [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Youden WJ (1950). Index for rating diagnostic tests. Cancer, 3(1), 32–35. doi: [DOI] [PubMed] [Google Scholar]
  72. Zheng N, & Shabek N (2017). Ubiquitin Ligases: Structure, Function, and Regulation. Annu Rev Biochem, 86, 129–157. doi: 10.1146/annurev-biochem-060815-014922 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo2

Table S1. The mutation table in DDI

Table S2. The statistics of pathogenic variants from ClinVar and rare variants from gnomAD in 3,982 domains.

Table S3. 217 constraint domains within highly damaged genes

Table S4. 47 constraint domains within highly damaged genes having pathogenic variants reported in ClinVar

Table S5. 25 gene-independent constrained domains.

Table S6. 15 gene-independent unconstrained domains.

supinfo1

Figure S1. The lollipop plot of ClinVar pathogenic missense variants in SMARCA2, ERBB2, AR, FOXC1, and NSD1.

Figure S2. The DDI percentile distribution for domains.

Data Availability Statement

The code generated during this study are freely available at https://github.com/chc-code/domain-damage-index

The website for DDI database is freely available at http://bioinfo.vanderbilt.edu/ddi/

RESOURCES