Abstract
Deleterious single amino acid variation (SAV) is one of the leading causes of human diseases. Evaluating the functional impact of SAVs is crucial for diagnosis of genetic disorders. We previously developed a deep convolutional neural network predictor, DeepSAV, to evaluate the deleterious effects of SAVs on protein function based on various sequence, structural, and functional properties. DeepSAV scores of rare SAVs observed in the human population are aggregated into a gene-level score called GTS (Gene Tolerance of rare SAVs) that reflects a gene’s tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. In this study, we aim to enhance the performance of DeepSAV by using expanded datasets of pathogenic and benign variants, more features, and neural network optimization. We found that multiple sequence alignments built from vertebrate-level orthologs yield better prediction results compared to those built from mammalian-level orthologs. For multiple sequence alignments built from BLAST searches, optimal performance was achieved with a sequence identify cutoff of 50% to remove distant homologs. The new version of DeepSAV exhibits the best performance among standalone predictors of deleterious effects of SAVs. We developed the DBSAV database (http://prodata.swmed.edu/DBSAV) that reports GTS scores of human genes and DeepSAV scores of SAVs in the human proteome, including pathogenic and benign SAVs, population-level SAVs, and all possible SAVs by single nucleotide variations. This database serves as a useful resource for research of human SAVs and their relationships with protein functions and human diseases.
Keywords: genetic variations, pathogenic variants, benign variants, variant deleteriousness prediction, neural network predictor
Introduction
Whole genome or exome sequencing is now routinely used in diagnosis and research of human diseases.1,2 Their successful applications to clinical studies depend on assessment of functional impact of genetic variations discovered in sequencing projects. Genetic variations include structural variations at the chromosomal level, gene copy number variations, insertions and deletions of long and short DNA segments, and single-base-pair (single-nucleotide) variations (SNVs).3 SNVs within or near protein-coding regions are of particular interest, as they could have functional impact on the protein products. Some of these variations most likely result in loss-of-function, such as changes that disrupt splice sites or introduce premature stop codons, while most synonymous variations are benign. It is much more difficult to assess the functional and clinical impact of non-synonymous SNPs (missense variations) that lead to single-amino acid changes.4 Clinical consequences of these single amino-acid variations (SAVs) can be benign or pathogenic, depending on a plethora of factors that affect the functions of proteins. Deleterious SAVs could affect various aspects of protein function, including protein folding and stability, protein–protein interactions, protein localization and degradation, post-translational modification, and the activity of enzymes.5,6 It should be noted that protein-level deleterious effects of genetic variants do not necessarily lead to disease phenotypes, as a fraction of non-essential genes can be compromised without causing diseases.7
In our previous work,8 a deep neural network method was developed to quantify the functional impact of human SAVs based on their sequence, structural and functional properties. The neural network prediction scores of SAVs observed in the human general population were used to calculate a mutation severity measure called GTS (Gene Tolerance of rare SAVs) that estimates the tolerance of each human protein-coding gene to deleterious missense mutations. We found that this measure correlates with gene essentiality and specific disease classes such as cancer and autism.8 A dichotomy of mutation severity for disease-associated genes were observed: SAV-intolerant genes tend to function in development and signal transduction pathways and SAV-tolerant genes tend to function in metabolism.8
Here we report an update of the DeepSAV predictor of SAV deleteriousness with improvement from enlarged training datasets, more input features, and neural network optimization. We describe the DBSAV database that reports the GTS scores of human genes and the DeepSAV scores of various sources of SAVs in the human proteome. Each human protein has a web page where the SAVs are displayed together with the structural and functional properties of the protein. DBSAV would be a useful tool for researching SAVs and their impact on protein function and diseases.
Results and Discussion
Selection of datasets of human pathogenic and benign variations
We updated the datasets of human pathogenic and benign SAVs compiled from recent versions of ClinVar (2020-July-20)9 and UniProt (version 2020-3 (2020-June-12)).10 A total of 50,568 pathogenic SAVs (35251 from ClinVar, 29,896 from UniProt, 14,579 in common) were retrieved from 3758 proteins. A total of 72,483 benign SAVs (40638 from ClinVar, 39,064 from UniProt, 7219 in common) were retrieved from 13,343 human proteins (see Materials and methods). We aim to develop a neural network predictor of the deleterious effect of any SAV on its protein’s function. The pathogenic SAVs are used as positive cases in our predictor since their deleterious effects on the proteins’ function are reflected by the disease-causing phenotypes. However, protein-level deleterious effects do not necessarily lead to human diseases, as a significant fraction of human protein-coding genes (e.g., olfactory receptors) can be compromised without causing diseases. Certain SAVs classified as benign SAVs could have detrimental effects on protein function if they reside in non-essential proteins whose loss-of-function does not lead to diseases. It is thus desirable to exclude SAVs that are classified as benign in terms of their effect on human health but could still have deleterious impact on protein function. To construct the dataset of negative cases (SAVs without deleterious effects on protein function) in our neural network predictor, we only selected the benign SAVs from the curated set of proteins that have been linked to diseases according to the DisGeNET database.11 After excluding benign SAVs in proteins not classified as disease-causing by DisGeNET, our negative dataset consists of 54,680 benign SAVs from 7451 proteins (Table S1).
Input features of the neural network predictor
The input to the DeepSAV predictor consists of a set of features of protein sequence, structure, and functional properties including original and mutated amino acid types, sequence conservation, profile, secondary structure predictions, disordered residue predictions, low complexity regions, coiled coil regions, and various features retrieved from the UniProt Feature fields (see Materials and methods). We analyzed UniProt features by using an enrichment log-odds score that reflect enrichment or depletion of any feature in amino acid positions with pathogenic SAVs compared to positions with benign SAVs (Figure 1). In addition to previously used UniProt features (SIGNAL, TRANSIT, TRANSMEM, MODRES_A, MODRES_P, MODRES_M, DISULFID, CARBOHYD, METAL, BINDING, ACT_SITE, SITE, LIPID, and MOTIF, see their descriptions in Materials and methods),8 we added several UniProt features that showed enrichment/depletion in pathogenic SAVs including INTRAMEM (region that is buried within a membrane, but does not cross it), NP_BIND (nucleotide phosphate-binding region), REGION30 (extent of a region of interest in the sequence denoted by the UniProt feature REGION, with length no more than 30 amino acids), DNA_BIND (DNA-binding region), CA_BIND (calcium-binding region), ZN_FING (zinc finger region), PEPTIDE (released active peptide), and PROPEP (propeptide). Among the UniProt features, DISULFID, METAL, INTRAMEM, ACT_-SITE, NP_BIND, DNA_BIND, SITE and BINDING (red bars in Figure 1) have the highest log-odds score, showing more than 7-fold enrichment in pathogenic SAVs compared to benign SAVs. LIPID, REGION30, MOTIF, CA_BIND, and TRANSMEM (orange bars in Figure 1) are modestly enriched in pathogenic SAVs with between 2- to 4-fold enrichment in pathogenic SAVs. It should be noted that some of these features correspond to regions that could vary considerably by their lengths. For example, the majority of DNA_BIND features (>92%) in the human proteome correspond to regions with less than 100 amino acids and only a couple of them are longer than 300 amino acids. We restricted the UniProt feature REGION to be those with no longer than 30 amino acids as these short regions (denoted as REGION30) have much higher enrichment score (3.5 fold) than longer regions with over 30 amino acids (1.3 fold enrichment) when comparing their occurrences in pathogenic SAVs and benign SAVs. The UniProt features SIGNAL, TRANSIT, and PROPEP (light blue bars in Figure 1) are modestly depleted in pathogenic SAVs.
A major update in the new DeepSAV predictor is the source of the multiple sequence alignments (MSAs) that are used to calculate sequence conservation and sequence profile, two most important features determining the performance of DeepSAV.8 We tested MSAs from three different sources: orthologous proteins at the vertebrate level from the OrthoDB12 database, orthologous proteins at the mammalian level from the OrthoDB database, and homologs identified by BLAST with e-value cutoff set to 0.001. Vertebrate-level orthologs gave better prediction results (area under the receiver operating characteristic curve (AUC) score: 0.894) than mammalian-level orthologs (AUC score: 0.886). For DeepSAV using vertebrate-level orthologs, the optimal accuracy score (the sum of total numbers of true positives and true negative divided by total number of cases) is 0.824 and the Matthews correlation coefficient (MCC) is 0.648 (supplementary Table S1) at the prediction score cutoff of 0.44. Several sequence identity cutoffs (25%, 30%, 40%, 45%, 50%, 55%, and 60%) were tested to study the effect of removing divergent sequences from the BLAST output. DeepSAV using BLAST homologs with sequence identity cutoffs around 50% gave similar performance to DeepSAV with vertebrate orthologs in terms of accuracy, ROC AUC, MCC, and F1-score (supplementary Table S1). DeepSAV performed worse when lower sequence identify cutoffs were applied to filter out divergent sequences from BLAST output (e.g., 25% cutoff ROC AUC: 0.879 and 30% cutoff ROC AUC: 0.883) (supplementary Table S1). The vertebrate-level MSAs were chosen to calculate SAVs in the human proteome in the updated DeepSAV predictor.
Neural network training and testing
We trained and tested our neural network predictor by using a four-fold cross-validation procedure described previously8 (see Materials and methods). Compared with the previous version of DeepSAV, we partitioned the datasets into four subsets of roughly equal sizes at the protein level, such that SAVs from any protein stay in the same subset. This procedure aims to reduce overtraining as the training dataset and the testing dataset use variants from different proteins for any partitioning. Similar partitioning procedures have been used in variant effect predictors SuSPect,13 SAAPdb,14 and VEST.15
A number of parameters in the neural network architecture and hyperparamters in the training procedure were optimized, including kernel size, the number of filters, the number of convolutional layers, the number of dense layers, dropout rates, and the pool size of max pooling layers. Compared to the previous version of DeepSAV, the largest improvement resulted from using a kernel size of 1 instead of 3. Increasing the number of filters to 1000 also resulted in slight improvement. The AUC score of DeepSAV (using MSAs of vertebrate orthologs) is 0.894 and is the best among tested standalone programs that include MPC,16 LIST-S2,17 PrimateAI,18 MutationAssessor,19 SIFT,20 PolyPhen-2,21 PROVEAN,22 fathmm-XF,23 and CADD24 (orange bars in Figure 2(A)). DeepSAV using vertebrate-level orthologs also gave the highest accuracy score, Matthews correlation coefficient score, and F1 score among standalone programs (supplementary Table S1). Meta-predictors (sometimes named ensemble predictors) use prediction scores of a variety of standalone programs (blue bars in Figure 2(A)). As a standalone method, DeepSAV performs worse than most meta-predictors (MVP,25 MetaSVM,26 MetaLR,26 DEOGEN2,27 VEST4,15 REVEL,28 ClinPred,29 and BayesDel30) except for Eige31 and Condel32 (Figure 2(A) and supplementary Table S1).
Our dataset used for training and testing the DeepSAV predictor was constructed from ClinVar and UniProt SAVs. ClinVar SAVs annotated as “Pathogenic” and “Benign” in general bear more reliable clinical interpretation than those annotated as “Likely pathogenic” and “Likely benign” according to ACMG (American College of Medical Genetics and Genomics) guideline.33 DeepSAV performance on the subset of SAVs with ClinVar annotations of “Pathogenic” and “Benign” is better in terms of accuracy (0.856) and MCC (0.712) than on the whole dataset (accuracy: 0.824, MCC: 0.648). DeepSAV performance on the subset of SAVs with ClinVar annotations of “Likely pathogenic” and “Likely benign” (accuracy: 0.835, MCC: 0.669) is slightly worse than that on the subset with ClinVar annotations of “Pathogenic” and “Benign”. On the other hand, DeepSAV performance is worse on the subset of SAVs only annotated in the UniProt database (accuracy: 0.784, MCC: 0.564) compare to the performance of the whole dataset.
Our cross-validation procedure involved dividing the full dataset into four subsets and using three subsets for training and the remaining subset for testing. Partitioning the dataset at the protein level reduced overfitting as variants from the same protein are used exclusively for training or testing. However, this procedure does not eliminate overfitting that may be caused by similarities among proteins. For example, two homologous proteins SEM5A and SEM5B exhibit a similarity of 59% sequence identity, but they were placed into different subsets. Potential overfitting could occur when the neural network model trained on subsets containing variants from SEM5A is used to calculate the prediction scores of variants from SEM5B. To reduce potential overfitting due to similarities among proteins, we clustered proteins in our dataset by BLASTCLUST (from the NCBI BLAST suite) at the sequence identity cutoff 25% (BLASTCLUST options: -S: 25, -b: F, and other options default). For four-fold cross-validation, we discarded all variants from any protein in the testing subset if that protein is in the same BLASTCLUST cluster as one or more proteins in the three subsets used in training. Such a procedure reduced the number of variants tested (27023 of 50,568 pathogenic variants and 33,087 out of 54,680 benign variants are tested in the reduced dataset). The performance scores of DeepSAV and other methods on the reduced dataset are similar to those on the full dataset (ROC AUC scores are plotted in Figure 2(A) for the full dataset and Figure 2(B) for the reduced dataset; accuracy scores, F1-scores and Matthews correlation coefficient scores are shown in supplementary Table S1 and Table S2 for the full dataset and reduced dataset, respectively). The relative rankings of DeepSAV and other programs are also similar on the full dataset and the reduced dataset (Figure 2(B) and supplementary Table S2).
DBSAV – A database of SAVs for human proteins
We developed the DBSAV database (http://prodata.swmed.edu/DBSAV) with web interfaces that display SAVs along with the primary sequence and various structural and functional properties for human proteins. An example of the web interface is shown for the protein Sonic Hedgehog (gene name: SHH, UniProt accession: Q15465, Figure 3(A)). The top of the web page lists the UniProt accession, gene name and description of the protein, followed by links to other web sites such as gnomAD,7 InterPro,34 ProViz,35 and SWISS-MODEL.36 The GTS score (see its definition and distribution in supplementary Figure S1) calculated based on the DeepSAV scores of gnomAD rare SAVs8 and its percentile are reported. Links are provided to tables with DeepSAV predictions for several sources of SAVs, including pathogenic SAVs, benign SAVs, SAVs found in exome sequencing (gnomAD SAVs), and all possible SAVs by single nucleotide changes of the coding regions of the gene (SnvSAV). Parts of two tables are shown in Figure 3(B) for pathogenic SAVs and Figure 3(C) for gnomAD SAVs, respectively. All SAV tables contain columns of the amino acid positions (AAPOS), original amino acids (OAA), mutated amino acids (MAA), and DeepSAV scores. For tables of pathogenic SAVs and benign SAVs, links to their sources (ClinVar and UniProt) are provided. For gnomAD and SnvSAV tables, fields with information of the nucleotide change are also shown, such as chromosome location, codon and codon change, gnomAD allele count (gnomAD_AC), gnomAD total number of alleles (gnomAD_AN), and gnomAD allele frequencies (gnomAD_AF). Three types of SAVs (Pathogenic-SAV, BenignSAV, and gnomAD_SAV, with the “#” sign suggesting variations of more than one amino acid types) as well as a number of structural and functional features used in DeepSAV prediction are shown in sequence blocks of 100 amino acids. The first 200 amino acids of Sonic Hedgehog (two blocks) are in Figure 3(A). The feature lines include sequence conservation (integer values from 0 to 9, with 9 being the most conserved), predicted secondary structures (H: alpha-helix; E: beta-strand) and predicted disordered regions (D: disordered position). The STMI line shows the location of signal peptide (S), transit peptide (T), transmembrane segment (M), and intramembrane regions (I) if they are present in the protein. Other UniProt features are displayed if they are present in the sequence block. Such an interface provides a valuable resource to study the relationships between SAVs and structural and functional properties of the protein, which could facilitate mechanistic interpretation of functional impact of SAVs. For example, as several pathogenic variations in Sonic Hedgehog occur at or near the zinc-binding residues (H40, D147, and C183) and at or near the cleavage site (G197, G198, and C199), future discoveries of new mutations around these sites could suggest similar deleterious consequences. We plan to retrain the DeepSAV neural network predictor and update DeepSAV scores and GTS scores in DBSAV in the future when a significant number of new SAVs with clinical annotations (e.g., more than 20% of the current dataset) have been accumulated in databases such as ClinVar and UniProt.
Materials and Methods
Human proteome and multiple sequence alignments
The human proteome was obtained from the UniProt database (version 2020-3, released on 17-JUN-2020).37 For each human protein, three sources of homologs were obtained: orthologous sequences at the vertebrate level in OrthoDB (version 10.1),12 orthologous sequences at the mammalian level in OrthoDB, and homologs detected by BLAST38 (against nr database, e-value inclusion cutoff: 0.001) with divergent sequences removed at various sequence identity cutoffs (25%, 30%, 40%, 45%, 50%, 55%, and 60%). For a small fraction of proteins missing in the OrthoDB database, BLAST homologs were used. For the few cases where OrthoDB vertebrate level homologs have an excessive number of paralogs (>2000 sequences total), OrthoDB mammalian-level homologs were used. Multiple sequence alignments of orthologs were obtained by MAFFT.39 Sequence profile of each position of an alignment, represented as the estimated amino acid frequencies, was calculated as described before.40 In the final version of the DeepSAV predictor, vertebrate-level multiple sequence alignments were used.
Datasets of pathogenic and benign SAVs
Pathogenic SAVs in our datasets include those classified as “Pathogenic” or “Likely Pathogenic” in the ClinVar database9 and those classified as “Disease” in the UniProt database. Benign SAVs are those classified as “Benign” or “Likely Benign” in the ClinVar database and those classified as “Polymorphism” in the UniProt database. Variants with contradicting annotations between ClinVar and UniProt were removed. We used all pathogenic variants as positive cases in neural network training. Negative cases are restricted to benign variants in proteins that are classified as disease-associated in the DisGeNET database.
Cross-validation procedure of training and testing the DeepSAV neural network predictor
To evaluate the performance of our neural network predictor, we performed four-fold cross-validation tests. We partitioned the dataset of pathogenic SAVs (positive dataset) and benign SAVs (negative dataset) into four subsets with similar sizes. The partitioning of SAVs is based on the partitioning of the proteins so that SAVs in the same protein stay in the same partition. Three subsets of pathogenic variants and three subsets of the benign variants were used to train the neural network and the remaining variants were used for testing. This process is repeated four times with each of the four subsets serving as the validation set.
Performance measurement of DeepSAV and oth6er variant effect predictors
The area under the receiver operating characteristic curve (ROC AUC) was calculated for the prediction results of DeepSAV predictors. We also obtained ROC AUC scores of various prediction methods based on their predictions scores retrieved from the dbNSFP database,41 except for the method of Condel,32 which was obtained from FannsDB at https://bbglab.irb-barcelona.org/fannsdb/. Besides ROC AUC, three measures of performance were also used: accuracy, Matthews correlation coefficient, and F1 score. These measures require a score cutoff to differentiate positive and negative predictions. These cutoffs were optimized to achieve best evaluation scores for each method.
Features used as input to the DeepSAV neural network predictor
For each amino acid position, features reflecting various sequence, structural, and functional properties were obtained as previously described.8 These features include the amino acid types of the original amino acid and the variant amino acid, sequence profile (estimated amino acid frequencies in the alignment position), sequence conservation,42 predicted 3-state secondary structures (PSIPRED,43 SPIDER,44 and PSSpred45), predicted disorder propensities (DISOPRED3,46 SPOT-Disorder,47 and IUPred2A48), low complexity regions identified by SEG,49 and coiled coil predictions by NCOILS.50 We updated UniProt features used in the previous DeepSAV predictor based on the newer version of UniProt. The new DeepSAV predictor kept previously used UniProt features including SIGNAL (regions of N-terminal signal peptide, an indication of proteins going through secretory pathway), TRANSIT (transit peptide, an indication of mitochondrion targeting), TRANSMEM (transmembrane segments), DISULFID (cysteines participating in disulfide bonds), CARBOHYD (site with covalently attached glycan group), METAL (binding site for a metal ion), BINDING (binding site for any chemical group (co-enzyme, prosthetic group, etc.)), ACT_SITE (amino acid directly involved in the activity of an enzyme), SITE (any single amino acid site that could be functionally relevant), LIPID (site with covalently attached lipid group(s)), MOTIF (short, i.e. up to 20 amino acids, sequence motif of biological interest), and three post-translational modifications (MODRES_P: phosphorylation, MODRES_A: acetylation, and MODRES_M: methylation) that were extracted from the UniProt MOD_RES records. In addition, the new DeepSAV predictor added eight UniProt features that exhibit enrichment or depletion when comparing their frequencies in pathogenic variants and benign variants. The newly added UniProt features include INTRAMEM (region that is buried within a membrane, but does not cross it), NP_BIND (nucleotide phosphate-binding region), REGION30 (extent of a region of interest in the sequence that is no more than 30 amino acids), DNA_BIND (DNA-binding region), CA_BIND (calcium-binding region), ZN_FING (zinc finger region), PEPTIDE (released active peptide), and PROPEP (propeptide). Addition of the 8 new UniProt features did not result in significant improvement of neural network performance. The ROC AUC score decreased only slightly (from 0.894 to 0.892) when the new UniProt features were left out in the training process. For the 1-dimensional convolutional neural network, the above features from a window of 21 amino acids (the target position and 10 neighboring positions on each side) were used as input. Features in neighboring positions beyond the first or last residues were zero-filled (zero-padding). One additional feature encodes the indicator of zero-padding for such positions (1 for positions beyond the first or last residues, and zero for normal amino acid positions within the protein length). The total number of features for each position is 98 (see Supplementary Table S3 for their descriptions and counts). By using a window of 21 positions, a total of 98 × 21 = 2058 values are used as the input of the convolutional neural network for each training and testing data point.
Enrichment analysis of UniProt features in pathogenic SAVs and benign SAVs
The enrichment log-odds score for a feature is defined as the logarithm (with base 2) of the ratio between its frequency in pathogenic variants and its frequency in benign variants. It reflects enrichment (if the log-odds score is above zero) or depletion (log-odds score less than zero) of the feature in the pathogenic variants compared to the benign variants.
Optimization of the DeepSAV neural network predictor
We used a deep-learning artificial neural network for prediction of SAV deleteriousness with the network architecture described previously.8 It consists of multiple 1-dimensional convolutional (conv1d) layers, max-pooling layers, and dense layers before the output. The parameters in the neural network architecture and hyperparamters in the training procedure were varied, including kernel size, the number of filters, the number of convolutional layers, the number of dense layers, dropout rates, and the pool size of max pooling layers. More than 30 top parameter/hyperparameter settings gave similar performances that differ in AUC score by less than 0.001 (all above 0.893) using vertebrate homologs as the source of profile. The best performing setting has a kernel size 1, the number of filters 1000, 2 dense layers, 200 nodes in each dense layer, drop out rate 0.3, and pool size 4 in max pooling layers. The neural network program was written in Python using the TensorFlow package.
Supplementary Material
Acknowledgements
We thank Dr. Lisa Kinch for helpful discussions and Ming Tang for technical support. The study is supported in part by the grants (to NVG) from the National Institutes of Health (GM127390) and the Welch Foundation (I-1505).
Abbreviations used:
- NTD/CTD
N/C terminal domain
- PTMs
post-translational modifications
- GH1
H1 globular domain
- COM
center of mass
- AWSEM
associated memory, water-mediated, structure and energy model
Footnotes
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. Supplementary data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.jmb.2021.166915.
References
- 1.Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, et al. , (2011). Exome sequencing as a tool for Mendelian disease gene discovery. Nature Rev. Genet, 12, 745–755. [DOI] [PubMed] [Google Scholar]
- 2.Cirulli ET, Goldstein DB, (2010). Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev. Genet, 11, 415–425. [DOI] [PubMed] [Google Scholar]
- 3.Frazer KA, Murray SS, Schork NJ, Topol EJ, (2009). Human genetic variation and its contribution to complex traits. Nature Rev. Genet, 10, 241–251. [DOI] [PubMed] [Google Scholar]
- 4.Vitkup D, Sander C, Church GM, (2003). The amino-acid mutational spectrum of human genetic disease. Genome Biol., 4, R72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Peng Y, Alexov E, (2016). Investigating the linkage between disease-causing amino acid variants and their effect on protein stability and binding. Proteins, 84, 232–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Niroula A, Vihinen M, (2016). Variation interpretation predictors: principles, types, performance, and choice. Hum. Mutat, 37, 579–597. [DOI] [PubMed] [Google Scholar]
- 7.Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, et al. , (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581, 434–443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Pei J, Kinch LN, Otwinowski Z, Grishin NV, (2020). Mutation severity spectrum of rare alleles in the human genome is predictive of disease type. PLoS Comput. Biol, 16, e1007775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Landrum MJ, Chitipiralla S, Brown GR, Chen C, Gu B, Hart J, et al. , (2020). ClinVar: improvements to accessing data. Nucleic Acids Res., 48, D835–D844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.UniProt C, (2019). UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res., 47, D506–D515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pinero J, Bravo A, Queralt-Rosinach N, Gutierrez-Sacristan A, Deu-Pons J, Centeno E, et al. , (2017). DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res., 45, D833–D839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kriventseva EV, Kuznetsov D, Tegenfeldt F, Manni M, Dias R, Simao FA, et al. , (2019). OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res., 47, D807–D811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Yates CM, Filippis I, Kelley LA, Sternberg MJ, (2014). SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J. Mol. Biol, 426, 2692–2701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Al-Numair NS, Martin AC, (2013). The SAAP pipeline and database: tools to analyze the impact and predict the pathogenicity of mutations. BMC Genomics, 14 (Suppl 3), S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Carter H, Douville C, Stenson PD, Cooper DN, Karchin R, (2013). Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics, 14 (Suppl 3), S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Samocha KE, Kosmicki JA, Karczewski KJ, O’Donnell-Luria AH, Pierce-Hoffman E, MacArthur DG, et al. , (2017). Regional missense constraint improves variant deleteriousness prediction. bioRxiv,, 1–32. [Google Scholar]
- 17.Malhis N, Jacobson M, Jones SJM, Gsponer J, (2020). LIST-S2: taxonomy based sorting of deleterious missense mutations across species. Nucleic Acids Res., 48, W154–W161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sundaram L, Gao H, Padigepati SR, McRae JF, Li Y, Kosmicki JA, et al. , (2018). Predicting the clinical impact of human mutation with deep neural networks. Nature Genet., 50, 1161–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Reva B, Antipin Y, Sander C, (2011). Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res., 39, e118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ng PC, Henikoff S, (2001). Predicting deleterious amino acid substitutions. Genome Res., 11, 863–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Adzhubei I, Jordan DM, Sunyaev SR, (2013). Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet.,, 20. Chapter 7:Unit7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Choi Y, Chan AP, (2015). PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics, 31, 2745–2747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C, (2018). FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics, 34, 511–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J, (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet., 46, 310–315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Qi H, Chen C, Zhang H, Long JJ, Chung WK, Guan Y, et al. , (2018). MVP: predicting pathogenicity of missense variants by deep neural networks. bioRxiv,, 1–15. [Google Scholar]
- 26.Dong C, Wei P, Jian X, Gibbs R, Boerwinkle E, Wang K, et al. , (2015). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet, 24, 2125–2137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Raimondi D, Tanyalcin I, Ferte J, Gazzo A, Orlando G, Lenaerts T, et al. , (2017). DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic Acids Res., 45, W201–W206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.loannidis NM, Rothstein JH, Pejaver V, Middha S, McDonnell SK, Baheti S, et al. , (2016). REVEL: An ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet, 99, 877–885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Alirezaie N, Kernohan KD, Hartley T, Majewski J, Hocking TD, (2018). ClinPred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants. Am. J. Hum. Genet, 103, 474–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Feng BJ, (2017). PERCH: A unified framework for disease gene prioritization. Hum. Mutat, 38, 243–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ionita-Laza I, McCallum K, Xu B, Buxbaum JD, (2016). A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nature Genet., 48, 214–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Gonzalez-Perez A, Lopez-Bigas N, (2011). Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score. Condel. Am J Hum Genet, 88, 440–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. , (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med, 17, 405–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Mitchell AL, Attwood TK, Babbitt PC, Blum M, Bork P, Bridge A, et al. , (2019). InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res., 47, D351–D360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jehl P, Manguy J, Shields DC, Higgins DG, Davey NE, (2016). ProViz-a web-based visualization tool to investigate the functional and evolutionary features of protein sequences. Nucleic Acids Res., 44, W11–W15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bienert S, Waterhouse A, de Beer TA, Tauriello G, Studer G, Bordoli L, et al. , (2017). The SWISS-MODEL Repository-new features and functionality. Nucleic Acids Res., 45, D313–D319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.UniProt Consortium T, (2018). UniProt: the universal protein knowledgebase. Nucleic Acids Res., 46, 2699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. , (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Katoh K, Toh H, (2008). Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform., 9, 286–298. [DOI] [PubMed] [Google Scholar]
- 40.Pei J, Grishin NV, (2007). PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics, 23, 802–808. [DOI] [PubMed] [Google Scholar]
- 41.Liu X, Wu C, Li C, Boerwinkle E, (2016). dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice-Site SNVs. Hum. Mutat, 37, 235–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Pei J, Grishin NV, (2001). AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics, 17, 700–712. [DOI] [PubMed] [Google Scholar]
- 43.Buchan DWA, Jones DT, (2019). The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res., 47, W402–W407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yang Y, Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, et al. , (2017). SPIDER2: A package to predict secondary structure, accessible surface area, and main-chain torsional angles by deep neural networks. Methods Mol. Biol, 1484, 55–63. [DOI] [PubMed] [Google Scholar]
- 45.Yan R, Xu D, Yang J, Walker S, Zhang Y, (2013). A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci. Rep, 3, 2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Jones DT, Cozzetto D, (2015). DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 31, 857–863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Hanson J, Yang Y, Paliwal K, Zhou Y, (2017). Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks. Bioinformatics, 33, 685–692. [DOI] [PubMed] [Google Scholar]
- 48.Meszaros B, Erdos G, Dosztanyi Z, (2018). IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res., 46, W329–W337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wootton JC, Federhen S, (1993). Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. (Oxford), 17, 149–163. [Google Scholar]
- 50.Lupas A, (1996). Prediction and analysis of coiled-coil structures. Methods Enzymol., 266, 513–525. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.