Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2014 Jul 15;426(14):2692–2701. doi: 10.1016/j.jmb.2014.04.026

SuSPect: Enhanced Prediction of Single Amino Acid Variant (SAV) Phenotype Using Network Features

Christopher M Yates 1,, Ioannis Filippis 1, Lawrence A Kelley 1, Michael JE Sternberg 1
PMCID: PMC4087249  PMID: 24810707

Abstract

Whole-genome and exome sequencing studies reveal many genetic variants between individuals, some of which are linked to disease. Many of these variants lead to single amino acid variants (SAVs), and accurate prediction of their phenotypic impact is important. Incorporating sequence conservation and network-level features, we have developed a method, SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction), for predicting how likely SAVs are to be associated with disease. SuSPect performs significantly better than other available batch methods on the VariBench benchmarking dataset, with a balanced accuracy of 82%. SuSPect is available at www.sbg.bio.ic.ac.uk/suspect. The Web site has been implemented in Perl and SQLite and is compatible with modern browsers. An SQLite database of possible missense variants in the human proteome is available to download at www.sbg.bio.ic.ac.uk/suspect/download.html.

Abbreviations: MCC, Matthews correlation coefficient; SVM, support vector machine; PPI, protein–protein interaction; SAV, single amino acid variant; MSA, multiple sequence alignment; PSSM, position-specific scoring matrix; RSA, relative solvent accessibility; RBF, radial basis function

Keywords: protein–protein interaction, nsSNP, missense mutation, SuSPect, SAV

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Bioinformatics approaches are key for identification of disease-causing variants.

  • SAV phenotype prediction can be improved using network information.

  • A method including these features, SuSPect, outperforms tested methods.

  • SuSPect is available to use at www.sbg.bio.ic.ac.uk/suspect.

Introduction

Large-scale projects, such as the Exome Sequencing Project and the 1000 Genomes Project [1], have uncovered substantial genetic variation between individuals. Genome-wide association studies, whole-genome sequencing and exome sequencing have also been used to identify variants associated with both Mendelian diseases, such as cystic fibrosis, and complex diseases, including diabetes and cancer [2,3].

Non-synonymous single nucleotide variants are one of the best-studied groups of variants in human disease. These are single-base changes that lead to a change in the amino acid sequence of the encoded protein, termed a single amino acid variant (SAV) or missense variant. SAVs can also be caused by multiple nucleotide substitutions. The amino acid change can affect, for example, protein stability, interactions and enzyme activity, thereby leading to disease.

In a genome-wide association study or sequencing study, a large number of SAVs can be identified as potentially causative of a disease. It is not feasible to experimentally determine the phenotype and biochemical impact of such a large number of mutations; thus, accurate computational predictions are vital for analysis of identified SAVs. These predictions are generally based on sequence conservation, protein structural features or a combination of these, although other features have been included, such as Gene Ontology terms [4–9]. Commonly, these features are combined using machine learning methods such as random forests and support vector machines (SVMs).

We have previously shown that certain proteins and domains are significantly more likely than others to contain disease-associated variants [10]. Disease-propensity is based on a binomial test comparing the observed numbers of disease-associated and neutral variants in the protein/domain to random expectation. Predicting the phenotypic effect of variants based on the disease-propensity of the domain in which they are located can give good performance but may be affected by underlying biases in the training data toward well-studied proteins and the method has limited coverage of the human proteome. We showed that the susceptibility of proteins and domains to contain disease-associated variants is related to other features including the location in the interactome network of the protein or domain and the function of the protein. Thus, in this work, we include features that correlate with disease-propensity, such as protein–protein interaction (PPI) network centrality. To our knowledge, this is the first time PPI network-based features have been used for SAV phenotype prediction.

The function of a folded protein is intimately linked to its three-dimensional structure, and in many cases, the effects of SAVs can be understood by investigating their effects on the protein structure. Accordingly, several approaches to predict phenotype include structural features, particularly when evolutionary information is lacking [6,11,12]. For example, many disease-associated SAVs are located in the core of the protein, whereas SAVs on the surface are more likely to be neutral [13]. However, variants on the surface can affect PPIs, leading to disease [14,15]. In a recent review, we summarized the various mechanisms by which SAVs can affect PPIs, including steric clashes, loss of salt bridges, changes in intrinsic disorder and alterations in post-translational modifications [16].

There is currently low structural coverage of the human proteome and consequently structure-based prediction is only possible for a subset of variants. To counter this, we use both experimentally solved protein structures and homology models produced by Phyre2 [17], greatly increasing structural coverage of the proteome. Structural features may be useful in assessing the likely impact of an SAV, and in this work, we also test whether or not structural information can add value to SAV phenotype prediction.

In our approach, we combine sequence and structural features, which are used in other widely used algorithms, with several other features including network information to train an SVM to identify disease-associated SAVs. Because many of the features chosen are related to the differences between disease-susceptible and disease-resistant domains and proteins, we have termed our method SuSPect (Disease-Susceptibility-based SAV Phenotype Prediction).

We find that incorporating PPI network information improves predictive performance. Surprisingly, features derived from the three-dimensional structure of a protein do not contribute to performance. This is not due to the low structural coverage of the proteome, as the same pattern is seen when tested only on those SAVs with an experimentally solved structure available. Protein structures can, however, help with human interpretation.

SuSPect has been trained to determine the likelihood of an SAV to be associated with disease, and when tested on the VariBench benchmarking dataset [18], SuSPect shows greatly improved performance compared to other widely used methods allowing batch submission, such as PolyPhen-2 [8], SIFT [19], MutationAssessor [20], Condel [9] and FATHMM [21]. Feature selection is used to further improve performance. SuSPect is available as a Web server, where users can submit individual mutations or a VCF (variant call format) file or download a database of pre-calculated scores for all possible SAVs in the human proteome.

Results

A total of 77 features (see Table S1) were calculated for 20,728 disease and 36,799 polymorphism SAVs from the Humsavar database and these SAVs were used to train the SVM learning algorithm, SuSPect-All. Table 1 shows the number of SAVs that could be mapped to PDB structures and Phyre2 models. Using these homology models, we have been able to increase structural coverage from 7.6% to 60.4% of SAVs, an 8-fold increase. A structure or model is available for 48% of the human proteome.

Table 1.

Distribution of disease and polymorphism SAVs in PDB and Phyre2 structures

Phenotype Structure
Total
PDB Phyre2 N/A
Disease 2914 13,560 4254 20,728
Polymorphism 1468 16,833 18,498 36,799
Total 4382 30,393 22,752 57,527

SAVs that could be mapped to PDB structures are significantly more likely to be annotated as disease-associated than those in Phyre2 models or with no structure available (χ2 test, p < 2.2 × 10− 16), while those for which a Phyre2 model is available are more likely to be disease-associated than those with no structural information available (χ2 test, p < 2.2 × 10− 16). These results may be due to an intrinsic bias in either the PDB database or the SAV database, with disease-associated proteins more likely to have been studied in detail and therefore to have an available structure or the structure of a homologue.

Feature selection

Stability selection [22] was used in conjunction with mRMR (minimum redundancy, maximum relevance) [23] to select the most important features. In 10-fold cross-validation, performance following feature selection was similar to that using all features according to Matthews correlation coefficient (MCC) and balanced accuracy (Table S2). Feature selection was carried out on the full training set, with nine features chosen in all stability selection subsets. These features are described in Table 2 and include a combination of sequence conservation, predicted solvent accessibility and protein network centrality. These features were used to train an SVM, hereafter termed SuSPect-FS (for feature selection). As shown below, SuSPect-FS outperforms SuSPect-All on unseen data. Carrying out feature selection separately on the three sets of SAVs (grouped according to the availability of a structure) gives worse performance than when carried out on all SAVs together (DeLong's test, p < 2.2 × 10− 16).

Table 2.

Features chosen in feature selection on the full training set

Feature
(a) Degree centrality in a PPI network.
(b) Number of annotations at this position in UniProt FT feature table.
(c) Score for the wild-type amino acid in a PSSM.
(d) Score for the mutant amino acid in PSSM.
(e) Difference between PSSM scores for the wild type and mutant amino acids at the SAV position.
(f) Difference between Pfam HMM emission probabilities for the wild type and mutant amino acids at the SAV position.
(g) Jensen-Shannon divergence, a measure of sequence conservation.
(h) Percentage sequence identity with the first sequence in the MSA to have the mutant amino acid at the SAV position.
(i) RSA predicted by NetSurfP.

Degree centrality in a PPI network is selected as an important feature. We have previously shown that proteins with significantly more disease-associated than neutral SAVs (disease-susceptible) are positioned centrally in PPI networks [10]. SAVs can affect protein function without leading to disease, for example, if normal cellular function can be carried out even in the complete absence of the protein [24]. We and others have found that mutations in more centrally positioned proteins are more likely to be associated with disease; thus, PPI centrality is likely to be important in discriminating between mutations affecting proteins unlikely to be involved in disease from those in proteins whose mutation is likely to lead to disease. The importance of PPI centrality is shown by the fact that all four centrality measures used are ranked in the top 25% of features (Table S1). There may be bias toward well-studied proteins, although the PPI network used was filtered to only contain those interactions with experimental evidence, which should lessen any bias.

Sequence conservation and predicted solvent accessibility have previously been shown to be useful in SAV phenotype prediction [5]. Functional annotations from UniProt are used to identify variants affecting functionally important residues, for example, those that bind to metal ions or are in a disulfide bond. Because these residues have important functional roles, their mutation can lead to impaired function and therefore disease. While better-studied proteins are potentially more likely to be annotated, many annotations in the UniProt FT table come from similarity or predictions rather from direct observations, which should decrease the bias toward better-studied proteins. Jensen-Shannon divergence is an information-theoretic measure for identifying important residues by comparing the observed distribution of amino acids in a multiple sequence alignment (MSA) with an estimated background distribution. Positions differing from this background are assumed to be under evolutionary pressure, constraining the observed distribution of amino acids [25].

Performance of SuSPect

As a test on previously unseen data, the neutral and pathogenic datasets were downloaded from VariBench and filtered to remove any SAVs present in the SuSPect training set, leaving 5432 pathogenic and 13,236 neutral SAVs. This dataset was chosen because of its size and because it has previously been used for benchmarking of similar methods [26]. Figure 1a shows ROC (receiver operating characteristic) curves comparing the performances of SuSPect-All and SuSPect-FS with those of FATHMM, PolyPhen-2, SIFT, Condel, MutationAssessor, MutPred and PANTHER with AUC (area under curve), balanced accuracy, precision, recall and MCC shown in Table 3. The first five of these methods were chosen based on the availability of a Web server allowing batch submission, reflecting the situation faced by a user with a large number of SAVs to analyze and filter. Scores for MutPred and PANTHER were obtained from Thusberg et al. [26]. Other methods, such as FunSAV [27], have shown good performance in benchmarking but are unavailable for batch submission or are only applicable to a subset of SAVs, such as those with an experimentally solved structure available.

Fig. 1.

Fig. 1

Performance of five versions of SuSPect compared to seven other methods.

Table 3.

Performance of five versions of SuSPect compared to 11 other SAV phenotype prediction methods, ordered by MCC

Method Precision Recall F-measure Balanced accuracy MCC AUC
SuSPect-FS 0.75 0.75 0.75 0.82 0.65 0.90
SuSPect-No Structure 0.73 0.67 0.70 0.79 0.59 0.89
SuSPect-All 0.72 0.67 0.69 0.78 0.58 0.88
SNPs&GO 0.96 0.70 0.81 0.82 0.56
MutPred 0.79 0.81 0.80 0.75 0.49 0.84
SuSPect-No Networks 0.78 0.64 0.70 0.71 0.44 0.78
PHD-SNP 0.69 0.72 0.70 0.69 0.39
MutationAssessor 0.36 0.81 0.50 0.70 0.34 0.79
SuSPect-FS-No Networks 0.63 0.45 0.53 0.67 0.38 0.74
SNAP 0.82 0.75 0.78 0.68 0.34
FATHMM 0.41 0.71 0.52 0.63 0.24 0.63
SIFT 0.14 0.58 0.23 0.62 0.22 0.65
Condel 0.43 0.52 0.47 0.61 0.21 0.63
SNPanalyzer 0.94 0.61 0.74 0.65 0.20
PANTHER 0.43 0.75 0.55 0.59 0.17 0.63
PolyPhen-2 0.37 0.60 0.46 0.58 0.14 0.62

For four methods, predictions were binary; thus, AUC could not be calculated.

In this benchmark, SuSPect-All and SuSPect-FS outperform other tested methods, achieving higher sensitivity without loss of selectivity. High sensitivity corresponds to a high proportion of disease-associated SAVs being correctly classified, which SuSPect is able to do without increasing the number of false positives. Feature selection improves performance: SuSPect-FS has an AUC of 0.90, which is significantly higher than SuSPect-All (Fig. 1a, DeLong's test [28], p < 10− 10). In addition to these methods, SNAP, SNPs&GO, PHD-SNP and SNPanalyzer results were obtained from Thusberg et al. [26]. In these cases, results were provided as binary classifications, meaning ROC curves could not be produced, but other performance measures (precision, recall, balanced accuracy and MCC) are shown in Table 3, together with the methods mentioned previously.

SuSPect-FS has the highest AUC and MCC, as well as the joint-highest balanced accuracy. These three are balanced measures of performance and thus are unaffected by the discrepancy in number of neutral and pathogenic SAVs in the dataset. SNPs&GO shows the highest precision and F-measure, meaning a user can be confident that if an SAV is predicted to be associated with disease, that is likely to be the case. However, SNPs&GO shows lower recall than SuSPect-FS; thus, more disease-associated SAVs are incorrectly classified as neutral, as well as a lower MCC, which is a measure of how much of the data overall falls in the true negative and true positive categories. A higher value therefore corresponds to higher confidence that SAVs called as disease-associated or neutral are truly disease-associated or neutral, respectively. While three methods (SNAP, SNPs&GO and MutPred) show higher F-measure than SuSPect-FS, it is worth noting that this measure does not take into account the true negative rate of predictions, unlike MCC, which is therefore a better measure of overall performance, taking into account how often neutral variants are correctly classified as such.

As a test of the importance of network centrality, we removed all network-related features (network centralities and protein–protein interface information) from SuSPect-All and retrained without these features. This SVM gives significantly worse performance than the full method (SuSPect-No Network, DeLong's test, p < 2.2 × 10− 16). Performance without network features is inferior to that of MutationAssessor (AUC, DeLong's test, p < 0.01), MutPred (AUC, p < 10− 13) and SNPs&GO, showing that it is the use of network features that improves performance over other methods. Similar results are seen upon removing PPI network centrality from SuSPect-FS (its only network-related feature), with an AUC of 0.74 (p < 2.2 × 10− 16), MCC of 0.38 and balanced accuracy of 0.67, all considerably lower than when PPI network centrality is included.

Interestingly, no structural features were selected for inclusion in SuSPect-FS, suggesting that these features do not give any extra information over that provided by the sequence. To assess this, we removed all structural features from SuSPect-All and retrained the SVM. On the VariBench dataset, this gives slightly better performance than SuSPect-All (DeLong's test, p < 10− 14) although worse than SuSPect-FS (SuSPect-No Structure, DeLong's test, p < 10− 3; Fig. 1 and Table 3). In cross-validation, performance is similar to that of SuSPect (Table S3). The lack of importance of structural features is not due to the poor structural coverage; an SVM was trained only on those SAVs that have a structure available from the PDB. This SVM performs the same as SuSPect-All on the VariBench SAVs with a PDB structure available (Supplementary Table S4, DeLong's test, p = 0.60). Interestingly, when only predicted solvent accessibility is included, performance is slightly but statistically significantly better than when the NACCESS calculated solvent accessibility is included (AUC = 0.91 and 0.90, respectively; DeLong's test, p < 10− 3). One possible explanation for this is that NetSurfP may be providing information about protein quaternary structure, whereas the NACCESS solvent accessibility calculated from a monomeric structure will not. While structural features may not aid predictive performance, they are helpful in interpretation of how a mutation may have its effect, thus are included in the output of the SuSPect Web server (see below).

Web server and download

SuSPect is available for non-commercial use. Pre-calculated scores from SuSPect-FS are available for human mutations, queried using UniProt accessions or by uploading a VCF file. Scores range from 0 to 100, with a recommended cutoff of 50 for discriminating between neutral and disease-associated SAVs. The distribution of SuSPect scores on the VariBench test set is shown in Fig. 2. In addition to giving a score, detailed information about the SAV is provided, including an image of the SAV in a protein structure or model where available. This extra information, including predicted post-translational modifications, Pfam domains and sequence conservation, helps interpretation of the scores, which we see as being particularly helpful in determining which SAVs are functionally important in disease. These pre-calculated scores are also available to download as an SQLite database to allow users to obtain scores for SAVs locally without uploading large files or potentially sensitive information. If an SAV of known phenotype is uploaded, we inform the user of the phenotype using information from databases such as OMIM or dbSNP.

Fig. 2.

Fig. 2

Distribution of SuSPect-FS scores for disease-associated (red) and neutral (blue) SAVs in the VariBench test set. The two sets of SAVs have significantly different distributions (Wilcoxon test, p < 2.2 × 10− 16).

Alternatively, users can upload a sequence or structure and receive scores for all possible mutations at all positions. The Humsavar database used for training data consists mostly of nsSNPs (non-synonymous single nucleotide polymorphisms), which are SAVs brought about by only a single-base change, although there are 164 examples of SAVs requiring multiple base substitutions. Because there may be differences between nsSNPs and other SAVs and SuSPect has been trained primarily on nsSNPs, we highlight SAVs that cannot be reached by an nsSNP as part of the extra detailed information. Scores are obtained from the SuSPect-FS database for human proteins and using SuSPect-No Networks for proteins from other organisms, which lack PPI network centrality information. Where a structure has been provided, this can be viewed interactively using JSmol [29], with user-selected residues of interest highlighted.

Scores range from 0 (neutral) to 100 (disease-associated), with more extreme scores corresponding to a greater degree of confidence in the prediction. The distributions of scores in the VariBench test set are shown in Fig. 2.

Discussion

We have developed SuSPect, a new method for predicting whether an SAV is associated with disease. Sequence conservation and solvent accessibility are known to be important determinants of the likelihood for an SAV to be deleterious [13,30]. In addition, we have previously shown that disease-susceptible proteins, in which SAVs are significantly more likely to be disease-associated than expected by chance, are located more centrally in PPI networks according to betweenness, degree and coreness centralities [10]. As such, network centrality helps to discriminate between disease-associated and tolerated SAVs by describing how likely any variation in the protein is to lead to disease. Removing network-based features gives a large drop in performance, showing that these features are important for our improved prediction of phenotype. Another important difference is that SuSPect has been specifically trained to discriminate between disease-associated and neutral SAVs, as opposed to predicting an effect on protein function. This is because genetic variants can affect protein function without leading to disease [24] and, while loss of function is often used as a proxy for disease, it is better to use a tool specifically designed for the task. In spite of this, SuSPect is still able to outperform SIFT at predicting the phenotypic effects of a set of mutations in non-human proteins (Supplementary Table S5).

Most of the SAVs used to train SuSPect are involved in Mendelian diseases, although there are some variants involved in complex disorders. While this may lead to difficulty of interpretation when used on complex diseases, many of the same principles could apply between Mendelian and complex phenotypes, and the SuSPect score could be a useful way of prioritizing variants for further investigation.

Using Phyre2 structural models, we increased the structural coverage of the human proteome. Only 7.6% of SAVs in our training data could be mapped to a structure from the PDB, but by also using Phyre2 models, 60.4% of variants had a structure available. However, we see no significant increase in performance when structural features are included and no structural features are chosen through feature selection, suggesting that the sequence of a protein contains sufficient information about protein structure for SAV phenotype prediction. For human interpretation, however, the sequence signal is highly complex and interpretation is problematic; thus, structural information can be helpful.

On a blind test, SuSPect-All and SuSPect-FS significantly outperform PolyPhen-2, SIFT, MutationAssessor, Condel and FATHMM. SuSPect-FS has an AUC of 0.90, which is significantly higher than all other methods tested. We have also tested its ability to predict the phenotypes of mutations in non-human proteins and seen good performance, although worse than on human proteins due to the lack of PPI network information and the fact that predicting a loss of protein function is not the same as predicting a disease-associated mutation, which is the task for which SuSPect was developed. A further test of SuSPect would be the CAGI (Critical Assessment of Genome Interpretation§) experiment. A previous, development-stage version of SuSPect was entered into the CAGI 2012 experiment; thus, we would hope to see improved performance in future assessments.

An example of an SAV showing the potential importance of network centrality for phenotype prediction is p.Cys873Gly in MSH2 (UniProt: P43246), which has been identified in families with gastric cancer [31]. This position is not highly conserved, with low Jensen-Shannon entropy and only a small decrease in position-specific scoring matrix (PSSM) score (from 2 to − 1), and is not predicted to be buried. However, MSH2 has high degree in the STRING PPI network, interacting with a number of cancer-related proteins, such as PCNA and MLH1. Because of this high degree, SuSPect-FS predicts this SAV to be deleterious (score = 69), whereas SIFT (0.52), PolyPhen-2 (0.003), Condel (0.002), MutationAssessor (1.87), PANTHER (0.43624) and MutPred (0.395) all predict that it will be tolerated. While this is only a single example, it does suggest that there are cases where SuSPect can identify deleterious variants that would be missed by other methods. One potential limitation of using network centrality is the potential bias toward well-studied proteins, although using data from high-throughput experiments should lessen this bias.

Unlike other methods, the SuSPect Web server also provides users with an explanation of the features and annotations associated with the SAV, which can aid understanding of why a mutation is predicted to be deleterious or not. If the SAV is present in the training data, this will also be noted and the phenotype returned. By providing improved SAV phenotype prediction performance compared to other methods, we consider SuSPect will be a useful tool for research into disease, protein evolution and protein structure.

Materials and Methods

SAV data

SAVs were downloaded from Humsavar|| (version 2011–09) and VariBench [18]. In the Humsavar database from UniProt-KB, mutations are annotated as Disease, Polymorphism or Unclassified, depending on whether they are disease-associated, neutral or of unknown phenotype. The VariBench neutral dataset is from dbSNP and the pathogenic dataset is from PhenCode, which collates mutations from SwissProt and numerous locus-specific databases [32]. Where necessary, VariBench SAVs were mapped to UniProt sequences based on mapping between UniProt, GenBank and RefSeq accessions and using BLASTP to align sequences. For the blind test, SAVs also present in the Humsavar database were removed, leaving 5432 pathogenic and 13,236 neutral SAVs, many of which are in proteins also present in the training data. However, in cross-validation, we did not see over-training due to protein-level features (see Supplementary Table S2).

Condel [9], PolyPhen-2 [8], SIFT [19] and MutationAssessor [20] scores were obtained from the Condel Web server and FATHMM [21] scores were from the FATHMM Web servera. PolyPhen-2 provides both scores and a classification, but we only use the scores in our analysis. These methods were chosen based on the availability of batch submission. Predictions from MutPred, PANTHER, PHD-SNP, SNAP, SNPanalyzer and SNPs&GO were obtained from Thusberg et al. [26].

Protein structures and models

To obtain protein structures, we used the mapping file pdb2sp.txt from UniProt. PDB files were filtered to remove those containing multiple chains, meaning only monomeric structures were used, preventing any misinterpretation of a position as buried when it is in fact at an interface. For each SAV, the UniProt sequence was aligned to the PDB sequence using BLASTP [33]. We required the wild-type amino acid to match the amino acid in the PDB file. If multiple structures were available, that with the best resolution was used.

Where PDB structures were not available, Phyre2 structural models from the Genome3D project were used, requiring a confidence of at least 90% in the model [17,34]. Phyre2 uses HMM-HMM (hidden Markov model) alignments to compare a protein sequence to proteins in a fold library. If a match can be found, the query structure is modeled on the matching structure. If no model could be generated covering an SAV position, only sequence-based features were used for prediction.

SVM features

We used a total of 77 features in SuSPect-All, which were then reduced to nine by feature selection (see section 2.5). Previous studies have shown the importance of sequence conservation in SAV phenotype prediction. To this end, we obtained PSSMs and MSAs by running PSI-BLAST and storing the PSSM produced after three iterations and the MSA after a single iteration [35]. PSSMs are substitution matrices showing, for each position in the protein, how likely each amino acid is to occur, based on their frequencies in an alignment. Uniref50 was used as the sequence database as it has been suggested that it can improve performance in homology detection [36]. The best (measured by lowest E-value) sequences to (i) have any amino acid other than the wild type or (ii) have the mutant amino acid at an SAV position were found. Their BLAST E-values and sequence identities to the query sequence were used as features. These features show how far away two protein sequences have diverged in total before the SAV position changes and the new amino acid is observed. The MSA was also used to calculate Jensen-Shannon divergence for all columns with fewer than 99.9% gaps, and the proportion of gaps in a column was used as another feature [25]. Multiple sequence conservation-based features are used because they each provide differing information. For example, Jensen-Shannon divergence is a measure of conservation at a specific position of an MSA, whereas the sequence identity-based features show how far two protein sequences have diverged overall in order for a given variant to occur between them.

Structural features are also thought to be useful in SAV phenotype prediction. Where possible, SAVs were mapped to monomeric PDB structures or homology models produced using Phyre2. For each structure, DSSP was used to give secondary structure, ϕ/ψ backbone torsion angles and backbone hydrogen bonds [37] and NACCESSb was used for relative solvent accessibility (RSA). Fpocket was used to find surface pockets [38]. Betweenness centrality on a residue interaction network was calculated using the igraph library for R [6,39]. Residue interaction networks were produced by connecting all amino acids located < 5 Å from one another. Catalytic sites from the Catalytic Site Atlas [40] and protein interface residues from PISite [41] and ProtInDBc were obtained for PDB structures. For Phyre2 models, these annotations were mapped from the template structure.

NetSurfP was used to predict secondary structure and RSA based on the results of a PSI-BLAST search [42]. IUPred was used to identify disordered regions [43] and ANCHOR was used to identify disordered binding sites in both the wild type and mutant sequences [44]. Aliphatic index and GRAVY (grand average of hydropathy) were calculated for the wild-type sequence as described in Refs. [45] and [46], respectively. The former may be related to protein stability, while GRAVY characterizes the hydrophobicity of a protein.

SAVs located in protein domains are more likely to be deleterious than those elsewhere (e.g., in linkers), and the change in Pfam E-value caused by the SAV has been used previously to predict its effects [47]. Domains were detected using Pfam and, if a residue was in a domain, the emission probability for the wild type and the mutant amino acid at that position of the Pfam HMM was obtained [48].

The centrality of each domain in both a domain–domain interaction (from DOMINE [49]) and a domain bigram network [50] were calculated using four measures: betweenness, closeness, coreness and degree. The same four centrality measures were calculated for each protein in a PPI network obtained from STRING by filtering for human interactions with experimental evidence [51]. In a network, the shortest path between two nodes (protein or domain) is the path that requires the fewest edges. Betweenness centrality is the proportion of all shortest paths passing through a given node. Closeness centrality is the inverse of the sum of all shortest paths, showing how far away the node is from all other nodes. Coreness is defined using k-core centrality [52], an iterative process in which nodes and their adjacent edges are removed from the network if their degree is less than an integer k. This is repeated until all remaining nodes have degree of at least k, with these nodes constituting the k-core. The coreness of a node is the highest value of k for which it is present in the k-core but not the k + 1-core. Degree centrality is the number of edges adjacent to a node (e.g., the number of interactions a protein makes).

Functional annotations were obtained from the UniProt FT table [53]. Finally, a number of features describing the wild type and the mutant amino acids were used—BLOSUM [54], change in charge, mutation to/from glycine/proline and changes to the values of the five principle components calculated by Atchley et al. [55] from the 494 amino acid indices in the Amino Acid Index [56].

LIBSVM 3.12 was used for SVMs, using a radial basis function (RBF) kernel [57]. RBF kernel SVMs have two parameters—C, which is the soft margin parameter, determining how much misclassifications in the training set are punished, and gamma, which is the radius of the RBF kernel, determining how far the influence of each training data point stretches. The sigest function in the kernlab library was used to obtain a suitable value for gamma [58], and 10-fold cross-validation was used to determine the optimum value for C. Following cross-validation, C = 128 and gamma = 0.01 were chosen.

While there are protein-level features present, when cross-validation was carried out by dividing proteins into 10 sets, there was only a very small drop in MCC and balanced accuracy compared to dividing the SAVs into 10 groups with proteins present in multiple groups (see Supplementary Table S2), implying that there is little or no over-training for certain proteins.

A model was trained using the SAVs from Humsavar and tested on those from VariBench, after filtering to remove any SAVs also present in the training set. Probability estimates were obtained by using the − b 1 option when training the SVM, as described in Ref. [57]. These probability estimates were then multiplied by 100 to give the SuSPect score. Before training and testing, all features were scaled to between 0 and 1.

Evaluation of performance

We carried out 10-fold cross validation as an initial test (see Supplementary Information). As a test on unseen data, VariBench was used, having been filtered to remove SAVs present in the training data. ROC curves and their AUC were calculated using the pROC package in R [59]. ROC curves were compared using the roc.test function, which implements DeLong's non-parametric test for AUC comparison [28]. Accuracy, recall (selectivity), precision, MCC, F-measure and balanced accuracy were calculated using the following equations, where TP is the number of correctly predicted disease SAVs, TN is the number of correctly predicted neutral polymorphisms, FP is the number of neutral SAVs misclassified as disease and FN is the number of disease SAVs classified as neutral. For cross-validation, the SVM was trained to give a binary classification. For blind testing, probabilistic outputs were used, with a cutoff of 50 used to discriminate between neutral and disease-associated SAVs.

Accuracy=TP+TNTP+TN+FP+FNRecall=TPTP+FNPrecision=TPTP+FPBalancedAccuracy=0.5×TPTP+FN+0.5×TNTN+FPMCC=TP×TNFP×FNTP+FPTP+FNTN+FPTN+FNFmeasure=2×Precision×RecallPrecision+Recall

Feature selection

For feature selection, stability selection was used in conjunction with mRMR. In stability selection, feature selection is carried out on multiple subsets of 50% of the SAVs and the frequency at which each feature is chosen is used to determine important features [22]. mRMR aims to identify the subset of features most relevant to the outcome while reducing the redundancy between selected features [23]. In 10-fold cross-validation, SAVs were split into 10 sets. Feature selection was carried out on nine of the sets and then an SVM trained on these sets with the selected features. This SVM was used to make predictions on the tenth set. This was repeated for all test sets. In both cross-validation and feature selection on the full training set, 100 subsets were used for stability selection, with those features selected in all 100 iterations chosen. For mRMR, parameter alpha was set to 0.5 and 30 features were selected in each iteration.

Unless otherwise stated, statistical tests and calculations were carried out using R and scripting with Perl.

Acknowledgements

We would like to thank Prof. Mauno Vihinen for generously providing VariBench benchmarking data and Dr. Suhail Islam for his invaluable help with the Web server. This work was supported by the Medical Research Council (grant number G1000390-1/1).

Footnotes

Appendix A. Supplementary data

Supplementary material.

mmc1.docx (163.7KB, docx)

References

  • 1.The 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stranger B.E., Stahl E.A., Raj T. Progress and promise of genome-wide association studies for human complex trait genetics. Genetics. 2011;187:367–383. doi: 10.1534/genetics.110.120907. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Roach J.C., Glusman G., Smit A.F.A., Huff C.D., Hubley R., Shannon P.T. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639. doi: 10.1126/science.1186802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ng P.C. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bromberg Y., Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35:3823–3835. doi: 10.1093/nar/gkm238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cheng T.M.K., Lu Y.-E., Vendruscolo M., Lio' P., Blundell T.L. Prediction by graph theoretic measures of structural effects in proteins arising from non-synonymous single nucleotide polymorphisms. PLoS Comput Biol. 2008;4:e1000135. doi: 10.1371/journal.pcbi.1000135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Calabrese R., Capriotti E., Fariselli P., Martelli P.L., Casadio R. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat. 2009;30:1237–1244. doi: 10.1002/humu.21047. [DOI] [PubMed] [Google Scholar]
  • 8.Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.González-Pérez A., López-Bigas N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score Condel. Am J Hum Genet. 2011;88:440–449. doi: 10.1016/j.ajhg.2011.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yates C.M., Sternberg M.J.E. Proteins and domains vary in their tolerance of non-synonymous single nucleotide polymorphisms (nsSNPs) J Mol Biol. 2013;425:1274–1286. doi: 10.1016/j.jmb.2013.01.026. [DOI] [PubMed] [Google Scholar]
  • 11.Bao L., Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics. 2005;21:2185–2190. doi: 10.1093/bioinformatics/bti365. [DOI] [PubMed] [Google Scholar]
  • 12.Capriotti E., Altman R.B. Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics. 2011;12:S3. doi: 10.1186/1471-2105-12-S4-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yue P., Li Z., Moult J. Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol. 2005;353:459–473. doi: 10.1016/j.jmb.2005.08.020. [DOI] [PubMed] [Google Scholar]
  • 14.David A., Razali R., Wass M.N., Sternberg M.J.E. Protein–protein interaction sites are hot spots for disease-associated nonsynonymous SNPs. Hum Mutat. 2012;33:359–363. doi: 10.1002/humu.21656. [DOI] [PubMed] [Google Scholar]
  • 15.Wang X., Wei X., Thijssen B., Das J., Lipkin S.M., Yu H. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat Biotechnol. 2012;30:159–164. doi: 10.1038/nbt.2106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yates C.M., Sternberg M.J.E. The effects of non-synonymous single nucleotide polymorphisms (nsSNPs) on protein–protein interactions. J Mol Biol. 2013;425:3949–3963. doi: 10.1016/j.jmb.2013.07.012. [DOI] [PubMed] [Google Scholar]
  • 17.Kelley L.A., Sternberg M.J.E. Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc. 2009;4:363–371. doi: 10.1038/nprot.2009.2. [DOI] [PubMed] [Google Scholar]
  • 18.Nair P.S., Vihinen M. VariBench: a benchmark database for variations. Hum Mutat. 2013;34:42–49. doi: 10.1002/humu.22204. [DOI] [PubMed] [Google Scholar]
  • 19.Sim N.-L., Kumar P., Hu J., Henikoff S., Schneider G., Ng P.C. SIFT Web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 2012;40:W452–W457. doi: 10.1093/nar/gks539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Reva B., Antipin Y., Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Shihab H.A., Gough J., Cooper D.N., Stenson P.D., Barker G.L.A., Edwards K.J. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Hum Mutat. 2013;34:57–65. doi: 10.1002/humu.22225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Meinshausen N., Bühlmann P. Stability selection. J R Stat Soc Ser B Stat Methodol. 2010;72:417–473. [Google Scholar]
  • 23.Ding C., Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005;3:185–205. doi: 10.1142/s0219720005001004. [DOI] [PubMed] [Google Scholar]
  • 24.MacArthur D.G., Balasubramanian S., Frankish A., Huang N., Morris J., Walter K. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828. doi: 10.1126/science.1215040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Capra J.A., Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875–1882. doi: 10.1093/bioinformatics/btm270. [DOI] [PubMed] [Google Scholar]
  • 26.Thusberg J., Olatubosun A., Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32:358–368. doi: 10.1002/humu.21445. [DOI] [PubMed] [Google Scholar]
  • 27.Wang M., Zhao X.-M., Takemoto K., Xu H., Li Y., Akutsu T. FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model. PLoS One. 2012;7:e43847. doi: 10.1371/journal.pone.0043847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.DeLong E.R., DeLong D.M., Clarke-Pearson D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
  • 29.Hanson R.M., Prilusky J., Renjian Z., Nakane T., Sussman J.L. JSmol and the next-generation Web-based representation of 3D molecular structure as applied to proteopedia. Isr J Chem. 2013;53:207–216. [Google Scholar]
  • 30.Ng P.C., Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–874. doi: 10.1101/gr.176601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kim J., Kim H., Roh S., Koo K., Lee D., Yu C. hMLH1 and hMSH2 mutations in families with familial clustering of gastric cancer and hereditary non-polyposis colorectal cancer. Cancer Detect Prev. 2001;25:503–510. [PubMed] [Google Scholar]
  • 32.Giardine B., Riemer C., Hefferon T., Thomas D., Hsu F., Zielenski J. PhenCode: connecting ENCODE data with mutations and phenotype. Hum Mutat. 2007;28:554–562. doi: 10.1002/humu.20484. [DOI] [PubMed] [Google Scholar]
  • 33.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic Local Alignment Search Tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 34.Lewis T.E., Sillitoe I., Andreeva A., Blundell T.L., Buchan D.W.A., Chothia C. Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. Nucleic Acids Res. 2013;41:D499–D507. doi: 10.1093/nar/gks1266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chubb D., Jefferys B.R., Sternberg M.J.E., Kelley L.A. Sequencing delivers diminishing returns for homology detection: implications for mapping the protein universe. Bioinformatics. 2010;26:2664–2671. doi: 10.1093/bioinformatics/btq527. [DOI] [PubMed] [Google Scholar]
  • 37.Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 38.Le Guilloux V., Schmidtke P., Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics. 2009;10:168. doi: 10.1186/1471-2105-10-168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Csardi G., Nepusz T. The igraph software package for complex network research. InterJ Complex Syst. 2006;1695:1–9. [Google Scholar]
  • 40.Porter C.T., Bartlett G.J., Thornton J.M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 2004;32:D129–D133. doi: 10.1093/nar/gkh028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Higurashi M., Ishida T., Kinoshita K. PiSite: a database of protein interaction sites using multiple binding states in the PDB. Nucleic Acids Res. 2009;37:D360–D364. doi: 10.1093/nar/gkn659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Petersen B., Petersen T.N., Andersen P., Nielsen M., Lundegaard C. A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol. 2009;9:51. doi: 10.1186/1472-6807-9-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Dosztányi Z., Csizmók V., Tompa P., Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol. 2005;347:827–839. doi: 10.1016/j.jmb.2005.01.071. [DOI] [PubMed] [Google Scholar]
  • 44.Meszaros B., Simon I., Dosztanyi Z. Prediction of protein binding regions in disordered proteins. PLoS Comput Biol. 2009;5:e1000376. doi: 10.1371/journal.pcbi.1000376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ikai A. Thermostability and aliphatic index of globular proteins. J Biochem. 1980;88:1895–1898. [PubMed] [Google Scholar]
  • 46.Kyte J., Doolittle R.F. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157:105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
  • 47.Clifford R.J., Edmonson M.N., Nguyen C., Buetow K.H. Large-scale analysis of non-synonymous coding region single nucleotide polymorphisms. Bioinformatics. 2004;20:1006–1014. doi: 10.1093/bioinformatics/bth029. [DOI] [PubMed] [Google Scholar]
  • 48.Punta M., Coggill P.C., Eberhardt R.Y., Mistry J., Tate J., Boursnell C. The Pfam protein families database. Nucleic Acids Res. 2012;40:D290–D301. doi: 10.1093/nar/gkr1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Yellaboina S., Tasneem A., Zaykin D.V., Raghavachari B., Jothi R. DOMINE: a comprehensive collection of known and predicted domain–domain interactions. Nucleic Acids Res. 2011;39:D730–D735. doi: 10.1093/nar/gkq1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Xie X., Jin J., Mao Y. Evolutionary versatility of eukaryotic protein domains revealed by their bigram networks. BMC Evol Biol. 2011;11:242. doi: 10.1186/1471-2148-11-242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Szklarczyk D., Franceschini A., Kuhn M., Simonovic M., Roth A., Minguez P. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–D568. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Dorogovtsev S., Goltsev A., Mendes J. k-Core organization of complex networks. Phys Rev Lett. 2006;96:3–6. doi: 10.1103/PhysRevLett.96.040601. [DOI] [PubMed] [Google Scholar]
  • 53.The UniProt Consortium Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013;41:D43–D47. doi: 10.1093/nar/gks1068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Atchley W.R., Zhao J., Fernandes A.D., Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci. 2005;102:6395–6400. doi: 10.1073/pnas.0408677102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Kawashima S., Pokarowski P., Pokarowska M., Kolinski A., Katayama T., Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008;36:D202–D205. doi: 10.1093/nar/gkm998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Chang C.-C., Lin C.-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:1–27. [Google Scholar]
  • 58.Karatzoglou A., Smola A. kernlab—an S4 package for kernel methods in R. J Stat Softw. 2004;11:1–20. [Google Scholar]
  • 59.Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.-C. pROC: an open-source package for R and S + to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material.

mmc1.docx (163.7KB, docx)

RESOURCES