Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models

Hashem A Shihab; Julian Gough; David N Cooper; Peter D Stenson; Gary L A Barker; Keith J Edwards; Ian N M Day; Tom R Gaunt

doi:10.1002/humu.22225

. 2012 Oct 3;34(1):57–65. doi: 10.1002/humu.22225

Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models

Hashem A Shihab ^1,^†, Julian Gough ^2,^†, David N Cooper ³, Peter D Stenson ³, Gary L A Barker ⁴, Keith J Edwards ⁴, Ian N M Day ^1,^‡, Tom R Gaunt ^1,^*,^‡

PMCID: PMC3558800 PMID: 23033316

Abstract

The rate at which nonsynonymous single nucleotide polymorphisms (nsSNPs) are being identified in the human genome is increasing dramatically owing to advances in whole-genome/whole-exome sequencing technologies. Automated methods capable of accurately and reliably distinguishing between pathogenic and functionally neutral nsSNPs are therefore assuming ever-increasing importance. Here, we describe the Functional Analysis Through Hidden Markov Models (FATHMM) software and server: a species-independent method with optional species-specific weightings for the prediction of the functional effects of protein missense variants. Using a model weighted for human mutations, we obtained performance accuracies that outperformed traditional prediction methods (i.e., SIFT, PolyPhen, and PANTHER) on two separate benchmarks. Furthermore, in one benchmark, we achieve performance accuracies that outperform current state-of-the-art prediction methods (i.e., SNPs&GO and MutPred). We demonstrate that FATHMM can be efficiently applied to high-throughput/large-scale human and nonhuman genome sequencing projects with the added benefit of phenotypic outcome associations. To illustrate this, we evaluated nsSNPs in wheat (Triticum spp.) to identify some of the important genetic variants responsible for the phenotypic differences introduced by intense selection during domestication. A Web-based implementation of FATHMM, including a high-throughput batch facility and a downloadable standalone package, is available at http://fathmm.biocompute.org.uk.

Keywords: SNP, hidden Markov models, FATHMM

Introduction

Nonsynonymous single nucleotide polymorphisms (nsSNPs) lead to amino acid substitutions (AASs) and have the potential to affect the function of the protein product of a gene via the structure, biochemistry and/or splicing of the protein. Advances in high-throughput sequencing technologies have accelerated the rate at which nsSNPs are now being identified [The 1000 Genomes Project, ²⁰¹⁰]. Accurate automated computational methods capable of predicting the effects of AASs and amenable to high-throughput analyses of large datasets are therefore of increasing importance for identifying and prioritizing functional nsSNPs for further studies [Thusberg and Vihinen, ²⁰⁰⁹].

The majority of computational prediction methods utilize evolutionary sequence conservation and/or structural annotations within homologous (orthologous and/or paralogous) proteins from a database of known sequences and/or structures [Ng and Henikoff, ²⁰⁰⁶]. Traditionally, the BLAST range of pairwise alignment [Altschul et al., ¹⁹⁹⁰] and sequence profile algorithms [Altschul et al., ¹⁹⁹⁷] have been used to search large sequence databases for homologous proteins falling within a predefined similarity threshold. However, weaknesses of these algorithms include the position-invariant scoring matrices in BLAST and the ad hoc estimation of algorithm parameters, that is, position-invariant gap penalties, in PSI-BLAST [Bateman and Haft, ²⁰⁰²]. On the other hand, hidden Markov models (HMMs) [Eddy, ¹⁹⁹⁶; Krogh et al., ¹⁹⁹⁴] are powerful probabilistic models that can be used to capture position-specific information within a multiple sequence alignment (MSA) of homologous sequences. Here, an MSA is represented as a series of match, insert, and delete states linked together via state transitions. A match state models the position-specific amino acid probabilities (with Dirichlet mixtures [Sjölander et al., ¹⁹⁹⁶]) at each column within the sequence alignment whereas insert/delete states allow for particular residues/states to be inserted and skipped, respectively, throughout the sequence alignment (position-specific insertions/deletions). HMM profiles are similar to PSI-BLAST profiles except they are applied within a more rigorous statistical framework and have been shown to perform considerably better when detecting distant relationships between homologous sequences [Madera and Gough, ²⁰⁰²].

Inspired by previous work [Calabrese et al., ²⁰⁰⁹; Ng and Henikoff, ²⁰⁰¹; Thomas et al., ²⁰⁰³], we have capitalized upon recent advances in the HMMER3 software suite [Eddy, ²⁰⁰⁹] to potentiate the computational prediction of the functional effects of AASs using HMMs. First, we present an unweighted/species-independent method in which homologous sequences are automatically collected and aligned using an iterative search procedure. The resulting MSA is then used to build an ab initio HMM where sequence conservation is then interrogated through the internal match states of the model. In conjunction, sequence conservation within manually curated HMMs representing the alignment of conserved protein domain families: SUPERFAMILY [Gough et al., ²⁰⁰¹] and Pfam [Sonnhammer et al., ¹⁹⁹⁷], is interrogated. This additional domain-based analysis is capable of capturing important structural and evolutionary constraints (via priors) that are potentially missed when using an automatically collected alignment of homologous sequences. Next, we introduce a weighted/species-specific method, which incorporates “pathogenicity weights”. These weights are derived from the relative frequencies of disease-associated and functionally neutral AASs mapping onto conserved protein domains. Using a model weighted for human mutations, we obtained performance accuracies that outperformed traditional prediction methods—SIFT, PolyPhen, and PANTHER—on two separate benchmarks. Furthermore, in one benchmark, we achieve performance accuracies that outperform current state-of-the-art prediction methods: SNPs&GO and MutPred. We demonstrate that our method, functional analysis through hidden Markov models (FATHMM), can be efficiently applied to all foreseeable high-throughput large-scale genomic datasets, and advances the field with the added benefit of providing phenotypic outcome associations. In addition to demonstrating the predictive capabilities of FATHMM on multiple benchmarks representing human mutations, we have applied it in practice to a large dataset of nsSNPs in wheat (Triticum spp.) to identify some of the key genetic variants responsible for the phenotypic differences introduced by intense selection during domestication and have made this analysis publicly available to the scientific community.

Materials and Methods

The Mutation Datasets

A collection of five human mutation datasets from online databases and the literature were downloaded and used in this study (Table 1). First, inherited disease-causing AASs annotated as DMs (damaging mutations) in the Human Gene Mutation Database [Stenson et al., ²⁰⁰⁹] (HGMD—November 2011; http://www.hgmd.org) and inherited putative functionally neutral AASs in the UniProt database [Apweiler et al., ²⁰⁰⁴] (UniProt—November 2011; http://www.uniprot.org/docs/humsavar) were downloaded and used to calculate the pathogenicity weights implemented in our weighted/species-specific method. Next, we obtained two human mutation datasets to assess the performance of FATHMM against the performance of other computational prediction algorithms previously reported in the literature: the VariBench database (VariBench—November 2011; http://bioinf.uta.fi/VariBench) used in a comprehensive review [Thusberg et al., ²⁰¹¹] of nine other computational prediction methods [Adzhubei et al., ²⁰¹⁰; Bao et al., ²⁰⁰⁵; Bromberg and Rost, ²⁰⁰⁷; Calabrese et al., ²⁰⁰⁹; Capriotti et al., ²⁰⁰⁶; Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Ramensky et al., ²⁰⁰²; Thomas et al., ²⁰⁰³] and 267 AASs in four cancer-associated genes (BRCA1, MSH2, MLH1, and TP53) used in a recent review [Hicks et al., ²⁰¹¹] of four alternative computational prediction algorithms [Adzhubei et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Reva et al., ²⁰¹¹; Tavtigian et al., ²⁰⁰⁶]. Finally, we downloaded a human mutation dataset consisting of disease-associated and putative functionally neutral AASs from the SwissVar portal [Mottaz et al., ²⁰¹⁰] (SwissVar—February 2011; http://swissvar.expasy.org) and performed an independent benchmark of FATHMM against eight other computational prediction algorithms [Adzhubei et al., ²⁰¹⁰; Calabrese et al., ²⁰⁰⁹; Capriotti et al., ²⁰⁰⁶; Ferrer-Costa et al., ²⁰⁰⁴; Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Ramensky et al., ²⁰⁰²; Thomas et al., ²⁰⁰³].

Table 1.

Summary of Mutation Datasets

Dataset	Proteins	Amino acid substitutions	Description
HGMD	2,298	49,532	Inherited disease-causing mutations from HGMD used to calculate our pathogenicity weights
UniProt	11,548	36,928	Inherited putative functionally neutral mutations from UniProt used to calculate our pathogenicity weights
VariBench	9,684	40,470	Benchmarking dataset used in a review of nine alternative computational prediction algorithms [Thusberg et al., ²⁰¹¹]
Hicks et al. ²⁰¹¹	4	267	Benchmarking dataset consisting of both disease-causing and functionally neutral mutations in four well-characterized genes (BRCA1, MSH2, MLH1, TP53) used in a recent review of four alternative prediction algorithms [Hicks et al., ²⁰¹¹]
SwissVar	11,986	59,976	Benchmarking dataset used as an independent benchmark of eight alternative prediction algorithms

Open in a new tab

Scoring the Magnitude of Effect of Amino Acid Substitutions

The procedure for predicting the functional consequences on the protein function is as follows (see Supp. Fig. S1 for a flow diagram detailing the procedure): the JackHMMER component of HMMER3 (one iteration with the optional –hand parameter applied; see Supp. Fig. S2) is used to search for homologous sequences within the UniRef90 [Suzek et al., ²⁰⁰⁷] database (November 2011). As part of this procedure, an ab initio HMM representing the MSA of homologous sequences (with Dirichlet mixtures [Sjölander et al., 1996]) is constructed and used. In conjunction, protein domain annotations from the SUPERFAMILY [Gough et al., ²⁰⁰¹] (version 1.75) and Pfam [Sonnhammer et al., ¹⁹⁹⁷] (Pfam-A and Pfam-B; version 26.0) databases are made. The relevant SUPERFAMILY and Pfam HMMs are then extracted only if and when the domain assignment is deemed significant (e-value ≤0.01) and the AAS maps onto a match state within the model.

The information gain (as measured by the Kullback–Leibler [Kullback and Leibler, ¹⁹⁵¹] divergence from the SwissProt/TrEMBL [Apweiler et al., ²⁰⁰⁴] amino acid composition) is then calculated at the corresponding match states within the HMMs extracted above. Next, we interrogate the underlying amino acid probabilities modeled by the most informative HMM and assume that a reduction in the amino acid probabilities (when comparing the wild-type to the mutant residue) indicates a potentially negative impact upon protein function whereas a gain in the amino acid probabilities indicates a more favorable substitution. Furthermore, we assume that larger reductions in amino acid probabilities have more substantial effects than smaller reductions in amino acid probabilities. Here, the predicted magnitude of the effect upon protein function is calculated as follows:

(1)

where P_w and P_m represent the underlying probabilities for the wild-type and mutant amino acid residues, respectively.

Incorporating Species-Specific Pathogenicity Weights

As before, we interrogate the amino acid probabilities within the most informative SUPERFAMILY [Gough et al., ²⁰⁰¹] or Pfam [Sonnhammer et al., ¹⁹⁹⁷] (Pfam-A and Pfam-B) HMM (as measured by the Kullback–Leibler [Kullback and Leibler, 1951] divergence from the SwissProt/TrEMBL [Apweiler et al., ²⁰⁰⁴] amino acid composition). However, for an improved performance in human, the predicted magnitude of effect is weighted by the relative frequency of disease-associated (HGMD) and functionally neutral (UniProt) AASs mapping onto the relevant SUPERFAMILY/Pfam HMM:

(2)

where P_w and P_m represent the underlying probabilities for the wild-type and mutant amino acid residues, respectively, and the pathogenicity weights, W_d and W_n, represent the relative frequencies of disease-associated and functionally neutral AASs mapping onto the relevant HMM, respectively. The pathogenicity weights also include a pseudo-count of 1.0 to avoid a zero divisible term.

Annotating the Molecular and Phenotypic Consequences of Amino Acid Substitutions

The overall biological function of a protein is commonly governed by the various combinations of protein domains within it [Peterson et al., ²⁰¹⁰]. Therefore, we annotate the potential molecular and phenotypic consequences of pathogenic mutations via domain-centric ontologies [de Lima Morais et al., ²⁰¹¹]. For example, the molecular consequences of AASs are statistically inferred by mapping SUPERFAMILY [Gough et al., ²⁰⁰¹] HMMs onto the Gene Ontology [Ashburner et al., ²⁰⁰⁰]. Moreover, the phenotypic consequences of AASs are annotated by extending these mappings onto several phenotype ontologies including the Human Phenotype Ontology [Robinson et al., ²⁰⁰⁸], the Mammalian Phenotype Ontology [Smith and Eppig, ²⁰⁰⁹] and the Plant Phenotype Ontology [Ilic et al., ²⁰⁰⁷; Pujar et al., ²⁰⁰⁶].

Performance Evaluation

In accordance with previous computational prediction methods, the following six parameters (formulae 3-8) were used to assess the performance of our models:

(3)

(4)

(5)

(6)

(7)

(8)

where tp and fp refer to the number of true positives and false positives reported and tn and fn denote the number of true negatives and false negatives reported.

Results

Calculating a Prediction Threshold

Theoretically, using our prediction formulae, scores approximately equal to zero indicate that there is no significant change in the underlying amino acid probabilities whereas scores less than zero indicate that an unfavorable substitution has been observed, that is, the mutant residue is less likely to be observed than the wild-type residue, and scores greater than zero indicate that a favorable substitution has been observed, that is, the mutant residue is more likely to be observed than the wild-type residue. However, in practice, FATHMM is sensitive to small fluctuations in the amino acid probabilities modeled within the HMMs. For example, the slightest reduction in amino acid probabilities would yield a pathogenic prediction in our unweighted/species-independent algorithm. Therefore, to eliminate the effects of these fluctuations, we plotted the distribution of the predicted magnitude of effect for both disease-associated and functionally neutral AASs within the SwissVar dataset (Fig. 1). From this, we calculated prediction thresholds for our unweighted and weighted methods at which the specificity and sensitivity were both maximized (−3.0 and −1.5, respectively). Using our unweighted method, we noted that the majority of disease-associated AASs (>60%) fell below our threshold, whereas the majority of functionally neutral polymorphisms (80%) fell above this threshold. Furthermore, using our weighted method, the majority of disease-associated AASs (80%) fell below our threshold whereas a significant proportion of functionally neutral polymorphisms (>80%) fell above this threshold.

The distribution of the predicted magnitude of effect for disease-associated (shaded region) and functionally neutral (unshaded region) AASs in the SwissVar dataset using our unweighted and weighted methods (A and B, respectively). From this, we calculated prediction thresholds at which both specificity and sensitivity were maximized (−3.0 and −1.5, respectively).

A Performance Comparison Against Published Reviews

The performance of FATHMM was compared against the performance of other computational prediction algorithms reported in two previously published reviews [Hicks et al., ²⁰¹¹; Thusberg et al., ²⁰¹¹]. First, the VariBench database was used to benchmark our method against nine alternative computational prediction algorithms [Adzhubei et al., ²⁰¹⁰; Bao et al., ²⁰⁰⁵; Bromberg and Rost, ²⁰⁰⁷; Calabrese et al., ²⁰⁰⁹; Capriotti et al., ²⁰⁰⁶; Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Ramensky et al., ²⁰⁰²; Thomas et al., ²⁰⁰³] (Table 2). Typically, the performance of trained/weighted computational prediction algorithms is superior to that of theoretical/unweighted algorithms. Therefore, to allow for a fair comparison to be made, we opted to compare our unweighted/species-independent method against other theoretical/unweighted computational algorithms and our weighted/species-specific method against other trained/weighted computational prediction algorithms. From Table 2, and in terms of performance accuracies, PANTHER [Thomas et al., ²⁰⁰³] appears to be the best performing theoretical/unweighted prediction method with an accuracy of 76%. It appears that both SIFT [Ng and Henikoff, ²⁰⁰¹] (another sequence-based method) and our unweighted method perform less favorably with accuracies of 65% and 69%, respectively, indicating that FATHMM is somewhat the better option of the two. The observed performances in our analysis indicate that our weighted method is the best performing method available with an overall performance accuracy of 86%, thereby outperforming the current state-of-the-art prediction methods MutPred [Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰] (81%) and SNPs&GO [Capriotti et al., ²⁰⁰⁶] (82%).

Table 2.

Performance of Computational Prediction Methods using the VariBench Benchmarking Dataset

	tp	fp	tn	fn	Accuracy^a	Precision^a	Specificity^a	Sensitivity^a	NVP^a	MCC^a
Theoretical/unweighted computational prediction methods
SIFT	10,464	4,856	12,188	7,433	0.65	0.64	0.62	0.68	0.66	0.30
PolyPhen 1^b	10,093	9,185	17,669	3,199	0.69	0.77	0.85	0.52	0.64	0.39
PolyPhen 1^c	14,285	4,993	13,671	7,197	0.70	0.68	0.66	0.74	0.72	0.40
PANTHER	9,689	2,859	8,676	2,797	0.76	0.76	0.76	0.77	0.77	0.53
FATHMM (unweighted)	11,561	4,839	16,257	7,707	0.69	0.72	0.77	0.60	0.66	0.38
Trained/weighted computational prediction methods
PolyPhen 2^b	13,807	5,102	13,863	6,010	0.71	0.71	0.70	0.73	0.72	0.43
PolyPhen 2^c	16,206	2,703	10,199	9,674	0.69	0.64	0.51	0.86	0.78	0.39
PhD-SNP	11,900	6,896	16,788	4,377	0.71	0.75	0.79	0.63	0.68	0.43
SNPs&GO	13,736	5,487	17,028	1,382	0.82	0.90	0.92	0.71	0.76	0.65
nsSNPAnalyzer	4,360	2,778	1,319	943	0.60	0.59	0.58	0.61	0.60	0.19
SNAP	16,000	2,146	8,190	6,387	0.72	0.67	0.56	0.88	0.83	0.47
MutPred	13,829	2,507	15,891	4,557	0.81	0.79	0.78	0.85	0.84	0.63
FATHMM (weighted)	14,231	1,633	10,146	2,336	0.86	0.86	0.86	0.86	0.86	0.72

Open in a new tab

tp, fp, tn, fn refer to the number of true positives, false positives, true negatives, and false negatives, respectively.

Accuracy, Precision, Specificity, Sensitivity, NVP, and MCC are calculated from normalized numbers.

“Probably Pathogenic” predictions classed as disease causing.

“Probably Pathogenic” predictions classed as functionally neutral.

Next, we used the Hicks dataset to benchmark FATHMM against four other computational prediction algorithms (using their native alignments) [Adzhubei et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Reva et al., ²⁰¹¹; Tavtigian et al., ²⁰⁰⁶] (Table 3). Overall, Align-GVGD [Tavtigian et al., ²⁰⁰⁶] appears to be the best performing method. However, Align-GVGD employs gene-specific alignments and its performance is severely affected when automatically generated alignments are used [Hicks et al., ²⁰¹¹]. These results appear to indicate that our unweighted method is more specific than either Align-GVGD or SIFT; however, we also noted higher false positive rates when compared with the other prediction methods. In general, and perhaps more surprisingly, it appears that the performance of all trained/weighted computational prediction methods is inferior across the four genes when compared to their theoretical/unweighted counterparts. Again, although no one trained/weighted prediction method performs best across the four genes, it would appear that our weighted method is, on average, the most specific/least sensitive.

Table 3.

Specificity and Sensitivity of Computational Prediction Methods in Four Well-Characterized Genes (BRCA1, MSH2, MLH1, and TP53)

	BRCA1		MSH2		MLH1		TP53

Algorithm	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity	Specificity	Sensitivity
Theoretical/unweighted computational prediction methods
SIFT	0.31	0.94	0.46	0.90	0.52	0.72	0.75	0.84
Align-GVGD	0.94	0.71	0.55	0.90	0.52	0.97	1.00	0.82
FATHMM (unweighted)	0.56	0.65	0.73	0.84	0.71	0.77	1.00	0.71
Trained/weighted computational prediction methods
PolyPhen-2	0.38	0.77	0.36	0.90	0.67	0.90	1.00	0.84
X-Var	0.56	0.82	0.27	1.00	0.33	1.00	0.50	0.96
FATHMM (weighted)	0.70	0.47	0.50	0.79	0.24	0.97	NA^a	1.00

Open in a new tab

The specificity for our weighted method in this instance is uninformative as there was only one neutral mutation falling within conserved protein domains.

Table 4.

Performance of Computational Prediction Methods Using the SwissVar Benchmarking Dataset

	tp	fp	tn	fn	Accuracy^a	Precision^a	Specificity^a	Sensitivity^a	NVP^a	MCC^a
Unweighted computational prediction methods
SIFT	15,634	6,318	28,236	7,716	0.74	0.79	0.82	0.67	0.71	0.49
PolyPhen 1	12,803	8,759	18,603	4,497	0.71	0.70	0.68	0.74	0.72	0.42
PANTHER	8,283	5,842	17,447	5,162	0.68	0.71	0.75	0.62	0.66	0.37
FATHMM (unweighted)	14,311	6,717	29,454	9,429	0.71	0.76	0.81	0.60	0.67	0.43
Weighted/trained computational prediction methods
PolyPhen 2 (HumDiv)	19,782	13,592	20,874	3,204	0.73	0.69	0.61	0.86	0.81	0.48
PolyPhen 2 (HumVar)	19,928	13,239	21,227	3,058	0.74	0.69	0.62	0.87	0.82	0.50
PhD-SNP Sequence	15,695	9,380	26,838	8,062	0.70	0.72	0.74	0.66	0.69	0.40
PhD-SNP Profile	17,548	7,233	27,731	5,161	0.78	0.79	0.79	0.77	0.78	0.57
PMut	13,498	12,156	23,636	10,159	0.62	0.63	0.66	0.57	0.61	0.23
SNPs&GO	17,768	3,768	29,101	5,655	0.82	0.87	0.89	0.76	0.79	0.65
MutPred	21,365	3,500	32,719	2,392	0.90	0.90	0.90	0.90	0.90	0.80
FATHMM (weighted)	15,916	3,017	19,713	4,496	0.82	0.85	0.87	0.78	0.80	0.65

Open in a new tab

tp, fp, tn, fn refer to the number of true positives, false positives, true negatives, and false negatives, respectively.

Accuracy, Precision, Specificity, Sensitivity, NVP, and MCC are calculated from normalized numbers.

An Independent Benchmark Against Other Computational Prediction Methods

Although we recognize the importance of comparing prediction methods in relation to previously established benchmarks, we also conducted our own benchmark (using the SwissVar mutation dataset – see Materials & Methods) comparing the performance of FATHMM against eight published computational prediction methods [Adzhubei et al., ²⁰¹⁰; Calabrese et al., ²⁰⁰⁹; Capriotti et al., ²⁰⁰⁶; Ferrer-Costa et al., ²⁰⁰⁴; Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰; Ng and Henikoff, ²⁰⁰¹; Ramensky et al., ²⁰⁰²; Thomas et al., ²⁰⁰³] (Table 3—see Supp. Table S1). In contrast to the VariBench benchmark, and in terms of performance accuracies, it appears that both SIFT [Ng and Henikoff, ²⁰⁰¹] and our own unweighted method outperform PANTHER [Thomas et al., ²⁰⁰³] (68%) with performance accuracies of 74% and 71%, respectively, indicating that SIFT is somewhat the better option. The best performing method is MutPred [Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰] with a performance accuracy of 90%. However, the observed performances show that our weighted method once again performs favorably when compared to other state-of-the-art prediction methods: SNPs&GO [Calabrese et al., ²⁰⁰⁹], despite the domain-based restriction inherited from our pathogenicity weights. Next, we compared the performance of our unweighted method via receiver operating characteristic (ROC) curves against the top ranking theoretical/unweighted computational prediction methods: SIFT and PANTHER (Fig. 2A, B—see Supp. Fig. S3 for a comprehensive ROC curve against all evaluated methods). Impressively, given a 10% false positive rate, it seems that the performance of our unweighted method is comparable to SIFT thereby highlighting the sensitivity of our method to small fluctuations within the underlying amino acid probabilities. Furthermore, we compared the performance of our weighted method via ROC curves against the top-ranking trained/weighted prediction algorithms: MutPred and SNPs&GO (Fig. 2C, D). These results confirm that our weighted method performs favorably when compared to SNPs&GO.

Receiver operating characteristic (ROC) curves for the top-ranking computational prediction algorithms evaluated using the SwissVar dataset. Here, we compare our unweighted method against SIFT and PANTHER (A—full curve; B—10% false positive rate) whereas our weighted method is compared to SNPs&GO and MutPred (C—full curve; D—10% false positive rate). Full ROC curves for all computational prediction algorithms evaluated are made available in Supp. Figure S3.

The pathogenicity weights incorporated in FATHMM were not directly used to train for, or recognize, pathogenic sequences and/or mutations. We do nevertheless recognize the potential for bias in the performances observed. Therefore, to remove this bias, we performed a “leave-one-out” analysis on all benchmarking datasets. Here, we adjusted our pathogenicity weights, W_d and W_n, if and only when the AAS being evaluated was present in either the HGMD [Stenson et al., ²⁰⁰⁹] or UniProt [Apweiler et al., ²⁰⁰⁴] datasets. We observed no significant deviations in the performance measures reported above and hence concluded that the performances observed were not biased toward the pathogenicity weights employed (see Supp. Table S2).

To understand the potential complementarity/redundancy of FATHMM to other methods, we assessed the intersection of disease-associated AASs correctly identified (true positives) by our method and the top-ranking computational prediction algorithms (Fig. 3). From this analysis, it was clear that no one method completely encapsulates all other methods that is, each method succeeded in correctly and uniquely identifying some disease-associated AASs where other methods did not. These results reaffirm previous suggestions that combining predictions from multiple prediction methods has the potential to perform better than any individual method [Liu et al., ²⁰¹¹; Olatubosun et al., ²⁰¹²].

The intersection of disease-associated amino acid substitutions correctly identified by the top-ranking computational prediction algorithms evaluated using the SwissVar dataset. Here, we compare our unweighted method against SIFT and PANTHER (A) whereas our weighted method is compared to SNPs&GO and MutPred (B).

Facilitating the High-Throughput Analysis of Large Genomic Datasets

Anticipating a massive increase in the number of available whole-genome and whole-exome datasets, the need for accurate computational prediction methods capable of processing these datasets in a timely fashion is increasingly apparent. As a result, the majority of computational prediction algorithms now offer some form of precomputed facility allowing for near-instant predictions to be returned (see Supp. Table S3). However, only SIFT [Ng and Henikoff, ²⁰⁰¹] and PolyPhen-2 [Adzhubei et al., ²⁰¹⁰] allow for batch submissions (with restrictions) to be made. To facilitate the high-throughput analysis of large-scale genomic datasets, our public Web-server provides up-to-date (precomputed) domain assignments for several large sequence collections, including SwissProt/TrEMBL [Apweiler et al., ²⁰⁰⁴]; thereby enabling (unrestricted) near-instant predictions to be made for AASs falling within conserved protein domains. Furthermore, our precomputed database is available as an optional download enabling near-instant predictions to be made while running our software locally.

Annotating Phenotypic Outcome Associations

As previously alluded to, FATHMM not only predicts the potentially deleterious nature of AASs but is also capable of annotating the molecular and phenotypic consequences of these mutations via domain-centric ontologies. To illustrate this, we evaluated the predicted phenotypic consequences of disease-associated AASs within the SwissVar dataset (Supp. Table S4). As expected, the phenotypic consequences of well-characterized diseases are correctly identified. For example, the cardiovascular consequences of the C1971Y mutation in FBN1 (Marfan syndrome; MIM# 154700) are correctly identified via domain-based ontological associations. However, potential issues of using domain-centric ontologies arise when a common domain harbors multiple mutations with distinct and uniquely expressed phenotypes. In these instances, domain-centric ontological associations may have become diluted and should therefore be used with caution. For example, the predicted phenotypic consequences for the R239C mutation in CHRNG (Escobar syndrome; MIM# 265000) are consistent with the associated syndrome, which is characterized by a decrease in fetal movement and overall muscle weakness. However, phenotypes not associated with (or secondary to) Escobar syndrome, for example abnormalities in temperature regulation, were also predicted. Nevertheless, we foresee that these annotations will be most prominent in protein sequences of unknown function and/or ongoing nonhuman genome sequencing projects, as demonstrated below.

Case Study: Annotating the Functional and Phenotypic Consequences of nsSNPs in Wheat

As the world's population continues to grow, so does the demand for crops with particular characteristics such as drought resistance, high yield and resistance to pests and pathogens. The cultivation and repeat harvesting of wild “landrace” wheat varieties has led over time to the emergence of domesticated “elite” wheat varieties with desirable phenotypic characteristics. In an attempt to elucidate some of the important genetic variants responsible for these characteristics, we collected single nucleotide variants (SNVs) from four elite UK bread wheat varieties (Avalon, Cadenza, Rialto, and Savannah) and have predicted the functional effects of these mutations when compared to four landrace wheat varieties from the Watkins collection held at the John Innes Centre, Norwich, UK (304, 306, 311, and 328). For this analysis, SNVs were mapped onto the draft wheat genome assembly and six-frame translated. For each reading frame, SUPERFAMILY [Gough et al., 2001] and Pfam [Sonnhammer et al., ¹⁹⁹⁷] domain assignments on the full-length amino acid sequence were made and the corresponding AASs were evaluated using our unweighted method. We found several biologically interesting SNV differences between the landrace and elite wheat varieties (see Supp. Table S5). For example, wheat contig F0Z7V0F01D2DA5 had a SNP at position 172 in the casein kinase II beta subunit domain with phenotypic consequences predicted to affect the flower developmental stages and vegetative growth. The casein kinase II beta subunit domain has a putative function in flowering time regulation in the model plant Arabidopsis [Ogiso et al., ²⁰¹⁰] and is likely to be biologically significant as European domestic wheat will have been selected to grow under shortened seasons and different day lengths to the landraces. Next, wheat contig GIZP4PP04H5FGF had a SNP at position 219, which lies within the Pfam starch synthase catalytic domain. Once again, this is likely to be biologically significant as the quantity and properties of starch are important to the baking properties of cultivated wheat and will thus have been under strong selection. Finally, wheat contig 09781 had a SNP at position 368 in the cysteine proteinase domain with predicted phenotypic consequences affecting plant structure development. In cereals, cysteine proteases are known to be important in the laying down of storage proteins [Fahmy et al., ²⁰⁰⁴]. As with starch, the properties of wheat storage proteins will have come under intense selection during domestication as they are the most important determinant of baking qualities and economic yield. These results, made publicly available to the wheat genomics community at http://www.cerealsdb.uk.net/functional_snps/index.htm, illustrate the potential additional utility of FATHMM in predicting the functional consequences of variants identified in ongoing nonhuman genome sequencing projects (even in species very distantly related to human).

Discussion

Here, we have introduced and discussed the FATHMM software and server: a species-independent method with optional species-specific weightings for the prediction of the functional effects of protein missense variants. Inspired by previous sequence-based computational prediction algorithms [Ng and Henikoff, ²⁰⁰¹; Thomas et al., ²⁰⁰³], our unweighted/species-independent method interrogates sequence conservation through the underlying amino acid probabilities modeled by the internal match states of several HMMs representing the alignment of homologous sequences and conserved protein domains. Following a similar weighting scheme implemented in SNPs&GO [Calabrese et al., ²⁰⁰⁹], our weighted/species-specific method amalgamates sequence conservation within the HMMs with “pathogenicity weights” representing the relative frequencies of disease-associated and functionally neutral AASs mapping onto conserved protein domains. The pathogenicity weights incorporated here are not directly used to train for, or recognize, pathogenic sequences, and/or mutations. Instead, these weights are capable of recognizing protein domains (species-independent/evolutionary units) sensitive to or intolerant of missense mutations. Therefore, the pathogenicity weights implemented in FATHMM are also likely to represent an improvement for nonhuman organisms (especially those not too distantly related to human) [Ferrer-Costa et al., ²⁰⁰⁵].

The performance of FATHMM was compared to the performances of alternative computational prediction methods previously reported in two published reviews [Hicks et al., ²⁰¹¹; Thusberg et al., ²⁰¹¹]. Furthermore, we performed our own independent benchmark comparing the performance of FATHMM against the performance of other computational prediction methods. In two benchmarks (VariBench/SwissVar), the performance of our unweighted method is comparable to another sequence-based method: SIFT [Ng and Henikoff, 2001], and to a sequence/structure-based method: PolyPhen-1 [Ramensky et al., ²⁰⁰²]. This performance reaffirms the ability of FATHMM to recognize important structural and/or evolutionary constraints (via priors) modeled within manually curated HMMs representing the alignment of conserved protein domains: SUPERFAMILY [Gough et al., ²⁰⁰¹] and Pfam [Sonnhammer et al., ¹⁹⁹⁷]. A detailed analysis of four cancer-associated genes (Hicks; BRCA1, MSH2, MLH1, and TP53) shows Align-GVGD [Tavtigian et al., ²⁰⁰⁶] to be the best performing prediction method. However, this can be attributed to the manually curated (gene-specific) sequence alignments employed in the prediction method. On average, the performance of our unweighted method in this benchmark is comparable to SIFT.

An important issue to consider when comparing the performance of trained/weighted computational prediction methods is the cross-validation dataset, that is, these prediction methods should ideally be tested using “blind” datasets to minimize the bias in the performances observed. Unfortunately, this level of testing is not possible as it would require retraining/validating all prediction methods with common datasets. However, the majority of disease-associated AASs in the VariBench database were collected from Locus-Specific Databases (LSDB) and are not found in commonly used training datasets, for example, SwissProt/TrEMBL [Apweiler et al., ²⁰⁰⁴]. Therefore, the curators claim this bias is minimized in this dataset [Thusberg et al., ²⁰¹¹]. Here, the performance of our weighted method appears to outperform the current state-of-the-art prediction methods: MutPred [Li et al., ²⁰⁰⁹; Mort et al., ²⁰¹⁰] and SNPs&GO [Calabrese et al., ²⁰⁰⁹]. By contrast, the mutation dataset used in our independent benchmark was collected from the SwissVar [Mottaz et al., ²⁰¹⁰] portal. As a result, the estimated performances of other computational prediction methods which have been trained on SwissProt/TrEMBL mutations may be overinflated. Here, MutPred is the best performing method; however, the performance of our weighted method is comparable to SNPs&GO. To alleviate the potential bias in our method, we performed a leave-one-out analysis and found no significant deviations in the observed performances; we therefore concluded that the performances observed in FATHMM were not an artifact of the weighting scheme employed. The performances of all trained/weighted computational prediction algorithms were, somewhat surprisingly, inferior when compared to their theoretical/unweighted counterparts across four cancer-associated genes. The performances observed within Align-GVGD (gene-specific alignments) suggest that there may be some benefit in incorporating disease-specific weightings into our algorithm, for example, cancer-specific weightings similar to those employed by Capriotti and Altman (2011).

A potential disadvantage of our weighted method is the inherited restriction (via the weighting scheme employed) to AASs falling within conserved protein domains. However, protein domain annotations from the SUPERFAMILY and Pfam databases encompass around 80% of the SwissProt/TrEMBL database [Punta et al., ²⁰¹²]. In our analysis, we were able to analyse a large proportion (>70%) of the VariBench and SwissVar benchmarking datasets. On the other hand, unlike other sequence-based prediction methods (including our own unweighted method), which are too slow for practical use in large-scale sequencing projects, our weighted method uses computationally inexpensive domain assignments. Therefore, FATHMM can be efficiently applied to all foreseeable high-throughput large-scale genomic datasets with minimal reduction in coverage. In addition, our method advances the field with its unique ability to annotate the molecular and phenotypic consequences of AASs using several domain-centric ontologies [de Lima Morais et al., ²⁰¹¹] including the Human Phenotype Ontology [Robinson et al., ²⁰⁰⁸] and the Mammalian Phenotype Ontology [Smith and Eppig, ²⁰⁰⁹]. Thus, by coupling the functional predictions generated by FATHMM with domain-based ontological associations, as opposed to protein level annotations, we have developed a tool, which is capable of providing useful insights into the underlying mechanisms disrupted by AASs without any prior/background information on the protein itself. A Web-based implementation of FATHMM, which facilitates the high-throughput analysis of large-scale genomic datasets, and includes a downloadable open-source software package, is available at http://fathmm.biocompute.org.uk.

Acknowledgments

The authors thank Predrag Radivojac and Biao Li from Indiana University (MutPred) and Adam Hospital from the University of Barcelona (PMut) for their help in analyzing the SwissVar dataset. The authors also thank Dr. Philip Guthrie for his help in reviewing the manuscript.

Supplementary material

Disclaimer: Supplementary materials have been peer-reviewed but not copyedited.

humu0034-0057-SD1.pdf^{(258.6KB, pdf)}

humu0034-0057-SD2.png^{(113.7KB, png)}

humu0034-0057-SD3.txt^{(7.9KB, txt)}

References

Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–119. doi: 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bao L, Zhou M, Cui Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 2005;33:W480–482. doi: 10.1093/nar/gki372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bateman A, Haft DH. HMM-based databases in InterPro. Brief Bioinformatics. 2002;3:236–245. doi: 10.1093/bib/3.3.236. [DOI] [PubMed] [Google Scholar]
Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35:3823–3835. doi: 10.1093/nar/gkm238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat. 2009;30:1237–1244. doi: 10.1002/humu.21047. [DOI] [PubMed] [Google Scholar]
Capriotti E, Altman RB. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. 2011;98:310–317. doi: 10.1016/j.ygeno.2011.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Capriotti E, Calabrese R, Casadio R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006;22:2729–2734. doi: 10.1093/bioinformatics/btl423. [DOI] [PubMed] [Google Scholar]
Eddy SR. Hidden Markov models. Curr Opin Struct Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. [PubMed] [Google Scholar]
Fahmy AS, Ali AA, Mohamed SA. Characterization of a cysteine protease from wheat Triticum aestivum (cv. Giza 164) Bioresour Technol. 2004;91:297–304. doi: 10.1016/s0960-8524(03)00193-7. [DOI] [PubMed] [Google Scholar]
Ferrer-Costa C, Orozco M, Cruz X dela. Sequence-based prediction of pathological mutations. Proteins. 2004;57:811–819. doi: 10.1002/prot.20252. [DOI] [PubMed] [Google Scholar]
Ferrer-Costa C, Orozco M, Cruz X dela. Use of bioinformatics tools for the annotation of disease-associated mutations in animal models. Proteins. 2005;61:878–887. doi: 10.1002/prot.20664. [DOI] [PubMed] [Google Scholar]
Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001;313:903–919. doi: 10.1006/jmbi.2001.5080. [DOI] [PubMed] [Google Scholar]
Hicks S, Wheeler DA, Plon SE, Kimmel M. Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed. Hum Mutat. 2011;32:661–668. doi: 10.1002/humu.21490. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ilic K, Kellogg EA, Jaiswal P, Zapata F, Stevens PF, Vincent LP, Avraham S, Reiser L, Pujar A, Sachs MM, Whitman NT, McCouch SR, et al. The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant. Plant Physiol. 2007;143:587–599. doi: 10.1104/pp.106.092825. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994;235:1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]
Kullback S, Leibler RA. On Information and Sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86. [Google Scholar]
Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;25:2744–2750. doi: 10.1093/bioinformatics/btp528. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lima Morais DAde, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011;39:D427–434. doi: 10.1093/nar/gkq1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011;32:894–899. doi: 10.1002/humu.21517. [DOI] [PMC free article] [PubMed] [Google Scholar]
Madera M, Gough J. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 2002;30:4321–4328. doi: 10.1093/nar/gkf544. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mort M, Evani US, Krishnan VG, Kamati KK, Baenziger PH, Bagchi A, Peters BJ, Sathyesh R, Li B, Sun Y, Xue B, Shah NH, et al. In silico functional profiling of human disease-associated and polymorphic amino acid substitutions. Hum Mutat. 2010;31:335–346. doi: 10.1002/humu.21192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mottaz A, David FPA, Veuthey A-L, Yip YL. Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics. 2010;26:851–852. doi: 10.1093/bioinformatics/btq028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–874. doi: 10.1101/gr.176601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006;7:61–80. doi: 10.1146/annurev.genom.7.080505.115630. [DOI] [PubMed] [Google Scholar]
Ogiso E, Takahashi Y, Sasaki T, Yano M, Izawa T. The role of casein kinase II in flowering time regulation has diversified during evolution. Plant Physiol. 2010;152:808–820. doi: 10.1104/pp.109.148908. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M. PON-P: Integrated predictor for pathogenicity of missense variants. Hum Mutat. 2012;33:1166–1174. doi: 10.1002/humu.22102. [DOI] [PubMed] [Google Scholar]
Peterson TA, Adadey A, Santana-Cruz I, Sun Y, Winder A, Kann MG. DMDM: domain mapping of disease mutations. Bioinformatics. 2010;26:2458–2459. doi: 10.1093/bioinformatics/btq447. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pujar A, Jaiswal P, Kellogg EA, Ilic K, Vincent L, Avraham S, Stevens P, Zapata F, Reiser L, Rhee SY, Sachs MM, Schaeffer M, et al. Whole-plant growth stage ontology for angiosperms and its application in plant biology. Plant Physiol. 2006;142:414–428. doi: 10.1104/pp.106.085720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40:D290–301. doi: 10.1093/nar/gkr1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30:3894–3900. doi: 10.1093/nar/gkf493. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83:610–615. doi: 10.1016/j.ajhg.2008.09.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996;12:327–345. doi: 10.1093/bioinformatics/12.4.327. [DOI] [PubMed] [Google Scholar]
Smith CL, Eppig JT. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdiscip Rev Syst Biol Med. 2009;1:390–399. doi: 10.1002/wsbm.44. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]
Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN. The Human Gene Mutation Database: 2008 update. Genome Med. 2009;1:13. doi: 10.1186/gm13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]
Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, Silva Dde, Zharkikh A, Thomas A. Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J Med Genet. 2006;43:295–305. doi: 10.1136/jmg.2005.033878. [DOI] [PMC free article] [PubMed] [Google Scholar]
The 1000 Genomes Project. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–2141. doi: 10.1101/gr.772403. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32:358–368. doi: 10.1002/humu.21445. [DOI] [PubMed] [Google Scholar]
Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat. 2009;30:703–714. doi: 10.1002/humu.20938. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

humu0034-0057-SD1.pdf^{(258.6KB, pdf)}

humu0034-0057-SD2.png^{(113.7KB, png)}

humu0034-0057-SD3.txt^{(7.9KB, txt)}

[b1] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[b3] Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b4] Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004;32:D115–119. doi: 10.1093/nar/gkh131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6] Bao L, Zhou M, Cui Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res. 2005;33:W480–482. doi: 10.1093/nar/gki372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7] Bateman A, Haft DH. HMM-based databases in InterPro. Brief Bioinformatics. 2002;3:236–245. doi: 10.1093/bib/3.3.236. [DOI] [PubMed] [Google Scholar]

[b8] Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35:3823–3835. doi: 10.1093/nar/gkm238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b9] Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. Functional annotations improve the predictive score of human disease-related mutations in proteins. Hum Mutat. 2009;30:1237–1244. doi: 10.1002/humu.21047. [DOI] [PubMed] [Google Scholar]

[b10] Capriotti E, Altman RB. A new disease-specific machine learning approach for the prediction of cancer-causing missense variants. Genomics. 2011;98:310–317. doi: 10.1016/j.ygeno.2011.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b11] Capriotti E, Calabrese R, Casadio R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics. 2006;22:2729–2734. doi: 10.1093/bioinformatics/btl423. [DOI] [PubMed] [Google Scholar]

[b12] Eddy SR. Hidden Markov models. Curr Opin Struct Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]

[b13] Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009;23:205–211. [PubMed] [Google Scholar]

[b14] Fahmy AS, Ali AA, Mohamed SA. Characterization of a cysteine protease from wheat Triticum aestivum (cv. Giza 164) Bioresour Technol. 2004;91:297–304. doi: 10.1016/s0960-8524(03)00193-7. [DOI] [PubMed] [Google Scholar]

[b15] Ferrer-Costa C, Orozco M, Cruz X dela. Sequence-based prediction of pathological mutations. Proteins. 2004;57:811–819. doi: 10.1002/prot.20252. [DOI] [PubMed] [Google Scholar]

[b16] Ferrer-Costa C, Orozco M, Cruz X dela. Use of bioinformatics tools for the annotation of disease-associated mutations in animal models. Proteins. 2005;61:878–887. doi: 10.1002/prot.20664. [DOI] [PubMed] [Google Scholar]

[b17] Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001;313:903–919. doi: 10.1006/jmbi.2001.5080. [DOI] [PubMed] [Google Scholar]

[b18] Hicks S, Wheeler DA, Plon SE, Kimmel M. Prediction of missense mutation functionality depends on both the algorithm and sequence alignment employed. Hum Mutat. 2011;32:661–668. doi: 10.1002/humu.21490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19] Ilic K, Kellogg EA, Jaiswal P, Zapata F, Stevens PF, Vincent LP, Avraham S, Reiser L, Pujar A, Sachs MM, Whitman NT, McCouch SR, et al. The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant. Plant Physiol. 2007;143:587–599. doi: 10.1104/pp.106.092825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b20] Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 1994;235:1501–1531. doi: 10.1006/jmbi.1994.1104. [DOI] [PubMed] [Google Scholar]

[b21] Kullback S, Leibler RA. On Information and Sufficiency. The Annals of Mathematical Statistics. 1951;22:79–86. [Google Scholar]

[b22] Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P. Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics. 2009;25:2744–2750. doi: 10.1093/bioinformatics/btp528. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23] Lima Morais DAde, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011;39:D427–434. doi: 10.1093/nar/gkq1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24] Liu X, Jian X, Boerwinkle E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum Mutat. 2011;32:894–899. doi: 10.1002/humu.21517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b25] Madera M, Gough J. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 2002;30:4321–4328. doi: 10.1093/nar/gkf544. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b26] Mort M, Evani US, Krishnan VG, Kamati KK, Baenziger PH, Bagchi A, Peters BJ, Sathyesh R, Li B, Sun Y, Xue B, Shah NH, et al. In silico functional profiling of human disease-associated and polymorphic amino acid substitutions. Hum Mutat. 2010;31:335–346. doi: 10.1002/humu.21192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b27] Mottaz A, David FPA, Veuthey A-L, Yip YL. Easy retrieval of single amino-acid polymorphisms and phenotype information using SwissVar. Bioinformatics. 2010;26:851–852. doi: 10.1093/bioinformatics/btq028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b28] Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–874. doi: 10.1101/gr.176601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b29] Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet. 2006;7:61–80. doi: 10.1146/annurev.genom.7.080505.115630. [DOI] [PubMed] [Google Scholar]

[b30] Ogiso E, Takahashi Y, Sasaki T, Yano M, Izawa T. The role of casein kinase II in flowering time regulation has diversified during evolution. Plant Physiol. 2010;152:808–820. doi: 10.1104/pp.109.148908. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b31] Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M. PON-P: Integrated predictor for pathogenicity of missense variants. Hum Mutat. 2012;33:1166–1174. doi: 10.1002/humu.22102. [DOI] [PubMed] [Google Scholar]

[b32] Peterson TA, Adadey A, Santana-Cruz I, Sun Y, Winder A, Kann MG. DMDM: domain mapping of disease mutations. Bioinformatics. 2010;26:2458–2459. doi: 10.1093/bioinformatics/btq447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b33] Pujar A, Jaiswal P, Kellogg EA, Ilic K, Vincent L, Avraham S, Stevens P, Zapata F, Reiser L, Rhee SY, Sachs MM, Schaeffer M, et al. Whole-plant growth stage ontology for angiosperms and its application in plant biology. Plant Physiol. 2006;142:414–428. doi: 10.1104/pp.106.085720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b34] Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40:D290–301. doi: 10.1093/nar/gkr1065. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b35] Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002;30:3894–3900. doi: 10.1093/nar/gkf493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b36] Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39:e118. doi: 10.1093/nar/gkr407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b37] Robinson PN, Köhler S, Bauer S, Seelow D, Horn D, Mundlos S. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet. 2008;83:610–615. doi: 10.1016/j.ajhg.2008.09.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b38] Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996;12:327–345. doi: 10.1093/bioinformatics/12.4.327. [DOI] [PubMed] [Google Scholar]

[b39] Smith CL, Eppig JT. The mammalian phenotype ontology: enabling robust annotation and comparative analysis. Wiley Interdiscip Rev Syst Biol Med. 2009;1:390–399. doi: 10.1002/wsbm.44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b40] Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. doi: 10.1002/(sici)1097-0134(199707)28:3<405::aid-prot10>3.0.co;2-l. [DOI] [PubMed] [Google Scholar]

[b41] Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN. The Human Gene Mutation Database: 2008 update. Genome Med. 2009;1:13. doi: 10.1186/gm13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b42] Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. doi: 10.1093/bioinformatics/btm098. [DOI] [PubMed] [Google Scholar]

[b43] Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, Silva Dde, Zharkikh A, Thomas A. Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J Med Genet. 2006;43:295–305. doi: 10.1136/jmg.2005.033878. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b44] The 1000 Genomes Project. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b45] Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–2141. doi: 10.1101/gr.772403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b46] Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011;32:358–368. doi: 10.1002/humu.21445. [DOI] [PubMed] [Google Scholar]

[b47] Thusberg J, Vihinen M. Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods. Hum Mutat. 2009;30:703–714. doi: 10.1002/humu.20938. [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models

Hashem A Shihab

Julian Gough

David N Cooper

Peter D Stenson

Gary L A Barker

Keith J Edwards

Ian N M Day

Tom R Gaunt

Abstract

Introduction

Materials and Methods

The Mutation Datasets

Table 1.

Scoring the Magnitude of Effect of Amino Acid Substitutions

Incorporating Species-Specific Pathogenicity Weights

Annotating the Molecular and Phenotypic Consequences of Amino Acid Substitutions

Performance Evaluation

Results

Calculating a Prediction Threshold

Figure 1.

A Performance Comparison Against Published Reviews

Table 2.

Table 3.

Table 4.

An Independent Benchmark Against Other Computational Prediction Methods

Figure 2.

Figure 3.

Facilitating the High-Throughput Analysis of Large Genomic Datasets

Annotating Phenotypic Outcome Associations

Case Study: Annotating the Functional and Phenotypic Consequences of nsSNPs in Wheat

Discussion

Acknowledgments

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases