Abstract
Single nucleotide variants (SNVs) are, together with copy number variation, the primary source of variation in the human genome and are associated with phenotypic variation such as altered response to drug treatment and susceptibility to disease. Linking structural effects of non-synonymous SNVs to functional outcomes is a major issue in structural bioinformatics. The SNPeffect database (http://snpeffect.switchlab.org) uses sequence- and structure-based bioinformatics tools to predict the effect of protein-coding SNVs on the structural phenotype of proteins. It integrates aggregation prediction (TANGO), amyloid prediction (WALTZ), chaperone-binding prediction (LIMBO) and protein stability analysis (FoldX) for structural phenotyping. Additionally, SNPeffect holds information on affected catalytic sites and a number of post-translational modifications. The database contains all known human protein variants from UniProt, but users can now also submit custom protein variants for a SNPeffect analysis, including automated structure modeling. The new meta-analysis application allows plotting correlations between phenotypic features for a user-selected set of variants.
INTRODUCTION
Human next-generation sequencing projects currently generate millions of previously unknown single nucleotide variants (SNVs) (1). On average, every newly sequenced genome generates about 300 000 novel SNVs (2). Although it is quite straightforward to annotate these SNVs according to their genomic location (coding, non-coding and regulatory regions), and for coding SNVs to denote their effect on the translated protein (synonymous or non-synonymous), predicting the detailed effect of a coding mutation on the structure and function of a protein is a largely unsolved problem. As these variants can influence drug selection, dosing and adverse effects (3), it is recognized that this genetic information is of great importance for drug development in general (4) and crucial for personalized medicine (5). Most current approaches classify SNVs into neutral or deleterious variants by using either conservation based measures (6) or by using a combination of conservation scores and structural features (7–9). Tools for predicting stability changes upon mutation have also been developed (10,11), however these do not use explicit stability predictions based on a high-resolution structure but rather depend on black-box predictions using intelligent machine-learning approaches such as support-vector machines or neural networks.
Coding non-synonymous SNVs can affect protein structure and function to various degrees (12,13). Although predicting neutral or fully disruptive variants is relatively easy, a large portion of variants will result in more subtle intermediate phenotypic effects that are much more challenging to predict.
To tackle this challenge web servers such as PolyPhen (9) and HOPE (8), for example, base their predictions on a statistical analysis of protein structures extrapolated to the protein under study and do currently not provide quantitative free energy changes of point mutations. SNPeffect on the other hand uses the FoldX (14) force field and aims at calculating realistic free-energy changes upon mutation (ΔΔG), thereby providing high-accuracy protein stability information. As structure quality is crucial for the accuracy of ΔΔG predictions using FoldX we currently do not model structures with <90% sequence identity to the modeling template structure. As a result the structural coverage of SNPeffect is somewhat lower than that of PolyPhen or HOPE. However, by integrating several in-house developed structural bioinformatics tools designed to quantify protein misfolding (FoldX), protein aggregation [TANGO (15) and WALTZ (16)] and chaperone interaction [LIMBO (17)], SNPeffect was developed with the specific aim of mapping the effect of SNVs on the protein homeostasis landscape. i.e. the ability of a cell to maintain appropriate concentrations of properly folded proteins in the correct cellular compartment (18). Currently SNPeffect provides pre-calculated mutant analyses for more than 60 000 human coding protein variants, benefiting the speed of information retrieval, but it also allows calculation of custom mutant sets. Finally SNPeffect provides features for meta-analysis of selected data sets allowing to analyze the proteostatic landscape of a given protein or protein family for example.
SNPeffect PIPELINE FOR MOLECULAR PHENOTYPING OF HUMAN PROTEIN VARIANTS
The raw data source of the SNPeffect database consists of the UniProt human variation database (http://www.uniprot.org/docs/humsavar), containing single amino acid polymorphisms, classified either as disease mutations, polymorphisms or yet unclassified mutations. SNPeffect predicts the impact of these variants on (i) protein aggregation and amyloid formation (TANGO and WALTZ, respectively), (ii) chaperone binding (LIMBO) and (iii) structural stability (FoldX). The availability of a crystal structure with a minimal resolution of 4 Å is required to accurately analyze the effect on protein stability with FoldX. If an exact structural match is not found, homologous structures with no <90% sequence identity are considered as template structures to build a homology model of the original sequence with FoldX. The stability analysis is then applied to this model.
Furthermore, SNPeffect holds annotations on functional sites, structural features, domain information, cellular processing and post-translational modifications for each variant.
The effect on functional sites and structural features is analyzed by investigating several properties of the position of the mutation. Data from the Catalytic Site Atlas is parsed to analyze whether the residue is part of the active site (19). Secondary structure information is generated by FoldX and transmembrane topology (extracellular, intracellular, transmembrane) is predicted by TMHMM (20). Domain information is provided by SMART (21) and PFAM (22). PSORT (23) provides a prediction on the sub-cellular localization. SNPeffect also maps changes in post-translational lipid anchor attachment and the peroxisomal targeting signal PTS1 (24). Lipid attachment predictions include myristoylation (25,26), farnesylation (26), GPI-anchor attachment (27) and type-1 and type-2 geranylgeranylation (26).
All entries are additionally linked to the OMIM genetic disorder database (Online Mendelian Inheritance in Man, OMIM. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), 2011. World Wide Web URL: http://omim.org/) and the Gene Ontology database (28).
SNPeffect DATABASE
SNPeffect currently contains data on 63 410 human non-synonymous SNVs. Automatic updates from the UniProt human variation database are scheduled every 6 months.
The database interface (Figure 1, left) allows users to search SNVs by filtering on molecular phenotypic effects, mutation type, disease, UniProt identifier, dbSNP identifier and gene name. Molecular phenotypic effects include changes in aggregation tendency (dTANGO), amyloid formation (dWALTZ), chaperone binding (dLIMBO) and structural stability change upon mutation (ddG). Applying the filter settings results in a set of variants that can be analyzed in a protein-centered or variant-centered view (Figure 1, right).
This SNPeffect update focuses primarily on the scientist user's ability to quickly retrieve and rapidly analyze the effect of protein variants. Moreover, the wild-type protein of each variant is also fully analyzed and directly linked from the variant webpage. The effects are visualized by self-explanatory barplots and histograms. Structural data is retrieved from the Protein Data Bank (PDB) (29). When an exact match to the wild-type sequence is not found, a homology model is built from a template structure that has at least 90% sequence identity to the original sequence. If structural information is retrieved, we offer visualization of both the wild-type and mutant residue environment in the protein structure. Additionally, every phenotypic analysis is accompanied by a graphical and textual comparison to the wild-type protein. Figure 1 illustrates the summary of a variant that meets the criteria set in the filter.
META-ANALYSIS
A new feature in SNPeffect 4.0 is the ability to analyze and plot phenotypic features of a specific subset (or all) of the SNPeffect database. The meta-analysis tool enables scientists to carry out large-scale data mining of the specified data and visualize the results in a graphical plot. The data set of variants is primarily chosen on disease associations and the mutation type. Mutation types include disease, polymorphism and unclassified. An additional filter can be applied to limit the results by one or more disease terms that are selected from a list or specified by keywords. SNPeffect will then search for all variants of the selected type and retrieve those that are linked to the selected disease(s). For the disease type, these are solely the mutations annotated with that disease. For the polymorphisms and unclassifieds, SNPeffect retrieves all of these variants from proteins associated with the selected disease(s). Next, the two phenotypic effects that will be analyzed and plotted can be specified (dTANGO, dWALTZ, dLIMBO or ddG) (Figure 2). For example, one can create an aggregation/stability feature plot of a set of variants to correlate aggregation changes with stability changes. If the number of hits for one of the mutation types exceeds 500, the average Y is plotted for each X bin, to keep the plots clear and readable. The meta-analysis tool converts phenotypic features of a selected set of variants to comprehensible scatter plots, boxplots and frequency plots (Figure 2).
JOB SUBMISSION
Novel to previous versions of SNPeffect (30–32), the current version includes a data submission framework that allows submitting (human or non-human) custom single protein variants for a detailed SNPeffect analysis including TANGO, WALTZ, LIMBO and FoldX. Possible input types are UniProt ID, FASTA sequence, PDB ID, or an uploaded PDB file. If only sequence information but no structural information is provided, SNPeffect will search the PDB for a matching structure to complete the stability analysis with FoldX. When an exact match is not found, a homology filter allows setting the minimum percent sequence identity a structural homolog template should have to build a homology model. The effect on structural stability is then determined by analyzing the homology model. Users receive an e-mail notification when the analysis has finished and can download the results from their SNPeffect account. The results include a PDF file with the complete phenotypic SNPeffect analysis. This file contains figures and extensive life scientist-friendly text reports with comparison to the wild-type protein. All separate figure files are also available and free to use.
SUMMARY
SNPeffect 4.0 offers a detailed and comprehensible molecular and structural phenotypic analysis of all known human protein variants. Major phenotypic features such as aggregation propensity prediction, stability analysis, structural features, post-translational modification and cellular localization are intelligibly visualized and explained for each variant. The meta-analysis tool allows plotting correlations between phenotypic effects concerning a specified set of variants. Custom protein variants can now be submitted for a detailed SNPeffect analysis, including automated structure modeling. SNPeffect 4.0 is available at http://snpeffect.switchlab.org
FUNDING
Interuniversity Attraction Poles (IAP Network 6/43) of the Belgian Federal Science Policy Office (BelSPo) (VIB Switch laboratory); Flanders Institute for Science and Technology (IWT) (to G.D.B.); Fund for Scientific Research (FWO), Flanders (to J.V.D.); MICINN projects BIO2008-04212, and RD06/0020/1019 (RTICC, ISCIII) (Dopazo lab, partial) GVA-FEDER (PROMETEO/2010/001, partial). The CIBER de Enfermedades Raras is an initiative of the ISCIII, MICINN. Funding for open access charge: VIB.
Conflict of interest statement. None declared.
REFERENCES
- 1.Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 2011;12:363–376. doi: 10.1038/nrg2958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Collins FS, Guyer MS, Charkravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278:1580–1581. doi: 10.1126/science.278.5343.1580. [DOI] [PubMed] [Google Scholar]
- 3.Giacomini KM, Brett CM, Altman RB, Benowitz NL, Dolan ME, Flockhart DA, Johnson JA, Hayes DF, Klein T, Krauss RM, et al. The pharmacogenetics research network: from SNP discovery to clinical drug response. Clin. Pharmacol. Ther. 2007;81:328–345. doi: 10.1038/sj.clpt.6100087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Foot E, Kleyn D, Palmer Foster E. Pharmacogenetics–pivotal to the future of the biopharmaceutical industry. Drug Discov. Today. 2010;15:325–327. doi: 10.1016/j.drudis.2010.03.004. [DOI] [PubMed] [Google Scholar]
- 5.Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011;27:1741–1748. doi: 10.1093/bioinformatics/btr295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 2009;4:1073–1081. doi: 10.1038/nprot.2009.86. [DOI] [PubMed] [Google Scholar]
- 7.Sunyaev S, Ramensky V, Bork P. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet. 2000;16:198–200. doi: 10.1016/s0168-9525(00)01988-0. [DOI] [PubMed] [Google Scholar]
- 8.Venselaar H, Te Beek TA, Kuipers RK, Hekkelman ML, Vriend G. Protein structure analysis of mutations causing inheritable diseases. An e-Science approach with life scientist friendly interfaces. BMC Bioinformatics. 2010;11:548. doi: 10.1186/1471-2105-11-548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat. Methods. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bromberg Y, Rost B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007;35:3823–3835. doi: 10.1093/nar/gkm238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Capriotti E, Fariselli P, Casadio R. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 2005;33:W306–W310. doi: 10.1093/nar/gki375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hartl FU, Bracher A, Hayer-Hartl M. Molecular chaperones in protein folding and proteostasis. Nature. 2011;475:324–332. doi: 10.1038/nature10317. [DOI] [PubMed] [Google Scholar]
- 13.Tokuriki N, Tawfik DS. Chaperonin overexpression promotes genetic variation and enzyme evolution. Nature. 2009;459:668–673. doi: 10.1038/nature08009. [DOI] [PubMed] [Google Scholar]
- 14.Schymkowitz J, Borg J, Stricher F, Nys R, Rousseau F, Serrano L. The FoldX web server: an online force field. Nucleic Acids Res. 2005;33:W382–W388. doi: 10.1093/nar/gki387. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat. Biotechnol. 2004;22:1302–1306. doi: 10.1038/nbt1012. [DOI] [PubMed] [Google Scholar]
- 16.Maurer-Stroh S, Debulpaep M, Kuemmerer N, Lopez de la Paz M, Martins IC, Reumers J, Morris KL, Copland A, Serpell L, Serrano L, et al. Exploring the sequence determinants of amyloid structure using position-specific scoring matrices. Nat. Methods. 2010;7:237–242. doi: 10.1038/nmeth.1432. [DOI] [PubMed] [Google Scholar]
- 17.Van Durme J, Maurer-Stroh S, Gallardo R, Wilkinson H, Rousseau F, Schymkowitz J. Accurate prediction of DnaK-peptide binding via homology modelling and experimental data. PLoS Comput. Biol. 2009;5:e1000475. doi: 10.1371/journal.pcbi.1000475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Powers ET, Morimoto RI, Dillin A, Kelly JW, Balch WE. Biological and chemical approaches to diseases of proteostasis deficiency. Annu. Rev. Biochem. 2009;78:959–991. doi: 10.1146/annurev.biochem.052308.114844. [DOI] [PubMed] [Google Scholar]
- 19.Torrance JW, Bartlett GJ, Porter CT, Thornton JM. Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. J. Mol. Biol. 2005;347:565–581. doi: 10.1016/j.jmb.2005.01.044. [DOI] [PubMed] [Google Scholar]
- 20.Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 2001;305:567–580. doi: 10.1006/jmbi.2000.4315. [DOI] [PubMed] [Google Scholar]
- 21.Letunic I, Doerks T, Bork P. SMART 6: recent updates and new developments. Nucleic Acids Res. 2009;37:D229–D232. doi: 10.1093/nar/gkn808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. doi: 10.1093/nar/gkp985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem. Sci. 1999;24:34–36. doi: 10.1016/s0968-0004(98)01336-x. [DOI] [PubMed] [Google Scholar]
- 24.Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F. Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J. Mol. Biol. 2003;328:581–592. doi: 10.1016/s0022-2836(03)00319-x. [DOI] [PubMed] [Google Scholar]
- 25.Maurer-Stroh S, Eisenhaber B, Eisenhaber F. N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J. Mol. Biol. 2002;317:541–557. doi: 10.1006/jmbi.2002.5426. [DOI] [PubMed] [Google Scholar]
- 26.Maurer-Stroh S, Eisenhaber F. Refinement and prediction of protein prenylation motifs. Genome Biol. 2005;6:R55. doi: 10.1186/gb-2005-6-6-r55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Eisenhaber B, Bork P, Eisenhaber F. Prediction of potential GPI-modification sites in proprotein sequences. J. Mol. Biol. 1999;292:741–758. doi: 10.1006/jmbi.1999.3069. [DOI] [PubMed] [Google Scholar]
- 28.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Reumers J, Maurer-Stroh S, Schymkowitz J, Rousseau F. SNPeffect v2.0: a new step in investigating the molecular phenotypic effects of human non-synonymous SNPs. Bioinformatics. 2006;22:2183–2185. doi: 10.1093/bioinformatics/btl348. [DOI] [PubMed] [Google Scholar]
- 31.Reumers J, Schymkowitz J, Ferkinghoff-Borg J, Stricher F, Serrano L, Rousseau F. SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs. Nucleic Acids Res. 2005;33:D527–D532. doi: 10.1093/nar/gki086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Reumers J, Conde L, Medina I, Maurer-Stroh S, Van Durme J, Dopazo J, Rousseau F, Schymkowitz J. Joint annotation of coding and non-coding single nucleotide polymorphisms and mutations in the SNPeffect and PupaSuite databases. Nucleic Acids Res. 2008;36:D825–D829. doi: 10.1093/nar/gkm979. [DOI] [PMC free article] [PubMed] [Google Scholar]