Abstract
Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.
INTRODUCTION
Genes can generate a wide range of mature RNA variants through the alternative splicing process (1,2). In fact, studies have revealed that virtually all multi-exon human genes (3,4) are capable of producing at least two RNA transcripts by alternative splicing. Alternative splicing events that occur within coding regions will produce alternative transcripts that have the potential to be translated into distinct gene products. Those alternative transcripts that are not picked up by the cellular surveillance machinery, such as nonsense-mediated decay (NMD; 5,6), non-stop decay (7) and no-go decay (8), may contribute to an increase in the complexity of the cell. Alternative gene products often have a surprising level of diversity (9) and can therefore have very different biological and cellular properties. Thus, the suggestion that the rearrangement of exons conducted by alternative splicing may enrich the repertoire of cellular functions (10).
Genome annotation projects are producing gene sets saturated with alternative splicing variants (11). If alternative splicing does have the potential to expand the cellular functional repertoire in eukaryotic species, it would seem to be important to assign roles to these splicing variants. However, the sheer quantity of genomic data generated by these projects (12,13) presents serious challenges for functional annotation. At present, almost all the experimental information related to alternative coding variants has been generated for RNA transcripts rather than protein isoforms. Despite the fact that there is only piecemeal experimental data for the cellular role of alternative isoforms, it is possible to predict the likely biological effects of alternative splicing.
APPRIS has been developed within the GENCODE consortium (14) to annotate alternative gene products with reliable, biologically relevant data. GENCODE provides high-accuracy manual annotations of protein-coding loci and alternative variants as part of the ENCODE project (12,15). The GENCODE annotations are gradually replacing and augmenting the Ensembl (16) automatic annotations. As part of the GENCODE annotation process, APPRIS flags isoforms with likely altered structure, function or localization, and exons that are evolving unusually. The information from APPRIS is fed back to the manual annotators and has lead to the annotation of new isoforms.
APPRIS annotates variants with protein structural and functional information, signal peptides and trans-membrane helices, conservation of related species, the conservation of exonic structure and exon evolutionary rates. APPRIS currently annotates the GENCODE/Ensembl merge of the human genome.
The novel feature of APPRIS is that it selects a principal isoform for each gene based on the reliable annotations for protein structure, function and cross-species conservation. The principal isoform is the representative isoform of the gene, the isoform against which all other (alternative) isoforms should be compared. In APPRIS, the principal isoform is the isoform with the main cellular function, the isoform that is expressed in the majority tissues or in most stages of development or the isoform that is the most evolutionary conserved. The process of selecting principal isoforms is illustrated with biologically relevant examples in the ‘APPRIS annotation’ section.
It is particularly difficult to automate the selection of a single representative for a gene, and all large-scale genomic analyses and databases such as Ensembl (16) and SwissProt (17) get round this problem by simply selecting the longest isoform as the main variant. Although this is a safe choice, and is often correct, we have shown that it is not always the best strategy—only ∼75% of the isoforms selected by this strategy are likely to be principal (18).
We performed an initial study on the feasibility of selecting principal isoforms using a number of prediction methods (18). The methods used to pinpoint principal functional isoforms were based on conservation and the characteristics of known proteins, principally structural and functional features. We determined a principal variant for 179 of the 215 human genes in the study, 83% of genes with multiple alternative variants. Where the principal variants selected in the study differed from the SwissProt display sequences, we found annotation evidence from cross-species alignments that supported our selection over the SwissProt display sequence.
Based on this initial study, we developed APPRIS. APPRIS is made up of eight separate annotation modules, each with a specific role. For example, firestar (19) predicts the presence of individual functionally important residues in splice variants, Matador3D predicts the effect of splicing events on 3D structure and INERTIA detects exons that are undergoing unusual evolution.
THE DATABASE
APPRIS was developed using version 3c of the GENCODE annotation (Ensembl 56), which was the initial Ensembl/GENCODE merge, and currently runs GENCODE version 7 (Ensembl 62). Between GENCODE 3c and 7, the annotation was cleaned of ∼2000 genes (mostly automatic annotations removed by Ensembl), while more 10 000 new annotated coding variants were added. GENCODE release 7 recognizes 20 687 protein-coding genes and 84 408 distinct-coding transcripts. APPRIS is updated with each new stable GENCODE release and is currently being updated to GENCODE version 12.
The APPRIS system (Figure 1) is composed of eight separate annotation modules. These eight modules do not comprise an exhaustive list of all possible protein features. Instead, the methods used in APPRIS were chosen for their ability to select principal isoforms. Each method either detects the absence of highly conserved protein features (as highly conserved protein features are extremely unlikely to have arisen by chance, we can discard variants that lack these features) or calculates cross-species conservation (the more conserved an exon/transcript, the more likely that it represents the principal variant). None of the computational methods behind each module is previously unpublished. Instead, the methods have either been combined in novel ways or have been adjusted especially for APPRIS, and the output of all the methods has been tuned to keep false-positive predictions to a minimum, albeit at the expense of coverage.
The eight modules are as follows (see Supplementary Data for further details). Matador3D checks for the presence of structural homologs in the PDB (20) and tests the integrity of the 3D structure; firestar (19) makes highly reliable predictions of conserved functionally important amino acid residues; SPADE uses the program Pfamscan (21) to count conserved and compromised Pfam functional domains; INERTIA uses three alignment methods (22–24) to generate cross-species alignments, from which SLR (25) identifies exons with unusual evolutionary rates; CRASH makes conservative predictions of signal peptides using the SignalP and TargetP programs (26); THUMP generates conservative predictions of trans-membrane helices from three separate trans-membrane predictors (27–29); CExonic uses exonerate (30) to align mouse and human transcripts and looks for patterns of conservation in exonic structure and CORSAIR uses BLAST (31) to map vertebrate orthologs to each variant and counts the numbers of orthologs that align correctly and without gaps. All of these methods are available as web services.
Annotation and selection of principal isoforms
In addition to annotating alternative isoforms with biological data, APPRIS selects a principal isoform from among the isoforms annotated for each gene. The selection of a principal isoform by APPRIS is based on two principles. The first principle is that there is often one isoform that performs the main cellular function or that is expressed in the majority tissues or in most stages of development, and that the rest of the annotated isoforms are alternatively spliced isoforms that may perform distinct roles. Proteomics evidence suggests that this seems to hold true for many genes (32–37) although there are genes for which it is more difficult to define a principal isoform precisely because there are a number of isoforms that might be regarded as equally important for cellular function.
The second principle is that the principal isoform should have more evolutionary history. The principal isoform ought to be the variant that is most conserved across related species. It has been shown that alternative exons tend to have evolved more recently (38), so this is a reasonable assumption.
The methods that make up APPRIS detect unusual, missing or non-conserved features and will flag these transcripts as alternative. The selection of a principal isoform is based on a jury of the eight methods that make up the pipeline. The isoform selected as principal will be either the variant that has the most conserved protein features (since it is much more likely that alternative isoforms have lost rather than gained protein features such as 3D structure and function) or that has more evidence of cross-species conservation, or, most frequently, both. Four methods (SPADE, CORSAIR, Matador3D and firestar) make up the core of the jury system, with the other methods becoming more important in cases where these four methods are not able to make a decision.
It should be noted that GENCODE-coding transcripts are not all considered equally. First, transcripts with identical CDS (in other words those that undergo alternative splicing only in 3′- or 5′-UTR) are regarded as identical for the purposes of selecting a principal protein isoform in APPRIS. Second, transcripts that are annotated as NMD targets are annotated with protein features, but cannot be selected as the principal variant by APPRIS. The same is also true of all transcripts annotated as fragments.
For the few cases where the methods in APPRIS tag all the variants as alternative (in most cases these are genes with ‘read-through’ transcripts), the gene is brought to the attention of the GENCODE manual annotators.
A list of principal isoforms selected by APPRIS for each version of GENCODE/Ensembl is available. In the few cases where APPRIS is not able to determine the main isoform, the variant with the longest protein sequence is selected from among those isoforms not rejected by APPRIS.
APPRIS will be updated with the Ensembl/GENCODE database and will be extended to cover mouse gene models in the near future as the GENCODE annotators focus on mouse models, although in theory APPRIS could be extended to incorporate data from any well-annotated eukaryotic genome.
System architecture and user notes
The APPRIS web site allows the user to search genes and transcripts and displays six panels of annotations. The first panel shows all the GENCODE/Ensembl-coding variants and highlights the main functional variant. The second panel shows the APPRIS annotations in detail and includes information such as the number of functional residues detected by firestar, the Matador3D homologous structure score or the number of vertebrate species that align in CORSAIR. The next two panels map the APPRIS annotations onto the amino acid sequences of all coding variants and make the annotations visible in the genome regions provided by the UCSC Genome Browser (39). Finally, there are panels that allow the user to compare and contrast proteomic (37,40) and RNAseq (4) evidence tracks against the APPRIS annotation tracks in the UCSC Browser (see Supplementary Data for more details).
APPRIS has been designed to be portable, modular and flexible and it can be accessed as web services. These services can retrieve the results of the execution of APPRIS methods and other useful information for genes/transcripts. Plain text, JSON/GTF format or BED format (which facilitates the visualization of annotation tracks across genomic regions) outputs are available. In addition, APPRIS supports the downloading of data through the highly customizable BioMart (41) data mining tool.
APPRIS uses a MySQL relational database to store the information that can be downloaded from APPRIS web site. A comprehensive set of application programming interfaces (APIs) serve as a middle layer between underlying database schemes. The APIs encapsulate the database layout by providing efficient high-level access to data tables and isolate applications from data layout changes.
APPRIS ANNOTATIONS
APPRIS identifies a principal isoform for the majority of human genes. APPRIS determines a principal isoform for 17 731 (85.7%) of the 20 687 protein-coding genes in the GENCODE 7 (Ensembl 62) release (Supplementary Figure S4). A total of 53 307 variants were tagged as alternative by the methods in APPRIS and 22 799 transcripts were identified as principal isoforms (the discrepancy between genes and transcripts is because many transcripts are only alternatively spliced in the 3′- or 5′-regions and are regarded as identical by APPRIS because they have identical coding sequences).
Many of the isoforms recognized by APPRIS as alternative are likely to have substantial changes to their structure and function. A total of 37 550 alternative splice variants (70.4% of the variants tagged as alternative) would lose important functional or structural information relative to the principal isoform. The conservative estimates from the APPRIS methods show that 15 087 variants (28.3%) would lose important functional residues, 31 169 alternative gene products (58.5%) would have damaged or lost Pfam functional domains and 26 955 alternative isoforms (50.6%) would lose a substantial part of their 3D structure.
More than 8175 of the annotated transcripts would lose at least one trans-membrane helix and 543 would have lost a signal sequence. Almost 50 000 alternative splice variants (49 899, 94.5% of variants tagged as alternative) were less conserved across related species than the principal variant (from the results of CORSAIR, CExonic or INERTIA).
The CCDS project (42) is identifying consistently annotated, high-quality protein-coding variants for the human genome. CCDS variants are annotated only when there is agreement between the three main public annotation resources (GENCODE/Ensembl, NCBI and UCSC). Although the CCDS project can annotate any number of variants for a gene, many genes have a single CCDS variant, a variant agreed upon by all annotation resources. A single CCDS variant is the closest thing to an APPRIS principal variant, therefore, we should expect to see high agreement between the APPRIS constitutional isoforms and the CCDS variants.
For those genes that have multiple isoforms and a single CCDS variant, APPRIS is in agreement with the CCDS variant 93.5% of the time. What is more, this rises to over 96% for the core modules. This compares to an agreement of just 79.2% for the strategy of selecting the longest isoform (see Supplementary Table S1).
Two examples (of many) serve to illustrate the utility of APPRIS in the selection of principal isoforms. In the first example (gene DNAJC5G), APPRIS disagrees with the CCDS annotation by selecting an isoform that is 16 residues shorter than the pair of protein sequence identical isoforms chosen as the single CCDS variant, as the Ensembl reference sequence and as the SwissProt display sequence (Figure 2A). The variant selected by APPRIS (DNAJC5G-004) has a better score in Matador3D (it maps better to the known 3D structures in the PDB) and has a conserved Pfam domain. In contrast, the longer sequences would have broken Pfam domains and 3D structure (Figure 2B). The extra exon in the CCDS variant generates a 16-residue insertion that would be likely to disrupt a 3D structure (Figure 2C) and a conserved Pfam domain (Figure 2D).
The second example concerns the TP63 gene. There are two well-studied isoforms for this gene, TA-alpha (43) and deltaN (44). They are generated from different translation start sites and generate different N-terminals (Figure 3A). However, rather than elect one of these two, APPRIS gives the best score to variant TP63-013, a 582-amino acid protein. Although this result might be surprising at first glance, it is perfectly logical.
TP63 is annotated with 15 coding variants in GENCODE 7, and all but 4 (TA-alpha, deltaN, P63delta and TP63-013) are rejected as potential principal isoforms by the SPADE (Pfam domains) and Matador3D (3D structure) modules in APPRIS. P63delta and TP63-013 are generated from TA-alpha and deltaN, respectively, by a known GYNGYN splicing event (45) that results in a swap of five amino acids ‘GTKRP’ for a single alanine in a non-critical region of the protein (Figure 3A). CORSAIR (vertebrate sequence database information) separates these four variants based on alignments with isoforms from other species (Figure 3B). It turns out that there is more ortholog evidence for deltaN and its GYNGYN variant TP63-013 than there is for TA-alpha and P63delta (the deltaN splice event is conserved as far back as chicken and Danio). There is also more CAGE data support for the deltaN/TP63-013 translation start site (A. Frankish, personal communication).
In addition, CORSAIR selects TP63-013 ahead of deltaN because Danio isoforms in the sequence databases have the single alanine instead of the GTKRP motif. In fact, the 3D structure of this region of the protein has also been solved for the isoform with the single alanine, adding weight to the APPRIS selection.
These examples neatly demonstrates the process behind APPRIS and reinforce the idea that it is possible to designate a principal isoform based on protein features and evolutionary antiquity, even where two isoforms (or more as in the case of TP63) have clearly defined functional roles.
DISCUSSION
APPRIS deploys a range of computational methods to annotate alternative isoforms with protein structural and functional information and to evaluate cross-species conservation. The database provides reliable functional annotations for the most recent version of the manual annotation of the human genome. The APPRIS annotations will allow genome annotation groups and individual researchers to track the effect of alternative splicing events on individual splice isoforms.
There are already a number of databases that can annotate alternative transcripts with some of these features (46–48). What sets APPRIS apart from all these databases is that APPRIS provides high-quality annotations that are being used in the annotation of the human genome and that these annotations are used to select a principal isoform for each gene. Principal isoforms are selected based on evolutionary evidence in the form of conserved functional and structural motifs and cross-species conservation. The success of APPRIS is due to the observation that most alternative isoforms lack regions of conserved structure or function, or have exons that are evolving at measurably different rates compared with their principal counterparts. The APPRIS database has been able to identify a principal isoform for the majority of human genes (85%).
APPRIS is the first database to include principal isoforms on a genome-wide scale. Previously all database and large-scale studies have had to resort to selecting the largest annotated isoform as the reference variant. We have shown that this conservative solution is not ideal (18). The lack of reliably identified principal isoforms in annotation databases is an omission that is only going to become more glaring with time as the numbers of annotated variants in the sequence databases grow.
At present, most computational methods (49) and databases (21) are based on the assumption that a single isoform represents each gene. The SwissProt database, for example, combines all variants of the same gene in a single entry. These entries include experimental data and predictions, which are widely referenced from a number of external sources. One of the sequences in each entry, almost always the longest, is designated as the display sequence and the remaining sequences are included as splice variants. External databases and methods that use SwissProt as their standard often ignore these alternative sequences. If databases are going to condense gene products from the same gene into a single entry for technical reasons, it is better that the sequence that ‘represents’ the gene is the APPRIS principal isoform.
APPRIS principal isoforms have a wide range of uses and are applicable in all fields of research. Determining a principal isoform is important for research groups studying individual genes, since it is vitally important for designing experimental work. Researchers need to be able to work with the isoform that is most likely to have major functional activity, and this is not always clear for all genes. The designation of a single variant as the principal isoform is a critical first step for any genome analysis, for example, studies of cancer mutations (50) would be able to use APPRIS data to determine whether the mutations are principal or alternative exons, and proteomics studies could use APPRIS data to decide whether a peptide would be generated from on alternative or principal exon. Since automatic prediction methods rely on the quality of input data, starting from the principal isoform should allow groups to perform more reliable studies. Finally, the selection of a principal isoform also serves as a starting point for investigations into the functional potential of alternative isoforms. These are just a few examples; the potential for the use of the APPRIS data in research is huge.
APPRIS is currently being used to annotate protein-coding genes by annotators in the GENCODE consortium (14) and the CCDS project (42). Annotations based on APPRIS data are already percolating to many users through these databases. We hope that the APPRIS principal isoforms will become accepted as the standard reference sequence for each gene. We believe that the principal isoforms identified by APPRIS are a significant advance on the current practice of selecting the longest variant as the reference isoform and that they should be used in all automatic genome-wide protocols and large-scale analyses.
The APPRIS annotations and the list of principal isoforms are accessible to all and are available for download in a range of formats.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online: Supplementary Table 1, Supplementary Figures 1–4, Supplementary Methods, Supplementary Results and Supplementary References [51–57].
FUNDING
The Spanish National Institute of Bioinformatics (www.inab.org), a project of the ‘Instituto de Salud Carlos III’; the Spanish Ministry of Science and Innovation [BIO2007-666855]; the ENCODE Project [U54 HG0004555]; Blueprint [282510]. Funding for open access charge: Spanish Ministry of Science and Innovation [BIO2007-666855].
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
The authors would like to thank Angel Carro and Edu Leon for invaluable technical assistance.
REFERENCES
- 1.Gilbert W. The exon theory of genes. Cold Spring Harb. Symp. Quant. Biol. 1987;52:901–905. doi: 10.1101/sqb.1987.052.01.098. [DOI] [PubMed] [Google Scholar]
- 2.Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000;103:367–370. doi: 10.1016/s0092-8674(00)00128-8. [DOI] [PubMed] [Google Scholar]
- 3.Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008;40:1413–1415. doi: 10.1038/ng.259. [DOI] [PubMed] [Google Scholar]
- 4.Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nicholson P, Mühlemann O. Cutting the nonsense: the degradation of PTC-containing mRNAs. Biochem. Soc. Trans. 2010;38:1615–1620. doi: 10.1042/BST0381615. [DOI] [PubMed] [Google Scholar]
- 6.Weischenfeldt J, Waage J, Tian G, Zhao J, Damgaard I, Jakobsen JS, Kristiansen K, Krogh A, Wang J, Porse BT. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol. 2012;13:R35. doi: 10.1186/gb-2012-13-5-r35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Vasudevan S, Peltz SW, Wilusz CJ. Non-stop decay—a new mRNA surveillance pathway. Bioessays. 2002;24:785–788. doi: 10.1002/bies.10153. [DOI] [PubMed] [Google Scholar]
- 8.Harigaya Y, Parker R. No-go decay: a quality control mechanism for RNA in translation. Wiley Interdiscip. Rev. RNA. 2010;1:132–141. doi: 10.1002/wrna.17. [DOI] [PubMed] [Google Scholar]
- 9.Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink J-J, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, et al. The implications of alternative splicing in the ENCODE protein complement. Proc. Natl Acad. Sci. USA. 2007;104:5495–5500. doi: 10.1073/pnas.0700800104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Smith CW, Valcárcel J. Alternative pre-mRNA splicing: the logic of combinatorial control. Trends Biochem. Sci. 2000;25:381–388. doi: 10.1016/s0968-0004(00)01604-2. [DOI] [PubMed] [Google Scholar]
- 11.Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T, Howald C, Reymond A, Guigó R, Hubbard T, Harrow J. The origins, evolution, and functional potential of alternative splicing in vertebrates. Mol. Biol. Evol. 2011;28:2949–2959. doi: 10.1093/molbev/msr127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Frankish A, Mudge JM, Thomas M, Harrow J. The importance of identifying alternative splicing in vertebrate genome annotation. Database. 2012;2012:bas014. doi: 10.1093/database/bas014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: the reference annotation for the ENCODE Project. Genome Res. 2012;22:1775–1789. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.ENCODE Project Consortium, Bernstein,B.E., Birney,E., Dunham,I., Green,E.D., Gunter,C. and Snyder,M. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, et al. Ensembl 2012. Nucleic Acids Res. 2012;40:D84–D90. doi: 10.1093/nar/gkr991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource. Nucleic Acids Res. 2012;40:D71–D75. doi: 10.1093/nar/gkr981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tress ML, Wesselink J-J, Frankish A, López G, Goldman N, Löytynoja A, Massinghamm T, Pardi F, Whelan S, Harrow J, et al. Determination and validation of principal gene products. Bioinformatics. 2008;24:11–17. doi: 10.1093/bioinformatics/btm547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lopez G, Maietta P, Rodriguez J-M, Valencia A, Tress ML. firestar—advances in the prediction of functionally important residues. Nucleic Acids Res. 2011;39:W235–W241. doi: 10.1093/nar/gkr437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prlic A, Quesada M, Quinn GB, Westbrook JD. The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 2011;39:392–401. doi: 10.1093/nar/gkq1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40:D290–D301. doi: 10.1093/nar/gkr1065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169:1753–1762. doi: 10.1534/genetics.104.032144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. doi: 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lassmannm T, Sonnhammer EL. Kalign—an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics. 2005;6:298. doi: 10.1186/1471-2105-6-298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Löytynoja A, Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. Proc. Natl Acad. Sci. USA. 2005;102:10557–10562. doi: 10.1073/pnas.0409137102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Emanuelsson O, Brunak S, von Heijne G, Nielsen H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2007;2:953–971. doi: 10.1038/nprot.2007.131. [DOI] [PubMed] [Google Scholar]
- 27.Jones DT. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007;23:538–544. doi: 10.1093/bioinformatics/btl677. [DOI] [PubMed] [Google Scholar]
- 28.Käll L, Krogh A, Sonnhammer EL. A combined transmembrane topology and signal peptide prediction method. J. Mol. Biol. 2004;338:1027–1036. doi: 10.1016/j.jmb.2004.03.016. [DOI] [PubMed] [Google Scholar]
- 29.Viklund H, Elofsson A. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci. 2004;13:1908–1917. doi: 10.1110/ps.04625404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Slater G, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tress ML, Bodenmiller B, Aebersold R, Valencia A. Proteomics studies confirm the presence of alternative protein isoforms on a large scale. Genome Biol. 2008;9:R162. doi: 10.1186/gb-2008-9-11-r162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP. Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl Acad. Sci. USA. 2008;105:21034–21038. doi: 10.1073/pnas.0811066106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Chang K-Y, Georgianna DR, Heber S, Payne GA, Muddiman DC. Detection of alternative splice variants at the proteome level in Aspergillus flavus . J. Proteome Res. 2010;9:1209–1217. doi: 10.1021/pr900602d. [DOI] [PubMed] [Google Scholar]
- 35.Severing E, van Dijk A, van Ham R. Assessing the contribution of alternative splicing to proteome diversity in Arabidopsis thaliana using proteomics data. BMC Plant Biol. 2011;11:82. doi: 10.1186/1471-2229-11-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS, et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and “resurrected” pseudogenes in the mouse genome. Genome Res. 2011;21:756–767. doi: 10.1101/gr.114272.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ezkurdia I, Del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A, Tress ML. Comparative proteomics reveals a significant bias towards alternative protein isoforms with conserved structure and function. Mol. Biol. Evol. 2012;29:2265–2283. doi: 10.1093/molbev/mss100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Alekseyenko AV, Kim N, Lee CJ. Global analysis of exon creation versus loss and the role of alternative splicing in 17 vertebrate genomes. RNA. 2007;13:661–670. doi: 10.1261/rna.325107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dreszer TR, Karolchik D, Zweig AS, Hinrichs AS, Raney BJ, Kuhn RM, Meyer LR, Wong M, Sloan CA, Rosenbloom KR, et al. The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res. 2012;40:D918–D923. doi: 10.1093/nar/gkr1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Desiere F, Deutsch EW, Nesvizhskii AI, Mallick P, King NL, Eng JK, Aderem A, Boyle R, Brunner E, Donohoe S, et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 2005;6:R9. doi: 10.1186/gb-2004-6-1-r9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kasprzyk A. BioMart: driving a paradigm change in biological data management. Database. 2011;2011:bar049. doi: 10.1093/database/bar049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, et al. The consensus coding sequence (CCDS) project: identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009;19:1316–1323. doi: 10.1101/gr.080531.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Rouleau M, Medawar A, Hamon L, Shivtiel S, Wolchinsky Z, Zhou H, De Rosa L, Candi E, de la Forest Divonne S, Mikkola ML, et al. TAp63 is important for cardiac differentiation of embryonic stem cells and heart development. Stem Cells. 2011;29:1612–1683. doi: 10.1002/stem.723. [DOI] [PubMed] [Google Scholar]
- 44.Crum CP, McKeon FD. p63 in epithelial survival, germ cell surveillance, and neoplasia. Annu. Rev. Pathol. 2010;5:349–371. doi: 10.1146/annurev-pathol-121808-102117. [DOI] [PubMed] [Google Scholar]
- 45.Sinha R, Lenser T, Jahn N, Gausmann U, Friedel S, Szafranski K, Huse K, Rosenstiel P, Hampe J, Schuster S, et al. TassDB2—a comprehensive database of subtle alternative splicing events. BMC Bioinformatics. 2010;11:216. doi: 10.1186/1471-2105-11-216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Birzele F, Küffner R, Meier F, Oefinger F, Potthast C, Zimmer R. ProSAS: a database for analyzing alternative splicing in the context of protein structures. Nucleic Acids Res. 2008;36:D63–D68. doi: 10.1093/nar/gkm793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Shionyu M, Yamaguchi A, Shinoda K, Takahashi K, Go M. AS-ALPS: a database for analyzing the effects of alternative splicing on protein structure, interaction and network in human and mouse. Nucleic Acids Res. 2009;37:D305–D309. doi: 10.1093/nar/gkn869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Martelli PL, D'Antonio M, Bonizzoni P, Castrignanò T, D'Erchia AM, D'Onorio De Meo P, Fariselli P, Finelli M, Licciulli F, Mangiulli M, et al. ASPicDB: a database of annotated transcript and protein variants generated by alternative splicing. Nucleic Acids Res. 2011;39:D80–D85. doi: 10.1093/nar/gkq1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Hériché JK, Hu Y, Kristiansen K, Li R, et al. TreeFam: 2008 Update. Nucleic Acids Res. 2008;36:D735–D740. doi: 10.1093/nar/gkm1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Quesada V, Conde L, Villamor N, Ordóñez GR, Jares P, Bassaganyas L, Ramsay AJ, Beà S, Pinyol M, Martínez-Trillos A, et al. Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nat. Genet. 2011;44:47–52. doi: 10.1038/ng.1032. [DOI] [PubMed] [Google Scholar]
- 51.López G, Valencia A, Tress ML. firestar—prediction of functionally important residues using structural templates and alignment reliability. Nucleic Acids Res. 2007;35:W573–W577. doi: 10.1093/nar/gkm297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Castelo R, Reymond A, Wyss C, Câmara F, Parra G, Antonarakis SE, Guigó R, Eyras E. Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes. Nucleic Acids Res. 2005;33:1935–1939. doi: 10.1093/nar/gki328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM–HMM alignment. Nat. Methods. 2011;9:173–175. doi: 10.1038/nmeth.1818. [DOI] [PubMed] [Google Scholar]
- 55.López G, Valencia A, Tress ML. FireDB—a database of functionally important residues from proteins of known structure. Nucleic Acids Res. 2007;35:D217–D223. doi: 10.1093/nar/gkl897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Tress ML, Graña O, Valencia A. SQUARE—determining reliable regions in sequence alignments. Bioinformatics. 2004;20:974–975. doi: 10.1093/bioinformatics/bth032. [DOI] [PubMed] [Google Scholar]
- 57.Grishin NV. Fold change in evolution of protein structures. J. Struct. Biol. 2001;134:167–185. doi: 10.1006/jsbi.2001.4335. [DOI] [PubMed] [Google Scholar]
- 58.Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The PeptideAtlas project. Nucleic Acids Res. 2006;34:D655–D658. doi: 10.1093/nar/gkj040. [DOI] [PMC free article] [PubMed] [Google Scholar]