Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2016 Jan 6;17(6):1024–1031. doi: 10.1093/bib/bbv109

A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling

Hong-Dong Li, Gilbert S Omenn, Yuanfang Guan
PMCID: PMC5142014  PMID: 26740460

Abstract

The products of multi-exon genes are a mixture of alternatively spliced isoforms, from which the translated proteins can have similar, different or even opposing functions. It is therefore essential to differentiate and annotate functions for individual isoforms. Computational approaches provide an efficient complement to expensive and time-consuming experimental studies. The input data of these methods range from DNA sequence, to RNA selection pressure, to expressed sequence tags, to full-length complementary DNA, to exon array, to RNA-seq expression, to proteomic data. Notably, RNA-seq technology generates quantitative profiling of transcript expression at the genome scale, with an unprecedented amount of expression data available for developing isoform function prediction methods. Integrative analysis of these data at different molecular levels enables a proteogenomic approach to systematically interrogate isoform functions. Here, we briefly review the state-of-the-art methods according to their input data sources, discuss their advantages and limitations and point out potential ways to improve prediction accuracies.

Keywords: alternatively spliced isoform, function prediction, selection pressure, isoform expression, functional networks

Introduction

Functional analysis of genes is a central feature of genetics and genomics. Functional experiments are traditionally analyzed at the gene level [1–14], and the resulting data are commonly recorded in Gene Ontology (GO) [15, 16] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [17] databases. A gene is assumed to carry a function if it is annotated to a biological process term in GO or a pathway in KEGG. However, in multicellular organisms, including humans, 95% of multi-exon genes have their primary transcription product, heterogeneous nuclear RNA, alternatively spliced into multiple messenger RNA (mRNA) isoforms [18]. Isoforms of the same gene may show functional differences when they differ in sequences and structures [18–27]. One isoform may contain a deletion or addition of a domain (a functional unit) [28] compared with another one because of the splicing mechanism of exon skipping [29]; for example, the longer splice variant of Bcl-x is antiapoptotic, while the shorter one because of the loss of two motifs becomes proapoptotic [23].

Alternative splicing greatly increases the repertoire of functional complexity of gene products [30–35]. Splicing helps explain how humans can 'get by' with only 20 000 protein-coding genes. However, because functions are assigned to genes, not to individual isoforms, the data in GO and KEGG are limited in the sense that they cannot differentiate functions of splice isoforms of the same gene [35]. Therefore, methods to perform functional annotation at the isoform level would promise a more precise understanding of gene functions. Developing computational approaches for probing gene functions at the isoform level is a promising alternative or complement to expensive, time-consuming experimental methods. Currently available computational methods dedicated to predicting isoform functions are limited, however. Recently, DNA sequence alone was shown to be able to predict splicing based on deep learning [36]. An evolutionary approach was proposed to analyze regulation of exons and hence isoforms [37]. Because proteins are encoded by isoforms, protein structure-based methods are directly applicable to isoform functions. These methods can use homologous templates [38, 39], threading [40, 41], or ab initio prediction [42, 43]. Among these, I-TASSER (Iterative Threading ASSEmbly Refinement) combines homology and threading methods [41, 42], and has been applied to build three-dimensional (3D) conformation models of pairs of isoforms of individual genes, with emphasis on domain differences [23]. Other groups also use domain information for function prediction [44–46].

We used expression-based methods, including the multiple instance learning (MIL) approach, to differentiate isoform functions for the mouse [35], with RNA-seq data as input. A method similar to MIL was used to annotate human isoform functions [47]. Isoform function prediction is becoming an active topic in functional genomics [29].

In this review, we organized these computational approaches based on their input data source: DNA sequence, evolutionary traits such as RNA selection pressure ratio (RSPR) and expression data such as expressed sequence tags (ESTs) and RNA-seq. A glossary of terms used in this field has been compiled (Table 1) to facilitate the understanding of different approaches. We present a representative rather than an exhaustive review of approaches and their applications. Finally, we discuss potential ways to achieve refined annotation of isoform functions. An outline of the content of this review is in Figure 1.

Table 1.

A glossary of special terms used in the field of isoform function predictions (in alphabetical order)

Term Meaning
Alternative splicing The process through which mature mRNAs are produced by splicing out introns and concatenating exons from pre-mRNA.
ESTs A short subsequence of a cDNA sequence with 500–800 nucleotides.
Exon array A type of microarray whose probes target only the exons of a gene
HCI/NCI Highest connected isoforms/non-highest connected isoforms.
iMILP An algorithm to predict isoform functions through label-propagation based on isoform coexpression networks
MIL A machine learning approach that is able to make predictions at 'instance' level by using input data with 'bag'-level class label. In the context of this work, a gene is a 'bag' containing its isoforms as 'instances'.
Proteogenomics The integrated analysis of genomic and proteomic data to improve gene annotation, enhance peptide identification and address many other objectives
RNA selection pressure A metric that measures the degree for an individual alternative exon to be highly regulated and functional.
RNA-seq A technology that quantifies the whole transcriptome using next-generation sequencing
Splice isoforms The products of alternative splicing from the same gene. A gene can express multiple splice isoforms, which is common in multicellular organisms.

Figure 1.

Figure 1.

An outline of the content of this review. First, different methods are briefly described and discussed according to their input data type. Second, the features and limitations of these methods are compared and summarized in Table 1. Third, perspectives on future isoform function predictions are presented.

Predicting splicing from DNA sequence by deep learning

Understanding how genetic variants alter the expression of splice isoforms is an essential step toward precision medicine. Recently, Xiong et al. proposed a computational model for predicting splicing based on DNA alone using deep learning [36]. Applying to spinal muscular atrophy (SMA), hereditary nonpolyposis colorectal cancer and autism spectrum disorder, the authors demonstrated the high accuracy of their model in searching for single nucleotide variations (SNVs) as genetic determinants of splicing and diseases. Taking SMA as an example, it is caused by homozygous loss of survival motor neuron 1 (SMN1) function, but the functional protein can be generated by survival motor neuron 2 (SMN2) gene, which is a mutated form of SMN1 and shows exon 7 skipping. The authors simulated the effects of over 700 known and novel mutations around exon 7 of the SMN1 and SMN2 gene, and predicted that a C6T mutation is the most significant factor that causes exon 7 skipping in SMN2 gene, which was consistent with literature. The predicted effects of all possible SMN1 and SMN2 mutations agreed well with the reverse transcription polymerase chain reaction (RT-PCR) experimental results (Spearman correlation = 0.82).

This is a promising approach that addresses the challenge to investigate how SNVs cause misregulation of splicing and possibly lead to diseases. This method was released as a web tool (http://tools.genes.toronto.edu/), providing a convenient interface for users to analyze their data. One limitation is that it is computationally expensive and can process 40 variants at a time.

Identifying the ‘functional’ isoform by analyzing selection pressure

Alternative splicing undergoes selection pressure. The splicing pattern during evolution could provide evidence of biological function of isoforms [48–51]. Based on RSPR, Lu et al. proposed an approach for predicting functional alternative splicing with the aid of multi-genome alignment [37]. Briefly, RSPR values measure the degree to which the alternative splicing event of an exon is regulated and functional. A high RSPR value (empirically RSPR > 3) indicates that an exon is highly regulated, and the resulting isoform is highly likely to be functional. As a proof-of-principle, the two exons (exons 5 and 21), which were known to be highly regulated in alternative splicing of the gene GRIN1 (Glutamate Receptor, Ionotropic, N-Methyl D-Aspartate 1), were shown to have significantly higher RSPR values than all the other exons, indicating that they are highly regulated functional exons; therefore, the isoforms resulting from splicing-in or -out of these two exons are highly likely to be functional. Because tissue-specific splicing is a type of functional regulation, the authors further validated whether the exons with high RSPR values (>3) are regulated in a tissue-specific manner or not. Taking the human gene GIT1 (G Protein-Coupled Receptor Kinase Interactor) as an example, one of its exons (chr17:24930106-24930132, hg18 genome) has RSPR = 6.12 and was therefore predicted to be highly tissue specific. To validate this, the authors assayed the expression of its isoforms in a panel of 10 tissues, including cerebellum and kidney, and found that the isoform including this exon is only expressed in brain, being consistent with their prediction. Though not focusing on differentiating isoform functions, this method could be of potential use to identify functional splicing events and thus functional isoforms. This method features the use of multiple genome alignment from the perspective of evolution. It is our opinion that, when integrated with expression data, this approach would be more powerful.

Identifying isoforms and functions from ESTs

ESTs data provided an abundant resource for analyzing splicing events. Neverov et al. developed an approach, called IsoformCounter, which was able to generate splice isoforms from EST data, followed by distinguishing functional from nonfunctional isoforms [52]. The filtration for functional isoforms was performed by looking at factors such as start codon, premature termination codon and protein isoform length. They found that genes in different functional categories may have different numbers of mRNA isoforms. For example, genes in ‘Small GTPase-mediated signal transduction' have less number of isoforms than average. In contrast, genes belonging to 'DNA replication and chromosome cycle' were shown to have more isoforms than average [52]. Castrignanò et al. proposed an approach, called alternative splicing prediction (ASPIC) [53], for predicting and characterizing splice isoforms using a novel multiple genome-EST align algorithm. Their method involves refining of exon-intron boundaries, clustering of ESTs by common splice sites and finding a minimal set of full-length transcript isoforms. EST-based methods are limited to the usual low quality of sequence data.

Analyzing isoform functions using full-length complementary DNA

Yura et al. [54] performed a systematic study of full-length human complementary DNAs (cDNAs) to assess the impact of alternative splicing on gene function. They found that more than half of the spliced regions consist of sites for protein–protein interactions. As an example, one isoform of the G-Protein Coupled Receptor has a deletion of 46 amino-acid residues, which causes the loss of a conserved (E/D)RY motif responsible for G-protein α-subunit interaction [54]. This isoform therefore cannot transduce extracellular signals to the cytoplasm like its normal counterpart [54]. This method involves the analyzing cDNA, identifying splice events and analyzing isoform functions. It is a pipeline allowing for identifying novel isoforms, which is useful.

Inferring isoform function through exon array data

Exon array methods are designed to analyze gene expression by exon-targeting probes. Such data are useful to study alternative splicing and infer isoform functions [55, 56]. Langer et al. compared exon array data of matched non-small-cell lung cancers (NSCLC) and normal lung tissues, and identified cancer-specific splice isoforms [57]. For example, ADD3 (Gamma-adducin) is a gene that contains a cassette exon. Its longer isoform with this exon is highly expressed in NSCLC samples. Structural analysis predicted that the cassette exon introduced an insertion of 32 amino acids that could cause a small coil formation in its tail domain, potentially predisposing a functional differentiation compared with the shorter isoform. LIMMA (Linear Models for Microarray Data) is an approach originally proposed to identify differentially expressed genes for microarray data [58]. It is available as a free Bioconductor R package (https://bioconductor.org/packages/release/bioc/html/limma.html). Applying LIMMA to a series of Affymetrics exon array data of human tissues, Shah et al. was able to discover functional splicing events of a set of genes [59]. For instance, ITSN1 (intersectin 1), a gene functioning in the MAP (mitogen-activated protein) kinase signaling pathway and clathrin-mediated endocytosis, has two major isoforms. The authors showed that the shorter isoform is because of the splicing out of the 3′ end exons. Of interest, the shorter isoform is ubiquitously expressed, while the longer one with three additional domains shows brain-specific expression. Xing et al. developed the Microarray Analysis of Differential Splicing (MADS) method for identifying differential splicing events [60]. With a series of embedded analysis, including background correction, iterative probe selection and removal of off-target hybridization, MADS was shown to be able to detect splicing of individual exons with improved sensitivity and specificity [60]. Taking the Tmem87a gene as an example, they identified two mutually exclusive exons. The differential expression of these two isoforms before and after polypyrimidine tract-binding protein depletion was confirmed experimentally through RT-PCR experiments. The author released MADS source codes in python (http://www.mimg.ucla.edu/faculty/xing/MADS/). Emig et al. developed the AltAnalyze software, which aimed to predict splicing events and assign functions to isoforms using expression data, including exon array data [61]. This software is developed as an open source project and can be run through both graphic and command-line interfaces.

Isoform function prediction methods based on RNA-seq data

RNA-seq is a powerful technology for massive high-throughput transcript expression profiling [62, 63]. It has provided an unprecedented amount of transcript-level expression data that can be used for predicting isoform functions. For example, the Short Read Archive database contains >52 000 and 56 000 publicly available RNA-seq samples for human and mouse, respectively, as of October 2015. The use of RNA-seq-based methods for predicting isoform function to date is limited. We here describe two methods.

Eksi et al. developed a multiple instance learning (MIL) algorithm for differentiation of splice isoform functions by exploring RNA-seq data for the mouse [35]. MIL treats a gene as a bag of splice isoforms (instances) of potentially different functions [64, 65]. The reason to use MIL is that supervised learning methods such as Bayesian networks are not directly applicable because few functional annotation data are available at the isoform level so far. An overview of the MIL algorithm is shown in Figure 2. Eksi et al. found that MIL can accurately identify the ‘responsible' isoforms that most likely carry the function of its originating gene and also predict novel functions for isoforms. Cdkn2a is presented as an example; one isoform NM_001040654.1 was predicted to function in apoptotic nuclear changes, while its other isoform NM_009877.2 was highly likely to be involved in positive regulation of the transmembrane receptor protein serine/threonine kinase signaling pathway [35]. Through 3D modeling, it was found that their protein structures were dramatically different. NM_001040654.1 contains five ankyrin repeats, whereas NM_009877.2 has a cyclin-dependent kinase inhibitor N-terminus domain [41] (Figure 3). The authors released genome-wide isoform function predictions through a web server (http://guanlab.ccmb.med.umich.edu/isoPred), such that one can search and investigate genes/isoforms of interest.

Figure 2.

Figure 2.

An outline of the MIL algorithm in ref [35].

Figure 3.

Figure 3.

Illustration of MIL predictions using Cdkn2a as an example. Its isoform NM_001040654.1 (168 amino acids) contains an alternate reading frame compared with NM_009877.2 (169 amino acids), and therefore, their sequences lack similarity. The former is predicted to function in apoptotic nuclear changes, while the latter is highly likely to be involved in positive regulation of the transmembrane receptor protein serine/threonine kinase signaling pathway. Three-dimensional modeling of these two protein isoforms showed their dramatically different structures, with the former containing five ankyrin repeats and the latter encoding a cyclin-dependent kinase inhibitor N-terminus domain (Recreated from Eksi et al. [35]).

Li et al. reported an instance-oriented multiple-instance label propagation (iMILP) algorithm for annotating isoform functions of the human, also considering a gene as a bag of isoforms with potentially different functions [47]. This method first calculates an isoform co-expression network using RNA-seq data and then performs function prediction by propagating the label of the isoforms using a rule that an isoform will receive a larger prediction score if it links more isoforms from positive bags, and vice versa. The accuracy of this method was assessed by cross-validating single-isoform genes. As an example, they correctly predicted that five of the eight isoforms of the TP53 (tumor protein p53) gene are annotated to the GO term ‘regulation of apoptotic process’ (GO: 0042981). Further, the regulation direction—‘positive regulation of apoptotic process’ (GO: 0043065) or ‘negative regulation of apoptotic process’ (GO: 0043066))—of these five isoforms was predicted with high accuracy. Predictions of the other three isoforms were not reported in the work. Overall, these results demonstrated the usefulness of the iMILP algorithm. This method involves the calculation of co-expression networks, which may be computationally expensively, especially for comprehensive gene annotation models such as ENSEMBL. The source codes in MATLAB were freely available (http://zhoulab.usc.edu/IsoFP/).

In sum, methods based on RNA-seq data integration are readily scalable to the whole genome. In addition to coding isoforms, such methods can be applied to nonprotein-coding RNA isoforms, which may carry functions like regulating gene expression based on the ENCODE Project [66, 67]. However, the lack of gold standard isoform-level functional annotation data limits the accuracy of predictions and restricts the scope of the results at this time. When integrating RNA-seq data for function prediction, one needs to be cautious about the data quality, such as alignment errors and duplicated reads. Also, RNA coverage and expression are highly dependent on the gene annotation model (RefSeq, ENSEMBL, UniProt/SwissProt, or GENCODE), which defines how many isoforms one gene contains and the exons of each isoform.

Analyzing isoform function through a proteogenomic approach

Functional experiments are mostly carried out for coding genes from which splice isoforms are translated into proteins. Therefore, proteomic data can also be integrated for functional analysis. In recent work [68], Li et al. proposed a proteogenomics approach to interrogate mouse isoform functional networks through integrating functional genomic, transcriptomic, and proteomic data. Functional connections of isoforms of the same gene are first identified, followed by selecting the highest connected isoform (HCI) based on protein–protein interactions. Expression behaviors of HCI at the transcript and protein levels were investigated with RNA-seq and proteomic data. Taking the Abca6 (ATP-Binding Cassette, Sub-Family A (ABC1), Member 6) gene as an example, its HCI is predicted to be NM_147218.2, which showed stronger interactions in its isoform network compared with the non-highest connected isoforms (NCIs) (NM_001166556.1 and NM_001166557.1). Further, the HCI was found to be expressed at the protein level through integrated analysis of proteomic data of eight normal mouse tissues, including liver, breast and brain. In contrast, none of its NCIs were expressed in any of the eight tissues. A set of 206 genes with their HCIs expressed at protein levels was provided as a functional resource [68]. Recently, the authors extended this proteogenomic approach to humans and identified HCIs accordingly [69]. For example, the HCI of the gene NDUFB6 [NADH dehydrogenase (ubiquinone) 1 beta subcomplex 6] was predicted to be NM_002493.4, which also showed protein expression evidence in retina. However, its NCI NM_182739.2, which had much weaker functional connections compared with the HCI, had no evidence at the protein level. The authors implemented web servers for both mice and human through which users can easily search and investigate isoform networks and functions (http://guanlab.ccmb.med.umich.edu/misomine/; http://guanlab.ccmb.med.umich.edu/hisonet/). Likewise, Woo et al. developed a proteogenomic approach to identify aberrant peptides based on proteomic and RNA-seq data [70].

For a proteogenomics approach to isoform function analysis, large-scale proteomic data are an essential part. The Human Proteome Organization in 2011 launched the Human Proteome Project, with annual updates of the metrics for high-quality protein identifications by mass spectrometry and complementary methods of protein 3D structures, Edman sequencing and antibody profiling. As of October 2014, neXtProt contained 16 491 highly confident protein identifications (82% of presumed protein-coding genes), and a search was on across the 25 Chromosome-centric HPP (Human Proteome Project) teams and many other research groups for the ‘missing proteins' [71]. The HPP data sets at PeptideAtlas and GPMdb (Global Proteome Machine database) contain the results of a standardized reanalysis of the primary mass spectrometry data of numerous studies, including two large studies published in Nature in 2014 [72, 73] and the initial proteomics releases from the National Cancer Institute’s The Cancer Genome Atlas Project . Dozens of different tissues and cell lines were analyzed by mass spectrometry in refs [72, 73] and by antibody profiling for the Human Protein Atlas [74]. There are numerous tissue-specific isoforms reported; for instance, isoform 1 of FYN protein tyrosine kinase was expressed in the brain, while the isoform 2 was expressed in hematopoietic cells [72]. These data provide a rich resource for functional investigation and annotation of splice isoforms. Recently, the Genomewide PDB (Protein Data Bank) 2.0 has incorporated new features for the identification of protein isoforms by integrating transcriptomic and proteomic information [75].

Summary and future perspectives

Functions have traditionally been analyzed and assigned to genes but not to individual isoforms resulting from alternative splicing. Even though splice isoforms have potentially different or even opposing functions, studies focused on differentiating isoform functions have been limited. Computational prediction provides an efficient alternative or complement to experimental methods that are usually time-consuming, expensive and small scale. A series of approaches have been developed for predicting isoform functions [35–37, 47, 52].

Computational approaches for isoform function prediction are different in many ways in their input data, models and performance. We organized these methods into three different categories based on DNA sequence, evolution and expression, respectively. A summary and comparison of these methods is in Table 2. Basics of these methods were outlined, and applications were illustrated. Integration of multi-omics level data in a proteogenomic approach presents a powerful way to study isoform functions.

Table 2.

A brief summary of methods for differentiating isoform functions categorized based on input data sources

Data source Methods and reference Features Limitations
DNA sequence Deep learning [36] Use DNA alone; wide applicability; providing causal relationship. Computationally expensive.
Evolutionary features RSPR [37] Use genome sequences from multiple species. Only at exon level, though can be extended.
ESTs IsoformCounter and ASPIC [52, 53] Filtering for functional isoforms; need to assemble ESTs to transcripts; ESTs data are usually of low quality.
Full-length cDNA Identifying exons by comparing cDNA with genome [54] Powerful for identifying splicing. Limited amount of full-length cDNA.
Exon array From splicing events to function by approaches: LIMMA, MADS, AltAnalyze [56, 57, 59, 61] Easy interpretation if exons map to domain. Dependent on the probe design of the exon array; not able to discover novel splice isoforms.
RNA-seq MIL [35] Heterogeneous data integration; well scalable. Convergence not guaranteed; 'witness' selection is challenging.
iMILP [47] Heterogeneous data integration; well scalable. Convergence not guaranteed.
Proteogenomic Isoform-level functional network approaches based on Bayesian networks [68, 69] Multi-omics data integration; protein-level evidence; improve gene annotation and peptide identification. Suitable to coding gene only; usually expensive to measure data at multi-omics level.

These methods have their limitations, with opportunities for improvement. First, a gold standard set of functional annotation data at the isoform level is unavailable, which limits the accuracy of these methods and the validation of prediction results. Creating a database on isoform-level function from curation of individual studies and/or through text mining would be of great value and general interest to the biology and bioinformatics community. Also, it should be mentioned that a non-negligible fraction of predicted isoforms may be reconstruction artifacts or be expressed at such low levels that they are hardly functional. Therefore, a preliminary filter should be useful to focus the functional annotation on a smaller set of reliable candidates. RNA-seq-based machine learning methods are able to address both coding and noncoding isoforms, but their explanation is not as intuitive as, e.g. DNA sequence-based methods, which are able to identify genetic determinants of splicing events [36]. Current methods are often limited by their use of only one type of data. Because of the complementarity of different types of genomic data, integrating heterogeneous data such as DNA sequence, exon array, RNA-seq and proteomics would improve accuracy of prediction models. In addition, incorporating contextual experimental information as tissues and developmental stages could enhance the accuracy of isoform function prediction.

Overall, systematically addressing functional differences of alternative splice isoforms that arise from structural variations has just begun, but is becoming an essential topic in genetics and proteogenomics. Integrative proteogenomics approaches to investigate isoform functions by leveraging molecular information at different omics levels will yield a more precise understanding of functions of genes and gene interactions. Protein isoforms are likely to be more specific diagnostic biomarker candidates and therapeutic targets in cancers and many other diseases than the mixtures of isoforms implied by gene-level analyses.

Key Points

  • Alternative splicing enables a single gene to produce multiple mRNAs that may carry different functions, which is true in all multicellular organisms with multi-exon genes. Traditional functional analysis methods treat a gene as a single entity and assign functions to genes without considering splice isoforms.

  • It is essential to differentiate isoform functions for more accurate understanding of gene functions, protein networks and protein interactions.

  • Many previously developed methods for protein function prediction are directly applicable to this challenge.

  • Dedicated methods for isoform function prediction, such as the MIL method, have limitations.

  • Combining structural, evolutionary and expression data gives a much more comprehensive assessment of splice isoforms.

  • A non-negligible fraction of predicted isoforms may be reconstruction artifacts or be expressed at such low levels that they are hardly functional. Therefore, a preliminary filter should be useful to focus the functional annotation on a smaller set of reliable candidates.

Acknowledgement

G.S.O. acknowledges support from NIH grants U54ES017885 and RM-08-029.

Biographies

Hong-Dong Li is a research scientist in Institute for Systems Biology, Seattle. He was formerly a postdoctoral research fellow in the Department of Computational Medicine and Bioinformatics at the University of Michigan, Ann Arbor. His research focuses on predicting functions and networks at the isoform level through heterogeneous genomic data integration.

Gilbert S. Omenn is a professor in Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics, and Public Health. His research focuses on cancer proteomics and informatics. He is especially interested in the role of differential expression of alternative splice isoforms of proteins and transcripts in specific cancer-related pathways. He chairs the Human Proteome Organization (HUPO) global Human Proteome Project (www.thehpp.org).

Yuanfang Guan is an assistant professor in the Department of Computational Medicine and Bioinformatics at the University of Michigan, Ann Arbor. Her lab has contributed eight best-performing algorithms to the biggest systems biology benchmark study DREAM (Dialogue of Reverse Engineering and Methods).

Funding

This work was supported by National Institutes of Health [1R21NS082212-01], EU-FP VII [Systems Biology of Rare Disease] and NIH [University of Michigan O’Brien Kidney Translational Core Center].

References

  • 1.Guan Y, Gorenshteyn D, Schimenti JC, et al. Tissue-specific functional networks for prioritizing phenotypes and disease genes. PLoS Comput Biol 2012;8:e1002694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Pena-Castillo L, Tasan M, Myers C, et al. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 2008;9:S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Zhu F, Shi L, Li H, et al. Modeling dynamic functional relationship networks and application to ex vivo human erythroid differentiation. Bioinformatics 2014;30:3325–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Zhu F, Guan Y. Predicting dynamic signaling network response under unseen perturbations. Bioinformatics 2014;30:2772–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wong AK, Park CY, Greene CS, et al. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2012;40:W484–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Guan Y, Ackert-Bicknell CL, Kell B, et al. Functional genomics complements quantitative genetics in identifying disease-gene associations. PLoS Comput Biol 2010;6:e1000991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Guan Y, Myers CL, Lu R, et al. A genomewide functional network for the laboratory mouse. PLoS Comput Biol 2008;4:e1000165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Guan Y, Myers CL, Hess DC, et al. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol 2008;9(Suppl. 1):S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huttenhower C, Hibbs M, Myers C, et al. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 2006;22:2890–7. [DOI] [PubMed] [Google Scholar]
  • 10.Troyanskaya OG, Dolinski K, Owen AB, et al. A Bayesian framework for combining heterogeneous data source for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci USA 2003;100:8348–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ray SS, Bandyopadhyay S, Pal SK. A weighted power framework for integrating multisource information: gene function prediction in Yeast. IEEE T Bio-Med Eng 2012;59:1162–8. [DOI] [PubMed] [Google Scholar]
  • 12.Xin L, Hsinchun C, Jiexun L, et al. Gene function prediction with gene interaction networks: a context graph kernel approach. IEEE Trans Inf Technol Biomed 2010;14:119–28. [DOI] [PubMed] [Google Scholar]
  • 13.Zhang C, Joshi T, Lin GN, et al. An integrated probabilistic approach for gene function prediction using multiple sources of high-throughput data. Int J Comput Biol Drug Des 2008;1:254–74. [DOI] [PubMed] [Google Scholar]
  • 14.Walker MG, Volkmuth W, Sprinzak E, et al. Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. Genome Res 1999;9:1198–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Harris MA. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004;32:D258–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kanehisa M, Goto S, Kawashima S, et al. The KEGG resource for deciphering the genome. Nucleic Acids Res 2004;32:D277–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Pan Q, Shai O, Lee LJ, et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008;40:1413–15. [DOI] [PubMed] [Google Scholar]
  • 19.Shargunov AV, Krasnov GS, Ponomarenko EA, et al. Tissue-specific alternative splicing analysis reveals the diversity of Chromosome 18 transcriptome. J Proteome Res 2014;13:173–82. [DOI] [PubMed] [Google Scholar]
  • 20.Corominas R, Yang X, Lin GN, et al. Protein interaction network of alternatively spliced isoforms from brain links genetic risk factors for autism. Nat Commun 2014;5:3650 doi:10.1038/ncomms4650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Omenn GS, Guan Y, Menon R. A new class of protein cancer biomarker candidates: differentially expressed splice variants of ERBB2 (HER2/neu) and ERBB1 (EGFR) in breast cancer cell lines. J Proteomics 2014;107:103–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Omenn GS, Menon R, Zhang Y. Innovations in proteomic profiling of cancers: alternative splice variants as a new class of cancer biomarker candidates and bridging of proteomics with structural biology. J Proteomics 2013;90:28–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Menon R, Roy A, Mukherjee S, et al. Functional implications of structural predictions for alternative splice proteins expressed in Her2/neu-induced breast cancers. J Proteome Res 2011;10:5503–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gao Z, Poon HY, Li L, et al. Splice-mediated motif switching regulates disabled-1 phosphorylation and SH2 domain interactions. Mol Cell Biol 2012;32:2794–808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Black DL. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem 2007;72:291–336. [DOI] [PubMed] [Google Scholar]
  • 26.Stamm S, Ben-Ari S, Rafalska I, et al. Function of alternative splicing, Gene 2005;344:1–20. [DOI] [PubMed] [Google Scholar]
  • 27.Lee C, Atanelov L, Modrek B, et al. ASAP: the alternative splicing annotation project. Nucleic Acid Res 2003;31:101–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ponting CP, Russell RR. The natural history of protein domains. Annu Rev Biophys 2002;31:45–71. [DOI] [PubMed] [Google Scholar]
  • 29.Li H-D, Menon R, Omenn G, et al. The emerging era of genomic data integration for analyzing splice isoform functions. Trends Genet 2014;30:340–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 2010;11:345–55. [DOI] [PubMed] [Google Scholar]
  • 31.Li H-D, Omenn GS, Guan Y. MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse. Database 2015;2015:bav045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Boue S, Letunic I, Bork P. Alternative splicing and evolution. BioEssays 2003;25:1031–4. [DOI] [PubMed] [Google Scholar]
  • 33.Kriventseva EV, Koch I, Apweiler R, et al. Increase of functional diversity by alternative splicing. Trends Genet 2003;19:124–8. [DOI] [PubMed] [Google Scholar]
  • 34.Ellis Jonathan D, Barrios-Rodiles M, Çolak R, et al. Tissue-specific alternative splicing remodels protein-protein interaction networks. Mol Cell 2012;46:884–92. [DOI] [PubMed] [Google Scholar]
  • 35.Eksi R, Li H-D, Menon R, et al. Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data. PLoS Comput Biol 2013;9:e1003314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Xiong HY, Alipanahi B, Lee LJ, et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 2015;347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lu H, Lin L, Sato S, et al. Predicting functional alternative splicing by measuring RNA selection pressure from multigenome alignments. PLoS Comput Biol 2009;5:e1000608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Marti-Renom MA, Stuart AC, Fiser A, et al. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys 2000;29:291–325. [DOI] [PubMed] [Google Scholar]
  • 39.Ginalski K. Comparative modeling for protein structure prediction. Curr Opin Struct Biol 2006;16:172–7. [DOI] [PubMed] [Google Scholar]
  • 40.Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acid Res 2012;40:W471–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 2010;5:725–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wu S, Skolnick J, Zhang Y. Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 2007;5:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Bonneau R, Tsai J, Ruczinski I, et al. Functional inferences from blind ab initio protein structure predictions. J Struct Biol 2001;134:186–90. [DOI] [PubMed] [Google Scholar]
  • 44.Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics 2013;14(Suppl 3):S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Fruhwald J, Londono JC, Dembla S, et al. Alternative splicing of a protein domain indispensable for function of transient receptor potential melastatin 3 (TRPM3) ion channels. J Biol Chem 2012;287:36663–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Forslund K, Sonnhammer ELL. Predicting protein function from domain content. Bioinformatics 2008;24:1681–7. [DOI] [PubMed] [Google Scholar]
  • 47.Li W, Kang S, Liu CC, et al. High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method. Nucleic Acids Res 2014;42:e39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lareau LF, Green RE, Bhatnagar RS, et al. The evolving roles of alternative splicing. Curr Opin Struct Biol 2004;14:273–82. [DOI] [PubMed] [Google Scholar]
  • 49.Chen FC, Chaw SM, Tzeng YH, et al. Opposite evolutionary effects between different alternative splicing patterns. Mol Biol Evol 2007;24:1443–6. [DOI] [PubMed] [Google Scholar]
  • 50.Baek D, Green P. Sequence conservation, relative isoform frequencies, and nonsense-mediated decay in evolutionarily conserved alternative splicing. Proc Natl Acad Sci USA 2005;102:12813–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Nurtdinov RN, Artamonova II, Mironov AA, et al. Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet 2003;12:1313–20. [DOI] [PubMed] [Google Scholar]
  • 52.Neverov A, Artamonova I, Nurtdinov R, et al. Alternative splicing and protein function. BMC Bioinformatics 2005;6:266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Castrignanò T, Rizzi R, Talamo IG, et al. ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Res 2006;34:W440–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Yura K, Shionyu M, Hagino K, et al. Alternative splicing in human transcriptome: Functional and structural influence on proteins. Gene 2006;380:63–71. [DOI] [PubMed] [Google Scholar]
  • 55.Suzuki H, Osaki K, Sano K, et al. Comprehensive analysis of alternative splicing and functionality in neuronal differentiation of P19 cells. PLoS ONE 2011;6:e16880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Chen P, Lepikhova T, Hu Y, et al. Comprehensive exon array data processing method for quantitative analysis of alternative spliced variants. Nucleic Acids Res 2011;39:e123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Langer W, Sohler F, Leder G, et al. Exon array analysis using re-defined probe sets results in reliable identification of alternatively spliced genes in non-small cell lung cancer. BMC Genomics 2010;11:676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Smyth GK. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004;3:article3. [DOI] [PubMed] [Google Scholar]
  • 59.Shah S, Pallas J. Identifying differential exon splicing using linear models and correlation coefficients. BMC Bioinformatics 2009;10:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Xing Y, Stoilov P, Kapur K, et al. MADS: A new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. RNA 2008;14:1470–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Emig D, Salomonis N, Baumbach J, et al. AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Res 2010;38:W755–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Mortazavi A, Williams BA, McCue K, et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 2008;5:621–8. [DOI] [PubMed] [Google Scholar]
  • 63.Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Maron O, Lozano-Perez T. A framework for multiple-instance learning. Adv Neural Inf Process Syst 1998;10:570–6. [Google Scholar]
  • 65.Andrews S, Tsochantaridis I, Hofmann T. Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst 2003;15. [Google Scholar]
  • 66.Dunham I, Kundaje A, Aldred SF, et al. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res 2012;22:1760–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Li H-D, Menon R, Omenn GS, et al. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence. Proteomics 2014;14:2709–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Li H-D, Menon R, Govindarajoo B, et al. Functional networks of highest-connected splice isoforms: from the Chromosome 17 Human Proteome Project. J Proteome Res 2015;14:3484–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Woo S, Cha SW, Na S, et al. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data. Proteomics 2014;14:2719–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Lane L, Bairoch A, Beavis RC, et al. Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014;13:15–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kim M-S, Pinto SM, Getnet D, et al. A draft map of the human proteome. Nature 2014;509:575–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Wilhelm M, Schlegl J, Hahne H, et al. Mass-spectrometry-based draft of the human proteome. Nature 2014;509:582–7. [DOI] [PubMed] [Google Scholar]
  • 74.Uhlen M, Fagerberg L, Hallstrom BM, et al. Tissue-based map of the human proteome. Science 2015;347:1260419. [DOI] [PubMed] [Google Scholar]
  • 75.Jeong S-K, Hancock WS, Paik Y-K. GenomewidePDB 2.0: a newly upgraded versatile proteogenomic database for the chromosome-centric human proteome project. J Proteome Res 2015;14:3710–19. DOI: 10.1021/acs.jproteome.5b00541. [DOI] [PubMed] [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES