Summary
The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the structure of rapidly accumulating, diverse biological network data and can be broadly applied to other network science domains.
Graphical Abstract
Introduction
Comprehensively understanding various functional aspects of genes or proteins, such as their involvement in a particular biological process, physical or genetic interactions, or disease association, is critical for both biological and translational medicine research. Since exhaustively characterizing genes or proteins through biological experiments is often intractable, systems-level integration of knowledge and computational hypothesis generation have garnered great interest in the field as effective ways to guide experiments (Berger et al., 2013).
With the advent of high-throughput experimental techniques, genome-scale interaction networks (also known as interactomes) have been an integral way of encapsulating information and have enabled approaches to extend and refine functional knowledge of genes and proteins (Yu et al., 2013). A key insight behind such approaches is that genes or proteins that are co-localized or have similar topological roles in the interaction networks are more likely to be functionally correlated. This insight allows us to infer properties of unknown proteins by transferring knowledge from similar genes and proteins that are better understood—a process known as “guilt-by-association”.
An important challenge has been to develop principled approaches for integrating heterogeneous sources of information (e.g., physical binding, genetic interaction, co-expression, or co-evolution) from which different interaction networks can be constructed. Most previous work has focused on summarizing a collection of heterogeneous data into a single integrated network, which is typically obtained by combining the edges across different networks via Bayesian inference (Franceschini et al., 2013; Lee et al., 2011; Wong et al., 2015) or adaptive weighted averaging (Mostafavi et al., 2008). The resulting integrated network is provided as input to existing network-based inference methods, such as label propagation (Mostafavi et al., 2008) or graph-based clustering (Dutkowski et al., 2013), to derive functional insights from the data. However, a key limitation of such approaches is the substantial information loss incurred by projecting various data sets onto a single network representation. For instance, context-specific interaction patterns (e.g., tissue-specific gene modules) that are only present in certain data sets are likely to be obscured by edges from other data sources in the integrated network.
A naïve approach for tackling this challenge would be to separately analyze the structure of each network and to concatenate the resulting network features (e.g., [Cao et al., 2014; Milenković and Pržulj, 2008; Mostafavi et al., 2012]) for each gene. However, this approach greatly increases the dimensionality of the feature space and often dilutes the signal in the data as a result. Noise in interaction networks based on high-throughput experiments further compounds this issue. Thus, it is imperative to develop integrative methods that can properly take advantage of the fine-grained topology of multiple heterogeneous networks while maintaining a low-dimensional feature space, thereby increasing robustness to noise and enhancing accuracy.
Here we address this challenge by introducing an integrative framework, Mashup, for obtaining high-quality, compact topological feature representations of genes from one or more interaction networks constructed from heterogeneous data types. We incorporate the following conceptual advances into our framework: (i) Mashup takes full advantage of network-specific topology by analyzing the structure of each network separately before learning a canonical representation that best explains the topological patterns across all networks, and (ii) Mashup decouples the dimensionality of feature representations from the data parameters (e.g., number of networks or genes), which allows it to cope with inherent noise in high-throughput data by obtaining compact representations that keep only the most explanatory features. By showing substantial improvements over the state-of-the-art methods in three distinct functional inference tasks—automated gene function annotation, gene ontology reconstruction, and genetic interaction prediction—we demonstrate Mashup’s wide applicability and its potential to effectively decipher functional properties of genes from interactomes. Notably, Mashup easily scales to a large number of networks—a critical requirement for network-based methods to fully utilize the ever-growing repository of interactomes. We provide software for Mashup along with ready-to-use compact vector representations of genes learned from existing interactome data sets for researchers to apply to their own application domains (http://mashup.csail.mit.edu).
Mashup can in principle be used to simultaneously analyze any large networks in which “guilt-by-association” properties hold for more accurate knowledge discovery. Not only do the substantial improvements in accuracy and scalability promise to enable new workflows for biomedical practitioners (e.g., integration of single-cell data), but also the general framework for network integration that we introduce can be straightforwardly applied to network analysis problems outside of biology.
Results
Overview of Mashup
The basic Mashup framework for heterogeneous network integration involves three steps (Figure 1): (i) Run a localized network diffusion process (e.g., random walks with restart [RWR; Tong et al., 2006]) on each network to obtain a distribution for each node, which captures its relevance to all other nodes in the network. Similar to the widely used PageRank algorithm (Page et al., 1999) in web and social network analysis, this step characterizes the topological context of each gene in a network, taking the global connectivity patterns into account. (ii) Approximate each of these distributions by constructing a model, parameterized by low-dimensional feature vectors for each node; these feature vectors are obtained by minimizing the difference between the model distribution and diffusion distributions for all networks simultaneously. Akin to Principal Component Analysis (PCA), which reveals the internal low-dimensional linear structure of the data that best explains the variance, Mashup computes a low-dimensional vector-space representation for all nodes such that the diffusion or the connectivity patterns in the networks can be best explained. (iii) Use the learned representations as input features for a wide range of network-based functional inference tasks. A more detailed description of Mashup is provided in Method Details.
Figure 1. Overview of Mashup.
Random walks with restart (RWR) are used to compute the diffusion state for each node in each individual network. Low-dimensional feature vectors describing the topological properties of each node are obtained by jointly minimizing the difference between the observed diffusion states and the parameterized-multinomial logistic distributions across all networks. The low-dimensional representation can be readily plugged into machine learning methods for functional inference.
Improved Gene Function Prediction
Automated annotation of gene function, the goal of which is to assign a poorly understood gene to the correct functional categories in an annotation database, is considered one of the most important and challenging problems of the post-genomic era (Radivojac et al., 2013). Many solutions based on high-throughput experimental data have been proposed in the past decade, each exploiting different types of information, including amino acid sequence (Clark and Radivojac, 2011), genomic context (Enault et al., 2005), evolutionary relationships (Gaudet et al., 2011), protein structure (Pal and Eisenberg, 2005), and gene expression (Huttenhower et al., 2006). Here we focus on the use of protein-protein interaction (PPI) networks, where we pursue the intuition that the topological role of a gene in interaction networks is correlated with its biological function.
Existing approaches for integrating multiple networks for function prediction have largely focused on combining the networks into a single representative network to be used for prediction. GeneMANIA (Mostafavi and Morris, 2010; Mostafavi et al., 2008) is a state-of-the-art function prediction server that uses a label propagation algorithm on an averaged network, whose mixing weights are optimized for each functional category. Another standard approach for network integration, adopted by the large public PPI network database STRING (Franceschini et al., 2013), is to use Bayesian inference to combine edges across multiple networks. STRING’s resulting integrated network can be used with single-network function prediction methods, such as diffusion state distance (DSD; Cao et al., 2014), a state-of-the-art diffusion-based method that uses RWR to characterize the local topology of each gene and assigns functions by majority vote based on a set of genes with most similar diffusion patterns.
We found that Mashup-based function prediction substantially outperforms these state-of-the-art integrative methods in assigning a previously unseen gene to its known functional categories in a cross-validation experiment on real data sets from yeast and human (Figure 2). We observed clear improvements for both the yeast and human data sets at different annotation levels of the Munich Information Center for Protein Sequences (MIPS) (Ruepp et al., 2004) and the Gene Ontology database (GO; Ashburner et al., 2000) hierarchies, respectively. For example, top predictions based on Mashup correctly assigned 38.4% of genes (on average) to their functional categories, in contrast to 28.7% for GeneMANIA and 25.9% for DSD with STRING integration (referred to as STRING-DSD), with respect to human Biological Process (BP) GO terms with highest specificity (11–30 genes). An exception to the general improvement was the top layer (Level 1) of the yeast data set, for which we performed comparably to GeneMANIA. This finding is likely due to the relative completeness of yeast interactomes and the fact that the top layer contains the largest functional terms that are easiest to predict, leaving little room for improvement. Moreover, Mashup’s improvement is consistent over a wide range of parameters in our framework, which includes the dimensionality of our learned representation and the restart probability of RWR (Figure S4). To enable function prediction with Mashup, we used a support vector machine (SVM) classifier for each functional category with Mashup’s compact topological representations as input features (Method Details).
Figure 2. Mashup improves gene function prediction performance in human and yeast.
We performed five-fold cross validation to compare the function prediction performance of Mashup to other state-of-the-art network integration methods, GeneMANIA and STRING’s Bayesian integration followed by a diffusion-based function prediction method DSD (STRING-DSD) in (a) human and (b) yeast. A precision-recall curve for each method is shown (c). Additional figures, including the results on molecular function (MF) and cellular component (CC) ontologies in human and further comparisons to other integration approaches, are provided in Figures S1–3. Performance is measured by the fraction of top predictions correctly labeled (Acc), harmonic mean of precision and recall when the top 3 predictions are assigned to each gene (F1), and the area under the precision recall curve summarized over all labels, both under the micro-averaging (m-PR) and macro-averaging (M-PR) schemes. Results are summarized over 10 trials, and asterisks represent significance of Mashup’s improvement over GeneMANIA (one-sided rank-sum p-value < 0.01). Overall, Mashup achieves substantially greater predictive performance over previous methods.
Mashup’s accuracy improvement can be partially attributed to the fact that separately analyzing the structure of each individual network uncovers fine-grained topological patterns that are difficult to identify in the combined network where different edge types are not distinguished. For instance, we noted that many genes’ most topologically similar gene, based on Mashup’s integrated features, is not a direct neighbor in any of the networks, but rather a gene indirectly connected by numerous paths that go through different intermediary nodes in different networks. Such indirect, but consistent associations are often outweighed by direct neighbors if analyzed based on a single combined network, even if the direct connection exists only in a narrow context (few networks). Further inspection revealed that many of these top, indirect associations newly identified by Mashup in fact correspond to paralogous genes, suggesting that such patterns reflect coherent biological functions (Table S1).
Another important factor in Mashup’s enhanced accuracy is the compactness of its feature representations, which helps tease functionally relevant topological patterns apart from noise in the data. To assess this aspect in isolation, we applied Mashup to individual networks without integration. We still observed significantly better (rank-sum p-value < 0.01) prediction performance as compared to the single-network method DSD on all but one network (Figure 3). As additional evidence, we observed that even a favorably modified DSD, which uses log-transformed diffusion states as features to train SVM classifiers to closely approximate Mashup without dimensionality reduction, still achieved significantly lower accuracy than Mashup, which further corroborates the importance of the compactness of Mashup representations (Figure S6). Furthermore, randomly perturbing the network structure led to smaller changes in pairwise topological similarities between genes for Mashup features, compared to high-dimensional diffusion states used by DSD (Figure S7). This result demonstrates Mashup’s greater robustness to noise. We would like to emphasize that integrating all networks from STRING results in higher function prediction performance than any single network alone (Figure 3), which underscores the significance of integrating various types of data sources for understanding the functional roles of genes or proteins.
Figure 3. Integrating multiple networks outperforms individual networks in gene function prediction.
We compared Mashup’s five-fold cross validation performance, measured by micro-averaged area under the precision-recall curve (AUPR); performance on each individual network in STRING (gray shaded) is compared to using all networks simultaneously (Integrated, red shaded). The results of applying a diffusion-based, single network method, DSD, to each network type is also shown (white shaded). Asterisks represent individual networks where Mashup outperformed DSD (one-sided rank-sum p-value < 0.01). Results are summarized over 10 trials.
Taken together, these results suggest that the key advances of Mashup—simultaneously capturing the patterns of multiple interaction networks by learning compact, canonical representations of topology—lead to substantially more accurate prediction of gene function than previous approaches.
Further comparisons to other data integration methods that previously have not been systematically evaluated for the task of function prediction are provided in Figure S1. In particular, we compared Mashup to a recently proposed matrix factorization-based approach, CMF (Žitnik et al., 2015), which views heterogeneous data matrices as relations between different object types that can be approximated via a low-lank factorization. While straightforward CMF has limited use of network data as additional constraints on the parameters to be learned, we considered a favorably modified CMF that directly factorizes the network data (i.e., more similar to Mashup) and found that Mashup significantly outperforms this approach as well (Figure S1).
More Precise Reconstruction of Gene Ontology
In addition to refining our functional knowledge of proteins via automatic function annotation, which assumes a predetermined set of functional categories, molecular networks can be used to guide the identification of functional categories and their hierarchical organization—widely known as gene ontology. Building an entire ontology based on only high-throughput interactome data—an approach recently pioneered by Dutkowski et al. (2013)—circumvents the inconsistencies and biases that are typically introduced by the manual curation process underlying existing ontology databases (e.g., Gene Ontology (GO) database [Ashburner et al., 2000]). Therefore, such unbiased approaches can produce valuable hypotheses for enhancing and expanding existing ontologies.
Dutkowski et al. (2013) used a graph-based agglomerative clustering algorithm (Park and Bader, 2011) to extract a hierarchy of gene clusters from an interaction network, where each cluster is viewed as a putative functional category. The resulting data-driven ontology, called NeXO, was then provided as input to an ontology alignment algorithm developed by the same researchers to show substantial overlap with the GO database. More recently, a new algorithm based on maximal clique detection, named CliXO (Kramer et al., 2014), was proposed as an alternative approach that can better handle weighted interaction networks. Motivated by the observation that Mashup’s integrated topological features are highly predictive of gene function, we set out to test whether clustering Mashup features in lieu of the original input networks would result in more accurate gene ontology than NeXO and CliXO. Both methods, unlike Mashup, take a single combined network as input, which obscures the fine-grained topological patterns that are specific to individual networks.
We first extracted compact topological representations with Mashup from the same set of four binary yeast PPI networks used by Dutkowski et al. (2013), which consists of a physical interaction network, a genetic interaction network, a co-expression network, and a functional association network from YeastNet (Kim et al., 2013). While Dutkowski et al. (2013) simply took the union of all edges to construct a combined network for clustering, we used topological features from Mashup’s integration to construct a gene ontology via a standard hierarchical agglomerative clustering algorithm. The clustering was followed by a post-processing step analogous to NeXO’s in order to introduce multi-way joins and multiple parents, which are common in real ontologies (Method Details). With our Mashup-based ontology, we achieved substantially better agreement with GO than NeXO (in F1 score, which measures harmonic mean ‘of precision and recall) for molecular function (MF) and cellular component (CC) ontologies, and comparable performance for biological process (BP) (Figure 4a). Overall, Mashup achieved a combined alignment score of 0.33 (geometric mean of F1 scores in three ontology categories), which was significantly higher than NeXO’s (0.24). To compare with CliXO (for weighted interaction networks), we applied Mashup to six weighted PPI networks, excluding text-mining, from STRING (Franceschini et al., 2013) and similarly constructed an ontology via hierarchical clustering. For CliXO, we uniformly combined the networks by Bayesian integration (following STRING’s approach) to construct a single integrated network as input. Even with an optimized parameter setting for CliXO (Method Details), Mashup-based ontology achieved substantially higher alignment scores than CliXO in all three ontology categories: Mashup had a combined score of 0.35 whereas CliXO, 0.18 (Figure 4b). Furthermore, we observed similar improvement for Mashup on the YeastNet networks (Kim et al., 2013), the original data set used by Kramer et al. (2014) to evaluate CliXO (Figure S8).
Figure 4. Mashup improves network-based gene ontology reconstruction.
Mashup features extracted from (a) the network data used to generate NeXO or (b) STRING networks were hierarchically clustered to generate an ontology, which is aligned using the same algorithm as NeXO/CliXO to biological process (BP), molecular function (MF), and cellular component (CC) ontologies in the GO database. We compared the alignment quality of Mashup-based ontologies to NeXO and CliXO on the respective data sets. Following a previous work (Dutkowski et al., 2013), we measured the alignment quality as the harmonic mean of the fraction of terms in the reconstructed ontology and the fraction of terms in GO that are aligned. The overall score is calculated as the geometric mean of the three scores for different GO types. (c) Breakdown of the number of terms in Mashup-based ontology using STRING networks aligned to GO (at FDR=10%).
These results demonstrate that Mashup’s integration of topological features enable more precise identification of functionally coherent gene sets than the state-of-the-art approaches, which rely on a single combined network where network-specific topological information is obscured. Note also that Mashup allows the use of a simple off-the-shelf clustering algorithm through its convenient vector representation of topology.
Improved Prediction of Genetic Interaction and Drug Efficacy
A critical step towards attaining a thorough understanding of how genes carry out their biological function in a cell is to tease apart their sophisticated interplay with other genes or proteins. Synthetic lethality (SL) and synthetic dosage lethality (SDL) describe a particular type of interaction between genes where an otherwise non-essential gene becomes essential (i.e., its deletion reduces cell viability) given the deletion (SL) or over-expression (SDL) of another gene. SL interactions can reveal inherent redundancy in the genetic program, and SDL interactions, dosage dependence of gene products. There has been great interest in identifying SL or SDL interactions due to their clinical significance; these interactions can lead to the discovery of novel drugs for targeted therapies, where SL or SDL interaction partners of genes selectively deleted or over-expressed in disease cells are targeted (Chan and Giaccia, 2011). However, experimentally interrogating the presence of an interaction between every pair of genes is infeasible, and thus it is essential to develop computational approaches for predicting candidate interactions with high accuracy.
Several prediction methods have focused on the use of PPI networks either exclusively (Paladugu et al., 2008) or in conjunction with other types of information, such as gene expression or functional annotation (Pandey et al., 2010; Wong et al., 2004). The key insight behind these approaches is that observing a genetic interaction between genes A and B increases the likelihood of an interaction between A and other genes that are functionally similar to B. This kind of information transfer can be effectively derived from the topology of interactomes, as we demonstrated in the above applications of Mashup. Notably, Paladugu et al. (2008) used a manually curated list of conventional graph theoretic measures (e.g., degree, closeness, betweenness centralities) to train support vector machine (SVM) classifiers and demonstrated that analyzing the topology of a PPI network alone can be effective for predicting genetic interactions.
Here we asked whether the compact topological representation learned by Mashup can be used to further improve prediction of genetic interactions. To this end, we adopted the same prediction framework used by Paladugu et al. (2008) and measured, via cross validation, the impact of substituting Mashup’s topological representations for their curated topological features (Method Details). We observed that Mashup’s compact representation consistently outperforms manually-curated topological features (referred to as GTM) for predicting both SL and SDL interactions in a real human data set (Figure 5a). Mashup achieved an average area under the precision recall curve (AUPR) of 0.59 for SL and 0.51 for SDL, whereas GTM achieved 0.44 and 0.39, respectively. Mashup’s performance was highly consistent across a wide range of choices for the dimensionality of our learned representation (Figure S5). Furthermore, Mashup achieved substantially better accuracy than a recent approach that uses known GO annotations of each gene pair as input features for random forest classifiers (Yu et al., 2016; referred to as Ontotype), which was shown to achieve state-of-the-art performance in yeast. We attribute Ontotype’s relatively poor performance on our human data to the sparsity and incompleteness of functional annotations in human; this finding further highlights the strength of our approach where functional relationships are directly inferred from interactomes as opposed to relying on curated annotations. Additionally, we observed that Mashup achieves better overall performance, albeit by a small margin (AUPR of 0.13 compared to 0.1), than Ontotype on the original yeast data set (Costanzo et al., 2010) used by Yu et al. (2016; Figure S9).
Figure 5. Mashup improves genetic interaction prediction and drug efficacy prediction.
(a) Cross validation performance of using Mashup representations from human STRING networks in SVM classifiers to predict SL/SDL interactions reported by Jerby-Arnon et al. (2014) as compared to a previous approach that uses various graph-theoretic measures (GTM) as input features instead (Paladugu et al., 2008) and a more recent approach, Ontotype, that uses the combined, known GO annotations of each gene pair as features in an ensemble of decision trees (Yu et al., 2016). Area under the precision-recall curve (AUPR) is used as the performance metric. (b) Number of single-target drugs in the CGP data (Garnett et al., 2012) whose efficacy is predicted with statistical significance at varying FDR levels based on SDL interactions originally identified by DAISY (Jerby-Arnon et al., 2014) or top 5, 10, 20 candidate interactions for each drug target predicted by Mashup-based classifers using DAISY interactions as training data. (c) An illustration of a putative SDL interaction between TOP1 and TXNL1 predicted by Mashup and its associated literature evidence.
Notably, our cross-validation was based on the genetic interactions reported by Jerby-Arnon et al. (2014) as ground truth; these were computationally identified based on a diverse set of high-throughput experimental data, including somatic mutations, copy number alterations, gene expression, and shRNA-based functional screening data. Therefore, most of the “known” interactions in our data were not individually validated in greater depth, which raised a potential concern that Mashup’s improvement could be due to statistical artifacts in the high-throughput data. To address this concern, we tested whether Mashup’s predicted interactions produce meaningful predictions on an independent biological data set. In particular, we considered the task of efficacy prediction of cancer drugs; if a drug targets a gene with an SDL interaction, the expression level of the interaction partner is expected to correlate with the efficacy of the drug, which allows us to indirectly validate our predicted SDL interactions by analyzing their ability to predict drug efficacy.
We obtained the efficacy profiles of 50 drugs with single protein targets from the Cancer Genome Project (CGP) (Garnett et al., 2012) over 639 human cancer cell lines. Using an unsupervised prediction framework similar to the one employed by Jerby-Arnon et al. (2014), we calculated how many drug response profiles could be predicted, with statistical significance, using the top SDL interactions predicted by Mashup for each drug (Method Details). We were able to predict many more drugs as compared to using only the interactions identified by DAISY (Jerby-Arnon et al., 2014) (Figure 5b). For instance, at a false discovery rate of 10% and based on the top 5 candidate interactions, Mashup significantly predicted the efficacy of 7 drugs while DAISY interactions could only predict one. A potential reason for the response of the majority of drugs not being explained in our analysis, as previously noted by Jerby-Arnon et al. (2014), is that many factors other than the essentiality of the drug target, including cell membrane permeability, influence drug efficacy. Note that only 11 out of the 50 single-target drugs had at least one reported SDL interaction. Furthermore, 4 out of 7 drugs whose efficacy we significantly predicted did not have any known SDL interaction partners, which suggests that our classifier was able to produce meaningful predictions even for genes that were not included in the training data. The list of drugs whose response profiles were significantly predicted by Mashup and their top candidate interactions used for efficacy prediction are provided in Table S2.
TOP1-TXNL1 is one particularly convincing candidate interaction we identified using Mashup that was not significantly identified by other computational methods (Figure 5c). TXNL1, one of the top 5 candidate interactors for TOP1 (DNA topoisomerase I), has strong evidence in the literature that supports the interaction between the two genes with respect to a drug camptothecin, which targets TOP1. Topoisomerase I normally binds to DNA during transcription to control the topological states of DNA strands. However, in the presence of camptothecin, a normally transient topoisomerase I-DNA complex becomes persistent, resulting in a toxic lesion commonly known as a suicide complex. It has been noted in the literature that a gene named XRCC1 is involved in the repair of TOP1 suicide complexes (Plo et al., 2003). Furthermore, TXNL1 has recently been observed to down-regulate XRCC1 in a gastric cancer cell line (Xu et al., 2014). This finding implies that the higher expression of TXNL1 likely indicates lower levels of XRCC1, which increases the vulnerability of cells to camptothecin-induced DNA damage. Consistent with the literature, the efficacy of camptothecin was significantly predicted in our experiments with TXNL1 as one of the predictors in two independent samples (Table S2). In one of the replicates, the expression level of TXNL1 had strong marginal correlation with the efficacy of camptothecin (Spearman correlation p-value = 5.16 × 10−4). This example illustrates the unique potential of Mashup-based functional inference to produce new biological insights by effectively integrating various types of network data.
Discussion
We have presented Mashup, an integrative framework for analyzing the topology of multiple interaction networks from heterogeneous data sources, which can be used to infer various functional properties of genes or proteins. Mashup characterizes the topology of individual networks by diffusion and then computes compact but highly informative vector representations for nodes in the networks to approximate the diffusion patterns jointly for all networks. We have demonstrated the wide applicability of Mashup in exploiting functional topology in interaction networks by accurately predicting gene function, reconstructing the gene ontology hierarchy, and predicting genetic interactions from heterogeneous network data. We also demonstrated substantial improvements over previous approaches for each application.
While we have showcased the effectiveness of Mashup as a plug-in architecture for standard tools, such as SVM classifiers and hierarchical clustering, we note that our framework readily allows the use of more sophisticated methods and that such a direction is likely important for further improving performance for many applications. For instance, we recently developed an improved protein function prediction algorithm based on Mashup that exploits the semantic similarity between different functional categories from the ontology hierarchy, which led to significantly better predictions in sparsely annotated GO categories (Wang et al., 2015).
This work was initially inspired by a related line of research in natural language processing. In their seminal paper, Mikolov et al. (2013) introduced a framework that takes a corpus of text documents and gives each word a vector representation based on pairwise co-occurrence patterns and showed that the learned vectors capture semantics of words. In our work, we view genes as words and network diffusion as a way to characterize “co-occurrence” of genes in interaction networks to adapt Mikolov et al.’s idea to real biological networks and demonstrate that the learned features similarly represent functional properties of genes with high accuracy. Mashup generalizes Mikolov et al.’s approach to heterogeneous data sets where the co-occurrence patterns are subdivided into different contexts (different networks). An exciting future direction would be to apply our insights to improve text analysis where different “types" of documents are used to construct more fine-grained co-occurrence data, based on which semantic relationships among words can be more accurately inferred.
For applications in biology, another important direction to explore is incorporating into Mashup other types of information that are not commonly represented as networks, such as sequence, evolutionary, or biochemical properties of individual genes or proteins. Conveniently, this non-network information can be incorporated into our framework in a straightforward manner as additional entries in the feature representation. We also want to emphasize that there is ample opportunity to apply Mashup to other network-based applications, including but not limited to: inter-species network alignment (Liao et al., 2009; Milenković et al., 2010; Singh et al., 2008), protein complex detection (Nepusz et al., 2012), and drug-target interaction prediction (Cheng et al., 2012). Mashup is a versatile tool that provides an effective, unified and scalable framework for data integration in diverse applications.
STAR Methods
Contact for Reagent and Resource Sharing
Jian Peng, jianpeng@illinois.edu, Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
Method Details
The Mashup Framework
Random walk with restart review
The random walk with restart (RWR) method has been well established for analyzing network structures. By allowing the restart of random walk in each step with a probability, RWR can take into consideration both local and global topology within the network to identify the relevant or important nodes in the network. Let A denote the adjacency matrix of a (weighted) molecular interaction network G = (V, E) with n nodes, each denoting a gene or a protein. Each entry Bij in the transition probability matrix B, which stores the probability of a transition from node j to node i, is computed as
Formally, the RWR from a node i is defined as
where pr is the probability of restart, controlling the relative influence of local and global topological information in the diffusion, with higher chances of restart placing greater emphasis on the local structure; ei is a n-dimensional distribution vector with ei(i) = 1 and ei(j) = 0, ∀j ≠ i; is a n-dimensional distribution (column) vector in which each entry holds the probability of a node being visited after t steps in the random walk, starting from node i. The first term in the above update corresponds to following a random edge connected to the current node, while the second term corresponds to restarting from the initial node i. At the fixed point of this iteration we obtain the stationary distribution . Consistent with a previous work (Cao et al., 2014), we define the diffusion state of each node i to the stationary distribution of random walk with restart (RWR) starting at each node, where Δn denotes the n-dimensional probability simplex. Intuitively, the jth element, sij, represents the probability of RWR starting at node i ending up at node j in equilibrium. When the diffusion states of two nodes are close to one another, it implies that they are in similar positions within the graph with respect to other nodes, which might suggest functional similarity. This insight provided the basis for several diffusion-based methods (Cao et al., 2014; Macropol et al., 2009; Wang et al., 2012) that aim to predict characteristics of genes or proteins by using the diffusion states to better capture topological associations. Instead of simply using the probability in the diffusion state, the diffusion state distance (DSD) approach, using L1 distances between diffusion states, achieved the state-of-the-art performance on predicting protein functions on yeast interactomes (Cao et al., 2014).
Novel dimensionality reduction
A key observation behind our approach is that the diffusion states obtained in this manner are still noisy, in part due to their high dimensionality and the incompleteness of the original network data. With the goal of noise and dimensionality reduction, we approximate each diffusion state si with a multinomial logistic model based on a latent vector representation of nodes that uses far fewer dimensions than the original, n-dimensional state. Specifically, we compute the probability assigned to node j in the diffusion state of node i as
where ∀i, wi, xi ∈ ℝd for d ≪ n. Each node is given two vector representations, wi and xi. We refer to wi as the context feature and xi as the node feature of node i, both capturing the intrinsic topological properties in the network. If xi and wj are close in direction and with large inner product, node j should be frequently visited in the random walk starting from node i. Ideally, if the vector representation w and x is able to capture fine-grain topological properties, we can use it to retrieve genes with similar functions or use it as features for other network-based machine learning applications. While it is possible to enforce equality between these two vectors, decoupling them leads to a more manageable optimization problem and also allows our framework to be readily extended to the multiple network case, which is further described in the next section.
Given this model, we formulate the following optimization problem that takes a set of observed diffusion states s = {s1, …, sn)} as input and finds the low-dimensional vector representation of nodes w and x that best approximates s according to the multinomial logistic model. To obtain w and x for all nodes, we use KL-divergence (relative entropy) as the objective function to minimize, which is a natural choice for comparing probability distributions, to guide the optimization:
By writing out the definition of relative entropy and ŝ, we can express the objective as
where H(·) denotes the entropy.
Novel integration of heterogeneous networks
We can naturally extend our dimensionality reduction framework to integrate network data from diverse sources. We first perform random walks on each individual network k separately and obtain network-specific diffusion states for each node i. We also construct the multinomial distribution from the following logistic model
where for each node i in network k, we assign it a network-specific context vector representation , which encodes the intrinsic topological properties of network data set k; for node features x, we allow them to be shared across all K networks. Finally, we jointly optimize the objective function
and use the optimized node features x for various functional inference tasks. Note that it is possible to weight the divergence term for each network differently, but we give equal importance to each network in this work for simplicity. In addition, while here we assume that all networks are defined over the same set of nodes, given overlapping but different node sets one can take the union of distinct nodes and augment each network with missing nodes to unify the node sets. We believe this approach is preferable to taking the intersection of node sets, as paths over the nodes that are missing in another network could still contain useful topological information to be captured by our diffusion process.
Implementation Details
To optimize the objective function of Mashup, we computed the gradients with respect to the parameters w and x:
Both the objective function and the gradients can be computed in O(n2 dK) time. We used a standard quasi-Newton method L-BFGS (Zhu et al., 1997) with these gradients to find the low-dimensional vector representations corresponding to a local optimum of our optimization problem. We used uniform random numbers from [−0.05, 0.05] to initialize the vectors and observed that this consistently leads to good solutions.
For the human data set, we used an alternative objective that allows for more efficient optimization based on singular value decomposition (SVD), in order to cope with large number of genes. We first concatenated the diffusion states for each network k to form a n × n diffusion state matrix Sk where is the ith column. Then, we concatenated the resulting matrices to obtain a nK × n matrix S. We took the logarithm of each element to obtain S̃ and performed truncated SVD on S̃ (with a user-specified number of components) to get a low-rank factorization UΣV. We assigned the columns of Σ1/2UT to {} and the columns of Σ1/2V to {xi}. Intuitively, this corresponds to a solution that minimizes the difference between the observed and model distributions as measured by the L2 norm in log space. A small smoothing constant (e.g., reciprocal of the number of genes) was added to each entry in S to avoid taking the log of zero. Further performance optimization can be achieved by calculating the top eigenvectors of , whose eigenvectors correspond to the right singular vectors of S̃ (i.e., {xi}). Note S̃k denotes the log-transformed matrix of Sk. In order to calculate R, one needs to keep only a single network in memory at a time, which reduces the memory footprint of this approach from O(n2K) to O(n2), thereby allowing Mashup to easily scale to a large number of networks. Thus, besides the time it takes to run RWR on each network, which scales linearly with the number of networks and is typically very fast, the running time of calculating Mashup representations via SVD from the diffusion states is constant with respect to the number of networks.
All of our experiments used restart probability of 0.5 for RWR, and unless otherwise noted, we used 500-dimensional vectors for yeast networks and 800-dimensional vectors for human networks, roughly corresponding to 5–10% of the original number of dimensions. However, we observed that our results are robust to the choice of these parameters (Figures S4 and S5).
Networks and Functional Annotations
We obtained a collection of protein-protein interaction (PPI) networks of yeast and human from the STRING database v9.1 (Franceschini et al., 2013), which is based on a variety of data sources, including high-throughput interaction assays, curated PPI databases, and conserved co-expression. We excluded the network constructed from text mining of academic literature to prevent confounding caused by links based on functional similarity. The resulting collection consisted of six heterogeneous networks over 6,400 genes with the number of edges varying from 1,361 to 314,013 for yeast, and 18,362 genes with the number of edges varying from 3,717 to 1,544,348 for human. Every edge in these networks is associated with a weight between 0 and 1 representing the probability of edge presence, which we factor into the calculation of transition probabilities in the random walk process.
We downloaded functional annotations from Munich Information Center for Protein Sequences (MIPS) (Ruepp et al., 2004) for yeast and the Gene Ontology database (GO; Ashburner et al., 2000) for human. The functional categories in MIPS are organized in a three-layered hierarchy, where the top level (Level 1) consists of the 17 most general functional categories, the second level (Level 2) consists of 74, and the third (Level 3) consists of the 154 most specific categories. We grouped the GO terms for human in a similar fashion to obtain three distinct levels of functional categories of varying specificity, each containing GO terms with 11–30, 31–100, and 101–300 genes, respectively. Within each level, we iteratively removed categories that had Jaccard similarity greater than 0.1 with another category in the same level in order to avoid statistical artifacts arising from overlapping functional categories.
Gene Function Prediction
To predict gene function using the topological feature representations obtained by Mashup, we formulated the task as a multi-label classification problem and applied an off-the-shelf support vector machine (SVM) toolbox, LIBSVM (Chang and Lin, 2011). We trained a binary classifier for each function and obtained per-class probability scores for each gene in the validation set. We used the standard radial basis function (RBF) kernel for the SVMs and performed a nested five-fold cross-validation within the training data to select the optimal parameters via grid search.
The performances of baseline methods—DSD (Cao et al., 2014) and GeneMANIA (Mostafavi and Morris, 2010)—are obtained as follows. For DSD, following the suggestions of the original paper, we obtained the diffusion states using RWR with restart probability of 0.5 and used a L1 distance-based weighted majority voting scheme where the labels assigned to the k most similar genes are combined using the reciprocal of distance between diffusion states as weights to produce per-label confidence scores (k = 10). Since DSD takes a single network as input, we used STRING's approach to integrate the networks as a preprocessing step for DSD: we assign as the probability of each edge in the combined network, where is the probability associated with the edge (i, j) in network k. For GeneMANIA, we downloaded the MATLAB implementation online (http://morrislab.med.utoronto.ca/~sara/SW) and applied it to our data set. Since GeneMANIA generates predicted scores for genes that are not directly comparable across different functional labels, we applied the standard Platt calibration (Lin et al., 2007; Platt, 1999), based on the same implementation as the one provided in LIBSVM, to transform the GeneMANIA scores into probability scores before evaluating them in the same manner as other methods.
For each method, we repeatedly held out 20% of the annotated genes as the validation set and used the remaining 80% to predict their functions. We used four different metrics to evaluate the prediction performance: (i) Accuracy is measured by assigning top predicted function to each gene in the validation set and measuring how often our prediction is one of the known functions of the gene. (ii) Micro-averaged F1 score is calculated by assigning top g predictions to each gene, constructing a 2-by-2 contingency table for each function (treating it as a binary classification task), and computing the F1 score—harmonic mean of precision and recall— on the combined table where each cell is summed across all functions. We used α = 3 for the results presented in this paper, following previous work (Cao et al., 2014; Schwikowski et al., 2000). (iii) Micro-averaged area under the precision-recall curve (m-PR) is calculated by vectorizing the matrix of predicted confidence scores for each gene-functional category pair and measuring the area under the precision-recall curve constructed based on the resulting vector, which combines the results from all functional categories. (iv) Macro-averaged area under the precision-recall curve (M-PR) is calculated by computing the area under the PR curve separately for each function and taking the average across all labels. Since M-PR gives equal weight to all labels, it is less prone (compared to m-PR) to potential biases caused by some functional labels being easier to predict. We chose not to consider the receiver operating characteristic (ROC) curve, because its behavior is closely related to the PR curve and the latter is more appropriate for a classification task with a large skew in the class distribution (Davis and Goadrich, 2006), which is the case for gene function prediction.
Gene Ontology Reconstruction
We reconstructed the gene ontology based on Mashup’s feature representations using a standard agglomerative hierarchical clustering algorithm (UPGMA; Sneath and Sokal, 1973), using a cosine distance function which is most appropriate given the use of pairwise inner products in the multinomial logistic model used by Mashup to learn the feature representations. The resulting tree of clusters was pruned (cluster size ≥ 3) and aligned to each type of GO ontology (biological process, molecular function, cellular component) using the ontology alignment algorithm proposed by Dutkowski et al. (2013). Following their work, only the statistically significant (FDR = 10%) alignments based on a permutation test were used to assess the level of agreement between GO and the reconstructed ontology, which is measured by the harmonic mean of precision (the fraction of reconstructed ontology terms that are aligned) and recall (the fraction of GO terms that are aligned). Overall alignment score was summarized by taking the geometric mean of the alignment scores of the three types of GO ontology.
For fair comparison with NeXO (Dutkowski et al., 2013), we also implemented heuristics to allow multi-way joins and multiple parents in the reconstructed ontology. Given a cluster, we tested whether pairwise distances within a cluster are significantly smaller than those between the cluster and its sibling, using a one-sided rank-sum test. Intermediary nodes in the tree with insignificant p-values (>0.05) were removed, which induces multi-way joins in the ontology. To allow terms to have more than one parent, we adopted a procedure identical to that of Dutkowski et al. (2013), where an additional link is iteratively added between two terms with significant connectivity that are not already on the same path to root.
To obtain the alignment results for NeXO, we downloaded the reconstructed ontology provided on its website (http://www.nexontology.org) and aligned it to GO ontologies using the original authors’ implementation of the ontology alignment algorithm (https://mhk7.github.io/alignOntology). For CliXO, we downloaded the implementation provided by the authors (https://mhk7.github.io/clixo_0.3). Following the recommendations in the original paper (Kramer et al., 2014), we set β = 0.5 and selected the value of α ∈ {.001, .005, .01, .015, .02, .03, .05} that resulted in the highest alignment score.
Genetic Interaction Prediction
Following a previous work (Paladugu et al., 2008), we implemented a support vector machine (SVM) classification framework for genetic interaction prediction. Given the Mashup’s topological feature representation of each gene, the feature vector for each pair of genes is constructed by taking the mean and the absolute difference of the gene features. We trained SVM classifiers with standard radial basis function (RBF) kernel using an off-the-shelf package LIBSVM (Chang and Lin, 2011) with such features as input to distinguish interacting gene pairs from non-interacting ones. For cross-validation, half of the known interactions and a matching number of sampled non-interactions were used to train the classifiers, and their performance was tested on a data set consisting of the remaining known interactions and a much larger set of non-interactions such that known interactions compose only 5% of the test data. Non-interactions were sampled uniformly at random from the set of gene pairs without known interactions, which relies on the assumption that the number of unobserved interactions is expected to be only a small fraction. We used the area under ROC curve (AUROC) and the area under precision-recall curve (AUPR) on the test set as performance metrics. The variance and cost parameters of SVM were optimized via grid search in a nested cross-validation framework.
As a baseline, we considered using as input topological features a manually curated set of graph-theoretic measures used by Paladugu et al. (2008), which includes: degree (number of direct interactions), clustering coefficient (Watts and Strogatz, 1998), closeness centrality (Beauchamp, 1965), normalized betweenness centrality (Freeman, 1977), eigenvector centrality (Bonacich, 1972), stress centrality (Brandes, 2001), bridging centrality (Zhang et al., 2010), information centrality (Stephenson and Zelen, 1989), and current-flow betweenness centrality (Brandes and Fleischer, 2005). We calculated each measure for each gene based on an integrated PPI network using STRING’s Bayesian integration, as previously described. For the baseline model, we also included a few additional features based on shortest distance between genes and the presence of connecting paths of length two (see original paper for more details; Paladugu et al., 2008).
To evaluate the performance of Ontotype (Yu et al., 2016), we first built a hierarchy of GO terms across all three ontologies (biological process, molecular function, and cellular component) using only “is a” and “part of” relationships. Then, we assigned each gene in STRING to its known GO terms and all of their ancestors in the hierarchy. Given a pair of genes, we constructed a feature vector (termed Ontotype) that has 0, 1, or 2 for each GO term representing the number of genes in the pair that are assigned to the term. Following Yu et al. (2016), we used the implementation of random forest classifiers provided by the Python scikit-learn package (Pedregosa et al., 2011) to classify genetic interactions based on the Ontotype features. We explored a wide range of model parameters: {100, 300, 500, 1000} for number of trees, {10, 30, 50, Full} for maximum depth of the trees, and {0.1, 0.3, 0.5} for the fraction of features to consider at each split. To compare with Mashup, we selected the best parameters for SL and SDL interactions, respectively.
Drug Efficacy Prediction
For each drug, we took the target gene's top 5, 10, 20 SDL interactors predicted by our Mashup-based classifier trained on all SDL interactions identified by Jerby-Arnon et al. (2014) and a matching number of sampled non-interactions. Then, following a similar procedure as Jerby-Arnon et al. (2014), we counted the number of (predicted) interactors that are over-expressed in each cell line. The one-sided Spearman correlation p-value between the number of over-expressed interactors and the IC50 value of the drug, which reflects drug efficacy, was used as a measure of prediction accuracy. We tried using each of the top five deciles of the expression level observed across all tissues as the threshold for determining over-expression for each gene and selected the most significant among the resulting correlation p-values as the final performance score. To assess the statistical significance of our prediction, we sampled the score 105 times from a null distribution by using the same number of randomly selected genes as SDL interactors instead. Given the empirical p-values for each of the drugs tested, we used the Benjamini-Hochberg false discovery rate (FDR) controlling procedure (Benjamini and Hochberg, 1995) with varying FDR thresholds to see the number of drugs whose efficacy we could significantly predict given only each tissue’s expression profiles.
Supplementary Material
Acknowledgments
A single page abstract of an earlier version of this work appeared in RECOMB 2015. This work was partially supported by NIH grant R01GM081871.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Author Contributions
Conceptualization, H.C., B.B., and J.P.; Data curation, H.C. and J.P.; Investigation, H.C. and J.P.; Methodology, H.C., B.B., and J.P.; Resources, B.B.; Software, H.C.; Validation, H.C.; Visualization, H.C.; Writing – Original Draft, H.C.; Writing – Review & Editing, H.C., B.B., and J.P.; Funding acquisition, B.B.; Supervision, B.B. and J.P.
Data and Software Availability
A MATLAB implementation of Mashup, pre-trained vectors for various organisms, and a benchmark data set for function prediction are available for download at: http://mashup.csail.mit.edu and in Data S1.
References
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Sherlock G. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beauchamp M. An improved index of centrality. Behavioral Science. 1965;10(2):161–163. doi: 10.1002/bs.3830100205. [DOI] [PubMed] [Google Scholar]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological) 1995;57(1):289–300. [Google Scholar]
- Berger B, Peng J, Singh M. Computational solutions for omics data. Nature Reviews Genetics. 2013;14(5):333–346. doi: 10.1038/nrg3433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonacich P. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology. 1972;2(1):113–120. [Google Scholar]
- Brandes U. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology. 2001;25(2):163–177. [Google Scholar]
- Brandes U, Fleischer D. STACS. Vol. 2005. Berlin Heidelberg: Springer; 2005. Centrality measures based on current flow; pp. 533–544. [Google Scholar]
- Cao M, Pietras CM, Feng X, Doroschak KJ, Schaffner T, Park J, Hescott BJ. New directions for diffusion-based network prediction of protein function: Incorporating pathways with confidence. Bioinformatics. 2014;30(12):i219–i227. doi: 10.1093/bioinformatics/btu263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chan D, Giaccia A. Harnessing synthetic lethal interactions in anticancer drug discovery. Nature Reviews Drug Discovery. 2011;10(5):351–364. doi: 10.1038/nrd3374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang C, Lin C. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2(3):27. [Google Scholar]
- Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Tang Y. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Computational Biology. 2012;8(5):e1002503. doi: 10.1371/journal.pcbi.1002503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark W, Radivojac P. Analysis of protein function and its prediction from amino acid sequence. Proteins: Structure, Function, and Bioinformatics. 2011;79(7):2086–2096. doi: 10.1002/prot.23029. [DOI] [PubMed] [Google Scholar]
- Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Boone C. The genetic landscape of a cell. Science. 2010;327(5964):425–431. doi: 10.1126/science.1180823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves; Proceedings of the 23rd International Conference on Machine Learning (ICML); 2006. pp. 233–240. [Google Scholar]
- Dutkowski J, Kramer M, Surma MA, Balakrishnan R, Cherry JM, Krogan NJ, Ideker T. A gene ontology inferred from molecular networks. Nature Biotechnology. 2013;31(1):38–45. doi: 10.1038/nbt.2463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Enault F, Suhre K, Claverie J-M. Phydbac “Gene Function Predictor”: a gene annotation tool based on genomic context analysis. BMC Bioinformatics. 2005;6(1):247. doi: 10.1186/1471-2105-6-247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Jensen LJ. STRING v9. 1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Research. 2013;41(D1):D808–D815. doi: 10.1093/nar/gks1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freeman L. A set of measures of centrality based on betweenness. Sociometry. 1977:35–41. [Google Scholar]
- Garnett M, Edelman E, Heidorn S. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012;483(7391):570–575. doi: 10.1038/nature11005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaudet P, Livstone MS, Lewis SE, Thomas PD. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Briefings in Bioinformatics. 2011;12(5):449–462. doi: 10.1093/bib/bbr042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huttenhower C, Hibbs M, Myers C, Troyanskaya OG. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics. 2006;22(23):2890–2897. doi: 10.1093/bioinformatics/btl492. [DOI] [PubMed] [Google Scholar]
- Jerby-Arnon L, Pfetzer N, Waldman YY, McGarry L, James D, Shanks E, Ruppin E. Predicting cancer-specific vulnerability via data-driven detection of synthetic lethality. Cell. 2014;158(5):1199–1209. doi: 10.1016/j.cell.2014.07.027. [DOI] [PubMed] [Google Scholar]
- Kim H, Shin J, Kim E, Kim H, Hwang S, Shim JE, Lee I. YeastNet v3: a public database of data-specific and integrated functional gene networks for Saccharomyces cerevisiae. Nucleic Acids Research. 2013 doi: 10.1093/nar/gkt981. gkt981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kramer M, Dutkowski J, Yu M, Bafna V, Ideker T. Inferring gene ontologies from pairwise similarity data. Bioinformatics. 2014;30(12):i34–i42. doi: 10.1093/bioinformatics/btu282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Research. 2011;21(7):1109–1121. doi: 10.1101/gr.118992.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao C-S, Lu K, Baym M, Singh R, Berger B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics. 2009;25(12):i253–i258. doi: 10.1093/bioinformatics/btp203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin H-T, Lin C-J, Weng RC. A note on Platt’s probabilistic outputs for support vector machines. Machine Learning. 2007;68(3):267–276. [Google Scholar]
- Macropol K, Can T, Singh AK. RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinformatics. 2009;10(1):283. doi: 10.1186/1471-2105-10-283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikolov T, Sutskever I, Chen K. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (NIPS) 2013 Retrieved from http://papers.nips.cc/paper/5021-di. [Google Scholar]
- Milenković T, Ng WL, Hayes W, Pržulj N. Optimal network alignment with graphlet degree vectors. Cancer Informatics. 2010;9:121–137. doi: 10.4137/cin.s4744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Milenković T, Pržulj N. Uncovering biological network function via graphlet degree signatures. Cancer Informatics. 2008;6:257–273. [PMC free article] [PubMed] [Google Scholar]
- Mostafavi S, Goldenberg A, Morris Q. Labeling nodes using three degrees of propagation. PloS ONE. 2012;7(12):e51947. doi: 10.1371/journal.pone.0051947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mostafavi S, Morris Q. Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 2010;26(14):1759–1765. doi: 10.1093/bioinformatics/btq262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology. 2008;9(Suppl 1):S4. doi: 10.1186/gb-2008-9-s1-s4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nature Methods. 2012;9(5):471–472. doi: 10.1038/nmeth.1938. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the Web. 1999 Retrieved from http://ilpubs.stanford.edu:8090/422.
- Pal D, Eisenberg D. Inference of protein function from protein structure. Structure. 2005;13(1):121–130. doi: 10.1016/j.str.2004.10.015. [DOI] [PubMed] [Google Scholar]
- Paladugu SR, Zhao S, Ray A, Raval A. Mining protein networks for synthetic genetic interactions. BMC Bioinformatics. 2008;9(1):426. doi: 10.1186/1471-2105-9-426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pandey G, Zhang B, Chang AN, Myers CL, Zhu J, Kumar V, Schadt EE. An integrative multi-network and multi-classifier approach to predict genetic interactions. PLoS Computational Biology. 2010;6(9):e1000928. doi: 10.1371/journal.pcbi.1000928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park Y, Bader J. Resolving the structure of interactomes with hierarchical agglomerative clustering. BMC Bioinformatics. 2011;12(Suppl 1):S44. doi: 10.1186/1471-2105-12-S1-S44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011 Oct;12:2825–2830. [Google Scholar]
- Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers. 1999;10(3):61–74. [Google Scholar]
- Plo I, Liao Z, Barceló J, Kohlhagen G. Association of XRCC1 and tyrosyl DNA phosphodiesterase (Tdp1) for the repair of topoisomerase I-mediated DNA lesions. DNA Repair. 2003;2(10):1087–1100. doi: 10.1016/s1568-7864(03)00116-2. [DOI] [PubMed] [Google Scholar]
- Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Friedberg I. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013;10(3):221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Mewes HW. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research. 2004;32(18):5539–5545. doi: 10.1093/nar/gkh894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast. Nature Biotechnology. 2000;18(12):1257–1261. doi: 10.1038/82360. [DOI] [PubMed] [Google Scholar]
- Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences. 2008;105(35):12763–12768. doi: 10.1073/pnas.0806627105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sneath P, Sokal R. Numerical taxonomy. The principles and practice of numerical classification. San Francisco: 1973. [Google Scholar]
- Stephenson K, Zelen M. Rethinking centrality: methods and examples. Social Networks. 1989;11(1):1–37. [Google Scholar]
- Tong H, Faloutsos C, Pan J. Fast random walk with restart and its applications; IEEE International Conference on Data Mining; 2006. pp. 613–622. [Google Scholar]
- Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Goldenberg A. Similarity network fusion for aggregating data types on a genomic scale. Nature Methods. 2014;11(3):333–337. doi: 10.1038/nmeth.2810. [DOI] [PubMed] [Google Scholar]
- Wang PI, Hwang S, Kincaid RP, Sullivan CS, Lee I, Marcotte EM. RIDDLE: reflective diffusion and local extension reveal functional associations for unannotated gene sets via proximity in a gene network. Genome Biology. 2012;13(12):R125. doi: 10.1186/gb-2012-13-12-r125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang S, Cho H, Zhai C, Berger B, Peng J. Exploiting ontology graph for predicting sparsely annotated gene function. Bioinformatics. 2015;31(12):i357–i364. doi: 10.1093/bioinformatics/btv260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watts D, Strogatz S. Collective dynamics of small-world networks. Nature. 1998;393(6684):440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- Wong AK, Krishnan A, Yao V, Tadych A, Troyanskaya OG. IMP 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Research. 2015;43(W1):W128–W133. doi: 10.1093/nar/gkv486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong SL, Zhang LV, Tong AHY, Li Z, Goldberg DS, King OD, Roth FP. Combining biological networks to predict genetic interactions. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(44):15682–15687. doi: 10.1073/pnas.0406614101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu W, Wang S, Chen Q, Zhang Y, Ni P. TXNL1-XRCC1 pathway regulates cisplatin-induced cell death and contributes to resistance in human gastric cancer. Cell Death & Disease. 2014;5(2):e1055. doi: 10.1038/cddis.2014.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu D, Kim M, Xiao G, Hwang T. Review of biological network data and its applications. Genomics & Informatics. 2013;11(4):200–210. doi: 10.5808/GI.2013.11.4.200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu MK, Kramer M, Dutkowski J, Srivas R, Licon K, Kreisberg JF, Ideker T. Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems. Cell Systems. 2016;2(2):77–88. doi: 10.1016/j.cels.2016.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang A, Ramanathan M, Hwang W, Cho Y. Bridging Centrality: A concept and formula to identify bridging nodes in scale-free networks. 7,808,921. U.S. Patent No. 2010
- Zhu C, Byrd RH, Lu P, Nocedal J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS) 1997;23(4):550–560. [Google Scholar]
- Žitnik M, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B. Gene prioritization by compressive data fusion and chaining. PLoS Computational Biology. 2015;11(10):e1004552. doi: 10.1371/journal.pcbi.1004552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Žitnik M, Zupan B. Data fusion by matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;37(1):41–53. doi: 10.1109/TPAMI.2014.2343973. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.