Abstract
High-throughput techniques for measuring protein interactions have enabled the systematic study of complex protein networks. Comparing the networks of different organisms and identifying their common substructures can lead to a better understanding of the regulatory mechanisms underlying various cellular functions. To facilitate such comparisons, we present an efficient framework based on hidden Markov models (HMMs) that can be used for finding homologous pathways in a network of interest. Given a query path, our method identifies the top k matching paths in the network, which may contain any number of consecutive insertions and deletions. We demonstrate that our method is able to identify biologically significant pathways in protein interaction networks obtained from the DIP database, and the retrieved paths are closer to the curated pathways in the KEGG database when compared to the results from previous approaches. Unlike most existing algorithms that suffer from exponential time complexity, our algorithm has a polynomial complexity that grows linearly with the query size. This enables the search for very long paths with more than 10 proteins within a few minutes on a desktop computer. A software program implementing the algorithm is available upon request from the authors.
Key words: hidden Markov model (HMM), pathway alignment, protein interaction network
1. Introduction
Recent advances in high-throughput experimental techniques for measuring protein interactions—such as the two-hybrid systems (Ito et al., 2001) and co-immunoprecipitation assays (Mann et al., 2001; Uetz et al., 2004), have enabled the systematic study of biological interactions on a global scale for an increasing number of organisms (von Mering et al., 2002). These complex interaction networks can be represented by graphs, in which the nodes represent biological entities in a given network and the edges indicate interactions between them. Comparing the networks of different organisms and identifying their common substructures, such as signaling pathways and protein complexes, can lead to a better understanding of the regulatory mechanisms underlying various cellular processes. For this reason, there have been growing efforts to find conserved interaction patterns in protein interaction networks (Dost et al., 2008; Kelley et al., 2003; Koyutürk et al., 2004; Scott et al., 2006; Sharan et al., 2005; Shlomi et al., 2006; Singh et al., 2008; Yang and Sze, 2007), metabolic networks (Koyutürk et al., 2004; Pinter et al., 2005; Yang and Sze, 2007), gene regulatory networks (Akutsu et al., 1998), and signal transduction networks (Steffen et al., 2002). It has been demonstrated that searching for conserved interaction patterns can detect many well-known pathways and can also make statistically significant predictions of novel interaction pathways.
Despite the initial success of existing methods, there still exist a number of problems that limit the applicability of these methods. For example, many of these methods suffer from high computational complexity that makes it difficult to use them to search for long query paths. The number of consecutive insertions and deletions are often restricted in these methods, which may prevent the detection of distant homologous pathways, and many of them rely on heuristics or randomized algorithms that may not necessarily yield optimal results.
In this paper, we focus on the problem of finding conserved pathways in a protein interaction network that are homologous to a known linear pathway. We formulate the problem as a pathway alignment problem, in which the goal is to find the path in a given protein network that is most similar to a given query. Based on hidden Markov models (HMMs), we propose a general probabilistic framework for scoring pathway alignments and present an efficient search algorithm that overcomes the aforementioned limitations of previous algorithms. Given a query path, our method identifies the top k matching paths in a network of interest, where the detected paths may contain any number of consecutive insertions and deletions. Since the search algorithm has a very low computational complexity that is linear in the query size as well as in the number of edges in the network, our method can be used to effectively search for long query paths in remotely related organisms. Although HMMs have been widely used in sequence homology search, we believe that this is their first application to homology search in pathways.
2. Methods
In this section, we present an algorithm for solving the following pathway alignment problem: Given a linear query path p and a protein interaction network , find the optimal path q in the network that is closest to the query.
2.1. Pathway alignment
Let be an undirected graph representing a protein interaction network. We assume that has a set of N nodes, representing the proteins in the network, and a set ɛ = {eij} of M edges, representing the interactions between proteins vi and vj. As the network is an undirected graph, we assume that if there exists an interaction between vi and vj, both eij and eji are present in the set ɛ. For a protein pair (vi, vj) with , we define their interaction reliability as w(vi, vj). Given a query path that consists of L proteins, we define the sequence similarity score between pi and vj as h(pi, vj). Our goal is to find the best matching path in the protein interaction network that maximizes a predefined pathway alignment score S(p, q).
To obtain meaningful pathway alignments, the alignment score S(p, q) should be defined in such a way that combines the similarity score h(pi, qj) between the aligned proteins pi and qj, the interaction reliability score w(qj, qj+1) between the proteins qj and qj+i (1 ≤ j ≤ L′ − 1), and the penalty for insertion (qj that does not have a matching protein in the query path p) and deletion (pi that does not have a matching protein in the retrieved path q). Figure 1C illustrates an example of a query path p (Fig. 1A) that is aligned to the best matching path q in the network (Fig. 1B). A dashed line between pi and vj or between pi and qj indicates that they have significant sequence similarity. In this example, the optimal alignment that maximizes the alignment score S(p, q) has one insertion (node q3 in the retrieved path) and one deletion (node p4 in the query).
2.2. Network representation using HMM
To define the alignment score S(p, q), we adopt the HMM formalism. For simplicity, we start with an HMM that does not allow insertions or deletions in the pathway alignment. We construct the HMM based on the network graph so that each node in the graph corresponds to a hidden state in the HMM. For convenience, we represent this hidden state using the same notation vi and the HMM has an identical structure as the network graph . The resulting HMM can be viewed as a generative model that produces (or “emits”) an interesting substructure of the original network, such as a signaling pathway. From this point of view, we regard the query path as an observation sequence generated by the constructed HMM. Figure 1D illustrates the HMM where the emittable symbols are shown next to the hidden states.
Based on this representation, the interaction reliability score and the sequence similarity score can be integrated naturally into the probabilistic framework. We define a mapping f : w(vm, vn) ↦ t(vn|vm) that converts the interaction reliability w(vm, vn) between two proteins to the following transition probability
(1) |
between the corresponding hidden states in the HMM. The mapping f is defined so that (i) t(vn | vm) = 0 for for all m, and (iii) t(vn1 | vm) > t(vn2 | vm) for w(vm, vnl) > w(vm, vn2). We define another mapping g : h(pi, vm) ↦ e(pi | vm) that converts the sequence similarity score h(pi, vm) to the following emission probability
(2) |
where qj = vm is a hidden state in the HMM (representing a protein in the network ) and pi is an emitted symbol (representing a protein in the query p that is aligned to qj). The mapping g is defined so that (i) for all m, where P is the set of distinct proteins in the query path , and (ii) e(pi1 | vm) > e(pi2 | vm) for h(pi1, vm) > h(pi2, vm).
2.3. Ungapped pathway alignment
Based on the HMM framework, the problem of finding the best matching path is transformed into the problem of finding the optimal state sequence in the HMM that maximizes the observation probability of the given query path. In an ungapped pathway alignment, the matching path has the same length as the query path , hence qi will be the underlying state for the “observed symbol” pi.
To find the best matching path, a dynamic programming algorithm, called the Viterbi algorithm, can be used to solve the problem in polynomial time. We define γ(t, j) as the log-probability of the most probable path for the sub-query of length t(≤ L), where the underlying state of pt is qt = vj. We compute γ(t, j) recursively as follows:
(3) |
We repeat the above iterations until t = L. We then obtain the maximum log-probability of the query p as follows:
(4) |
where q* = arg maxq [log P(p, q)] is the best matching path for p in the network. Once we have log P(p, q*), it is straightforward to find q* by tracing the recursive equations that led to the maximum log-probability log P(p, q*). Although the above algorithm only finds the optimal path, we can extend the algorithm to find the top k paths simply by replacing the max operator by an operator that finds the k largest scores.
Note that S(p, q) = log P(p, q) can serve as a good alignment score for the paths p and q that effectively combines sequence similarity and interaction reliability. In principle, we can also use non-stochastic emission scores sem(pt | vj) and transition scores str(vj | vi) in place of the log-probabilities log e(pt | vj) and log t(vj | vi), respectively, in the recursive equation, Equation (3). This will yield a non-stochastic alignment score instead of an observation probability.
2.4. Pathway alignment with gaps
To accommodate insertions and deletions, we modify the HMM as follows. To model deletions, we add an accompanying state um for every state vm in the HMM. We add an outgoing edge from vm to um and add outgoing edges from um to all of the neighboring states of vm in the network . To be more precise, um will have an outgoing edge to every . By varying the transition probability t(um | vm), we can control the probability (hence, the penalty) of having deletions. We adjust the outgoing transition probabilities from vm so that . We control the probability of having consecutive deletions by adjusting the probability t(um | um) for making self-transitions at um. The outgoing transition probabilities t(vn | um) from an accompanying state um are chosen so that they are proportional to t(vn |vm) and satisfy . To model insertions, we allow the original states in the HMM to emit a gap symbol φ in addition to the proteins in the query path p. The gap emission probability e(φ|vm) can be used to control the probability (hence, the penalty) of having insertions. The structure of the modified HMM is depicted in Figure 1E. Note that the matching path might have different length L′ from the length L of the query path when there are insertions or deletions in the pathway alignment. Hence we may have j ≠ i for the underlying state qj of the “observed symbol” pi.
To find the optimal path that may include one or more gaps, we modify our dynamic programming algorithm correspondingly. We define γ(t, d, j) as the log-probability of the most probable path for that contains d insertions and ends at vj (1 ≤ j ≤ 2N), where we use vm+N = um for convenience. The value of γ(t, d, j) is computed recursively as follows:
(5) |
We repeat the above iterations until we reach t = L and d = D, where D is the maximum number of allowed insertions. Note that there is no explicit limit for the number of deletions. We then compute the log-probability of the optimal path as follows:
(6) |
The path q* = arg maxq [log P(p, q)] is the closest match to the query p, and it may contain some number of insertions and deletions. As before, we can replace the max operator by an operator that finds the k largest scores if we want to find the top k matching paths instead of a single top-scoring path.
The computational complexity of the above algorithm is O(kLDM) for finding the top k matching paths, where L is the length of the query, D is the maximum number of allowed insertions, and M is the number of edges in the network . Note that the complexity is linear with respect to the query size, the number of edges in the network, the maximum number of insertions, and the number of best matching paths we want to retrieve.
2.5. Statistical significance
We estimate the statistical significance of a retrieved path using a similar approach as the one proposed in Kelley et al. (2003). We first generate a large number of random graphs by permuting the protein locations in the original network . Therefore, all random graphs will be comprised of the same set of proteins and retain a similar structure as the original network. For each random graph, we construct an HMM and compute the best pathway alignment score for the query p. It is well known that the Gumbel distribution provides a good approximation for the extreme value distribution (EVD) of various random variables, and it has been widely used in sequence homology search to assess the significance of predicted results (Durbin et al., 1998; Karlin and Altschul, 1990). As we are interested in evaluating the statistical significance of the maximum pathway alignment score obtained from the original network, the Gumbel distribution provides a better approximation when compared to the widely used Gaussian distribution. The two unknown parameters α and β in the Gumbel distribution function can be estimated using simple least squares regression based on the alignment scores obtained from the random graphs. Once we have estimated α and β, we can compute the p-value of the best pathway alignment score in the original network based on the estimated distribution.
3. Results
3.1. HMM parameterization
We tested our algorithm using several protein interaction networks in the Database of Interacting Proteins (DIP) (Xenarios et al., 2002). We adopted a simple non-stochastic scoring scheme for parameterizing the HMMs.
We set the transition scores str (vn | vm) based on the presence of interaction between the corresponding proteins. If there exists an interaction between the two nodes vm and vn in the network , we set the transition score to str (vn | vm) = 0 (and also str (vm | vn) = 0). Otherwise, we set the score to str (vn | vm) = −∞ (and also str (vm | vn) = −∞). This keeps the state vm in the HMM from making a direct transition to a non-relevant state vn, and vice versa, thereby preventing the inclusion of any irrelevant protein interactions with no biological support in the retrieved path q. The score for making a transition into an accompanying state um was set to str (um | vm) = 0, and we set the self-transition score to str (um | um) = 0 to allow consecutive deletions. The score for making a transition from um back to a regular state vn was set to str (vn | um) = 0 for and for .
We set the emission score sem(pt | vm) based on the sequence similarity of the proteins pt and vm. We assume that every state in the HMM (including the accompanying states) can emit any protein pt in the query path p. The emission score was made larger for more similar proteins so that it is more likely that a hidden state vm will emit a protein pt that is closer to its corresponding protein. For all protein pairs (pt, vm) between a protein pt in the query and a protein vm in the network , we computed their E-values using the PRSS routine in the FASTA package (Pearson and Lipman, 1988). PRSS (Pearson, 1996) computes accurate E-values using the Smith-Waterman algorithm with sophisticated shuffling methods, and it is believed to be better than BLASTP in detecting significant matches (Pagni and Jongeneel, 2001). We regarded a protein pair (pt, vm) as a “match” if its E-value Ev(pt, vm) was below some predefined threshold λth. Otherwise, we viewed (pt, vm) as a “mismatch,” which implies that the two proteins do not contain significant similarity. Based on this criterion, we set the emission score sem(pt | vm) as follows:
(7) |
The value Δ can be viewed as the mismatch penalty, and is selected so that −Δ ≪ −log10 λth. We set the insertion penalty to sem(φ |vm) = −Δi and the deletion penalty to sem(pt|um) = −Δd. Note that the accompanying state um cannot emit a gap, hence sem(φ |um) = −∞. Finally, we set the maximum number of insertions to be the same as the length of the query (D = L).
3.2. Querying yeast pathways in various organisms
To verify the capability of our method for identifying relevant pathways in different organisms, we obtained the protein interaction networks of S. cerevisiae, C. elegans, D. melanogaster, and E. coli from DIP. We took a mating-pheromone response pathway of S. cerevisiae with 10 proteins as our query path, which contains the mitogen-activated protein (MAP) kinase cascade Ste11p–Ste7p–Fus3p (Fig. 2A). The same pathway has been used by other existing algorithms for performance evaluation (Scott et al., 2006; Yang and Sze, 2007). We searched for similar paths in the S. cerevisiae network with 17,579 interactions among 4969 proteins, the C. elegans network with 4037 interactions among 2647 proteins, and the D. melanogaster network with 22,840 interactions among 7476 proteins. We set λth = 0.5 and Δ = Δi = Δd = 12.
As we would expect, the best matching path in S. cerevisiae network was identical to the query path. This is shown in Figure 2B, where other top matches are shown with the best matching path. The retrieved paths in the C. elegans network and the D. melanogaster network are shown in Figures 2C and 2D. It is interesting to note that many proteins in the retrieved paths share similar functions with the corresponding proteins in the query path. The proteins Ste11p, nsy-1, mig-15, CG10033-PA, Pk17E-PA, and CG5169-PA, which are aligned to each other, all belong to the serine/threonine protein kinase family. Similarly, Ste7p, mig-15, sek-1, Mekk1-PA, CG10498-PB, and Lic-PA, also belong to the serine/threonine protein kinase family, with sek-1, Mekk1-PA, and Lic-PA being MAPK kinases. All the proteins Fus3p, Kss1p, mpk-1, and ERKA, which are aligned to each other in Figure 2, are MAP kinases (Flybase Consortium, 1996; Gustin et al., 1998; Stein et al., 2001). These results clearly indicate that our method is able to effectively identify similar pathways that are biologically meaningful.
In order to estimate the statistical significance of the predicted results, we computed the p-values for the best matching paths in the respective networks. The p-values have been computed as described in Section 2.5 using 100 random networks. Figures 3A and 3B show the resulting cumulative distribution functions (CDFs) and histograms of the maximum alignment scores in the random networks of C. elegans and D. melanogaster, respectively. The fitted Gumbel distributions are also shown in the figure. The p-value of the optimal path in the original C. elegans network was p = 0.014, while the p-value of the optimal path in the D. melanogaster network was p = 0.069. For both organisms, the optimal alignment scores in the original networks ranked among the top scores, indicating that the retrieved results are statistically significant. One point to note is that using a shorter query path typically leads to a smaller p-value, as it becomes more likely to detect good matches that contain less insertions and deletions. For example, if we take a small portion (Ste11p–Ste5p–Ste7p–Fus3p) of the original query shown in Figure 2A and search the C. elegans network, the algorithm retrieves the corresponding part in Figure 2C, with a smaller p-value p = 2.6 × 10−3. Similarly, if we use the same short query to search the D. melanogaster network, the corresponding portion in Figure 2D is retrieved, where the p-value of the optimal path is only p = 3.6 × 10−9.
In addition to the D. melanogaster, C. elegans, and S. cerevisiae networks, we used the same mating-pheromone response pathway in Figure 2A to find relevant paths in the E. coli protein interaction network, which contains 6976 interactions among 1850 proteins. We reduced the mismatch and indel penalties to Δ = Δi = Δd = 2 so that the retrieved paths may contain more indels and mismatches. The p-value of the optimal path was p = 0.66 and its alignment score was among the lowest when compared to the alignment scores obtained from the random networks. Similarly, the p-value of the retrieved path was also high (p = 0.59) for the short query. This implies that the search results are statistically insignificant, which is consistent with the fact that there are no known MAP kinase pathways in bacteria (Chang and Stewart, 1998). This is a good indication that our method is very useful in identifying conserved pathways that are biologically meaningful.
3.3. Querying human pathways in fly
We further applied our algorithm to search the D. melanogaster network for matching paths that are similar to known human signaling pathways. We used the same parameters as before: λth = 0.5 and Δ = Δi = Δd = 12. Figure 4A shows the retrieved paths for the human hedgehog signaling pathway and Figure 4B shows the retrieved paths for the human MAP kinase pathway. In both cases, the top matching paths agreed well with the query paths, according to the known functional annotations of D. melanogaster. In Figure 4A, ptc is a receptor of the hedgehog pathway located at the plasma membrane (Lum and Beachy, 2004), and it has been shown that ci-PA plays an important role in the hedgehog pathway that regulates cell growth in many tissues (Lum and Beachy, 2004). For the MAP kinase query (Fig. 4B), Egfr-PB is a putative growth factor receptor; drk-PA is downstream of the receptor kinase; Rap21-PA and Ras85D-PA have putative GTPase activity; ph1-PA, Pkc98E-PA, Mekk1-PA, and cdc2c-PA all belong to the serine/threonine protein kinase family; and ERKA is an annotated nuclear MAP kinase which likely activates specific transcription factors (Flybase Consortium, 1996; Gustin et al., 1998).
Figures 4A and 4B contain the putative homologous pathways in D. melanogaster reported in the KEGG database (Kanehisa and Goto, 2000): one of the top matching paths in Figure 4A (shh–ptc–Smo–fu–ci) is the core of the D. melanogaster hedgehog signaling pathway given in Kanehisa and Goto (2000), and Egfr–drk–Sos–Ras85D–ph1–Mekk1–ERKA (Fig. 4B) is part of the putative MAP kinase pathway for D. melanogaster in Kanehisa and Goto (2000). By comparing the retrieved pathways with the corresponding putative pathways of D. melanogaster in the KEGG database (Kanehisa and Goto, 2000), we found that our algorithm was able to retrieve the identical core segment with five proteins in the putative hedgehog signaling pathway.1 For the MAP kinase query, the retrieved pathway included seven proteins that exactly matched the proteins in the putative D. melanogaster MAP kinase pathway.2 These results compare favorably to the previously reported results (Shlomi et al., 2006), which found two and five matched proteins for the respective pathways, indicating that our algorithm can make biologically meaningful predictions with better accuracy. As before, we also computed the p-values for the top matching paths using 100 random networks. The p-value of the optimal path for the human hedgehog signaling pathway was p = 7.0 × 10−23 and the p-value of the top retrieved path for the human MAP kinase pathway was p = 2.6 × 10−4, which show the statistical significance of the predictions.
3.4. Running time
Our method has a very low computational complexity that is linear with respect to the length of the query as well as the number of interactions in the network. Unlike most existing methods whose utility is limited to relatively short queries with 3–10 proteins, our method can search for very long query paths in large protein interaction networks. Table 1 summarizes the running time of our method with different parameters. We searched for queries with six to 20 proteins in the S. cerevisiae network that consists of 4969 proteins and 17,579 interactions. The running time has been measured on a desktop computer with 2.13-GHz CPU and 2-GB memory. From Table 1, we see that it takes only about three minutes to find the top path for a query of length 20, where the retrieved path may contain any number of deletions and mismatches but no insertions. We see clearly in Table 1 that the running time grows linearly with the query length. If we search for the top five paths using the same query (of length 20), and allow up to five insertions and any number of deletions and mismatches, it still takes only about 80 minutes. Note that this is about 25 (= D × k) times larger compared to the previous running time (3 min.), which confirms that the computational complexity is also linear in the maximum number of insertions D and the number of paths k we want to retrieve.
Table 1.
|
|
Query path length |
|||||||
---|---|---|---|---|---|---|---|---|---|
Maximum no. of insertions | No. of paths | 6 | 8 | 10 | 12 | 14 | 16 | 18 | 20 |
D = 0 | k = 1 | 47.9 | 67.5 | 85.8 | 105.4 | 122.6 | 144.0 | 163.6 | 179.2 |
D = 5 | k = 5 | 1373.4 | 1878.3 | 2382.4 | 2881.5 | 3384.4 | 3894.5 | 4387.5 | 4817.5 |
3.5. Accuracy and robustness
To evaluate the accuracy and the robustness of our HMM-based algorithm, we performed the following experiments.
We first estimated the accuracy of our algorithm in retrieving homologous pathways by using synthetic query paths. For this purpose, we followed a similar procedure used in Dost et al. (2008). To obtain a reasonable set of query paths, we randomly extracted 10 paths from the S. cerevisiae network, whose lengths range from L = 6 to L = 10. Each of these paths was perturbed by inserting, deleting, and replacing one or more nodes. We also applied point mutations to the protein sequences in the query paths with different mutation rates of up to 80%. For each path, we used our algorithm to retrieve the top matching path in the S. cerevisiae network. As in the previous experiments, we used λth = 0.5 and Δ = Δi = Δd = 12. The retrieved results have been compared to the original unperturbed paths that were used to obtain the query paths. We computed the edit distance between each retrieved path and the original path to evaluate the prediction accuracy. As a comparison, we also tried to find the best matching proteins based on sequence similarity alone. For each node in the query paths, we reported the node in the S. cerevisiae network with the highest alignment score using the PRSS routine (Pearson, 1996) as the “matching node” and counted the number of nodes that were correctly predicted. The prediction results are summarized in Table 2. We can see from this table that the predictions made by our HMM-based algorithm are far more accurate than those based on the best PRSS scores. For HMM-based predictions, the average distance between the retrieved optimal paths and the original paths was less than one for all types of perturbations with up to 70% point mutations. The advantage of our algorithm over the sequence-based approach becomes more pronounced for higher mutation rates. These results clearly show that the HMM-based algorithm can make accurate predictions by integrating sequence similarity and the interaction network in a sensible manner.
Table 2.
|
Point mutation rate |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
0% |
50% |
60% |
70% |
80% |
|||||
Note perturbation | PRSS | HMM | PRSS | HMM | PRSS | HMM | PRSS | HMM | PRSS | HMM |
None | 0 | 0 | 0 | 0 | 0.2 | 0 | 0.9 | 0.3 | 3.9 | 0.9 |
1 replacement | 1 | 0.2 | 1 | 0.2 | 1.2 | 0.2 | 1.9 | 0.8 | 4.9 | 1.5 |
2 replacements | 2 | 0.5 | 2 | 0.5 | 2.2 | 0.5 | 2.9 | 0.8 | 5.9 | 2.2 |
1 insertion and 1 deletion | 1.8 | 0.3 | 1.8 | 0.3 | 2.0 | 0.3 | 2.7 | 0.7 | 5.7 | 2.3 |
2 insertions and 2 deletions | 3.3 | 0.6 | 3.3 | 0.6 | 3.5 | 0.6 | 4.2 | 0.7 | 7.2 | 3.8 |
The results obtained by our algorithm are denoted as “HMM,” and the results obtained based on the best PRSS hits are denoted as “PRSS.”
We also measured the sensitivity of our algorithm with respect to parameter changes. For this purpose, we used the human hedgehog signaling pathway shown in Figure 4A to query the D. melanogaster network using different sets of parameters. For each parameter setting, we compared the proteins in the top 10 retrieved paths with the proteins included in the top 10 paths based on the original setting (λth = 0.5 and Δ = ΔI = Δd = 12) that was used to obtain the results in Section 3.3. To estimate the relative changes of the retrieved proteins, we computed the following values:
where C is the number of proteins in the top retrieved paths based on the original setting, CI is the number of proteins that were added as a result of the parameter change, and CD is the number of proteins that were removed from the top paths due to the change.
Table 3 shows the relative number of added proteins and that of removed proteins for different sets of parameters. As we can see, the retrieved results are not significantly affected by parameter changes. For large ranges of λth (from 0.05 to 1.0) and Δ = Δi = Δd (from 6 to 18), the set of top 10 retrieved paths stayed the same. Even for larger values of λth (= 5.0) and Δ = Δi = Δd (= 26), the changes in the retrieved results were relatively small. In addition to measuring the relative changes in the predicted results, we computed the p-values of the top paths based on 100 random networks. In all cases, the p-values ranged between 1.1 × 10−26 and 6.2 × 10−18, showing that the predictions made by our HMM-based method are statistically significant. This explains why the retrieved paths were so close to each other. As we can see from these results, our HMM-based algorithm performs robustly for a considerably large range of parameter values, especially when the retrieved paths are biologically meaningful. The predicted results may have larger variations when the retrieved paths are not statistically significant, as in the case of querying the E. coli network using the yeast MAP kinase pathway, described in Section 3.2.
Table 3.
|
Δ = Δi = Δd |
|||||||
---|---|---|---|---|---|---|---|---|
|
6 |
12 |
18 |
26 |
||||
λth | rI | rD | rI | rD | rI | rD | rI | rD |
0.05 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.03 |
0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.03 |
0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.03 |
1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.03 |
5.0 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | 0.09 | 0.05 |
We computed the relative number of proteins that were added to or removed from the predicted results due to the parameter changes. The changes have been measured by comparing the predicted results to the results obtained using the original parameters (λth = 0.5 and Δ = Δi = Δd = 12) employed in Section 3.3. The relative number of added proteins is denoted by rI , and the relative number of removed proteins is denoted by rD.
4. Discussion
There exist a number of algorithms that can be applied to pathway search (Kelley et al., 2003; Pinter et al., 2005; Scott et al., 2006; Shlomi et al., 2006; Yang and Sze, 2007). Among these algorithms, PathBLAST (Kelley et al., 2003) focuses on finding conserved pathways by comparing two networks, and this reduces to a pathway search problem if one of the networks is a specific pathway. However, PathBLAST does not allow consecutive deletions of proteins in the query path nor consecutive insertions of proteins in the matching path, limiting its applicability to closely matching pathways. Furthermore, its computational complexity contains factorial factors in the query length, making it impractical for long paths that contain more than six to seven proteins.
To solve these problems, two algorithms have been proposed based on the color coding technique (Scott et al., 2006; Shlomi et al., 2006) that can search for simple paths of length around 10. Although these algorithms have significantly reduced the running time, the algorithmic complexity is still exponential in the query length. Hence they quickly become infeasible for slightly longer queries. In addition, the use of randomized algorithms does not guarantee the optimality of the query result.
A more recent algorithm called PathMatch (Yang and Sze, 2007) reduces the pathway search problem to the problem of finding the longest weighted path in a directed acyclic graph by relaxing the constraint for finding simple paths. This allows top matching paths to be found in polynomial time and obviates the need for randomized algorithms. However, PathMatch has a limited flexibility in the choice of the scoring scheme since mismatches and indels are treated in an identical manner and it is difficult to use different penalties for different mismatches or indels.
When compared to the existing methods, our HMM-based method has a significantly lower computational complexity that is linear with respect to the query length and the number of edges in the network, and it can search for query paths with more than 10 proteins in a network with thousands of nodes and tens of thousands of interactions within a few minutes. An important advantage of the HMM-based framework is that it is very flexible in the choice of the scoring scheme. We can use different penalties for mismatches, insertions, and deletions, and it is also possible to assign different penalties to different types of mismatches. Furthermore, our algorithm allows the matching paths to contain any number of consecutive deletions and insertions. We can also use an affine gap penalty model for scoring consecutive deletions, where gap openings and gap extensions are treated differently.
Similar to the PathMatch algorithm (Yang and Sze, 2007), we may in principle have a repeated occurrence of a network node vj in the retrieved path. Such repeated occurrences are not frequently observed in practice unless all the proteins in the query p are very similar to each other. As mentioned in Yang and Sze (2007), these limited occurrences of repeats can be biologically useful in identifying proteins that have multiple roles in a signaling pathway, and there exist many known examples of such multi-functional proteins (Teichmann et al., 2001). We did not observe many repeated occurrences of the same protein in our retrieved paths.
In our tests, we have used relatively simple non-stochastic scores for parameterizing the HMMs. We have shown that, even with this simple scoring scheme, our retrieved paths are closer to the putative homologous D. melanogaster pathways in KEGG than the paths reported in Shlomi et al. (2006). Considering the flexibility of the proposed framework, it would be beneficial to use a more elaborate scoring scheme in the future. For example, we may incorporate additional information, such as gene ontology (GO) annotations and gene expression data (Sharan et al., 2005; Shlomi et al., 2006), for more reliable estimations of the HMM parameters. We may also incorporate methods for evaluating the reliabilities of protein interactions (Bader et al., 2004; Sharan et al., 2005; von Mering et al., 2002) to obtain more robust transition scores for the HMMs. Although the pathway alignment score S(p, q) used in this paper incorporates only the interaction reliability of the protein network, we can easily incorporate the interaction reliability of the query as well. This can be achieved by adding an additional term for the reliability of the interaction between pt − 1 and pt in the recursive equations, Equations (3) and (5). As a final remark, the HMM-based method proposed in this paper is currently limited to linear queries, and we are investigating the possibility of extending the framework to support more general queries such as trees.
Footnotes
Acknowledgments
This work was supported in part by the National Cancer Institute (grant 2-R25CA090301-06 to X.Q.) and the National Science Foundation (grant DBI-0624077 to S.S.).
Disclosure Statement
No conflicting financial interests exist.
References
- Akutsu T. Kuhara S. Maruyama O., et al. Identification of gene regulatory networks by strategic gene disruptions and gene overexpressions. Proc. 9th Annu. ACM-SIAM Symp. Discrete Algo. 1998:695–702. [Google Scholar]
- Bader J. Chaudhuri A. Rothberg J.M., et al. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol. 2004;22:78–85. doi: 10.1038/nbt924. [DOI] [PubMed] [Google Scholar]
- Chang C. Stewart R.C. The two-component system. Regulation of diverse signaling pathways in prokaryotes and eukaryotes. Plant Physiol. 1998;117:723–731. doi: 10.1104/pp.117.3.723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dost B. Shlomi T. Gupta N., et al. QNet: a tool for querying protein interaction networks. J. Comput. Biol. 2008;15:913–925. doi: 10.1089/cmb.2007.0172. [DOI] [PubMed] [Google Scholar]
- Durbin R. Eddy S.R. Krogh A., et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; Cambridge, UK: 1998. [Google Scholar]
- Flybase Consortium. FlyBase: the Drosophila database. Nucleic Acids Res. 1996;24:53–56. doi: 10.1093/nar/24.1.53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gustin M.C. Albertyn J. Alexander M., et al. MAP kinase pathways in the yeast Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev. 1998;62:1264–1300. doi: 10.1128/mmbr.62.4.1264-1300.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ito T. Chiba T. Ozawa R., et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA. 2001;98:4569–4574. doi: 10.1073/pnas.061034498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M. Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlin S. Altschul S.F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA. 1990;87:2264–2268. doi: 10.1073/pnas.87.6.2264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley B.P. Sharan R. Karp R.M., et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. USA. 2003;100:11394–11399. doi: 10.1073/pnas.1534710100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koyutürk M. Grama A. Szpankowski W. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics. 2004;20:SI200–SI207. doi: 10.1093/bioinformatics/bth919. [DOI] [PubMed] [Google Scholar]
- Lum L. Beachy P.A. The Hedgehog response network: sensors, switches, and routers. Science. 2004;304:1755–1759. doi: 10.1126/science.1098020. [DOI] [PubMed] [Google Scholar]
- Mann M. Hendrickson R. Pandey A. Analysis of proteins and proteomes by mass spectrometry. Annu. Rev. Biochem. 2001;70:437–473. doi: 10.1146/annurev.biochem.70.1.437. [DOI] [PubMed] [Google Scholar]
- Pagni M. Jongeneel C.V. Making sense of score statistics for sequence alignments. Brief Bioinform. 2001;2:51–67. doi: 10.1093/bib/2.1.51. [DOI] [PubMed] [Google Scholar]
- Pearson W.R. Effective protein sequence comparison. Methods Enzymol. 1996;266:227–258. doi: 10.1016/s0076-6879(96)66017-0. [DOI] [PubMed] [Google Scholar]
- Pearson W.R. Lipman D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pinter R.Y. Rokhlenko O. Yeger-Lotem E., et al. Alignment of metabolic pathways. Bioinformatics. 2005;21:3401–3408. doi: 10.1093/bioinformatics/bti554. [DOI] [PubMed] [Google Scholar]
- Scott J. Ideker T. Karp R.M., et al. Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comput. Biol. 2006;13:133–144. doi: 10.1089/cmb.2006.13.133. [DOI] [PubMed] [Google Scholar]
- Sharan R. Suthram S. Kelley R.M., et al. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. USA. 2005;102:1974–1979. doi: 10.1073/pnas.0409522102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shlomi T. Segal D. Ruppin E., et al. QPath: a method for querying pathways in a protein-protein interaction network. BMC Bioinform. 2006;7:199. doi: 10.1186/1471-2105-7-199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh R. Xu J. Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA. 2008;105:12763–12768. doi: 10.1073/pnas.0806627105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steffen M. Petti A. Aach J., et al. Automated modelling of signal transduction networks. BMC Bioinform. 2002;3:34. doi: 10.1186/1471-2105-3-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein L. Sternberg P. Durbin R., et al. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001;29:82–86. doi: 10.1093/nar/29.1.82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teichmann S.A. Rison S.C. Thornton J.M., et al. Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol. 2001;19:482–486. doi: 10.1016/s0167-7799(01)01813-3. [DOI] [PubMed] [Google Scholar]
- Uetz P. Rajagopala S.V. Dong Y.A., et al. From ORFeomes to protein interaction maps in viruses. Genome Res. 2004;14:2029–2033. doi: 10.1101/gr.2583304. [DOI] [PubMed] [Google Scholar]
- von Mering C. Krause R. Snel B., et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417:399–403. doi: 10.1038/nature750. [DOI] [PubMed] [Google Scholar]
- Xenarios I. Salwinski L. Duan X.J., et al. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30:303–305. doi: 10.1093/nar/30.1.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Q. Sze S.H. Path matching and graph matching in biological networks. J. Comput. Biol. 2007;14:56–67. doi: 10.1089/cmb.2006.0076. [DOI] [PubMed] [Google Scholar]