Skip to main content
BMC Systems Biology logoLink to BMC Systems Biology
. 2010 Sep 13;4(Suppl 2):S13. doi: 10.1186/1752-0509-4-S2-S13

Comparing biological networks via graph compression

Morihiro Hayashida 1,, Tatsuya Akutsu 1,
PMCID: PMC2982687  PMID: 20840727

Abstract

Background

Comparison of various kinds of biological data is one of the main problems in bioinformatics and systems biology. Data compression methods have been applied to comparison of large sequence data and protein structure data. Since it is still difficult to compare global structures of large biological networks, it is reasonable to try to apply data compression methods to comparison of biological networks. In existing compression methods, the uniqueness of compression results is not guaranteed because there is some ambiguity in selection of overlapping edges.

Results

This paper proposes novel efficient methods, CompressEdge and CompressVertices, for comparing large biological networks. In the proposed methods, an original network structure is compressed by iteratively contracting identical edges and sets of connected edges. Then, the similarity of two networks is measured by a compression ratio of the concatenated networks. The proposed methods are applied to comparison of metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis, and are compared with an existing method. These results suggest that our methods can efficiently measure the similarities between metabolic networks.

Conclusions

Our proposed algorithms, which compress node-labeled networks, are useful for measuring the similarity of large biological networks.

Background

Development of algorithms for comparing various kinds of biological data is one of the important topics in bioinformatics and systems biology. Methods for comparison of DNA and/or protein sequences have been extensively studied and have been applied to analyses of real sequence data quite successfully. Due to increased interests in systems biology, extensive studies have recently been done on comparison of biological networks.

For comparison of metabolic networks, Ogata et al. developed a method based on clustering [1], Tohsato et al. extended a multiple sequence alignment technique to multiple alignment of metabolic pathways using a scoring scheme based on EC (Enzyme Commission) numbers [2], Pinter et al. applied a tree matching technique to alignment of metabolic pathways [3], and Wernicke and Rasche developed a simple backtracking algorithm utilizing the local diversity property [4]. For comparison of protein-protein interaction networks, Kelley et al. developed PathBlast using dynamic programming [5], Liang et al. developed NetAlign using a clique-based method for computing maximal common subgraphs [6], Li et al. developed MNAligner using integer quadratic programming [7], Singh et al. developed IsoRank algorithm based on Google's PageRank method [8], and Zaslavskiy et al. developed a gradient ascent-based method and a message passing-based method [9].

On the other hand, data compression methods have been applied to comparison of large sequence data [10,11] and protein structure data [12,13]. Since it is still difficult to compare global structures of large biological networks and data compression-based methods can be applied to comparison of large-scale sequence data, it is reasonable to try to apply data compression methods to comparison of biological networks. In this paper, we propose such methods.

In order to apply data compression to biological networks, data compression methods for graphs are required. For compression of graphs, Adler and Mitzenmacher developed a method based of Huffman coding of vertices [14], Peshkin developed GRAPHITOUR based on iterative contractions of identical edges [15], and Cook and Holder developed SUBDUE based on contraction of frequent subgraphs and MDL (minimum description length) principle [16], which was further extended to EDIF for lossless compression by Yang et al. [17]. However, the method by Adler and Mitzenmacher does not seem to be useful for comparison of networks because it does not make much use of structural information. In GRAPHITOUR, the uniqueness of compression results is not guaranteed because there is some ambiguity in selection of overlapping edges (isomorphic graphs may be differently compressed depending on the orderings of vertices in input data), which is not suitable for comparison of network structures. This point is also unclear in EDIF and SUBDUE. Therefore, we develop in this paper novel graph compression methods for which it is guaranteed that two isomorphic graphs are compressed in the same way. Using these compression methods, we measure the similarity of two networks by means of the universal similarity metric (USM) proposed by Li et al.[11]. USM is defined using Kolmogorov complexity which represents the amount of information contained in data, and is obtained by removing redundant parts maximally. Therefore, Kolmogorov complexities are approximated by compression sizes.

We apply the proposed methods to comparison of metabolic networks, and compare the results with those of GRAPHITOUR. The results of hierarchical clustering for several organisms suggest that the proposed methods outperforms GRAPHITOUR, and can adequately measure the similarities between metabolic networks.

Methods

Graph compression method

Since our proposed methods are based on GRAPHITOUR, we briefly review the GRAPHITOUR algorithm [15]. GRAPHITOUR is based on iterative contractions of identical edges. In order to efficiently contract edges, GRAPHITOUR selects edges appearing most frequently, and solves an instance of maximum cardinality matching problem, which finds as many edges as possible such that no two edges share a common vertex.

Figure 1 shows an example of contraction of identical edges. The graph of (A) contains 4 edges labeled with 'a' and 'b', 2 edges with 'a' and 'a', 1 edge with 'b' and 'b', and 1 edge with 'a' and 'c'. GRAPHITOUR selects edges with 'a' and 'b' because they appear most frequently, and solves the maximum cardinality matching problem for their edges. However, optimal solutions are not necessarily uniquely determined. (B) shows a contracted graph after the top-left edge with 'a' and 'b' is substituted with a new vertex labeled with 'ab'. On the other hand, (C) shows a contracted graph after the top-right edge with 'a' and 'b' is substituted. This example implies that GRAPHITOUR can generate different compressed graphs. In order to measure the similarity of networks, the same compressed graph should always be obtained. Therefore, we improve GRAPHITOUR for that purpose, and propose the following algorithm, which we call CompressEdge, to uniquely determine contracted edges in each iteration.

Figure 1.

Figure 1

Example of contraction of identical edges. (A) Graph with 7 vertices and 8 edges. (B, C) Graphs contracted from graph (A) by GRAPHITOUR. (D) Graph contracted from graph (A) by our proposed method. (E) Graph contracted from graph (A) by substituting two edges ('a', 'a') and ('a', 'b') with 'aab'. (F) Graph contracted from graph (A) by substituting three vertices and the edges between the vertices with 'aab'.

Procedure CompressEdge

Input: undirected graph G(V, E) with labeled vertices V and edges E (a total order ≤ is defined for the set of labels L, and each νV is labeled with lνL);

Output: induced compression rules R and compressed graph;

Begin

R := ∅;

s(l) := {l} for each label lL;

while |E| > 0

ε(l1, l2) := {(ν1, ν2) ∈ E|Inline graphic = (l1, l2) where Inline graphic, l1l2};

ε := {ε(l1, l2)| no two edges in ε(l1, l2) share a common vertex};

if ε = ∅ then return (R,G);

εmost := {ε(l1, l2) ∈ ε| |ε(l1, l2)| ≥ |ε(l3, l4)| for all ε(l3, l4) ∈ ε};

selectε(l1, l2)(∈ εmost) such that s(l1) ∪ s(l2) <s(l3) ∪ s(l4)

or (s(l1) ∪ s(l2) = s(l3) ∪ s(l4) and (l1, l2) < (l3, l4)) for all ε(l3, l4) ∈ εmost,

where l1l2, l3l4;

add a new label ln to L such that ln >l for all lL;

s(ln) := s(l1) ∪ s(l2);

R = R ∪ {ln ← (l1, l2)};

for each edge eε(l1, l2)

substitutee with a new vertex labeled with ln;

return (R, G);

End

This proposed algorithm avoids contraction of edges which share a common vertex. In the example of Figure 1, our algorithm does not choose edges whose endpoints are 'a' and 'b', instead chooses the second candidate edges whose endpoints are 'a' and 'a', and obtains the graph of (D) as the result. It should be noted that the proposed algorithm does not solve the maximum cardinality matching problem because it selects only edges such that all edges with the same labels do not share a common vertex.

However, it is not sufficient to uniquely determine contracted edges because there can be more than one set which has the same number of edges, that is, |εmost| > 1. Therefore, we introduce a total order to sets of labels to determine the priority of edges. Each edge has a pair of labels (l1, l2) corresponding to two endpoints of the edge. Let s1 and s2 be sets of labels. We can define a total order for s1 and s2 as follows.

First, we sort s1 and s2 by descending order, respectively. We compare i-th elements Inline graphic ,Inline graphic of s1 and s2, and define s1 <s2 if Inline graphic <Inline graphic and Inline graphic = Inline graphic(for all j <i) hold for some i. The proposed algorithm selects edges with the smallest set of labels from εmost according to the total order. For example, if we compare s1 = {l1, l3} with s2 = {l3, l2} under l1 <l2 <l3, s1 and s2 are sorted as (l3, l1) and (l3, l2), respectively, and we have s1 <s2.

When edges with (l1, l2) are contracted, a new label ln is added to L, where ln >l for all l (≠ ln) ∈ L. In computational experiments, Morgan index [18] based on graph structures is assigned to each vertex. However, new added labels themselves do not reflect the original graph structure. Therefore, in order to make effective use of the total order of original labels, we introduce a set of labels for each label l, s (l), which consists of only original labels. Then, s (ln) is defined to be s (l1) ∪ s (l2) when (l1, l2) is substituted with ln. The algorithm compares s (l1) ∪ s(l2) with s (l3) ∪ s (l4) before comparing edges of (l1, l2) and (l3, l4). For example, for the graph of Figure 1D, the algorithm selects edges with ('aa', 'b') as contracted edges because it appears most frequently. However, if there is another edge with ('b', 'b') than shown in Figure 1D, edges of ('aa', 'b') and ('b', 'b') are compared. We suppose that 'a'<'b'<'c'<'aa' and 'aa' was obtained by contracting edges with ('a', 'a'). Then, the corresponding sets to ('aa', 'b') and ('b', 'b'), s1 = s('aa') ∪ s('b') = {'a','a','b'} and s2 = s('b') ∪ s('b') = {'b','b'}, are compared, sorted as {'b','a','a'} and {'b','b'}, respectively, and we have s1 <s2. Then, edges with ('aa', 'b') are selected, and contracted to vertices with a new label 'aab', where 'aab'>'aa' and s ('aab') = s ('aa') ∪ s('b') = {'a','a','b'}.

Extension to contraction of multiple edges

In the previous algorithm, identical edges are contracted at each iteration. In this section, we propose another algorithm, which we call CompressVertices, to contract identical sets of multiple connected edges. In order to uniquely determine the contracted sets of connected edges, we must introduce a total order to a set of edges. For that purpose, we apply degree sequence [19], which is defined to be the non-increasing sequence of degrees of vertices. For example, the degree sequence of Figure 1A is (3, 3, 3, 2, 2, 2, 1). Moreover, in order to distinguish labels of vertices, we introduce the non-increasing sequence of pairs of the degree and the label for each vertex included in the set of edges, which we call dl-sequence. In dl-sequence, the degree is not calculated for the original graph, but is for the set of edges, and we define the inequality of elements of dl-sequence by; (d1, l1) > (d2, l2) if d1 >d2 or (d1 = d2 and l1 >l2). Then, we can define a total order for dl-sequences Inline graphic and Inline graphic byInline graphic and Inline graphic hold for some i. For example, the dl-sequence of two connected edges of ('a', 'a') and ('a', 'b') in Figure 1A, that is a–a–b, is ((2, 'a'), (1, 'b'), (1, 'a')). Here, in the example, if the edges are substituted with a new vertex, a self loop is remained to the new vertex due to the edge of (' a', 'b') as shown in Figure 1E. The remained edges should be also contracted. Therefore, we modify CompressEdge, and propose the following algorithm to contract identical sets of vertices and the edges between the vertices instead of individual edges. In the example of Figure 1E, the graph of Figure 1F is obtained by this algorithm.

Procedure CompressVertices(M)

Input: maximum number of vertices M and undirected graph G(V, E) with labeled vertices V and edges E

Output: induced compression rules R and compressed graph;

Begin

R := ∅;

s(l) := {l} for each label lL;

while |E| > 0

V = ∅; m := 2;

whileV = ∅ and mM

(V, dli) := SelectVertices(m);

m := m + 1;

if V = ∅ then return (R, G);

add a new label ln to L such that ln >l for all lL;

graphic file with name 1752-0509-4-S2-S13-i11.gif

R = R ∪ {lndli};

for each set of vertices uV

substituteu with a new vertex labeled with ln;

return (R, G);

End

Subprocedure SelectVertices(m)

Begin

V(dli := ((d1, l1), …, (dm, lm))) := {{ν1, …, νm} ⊂ V | ν1, …, νm are connected and (the dl-sequence of {ν1, …, vm})= dli};

V := {V(dli)| u1u2 = ∅ for all u1,u2V(dli)};

if V = ∅ then return (∅, ∅);

Vmost := {V(dli) ∈ V| |V(dli)| ≥ |V(dlj)| for all V(dlj) ∈ V};

selectV(dli)(∈ Vmost) such that Inline graphic

or (si = sj and dli <dlj) for all V(dlj) ∈ Vmost;

return (V(dli),dli);

End

First, the algorithm tries to find two vertices and the edge between the vertices to be contracted. If it does not find such two vertices, it tries to find more than two vertices and the edges between the vertices until it finds or up to the given number of vertices, M. Figure 2 shows an example of contraction by the algorithm. The graph of (A) contains 4 edges labeled with 'a' and 'b', 2 edges with 'a' and 'c', and 2 edges with 'b' and 'c'. However, it cannot select any edge because they are overlapping each other. Therefore, it tries to find three vertices to be contracted. (B) shows the candidate sets of vertices {' a', 'b', 'c '} and three edges ('a', 'b'), ('b', 'c'), and ('c', 'a'), which appear most frequently two times in the graph. However, they are overlapping again. Finally, it selects the candidate sets of (C), vertices {'a', 'b', 'b'} and two edges of ('a', 'b'), which appear also most frequently. (D) shows the graph contracted from the graph (A) by substituting the selected vertices and edges with a new vertex labeled with 'abb'.

Figure 2 .

Figure 2

Example of contraction by CompressVertices Example of contraction of identical sets of vertices and the edges between the vertices by CompressVertices. (A) Graph with 7 vertices and 8 edges. (B) Candidate vertices {' a', 'b', 'c '} and three edges ('a', 'b'), ('b', 'c'), and ('c', 'a') to be contracted. (C) Candidate vertices {'a', 'b', 'b'} and two edges of ('a', 'b') to be contracted. (D) Graph contracted from graph (A) by substituting vertices {'a', 'b', 'b'} and the edges with 'abb'.

CompressVertices(2) of M = 2 is equivalent to CompressEdge. It should be noted that the algorithm uniquely determines the contracted sets of connected edges in each iteration although different subgraphs can be substituted with vertices of the same label in M ≥ 4. For example, both graphs of a-b-a-b and a-a-b-b are represented as ((2, 'b'), (2, 'a'), (1, 'b'), (1, 'a')) in dl-sequence. It is to be noted that for three vertices, different subgraphs cannot have the same dl-sequence. The algorithm is still efficient because it compares dl-sequences instead of comparing subgraphs, where subgraph isomorphism problem is known to be in NP-complete.

Similarity measure

The universal similarity metric (USM) was proposed by Li et al.[11], and has been applied to several biological data [12,13]. USM between two objects o1 and o2 is defined using Kolmogorov complexity K(o) as follows:

graphic file with name 1752-0509-4-S2-S13-i13.gif

(1)

where Inline graphic denote shortest programs for generating o1,o2, respectively.

Kolmogorov complexity K(o) of an object o is defined to be the length of the shortest program P for a universal Turing machine U which outputs o, and the conditional Kolmogorov complexity of o1 given o2 is defined to be the length of the shortest program P which outputs o1 when o2 is given as follows:

graphic file with name 1752-0509-4-S2-S13-i15.gif

(2)

It should be noted that K(o) is considered as a measure of the amount of information that the object o contains.

Since it is not possible to obtain these Kolmogorov complexities for real data, we approximate K(G) of a graph G by C(G) = |R| + |EC|, where |R| means the number of rules extracted from G by our method, and |EC| means the number of remaining edges after the compression of G. The conditional Kolmogorov complexity K(G1|G2) of G1 given G2 can be approximated to be C(G1G2) − C(G2) as in [12,13], where G1G2 means the concatenated graph G′(V′, E′) of G1(V1, E1) and G2(V2, E2) such that V′ = V1V2, E′ = E1E2, |V′| = |V1| + |V2| and |E′| = |E1| + |E2|. Even if there are identical vertices (i.e. vertices with identical labels) between G1 and G2, they are added to V′ as different vertices.

Substituting K(o) of Eq.(1) with C(G), the approximated USM for graph compression, GUSM, between two graphs G1 and G2 is given as follows:

graphic file with name 1752-0509-4-S2-S13-i16.gif

(3)

It should be noted that GU SM(G, G) = 0 if |Ec| =0 for C(G). If G1 and G2 are similar, then G1 and G2 are generated from almost the same set of rules R, that is, C(G1G2) ≈ C(G1) ≈ C(G2) ≈ |R|, and GU SM(G1 ,G2) approaches to 0.

Results and discussion

To evaluate the proposed measure, we used metabolic pathways for several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis, from the KEGG database [20] (see Table 1). We used chemical compounds as nodes and chemical reactions as edges. For evaluation purposes, it is not appropriate to use protein-protein interaction networks because the accuracy of the networks is not sufficient [21]. Furthermore, we compared the results with those of GRAPHITOUR because other methods such as PathBLAST [5], NetAlign [6], and MNAligner [7] do not give the similarity of networks, and focus on finding interesting subnetworks or most similar subnetworks to a query network.

Table 1.

Statistics of metabolic pathways

organism # nodes # edges
H. sapiens 1550 1673
M. musculus 1518 1640
A. thaliana 1389 1395
D. melanogaster 1238 1250
C. elegans 1049 1009
E. coli 1103 1256
S. cerevisiae 983 1028
B. subtilis 994 1065

Statistics of metabolic pathways for several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis.

In our first computational experiment, all nodes in the metabolic networks were labeled with chemical compounds, and there was only one edge having the same labels, that is, |ε(l1, l2)| = 1. Then, our compression algorithm for G(V, E) produced rules R and the remaining graph Gc(Vc, Ec) as C(G) = |R| + |Ec| = |E|. This means that G is not compressed. Many other methods also cannot select frequent subnetworks for such networks.

Since we would like to compare network structures for the metabolic networks, we replaced labels with Morgan index [18]. Figure 3 shows an example of calculation of Morgan index. First, 1 is assigned to each node. Next, the sum of values of adjacent nodes is assigned for each node. This iteration is repeated until the number of different values of Morgan index does not increase. We call the index obtained in this way the original Morgan index. Morgan index obtained by one iteration of this procedure is equivalent to the degree of each node, and Morgan index depends on graph structures.

Figure 3 .

Figure 3

Example of calculation of Morgan index. (A) 1 is assigned to each vertex. (B) first iteration. (C) second iteration.

We fixed the number of iterations of the Morgan index procedure, applied our compression algorithms to individual and concatenated metabolic networks, G1, G2, G1G2, and calculated GU SM (G1, G2) from C(G1), C(G2) and C(G1G2). To confirm that our compression algorithms work for measuring the similarity of metabolic networks, we obtained hierarchical clustering results using the nearest neighbor (single linkage) method, and compared them with actual phylogenetic trees and hierarchical clustering results by GRAPHITOUR. Moreover, we performed such experiments with several numbers of iterations from 1 to 20 because the number of iterations of the original Morgan index is at most 11 for the metabolic networks.

Figure 4 shows the results of hierarchical clustering using CompressEdge for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis with Morgan indices of 1, 2, 3, 6, 11, and 12 iterations. The numbers of contracted edges for the metabolic network of H. sapiens with Morgan indices of 1, 2, 3, 6, 11, and 12 iterations were 251, 1367, 1387, 1395, 1395, and 1395, respectively. The results of more than 5 iterations were almost similar to those of 12 iterations. Figure 5 shows the results on the number of different values of Morgan indices for the metabolic networks for 1-20 iterations of the Morgan index procedure. We can see from this figure that the number of different values of Morgan indices is almost constant for more than 11 iterations. For a small number of iterations, it is considered that metabolic networks were not compressed well because many edges have the same labels and share common nodes. This means that the number of iterations is required to be large for measuring the similarity more accurately. However, for that purpose, much larger numbers than the number of iterations of the original Morgan indices, that is at most 11, are not needed because the number of different values of Morgan indices is almost constant for more than 11 iterations (see Figure 5). According to the results of hierarchical clustering in Figure 4, H. sapiens was always the nearest to M. musculus. Bacterial organisms of B. subtilis and E. coli were furthest from H. sapiens in the result of 12 iterations. It is considered that the result of 12 iterations is almost consistent to actual phylogenetic trees. This suggests that the proposed method can adequately measure the similarities between metabolic networks.

Figure 4.

Figure 4

Results of hierarchical clustering using CompressEdge for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis with Morgan indices of 1, 2, 3, 6, 11, and 12 iterations.

Figure 5 .

Figure 5

Results on the number of different values of Morgan indices for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis for 1-20 iterations of the Morgan index procedure.

Figure 6 shows the results of hierarchical clustering using CompressVertices(3) for metabolic networks. The result of 3 iterations was already similar to the results of more than 5 iterations. It is considered that there are more overlapping edges in networks for a smaller number of iterations, and our proposed method contracted their edges well for a small number of iterations.

Figure 6.

Figure 6

Results of hierarchical clustering using CompressVertices(3) for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis with Morgan indices of 1, 2, 3, and 6 iterations.

Figure 7 shows the results of hierarchical clustering using GRAPHITOUR for metabolic networks. In the result of 10 iterations, E. coli was farthest from H. sapiens. The result of 12 iterations was not well clustered because D. melanogaster is likely to be closer to H. sapiens than to C. elegans [22]. These suggest that our proposed method can measure the similarities better than GRAPHITOUR.

Figure 7.

Figure 7

Results of hierarchical clustering using GRAPHITOUR for metabolic networks of several organisms, H. sapiens, M. musculus, A. thaliana, D. melanogaster, C. elegans, E. coli, S. cerevisiae, and B. subtilis with Morgan indices of 1, 2, 3, 6, 10, and 12 iterations.

Furthermore, the proposed methods are efficient. The computational time was at most 9 seconds in CompressEdge, 17 seconds in CompressVertices(3), even for the concatenated network of H. sapiens and M. musculus with Morgan index of 12 iterations. These experiments were done in a single processor core on a PC with Xeon X5460 3.16GHz CPUs and 8GB memory under the Linux (version 2.6) operating system, where the g++ compiler was used with optimization option -O3.

We also compared our proposed methods with simple methods based on node and edge numbers. We calculated distances between two organisms as the differences of the number of nodes or edges, and obtained hierarchical clustering results using the nearest neighbor method. We can see from the results (Figures S1 and S2 in the supplementary information web page) that H. sapiens was the nearest to M. musculus, but other relations were different from actual phylogenetic trees. This result shows that our proposed methods are better than those based on node or edge numbers. In addition, our methods generate rules to contract edges iteratively. It means that they try to find hierarchical substructures for a given graph. We defined the similarity between graphs, GUSM, based on the number of such rules. It can be considered that GUSM implies whether hierarchical substructures are also similar, or not.

We proposed two methods, CompressEdge and CompressVertices(M), where M denotes the maximum number of vertices to be contracted at a time. It should be noted that the minimum number is 2 because a vertex itself cannot be contracted. CompressEdge is equivalent to CompressVertices(2). If a graph has many overlapping edges, CompressEdge (and CompressVertices(2)) would not extract many rules, that is, many edges are not contracted and are remained. Consider a graph G that all edges of G are overlapping. The similarity between G and itself, GU SM (G, G), must be 0. However, GU SM (G, G) = 1 because any edge of G is not contracted, C (G) = |Ec| and C (GG) = 2|Ec|. Therefore, CompressVertices(M) for M > 2 is needed.

Conclusions

In this paper, we have proposed novel methods for compressing biological networks. One of the important properties of the proposed methods is that isomorphic networks are compressed in the same way. Unlike many other methods comparing biological networks, our methods are able to give the similarity metric of their networks. Moreover, our methods are very efficient because they do not compare subgraphs directly. We have applied the proposed compression methods to comparison of metabolic networks, and compared the results with those of an existing method. The results suggest that the proposed compression methods are useful for comparison of biological networks.

Our proposed methods were applied to only undirected graphs in this paper. However, it is possible to extend our algorithms to deal with directed graphs. It is easy to extend CompressEdge, and the modified method can still be efficient. Our methods are efficient as well as GRAPHITOUR, and it is important to keep the efficiency after extending them. Metabolic networks can also be represented as directed graphs and undirected graphs using chemical compounds as nodes and the relation between compounds (e.g., involve the same reaction) as edges. It is an interesting issue to examine whether our methods are robust for any representation of a network.

It is considered in general that random networks are more difficult to be compressed than scale-free networks. However, our methods cannot compress metabolic networks that are known as scale-free networks because all nodes in metabolic networks are labeled with distinct chemical compounds, and there is only one edge having the same labels. Our proposed methods are useful only for comparison of networks. If the same labels can appear multiple times, it is expected that our methods can also differentiate these networks. However, it is difficult to compare them in a simple way because the compression size depends on the distribution of node labels. Since we do not have realistic models for generation of random and scale-free networks with node labels, application of our proposed methods to differentiation of random networks from scale-free networks is left as future work.

Though we have applied the methods to comparison of networks, the application is not limited to comparison. It might be applied to detection of network motifs with hierarchical structures because our method iteratively compresses edges (edges can be replaced by small subgraphs).

One drawback of our proposed compression methods is that it is not a lossless compression method (i.e., the original network cannot be reconstructed from compressed data). Therefore, improvement of the method towards lossless compression is also important future work.

Availability

The source code of CompareVertices is available through the supplementary information web page (http://sunflower.kuicr.kyoto-u.ac.jp/morihiro/gracomp/).

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Methods were developed by MH and TA. The manuscript was prepared by MH and TA.

Contributor Information

Morihiro Hayashida, Email: morihiro@kuicr.kyoto-u.ac.jp.

Tatsuya Akutsu, Email: takutsu@kuicr.kyoto-u.ac.jp.

Acknowledgements

This work was partially supported by Grants-in-Aid #19650053 and #21700323 from MEXT, Japan.

This article has been published as part of BMC Systems Biology Volume 4 Supplement 2, 2010: Selected articles from the Third International Symposium on Optimization and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/4?issue=S2

References

  1. Ogata H, Fujibuchi W, Goto S, Kanehisa M. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Research. 2000;28:4021–4028. doi: 10.1093/nar/28.20.4021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Tohsato Y, Matsuda H, Hashimoto A. In: In Proc. 8th Int. Conf. Intelligent Systems for Molecular Biology . Bourne PE, Gribskov M, Altman RB, Jensen N, Hope DA, Lengauer T, Mitchell JC, Scheeff ED, Smith C, Strande S, Weissig H, editor. 2000. A multiple alignment algorithm for metabolic pathway analysis using enzyme hierarchy. pp. 376–383. [PubMed] [Google Scholar]
  3. Pinter RY, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson N. Alignment of metabolic pathways. Bioinformatics. 2005;21:3401–3408. doi: 10.1093/bioinformatics/bti554. [DOI] [PubMed] [Google Scholar]
  4. Wernicke S, Rasche F. Simple and fast alignment of metabolic pathways by exploiting local diversity. Bioinformatics. 2007;23:1978–1985. doi: 10.1093/bioinformatics/btm279. [DOI] [PubMed] [Google Scholar]
  5. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Research. 2004;32:W83–W88. doi: 10.1093/nar/gkh411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Liang Z, Xu M, Teng M, Niu L. NetAlign: a web-based tool for comparison of protein interaction networks. Bioinformatics. 2006;22:2175–2177. doi: 10.1093/bioinformatics/btl287. [DOI] [PubMed] [Google Scholar]
  7. Li Z, Zhang S, Wang Y, Zhang XS, Chen L. Alignment of molecular networks by integer quadratic programming. Bioinformatics. 2007;23:1631–1639. doi: 10.1093/bioinformatics/btm156. [DOI] [PubMed] [Google Scholar]
  8. Singh R, Xu J, Berger B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA. 2008;105:12763–12768. doi: 10.1073/pnas.0806627105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Zaslavskiy M, Bach F, Vert JP. Global alignment of protein-protein interaction networks by graph matching methods. Bioinformatics. 2009;25:i259–i267. doi: 10.1093/bioinformatics/btp196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kocsor A, Kertész-Farkas A, Kaiàn L, Pongor S. Application of compression-based distance measures to protein sequence classification: a methodological study. Bioinformatics. 2005;22:407–412. doi: 10.1093/bioinformatics/bti806. [DOI] [PubMed] [Google Scholar]
  11. Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics. 2001;17:149–154. doi: 10.1093/bioinformatics/17.2.149. [DOI] [PubMed] [Google Scholar]
  12. Hayashida M, Akutsu T. In: In Proc. 6th Asia-Pacific Bioinformatics Conference . Brazma A, Miyano S, Akutsu T, editor. 2008. Image compression-based approach to measuring the similarity of protein structures. pp. 221–230. [Google Scholar]
  13. Krasnogor N, Pelta DA. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics. 2004;20:1015–1021. doi: 10.1093/bioinformatics/bth031. [DOI] [PubMed] [Google Scholar]
  14. Adler M, Mitzenmacher M. In: Towards compressing web graphs. In Proc. 2001 Data Compression Conference . Storer JA, Cohn M, editor. 2001. pp. 203–212. [Google Scholar]
  15. Peshkin L. In: In Proc. 2007 Data Compression Conference . Storer JA, Cohn M, editor. 2007. Structure induction by lossless graph compression. pp. 53–62. [Google Scholar]
  16. Cook DJ, Holder LB. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research. 1994;1:231–255. [Google Scholar]
  17. Yang J, Savari SA, Mencer O. In: An approach to graph and netlist compression. In Proc. 2008 Data Compression Conference. Storer JA, Cohn M, editor. 2008. pp. 33–42. [Google Scholar]
  18. Morgan H. The generation of unique machine description for chemical structures - a technique developped at chemical abstracts service. Journal of Chemical Documentation. 1965. pp. 107–113.
  19. Diestel R. In Graph Theory. 3. Heidelberg: Springer-Verlag; p. 278. [Google Scholar]
  20. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y. KEGG for linking genomes to life and the environment. Nucleic Acids Research. 2008;36:D480–D484. doi: 10.1093/nar/gkm882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci., USA. 2001;98:4569–4574. doi: 10.1073/pnas.061034498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mushegian AR, Garey JR, Martin J, Liu LX. Large-scale taxonomic profiling of eukaryotic model organisms: A comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Research. 1998;8:590–598. doi: 10.1101/gr.8.6.590. [DOI] [PubMed] [Google Scholar]

Articles from BMC Systems Biology are provided here courtesy of BMC

RESOURCES