Abstract
Background
Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.
Results
This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on low-support data mining.
Conclusions
GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed low-support mining technique which applies to any searching system also allows a significant index space reduction.
Background
Application domains such as bioinformatics and cheminformatics represent data as graphs where nodes are basic elements (i.e. proteins, atoms, etc…) and edges model relations among them. In these domains, graph searching plays a key role. For example, in computational biology locating subgraphs matching a specific topology is useful to find motifs of networks that may have functional relevance. In drug discovery, the main task is to find novel bioactive molecules, i.e., chemical compounds that, for example, protect human cells against a virus. One way to support the solution of this task is to analyze a database of known and tested molecules with the aim of building a classifier which predicts whether a novel molecule will be active or not. Future chemical tests can focus on the most promising candidates. Users may ask to find molecules containing the query graph (exact search) or subgraphs similar to the one described by the query (approximate querying) (see Figure 1 for an example).
The graph searching problem can be formalized as follows. Given a database of graphs D = {G1, G2,…, Gn} (e.g. collection of molecules, etc.) and a query graph Q (e.g pattern), find all graphs in D containing Q as a subgraph. Moreover, all occurrences of Q in those graphs should be detected. In many important application D consists of a single huge graph. Since most of these problems involve solutions of the graph isomorphism problem, an efficient exact solution can not exists. In order to make searching time acceptable research efforts have tried to improve the following steps [1,2].
1. Reduce the search space by filtering. For a database of graphs a filter limits the search to only possible candidate graphs. For a single-graph database only the possible candidate subgraphs are identified. The common idea is to extract structural features of graphs and store them in a global index. When a query graph is presented, its own structural features are extracted and compared with the features stored in the index to check compatibility [3-6]. Most existing systems use subgraphs of small size (typically not larger than 10 nodes). However, even though small subgraphs are used, the size of the index and its time construction may be high. Therefore, high-support/high-confidence mining rules are used to index only frequent and not redundant subgraphs (i.e. a subgraph is redundant when its presence in a graph can be predicted by the presence of its subgraphs) [7-9].
2. Store Data. In order to scale to very large databases of graphs indexing structures and data must be stored in secondary memory. Applications make use of advanced database management systems [4].
3. Match. After candidate graphs have been selected, an exhaustive search on these graphs must be performed. This step is implemented either by traditional (sub)graph-to-graph matching techniques [10,11] or by an implementation on an extension of the SQL algebra [12].
In this paper, GraphFind, an enhancement of the application-independent graph searching system GraphGrep [5,12], is presented. Experiments show that GraphFind outperforms the compared systems on a very large collection of small molecules, available at the web site of the National Cancer Institute [13]. A key feature of GraphFind is the use of low-support data mining technique (Min-Hashing [14]) to reduce the index size. It is shown that such a mining technique can be successfully applied to enhance other systems such as gIndex [7].
Results and discussion
Approach
GraphFind locates all exact and approximate occurrences of a query graph in collections of graphs. It combines filtering techniques described in [5] with a recent matching algorithm [11]. Each graph is stored as a set of small subgraphs. At query time, such a representation allows the selection of candidate subgraphs. GraphFind is implemented on top of Berkeley DB [15] to store both indexing structures and data graphs. A low-support data mining technique (Min-Hashing [14]) is applied to reduce the index size of GraphFind and gIndex [7].
Related compared systems
GraphGrep [5,12] finds all exact and approximate occurrences of a query graph in collections of graphs. Approximate queries are special subgraphs that may contain: (a) nodes with a special wildcard symbol “?”, that can match any node; (b) approximate paths (represented by a wildcard symbol “*”) which are paths of any length that can connect two nodes. GraphGrep enumerates all small subgraphs (say paths with no more than 4 nodes) in the database together with all occurrences sites and the number of such occurrences. Matching is performed by combining such occurrences making use of an extension of the classical SQL algebra [12].
Daylight [3] is a commercial system to search in molecules databases. The index of each graph is a fixed-size bit vector. It enumerates all existing small paths in a graph, hashes them, and adds them to the vector. A disadvantage of such an approach is that different and unrelated paths may “collide” at the same bit position. An academic freely available emulation of Daylight, called Frowns [10], makes use of an efficient matching subgraph algorithm [11]. All above systems are designed to optimize the query time, at the cost of large preprocessing time.
Data mining techniques have been applied to reduce index construction (space and time) complexity. Related recent work includes [8,9] (stable release of such software are upcoming). gIndex [7] represents the state of art in this area. The key ideas of gIndex are: (i) index frequent subgraphs with a size-increasing support function; (ii) represent them in a canonical form (strings); and (iii) store such strings in a prefix tree.
Although there is a long history of research on indexing for exact searching in database of graphs, only recently have indexing structures for approximate search been proposed [16-18]. SAGA [18] appears to be the most flexible system. It finds subgraphs of a query which are similar (allowing node gaps, node mismatches and graph structural differences) to subgraphs in the database. Algorithms for networks alignment [19] such as NetworkBlast [20] may be used to find approximate occurrences of a query path in a single graph. The main difference between those systems and GraphFind is that GraphFind users may specify precisely at query construction time which nodes or paths are approximate. Thus, GraphFind can not be compared with those systems because it controls the semantics of the output precisely.
Results
In order to evaluate the performance of GraphFind, we have compared it with the main graph search systems (GraphGrep [12], GFrowns, and gIndex [7]). GFrowns is an implementation of the system Frowns [10] to deal with general graphs. Experiments show that GraphFind compared had better behavior than gIndex in terms of scalability on the tested databases. In addition, GraphFind improves our previous system GraphGrep which is commonly used in the literature as a test system. Experimental analysis was performed on a Pentium IV with 1GB of memory using Linux OS. All algorithms were implemented in C++.
Test sets
To test the proposed system, a database of 40000 molecules, available at the web site of the National Cancer Institute [13], was used. It contains sparse graphs having from 20 to 270 nodes. The database was divided into subsets of size ranging from 1000 to 40000 molecules.
Systems were tested using a set of 40 queries drawn from the molecules database. The number of nodes, for each query, ranges from 4 to 32. Query time is given as the sum of filtering time and matching time.
Experiments on a single graph database were performed using synthetic data described in [21]. The Min-Hashing technique was analyzed using both synthetic and molecules database.
Comparisons
Figure 2 reports preprocessing time and index size of GraphFind, GraphGrep, GFrowns, and gIndex on the molecule databases. Concerning lp = 4, GraphFind and gIndex were comparable and faster than the other systems. However, GFrowns and gIndex outperformed the others on index space. Notice that, the maximum database size treatable by gIndex was 16000. Figure 3 reports querying time. gIndex and GraphFind showed comparable behavior.
Preprocessing time and index size of GraphFind using lp = 10 is considerable higher than the ones obtained using lp = 4 (see Figure 2). Results of GraphGrep and GFrowns with lp =10 are not reported since they are clearly outperformed by GraphFind and gIndex. The querying time of GraphFind (lp = 10) is not shown since it does not yield any speed-up with respect to the case of lp = 4.
gIndex filtering with lp = 10 compared to lp = 4 discards more graphs, but is slower.
In Figure 4, the performances of GraphFind, GraphGrep and GFrowns on a single large graph are shown. Although the amount of space required by GraphFind is high, the querying results very efficient. gIndex is not reported since it does not treat graphs with thousands of nodes.
Figure 5 reports the performance of GraphFind on approximate queries on a database of 8000 molecules available at [13]. As expected, by increasing the allowed degree of approximation in a query, the execution time and the number of matching subgraphs returned by such a query grow.
Finally, the Min-Hashing algorithm was applied to reduce the index size of GraphFind and gIndex (see Figure 6). The running time of Min-Hashing does not affect the preprocessing performance (less than one percent of the total time in all tests). However, the index size is considerably reduced in both GraphFind and gIndex.
Conclusions
This paper has presented GraphFind, an application-independent graph searching system that enhances GraphGrep. The system allows exact and approximate graph searching where the approximations can be precisely specified. Comparisons with competitive systems show that GraphFind performs well and scales better. GraphFind significantly reduces data storage with respect to GraphGrep overhead thanks to low-support data mining. The proposed low-support mining technique, which applies to other searching methods also, reduces indexing space significantly.
GraphFind can be easily implemented in a distributed environment. The database of graphs may be distributed among several servers according to a graph similarity criterion. When graph searching is applied to a huge graph (network), the graph may be partitioned into components based on a minimum cut strategy (e.g. locate hubs and cut at them). Future work will include the design and the experimental analysis of a GraphFind distributed version on web-scale databases. Moreover, methods to rank outputs will be added on specific domains of application. This will be a domain-specific extension. Datasets, software and results are freely available at [22].
Methods
GraphFind models the nodes of data graphs as having an identification number (node-id) and a label (node-label). An id-path of length n is a list of n + 1 node-ids with an unlabeled edge between any two consecutive nodes. A label-path of length n is a list of n + 1 node-labels. Label-paths and the id-paths of the graphs in a database are used to construct the index of the database and to store the data graphs.
Index construction
Let lp be a fixed positive integer. For each graph in the database and for each node, all paths that start at this node and have length from one up to lp are collected. The index is implemented using a hash table. The keys of the hash table are the hash values of the label-paths. Collisions are resolved by chaining. This hash table is referred as the fingerprint of the database. Each entry in a column is the number of occurrences of a label-path in that graph (see Figure 7).
Data storage
Since several paths may contain the same label sequence, the id-paths of all the paths representing a label sequence are grouped into a label-path-set. GraphFind uses Berkeley DB [15] as the underlying database to store data graph representation and index. GraphFind stores each fingerprint as a dynamic Berkeley DB hash table of linked lists (whose keys and values are described above). Each graph is stored in a set of Berkeley DB tables each corresponding to a label-path-set (see Figure 7).
Queries
A query is an undirected labeled graph. Approximate queries are special subgraphs that may contain: (a) nodes labeled with a special wildcard symbol “?”, which can match any label; (b) approximate paths (represented by a wildcard symbol “*”) which are paths of any length that can connect two nodes.
Database filtering
The database is filtered by comparing the fingerprint of the query with the fingerprints of the graph in the database. A database graph, for which at least one value in its fingerprint is less than the corresponding value in the fingerprint of the query, is filtered out. The remaining graphs are candidates for matching (see Figure 7 (Filtered Database(1))). Next, parts of the candidate graphs are filtered out as follows: (i) decompose the query into patterns and (ii) select only those id-path sets associated with patterns in the query (see Figure 7 (Filtered Database(2))). The selected id-path sets correspond to one or several subgraphs of candidate graphs. Those subgraphs are the only ones that may match the query.
Subgraph exact and approximate matching
After filtering, subgraph matching on the possible matching candidates is performed by applying the VF2 algorithm [11] to each candidate. This is a refinement of Ullmann's subgraph isomorphism algorithm that uses more selective feasibility rules to prune the state search space. Approximate queries are handled by independently processing, as described above, all maximal exact (completely specified) subqueries. The resulting subgraph matchings are then “joined” by checking, for each pair of query nodes connected by an approximate path, if there is a path in the data graph (of length equal to the wildcards' values) between the corresponding matched nodes. This is performed by using depth-first search. As shown in [11], the computational complexity in the worst case of the VF2 algorithm is Θ(N!N), where N is the number of nodes in the query.
Indexing by low support data mining techniques
Let M(m,n) be the fingerprint of a graph database. Rows correspond to graphs, columns are patterns and each entry is the number of occurrences of each pattern in that graph. Two patterns are similar if a large number of graphs have the same number of occurrences of it. More precisely, let the similarity Sim(Ci, Cj) of two columns be the percentage of non null rows in which the two columns have the same value. The aim of the Min-Hashing algorithm [14] is to quickly find pairs of columns (indexed patterns) that have a similarity greater than a given threshold s*. It generates k random permutations, say pj : {1,…,m} → {1,…,m} for j = 1,···,k, of row indices of M. denotes the i-th element of the permutation pj. Let be the corresponding signature matrix of M. Each entry is the index t of the first row in M in which . Formally, =t if and only if ∀s <t, . Let the similarity Sim(Ci, Cj) of two columns Ci and Cj be defined as . In [14] the authors show that the similarity of two columns is well approximated by the similarity of the corresponding columns in the signature matrix. Consequently, finding similar columns in the matrix becomes a lightweight computation. This allows deletion of columns which are similar to others. Such a technique can be applied to any indexing system. In GraphFind it is applied to the transposed database fingerprint matrix (see Figure 7). Moreover, in GraphFind s* is not a user parameter. The system is designed to find pairs of columns (patterns) with similarity s* = 100% in the fingerprint database. Therefore, two patterns that have the same occurrence in each graph will be represented in the matrix using only one column indexed by both patterns. Notice that, by reducing the similarity threshold s*, correctness is maintained and the compression ratio may be higher. However, this implies a loss in filtering efficiency and therefore greater searching time. Figure 6 reports the compression ratio of index size in both GraphFind and gIndex after Min-Hashing.
List of abbreviations used
3D: Three-Dimensional
DB: Database
G : Number of Graphs
GB: Gigabyte
GraphFindNT: GraphFind Fingerprint
GFrowns: Graph Frowns, Implementation of Frowns for General Graph
L : Number of Different Node Labels
lp: Length of Label Path
OS: Operating System
N: Number of Nodes
SQL: Structured Query Language
SAGA: Substructure Index-based Approximate Graph Alignment
VF2: Graph Matching Algorithm by Vento Foggia et al.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors designed, analyzed, implemented and tested the proposed algorithm. Each author contributed equally in writing the paper. All authors read and approved the final manuscript.
Acknowledgments
Acknowledgements
We thank all the users who have downloaded our software and contributed to its improvement. We would like to thank Xifeng Yan, Philip S. Yu and Jiawei Han for providing gIndex. Some of the authors were in part supported by PROGETTO FIRB ITALY-ISRAEL grant n. RBIN04BYZ7: Algorithms for Patterns Discovery and Retrieval in discrete structures with applications to Bioinformatics.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 4, 2008: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S4.
Contributor Information
Alfredo Ferro, Email: ferro@dmi.unict.it.
Rosalba Giugno, Email: giugno@dmi.unict.it.
Misael Mongiovì, Email: mongiovi@dmi.unict.it.
Alfredo Pulvirenti, Email: apulvirenti@dmi.unict.it.
Dmitry Skripin, Email: dskripin@dmi.unict.it.
Dennis Shasha, Email: shasha@cs.nyu.edu.
References
- Cook DJ, Holder LB. Substructure Discovery Using Minimum Description Length and Background Knowledge. Artificial Intelligence Research. 1994;1:231–255. [Google Scholar]
- Ferro A, Giugno R, Pigola G, Pulvirenti A, Skripin D, Bader GD, Shasha D. NetMatch: a Cytoscape plugin for searching biological networks. Bioinformatics. 2007;23:910–912. doi: 10.1093/bioinformatics/btm032. [DOI] [PubMed] [Google Scholar]
- Daylight Chemical Information Systems. [ http://www.daylight.com/].
- Kumar S, Srinivasa S. A Database for Storage and Fast Retrieval of Structure Data. In: Dayal U, Ramamritham K, Vijayaraman TM, IEEE Computer Society, editor. Proceedings of the 19th International Conference on Data Engineering: 5-8 March 2003; Bangalore. 2003. pp. 789–791. [Google Scholar]
- Shasha D, Wang JTL, Giugno R. Algorithmics and Applications of Tree and Graph Searching. In: Popa L, ACM, editor. Proceedings of the 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems: 3-5 June 2002; Madison. 2002. pp. 39–52. [Google Scholar]
- Messmer BT, Bunke H. Subgraph Isomorphism Detection in Polynominal Time on Preprocessed Model Graphs. In: Li SZ, Mital DP, Teoh EK, Wang H, Springer, editor. Recent Developments in Computer Vision, Second Asian Conference on Computer Vision: 5-8 December 1995; Singapore, Volume 1035 of Lecture Notes in Computer Science. 1995. pp. 373–382. [Google Scholar]
- Yan X, Yu PS, Han J. Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems. 2005;30:960–993. [Google Scholar]
- Cheng J, Ke Y, Ng W, Lu A. Fg-index: towards verification-free query processing on graph databases. In: Chan CY, Ooi BC, Zhou A, ACM, editor. Proceedings of the ACM SIGMOD International Conference on Management of Data: 12-14 June 2007; Beijing. 2007. pp. 857–872. [Google Scholar]
- Zhang S, Hu M, Yang J. TreePi: A Novel Graph Indexing Method. In: Chirkova R, Dogac A, Ozsu T, Sellis T, IEEE Computer Society, editor. Proceedings of the 23nd International Conference on Data Engineering: 15-20 April 2007; Istanbul. 2007. pp. 966–975. [Google Scholar]
- Frowns. [ http://frowns.sourceforge.net/].
- Cordella L, Foggia P, Sansone C, Vento M. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2004;26:1367–1372. doi: 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
- Giugno R, Shasha D. GraphGrep: A Fast and Universal Method for Querying Graphs. In: Kasturi R, Suen DLC, IEEE Computer Society, editor. Proceedings of the 16th International Conference on Pattern Recognition: 11-15 August 2002; Quebec. 2002. pp. 112–115. [Google Scholar]
- National Cancer Institute. U.S. National Institute of Health. [ http://www.cancer.gov/].
- Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C. Finding interesting associations without support pruning. IEEE Transactions on Knowledge and Data Engineering. 2001;13:64–78. [Google Scholar]
- Berkeley DB. [ http://www.sleepycat.com/].
- Yan X, Yu PS, Han J. Substructure Similarity Search in Graph Databases. In: Özcan F, ACM, editor. Proceedings of the ACM SIGMOD International Conference on Management of Data: 14-16 June 2005; Baltimore. 2005. pp. 766–777. [Google Scholar]
- Yan X, Zhu F, Han J, Yu PS. Searching Substructures with Superimposed Distance. In: Liu L, Reuter A, Whang K, Zhang J, IEEE Computer Society, editor. Proceedings of the 22nd International Conference on Data Engineering: 3-8 April 2006; Atlanta. 2006. pp. 88–98. [Google Scholar]
- Tian Y, McEachin RC, Santos C, States DJ, Patel JM. SAGA: a subgraph matching tool for biological graphs. Bioinformatics. 2007;23:232–239. doi: 10.1093/bioinformatics/btl571. [DOI] [PubMed] [Google Scholar]
- Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nature Biotechnology. 2006;24:427–433. doi: 10.1038/nbt1196. [DOI] [PubMed] [Google Scholar]
- Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T. Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U S A. 2005;102:1974–1979. doi: 10.1073/pnas.0409522102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Foggia P, Sansone C, Vento M. A Database of Graphs for Isomorphism and Sub-Graph Isomorphism Benchmarking. Proceedings of the 3rd IAPR TC-15 Workshop on Graph-based Representations in Pattern Recognition: 23-25 May 2001; Ischia, CUEN. 2001. pp. 176–188.
- CTNYU Research Lab. [ http://alpha.dmi.unict.it/~ctnyu/].