Abstract
This paper proposes a scheme to reduce big graphs to small graphs. It contracts obsolete parts and regular structures into supernodes. The supernodes carry a synopsis for each query class in use, to abstract key features of the contracted parts for answering queries of . Moreover, for various types of graphs, we identify regular structures to contract. The contraction scheme provides a compact graph representation and prioritizes up-to-date data. Better still, it is generic and lossless. We show that the same contracted graph is able to support multiple query classes at the same time, no matter whether their queries are label based or not, local or non-local. Moreover, existing algorithms for these queries can be readily adapted to compute exact answers by using the synopses when possible and decontracting the supernodes only when necessary. As a proof of concept, we show how to adapt existing algorithms for subgraph isomorphism, triangle counting, shortest distance, connected component and clique decision to contracted graphs. We also provide a bounded incremental contraction algorithm in response to updates, such that its cost is determined by the size of areas affected by the updates alone, not by the entire graphs. We experimentally verify that on average, the contraction scheme reduces graphs by 71.9% and improves the evaluation of these queries by 1.69, 1.44, 1.47, 2.24 and 1.37 times, respectively.
Keywords: Graph data management, Graph contraction, Graph algorithms, Incremental computation
Introduction
There has been prevalent use of graphs in artificial intelligence, knowledge bases, search, recommendation, business transactions, fraud detection and social network analysis. Graphs in the real world are often big, e.g., transaction graphs in e-commerce companies easily have billions of nodes and trillions of edges. Worse still, graph computations are often costly, e.g., graph pattern matching via subgraph isomorphism is intractable (cf. [42]). These highlight the need for developing techniques for speeding up graph computations.
There has been a host of work on the subject, either by making graphs compact, e.g., graph summarization [67] and compression [12, 82], or speeding up query answering by building indices [81]. The prior work often targets a specific class of queries, e.g., query-preserving compression [37] and 2-hop labeling [25] are for reachability queries. In practice, however, multiple applications often run on the same graph at the same time. It is infeasible to switch compression schemes or summaries between different applications. It is also too costly to build indices for each and every query class in use.
Another challenge stems from obsolete data. As a real-life example, consider graphs converted from IT databases at a telecommunication company. The databases were developed in stages over years and have a large schema with hundreds of attributes. About 80% of the attributes were copied from earlier versions and have not been touched for years. No one can tell what these attributes are for, but no one has the gut to drop them in the fear of information loss. As a result, a large bulk of the graphs is obsolete. As another example, there are a large number of zombie accounts in Twitter. As reported by The New York Times, 71% of Lady Gaga’s followers are fake or inactive, and it is 58% for Justin Bieber. The obsolete data incur heavy time and space costs and often obscure query answers.
The challenges give rise to several questions. Is it possible to find a compact representation of graphs that is generic and lossless? That is, we want to reduce big graphs to a substantially smaller form. Moreover, using the same representation, we want to compute exact answers to queries of different classes at the same time. In addition, can the representation separate up-to-date data from obsolete components without loss of information? Can we adapt existing evaluation algorithms to the compact form, without the need for redeveloping the algorithms starting from scratch? Furthermore, can we efficiently and incrementally maintain the representation in response to updates to the original graphs?
Contributions and organization. In this paper, we propose a new approach to tackling these challenges, by extending the idea of graph contraction.
(1) A contraction scheme (Sect. 2). We propose a contraction scheme to reduce big graphs into smaller ones. It contracts obsolete components and regular structures into supernodes, and prioritizes up-to-date data. For each query class , supernodes carry a synopsis that records key features needed for answering queries of . As opposed to conventional graph summarization and compression, the scheme is generic and lossless. A contracted graph retains the same topological structure for all query classes , and the same synopses work for all queries in the same class . Only may vary for different query classes . We identify regular structures to contract in different types of graphs, and develop a (parallel) contraction algorithm.
(2) Proof of concept (Sect. 3). We show that existing query evaluation algorithms can be readily adapted to contracted graphs. In a nutshell, we extend the algorithms to handle supernodes. When answering a query Q in , we make use of the synopsis of a supernode if it carries sufficient information for answering Q, and decontract the supernode only when necessary. We pick five different query classes: subgraph isomorphism (), triangle counting (), shortest distance (), connected component () and clique decision () based on the following dichotomies:
label-based queries () versus non-label based ones (, , , );
local queries (, , ) versus non-local ones (, ); and
various degrees of topological constraints ( ).
We show how easy to adapt existing algorithms for these query classes to contracted graphs, without increasing their complexity. Better still, all these queries can be answered without decontraction of topological structures except some supernodes for obsolete parts.
(3) Incremental contraction (Sect. 4). We develop an incremental algorithm for maintaining contracted graphs in response to updates to original graphs. Such updates may change both the topological structures and timestamps (obsolete data). We show that the algorithm is bounded [77], i.e., it takes at most time, where is the size of areas affected by updates, not the size of the entire (possibly big) graph. We parallelize the algorithm to scale with large graphs.
(4) Empirical evaluation (Sect. 5). Using 10 real-life graphs, we experimentally verify the following. On average, (a) the contraction scheme reduces graphs by 71.9%, up to . (b) Contraction makes , , , and 1.69, 1.44, 1.47, 2.24 and 1.37 times faster, respectively. (c) The total space cost of our contraction scheme for the five accounts only for 12.6% of indices for [44], [75], [4] and [68]. It is 9.0% when [92] also runs on the same graph. The synopses for each take of the space. Hence, the scheme is scalable with the number of applications on the same graph. (d) Contracting obsolete data improves the efficiency of conventional queries and temporal queries by 1.64 and 1.78 times on average, respectively. (e) Our (incremental) contraction scheme scales well with graphs, e.g., it takes 33.1s to contract graphs of 1.8B edges and nodes with 20 cores.
We survey related work in Sect. 6 and identify research topics for future work in Sect. 7.
A graph contraction scheme
In this section, we first present the graph contraction scheme (Sect. 2.1). We then identify topological components to contract for different types of real-life graphs (Sect. 2.2). Moreover, we develop a contraction algorithm (Sect. 2.3) and its parallelization (Sect. 2.4).
Preliminaries. We start with basic notations.
Graphs. Assume two infinite sets and for labels and timestamps, respectively. We consider undirected graphs , where (a) V is a finite set of nodes, (b) is a bag of edges, (c) for each node , L(v) is a label in ; and (d) T is a partial function such that for each node , if T(v) is defined, it is a timestamp in that indicates the time when v or its adjacent edges were last updated.
Queries. A graph query is a computable function from a graph G to another object, e.g., a Boolean value, a number, a graph, or a relation. For instance, a graph pattern matching query is a graph pattern Q to find the set of subgraphs in G that are isomorphic to pattern Q, denoted by Q(G). A query class is a set of queries of the same “type,” e.g., all graph pattern queries. We also refer to as an application. In practice, multiple applications run on the same graph G simultaneously.
Contraction scheme
A graph contraction scheme is a triple , where (1) is a contraction function such that given a graph G, is a graph deduced from G by contracting certain subgraphs H into supernodes ; we refer to H as the subgraph contracted to , and as the contracted graph of G by ; (2) is a set of synopsis functions such that for each query class in use, there exists that annotates each supernode of with a synopsis ; and (3) is a decontraction function that restores each supernode in to its contracted subgraph H.
Example 1
Graph G in Fig. 1a is a fraction of Twitter network. A node denotes a user (u), a tweet (t), a keyword (k), or a feature of a user such as id (i), name (n), number of followers (f) and link to other accounts of the same user in other social networks (l). An edge indicates the following: (1) , a user follows another; (2) (u, t), a user posts a tweet; (3) , a tweet retweets another; (4) (t, k), a tweet tags a keyword; (5) , two keywords are highly related; (6) (u, k), a user is interested in a keyword; (7) (i, l), a user has a feature; or (8) (i, f), a user has f followers.
Fig. 1.
Graph contraction
In G, subgraphs in dashed rectangles are contracted into supernodes, yielding a contracted graph shown in Fig. 1b. Synopses for are shown in Fig. 1d and are elaborated in Sect. 3.1.
Before we formally define functions and synopsis , observe the following.
(1) The contraction scheme is generic. (a) Note that and are application independent, i.e., they remain the same no matter what query classes run on the contracted graphs. (b) While is application dependent, it is query independent, i.e., all queries use the same synopses annotated by .
(2) The contraction scheme is lossless due to synopses and decontraction function . As shown in Sect. 3, an existing algorithm for a query class can be readily adapted to contracted graph and computes exact query answers.
We next give the details of and . We aim to strike a balance between space cost and query evaluation cost. When a graph is over-contracted, i.e., when the subgraphs contracted to supernodes are too large or too small, the decontraction cost goes up although the contracted graph may take less space. Moreover, the more detailed synopses are, the less likely decontraction is needed, but the higher space overhead is incurred.
(1) Contraction function. Function contracts subgraphs in G into supernodes in . To simplify the discussion, we contract the following basic structures.
(a) Obsolete component: a connected subgraph consisting of nodes with timestamps earlier than threshold .
(b) Topological component: a subgraph with a regular structure, e.g., clique, star, path and butterfly.
Different types of graphs have different regular substructures, e.g., cliques are ubiquitous and effective in social networks while paths are only effective in road networks. In Sect. 2.2, we will identify what regular structures H to contract in different types of graphs.
We contract subgraphs with the number of nodes in the range to avoid over-contraction (see Sects. 2.3 and 5 for the choices).
Contraction function maps each node v in graph G to a supernode in contracted graph , which is either a supernode if v falls in one of the subgraphs H in (a) or (b), or node v itself otherwise.
In Example 1, function maps nodes in each dashed rectangle to its corresponding supernode, e.g., , and .
Obsolete components help us prioritize up-to-date data, and topological ones reduce unnecessary checking when answering queries. As shown in Sect. 5, on average the first three regular structures and obsolete components contribute 18.3%, 14.9%, 2.8% and 63.1% to the contraction ratio, and speeds up query answering by 1.61, 1.44, 1.04 and 1.71 times, respectively.
(2) Contracted graph. For a graph G, its contracted graph by is = = , where (a) is a set of supernodes mapped from G as remarked above; (b) is a bag of superedges, where a superedge if there exist nodes and such that , and ; and (c) is the reverse function of , i.e., .
In Example 1, function maps each supernode in contracted graph of Fig. 1b back to the nodes in the corresponding rectangle in Fig. 1a, e.g., = , id), (, name), (, follower), (, link).
Intuitively, the reverse function recovers the contracted nodes and their associated labels, while the decontraction function restores the topological structures of the contracted subgraphs.
(3) Synopsis. For each query class in use, a synopsis function is in , to retain features necessary for answering queries in . For instance, when is the class of graph patterns, at each supernode , consists of the type of and the most distinguished features of , e.g., the central node of a star and the sorted node list of a path. We will give more details about in Sect. 3. As will also be seen there, and synopses taken together often suffice to answer queries in , without decontraction.
Note that not every synopsis has to reside in memory. We load to memory only if its corresponding application is currently in use.
(4) Decontraction. Function restores contracted subgraphs. For supernode , restores the edges between the nodes in , i.e., the subgraph induced by . For superedge , recovers the edges between and .
That is, the contracted subgraphs and edges are not dropped. They can be restored by when necessary. In light of , the scheme is guaranteed lossless.
For example, decontraction function restores the subgraph in Fig. 1a from supernodes, e.g., is a star with central node and leaves , , and . It also restores edges from superedges, e.g., .
Identifying regular structures
We now identify what regular structures to contract for different types of real-life graphs.
Different types of graphs. We investigated the following 10 different types of graphs: (1) social graphs: [70] and [94]; (2) communication networks: [62]; (3) citation networks: [63] and [63]; (4) Web graphs: [64] and [5]; (5) knowledge graphs: [61] and [71]; (6) collaboration networks: [2] and [15]; (7) biomedical graphs: [51]; (8) economic networks: [80]; (9) chemical graphs: [80]; and (10) road networks: [1].
Regular structures. For a certain type of graphs G, we apply a subgraph mining model to G. It returns a set of frequent subgraphs of G together with the support of each . Support metrics may vary in different mining models, e.g., [33] adopts minimum image based metric [19]. We pick subgraphs whose supports are above a threshold .
As an example, we adopt a subgraph miner [33] as . discovers all the frequent subgraphs in G that have a support above a predefined threshold, which are then manually inspected. We pick ’s with at least 4 nodes to avoid over-contraction.
As shown in Fig. 2, we found the following 6 structures in the 10 types of graphs: (a) clique: a fully-connected graph; (b) star: a single central node with neighbors; (c) path: a sequence of connected nodes with no edges between the head and tail (its two endpoints); (d) claw: a special star in which the central node has exactly 3 neighbors, denoted as its leaves; claws are quite frequent and are hence treated separately; (e) diamond: two triangles that share two endpoints; and (f) butterfly: two triangles sharing a single node.
Fig. 2.

Frequent regular structures
Note that within these structures H, the only edges allowed are those that form H. Moreover, edges are allowed from each node in H to nodes outside of H. The only exception is that for a path, only the two endpoints can connect to other nodes in the graph.
We summarize how these structures appear in the 10 types of graphs in Table 1, ordered by supports and importance from high to low. Note that different graphs have different frequent regular substructures. Cliques, stars and diamonds often occur in social graphs, while in road networks, stars, claws and paths are frequent.
Table 1.
Common structures in different types of graphs
| Graph type | Regular structure |
|---|---|
| Social graphs | Clique, star, diamond, butterfly, path |
| Communication networks | Star |
| Citation networks | Clique, star, diamond, butterfly |
| Web graphs | Star, clique, diamond |
| Knowledge graphs | Star, claw |
| Collaboration networks | Clique, star, diamond |
| Biomedical graphs | Star, clique, path |
| Economic networks | Star |
| Chemical graphs | Claw, path |
| Road networks | Star, claw, path |
Note that frequent pattern mining is conducted once for each type of graphs offline, not for each input graph. For instance, we always contract cliques, stars, diamonds, butterflies and paths for social graphs.
Contraction algorithm
We next present an algorithm to contract a given graph G, denoted as and shown in Fig. 3.
Fig. 3.

Algorithm
A tricky issue is that the contracted graphs depend on the order on the regular structures contracted. For example, if we contract diamonds first in the Twitter graph of Fig. 1a, then it contracts as a diamond; after this there are no cliques in . In contrast, if cliques are contracted first, then is extracted. As suggested by , cliques “dominate” in social graphs and hence should be “preserved” when contracting .
We adopt a deterministic order to ensure that important structures are contracted earlier and hence preserved. We order the importance of different types of regular structures in a graph G by their supports: the higher the support is, the more important the topology is. We denote by T(G) its ordered set of regular structures to contract in Table 1. Note that T(G) is determined by the type of G, e.g., social graphs, and is learned once offline regardless of individual G.
Given a graph G, algorithm first contracts all obsolete data into components to prioritize up-to-date data. Each obsolete component is a connected subgraph that contains only nodes with timestamps earlier than a threshold . It is extracted by bounded breadth-first-search () that stops at non-obsolete nodes. The remaining nodes are then either contracted into topological components, or are left as singletons.
Putting these together, we present the main driver of algorithm in Fig. 3. Given a graph G, a timestamp threshold and range , it constructs functions and of the contraction scheme. It first contracts nodes with timestamps earlier than into obsolete components (line 1). It then recalls the list T(G) of topological components to contract based on the type of graph G (line 2). Next, contracts topological components into supernodes following order T(G), and deduces and accordingly (lines 3-5). Each topological component consists of only uncontracted nodes. More specifically, it does the following.
(1) It extracts a clique by repeatedly selecting an un-contracted node that connects to all selected ones, subject to pre-selected size bounds and (see below).
(2) It extracts a star by first picking a central node , and then repeatedly selecting an un-contracted node as a leaf that is (a) connected to and (b) disconnected from all selected leaves, again subject to and .
(3) For paths, it first extracts intermediate nodes having only two neighbors that are not linked by an edge. It then finds a path consisting of only the intermediate nodes, along with two neighbors of the endpoints.
(4) For diamonds, it first selects an edge (u, v) and then picks x and y that are (a) connected to both u and v, and (b) pairwise disconnected.
(5) For butterflies, it first selects a node v that has a degree at least 4. It then checks whether there exist four neighbors u, x, y, z of node v such that exactly (u, x, v) and (y, z, v) form two triangles.
(6) For claws, it selects nodes with exactly 3 neighbors, and there is no edge between any two neighbors.
As remarked earlier, the remaining nodes that cannot be contracted into any component as above are treated as singleton, i.e., mapped to themselves by .
Example 2
Assume that timestamp threshold for graph G of Fig. 1a is larger than timestamps of nodes , , and , but is smaller than those of remaining nodes. Algorithm works as follows. (1) It first triggers bounded , and contracts , , and into an obsolete component in . (2) Since G is a social network, it contracts clique, star, diamond, butterfly and path in this order. (3) It builds a clique with nodes , ..., . (4) It picks and as central nodes for a star, and makes a star consisting of . Nodes cannot make a star due to lower bound . (5) No diamond exists. (6) It picks as central node for a butterfly and makes a butterfly . (7) It finds , and as candidate intermediate nodes for paths, and contracts them into a path with endpoints and . (8) Node is left as a singleton, and is mapped to itself by .
Range . We contract an (obsolete/topological) component H such that the number of its nodes is in the range . The reason is twofold. (1) If H is too small, a contracted graph would have an excessive number of supernodes; this leads to over-contraction with high overhead for possible decontraction and low contraction ratio. Thus, we set a lower bound . (2) We set an upper bound to avoid overlarge components and excessive superedge decontraction. We experimentally find that the best and are 4 and 500, respectively.
Diamonds, butterflies and claws have a fixed size with 4, 5 and 4 nodes, respectively, in the range above.
Complexity. Algorithm takes at most time. Indeed, (1) obsolete components can be contracted in O(|G|) time via edge-disjoint bounded s; (2) paths can be built in O(|G|) time; (3) it takes O(|G|) time to contract each clique and time for all cliques; and (4) similarly, the other regular structures can be contracted in time.
Properties. Observe the following about the contraction scheme. (1) It is lossless and is able to compute exact query answers. (2) It is generic and supports multiple applications on the same contracted graph at the same time. This is often necessary. For instance, on average 10 classes of queries run on a graph simultaneously in GDB benchmarks [32]. (3) It prioritizes up-to-date data by separating it from obsolete data. (4) It improves performance. (a) As discussed in Sect. 5, . In particular, each obsolete component is contracted into a single supernode. (b) Decontraction is often not needed. As shown in Sect. 3, none of , , , and needs to decontract any topological component, and for , and , even obsolete components do not need decontraction.
Parallel contraction algorithm
We next parallelize algorithm , denoted by , to speed up the contraction process. Note that contraction is conducted once offline, and is then incrementally maintained in response to updates (Sect. 4).
Parallel setting. Assume a master (processor) and n workers (processors) . Graph G is partitioned into n fragments by an edge-cut partitioner [17, 55], and the fragments are distributed to n workers , respectively. We adopt the model [88], which separates iterative computations into supersteps and synchronizes states after each superstep.
Parallel contraction algorithm . As shown in Fig. 4, the idea of is to leverage data-partitioned parallelism. first conducts locally on each fragment in parallel, and then contracts uncontracted “border nodes,” i.e., nodes with edges crossing fragments, by building neighbors of at most uncontracted nodes, referred to as uncontracted neighbors, which are subgraphs with uncontracted nodes.
Fig. 4.

Algorithm
More specifically, algorithm works as follows.
(1) All workers run on its local fragment in parallel (line 1), since after all, each fragment is a graph itself.
In contrast with single-thread , workers do not contract mirror nodes, i.e., nodes assigned to other fragments with edges linked to the local fragment. Adopting edge-cut partition, each node of G is assigned to a single fragment and is contracted at most once during .
(2) contracts “border nodes” (line 2-3). For each border node v, if v is not contracted, builds it uncontracted neighbors. Such neighbors are identified in parallel, coordinated by master .
(3) Master merges overlapped neighbors into one, and distributes disjoint ones to n workers (line 4-5). In this way, reduces communication cost and speeds up the process when contracting border nodes.
(4) Each worker contracts its assigned uncontracted-neighbors of border nodes, in parallel (line 6).
One can verify that each node v in G is contracted into at most one supernode . The graph contracted by may be slightly different from that of since border nodes may be contracted in different orders. One can fix this by repeating steps (1)–(4) for each of topological components following the order T(G). Nonetheless, we experimentally find that the differences are not substantial enough to worth the extra cost. Moreover, the contracted graphs of are ensured compact, i.e., they cannot be contracted further.
Proof of concept
In this section, we show that existing query evaluation algorithms can be readily adapted to the contracted graphs. As a proof of concept, we pick five query classes: (1) graph pattern matching via subgraph isomorphism (labeled queries with locality); (2) triangle counting (un-labeled queries with locality); (3) shortest distance (un-labeled and non-local queries); (4) connected component (un-labeled queries without locality); and (5) clique decision (un-labeled queries with locality). Among these, subgraph isomorphism and clique decision are intractable (cf. [42]).
Informally, when answering a query , we check whether the synopsis at a supernode has enough information for Q; it uses directly if so; otherwise it decontracts superedges adjacent to or restores the subgraph of via decontraction function . As will be seen shortly, often provides enough information to process Q at as a whole or safely skip . Thus, it suffices to answer queries in the five classes by decontracting superedges, without decontracting any topological components. Here decontraction of a superedge restores the edges between and (Sect. 2).
The main result of this section is as follows.
Theorem 1
Using linear synopsis functions,
(1) for each of and , there are existing algorithms that can be adapted to compute exact answers on contracted graphs , which decontract only supernodes of obsolete components and superedges between supernodes, not any topological components;
(2) for and , there are existing algorithms that can be adapted to and decontract no supernodes, neither topological nor obsolete components; and
(3) for , there are existing algorithms that can be adapted to and decontract neither supernodes (topological and obsolete) nor superedges.
Below we provide a constructive proof for Theorem 1 by adapting existing algorithms of the five query classes to contracted graphs one by one.
Graph pattern matching with contraction
We start with graph pattern matching ().
Preliminaries. We first review basic notations.
Pattern. A graph pattern is defined as a graph Q = (, , ), where (1) is a set of pattern nodes, (2) is a set of pattern edges, and (3) is a function that assigns a label to each .
We also investigate temporal patterns (Q, t), where Q is a pattern as above and t is a given timestamp.
To simplify the discussion, we consider connected patterns Q. This said, our algorithm can be adapted to disconnected ones. We denote by u, v pattern nodes in pattern Q, and by x, y nodes in graph G. A neighbor of node v is a node such that .
Pattern matching. A match of pattern Q in graph G is a subgraph of G that is isomorphic to Q, i.e., there exists a bijective function such that (1) for each node , ; and (2) is an edge in pattern Q iff (if and only if) is an edge in graph G. We denote by Q(G) the set of all matches of pattern Q in graph G.
A match of a temporal pattern (Q, t) in graph G is a match in Q(G) such that for each node v in , , i.e., a match of (conventional) pattern Q in which all nodes have timestamps later than t. We denote by Q(G, t) all matches of (Q, t) in G.
The graph pattern matching problem, denoted by , is to compute, given a pattern Q and a graph G, the set Q(G) of matches. Similarly, the temporal matching problem is to compute Q(G, t) for a given temporal pattern (Q, t) and a graph G, denoted by .
Graph pattern matching is widely used in graph queries [6, 40, 79, 90] and graph dependencies [36, 39].
Note that (1) patterns Q are labeled, i.e., nodes are matched by labels. Moreover, (2) Q has the locality, i.e., for any match of Q in G and any nodes , in , and are within hops when treating as undirected. Here is the diameter of Q, i.e., the maximum shortest distance between any two nodes in Q.
The decision problem of pattern matching is -complete (cf. [42]); similarly for temporal matching. A variety of algorithms have been developed for , notably [44] with indices and [28] without index. Both and can be adapted to contracted graphs as characterized in Theorem 1.
We give a constructive proof for , because (1) it is one of the most efficient algorithms for subgraph isomorphism and is followed by other algorithms e.g., [14, 78], and (2) it employs indexing to reduce redundant matches; by adapting we show that the indices for can be inherited by contracted graphs, i.e., contraction and indexing complement each other. The same algorithm works for temporal matching. The proof for is simpler (not shown).
Below we first present synopses for (Sect. 3.1.1), which are the same for both and . We then show how to adapt algorithm to contracted graphs (Sect. 3.1.2)
Contraction for
Observe that topological components have regular structures. The idea of synopses is to store the types and key features of regular structures so that we could check pattern matching without decontracting any supernodes of topological components.
The synopsis of a supernode for query class is defined as follows:
clique: = clique;
star: = star, records its central node;
path: = path, , storing all the nodes on the path in order;
diamond: = diamond, and store the two share nodes of the two triangles;
butterfly: =butterfly, records the node shared by the two triangles, and stores the two disjoint edges;
claw: =claw, stores the central node and () record its three neighbors;
obsolete component: = obsolete; and
each component maintains = , i.e., the largest timestamp of its nodes.
Node labels are stored in the reverse function of the contraction function (see Sect. 2.1).
For instance, the synopsis for each supernode in the contracted graph of Fig. 1b is given in Fig. 1d. Note that only stores the synopses of the regular structures contracted in a graph.
Properties. The synopses in have two properties. (1) Taken with the reverse function of , the synopsis of a supernode suffices to recover topological component H contracted to . For instance, given the central node and leaf nodes, a star can be uniquely determined. As a result, no supernode decontraction is needed for topological components. (2) The synopses can be constructed during the traversal of G for constructing contracted graph , as a byproduct.
We remark that the design of synopses needs domain knowledge. This said, (1) users only need to develop synopses for their applications in use, not exhaustively for all possible query classes; and (2) synopsis design is no harder than developing indexing structures.
Subgraph isomorphism
Below we first review algorithm [44] and then show how to adapt to contracted graphs.
. As shown in Fig. 5, given a graph G and a pattern Q, computes Q(G) as follows. It first rewrites pattern graph Q into a tree by performing from a start vertex (lines 1-2). Here each vertex in is a neighborhood equivalence class (NEC) that contains pattern nodes in Q having identically matching data vertices. Then, for each start vertex of each region, constructs a candidate region (), i.e., an index that maintains candidates for each NEC vertex in , via from (lines 3-4). If valid candidates are found, i.e., , enumerates all possible matches that map to following a matching order O (lines 5-6). The matching order O is decided by sorting the leaf NEC vertices based on the number of their candidate vertices. It expands Q(G) with valid matches identified in the process (line 7).
Fig. 5.

Algorithm
Algorithm . can be easily adapted to contracted graph , denoted by . As shown in Fig. 6, adopts the same logic as except minor adaptations in (line 4) and (line 7) to deal with supernodes. To see these, let H be the subgraph contracted to a supernode .
Fig. 6.

Algorithm
(1) . It adds a supernode as a candidate for a node u in Q if some node in can match u, which is checked by and . It also prunes based on , e.g., a node u in Q cannot match intermediate nodes on paths if u is in some triangle in Q; and u matches intermediate nodes on a path only if its degree is no larger than 2. No supernodes or superedges are decontracted.
(2) . Checking the existence of an edge (x, y) that matches edge is easy with synopses and functions and . Here x (resp. y) denotes a node in supernode (resp. ) in the candidates of (resp. ). When , (a) if =star or claw, (x, y) exists only if or ; (b) if = clique, (x, y) always exists; (c) if =path, (x, y) exists if x and y are next to each other in ; (d) if =diamond, (x, y) exists if at least one of x and y is the shared node or ; and (e) if =butterfly, (x, y) exists if x and y are not endpoints of the two disjoint edges in simultaneously. Hence, no topological component is decontracted by . (f) If =obsolete, it checks whether none of the labels in Q is in ; it safely skips if so, and decontracts by to check the existence of (x, y) otherwise. If x and y match distinct supernodes, it suffices to decontract superedge by .
Example 3
Query Q in Fig. 1c is to find potential friendship between users based on the retweet and shared keywords in their posted tweets. Nodes u and both have the same label u. Given Q, first chooses k as the start node, to which only and can match. For , adds and as candidates for t and , as candidate for u, and and as candidates for . Note that for obsolete supernode , none of the labels in Q is covered by ; hence, can be safely skipped. finds that matches t since there exists no edge between and . Thus, it matches with .
Similarly for , adds and as candidates for t and , as candidate for u, and and for u and . Next, finds that and match u and t by decontracting superedge ; then, matches k. However, since is an intermediate node of path , no match for can be found. Hence, match .
Analyses. One can easily verify that is correct since it has the same logic as except that it incorporates pruning strategies. While they have the same worst-case complexity, operates on , much smaller than G (see Sect. 5); moreover, its saves traversal cost and saves validation cost by pruning invalid matches.
Temporal pattern matching. Algorithm can also take a temporal pattern (Q, t) as part of its input, instead of Q. The only major difference is at construction (line 4), where a supernode is safely pruned if , when is obsolete or not. It skips a match if it contains a node v with .
Triangle counting with contraction
We next study triangle counting [26, 47], which has been used in clustering [91], cycle detection [48] and transitivity [74]. In graph G, a triangle is a clique of three vertices. The triangle counting problem is to find the total number of triangles in G, denoted by .
Similar to , is local with diameter 1. In contrast, it consists of a single query and is not labeled.
We adapt algorithm of [26] for to contracted graphs, since it is one of the most efficient algorithms [47], and it does not use indexing (as a different example from ). We show that for , the adapted algorithm needs to decontract no supernodes, neither topological components nor obsolete parts.
Contraction for
Observe that contraction function on G is equivalent to node partition of G, such that two nodes are in the same partition if they are contracted into the same supernode. The idea of synopses for is to pre-count triangles with at least two nodes in the same partition, without enumerating them. As will be seen shortly, this allows us to avoid supernode decontraction for both topological and obsolete components.
Consider a triangle (u, v, w) in G that is mapped to via . We have the following cases.
(1) If , where supernode contracts a subgraph H with node set V(H), i.e., when the three nodes of a triangle are contracted into the same supernode, then (a) when H is a clique, there are triangles inside H; (b) when H is a diamond or a butterfly, there are 2 triangles inside H; (c) when H is an obsolete component, then the number of triangles inside H can be pre-calculated, denoted by ; and (d) there are no triangles inside H otherwise.
(2) If , , where and contract subgraphs I and J, respectively, i.e., if two nodes of a triangle are contracted into the same supernode, then (a) when I is a clique, then w leads to triangles, where k is the number of the neighbors of w in I. Denote by the number of such triangles in a clique neighbor I of w. (b) Subgraph I cannot be a path since intermediate nodes on a path are not allowed to connect to nodes outside I. (c) Otherwise, nodes u and v yield k triangles, where k is the number of common neighbors of u and v in J. We denote by the number of such triangles in a common neighbor J of u and v.
(3) If , , , i.e., when the three nodes of a triangle are contracted into different supernodes, we count such triangles online and it suffices to decontract only superedges, not supernodes.
Synopsis of supernode for extends with an extra tag , which records the number of triangles pre-calculated as above. More specifically, is computed as follows. Below we use u and v to range over nodes in V(H), I to range over clique neighbors of u, and J to range over common neighbors of u, v. We define , and as above.
In a clique H, there are (1) triangles; (2) each node has triangles with its clique neighbor I; hence, . We can calculate similarly for other regular structures. Thus,
clique: ;
star: ;
path: , where and are the first and last node on the path;
claw: ;
diamond and butterfly: ,
obsolete: .
Synopses also share the properties of .
Example 4
In the contracted graph of Fig. 1b, only contracts a clique, denoted by I. Synopsis of a supernode extends with : (1) for , (a) H1 contracted to contains no triangles; thus, ; (b) I is not a neighbor of any node u in V(H1); thus, ; and (c) nodes in V(H1) have no common neighbors, i.e., no J exists for any connected ; thus, . Hence, . (2) For , =clique, and no other supernodes in are cliques. Hence, . (3) For , and have only 1 neighbor in clique I; thus, ; similarly, no J exists for any leaf u and ; thus, . Hence, . (4) Similarly, , and .
Triangle counting
We now adapt algorithm [26] to contracted graphs. The adapted algorithm is referred to as .
Algorithm . Given a graph G, assigns distinct numbers to all the nodes in G. It then enumerates triangles for each edge (u, v) by counting the common neighbors w of u and v such that and .
Algorithm . On a contracted graph with superedges decontracted, works in the same way as except that at a supernode (for both topological and obsolete components), it simply accumulates without decontraction or enumeration. It only restores superedges when necessary.
Example 5
From synopsis , directly finds 14 triangles. In , it finds two additional triangles and by restoring superedges. Thus, it finds 16 triangles in G. No supernodes of either topological or obsolete components are decontracted.
Analyses. One can verify that is correct since it counts all triangles in G once and only once. It speeds up since it works on a smaller contracted .
Temporal triangle counting. Algorithm can be adapted to count triangles with timestamp later than a given time t. It prunes a supernode if , and drops a triangle if it has a node v with .
Shortest distance with contraction
We next study the shortest distance problem.
Shortest distance. Consider an undirected weighted graph with additional weight W; for each edge e, W(e) is a positive number for the length of the edge. In a graph G, a path p from to is a sequence of nodes such that for all . The length of a path in G is simply .
The shortest distance problem, denoted by , is to compute, given a pair (u, v) of nodes in G, the shortest distance between u and v, denoted by d(u, v) [4, 25, 31].
Shortest distance has a wide range of applications, e.g., socially-sensitive search [89, 93], influential community detection [9, 56] and centrality analysis [16, 18].
As opposed to , shortest distance queries are unlabeled, i.e., the value of a query answer d(u, v) does not depend on labels. In contrast with and , is non-local, i.e., there exists no d independent of the input graph G such that .
We adapt Dijkstra’s algorithm [31] to contracted graphs, denoted by , which is one of the best known algorithms for . Just like , the adapted algorithm for decontracts no supernodes, neither topological components nor obsolete parts.
Contraction for
A path between nodes u and v can be decomposed into (1) edges between supernodes, and (2) edges within a supernode. The idea of synopses for is to pre-compute the shortest distances within supernodes to avoid supernode decontraction, for both topological and obsolete components. Edges between supernodes are recovered by superedge decontraction when necessary.
Suppose that and are nodes mapped to supernode by , i.e., . We compute the shortest distance for within the subgraph H contracted to , denoted by . The synopsis extends with a tag that is a set of triples for a path between and within , based on .:
clique: for all pairs of ;
path: , i.e., it records the path itself;
diamond, butterfly and obsolete components: .
In practice, the number of nodes in most contracted subgraphs is far below the upper bound . Indeed, diamonds and butterflies have a constant size, and we find that a clique (resp. star, path and obsolete component) typically contains 6.5 (resp. 7.3, 4.1 and 49.2) nodes. Hence, the size of a synopsis is fairly small. Note that the upper bound should be larger than typical sizes of components, since large components exist and may be more powerful for accelerating computations.
Example 6
Assume for all edges (u, v) in graph G of Fig. 1a. Then, for supernodes in the contracted graph of Fig. 1b, (1) ; (2) ; (3) , , ; and finally, (4) .
Shortest distance
We adapt algorithm [31] to contracted graphs , and refer to the adapted algorithm as .
Algorithm . Given a graph G and a pair (u, v) of nodes, finds the shortest distances from u to nodes in G in ascending order, and terminates as soon as d(u, v) is determined. It maintains a set S of nodes whose shortest distances from u are known; it initializes distance estimates , and for other nodes. At each step, moves a node w from to S that has minimal , and updates distance estimates of nodes adjacent to w accordingly.
Algorithm . is the same as except minor changes to updating distance estimates. When moving a node w from to S, suppose that is the supernode to which w is mapped, i.e., . updates distance estimates for as follows: (1) if is clique, butterfly, diamond or obsolete, update by using ; (2) if = star or claw, update by , where can be easily computed by synopsis; (3) if = path, update by for the other endpoint using ; in these cases, no supernode (for topological or obsolete components) is decontracted. updates by for all edges where , by decontracting superedge at worst, in the same way as .
Example 7
Given query on the contracted graph of Fig. 1b, works in the following steps: (1) initially, , , and for all other nodes; (2) , , by using ; (3) , by edge , and by ; by edge , and by ; similarly, and = 4, by making use of reverse function and synopsis (note that contracts a star); (4) , by edge ; and (5) , by edges , , . When moves node to S, it gets . The algorithm returns .
Analyses. By induction on the length of shortest paths, one can verify that is correct. In particular, for each node in G, when is updated by a node w that is mapped to the same supernode, the update is equivalent to a series of updates. Moreover, works on smaller contracted graphs and saves traversal cost inside contracted components without any decontraction, neither topological nor obsolete.
Temporal shortest distance. Similar to temporal and , we study temporal queries (u, v, t), where (u, v) is a pair of nodes as in , and t is a timestamp. It is to compute the shortest length of paths p from u to v such that for each node w on p, .
Algorithm can be easily adapted to temporal , by skipping nodes v with . In particular, it safely ignores a supernode if .
Connected component with contraction
We next study the connected component problem [29, 85]. In a graph G, a connected component is a maximal subgraph of G in which any two nodes are connected to each other via a path. The connected component problem, denoted as , is to compute the set of pairs (s, n) for a given graph G, where (s, n) indicates that there are n connected components in G that consist of s nodes.
Given a graph G, returns the numbers of connected components of various sizes in G. Similar to , is a non-local query, i.e., it has to traverse the entire graph when answering the query. It is also un-labeled, i.e., labels have no impact on its query answer.
This form of is used in pattern recognition [45, 53], graph partition [86] and random walk [49].
We adapt algorithm of [85] for to contracted graphs, since it is one of the most efficient algorithms. Better still, we show that the adapted algorithm decontracts neither supernodes nor superedges.
Contraction for
The synopsis for suffices for us to answer queries. Observe that each subgraph H contracted to a supernode is connected, no matter whether H is a topological component or an obsolete component. We can regard a supernode as a whole when evaluating queries, and leverage and to compute the size of connected components. We need neither additional synopses nor any decontraction.
Connected component
We now adapt algorithm [85] to contracted graphs. The adapted algorithm is referred to as .
Algorithm . We first review how works. (1) Starting from each unvisited node v in graph G, performs a depth-first-search () and collects all unvisited nodes reached in the traversal. These nodes are connected to v and are marked as visited. When no more nodes are unvisited, all visited nodes and v form a connected component. records its size s. (2) After all nodes in G are visited, groups connected components by size s and returns the aggregate (s, n).
Algorithm . On the contracted graph , works in the same way as except that (1) it only performs on , without decontracting any supernodes or superedges; and (2) the size of each connected component is aggregated as the sum of the size of all supernodes in the component.
Example 8
On the contracted graph in Fig. 1b, finds a connected component that consists of supernodes and . The size s of this component is simply the sum , i.e., . Since all the supernodes in have been visited, outputs (25, 1).
Analyses. is correct since it follows the same logic as and all contracted subgraphs are guaranteed to be connected. The algorithm takes at most time while takes O(|G|) time. Since is much smaller than G, always outperforms .
Temporal connected component. can be adapted to compute connected components with timestamp later than a given time t, by skipping nodes v with . It safely ignores a supernode if .
Clique decision with contraction
We next study a decision problem for clique. A clique in a graph G is a subgraph C in which there are edges between any two nodes; it is a k-clique if the number of nodes in C is k (i.e., ) . We consider the clique decision problem [20, 57], denoted by , to find whether there exists a k-clique in G for a given number k. is being widely used in community search [76], team formation [59] and anomaly detection [11, 65].
Similar to and , is un-labeled. In contrast with and , but similar to , it is local, i.e., all nodes in a clique are within 1 hop of each other.
The clique decision problem is known -complete (cf. [42]). A variety of algorithms have been developed for , notable of [57], which we will adapt next.
Contraction for
Observe the following. (1) Cliques in G contracted into supernodes in can help us find an initial maximum clique (see below). (2) The degree of a node can be used as an upper bound of the maximum clique containing it.
In light of these, we extend synopsis with tags and . For a subgraph H that is contracted to a supernode , the two tags record the maximum clique found in H and the maximum degree of the nodes in H, respectively. Specifically, is based on :
clique: ;
diamond and butterfly: ;
star, path and claw: ; and
obsolete component: we find a k-clique in an obsolete component online.
and is by aggregation:
node v: ; and
supernode : .
Synopses also share the properties of .
Example 9
In the contracted graph of Fig. 1b, extends with tags and as follows. Since contracts a clique, ; since contracts a butterfly, and for supernodes (star) and (path). For tag , ; similarly, , , , , and .
Clique decision
We adapt [57] to , denoted as .
Algorithm . We first review . Given a graph G, algorithm checks the existence of a k-clique in G by branch-and-bound. It branches from each node in G. Denote by C the current clique in the search, and by P the set of common neighbors of the nodes in C. (1) bounds the search from C if , or (2) branches from each node u in P to expand C. More specifically, it iteratively adds a node u from P to C and removes all those nodes in P that are not neighbors of u, enlarging C and shrinking P until P is empty. If , then C contains a k-clique and terminates with ; it returns if no k-clique is found after all branches are searched.
Algorithm . adopts the same logic as except the following: (1) it picks the maximum synopsis among all supernodes in ; a k-clique is found directly if ; and (2) it skips a supernode in if . Superedges adjacent to are skipped as well since no k-clique contains any node contracted to . Otherwise, it checks the synopsis of if contracts a topological component, or restores obsolete component H contracted to , to check cliques in the original graph G. Note that initiates the search with the largest clique contracted, by checking the synopses. Hence, cliques play a more important role than the other regular structures for .
Example 10
For query with , by of Fig. 1b, finds a 5-clique and returns true.
For query with , all supernodes except are skipped by synopses. Their adjacent superedges are skipped as well. Since only contracts a 5-clique, fails to find a 6-clique and returns false.
Analyses. One can verify that is correct since it follows the same logic as except that it adopts pruning strategies that are possible because of the use of synopses. While the two algorithms have the same worst-case complexity, starts with a supernode with a maximum clique and may find a k-clique directly; moreover, it skips a supernode as a whole by synopses, which reduces unnecessary search and validation.
Temporal k-clique. Algorithm can be adapted to find a k-clique with timestamp later than a given time t, by skipping nodes v with . Like and , it safely ignores a supernode if .
Incremental contraction
We next develop an incremental algorithm to maintain contracted graphs in response to updates to graphs G. We start with batch update , which is a sequence of edge insertions and deletions. We formulate the problem (Sect. 4.1), present the incremental algorithm (Sects. 4.2–4.3), discuss vertex updates (Sect. 4.4), and parallelize the algorithm (Sect. 4.5).
Incremental contraction problem
Updates to a graph G, denoted by , consists of (1) node updates, i.e., node insertions and deletions; and (2) edge updates, i.e., edge insertions and deletions.
Given a contraction scheme , a contracted graph , and updates , the incremental contraction problem, denoted as , is to compute (a) changes to such that , i.e., to get the contracted graph of the updated graph , where applies to ; (b) the updated synopses of affected supernodes; and (c) functions and w.r.t. the new contracted graph .
studies the maintenance of contracted graphs in response to update that may both change the topological structures of contracted graph , and refresh timestamps of nodes. As a consequence, obsolete nodes may be promoted to be non-obsolete ones if they are touched by , among other things.
Criterion. Following [77], we measure the complexity of incremental algorithms with the size of the affected area, denoted by . Here includes (a) changes to the input, (b) changes to the output, and (c) edges with at least an endpoint in (a) or (b).
An incremental algorithm is said to be bounded [77] if its complexity is determined by , not by the size |G| of the entire (possibly big) graph G.
Intuitively, is typically small in practice. When is small, so is . Hence, when is small, a bounded incremental algorithm is often far more efficient than a batch algorithm that recomputes starting from scratch, since the cost of the latter depends on the size of G, as opposed to of the former.
An incremental problem is said to be bounded if there exists a bounded incremental algorithm for it, and it is unbounded otherwise.
Challenges. Problem is nontrivial. (1) Topological components are fragile. For instance, when inserting an edge between two leaves of a star H, H is no longer a star, and its nodes may need to be merged into other topological components. (2) Refreshing timestamps by a query Q may make some obsolete nodes “fresh” and force us to reorganize obsolete and topological components. (3) When contracted graph is changed, so are their associated synopses and decontraction function.
Main result. Despite challenges, we show that bounded incremental contraction is within reach in practice.
Theorem 2
Problem is bounded for , , , and , and takes at most time.
We first give a constructive proof of Theorem 2 for edge updates, consisting of two parts: (1) the maintenance of the contracted graph and its associated decontraction function (Sect. 4.2); and (2) the maintenance of the synopses of affected supernodes (Sect. 4.3). We then give a constructive proof of Theorem 2 for vertex updates (Sect. 4.4), which is simpler.
Incremental contraction algorithm
An incremental algorithm is shown in Fig. 7, denoted by . It has three steps: preprocessing to initialize affected areas, updating to maintain contracted graph , and contracting to process refreshed singleton nodes. To simplify the discussion, we focus on how to update in response to , where consists of edge insertions and deletions; the handling of is similar.
Fig. 7.

Algorithm
(a) Preprocessing. Algorithm first identifies an initial area affected by edge update (lines 1-2). It removes “unaffecting” updates from that have no impact on (line 1), i.e., edges in that are between two supernodes when none of their nodes is an intermediate node of a path. These updates are made to corresponding subgraphs of G that are maintained by . It then refreshes timestamps of nodes u touched by edges in (line 2). Suppose that node u is mapped by to supernode with = obsolete. Then, is decomposed into singleton nodes, u is non-obsolete and is mapped to itself by . Such singleton nodes are collected in a set , as the initial area affected by . Node v is treated similarly.
Note that an unaffecting update would not become “affecting update” later on. All changes in are applied to graph G in the given order.
(b) Updating. Algorithm then updates contracted graph (lines 3-8). For each update , invokes procedure (resp. ) to update when e is to be inserted (resp. deleted) (lines 4-7). Updating may make some updates in unaffecting, which are further removed from (line 8). Moreover, some nodes may become “singleton” when a topological component is decomposed by the updates, e.g., leaves of a star. It collects such nodes in the set .
More specifically, to insert an edge , updates and adds new singleton nodes to . Suppose that u (resp. v) is mapped by to supernode (resp. ) (line 1). decomposes and into the regular structures of topological components (line 2). For instance, if , and star, u and v make a triangle with the central node; thus, decomposes the star into singleton nodes. When and , supernode is divided into two shorter paths. Note that components with less than nodes due to updates are decomposed into singleton nodes. All such singleton nodes are added to the set (line 3).
(c) Contracting. Finally, algorithm processes nodes in the set (line 10). It (a) merges nodes into neighboring supernodes; or (b) builds new components with these nodes, if possible; otherwise (c) it leaves node v as a singleton, i.e., by letting .
Example 11
Consider inserting four edges into graph G of Fig. 1a: (1) : nodes and are mapped to obsolete component , and is decomposed into singleton nodes, one for each of , , and ; then, is removed from ; (2) : it is unaffecting since and neither nor is an intermediate node of a path; (3) : it is also unaffecting; and (4) : is not a butterfly any longer, and is decomposed into singletons.
Edge deletions are handled similarly.
Analyses. Algorithm takes time: (a) the preprocessing step is in time; (b) the updating step takes time, in which updating is the dominating part; and (3) the cost of contracting into topological components is in time.
The algorithm is (a) bounded [77], since its cost is determined by alone, and (b) local [35], i.e., the changes are confined only to affected supernodes and their neighbors in the contracted graph .
Maintenance of synopses
We next show that for , , , and , (a) the number of supernodes whose synopses are affected is at most , and (2) the synopsis for each supernode can be updated in time. Hence, incremental synopses maintenance for each of , , , and takes at most time.
To see these, consider a supernode in .
(a) For , recall that stores the type and key features of (Sect. 3.1). One can see that the number of supernodes whose synopses are affected is at most , and for each such can be updated in O(1) time. Thus, the maintenance of is bounded in time due to bounds .
(b) For , synopsis extends with , which is updated by (i) clique neighbors I of nodes u in when ; (ii) itself if is clique or obsolete; and (iii) common neighbors J of connected nodes u, v in for . Thus, supernodes affected are enclosed in , which covers , and their neighbors. Moreover, for each affected can be updated in time. Thus, the maintenance of is bounded in time.
(c) For , extends with , which is confined to and can be updated in O(1) time since . Thus, the incremental maintenance of is bounded in time.
(d) For , recall that the synopsis suffices to answer queries. Hence, as in case (a), for each supernode can be updated in O(1) time, and the maintenance of is bounded in time.
(e) For , extends with and . Here is confined to and can be updated in O(1) time; is confined to and its neighbors, and can be updated in time. Thus, the maintenance of is in time.
Example 12
Continuing with Example 11, we show how to maintain in for supernodes in ; , , and are simpler since their affected synopses are confined to .
More specifically, (1) for edge insertion , supernode is decomposed into four singletons, for which synopses are defined as . (2) For (unaffecting) edge insertion , remains the same for all . (3) For (unaffecting) edge insertion , becomes a common neighbor of and ; let H denote the subgraph contracted by ; then, and . (4) When inserting edge , is decomposed into singletons. During the contraction phase, nodes are contracted into a diamond with . Node is left singleton, with .
Vertex updates
Vertex updates are a dual of edge updates [58], and can be processed accordingly. More specifically, we present incremental algorithm in Fig. 8, to deal with vertex updates. Consider node insertions and deletions.
Fig. 8.

Algorithm
(1) When inserting a new node u, algorithm first treats u as a singleton and collects it in set (lines 3-4); the node u is then contracted into a topological structure in the contracting step (line 7).
(2) When deleting a node u that is contracted into a supernode , there are three cases to consider, elaborated in of Fig. 8: (a) if is a clique, remains unchanged except that u is removed (lines 2-3); (b) if is a claw, a butterfly or an obsolete component, is decontracted and all nodes in except u are treated as singletons and are collected in set (lines 4-5); and (c) otherwise, we process u and by synopsis and add resulting singleton nodes into (lines 6-7). For instance, consider the case when contracts a star, (i) if u is the central node , is decontracted in the same way as case (b); and (ii) otherwise, remains to be a star, similar to case (a).
Similar to edge updates, contracting singleton nodes of into topological components dominates the cost of the process. One can verify that it can be done in at most time. Similarly, synopsis maintenance also takes time. Hence, incremental contraction remains bounded in the presence of vertex updates.
Parallel incremental contraction algorithm
We parallelize incremental algorithm , to speed up the incremental maintenance process.
Parallel setting. Similar to , we use a master and n workers, A contracted graph is edge-partitioned and is distributed to n workers. Each fragment consists of a part of the contracted graph and its corresponding (partial) decontraction function and synopses. For a crossing superedge between two fragments, i.e., when and are assigned to two distinct fragments, the decontraction function is maintained in both fragments.
Parallel incremental contraction. The parallel incremental algorithm is denoted by and shown in Fig. 9. To simplify the discussion, we focus on edge updates; node updates are processed similarly. It works under [88]. In a nutshell, it preprocesses crossing (super)edges (line 1). Then, all the workers run on its local fragment in parallel (line 2). After that, contracts refreshed singleton nodes into supernodes (lines 3-8) along the same lines as algorithm . Here, each fragment has its local set and all refreshed singleton nodes in can be coordinated and distributed by the master . Each node v is guaranteed to be contracted into one supernode . More specifically, algorithm works as follows.
Fig. 9.

Algorithm
(1) preprocess updated edges between two fragments (line 1), i.e., when u and v are contracted into supernodes and , and and are in two distinct fragments. Such updates are unaffecting as long as neither u nor v is an intermediate node of a path, and these updates are maintained by . Otherwise, the supernode of type path may be affected and is decomposed into singleton nodes; such refreshed singleton nodes are collected in a set as the initial area affected by . In the same way as , we refresh timestamps of obsolete nodes touched by updates.
(2) Each worker locally runs in parallel (line 2). Refreshed singleton nodes that cannot be contracted into supernodes are collected in (line 3).
(3) For each refreshed singleton node v in , build its uncontracted neighbors (of at most nodes) in parallel, similar to step (2) in (lines 4-5).
(4) Master merges overlapped neighbors into one and distributes disjoint ones to n workers (lines 6-7).
(5) Each worker contracts its assigned subgraphs, i.e., uncontracted neighbors, in parallel (line 8).
One can verify that each node v in G is contracted into one supernode (including v itself), and the contracted graph cannot be further contracted.
Experimental study
Using ten real-life graphs, we experimentally evaluated (1) the contraction ratio; (2) the speedup of the contraction scheme; (3) the impact of contracting each topological component and obsolete component; (4) the space cost of the contraction scheme compared to existing indexing methods; (5) the efficiency of the (incremental) contraction algorithms; and (6) the parallel scalability of the (incremental) contraction algorithms.
Experiment setting. We used the following datasets.
(1) Graphs. We used ten real-life graphs: three social networks [70], [94] and [10]; three Web graphs [64], [5] and [3]; three collaboration networks [2], [15] and [63]; and a road network [1]. Their sizes are shown in Table 2. We randomly generated a time series to simulate obsolete attributes, at most 70% (it is 80% for IT data of our industry collaborator). We tested obsolete components with random (temporal) queries generated on all datasets.
Table 2.
Contraction ratio (each column: CR or % of contribution to CR with/without obsolete mark)
| Graph | |V|, |E| | CR | 1st | 2nd | 3rd | Obsolete | |
|---|---|---|---|---|---|---|---|
| 81K, 1.3M | 100 | 0.176/0.286 | 7.78/27.7 | 15.44/50.71 | 4.29/14.39 | 69.69/– | |
| LiveJournal | 4M, 35M | 500 | 0.378/0.527 | 11.46/30.3 | 20.41/51.4 | 3.74/9.7 | 60.99/– |
| LivePokec | 1.6M, 22M | 500 | 0.467/0.651 | 4.46/9.91 | 35.91/77.76 | 2.32/4.83 | 54.4/– |
| 876K, 4.3M | 200 | 0.193/0.294 | 19.36/51.47 | 19.33/47.04 | 0.58/1.49 | 60.74/– | |
| NotreDame | 325K,1.1M | 200 | 0.274/0.441 | 23.16/60.64 | 9.47/26.95 | 4.56/12.4 | 62.81/– |
| GSH | 68M, 1.8B | 500 | 0.325/0.493 | 29.32/77.33 | 5.31/21.78 | 0.75/0.89 | 64.62/– |
| DBLP | 204K, 382K | 100 | 0.14/0.172 | 36.21/71.65 | 14.22/28.32 | 0.02/0.03 | 49.54/– |
| Hollywood | 1.1M, 56M | 500 | 0.239/0.534 | 17.36/71.76 | 6.05/16.46 | 3.21/11.79 | 73.38/– |
| citHepTh | 28K, 352K | 50 | 0.26/0.362 | 21.42/51.93 | 14.18/36.71 | 4.6/11.36 | 59.81/– |
| Traffic | 24M, 29M | 500 | 0.365/0.59 | 12.37/49.72 | 9.42/36.74 | 3.5/13.54 | 74.7/– |
We also generated synthetic graphs with up to 250 M nodes and 2.5 B edges, to test the parallel scalability of the (incremental) contraction algorithms.
Updates. We randomly generated edge updates , controlled by the size and a ratio of edge insertions to deletions. We kept unless stated otherwise, i.e., the size of remains stable. In the same manner, we generated vertex updates .
(2) Graph patterns. We implemented a generator for graph pattern queries controlled by three parameters: the number of pattern nodes, the number of pattern edges, and a set of labels for queries Q.
(3) Implementation. We implemented the following algorithms, all in C++. (1) Algorithms (Sect. 3.1.2), (Sect. 3.2.2), (Sect. 3.3.2), (Sect. 3.4.2), (Sect. 3.5.2), for by adapting [28] to contracted graphs; in addition, for by adapting [4] to contracted graphs. (2) Our contraction algorithm (Sect. 2.3) and its parallel version (Sect. 2.4), incremental algorithm for batch updates and its parallel version (Sect. 4). (3) The baselines include existing query evaluation algorithms: (a) [44] and [78] with indexing for , and [28] without indexing; (b) graph compression [69] for ; (c) [47] for ; (d) without indexing and [4] with indexing for [31]; (e) [85] for ; and (f) [57] for . We did not compare with summarization since it does not support any algorithm to compute exact answers for the five applications.
(4) Experimental environment. The experiments were conducted on a single-processor machine powered by Xeon 3.0 GHz with 64GB memory, running Linux. Since and synthetic graphs ran out of 32 GB memory without contraction, we used a machine with 64 GB memory. For parallel (incremental) contraction, we used 4 machines, each with 12 cores powered by Xeon 3.0 GHz, 32GB RAM, and 10Gbps NIC. Each experiment was run 5 times, and the average is reported here.
Experimental results. We now report our findings.
Exp-1: Effectiveness: Contraction ratio. We first tested the contraction ratio of our contraction scheme, defined as . Note that for each query class , CR is the same for all queries in . Moreover, all applications on G share the same contracted graph while incorporating different synopses. In addition, we report the impact of each of the first three topological components and obsolete component for each dataset, in the presence and absence of obsolete data.
As remarked in Sect. 2, we limit the nodes of contracted subgraphs within . We fixed and varied based on the size of each graph. We considered two settings: (a) when obsolete data are taken into account, with threshold , where denotes the maximum timestamp in each dataset; and (b) when we do not separate obsolete data, i.e., when . The results are reported in Table 2 for all the real-life graphs (in which each column indicates either CR or percentage of contribution to CR with/without obsolete mark). We can see the following.
(1) When , CR is on average 0.281, i.e., contraction reduces these graphs by 71.9%. When , i.e., if obsolete data are not considered, CR is 0.435. These show that real-life graphs can be effectively contracted in the presence and absence of obsolete data. Compared with the results of [38], by considering more regular structures, the contraction scheme improves the contraction ratio CR by 2.49% and 6.90% in the presence and absence of obsolete data, respectively.
(2) When obsolete data are present, the average CR is 0.34, 0.264, 0.213 and 0.365 in social networks, Web graphs, collaboration networks and road networks, respectively. When obsolete data are absent, CR is on average 0.488, 0.409, 0.356 and 0.59. The contraction scheme performs the best on collaboration networks in both settings, since such graphs exhibit evident inhomogeneities and community structures.
(3) When obsolete data are absent, on average the first three regular structures contribute 50.2%, 39.4% and 8.0% to CR, respectively. When obsolete mark is taken into account, their contribution is 18.3%, 14.9% and 2.8%, respectively. This is because nodes from these components may be moved to obsolete components.
(4) We also studied the impact of the contraction order on query evaluation. Topological components have different impacts on different types of graphs, e.g., stars, claws and paths are effective in , and cliques, stars and butterflies work better than the others in collaboration networks. Taking the order of Table 1 as the baseline, we tested the impact of (a) RE, by reversing the order, and (b) EX, by exchanging between different types of graphs, e.g., we use the order for road networks to contract social graphs. On average the CR of RE and EX is decreased by 9.42% and 7.05%, respectively. As shown in Table 3, the average slowdown of RE and EX is (a) 7.24% and 5.58% for , (b) 5.55% and 5.46% for , (c) 3.89% and 4.30% for , (d) 7.34% and 34.7% for , and (e) 2.38% and 19.1% for , respectively. These justify that the order of Table 1 is effective for most applications and most types of graphs. There are also exceptions, e.g., reversing the order for Web graphs improves the efficiency of . Recall that we contract stars, cliques and butterflies for Web graphs. For in particular, however, cliques play a more important role than the other two (Sect. 3.5); hence, contracting cliques first may work better for .
Table 3.
Slowdown (%) by RE and EX orders
| Graph | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RE | EX | RE | EX | RE | EX | RE | EX | RE | EX | |
| 8.04 | 3.98 | 7.41 | 3.66 | 5.27 | 2.86 | 8.22 | 19.2 | 6.73 | 24.9 | |
| LiveJournal | 9.46 | 5.52 | 8.26 | 5.09 | 2.71 | 5.32 | 9.03 | 61.1 | 5.49 | 18.3 |
| LivePokec | 8.67 | 6.46 | 3.15 | 3.12 | 4.49 | 2.48 | 6.10 | 51.7 | 8.23 | 20.2 |
| 5.17 | 7.54 | 6.07 | 3.75 | 1.02 | 3.8 | 7.19 | 38.7 | 12.5 | ||
| NotreDame | 11.9 | 5.76 | 4.20 | 6.46 | 5.95 | 4.93 | 3.72 | 44.5 | 15.3 | |
| GSH | 3.52 | 6.22 | 4.59 | 6.08 | 2.78 | 4.15 | 4.25 | 32.1 | 16.4 | |
| DBLP | 2.13 | 5.53 | 11.3 | 14.2 | 4.38 | 5.31 | 18.8 | 19.6 | 5.05 | 34.2 |
| Hollywood | 6.32 | 6.39 | 2.25 | 4.73 | 3.89 | 5.81 | 5.75 | 30.3 | 3.02 | 29.3 |
| citHepTh | 7.48 | 3.24 | 3.98 | 4.91 | 2.56 | 3.23 | 7.43 | 35.5 | 7.92 | 17.1 |
| Traffic | 9.69 | 5.11 | 4.29 | 2.56 | 5.78 | 5.11 | 2.94 | 14.2 | 1.39 | 2.87 |
Exp-2: Effectiveness: query processing. We next evaluated the speedup of query processing introduced by the contraction scheme, measured by query evaluation time over original and contracted graphs.
Subgraph isomorphism. Varying the size of pattern queries from 4 to 7, we tested , and on and as G, [69] on the compressed graph, and and on the contracted graph of G. For each query, we output the first matches. As shown in Fig. 10a, b, (1) on average, on is1.69, 1.49 and 18.85 times faster than , and , respectively; (2) beats by 9.31 times; (3) without indices is only 19.1% slower than with indices, while and are 10.1 and 8.97 times faster than , respectively; and (4) the speedup is more substantial on collaboration networks, e.g., 2.11 times on , because cliques are prevalent in such graphs and are the most effective structure for due to the high capacity in pruning invalid matches.
Fig. 10.
Performance evaluation
Triangle counting. As shown in Fig. 10c, the results for are consistent with the results on subgraph isomorphism: (1) on the contracted is on average 1.44 times faster than on their original graphs G. (2) The speedup is more evident in collaboration networks: e.g., on is 1.57 times faster than while it is 1.47, 1.45 and 1.28 times on , and , respectively. spends more than 1000 seconds on (hence not shown).
Shortest distance. The results for are consistent with the results on . As reported in Fig. 10d, is 1.64 and 1.36 times faster than on and , respectively, by reducing search space and employing synopses. could not build indices on within 64G memory, while successfully builds indices on (smaller) contracted . On average, spends 94.2 to evaluate a query on . On other smaller datasets, in contrast, is 18% slower than due to overhead on supernodes.
Connected component. As shown in Fig. 10e over , , , and for social graphs, Web graphs, collaboration networks and road networks, respectively, the results for are consistent with the results on and : (1) algorithm on contracted graph is on average 2.24 times faster than on the original graph G, since operates on the smaller without decontracting supernodes or superedges. (2) The speedup is more evident in collaborations networks: e.g., on is 2.87 times faster than , since the contraction scheme performs the best on such graphs and the time complexity of is linear in the size of the contracted graph.
Clique decision. As also shown in Fig. 10f, (1) algorithm is 1.32, 1.54, 1.52 and 1.08 times faster than on , , and , respectively, by using synopses to start with an initial maximum clique that may find a k-clique directly. (2) The speedup is less evident in road networks. For road networks, the contraction scheme contracts stars, claws and paths into supernodes; hence, we can only find a 2-clique (an edge) as the initial maximum clique by using synopses, which is trivial and useless.
The results on the other graphs are consistent.
Temporal queries. Fixing pattern size |Q| = 4 and varying timestamp t in temporal queries from to , we tested , , , and . As shown in Fig. 10g–k on , (1) is on average 1.81 and 1.77 times faster than and , respectively; outperforms by 7.83 times. (2) The average speedup for , , and is 1.58, 2.31, 1.66 and 1.31 times, respectively. (3) The speedup is larger for temporal queries than for conventional ones since temporal information maintained in synopsis provides additional capacity to skip more supernodes, as expected. (4) It is more substantial for larger t on .
The results verify that our contraction scheme (a) is generic and speeds up evaluation for all five applications, and (b) it can be used together with existing algorithms, with indexing (e.g., and ) or not (e.g., and ). (c) It is effective by separating up-to-date data from obsolete.
We remark that our contraction scheme aims to make a generic optimization for multiple applications to run on the same graph at the same time. When a new application is considered, adding a specific synopsis suffices for our scheme. In contrast, a separate indexing structure has to be built for indexing approaches. Better still, it is much easier to develop synopses than indices. Moreover, existing indexing structures can be inherited by contracted graphs, to improve performance from contraction in addition to from indexing.
Exp-3: Impact of each component. We next evaluated the impact of contracting each of the topological components identified in Sect. 2.2.
Impact of topological components. Based on Table 1, we took contraction of the first three types of regular structures as the baseline, and tested the impact of each component on the efficiency of query answering by disabling it, using all the ten real-life datasets.
As shown in Table 4, the average slowdown in evaluation time by disabling each of the first three structures is (a) 37.6%, 22.7% and 3.02% for , (b) 70.3%, 36.3% and 2.0% for , (c) 28.4%, 28.8% and 5.2% for , (d) 141.0%, 118.4% and 9.9% for , and (e) 27.8%, 15.9% and 0.7% for , respectively. In particular, the impact of each regular structure is mostly consistent with the contraction order. This said, for specific application and graphs, the impact of each regular structure may be slightly different. For on Web graphs, the average slowdown in evaluation time by disabling the first structure (star) and the second structure (clique) is 4.1% and 43.5%, respectively, since cliques dominate the effectiveness of the synopses for .
Table 4.
Slowdown(%) by disabling certain topological component
| Graph | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1st | 2nd | 3rd | 1st | 2nd | 3rd | 1st | 2nd | 3rd | 1st | 2nd | 3rd | 1st | 2nd | 3rd | |
| 45.8 | 10.9 | 4.7 | 16.4 | 19.1 | 2.1 | 28.2 | 28.7 | 5.7 | 85.1 | 141.6 | 22.8 | 42.7 | 4.1 | 0.3 | |
| LiveJournal | 46.3 | 16.7 | 3.0 | 17.5 | 3.9 | 1.4 | 44.3 | 13.2 | 7.1 | 68.4 | 95.5 | 14.5 | 27.5 | 3.3 | 0.9 |
| LivePokec | 45.5 | 13.5 | 2.1 | 5.5 | 22.1 | 0.7 | 29.5 | 23.6 | 4.4 | 18.0 | 69.5 | 11.3 | 39.0 | 5.2 | 1.3 |
| GSH | 11.7 | 32.2 | 0.4 | 5.4 | 18.2 | 1.1 | 15.9 | 33.1 | 0.7 | 41.7 | 10.8 | 0.4 | 4.9 | 52.2 | 0.2 |
| 19.6 | 40.6 | 2.5 | 8.7 | 20.3 | 2.9 | 18.3 | 44.6 | 5.8 | 107.1 | 70.8 | 5.6 | 5.4 | 57.8 | 2.7 | |
| NotreDame | 15.2 | 42.3 | 3.3 | 29.5 | 41.2 | 0.4 | 27.7 | 47.8 | 4.9 | 55.4 | 50.2 | 8.0 | 2.1 | 20.6 | 0.5 |
| DBLP | 66.6 | 17.0 | 0.8 | 572.1 | 216.6 | 1.7 | 23.2 | 29.5 | 0.4 | 631.7 | 450.2 | 0.1 | 65.1 | 7.9 | 0.1 |
| Hollywood | 40.3 | 13.4 | 5.1 | 22.6 | 10.9 | 1.5 | 24.0 | 26.3 | 5.4 | 80.1 | 64.3 | 5.9 | 51.7 | 3.8 | 0.2 |
| citHepTh | 54.5 | 15.7 | 2.4 | 15.4 | 7.2 | 0.5 | 32.3 | 22.6 | 7.3 | 280.7 | 222.4 | 25.7 | 35.5 | 1.7 | 0.6 |
| Traffic | 30.1 | 24.3 | 5.7 | 10.1 | 3.5 | 9.4 | 40.2 | 18.7 | 10.6 | 41.7 | 8.7 | 5.2 | 4.3 | 2.5 | 0.3 |
Impact of obsolete components. We tested the impact of contracting obsolete components on the efficiency of answering conventional queries. Fixing |Q| = 4 and varying x for timestamp threshold such that , Fig. 10i–p reports the results of , , , and on , respectively. We find that (1) the speedup is bigger for larger when , i.e., more nodes are contracted into obsolete components; (2) obsolete components speed up , , , and by 1.56, 1.53, 1.39, 2.49 and 1.33 times, respectively; and (3) the speedup for and gets smaller when due to the overhead of decontracting obsolete components. The results are consistent for , and , except that their speedup does not go down when gets larger since they do not need to decontract obsolete components.
Impact of and . We also tested the impact of and on the contraction ratio CR and efficiency. As remarked in Sect. 2.3, diamonds, butterflies and claws have a fixed size, while cliques, stars and paths vary. Fixing (resp. ) and varying (resp. ) from 2 to 6 (resp. 20 to 1000), Fig. 10q (resp. Fig. 10r) reports the CR on , , and , respectively. As shown there, CR decreases when decreases or increases. Similarly, Fig. 10s (resp. Fig. 10t) reports the speedup of , , , and on . Query evaluation is slowed down when or for all algorithms except and due to excessive superedge decontractions or overlarge components. Recall that decontracts neither supernodes nor superedges, and precalculates triangles in both topological components and obsolete parts; hence, it prefers large . We find that the best and for the datasets tested are around 4 and 500, respectively.
The results on the other graphs are consistent.
Exp-4: Space cost. We next studied the space cost of our contraction scheme compared with indexing cost. We consider six algorithms: , , , , and . The space cost includes the sizes of the contracted graph , decontraction function and the sizes of synopses; as shown in Sect. 3, , , , and do not need to decontract topological components; thus, we only uploaded for obsolete components and superedges into memory. In particular, requires no decontraction (Theorem 1) and thus incurs no cost for storing at all. We compared the space cost with the indices used by , [75], [4] and [68].
Table 5 shows how the space cost increases when more applications run on (i.e., graph G). We find the following. (1) Our contraction scheme takes totally 1.62GB for , , , and , much smaller than 12.9GB taken by , , and . (2) With the contraction scheme, graph G is no longer needed. That is, compared to G, the scheme uses 0.89GB additional space for the supernodes/edges in and synopses for all five applications. It trades affordable space for speedup. (3) Synopses , , , and take 48.3% of the total space of contraction, i.e., and dominate the space cost, which are shared by all applications. Hence, the more applications are supported, the more substantial the improvement in the contraction scheme is over indices.
Table 5.
Total space cost of applications run on
| Application | Contraction | Indexing | ||
|---|---|---|---|---|
| Detail | Space | Detail | Space | |
| Shared parts | 837MB | G | 727MB | |
| + | 848MB | 1.07GB | ||
| + | 874MB | + | 2.1GB | |
| + | 1.51GB | + | 9.58GB | |
| + | – | 1.51GB | – | 9.58GB |
| + | 1.62GB | + | 12.9GB | |
| + | 1.75GB | + | 19.4GB | |
To inherit the indexing structures of [44] and , we use 1.14GB additional space to build a compact index for and on average 26MB for on . in addition to synopses and .
To verify the scalability with applications, we further adapted existing algorithms for k-nearest neighbors () [92]. The total space cost of the scheme for the six applications is 1.75GB, i.e., 18.1% increment for each. It accounts for only 9.0% of the indices for , , , and [22] of .
Exp-5: Efficiency of (incremental) contraction. We next evaluated the efficiency of contraction algorithm and incremental contraction algorithm . We also studied the impact of the order and varied rates of updates on incremental .
Efficiency of . We first report the efficiency of . As shown in Fig. 11a–d on , , and , respectively, (1) on average takes 109.7s to contract the graph, without the time of the computation for synopses. (2) It takes on average 4.13s, 21.2s, 18.1s, 0s and 3.38s only to compute the synopses for , , , and , respectively; i.e., computing synopses of the five only takes on average 37.3% of the time of . Recall that the synopses for suffice for us to answer queries; hence, it is unnecessary to compute synopses for .
Fig. 11.
Efficiency of (incremental) contraction
Efficiency of . We tested the efficiency of , by varying from to . As shown in Fig. 11e–h on , , and , respectively, (1) on average is 2.1 times faster than , up to 6.3 times when . It takes on average 26.6% time to update the synopses for 5% updates on the five applications. (2) beats even when is up to . This justifies the need for incremental contraction. (3) is sensitive to ; it takes longer for larger .
Impact of update order. We tested the impact of the orders of edge insertions and deletions in on . Fixing , we varied the order of updates by (1) random (RO), (2) insertion-first (IF) and (3) deletion-first (DF). On average we find that RO, IF and DF have a performance difference less than 3.5% on . That is, is stable on batch updates, regardless of the order on the updates. Similarly, we find that RO, IF and DF have a performance difference less than 3.7% on for vertex updates.
Impact of update rates. We also tested the efficiency of against real-time updates, measured by the updates coming in 1s intervals, i.e., /s. Varying /s from /s to /s, Fig. 11i shows the following on . (1) On average it takes 0.88s to update contracted graphs, i.e., is able to efficiently maintain the contracted graphs in real life. (2) The update time is less than 1s even when the updates are up to . can handle of “burst” updates on graph with 40M nodes and edges.
The results are consistent on the other graphs.
Exp-6: Scalability. Finally, we evaluated (1) the scalability of our contraction algorithm with graph size |G|, (2) the parallel scalability of algorithm and with the number of cores.
Scalability on . Varying the size of synthetic graphs from (50M, 0.5B) to (250M, 2.5B), we tested the scalability of using a single machine. As shown in Fig. 11j, scales well with G. It takes 1325s when graph G has 2.75B nodes and edges.
Scalability of and . Fixing , we tested the scalability of parallel and with the number k of cores. As shown in Fig. 11k and l on , (1) scales well with k: it is 10.1 times faster when using cores versus (single core), and it is 4.3 times faster when k varies from 4 to 20. (2) is on average 1.9 times faster than . (3) scales well with k; it is 3.7 times faster when k varies from 4 to 20, across 4 machines.
The results on other graphs are consistent.
Summary. We find the following over 10 real-life graphs. On average, (1) the contraction scheme reduces graphs by 71.9%. The contraction ratio is 0.34, 0.264, 0.213 and 0.365 in social networks, Web graphs, collaboration networks and road networks, respectively. (2) It improves the evaluation of , , , and by 1.69, 1.44, 1.47, 2.24 and 1.37 times, respectively. Existing algorithms can be adapted to the scheme, with indices or not. (3) On average, contracting the first three types of regular structures improves the efficiency of query evaluation by 1.61, 1.44 and 1.04 times, respectively. (4) Contracting obsolete data improves the efficiency of both conventional queries and temporal queries, by 1.64 and 1.78 times on average, respectively. (5) Its total space cost on , , , and is only of indexing costs of , , and . The synopses for the five query classes take only 48.3% of the total space of the contraction scheme. Thus, our contraction scheme scales with the number of applications. (6) Algorithms , , and scale well with graphs and updates. takes 344s when G has 1.8B edges and nodes, and takes only 33.1s with 20 cores, across 4 machines. is 4.9 times faster than when is , and is still faster when is up to . (7) and scale well with the number k of machines. When , is 4.3 times faster and is 3.7 times faster when k varies from 4 to 20.
Related work
This paper extends its conference version [38] as follows. (1) We identify a variety of frequent regular structures in different types of graphs, develop their synopses and contract graphs based on their types (Sect. 2.2). In contrast, [38] adopts an one-size-fit-all solution and contracts only cliques, paths and stars for all types of graphs. (2) In light of new regular structures, all examples and algorithms have been extended (Sects. 2–4). (3) We provide the pseudo code and details of a parallel contraction algorithm (Sect. 2.4). (4) We study two new query classes, namely, (non-local) connected component and (intractable) clique decision, for proof of concept (Sects. 3.4 and 3.5). We also extend the algorithms for the three other cases to cope with newly studied topological components (Sects. 3.1–3.3). (5) We extend the study of incremental contraction by presenting vertex updates and parallel incremental maintenance algorithm (Sect. 4). (6) The experimental study is almost entirely new and evaluates the contraction scheme w.r.t. different regular structures to contract as well as its effectiveness on new big graphs and new query classes of Sects. 3.4 and 3.5 (Sect. 5).
We discuss the other related work as follows.
Contraction. As a traditional graph programming technique [43], node contraction merges nodes, and subgraph contraction replaces connected subgraphs with supernodes. It is used in e.g., single source shortest paths [54], connectivity [43] and spanning tree [41].
In contrast, we extend the conventional contraction with synopses to build a compact representation of graphs as a generic optimization scheme, which is a departure from the programming techniques.
Compression. Graph compression has been studied for social network analysis [27], community queries [21], subgraph isomorphism [34, 69], graph simulation [37], reachability and shortest distance [50], and GPU-based graph traversal [82]. It often computes query-specific equivalence relations by merging equivalent nodes into a single node or replacing frequent patterns by virtual nodes. Some are query preserving (lossless), e.g., [37, 50, 69], and can answer certain types of queries on compressed graphs without decompression.
Another category of compression aims to minimize the number of bits required to represent a graph. WebGraph [15] exploits the inner redundancies of Web graphs; [8] proposes an encoding scheme based on node indices assigned by the BFS order; [24] approximates the optimal encoding with MinHash; and [52] removes the hub nodes for an scheme to have better locality.
Our contraction scheme differs from graph compression in the following. (a) It optimizes performance of multiple applications with the same contracted graph. In contrast, many compression schemes are query dependent and require different structures for different query classes. While some methods serve generic queries [8, 15, 24], they may incur heavy recovering cost. (b) Contraction is lossless, while some compression schemes are lossy, e.g., [34]. (c) For a number of query classes, their existing algorithms can be readily adapted to contracted graphs, while compression often requires to develop new algorithms e.g., [69] demands a decompose-and-join algorithm for subgraph isomorphism.
Summarization. Graph summarization aims to produce an abstraction or summary of a large graph by aggregating nodes or subgraphs (see [67] for a survey), classified as follows. (1) Node aggregation, e.g., [60] merges node clusters into supernodes labeled with the number of edges within and between the clusters; it is developed for adjacency, degree and centrality queries. [87] generates an approximate summary of a graph structure by aggregating nodes based on attribute similarity. (2) Edge aggregation, e.g., [73] generates a summary by aggregating edges, with a bounded number of edges different from the original graph. (3) Simplification: instead of aggregating nodes and edges, [83] drops low-degree nodes, duplicate paths and unimportant labels. Most summarization methods are lossy, e.g., and only retain part of attributes, and drops nodes, edges and labels.
Incremental maintenance of summarization has been studied [30, 46, 84]. It depends on update intervals [84]; short-period summarization is space-costly, while long-interval summarization may miss updates. To handle these, [46] aggregates updates into a graph of “frequent” nodes and edges and computes a summary based on all historical updates on entire graph.
Both summarization and contraction schemes aim to provide a generic graph representation to speed up graph analyses. However, contraction differs from summarization in the following. (1) The contraction scheme is lossless and returns exact answers for various classes of queries. In contrast, summarization is typically lossy and supports at best certain aggregate or approximate queries only. (2) Many existing algorithms for query answering can be readily adapted to contracted graphs, while new algorithms often have to be developed on top of graph summaries. (3) For a number of query classes studied, contracted graphs can be incrementally maintained with boundedness and locality, while summarization maintenance requires historical updates and often operates on the entire graph [46].
Indexing. Indices have been studied for, e.g., subgraph isomorphism [13, 14, 28, 44, 72], reachability [7, 23, 50, 95] and shortest distance [25, 66]. They are query specific, and take space and time to store and maintain.
Our contraction scheme differs from indexing as it supports multiple applications on the same contracted graph, while a separate indexing structure has to be built for each query class. Moreover, it is more efficient to maintain contracted graphs than indices. This said, the contraction scheme can be complemented with indices for further speedup, by building indices on smaller contracted graphs, as demonstrated in Sect. 3.1.
Conclusion
We have proposed a contraction scheme to make big graphs small, as a generic optimization scheme for multiple applications to run on the same graph at the same time. We have shown that the scheme is generic and lossless. Moreover, it prioritizes up-to-date data by separating it from obsolete data. In addition, existing query evaluation algorithms can be readily adapted to compute exact answers, often without decontracting topological components. Our experimental results have verified that the contraction scheme is effective.
A topic for future work is to build a hierarchy of contracted graphs by iteratively contracting regular structures into supernodes, until the one at the top fits into the memory; the objective is to make large graphs small enough to fit into the memory of a single machine, and make it possible to process large graphs under limited resources. Another topic is to study the capacity of a single multi-core machine for big graph analytics, by leveraging both contraction and multi-core parallelism.
Acknowledgements
Fan, Li and Liu are supported in part by ERC 652976 and Royal Society Wolfson Research Merit Award WRM/R1/180014. Liu is also supported in part by EPSRC EP/L01503X/1, EPSRC CDT in Pervasive Parallelism at the University of Edinburgh. Lu is supported in part by NSFC 62002236.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Wenfei Fan, Email: wenfei@inf.ed.ac.uk.
Yuanhao Li, Email: yuanhao.li@ed.ac.uk.
Muyang Liu, Email: muyang.liu@ed.ac.uk.
References
- 1.Traffic. http://www.dis.uniroma1.it/challenge9/download.html (2006)
- 2.DBLP. https://snap.stanford.edu/data/com-DBLP.html (2012)
- 3.Gsh host. http://law.di.unimi.it/webdata/gsh-2015-host (2015)
- 4.Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In: SIGMOD (2013)
- 5.Albert, R., Jeong, H., Barabási, A.: The diameter of the World Wide Web. CoRR cond-mat/9907038 (1999)
- 6.Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: A core for future graph query languages. In: SIGMOD, pp. 1421–1432 (2018)
- 7.Anirban, S., Wang, J., Islam, M.S.: Multi-level graph compression for fast reachability detection. In: DASFAA (2019)
- 8.Apostolico A, Drovandi G. Graph compression by bfs. Algorithms. 2009;2(3):1031–1044. doi: 10.3390/a2031031. [DOI] [Google Scholar]
- 9.Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X.: Group formation in large social networks: membership, growth, and evolution. In: SIGKDD, pp. 44–54 (2006)
- 10.Bae SH, Halperin D, West JD, Rosvall M, Howe B. Scalable and efficient flow-based community detection for large-scale graph analysis. TKDD. 2017;11(3):1–30. doi: 10.1145/2992785. [DOI] [Google Scholar]
- 11.Berry, N., Ko, T., Moy, T., Smrcka, J., Turnley, J., Wu, B.: Emergent clique formation in terrorist recruitment. In: AAAI Workshop on Agent Organizations (2004)
- 12.Besta, M., Hoefler, T.: Survey and taxonomy of lossless graph compression and space-efficient graph representations. CoRR arXiv: 1806.01799 (2018)
- 13.Bhattarai, B., Liu, H., Huang, H.H.: CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching. In: SIGMOD (2019)
- 14.Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD (2016)
- 15.Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: WWW, pp. 595–602 (2004)
- 16.Bonacich P. Power and centrality: a family of measures. Am. J. Sociol. 1987;92(5):1170–1182. doi: 10.1086/228631. [DOI] [Google Scholar]
- 17.Bourse, F., Lelarge, M., Vojnovic, M.: Balanced graph edge partition. In: SIGKDD, pp. 1456–1465 (2014)
- 18.Brandes U. A faster algorithm for betweenness centrality. J. Math. Sociol. 2001;25(2):163–177. doi: 10.1080/0022250X.2001.9990249. [DOI] [Google Scholar]
- 19.Bringmann, B., Nijssen, S.: What is frequent in a single graph? In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 858–863 (2008)
- 20.Bron C, Kerbosch J. Algorithm 457: finding all cliques of an undirected graph. CACM. 1973;16(9):575–577. doi: 10.1145/362342.362367. [DOI] [Google Scholar]
- 21.Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: WSDM, pp. 95–106 (2008)
- 22.Cantone D, Ferro A, Pulvirenti A, Recupero DR, Shasha D. Antipole tree indexing to support range search and k-nearest neighbor search in metric spaces. TKDE. 2005;17(4):535–550. [Google Scholar]
- 23.Cheng, J., Huang, S., Wu, H., Fu, A.W.C.: TF-label: a topological-folding labeling scheme for reachability querying in a large graph. In: SIGMOD (2013)
- 24.Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: SIGKDD, pp. 219–228 (2009)
- 25.Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-hop labels. SICOMP 32(5) (2003)
- 26.Cohen, J.: Trusses: Cohesive subgraphs for social network analysis. Natl. Secur. Agency Tech. Rep. 16(3.1) (2008)
- 27.Cohen, S.: Data management for social networking. In: SIGMOD (2016)
- 28.Cordella LP, Foggia P, Sansone C, Vento M. A (sub) graph isomorphism algorithm for matching large graphs. TPAMI. 2004;26(10):1367–1372. doi: 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
- 29.Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms. MIT press (2009)
- 30.Cortes, C., Pregibon, D., Volinsky, C.: Communities of interest. In: IDA (2001)
- 31.Dijkstra, E.W., et al.: A note on two problems in connexion with graphs. Numer. Math. 1(1) (1959)
- 32.Dominguez-Sal, D., Martinez-Bazan, N., Muntes-Mulero, V., Baleta, P., Larriba-Pey, J.L.: A discussion on the design of graph database benchmarks. In: TPCTC, pp. 25–40 (2010)
- 33.Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P. GRAMI: frequent subgraph and pattern mining in a single large graph. PVLDB. 2014;7(7):517–528. [Google Scholar]
- 34.Fairey, J., Holder, L.: Stariso: Graph isomorphism through lossy compression. In: DCC (2016)
- 35.Fan, W., Hu, C., Tian, C.: Incremental graph computations: Doable and undoable. In: SIGMOD (2017)
- 36.Fan, W., Jin, R., Liu, M., Lu, P., Tian, C., Zhou, J.: Capturing associations in graphs. PVLDB 13(11) (2020)
- 37.Fan, W., Li, J., Wang, X., Wu, Y.: Query preserving graph compression. In: SIGMOD (2012)
- 38.Fan, W., Li, Y., Liu, M., Lu, C.: Making graphs compact by lossless contraction (2021). SIGMOD [DOI] [PMC free article] [PubMed]
- 39.Fan, W., Wu, Y., Xu, J.: Functional dependencies for graphs. In: SIGMOD (2016)
- 40.Francis, N., Green, A., Guagliardo, P., Libkin, L., Lindaaker, T., Marsault, V., Plantikow, S., Rydberg, M., Selmer, P., Taylor, A.: Cypher: An evolving query language for property graphs. In: SIGMOD (2018)
- 41.Gabow, H.N., Galil, Z., Spencer, T.H.: Efficient implementation of graph algorithms using contraction. In: FOCS (1984)
- 42.Garey M, Johnson D. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman and Company; 1979. [Google Scholar]
- 43.Gross J, Yellen J. Graph Theory and its Applications. Boca Raton: CRC Press; 1998. [Google Scholar]
- 44.Han, W.S., Lee, J., Lee, J.H.: Turbo: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD (2013)
- 45.He, L., Chao, Y., Suzuki, K., Wu, K.: Fast connected-component labeling. Pattern Recogn. 42(9) (2009)
- 46.Hill S, Agarwal DK, Bell R, Volinsky C. Building an effective representation for dynamic networks. J. Comput. Graph. Stat. 2006;15(3):584–608. doi: 10.1198/106186006X139162. [DOI] [Google Scholar]
- 47.Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD (2013)
- 48.Itai A, Rodeh M. Finding a minimum circuit in a graph. SICOMP. 1978;7(4):413–423. doi: 10.1137/0207033. [DOI] [Google Scholar]
- 49.Jaakkola, M.S.T., Szummer, M.: Partially labeled classification with markov random walks. NIPS 14 (2002)
- 50.Jin, R., Xiang, Y., Ruan, N., Wang, H.: Efficiently answering reachability queries on very large directed graphs. In: SIGMOD (2008)
- 51.Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Kang, U., Faloutsos, C.: Beyond’caveman communities’: Hubs and spokes for graph compression and mining. In: ICDM, pp. 300–309 (2011)
- 53.Kang, U., McGlohon, M., Akoglu, L., Faloutsos, C.: Patterns on the connected components of terabyte-scale graphs. In: ICDM, pp. 875–880 (2010)
- 54.Karimi, R., Koppelman, D.M., Michael, C.J.: GPU road network graph contraction and SSSP query. In: ICS (2019)
- 55.Karypis G, Kumar V. Multilevelk-way partitioning scheme for irregular graphs. JPDC. 1998;48(1):96–129. [Google Scholar]
- 56.Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: SIGKDD, pp. 137–146 (2003)
- 57.Koch I. Enumerating all connected maximal common subgraphs in two graphs. TCS. 2001;250(1–2):1–30. doi: 10.1016/S0304-3975(00)00286-3. [DOI] [Google Scholar]
- 58.Kropatsch, W.: Building irregular pyramids by dual-graph contraction. In: Vision Image and Signal Processing (1996)
- 59.Lappas, T., Liu, K., Terzi, E.: Finding a team of experts in social networks. In: KDD (2009)
- 60.LeFevre, K., Terzi, E.: Grass: Graph structure summarization. In: SDM (2010)
- 61.Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web. 2015;6(2):167–195. doi: 10.3233/SW-140134. [DOI] [Google Scholar]
- 62.Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links in online social networks. In: WWW, pp. 641–650 (2010)
- 63.Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: SIGKDD (2005)
- 64.Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR arXiv:0810.1355 (2008)
- 65.Leung, K., Leckie, C.: Unsupervised anomaly detection in network intrusion detection using clusters. In: ACSW (2005)
- 66.Liang, Y., Zhao, P.: Similarity search in graph databases: a multi-layered indexing approach. In: ICDE (2017)
- 67.Liu Y , Safavi T, Dighe A, Koutra D. Graph summarization methods and applications: A survey. ACM Comput. Surv. 2018;51(3):62:1–62:34. [Google Scholar]
- 68.Lu, C., Yu, J.X., Wei, H., Zhang, Y.: Finding the maximum clique in massive graphs. PVLDB 10(11) (2017)
- 69.Maccioni, A., Abadi, D.J.: Scalable pattern matching over compressed graphs via dedensification. In: SIGKDD (2016)
- 70.McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS (2012)
- 71.Miller GA. WordNet: a lexical database for English. Commun. ACM. 1995;38(11):39–41. doi: 10.1145/219717.219748. [DOI] [Google Scholar]
- 72.Myoungji, H., Hyunjoon, K., Geonmo, G., Kunsoo, P., Wook-Shin, H.: Efficient subgraph matching: harmonizing dynamic programming, adaptive matching order, and failing set together. In: SIGMOD (2019)
- 73.Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: SIGMOD (2008)
- 74.Newman ME, Watts DJ, Strogatz SH. Random graph models of social networks. PNAS. 2002;99(suppl 1):2566–2572. doi: 10.1073/pnas.012582999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Pandey, S., Li, X.S., Buluc, A., Xu, J., Liu, H.: H-index: Hash-indexing for parallel triangle counting on GPUs. In: HPCS, pp. 1–7 (2019)
- 76.Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Discov. 24 (2012)
- 77.Ramalingam G, Reps T. On the computational complexity of dynamic graph problems. TCS. 1996;158(1–2):233–277. doi: 10.1016/0304-3975(95)00079-8. [DOI] [Google Scholar]
- 78.Ren X, Wang J. Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs. PVLDB. 2015;8(5):617–628. [Google Scholar]
- 79.van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: A property graph query language. In: GRADES (2016)
- 80.Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015)
- 81.Sakr S, Al-Naymat G. Graph indexing and querying: a review. IJWIS. 2010;6(2):101–120. doi: 10.1108/17440081011053104. [DOI] [Google Scholar]
- 82.Sha, M., Li, Y., Tan, K.: Gpu-based graph traversal on compressed graphs. In: SIGMOD, pp. 775–792 (2019)
- 83.Shen Z, Ma KL, Eliassi-Rad T. Visual analysis of large heterogeneous social networks by semantic and structural abstraction. TVCG. 2006;12(6):1427–1439. doi: 10.1109/TVCG.2006.107. [DOI] [PubMed] [Google Scholar]
- 84.Soundarajan, S., Tamersoy, A., Khalil, E.B., Eliassi-Rad, T., Chau, D.H., Gallagher, B., Roundy, K.: Generating graph snapshots from streaming edge data. In: WWW (2016)
- 85.Tarjan R. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972;1(2):146–160. doi: 10.1137/0201010. [DOI] [Google Scholar]
- 86.Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From“ think like a vertex” to“ think like a graph”. PVLDB 7(3), 193–204 (2013)
- 87.Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: SIGMOD (2008)
- 88.Valiant LG. A bridging model for parallel computation. CACM. 1990;33(8):103–111. doi: 10.1145/79173.79181. [DOI] [Google Scholar]
- 89.Vieira, M.V., Fonseca, B.M., Damazio, R., Golgher, P.B., Reis, D.d.C., Ribeiro-Neto, B.: Efficient search ranking in social networks. In: CIKM (2007)
- 90.W3C Recommendation: SPARQL query language for RDF. https://www.w3.org/TR/rdf-sparql-query/ (2008)
- 91.Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’networks. Nature. 1998;393(6684):440. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- 92.Wu Y, Jin R, Zhang X. Efficient and exact local search for random walk based top-k proximity query in large graphs. TKDE. 2016;28(5):1160–1174. doi: 10.1109/TKDE.2016.2515579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Yahia SA, Benedikt M, Lakshmanan LV, Stoyanovich J. Efficient network aware search in collaborative tagging sites. PVLDB. 2008;1(1):710–721. [Google Scholar]
- 94.Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. In: ICDM (2012)
- 95.Yildirim, H., Chaoji, V., Zaki, M.J.: Grail: Scalable reachability index for large graphs. PVLDB 3(1-2) (2010)



