Skip to main content
Springer logoLink to Springer
. 2022 Feb 19;32(1):49–73. doi: 10.1007/s00778-022-00731-7

Making graphs compact by lossless contraction

Wenfei Fan 1,2,3, Yuanhao Li 1,, Muyang Liu 1, Can Lu 2
PMCID: PMC9845199  PMID: 36686981

Abstract

This paper proposes a scheme to reduce big graphs to small graphs. It contracts obsolete parts and regular structures into supernodes. The supernodes carry a synopsis SQ for each query class Q in use, to abstract key features of the contracted parts for answering queries of Q. Moreover, for various types of graphs, we identify regular structures to contract. The contraction scheme provides a compact graph representation and prioritizes up-to-date data. Better still, it is generic and lossless. We show that the same contracted graph is able to support multiple query classes at the same time, no matter whether their queries are label based or not, local or non-local. Moreover, existing algorithms for these queries can be readily adapted to compute exact answers by using the synopses when possible and decontracting the supernodes only when necessary. As a proof of concept, we show how to adapt existing algorithms for subgraph isomorphism, triangle counting, shortest distance, connected component and clique decision to contracted graphs. We also provide a bounded incremental contraction algorithm in response to updates, such that its cost is determined by the size of areas affected by the updates alone, not by the entire graphs. We experimentally verify that on average, the contraction scheme reduces graphs by 71.9% and improves the evaluation of these queries by 1.69, 1.44, 1.47, 2.24 and 1.37 times, respectively.

Keywords: Graph data management, Graph contraction, Graph algorithms, Incremental computation

Introduction

There has been prevalent use of graphs in artificial intelligence, knowledge bases, search, recommendation, business transactions, fraud detection and social network analysis. Graphs in the real world are often big, e.g., transaction graphs in e-commerce companies easily have billions of nodes and trillions of edges. Worse still, graph computations are often costly, e.g., graph pattern matching via subgraph isomorphism is intractable (cf. [42]). These highlight the need for developing techniques for speeding up graph computations.

There has been a host of work on the subject, either by making graphs compact, e.g., graph summarization [67] and compression [12, 82], or speeding up query answering by building indices [81]. The prior work often targets a specific class of queries, e.g., query-preserving compression [37] and 2-hop labeling [25] are for reachability queries. In practice, however, multiple applications often run on the same graph at the same time. It is infeasible to switch compression schemes or summaries between different applications. It is also too costly to build indices for each and every query class in use.

Another challenge stems from obsolete data. As a real-life example, consider graphs converted from IT databases at a telecommunication company. The databases were developed in stages over years and have a large schema with hundreds of attributes. About 80% of the attributes were copied from earlier versions and have not been touched for years. No one can tell what these attributes are for, but no one has the gut to drop them in the fear of information loss. As a result, a large bulk of the graphs is obsolete. As another example, there are a large number of zombie accounts in Twitter. As reported by The New York Times, 71% of Lady Gaga’s followers are fake or inactive, and it is 58% for Justin Bieber. The obsolete data incur heavy time and space costs and often obscure query answers.

The challenges give rise to several questions. Is it possible to find a compact representation of graphs that is generic and lossless? That is, we want to reduce big graphs to a substantially smaller form. Moreover, using the same representation, we want to compute exact answers to queries of different classes at the same time. In addition, can the representation separate up-to-date data from obsolete components without loss of information? Can we adapt existing evaluation algorithms to the compact form, without the need for redeveloping the algorithms starting from scratch? Furthermore, can we efficiently and incrementally maintain the representation in response to updates to the original graphs?

Contributions and organization. In this paper, we propose a new approach to tackling these challenges, by extending the idea of graph contraction.

(1) A contraction scheme (Sect. 2). We propose a contraction scheme to reduce big graphs into smaller ones. It contracts obsolete components and regular structures into supernodes, and prioritizes up-to-date data. For each query class Q, supernodes carry a synopsis SQ that records key features needed for answering queries of Q. As opposed to conventional graph summarization and compression, the scheme is generic and lossless. A contracted graph retains the same topological structure for all query classes Q, and the same synopses SQ work for all queries in the same class Q. Only SQ may vary for different query classes Q. We identify regular structures to contract in different types of graphs, and develop a (parallel) contraction algorithm.

(2) Proof of concept (Sect. 3). We show that existing query evaluation algorithms can be readily adapted to contracted graphs. In a nutshell, we extend the algorithms to handle supernodes. When answering a query Q in Q, we make use of the synopsis SQ of a supernode if it carries sufficient information for answering Q, and decontract the supernode only when necessary. We pick five different query classes: subgraph isomorphism (SubIso), triangle counting (TriC), shortest distance (Dist), connected component (CC) and clique decision (CD) based on the following dichotomies:

label-based queries (SubIso) versus non-label based ones (TriC, Dist, CC, CD);

local queries (SubIso, TriC, CD) versus non-local ones (Dist, CC); and

various degrees of topological constraints (Dist CC TriC CD SubIso).

We show how easy to adapt existing algorithms for these query classes to contracted graphs, without increasing their complexity. Better still, all these queries can be answered without decontraction of topological structures except some supernodes for obsolete parts.

(3) Incremental contraction (Sect. 4). We develop an incremental algorithm for maintaining contracted graphs in response to updates to original graphs. Such updates may change both the topological structures and timestamps (obsolete data). We show that the algorithm is bounded [77], i.e., it takes at most O(|AFF|2) time, where |AFF| is the size of areas affected by updates, not the size of the entire (possibly big) graph. We parallelize the algorithm to scale with large graphs.

(4) Empirical evaluation (Sect. 5). Using 10 real-life graphs, we experimentally verify the following. On average, (a) the contraction scheme reduces graphs by 71.9%, up to 86.0%. (b) Contraction makes SubIso, TriC, Dist, CC and CD 1.69, 1.44, 1.47, 2.24 and 1.37 times faster, respectively. (c) The total space cost of our contraction scheme for the five accounts only for 12.6% of indices for TurboIso [44], HINDEX [75], PLL [4] and RMC [68]. It is 9.0% when kNN [92] also runs on the same graph. The synopses for each take 9.7% of the space. Hence, the scheme is scalable with the number of applications on the same graph. (d) Contracting obsolete data improves the efficiency of conventional queries and temporal queries by 1.64 and 1.78 times on average, respectively. (e) Our (incremental) contraction scheme scales well with graphs, e.g., it takes 33.1s to contract graphs of 1.8B edges and nodes with 20 cores.

We survey related work in Sect. 6 and identify research topics for future work in Sect. 7.

A graph contraction scheme

In this section, we first present the graph contraction scheme (Sect. 2.1). We then identify topological components to contract for different types of real-life graphs (Sect. 2.2). Moreover, we develop a contraction algorithm (Sect. 2.3) and its parallelization (Sect. 2.4).

Preliminaries. We start with basic notations.

Graphs. Assume two infinite sets Θ and Γ for labels and timestamps, respectively. We consider undirected graphs G=(V,E,L,T), where (a) V is a finite set of nodes, (b) EV×V is a bag of edges, (c) for each node vV, L(v) is a label in Θ; and (d) T is a partial function such that for each node vV, if T(v) is defined, it is a timestamp in Γ that indicates the time when v or its adjacent edges were last updated.

Queries. A graph query is a computable function from a graph G to another object, e.g., a Boolean value, a number, a graph, or a relation. For instance, a graph pattern matching query is a graph pattern Q to find the set of subgraphs in G that are isomorphic to pattern Q, denoted by Q(G). A query class Q is a set of queries of the same “type,” e.g., all graph pattern queries. We also refer to Q as an application. In practice, multiple applications run on the same graph G simultaneously.

Contraction scheme

A graph contraction scheme is a triple fC,S,fD, where (1) fC is a contraction function such that given a graph G, Gc=fC(G) is a graph deduced from G by contracting certain subgraphs H into supernodes vH; we refer to H as the subgraph contracted to vH, and Gc as the contracted graph of G by fC; (2) S is a set of synopsis functions such that for each query class Q in use, there exists SQS that annotates each supernode vH of Gc with a synopsis SQ(vH); and (3) fD is a decontraction function that restores each supernode vH in Gc to its contracted subgraph H.

Example 1

Graph G in Fig. 1a is a fraction of Twitter network. A node denotes a user (u), a tweet (t), a keyword (k), or a feature of a user such as id (i), name (n), number of followers (f) and link to other accounts of the same user in other social networks (l). An edge indicates the following: (1) (u,u), a user follows another; (2) (ut), a user posts a tweet; (3) (t,t), a tweet retweets another; (4) (tk), a tweet tags a keyword; (5) (k,k), two keywords are highly related; (6) (uk), a user is interested in a keyword; (7) (il), a user has a feature; or (8) (if), a user has f followers.

Fig. 1.

Fig. 1

Graph contraction

In G, subgraphs in dashed rectangles are contracted into supernodes, yielding a contracted graph Gc shown in Fig. 1b. Synopses SSubIso for SubIso are shown in Fig. 1d and are elaborated in Sect. 3.1.

Before we formally define functions fC,fD and synopsis S, observe the following.

(1) The contraction scheme is generic. (a) Note that fC,Gc and fD are application independent, i.e., they remain the same no matter what query classes Q run on the contracted graphs. (b) While S is application dependent, it is query independent, i.e., all queries QQ use the same synopses annotated by SQ.

(2) The contraction scheme is lossless due to synopses S and decontraction function fD. As shown in Sect. 3, an existing algorithm A for a query class Q can be readily adapted to contracted graph and computes exact query answers.

We next give the details of fC,S and fD. We aim to strike a balance between space cost and query evaluation cost. When a graph is over-contracted, i.e., when the subgraphs contracted to supernodes are too large or too small, the decontraction cost goes up although the contracted graph Gc may take less space. Moreover, the more detailed synopses are, the less likely decontraction is needed, but the higher space overhead is incurred.

(1) Contraction function. Function fC contracts subgraphs in G into supernodes in Gc. To simplify the discussion, we contract the following basic structures.

(a) Obsolete component: a connected subgraph consisting of nodes with timestamps earlier than threshold t0.

(b) Topological component: a subgraph with a regular structure, e.g., clique, star, path and butterfly.

Different types of graphs have different regular substructures, e.g., cliques are ubiquitous and effective in social networks while paths are only effective in road networks. In Sect. 2.2, we will identify what regular structures H to contract in different types of graphs.

We contract subgraphs with the number of nodes in the range [kl,ku] to avoid over-contraction (see Sects. 2.3 and 5 for the choices).

Contraction function fC maps each node v in graph G to a supernode in contracted graph Gc, which is either a supernode vH if v falls in one of the subgraphs H in (a) or (b), or node v itself otherwise.

In Example 1, function fC maps nodes in each dashed rectangle to its corresponding supernode, e.g., fC(i1)=fC(n1)=fC(f1)=fC(l1)=vH1, fC(k1)==fC(k5)=vH2 and fC(t2)=t2.

Obsolete components help us prioritize up-to-date data, and topological ones reduce unnecessary checking when answering queries. As shown in Sect. 5, on average the first three regular structures and obsolete components contribute 18.3%, 14.9%, 2.8% and 63.1% to the contraction ratio, and speeds up query answering by 1.61, 1.44, 1.04 and 1.71 times, respectively.

(2) Contracted graph. For a graph G, its contracted graph by fC is Gc = fC(G) = (Vc,Ec,fC), where (a) Vc is a set of supernodes mapped from G as remarked above; (b) EcVc×Vc is a bag of superedges, where a superedge (vH1,vH2)Ec if there exist nodes v1 and v2 such that fC(v1)=vH1, fC(v2)=vH2 and (v1,v2)E; and (c) fC is the reverse function of fC, i.e., fC(vH)={(v,L(v))|fC(v)=vH}.

In Example 1, function fC maps each supernode in contracted graph Gc of Fig. 1b back to the nodes in the corresponding rectangle in Fig. 1a, e.g., fC(vH1) = {(i1, id), (n1, name), (f1, follower), (l1, link)}.

Intuitively, the reverse function fC recovers the contracted nodes and their associated labels, while the decontraction function fD restores the topological structures of the contracted subgraphs.

(3) Synopsis. For each query class Q in use, a synopsis function SQ is in S, to retain features necessary for answering queries in Q. For instance, when Q is the class of graph patterns, at each supernode vH, SQ(vH) consists of the type of vH and the most distinguished features of fD(vH), e.g., the central node of a star and the sorted node list of a path. We will give more details about SQ in Sect. 3. As will also be seen there, fC and synopses SQ taken together often suffice to answer queries in Q, without decontraction.

Note that not every synopsis SQ has to reside in memory. We load SQ to memory only if its corresponding application Q is currently in use.

(4) Decontraction. Function fD restores contracted subgraphs. For supernode vH, fD(vH) restores the edges between the nodes in fC(vH), i.e., the subgraph induced by fC(vH). For superedge (vH1,vH2), fD(vH1,vH2) recovers the edges between fC(vH1) and fC(vH2).

That is, the contracted subgraphs and edges are not dropped. They can be restored by fD when necessary. In light of fD, the scheme is guaranteed lossless.

For example, decontraction function fD restores the subgraph in Fig. 1a from supernodes, e.g., fD(vH3) is a star with central node u10 and leaves u6, u7, u8 and u9. It also restores edges from superedges, e.g., fD(vH2,vH5)={(t1,k1),(k1,k6),(k2,k6)}.

Identifying regular structures

We now identify what regular structures to contract for different types of real-life graphs.

Different types of graphs. We investigated the following 10 different types of graphs: (1) social graphs: Twitter [70] and LiveJournal [94]; (2) communication networks: WikiTalk [62]; (3) citation networks: HepTh [63] and Patent [63]; (4) Web graphs: Google [64] and NotreDame [5]; (5) knowledge graphs: DBpedia [61] and WordNet [71]; (6) collaboration networks: DBLP [2] and Hollywood [15]; (7) biomedical graphs: Mimic [51]; (8) economic networks: Poli [80]; (9) chemical graphs: Enzymes [80]; and (10) road networks: Traffic [1].

Regular structures. For a certain type of graphs G, we apply a subgraph mining model M to G. It returns a set of frequent subgraphs M(G)={g1,g2,...} of G together with the support of each gi. Support metrics may vary in different mining models, e.g., GRAMI [33] adopts minimum image based metric [19]. We pick subgraphs whose supports are above a threshold ts.

As an example, we adopt a subgraph miner GRAMI [33] as M. GRAMI discovers all the frequent subgraphs in G that have a support above a predefined threshold, which are then manually inspected. We pick gi’s with at least 4 nodes to avoid over-contraction.

As shown in Fig. 2, we found the following 6 structures in the 10 types of graphs: (a) clique: a fully-connected graph; (b) star: a single central node with neighbors; (c) path: a sequence of connected nodes with no edges between the head and tail (its two endpoints); (d) claw: a special star in which the central node has exactly 3 neighbors, denoted as its leaves; claws are quite frequent and are hence treated separately; (e) diamond: two triangles that share two endpoints; and (f) butterfly: two triangles sharing a single node.

Fig. 2.

Fig. 2

Frequent regular structures

Note that within these structures H, the only edges allowed are those that form H. Moreover, edges are allowed from each node in H to nodes outside of H. The only exception is that for a path, only the two endpoints can connect to other nodes in the graph.

We summarize how these structures appear in the 10 types of graphs in Table 1, ordered by supports and importance from high to low. Note that different graphs have different frequent regular substructures. Cliques, stars and diamonds often occur in social graphs, while in road networks, stars, claws and paths are frequent.

Table 1.

Common structures in different types of graphs

Graph type Regular structure
Social graphs Clique, star, diamond, butterfly, path
Communication networks Star
Citation networks Clique, star, diamond, butterfly
Web graphs Star, clique, diamond
Knowledge graphs Star, claw
Collaboration networks Clique, star, diamond
Biomedical graphs Star, clique, path
Economic networks Star
Chemical graphs Claw, path
Road networks Star, claw, path

Note that frequent pattern mining is conducted once for each type of graphs offline, not for each input graph. For instance, we always contract cliques, stars, diamonds, butterflies and paths for social graphs.

Contraction algorithm

We next present an algorithm to contract a given graph G, denoted as GCon and shown in Fig. 3.

Fig. 3.

Fig. 3

Algorithm GCon

A tricky issue is that the contracted graphs depend on the order on the regular structures contracted. For example, if we contract diamonds first in the Twitter graph G0 of Fig. 1a, then it contracts {t2,k1,k5,k3} as a diamond; after this there are no cliques in G0. In contrast, if cliques are contracted first, then {k1,k2,k3,k4,k5} is extracted. As suggested by M, cliques “dominate” in social graphs and hence should be “preserved” when contracting G0.

We adopt a deterministic order to ensure that important structures are contracted earlier and hence preserved. We order the importance of different types of regular structures in a graph G by their supports: the higher the support is, the more important the topology is. We denote by T(G) its ordered set of regular structures to contract in Table 1. Note that T(G) is determined by the type of G, e.g., social graphs, and is learned once offline regardless of individual G.

Given a graph G, algorithm GCon first contracts all obsolete data into components to prioritize up-to-date data. Each obsolete component is a connected subgraph that contains only nodes with timestamps earlier than a threshold t0. It is extracted by bounded breadth-first-search (BFS) that stops at non-obsolete nodes. The remaining nodes are then either contracted into topological components, or are left as singletons.

Putting these together, we present the main driver of algorithm GCon in Fig. 3. Given a graph G, a timestamp threshold t0 and range [kl,ku], it constructs functions fC and fD of the contraction scheme. It first contracts nodes with timestamps earlier than t0 into obsolete components (line 1). It then recalls the list T(G) of topological components to contract based on the type of graph G (line 2). Next, GCon contracts topological components into supernodes following order T(G), and deduces fC and fD accordingly (lines 3-5). Each topological component consists of only uncontracted nodes. More specifically, it does the following.

(1) It extracts a clique by repeatedly selecting an un-contracted node that connects to all selected ones, subject to pre-selected size bounds kl and ku (see below).

(2) It extracts a star by first picking a central node vc, and then repeatedly selecting an un-contracted node as a leaf that is (a) connected to vc and (b) disconnected from all selected leaves, again subject to kl and ku.

(3) For paths, it first extracts intermediate nodes having only two neighbors that are not linked by an edge. It then finds a path consisting of only the intermediate nodes, along with two neighbors of the endpoints.

(4) For diamonds, it first selects an edge (uv) and then picks x and y that are (a) connected to both u and v, and (b) pairwise disconnected.

(5) For butterflies, it first selects a node v that has a degree at least 4. It then checks whether there exist four neighbors uxyz of node v such that exactly (uxv) and (yzv) form two triangles.

(6) For claws, it selects nodes with exactly 3 neighbors, and there is no edge between any two neighbors.

As remarked earlier, the remaining nodes that cannot be contracted into any component as above are treated as singleton, i.e., mapped to themselves by fC.

Example 2

Assume that timestamp threshold t0 for graph G of Fig. 1a is larger than timestamps of nodes i1, n1, f1 and l1, but is smaller than those of remaining nodes. Algorithm GCon works as follows. (1) It first triggers bounded BFS, and contracts i1, n1, f1 and l1 into an obsolete component vH1 in Gc. (2) Since G is a social network, it contracts clique, star, diamond, butterfly and path in this order. (3) It builds a clique vH2 with nodes k1, ..., k5. (4) It picks u10 and u5 as central nodes for a star, and makes a star vH3 consisting of u6,u7,u8,u9,u10. Nodes u5,u1,u3 cannot make a star due to lower bound kl=4. (5) No diamond exists. (6) It picks u5 as central node for a butterfly and makes a butterfly vH4. (7) It finds k7, k8 and k9 as candidate intermediate nodes for paths, and contracts them into a path vH5 with endpoints k6 and t1. (8) Node t2 is left as a singleton, and is mapped to itself by fC.

Range [_kl,ku]. We contract an (obsolete/topological) component H such that the number of its nodes is in the range [kl,ku]. The reason is twofold. (1) If H is too small, a contracted graph would have an excessive number of supernodes; this leads to over-contraction with high overhead for possible decontraction and low contraction ratio. Thus, we set a lower bound kl. (2) We set an upper bound ku to avoid overlarge components and excessive superedge decontraction. We experimentally find that the best kl and ku are 4 and 500, respectively.

Diamonds, butterflies and claws have a fixed size with 4, 5 and 4 nodes, respectively, in the range above.

Complexity. Algorithm GCon takes at most O(|G|2) time. Indeed, (1) obsolete components can be contracted in O(|G|) time via edge-disjoint bounded BFSs; (2) paths can be built in O(|G|) time; (3) it takes O(|G|) time to contract each clique and O(|G|2) time for all cliques; and (4) similarly, the other regular structures can be contracted in O(|G|2) time.

Properties. Observe the following about the contraction scheme. (1) It is lossless and is able to compute exact query answers. (2) It is generic and supports multiple applications on the same contracted graph at the same time. This is often necessary. For instance, on average 10 classes of queries run on a graph simultaneously in GDB benchmarks [32]. (3) It prioritizes up-to-date data by separating it from obsolete data. (4) It improves performance. (a) As discussed in Sect. 5, |Gc||G|. In particular, each obsolete component is contracted into a single supernode. (b) Decontraction is often not needed. As shown in Sect. 3, none of SubIso, CD, TriC, Dist and CC needs to decontract any topological component, and for TriC, Dist and CC, even obsolete components do not need decontraction.

Parallel contraction algorithm

We next parallelize algorithm GCon, denoted by PCon, to speed up the contraction process. Note that contraction is conducted once offline, and is then incrementally maintained in response to updates (Sect. 4).

Parallel setting. Assume a master (processor) M0 and n workers (processors) P1,,Pn. Graph G is partitioned into n fragments F1,,Fn by an edge-cut partitioner [17, 55], and the fragments are distributed to n workers P1,,Pn, respectively. We adopt the BSP model [88], which separates iterative computations into supersteps and synchronizes states after each superstep.

Parallel contraction algorithm PCon. As shown in Fig. 4, the idea of PCon is to leverage data-partitioned parallelism. PCon first conducts GCon locally on each fragment in parallel, and then contracts uncontracted “border nodes,” i.e., nodes with edges crossing fragments, by building neighbors of at most ku uncontracted nodes, referred to as uncontracted neighbors, which are subgraphs with uncontracted nodes.

Fig. 4.

Fig. 4

Algorithm PCon

More specifically, algorithm PCon works as follows.

(1) All workers run GCon on its local fragment in parallel (line 1), since after all, each fragment Fi is a graph itself.

In contrast with single-thread GCon, workers do not contract mirror nodes, i.e., nodes assigned to other fragments with edges linked to the local fragment. Adopting edge-cut partition, each node of G is assigned to a single fragment and is contracted at most once during GCon.

(2) PCon contracts “border nodes” (line 2-3). For each border node v, if v is not contracted, PCon builds it uncontracted neighbors. Such neighbors are identified in parallel, coordinated by master M0.

(3) Master M0 merges overlapped neighbors into one, and distributes disjoint ones to n workers (line 4-5). In this way, PCon reduces communication cost and speeds up the process when contracting border nodes.

(4) Each worker contracts its assigned uncontracted-neighbors of border nodes, in parallel (line 6).

One can verify that each node v in G is contracted into at most one supernode vH. The graph Gc contracted by PCon may be slightly different from that of GCon since border nodes may be contracted in different orders. One can fix this by repeating steps (1)–(4) for each of topological components following the order T(G). Nonetheless, we experimentally find that the differences are not substantial enough to worth the extra cost. Moreover, the contracted graphs of PCon are ensured compact, i.e., they cannot be contracted further.

Proof of concept

In this section, we show that existing query evaluation algorithms can be readily adapted to the contracted graphs. As a proof of concept, we pick five query classes: (1) graph pattern matching SubIso via subgraph isomorphism (labeled queries with locality); (2) triangle counting TriC (un-labeled queries with locality); (3) shortest distance Dist (un-labeled and non-local queries); (4) connected component CC (un-labeled queries without locality); and (5) clique decision CD (un-labeled queries with locality). Among these, subgraph isomorphism and clique decision are intractable (cf. [42]).

Informally, when answering a query QQ, we check whether the synopsis SQ(vH) at a supernode vH has enough information for Q; it uses SQ(vH) directly if so; otherwise it decontracts superedges adjacent to vH or restores the subgraph of vH via decontraction function fD. As will be seen shortly, SQ(vH) often provides enough information to process Q at vH as a whole or safely skip vH. Thus, it suffices to answer queries in the five classes by decontracting superedges, without decontracting any topological components. Here decontraction fD(vH1,vH2) of a superedge (vH1,vH2) restores the edges between fC(vH1) and fC(vH2) (Sect. 2).

The main result of this section is as follows.

Theorem 1

Using linear synopsis functions,

(1) for each of SubIso and CD, there are existing algorithms that can be adapted to compute exact answers on contracted graphs Gc, which decontract only supernodes of obsolete components and superedges between supernodes, not any topological components;

(2) for TriC and Dist, there are existing algorithms that can be adapted to Gc and decontract no supernodes, neither topological nor obsolete components; and

(3) for CC, there are existing algorithms that can be adapted to Gc and decontract neither supernodes (topological and obsolete) nor superedges.

Below we provide a constructive proof for Theorem 1 by adapting existing algorithms of the five query classes to contracted graphs one by one.

Graph pattern matching with contraction

We start with graph pattern matching (SubIso).

Preliminaries. We first review basic notations.

Pattern. A graph pattern is defined as a graph Q = (VQ, EQ, LQ), where (1) VQ is a set of pattern nodes, (2) EQ is a set of pattern edges, and (3) LQ is a function that assigns a label LQ(u) to each uVQ.

We also investigate temporal patterns (Qt), where Q is a pattern as above and t is a given timestamp.

To simplify the discussion, we consider connected patterns Q. This said, our algorithm can be adapted to disconnected ones. We denote by uv pattern nodes in pattern Q, and by xy nodes in graph G. A neighbor of node v is a node such that (u,v)EQ.

Pattern matching. A match of pattern Q in graph G is a subgraph G=(V,E,L,T) of G that is isomorphic to Q, i.e., there exists a bijective function h:VQV such that (1) for each node uVQ, LQ(u)=L(h(u)); and (2) e=(u,u) is an edge in pattern Q iff (if and only if) (h(u),h(u)) is an edge in graph G. We denote by Q(G) the set of all matches of pattern Q in graph G.

A match of a temporal pattern (Qt) in graph G is a match G in Q(G) such that for each node v in G, T(v)>t, i.e., a match of (conventional) pattern Q in which all nodes have timestamps later than t. We denote by Q(Gt) all matches of (Qt) in G.

The graph pattern matching problem, denoted by SubIso, is to compute, given a pattern Q and a graph G, the set Q(G) of matches. Similarly, the temporal matching problem is to compute Q(Gt) for a given temporal pattern (Qt) and a graph G, denoted by SubIsot.

Graph pattern matching is widely used in graph queries [6, 40, 79, 90] and graph dependencies [36, 39].

Note that (1) patterns Q are labeled, i.e., nodes are matched by labels. Moreover, (2) Q has the locality, i.e., for any match G of Q in G and any nodes v1, v2 in G, v1 and v2 are within dQ hops when treating G as undirected. Here dQ is the diameter of Q, i.e., the maximum shortest distance between any two nodes in Q.

The decision problem of pattern matching is NP-complete (cf. [42]); similarly for temporal matching. A variety of algorithms have been developed for SubIso, notably TurboIso [44] with indices and VF2 [28] without index. Both TurboIso and VF2 can be adapted to contracted graphs as characterized in Theorem 1.

We give a constructive proof for TurboIso, because (1) it is one of the most efficient algorithms for subgraph isomorphism and is followed by other SubIso algorithms e.g., [14, 78], and (2) it employs indexing to reduce redundant matches; by adapting TurboIso we show that the indices for SubIso can be inherited by contracted graphs, i.e., contraction and indexing complement each other. The same algorithm works for temporal matching. The proof for VF2 is simpler (not shown).

Below we first present synopses for SubIso (Sect. 3.1.1), which are the same for both VF2 and TurboIso. We then show how to adapt algorithm TurboIso to contracted graphs (Sect. 3.1.2)

Contraction for SubIso

Observe that topological components have regular structures. The idea of synopses is to store the types and key features of regular structures so that we could check pattern matching without decontracting any supernodes of topological components.

The synopsis of a supernode vH for query class SubIso is defined as follows:

clique: vH.type = clique;

star: vH.type = star, vH.c records its central node;

path: vH.type = path, vH.list=u1,,u|vc|, storing all the nodes on the path in order;

diamond: vH.type = diamond, vH.s1 and vH.s2 store the two share nodes of the two triangles;

butterfly: vH.type =butterfly, vH.s records the node shared by the two triangles, and vH.e stores the two disjoint edges;

claw: vH.type =claw, vH.c stores the central node and vH.si (i[1,3]) record its three neighbors;

obsolete component: vH.type = obsolete; and

each component maintains vH.t = max{T(v)|vfC(vH)}, i.e., the largest timestamp of its nodes.

Node labels are stored in the reverse function fc of the contraction function fc (see Sect. 2.1).

For instance, the synopsis SSubIso(vH) for each supernode vH in the contracted graph Gc of Fig. 1b is given in Fig. 1d. Note that SSubIso only stores the synopses of the regular structures contracted in a graph.

Properties. The synopses in SSubIso have two properties. (1) Taken with the reverse function fC of fC, the synopsis of a supernode vH suffices to recover topological component H contracted to vH. For instance, given the central node and leaf nodes, a star can be uniquely determined. As a result, no supernode decontraction is needed for topological components. (2) The synopses can be constructed during the traversal of G for constructing contracted graph Gc, as a byproduct.

We remark that the design of synopses needs domain knowledge. This said, (1) users only need to develop synopses for their applications in use, not exhaustively for all possible query classes; and (2) synopsis design is no harder than developing indexing structures.

Subgraph isomorphism

Below we first review algorithm TurboIso [44] and then show how to adapt TurboIso to contracted graphs.

TurboIso_. As shown in Fig. 5, given a graph G and a pattern Q, TurboIso computes Q(G) as follows. It first rewrites pattern graph Q into a tree Q by performing BFS from a start vertex vs (lines 1-2). Here each vertex in Q is a neighborhood equivalence class (NEC) that contains pattern nodes in Q having identically matching data vertices. Then, for each start vertex xs of each region, TurboIso constructs a candidate region (CR0), i.e., an index that maintains candidates for each NEC vertex in Q, via DFS from xs (lines 3-4). If valid candidates are found, i.e., CR0, TurboIso enumerates all possible matches that map xs to vs following a matching order O (lines 5-6). The matching order O is decided by sorting the leaf NEC vertices based on the number of their candidate vertices. It expands Q(G) with valid matches identified in the process (line 7).

Fig. 5.

Fig. 5

Algorithm TurboIso

Algorithm SubAc. TurboIso can be easily adapted to contracted graph Gc, denoted by SubAc. As shown in Fig. 6, SubAc adopts the same logic as TurboIso except minor adaptations in ExploreCR (line 4) and SGSearch (line 7) to deal with supernodes. To see these, let H be the subgraph contracted to a supernode vH.

Fig. 6.

Fig. 6

Algorithm SubAc

(1) ExploreCR. It adds a supernode vH as a candidate for a node u in Q if some node in vH can match u, which is checked by SSubIso(vH) and fC(vH). It also prunes CR0 based on vH.type, e.g., a node u in Q cannot match intermediate nodes on paths if u is in some triangle in Q; and u matches intermediate nodes on a path only if its degree is no larger than 2. No supernodes or superedges are decontracted.

(2) SGSearch. Checking the existence of an edge (xy) that matches edge (vx,vy)Q is easy with synopses SSubIso and functions fC and fD. Here x (resp. y) denotes a node in supernode vH=fC(x) (resp. vH=fC(y)) in the candidates of vx (resp. vy). When fC(x)=fC(y)=vH, (a) if vH.type=star or claw, (xy) exists only if x=vH.c or y=vH.c; (b) if vH.type = clique, (xy) always exists; (c) if vH.type=path, (xy) exists if x and y are next to each other in vH.list; (d) if vH.type=diamond, (xy) exists if at least one of x and y is the shared node vH.s1 or vH.s2; and (e) if vH.type=butterfly, (xy) exists if x and y are not endpoints of the two disjoint edges in vH.e simultaneously. Hence, no topological component is decontracted by fD. (f) If vH.type=obsolete, it checks whether none of the labels in Q is in fC(vH); it safely skips vH if so, and decontracts vH by fD to check the existence of (xy) otherwise. If x and y match distinct supernodes, it suffices to decontract superedge (fC(x),fC(y)) by fD.

Example 3

Query Q in Fig. 1c is to find potential friendship between users based on the retweet and shared keywords in their posted tweets. Nodes u and u both have the same label u. Given Q, SubAc first chooses k as the start node, to which only vH2 and vH5 can match. For vH2, ExploreCR adds vH5 and t2 as candidates for t and t, vH3 as candidate for u, and vH3 and vH4 as candidates for u. Note that for obsolete supernode vH1, none of the labels in Q is covered by fC(vH1); hence, vH1 can be safely skipped. SGSearch finds that t2 matches t since there exists no edge between vH3 and vH5. Thus, it matches k,t,u,t,u with k1,t2,u6,t1,u4.

Similarly for vH5, ExploreCR adds vH5 and t2 as candidates for t and t, vH4 as candidate for u, and vH3 and vH4 for u and u. Next, SGSearch finds that u4 and t1 match u and t by decontracting superedge (vH3,vH4); then, k9 matches k. However, since k9 is an intermediate node of path vH3, no match for t can be found. Hence, k,t,u,t,u match k1,t2,u6,t1,u4.

Analyses. One can easily verify that SubAc is correct since it has the same logic as TurboIso except that it incorporates pruning strategies. While they have the same worst-case complexity, SubAc operates on Gc, much smaller than G (see Sect. 5); moreover, its ExplorCR saves traversal cost and SGSearch saves validation cost by pruning invalid matches.

Temporal pattern matching. Algorithm SubAc can also take a temporal pattern (Qt) as part of its input, instead of Q. The only major difference is at CR0 construction (line 4), where a supernode vH is safely pruned if vH.tt, when vH.type is obsolete or not. It skips a match if it contains a node v with T(v)t.

Triangle counting with contraction

We next study triangle counting [26, 47], which has been used in clustering [91], cycle detection [48] and transitivity [74]. In graph G, a triangle is a clique of three vertices. The triangle counting problem is to find the total number of triangles in G, denoted by TriC.

Similar to SubIso, TriC is local with diameter 1. In contrast, it consists of a single query and is not labeled.

We adapt algorithm TriA of [26] for TriC to contracted graphs, since it is one of the most efficient TriC algorithms [47], and it does not use indexing (as a different example from TurboIso). We show that for TriC, the adapted algorithm needs to decontract no supernodes, neither topological components nor obsolete parts.

Contraction for TriC

Observe that contraction function fC on G is equivalent to node partition of G, such that two nodes are in the same partition if they are contracted into the same supernode. The idea of synopses for TriC is to pre-count triangles with at least two nodes in the same partition, without enumerating them. As will be seen shortly, this allows us to avoid supernode decontraction for both topological and obsolete components.

Consider a triangle (uvw) in G that is mapped to Gc via fC. We have the following cases.

(1) If fC(u)=fC(v)=fC(w)=vH, where supernode vH contracts a subgraph H with node set V(H), i.e., when the three nodes of a triangle are contracted into the same supernode, then (a) when H is a clique, there are |V(H)|3 triangles inside H; (b) when H is a diamond or a butterfly, there are 2 triangles inside H; (c) when H is an obsolete component, then the number of triangles inside H can be pre-calculated, denoted by tH; and (d) there are no triangles inside H otherwise.

(2) If fC(u)=fC(v)=vI, fC(w)=vJ, where vI and vJ contract subgraphs I and J, respectively, i.e., if two nodes of a triangle are contracted into the same supernode, then (a) when I is a clique, then w leads to k2 triangles, where k is the number of the neighbors of w in I. Denote by twI the number of such triangles in a clique neighbor I of w. (b) Subgraph I cannot be a path since intermediate nodes on a path are not allowed to connect to nodes outside I. (c) Otherwise, nodes u and v yield k triangles, where k is the number of common neighbors of u and v in J. We denote by tu,vJ the number of such triangles in a common neighbor J of u and v.

(3) If fC(u)=vI, fC(v)=vJ, fC(w)=vK, i.e., when the three nodes of a triangle are contracted into different supernodes, we count such triangles online and it suffices to decontract only superedges, not supernodes.

Synopsis STriC(vH) of supernode vH for TriC extends SSubIso(vH) with an extra tag tc, which records the number of triangles pre-calculated as above. More specifically, vH.tc is computed as follows. Below we use u and v to range over nodes in V(H), I to range over clique neighbors of u, and J to range over common neighbors of uv. We define tuI, tH and tu,vJ as above.

In a clique H, there are (1) |V(H)|3 triangles; (2) each node uH has tuI triangles with its clique neighbor I; hence, vH.tc=|V(H)|3+ΣuΣItuI. We can calculate vH.tc similarly for other regular structures. Thus,

clique: vH.tc=|V(H)|3+ΣuΣItuI;

star: vH.tc=ΣuΣItuI+ΣuΣJtvH.c,uJ;

path: vH.tc=ΣItu1I+ΣItu|V(H)|I, where u1 and u|V(H)| are the first and last node on the path;

claw: vH.tc=ΣuΣItuI+Σu,vΣJtu,vJ;

diamond and butterfly: vH.tc=2+ΣuΣItuI+Σu,vΣJtu,vJ,

obsolete: vH.tc=tH+ΣuΣItuI+Σu,vΣJtu,vJ.

Synopses STriC also share the properties of SSubIso.

Example 4

In the contracted graph Gc of Fig. 1b, only vH2 contracts a clique, denoted by I. Synopsis STriC(vH) of a supernode vH extends SSubIso(vH) with vH.tc: (1) for vH1, (a) H1 contracted to vH1 contains no triangles; thus, tH1=0; (b) I is not a neighbor of any node u in V(H1); thus, tuI=0; and (c) nodes in V(H1) have no common neighbors, i.e., no J exists for any connected u,vV(H1); thus, tu,vJ=0. Hence, vH1.tc=0. (2) For vH2, vH2.type=clique, |V(H2)|=5 and no other supernodes in Gc are cliques. Hence, vH2.tc=10. (3) For vH3, u6 and u9 have only 1 neighbor in clique I; thus, tuI=0; similarly, no J exists for any leaf u and vH3.c; thus, tvH3.c,uJ=0. Hence, vH3.tc=0. (4) Similarly, vH4.tc=2, vH5.tc=1 and t2.tc=1.

Triangle counting

We now adapt algorithm TriA [26] to contracted graphs. The adapted algorithm is referred to as TriAc.

Algorithm TriA_. Given a graph G, TriA assigns distinct numbers to all the nodes in G. It then enumerates triangles for each edge (uv) by counting the common neighbors w of u and v such that w<u and w<v.

Algorithm TriAc. On a contracted graph Gc with superedges decontracted, TriAc works in the same way as TriA except that at a supernode vH (for both topological and obsolete components), it simply accumulates vH.tc without decontraction or enumeration. It only restores superedges when necessary.

Example 5

From synopsis STriC, TriAc directly finds 14 triangles. In Gc, it finds two additional triangles (u6,t2,k1) and (t1,t2,k1) by restoring superedges. Thus, it finds 16 triangles in G. No supernodes of either topological or obsolete components are decontracted.

Analyses. One can verify that TriAc is correct since it counts all triangles in G once and only once. It speeds up TriA since it works on a smaller contracted Gc.

Temporal triangle counting. Algorithm TriAc can be adapted to count triangles with timestamp later than a given time t. It prunes a supernode vH if vH.tt, and drops a triangle if it has a node v with T(v)t.

Shortest distance with contraction

We next study the shortest distance problem.

Shortest distance. Consider an undirected weighted graph G=(V,E,L,T,W) with additional weight W; for each edge e, W(e) is a positive number for the length of the edge. In a graph G, a path p from v0 to vk is a sequence v0,v1,,vk of nodes such that (vi,vi+1)E for all 0i<k. The length of a path p=(v0,,vk) in G is simply sumi[1,k]W(vi-1,vi).

The shortest distance problem, denoted by Dist, is to compute, given a pair (uv) of nodes in G, the shortest distance between u and v, denoted by d(uv) [4, 25, 31].

Shortest distance has a wide range of applications, e.g., socially-sensitive search [89, 93], influential community detection [9, 56] and centrality analysis [16, 18].

As opposed to SubIso, shortest distance queries are unlabeled, i.e., the value of a query answer d(uv) does not depend on labels. In contrast with SubIso and TriC, Dist is non-local, i.e., there exists no d independent of the input graph G such that d(u,v)<d.

We adapt Dijkstra’s algorithm [31] to contracted graphs, denoted by Dijkstra, which is one of the best known algorithms for Dist. Just like TriC, the adapted algorithm for Dist decontracts no supernodes, neither topological components nor obsolete parts.

Contraction for Dist

A path between nodes u and v can be decomposed into (1) edges between supernodes, and (2) edges within a supernode. The idea of synopses for Dist is to pre-compute the shortest distances within supernodes to avoid supernode decontraction, for both topological and obsolete components. Edges between supernodes are recovered by superedge decontraction when necessary.

Suppose that v1 and v2 are nodes mapped to supernode vH by fC, i.e., fC(v1)=fC(v2)=vH. We compute the shortest distance for (v1,v2) within the subgraph H contracted to vH, denoted by dvH(v1,v2). The synopsis SDist(vH) extends SSubIso(vH) with a tag dis that is a set of triples (v1,v2,dvH(v1,v2)) for a path between v1 and v2 within vH, based on vH.type:

clique: vH.dis={(v1,v2,dvH(v1,v2))} for all pairs of v1,v2fC(vH);

path: vH.dis={(u1,u|fC(vH)|,Σ1i<|fC(vH)|W(ui,ui+1))}, i.e., it records the path itself;

diamond, butterfly and obsolete components: vH.dis={(v1,v2,dvH(v1,v2))|v1,v2fC(vH)}.

In practice, the number of nodes in most contracted subgraphs is far below the upper bound ku. Indeed, diamonds and butterflies have a constant size, and we find that a clique (resp. star, path and obsolete component) typically contains 6.5 (resp. 7.3, 4.1 and 49.2) nodes. Hence, the size of a synopsis is fairly small. Note that the upper bound ku should be larger than typical sizes of components, since large components exist and may be more powerful for accelerating computations.

Example 6

Assume W(u,v)=1 for all edges (uv) in graph G of Fig. 1a. Then, for supernodes in the contracted graph of Fig. 1b, (1) vH1.dis={(i1,f1,1),(i1,n1,1),(i1,l1,1),(f1,n1,2),(f1,l1,2),(n1,l1,2)}; (2) vH2.dis={(ki,kj,1)| 1i<j5}; (3) vH4.dis={(u1,u2,1),(u1,u5,1), (u1,u3,2), (u1,u4,2),}; and finally, (4) vH5.dis={(k6,t1,4)}.

Shortest distance

We adapt algorithm Dijkstra [31] to contracted graphs Gc, and refer to the adapted algorithm as DisAc.

Algorithm Dijkstra_. Given a graph G and a pair (uv) of nodes, Dijkstra finds the shortest distances from u to nodes in G in ascending order, and terminates as soon as d(uv) is determined. It maintains a set S of nodes whose shortest distances from u are known; it initializes distance estimates d¯(u)=0, and d¯(w)= for other nodes. At each step, Dijkstra moves a node w from V\S to S that has minimal d¯(w), and updates distance estimates of nodes adjacent to w accordingly.

Algorithm DisAc. DisAc is the same as Dijkstra except minor changes to updating distance estimates. When moving a node w from V\S to S, suppose that vH is the supernode to which w is mapped, i.e., fC(w)=vH. DisAc updates distance estimates d¯(w) for wfC(vH) as follows: (1) if vH.type is clique, butterfly, diamond or obsolete, update d¯(w) by d¯(w)+dvH(w,w) using vH.dis; (2) if vH.type = star or claw, update d¯(w) by d¯(w)+dvH(w,w), where dvH(w,w) can be easily computed by synopsis; (3) if vH.type = path, update d¯(w) by d¯(w)+dvH(w,w) for the other endpoint w using vH.dis; in these cases, no supernode (for topological or obsolete components) is decontracted. DisAc updates d¯(w) by d¯(w)+W(w,w) for all edges (w,w) where fC(w)fC(w), by decontracting superedge (fC(w),fC(w)) at worst, in the same way as Dijkstra.

Example 7

Given Dist query (u2,k5) on the contracted graph Gc of Fig. 1b, DisAc works in the following steps: (1) initially, S=, d¯(u2)=0, and d¯(v)= for all other nodes; (2) S={u2}, d¯(u1)=d¯(u5)=1, d¯(u3)=d¯(u4)=2 by using SDist(vH4); (3) S={u2,u1,u5,u3,u4}, d¯(t1)=3 by edge (u4,t1), and d¯(k6)=d¯(t1)+dvH3(k6,t1)=7 by vH5.dis; d¯(i1)=2 by edge (u1,i1), and d¯(f1)=d¯(n1)=d¯(l1)=3 by vH1.dis; similarly, d¯(u7)=3 and d¯(u10) = 4, d¯(u6)=d¯(u8)=d¯(u9)=5 by making use of reverse function fC and synopsis SDist(vH3) (note that vH3 contracts a star); (4) S={u2,u1,u5,u3,u4,i1,t1,u7}, d¯(t2)=4 by edge (t1,t2); and (5) S={u2,u1,u5,u3,u4,i1,t1,u7,f1,n1,l1,t2}, d¯(k1)=d¯(k3)=d¯(k5)=5 by edges (t2,k1), (t2,k3), (t2,k5). When DisAc moves node k5 to S, it gets d(k5)=5. The algorithm returns d(u2,k5)=5.

Analyses. By induction on the length of shortest paths, one can verify that DisAc is correct. In particular, for each node w in G, when d¯(w) is updated by a node w that is mapped to the same supernode, the update is equivalent to a series of Dijkstra updates. Moreover, DisAc works on smaller contracted graphs Gc and saves traversal cost inside contracted components without any decontraction, neither topological nor obsolete.

Temporal shortest distance. Similar to temporal SubIso and TriC, we study temporal Dist queries (uvt), where (uv) is a pair of nodes as in Dist, and t is a timestamp. It is to compute the shortest length of paths p from u to v such that for each node w on p, T(w)>t.

Algorithm DisAc can be easily adapted to temporal Dist, by skipping nodes v with T(v)t. In particular, it safely ignores a supernode vH if vH.tt.

Connected component with contraction

We next study the connected component problem [29, 85]. In a graph G, a connected component is a maximal subgraph of G in which any two nodes are connected to each other via a path. The connected component problem, denoted as CC, is to compute the set of pairs (sn) for a given graph G, where (sn) indicates that there are n connected components in G that consist of s nodes.

Given a graph G, CC returns the numbers of connected components of various sizes in G. Similar to Dist, CC is a non-local query, i.e., it has to traverse the entire graph when answering the query. It is also un-labeled, i.e., labels have no impact on its query answer.

This form of CC is used in pattern recognition [45, 53], graph partition [86] and random walk [49].

We adapt algorithm CCA of [85] for CC to contracted graphs, since it is one of the most efficient CC algorithms. Better still, we show that the adapted algorithm decontracts neither supernodes nor superedges.

Contraction for CC

The synopsis SSubIso for SubIso suffices for us to answer CC queries. Observe that each subgraph H contracted to a supernode vH is connected, no matter whether H is a topological component or an obsolete component. We can regard a supernode vH as a whole when evaluating CC queries, and leverage SSubIso(vH) and fC to compute the size of connected components. We need neither additional synopses nor any decontraction.

Connected component

We now adapt algorithm CCA [85] to contracted graphs. The adapted algorithm is referred to as CCAc.

Algorithm CCA_. We first review how CCA works. (1) Starting from each unvisited node v in graph G, CCA performs a depth-first-search (DFS) and collects all unvisited nodes reached in the traversal. These nodes are connected to v and are marked as visited. When no more nodes are unvisited, all visited nodes and v form a connected component. CCA records its size s. (2) After all nodes in G are visited, CCA groups connected components by size s and returns the aggregate (sn).

Algorithm CCAc. On the contracted graph Gc, CCAc works in the same way as CCA except that (1) it only performs DFS on Gc, without decontracting any supernodes or superedges; and (2) the size of each connected component is aggregated as the sum of the size |fC(vH)| of all supernodes vH in the component.

Example 8

On the contracted graph in Fig. 1b, CCAc finds a connected component that consists of supernodes vH1,vH2,vH3,vH4,vH5 and t2. The size s of this component is simply the sum |fC(vH1)|++|fC(vH5)|+|fC(t2)|, i.e., s=25. Since all the supernodes in Gc have been visited, CCAc outputs (25, 1).

Analyses. CCAc is correct since it follows the same logic as CCA and all contracted subgraphs are guaranteed to be connected. The algorithm takes at most O(|Gc|) time while CCA takes O(|G|) time. Since Gc is much smaller than G, CCAc always outperforms CCA.

Temporal connected component. CCAc can be adapted to compute connected components with timestamp later than a given time t, by skipping nodes v with T(v)t. It safely ignores a supernode vH if vH.tt.

Clique decision with contraction

We next study a decision problem for clique. A clique in a graph G is a subgraph C in which there are edges between any two nodes; it is a k-clique if the number of nodes in C is k (i.e., |V(C)|=k) . We consider the clique decision problem [20, 57], denoted by CD, to find whether there exists a k-clique in G for a given number k. CD is being widely used in community search [76], team formation [59] and anomaly detection [11, 65].

Similar to Dist and CC, CD is un-labeled. In contrast with Dist and CC, but similar to SubIso, it is local, i.e., all nodes in a clique are within 1 hop of each other.

The clique decision problem is known NP-complete (cf. [42]). A variety of algorithms have been developed for CD, notable CDA of [57], which we will adapt next.

Contraction for CD

Observe the following. (1) Cliques in G contracted into supernodes in Gc can help us find an initial maximum clique (see below). (2) The degree of a node can be used as an upper bound of the maximum clique containing it.

In light of these, we extend synopsis SSubIso(vH) with tags cs and md. For a subgraph H that is contracted to a supernode vH, the two tags record the maximum clique found in H and the maximum degree of the nodes in H, respectively. Specifically, vH.cs is based on vH.type:

clique: vH.cs=|fC(vH)|;

diamond and butterfly: vH.cs=3;

star, path and claw: vH.cs=2; and

obsolete component: we find a k-clique in an obsolete component online.

and vH.md is by aggregation:

node v: v.md=|{u|(u,v)E}|; and

supernode vH: vH.md=max{v.md|fC(v)=vH}.

Synopses SCD also share the properties of SSubIso.

Example 9

In the contracted graph Gc of Fig. 1b, SCD(vH) extends SSubIso(vH) with tags cs and md as follows. Since vH2 contracts a clique, vH2.cs=5; vH4.cs=3 since vH4 contracts a butterfly, and vH.cs=2 for supernodes vH3 (star) and vH5 (path). For tag md, vH1.md=i1.md=4; similarly, vH2.md=8, vH3.md=4, vH4.md=4, vH5.md=4, and t2.md=4.

Clique decision

We adapt CDA [57] to Gc, denoted as CDAc.

Algorithm CDA_. We first review CDA. Given a graph G, algorithm CDA checks the existence of a k-clique in G by branch-and-bound. It branches from each node in G. Denote by C the current clique in the search, and by P the set of common neighbors of the nodes in C. CDA (1) bounds the search from C if |C|+|P|<k, or (2) branches from each node u in P to expand C. More specifically, it iteratively adds a node u from P to C and removes all those nodes in P that are not neighbors of u, enlarging C and shrinking P until P is empty. If |C|k, then C contains a k-clique and CDA terminates with true; it returns false if no k-clique is found after all branches are searched.

Algorithm CDAc. CDAc adopts the same logic as CDA except the following: (1) it picks the maximum synopsis vH.cs among all supernodes vH in Gc; a k-clique is found directly if vH.csk; and (2) it skips a supernode vH in Gc if vH.md<k-1. Superedges adjacent to vH are skipped as well since no k-clique contains any node contracted to vH. Otherwise, it checks the synopsis of vH if vH contracts a topological component, or restores obsolete component H contracted to vH, to check cliques in the original graph G. Note that CDAc initiates the search with the largest clique contracted, by checking the synopses. Hence, cliques play a more important role than the other regular structures for CD.

Example 10

For query with k=5, by SCD(vH2) of Fig. 1b, CDAc finds a 5-clique and returns true.

For query with k=6, all supernodes except vH2 are skipped by synopses. Their adjacent superedges are skipped as well. Since vH2 only contracts a 5-clique, CDAc fails to find a 6-clique and returns false.

Analyses. One can verify that CDAc is correct since it follows the same logic as CDA except that it adopts pruning strategies that are possible because of the use of synopses. While the two algorithms have the same worst-case complexity, CDAc starts with a supernode with a maximum clique and may find a k-clique directly; moreover, it skips a supernode as a whole by synopses, which reduces unnecessary search and validation.

Temporal k-clique. Algorithm CDAc can be adapted to find a k-clique with timestamp later than a given time t, by skipping nodes v with T(v)t. Like SubAc and TriAc, it safely ignores a supernode vH if vH.tt.

Incremental contraction

We next develop an incremental algorithm to maintain contracted graphs in response to updates ΔG to graphs G. We start with batch update ΔG, which is a sequence of edge insertions and deletions. We formulate the problem (Sect. 4.1), present the incremental algorithm (Sects. 4.24.3), discuss vertex updates (Sect. 4.4), and parallelize the algorithm (Sect. 4.5).

Incremental contraction problem

Updates to a graph G, denoted by ΔG, consists of (1) node updates, i.e., node insertions and deletions; and (2) edge updates, i.e., edge insertions and deletions.

Given a contraction scheme fC,S,fD, a contracted graph Gc=fC(G), and updates ΔG, the incremental contraction problem, denoted as ICP, is to compute (a) changes ΔGc to Gc such that GcΔGc=fC(GΔG), i.e., to get the contracted graph of the updated graph GΔG, where GcΔGc applies ΔGc to Gc; (b) the updated synopses of affected supernodes; and (c) functions fCΔfC and fDΔfD w.r.t. the new contracted graph GcΔGc.

ICP studies the maintenance of contracted graphs in response to update ΔG that may both change the topological structures of contracted graph Gc, and refresh timestamps of nodes. As a consequence, obsolete nodes may be promoted to be non-obsolete ones if they are touched by ΔG, among other things.

Criterion. Following [77], we measure the complexity of incremental algorithms with the size of the affected area, denoted by AFF. Here AFF includes (a) changes ΔG to the input, (b) changes ΔGc to the output, and (c) edges with at least an endpoint in (a) or (b).

An incremental algorithm is said to be bounded [77] if its complexity is determined by |AFF|, not by the size |G| of the entire (possibly big) graph G.

Intuitively, ΔG is typically small in practice. When ΔG is small, so is ΔGc. Hence, when ΔG is small, a bounded incremental algorithm is often far more efficient than a batch algorithm that recomputes Gc starting from scratch, since the cost of the latter depends on the size of G, as opposed to |AFF| of the former.

An incremental problem is said to be bounded if there exists a bounded incremental algorithm for it, and it is unbounded otherwise.

Challenges. Problem ICP is nontrivial. (1) Topological components are fragile. For instance, when inserting an edge between two leaves of a star H, H is no longer a star, and its nodes may need to be merged into other topological components. (2) Refreshing timestamps by a query Q may make some obsolete nodes “fresh” and force us to reorganize obsolete and topological components. (3) When contracted graph Gc is changed, so are their associated synopses and decontraction function.

Main result. Despite challenges, we show that bounded incremental contraction is within reach in practice.

Theorem 2

Problem ICP is bounded for SubIso, TriC, Dist, CC and CD, and takes at most O(|AFF|2) time.

We first give a constructive proof of Theorem 2 for edge updates, consisting of two parts: (1) the maintenance of the contracted graph Gc and its associated decontraction function fD (Sect. 4.2); and (2) the maintenance of the synopses of affected supernodes (Sect. 4.3). We then give a constructive proof of Theorem 2 for vertex updates (Sect. 4.4), which is simpler.

Incremental contraction algorithm

An incremental algorithm is shown in Fig. 7, denoted by IncCR. It has three steps: preprocessing to initialize affected areas, updating to maintain contracted graph Gc, and contracting to process refreshed singleton nodes. To simplify the discussion, we focus on how to update Gc in response to ΔG, where ΔG consists of edge insertions and deletions; the handling of fD is similar.

Fig. 7.

Fig. 7

Algorithm IncCR

(a) Preprocessing. Algorithm IncCR first identifies an initial area affected by edge update ΔG (lines 1-2). It removes “unaffecting” updates from ΔG that have no impact on Gc (line 1), i.e.,  edges in ΔG that are between two supernodes when none of their nodes is an intermediate node of a path. These updates are made to corresponding subgraphs of G that are maintained by fD. It then refreshes timestamps of nodes u touched by edges e=(u,v) in ΔG (line 2). Suppose that node u is mapped by fC to supernode vH with vH.type = obsolete. Then, vH is decomposed into singleton nodes, u is non-obsolete and is mapped to itself by fC. Such singleton nodes are collected in a set Vs, as the initial area affected by ΔG. Node v is treated similarly.

Note that an unaffecting update would not become “affecting update” later on. All changes in ΔG are applied to graph G in the given order.

(b) Updating. Algorithm IncCR then updates contracted graph Gc (lines 3-8). For each update e=(u,v), IncCR invokes procedure IncCR+ (resp. IncCR-) to update Gc when e is to be inserted (resp. deleted) (lines 4-7). Updating Gc may make some updates in ΔG unaffecting, which are further removed from ΔG (line 8). Moreover, some nodes may become “singleton” when a topological component is decomposed by the updates, e.g., leaves of a star. It collects such nodes in the set Vs.

More specifically, to insert an edge e=(u,v), IncCR+ updates Gc and adds new singleton nodes to Vs. Suppose that u (resp. v) is mapped by fC to supernode vH1 (resp. vH2) (line 1). IncCR+ decomposes vH1 and vH2 into the regular structures of topological components (line 2). For instance, if vH1=vH2, and vH1.type=star, u and v make a triangle with the central node; thus, IncCR+ decomposes the star into singleton nodes. When vH1.type=clique and vH2.type=path, supernode vH2 is divided into two shorter paths. Note that components with less than kl nodes due to updates are decomposed into singleton nodes. All such singleton nodes are added to the set Vs (line 3).

(c) Contracting. Finally, algorithm IncCR processes nodes in the set Vs (line 10). It (a) merges nodes into neighboring supernodes; or (b) builds new components with these nodes, if possible; otherwise (c) it leaves node v as a singleton, i.e., by letting fC(v)=v.

Example 11

Consider inserting four edges into graph G of Fig. 1a: (1) (n1,f1): nodes n1 and f1 are mapped to obsolete component vH1, and vH1 is decomposed into singleton nodes, one for each of n1, f1, i1 and l1; then, (n1,f1) is removed from ΔG; (2) (k1,u4): it is unaffecting since fC(k1)fC(u4) and neither k1 nor u4 is an intermediate node of a path; (3) (k1,u10): it is also unaffecting; and (4) (u1,u4): vH4 is not a butterfly any longer, and is decomposed into singletons.

Edge deletions are handled similarly.

Analyses. Algorithm IncCR takes O(|AFF|2) time: (a) the preprocessing step is in O(|ΔG|) time; (b) the updating step takes O(|AFF|) time, in which updating fD is the dominating part; and (3) the cost of contracting Vs into topological components is in O(|AFF|2) time.

The algorithm is (a) bounded [77], since its cost is determined by |AFF| alone, and (b) local [35], i.e., the changes are confined only to affected supernodes and their neighbors in the contracted graph Gc.

Maintenance of synopses

We next show that for SubIso, TriC, Dist, CC and CD, (a) the number of supernodes whose synopses are affected is at most O(|AFF|), and (2) the synopsis for each supernode can be updated in O(|AFF|) time. Hence, incremental synopses maintenance for each of SubIso, TriC, Dist, CC and CD takes at most O(|AFF|2) time.

To see these, consider a supernode vH in Gc.

(a) For SubIso, recall that SSubIso(vH) stores the type and key features of vH (Sect. 3.1). One can see that the number of supernodes whose synopses are affected is at most |ΔGc|, and SSubIso(vH) for each such vH can be updated in O(1) time. Thus, the maintenance of SSubIso is bounded in O(|AFF|) time due to bounds [kl,ku].

(b) For TriC, synopsis STriC(vH) extends SSubIso(vH) with vH.tc, which is updated by (i) clique neighbors I of nodes u in vH when IAFF; (ii) vH itself if vH.type is clique or obsolete; and (iii) common neighbors J of connected nodes uv in vH for JAFF. Thus, supernodes affected are enclosed in AFF, which covers ΔG, ΔGc and their neighbors. Moreover, STriC(vH) for each affected vH can be updated in |AFF| time. Thus, the maintenance of STriC is bounded in O(|AFF|2) time.

(c) For Dist, SDist(vH) extends SSubIso(vH) with vH.dis, which is confined to vH and can be updated in O(1) time since |fC(vH)|ku. Thus, the incremental maintenance of SDist is bounded in O(|AFF|) time.

(d) For CC, recall that the synopsis SSubIso suffices to answer CC queries. Hence, as in case (a), SCC(vH) for each supernode vH can be updated in O(1) time, and the maintenance of SCC is bounded in O(|AFF|) time.

(e) For CD, SCD(vH) extends SSubIso(vH) with vH.cs and vH.md. Here vH.cs is confined to vH and can be updated in O(1) time; vH.md is confined to vH and its neighbors, and can be updated in O(|AFF|) time. Thus, the maintenance of SCD is in O(|AFF|2) time.

Example 12

Continuing with Example 11, we show how to maintain vH.tc in STriC(vH) for supernodes vH in Gc; SSubIso(vH), SDist(vH), SCC(vH) and SCD(vH) are simpler since their affected synopses are confined to ΔGc.

More specifically, (1) for edge insertion (n1,f1), supernode vH1 is decomposed into four singletons, for which synopses are defined as n1.tc=f1.tc=l1.tc=i1.tc=0. (2) For (unaffecting) edge insertion (k1,u4), vH.tc remains the same for all vHGc. (3) For (unaffecting) edge insertion (k1,u10), k1 becomes a common neighbor of u10 and u6; let H denote the subgraph contracted by vH2; then, tu10,u6H=1 and vH3.tc=1. (4) When inserting edge (u1,u4), vH4 is decomposed into singletons. During the contraction phase, nodes u1,u2,u5,u4 are contracted into a diamond vH4 with vH4.tc=2. Node u3 is left singleton, with u3.tc=0.

Vertex updates

Vertex updates are a dual of edge updates [58], and can be processed accordingly. More specifically, we present incremental algorithm IncCRV in Fig. 8, to deal with vertex updates. Consider node insertions and deletions.

Fig. 8.

Fig. 8

Algorithm IncCRV

(1) When inserting a new node u, algorithm IncCRV first treats u as a singleton and collects it in set Vs (lines 3-4); the node u is then contracted into a topological structure in the contracting step (line 7).

(2) When deleting a node u that is contracted into a supernode vH, there are three cases to consider, elaborated in IncCRV- of Fig. 8: (a) if vH is a clique, vH remains unchanged except that u is removed (lines 2-3); (b) if vH is a claw, a butterfly or an obsolete component, vH is decontracted and all nodes in fC(vH) except u are treated as singletons and are collected in set Vs (lines 4-5); and (c) otherwise, we process u and vH by synopsis and add resulting singleton nodes into Vs (lines 6-7). For instance, consider the case when vH contracts a star, (i) if u is the central node vH.c, vH is decontracted in the same way as case (b); and (ii) otherwise, vH remains to be a star, similar to case (a).

Similar to edge updates, contracting singleton nodes of Vs into topological components dominates the cost of the process. One can verify that it can be done in at most O(|AFF|2) time. Similarly, synopsis maintenance also takes O(|AFF|2) time. Hence, incremental contraction remains bounded in the presence of vertex updates.

Parallel incremental contraction algorithm

We parallelize incremental algorithm IncCR, to speed up the incremental maintenance process.

Parallel setting. Similar to PCon, we use a master M0 and n workers, A contracted graph Gc is edge-partitioned and is distributed to n workers. Each fragment Fi consists of a part of the contracted graph Gc and its corresponding (partial) decontraction function and synopses. For a crossing superedge (vH1,vH2) between two fragments, i.e., when vH1 and vH2 are assigned to two distinct fragments, the decontraction function fD(vH1,vH2) is maintained in both fragments.

Parallel incremental contraction. The parallel incremental algorithm is denoted by IncPC and shown in Fig. 9. To simplify the discussion, we focus on edge updates; node updates are processed similarly. It works under BSP [88]. In a nutshell, it preprocesses crossing (super)edges (line 1). Then, all the workers run IncCR on its local fragment in parallel (line 2). After that, IncPC contracts refreshed singleton nodes Vs into supernodes (lines 3-8) along the same lines as algorithm PCon. Here, each fragment has its local set Vs and all refreshed singleton nodes in Vs can be coordinated and distributed by the master M0. Each node v is guaranteed to be contracted into one supernode vH. More specifically, algorithm IncPC works as follows.

Fig. 9.

Fig. 9

Algorithm IncPC

(1) IncPC preprocess updated edges e=(u,v) between two fragments (line 1), i.e., when u and v are contracted into supernodes vH1 and vH2, and vH1 and vH2 are in two distinct fragments. Such updates are unaffecting as long as neither u nor v is an intermediate node of a path, and these updates are maintained by fD. Otherwise, the supernode of type path may be affected and is decomposed into singleton nodes; such refreshed singleton nodes are collected in a set Vs as the initial area affected by ΔG. In the same way as IncCR, we refresh timestamps of obsolete nodes touched by updates.

(2) Each worker locally runs IncCR in parallel (line 2). Refreshed singleton nodes that cannot be contracted into supernodes are collected in Vs (line 3).

(3) For each refreshed singleton node v in Vs, IncPC build its uncontracted neighbors (of at most ku nodes) in parallel, similar to step (2) in PCon (lines 4-5).

(4) Master M0 merges overlapped neighbors into one and distributes disjoint ones to n workers (lines 6-7).

(5) Each worker contracts its assigned subgraphs, i.e., uncontracted neighbors, in parallel (line 8).

One can verify that each node v in G is contracted into one supernode vH (including v itself), and the contracted graph Gc cannot be further contracted.

Experimental study

Using ten real-life graphs, we experimentally evaluated (1) the contraction ratio; (2) the speedup of the contraction scheme; (3) the impact of contracting each topological component and obsolete component; (4) the space cost of the contraction scheme compared to existing indexing methods; (5) the efficiency of the (incremental) contraction algorithms; and (6) the parallel scalability of the (incremental) contraction algorithms.

Experiment setting. We used the following datasets.

(1) Graphs. We used ten real-life graphs: three social networks Twitter [70], LiveJournal [94] and LivePokec [10]; three Web graphs Google [64], NotreDame [5] and GSH [3]; three collaboration networks DBLP [2], Hollywood [15] and citHepTh [63]; and a road network Traffic [1]. Their sizes are shown in Table 2. We randomly generated a time series to simulate obsolete attributes, at most 70% (it is 80% for IT data of our industry collaborator). We tested obsolete components with random (temporal) queries generated on all datasets.

Table 2.

Contraction ratio (each column: CR or % of contribution to CR with/without obsolete mark)

Graph |V|, |E| ku CR 1st 2nd 3rd Obsolete
Twitter 81K, 1.3M 100 0.176/0.286 7.78/27.7 15.44/50.71 4.29/14.39 69.69/–
LiveJournal 4M, 35M 500 0.378/0.527 11.46/30.3 20.41/51.4 3.74/9.7 60.99/–
LivePokec 1.6M, 22M 500 0.467/0.651 4.46/9.91 35.91/77.76 2.32/4.83 54.4/–
Google 876K, 4.3M 200 0.193/0.294 19.36/51.47 19.33/47.04 0.58/1.49 60.74/–
NotreDame 325K,1.1M 200 0.274/0.441 23.16/60.64 9.47/26.95 4.56/12.4 62.81/–
GSH 68M, 1.8B 500 0.325/0.493 29.32/77.33 5.31/21.78 0.75/0.89 64.62/–
DBLP 204K, 382K 100 0.14/0.172 36.21/71.65 14.22/28.32 0.02/0.03 49.54/–
Hollywood 1.1M, 56M 500 0.239/0.534 17.36/71.76 6.05/16.46 3.21/11.79 73.38/–
citHepTh 28K, 352K 50 0.26/0.362 21.42/51.93 14.18/36.71 4.6/11.36 59.81/–
Traffic 24M, 29M 500 0.365/0.59 12.37/49.72 9.42/36.74 3.5/13.54 74.7/–

We also generated synthetic graphs with up to 250 M nodes and 2.5 B edges, to test the parallel scalability of the (incremental) contraction algorithms.

Updates. We randomly generated edge updates ΔG, controlled by the size |ΔG| and a ratio ρ of edge insertions to deletions. We kept ρ=1 unless stated otherwise, i.e., the size of GΔG remains stable. In the same manner, we generated vertex updates ΔG.

(2) Graph patterns. We implemented a generator for graph pattern queries controlled by three parameters: the number VQ of pattern nodes, the number EQ of pattern edges, and a set LQ of labels for queries Q.

(3) Implementation. We implemented the following algorithms, all in C++. (1) Algorithms SubAc (Sect. 3.1.2), TriAc (Sect. 3.2.2), DisAc (Sect. 3.3.2), CCAc (Sect. 3.4.2), CDAc (Sect. 3.5.2), VF2c for SubIso by adapting VF2 [28] to contracted graphs; in addition, PLLc for Dist by adapting PLL [4] to contracted graphs. (2) Our contraction algorithm GCon (Sect. 2.3) and its parallel version PCon (Sect. 2.4), incremental algorithm IncCR for batch updates and its parallel version IncPC (Sect. 4). (3) The baselines include existing query evaluation algorithms: (a) TurboIso [44] and TurboIsoBoosted [78] with indexing for SubIso, and VF2 [28] without indexing; (b) graph compression DeDense [69] for SubIso; (c) TriA [47] for TriC; (d) Dijkstra without indexing and PLL [4] with indexing for Dist [31]; (e) CCA [85] for CC; and (f) CDA [57] for CD. We did not compare with summarization since it does not support any algorithm to compute exact answers for the five applications.

(4) Experimental environment. The experiments were conducted on a single-processor machine powered by Xeon 3.0 GHz with 64GB memory, running Linux. Since GSH and synthetic graphs ran out of 32 GB memory without contraction, we used a machine with 64 GB memory. For parallel (incremental) contraction, we used 4 machines, each with 12 cores powered by Xeon 3.0 GHz, 32GB RAM, and 10Gbps NIC. Each experiment was run 5 times, and the average is reported here.

Experimental results. We now report our findings.

Exp-1: Effectiveness: Contraction ratio. We first tested the contraction ratio of our contraction scheme, defined as CR=|Gc|/|G|. Note that for each query class Q, CR is the same for all queries in Q. Moreover, all applications on G share the same contracted graph Gc while incorporating different synopses. In addition, we report the impact of each of the first three topological components and obsolete component for each dataset, in the presence and absence of obsolete data.

As remarked in Sect. 2, we limit the nodes of contracted subgraphs within [kl,ku]. We fixed kl=4 and varied ku based on the size of each graph. We considered two settings: (a) when obsolete data are taken into account, with threshold t0=50%tm, where tm denotes the maximum timestamp in each dataset; and (b) when we do not separate obsolete data, i.e., when t0=0. The results are reported in Table 2 for all the real-life graphs (in which each column indicates either CR or percentage of contribution to CR with/without obsolete mark). We can see the following.

(1) When t0=50%tm, CR is on average 0.281, i.e.,  contraction reduces these graphs by 71.9%. When t0=0, i.e., if obsolete data are not considered, CR is 0.435. These show that real-life graphs can be effectively contracted in the presence and absence of obsolete data. Compared with the results of [38], by considering more regular structures, the contraction scheme improves the contraction ratio CR by 2.49% and 6.90% in the presence and absence of obsolete data, respectively.

(2) When obsolete data are present, the average CR is 0.34, 0.264, 0.213 and 0.365 in social networks, Web graphs, collaboration networks and road networks, respectively. When obsolete data are absent, CR is on average 0.488, 0.409, 0.356 and 0.59. The contraction scheme performs the best on collaboration networks in both settings, since such graphs exhibit evident inhomogeneities and community structures.

(3) When obsolete data are absent, on average the first three regular structures contribute 50.2%, 39.4% and 8.0% to CR, respectively. When obsolete mark is taken into account, their contribution is 18.3%, 14.9% and 2.8%, respectively. This is because nodes from these components may be moved to obsolete components.

(4) We also studied the impact of the contraction order on query evaluation. Topological components have different impacts on different types of graphs, e.g., stars, claws and paths are effective in Traffic, and cliques, stars and butterflies work better than the others in collaboration networks. Taking the order of Table 1 as the baseline, we tested the impact of (a) RE, by reversing the order, and (b) EX, by exchanging between different types of graphs, e.g., we use the order for road networks to contract social graphs. On average the CR of RE and EX is decreased by 9.42% and 7.05%, respectively. As shown in Table 3, the average slowdown of RE and EX is (a) 7.24% and 5.58% for SubIso, (b) 5.55% and 5.46% for TriC, (c) 3.89% and 4.30% for Dist, (d) 7.34% and 34.7% for CC, and (e) 2.38% and 19.1% for CD, respectively. These justify that the order of Table 1 is effective for most applications and most types of graphs. There are also exceptions, e.g., reversing the order for Web graphs improves the efficiency of CD. Recall that we contract stars, cliques and butterflies for Web graphs. For CD in particular, however, cliques play a more important role than the other two (Sect. 3.5); hence, contracting cliques first may work better for CD.

Table 3.

Slowdown (%) by RE and EX orders

Graph SubIso TriC Dist CC CD
RE EX RE EX RE EX RE EX RE EX
Twitter 8.04 3.98 7.41 3.66 5.27 2.86 8.22 19.2 6.73 24.9
LiveJournal 9.46 5.52 8.26 5.09 2.71 5.32 9.03 61.1 5.49 18.3
LivePokec 8.67 6.46 3.15 3.12 4.49 2.48 6.10 51.7 8.23 20.2
Google 5.17 7.54 6.07 3.75 1.02 3.8 7.19 38.7 -4.17 12.5
NotreDame 11.9 5.76 4.20 6.46 5.95 4.93 3.72 44.5 -4.33 15.3
GSH 3.52 6.22 4.59 6.08 2.78 4.15 4.25 32.1 -5.53 16.4
DBLP 2.13 5.53 11.3 14.2 4.38 5.31 18.8 19.6 5.05 34.2
Hollywood 6.32 6.39 2.25 4.73 3.89 5.81 5.75 30.3 3.02 29.3
citHepTh 7.48 3.24 3.98 4.91 2.56 3.23 7.43 35.5 7.92 17.1
Traffic 9.69 5.11 4.29 2.56 5.78 5.11 2.94 14.2 1.39 2.87

Exp-2: Effectiveness: query processing. We next evaluated the speedup of query processing introduced by the contraction scheme, measured by query evaluation time over original and contracted graphs.

Subgraph isomorphism. Varying the size |VQ| of pattern queries from 4 to 7, we tested VF2, TurboIso and TurboIsoBoosted on GSH and Hollywood as G, DeDense [69] on the compressed graph, and SubAc and VF2c on the contracted graph Gc of G. For each query, we output the first 108 matches. As shown in Fig. 10a, b, (1) on average, SubAc on Gc is1.69, 1.49 and 18.85 times faster than TurboIso, TurboIsoBoosted and DeDense, respectively; (2) VF2c beats DeDense by 9.31 times; (3) VF2c without indices is only 19.1% slower than TurboIso with indices, while TurboIsoBoosted and TurboIso are 10.1 and 8.97 times faster than VF2, respectively; and (4) the speedup is more substantial on collaboration networks, e.g., 2.11 times on Hollywood, because cliques are prevalent in such graphs and are the most effective structure for SubIso due to the high capacity in pruning invalid matches.

Fig. 10.

Fig. 10

Performance evaluation

Triangle counting. As shown in Fig. 10c, the results for TriC are consistent with the results on subgraph isomorphism: (1) TriAc on the contracted Gc is on average 1.44 times faster than TriA on their original graphs G. (2) The speedup is more evident in collaboration networks: e.g., TriAc on Hollywood is 1.57 times faster than TriA while it is 1.47, 1.45 and 1.28 times on LiveJournal, Google and Traffic, respectively. TriA spends more than 1000 seconds on GSH (hence not shown).

Shortest distance. The results for Dist are consistent with the results on SubIso. As reported in Fig. 10d, DisAc is 1.64 and 1.36 times faster than Dijkstra on GSH and Hollywood, respectively, by reducing search space and employing synopses. PLL could not build indices on GSH within 64G memory, while PLLc successfully builds indices on (smaller) contracted GSH. On average, PLLc spends 94.2μs to evaluate a query on GSH. On other smaller datasets, in contrast, PLLc is 18% slower than PLL due to overhead on supernodes.

Connected component. As shown in Fig. 10e over LiveJournal, GSH, Hollywood, and Traffic for social graphs, Web graphs, collaboration networks and road networks, respectively, the results for CC are consistent with the results on SubIso and TriC: (1) algorithm CCAc on contracted graph Gc is on average 2.24 times faster than CCA on the original graph G, since CCAc operates on the smaller Gc without decontracting supernodes or superedges. (2) The speedup is more evident in collaborations networks: e.g., CCAc on Hollywood is 2.87 times faster than CCA, since the contraction scheme performs the best on such graphs and the time complexity of CCAc is linear in the size of the contracted graph.

Clique decision. As also shown in Fig. 10f, (1) algorithm CDAc is 1.32, 1.54, 1.52 and 1.08 times faster than CDA on LiveJournal, Hollywood, GSH and Traffic, respectively, by using synopses to start with an initial maximum clique that may find a k-clique directly. (2) The speedup is less evident in road networks. For road networks, the contraction scheme contracts stars, claws and paths into supernodes; hence, we can only find a 2-clique (an edge) as the initial maximum clique by using synopses, which is trivial and useless.

The results on the other graphs are consistent.

Temporal queries. Fixing pattern size |Q| = 4 and varying timestamp t in temporal queries from 30%tm to 70%tm, we tested SubIsot, TriCt, Distt, CCt and CDt. As shown in Fig. 10g–k on LiveJournal, (1) SubAc  is on average 1.81 and 1.77 times faster than TurboIsoBoosted and TurboIso, respectively; VF2c outperforms VF2 by 7.83 times. (2) The average speedup for TriC, Dist, CC and CD is 1.58, 2.31, 1.66 and 1.31 times, respectively. (3) The speedup is larger for temporal queries than for conventional ones since temporal information maintained in synopsis provides additional capacity to skip more supernodes, as expected. (4) It is more substantial for larger t on SubIsot.

The results verify that our contraction scheme (a) is generic and speeds up evaluation for all five applications, and (b) it can be used together with existing algorithms, with indexing (e.g., TurboIso and PLL) or not (e.g., VF2c and Dijkstra). (c) It is effective by separating up-to-date data from obsolete.

We remark that our contraction scheme aims to make a generic optimization for multiple applications to run on the same graph at the same time. When a new application is considered, adding a specific synopsis suffices for our scheme. In contrast, a separate indexing structure has to be built for indexing approaches. Better still, it is much easier to develop synopses than indices. Moreover, existing indexing structures can be inherited by contracted graphs, to improve performance from contraction in addition to from indexing.

Exp-3: Impact of each component. We next evaluated the impact of contracting each of the topological components identified in Sect. 2.2.

Impact of topological components. Based on Table 1, we took contraction of the first three types of regular structures as the baseline, and tested the impact of each component on the efficiency of query answering by disabling it, using all the ten real-life datasets.

As shown in Table 4, the average slowdown in evaluation time by disabling each of the first three structures is (a) 37.6%, 22.7% and 3.02% for SubIso, (b) 70.3%, 36.3% and 2.0% for TriC, (c) 28.4%, 28.8% and 5.2% for Dist, (d) 141.0%, 118.4% and 9.9% for CC, and (e) 27.8%, 15.9% and 0.7% for CD, respectively. In particular, the impact of each regular structure is mostly consistent with the contraction order. This said, for specific application and graphs, the impact of each regular structure may be slightly different. For CD on Web graphs, the average slowdown in evaluation time by disabling the first structure (star) and the second structure (clique) is 4.1% and 43.5%, respectively, since cliques dominate the effectiveness of the synopses for CD.

Table 4.

Slowdown(%) by disabling certain topological component

Graph SubIso TriC Dist CC CD
 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd 1st 2nd 3rd
Twitter 45.8 10.9 4.7 16.4 19.1 2.1 28.2 28.7 5.7 85.1 141.6 22.8 42.7 4.1 0.3
LiveJournal 46.3 16.7 3.0 17.5 3.9 1.4 44.3 13.2 7.1 68.4 95.5 14.5 27.5 3.3 0.9
LivePokec 45.5 13.5 2.1 5.5 22.1 0.7 29.5 23.6 4.4 18.0 69.5 11.3 39.0 5.2 1.3
GSH 11.7 32.2 0.4 5.4 18.2 1.1 15.9 33.1 0.7 41.7 10.8 0.4 4.9 52.2 0.2
Google 19.6 40.6 2.5 8.7 20.3 2.9 18.3 44.6 5.8 107.1 70.8 5.6 5.4 57.8 2.7
NotreDame 15.2 42.3 3.3 29.5 41.2 0.4 27.7 47.8 4.9 55.4 50.2 8.0 2.1 20.6 0.5
DBLP 66.6 17.0 0.8 572.1 216.6 1.7 23.2 29.5 0.4 631.7 450.2 0.1 65.1 7.9 0.1
Hollywood 40.3 13.4 5.1 22.6 10.9 1.5 24.0 26.3 5.4 80.1 64.3 5.9 51.7 3.8 0.2
citHepTh 54.5 15.7 2.4 15.4 7.2 0.5 32.3 22.6 7.3 280.7 222.4 25.7 35.5 1.7 0.6
Traffic 30.1 24.3 5.7 10.1 3.5 9.4 40.2 18.7 10.6 41.7 8.7 5.2 4.3 2.5 0.3

Impact of obsolete components. We tested the impact of contracting obsolete components on the efficiency of answering conventional queries. Fixing |Q| = 4 and varying x for timestamp threshold such that t0=x%tm, Fig. 10i–p reports the results of SubIso, TriC, Dist, CC and CD on LiveJournal, respectively. We find that (1) the speedup is bigger for larger t0 when t070%, i.e., more nodes are contracted into obsolete components; (2) obsolete components speed up SubIso, TriC, Dist, CC and CD by 1.56, 1.53, 1.39, 2.49 and 1.33 times, respectively; and (3) the speedup for SubIso and CD gets smaller when t080% due to the overhead of decontracting obsolete components. The results are consistent for Dist, TriC and CC, except that their speedup does not go down when t0 gets larger since they do not need to decontract obsolete components.

Impact of kl_ and ku_. We also tested the impact of kl and ku on the contraction ratio CR and efficiency. As remarked in Sect. 2.3, diamonds, butterflies and claws have a fixed size, while cliques, stars and paths vary. Fixing ku=500 (resp. kl=4) and varying kl (resp. ku) from 2 to 6 (resp. 20 to 1000), Fig. 10q (resp. Fig. 10r) reports the CR on LiveJournal, Hollywood, GSH and Traffic, respectively. As shown there, CR decreases when kl decreases or ku increases. Similarly, Fig. 10s (resp. Fig. 10t) reports the speedup of SubAc, TriAc, DisAc, CCAc and CDAc on Hollywood. Query evaluation is slowed down when kl3 or ku500 for all algorithms except CCAc and TriAc due to excessive superedge decontractions or overlarge components. Recall that CCAc decontracts neither supernodes nor superedges, and TriAc precalculates triangles in both topological components and obsolete parts; hence, it prefers large ku. We find that the best kl and ku for the datasets tested are around 4 and 500, respectively.

The results on the other graphs are consistent.

Exp-4: Space cost. We next studied the space cost of our contraction scheme compared with indexing cost. We consider six algorithms: SubAc, TriAc, DisAc, CCAc, CDAc and PLLc. The space cost includes the sizes of the contracted graph |Gc|, decontraction function |fD| and the sizes of synopses; as shown in Sect. 3, SubAc, TriAc, DisAc, CCAc and CDAc do not need to decontract topological components; thus, we only uploaded fD for obsolete components and superedges into memory. In particular, CCAc requires no decontraction (Theorem 1) and thus incurs no cost for storing fD at all. We compared the space cost with the indices used by TurboIso, HINDEX [75], PLL [4] and RMC [68].

Table 5 shows how the space cost increases when more applications run on Google (i.e., graph G). We find the following. (1) Our contraction scheme takes totally 1.62GB for SubIso, TriC, Dist, CC and CD, much smaller than 12.9GB taken by TurboIso, PLL, HINDEX and RMC. (2) With the contraction scheme, graph G is no longer needed. That is, compared to G, the scheme uses 0.89GB additional space for the supernodes/edges in Gc and synopses for all five applications. It trades affordable space for speedup. (3) Synopses SSubIso, STriC, SDist, SCC and SCD take 48.3% of the total space of contraction, i.e., Gc and fD dominate the space cost, which are shared by all applications. Hence, the more applications are supported, the more substantial the improvement in the contraction scheme is over indices.

Table 5.

Total space cost of applications run on Google

Application Contraction Indexing
Detail Space Detail Space
Shared parts Gc,fD 837MB G 727MB
+SubIso SSubIso 848MB TurboIso 1.07GB
+TriC STriC 874MB +HINDEX 2.1GB
+Dist SDist 1.51GB +PLL 9.58GB
+CC 1.51GB 9.58GB
+CD SCD 1.62GB +RMC 12.9GB
+kNN SkNN 1.75GB +Antipole 19.4GB

To inherit the indexing structures of [44] and PLL, we use 1.14GB additional space to build a compact index for PLLc and on average 26MB for SubAc on Google. in addition to synopses SDist and SSubIso.

To verify the scalability with applications, we further adapted existing algorithms for k-nearest neighbors (kNN) [92]. The total space cost of the scheme for the six applications is 1.75GB, i.e., 18.1% increment for each. It accounts for only 9.0% of the indices for TurboIso, PLL, HINDEX, RMC and Antipole [22] of kNN.

Exp-5: Efficiency of (incremental) contraction. We next evaluated the efficiency of contraction algorithm GCon and incremental contraction algorithm IncCR. We also studied the impact of the order and varied rates of updates on incremental IncCR.

Efficiency of GCon_. We first report the efficiency of GCon. As shown in Fig. 11a–d on LiveJournal, Hollywood, GSH and Traffic, respectively, (1) on average GCon takes 109.7s to contract the graph, without the time of the computation for synopses. (2) It takes on average 4.13s, 21.2s, 18.1s, 0s and 3.38s only to compute the synopses for SubIso, TriC, Dist, CC and CD, respectively; i.e., computing synopses of the five only takes on average 37.3% of the time of GCon. Recall that the synopses for SubIso suffice for us to answer CC queries; hence, it is unnecessary to compute synopses for CC.

Fig. 11.

Fig. 11

Efficiency of (incremental) contraction

Efficiency of IncCR_. We tested the efficiency of IncCR, by varying |ΔG| from 5%|G| to 35%|G|. As shown in Fig. 11e–h on LiveJournal, Hollywood, GSH and Traffic, respectively, (1) on average IncCR is 2.1 times faster than GCon, up to 6.3 times when |ΔG|=5%|G|. It takes on average 26.6% time to update the synopses for 5% updates on the five applications. (2) IncCR beats GCon even when |ΔG| is up to 30%|G|. This justifies the need for incremental contraction. (3) IncCR is sensitive to |ΔG|; it takes longer for larger |ΔG|.

Impact of update order. We tested the impact of the orders of edge insertions and deletions in ΔG on IncCR. Fixing |ΔG|=10%, we varied the order of updates by (1) random (RO), (2) insertion-first (IF) and (3) deletion-first (DF). On average we find that RO, IF and DF have a performance difference less than 3.5% on Hollywood. That is, IncCR is stable on batch updates, regardless of the order on the updates. Similarly, we find that RO, IF and DF have a performance difference less than 3.7% on Hollywood for vertex updates.

Impact of update rates. We also tested the efficiency of IncCR against real-time updates, measured by the updates coming in 1s intervals, i.e., |ΔG|/s. Varying |ΔG|/s from 0.2%|G|/s to 1%|G|/s, Fig. 11i shows the following on LiveJournal. (1) On average it takes 0.88s to update contracted graphs, i.e.,  IncCR is able to efficiently maintain the contracted graphs in real life. (2) The update time is less than 1s even when the updates are up to 0.8%|G|. IncCR can handle 0.8%|G|/s of “burst” updates on graph with 40M nodes and edges.

The results are consistent on the other graphs.

Exp-6: Scalability. Finally, we evaluated (1) the scalability of our contraction algorithm GCon with graph size |G|, (2) the parallel scalability of algorithm PCon and IncPC with the number of cores.

Scalability on |G|_. Varying the size |G|=(|V|,|E|) of synthetic graphs from (50M, 0.5B) to (250M, 2.5B), we tested the scalability of GCon  using a single machine. As shown in Fig. 11j, GCon scales well with G. It takes 1325s when graph G has 2.75B nodes and edges.

Scalability of PCon_ and IncPC_. Fixing |ΔG|=10%|G|, we tested the scalability of parallel PCon and IncPC with the number k of cores. As shown in Fig. 11k and l on GSH, (1) PCon scales well with k: it is 10.1 times faster when using k=20 cores versus k=1 (single core), and it is 4.3 times faster when k varies from 4 to 20. (2) IncPC is on average 1.9 times faster than PCon. (3) IncPC scales well with k; it is 3.7 times faster when k varies from 4 to 20, across 4 machines.

The results on other graphs are consistent.

Summary. We find the following over 10 real-life graphs. On average, (1) the contraction scheme reduces graphs by 71.9%. The contraction ratio is 0.34, 0.264, 0.213 and 0.365 in social networks, Web graphs, collaboration networks and road networks, respectively. (2) It improves the evaluation of SubIso, TriC, Dist, CC and CD by 1.69, 1.44, 1.47, 2.24 and 1.37 times, respectively. Existing algorithms can be adapted to the scheme, with indices or not. (3) On average, contracting the first three types of regular structures improves the efficiency of query evaluation by 1.61, 1.44 and 1.04 times, respectively. (4) Contracting obsolete data improves the efficiency of both conventional queries and temporal queries, by 1.64 and 1.78 times on average, respectively. (5) Its total space cost on SubIso, TriC, Dist, CC and CD is only 12.7% of indexing costs of TurboIso, PLL, HINDEX and RMC. The synopses for the five query classes take only 48.3% of the total space of the contraction scheme. Thus, our contraction scheme scales with the number of applications. (6) Algorithms GCon, PCon, IncCR and IncPC scale well with graphs and updates. GCon takes 344s when G has 1.8B edges and nodes, and PCon takes only 33.1s with 20 cores, across 4 machines. IncCR is 4.9 times faster than GCon when |ΔG| is 5%|G|, and is still faster when |ΔG| is up to 30%|G|. (7) PCon and IncPC scale well with the number k of machines. When |ΔG|=10%|G|, PCon is 4.3 times faster and IncPC is 3.7 times faster when k varies from 4 to 20.

Related work

This paper extends its conference version [38] as follows. (1) We identify a variety of frequent regular structures in different types of graphs, develop their synopses and contract graphs based on their types (Sect. 2.2). In contrast, [38] adopts an one-size-fit-all solution and contracts only cliques, paths and stars for all types of graphs. (2) In light of new regular structures, all examples and algorithms have been extended (Sects. 24). (3) We provide the pseudo code and details of a parallel contraction algorithm (Sect. 2.4). (4) We study two new query classes, namely, (non-local) connected component and (intractable) clique decision, for proof of concept (Sects. 3.4 and 3.5). We also extend the algorithms for the three other cases to cope with newly studied topological components (Sects. 3.13.3). (5) We extend the study of incremental contraction by presenting vertex updates and parallel incremental maintenance algorithm (Sect. 4). (6) The experimental study is almost entirely new and evaluates the contraction scheme w.r.t.  different regular structures to contract as well as its effectiveness on new big graphs and new query classes of Sects. 3.4 and 3.5 (Sect. 5).

We discuss the other related work as follows.

Contraction. As a traditional graph programming technique [43], node contraction merges nodes, and subgraph contraction replaces connected subgraphs with supernodes. It is used in e.g., single source shortest paths [54], connectivity [43] and spanning tree [41].

In contrast, we extend the conventional contraction with synopses to build a compact representation of graphs as a generic optimization scheme, which is a departure from the programming techniques.

Compression. Graph compression has been studied for social network analysis [27], community queries [21], subgraph isomorphism [34, 69], graph simulation [37], reachability and shortest distance [50], and GPU-based graph traversal [82]. It often computes query-specific equivalence relations by merging equivalent nodes into a single node or replacing frequent patterns by virtual nodes. Some are query preserving (lossless), e.g., [37, 50, 69], and can answer certain types of queries on compressed graphs without decompression.

Another category of compression aims to minimize the number of bits required to represent a graph. WebGraph [15] exploits the inner redundancies of Web graphs; [8] proposes an encoding scheme based on node indices assigned by the BFS order; [24] approximates the optimal encoding with MinHash; and [52] removes the hub nodes for an scheme to have better locality.

Our contraction scheme differs from graph compression in the following. (a) It optimizes performance of multiple applications with the same contracted graph. In contrast, many compression schemes are query dependent and require different structures for different query classes. While some methods serve generic queries [8, 15, 24], they may incur heavy recovering cost. (b) Contraction is lossless, while some compression schemes are lossy, e.g., [34]. (c) For a number of query classes, their existing algorithms can be readily adapted to contracted graphs, while compression often requires to develop new algorithms e.g., [69] demands a decompose-and-join algorithm for subgraph isomorphism.

Summarization. Graph summarization aims to produce an abstraction or summary of a large graph by aggregating nodes or subgraphs (see [67] for a survey), classified as follows. (1) Node aggregation, e.g., GraSS [60] merges node clusters into supernodes labeled with the number of edges within and between the clusters; it is developed for adjacency, degree and centrality queries. SNAP [87] generates an approximate summary of a graph structure by aggregating nodes based on attribute similarity. (2) Edge aggregation, e.g., [73] generates a summary by aggregating edges, with a bounded number of edges different from the original graph. (3) Simplification: instead of aggregating nodes and edges, OntoVis [83] drops low-degree nodes, duplicate paths and unimportant labels. Most summarization methods are lossy, e.g., GraSS and SNAP only retain part of attributes, and OntoVis drops nodes, edges and labels.

Incremental maintenance of summarization has been studied [30, 46, 84]. It depends on update intervals [84]; short-period summarization is space-costly, while long-interval summarization may miss updates. To handle these, [46] aggregates updates into a graph of “frequent” nodes and edges and computes a summary based on all historical updates on entire graph.

Both summarization and contraction schemes aim to provide a generic graph representation to speed up graph analyses. However, contraction differs from summarization in the following. (1) The contraction scheme is lossless and returns exact answers for various classes of queries. In contrast, summarization is typically lossy and supports at best certain aggregate or approximate queries only. (2) Many existing algorithms for query answering can be readily adapted to contracted graphs, while new algorithms often have to be developed on top of graph summaries. (3) For a number of query classes studied, contracted graphs can be incrementally maintained with boundedness and locality, while summarization maintenance requires historical updates and often operates on the entire graph [46].

Indexing. Indices have been studied for, e.g., subgraph isomorphism [13, 14, 28, 44, 72], reachability [7, 23, 50, 95] and shortest distance [25, 66]. They are query specific, and take space and time to store and maintain.

Our contraction scheme differs from indexing as it supports multiple applications on the same contracted graph, while a separate indexing structure has to be built for each query class. Moreover, it is more efficient to maintain contracted graphs than indices. This said, the contraction scheme can be complemented with indices for further speedup, by building indices on smaller contracted graphs, as demonstrated in Sect. 3.1.

Conclusion

We have proposed a contraction scheme to make big graphs small, as a generic optimization scheme for multiple applications to run on the same graph at the same time. We have shown that the scheme is generic and lossless. Moreover, it prioritizes up-to-date data by separating it from obsolete data. In addition, existing query evaluation algorithms can be readily adapted to compute exact answers, often without decontracting topological components. Our experimental results have verified that the contraction scheme is effective.

A topic for future work is to build a hierarchy of contracted graphs by iteratively contracting regular structures into supernodes, until the one at the top fits into the memory; the objective is to make large graphs small enough to fit into the memory of a single machine, and make it possible to process large graphs under limited resources. Another topic is to study the capacity of a single multi-core machine for big graph analytics, by leveraging both contraction and multi-core parallelism.

Acknowledgements

Fan, Li and Liu are supported in part by ERC 652976 and Royal Society Wolfson Research Merit Award WRM/R1/180014. Liu is also supported in part by EPSRC EP/L01503X/1, EPSRC CDT in Pervasive Parallelism at the University of Edinburgh. Lu is supported in part by NSFC 62002236.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Wenfei Fan, Email: wenfei@inf.ed.ac.uk.

Yuanhao Li, Email: yuanhao.li@ed.ac.uk.

Muyang Liu, Email: muyang.liu@ed.ac.uk.

References

  • 1.Traffic. http://www.dis.uniroma1.it/challenge9/download.html (2006)
  • 2.DBLP. https://snap.stanford.edu/data/com-DBLP.html (2012)
  • 3.Gsh host. http://law.di.unimi.it/webdata/gsh-2015-host (2015)
  • 4.Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In: SIGMOD (2013)
  • 5.Albert, R., Jeong, H., Barabási, A.: The diameter of the World Wide Web. CoRR cond-mat/9907038 (1999)
  • 6.Angles, R., Arenas, M., Barceló, P., Boncz, P.A., Fletcher, G.H.L., Gutierrez, C., Lindaaker, T., Paradies, M., Plantikow, S., Sequeda, J.F., van Rest, O., Voigt, H.: G-CORE: A core for future graph query languages. In: SIGMOD, pp. 1421–1432 (2018)
  • 7.Anirban, S., Wang, J., Islam, M.S.: Multi-level graph compression for fast reachability detection. In: DASFAA (2019)
  • 8.Apostolico A, Drovandi G. Graph compression by bfs. Algorithms. 2009;2(3):1031–1044. doi: 10.3390/a2031031. [DOI] [Google Scholar]
  • 9.Backstrom, L., Huttenlocher, D., Kleinberg, J., Lan, X.: Group formation in large social networks: membership, growth, and evolution. In: SIGKDD, pp. 44–54 (2006)
  • 10.Bae SH, Halperin D, West JD, Rosvall M, Howe B. Scalable and efficient flow-based community detection for large-scale graph analysis. TKDD. 2017;11(3):1–30. doi: 10.1145/2992785. [DOI] [Google Scholar]
  • 11.Berry, N., Ko, T., Moy, T., Smrcka, J., Turnley, J., Wu, B.: Emergent clique formation in terrorist recruitment. In: AAAI Workshop on Agent Organizations (2004)
  • 12.Besta, M., Hoefler, T.: Survey and taxonomy of lossless graph compression and space-efficient graph representations. CoRR arXiv: 1806.01799 (2018)
  • 13.Bhattarai, B., Liu, H., Huang, H.H.: CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching. In: SIGMOD (2019)
  • 14.Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD (2016)
  • 15.Boldi, P., Vigna, S.: The WebGraph framework I: Compression techniques. In: WWW, pp. 595–602 (2004)
  • 16.Bonacich P. Power and centrality: a family of measures. Am. J. Sociol. 1987;92(5):1170–1182. doi: 10.1086/228631. [DOI] [Google Scholar]
  • 17.Bourse, F., Lelarge, M., Vojnovic, M.: Balanced graph edge partition. In: SIGKDD, pp. 1456–1465 (2014)
  • 18.Brandes U. A faster algorithm for betweenness centrality. J. Math. Sociol. 2001;25(2):163–177. doi: 10.1080/0022250X.2001.9990249. [DOI] [Google Scholar]
  • 19.Bringmann, B., Nijssen, S.: What is frequent in a single graph? In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 858–863 (2008)
  • 20.Bron C, Kerbosch J. Algorithm 457: finding all cliques of an undirected graph. CACM. 1973;16(9):575–577. doi: 10.1145/362342.362367. [DOI] [Google Scholar]
  • 21.Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: WSDM, pp. 95–106 (2008)
  • 22.Cantone D, Ferro A, Pulvirenti A, Recupero DR, Shasha D. Antipole tree indexing to support range search and k-nearest neighbor search in metric spaces. TKDE. 2005;17(4):535–550. [Google Scholar]
  • 23.Cheng, J., Huang, S., Wu, H., Fu, A.W.C.: TF-label: a topological-folding labeling scheme for reachability querying in a large graph. In: SIGMOD (2013)
  • 24.Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: SIGKDD, pp. 219–228 (2009)
  • 25.Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-hop labels. SICOMP 32(5) (2003)
  • 26.Cohen, J.: Trusses: Cohesive subgraphs for social network analysis. Natl. Secur. Agency Tech. Rep. 16(3.1) (2008)
  • 27.Cohen, S.: Data management for social networking. In: SIGMOD (2016)
  • 28.Cordella LP, Foggia P, Sansone C, Vento M. A (sub) graph isomorphism algorithm for matching large graphs. TPAMI. 2004;26(10):1367–1372. doi: 10.1109/TPAMI.2004.75. [DOI] [PubMed] [Google Scholar]
  • 29.Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms. MIT press (2009)
  • 30.Cortes, C., Pregibon, D., Volinsky, C.: Communities of interest. In: IDA (2001)
  • 31.Dijkstra, E.W., et al.: A note on two problems in connexion with graphs. Numer. Math. 1(1) (1959)
  • 32.Dominguez-Sal, D., Martinez-Bazan, N., Muntes-Mulero, V., Baleta, P., Larriba-Pey, J.L.: A discussion on the design of graph database benchmarks. In: TPCTC, pp. 25–40 (2010)
  • 33.Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P. GRAMI: frequent subgraph and pattern mining in a single large graph. PVLDB. 2014;7(7):517–528. [Google Scholar]
  • 34.Fairey, J., Holder, L.: Stariso: Graph isomorphism through lossy compression. In: DCC (2016)
  • 35.Fan, W., Hu, C., Tian, C.: Incremental graph computations: Doable and undoable. In: SIGMOD (2017)
  • 36.Fan, W., Jin, R., Liu, M., Lu, P., Tian, C., Zhou, J.: Capturing associations in graphs. PVLDB 13(11) (2020)
  • 37.Fan, W., Li, J., Wang, X., Wu, Y.: Query preserving graph compression. In: SIGMOD (2012)
  • 38.Fan, W., Li, Y., Liu, M., Lu, C.: Making graphs compact by lossless contraction (2021). SIGMOD [DOI] [PMC free article] [PubMed]
  • 39.Fan, W., Wu, Y., Xu, J.: Functional dependencies for graphs. In: SIGMOD (2016)
  • 40.Francis, N., Green, A., Guagliardo, P., Libkin, L., Lindaaker, T., Marsault, V., Plantikow, S., Rydberg, M., Selmer, P., Taylor, A.: Cypher: An evolving query language for property graphs. In: SIGMOD (2018)
  • 41.Gabow, H.N., Galil, Z., Spencer, T.H.: Efficient implementation of graph algorithms using contraction. In: FOCS (1984)
  • 42.Garey M, Johnson D. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W. H. Freeman and Company; 1979. [Google Scholar]
  • 43.Gross J, Yellen J. Graph Theory and its Applications. Boca Raton: CRC Press; 1998. [Google Scholar]
  • 44.Han, W.S., Lee, J., Lee, J.H.: Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD (2013)
  • 45.He, L., Chao, Y., Suzuki, K., Wu, K.: Fast connected-component labeling. Pattern Recogn. 42(9) (2009)
  • 46.Hill S, Agarwal DK, Bell R, Volinsky C. Building an effective representation for dynamic networks. J. Comput. Graph. Stat. 2006;15(3):584–608. doi: 10.1198/106186006X139162. [DOI] [Google Scholar]
  • 47.Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD (2013)
  • 48.Itai A, Rodeh M. Finding a minimum circuit in a graph. SICOMP. 1978;7(4):413–423. doi: 10.1137/0207033. [DOI] [Google Scholar]
  • 49.Jaakkola, M.S.T., Szummer, M.: Partially labeled classification with markov random walks. NIPS 14 (2002)
  • 50.Jin, R., Xiang, Y., Ruan, N., Wang, H.: Efficiently answering reachability queries on very large directed graphs. In: SIGMOD (2008)
  • 51.Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Sci. Data. 2016;3(1):1–9. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kang, U., Faloutsos, C.: Beyond’caveman communities’: Hubs and spokes for graph compression and mining. In: ICDM, pp. 300–309 (2011)
  • 53.Kang, U., McGlohon, M., Akoglu, L., Faloutsos, C.: Patterns on the connected components of terabyte-scale graphs. In: ICDM, pp. 875–880 (2010)
  • 54.Karimi, R., Koppelman, D.M., Michael, C.J.: GPU road network graph contraction and SSSP query. In: ICS (2019)
  • 55.Karypis G, Kumar V. Multilevelk-way partitioning scheme for irregular graphs. JPDC. 1998;48(1):96–129. [Google Scholar]
  • 56.Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: SIGKDD, pp. 137–146 (2003)
  • 57.Koch I. Enumerating all connected maximal common subgraphs in two graphs. TCS. 2001;250(1–2):1–30. doi: 10.1016/S0304-3975(00)00286-3. [DOI] [Google Scholar]
  • 58.Kropatsch, W.: Building irregular pyramids by dual-graph contraction. In: Vision Image and Signal Processing (1996)
  • 59.Lappas, T., Liu, K., Terzi, E.: Finding a team of experts in social networks. In: KDD (2009)
  • 60.LeFevre, K., Terzi, E.: Grass: Graph structure summarization. In: SDM (2010)
  • 61.Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web. 2015;6(2):167–195. doi: 10.3233/SW-140134. [DOI] [Google Scholar]
  • 62.Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links in online social networks. In: WWW, pp. 641–650 (2010)
  • 63.Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: SIGKDD (2005)
  • 64.Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. CoRR arXiv:0810.1355 (2008)
  • 65.Leung, K., Leckie, C.: Unsupervised anomaly detection in network intrusion detection using clusters. In: ACSW (2005)
  • 66.Liang, Y., Zhao, P.: Similarity search in graph databases: a multi-layered indexing approach. In: ICDE (2017)
  • 67.Liu Y , Safavi T, Dighe A, Koutra D. Graph summarization methods and applications: A survey. ACM Comput. Surv. 2018;51(3):62:1–62:34. [Google Scholar]
  • 68.Lu, C., Yu, J.X., Wei, H., Zhang, Y.: Finding the maximum clique in massive graphs. PVLDB 10(11) (2017)
  • 69.Maccioni, A., Abadi, D.J.: Scalable pattern matching over compressed graphs via dedensification. In: SIGKDD (2016)
  • 70.McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS (2012)
  • 71.Miller GA. WordNet: a lexical database for English. Commun. ACM. 1995;38(11):39–41. doi: 10.1145/219717.219748. [DOI] [Google Scholar]
  • 72.Myoungji, H., Hyunjoon, K., Geonmo, G., Kunsoo, P., Wook-Shin, H.: Efficient subgraph matching: harmonizing dynamic programming, adaptive matching order, and failing set together. In: SIGMOD (2019)
  • 73.Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: SIGMOD (2008)
  • 74.Newman ME, Watts DJ, Strogatz SH. Random graph models of social networks. PNAS. 2002;99(suppl 1):2566–2572. doi: 10.1073/pnas.012582999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Pandey, S., Li, X.S., Buluc, A., Xu, J., Liu, H.: H-index: Hash-indexing for parallel triangle counting on GPUs. In: HPCS, pp. 1–7 (2019)
  • 76.Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Discov. 24 (2012)
  • 77.Ramalingam G, Reps T. On the computational complexity of dynamic graph problems. TCS. 1996;158(1–2):233–277. doi: 10.1016/0304-3975(95)00079-8. [DOI] [Google Scholar]
  • 78.Ren X, Wang J. Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs. PVLDB. 2015;8(5):617–628. [Google Scholar]
  • 79.van Rest, O., Hong, S., Kim, J., Meng, X., Chafi, H.: PGQL: A property graph query language. In: GRADES (2016)
  • 80.Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015)
  • 81.Sakr S, Al-Naymat G. Graph indexing and querying: a review. IJWIS. 2010;6(2):101–120. doi: 10.1108/17440081011053104. [DOI] [Google Scholar]
  • 82.Sha, M., Li, Y., Tan, K.: Gpu-based graph traversal on compressed graphs. In: SIGMOD, pp. 775–792 (2019)
  • 83.Shen Z, Ma KL, Eliassi-Rad T. Visual analysis of large heterogeneous social networks by semantic and structural abstraction. TVCG. 2006;12(6):1427–1439. doi: 10.1109/TVCG.2006.107. [DOI] [PubMed] [Google Scholar]
  • 84.Soundarajan, S., Tamersoy, A., Khalil, E.B., Eliassi-Rad, T., Chau, D.H., Gallagher, B., Roundy, K.: Generating graph snapshots from streaming edge data. In: WWW (2016)
  • 85.Tarjan R. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972;1(2):146–160. doi: 10.1137/0201010. [DOI] [Google Scholar]
  • 86.Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From“ think like a vertex” to“ think like a graph”. PVLDB 7(3), 193–204 (2013)
  • 87.Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: SIGMOD (2008)
  • 88.Valiant LG. A bridging model for parallel computation. CACM. 1990;33(8):103–111. doi: 10.1145/79173.79181. [DOI] [Google Scholar]
  • 89.Vieira, M.V., Fonseca, B.M., Damazio, R., Golgher, P.B., Reis, D.d.C., Ribeiro-Neto, B.: Efficient search ranking in social networks. In: CIKM (2007)
  • 90.W3C Recommendation: SPARQL query language for RDF. https://www.w3.org/TR/rdf-sparql-query/ (2008)
  • 91.Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’networks. Nature. 1998;393(6684):440. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 92.Wu Y, Jin R, Zhang X. Efficient and exact local search for random walk based top-k proximity query in large graphs. TKDE. 2016;28(5):1160–1174. doi: 10.1109/TKDE.2016.2515579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Yahia SA, Benedikt M, Lakshmanan LV, Stoyanovich J. Efficient network aware search in collaborative tagging sites. PVLDB. 2008;1(1):710–721. [Google Scholar]
  • 94.Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. In: ICDM (2012)
  • 95.Yildirim, H., Chaoji, V., Zaki, M.J.: Grail: Scalable reachability index for large graphs. PVLDB 3(1-2) (2010)

Articles from The Vldb Journal are provided here courtesy of Springer

RESOURCES