Efficient and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs

Yubao Wu; Ruoming Jin; Xiang Zhang

doi:10.1109/TKDE.2016.2515579

. Author manuscript; available in PMC: 2019 Mar 11.

Published in final edited form as: IEEE Trans Knowl Data Eng. 2016 Jan 7;28(5):1160–1174. doi: 10.1109/TKDE.2016.2515579

Efficient and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs

Yubao Wu ¹, Ruoming Jin ², Xiang Zhang ³

PMCID: PMC6411071 NIHMSID: NIHMS777927 PMID: 30867621

Abstract

Top-k proximity query in large graphs is a fundamental problem with a wide range of applications. Various random walk based measures have been proposed to measure the proximity between different nodes. Although these measures are effective, efficiently computing them on large graphs is a challenging task. In this paper, we develop an efficient and exact local search method, FLoS (Fast Local Search), for top-k proximity query in large graphs. FLoS guarantees the exactness of the solution. Moreover, it can be applied to a variety of commonly used proximity measures. FLoS is based on the no local optimum property of proximity measures. We show that many measures have no local optimum. Utilizing this property, we introduce several operations to manipulate transition probabilities and develop tight lower and upper bounds on the proximity values. The lower and upper bounds monotonically converge to the exact proximity value when more nodes are visited. We further extend FLoS to measures having local optimum by utilizing relationship among different measures. We perform comprehensive experiments on real and synthetic large graphs to evaluate the efficiency and effectiveness of the proposed method.

Index Terms: Local search, proximity search, random walk, top-k search, nearest neighbors

1 Introduction

Given a large graph and a query node, finding its k-nearest-neighbor (kNN) is a primitive operation that has recently attracted intensive research interests [1], [2], [3], [4]. In general, there are two challenges in top-k proximity query. One is to design proximity measures that can effectively capture the similarity between nodes. Another challenge is to develop efficient algorithms to compute the top-k nodes for a given measure.

Designing effective proximity (similarity) measures is a nontrivial task. Random walk based measures have been shown to be effective in many applications. Some examples include discounted/truncated hitting time [1], [5], penalized hitting probability [6], [7], random walk with restart [8], [9], RoundTripRank [10], and absorption probability [11].

Although various proximity measures have been developed, how to efficiently compute them remains a challenging problem. For most random walk based measures, a naive method requires matrix inversion, which is prohibitive for large graphs. Two global approaches have been developed. One applies the power iteration method over the entire graph [12], [4], [13]. Another approach precomputes and stores the inversion of a matrix [8], [14], [15]. The precomputing step is usually expensive and needs to be repeated whenever the graph changes.

To improve the efficiency, local methods have been developed [1], [5], [7], [9]. The idea is to visit the nodes near the query node and dynamically expand the search range. Node proximities are estimated based on local information only. Without using the global information, however, most existing local search methods cannot guarantee to find the exact solution. Moreover, they are usually designed for specific measures and cannot be generalized to other measures.

In this paper, we propose FLoS (Fast Local Search), a simple and unified local search method for efficient and exact top-k proximity query in large graphs. FLoS has the following properties.

Exact: It guarantees to find the exact top-k nodes.
Unified: It is a general method that can be applied to a variety of random walk based proximity measures. Most existing methods are designed for specific measures.
Efficient: It uses a simple local search strategy that needs neither preprocessing nor iterating over the entire graph. Experimental results show that it is orders of magnitude faster than alternatives.

The key idea behind FLoS is that we can develop upper and lower bounds on the proximity of the nodes near the query node. These bounds can be dynamically updated when a larger portion of the graph is explored and will finally converge to the exact proximity value. The top-k nodes can be identified once the differences between their upper and lower bounds are small enough to distinguish them from the remaining nodes.

The theoretical basis of FLoS relies on the no local optimum property of proximity measures. That is, given a query node q, for any node i (i ≠ q) in the graph, i always has a neighbor that is closer to q than i is. We show that many measures have no local optimum. This property ensures that the proximity of unvisited nodes is bounded by the maximum proximity (or minimum proximity for some measures) in the boundary of the visited nodes. It can be utilized to find the top-k nodes without exploring the entire graph under the assumption that the exact proximity can be computed based on local information. However, for most measures, the exact proximity cannot be computed without searching the entire graph. To tackle this challenge, we introduce several simple operations to modify transition probabilities, which enable developing upper and lower bounds on the proximity of visited nodes. The developed upper (lower) bounds monotonically decrease (increase) when more nodes are visited. We further study the relationship among different measures and show that FLoS can also be applied to measures having local optimum. Extensive experimental results show that, for a variety of measures, FLoS can dramatically improve the efficiency compared to the state-of-the-art methods.

2 Related Work

Various random walk based proximity measures have been proposed recently [16]. Examples include truncated or discounted hitting time [5], [17], [1], penalized hitting probability [6], [7], random walk with restart [8], [1], effective importance (degree normalized random walk with restart) [9], RoundTripRank [10], and absorption probability [11]. The Katz score [18] captures multiple paths between two nodes and is closely related to the random walk based proximity measures [13].

The basic approach for proximity query is to use the power iteration method [12]. An improved iteration method designed for random walk with restart decomposes the proximity into random walk probabilities of different length [4]. It tries to reduce the number of iterations by estimating the proximity based on the information collected so far. The iteration method can also be improved by the prioritized execution of the iterative computation, where the node with the largest residual proximity value is updated first [13]. Another approach precomputes the information needed for proximity estimation during the query process [8], [14], [15]. However, this step is time consuming and becomes infeasible when the graph is large or constantly changing. Graph embedding method embeds nodes into geometric space so that node proximities can be preserved as much as possible [19]. The embedding step is also time-consuming. Moreover, the proximities in the new space are not exactly the same as the ones in the original graph.

Based on the intuition that nodes near the query node tend to have high proximity, local search methods try to visit a small number of nodes to approximate the proximities. Best-first [7] and depth-first [20] search strategies simply extract a fixed number of nodes near the query node. An approximate local search algorithm is proposed for truncated hitting time [5]. The key idea is to develop upper and lower bounds that can be used to approximate the proximities of local nodes. A similar local search algorithm is developed for personalized PageRank and degree normalized personalized PageRank [1]. The push style method is first developed in [21] for random walk with restart, and later improved by [22], [2] for the top-k query problem. Starting from the query node, the push style method propagates the proximity value to the nodes in the neighborhood of the query node, and obtains approximate proximity values for them. This basic idea has been adapted to compute the top-k nodes for effective importance [9], RoundTripRank [10], and Katz score [23]. Most of the existing local search methods cannot guarantee the exactness. Moreover, all of them are designed for specific proximity measures. It is unclear whether the local search methods can be generalized to other measures.

3 No Local Optimum Property

In this section, we first introduce the basic concept of no local optimum property of proximity measures and discuss how it can be used to bound the proximity of the unvisited nodes. Then we study whether commonly used measures have no local optimum and discuss the relationship between them. Table 1 lists the main symbols and their definitions. The top-k query problem is defined as

TABLE 1.

Main symbols

Symbols	Definitions
G(V,E)	undirected graph G with node set V and edge set E
N_i	neighbors of node i
w_i_,_j	weight of edge (i, j)
w_i	degree of node i, w_i=Σ_j_{∈N_i} w_i_,_j
q	query node
e	n×1 vector with e_q =1 and e_i=0 if i ≠q, where n=\|V\|
k	number of returned nodes
S	a set of nodes
S̄	complement of S: S̄=V \S
δS	boundary of S: {i∈S \|∃j∈N_i ∩ S̄}
δS̄	boundary of S̄: {i∈ S̄ \|∃j∈N_i ∩ S}
r	n×1 vector, r_i: proximity of node i w.r.t. the query node q
r̄	upper bound of r: r̄_i≥r_i, ∀i∈S
r	lower bound of r: r_i≤r_i, ∀i∈S
p_i_,_j	transition probability from node i to j
P	transition probability matrix: P_q_,_j =0;P_i_,_j =p_i_,_j if i ≠q
d, r_d	a dummy node d with constant proximity value r_d
c	decay factor in PHP, DHT, RWR, EI, or RT
u↝v	node u can reach node v in the transition graph
u↝̸v	node u cannot reach node v in the transition graph
u↝S	node u can reach at least one node in S
u↝̸S	node u cannot reach any node in S

Open in a new tab

Definition 1. [Top-k Query Problem]

Suppose that we have an undirected graph G(V,E), a query node q and a number k. Let r_i represent the proximity of node i with regard to the query node q. The top-k query problem aims at finding a node set K ⊆ V \{q} such that |K| = k and r_i≥r_j, for any node i∈K and j ∈V \(K∪{q}).

3.1 Theoretical Basis

Note that for some measures, such as penalized hitting probability, random walk with restart, effective importance, RoundTripRank, Katz score and absorption probability, the larger the proximity the closer the nodes. In this case, no local optimum means no local maximum. For other measures, such as discounted or truncated hitting time, the smaller the proximity the closer the nodes. In this case, no local optimum means no local minimum.

Given an undirected and edge weighted graph G = (V,E) and a query node q∈V, let r be the proximity vector with r_i representing the proximity of node i∈V with respect to the query node q.

Definition 2. [No Local Maximum]

A proximity measure has no local maximum if for any node i ≠q, there exists a neighbor node j of i (i.e., j ∈N_i), such that r_j >r_i.

Definition 3. [No Local Minimum]

A proximity measure has no local minimum if for any node i ≠q, there exists a neighbor node j of i (i.e., j ∈N_i), such that r_j <r_i.

We say that a proximity measure has no local optimum if it has no local maximum or minimum. In Section 3.2, we will examine whether the commonly used proximity measures have no local optimum. Unless otherwise mentioned, in the next, we assume that the larger the proximity the closer the nodes, and focus on the no local maximum property. All conclusions can also be applied to the proximity measures with no local minimum.

Let S be a set of nodes, and S̄=V \S be the remaining nodes. We use δS = {i ∈ S |∃j ∈ N_i ∩ S̄} to denote the boundary of S, and δ S̄ ={i ∈ S̄ |∃j ∈N_i ∩ S} to denote the boundary of S̄.

Figure 1(a) shows an undirected graph with 8 nodes. Suppose that the node set S={1,2,3,4}, then we have S̄= {5,6,7,8}, δS={3,4}, and δ S̄={5,6,7}.

Fig. 1 — An example graph and its transition graph

Theorem 1

Let S be a node set containing the query node, and u be the node with the largest proximity in δS. If a proximity has no local maximum, we have that r_u >r_j (∀j ∈ S̄).

Proof

Suppose otherwise. We have that ∃j ∈ S̄, such that r_u≤r_j. Now suppose that node v is the node with the largest proximity in S̄. We have that ∀i∈ S̄∪δS, r_v≥r_i. The neighbors of node v must exist in S̄ ∪ δS, i.e., N_v ⊆ S̄ ∪ δS. Therefore, we have r_v≥r_i (∀i∈N_v), which means node v is a local maximum. This contradicts the assumption.

Based on Theorem 1, assuming that we already have the exact proximity vector r, we can design a simple local search strategy as shown in Algorithm 1 to find the top-k nodes. It starts from the query node q and uses K and S to store the top-k nodes and visited nodes respectively. In each iteration, the algorithm finds the node u that has the largest proximity in S \K and expands S to the neighbors of node u. Since δS ⊆ S \K, the maximum proximity value in S \K must be no less than the maximum proximity value in δS, and greater than the maximum proximity value in the unvisited nodes S̄ based on Theorem 1. Thus, K contains the top-k nodes. The algorithm continues until |K|=k+1.

Let h be the average number of neighbors of a node. In each iteration t, on average h nodes are added to S. This takes O(hlog ht) time for a sorted list. The overall complexity of Algorithm 1 is thus $O (\sum_{t = 1}^{k} h log ht) = O (hk log hk)$ .

Algorithm 1.

The basic top-k local search algorithm

graphic file with name nihms777927f20.jpg

Open in a new tab

3.2 Measures With and Without Local Optimum

Table 2 summarizes whether the commonly used proximity measures have no local optimum property. Next, we use penalized hitting probability (PHP) [6], [7] as an example to illustrate that it has no local maximum. We use w_i to denote the degree of node i, and w_i_,_j to denote the edge weight between i and j. The transition probability from i to j is thus p_i_,_j =w_i_,_j/w_i.

TABLE 2.

No local optimum property of some measures

Proximity measures	Abbr.	Ref.	Property
Penalized hitting probability	PHP	[6]	No local maximum
Effective importance	EI	[9]	No local maximum
Discounted hitting time	DHT	[1]	No local minimum
Truncated hitting time	THT	[5]	No local minimum (within L hops)
Random walk with restart	RWR	[8]	Local maximum
RoundTripRank	RT	[10]	Local maximum
Katz score	KZ	[18]	Local maximum
Absorption probability	AP	[11]	Local maximum

Open in a new tab

Suppose the undirected graph in Figure 1(a) has unit edge weight. Node 3 has degree 3, thus its transition probability to node 4 is p_3,4 = 1/3. Based on these transition probabilities, we can construct the corresponding transition graph as shown in Figure 1(b). In the transition graph, each directed edge and the number on the edge represent the transition probability from one node to the other.

PHP penalizes the random walk for each additional step. Given a query node q, let r denote the PHP proximity vector, with r_i representing the proximity value of node i. PHP can be defined recursively as

r_{i} = {\begin{cases} 1, & if i = q, \\ c \sum_{j \in N_{i}} p_{i, j} r_{j}, & if i \neq q, \end{cases}

where c (0<c<1) is the decay factor in the random walk process. In [6], c = e⁻¹ is used as the decay factor. The query node q has constant proximity value 1, and there is no transition probability going out of the query node. For example, there is no outgoing edges from the query node 1 in the transition graph in Figure 1(b).

Let P be the transition probability matrix with

P_{i, j} = {\begin{cases} 0, & if i = q, \\ p_{i, j}, & if i \neq q . \end{cases}

Then the above recursive definition can be expressed as the following matrix form

r = c \Pr + e,

where e_i=1 if i=q, and e_i=0 if i ≠ q.

Lemma 1

PHP has no local maximum.

Proof

Suppose that node i is a local maximum. We have r_i=cΣ_j_{∈N_i} p_i_,_jr_j ≤ cΣ_j_{∈N_i} p_i_,_jr_i=cr_i<r_i. We get a contradiction that r_i<r_i.

The proofs for whether other proximity measures in Table 2 have local optimum can be found in the Appendices. In particular, in Appendix A, we show that EI, DHT, and THT have no local optimum. In Appendix G, we show that RWR, RT, KZ, and AP have local maximum.

Some proximity measures have inherent relationship. The following theorem says that PHP, EI, and DHT are equivalent in terms of ranking.

Theorem 2

PHP, EI, and DHT give the same ranking results.

Please see Appendix A for the proof.

From the discussion in this section, we know that if a proximity measure has no local optimum, we can apply local search described in Algorithm 1 to find the top-k nodes under the assumption that the proximity values of all the nodes are given. However, without exploring the entire graph, the exact values of all the nodes are unknown. To tackle this challenge, we can develop lower and upper bounds on the proximity values of visited nodes. When more nodes are visited, the lower and upper bounds become tighter and eventually converge to the exact proximity value.

Next, we will use PHP, which has no local optimum, as a concrete example to explain how to derive the lower and upper bounds. The strategy can be applied to other proximity measures with no local optimum. How to apply the local search strategy to the measures having local optimum will be discussed in Section 6.

4 Bounding the Proximity

To develop the lower and upper bounds, we introduce three basic operations, i.e., deletion, restoration, and destination change of transition probability. We show how proximities change if we modify transition probabilities according to these operations. Then we discuss how to derive lower and upper bounds based on them.

4.1 Modifying Transition Probability

We first introduce the operation of deleting a transition probability. Figure 2(a) shows an example, in which the original graph is on the top, with node 1 being the query node and transition probabilities p_2,1 = p_2,3 = 0.5 and p_3,2 = 1. After deleting p_2,3, the resulting graph is shown at the bottom. Note that deleting a transition probability is different from deleting an edge. Deleting an edge will change the transition probabilities on the remaining graph, while deleting a transition probability will not.

Fig. 2 — Basic operations on transition probability

Theorem 3

Deleting a transition probability will not increase the proximity of any node.

Please see Appendix B for the proof.

Continue with the example in Figure 2(a). Suppose that the decay factor c=0.5. The original PHP proximity vector is r = [1,2/7,1/7]. After deleting p_2,3, the new proximity vector is r′ =[1,1/4,1/8].

It can be shown in a similar way that if we restore the transition probability as shown in Figure 2(a), the proximities will not decrease.

Theorem 4

Restoring a deleted transition probability will not decrease the proximity of any node.

Proof

Omitted.

Figure 2(b) shows an example in which we change the destination of the original transition probability p_3,2 from node 2 to 1. Thus, p_3,1 is set to p_3,2, and p_3,2 is set to 0.

Theorem 5

Changing the destination of transition probability p_i_,_j from node j to node u with r_u≥r_j (r_u≤r_j) will not decrease (increase) the proximity of any node.

Please see Appendix B for the proof.

Let us continue with the example in Figure 2(b), where node 1 is the query node. After we change the destination of p_3,2 from node 2 to 1, the proximity values should be non-decreasing. With a decay factor c=0.5, the proximity vectors before and after the destination change are r=[1,2/7,1/7] and r′ =[1,3/8,1/2] respectively.

The proofs for other measures having no local optimum are similar and omitted here.

4.2 Lower Bound

Let S, S̄, δS and δ S̄ represent the set of visited nodes, the set of unvisited nodes, the boundary of S, and the boundary of S̄, respectively. From Theorem 3, if we delete all transition probabilities {p_i_,_j: i or j ∈ S̄} in the original graph, the proximity value of any node u computed using the resulting graph, r_u, will be less than or equal to its original value r_u, i.e., r_u≤r_u. Thus, r can be used as the lower bound of r.

Let us take the undirected graph in Figure 1(a) for example. Its transition graph is shown in Figure 3(a), with node 1 being the query node and transition probabilities shown on the edges. Suppose that the current set of visited nodes S ={1,2,3,4}. Thus S̄ ={5,6,7,8}, δS ={3,4}, and δ S̄ ={5,6,7}. The nodes in S but not in δS are black. The nodes in δS are gray. The nodes in S̄ are white.

Fig. 3 — Lower and upper bounds based on basic operations

Figure 3(b) shows the resulting transition graph after deleting all transition probabilities {p_i_,_j: i or j ∈ S̄}. The proximity values r computed based on Figure 3(b) will lower bound the original proximity values r for the nodes in S.

4.3 Upper Bound

From Theorem 5, if we change the destination of the transition probabilities {p_i_,_j: i ∈ δS,j ∈ δS̄} to a newly added dummy node d with a constant proximity value r_d and r_d>r_v (∀v∈ S̄), the proximity value of any node u computed using the resulting graph, r̄_u, will be greater than or equal to its original value r_u, i.e., r̄_u≥r_u. Thus, r̄ can be used as the upper bound of r.

Continue with the example in Figure 3(a). In the original graph, δS={3,4}, δS̄={5,6,7}, p_3,5=1/3, p_4,6=p_4,7=1/4. The left figure in Figure 3(c) shows the resulting graph after we change all the transition probabilities going from δS to δS̄ to the newly added dummy node d. Specifically, p₃_,_d = 1/3 and p₄_,_d=p₄_,₆+p₄_,₇=1/2. Note that after changing the destination to d, there will be no transition probability from any node in S to any node in S̄. Therefore, for the nodes in S, the upper bounds can be computed by the subgraph induced by the nodes in S and the dummy node d, as shown on the right in Figure 3(c).

Note that to get the upper bound, we need to add a dummy node d with constant proximity value r_d≥r_v (∀v ∈ S̄). In the next section, we will present the fast local search algorithm, FLoS, and discuss how to choose the value r_d.

Algorithm 2.

Fast top-k local search (FLoS)

graphic file with name nihms777927f21.jpg

Open in a new tab

5 Fast Local Search

In this section, we present the FLoS algorithm, which utilizes the bounds developed in Section 4 to enable the local search. We show that the bounds can only change monotonically when more nodes are visited.

5.1 The FLoS Algorithm

Algorithm 2 describes the FLoS algorithm. It has four main steps. In the first step, the algorithm expands locally to the neighbors of a selected visited node. In the second and third steps, it updates the lower and upper bounds of the visited nodes. Finally, it checks whether the top-k nodes are identified.

The local expansion step is shown in Algorithm 3. It picks the node in δS having the largest average lower and upper bound values, and expands to the neighbors of this node. S and δS are then updated accordingly. We take the average of the lower and upper bound value as an approximation of the exact proximity value. Expanding the node with the largest value is the best-first search strategy.

Algorithm 4 shows how to update the lower bound by using PHP as an example. It can also be applied to other measures with no local optimum. To update the bounds, we first construct the transition matrix P. Note that the size of P is |S| × |S| instead of |V | × |V |. We do not allocate memory for the full matrix P, but only use adjacency list to represent it. The lower bound vector r is initiated the same as in the previous iteration or 0 for the newly added nodes. We then use the standard iterative method, which is shown in Algorithm 7, to solve the linear equation r=cPr+e and update r.

Algorithm 3.

LocalExpansion()

1:	$u \leftarrow {argmax}_{i \in δ S^{t - 1}} ({\underline{r}}_{i}^{t - 1} + {\bar{r}}_{i}^{t - 1})$ ;
2:	S^t←S^t⁻¹ ∪ N_u;
3:	Update δS^t;

Open in a new tab

Algorithm 4.

UpdateLowerBound()

1:	$P_{i, j}^{t} \leftarrow w_{i, j} / \sum_{v \in N_{i}} w_{i, v}$ , if node i or j are newly added;
2:	$P_{q, j}^{t} \leftarrow 0$ , if node j is newly added;
3:	$P_{i, j}^{t} \leftarrow P_{i, j}^{t - 1}$ , if nodes i and j exist in the last iteration;
4:	${\underline{r}}_{i}^{t} \leftarrow 0$ , if node i is newly added;
5:	${\underline{r}}_{i}^{t} \leftarrow {\underline{r}}_{i}^{t - 1}$ , if node i exists in the last iteration;
6:	e_i←1, if i=q; e_i←0, otherwise;
7:	r^t← IterativeMethod(P^t, r^t, e, c, τ);

Open in a new tab

Algorithm 5.

UpdateUpperBound()

1:	Extend P^t with 1 column and 1 row for the dummy node d;
2:	$P_{i, d}^{t} \leftarrow 1 - \sum_{j \in N_{i} \cap S^{t}} P_{i, j}^{t}$ , if node i∈δS^t;
3:	$P_{i, d}^{t} \leftarrow 0$ , if i∈S^t\δS^t;
4:	$P_{d, i}^{t} \leftarrow 0$ , for any node i;
5:	${\bar{r}}_{i}^{t} \leftarrow 1$ , if node i is newly added;
6:	${\bar{r}}_{i}^{t} \leftarrow {\bar{r}}_{i}^{t - 1}$ , if node i exists in the last iteration;
7:	${\bar{r}}_{d}^{t} \leftarrow r_{d}^{t} \leftarrow {max}_{i \in δ S^{t - 1}} {\bar{r}}_{i}^{t - 1}$ ;	`// dummy node value`
8:	Extend e with 1 new element $e_{d} = r_{d}^{t}$ for the node d;
9:	r̄^t← IterativeMethod(P^t, r̄^t, e, c, τ);

Open in a new tab

Algorithm 6.

CheckTerminationCriterion()

graphic file with name nihms777927f22.jpg

Open in a new tab

Algorithm 7.

IterativeMethod()

Input: matrix P, vector r_in, vector e, decay factor c, value τ
Output: proximity vector r_out
1:	r⁰←r_in; l←0;
2:	repeat l←l+1; r^l←cPr^l⁻¹+e; until \|\|r^l−r^l⁻¹\|\|<τ;
3:	return r^l;

Open in a new tab

Algorithm 5 shows how to update the upper bound. The transition matrix P has one additional dummy node d and its related transition probabilities {p_i,_d : i∈δS}. The values in r̄ are initiated the same as the values in the previous iteration or 1 for the newly added nodes. The smaller the value of r_d, the tighter the upper bounds. On the other hand, we also need to make sure that the value r_d is larger than the exact proximity value of any unvisited node. Therefore in line 7, we use the largest upper bound value in the boundary of the last iteration as $r_{d}^{t}$ . We have $r_{d}^{t} = {max}_{i \in δ S^{t - 1}} {\bar{r}}_{i}^{t - 1} \geq {max}_{i \in δ S^{t - 1}} r_{i} > r_{j} (\forall j \in {\bar{S}}^{t - 1})$ , where the last inequality is based on Theorem 1. Thus $r_{d}^{t} > r_{j} (\forall j \in {\bar{S}}^{t})$ , since S̄^t⊆ S̄^t⁻¹. This guarantees the correctness of the upper bound according to Theorem 5. Finally, we solve the linear equation r̄=cPr̄+e to update r̄ by using Algorithm 7. Algorithm 7 is very efficient in practice. This is because when the initial values of the iterative method is close to the exact solution, the algorithm will converge very fast. In our method, between two adjacent iterations, the proximity values of the visited nodes are very close. Therefore, updating the proximity is very efficient.

Algorithm 6 shows the termination criterion. We select the k nodes in S \ (δS ∪{q}) with largest r values. If the minimum lower bound of the selected nodes is greater than or equal to the maximum upper bound of the remaining visited nodes, the selected nodes will be the top-k nodes in the entire graph. This is because the maximum proximity of unvisited nodes is bounded by the maximum proximity in δS, which is in turn bounded by the maximum upper bound in δS.

Figure 4 shows the lower and upper bounds at different iterations using the example graph in Figure 1(a). One iteration represents one local expansion process. The newly visited nodes in each iteration are listed in Table 3.

Fig. 4 — Lower and upper bounds in different iterations (PHP: q=1, c=0.8)

TABLE 3.

Newly visited nodes in each iteration

Iteration	1	2	3	4	5
Newly visited nodes	{2, 3}	{4}	{5}	{6, 7}	{8}

Open in a new tab

The left figure in Figure 4 shows how the lower and upper bounds change through local expansions. Query node 1 has constant proximity value 1.0 thus is not shown. It can be seen that the bounds monotonically change and eventually converge to the exact proximity value when all the nodes are visited. The monotonicity of the bounds is proved theoretically in next two sections.

The right figure in Figure 4 shows the lower and upper bounds in iteration 3 (at the top) and 4 (at the bottom). The interval from the lower to upper bounds is indicated by the vertical line segment. The interval of the bounds for the unvisited node is indicated by the dashed vertical line. In iteration 3, nodes {6,7,8} are unvisited, and their upper bound is the upper bound for node 4, which is the maximum upper bound for the boundary nodes {4,5}. In iteration 4, the bounds become tighter, and the minimum lower bound of nodes {2,3} is larger than the maximum upper bound of the remaining nodes {4,5,6,7,8}, which is indicated by the horizontal red dashed line. Therefore, nodes {2,3} are guaranteed to be the top-2 nodes after iteration 4, even though node 8 is still unvisited.

5.2 Monotonicity of the Lower Bound

We first consider the monotonicity of the lower bound. Let S^t⁻¹ and S^t represent the set of visited nodes in iterations (t − 1) and t respectively. In the next, we prove that the (a) original (b) 1st iteration (c) 2nd iteration lower bound is monotonically non-decreasing when more nodes are visited, i.e., ${\underline{r}}_{i}^{t} \geq {\underline{r}}_{i}^{t - 1} (i \in S^{t - 1})$ .

Given a directed transition graph, we say that a node u can reach a node v if there exists a sequence of adjacent nodes (i.e., a path) which starts from u and ends at v. For example, in the transition graph in Figure 5(a), node 1 can reach node 6, but node 5 cannot reach node 6. We use u↝v to denote that node u can reach v, and u↝̸v to denote that node u cannot reach v. We also use u↝S to denote that node u can reach at least one node in S, and u↝̸S to denote that node u cannot reach any node in S.

Fig. 5 — Example transition graphs between two adjacent iterations for analyzing lower bound monotonicity with query node 3.

From iteration (t−1) to t, we only restore some transition probabilities in {p_i,j : i or j ∈S^t\S^t⁻¹}. The following Theorem 6 says that if node i can reach at least one of the newly added nodes, the lower bound of node i is strictly increasing. If node i cannot reach any of the newly added nodes, the lower bound value of node i will not change during the iteration.

Figure 5 shows an example. Figure 5(a) shows the full transition graph when the query is node 3. Figures 5(b) and 5(c) show the transition graphs constructed in the first and second iteration of Algorithm 2 respectively. S¹={3,1,4,5}, S² ={3,1,4,5,2}, and node 2 is the newly visited node in the second iteration. We can see that node 5 cannot reach node 2. Thus, the lower bound value r₅ is unchanged. The lower bound values of nodes {1,4} are strictly increasing.

Theorem 6

(Monotonicity of the lower bound) For any node i∈S^t⁻¹, we have that

{\begin{cases} {\underline{r}}_{i}^{t} = {\underline{r}}_{i}^{t - 1}, & if i ↝̸ S^{t} \ S^{t - 1}, \\ {\underline{r}}_{i}^{t} > {\underline{r}}_{i}^{t - 1}, & if i ↝ S^{t} \ S^{t - 1}, \end{cases}

where ↝ and ↝̸ represent the reachability in the transition graph at the t-th iteration.

Please see Appendix C for the proof.

5.3 Monotonicity of the Upper Bound

In this subsection, we analyze the monotonicity of the upper bound. Specifically, we prove that the upper bound values are strictly increasing until they converge to the exact proximity values. That is, for any node i∈S^t⁻¹, ${\bar{r}}_{i}^{t} < {\bar{r}}_{i}^{t - 1}$ until ${\bar{r}}_{i}^{t - 1} = r_{i}$ .

From iteration (t − 1) to t, we decrease the proximity value of the dummy node and add new nodes in S^t\S^t⁻¹. After adding the new nodes, the transition probabilities need to be updated accordingly. Specifically, we need to

Decrease the proximity value of the dummy node from ${\bar{r}}_{d}^{t - 1}$ to ${\bar{r}}_{d}^{t}$ ;
Add the transition probabilities {p_i,j} from the newly added nodes i(∈S^t\S^t⁻¹) to nodes j(∈S^t), and {p_i,_d} from i to the dummy node d;
Add the transition probabilities {p_j,i} from nodes j(∈δS^t⁻¹) to the newly added nodes i(∈S^t\S^t⁻¹), and remove their correspondences in {p_j,_d}.

An example is shown in Figure 6. Figure 6(a) shows the transition graph for the first iteration. Figures 6(b), 6(c) and 6(d) show the resulting graphs after applying steps 1, 2, and 3 respectively. The graph in Figure 6(d) is the final transition graph for the next iteration.

Fig. 6 — Transition graphs between two adjacent iterations (upper bound)

The upper bound values will monotonically change at each step. In step 1, reducing the value of r_d will not increase the upper bound values. Applying step 2 will not change the upper bound values for the nodes in S^t⁻¹, since all the newly added transition probabilities begin from nodes i(∈ S^t \ S^t⁻¹). In step 3, we resets the transition probabilities from nodes j(∈ δS^t⁻¹) to the newly added nodes i(∈S^t\S^t⁻¹). This is equivalent to destination change, i.e., changing {p_j,_d} to {p_j,i}. Moreover, we have that ${\bar{r}}_{d}^{t} \geq {\bar{r}}_{i}^{t}$ . Thus, in step 3, the upper bound values will not increase. We provide rigorous analysis for the three steps in Appendix D.

Theorem 7

(Monotonicity of the upper bound) For any node i ∈ S^t⁻¹, we have that

{\begin{cases} {\bar{r}}_{i}^{t - 1} = {\bar{r}}_{i}^{t}, & if i ↝̸ d, \\ {\bar{r}}_{i}^{t - 1} > {\bar{r}}_{i}^{t}, & if i ↝ d, \end{cases}

where ↝ and ↝̸ represent the reachability in the transition graph at the t-th iteration.

Please see Appendix D for the proof.

If we have that i↝̸d in the transition graph at the t-th iteration, we have that i↝̸d in the transition graph at any future iteration. Thus, ${\bar{r}}_{i}^{t}$ will not change during the future iterations. Since ${\bar{r}}_{i}^{t}$ converges to the exact proximity value r_i when the entire graph is visited. We must have that ${\bar{r}}_{i}^{t} = r_{i}$ when i↝̸d. In conclusion, the upper bound strictly decreases until it converges to the exact proximity value.

The lower and upper bounds can be further tightened by adding self-loop transition probabilities to the nodes in δS. Please see Appendix E for more details.

5.4 Complexity

Assume Algorithm 2 executes in β iterations. Let h be the average number of neighbors of a node. The LocalExpansion step takes O(ht) time to find the node to expand at the t-th iteration. To update the lower bound, updating P needs O(h²) operations, and updating r and e needs O(h) operations. Subgraph induced by S has O(h²t) edges, so matrix P has O(h²t) non-zero entries. Therefore using the iterative method to solve linear equations takes O(αh²t) time, where α is the number of iterations used in IterativeMethod. Thus the overall complexity of UpdateLowerBound in the t-th iteration is O(αh²t). The complexity of UpdateUpperBound function is the same as that of UpdateLowerBound. In the CheckTermination-Criterion step, finding the nodes with largest lower bounds takes O(ht) time. Therefore, the overall complexity of FLoS is $O (\sum_{t = 1}^{β} (α h^{2} t + h t)) = O (α h^{2} β^{2})$ .

At each iteration, FLoS visits h new nodes on average. In the worst case, where the whole graph is visited, FLoS needs to run β=n/h iterations. Thus, the worst case complexity of FLoS is O(αh²β²)=O(αn²).

In the above complexity analysis, the number of iterations β is proportional to the number of visited nodes. Appendix F provides theoretical analysis of the number of visited nodes. In Section 8, we show experimental results on the number of visited nodes using real graphs.

Note that so far, we have used PHP to illustrate the key principles underlying the fast local search method. EI and DHT are equivalent with PHP thus there is no need to develop algorithm for them. For THT, deleting a transition probability will not increase the proximity of any node. Therefore, when we delete all the transition probabilities {p_i,j : i or j ∈ S̄} in the original transition graph, the proximity value of any node computed based on the modified transition graph will be the lower bound. For the upper bound, we add a dummy node with value L, which is the largest possible proximity value of THT. All other processes are similar to those of PHP and omitted here.

6 Extensions of FLoS to the Proximity Measures Having Local Maximum

In this section, we study how to extend the FLoS method to random walk with restart, RoundTripRank, Katz score, and absorption probability.

The idea is to use the relationships between PHP and these proximity measures. Figure 7 summarizes the relationships between PHP and other proximity measures. For RWR, its proximity is proportional to the PHP proximity multiplied by the node degree. For RT, its proximity is proportional to the PHP proximity multiplied by the node degree to the power of β, where β is a constant in RT. For KZ, we define a new proximity measure PHP′, which is a variant of PHP. There is a simple relationship between KZ and PHP′. For AP, we define another new proximity measure PHP″, which is also a variant of PHP. There is a simple relationship between AP and PHP″. Compared with PHP, the transition probabilities in PHP′ and PHP″ are changed. In PHP, the transition probability is normalized by the node degree w_i. In PHP′, the transition probability is normalized by the maximum degree w_max. In PHP″, the transition probability is normalized by the value (λ_i +w_i), where λ_i is a constant in AP. Thus, the FLoS algorithm for PHP can be readily modified for PHP′ and PHP″. Appendix G provides proofs for these relationships.

Fig. 7 — Relationships between PHP and other proximity measures

Next, we use RWR as an example to show how to extend FLoS to these proximity measures by using the relationship. Suppose that node v ∈ δS^t has the largest PHP proximity value. Based on Theorem 1, for any node i ∈ S̄^t, we have that PHP(i) ≤ PHP(v). Let w(S̄^t) denote the maximum degree of unvisited nodes in S̄^t. We have that w_i · PHP(i) ≤ w(S̄^t) · PHP(i) ≤ w(S̄^t) · PHP(v). Therefore, if we maintain the maximum degree of unvisited nodes, we can develop the upper bound for the proximity values of unvisited nodes.

Specifically, we can apply FLoS to RWR as follows. In Algorithm 3, we can change line 1 to the following line.

1
$u \leftarrow {argmax}_{i \in δ S^{t - 1}} w_{i} \cdot ({\underline{r}}_{i}^{t - 1} + {\bar{r}}_{i}^{t - 1})$ ;

In Algorithm 6, we can change line 2 and 3 to the following two lines.

2
K←k nodes in S^t\(δS^t∪{q}) with largest $w_{i} \cdot {\underline{r}}_{i}^{t}$ values;
3
if ${min}_{i \in K} w_{i} \cdot {\underline{r}}_{i}^{t} \geq {max}_{i \in δ S^{t} \ (K \cup {q})} w_{i} \cdot {\underline{r}}_{i}^{t}$ and ${min}_{i \in K} w_{i} \cdot {\underline{r}}_{i}^{t} \geq w ({\bar{S}}^{t}) \cdot {max}_{i \in S^{t}} \cdot {\bar{r}}_{i}^{t}$ then bStop ← true;

All other processes remain the same.

For other proximity measures, we extend the FLoS algorithm in a similar way. Please see Appendix G for further details.

7 Top-K Reverse-Proximity Query Problem

In this section, we study the top-k reverse-proximity query problem [24] and discuss how FLoS can be applied to solve it efficiently.

Given a query node q, we can compute the proximity values of all other nodes. We can also use each node i as the query, and compute the proximity value of q. We refer to this proximity of node q as the reverse proximity of node i. The top-k reverse-proximity query problem aims at finding the top-k nodes that are ranked by the reverse proximity. In Table 2, only EI and KZ are symmetric and all other proximity measures are not symmetric. The top-k reverse-proximity query problem is different from the top-k proximity query problem when the proximity measure is not symmetric.

Note that the top-k reverse-proximity query problem is different from the reverse top-k problem studied in [25]. Given a query node q, the reverse top-k problem aims at finding all the nodes that have q in their top-k proximity sets. In this paper, we study the top-k reverse-proximity query problem, which aims at finding the top-k nodes ranked by the reverse proximity [24].

The top-k reverse-proximity query problem has been studied when RWR is used as the proximity measure [24]. In a recent paper [10], the original and reverse proximity values in RWR are interpreted as importance and specificity respectively. If node i has large RWR proximity value when the query node is q, node i is important for node q. On the other hand, if node q has large RWR proximity value when the query node is i, node i is specific for node q. The authors show that ranking by the combination of two directions performs better than ranking by one direction.

The naive method to solve the top-k reverse-proximity query problem is as follows. First, each node is used as the query node, and the proximity value of node q is computed by the iterative method. Then the top-k nodes with largest reverse proximity values are selected. Suppose that the iterative method takes O(αm) for each query node. The naive method takes time O(αmn), where α is the number of iterations in the iterative method, m is the number of edges, and n is the number of nodes. This is expensive and prohibitive for large graphs.

For the RWR proximity measure, it is shown that the reverse proximity vector can be computed using the iterative method, which has the same complexity O(αm) as computing the original proximity vector [25]. However, the iterative method is still expensive since it needs to iterate over the entire graph. Moreover, it is unclear how to compute the reverse proximity vectors for other measures in a similar way.

To extend FLoS to the reverse proximity measures, we use the relationships between PHP and the reverse proximity measures. Figure 8 summarizes these relationships. rPHP, rRWR, rEI, rDHT, rRT, rKZ, and rAP represent the reversed version of their corresponding measures. Appendix H provides the proofs for these relationships. Based on these relationships, we can develop the bounds for the reverse proximity values based on the bounds for the PHP or its variant proximity values. Appendix H shows more details about how to extend the FLoS algorithm to the reverse proximity measures.

Fig. 8 — Relationships between PHP and the reverse proximity measures

8 Experimental Results

In this section, we present extensive experimental results on evaluating the performance of the FLoS algorithm. The datasets are shown in Table 4. The real datasets are publicly available from the website http://snap.stanford.edu/data/. The synthetic datasets are generated using the Erdös-Rényi random graph (RAND) model [26] and R-MAT model [27] with different parameters. All programs are written in C++. All experiments are performed on a server with 32G memory, Intel Xeon 3.2GHz CPU, and Redhat 4.1.2 OS.

TABLE 4.

Datasets used in the experiments

Datasets		Abbr.	#Nodes	#Edges
Real	Amazon	AZ	334,863	925,872
	DBLP	DP	317,080	1,049,866
	Youtube	YT	1,134,890	2,987,624
	LiveJournal	LJ	3,997,962	34,681,189
Synthetic	In-memory	–	Varying size
	In-memory	–	Varying density
	Disk-resident	–	Varying size

Open in a new tab

8.1 State-of-the-Art Methods

The measures we use include PHP, EI, RWR, RT, KZ, THT, and AP. We compare FLoS with the state-of-the-art methods for each measure as summarized in Table 5. These methods are categorized into global and local methods.

TABLE 5.

State-of-the-art methods used for comparison

Our methods (Exact)	State-of-the-art methods
Our methods (Exact)	Abbr.	Key idea	Ref.	Exactness
FLoS_PHP	GI_PHP	Global iteration	[12]	Exact
	DNE	Local search	[7]	Approx.
	NN_EI	Local search	[9]	Exact
	LS_EI	Local search	[1]	Approx.
FLoS_RWR	GI_RWR	Global iteration	[12]	Exact
	GE_RWR	Graph embedding	[19]	Approx.
	Castanet	Improved GI	[4]	Exact
	K-dash	Matrix inversion	[14]	Exact
	LS_RWR	Local search	[1]	Approx.
FLoS_RT	GI_RT	Global iteration	[12]	Exact
FLoS_RT	LS_RT	Local search	[10]	Approx.
FLoS_KZ	GI_KZ	Global iteration	[12]	Exact
	LS_KZ	Local search	[23]	Approx.
	AA_KZ	Improved GI	[13]	Exact
FLoS_THT	GI_THT	Global iteration	[12]	Exact
FLoS_THT	LS_THT	Local search	[5]	Approx.
FLoS_AP	GI_AP	Global iteration	[12]	Exact
FLoS_rPHP	GI_rPHP	Global iteration	[12]	Exact

Open in a new tab

The global iteration (GI) method directly applies the iterative method on the entire graph [12]. It guarantees to find the exact top-k nodes. The graph embedding (GE) method can answer the query in constant time after embedding [19]. It can only be applied to RWR. However, the embedding process is very time consuming. Moreover, it only returns approximate results. The Castanet algorithm is specifically designed for RWR. It improves the GI method and guarantees the exactness of the results [4]. AA_KZ improves the global iteration method by prioritized execution of the iterative computation and also guarantees the exactness of the results [13]. K-dash is the state-of-the-art matrix-based method for RWR which guarantees result exactness [14]. Note that K-dash and GE can only be applied on two medium-sized real graphs because of the expensive preprocessing step.

Dynamic neighborhood expansion (DNE) method applies a best-first expansion strategy to find the top-k nodes using PHP [7]. This strategy is heuristic and does not guarantee to find the exact solution. The number of visited nodes is fixed to 4,000 in the experiments. NN_EI applies the push style method [21], [2] in local search, and guarantees the exactness of the top-k results [9]. Since PHP and EI are equivalent in terms of ranking, we can compare the methods for PHP and EI directly. LS_RWR applies the dynamic programming technique [28] to develop bounds in local search [1]. It returns approximate results. LS_EI is based on LS_RWR and has similar performance [1]. LS_RT leverages the push style method [21] developed for RWR to estimate the bounds and find the approximate top-k nodes with largest RoundTripRank proximity values. LS_KZ locally searches a small portion of the graph and adapts the push style method [21] to find the approximate top-k results for the Katz score. LS_THT is a local search method for THT [5].

The decay factors in PHP, RWR, EI, and RT are all set to 0.5. The decay factor in KZ is set to 0.99/w_max. In RT, we set the parameter β = 0.4. In AP, we set the parameter λ_i = 10 for any node i. The truncated length in THT is set to 10.

We use FLoS_rPHP to denote the FLoS method for reverse PHP. Reverse RWR gives the same ranking as PHP, so we only evaluate FLoS_PHP. The FLoS method for reverse RT is quite similar to that for RT, thus we only evaluate FLoS_RT. When we set the parameter λ_i = 10 for any node i, AP becomes symmetric. Thus AP and reverse AP give the same ranking results, and we only evaluate FLoS_AP.

8.2 Evaluation on Real Graphs

We study the efficiency of the selected methods on real graphs when varying the number of returned nodes k. For each k, we repeat the experiments 10³ times, each with a randomly picked query node. The average running time is reported. For methods using the iteration procedure in Algorithm 7, the termination threshold is set to τ = 10⁻⁵. We also perform experiments using a fixed number of 10 iterations. The results are similar and omitted here.

8.2.1 Evaluation of FLoS_PHP

Figure 9 shows the running time of different methods for PHP. The running time of DNE is almost a constant for different k, because it visits a fixed number of nodes. The running time of NN_EI increases when k increases. FLoS_PHP is more efficient than NN_EI, which demonstrates that the bounds of FLoS are tighter. LS_EI has a constant running time. This is because it extracts the cluster containing the query node. Note that LS_EI takes tens of hours in the preprocessing step to cluster the graphs.

Fig. 9 — Running time of different methods for PHP on real graphs

Figure 11(a) shows the ratio between the number of visited nodes using FLoS_PHP and total number of nodes in the graph. The value indicated by the bar is the average ratio of 10³ queries. The minimum and maximum ratios are also shown in the figure. As can be seen from the figure, only a very small part of the graph is needed for FLoS to find the exact solution. Moreover, the ratio decreases when the graph size increases. This indicates that FLoS is more effective for larger graphs.

Fig. 11 — Ratio between the number of visited nodes and the total number of nodes on real graphs

8.2.2 Evaluation of FLoS_RWR

Figure 10 shows the running time for RWR. K-dash has the best performance after precomputing the matrix inversion as shown in Figures 10(a) and 10(b). The precomputing step of K-dash takes tens of hours for the medium-sized AZ and DP graphs and cannot be applied to the other two larger graphs. GE_RWR also has fast response time. However, as discussed before, its embedding step is time consuming and not applicable to larger graphs. Moreover, it does not find the exact solution. Castanet method cuts the running time from the GI method by 72% to 91%. LS_RWR method has constant running time, and it needs tens of hours in the precomputing step to cluster the graphs.

Fig. 10 — Running time of different methods for RWR on real graphs

Figure 11(b) shows the ratio of the number of visited nodes of the FLoS_RWR method. The results are similar to that of Figure 11(a).

8.2.3 Evaluation of FLoS_RT

Figure 12(a) shows the running time of different methods for RT. The number on the right side of the rectangle legend indicates the value of k. Since the GI_RT method has almost constant running time for different k, we only show the result when k = 10. The running time of FLoS_RT and LS_RT increases when k increases. FLoS_RT is the most efficient method. FLoS_RT is about 1 order of magnitude faster than LS_RT, and 2 orders of magnitude faster than GI_RT. LS_RT uses the push style method to develop the bounds, which are looser than those of FLoS_RT.

Fig. 12 — Running time of different methods for RT, KZ, THT, AP, and rPHP on real graphs

8.2.4 Evaluation of FLoS_KZ

Figure 12(b) shows the running time of different methods for KZ. We also only show the running time when k = 10 for the GI_KZ method since it has almost constant running time for different k. FLoS_KZ, LS_KZ, and AA_KZ methods all have increasing running time when increasing k. FLoS_KZ is about 1–2 orders of magnitude faster than LS_KZ and AA_KZ. LS_KZ uses the push style method to develop the bounds, which are not as tight as those of FLoS_KZ. The results also demonstrate that the bounds in AA_KZ are looser than those of FLoS_KZ.

8.2.5 Evaluation of FLoS_THT

Figure 12(c) shows the running time for THT. As we can see, FLoS_THT runs faster than LS_THT, which is specifically designed to speed up the computation for THT. This is because the lower and upper bounds of FLoS_THT are tighter than those of LS_THT. Both of the two local search methods are 2 to 3 orders of magnitude faster than GI_THT.

8.2.6 Evaluation of FLoS_AP

Figure 12(d) shows the running time for AP. Similar to the results of other proximity measures, FLoS_AP runs 2–3 orders of magnitude faster than the GI_AP method.

8.2.7 Evaluation of FLoS_rPHP

In FLoS_rPHP, we pre-compute the exact values EI_i(i) for each node i by the K-dash method [14]. The precomputation step takes 28.5 and 34.6 hours for two medium-sized graphs, AZ and DP. Thus we did not apply FLoS_rPHP on the large graphs.

Figure 12(e) shows the running time of our local search method and the global iteration method for reverse PHP. Similar to the results for other proximity measures, FLoS_rPHP runs 2–3 orders of magnitude faster than the GI_rPHP method.

8.2.8 Number of Visited Nodes in Local Search Methods

In this subsection, we study the number of visited nodes of different local search methods on real graphs. The number of visited nodes in the DNE method is fixed, thus it is not included. Figure 13(a) shows the ratio between the number of visited nodes using different local search methods and total number of nodes in the YT graph. Figure 13(b) shows that in the LJ graph. The value indicated by the bar is the average ratio of 10³ queries. The minimum and maximum ratios are also shown in the figure. As can be seen from the figure, other local search methods need to visit larger number of nodes than the FLoS methods do. This demonstrates the tightness of the bounds in the FLoS methods. We also can observe that the LS_EI and LS_RWR methods visit relatively large number of nodes and the ratio is stable when the number k changes. This is because in each expansion of the LS_EI and LS_RWR methods, all the nodes in one cluster will be visited. Thus, they need to visit larger number of nodes.

Fig. 13 — Ratio between the number of visited nodes and the total number of nodes on real graphs for the local search methods

8.3 Evaluation on In-Memory Synthetic Graphs

We generate synthetic graphs with different parameters to evaluate the selected methods. More specifically, we study two types of graphs: Erdös-Rényi random graph (RAND) [26] and scale-free graph based on the R-MAT model [27]. There are two parameters, the size and density of the graphs. We study how these two parameters affect the running time of different methods for PHP, RWR, RT, and KZ.

We download the graph generator available from the website https://github.com/dhruvbird/GTgraph and use the default parameters to generate two series of graphs with varying size and varying density, using RAND and R-MAT respectively. The graphs with varying size have the same density but different number of nodes. The graphs with varying density have the same number of nodes but different densities. The statistics are shown in Table 6.

TABLE 6.

Statistics of in-memory synthetic graphs

Varying size	#Nodes	1 × 2²⁰	2 × 2²⁰	4 × 2²⁰	8 × 2²⁰
	#Edges	1 × 10⁷	2 × 10⁷	4 × 10⁷	8 × 10⁷
	Density	9.5	9.5	9.5	9.5
Varying density	#Nodes	1 × 2²⁰	1 × 2²⁰	1 × 2²⁰	1 × 2²⁰
	#Edges	5 × 10⁶	10 × 10⁶	15 × 10⁶	20 × 10⁶
	Density	4.8	9.5	14.3	19.1

Open in a new tab

We apply the selected methods for PHP, RWR, RT and KZ on these graphs with k = 20. For each graph, we repeat the query 10³ times with randomly picked query nodes, and report the average running time.

8.3.1 Evaluation of FLoS_PHP

Figure 14(a) shows the running time of the selected methods for PHP on the series of RAND graphs with varying size. The running time of GI_PHP increases as the number of nodes increases. FLoS_PHP, DNE, NN_EI and LS_EI all have almost constant running time when the number of nodes increases. This is because these methods only search locally. When the density of the graph is fixed, adding more nodes to the graph will not change the size of the search space of these methods. Figure 14(b) shows the running time on the series of R-MAT graphs with varying size. Similar trends are observed. Comparing Figure 14(a) and 14(b), GI_PHP has less running time on R-MAT than on RAND graphs, while other methods have more. The reason is that R-MAT graphs have the power-law distribution, thus it is easier for FLoS_PHP, DNE, NN_EI and LS_EI to encounter hub nodes with larger degree when expanding subgraph. The faster performance of GI_PHP on R-MAT may be because of the greater data locality due to the hub node.

Fig. 14 — Running time of different methods for PHP on in-memory synthetic graphs (k = 20)

Figure 14(c) shows the running time of the selected methods for PHP on the series of RAND graphs with varying density. The running time of all the methods increases as the density increases. FLoS_PHP and NN_EI have increasing running time because the number of visited nodes in these two methods increases when the density becomes larger. LS_EI has increasing running time because the number of nodes and edges increases in local clusters. Figure 14(d) shows the running time on the series of R-MAT graphs with varying density. Similar trends are observed.

8.3.2 Evaluation of FLoS_RWR

Figure 15(a) shows the running time of the selected methods for RWR on the series of RAND graphs with varying size. The running time of GI_RWR and Castanet increases as the number of nodes increases. Castanet method cuts the running time from the GI method by 69% to 88%. FLoS_RWR and LS_RWR both have almost constant running time when the number of nodes increases. This is because FLoS_RWR and LS_RWR only search locally. Figure 15(b) shows the running time on the series of R-MAT graphs with varying size. Similar trends are observed. Comparing Figure 15(a) and 15(b), GI_RWR has less running time on the R-MAT graphs than on the RAND graphs, while other methods have more. The reason is similar as what discussed previously.

Fig. 15 — Running time of different methods for RWR on in-memory synthetic graphs (k = 20)

Figure 15(c) shows the running time on the series of RAND graphs with varying density. The running time of all the methods increases as the density increases. Figure 15(d) shows the running time on the series of R-MAT graphs with varying density. Similar trends are observed.

8.3.3 Evaluation of FLoS_RT

Figure 16(a) shows the running time of the selected methods for RT on the series of R-MAT graph with varying size. The running time of GI_RT increases as the number of nodes increases. FLoS_RT has almost constant running time when the number of nodes increases. Because it only searches locally. LS_RT has increasing running time. LS_RT needs to find the node with the largest residual proximity value in each iteration. When the number of nodes in the graph increases, the search space may also increase. This may be the reason why it has a slightly increasing running time.

Fig. 16 — Running time of different methods for RT on in-memory synthetic graphs (R-MAT, k = 20)

Figure 16(b) shows the running time of the selected methods for RT on the series of R-MAT graph with varying density. The running time of all the methods increases as the density increases. Both FLoS_RT and LS_RT have increasing running time because they will visit more nodes in a graph with larger density.

8.3.4 Evaluation of FLoS_KZ

Figure 17(a) shows the running time of the selected methods for KZ on the series of R-MAT graph with varying size. The running time of GI_KZ increases as the number of nodes increases. FLoS_KZ has almost constant running time when the number of nodes increases. LS_KZ and AA_KZ both have increasing running time when increasing graph size. LS_KZ needs to update the node with the largest residual proximity value in each iteration, thus it has a slightly increasing running time when the graph size increases. In AA_KZ, computing the upper bound of each node requires linear time O(m). Thus it has increasing running time.

Fig. 17 — Running time of different methods for KZ on in-memory synthetic graphs (R-MAT, k = 20)

Figure 17(b) shows the running time of the selected methods for KZ on the series of R-MAT graph with varying density. The running time of all the methods increases as the density increases. The reason why FLoS_KZ, LS_KZ and AA_KZ have increasing running time is that they need to visit more nodes when increasing the graph density.

8.3.5 Number of Visited Nodes in Local Search Methods

In this subsection, we study the number of visited nodes using different local search methods on synthetic graphs. We use the synthetic graphs with 2²⁰ nodes and 10⁷ edges. The number of query nodes is fixed to k = 20. Figure 18(a) shows the ratio between the number of visited nodes using different local search methods for PHP and RWR and total number of nodes in the RAND graph. Figure 18(b) shows the ratio between the number of visited nodes using different local search methods for PHP, RWR, RT and KZ and total number of nodes in the R-MAT graph. The value indicated by the bar is the average ratio of 10³ queries. The minimum and maximum ratios are also shown in the figure. As can be seen from the figure, other local search methods need to visit larger number of nodes than the FLoS methods do. This demonstrates the tightness of the bounds in the FLoS methods.

Fig. 18 — Ratio between the number of visited nodes and the total number of nodes on synthetic graphs for the local search methods (2²⁰ nodes and 10⁷ edges, k = 20)

8.4 Evaluation on Disk-Resident Synthetic Graphs

What if the graphs are too large to fit into memory? To test the performance of FLoS on disk-resident graphs, we generate disk-resident R-MAT graphs, whose statistics are in Table 7. We use the open source Neo4j (available from http://www.neo4j.org) version 2.0 graph database. The FLoS method for disk-resident graphs only calls some basic query functions provided by Neo4j, such as, querying the neighbors of one node. And the remaining work is the same as that for in-memory graphs. We apply the FLoS_PHP and FLoS_RWR methods on the disk-resident graphs with k = 20. We repeat the query 10³ times with randomly picked query nodes and report the average running time. In the experiments, we restrict the memory usage to 2 GB.

TABLE 7.

Statistics of disk-resident synthetic graphs

#Nodes	16 × 2²⁰	32 × 2²⁰	48 × 2²⁰	64 × 2²⁰
#Edges	16 × 10⁷	32 × 10⁷	48 × 10⁷	64 × 10⁷
Disk size	3.1 G	6.5 G	9.9 G	13.2 G

Open in a new tab

Figure 19(a) shows the running time of FLoS_PHP and FLoS_RWR. From the figure, we can see that FLoS can process disk-resident graphs in tens of seconds. The reason is that FLoS only needs to find the neighbors of visited nodes and the transition probabilities on the edges. These results also verify that FLoS has almost constant running time when the number of nodes increases. Figure 19(b) shows the ratio of the number of visited nodes to the total number of nodes in the graph. FLoS only needs to explore a small portion of the whole graph to return the top-k nodes. When the graph size becomes larger, the portion of visited nodes becomes smaller.

Fig. 19 — Results of FLoS_PHP and FLoS_RWR on disk-resident synthetic graphs (k = 20)

9 Conclusion

Top-k nodes query in large graphs is a fundamental problem that has attracted intensive research interests. Existing methods need expensive preprocessing steps or are designed for specific proximity measures. In this paper, we propose a unified method, FLoS, which adopts a local search strategy to find the exact top-k nodes efficiently. FLoS is based on the no local optimum property of proximity measures. By exploiting the relationship among different proximity measures, we can also extend FLoS to the proximity measures having local optimum. FLoS can be further extended to solve the top-k reverse-proximity query problem. Extensive experimental results demonstrate that FLoS enables efficient and exact query for a variety of random walk based proximity measures.

Supplementary Material

tkde-wu-2515579-mm.zip

NIHMS777927-supplement-tkde-wu-2515579-mm_zip.zip^{(560.4KB, zip)}

Acknowledgments

This work was partially supported by the National Science Foundation grants IIS-1162374, IIS-1218036, IIS-0953950, the NIH/NIGMS grant R01GM103309, and the OSC (Ohio Supercomputer Center) grant PGS0218.

Biographies

graphic file with name nihms777927b1.gif

Yubao Wu received the Bachelor’s and Master’s degrees both in Dalian University of Technology, China. He is a fourth year Ph.D. student in the Department of Electrical Engineering and Computer Science, Case Western Reserve University. His research interests include big data analytics, data mining and bioinformatics.

graphic file with name nihms777927b2.gif

Ruoming Jin received the doctor’s degree in Computer Science from the Ohio State University in 2005. He is an associate professor in the Computer Science Department at Kent State University. His research interests are on Data Mining, Database, Biomedical Informatics and Cloud Computing.

graphic file with name nihms777927b3.gif

Xiang Zhang received the doctor’s degree in Computer Science from the University of North Carolina at Chapel Hill in 2011. He is the T&D Schroeder Assistant professor in the Department of Electrical Engineering and Computer Science at Case Western Reserve University. His research bridges the areas of data mining, database and bioinformatics.

Contributor Information

Yubao Wu, Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, 44106.

Ruoming Jin, Computer Science Department, Kent State University, Kent, OH, 44240.

Xiang Zhang, Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, 44106.

References

1.Sarkar P, Moore AW. Fast nearest-neighbor search in disk-resident graphs,” in. KDD. 2010:513–522. [Google Scholar]
2.Chakrabarti S, Pathak A, Gupta M. Index design and query processing for graph conductance search. VLDB J. 2011;20(3):445–470. [Google Scholar]
3.Lee P, Lakshmanan LV, Yu JX. On top-k structural similarity search,” in. ICDE. 2012:774–785. [Google Scholar]
4.Fujiwara Y, Nakatsuji M, Shiokawa H, Mishima T, Onizuka M. Efficient ad-hoc search for personalized PageRank,” in. SIGMOD. 2013:445–456. [Google Scholar]
5.Sarkar P, Moore AW. A tractable approach to finding closest truncated-commute-time neighbors in large graphs,” in. UAI. 2007:335–343. [Google Scholar]
6.Guan Z, Wu J, Zhang Q, Singh A, Yan X. Assessing and ranking structural correlations in graphs,” in. SIGMOD. 2011:937–948. [Google Scholar]
7.Zhang C, Shou L, Chen K, Chen G, Bei Y. Evaluating geosocial influence in location-based social networks,” in. CIKM. 2012:1442–1451. [Google Scholar]
8.Tong H, Faloutsos C, Pan J-Y. Fast random walk with restart and its applications,” in. ICDM. 2006:613–622. [Google Scholar]
9.Bogdanov P, Singh A. Accurate and scalable nearest neighbors in large networks based on effective importance,” in. CIKM. 2013:523–528. [Google Scholar]
10.Fang Y, Chang K-C, Lauw HW. RoundTripRank: Graph-based proximity with importance and specificity,” in. ICDE. 2013:613–624. [Google Scholar]
11.Wu X-M, Li Z, So AM, Wright J, Chang S-F. Learning with partially absorbing random walks,” in. NIPS. 2012:3077–3085. [Google Scholar]
12.Saad Y. Iterative methods for sparse linear systems. SIAM. 2003 [Google Scholar]
13.Khemmarat S, Gao L. Fast top-k path-based relevance query on massive graphs,” in. ICDE. 2014:316–327. [Google Scholar]
14.Fujiwara Y, Nakatsuji M, Onizuka M, Kitsuregawa M. Fast and exact top-k search for random walk with restart. PVLDB. 2012;5(5):442–453. [Google Scholar]
15.Fujiwara Y, Nakatsuji M, Yamamuro T, Shiokawa H, Onizuka M. Efficient personalized PageRank with accuracy assurance,” in. KDD. 2012:15–23. [Google Scholar]
16.Cohen S, Kimelfeld B, Koutrika G. A survey on proximity measures for social networks,” in. Search Computing. 2012:191–206. [Google Scholar]
17.Sarkar P, Moore AW, Prakash A. Fast incremental proximity search in large graphs,” in. ICML. 2008:896–903. [Google Scholar]
18.Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43. [Google Scholar]
19.Zhao X, Chang A, Sarma AD, Zheng H, Zhao BY. On the embeddability of random walk distances. PVLDB. 2013;6(14):1690–1701. [Google Scholar]
20.Mei Q, Zhou D, Church K. Query suggestion using hitting time,” in. CIKM. 2008:469–478. [Google Scholar]
21.Berkhin P. Bookmark-coloring algorithm for personalized PageRank computing. Internet Mathematics. 2006;3(1):41–62. [Google Scholar]
22.Gupta M, Pathak A, Chakrabarti S. Fast algorithms for topk personalized PageRank queries,” in. WWW. 2008:1225–1226. [Google Scholar]
23.Esfandiar P, Bonchi F, Gleich DF, et al. Fast Katz and commuters: Efficient estimation of social relatedness in large networks,” in. Algorithms and Models for the Web-Graph. 2010:132–145. [Google Scholar]
24.Benczur AA, Csalogany K, Sarlos T, Uher M. SpamRankFully automatic link spam detection work in progress,” in. AIRWeb. 2005 [Google Scholar]
25.Yu AW, Mamoulis N, Su H. Reverse top-k search using random walk with restart. PVLDB. 2014;7(5):401–412. [Google Scholar]
26.Erdős P, Rényi A. On the evolution of random graphs. Magyar Tud Akad Mat Kutató Int Közl. 1960;5:17–61. [Google Scholar]
27.Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A recursive model for graph mining,” in. SDM. 2004:442–446. [Google Scholar]
28.Jeh G, Widom J. Scaling personalized web search,” in. WWW. 2003:271–279. [Google Scholar]
29.Meyer C. Matrix analysis and applied linear algebra. SIAM. 2000 [Google Scholar]
30.Guillemin EA. Introductory circuit theory. John Wiley & Sons; 1953. [Google Scholar]
31.Jeh G, Widom J. SimRank: A measure of structural-context similarity. KDD. 2002:538–543. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

tkde-wu-2515579-mm.zip

NIHMS777927-supplement-tkde-wu-2515579-mm_zip.zip^{(560.4KB, zip)}

[R1] 1.Sarkar P, Moore AW. Fast nearest-neighbor search in disk-resident graphs,” in. KDD. 2010:513–522. [Google Scholar]

[R2] 2.Chakrabarti S, Pathak A, Gupta M. Index design and query processing for graph conductance search. VLDB J. 2011;20(3):445–470. [Google Scholar]

[R3] 3.Lee P, Lakshmanan LV, Yu JX. On top-k structural similarity search,” in. ICDE. 2012:774–785. [Google Scholar]

[R4] 4.Fujiwara Y, Nakatsuji M, Shiokawa H, Mishima T, Onizuka M. Efficient ad-hoc search for personalized PageRank,” in. SIGMOD. 2013:445–456. [Google Scholar]

[R5] 5.Sarkar P, Moore AW. A tractable approach to finding closest truncated-commute-time neighbors in large graphs,” in. UAI. 2007:335–343. [Google Scholar]

[R6] 6.Guan Z, Wu J, Zhang Q, Singh A, Yan X. Assessing and ranking structural correlations in graphs,” in. SIGMOD. 2011:937–948. [Google Scholar]

[R7] 7.Zhang C, Shou L, Chen K, Chen G, Bei Y. Evaluating geosocial influence in location-based social networks,” in. CIKM. 2012:1442–1451. [Google Scholar]

[R8] 8.Tong H, Faloutsos C, Pan J-Y. Fast random walk with restart and its applications,” in. ICDM. 2006:613–622. [Google Scholar]

[R9] 9.Bogdanov P, Singh A. Accurate and scalable nearest neighbors in large networks based on effective importance,” in. CIKM. 2013:523–528. [Google Scholar]

[R10] 10.Fang Y, Chang K-C, Lauw HW. RoundTripRank: Graph-based proximity with importance and specificity,” in. ICDE. 2013:613–624. [Google Scholar]

[R11] 11.Wu X-M, Li Z, So AM, Wright J, Chang S-F. Learning with partially absorbing random walks,” in. NIPS. 2012:3077–3085. [Google Scholar]

[R12] 12.Saad Y. Iterative methods for sparse linear systems. SIAM. 2003 [Google Scholar]

[R13] 13.Khemmarat S, Gao L. Fast top-k path-based relevance query on massive graphs,” in. ICDE. 2014:316–327. [Google Scholar]

[R14] 14.Fujiwara Y, Nakatsuji M, Onizuka M, Kitsuregawa M. Fast and exact top-k search for random walk with restart. PVLDB. 2012;5(5):442–453. [Google Scholar]

[R15] 15.Fujiwara Y, Nakatsuji M, Yamamuro T, Shiokawa H, Onizuka M. Efficient personalized PageRank with accuracy assurance,” in. KDD. 2012:15–23. [Google Scholar]

[R16] 16.Cohen S, Kimelfeld B, Koutrika G. A survey on proximity measures for social networks,” in. Search Computing. 2012:191–206. [Google Scholar]

[R17] 17.Sarkar P, Moore AW, Prakash A. Fast incremental proximity search in large graphs,” in. ICML. 2008:896–903. [Google Scholar]

[R18] 18.Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43. [Google Scholar]

[R19] 19.Zhao X, Chang A, Sarma AD, Zheng H, Zhao BY. On the embeddability of random walk distances. PVLDB. 2013;6(14):1690–1701. [Google Scholar]

[R20] 20.Mei Q, Zhou D, Church K. Query suggestion using hitting time,” in. CIKM. 2008:469–478. [Google Scholar]

[R21] 21.Berkhin P. Bookmark-coloring algorithm for personalized PageRank computing. Internet Mathematics. 2006;3(1):41–62. [Google Scholar]

[R22] 22.Gupta M, Pathak A, Chakrabarti S. Fast algorithms for topk personalized PageRank queries,” in. WWW. 2008:1225–1226. [Google Scholar]

[R23] 23.Esfandiar P, Bonchi F, Gleich DF, et al. Fast Katz and commuters: Efficient estimation of social relatedness in large networks,” in. Algorithms and Models for the Web-Graph. 2010:132–145. [Google Scholar]

[R24] 24.Benczur AA, Csalogany K, Sarlos T, Uher M. SpamRankFully automatic link spam detection work in progress,” in. AIRWeb. 2005 [Google Scholar]

[R25] 25.Yu AW, Mamoulis N, Su H. Reverse top-k search using random walk with restart. PVLDB. 2014;7(5):401–412. [Google Scholar]

[R26] 26.Erdős P, Rényi A. On the evolution of random graphs. Magyar Tud Akad Mat Kutató Int Közl. 1960;5:17–61. [Google Scholar]

[R27] 27.Chakrabarti D, Zhan Y, Faloutsos C. R-MAT: A recursive model for graph mining,” in. SDM. 2004:442–446. [Google Scholar]

[R28] 28.Jeh G, Widom J. Scaling personalized web search,” in. WWW. 2003:271–279. [Google Scholar]

[R29] 29.Meyer C. Matrix analysis and applied linear algebra. SIAM. 2000 [Google Scholar]

[R30] 30.Guillemin EA. Introductory circuit theory. John Wiley & Sons; 1953. [Google Scholar]

[R31] 31.Jeh G, Widom J. SimRank: A measure of structural-context similarity. KDD. 2002:538–543. [Google Scholar]

PERMALINK

Efficient and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs

Yubao Wu

Ruoming Jin

Xiang Zhang

Abstract

1 Introduction

2 Related Work

3 No Local Optimum Property

TABLE 1.

Definition 1. [Top-k Query Problem]

3.1 Theoretical Basis

Definition 2. [No Local Maximum]

Definition 3. [No Local Minimum]

Fig. 1.

Theorem 1

Proof

Algorithm 1.

3.2 Measures With and Without Local Optimum

TABLE 2.

Lemma 1

Proof

Theorem 2

4 Bounding the Proximity

4.1 Modifying Transition Probability

Fig. 2.

Theorem 3

Theorem 4

Proof

Theorem 5

4.2 Lower Bound

Fig. 3.

4.3 Upper Bound

Algorithm 2.

5 Fast Local Search

5.1 The FLoS Algorithm

Algorithm 3.

Algorithm 4.

Algorithm 5.

Algorithm 6.

Algorithm 7.

Fig. 4.

TABLE 3.

5.2 Monotonicity of the Lower Bound

Fig. 5.

Theorem 6

5.3 Monotonicity of the Upper Bound

Fig. 6.

Theorem 7

5.4 Complexity

6 Extensions of FLoS to the Proximity Measures Having Local Maximum

Fig. 7.

7 Top-K Reverse-Proximity Query Problem

Fig. 8.

8 Experimental Results

TABLE 4.

8.1 State-of-the-Art Methods

TABLE 5.

8.2 Evaluation on Real Graphs

8.2.1 Evaluation of FLoS_PHP

Fig. 9.

Fig. 11.

8.2.2 Evaluation of FLoS_RWR

Fig. 10.

8.2.3 Evaluation of FLoS_RT

Fig. 12.

8.2.4 Evaluation of FLoS_KZ

8.2.5 Evaluation of FLoS_THT

8.2.6 Evaluation of FLoS_AP

8.2.7 Evaluation of FLoS_rPHP

8.2.8 Number of Visited Nodes in Local Search Methods

Fig. 13.

8.3 Evaluation on In-Memory Synthetic Graphs

TABLE 6.

8.3.1 Evaluation of FLoS_PHP

Fig. 14.

8.3.2 Evaluation of FLoS_RWR

Fig. 15.

8.3.3 Evaluation of FLoS_RT

Fig. 16.