Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2017 Aug 2;7:7147. doi: 10.1038/s41598-017-07315-4

Link Prediction in Evolving Networks Based on Popularity of Nodes

Tong Wang 1, Xing-Sheng He 1, Ming-Yang Zhou 2,3,, Zhong-Qian Fu 1
PMCID: PMC5540936  PMID: 28769053

Abstract

Link prediction aims to uncover the underlying relationship behind networks, which could be utilized to predict missing edges or identify the spurious edges. The key issue of link prediction is to estimate the likelihood of potential links in networks. Most classical static-structure based methods ignore the temporal aspects of networks, limited by the time-varying features, such approaches perform poorly in evolving networks. In this paper, we propose a hypothesis that the ability of each node to attract links depends not only on its structural importance, but also on its current popularity (activeness), since active nodes have much more probability to attract future links. Then a novel approach named popularity based structural perturbation method (PBSPM) and its fast algorithm are proposed to characterize the likelihood of an edge from both existing connectivity structure and current popularity of its two endpoints. Experiments on six evolving networks show that the proposed methods outperform state-of-the-art methods in accuracy and robustness. Besides, visual results and statistical analysis reveal that the proposed methods are inclined to predict future edges between active nodes, rather than edges between inactive nodes.

Introduction

Networks are effective descriptions of complex systems in society and nature1, 2, with entities denoted as nodes and relations as links, respectively. The organization of real networks evolve under the influence of certain patterns and irregular factors, in principle, only the former can be modeled with physical methodologies. A significant concern about complex networks is link prediction that conduces to explanations of these models and revelations of the hidden driving-mechanisms. Therefore, link prediction has drawn numerous attentions from various fields covering biology, sociology and others36. For example, in protein-protein interaction experiments in cells, only strong relations between proteins could be detected by limited precision of equipments. It is prohibitive to measure every interaction between all pair proteins due to sharply increasing experimental costs with the size of proteins7, 8, an appropriate approach is to evaluate the likelihood of potential relations and specifically test non-existing relations with the high likelihood. Also, in social contexts, two persons would build friendship in the near future with a high probability if they have many common friends or attributes, which could be utilized to uncover lost friends or predict future friends911. Besides, further extensive applications also include personalized recommendations in e-commerce12, 13 and aircraft route planning study14, etc.

The crux of link prediction is to evaluate the likelihood of potential edges, based on which we can rank the potential edges in descending order and edges in the top of ranking list are predicted as underlying or future edges15, 16. The similarity based approaches, which equate likelihood with similarity, are the most common frameworks that argue the prospective edges may exist between similar nodes. To achieve this, traditional attribute based methods measure the likelihood of links by learning how many common features (e.g. common hobbies, ages, tastes, geographical locations) the two endpoints share17. Many researches on social networks have shown that the pervasive homophily promotes ties between similar humans18, 19. However this kind of methods suffer from the inaccessible and unreliable information of nodes due to the privacy policy in real scenario20. Luckily, the development of the complex network theory provides a new path in which only network topological structure is required regardless of privacy information to solve the problem. When evaluating the similarity between nodes, according to the structure differences, structure based methods could be classified into three categories: local methods, global methods and Quasi-global methods. Local similarity is mainly based on common neighbors, such as the most well-known Common Neighbor (CN) index that counts the number of common neighbor nodes21, Adamic-Adar (AA) index and Resource Allocation (RA) index that depress the large-degree neighbor nodes22, 23. For large networks, Cui et al. proposed a fast algorithm for calculating the number of common neighbors24. Global similarity emphasizes the global topology information of network, such as Katz index that counts all of the paths between two nodes25. Quasi-global similarity is a well trade-off of local similarity methods and global similarity methods, such as Local Path (LP) index that only considers the short paths in Katz index23, Local Random Walk (LRW) index that focuses on the limited random walk in local area26. Beyond that, some algorithms based on maximum likelihood methods and other exquisite models have been proposed. Clauset et al. proposed a Hierarchical Structure Model which presents well performance in hierarchical networks by using a dendrogram27. Lü et al. proposed a Structural Perturbation Method that approximates the observed networks by randomly repeated perturbations. This method outperforms state-of-the-art methods in accuracy and robustness28. In terms of information theory, Xu et al. proposed the Path Entropy index that considers the information entropies of shortest paths and penalizes the long paths29. Tan et al. proposed a Mutual Information (MI) method with the high accuracy and reasonable computation time, which considers the feature of common neighbors and denotes the likelihood of one link as the the conditional self-information of this link existing between the node pair when their common neighbors are given30. Zhu et al. generalized the MI index into Neighbor Set Information that is applicable to multiple structural features to enhance the accuracy31.

Real networks are highly dynamic with the come-and-go of nodes and edges32. However, the aforementioned algorithms unexceptionally ignore the temporal aspects of real networks, in particular, the trend of nodes: yesterday active nodes that contacted numerous neighbors may be unpopular today. Inspired by this, we propose a hypothesis that the emergence of future links are not only determined by existing network structure, but also are affected by the popularity of endpoints. For instance, Fig. 1 illustrates the effects of popularity. The red node will enter in the network and connect with one of the existing nodes. In Fig. 1(a), according to the static analysis, node 10 prefers to connect with the large-degree node 1. While the birth time of each edge is given in Fig. 1(b), we can easily know that node 3 is of high popularity because only it attracts edges at the present time t 2. In practice, the fresh edge will be more likely to occur between node 10 and the active node 3 at the next period t 3. To comply with this scenario, unlike previous works that predict potential links mostly based on static networks, we propose a popularity based structural perturbation method (PBSPM) and its fast algorithm that integrate popularity of nodes and observed network topology to predict future edges. Experimental results on real-world networks show that the proposed methods outperform the other traditional approaches in accuracy and robustness.

Figure 1.

Figure 1

Illustration of the popularity. The fresh link and node 10 will be added into the existing networks at the next time t 3. In panel (a), attractiveness of nodes are determined by static features. According to the preferential attachment, node 10 prefers to connect with node 1 due to the largest degree. In panel (b), temporal effects are considered. The currently popular node 3 may become attractive and connect with node 10 at time t 3.

Results

Popularity metrics

The definition of popularity is related to the concepts of temporal trend of nodes that could be obtained through the statistics and analysis of relevant historical information. For two nodes with the same degree, one may connect with its neighbors at early stage and not form any connections later, while the other one develops most of its connections at late stage. Intuitively the latter node would attract more fresh edges with high probability in the near future. Given this, a straightforward approach to evaluate the popularity of a node is counting the edges it recently attracts.

Given an undirected and unweighted network G(V, E) where V and E represent the set of nodes and links, respectively, each link has a time-stamp that represents the entering time. In this work, multi-links and self-loops are not allowed. k i(t) denotes the degree of nodes i at time t. In the next time span T, node i would attract Δk i(t, T) new edges,

Δki(t,T)=ki(t+T)ki(t). 1

Note that Δk i(t, T) in Eq. (1) determined by both t and T cannot reflect the relative popularity of node i, since even large degree nodes become inactive, they still attract more fresh edges than nodes of small degree due to the preferential attachment mechanism. To solve this issue, for a dataset spans starting from t a to t c, we divide its edges into the fresh set and the old set according to a boundary t b ∈ (t a, t c). If an edge was constructed in (t a, t b), it belongs to the old set otherwise the fresh set. The fraction of old edges and fresh edges are denoted as p older and p fresher. The p fresher can be comprehended as the observation length of historical information. Then, the popularity of node i is

si=Δki(tb,tctb)Δki(ta,tcta)=ki,fresherki,all, 2

where k i,all and k i,fresher indicate the whole degree and fresher degree of node i. Equation (2) improves the drawbacks of simply counting the new edges and quantifies the popularity in the normalized range. Clearly, if all links of node i locate in the fresh set, s i = 1. For another case that all links of node i locate in the old set, node i becomes dormant, s i = 0. Therefore s i ∈ [0, 1] and a higher s i means a higher popularity.

Popularity based structural perturbation method

In this section, we propose a hypothesis that the observed network is determined by some latent attractors (e.g. similar hobbies, ages, gender, location) that independently influence the structural properties. For an attractor xk=[xk,1,xk,2,,xk,n]T, x k,i represents the attractiveness of node i for the latent attractor x k. Inspired by configuration model, the probability p ij that an edge exists between two node i and j is proportional to x k,i x k,j. Supposing that there are m kinds of attractors, probability p ij is defined as the weighted influence of each attractor,

pij=k=1mwkxk,ixk,j, 3

where w k is a tunable parameter to balance the relative influence of each attractor x k. The problem is how to seek the optimal w k and x k,i that make p ij approximate a ij at most. Considering a network G with adjacent matrix A = (a ij)n × n, a special case is that p ij = 1 if a ij = 1, otherwise p ij = 0. For optimal w k and x k,

Ap=(pij)n×n=k=1mwkxkxkT. 4

If m = n in Eq. (4), where n is the size of the network, then Eq. (4) could be comprehended as the matrix decomposition, with w k and x k representing eigenvalues and eigenvectors respectively. In practice, many random connections exist in networks, Lü et al. proposed the structural perturbation method (SPM) that can reduce the influence of randomness28. In SPM, a small fraction p H of edges ΔA is removed from the network, adjacent matrix A R of the remaining network is decomposed into

AR=k=1nλkxkxkT, 5

where λ k and x k are the eigenvalues and eigenvectors of A R, |xk|=1. We could use A R to evaluate A with

A˜=k=1n(λk+Δλk)xkxkT, 6

where ΔλkxkTΔAxkxkTxk is the coupling influence of x k on λ k. Ã actually is a special case of A p, (λ k + Δλ k) and elements of eigenvector x k represent weight difference and the attractiveness for attractor x k separately.

As we have argued, the ability for node i to attract new edges is determined by both latent attractors and its current popularity. To better meet practice, an advanced attractiveness xk,i is proposed as

xk,i=xk,i(1+αsi), 7

where α indicates the degree of temporal popularity. Equation (7), a combination of the static attractiveness and popularity, tightly captures both the static features and the temporal information of the evolving pattern. Later in Eq. (6), substituting x k with x′ k to predict future links,

A~=k=1n(λk+Δλk)xkxkT. 8

Since Eq. (5) degenerates into Eq. (4) if the size m of attractors is less than n. Supposing that |λ1|>|λ2|>>|λn|, we substitute w k and x k in Eq. (4) with λ k and x k in Eq. (5). Similar to the same transition from Eq. (5) to Eq. (8), we obtain

A=(pij)n×n=k=1m(λk+Δλk)xkxkΤ, 9

which reduces into Eq. (8) if m = n. In the following experiments, we firstly measure the performance of Eq. (8), then show that we could reduce the calculation complexity by using only a few eigenvalues and eigenvectors, that is mn in Eq. (9).

Experiments on real networks

The proposed method PBSPM, integrating the attractiveness x k,i and popularity s i, reduces into the original SPM when α = 0. With the increase of α, PBSPM prefers to predict links between popular nodes. Figure 2 gives the performance of PBSPM in contrast to SPM (α = 0) under different p fresher. The precision values tend to be stable or achieve the best when α brings the static attractiveness and popularity into balance. Clearly, the optimal value of α varies for different networks. For Hypertext, Infec and UcScoci, future links have high likelihood to exist between the active nodes. However, for the Haggle dataset, the temporal trend of nodes are less obvious. Hence, the precision curve is optimized at α = 2, contrast to the other three networks of which the curves finally stabilize when α increases. Overall, when α ∈ [3, 5], PBSPM achieves improved performance compared with SPM in the four networks. Moreover, given the different length of historical information p fresher, all the curves present different levels of superiority in precision, suggesting a general and robust range of p fresher. Actually, it is difficult to choose the optimal value, which should follow the principle of keeping the balance between the length of historical information and future information (probe set). With regard to 10% probe set in this experiment, P fresher = 0.1 is the balanced option because the corresponding curves all show the great improvements.

Figure 2.

Figure 2

Precision versus α obtained by PBSPM. The experiments are performed on 90% training set and 10% probe set. Each data point is averaged over 10 independent realizations. The values of p fresher and α corresponding to the optimal precision reported in Table 1 vary for different networks: 0.05 and 9 for Hypertext, 0.05 and 2 for Haggle, 0.10 and 11 for Infec, 0.10 and 7 for UcSoci.

Reducing the number of eigenvectors could reduce the computation complexity. To address the high computation complexity, we propose the fast PBSPM that takes into account a few eigenvectors with only some large eigenvalues, which can well reflect the backbone structure of networks33. In practical networks, a huge gap exists in the eigenvalue space. Some eigenvectors with large eigenvalues play more important roles than those with small eigenvalues. Taking Hypertext as example, Fig. 3(a) plots the precision for various m in Eq. (9). Compared with SPM, the curve presents significant improvements and achieves the best at m = 1, meeting the effectiveness of Eq. (9). Figure 3(b) gives the differences between two adjacent eigenvalues gm=|λm||λm+1| (|λ1|>|λ2|>>|λn|). The distinct g 1 indicates a huge gap between |λ1| and |λ2|, while the other gaps (m ≥ 2) are all close to 0, suggesting that the huge gap g 1 induces the decline of precision when m > 1. Then, we choose m = 1 as the optimal value for Hypertext, analogously, the values for Haggle, Infec and UcSoci are respectively determined as m = 2,19,2 after which the g m approaches to 0 approximately. In consequence, it only requires O(n 2) time to calculate the top-m eigenvalues and corresponding eigenvectors, and the reconstruction of similarity matrix (Eq. 9) needs O(m × n 2) time. To reduce the randomness, the fast PBSPM repeats the random perturbation for ten times and obtains the averaged similarity matrix with O(10 × (mn 2 + n 2)) time. Hence, with mn and the increase in size n, the time complexity of fast PBSPM is O(n 2) in contrast with the time complexity O(n 3) of PBSPM and SPM, where the decomposition and reconstruction consume O(n 3) time. Besides, the time complexity is O(n 2) for local similarity based methods, such as CN, RA, AA, and O(n 3) for Katz and SRW.

Figure 3.

Figure 3

Precision versus m and gap gm=|λm||λm+1| for Hypertext. λ m is the eigenvalue of adjacent matrix A T. Panel (a) shows the performance of Eq. (9) on various m ∈ [1, 30] with fixed p fresher = 0.05 and α = 9. Each data point is obtained over ten simulations. Panel (b) shows the difference g m between |λm| and |λm+1|. g 1 = 34.37 is distinct and the others are all close to 0.

Table 1 and Table 2 list the precision values and computation time of different link prediction algorithms. Obviously, the proposed methods achieve remarkable improvements, at most 84.84% for Hypertext, 28.42% for Haggle, 6.19% for Infec, 95.97% for UcSoci. In spite of this, PBSPM suffers from the huge computational cost that limits its extensive applications. Fast PBSPM, a well trade-off of computation complexity and accuracy, has the reasonable computational cost and the high accuracy. Due to the repeated steps in experimental procedures, the fast algorithm still consumes more time than some traditional predictors with the same time complexity. Additionally, the attractors ignored by the fast algorithm contain some secondary information that may either improve the accuracy as useful information or deteriorate the performance as network noise, hence, the precision slightly fluctuates around that of PBSPM. In general, the proposed methods show the high robustness because of the well performance for disparate networks, while other baselines give poor predictions for some networks. Apart from precision improvements, we also try to quantify the physical difference between the age of links selected by various methods, which can be comprehended as the average popularity of endpoints s¯=si+sj2⁎|EP| if edge e ij is selected by a certain predictor. According to Table 3, links selected by the proposed methods are much older than the others; that is, the potential links prefer to form between the active nodes in the earlier future.

Table 1.

Precision of different methods for four networks. All the results are calculated under the optimal cases by adjusting parameters if any.

Precision CN AA RA Katz SRW SPM PBSPM Fast PBPSM
Hypertext 0.0959 0.1050 0.1005 0.0959 0.1187 0.0984 0.2128 0.2194
Haggle 0.1786 0.1888 0.1939 0.2041 0.2194 0.2928 0.3475 0.3760
Infec 0.0233 0.1163 0.1814 0.0233 0.3023 0.1949 0.3210 0.3070
UcSoci 0.0138 0.0153 0.0138 0.0138 0.0046 0.0298 0.0584 0.0574

The data in bold face are averaged over ten realizations with the same p fresher and α.

Table 2.

Computation time of different methods for four networks.

Time(ms) CN AA RA Katz SRW SPM PBSPM Fast PBPSM
Hypertext 1.02 1.12 1.08 1.51 1.95 20.3 20.51 15.73
Haggle 2.25 2.58 2.54 3.08 4.78 50.45 51.62 28.86
Infec 5.62 6.22 5.95 7.39 11.4 175.71 179.15 92.8
UcSoci 204.27 239.23 228.62 272.06 856.08 15200.76 15902.53 1122.95

All the results are averaged over ten runs on AMD R7 computer with MATLAB R2016b and 8GB RAM.

Table 3.

Average age of links selected by predictors.

Precision CN AA RA Katz SRW SPM PBSPM Fast PBPSM
Hypertext 0.0411 0.0420 0.0445 0.0407 0.0509 0.0420 0.2115 0.2243
Haggle 0.0508 0.0513 0.0518 0.0488 0.0489 0.0609 0.1313 0.1265
Infec 0.0275 0.1228 0.2180 0.0275 0.4511 0.2148 0.7901 0.8145
UcSoci 0.0611 0.0666 0.0861 0.0617 0.1498 0.0672 0.4407 0.4051

The bold data are averaged over ten runs and obtained under the optimal parameters.

In the following, we mainly focus on the performance of SPM and PBSPM to explore underlying reasons of the improvements. To figure out the effect of popularity, four typical nodes from the training set of Hypertext, the large-degree node 1 and 3 (k 1,training = 78,s 1 = 0.051;k 3,training = 93,s 3 = 0.032), and the active node 91 and 113 (k 91,training = 29,s 91 = 0.289;k 113,training = 14,s 113 = 1) are chosen to analyse their predicted connections and corresponding variation of attractiveness. Figure 4 plots the predicted future links attached to selected nodes by SPM and PBSPM when p fresher = 0.05 and α = 9. After that, the principal eigenvector x 1 of A R and the advanced x1 under the optimal case are calculated to quantify the attractiveness for the most weighted attractor. In addition, the principal eigenvector also characterizes the ranking of nodes, i.e. the importance34, 35. In Fig. 4(a), node 1 and 3 (x 1,1 = 0.1715,x 1,3 = 0.1899) with the high importance are much more attractive than node 91 and 113 (x 1,91 = 0.0648,x 1,113 = 0.0329), especially, node 113 with the lowest importance has no connections at all. Contrastingly, the high popularity enhances the active nodes (x1,91=0.1158,x1,113=0.1923) and results in the burst of links connecting to the them in Fig. 4(b), notably the most active node 113. In summary, nodes with the higher popularity are emphasized by PBSPM to attract much more links, whereas the inactive despite their importance are weakened to reduce connections.

Figure 4.

Figure 4

Predicted connections of large-degree node 1, 3 and active node 91, 113 in Hypertext. Only the selected nodes and their neighbors are plotted, and the connections are the subset of the top-|EP| predicted links. Panel (a) shows the connections predicted by SPM. Node 1 and 3 are much more attractive than node 91, and node 113 is not presented because of no connections. Panel (b) shows the connections predicted by PBSPM. The active node 91 and 113 attract numerous nodes, which gives rise to the explosive growth of edges.

The above figures conduce to the understanding of how popularity imposes effects on several typical nodes, but note that, it is a rational speculation that the improvements must result from the advanced attractiveness of all nodes. As above argued, principal eigenvector denotes the attractiveness for the most weighted attractor. Because (λ1+Δλ1)(x1x1T) occupies the main body of Ã, neglecting constant term Δλ 1 + Δλ 1, similarity ã i,j is mainly determined by eigenvector x 1. The Pearson correlation coefficient (CC) between principal eigenvector and degree in the probe set, holistically reflecting the extent to which the attractiveness x 1,i coincides with real degree increment k i,probe, is computed as follows,

cc=1ni=1n(x1,ix¯1,iδx1,i)(ki,probek¯i,probeδki,probe), 10

where x¯1,i and k¯i,probe are the means of x 1,i and k i,probe. The CC between advanced x1 and degree in the probe set is obtained similarly. Table 4 lists the variation of CC after the addition of popularity and the coupling influence Δλ 1 averaged over ten independent perturbations. The positive ΔCC of four networks suggest attractiveness of some nodes are corrected to meet the degree increment in the future. Furthermore the positive Δλ 1 also strengthens the improvements of correlations. As a result, the popular nodes are assigned more connecting opportunities to promote the precision.

Table 4.

Variation of correlation coefficient ΔCC and coupling influence Δλ 1.

Networks Hypertext Haggle Infec UcSoci
ΔCC 0.28 0.0582 0.3092 0.1155
Δλ 1 4.1822 4.79 1.86 4.2183

Each data is averaged over ten perturbations.

Eventually, to demonstrate the feasibility of the proposed methods in practical applications, we compare the fast PBSPM with time series (TS) based methods on continuous temporal networks, which have been effectively applied to the temporal link prediction3638. For each network, the dataset is divided into T N snapshots (G1,G2,...,GTN) with the length of time period P length = 7 days. Setting a specified time window T = 5, we use the graph series (G t, G t + 1, …, G t + T − 1) and its reduced static graph G t~t + T − 1 to predict the links that will occur in G t + T (t = 1, 2, …, T N − T). Then the popularity of each node is calculated as:

si=ki,Gt+T1ki,Gt~t+T1=ki,fresherki,all. 11

During the evolution, certain mechanisms drive the network organization regularly and the structural features keep relatively stable. Hence, we obtain the optimal α and m by the known networks observed between the time period 1 ≤ t ≤ 6 (G 1~5 as the training set, G 6 as the probe set) and apply them to the subsequent predictions. Figure 5 shows the precision at continuous time steps and the average accuracy of different methods. For LKMLR, though the fast PBSPM falls behind sometimes, its average value shows a slight advantage in precision (Fig. 5(a) (c)). For Wiki, not only does the fast PBSPM gain the upper hand at any time, but it achieves much higher average accuracy compared with TS based methods (Fig. 5(b) (d)). These experimental results demonstrate that the fast PBSPM has prospective applications in evolving networks.

Figure 5.

Figure 5

Precision at different time steps and their average values. G t~t + T − 1(G t, G t + 1, …, G t + T − 1) and G t + T play the role of the training set E T and probe set E P. Panel (a) and panel (b) show the precision values at different time steps for LKMLR and Wiki. The red curves are respectively obtained by the fast PBSPM with α = 2 and 5, m = 2 and 2. The other results are obtained under optimal cases by different forecasting models. Panel (c) and panel (d) give the average precision values of different methods for the two networks.

Discussion

In this paper, we propose the PBSPM and its fast algorithm to predict future links. The main contribution is to investigate the popularity (activeness) of nodes in real-world evolving networks and apply it to link prediction. Unlike previous works that calculate temporal effects with complex theories, we infer the popularity of each node by its recently active edges. Then we propose a hypothesis that the future network is influenced by both existing structure and popularity of nodes. By introducing popularity into perturbation method, PBSPM could distinguish active and inactive historical important nodes, and prefer to predict new edges attached to active nodes. Subsequently, the fast method is proposed to get rid of the high computation complexity. Experimental results on real-world evolving networks reveal that compared with traditional methods, the proposed methods achieve better performance in precision and robustness. Besides, further experiments are conducted to uncover the underlying reasons of the improvements.

Definitely, the performance of proposed methods largely depend on the popularity of each node. In other words, the popularity based methods are more applicable for the networks with obvious temporal effects, where the popularity metric can effectively quantify the popularity of each node. Hence, another important issue is that improving popularity performance would enhance the precision of link prediction, which is the future work. Since our work mainly explores prediction in evolving networks, it has possible applications in traffic prediction, airline control, recommendation of social network, and so on.

Methods

Experimental procedures

To predict the future links of evolving networks with PBSPM, there are five detailed steps to follow:

Step 1: We firstly divide the network into the training set E T and the probe set E P based on the birth time of each edge, the corresponding adjacent matrix are denoted by A T and A P.

Step 2: The training set is further divided into the old set and the fresh set to calculate the popularity via Eq. (2) or Eq. (11).

Step 3: We perturb the training set by randomly removing a small fraction p H = 0.1 of edges ΔA, obviously, A T = A R + ΔA.

Step 4: We decompose the matrix A R and obtain the A˜ via Eq. (7) and Eq. (8).

Step 5: Repeat step 3 and step 4 for ten times. In other words, we implement the perturbations for ten times to obtain the averaged A˜ where the score a˜ij represents the existent likelihood of the link between node i and j. Finally, non-observed edges with the top-|EP| scores are chosen as potential future edges.

Data description

In this work, six datasets are considered to evaluate the performance of algorithms. (1) Hypertext 2009 (Hypertext): a network of face-to-face contacts of the attendees of the ACM Hypertext 2009 conference from June 30 to July 1, 2009, including 113 nodes and 2196 unique links39. (2) Haggle: an undirected network representing contacts between people measured by carried wireless devices40, including 188 nodes and 1947 unique links. The time span is 4 days. (3) Infectious (Infec): a network describing the face-to-face behavior of people during the exhibition INFECTIOUS: STAY AWAY in 200939, including 301 nodes and 2145 unique links. The time span is 8 hours. (4) UC Irvine messages (UcSoci): a directed network of messages between the users of an online community of students from the University of California, Irvine41, including 1692 nodes and 13037 unique links. The dataset spans from April 15 to October 25, 2004. (5) Linux kernel mailing list replies (LKMLR): a communication network of the Linux kernel mailing list. The data considered in experiments is from January to June, 2013, including 2907 nodes and 78955 links. (6) Wikipedia elections (Wiki): a network of users from the English Wikipedia that voted for and against each other in admin elections. The data considered in experiments spans from October, 2005 to April, 2006, including 2309 nodes and 23707 links42.

To simplified the problem, we ignore the direction and weighted of links, and remove the isolated nodes. What is more, the networks are divided into historical training set and future probe set only according to the timestamps that attach to edges.

Evaluation metric

AUC (Area Under the receiver operating characteristic Curve) and Precision are two standard metrics used to measure the link prediction algorithm43, 44. The former randomly compares the score of a missing link with a non-existent link to evaluate the performance. The latter focuses on the links with top-L scores. When dealing with highly skewed datasets, the precision always gives a more informative picture of algorithms’ performance45. Hence, We choose Precision index as the metric to evaluate the accuracy of the proposed method and other baselines. Precision is defined as the ratio of links predicted accurately to all links selected. Namely if we select top-L links in the all ranked non-observed links and only L r links are predicted correctly in the probe set E P, then the accuracy of predictor follows

Precision=LrL. 12

In our experiments, we select L=|EP| and count how many of top-|EP| links really exist in the probe set.

Baselines

For comparison, we briefly introduce five traditional algorithms based on all three kinds of structural similarity.

  1. Common Neighbors (CN), related to the concepts of the triadic closure, is the most well-known method with an assumption that two target points tend to connect with each other if the new connection may produce much more triangles in the graph.
    sxyCN=|Γ(x)Γ(y)|, 13
    where Γ(x) is the set of neighbors of node x and |Γ(x)Γ(y)| represents the set of common neighbors of x and y.
  2. Adamin-Adar (AA), advanced from CN, restricts the contributions of common neighbors by introducing a penalty factor, i.e., the logarithm of reciprocal of their degree.
    sxyAA=zΓ(x)Γ(y)1logkz, 14
    where k z denotes the degree of common neighbor z.
  3. Resource Allocation (RA), motivated by transferring resource between two unconnected nodes, views the common neighbor as the intermediary of which the transfer capability equals to the reciprocal of degree of common neighbors.
    sxyRA=zΓ(x)Γ(y)1kz. 15
  4. Katz index, based on global information of network, counts all the paths connecting two endpoints with weakening the contributions of longer paths exponentially:
    sxyKatz=l=1αl|pathsx,yl|. 16
    When |α|<1/λmax, it can be rewritten as:
    S=(IαA)1I, 17
    where I is the identity matrix, α > 0 is the tunable parameter, λ max is the largest eigenvalue of the adjacent matrix A.
  5. Superposed Random Walk (SRW) considers the summation of local random walks within t steps and degree of two endpoints to emphasize the local properties in real networks26.
    sxySRW(t)=τ=1t[qxπxy(τ)+qyπxy(τ)], 18
    where qx=kx2|E| denotes the initial distribution of resources and π xy(τ) represents the transfer probability from x to y.
  6. Time series based methods explore the evolution of topological metrics to predict the future links37. It follows the steps below:

Step1: Choose a static-structure method (e.g. CN, RA, Katz, etc);

Step2: Establish the time series by calculating the similarity between unconnected nodes in each time period;

Step3: Compute the final score of unconnected nodes with a forecasting model (e.g. Moving Average, Liner Regression, Simple Exponential Smoothing, etc);

Step4: Measure the algorithms with future links in the next time period.

Acknowledgements

This work is jointly supported by the National Natural Science Foundation of China (Nos 61471243, 11547040 and U1301252), Science and Technology Innovation Commission of Shenzhen (Nos JCYJ20160520162743717, JCYJ20150625101524056, JCYJ20140418095735561, JCYJ20150731160834611 and SGLH20131010163759789), the PhD Start-up Fund of Natural Science Foundation of Guangdong Province (2017A030310374), the Young Teachers Start-up Fund of Natural Science Foundation of Shenzhen University.

Author Contributions

T.W. and M.Z. conceived and designed the experiments. T.W. and Z.F. performed the experiments. X.H. and Z.F. analysed the data and improved the methods. T.W., X.H. and M.Z. wrote the manuscript. All authors reviewed the manuscript.

Competing Interests

The authors declare that they have no competing interests.

Footnotes

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Albert R, Barabási A-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 2002;74:47. doi: 10.1103/RevModPhys.74.47. [DOI] [Google Scholar]
  • 2.Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 3.Barzel B, Barabási A-L. Network link prediction by global silencing of indirect correlations. Nat. Biotechnol. 2013;31:720–725. doi: 10.1038/nbt.2601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hulovatyy Y, Solava RW, Milenković T. Revealing missing parts of the interactome via link prediction. Plos One. 2014;9:e90073. doi: 10.1371/journal.pone.0090073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ermiş B, Acar E, Cemgil AT. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining and Knowledge Discovery. 2015;290:203–236. doi: 10.1007/s10618-013-0341-y. [DOI] [Google Scholar]
  • 6.Stanfield Z, Coşkun M, Koyutürk M. Drug response prediction as a link prediction problem. Sci. Rep. 2017;7:40321. doi: 10.1038/srep40321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mamitsuka H. Mining from protein–protein interactions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012;2:400–410. [Google Scholar]
  • 8.Cannistraci CV, Alanis-Lobato G, Ravasi T. From link-prediction in brain connectomes and protein interactomes to the local-community-paradigm in complex networks. Sci. Rep. 2013;3:1613. doi: 10.1038/srep01613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gong NZ, et al. Joint link prediction and attribute inference using a social-attribute network. ACM Trans. Intell. Syst. Technol. 2014;5:27:1–27:20. doi: 10.1145/2594455. [DOI] [Google Scholar]
  • 10.He Y-L, Liu JN, Hu Y-X, Wang X-Z. Owa operator based link prediction ensemble for social network. Expert Syst. Appl. 2015;42:21–50. doi: 10.1016/j.eswa.2014.07.018. [DOI] [Google Scholar]
  • 11.Tang, J., Chang, S., Aggarwal, C. & Liu, H. Negative link prediction in social media. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 87–96 (ACM, 2015).
  • 12.Daminelli S, Thomas JM, Durán C, Cannistraci CV. Common neighbours and the local-community-paradigm for topological link prediction in bipartite networks. New J. Phys. 2015;17:113037. doi: 10.1088/1367-2630/17/11/113037. [DOI] [Google Scholar]
  • 13.He X-S, Zhou M-Y, Zhuo Z, Fu Z-Q, Liu J-G. Predicting online ratings based on the opinion spreading process. Physica A. 2015;436:658–664. doi: 10.1016/j.physa.2015.05.066. [DOI] [Google Scholar]
  • 14.Guimerà R, Sales-Pardo M. Missing and spurious interactions and the reconstruction of complex networks. P. Natl. Acad. Sci. USA. 2009;106:22073–22078. doi: 10.1073/pnas.0908366106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liben-Nowell D, Kleinberg J. The link-prediction problem for social networks. J. AM. Soc. Inf. Sci. Technol. 2007;58:1019–1031. doi: 10.1002/asi.20591. [DOI] [Google Scholar]
  • 16.Wang P, Xu B, Wu Y, Zhou X. Link prediction in social networks: the state-of-the-art. Sci. China Inform. Sci. 2015;58:1–38. [Google Scholar]
  • 17.Lin, D. An information-theoretic definition of similarity. Proceedings of the Fifteenth International Conference on Machine Learning, 296–304 (Morgan Kaufmann Publishers Inc., 1998).
  • 18.Yuan, G., Murukannaiah, P. K., Zhang, Z. & Singh, M. P. Exploiting sentiment homophily for link prediction. Proceedings of the 8th ACM Conference on Recommender Systems, 17–24 (ACM, 2014).
  • 19.Hâncean M-G, Perc M. Homophily in coauthorship networks of east european sociologists. Sci. Rep. 2016;6:36152. doi: 10.1038/srep36152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A. 2011;390:1150–1170. doi: 10.1016/j.physa.2010.11.027. [DOI] [Google Scholar]
  • 21.Newman ME. Clustering and preferential attachment in growing networks. Phys.Rew. E. 2001;64:025102. doi: 10.1103/PhysRevE.64.025102. [DOI] [PubMed] [Google Scholar]
  • 22.Adamic LA, Adar E. Friends and neighbors on the web. Social Networks. 2003;25:211–230. doi: 10.1016/S0378-8733(03)00009-1. [DOI] [Google Scholar]
  • 23.Zhou T, Lü L, Zhang Y-C. Predicting missing links via local information. Eur. Phys. J. B. 2009;71:623–630. doi: 10.1140/epjb/e2009-00335-8. [DOI] [Google Scholar]
  • 24.Cui W, et al. Bounded link prediction in very large networks. Physica A. 2016;457:202–214. doi: 10.1016/j.physa.2016.03.041. [DOI] [Google Scholar]
  • 25.Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18:39–43. doi: 10.1007/BF02289026. [DOI] [Google Scholar]
  • 26.Liu W, Lü L. Link prediction based on local random walk. Europhys. Lett. 2010;89:58007. doi: 10.1209/0295-5075/89/58007. [DOI] [Google Scholar]
  • 27.Clauset A, Moore C, Newman ME. Hierarchical structure and the prediction of missing links in networks. Nature. 2008;453:98–101. doi: 10.1038/nature06830. [DOI] [PubMed] [Google Scholar]
  • 28.Lü L, Pan L, Zhou T, Zhang Y-C, Stanley HE. Toward link predictability of complex networks. P. Natl. Acad. Sci. USA. 2015;112:2325–2330. doi: 10.1073/pnas.1424644112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Xu Z, Pu C, Yang J. Link prediction based on path entropy. Physica A. 2016;456:294–301. doi: 10.1016/j.physa.2016.03.091. [DOI] [Google Scholar]
  • 30.Tan F, Xia Y, Zhu B. Link Prediction in Complex Networks: A Mutual Information Perspective. Plos One. 2014;9:e107056. doi: 10.1371/journal.pone.0107056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhu B, Xia Y. An information-theoretic model for link prediction in complex networks. Sci. Rep. 2015;5:13037. doi: 10.1038/srep13037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kim H, Anderson R. Temporal node centrality in complex networks. Phys. Rev. E. 2012;85:026107. doi: 10.1103/PhysRevE.85.026107. [DOI] [PubMed] [Google Scholar]
  • 33.Godsil, C. & Royle, G. Algebraic graph theory (Springer-Verlag, New York, 2001).
  • 34.Estrada E, Rodrguez-Velázquez JA. Subgraph centrality in complex networks. Phys. Rev. E. 2005;71:056103. doi: 10.1103/PhysRevE.71.056103. [DOI] [PubMed] [Google Scholar]
  • 35.Borgatti SP. Centrality and network flow. Social Networks. 2005;27:55–71. doi: 10.1016/j.socnet.2004.11.008. [DOI] [Google Scholar]
  • 36.Huang Z, Lin DK. The time-series link prediction problem with applications in communication surveillance. INFORMS J. Comput. 2009;21:286–303. doi: 10.1287/ijoc.1080.0292. [DOI] [Google Scholar]
  • 37.Soares, P. R. d. S. & Prudêncio, R. B. C. Time Series Based Link Prediction. The 2012 International Joint Conference on Neural Networks, 1–7 (IEEE, 2012).
  • 38.Güneș İ, Gündüz-Öğüdücü Ş, Çataltepe Z. Link prediction using time series of neighborhood-based node similarity scores. Data Mining and Knowledge Discovery. 2016;30:147–180. doi: 10.1007/s10618-015-0407-0. [DOI] [Google Scholar]
  • 39.Isella L, et al. What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol. 2011;271:166–180. doi: 10.1016/j.jtbi.2010.11.033. [DOI] [PubMed] [Google Scholar]
  • 40.Chaintreau A, et al. Impact of human mobility on opportunistic forwarding algorithms. IEEE Transactions on Mobile Computing. 2007;6:606–620. doi: 10.1109/TMC.2007.1060. [DOI] [Google Scholar]
  • 41.Opsahl T, Panzarasa P. Clustering in weighted networks. Social Networks. 2009;31:155–163. doi: 10.1016/j.socnet.2009.02.002. [DOI] [Google Scholar]
  • 42.Leskovec, J., Huttenlocher, D. & Kleinberg, J. Predicting positive and negative links in online social networks. Proceedings of the 19th international conference on World wide web, 641–650 (ACM, 2010).
  • 43.Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36. doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  • 44.Herlocker JL, Konstan JA, Terveen LG, Riedl JT. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 2004;22:5–53. doi: 10.1145/963770.963772. [DOI] [Google Scholar]
  • 45.Davis, J. & Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. Proceedings of the 23rd international conference on Machine learning, 233–240 (ACM, 2006).

Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES