Identifying the perceived local properties of networks reconstructed from biased random walks

Lucas Guerreiro; Filipi Nascimento Silva; Diego Raphael Amancio

doi:10.1371/journal.pone.0296088

. 2024 Jan 19;19(1):e0296088. doi: 10.1371/journal.pone.0296088

Identifying the perceived local properties of networks reconstructed from biased random walks

Lucas Guerreiro ¹, Filipi Nascimento Silva ^2,^*, Diego Raphael Amancio ¹

Editor: Dariusz Siudak³

PMCID: PMC10798465 PMID: 38241390

Abstract

Many real-world systems give rise to a time series of symbols. The elements in a sequence can be generated by agents walking over a networked space so that whenever a node is visited the corresponding symbol is generated. In many situations the underlying network is hidden, and one aims to recover its original structure and/or properties. For example, when analyzing texts, the underlying network structure generating a particular sequence of words is not available. In this paper, we analyze whether one can recover the underlying local properties of networks generating sequences of symbols for different combinations of random walks and network topologies. We found that the reconstruction performance is influenced by the bias of the agent dynamics. When the walker is biased toward high-degree neighbors, the best performance was obtained for most of the network models and properties. Surprisingly, this same effect is not observed for the clustering coefficient and eccentric, even when large sequences are considered. We also found that the true self-avoiding displayed similar performance as the one preferring highly-connected nodes, with the advantage of yielding competitive performance to recover the clustering coefficient. Our results may have implications for the construction and interpretation of networks generated from sequences.

1 Introduction

Many real-world phenomena are characterized by discrete series of events or decisions happening in succession [1, 2]. This includes how users navigate through websites or social media, how language is written and spoken, music, city navigation, and even people’s everyday decisions. In most cases, however, only a limited amount of information is available to infer the rules and the mechanisms driving the generative processes behind these phenomena. For instance, from the perspective of a social media user, the observed content may be limited to their own interests, political positions, friends’ preferences and what is being suggested by a recommendation algorithm. Such content normally only constitutes a small fragment of what is present in the complete social media platform. These aspects are often linked with the emergence of biases leading to polarization, formation of echo chambers, and other social phenomena like the friendship paradox [3]. In another example, because of limited individuals’ capacity, resources and available personnel, scientists or research groups adopt different strategies to choose the focus of their research among all the possible problems. Such a strategy could favor exploitation over exploration (or vice versa), a decision that could potentially impact the collective discovery process [4]. These examples raise the question on how well aspects of the inherent (generative) process are truly recovered through limited or biased information.

Since many complex systems have been successfully represented by networks (i.e., by the intricate relationships among their components) [5–8], it is possible to study the aforementioned question in terms of how well the characterization of such structures changes according to the limited information observed through certain dynamics. In particular, for the case of networks, different behaviors of dynamics can be simulated through random walk heuristics. In such systems, an agent performs a walk in a network and reconstructs it based on the set of visited nodes and edges. This process has been investigated, in particular for the case of the knowledge acquisition process, in which pieces of knowledge are learned by agents walking across a network representing knowledge. Previous works [9] have found that it is possible to determine characteristics of the inherent model by only looking at features of the partially reconstructed networks. This indicates that different combinations of network topology and dynamics can lead to potentially different observed features in the generated sequences. In this work, we address the problem of checking how similar are the observed features of partially reconstructed networks compared to the original structure.

We approach the problem of reconstructing networks from limited information by employing different types of random walks performed by non-interacting agents. Each agent simulates an individual with limited information and stores a subgraph of the original network reconstructed by co-adjacency. Fig 1 shows an example network in which a random walk was performed starting at node A. This simulates, for instance, a user in social media navigating across different profiles. Given the user’s limited information, they may think that node A is the one with most connections in the network, in contrast to the correct answer: C. This is because the agent visited A’s neighborhood, while B’s and C’s neighbors were not. Other properties such as clustering coefficient and centrality measures are also not correctly recovered through this walk. The potential to recover network properties may also depend on the network topology itself. For instance, an irregular network with heterogeneous degrees and high density may be more challenging to navigate than a regular network, since in the first case, hubs may be visited more frequently, potentially leaving regions of low degree poorly explored. As a consequence, the recovered network characteristics may be different and biased compared to those from the original network.

In this work, we study how well network characteristics—such as node degree, clustering coefficient, etc—can be recovered from reconstructions based on finite sequences. We explore the effects of different strategies to generate the sequences, including biased [9] and true self-avoiding [10, 11] random walks. First, we generate sequences of nodes based on the progression of visited nodes given by an agent dynamics. Next, the sentences are used to reconstruct independently a network based on co-adjacency. Network properties for both the original and reconstructed networks are obtained and compared via Pearson or Spearman correlations. We also vary the length of the sequences to simulate different levels of limited information. For this analysis, we considered real-world networks and realizations of traditional network models in addition to a community-based model, the LFR [12].

Our results indicate that the choice of dynamics employed to generate the sequences has an influence on the correlation values between the recovered and original network properties. The reconstruction performance depends, for instance, if the dynamics are biased by node degree. When highly connected nodes are preferable to be visited (RWD), we achieve the best performance in recovering network properties for most of the considered networks and properties, with the exception of clustering coefficient and eccentricity. In those cases, even by considering the long sequences, it still reaches low values of correlation. On the other hand, for the case that the random walk dynamics avoids highly connected nodes (RWID), we see the worst performance among the considered dynamics. However, it is able to recover the clustering coefficient with similar performance as other dynamics. In addition to that, we explore three other types of random walks, the unbiased random walk (RW) and two self-avoiding strategies, known as true self-avoiding walk [10, 11], one based on edges, which avoids passing through already visited edges (TSAW-edge), and another based on nodes (TSAW-node). TSAW-edge displayed similar performance as the RWD approach, but with no problems in recovering the clustering coefficient. We discuss these results in detail and the potential implications in Section 4.

Finally, we also check if the community structure can be recovered from the partial information stored in sequences. This is accomplished by comparing the detected communities’ membership of the original networks (or planted for the LFR models) with those from the reconstructed versions. The results seem to depend strongly on the network topology, with mixed patterns across different mixing coefficients of the LFR and real networks. Nonetheless, TSAW-edge and RW display the best performance in that task.

This work is organized as follows. In Section 2 we present and discuss the related works. Section 3 describes the adopted methodology in order to generate networks, perform random walks on the topologies, reconstruct partial networks and how we have analyzed and compared the properties in these networks. The results are reported and discussed in Section 4. Finally, in Section 5, we present the general conclusions and future works.

2 Related works

The process of random walkers exploring complex network topologies has already been studied by several works [2, 13–15]. In the context of knowledge acquisition, the sequence of visited nodes in random walks is to recover the set of nodes in the network [2, 14]. In [2], the authors investigated how different agents walking over the network can reconstruct the network topology. In the proposed multi-agent random walk, the true self-avoiding and Lévy flight-based dynamics outperformed other walk strategies in terms of efficiency in discovering new nodes. Surprisingly, the study also showed that fine-tuning the parameters controlling the agent dynamics had little effect on the global knowledge acquisition performance.

The study conducted in [13] focused on the knowledge acquisition task when several network topologies and agent dynamics are used in a single-agent context. This study found that the true self-avoiding dynamics had the best performance over different settings in discovering nodes in the network. The degree-biased had the slowest learning curve in the experiments. The study has also demonstrated that higher average degrees provide a faster learning rate.

While several studies focused on the knowledge (nodes) acquisition problem [14, 15], the study conducted in [9] used a machine learning approach to recover both the network topology and agent dynamics generating a sequence of symbols. To train the supervised classifiers, sequences of visited nodes were mapped into (reconstructed) networks via the co-occurrence strategy. Then, six different network properties were used to create features describing the observed reconstructed networks. Sixteen different combinations of network topology and agent dynamics were considered to generate sequences. The study revealed that it is possible to recover both the topology and dynamics with high accuracy, provided that the sequence (i.e the random walk) length is not too short. The accuracy of identification increased with the observed sequence length. When less than 20% of the whole network was discovered, both the topology and dynamics were recovered with an accuracy higher than 86% in a supervised classification scenario with 16 classes.

In [16], the authors analyzed how network properties (e.g. average degree) evolve as the network sample size grows. If a network property is unstable for all sample sizes then it does not represent the network very well; however, if the property does not change as the sample size grows, then the property is considered a good representation of the network. The main contribution of this work is therefore a methodology to quantify if any network property is robust regarding the network size used in the experiments. In networks formed from sequences, it means that unstable properties may vary depending on the sequence length used to form the networks. This means that when recovering local properties, the original value of the property may only be recovered if the sample and original network sizes are consistent. However, one may still find a correlation between values observed in sampled and original networks for unstable metrics.

The study conducted in [7] investigated a teaching-learning perspective using complex networks. In the adopted representation, facts are graph nodes and the relationship or underlying connections between two facts are represented by edges. The study aimed to probe how students learn contents from linear algebra textbooks by considering the nodes exploration process simulating the human memory characteristics during the learning process. Among the main findings, the authors reported that human memory limitation plays a special role in long-term information retention effectiveness and problem-solving creativity.

The relationship between knowledge representation and complex networks has also been studied elsewhere. In [14], knowledge is acquired when nodes and edges are visited by random walkers. Different from other approaches, the experiments considered free and conditional transition edges. While the former is commonly used in most of the works, the latter allows new nodes to be accessed only when certain criteria are met. In this case, the main criteria consist in visiting a subset of nodes in order to make a new node accessible. The author analyzed the knowledge acquisition performance via hierarchical complex networks [17], which are explored via traditional random walks and variations biased toward new links. The study showed that the biased random walks are slower to acquire knowledge in the conditional exploration scenario.

While most of the related works tried to recover the set of nodes or identify the dynamics and topology generating a sequence, here we focused on a different network perspective. We studied if the properties of the reconstructed networks are consistent with the ones of the original networks. This is a relevant topic since in many scenarios one does not have access to the original networks generating a sequence of symbols.

3 Methodology

In this section, we discuss the proposed methodology. The main purpose of this paper is to analyze whether the local properties of reconstructed and original networks are correlated. The methodology can be divided into the following 4 main steps, which are summarized below. The steps are also illustrated in Fig 2.

Fig 2 — First, we look into the original network and annotate the concerned properties for each node. In the next step, we iterate over the network with the desired dynamics in order to generate a sequence of symbols. In the third step, we reconstruct the network using the discovered nodes and edges, and we annotate the node’s properties in this reconstructed network. Finally, we build correlations among the reconstructed and original nodes by comparing each node i in the reconstructed network to its respective node in the original network.

Original networks: here we used network models to represent different network topologies. Examples of models include random and geographical networks. We also used examples of real-world networks modeling e.g. social and biological complex systems.
Network dynamics: in many real-world situations, network data is only available as a sequence of symbols [18]. Sequences can be generated by an agent walking over the network via different rules.
Network reconstruction and properties extraction: the observed sequence generated in the previous step is used to reconstruct. Several properties of the obtained networks are then extracted.
Correlation analysis: the properties of the original and reconstructed networks are compared. The properties are compared in terms of network metrics (e.g. clustering coefficient) and, in modular networks, the partitions representing the network communities are compared.

3.1 Original networks

We considered the most common and diverse topology models in the literature [19]. Similar to previous studies [9], our analysis focused on networks with N = 5, 000 nodes and average degree 〈k〉 = 4. The following models were used here:

Erdős-Rényi (ER): this is the traditional random network model. The probability π of two nodes being linked by an edge is constant [20, 21].
Barabási-Albert (BA): this model implements a scale-free topology [22]. Different from ER networks, the BA model is based on a growth model, since at each step new nodes are included in the network. The probability π_ij of a new node i to connect to an old node j with k_j links is proportional to k_j:
$\begin{matrix} π_{i j} = \frac{k_{i}}{\sum_{j} k_{j}} . \end{matrix}$ (1)
Waxman (Wax): this topology is an implementation of a geographic network. The first step consists in randomly placing each node in a two-dimensional plane. A link between two nodes i and j is provided by a probability formulation that decays exponentially with the geographical distance between the nodes [13, 23].
Modular Networks (LFR): this model [12] creates networks with nodes clustered into network communities. The main parameters used to construct modular networks are the number of communities (n_C), the exponent for the degree sequence in the network (t₁), the minus exponent for the community size distribution (t₂), and the mixing parameter (μ), which measures how well defined communities are. Lower values of mixing value lead to well-defined network communities. We used the following parameters to construct the networks: n_C = 5, t₁ = 3, t₂ = 0 and μ = {0.05, 0.2, 0.8}. Similar values have been used in related works [2, 9, 13],

We also conducted our experiments in real networks modeling diverse complex systems, the selected networks description as well as their download links are:

Facebook: this network comprises social relationships among Facebook employees. The network comprises 320 nodes [24] and is available at [25].
Power Grid: this is a classic geographical network modeling the US Western States Power Grid. Nodes represent transforms or power relay points, while edges are power lines. The network comprises 4,941 nodes [26] and is available at [25].
Econ-Poli: we have used Economics Poli network which contains 3,915 nodes and presents behavior on interconnected economic agents [27]. The network is available at [27].
Web-EPA: we also have used the Web-EPA network with the size of 4,271 nodes and implements information in web level for hyperlinks across the internet that link to the www.epa.gov website [27]. The network is available at [27].
Bio (DM-CX): The Bio (DM-CX) is a biological real network that represents co-occurrences on Drosophila melanogaster fly pairs of genes acquired from the FlyNet repository. The network comprises 4,032 nodes [27, 28]. The network is available at [27].
Bio (AI Interactions): we have explored the AI Interactions dataset which presents a biological real network with Arabidopsis Interactome map of protein-protein interactions. The largest component of this network comprises 4,519 nodes [29]. The network is available at [30].
socfb-JohnsHopkins55: this is a social network representing Facebook connections inside the Johns Hopkins community. The network comprises 5,180 nodes and is available at [27].

Since the size of the networks can influence the coverage speed of random walks, we opted to select networks of similar size. Except for the Facebook network, all the choosen networks have about 4000-5000 nodes.

3.2 Network dynamics

Agent dynamics have been used in a wide variety of network-based studies, including epidemic spreading and knowledge acquisition analysis [2, 13, 15]. In this work, agent dynamics are used to explore the network and generate a sequence of visited nodes. The sequence of nodes is assumed to be the information available for network reconstruction. We have used 5 well-known walks:

Traditional Random Walk (RW): in the traditional random walk the agent selects the next node to visit randomly among its neighbors. The probability of the agent moving from node i to node j is p_ij = 1/k_i, where k_i represents the degree of i-th node.
Degree-biased Random Walk (RWD): this random walk considers the degree of the neighbor nodes when selecting the next node to be visited by the agent. The probability of visiting a node is proportional to its degree:
$\begin{matrix} p_{i j} = \frac{k_{j}}{\sum_{l \in Γ_{i}} k_{l}}, \end{matrix}$ (2)
where Γ_i is the set comprising the neighbors of i.
Inverse of the Degree-biased Random Walk (RWID): Similar to the RWD dynamics, the RWID walk uses the degree of the neighborhood when defining the probabilities. However, in this dynamics, the agent prefers to visit the nodes with smaller degrees, i.e.
$\begin{matrix} p_{i j} = \frac{k_{j}^{- 1}}{\sum_{l \in Γ_{i}} k_{l}^{- 1}} . \end{matrix}$ (3)
True Self-avoiding Random Walk on nodes (TSAW-node): in this random walk, the agent avoids the nodes that were already visited [10, 11]. Therefore, in this network, the nodes not yet visited are preferred to be visited, which works in favor of the network exploration. Let f_j be the frequency that j has been visited. The mechanism to avoid already visited nodes is encoded according to:
$\begin{matrix} p_{j} = \frac{e^{- λ f_{j}}}{\sum_{l \in Γ_{i}} e^{- λ f_{l}}} . \end{matrix}$ (4)
True Self-avoiding Random Walk on edges (TSAW-edge): similarly to the traditional TSAW dynamics, in this random walk there is an avoiding bias, however in this implementation, the agent avoids edges previously visited instead of nodes, as implemented in works related to network exploration [2, 13]. The probability transition is computed as
$\begin{matrix} p_{i j} = \frac{e^{- λ f_{i j}}}{\sum_{l \in Γ_{i}} e^{- λ f_{i l}}}, \end{matrix}$ (5)
In both versions of the true self-avoiding random walks, we are using λ = ln2, as in related works [2, 9, 13].

3.3 Network reconstruction and properties extraction

The sequences generated by the random walks are used to reconstruct the networks. In our experiments, we probed how the sequence length affects the properties of the reconstructed networks. We investigated the results for the following set of sequence length w: {100, 200, 400, 500, 600, 800, 1000, 2000, 5000, 20000, 50000}. The reconstruction is performed by recreating the edges observed in the sequence. This procedure is equivalent to the co-occurrence approach usually employed in network analysis, where two symbols are linked whenever they are adjacent in the sequence. When analyzing texts using network science, the co-occurrence approach is widely employed [6, 31, 32]. For each combination of network topology, agent dynamics and walk size, we considered 20 sequence realizations.

Once the network is reconstructed, our aim is to analyze if relevant properties can be recovered. To characterize the networks, we used well-known network metrics, including local, quasi-local and global metrics. We computed the degree, clustering coefficient, closeness, betweenness eccentricity and coreness centrality of the networks [19, 33]. All metrics are defined for unweighted and undirected networks. For networks with modular structure, we also detect the communities using the Leiden method [34].

3.4 Correlation analysis

To assess how similar the structural properties are preserved by the reconstruction process, we employed the Pearson and Spearman correlations between structural properties of the original and reconstructed networks. More specifically, for each node i^(R) of the reconstructed networks we measure a structural feature μ(i^(R)) (e.g. degree or clustering coefficient). The same property is also measured for the corresponding node in the original network, i.e. μ(i^(O)). Finally, we measured the correlation between μ(i^(R)) and μ(i^(O)), for all nodes of the reconstructed network.

Since we are also interested in the modular structure of networks, we also compared the similarity of partitions in the original and reconstructed network via normalized mutual information (NMI) [19, 35] and adjusted rand index (ARI) [36, 37]. Higher values of normalized mutual information mean that the partitions of the original and reconstructed network are similar.

The steps taken in our framework and that were described in this section are summarized in Algorithm 1. The code can be found on GitHub at: https://github.com/lucasguerreiro/localproperties.

To mitigate the inherent randomness associated with the starting nodes in random walks, we conducted multiple iterations of each walk for every network configuration (20 repetitions for each configuration), thereby ensuring that the Pearson correlation values presented are robust averages.

Algorithm 1 Framework to explore networks (N_o), reconstruct subnetworks (N_r) and calculate correlations between N_r and N_o

function reconstruct(r)

N_r ← ∅

lastnode ← ∅

for n ∈ r do

if n ∉ N_r then

N_r ← N_r∪n ⊳ add node n to N_r

end if

N_r ← N_r∪(lastnode, n) ⊳ add edge (n,lastnode) to N_r

end for

end function and return N_r

for p ∈ w do

r ← walk(N, d, p) ⊳ perform dynamics d in network N with length p

N_r ← reconstruct(r)

μ(i(R)) ← getproperty(N_r) ⊳ calculate property of reconstructed network

end for

μ(i(O)) ← getproperty(N) ⊳ calculate property of original network

C ← correlation(μ(i(O)), μ(i(R))) ⊳ calculate correlation

4 Results and discussions

In Section 4.1, we analyze if the local properties of the original networks are preserved for distinct biased random walks. In Section 4.2, the efficiency in recovering local properties is analyzed in the context of the knowledge acquisition task [2].

4.1 Efficiency in recovering the original properties

In our first analysis, we intended to probe whether the properties of the reconstructed networks are consistent with the properties of the original ones, according to the procedure described in Fig 2. The scatter plot of the node degree observed in original and reconstructed networks is shown in Fig 3 for the five agent dynamics in the LFR network (with mixing parameter m = 0.05). Each column represents a different sequence length (chosen proportionally to the original network length), while lines are different agent dynamics. For each subpanel, we show in the x- and y-axis the node degree observed in the reconstructed and original networks, respectively. We also show in each subpanel the Pearson and Spearman correlations (C_p and C_s, respectively).

Fig 3 — Each subpanel corresponds to a different configuration of agent dynamics and walks length (w). The walk length is distributed between 100 and 50,000 steps.

We notice that for small sample sizes (w = 100 and w = 500) the node degree property is not well represented since the agent did not acquire enough information to create an accurate representation of the original network. This might be an explanation of previous results showing that networks reconstructed via very short walks are not consistent with their original topological nature [9]. Interestingly, larger sample sizes not necessarily imply high correlation values. Considering the RWID agent dynamics and w = 5, 000 steps, the Pearson correlation, C_p, reaches only 0.24, even though higher values were observed for the other considered dynamics. Considering the same walk length, we found C_p = 0.88 for the RWD walk. This means that the node degree is consistent (i.e. linearly correlated) with the ones observed in the original networks.

For large values of w, i.e. long sequences, one should expect that most of the network structure (nodes and links) is retrieved by the agents [2]. Therefore, the correlations should reach high values. In fact, this is observed in the TSAW Edge dynamics (C_p = 1). This means that this particular TSAW is not only efficient in knowledge acquisition, but it also captures the local connectivity [2]. RWD outperformed RW in this particular network. Finally, it is clear that after 5, 000 steps, the RWID walk may recover relevant information regarding node connectivity but it does not perform as well as the other dynamics.

While in Fig 3 we focused on a scatter-plot analysis of a single network model, the scatter plot for other network models are similar to the one shown for the other LFR networks (results not shown). The behavior of the correlations in different agent dynamics and network topologies is shown in Fig 4. The figure illustrates the evolution of the Pearson correlation for distinct sequence lengths. We also show the NMI and ARI, comparing the structure of communities found in the LFR original and reconstructed networks.

The results in Fig 4 reveal that the RWID dynamics displayed the lowest correlation values in almost all scenarios. This is especially true in networks where hubs play a prominent role, as is the case of BA and LFR networks. Therefore, avoiding hubs in networks where hubs are relevant causes a distortion in network metrics observed in reconstructed networks. Conversely, the RWD dynamics displayed competitive performance, even in networks with no evident presence of hubs (see e.g. degree in ER networks). The efficiency of RWD is more evident in BA and LFR networks, even in short sequences. The RW dynamics displayed a performance that is similar to the TSAW Edge walk. A major difference was found only for particular metrics, e.g. the eccentricity in ER networks for long sequences. Interestingly, we found that major differences in performance can be found for different versions of the TSAW walk. This is the case of the betweenness in BA networks. For walk lengths larger than 2, 000 steps, the true self-avoiding rule applied on edges turned out to be more efficient than the same rule applied on nodes.

The efficiency of metrics recovery has a minor dependency on the walking strategy when considering the clustering coefficient, coreness and the NMI and ARI metrics. For the particular case of networks with community structure (i.e. LFR networks), we noticed that there is no evident differences in performance in networks with high values of mixing parameter. The differences in performance are only evident for well-defined communities, i.e. for networks with mixing parameter lower than 0.20. However, the NMI reaches roughly 0.50 even after long walks. In well-defined communities (mixing parameter = 0.05), we found that the structure can indeed be recovered; also, we noticed that the ARI metric did not differ much from the NMI calculations, presenting corresponding results. However, a large number of steps is still required to achieve high performance.

In Fig 5 we show the efficiency of the recovery for real-world networks obtained from systems of different disciplines (as described in Section 3.1). We observe the same overall behavior for both RWD and TSAW Edge dynamics. In most cases, the RWD outperforms other approaches for short sequences, while the TSAW dynamics performs better when longer sequences are considered. We can also notice that, again, the RWID dynamics had the worst comparable performance in terms of correlation over the original network for most properties.

When comparing the efficiency of properties recovery across different networks, the Economics and Power Grid networks have almost all of the considered properties recovered with high correlation for sequences comprising more than 2,000 nodes. Conversely, for some properties in both Bio (DM-CX) and Social (JH) networks, 50,000 steps were not sufficient to recover the original metrics with high efficiency. This is the case of the clustering coefficient, eccentricity and coreness. Concerning the different network properties, the community structure could be recovered with efficiency only for three networks (according to the NMI and ARI metrics). Interestingly, we can see that neither walk outperformed the others substantially regarding the recovery of network structure.

4.2 Efficiency in recovering network properties and knowledge acquisition

While in the previous section we focused on analyzing the recovery of network properties, here we also consider the knowledge acquisition performance as an additional feature of the random walks [13]. In the knowledge acquisition task, each node is considered as a piece of knowledge, and the performance metric corresponds to the fraction of the total number of nodes that have been discovered in the reconstructed network in comparison with the original network [13]. In this context, we analyze whether the reconstructed network is a good representation of the original ones in a twofold fashion: (i) the correlation of the properties of the discovered nodes; (ii) the computation of how many nodes from the original networks have been recovered.

In Fig 6, in the x-axis, each point represents the knowledge acquired by the walkers for a given sequence length L, i.e. we show the fraction of unique nodes discovered for that sequence length. In the y-axis, we show the correlation of properties obtained for the same value of L, according to the methodology described in Section 3.3. One may notice that, in most scenarios, the RWD dynamics improved the recovery correlation as more nodes are discovered. In particular scenarios, high correlation values are reached in the first steps of the walk, however, many more steps are required to discover a significant portion of the network (see e.g. the closeness metric for the BA network). We also note that the knowledge acquisition and correlation performance may increase with a similar speed—this is the case e.g. of the clustering coefficient in BA networks). As for the RWID dynamics, we observe that, in many cases, even when a large portion of the network is discovered, a low correlation is found (see e.g. the closeness centrality in BA networks). As for the TSAW, in most cases, the edge-based version presented a higher correlation when the same amount of nodes was discovered. This is evident, for example, when recovering the betweenness in BA networks. When half of the network is discovered, the edge-based version presents an almost perfect correlation, while the node-based version only achieves a correlation value close to 0.50.

We have also analyzed both correlation and knowledge acquisition relationships in real-world networks. The results are shown in Fig 7. In the Facebook network, for a fixed amount of discovered nodes, the highest correlation is mostly achieved with RWD dynamics. Both clustering and eccentricity metrics This result is compatible with the behavior of BA networks. Surprisingly, in the Power Grid, Economics and BIO (AI) networks, there is no evident difference in the behavior observed for distinct random walks for most of the considered metrics. The RWD again seems to provide the highest values of correlation when the same number of nodes is discovered for shortest paths-dependent metrics (closeness and betweenness) in the Web network. Finally, we note that RWD also achieved the highest accuracy for the degree, closeness and betweenness in the social network. Finally, in almost all metrics and networks, we again observed that the TSAW-edge strategy is more efficient in recovering nodes’ properties than the nodes-based counterpart.

All in all, the results indicate that the RWD agent dynamics performs equally or better than the other walk dynamics when recovering local metrics when the same fraction of nodes is discovered. We should keep in mind, however, that the RWD is outperformed by other random walk strategies in the knowledge acquisition task [2, 13]. In other words, while the RWD is efficient in recovering nodes’ properties, it takes longer to discover new nodes. Another interesting finding is that many of the real-world networks displayed a behavior that is similar to the one observed for the respective models. This is the case of the Facebook network, which displayed a behavior consistent with BA networks.

5 Conclusions

In the current paper, we proposed a framework to identify the efficiency in recovering network metrics arising from the reconstructed structure generated by a sequence of symbols. Networks were reconstructed using the well-known co-occurrence approach. The efficiency in recovering the network structure was evaluated by comparing reconstructed and original networks via correlation of network metrics. The analysis included four network models and six real networks. Five different random walks were evaluated. Here we focused on analyzing the ability to recover network metrics as many networked-based applications depend on the accurate representation of network topology [38, 39].

Our experiments revealed that long walks do not necessarily yield a high correlation between original and reconstructed networks. We also found that the TSAW Edge dynamics achieved a high correlation for most of the experiments. Surprisingly, while having a similar strategy to select neighbors, the TSAW Node dynamics did not achieve competitive correlations. In modular networks, the walking strategy based on avoiding hubs did not achieve competitive performance in particular network models (e.g. RWID for all considered values of mixing parameter). Conversely, the RWD dynamics, performed well in most scenarios, especially when the size of the sequence size used in the reconstructed network was typically lower than 2, 000 nodes. Such behavior was similar for model and real networks.

We also analyzed the interplay between network reconstruction and knowledge acquisition performance. The experiments demonstrated that the RWD outperforms other dynamics with regard to network metrics recovery efficiency. However, this random walk discovers new nodes slower than others. In other words, discovering nodes faster may not reflect in a good local network representation via network metrics. Finally, we also noted that the true self-avoiding walking—an efficient metric in the knowledge acquisition task—might have distinct behavior in recovering network metrics depending on which network elements are avoided. We found that avoiding visited edges is more efficient in network metrics recovery than avoiding nodes, according to the true self-avoiding rule.

In general, for shorter walks, the performance in recovering the properties of the original networks can vary substantially depending on its architecture and type of walk dynamics. In addition, for some combinations of networks, walks and metrics; even longer walks can lead to low correspondence between the real and observed properties. Such results indicate that biases can be easily formed depending on the walk dynamics, length of the sequences, and topological characteristics of the networks.

A potential application of this work is understanding how recommendation algorithms (in social media or content platforms) impact in the perceived knowledge of the network. For instance, we found that clustering coefficient was not reliably recovered from network reconstructions based on the RWD dynamics. This suggests that when new content is recommended to users based on their number of views (or number of links to them, similar to the RWD dynamics), this may lead to the misleading notion that related contents (local) are not interconnected among themselves but only through hubs.

Another application is understanding the different user behaviors in click-streams [40] data. Such a type of data covers sequences of web access or actions taken in by users in a online platform or across the whole internet. Users may navigate across content by using different strategies, which could potentially be identified by the patterns of the reconstructed networks. A similar approach could be used to understand the foraging process of researchers in science [41], i.e., the different strategies they use to perform or seek for new experiments, research questions and theories. This can be accomplished by considering researchers as agents walking across a knowledge space made from publications [42].

While this paper focused on a global recovery strategy, in future studies we intend to analyze whether different parts of the network are more easily recovered. In addition, we also intend to analyze if other reconstruction methods lead to improved reconstruction accuracy. The results could lead to potential new approaches to model sequences as complex networks, with potential implications in applications relying on co-occurrence approaches [43]. The proposed framework can also be used to better understand and aid developing new algorithms to estimate topological features of networks based on samples or partial observations of the data, such as in [44, 45]. In addition to that, future work can also focus on walks directly inspired by content recommendation strategies commonly used in media content platforms, which can lead to new ways to diversify content for the users or to mitigate biases in these platforms.

Data Availability

The data can be found from the cited sources in the paper. No new datasets are being introduced in this paper.

Funding Statement

Diego R. Amancio acknowledges financial support from CNPq (grant no. 311074/2021-9) and São Paulo Research Foundation (FAPESP grant no. 2020/06271-0). This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Sizemore AE, Karuza EA, Giusti C, Bassett DS. Knowledge gaps in the early growth of semantic feature networks. Nature human behaviour. 2018;2(9):682–692. doi: 10.1038/s41562-018-0422-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Arruda HF, Silva FN, Costa LF, Amancio DR. Knowledge acquisition: A Complex networks approach. Information Sciences. 2017;421:154–166. doi: 10.1016/j.ins.2017.08.091 [DOI] [Google Scholar]
3. Cantwell GT, Kirkley A, Newman M. The friendship paradox in real and model networks. Journal of Complex Networks. 2021;9(2):cnab011. doi: 10.1093/comnet/cnab011 [DOI] [Google Scholar]
4. Rzhetsky A, Foster JG, Foster IT, Evans JA. Choosing experiments to accelerate collective discovery. Proceedings of the National Academy of Sciences. 2015;112(47):14569–14574. doi: 10.1073/pnas.1509757112 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Costa LdF, Oliveira ON Jr, Travieso G, Rodrigues FA, Villas Boas PR, Antiqueira L, et al. Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Advances in Physics. 2011;60(3):329–412. doi: 10.1080/00018732.2011.572452 [DOI] [Google Scholar]
6. Akimushkin C, Amancio DR, Oliveira ON Jr. On the role of words in the network structure of texts: Application to authorship attribution. Physica A: Statistical Mechanics and its Applications. 2018;495:49–58. doi: 10.1016/j.physa.2017.12.054 [DOI] [Google Scholar]
7.Klishin AA, Christianson NH, Siew CSQ, Bassett DS. Learning Dynamic Graphs, Too Slow; 2022. Available from: https://arxiv.org/abs/2207.02177.
8. Wang J, Yang N. Dynamics of collaboration network community and exploratory innovation: The moderation of knowledge networks. Scientometrics. 2019;121(2):1067–1084. doi: 10.1007/s11192-019-03235-4 [DOI] [Google Scholar]
9.Guerreiro L, Silva FN, Amancio DR. Recovery of network topology and dynamics via sequence characterization. arXiv preprint arXiv:220615190. 2022;.
10. Kim Y, Park S, Yook SH. Network exploration using true self-avoiding walks. Physical review E. 2016;94 4-1:042309. doi: 10.1103/PhysRevE.94.042309 [DOI] [PubMed] [Google Scholar]
11. Amit DJ, Parisi G, Peliti L. Asymptotic behavior of the “true” self-avoiding walk. Phys Rev B. 1983;27:1635–1645. doi: 10.1103/PhysRevB.27.1635 [DOI] [Google Scholar]
12. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical review E. 2008;78(4):046110. doi: 10.1103/PhysRevE.78.046110 [DOI] [PubMed] [Google Scholar]
13. Guerreiro L, Silva FN, Amancio DR. A comparative analysis of knowledge acquisition performance in complex networks. Information Sciences. 2021;555:46–57. doi: 10.1016/j.ins.2020.12.060 [DOI] [Google Scholar]
14. Costa LF. Learning about knowledge: A complex network approach. Physical Review E. 2006;74(2):026103. doi: 10.1103/PhysRevE.74.026103 [DOI] [PubMed] [Google Scholar]
15. Lima TS, Arruda HF, Silva FN, Comin CH, Amancio DR, Costa LF. The dynamics of knowledge acquisition via self-learning in complex networks. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2018;28(8):083106. doi: 10.1063/1.5027007 [DOI] [PubMed] [Google Scholar]
16.Latapy M, Magnien C. Complex Network Measurements: Estimating the Relevance of Observed Properties. In: IEEE Infocom 2008—The 27th Conference on Computer Communications; 2008. p. 1660–1668.
17. Costa LF. Voronoi and fractal complex networks and their characterization. International Journal of Modern Physics C. 2004;15(01):175–183. doi: 10.1142/S0129183104005619 [DOI] [Google Scholar]
18. de Arruda HF, Silva FN, Comin CH, Amancio DR, Costa LF. Connecting network science and information theory. Physica A: Statistical Mechanics and its Applications. 2019;515:641–648. doi: 10.1016/j.physa.2018.10.005 [DOI] [Google Scholar]
19. Menczer F, Fortunato S, Davis CA. A first course in network science. Cambridge University Press; 2020. [Google Scholar]
20. Erdös P, Rényi A. On Random Graphs I. Publicationes Mathematicae Debrecen. 1959;6:290. [Google Scholar]
21. Erdös P, Rényi A. On the Evolution of Random Graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences; 1960. p. 17–61. [Google Scholar]
22. Barabasi AL, Albert R. Emergence of Scaling in Random Networks. Science. 1999;286(5439):509–512. doi: 10.1126/science.286.5439.509 [DOI] [PubMed] [Google Scholar]
23. Waxman BM. Routing of multipoint connections. IEEE J Sel Area Comm. 1988;1:286–292. [Google Scholar]
24. Fire M, Puzis R. Organization Mining Using Online Social Networks. Networks and Spatial Economics. 2016;16(2):545–578. doi: 10.1007/s11067-015-9288-4 [DOI] [Google Scholar]
25.Peixoto TP. The Netzschleuder network catalogue and repository; 2020. Available from: 10.5281/zenodo.7839981. [DOI]
26. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393(6684):440–442. doi: 10.1038/30918 [DOI] [PubMed] [Google Scholar]
27.Rossi RA, Ahmed NK. The Network Data Repository with Interactive Graph Analytics and Visualization. In: AAAI; 2015. Available from: https://networkrepository.com.
28. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research. 2012;41(D1):D991–D995. doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Consortium AIM, Dreze M, Carvunis AR, Charloteaux B, Galli M, Pevzner SJ, et al. Evidence for Network Evolution in an Arabidopsis Interactome Map. Science. 2011;333(6042):601–607. doi: 10.1126/science.1203877 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Clauset A, Tucker E, Sainz M. Index of complex networks; 2016. Available from: https://icon.colorado.edu/.
31. Chen H, Chen X, Liu H. How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks. PloS one. 2018;13(2):e0192545. doi: 10.1371/journal.pone.0192545 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Machicao J, Corrêa EA Jr, Miranda GH, Amancio DR, Bruno OM. Authorship attribution based on life-like network automata. PloS one. 2018;13(3):e0193703. doi: 10.1371/journal.pone.0193703 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Das K, Samanta S, Pal M. Study on centrality measures in social networks: a survey. Social network analysis and mining. 2018;8(1):1–11. doi: 10.1007/s13278-018-0493-2 [DOI] [Google Scholar]
34. Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports. 2019;9(1):5233. doi: 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]
35.McDaid AF, Greene D, Hurley N. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:11102515. 2011;.
36. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]
37. Steinley D. Properties of the Hubert-Arable Adjusted Rand Index; 2004. [DOI] [PubMed] [Google Scholar]
38. Arruda HF, Silva FN, Marinho VQ, Amancio DR, Costa LF. Representation of texts as complex networks: a mesoscopic approach. Journal of Complex Networks. 2018;6(1):125–144. doi: 10.1093/comnet/cnx023 [DOI] [Google Scholar]
39. Mehri A, Darooneh AH, Shariati A. The complex networks approach for authorship attribution of books. Physica A: Statistical Mechanics and its Applications. 2012;391(7):2429–2437. doi: 10.1016/j.physa.2011.12.011 [DOI] [Google Scholar]
40. Olmezogullari E, Aktas MS. Pattern2Vec: Representation of clickstream data sequences for learning user navigational behavior. Concurrency and Computation: Practice and Experience. 2022;34(9):e6546. doi: 10.1002/cpe.6546 [DOI] [Google Scholar]
41.Dubova M, Moskvichev A, Zollman K. Against theory-motivated experimentation in science; 2022. Available from: osf.io/preprints/metaarxiv/ysv2u.
42. Börner K, Silva FN, Milojević S. Visualizing big science projects. Nature Reviews Physics. 2021;3(11):753–761. doi: 10.1038/s42254-021-00374-7 [DOI] [Google Scholar]
43. Grames EM, Stillman AN, Tingley MW, Elphick CS. An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks. Methods in Ecology and Evolution. 2019;10(10):1645–1654. doi: 10.1111/2041-210X.13268 [DOI] [Google Scholar]
44. Ficara A, Fiumara G, De Meo P, Liotta A. Correlations Among Game of Thieves and Other Centrality Measures in Complex Networks. In: Fortino G, Liotta A, Gravina R, Longheu A, editors. Data Science and Internet of Things. Internet of Things. Cham: Springer; 2021. [Google Scholar]
45. Ficara A, Saitta R, Fiumara G, De Meo P, Liotta A. Game of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Mafia Networks. In: Teixeira AS, Pacheco D, Oliveira M, Barbosa H, Gonçalves B, Menezes R, editors. Complex Networks XII. Springer Proceedings in Complexity. Cham: Springer; 2021. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0296088.r001

Decision Letter 0

Dariusz Siudak

17 Sep 2023

PONE-D-23-25848Identifying the Perceived Local Properties of Networks Reconstructed from Biased Random WalksPLOS ONE

Dear Dr. Silva,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 01 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Dariusz Siudak, Ph.D., DSc.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

4. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"Diego R. Amancio acknowledges financial support from CNPq (grant no. 311074/2021-9) and São Paulo Research Foundation (FAPESP grant no. 2020/06271-0)."

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"Diego R. Amancio acknowledges financial support from CNPq (grant no. 311074/2021-9) and São Paulo Research Foundation (FAPESP grant no. 2020/06271-0). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

I propose to extend the assessment of the similarity of partitions in the original and reconstructed network using the ARI measure in addition to NMI.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In their work entitled “Identifying the Perceived Local Properties of Networks Reconstructed from Biased Random Walks", the authors consider how to discover local properties of networks by using the reconstructed networks from biased random walks. Different walk strategies on different networks are addressed and the correlation of the local properties between the reconstructed networks and the original networks are evaulated.

The topic considered here is particularly meaningful and has wide applications, e.g. machine learning. I like this work and I think the results worth of publication potentially. However, the manuscript, in the current state, displays some flaws that the authors must address very carefully before I can recommend publication.

1. Is the data for the real networks such as Facebook and the power grid,and etc, active and traceable? Can the authors provide related download links?

2. In Sec. 3.3, the authors investigated the results for the following set of sequence length w. However, the size of the networks may have big effect on the correlation between the reconstructed networks and the original networks. How did the authors handle this issue?

3.The sequence of nodes obtained by random walk has a high degree of randomness. Different results will be obtained at different time. How did the author handle this issue?

4. To improve the readability, it is better to point out they use Pearson correlation to measure the correlation of the structural properties between the reconstructed networks and the original networks.

Reviewer #2: This manuscript focuses on network reconstruction to recover the local properties of networks generating sequences of symbols for different combinations of random walks and network topologies.

The experiments are conducted on four types of network models (i.e. Erdős-Rényi, Barabási-Albert, Waxman and Modular Networks) and six types of real-world networks (i.e. Facebook, Power Grid, Econ-Poli, Web-EPA, Bio (DM-CX), socfb-JohnsHopkins55). The considered network models have 5,000 nodes, and the real ones have from 320 to 5,180 nodes.

Then, five types of random walks (i.e. traditional random walk, degree-biased random walk, inverse of degree-biased random walk, true self-avoiding random walk on nodes, true self-avoiding random walk on edges) are used to explore the initial network and generate a sequence of visited nodes which is assumed to be the information available for network reconstruction.

Degree, clustering coefficient, closeness, betweenness eccentricity and coreness centrality of the networks are computed to analyze if relevant properties of the initial networks have been recovered.

In my opinion, the problem is certainly worth to be investigated. The results from the paper are significant and sufficient to be published in the journal.

Nevertheless, the used methodology should be expressed more clearly and logically. Maybe a pseudocode or a list of steps can be used to summarize the procedure used to prune and reconstruct the network, and also to compute the efficiency in recovering the network structure.

In the Related Works section, the authors refers to the use of random walkers exploring complex network topologies. At first, I wish to suggest the following references on a recent defined centrality measure named Game of Thieves which could be interesting for the authors’ future works:

- Ficara, A., Fiumara, G., De Meo, P., Liotta, A. (2021). Correlations Among Game of Thieves and Other Centrality Measures in Complex Networks. In: Fortino, G., Liotta, A., Gravina, R., Longheu, A. (eds) Data Science and Internet of Things. Internet of Things. Springer, Cham. https://doi.org/10.1007/978-3-030-67197-6_3

- Ficara, A., Saitta, R., Fiumara, G., De Meo, P., Liotta, A. (2021). Game of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Mafia Networks. In: Teixeira, A.S., Pacheco, D., Oliveira, M., Barbosa, H., Gonçalves, B., Menezes, R. (eds) Complex Networks XII. CompleNet-Live 2021. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-030-81854-8_2

Then, I wish to suggest to add some additional references on network reconstruction.

At the end, the paper should be properly proofread checking acronym definition, grammar, punctuation, mechanics and other mistakes such missing spaces.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jan 19;19(1):e0296088. doi: 10.1371/journal.pone.0296088.r002

Author response to Decision Letter 0

27 Nov 2023

Dear Editor,

Please find attached a revised version of our manuscript “Identifying the Perceived Local Properties of Networks Reconstructed from Biased Random Walks”, in which we have taken onboard all the referee’s suggestions. Please find attached a response sheet to the referee’s comments.

Yours sincerely,

Additional Editor Comments:

I propose to extend the assessment of the similarity of partitions in the original and reconstructed network using the ARI measure in addition to NMI.

Response: We have conducted experiments using the ARI measure in addition to NMI. Therefore, we have updated Figures 4 and 5, and changed the text. The changes are highlighted in blue color.

=====================================================================

1. Is the data for the real networks such as Facebook and the power grid,and etc, active and traceable? Can the authors provide related download links?

Response: Thank you for pointing it out. All the data is publicly available, and the links are already on the paper in form of citations; we have added text to clarify it in the current version. During this review we have noticed that, in the version we have submitted, the AI Interactions paragraph was not included (we apologize for that), and we have added one paragraph with information regarding this network.

Response: Indeed, we acknowledge that the size of a network can influence the coverage of a random walk for a fixed sequence length w. To address this, we varied the value of w in our experiments. In Figure 6, the x-axis represents network coverage, which helps mitigate the dependence of our results on network size. This methodological choice allows us to focus more on comparing the dynamics within each network rather than across networks of different sizes.

Furthermore, to minimize the impact of network size variability, we carefully selected the dataset so that all real-world networks, except for the Facebook network, are of similar sizes. The artificial models were also constructed to have a number of nodes equivalent to these real-world counterparts. In the case of the Facebook network, which is notably smaller, we observed that the Pearson correlation reaches 1 more rapidly. This observation aligns with the expected influence of network size on correlation measures. We have included a new paragraph in the methodology about this issue.

3.The sequence of nodes obtained by random walk has a high degree of randomness. Different results will be obtained at different time. How did the author handle this issue?

Response: Indeed, biases may emerge due to the random selection of starting nodes in each walk. To address this issue, we implemented a method of repeating the random walks multiple times for each network combination. Consequently, the Pearson correlation values depicted in our plots represent the mean of these repetitions. Our extensive testing revealed that augmenting the count of these realizations did not significantly alter the resultant curves. We have included additional details on section 3.4 in the revised manuscript to clarify this approach.

Response: We have clarified in the “Introduction” and “Correlation analysis” sections that we adopted the Pearson correlation as our measure of quality for network reconstruction concerning observed properties. This choice was made because the Pearson correlation effectively captures the general similarity between measurements of the structural properties in the reconstructed and original networks.

=====================================================================

Degree, clustering coefficient, closeness, betweenness eccentricity and coreness centrality of the networks are computed to analyze if relevant properties of the initial networks have been recovered.

In my opinion, the problem is certainly worth to be investigated. The results from the paper are significant and sufficient to be published in the journal.

Then, I wish to suggest to add some additional references on network reconstruction.

At the end, the paper should be properly proofread checking acronym definition, grammar, punctuation, mechanics and other mistakes such missing spaces.

Response: Thanks for the positive comments. Regarding the pseudocode, it was inserted under Section 3.4. We have included the suggested references and discussed them in the conclusion, in particular for the general problem of using sampled information from networks to estimate topological measures. We have proofread the document to check for problems.

Attachment

Submitted filename: ResponseLetter.pdf

Click here for additional data file.^{(76.1KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0296088.r003

Decision Letter 1

Dariusz Siudak

6 Dec 2023

Identifying the Perceived Local Properties of Networks Reconstructed from Biased Random Walks

PONE-D-23-25848R1

Dear Dr. Silva,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Dariusz Siudak, Ph.D., DSc.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The topic considered here is particularly meaningful and has wide applications, e.g. machine learning. I like this work and I think the results worth of publication. For the points raised in the previous comment, all my concerns have been addressed. I thus recommend publication as is.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

PLoS One. doi: 10.1371/journal.pone.0296088.r004

Acceptance letter

Dariusz Siudak

11 Jan 2024

PONE-D-23-25848R1

PLOS ONE

Dear Dr. Silva,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Dariusz Siudak

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: ResponseLetter.pdf

Click here for additional data file.^{(76.1KB, pdf)}

Data Availability Statement

The data can be found from the cited sources in the paper. No new datasets are being introduced in this paper.

[pone.0296088.ref001] 1. Sizemore AE, Karuza EA, Giusti C, Bassett DS. Knowledge gaps in the early growth of semantic feature networks. Nature human behaviour. 2018;2(9):682–692. doi: 10.1038/s41562-018-0422-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref002] 2. Arruda HF, Silva FN, Costa LF, Amancio DR. Knowledge acquisition: A Complex networks approach. Information Sciences. 2017;421:154–166. doi: 10.1016/j.ins.2017.08.091 [DOI] [Google Scholar]

[pone.0296088.ref003] 3. Cantwell GT, Kirkley A, Newman M. The friendship paradox in real and model networks. Journal of Complex Networks. 2021;9(2):cnab011. doi: 10.1093/comnet/cnab011 [DOI] [Google Scholar]

[pone.0296088.ref004] 4. Rzhetsky A, Foster JG, Foster IT, Evans JA. Choosing experiments to accelerate collective discovery. Proceedings of the National Academy of Sciences. 2015;112(47):14569–14574. doi: 10.1073/pnas.1509757112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref005] 5. Costa LdF, Oliveira ON Jr, Travieso G, Rodrigues FA, Villas Boas PR, Antiqueira L, et al. Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Advances in Physics. 2011;60(3):329–412. doi: 10.1080/00018732.2011.572452 [DOI] [Google Scholar]

[pone.0296088.ref006] 6. Akimushkin C, Amancio DR, Oliveira ON Jr. On the role of words in the network structure of texts: Application to authorship attribution. Physica A: Statistical Mechanics and its Applications. 2018;495:49–58. doi: 10.1016/j.physa.2017.12.054 [DOI] [Google Scholar]

[pone.0296088.ref007] 7.Klishin AA, Christianson NH, Siew CSQ, Bassett DS. Learning Dynamic Graphs, Too Slow; 2022. Available from: https://arxiv.org/abs/2207.02177.

[pone.0296088.ref008] 8. Wang J, Yang N. Dynamics of collaboration network community and exploratory innovation: The moderation of knowledge networks. Scientometrics. 2019;121(2):1067–1084. doi: 10.1007/s11192-019-03235-4 [DOI] [Google Scholar]

[pone.0296088.ref009] 9.Guerreiro L, Silva FN, Amancio DR. Recovery of network topology and dynamics via sequence characterization. arXiv preprint arXiv:220615190. 2022;.

[pone.0296088.ref010] 10. Kim Y, Park S, Yook SH. Network exploration using true self-avoiding walks. Physical review E. 2016;94 4-1:042309. doi: 10.1103/PhysRevE.94.042309 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref011] 11. Amit DJ, Parisi G, Peliti L. Asymptotic behavior of the “true” self-avoiding walk. Phys Rev B. 1983;27:1635–1645. doi: 10.1103/PhysRevB.27.1635 [DOI] [Google Scholar]

[pone.0296088.ref012] 12. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical review E. 2008;78(4):046110. doi: 10.1103/PhysRevE.78.046110 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref013] 13. Guerreiro L, Silva FN, Amancio DR. A comparative analysis of knowledge acquisition performance in complex networks. Information Sciences. 2021;555:46–57. doi: 10.1016/j.ins.2020.12.060 [DOI] [Google Scholar]

[pone.0296088.ref014] 14. Costa LF. Learning about knowledge: A complex network approach. Physical Review E. 2006;74(2):026103. doi: 10.1103/PhysRevE.74.026103 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref015] 15. Lima TS, Arruda HF, Silva FN, Comin CH, Amancio DR, Costa LF. The dynamics of knowledge acquisition via self-learning in complex networks. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2018;28(8):083106. doi: 10.1063/1.5027007 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref016] 16.Latapy M, Magnien C. Complex Network Measurements: Estimating the Relevance of Observed Properties. In: IEEE Infocom 2008—The 27th Conference on Computer Communications; 2008. p. 1660–1668.

[pone.0296088.ref017] 17. Costa LF. Voronoi and fractal complex networks and their characterization. International Journal of Modern Physics C. 2004;15(01):175–183. doi: 10.1142/S0129183104005619 [DOI] [Google Scholar]

[pone.0296088.ref018] 18. de Arruda HF, Silva FN, Comin CH, Amancio DR, Costa LF. Connecting network science and information theory. Physica A: Statistical Mechanics and its Applications. 2019;515:641–648. doi: 10.1016/j.physa.2018.10.005 [DOI] [Google Scholar]

[pone.0296088.ref019] 19. Menczer F, Fortunato S, Davis CA. A first course in network science. Cambridge University Press; 2020. [Google Scholar]

[pone.0296088.ref020] 20. Erdös P, Rényi A. On Random Graphs I. Publicationes Mathematicae Debrecen. 1959;6:290. [Google Scholar]

[pone.0296088.ref021] 21. Erdös P, Rényi A. On the Evolution of Random Graphs. In: Publication of the Mathematical Institute of the Hungarian Academy of Sciences; 1960. p. 17–61. [Google Scholar]

[pone.0296088.ref022] 22. Barabasi AL, Albert R. Emergence of Scaling in Random Networks. Science. 1999;286(5439):509–512. doi: 10.1126/science.286.5439.509 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref023] 23. Waxman BM. Routing of multipoint connections. IEEE J Sel Area Comm. 1988;1:286–292. [Google Scholar]

[pone.0296088.ref024] 24. Fire M, Puzis R. Organization Mining Using Online Social Networks. Networks and Spatial Economics. 2016;16(2):545–578. doi: 10.1007/s11067-015-9288-4 [DOI] [Google Scholar]

[pone.0296088.ref025] 25.Peixoto TP. The Netzschleuder network catalogue and repository; 2020. Available from: 10.5281/zenodo.7839981. [DOI]

[pone.0296088.ref026] 26. Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393(6684):440–442. doi: 10.1038/30918 [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref027] 27.Rossi RA, Ahmed NK. The Network Data Repository with Interactive Graph Analytics and Visualization. In: AAAI; 2015. Available from: https://networkrepository.com.

[pone.0296088.ref028] 28. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic acids research. 2012;41(D1):D991–D995. doi: 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref029] 29. Consortium AIM, Dreze M, Carvunis AR, Charloteaux B, Galli M, Pevzner SJ, et al. Evidence for Network Evolution in an Arabidopsis Interactome Map. Science. 2011;333(6042):601–607. doi: 10.1126/science.1203877 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref030] 30.Clauset A, Tucker E, Sainz M. Index of complex networks; 2016. Available from: https://icon.colorado.edu/.

[pone.0296088.ref031] 31. Chen H, Chen X, Liu H. How does language change as a lexical network? An investigation based on written Chinese word co-occurrence networks. PloS one. 2018;13(2):e0192545. doi: 10.1371/journal.pone.0192545 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref032] 32. Machicao J, Corrêa EA Jr, Miranda GH, Amancio DR, Bruno OM. Authorship attribution based on life-like network automata. PloS one. 2018;13(3):e0193703. doi: 10.1371/journal.pone.0193703 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref033] 33. Das K, Samanta S, Pal M. Study on centrality measures in social networks: a survey. Social network analysis and mining. 2018;8(1):1–11. doi: 10.1007/s13278-018-0493-2 [DOI] [Google Scholar]

[pone.0296088.ref034] 34. Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports. 2019;9(1):5233. doi: 10.1038/s41598-019-41695-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0296088.ref035] 35.McDaid AF, Greene D, Hurley N. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:11102515. 2011;.

[pone.0296088.ref036] 36. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075 [DOI] [Google Scholar]

[pone.0296088.ref037] 37. Steinley D. Properties of the Hubert-Arable Adjusted Rand Index; 2004. [DOI] [PubMed] [Google Scholar]

[pone.0296088.ref038] 38. Arruda HF, Silva FN, Marinho VQ, Amancio DR, Costa LF. Representation of texts as complex networks: a mesoscopic approach. Journal of Complex Networks. 2018;6(1):125–144. doi: 10.1093/comnet/cnx023 [DOI] [Google Scholar]

[pone.0296088.ref039] 39. Mehri A, Darooneh AH, Shariati A. The complex networks approach for authorship attribution of books. Physica A: Statistical Mechanics and its Applications. 2012;391(7):2429–2437. doi: 10.1016/j.physa.2011.12.011 [DOI] [Google Scholar]

[pone.0296088.ref040] 40. Olmezogullari E, Aktas MS. Pattern2Vec: Representation of clickstream data sequences for learning user navigational behavior. Concurrency and Computation: Practice and Experience. 2022;34(9):e6546. doi: 10.1002/cpe.6546 [DOI] [Google Scholar]

[pone.0296088.ref041] 41.Dubova M, Moskvichev A, Zollman K. Against theory-motivated experimentation in science; 2022. Available from: osf.io/preprints/metaarxiv/ysv2u.

[pone.0296088.ref042] 42. Börner K, Silva FN, Milojević S. Visualizing big science projects. Nature Reviews Physics. 2021;3(11):753–761. doi: 10.1038/s42254-021-00374-7 [DOI] [Google Scholar]

[pone.0296088.ref043] 43. Grames EM, Stillman AN, Tingley MW, Elphick CS. An automated approach to identifying search terms for systematic reviews using keyword co-occurrence networks. Methods in Ecology and Evolution. 2019;10(10):1645–1654. doi: 10.1111/2041-210X.13268 [DOI] [Google Scholar]

[pone.0296088.ref044] 44. Ficara A, Fiumara G, De Meo P, Liotta A. Correlations Among Game of Thieves and Other Centrality Measures in Complex Networks. In: Fortino G, Liotta A, Gravina R, Longheu A, editors. Data Science and Internet of Things. Internet of Things. Cham: Springer; 2021. [Google Scholar]

[pone.0296088.ref045] 45. Ficara A, Saitta R, Fiumara G, De Meo P, Liotta A. Game of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Mafia Networks. In: Teixeira AS, Pacheco D, Oliveira M, Barbosa H, Gonçalves B, Menezes R, editors. Complex Networks XII. Springer Proceedings in Complexity. Cham: Springer; 2021. [Google Scholar]

PERMALINK

Identifying the perceived local properties of networks reconstructed from biased random walks

Lucas Guerreiro

Filipi Nascimento Silva

Diego Raphael Amancio

Roles

Abstract

1 Introduction

Fig 1. Example of a subgraph (highlighted in red) representing the limited information observed by a random walk starting at node A.

2 Related works

3 Methodology

Fig 2. Schematics of the methodology.

3.1 Original networks

3.2 Network dynamics

3.3 Network reconstruction and properties extraction

3.4 Correlation analysis

4 Results and discussions

4.1 Efficiency in recovering the original properties

Fig 3. Scatter-plot of node degree obtained in original vs. reconstructed LFR networks (with mixing parameter m = 0.05).

Fig 4. Evolution of the efficacy in recovering network metrics in reconstructed networks.

Fig 5. Pearson correlation for the selected real networks.

4.2 Efficiency in recovering network properties and knowledge acquisition

Fig 6. Efficiency in recovering network metrics in for the network models.

Fig 7. Pearson correlation for the selected real networks.

5 Conclusions

Data Availability

Funding Statement

References

Decision Letter 0

Dariusz Siudak

Roles

Author response to Decision Letter 0

Decision Letter 1

Dariusz Siudak

Roles

Acceptance letter

Dariusz Siudak

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases