Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Nov 22;18(11):e1010650. doi: 10.1371/journal.pcbi.1010650

Effective resistance against pandemics: Mobility network sparsification for high-fidelity epidemic simulations

Alexander Mercier 1,2,*, Samuel Scarpino 2,3,4,5,6, Cristopher Moore 2
Editor: Feng Fu7
PMCID: PMC9681106  PMID: 36413581

Abstract

Network science has increasingly become central to the field of epidemiology and our ability to respond to infectious disease threats. However, many networks derived from modern datasets are not just large, but dense, with a high ratio of edges to nodes. This includes human mobility networks where most locations have a large number of links to many other locations. Simulating large-scale epidemics requires substantial computational resources and in many cases is practically infeasible. One way to reduce the computational cost of simulating epidemics on these networks is sparsification, where a representative subset of edges is selected based on some measure of their importance. We test several sparsification strategies, ranging from naive thresholding to random sampling of edges, on mobility data from the U.S. Following recent work in computer science, we find that the most accurate approach uses the effective resistances of edges, which prioritizes edges that are the only efficient way to travel between their endpoints. The resulting sparse network preserves many aspects of the behavior of an SIR model, including both global quantities, like the epidemic size, and local details of stochastic events, including the probability each node becomes infected and its distribution of arrival times. This holds even when the sparse network preserves fewer than 10% of the edges of the original network. In addition to its practical utility, this method helps illuminate which links of a weighted, undirected network are most important to disease spread.

Author summary

Epidemiologists increasingly use social networks to understand how geography, demographics, and human mobility affect disease spread and the effectiveness of intervention strategies. While highly detailed data on human social networks are now available, the size and density of these modern networks makes them computationally intensive to study. To address this challenge, we study methods for reducing a network to its most important links. Following recent work in computer science, we use the effective resistance, which takes both local and global connectivity into account. We test this method in simulations on a U.S.-wide mobility network and find that it preserves epidemic dynamics with high fidelity. Combined with efficient epidemic simulation algorithms, our approach can facilitate a more effective response to epidemics.


This is a PLOS Computational Biology Methods paper.

Introduction

Networks are a powerful tool for understanding the effects of superspreading events, geographic and demographic communities, and other inhomogeneities in social structure on the spread of infectious diseases [14]. As a consequence, network-based models for simulating epidemics have become particularly popular. However, simulating a stochastic epidemic model typically takes time proportional to the number of edges or links along which the disease might spread [5]. This makes these models computationally expensive for dense networks where most nodes have edges of nonzero weight to many destinations, such as those derived from high-resolution mobility data [6,7]. This computational cost is exacerbated by the need to perform many independent runs on the same network to get a sense of the probability distribution of events–for instance to calculate the probability that each individual becomes infected or to test the effect of various intervention strategies and different initial conditions [8].

A natural way to reduce this computational cost is sparsification: choosing a subset of important links to produce a sparse network whose behavior is faithful to the original, but which is less costly to study. One popular method is simply to remove links whose weights are below a certain threshold (see [9] for an overview). This is intended to remove low-weight links that are unlikely to spread the contagion, or which have nonzero weight simply due to noise in the measurement process. However, it is unclear to what extent this naive thresholding approach preserves the true behavior of contagion spread. In particular, thresholding ignores the “strength of weak ties” [10]: a low-weight edge could play a important role in low-probability, but high-impact, events if it is one of the few ways to spread the epidemic from one region or community to another.

A more sophisticated family of sparsification algorithms comes from recent results in computer science. These algorithms efficiently solve large systems of linear equations by making them sparser, i.e., by choosing and reweighting a random subset of their terms or coefficients. By sampling in a specific way, the spectrum of the linear system can largely be preserved and thus approximate the solution with high accuracy [11,12].

In the context of networks, we can think of these algorithms as follows. Given a weighted, undirected network G with n nodes and m edges, we assign each edge a probability pe of being included in the sparsified network. This probability might depend on the entire network, not just on the weight of that edge, which we denote we. Let q be the fraction of edges of G we wish to preserve. Then, we form a sparse network G˜ on the same set of nodes by sampling s = qm edges independently from the distribution {pe}. If an edge e is chosen, we set its weight in G˜ to w˜e=we/(pes). An edge may be chosen multiple times, in which case we sum w˜e accordingly (so the number m˜ of distinct edges in G˜ is less than or equal to s). This reweighting from we to w˜e compensates for the fact that the network is sparser overall, while ensuring that w˜e equals the original weight we in expectation. Similarly, the weighted adjacency matrix and graph Laplacian of G˜ are equal, in expectation, to those of G.

It might seem strange to use random choices, rather than a deterministic criterion, to decide which edges of the original network to include in the sparsification. But the hope is that even if the fraction q of preserved edges is much smaller than 1, these choices cause the epidemic behavior of the network to concentrate around that of the original network, rather like the Central Limit Theorem makes empirical means converge after a moderate number of samples.

The question remains how to assign the probabilities pe. One choice is to make pe proportional to the weight we. An even simpler choice is to make these probabilities uniform, pe = 1/m for every e. However, neither of these choices takes into account whether e is structurally important—for instance, if it is the only way to cross between two communities or is redundant, with many alternate paths that play the same role.

Graph theorists and network theorists have invented various kinds of “betweenness” to measure the structural importance of an edge (see [13] for a review). However, edge betweenness takes O(nm) time to calculate for graphs with n nodes and m edges [14], making it impractical for large, dense networks. There is also a rich literature on identifying a “backbone” or “effective graph” of a network, in order to summarize its structure, preserve its statistical properties, or identify causal connections (e.g., [1517]).

In particular, the distance backbone defined in [16] preserves all shortest-path distances on a weighted graph. This is clearly important to many types of dynamical systems on a network. On the other hand, epidemic spread is a setting in which many parallel paths can combine to transmit a disease more quickly than a single shortest path. Our goal is to sparsify networks in a way that takes this effect into account.

Here we consider a sparsification algorithm with rigorous guarantees for spectral properties, and therefore for linear dynamics, due to Spielman and Srivastava [11] (simplifying earlier work by Spielman and Teng [12]). It uses the edges’ effective resistance, denoted Re. Effective resistance can be understood by transforming the given network into an electrical circuit where each edge e becomes a resistor with resistance equal to 1/we. Then Re is the resistance of this network between e’s endpoints. This takes into account not just e, but all other possible paths between the endpoints of e, each of which reduces Re as in a parallel circuit. If e is the only path between its endpoints, then Re = 1/we. If there are many alternate paths that are short and consist of high-weight edges, then Re is small.

The effective resistance Re and the product weRe have many names in different fields, e.g., [1822], including information distance, resistance distance, statistical leverage, current flow betweenness, and spanning edge betweenness; this last because, due to Kirchhoff’s matrix-tree theorem, weRe is the probability that a random spanning tree includes e if each spanning tree appears with probability proportional to the product of its edge weights [23]. Moreover, Re can be computed for all edges simultaneously by inverting the graph Laplacian and can be approximated in nearly-linear time using a random projection technique (see Methods). The Spielman–Srivastava algorithm chooses edges with probability pe proportional to weRe. This prioritizes edges that are the only efficient way to travel between their endpoints, for which weRe≈1. Intuitively, this strategy helps keep the network connected and preserves its global structure. Indeed, in [11] they proved that sampling just s = O(n log n) edges gives a Laplacian very close to that of the original network, making it possible to solve certain large systems of equations in nearly-linear time [12].

In our setting, if the original network is very dense with m = O(n2) edges then we can radically sparsify it, reducing its average degree from O(n) to O(log n) and keeping just a fraction q = O((log n)/n) of the edges, shown graphically in Fig 1. However, an ideal sparsifier will decrease the density of a network while preserving both structural and dynamical properties, so that models will behave similarly on the sparsified network as they do on the full network. Said differently, the utility of the Spielman–Srivastava algorithm for our problem requires that weRe be a reasonable measure of edge importance in an epidemic. It is well established in network epidemiology that the spectrum of the Laplacian or adjacency matrix can be used to determine epidemic thresholds (e.g., [2426]) since this is a matter of linear stability. However, epidemic models such as SIR are nonlinear as well as stochastic. Therefore, when evaluating the performance of various sparsification algorithms, we focus on measures that provide a richer description of the epidemic dynamics.

Fig 1. Sparsification of the U.S. mobility network.

Fig 1

The U.S. mobility network, where nodes are census tracts and edges are weighted according to average human mobility between census tracts. On the left, the original network with about 26 million edges. On the right, a sparsified network based on effective resistance sampling with q = 0.1, preserving about 7% of the edges of the original. Heavier-weight edges are lighter in color on a logarithmic scale. Note the mixture of local and long-range links, and how sparsification reweights edges to be heavier in order to compensate for the decrease in density.

Exploring whether effective resistance is a good measure of edge importance in the epidemic context is the goal of this paper. Some early work in this direction was done by [27] using Hamming distance as a metric of sparsifier performance. Here we confirm on a dense real-world mobility network that the Spielman–Srivastava algorithm preserves the behavior of the SIR model, even when we keep only a few percent of the original edges. This includes both the bulk behavior of the epidemic and the probability distribution of events, including the probability that each node becomes infected and the distribution of times at which it does so. According to these metrics, it achieves higher fidelity than simpler edge-sampling methods where the probability is uniform or proportional to edge weight, as well as the naive thresholding approach. While estimating the effective resistance for each edge has a higher computational cost than these methods, this cost is only incurred once for each network as a preprocessing step, after which independent trials of the SIR simulation (and, if desired, of the edge-sampling process) can be carried out at no additional cost.

In addition to reducing the computational effort required to carry out accurate simulations, these results offer some insight into which links are important for disease spread, both topologically and quantitatively.

Materials and methods

Commuter network data set

The real-world mobility network was constructed from publicly available United States Census Bureau inter-census-tract commuting flows for all fifty states. Each node is a single census tract, and integer edge weights denote the amount of inter-census-tract human mobility provided by the United States Census Bureau through a summary of Longitudinal Employer-Household Dynamics (LEHD) Origin-Destination Employment Statistics (LODES) across Origin-Destination (OD), Residence Area Characteristic (RAC), and Workplace Area Characteristic (WAC) data types for the year 2016. This commuting data is directional, so a priori this network is directed. Since effective resistance and many other sparsification techniques assume an undirected graph (which is standard in the field of network epidemiology [28]), we set the undirected edge weight for (i, j) to the average of the directional weights between i and j in each direction. The resulting network is comprised of n = 72, 721 nodes and m = 26, 319, 308 edges, giving it an average topological degree of 2m/n = 723.8. Each node has three pieces of metadata: the population of the census tract, its land area in square meters, and its Geographic Identifier (GEOID). Standardized census tract GEOIDs are used to merge node data with United States Census Bureau cartographic boundary shape files of all fifty states and assign corresponding Rural-Urban Commuting Areas (RUCA) codes, standardized by the Economic Research Service of the United States Department of Agriculture. These codes range from 1 (urban core) to 10 (rural) and summarize the level of urbanization, daily commuting, and population density of the given census tract according to United States Census Bureau standards.

SIR simulation algorithm

To simulate the SIR model on large, complex networks, we use a continuous-time, event-driven Gillespie algorithm [29]. This algorithm stores a “heap” of potential events; a heap is an efficient implementation of a priority queue that allows events to be added or removed in O(log n) time. At each step the event in the heap with the smallest (soonest) time becomes the current event. It is removed from the heap, and new events driven by that event are added to the heap. Specifically, whenever a node becomes infected, infection events for all its neighbors are added to the queue along with its own recovery event. If an infection event ij is drawn from the heap, we check that i is still infected and j is still susceptible; this is simpler than removing these events from the heap when, say, i recovers.

If the overall infection rate is β, each edge e transmits at a rate βwe, and we assume all nodes have the same recovery rate γ. New events are given a time equal to the current time t plus a waiting time Δt, where Δt is drawn from an exponential distribution with mean 1/(βwe) or 1/γ for infection and recovery events respectively. This exponential distribution assumes that infection and recovery are continuous-time Markov processes with constant rate, but the same approach easily generalizes to more complicated waiting time distributions.

For analysis, each run of the SIR simulation outputs whether each node becomes infected, and its arrival time if so. It also outputs the total number of nodes susceptible, infected, or recovered as a function of time, the total computation time on the CPU, and the maximum heap size.

Network sparsification

We use two types of sparsification methods: weight thresholding and edge sampling. Both methods focus on edge reduction and do not change the number of nodes. Weight thresholding removes all edges below a specified weight. For comparison with the edge sampling sparsifiers, we vary this threshold to preserve a given fraction of edges.

We employ three edge sampling algorithms: uniform sampling, sampling by weight, and sampling according to effective resistances. These sampling algorithms utilize the same general scheme, but vary in the amount of information they consider about the given edge, ranging from most naive (uniform) to the most sophisticated (effective resistances). Each sampling procedure computes an importance ue for each edge, where ue for the three methods is shown in Table 1. It then samples s edges with replacement with probability pe = ue/Σeu’ and adds the sampled edges to the sparse network with weight equal to w˜e=we/(pes), shown in Algorithm 1.

Table 1. Random Sampling Importances.

Sampling Procedure Edge Importance ue
Uniform (Uni) 1
Weights (Wts) w e
Effective Resistance (EffR) w e R e

----------------------------------------------------------------------------------

Algorithm 1: Edge-Sampling Sparsification

----------------------------------------------------------------------------------

Input: dense network G(V,E,ϕ)

Output: sparse network G˜(V,E˜,ϕ˜)

Parameters: number of samples s, and edge importances {ue}

Procedure:

Choose random edge e from G with probability peue

Add edge e to G˜ with weight w˜e=we/(pes) Take s samples with replacement

Sum weights if an edge is chosen more than once

----------------------------------------------------------------------------------

Since edges are chosen with replacement, the same edge may be sampled multiple times: in that case its weight is summed, i.e., multiplied by the number of times it is sampled.

Reweighting edges from we to w˜e in this way ensures that w˜e=we, and therefore that A˜=A and L˜=L where these are the adjacency matrix and Laplacian of G˜ and G respectively. Thus, the sparsifier preserves the linear properties of the original network in expectation. Moreover, Spielman and Srivastava [11] showed that if we sparsify using effective resistances and s is a sufficiently large constant times n log n, then L˜ is concentrated around L in a powerful sense. Let ε > 0 be the desired error in the Laplacian. Then if s ≥ 8n log(n/ε2), then with probability at least 1/2 (and tending to 1 if s increases further) for all vectors x ∈ ℝn we have

(1ε)xTLxxTL˜x(1+ε)xTLx (1)

As a result, L˜ approximately preserves the important eigenvectors and eigenvalues of the original Laplacian L.

Effective resistance

The effective resistance between any two given nodes is the resistance across them in a network of resistors where each edge has conductance we or equivalently resistance 1/we. The entire matrix of effective resistances can be computed from the graph Laplacian as follows: the effective resistance for an edge e = (i, j) is

Re=(eiej)TL+(eiej) (2)

where L+ is the pseudoinverse of L and ei is the basis column vector with 1 at position i and zeros elsewhere.

Computing the exact pseudoinverse of L is computationally expensive for large networks. Therefore, Spielman and Srivastava [11] approximate the effective resistances using a random projection technique. This yields approximate resistances Re such that, for all edges e

(1ε)ReRe(1+ε)Re (3)

where ε can be made as small as desired at increased computational cost. These approximate resistances can then be used by the Spielman–Srivastava algorithm with a slightly larger error parameter ε in (1).

We carried out this approximation for ε = 0.1 using the implementation of Koutis et al. [30], which takes time O˜(mlog(1/ε)) where O˜ hides factors logarithmic in n. This procedure took several hours for the U.S. commuting network: note that we only need to do this once, after which we can generate sparsified networks with any desired density and run the SIR model on each one as many times as we like.

Experiments and parameters

To compare the four sparsification methods, ten sparse graphs of each type were created with a varying number of edges. For thresholding, we set the threshold in order to retain a given fraction of the edges, ranging 0.2 to 0.9 in steps of 0.1. Since these sparsifications performed poorly compared to the edge-sampling methods, we did not reduce the density further.

For the edge-sampling methods, we set the number of samples for each method to s = qm where q = 0.001, 0.00325, 0.0055, 0.00775, 0.01, 0.0325, 0.055, 0.1, 0.55, and 1. Since edges can be sampled more than once, the fraction of edges preserved by the resulting sparse network can be slightly smaller.

We chose two representative initial conditions, where the initial infections are localized or dispersed. The localized initial condition consists of a single infected node, namely the census tract containing John F. Kennedy International Airport. In the dispersed initial condition, we infect 1% of the nodes chosen with probability proportional to their population.

We chose representative values of the parameters β and γ in order to create a nontrivial distribution of infection probabilities and arrival times, thus posing a challenge for sparsification. In both cases we set the recovery rate to γ = 1. We set β ≈ 0.0014 and β ≈ 0.0046 for the localized and dispersed initial condition, respectively.

Since each edge with weight we transmits the infection at rate weβ rate, and since the average weighted degree of a node in the network is w¯1782 these correspond to reproductive numbers R0=w¯β/γ=2.50 and R0 = 8.20 respectively, although of course in a heterogeneous network no single value of R0 is sufficient to model the dynamics [31].

For each sparsified network and each initial condition, 1000 simulations were run for analysis. We ran the simulation for a maximum time tmax = 20: in most but not all simulations, all nodes were either recovered or susceptible at that point.

Sparsifier evaluation

For each sparsifier, the fidelity of the stochastic SIR dynamics to that on the original network was assessed through three metrics: the probability of infection for each node across 1000 simulations, the distribution of arrival times for each node, and the average SIR curve across all simulations.

We compared the infection probabilities with scatterplots as shown in Fig 2, where for each node the horizontal and vertical coordinates are its infection probability in the original and sparsified network respectively. If the sparsifier preserved these probabilities exactly, all nodes would fall on the diagonal. We measured various quantities such as the squared correlation R2 and the L1 and L2 distances to confirm that these probabilities are approximately preserved. The arrival time distributions, i.e., the distributions of the time at which each node becomes infected, were also compiled 1000 simulations.

Fig 2. Sparsification of the U.S. mobility network: preserving infection probabilities.

Fig 2

Scatterplots of infection probabilities for localized (A) and dispersed (B) initial conditions for four sparsification methods: effective resistance, uniform sampling, weight-based sampling, and simple thresholding. For each sampling method we chose 0.1m edges of the original network, and we set the threshold to retain the 10% highest-weighted edges. Each dot represents a node in the network. The horizontal and vertical axes give the probability that node becomes infected using the original and sparse network respectively, based on 1000 independent runs of the SIR model. Dots close to the diagonal are those for which sparsification preserves this probability, and the blue dots represent the 90% of nodes closest to the diagonal. While weight-based sampling achieves a similar R2, it is not as good at preserving infection probabilities for low-probability nodes, especially for dispersed initial conditions. Effective resistance preserves the infection probability for both low- and high-probability nodes.

We compared these distributions using the Arrival Time Error Score (ATES) averaged over all nodes. As stated in the main text, for each node this score is the Wasserstein distance between the arrival time distributions conditioned on the node becoming infected, with a penalty tmax if the node has a zero infection probability in one network but a nonzero infection probability in the other, so that one of the two arrival time distributions is undefined. Typically, this occurred because the sparsifier cut that node off from the rest of the network. If a node has zero infection probability in both networks, we assign an error score of zero. Thus the ATES is small only if the sparsifier is faithful to the stochastic behavior of the SIR model on the original graph in two senses: 1) they agree on which nodes are infected with nonzero probability, and 2) when a node becomes infected, they agree on the distribution of times at which it does so.

The Wasserstein distance of two distributions is the average difference between a point from one distribution (in this case, an arrival time) and the corresponding point in the other distribution, minimized over all possible probability-preserving correspondences. This is also called the “earthmover distance” and comes from optimal transport theory [32]: it is the minimum, over all ways to transport earth from one pile to another, of the average distance we have to move each unit of earth. Formally, if X and Y are two random variables, their Wasserstein distance is

W(X,Y)=infπΓ(X,Y)|xy|dπ(x,y) (4)

where π is minimized over all couplings between the two distributions, i.e., over the space Γ(X, Y) of joint distributions on X × Y whose marginals over X and Y are their distributions.

For distributions of one real-valued variable such as the arrival time, the optimal coupling is simple: we match times in the two distributions according to their quantiles, i.e., their cumulative distribution functions C(x) = Pr[X < x] and D(y) = Pr[Y < y]. Then

W(X,Y)=01|C1(z)D1(z)|dz (5)

Numerically, for each node we take the list of arrival times at which it became infected in the original network (from the subset of simulation runs in which it did so) and the analogous list in the sparsified network. We sort each list from smallest (earliest) to largest, giving times s1 < s2 <… and t1 < t2 <…. If these lists are the same size ℓ, then W (X, Y) is just the average over all 1≤ i ≤ ℓ of |siti|. (If the lists are of different sizes ℓ1 and ℓ2, we rescale them to produce two empirical distributions, assigning si in part to t(l2/l1)i and in part to t(l2/l1)i.) Thus, for each node, W (X, Y) is the absolute difference in its arrival time for the two networks, averaged over all simulations, conditioned on the event that it becomes infected.

We note that some other common measures of distance between probability distributions, such as the Kullback-Leibler divergence, do not pay attention to the magnitude of the temporal error. For instance, the KL divergence is large for two arrival time distributions that are both narrowly peaked regardless of whether the distance between those peaks is large or small. Similarly, the KL divergence is low between two distributions with a large overlap, regardless of how severe the temporal difference is in their non-overlapping parts.

Results

We focus our attention on a U.S.-wide network of commuting patterns based on data from the U.S. Census Bureau (see Methods). Mobility data has proven a powerful tool for incorporating realistic social contact and population connectivity into epidemiological models [3335], which has been especially true during the SARS-CoV-2 pandemic [3641]. For example, [42] showed that reductions in commute flow correlated with lower SARS-CoV-2 prevalence in New York City. However, it is well established that commute flows alone are almost certainly too simplistic to capture the dynamics of modern epidemics in humans [3,43]. Thus, while we treat this network as a test case for sparsification in modern data drawn from human social structure because it is large, dense, and highly heterogeneous, we do not claim that it is an accurate representation of the epidemiological contact network. In particular, our simulations simplistically assign a single state (Susceptible, Infected, or Recovered) to each census tract. On the other hand, a more realistic model where each census tract has population-level variables (the fraction of people in that tract with each SIR state) would also be simplified by sparsification, since each edge corresponds to a coupling term in the resulting set of differential equations. This network has roughly 73 thousand nodes, each corresponding to one census tract, and roughly 26 million edges. Edge weights are given by commuting flows averaged over 2016. Its average topological degree (the number of connections with nonzero weight) is 724. Fig 1 shows a sparse version of this network, using effective resistance to sample s = qm edges with q = 0.1. Since some edges are chosen multiple times, this process preserves about 7% of the original number of edges.

We simulated an SIR model both on this network and on sparse networks generated by the three edge-sampling methods described above, where probabilities are uniform, proportional to edge weight, or based on effective resistance. For comparison, we also included the simple threshold method. In each case, we varied the fraction of edges preserved by varying either the weight threshold or the fraction q of edges sampled.

Our SIR simulations are continuous-time, event-driven, stochastic Markov processes. Each edge e transmits the disease at rate βwe, and each infected node recovers at rate γ. We consider two types of initial condition: a localized one, where only the node corresponding to JFK International Airport in New York City is infected, and dispersed one, where 1% of the nodes (a total of 727) are chosen with probability proportional to their population. We chose representative values of β and γ to create a wide range of probabilities with which individual nodes become infected, and a wide range of arrival times at which they do so.

Since epidemics are inherently stochastic, we are not only concerned with macroscopic quantities like the fraction of the population that is susceptible or infected at a given time. We also wish sparsification to preserve important aspects of the probability distribution of events. In particular, we are concerned with the probability that each node becomes infected and the distribution of arrival times when it becomes infected. In order to have a distribution of events for each node, simulations were run independently 1000 times for each set of initial conditions and for each sparsifier.

In Fig 2, we show how the three sampling methods and simple thresholding perform at the task of preserving infection probabilities. As in Fig 1, we preserve about 7% of the distinct edges of the original network. Each dot represents a node, with the probability it becomes infected on the original network and the sparsified network plotted on the horizontal and vertical axes, respectively. Dots on the diagonal are those for which these probabilities are the same. We see that sampling with effective resistances accurately preserves these probabilities across the entire distribution, from low to high probability, and in both initial conditions. Weight-based and uniform sampling perform reasonably well, but with larger error shown by the distance of these dots from the diagonal. Naïve thresholding performs quite poorly, even while containing 10% of the original edges.

We are also interested in the arrival time distribution of each node, i.e., the distribution of times at which it becomes infected [44,45]. We define the error with which a sparsification method preserves this distribution as follows. First, we use the Wasserstein distance to compare the arrival time distributions of each node in the original and sparse network, ignoring the runs of the SIR model in which it never becomes infected. The Wasserstein distance is essentially the average difference between the times at which the node becomes infected in the two networks (see Methods for a formal definition). However, if a node’s infection probability is nonzero in one network but zero in the other so that its arrival time distribution is empty, we impose a penalty equal to the duration tmax of the simulation (the maximum possible Wasserstein distance). We call the average of this quantity across all nodes the Arrival Time Error Score. Its value is low if the same nodes have nonzero infection probability in both networks—in particular, if the sparsifier doesn’t cut potentially infected nodes off from the rest of the network—and if each such node becomes infected at a similar distribution of times.

We show this error score, averaged over 1000 runs of the SIR model, for the four sparsification methods in Fig 3 for the localized (A) and dispersed (B) initial conditions. In both cases, even when we only preserve a few percent of the original edges, sampling by effective resistances (EffR) achieves a small error. Uniform and weight-based sampling also perform reasonably well on this measure. However, in the dispersed initial condition both are significantly worse than effective resistance when we preserve less than 10% of the original edges. In the localized initial condition, weight-based sampling is comparable to effective resistance, but uniform sampling is significantly worse below about 5%.

Fig 3. Sparsification of the U.S. mobility network average: Arrival Time Error Score.

Fig 3

Arrival Time Error Score averaged over nodes and over 1000 independent simulations for the (A) localized initial conditions and (B) dispersed initial condition. The shaded regions correspond to one standard deviation of the average, which for several of these curves is too small to see. The horizontal axis shows the fraction of distinct edges of the original network preserved by the sparsifier. Note that this fraction is slightly less than q, since edge-sampling algorithms can choose the same edge multiple times.

For illustration, in Fig 4 we show the arrival time distributions for two representative nodes. We compare these distributions on the original network with the three edge-sampling sparsifiers. All three reproduce the shape and width of these distributions fairly well, giving a small Wasserstein distance. However, since these distributions are conditioned on the event that these nodes become infected, this comparison does not measure whether the infection probabilities are the same.

Fig 4. Sparsification of the U.S. mobility network: representative arrival time distributions.

Fig 4

Arrival time distributions for the original network (Org) and sparsified networks produced by the three edge-sampling methods with 7% of the original edges preserved. In each graph, we show the probability density of the time at which a particular node becomes infected, conditioned on the event that it becomes infected during the epidemic. We show this distribution for two representative nodes (top and bottom) under (A) the localized initial condition and (B) the dispersed initial condition. The top node is in a well-connected part of the network, with typical arrival times ranging from 0.5 to 1.8 in the localized initial condition and from 0.05 to 0.25 in the dispersed initial condition. The bottom node is in a sparser region and more remote from the initial infection, giving it arrival times of 3–8 and 0.2–1.5 in the localized and dispersed initial conditions respectively. All three edge-sampling methods do fairly well at reproducing the shape of these arrival time distributions.

Indeed, as shown in Fig 5, the main reason why uniform and weight-based sampling have a larger error is that they disconnect a significant fraction of the nodes from the rest of the network, giving them a zero infection probability.

Fig 5. Sparsification of the U.S. mobility network: average fraction of disconnected nodes.

Fig 5

The average fraction of nodes disconnected from the largest connected component of the network by sparsifiers of different types. The horizontal axis shows the fraction of edges of the original network preserved by the sparsifier. Even when only 1% of the edges are preserved, effective resistance keeps almost all the nodes connected, while uniform and weight-based sampling disconnects 5% and 7% of the nodes respectively.

In contrast, effective resistance keeps almost all nodes connected even when preserving just 1% of the original edges.

We also studied the performance on these sparsifiers on census tracts of different types. We use the RUCA (Rural-Urban Commuting Area) codes of the U.S. Department of Agriculture, which classify tracts based on population density, level of urbanization, and daily commuting. As shown in Fig 6, sampling by effective resistances performs well in all 10 types, indicating that it is faithful to a wide range of network structures and mobility patterns. Specifically, sampling by effective resistances achieves the minimum Arrival Time Error Score across all RUCA categories in both the core and periphery of dense metropolitan-, micropolitan-, or town-like areas. As we discuss in the next section, the other sparsification techniques do more poorly in low-commuting areas.

Fig 6. Map of U.S. Census Tracts by RUCA category.

Fig 6

(A) map of the U.S. where each census tract is color-coded according to its RUCA designation: 1, 2, 3 are metropolitan core, high, and low commuting, respectively, 4, 5, 6 are micropolitan core, high, and low commuting, respectively, 7, 8, 9 are small town core, high, and low commuting, respectively, and 10 is rural. On the right, we show the arrival time error for sparsifiers that sample s = 0.1m edges across the 10 RUCA categories for (B) the localized initial condition and (C) the dispersed initial condition. Across all RUCA categories, effective resistance performs better than uniform or weight-based sampling. This effect is especially pronounced for low-commuting areas, which have fewer or lower-weighted edges connecting them to the rest of the network. Base map provided by U.S. Census Bureau (https://www2.census.gov/geo/tiger/TIGER2016/TRACT/) and is covered under public domain.

Discussion

As discussed above, naive thresholding ignores the fact that low-weight edges, individually or in the aggregate, can contribute to disease spread. Removing all such edges can separate communities, isolate rural areas, and remove important long-range links. In contrast, the edge-sampling methods studied here choose at least some low-weight edges, and can thus preserve their contribution to the network structure. Edge-sampling methods also give a principled way to reweight edges to compensate for sparsification, and thus preserve the rates and timescales of the SIR model, thus providing a convenient single parameter slider (the number of samples taken) from a sparse network to the original network in expectation.The question remains why effective resistance is better at preserving infection probabilities and arrival time distributions than weight-based or uniform sampling. We believe this is because effective resistance gives the highest priority to edges where few alternate paths exist—that is, edges that are the only efficient way to travel between their endpoints.

For instance, consider a sparse, tree-like region of the network, with few short loops. For each edge we have Re ≈ 1/we so the product weRe is close to its maximum value of 1. Thus, effective resistance recognizes the essential role each edge plays in the structure of this region and it gives these edges a priority at least as high as any others in the network. In contrast, weight-based and uniform sampling may let this region become disconnected, dropping nodes with few neighbors or low weighted degree from the network backbone. They will also tend to under-sample edges in such regions, focusing on other regions with larger total edge weight or simply more edges. Similarly, effective resistance recognizes the importance of long-range edges that are the only easy way to cross from one part of the world to another, since we again have weRe ≈ 1. Weight-based sampling gives these edges low priority if we is small, and uniform sampling will keep them only by chance, despite their importance in spreading disease at large scales across the network.

Now consider a dense region of the network with many short loops, and thus many competing paths of various weights between most pairs of nodes. In such a region we need to preserve edges that represent the most-likely paths for disease spread. Uniform sampling ignores both topology and edge weights, so even if it keeps the region connected it makes no attempt to sample more important edges. Weight-based sampling is a reasonable heuristic, since high weight means a high probability of transmission, but, unlike effective resistance, it does not take alternate paths into account and may disconnect nodes that only have edges with low weight.

This picture is borne out by the performance of these sparsification methods in different census tract types (Fig 6). In low-commuting areas, whose network structure is sparser (according to topological degree) and/or lower weight (according to weighted degree), uniform and weight-based sampling disconnect a larger fraction of nodes than effective resistance. This results in the Arrival Time Error Score of nodes with low infection probability being higher in weight-based sampling than either uniform sampling or sampling by effective resistances. In high-commuting areas, even though it keeps the network more connected, uniform sampling has a higher error than effective resistance in the arrival time distribution conditioned on a nonzero infection probability, indicating that it is not choosing the most important edges for disease spread.

Fig 6 also shows that the distinction between core, high-, and low-commuting areas is more important than the distinction between metropolitan, micropolitan, small town, or rural. Thus, regardless of the level of urbanization, effective resistance does a better job of preserving network structure and epidemic dynamics than the other methods. It keeps nodes connected even if they are in sparser, lower-weight regions, and it selects edges with high structural importance even if they have low weight.

We note that it is common in the literature, including in more sophisticated thresholding methods, to require that every node keeps at least one edge in an effort to keep the network connected (e.g., [15]). One could add that requirement to any of these methods, but it is unclear how to reweight these edges of last resort in a principled manner.

Conclusion

At a practical level, sparsification significantly reduces computation time. Our simulation takes about 12 minutes for a single run on the original network, and only about 1.75 minutes on a sparse network with about 7% of the original edges. While other implementations may be faster overall, a similar speedup will apply.

Epidemic simulations typically take a total of O((n + m) log n) time for networks with n nodes and m edges (e.g., [5,46]. This is because there are n + m possible recovery and infection events, and they can be managed with a data structure where events can be added or retrieved in O(log n) time. For dense networks where m = O(n2), the typical running time is then O(m log n), or linear in m. Thus, sparsification can be used as a preprocessing step for a wide range of epidemic models, reducing their computation time by roughly the same ratio as the fraction of edges preserved. We note a recent SIR algorithm [47] that takes O(m log log n) time using more sophisticated data structures. This algorithm would also benefit from sparsification since all m edges have to be insert into these data structures at the outset.

Beyond this practical application, these results shed light on which edges are the most important for disease spread. In particular, they suggest that effective resistance is a better guide to an edge’s importance than its weight in the epidemic context. We find it interesting that a technique designed to preserve linear-algebraic properties of a weighted graph also preserves nonlinear stochastic dynamics of an epidemic model. More generally, effective resistance belongs to a rich class of sparsifiers which seek to preserve dynamical, rather than topological, properties of a graph.

One caveat is that, while this method of sparsification preserves epidemic dynamics, it can obscure the original edge weights. For instance, if there is a bundle of low-weight edges that cross between two communities, the Spielman–Srivastava algorithm will choose one of them and give it a high weight, in essence designating it as the representative of the entire bundle. This makes sense in contexts like epidemic spreading where bundles of parallel edges can work together and act as one high-weight edge. But in some other contexts, such as genetic regulatory networks, where the goal is to understand the functional role of each link and where edge weights are of independent scientific interest, weight-preserving methods like those of [16,17] may be more appropriate.

We believe further work in this area, in both epidemiology and other biological network models, will help us understand how to identify which edges are important to a given dynamical process.

Acknowledgments

This work was carried out as part of an REU (Research Experience for Undergraduates) program at the Santa Fe Institute under the mentorship of Cristopher Moore and Maria Riolo

Data Availability

All data and code used for running network sparsification, models, and plotting is available on GitHub at https://github.com/AMMercier/EffectiveResistanceSampling.

Funding Statement

This study was funded by the National Science Foundation via two grants (https://nsf.gov/index.jsp, https://nsf.gov/index.jsp) 1757923 and 1838251 to CM. The sponsors or funders have played no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Halloran ME, Vespignani A, Bharti N, Feldstein LR, Alexander K, Ferrari M, et al. Ebola: mobility data. Science. 2014;346(6208):433. doi: 10.1126/science.346.6208.433-a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Oliver N, Lepri B, Sterly H, Lambiotte R, Deletaille S, De Nadai M, et al. Mobile phone data for informing public health actions across the COVID-19 pandemic life cycle; 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wesolowski A, Qureshi T, Boni MF, Sundsøy PR, Johansson MA, Rasheed SB, et al. Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proceedings of the National Academy of Sciences. 2015;112(38):11887–11892. doi: 10.1073/pnas.1504964112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A. The architecture of complex weighted networks. Proceedings of the National Academy of Sciences. 2004;101(11):3747–3752. doi: 10.1073/pnas.0400087101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kiss IZ, Miller JC, Simon PL, et al. Mathematics of epidemics on networks. Cham: Springer. 2017;598. [Google Scholar]
  • 6.Barbosa H, Barthelemy M, Ghoshal G, James CR, Lenormand M, Louail T, et al. Human mobility: Models and applications. Physics Reports. 2018;734:1–74. [Google Scholar]
  • 7.Schlapfer M, Dong L, O’Keeffe K, Santi P, Szell M, Salat H, et al. The universal visitation law of human mobility. Nature. 2021;593(7860):522–527. doi: 10.1038/s41586-021-03480-9 [DOI] [PubMed] [Google Scholar]
  • 8.Pellis L, Ball F, Bansal S, Eames K, House T, Isham V, et al. Eight challenges for network epidemic models. Epidemics. 2015;10:58–62. doi: 10.1016/j.epidem.2014.07.003 [DOI] [PubMed] [Google Scholar]
  • 9.Yan X, Jeub LG, Flammini A, Radicchi F, Fortunato S. Weight thresholding on complex networks. Physical Review E. 2018;98(4):042304. [Google Scholar]
  • 10.Granovetter MS. The strength of weak ties. American Journal of Sociology. 1973;78(6):1360–1380. [Google Scholar]
  • 11.Spielman DA, Srivastava N. Graph sparsification by effective resistances. SIAM Journal on Computing. 2011;40(6):1913–1926. [Google Scholar]
  • 12.Spielman DA, Teng SH. Spectral sparsification of graphs. SIAM Journal on Computing. 2011;40(4):981–1025. [Google Scholar]
  • 13.Shugars S, Scarpino SV. One outstanding path from A to B. Nature Physics. 2021;17(4):540–540. [Google Scholar]
  • 14.Brandes U. A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology. 2001;25(2):163–177. [Google Scholar]
  • 15.Serrano MÁ, Boguna M, Vespignani A. Extracting the multiscale backbone of complex weighted networks. Proceedings of the national academy of sciences. 2009;106(16):6483–6488. doi: 10.1073/pnas.0808904106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Simas T, Correia RB, Rocha LM. The distance backbone of complex networks. Journal of Complex Networks. 2021;9(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gates AJ, Brattig Correia R, Wang X, Rocha LM. The effective graph reveals redundancy, canalization, and control pathways in biochemical regulation and signaling. Proceedings of the National Academy of Sciences. 2021;118(12). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Stephenson K, Zelen M. Rethinking centrality: Methods and examples. Social Networks. 1989;11(1):1–37. [Google Scholar]
  • 19.Bozzo E, Franceschet M. Resistance distance, closeness, and betweenness. Social Networks. 2013;35(3):460–469. [Google Scholar]
  • 20.Teixeira AS, Santos FC, Francisco AP. Spanning Edge Betweenness in Practice. In: Cherifi H, Gon¸calves B, Menezes R, Sinatra R, editors. Complex Networks VII. Springer International Publishing; 2016. p. 3–10. [Google Scholar]
  • 21.Schaub MT, Lehmann J, Yaliraki SN, Barahona M. Structure of complex networks: Quantifying edge-to-edge relations by failure-induced flow redistribution. Network Science. 2014;2(1):66–89. [Google Scholar]
  • 22.Drineas P, Mahoney MW. Effective resistances, statistical leverage, and applications to linear equation solving. arXiv preprint arXiv:10053097. 2010;. [Google Scholar]
  • 23.Bollob´as B. Modern Graph Theory. Springer-Verlag; 1998. [Google Scholar]
  • 24.Pastor-Satorras R, Castellano C, Van Mieghem P, Vespignani A. Epidemic processes in complex networks. Reviews of Modern Physics. 2015;87(3):925. [Google Scholar]
  • 25.Van Mieghemy P, Sahnehz FD, Scoglioz C. An upper bound for the epidemic threshold in exact Markovian SIR and SIS epidemics on networks. In: 53rd IEEE Conference on Decision and Control. IEEE; 2014. p. 6228–6233.
  • 26.Newman MEJ. Networks: An Introduction. Oxford University Press; 2010. [Google Scholar]
  • 27.Swarup S, Ravi SS, Hassin Mahmud MM, Lum K. Identifying Core Network Structure for Epidemic Simulations; 2016. https://nssac.bii.virginia.edu/~swarup/papers/swarup_etal_admi2016_sparsification.pdf. [Google Scholar]
  • 28.Allard A, Moore C, Scarpino SV, Althouse BM, H´ebert-Dufresne L. The role of directionality, heterogeneity and correlations in epidemic risk and spread. arXiv preprint arXiv:200511283. 2020. 35169597 [Google Scholar]
  • 29.Gillespie DT. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics. 1976;22(4):403–434. [Google Scholar]
  • 30.Koutis I, Levin A, Peng R. Improved spectral sparsification and numerical algorithms for SDD matrices. In: STACS’12 (29th Symposium on Theoretical Aspects of Computer Science). vol. 14. LIPIcs; 2012. p. 266–277.
  • 31.H´ebert-Dufresne L, Althouse BM, Scarpino SV, Allard A Beyond R0: heterogeneity in secondary infections and probabilistic epidemic forecasting. Journal of the Royal Society Interface. 2020;17(172):20200393. doi: 10.1098/rsif.2020.0393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Rubner Y, Tomasi C, Guibas LJ. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision. 2000;40(2):99–121. [Google Scholar]
  • 33.Balcan D, Colizza V, Gonc¸alves B, Hu H, Ramasco JJ, Vespignani A. Multiscale mobility networks and the spatial spreading of infectious diseases. Proceedings of the National Academy of Sciences. 2009;106(51):21484–21489. doi: 10.1073/pnas.0906910106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Balcan D, Vespignani A. Phase transitions in contagion processes mediated by recurrent mobility patterns. Nature Physics. 2011;7(7):581–586. doi: 10.1038/nphys1944 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Poletto C, Tizzoni M, Colizza V. Human mobility and time spent at destination: impact on spatial epidemic spreading. Journal of Theoretical Biology. 2013;338:41–58. doi: 10.1016/j.jtbi.2013.08.032 [DOI] [PubMed] [Google Scholar]
  • 36.Buckee CO, Balsari S, Chan J, Crosas M, Dominici F, Gasser U, et al. Aggregated mobility data could help fight COVID-19. Science. 2020;368(6487):145–146. doi: 10.1126/science.abb8021 [DOI] [PubMed] [Google Scholar]
  • 37.Gao S, Rao J, Kang Y, Liang Y, Kruse J, Dopfer D, et al. Association of Mobile Phone Location Data Indications of Travel and Stay-at-Home Mandates With COVID-19 Infection Rates in the US. JAMA Network Open. 2020;3(9):e2020485–e2020485. doi: 10.1001/jamanetworkopen.2020.20485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chang S, Pierson E, Koh PW, Gerardin J, Redbird B, Grusky D, et al. Mobility network models of COVID-19 explain inequities and inform reopening. Nature. 2021;589(7840):82–87. doi: 10.1038/s41586-020-2923-3 [DOI] [PubMed] [Google Scholar]
  • 39.Chinazzi M, Davis JT, Ajelli M, Gioannini C, Litvinova M, Merler S, et al. The Effect of Travel Restrictions on the Spread of the 2019 Novel Coronavirus (COVID-19) Outbreak. Science. 2020;368. doi: 10.1126/science.aba9757 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kraemer MUG, Yang CH, Gutierrez B, Wu CH, Klein B, Pigott DM, et al. The effect of human mobility and control measures on the COVID-19 epidemic in China. Science. 2020;368(6490):493–497. doi: 10.1126/science.abb4218 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Klein B, LaRock T, McCabe S, Torres L, Friedland L, Privitera F, et al. Reshaping a nation: Mobility, commuting, and contact patterns during the COVID-19 outbreak. Northeastern University-Network Science Institute Report. 2020;. [Google Scholar]
  • 42.Kissler SM, Kishore N, Prabhu M, Goffman D, Beilin Y, Landau R, et al. Reductions in commuting mobility correlate with geographic differences in SARS-CoV-2 prevalence in New York City. Nature Communications. 2020;11(1):1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sattenspiel L, Dietz K. A structured epidemic model incorporating geographic mobility among regions. Mathematical Biosciences. 1995;128(1–2):71–91. doi: 10.1016/0025-5564(94)00068-b [DOI] [PubMed] [Google Scholar]
  • 44.Gautreau A, Barrat A, Barthelemy M. Global disease spread: statistics and estimation of arrival times. Journal of Theoretical Biology. 2008;251(3):509–522. doi: 10.1016/j.jtbi.2007.12.001 [DOI] [PubMed] [Google Scholar]
  • 45.Jamieson-Lane A, Blasius B. Calculation of epidemic arrival time distributions using branching processes. Physical Review E. 2020;102(4):042301. doi: 10.1103/PhysRevE.102.042301 [DOI] [PubMed] [Google Scholar]
  • 46.Miller JC, Ting T. Eon (epidemics on networks): a fast, flexible Python package for simulation, analytic approximation, and analysis of epidemics on networks. arXiv preprint arXiv:200102436. 2020;. [Google Scholar]
  • 47.St-Onge G, Young JG, H´ebert-Dufresne L, Dub´e LJ. Efficient sampling of spreading processes on complex networks using a composition and rejection algorithm. Computer Physics Communications. 2019;240:30–37. doi: 10.1016/j.cpc.2019.02.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010650.r001

Decision Letter 0

Feng Fu, Virginia E Pitzer

10 Jun 2022

Dear Mr. Mercier,

Thank you very much for submitting your manuscript "Effective resistance against pandemics: Mobility network sparsification for high-fidelity epidemic simulations" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Feng Fu

Associate Editor

PLOS Computational Biology

Virginia Pitzer

Deputy Editor-in-Chief

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this manuscript, the authors discuss a network sparsification method implemented in the context of epidemiology. The method is built upon the earlier algorithm from Spielman and Srivastava, which uses effective resistance to sample the edges. In this paper, the authors simulated the SIR model on a real-world mobility network. They showed that the effective resistance sampling method could preserve the behavior of the SIR model. In addition, the authors examined other simpler sparsification algorithms and found that using effective resistance provides the most accurate simulations. Network sparsification is an interesting and important topic, especially when simulating large-scale epidemical models. Thus, the findings within this paper are potential of great practical importance.

Detailed comments:

1. In the introduction, the authors stated, “It outperforms the simpler edge-sampling methods based on uniform probabilities and edge weights, as well as the naïve thresholding approach.” Simply saying “outperform” is a little vague. I think it would be better if the authors could explain precisely how the effective resistance sampling method is better than the other methods. It seems that the effective resistance sampling method is more accurate to capture the behavior of the SIR model, but it is more computationally expensive than the other simpler sparsification algorithms.

2. Figure 1: Is the color coded on the same scale for the figures on the left and right? The edges look in the same darker blue in the figure on the left.

3. It would be better if the authors could provide some details of the network in the caption of Figure 1. For example, each node represents a census tract, define q as the fraction of edges wanted to preserve, etc.

4. Figure 2: There are blue and black dots in the figure. What does the color represent? Moreover, weight-based sampling method performed almost as good as the effective-resistance method, at least when measured by R-squared. I think the authors should mention this observation in the text.

5. The authors chose the same q values for the three edge-sampling methods and simulated the methods for each value of q. So in Figures 3 and 5, if I draw a vertical line on each chosen value of q, should three different dots lie on the vertical line? It seems that the dots are aligned for q values below 0.01 but not for the values above 0.01.

6. Figure 3: In the caption, the authors mentioned: “the shaded regions corresponds to one standard deviation of the average.” However, I didn’t see the shaded regions in the figure on my end. Also, “corresponds” should be “correspond”.

7. I don’t understand Figure 4 very easily. Please provide a better description. What does the y-axis represent? How are these two nodes chosen? Is it meaningful that all sampling methods have overlapping curves for one node under the localized initial condition but not the dispersed initial condition?

Reviewer #2: The authors propose a new statistical method for weighted graph sparsification, with a focus on preserving epidemic spread dynamics. The paper is very well written and indeed a pleasure to read. The method is of clear relevance to network science and computational biology as so many of the latter's analytical pipelines depend on large networks that benefit from sparsification---not just mobility or other networks involved in epidemic simulations. As such, I recommend publication with some minor suggested edits.

My main comment relates to a comparison with methods we developed in our group. Thus, for maximum transparency, I sign this review. I could not agree more with the authors about the need for principled network reduction methods that preserve essential network dynamics and the hierarchical structure of networks. Two of the methods we have proposed are cited by the authors: distance backbones [Ref. 16] and the effective graph [Ref. 17]. The most related to this paper is the former methodology. While the effective resistance methods is statistically-principled ("the weighted adjacency matrix and graph Laplacian of \\tilde {G} are equal, in expectation, to those of G"), distance backbones are algebraically-principled (and parameter-free). Indeed, distance backbones are unique for a given distance function (typically in network science we sum distance edges resulting in the unique metric backbone, but other distances are possible), while effective resistance leads to different sparsifications in each run (similarly to the disparity filter proposed by Serrano, Boguna and Vespignani [Ref. 15] but likely more efficient in preserving spreading dynamics given the preservation of essential connectivity), and also changes the original edge weights.

Perhaps more importantly, the distance backbone is guaranteed to preserve the entire distribution of shortest paths intact and edge weights, not only the network connectivity --- even though distance backbones are also typically very small [Ref. 16]. The authors say (page 3) that their "strategy helps keep the network connected and preserves its global structure." So, it would be interesting to understand how the effective resistance sparsification affects the original distribution of shortest paths? Figure 4 tallies the fraction of disconnected nodes, which is related, but not the distribution of shortest paths. This suggests that effective resistance preserves the distance backbone for a large range of the fraction of edges removed (unlike the other methods) as the backbone would also keep all edges connected, but the impact on shortest paths may occur for smaller fractions of edges removed. So, while the effective resistance "sparsifier preserves the linear properties of the original networks in expectation," the distance backbone does not affect any shortest path on the network (nor the original distance weights). This speaks to the synergy between the two concepts, and of course I do not expect the authors to run simulations to compare the two methods in this paper. However, the impact of effective resistance sparsification on the distribution of shortest paths is a reasonable question to consider---in addition to the uniqueness of the distance backbone.

Another related question is what to do with the results of the effective resistance sparsifier since it changes the original weights? In the case of the epidemic models this is (more or less) clear because we tend to be more interested in the dynamical observables reported (time to infection, infection probability, etc.) But what about cases where the original weights are meaningful? For instance, in brain connectome or gene regulation networks (as in [Ref. 16]) the original edge weights have specific experimental significance. The distance backbone preserves the edge weights and thus their experimental significance. The authors could discuss or suggest how that would be handled with their sparsifier.

The authors say, in regards to Refs 16&17, that they "know of no rigorous results on whether these techniques preserve dynamical behavior." The effective graph methodology [Ref. 17] is also principled (logically-principled based on the Quine-McCluskey algorithm) , but it applies to networks with node dynamics (such as automata networks used in systems biology). The effective graph is unique because our measure of effective connectivity is a parameter (not a statistic) of the dynamics of Boolean functions. Preserving dynamical behavior is the whole point of Ref. 17, indeed of the whole approach. Because the networks analyzed in this paper have no node-dynamics and are rather used to study spreading dynamics (dynamics on networks rather than dynamics of networks), the effective graph is not directly comparable to the effective resistance methodology (albeit the similar name). Still, one cannot say the latter was not studied in regards to preserving dynamical behavior (our recent discussion in https://doi.org/10.1093/bioinformatics/btac360 could be relevant here).

As for weather the distance backbone methodology [Ref. 16] was studied to preserve dynamical behavior (in the sense of dynamics on networks) is a curious thing since https://doi.org/10.1101/2022.02.02.478784 is under review in this same journal, and of course the authors would not know about it. Still, the utility of the distance backbone in epidemic models has also been studied. I would venture that the effective resistance sparsifier assumes propagation by a particular distance (a resistance distance) which makes a lot of sense for epidemic spread, whereas the distance backbone methodology at large can consider any distance (our under review work on epidemic spread considers only the metric distance). I posit that it should be possible to derive a "resistance backbone" that is unique (not sampled) , but that is clearly an idea for future work---again, supporting the synergy and complementarity between the two methods, which in my view are both very relevant and useful.

Minor comments:

I would welcome a little more justification for the values of the edge-sampling proportion q used. In particular, why only 0.1, 0.55, and 1 for anything above 10%? Probably because there is not much difference between methods tested (safe thresholding) in that range? By the way, since other figures are derived for q=0.1 which led to near 7% edges preserved (right?), a figure with q vs actual % od edges that remain would be useful even if in supporting materials as I imagine this may different from network to network depending on how many (essential) bridges they have (and other properties).

I don't quite get what all the four panels are in Figure 4. A and B sides are explained, but not upper and lower panels.

Really liked the use of the Wasserstein distance for comparisons.

Luis M. Rocha

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Luis M. Rocha

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

 

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010650.r003

Decision Letter 1

Feng Fu, Virginia E Pitzer

12 Oct 2022

Dear Mr. Mercier,

We are pleased to inform you that your manuscript 'Effective resistance against pandemics: Mobility network sparsification for high-fidelity epidemic simulations' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Feng Fu

Academic Editor

PLOS Computational Biology

Virginia Pitzer

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thank authors for the hard work. It's a pleasure reading this manuscript. All my previous comments have been addressed.

Reviewer #2: Thank you so much for addressing all my comments. I am fully satisfied with the responses and excited for the possibilities this research (and synergy with our own methods) brings to network science and the study of the dynamics of epidemic spread on networks.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Luis M. Rocha

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010650.r004

Acceptance letter

Feng Fu, Virginia E Pitzer

28 Oct 2022

PCOMPBIOL-D-22-00323R1

Effective resistance against pandemics: Mobility network sparsification for high-fidelity epidemic simulations

Dear Dr Mercier,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: AM_SS_CM_PLOSCompBio_RevResponse.pdf

    Data Availability Statement

    All data and code used for running network sparsification, models, and plotting is available on GitHub at https://github.com/AMMercier/EffectiveResistanceSampling.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES