Abstract
Social connections are conduits through which individuals communicate, information propagates, and diseases spread. Identifying individuals who are more likely to adopt ideas and spread them is essential in order to develop effective information campaigns, maximize the reach of resources, and fight epidemics. Consequently, a lot of work has focused on identifying influencers in social networks with various influence maximization algorithms being proposed. Based on extensive computer simulations on synthetic and 10 diverse real-world social networks we show that seeding information in social networks using state-of-the-art influence maximization methods creates information gaps. Our results show that these algorithms select influencers who do not disseminate information equitably, threatening to create an increasingly unequal society. To overcome this issue, we devise a multiobjective algorithm which both maximizes influence and information equity. Our results demonstrate it is possible to reduce vulnerability at a relatively low trade-off with respect to spread. This highlights that in our search for maximizing the spread of information we do not need to compromise on information equality.
Keywords: algorithmic bias, influence maximization, informational vulnerability, multiobjective optimization, social networks
Significance Statement.
Algorithms are increasingly used to seed information in social networks. Unfairness in these systems can stem from many factors. Often biased data are believed to be the main culprit. However, here we show that influence maximization algorithms disseminate information in an unfair manner due to mathematical definition of their optimization problem. By focusing on only one parameter, spread, these algorithms create large information gaps. Closing these is vital, as access to timely and relevant information is central for physical, social, and economic well-being. To address this, we develop a multiobjective algorithm which incorporates spread and fairness as objectives. Our algorithm improves the distribution of information without jeopardizing spread, demonstrating the possibility of building optimization algorithms with equity at their core.
Introduction
Social relationships serve as important vectors through which a multitude of behaviors spread, from health related behaviors (1, 2), innovation (3), decisions of micro-financing (4), happiness (5), to the emergence of social movements (6). Knowing through which pathways information spreads is vital for international development (7) and crucial for developing efficient methodologies that maximize the diffusion of potentially life saving information (8, 9). Due to resource constraints, it is unfeasible to send a piece of information to all individuals within a network. Instead, a frequently adopted strategy is to seed a small set of individuals, much smaller than the full population, located at strategic places in the network whose activation (or removal) would facilitate the spread of information (or in the case of epidemics inhibit a disease from spreading). Numerous methods have been proposed to identify this set of “influential nodes.” The methodologies can be divided up into two fundamentally distinct classes (10, 11), superspreader and superblocker methods. Superspreader methods identify individuals who are highly connected and effective at diffusing information (4, 12–15). Superblocker methods identify individuals that occupy structurally vital positions in a network whose removal would destroy the network and subsequently block information from propagating (16–20). Although the methods are different, there is a general consensus that they both pinpoint nodes which are highly efficient conduits for information propagation (11).
Unfortunately, we have a limited understanding of which demographics these methods reach and which communities they leave behind, but there are alarming signs in the literature. Previous work has shown that an individuals’ chances of being ranked as an influencer are highly correlated with personal economic status (21). Similarly, influence maximization methods has been shown to create gaps in information access (22), and to have a gender skew with male individuals having an advantage in being selected as influencers (23–25). Further, social systems display strong levels of homophily, where connections between similar individuals occur at higher rates than between dissimilar individuals, with individuals being more likely to befriend people that resemble them (26). As a consequence, information in social networks tends to be localized within social strata, restricting diffusion across demographic and socioeconomic gaps. In general, access to information is a major factor of social vulnerability (27). By naively using current influence maximization methodologies to select influencers, we run the risk of tailoring information campaigns towards the most affluent groups of our societies, while under-representing the most vulnerable and marginalized.
Defining informational vulnerability
Vulnerability is a complex issue determined by physical, economic, social, and environmental factors, which decrease the capacity of individuals and groups to cope, anticipate, and react to hazards (28). Here, we study one aspect of vulnerability, namely individuals’ access to information. Previous work has adapted metrics from the machine learning literature to study the interplay between influence maximization and characteristics of nodes (such as age, gender, ethnicity) which receive information (23, 25, 29). Within the machine learning literature metrics such as statistical parity, equalized odds, and treatment equality, and variations of them, are used to detect and quantify unfairness (30). However for these metrics to be successfully applied it is necessary for demographic information to be present in network datasets, for example, as node attributes. This is unfortunately not the case for a large majority of network science datasets. As such, we adopt a different approach. Rather than assuming we have access to demographic information of nodes we focus on the general utility of the information individuals receive—this is called a “welfare approach” (22, 31). We study how access to information is distributed across all nodes in the network using numerical simulations. Specifically, we define informational vulnerability as the average likelihood that an individual receives information, estimated from many independent diffusion cascades. We focus on two aspects of this stochastic process: (i) frequency: how often does an individual receive information (i.e. how often are they reached by cascades) and (ii) recency: how old are the cascades when they reach them (i.e. at what step of a cascade is an individual reached). To simulate information diffusion processes in social networks, we apply the commonly used Independent Cascade Model (ICM) in its most simple form, unweighted and undirected (see Methods and Fig. S3). ICMs are commonly used to study influence maximization in social networks (13, 32). The ICM allows an informed individual one attempt to convince their neighbors to adopt a behavior (according to some probability); if successful the neighbors will try to convince their neighbors, etc. (see Section S3).
To identify sets of influencers, we focus on two state-of-the-art influence maximization methods: degree discount (12) (DD) which is an efficient method for identifying superspreaders, and coreHD (20) (CHD) which infers superblockers. For comparison purposes, we also include two commonly used heuristic for selecting influencers: highest degree (33) (HD), which selects nodes according to their number of connections, and k-core (14) (KC) which selects nodes located in the core of the network (see Methods for more details on the four heuristics). Figure 1a shows, for a small real-world network, which nodes each method selects as influencers. As ICMs are stochastic we average over multiple realizations. For each realization of the dynamic process, we track which nodes are activated and how long it takes for the spreading process to reach them. We quantify this using two measures. The first measure, information frequency:
Fig. 1.
Information is unequally distributed in networks. a) Initial seed sets selected according to HD, CHD, DD, and KC, showing variations in how the four methods select influencers for a social network between households in a South Indian village (4). Here, of nodes (colored) are selected as influencers for illustrative purposes (1% otherwise). b) Effective recency for the social network. Recency is estimated across 1,000 runs with transmission probability (see Section S3 and Fig. S4). c) Cumulative distribution of information frequency for synthetic SF-networks with , , and average transmission probability . The curves show the probability that ν is less than or equal to x, where x is any arbitrary value. Results are combined over 100 different network realizations. For each network, we select 1% of nodes as influencers (inferred by one of the heuristics), run the spreading process, track which nodes receive information, and repeat the process times to account for stochasticity. Red shaded regions denote parts of the distribution where the effective measure is below one, while gray shaded indicate places where the ratio is above one. Results for higher transmission probabilities are shown in Fig. S5 and Table S2, and for larger seed populations in Fig. S6. d) Cumulative distribution of recency for SF networks. e) Fraction of nodes that are worse off with respect to information frequency in n of the four seeding heuristics when compared to the benchmark. Error bars are standard deviation over 100 network realizations. f) Fraction of nodes that are worse off with respect to recency.
summarizes the average fraction of times node i has been reached, where if node i received information in realization n and zero otherwise, and M is the total number of realizations. Information frequency () lies in the interval 0 to 1, where zero indicates that a node is never reached by any cascade, while a value of 1 indicates the node is reached by all cascades.
The second measure, information recency:
quantifies the temporal delay from process initialization () until node i is reached. Recency is calculated as the average of the inverse activation time to handle cases where the information spreading process dies out before reaching a node. Nodes that on average receive information fast have and , while nodes that are reached very late, or never, () have .
To uncover the shortcomings of the four influencer heuristics, we compare them to a benchmark model where all nodes have an equal chance of being selected (selected at random)—we call this the effective measure. If the ratio , node i will on average receive information more frequently when seeds are selected using a specific influencer maximization heuristic as compared to when nodes are selected at random. If , node i will be better off when information is inserted at random entry points in the network. The same holds for recency, indicates node i, on average, receives more recent information when using a influencer maximization method, while denotes that information is received faster when information is inserted at random nodes. Figure 1b illustrates the resulting effective recency-values from using influencers inferred by the four heuristics as seeds. Independent of influence maximization methodology, the influencer nodes and their surrounding neighbors are always reached by the influencer set, however, a large fraction of nodes seems to be left behind. Typically nodes located on the periphery.
Quantifying informational vulnerability
To formalize our observations from Fig. 1b, we first investigate the four influencer maximization heuristics on a testbed of synthetic unweighted and undirected networks with scale-free (SF, see also Fig. S1) degree distributions (see Section S6 for synthetic networks with normally distributed degree distributions and Figs. S7 and S8). While perfect SF networks are rarely observed in nature they are powerful simplifications of real-world networks (34). In order to compare heuristics, we construct influencer sets from a fixed finite fraction of the network population—1% of nodes (see Section S5 for other seed sizes).
Seeding information through random nodes in SF networks results in a near-homogeneous frequency distribution (Fig. 1c black line). In Fig. 1c, a completely equal distribution of information would be characterized by a vertical line in the cumulative probability distribution. (Note that the black line characterizing the random process is not vertical due to the intrinsic variations of network structures.) Using the four influence maximization methods to select influencers, however, results in fundamentally different frequency distributions (Fig. 1c, colored lines). Approximately half of nodes have effective frequency values above one, meaning they receive information more frequently then expected compared to a random process, while the other half receives much less. We observe similar results for recency, albeit slightly more polarized (Fig. 1d). Looking across the different influencer heuristics, Fig. 1e, illustrates that if a node is under-informed by one heuristic, it will most likely not be better informed by any other heuristic. On average, of nodes are better informed when information is seeded using any influencer maximization heuristics. We say that these nodes are always better off. However, of nodes are consistently left behind, independent of which method is used to select influencers. With respect to recency, Fig. 1f shows an even worse situation, of nodes receive out-of-date information, independent on which information maximization methodology is used.
To understand the implications of information inequalities for real-world networks we look at 10 social networks, encompassing communication, interaction, and collaboration networks varying in size from hundreds to tens of thousands of individuals. The networks are diverse in context, ranging from: face-to-face encounters (35), connections between households in multiple villages (4), connections between bloggers and blogs on political topics (36), digital communication between university students (37 ), email communication (38), online friendships on Facebook (39), and scientific collaborations (40, 41) (see Section S2, Fig. S2, and Table S1 for details).
Repeating the analysis from Fig. 1 for these real-world networks, we find influencer maximization heuristics to, on average, leave significant portions of the networks in disadvantaged positions in terms of information frequency (Fig. 2a). Overall HD, CHD, and DD result in fairly similar ν distributions, while selecting influencers according to KC performs worse (with up to 80% of nodes being worse off). This is due to real-world network having a large numbers of cliques, where small discrepancies in shell numbers can result in KC only selecting nodes from a single clique (14), effectively limiting the diffusion of information. Similar behavior is observed for information recency. Figure 2b shows that HD, CHD, and DD leave behind a comparable numbers of nodes, while KC consistently performs worse. Summarizing the average reach of influencer heuristics, we find that up to of a network might receive information less frequently compared to if it is input at random (Fig. 2c), and information can reach up to of individuals slower (Fig. 2d).
Fig. 2.
Information is unequally distributed in real-world social networks. Here, we show results for five of the networks, see Fig. S9 for results for other networks. a) Cumulative distribution of individual node frequency for networks ordered according to size (number of nodes). Initial seeds contain of network nodes, and results are averaged over simulations (see Table S1). b) Cumulative distribution of recency for empirical networks. c) Fraction of nodes that are worse off with respect to frequency in n of the four seeding heuristics when compared to the random seeding procedure (). Demonstrating that large parts of social networks are in disadvantaged positions. d) Fraction of nodes that are worse off with respect to recency () for n methods.
There is a connection between access to information and a node’s position in the network. The connection is so strong that a predictive model, based on structural features of as node can accurately predict whether a node will fall into the group of “worse off according to all influencer heuristics” or “better off in all” (Section S8 and Figs. S10 and S11). We can, on average, with 97.4% accuracy predict the information status of nodes regarding frequency, and 96.9% for recency. This demonstrates that current influencer heuristics suffer from biases which disadvantage low-connected and peripheral nodes.
Fair influence maximization
Different strategies can be employed to bridge the information gap. Previous work have shown that instead of using influencer algorithms to identify s individuals, one can select slightly more individuals , but at random (42). Even for small -values this has been shown to result in larger cascades. Another solution is to apply acquaintance methods, also called friendship-nomination, which work by selecting a random neighbor of a randomly selected node. These have been used for mass drug administration campaigns (43), to seed information about maternal and child health (44), and for inferring centrally located individuals suitable as monitors for detecting large scale disease outbreaks (45). Lastly, algorithms that use iterative realizations of ICMs to equitably maximize social welfare objective functions have been proposed to bridge the information gap (22), however, we find them not to be effective at bridging the information gap (see Fig. S12).
We embark on a different solution. Traditionally, influence maximization has only focused on maximizing a single objective function, information spread. However, as literature from the field of AI shows, focusing only on one parameter can lead to troubling and unfair outcomes (46). Instead, we propose a multiobjective formulation of the fair influence maximization, where both spread and fairness are taken into account in the fitness of a candidate solution (set of selected influencers). We measure fairness as the fraction of nodes that receive information at a higher frequency and speed than what is expected from the benchmark model. We say a node is vulnerable if it receive less information than what can be expected at random (). In the following, we focus primarily on information frequency, but similar results can be obtained for recency. The more fair an influencer set is, the fewer nodes will be vulnerable, so we maximize the number of nonvulnerable nodes in addition to maximizing spread. Analytically calculating how information will spread from a set of nodes according to the ICM model is a computationally hard problem (NP-hard) (13). Instead, we use an approximation of the fitness of a influencer set (Section S9). To find fairer seeds we then use a genetic algorithm to solve the optimization problem (see Methods).
Figure 3a shows the theoretical Pareto front (in multiobjective optimization a Pareto front denotes a line of optimal solutions) for a social network between households in a village. Our method identifies seed sets which are more fair and, at the same time, as effective at maximizing influence as the influence maximization heuristics (HD, CHD, and DD). (We disregard here seed sets identified by KC as they are far inferior to the ones produced by the other methods.) For this network, our algorithm identifies nine possible influencer sets, undiscovered by the traditional heuristics, each with different trade-offs between maximizing information reach (cascade size) and the number of nonvulnerable nodes (fairness). As these findings are based on an theoretical approximation, we also evaluate them numerically using ICMs. Figure 3b, shows our theoretical predictions are consistent with results from numerical simulations. For a negligible reduction in cascade size we can, for this specific network, choose fairer seeds that roughly corresponds to 6–10 fewer vulnerable nodes. Figure 3c and d illustrates difference between seed sets inferred by a state-of-the-art influencer maximization heuristics (CHD) and our approach. The figure shows initial seed nodes (in black) and the activation of edges, where an activated edge indicates successful information propagation. Visually there is a clear difference between which nodes are selected as influencers. Using the average distance from seed nodes to the rest of the network, we calculate how far each seed set is to all nodes in the network. We find that our method identifies nodes which are more evenly distributed in the network and, on average, closer to the overall network (Section S11 and Table S4). This results in larger parts of the network being more easily reached.
Fig. 3.
Fair influence maximization for social networks. a) Theoretical Pareto front of optimal influencer sets identified by our multiobjective algorithm for the social network between households in a South Indian village, compared to influence maximization heuristics. We disregard KC as it consistently performs worse, both in terms of fairness and reach compared to the other heuristics. Higher values of nonvulnerable nodes indicate higher values of fairness. b) Numerical evaluation of influencer sets using ICMs. Error bars are given as the standard deviation from 10 realization of ICM simulations. c) Edge activations for the set of influencers identified by CHD. Edges are colored and sized according to how often they are activated during simulations. Nodes colored black are seed nodes. d) Edge activations for one of the fairer seed sets identified by our algorithm. A comprehensive comparison between the seed sets is available in Fig. S16. e–h) Theoretical Pareto fronts for four additional real-world social networks (see Fig. S15 for results for remaining five networks, and for numerical results from ICMs).
Figure 3a and b shows results for one network, however, our fair influence maximization method works equally well for other real-world graphs (Figs. 3e–h and S13–S15). For other networks (Fig. 3e–h), we find similar results. While influence maximization heuristics identify seed sets which optimize cascade size, we find that these seed sets are not fair in terms of information equality (for numerical results, see Section S10). For all networks it is possible to improve this, with large gains in fairness being achievable with small trade-offs in cascade size. Overall we find that a decrease in cascade size can result in a decrease in number of vulnerable individual by approximately: for the network of political blogs, for the student communication network, for email communication, up to for collaboration networks, and for online friendships on Facebook (here, we disregard the face-to-face and village networks due to the low number of identified seed sets). Accepting larger reductions in cascade size produce enable even larger reductions in number of vulnerable nodes (see Table S3). For example, for the network of online friendships a reduction of 5% in cascade size can yield up to 71% fewer vulnerable individuals.
Lastly, for the collaboration networks (Fig. 3h), we find that seed sets identified by influence maximization heuristics are not even close to the Pareto frontier. Meaning they are sub-optimal both in term of cascades sizes and fair information access. In this situation, our algorithm can also be used to identify, previously undiscovered, seed sets which optimize both cascade-size and fairness. These findings signal a practical significance of our multiobjective approach and highlight the need to rethink how algorithms for information diffusion and influencer selection are designed and evaluated.
Discussion
The United Nations sustainable development goals (SDGs) recognize that eradicating inequalities in all their forms and dimensions are one of the greatest global challenges our societies face. Algorithms have the power to deliver on the SDGs. For example, access to information is critical for vaccination campaigns and algorithms such as influence maximization have a role to play in effectivizing these campaigns. However, algorithms can also bring potential biases into play. Our results show there are groups of nodes that are consistently left behind by influence maximization algorithms. In particular, for both real-world and synthetic networks, we find that access to information in unequal, both in terms of how often information is received and how recent the information is. A behavior that is not limited to low or high clustered networks, nor to specific types of interactions (Table S1); we find it present across all networks we investigated. Although algorithmic systems can be biased due to many factors (30), it is often thought that biases appear due to skews or misrepresentations in training data. However, that is not the case for influence maximization algorithms. Here, the issue lies with the problem statement and the choice of objective function. An algorithmic bias is created by focusing algorithms solely on optimizing reach, without considering information equity. Unfortunately, not receiving information has real-world consequences. For example, experiences from mass drug administration campaigns have pinpointed that individuals are left untreated not due to lack of medicine, but because they never receive information about the campaign (9).
The pervasive usage of influence maximization algorithms in information diffusion and in online social network (47, 48) can create large fractures in the social fabric of our societies. Thus, it is vital to understand if such algorithms are equitable, to quantify the level of inequality, and propose potential alternatives that can balance potential reach and equity. Multiobjective-optimization is a well-known computational tool which adds nuance to optimization problems and enables the inclusion of multiple criteria. Our results demonstrate it is possible to find influencer sets that reduce vulnerability at a relatively low trade-off with respect to spread. For example, we find that a mere reduction in reach can reduce the number of people left behind in information campaigns by up to .
We focus on state-of-the-art heuristics which are computationally efficient and widely used for identifying influencers. However, more specialized algorithms exist that leverage community structure, use adaptive seeding, or incorporate node attributes (e.g. demographic features). Comparing our multiobjective approach to these methods can enrich the discussion on equity and reach. our approach to a specialized adaptive seeding algorithm (Myopic method (22))), we find that our approach is more effective at bridging the information gap (see Section S9). This showcases the potential of our method to mitigate digital inequalities.
Our multiobjective algorithm is a first approach at solving this critical problem, yet it is not perfect. We believe it can act as a starting point towards more systematic solutions towards fair information access, as this issue arise across many other contexts within network science, artificial intelligence, and computational science problems. One particular application can be online social networks where incorporating additional algorithmic objectives can be beneficial to: help detect vulnerable individuals, mitigate and reduce segregation, lessen polarization between groups, and help guide the design of more equal information dissemination structures.
Our approach requires information about the full network. Noise and incomplete mappings of networks will naturally affect this. However, we believe the effect will not differ from what methods like degree discount (DD), CoreHD (CHD), or highest degree (HD) already experience, as they also require information about the total graph. Another shortcoming is that we focus purely on simple contagion effects, where nodes have equal, and independent probabilities of adopting a behavior. Complex contagion, where individuals require social affirmation from multiple sources, has been observed for certain social settings, including sharing of content on social media (49), and online behaviors (1). The nature of contagion depends on the type of the situation, and whether interactions happen at a local or global level. We focus on simple contagion because it is believed to be the main factor in information spreading (50). For instance, if a person is looking for job opportunities, it is more beneficial to receive information from the global network, via weak ties, rather than just from close friends and family (51). However, future work should study how information inequalities develop in complex contagion scenarios, and how to mitigate their effects.
Our results are based on simulations using static, undirected, and unweighted networks. This is a simplification of the real-world, as information propagation often occurs through directed and temporal connections. Our multiobjective framework can be extended to cover weighted and directed connections in a straightforward manner, either directly by weighting the transmission probability with edge weights and directions, or by splitting weighted edges into multiple edges of weight “one.” Expanding the framework to temporal networks is less trivial. As such, an important direction for future research would be to investigate how information inequalities manifest in temporal networks, and how to mitigate these.
Lastly, our definition of information inequality relies on benchmarking existing methods to random information spreading scenarios, as this is the most fair system we can imagine. Other metrics exist, such as statistical parity, equalized odds, or treatment equality (30). These, however, require information about demographic variables and protected features to be present in the network data (e.g. as node attributes). As this is rarely the case, we have avoided using them. Nonetheless, future work could focus on testing these metrics for cases where information about demographic or protected features is present.
Independent of the choice of definition, it is vital that inequalities, which arise, or are amplified, as result of algorithms, be quantified and measured. As our world is becoming increasingly digitalized, access to correct, timely, and factual information will grow in significance. As such, it is critical to know how well algorithms which deal with information dissemination and delivery work, and which groups and individuals they leave behind.
Methods
Independent cascade model
The ICM process is as follows: at time all nodes are inactive, except for initial seed nodes (activated). At each time step t, an activated node i will contacts all its neighbors, which have previously not been activated, and try to convince/activate them according to an independent transmission probability p. After attempting to convince all its neighbors a node becomes inactive and cannot be activated again in subsequent stages of the dynamic. The process is iterated until no active nodes remain.
Influencer selection heuristics
Highest degree (HD) Nodes are selected according to highest degree. Nodes are added to the set of influencers in an aggregative fashion until the set reaches the desired size S, starting with nodes with , , etc. until , where is the lowest k-value for nodes which are added to the set of influencers. If there are more nodes with than there is room in the influencer set, then nodes with are added to the set at random until the desired set size is been reached. K-core (KC). Nodes are added to the set of influencers according to their k-core value. Similar to HD, nodes are added in an aggregative fashion according to their k-core value, highest ones first. If there are more nodes with identical k-core values than there is room for in the set, then we select at random from these nodes until the desired influencer set size is reached. Degree discount (DD). This heuristic starts with adding the highest degree node to the seed set, then it discounts the degree of other nodes in the networks. The rationale is that nodes which are close to the seed node should have their degrees discounted since there is a high likelihood they will be reached by an information cascade started at the seed node. Degree is discounted as , where is the discounted degree of node v, is its degree, is the number of neighbors of v that have been selected as influencers, and p is a transmission probability. We set (see below). Nodes are iteratively added to the seed set based on the highest discounted degree. Discounted degree values are recalculated after the addition a node to the influencer set. CoreHD (CHD). An iterative procedure to select “superblocker” nodes. The procedure finds the core of the network and the selects the node with the highest degree within the core. If more than one node have the same max degree we randomly sample one of them. The selected node then is removed from the graph. The process repeats until S influencers have been selected, with the core being identified each step and the highest degree node removed.
Transmission probability
For ICM, the only parameter is the activation probability p (probability of convincing people to adopt a behavior); a too high probability will correspond to a global information cascade with the full network adopting a behavior, a too low would entail no information spreading. We set , where is the critical probability separating the region of the phase diagram where cascades (outbreaks) are subextensive () from the supercritical region () where outbreaks reach a finite fraction of the whole network (11). For each network, we calculate the critical value of the transmission probability () as the position of the maximum of the susceptibility , where is the nth moment of the outbreak size distribution computed for random selected initial single spreaders (52). See Supplementary information for more information.
Fair influence maximization
We implement a simple version of a nondominated sorting genetic algorithm (NSGAIII) using the DEAP (distributed evolutionary algorithms in python) library (53). Our code is freely available at https://github.com/vedransekara/multi-objective-influence-maximization. Briefly, the main modeling set-up is:
Initialization is performed by generating one set of influencers with each of the heuristics mentioned in this paper and the rest completely at random.
Crossover of two individual sets is performed by generating the union of both sets and choosing from the joint set of nodes at random to form two new seeds sets.
-
Mutation is composed of two operators that are performed with different frequencies:
Random where of the seeds are removed from a individual set and new ones selected at random.
Tabu-like where one seed is removed at random and a certain number (or all) of random seeds are inspected for addition to the set, and the one with lowest vulnerability is finally selected.
As objectives, we use reach (R) and vulnerability (V). For a network of n nodes, reach is approximated for node i as , where , p is the transmission probability, is the length of the shortest path between nodes i and j, and S is the candidate seed set. Reach is estimated as . Vulnerability for node i is estimated as if , 0 otherwise, where . Vulnerability is calculated as . For a complete mathematical derivation, see Section S9. All experiments on empirical networks were run with a population of 100 individual sets, 100 generations, crossover probability of , mutation probability of 1, Tabu-like mutation frequency of , and size of the tabu neighborhood of of the total number of nodes on the network.
Supplementary Material
Acknowledgments
M.G.H. and I.D. want to thank AECID (Spanish Agency for International Development Cooperation) for their support to data innovation and Frontier Data Technologies through UNICEF’s Frontier Data Network.
Contributor Information
Vedran Sekara, Networks, Data, and Society (NERDS) Group, IT University of Copenhagen, Copenhagen DK-2300, Denmark; Pioneer Centre for AI (P1), Copenhagen DK-1350, Denmark.
Ivan Dotu, UNICEF, GIGA, 08019 Barcelona, Spain.
Manuel Cebrian, Center for Automation and Robotics (CAR), Spanish National Research Council (CSIC-UPM), 28500 Madrid, Spain.
Esteban Moro, Department of Physics, Network Science Institute, Northeastern University, Boston, MA 02115, USA.
Manuel Garcia−Herranz, UNICEF, Frontier Data Network, New York, NY 10017, USA.
Supplementary Material
Supplementary material is available at PNAS Nexus online.
Funding
M.C. acknowledges support from the Ministerio de Ciencia, Innovación y Universidades and from the “Convocatoria de la Universidad Carlos III de Madrid de Ayudas para la recualificación del sistema universitario español 2021-2023” (Real Decreto 289/2021, 2021 April 20), which funded his work while he was at the Department of Statistics, Universidad Carlos III de Madrid, Spain. Additional funding was provided by project PID2022-137243OB-I00 (MCIN/AEI/10.13039/501100011033) and by FEDER—“Una manera de hacer Europa.” E.M. acknowledges support from the U.S. National Science Foundation under Grants 2420945 and 2427150.
Author Contributions
All authors conceived the study. V.S. and I.D. performed the analysis. All authors analyzed the data and wrote the manuscript.
Preprints
A prior version of this manuscript was posted on a preprint server: https://arxiv.org/pdf/2405.12764.
Data Availability
Previously published data were used for this work (4, 35, 37, 40, 54), a full description can be found in the Supplementary material.
References
- 1. Centola D. 2010. The spread of behavior in an online social network experiment. Science. 329(5996):1194–1197. [DOI] [PubMed] [Google Scholar]
- 2. Christakis NA, Fowler JH. 2007. The spread of obesity in a large social network over 32 years. N Engl J Med. 357(4):370–379. [DOI] [PubMed] [Google Scholar]
- 3. Rogers EM. Diffusion of innovations. Simon and Schuster, 2010. [Google Scholar]
- 4. Banerjee A, Chandrasekhar AG, Duflo E, Jackson MO. 2013. The diffusion of microfinance. Science. 341(6144):1236498. [DOI] [PubMed] [Google Scholar]
- 5. Fowler JH, Christakis NA. 2008. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham heart study. Bmj. 337:a2338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. González-Bailón S, Borge-Holthoefer J, Rivero A, Moreno Y. 2011. The dynamics of protest recruitment through an online network. Sci Rep. 1(1):197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Barrett CB, Constas MA. 2014. Toward a theory of resilience for international development applications. Proc Natl Acad Sci U S A. 111(40):14625–14630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Alexander M, Forastiere L, Gupta S, Christakis NA. 2022. Algorithms for seeding social networks can enhance the adoption of a public health intervention in urban India. Proc Natl Acad Sci U S A. 119(30):e2120742119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chami GF, et al. 2016. Profiling nonrecipients of mass drug administration for schistosomiasis and hookworm infections: a comprehensive analysis of praziquantel and albendazole coverage in community-directed treatment in Uganda. Clin Infect Dis. 62(2):200–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Borgatti SP. 2006. Identifying sets of key players in a social network. Comput Math Organ Theory. 12(1):21–34. [Google Scholar]
- 11. Radicchi F, Castellano C. 2017. Fundamental difference between superblockers and superspreaders in networks. Phys Rev E. 95(1):012318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chen W, Wang Y, Yang S. Efficient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD International Conference on knowledge Discovery and Data Mining. ACM, 2009. p. 199–208.
- 13. Kempe D, Kleinberg J, Tardos É. Maximizing the spread of influence through a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2003. p. 137–146.
- 14. Kitsak M, et al. 2010. Identification of influential spreaders in complex networks. Nat Phys. 6(11):888. [Google Scholar]
- 15. Lokhov AY, Saad D. 2017. Optimal deployment of resources for maximizing impact in spreading processes. Proc Natl Acad Sci U S A. 114(39):E8138–E8146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Braunstein A, Dall’Asta L, Semerjian G, Zdeborová L. 2016. Network dismantling. Proc Natl Acad Sci U S A. 113(44):12368–12373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Chen Y, Paul G, Havlin S, Liljeros F, Stanley HE. 2008. Finding a better immunization strategy. Phys Rev Lett. 101(5):058701. [DOI] [PubMed] [Google Scholar]
- 18. Clusella P, Grassberger P, Pérez-Reche FJ, Politi A. 2016. Immunization and targeted destruction of networks using explosive percolation. Phys Rev Lett. 117(20):208301. [DOI] [PubMed] [Google Scholar]
- 19. Morone F, Makse HA. 2015. Influence maximization in complex networks through optimal percolation. Nature. 524(7563):65–68. [DOI] [PubMed] [Google Scholar]
- 20. Zdeborová L, Zhang P, Zhou H-J. 2016. Fast and simple decycling and dismantling of networks. Sci Rep. 6(1):37954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Luo S, Morone F, Sarraute C, Travizano M, Makse HA. 2017. Inferring personal economic status from social network location. Nat Commun. 8(1):15227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Fish B, et al. Gaps in information access in social networks? In: The World Wide Web Conference. ACM, 2019, p. 480–490.
- 23. Jalali ZS, Wang W, Kim M, Raghavan H, Soundarajan S. On the information unfairness of social networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020. p. 613–521.
- 24. Stoica A-A, Chaintreau A. Fairness in social influence maximization. In: Companion Proceedings of The 2019 World Wide Web Conference. ACM, 2019. p. 569–574.
- 25. Stoica A-A, Han JX, Chaintreau A. Seeding network influence in biased networks and the benefits of diversity. In: Proceedings of The Web Conference 2020. ACM, 2020. p. 2089–2098.
- 26. Leo Y, Fleury E, Alvarez-Hamelin JI, Sarraute C, Karsai M. 2016. Socioeconomic correlations and stratification in social-communication networks. J R Soc Interface. 13(125):20160598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Shirley WL, Boruff BJ, Cutter SL. Social vulnerability to environmental hazards. In: Hazards vulnerability and environmental justice. Routledge, 2012. p. 143–160.
- 28. Turner BL, et al. 2003. A framework for vulnerability analysis in sustainability science. Proc Natl Acad Sci U S A. 100(14):8074–8079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wang X, Varol O, Eliassi-Rad T. 2022. Information access equality on generative models of complex networks. Appl Netw Sci. 7(1):1–20. [Google Scholar]
- 30. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. 2021. A survey on bias and fairness in machine learning. ACM Comput Surv (CSUR). 54(6):1–35. [Google Scholar]
- 31. Heidari H, Ferrari C, Gummadi K, Krause A. 2018. Fairness behind a veil of ignorance: a welfare analysis for automated decision making. Adv Neural Inf Process Syst. 31. [Google Scholar]
- 32. Beaman L, BenYishay A, Magruder J, Mushfiq Mobarak A. 2021. Can network theory-based targeting increase technology adoption? Am Econ Rev. 111(6):1918–1943. [Google Scholar]
- 33. Albert R, Jeong H, Barabási A-L. 2000. Error and attack tolerance of complex networks. Nature. 406(6794):378. [DOI] [PubMed] [Google Scholar]
- 34. Broido AD, Clauset A. 2019. Scale-free networks are rare. Nat Commun. 10(1):1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Isella L, et al. 2011. What’s in a crowd? Analysis of face-to-face behavioral networks. J Theor Biol. 271(1):166–180. [DOI] [PubMed] [Google Scholar]
- 36. Adamic LA, Glance N. The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery. ACM, 2005. p. 36–43.
- 37. Opsahl T, Panzarasa P. 2009. Clustering in weighted networks. Soc Netw. 31(2):155–163. [Google Scholar]
- 38. Guimera R, Danon L, Diaz-Guilera A, Giralt F, Arenas A. 2003. Self-similar community structure in a network of human interactions. Phys Rev E. 68(6):065103. [DOI] [PubMed] [Google Scholar]
- 39. Cho E, Myers SA, Leskovec J. Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2011. p. 1082–1090.
- 40. Leskovec J, Kleinberg J, Faloutsos C. 2007. Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data. 1(1):2. [Google Scholar]
- 41. Newman MEJ. 2001. The structure of scientific collaboration networks. Proc Natl Acad Sci U S A. 98(2):404–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Akbarpour M, Malladi S, Saberi A. Diffusion, seeding, and the value of network information. In: Proceedings of the 2018 ACM Conference on Economics and Computation. ACM, 2018. p. 641–641.
- 43. Chami GF, Ahnert SE, Kabatereine NB, Tukahebwa EM. 2017. Social network fragmentation and community health. Proc Natl Acad Sci U S A. 114(36):E7425–E7431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Airoldi EM, Christakis NA. 2024. Induction of social contagion for diverse outcomes in structured experiments in isolated villages. Science. 384(6695):eadi5147. [DOI] [PubMed] [Google Scholar]
- 45. Garcia-Herranz M, Moro E, Cebrian M, Christakis NA, Fowler JH. 2014. Using friends as sensors to detect global-scale contagious outbreaks. PLoS One. 9(4):e92413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Thomas RL, Uminsky D. 2022. Reliance on metrics is a fundamental challenge for AI. Patterns. 3(5):100476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Coró F, D’angelo G, Velaj Y. 2021. Link recommendation for social influence maximization. ACM Trans Knowl Discov Data. 15(6):1–23. [Google Scholar]
- 48. Leskovec J, et al. Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007. p. 420–429.
- 49. Mønsted B, Sapieżyński P, Ferrara E, Lehmann S. 2017. Evidence of complex contagion of information in social media: an experiment using twitter bots. PLoS One. 12(9):e0184148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Centola D, Macy M. 2007. Complex contagions and the weakness of long ties. AJS. 113(3):702–734. [Google Scholar]
- 51. Granovetter MS. 1973. The strength of weak ties. AJS. 78(6):1360–1380. [Google Scholar]
- 52. Castellano C, Pastor-Satorras R. 2016. On the numerical study of percolation and epidemic critical properties in networks. Eur Phys J B. 89(11):243. [Google Scholar]
- 53. Fortin F-A, De Rainville F-M, Gardner M-A, Parizeau M, Gagné C. 2012. DEAP: evolutionary algorithms made easy. J Mach Learn Res. 13:2171–2175. [Google Scholar]
- 54. Leskovec J, Mcauley JJ. 2012. Learning to discover social circles in ego networks. Adv Neural Inf Process Syst.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Previously published data were used for this work (4, 35, 37, 40, 54), a full description can be found in the Supplementary material.



