Significance
A central theme of network science is the heterogeneity present in real-life systems, for instance through the absence of a characteristic degree for the nodes. Despite their small-worldness, networks may present other types of heterogeneous patterns, with different parts of the network exhibiting different behaviors. Here we focus on assortativity, a network analogue of correlation used to describe how the presence and absence of edges covaries with the properties of nodes. We design a method to characterize the heterogeneity and local variations of assortativity within a network and exhibit, in a variety of empirical data, rich mixing patterns that would be obscured by summarizing assortativity with a single statistic.
Keywords: complex networks, assortativity, multiscale, node metadata
Abstract
Assortative mixing in networks is the tendency for nodes with the same attributes, or metadata, to link to each other. It is a property often found in social networks, manifesting as a higher tendency of links occurring between people of the same age, race, or political belief. Quantifying the level of assortativity or disassortativity (the preference of linking to nodes with different attributes) can shed light on the organization of complex networks. It is common practice to measure the level of assortativity according to the assortativity coefficient, or modularity in the case of categorical metadata. This global value is the average level of assortativity across the network and may not be a representative statistic when mixing patterns are heterogeneous. For example, a social network spanning the globe may exhibit local differences in mixing patterns as a consequence of differences in cultural norms. Here, we introduce an approach to localize this global measure so that we can describe the assortativity, across multiple scales, at the node level. Consequently, we are able to capture and qualitatively evaluate the distribution of mixing patterns in the network. We find that, for many real-world networks, the distribution of assortativity is skewed, overdispersed, and multimodal. Our method provides a clearer lens through which we can more closely examine mixing patterns in networks.
Networks are used as a common representation for a wide variety of complex systems, spanning social (1–3), biological (4, 5), and technological (6, 7) domains. Nodes are used to represent entities or components of the system, and links between them are used to indicate pairwise interactions. The link formation processes in these systems are still largely unknown, but the broad variety of observed structures suggests that they are diverse. One approach to characterize the network structure is based on the correlation, or assortative mixing, of node attributes (or “metadata”) across edges. This analysis allows us to make generalizations about whether we are more likely to observe links between nodes with the same characteristics (assortativity) or between those with different ones (disassortativity). Social networks frequently contain positive correlations of attribute values across connections (8). These correlations occur as a result of the complementary processes of selection (or “homophily”) and influence (or “contagion”) (9). For example, assortativity has frequently been observed with respect to age, race, and social status (10), as well as behavioral patterns such as smoking and drinking habits (11, 12). Examples of disassortative networks include heterosexual dating networks (gender), ecological food webs (metabolic category), and technological and biological networks (node degree) (13). It is important to note that, just as correlation does not imply causation, observations of assortativity are insufficient to imply a specific generative process for the network.
The standard approach to quantifying the level of assortativity in a network is by calculating the assortativity coefficient (13). Such a summary statistic is useful to capture the average mixing pattern across the whole network. However, such a generalization is only really meaningful if it is representative of the population of nodes in the network, i.e., if the assortativity of most individuals is concentrated around the mean. However, when networks are heterogeneous and contain diverse mixing patterns, a single global measure may not present an accurate description. Furthermore, it does not provide a means for quantifying the diversity or identifying anomalous or outlier patterns of interaction.
Quantifying diversity and measuring how mixing may vary across a network becomes a particularly pertinent issue with modern advances in technology that have enabled us to capture, store, and process massive-scale networks. Previously, social interaction data were collected via time-consuming manual processes of conducting surveys or observations. For practical reasons, these were often limited to a specific organization or group (1, 2, 15, 16). Summarizing the pattern of assortative mixing as a single value may be reasonable for these small-scale networks that tend to focus on a single social dimension (e.g., a specific working environment or common interest). Now, technology such as online social media platforms allow for the automatic collection of increasingly larger amounts of social interaction data. For instance, the largest connected component of the Facebook network was previously reported to account for ∼10% of the global population (17). These vast multidimensional social networks present more opportunities for heterogeneous mixing patterns, which could conceivably arise, for example, due to differences in demographic and cultural backgrounds. Fig. 1 shows, using the methods we will introduce, an example of this variation in mixing on a subset of nodes in the Facebook social network (14). A high variation in mixing patterns indicates that the global assortativity may be a poor representation of the entire population. To address this issue, we develop a node-centric measure of the assortativity within a local neighborhood. Varying the size of the neighborhood allows us to interpolate from the mixing pattern between an individual node and its neighbors to the global assortativity coefficient. In a number of real-world networks, we find that the global assortativity is not representative of the collective patterns of mixing.
1. Mixing in Networks
Currently, the standard approach to measure the propensity of links to occur between similar nodes is to use the assortativity coefficient introduced by Newman (13). Here we will focus on undirected networks and categorical node attributes, but assortativity and the methods we propose naturally extend to directed networks and scalar attributes (Supporting Information).
The global assortativity coefficient for categorical attributes compares the proportion of links connecting nodes with same attribute value, or type, relative to the proportion expected if the edges in the network were randomly rewired. The difference between these proportions is commonly known as modularity , a measure frequently used in the task of community detection (18). The assortativity coefficient is normalized such that if all edges only connect nodes of the same type (i.e., maximum modularity ) and if the number of edges is equal to the expected number for a randomly rewired network in which the total number of edges incident on each type of node is held constant. The global assortativity is given by (13)
[1] |
in which is half the proportion of edges in the network that connect nodes with type to nodes with type (or the proportion of edges if ) and is the sum of degrees () of nodes with type , normalized by twice the number of edges, . We calculate as
[2] |
where is an element of the adjacency matrix. The normalization constant ensures that the assortativity coefficient lies in the range (see Supporting Information).
Local Patterns of Mixing.
The summary statistic describes the average mixing pattern over the whole network. However, as with all summary statistics, there may be cases where it provides a poor representation of the network, e.g., if the network contains localized heterogeneous patterns. Fig. 2 illustrates an analogy to Anscombe’s quartet of bivariate datasets with identical correlation coefficients (19). Each of the five networks in the top row have the same number of nodes () and edges () and have been constructed to have the same with respect to a binary attribute, indicated by a cross () or a diamond (). All five networks have edges between nodes of the same type and edges between nodes of different types, such that each has . Local patterns of mixing are formed by splitting each of the types further into two equally sized subgroups . The middle row depicts the placement of edges within and between the four subgroups. Distributing edges uniformly between subgroups creates a network with homogeneous mixing (Fig. 2A).
We propose a local measure of assortativity that captures the mixing pattern within the local neighborhood of a given node of interest . Trivially, one could calculate the local assortativity by adjusting Eq. 1 to only consider the immediate neighbors of . However, this approach can encounter problems. For nodes with low degree, we would be calculating assortativity based only on a small sample, providing a potentially poor estimate of the node’s mixing preference. Also, when all of ’s neighbors are of the same type, then we would assign because .
We face similar issues in time series analysis when we wish to interpret how a noisy signal varies over time. Direct analysis of the series may be more descriptive of the noise process than of the underlying signal we are interested in. Averaging over the whole series provides an accurate estimate of the mean, but treats all variation as noise and ignores any important trends. A common solution to this problem is to use a local filter such as the exponential weighted moving average, in which values farther in time from the point of interest are weighted less. We adopt a similar strategy in calculating the local assortativity. To make the connection with time series analysis concrete, we define a random time series where each value is the attribute of a node visited in a random walk on the graph. A simple random walker at node jumps to node by selecting an outgoing edge with equal probability, , and, in an undirected network, the stationary probability of being at node is proportional to its degree. Then, every edge of the network is traversed in each direction with equal probability . In this context, a key observation is that we can equivalently rewrite Eq. 2 as
[3] |
which is the total probability that a simple random walker will jump from a node with type to one with type . We can then interpret the global assortativity of the network as the autocorrelation (with time lag of 1) of this random time series (see SI Text, section D for details).
Global assortativity counts all edges in the network equally, just as the stationary random walker visits all edges with equal probability. To create our local measure of assortativity, we instead reweight the edges in the network based on how local they are to the node of interest, . We do so by replacing the stationary distribution in Eq. 3 with an alternative distribution over the nodes ,
[4] |
and compare the proportion of links between nodes of the same type in the local neighborhood to the global value . Then we can calculate the local assortativity as the deviation from the global assortativity,
[5] |
[6] |
All that remains is to define a distribution . We choose the well-known personalized PageRank vector, the stationary distribution of a simple random walk, modified so that, at each time step, we return to the node of interest with probability (Fig. 3A). In the special case of a network consisting of nodes linked in a line, corresponds to an exponential distribution (Fig. 3B) and is analogous to the previously mentioned exponential filter commonly used in time series analysis. The personalized PageRank vector is an intuitive choice, given its role in local community detection (20) and connections to the stochastic block model (21). It is, however, not the only way to define a local neighborhood [e.g., a number of graph kernels may be suitable (22)].
We can now calculate a local assortativity for each node and use to interpolate from the trivial local neighborhood assortativity (, the random walker never leaves the initial node) to the global assortativity (, the random walker never restarts) (Fig. 3C). We can also view this local assortativity as a (normalized) autocovariance of the random time series of node attributes, defined as before but now generated by a stationary random walker with restarts and only when traversing edges of the original network.
Choice of .
We can use to interpolate from the global measure at to the local measure based only on the neighbors of when . As previously mentioned, either extreme can be problematic: is uniform across network, while may be based on a small sample (particularly in the case of low degree nodes) and therefore subject to overfitting. Moreover, both extremes are blind to the possible existence of coherent regions of assortativity inside the network, as considers the network as a whole, while considers the local assortativities of the nodes as independent entities.
To circumvent these issues, we consider calculating the assortativity across multiple scales by calculating a “multiscale” distribution by integrating over all possible values of (23),
[7] |
which is effectively the same as treating as an unknown with a uniform prior distribution (see Supporting Information for details). Using this distribution, we can calculate a multiscale measure that captures the assortativity of a given node across all scales.
As a simple demonstration, we return to Fig. 2 in which the distribution of for each synthetic network is shown in the bottom row. We see, under homogeneous mixing (Fig. 2A), a unimodal distribution peaked around , confirming that the global measure is representative of the mixing patterns in the network. However, when mixing is heterogeneous (Fig. 2 B–E), we observe multimodal distributions of that allow us to disambiguate between different local mixing patterns.
2. Real Networks
Next we use to evaluate the mixing patterns in some real networks: an ecological network and set of online social networks. In both cases, nodes have multiple attributes assigned to them, providing different dimensions of analysis.
Weddell Sea Food Web.
We first examine a network of ecological consumer interactions between species dwelling in the Weddell Sea (4). Fig. 4 shows the distributions (green) of local assortativity for five different categorical node attributes. For comparison, we present a null distribution (black) obtained by randomly rewiring the edges such that the attribute values, degree sequence, and global assortativity are all preserved (see Supporting Information). In each case, we observe skewed and/or multimodal distributions. The empirical distributions appear overdispersed compared with the null distributions.
It may be surprising to see that, for some attributes, the null distribution appears to be multimodal. Closer inspection reveals that the different modes are correlated with the attribute values and that multimodality arises from the unbalanced distribution of nodes and incident edges across different node types. This effect is particularly pronounced for the attribute “Metabolic Category,” for which we observe two distinct peaks in the distribution. The larger peak that occurs around represents all species that belong to the metabolic category “plant” and accounts for the majority () of the species in the network, upon which approximately two-thirds of the edges are incident. This bias in the distribution of edges across the different node types means that randomly assigned edges are more likely to connect two nodes of the majority class than any other pair of nodes. In fact, it is impossible to assign edges such that nodes in each Metabolic Category exhibit (approximately) the same assortativity as the global value. Specifically, to achieve , it is necessary that more than half of the edges connect species from different metabolic categories. However, this is impossible for the plant category without changing the distribution of edges over categories.
Facebook 100.
We next consider a set of online social networks collected from the Facebook social media platform at a time when it was only open to 100 US universities (3). The process of incrementally providing these universities access to the platform meant that, at this point in time, very few links existed between each of the universities’ networks, which provides the opportunities to study each of these social systems in a relatively independent manner. One of the original studies on this dataset examined the assortativity of each demographic attribute in each of the networks (3). This study found some common patterns that occurred in many of the networks, such as a tendency to be assortative by matriculation year and dormitory of residence, with some variation around the magnitude of assortativity for each of the attributes across the different universities.
In this case, it makes sense to analyze the universities separately, since it is reasonable to assume that university membership played an important and restricting role in the organization of the network. However, a modern version of this dataset might contain a higher density of interuniversity links, making it less reasonable to treat them independently; in general, partitioning networks based on attributes without careful considerations can be problematic (24).
Fig. 5 depicts the distributions of for each of the 100 networks according to dormitory. For many of the networks, we observe a positively skewed distribution. The surrounding subplots show details for four universities with approximately the same global assortativity , but with qualitatively different distributions of . Common across these distributions is that all of the empirical distributions exhibit a positive skew beyond that of the null distribution. Closer inspection reveals that, in all four networks, the nodes associated with a higher local assortativity belong to a community of nodes more loosely connected to the rest of the network. These nodes also correspond to first-year students, which suggests that residence is more relevant to friendship among new students than it is for the rest of the student body. We see this pattern in many of the other schools too: First-year students are more assortative by year (for all schools except one) and by residence (more than 75% of schools); see Supporting Information for details.
We can also use local assortativity to compare how the mixing of multiple attributes covaries across a network. This may be of interest, as a positive correlation could suggest a relationship between attributes, while a negative correlation indicates that assortativity of one attribute may replace the assortativity of another. Note that differences in normalization between attributes mean that the actual values may not be directly comparable, which is why we focus on correlation. Fig. 6 compares for year of study and place of residence. The central scatter plot shows, for each university, the correlation between local assortativities of the two attributes (x axis) against the difference in the two global assortativities for each network (y axis), which was previously the only way to compare assortativities (3). The four surrounding subplots show the joint distribution of year and dorm local assortativity for specific universities. The yellow points indicate students in their first year. In most universities, we observed that first-year students were the most assortative by either year, residence, or both. In both Auburn and Pepperdine, there is a negative correlation between year and dorm assortativity, suggesting that many friendships are associated with either being in the same year or sharing a dorm.
For Simmons and Rice, we observe a positive correlation between dorm and year local assortativity. However, in Simmons, we see that the first-year students form a separate cluster, while, in Rice, they are much more interspersed. This difference may relate to how students are placed in university dorms. At Simmons, all first-year students live on campus (www.simmons.edu/student-life/life-at-simmons/housing/residence-halls) and form the majority residents in the few dorms they occupy. Rice houses their new intakes according to a different strategy, by placing them evenly spread across all of the available dorms. The fact that students are mixed across years and that the vast majority [almost 78% (campushousing.rice.edu/)] of students reside in university accommodation offers a possible explanation for why we observe a smooth variation in values of assortativity without a distinction between new students and the rest of the population.
3. Discussion
Characterizing the level of assortativity plays an important role in understanding the organization of complex systems. However, the global assortativity may not be representative, given the variation present in the network. We have shown that the distribution of mixing in real networks can be skewed, overdispersed, and possibly multimodal. In fact, for certain network configurations, we have seen that a unimodal distribution may not even be possible.
As network data grow bigger, there is a greater possibility for heterogeneous subgroups to coexist within the overall population. The presence of these subpopulations adds further to the ongoing discussions of the interplay between node metadata and network structure (24) and suggests that, while we may observe a relationship between particular node properties and existence of links in part of a network, it does not imply that this relationship exists across the network as a whole. This heterogeneity has implications for how we make generalizations in network data, as what we observe in a subgraph might not necessarily apply to the rest of the network. However, it may also present new opportunities too. Recent results show that, with an appropriately constructed learning algorithm, it is still possible to make accurate predictions about node attributes in networks with heterogeneous mixing patterns (25) and, in some cases, even utilize the heterogeneity to further improve performance (26). Quantifying local assortativity offers a new dimension to study this predictive performance. Heterogeneous mixing also offers a potential new perspective for the community detection problem (27), i.e., to identify sets of nodes with similar assortativity, which may be useful in the study of “echo chambers” in social networks (28).
Our approach to quantifying local mixing could easily be applied to any global network measure, such as clustering coefficient or mean degree. It may also be used to capture the local correlation between node attributes and their degree, a relationship that plays a definitive role in network phenomena such as the majority illusion (29) and the generalized friendship paradox (30).
Supplementary Material
Acknowledgments
This work was supported by Concerted Research Action (ARC) supported by the Federation Wallonia-Brussels Contract ARC 14/19-060 (to L.P., J.-C.D., and R.L.); Fonds de la Recherche Scientifique-Fonds National de le Recherche Scientifique (L.P.); and Flagship European Research Area Network (FLAG-ERA) Joint Transnational Call “FuturICT 2.0” (J.-C.D. and R.L.).
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1713019115/-/DCSupplemental.
References
- 1.Krackhardt D. The ties that torture: Simmelian tie analysis in organizations. Res Sociol Organ. 1999;16:183–210. [Google Scholar]
- 2.Lazega E. The Collegial Phenomenon: The Social Mechanisms of Cooperation Among Peers in a Corporate Law Partnership. Oxford Univ Press; Oxford, UK: 2001. [Google Scholar]
- 3.Traud AL, Mucha PJ, Porter MA. Social structure of Facebook networks. Phys A. 2012;391:4165–4180. [Google Scholar]
- 4.Brose U, et al. Body sizes of consumers and their resources. Ecology. 2005;86:2545. [Google Scholar]
- 5.Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000;407:651–654. doi: 10.1038/35036627. [DOI] [PubMed] [Google Scholar]
- 6.Albert R, Jeong H, Barabási A-L. Internet: Diameter of the worldwide web. Nature. 1999;401:130–131. [Google Scholar]
- 7.Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
- 8.McPherson M, Smith-Lovin L, Cook JM. Birds of a feather: Homophily in social networks. Annu Rev Sociol. 2001;27:415–444. [Google Scholar]
- 9.Aral S, Muchnik L, Sundararajan A. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proc Natl Acad Sci USA. 2009;106:21544–21549. doi: 10.1073/pnas.0908800106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Moody J. Race, school integration, and friendship segregation in America. Am J Sociol. 2001;107:679–716. [Google Scholar]
- 11.Cohen JM. Sources of peer group homogeneity. Sociol Educ. 1977;50:227–241. [Google Scholar]
- 12.Kandel DB. Homophily, selection, and socialization in adolescent friendships. Am J Sociol. 1978;84:427–436. [Google Scholar]
- 13.Newman MEJ. Mixing patterns in networks. Phys Rev E. 2003;67:026126. doi: 10.1103/PhysRevE.67.026126. [DOI] [PubMed] [Google Scholar]
- 14.McAuley JJ, Leskovec J. Learning to discover social circles in ego networks. Adv Neur. 2012;25:539–547. [Google Scholar]
- 15.Sampson SF. 1968. A novitiate in a period of change: An experimental and case study of social relationships. PhD thesis (Cornell Univ, Ithaca, NY)
- 16.Zachary WW. An information flow model for conflict and fission in small groups. J Anthropol Res. 1977;33:452–473. [Google Scholar]
- 17.Ugander J, Karrer B, Backstrom L, Marlow C. 2011. The anatomy of the facebook social graph. arXiv:1111.4503.
- 18.Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev E. 2004;69:026113. doi: 10.1103/PhysRevE.69.026113. [DOI] [PubMed] [Google Scholar]
- 19.Anscombe FJ. Graphs in statistical analysis. Am Stat. 1973;27:17–21. [Google Scholar]
- 20.Andersen R, Chung F, Lang K. Foundations of Computer Science (FOCS’06) Inst Electr Electron Eng; New York: 2006. Local graph partitioning using PageRank vectors; pp. 475–486. [Google Scholar]
- 21.Kloumann IM, Ugander J, Kleinberg J. Block models and personalized PageRank. Proc Natl Acad Sci USA. 2016;114:33–38. doi: 10.1073/pnas.1611275114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Fouss F, Saerens M, Shimbo M. Algorithms and Models for Network Data and Link Analysis. Cambridge Univ Press; Cambridge, UK: 2016. [Google Scholar]
- 23.Boldi P. Special Interest Tracks and Posters of the 14th International Conference on World Wide Web (WWW) Assoc Comput Machinery; New York: 2005. Totalrank: Ranking without damping; pp. 898–899. [Google Scholar]
- 24.Peel L, Larremore DB, Clauset A. The ground truth about metadata and community detection in networks. Sci Adv. 2017;3:e1602548. doi: 10.1126/sciadv.1602548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Peel L. SIAM International Conference on Data Mining. Soc Industrial Appl Math; Philadelphia: 2017. Graph-based semi-supervised learning for relational networks; pp. 435–443. [Google Scholar]
- 26.Altenburger KM, Ugander J. 2017. Bias and variance in the social structure of gender. arXiv:1705.04774.
- 27.Schaub MT, Delvenne J-C, Rosvall M, Lambiotte R. The many facets of community detection in complex networks. Appl Netw Sci. 2017;2:4. doi: 10.1007/s41109-017-0023-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Colleoni E, Rozza A, Arvidsson A. Echo chamber or public sphere? predicting political orientation and measuring political homophily in twitter using big data. J Commun. 2014;64:317–332. [Google Scholar]
- 29.Lerman K, Yan X, Wu X-Z. The “majority illusion” in social networks. PLoS One. 2016;11:e0147617. doi: 10.1371/journal.pone.0147617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Eom Y-H, Jo H-H. Generalized friendship paradox in complex networks: The case of scientific collaboration. Sci Rep. 2014;4:4603. doi: 10.1038/srep04603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yule GU. On the methods of measuring association between two attributes. J R Stat Soc. 1912;75:579–652. [Google Scholar]
- 32.Ferguson GA. The factorial interpretation of test difficulty. Psychometrika. 1941;6:323–329. [Google Scholar]
- 33.Guilford JP. Fundamental Statistics in Psychology and Education. McGraw-Hill; New York: 1950. [Google Scholar]
- 34.Davenport EC, Jr, El-Sanhurry NA. Phi/Phimax: Review and synthesis. Educ Psychol Meas. 1991;51:821–828. [Google Scholar]
- 35.Cureton EE. Note on φ/φ max. Psychometrika. 1959;24:89–91. [Google Scholar]
- 36.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46. [Google Scholar]
- 37.Boldi P, Santini M, Vigna S. 2007. A deeper investigation of PageRank as a function of the damping factor. Web Information Retrieval and Linear Algebra Algorithms (Internationales Begegnungs- und Forschungszentrum für Informatik, Dagstuhl, Germany), 07071.
- 38.Fosdick BK, Larremore DB, Nishimura J, Ugander J. 2016. Configuring random graph models with fixed degree sequences. arXiv:1608.00607.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.