Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2010 Apr 8;5(4):e10012. doi: 10.1371/journal.pone.0010012

Efficient and Exact Sampling of Simple Graphs with Given Arbitrary Degree Sequence

Charo I Del Genio 1,2, Hyunju Kim 3, Zoltán Toroczkai 3, Kevin E Bassler 1,2,*
Editor: Fabio Rapallo4
PMCID: PMC2851615  PMID: 20386694

Abstract

Uniform sampling from graphical realizations of a given degree sequence is a fundamental component in simulation-based measurements of network observables, with applications ranging from epidemics, through social networks to Internet modeling. Existing graph sampling methods are either link-swap based (Markov-Chain Monte Carlo algorithms) or stub-matching based (the Configuration Model). Both types are ill-controlled, with typically unknown mixing times for link-swap methods and uncontrolled rejections for the Configuration Model. Here we propose an efficient, polynomial time algorithm that generates statistically independent graph samples with a given, arbitrary, degree sequence. The algorithm provides a weight associated with each sample, allowing the observable to be measured either uniformly over the graph ensemble, or, alternatively, with a desired distribution. Unlike other algorithms, this method always produces a sample, without back-tracking or rejections. Using a central limit theorem-based reasoning, we argue, that for large Inline graphic, and for degree sequences admitting many realizations, the sample weights are expected to have a lognormal distribution. As examples, we apply our algorithm to generate networks with degree sequences drawn from power-law distributions and from binomial distributions.

Introduction

Network representation has become an increasingly widespread methodology of analysis to gain insight into the behavior of complex systems, ranging from gene regulatory networks to human infrastructures such as the Internet, power-grids and airline transportation, through metabolism, epidemics and social sciences [1][4]. These studies are primarily data driven, where connectivity information is collected, and the structural properties of the resulting graphs are analyzed for modeling purposes. However, rather frequently, full connectivity data is unavailable, and the modeling has to resort to considerations on the class of graphs that obeys the available structural data. A rather typical situation is when the only information available about the network is the degree sequence of its nodes Inline graphic. For example, in epidemiology studies of sexually transmitted diseases [5], anonymous surveys may only collect the number of sexual partners of a person in a given period of time, not their identity. Epidemiologists are then faced with constructing a typical contact graph having the observed degree sequence, on which disease spread scenarios can be tested. Another reason for studying classes or ensembles of graphs obeying constraints comes from the fact that the network structure of many large-scale real-world systems is not the result of a global design, but of complex dynamical processes with many stochastic elements. Accordingly, a statistical mechanics approach [1] can be employed to characterize the collective properties of the system emerging from its node level (microscopic) properties. In this approach, statistical ensembles of graphs are defined [6], [7], representing “connectivity microstates” from which macroscopic system level properties are inferred via averaging. Here we focus on the degree as a node characteristic, which could represent, for example, the number of friends of a person, the valence of an atom in a chemical compound, the number of clients of a router, etc.

In spite of its practical importance, finding a method to construct degree-based graphs in a way that allows the corresponding graph ensemble to be properly sampled has been a long-standing open problem in the network modeling community (references using various approaches are given below). Here we present a solution to this problem, using a biased sampling approach. We consider degree-based graph ensembles on two levels: 1) sequence-level, where a specific sequence of degrees is given, and 2) distribution level, where the sequences are themselves drawn from a given degree distribution Inline graphic. In the remainder we will focus on the fundamental case of labeled, undirected simple graphs. In a simple graph any link connects a single pair of distinct nodes and self loops and multiple links between the same pair of nodes are not allowed. Without loss of generality, consider a sequence of Inline graphic positive integers Inline graphic, arranged in non-increasing order: Inline graphic. If there is at least one simple graph Inline graphic with degree sequence Inline graphic, the sequence Inline graphic is called a graphical sequence and we say that Inline graphic realizes Inline graphic. Note that not every sequence of positive integers can be realized by simple graphs. For example, there is no simple graph with degree sequence Inline graphic or Inline graphic, while the sequence Inline graphic can obviously be realized by a simple graph. In general, if a sequence is graphical, then there can be several graphs having the same degree sequence. Also note that given a graphical sequence, the careless or random placing of links between the nodes may not result in a simple graph.

Recently, a direct, swap-free method to systematically construct all the simple graphs realizing a given graphical sequence Inline graphic was presented [8]. However, in general (for exceptions see Ref. [9]), the number of elements of the set Inline graphic of all graphs that realize sequence Inline graphic, increases very quickly with Inline graphic: a simple upper bound is provided by the number of all graphs with sequence Inline graphic, allowing for multiple links and loops: Inline graphic. Thus, typically, systematically constructing all graphs with a given sequence Inline graphic is practical only for short sequences, such as when determining the structural isomers of alkanes [8]. For larger sequences, and in particular for modeling real-world complex networks, it becomes necessary to sample Inline graphic. Accordingly, several variants based on the Markov Chain Monte Carlo (MCMC) method were developed. They use link-swaps [10] (“switches”) to produce pseudo-random samples from Inline graphic. Unfortunately, most of them are based on heuristics, and apart from some special sequences, little has been rigorously shown about the methods' mixing time, and accordingly they are ill-controlled. The literature on such MCMC methods is simply too extensive to be reviewed here, instead, we refer the interested reader to Refs. [11][13] and the references therein. Finally, we recall the main swap-free method producing uniform random samples from Inline graphic, namely the configuration model (CM) [14][17]. This method picks a pair of nodes uniformly at random and connects them, until a rejection occurs due to a double link or a self-loop, in which case it restarts from the very beginning. For this reason, the CM can become very slow, as shown in the Discussion section. The CM has inspired approximation methods as well [18] and methods that construct random graphs with given expected degrees [19].

Here, by developing new results from the theorems in Ref. [8], we present an efficient algorithm that solves this fundamental graph sampling problem, and it is exact in the sense that it is not based on any heuristics. Given a graphical sequence, the algorithm always finishes with a simple graph realization in polynomial time, and it is rejection free. While the samples obtained are not uniformly generated, the algorithm also provides the exact weight for each sample, which can then be used to produce averages of arbitrary graph observables measured uniformly, or following any given distribution over Inline graphic.

Methods

Mathematical foundations

Before introducing the algorithm, we state some results that will be useful later on. We begin with the Erdös-Gallai (EG) theorem [20], which is a fundamental result that allows us to determine whether a given sequence of non-negative integers, called “degree sequence” hereafter, is graphical.

Theorem 1 (Erdö-Gallai)

A non-increasing degree sequence Inline graphic is graphical if and only if their sum is even and, for all Inline graphic:

graphic file with name pone.0010012.e028.jpg (1)

A necessary and sufficient condition for the graphicality of a degree sequence, which is constrained from having links between some node and a “forbidden set” of other nodes is given by the star-constrained graphicality theorem [8]. In this case the forbidden links are all incident on one node and thus form a “star”. To state the theorem, we first define the “leftmost adjacency set” of a node Inline graphic with degree Inline graphic in a degree sequence Inline graphic as the set consisting of the Inline graphic nodes with the largest degrees that are not in the forbidden set. If Inline graphic is non-increasing, then the nodes in the leftmost adjacency set are the first Inline graphic nodes in the sequence that are not in the forbidden set. The forbidden set could represent nodes that are either already connected to Inline graphic, and thus subsequent connections to them are forbidden, or just imposed arbitrarily. Using this definition, the theorem is:

Theorem 2 (Star-constrained graphical sequences)

Let Inline graphic be a non-increasing graphical degree sequence. Assume there is a set of forbidden links incident on a node Inline graphic. Then a simple graph avoiding the forbidden links can be constructed if and only if a simple graph can be constructed where Inline graphic is connected to all the nodes in its leftmost adjacency set.

A direct consequence [8] of Theorem 2 for the case of an empty forbidden set is the well-known Havel-Hakimi result [21], [22], which in turn implies:

Corollary 1

Let Inline graphic be a non-increasing unconstrained graphical degree sequence. Then, given any node Inline graphic, there is a realization of Inline graphic that includes a link between the first node and Inline graphic.

Another result we exploit here is Lemma 3 of Ref. [8], extended to star-constrained sequences:

Lemma 1

Let Inline graphic be a graphical sequence, possibly with a star constraint incident on node Inline graphic. Let Inline graphic and Inline graphic be distinct nodes not in the forbidden set and different from Inline graphic, such that Inline graphic. Then Inline graphic is also a graphical sequence with the same star constraint.

Proof. Let Inline graphic denote the set of nodes forbidden to connect to node Inline graphic. Since Inline graphic is star-constrained graphical there is a simple graph Inline graphic realizing the sequence with no connections between Inline graphic and Inline graphic. Since Inline graphic, there is a node Inline graphic to which Inline graphic is connected but Inline graphic is not. Note that Inline graphic could be in Inline graphic. Now cut the edge Inline graphic of Inline graphic creating a stub at Inline graphic and another at Inline graphic. Remove the stub at Inline graphic so that its degree becomes Inline graphic, and add a stub at Inline graphic so that its degree becoming Inline graphic. Since there are no connections in Inline graphic between Inline graphic and Inline graphic, connect the two stubs at these nodes creating a simple graph Inline graphic thus realizing Inline graphic. Clearly there are still no connections between Inline graphic and Inline graphic in Inline graphic, and thus Inline graphic is also star-constrained graphical.

Finally, using Lemma 1 and Theorem 2, we prove:

Theorem 3

Let Inline graphic be a degree sequence, possibly with a star-constraint incident on node Inline graphic, and let Inline graphic and Inline graphic be two nodes with degrees such that Inline graphic that are not constrained from linking to node Inline graphic. If the residual degree sequence Inline graphic obtained from Inline graphic by reducing the degrees at Inline graphic and Inline graphic by unity is not graphical, then the degree sequence Inline graphic obtained from Inline graphic by reducing the degrees at Inline graphic and Inline graphic by unity is also not graphical.

Proof. By definition, Inline graphic for Inline graphic and Inline graphic, Inline graphic; Inline graphic for Inline graphic and Inline graphic, Inline graphic. We consider Inline graphic, however, the proof is not affected by this assumption. By assumption, Inline graphic is not graphical. Using proof by contradiction, assume that Inline graphic is graphical. Clearly, Inline graphic, and thus we can apply Lemma 1 on this sequence. As a result, the sequence Inline graphic, that is exactly Inline graphic is graphical, a contradiction.

Note that if a sequence is non-graphical, then it is not star-constrained graphical either, and thus Theorem 3 is in its strongest form.

Biased sampling

The sampling algorithm described below is ergodic in the sense that every possible simple graph with the given finite degree sequence is generated with non-zero probability. However, it does not generate the samples with uniform probability; the sampling is biased. Nevertheless, the algorithm can be used to compute network observables that are unbiased, by appropriately weighing the averages measured from the samples. According to a well known principle of biased sampling [23],[24], if the relative probability of generating a particular sample Inline graphic is Inline graphic, then an unbiased estimator for an observable Inline graphic measured from a set of Inline graphic randomly generated samples Inline graphic is the weighted average

graphic file with name pone.0010012.e112.jpg (2)

where the weights are Inline graphic, and the denominator is a normalization factor. The key to this method is to find the appropriate weight Inline graphic to associate with each sample. Note that in addition to uniform sampling, it is in fact possible to sample with any arbitrary distribution by choosing an appropriate set of sample weights.

Results

The algorithm

Let Inline graphic be a non-increasing graphical sequence. We wish to sample the set Inline graphic of graphs that realize this sequence. The graphs can be systematically constructed by forming all the links involving each node. To do so, begin by choosing the first node in the sequence as the “hub” node and then build the set of the “allowed nodes” Inline graphic that can be connected to it. Inline graphic contains all the nodes that can be connected to the hub such that if a link is placed between the hub and a node from Inline graphic, then a simple graph can still be constructed, thus preserving graphicality. Choose uniformly at random a node Inline graphic, and place a link between Inline graphic and the hub. If Inline graphic still has “stubs”, i.e. remaining links to be placed, then add it to the set of “forbidden nodes” Inline graphic that contains all the nodes which can't be linked anymore to the hub node and which initially contains only the hub; otherwise, if Inline graphic has no more stubs to connect, then remove it from further consideration. Repeat the construction of Inline graphic and link the hub with one of its randomly chosen elements until the stubs of the hub are exhausted. Then remove the hub from further consideration, and repeat the whole procedure until all the links are made and the sample construction is complete. Each time the procedure is repeated, the degree sequence Inline graphic considered is the “residual degree sequence”, that is the original degree sequence reduced by the links that have previously been made, and with any zero residual degree node removed from the sequence. Then, choose a new hub, empty the set of forbidden nodes Inline graphic and add the new hub to it. It is convenient, but not necessary, to choose the new hub to be a node with maximum degree in the residual degree sequence.

The sample weights needed to obtain unbiased estimates using Eq. 2 are the inverse relative probabilities of generating the particular samples. If in the course of the construction of the sample Inline graphic different nodes Inline graphic are chosen as the hub and they have Inline graphic residual degrees when they are chosen, then this sample weight can be computed by first taking the product of the sizes Inline graphic of the allowed sets Inline graphic constructed, then dividing this quantity by a combinatorial factor which is the product of the factorials of the residual degrees of each hub:

graphic file with name pone.0010012.e133.jpg (3)

The weight accounts for the fact that at each step the hub node has Inline graphic nodes it can be linked to, which is the size of the allowed set at that point, and that the number of equivalent ways to connect the residual stubs of a new hub is Inline graphic. Note that it is always true that Inline graphic, with Inline graphic occurring for sequences for which there is only one possible graph.

Building the allowed set

The most difficult step in the sampling algorithm is to construct the set of allowed nodes Inline graphic. In order to do so first note that Theorem 3 implies that if a non-forbidden node, that is a node not in Inline graphic, can be added to Inline graphic, then all non-forbidden nodes with equal or higher degree can also be added to Inline graphic. Conversely, if it is determined that a non-forbidden node cannot be added to Inline graphic, then all nodes with equal or smaller degree also cannot be added to Inline graphic. Therefore, referring to the degrees of nodes that cannot be added to Inline graphic as “fail-degrees”, the key to efficiently construct Inline graphic is to determine the maximum fail-degree, if fail-degrees exist.

The first time Inline graphic is constructed for a new hub, according to Corollary 1, there is no fail-degree and Inline graphic consists of all the other nodes. However, constructing Inline graphic becomes more difficult once links have been placed from the hub to other nodes. In this case, to find the maximum fail-degree note that at any step during the construction of a sample the residual sequence being used is graphical. Then, since according to Theorem 2 any connection to the leftmost adjacency set of the hub preserves graphicality, it follows from Theorem 3 that any fail-degree has to be strictly less than the degree of any node in the leftmost adjacency set of the hub.

If there are non-forbidden nodes in the residual degree sequence that have degree less than any in its leftmost adjacency set, then the maximum fail-degree can be found with a procedure that exploits Theorem 2. In particular, if the hub is connected to a node with a fail-degree, then, by Theorem 2, even if all the remaining links from the hub were connected to the remaining nodes in the leftmost adjacency set, the residual sequence will not be graphical. Our method to find fail-degrees, given below, is based on this argument.

Begin by constructing a new residual sequence Inline graphic by temporarily assuming that links exist between the hub and all the nodes in its leftmost adjacency set except for the last one, which has the lowest degree in the set. The nodes temporarily linked to the hub should also be temporarily added to the set of forbidden nodes Inline graphic. The nodes in Inline graphic should be ordered so that it is non-increasing, that forbidden nodes appear before non-forbidden nodes of the same degree, and that the hub, which now has residual degree 1, is last.

At this point, in principle one could find the maximum fail degree by systematically connecting the last link of the hub with non-forbidden nodes of decreasing degree, and testing each time for graphicality using Theorem 1. If it is not graphical then the degree of the last node connected to the hub is a fail-degree, and the node with the largest degree for which this is true will have the maximum fail-degree. However, this procedure is inefficient because each time a new node is linked with the hub the residual sequence changes and every new sequence must be tested for graphicality.

A more efficient procedure to find the maximum fail-degree instead involves only testing the sequence Inline graphic. To see how this can be done, note that Inline graphic is a graphical sequence, by Theorem 2. Thus, by Theorem 1, for all relevant values of Inline graphic, the left hand side of Inequality 1, Inline graphic, and the right hand side of it, Inline graphic, satisfy Inline graphic. Furthermore, for the purposes of finding fail-degrees it is sufficient to consider linking the final stub of the hub with only the last non-forbidden node of a given degree, if any exists. After any such link is made, the resulting degree-sequence Inline graphic will be non-increasing, and thus Theorem 1 can be applied to test it for graphicality. Therefore, if the degree of the node connected with the last stub of the hub is a fail-degree, then Inequality 1 for Inline graphic must fail for some Inline graphic. For each Inline graphic, the possible differences in Inline graphic and Inline graphic between Inline graphic and Inline graphic are as follows. Inline graphic is always reduced by 1 because the residual degree of the hub is reduced from 1 to 0. Inline graphic may be reduced by an another factor of 1 if the last node connected to the hub, having index Inline graphic and degree Inline graphic, is such that Inline graphic and Inline graphic. Inline graphic is reduced by 1 if Inline graphic, otherwise it is unchanged.

Considering these conditions that can cause Inequality 1 to fail for Inline graphic, the set of allowed nodes Inline graphic can be constructed with the following algorithm that requires only testing Inline graphic. Starting with Inline graphic, compute the values of Inline graphic and Inline graphic for Inline graphic. There are three possible cases: (1) Inline graphic, (2) Inline graphic, and (3) Inline graphic. In case (1) fail-degrees occur whenever Inline graphic is unchanged by making the final link to the hub. Thus, the degree of the first non-forbidden node whose index is greater than Inline graphic is the largest fail-degree found with this value of Inline graphic. In case (2) fail-degrees occur whenever Inline graphic is unchanged and Inline graphic is reduced by 2 by making the final link to the hub. Thus, the degree of the first non-forbidden node whose index is greater than Inline graphic and whose degree is less than Inline graphic is the largest fail-degree found with this value of Inline graphic. In case (3) no fail-degree can be found with this value of Inline graphic. Repeat this process sequentially increasing Inline graphic, until all the relevant Inline graphic values have been considered, then retain the maximum fail-degree. It can be shown that the algorithm can be stopped either after a case (1) occurs, or after Inline graphic where Inline graphic is the lowest index of any node in Inline graphic with degree Inline graphic. Once the maximum fail-degree is found, remove the nodes that were temporarily added to Inline graphic and construct Inline graphic by including all non-forbidden nodes of Inline graphic with a higher degree. If no fail-degree is ever found, then all non-forbidden nodes of Inline graphic are included in Inline graphic. Inline graphic will always include the leftmost adjacency set of the hub and any non-forbidden nodes of equal degree.

Note that after a link is placed in the sample construction process, the residual degree sequence Inline graphic changes, and therefore, Inline graphic has to be determined every time.

Implementing the Erdös-Gallai test

Finally, Inline graphic and Inline graphic should be calculated efficiently. Calculating the sums that comprise them for each new value of Inline graphic can be computationally intensive, especially for long sequences. Even computing them only for as many distinct terms as there are in the sequence, as suggested in Ref. [25], can still become slow if the degree distribution is not quickly decreasing. Instead, it is much more efficient to use recurrence relations to calculate them.

A recurrence relation for Inline graphic is simply

graphic file with name pone.0010012.e211.jpg (4)

with Inline graphic.

For non-increasing degree sequences, define the “crossing-index” Inline graphic for each Inline graphic as the index of first node that has degree less than Inline graphic, that is for which Inline graphic for all Inline graphic. If no such index exists, such as for Inline graphic since the minimum degree of any node in the sequence is 1, then set Inline graphic. Then, a recurrence relation for Inline graphic is

graphic file with name pone.0010012.e221.jpg (5)

where Inline graphic is a discrete equivalent of the Heaviside function, defined to be 1 on positive integers and 0 otherwise, and Inline graphic. Or, since the crossing-index can not increase with Inline graphic, that is Inline graphic for all Inline graphic, a value Inline graphic will exist for which Inline graphic for all Inline graphic, and so Eq. 5 can be written

graphic file with name pone.0010012.e230.jpg (6)

Thus, there is no need to find Inline graphic for Inline graphic.

Using Eqs. 4 and 6, the mechanism of the calculation of Inline graphic and Inline graphic at sequential values of Inline graphic is shifted from a slow repeated calculation of sums of many terms to the much less computationally intensive task of calculating the recurrence relations. In order to perform the test efficiently, a table of the values of crossing-index Inline graphic for each relevant Inline graphic can be created as Inline graphic is constructed.

It should be noted that the usefulness of this method for calculating Inline graphic and Inline graphic is broader than its use for calculating fail-degrees in our sampling algorithm. In particular, it can be used in an Erdös-Gallai test to efficiently determine whether a degree-sequence is graphical.

Sample weights

As previously stated, the weight Inline graphic associated with a particular sample, given by Eq. 3, is the product of the sizes Inline graphic of all the sets of allowed nodes that have been built for each hub node Inline graphic divided by the product of the factorials of the initial residual degrees of each hub node. The logarithm of this weight is

graphic file with name pone.0010012.e244.jpg (7)

Generally, degree sequences with Inline graphic admit many graphical realizations. When this is true, each of the Inline graphic terms in square brackets in Eq. 7 are effectively random and independent, and, by virtue of the central limit theorem, their sum will be normally distributed. That is, the weight Inline graphic of graph samples generated from a given degree sequence with large Inline graphic is typically log-normally distributed. However, degree sequences with Inline graphic that have only a small number of realizations do exist, and Inline graphic is not expected to be log-normally distributed for those sequences.

Furthermore, one can consider not just samples of a particular graphical sequence, but of an ensemble of sequences. By a similar argument to that given above for individual sequences, the weight Inline graphic of graph samples generated from an ensemble of sequences will also typically be log-normally distributed in the limit of large Inline graphic. For example, consider an ensemble of sequences of randomly chosen power-law distributed degrees, that is, sequences of random integers chosen from a probability distribution Inline graphic. Hereafter, we refer to such sequences as “power-law sequences.” Figure 1 shows the probability distribution of the logarithm of weights for realizations of power-law sequences with exponent Inline graphic and Inline graphic. Note that this distribution is well approximated by a Gaussian fit.

Figure 1. Probability distribution Inline graphic of the logarithm of weights for an ensemble of power-law sequences with Inline graphic and Inline graphic.

Figure 1

The ensemble contained Inline graphic graphical sequences, and for each sequence Inline graphic graph samples were produced. Thus, the total number of samples produced was Inline graphic. The simulation data is given by the solid black line and a Gaussian fit of the data is shown by the dashed red line that nearly obscures the black line.

We have also studied the behavior of the mean and the standard deviation of the probability distribution of the logarithm of the weights of such power-law sequences as a function of Inline graphic. As shown in Fig. 2, they scale as a power-law. We have found qualitatively similar results, including power-law scaling of the growth of the mean and variance of the distribution of Inline graphic, for binomially distributed degree sequences that correspond to those of Erdös-Renyi random graphs with node connection probability Inline graphic such that Inline graphic, and for uniformly distributed degree sequences, that is power-law sequences with Inline graphic, with an upper limit, or cutoff, of Inline graphic for the degree of a node. However, for uniformly distributed degree sequences without an imposed upper limit on node degrees, we find that the sample weights are not log-normally distributed.

Figure 2. Mean Inline graphic and standard deviation Inline graphic of the distributions of the logarithm of the weights vs. number of nodes Inline graphic of samples from an ensemble of power-law sequences with Inline graphic.

Figure 2

The black circles correspond to Inline graphic, the red squares correspond to Inline graphic. The error bars are smaller than the symbols. The solid black line and the dashed red line show the outcomes of fits on the data. The linearity of the data on a logarithmic scale indicates that the Inline graphic and Inline graphic follow power-law scaling relations with Inline graphic: Inline graphic and Inline graphic. The slopes of the fit lines are an estimate of the value of the exponents: Inline graphic and Inline graphic.

Complexity

In this section we discuss the algorithm's computational complexity. We first provide an upper bound on the worst case complexity, given a degree sequence Inline graphic. Then, using extreme value arguments, we conservatively estimate the average case complexity for degree sequences of random integers chosen from a distribution Inline graphic. The latter is useful for realistically estimating the computational costs for sampling graphs from ensembles of long sequences.

To determine an upper bound on the worst case complexity for constructing a sample from a given degree sequence Inline graphic, recall that the algorithm connects all the stubs of the current hub node before it moves on to the hub node of the new residual sequence. For every stub from the hub one must construct the allowed set Inline graphic. The algorithm for constructing Inline graphic, which includes constructing Inline graphic, performing the Inline graphic vs Inline graphic comparisons, and determining the maximum fail-degree, can be completed in Inline graphic steps, where Inline graphic is the maximum possible number of nodes in the residual sequence after eliminating Inline graphic hubs from the process. Therefore, an upper bound on the worst case complexity Inline graphic of the algorithm given a sequence Inline graphic is:

graphic file with name pone.0010012.e294.jpg (8)

where the sum involves at most Inline graphic terms. Equivalently, Inline graphic, with Inline graphic being the number of links in the graph. For simple graphs, the maximum possible number of links is Inline graphic, and the minimum possible number is Inline graphic. If Inline graphic, then Inline graphic, and if Inline graphic, then Inline graphic, which is an upper bound, independent of the sequence.

From Eq. 8, the expected complexity for the algorithm to construct a sample for a degree sequence of random integers chosen from a distribution Inline graphic, normalized to unity, can be conservatively estimated as

graphic file with name pone.0010012.e305.jpg (9)

Here Inline graphic is the expectation value for the degree of the node with index Inline graphic, which is the largest degree for which the expected number of nodes with equal or larger degree is at least Inline graphic. That is,

graphic file with name pone.0010012.e309.jpg (10)

Notice that the sum in the above equation runs to the maximum allowed degree in the network Inline graphic, which is nominally Inline graphic, but a different value can be imposed. For example, in the case of power-law sequences, the so-called structural cutoff of Inline graphic is necessary if degree correlations are to be avoided [19], [26], [27]. However, such a cutoff needs to be imposed only for Inline graphic, because the expected maximum degree Inline graphic in a power-law network grows like Inline graphic. Thus, for Inline graphic, Inline graphic grows no faster than Inline graphic and no degree correlations exist for large Inline graphic [28].

Given a particular form of distribution Inline graphic, Eq. 9 can be computed for different values of Inline graphic. Subsequent fits of the results to a power-law function allow the order of the complexity of the algorithm to be estimated. Figure 3 shows the results of such calculations for power-law sequences with and without the structural cutoff of Inline graphic as a function of exponent Inline graphic. Note that, in the absence of cutoff, the results indicate that the order of the complexity goes to a value of 3 for Inline graphic, that is, in the limit of a uniform degree distribution. However, if the structural cutoff is imposed the order of the complexity is only Inline graphic in this limit. Both these results are easily verified analytically.

Figure 3. The estimated computational complexity of the algorithm for power-law sequences.

Figure 3

The leading order of the computational complexity of the algorithm as a power of Inline graphic, where Inline graphic is the number of nodes, is plotted as a function of the degree distribution power-law exponent Inline graphic. The black circles correspond to ensembles of sequences without cutoff, while the red squares correspond to ensembles of sequences with structural cutoff in the maximum degree of Inline graphic. The fits that yielded the data points were carried out considering sequences ranging in size from Inline graphic to Inline graphic.

We have tested the estimates shown in Fig. 3 with our implementation of the sampling algorithm for power-law sequences with and without the structural cutoff for certain values of Inline graphic, including 0, 2, and 3. This was done by measuring the actual execution times for generating samples for different Inline graphic and fitting the results to a power-law function. In every case, the actual order of the complexity of our implementation of the sampling algorithm was equal to or slightly less than its estimated value shown in Fig. 3.

Discussion

We have solved the long standing problem of how to efficiently and accurately sample the possible graphs of any graphical degree sequence, and of any ensemble of degree sequences. The algorithm we present for this purpose is ergodic and is guaranteed to produce an independent sample in, at most, Inline graphic steps. Although the algorithm generates samples non-uniformly, and, thus, it is biased, the relative probability of generating each sample can be calculated explicitly permitting unbiased measurements to be made. Furthermore, because the sample weights are known explicitly, the algorithm makes it possible to sample with any arbitrary distribution by appropriate re-weighting.

It is important to note that the sampling algorithm is guaranteed to successfully and systematically proceed in constructing a graph. This behavior contrasts with that of other algorithms, such as the configuration model (CM), which can run into dead ends that require back-tracking or restarting, leading to considerable losses of time and potentially introducing an uncontrollable bias into the results. While there are classes of sequences for which it is perhaps preferable to use the CM instead of our algorithm, in other cases its performance relative to ours can be remarkably poor. For example, a configuration model code failed to produce even a single sample of a uniformly distributed graphical sequence, Inline graphic, with Inline graphic, after running for more than 24 hours, while our algorithm produced Inline graphic samples of the very same sequence in 30 seconds. Furthermore, each sample generated by our algorithm is independent. This behavior contrasts with that of algorithms based on MCMC methods. Because our algorithm works for any graphical sequence and for any ensemble of random sequences, it allows arbitrary classes of graphs to be studied.

One of the features of our algorithm that makes it efficient is a method of calculating the left and right sides of the inequality in the Erdös-Gallai theorem using recursion relations. Testing a sequence for graphicality can thus be accomplished without requiring repeated computations of long sums, and the method is efficient even when the sequence is nearly non-degenerate. The usefulness of this method is not limited to the algorithm presented for graph sampling, but can be used anytime a fast test of the graphicality of a sequence of integers is needed.

There are now over 6000 publications focusing on complex networks. In many of these publications various processes, such as network growth, flow on networks, epidemics, etc., are studied on toy network models used as “graph representatives” simply because they have become customary to study processes on. These include the Erdös-Rényi random graph model, the Barabási-Albert preferential attachment model, the Watts-Strogatz small-world network model, random geometric graphs, etc. However, these toy models are based on specific processes that constrain their structure beyond their degree-distribution, which in turn might not actually correspond to the processes that have led to the structure of the networks investigated with them, thus potentially introducing dangerous biases in the conclusions of these studies. The algorithm presented here provides a way to study classes of simple graphs constrained solely by their degree sequence, and nothing else. However, additional constraints, such as connectedness, or any functional of the adjacency matrix of the graph being constructed, can in principle be added to the algorithm to further restrict the graph class built.

After this paper was accepted for publication, we became aware of an unpublished work by J. Blitzstein and P. Diaconis that provides another direct construction method for sampling graphs with given degree sequences.

Acknowledgments

The authors gratefully acknowledge Y. Sun, B. Danila, M. M. Ercsey Ravasz, I. Miklós, E. P. Erdös and L. A. Székely for fruitful comments, discussions and support.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: CIDG and KEB are supported by the National Science Foundation (NSF) through grant DMR-0908286 and by the Norman Hackerman Advanced Research Program through grant 95921. HK and ZT are supported in part by the NSF BCS-0826958 and by the Defense Threat Reduction Agency (DTRA) through HDTRA 201473-35045. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Albert R, Barabási AL. Statistical mechanics of complex networks. Rev Mod Phys. 2002;74:47–97. [Google Scholar]
  • 2.Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003;45:167–256. [Google Scholar]
  • 3.Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU. Complex networks: Structure and dynamics. Phys Rep. 2006;424:175–308. [Google Scholar]
  • 4.Newman MEJ, Barabási AL. The structure and dynamics of networks. Princeton University Press; 2006. [Google Scholar]
  • 5.Liljeros F, Edling CR, Amaral L, Stanley H, Åberg Y. The web of human sexual contacts. Nature. 2001;411:907–908. doi: 10.1038/35082140. [DOI] [PubMed] [Google Scholar]
  • 6.Bianconi G. Entropy of network ensembles. Phys Rev E. 2009;79:036114. doi: 10.1103/PhysRevE.79.036114. [DOI] [PubMed] [Google Scholar]
  • 7.Bianconi G, Coolen ACC, Perez Vicente CJ. Entropies of complex networks with hierarchically constrained topologies. Phys Rev E. 2008;78:016114. doi: 10.1103/PhysRevE.78.016114. [DOI] [PubMed] [Google Scholar]
  • 8.Kim H, Toroczkai Z, Erdös P, Miklós I, Székely L. Degree-based graph construction. J Phys A: Math Theor. 2009;42:392001. [Google Scholar]
  • 9.Koren M. Sequences with a unique realization by simple graphs. J Comb Theor B. 1976;21:235. [Google Scholar]
  • 10.Taylor R. Constrained switchings in graphs. SIAM J Alg Disc Math. 1982;3:115–121. [Google Scholar]
  • 11.Cooper C, Dyer M, Greenhill C. Sampling regular graphs and a peer-to-peer network. Comb Prob Comp. 2007;16:557–593. [Google Scholar]
  • 12.Kannan R, Tetali P, Vempala S. Simple markov-chain algorithms for generating bipartite graphs and tournaments. Random Struct Alg. 1999;14:293–308. [Google Scholar]
  • 13.Viger F, Latapy M. Efficient and simple generation of random simple connected graphs with prescribed degree sequence. Lect Notes Comp Sci. 2005;3595:440–449. [Google Scholar]
  • 14.Bollobás B. A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. Eur J Comb. 1980;1:311–316. [Google Scholar]
  • 15.Bender E, Canfield E. The asymptotic number of labelled graphs with given degree sequences. J Comb Th A. 1978;24:296–307. [Google Scholar]
  • 16.Molloy M, Reed B. A critical point for random graphs with a given degree sequence. Rand Struct Alg. 1995;6:161–179. [Google Scholar]
  • 17.Newman MEJ, Strogatz SH, Watts DJ. Random graphs with arbitrary degree distributions and their applications. Phys Rev E. 2001;64:026118. doi: 10.1103/PhysRevE.64.026118. [DOI] [PubMed] [Google Scholar]
  • 18.Britton T, Deijfen M, Martin-Löf A. Generating simple random graphs with prescribed degree distribution. J Stat Phys. 2006;124:1377–1397. [Google Scholar]
  • 19.Chung F, Lu L. Connected components in random graphs with given expected degree sequences. Ann Combinatorics. 2002;6:125. [Google Scholar]
  • 20.Erdös P, Gallai T. Graphs with prescribed degree of vertices. Mat Lapok. 1960;11:477. [Google Scholar]
  • 21.Havel V. A remark on the existence of finite graphs. Časopis Pěst Mat. 1955;80:477. [Google Scholar]
  • 22.Hakimi SL. On the realizability of integers as the degrees of the vertices of a linear graph - I. J SIAM Appl Math. 1962;10:496. [Google Scholar]
  • 23.Newman MEJ, Barkema GT. Monte Carlo methods in statistical physics. Oxford University Press; 1999. [Google Scholar]
  • 24.Cochran WG. Sampling techniques. Wiley; 1977. [Google Scholar]
  • 25.Tripathi A, Vijay S. A note on a theorem of erdös & gallai. Discr Math. 2003;265:417. [Google Scholar]
  • 26.Burda Z, Krzywicki A. Uncorrelated random networks. Phys Rev E. 2003;67:046118. doi: 10.1103/PhysRevE.67.046118. [DOI] [PubMed] [Google Scholar]
  • 27.Boguñá M, Pastor-Satorras R, Vespignani A. Cut-offs and finite size effects in scale-free networks. Eur Phys J B. 2004;38:205. [Google Scholar]
  • 28.Catanzaro M, Boguñá M, Pastor-Satorras R. Generation of uncorrelated random scale-free networks. Phys Rev E. 2005;71:027103. doi: 10.1103/PhysRevE.71.027103. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES