Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure

A Annibale; ACC Coolen; LP Fernandes; F Fraternali; J Kleinjung

doi:10.1088/1751-8113/42/48/485001

. Author manuscript; available in PMC: 2010 Sep 14.

Published in final edited form as: J Phys A Math Gen. 2009 Dec 4;42(48):485001. doi: 10.1088/1751-8113/42/48/485001

Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure

A Annibale ^†, ACC Coolen ^†,^‡, LP Fernandes ^‡, F Fraternali ^‡, J Kleinjung ^§

PMCID: PMC2938474 EMSID: UKMS27919 PMID: 20844594

Abstract

We study the tailoring of structured random graph ensembles to real networks, with the objective of generating precise and practical mathematical tools for quantifying and comparing network topologies macroscopically, beyond the level of degree statistics. Our family of ensembles can produce graphs with any prescribed degree distribution and any degree-degree correlation function, its control parameters can be calculated fully analytically, and as a result we can calculate (asymptotically) formulae for entropies and complexities, and for information-theoretic distances between networks, expressed directly and explicitly in terms of their measured degree distribution and degree correlations.

1. Introduction

In the study of natural or synthetic signaling networks, one of the key questions is how network structure relates to the execution of the process which it supports. This is especially true in systems biology, where, for instance, our understanding of how the structure of protein-protein interaction networks (PPIN) relates to their biological functionality is vital in the design of a new generation of intelligent and personalized medical interventions. In recent years, high-throughput proteomics has allowed for the drafting of large PPIN data sets, for different organisms, and with different experimental techniques and degrees of accuracy. With this accumulation of information, we now face the challenge of analyzing these data from a complex networks perspective, and using them optimally in order to increase our understanding of how PPIN control the functioning of cells, both in healthy and in diseased conditions. A prerequisite for achieving this is the availability of precise mathematical tools with which to quantify topological structure in large observed networks, to compare network instances and distinguish between meaningful and ‘random’ structural features. These tools have to be both systematic, i.e. with a sound statistical or information-theoretic basis, but also practical, i.e. preferably formulated in terms of explicit formulae as opposed to tedious numerical simulations.

Many quantities have been proposed for characterizing the structure of networks, such as degree distributions [1], degree sequences [2], degree correlations [3] and assortativity [4], clustering coefficients [5], and community structures [6]. To assess the relevance of an observed topological feature in a network, a common strategy is to compare it against similar observations in so-called ‘null models’, defined as randomized versions of the original network which retain some features of the original one. The choice of which topological features to conserve in the randomized models was mostly limited to degree distributions and degree sequences. Such null models were used to assess the statistical relevance of network motifs in real networks, viz. patterns which were observed significantly more often in the real networks than in their randomized counterparts [7, 8, 9]. Whether any such proposed motif is indeed functionally important and/or represent (evolutionary) arisen principles, is however not obvious; topological deviations from randomized networks could also be merely irrelevant consequences of some neglected structural property of the network, i.e. the result of an inappropriate null hypothesis rather than of a distinctive feature of the process [10, 11]. The definition and generation of good null models for benchmarking topological measures of real world graphs (and the dynamical processes which they enable) is a nontrivial issue. Similarly, in comparing observed networks (which, as a result of experimental noise, will usually not even have identical nodes), one would seek to focus on the values of macroscopic topological observables, and know the typical properties of networks with the observed features.

In recent years there have been efforts to define and generate random graphs whose topological features can be controlled and tailored to experimentally observed networks. In [12] a parametrized random graph ensemble was defined where graphs have a prescribed degree sequence, and links are drawn in a way that allows for preferential attachment on the basis of arbitrary two-degree kernels. In this paper we generalize the definition of this ensemble, and show that it can be tailored asymptotically to the generation of graphs with any prescribed degree distribution and any prescribed degree correlation function (and that it is a maximum entropy ensemble, given the degree correlations). Moreover, in spite of its parameter space being in principle infinitely large, in contrast to most random graph ensembles used to mimic real networks, we can derive explicit analytical formulae for the parameters of the ensemble, to leading order in system size, expressed directly in terms of the observed characteristics of the network given. Graphs from this ensemble are thus ideally suited to be used as either proxies or null models for observed networks, depending on the question to be answered.

Statistical mechanics approaches have been proposed to quantify the information content of network structures. Especially the (Shannon or Boltzmann) entropy has been instrumental in characterizing the complexity of network ensembles [13, 14, 15]. Here, the crucial availability of analytical expressions for the parameters of our ensemble will enable us to derive explicit formulae, in the thermodynamic limit (based on combinatorial and saddle-point arguments), for our ensemble's Shannon entropy, and hence also for the complexity of its typical graphs. These formulae are compact and transparent, and expressed solely and explicitly in terms of the degree distribution and the degree correlations that our ensemble is targeting. Finally, along similar lines we can obtain an information theoretic distance between networks, again expressed solely in terms of their degree distributions and degree correlations. A companion paper [16] will be devoted to large scale applications to PPIN data of these complexity and distance measures; here we focus on their mathematical derivation. Although there is no need for numerical sampling in our derivations (all results can be obtained analytically), we note that exact algorithms for generating random graphs from the proposed ensemble exist [17].

2. Definitions and properties of network topology characterizations

2.1. Networks, degree distributions, and degree correlation functions

We study networks (or graphs) of N nodes (or vertices), labeled by Roman indices i,j,… etc, where every vertex can be connected to other vertices by undirected links (or ‘edges’). The microscopic structure of such a network is defined in full by an N × N matrix of binary variables c_ij ∈ {0, 1}, where the nodes i and j are connected by a link if and only if c_ij = 1. We define c_ij = c_ji and c_ii = 0 for all (i, j), and we abbreviate c = {c_ij}. Henceforth, unless indicated otherwise, any summation over Roman indices will always run over the set {1, …,N}.

A standard way of characterizing the topology of a network c, as e.g. observed in a biological or physical system under study, is to measure for each vertex i the degree k_i(c) = Σ_jc_ij, the number of the links to this vertex. From these numbers then follow the empirical degree distribution p(k|c) and the observed average connectivity k̄(c):

p (k ∣ c) = \frac{1}{N} \underset{i}{Σ} δ_{k, k_{i}} (c), \bar{k} (c) = \frac{1}{N} \underset{i}{Σ} k_{i} (c)

(1)

(using the Kronecker δ-symbol for n, m ∈ IN, defined as δ_nm = 1 for n = m and δ_nm = 0 otherwise). Objects such as p(k|c) have the advantage of being macroscopic in nature, allowing for size-independent characterization of network topologies, and for comparing networks that differ in size. However, networks with the same degree distribution can still differ profoundly in their microscopic structures. We need observables that capture additional topological information, in order to discriminate between different networks with the same degree distribution (1).

To construct macroscopic observables that quantify network topology beyond the level of degree statistics, it is natural to consider how the likelihood for two nodes of a network c to be connected depends on their degrees, which is measured by the degree correlation function

\tilde{Π} (k, k' ∣ c) = \frac{𝒫 [conn ∣ c, k, k']}{𝒫 [conn ∣ c]}

(2)

Here 𝒫[conn|c,k,k′] is the probability for two randomly drawn nodes with degrees (k, k′) to be connected, and 𝒫[conn|c] is the overall probability for two randomly drawn nodes to be connected, irrespectively of their degrees, viz.

𝒫 [conn ∣ c, k, k'] = \frac{Σ_{i \neq j} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)}}{Σ_{i \neq j} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)}}

(3)

𝒫 [conn ∣ c] = \frac{1}{N (N - 1)} \underset{i \neq j}{Σ} c_{ij} = \frac{\bar{k} (c)}{N - 1}

(4)

By definition, Π̃(k, k′|c) is symmetric under exchanging k and k′. For simple networks c₀, with some degree distribution p(k) but without any micro-structure beyond that required by p(k)^‡, it is known (see e.g. [18] and references therein) that in the limit N → ∞ one finds

\tilde{Π} (k, k' ∣ c_{0}) = kk' ∕ \bar{k} 2 (c) .

(5)

It follows that those topological properties of a given (large) network c, that manifest themselves at the level of degree correlations and cannot be attributed simply to its degree statistics, can be quantified by a deviation from the simple law (5); see also [7, 19, 20]. One is therefore led in a natural way to the introduction of the relative degree correlations

Π (k, k' ∣ c) = \frac{\tilde{Π} (k, k' ∣ c)}{\tilde{Π} (k, k' ∣ c_{0})} = \frac{𝒫 [conn ∣ c, k, k']}{𝒫 [conn ∣ c]} \frac{\bar{k} 2 (c)}{kk'} .

(6)

By definition, Π(k, k′|c₀) = 1 for sufficiently large simple networks c₀, whereas any statistically relevant deviation from Π(k, k′|c) = 1 signals the presence in network c of underlying criteria for connecting nodes beyond its degrees. Just like p(k|c), Π(k, k′|c) is again a macroscopic observable that can be measured directly and at low computation cost. It is therefore a natural tool for quantifying and comparing network structures beyond the level of degree statistics.

2.2. Properties of the relative degree correlation function

To prepare the ground for proving some asymptotic mathematical properties of the relative degree correlation function Π(k, k′|c), we first simplify the denominator of (3):

\begin{matrix} \underset{i \neq j}{Σ} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)} & = \underset{ij}{Σ} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)} - δ_{kk'} \underset{i}{Σ} δ_{k, k_{i} (c)} \\ = N 2 [p (k ∣ c) p (k' ∣ c) - N - 1 δ_{kk'} p (k ∣ c)] \end{matrix}

(7)

Upon inserting the result for (4) together with (3) into (6) we then find that

Π (k, k' ∣ c) = \frac{N - 1 Σ_{i \neq j} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)}}{[p (k ∣ c) - N - 1 δ_{kk'}] p (k' ∣ c)} \frac{\bar{k} (c)}{kk'} (1 - N - 1)

(8)

and hence, using c_ii = 0 for all i,

\lim_{N \to \infty} Π (k, k' ∣ c) = \lim_{N \to \infty} \frac{\bar{k} (c) Σ_{ij} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)}}{N p (k ∣ c) p (k' ∣ c) kk'} .

(9)

We are now in a position to establish three identities obeyed by Π(k, k′|c). The first two of these, viz. (10,12), are the main ones; they are used frequently in mathematical manipulations of subsequent sections. The third provides the physical intuition behind (10,12). It is assumed implicitly in all proofs that k̄(c) remains finite for N → ∞ and that the limits N → ∞ exist.

Linear constraints:
- $\forall k \in IN : \lim_{N \to \infty} \underset{k'}{Σ} \frac{k' p (k' ∣ c)}{\bar{k} (c)} Π (k, k' ∣ c) = 1$ (10)
  These are easily verified for simple graphs c₀, for which Π(k, k′|c₀) = 1 ∀k, k′. However, they turn out to hold for any graph c, as can be proven using (9) as follows:
  $\begin{matrix} \lim_{N \to \infty} \underset{k'}{Σ} \frac{k' p (k' ∣ c)}{\bar{k} (c)} Π (k, k' ∣ c) & = \lim_{N \to \infty} \frac{Σ_{k'} Σ_{ij} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)}}{N p (k ∣ c) k} \\ = \lim_{N \to \infty} \frac{Σ_{i} k_{i} (c) δ_{k, k_{i} (c)}}{N p (k ∣ c) k} = 1 . \end{matrix}$ (11)
Normalization:
- $\lim_{N \to \infty} \underset{kk'}{Σ} \frac{k p (k ∣ c)}{\bar{k} (c)} \frac{k' p (k' ∣ c)}{\bar{k} (c)} Π (k, k' ∣ c) = 1$ (12)
  This follows directly from (10) upon multiplying both sides by kp(k)/k, followed by summation over all k.
Interpretation of the linear constraints:
- The LHS of (10) can be rewritten as
  $\lim_{N \to \infty} \underset{k'}{Σ} \frac{k' p (k' ∣ c)}{\bar{k} (c)} Π (k, k' ∣ c) = \lim_{N \to \infty} \frac{\bar{k} (c)}{k} \underset{k'}{Σ} p (k' ∣ c) \frac{𝒫 [conn ∣ c, k, k']}{𝒫 [conn ∣ c, k]} = \lim_{N \to \infty} \frac{N}{k} 𝒫 [conn ∣ c, k]$ (13)
  where we used (6), (3) and 𝒫[conn|c, k] is the marginal probability of 𝒫[conn|c, k], which represents the probability that two randomly drawn nodes, one having degree k, are connected. We conclude that our first (proven) identity (10) boils down to the claim that for large N one has 𝒫[conn|c, k] = k/N (modulo irrelevant orders in N), which is easily understood.

We end this section with two further observations. First, the relations (10) involve the degree distribution, so one must expect that the possible values for Π(k, k′|c) are dependent upon (or constrained by) p(k|c). Second, several other useful properties of the kernel Π(k, k′|c) can be extracted from (10). For instance, the only separable kernel Π is Π(k, k′|c) = 1 for all (k, k′): a separable kernel is of the form Π(k, k′|c) = G(k|c)G(k′|c) for some function G(k|c) (Π being symmetric), and insertion of this form into (10) leads immediately to G(k|c) = 1 |for all k, c.

3. Random graphs with controlled macroscopic structure

3.1. Definition of the random graph ensembles

To study the signalling properties of real-world networks, or generate ‘null models’ to assess the relevance of observed topological features, one needs random graph ensembles in which one can control the topological characteristics one is interested in and ‘tune’ these to match the characteristics of the observed networks. Most ensembles studied in literature so far have focused on producing graphs with controlled degree statistics. The suggestion that (6) can be used for identifying network complexity beyond degree statistics goes back at least to [7, 18, 19, 20]. In contrast to these earlier studies, which were mostly limited to measuring (6) for real networks, here we take further mathematical steps that will allow us to use (6) as a systematic tool for quantifying complexity and distances in network structure beyond degree statistics. This requires generating random graphs in which we can control at will both the degree distribution p(k) and the relative degree correlations Π(k, k′).

It will turn out that we can achieve our objectives with the following random graph ensembles, in which all degrees k_i are drawn randomly and independently from p(k), and where in addition the edges are drawn in a way that allows for preferential attachment on the basis of an arbitrary symmetric function Q(k, k′) of the degrees of the two vertices concerned:

Prob (c ∣ p, Q) = \underset{k}{Σ} Prob (c ∣ k, Q) \prod_{i} p (k_{i})

(14)

Prob (c ∣ k, Q) = \frac{1}{Ƶ (k, Q)} \prod_{i < j} [\frac{\bar{k}}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{\bar{k}}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] \prod_{i} δ_{k_{i}, k_{i} (c)}

(15)

Here $Ƶ (k, Q)$ is a normalization constant that ensures Σ_c Prob(c|k,Q) = 1 for all (k,Q), k̄ = N⁻¹ Σ_ik_i, and the function Q must obey Q(k, k′) ≥ 0 for all (k, k′) and N⁻² Σ_ijQ(k_i, k_j) = 1. The ensemble (15) with prescribed degrees k = (k₁, …, k_N) was defined and studied in [12, 21]. We note that in the above ensemble one will have k̄ = Σ_kp(k)k + 𝒪(N^−1/2).

Upon making the simplest choice Q(k, k′) = 1 for all (k, k′) one retrieves from (14) the ‘flat’ ensemble, where once the individual degrees are drawn randomly from p(k), all graphs c with the prescribed degrees carry equal probability:

Prob (c) = \underset{k}{Σ} [\prod_{i} p (k_{i})] \frac{\prod_{i} δ_{k_{i}, k_{i} (c)}}{Σ_{c'} \prod_{i} δ_{k_{i}, k_{i} (c')}} .

(16)

This follows from the property that for Q(k, k′) = 1 the factor Π_i<j [… δ_cij,1+… δ_cij,0] in (14) depends on c via the degrees {k_i(c)} only, and will consequently drop out of the measure (15):

\begin{matrix} \prod_{i < j} [\frac{〈 k 〉}{N} δ_{c_{ij}, 1} + (1 - \frac{〈 k 〉}{N}) δ_{c_{ij}, 0}] & = (1 - \frac{\bar{k}}{N}) \frac{1}{2} N (N - 1) e Σ_{i < j} c_{ij} \log [\frac{\bar{k}}{N} (1 - \frac{\bar{k}}{N}) - 1] \\ = (1 - \frac{\bar{k}}{N}) \frac{1}{2} N (N - 1) e \frac{1}{2} N \bar{k} \log [\frac{\bar{k}}{N} (1 - \frac{\bar{k}}{N}) - 1] \end{matrix}

(17)

3.2. Asymptotic properties of the ensembles

One should expect that macroscopic physical observables such as p(k|c) (1) and Π(k, k′|c) (8) are self-averaging, and can therefore be calculated, to leading order in N, in terms of their expectation values over the ensemble (14)^§. We should therefore find that each graph drawn from (14) will for sufficiently large N have as its degree distribution p(k) and will have relative degree correlations identical to

\begin{matrix} Π (k, k') & = \lim_{N \to \infty} \underset{c}{Σ} Prob (c ∣ p, Q) Π (k, k' ∣ c) \\ = \frac{〈 k 〉}{p (k) p (k') kk'} \lim_{N \to \infty} \frac{1}{N} \underset{c}{Σ} Prob (c ∣ p, Q) \underset{ij}{Σ} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)} \end{matrix}

(18)

It turns out that (18) can be calculated analytically, and expressed in terms of p(k) and Q(k, k′). The first published result related to this connection in an appendix of [12] was unfortunately subject to an error; see the corrigendum [21] for the correct relation as given below, of which the actual derivation is given in Appendix A of this present paper:

Π (k, k') = \frac{Q (k, k')}{F (k ∣ Q) F (k' ∣ Q)}

(19)

where the function F(k|Q) is calculated self-consistently, for any Q(k, k′), as the solution of

\forall k : F (k ∣ Q) = \frac{1}{〈 k 〉} \underset{k'}{Σ} p (k') k' \frac{Q (k, k')}{F (k' ∣ Q)} .

(20)

It is satisfactory to observe, upon eliminating Q(k, k′) from (20) via (19), that (20) becomes identical to the set of relations (10) that we derived earlier for Π(k, k′) solely on the basis of the latter's microscopic definition. Clearly, (10) must indeed hold for every single graph of the ensemble (14), provided N is sufficiently large. On the other hand, for finite N a typical graph of the ensemble (14) will display deviations from (19) that are at least of order 𝒪(N⁻¹) (the difference between definition (8) and its asymptotic form (9)), but possibly of order 𝒪(N^−1/2) (the typical finite size corrections in empirical averages over 𝒪(N) independent samples).

Expression (19) also provides en passant the explicit proof that for graphs in which the only structure is that imposed by the degree sequence, viz. those generated from (16) corresponding to Q(k, k′) = 1 for all (k, k′), one indeed finds Π(k, k′) = 1 for N → ∞. Upon inserting Q(k, k′) = 1 into condition (20) we find that F²(k) = 1 for all k, upon which the desired result follows directly from (19).

Asymptotically (i.e. in leading relevant orders in N), the probabilities (14) to find graphs c with the correct degree statistics, i.e. with degrees drawn randomly from p(k), depends on c via the degree distribution p(k|c) and the kernel Π(k, k′|c) only. To see this we study the following function for large N,

\begin{matrix} Ω (c ∣ p, Q) & = - N - 1 \log Prob (c ∣ p, Q) = - N - 1 \log \underset{k}{Σ} \prod_{i} [p (k_{i}) δ_{k_{i}, k_{i} (c)}] e - N Ω (c ∣ k, Q) \\ = - \frac{1}{N} \underset{i}{Σ} \log p (k_{i} (c)) + Ω (c ∣ k (c), Q) \end{matrix}

(21)

The leading order in Ω(c|k, Q) = −N⁻¹ log p(c|k, Q) was studied in [12]. If k(c) ≠ k one has Ω(c|k, Q) = ∞ (the degrees are imposed as strict constraints), whereas for k = k(c) one has

\begin{matrix} Ω (c ∣ k, Q) & = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - N - 1 \underset{i}{Σ} \log k_{i}! + N - 1 \underset{i}{Σ} k_{i} \log F (k_{i} ∣ Q) \\ - N - 1 \underset{i < j}{Σ} c_{ij} \log Q (k_{i}, k_{j}) + 𝒪 (N - 1) \\ = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - N - 1 \underset{i}{Σ} \log k_{i}! + N - 1 \underset{i}{Σ} k_{i} \log F (k_{i} ∣ Q) \\ - \frac{1}{2} \underset{kk'}{Σ} \log Q (k, k') \frac{1}{N} \underset{ij}{Σ} c_{ij} δ_{k, k_{i}} δ_{k', k_{j}} + 𝒪 (N - 1) \end{matrix}

(22)

where k̄ = N⁻¹ Σ_ik_i. We introduce the further short-hand p̃(k) = N⁻¹ Σ_iδ_k,ki, as well as the notation o(1) to denote finite size corrections that obey lim_N→∞o(1) = 0 (to determine the exact scaling with N of these corrections we would have to inspect e.g. the finite size corrections to (19)). We write the leading orders of (22) in terms of the kernel Π(k, k′|c), using (9) and (19), and substituting into (19) the present degree distribution p̃(k), and find

\begin{matrix} Ω (c ∣ k, Q) & = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - \underset{k}{Σ} \tilde{p} (k) \log k! + \underset{k}{Σ} \tilde{p} (k) k \log F (k ∣ Q) \\ - \frac{1}{2} \underset{kk'}{Σ} \log Q (k, k') N - 1 \underset{ij}{Σ} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)} + o (1) \\ = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - \underset{k}{Σ} \tilde{p} (k) \log k! + \underset{k}{Σ} \tilde{p} (k) k \log F (k ∣ Q) \\ - \underset{kk'}{Σ} \frac{\tilde{p} (k) \tilde{p} (k') kk'}{2 \bar{k}} Π (k, k' ∣ c) \log [Π (k, k') F (k ∣ Q) F (k' ∣ Q)] + o (1) \\ = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - \underset{k}{Σ} \tilde{p} (k) \log k! \\ + \underset{k}{Σ} \tilde{p} (k) k \log F (k ∣ Q) [1 - \underset{k'}{Σ} \frac{\tilde{p} (k') k'}{\bar{k}} Π (k, k' ∣ c)] \\ - \underset{kk'}{Σ} \frac{\tilde{p} (k) \tilde{p} (k') kk'}{2 \bar{k}} Π (k, k' ∣ c) \log Π (k, k') + o (1) \\ = \frac{1}{2} \bar{k} \log N + \frac{1}{2} \bar{k} [\log \bar{k} - 1] - \underset{k}{Σ} \tilde{p} (k) \log k! \\ - \underset{kk'}{Σ} \frac{\tilde{p} (k) \tilde{p} (k') kk'}{2 \bar{k}} Π (k, k' ∣ c) \log Π (k, k') + o (1) \end{matrix}

(23)

where in the last step we used the identities (10). It subsequently follows (21) as

\begin{matrix} Ω (c ∣ p, Q) = & \frac{1}{2} \bar{k} (c) \log N + \frac{1}{2} \bar{k} (c) [\log \bar{k} (c) - 1] - \underset{k}{Σ} p (k ∣ c) \log k! \\ - Ω [p (c), Π (c); p, Π] + o (1) \end{matrix}

(24)

\begin{matrix} Ω [p (c), Π (c); p, Π] = & \underset{kk'}{Σ} \frac{p (k ∣ c) p (k' ∣ c) kk'}{2 \bar{k} (c)} Π (k, k' ∣ c) \log Π (k, k') \\ + \underset{k}{Σ} p (k ∣ c) \log p (k) \end{matrix}

(25)

with k̄(c) = Σ_kkp(k|c), Π(c) = {Π(k, k′|c)}, and p(c) = {p(k|c)}. The leading order ½k̄(c) log N in Ω(c|Q, p) reflects the property that the number of finitely connected graphs grows asymptotically with N as exp[~N log N]. The next order is found to depend only on the macroscopic characterization {p(c), Π(c)} of the specific graph c, and on the macroscopic characterization {p, Π} of typical graphs from (14), with Π calculated for the kernel Q via (19).

3.3. Existence and uniqueness of tailored ensembles

We will now prove that for each degree distribution p(k) and each relative degree correlation function Π(k, k′) there exist kernels Q(k, k′) such that their associated ensembles (14) will for large N be tailored to the production of random graphs with precisely these statistical features. We identify these kernels and show that they all correspond in leading order in N to the same random graph ensemble.

Existence of a family of tailored kernels:
- For each non-negative function φ(k) such that p(k)φ(k)Π(k, k′)φ(k′)p(k′) is nonzero for at least one combination (k, k′), the following kernel satisfies all conditions required to define a random graph ensemble of the family (14) that generates graphs with degree distribution p(k) and relative degree correlation function Π(k, k′) as N → ∞:
  $Q (k, k') = \frac{ϕ (k) Π (k, k') ϕ (k')}{Z}, Z = \underset{kk'}{Σ} p (k) ϕ (k) Π (k, k') ϕ (k') p (k')$ (26)
  Q(k, k′) is by construction non-negative, symmetric, and correctly normalized. Also we will always find Z > 0 due to Π(k, k′) ≥ 0 in combination with our conditions on φ(k) and the normalization (12). Recovering the correct degree distribution is built into the ensemble (14) via the degree constraints. To prove that equations (19,20) are satisfied we define $F (k ∣ Q) = ϕ (k) ∕ \sqrt{Z}$ , and use the fact that by virtue of (19) the condition (20) reduces to (10), and is therefore guaranteed to hold, provided Π(k, k′) indeed represents a relative degree correlation function.
- What remains is to show that there exist functions φ(k) that meet the relevant conditions. The simplest candidate is φ(k) = k/k, for which we find Z = 1 via (12) and which is easily confirmed to meet all criteria. It gives what we will call the canonical kernel:
  $Q * (k, k') = Π (k, k') kk' ∕ {〈 k 〉}^{2}$ (27)
Completeness of the family of tailored kernels:
- The set of kernels defined by (26) is complete: if a kernel Q(k, k′) generates random graphs with statistics p(k) and Π(k, k′), then is must be of the form (26).
- The proof is simple. If Q(k, k′) generates graphs with relative degree correlation function Π(k, k′), according to (19) it must be of the form Q(k, k′) = F(k)Π(k, k′)F(k′) for some function F(k). Since both Π(k, k′) and Q(k, k′) must be non-negative, the same must be true for F(k). Hence Q(k, k′) is also of the form (26), with $ϕ (k) = \sqrt{Z} F (k)$ and with the formula for Z in (26) satisfied automatically due to Q(k, k′) having to be normalized.
- A further corollary is that all kernels tailored to the generation of graphs with statistics p(k) and Π(k, k′) are related to the canonical kernel (27) via separable transformations, with suitably normalized non-negative functions G(k):
  $Q (k, k') = G (k) Q^{*} (k, k') G (k')$ (28)
Asymptotic uniqueness of the canonical ensemble:
- The random graph ensembles of all kernels of the family (26), tailored to generating random graphs with statistical properties p(k) and Π(k, k′), are asymptotically (i.e. for large enough N) identical: if all {k_i} are drawn randomly from p(k), and Q(k, k′) belongs to the family (26) with canonical member Q*(k, k′) defined in (27), then
  ${[Prob (c ∣ p, Q)]}^{1 ∕ N} = {[Prob (c ∣ p, Q^{*})]}^{1 ∕ N} e^{o (1)}$ (29)
  This follows from (24), which tells us that in the two leading orders in N the probabilities of graphs generated from (26) depend on the kernel Q(k, k′) of the ensemble only via its associated function Π(k, k′), so that N⁻¹ log Prob(c|p, Q) − N⁻¹ log Prob(c|p, Q*) = o(1).

The above results imply that we may regard the random graph ensemble (14), equipped with the kernel (27), as the natural ensemble for generating large random graphs with topologies controlled strictly by a prescribed degree distribution p(k) and prescribed relative degree correlations Π(k, k′). We will call p(c|p, Q*), with Q*(k, k′) = Π(k, k′)kk′/ Inline graphic k², the canonical ensemble for graphs with p(k) and Π(k, k′). Note that for Π(k, k′) = 1 one has Q*(k, k′) = kk′/k², which is indeed equivalent to the trivial choice Q(k, k′) = 1 (as it is related to the latter by a separable transformation).

We can now also define what we mean by ‘null models’. Given the hypothesis that a network c has no structure beyond that imposed by its degree statistics, the appropriate null model is a random graph generated by the canonical ensemble with degree distribution p(k) = p(k|c) and relative degree correlations Π(k, k′) = 1 (giving the trivial kernel Q(k, k′) = 1; these are usually referred to as ‘simple graphs’). Similarly, given the hypothesis that a network has no structure beyond that imposed by its degree statistics and its degree-degree correlations, the appropriate null model is a random graph generated by the canonical ensemble with degree distribution p(k) = p(k|c) and relative degree correlations Π(k, k′) = Π(k, k′|c).

Finally, self-consistency demands that p(k) and the canonical kernel (or a member of its equivalent family, related by separable transformations) are also the most probable pair {p, Q} in a Bayesian sense. The probability Prob(p, Q|c) that a pair {p, Q} was the ‘generator’ of c via (14) can be expressed, via standard Bayesian relations, in terms of the probability Prob(c|p, Q) of drawing c at random from (14):

Prob (p, Q ∣ c) = \frac{Prob (c, p, Q)}{Prob (c)} = \frac{Prob (c ∣ p, Q) Prob (p, Q)}{Σ_{Q'} Σ_{p'} Prob (c ∣ p', Q',) Prob (p', Q')}

(30)

The most probable pair {p, Q} is the one that maximizes log Prob(p, Q|c) = log Prob(p, Q) + log Prob(c|p, Q) (modulo terms independent of {p, Q}), so in the absence of any prior bias, i.e. if Prob(p, Q) is independent of {p, Q}, it is the kernel that maximizes Prob(c|p, Q). Since Σ_c Prob(c|p, Q) = 1 for any {p, Q}, finding the most probable {p, Q} for a graph c boils down to finding the smallest ensemble of graphs compatible with the structure of c. Intuitively this makes sense: a more detailed characterization of the topology of an observed graph allows for more information being carried over from the graph to the ensemble, reducing the number of potential graphs allowed for by the ensemble. The smaller the number of graphs in the ensemble, the more accurate will these graphs be when used as proxies for the observed one.

Maximizing Prob(c|p, Q) over {p, Q} means minimizing Ω(c|p, Q) in (21), of which the leading orders in N are given in (24). Demonstrating Bayesian self-consistency of our canonical graph ensemble for large N hence boils down to proving that the maximum of (25) over {p, Π} (subject to the relevant constraints) is obtained for {p, Π} = {p(c), Π(c)}. The constraints include the set (10). There are clearly more, e.g. Π(k, k′) ≥ 0 for all (k, k′), however we show below that maximizing (25) over {p, Π} subject only to (10) and Σ_kp(k) = 1 already generates the desired result: {p, Π} = {p(c), Π(c)}. Extremizing (25) with the Lagrange formalism, leads to the following equations, which are to be solved in combination with (10) and Σ_kp(k) = 1:

\forall (k, k') : \frac{\partial}{\partial Π (k, k')} Ω [p (c), Π (c); p, Π] = \underset{ℓ \geq 0}{Σ} λ (ℓ) \frac{\partial}{\partial Π (k, k')} (\frac{1}{〈 k 〉} \underset{ℓ'}{Σ} ℓ' p (ℓ') Π (ℓ, ℓ') - 1)

(31)

\begin{matrix} \forall k : \frac{\partial}{\partial p (k)} Ω [p (c), Π (c); p, Π] = \underset{ℓ \geq 0}{Σ} λ (ℓ) & \frac{\partial}{\partial p (k)} (\frac{1}{〈 k 〉} \underset{ℓ'}{Σ} ℓ' p (ℓ') Π (ℓ, ℓ') - 1) \\ + μ \frac{\partial}{\partial p (k)} (\underset{k'}{Σ} p (k') - 1) \end{matrix}

(32)

in which {λ(ℓ)} and μ are Lagrange multipliers. Working out (31) gives

\forall (k, k') : \frac{p (k) Π (k, k') p (k')}{〈 k 〉} = \frac{p (k ∣ c) Π (k, k' ∣ c) p (k' ∣ c)}{\bar{k} (c)} \frac{p (k) k}{2 λ (k)}

(33)

Since both Π(k, k′) and Π(k, k′|c) must satisfy the constraints (10), with degree distributions p(k) and p(k|c), respectively, it follows from (33) that

λ (k) = \frac{1}{2} p (k) k

(34)

With this expression we eliminate λ(k) from (33) to find

\forall (k, k') : \frac{p (k) Π (k, k') p (k')}{〈 k 〉} = \frac{p (k ∣ c) Π (k, k' ∣ c) p (k' ∣ c)}{\bar{k} (c)}

(35)

Next we work out (32) and substitute (34) into the result. This gives, using symmetry of Π:

\forall k : p (k ∣ c) = μ p (k) + \frac{p (k)}{2 〈 k 〉} \underset{k'}{Σ} p (k') k' Π (k', k) = (μ + \frac{1}{2}) p (k)

(36)

The normalization conditions Σ_kp(k|c) = Σ_kp(k) = 1 then tell us that μ = ½, so p(k) = p(k|c) for all k, and finally also (via (35)):

\forall (k, k') : Π (k, k') = Π (k, k' ∣ c)

(37)

Hence, the choice {p, Π} = {p(c), Π(c)} indeed extremizes the leading two orders in N of (24), subject to (10) and to normalization of p(k). The above extremum must be a maximum, since by making pathological choices for {p, Π} (viz. choices inconsistent with the structure of c) we can make prob(c|p, Q) arbitrary small, and hence Ω[p(c), Π(c); p, Π] arbitrarily small. Hence our canonical ensembles are indeed self-consistent in a Bayesian sense, as expected.

3.4. The random graphs ensemble as a conditioned maximum entropy ensemble

In this section we show that our canonical ensemble gives the maximum entropy within the subspace of graphs with prescribed degrees and upon imposing as a constraint the average values Π(k, k′) = Inline graphic Π(k, k′|c) of the relative degree correlations. First we define our constraining observables, i.e. the degree sequence and the re-scaled degree correlation:

k_{i} (c) = k_{i} (\forall i)

(38)

q (k, k' ∣ c) = N^{- 1} \underset{ij}{Σ} c_{ij} δ_{k, k_{i} (c)} δ_{k', k_{j} (c)} (k, k' > 0)

(39)

Note that if N⁻¹ Σ_iδ_{k,k_i(c)} = p(k) and Inline graphic k = Σ_kp(k), then

Π (k, k' ∣ c) = \frac{〈 k 〉}{p (k) p (k') kk'} q (k, k' ∣ c) for N \to \infty

(40)

We are interested in the maximum entropy random graph ensemble p(c) (limited to symmetric graphs without self-interactions) such that q(k, k′) = Σ_cp(c)q(k, k′|c) for all (k, k′) and k_i = k_i(c) for all i. This is given by the ensemble p(c) for which the Shannon entropy

S [k, q] = \underset{c}{Σ} p (c ∣ k, q) \log p (c ∣ k, q)

(41)

is maximal subject to our constraints. Extremization of (41) with Lagrange multipliers gives, without enforcing p(c) ≥ 0 explicitly,

\forall c : \frac{\partial}{\partial p (c)} {\underset{c'}{Σ} p (c') [\log p (c') + Λ_{0} + \underset{kk'}{Σ} λ (k, k') q (k, k' ∣ c')]} = 0

(42)

\forall c : \log p (c) + Λ_{0} + \underset{kk'}{Σ} λ (k, k') q (k, k' ∣ c) + 1 = 0

(43)

\forall c : p (c) = \frac{1}{Ƶ} e^{- Σ_{kk'} λ (k, k') q (k, k' ∣ c)}

(44)

\forall c : p (c) = \frac{1}{Ƶ} e^{- N^{- 1} Σ_{ij} c_{ij} λ (k_{i} (c), k_{j} (c))}

(45)

with Ƶ such that Σ_cp(c) = 1. As expected for an ensemble of random graphs with maximum entropy, where a set of averages of obervables are constrained to assume prescribed values, the result of the extremization gives an exponential family, where the parameters {λ(k_i, k_j) are to be calculated from the equations for the constraints. What is left is to show that the exponential family can be reduced to the micro-canonical ensemble (15), where degrees are prescribed, by a simple redefinition of the Lagrange multipliers. Let us first rewrite (45)

\forall c : p (c) = \frac{1}{Ƶ} (\prod_{i < j} e^{- N^{- 1} c_{ij} [λ (k_{i}, k_{j}) + λ (k_{j}, k_{i})]}) (\prod_{i} δ_{k_{i}, k_{i} (c)})

(46)

\forall c : p (c) = \frac{1}{Ƶ} (\prod_{i < j} [e^{- N^{- 1} [λ (k_{i}, k_{j}) + λ (k_{j}, k_{i})]} δ_{c_{ij}, 1} + δ_{c_{ij}, 0}]) (\prod_{i} δ_{k_{i}, k_{i} (c)})

(47)

We can then redefine our Langrange multipliers in terms of the function Q(k, k′) via

\frac{〈 k 〉}{N} Q (k, k') = \frac{e^{- N^{- 1} [λ (k_{i}, k_{j}) + λ (k_{j}, k_{i})]}}{1 + e^{- N^{- 1} [λ (k_{i}, k_{j}) + λ (k_{j}, k_{i})]}}

This results in

\begin{matrix} p (c) = & \frac{1}{Ƶ} \prod_{i < j} {(1 - \frac{〈 k 〉 Q (k_{i}, k_{j})}{N})}^{- 1} \\ \times \prod_{i < j} [\frac{〈 k 〉}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{〈 k 〉}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] . \prod_{i} δ_{k_{i}, k_{i} (c)} \end{matrix}

(48)

The first product in (48) only depends on the constrained degrees {k_i} (in fact, to leading order this dependence is only via their average Inline graphic k, since Π_i<j(1−kQ(k_i, k_j)/N)⁻¹ = e^Nk/2+𝒪(1)), so it drops out of the measure, and hence (48) can be rewritten as

p (c) = \frac{1}{Ƶ (k, Q)} \prod_{i < j} [\frac{〈 k 〉}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{〈 k 〉}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] \prod_{i} δ_{k_{i}, k_{i} (c)}

(49)

Ƶ (k, Q) = \underset{c}{Σ} \prod_{i < j} [\frac{〈 k 〉}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{〈 k 〉}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] \prod_{i} δ_{k_{i}, k_{i} (c)}

(50)

which indeed reduces to (15), as claimed.

3.5. Shannon entropy

The (rescaled) Shannon entropy of the canonical ensemble Prob(c|Q*, p), as defined by Q*(k, k′) = Π(k, k′)kk′/ Inline graphic k² in combination with (14), is an important quantity as it allows us to define and calculate the effective number of graphs 𝒩[p, Π] in the ensemble:

S [p, Π] = - \frac{1}{N} \underset{c}{Σ} Prob (c ∣ p, Q^{*}) \log Prob (c ∣ p, Q^{*})

(51)

𝒩 [p, Π] = e^{NS [p, Π]}

(52)

In (51) one defines as always 0 log 0 = lim_∈↓0 ∈ log ∈ = 0. For large N we can use our earlier results (21,24,25) to find the leading orders of the entropy, since

\begin{matrix} S [p, Π] = & \underset{c}{Σ} Prob (c ∣ p, Q^{*}) Ω (c ∣ p, Q^{*}) \\ = & \frac{1}{2} 〈 k 〉 [\log [N 〈 k 〉] - 1] - \underset{k}{Σ} p (k) \log k! \\ - \underset{c}{Σ} Prob (c ∣ p, Q^{*}) Ω [Π (c), p (c); Π, p] + o (1) \\ = & \frac{1}{2} 〈 k 〉 [\log [N 〈 k 〉] - 1] - \underset{k}{Σ} p (k) \log k! - \underset{k}{Σ} p (k) \log p (k) \\ - \underset{kk'}{Σ} \frac{p (k) p (k') kk'}{2 〈 k 〉} Π (k, k') \log Π (k, k') + o (1) \\ = & \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉] + 1] - \underset{k}{Σ} p (k) \log [p (k) ∕ π (k)] \\ - \frac{1}{2 〈 k 〉} \underset{kk'}{Σ} p (k) p (k') kk' Π (k, k') \log Π (k, k') + o (1) \end{matrix}

(53)

where π(k) denotes the Poissonian degree distribution with average degree Inline graphic k, viz. π(k) = e^−kk^k/k!. To prove various properties of the above expression for the entropy it will be convenient to introduce a new (symmetric) quantity W(k, k′), defined as the probability that a randomly drawn link in a graph that has Π(k, k′|c) = Π(k, k′) connects two nodes with degrees k and k′. It can be shown to be related to Π(k, k′) via

W (k, k') = p (k) p (k') kk' Π (k, k') ∕ {〈 k 〉}^{2}

(54)

Irrespective of its exact meaning, the crucial mathematical advantage here of working with W(k, k′|p, Π) is that it represents a probability distribution: W(k, k′) ≥ 0 and Σ_kk′W(k, k′) = 1 (normalization follows from (10)). One also verifies explicitly that W(k) = Σ_k′W(k, k′) = p(k)k/ Inline graphic k for all k. If we use (54) to eliminate Π(k, k′) from (53) in favour of W(k, k′) we get

\begin{matrix} S [p, Π] = & \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉] + 1] - \underset{k}{Σ} p (k) \log [p (k) ∕ π (k)] \\ - \frac{1}{2} 〈 k 〉 \underset{kk'}{Σ} W (k, k') \log [W (k, k') ∕ W (k) W (k')] + o (1) \end{matrix}

(55)

The term in (55) with W(k, k′) is seen to be proportional to minus the mutual information between two connected sites, and is therefore non-positive, vanishing if and only if Π(k, k′) = 1 for all (k, k′). Furthermore, the term in (55) involving π(k) is minus a KL-divergence, and therefore also non-positive, vanishing if and only if p(k) = π(k) for all k. Our result (55) therefore has a clear and elegant interpretation:

For the simplest graphs of the Erdös-Rényi type, where only the average degree k is imposed, one has p(k) = π(k) for all k and Π(k, k′) = 1 for all (k, k′). This gives W(k, k′) = W(k)W(k′), and the entropy takes its maximal value:
$S [p, Π] = \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉] + 1]$ (56)
For graphs where the degree distribution p(k) is imposed, but without further structure (i.e. still Π(k, k′) = 1 for all (k, k′)), the entropy decreases by an amount Σ_kp(k) log[p(k)/π(k)] which is the KL-distance between the imposed p(k) and the Poissonian degree distribution with the same average connectivity:
$S [p, Π] = \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉] + 1] - \underset{k}{Σ} p (k) \log [p (k) ∕ π (k)]$ (57)
For the more sophisticated graphs where both a degree distribution p(k) and nontrivial degree correlations defined via Π(k, k′) are imposed, one no longer has W(k, k′) = W(k)W(k′) and the entropy decreases further by an amount ½k Σ_kk′W(k, k′) log [W(k, k′)/W(k)W(k′)], which is proportional to the mutual information regarding degrees of connected nodes:
$\begin{matrix} S [p, Π] = & \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉] + 1] - \underset{k}{Σ} p (k) \log [p (k) ∕ π (k)] \\ - \frac{1}{2} 〈 k 〉 \underset{kk'}{Σ} W (k, k') \log [W (k, k') ∕ W (k) W (k')] \end{matrix}$ (58)

4. Quantitative tools for networks

In the previous sections we have shown that ensemble (14) is tailored, for large N, to the production of graphs with degree distribution p(k) and degree correlation Π(k, k′) given by (19,20). Conversely, for each desired function Π(k, k′), one may always choose the canonical kernel Q*(k, k′) = Π(k, k′)kk′/ Inline graphic k² in (14) to tailor the ensemble to the production of graphs with the desired degree correlation.

The availability for any given/observed network c of a well-defined canonical random graph ensemble, that produces random graphs with microscopic topologies controlled solely by the observed degree statistics and degree correlations of the given c, allows us to develop practical quantitative tools with which to analyze and compare (structure in) real networks. Here we focus on three such tools.

4.1. Quantifying structural network complexity

The natural definition of the complexity of a given network c is based on the number 𝒩[p, Π] of graphs in its canonical ensemble {p, Π}, and hence on the entropy per node S[p, Π] given in (55). It makes sense to write this entropy for large N as S[p, Π] = S₀ − C[p, Π] + o(1), with a first (positive) contribution $S_{0} = \frac{1}{2} 〈 k 〉 [\log [N ∕ 〈 k 〉 + 1]$ that originates simply from counting the total number of bonds and would also be found for structureless Erdös-Rényi graphs (where only the average degree is prescribed), minus a second term C[p, Π] which acts to reduce the entropy as soon as there is structure in the graph beyond a prescribed average degree. This latter quantity C[p, Π] can be identified as the complexity of graphs in the canonical ensemble associated with c, and hence as the complexity of c:

𝒞 [p, Π] = \underset{k}{Σ} p (k) \log [p (k) ∕ π (k)] + \frac{1}{2 〈 k 〉} \underset{kk'}{Σ} p (k) p (k') kk' Π (k, k') \log Π (k, k')

(59)

where π(k) is the Poissonian distribution with average degree Inline graphic k:

π (k) = e^{- 〈 k 〉} {〈 k 〉}^{k} ∕ k!

(60)

The larger C[p, Π], the more ‘rare’ or ‘special’ are graphs with characteristics {p, Π}. For every N, the complexity is bounded from above by (56); at this value the network undergoes an entropy ‘crisis’, as (58) vanishes and the degree distribution ceases to be graphical, i.e. no network can be found with this degree distribution (see [22] for the notion of graphicality). Note, however, that our results were obtained in the limit Inline graphic k N; they no longer apply for degree distributions with an average connectivity of the order of the system size. For e.g. fully connected graphs, where the complexity is maximal, the entropy should vanish, whereas (58) indeed yields an incorrect 𝒪(N) result. As an illustration one may check how close to the entropy crisis are PPIN of different species (PPIN typically meet the requirements Inline graphic k N for our theory to apply). For this purpose we have computed the (58) for protein interaction networks of different species and show the results in Fig. 1. A more systematic and extensive application of our tools to PPIN will be published in [16].

Shannon entropy per node S[p, Π] (markers connected by solid lines) and complexity C[p, Π] (markers connected by dashed lines) of the canonical ensembles tailored to the production of random graphs with microscopic topologies controlled solely by the degree sequence and degree correlation of experimentally determined PPINs. The methods/sources for the experimental data sets are the following: BGD, BioGrid database; Y2h, yeast two-hybrid screen; PMs, purification mass spectrometry; HPRD, human protein reference database; DI, data integration (database with combined experimental data); PCA, protein fragment complementation assay. The studied organisms are listed in alphabetical order on the x-axis. Data sets properties and references are summarised in Table 1.

4.2. Quantifying structural distance between networks

In the same spirit we can now also use our tailored graph ensembles to define an information-theoretic distance D_AB between any two networks c_A and c_B, based solely on the macroscopic structure statistics as captured by their associated (observed) structure function pairs {p_A, Π_A} and {p_B, Π_B}. The natural definition would be in terms of the Jeffreys divergence (i.e. the symmetrized KL-distance) between the two associated canonical ensembles, which is non-negative and equals zero if and only if {p_A, Π_A} = {p_B, Π_B}, i.e. if the graphs c_A and c_B belong to the same canonical ensemble:

\begin{matrix} D_{AB} = & \frac{1}{2 N} \underset{c}{Σ} Prob (c ∣ p_{A}, Q_{A}) \log [\frac{Prob (c ∣ p_{A}, Q_{A})}{Prob (c ∣ p_{B}, Q_{B})}] \\ + \frac{1}{2 N} \underset{c}{Σ} Prob (c ∣ p_{B}, Q_{B}) \log [\frac{Prob (c ∣ p_{B}, Q_{B})}{Prob (c ∣ p_{A}, Q_{A})}] \end{matrix}

(61)

Working out this formula, using (24) and (55), gives for large N:

\begin{matrix} D_{AB} = & \frac{1}{2} \underset{c}{Σ} Prob (c ∣ p_{A}, Q_{A}) Ω (c ∣ p_{B}, Q_{B}) - \frac{1}{2} S [p_{A}, Π_{A}] \\ + \frac{1}{2} \underset{c}{Σ} Prob (c ∣ p_{B}, Q_{B}) Ω (c ∣ p_{A}, Q_{A}) - \frac{1}{2} S [p_{B}, Π_{B}] \\ = & \frac{1}{2} \underset{k}{Σ} p_{A} (k) \log [\frac{p_{A} (k)}{p_{B} (k)}] + \underset{kk'}{Σ} \frac{p_{A} (k) p_{A} (k') kk'}{4 {〈 k 〉}_{A}} Π_{A} (k, k') \log [\frac{Π_{A} (k, k')}{Π_{B} (k, k')}] \\ + \frac{1}{2} \underset{k}{Σ} p_{B} (k) \log [\frac{p_{B} (k)}{p_{A} (k)}] + \underset{kk'}{Σ} \frac{p_{B} (k) p_{B} (k') kk'}{4 {〈 k 〉}_{B}} Π_{B} (k, k') \log [\frac{Π_{B} (k, k')}{Π_{A} (k, k')}] + o (1) \end{matrix}

(62)

This quantity is used in [16] for comparing and clustering PPIN data sets, even if these differ in size, solely on the basis of their degree sequence and degree correlations. The combination of its information-theoretic origin and explicit nature (so that it involves almost no computational cost) makes (62) an efficient practical tool in bio-informatics.

4.3. Numerical generation of canonical ‘null models’

We have shown that for any given network c it is possible to define a tailored ensemble of graphs, that share with c those structural aspects that follow directly from its degree distribution and degree correlations, and used it to define and calculate complexities and structural distances. Our next aim is to use the ensemble for generating random graphs with structure {p, Π} identical to that of a given network. The problems associated with generating complex random graphs with controlled properties are well known [23, 24, 25, 26, 27, 28, 29, 30]. In [17] a general method was proposed for generating random graphs with built-in constraints and specific statistic weights, such as described by the invariant measure (14), in the form of a Monte-Carlo process that is guaranteed to evolve from any initial graph c₀ that meets the relevant constraints towards the prescribed invariant measure (14). The initial graph c₀ can be constructed by hand, for any choice of p(k), such that for sufficiently large N it will have the required degree statistics (see e.g. [31]). With the general and exact algorithm [17] we can generate graphs according to the measure (14), with the kernel Q(k, k′) of (27) chosen such as to impose any desired degree correlations Π(k, k′). These graphs can then serve as ‘null models’, allowing us, for instance, to determine to what extent specific small motifs in biological networks (such as short loops) can be regarded as mere consequences of the overall structure dictated by their degree statistics and degree correlations, or whether they reflect deeper biological principles. See [16] for the results of such tests.

Here we generate, as an illustration, a synthetic network which is to have the same degree sequence and the same degree correlations as the protein interaction network of Escherichia coli, as given in [32], (i.e. we produce a member of the tailored graph ensemble of this particular PPIN), where N = 2457 and Inline graphic k = 7.05. The degree correlations of the resulting graph after 67,147 accepted moves of the Markov chain algorithm of [17] are shown in figure 2(b), and are seen to be in very good agreement with the degree correlations of the PPIN that are being targeted, displayed in figure 2(a) (note that there is no need to compare degree distributions, since all degrees are guaranteed to be conserved by the graph dynamics [17]). To rule out the possibility that the observed similarity in degree correlations between the synthetic graph and the original PPIN could have arisen from poor sampling of the microscopic configurations (and just reflect direct similarities in the connectivity matrices), we also calculated the Hamming distance between the connectivity matrices c and c′ of the original PPIN and the synthetic graph,

ρ (c, c') = \frac{1}{2 N 〈 k 〉} \underset{ij}{Σ} ∣ c_{ij} - c_{ij}' ∣,

(63)

(the prefactor is chosen such that when the two matrices differ in all the 2N Inline graphic k entries which could be different, then ρ(c, c′) = 1). The Hamming distance vanishes if the two matrices are identical. In the present case we find ρ = 0.90, which implies that although our two graphs have similar macroscopic structure, their microscopic realizations are indeed very different.

Results of Markov chain graph dynamics proposed in [17] tailored to generating equilibrium random graph ensembles with specific degree sequences *and* specific degree correlations. (a): Colour plot of the relative degree correlations Π(k, k′|c) as measured in the *Escherichia coli* PPIN (here N = 2457 and k = 7.05). (b): colour plot of Π(k, k′|c) in the synthetic graph c′ generated with Markov chain dynamics targeting the measured degree correlation if the PPIN, after 67, 147 accepted moves. (c): colour plot of Π(k, k′|c′) in the final graph generated with dynamics targeting Π(k, k′) = 1 ∀ (k, k′), after 1,968,000 accepted moves.

For comparison, we also show in figure 2(c) the degree correlations of a synthetic graph obtained via the Markov chain dynamics of [17], starting from the same initial graph, but now targeting degree correlations described by Π(k, k′) = 1 for all (k, k′). All residual deviations in the bottom plot of figure 2 from the objective Π(k, k′) = 1 ∀ (k, k′) are due to finite size effects. Again we also calculate the Hamming distance between the original and the synthetic matrix, giving ρ = 0.93. This value is similar to the one found previously, but now the macroscopic structure of the synthetic graph in terms of the degree correlations is considerably different from the underlying PPIN.

It has been noted by several authors that most PPINs are disassortative, i.e. nodes with high degrees tend to connect with nodes with low degrees [4]. Measures of degree assortativity have been proposed in [3, 4, 33]. A conventional measure of assortativity is the correlation coefficient ( Inline graphic kk′ − k k′)/(k² − k²), calculated over the joint distribution W(k, k′) in (54). Degree-assortativity has been shown to have important consequences on both the topology of a network and the process which it supports. In particular, it was shown that assortative networks are more resistant to random attacks, i.e. random vertex removal, whereas disassortative networks are less resistant [4]. It may be useful from a practical point of view to generate networks with a prescribed assortative character. This can again be achieved by using the measure (14), where the kernels Q(k, k′) are now chosen to produce assortative or disassortative graphs. In [17] it was shown that the kernel

Q (k, k') = \frac{∣ k - k' ∣^{2}}{2 (〈 k^{2} 〉 - {〈 k 〉}^{2})}

(64)

tailors the ensemble (14) to the production, for large N, of graphs with degree correlations

Π (k, k') = \frac{〈 k 〉 {(k - k')}^{2}}{[α_{3} - 2 α_{2} k + α_{1} k^{2}] [α_{3} - 2 α_{2} k' + α_{1} {k'}^{2}]}

(65)

where the three coefficients α_l are to be solved numerically from

α_{ℓ} = \underset{k}{Σ} \frac{k^{ℓ} p (k)}{α_{3} - 2 α_{2} k + α_{1} k^{2}}

(66)

This degree correlation has a disassortative character. In fact, any kernel of the form

Q (k, k') = C^{- 1} ∣ k - k' ∣^{n}, n = 1, 2, \dots

(67)

with C such that Σ_k,k′p(k)p(k′)Q(k, k′) = 1 will tailor the ensemble (14), for large N, to the production of graphs with increasingly negative assortative coefficients as n increases. A prototype of an assortative kernel would be

Q (k, k') = \frac{1}{C} \frac{1}{1 + ∣ k - k' ∣^{n}}

(68)

where C = Σ_k,k′p(k)p(k′)[1 + |k − k′|ⁿ]. For sufficiently large N, the predicted values for Π(k, k′) follow from (19), where F(k) is to be solved numerically from (20). As an example we generated two synthetic graphs, both with the same degree sequence as the PPIN of Homo sapiens (the experimental data used was taken from the human protein reference database (HPRD), [34]). In the first graph we enforced an assortative connectivity using (68) with n = 1, and in the second one a disassortative connectivity using (67) with n = 1. Both graphs were generated with the algorithm of [17], starting from the actual Homo sapiens PPIN. In Fig. 3(a) we show the colour plot of the relative degree correlations Π(k, k′) as measured in the Homo sapiens PPIN, and in Fig. 3(c) and 3(e) we show the same quantity in the two synthethic graphs generated. For comparison we also show (Fig. 3(b) and 3(d), respectively) the functional Π(k, k′) that are being targeted, via the kernels in (68) and (67).

Colour plots of the relative degree correlations Π(k, k′) of networks which all have the degree sequence of the *Homo sapiens* (from the HPRD database) PPIN (with N = 9463 and k = 7.4). (a): Π(k, k′|c) as measured for the *Homo sapiens* PPIN. (b): the target assortative function Π(k, k′) given in (68). (c): the actual function Π(k, k′|c′) measured after 203,441 accepted moves of the Markov chain in [17], on the right. (d): the target disassortative function Π(k, k′) given in (67). (e): the actual function Π(k, k′|c′) measured after 266,763 accepted moves, on the right. These results confirm the efficiency of our canonical graph ensemble and its associated Markov chain algorithm, in generating controlled null models.

5. Discussion

In this paper we have studied the tailoring of a particular structured random graph ensemble to real-world networks. We have first derived several mathematical properties of this ensemble, including information-theoretic properties, its Shannon entropy, and the relation between its control parameters and the statistics and correlations of the degrees in the network to which the ensemble is tailored. We were then able to use the mathematical results in order to derive explicit and transparent mathematical tools with which to quantify structure in large real networks, define rational distance measures for comparing networks, and for generating controlled null models as benchmark graphs. These tools are precise and based on information-theoretic principles, yet they take the form of fully explicit formulae (as opposed to implicit equations that require equilibration of extensive graph simulations). We therefore hope and anticipate that they will be particularly useful in bio-informatics; indeed a subsequent paper will be fully devoted to their application to a broad range of protein-protein interaction networks, involving multiple organisms and multiple experimental protocols [16].

Let us turn to the limitations of this study. Our work so far has focused on characterizing network structure macroscopically at the level of degree distributions and degree-degree correlations, and was limited to undirected networks and graphs. We therefore envisage two main directions in which the present theory could and should be developed further. The first, and relatively straightforward, one is generalization of the analysis to tailored directed random graph ensembles. Here one does not envisage insurmountable obstacles, and it would in bio-informatics open up the possibility of application to e.g. gene regulation networks. The second direction is towards the inclusion of measures of macroscopic structure that take account of loops, such as the distribution of length-three loops in which individual network nodes participate. Here the mathematical task is much more challenging, since in entropy calculations it is no longer clear whether and how one can achieve factorization over nodes.

Table 1.

Maximum degree k_max, detection method/source and reference for the biological network data sets. The detection methods/sources are abbreviated as in Fig. 1

Species	k _max	Method	Reference
C.elegans	99	Y2h	[35]
C.jejuni	207	Y2h	[36]
D.melanogaster	176	BGD	[37]
E.coli	641	PMs	[32]
H.pylori	55	Y2h	[38]
H.sapiens I	125	Y2h	[39]
H.sapiens II	95	Y2h	[40]
H.sapiens III	314	PMs	[41]
H.sapiens IV	247	HPRD	[34]
M.loti	401	Y2h	[42]
P.falciparum	51	Y2h	[43]
S.cerevisiae I	24	Y2h	[44]
S.cerevisiae II	55	Y2h	[45]
S.cerevisiae III	279	Y2h	[45]
S.cerevisiae IV	62	PMs	[46]
S.cerevisiae V	118	DI	[47]
S.cerevisiae VI	53	PMs	[48]
S.cerevisiae VII	32	DI	[49]
S.cerevisiae VIII	955	PMs	[50]
S.cerevisiae IX	141	PMs	[51]
S.cerevisiae X	127	DI	[52]
S.cerevisiae XI	58	PCA	[53]
S.cerevisiae XII	86	Y2h-PCA	[54]
Synechocystis	51	Y2h	[55]
T.pallidum	285	Y2h	[56]

Open in a new tab

Acknowledgements

ACCC would like to thank Conrad Pérez-Vicente for stimulating discussions and the Engineering and Physical Sciences Research Council (UK) for support in the form of a Springboard Fellowship.

Appendix A

Degree correlations in the random graph ensemble

In this appendix we prove the validity of the crucial relation (19), with Π(k, k′) as defined in (18) for the random graph ensemble (14):

Π (k, k') = \frac{〈 k 〉}{p (k) p (k') kk'} \lim_{N \to \infty} \underset{rs}{Σ} \underset{k}{Σ} [\prod_{ℓ} p (k_{ℓ})] δ_{k, k_{r}} δ_{k', k_{s}} \frac{1}{N} \underset{c}{Σ} Prob (c ∣ k, Q) c_{rs}

(A.1)

Let us work out the sum over the graphs c in (A.1), using the integral representation $δ nm = {(2 π)}^{- 1} \int_{0}^{2 π} d ω e^{i ω (n - m)}$ to deal with the N degree constraints δ_{k_i,k_i}_(c). This introduces and N-fold integration over ω = (ω₁, … , ω_N) ∈ [0, 2π]^N. With a modest amount of foresight we introduce the two abbreviations ω · k = Σ_iω_ik_i and

W (ω, k) = \prod_{i < j} {1 + \frac{\bar{k}}{N} Q (k_{i}, k_{j}) [e^{- i (ω_{i} + ω_{j})} - 1]}

(A.2)

These allow us to write

\begin{matrix} \underset{c}{Σ} Prob (c ∣ k, Q) c_{rs} = \frac{Σ_{c} c_{rs} \prod_{i < j} [\frac{\bar{k}}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{\bar{k}}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] \cdot \prod_{i} δ_{k_{i}, k_{i} (c)}}{Σ_{c} \prod_{i < j} [\frac{\bar{k}}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{\bar{k}}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] \cdot \prod_{i} δ_{k_{i}, k_{i} (c)}} \\ = \frac{\int d ω e^{i ω \cdot k} Σ_{c} c_{rs} \prod_{i < j} {[\frac{\bar{k}}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{\bar{k}}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] e^{- i c_{ij} (ω_{i} + ω_{j})}}}{\int d ω e^{i ω \cdot k} Σ_{c} \prod_{i < j} {[\frac{\bar{k}}{N} Q (k_{i}, k_{j}) δ_{c_{ij}, 1} + (1 - \frac{\bar{k}}{N} Q (k_{i}, k_{j})) δ_{c_{ij}, 0}] e^{- i c_{ij} (ω_{i} + ω_{j})}}} \\ = \frac{\int d ω W (ω, k) e^{i ω \cdot k} [\frac{\frac{\bar{k}}{N} Q (k_{r}, k_{s}) e^{- i (ω_{r} + ω_{s})}}{1 + \frac{\bar{k}}{N} Q (k_{r}, k_{s}) [e^{- i (ω_{r} + ω_{s})} - 1]}]}{\int d ω W (ω, k) e^{i ω \cdot k}} \\ = \frac{\bar{k}}{N} Q (k_{r}, k_{s}) [1 + 𝒪 (N^{- 1})] \frac{\int d ω W (ω, k) e^{i ω \cdot k - i (ω_{r} + ω_{s})}}{\int d ω W (ω, k) e^{i ω \cdot k}} \end{matrix}

(A.3)

We next expand the function W(ω, k), as defined in (A.2), in leading orders for large N, using the abbreviation P(q, ω|ω, k) = N⁻¹ Σ_iδ_{q,k_i}δ(ω − ω_i):

\begin{matrix} W (ω, k) & = \prod_{i < j} \exp {\frac{\bar{k}}{N} Q (k_{i}, k_{j}) [e^{- i (ω_{i} + ω_{j})} - 1] + 𝒪 (N^{- 2})} \\ = \exp {\frac{\bar{k}}{2 N} \underset{ij}{Σ} Q (k_{i}, k_{j}) [e^{- i (ω_{i} + ω_{j})} - 1] + 𝒪 (1)} \\ = \exp {\frac{1}{2} \bar{k} N \underset{qq'}{Σ} \int d ω d ω' P (q, ω ∣ ω, k) P (q', ω' ∣ ω, k) Q (q, q') [e^{- i (ω + ω')} - 1] + 𝒪 (1)} \end{matrix}

(A.4)

We now insert the following representation of unity, for each combination of (q, ω),

\begin{matrix} 1 = & \int dP (q, ω) δ [P (q, ω) - P (q, ω ∣ ω, k)] \\ = & \int \frac{dP (q, ω) d \hat{P} (q, ω)}{2 π ∕ N} e^{iN \hat{P} (q, ω) [P (q, ω) - P (q, ω ∣ ω, k)]} \end{matrix}

(A.5)

and convert the previous expression for W(ω, k) into the form of a functional integral, with a path integral measure ${dP} = Π_{q, ω} [dP (q, ω) Δ ω ∕ \sqrt{2 π}]$ (where the values of ω ∈ [0, 2π] are first discretized, with the discretization spacing Δω sent to zero as soon as this is possible):

\begin{matrix} W (ω, k) = & \int {dPd \hat{P}} e^{iN Σ_{q} \int d ω \hat{P} (q, ω) P (q, ω) - i Σ_{i} \hat{P} (k_{i}, ω_{i}) + 𝒪 (1)} \\ \times e^{\frac{1}{2} \bar{k} N Σ_{qq'} \int d ω d ω' P (q, ω) P (q', ω') Q (q, q') [e^{- i (ω + ω')} - 1]} \\ = & \int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}]) - i Σ_{i} \hat{P} (k_{i}, ω_{i}) + 𝒪 (1)} \end{matrix}

(A.6)

with

Ψ [{P, \hat{P}}] = i \underset{q}{Σ} \int_{0}^{2 π} d ω \hat{P} (q, ω) P (q, ω)

(A.7)

Φ [{P}] = \frac{1}{2} \bar{k} \underset{qq'}{Σ} \int_{0}^{2 π} d ω d ω' P (q, ω) P (q', ω') Q (q, q') [e^{- i (ω + ω')} - 1]

(A.8)

We can now integrate over the N-fold angles ω ∈ [0, 2π]^N, and obtain

\begin{matrix} \int d ω W (ω, k) e^{i ω \cdot k'} = & \int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}]) + 𝒪 (1)} \int d ω e^{i ω \cdot k' - i Σ_{i} \hat{P} (k_{i}, ω_{i})} \\ = & \int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}]) + 𝒪 (1)} \prod_{i} \int d ω e^{i [ω k_{i}' - \hat{P} (k_{i}, ω)]} \end{matrix}

(A.9)

and write the ratio of integrals in (A.3) as

\begin{matrix} \frac{\int d ω e^{i ω \cdot k - i (ω_{r} + ω_{s})} W (ω, k)}{\int d ω e^{i ω \cdot k} W (ω, k)} = \\ \frac{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}} ∣ k]) + 𝒪 (1)} {\frac{[\int d ω e^{i ω (k_{r} - 1) - i \hat{P} (k_{r}, ω)}] [\int d ω e^{i ω (k_{s} - 1) - i \hat{P} (k_{s}, ω)}]}{[\int d ω e^{i ω k_{r} - 1) - i \hat{P} (k_{r}, ω)}] [\int d ω e^{i ω k_{s} - 1 - i \hat{P} (k_{s}, ω)}]}}}{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}} ∣ k]) + 𝒪 (1)}} \end{matrix}

(A.10)

with

Ω [{\hat{P}} ∣ k] = \frac{1}{N} \underset{i}{Σ} \log \int d ω e^{i [ω k_{i} - \hat{P} (k_{i}, ω)]}

(A.11)

Therefore we find upon combining the previous intermediate results that the quantity of interest (A.1) can be written in the following form:

\begin{matrix} Π (k, k') = \frac{{〈 k 〉}^{2} Q (k, k')}{p (k) p (k') kk'} \lim_{N \to \infty} \underset{k}{Σ} [\prod_{ℓ} p (k_{ℓ})] [\frac{1}{N} \underset{r}{Σ} δ_{k, k_{r}}] [\frac{1}{N} \underset{s}{Σ} δ_{k', k_{s}}] \times \\ \times \frac{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}} ∣ k]) + 𝒪 (1)} {\frac{[\int d ω e^{i [ω (k - 1) - \hat{P} (k, ω)]}] [\int d ω e^{i [ω (k' - 1) - \hat{P} (k', ω)]}]}{[\int d ω e^{i [ω k - \hat{P} (k, ω)]}] [\int d ω e^{i [ω k' - \hat{P} (k', ω)]}]}}}{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}} ∣ k]) + 𝒪 (1)}} \\ = \frac{{〈 k 〉}^{2} Q (k, k')}{kk'} \times \\ \lim_{N \to \infty} \frac{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}}]) + 𝒪 (1)} {\frac{[\int d ω e^{i [ω (k - 1) - \hat{P} (k, ω)]}] [\int d ω e^{i [ω (k' - 1) - \hat{P} (k', ω)]}]}{[\int d ω e^{i [ω k - \hat{P} (k, ω)]}] [\int d ω e^{i [ω k' - \hat{P} (k', ω)]}]}}}{\int {dPd \hat{P}} e^{N (Ψ [{P, \hat{P}}] + Φ [{P}] + Ω [{\hat{P}}]) + 𝒪 (1)}} \end{matrix}

(A.12)

where

Ω [{\hat{P}}] = \underset{k^{″}}{Σ} p (k^{″}) \log \int d ω e^{i [ω k^{″} - \hat{P} (k^{″}, ω)]}

(A.13)

We conclude from (A.12), in which the functional integrals can be done by steepest descent in the limit N → ∞, that Π(k, k′) takes the form

Π (k, k') = Q (k, k') ∕ F (k ∣ Q) F (k' ∣ Q)

(A.14)

with

\frac{1}{F (k ∣ Q)} = \frac{〈 k 〉}{k} \frac{\int d ω e^{i ω (k - 1) - i \hat{P} k, ω)}}{\int d ω e^{i ω k - i \hat{P} (k, ω)}}

(A.15)

and where the functions P(k, ω)and Inline graphic (k, ω) are to be solved from extremization of Ψ[{P, }] + Φ[{P}] + Ω[{}], with the three functions given in (A.7,A.8,A.13), leading to the two coupled functional saddle-point equations δ[Ψ + Φ]/δP = 0 and δ[Ψ + Ω]/δ = 0.

The last step in this appendix is to derive from the saddle-point equations an equation for the function F(k|Q) in (A.14). Upon transforming exp[−i(k, ω)] = R(k, ω), our saddle-point equations simplify to

R (k, ω) = \exp {〈 k 〉 \underset{k'}{Σ} \int d ω' P (k', ω') Q (k, k') [e^{- i (ω + ω')} - 1]}

(A.16)

P (k, ω) = p (k) \frac{R (k, ω) e^{i ω k}}{\int d ω' R (k, ω') e^{i ω' k}}

(A.17)

Elimination of P(k, ω) from this set gives, using the identity ∫dω P(k, ω) = p(k),

\begin{matrix} R (k, ω) & = \exp {〈 k 〉 \underset{k'}{Σ} p (k') Q (k, k') [e^{- i ω} \frac{\int d ω' R (k', ω') e^{i ω' (k' - 1)}}{\int d ω' R (k', ω') e^{i ω' k'}} - 1]} \\ = \exp {\underset{k'}{Σ} p (k') Q (k, k') e^{- i ω} k' ∕ F (k' ∣ Q) - G (k ∣ Q)} \end{matrix}

(A.18)

in which F(k|Q) is defined in (A.15), and G(k|Q) = Inline graphic k Σ_k′p(k′)Q(k, k′). Insertion of our expression for R(k, ω) into (A.15), using exp[−i(k, ω)] = R(k, ω), leaves us with an equation for F(k|Q) only, from which the object G(k|Q) simply drops out since it gives an identical prefactor exp[−G(k|Q)] in both the numerator and the denominator of the formula for F(k|Q):

\begin{matrix} \frac{1}{F (k ∣ Q)} & = \frac{〈 k 〉}{k} \frac{\int d ω e^{i ω (k - 1)} R (k, ω)}{\int d ω e^{i ω k} R (k, ω)} \\ = \frac{〈 k 〉}{k} \frac{Σ_{m \geq 0} \frac{1}{m!} {[Σ_{k'} p (k') Q (k, k') k' ∕ F (k' ∣ Q)]}^{m} \int d ω e^{i ω (k - 1) - i m ω}}{Σ_{m \geq 0} \frac{1}{m!} {[Σ_{k'} p (k') Q (k, k') k' F (k' ∣ Q)]}^{m} \int d ω e^{i ω k - i m ω}} \\ = \frac{〈 k 〉}{k} \frac{Σ_{m \geq 0} \frac{1}{m!} δ_{m, k - 1} {[Σ_{k'} p (k') Q (k, k') k' F (k' ∣ Q)]}^{m}}{Σ_{m \geq 0} \frac{1}{m!} δ_{m k} {[Σ_{k'} p (k') Q (k, k') k' ∕ F (k' ∣ Q)]}^{m}} \\ = \frac{〈 k 〉}{Σ_{k'} p (k') Q (k, k') k' ∕ F (k' ∣ Q)} \end{matrix}

(A.19)

Equivalently:

F (k ∣ Q) = {〈 k 〉}^{- 1} \underset{k'}{Σ} p (k') k' Q (k, k') F^{- 1} (k' ∣ Q)

(A.20)

Note that the present derivation of the combined result (A.14,A.20) also serves as the explicit proof of the validity of corrigendum [21].

Footnotes

^‡

In section 2 we will give a precise and more general mathematical definition of ‘simple networks’, relative to some imposed macroscopic feature such as the degree distribution p(k).

^§

Proving this self-averaging property explicitly for the ensemble (14) is trivial in the case of p(k|c), and nontrivial but feasible in the case of Π(k, k′|c)

6. References

1.Albert R, Barabasi AL. Reviews of Modern Physics. 2002;74:47–97. [Google Scholar]
2.Barabasi AL, Albert R. Science. 1999;286:509. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
3.Pastor-Satorras R, Vazquez A, Vespignani A. Phys. Rev. Lett. 2001;87:258701. doi: 10.1103/PhysRevLett.87.258701. [DOI] [PubMed] [Google Scholar]
4.Newman MEJ. Phys. Rev. Lett. 2002;89:208701. doi: 10.1103/PhysRevLett.89.208701. [DOI] [PubMed] [Google Scholar]
5.Watts DJ, Strogatz SH. Nature. 1998;393:440. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
6.Newman MEJ, Leicht EA. Proc. Natl. Acad. Sci. U.S.A. 2007;104:9564. doi: 10.1073/pnas.0610537104. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Maslov S, Sneppen K. Science. 2002;296:910. doi: 10.1126/science.1065103. [DOI] [PubMed] [Google Scholar]
8.Maslov S, Sneppen K, Zaliznyak A. Physica A. 2004;333:529–540. [Google Scholar]
9.Shen-Orr SS, Milo R, Mangan S, Alon U. Nature Genetics. 2002;31:64–68. doi: 10.1038/ng881. [DOI] [PubMed] [Google Scholar]
10.Junker BH, Schreiber F. Analysis of biological networks. Wiley Series on Bioinformatics; Hoboken, NJ: 2008. [Google Scholar]
11.Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L. Science. 2004;305:1107. doi: 10.1126/science.1099334. [DOI] [PubMed] [Google Scholar]
12.Pérez-Vicente CJ, Coolen ACC. J. Phys. A: Math. Theor. 2008;41:255003. [Google Scholar]
13.Bianconi G. EPL. 2008;81:28005. [Google Scholar]
14.Bianconi G, Coolen ACC, Vicente CJP. Phys. Rev. E. 2008;78:016114. doi: 10.1103/PhysRevE.78.016114. [DOI] [PubMed] [Google Scholar]
15.Bianconi G. Phys. Rev. E. 2009;79:036114. doi: 10.1103/PhysRevE.79.036114. [DOI] [PubMed] [Google Scholar]
16.Fernandes LP, Annibale A, Kleinjung J, Coolen ACC, Fraternali F. doi: 10.1088/1751-8113/42/48/485001. in preparation. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Coolen ACC, De Martino A, Annibale A. accepted in J. Stat. Phys. 2009 arXiv:0905.4155. [Google Scholar]
18.Dorogovtsev SN, Mendes JF. Evolution of networks. Oxford University Press; Oxford: 2003. [Google Scholar]
19.Ivanic J, Wallqvist A, Reifman J. BCM Systems Biology. 2008;2:11. doi: 10.1186/1752-0509-2-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Ivanic J, Wallqvist A, Reifman J. PLoS Computational Biology. 2008;4:e1000114. doi: 10.1371/journal.pcbi.1000114. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Pérez-Vicente CJ, Coolen ACC. J. Phys. A: Math. Theor. 2009;42:169801. [Google Scholar]
22.Erdos P, Gallai T. Mat. Lapok. 1960;11:264–274. [Google Scholar]
23.Rao AR, Jana R, Bandyopadhyay S. Indian J. of Statistics. 1996;58:225. [Google Scholar]
24.Gkantsidis C, Mihail M, Zegura E. Proc. 5th workshop on algorithm engineering and experiments (ALENEX), Siam; 2003. [Google Scholar]
25.Viger F, Latapy M. COCOON 2005, The eleventh international computing and combinatorics conference, LNCS; 2005. pp. 440–449. [Google Scholar]
26.Chen Y, Diaconis P, Holmes S, Liu JS. J. Amer. Statistical Assoc. 2005;100:109. [Google Scholar]
27.Catanzaro M, Boguña M, Pastor-Satorras R. Phys. Rev. E. 2005;71:027103. doi: 10.1103/PhysRevE.71.027103. [DOI] [PubMed] [Google Scholar]
28.Serrano MA, Boguña M. Phys. Rev. E. 2005;72:036133. doi: 10.1103/PhysRevE.72.036133. [DOI] [PubMed] [Google Scholar]
29.Foster JG, Foster DV, Grassberger P, Paczuski M. Phys. Rev. E. 2007;76:046112. doi: 10.1103/PhysRevE.76.046112. [DOI] [PubMed] [Google Scholar]
30.Verhelst ND. Psychometrika. 2008;73:705. [Google Scholar]
31.Newman MEJ, Strogatz SH, Watts DJ. Phys. Rev. E. 2001;64:026118. doi: 10.1103/PhysRevE.64.026118. [DOI] [PubMed] [Google Scholar]
32.Arifuzzaman M, et al. Genome Res. 2006;16:686–691. doi: 10.1101/gr.4527806. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Newman MEJ. Phys. Rev. Lett. 2003;67:026126. [Google Scholar]
34.Keshava Prasad T S, et al. Nucleic Acids Research. 2009;37 D767(Database issue) doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Simonis N, et al. Nature Methods. 2009;6(1):47–54. doi: 10.1038/nmeth.1279. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Parrish JR, et al. Genome Biology. 2007;8(7):R130. doi: 10.1186/gb-2007-8-7-r130. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Stark C, et al. Nucleic Acids Res. 2006;34(Database issue):D535. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Rain JC, et al. Nature. 2001;409(6817):211–215. doi: 10.1038/35051615. [DOI] [PubMed] [Google Scholar]
39.Rual J-FF, et al. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]
40.Stelzl U, et al. Cell. 2005;122(6):957–968. doi: 10.1016/j.cell.2005.08.029. [DOI] [PubMed] [Google Scholar]
41.Ewing RM, et al. Molecular systems biology. 2007;3:89. doi: 10.1038/msb4100134. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Shimoda Y, et al. DNA Res. 2008;15(1):13–23. doi: 10.1093/dnares/dsm028. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Lacount DJ, et al. Nature. 2005;438(7064):103–107. doi: 10.1038/nature04104. [DOI] [PubMed] [Google Scholar]
44.Uetz P, et al. Nature. 2000;403(6770):623–627. doi: 10.1038/35001009. [DOI] [PubMed] [Google Scholar]
45.Ito T, et al. Proc Natl Acad Sci U S A. 2001;98(8):4569–4574. doi: 10.1073/pnas.061034498. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Ho Y, et al. Nature. 2002;415(6868):180–183. doi: 10.1038/415180a. [DOI] [PubMed] [Google Scholar]
47.von Mering, et al. Nature. 2002;417(6887):399–403. doi: 10.1038/nature750. [DOI] [PubMed] [Google Scholar]
48.Gavin AC, et al. Nature. 2002;415(6868):141–147. doi: 10.1038/415141a. [DOI] [PubMed] [Google Scholar]
49.Han JDJ, et al. Nature. 2004;430(6995):88–93. doi: 10.1038/nature02555. [DOI] [PubMed] [Google Scholar]
50.Gavin ACC, et al. Nature. 2002;415(7084):141–147. [Google Scholar]
51.Krogan NJ, et al. Nature. 2006;440(7084):637–643. doi: 10.1038/nature04670. [DOI] [PubMed] [Google Scholar]
52.Collins SR, et al. Mol Cell Proteomics. 2007;6(3):439–450. doi: 10.1074/mcp.M600381-MCP200. [DOI] [PubMed] [Google Scholar]
53.Tarassov K, et al. Science. 2008;320(5882):1465–1470. doi: 10.1126/science.1153878. [DOI] [PubMed] [Google Scholar]
54.Yu H, et al. Science. 2008;322(5898):104–110. doi: 10.1126/science.1158684. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Sato S, et al. DNA Res. 2007;14(5):207–216. doi: 10.1093/dnares/dsm021. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Titz B, et al. PLoS ONE. 2008;3(5):e2292. doi: 10.1371/journal.pone.0002292. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Albert R, Barabasi AL. Reviews of Modern Physics. 2002;74:47–97. [Google Scholar]

[R2] 2.Barabasi AL, Albert R. Science. 1999;286:509. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]

[R3] 3.Pastor-Satorras R, Vazquez A, Vespignani A. Phys. Rev. Lett. 2001;87:258701. doi: 10.1103/PhysRevLett.87.258701. [DOI] [PubMed] [Google Scholar]

[R4] 4.Newman MEJ. Phys. Rev. Lett. 2002;89:208701. doi: 10.1103/PhysRevLett.89.208701. [DOI] [PubMed] [Google Scholar]

[R5] 5.Watts DJ, Strogatz SH. Nature. 1998;393:440. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]

[R6] 6.Newman MEJ, Leicht EA. Proc. Natl. Acad. Sci. U.S.A. 2007;104:9564. doi: 10.1073/pnas.0610537104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Maslov S, Sneppen K. Science. 2002;296:910. doi: 10.1126/science.1065103. [DOI] [PubMed] [Google Scholar]

[R8] 8.Maslov S, Sneppen K, Zaliznyak A. Physica A. 2004;333:529–540. [Google Scholar]

[R9] 9.Shen-Orr SS, Milo R, Mangan S, Alon U. Nature Genetics. 2002;31:64–68. doi: 10.1038/ng881. [DOI] [PubMed] [Google Scholar]

[R10] 10.Junker BH, Schreiber F. Analysis of biological networks. Wiley Series on Bioinformatics; Hoboken, NJ: 2008. [Google Scholar]

[R11] 11.Artzy-Randrup Y, Fleishman SJ, Ben-Tal N, Stone L. Science. 2004;305:1107. doi: 10.1126/science.1099334. [DOI] [PubMed] [Google Scholar]

[R12] 12.Pérez-Vicente CJ, Coolen ACC. J. Phys. A: Math. Theor. 2008;41:255003. [Google Scholar]

[R13] 13.Bianconi G. EPL. 2008;81:28005. [Google Scholar]

[R14] 14.Bianconi G, Coolen ACC, Vicente CJP. Phys. Rev. E. 2008;78:016114. doi: 10.1103/PhysRevE.78.016114. [DOI] [PubMed] [Google Scholar]

[R15] 15.Bianconi G. Phys. Rev. E. 2009;79:036114. doi: 10.1103/PhysRevE.79.036114. [DOI] [PubMed] [Google Scholar]

[R16] 16.Fernandes LP, Annibale A, Kleinjung J, Coolen ACC, Fraternali F. doi: 10.1088/1751-8113/42/48/485001. in preparation. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Coolen ACC, De Martino A, Annibale A. accepted in J. Stat. Phys. 2009 arXiv:0905.4155. [Google Scholar]

[R18] 18.Dorogovtsev SN, Mendes JF. Evolution of networks. Oxford University Press; Oxford: 2003. [Google Scholar]

[R19] 19.Ivanic J, Wallqvist A, Reifman J. BCM Systems Biology. 2008;2:11. doi: 10.1186/1752-0509-2-11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Ivanic J, Wallqvist A, Reifman J. PLoS Computational Biology. 2008;4:e1000114. doi: 10.1371/journal.pcbi.1000114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Pérez-Vicente CJ, Coolen ACC. J. Phys. A: Math. Theor. 2009;42:169801. [Google Scholar]

[R22] 22.Erdos P, Gallai T. Mat. Lapok. 1960;11:264–274. [Google Scholar]

[R23] 23.Rao AR, Jana R, Bandyopadhyay S. Indian J. of Statistics. 1996;58:225. [Google Scholar]

[R24] 24.Gkantsidis C, Mihail M, Zegura E. Proc. 5th workshop on algorithm engineering and experiments (ALENEX), Siam; 2003. [Google Scholar]

[R25] 25.Viger F, Latapy M. COCOON 2005, The eleventh international computing and combinatorics conference, LNCS; 2005. pp. 440–449. [Google Scholar]

[R26] 26.Chen Y, Diaconis P, Holmes S, Liu JS. J. Amer. Statistical Assoc. 2005;100:109. [Google Scholar]

[R27] 27.Catanzaro M, Boguña M, Pastor-Satorras R. Phys. Rev. E. 2005;71:027103. doi: 10.1103/PhysRevE.71.027103. [DOI] [PubMed] [Google Scholar]

[R28] 28.Serrano MA, Boguña M. Phys. Rev. E. 2005;72:036133. doi: 10.1103/PhysRevE.72.036133. [DOI] [PubMed] [Google Scholar]

[R29] 29.Foster JG, Foster DV, Grassberger P, Paczuski M. Phys. Rev. E. 2007;76:046112. doi: 10.1103/PhysRevE.76.046112. [DOI] [PubMed] [Google Scholar]

[R30] 30.Verhelst ND. Psychometrika. 2008;73:705. [Google Scholar]

[R31] 31.Newman MEJ, Strogatz SH, Watts DJ. Phys. Rev. E. 2001;64:026118. doi: 10.1103/PhysRevE.64.026118. [DOI] [PubMed] [Google Scholar]

[R32] 32.Arifuzzaman M, et al. Genome Res. 2006;16:686–691. doi: 10.1101/gr.4527806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Newman MEJ. Phys. Rev. Lett. 2003;67:026126. [Google Scholar]

[R34] 34.Keshava Prasad T S, et al. Nucleic Acids Research. 2009;37 D767(Database issue) doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Simonis N, et al. Nature Methods. 2009;6(1):47–54. doi: 10.1038/nmeth.1279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Parrish JR, et al. Genome Biology. 2007;8(7):R130. doi: 10.1186/gb-2007-8-7-r130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Stark C, et al. Nucleic Acids Res. 2006;34(Database issue):D535. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Rain JC, et al. Nature. 2001;409(6817):211–215. doi: 10.1038/35051615. [DOI] [PubMed] [Google Scholar]

[R39] 39.Rual J-FF, et al. Nature. 2005;437(7062):1173–1178. doi: 10.1038/nature04209. [DOI] [PubMed] [Google Scholar]

[R40] 40.Stelzl U, et al. Cell. 2005;122(6):957–968. doi: 10.1016/j.cell.2005.08.029. [DOI] [PubMed] [Google Scholar]

[R41] 41.Ewing RM, et al. Molecular systems biology. 2007;3:89. doi: 10.1038/msb4100134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Shimoda Y, et al. DNA Res. 2008;15(1):13–23. doi: 10.1093/dnares/dsm028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Lacount DJ, et al. Nature. 2005;438(7064):103–107. doi: 10.1038/nature04104. [DOI] [PubMed] [Google Scholar]

[R44] 44.Uetz P, et al. Nature. 2000;403(6770):623–627. doi: 10.1038/35001009. [DOI] [PubMed] [Google Scholar]

[R45] 45.Ito T, et al. Proc Natl Acad Sci U S A. 2001;98(8):4569–4574. doi: 10.1073/pnas.061034498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Ho Y, et al. Nature. 2002;415(6868):180–183. doi: 10.1038/415180a. [DOI] [PubMed] [Google Scholar]

[R47] 47.von Mering, et al. Nature. 2002;417(6887):399–403. doi: 10.1038/nature750. [DOI] [PubMed] [Google Scholar]

[R48] 48.Gavin AC, et al. Nature. 2002;415(6868):141–147. doi: 10.1038/415141a. [DOI] [PubMed] [Google Scholar]

[R49] 49.Han JDJ, et al. Nature. 2004;430(6995):88–93. doi: 10.1038/nature02555. [DOI] [PubMed] [Google Scholar]

[R50] 50.Gavin ACC, et al. Nature. 2002;415(7084):141–147. [Google Scholar]

[R51] 51.Krogan NJ, et al. Nature. 2006;440(7084):637–643. doi: 10.1038/nature04670. [DOI] [PubMed] [Google Scholar]

[R52] 52.Collins SR, et al. Mol Cell Proteomics. 2007;6(3):439–450. doi: 10.1074/mcp.M600381-MCP200. [DOI] [PubMed] [Google Scholar]

[R53] 53.Tarassov K, et al. Science. 2008;320(5882):1465–1470. doi: 10.1126/science.1153878. [DOI] [PubMed] [Google Scholar]

[R54] 54.Yu H, et al. Science. 2008;322(5898):104–110. doi: 10.1126/science.1158684. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Sato S, et al. DNA Res. 2007;14(5):207–216. doi: 10.1093/dnares/dsm021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] 56.Titz B, et al. PLoS ONE. 2008;3(5):e2292. doi: 10.1371/journal.pone.0002292. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure

A Annibale

ACC Coolen

LP Fernandes

F Fraternali

J Kleinjung

Abstract

1. Introduction

2. Definitions and properties of network topology characterizations

2.1. Networks, degree distributions, and degree correlation functions

2.2. Properties of the relative degree correlation function

3. Random graphs with controlled macroscopic structure

3.1. Definition of the random graph ensembles

3.2. Asymptotic properties of the ensembles

3.3. Existence and uniqueness of tailored ensembles

3.4. The random graphs ensemble as a conditioned maximum entropy ensemble

3.5. Shannon entropy

4. Quantitative tools for networks

4.1. Quantifying structural network complexity

Figure 1.

4.2. Quantifying structural distance between networks

4.3. Numerical generation of canonical ‘null models’

Figure 2.

Figure 3.

5. Discussion

Table 1.

Acknowledgements

Appendix A

Degree correlations in the random graph ensemble

Footnotes

6. References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Tailored graph ensembles as proxies or null models for real networks I: tools for quantifying structure

A Annibale

ACC Coolen

LP Fernandes

F Fraternali

J Kleinjung

Abstract

1. Introduction

2. Definitions and properties of network topology characterizations

2.1. Networks, degree distributions, and degree correlation functions

2.2. Properties of the relative degree correlation function

3. Random graphs with controlled macroscopic structure

3.1. Definition of the random graph ensembles

3.2. Asymptotic properties of the ensembles

3.3. Existence and uniqueness of tailored ensembles

3.4. The random graphs ensemble as a conditioned maximum entropy ensemble

3.5. Shannon entropy

4. Quantitative tools for networks

4.1. Quantifying structural network complexity

Figure 1.

4.2. Quantifying structural distance between networks

4.3. Numerical generation of canonical ‘null models’

Figure 2.

Figure 3.

5. Discussion

Table 1.

Acknowledgements

Appendix A

Degree correlations in the random graph ensemble

Footnotes

6. References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases