Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 3.
Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2021 Jun 3;18(3):836–849. doi: 10.1109/TCBB.2020.2980260

Revisiting Parameter Estimation in Biological Networks: Influence of Symmetries

Jithin K Sreedharan 1,#, Krzysztof Turowski 2,#, Wojciech Szpankowski 3
PMCID: PMC8555700  NIHMSID: NIHMS1711334  PMID: 32175871

Abstract

Graph models often give us a deeper understanding of real-world networks. In the case of biological networks they help in predicting the evolution and history of biomolecule interactions, provided we map properly real networks into the corresponding graph models. In this paper, we show that for biological graph models many of the existing parameter estimation techniques overlook the critical property of graph symmetry (also known formally as graph automorphisms), thus the estimated parameters give statistically insignificant results concerning the observed network. To demonstrate it and to develop accurate estimation procedures, we focus on the biologically inspired duplication-divergence model, and the up-to-date data of protein-protein interactions of seven species including human and yeast. Using exact recurrence relations of some prominent graph statistics, we devise a parameter estimation technique that provides the right order of symmetries and uses phylogenetically old proteins as the choice of seed graph nodes. We also find that our results are consistent with the ones obtained from maximum likelihood estimation (MLE). However, the MLE approach is significantly slower than our methods in practice.

Keywords: biological networks, protein-protein interaction, parameter estimation, duplication-divergence, random graphs

1. Introduction

Many biological processes are regulated at the level of interactions between protein molecules. In recent decades, the development of experimental and bioinformatic methods allow researchers to obtain and make publicly available growing amount of data of various biological mechanisms involving different proteins. The whole network of these events that can be found and reconstructed using various techniques is customarily summarized as protein-protein interaction (PPI) networks. The PPI networks are often described as undirected graphs with nodes representing proteins and edges corresponding to protein-protein interactions.

The proliferation of biological data, in turn, encouraged a study of a series of theoretical models to develop a deeper understanding of the evolution and structural properties of the network representation. However, proper fitting of the biological data and development of statistical tests to check the validity of the models are critical to take advantage of the theoretical developments in graph models. The parameter estimation remains as a challenging problem in biological networks like PPIs, mainly because most of the classical estimation procedures were developed for static networks, and not tailored to specific properties of biological data and its underlying dynamics. In this paper, we propose a parameter estimation scheme for biological data with a new perspective of symmetries and recurrence relations, and point out many fallacies in the previous estimation procedures.

In this paper, we assume that the networks evolve according to the following duplication-divergence stochastic graph model.

Duplication-divergence model (DD-model)

There is a wide agreement [1], originally stemming from Ohno’s hypothesis on genome growth [2], that the main mechanism driving evolution of PPI networks is the duplication mechanism, in which new proteins appear as copies of some already existing proteins in the network. This is supplemented by a certain amount of divergence of random mutations that lead to some differences between patterns of interaction for the source protein and the duplicated protein.

There are many variations of duplication-divergence models in the literature, although they have not yet been studied or compared systematically. In this work, we use the model suggested by Pastor-Satorras et al. in [3], recommended in several surveys [4], [5] as a good possible theoretical match for PPI networks.

The model of Pastor-Satorras et al. constructs the graph as described below. Let Nk(u) be the the set of neighbors of vertex u at time k. Given an undirected, simple, seed graph Gn0 on n0 nodes and target number of nodes n, the graph Gk+1 with k + 1 nodes evolves from the graph Gk as follows (the subscript k in Gk can also be interpreted as time instant k). First, a new vertex v is added to Gk. Then the following steps are carried out:

  • Duplication: Select an node u from Gk uniformly at random. The node v then makes connections to Nk(u).

  • Divergence: Each of the newly made connections from v to Nk(u) are deleted with probability 1 − p. Furthermore, for all the nodes in Gk except Nk(u), create an edge from each of them to v independently with probability rk.

The above process is repeated until the number of nodes in the graph reaches n. We denote the graph Gn generated from the DD-model with parameters p and r, starting from seed graph Gn0, by Gn ∼ DD–model (n, p, r, Gn0). Note that the above model generalizes pure duplication model when p = 1, r = 0 [6], [7]. In some variations of the model (e.g. [8]), the nodes v and u will make a connection independently with a probability q that is much larger than r/k. However, the addition of q does not introduce significant changes in the properties of the graph, therefore we do not consider such a variation in this work.

Motivation and contributions

In this work, we rigorously study the problem of parameter estimation in the duplication-divergence model using PPI datasets of seven species. The following points motivate this work and we present our key results with them.

Symmetries of the graph

One important feature of the networks that was neglected in the previous studies is the distribution of the number of symmetries generated by the fitted models with fine-tuned parameters. The symmetries of a graph H are formally called automorphisms which is defined as the set of all permutations π of the vertex set of H with the property that, for any vertices u and v, we have π(u) ∼ π(v) if and only if uv where ∼ represents an edge (i.e., an automorphism is a an adjacency preserving permutation). For the real-world PPI networks it turns out that the number of symmetries is considerably high, which is in stark contradiction with properties of many random graph models. For example, it is known that graphs generated from Erdős-Rényi model [9] and from preferential attachment model [10] are asymmetric with high probability. Therefore they cannot be reasonably justified as underlying generation schemes for PPI networks. We shall also see that the same phenomena may occur for the DD-model with some ranges of the parameters p and r.

Automorphisms are rarely studied in the context of biological networks and graph models. So far there are no theoretical results on automorphisms in the duplication-divergence model except the work in [11] for the limiting case of r = 0, where it was discovered that when both p = 0 and p = 1 the model produces graphs with a significant amount of symmetry.

Our main focus in this work is to take into account the number of automorphisms of the observed network to restrict the parameter search to a more meaningful range. Moreover, we show that most parameters outputted by previous estimation techniques fail to produce graphs having an order of automorphisms close to that of the PPI networks and therefore they are, in this regard, do not fit the DD-model well. We also note that cross-checking with the number of automorphisms of the real-world network forms a null hypothesis test for the model under consideration.

Moreover, there are close relations between automorphism group of a graph and eigenvalues of the associated graph matrices (commonly called spectrum of a graph) [12]. The graph spectrum concisely abstracts many key characteristics of the graph like the number of triangles and other subgraphs, number of walks of a specified length, number of spanning trees etc. If the existing parameter estimation methods fit the given graph into a model that is not in agreement in terms of the number of automorphisms, the spectrum and many characteristics of the fitted graph do not get matched to the given graph.

Graph parameter recurrences

It is widely recognized that the asymptotics of structural properties such as the degree distribution and number of edges of the DD-model are crucial parameters, upon which judgment about the fitness of the model could be made. From the theoretical point of view, the analysis for the DD-model was presented in [13], [14], supporting the case that graphs derived from this model exhibit (under certain assumptions) power-law-like behavior. Moreover, the frequency of appearance of certain graph structures called graphlets (small subgraphs such as triangles, open triangles, etc.) can be viewed as another criterion for model fitting (see [4], [15] and the references therein). The triangles and wedges (paths of length 2 or star with two nodes) are particularly crucial as they are directly related to the network clustering coefficient. The high value of this coefficient is recognized in general as a significant characteristic of some biological networks including PPI networks [16], differentiating these networks from those which can be obtained, for example, from Erdős-Rényi or preferential attachment models.

Our approach is based on recreating graph evolution from a single snapshot of the observed network. We apply, for the first time, rigorous analyses to estimate parameters with the recurrence relations of degree and the number of wedges and triangles, recreating the dynamic process of DD-model construction. The advantages of this approach are twofold: first, the use of accurate iterative formulas allow us to achieve more realism and precision for finite graphs, which is in contrast to most of the previous studies that derive parameter estimates exclusively in terms of steady-state behavior. Second, it is not proven whether the steady state for such models even exists and whether the whole random process converges asymptotically (see Section 4 for more details).

Maximum likelihood method

To substantiate the accuracy of our estimation technique, we apply the maximum likelihood method (MLE) with the importance sampling to the parameter inference problem, adapting the work of Wiuf et al. [17] to the DD-model (see Section 5.2 for the implementation and details). It turns out that the results of the MLE method are very similar to those that derived from our estimation method, and the estimated parameters in both the techniques generate data that is consistent with the observed network in terms of the number of automorphisms.

However, the MLE algorithm has much larger computational complexity, Θ(n31ε2) compared to Θ(n1εlog1ε) needed by our approach based on the recurrences (n and ϵ being the number of nodes and required resolution). Therefore, the analysis of graphs using MLE method is significantly slower for networks in practice and its application is impractical for networks exceeding 1000 nodes, which includes most of the real-world PPI networks. Furthermore, in the case of MLE, a large set of parameter values maximizes the likelihood function when the true p value is close to one, thus making it less reliable (see Section 6.1).

Seed graph choice

It is well known that seed graphs play an important role in biological networks, and its improper selection will affect the estimated parameters of the fitted model [5], [18]. In particular, the task of determining the suitable range of parameters for the duplication-divergence model is always done under assumptions concerning the seed graph. In this work, we improve on the existing solutions by choosing the seed graph on the basis of phylogenetic ages of the proteins in the PPI data – the oldest proteins forms the seed graph. Although such a choice of seed graph is completely absent in prior literature, it is a natural pick as the seed graph itself is defined as the network that is comprised of the oldest entities in the given network.

Outline of the paper

In Section 2, we describe the PPI datasets that are considered in this paper. The influence of parameters of the DD-model on graph symmetries and the p-value calculation for comparing with the observed graph are given in Section 3. In Section 4, we provide a critique of various deficiencies of the previous approaches to PPI networks parameter estimation, like lack of symmetries and overemphasis on power-law behavior. Section 5 describes our approach based on both automorphisms counting and exact iterative formulas for certain graph statistics. In this section, we also present an MLE algorithm for parameter inference and compare it with our approach in terms of their complexity and practical usage. Section 6 contains numerical results for both synthetic data generated from the DD-model and real-world PPI networks. Section 7 reports the conclusions of the paper with a discussion of obtained results and their significance. At the end, in Section 8 and 9, we provide, as a supplement, an implementation of the MLE algorithm and the proof of our main theorem.

Table 1 provides the list of main notations used in the paper.

TABLE 1:

List of main notations

Notation Meaning
G obs Observed real-world network
G n0 Seed graph (initial graph) with n0 nodes
Gn Realization of the DD-model with fixed parameters and n number of nodes
p, r Parameters of the DD-model
γ Power law exponent
Aut(G) Automorphism group of graph G
|Aut(G)| Number of automorphisms of graph G
Ns(t) Set of neighbors of node t at time s
degs(t) Degree of a node t at time s
D(Gn) n1i=1ndegn(i)
D2(Gn) n1i=1ndegn2(i)
S2(Gn) Number of wedges (stars with two nodes) in Gn
C3(Gn) Number of triangles in Gn

2. Datasets

We use protein-protein interaction networks (PPI) to verify the estimation techniques proposed in this paper. The data is collected from the BioGRID, a popular curated biological database of protein-protein interactions. The networks formed from protein-protein interaction data are further cleaned by removing self-interactions (self-loops), multiple interactions (multiple edges), and interspecies (organisms) interactions of proteins. Thus the considered PPI networks only have physical and intra-species interactions. Unlike some of the previous studies that consider only the largest connected component, the DD-model we focus in this work incorporates disconnected subgraphs and isolated nodes.

Table 2 shows the different PPI datasets considered in this paper. We have also listed the logarithm of the number of automorphisms in the original graph, obtained using a publicly available program nauty [19]. We note here that the PPI dataset is growing as new interactions getting added on every new release of the dataset. Many previous studies were using older and less complete versions of the data, and therefore it is important to repeat the estimation procedures from those studies and compare them to our methods.

TABLE 2:

Statistics of PPI networks used in this paper and the generated seed graph Gn0 with nodes of the largest phylogenetic ages.

Original graph Gobs
Seed graph Gn0
Organism Scientific name # Nodes # Edges log |Aut(G)| # Nodes # Edges
Baker’s yeast Saccharomyces cerevisiae 6,152 531,400 267 548 5,194
Human Homo sapiens 17,295 296,637 3026 546 2,822
Fruitfly Drosophila melanogaster 9,205 60,355 1026 416 1,210
Fission yeast Schizosaccharomyces pombe 4,177 58,084 675 412 226
Mouse-ear cress Arabidopsis thaliana Columbia 9,388 34,885 6696 613 41
Mouse Mus musculus 6,849 18,380 7827 305 7
Worm Caenorhabditis elegans 3,869 7,815 3348 185 15

2.1. Selection of seed graph Gn0

Previous studies typically assume the seed graph Gn0 as the maximal clique (or the largest two cliques) in the graph Gn [4], [5]. Here we consider a novel formulation for the seed graph. We select the seed graph as the graph induced in the PPI networks by the oldest proteins. That is, the proteins in the observed PPI data that are known to have the largest phylogenetic age (taxon age). It is reasonable to expect that the same protein which appeared over different species also appears in their common ancestor. Hence proteins shared across many different, distant species are supposed to be older than others.

More precisely, the age of a protein is based on a family’s appearance on a species tree, and it is estimated via protein family databases and ancestral history reconstruction algorithms. We use Princeton Protein Orthology Database (PPOD) [20] along with OrthoMCL [21] and PANTHER [22] for the protein family database and asymmetric Wagner parsimony as the ancestral history reconstruction algorithm. These algorithms can be accessed via ProteinHistorian software [23].

Table 2 also lists the statistics of seed graphs Gn0 for different PPI networks. Even if the original PPI network is connected, the DD-model under consideration allows a disconnected graph to be the seed graph. Thus, similar to the formation of the PPI network, we consider the graph induced by oldest proteins including isolated nodes and disconnected subgraphs, not restricting ourselves to a connected component that introduces biases in the results.

3. Influence of parameters on symmetries of the model

For certain range of values of the parameters p and r of the DD-model, given n and Gn0, we show in this section that the model generates virtually only asymmetric graphs. However, we can put forward a question: are there any values of parameters that will yield graphs with the number of automorphisms (i.e., the number of adjacency preserving permutations of the vertex set) close to the real-world PPIs?

In Figure 1, we present the average number of symmetries in the logarithmic scale for graphs with different sizes generated from the DD-model with a fixed set of parameters. As p, r → 0 or when p becomes very close to 1 we observe significantly larger values for the average number of automorphisms (since the generated graphs tend to have numerous isolated nodes or they become closer to a complete graph). For instance in Figure 1a, p = 1, r = 0.4 has E[logAut(Gn)]=1114, and p = 0, r = 0.4 has E[logAut(Gn)]=1253. But for large ranges of p and r, it is practically impossible to generate a graph exhibiting any noticeable symmetries. For example, p = 0.2, r = 2.4 has E[logAut(Gn)]=3.2; p = 0.6, r = 0 has E[logAut(Gn)]=1.3; and p = 0.4, r = 2.4 has E[logAut(Gn)]=0.12. These observations are consistent for different n and Gn0 too, though the specific range of values of parameters will obviously change.

Fig. 1:

Fig. 1:

Effect of the DD-model parameters on the number of automorphisms: Logarithm of the expected number of automorphisms of graphs generated from the DD-model for various values of p and r. The seed graph Gn0 = K20.

The number of automorphisms in the DD-model behaves differently as in many other graph models. The preferential-attachment graphs are asymmetric (no nontrivial symmetries) with high probability when the number of edges a new node brings into the graph exceeds 2 [10], and almost every graph from the Erdős-Rényi model is asymmetric [9]. On the other hand, the DD-model exhibits a large number of symmetries and it grows with the number of nodes, as shown in Figure 1.

These findings allow us to argue that only certain subsets of (p, r) pairs correspond to the expected number of automorphisms in the order of the required value. This means that it can be reasonably used as a falsification tool to discard certain parameter ranges and to verify parameter estimation methods. We provide a simple statistical test for checking the possibility of generating the required number of symmetries with the estimated parameters.

Statistical test for significance of the number of symmetries with the estimated parameters

Given the real-world network Gobs, seed graph Gn0, and the estimated parameters (p^,r^) of the DD-model, we can estimate the statistical significance of the estimates with respect to the number of symmetries in Gobs as follows. Let Gn(1),,Gn(m) be m graphs generated from DD–model(n,p^,r^,Gn0). Then the p-value is now calculated as follows:

pu=1mi=1m1{log|Aut(Gn(i))|log|Aut(Gobs)|}
pl=1mi=1m1{log|Aut(Gn(i))|log|Aut(Gobs)|},

with 1{A} as the indicator function of the event A. Then p-value = 2 min{pu, pl}. As an example, for a fixed parameter set, the empirical distribution of log|Aut(G)| is shown in Figure 2. The distribution is symmetrical and this justifies use of the symmetrical definition of p-value. A lower p-value indicates that the estimated parameters do not fit the observed network, and a higher value gives an argument for the estimated parameters being in agreement with the number of symmetries in Gobs.

Fig. 2:

Fig. 2:

Normalized histogram of logarithm of number of automorphisms when Gn ∼ DD–model (500, 0.3, 0.4, K20).

4. Parameter Estimation and Why Existing Methods Fail in Practice?

Previous methods for the parameter estimation problem in the DD-model was first sketched in [3] and then considered more rigorously in [13]. Later, [4], [5] provided some extensions to the estimation procedures using the mean-field approximation of the average degree D(Gobs) together with the steady-state expression of the power-law exponent γ of the degree distribution. Then, the values of p and r are computed, respectively, from the formulas:

γ=1+1ppγ2 (1)
r=(12p)D(Gobs),forp<12. (2)

Table 3 presents the estimates of parameters p and r using the above method. Additionally, we present the average logarithm of the number of automorphisms computed from 10,000 graphs generated from the DD-model with the estimated parameters.

TABLE 3:

Estimated parameters of the DD-model using the mean-field approach, the average number of symmetries with the estimated parameters, and the p-value of its statistical significance with respect to the actual number of symmetries for the PPI data in Table 2.

Organism p^ r^ E[log|Aut(Gn)|] p-value
Baker’s yeast 0.28 38.25 0 0
Human 0.43 2.39 10.81 0
Fruitfly 0.44 0.75 3771.99 0
Fission yeast 0.46 1.02 897.48 0
Mouse-ear cress 0.44 0.43 18596.72 0
Mouse 0.48 0.12 34961.69 0
Worm 0.47 0.14 15700.26 0

Mismatch in the number of symmetries and graph statistics

Comparing Tables 2 and 3, we observe that the number of symmetries of the graphs which are generated by the DD-model with parameters estimated via the mean-field approach differs significantly with that of the real-world PPI networks. Moreover, the estimated p-values are consistently zero for all the species because the observed values of the parameters under investigation fall far outside the range of the empirical distribution of the parameters for synthetic graphs generated with estimated p and r. This shows that the previously established estimation methods of the DD-model fail to capture the critical graph property of automorphisms, and thus do not fit the PPI networks accurately.

As shown in Table 2, the PPI networks exhibit some significant amount of symmetry, but far less than the maximum possible value (equal to n log n), which is attained when every node can be interchanged with every other node. This observation, along with the p-value test in Section 3, allow us to discard not only many models which produce only asymmetric graphs with high probability (such as Erdős-Rényi or preferential attachment model), but also effectively stands as a hypothesis test to verify that the fitting obtained by an estimation procedure can be safely assumed to match the model underlying real-world structures.

Similarly, for certain graph statistics D(Gn), S2(Gn) and C3(Gn) (see Table 1 for notation), which are considered later in Section 5.1 for deriving our methods, we observe from Table 4 that the estimated parameters do not yield graphs that have the considered statistics close to the observed graph. Here the p-values are calculated in an equivalent way of number symmetries, just that now it is computed with respect to the graph statistics.

TABLE 4:

Comparison of some graph statistics of the observed graph and that of the synthetic data with parameters estimated via the mean-field approach.

Organism D(Gobs) E[D(Gn)] p-value S2 (Gobs) E[S2(Gn)] p-value C3(Gobs) E[C3(Gn)] p-value
Baker’s yeast 172.76 115.10 0 220.35M 45.33M 0 9.77M 370.49K 0
Human 34.30 19.39 0 52.25M 7.02M 0 1.07M 105K 0
Fruitfly 13.11 7.87 0 2.94M 1.45M 0 195.96K 77.61K 0
Fission yeast 27.64 6.72 0 7.42M 215.84K 0 223.61K 1.14K 0
Mouse-ear cress 7.39 2.23 0 2.98M 44.46K 0 23.34K 23.27 0
Mouse 5.35 0.82 0 2.95M 9.33K 0 10.22K 0.79 0
Worm 4.04 0.90 0 346.13K 5.32K 0 2.41K 0.49 0

Next, we point out several other deficiencies in the known estimation procedures, which could be the reasons behind such a divergence between the number of symmetries of the PPI networks and its proposed theoretical model.

Power-law behavior

The parameter estimation of the DD-model introduced in prior works, such as the one that was presented at the beginning of this section, assumes that the PPI networks are scale-free. This property, stating that the degree distribution of the PPI networks is heavy-tailed or, more precisely, that the number of vertices of degree k is proportional to kγ for some constant γ > 0 [24], [25]. With this assumption, some (see [8] for example) argue that the estimated value of the exponent for the PPI networks satisfies 2 < γ < 3. However, there are counterarguments to this claim, and it is challenged on statistical grounds that the PPI graphs do not fall into the power-law degree distribution category [26], [27].

To each of the PPI networks in Table 2, we fit the coefficients of power-law distribution with the cut-off following the methodology of Clauset et al. [28]. We note here that cutoff is required in all the cases since the power-law behavior mostly happens in the tails of the degree distribution. However, we find that the cutoff neglects a huge percentage of the data. For example, for a fitting of baker’s yeast PPI network, as shown in Figure 3, the cutoff is 582, which is at 94.98 percentile of the degree data, i.e., the power-law fitting does not take into account 94.98% of the data. With the cutoff and the percentiles of all the species listed in Table 5, we remark that any method to estimate the parameters p and r involving power-law exponent do not give reliable approximations since it discards the vast majority of the data.

Fig. 3:

Fig. 3:

Power-law fitting of the Baker’s yeast data: Complementary cumulative distribution function (CCDF) of the degrees and the fitted power-law lines. The solid green plot represents the CCDF curve. The dashed red line is the fitted power-law assuming a cutoff (i.e., neglecting initial part of the CCDF curve), while the dotted green line is the fitted power-law without assuming a cutoff.

TABLE 5:

Estimated power law exponent and required cutoff percentile with the mean-field approach

Organism γ^ Cutoff percentile
Baker’s yeast 4.55 94.98
Human 2.85 92.33
Fruitfly 2.71 88.00
Fission yeast 2.43 88.31
Mouse-ear cress 2.68 93.89
Mouse 2.29 78.58
Worm 2.41 88.23

Steady state assumption

Previous research on the DD-model, both on the level of theoretical analysis of the model properties and the level of parameter estimation of real-world PPI networks, focus heavily on the asymptotic and steady-state behavior [3]. Most of the previous results on the functional form of certain graph statistics in the DD-model are under the strong assumption of steady-state [8], [29]. But they do not provide any theoretical proof for convergence to steady-state. Moreover, these steady-state asymptotic results, even when achievable, do not give any bounds on the rate of convergence. This, in turn, raises questions about the straightforward applicability of such theoretical results to parameter estimation.

The previously used methods of parameter estimation also suffer from another issue: for simplicity, they assume that the average degree of the network does not change during the whole evolution from Gn0 to Gn. This is not only highly implausible in practice, but also impose direct relation between p and r, and hides any dependency that might be discovered from various properties of the networks.

Seed graph choice

As shown by previous studies (most notably in Hormozidari et al. [5]), choice of the seed graph plays a significant role in graph evolution, directly contributing to the order of growth of many important graph statistics.

The seed graph is typically assumed to be the largest clique (or a connected graph of the largest two cliques) of the observed graph. Then random vertices and edges are gradually added to the network, preserving the average degree of the final network, to make the size of the network to a fixed value of n0. This method is motivated by the infinitesimally small probability with which there could appear a clique of a greater size during graph evolution. Such a procedure has no formal theoretical guarantees and does not have any clear justification from a biological perspective [5].

Our natural approach to select the seed graph is based on the extra-network information about the estimated age of proteins, described in Section 2.1.

5. Main Results

Our main constructive results concern the relation between the parameters of the model and the number of symmetries exhibited by graphs generated from it. Additionally, we present two parameter estimation algorithms, one based on recurrences characteristic for certain graph statistics, the other based on the well-known maximum likelihood approach.

5.1. Our method: parameter estimation using recurrence relations

Our basic tool to infer the parameters of DD-model for a given the PPI network is a set of the exact recurrence relations for basic graph statistics, which relate their values at time k and k + 1 of graph evolution. Such recurrence relations are sufficient to estimate model parameters, as the whole sequence of graphs from the initial graph Gn0 to the final graph Gn can be split into steps consisting of the addition of a single vertex and the changes introduced by the added vertex.

Typically, five statistics of the random graph Gn are studied in literature: number of edges E(Gn), mean degree of the network D(Gn)=n1i=1ndegn(i), mean squared degree D2(Gn)=n1i=1ndegn2(i), number of triangles (3-cliques) C3(Gn), and number of wedges S2(Gn) (wedges are also called 2-stars or paths of length 2 in prior literature, and number of wedges includes counts of triangles and open triangles).

However, for every graph H on n vertices it is true that E(H)=n2D(H) and S2(n)=n2(D2(H)D(H)). Therefore, it is sufficient to analyze only the three of above-mentioned graph statistics: D(n), S2(n) and C3(n).

As a first step, we derive the following recurrence relations for the chosen statistics.

Theorem 1. If Gn+1DD–model(n + 1, p, r, Gn), then

E[D(Gn+1)|Gn]=D(Gn)(1+2p1n+12rn(n+1))+2rn+1
E[D2(Gn+1)|Gn]=D2(Gn)(1+2p+p21n+12r(1+p)n(n+1)+r2n2(n+1))+D(Gn)(2pp2+2pr+2rn+12r+2r2n(n+1)+r2n2(n+1))+2r2+2rn(n+1)r2n(n+1)
E[C3(Gn+1)|Gn]=C3(Gn)(1+3p2n6prn2+3r2n3)+D2(Gn)(prnr2n2)+D(Gn)r22n
E[S2(Gn+1)|Gn]=S2(Gn)(1+2p+p2n2(p+1)rn2+r2n3)+D(Gn)(pr+p+rpr+r+r2n+r2n2)+r22r22n.

Proof. See Section 9. □

In Figure 4, we verify Theorem 1 by comparing E[Dn], for various n, computed using theory and experiments.

Fig. 4:

Fig. 4:

Comparison of E[D(Gn)] computed via Theorem 1 and via experiments.

The expressions given in Theorem 1 implicitly define a function E[D(Gn)|Gn0]=FD(n,p,r,Gn0), which is a cornerstone of our algorithm. Similar functions exist for recurrences based on other statistics of Gn0 and Gn. Now we claim that the result of Theorem 1 in terms of expectation can be used for the graph statistics with high probability too. Figure 5 shows the concentration of empirical distribution of different graph statistics.

Fig. 5:

Fig. 5:

Empirical distribution of graph statistics: Gn ∼ DD–model(100, 0.5, 1.5, K10). Coefficient of variation (CV) is defined as the ratio of empirical standard deviation and empirical mean. The lower values of CV in the sub-figures show the concentration of the considered graph statistics.

Although we don’t need an explicit formula for FD in our algorithm, we may derive one from the recurrences:

E[D(Gn)|Gn0]=D(Gn0)k=n0n1(1+2p1k+12rk(k+1))+k=n0n12rk+1l=k+1n1(1+2p1l+12rl(l+1)).

Although this is outside of the scope of this article, we note that such an expression allows us to find, for example, the asymptotic order of growth for D(Gn) and for other statistics.

Though closed form solution of recurrences with Gn and Gn0 could be difficult to obtain, Theorem 1 is sufficient to formulate an efficient algorithm for finding the parameters of the model. The crucial feature is that all parameters are monotonic, that is, the larger the parameters p and r, the larger the values of D(Gn) and other statistics.

Algorithm 1 presents our estimation technique for finding p^ with the recurrence relation for D(Gn) (which will be D(Gobs) when we consider real-world network), assuming r^ is known beforehand. However, sufficient number of samples of r^ from the interval [0, n0] is adequate to get a feasible solution set of {(p^,r^)} with a desired resolution. The algorithm also works for recurrence relations of S2(Gn) and C3(Gn) with evident modifications.

Algorithm 1.

Estimation of p via recurrence relation of D(Gn).

1: function Recurrence-Relation(n, r, Gn0, D(Gn), ε)
2: DminFD(n, 0, r, Gn0), DmaxFD(n, 1, r, Gn0)
3: if Dmin > D(Gn) or Dmax < D(Gn) then
4:   return “no suitable solution for p
5: pmin ← 0, pmax ← 1
6: while pmaxpmin > ε do
7:   ppmin+pmax2, D′ ← FD(n, p′, r, Gn0)
8:   if D′ < D(Gn) then pminpelse pmaxp
9: return pmin

We note here that for each graph property under consideration, D, S2 or C3, the estimation algorithm returns a curve (more precisely, a set of feasible points). Now, if we find a concurrence in the solutions to the recurrence relations of various graph statistics, we know that a necessary condition for the presence of duplication-divergence model has been satisfied. On the other hand, if the curves were not having a common crossing point, it suggests that the DD-model may not be the appropriate fit for the observed network. We denote the above estimation procedure using the recurrence relations of all three graph statistics as the Recurrence-Relation method.

5.2. Parameter estimation via maximum likelihood method

An alternative way of estimating parameters of the DD-model is the maximum likelihood estimation (MLE). With MLE, the estimated parameters p^ and r^ are given by maxθ=(p,r)L(θ,Gn), where the likelihood function is L(θ, Gn) is the probability of generating Gn from Gn0 for fixed parameters θ, i.e.

L(θ,Gn):=Pr(Gn|Gn0;θ)=Gn0+1,,Gn1,GnG(Gn0,Gn)k=n0+1nPr(Gk|Gk1;θ),

where G(Gn0,Gn) is the set of all sequences of graphs that starts with Gn0 and ends at Gn. Given a fixed sequence of graph evolution history (Gn0, …, Gn−1, Gn), it is straightforward to calculate the likelihood, but L(θ, Gn) requires summation over all histories, which has exponential number of possibilities. In [17], the authors present an importance sampling strategy to approximate the likelihood and thereby estimate the parameters. It is based on the idea of traversing backwards in history (Gn to Gn−1 and Gn−1 to Gn−2 likewise) on one sample path of graph evolution sequence via Markov chain. We adapt their algorithm to our DD-model and the complete algorithm is presented later in Section 8.

We now provide a brief description of the importance sampling procedure. The idea is to express likelihood in terms of a known reference parameter θ0 instead of unknown θ. Now, the likelihood can be rewritten as an expectation with respect to θ0 and can be estimated via Monte Carlo simulations (see [17] for more details):

L(θ,Gn)=Eθ0[k=n0nS(θ0,θ,Gk,v)],

where

S(θ0,θ,Gk,v)=1kω(Gk,θ,v)ω(Gk,θ0)ω(Gk,θ0,v).

Here ω(Gk, θ, v) is the probability of creating the graph Gk from Gk−1 through the addition of a node v, with parameter as θ. v can be chosen as any node in Gk such that its removal would result in a positive probability Gk−1 under the DD-model. The variable ω(Gk, θ) is the transition probability ω(Gk, θ, v), summed over all possible v. In fact, ω(Gk, θ, v) itself is the normalized sum of ω(Gk, θ, v, ω), over all possible nodes w, which is the probability of producing a graph Gk from Gk−1 by adding a node v that is duplicated from node w.

5.3. Computational complexity of parameter estimation methods

Let us now assume that the n0 is fixed and we are interested in results up to a resolution ε, that is, the values of p and r are stored in such a way that two numbers within a distance less than ε are indistinguishable.

For our algorithm 1, Recurrence-Relation, a single pass requires Θ(nlog1ε), as it uses a binary search for p and for every intermediate value of p it executes exactly nn0 steps of for loop, each requiring constant time. Now it is sufficient to sample n0ε different values of r, therefore the total running time to find suitable (p^,r^) pairs is Θ(n1εlog1ε).

On the other hand, the MLE algorithm needs to compute at every step values of the ω function for all possible pairs of v and w for each graph Gk. This is the case because in DD-model every vertex v could be a duplicate of every other vertex w always with some non-zero probability at every stage of the algorithm. This means that we require Θ(k2) steps at each iteration of the algorithm; therefore Θ(k=n0nk2)=Θ(n3) steps in total. Unfortunately, even clever bookkeeping and amortization is not much of a help here.

Additionally, we need to estimate the likelihood for each pair (p, r) independently, as maximum likelihood function does not have the monotonicity property, so it requires in total Θ(n31ε2) steps to find all feasible pairs up to a desired resolution of ϵ.

Moreover, as it was suggested by Wiuf et al. in [17], importance sampling provides good quality results only for at least 1000 independent trials. This adds up a constant factor not visible in the big-Θ notation, but significant in practice, effectively making the algorithm infeasible for the real-world data without using supercomputer power.

6. Numerical Experiments

In this section, we evaluate our methods on synthetic graphs and real-world PPI networks.

We made publicly available all the code and data of this project at https://github.com/krzysztof-turowski/duplication-divergence. The code supports random graph models and real-world networks.

Estimation of tolerance interval

We find the tolerance interval of the estimated p and r values for the fitted DD-model as follows. For a given network Gobs and a seed graph Gn0, first the Recurrence-Relation algorithm outputs a set of solutions {(p^,r^)}. For each of the feasible pairs, we then estimate the confidence interval of the graph property with which recurrence was calculated. For instance, if the property used was the empirical mean D, graph samples generated from DDmodel(n,p^,r^,Gn0) are used to estimate expectation E[D(Gn)] and variance Var[D(Gn))]. A 95% confidence interval of D(Gn) is then given by the values

E[D(Gn)]±1.96Var[D(Gn))].

The Gaussian distribution assumption used in the above expression is indeed a good approximation for the distribution of D for large n. Now by fixing p^, we can calculate a tolerance interval (r^min,r^max) for the estimated parameter r^. In the following experiments, for demonstrating the above approach, we focus on two graph statistics D and C3 (parameter estimation will include S2 too).

Parameter estimation procedure for the experiments

Our parameter estimation procedure can be summarized follows:

  1. We employ the Recurrence-Relation algorithm for solving graph recurrences of the three graph statistics D, S2 and C3, and we identify a set of solutions for p and r.

  2. With GnDD–model(n,p^,r^,Gn0), we find the tolerance interval of r^ using the confidence interval of D(Gn) and C3(Gn), as explained in the above subsection.

  3. We look for crossing points of the plots in the figure, and the range of values of p and r where the confidence intervals meet around the crossing point. We call such a range of values as feasible-box.

  4. Though any point in the feasible-box is a good estimate of p and r, to improve the accuracy, we uniformly sample a fixed number of points from the box and choose the pair that gives maximum p-value with respect to the number of automorphisms of the given graph Gobs.

The Theorem 1 provides theoretical guarantees (in the expected sense) for the Recurrence-Relation algorithm and the idea of convergence of the three curves, the solution set of D, S2 and C3 statistics, in the above estimation procedure. But in practice, we allow some discrepancy in the convergence of the three curves. In the following experiments, we declare that the DD-model fits the given dataset when at least two curves converge and there is an intersection among the confidence intervals of the three curves or when one curve and the confidence interval of another curve intersect.

6.1. Synthetic graphs

In this section, we derive preliminary insights by studying our method and maximum likelihood estimation (MLE) empirically on synthetic data. See Sections 5.2 and 8 for the MLE implementation. We note here that the infeasibility of the MLE method for large graphs (e.g. with tens of thousands of vertices) is already explained in Section 5.3 and the comparison, in this section, between our method and the MLE on small graphs is provided only to prove the competitiveness of our approach as to the MLE method.

We generate two random graph samples Gn(1) and Gn(2) from the DD-model with the following parameters:

Gn(1)DD–model(n=100,p=0.1,r=0.3,Gn0=K20),
Gn(2)DD–model(n=100,p=0.99,r=3.0,Gn0=K20).

The choice of parameters in Gn(1) and Gn(2) show different regimes in the following studies. Moreover the parameters are chosen in such a way that the generated graphs have non-trivial symmetries.

Figures 6a and 6b plot the sets of feasible points identified by the recurrence relations using Recurrence-Relation method. The light shaded bands show the tolerance intervals of r. We observe that the crossing points and the tolerance intervals are fairly close to the original parameters. Figures 6c and 6d display the heat-plot of log-likelihood function of the MLE for different values of the parameters and the maximum value of the log-likelihood in the heat-plot will give the parameters outputted by the MLE. The log-likelihood function maximizes at (p, r) pairs close to the original parameters, but not up to the resolution of Recurrence-Relation method.

Fig. 6:

Fig. 6:

Results on synthetic networks: Recurrence-Relation and maximum likelihood estimation (MLE) methods. The yellow squares represent the parameters estimated via our Recurrence-Relation method, and the black squares denote the MLE method (see Table 6 for more details). We also estimate the parameters by the existing mean-field approach (using (1) and (2)), but the estimated values are very different from the true value; for the graph in (a), we get p^=0.51 and r^ as a negative value, and for the graph in (b), p^=0.15, r^=15.42.

In Table 6 we produce the statistical significance of the best estimated parameter pairs via both the Recurrence-Relation and the MLE. The best pair is found in the Recurrence-Relation method from 1000 uniform samples in the feasible-box centered at the point where the three curves are in agreement, and for the MLE, it is found from 1000 uniform samples in the maximum log-likelihood area if no unique maximizer exists. The estimates from both the techniques demonstrate the presence of the DD-model in the given graphs Gn(1) and Gn(2) (p-value > 0.1), the best pair of Recurrence-Relation estimator has much higher p-value and certainly outperforms MLE.

TABLE 6:

Results on synthetic networks: average number of automorphisms and p-value.

Recurrence-Relation
MLE
Model parameters log |Aut(Gobs)| p^ r^ E[log|Aut(Gn)|] p-value p^ r^ E[log|Aut(Gn)|] p-value
p = 0.1, r = 0.3 81.963 0.09 0.3 81.974 0.980 0.1 0.3 78.794 0.820
p = 0.99, r = 3.0 16.178 0.99 2.5 16.588 0.980 0.95 0.3 0.368 0

We note that for the first graph Gn(1) the results obtained by both methods are almost identical, in terms of E[log|Aut(Gn)|] and p-values. For the second graph Gn(2), the log-likelihood function of MLE is nearly flat for large values of p, and thus MLE returns less reliable estimates. This in turn results in a larger deviation of the number of automorphisms from the observed graph. Our algorithm on the other hand provides a better estimate even when p is close to 1. To sum up, we find that our algorithm does not perform worse than MLE in terms of quality and achieves better performance than MLE when p is high. It also has much lower computational complexity than the MLE. The MLE requires more than 1 million computations for this particular example, where as Recurrence-Relation needs only at most 100 computations (scaling factors excluded). Detailed complexity calculations are provided in Section 5.3.

6.2. Real-world PPIs

We apply recurrence-based estimator to PPI networks of seven species listed in Table 2. As mentioned in Section 2.1, the seed graph Gn0 is assumed as the graph induced by the nodes having the largest phylogenetic age.

The MLE solution is almost impossible to calculate for the PPI networks. Even for the smallest PPI network (Worm), MLE requires around 58 billion computations to obtain a result for a single (p, r) parameter set.

Figure 7 presents plots of Recurrence-Relation estimator for seven species. In all the figures, the plots meet or come very close at a specific point. This illustrates the presence of the DD-model in all the considered species. Furthermore, Table 7 calculates the statistical significance of the fitted DD-model with respect to the number of automorphisms in the observed PPI networks. The estimated p-values are remarkably high and most often much larger than 0.4 (except in one case), demonstrating that the fitted DD-models exhibit symmetries closer to the real-world PPIs.

Fig. 7:

Fig. 7:

Parameter estimation of the PPI networks in Table 2 with our Recurrence-Relation method. The yellow squares represent the parameters chosen by the Recurrence-Relation method (Table 7), and the black squares denote the estimated value by the existing mean-field method (Table 3).

TABLE 7:

Parameters of the real-world PPI networks estimated using Recurrence-Relation method

Organism p^ r^ E[log|Aut(Gn)|] p-value
Baker’s yeast 0.98 0.35 293.27 0.71
Human 0.64 0.49 2998.81 0.51
Fruitfly 0.53 0.92 1073.83 0.64
Fission yeast 0.983 0.85 705.278 0.74
Mouse-ear cress 0.98 0.49 6210.36 0.13
Mouse 0.96 0.32 8067.56 0.67
Worm 0.85 0.35 3352.91 0.48

If we compare our parameter estimation results on the the PPI data (Table 7) with the known estimation procedure based on mean-field approximation (Table 3), we find a large difference in the values of p and r. To explain this difference, we show in Table 5 that the mean-field estimation procedure typically discards more than 85% of the data to estimate the power-law exponent γ, and the estimation of parameters p and r is based on γ (see (1) and (2)), thus making them erroneous. Moreover, we observe that results of the mean-field estimation are closer to the D curve, but not to S2 or C3 curves in Figure 7, and hence they do not yield statistically significant log|Aut(Gn)|. This means that the mean-field method, due to the simplifications involved in its procedure, is able to capture only certain characteristics, but missed others that are no less important.

7. Discussion

We focus in this work on fitting dynamic biological networks to a probabilistic graph model, from a single snapshot of the networks. Our attention here is on a key characteristic of the networks – the number of automorphisms – that is often neglected in modeling. Using the number of automorphisms as a measure to sample parameters from the parameter space may raise serious questions about its practicality (like some slower maximum likelihood estimation methods for graph fitting). To address this, our approach in this paper to combine the number of symmetries with a faster method of recurrence relations, which allows us to narrow down the parameter search, finds high relevance in practice.

We argue that many existing parameter estimation techniques fail to take into account the number of symmetries of real-world networks, leading to serious concerns in the fitting methodology. Previous studies made unrealistic assumptions like the steady-state behavior of the model, and it could be the reason behind erroneous estimates. Our proposed fitting method based on exact recurrence relations, derived from rigorous theory, with minimal assumptions works well on synthetic data and real-world protein-protein interaction (PPI) networks of seven species. We also formulate a simple statistical test in terms of the number of symmetries. Since the PPI networks are expanding with new protein-protein interactions getting discovered, we make sure to use up-to-date data so that the fitted parameters in this paper can serve as a benchmark for future studies.

We note here that the method introduced in this work is applicable to a variety of dynamic network models, as for many models there exist recurrence relations similar to the ones presented here. A systematic way of parameter estimation can also be seen as an introductory work to other important problems in biological networks. One example of such a problem is the temporal order problem [30]: given a network, the task is to recover the chronology of the node arrivals in the network. Parameter estimation provides us with better knowledge about the specific characteristics of the model that retains temporal information in its structure.

8. Maximum likelihood algorithm

Function MaximumLikelihoodValue, as shown below in Algorithm 2, presents a single pass of the MLE technique. Here θ0 = (p0, r0) is the initial parameter set which can be chosen with some extra knowledge of the given network or even arbitrarily; but with a proper choice of θ0, a faster convergence to the true value of likelihood is guaranteed.

We note here that the MLE procedure has to be run multiple times, and the average of the L’s (see algorithm) from the multiple runs gives an estimate of the likelihood at the inputted (p, r). The procedure needs to be repeated for all the relevant (p, r) pairs. The estimated parameters are the points for which the likelihood function is maximized. See Section 5.2 for details.

Algorithm 2.

Single run for the likelihood value computation.

function MaximumLikelihoodValue(G, n, n0, p, r, p0, r0)
L ← 1
for k = n, n − 1, …, n0 + 1 do
  Pick v at random with Pr[v] ∼ ω(G, p0, r0, v)
  LL·1kuGω(G,p0,r0,u)ω(G,p0,r0,v)ω(G,p,r,u)
  Remove v from G
return L
function ω(G, p, r, v)
sum ← 0, n ← |V(G)|
for uV(G), uv do
  both|Nn(u)Nn(v)|
  onlyv|Nn(v)\Nn(u)|
  onlyu|Nn(u)\Nn(v)|
  nonen|Nn(u)Nn(v)|
  sum+=pboth(rn)onlyv(1p)onlyu(1rn)none
return sum

9. Proof of Theorem 1

Let degt(s) be the degree at time t of a vertex added at time s, which is same as node label s, and parent(t) be a vertex which was chosen from Gt−1 for the duplication step at time t.

It follows from the definition of the model that degree of the new vertex n + 1 is the total number of edges from the vertex n + 1 to Nn(parent(n+1)) (each of which is formed from choosing nodes independently from Nn(parent(n+1)) with probability p) and to all other vertices (each of which is formed from nodes chosen independently from a set V(Gn)\Nn(parent(n+1)) with probability rn).

It can be then expressed as a sum of two independent binomial variables:

degn+1(n+1)Bin(degn(parent(n+1)),p)+Bin(ndegn(parent(n+1)),rn).

1. Recurrence for D(Gn).

E[degn+1(n+1)|Gn]=k=0kPr(degn(parent(n+1))=k|Gn)×a=0k(ka)pa(1p)ka×b=0nk(nkb)(rn)b(1rn)nkb(a+b)=k=0n1Pr(degn(parent(n+1))=k|Gn)(pk+rn(nk))=(prn)k=0n1kPr(degn(parent(n+1))=k|Gn)+r.

Since in the definition of the model it is stated that the parent is selected uniformly at random, we know that Pr(parent(n+1)=i|Gn)=1n and therefore

D(Gn)=i=1nPr(parent(n+1)=i|Gn)degn(i)=k=0n1kPr(degn(parent(n+1)=i|Gn)=k).

Combining the last two equations, we get

E[degn+1(n+1)|Gn]=(prn)D(Gn)+r.

Using the above, we find the following recurrence for the mean degree of Gn+1:

E[D(Gn+1)|Gn]=1n+1E[i=1n+1degn+1(i)|Gn]=1n+1(i=1n+1degn(i)+2E[degn+1(n+1)|Gn])=1n+1(nD(Gn)+2E[degn+1(n+1)|Gn])=D(Gn)(1+2p1n+12rn(n+1))+2rn+1.

Now from the law of total expectation:

E[D(Gn+1)]=E[D(Gn)](1+2p1n+12rn(n+1))+2rn+1.

2. Recurrence for D2(Gn).

E[degn+12(n+1)|Gn]=k=0n1Pr(degn(parent(n+1))=k|Gn)×a=0k(ka)pa(1p)ka×b=0nk(nkb)(rn)b(1rn)nkb(a+b)2=k=0n1Pr(degn(parent(n+1))=k|Gn)×(k2(p22prn+r2n2)+k(pp2+2prr+2r2n)+r2+rr2n)=D2(Gn)(p22prn+r2n2)+D(Gn)(pp2+2prr+2r2n+r2n2)+r2+rr2n,

since we have, as before,

D2(Gn)=i=1nPr(parent(n+1)=i|Gn)degt2(i)=k=0n1Pr(degn(parent(n+1))=k|Gn)k2.

Now we proceed with the second moment of degree distribution of Gn. Let In+1(i) be an indicator variable whether there is an edge between n + 1 and i. Then the following basic results follows:

i=1nIn+12(i)=i=1nIn+1(i)=degn+1(n+1)

and

E[i=1ndegn(i)In+1(i)|Gn]=i=1ndegn(i)E[In+1(i)|Gn]=i=1ndegn(i)(degn(i)np+ndegn(i)nrn)=1ni=1ndegt2(i)(prn)+1ni=1ndegn(i)r=(prn)D2(Gn)+rD(Gn).

Now,

E[D2(Gn+1)|Gn]=1n+1E[i=1n+1degn+12(i)|Gn]=1n+1E[i=1ndegt2(i)+2i=1ndegn(i)In+1(i)+degn+1(n+1)+degn+12(n+1)|Gn]=D2(Gn)(1+2p+p21n+12r(1+p)n(n+1)+r2n2(n+1))+D(Gn)×(2pp2+2pr+2rn+12r+2r2n(n+1)+r2n2(n+1))+2r2+2rn+1r2n2(n+1).

Then from the law of total expectation we obtained the desired formula.

3. Recurrence for S2(Gn).

The recurrence for S2(Gn) is straightforward from the following deterministic relation for every graph:

S2(Gn)=n2(D2(Gn)D(Gn)).

Alternatively, it can be computed from the following relation using similar methods as before:

E[S2(Gn+1)|Gn]=E[i=1n+1(degn+1(i)2)|Gn].

4. Recurrence for C3(Gn).

Finally, we find the expected number of triangles in the following way. Let us denote by et(A, B) the number of edges with one endpoint in A and other in B, and by et(A) the number of edges with both edges in A for some fixed A, BV(Gt), at time t.

For brevity, let us also introduce the following notations.

  • NP(n+1)=Nn(parent(n+1)) – the set of neighbors of the parent of n + 1 in Gn,

  • X1(Gn) := en(NP(n + 1)) – the number of edges within (open) neighborhood of the parent of n + 1 in Gn,

  • X2(Gn) := en(NP(n + 1), V(Gn) \ NP(n + 1)) – the number of edges between (open) neighborhood of the parent of n +1 and other vertices in Gn,

  • X3(Gn) := en(V(Gn) \ NP(n + 1)) – the number of edges between vertices not connected to the parent of n + 1 in Gn.

It can be easily verified that

X1(Gn)=3C(Gn)n,X1(Gn)+X2(Gn)=D2(Gn),X1(Gn)+X2(Gn)+X3(Gn)=n2D(Gn).

Therefore

E[C3(Gn+1)|Gn]=C3(Gn)+p2E[X1(Gn)|Gn]+prnE[X2(Gn)|Gn]+r2n2E[X3(Gn)|Gn]=C3(Gn)(1+3p2n6prn2+3r2n3)+D2(Gn)(prnr2n2)+D(Gn)r22n.

We can now apply the law of total expectation to get the final result. □

Acknowledgments

This work was supported by NSF Center for Science of Information (CSoI) Grant CCF-0939370, and in addition by NSF Grants CCF-1524312, NIH Grant 1U01CA198941-01, Polish National Science Centre grant 2018/31/B/ST6/01294.

Biographies

graphic file with name nihms-1711334-b0008.gif

Jithin K. Sreedharan is a Postdoctoral Research Associate at the NSF Center for Science of Information and Dept. of Computer Science in Purdue University. He received his Ph.D. in computer science from INRIA, France, in 2017 with a fellowship from INRIA-Bell Labs joint lab. Before that, he finished M.S. from Indian Institute of Science (IISc), Bangalore, in 2013, and received the best thesis award. His central research interest is in solving real-world problems in data science with a network perspective. His current works focus on data mining algorithms for large networks with probabilistic guarantees, statistical modeling, and inference on networks, and distributed techniques for analyzing big matrices.

graphic file with name nihms-1711334-b0009.gif

Krzysztof Turowski is currently assistant professor at the Theoretical Computer Science Department at the Jagiellonian University, Krakow, Poland. He received his MS and PhD degrees from Gdansk University of Technology, Poland in 2011 and 2015, respectively, both in computer science. From 2010 to 2016 he was employed at the Department of Algorithms and System Modelling at Gdansk University of Technology and from 2016 to 2018 he worked at Google as a software developer for Google Compute Engine. From 2018 to 2019 he was a Postdoctoral Research Scholar in the NSF Center for Science of Information at Purdue University. His research interests include graph theory (especially various models of graph coloring), analysis of algorithms and information theory.

graphic file with name nihms-1711334-b0010.gif

Wojciech Szpankowski is Saul Rosen Distinguished Professor of Computer Science at Purdue University where he teaches and conducts research in analysis of algorithms, information theory, analytic combinatorics, data science, random structures, and stability problems of distributed systems. He held several Visiting Professor/Scholar positions, including McGill University, INRIA, France, Stanford, Hewlett-Packard Labs, Universite de Versailles, University of Canterbury, New Zealand, Ecole Polytechnique, France, the Newton Institute, Cambridge, UK, ETH, Zurich, and Gdansk University of Technology, Poland. He is a Fellow of IEEE, and the Erskine Fellow. In 2010 he received the Humboldt Research Award and in 2015 the Inaugural Arden L. Bement Jr. Award. He is also the recipient of 2020 Flajolet Lecture Prize. He published two books: “Average Case Analysis of Algorithms on Sequences”, John Wiley & Sons, 2001, and “Analytic Pattern Matching: From DNA to Twitter”, Cambridge, 2015. In 2008 he launched the interdisciplinary Institute for Science of Information, and in 2010 he became the Director of the newly established NSF Science and Technology Center for Science of Information.

Contributor Information

Jithin K. Sreedharan, Dept. of Computer Science and the NSF Center for Science and Information, Purdue University, West Lafayette, IN 47907

Krzysztof Turowski, Theoterical Computer Science Department, Jagiellonian University, Krakow, Poland.

Wojciech Szpankowski, Dept. of Computer Science and the NSF Center for Science and Information, Purdue University, West Lafayette, IN 47907.

References

  • [1].Zhang J, “Evolution by gene duplication: an update,” Trends in Ecology & Evolution, vol. 18, no. 6, pp. 292–298, 2003. [Google Scholar]
  • [2].Ohno S, Evolution by gene duplication. Berlin–Heidelberg: Springer-Verlag, 1970. [Google Scholar]
  • [3].Pastor-Satorras R, Smith E, and Solé RV, “Evolving protein interaction networks through gene duplication,” Journal of Theoretical Biology, vol. 222, no. 2, pp. 199–210, 2003. [DOI] [PubMed] [Google Scholar]
  • [4].Shao M, Yang Y, Guan J, and Zhou S, “Choosing appropriate models for protein–protein interaction networks: a comparison study,” Briefings in Bioinformatics, vol. 15, no. 5, pp. 823–838, 2013. [DOI] [PubMed] [Google Scholar]
  • [5].Hormozdiari F, Berenbrink P, Pržulj N, and Sahinalp SC, “Not all scale-free networks are born equal: the role of the seed graph in PPI network evolution,” PLoS Computational Biology, vol. 3, no. 7, p. e118, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Ispolatov I, Krapivsky P, and Yuryev A, “Duplication-divergence model of protein interaction network,” Physical Review E, vol. 71, no. 6, p. 061911, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Raval A, “Some asymptotic properties of duplication graphs,” Physical Review E, vol. 68, no. 6, p. 066119, 2003. [DOI] [PubMed] [Google Scholar]
  • [8].Chung F, Lu L, Dewey TG, and Galas D, “Duplication models for biological networks,” Journal of Computational Biology, vol. 10, no. 5, pp. 677–687, 2003. [DOI] [PubMed] [Google Scholar]
  • [9].Kim JH, Sudakov B, and Vu VH, “On the asymmetry of random regular graphs and random graphs,” Random Structures & Algorithms, vol. 21, no. 3-4, pp. 216–224, 2002. [Google Scholar]
  • [10].Łuczak T, Magner A, and Szpankowski W, “Asymmetry and structural information in preferential attachment graphs,” Random Structures & Algorithms, 2019. [Google Scholar]
  • [11].Turowski K, Magner A, and Szpankowski W, “Compression of Dynamic Graphs Generated by a Duplication Model,” in 56th Annual Allerton Conference on Communication, Control, and Computing. Monticello, IL, US: IEEE, 2018, pp. 1089–1096. [Google Scholar]
  • [12].Godsil C and Royle GF, Algebraic graph theory. Springer Science & Business Media, 2013. [Google Scholar]
  • [13].Bebek G, Berenbrink P, Cooper C, Friedetzky T, Nadeau J, and Sahinalp SC, “The degree distribution of the generalized duplication model,” Theoretical Computer Science, vol. 369, no. 1-3, pp. 239–249, 2006. [Google Scholar]
  • [14].Bebek G, Berenbrink P, Cooper C, Friedetzky T, Nadeau JH, and Sahinalp SC, “Improved duplication models for proteome network evolution,” in Systems Biology and Regulatory Genomics. Berlin, Heidelberg: Springer, 2007, pp. 119–137. [Google Scholar]
  • [15].Colak R, Hormozdiari F, Moser F, Schönhuth A, Holman J, Ester M, and Sahinalp SC, “Dense graphlet statistics of protein interaction and random networks,” in Biocomputing 2009. Singapore: World Scientific Publishing, 2009, pp. 178–189. [DOI] [PubMed] [Google Scholar]
  • [16].Bhan A, Galas D, and Dewey TG, “A duplication growth model of gene expression networks,” Bioinformatics, vol. 18, no. 11, pp. 1486–1493, 2002. [DOI] [PubMed] [Google Scholar]
  • [17].Wiuf C, Brameier M, Hagberg O, and Stumpf M, “A likelihood approach to analysis of network data,” Proceedings of the National Academy of Sciences, vol. 103, no. 20, pp. 7566–7570, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Bubeck S, Mossel E, and Rácz MZ, “On the influence of the seed graph in the preferential attachment model,” IEEE Transactions on Network Science and Engineering, vol. 2, no. 1, pp. 30–39, 2015. [Google Scholar]
  • [19].McKay B and Piperno A, “Practical graph isomorphism,” Journal of Symbolic Computation, vol. 60, pp. 94–112, 2013. [Google Scholar]
  • [20].Heinicke S, Livstone M, Lu C, Oughtred R, Kang F, Angiuoli S, White O, Botstein D, and Dolinski K, “The princeton protein orthology database (p-pod): a comparative genomics analysis tool for biologists,” PloS one, vol. 2, no. 8, p. e766, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Li L, Stoeckert C, and Roos D, “Orthomcl: identification of ortholog groups for eukaryotic genomes,” Genome research, vol. 13, no. 9, pp. 2178–2189, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Thomas P, Campbell M, Kejariwal A, Mi H, Karlak B, Daverman et al. , “Panther: a library of protein families and subfamilies indexed by function,” Genome research, vol. 13, no. 9, pp. 2129–2141, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Capra J, Williams A, and Pollard K, “Proteinhistorian: tools for the comparative analysis of eukaryote protein origin,” PLoS Computational Biology, vol. 8, no. 6, p. e1002567, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Brown K, Hill C, Calero G, Myers C, Lee K, Sethna J, and Cerione R, “The statistical mechanics of complex signaling networks: nerve growth factor signaling,” Physical Biology, vol. 1, no. 3, p. 184, 2004. [DOI] [PubMed] [Google Scholar]
  • [25].Han J-D, Bertin N, Hao T, Goldberg D, Berriz G, Zhang L, Dupuy D, Walhout A, Cusick M, Roth F et al. , “Evidence for dynamically organized modularity in the yeast protein–protein interaction network,” Nature, vol. 430, no. 6995, p. 88, 2004. [DOI] [PubMed] [Google Scholar]
  • [26].Tanaka R, Yi T-M, and Doyle J, “Some protein interaction data do not exhibit power law statistics,” FEBS Letters, vol. 579, no. 23, pp. 5140–5144, 2005. [DOI] [PubMed] [Google Scholar]
  • [27].Khanin R and Wit E, “How scale-free are biological networks,” Journal of Computational Biology, vol. 13, no. 3, pp. 810–818, 2006. [DOI] [PubMed] [Google Scholar]
  • [28].Clauset A, Shalizi CR, and Newman ME, “Power-law distributions in empirical data,” SIAM review, vol. 51, no. 4, pp. 661–703, 2009. [Google Scholar]
  • [29].Solé R, Pastor-Satorras R, Smith E, and Kepler T, “A model of large-scale proteome evolution,” Advances in Complex Systems, vol. 5, no. 01, pp. 43–54, 2002. [Google Scholar]
  • [30].Sreedharan J, Magner A, Grama A, and Szpankowski W, “Inferring temporal information from a snapshot of a dynamic network,” Nature Scientific Reports, vol. 9, no. 1, p. 3057, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES