Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2012 Apr 18;7(4):e35531. doi: 10.1371/journal.pone.0035531

Limitations of Gene Duplication Models: Evolution of Modules in Protein Interaction Networks

Frank Emmert-Streib 1,*
Editor: Lukasz Kurgan2
PMCID: PMC3329483  PMID: 22530042

Abstract

It has been generally acknowledged that the module structure of protein interaction networks plays a crucial role with respect to the functional understanding of these networks. In this paper, we study evolutionary aspects of the module structure of protein interaction networks, which forms a mesoscopic level of description with respect to the architectural principles of networks. The purpose of this paper is to investigate limitations of well known gene duplication models by showing that these models are lacking crucial structural features present in protein interaction networks on a mesoscopic scale. This observation reveals our incomplete understanding of the structural evolution of protein networks on the module level.

Introduction

The understanding of evolutionary processes is not only of great interest to reconstruct the history of organic life and its evolution but can also help to shed light on the molecular functioning of organisms [1][6]. With the availability of large-scale sequence information and protein structures, the information stored in these entities could be systematically exploited with the help of computational and statistical methods [7][12]. Such studies have in common that a functional understanding is usually not obtained by direct investigations of molecular interactions but by inductive reasoning based on a comparative analysis. This is in contrast to studies based on the analysis of gene networks [13], because with the advent of network biology [14] and the availability of genome-scale networks, evolutionary questions can be addresses on the network-level [15]. Due to the fact that the structure of gene networks, e.g., metabolic, protein, or transcriptional regulatory networks, represent causal molecular interactions, direct studies of the biological function are enabled [16][18].

Since the introduction of random networks in the 1950s [19], [20] many new network classes have been invented [21][24], commonly called complex networks, and shown to provide better models for numerous natural phenomena [25][27]. Over the years, the interest in these complex networks has been gradually shifted from studying local properties, e.g., degree distributions, toward larger substructures or subnetworks forming motifs or communities [28][31]. In biology, the rational for this shift lies in the opportunity that gene networks offer in revealing insights about functional working mechanisms of a cell, if studied appropriately [13], [32], [33]. Similarly, this trend can be also observed in studies of the structural evolution of gene networks [34], [35].

The major purpose of this paper is to study two biologically motivated models that have been introduced to describe the evolution of protein-protein interaction (PPI) networks. More precisely, we study the question if the network gene duplication (NGD) model [36], [37] and the duplication-mutation complementation (DMC) model [38] resemble the module structure of biological protein interaction networks. Due to the similarity of both models, as described in detail in section ‘Network data and models’, we use the term gene duplication model (GDM) to either indicate the network gene duplication model or the duplication-mutation complementation model. The general idea of our study is to probe these evolutionary models by comparing networks generated with these models with biological PPI networks from various organisms [39], [40] to gain a deeper understanding of their capabilities. Here, the fact that a gene duplication model may have limitations would be of no surprise, since every model is merely an abstraction of reality sacrificing certain aspects to gain a mathematical representation. However, it is important to understand what specific limitations a gene duplication model is suffering from to judge its usefulness to serve as a model for the evolution of protein interaction networks. Considering the results from investigations studying either the NGD model or the DMC model with respect to the degree distribution of the networks, it appears unlikely that on this level of description refuting information can be found. Instead, in this article we are focusing on mesoscopic properties of networks in form of modules [41], [42]. The motivating idea for choosing this level is not only the fact that the module structure of networks is by far less well studied compared to the degree distribution, but, from a biological point of view, a module appears to be a more important entity with respect to the biological function of an organism than the degree of a gene. More specifically, genes and gene products establish by interacting with each other a biological function. This entity is of central importance to understand the functioning of an organism. In more abstract terms the interactions among genes establish an information flow that gives rise to this biological function. In this respect, modules can be considered to represent basic entities of information processing on a molecular level. Due to the fact that both, the network gene duplication model and the duplication-mutation complementation model, have been introduced to resemble gene inheritance [43], but not information processing in modules, the answer to the question if these models resemble the module structure of biological protein interaction networks does not immediately follow from their definition, but needs to be investigated. For reasons of completeness, we would like to mention that the two evolutionary models [36][38] are not the only models that has been proposed for the evolution of protein networks but there are a few other models, e.g., [44][46]. However, the NGD model and the DMC model investigated in this paper might be the most widely used and studied models in the literature.

In order to study the proposed question quantitatively, we pursue the following approach. First, we select an algorithm to identify the modules in networks. Second, we define several network-based measures that capture important information about the module structure of a network. These measures will form the components of a feature vector that represents the network. Third, we use agglomerative clustering to cluster the feature vectors in order to reveal similarities respectively differences between the clustered networks.

This paper is organized as follows. In the ‘Methods’ section we specify the network data we are analyzing and the methods we are applying. In the ‘Results’ section we present numerical results of our analysis and discuss our findings. The paper finishes with the ‘Discussion’ section presenting a summary and an outlook to future problems.

Methods

In order to study the question if gene duplication models resemble the module structure of biological protein interaction networks, we need to realize that any gene duplication model (GDM) is formally a stochastic process [36][38]. That means if we generate two networks using the same model parameters, these networks will most likely not be identical. However, they share certain characteristics quantifiable by network-based measures, e.g., the exponent of their degree distribution or their edge density. This implies that a GDM, as any other stochastic process that generates networks, constitutes a network population or a network class. Throughout this paper, we use both terms synonymously. In the following we describe our general approach to study the population properties of a GDM.

Our overall approach is schematically visualized in Fig. 1. The basic idea is to map networks, which are part of a population, to feature vectors. That means the feature vectors are used as a representation of the networks, respectively the population of networks. We assume that there exists an underlying stochastic mechanism, or a model, that generates networks with common characteristics. These characteristics may vary from network to network because the underlying mechanism is stochastic rather than deterministic. The commonality of all networks generated from such a model forms a population. A specific example of a biologically motivated mechanism that generates protein networks is either the NGD model [36], [37] or the DMC model [38]. For a given set of model parameters these models establish a population of networks sharing common properties. Another example for such a mechanism is the preferential attachment model which generates scale-free networks [21]. In the following, we assume that also (biological) protein networks constitute a population which have been generated by evolutionary forces.

Figure 1. A schematic visualization of our approach to study properties of network populations.

Figure 1

First, we map networks to feature vectors. Then we analysis these feature vectors with a hierarchical clustering. The resulting clustering allow us to conclude back to the similarity of the network populations.

The quantitative analysis we will perform is based on the feature vectors derived from the networks. We conduct a comparative analysis applying a hierarchical clustering to investigate similarities between feature vectors. This allows us to conclude about the similarities of the underlying networks and, hence, about the similarities of the populations. We want to re-emphasize that our focus is on the properties of the network populations rather than on individual networks. This difference is crucial because we do not aim to derive results about individual networks but for the population.

Finding modules

In recent years, many algorithms have been introduced for finding the community or module structure in networks which are based on a variety of different principles and approaches [47][50]. In this paper, we use the edge-betweenness algorithm introduced in [51] for finding the modules in the networks we study. This method is probably the best studied module algorithm. Together with the measure modularity, Inline graphic, it has been widely used for analyzing biological networks [52][56]. The principle working mechanism of the edge-betweenness algorithm is to start from a connected network and remove successively edges with the highest edge-betweenness values, i.e., edges that occur on many shortest paths. The idea is that edges connecting separate modules are more likely to have high edge betweenness values because paths between modules must pass through them. Successively removing edges with the highest edge betweenness values results in a hierarchical tree of network components. The optimal partitioning of the network is obtained by finding the optimal cut of this tree. This is accomplished by using an optimization function, called modularity Inline graphic. Application of this algorithm results in a non-overlapping module structure meaning that each node in the network is allowed to belong to exactly one module.

Module measures

In order to characterize the modular structure of a network Inline graphic, found by the application of a partitioning method Inline graphic, we use Inline graphic different measures introduced in the following. Some of these measures bear a resemblance to indices frequently used for the analysis of biological or chemical networks [57][61]. The motivation for the selection of the following measures is to obtain a heterogeneous set of network-based measures because we will use them as components of feature vectors.

For each network Inline graphic we determine the number of communities Inline graphic and its modularity value Inline graphic found by application of the partitioning algorithm Inline graphic [51]. These measures provide a course overview of the network structure. To obtain more detailed information we calculate Inline graphic additional measures which are based on the connectivity matrix of the modules, Inline graphic, of the module structure of Inline graphic. The components of Inline graphic, for Inline graphic and Inline graphic, give the number of connections between nodes in module Inline graphic to nodes in module Inline graphic. All self-connections, Inline graphic, are set to zeros. In the following we consider only undirected networks Inline graphic, hence, Inline graphic is a symmetric matrix. In addition, we calculate a vector Inline graphic whose components Inline graphic correspond to the number of nodes in module Inline graphic. From these auxiliary measures we obtain further measures. We want to remark that the matrix Inline graphic can be considered as (weighted) adjacency matrix of a new network Inline graphic, whose nodes correspond to modules. We call Inline graphic the module network. The reason for this is, formally, Inline graphic can be seen as the result from a functional mapping from Inline graphic, namely, Inline graphic. The last equation illustrates that the application of a method for finding a community structure in Inline graphic leads to a new network Inline graphic. This is illustrated in Fig. 2. The measures defined in the following are calculated for module networks Inline graphic in which one node corresponds to one module, as defined above. Due to the fact that we apply these measures to the module network Inline graphic, and not to the network Inline graphic, we enforce these measures to capture module specific information because Inline graphic represents explicitly the modules found by the partitioning algorithm Inline graphic.

Figure 2. Mapping from the unweighted network Inline graphic to the weighted network Inline graphic by application of a partitioning method Inline graphic.

Figure 2

The numbers next to the edges of the module network Inline graphic refer to the values of the edge weights which correspond to the number of connections between nodes from module Inline graphic to module Inline graphic, i.e., Inline graphic.

We define the relative size, Inline graphic, of the largest module with respect to the size of the network,

graphic file with name pone.0035531.e045.jpg (1)
graphic file with name pone.0035531.e046.jpg (2)

Here Inline graphic gives the number of nodes found in the largest module and Inline graphic is the total number of nodes in the network Inline graphic. Further, we determine the normalized entropy of the module connectivity, Inline graphic, given by

graphic file with name pone.0035531.e051.jpg (3)

with

graphic file with name pone.0035531.e052.jpg (4)
graphic file with name pone.0035531.e053.jpg (5)
graphic file with name pone.0035531.e054.jpg (6)

whereby Inline graphic is twice the number of connections between all modules and Inline graphic corresponds to the Shannon Entropy [62]. The measure Inline graphic allows to get an impression of the connectivity among the different modules.

Next, we calculate the mean normalized module-wise entropy, Inline graphic, by

graphic file with name pone.0035531.e059.jpg (7)
graphic file with name pone.0035531.e060.jpg (8)
graphic file with name pone.0035531.e061.jpg (9)
graphic file with name pone.0035531.e062.jpg (10)

Both entropy measures are normalized because Inline graphic, due to the factors that we included in their definitions. The difference between the normalized entropy of the module connectivity, Inline graphic, and the mean normalized module-wise entropy, Inline graphic, is that for Inline graphic we calculate for each community Inline graphic a probability value, Inline graphic, based on its total connectivity to all other modules. In contrast, Inline graphic is obtained by calculating a probability vector, Inline graphic, with Inline graphic, for each community Inline graphic. Hence, both measures focus on different structural aspects and for this reason have different discriminative properties with respect to the modular structure of the communities.

Finally, the normalized mutual information (nMI) [63] is defined as

graphic file with name pone.0035531.e073.jpg (11)
graphic file with name pone.0035531.e074.jpg (12)

with Inline graphic, Inline graphic and Inline graphic. Because the module matrix Inline graphic is symmetric for undirected networks Inline graphic, Inline graphic holds. Briefly, Eqn 12 can be written as Inline graphic, whereas Inline graphic is the Shannon Entropy.

In order to illustrate the numerical usage of our measures we present in Fig. 3 an example. Suppose we have an undirected, unweighted network Inline graphic and application of an algorithm Inline graphic for community finding results in the shown results. Here each node in the network corresponds to a module which may consists of a variable number of nodes, indicated by a varying size of these nodes. Let's call these weighted network Inline graphic, because it describes the structural connectivity among the modules found in Inline graphic. The modules are numbered from Inline graphic to Inline graphic and Inline graphic gives the number of connections between module Inline graphic and Inline graphic. For instance in Fig. 3, module Inline graphic is connected to module Inline graphic via 18 links. These links are obtained by using the partitioning which is found by application of Inline graphic to Inline graphic, and its corresponding adjacency matrix (not shown). For reasons of simplicity we represent the number of links as weights of edges, instead of Inline graphic individual links. When two modules are not directly connected then the corresponding component of Inline graphic is zero, e.g., Inline graphic. Numerically, we obtain for the example shown in Fig. 3:

graphic file with name pone.0035531.e099.jpg (13)
graphic file with name pone.0035531.e100.jpg (14)
graphic file with name pone.0035531.e101.jpg (15)
graphic file with name pone.0035531.e102.jpg (16)
graphic file with name pone.0035531.e103.jpg (17)

Figure 3. An illustrative examples to demonstrate the usage of our measures (see text).

Figure 3

Different modules are shown as colored nodes.

Network data and models

Table 1 provides an overview of the Inline graphic protein interaction networks we are using for our analysis. These data are taken from the BioGrid (BG) and IntAct (IA) database [39], [40]. GCC shown in the fifth column corresponds to the giant connected component of the respective network which is the size of a subnetwork with the property that any two nodes are connected via an undirected path. In addition to the protein networks, table 1 contains also three non-biological networks. Specifically, one technological (power grid) and two social networks (netscience and hep-th) are also used in our analysis. ‘Power grid’ is the Western States power grid network, ‘Netscience’ represents a coauthorship network of scientists working on network theory, and ‘hep-th’ is a coauthorship network between scientists posting preprints on the high-energy theory e-print archive. The data for these networks are obtained from [24], [64], [65]. The merit for including these networks in our analysis will become clear in the results section.

Table 1. Overview of networks used in our analysis.

network type Inline graphic Inline graphic density GCC
Arabidopsis thaliana (BG) 1675 2953 0.00210 1212
Homo sapiens (BG) 8429 29321 0.00082 8114
Mus musculus (BG) 545 490 0.00330 141
Drosophila melanogaster (BG) 7034 22222 0.00089 6907
Caenorhabditis elegans (BG) 2806 4457 0.00113 2575
Saccharomyces cerevisiae (BG) 5620 53309 0.00337 5611
Schizosaccharomyces pombe (BG) 1411 2478 0.00249 1313
Escherichia coli (DIP) 2856 6712 0.00164 2159
Helicobacter pylori (DIP) 1066 1415 0.00249 976
Mycoplasma pneumoniae (IA) 415 735 0.00855 375
Rattus norvegicus (IA) 1232 1421 0.00187 1095
Western States Power Grid 4941 6594 0.00054 4941
Coauthorship Netscience 1589 2742 0.00217 379
Collaboration Hep-th 8361 15751 0.00045 5835

The first Inline graphic networks are protein networks and the bottom Inline graphic are technological and social networks. The columns refer to the number of nodes (Inline graphic) and edges (Inline graphic) in the network, density is the edge density and GCC is the giant connected component. BG: BioGrid database, IA: IntAct database.

We would like to remark that using the GCC of the protein interaction networks has the positive side effect to serve as a denoising of the network data. That means due to the fact that none of the available PPI networks is neither complete (comprises all proteins) nor error free nor unbiased with respect to the coverage of the biological processes, certain parts of the PPI networks are more reliable than others [66]. Here with reliable we mean having a lower error as measured by the false negative and false positive rate of interactions. Due to the fact that the GCC has the property, among others, to be the largest connected subnetwork, it appears to represent such a lower-error region in the network compared to, e.g., an unconnected subnetwork composed of many separate protein complexes and interactions. However, an explicit quantification of this effect is currently difficult because it would require to introduce assumptions about the occurring errors and their estimates.

In addition to the above (real) networks, we generate three types of synthetic networks with three different algorithms. The first two algorithm are very popular models to emulate the evolution of protein networks, namely the network gene duplication (NGD) model [36], [37] and the duplication-mutation complementation (DMC) [38] model. Briefly, the NGD model starts with a very small number of genes and connections among them, selects one of these genes randomly and makes a copy thereof including its connections to other genes. This corresponds to the introduction of a new gene to the genome. Then two probabilistic mechanisms are applied separately to emulate the divergence of these two genes. The first mechanism consists of a deletion of common links (with probability Inline graphic) and the second establishes new connections between the new gene and the rest of the genome (with probability Inline graphic). A schematic visualization of the three steps of the network gene duplication model are shown in Fig. 4. In the first step, gene Inline graphic and all its connections are duplicated, resulting in the gene highlighted in grey. In the second step, the three common edge pairs, highlighted in blue, red and green are independently tested and one randomly selected edge of each pair is deleted with probability Inline graphic. In the third step, the new gene receives with probability Inline graphic a new interaction to an existing gene. The dashed edges in Fig. 4 correspond to these potential new edges. A summary of the networks and their characteristics we generated with the NGD model, which we use for our analysis, can be found in table 2. In table 3 we provide the corresponding model parameters for their generation. The second algorithm we use models also the evolution of protein networks and is called the duplication-mutation complementation (DMC) [38] model. The DMC model is very similar to the NGD model, sharing the first two steps, namely gene duplication and edge deletion; step 1 and 2 in Fig. 4. However, the third step of the DMC model (step 3′) is different, consisting in the connection of the new gene with its template gene with probability Inline graphic. That means this model does not allow to create new connections to other genes already present. Regarding the selection of the parameters of the NGD and the DMC model, we choose Inline graphic (number of genes) to cover the size of the protein networks we use in our analysis listed in table 1. Also for the probabilities Inline graphic and Inline graphic we choose values to result in edge densities that are comparable to the protein networks. In the ‘Results’ section, we provide an additional discussion of the chosen parameters.

Figure 4. A schematic visualization of the three steps of the network gene duplication model (steps 1, 2 and 3: NGD) and the duplication-mutation complementation model (steps 1, 2 and 3′: DMC).

Figure 4

The colored and dashed edges highlight links or potential links effected with probability Inline graphic or Inline graphic.

Table 2. Overview of SCN, NGD and DMC networks.

network type Inline graphic Inline graphic density GCC
SCN1 2500 8904.6 0.00285 2500.0
SCN2 2500 8995.0 0.00287 2497.4
SCN3 3000 8271.6 0.00183 3000.0
SCN4 3000 12491.2 0.00277 2995.0
NGD1 500 1069.2 0.00857 461.6
NGD2 1000 2183.2 0.00437 826.8
NGD3 1500 7713.4 0.00686 1487.6
NGD4 1500 4495.4 0.00399 1058.2
NGD5 2000 12658.8 0.00633 1988.6
NGD6 2000 14332.2 0.00716 1988.4
NGD7 3000 28013.4 0.00622 2997.2
NGD8 7000 9056.0 0.00369 5826.6
DMC1 3000 5006.0 0.00112 3000.0
DMC2 1000 1423.0 0.00284 1000.0
DMC3 2000 5731.0 0.00286 2000.0
DMC4 3000 5018.5 0.00111 3000.0
DMC5 3000 5003.0 0.00112 3000.0

The shown network measures are averaged over Inline graphic networks.

Table 3. Parameters used to generate the NGD and DMC networks shown in table 2.

network name N Inline graphic Inline graphic
NGD1 500 0.0100 0.75
NGD2 1000 0.0050 0.70
NGD3 1500 0.0070 0.90
NGD4 1500 0.0050 0.70
NGD5 2000 0.0050 0.80
NGD6 2000 0.0050 0.70
NGD7 3000 0.0050 0.80
NGD8 7000 0.0005 0.95
DMC1 3000 0.47 0.80
DMC2 1000 0.22 0.85
DMC3 2000 0.27 0.65
DMC4 3000 0.52 0.80
DMC5 3000 0.12 0.80

The third algorithm we use to generate networks was introduced in [67] for generating a test set of networks with known community structure. The algorithm itself is not based on a biologically plausible mechanism that would correspond to an interpretable genomics mechanism, but serves purely as a benchmark generator. We name networks generated with this method synthetic community networks (SCN). In table 2 we show an overview of SCN networks we use for our analysis and table 3 provides the corresponding model parameters.

Results

In order to perform a numerical analysis to investigate the similarity respectively dissimilarity between protein networks and either synthetically generated or technological and social networks, we calculate for each of these networks Inline graphic different network-based features, Inline graphic, as described in the ‘Methods’ section. All networks we are using in the following analysis are listed in table 1 and 2. For each of the SCN and NGD networks we generated Inline graphic, and for each of the DMC parameter settings Inline graphic different networks in order to capture the variability of the stochastic process underlying each of the three network models. This gives a total of Inline graphic networks we are using in our study, namely, Inline graphic PPI networks, Inline graphic synthetic and Inline graphic technological and social networks.

From the Inline graphic features Inline graphic, we identify by an exhaustive search the best performing subset with the highest discriminatory power to separate PPI networks and gene duplication networks from each other. From this analysis, we found the Inline graphic-dimensional feature vector

graphic file with name pone.0035531.e138.jpg (18)

to perform best. The resulting hierarchical clustering is shown in Fig. 5. As distance measure for the feature vectors we used the Canberra distance [68]

graphic file with name pone.0035531.e139.jpg (19)

and for the agglomerative clustering we used the Mcquitty method [69] which joins clusters if they are reciprocally similar to each other.

Figure 5. Hierarchical clustering of all Inline graphic networks used in our analysis.

Figure 5

The color of the eight clusters from left to right (for a discussion see text): purple, gray, green, brown, magenta, blue, gold, red. The ‘NetType’ refers to the four principle network types: 1- synthetic networks (SCN), 2 - PPI networks, 3 - non-biological networks, 4 - network gene duplication networks (NGD), 5 - duplication mutation complementation networks (DMC).

The dendrogram in Fig. 5 consists of Inline graphic principle clusters (branches), each consisting of related networks, e.g., SCN, NGD, DMC or PPI networks. The colors of these eight clusters are from left to right: purple, gray, green, brown, magenta, blue, gold and red. For example, the left most cluster (purple) contains only SCN networks of various types (as defined in table 3). Similarly, the gray and the blue clusters consist only of NGD networks (for parameters see table 4). Also the PPI networks form two distinguished clusters shown in green and magenta, containing Inline graphic of the Inline graphic PPI networks. Interestingly, the remaining Inline graphic PPI networks are clustered together with the Inline graphic non-biological networks, namely, the coauthorship networks (Netscience and Hep-th) and the power grid (brown and red cluster). We want to remark that we repeated the above analysis also for other clustering methods, e.g., complete-linkage and Ward's method, and received qualitatively similar results to the presented ones.

Table 4. Parameters used to generate the synthetic community networks (SCN) shown in table 2.

network name N k maxk mu minc maxc
SCN1 2500 6 150 0.10 4 50
SCN2 2500 6 180 0.15 4 200
SCN3 3000 6 50 0.10 20 50
SCN4 3000 8 60 0.05 10 150

For explanation of the parameters the reader is refered to [67].

From the dendrogram in Fig. 5 follow two important observations. First, the Inline graphic network-based measures we employ to characterize the module structure of a network, result in feature vectors that allow a very good separation of the different network classes (populations). The class of a network can be seen as a label that assigns a network to a specific network category. Due to the fact that clustering analysis represents a form of unsupervised learning [70] these network labels haven't been used for this analysis. Hence, our feature vectors would allow to discover network classes, without a training sample, in a predictive manner. This demonstrates that our feature vectors, respectively the network-based measures, capture sensible information about the module structure of the networks that corresponds with an intuitive separation of them. Here, it is important to emphasize that the resulting clustering provides the intuitive grouping of network classes, although, the components of the feature vectors are abstract entities which may have no intuitive appeal at first sight. It is also interesting to highlight that the protein interaction network of Mus musculus is closest to the coauthorship network ‘Netscience’. From table 1, one can see that the GCC of Mus musculus is the smallest of all PPI networks and, hence, a largely incomplete network, because the available PPI network contains Inline graphic proteins of which only Inline graphic are in the GCC (see table 1), whereas the estimated number of genes for mouse is Inline graphic. Further, Arabidopsis thaliana is the only PPI network from a plant in our analysis. This distinctiveness may explain the separation of these two PPI networks from the two principle PPI clusters in Fig. 5 (green and magenta cluster). On the other hand, due to the fact that the PPI networks of Mus musculus and Arabidopsis thaliana are not randomly scattered in the dendrogram, there seems to be a clear signal with respect to the underlying characteristics of the population of PPI networks that is detactable by our feature vectors, even for such atypical networks. This is also the reason for using the PPI networks of Mus musculus and Arabidopsis thaliana in our analysis, because it allows for an indirect test of our assumption we made about the GCC, as explained in the section ‘Network data and models’.

The second interesting observation from Fig. 5 is that the networks generated with the duplication-mutation complementation (DMC) model are similar to Inline graphic of the Inline graphic PPI networks, namely the PPI network of Mus musculus and Arabidopsis thaliana (brown cluster). In contrast, the networks generated with the NGD model, appear not to resemble important structural characteristics of the PPI networks, because otherwise these networks would not be assigned to distinct branches of the clustering but would be found in the same clusters as the PPI networks. Here, the fact that there exist parameter settings of the network gene duplication model that lead to very different network structures which can be discriminated easily, as the purple cluster on the left-hand side shows, is not as important as the none existence of a parameter setting that would lead to a common cluster consisting of PPI and NGD networks.

We tried many different combinations of probabilities to add and delete links for the NGD and DMC model, as controlled by the two model parameters Inline graphic and Inline graphic, to explore the parameter space of these two models, however, none resulted in clusters that would be significantly different from the presented ones. Another interesting observation in this respect is that the second type of synthetic networks, indicated by SCN2, leads to network structures that are quite similar to the NGD8 networks (gold cluster). This is interesting because the model that underlies the SCN networks hasn't been conceived with the purpose to produce biologically plausible results. Instead, the underlying idea was solely to generate a set of benchmark networks with a know module structure [67].

It is amazing to see that the coauthorship network (Netscience) and the power grid resemble the module structure of PPI networks similarly good as the networks generated with the DMC model. This motivates also the reason for including them in the analysis because this finding hints that naturally generated networks are significantly different to mathematically constructed networks. Here, we consider the SCN, NGD and DMC networks as mathematically constructed.

The contribution of individual structural features

The above analysis is based on a Inline graphic-dimensional feature vector, namely, Inline graphic). In order to gain insights into the differences of the structural properties of PPI networks and gene duplication networks, we conduct an analysis to estimate the contribution of individual features to the separation of these networks. We start from a set of Inline graphic different features (Inline graphic) and eliminate subsets thereof. Specifically, we eliminate up to Inline graphic features from (Inline graphic) which gives a total number of

graphic file with name pone.0035531.e160.jpg (20)

different feature vector combinations, Inline graphic. Here, a Inline graphic corresponds to such a feature vector. The hierarchical clustering for each of these Inline graphic is assessed by a clustering score. This score is additively defined over all branches in a hierarchical clustering, whereas for each branch we calculate the product of the number of PPI networks and the number of NGD and DMC networks. That means branches with a mixed population of networks have a high score and branches with only one network type have a score of zero. Coarsely speaking, the clustering score quantifies the mixing of the PPI and gene duplication networks and a perfect score of zero corresponds to a perfect separation of these networks.

The results of this analysis is shown in the left Fig. 6. Here the x-axis corresponds to the indexed feature vector combinations, Inline graphic, and the y-axis is the normalized clustering score. The color of the dots indicate the number of features used for the hierarchical clustering. From left to right: blue (7 features), green (6 features), orange (5 features), purple (4 features) and brown (3 features). From Fig. 6 one can see that there is just one feature vector combination that leads to a minimal score of Inline graphic, which means that there only a few branches in the hierarchical clustering where PPI and gene duplication networks can be found together. This corresponds to the hierarchical clustering shown in Fig. 5.

Figure 6.

Figure 6

Left: Clustering score in dependence on the index of the used feature vectors. The colors indicate the number of features used for the hierarchical clustering. From left to right: blue (7 features), green (6), orange (5), purple (4), brown (3). The red surrounded dot (index Inline graphic) corresponds to the lowest score that was obtained for the features: Inline graphic. Right: Complementary cumulative distribution function (CCDF) for each of the eight features in dependence on the score.

To obtain a quantification for the contribution of individual features on these results, we conduct the following analysis. First, we estimate for each of the Inline graphic features Inline graphic its score density, i.e., Inline graphic. That means if a feature has not been used for a clustering, we use the obtained score for this clustering for a density estimation. Then, we calculate from the score densities the cumulative distribution function (CDF), Inline graphic, and the complementary cumulative distribution function (CCDF),

graphic file with name pone.0035531.e172.jpg (21)

The meaning of Inline graphic is as follows: If feature vector Inline graphic is not part of a feature vector combination than the probability of observing a score larger than Inline graphic is Inline graphic, i.e.,

graphic file with name pone.0035531.e177.jpg (22)

The results for the Inline graphic different complementary cumulative distribution functions are shown in the right Fig. 6. For any selected value of the score, one observes always the same two CCDFs with the highest probability, highlighted in green and red. These two CCDFs belong to the features Inline graphic (green) and Inline graphic (red). Hence, the absence of these features in a feature vector leads always to a higher probability to observe higher clustering scores. In other words, if the features Inline graphic or Inline graphic are not considered for a hierarchical clustering, the discriminative power of any feature set is compromised. However, due to the closeness of all Inline graphic CCDFs, see Fig. 6, this effect is not strong enough to claim that only these two features are sufficient to result in a clustering with a low or even the lowest score. This is confirmed by a numerical analysis which gives a score of Inline graphic for the feature vector Inline graphic.

This analysis demonstrates that there is no individual structural property in a network that has enough discriminatory power to allow separating branches of PPI and gene duplication networks. Instead, the combination of a variety of different features is needed.

Connecting data with models

Finally, we discuss the rational behind our analysis, which will also shed light on the robustness and interpretation of our results. In order to simplify the following discussion, a visual summary is presented in Fig. 7. First, we assume the existence of a (evolutionary) process that leads to the emergence of different species. For our discussion a species is represented by its underlying PPI network. An evolutionary process we consider abstractly as a model Inline graphic, which, depending on a set of parameters Inline graphic, generates PPI networks. The entity of all possible PPI networks that can be generated from the model Inline graphic constitutes the population of PPI networks. From each PPI network Inline graphic we can derive a feature vector Inline graphic, of a certain dimension, whose components represent properties of Inline graphic. This leads to the population of feature vectors that represents these properties for the whole population of PPI networks. It is important to note that we assume the dimension of Inline graphic to be finite. For this reason, there is an unidirectional mapping from Inline graphic to a feature vector Inline graphic which means that the properties given by Inline graphic may not be sufficient to reconstruct the network itself. Theoretically, one can assume that these feature vectors are drawn from an unknown probability distribution, Inline graphic. However, neither Inline graphic nor Inline graphic are known or observable. The data available for an analysis are the PPI networks from a (small) number of different species. These PPI networks can be seen as a sample from the population of all possible PPI networks from all species obtained experimentally. From this sample of networks, one can derive feature vectors which constitutes theselves a sample. We would like to note that the sample of feature vectors Inline graphic is only given indirectly via the observation of Inline graphic because Inline graphic is not accessible experimentally. In summary, we assume a mapping from an evolutionary model that constitutes the population of PPI networks, which is unobservable, to a sample of feature vectors, each representing one species. The practical merit from this mapping is that feature vector can be statistically studied within the framework of multivariate analysis. The model visualized in Fig. 7 describes exactly the way feature vectors were obtained for both synthetically generated network models. For example, in the case of the network gene duplication model we have Inline graphic with Inline graphic.

Figure 7. A schematic visualization of the connection between the underlying model Inline graphic to generate networks and the feature vectors Inline graphic, used in our study.

Figure 7

From the above model follow several implications that are important for the interpretation of our results shown in Fig. 5. First, we do not need to make detailed assumptions about the evolutionary model. That means it could be just one model or three different models, e.g., one for Bacteria, Archaea, and Eukaryota. Instead, it is enough to assume that a PPI network is represented by a random vector. From the clustering of feature vectors shown in Fig. 5 one can see that the Inline graphic different PPI networks are not randomly scattered, but clustered together in Inline graphic clusters. This is an indicator that despite the differences that certainly exist among the individual PPI networks, respectively the underlying species, they are more similar among each other than with other network types. As a side note, this could imply that they are describable by just one underlying evolutionary model Inline graphic to represent all PPI networks, but with different parameter values for different species. The fact that the PPI networks are not clustered in just one branch is no counter argument against this, because also the SCN and NGD networks, of which we know they are generated from the some underlying model, are distributed over several clusters.

More important for our analysis is the effect of the errors in the PPI networks. It is clear that none of the available PPI networks is error free, either missing true positive connections among proteins or, probably less likely, included false positive connections. In order to estimate these errors explicitly one would need to introduce a specific error model, which is based on assumptions. However, our framework does not require us to explicate such assumptions. More specifically, it is known that the errors in the PPI networks (false positives, false negatives) were created in a biased manner [66] effecting all networks. With respect to Fig. 5 this corresponds to a mechanism that influences the mapping

graphic file with name pone.0035531.e209.jpg (23)

in the following way

graphic file with name pone.0035531.e210.jpg (24)

This leads to the actually observed networks Inline graphic, which are different to the (true) PPI networks Inline graphic. Due to the fact that our analysis is comparative, based on a hierarchical clustering, investigating networks with respect to their similarity to other networks rather than individually, the effect of the presence of errors in the PPI networks is alleviated, when all PPI networks are approximately homogeneously effect by errors. If the differences between the errors on the feature vectors would be severely heterogeneous, we would not be able to observe clustered PPI networks, but they would be randomly scattered in the dendrogram. Hence, the presence of clusters of PPI networks in the dendrogram supports the claim that all PPI networks are similarly effected by errors and also that the noise level in these networks is smaller than the signal, as captured by the feature vectors, because otherwise there would be no meaningful clustering possible leading to discernible separations of networks of different classes.

Discussion

In this paper, we studied the question if gene duplication models allow to generate networks with a module structure that resembles the module structure one can find in experimentally obtained protein interaction networks. The results from our clustering analysis revealed the existence of different structural features on the module level, the NGD model and the DMC model exhibit compared with biological protein interaction networks and, hence, demonstrate limitations of these models [71]. We want to emphasize that we studied not only the parameter settings of the models listed in table 2, but many more. However, none resulted in qualitatively different results. This points to a general limitation of these models in the description of the evolution of modules in protein networks.

We would like to highlight that we found from our analysis the duplication-mutation complementation model [38] to be a better model for PPI networks than the network gene duplication model, at least for the two protein networks of Mus musculus and Arabidopsis thaliana. This difference in these two models hints that the new establishment of functional interactions between a new gene and genes already present in a genome (step 3 in Fig. 4) are less beneficial than their neglect (setp 3′ in Fig. 4). Hence, this information could be utilized to revise an evolutionary duplication model, e.g., in combination with other biological mechanisms. For example, higher-order extensions with respect to the number of duplicated genes may be needed to rectify the obtained module structure of the resulting networks. Biologically, there is a multitude of genomic mechanism ranging from whole-genome duplication (polyploidy) to a restricted duplication of chromosomal regions [43], [73][75] that provide ample opportunity for exploring plausible modifications of an extended duplication model.

From analyzing the influence individual features have on the separation of PPI and gene duplication networks, we found that there is no single structural feature that possess sufficiently discriminatory power to accomplish such a separation, but one needs a combination of several features. As most influential features we identified the relative size of the largest module (Inline graphic) and the normalized mutual information (Inline graphic). Interestingly, for PPI networks one finds a negative correlation coefficient of Inline graphic between the values of Inline graphic and Inline graphic. For NGD networks their correlation is Inline graphic and for DMC networks Inline graphic. Further, the correlation between the number of modules (Inline graphic) and the modularity (Inline graphic) is for PPI networks negative (Inline graphic) and for NGD and DMC networks positive (Inline graphic and Inline graphic). This indicates structural differences of modules in the two gene duplication models, but also between these models and the PPI networks. Interestingly, for gene duplication networks an increasing number of modules (Inline graphic) is associated with an increasing modularity (Inline graphic), whereas for PPI networks it is associated with a decreasing modularity.

The absence of individual features, allowing a separation of PPI and gene duplication networks, is not surprising because the module structure of networks corresponds to a mesoscopic level of description. This implies that systems properties are playing an important role which can not be reduced to individual proteins or features. Hence, our results are in accordance with the view of evolutionary systems biology [72] considering evolution as a high-dimensional process.

The NGD model and the DMC model have been studied numerously over the last few years and demonstrated to reproduce several features that are in accordance with protein networks [35], [76][80]. However, these studies focused either on global properties of protein networks, e.g., degree distributions, studied individual protein networks only or investigated network motifs. Instead, in this paper we studied the module structure of protein interaction networks, which is generally believed to play a key role in the functional understanding of an organism. Another difference is that in our analysis we did not focus on individual protein networks but we considered all protein networks to belong to a population (or a sample thereof). This is an important difference because it allows to capture biological variability that is inevitably present in protein interaction networks from different organisms as well as in any stochastic process that generates networks like the NGD or DMC model. Hence, conducting a comparative, instead of an individual protein interaction network analysis allows to borrow strength from different members of the population (sample) to alleviate errors. As a direct consequence thereof, the basic entities of our analysis are the resulting branches of the clustering and their composition and not the position of individual networks. A related, yet different aspect of our approach and the fact that a clustering analysis performs a comparative analysis is that our study does not aim to provide precise estimates for specific network statistics, e.g., by means of interval estimators and their corresponding confidence intervals. The latter would require the introduction of additional assumptions and a different methodology, specifically adopted to the characteristics of the studied networks; see [81] as an example for such a study. In general, a clustering analysis is considered as an exploratory analysis which provides a valuable comprehension into the pattern of data without the need of making strong assumptions [82]. Hence, the simplicity of our approach is that it requires only a minimum of assumptions compared to more elaborate methodological approaches, e.g., confirmatory methods [83], and, hence, constitutes in the light of our limited knowledge about the evolution of protein interaction networks a sensible first step to gain insight into the complex and important module structure of protein networks.

We would like to emphasize that the modularity of a network is just one property of a network, like the degree distribution or the average path length. For this reason, it would be interesting to study further network properties of the network gene duplication model, the duplication-mutation complementation model and protein interaction networks to see if there are additions differences between these networks. Using the module structure was guided by biological considerations, however, one could also approach this problem from a more theoretical perspective probing different network characteristics. If such an abstract distinctive network property could be found, it would be interesting to think about a biological elucidation for this effect. The potential gain from such an analysis could be to discover novel biological features that may have been overlooked so far, because only properties with a clear biological interpretation have been studied.

Despite the fact that the primary concern of this paper is a biological topic in evolutionary biology the similarity of module structures between PPI networks and the coauthorship networks (Netscience and Hep-th) and the power grid is interesting. It hints that much can be learned from analyzing networks from different origin and different disciplines [84].

Acknowledgments

We would like to thank Galina Glazko, Ricardo de Matos Simoes and Shailesh Tripathi for fruitful discussions. For our numerical simulations we used R [85] and for the visualization of the networks the igraph package [86]. Protein networks are obtained from the BioGrid (BG) and IntAct (IA) database [39], [40], technological and social networks are from [24], [64], [65]. The synthetic community networks (SCN) have been generated with an algorithm from [67].

Footnotes

Competing Interests: The author has declared that no competing interests exist.

Funding: Parts of this project are supported by the Department for Employment and Learning through its “Strengthening the all-Island Research Base" initiative. No additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Dobzhansky T. Genetics and the Origin of Species. New York: Columbia University Press, New York; 1937. [Google Scholar]
  • 2.Felsenstein J. Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates; 2003. [Google Scholar]
  • 3.Futuyma D. Evolution. Sunderland, Ma.: Sinauer Associates; 2005. [Google Scholar]
  • 4.Gould S. The Structure of Evolutionary Theory. Cambridge: Belknap Press; 2002. [Google Scholar]
  • 5.Koonin E, Galperin M. Sequence - evolution - function : computational approaches in comparative genomics. Boston: Kluwer Academic; 2003. [PubMed] [Google Scholar]
  • 6.Mayr E. Evolution and the Diversity of Life : Selected Essays. Cambridge, MA: Belknap Press; 1976. [Google Scholar]
  • 7.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Emmert-Streib F. Algorithmic computation of knot polynomials of secondary structure elements of proteins. Journal of Computational Biology. 2006;13:1503–1512. doi: 10.1089/cmb.2006.13.1503. [DOI] [PubMed] [Google Scholar]
  • 9.Emmert-Streib F, Mushegian A. A topological algorithm for identification of structural domains of proteins. BMC Bioinformatics. 2007;8:237. doi: 10.1186/1471-2105-8-237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Higgins DG, Sharp PM. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988;73:237–244. doi: 10.1016/0378-1119(88)90330-7. [DOI] [PubMed] [Google Scholar]
  • 11.Holm L, Sander C. Mapping the Protein Universe. Science. 1996;273:595–602. doi: 10.1126/science.273.5275.595. [DOI] [PubMed] [Google Scholar]
  • 12.Mushegian A. Foundations of Comparative Genomics. Amsterdam; Boston: Academic Press; 2007. [Google Scholar]
  • 13.Emmert-Streib F, Glazko G. Network biology: A direct approach to study biological function. Wiley Interdiscip Rev Syst Biol Med. 2011;3:379–391. doi: 10.1002/wsbm.134. [DOI] [PubMed] [Google Scholar]
  • 14.Barabási AL, Oltvai ZN. Network biology: Understanding the cell's functional organization. Nature Reviews. 2004;5:101–113. doi: 10.1038/nrg1272. [DOI] [PubMed] [Google Scholar]
  • 15.Wagner A. Robustness and Evolvability in Living Systems. Princeton University Press; 2007. [Google Scholar]
  • 16.Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA. Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology. 2004;14:283–291. doi: 10.1016/j.sbi.2004.05.004. [DOI] [PubMed] [Google Scholar]
  • 17.Ciliberti S, Martin O, Wagner A. Robustness can evolve gradually in complex regulatory networks with varying topology. PLoS Computational Biology. 2007;3:e15. doi: 10.1371/journal.pcbi.0030015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wagner A. Evolutionary constraints permeate large metabolic networks. BMC Evolutionary Biology. 2009;9:231. doi: 10.1186/1471-2148-9-231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Erdös P, Rényi A. On random graphs. I. Publicationes Mathematicae. 1959;6:290–297. [Google Scholar]
  • 20.Solomonoff R, Rapoport A. Connectivity of random nets. Bulletin of Mathematical Biophysics. 1951;13:107–117. [Google Scholar]
  • 21.Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999;206:509–512. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
  • 22.Durrett R. Random Graph Dynamics. Cambridge; New York: Cambridge University Press; 2006. [Google Scholar]
  • 23.Przulj N, Corneil DG, Jurisica I. Modeling interactome: scale-free or geometric? Bioinformatics. 2004;20:3508–3515. doi: 10.1093/bioinformatics/bth436. [DOI] [PubMed] [Google Scholar]
  • 24.Watts D, Strogatz S. Collective dynamics of ‘small-world’ networks. Nature. 1998;393:440–442. doi: 10.1038/30918. [DOI] [PubMed] [Google Scholar]
  • 25.Bornholdt S, Schuster H, editors. Handbook of Graphs and Networks: From the Genome to the Internet. Wiley-VCH; 2003. [Google Scholar]
  • 26.Dehmer M, Emmert-Streib F, editors. Analysis of Complex Networks: From Biology to Linguistics. Weinheim: Wiley-VCH; 2009. [Google Scholar]
  • 27.Newman M, Barabasi AL, Watts D. The Structure and and Dynamics of Networks. Princeton University Press; 2006. [Google Scholar]
  • 28.Ma HW, Kumar B, Ditges U, Gunzer F, Buer J, et al. An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Res. 2004;32:6643–6649. doi: 10.1093/nar/gkh1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Shen-Orr S, Milo R, Mangan S, Alon U. Network motifs in the transcriptional regulatory network of Escherichia coli. Nat Genet. 2002;31:64–68. doi: 10.1038/ng881. [DOI] [PubMed] [Google Scholar]
  • 30.Wagner G, Pavlicev M, Cheverud J. The road to modularity. Nature Reviews Genetics. 2007;8:921–931. doi: 10.1038/nrg2267. [DOI] [PubMed] [Google Scholar]
  • 31.Palla G, Derenyi I, Farkas I, Viscsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435:814–818. doi: 10.1038/nature03607. [DOI] [PubMed] [Google Scholar]
  • 32.Emmert-Streib F, Dehmer M. Information processing in the transcriptional regulatory network of yeast: Functional robustness. BMC Systems Biology. 2009;3:35. doi: 10.1186/1752-0509-3-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wang X, Dalkic E, Wu M, Chan C. Gene module level analysis: identification to networks and dynamics. Current Opinion in Biotechnology. 2008;19:482–491. doi: 10.1016/j.copbio.2008.07.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Levy E, Pereira-Leal J. Evolution and dynamics of protein interactions and networks. Current Opinion in Structural Biology. 2008;18:349–357. doi: 10.1016/j.sbi.2008.03.003. [DOI] [PubMed] [Google Scholar]
  • 35.Wagner A. How the global structure of protein interaction networks evolves. Proceedings of the Royal Society of London Series B: Biological Sciences. 2003;270:457–466. doi: 10.1098/rspb.2002.2269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chung F, Lu L, Dewey T, Galas D. Duplication Models for Biological Networks. Journal of Computational Biology. 2003;10:677–688. doi: 10.1089/106652703322539024. [DOI] [PubMed] [Google Scholar]
  • 37.Sole RV, Valverde S. Spontaneous emergence of modularity in cellular networks. Journal of The Royal Society Interface. 2008;5:129–133. doi: 10.1098/rsif.2007.1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Vazquez A, Flammini A, Maritan A, Vespignani A. Modeling of protein interaction networks. ComPlexUs. 2003;1:38–44. [Google Scholar]
  • 39.Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, et al. The BioGRID Interaction Database: 2008 update. Nucl Acids Res. 2008;36:D637–640. doi: 10.1093/nar/gkm1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, et al. The IntAct molecular interaction database in 2010. Nucl Acids Res. 2009:878. doi: 10.1093/nar/gkp878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ispolatov I, Mazo I, Yuryev A. Finding mesoscopic communities in sparse networks. Journal of Statistical Mechanics: Theory and Experiment. 2006;2006:P09014. doi: 10.1088/1742-5468/2006/09/P09014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lozano S, Arenas A, Sanchez A. Mesoscopic structure conditions the emergence of cooperation on social networks. PLoS ONE. 2008;3:e1892. doi: 10.1371/journal.pone.0001892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lynch M. The Origins of Genome Architecture. Sunderland, MA: Sinauer, Sunderland; 2007. [Google Scholar]
  • 44.Evlampiev K, Isambert H. Modeling protein network evolution under genome duplication and domain shuffling. BMC Systems Biology. 2007;1:49. doi: 10.1186/1752-0509-1-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ispolatov I, Krapivsky PL, Yuryev A. Duplication-divergence model of protein interaction network. Phys Rev E. 2005;71:061911. doi: 10.1103/PhysRevE.71.061911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kim WK, Marcotte EM. Age-dependent evolution of the yeast protein interaction network suggests a limited role of gene duplication and divergence. PLoS Comput Biol. 2008;4:e1000232. doi: 10.1371/journal.pcbi.1000232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Fortunato S. Community detection in graphs. Physics Reports. 2010;486:75–174. [Google Scholar]
  • 48.Newman M. Modularity and community structure in networks. Proc Natl Acad Sci USA. 2006;103:8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Rosvall M, Bergstrom C. An information-theoretic framework for resolving community structure in complex networks. Proc Natl Acad Sci U S A. 2007;104:7327–31. doi: 10.1073/pnas.0611034104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Ziv E, Middendorf M, Wiggins CH. Information-theoretic approach to network modularity. Phys Rev E. 2005;71:046117. doi: 10.1103/PhysRevE.71.046117. [DOI] [PubMed] [Google Scholar]
  • 51.Newman MEJ, Girvan M. Finding and evaluating community structures in networks. Physical Review E. 2004;69:026113. doi: 10.1103/PhysRevE.69.026113. [DOI] [PubMed] [Google Scholar]
  • 52.Cerami E, Demir E, Schultz N, Taylor BS, Sander C. Automated network analysis identifies core pathways in glioblastoma. PLoS ONE. 2010;5:e8918. doi: 10.1371/journal.pone.0008918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.del Sol A, Carbonell P. The modular organization of domain structures: Insights into proteinprotein binding. PLoS Comput Biol. 2007;3:e239. doi: 10.1371/journal.pcbi.0030239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fernandez E, Collins MO, Uren RT, Kopanitsa MV, Komiyama NH, et al. Targeted tandem affinity purification of psd-95 recovers core postsynaptic complexes and schizophrenia susceptibility proteins. Molecular Systems Biology. 2009;5:269. doi: 10.1038/msb.2009.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Stouffer DB, Bascompte J. Compartmentalization increases food-web persistence. Proceedings of the National Academy of Sciences. 2011;108:3648–3652. doi: 10.1073/pnas.1014353108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Yoon J, Blumer A, Lee K. An algorithm for modularity analysis of directed and weighted biological networks based on edge-betweenness centrality. Bioinformatics. 2006;22:3106–3108. doi: 10.1093/bioinformatics/btl533. [DOI] [PubMed] [Google Scholar]
  • 57.Dehmer M, Mowshowitz A. A history of graph entropy measures. Information Sciences. 2011;1:57–78. [Google Scholar]
  • 58.Dehmer M, Varmuza K, Borgert S, Emmert-Streib F. On entropy-based molecular descriptors: Statistical analysis of real and synthetic chemical structures. Journal of Chemical Information and Modeling. 2009;49:1655–63. doi: 10.1021/ci900060x. [DOI] [PubMed] [Google Scholar]
  • 59.Mowshowitz A. Entropy and the complexity of graphs: I. an index of the relative complexity of a graph. Bulletin of Mathematical Biophysics. 1968;30:175–204. doi: 10.1007/BF02476948. [DOI] [PubMed] [Google Scholar]
  • 60.Mowshowitz A. Entropy and the complexity of graphs: Ii. the information content of digraphs and infinite graphs. Bulletin of Mathematical Biophysics. 1968;30:225–240. doi: 10.1007/BF02476692. [DOI] [PubMed] [Google Scholar]
  • 61.Todeschini R, Consonni V, Mannhold R. Handbook of Molecular Descriptors. Weinheim: Wiley-VCH; 2002. [Google Scholar]
  • 62.Shannon C, Weaver W. The Mathematical Theory of Communication. University of Illinois Press; 1949. [Google Scholar]
  • 63.Danon L, Diaz-Guilera A, Duch J, Arenas A. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment. 2005; 2005:P09008. [Google Scholar]
  • 64.NewmanMEJ. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America. 2001;98:404–409. doi: 10.1073/pnas.021544898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Newman MEJ. Finding community structure in networks using the eigenvectors of matrices. Phys Rev E. 2006;74:036104. doi: 10.1103/PhysRevE.74.036104. [DOI] [PubMed] [Google Scholar]
  • 66.Yu H, Braun P, Yildirim MA, Lemmens I, Venkatesan K, et al. High-quality binary protein interaction map of the yeast interactome network. Science. 2008;322:104–110. doi: 10.1126/science.1158684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008;78:046110. doi: 10.1103/PhysRevE.78.046110. [DOI] [PubMed] [Google Scholar]
  • 68.Maridia K, Kent J, Bibby J. Multivariate Analysis. San Diego: Academic Press; 2003. [Google Scholar]
  • 69.McQuitty LL. Similarity analysis by reciprocal pairs for discrete and continuous data. Educational and Psychological Measurement. 1966;26:825–831. [Google Scholar]
  • 70.Izenman AJ. Modern Multivariate Statistical Techniques. New York, NY USA: Springer; 2008. 16 [Google Scholar]
  • 71.Popper K. The logic of scientific discovery. New York: Basic Books; 1959. [Google Scholar]
  • 72.Koonin EV, Wolf YI. Evolutionary systems biology: links between gene evolution and function. Current Opinion in Biotechnology. 2006;17:481–487. doi: 10.1016/j.copbio.2006.08.003. [DOI] [PubMed] [Google Scholar]
  • 73.Dehal P, Boore J. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005;3:e314. doi: 10.1371/journal.pbio.0030314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Lynch M, O'Hely M, Walsh B, Force A. The Probability of Preservation of a Newly Arisen Gene Duplicate. Genetics. 2001;159:1789–1804. doi: 10.1093/genetics/159.4.1789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Otto SP, Whitton J. Polyploid incidence and evolution. Annual Review of Genetics. 2000;34:401–437. doi: 10.1146/annurev.genet.34.1.401. [DOI] [PubMed] [Google Scholar]
  • 76.Berg J, Lassig M, Wagner A. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evolutionary Biology. 2004;4:51. doi: 10.1186/1471-2148-4-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Hallinan J. Gene duplication and hierarchical modularity in intracellular interaction networks. Biosystems. 2004;74:51–62. doi: 10.1016/j.biosystems.2004.02.004. [DOI] [PubMed] [Google Scholar]
  • 78.Plachetzki DC, Oakley TH. Key transitions during the evolution of animal phototransduction: novelty, “tree-thinking", co-option, and co-duplication. Integrative and Comparative Biology. 2007;47:759–769. doi: 10.1093/icb/icm050. [DOI] [PubMed] [Google Scholar]
  • 79.Middendorf M, Ziv E, Wiggins CH. Inferring network mechanisms: The drosophila melanogaster protein interaction network. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:3192–3197. doi: 10.1073/pnas.0409515102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Ratmann O, Wiuf C, Pinney J. From evidence to inference: probing the evolution of protein interaction networks. HFSP Journal. 2009;3:290–306. doi: 10.2976/1.3167215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Thorne T, Stumpf M. Generating confidence intervals on biological networks. BMC Bioinformatics. 2007;8:467. doi: 10.1186/1471-2105-8-467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Hoaglin D, Mosteller F, Tukey J. Understanding Robust and Exploratory Data Analysis. Wiley, New York; 1983. [Google Scholar]
  • 83.Lehman E. Testing Statistical Hypotheses. Springer; 2005. [Google Scholar]
  • 84.Onnela JP, Fenn DJ, Reid S, Porter MA, Mucha PJ, et al. A Taxonomy of Networks. ArXiv e-prints; 2010. [Google Scholar]
  • 85.R Development Core Team. R: A Language and Environment for Statistical Computing. 2008. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
  • 86.Csardi G, Nepusz T. igraph-package. 2008 [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES