Versatility and Connectivity Efficiency of Bipartite Transcription Networks

Mark P Brynildsen; Linh M Tran; James C Liao

doi:10.1529/biophysj.106.082560

. 2006 Jun 30;91(8):2749–2759. doi: 10.1529/biophysj.106.082560

Versatility and Connectivity Efficiency of Bipartite Transcription Networks

Mark P Brynildsen ¹, Linh M Tran ¹, James C Liao ¹

PMCID: PMC1578464 PMID: 16815895

Abstract

The modulation of promoter activity by DNA-binding transcription regulators forms a bipartite network between the regulators and genes, in which a smaller number of regulators control a much lager number of genes. To facilitate representation of gene expression data with the simplest possible network structure, we have characterized the ability of bipartite networks to describe data. This has led to the classification of two types of bipartite networks, versatile and nonversatile. Versatile networks can describe any data of the same rank, and are indistinguishable from one another. Nonversatile networks require constraints to be present in data they describe, which may be used to distinguish between different network topologies. By quantifying the ability of bipartite networks to represent data we were able to define connectivity efficiency, which is a measure of how economic the use of connections is within a network with respect to data representation and generation. We postulated that it may be desirable for an organism to maximize its gene expression range per network edge, since development of a regulatory connection may have some evolutionary cost. We found that the transcriptional regulatory networks of both Saccharomyces cerevisiae and Escherichia coli lie close to their respective connectivity efficiency maxima, suggesting that connectivity efficiency may have some evolutionary influence.

INTRODUCTION

Bipartite networks have been used to represent many biological systems and engineering tasks, including gene expression regulation (1–6), signal processing (7,8), image processing (9–11), and spectrum analysis (12,13). These networks consist of a layer of sources connected to a layer of outputs, where every connection (edge) represents the influence of a source on an output (Fig. 1 A). In some cases, the output nodes are fully connected to the sources, for example, microphones recording simultaneous speeches in the same location. In others, the outputs are sparsely connected to the source signals, such as in transcriptional regulatory networks.

(A) Bipartite network depicting a hypothetical transcriptional regulatory network. (B) Z_A corresponding to network in panel A. (C) created from Z_A in panel B. (D) Table of n_z, and E_rj from Z_A in panel B.

Inline graphic — (A) Bipartite network depicting a hypothetical transcriptional regulatory network. (B) Z_A corresponding to network in panel A. (C) created from Z_A in panel B. (D) Table of n_z, and E_rj from Z_A in panel B.

In general, it is advantageous to describe data with the simplest structure possible, both for interpretation and mechanistic reasons (14,15). However, conventional bipartite network analyses such as principal component analysis (PCA) and independent component analysis, assume that networks are fully connected. For systems governed by sparsely connected networks, this assumption could lead to the deduction of unrealistic source signals (4,14,16,17). A variation of PCA, called Sparse PCA, has been developed that acknowledges this issue and attempts to alleviate it (16,17). However, Sparse PCA like its precursor, PCA, requires deduced source signals to be mutually orthogonal. Such a mathematical constraint without any phenomenological justification may hinder the ability to provide simple representation, especially if the simplest structure may require oblique source signals (14,15). A complementary approach, network component analysis (NCA), takes into account known network connectivity in deducing source signals and allows for orthogonal and oblique source signals (4). However, if the a priori network connectivity has some degree of uncertainty, as in the case of ChIP-chip data being used to analyze DNA-microarray data, there may be simpler connectivities capable of describing the same data. Alternatively, exploratory factor analysis attempts to simplify structure by performing orthogonal or oblique rotations on a factorization. While the goal of this technique is to achieve simplicity of structure, the implementation has had difficulty with situations where the complexity of the simplest network exceeds that of maximal sparsity (one connection per output to the source layer) (15). To facilitate data representation with the simplest structure possible, we have characterized the ability of bipartite networks to describe data.

The ability of bipartite networks to describe data may be limited by network connectivity. In some cases, such as fully connected networks, any data within the span of the network can be described, while in other cases, such as sparsely connected networks, certain elements of the data may be required to lie on a single line or hyperplane. This leads to the classification of two types of bipartite networks, those networks whose output range is not limited by their connectivity, which we will term “versatile”, and those networks whose output range is hindered by their connectivity, which we will term “nonversatile”. Intuitively, one might think that any missing edge from a network might compromise its ability to describe data, and therefore any network besides a fully connected network will be nonversatile. However, this is not true, and there are networks that are not fully connected that can represent data equally as well as fully connected networks. These networks are also versatile and are not limited by their connectivity. The very existence of these networks demonstrates that there is no justification from data alone to conclude more than minimal versatile connectivity. Thereby, the most complex structure ever needed to describe data is the minimal versatile connectivity. Nonversatile networks, on the other hand, have their own utility, since their constraints are often present in datasets. Since nonversatile networks are often sparser than versatile networks they would provide the simplest representation under many circumstances.

In this article we define the minimal connectivity to achieve versatility, define the constraints present in nonversatile networks, discuss the implications of versatile and nonversatile networks, and suggest possible applications for their use. To demonstrate the utility of these concepts we examined the transcriptional regulatory networks of Saccharomyces cerevisiae and Escherichia coli. We recognized that for bipartite networks the ability to represent data is equivalent to the ability to generate data. With this in mind, we defined connectivity efficiency, which is a measure of how economic the use of connections is within a network with respect to data representation/generation ability. We then analyzed the connectivity efficiencies of the transcriptional regulatory networks of S. cerevisiae and E. coli. We postulated that it may be biologically desirable for organisms to maximize their gene expression range (breadth of possible gene expression profiles) per network edge, since development of a regulatory connection may have some evolutionary cost. Subsequently, we found that both networks lay close to their respective connectivity efficiency maxima, suggesting that connectivity efficiency may have some evolutionary influence.

BACKGROUND

We are interested in the ability of bipartite networks to represent data. A bipartite network represents an output e_i(t) by the linear mixing of sources, p_j(t), through a mixing rule described by

(1)

where a_ij values are the connectivity strengths. The mixing rule can be written in a matrix form,

(2)

where E is the output data (N × M), A is the matrix of network connectivity strengths (N × L), and P is the collection of source signals (L × M). Bipartite network representation can further be generalized by considering only the connectivity pattern of matrix A,

(3)

where the values of the nonzero a_ij are left unconstrained and can take on any value—positive, negative, or zero. For the purpose of this article, networks with varying connectivity strengths but the same connectivity pattern, Z_A, will be discussed identically.

Versatile networks

Ideally, we would prefer to represent data with the simplest network connectivity possible. In the context of this work the simplicity and sparsity of networks will be synonymous. Thus, we seek to find the sparsest network connectivity that can reliably represent data. Naturally, we begin by considering networks that can represent any data. These networks are termed versatile, and are characterized by the following theorem.

Theorem 1

A linear bipartite network with connectivity pattern Z_A (N × L) can describe any data within Inline graphic if all reduced forms of Z_A, are full row rank.

Here, Inline graphic is defined as the rows of Z_A which contain zeros in the i^th column of Z_A, where z_i is the number of zeros in the i^th column of Z_A. To test this, consider the nonzero entries of as nonzero random values that cannot combine on their own to produce a rank deficiency.

To demonstrate use of Theorem 1 we have provided a hypothetical transcriptional regulatory network in Fig. 1 A, transformed the network into Z_A form (Fig. 1 B), and determined all Inline graphic (Fig. 1 C). Both and are full row rank, but is not, and therefore the network in Fig. 1 A does not satisfy Theorem 1. For a network that would satisfy Theorem 1, simply connect TF₃ to the first, second, or fourth gene. The proof of this theorem along with examples is presented in Appendix A.

A consequence of the versatility theorem is that all connectivity patterns that satisfy the required criterion will represent data equally. This means that there may exist a minimal connectivity that satisfies the criterion, which may be used to represent data created from denser network structures. To determine the minimal connectivity (sparsest network) to achieve versatility we must find the limit of the criterion. To do so we recognize that Inline graphic can only be full row rank if z_i < L for every column of Z_A. Therefore, the minimal connectivity to achieve versatility contains L(L − 1) missing edges, specifically (L − 1) per column of Z_A. However, not all network connectivities with (L − 1) missing edges per column are versatile. Any network must still be in compliance with the above criterion to be versatile, even if it has the same number of, or a lesser number of missing edges than the minimal connectivity to achieve versatility.

Minimal connectivity for versatility is maximal connectivity for NCA-compliance

Interestingly, there exists a relationship between the minimal connectivity to achieve versatility and NCA. To guarantee the uniqueness of NCA solutions there are three criteria that must be satisfied. The second criterion in Liao et al. (4) deals with the connectivity pattern, Z_A, and the ranks of its reduced forms, which are essentially identical to the reduced forms described here. It states that the rank of every reduced form, Inline graphic must be (L − 1). The maximum rank for any is (L − 1), and can only be achieved if z_i ≥ (L − 1). Therefore, a necessary condition for NCA-compliance is that a network must have a minimum of (L − 1) zeros per column.

Versatility requires that all Inline graphic be full row rank. The maximum row rank of is (L – 1), and can only be achieved if z_i = (L − 1). This corresponds to the minimal connectivity to achieve versatility described previously. Thus, the minimum connectivity to achieve versatility requires all to have z_i = (L − 1) and be of rank (L − 1), and the maximum connectivity (largest number of nonzero connections) to be NCA-compliant requires all Inline graphic to have z_i = (L − 1) and be of rank (L − 1). Therefore, the minimum connectivity to achieve versatility is equivalent to the maximum NCA-compliant connectivity. To illustrate, examples have been provided in Appendix A.

Nonversatile networks

Nonversatile networks are those connectivity patterns that do not satisfy the versatility criterion. These networks have a reduced ability to represent data compared to that of versatile networks. Fig. 2 illustrates this concept, where the y axis is a measure of data representation capability (versatility index) that will be defined in the next section, and the x axis is the number of edges in the network. The reduced ability of nonversatile networks to represent data is due to connectivity constraints that dictate the type of data the network is able to describe. However, these constraints are often present in datasets, leading to the possibility of data representation with simpler structures than versatile networks. Therefore, we have characterized these constraints in the following theorem, such that they may aid in simplifying network structures used to represent data.

Plot of versatility index versus number of edges, for >10,000 networks with 50 outputs and 10 sources.

Definitions

A zero pattern is a 1 × L vector that indicates, by the position of zero entries, which transcription factors (TF) do not control expression of a gene. The number of zero entries in a zero pattern is designated by n_z. A system with three TFs has seven possible zero patterns, which are shown in Fig. 1 D. The zero pattern indicates that a gene is not controlled by TF₃.
Any gene that satisfies the definition of a zero pattern is a member of that zero pattern. For instance, the zero pattern requires genes to not be regulated by TF₃, therefore, gene_1,2,4 are all members.
An informative zero pattern, is any zero pattern with > L − n_zj members, where n_zj is equal to the number of zeros in Fig. 1 A has two and
E_rj is a matrix composed of the genes (rows of E) that are members of From Fig. 1, E_r1 = E(rows 1, 2, 4).

Theorem 2

Any dataset, E (N × L), may be represented by a linear bipartite network characterized by connectivity pattern Z_A (N × L) if every E_rj has rank ≤ (L − n_zj).

Fig. 1 D summarizes all items needed to evaluate Theorem 2. As one can see, only two Inline graphic () exist and need to be evaluated by E_rj. For a dataset to be represented by the network in Fig. 1 A, E_r1 must have a rank ≤ 2, and E_r2 must have rank ≤ 3. The proof of this theorem along with examples is presented in Appendix B. The theorem identifies bipartite connectivity constraints from Z_A that must be present within E for Z_A to represent it. Theorem 2 may be used to check whether a dataset can be represented by a network. The procedure to use Theorem 2 is presented in Table A1 of Appendix B along with an example to illustrate its use.

TABLE A1.

Procedure used to determine whether a dataset, E, contains the connectivity constraints dictated by Z_A

Procedure for using Theorem 2
1. Identify all possible 1 × L zero patterns.
2. Determine those zero patterns that have >(L – n_z) members. This will be the list of
3. Create E_rj for every and check whether all E_rj have rank ≤(L – n_z).

Open in a new tab

It should be noted that Theorem 2 is general and can be applied to any bipartite network. In fact, if one were to check whether a dataset could be represented by a versatile network, Z_A (N × L), only one Inline graphic would be found that had >(L − n_zj) members. This would not have any zero entries and would check whether the dataset was contained within a condition present in Theorem 1.

Implications of nonversatile networks

Although a dataset may satisfy Theorem 2 for a particular nonversatile network, the dataset may still contain additional constraints. This is due to the fact that constraints from nonversatile networks are nonunique. In fact, any network that can be created from another network by edge deletion (which we call the offspring networks) will have the same set of constraints or a larger set that contains the previous network's constraints. This means that the nonversatility criterion does not identify the minimal nonversatile connectivity to represent data, but simply identifies whether a dataset may be represented by a particular nonversatile network. To deduce the minimal nonversatile connectivity to represent data a method must be developed that can efficiently search for constraints in data, rather than see if data fits the constraints of a nonversatile network. This leads to the question of network reconstruction from constraints embedded in the data, which we will leave for the Discussion.

Connectivity efficiency

For bipartite networks the ability to represent data is equivalent to the ability to generate data. For transcriptional regulatory networks the ability to generate data would be the ability to generate gene expression. Knowing that transcriptional regulatory networks are generally sparse and that versatile networks of the same size would be fairly dense, we knew that transcriptional regulatory networks would not be versatile, and thus not have the maximal capability. With this in mind, we postulated that it may be desirable for organisms to maximize their gene expression ability per connection of the network, since it is safe to assume that there could be an evolutionary cost associated with the development of every regulatory interaction in the network.

First, we needed to define an index which could give us an indication of how close a network is to being versatile. We wanted the index to range from 0 to 1, where any network with a value of 1 would be versatile and any network with a value of 0 would be the most nonversatile (one connection per output to the source layer). We also required that if a network failed Theorem 1 for every Inline graphic any edge deletion within the network would decrease its index. We require that the nonversatile network fail every because those that comply with Theorem 1 correspond to columns of Z_A that are versatile in nature, and thus edge deletion may not change that if they have <(L − 1) zeros. With these conditions in mind we defined the versatility index,

(4)

where VI(Z_A) is the versatility index of Z_A, Inline graphic are the constraints imposed by Z_A, max(Z^c) are the constraints from the most nonversatile network the same size as Z_A, N is equal to the number of outputs, and L the number of regulators. The method to determine and max(Z^c) can be found within Appendix D. Both are based off of the principles detailed in Theorem 2. Subsequently, we can define the connectivity efficiency (CE(Z_A)), as

(5)

Connectivity efficiency (CE) is an average measure of how much each edge in a network contributes to the ability of that network to represent/generate data. We calculated the connectivity efficiency for the transcriptional regulatory networks of S. cerevisiae (CE = 4.7e-5) and E. coli (CE = 1.9e-4). While this might not appear significant, when plots of the versatile efficiencies from networks of the same size (same number of genes and regulators) and edge distribution are created, the versatile efficiency for S. cerevisiae is 87% of the maximum, and that of E. coli is 55% of the maximum, and both lie on the same shoulder of their respective maxima as depicted in Fig. 3. That shoulder represents networks that are sparser than the maximum, and its significance will be address in detail within the Discussion.

(A) Connectivity efficiency plot for the transcriptional regulatory network of *S. cerevisiae* (*circle*) plotted against networks of the same size (same number of regulators and genes), sampled from the same edge distribution, with a varying degree of edge density (*line*). (B) Connectivity efficiency plot for the transcriptional regulatory network of *E. coli* (*circle*) plotted against networks of the same size (same number of regulators and genes), sampled from the same edge distribution, with a varying degree of edge density (*line*).

DISCUSSION

Generally, it is desirable to describe data in the simplest possible manner. For systems governed by bipartite networks, this translates into describing data with the simplest possible structure. It has long been argued that simplicity of structure has more physical meaning than other considerations, such as orthogonality, during data representation (14). In fact, it has been shown that such abstract constraints yield erred results (4). In this work we have characterized the ability of bipartite networks to describe data, so as to facilitate data representation with the simplest possible structure. As we have shown, the ability of bipartite networks to describe data is dependent upon the network connectivity. Here we have classified bipartite networks into two categories based on their connectivity, versatile networks that do not have any restrictions imposed by their connectivity on the type of data they can describe, and nonversatile networks that do. This distinction gives rise to exclusive properties of each class that have implications for data representation, data compression, and network and source signal reconstruction.

Versatile networks can describe any data, and do not need to be fully connected. Therefore, the maximal connectivity necessary to describe any data would be the minimal versatile connectivity. This signifies the ability of some versatile networks to explain output generated from denser network structures. Theoretically, this would provide data compression capability superior to that of PCA. However, this capability comes at a cost. Since versatile networks are equally capable there is no way to discern the true network and source signals from data generated by versatile networks. Even if one were to assume that the true network was the minimal versatile connectivity, this would identify a whole class of networks that satisfy Theorem 1. The connections within the network would have no physical meaning since they could be rearranged in many different ways without impacting the system. This would be undesirable for situations where the actual arrangement of connections was of importance, such as in transcriptional regulatory networks. However, nonversatile networks do not share this deficiency.

Nonversatile networks are capable of describing a limited set of data. Restrictions that match those dictated by their network connectivity must be present in datasets for representation by them. This limitation, however, has its utility—since output created from nonversatile networks carry the connectivity restrictions derived from the original network. This enables network and source signal reconstruction on their outputs, and lends credibility to physical meanings attributed to their connections. Though reconstruction remains possible and seems plausible, efficient search algorithms must be designed to probe for connectivity restrictions from nonversatile networks. Whether these concepts will be incorporated into current techniques or form the basis of novel approaches, the additional complication of noise must be hurdled. While versatile networks can describe any data, including data riddled with noise, the restrictions left by nonversatile networks may be obscured by noise and more difficult to locate. This however, is an unavoidable complication when attempting to decipher underlying mechanisms, and does not change the basic principles of versatile and nonversatile network representation.

In addition, the concept of network versatility has been applied to the transcriptional regulatory network of S. cerevisiae and E. coli. Connectivity efficiency, which is an economic measure of connection usage, was calculated for the transcriptional regulatory networks of S. cerevisiae and E. coli and plotted against the connectivity efficiencies of other networks of the same size and sampled from the same distribution. It was found that the connectivity efficiencies of S. cerevisiae and E. coli were 87% and 55% of the maximum of their respective plots, and that both were found on the same shoulder of their maxima. That shoulder represents networks that have fewer edges than the maximal efficient network. This is an important feature because the transcriptional networks of S. cerevisiae and E. coli are more likely to be missing connections than containing erred edges. Therefore, the true transcriptional networks of these organisms should approach the maximal versatile efficiency. In fact, Harbison et al. (18) claimed that the 203 transcription factors they performed genome-wide location analysis on is most likely to comprise all of the DNA-binding transcriptional regulators in S. cerevisiae, and that the false-positive rate of their analysis should be ∼96% while the false-negative rate should be ∼24%. Combined with the fact that the majority of open reading frames in S. cerevisiae have been found after its genome sequencing the size of the transcriptional network should not change much. Therefore, any addition of edges to the transcriptional network of yeast will invariably push the network toward the maximal versatile efficiency. For E. coli, since an analogous genome-wide location analysis has never been done, the likelihood for missing connections over erred connections seems to be even higher. These findings suggest that connectivity efficiency may be a quantity that transcriptional networks evolve to maximize.

In conclusion, we have characterized the ability of bipartite networks to represent data, which has led to the concepts of versatility and nonversatility. Both of these concepts have been derived, described, and discussed in detail. Lastly, we demonstrated the utility of these concepts by analyzing the connectivity efficiencies of S. cerevisiae and E. coli, which suggested that measures derived from these concepts, may have some biological or evolutionary importance.

METHODS

Transcriptional networks

S. cerevisiae: Using a p-value threshold of 1 × 10⁻³, transcriptional regulatory networks were obtained from the ChIP-chip data of Lee et al. (19) and Harbison et al. (18) (YPD and all conditions). The networks were then merged to obtain a network comprised of all transcription factor-promoter binding relationships known through ChIP-chip experimentation.

Escherichia coli: The network was obtained by combining information from RegulonDB version 4 (20), Ver. 1.1 of Shen-Orr et al. (21), and Pernestig et al. (22). CsrA was included as a transcriptional regulator since small regulatory RNAs can be incorporated into bipartite networks without a loss of generality.

Network processing

Due to the size of the transcriptional networks of S. cerevisiae and E. coli, it was necessary to use the versatility index shortcut calculation described in Appendix D. To utilize this calculation, every regulator in the system must have a gene it solely controls. Not all regulators in the transcriptional networks of S. cerevisiae and E. coli have this attribute. Therefore, those regulators without this attribute along with all of the genes they participate in controlling were removed from the networks. The remaining networks (S. cerevisiae: 3630 genes, 147 regulators; E. coli: 680 genes, 71 regulators) were then analyzed as described in Appendix D.

Versatility index plot

Networks were created from an algorithm whose initial N × L network had one edge per output and the same number of edges per regulator. For every iteration an edge was randomly added to the network of the previous step. The algorithm concluded when the network was fully connected. A versatility index was calculated at every iteration for the network of that step. To ensure use of the versatility index shortcut calculation, an output for every regulator was required to contain a single edge, until the remaining N – L outputs were fully connected. Then edges were added at random to the remaining L outputs until the network was fully connected.

Acknowledgments

This work has been supported by the Center for Cell Mimetic Space Exploration and NASA University Research, Engineering and Technology Institute under award No. NCC 2-1364, National Science Foundation No. ITR CCF-0326605, and the University of California at Los Angeles-Department of Energy Institute for Genomics and Proteomics.

APPENDIX A

Proof of Theorem 1

Definition

The connectivity pattern, Z_A, can be defined as

(A1)

where the values of the nonzero a_ij are left unconstrained and can take on any value, positive, negative, or zero. Z_A characterizes a class of networks that all have the same zero pattern, but varying connectivity strengths (nonzero a_ij).