Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Dec 15;102(52):19103–19108. doi: 10.1073/pnas.0505231102

The global transcriptional regulatory network for metabolism in Escherichia coli exhibits few dominant functional states

Christian L Barrett 1, Christopher D Herring 1, Jennifer L Reed 1, Bernhard O Palsson 1,*
PMCID: PMC1323155  PMID: 16357206

Abstract

A principal aim of systems biology is to develop in silico models of whole cells or cellular processes that explain and predict observable cellular phenotypes. Here, we use a model of a genome-scale reconstruction of the integrated metabolic and transcriptional regulatory networks for Escherichia coli, composed of 1,010 gene products, to assess the properties of all functional states computed in 15,580 different growth environments. The set of all functional states of the integrated network exhibits a discernable structure that can be visualized in 3-dimensional space, showing that the transcriptional regulatory network governing metabolism in E. coli responds primarily to the available electron acceptor and the presence of glucose as the carbon source. This result is consistent with recently published experimental data. The observation that a complex network composed of 1,010 genes is organized to achieve few dominant modes demonstrates the utility of the systems approach for consolidating large amounts of genome-scale molecular information about a genome and its regulation to elucidate an organism's preferred environments and functional capabilities.

Keywords: systems biology, transcriptional regulation


Two prominent paradigms in modern biology about which there has been much discussion are the reductionist and integrative (or systems) approaches to biological discovery and understanding (15). The utility of the systems approach is demonstrated when computational models are used to integrate large amounts of molecular data to interpret and predict the range of cellular phenotypes. We investigated the utility of the systems approach by using a genome-scale reconstruction of the integrated transcriptional regulatory and metabolic network for Escherichia coli (iMC1010v1) (6) to computationally assess growth phenotypes in an exhaustive library of minimal-medium growth environments. iMC1010v1 accounts for a total of 1,010 ORFs in E. coli, or about one-third of the functionally assigned ORFs on the E. coli genome. iMC1010v1 was constructed solely from experimentally verified molecular interactions reported in primary literature sources. iMC1010v1 represents 906 metabolic ORFs that enable 932 unique biochemical reactions (including transport reactions) among 625 metabolites and 104 transcription factors regulating the expression of 479 of the 1,010 ORFs in the reconstruction.

Genome-scale reconstructions like iMC1010v1 can be used to compute phenotypic states by using constraint-based approaches (7). Dynamic batch-culture growth simulations (8, 9) have been performed by using constraint-based analysis (1014). Constraint-based analysis has been used to assess optimal growth states (15), consequences of gene knockouts (1618), endpoints of adaptive evolution (19), synthetic lethals (20), epistatic interactions between metabolic genes (21), and enzyme dispensability (22). The aim of this work was to comprehensively simulate the full set of possible molecular interactions in iMC1010v1 to gain a global view of the range of functional network states and to compare the results with experimental outcomes.

Materials and Methods

Minimal Media. All possible media components (see Data Set 1, which is published as supporting information on the PNAS web site) that can be used by the metabolic and transcriptional regulatory network reconstruction were categorized as possible carbon, nitrogen, phosphate, sulfur, or electron-acceptor sources. Some components were placed in multiple categories (e.g., glucose 6-phosphate serves as both a carbon and phosphate source). Additionally, each category contained “None.” A complete library of minimal media was then constructed by enumerating all combinations of components from each category, resulting in 108,723 possible minimal-media compositions.

Growth Simulations. Of the 108,723 minimal-media compositions, 15,580 were predicted to allow cellular growth by iMC1010v1. Any condition that resulted in a biomass doubling time of, at most, 12 h was considered to be one that allowed cellular growth. Simulations corresponded to log-phase batch growth for 50 min, with time steps in 5-min intervals.

Computation-Based Gene-Expression Profiles. The expression level of each gene for a growth simulation was computed to be the fraction of all time steps for which the gene was active, or “on” and then rounded to 0 or 1.

Computation-Based Activity Profiles. Because of combinatorial logic at promoters, the “off/on” relationship between transcription factors (TFs) and the genes they regulate is often observed to change during a growth simulation. Therefore, we developed an encoding scheme that records the computed gene-expression states of iMC1010v1 and the logic of all regulatory interactions observed in a simulation in a single binary vector. Fig. 1 demonstrates the procedure for forming such computation-based activity profiles. This encoding scheme is regulatory-connection centric as opposed to promoter centric and, as such, loses some of the information about combinatorial regulation at individual promoters. This scheme does, however, have the strength of being comparable with (experimental) expression-based activity profiles (see below).

Fig. 1.

Fig. 1.

A graphical depiction of how growth simulations are encoded into computation-based activity profiles (Left) and how gene expression measurements are combined with the known structure of the transcriptional regulatory network to create expression-based activity profiles (Right). (A) The example network contains genes for two transcription factors and an enzyme. (B) The gene 0/1 (off/on) states in each time step tn of a simulation and the log-transformed signal-intensity values from microarray experiment for a defined growth environment. (C) How a “basic unit” is defined for both simulated and experimental data, one of which exists for every TF-regulated (target) gene pair in the E. coli reconstruction. For simulation, each cell of the basic unit gets either a “0” or a “1,” depending on whether its associated TF and target gene were observed to be in the indicated off/on combination in any simulation time step. For experiment, the logical interaction between the TF and target gene is encoded as the logarithm of the ratio of their signal-intensity values. (D) The basic units for the example in A are shown. Concatenating the basic units forms the final activity profile.

Clustering Profiles. The similarity between growth simulations was quantified by computing the Hamming (27) distance between either the computation-based gene-expression profiles or the computation-based activity profiles. We developed an agglomerative strategy for clustering profiles. This strategy used a distance cutoff to decide whether to place a profile within an existing cluster or to form a new cluster. The location of the first troughs in Fig. 2 A and B were used as the distance cutoff, giving a cutoff of 45 bits for expression-profile clustering and 200 bits for activity-profile clustering. The procedure placed each profile together with its “nearest” profile(s) until all vectors were part of a cluster. An iterative adjustment phase was then used to place each vector into another cluster when it was actually closer to all of the members of that cluster than to the members of its current cluster. This adjustment phase was reperformed until convergence. In practice, three such iterations were needed. Very similar clustering results were obtained with the program cluster (23) from the Eisen lab, using (average linkage) hierarchical clustering and the Euclidean distance metric.

Fig. 2.

Fig. 2.

The distribution of pair-wise Hamming distances between simulated gene-expression (A) and network (B) activity profiles of the 15,580 growth simulations using iMC1010v1.

Sensitivity of Clustering Results to Distance Cutoff Value. We defined a quantitative measure of how clustering results differ when distance cutoff values different from the automatically chosen ones are used. Based on the clustering results for the automatically chosen cutoff, every pair of profiles has the relationship of belonging or not belonging to the same cluster, which we term the reference relationship. We evaluated clustering results for different distance cutoff values by computing the fraction of reference relationships for all profile pairs maintained in the new clustering. We evaluated the sensitivity of the clustering results to distance cutoff value by computing this fractional value for many cutoff values around the automatically chosen values.

Visualizing Profiles in 3-Dimensional Space. The clustering described above resulted in 36 and 13 clusters for the computation-based gene expression profiles and computation-based activity profiles, respectively. Classical multidimensional scaling (24), also known as principal-components analysis, was then used to project the very high-dimensional activity profiles into 3-dimensional space. (The relative contribution of the first three eigenvalues was 76% and 90% for the expression and activity profiles, respectively, indicating that the location of activity profiles relative to each other in three dimensions reflected very well their location relative to each other in their original high-dimensional space.) As a final step, a convex hull (25) was placed around all of the members of each cluster so that clusters could be visualized.

Expression Data. Microarray data from nine different experimental conditions were obtained from various studies performed in this laboratory and in others (6, 26, 37). All data sets had the following features in common: (i) They were generated by using WT E. coli K-12 MG1655 in batch culture grown to midlog phase; (ii) mRNA was reverse-transcribed and labeled as recommended by Affymetrix (Santa Clara, CA); (iii) cDNA samples were hybridized to E. coli antisense arrays (part no. 900381, Affymetrix); and (iv) data from .CEL files were processed by using the program mas ver. 5.0 or 5.1 (Affymetrix) and normalized such that the average signal from all ORFs was 1,000 (an arbitrary value used by other laboratories). Log2 values of signal were used in this analysis. The growth conditions for the expression profiles are detailed in Table 1, which is published as supporting information on the PNAS web site.

Experiment-Based Activity Profiles. To use experimental expression data to support results stemming from computation-based activity profiles, we combined signal-intensity values from gene-expression profiles with the known structure of the transcriptional regulatory network to produce experiment-based activity profiles (see Fig. 1). Key to the computation-based activity profiles was the encoding of the expression relationship between a TF and target gene. For real expression profiles, this relationship was encoded as the logarithm of the ratio of the signal intensity values between a TF and a (regulated) target gene. Experiment-based activity profiles were constructed from experimental expression data for nine different environments.

Comparing Computation-Based Profiles with Experiment-Based Profiles. In this work, we investigated the correspondence between computational and experimental profiles. We were interested in measuring the degree to which the spatial relationship among the experimental profiles matched the spatial relationship among computational profiles for both gene-expression and activity profiles. Because computational and experimental profiles reside in different metric spaces, they are not directly comparable. To measure the degree of correspondence, we computed the distance between all pairs of experimental profiles and all pairs of computational profiles for the nine growth environments. This computation gave rise to 9 × 8/2 = 36 pairs of profiles between which distances could be calculated. Each set (computational and experimental) of distance values imparts a ranking among the 36 pairs of growth environments. We then used rank-correlation analysis (27) to measure the degree of correspondence between the computational and experimental profiles. We computed both Kendall's τ and Spearman's ρ rank-correlation coefficients and the P value associated with the computed values.

Results

Integrating E. coli Transcriptional Regulatory Network Structure with Simulated Gene-Expression Profiles Reveals a Global Functional Organization. Cellular growth was simulated (8, 9) in 15,580 media conditions and the expression state (off/on) of all genes was recorded for each time step of each simulation. From this recorded data, the activity of the iMC1010v1 network during the course of an entire simulation was summarized in two different ways: (i) by computation-based gene-expression profiles and (ii) by computation-based activity profiles.

The first approach produces a vector of zeros and ones, indicating the expression state (off or on) of each gene in the network during a growth simulation. A histogram of the Hamming distances between all pairs of simulated gene-expression profiles shows a trimodal distribution (Fig. 2A). To better assess the implied structure in the “space” of computation-based gene expression profiles, we performed clustering and subsequent dimensionality reduction of the profiles. This procedure elucidates clusters of computation-based expression profiles for E. coli iMC1010v1 that can be visualized in a 3-dimensional space (Fig. 3).

Fig. 3.

Fig. 3.

The clusters of all computation-based gene-expression profiles of iMC1010v1 projected into 3-dimensional space. The numbers in parentheses by each cluster in the key are the numbers of different profiles in the cluster. The clusters whose contained profiles all correspond to growth environments with the same terminal electron acceptor are listed (Upper Right). The units of each axis are in bits, as given by the Hamming distance computed between computation-based expression profiles that are contained within the clusters.

The second approach incorporates both the gene-expression profiles (as above) and the logic states of the TF-target-gene regulatory connections into an activity profile (see Fig. 1). Each activity profile is a bit vector composed of units describing the logical interactions observed for every transcription-factor–target-gene pair and also accounts for the expression state of each gene. A histogram of the Hamming distances between all pairs of such computation-based activity profiles is trimodal (Fig. 2 A) and has a more distinct structure than the expression state alone (Fig. 2 A). Through clustering and dimensionality reduction (as above), we visually elucidate clusters of activity profiles of iMC1010v1 in a 3-dimensional space (Fig. 4).

Fig. 4.

Fig. 4.

The clusters of all computation-based activity profiles of iMC1010v1 projected into 3-dimensional space, allowing visualization of the “space” of transcriptional regulation and metabolic functional capabilities. The numbers in parentheses by each cluster in the key are the numbers of different activity profiles in the cluster. Comparison of the clusters shows that they can be distinguished by the available electron acceptor (indicated by the ellipses) and the carbon source and, to a lesser degree, by the nitrogen source. The units of each axis are in bits, as given by the Hamming distance computed between computation-based activity profiles contained within the clusters.

We investigated how our clustering results changed when we used cutoff values progressively lesser and greater than the automatically chosen ones. We performed this assessment quantitatively by first defining a coclustering relationship for all pairs of profiles based on the clusters for the automatically chosen cutoff. We then established a “reference relationship” between every pair of profiles: either the two profiles of a pair were in the same cluster, or they were not. Next, we reperformed the clustering procedure for different cutoff values. For each different value, we observed the resulting coclustering relationship for all pairs of profiles and then calculated the fraction of the reference relationships that were maintained. We observed that >90% of the reference relationships were maintained for cutoff values ±30% of the automatically chosen value (data not shown), showing the stability of the structure of the clusters in Figs. 3 and 4 to the exact value of the clustering cutoff value.

Functional States of iMC1010v1 Are Globally Organized According to the Terminal Electron Acceptor and Carbon Source. We analyzed the growth environments corresponding to the computed gene expression and activity profiles in the clusters shown in Figs. 3 and 4. The clusters of expression profiles in Fig. 3 are generally distinguished by the fact that the growth environments to which they correspond use the same terminal electron acceptor. This distinction is not crisp, however, because many clusters characterized by different terminal electron acceptors overlap and/or are closer to each other than they are to other clusters characterized by the same electron acceptor. Distinguishing clusters based on other growth-environment components did not prove fruitful.

In contrast, the activity profile clusters in Fig. 4 could be clearly distinguished by three discriminating environmental characteristics. The vast majority (97%) of the activity profiles could be discriminated based solely on the available terminal electron acceptor, regardless of the available carbon, nitrogen, phosphate, or sulfur sources. All activity profiles in clusters 1–4 had dimethyl sulfoxide, fumarate, trimethylamine N-oxide, or none (i.e., anaerobic fermentation) as the electron acceptor; all in clusters 7 and 8 had NO2 or NO3 as the electron acceptor; and all in clusters 10–12 had O2 as the electron acceptor. Together, these nine clusters contained 97% of the activity profiles.

The clusters using the same electron-acceptor class are indicated by the ellipses in Fig. 4. The remaining 3% of the activity profiles were contained in the four clusters (5, 6, 9, and 13) that are separated within the ellipses. These four clusters represent discrimination based on the presence of glucose or gluconate in the growth environment. So, for example, whereas clusters 7–9 contained all activity profiles that used NO2 or NO3 as the electron acceptor, those that used glucose or gluconate were in cluster 9, and those using all other carbon sources were in clusters 7 and 8. This separation of clusters 7 and 8 from cluster 9 implies that iMC1010v1 varies significantly in its use when glucose or gluconate, as opposed to all other carbon sources, are present in the growth environment. This result is, in large part, due to Crp, which is inactive in the presence of glucose and gluconate because of low levels of cAMP and whose regulon is nearly twice the size (197 vs. 111) of the next-largest regulon (28).

The remaining organization in the set of activity profiles was attributable to the available nitrogen source. The distinct but overlapping clusters in Fig. 4 (indicated by different colors and cluster numbers) were mostly the result of some nitrogen sources eliciting different regulatory and metabolic responses. The nitrogen sources that distinguished overlapping clusters were adenine, adenosine, deoxyadenosine, guanosine, deoxyguanosine, cytidine, inosine, and N-acetyl-d-mannosamine. The fact that these clusters overlap indicates that the regulatory and metabolic networks are not as differently used for these nitrogen sources as compared with glucose/gluconate and the rest of the carbon sources.

The Activity of a Few TFs Distinguishes the Dominant Operating Modes of iMC1010v1. The clusters in Fig. 4 can be discriminated among based on three environmental components. Because the computation-based activity profiles include the state of the TFs, we can also distinguish the clusters based on TF activities. To determine the roles of the TFs in the clusters, we first detailed the activities of all TFs in the following sets of clusters: 1–4; 5, 6; 7, 8; 9; 10–12; and 13. Then, for every pair of sets of clusters, we report in Table 2 (which is published as supporting information on the PNAS web site) those TFs that were active in 100% of all simulations in one set and inactive in 100% of all simulations in another set. The implication of these results is that some or all of these identified TFs are global determinants of the expression program in response to the environments in which they are active.

Experimental Expression-Profiling Data Support the Clustering Results. To examine the computational results further, we sought to determine whether the observed structure in Figs. 3 and 4 was consistent with experimental gene-expression-profiling data. We computed gene-expression and activity profiles corresponding to nine different growth environments for which we had gene-expression data (data which were not used to generate the network and its computational results). We assessed whether the computational and experimental sets of profiles qualitatively mapped the same spatial organization. The assessment was performed by computing both Kendall's τ and Spearman's ρ rank-correlation coefficients based on the ranking of interprofile distances for the 9 × 8/2 = 36 pairs of experiment- and computation-based profiles. For the gene-expression profiles, we computed τ = 0.615 and ρ = 0.828, with associated P values of 6.3 × 10-8 and 4.9 × 10-7, respectively. For the activity profiles, we computed τ = 0.610 and ρ = 0.790, with associated P values of 8.5 × 10-8 and 1.5 × 10-6, respectively. These values are acceptably high, especially considering the measurement noise (mean r2 value between microarray replicates was 0.942). We concluded that the available experimental gene-expression data were highly consistent with the spatial organization of the clusters in Figs. 3 and 4. Because of the limited amount of available experimental data, this conclusion will require revalidation as additional expression-profiling data for appropriate minimal-media conditions are published.

Discussion

The reconstruction and integration of networks and assessment of their functional states is a major focus in systems biology. In this study, we presented a global characterization of the full range of functional states that an integrated transcriptional regulatory and metabolic network can exhibit. The literature-curated iMC1010v1 for E. coli was used as the basis for this study. Network growth states were simulated in an exhaustive library of 15,580 minimal-media conditions. Each functional state was described by an activity profile that contains both the calculated expression state(s) of each gene and the logical interactions between all TFs and the genes they are known to regulate. After clustering the activity profiles and applying a standard dimensionality-reduction technique, we showed that (i) the set of all possible network states has few dominant modes (Fig. 4); (ii) these modes are primarily organized according to the terminal electron acceptor and whether or not glucose or gluconate was the carbon source; (iii) the clusters can also be organized based on the activities of relatively few TFs (Table 2), consistent with recent genome-scale experimental (29, 30) results and limited expression profiling data; and (iv) the integrated network gives a crisper clustering structure of functional states than without regulatory information and should lead to focused further experimentation to determine the functional states of this network.

Fig. 4 demonstrates how all functional states of iMC1010v1 can be condensed to allow one to quickly and easily understand the functional capabilities and properties of the E. coli integrated transcriptional regulatory and metabolic network for a large number of growth environments. This representation and summary of functional states is not possible with a logical connectivity diagram of the transcriptional regulatory-network hierarchy. The result that a large genetic system of >1,000 genes operating in 15,580 different growth environments achieves few distinct overall responses is consistent with previous experimental results and discussion about “simplicity from complexity” (31) and “complexity for simplicity” (32). Singular value decomposition analysis of gene-expression data showed the transcriptional response of the yeast genome to be orchestrated through a few fundamental patterns of gene-expression change (35). With changing cellular conditions, a majority of genes were seen to transition between active and inactive states, at most, once or twice. The study of signaling networks (6) led to the proposition (also conceptually applicable to regulatory networks) that the objective of system complexity is to guarantee reliable system output. That is, system complexity is built in to robustly provide for simple behavior. The result of few global functional modes is also consistent with earlier theoretical work on logical control networks and their applicability to genomic regulatory networks (33). In this work, random Boolean networks parameterized to reflect fundamental features of polymer chemistry were found to have few state cycle attractors, which are interpreted to be cell phenotypes. The results presented here, based on the largest reconstructed genomic regulatory system to date, are consistent with this hypothesis.

The result that the terminal electron acceptor and carbon source are what primarily organize the iMC1010v1 network in E. coli is consistent with a priori intuition and two recent genome-scale experiments (29, 30). Both of these experiments pertain to cluster-defining TFs in Table 2. The first analysis (29) focused on Crp and found it to directly regulate over 200 operons involved in such diverse activities as carbohydrate transport and metabolism, energy production, amino acid metabolism, ion-transport systems, and further transcriptional regulation. Additionally, more than one-third of the directly regulated genes were of unknown function. The authors of that study predicted that there are several hundred genes directly or indirectly subjected to regulation by Crp. The second analysis (30) focused on ArcA, the cytoplasmic component of the ArcB/ArcA two-component signal-transduction system that regulates genes in response to redox conditions (i.e., changing respiratory conditions). The authors of that study conservatively estimated that 100–150 operons may be under transcriptional control of ArcA.

Of the other cluster-defining TFs identified in Table 2, Fnr and NarL are also known to be widely influential. Fnr, which belongs to the Crp family of transcriptional regulators and is active in the absence of O2, activates genes for fermentation and anaerobic growth and represses genes for aerobic growth. Fnr activates transcription initiation in a similar way to Crp, and each can bind to the DNA site for the other protein (34, 35). Because of these abilities, the authors of the Crp study proposed that some of their identified Crp-regulated operons may actually be Fnr-regulated in vivo. NarL acts as an activator and repressor in response to a nitrate/nitrite environment to enable anaerobic respiration using either nitrite or nitrate as the terminal electron acceptor.

The two remaining transcription factors identified in Table 2, CaiF and FeaR, are known to regulate only relatively few genes in comparison with the other TFs. It appears that CaiF and FeaR are narrowly acting TFs that are directly regulated only by global TFs. Nonetheless, the result that they were singled out with known global regulators may make them interesting targets for study.

Figs. 3 and 4 show that the more detailed, or informative, state description leads not to a more complicated view of the cell's “state space,” but to a markedly simpler one. We interpret this result to be a reflection of the internal organization of the cell, which is that of a system robustly structured against all but a few characteristic behaviors. If this is the case, then we expect that this reflection will become increasingly pronounced as large amounts of additional types of state information are integrated.

Conclusion

Taken together, the results of this study show consistency between the computed behavior of the iMC1010v1 network and what one would intuitively expect from the known physiology of E. coli. Fundamentally, iMC1010v1 represents a biochemically, genetically, and genomically structured representation of a large body of experimental information. The simulations investigate all of the functional states that this network can take on, and the clustering of these states shows that the dominant features of the network responses are consistent with physiological functions and a limited set of expression-profiling data.

This study, on the whole, represents an integration of diverse sources of data. Such structured integration of diverse data sets represents one of the main current challenges in systems biology. By demonstrating that the integration of the metabolic and transcriptional regulatory networks in a mechanistic format can be used to generate an improved understanding of the functional states of E. coli metabolism, we have demonstrated the utility of the systems-biology approach and a framework for its implementation.

Supplementary Material

Supporting Information

Acknowledgments

We thank Markus Herrgard for helpful review of the manuscript and the anonymous reviewers for helpful and insightful comments. This work was supported, in part, by National Institutes of Health Grant GM068837.

Author contributions: C.L.B. designed research; C.L.B. and C.D.H. performed research; C.L.B. and C.D.H. analyzed data; C.L.B. and B.O.P. wrote the paper; and J.L.R. reconstructed the organism.

Conflict of interest statement: University of California at San Diego has a related patent application (U.S. Patent Application 20040072723) that has been licensed.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: TF, transcription factor.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES