Skip to main content
EURASIP Journal on Bioinformatics and Systems Biology logoLink to EURASIP Journal on Bioinformatics and Systems Biology
. 2008 Jun 5;2008(1):248747. doi: 10.1155/2008/248747

Recovering Genetic Regulatory Networks from Chromatin Immunoprecipitation and Steady-State Microarray Data

Wentao Zhao 1, Erchin Serpedin 1,, Edward R Dougherty 2
PMCID: PMC3171391  PMID: 18584039

Abstract

Recent advances in high-throughput DNA microarrays and chromatin immunoprecipitation (ChIP) assays have enabled the learning of the structure and functionality of genetic regulatory networks. In light of these heterogeneous data sets, this paper proposes a novel approach for reconstruction of genetic regulatory networks based on the posterior probabilities of gene regulations. Built within the framework of Bayesian statistics and computational Monte Carlo techniques, the proposed approach prevents the dichotomy of classifying gene interactions as either being connected or disconnected, thereby it reduces significantly the inference errors. Simulation results corroborate the superior performance of the proposed approach relative to the existing state-of-the-art algorithms. A genetic regulatory network for Saccharomyces cerevisiae is inferred based on the published real data sets, and biological meaningful results are discussed.

1. Introduction

Currently, one of the most important research problems in molecular biology and bioinformatics consists of finding out the mechanisms that govern the gene regulations, which are considered to play fundamental roles in the operation of all processes taking place in living cells. Learning the structure and machinery of gene regulations opens up the possibility for understanding and controlling the functioning of organisms at the molecular level, and for designing intelligent therapies and drugs. In a biological process such as cell cycle or environmental response, a gene's product, the protein, can serve as a transcription factor of a target gene by binding to the target gene's regulatory region on chromatin and affect its transcription. The protein can also influence another gene's expression in subsequent stages, for example, through splicing or translation. Alternatively, these protein-gene relationships can be viewed as gene-gene interactions, and are modeled in general as genetic regulatory networks.

Recent years have witnessed a number of different frameworks for modeling genetic regulatory networks, ranging from fine-scale modeling at the molecular level in terms of partial differential equations and stochastic equations, to large scale modeling at the gene and protein-level in terms of Boolean and probabilistic Boolean networks, and (dynamic) Bayesian networks; see, for example, [16] and their toolboxes [79]. The small scale modeling techniques are used to capture the detailed biochemical aspects of molecular interactions and are in general very computational demanding. On the other side, the large-scale models provide a global vision of the interactions among the constituent elements of genetic regulatory networks and are generally represented in terms of graphs.

In the middle of 1990s, the birth of DNA microarrays equipped the industry with the capability to measure simultaneously the concentration of genome-wide mRNA expressions. The gene expression data produced thereafter by gene chips have attracted extensive research on the inference of genetic regulatory networks based on various network models [1018]. There are two types of DNA microarray data sets: time series (or time dependent) and time independent (also called steady-state or single point time series) data sets. In general, the time-independent gene expression profiles are capable of recovering steady-state attractors, but fail to recover the direct and oriented (temporal regulating) relationships. On the other side, time series data sets can improve the inference greatly in contrast to time-independent data sets [13]. However, the financial costs, ethnical concerns, and implementation issues prevent collecting beneficial time series data. Recent statistics show that about 70% of published data are time independent [19]. Therefore, the steady-state analysis is highly valuable despite the difficulty of making accurate inference of temporal relationships.

Inference of gene regulatory networks based solely on the information provided by microarray data is limited by a number of factors: number of available microarrays, quality of data samples, experimental noise, and errors (cross-hybridizations). It is also known that post-transcriptional modifications and transcripts that are present at low levels are generally not detectable by microarrays. Since the gene activity is measured by the mRNA level, the underlying assumption is that there is a significant correlation between the mRNA level and the amount of protein associated with mRNA. However, the magnitude of such a correlation varies significantly depending on the type of protein involved. Therefore, a combined approach which, besides gene expression data, exploits additional data sources is likely to enhance the inference process.

The advent of in vivo chromatin immunoprecipitation (ChIP) assays has enabled to test whether a protein acting as a transcription factor binds to a specific DNA segment. Hence, ChIP assays serve as a promising mechanism to examine the regulatory relationships. In ChIP experiments, the protein is immobilized on the chromatin, then the chromatin is broken into DNA fragments, and the DNA-protein complexes are immunoprecipitated by using antibodies corresponding to the tested protein. Afterwards the DNA bound by the protein in question can be isolated and identified by using a cDNA microarray chip. The whole process is also called a ChIP-chip experiment, and inherits several disadvantages. The protein to be tested has to possess a specific antibody, which might not be synthesized, discovered, or exist. In addition, the transcriptional regulation is a complex process that is expressed in several different aspects. The binding of the transcription factor to the promoter region of the target gene is the most pristine mode. Especially for eukaryotic organisms, some regulatory bindings take place at a region far away from the regulated gene. This fact makes the binding information questionable for determining the regulation relationships. Furthermore, the experimental results are represented by p-values and the determination of the binding relationship is achieved through threshold comparison. However, the selection of the p-value threshold introduces a dilemma. A high threshold not only identifies the most probable binding relationships but also might miss many true relationships with lower p-values, while a low threshold infers more relationships, among which more might be false alarms. A good tradeoff is not easy to make. Besides, the cost factor has also to be considered. Generally, ChIP-chip experiments are very expensive and testing thousands of proteins is not affordable.

A combination of both steady-state microarray data and ChIP-chip data might help in making more accurate inferences. Intuitively, these two different types of data complement the shortcomings of each other. This motivates us to propose a Bayesian approach to analyze jointly both data sets and to establish a confidence measure of gene interactions. The proposed scheme possesses six key features which make it different from the existing algorithms. First, gene expression data in steady-state are considered, while time course data are used in other works like [11,13,20]. Second, most of the current schemes recover a unique genetic network represented by a graph which best fits the observed data in a certain metric, while the proposed approach determines the posterior probabilities for all gene-pair interactions and avoids to make a dichotomous decision that classifies each gene interaction as being either connected or disconnected. The proposed approach can be easily transformed into a dichotomous scheme by only preserving the highly probable gene interactions. Third, the underlying structural model is assumed to be a directed cyclic graph, which allows cycles (feedback loops) and directed acyclic graphs are treated as special cases. This contrasts to Bayesian networks, which are directed acyclic graphs. Feedback loops are a common network motif in biological processes and their function is to yield the necessary redundancy and stability for the system [1]. Therefore, methods based on Bayesian networks, for example, [2123], lose their validity in the inference of cyclic graphs. Fourth, the proposed approach assumes continuous-valued variables, and this prevents the information loss incurred by data quantization. This represents an advantage compared with the discrete-valued networks such as [2123]. Fifth, the proposed connectivity score is oriented and has a very clear meaning, in the sense of posterior probabilities, while the existing scores based on the mutual information [14,18,24] are vague and lack orientation information. Sixth, in the proposed approach the system kinetics are assumed to be nonlinear, while linear models are commonly utilized for the purpose of simplification [12,15]. Besides, the proposed scheme establishes a general framework whose components can be customized to fit the nature of the underlying biological system.

The rest of the paper is organized as follows. Section 2 discusses the graphical model and system dynamics that govern the genetic expressions. Section 3 translates the p-values of ChIP-chip experiments into regulation probabilities and formulates the inference algorithm through Bayesian analysis. In Section 4, the proposed algorithm and other three schemes are simulated on a set of artificial networks. Performance comparisons illustrate that the proposed algorithm exceeds in terms of several metrics. The robustness of kinetics model is also discussed via simulations. Realistic data sets are exploited in the proposed inference framework and a genetic network is presented to account for the genetic response to environmental changes. Finally, Section 5 concludes the paper with remarks on possible future works.

2. Methods

Genetic regulatory networks can be represented by a parameterized graph Inline graphic, where Inline graphic and Inline graphic stand for the graph structure and parameter set, respectively. The graph structure qualitatively explains the direct gene interactions, while the parameter set quantitatively describes the system kinetics.

2.1. Structural Model

The graph Inline graphic is employed to map gene interactions at transcriptional level, where Inline graphic denotes the set of vertices (genes) and Inline graphic stands for the set of edges (regulation relationships). If gene Inline graphic regulates gene Inline graphic, graphically such a relation is represented in terms of an oriented edge Inline graphic, where Inline graphic is a parent of Inline graphic and Inline graphic is considered a child of Inline graphic. All genes that present incidence edges with gene Inline graphic represent the set of parental genes of Inline graphic, and are compactly denoted in terms of the notation Inline graphic. If two genes Inline graphic and Inline graphic interact with each other but the regulation orientation cannot be determined, an undirected edge is laid between the two genes as Inline graphic, which means both orientations are possible. A sequence of consecutive-oriented edges constitutes a directed path. If there is no directed path which starts and ends at the same vertex, in other words, the graph contains no loops, the graph is called a directed acyclic graph (DAG). DAGs lie at the basis of Bayesian networks, which are commonly employed to model causal relationships [25].

General directed graphs (with possibly cycles) will serve as our structural model since they are consistent with the features exhibited by biological systems, in which loops account for system redundancy and stability. Given the graph structure Inline graphic, the parent set Inline graphic is specified for any gene Inline graphic. For conciseness, the subscript Inline graphic associated with some variables is omitted in the analysis procedure when the context has clearly specified the gene in question. Next, we discuss the system kinetics and parameters defined in Inline graphic.

2.2. System Kinetics

The system kinetics represents the dynamics that governs the gene mRNA concentrations in terms of gene-gene interactions. It can be modeled by a set of differential equations (DEs). A simplified form is a set of linear DEs. However, we accept the more complex model which was employed previously by [16,17] since it is much more realistic and accounts for the expression saturation. Given a gene Inline graphic, its parent set Inline graphic can be further partitioned into two disjoint subsets: the activator set Inline graphic and the repressor set Inline graphic, that is, Inline graphic and Inline graphic. The kinetics of gene Inline graphic can be explained by the following differential equation:

graphic file with name 1687-4153-2008-248747-i32.gif (1)

where Inline graphic is the concentration of gene Inline graphic's transcriptional product, namely, mRNA. In this paper, to simplify the exposition, the gene name and its expression are used interchangeably. The changing rate of gene Inline graphic is controlled by its activating and repressing parents, denoted individually by Inline graphic and Inline graphic. Inline graphic and Inline graphic serve as the regulating factors corresponding to each activator and repressor. Inline graphic and Inline graphic assume positive values, and hence can be modeled by a gamma distribution with shape and scale parameters Inline graphic. Here we can unbiasedly assume that the activators and repressers share the same gamma distribution for their regulation factors. Other light-tail distributions, such as Weibull and lognormal distributions, could also be employed. However, since gamma distribution is popular in modeling the reaction rate or molecular concentration [26], the gamma distribution is chosen here. Inline graphic stands for the gene degradation rate and the time scale can be properly chosen in order to normalize Inline graphic to the unit value Inline graphic. Inline graphic represents the expression baseline rate, taht is, the expression rate for Inline graphic when there is neither activator nor repressor regulating the target gene Inline graphic. Suppose that Inline graphic represents the observation of Inline graphic, then Inline graphic assumes the form Inline graphic, where Inline graphic incorporates all noise sources and is modeled by an additive Gaussian random variable with zero mean and variance Inline graphic.

As the response to environmental changes or incitations, a mature biological system always converges to a certain steady-state, in which all genes stay in equilibrium and do not change their expressions. In this context, the periodic processes, for example, cell cycle and circadian rhythm, are excluded from our research interest. By setting Inline graphic and Inline graphic, the observation Inline graphic of the steady-state gene expression for gene Inline graphic can be expressed as

graphic file with name 1687-4153-2008-248747-i59.gif (2)

Given a parent structure Inline graphic for gene Inline graphic, the parameters in Inline graphic can be summarized as follows.

(1) For each parent Inline graphic, a binary variable is demanded to specify whether the parent is an activator or repressor, that is, Inline graphic, where Inline graphic is the indicator function and it assumes the value 1 when Inline graphic, and 0 otherwise. It can be modeled by a Bernoulli random variable with known success probability Inline graphic.

(2) For each activator Inline graphic and repressor Inline graphic, it is assumed that the regulating factors Inline graphic, where Inline graphic are known.

(3) The baseline parameter Inline graphic is usually known.

(4) The noise Inline graphic, where Inline graphic can be set to a specific value or estimated.

It is worth to note that the choice of nonlinear differential equation and parameter priors does not influence the flow of analysis. Our scheme stands for a general framework and the detailed parameters can be easily customized to fit different scenarios. There are various mathematical models for system kinetics, such as [2729]. The kinetics in 1 is chosen as our dynamic model because it possess the property of saturation, a key idea of Michaelis-Menten kinetics [29]. Besides, it is fairly simple and it also takes account of most other biological properties. Therefore, in the simulation of the real data set, we are assuming the proposed kinetics is true.

3. Inference Method

Consider a system composed of Inline graphic genes indexed by Inline graphic. ChIP-chip experiments can be conducted to examine whether gene Inline graphic's corresponding protein binds gene Inline graphic's regulatory region. Usually this regulatory sequence is a promoter region which is located within 600 base pairs upstream of the coding region of gene Inline graphic. The experimental results are represented in terms of p-values. In the first step, it is necessary to translate the p-value Inline graphic into the probability of existence of a regulation relationship from gene Inline graphic to gene Inline graphic, which is denoted as Inline graphic. This probability will act as the prior knowledge to integrate gene expression data.

3.1. Incorporating ChIP-Chip Data

The p-value is within the range of Inline graphic. After studying the properties of the microarray data, Allison proposed to exploit mixed Beta distribution to model the p-value [30]. If the transcription factor Inline graphic regulates gene Inline graphic, it is assumed that the ChIP-chip experiment produces a p-value Inline graphic which conforms to a Beta distribution with parameters Inline graphic,

graphic file with name 1687-4153-2008-248747-i89.gif (3)

where Inline graphic stands for the probability density function and Inline graphic represents the beta function. On the other hand, if Inline graphic does not regulate Inline graphic, the p-value assumes a different Beta distribution with parameters Inline graphic:

graphic file with name 1687-4153-2008-248747-i95.gif (4)

Based on the knowledge provided by established and verified genetic networks, one can infer a prior knowledge about the probability of connectivity between arbitrary genes, denoted as Inline graphic for all Inline graphic. Such statistics regarding the network connectivity can be found in the open literature, for example, the data sets for yeast [31], and Drosophila [32]. By applying Bayes theorem, we obtain

graphic file with name 1687-4153-2008-248747-i98.gif (5)

For simplicity, a uniform distribution can be alternatively employed to account for the p-value when Inline graphic. In this case, Inline graphic, and (5) takes the form

graphic file with name 1687-4153-2008-248747-i101.gif (6)

The determination of Inline graphic and Inline graphic depends on the experimental knowledge of the accuracy of selecting p-value thresholds. In the first step, a p-value threshold Inline graphic is imposed, then the validity of all bindings with p-values less than Inline graphic is corroborated by biological experiments. In this way, we can gain knowledge of the probability Inline graphic, which can be written in the form of

graphic file with name 1687-4153-2008-248747-i107.gif (7)

Some works in the literature, for example, [33], have made the observation that at a p-value threshold of 0.001, the frequency of false positives is 6%–10%, that is, Inline graphic. Taking into account these special points, we can determine the pair Inline graphic in a small range. In our case, Inline graphic and Inline graphic. Finally, a table can be set up to map the p-value into the edge existence probability, which can be computed only once. It is an overhead for the computational system but it does not assume much computational resource in the runtime.

3.2. Exploiting Steady-State Gene Expression Data

Assume that Inline graphic observations of expression vector are obtained and stored in matrix Inline graphic. Next, we develop a computational approach to establish the posterior probability of the regulation Inline graphic, that is, the probability of the existence of the edge Inline graphic, which is represented by Inline graphic. This posterior can be obtained through integration over the whole parental gene set and parameter space for gene Inline graphic:

graphic file with name 1687-4153-2008-248747-i118.gif (8)

where the function Inline graphic is the indicator function, which takes 1 if Inline graphic and 0 otherwise. Applying Bayes theorem, Inline graphic can be expressed as

graphic file with name 1687-4153-2008-248747-i122.gif (9)

where Inline graphic denotes the observations of gene Inline graphic, and Inline graphic represents the collection of all the observations pertaining to all genes excluding those of gene Inline graphic. Inline graphic denotes the probability density of the high-dimensional parental model given the observation of ChIP-chip data. Inline graphic stands for the gene expression likelihood given the parental values and the graphical model. It is a Gaussian distribution with known variance and mean determined by the first part of (2). The second equality in (9) holds because we believe the ChIP-chip experiment and steady-state gene expression measurements are independent. By plugging (9) into (8), it can be inferred that

graphic file with name 1687-4153-2008-248747-i129.gif (10)

The integrations at the numerator and denominator of (10) cannot be generally performed in a closed-form expression. However, the Monte Carlo methods enable to numerically evaluate the posterior probabilities. We can generate Monte Carlo samples based on the model probability density Inline graphic and the integration can be obtained by averaging over these samples. Then the posterior probabilities can be estimated by

graphic file with name 1687-4153-2008-248747-i131.gif (11)

Assuming that the selection of a parent as an activator is performed in an independent manner, and that the selection of the regulation factor value is also performed independently, the model probability density Inline graphic can be further expanded by using the chain rule

graphic file with name 1687-4153-2008-248747-i133.gif (12)

Equation (12) conveys the idea that the random samples of graphical models can be sequentially created and processed. First the network structure is created based on the binding probability of gene regulation obtained in the ChIP-chip experiment, then each parent is randomly assigned to represent an activator or repressor, and finally regulation factors are generated.

3.3. Algorithm Formulation

Our computational procedure can be briefly formulated in terms of Algorithm 1, where the Matlab coding convention is used to write the pseudocode. There exist Inline graphic genes in the system. An Inline graphic matrix is created to represent the p-values produced in the ChIP-chip experiment. We collect Inline graphic steady-state gene expression samples. The output entry Inline graphic stands for Inline graphic, and Inline graphic denotes the number of Monte-Carlo iterations. Lines 1 and 2 deal with the ChIP-chip experimental data and translate p-values into the binding probabilities by using (5). The results are stored in matrix Inline graphic. Lines 3 and 4 perform the preprocessing of the gene expression data. Let Inline graphic be the values of a specific gene expression in ascending order. The smallest two values, Inline graphic, and the largest two values, Inline graphic, are treated as outliers and discarded. The dynamic range is defined as Inline graphic. The gene expressions are normalized as follows: the smallest two samples are assigned the null value and the largest two samples are assigned the unit value; the intermediary samples Inline graphic are normalized as Inline graphic; if there is a missing sample, it is recovered through interpolation by gene's mean expression. Lines 12 through 16 implement the numerator of (11), and Line 17 computes the denominator of (11).

The algorithm can be easily reorganized into a parallel form so that we can exploit efficiently the distributed computational resources. The entries of output matrix Inline graphic represent the posterior probabilities of regulation relationships between any two genes. It is directional (asymmetrical), and it possesses a clear probabilistic meaning compared with other vague connectivity metrics, for example, mutual information. It grants the biologists the flexibility first to examine the most significant interactions, then to proceed with less evidenced edges. Therefore, it is advantageous relative to a purely dichotomous scheme, in which genes are treated as being either connected or disconnected. A probability threshold can be imposed to change the algorithm into a dichotomous classifier. Since the posterior probability has a universal meaning, this threshold can be easily selected, usually within the range of [0.3–0.9]. A tradeoff has also to be made for different performance metrics.

Algorithm 1: Inference of connectivity significance.

(1) Input ChIP-chip data set Inline graphic;

(2) Translate p-values to construct the binding probability matrix Inline graphic.

(3) Input gene expression data set Inline graphic;

(4) Normalize the expression data so that each expression is within the range

        of Inline graphic;

(5) Initialize Inline graphic;

(6) forInline graphic to Inline graphicdo

(7) Randomly create a directed graph and the adjacency matrix Inline graphic based on

        Inline graphic;

(8) forInline graphic to Inline graphicdo

(9)     For gene Inline graphic's parents specified in Inline graphic, randomly assign them to be

            activators or repressers;

(10)     For each parent, randomly create their regulation factor Inline graphic or Inline graphic;

(11)     Inline graphic;

(12)     forInline graphic to Inline graphicdo

(13)        ifInline graphicthen

(14)          Inline graphic;

(15)        end if

(16)      end for

(17)      Inline graphic;

(18) end for

(19) end for

(20) Inline graphic;

(21) returnInline graphic.

4. Results

The simulation consists of two parts. In the first part, artificial networks are created and the performance of the proposed algorithm is compared with other representative algorithms available in the literature, namely the relevance network (RN) method [14], Chow-Liu algorithm [24], and ARACNE [18]. In the second part, the algorithm is tested on the real Saccharomyces cerevisiae (budding yeast) data set and a biologically meaningful genetic network is inferred for the genetic response to environmental changes.

4.1. Simulation on Artificial Networks

The proposed algorithm is compared with other three algorithms to evaluate its capability of recovering genetic networks based on gene expression data alone. The relevance network (RN) model [14] represents a robust inference method based on gene expression profiles. In the first step, it computes the mutual information between any two genes Inline graphic and Inline graphic, denoted as Inline graphic. Then it suggests two genes Inline graphic and Inline graphic to be relevant if their mutual information assumes a larger value than a prespecified threshold and it lays down an undirected edge as Inline graphic. Hence, RN measures the significance of gene interactions in terms of mutual information between the gene expressions and produces an undirected cyclic graph. Chow-Liu algorithm [24] approaches the inference problem by finding the maximum spanning tree in which the edge weights stand for the mutual information. However, it loses validity if the underlying model is a cyclic graph. In addition, when the graph is densely connected, this scheme might falsely miss too many edges. ARACNE algorithm [18] exploits the data processing inequality (DPI). It starts with a fully connected graph and a predefined mutual information threshold. Whenever the mutual information between two genes Inline graphic and Inline graphic, that is, Inline graphic, is less than a threshold, it disconnects the two genes. Next, in the preliminary graph if there exists Inline graphic so that Inline graphic, then it disconnects Inline graphic and Inline graphic. In our simulations, we resort to an already available but efficient Matlab toolbox [34] to estimate the mutual information.

4.1.1. Performance Definition

Before making performance comparisons, we define inference errors and performance metrics. Because RN, Chow-Liu, and ARACNE algorithms all construct undirected graphs, we have to disregard the orientation information inferred by the proposed algorithm. The synthetic and inferred graphs are represented by Inline graphic and Inline graphic, respectively. The two graphs share the same set of vertices but differ in the set of edges.

There are two types of inference errors. The type-1 errors are false positives (FP) and are also called false alarms. If the inference algorithm determines an interaction for two vertices Inline graphic and Inline graphic in the inferred graph, denoted as Inline graphic, but there is no such edge in the synthetic graph, that is, Inline graphic, then an FP is produced. The number of FPs, represented by Inline graphic, can be counted as follows:

graphic file with name 1687-4153-2008-248747-i191.gif (13)

where Inline graphic stands for the logic and operator. The type-2 errors are false negatives (FN) and also named misses. If the inference does not discover the connectivity Inline graphic which resides in the synthetic network, an FN is generated. The number of FNs, depicted by Inline graphic, is obtained by

graphic file with name 1687-4153-2008-248747-i195.gif (14)

Correct inference can also be divided into two categories. If Inline graphic and Inline graphic, the correctness is defined as a true positive (TP). Its summation, annotated by Inline graphic, is

graphic file with name 1687-4153-2008-248747-i199.gif (15)

On the other hand, if Inline graphic and Inline graphic, this correctness is called a true negative (TN). The number of TNs, represented by Inline graphic, is defined as follows:

graphic file with name 1687-4153-2008-248747-i203.gif (16)

Different performance metrics are proposed in the literature. Three most popular of them are considered here. The first metric, referred to as the Hamming distance, is the summation of all the inference errors and is given by

graphic file with name 1687-4153-2008-248747-i204.gif (17)

The Hamming distance is widely accepted as a good measure of the distance between two graphs.

The second metric is called the sensitivity, and is defined as

graphic file with name 1687-4153-2008-248747-i205.gif (18)

The sensitivity describes the inference algorithm ability to identify the regulation relationships among genes. The third metric is called the specificity, and it assumes the form

graphic file with name 1687-4153-2008-248747-i206.gif (19)

The specificity represents the inference algorithm's capability to avoid falsely connecting two unrelated genes.

4.1.2. Simulation on the Proposed Kinetics

A set of artificial networks are created based on the system dynamic equation (1). Each network has 30 vertices and 60 oriented edges. Such a network scale is selected for the consideration of the computational resources and the biological network that we are going to infer. The steady-state data are sampled by emulating the gene knockout experiment. A gene's expression is mandatorily forced to 0 while all other genes are free to change their expressions. The initial values of the system are randomly generated. When the system converges to the equilibrium, a Gaussian noise Inline graphic is added and a few samples are obtained. All genes are shut down one by one. An extra in silico experiment is performed and no genes are shut down. These samples correspond to the wild type strain.

Different numbers of steady-state samples were generated based on the adopted system kinetics. The transcription factor is assumed to be an activator or repressor with equal probability, that is, Inline graphic. The baseline parameter Inline graphic and the gamma parameters of regulation factors are Inline graphic so that the regulation factor has a unit mean. Chow-Liu algorithm creates a spanning tree; therefore, it preserves only 29 edges, while the original synthetic network possesses 30 vertices and 60 edges. In order to make comparisons, we tune the parameters for the other three schemes so that the number of inferred edges is around 30. For the RN method, we keep the 30 edges with the highest mutual information. For ARACNE, the mutual information threshold is adjusted. In our proposed algorithm, the posterior probability thresholds are changed in the range of Inline graphic so that approximately 30 edges are obtained. It has to be noted that RN, ARACNE, and Chow-Liu algorithms only preserve interactions but disregard the interaction orientation. Therefore, in order to make consistent comparisons, we have to sacrifice the orientation information offered by the proposed algorithm. Besides, these three schemes have no capability of processing ChIP-chip data. Therefore, we have to configure the proposed algorithm such that any two nodes are associated with a small prior probability of connection (0.1). This reflects the fact that the connection between two arbitrary nodes in the graph is very unlikely, but not impossible. This also exemplifies how the algorithm works in the absence of the ChIP-chip data.

Figure 1(a) compares the performance in terms of Hamming distance for the four schemes assuming different sample sizes. The proposed method provides much better inference accuracy because it achieves the lowest Hamming distance. Larger sample size rewards a better inference precision. Chow-Liu's algorithm and ARACNE do not perform well. This can be attributed to the assumption of the network. Our synthetic networks actually are cyclic networks in order to reflect the real world scenario. However, cycles in the network ruin the inference precisions of Chow-Liu and ARACNE. Figure 1(c) illustrates the impact of sample size on the sensitivity. The proposed scheme outperforms the other three schemes. The sensitivities of all algorithms are less than 0.5. This is mainly due to the constraint that we pose on the number of inferred edges, that is, 30 edges. If we relax the posterior probability threshold, the sensitivity will be improved by sacrificing the specificity. Figure 1(e) depicts specificity for all schemes. All of them have high specificities, which are all greater than 0.90. The proposed scheme still exceeds. This high specificity is mainly due to the stringent constraint posed on the number of inferred edges. When considering the orientation of the edges, we find that 90% true positives inferred by the proposed algorithm are actually oriented correctly. This represents a big advantage of the proposed algorithm compared with the other three schemes.

Figure 1.

Figure 1

Performance comparison in terms of Hamming distance, sensitivity and specificity. Figures in the left column illustrate results based on the same kinetics model employed in both data synthesization and network inference, while figures in the right column represent results based on different kinetics models employed in the simulation process. The Monte Carlo iterations are fixed at Inline graphic for the proposed algorithm. Thresholds for different algorithms are selected to produce around 30 inferred edges.

4.1.3. Robustness of Inference

In the previous simulations, the proposed inference algorithm assumes the system dynamic as depicted by (1). Actually, for different biological processes, there exist various mathematical models which achieve tradeoffs between the sophistication of the underlying molecular reaction and the simplification of the formula description (see [27,29] for model comparisons). Savageau [28] proposed an alternative mathematical model to account for the gene control and various forms of coupling among elementary gene circuits. This model can be denoted as

graphic file with name 1687-4153-2008-248747-i213.gif (20)

where two new symbols Inline graphic and Inline graphic are activation and degradation coefficients and all other symbols share the same meanings as in (1).

Although the proposed inference framework can "plug and play" with different models, it is still necessary to examine its robustness against the underlying model. We evaluate this model dependence by the following steps: configure the model as 13 and create a set of synthetic data, then apply the proposed algorithm based on the dynamic equation (1), finally determine the performance metrics for different algorithms and compare the results with those in the previous section.

The simulation results are plotted in Figures 1(b), 1(d), and 1(f). Each figure corresponds to a different performance metric. All algorithms exhibit different values for performance values. This shows that the inference is dependent on the particular data sets and their underlying model. Compared with other three schemes, the proposed algorithm still achieves good performance in terms of three metrics. However, the advantage of the proposed algorithm are not significant now. ARACNE, Chow-Liu, and relevance method do not degenerate much. This attributes mainly to the nonparametric property of these three schemes. The persistent good performance of the proposed algorithm is due to the fact that both dynamic models have to convey the basic properties of the gene interaction kinetics, such as the activation and repression effects and the coupling of the circuitry.

4.2. Simulation on Saccharomyces Cerevisiae Data Sets

Saccharomyces cerevisiae (yeast) has been extensively studied in the literature of molecular biology because it is a unicellular eukaryotic organism, which shares similar cell structure with plants and animals. Also, yeast presents a short life cycle, which makes the experiments to be easily conducted. Lee et al. [33] performed the ChIP-chip experiment, in which 141 transcription factors were tested for binding intergenetic regions corresponding to 6270 genes. The gene expression data were published by Mnaimneh et al. [35], who created promoter shut-off strains for 2/3 of all essential genes. The data set contains 215 steady-state cDNA microarray samples. The model parameters are assumed the same as artificial networks.

The intracellular signalling pathway in response to environmental changes has been conserved through evolution. Therefore, a study of this biological subsystem on the Saccharomyces cerevisiae might help to decipher the cell survival mechanism of other organisms. We select 30 genes which are annotated to participate in the stress response process. The given ChIP-chip experiment did not provide full prior knowledge between any two genes (nodes in the graph). We believe that, among these genes, there are some genes whose protein products may also serve as transcription factors. Therefore, if the binding between two genes was not tested in the ChIP-chip experiment, a small probability value 0.1 is assigned as the prior knowledge. The proposed inference algorithm leads to the genetic network illustrated in Figure 2.

Figure 2.

Figure 2

Recovered genetic regulatory network for yeast stress response. The Monte Carlo iterations are Inline graphic. Dashed edges represent interactions preserved by using ChIP-chip data alone under the p-value threshold 0.001. Shadowed vertices are transcription factors tested in the ChIP-chip experiment.

The inferred genetic regulatory network shows strong proneness toward a scale-free network instead of a random network. Some genes possess especially high degree of connectivity. Three hub genes Inline graphicInline graphic already connect with more than 60% of all selected genes. Each of them has a connectivity degree not less than 8 while on average each gene in the network is connected with no more than 4 genes. These hub genes constitute the backbone of the network and they are potential control targets. This scale-free property is advantageous in maintaining the system robustness because a failure in one subsystem will not be propagated to the whole body.

Multiple works, for example [36], have identified Inline graphic and Inline graphic as two of the most important genes in the response to environmental changes. A recent work [37] recognized the functionality of another crucial gene Inline graphic, which is a heat shock transcription factor and functions in a different domain than the one corresponding to Inline graphic. Our inferred network confers this experimental result by showing that Inline graphic and Inline graphic regulate different set of genes except a weak connectivity between Inline graphic and Inline graphic. Inline graphic are not conserved in humans, while Inline graphic genes have been preserved for various organisms such as Drosophila melanogaster, chickens, and mammals. Therefore, a study of the Inline graphic pathway opens up the possibility of understanding the mechanism that governs the survival of normal cells under austere conditions.

Inline graphic (Inline graphic) and Inline graphic are two genes that play key roles in controlling the resistance to drugs, for example, cisplatin [38]. Inline graphicInline graphic,Inline graphic, and Inline graphic share a structure motif called basic leucine zipper (Inline graphic) and they are located closely in the network. However, they are not neighboring the other two Inline graphic genes: Inline graphic and Inline graphic. It is hypothesized that although they have similar molecular structures, their biological functionalities are in distinct domains.

Several edges, discovered by imposing a stringent p-value threshold 0.001 to the location data, were persevered in our inferred network. Actually, these connections constitute a small portion of the proposed network, and they are Inline graphicInline graphicInline graphic, and Inline graphic. Various evidences are found to corroborate the recovered interactions, which can not be obtained by employing a stringent p-value for the location data. For example, Inline graphic is recovered to directly regulate Inline graphic. This regulation relationship has also been reported in the work of Horak [39]. The relationship between Inline graphic and Inline graphic is studied in [40] in the context of extending the life span.

It is worthwhile to note that gene expression data mainly provide statistical relationships among genes, while location data offer physical binding interactions at the molecular level. By combining the two data sources, we are aiming to refine the inferred network to be biologically more meaningful. However, it also runs at a risk of confusing statistical regulatory relationships with real binding interactions. When such a case occurs, the proposed algorithm is capable of constraining the interacting genes within the same biological process and common functional relationships. A related discussion about the meaning of inferred network can also be found in [41].

5. Conclusions

A novel algorithm is proposed to recover the genetic regulatory networks in the light of knowledge of transcriptional kinetics, ChIP-chip, and gene microarray data. The analysis is based on the Bayesian methodology and Monte Carlo techniques. The proposed scheme is useful to compensate the shortcomings of the utilization of only one data set alone. Our in silico experiments corroborate that the algorithm outperforms in specificity, sensitivity and Hamming distance relative to three state-of-the-art schemes. A budding yeast genetic regulatory network is proposed to account for the stress response.

There are possible extensions to our current scheme. An analysis of the error estimation is desired for the Monte Carlo simulation in order to determine the appropriate number of iterations. Several other knowledge sources are to be integrated into the current framework. For example protein-protein interactions are useful to identify cobinding regulations. Genome sequencing data have been utilized to find regulatory motifs. Protein structure knowledge can be exploited to categorize the proteins and find similar functionality. A cross-species research is also highly desirable since similar regulation mechanisms are expected to be conserved. If a gene is conserved in both humans and mice, then the knowledge of the gene pathway in the mouse will be an excellent reference for the study of human genetic diseases. We expect a global distributed framework, in which each data source acts as a separate component and its absence does not interfere with the whole computational process.

Contributor Information

Wentao Zhao, Email: wtzhao@neo.tamu.edu.

Erchin Serpedin, Email: serpedin@ece.tamu.edu.

Edward R Dougherty, Email: edward@ece.tamu.edu.

Acknowledgments

This work was supported by the National Cancer Institute (CA-90301) and the National Science Foundation (ECS-0355227 and CCF-0514644).

References

  1. Kauffman SA. Metabolic stability and epigenesist in randomly constructed genetic nets. Journal of Theoretical Biology. 1969;22(3):437–467. doi: 10.1016/0022-5193(69)90015-0. [DOI] [PubMed] [Google Scholar]
  2. Murphy K, Mia S. Modelling gene expression data using dynamic Bayesian networks. Computer Science Division, University of California, Berkeley, Calif, USA; 1999. [Google Scholar]
  3. Sebastiani P, Abad MM, Ramoni M. The Data Mining and Knowledge Discovery Handbook. Springer, New York, NY, USA; 2005. Bayesian networks; pp. 193–230. [Google Scholar]
  4. Shmulevich I, Dougherty ER, Kim S, Zhang W. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics. 2002;18(2):261–274. doi: 10.1093/bioinformatics/18.2.261. [DOI] [PubMed] [Google Scholar]
  5. Tabus I, Astola J. On the use of MDL principle in gene expression prediction. Journal on Applied Signal Processing. 2001;2001(4):297–303. [Google Scholar]
  6. Öktem H, Pearson R, Yli-Harja O, Nicorici D, Egiazarian K, Astola J, A computational model for simulating continuous time Boolean networks. Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '02) Raleigh, NC, USA October 2002.
  7. Murphy K. Bayes Net Toolbox for Matlab. http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
  8. Leray P. Structure learning toolbox of Bayesian Networks. http://bnt.insa-rouen.fr/ajouts.html http://bnt.insa-rouen.fr/ajouts.html
  9. Friedman N, Elidan G. Bayesian Network learning tool. http://www.cs.huji.ac.il/labs/compbio/LibB/programs.html#LearnBayes http://www.cs.huji.ac.il/labs/compbio/LibB/programs.html#LearnBayes
  10. Rao A, Hero AO, States DJ, Engel JD. Manifold embedding for understanding mechanisms of transcriptional regulation. Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06), College Station, Tex, USA, May 2006. pp. 3–4.
  11. Liang S, Fuhrman S, Somogyi R. REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. Proceedings of the Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998. pp. 18–29. [PubMed]
  12. Luna IT, Yin Y, Huang Y, Padillo DPR, Perez MCC, Wang Y. Uncovering gene regulatory networks using variational Bayes variable selection. Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06), College Station, Tex, USA, May 2006. pp. 111–112.
  13. Zhao W, Serpedin E, Dougherty ER. Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics. 2006;22(17):2129–2135. doi: 10.1093/bioinformatics/btl364. [DOI] [PubMed] [Google Scholar]
  14. Butte AJ, Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Proceedings of the Pacific Symposium on Biocomputing (PSB '00), Honolulu, Hawaii, USA, January 2000. pp. 418–429. [DOI] [PubMed]
  15. Rogers S, Girolami M. A Bayesian regression approach to the inference of regulatory networks from gene expression data. Bioinformatics. 2005;21(14):3131–3137. doi: 10.1093/bioinformatics/bti487. [DOI] [PubMed] [Google Scholar]
  16. Rice JJ, Tu Y, Stolovitzky G. Reconstructing biological networks using conditional correlation analysis. Bioinformatics. 2005;21(6):765–773. doi: 10.1093/bioinformatics/bti064. [DOI] [PubMed] [Google Scholar]
  17. Yeung MKS, Tegnér J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Margolin AA, Nemenman I, Basso K. et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(supplement 1, S7):1–15. doi: 10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Simon I, Siegfried Z, Ernst J, Bar-Joseph Z. Combined static and dynamic analysis for determining the quality of time-series expression profiles. Nature Biotechnology. 2005;23(12):1503–1508. doi: 10.1038/nbt1164. [DOI] [PubMed] [Google Scholar]
  20. Bernard A, Hartemink AJ. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of data. Proceedings of the Pacific Symposium on Biocomputing (PSB '05), The Big Island of Hawaii, Hawaii, USA, January 2005. pp. 459–470. [PubMed]
  21. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining location and expression data for principled discovery of genetic regulatory network models. Proceedings of the Pacific Symposium on Biocomputing (PSB '02), Lihue, Hawaii, USA, January 2002. pp. 437–449. [PubMed]
  22. Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Machine Learning. 1992;9(4):309–347. [Google Scholar]
  23. Heckerman D, Geiger D, Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning. 1995;20(3):197–243. [Google Scholar]
  24. Chow C, Liu C. Approximating discrete probability distributions with dependence trees. IEEE Transaction on Information Theory. 1968;14(3):462–467. doi: 10.1109/TIT.1968.1054142. [DOI] [Google Scholar]
  25. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco, Calif, USA; 1988. [Google Scholar]
  26. Friedman N, Cai L, Xie XS. Linking stochastic dynamics to population distribution: an analytical framework of gene expression. Physical Review Letters. 2006;97(16) doi: 10.1103/PhysRevLett.97.168302. 4 pages. [DOI] [PubMed] [Google Scholar]
  27. Wessels LF, van Someren EP, Reinders MJ. A comparison of genetic network models. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), The Big Island of Hawaii, Hawaii, USA, January 2001. pp. 508–519. [PubMed]
  28. Savageau MA. Rules for the evolution of gene circuitry. Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998. pp. 54–65. [PubMed]
  29. Edelstein-Keshet L. Mathematical Models in Biology. Random House, New York, NY, USA; 1988. [Google Scholar]
  30. Allison DB Gadbury GL Heo M et al. A mixture model approach for the analysis of microarray gene expression data Computational Statistics & Data Analysis 20023911–20. 10.1016/S0167-9473(01)00046-921757463 [DOI] [Google Scholar]
  31. Guelzim N, Bottani S, Bourgine P, Képès F. Topological and causal structure of the yeast transcriptional regulatory network. Nature Genetics. 2002;31(1):60–63. doi: 10.1038/ng873. [DOI] [PubMed] [Google Scholar]
  32. Giot L, Bader JS, Brouwer C. et al. A protein interaction map of Drosophila melanogaster. Science. 2003;302(5651):1727–1736. doi: 10.1126/science.1090289. [DOI] [PubMed] [Google Scholar]
  33. Lee TI, Rinaldi NJ, Robert F. et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298(5594):799–804. doi: 10.1126/science.1075090. [DOI] [PubMed] [Google Scholar]
  34. Ihler A. Kernel density estimation software. http://www.ics.uci.edu/~ihler/code/ http://www.ics.uci.edu/~ihler/code/
  35. Mnaimneh S, Davierwala AP, Haynes J. et al. Exploration of essential gene functions via titratable promoter alleles. Cell. 2004;118(1):31–44. doi: 10.1016/j.cell.2004.06.013. [DOI] [PubMed] [Google Scholar]
  36. Gasch AP, Spellman PT, Kao CM. et al. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell. 2000;11(12):4241–4257. doi: 10.1091/mbc.11.12.4241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Eastmond DL, Nelson HCM. Genome-wide analysis reveals new roles for the activation domains of the Saccharomyces cerevisiae heat shock transcription factor (Hsf1) during the transient heat shock response. Journal of Biological Chemistry. 2006;281(43):32909–32921. doi: 10.1074/jbc.M602454200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Furuchi T, Ishikawa H, Miura N. et al. Two nuclear proteins, Cin5 and Ydr259c, confer resistance to cisplatin in Saccharomyces cerevisiae. Molecular Pharmacology. 2001;59(3):470–474. doi: 10.1124/mol.59.3.470. [DOI] [PubMed] [Google Scholar]
  39. Horak CE, Luscombe NM, Qian J. et al. Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. Genes & Development. 2002;16(23):3017–3033. doi: 10.1101/gad.1039602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Fabrizio P, Pozza F, Pletcher SD, Gendron CM, Longo VD. Regulation of longevity and stress resistance by Sch9 in yeast. Science. 2001;292(5515):288–290. doi: 10.1126/science.1059497. [DOI] [PubMed] [Google Scholar]
  41. Hartemink AJ. Reverse engineering gene regulatory networks. Nature Biotechnology. 2005;23(5):554–555. doi: 10.1038/nbt0505-554. [DOI] [PubMed] [Google Scholar]

Articles from EURASIP Journal on Bioinformatics and Systems Biology are provided here courtesy of Springer

RESOURCES