Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2012 Dec 27;29(3):338–346. doi: 10.1093/bioinformatics/bts634

Inference of gene regulatory networks from genome-wide knockout fitness data

Liming Wang 1, Xiaodong Wang 2,*, Adam P Arkin 3,4, Michael S Samoilov 3,*
PMCID: PMC3562072  PMID: 23271269

Abstract

Motivation: Genome-wide fitness is an emerging type of high-throughput biological data generated for individual organisms by creating libraries of knockouts, subjecting them to broad ranges of environmental conditions, and measuring the resulting clone-specific fitnesses. Since fitness is an organism-scale measure of gene regulatory network behaviour, it may offer certain advantages when insights into such phenotypical and functional features are of primary interest over individual gene expression. Previous works have shown that genome-wide fitness data can be used to uncover novel gene regulatory interactions, when compared with results of more conventional gene expression analysis. Yet, to date, few algorithms have been proposed for systematically using genome-wide mutant fitness data for gene regulatory network inference.

Results: In this article, we describe a model and propose an inference algorithm for using fitness data from knockout libraries to identify underlying gene regulatory networks. Unlike most prior methods, the presented approach captures not only structural, but also dynamical and non-linear nature of biomolecular systems involved. A state–space model with non-linear basis is used for dynamically describing gene regulatory networks. Network structure is then elucidated by estimating unknown model parameters. Unscented Kalman filter is used to cope with the non-linearities introduced in the model, which also enables the algorithm to run in on-line mode for practical use. Here, we demonstrate that the algorithm provides satisfying results for both synthetic data as well as empirical measurements of GAL network in yeast Saccharomyces cerevisiae and TyrR–LiuR network in bacteria Shewanella oneidensis.

Availability: MATLAB code and datasets are available to download at http://www.duke.edu/∼lw174/Fitness.zip and http://genomics.lbl.gov/supplemental/fitness-bioinf/

Contact: wangx@ee.columbia.edu or mssamoilov@lbl.gov

Supplementary information: Supplementary data are available at Bioinformatics online

1 INTRODUCTION

In recent years, modelling and inference of biological regulatory networks have become an active area of research in large part owing to the emergence of microarray technology, which allows for simultaneous measurement of gene expression on the genome-wide scale (Bonneau et al., 2006; Chou and Voit, 2009; Friedman et al., 2000; Liang and Wang, 2008; Margolin et al., 2006; Reiss et al., 2006; Shmulevich et al., 2002; Stuart et al., 2003). The vast amounts of data provided by gene expression microarrays enable the possibility of accurate estimation of gene regulatory network organization, which has greatly benefited a broad range of disciplines—from basic biological sciences, to bioengineering, to medical diagnosis and treatment (Hanai et al., 2006; Mischel et al., 2004). The goal of inference algorithms is to discover the connectivity structure and, potentially, dynamic characteristics of these networks based on such time- or other state-series data. Among other things, the nature of inference algorithms varies depending on the types of biological networks and the way they are modelled (de Jong, 2002; Hendrickx et al., 2011; Lecca et al., 2011; Liu et al., 2006; Samoilov et al., 2001; Shmulevich et al., 2002; Tian and Burrage, 2003; Wang and Schonfeld, 2010). One category of models quantizes the empirical data into binary numbers and views network structures as Boolean constraints (Bornholdt, 2008; Kauffman et al., 2003). Although this could be attempted in a deterministic framework, both the uncertainties introduced by measurement errors as well as the inherent stochasticity of gene expression make any experimental data substantially probabilistic. To impart this random nature to the Boolean framework, the probability Boolean network models have been introduced (Akutsu et al., 1999; Huang, 1999; Shmulevich et al., 2002). However, as biological processes are neither digital nor homogeneous, further gene regulatory modelling and inference refinements may be achieved by using alternative probabilistic network descriptions (Craciun et al., 2013; Kellam et al., 2002; Liu et al., 2006), continuous-time differential equations (Chen et al., 1999; Holter et al., 2001; Wang et al., 2008), stochastic differential equations (Tian and Burrage, 2003; Yeung et al., 2002), and control theory methods (Beal et al., 2005; Cook et al., 1998; Rangel et al., 2004), among others. Although any of these methods offers certain advantages and disadvantages in attempting to capture the structure and dynamics of gene regulatory network, it should be noted that they have largely been designed toward describing gene expression data.

Recently, however, a new type of high-throughput data has emerged and seen rapid proliferation in empirical biosciences—the genome-wide fitness data. At its core, this involves using latest technological advances to massively scale the traditional gene deletion/interruption studies in order to achieve nearly genome-wide coverage by generating knockout/knockdown mutant strain libraries for all non-essential genes in an organism (Oh et al., 2010). These libraries are then further subjected to a large number of environmental conditions and stresses—with the observable in each of the settings being the fitness of individual clones (collected in stationary phase). Pairing the resulting data with an appropriate model of gene expression then allows for the inference of the underlying gene regulatory networks through estimation of significant interaction terms, along with those for production, degradation, expression level, etc. Although potentially applicable to any observable type, this approach may be particularly well-suited for the use with fitness data to help constrain any inferred gene regulatory network solutions to those dynamic modes that are most important for a given set of biological functions and conditions—e.g. growth on specific substrates or tolerance to certain stresses. Recent works have indeed suggested that the use of genome-wide fitness data can provide new perspectives on systems-level organization of cells and uncover novel gene regulatory interactions when compared with gene expression-based analysis (Hillenmeyer et al., 2008, 2010); Deutschbauer et al., 2011). Yet, although on a limited scale the idea of biological network characterization based on knockout data has been considered before, e.g. Winzeler et al. (1999), the emergence of high-throughput genome-wide gene deletion/interruption technology along with the use of population fitness rather than gene expression as an observable offers novel challenges as well as benefits to the task of gene network inference. On the one hand, microarray and other gene expression experiments typically generate high-dimensional data in the form of a real vector that comprises expression levels of multiple genes at each sampled time and/or condition point, whereas fitness measurements map the state of the system into a much lower dimensional space—e.g. that of a single real variable, such as growth rate. This inevitably leads to significant loss of information. On the other hand, deletion experiments usually involve simply cultivating and observing cells, which could be performed on a substantially larger scale, much more efficiently and under significantly greater range of conditions when compared with the relatively demanding gene expression assays. The ensuing ability to perform experiments simultaneously across the entire mutant collection substantially increases the overall genome-wide fitness data dimensionality—often putting it on par with available gene expression datasets. Furthermore, fitness observations allow for the preferential selection or overweighting of clones that display a desirable phenotype, e.g. the stronger the selection—the more significant the contribution of surviving strains. This becomes an increasingly important factor in many biotechnological and biomedical applications, whereby the contribution of practically irrelevant genes is effectively being filtered out—regardless of their statistical significance or dynamic state. Indeed, it has been shown that this type of data is very useful for the determination of target gene functions (Deutschbauer et al., 2002, 2011; Hillenmeyer et al., 2010; Pierce et al., 2009; Steinmetz et al., 2002).

Few systematic models and/or inference algorithms have been proposed for the elucidation of regulatory networks from fitness data. Conclusions are often being made on the basis of visual inspections or similar relatively naive strategies. Yet, greater prominence and availability of such data along with indications that fitness profiling might contain information about gene regulation suggest the need for a more comprehensive and rigorous inference approach. The analysis of ample data provided by genome-wide fitness experiments may also be useful in complementing network inference methods based on microarray gene expression and such other data by helping to initialize them or further refine their results.

In this work, we propose an algorithm for inferring gene regulatory networks from genome-wide knockout fitness data. Our approach is based on describing biological networks via a non-linear dynamical model and then elucidating model parameters from fitness measurements. The resulting parameter set can be used to identify the underlying regulatory network structure as well as to make forward-looking estimates of its function under temporal dynamics or environmental changes. In Section 2, we describe the system model and problem formulation. In Section 2.2, we provide a heuristic sample ordering selection algorithm based on correlation score to cope with the order selection problem that arises when using mutant fitness data. In Section 3, we describe the parameter estimation algorithm based on the unscented Kalman filter (UKF) technique. In Section 4, we use the proposed algorithm to analyse both a synthetic example as well as experimental data from yeast Saccharomyces cerevisiae and bacteria Shewanella oneidensis, which are further compared against known empirical results. We conclude the article with a summary and remarks regarding the proposed model and inference algorithm.

2 SYSTEM MODEL AND PROBLEM FORMULATION

2.1 System model

The outline of our approach is to describe a gene expression data model and to then extend it towards accommodating the observables supplied in the form of a large-scale knockout strain fitness dataset. To this end, we first introduce a basic model of gene expression as a weighted sum of (non-linear) functions of other genes with additive noise. We then obtain, through the removal of individual genes, the knockout strain network models used to drive the observed fitness model, which thus contains latent variables of our overall model, as described next in further detail.

Consider a gene regulatory network with total Inline graphic genes. Let Inline graphic, Inline graphic, Inline graphic denote the gene expression level for the Inline graphic-th gene at time Inline graphic. We denote observation or measurement data, Inline graphic, for Inline graphic at time Inline graphic as:

graphic file with name bts634m1.jpg (1)

where Inline graphic is the observation noise at time Inline graphic for Inline graphic-th gene. (Note that here the term ‘time’ is used in the generalized Bayesian inference sense and so may be loosely viewed as a discrete index enumerating individual experiments—rather than some continuous parameter of a kinetic biochemical system. Accordingly, the dynamic model we use is one based on ‘discrete time’, which thus fundamentally does not a priori assume or require continuity of states or observables.) We denote gene expression levels within the network by vector Inline graphic, the observation vector by Inline graphic and noise vector by Inline graphic. We assume that all vectors Inline graphic for Inline graphic are independent and jointly Gaussian with zero mean and variance matrix Inline graphic. We approximate any multivariate gene–gene interactions by a combination of a linear expansion around the stationary solution and univariate non-linear terms. Specifically, we follow a discrete-time regulation model proposed in (Chen and Aihara, 1997) and describe the regulatory functions among genes as:

graphic file with name bts634m2.jpg (2)

for Inline graphic, where Inline graphic denotes the linear regulation coefficient from gene Inline graphic to gene Inline graphic and Inline graphic; Inline graphic denotes the non-linear regulation coefficient from gene Inline graphic to gene Inline graphic and Inline graphic; Inline graphic is the non-linear function for gene Inline graphic which is given by:

graphic file with name bts634m3.jpg (3)

where Inline graphic is the parameter to be inferred and Inline graphic; Inline graphic denotes the system expression bias for Inline graphic-th gene and Inline graphic, which will be inferred later. The noise vectors Inline graphic for Inline graphic are assumed to be jointly Gaussian with zero mean and variance Inline graphic. We also assume that they are independent from all Inline graphic. The regulatory network is realized as a state–space model, where we view gene expression levels as states and measurements as observations. The goal of inference is to estimate all the unknown parameters in the model. Inference results then provide estimates for all regulatory relations across the network.

Note that Equations (1) and (2) provide the description of the system in a manner most commensurate with expression data. We now proceed to extend this model to accommodate fitness data. Without loss of generality, we consider the case of a mutant library with single gene knockout per strain (with multiple-knockout collections being handled analogously, as discussed later) and assume that Inline graphic-th gene has been deleted when the system is at time Inline graphic. Note that ‘time’ here corresponds to the experiment number, with the index Inline graphic being determined as discussed in the previous paragraph. For the purposes of the single-knockout state–space model, expressions of all genes evolve without participation of gene Inline graphic. Therefore we set all Inline graphic regulatory coefficients to zero. The states and system coefficients equations can then be summarized as:

graphic file with name bts634m4.jpg (4)

The last two equations determine the current system coefficients, which will be appended to the system state. Unlike the case of expression microarrays, here, each gene deletion/interruption strain measurement quantifies a single system property which is a function of all the remaining genes. In this article, we assume this measurement is a real number, which represents the fitness of the remaining network. (The model can be easily adapted into higher dimensional measurement case by direct extension.) Therefore, the observation Inline graphic becomes:

graphic file with name bts634m5.jpg (5)

where Inline graphic is typically not known a priori. We denote Inline graphic as the variance of Inline graphic as before.

Various bases could be used to estimate Inline graphic and associated coefficients. For instance, one could use a simple basis such as power series—i.e. a Taylor expansion. However, the speed of convergence for Taylor expansion is slow, resulting in a large number of parameters to be estimated. In contrast, the radial basis approach has been shown to be more robust and adaptive than Taylor expansion (Jiang et al., 2003). Furthermore, it has been shown that with certain additional assumptions, the approximation by radial basis will converge to the true function in Inline graphic sense (Powell, 1987). Thus, we approximate Inline graphic as:

graphic file with name bts634m6.jpg (6)

where Inline graphic for Inline graphic and Inline graphic are the centres of the basis with Inline graphic; Inline graphic is the total number of basis functions used, which is a fixed constant; Inline graphic, Inline graphic are the centre points of the waveform, which are chosen a priori; and Inline graphic—the Hardy multi-quadratic function with constant Inline graphic, which serves as a classical choice for efficient radial basis expansion (Buhmann, 2003).

All the coefficients Inline graphic as well as Inline graphic are parameters to be inferred and so are appended to the state variable. Therefore, the new augmented model state variable for knockout fitness data is:

graphic file with name bts634m7.jpg (7)

As a summary, the dynamical model we propose for regulatory networks with gene deletion/interruption mutant fitness data is:

graphic file with name bts634m8.jpg (8)
graphic file with name bts634m9.jpg

where Inline graphic is the system function described in (4); Inline graphic is the augmented noise vector; Inline graphic is the selection matrix, i.e. Inline graphic, where the index Inline graphic is determined by the time index Inline graphic; Inline graphic is the Gaussian noise as assumed before.

2.2 Data feeding order score

To infer the model given in (8), one needs to specify the order in which data are supplied to the algorithm. The question in what order data should be optimally fed into the inference algorithm thus arises. As this problem is fundamentally associated with the network structure itself, it can be solved exactly only if the structure of the network is already known. (And even then, the problem typically has NP-hard complexity since one needs to test all possible permutations.) Since the feeding order will have direct impact on the performance of the inference result, we still need a strategy to find an ‘optimal’ feeding order without explicitly exploring all potential network structures. In this section, we propose a heuristic strategy based on correlation score, which is then compared against selecting the feeding order randomly and shown to be more ‘optimal’ by offering certain advantages.

The basic idea behind the proposed heuristic is that we should feed the most useful data first. In this work, we use correlation as a measure of such ‘usefulness’. The intuition is that for sequential inference of the model (8), the data should be fed based on their importance in a certain sense. The reasoning for this strategy is that, in sequential inference, a good starting point usually provides a superior opportunity to converge to a good result and vice versa. Moreover, once we have already fallen into a steady state or an attracting basin, subsequent data may have less influence on the final result, since it may be difficult to jump away from the local attractor. By contrast, at early stages, this influence may be vital to the final inference result. Heuristically, the importance of a certain sample may be determined by the connectivity of the deleted gene. If it has many connections to other genes, it may likely play an important role in the network, which would be reflected in the measurement value (e.g. fitness). Based on this approach, the feeding order for a sample is related to the importance or connectivity of the corresponding deleted gene, which could be quantified by using correlation as a metric.

Specifically, we consider fitness observations Inline graphic, Inline graphic, where Inline graphic is a positive integer representing the total rounds of experiments. Note that without loss of generality, we can always assume that Inline graphic, Inline graphic, Inline graphic is the Inline graphic round fitness data for gene Inline graphic. The score Inline graphic for gene Inline graphic is calculated as:

graphic file with name bts634m10.jpg (9)

with gene feeding arranged in the order of descending Inline graphic. The summation over Inline graphic calculates the correlation under round Inline graphic. The final score is the summation over all rounds. This approach may be compared with choosing the feeding order randomly, as will be done later using concrete examples.

Finally, note that a similar feeding heuristic may be used to optimize the order of individual experimental conditions as well. The utility of this additional step, however, needs to be weight against the diminished significance of individual permutations among observation data points as the number of conditions becomes large as well as the increasing computational overhead this may entail. As in our applications the number of conditions was substantially greater than either the number of genes or the number of connections between them, we found the additional computational costs such extra step would entail to be unwarranted.

3 THE UNSCENTED KALMAN FILTER APPROACH FOR INFERENCE

In the previous section, we proposed a system model for describing gene regulatory networks at the state–space level. The approach now requires an algorithm for estimating the unknown parameters of the model, from which network organization and other biological system properties may be inferred. Here, a Kalman filter technique—a well-characterized estimation strategy for elucidating state-space models of regulatory networks (Wang et al., 2006, 2009), and specifically, an unscented Kalman filter (UKF)—is used to accomplish this task. UKF is used to estimate all the parameters in order to cope with the expected non-linearity of the model (Julier and Uhlmann, 1997), as it has been shown to have superior performance when compared with traditional approaches, such as the extended Kalman filter, especially with availability of enough data.

Classical Kalman filter technique iteratively uses innovations in state and measurement predictions—updating the system sequentially (Simon, 2006). The general idea of a Kalman filter can be summarized as:

graphic file with name bts634m11.jpg (10)

where Inline graphic is the ‘gain’ for the residue. The original approach by Kalman is based on linear differential equation model under Gaussian noise assumption. An extended Kalman filter (EKF) has been proposed for dealing with non-linear models (Corigliano and Mariani, 2004). The idea of EKF is to linearize the non-linear function by approximating it with the first-order Taylor expansion. However, such an approximation is quite coarse and insufficient under general circumstances. Some approaches look to remedying the situation by using higher-order terms Taylor expansion terms, which—while more accurate—generally leads to dramatic increases in complexity (Daum, 2005). Alternatively, UKF approximates the non-linear function by viewing it as a non-linear transform and then using the so-called ‘sigma points’ to capture the posterior mean and covariance accurately up to the third order. Compared with EKF, UKF provides a more accurate approximation without significant increase in complexity and has been shown superior in many practical situations (Wan and Van Der Merwe, 2000). Another advantage of UKF is that it does not require the calculation of model’s Jacobian or Hessian, which makes the algorithm and associated mathematical derivations less involved (see more below).

As noted, the UKF is based on the idea of choosing sigma points from the unscented transform. Consider a random vector Inline graphic being passed through a non-linear transform Inline graphic. In order to calculate the mean and variance of Inline graphic, we choose the sigma points Inline graphic, Inline graphic and their weights Inline graphic as follows:

graphic file with name bts634m12.jpg {\cal S}_0 = {\rm E}({\bf x}),
graphic file with name bts634m13.jpg (11)

where Inline graphic is the variance matrix of the random variable Inline graphic; Inline graphic denotes the Inline graphic-th column of the input matrix; Inline graphic is the scaling parameter; and Inline graphic is a parameter incorporating prior knowledge of Inline graphic. Under Gaussian noise assumption, we can set Inline graphic, Inline graphic, and Inline graphic (Julier and Uhlmann, 1997).

After computing all the sigma points Inline graphic and their corresponding weights, the mean and variance of Inline graphic can be approximated as:

graphic file with name bts634m14.jpg (12)

In order to infer the dynamical system model described in (8), we simply concatenate the state variable Inline graphic with the noise vectors Inline graphic and Inline graphic to form a new augmented vector

graphic file with name bts634m15.jpg (13)

Viewing the Inline graphic and Inline graphic in (8) as non-linear transforms allows us to calculate the corresponding sigma points as well as to approximate their mean and variance for use in sequential updates.

We summarize the UKF-based algorithm for inferring model (8) from knockout fitness data in Section S1 of the Supplementary Material.

4 RESULTS

4.1 Inference of synthetic network

In this section, we investigate the performance of the proposed algorithm for inference of a synthetic network. The network has both linear and non-linear connections with the graph structure specified in Figure 1. The dynamics of the network are based on the proposed model (8), with arrows denoting the direction of regulatory interactions. The parameters of the network are given in Table 1, with the variance of the model noise Inline graphic taken as Inline graphic for Inline graphic, where Inline graphic is the identity matrix. We also take the variance of measurement noise (error) Inline graphic in (1) to be Inline graphic. The deletion data are obtained by sequentially removing each gene and its corresponding connections. We use 10 rounds of sample data, i.e. Inline graphic. Finally, five basis functions are included in the fitness model. (Although there is always a trade-off between accuracy and complexity for number of basis functions to be used the dimension of the problem is proportional to the number of basis functions. The number of basis functions may be chosen quantitatively, if needed—e.g. by cross-validation—but here five basis functions appears to be sufficient.) The fitness function model Inline graphic in (5), thus becomes:

graphic file with name bts634m16.jpg (14)

where Inline graphic with Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic; and Inline graphic being Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Fig. 1.

Fig. 1.

Structure of the synthetic network

Table 1.

Comparison of linear coefficients (LCs) and non-linear coefficients (NLCs) of the inferred regulatory network and the underlying model system

Edge Synthetic LC Synthetic NLC Inferred LC Inferred NLC
(2,1) 0.7 0.5 0.621 0.781
(3,1) 0.7 0.5 0.491
(3,2) 0.7 0.5 1.091
(3,4) 0.7 0.5 0.861 1.132
(3,5) 0.7 0.5 0.682 0.581
(4,2) 0.7 0.5
(5,4) 0.981 0.852

Following the proposed algorithm, we first determine the ‘optimal’ data feeding order using the correlation score described in Section 2.2. The resulting order is 3, 2, 1, 4, 5. We can see that this ordering generally coincides with the importance of each node in the sense of node connectivity. This further reaffirms the validity of the proposed heuristic that the data from most connected nodes should be fed in first.

Following subsequent steps, we infer parameters of the underlying network, which are then filtered to remove values below noise threshold set at 40% of their maximal variation (0.431 and 0.443 for linear and non-linear coefficients, respectively). The resulting inferred synthetic network is shown in Figure 2, with regulatory interaction parameters given in Table 1.

Fig. 2.

Fig.
2.

Connection structure of the inferred synthetic network. Linear connections are denoted by solid lines. Non-linear connections are denoted by dashed lines

Comparing the inferred network to the true model, we can see that the elucidated results correctly identify the presence of five out of six regulatory interactions and suggest one non-existent one (false negative rate of 16.7% and false positive rate of 8%, respectively, accounting for the direction of regulation). If we account for both linear and non-linear connections individually, a further two connections are not discovered (linear Inline graphic to Inline graphic and non-linear Inline graphic to Inline graphic).

Finally, we have compared the effect of choosing the feeding order according to the prescription provided in Section 2.2 versus selecting it randomly. As can be seen in Figure 3, using the original order 1, 2, 3, 4, 5 instead of the optimal one 3, 2, 1, 4, 5 results in a significant degradation of inference results.

Fig. 3.

Fig. 3.

Connection structure of the inferred network without using the optimal feeding order. Here, the original data order 1, 2, 3, 4, 5 is used instead of the optimal one 3, 2, 1, 4, 5 (which is used in previous figures). Linear connections are denoted by solid lines. Non-linear connections are denoted by dashed lines

To quantify this effect more rigorously, we define missing rate Inline graphic as the difference between Inline graphic and the ratio of the number of correctly identified edges to the total number of edges in the synthetic network, i.e. 12 here. We also define false rate Inline graphic as the ratio of number of incorrectly identified edges to the total number of edges not in the synthetic network, i.e. 228 here. Then, for the inference result using the optimal order, we have Inline graphic and Inline graphic. In contrast, using the original order yields, Inline graphic and Inline graphic.

Table 2 summarizes the results when using the optimal order versus the average of 30 randomly chosen orders. As we can see, the optimal order has performed better than randomly chosen orders in both the missing and false rates.

Table 2.

Inference algorithm performance when using the optimal feeding order versus the average of 30 randomly chosen orders (lower is better)

Order Inline graphic Inline graphic
Optimal order 0.333 0.00877
Average of random chosen orders 0.753 0.0729

4.2 Inference of S.cerevisiae GAL network

We now apply the described inference algorithm to GAL regulatory network that controls galactose utilization in yeast Saccharomyces cerevisiae. GAL regulation represents one of the most historically prominent model systems in yeast because of its importance for the studies of eukaryotic regulation and relatively self-contained nature. Figure 4 summarizes the empirical knowledge of GAL network structure (Egriboz et al., 2011; Flick and Johnston, 1990; Johnston et al., 1994; Lohr et al., 1995; Ostergaard et al., 2000).

Fig. 4.

Fig. 4.

Structure of the empirical GAL network

Our analysis is based on yeast deletion strains fitness data previously collected by Giaever and co-workers (Giaever et al., 2002). The one-dimensional measurements were performed under various environmental conditions—such as different concentrations of galactose, alkali, sodium chloride, sorbitol, etc. We utilize nine sets of samples from different environmental conditions for network inference using the described algorithm, with 40% variation threshold as before. The nine sets of samples are formed by combining arbitrary samples from each environmental condition as representatives. Figures 5 and 6 show the inferred network graph structures as identified by linear and non-linear coefficients, respectively. A combined linear–non-linear network graph is given in Figure 7 (note, in the interest of clarity, edge weight labels and lower-weight edges have been removed).

Fig. 5.

Fig. 5.

Linear structure of the inferred GAL network

Fig. 6.

Fig. 6.

Non-linear structure of the inferred GAL network

Fig. 7.

Fig. 7.

Combined structure of the inferred GAL network

It could be seen from Figures 5 and 6 that GAL 1, 3, 4 and 80 have the most connections and the largest coefficients, which is in accord with the known fact that these are the regulatory genes in the network, with others being regarded as the structural genes. Additionally, we note that GAL 80 has negative connections to GAL 3 and GAL 4 as well as that GAL 4 has negative connections to GAL 1 and 7, all of which coincides with the empirically known fact that GAL 80 negatively regulates GAL 3, 4 and that GAL 4 leads to the repression of transcription from GAL 1, 7. The connections between GAL 1, 2 and GAL 3 also reflect the fact that GAL 2 and GAL 1 regulate GAL 3 by protein utilization pathway. Finally, we see that there is no direct connection from GAL 11 to GAL 80, which also coincides with the fact that GAL 11 does not have direct interaction with GAL 80. On the other hand, we find that although inference results discover the connections between GAL 3 and GAL 80 as well as GAL 3 and GAL 4, they may not be in the correct orientation. Otherwise, inferred influences among regulatory genes appear to be in a general agreement with empirical understanding of the system.

4.3 Inference of Shewanella TyrRLiuR network

In this section, we apply our algorithm to the TyrRLiuR amino acid utilization–degradation network of S.oneidensis strain MR-1. Its ‘true’ structure, shown in Figure 8, was derived from hand-curated high-confidence regulatory interactions catalogued at MicrobesOnline and RegPrecise (Dehal et al., 2009; Novichkov et al., 2010). Inference was performed by using the proposed algorithm on 287 sets of fitness data (number of fitness measurements for each of the knockout strains under different growth conditions), with results shown in Figures 9 and 10. The inference results are compared to and seen to be in general correspondence with the true structure, Table 3.

Fig. 8.

Fig.
8.

Structure of the empirical TyrR–LiuR network

Fig. 9.

Fig. 9.

Linear structure of the inferred TyrR–LiuR network

Fig. 10.

Fig. 10.

Non-linear structure of the inferred TyrR–LiuR network

Table 3.

Comparison of Ground truth to the Inference algorithm performance

Number of edges (Ground truth) Number of edges (Inference) Number of correctly identified edges
57 69 25

5 DISCUSSION

We have proposed a dynamical model and an algorithm for inference of gene regulatory networks based on genome-wide knockout fitness data—an emerging data type, whose utility in biological systems identification has not been sufficiently explored to date. The algorithm uses a state–space model to capture the dynamical and non-linear nature of such networks. An unscented Kalman filter is used to infer the unknown parameters in order to cope with model non-linearity.

Although fitness data inherently suffers from loss of information caused by its reduced dimensionality, when compared with the more widely explored gene expression data type, the potentially larger amounts of and more contextually/phenotypically meaningful data provided may be able to compensate for the relative lack of resolution as our work appears to suggest. The analysis of a synthetic example as well as empirical GAL and TyrRLiuR network data presented here shows that the described algorithm is able to provide satisfying inference results even for relatively complex mechanisms.

Ultimately, the two data types—gene expression and knockout fitness—may be expected to be most informative when used in a complementary fashion. As noted previously, inferences generated from genome-wide knockout fitness data could be used to facilitate network elucidation methods based on gene expression by helping initialize or further refine their predictions. Conversely, information provided by gene expression data may be exploited by the proposed algorithm in conjunction with knockout fitness data to synergistically improve final inference results. For instance, gene expression data may be collected under a more limited set of conditions, for a subset of the mutant library, or just for the wild-type strain; and used for the initial inference of an augmented state vector Inline graphic—associated only with the gene expression part of the model, Equations (1)–(4). Such preliminary results could then be used as a prior for the subsequent network inference round, which uses genome-wide fitness data in order to take advantage of its potentially larger scale or more immediate availability across a range of conditions, as well as to introduce corresponding phenotypically significant refinements (as discussed earlier).

Several other directions appear promising toward potentially further extending and improving the inference methodology proposed here. For instance, the non-linear univariate interaction model, Equation (2), may be augmented with explicitly multivariate terms. This should serve to enhance resolution of non-linear interactions for a given gene across multiple reaction partners and improve modelling accuracy, though at a cost of substantial computational overhead owing to complications related to multiplicative noise propagation and the need to account for intermediate molecular complexes. Perhaps a more straightforward approach involves using the model to accommodate multiple-knockout experiments (i.e. those involving multiple inactivated genes in each strain). This is done analogously to the way single-knockout networks have been analyzed here by simply setting all of the corresponding interaction term coefficients to zero for the genes in question. The reason we have focused less on such applications in this work, however, is the present scarcity of multiple- versus single-knockout observation data. Finally, additional constraints could be incorporated to help account for other pre-existing sources of experimental or heuristic information. For example, we have noted earlier the possibility of applying feed optimization schemes to condition data ordering. One may further look at various sparseness conditions as a way of improving computational efficiency and incorporating pre-existing knowledge about the network during inference. [Though, care should be taken, as significant variations in effective regulatory network topology may naturally arise across experimental conditions or in going from one related organism to another (Bergmann et al., 2003; Luscombe et al., 2004)]

Overall, we believe that these questions help further support the suggestion that the analysis of genome-wide fitness data (whether directly or in conjunction with gene expression data) towards understanding of biological systems function and, in particular, inference of gene regulatory network organization offers a rich new area of future research—to which this article is seeking to make an initial contribution.

ACKNOWLEDGMENTS

We would like to thank Adam Deutschbauer for contributing his insights throughout this work. We would also like to thank the anonymous reviewers for useful comments and helpful suggestions that have offered much help in improving this article.

Funding: This work conducted by ENIGMA-Ecosystems and Networks Integrated with Genes and Molecular Assemblies was supported by the Office of Science, Office of Biological and Environmental Research, of the U. S. Department of Energy under Contract No. DE-AC02-05CH11231; and under U.S. National Science Foundation (NSF) grant DBI-0850030. Partial funding for open access charge by the Columbia Open-Access Publication (COAP) Fund.

Conflict of Interest: none declared.

REFERENCES

  1. Akutsu T, et al. Pacific Symposium on Biocomputing. Vol. 4. World Scientific Maui, Hawaii; 1999. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model; pp. 17–28. [DOI] [PubMed] [Google Scholar]
  2. Beal M, et al. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics. 2005;21:349–356. doi: 10.1093/bioinformatics/bti014. [DOI] [PubMed] [Google Scholar]
  3. Bergmann S, et al. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol. 2003;2:e9. doi: 10.1371/journal.pbio.0020009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bonneau R, et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol. 2006;7:R36. doi: 10.1186/gb-2006-7-5-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bornholdt S. Boolean network models of cellular regulation: prospects and limitations. J. R. Soc. Interface. 2008;5(Suppl. 1):S85–S94. doi: 10.1098/rsif.2008.0132.focus. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Buhmann M. Radial Basis Functions: Theory and Implementations. Vol. 12. Cambridge, UK: Cambridge University Press; 2003. [Google Scholar]
  7. Chen L, Aihara K. Chaos and asymptotical stability in discrete-time neural networks. Phys. D. 1997;104:286–325. [Google Scholar]
  8. Chen T, et al. Modeling gene expression with differential equations. Pac. Symp. Biocomput. 1999;4:29–40. [PubMed] [Google Scholar]
  9. Chou IC, Voit EO. Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math. Biosci. 2009;219:57–83. doi: 10.1016/j.mbs.2009.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cook D, et al. Modeling stochastic gene expression: implications for haploinsufficiency. Proc. Natl Acad. Sci. 1998;95:15641–15646. doi: 10.1073/pnas.95.26.15641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Corigliano A, Mariani S. Parameter identification in explicit structural dynamics: performance of the extended Kalman filter. Comput. Methods Appl. Mech. Eng. 2004;193:3807–3836. [Google Scholar]
  12. Craciun G, et al. Statistical model for biochemical network inference. Commun. Stat.-Simul. Comput. 2013;42:121–137. doi: 10.1080/03610918.2011.633200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Daum F. Nonlinear filters: beyond the Kalman filter. IEEE Aerosp. Electron. Syst. Mag. 2005;20:57–69. [Google Scholar]
  14. Dehal PS, et al. Microbesonline: an integrated portal for comparative and functional genomics. Nucleic Acids Res. 2009;38(Suppl. 1):D396–D400. doi: 10.1093/nar/gkp919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Deutschbauer A, et al. Evidence-based annotation of gene function in shewanella oneidensis MR-1 using genome-wide fitness profiling across 121 conditions. PLoS Genet. 2011;7:e1002385. doi: 10.1371/journal.pgen.1002385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Deutschbauer A, et al. Parallel phenotypic analysis of sporulation and postgermination growth in saccharomyces cerevisiae. Proc. Natl Acad. Sci. 2002;99:15530–15535. doi: 10.1073/pnas.202604399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. de Jong H. Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol. 2002;9:67–103. doi: 10.1089/10665270252833208. [DOI] [PubMed] [Google Scholar]
  18. Egriboz O, et al. The rapid GAL gene switch of Saccharomyces cerevisiae depends on nuclear Gal3, not Nucleo-cytoplasmic trafficking of Gal3 and Gal80. Genetics. 2011;189:825–836. doi: 10.1534/genetics.111.131839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Flick J, Johnston M. Two systems of glucose repression of the GAL1 promoter in Saccharomyces cerevisiae. Mol. Cell. Biol. 1990;10:4757–4769. doi: 10.1128/mcb.10.9.4757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Friedman N, et al. Using Bayesian networks to analyze expression data. J. Comput. Biol. 2000;7:601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
  21. Giaever G, et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002;418:387–391. doi: 10.1038/nature00935. [DOI] [PubMed] [Google Scholar]
  22. Hanai T, et al. Application of bioinformatics for DNA microarray data to bioscience, bioengineering and medical fields. J. Biosci. Bioeng. 2006;101:377–384. doi: 10.1263/jbb.101.377. [DOI] [PubMed] [Google Scholar]
  23. Hendrickx DM, et al. Reverse engineering of metabolic networks, a critical assessment. Mol. Biosyst. 2011;7:511–520. doi: 10.1039/c0mb00083c. [DOI] [PubMed] [Google Scholar]
  24. Hillenmeyer ME, et al. Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. Genome Biol. 2010;11:R30. doi: 10.1186/gb-2010-11-3-r30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hillenmeyer ME, et al. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008;320:362–365. doi: 10.1126/science.1150021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Holter N, et al. Dynamic modeling of gene expression data. Proc. Natl Acad. Sci. 2001;98:1693–1698. doi: 10.1073/pnas.98.4.1693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Huang S. Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery. J. Mol. Med. 1999;77:469–480. doi: 10.1007/s001099900023. [DOI] [PubMed] [Google Scholar]
  28. Jiang D, et al. Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering. Bethesda, Maryland: IEEE; 2003. DHC: a density-based hierarchical clustering method for time series gene expression data; pp. 393–400. [Google Scholar]
  29. Johnston M, et al. Multiple mechanisms provide rapid and stringent glucose repression of GAL gene expression in Saccharomyces cerevisiae. Mol. Cell. Biol. 1994;14:3834–3841. doi: 10.1128/mcb.14.6.3834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Julier S, Uhlmann J. International Symposium Aerospace/Defense Sensing, Simulations and Controls. Vol. 3. Bellingham, WA, USA: SPIE; 1997. A new extension of the Kalman filter to nonlinear systems; pp. 26–37. [Google Scholar]
  31. Kauffman S, et al. Random Boolean network models and the yeast transcriptional network. Proc. Natl Acad. Sci. 2003;100:14796–14799. doi: 10.1073/pnas.2036429100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kellam P, et al. A framework for modelling virus gene expression data. Intell. Data Anal. 2002;6:267–279. [Google Scholar]
  33. Lecca P, et al. Network inference from time-dependent omics data. Methods Mol. Biol. 2011;719:435–455. doi: 10.1007/978-1-61779-027-0_20. [DOI] [PubMed] [Google Scholar]
  34. Liang K, Wang X. Gene regulatory network reconstruction using conditional mutual information. EURASIP J. Bioinformatics Syst. Biol. 2008;2008:253894. doi: 10.1155/2008/253894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Liu T, et al. Model gene network by semi-fixed Bayesian network. Expert Syst. Appl. 2006;30:42–49. [Google Scholar]
  36. Lohr D, et al. Transcriptional regulation in the yeast GAL gene family: a complex genetic network. FASEB J. 1995;9:777–787. doi: 10.1096/fasebj.9.9.7601342. [DOI] [PubMed] [Google Scholar]
  37. Luscombe NM, et al. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature. 2004;431:308–312. doi: 10.1038/nature02782. [DOI] [PubMed] [Google Scholar]
  38. Margolin A, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl. 1):S7. doi: 10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Mischel P, et al. DNA-microarray analysis of brain cancer: molecular classification for therapy. Nat. Rev. Neurosci. 2004;5:782–792. doi: 10.1038/nrn1518. [DOI] [PubMed] [Google Scholar]
  40. Novichkov PS, et al. Regprecise: a database of curated genomic inferences of transcriptional regulatory interactions in prokaryotes. Nucleic Acids Res. 2010;38(Suppl. 1):D111–D118. doi: 10.1093/nar/gkp894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Oh J, et al. A universal TagModule collection for parallel genetic analysis of microorganisms. Nucleic Acids Res. 2010;38:e146. doi: 10.1093/nar/gkq419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Ostergaard S, et al. Increasing galactose consumption by Saccharomyces cerevisiae through metabolic engineering of the GAL gene regulatory network. Nat. Biotechnol. 2000;18:1283–1286. doi: 10.1038/82400. [DOI] [PubMed] [Google Scholar]
  43. Pierce SE, et al. Chemogenomic approaches to elucidation of gene function and genetic pathways. Methods Mol. Biol. 2009;548:115–143. doi: 10.1007/978-1-59745-540-4_7. [DOI] [PubMed] [Google Scholar]
  44. Powell M. Algorithms for Approximation. Oxford, UK: Clarendon Press; 1987. Radial basis functions for multivariable interpolation: a review; pp. 143–167. [Google Scholar]
  45. Rangel C, et al. Modeling T-cell activation using gene expression profiling and state-space models. Bioinformatics. 2004;20:1361–1372. doi: 10.1093/bioinformatics/bth093. [DOI] [PubMed] [Google Scholar]
  46. Reiss D, et al. Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks. BMC Bioinformatics. 2006;7:280. doi: 10.1186/1471-2105-7-280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Samoilov M, et al. On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos. 2001;11:108–114. doi: 10.1063/1.1336499. [DOI] [PubMed] [Google Scholar]
  48. Shmulevich I, et al. From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proc. IEEE. 2002;90:1778–1792. [Google Scholar]
  49. Simon D. Optimal State Estimation: Kalman, H-infinity and Nonlinear Approaches. Hoboken, NJ: John Wiley and Sons; 2006. [Google Scholar]
  50. Steinmetz L, et al. Systematic screen for human disease genes in yeast. Nature Genet. 2002;31:400–404. doi: 10.1038/ng929. [DOI] [PubMed] [Google Scholar]
  51. Stuart J, et al. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. doi: 10.1126/science.1087447. [DOI] [PubMed] [Google Scholar]
  52. Tian T, Burrage K. The Congress on Evolutionary Computation. Vol. 1. Canberra, Australia: IEEE Press; 2003. Stochastic neural network models for gene regulatory networks; pp. 162–169. [Google Scholar]
  53. Wan E, Van Der Merwe R. IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium. Lake Louise, Canada: IEEE; 2000. The unscented Kalman filter for nonlinear estimation; pp. 153–158. [Google Scholar]
  54. Wang H, et al. IEEE International Workshop on Genomic Signal Processing and Statistics. College Station, TX: IEEE; 2006. Inference of gene regulatory networks using genetic programming and Kalman filter; pp. 27–28. [Google Scholar]
  55. Wang L, Schonfeld D. 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) Dallas, TX: IEEE; 2010. Game theoretic model for control of gene regulatory networks; pp. 542–545. [Google Scholar]
  56. Wang Z, et al. On delayed genetic regulatory networks with polytopic uncertainties: robust stability analysis. IEEE Transact. NanoBiosci. 2008;7:154–163. doi: 10.1109/TNB.2008.2000746. [DOI] [PubMed] [Google Scholar]
  57. Wang Z, et al. An extended Kalman filtering approach to modeling nonlinear dynamic gene regulatory networks via short gene expression time series. IEEE/ACM Transact. Comput. Biol. Bioinformatics. 2009;6:410–419. doi: 10.1109/TCBB.2009.5. [DOI] [PubMed] [Google Scholar]
  58. Winzeler E, et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999;285:901–906. doi: 10.1126/science.285.5429.901. [DOI] [PubMed] [Google Scholar]
  59. Yeung M, et al. Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl Acad. Sci. 2002;99:6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES