Skip to main content
IET Systems Biology logoLink to IET Systems Biology
. 2018 Dec 1;12(6):294–303. doi: 10.1049/iet-syb.2018.5015

Hierarchical parameter estimation of GRN based on topological analysis

Wei Zhang 1, Feng Zhang 1, Jianming Zhang 1,, Ning Wang 1
PMCID: PMC8687294  PMID: 30472694

Abstract

Reverse engineering of gene regulatory network (GRN) is an important and challenging task in systems biology. Existing parameter estimation approaches that compute model parameters with the same importance are usually computationally expensive or infeasible, especially in dealing with complex biological networks.In order to improve the efficiency of computational modeling, the paper applies a hierarchical estimation methodology in computational modeling of GRN based on topological analysis. This paper divides nodes in a network into various priority levels using the graph‐based measure and genetic algorithm. The nodes in the first level, that correspond to root strongly connected components(SCC) in the digraph of GRN, are given top priority in parameter estimation. The estimated parameters of vertices in the previous priority level ARE used to infer the parameters for nodes in the next priority level. The proposed hierarchical estimation methodology obtains lower error indexes while consuming less computational resources compared with single estimation methodology. Experimental outcomes with insilico networks and a realistic network show that gene networks are decomposed into no more than four levels, which is consistent with the properties of inherent modularity for GRN. In addition, the proposed hierarchical parameter estimation achieves a balance between computational efficiency and accuracy.

Inspec keywords: biology computing, network theory (graphs), reverse engineering, graph theory, genetics, genetic algorithms, directed graphs, parameter estimation

Other keywords: hierarchical parameter estimation, GRN, topological analysis, gene regulatory network, important task, computational systems biology, compute model parameters, complex biological networks, efficient information, model quality, parameter reliability, computational modelling, study divides nodes, priority levels, graph‐based measure, previous priority level, hierarchical estimation methodology obtains, computational resources, single time estimation, insilico network, realistic network show, computational efficiency

1 Introduction

Mathematical modelling of biological networks provides a way to understand the regulatory mechanisms of the whole system by using the measured expression data as the information source. As typical biological networks, gene regulatory networks (GRNs) are represented by the weighted graphs. The purpose of gene network inference is to reveal the quantitative regulatory relationship between genes and transcription factors (TFs), thus providing theoretical support for genetic circuit engineering. TFs are those genes that are usually located in the upstream of information flow within a network and control other nodes. The applications of computational modelling for GRN cover various areas such as epidemic prevention, energy production and synthetic biology. In synthetic biology, accurate and reliable models provide guides to construct synthetic genetic networks with certain cellular functions.

With the rapid development of systems biology and high‐throughput technologies, network inference has become a feasible task with efficient inference algorithms. The problem of gene network inference has received increasing attention in recent years [1]. Early researches have discussed the topic of network identification using time‐series expression measurements [2, 3]. Various computational approaches have been applied to infer regulatory relations among the network components [4, 5]. Incorporation of prior knowledge is helpful to improve the reliability and accuracy in computational modelling of gene regulatory gene networks [6]. Since the regulations between regulators and target genes are directed, directed graphs make it convenient to analyse the gene regulation mechanism. In the digraph, both the network structure and regulatory strengths are partially or completely unknown, thus need to be inferred from microarray or single cell RNA sequencing data. Sometimes, network structure and model parameters are considered simultaneously during network inference [7].

Existing network inference algorithms usually breaks the modelling task into structure identification and parameter estimation [8]. Structure identification algorithms are capable of determining the network structure using transcription data. The task of structure identification is to determine the directed regulations of nodes. Using the high‐throughput expression data as information sources, correlation‐based and mutual‐information‐based approaches are capable to determine an appropriate topology through calculation of statistical dependencies between network components [9, 10]. Mutual information is used to quantitatively reflect the similarity between expression levels of a pair of genes. Inferring algorithms such as ARACNE and its derivatives are classical reconstruction methods that calculate the statistical dependencies between genes [11, 12]. It should be noted that mutual‐information‐based approaches only obtain undirected regulations between genes.

In general, the current gene network inference algorithms mainly focus on inferring the network topology instead of unknown model parameters including regulatory strengths. These parameters are crucial to understand the regulation mechanism of gene regulations and to evaluate the importance of gene regulations. Research works have seriously discussed the parameter estimation for gene networks [13, 14]. With a well‐defined cost function, the problem of parameter estimation can be converted to an optimisation problem that needs considerable resources [15, 16]. Incorporating prior knowledge especially structural information is a promising solution to improve the performance of parameter estimation [17]. Even with efficient transcription profiles, parameter estimation algorithms are generally time‐consuming and consume considerable computational resources.

The challenges in parameter estimation of GRN partly lie in the computational burden during optimisation. The huge parameter space caused by the number of nodes and regulatory edges within a network makes it difficult for optimisation approaches to find the global optimal parameters. Another important issue is the lack of useful information. The number of biological experiments that perturb the expressions of GRN is quite limited compared with that of genes and relevant edges. Useful information underlying measured expression levels of genes should be extracted to infer unknown model parameters. Although this problem may be solved by adding more perturbations, this solution is expensive or even unfeasible due to the cost. Current parameter estimation of GRNs using time‐series expression data always faces a few drawbacks [18]. The first factor is huge feasible parameter space which is related with the number of genes in networks. Meanwhile, correlations between model parameters and states may lead to structural and parametric unidentifiability. Lack of efficient information for parameter estimation is also a factor that is related with the quality of mathematical models and inferred parameters.

In fact, the dynamics of the regulatory system are determined by a subset of nodes that corresponds to specific groups of key genes. Using weighted graphs to describe GRN, genetic regulations can be quantitatively evaluated by the weights of edges. Compared to peripheral gene nodes, key genes have stronger gene regulations such as repression or on the downstream genes which have relatively low degrees. On the basis of this hypothesis, this paper further computes the parameters in the next stage of nodes using estimated parameters in the previous stage as known information. The potential advantages of this strategy are improved computational efficiency due to decomposed network structures and reduced number of key nodes in a network.

Such priority levels are determined by topology analysis of decomposition of the network structures using an elitism‐based genetic algorithm (GA). The first level of nodes selected by the optimisation approach is regarded as root strongly connected components (SCCs), which locate in upstream in information flow for a given gene network. The second step of topology analysis is SCC decomposition of the digraph of GRN. The purpose of hierarchical estimation is to obtain kinetic parameters for genes in root SCCs, which are regarded as the largest subgraphs in SCC. In this case, inferred parameters of gene nodes in root SCC can be used as known information to calculate the parameters of the second priority level of nodes. The proposed hierarchical estimation strategy that deals with genes with different priority levels has advantages of improving computational efficiency.

2 Deterministic modelling of GRNs

In network inference, selection of mathematical models is a crucial step. Bayesian networks are probabilistic modelling approaches that compute the posterior distribution of model parameters [19, 20]. Deterministic modelling using ordinary differential equations (ODEs) is another candidate solution to describe the expression behaviours of gene networks. Compared to probabilistic modelling, deterministic modelling based on ODE‐based models needs limited computational resources and still captures system dynamics with appropriate parameters. With an identified network structure, expression behaviours of nodes in a gene network are described by a set of differential equations, where unknown parameters including regulation strengths, transcription and degradation rates need to be computed using expression data.

2.1 Model formulation

The linear differential models for networks are described by the equation below:

dx(t)dt=Ax(t)+Bu(t) (1)

where the vector x(t)=x1(t),x2(t),,xN(t)T is the state vector of a network consisting of N nodes at the time point t, xi(t) represents the transcriptional expression level of the gene i and the input vector u(t) reflects the environmental perturbations. The coefficients aij in the state matrix ARN reflect the strengths of gene regulation. Influence on inner states is reflected by the input matrix B. For a linear time‐invariant dynamic system, x˙(t)=Ax(t)+Bu(t) and y(t)=Cx(t). According to the observability analysis theory, the observability matrix O=CT,CAT,,CAN1TT should satisfy the full rank requirement to guarantee the observability of a network [21]. Furthermore, if a network is described by non‐linear differential equations, denoted by the equation below:

dx(t)dt=f(t,x(t),u(t)) (2)

Then, the Jacobian matrix should satisfy the full rank condition, i.e. rankJ=N, where Jij is computed using the Lie derivatives LF. During structure identification of GRN, the gene nodes are generally divided into the hub and non‐hub genes [22]. Those gene nodes with high degrees are considered as hub genes and form a skeleton for a network. The remaining genes are non‐hub genes that are generally in peripheral positions. It should be noted that in‐degree is different from out‐degree from the viewpoint of information flow. If a gene has many out‐degrees, it is more likely to provide control and thus is playing the role of hub gene. The adjacent matrix of the network is decomposed into two parts, as shown in the equation below:

A=A1+A2 (3)

where A1 and A2 represent the subsets of hub and non‐hub gene nodes. Those hub nodes in subset A1 also play the role of bridge transferring control information from the upstream to downstream. Given time‐series measurements xt=1T, the cost function is defined by the equation below:

f(A)=argminYAX2 (4)

where Y is the output vector of the target gene and X represents the expression of regulator genes. To model the network, output variables can be constructed. Such outputs correspond to the expression of downstream genes that are regulated by other regulators in the upstream of information flow. For a network, the network structure as well as the directed regulation between candidate regulators and target genes may be unclear. Feature selection methods are able to calculate scores for possible regulation between candidate regulators and target genes. By minimising the cost function f(A), the best possible solution A is computed to determine the network structure.

2.2 Topological characteristics of GRN

Considering the topological characteristics of GRNs, the directed graphs are useful tools to describe the connections and information flow between components within a network [23]. On the basis of the graph theory, a gene network is described by a directed graph G=(V,E), where V and E denote vertices and edges in a graph. In the study of GRNs, the majority of the genes are sparsely connected, while densely connected genes also exist. Such densely connected genes play the role of key nodes that regulate the system dynamics and can be identified by analysing degree distribution for a given network.

Scale‐free topology is an interesting characteristic owned by GRN and can be used to infer GRN as prior information [2426]. The fraction P(k) of nodes in the scale‐free network having k connections to other nodes follows the power law, defined as P(k)kγ, where 2<γ<3. The degree of a node in a network corresponds to the number of directed relations with other nodes. Degree distribution that is the probability distribution is closely related with the topological characteristics of a given network. For a graph G with edge set E, the degree of a vertex v is defined as by deg(v), then the scale‐free metric is defined by the equation below:

S(G)=Σu,vEdeg(u)deg(v) (5)

This metric S(G) will be maximised when high‐degree nodes are connected to other high‐degree nodes [27]. For a given graph G=(V,E), connected components in G are defined as a subgraph G=V,E where vertices i,jV are reachable for each other. Nodes and directed edges are two basic components in a gene network. Properties of the biological network can be analysed from the viewpoint of network centrality [28]. The purpose of analysing network centrality is to identify the most important vertices in a graph based on the topological analysis. Degree and betweenness centrality are discussed to judge the importance of a node. The degree is closely related with the number of directed edges that are connected to a target node, while betweenness is used to measure how central a node in a given network is [29]. In weighted networks, the strength of a node is computed using the weights of adjacent edges that are denoted by the equation below:

si=Σj=1Naijωij (6)

where aij and ωij are adjacent and weighting matrices between nodes i and j. Before modular decomposition, the SCCs are introduced to describe topological relations between nodes [30]. The set of root SCCs is a subset of SCC [31]. A directed graph is strongly connected if there is a path between each pair of nodes. The subgraph is regarded to be strongly connected when each vertex can be reachable from every other node. One SCC of a directed graph corresponds to a maximal strongly connected subgraph. In the selection of important nodes within a network, nodes in SCC that have no in‐degrees are selected to form the subset of root SCC. Both SCC and root SCC are specific kinds of subsets in a graph. The subset of root SCC, which is the maximisation of G, can be considered as a condensate node from the viewpoint of information flow. A brief comparison of SCC and root SCC is depicted in Fig. 1.

Fig. 1.

Fig. 1

SCC and root SCC in gene networks. There exists mutual regulation between gene nodes xk and xj, which indicates close information sharing within the subset. The expressions of genes xi and xk are directly controlled by the elements in root SCC. This indicates that importance of nodes in a given GRN is closely related with their topological positions

In Fig. 1, there exist three subsets of SCCs including one root SCC and two common SCCs. In this paper, vertices in the root SCC are considered as important nodes that provide regulation on the behaviours of the whole system. In the proposed hierarchical decomposition methodology, the connections within a subset are given lower priority since the inner connections generally reflect information flows within a module. Those nodes that locate at peripheral positions of a given network are ignored to some extent. Motivated by degree modularity of gene networks, the hierarchical estimation strategy first focuses on the key nodes and relevant edges.

3 Hierarchical decomposition methodology

The topological characteristics of gene network such as sparsity and hub network structure indicate that unknown parameters have different levels of importance in influencing system behaviours. However, current modelling algorithms estimate unknown coefficients equally using transcriptional data, thus leading to computational difficulties. A natural way to solve this problem is to perform the parameter estimation at different priority levels instead of inferring all parameters at one time. This strategy requires that, for a given network, its nodes should be divided according to topological characteristics and locations in information flow into several groups. Nodes that locate in upstream of information flow in a gene network provide genetic regulation to those located in downstream.

3.1 Selection of key genes

Graph representation of GRN brings powerful tools in analysing the topologies of a network. Traditional controllability analysis for networks meets certain difficulties. With the graph representation of biological networks, characteristics such as accessibility controllability of systems can be analysed [32, 33]. Each node in the set of SCC is regarded as an information source that can be used to infer the dynamics of other nodes within the subset. In this case, the structural control theory was proposed to analyse the linear network [34]. Vertices in root SCC have only out‐degrees, which are denoted by blue coloured nodes in Fig. 2, thus transferring regulation in downstream nodes. Generally speaking, the condition of root SCC is stricter than that of SCC for a given network.

Fig. 2.

Fig. 2

Two situations of root SCC within gene networks

(a) A single node controls three downstream genes, while a subset consisted of three closely related genes regulating the downstream genes (b) Genes in the root SCC coloured blue are playing similar role with the single regulator gene, and thus regarded as a condensate node

On the basis of the characteristics of GRN such as scale‐free topology and modularity, parameter estimation can be performed at different priority levels, determined by topological analysis. With a graph‐based measure, nodes in a network are divided into different levels according to their topological positions. Such a task can be accomplished by an elitism‐based GA.

3.2 Hierarchical decomposition algorithm

Gene networks in this paper are assumed as is connected that indicates every vertex in the network is connected and there are no isolated islands. The task of hierarchical decomposition is to select the subset of key nodes using a graph‐based measure. In this paper, the key nodes are defined as the root SCC. In graph theory, components in root SCC, which is a subset of SCC, are generally located in upstream of the information flow for a given network. In the proposed methodology, this selected first‐level subset is considered as a merged node in the next level of estimation. Directed edges that are regulated by components in the first level L1 can be regarded as edges linked with the new merged node.

  1. Reduce the scope based on the connectivity : If there exists a directed edge xixj, the parameters about the target gene xj can be inferred by measuring the expression of the regulator xi. That means the regulator xi should be of higher priority level during modelling while parameter estimation about the target gene xj can be postponed to the next stage. From the viewpoint of information flow, the regulation information flows from the regulatory xi to the target gene xj.

  2. Determine the key subset V1V2 : This step randomly selects a subset V1V, where V is the set of vertices. The subset V2 is defined such as V2=V. For each gene node g in V1, those nodes in V2 that are inferred by g. Then, the nodes in the subset V1V2 are regarded as components in the first priority level.

  3. Correlation analysis and deletion of redundant nodes : The elitism‐based GA optimises the number of nodes in using a fitness function defined in (7). After determining the subset L1=V1V2, there are redundant nodes that are not components in root SCC. Then, each node in L1 is traversed to judge the strongly coupled condition, shown in Fig. 2 b. Those nodes meet the strongly coupled condition are added to a new set T. During the construction of T, the subset T will be deleted if it is not an SCC. Otherwise, T will be merged to the key nodes L1.

  4. Extension to the next priority level Li : Vertices in the first level L1, selected by the standard of root SCC, are those SCCs that have only out‐degrees. On the basis of this principle, nodes in the second level L2 are directly controlled by nodes in L1. Those nodes that have SCC relations with nodes in L2 are absorbed to L2.

Furthermore, the optimisation is performed by a modified GA. This elitism‐based GA designs a sub‐function to read the adjacent list of a given network and use individuals to represent the selected nodes by means of binary encoding technology. The elitism‐based GA first generates an initial population, in which individuals are binary coded using elements 0 and 1. Every individual represents a candidate solution of optimisation, i.e. key nodes. The work flow of hierarchical decomposition algorithm for gene networks is described in Fig. 3.

Fig. 3.

Fig. 3

Flowchart of the GA‐based key nodes selection algorithm. To get a consistent hierarchical decomposition outcome for a given network, the part of reversing left nodes will be accomplished

The fitness function in selecting the first level of key nodes that is also regarded as the set of root SCC is defined by the equation below:

f=N(V1)(V2) (7)

In (7), N is the number of nodes of a given graph. The subset V1 represents the node set of root SCC and V2 is another subset of V. The basic strategy is randomly selecting a subset V1 and reversing nodes to delete nodes that belong to V2. This study applies the elitism‐based GA to maximise the fitness f in (7). GA algorithm loops over the population size and creates new individuals with the mutation and crossover operations.

After optimisation, the subset of nodes in L1 is selected as key nodes, which are hub genes in a network. For these nodes, edges that connected with them transfer information from the upstream to downstream. According to the theory of six degree separation, the maximum number of levels is expected to be no more than six [35]. The principle of six degrees of separation indicates that network nodes are connected to any other nodes with less than six steps in a small‐world network such as social networks. It is a well known principle shared in small‐world network including social networks.

To illustrate the hierarchical decomposition in details, a simplified type of gene network that consisted of five nodes is taken into consideration, as shown in Fig. 4.

Fig. 4.

Fig. 4

Illustration of hierarchical decomposition strategy. The nodes x1 and x5 have only out‐degrees and divided into the first priority level. The decomposition then reverses other nodes to judge whether other nodes belong to L1. At the node, there is mutual regulation between x5 and x6. The subset becomes x1,x5,x6

The working flow of hierarchical decomposition part is implemented in two steps. The first step of decomposition is finding the nodes that have only out‐degrees, corresponding to the subset of nodes s=x1,x5 in Fig. 4. The directed edge from node i to j is denoted as (i,j). When there is a mutual connection between i and j, these two vertices are regarded as strongly connected, denoted by i,j. Furthermore, the nodes x5 and x6 are strongly connected, thus the gene x6 will be absorbed into the first priority level. In this case, the first level of subsets is s=x1,x5,x6. For the first‐level subset, the out‐degree edges transfer the information to nodes in the next level of vertices. In the second step, extensions are performed independently from the subsets s=x1,x5,x6.

For vertices in the current level, the first task is to determine vertices in the next level. Vertices in the top priority level L1 are required to have only out‐degrees and no in‐degrees. The set of nodes in previous and next priority levels are denoted by preLevel and nextLevel, respectively. GraphLength defines the number of nodes in a given network and tmpQueue denotes a temporary queue that is used to store the nodes in the key subset v1. On the basis of GA algorithm, the hierarchical decomposition algorithm of directed graphs of GRN is described by the pseudocode in Algorithm 1 (see Fig. 5).

Fig. 5.

Fig. 5

Algorithm 1: Pseudocode of hierarchical decomposition algorithm of gene network

4 Hierarchical parameter estimation of GRN

The proposed hierarchical estimation strategy aims to evaluate the importance of genes according to the graph‐based measure, i.e. in‐degrees and out‐degrees distributions. For a given network, those nodes in the first priority level L1 are first inferred using transcriptional profiles, thus reducing the underestimated problem caused by a huge number of gene nodes and limited measured data. After determining the first level L1, the hierarchical decomposition algorithm further extends to the next level L2. Considering the topological characteristics of GRN, the hierarchical levels are expected to be no more than six.

Considering the complexity of biological networks and the number of genes, the minimisation of the cost function in one‐time‐all strategy needs a considerable resource to search for the global optimum. Under a given network topology, the inferring algorithms search for the parameters via a defined cost function. For the task of computational modelling of GRN, linear ODE systems are usually applied to describe the expression dynamics. However, non‐linear ODE models have the advantage of capturing the complex dynamics of biological models [36]. Non‐linear modelling of GRN has its own advantages in reflecting the regulation mechanisms [37, 38]. In non‐linear ODE models of GRN, the regulation of the expression of a target gene can be described by a combinatorial action of multiple regulators. The corresponding ODE models are described as the equation below:

dxidt=Σj=1mωijxj+bi,1in,1mn (8)

The weighted parameter ωij represents directed regulatory relations from the regulator gene i from the target gene j. Gene regulation is divided into activation and inhibition depending on the values of ωij values. Positive ωij correspond to activation. Non‐linear models are popular in describing relatively complex expression dynamics compared with linear models. With a sigmoidal transfer function, the differential equation of expression dynamics is shown as the equation below:

dxidt=ki1+expΣjωijbidixi(t) (9)

where j=1,,m, the parameter ki denotes the maximal expression rate and the constant di represents the degradation rates of biomolecular products. For each gene, there are three unknown parameters ki, bi and ωij that are merged into θ. The unknown parameter θ=ki,bi,ωij will be estimated by optimisation approaches using a suitable cost function. To perform the task of parameter estimation, the cost function can be defined as the mean square error (MSE) function, shown as below:

E=1QΣi=1Qytiy^ti2 (10)

where the weighted matrix Q is user‐defined, yti and y^ti denote the expression levels of target genes, respectively. Non‐linear least‐square algorithm aims to minimise the errors between the model output and measured data. For a gene network consisted N genes with T time points and R replicated experiments, the corresponding cost function is defined as below:

J(θ)=s=1Rk=1Ti=1N|x^is(tk)xi(tk,θ)| (11)

where x^tk represent the measured mRNA levels in the microarray dataset and xitk,θ are the model outputs. Penalty terms are introduced to enhance the predictive ability of models. The penalised cost function is further described as the equation below:

J(θ)=s=1Rk=1Ti=1N|x^is(tk)xi(tk,θ)|+α||θ||2 (12)

The penalty factor α is introduced to avoid overfitting and can be determined by the L curve method [39]. For a given value of α, the error term and the penalty term are calculated. Solving non‐linear ODEs is a time‐consuming process, and will become infeasible for complex biological systems. To estimate unknown parameters of gene networks consisted of hundred nodes, linear regression approaches can be applied instead. In dealing with complex systems, linear models and ridge regression are applied to compute the kinetic parameters that correspond to regulatory strength ωij [40]. The regularised cost function is ridge regression defined as below:

J(A)=argmin||YAX||2+λ||A||2 (13)

The cost function is related to the adjacent matrix which denotes regulations between genes. The matrices Y and X represent expression levels of the target gene and candidate regulators. The function of the second term with coefficient λ is to reduce the overfitting. With known network structures, the regulations between genes are clear. In hierarchical estimation methodology, priority stages are determined by the decomposition algorithm based on the topological analysis.

5 Experimental results and analysis

Simulated and realistic GRNs are selected as the benchmarks to validate the proposed hierarchical estimation. The selected realistic gene network uses microarray data of Saccharomyces cerevisiae measured under cold shock response, which involves 21 transcription factors (TFs) [16]. Kinetic parameters of five size‐100 gene networks that come from the DREAM4 challenge are estimated using microarray multifactorial data as the information source [41].

5.1 Hierarchical estimation of GRN of S. cerevisiae

The first example considers a medium network of S. cerevisiae that consisted of 21 nodes and 32 directed edges. Each node represents the gene, mRNA and expressed proteins simultaneously. As a typical kind of eukaryote, S. cerevisiae responses to environment stresses including temperature changes. The cold shock experiments were performed between the temperatures of 10 and 18C. For each gene, there are three replicates about log 2 fold changes in expression levels measured at four time points, which are measured 10, 30 and 120 min after cold shock perturbations [42].

In this elitism‐based GA algorithm, a function Graph.java is defined to describe a network using adjacent matrices that consist of elements 0 and 1. Then, the algorithm initialises a population and calculates the fitness values for individuals. In the elitism‐based GA, the tournament selection as well as crossover and mutation operations are applied to evolve the individuals. In this paper, the probabilities of crossover and mutation are settled at 0.5 and 0.015, respectively. Assume that nodes in the current priority level have only out‐degrees and no in‐degrees, the nodes in the next level are inferred based on the nodes in the current priority level, as shown in Algorithm 1 (Fig. 5). The visualisation of the hierarchical decomposition of S. cerevisiae subnetwork that contains 21 genes is shown in Fig. 6.

Fig. 6.

Fig. 6

Decomposition of a 21‐gene network in S. cerevisiae. The nodes in the first and second priority levels are coloured yellow and red. Nodes of the first level L1 have only out‐degrees and regulate downstream genes. Those nodes in the second level L2 are directly controlled by that in L1. Vertices in the third and fourth levels that are coloured blue and purple are considered as peripheral nodes

After 1000 generations, the nodes are divided into four priority levels, in which seven genes are selected in L1. These genes are SKN7, ACE2, ABF1, MAC1, PHD1, HAL9 and RAP1. These 7 genes are located in upstream of information flow and control expression of downstream nodes. Gene RAP1 has five out‐degrees and one autoregulation. In this subnetwork, the gene SKN7 controls three downstream genes. SKN7 also plays a vital role in controlling information flow between gene modules (in specific synthetic gene networks) and controls three downstream genes. Gene YAP6 is regulated by six genes and selected in the second priority level L2. Furthermore, CIN5, ROX1 and YAP6 form a feed‐forward motif. As these seven genes have no in‐degrees, parameter estimation algorithm mainly considers the production rates. In the following stage, computed parameters work as known information to infer parameters of nodes.

Afterwards, parameter estimation is performed in four priority levels based on the decomposition outcomes. There are totally 88 unknown parameters, in which 67 parameters are estimated from measured expression data. Among estimated parameters, there are 31 regulatory weights ωij that correspond to 31 edges, 21 production rate Pi that represents 21 genes and 15 thresholds b. Using a one‐time‐all estimation strategy, the MSE index of parameter estimation converge to 85 after 600 iterations using the built‐in fmincon function in MATLAB. The information of central processing unit used is Core(TM) i5‐4590, 3.3 GHz. The trajectory of the error index is described in Fig. 7.

Fig. 7.

Fig. 7

Trajectory of MSE index in single estimation strategy. In a single estimation methodology, the MSE index stops decreasing after reaching a specific level

The MSE index of parameter estimation of nodes L1 that involve nine unknown parameters is 4.59 after 64 iterations. Afterwards, these nine inferred parameters are used as known information, thus reducing the uncertainty of modelling in the second level L2. In this 21‐gene network, the second stage of estimation considers nodes that are directly controlled by L1. The gene ROX1 is selected to L2. Both CIN5 and YAP6 are taken to the second level L2 since there exists mutual regulation between YAP6 and ROX1. Computation of parameters in L2 involves 28 weights ω, 12 production rates Pi and 12 thresholds bi. The MSE index L2 is 30.15 that needs 815 iterations to reach the convergence condition. The third and fourth levels only estimate one node. For four levels of parameter estimation, the MSE indexes and running time are recorded in Table 1.

Table 1.

Comparison of single and hierarchical estimations of size‐21 network

Nodes Parameters Iterations Time, s MSE index
level 1 7 9 64 3.17 4.59
level 2 12 52 815 176 30.15
level 3 1 3 47 2.47 0.00845
level 4 1 3 40 2.46 0.198
single est 21 67 200 716 85

The row ‘Single est’ in Table 1 denotes that the MSE index using the single estimation strategy is 85. The number of genes in first priority level is about 33% in the whole set of nodes. Most of the computational time is spent on computation in the second stage since this stage involves 57% nodes and 77% unknown parameters. The sum of error index in hierarchical estimation is 34.95, which is much lower than 85 in the single estimation strategy. As the number of nodes in each level is different, the normalised MSE indexes in each priority level are calculated by averaging the number of nodes. The normalised MSE indexes at four priority levels are demonstrated in Fig. 8.

Fig. 8.

Fig. 8

Trajectories of MSE indexes in the hierarchical estimation of the 21‐gene network. The task of parameter estimation is decomposed into four stages. The inferred parameters will be used in the next stage of estimation. As the numbers of nodes considered in each stage are quite different, normalised MSE indexes are calculated by dividing the error indexes to the number of genes in each priority level

Among inferred parameters, regulation strengths are directed related to regulation mechanism underlying expression profiles. Positive and negative weights of edges correspond to activation and repression, respectively. The estimation value of regulation strengths ωij in one‐time‐all strategy and hierarchical estimation is given as Fig. 9.

Fig. 9.

Fig. 9

Comparison of regulation strengths in the 21‐gene network

5.2 Hierarchical estimations of GRN in DREAM challenge

Accurate and reliable estimation of model parameters for biological systems is still a challenging task when gene networks involve hundreds of parameters. The experiment chooses insilico networks in the DREAM4 challenge as benchmarks to validate the performance of the proposed hierarchical estimation algorithm [41]. The goal of hierarchical estimation is to compute the parameters of gene regulation networks from simulated expression data. The insilico network 1 in the DREAM4 challenge has 100 nodes and 176 edges, the hierarchical estimation algorithm decomposes it into four priority levels, which contains 16, 68, 15 and 1 nodes. The adjacent lists are recognised and converted to binary code in elitism‐based GA algorithm. The optimisation algorithm determines the number of levels and gene nodes in each level. Visualisation of gold standard for insilico network 1 is given in Fig. 10.

Fig. 10.

Fig. 10

Network topology of insilico network 1 of size‐100 networks from DREAM4 challenge

Current estimation strategies, which usually consider nodes and related edges in a given network with the same importance level, need considerable computational resources. Such requirements are hard to be met in realistic cases, especially in dealing with complex regulatory systems. In this case, hierarchical parameter estimation provides a promising solution for computational modelling of GRNs. Considering the topology characteristics of gene networks such as hub structures, a subset of gene nodes has a higher influence on the whole system. Meanwhile, the importance levels of genes or TFs in a given network are different depending on the locations in the information flow.

The first step is to decompose a given network with a clear structure into several priority levels using the graph‐based decomposition algorithm. For insilico network 1, the second priority level contains 68 nodes. To simplify the parameter estimation problem, only regulation strengths are considered and the ODE models that are used to describe the expression behaviours are simplified to X=AX. Second, ridge regression is applied to calculate the unknown weights of edges that connect the corresponding nodes. To show more details about parameter estimation, both MSE and median absolute error (MAE) indexes are computed. In this linear modelling framework, the number of edges in a network equals to that of unknown model parameters that correspond to the weights of edges. Computational times of hierarchical parameter estimation for five insilico networks are 6.22, 5.21, 4.97, 4.93 and 5.14 s. The reduced computational time is much less than that in the modelling size‐21 network in yeast is partly due to the linear model assumptions and the efficiency of ridge regression. Computational time of modelling five insilico networks using single estimation is approximately the same with the hierarchical estimation methodology. Compared to stochastic modelling approaches, one advantage of deterministic modelling using linear regression is the stable error indexes and computational cost. The outcomes of hierarchical parameter estimation using ridge regression are shown in Table 2.

Table 2.

Hierarchical parameter estimation of DREAM4 networks

Networks Edges Levels Nodes in Li MSE MAE
insilico 1 176 4 L1 : 16 1.766 3.927
L2 : 68 6.365 14.605
L3 : 15 2.077 3.864
L4 : 1 0.153 0.345
insilico 2 249 3 L1 : 14 1.461 3.221
L2 : 76 9.688 21.679
L3 : 10 0.837 2.018
insilico 3 195 4 L1 : 9 1.425 2.667
L2 : 44 7.569 14.148
L3 : 45 5.997 11.568
L4 : 2 0.297 0.668
insilico 4 211 3 L1 : 9 1.2082 2.392
L2 : 45 6.338 12.437
L3 : 46 6.253 13.352
insilico 5 193 4 L1 : 7 1.052 2.013
L2 : 36 5.538 10.933
L3 : 56 8.621 16.056
L4 : 1 0.152 0.285

For insilico network 1, hierarchical parameter estimation begins with the top priority level of 16 nodes and computes the relevant regulatory strengths with an error index of 1.676. The MSE indexes for left three levels are 6.365, 2.077 and 0.153, in which 0.153 denotes the estimation error of G89 in insilico network 1. The total MSE of insilico network 1 is summed as 10.271, which is slightly lower than 11.6395 in single estimation strategy. Linear ODE model and ridge regression have been applied to model size‐100 networks, the error indexes are relatively low, leaving limited space for improvement. In general, the size‐100 networks are decomposed into three to four priority levels using the hierarchical estimation algorithm. For those networks that decomposed into four levels, the number of nodes in the last level L4 is either 1 or 2. For instance, the nodes in L4 of insilico3 network correspond to genes G25,G46 that are considered peripheral nodes. For insilico network 1 in size‐100 networks, comparison of regulatory strengths using single and hierarchical estimation strategy is shown in Fig. 10.

In Fig. 11, the trajectories of estimated regulatory strengths ωij follow a similar pattern, indicating hierarchical estimation methodology basically find the solution obtained by the traditional one‐time‐all strategy. One‐time‐all estimation methods pose a huge computational burden, partly because the cost function considers all model parameters at the same importance level. In fact, the information flow in gene networks determines that certain modules of genes locate in the upstream while specific groups of genes stay in the downstream. Especially in engineered gene circuits or networks, the information flow and the regulatory relationships are vital to the cellular functions of the synthetic system. This phenomenon indicates that the influences of nodes and relevant parameters on the whole system are different and should be considered based on their priority levels. To compare the accurateness of parameter estimation under single and hierarchical estimation strategies, the comparisons of MSE and MAE indexes are given by Table 3.

Fig. 11.

Fig. 11

Comparison of regulation strengths in insilico network 1 in DREAM4 challenge

Table 3.

Comparison of error indexes using single and hierarchical estimations

Networks Single MSE Hierarchical MSE Single MAE Hierarchical MAE
insilico 1 11.639 10.0361 24.138 22.741
insilico 2 14.288 11.986 29.556 26.918
insilico 3 17.708 15.288 32.172 29.051
insilico 4 16.994 13.790 31.583 28.281
insilico 5 18.208 15.363 32.801 29.287

According to Table 3, both MSE and MAE indexes for insilico networks calculated under hierarchical estimation strategy are lower than that calculated by single estimation strategy. That indicates hierarchical estimation strategy is capable to obtain parameters with relatively higher accurateness compared with the traditional single estimation methodology. For biological networks that involve hundreds of parameters, linear regression methods are suitable tools to determine the model parameters and get lower error indexes compared with non‐linear modelling approaches. The meaning of hierarchical parameter estimation methodology is not only limited to reduce error indexes and computational cost, but also to provide a feasible way to deal with regulatory systems with a small ratio of hub gene nodes.

6 Conclusion

This paper proposes a hierarchical estimation strategy where nodes in a given network are divided into various levels based on topology analysis. For given gene networks, the hierarchical estimation strategy first decomposes the network based on the topology, which contains usually three to four priority levels. Such decomposition of the digraph of gene networks is completed by searching the subset of root SCC using optimisation. The nodes in the subset of root SCC are selected as the top priority level. To perform this decomposition task, a fitness function based on the graph‐based measure is defined and minimised by a GA method. Experiments about insilico expression data in the DREAM4 challenge and a realistic microarray data indicate that the proposed hierarchical strategy is able to reduce the error index with lower computational cost.

7 Acknowledgment

This work is supported partly by National Natural Science Foundation of China (Grant No. 61573311).

8 References

  • 1. Marbach D. Costello J.C., and Kĺźffner B. et al.: ‘Wisdom of crowds for robust gene network inference’, Nat. Methods, 2012, 9, (8), pp. 796–806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Bansal M. Gatta G.D., and Bernardo D.D.: ‘Inference of gene regulatory networks and compound mode of action from time course gene expression profiles’, Bioinformatics, 2006, 22, (7), pp. 815–822 [DOI] [PubMed] [Google Scholar]
  • 3. Xing H.M., and Gardner T.S.: ‘The mode‐of‐action by network identification algorithm: a network biology approach for molecular target identification’, Nat. Protocols, 2006, 1, (6), pp. 2551–2554 [DOI] [PubMed] [Google Scholar]
  • 4. Hase T. Ghosh S., and Kitano H. et al.: ‘Harnessing diversity towards the reconstructing of large scale gene regulatory networks’, PLOS Comput. Biol., 2013, 9, (11), p. e1003361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Chang Y.H. Gray J.W., and Tomlin C.J. et al.: ‘Exact reconstruction of gene regulatory networks using compressive sensing’, BMC Bioinf., 2014, 15, (1), pp. 400–421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ghanbari M. Lasserre J., and Vingron M.: ‘Reconstruction of gene networks using prior knowledge’, BMC Syst. Biol., 2015, 9, (1), pp. 84–94 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Meyer P. Cokelaer T., and Chandran D. et al.: ‘Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach’, BMC Syst. Biol., 2014, 8, (1), pp. 13–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Xiong J., and Zhou T.: ‘Structure identification for gene regulatory networks via linearization and robust state estimation’, Automatica, 2014, 50, (11), pp. 2765–2776 [Google Scholar]
  • 9. Zhang X.J. Zhao X.M., and He K. et al.: ‘Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information’, Bioinformatics, 2012, 28, (1), pp. 98–104 [DOI] [PubMed] [Google Scholar]
  • 10. Zhang X.J. Liu K.Q., and Liu Z.P. et al.: ‘NARROMI: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference’, Bioinformatics, 2013, 29, (1), pp. 106–113 [DOI] [PubMed] [Google Scholar]
  • 11. Margolin A.A. Nemenman I., and Basso K et al.: ‘ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context’, BMC Bioinf., 2004, 7, (1), pp. 1–15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lachmann A. Giorgi F.M., and Lopez G. et al.: ‘ARACNe‐AP: gene network reverse engineering through adaptive partitioning inference of mutual information’, Bioinformatics, 2016, 32, (14), pp. 2233–2235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Cao J.G., and Zhao H.Y.: ‘Estimating dynamic models for gene regulation networks’, Bioinformatics, 2008, 24, (14), pp. 1619–1624 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fan M. Kuwahara H., and Wang X. et al.: ‘Parameter estimation methods for gene circuit modeling from time‐series mRNA data: a comparative study’, Briefings Bioinf., 2015, 16, (6), pp. 987–999 [DOI] [PubMed] [Google Scholar]
  • 15. Biswas S., and Acharyya S.: ‘Parameter estimation of gene regulatory network using honey bee mating optimization’, Int. Conf. Emerg. Appl. Inf. Technol., 2015, 5, (3), pp. 1–10 [Google Scholar]
  • 16. Dahlquist K.D. Fitzpatrick B.G., and Camacho E.T. et al.: ‘Parameter estimation for gene regulatory networks from microarray data: cold shock response in Saccharomyces cerevisiae ’, Bull. Math. Biol., 2015, 77, (8), pp. 1457–1492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Kuwahara H. Fan M., and Wang S.J. et al.: ‘A framework for scalable parameter estimation of gene circuit models using structural information’, Bioinformatics, 2013, 29, (13), pp. 98–107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Faith J.J. Hayete B., and Thaden J.T. et al.: ‘Large‐scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles’, PLOS Biol., 2007, 5, (1), p. e8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Vignes M. Vandel J., and Allouche D. et al.: ‘Gene regulatory network reconstruction using Bayesian networks, the Dantzig selector, the Lasso and their meta‐analysis’, PLOS One, 2011, 6, (12), p. e29165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Liu F. Zhang S.W., and Guo W.F et al.: ‘Inference of gene regulatory network based on local Bayesian networks’, PLOS Comput. Biol., 2016, 12, (8), p. e1005024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Diop S., and Fliess M.: ‘Nonlinear observability, identifiability, and persistent trajectories’. IEEE Conf. Proc. Decision and Control, Brighton, UK, 1991, (1), pp. 714–719 [Google Scholar]
  • 22. Gui S. Rice A.P., and Chen R. et al.: ‘A scalable algorithm for structure identification of complex gene regulatory network from temporal expression data’, BMC Bioinf., 2017, 18, (1), pp. 74–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Mason O., and Verwoerd M.: ‘Graph theory and networks in biology’, IET Syst. Biol., 2007, 1, (2), pp. 89–119 [DOI] [PubMed] [Google Scholar]
  • 24. Chen G. Larsen P., and Almasri E. et al.: ‘Rank‐based edge reconstruction for scale‐free genetic regulatory networks’, BMC Bioinf., 2008, 9, (1), p. 75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Roy S.: ‘Systems biology beyond degree, hubs and scale‐free networks: the case for multiple metrics in complex networks’, Syst. Synth. Biol., 2012, 6, (1–2), p. 31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Yang B. Xu J., and Liu B. et al.: ‘Inferring gene regulatory networks with a scale‐free property based informative prior’. IEEE Int. Conf. Biomedical Engineering and Informatics, Shenyang, China, 2015, pp. 542–547 [Google Scholar]
  • 27. Li L. Alderson D., and Doyle J.C. et al.: ‘Towards a theory of scale‐free graphs: definition, properties, and implications’, Internet Math., 2005, 2, (4), pp. 431–523 [Google Scholar]
  • 28. Barrat A. Barth Lemy M., and Pastorsatorras R et al.: ‘The architecture of complex weighted networks’, Proc. Natl. Acad. Sci. USA, 2004, 101, (11), pp. 3747–3752 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Piraveenan M. Prokopenko M., and Hossain L. et al.: ‘Percolation centrality: quantifying graph‐theoretic impact of nodes during percolation in networks’, PLOS One, 2013, 8, (1), p. e53095 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. McLendon W. Hendrickson B., and Plimption S.J. et al.: ‘Finding strongly connected components in distributed graphs’, J. Parallel Distrib. Comput., 2005, 65, (8), pp. 901–910 [Google Scholar]
  • 31. Liu Y.Y. Slotine J.J., and Barabási A.L.: ‘Observability of complex systems’, Proc. Natl. Acad. Sci., 2013, 110, (7), pp. 2460–2465 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Travençolo B.A.N., and Costa L.F.: ‘Accessibility in complex networks’, Phys. Lett. A, 2008, 373, (1), pp. 89–95 [Google Scholar]
  • 33. Liu Y.Y. Slotine J.J., and Barabási A.L.: ‘Controllability of complex networks’, Nature, 2011, 473, (7346), pp. 167–173 [DOI] [PubMed] [Google Scholar]
  • 34. Lin C.T.: ‘Structural controllability’, IEEE Trans. Autom. Control, 1974, 19, (3), pp. 201–208 [Google Scholar]
  • 35. Dodds P.S. Muhamad R., and Watts D.J.: ‘An experimental study of search in global social networks’, Science, 2003, 301, (5634), pp. 827–829 [DOI] [PubMed] [Google Scholar]
  • 36. Vu T.T., and Vohradsky J.: ‘Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae ’, Nucleic Acids Res., 2007, 35, (1), pp. 279–287 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Ghasemi O. Lindsey M.L., and Yang T. et al.: ‘Bayesian parameter estimation for nonlinear modelling of biological pathways’, BMC Syst. Biol., 2011, 5, (3), pp. 1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Eydgahi H. William W.C., and Muhlich J. et al.: ‘Properties of cell death models calibrated and compared using Bayesian approaches’, Mol. Syst. Biol., 2013, 9, (1), pp. 644–660 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Reginska T.: ‘A regularization parameter in discrete ill‐posed problems’, SIAM J. Sci. Comput., 1996, 17, (3), pp. 740–749 [Google Scholar]
  • 40. Arthur E.H., and Robert W.K.: ‘Ridge regression: biased estimation for nonorthogonal problems’, Technometrics, 2000, 42, (1), pp. 80–86 [Google Scholar]
  • 41. Greenfield A. Madar A., and Ostrer H. et al.: ‘DREAM4: combining genetic and dynamic information to identify biological networks and dynamical models’, PLOS One, 2010, 5, (10), p. e13397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Schade B. Jansen G., and Whiteway M. et al.: ‘Cold adaptation in budding yeast’, Mol. Biol. Cell, 2004, 15, (12), pp. 5492–5502 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from IET Systems Biology are provided here courtesy of Wiley

RESOURCES