Abstract
Reverse engineering of gene regulatory network (GRN) is an important and challenging task in systems biology. Existing parameter estimation approaches that compute model parameters with the same importance are usually computationally expensive or infeasible, especially in dealing with complex biological networks.In order to improve the efficiency of computational modeling, the paper applies a hierarchical estimation methodology in computational modeling of GRN based on topological analysis. This paper divides nodes in a network into various priority levels using the graph‐based measure and genetic algorithm. The nodes in the first level, that correspond to root strongly connected components(SCC) in the digraph of GRN, are given top priority in parameter estimation. The estimated parameters of vertices in the previous priority level ARE used to infer the parameters for nodes in the next priority level. The proposed hierarchical estimation methodology obtains lower error indexes while consuming less computational resources compared with single estimation methodology. Experimental outcomes with insilico networks and a realistic network show that gene networks are decomposed into no more than four levels, which is consistent with the properties of inherent modularity for GRN. In addition, the proposed hierarchical parameter estimation achieves a balance between computational efficiency and accuracy.
Inspec keywords: biology computing, network theory (graphs), reverse engineering, graph theory, genetics, genetic algorithms, directed graphs, parameter estimation
Other keywords: hierarchical parameter estimation, GRN, topological analysis, gene regulatory network, important task, computational systems biology, compute model parameters, complex biological networks, efficient information, model quality, parameter reliability, computational modelling, study divides nodes, priority levels, graph‐based measure, previous priority level, hierarchical estimation methodology obtains, computational resources, single time estimation, insilico network, realistic network show, computational efficiency
1 Introduction
Mathematical modelling of biological networks provides a way to understand the regulatory mechanisms of the whole system by using the measured expression data as the information source. As typical biological networks, gene regulatory networks (GRNs) are represented by the weighted graphs. The purpose of gene network inference is to reveal the quantitative regulatory relationship between genes and transcription factors (TFs), thus providing theoretical support for genetic circuit engineering. TFs are those genes that are usually located in the upstream of information flow within a network and control other nodes. The applications of computational modelling for GRN cover various areas such as epidemic prevention, energy production and synthetic biology. In synthetic biology, accurate and reliable models provide guides to construct synthetic genetic networks with certain cellular functions.
With the rapid development of systems biology and high‐throughput technologies, network inference has become a feasible task with efficient inference algorithms. The problem of gene network inference has received increasing attention in recent years [1]. Early researches have discussed the topic of network identification using time‐series expression measurements [2, 3]. Various computational approaches have been applied to infer regulatory relations among the network components [4, 5]. Incorporation of prior knowledge is helpful to improve the reliability and accuracy in computational modelling of gene regulatory gene networks [6]. Since the regulations between regulators and target genes are directed, directed graphs make it convenient to analyse the gene regulation mechanism. In the digraph, both the network structure and regulatory strengths are partially or completely unknown, thus need to be inferred from microarray or single cell RNA sequencing data. Sometimes, network structure and model parameters are considered simultaneously during network inference [7].
Existing network inference algorithms usually breaks the modelling task into structure identification and parameter estimation [8]. Structure identification algorithms are capable of determining the network structure using transcription data. The task of structure identification is to determine the directed regulations of nodes. Using the high‐throughput expression data as information sources, correlation‐based and mutual‐information‐based approaches are capable to determine an appropriate topology through calculation of statistical dependencies between network components [9, 10]. Mutual information is used to quantitatively reflect the similarity between expression levels of a pair of genes. Inferring algorithms such as ARACNE and its derivatives are classical reconstruction methods that calculate the statistical dependencies between genes [11, 12]. It should be noted that mutual‐information‐based approaches only obtain undirected regulations between genes.
In general, the current gene network inference algorithms mainly focus on inferring the network topology instead of unknown model parameters including regulatory strengths. These parameters are crucial to understand the regulation mechanism of gene regulations and to evaluate the importance of gene regulations. Research works have seriously discussed the parameter estimation for gene networks [13, 14]. With a well‐defined cost function, the problem of parameter estimation can be converted to an optimisation problem that needs considerable resources [15, 16]. Incorporating prior knowledge especially structural information is a promising solution to improve the performance of parameter estimation [17]. Even with efficient transcription profiles, parameter estimation algorithms are generally time‐consuming and consume considerable computational resources.
The challenges in parameter estimation of GRN partly lie in the computational burden during optimisation. The huge parameter space caused by the number of nodes and regulatory edges within a network makes it difficult for optimisation approaches to find the global optimal parameters. Another important issue is the lack of useful information. The number of biological experiments that perturb the expressions of GRN is quite limited compared with that of genes and relevant edges. Useful information underlying measured expression levels of genes should be extracted to infer unknown model parameters. Although this problem may be solved by adding more perturbations, this solution is expensive or even unfeasible due to the cost. Current parameter estimation of GRNs using time‐series expression data always faces a few drawbacks [18]. The first factor is huge feasible parameter space which is related with the number of genes in networks. Meanwhile, correlations between model parameters and states may lead to structural and parametric unidentifiability. Lack of efficient information for parameter estimation is also a factor that is related with the quality of mathematical models and inferred parameters.
In fact, the dynamics of the regulatory system are determined by a subset of nodes that corresponds to specific groups of key genes. Using weighted graphs to describe GRN, genetic regulations can be quantitatively evaluated by the weights of edges. Compared to peripheral gene nodes, key genes have stronger gene regulations such as repression or on the downstream genes which have relatively low degrees. On the basis of this hypothesis, this paper further computes the parameters in the next stage of nodes using estimated parameters in the previous stage as known information. The potential advantages of this strategy are improved computational efficiency due to decomposed network structures and reduced number of key nodes in a network.
Such priority levels are determined by topology analysis of decomposition of the network structures using an elitism‐based genetic algorithm (GA). The first level of nodes selected by the optimisation approach is regarded as root strongly connected components (SCCs), which locate in upstream in information flow for a given gene network. The second step of topology analysis is SCC decomposition of the digraph of GRN. The purpose of hierarchical estimation is to obtain kinetic parameters for genes in root SCCs, which are regarded as the largest subgraphs in SCC. In this case, inferred parameters of gene nodes in root SCC can be used as known information to calculate the parameters of the second priority level of nodes. The proposed hierarchical estimation strategy that deals with genes with different priority levels has advantages of improving computational efficiency.
2 Deterministic modelling of GRNs
In network inference, selection of mathematical models is a crucial step. Bayesian networks are probabilistic modelling approaches that compute the posterior distribution of model parameters [19, 20]. Deterministic modelling using ordinary differential equations (ODEs) is another candidate solution to describe the expression behaviours of gene networks. Compared to probabilistic modelling, deterministic modelling based on ODE‐based models needs limited computational resources and still captures system dynamics with appropriate parameters. With an identified network structure, expression behaviours of nodes in a gene network are described by a set of differential equations, where unknown parameters including regulation strengths, transcription and degradation rates need to be computed using expression data.
2.1 Model formulation
The linear differential models for networks are described by the equation below:
| (1) |
where the vector is the state vector of a network consisting of N nodes at the time point t, represents the transcriptional expression level of the gene i and the input vector reflects the environmental perturbations. The coefficients in the state matrix reflect the strengths of gene regulation. Influence on inner states is reflected by the input matrix . For a linear time‐invariant dynamic system, and . According to the observability analysis theory, the observability matrix should satisfy the full rank requirement to guarantee the observability of a network [21]. Furthermore, if a network is described by non‐linear differential equations, denoted by the equation below:
| (2) |
Then, the Jacobian matrix should satisfy the full rank condition, i.e. , where is computed using the Lie derivatives . During structure identification of GRN, the gene nodes are generally divided into the hub and non‐hub genes [22]. Those gene nodes with high degrees are considered as hub genes and form a skeleton for a network. The remaining genes are non‐hub genes that are generally in peripheral positions. It should be noted that in‐degree is different from out‐degree from the viewpoint of information flow. If a gene has many out‐degrees, it is more likely to provide control and thus is playing the role of hub gene. The adjacent matrix of the network is decomposed into two parts, as shown in the equation below:
| (3) |
where and represent the subsets of hub and non‐hub gene nodes. Those hub nodes in subset also play the role of bridge transferring control information from the upstream to downstream. Given time‐series measurements , the cost function is defined by the equation below:
| (4) |
where is the output vector of the target gene and X represents the expression of regulator genes. To model the network, output variables can be constructed. Such outputs correspond to the expression of downstream genes that are regulated by other regulators in the upstream of information flow. For a network, the network structure as well as the directed regulation between candidate regulators and target genes may be unclear. Feature selection methods are able to calculate scores for possible regulation between candidate regulators and target genes. By minimising the cost function , the best possible solution is computed to determine the network structure.
2.2 Topological characteristics of GRN
Considering the topological characteristics of GRNs, the directed graphs are useful tools to describe the connections and information flow between components within a network [23]. On the basis of the graph theory, a gene network is described by a directed graph , where V and E denote vertices and edges in a graph. In the study of GRNs, the majority of the genes are sparsely connected, while densely connected genes also exist. Such densely connected genes play the role of key nodes that regulate the system dynamics and can be identified by analysing degree distribution for a given network.
Scale‐free topology is an interesting characteristic owned by GRN and can be used to infer GRN as prior information [24–26]. The fraction of nodes in the scale‐free network having k connections to other nodes follows the power law, defined as , where . The degree of a node in a network corresponds to the number of directed relations with other nodes. Degree distribution that is the probability distribution is closely related with the topological characteristics of a given network. For a graph G with edge set E, the degree of a vertex v is defined as by , then the scale‐free metric is defined by the equation below:
| (5) |
This metric will be maximised when high‐degree nodes are connected to other high‐degree nodes [27]. For a given graph , connected components in G are defined as a subgraph where vertices are reachable for each other. Nodes and directed edges are two basic components in a gene network. Properties of the biological network can be analysed from the viewpoint of network centrality [28]. The purpose of analysing network centrality is to identify the most important vertices in a graph based on the topological analysis. Degree and betweenness centrality are discussed to judge the importance of a node. The degree is closely related with the number of directed edges that are connected to a target node, while betweenness is used to measure how central a node in a given network is [29]. In weighted networks, the strength of a node is computed using the weights of adjacent edges that are denoted by the equation below:
| (6) |
where and are adjacent and weighting matrices between nodes i and j. Before modular decomposition, the SCCs are introduced to describe topological relations between nodes [30]. The set of root SCCs is a subset of SCC [31]. A directed graph is strongly connected if there is a path between each pair of nodes. The subgraph is regarded to be strongly connected when each vertex can be reachable from every other node. One SCC of a directed graph corresponds to a maximal strongly connected subgraph. In the selection of important nodes within a network, nodes in SCC that have no in‐degrees are selected to form the subset of root SCC. Both SCC and root SCC are specific kinds of subsets in a graph. The subset of root SCC, which is the maximisation of , can be considered as a condensate node from the viewpoint of information flow. A brief comparison of SCC and root SCC is depicted in Fig. 1.
Fig. 1.

SCC and root SCC in gene networks. There exists mutual regulation between gene nodes and , which indicates close information sharing within the subset. The expressions of genes and are directly controlled by the elements in root SCC. This indicates that importance of nodes in a given GRN is closely related with their topological positions
In Fig. 1, there exist three subsets of SCCs including one root SCC and two common SCCs. In this paper, vertices in the root SCC are considered as important nodes that provide regulation on the behaviours of the whole system. In the proposed hierarchical decomposition methodology, the connections within a subset are given lower priority since the inner connections generally reflect information flows within a module. Those nodes that locate at peripheral positions of a given network are ignored to some extent. Motivated by degree modularity of gene networks, the hierarchical estimation strategy first focuses on the key nodes and relevant edges.
3 Hierarchical decomposition methodology
The topological characteristics of gene network such as sparsity and hub network structure indicate that unknown parameters have different levels of importance in influencing system behaviours. However, current modelling algorithms estimate unknown coefficients equally using transcriptional data, thus leading to computational difficulties. A natural way to solve this problem is to perform the parameter estimation at different priority levels instead of inferring all parameters at one time. This strategy requires that, for a given network, its nodes should be divided according to topological characteristics and locations in information flow into several groups. Nodes that locate in upstream of information flow in a gene network provide genetic regulation to those located in downstream.
3.1 Selection of key genes
Graph representation of GRN brings powerful tools in analysing the topologies of a network. Traditional controllability analysis for networks meets certain difficulties. With the graph representation of biological networks, characteristics such as accessibility controllability of systems can be analysed [32, 33]. Each node in the set of SCC is regarded as an information source that can be used to infer the dynamics of other nodes within the subset. In this case, the structural control theory was proposed to analyse the linear network [34]. Vertices in root SCC have only out‐degrees, which are denoted by blue coloured nodes in Fig. 2, thus transferring regulation in downstream nodes. Generally speaking, the condition of root SCC is stricter than that of SCC for a given network.
Fig. 2.

Two situations of root SCC within gene networks
(a) A single node controls three downstream genes, while a subset consisted of three closely related genes regulating the downstream genes (b) Genes in the root SCC coloured blue are playing similar role with the single regulator gene, and thus regarded as a condensate node
On the basis of the characteristics of GRN such as scale‐free topology and modularity, parameter estimation can be performed at different priority levels, determined by topological analysis. With a graph‐based measure, nodes in a network are divided into different levels according to their topological positions. Such a task can be accomplished by an elitism‐based GA.
3.2 Hierarchical decomposition algorithm
Gene networks in this paper are assumed as is connected that indicates every vertex in the network is connected and there are no isolated islands. The task of hierarchical decomposition is to select the subset of key nodes using a graph‐based measure. In this paper, the key nodes are defined as the root SCC. In graph theory, components in root SCC, which is a subset of SCC, are generally located in upstream of the information flow for a given network. In the proposed methodology, this selected first‐level subset is considered as a merged node in the next level of estimation. Directed edges that are regulated by components in the first level can be regarded as edges linked with the new merged node.
Reduce the scope based on the connectivity : If there exists a directed edge , the parameters about the target gene can be inferred by measuring the expression of the regulator . That means the regulator should be of higher priority level during modelling while parameter estimation about the target gene can be postponed to the next stage. From the viewpoint of information flow, the regulation information flows from the regulatory to the target gene .
Determine the key subset : This step randomly selects a subset , where V is the set of vertices. The subset is defined such as . For each gene node g in , those nodes in that are inferred by g. Then, the nodes in the subset are regarded as components in the first priority level.
Correlation analysis and deletion of redundant nodes : The elitism‐based GA optimises the number of nodes in using a fitness function defined in (7). After determining the subset , there are redundant nodes that are not components in root SCC. Then, each node in is traversed to judge the strongly coupled condition, shown in Fig. 2 b. Those nodes meet the strongly coupled condition are added to a new set T. During the construction of T, the subset T will be deleted if it is not an SCC. Otherwise, T will be merged to the key nodes .
Extension to the next priority level : Vertices in the first level , selected by the standard of root SCC, are those SCCs that have only out‐degrees. On the basis of this principle, nodes in the second level are directly controlled by nodes in . Those nodes that have SCC relations with nodes in are absorbed to .
Furthermore, the optimisation is performed by a modified GA. This elitism‐based GA designs a sub‐function to read the adjacent list of a given network and use individuals to represent the selected nodes by means of binary encoding technology. The elitism‐based GA first generates an initial population, in which individuals are binary coded using elements 0 and 1. Every individual represents a candidate solution of optimisation, i.e. key nodes. The work flow of hierarchical decomposition algorithm for gene networks is described in Fig. 3.
Fig. 3.

Flowchart of the GA‐based key nodes selection algorithm. To get a consistent hierarchical decomposition outcome for a given network, the part of reversing left nodes will be accomplished
The fitness function in selecting the first level of key nodes that is also regarded as the set of root SCC is defined by the equation below:
| (7) |
In (7), N is the number of nodes of a given graph. The subset represents the node set of root SCC and is another subset of V. The basic strategy is randomly selecting a subset and reversing nodes to delete nodes that belong to . This study applies the elitism‐based GA to maximise the fitness f in (7). GA algorithm loops over the population size and creates new individuals with the mutation and crossover operations.
After optimisation, the subset of nodes in is selected as key nodes, which are hub genes in a network. For these nodes, edges that connected with them transfer information from the upstream to downstream. According to the theory of six degree separation, the maximum number of levels is expected to be no more than six [35]. The principle of six degrees of separation indicates that network nodes are connected to any other nodes with less than six steps in a small‐world network such as social networks. It is a well known principle shared in small‐world network including social networks.
To illustrate the hierarchical decomposition in details, a simplified type of gene network that consisted of five nodes is taken into consideration, as shown in Fig. 4.
Fig. 4.

Illustration of hierarchical decomposition strategy. The nodes and have only out‐degrees and divided into the first priority level. The decomposition then reverses other nodes to judge whether other nodes belong to . At the node, there is mutual regulation between and . The subset becomes
The working flow of hierarchical decomposition part is implemented in two steps. The first step of decomposition is finding the nodes that have only out‐degrees, corresponding to the subset of nodes in Fig. 4. The directed edge from node i to j is denoted as . When there is a mutual connection between i and j, these two vertices are regarded as strongly connected, denoted by . Furthermore, the nodes and are strongly connected, thus the gene will be absorbed into the first priority level. In this case, the first level of subsets is . For the first‐level subset, the out‐degree edges transfer the information to nodes in the next level of vertices. In the second step, extensions are performed independently from the subsets .
For vertices in the current level, the first task is to determine vertices in the next level. Vertices in the top priority level are required to have only out‐degrees and no in‐degrees. The set of nodes in previous and next priority levels are denoted by preLevel and nextLevel, respectively. GraphLength defines the number of nodes in a given network and tmpQueue denotes a temporary queue that is used to store the nodes in the key subset . On the basis of GA algorithm, the hierarchical decomposition algorithm of directed graphs of GRN is described by the pseudocode in Algorithm 1 (see Fig. 5).
Fig. 5.

Algorithm 1: Pseudocode of hierarchical decomposition algorithm of gene network
4 Hierarchical parameter estimation of GRN
The proposed hierarchical estimation strategy aims to evaluate the importance of genes according to the graph‐based measure, i.e. in‐degrees and out‐degrees distributions. For a given network, those nodes in the first priority level are first inferred using transcriptional profiles, thus reducing the underestimated problem caused by a huge number of gene nodes and limited measured data. After determining the first level , the hierarchical decomposition algorithm further extends to the next level . Considering the topological characteristics of GRN, the hierarchical levels are expected to be no more than six.
Considering the complexity of biological networks and the number of genes, the minimisation of the cost function in one‐time‐all strategy needs a considerable resource to search for the global optimum. Under a given network topology, the inferring algorithms search for the parameters via a defined cost function. For the task of computational modelling of GRN, linear ODE systems are usually applied to describe the expression dynamics. However, non‐linear ODE models have the advantage of capturing the complex dynamics of biological models [36]. Non‐linear modelling of GRN has its own advantages in reflecting the regulation mechanisms [37, 38]. In non‐linear ODE models of GRN, the regulation of the expression of a target gene can be described by a combinatorial action of multiple regulators. The corresponding ODE models are described as the equation below:
| (8) |
The weighted parameter represents directed regulatory relations from the regulator gene i from the target gene j. Gene regulation is divided into activation and inhibition depending on the values of values. Positive correspond to activation. Non‐linear models are popular in describing relatively complex expression dynamics compared with linear models. With a sigmoidal transfer function, the differential equation of expression dynamics is shown as the equation below:
| (9) |
where , the parameter denotes the maximal expression rate and the constant represents the degradation rates of biomolecular products. For each gene, there are three unknown parameters , and that are merged into . The unknown parameter will be estimated by optimisation approaches using a suitable cost function. To perform the task of parameter estimation, the cost function can be defined as the mean square error (MSE) function, shown as below:
| (10) |
where the weighted matrix is user‐defined, and denote the expression levels of target genes, respectively. Non‐linear least‐square algorithm aims to minimise the errors between the model output and measured data. For a gene network consisted N genes with T time points and R replicated experiments, the corresponding cost function is defined as below:
| (11) |
where represent the measured mRNA levels in the microarray dataset and are the model outputs. Penalty terms are introduced to enhance the predictive ability of models. The penalised cost function is further described as the equation below:
| (12) |
The penalty factor is introduced to avoid overfitting and can be determined by the L curve method [39]. For a given value of , the error term and the penalty term are calculated. Solving non‐linear ODEs is a time‐consuming process, and will become infeasible for complex biological systems. To estimate unknown parameters of gene networks consisted of hundred nodes, linear regression approaches can be applied instead. In dealing with complex systems, linear models and ridge regression are applied to compute the kinetic parameters that correspond to regulatory strength [40]. The regularised cost function is ridge regression defined as below:
| (13) |
The cost function is related to the adjacent matrix which denotes regulations between genes. The matrices and represent expression levels of the target gene and candidate regulators. The function of the second term with coefficient is to reduce the overfitting. With known network structures, the regulations between genes are clear. In hierarchical estimation methodology, priority stages are determined by the decomposition algorithm based on the topological analysis.
5 Experimental results and analysis
Simulated and realistic GRNs are selected as the benchmarks to validate the proposed hierarchical estimation. The selected realistic gene network uses microarray data of Saccharomyces cerevisiae measured under cold shock response, which involves 21 transcription factors (TFs) [16]. Kinetic parameters of five size‐100 gene networks that come from the DREAM4 challenge are estimated using microarray multifactorial data as the information source [41].
5.1 Hierarchical estimation of GRN of S. cerevisiae
The first example considers a medium network of S. cerevisiae that consisted of 21 nodes and 32 directed edges. Each node represents the gene, mRNA and expressed proteins simultaneously. As a typical kind of eukaryote, S. cerevisiae responses to environment stresses including temperature changes. The cold shock experiments were performed between the temperatures of 10 and . For each gene, there are three replicates about log 2 fold changes in expression levels measured at four time points, which are measured 10, 30 and 120 min after cold shock perturbations [42].
In this elitism‐based GA algorithm, a function Graph.java is defined to describe a network using adjacent matrices that consist of elements 0 and 1. Then, the algorithm initialises a population and calculates the fitness values for individuals. In the elitism‐based GA, the tournament selection as well as crossover and mutation operations are applied to evolve the individuals. In this paper, the probabilities of crossover and mutation are settled at 0.5 and 0.015, respectively. Assume that nodes in the current priority level have only out‐degrees and no in‐degrees, the nodes in the next level are inferred based on the nodes in the current priority level, as shown in Algorithm 1 (Fig. 5). The visualisation of the hierarchical decomposition of S. cerevisiae subnetwork that contains 21 genes is shown in Fig. 6.
Fig. 6.

Decomposition of a 21‐gene network in S. cerevisiae. The nodes in the first and second priority levels are coloured yellow and red. Nodes of the first level have only out‐degrees and regulate downstream genes. Those nodes in the second level are directly controlled by that in . Vertices in the third and fourth levels that are coloured blue and purple are considered as peripheral nodes
After 1000 generations, the nodes are divided into four priority levels, in which seven genes are selected in . These genes are , , , , , and . These 7 genes are located in upstream of information flow and control expression of downstream nodes. Gene has five out‐degrees and one autoregulation. In this subnetwork, the gene controls three downstream genes. also plays a vital role in controlling information flow between gene modules (in specific synthetic gene networks) and controls three downstream genes. Gene is regulated by six genes and selected in the second priority level . Furthermore, , and form a feed‐forward motif. As these seven genes have no in‐degrees, parameter estimation algorithm mainly considers the production rates. In the following stage, computed parameters work as known information to infer parameters of nodes.
Afterwards, parameter estimation is performed in four priority levels based on the decomposition outcomes. There are totally 88 unknown parameters, in which 67 parameters are estimated from measured expression data. Among estimated parameters, there are 31 regulatory weights that correspond to 31 edges, 21 production rate that represents 21 genes and 15 thresholds b. Using a one‐time‐all estimation strategy, the MSE index of parameter estimation converge to 85 after 600 iterations using the built‐in fmincon function in MATLAB. The information of central processing unit used is Core(TM) i5‐4590, 3.3 GHz. The trajectory of the error index is described in Fig. 7.
Fig. 7.

Trajectory of MSE index in single estimation strategy. In a single estimation methodology, the MSE index stops decreasing after reaching a specific level
The MSE index of parameter estimation of nodes that involve nine unknown parameters is 4.59 after 64 iterations. Afterwards, these nine inferred parameters are used as known information, thus reducing the uncertainty of modelling in the second level . In this 21‐gene network, the second stage of estimation considers nodes that are directly controlled by . The gene is selected to . Both and are taken to the second level since there exists mutual regulation between and . Computation of parameters in involves 28 weights , 12 production rates and 12 thresholds . The MSE index is 30.15 that needs 815 iterations to reach the convergence condition. The third and fourth levels only estimate one node. For four levels of parameter estimation, the MSE indexes and running time are recorded in Table 1.
Table 1.
Comparison of single and hierarchical estimations of size‐21 network
| Nodes | Parameters | Iterations | Time, s | MSE index | |
|---|---|---|---|---|---|
| level 1 | 7 | 9 | 64 | 3.17 | 4.59 |
| level 2 | 12 | 52 | 815 | 176 | 30.15 |
| level 3 | 1 | 3 | 47 | 2.47 | 0.00845 |
| level 4 | 1 | 3 | 40 | 2.46 | 0.198 |
| single est | 21 | 67 | 200 | 716 | 85 |
The row ‘Single est’ in Table 1 denotes that the MSE index using the single estimation strategy is 85. The number of genes in first priority level is about 33% in the whole set of nodes. Most of the computational time is spent on computation in the second stage since this stage involves 57% nodes and 77% unknown parameters. The sum of error index in hierarchical estimation is 34.95, which is much lower than 85 in the single estimation strategy. As the number of nodes in each level is different, the normalised MSE indexes in each priority level are calculated by averaging the number of nodes. The normalised MSE indexes at four priority levels are demonstrated in Fig. 8.
Fig. 8.

Trajectories of MSE indexes in the hierarchical estimation of the 21‐gene network. The task of parameter estimation is decomposed into four stages. The inferred parameters will be used in the next stage of estimation. As the numbers of nodes considered in each stage are quite different, normalised MSE indexes are calculated by dividing the error indexes to the number of genes in each priority level
Among inferred parameters, regulation strengths are directed related to regulation mechanism underlying expression profiles. Positive and negative weights of edges correspond to activation and repression, respectively. The estimation value of regulation strengths in one‐time‐all strategy and hierarchical estimation is given as Fig. 9.
Fig. 9.

Comparison of regulation strengths in the 21‐gene network
5.2 Hierarchical estimations of GRN in DREAM challenge
Accurate and reliable estimation of model parameters for biological systems is still a challenging task when gene networks involve hundreds of parameters. The experiment chooses insilico networks in the DREAM4 challenge as benchmarks to validate the performance of the proposed hierarchical estimation algorithm [41]. The goal of hierarchical estimation is to compute the parameters of gene regulation networks from simulated expression data. The insilico network 1 in the DREAM4 challenge has 100 nodes and 176 edges, the hierarchical estimation algorithm decomposes it into four priority levels, which contains 16, 68, 15 and 1 nodes. The adjacent lists are recognised and converted to binary code in elitism‐based GA algorithm. The optimisation algorithm determines the number of levels and gene nodes in each level. Visualisation of gold standard for insilico network 1 is given in Fig. 10.
Fig. 10.

Network topology of insilico network 1 of size‐100 networks from DREAM4 challenge
Current estimation strategies, which usually consider nodes and related edges in a given network with the same importance level, need considerable computational resources. Such requirements are hard to be met in realistic cases, especially in dealing with complex regulatory systems. In this case, hierarchical parameter estimation provides a promising solution for computational modelling of GRNs. Considering the topology characteristics of gene networks such as hub structures, a subset of gene nodes has a higher influence on the whole system. Meanwhile, the importance levels of genes or TFs in a given network are different depending on the locations in the information flow.
The first step is to decompose a given network with a clear structure into several priority levels using the graph‐based decomposition algorithm. For insilico network 1, the second priority level contains 68 nodes. To simplify the parameter estimation problem, only regulation strengths are considered and the ODE models that are used to describe the expression behaviours are simplified to . Second, ridge regression is applied to calculate the unknown weights of edges that connect the corresponding nodes. To show more details about parameter estimation, both MSE and median absolute error (MAE) indexes are computed. In this linear modelling framework, the number of edges in a network equals to that of unknown model parameters that correspond to the weights of edges. Computational times of hierarchical parameter estimation for five insilico networks are 6.22, 5.21, 4.97, 4.93 and 5.14 s. The reduced computational time is much less than that in the modelling size‐21 network in yeast is partly due to the linear model assumptions and the efficiency of ridge regression. Computational time of modelling five insilico networks using single estimation is approximately the same with the hierarchical estimation methodology. Compared to stochastic modelling approaches, one advantage of deterministic modelling using linear regression is the stable error indexes and computational cost. The outcomes of hierarchical parameter estimation using ridge regression are shown in Table 2.
Table 2.
Hierarchical parameter estimation of DREAM4 networks
| Networks | Edges | Levels | Nodes in | MSE | MAE |
|---|---|---|---|---|---|
| insilico 1 | 176 | 4 | : 16 | 1.766 | 3.927 |
| : 68 | 6.365 | 14.605 | |||
| : 15 | 2.077 | 3.864 | |||
| : 1 | 0.153 | 0.345 | |||
| insilico 2 | 249 | 3 | : 14 | 1.461 | 3.221 |
| : 76 | 9.688 | 21.679 | |||
| : 10 | 0.837 | 2.018 | |||
| insilico 3 | 195 | 4 | : 9 | 1.425 | 2.667 |
| : 44 | 7.569 | 14.148 | |||
| : 45 | 5.997 | 11.568 | |||
| : 2 | 0.297 | 0.668 | |||
| insilico 4 | 211 | 3 | : 9 | 1.2082 | 2.392 |
| : 45 | 6.338 | 12.437 | |||
| : 46 | 6.253 | 13.352 | |||
| insilico 5 | 193 | 4 | : 7 | 1.052 | 2.013 |
| : 36 | 5.538 | 10.933 | |||
| : 56 | 8.621 | 16.056 | |||
| : 1 | 0.152 | 0.285 |
For insilico network 1, hierarchical parameter estimation begins with the top priority level of 16 nodes and computes the relevant regulatory strengths with an error index of 1.676. The MSE indexes for left three levels are 6.365, 2.077 and 0.153, in which 0.153 denotes the estimation error of in insilico network 1. The total MSE of insilico network 1 is summed as 10.271, which is slightly lower than 11.6395 in single estimation strategy. Linear ODE model and ridge regression have been applied to model size‐100 networks, the error indexes are relatively low, leaving limited space for improvement. In general, the size‐100 networks are decomposed into three to four priority levels using the hierarchical estimation algorithm. For those networks that decomposed into four levels, the number of nodes in the last level is either 1 or 2. For instance, the nodes in of insilico3 network correspond to genes that are considered peripheral nodes. For insilico network 1 in size‐100 networks, comparison of regulatory strengths using single and hierarchical estimation strategy is shown in Fig. 10.
In Fig. 11, the trajectories of estimated regulatory strengths follow a similar pattern, indicating hierarchical estimation methodology basically find the solution obtained by the traditional one‐time‐all strategy. One‐time‐all estimation methods pose a huge computational burden, partly because the cost function considers all model parameters at the same importance level. In fact, the information flow in gene networks determines that certain modules of genes locate in the upstream while specific groups of genes stay in the downstream. Especially in engineered gene circuits or networks, the information flow and the regulatory relationships are vital to the cellular functions of the synthetic system. This phenomenon indicates that the influences of nodes and relevant parameters on the whole system are different and should be considered based on their priority levels. To compare the accurateness of parameter estimation under single and hierarchical estimation strategies, the comparisons of MSE and MAE indexes are given by Table 3.
Fig. 11.

Comparison of regulation strengths in insilico network 1 in DREAM4 challenge
Table 3.
Comparison of error indexes using single and hierarchical estimations
| Networks | Single MSE | Hierarchical MSE | Single MAE | Hierarchical MAE |
|---|---|---|---|---|
| insilico 1 | 11.639 | 10.0361 | 24.138 | 22.741 |
| insilico 2 | 14.288 | 11.986 | 29.556 | 26.918 |
| insilico 3 | 17.708 | 15.288 | 32.172 | 29.051 |
| insilico 4 | 16.994 | 13.790 | 31.583 | 28.281 |
| insilico 5 | 18.208 | 15.363 | 32.801 | 29.287 |
According to Table 3, both MSE and MAE indexes for insilico networks calculated under hierarchical estimation strategy are lower than that calculated by single estimation strategy. That indicates hierarchical estimation strategy is capable to obtain parameters with relatively higher accurateness compared with the traditional single estimation methodology. For biological networks that involve hundreds of parameters, linear regression methods are suitable tools to determine the model parameters and get lower error indexes compared with non‐linear modelling approaches. The meaning of hierarchical parameter estimation methodology is not only limited to reduce error indexes and computational cost, but also to provide a feasible way to deal with regulatory systems with a small ratio of hub gene nodes.
6 Conclusion
This paper proposes a hierarchical estimation strategy where nodes in a given network are divided into various levels based on topology analysis. For given gene networks, the hierarchical estimation strategy first decomposes the network based on the topology, which contains usually three to four priority levels. Such decomposition of the digraph of gene networks is completed by searching the subset of root SCC using optimisation. The nodes in the subset of root SCC are selected as the top priority level. To perform this decomposition task, a fitness function based on the graph‐based measure is defined and minimised by a GA method. Experiments about insilico expression data in the DREAM4 challenge and a realistic microarray data indicate that the proposed hierarchical strategy is able to reduce the error index with lower computational cost.
7 Acknowledgment
This work is supported partly by National Natural Science Foundation of China (Grant No. 61573311).
8 References
- 1. Marbach D. Costello J.C., and Kĺźffner B. et al.: ‘Wisdom of crowds for robust gene network inference’, Nat. Methods, 2012, 9, (8), pp. 796–806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Bansal M. Gatta G.D., and Bernardo D.D.: ‘Inference of gene regulatory networks and compound mode of action from time course gene expression profiles’, Bioinformatics, 2006, 22, (7), pp. 815–822 [DOI] [PubMed] [Google Scholar]
- 3. Xing H.M., and Gardner T.S.: ‘The mode‐of‐action by network identification algorithm: a network biology approach for molecular target identification’, Nat. Protocols, 2006, 1, (6), pp. 2551–2554 [DOI] [PubMed] [Google Scholar]
- 4. Hase T. Ghosh S., and Kitano H. et al.: ‘Harnessing diversity towards the reconstructing of large scale gene regulatory networks’, PLOS Comput. Biol., 2013, 9, (11), p. e1003361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Chang Y.H. Gray J.W., and Tomlin C.J. et al.: ‘Exact reconstruction of gene regulatory networks using compressive sensing’, BMC Bioinf., 2014, 15, (1), pp. 400–421 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ghanbari M. Lasserre J., and Vingron M.: ‘Reconstruction of gene networks using prior knowledge’, BMC Syst. Biol., 2015, 9, (1), pp. 84–94 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Meyer P. Cokelaer T., and Chandran D. et al.: ‘Network topology and parameter estimation: from experimental design methods to gene regulatory network kinetics using a community based approach’, BMC Syst. Biol., 2014, 8, (1), pp. 13–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Xiong J., and Zhou T.: ‘Structure identification for gene regulatory networks via linearization and robust state estimation’, Automatica, 2014, 50, (11), pp. 2765–2776 [Google Scholar]
- 9. Zhang X.J. Zhao X.M., and He K. et al.: ‘Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information’, Bioinformatics, 2012, 28, (1), pp. 98–104 [DOI] [PubMed] [Google Scholar]
- 10. Zhang X.J. Liu K.Q., and Liu Z.P. et al.: ‘NARROMI: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference’, Bioinformatics, 2013, 29, (1), pp. 106–113 [DOI] [PubMed] [Google Scholar]
- 11. Margolin A.A. Nemenman I., and Basso K et al.: ‘ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context’, BMC Bioinf., 2004, 7, (1), pp. 1–15 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Lachmann A. Giorgi F.M., and Lopez G. et al.: ‘ARACNe‐AP: gene network reverse engineering through adaptive partitioning inference of mutual information’, Bioinformatics, 2016, 32, (14), pp. 2233–2235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cao J.G., and Zhao H.Y.: ‘Estimating dynamic models for gene regulation networks’, Bioinformatics, 2008, 24, (14), pp. 1619–1624 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Fan M. Kuwahara H., and Wang X. et al.: ‘Parameter estimation methods for gene circuit modeling from time‐series mRNA data: a comparative study’, Briefings Bioinf., 2015, 16, (6), pp. 987–999 [DOI] [PubMed] [Google Scholar]
- 15. Biswas S., and Acharyya S.: ‘Parameter estimation of gene regulatory network using honey bee mating optimization’, Int. Conf. Emerg. Appl. Inf. Technol., 2015, 5, (3), pp. 1–10 [Google Scholar]
- 16. Dahlquist K.D. Fitzpatrick B.G., and Camacho E.T. et al.: ‘Parameter estimation for gene regulatory networks from microarray data: cold shock response in Saccharomyces cerevisiae ’, Bull. Math. Biol., 2015, 77, (8), pp. 1457–1492 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kuwahara H. Fan M., and Wang S.J. et al.: ‘A framework for scalable parameter estimation of gene circuit models using structural information’, Bioinformatics, 2013, 29, (13), pp. 98–107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Faith J.J. Hayete B., and Thaden J.T. et al.: ‘Large‐scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles’, PLOS Biol., 2007, 5, (1), p. e8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Vignes M. Vandel J., and Allouche D. et al.: ‘Gene regulatory network reconstruction using Bayesian networks, the Dantzig selector, the Lasso and their meta‐analysis’, PLOS One, 2011, 6, (12), p. e29165 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Liu F. Zhang S.W., and Guo W.F et al.: ‘Inference of gene regulatory network based on local Bayesian networks’, PLOS Comput. Biol., 2016, 12, (8), p. e1005024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Diop S., and Fliess M.: ‘Nonlinear observability, identifiability, and persistent trajectories’. IEEE Conf. Proc. Decision and Control, Brighton, UK, 1991, (1), pp. 714–719 [Google Scholar]
- 22. Gui S. Rice A.P., and Chen R. et al.: ‘A scalable algorithm for structure identification of complex gene regulatory network from temporal expression data’, BMC Bioinf., 2017, 18, (1), pp. 74–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Mason O., and Verwoerd M.: ‘Graph theory and networks in biology’, IET Syst. Biol., 2007, 1, (2), pp. 89–119 [DOI] [PubMed] [Google Scholar]
- 24. Chen G. Larsen P., and Almasri E. et al.: ‘Rank‐based edge reconstruction for scale‐free genetic regulatory networks’, BMC Bioinf., 2008, 9, (1), p. 75 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Roy S.: ‘Systems biology beyond degree, hubs and scale‐free networks: the case for multiple metrics in complex networks’, Syst. Synth. Biol., 2012, 6, (1–2), p. 31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Yang B. Xu J., and Liu B. et al.: ‘Inferring gene regulatory networks with a scale‐free property based informative prior’. IEEE Int. Conf. Biomedical Engineering and Informatics, Shenyang, China, 2015, pp. 542–547 [Google Scholar]
- 27. Li L. Alderson D., and Doyle J.C. et al.: ‘Towards a theory of scale‐free graphs: definition, properties, and implications’, Internet Math., 2005, 2, (4), pp. 431–523 [Google Scholar]
- 28. Barrat A. Barth Lemy M., and Pastorsatorras R et al.: ‘The architecture of complex weighted networks’, Proc. Natl. Acad. Sci. USA, 2004, 101, (11), pp. 3747–3752 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Piraveenan M. Prokopenko M., and Hossain L. et al.: ‘Percolation centrality: quantifying graph‐theoretic impact of nodes during percolation in networks’, PLOS One, 2013, 8, (1), p. e53095 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. McLendon W. Hendrickson B., and Plimption S.J. et al.: ‘Finding strongly connected components in distributed graphs’, J. Parallel Distrib. Comput., 2005, 65, (8), pp. 901–910 [Google Scholar]
- 31. Liu Y.Y. Slotine J.J., and Barabási A.L.: ‘Observability of complex systems’, Proc. Natl. Acad. Sci., 2013, 110, (7), pp. 2460–2465 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Travençolo B.A.N., and Costa L.F.: ‘Accessibility in complex networks’, Phys. Lett. A, 2008, 373, (1), pp. 89–95 [Google Scholar]
- 33. Liu Y.Y. Slotine J.J., and Barabási A.L.: ‘Controllability of complex networks’, Nature, 2011, 473, (7346), pp. 167–173 [DOI] [PubMed] [Google Scholar]
- 34. Lin C.T.: ‘Structural controllability’, IEEE Trans. Autom. Control, 1974, 19, (3), pp. 201–208 [Google Scholar]
- 35. Dodds P.S. Muhamad R., and Watts D.J.: ‘An experimental study of search in global social networks’, Science, 2003, 301, (5634), pp. 827–829 [DOI] [PubMed] [Google Scholar]
- 36. Vu T.T., and Vohradsky J.: ‘Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae ’, Nucleic Acids Res., 2007, 35, (1), pp. 279–287 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Ghasemi O. Lindsey M.L., and Yang T. et al.: ‘Bayesian parameter estimation for nonlinear modelling of biological pathways’, BMC Syst. Biol., 2011, 5, (3), pp. 1–10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Eydgahi H. William W.C., and Muhlich J. et al.: ‘Properties of cell death models calibrated and compared using Bayesian approaches’, Mol. Syst. Biol., 2013, 9, (1), pp. 644–660 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Reginska T.: ‘A regularization parameter in discrete ill‐posed problems’, SIAM J. Sci. Comput., 1996, 17, (3), pp. 740–749 [Google Scholar]
- 40. Arthur E.H., and Robert W.K.: ‘Ridge regression: biased estimation for nonorthogonal problems’, Technometrics, 2000, 42, (1), pp. 80–86 [Google Scholar]
- 41. Greenfield A. Madar A., and Ostrer H. et al.: ‘DREAM4: combining genetic and dynamic information to identify biological networks and dynamical models’, PLOS One, 2010, 5, (10), p. e13397 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Schade B. Jansen G., and Whiteway M. et al.: ‘Cold adaptation in budding yeast’, Mol. Biol. Cell, 2004, 15, (12), pp. 5492–5502 [DOI] [PMC free article] [PubMed] [Google Scholar]
