Abstract
Accurate determination of causalities between genes is a challenge in the inference of gene regulatory networks (GRNs) from the gene expression profile. Although many methods have been developed for the reconstruction of GRNs, most of them are insufficient in determining causalities or regulatory directions. In this work, we present a novel method, namely, DDTG, to improve the accuracy of causality determination in GRN inference by dissecting downstream target genes. In the proposed method, the topology and hierarchy of GRNs are determined by mutual information and conditional mutual information, and the regulatory directions of GRNs are determined by Taylor formula-based regression. In addition, indirect interactions are removed with the sparseness of the network topology to improve the accuracy of network inference. The method is validated on the benchmark GRNs from DREAM3 and DREAM4 challenges. The results demonstrate the superior performance of the DDTG method on causality determination of GRNs compared to some popular GRN inference methods. This work provides a useful tool to infer the causal gene regulatory network.
Keywords: gene regulatory networks, network inference, downstream targets, causality, machine learning
Introduction
Elucidating gene regulatory networks (GRNs) is a fundamental challenge in molecular biology (Hughes et al., 2000). High-throughput technologies provided a wealth of gene expression data which are helpful to interrogate the complex regulatory dynamics inherent to organisms (Algabri et al., 2022; Wang and Liu, 2022). The network structure with genes (genes) and regulatory interactions (edges) can be inferred from the observed data through minimizing the effects of noise and hidden genes (Baruch and Albert-László, 2013; Yang et al., 2022). To improve the accuracy of network reconstruction, various methods have been developed for the reconstruction of GRNs from gene expression profiles (Riet and Kathleen, 2010; Zhang et al., 2022). However, each method has its own strengths and weaknesses (Daniel et al., 2010). Among the methods for GRN inference, most of them are insufficient in determining the causalities or regulatory directions (Krouk et al., 2013; Ahmed et al., 2018). Understanding the causality in the gene expression data is critical to elucidating the underlying regulatory mechanism of cellular machines (Jiang et al., 2000; Nagoshi et al., 2004; Rubin et al., 2019).
Existing methods to infer the GRNs from gene expression data with the motivation of improving the accuracy and scalability of network inference include model-based approaches and machine learning-based approaches (Madar et al., 2009; Zhang et al., 2013). For the model-based approaches, chemical reaction of transcription and translation, as well as other cellular processes, are described as linear or nonlinear differential equations, in which the parameters represent the regulation strengths of the regulators (Gardner et al., 2003; Honkela et al., 2010). Dynamical system models of differential equations can forecast future system behaviors and characterize formal properties such as stability (Zak et al., 2003; Wang et al., 2022). Furthermore, prior information, such as experimentally verified regulations, can be easily included in these models to improve the accuracy of network inference (Studham et al., 2014; Zhang et al., 2017). Moreover, the model-based methods are found useful to remove possible redundant indirect regulations by forcing sparseness on the model (Hurley et al., 2011; Jiang and Zhang, 2022). However, these models are computationally intractable for large GRNs owing to extensive and explicit parameterization requirements (Karlebach and Shamir, 2008; Tibshirani, 2011). For the machine learning-based approaches, the network is inferred through measuring the dependences or causalities between transcriptional factors and target genes (Khatamian et al., 2018; Deng et al., 2021). Popular methods in this category include mutual information (MI) (Modi et al., 2011), conditional mutual information (CMI) (Zhang et al., 2011), part mutual information (Zhao et al., 2016), Granger causality (Finkle et al., 2018), and maximal information coefficient (Reshef et al., 2011; Kinney and Atwal, 2014). With no explicit mechanistic assumptions, the machine learning-based methods are usually more efficient than the model-based methods in the computational complex (Zhang et al., 2015).
As the most popular methods, MI and CMI evaluate the association between the genes by measuring the entropy of their mutual activities, where a lower entropy for a gene indicates that its activities are less randomly distributed; that is, it is statistically dependent on the activities of other genes (Butte and Kohane, 2000). Specially, MI can characterize nonlinear dependency and deal with thousands of variables (genes) in the presence of a limited number of samples. However, the MI between two variables is a symmetric quantity, and so the MI-based methods generally infer undirected interactions (Aghdam et al., 2015). The ordinary differential equation (ODE)-based methods can be used for inference of causal GRNs, but these methods require rigorous datasets like time-series data (Lu et al., 2021; Yang et al., 2021; Chen and Liu, 2022).
In this study, we develop a method for inferring causal GRNs from genetic knockout data. The activities of the downstream target genes respond to the knockout changes of regulatory genes and are identified accurately by comparing the relative change values of the downstream targets and assigning a weight to the relative change values of each gene. It is helpful to remove the spurious edges in the causal inference of GRNs. We dissect the downstream target genes using CMI, MI, and Taylor formula-based regression to determine the causalities among the downstream targets. We model the hierarchy structure of the downstream targets and determine the causalities in different layers with CMI which is efficient to remove these redundant indirect regulations. Consequently, we determine the correlation in the same layer. At last, we use Taylor formula-based regression to determine the causalities in the same layer. With the process repeating for each regulator, the reconstruction of GRNs is achieved. Our method not only has the advantages of the machine learning-based methods but also can determine the regulatory directions. The results on the DREAM3 and DREAM4 datasets show that our method significantly outperforms the existing method in terms of false-positive rates and accuracy.
Methods
Downstream target identification
As the activity of the regulator can be approximated by the expression level of the gene encoding the regulator, we suppose the gene expression level as regulator activity. Let represent the ith gene. Considering a network consisted of n genes, gene expression matrix A denotes the gene expression level under different conditions (samples) which can be measured directly from gene knockout experiments. The knockout experiment is implemented for every gene and the downstream gene response to the knockout of gene through the fluctuations of expression levels. The steady-state levels of genes in the wild-type provide a standard of the gene expression changes. Thus, the gene expression levels of wild-type and knockout experiments of each gene provide the information to identify downstream targets. Matrix A consists of n rows with n steady-state values of knockout experiments, and each row is obtained after deleting one of n genes. The vector stands for the jth column of matrix A, in which represents the steady-state level of gene after knockout of gene . The wild-type file contained n steady-state levels of the unperturbed network. The vector stands for the wild-type data of each gene, in which denotes the steady-state level of the wild-type of gene .
The genes whose steady-state values change as a result of a single-gene knockout are likely to be downstream of the perturbed gene. Most causal relationships owing to the gene knockout could be immediately recognized by 0comparing the steady-state data after gene knockout with wild-type data. We calculate the changes of gene response to the knockout of every gene by using the following expression:
| (1) |
where denotes the changes of gene by comparing the response to the knockout of gene with the wild type. The changes describe the response of all genes as a consequence of the perturbation of the source gene. We use the mean change value to quantify the mean response strengths of the same target to different regulators. The mean change value in gene can be expressed as . for different genes varies widely because the wild-type data of different genes vary widely. So we use the relative change value to quantify the response strengths of the same target to different regulators. We obtain the relative change value vector which is , where denotes a relative change value of gene . Gene is called as the regulator, and the genes that respond to the change of are called downstream target genes or targets. of each gene varies widely because the wild-type data of each gene vary widely and because the activities of the downstream target genes responding to the same knock-out change of regulatory gene vary widely. To calculate the activities of the downstream target genes and compare the relative changes of with other genes, we assign a weight to by sigmoid function , where parameters r and u describe the coefficients of sigmoid function, and describes the maximum response strength of to the changes of other genes. Parameters r and u are given but not estimated to balance the computation of w. Parameter r is set as a negative integer number and parameter u is set as a positive number and is smaller than 1. In general, the values chosen will not affect the final results. By calculating , we obtain a matrix , where denotes the th column of matrix S, and the row vector denotes the ith row of matrix S. Given a threshold parameter for deciding the downstream target genes of regulator , the elements in above are regarded as downstream target genes (Figure 1A). Most casual relationships could be accurately recognized from . Due to the sparseness of GRNs, the downstream targets consist of a small number of genes, which is helpful to remove the indirect dependencies and reduce the computational complexity.
FIGURE 1.
Overview of the DDTG method. (A) For each regulator , the downstream target genes of the regulator are identified by the Sigmoid function and by comparing the relative change values. (B) Hierarchy structure of downstream target genes will be decided by using CMI. Assuming genes are the downstream target genes of regulator . As an example, genes and belong to the 2 combination of the downstream target genes. If is small, it indicates that regulates directly and regulates directly. The dashed arrows denote the regulatory direction. (C) The correlations between two genes in the same layer. The dashed lines without arrows denote the genes being strongly dependent on each other. Edges – , – , and – are directly correlated with each other in the same layer. (D) The regulatory direction between two genes in the same layer is determined using the Taylor formula and linear regression. The solid arrows denote the causality in the same layer. The reconstruction of GRNs is achieved by aggregating the edges from each step.
Causality among hierarchy genes
Some of the downstream targets may be indirectly regulated by gene . The remaining task is thus to distinguish direct dependence from indirect dependence. To accomplish this, we use conditional mutual information (CMI) to discriminate the genes directly regulated by from the genes indirectly regulated by . Accordingly, we obtain a hierarchy structure of the downstream targets of . So the topological structure of the downstream target genes of is a two-layer network. The genes in the first layer are directly regulated by regulator , and the genes in the second layer are indirectly regulated by .
The CMI allows measuring the dependency of two genes in the context of a third gene. We assume that and are ’s downstream target genes. The interaction between gene and can be measured in the context of gene by the CMI, which is defined as follows:
The CMI can be easily calculated using the following equation:
| (2) |
where is the covariance matrix of variables, and is the determinant of matrix . If and carry the same information about , . It indicates that directly regulates and indirectly regulates mediated by ; that is, gene directly regulates gene . The genes regulated directly by form a layer, namely, the first layer, and the genes regulated indirectly by form a layer, namely, the second layer (Figure 1B).
Correlations among the genes in the same layer
For the genes in the same layer, the correlations among them are measured by mutual information (MI). MI between two genes and can be defined as follows (Altay and Emmert-Streib, 2010):
The MI can be easily calculated using the following equivalent formula:
| (3) |
where is the covariance matrix of variables, and is the determinant of matrix . If is large, it indicates that a strong dependency exists between genes and (Figure 1C).
Regulatory directions between genes in the same layer
To determine if the regulatory direction in the scenario of and is in the same layer and a strong dependency exists between them, we here propose an innovative and effective approach based on linear regression.
We assume that gene is the common regulator of genes and , and that a strong dependency exists between gene and gene by measuring the MI. We denote , , and by X, Y, and Z, respectively, for simplifying notations. We aim to determine the regulatory direction between Y and Z in the direct gene set (Figure 2A) or the indirect gene set (Figure 2B). If we assume gene Y regulates gene Z, the gene–gene regulation can be expressed as a nonlinear equation set:
| (4) |
FIGURE 2.

Target genes with co-regulator. (A) Genes Y and Z are direct targets of gene X, and (B) genes Y and Z are indirect targets of gene X.
The activity of Y is determined by X and the activity of Z is determined by X and Y. So Y will be the function with respect to X, and Z will be the function with respect to X and Y satisfying Eq. 4 which indicates the causality among X, Y, and Z. The nonlinear regulatory function 4) makes it difficult to computationally identify the model. To address this issue, we apply Taylor expansion which is an accurate substitution of the polynomial function for the nonlinear equation Eq. 4.If , , and denote the wild-type data of X, Y, and Z, then and . The Taylor expansion corresponding to and at point is the infinite series whose nth term is , that is,
| (5) |
So we need to take the derivative of Z. The first derivative of with respect to can be written as follows:
The second derivative of with respect to can be written as follows:
| (6) |
The wild-type data can be viewed as a steady state of GRNs, only for as long as the flow of energy, nutrients, and other molecules is maintained. Hence, while the gene expression level of regulator X is at point , that is, the wild-type data of regulator X, the fluctuation of the gene expression level of targets Y and Z is minimal. This means the derivative of Y and Z with respect to X at point is zero. So we obtain the following equation:
| (7) |
that is,
| (8) |
By substituting (7) into (6), we obtain the second derivative of with respect to at point :
| (9) |
that is,
| (10) |
For the value of is small enough, the terms of in (5) can be neglected. So we can obtain an equation set:
| (11) |
By substituting (8) into (11), we obtain the following equation:
| (12) |
Due to and , (12) can be written as the following equation:
| (13) |
Substituting (10) into the second equation in equation set (13), we obtain the following equation:
| (14) |
The first equation in Eq. 13 is equivalent to the following equation:
| (15) |
that is,
| (16) |
By substituting (16) into (14), we obtain the following equation:
| (17) |
where and are constants. Eq. 17 is a function for Y and Z. For simplicity, we set , , , , and . Hence, Eq. 17 can be written as . We use multivariate linear regression to estimate the coefficients a and b, and then determine the causality between Y and Z.
On the contrary, we assume that gene Z regulates gene Y, and the gene–gene regulation can be expressed as a nonlinear equation set:
| (18) |
Following the similar process to the aforementioned equation, we estimate parameter to measure the relationship between Z and Y. Obviously, if gene Y truly regulates gene Z, then will be the regression coefficient of linearization of model (17). Conversely, if gene truly regulates gene Y, then will be the regression coefficient of model (18). Consequently, we compare the value of with the value of . If the value of is larger than the value of , it indicates that gene Y regulates gene Z. Conversely, if the value of is larger than the value of , it indicates that gene Z regulates gene Y (Figures 2A,B). With the iterative computation of gene and , the global network is constructed.
Pseudocode of the DDTG algorithm
To describe the algorithm clearly, the pseudocode of the DDTG algorithm (see Algorithm 1) is provided in detail as follows:
Algorithm 1
DDTG.
1: Input: Gene expression data A.
2: Output: Inferred causal network G.
3: for each gene i do
4: Select the candidate downstream target genes for gene i based on matrix S. The number of the candidate genes is noted as N.
5: Separate N candidate genes into two layers of the downstream targets by , that is, direct target set and indirect target set .
6: Determine the causalities among the genes in the same layer for and by Taylor expansion.
7: end for
Result
In order to test the predictive power of the DDTG method, DREAM3 datasets (Prill et al., 2010) about Yeast knockout genes with sizes 10, 50, and 100 and two networks of size 10 from the DREAM4 datasets (Daniel et al., 2009) were used. The gold standard networks were generated with the nonlinear ODE systems in which the network structures were determined with detailed dynamics of both transcriptional and translational processes.
The predictive results were evaluated by following measures, that is, sensitivity or true-positive rate (TPR), false-positive rate (FPR), positive predictive value (PPV), accuracy (ACC), and Matthew’s coefficient constant (MCC). Mathematically, they are defined as the following expressions:
where TP, FP, TN, and FN are the numbers of true positives, false positives, true negatives, and false negatives, respectively. TPR and FPR are also used to plot the receiver operating characteristic (ROC) curves, and the area under ROC curve (AUC) is calculated.
To validate the performance of DDTG, we compared it with several popular methods including LP (Wang et al., 2006), RO (Zhang, Liu, Liu, Duval, Richer, Zhao, Hao, and Chen, 2013), regression model-based LASSO (Geert et al., 2012), MI-based ARACNE (Margolin et al., 2006), and random forest-based GENIE3 (Van Anh et al., 2010), where the two alternatives with parameters “sqrt” and “all” in the GENIE3 were considered here, as they performed best in the DREAM challenges. For fair comparison, optimal default values of parameters from previous published articles were used in the running of these comparative algorithms. For example, regularization parameter of methods LP and RO were set to one, the ensemble parameter of method GENIE3 was set to 1,000, the threshold of MI filtering in method NARROMI was set to 0.05, and the threshold of MI in method DDTG was set to 0.1.
Performance on DREAM3 benchmark data
DREAM3 datasets about Yeast knock-out genes with sizes 10, 50, and 100 were used.
First, DDTG was applied to the Yeast gene expression data with network sizes 10 and 10 samples. The comparison of DDTG with other methods is shown in Figure 3A, where DDTG outperforms other methods significantly with an AUC score of 1.000. From Figure 3A, we can see that all of the edges were detected. The performance of DDTG and other methods with respect to PPV, ACC, MCC, and AUC are shown in Table 1, where DDTG is superior to other methods.
FIGURE 3.
ROC curves of several methods on networks with different sizes. The solid line with star points is the ROC curve of method DDTG. (A) The ROC curves on the network with size 10. (B) The ROC curves on the network with size 50. (C) The ROC curves on the network with size 100.
TABLE 1.
Comparison on networks with sizes 10, 50, and 100.
| Method | TPR | FPR | PPV | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| Size 10 | ||||||
| LASSO | 0.600 | 0.837 | 0.082 | 0.211 | −0.191 | 0.703 |
| LP | 0.100 | 0.412 | 0.029 | 0.533 | −0.202 | 0.738 |
| RO | 0.100 | 0.500 | 0.024 | 0.456 | −0.252 | 0.798 |
| ARACNE | 0.900 | 0.112 | 0.500 | 0.888 | 0.618 | 0.930 |
| GENIE3_FR_sqrt | 0.700 | 0.112 | 0.437 | 0.867 | 0.483 | 0.919 |
| GENIE3_FR_all | 0.700 | 0.138 | 0.389 | 0.844 | 0.442 | 0.894 |
| NARROMI | 0.700 | 0.050 | 0.636 | 0.922 | 0.623 | 0.938 |
| DDTG | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Size 50 | ||||||
| LASSO | 0.351 | 0.129 | 0.081 | 0.855 | 0.113 | 0.711 |
| LP | 0.389 | 0.085 | 0.130 | 0.899 | 0.182 | 0.669 |
| RO | 0.494 | 0.131 | 0.109 | 0.857 | 0.181 | 0.727 |
| ARACNE | 0.597 | 0.082 | 0.192 | 0.908 | 0.303 | 0.832 |
| GENIE3_FR_sqrt | 0.481 | 0.078 | 0.167 | 0.908 | 0.245 | 0.843 |
| GENIE3_FR_all | 0.442 | 0.073 | 0.164 | 0.912 | 0.231 | 0.796 |
| NARROMI | 0.532 | 0.062 | 0.217 | 0.925 | 0.307 | 0.839 |
| DDTG | 0.779 | 0.063 | 0.284 | 0.931 | 0.445 | 0.856 |
| Size 100 | ||||||
| LASSO | 0.403 | 0.112 | 0.175 | 0.861 | 0.199 | 0.696 |
| LP | 0.129 | 0.017 | 0.305 | 0.935 | 0.169 | 0.581 |
| RO | 0.245 | 0.056 | 0.206 | 0.906 | 0.175 | 0.580 |
| ARACNE | 0.118 | 0.016 | 0.304 | 0.936 | 0.161 | 0.695 |
| GENIE3_FR_sqrt | 0.007 | 0.001 | 0.308 | 0.944 | 0.040 | 0.710 |
| GENIE3_FR_all | 0.053 | 0.006 | 0.337 | 0.941 | 0.115 | 0.665 |
| NARROMI | 0.138 | 0.014 | 0.364 | 0.939 | 0.197 | 0.696 |
| DDTG | 0.635 | 0.185 | 0.169 | 0.805 | 0.254 | 0.726 |
The best performer for the relative item is noted in bold. LASSO, a regression method; LP, a linear programing-based method; RO, a recursive optimization-based method; ARACNE, a MI-based method; GENIE3, a random forest-based method; NARROMI, a method based on RO and MI; DDTG, a method based on dissecting the downstream target nodes.
Second, DDTG was applied to the Yeast gene expression data with network sizes 50 and 50 samples. The comparison of DDTG with other methods is shown in Figure 3B, where DDTG outperforms other methods significantly with an AUC score of 0.856. From Figure 3B, we can see that most of the edges were recovered. The performance of DDTG and other methods with respect to PPV, ACC, MCC, and AUC are shown in Table 1, where DDTG is superior to other methods.
Third, the Yeast gene expression data with network size 100 and 100 samples were used to evaluate DDTG and other methods. The ROC curves obtained by different methods are shown in Figure 3C, where DDTG outperforms other methods with an AUC score of 0.726. Table 1 shows the results obtained by different methods with respect to distinct performance indices. From the results, we can observe that DDTG performs better than most methods.
Performance on DREAM4 benchmark data
The performance of network inference methods may strongly vary depending on the structural properties of the target networks. In order to assess the performance of DDTG predicting the topology on different target networks, DREAM4 datasets with size 10 were adopted here to evaluate our method.
While DREAM3 benchmarks were of a great value, there were some notable differences between DREAM3 and DREAM4 datasets. First, all the networks in DREAM3 were acyclic, while the networks considered in DREAM4 do contain cycles. Furthermore, a deterministic model was used in the DREAM3, while a stochastic one was used in DREAM4. Finally, both biological noise and experimental noise were added to DREAM4 datasets (Andrea et al., 2010). DREAM4 benchmarks consist of a set of networks with widely varying topologies. Two networks of size 10 from DREAM4 in silico challenge were adopted here to test our method.
Figures 4A,B show the ROC curves by different methods on two different networks of size 10 from DREAM4 challenge. From figures, we can see that the performance of DDTG method is superior to that of other methods with the AUC values of 0.862 and 0.761. Table 2 summarizes the results obtained by different methods with respect to distinct performance indices. From Table 2, we can see that DDTG performs significantly better than other methods. Especially, when the DREAM4 datasets are used to test the performance of these methods, the accuracy of DDTG is still high. However, other methods except DDTG perform better in DREAM3 datasets; their performance decays rapidly in DREAM4 datasets.
FIGURE 4.
ROC curves of several methods on networks with different sizes. The solid line with star points is the ROC curve of method DDTG. (A) The ROC curves on network_1 from DREAM4 datasets with size 10. (B) The ROC curves on network_2 from DREAM4 datasets with size 10.
TABLE 2.
Comparison on networks from DREAM4 datasets with size 10.
| Method | TPR | FPR | PPV | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| Size 10_1 | ||||||
| LASSO | 0.533 | 0.720 | 0.129 | 0.322 | −0.150 | 0.584 |
| LP | 0.467 | 0.240 | 0.280 | 0.711 | 0.189 | 0.627 |
| RO | 0.467 | 0.373 | 0.200 | 0.600 | 0.071 | 0.492 |
| ARACNE | 0.467 | 0.147 | 0.389 | 0.789 | 0.298 | 0.648 |
| GENIE3_FR_sqrt | 0.333 | 0.160 | 0.294 | 0.756 | 0.165 | 0.668 |
| GENIE3_FR_all | 0.333 | 0.147 | 0.313 | 0.767 | 0.182 | 0.667 |
| NARROMI | 0.333 | 0.120 | 0.357 | 0.789 | 0.219 | 0.630 |
| DDTG | 0.800 | 0.080 | 0.667 | 0.900 | 0.671 | 0.862 |
| Size10_2 | ||||||
| LASSO | 0.813 | 0.757 | 0.188 | 0.344 | 0.050 | 0.544 |
| LP | 0.250 | 0.297 | 0.154 | 0.622 | -0.040 | 0.566 |
| RO | 0.375 | 0.378 | 0.177 | 0.578 | -0.003 | 0.546 |
| ARACNE | 0.313 | 0.149 | 0.313 | 0.756 | 0.164 | 0.573 |
| GENIE3_FR_sqrt | 0.188 | 0.189 | 0.177 | 0.700 | -0.002 | 0.501 |
| GENIE3_FR_all | 0.188 | 0.216 | 0.158 | 0.678 | -0.027 | 0.558 |
| NARROMI | 0.313 | 0.149 | 0.313 | 0.756 | 0.164 | 0.573 |
| DDTG | 0.687 | 0.149 | 0.500 | 0.822 | 0.479 | 0.761 |
The best performer for the relative item is noted in bold. LASSO, a regression method; LP, a linear programing-based method; RO, a recursive optimization-based method; ARACNE, a MI-based method; GENIE3, a random forest-based method; NARROMI, a method based on RO and MI; DDTG, a method based on dissecting the downstream target nodes.
Figure 5 shows the performance of the compared methods on DREAM3 datasets with size 10 and two networks from DREAM4 datasets with size 10. From Figure 5, we can find that the performance of DDTG varies less strongly than that of other methods in different datasets. It indicates that DDTG is more robust than other methods on different networks.
FIGURE 5.
AUC comparison of methods on different networks with size 10.
Discussion
In this article, we proposed a novel method DDTG to reconstruct GRNs from gene knockout data. Yet our method can be applied to infer regulatory networks if gene knockdown or over-expression experiments are provided. This algorithm includes two steps. In the first step, the downstream targets are identified by comparing relative change values. In the second step, the hierarchy structure of the downstream targets is determined using CMI and MI. From the results, we can see that clearly DDTG is the best performer on the benchmark datasets. The good performance of DDTG may be contributed by following factors. Genes whose steady state values change after gene knockout can be immediately recognized by comparing the relative change values, which can improve the accuracy of network reconstruction. Meanwhile, due to the sparseness of GRNs, the downstream targets consist of a small number of nodes, which is helpful to reduce the redundant edges. Moreover, we assign a weight to the relative change values using sigmoid function. The parameters of the weight function depend on the expression level of each gene. This can reduce the noise for each gene (higher noise for a higher expression level). Therefore, other methods perform poorly in DREAM4 datasets, but the accuracy of DDTG is still high.
Furthermore, we construct gene–gene regulations using the Taylor formula at the steady-state levels of the wild type, and we use linear regression to determine the causal relationship between genes in the same layer for the first time. Finally, we infer the causal structure of GRNs using CMI and MI. Our method has the advantages of machine learning-based methods, such as making no explicit mechanistic assumptions and more computationally efficient.
Despite the advantages of DDTG, there are also limitations: DDTG is strongly dependent on the accuracy of identifying the downstream target nodes. The spurious downstream target nodes definitely result in spurious edges. For instance, the FPR by DDTG on datasets of size 100 is higher than that of other methods in Table 1. A technique to filter out and remove the impact of the spurious nodes may improve the performance of DDTG and will be considered in DDTG.
Conclusion
We proposed a novel method, namely, DDTG, to improve the accuracy of GRN inference by dissecting the downstream target nodes. In this algorithm, the downstream targets for each gene are identified by comparing the relative change values. Furthermore, the causal structure of downstream targets is determined by CMI and MI. We especially use a weight function to reduce the noise for each regulator and determine the causality between nodes in the same layer using the Taylor formula and linear regression. The method was validated on the benchmark GRNs from DREAM challenge. The results confirmed the effectiveness of our method, which outperformed previous methods.
Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.
Author contributions
ZJ and XZ conceived and designed the experiments. ZJ performed the experiments. ZJ analyzed the data. ZJ and XZ wrote the article. All authors contributed to the article and approved the submitted version.
Funding
This work was supported by the grants from the National Natural Science Foundation of China [32070682], the Technology Innovation Zone Project [1816315XJ00100216], and the CAS Pioneer Hundred Talents Program.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
- Aghdam R., Ganjali M., Zhang X., Eslahchi C. (2015). CN: A consensus algorithm for inferring gene regulatory networks using the SORDER algorithm and conditional mutual information test. Mol. Biosyst. 11, 942–949. 10.1039/c4mb00413b [DOI] [PubMed] [Google Scholar]
- Ahmed S. S., Roy S., Kalita J. K. (2018). Assessing the effectiveness of causality inference methods for gene regulatory networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 56–70. 10.1109/TCBB.2018.2853728 [DOI] [PubMed] [Google Scholar]
- Algabri Y. A., Li L., Liu Z-P. (2022). scGENA: A single-cell gene coexpression network analysis framework for clustering cell types and revealing biological mechanisms. Bioengineering 9, 353. 10.3390/bioengineering9080353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altay G., Emmert-Streib F. (2010). Revealing differences in gene network inference algorithms on the network level by ensemble methods. Bioinformatics 26, 1738–1744. 10.1093/bioinformatics/btq259 [DOI] [PubMed] [Google Scholar]
- Andrea P., Nicola S., Alberto D. L. F. (2010). From knock-outs to networks: Establishing direct cause-effect relationships through graph analysis. PloS one 5, e12912. 10.1371/journal.pone.0012912 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baruch B., Albert-László B. (2013). Network link prediction by global silencing of indirect correlations. Nat. Biotechnol. 31, 720–725. 10.1038/nbt.2601 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Butte A. J., Kohane I. S. (2000). Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 2000, 418–429. Epub 2000/07/21. 10.1142/9789814447331_0040 [DOI] [PubMed] [Google Scholar]
- Chen G., Liu Z. (2022). Inferring causal gene regulatory network via GreyNet: From dynamic grey association to causation. Front. Bioeng. Biotechnol. 10, 954610. 10.3389/fbioe.2022.954610 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniel M., Prill R. J., Thomas S., Claudio M., Dario F., Gustavo S. (2010). Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. U. S. A. 107, 6286–6291. 10.1073/pnas.0913357107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniel M., Thomas S., Claudio M., Dario F. (2009). Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol. 16, 229–239. 10.1089/cmb.2008.09TT [DOI] [PubMed] [Google Scholar]
- Deng Z., Zhang J., Li J., Zhang X. (2021). Application of deep learning in plant–microbiota association analysis. Front. Genet. 12, 697090. 10.3389/fgene.2021.697090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finkle J. D., Wu J. J., Bagheri N. (2018). Windowed Granger causal inference strategy improves discovery of gene regulatory networks. Proc. Natl. Acad. Sci. U. S. A. 115, 2252–2257. 10.1073/pnas.1710936115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner T. S., Di Bernardo D., Lorenz D., Collins J. J. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–105. 10.1126/science.1081900 [DOI] [PubMed] [Google Scholar]
- Geert G., van Kesteren R. E., Smit A. B., de Gunst M. C. (2012). Identification of context-specific gene regulatory networks with GEMULA--gene expression modeling using LAsso. Bioinformatics 28, 214–221. 10.1093/bioinformatics/btr641 [DOI] [PubMed] [Google Scholar]
- Honkela A., Girardot C., Gustafson E. H., Liu Y-H., Furlong E. E., Lawrence N. D., et al. (2010). Model-based method for transcription factor target identification with limited data. Proc. Natl. Acad. Sci. U. S. A. 107, 7793–7798. 10.1073/pnas.0914285107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hughes T. R., Marton M. J., Jones A. R., Roberts C. J., Stoughton R., Armour C. D., et al. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109–126. 10.1016/s0092-8674(00)00015-5 [DOI] [PubMed] [Google Scholar]
- Hurley D., Araki H., Tamada Y., Dunmore B., Sanders D., Humphreys S., et al. (2011). Gene network inference and visualization tools for biologists: Application to new human transcriptome datasets. Nucleic Acids Res. 40, 2377–2398. 10.1093/nar/gkr902 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang X., Zhang X. (2022). Rsnet: Inferring gene regulatory networks by a redundancy silencing and network enhancement technique. BMC Bioinforma. 23, 165–218. 10.1186/s12859-022-04696-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang Y. J., Aerne B. L., Smithers L., Haddon C., Ish-Horowicz D., Lewis J. (2000). Notch signalling and the synchronization of the somite segmentation clock. Nature 408, 475–479. 10.1038/35044091 [DOI] [PubMed] [Google Scholar]
- Karlebach G., Shamir R. (2008). Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell. Biol. 9, 770–780. 10.1038/nrm2503 [DOI] [PubMed] [Google Scholar]
- Khatamian A., Paull E. O., Califano A., Yu J. (2018). SJARACNe: A scalable software tool for gene network reverse engineering from big data. Bioinformatics 35, 2165–2166. 10.1093/bioinformatics/bty907 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kinney J. B., Atwal G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. U. S. A. 111, 3354–3359. 10.1073/pnas.1309933111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krouk G., Lingeman J., Colon A. M., Coruzzi G., Shasha D. (2013). Gene regulatory networks in plants: Learning causality from time and perturbation. Genome Biol. 14, 123. 10.1186/gb-2013-14-6-123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu J., Dumitrascu B., McDowell I. C., Jo B., Barrera A., Hong L. K., et al. (2021). Causal network inference from gene transcriptional time-series response to glucocorticoids. PLoS Comput. Biol. 17, e1008223. 10.1371/journal.pcbi.1008223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Madar A., Greenfield A., Ostrer H., Vanden-Eijnden E., Bonneau R. (2009). The inferelator 2.0: A scalable framework for reconstruction of dynamic regulatory network models. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2009, 5448–5451. 10.1109/IEMBS.2009.5334018 [DOI] [PubMed] [Google Scholar]
- Margolin A. A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Favera R. D., et al. (2006). Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma. 7, S7. 10.1186/1471-2105-7-S1-S7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Modi S. R., Camacho D. M., Kohanski M. A., Walker G. C., Collins J. J. (2011). Functional characterization of bacterial sRNAs using a network biology approach. Proc. Natl. Acad. Sci. U. S. A. 108, 15522–15527. 10.1073/pnas.1104318108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagoshi E., Saini C., Bauer C., Laroche T., Naef F., Schibler U. (2004). Circadian gene expression in individual fibroblasts : Cell-autonomous and self-sustained oscillators pass time to daughter cells. Cell. 119, 693–705. 10.1016/j.cell.2004.11.015 [DOI] [PubMed] [Google Scholar]
- Prill R. J., Daniel M., Julio S. R., Sorger P. K., Alexopoulos L. G., Xiaowei X., et al. (2010). Towards a rigorous assessment of systems biology models: The DREAM3 challenges. PloS one 5, e9202. 10.1371/journal.pone.0009202 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reshef D. N., Reshef Y. A., Finucane H. K., Grossman S. R., McVean G., Turnbaugh P. J., et al. (2011). Detecting novel associations in large data sets. science 334, 1518–1524. 10.1126/science.1205438 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riet D. S., Kathleen M. (2010). Advantages and limitations of current network inference methods. Nat. Rev. Microbiol. 8, 717–729. 10.1038/nrmicro2419 [DOI] [PubMed] [Google Scholar]
- Rubin A. J., Parker K. R., Satpathy A. T., Qi Y., Wu B., Ong A. J., et al. (2019). Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell. 176, 361–376. 10.1016/j.cell.2018.11.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Studham M. E., Tjärnberg A., Nordling T. E., Nelander S., Sonnhammer E. L. (2014). Functional association networks as priors for gene regulatory network inference. Bioinformatics 30, i130–i138. 10.1093/bioinformatics/btu285 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. (2011). Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol. 73, 273–282. 10.1111/j.1467-9868.2011.00771.x [DOI] [Google Scholar]
- Van Anh H. T., Alexandre I., Louis W., Pierre G. J. P. O. (2010). Inferring regulatory networks from expression data using tree-based methods. PloS one 5, 4439–4451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T., Peng H., Cao Y., Xu J., Xiong Y., Liu K., et al. (2022). Dynamic network biomarker analysis reveals the critical phase transition of fruit ripening in grapevine. Genes. 13, 1851. 10.3390/genes13101851 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Joshi T., Zhang X-S., Xu D., Chen L. (2006). Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics 22, 2413–2420. 10.1093/bioinformatics/btl396 [DOI] [PubMed] [Google Scholar]
- Wang Y., Liu Z-P. (2022). Identifying biomarkers for breast cancer by gene regulatory network rewiring. BMC Bioinforma. 22, 308–315. 10.1186/s12859-021-04225-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang B., Bao W., Chen B. (2022). Pgrnig: Novel parallel gene regulatory network identification algorithm based on GPU. Brief. Funct. Genomics 2022, elac028. 10.1093/bfgp/elac028 [DOI] [PubMed] [Google Scholar]
- Yang B., Bao W., Zhang W., Wang H., Song C., Chen Y., et al. (2021). Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model. BMC Bioinforma. 22, 448–519. 10.1186/s12859-021-04367-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zak D. E., Gonye G. E., Schwaber J. S., Doyle F. J. (2003). Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: Insights from an identifiability analysis of an in silico network. Genome Res. 13, 2396–2405. 10.1101/gr.1198103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang A., Xiong Y., Fang J., Jiang X., Wang T., Liu K., et al. (2022). Diversity and functional evolution of terpene synthases in rosaceae. Plants 11, 736. 10.3390/plants11060736 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X., Liu K., Liu Z. P., Duval B., Richer J. M., Zhao X. M., et al. (2013). Narromi: A noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics 29, 106–113. 10.1093/bioinformatics/bts619 [DOI] [PubMed] [Google Scholar]
- Zhang X., Zhao J., Hao J. K., Zhao X. M., Chen L. (2015). Conditional mutual inclusive information enables accurate quantification of associations in gene regulatory networks. Nucleic Acids Res. 43, e31. 10.1093/nar/gku1315 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X., Zhao X-M., He K., Lu L., Cao Y., Liu J., et al. (2011). Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics 28, 98–104. 10.1093/bioinformatics/btr626 [DOI] [PubMed] [Google Scholar]
- Zhang X-F., Ou-Yang L., Yan H. (2017). Incorporating prior information into differential network analysis using non-paranormal graphical models. Bioinformatics 33, 2436–2445. 10.1093/bioinformatics/btx208 [DOI] [PubMed] [Google Scholar]
- Zhao J., Zhou Y., Zhang X., Chen L. (2016). Part mutual information for quantifying direct associations in networks. Proc. Natl. Acad. Sci. U. S. A. 113, 5130–5135. 10.1073/pnas.1522586113 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.




