Inference of gene networks using gene expression data with applications

Chi-Kan Chen

doi:10.1016/j.heliyon.2024.e26065

. 2024 Feb 9;10(5):e26065. doi: 10.1016/j.heliyon.2024.e26065

Inference of gene networks using gene expression data with applications

Chi-Kan Chen ¹

PMCID: PMC10915353 PMID: 38449656

Abstract

Gene networks (GNs) use graphs to represent the interaction relationships between genes. Large-scale GNs are often sparse and contain hub genes that interact with many other genes. In this paper, we propose a novel method called NetARD, which utilizes Automatic Relevance Determination (ARD) to estimate partial correlations, to infer GNs with the hub genes from gene expression data. We test NetARD on simulated GNs and in silico GNs, and it outperforms existing methods. In our high-throughput gene expression data analysis, we integrate the NetARD into a method called GN Co-expression Extension (GNCE). This approach infers the GNs of co-expressed genes, with genes from a predefined GN serving as hub genes. We validate this approach by extending the core GN of transcription factor genes of E. coli using microarray data. In an application example, we identify biological process (BP) Gene Ontology (GO) terms that are significantly involved in cancer progression. This task is accomplished by analyzing the GN inferred through GNCE using the core GN associated with the colorectal cancer pathway and RNA-seq data.

Keywords: ARD, Cancer, Co-expression, Gene expression, Hub, Inference, Network, Partial correlation

Highlights

•
The NetARD uses the ARD to estimate partial correlations between genes to infer GNs with hubs from gene expression data.
•
When tested on simulated and in silico GNs, the NetARD method demonstrates superior performance compared to existing methods.
•
The GNCE uses the NetARD to infer GNs for co-expressed genes using genes from a pre-defined core GN as hub genes.
•
Validation of the GNCE involves extending the core GN of transcription factor genes in E. coli using microarray data.
•
The GNCE is applied to extend the predefined core GN to identify GO terms related to colorectal cancer using RNA-seq data.

1. Introduction

Genes interact with each other through gene expression products in a living cell. The gene network (GN) represents the interaction relationships between genes by a graph whose nodes correspond to genes and undirected and directed edges represent the interactions and causal regulatory relationships between genes, respectively. The identification of GNs is crucial for gaining insights into cellular functions and disease-associated biological processes. The results obtained from studying GNs have important applications in the fields of cellular biology and medicine [1]. With the advancements in DNA microarray and RNA-seq technologies, it is now possible to measure the mRNA expression levels of thousands of genes in an experiment, thereby capturing information about gene interactions. Mathematical or statistical networks have been employed as models to infer GNs using gene expression data. For a comprehensive understanding of model-based GN inference methods, one can refer to reviews such as [[2], [3], [4]].

GNs involved in biological processes often demonstrate sparsity, where individual genes typically interact with only a limited number of other genes. However, within these GNs, there are also hub genes that exhibit extensive interactions with a large number of genes. A prime example of such hub genes is those responsible for encoding transcription factors (TFs), which have the capability to regulate the expression of numerous genes through their TF products. Also, it has been noted that large-scale GNs often exhibit hub genes in the presence of scale-free topology [5]. Hub genes, which are central to these GNs, play a critical role in cell survival and are potential targets for drug development. In recent years, network model-based methods have been developed to infer GNs with hub genes or scale-free topology from high-throughput gene expression data, e.g., correlation and weighted correlation networks [6,7], partial correlation networks or Gaussian graphical models (GGMs) [8,9], and vector autoregressions and regression-based dynamic Bayesian networks [[10], [11], [12]]. These methods employ diverse regularization techniques to incorporate desired topological properties into the inferred GNs. However, inferring large-scale GNs still poses a challenging problem due to the intricate nature of gene interactions. With a vast number of genes potentially interacting directly or indirectly, unraveling these complex interactions become increasingly difficult.

In this paper, we utilize Automatic Relevance Determination (ARD) [13], a class of sparse Bayesian regressions. Our approach, called NetARD, utilizing the ARD to estimate partial correlations, enables us to infer gene networks (GNs) with hub genes from gene expression data. To validate the NetARD, we compare its performance against published methods in the inference of small-scale simulated GNs featuring hub genes, as well as in silico GNs. In our analysis of high-throughput gene expression data, we begin with a predefined core GN. We extend this core GN using the Gene Network Co-expression Extension (GNCE) method. This method incorporates co-expressed genes that interact with the core GN genes. In this process, we utilize the NetARD to infer the GNs of co-expressed genes, with genes from a predefined GN serving as hub genes. To evaluate the efficiency of GNCE, we implement it on microarray data to extend the transcription factor GN of E. coli. In our application example, we identify the Gene Ontology (GO) terms significantly associated with the colorectal cancer progression. This task is achieved by analyzing the GN that is extended via the CNCE from the core GN associated with the colorectal cancer pathway using the RNA-seq data.

2. Methods

2.1. Partial correlation network model

Let us consider the continuous random vector $X = {(X_{1}, \dots, X_{p})}^{T} \in R^{p \times 1}$ that contains the expression levels of genes $1, \dots, p$ . We assume that the mean vector $E (X) = 0_{p \times 1} \in R^{p \times 1}$ and the covariance matrix $Cov (X) = Σ \in R^{p \times p}$ , where $Σ$ is positive-definite, denoted as $Σ ≻ 0$ . Let us denote by $X^{\ k^{'} k} \in R^{(p - 2) \times 1}$ the subvector of $X$ that excludes $X_{k^{'}}$ , $X_{k}$ $(k^{'} \neq k)$ . Assume $\sum_{j \neq k^{'}, k} c_{j}^{(k)} X_{j}$ represent the optimal linear regression with the minimum variance $Var (ε_{k^{'}}^{(k)})$ , where $ε_{k^{'}}^{(k)} = X_{k} - \sum_{j \neq k^{'}, k} c_{j}^{(k)} X_{j}$ . This linear regression is selected from all possible linear regressions of $X_{k}$ on $X^{\ k^{'} k}$ . The correlation $π_{k^{'} k} = Corr (ε_{k}^{(k^{'})}, ε_{k^{'}}^{(k)}) \in [- 1, 1]$ defines the partial correlation between $X_{k^{'}}$ , $X_{k}$ . It reveals the direct association between $X_{k^{'}}$ , $X_{k}$ after accounting for the effects of genes in $X^{\ k^{'} k}$ . Let us denote by $ω_{k^{'} k}$ the $k^{'}, k$ entry of precision matrix $Ω = Σ^{- 1}$ . It holds that

Equation 1.

(1)

Moreover, assume $\sum_{j \neq k} b_{j}^{(k)} X_{j}$ is the optimal linear regression of $X_{k}$ on $X^{\ k}$ . It holds that

Equation 2.

(2)

Both Eqs. (1), (2) can be verified by utilizing the partitions of $Σ$ and $Ω$ , along with the formulas for optimal regression coefficient vectors expressed in terms of submatrices of $Σ$ . For instance, a reference for the derivation of Eq. (2) can be found in Eqs. (17.17)–(17.9) of [14]. It follows that

Equation 3.

(3)

Let the undirected graph $G = (V, E)$ be a representation of the GN, where $V$ contains nodes corresponding to $X_{k}$ $(k = 1, \dots, p)$ and $E \subseteq V \times V$ comprises undirected edges connecting nodes in $V$ . The edge $(k^{'}, k)$ connecting nodes $k^{'}$ , $k$ signifies the interaction between gene $k^{'}$ and gene $k$ . Let $W \in R^{p \times p}$ be the weighted adjacency matrix of $G$ , where the $k^{'}, k$ entry $w_{k^{'} k} = | π_{k^{'} k} | \in [0, 1]$ if $k^{'} \neq k$ and 0 if $k^{'} = k$ . An edge $(k^{'}, k) \in E$ if $w_{k^{'} k} > 0$ , $\notin E$ if $w_{k^{'} k} = 0$ . Moreover, if $X \sim N_{p} (0_{p \times 1}, Σ)$ then $w_{k^{'} k} > 0$ if and only if $X_{k^{'}}$ , $X_{k}$ are conditionally dependent given $X^{\ k^{'}, k}$ . $G$ represents the structure of GGM in which $(k^{'}, k) \in E$ is often interpreted as the direct interaction between gene $k^{'}$ and gene $k$ . Gene $k^{'}$ is a regulator of gene $k$ and/or vice versa.

2.2. p-node ARD

Assume that $X = (X^{(1)}, \dots, X^{(p)}) \in R^{N \times p}$ is the data matrix that contains $N$ i.i.d. samples of $X$ . Let us denote by $X^{\ k} \in R^{N \times (p - 1)}$ the submatrix of $X$ excluding the column $X^{(k)}$ . The linear regression of $X^{(k)}$ on $X^{\ k}$ is given as $X^{\ k} b^{(k)}$ , where $b^{(k)} = {(b_{k^{'}}^{(k)} |_{k^{'} \neq k})}^{T} \in R^{(p - 1) \times 1}$ . Let $ε_{k} = X^{(k)} - X^{\ k} b^{(k)}$ . Assume that $P_{k} (ε_{k} | b^{(k)}, β) \sim N_{N} (0_{N \times 1}, β^{- 1} I_{N \times N})$ , where $β > 0$ , $I_{N \times N} \in R^{N \times N}$ is the identity matrix. The likelihood function of $b^{(k)}$ , $β$ given $X$ is defined as $L_{k} (b^{(k)}, β | X) = P_{k} (ε_{k} | b^{(k)}, β)$ . In the Bayesian estimation framework, the ARD assumes that $b^{(k)}$ follows a 2-level prior [13]. At the first level, the prior $P_{k} (b^{(k)} | α^{(k)}) \sim N_{p - 1} (0_{(p - 1) \times 1}, D_{α^{(k)}}^{- 1})$ , where $D_{α^{(k)}} \in R^{(p - 1) \times (p - 1)}$ is a diagonal matrix with $α^{(k)} = {(α_{k^{'}}^{(k)} |_{k^{'} \neq k})}^{T} \in R_{+}^{(p - 1) \times 1}$ on the diagonal. At the second level, the prior $P_{k} (α^{(k)} | s^{(k)}, r^{(k)}) = \prod_{k^{'} \neq k} g (α_{k^{'}}^{(k)} | s_{k^{'}}^{(k)}, r_{k^{'}}^{(k)})$ , where $s^{(k)} = {(s_{k^{'}}^{(k)} |_{k^{'} \neq k})}^{T}$ and $r^{(k)} = {(r_{k^{'}}^{(k)} |_{k^{'} \neq k})}^{T} \in R_{+}^{(p - 1) \times 1}$ , and $g (t | s, r) = \frac{r^{s}}{Γ (s)} t^{s - 1} e^{- r t}$ $(t > 0)$ is the Gamma probability density function. Additionally, we assume the prior $P (β | s_{β}, r_{β}) = g (β | s_{β}, r_{β})$ . With all has been assumed, we formulate pARD as follows.

From the rule of conditional probability, it follows that

Equation 4.

(4)

Here, $P_{k} (b^{(k)} | X, α^{(k)}, β)$ represents the conditional posterior probability density function of $b^{(k)}$ and $L_{k} (α^{(k)}, β | X)$ the marginal likelihood function of regression model specified by $α^{(k)}$ , $β$ . The conditional posterior probability density function $P_{k} (α^{(k)}, β | X, s^{(k)}, r^{(k)}, s_{β}, r_{β})$ of regression model specified by $α^{(k)}$ , $β$ is proportional to the product of $L_{k} (α^{(k)}, β | X)$ and Gamma priors over $α^{(k)}$ , $β$ . By combining $p$ ARDs, the p-node ARD (pARD) solves the maximization problem of the objective function

Equation 5.

(5)

Namely, for each fixed $Θ = (s^{(1)}, r^{(1)}, \dots, s^{(p)}, r^{(p)}, s_{β}, r_{β})$ , the optimization seeks for the mode $({\tilde{α}}^{(1)}, \dots {, \tilde{α}}^{(p)}, \tilde{β})$ of the joint conditional posterior distribution $\prod_{k = 1}^{p} P_{k} (α^{(k)}, β | X, s^{(k)}, r^{(k)}, s_{β}, r_{β})$ . The solution to the pARD determines a set of Bayesian regression models. According to Eq. (A1) in Appendix, $P_{k} (b^{(k)} | X, {\tilde{α}}^{(k)}, \tilde{β})$ follows the multi-normal distribution and its mean vector is given by

Equation 6.

(6)

This mean vector represents the Bayes estimate of $b^{(k)}$ that is the same as the weighted ridge estimate of $b^{(k)}$ with the regularization weights in $\frac{{\tilde{α}}^{(k)}}{\tilde{β}}$ for $k = 1, \dots, p$ . Algorithm pARD for generating ${\tilde{α}}^{(1)}, \dots {, \tilde{α}}^{(p)}$ , $\tilde{β}$ , ${\tilde{b}}^{(1)}, \dots, {\tilde{b}}^{(p)}$ simultaneously is described in Section A1 in Appendix.

For $k^{'} \neq k$ , we compute the plug-in estimate ${\tilde{π}}_{k^{'} k}$ of $π_{k^{'} k}$ using ${\tilde{b}}_{k^{'}}^{(k)}$ , ${\tilde{b}}_{k}^{(k^{'})}$ and Eq. (3). As $0 \leq {\tilde{b}}_{k^{'}}^{(k)} {\tilde{b}}_{k}^{(k^{'})} \leq 1$ is not guaranteed, we define ${\tilde{ρ}}_{k^{'} k} = sign ({\tilde{b}}_{k^{'}}^{(k)}) \sqrt{{\tilde{b}}_{k^{'}}^{(k)} {\tilde{b}}_{k}^{(k^{'})}}$ if ${\tilde{b}}_{k^{'}}^{(k)} {\tilde{b}}_{k}^{(k^{'})} > 0$ and $0$ otherwise, and ${\tilde{π}}_{k^{'} k} = - 1$ , ${\tilde{ρ}}_{k^{'} k}$ , $1$ if ${\tilde{ρ}}_{k^{'} k}$ is below $- 1$ , in the interval $[- 1, 1]$ , above $1$ , respectively. Accordingly, ${\tilde{w}}_{k^{'} k} = | {\tilde{π}}_{k^{'} k} |$ if $k^{'} \neq k$ , 0 if $k^{'} = k$ .

2.3. Adaptive estimation and NetARD

In the adaptive estimation approach, the hyper-parameters within $Θ$ serve to regularize $\tilde{W} = ({\tilde{w}}_{k^{'} k})$ via regularizing ${\tilde{b}}^{(k)}$ $(k = 1, \dots, p)$ generated by the pARD algorithm. Let $s_{β}$ , $r_{β}$ be fixed. This approach involves updating $\tilde{W}$ and fine-tuning $(s^{(1)}, r^{(1)}, \dots, s^{(p)}, r^{(p)})$ alternatingly over multiple iterations. Let us assume that $\tilde{W}$ is computed in one iteration of this estimation process. For $k = 1, \dots, p$ , we compute ${\tilde{v}}_{k} = \frac{1}{p} \sum_{j = 1}^{p} {\tilde{w}}_{j k} \in [0, 1]$ . In the subsequent iteration, we update $r^{(k)}$ , $s^{(k)}$ according to the following rules:

Equation 7.

(7)

Here, $λ > 0$ , $ξ \in [0, 1]$ are fixed constants, the function $φ (t)$ is a strictly increasing function of $t$ that maps $[0, 1]$ into $(0, \infty)$ , and the symbol $\circ$ is the operation of element-wise product. With the elements of $s^{(k)}$ greater than $\frac{1}{2}$ for $k = 1, \dots, p$ , Algorithm pARD generates ${\tilde{b}}^{(1)}, \dots, {\tilde{b}}^{(p)}$ using the updated $Θ$ . Note that $r^{(k)}$ , $\frac{s_{β}}{r_{β}}$ are the prior means of $α^{(k)}$ , $β$ , respectively. Additionally, $\frac{{\tilde{α}}^{(k)}}{\tilde{β}}$ is the regularization weight vector of ${\tilde{b}}^{(k)}$ . Indeed, we can view $\frac{r^{(k)} r_{β}}{s_{β}}$ as a form of “prior” regularization weight vector. Roughly, when ${\tilde{w}}_{k^{'} k}$ decreases (respectively, increases) in an iteration, it encourages an increase (respectively, decrease) in the “prior” regularization constants $\frac{r_{k^{'}}^{(k)} r_{β}}{s_{β}}$ , $\frac{r_{k}^{(k^{'})} r_{β}}{s_{β}}$ of ${\tilde{b}}_{k}^{(k^{'})}$ , ${\tilde{b}}_{k^{'}}^{(k)}$ , respectively. Consequently, this encouragement leads to a further decrease (respectively, increase) in ${\tilde{w}}_{k^{'} k}$ in the subsequent iteration. Likewise, the inclusion of ${\tilde{υ}}_{k^{'}}$ in $r_{new}^{(k)}$ has an impact on the change in ${\tilde{υ}}_{k^{'}}$ during an iteration. A decrease (or increase) in ${\tilde{υ}}_{k^{'}}$ in one iteration can indeed encourage a further decrease (increase) of ${\tilde{υ}}_{k^{'}}$ in the subsequent iteration. This iterative process, often described as “the strong get stronger and the weak get weaker” effectively results in estimates of $w_{k^{'} k}$ $(k^{'} \neq k)$ with notably non-uniform estimates of $w_{k} = \sum_{j = 1}^{p} w_{j k}$ $(k = 1, \dots, p)$ . Algorithm apARD, which performs the adaptive estimation to generate $\tilde{W}$ , is described in Section A2 in the Appendix.

Let us consider the assumption represented by $G_{0} = (V, E_{0})$ . In this context, the edges included in $E_{0}$ indicate likely edges, while the absence of edges in $E_{0}$ suggests that these edges are likely to be missing in the prior knowledge of $G$ . To implement the apARD in this scenario, we scale down ${\tilde{w}}_{k^{'} k}$ to $r {\tilde{w}}_{k^{'} k}$ for $(k^{'}, k) \notin E_{0}$ in each iteration of the adaptive estimation process, where $r \in [0, 1)$ is the shrinkage constant. The modified apARD can generate $\tilde{W}$ in which ${\tilde{w}}_{k^{'} k}$ $(k^{'}, k) \notin E_{0}$ decrease toward $0$ . The occurrence of ${\tilde{w}}_{k^{'} k} > 0$ $(k^{'}, k) \notin E_{0}$ diminishes as $r$ decreases to $0$ . The modified apARD is utilized to infer GNs with pre-specified hub genes. In this scenario, we define $(k^{'}, k) \in E_{0}$ if $k^{'} \neq k$ and at least one of genes $k^{'}$ , $k$ is a given hub gene.

Let $\tilde{G} = (V, \tilde{E})$ be the inferred GN, where $(k^{'}, k) \in \tilde{E}$ if ${\tilde{w}}_{k^{'} k} > 0$ and $(k^{'}, k) \notin \tilde{E}$ otherwise. Under $X \sim N_{p} (0_{p \times 1}, Σ)$ , we utilize a modified regression algorithm (Algorithm 17.1) from Ref. [14] to compute the maximum likelihood (ML) estimate $\hat{Ω}$ of $Ω$ that maximizes the log-likelihood $l (Ω) = \ln \det (Ω) - trace (S Ω)$ , where $S = \frac{1}{N} X^{T} X \in R^{p \times p}$ , over all possible $Ω ≻ 0$ of $\tilde{G}$ . Accordingly, $W$ is re-estimated by $\hat{W}$ , where ${\hat{w}}_{k^{'} k} = | {\hat{π}}_{k^{'} k} |$ if $k^{'} \neq k$ , 0 if $k^{'} = k$ , and ${\hat{π}}_{k^{'} k}$ is obtained by substituting ${\hat{ω}}_{k^{'} k}$ into Eq. (1) for $ω_{k^{'} k}$ . We refer to this hybrid method for generating $\tilde{G}$ using the apARD and $\hat{W}$ using the ML estimation of $Ω$ for $\tilde{G}$ as the NetARD. The NetARD_H is the NetARD implemented with the modified apARD under given hub genes.

2.4. Co-expression extension of GN

The co-expressed genes are often involved in the same biological processes. Let us assume that a predefined core GN associated with specific biological pathways is given. We propose the Co-expression Extension of GN (CEGN) to extend the core GN. Let $s_{k^{'} k} = | Corr (X^{(k^{'})}, X^{(k)}) |$ estimate the similarity between $X_{k^{'}}$ , $X_{k}$ . The dissimilarities or distances between $X_{k^{'}}$ , $X_{k}$ are calculated as $1 - s_{k^{'} k}$ for all $k^{'} \neq k$ , and then analyzed using hierarchical cluster analysis [15,16] to construct the co-expressed gene cluster hierarchy. Typically, the results are visualized using a dendrogram. The co-expressed gene sets are derived from the generated dendrogram using tree cut techniques [[17], [18], [19], [20]]. We utilize the modified NetARD to infer the GN within each co-expressed gene set that includes at least one of the core GN genes, where the core GN genes serve as hub genes. This enables us to identify interactions between core GN genes and their co-expressed genes. The core GN is expanded by incorporating the inferred interacting co-expressed genes and the interactions between them.

The computational experiments for GN inference and analysis are conducted in the R environment [21] on a personal computer running the Windows operating system. The computational settings of apARD are described at the end of Section A2 in the Appendix. When hub nodes are given, we set $ξ = 0$ , $r = 0$ for the modified apARD. The $\tilde{W}$ resulting from the (modified) apARD is further sparsified by resetting ${\tilde{w}}_{k^{'} k} = 0$ for ${\tilde{w}}_{k^{'} k} < 10^{- 3}$ . The hierarchical clustering analysis involved in the CEGN uses a chosen linkage method to generate the gene cluster hierarchies. Following this, the co-expressed gene clusters are derived from the resulting cluster hierarchy using the dynamic tree cut algorithm [20] implemented in the R package WGCNA [22]. Unless otherwise stated, the above computational settings remain unchanged.

3. Results

3.1. Simulated data

We perform simulation experiments to demonstrate the effectiveness of NetARD in inferring GNs with the hub genes (hub GNs). For these experiments, we use the R package hglasso [9] to simulate hub GNs with $p = 100$ genes. Within each simulated graph, each hub node is connected to approximately an average of 32 nodes, while each non-hub node is connected to approximately an average of 2 nodes. The graph's sparsity level (SL) and hub level (HL) are defined as the fraction of missing edges among all possible edges and the connectivity degree centralization [23], respectively. Fig. 1 illustrates a simulated hub GN with 4 hub genes, exhibiting SL, HL $\approx 0.97$ , $0.33$ , respectively. In this diagram, the pink nodes represent the hub genes. For a simulated GN, we follow the procedure in Ref. [9] to generate the associated $Ω ≻ 0$ . More specifically, assume $A = (a_{k^{'} k}) \in R^{p \times p}$ is the adjacency matrix of the simulated graph $G$ , where $a_{k^{'} k} = 1$ if $(k^{'}, k) \in E$ , 0 otherwise. $Ω$ is set equal to $A + (0.1 - Λ_{\min} (A) I)$ , where $Λ_{\min} (A)$ is the minimum eigenvalue of $A$ and $I \in R^{p \times p}$ is the identity matrix. The simulated gene expression dataset comprises $N = 60$ i.i.d. samples generated from $N_{p} (0_{p \times 1}, Ω^{- 1})$ .

We begin by demonstrating the NetARD using the simulated gene expression data for the simulated GN in Fig. 1. The tuning parameters $λ$ , $ξ$ in the apARD control the level and type of regularization of $\tilde{W}$ . For each $ξ = 0, 0.1, 1$ , we set $λ$ to generate $\hat{G}$ with SL $\approx 0.97$ . The heatmaps of $\hat{W}$ generated by the NetARD are displayed in Fig. 2. The bluish dots of heatmap correspond to the edges of $\tilde{G}$ , while the symmetrical pairs of horizontal and vertical stripes containing these bluish dots correspond to the hub genes. Notably, compared to panel (a), the bluish dots are more concentrated on specific stripes in the heatmaps of panels (b) and (c). The HLs of $\hat{G}$ associated with $\hat{W}$ are approximately 0.36, 0.46, 0.52, respectively. Let the genes be ranked based on the number of their interaction partners. The top 1–3 genes in each $\tilde{G}$ correspond to the actual hub genes of the hub GN.

Fig. 2 — Heatmaps of estimated weighted adjacency matrices. The heatmaps display the estimated weighted adjacency matrices of hub GN obtained using the NetARD with $ξ = 0$ , $0.1$ , $1$ in panels (a), (b), (c), respectively.

We evaluate NetARD's performance in inferring 5 simulated hub GNs. These simulated GNs have varying numbers of hub genes, ranging from 2 to 6. Furthermore, we use same simulated gene expression data to compare NetARD's performance with other published graph estimation methods. The methods used for comparison are as below:

1.
glasso [24]: It maximizes the regularized $l (Ω)$ with the $L_{1}$ -norm penalty to generate sparse estimates of $Ω$ .
2.
hglasso [9]: It maximizes the regularized $l (Ω)$ with a convex penalty to produce the sparse estimates of $Ω$ with the hub property.
3.
DWLasso [8]: This method symmetrizes the estimates of $b^{(k)}$ $(k = 1, \dots, p)$ of node-wise weighted LASSO regressions with estimated normalized regularization weights. This produces sparse estimates of $Ω$ with the hub property.

For the DWLasso, we re-estimate $Ω$ by $\hat{Ω}$ , which maximizes $l (Ω)$ subject to the estimate of $G$ corresponding to the estimate of $Ω$ produced by the DWLasso.

In the experiment, we set $ξ = 0.1$ and use different values of $λ$ for NetARD to generate $\hat{Ω}$ with the SL of corresponding $\tilde{G}$ in a range near 1. Similarly, we set a grid of tuning parameters for each reference method to generate $\hat{Ω}$ . To evaluate the quality of $\hat{Ω}$ , we use the extended Bayesian information criterion (EBIC), calculated as ${EBIC}_{γ} (\hat{Ω}) = - l (\hat{Ω}) + \frac{\ln N + 4 γ \ln p}{N} κ$ [25]. Here, $γ \in [0, 1]$ and $κ$ is the number of nonzero entries ${\hat{ω}}_{k^{'} k}$ of $\hat{Ω}$ with ${1 \leq k}^{'} < k \leq p$ . We set $γ = 0$ for NetARD and $0.2$ for reference methods. $\hat{W}$ is obtained from $\hat{Ω}$ generating the lowest ${EBIC}_{γ} (\hat{Ω})$ among all generated $\hat{Ω}$ for each method.

Several metrics are used to assess the accuracy of an estimated $G$ . They include (Re), which is the percent of edges in $G$ that are also present in the estimated $G$ , fall-out (Fa), which is the percent of edges not found in $G$ but present in the estimated $G$ , and precision (Pr), which is the percent of edges in the estimated $G$ that are also found in $G$ . Given a computed $\hat{W}$ and a specific threshold, the estimated $G$ consists of edges that correspond to edge weights in $\hat{W}$ that exceed the threshold. The receiver operating characteristic (ROC) curve plots Re against Fa of the estimated $G$ as the threshold moves through the interval $[0, 1] .$ On the other hand, the precision-recall (PR) curve plots Pr against Re of the estimated $G$ as the threshold moves through the interval $[0, 1]$ . To assess the efficiency of $\hat{W}$ in inferring $G$ , we use the R package PRROC [26] to calculate the areas under the ROC curve (AUROC) and the PR curve (AUPR). The greater the AUROC and AUPR values are in interval $[0, 1]$ , the more effective $\hat{W}$ is in inferring $G$ .

Fig. 3 displays the AUROCs, AUPRs for each method in inferring 5100-gene hub GNs. The horizontal dashed lines in light blue and pink indicate the expected AUROC and AUPR of randomly generated $W$ over the 5 GN inference experiments, respectively. These dashed lines serve as the baseline performance indicators. All compared methods yield AUROCs, AUPRs above the baseline levels. Table 1 presents the median values of AUROC, AUPR for $\hat{W}$ , as well as the median values of SL, HL of estimated $G$ corresponding to $\hat{W}$ generated by each method. When ranking the compared methods for sparse GN inference based on the median AUPR, it is observed that the NetARD performs best in this experiment.

Table 1.

Median values of AUROC, AUPR, SL, HL resulting from GN inference methods on 5 simulated 100-gene hub GNs.

Methods	AUROC	AUPR	SL	HL
NetARD	0.71	0.33	0.98	0.37
glasso	0.70	0.16	0.98	0.37
hglasso	0.71	0.18	0.98	0.38
DWLasso	0.67	0.16	0.97	0.38

Open in a new tab

3.2. GNW in silico data

The GeneNetWeaver (GNW) [27] is a software tool that generates in silico GNs by extracting subnetworks from real GNs and simulates gene expression data using a system of non-linear ordinary differential equations. The dataset gnw2000 in the R package grndata [28] includes an undirected yeast GN with 2000 genes and 2000 simulated steady-state samples of noise-free gene expression data of GN generated by GNW. For our study, we generate 5 test datasets, each of which includes a $p = 100$ -gene connected subnetwork extracted from the yeast GN and $N = 60$ samples of gene expression data extracted from the first $60$ samples of gene expression data in gnw2000 dataset. To simulate observation errors, we add Gaussian noise generated using the standard deviation of 0.01 to the extracted data. Fig. 4 illustrates one of the generated $100$ -gene yeast GN that contains 119 edges, with pink nodes representing hub genes that connect to more than $5$ other genes.

Fig. 4 — Graph of in silico 100-gene yeast GN. The pink nodes represent hub genes that interact with more than 5 genes in the GN. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

In the experiments for GN inference, we first standardize the expression data for each gene in a gene expression dataset. This is achieved by centering the mean to 0 and scaling the standard deviation to 1. The GN inference methods are then applied to these standardized gene expression datasets using similar computational settings as in previous experiments. We also implement the NetARD_H on data, considering the provided hub genes. Fig. 5 displays the AUROCs, AUPRs and boxplots for each method used to infer 5100-gene yeast GNs. Table 2 presents the median values of AUROC, AUPR of $\hat{W}$ and the median values of SL, HL of estimates of $G$ yielded by each method. Overall, the NetARD outperforms the reference methods in terms of generating higher AUPRs. The performance of NetARD_H exhibits significant improvement compared to that of NetARD.

Fig. 5 — AUROCs, AUPRs and boxplots of AUROCs, AUPRs resulting from GN inference methods on 5 in silico 100-gene yeast GNs.

Table 2.

Median values of AUROC, AUPR, SL, HL resulting from GN inference methods on 5 in silico 100-gene yeast GNs.

Methods	AUROC	AUPR	SL	HL
NetARD	0.63	0.17	0.96	0.11
glasso	0.62	0.08	0.97	0.08
hglasso	0.61	0.07	0.98	0.08
DWLasso	0.62	0.15	0.95	0.07
NetARD_H	0.68	0.28	0.96	0.30

Open in a new tab

3.3. E. coli microarray data

The Many Microbe Microarrays Database (M^3D) [29] provides gene expression datasets measured using Affymetrix microarrays. From this database, we select the E. coli dataset, which includes $466$ samples of expression data for 4297 genes. To acquire gene and transcription factor (TF) annotations, as well as experimentally validated regulatory relationships between TFs and genes in E. coli K-12, we utilize the R package regutools [30] to access datasets from RegulonDB [31].

To prepare the data for analysis, genes lacking gene symbol identifiers are excluded from the microarray dataset. The expression data of the remaining 4212 genes are log-transformed and then standardized. To construct the gold standard GN, we utilize the gene-gene regulation dataset, which includes 4405 regulatory relationships involving 201 TF-encoding gene or multi-gene regulators and 1842 target genes. We split multi-gene regulators into individual gene regulators. Genes not present in the measured genes for analysis, along with regulations involving missing regulators or target genes, are excluded. Self-regulations are omitted, and ultimately, regulatory directions are disregarded. The resulting gold standard GN includes 1887 genes and 4791 gene-gene interactions. Moreover, we utilize the TF-TF regulation dataset from the database, which includes 177 TFs and 449 TF-TF regulatory relationships, to extract interactions between TF genes from the gold standard GN. The resulting TF gene GN includes 154 TF genes and 333 gene-gene interactions. We observe that the gold standard GN exhibits a relatively low coverage of gene-gene interactions between the 4212 genes used for analysis.

To evaluate the performance of GNCE, we apply it to extend the TF gene GN using the processed microarray data. By employing the “Ward.D2” linkage for the hierarchical clustering and applying the dynamic tree cut algorithm [20] with the minimum cluster size set to 7 to the resulting cluster hierarchy, we obtain 148 co-expressed gene clusters and 1 noise gene cluster. We have utilized the largest minimum cluster size in the dynamic tree cut algorithm so that the noise gene cluster does not contain any TF genes. We identify and select 79 co-expressed gene sets, each of which includes TF genes. The sizes of these sets range from 7 to 117 genes. A total of 264 gene-gene interactions within the gold standard GN fall within the selected co-expressed gene sets. Subsequently, the NetARD_H is utilized to infer the GN within each of the 79 co-expressed gene sets, where the included TF genes serve as hub genes. As a result, in the 2006 inferred gene-gene interactions, 203 of them match the gold standard gene-gene interactions with $Re \approx 0.77$ and $\Pr \approx 0.10$ . The limited coverage of gene-gene interactions in the gold standard GN may contribute to the significantly lower $\Pr$ compared to the median $\Pr \approx 0.28$ induced by the NetARD_H. Merging the TF gene GN and the inferred gene-gene interactions results in the extension of TF GN that comprises 1630 genes. Out of 2327 edges of this inferred extension, 524 edges correspond to the edges within the gold standard GN. Fig. 6 (a) illustrates the TF gene GN and (b) the extension of TF gene GN.

Fig. 6 — (a) GN of TF genes of *E. coli* (b) Co-expression extension of GN of TF genes of *E. coli*.

3.4. Colon cancer RNAseq data

The Cancer Genome Atlas (TCGA) Project is a useful resource for obtaining human cancer gene expression data. In our study, we use the COAD RNA sequencing (RNAseq) dataset, which pertains to colon adenocarcinoma and is available through the R package RTCGA.rnaseq [32]. The dataset provides normalized RNAseq expression data of $285$ cancer tissue and 41 matched normal tissue samples, encompassing 20531 human genes. To acquire the predefined gene-gene interactions involved in colorectal cancer, we refer to the pathway map hsa05210 stored in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [33]. This pathway map outlines the molecular interactions during the development of human colorectal cancer. To obtain the GN associated with this pathway map, we use the R package KEGGgraph [34] to parse the hsa05210 file that is in extensible markup language (XML) format. As a result, we generate a KEGG colorectal cancer GN comprising 86 genes and 149 edges. Here, the regulatory directions between genes are not considered.

To prepare the data for analysis, we take a few steps of data preprocessing. First, genes with missing values are excluded. In addition, genes with duplicated gene symbol identifiers or genes that are not annotated by R package org.Hs.eg.db [35] are excluded. Furthermore, genes with expression levels lower than 0.1 in 95% of the samples are excluded from the dataset. The expression data of the remaining 16234 genes are first log-transformed and then standardized. We then use the R package limma [36] to conduct a differential expression analysis using the standardized log-transformed gene expression data. This package utilizes robust statistical methods in a linear modeling framework to test whether the gene expression levels are significantly different in cancer and normal tissues and control for multiple testing. From this analysis, a total of 11677 genes are identified as differentially expressed (DE) across cancer and normal tissue samples, with their false discovery rate (FDR) adjusted p-values below 0.05. On the other hand, 15 KEGG colorectal cancer GN genes are identified as non-DE genes.

We apply the GNCE to the standardized log-transformed expression data for 11692 DE and KEGG colorectal cancer GN genes obtained from 285 cancer tissue samples. By employing the “complete” linkage for the hierarchical clustering and applying the dynamic tree cut algorithm with the minimum cluster size set to 10, we obtain 559 co-expression gene clusters and 1 noise gene cluster. Except for 1 gene (SAMAD2) included in the noise gene cluster, the remaining 85 KEGG colorectal cancer GN genes are distributed across 74 co-expressed gene clusters. The NetARD_H with $r = 0.2$ is utilized to infer the GN within each of these 74 co-expressed gene sets, where the included KEGG colorectal cancer GN genes serve as hub genes. As a result, we identify 1144 interactions between hub genes and their co-expressed genes. Merging the KEGG colorectal cancer gene GN and the inferred gene-gene interactions results in the extended GN that contains 1269 genes and 4386 edges. Fig. 7 (a) illustrates the KEGG colorectal cancer gene GN and (b) the extension of KEGG colorectal cancer gene GN.

3.5. Colorectal cancer-related Gene Ontology terms

We analyze the inferred GN presented in Fig. 7 (b) to discover Gene Ontology (GO) terms related to biological processes (BPs) that play an important role in the context of colorectal cancer progression. We first employ GO enrichment analysis (GOEA) to identify BP GO terms significantly enriched in each co-expressed gene set within the inferred GN. In this analysis, we use the gene set of inferred GN as the background gene set. For each GO term, the associated gene set comprises genes from the background gene set that are annotated by that particular GO term. The strength of association between a GO term and a co-expressed gene set within the inferred GN is assessed by evaluating the strength of association between the gene set associated with the GO term and the co-expressed gene set. We employ the R package topGO [37] to compute the p-value of the Fisher exact test, which quantifies the likelihood of observing such an association through random sampling. BP GO terms with their FDR adjusted p-values lower than a specified level are considered significantly enriched in the co-expressed gene set.

Subsequently, we proceed to identify the GO terms that are significantly associated with the inferred GN from the previously identified enriched BP GO terms. Here, a GO term is considered associated with the inferred GN if the genes of its associated gene set distribute closely in the inferred GN. We utilize the Knet function [38], akin to Ripley's K-function, to examine the distribution of gene set within the inferred GN. A higher value of the area under the Knet function (AUK) indicates a greater degree of clustering of gene set within the inferred GN. A GO term is associated with the inferred GN if the AUK of its associated gene set significantly surpasses that of any randomly selected gene set of the same size. We employ the R package SANTA [38], where the shortest-path distance is utilized to quantify the distance between two genes within the GN, to compute the p-value of permutation test, by comparing the AUK of the GO term gene to the AUKs of randomly generated gene sets of the same size as the GO term gene set, for each GO term of interest. BP GO terms with their FDR adjusted p-values lower than a specified level are considered significantly associated with the inferred GN.

In the GO term analysis experiment, we set the significance level of the FDR-adjusted p-value for Fisher exact tests in GOEA at 0.05. We identify 891 BP GO terms that exhibit significant enrichment within 29 co-expressed gene sets. We choose from these BP GO terms whose associated gene sets contain 5 or more genes for subsequent analysis. Employing a significance level of FDR-adjusted p-value at 0.1 for permutation tests, we identify 40 GO terms, which exhibit significant enrichment within 11 co-expressed gene sets, significantly associated with the inferred GN. We present the results of our GO term analysis pertaining to a specific co-expressed gene set as follows.

Fig. 8 illustrates the largest subnetwork consisting of 103 co-expressed genes within the inferred GN shown in Fig. 7 (b). In this diagram, the pink nodes represent hub genes, including TGFB3 (Transforming Growth Factor Beta 3) and TCF7L1 (Transcription Factor 7-like 1), both included in the KEGG colorectal cancer GN. Among the 92 BP GO terms significantly enriched in genes in Fig. 8, there are 38 GO terms associated with gene sets containing 5 or more genes. Fig. 9 depicts the GO plot for the final 15 BP GO terms that are significantly associated with the inferred GN. These identified GO terms are related to the organization of extracellular matrix components (GO:0030198, GO:0030199), the development and differentiation of connective tissues (GO:0051216, GO:0001503, GO:0002062, GO:0048701, GO:0045669) and skins (GO:0043588), the development and differentiation of organs and tissues within a multicellular organism (GO:0007275, GO:0030324, GO:0035987, GO:0060021), as well as cell signaling (GO:0007229), regulation of cell differentiation (GO:0045597), and the reorganization of the actin cytoskeleton (GO:0031532). These GO terms provide informative annotations for the co-expressed genes associated with cancer-related BPs.

Fig. 9 — GO plot of identified BP GO terms associated with colorectal cancer. In this plot, red and blue dots represent the upregulated $(logFC > 0)$ and downregulated $(logFC < 0)$ GO term genes on the inferred GN in Fig. 7 (b), respectively. Here, the $logFC$ represents the difference of averages of log2-transformed gene expression data over cancer and normal tissue samples. The z-score of gene set is calculated as $\frac{u - d}{\sqrt{c}}$ , where $u$ is the number of upregulated genes, $d$ is the number of downregulated genes, and $c$ is the total number of genes of gene set. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 10 illustrates the workflow of the proposed method for identifying cancer-related GO terms using gene expression data and predefined core GN of cancer.

4. Conclusion

According to the AUPR, which provides a more insightful measure of sparse GN inference performance, the NetARD outperforms the reference GN inference methods when applied to simulated hub GNs and in silico yeast GNs. When provided with hub genes, the NetARD significantly improves the inference accuracy on in silico yeast GNs. The GNCE, which integrates co-expression analysis and the NetARD, retrieves the majority of interactions between TF genes and their co-expressed genes of E. coli. The GNCE enables the prediction of GO terms associated the colorectal cancer using combined information from curated core cancer GN and gene expression data.

5. Discussion

The proposed NetARD utilizes the apARD to infer the GN and then re-estimates the non-zero partial correlations between genes using the GGM. It is important to note that gene expression data can frequently display non-normally distributed characteristics, and the gene expression profiles can exhibit complex nonlinear correlations. Partial correlation or GGM-based graph estimation methods primarily target the capture of direct linear dependencies between (log-transformed) gene expression levels. Consequently, these methods may encounter difficulties in identifying GNs, particularly as the size of GN grows larger.

It is known that multiple genes frequently collaborate and exhibit co-expression during cellular processes. The GNCE identifies interactions between predefined core GN genes and their co-expressed genes using the NetARD, achieved through the clustering of genes into small-scale co-expressed gene sets. In our experiments, by controlling the minimum cluster size in the dynamic tree cut algorithm, the resulting co-expressed gene sets typically contain below 120 genes. Smaller co-expressed gene set sizes allow the NetARD to efficiently infer GNs with shorter computational time. However, this GNCE does not capture the interactions between non-co-expressed genes. To identify GO terms associated with the progression of colorectal cancer, we employ the GN inferred by GNCE, incorporating KEGG colorectal cancer core GN and gene expression data. Our approach is based on the fundamental assumption that GNs consisting of genes displaying co-expression patterns and interacting with those involved in the curated cancer pathway primarily contribute to the coordination of cellular processes.

Finally, additive models extend linear regressions by incorporating nonlinear component functions, such as spline functions, to accommodate the nonlinear interactions between predictor variables and the response variable. One of our future research interests involves the inference of GNs with biologically meaningful topological properties using additive models of gene expression. The GNs inferred in this context are expected to yield novel insights into the interactions among genes that coordinate cellular processes in both healthy and diseased states.

CRediT authorship contribution statement

Chi-Kan Chen: Writing – original draft, Software, Methodology, Investigation, Formal analysis, Conceptualization.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix.

A.1 p-node ARD.

From the normality assumptions of $P_{k} (ε_{k} | b^{(k)}, β)$ , $P_{k} (b^{(k)} | α^{(k)})$ and the application of conditional probability, we can express $P_{k} (b^{(k)} | X, α^{(k)}, β)$ in Eq (4) as

Equation A1.

(A1)

where ${\tilde{S}}_{b^{(k)}} = {(β {X^{\ k}}^{T} X^{\ k} + D_{α^{(k)}})}^{- 1}$ , ${\tilde{b}}^{(k)} = β {\tilde{S}}_{b^{(k)}} {X^{\ k}}^{T} X^{(k)}$ . Thus, $P_{k} (b^{(k)} | X, α^{(k)}, β)$ follows $N_{p - 1} ({\tilde{b}}^{(k)}, {\tilde{S}}_{b^{(k)}})$ . Moreover, we can express $L_{k} (α^{(k)}, β | X)$ in Eq (4) as

Equation A2.

(A2)

Using Eq. (A2), for $k^{'} \neq k$ , we derivative the partial derivatives $\frac{\partial}{\partial α_{k^{'}}^{(k)}} \ln L_{k} (α^{(k)}, β | X)$ as

Equation A3.

(A3)

and $\frac{\partial}{\partial β} \ln L_{k} (α^{(k)}, β | X)$ as

Equation A4.

(A4)

Likewise, for log-Gamma priors, $\frac{\partial}{\partial α_{k^{'}}^{(k)}} \ln P_{k} (α^{(k)} | θ^{(k)}) = \frac{s_{k^{'}}^{(k)} - 1}{α_{k^{'}}^{(k)}} - r_{k^{'}}^{(k)}$ , $\frac{\partial}{\partial β} \ln P (β | θ_{β}) = \frac{s_{β} - 1}{β} - r_{β}$ . By solving $\frac{\partial}{\partial α_{k^{'}}^{(k)}} f (α^{(1)}, \dots, α^{(p)}, β) = 0$ , we obtain the maximizer of $f (α^{(1)}, \dots, α^{(p)}, β)$ in Eq. (5) with $α^{(k^{'})}$ $(k^{'} \neq k)$ , $β$ fixed as

Equation A5.

(A5)

where $diag ({\tilde{S}}_{b^{(k)}})$ is the diagonal vector of ${\tilde{S}}_{b^{(k)}}$ and the vector division is the element-wise division. By solving $\frac{\partial}{\partial β} f (α^{(1)}, \dots, α^{(p)}, β) = 0$ ), we obtain the maximizer of $f (α^{(1)}, \dots, α^{(p)}, β)$ with $α^{(k)}$ $(1, \dots, p)$ fixed as

Equation A6.

(A6)

Assume $Θ$ with $s_{k^{'}}^{(k)} > \frac{1}{2}$ $(k^{'} \neq k)$ , and the initial values of $α^{(k)}$ $(k = 1, \dots, p)$ , $β$ are specified. Algorithm pARD that computes ${\tilde{S}}_{b^{(k)}}$ , ${\tilde{b}}^{(k)}$ under $α^{(k)}$ , $β$ , and updates $α^{(k)}$ , $β$ using Eq. (A5), (A6), respectively, iteratively to obtain the optimal ${\tilde{b}}^{(k)}$ , ${\tilde{α}}^{(k)}$ $(k = 1, \dots, p)$ , $\tilde{β}$ is as follows.

Algorithm pARD

repeat

for $k = 1, \dots, p$ do

{C_{k} \leftarrow X}^{\ k}^{T} X^{\ k}

S_{b^{(k)}} \leftarrow {(β C_{k} + D_{α^{(k)}})}^{- 1}

b^{(k)} \leftarrow β S_{b^{(k)}} {X^{\ k}}^{T} X^{(k)}

α^{(k)} \leftarrow \frac{2 s^{(k)} {- 1}_{(p - 1) \times 1}}{diag (S_{b^{(k)}}) + {\tilde{b}}^{(k)} \circ {\tilde{b}}^{(k)} + 2 r^{(k)}}

end for

β \leftarrow \frac{N - 2 + 2 s_{β}}{\frac{1}{p} \sum_{k = 1}^{p} {trace (S_{b^{(k)}} C_{k}) + {‖ X^{(k)} - X^{\ k} b^{(k)} ‖}_{2}^{2}} + 2 r_{β}}

if $f (α^{(1)}, \dots, α^{(p)}, β)$ ceases to increase then break

end

return \tilde{B} = (b^{(1)}, \dots, b^{(p)}), \tilde{Α} = (α^{(1)}, \dots, α^{(p)}), \tilde{β} = β

A2. Adaptive p-node ARD.

Assume the parameters $λ$ , $ξ$ , $s_{β}$ , $r_{β}$ are fixed. Assume the initial $W$ and the initial values of $α^{(k)}$ $(k = 1, \dots, p)$ , $β$ for Algorithm pARD are specified. Denote by $Φ$ the transformation of $\tilde{B} = ({\tilde{b}}^{(1)}, \dots, {\tilde{b}}^{(p)})$ to $\tilde{W}$ as described in Section 2.3. Algorithm apARD solves $\tilde{W}$ with the adaptively tuned $(s^{(k)}, r^{(k)})$ $(k = 1, \dots, p)$ is as follows.

Algorithm apARD

repeat

for $k = 1, \dots, p$ do

r^{(k)} \leftarrow {(λ {(\frac{1 - ξ}{φ ({\tilde{w}}_{k^{'} k})} + \frac{ξ}{φ (\frac{1}{p} \sum_{j = 1}^{p} {\tilde{w}}_{j k^{'}})}) + \frac{1}{\sqrt{2}} |}_{k^{'} \neq k})}^{T}

s^{(k)} \leftarrow r^{(k)} \circ r^{(k)}

end for

B \leftarrow pARD (X; Θ) |_{\tilde{B}}

W \leftarrow Φ (B)

if $W$ ceases to change then break

end repeat

return $\tilde{W} = W$ , $Θ$ .

In our computational experiments, we use a fix the Gamma prior over $β$ with $s_{β} = 1$ , $r_{β} = 10^{- 3}$ to approximate the flat prior. We define $φ (t) = \frac{t + 10^{- 3}}{1 - t + 10^{- 3}}$ , $t \in [0, 1]$ . To initialize the apARD, we set the initial $W$ with $k^{'}, k$ entry of $w_{k^{'} k} = | Cor (X^{(k^{'})}, X^{(k)}) |$ if $k^{'} \neq k$ , 0 if $k^{'} = k$ , and the initial $α^{(k)} = r^{(k)}$ $(k = 1, \dots, p)$ , $β = 1$ . The maximum iteration numbers of apARD, $pARD$ are 10, 50, respectively.

References

1.de Matos Simoes R., Dehmer M., Emmert-Streib F. B-cell lymphoma gene regulatory networks: biological consistency among inference methods. Front. Genet. 2013;4:281. doi: 10.3389/fgene.2013.00281. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bansal M., et al. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2007;3:78. doi: 10.1038/msb4100120. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Hawe J.S., Theis F.J., Heinig M. Inferring interaction networks from multi-omics data. Front. Genet. 2019;10:535. doi: 10.3389/fgene.2019.00535. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Hecker M., et al. Gene regulatory network inference: data integration in dynamic models-a review. Biosystems. 2009;96(1):86–103. doi: 10.1016/j.biosystems.2008.12.004. [DOI] [PubMed] [Google Scholar]
5.Barabasi A.-L., Oltvai Z.N. Network biology: understanding the cell's functional organization. Nat. Rev. Genet. 2004;5(2):101–113. doi: 10.1038/nrg1272. [DOI] [PubMed] [Google Scholar]
6.Chen G., et al. Rank-based edge reconstruction for scale-free genetic regulatory networks. BMC Bioinf. 2008;9:75. doi: 10.1186/1471-2105-9-75. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhang B., Horvath S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 2005;4 doi: 10.2202/1544-6115.1128. Article17. [DOI] [PubMed] [Google Scholar]
8.Sulaimanov N., et al. Inferring gene expression networks with hubs using a degree weighted Lasso approach. Bioinformatics. 2019;35(6):987–994. doi: 10.1093/bioinformatics/bty716. [DOI] [PubMed] [Google Scholar]
9.Tan K.M., et al. Learning graphical models with hubs. J. Mach. Learn. Res. 2014;15:3297–3331. [PMC free article] [PubMed] [Google Scholar]
10.Bock M., et al. Hub-centered gene network reconstruction using automatic relevance determination. PLoS One. 2012;7(5) doi: 10.1371/journal.pone.0035077. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Charbonnier C., Chiquet J., Ambroise C. Weighted-LASSO for structured network inference from time course data. Stat. Appl. Genet. Mol. Biol. 2010;9 doi: 10.2202/1544-6115.1519. Article 15. [DOI] [PubMed] [Google Scholar]
12.Chen C.K. Inference of genetic regulatory networks with regulatory hubs using vector autoregressions and automatic relevance determination with model selections. Stat. Appl. Genet. Mol. Biol. 2021;20(4–6):121–143. doi: 10.1515/sagmb-2020-0054. [DOI] [PubMed] [Google Scholar]
13.Tipping M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001;1:211–244. [Google Scholar]
14.Hastie T., Tibshirani R., Friedman J. second ed. Springer; New York: 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [Google Scholar]
15.Defays D. An efficient algorithm for a complete-link method. Comput. J. 1977;28:364–366. [Google Scholar]
16.Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 1973;16:30–34. [Google Scholar]
17.Carlson M.R., et al. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genom. 2006;7:40. doi: 10.1186/1471-2164-7-40. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Dong J., Horvath S. Understanding network concepts in modules. BMC Syst. Biol. 2007;1:24. doi: 10.1186/1752-0509-1-24. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gargalovic P.S., et al. Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc. Natl. Acad. Sci. U. S. A. 2006;103(34):12741–12746. doi: 10.1073/pnas.0605457103. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Langfelder P., Zhang B., Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24(5):719–720. doi: 10.1093/bioinformatics/btm563. [DOI] [PubMed] [Google Scholar]
21.R Core Team, R: a Language and Environment for Statistical Computing, R Foundation for Statistical Computing: Vienna, Austria,2023.
22.Langfelder P., Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Freeman L.C. Centrality in social networks conceptual clarification. Soc. Network. 1978/79;1:215–239. [Google Scholar]
24.Friedman J., Hastie T., Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Foygel R., Drton M. NIPS'10: Proceedings of the 23rd International Conference on Neural Information Processing Systems. Curran Associates Inc; 2010. Extended Bayesian information criteria for Gaussian graphical models. [Google Scholar]
26.Grau J., Grosse I., Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31(15):2595–2597. doi: 10.1093/bioinformatics/btv153. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Schaffter T., Marbach D., Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]
28.Bellot P., Olsen C., Meyer P.E. grndata: synthetic expression data for gene regulatory network inference. R package. 2023 version 1.34.0. [Google Scholar]
29.Faith J.J., et al. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–D870. doi: 10.1093/nar/gkm815. (Database issue) [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Chavez J., et al. Programmatic access to bacterial regulatory networks with regutools. Bioinformatics. 2020;36(16):4532–4534. doi: 10.1093/bioinformatics/btaa575. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Gama-Castro S., et al. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) Nucleic Acids Res. 2011;39:D98–D105. doi: 10.1093/nar/gkq1110. (Database issue) [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kosinski M. RTCGA.rnaseq: rna-seq datasets from The Cancer Genome Atlas Project. 2023 R package version 20151101.32.0. [Google Scholar]
33.Kanehisa M., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zhang J.D. KEGGgraph: application examples. R package. 2022 version 1.62.0. [Google Scholar]
35.Carlson M. org.Hs.eg.db: genome wide annotation for human. 2019 R package version 3.8.2. [Google Scholar]
36.Ritchie M.E., et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Alex A., Rahnenfuhrer J. topGO: enrichment analysis for Gene Ontology. 2023 R package version 2.54.0. [Google Scholar]
38.Cornish A.J., Markowetz F. SANTA: quantifying the functional content of molecular networks. PLoS Comput. Biol. 2014;10(9) doi: 10.1371/journal.pcbi.1003808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] 1.de Matos Simoes R., Dehmer M., Emmert-Streib F. B-cell lymphoma gene regulatory networks: biological consistency among inference methods. Front. Genet. 2013;4:281. doi: 10.3389/fgene.2013.00281. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Bansal M., et al. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2007;3:78. doi: 10.1038/msb4100120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Hawe J.S., Theis F.J., Heinig M. Inferring interaction networks from multi-omics data. Front. Genet. 2019;10:535. doi: 10.3389/fgene.2019.00535. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Hecker M., et al. Gene regulatory network inference: data integration in dynamic models-a review. Biosystems. 2009;96(1):86–103. doi: 10.1016/j.biosystems.2008.12.004. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Barabasi A.-L., Oltvai Z.N. Network biology: understanding the cell's functional organization. Nat. Rev. Genet. 2004;5(2):101–113. doi: 10.1038/nrg1272. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Chen G., et al. Rank-based edge reconstruction for scale-free genetic regulatory networks. BMC Bioinf. 2008;9:75. doi: 10.1186/1471-2105-9-75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Zhang B., Horvath S. A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 2005;4 doi: 10.2202/1544-6115.1128. Article17. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Sulaimanov N., et al. Inferring gene expression networks with hubs using a degree weighted Lasso approach. Bioinformatics. 2019;35(6):987–994. doi: 10.1093/bioinformatics/bty716. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Tan K.M., et al. Learning graphical models with hubs. J. Mach. Learn. Res. 2014;15:3297–3331. [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Bock M., et al. Hub-centered gene network reconstruction using automatic relevance determination. PLoS One. 2012;7(5) doi: 10.1371/journal.pone.0035077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Charbonnier C., Chiquet J., Ambroise C. Weighted-LASSO for structured network inference from time course data. Stat. Appl. Genet. Mol. Biol. 2010;9 doi: 10.2202/1544-6115.1519. Article 15. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Chen C.K. Inference of genetic regulatory networks with regulatory hubs using vector autoregressions and automatic relevance determination with model selections. Stat. Appl. Genet. Mol. Biol. 2021;20(4–6):121–143. doi: 10.1515/sagmb-2020-0054. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Tipping M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001;1:211–244. [Google Scholar]

[bib14] 14.Hastie T., Tibshirani R., Friedman J. second ed. Springer; New York: 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [Google Scholar]

[bib15] 15.Defays D. An efficient algorithm for a complete-link method. Comput. J. 1977;28:364–366. [Google Scholar]

[bib16] 16.Sibson R. SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. 1973;16:30–34. [Google Scholar]

[bib17] 17.Carlson M.R., et al. Gene connectivity, function, and sequence conservation: predictions from modular yeast co-expression networks. BMC Genom. 2006;7:40. doi: 10.1186/1471-2164-7-40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Dong J., Horvath S. Understanding network concepts in modules. BMC Syst. Biol. 2007;1:24. doi: 10.1186/1752-0509-1-24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Gargalovic P.S., et al. Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc. Natl. Acad. Sci. U. S. A. 2006;103(34):12741–12746. doi: 10.1073/pnas.0605457103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Langfelder P., Zhang B., Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008;24(5):719–720. doi: 10.1093/bioinformatics/btm563. [DOI] [PubMed] [Google Scholar]

[bib21] 21.R Core Team, R: a Language and Environment for Statistical Computing, R Foundation for Statistical Computing: Vienna, Austria,2023.

[bib22] 22.Langfelder P., Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Freeman L.C. Centrality in social networks conceptual clarification. Soc. Network. 1978/79;1:215–239. [Google Scholar]

[bib24] 24.Friedman J., Hastie T., Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Foygel R., Drton M. NIPS'10: Proceedings of the 23rd International Conference on Neural Information Processing Systems. Curran Associates Inc; 2010. Extended Bayesian information criteria for Gaussian graphical models. [Google Scholar]

[bib26] 26.Grau J., Grosse I., Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31(15):2595–2597. doi: 10.1093/bioinformatics/btv153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Schaffter T., Marbach D., Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Bellot P., Olsen C., Meyer P.E. grndata: synthetic expression data for gene regulatory network inference. R package. 2023 version 1.34.0. [Google Scholar]

[bib29] 29.Faith J.J., et al. Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008;36:D866–D870. doi: 10.1093/nar/gkm815. (Database issue) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Chavez J., et al. Programmatic access to bacterial regulatory networks with regutools. Bioinformatics. 2020;36(16):4532–4534. doi: 10.1093/bioinformatics/btaa575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Gama-Castro S., et al. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) Nucleic Acids Res. 2011;39:D98–D105. doi: 10.1093/nar/gkq1110. (Database issue) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Kosinski M. RTCGA.rnaseq: rna-seq datasets from The Cancer Genome Atlas Project. 2023 R package version 20151101.32.0. [Google Scholar]

[bib33] 33.Kanehisa M., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Zhang J.D. KEGGgraph: application examples. R package. 2022 version 1.62.0. [Google Scholar]

[bib35] 35.Carlson M. org.Hs.eg.db: genome wide annotation for human. 2019 R package version 3.8.2. [Google Scholar]

[bib36] 36.Ritchie M.E., et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Alex A., Rahnenfuhrer J. topGO: enrichment analysis for Gene Ontology. 2023 R package version 2.54.0. [Google Scholar]

[bib38] 38.Cornish A.J., Markowetz F. SANTA: quantifying the functional content of molecular networks. PLoS Comput. Biol. 2014;10(9) doi: 10.1371/journal.pcbi.1003808. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Inference of gene networks using gene expression data with applications

Chi-Kan Chen

Abstract

Highlights

1. Introduction