Network Completion for Static Gene Expression Data

Natsu Nakajima; Tatsuya Akutsu

doi:10.1155/2014/382452

. 2014 Mar 26;2014:382452. doi: 10.1155/2014/382452

Network Completion for Static Gene Expression Data

Natsu Nakajima ¹, Tatsuya Akutsu ^1,^*

PMCID: PMC3984774 PMID: 24826192

Abstract

We tackle the problem of completing and inferring genetic networks under stationary conditions from static data, where network completion is to make the minimum amount of modifications to an initial network so that the completed network is most consistent with the expression data in which addition of edges and deletion of edges are basic modification operations. For this problem, we present a new method for network completion using dynamic programming and least-squares fitting. This method can find an optimal solution in polynomial time if the maximum indegree of the network is bounded by a constant. We evaluate the effectiveness of our method through computational experiments using synthetic data. Furthermore, we demonstrate that our proposed method can distinguish the differences between two types of genetic networks under stationary conditions from lung cancer and normal gene expression data.

1. Introduction

Estimation of genetic interactions from gene expression microarray data is an interesting and important issue in bioinformatics. There are two kinds of gene expression data: time series data and nontime series data. To estimate the dynamics of gene regulatory networks such as cell cycle and life cycle processes, various mathematical models and methods have been proposed using time series data. Since the number of observed time points in time series data is usually small, these methods suffer from low accuracies. On the other hand, a large number of nontime series data are available, for example, samples from normal people and patients of various types of diseases. Although these data are not necessarily static, we may regard these data as static data because these are averaged over a large amount of cells in rather steady states.

For inference of genetic networks, various reverse engineering methods have been proposed, which include methods based on Boolean networks [1, 2], Bayesian networks [3, 4], differential equations [5–7], and graphical Gaussian models [8–10]. Boolean networks can only be applied to inference of genetic networks from time series data because the Boolean network is intrinsically a dynamic model. Although Bayesian networks have widely been applied to analysis of static data, they can only output acyclic networks. Many methods have also been proposed using various kinds of differential equation models. However, in many cases, parameter estimation needs a huge amount of computation time. Overall, most methods suffer from inaccuracy and/or computational inefficiency and thus there is not yet an established or standard method for inference of genetic networks using only gene expression data. Therefore, it is reasonable to try to develop another approach for analysis of gene regulatory networks.

In recent years, there have been several studies and attempts for network completion, not necessarily for biological networks but also for social networks and web graphs. Different from network inference, we assume in network completion that a certain type of a prototype network is given, which can be obtained by using existing knowledge. Kim and Leskovec [11] addressed the network completion problem in which an incomplete network including unobserved nodes and edges is given and then the unobserved parts should be inferred. They proposed KronEM, which combined the Expectation Maximization with the Kronecker graphs model to estimate the missing part of the network. Guimerà and Sales-Pardo [12] presented a mathematical and computational method which can identify both missing and spurious interactions in complex networks by using the stochastic block models to capture the structural features in the network. This method was also applied to a protein interaction network of yeast. Hanneke and Xing [13] defined the network completion as a problem of inferring the rest part of the network, given an observed incomplete network sample and proposed a sampling method to derive confidence intervals from sample networks. As a related work, Saito et al. [14] developed a method to measure the consistency of an inferred network with the measured gene expression data.

Independently, Akutsu et al. [15] proposed another model of network completion in which the objective is to make the minimum amount of modifications to a given network so that the resulting network is most consistent with the observed data. Based on this concept, Nakajima et al. [16] developed a practical method, DPLSQ, for completion of genetic networks from time series data, in which addition and deletion of edges are the basic modification operations and the numbers of added and deleted edges are specified. In addition, if we begin with a network with an empty set of edges, it can be applied to network inference. DPLSQ is based on a combination of least-squares fitting and dynamic programming, where least-squares fitting is used for estimation of parameters in differential equations and dynamic programming is used for minimizing the sum of least-squares errors under the restriction of the number of added and deleted edges. Different from other heuristic or stochastic approaches, DPLSQ is guaranteed to output an optimal solution (in the sense of the minimum least-squares error) in polynomial time if the maximum indegree of nodes is bounded by a constant. Nakajima and Akutsu [17] proposed a method to complete and infer the time varying networks by extending DPLSQ so that additions and deletions of edges can be performed at several time points. However, since DPLSQ is based on a dynamic model, it cannot be applied to inference or completion of genetic networks from static data.

In this study, we propose a novel method, DPLSQ-SS (DPLSQ for Static Samples), for completing and inferring a network using static gene expression data, based on DPLSQ. The purpose of this study is twofold: first, to complete and infer gene networks from static expression profile, instead of time series data and, secondly, to investigate the relationship between different kinds of inferred networks under different conditions (e.g., comparison of normal and cancer networks obtained from samples of normal and cancer cells). Static data typically consist of expression levels of genes, which were measured at single time point but for a large number of samples. As discussed in the beginning part of this section, these types of data can be regarded as the gene expression measurements in a stationary phase. Many of static microarray data are publicly available, in particular for cancer microarray data with a relatively large size of tumor and normal samples. Therefore, it may be possible to estimate and investigate differences between cancer and normal networks. The basic strategy of DPLSQ-SS is the same as that of DPLSQ: least-squares fitting is used for parameter estimation and dynamic programming is used for minimizing the sum of least-squares errors when adding and deleting edges. In order to cope with static data, we modified the error function to be minimized. Although the idea is simple, it brings wider applicability because a large number of static gene expression data are available. We demonstrate the effectiveness of DPLSQ-SS through computational experiments using synthetic data and gene expression data for lung cancer and normal samples. We also perform computational comparison of DPLSQ-SS as an inference method with some of state-of-the-art tools using synthetic data.

2. Method

The purpose of network completion in this study is to modify a given network by making the minimum number of modifications so that the resulting network is most consistent with the observed data. Here we assume additions and deletions of edges as modification operations (see Figure 1). In the following, graph G(V, E) denotes a given network where V and E are the sets of nodes and directed edges, including loops, respectively. In this graph G, each node corresponds to a gene and each edge represents a direct regulation between two genes. We let n denote the number of genes and let V = {v ₁,…, v _n}. For each node v _i, e ⁻(v _i) and deg⁡⁻(v _i), respectively, denote the set of incoming edges to v _i and the number of incoming edges to v _i as defined below:

\begin{matrix} e^{-} (v_{i}) = {v_{j} ∣ (v_{j}, v_{i}) \in E}, \\ \deg^{-} (v_{i}) = | e^{-} (v_{i}) | . \end{matrix}

(1)

Network completion by addition and deletion of edges from m samples. The bold dashed and the thin dashed edges represent added and deleted edges, respectively.

We employ least-squares fitting for the parameter estimation and dynamic programming for identifying structure of the network. In the following we explain the algorithm of the proposed method.

2.1. Model of Nonlinear Equation and Estimation of Parameters

Since we consider static data in this paper, we adopt a mathematical model based on nonlinear equations, instead of differential equations in [16]. We assume that the static state of each node v _i is determined by the following equation:

\begin{matrix} x_{i} = a_{0}^{i} + \sum_{j = 1}^{h} a_{j}^{i} x_{i_{j}} + \sum_{j \leq k} a_{j, k}^{i} x_{i_{j}} x_{i_{k}} + b^{i} ω, \end{matrix}

(2)

where v _i₁,…, v _{i_h} are incoming nodes to v _i, x _i corresponds to the expression value of the ith gene, and ω denotes a random noise. The second and third terms of the right-hand side of the equation represent linear and nonlinear effects to node v _i, respectively (see Figure 2), where positive a _j ⁱ or a _j,k ⁱ corresponds to an activation effect and negative a _j ⁱ or a _j,k ⁱ corresponds to an inhibition effect.

Static model of genetic network. The expression level of v _i is determined from those of input nodes. a _1,2 ⁱ is a coefficient corresponding to cooperative regulation by genes v _i₁ and v _i₂ to gene v _i.

We assume that static expression data 〈y ₁(s), y ₂(s),…, y _n(s)〉, s = 1,…, m, are given, where m is the number of samples and y _i(s) denotes the expression value of node v _i in the sth sample. The parameters (i.e., a ₀ ⁱ, a _j ⁱ, a _j,k ⁱ) can be estimated by minimizing the following objective function using a standard least-squares fitting method:

\begin{matrix} S_{i_{1}, i_{2}, \dots, i_{h}}^{i} \\ = \sum_{s = 1}^{m} {| y_{i} (s) - (a_{0}^{i} + \sum_{j = 1}^{h} a_{j}^{i} y_{i_{j}} (s) + \sum_{j \leq k} a_{j, k}^{i} y_{i_{j}} (s) y_{i_{k}} (s)) |}^{2} . \end{matrix}

(3)

2.2. Completion by Addition of Edges

Once the objective function is determined, the completion procedure is the same as that for DPLSQ [16]. In order to make this paper self-contained, we also present the completion procedure here. For the simplicity, we begin with network completion by adding k edges in total so that the sum of least-squares errors is minimized.

We let σ _{k_j,j} ⁺ denote the minimum least-squares error when adding k _j edges to the jth node and they are defined as

\begin{matrix} σ_{k_{j}, j}^{+} = \min_{j_{1}, j_{2}, \dots, j_{k_{j}}} S_{j_{1}, j_{2}, \dots, j_{k_{j}}}^{j}, \end{matrix}

(4)

where each v _{j_l} must be selected from V − v _j − e ⁻(v _j). In order to avoid combinatorial explosion, we constrain the maximum k _j to be a small constant K and let σ _{k_j,j} ⁺ = +∞ for k _j > K or k _j + deg⁡⁻(v _j) ≥ n.

Here, we define D ⁺[k, i] by

\begin{matrix} D^{+} [k, i] = \min_{k_{1} + k_{2} + \dots + k_{i} = k} \sum_{j = 1}^{i} σ_{k_{j}, j}^{+} . \end{matrix}

(5)

The entries of D ⁺[k, i] can be computed by the dynamic programming algorithm as follows:

\begin{matrix} D^{+} [k, 1] = σ_{k, 1}^{+}, \\ D^{+} [k, j + 1] = \min_{k^{'} + k^{′′} = k} {D^{+} [k^{'}, j] + σ_{k^{′′}, j + 1}^{+}} . \end{matrix}

(6)

It is to be noted that D ⁺[k, n] is determined uniquely regardless of the ordering of nodes in the network. The correctness of this dynamic programming algorithm can be seen by

\begin{matrix} \min_{k_{1} + k_{2} + \dots + k_{n} = k} \sum_{j = 1}^{n} σ_{k_{j}, j}^{+} \\ = \min_{k^{'} + k^{′′} = k} {\min_{k_{1} + k_{2} + \dots + k_{n - 1} = k^{'}} \sum_{j = 1}^{n - 1} σ_{k_{j}, j}^{+} + σ_{k^{′′}, n}^{+}} \\ = \min_{k^{'} + k^{′′} = k} D^{+} [k^{'}, n - 1] + σ_{k^{′′}, n}^{+} . \end{matrix}

(7)

2.3. Completion by Addition and Deletion of Edges

The above dynamic programming procedure can be modified for addition and deletion of edges.

We let σ _{k_j,h_j,j} denote the minimum least-squares error when adding k _j edges to e ⁻(v _j) and deleting h _j edges from e ⁻(v _j), where added and deleted edges must be disjoint. As described in Section 2.2, we also constrain the maximum k _j and h _j to be small constants K and H, respectively. We let σ _{k_j,h_j,j} = +∞ if k _j > K, h _j > H, k _j − h _j + deg⁡⁻(v _j) ≥ n, or k _j − h _j + deg⁡⁻(v _j) < 0 holds. Then, the problem is stated as

\begin{matrix} \min_{\begin{matrix} k_{1} + k_{2} + \dots + k_{n} = k \\ h_{1} + h_{2} + \dots + h_{n} = h \end{matrix}} \sum_{j = 1}^{n} σ_{k_{j}, h_{j}, j} . \end{matrix}

(8)

Here, we define D[k, h, i] by

\begin{matrix} D [k, h, i] = \min_{\begin{matrix} k_{1} + k_{2} + \dots + k_{i} = k \\ h_{1} + h_{2} + \dots + h_{i} = h \end{matrix}} \sum_{j = 1}^{i} σ_{k_{j}, h_{j}, j} . \end{matrix}

(9)

Then, the network completion problem by addition and deletion of edges can be solved by using the dynamic programming algorithm as follows:

\begin{matrix} D [k, h, 1] = σ_{k, h, 1}, \\ D [k, h, j + 1] = \min_{\begin{matrix} k^{'} + k^{′′} = k \\ h^{'} + h^{′′} = h \end{matrix}} {D [k^{'}, h^{'}, j] + σ_{k^{′′}, h^{′′}, j + 1}} . \end{matrix}

(10)

We will also discuss the computational complexity of DPLSQ-SS. Since completion by addition of edges is a special case, we only analyze completion by addition and deletion of edges.

It is known that least-squares fitting for a linear system can be done in O(mp ² + p ³) time where m is the number of samples and p is the number of parameters. In our proposed method, we assume that the maximum indegree in a given network and the number of parameters are bounded by constants. In this case, the time complexity per least-squares fitting can be estimated as O(m).

Next we analyze the time complexity required for σ _{k_j,h_j,j} and D[k, h, i]. The time complexity required for computation of σ _{k_j,h_j,j} is O(mn ^K+1) [16], where the time complexity of computing the minimum least-squares for jth node depends on the upper bounds for the number of adding and deleting edges per node, K and H. In addition, the time complexity for D[k, h, i]s is O(n ³) [16], considering that the size of table D[k, h, i] is O(n ³). Therefore, total time complexity for DPLSQ-SS is

\begin{matrix} O ({m n}^{K + 1} + n^{3}) . \end{matrix}

(11)

This analysis suggests that DPLSQ-SS can be applicable to large-scale networks if K ≤ 2 and n is not too large.

If the maximum indegree of the initial network is not bounded by a constant, the time complexity per least-squares fitting increases to O(mn ⁴ + n ⁶) and the number of combinations to be examined per node increases to O(n ^H+K), as discussed in [16]. In this case, the total time complexity would be O(n ^H+K+1 · (mn ⁴ + n ⁶)), which suggests that network completion should not start with dense networks but with sparse networks.

3. Results

To evaluate the effectiveness of DPLSQ-SS, we performed two types of computational experiments using both synthetic data and real expression data. All experiments were performed on a PC with Intel Core 2 Quad CPU (3.0 GHz). We employed the liblsq library (http://www2.nict.go.jp/aeri/sts/stmg/K5/VSSP/install_lsq.html) for a least-squares fitting method.

3.1. Inference Using Synthetic Data

In order to assess the potential effectiveness of DPLSQ-SS, we begin with network inference using two kinds of synthetic data. Recall that network completion beginning with a null network corresponds to network inference.

We employed here nonlinear equations as gene regulation rules between genes. Since it is difficult to generate static data by numerical simulations, we made manually nonlinear equations with obvious solutions as the synthetic network topology and regarded each solution as static data for one sample. For example, if we make n equations with n variables, it is assumed that there exist n genes in the synthetic network. We give an example of nonlinear equations with 3 variables below:

\begin{matrix} x_{1} = x_{1}^{2} - 2, \\ x_{2} = x_{2}^{2} - 6, \\ x_{3} = x_{1} x_{2} - 1, \end{matrix}

(12)

where we assume that x _i (i = 1,…, 3) corresponds to the expression value of ith gene. Therefore, an example network consists of 3 genes and 4 edges, including self-loops. If we solve this set of equations, we can find four solutions as below:

\begin{matrix} (2,3, 5), (2, - 2, - 5), (- 1,3, - 4), (- 1, - 2,1) . \end{matrix}

(13)

Then, we can employ these solutions as synthetic data. Since the use of synthetic static data consisting only of a few solutions easily resulted in numerical calculation error, we generated additional 400 data sets for each of static solutions by adding random numbers uniformly distributed between −0.5 and 0.5.

Under the above model, we examined DPLSQ-SS for network inference, using synthetic data which is generated as described above and letting E = ∅ in the initial network. It should be noted that we let upper bounds for the number of adding and deleting edges per node to be K = 2 and H = 0, respectively. Furthermore, in order to examine the CPU time changes with respect to the size of the network, we made synthetic networks with 10 and 20 nodes by making the nonlinear equation with corresponding number of variables.

Since the number of added edges was always equal to the number of edges in the original network, we evaluated the performance of DPLSQ-SS by means of the averaged accuracy, which was defined as the number of correctly inferred edges to the number of edges in the original network (i.e., the number of added edges) and the averaged computational time over 5 modified networks.

We also compared DPLSQ-SS with two well-known existing tools for inference of genetic networks, ARACNE [18, 19] and GeneNet [9, 10]. ARACNE is based on mutual information between genes and GeneNet is based on graphical Gaussian models and partial correlations. Since both tools output only correlation values for genes, we selected the top M from them and regarded {v _i, v _j} as a correct edge if either (v _i, v _j) or (v _j, v _i) was included in the edge set of the original network. We employed datasets which were generated by the same way for DPLSQ-SS and default parameter settings for both tools. We evaluated the results by the ratio of correctly inferred edges and averaged CPU time (see Table 1). The CPU time used by ARACNE is user time + sys time and that used by GeneNet is time difference between the start time and end time.

Table 1.

Results on inference using synthetic data.

		Method
		DPLSQ-SS	ARACNE	GeneNet
n = 10	Accuracy	0.779	0.578	0.571
n = 10	CPU time (sec.)	1.784	1.113	4.020
n = 20	Accuracy	0.722	0.554	0.390
n = 20	CPU time (sec.)	14.482	4.795	4.040

Open in a new tab

The results on DPLSQ-SS and comparative methods using synthetic data show that the accuracies by DPLSQ-SS are higher than those by ARACNE and GeneNet. Although ARACNE cannot handle networks with self-loops but GeneNet can, both methods showed almost the same performance in the case of n = 10. On the whole, three methods have something in common, which perform with low accuracy as the size of the network grows. As for the CPU time, ARACNE was faster than DPLSQ-SS and GeneNet in case of n = 10. In addition, the CPU time by DPLSQ-SS increases rapidly as the size of the network grows, in contrast to those by the comparative methods. Since DPLSQ-SS works in polynomial time, if we obtain sufficient computer resource, DPLSQ-SS can handle large-scale networks. Since accuracy is the most important criterion and DPLSQ-SS is more accurate than existing methods, our proposed method might be a useful tool for network inference.

3.2. Inference Using DREAM4 Data

In this subsection, we try to evaluate the effectiveness of DPLSQ-SS and perform a comparison with other methods in order to perform an unbiased evaluation since the results in Section 3.1 are based on the mathematical model adopted by DPLSQ-SS. We used synthetic datasets generated by GeneNetWeaver (GNW) [20], which provide benchmarks and performance testing for network inference methods in the DREAM (Dialogue on Reverse Engineering Assessment and Methods) challenge (http://www.the-dream-project.org/challenges). One aim of the DREAM project is to provide benchmark data on real and simulated expression data for network inference. This challenge includes several editions, where GNW has been developed to generate genetic network motifs and simulated expression data. In this evaluation, we used the DREAM4 challenge which is divided into three subchallenges called InSilico_Size10, InSilico_Size100, and InSilico_Size100_Multifactorial, consisting of five networks.

We validated the performance using InSilico_Size10 subchallenge consisting of gold standard 10 gene networks and simulated expression data generated under different conditions (wild-type, knockouts, knockdowns, multifactorial perturbations, and time series). Since only one set of wild-type data, which corresponds to static data, is provided for each network and it is not enough for inference, we generated 500 static data sets by randomly perturbing each data as in Section 3.1. The result is shown in Table 2, where the accuracy was evaluated as in Section 3.1. It is seen from the table that the performance of any method is not good. It is reasonable because inference was preformed based on one set of expression data (i.e., m = 1) although perturbed data were also used. Although ARACNE was better than DPLSQ-SS in four cases, DPLSQ-SS was better than ARACNE in one case. DPLSQ-SS was better than GeneNet in four cases and was comparative to GeneNet in one case. This result suggests that although DPLSQ-SS is not necessarily the best for simulated data in DREAM4, it has reasonable performance when a very few samples are given.

Table 2.

Results on inference using DREAM4 data, where the accuracy is shown for each case.

	Method
	DPLSQ-SS	ARACNE	GeneNet
Insilico_size_10_1	0.2666	0.2000	0.0666
Insilico_size_10_2	0.1875	0.2500	0.1250
Insilico_size_10_3	0.1333	0.2000	0.0666
Insilico_size_10_4	0.1538	0.3076	0.0769
Insilico_size_10_5	0.0833	0.1666	0.0833

Open in a new tab

3.3. Completion Using Synthetic Data

We also examined network completion using synthetic data. In this experiment, we adopted the nonlinear equations described in Section 3.1. In order to examine network completion, we also applied DPLSQ-SS to synthetic networks, which are generated by randomly adding k edges and deleting h edges from an original network.

We assess the DPLSQ-SS performance in terms of the accuracy of modified edges and the computational time for network completion. The accuracy is defined as follows:

\begin{matrix} \frac{h + k + | E_{org} \cap E_{cmp} | - | E_{org} |}{h + k}, \end{matrix}

(14)

where E _org and E _cmp are the set of edges in the original network and the completed network, respectively. This value takes 1 if all added and deleted edges are correct and takes 0 if all added and deleted edges are incorrect. For each (k, h), we took the averaged accuracy and CPU time for completing the network over 5 modifications for 10 and 20 gene networks, where we used the default values of K = H = 2. To avoid the numerical calculation error, we also generated additional 400 data sets for each of static solutions by adding random numbers uniformly distributed between −0.5 and 0.5.

The results are shown in Table 3. It is observed that DPLSQ-SS has quite high accuracy regardless of the number of k and h except for k = h = 5. It is also seen that the CPU time increases rapidly when applied to networks with 20 genes. In comparison with the CPU time for network inference by DPLSQ-SS, there seems to be a significant difference even if n equals 10. In this study, we used the default values of K = 2 and H = 2 for network completion, which were K = 2 and H = 0 for network inference. Moreover, the number of modified edges for network inference is much larger than that for network completion. However, the latter procedure requires more CPU time than the former procedure. This result suggests that the time complexity of DPLSQ-SS depends not so much on the number of modified edges, k and h, but depends much on the number of K and H as indicated in Section 2.3.

Table 3.

Results on completion with synthetic data on DPLSQ-SS.

	Number of added edges and number of deleted edges	Accuracy	CPU time (sec.)
n = 10	k = 1, h = 1	1.000	0.720
	k = 2, h = 2	1.000	5.810
	k = 3, h = 3	1.000	4.610
	k = 4, h = 4	1.000	5.000
	k = 5, h = 5	0.700	4.410

n = 20	k = 1, h = 1	1.000	2.760
	k = 2, h = 2	1.000	51.870
	k = 3, h = 3	0.833	46.220
	k = 4, h = 4	1.000	53.880
	k = 5, h = 5	0.700	48.910

Open in a new tab

3.4. Inference Using Real Data

We also examined DPLSQ-SS for inference of gene networks from static data under multiple conditions. The aim of this experiment is to identify different static gene networks under different conditions and investigate the differences of these network topologies. We focus on the genetic network related to lung cancer and employed a partial network which contains RB/E2F pathway in human small cell lung cancer from the KEGG database [21] shown in Figure 3. RB/E2F pathway is one of two main tumor suppressor pathways and the retinoblastoma gene (RB) plays a key role in cancer [22]. RB is known to control the activity of E2F transcription factor which regulates the cell-cycle progression and E2F is under the control of both CDK4 and CCND1 (CyclinD1). It is also known that the activity of E2F plays an important role in the tumor cell proliferation and that absence of E2F leads to cancer formation. In this way, it is obvious that the gene abnormality of RB/E2F pathway is linked to cancer and there are precise differences between the cancer gene network and the normal gene network. Therefore, inferring the cancer and normal gene networks is quite meaningful. In this study, we demonstrate that DPLSQ-SS can distinguish the difference between cancer and normal networks from static expression data. Based on the KEGG database and Entrez Gene ID, we selected 9 genes shown in Table 4. We referred to RefGene (http://refgene.com/) for gene symbols and annotations and employed the resulting network as the original network.

A part of small cell lung cancer network, containing RB/E2F pathway.

Table 4.

List of gene symbols and annotations in human small cell lung cancer network.

Number	Gene symbol	Gene annotation
1	MYC	v-myc myelocytomatosis viral oncogene homolog (avian)
2	P15	Cyclin-dependent kinase inhibitor 2B
3	CDK4/6	Cyclin-dependent kinase 4
4	CCND1	Cyclin D1
5	MAX	MYC associated factor X
6	CKS1	CDC28 protein kinase regulatory subunit 1B
7	SKP2	S-phase kinase-associated protein 2 (p45)
8	CCNE1	Cyclin E1
9	E2F3	E2F transcription factor 3

Open in a new tab

As for the static expression data, we employed lung cancer microarray data obtained by Beer et al. [23]. They clustered hierarchically the gene expression profiles from lung adenocarcinoma tumor tissues and those from normal lung tissues. This data contains 86 tumor samples and 10 normal samples and is publicly available from the study of Choi and Kendziorski [24]. In order to investigate the relationship between cancer and normal gene network topologies, we performed network inference using these two types of data, where K = 2, H = 0, and k = 13 were used. In order to avoid the numerical calculation error, we also generated additional 5 data sets for each expression value by adding random numbers uniformly distributed between −0.5 and 0.5. The results are shown in Figure 4.

Results on inference with gene expression data for cancer and normal samples. We could verify the activation of RB/E2F pathway in the normal network and the inactivation of it in the cancer network. The bold dashed arrows denote the incorrectly added edges. The bold solid arrows denote the correctly added edges (i.e., added edges that are included in the original network). Since directions of edges are ignored in comparison with ARACNE and GeneNet, bold dashed arrows connecting to P15 are regarded as correct in the evaluation of accuracy in Table 5.

We also compared DPLSQ-SS with ARACNE and GeneNet using these real data. The result is shown in Table 5, where the accuracy (i.e., the ratio of the number of correctly inferred edges to the number of added edges) was calculated as in Section 3.1 and the networks obtained from the KEGG database were regarded as the correct networks. Although DPLSQ-SS was worse than ARACNE for the cancer network, it was better for the normal network. For both networks, DPLSQ-SS was better than GeneNet. This result suggests that the accuracy of DPLSQ-SS for real data is reasonable compared with existing methods.

Table 5.

Results on inference of real networks, where the accuracy is shown for each case.

	Method
	DPLSQ-SS	ARACNE	GeneNet
Normal network	0.3846	0.3076	0.0769
Cancer network	0.1538	0.3846	0.0769

Open in a new tab

Although the accuracy of DPLSQ-SS is not high for real data, there are significant differences between cancer and normal networks. The inferred normal network indicated the existence of RB/E2F pathway involved in the regulation of E2F activity. It is observed that the tumor suppressor gene P15 regulated CDK4 activity and E2F was under the regulation of both CDK4 and CCND1. On the other hand, in the inferred tumor network, we found no significant correlations between genes in RB/E2F pathway. Instead, we discovered the regulation of CCND1 and deregulation of E2F activity. It has been reported that overexpression of CDK4/6 and CCND1 and deregulated E2F could contribute to cancer progression [22, 25]. Therefore, inference of two types of networks could produce the reasonable outcome that matches biological knowledge as mentioned above and could capture the features of each network. Although there is no common edge between the inferred cancer network and the original network, it is reasonable because cancer networks may be very different from normal networks. This result suggests that our proposed method can infer different static networks under the different conditions and can identify the feature of cancer and normal networks.

4. Conclusion

In this study, we addressed the problem of completing and inferring gene networks under the stationary conditions from static gene expression data. In our approach, we defined network completion as making the minimum amount of modifications to an initial network so that the inferred network is most consistent with the gene expression data. The aim of this study is (1) to complete genetic networks using static data and (2) to investigate the differences between two types of gene networks under different conditions. In order to achieve our goal, we proposed a novel method called DPLSQ-SS for network completion and network inference based on dynamic programming and least-squares fitting. This method works in polynomial time if the maximum indegree is bounded by a constant. We demonstrated the effectiveness of DPLSQ-SS through computational experiments using synthetic data and real data. In particular, we tried to infer the normal and lung cancer networks from static gene microarray data. As the results using synthetic data, DPLSQ-SS showed relatively good performance in comparison to other existing methods. As the results using microarray data from normal and lung cancer samples, it is seen that this method allows us to distinguish the differences between gene networks under different conditions.

There is some room for extending DPLSQ-SS. For example, we employed here simple nonlinear equations as gene regulation rules, but it can be replaced by more complex types of nonlinear equations. Although DPLSQ-SS works in polynomial time, the degree of polynomial is not low, which prevents the method from being applied to completion of large networks. However, DPLSQ-SS can be highly parallelizable: σ _{k_j,h_j,j} can be computed independently for different σ _{k_j,h_j,j}s. Therefore, parallel implementation of DPLSQ-SS is also important future work. Although we have focused on completion and inference of gene regulatory networks, completion and inference of large-scale protein-protein or ChIP-chip/seq interaction networks are also important. Since the proposed method is only applicable to gene regulatory networks, extension and application of DPLSQ-SS for these networks should be studied in the future work.

Acknowledgment

This work was partially supported by Grant-in-Aid no. 22240009 from JSPS, Japan.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

1.Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 1998:18–29. [PubMed] [Google Scholar]
2.Akutsu T, Miyano S, Kuhara S. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics. 2000;16(8):727–734. doi: 10.1093/bioinformatics/16.8.727. [DOI] [PubMed] [Google Scholar]
3.Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
4.Imoto S, Kim S, Goto T, et al. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology. 2003;1(2):231–252. doi: 10.1142/s0219720003000071. [DOI] [PubMed] [Google Scholar]
5.D’Haeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]
6.Wang Y, Joshi T, Zhang X-S, Xu D, Chen L. Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics. 2006;22(19):2413–2420. doi: 10.1093/bioinformatics/btl396. [DOI] [PubMed] [Google Scholar]
7.Wang R-S, Wang Y, Zhang X-S, Chen L. Inferring transcriptional regulatory networks from high-throughput data. Bioinformatics. 2007;23(22):3056–3064. doi: 10.1093/bioinformatics/btm465. [DOI] [PubMed] [Google Scholar]
8.Toh H, Horimoto K. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics. 2002;18(2):287–297. doi: 10.1093/bioinformatics/18.2.287. [DOI] [PubMed] [Google Scholar]
9.Opgen-Rhein R, Strimmer K. Inferring gene dependency networks from genomic longitudinal data: a functional data approach. REVSTAT. 2006;4(1):53–65. [Google Scholar]
10.Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Systems Biology. 2007;1, article 37 doi: 10.1186/1752-0509-1-37. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kim M, Leskovec J. The network completion problem: inferring missing nodes and edges in networks. Proceedings of SIAM International Conference on Data Mining; 2011; pp. 47–58. [Google Scholar]
12.Guimerà R, Sales-Pardo M. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences of the United States of America. 2009;106(52):22073–22078. doi: 10.1073/pnas.0908366106. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hanneke S, Xing EP. Network completion and survey sampling. Journal of Machine Learning Research. 2009;5:209–215. [Google Scholar]
14.Saito S, Aburatani S, Horimoto K. Network evaluation from the consistency of the graph structure with the measured data. BMC Systems Biology. 2008;2, article 84 doi: 10.1186/1752-0509-2-84. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Akutsu T, Tamura T, Horimoto K. Completing networks using observed data. Proceedings of the 20th International Conference on Algorithmic Learning Theory; 2009; pp. 126–140. [Google Scholar]
16.Nakajima N, Tamura T, Yamanishi Y, Horimoto K, Akutsu T. Network completion using dynamic programming and least-squares fitting. The Scientific World Journal. 2012;2012:8 pages. doi: 10.1100/2012/957620.957620 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Nakajima N, Akutsu T. Exact and heuristic methods for network completion for time-varying genetic networks. BioMed Research International. 2014;2014:14 pages. doi: 10.1155/2014/684014.684014 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Margolin AA, Wang K, Lim WK, Kustagi M, Nemenman I, Califano A. Reverse engineering cellular networks. Nature Protocols. 2006;1(2):662–671. doi: 10.1038/nprot.2006.106. [DOI] [PubMed] [Google Scholar]
19.Margolin AA, Nemenman I, Basso K, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(1, article S7) doi: 10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]
21.Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research. 2009;38(1):D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Nevins JR. The Rb/E2F pathway and cancer. Human Molecular Genetics. 2001;10(7):699–703. doi: 10.1093/hmg/10.7.699. [DOI] [PubMed] [Google Scholar]
23.Beer DG, Kardia SLR, Huang C-C, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002;8(8):816–824. doi: 10.1038/nm733. [DOI] [PubMed] [Google Scholar]
24.Choi Y, Kendziorski C. Statistical methods for gene set co-expression analysis. Bioinformatics. 2009;25(21):2780–2786. doi: 10.1093/bioinformatics/btp502. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Alao JP. The regulation of cyclin D1 degradation: roles in cancer development and the potential for therapeutic invention. Molecular Cancer. 2007;6, article 24 doi: 10.1186/1476-4598-6-24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] 1.Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 1998:18–29. [PubMed] [Google Scholar]

[B2] 2.Akutsu T, Miyano S, Kuhara S. Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics. 2000;16(8):727–734. doi: 10.1093/bioinformatics/16.8.727. [DOI] [PubMed] [Google Scholar]

[B3] 3.Friedman N, Linial M, Nachman I, Pe’er D. Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]

[B4] 4.Imoto S, Kim S, Goto T, et al. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology. 2003;1(2):231–252. doi: 10.1142/s0219720003000071. [DOI] [PubMed] [Google Scholar]

[B5] 5.D’Haeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]

[B6] 6.Wang Y, Joshi T, Zhang X-S, Xu D, Chen L. Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics. 2006;22(19):2413–2420. doi: 10.1093/bioinformatics/btl396. [DOI] [PubMed] [Google Scholar]

[B7] 7.Wang R-S, Wang Y, Zhang X-S, Chen L. Inferring transcriptional regulatory networks from high-throughput data. Bioinformatics. 2007;23(22):3056–3064. doi: 10.1093/bioinformatics/btm465. [DOI] [PubMed] [Google Scholar]

[B8] 8.Toh H, Horimoto K. Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics. 2002;18(2):287–297. doi: 10.1093/bioinformatics/18.2.287. [DOI] [PubMed] [Google Scholar]

[B9] 9.Opgen-Rhein R, Strimmer K. Inferring gene dependency networks from genomic longitudinal data: a functional data approach. REVSTAT. 2006;4(1):53–65. [Google Scholar]

[B10] 10.Opgen-Rhein R, Strimmer K. From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Systems Biology. 2007;1, article 37 doi: 10.1186/1752-0509-1-37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Kim M, Leskovec J. The network completion problem: inferring missing nodes and edges in networks. Proceedings of SIAM International Conference on Data Mining; 2011; pp. 47–58. [Google Scholar]

[B12] 12.Guimerà R, Sales-Pardo M. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences of the United States of America. 2009;106(52):22073–22078. doi: 10.1073/pnas.0908366106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Hanneke S, Xing EP. Network completion and survey sampling. Journal of Machine Learning Research. 2009;5:209–215. [Google Scholar]

[B14] 14.Saito S, Aburatani S, Horimoto K. Network evaluation from the consistency of the graph structure with the measured data. BMC Systems Biology. 2008;2, article 84 doi: 10.1186/1752-0509-2-84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Akutsu T, Tamura T, Horimoto K. Completing networks using observed data. Proceedings of the 20th International Conference on Algorithmic Learning Theory; 2009; pp. 126–140. [Google Scholar]

[B16] 16.Nakajima N, Tamura T, Yamanishi Y, Horimoto K, Akutsu T. Network completion using dynamic programming and least-squares fitting. The Scientific World Journal. 2012;2012:8 pages. doi: 10.1100/2012/957620.957620 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.Nakajima N, Akutsu T. Exact and heuristic methods for network completion for time-varying genetic networks. BioMed Research International. 2014;2014:14 pages. doi: 10.1155/2014/684014.684014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 18.Margolin AA, Wang K, Lim WK, Kustagi M, Nemenman I, Califano A. Reverse engineering cellular networks. Nature Protocols. 2006;1(2):662–671. doi: 10.1038/nprot.2006.106. [DOI] [PubMed] [Google Scholar]

[B20] 19.Margolin AA, Nemenman I, Basso K, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(1, article S7) doi: 10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 20.Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–2270. doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]

[B18] 21.Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research. 2009;38(1):D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Nevins JR. The Rb/E2F pathway and cancer. Human Molecular Genetics. 2001;10(7):699–703. doi: 10.1093/hmg/10.7.699. [DOI] [PubMed] [Google Scholar]

[B23] 23.Beer DG, Kardia SLR, Huang C-C, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine. 2002;8(8):816–824. doi: 10.1038/nm733. [DOI] [PubMed] [Google Scholar]

[B24] 24.Choi Y, Kendziorski C. Statistical methods for gene set co-expression analysis. Bioinformatics. 2009;25(21):2780–2786. doi: 10.1093/bioinformatics/btp502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Alao JP. The regulation of cyclin D1 degradation: roles in cancer development and the potential for therapeutic invention. Molecular Cancer. 2007;6, article 24 doi: 10.1186/1476-4598-6-24. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Network Completion for Static Gene Expression Data

Natsu Nakajima

Tatsuya Akutsu

Abstract

1. Introduction

2. Method

Figure 1.

2.1. Model of Nonlinear Equation and Estimation of Parameters

Figure 2.

2.2. Completion by Addition of Edges

2.3. Completion by Addition and Deletion of Edges

3. Results

3.1. Inference Using Synthetic Data

Table 1.

3.2. Inference Using DREAM4 Data

Table 2.

3.3. Completion Using Synthetic Data

Table 3.

3.4. Inference Using Real Data

Figure 3.

Table 4.

Figure 4.

Table 5.

4. Conclusion

Acknowledgment

Conflict of Interests

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Network Completion for Static Gene Expression Data

Natsu Nakajima

Tatsuya Akutsu

Abstract

1. Introduction

2. Method

Figure 1.

2.1. Model of Nonlinear Equation and Estimation of Parameters

Figure 2.

2.2. Completion by Addition of Edges

2.3. Completion by Addition and Deletion of Edges

3. Results

3.1. Inference Using Synthetic Data

Table 1.

3.2. Inference Using DREAM4 Data

Table 2.

3.3. Completion Using Synthetic Data

Table 3.

3.4. Inference Using Real Data

Figure 3.

Table 4.

Figure 4.

Table 5.

4. Conclusion

Acknowledgment

Conflict of Interests

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases