Reconstruction of Gene Regulatory Networks based on Repairing Sparse Low-rank Matrices

Young Hwan Chang; Roel Dobbe; Palak Bhushan; Joe W Gray; Claire J Tomlin

doi:10.1109/TCBB.2015.2465952

. Author manuscript; available in PMC: 2017 Jul 1.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2015 Aug 7;13(4):767–777. doi: 10.1109/TCBB.2015.2465952

Reconstruction of Gene Regulatory Networks based on Repairing Sparse Low-rank Matrices

Young Hwan Chang ^1,^✉, Roel Dobbe ¹, Palak Bhushan ¹, Joe W Gray ², Claire J Tomlin ^1,^3,^✉

PMCID: PMC5154690 NIHMSID: NIHMS809818 PMID: 27990101

Abstract

With the growth of high-throughput proteomic data, in particular time series gene expression data from various perturbations, a general question that has arisen is how to organize inherently heterogenous data into meaningful structures. Since biological systems such as breast cancer tumors respond differently to various treatments, little is known about exactly how these gene regulatory networks (GRNs) operate under different stimuli. Challenges due to the lack of knowledge not only occur in modeling the dynamics of a GRN but also cause bias or uncertainties in identifying parameters or inferring the GRN structure. This paper describes a new algorithm which enables us to estimate bias error due to the effect of perturbations and correctly identify the common graph structure among biased inferred graph structures. To do this, we retrieve common dynamics of the GRN subject to various perturbations. We refer to the task as “repairing” inspired by “image repairing” in computer vision. The method can automatically correctly repair the common graph structure across perturbed GRNs, even without precise information about the effect of the perturbations. We evaluate the method on synthetic data sets and demonstrate an application to the DREAM data sets and discuss its implications to experiment design.

1 Introduction

One of the most exciting trends and important themes in systems biology involves the use of high-throughput measurement data to construct models of complex systems. These approaches are also becoming increasingly important in other areas of biology. While mechanistic modeling approaches should be based on prior biological understanding of the molecular mechanisms involved, a data-driven model can help us to analyze large data sets by simplifying measurements or by acquiring insight from the data sets, without having to make any assumptions about the underlying mechanism [1].

Among various data-driven modeling approaches, clustering methods are widely used on gene expression data to categorize genes with similar expression profiles [2]. In general, unraveling the complex coherent structure of the dynamics of gene regulatory network (GRN) is the goal of a high-throughput data analysis. Recently, much research has focused on time series gene expression data sets, for example, using functional data analysis techniques [3–5]. Analyzing these data sets has the advantage of being able to identify dynamic relationships between genes since the spatio-temporal gene expression pattern results from both the GRN structure and integration of regulatory signals. For example, drug-induced perturbation experimental data sets have been combined with temporal profiling which provides the distinct possibility of observing the cellular mechanisms in action [6]. In cancer cells, since signaling networks frequently become compromised, leading to abnormal behaviors and responses to external stimuli, monitoring the change of gene expression patterns over time provides a profoundly different type of information. More specifically, the breast cancer that we study is comprised of distinct subtypes that may respond differently to pathway-targeted therapies [7]. Hence, comparing expression levels in the perturbed system with those in the original system reveals extra information about the underlying network structure. However, since the outcome of data-driven clustering or classification only represents the categorized or clustered responses, they have limitations in inferring the GRN structure directly. As a result, we need extra efforts to infer the network structure from the data.

In the last years, many data-driven inference algorithms have been developed and applied to reconstruct graph structures of GRNs from data. These include Bayesian networks, regression, correlation, mutual information, system-based approaches and l₁-penalized network inference [8–16]. Recent works [17, 18] provide a systematic method for inferring the direct dependencies in a network, corresponding to true interactions, and removing the effects of transitive relationships that result from indirect effects. Also, other works [19, 20] use systems biology approaches to model and reverse engineering gene regulatory networks from experimental data by performing successive perturbations to each modular component of the network. However, data-driven reconstruction of the network structure itself remains in general a difficult problem. Also, until recently, most studies on GRN inference have focused on exploiting a particular data set to identify the graph structure, and have applied the same method to other data sets independently. In addition, although many algorithms use time series gene expression data sets subject to drug-induced perturbations, these perturbations are either assumed to be known [21] [22] or simply ignored. However, such unknown perturbations can cause bias and/or variance in the outcome of the inference algorithm because these unknown perturbations can be considered as corruptions in the measurement and the algorithm is often sensitive to these corruptions. For example, consider a simple inhibition reaction: A ⊣ B (i.e., A inhibits B) and suppose that we perturb A and B by applying two different inhibition drugs respectively. If the effects of both perturbations are dominant, we may incorrectly infer the relation between A and B (i.e., we may infer A → B) since as A decreases, B decreases. Note that each time series response represents integration of both gene regulatory signal and effect of perturbations over time. Thus, in order to infer the GRN correctly, the effect of perturbations should be isolated.

Moreover, since the effects of the targeted drug can be propagated through the (unknown) underlying network over time, the dynamic responses of gene expressions can be affected directly or indirectly by the drug. For instance, when we design targeted therapies, we obviously know that the response of the target protein is perturbed, so we may assume structured perturbations. However, since these drug-induced perturbations can be propagated and also may have an effect on the other proteins directly, we might only have partial information of these perturbations. In addition, missing and corrupted data are quite common in biological data sets, and should be properly addressed.

In this paper, we propose a new method to harness various perturbation experimental data sets together, to retrieve commonalities under the sparse low-rank representation, and to improve identifiability of dynamics of GRNs, without any a priori information about the GRN structure. Intuitively, without retrieving commonalities, the inferred graph structures from each experimental data set may be biased because each data set has an inherent bias through the perturbation. Thus, the inferred graph structures may not be consistent with each other. By exploiting commonalities across the inferred graph structures, we can estimate bias error due to different perturbations, and correctly identify the common graph structure. We refer to the task as “repairing” inspired by “image repairing” in computer vision [23].

To do this, we first pose the problem as a sparse low-rank representation problem, by formulating the network inference as finding a sparsely connected structure that has low rank over multiple experiments. Inspired by repairing sparse low-rank structure [23] in the computer vision literature, we design a novel convex optimization formulation which enables us to combine temporal data sets from various perturbation experiments. The method can automatically repair the common graph structure from the data sets of perturbed GRNs, even without precise information about the effect of the perturbations. Through numerical examples, we demonstrate the advantage of both dealing with estimation of the perturbation effects and using that information to correctly learn the underlying gene regulatory structure. Also, we demonstrate a possible application using a DREAM data set [24] [25] [26] and we are currently applying this method to biological data sets in HER2 positive breast cancer [6] [7], in which the drugs perturb different parts of the network in each experiment.

The rest of this paper is organized as follows: Section 2 presents the image inpainting method in computer vision by which we inspire the method of repairing common GRNs. In Section 3, we pose the graph inference problem as repairing a sparse low-rank representation and we present the reconstruction of GRN in Section 4. Section 5 demonstrates an application of DREAM dataset and discusses the implications and limitations. Finally, conclusions are given.

2 Motivation

Although there are deep relationships between clustering and network inference, clustering gene expression data sets and inferring GRNs are tasks usually developed independently. We argue that clustering and network inference can potentially cover each other's shortcomings since spatio-temporal gene expression patterns result from both the network structure and the integration of regulatory signals through the network [27]. For example, the seminal module networks study [28] and recent study [29] exploit the relationship between clustering and network inference. In this paper, since we want to reveal the common graph structure of GRN (not limited to module level), we would like to estimate different responses across various perturbations by comparing gene expression levels under the different perturbation conditions and then correctly identifying the common GRN structure.

In Figure 1A, we consider collections of time series gene expression of HER2 positive breast cancer cell lines [7] from pathway-targeted therapies involving drug-induced perturbation experiments ( LAP, mutant, Akti; for each drug-induced perturbation, we add a single influence for a targeted protein). When a specific protein is perturbed, there are immediate effects on the target protein and compensatory responses on other proteins over time. Thus, comparing gene expression levels in the perturbed system with those in the unperturbed system reveals the extra information about the different cellular mechanisms in action. A dynamical system of the GRN can be modeled as follows:

Conceptual diagram of repairing common GRN structure based on collections of time series gene expression from drug-induced perturbation (g_LAP(·), g_M(·), g_AKTi(·)) experiments in HER2 positive breast cancer. In order to show analogous relationship of repairing sparse low-rank texture [23] in computer graphics application, we present each representation with the corresponding illustration such as input image (a), input support (b) and repairing result (c) shown in Figure S1.

\dot{x} = {\begin{matrix} f (x) & (w / o perturbation or wild - type) \\ f (x) + g_{{\cdot}} (x) & (perturbed / mutant - specific part) \end{matrix}

where x ∈ ℝⁿ denotes the concentrations of the rate-limiting species, ẋ represents the change in concentration of the species, n is the number of genes, f(·) represents the vector field of the typical dynamical system (or wild-type) and g_{·}(·) represents an additional perturbation or mutant-specific vector field (blue and red edges in Figure 1A and B). In other words, we have a unified model for wild-type cell line, ẋ = f(x) and in the perturbation case, we invoke a single change to the network topology or add a single influence for a specific gene by considering additional vector fields such as g_LAP(·), g_AKTi(·) and g_M(·). Although these additional vector fields affect only a single gene expression at time t, their influence can be propagated through the network over time.

Since each time series data set reflects dynamic response of GRN (ẋ = f(x)) under drug perturbation (g_{·} (x)), we want to reconstruct GRN by isolating these perturbation effects. By correctly infer bias or uncertainties (g_{·}) as shown in Figure 1B, we can correctly repair the common graph structure (ẋ = f(x)) in Figure 1C. Intuitively, we can think of these collections of time series gene expression as corrupted graphical images (a) in Figure 1 whose underlying texture shows regular pattern. In [23], by using the properties such as structured regular textures in images, the authors can correctly estimate the corrupted region (b) and deal with image completion (c) by repairing the corruption in Figure 1.

3 Problem Formulation

Inspired by repairing sparse low-rank representation in computer vision [23], first we define a dynamical system whose parameters are time invariant but unknown. This is a classic way to represent network inference problem [21, 22, 30]. By assuming sparsity of GRNs, we can rearrange the unknown GRN structure as a sparse signal. Then, we integrate different experiment data sets together and derive a sparse and low rank matrix to be inferred from multiple time-series assuming that some of the inputs are not known and that some of the outputs are corrupted by noise. In other words, we pose a graph inference problem by formulating the network inference as finding a sparsely connected structure that has low rank over various experiments. In this section, we will describe the methodological details.

3.1 Formulating Gene Regulatory Networks as a Dynamical System

We consider a dynamical system of GRN described by

y ≜ \dot{x} = f (x) + u

(1)

where x ∈ ℝⁿ denotes the concentrations of the rate-limiting species which can be measured in experiments; $\dot{x} = {[\begin{matrix} {\dot{x}}_{1} & {\dot{x}}_{2} & \dots & {\dot{x}}_{n} \end{matrix}]}^{⊤} \in R^{n}$ is a vector whose elements are the change in concentrations of the n species over time which may not be measured directly in experiments but we could calculate these quantities by interpolating x and using numerical derivatives¹; f(·) : ℝⁿ → ℝⁿ represents biochemical reactions, which typically include functions of known form such as product of monomials, monotonically increasing or decreasing Hill functions, simple linear terms and constant terms, since biochemical reactions are typically governed by mass action kinetics, Michaelis-Menten, or Hill kinetics [21, 30]. Since f(x) determines how the dynamics of ẋ_i of a protein i depends on the expression levels of all proteins, it contains the structural information of the network. u ∈ ℝⁿ denotes the control input, for example, drug-induced inhibition or stimulation, for which we only have partial information. For instance, when we inhibit a target protein by drug-induced perturbation, we only know that the dynamics of the targeted gene response may be affected, but we do not know by how large the effect on the dynamics is and how long this effect continues. Moreover, this drug-induced perturbation might also directly affect other proteins in practice.

The nonlinear function f(x) can be decomposed into a linear sum of scalar basis functions f_b,i(x) ∈ ℝ where we select the set of possible candidate basis functions that capture fundamental biochemical kinetic law [21, 30]:

f (x) = \sum_{i = 1}^{N} f_{b, i} (x) [\begin{matrix} q_{i 1} \\ q_{i 2} \\ ⋮ \\ q_{i n} \end{matrix}]

(2)

where N is the number of possible candidate basis functions and q_ij is the coefficient of the i-th basis function for the j-th protein response. The biochemical reactions (1) can be written as follows:

y = [\begin{matrix} {\dot{x}}_{1} \\ ⋮ \\ {\dot{x}}_{n} \end{matrix}] = [\begin{matrix} q_{11} & q_{21} & \dots & q_{N 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ q_{1 n} & q_{2 n} & \dots & q_{N n} \end{matrix}] [\begin{matrix} f_{b, 1} (x) \\ ⋮ \\ f_{b, N} (x) \end{matrix}] + u = Q F_{b} (x) + u

(3)

Where $Q ≜ [\begin{matrix} q_{11} & q_{21} & \dots & q_{N 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ q_{1 n} & q_{2 n} & \dots & q_{N n} \end{matrix}] = [\begin{matrix} q_{1}^{⊤} \\ ⋮ \\ q_{n}^{⊤} \end{matrix}] \in R^{n \times N}$ , $q_{i} = {[\begin{matrix} q_{1 i} & q_{2 i} & \dots & q_{N i} \end{matrix}]}^{⊤} \in R^{N}$ and $F_{b} (x) = {[\begin{matrix} f_{b, 1} (x), & \dots, & f_{b, N} (x) \end{matrix}]}^{⊤} \in R^{N}$ is the vector field which includes possible candidate basis functions. Thus, the i-th row in Q determines the connectivity of the dynamics of the i-th protein, through the functional basis in F_b. In practice, we can construct F_b(x) by selecting the most commonly used candidate basis functions to model GRNs, for example, all monomials, binomials, other combinations or Hill function. Thus, any biochemical reactions can be represented by a linear map Q and F_b(x) where Q reflects the influence map of GRN structure and F_b(x) includes all possible candidate functions representing the underlying biochemical reactions. Thus, in order to infer the graph structure, we want to recover Q from the measured response x, ẋ with the chosen basis functions F_b(x).

By formulating the dynamics of GRN into Equation (3), we are able to extract the graph structure of GRN into Q. In the following section, we rearrange the dynamic equations of the GRN into the desired matrix form amenable for repairing sparse low-rank representation [23].

3.2 Organizing GRN Dynamic Equations into Sparse Low-rank Representation

Suppose the time series data are sampled from a real experimental system at discrete time points t_j. By taking the transpose of equation (3) and vectorizing Q as s, we obtain

y (t_{j}) = [\begin{matrix} y_{i} (t_{j}) \\ y_{2} (t_{j}) \\ ⋮ \\ y_{n} (t_{j}) \end{matrix}] = [\begin{matrix} F_{b} {(x (t_{j}))}^{⊤} & 0 & \dots & 0 \\ 0 & F_{b} {(x (t_{j}))}^{⊤} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & F_{b} {(x (t_{j}))}^{⊤} \end{matrix}] [\begin{matrix} q_{1} \\ q_{2} \\ ⋮ \\ q_{n} \end{matrix}] + [\begin{matrix} u_{1} (t_{j}) \\ u_{2} (t_{j}) \\ ⋮ \\ u_{n} (t_{j}) \end{matrix}] ≜ Φ (t_{j}) s + u (t_{j})

(4)

where y(t_j) ∈ ℝⁿ, Φ(t_j)² ≜ (F_b(x)^⊤ ⊗ I_n) ∈ ℝⁿ^×^N^·ⁿ (⊗ denotes Kronecker product) and u(t_j) ∈ ℝⁿ. Note that we define $s ≜ vec (Q) = [\begin{matrix} q_{1}^{⊤}, & q_{2}^{⊤}, & \dots, & q_{n}^{⊤} \end{matrix}] \in R^{N \cdot n}$ by vectorizing Q, which represents the unknown GRN structure and is assumed to be sparse.

Since we have partial information about u(t_j), we want to exploit this information to reconstruct s. If we inhibit the k-th gene by a drug, for example, we know that

{\begin{matrix} u_{k} (t_{j}) \neq 0, \\ u_{l} (t_{j}) = 0, & (l \neq k) if the drug only affects the k - th gene \end{matrix}

From Equation (4), we expect that the k-th component of y(t_j) is corrupted due to unknown perturbation u_k(t_j). If we use this corrupted data to reconstruct s, the estimated q_k corresponding to this corruption (u_k(t_j) ≠ 0) might be biased. However, we can simply consider the corrupted part as unmeasured and ignore it for reconstruction, possibly using another experimental data set for this part. Since we consider various perturbation experimental data sets, we denote Equation (4) as follows:

y^{i} (t_{j}) = Φ^{i} (t_{j}) s_{j}^{i} + u^{i} (t_{j}) where i = {1, \dots, p}, j = {1, \dots, M}

where the superscript i denotes the i-th experiment and the subscript j denotes the time step for each i-th experiment, and we have p different experiments and M time steps for each experiment. Then, we define 𝒴, 𝒬 and 𝒰 as follows:

\begin{matrix} Y^{i} ≜ [\begin{matrix} y^{i} (t_{1}) & y^{i} (t_{2}) & \dots & y^{i} (t_{M}) \end{matrix}] \in R^{n \times M}, i = 1, \dots, p \\ = [\begin{matrix} Φ^{i} (t_{1}) s_{1}^{i} & Φ^{i} (t_{2}) s_{2}^{i} & \dots & Φ^{i} (t_{M}) s_{M}^{i} \end{matrix}] + [\begin{matrix} u^{i} (t_{1}) & u^{i} (t_{2}) & \dots & u^{i} (t_{M}) \end{matrix}] \\ Y ≜ [Y^{1}, Y^{2}, \dots, Y^{p}] \\ Q ≜ [\begin{matrix} s_{1}^{1}, s_{2}^{1}, \dots, s_{M}^{1}, & s_{1}^{2}, s_{2}^{2}, \dots, s_{M}^{2}, & \dots, & s_{1}^{p}, s_{2}^{p}, \dots, s_{M}^{p} \end{matrix}] \\ = [S^{1}, S^{2}, \dots, S^{p}] \in R^{N \cdot n \times M \cdot p} \\ U ≜ [\begin{matrix} u^{1} (t_{1}), \dots, u^{1} (t_{M}), & u^{2} (t_{1}), \dots, u^{2} (t_{M}), & \dots, & u^{p} (t_{1}), \dots, u^{p} (t_{M}) \end{matrix}] \\ = [U^{1}, U^{2}, \dots, U^{p}] \in R^{n \times M \cdot p} \end{matrix}

where Yⁱ represents the measured dynamic responses, $S^{i} ≜ [s_{1}^{i}, s_{2}^{i}, \dots, s_{M}^{i}] \in R^{N \cdot n \times M}$ represents the unknown GRNs structure for the i-th experiment over time and Uⁱ ∈ ℝⁿ^×^M represents partially known perturbation input over time for the i-th experiment.

Without loss of generality, since the GRN structure is assumed to be sparse and not to change over time, there are many zero rows in Sⁱ, and hence there are many zero rows in 𝒬. For example, if parameters of the influence map do not change over time, Sⁱ ∈ ℝ^N^·ⁿ^×^M can be represented by s̄ⁱ · 1^⊤ (i.e., Sⁱ has rank 1 where s̄ⁱ ∈ ℝ^N^·ⁿ^×1 is assumed to be sparse and 1^T ∈ ℝ^1×^M). Moreover, although all treatments result in down-regulation or up-regulation of gene regulatory signals, they can be well represented by Φⁱ(·) and the topology of the underlying influence map may not be changed. Therefore, 𝒬 ∈ ℝ^N^·ⁿ^×^M^·^p can be well represented by a sparse and low rank matrix. For instance, if the underlying graph structure is r-sparse, then 𝒬 can be represented by 𝒬 = S̄_N_·_n_×_r · T̄_r_×_M_·_p where r ≪ min(N · n, M · p). For 𝒰, we have (partial) information on the structure of the corrupted region since we only have information about drug perturbations, even without precise information about these effects.

By formulating a dynamical system as a GRN, we can construct a sparse low-rank matrix 𝒬, which enables us to use the sparse low-rank texture repairing method [23]. More precisely, we consider a model whose structure is the same across all the conditions but we allow parameter variations in 𝒬. On the other hand, one may simply apply l₁-penalized method to all experimental data sets together, i.e., a single interconnected matrix for all conditions. However, it cannot handle the direct input perturbations properly. Also, since each experimental condition may affect the system dynamics differently, one model with fixed parameter may not be able to represent all the experiment data, while the proposed method is able to handle even partially triggered GRN, for example, we could simply make the corresponding parameters be zero for the non-triggered part of GRN.

Then, we ought to prefer the lower rank solution in 𝒬 because we want to encourage common GRN structure across various conditions. By doing this, we can infer the direct effects of these perturbations 𝒰. Note that in many GRN inference problem settings [21, 22, 30], each experiment data set is applied independently but this may lead inferred GRN to be biased, since we do not know the effect of direct perturbation exactly and GRN may response differently under various conditions. For instance, if a certain experimental data set reflects network dynamics partially, then the inferred GRN can be biased or inconsistent with other inferred GRNs. In other words, we cannot expect to observe or infer the same GRN across various conditions by applying each data set independently or without handling the direct input properly. However, since each inferred GRN still reflects the common uncorrupted part of GRN, we can repair the entire GRN by integrating all experiment data sets together and estimating 𝒰 simultaneously. In order to do this, we make the following assumptions:

Assumption 1 𝒬 can be represented by a matrix of sparse and low rank. More precisely, parameters of the influence map are assumed to be fixed over time for each experiment.

Assumption 2 𝒰 is (partially) block-sparse and these nonzero blocks can be distributed uniformly by designing experiments. Also, we have partial information about the position of these blocks.

Assumption 1 asserts that 𝒬 can be represented by a sparse low-rank matrix, so that we can correctly repair the common graph structure from various perturbation experimental data sets. Without loss of generality, since GRNs are assumed to be sparse and 𝒬 denotes the underlying GRNs over time and across different experiments, the structure of 𝒬 has very low intrinsic dimensionality (sparse low-rank). Assumption 2 then states that our perturbations should excite the network uniformly, in order to retrieve the common structural and temporal information from which we can correctly repair the common GRN structure. Intuitively, if we corrupt or block entire rows of the image (a) in Figure 1, there is no way to correctly repair these rows. Similarly, if the responses of a specific protein are always corrupted directly by drug-induced perturbation across the entire experiments, there is no way to repair the corresponding structure. With this notion, as the size of network increases, we may need more experiments, i.e., perturbations should excite the network uniformly well. Thus, the method can be scalable for larger networks under these assumptions.

4 Reconstruction of GRNs via Repairing 𝒬

Since we construct the desired matrix form in the previous section, we will show how to harness both sparse and low-rank structure for inferring the common graph structure from various perturbation experimental data sets.

4.1 Repairing 𝒬 by Refining Support Estimation

We consider the following optimization problem:

min {‖ Q ‖}_{*} + λ {‖ Q ‖}_{1} + α {‖ U ‖}_{1} s . t . P_{Ω_{i}} (y^{i} (t_{j})) = P_{Ω_{i}} (Φ^{i} (t_{j}) s_{j}^{i} + u^{i} (t_{j})) where i = {1, \dots, p}, j = {1, \dots, M}

(5)

where λ and α are weighting parameters which trade off the rank and sparsity of the recovered graph structure, and the influence of the drug-induced perturbation respectively. In practice, we can use these parameters as tuning parameters to extract meaningful graph structure and recover the common graph structure which can be represented by sparse low-rank matrix 𝒬 from various drug-induced perturbation data sets. For example, if we want to find the common core structure of GRN, we may set both λ and α small values in order to penalize the commonalities (i.e., rank properties). Thus, by adjusting these parameters, we can also narrow down the key components of GRN structure. Also, here we define a linear operator P_Ω_i(·) that restricts the equality only on the entries belong to Ω_i and we could consider a simple set in ℝⁿ [23]:

Ω_{i} = {(l, j) | | u_{l}^{i} (t_{j}) | \leq ε}

(6)

for some threshold ε > 0. Thus, $P_{Ω_{i}} (y^{i} (t_{j})) = P_{Ω_{i}} (Φ^{i} (t_{j}) s_{j}^{i} + u^{i} (t_{j})) \approx P_{Ω_{i}} (Φ^{i} (t_{j}) s_{j}^{i})$ . Since the support needs not to be so precise, the proposed method is inherently robust with respect to noise in the data sets. By doing this, we cannot only obtain the support of the corrupted regions but also can reduce noise effect in the integrated datasets. Since missing and corrupted data are quite common in biological data sets, we can address these uncertainties properly. For example, if the model (2) cannot fully capture the complexities of a underlying biological system, it can still handle these uncertainties such as hidden node's effect, noise and deficiencies in the model properly by set Ω_i and an addition input term 𝒰. Also, we could estimate the support of uⁱ using a more sophisticated model to encourage additional structures such as spatial or temporal continuity [23] or to incorporate a priori information such as positive perturbation ( $u_{l}^{i} (t_{j}) > ε$ ) or negative perturbation ( $u_{l}^{i} (t_{j}) < - ε$ ).

Since we have partial information about uⁱ(t_j), we only use the uncorrupted information to reconstruct the graph structure. For the corrupted part, we estimate the corruption signal $u_{l}^{i} (t_{j})$ , update set Ω_i, and solve the optimization iteratively. We could iterate between the reconstruction and refine the support estimation as follows:

\begin{matrix} (Q^{k}, U^{k}) = arg min_{Q, U} {‖ Q ‖}_{*} + λ {‖ Q ‖}_{1} + α {‖ U ‖}_{1} subject to P_{Ω_{i}^{k}} (y^{i} (t_{j})) = P_{Ω_{i}^{k}} (Φ^{i} (t_{j}) s_{j}^{i} u^{i} (t_{j})) \\ Ω_{i}^{k + 1} = Ω_{i}^{k} - supp (U^{i, k}) where U^{k} = [\begin{matrix} U^{1, k} & \dots & U^{p, k} \end{matrix}] \end{matrix}

(7)

where superscript k represents the iteration step and the support of a function (supp(·)) is the set points in which the function is not zero-valued. We could iterate the above procedure (7) till it converges and then we can recover the optimal 𝒬* and estimate the corresponding 𝒰*. For larger networks, we can use the linearized alternating direction method to solve (7) efficiently [23].

4.2 Handling a Large Number of Candidate Basis Functions

In computer graphics applications, although being low-rank is a necessary condition for most regular, structured images, it is certainly not sufficient [23]. In order to repair a more realistic regular or near regular pattern (typically piecewise smooth), Liang et al. consider additional structures by introducing certain transformed domains [23]. In the biological setting, since we select the set of possible candidate basis functions that capture fundamental biochemical kinetics and the number of sample time steps (M) is limited in biological data sets, the number of rows in 𝒬 is easily greater than the number of columns. Thus, 𝒬 becomes a tall matrix which easily has full column rank. Since we want to encourage the common graph structure across column spaces, being low-rank may not be sufficient to repair the same structure, especially considering a large number of basis functions with limited time samples. For example, when we consider a tall matrix, reducing rank of 𝒬 may not encourage the common graph structure across the different experiments; There could be variations across the horizontal direction without affecting the rank or sparsity of the matrix. Hence, in order to recover a more realistic regular or near regular pattern across column space of tall matrix 𝒬, we modify the above convex program in Equation (5) as follows:

min {‖ Q ‖}_{T V} + λ {‖ Q ‖}_{1} + α {‖ U ‖}_{1} s . t . P_{Ω_{i}} (ψ_{j}^{i} y^{i} (t_{j})) = P_{Ω_{i}} (ψ_{j}^{i} Φ^{i} (t_{j}) s_{j}^{i} + ψ_{j}^{i} u^{i} (t_{j})) where i = {1, \dots, p}, j = {1, \dots, M}

(8)

where instead of the nuclear norm ‖𝒬‖_*, we minimize the total variation ‖𝒬‖_TV defined by:

{‖ Q ‖}_{T V} ≜ \sum_{i = 2}^{M \cdot p} | Q_{:, i} - Q_{:, i - 1} |

(9)

where 𝒬_:,_i denotes the i-th column of 𝒬. Also, compared to image repairing, we have extra information such as the time invariance of the GRN structure and thus we can impose additional constraints. For example, parameters of the influence map for each experiment i are assumed to be fixed over time for each experiment (i.e., $s_{j}^{i} = {\bar{s}}^{i}$ ) followed by Assumption 1. Thus, we regularize the variation across column spaces in 𝒬 in order to recover meaningful common GRN structure; Otherwise, for example, if we use ‖𝒬‖_* with a large number of basis functions, it is often hard to repair common or near common graph structure.

Lastly, we introduce transformations $ψ_{j}^{i}$ in Equation (8) by which the components of the sensing matrix Φⁱ(t_j) can be made more uniformly distributed so that we reduce the coherence and improve identifiability, as discussed in [21,22]. Here, we simply use a randomly chosen matrix for $ψ_{j}^{i}$ . Since randomly chosen matrices spread out the component of Φⁱ(t_j) and uⁱ(t_j) uniformly, it helps to differentiate the influence from highly correlated bases in Φⁱ(t_j) in practice.

5 Results and Discussion

We present more detailed examples for problem formulation and also evaluate the performance of reconstruction results with synthetic experimental data sets for both linear and nonlinear systems (see Supplementary Information for details). In this section, we demonstrate the practical relevance of the proposed method by applying it to the DREAM4 in silico Network Challenge dataset [24] [25] [26], a benchmark suite for performance evaluation of methods for gene network inference. Instead of random graph models, this dataset is generated by biologically plausible in silico networks, for example, by extracting sub-networks from transcriptional regulatory networks of E. coli and S. cerevisiae [24]. Also, time series datasets are generated from these networks using adequate dynamical models such as a detailed kinetic model of gene regulation. First, we show the result of GRN reconstruction based on the proposed method. Then, we interpret the implications of our result and use it to learn the relationship between the data and the identifiability of the proposed method.

5.1 Application of DREAM 4 in silico Network Challenge dataset

For the simplicity of analysis and explanation, we consider networks of size 10 and focus on their time series datasets with all perturbations. Each perturbation only affects about a third of all genes – but basal activation of these genes can be strongly increased or decreased. The genes that are directly targeted by the perturbation may then cause a change in the expression level of their downstream target genes leading to an indirect effect. As such, these experiments try to simulate physical or chemical perturbations applied to the cells, which would then cause some genes, via regulatory mechanisms, to have an increased or decreased basal activation.

The perturbations increase or decrease the basal activation of genes of the network simultaneously as shown in Figure 2(a). Each data set contains time courses showing how the GRN responds to a perturbation and how it relaxes upon removal of the perturbation. We consider 5 different time series and each time series has 21 time points (sampled every 50 steps). At t = 0, a perturbation is applied to the network, for example, a drug being added. The first half of the time series (until t = 500) shows the response of the network to the perturbation which is constantly applied from t = 0 to t = 500. At t = 500, the perturbation is removed and, thus, the second half of the time series (until t = 1000) shows how the gene expression levels go back from the perturbed to the unperturbed steady state. Since there are two different modes in time courses (i.e., with perturbation and without perturbation), in order to use all the time points, we should handle this perturbation condition properly. Otherwise, one model may not be able to fit both the first half and the second half of the time series.

DREAM4 *in silico* dataset [24]: time series gene expressions are obtained by applying various perturbations to the original network in five different experiments. (a) raw data plots show dynamic responses of all genes across various perturbations (b) possible separation of the raw data based on the experiment design information shown in Table 1 where gray color denotes (possibly) corrupted responses by direct perturbations. One may consider the directly perturbed gene responses (gray color) as the corrupted image shown in A and the other responses (non-gray color) as the uncorrupted images shown in B.

Table 1 represents 5 different perturbation conditions, for example, for Exp#3 (the third row), Gene1 (G1), Gene2 (G2) and Gene8 (G8) are inhibited by drugs. Note that these are the known information, i.e., whether the response is corrupted by perturbation or not, but we do not know how much the perturbation affects the GRN response. Each treatment might also affect other genes, of which we have no a priori knowledge.

Table 1.

5 different experiment conditions where Ihb. represents inhibition and Sti. represents stimulation. This information can be used for initializing the support Ω_i in Equation (6) and separating perturbed and unperturbed responses shown in Figure 2(b).

	G1	G2	G3	G5	G7	G8	G9	G10
Exp#1	Ihb.
Exp#2	Ihb.				Ihb.			Sti.
Exp#3	Ihb.	Ihb.				Ihb.
Exp#4					Ihb.	Ihb.	Ihb.
Exp#5			Sti.	Ihb.	Ihb.		Ihb.

Open in a new tab

In order to infer the GRN structure using these time series gene expression data under various perturbations, we should identify how these perturbations affect a change in the expression level of the targeted genes. Otherwise, the inferred GRN can be biased or may only represent a partial structure of the whole GRN. To do this, we incorporate all data sets together and take advantage of the common structure of GRNs across the inferred GRNs. Since we only have partial information about the exact extent of the perturbations (or corruptions) as shown in Table 1, we should consider the (possibly) corrupted response as unmeasured and ignore it for reconstruction. By using the information in Table 1, we can initialize the support Ω_i in Equation (6) and further refine this support iteratively in Equation (7).

Figure 2(b) shows the initial separation of the (possibly) perturbed responses and the unperturbed responses based on the initial support. For example, from Table 1 and Figure 2(a), for Gene2 (G2), we know that we have to ignore Exp#3's response which contains the (unknown) influence from perturbation (shown in gray color), and instead use the other experimental data (non-gray color) as shown in Figure 2(b). Similarly, for Gene3 (G3), we ignore Exp#5's response but for Gene4 (G4), we can use all responses since there are no direct perturbations for G4. Intuitively, one may consider the directly perturbed gene responses (gray color) in Figure 2(b) as the analog of the corrupted parts of an image shown in Figure 2A and the other responses (non-gray color) as the analog of the uncorrupted parts of the image shown in Figure 2B.

Figure 3 shows the reconstruction result, the estimated corruption and the inferred GRN. In Figure 3(a)A, the first column represents the true GRN structure, where red represents activation and green represents inhibition edge. Since the proposed method uses both sparsity (which encourages sparsity of GRNs) and low-rank (which encourages commonalities across the inferred GRNs) of 𝒬, we can reconstruct the common GRN, most of which is consistent with the true GRN (the first column). Also, we can infer and estimate how much each perturbation affects the dynamics of the GRN, as depicted in Figure 3(a)B. Since we choose the set of possible candidate basis functions in Equation (3) and assume that the commonality is uniform across all genes and experiments, a small fraction of the reconstructed GRN may be inconsistent with the true GRN.

Reconstruction results: (a) reconstructed GRNs (A) where red color denotes the activation edge and green color denotes the inhibition edge and estimated corruption (B) where red color denotes positive values and green color denotes negative values. Note that the estimated corruptions represent temporal profiles which directly affect ẋ, for example, when we perturb Gene1 (G1) in Exp#1, green color represents inhibition of G1 by drug perturbation; red color after green represents the effect of removing drug perturbation. (b) inferred GRN where solid lines denote the true positive (consistent with the true GRN), dotted lines denote the true negative (missing link) and dashed lines denote false positive (red: activation, blue: inhibition). Analyses and further details of these results are presented in Section 5.2.

5.2 Implications

The proposed method is able to detect unknown influences caused by perturbations and then correctly repair the common graph structure across perturbed GRNs by isolating these effects in GRN inference.

In this section, we first analyze the reconstruction result together with the data set. Then, we discuss the implications of the proposed method through these analyses, discussing the relationship between the data and the identifiability of the network. We explain how one could optimize the experimental design to improve the identifiability of the network for the proposed method.

5.2.1 Existence of (dominant) common dynamic responses

In order to estimate the effect of perturbations, the proposed method retrieves common dynamics of GRN subject to various perturbations. In other words, the proposed method uses low-rank of 𝒬 (or a combination of ‖·‖₁ and total variation) to extract commonalities across different datasets. Thus, if these commonalities are not well exposed in the dataset, the method may fail to recover the corresponding components. In practice, we can determine whether the low-rank component reflects dominant response or not by plotting the responses of output genes with respect to the input genes shown in Figure 4 and 5. By doing this, we can also validate the reconstruction results.

Responses of G2, G3, G4 and G5 with respect to G1 (influence from G1): each x-axis represents input response (i.e., G1) and each y-axis represents output responses (G2, G3, G4, and G5). (A) responses of G2 vs. G1 (B) responses of G3 vs. G1 (C) responses of G4 vs. G1 (D) responses of G5 vs. G1 (C′) (normalized) responses of G4 vs. G1 (D′) (normalized) responses of G5 vs. G1. Gray color represents the corrupted responses from perturbations which are ignored for reconstruction. For example, G2 (A) is directly perturbed in Exp#3, G3 (B) is directly perturbed in Exp#5 and G5 (D) is directly perturbed in Exp#5 as shown in Table 1.

Input-output responses: each x-axis represents input response and each y-axis represents output response. (**A/A′**) responses of G2 with respect to G6 and G8 (**B/B′**) responses of G3 with respect to G9 and G10 (**C/C′**) responses of G4 with respect to G9 and G10. Gray color represents the responses corrupted by perturbations which are ignored for reconstruction.

For instance, in Figure 3(b), the reconstruction result shows that we recover G1→G4 and G1⊣G5 but we fail to recover G1⊣G2 and G1→G3 (dotted line, missing link). In order to further investigate the reason for these results (true negative or missing link), we plot responses of G2, G3, G4 and G5 with respect to G1 in Figure 4. Again, the gray color denotes the response corrupted by perturbations. For example, in Figure 4A, G2 is directly perturbed in Exp#3 (gray color) and, thus, we ignore it for reconstruction.

Figure 4A: Since the responses of G2 are not varying with respect to the responses of G1, there is no way to infer this connection (G1⊣G2) from this data set. Only the dataset for Exp#3 contains this inhibition response, however it also contains the unknown direct perturbation effect. Thus, the reconstruction result with this missing link is actually the best result for this dataset, as it avoids overfitting. Note that the response of G2 in Exp#4 (cyan color in Figure 4A) is not caused by G1 because G1 shows steady state response.
Figure 4B: Given the data of Exp#1 and Exp#3, the response of G3 is more likely to be governed by (G1→G4⊣G3). Exp#2 is the only dataset that reflects the relationship (G1→G3). Thus, since the dynamic response corresponding to (G1→G3) is not a dominant common response for this dataset, we cannot reconstruct the corresponding GRN (G1→G3).
Figure 4C and Figure 4D: Since all the responses show the consistency or the common dynamic responses, we can capture the true connections such as (G1→G4) and (G1⊣G5). Also, Figure 4C shows the effect of activation (positive correlation) and Figure 4D represents the effect of inhibition (negative correlation). For example, in Figure 4C, as G1 decreases, G4 decreases. On the other hand, in Figure 4D, as G1 decreases, G5 increases. In Figure 4C′ and Figure 4D′, we plot normalized responses to show the common dynamic responses clearly. Since dynamic features can be represented by possible candidate basis functions in Equation (3), the sparse low-rank representation can capture the commonality of the GRN structure.

This result implies that since the low-rank of 𝒬 encourages commonalities across other's GRNs, if there are no dominant common responses for a certain edge, it is challenging to infer the corresponding edge. In the context of image repairing, one can think that if a certain part of the sparse low-rank texture is not exposed well due to corruptions, we may not be able to repair such texture properly.

Therefore, in order to reconstruct the GRN exactly, we have to design experiments with various perturbations that cause the underlying system to be perturbed and excited uniformly well. Or, we may have to consider different weighting factors across each gene for extracting the commonalities properly. For example, since we have more (uncorrupted) responses of G4 and G6, we can penalize the commonality more on G4 and G6. On the other hand, if there are only a few meaningful responses of a certain node, for example G3 in Figure 5B, we can reduce the weighting factor of commonality for that specific node.

Also, in practice, more information on drug perturbation (i.e., the GI-50 value) can help refine the effect of (unknown) perturbation. For example, in Figure 4A, if we know the effect of drug perturbation (i.e., independent dose-response data for drug), we can use the dataset for Exp#3 which contains the inhibition response (G1⊣G2) to infer the corresponding connection by isolating the effect of perturbation.

5.2.2 Avoiding overfitting

As we discussed for Figure 4A, our method avoids overfitting and thus failed to infer the true inhibition (G1⊣G2) which could only be identified by corrupted data (gray color in Figure 4A). Similarly, in Figure 3(b), we reconstruct G8⊣G2 but fail to recover G6⊣G2. Figure 5A shows the response of G2 vs. G6. Since this response shows positive correlation and cannot match to inhibition (G6⊣G2), we fail to recover G6⊣G2. On the other hand, Figure 5A′ shows the response of G2 vs. G8 and matches to the effect of inhibition. Thus, we are able to capture this edge.

Note that in Figure 5A′, the response of Exp#3 (gray color) shows both negative correlation (corresponding to true inhibition) and positive correlation (corresponding to direct perturbations). Since both G2 and G8 are inhibited by perturbations in Exp#3, if both perturbations' effect are dominant, the response can show positive correlation. Thus, the responses driven by direct perturbations can distort the relationship between genes and lead to incorrect GRN inference. Therefore, if we cannot estimate and isolate the effect of these perturbations, we should ignore perturbed responses to avoid overfitting.

5.2.3 Ambiguity

In Figure 3(b), the reconstruction results show false positive edges (dashed line, i.e., false discovery). Since the proposed method relies on the common dynamic features in the data sets to infer the GRN structure, if there are no dominant responses, it seems to be ambiguous and might present a challenge to infer the corresponding edge in the GRN structure. For example, consider influence at G3 and G4; since there is only one experiment which shows the dynamic response of G3 with respect to G9 in Figure 5B and G10 in Figure 5B′, it is hard to extract commonality. Thus, we identify a false positive link (G9→G3) (dashed line) in Figure 3(b). Similarly, since responses of G4 with respect to G9 show more common dynamic responses in Figure 5C and D, we infer a false positive link G9→G4 instead of G10→G4. Note that although the responses of G9 in Exp#4 and Exp#5 have similar scales in Figure 2(a) and G10 can only be affected by G9 (from the true GRN structure), the responses of G10 in Exp#4 and Exp#5 have quite different scales, which may cause a false positive inference.

5.3 Summary

We now summarize the lessons learnt from the above analyses, where we applied our method to the DREAM data set. The proposed method

can reconstruct the GRN by incorporating all data sets together.
can incorporate from others' inferred GRNs into the common GRN structure by using the low-rank property (commonalities).
can infer and estimate the (unknown) drug effect, which can distort the relationship between genes and may lead to incorrect GRN inference in general, by separating the common dynamic response from the inferred GRN.
can avoid overfitting but may fail to infer the true GRN when the dynamic responses corresponding to a certain edge do not show dominant common responses or they show ambiguities.

6 Conclusion

In this paper, we show how to harness both sparse and low-rank structures for reconstructing GRNs in heterogeneous data sets based on various drug-induced perturbation experiments. Our method proposes a new convex formulation for GRN reconstruction and can automatically correctly repair the common graph structure of a partially perturbed GRN, even without precise information about the corrupting effects of drug-induced perturbations. Through synthetic experiment simulations and application of DREAM dataset, we show that our method can complete and repair GRN structure subjected to drug-induced perturbations. Also, through numerical comparisons, we demonstrate advantage over existing graph inference method dealing with different data sets and estimation of perturbation inputs. We are currently applying this method to large-scale datasets and using this tool for designing effective experiments in inferring the HER2+ breast cancer signaling pathway.

Supplementary Material

Supp.pdf

NIHMS809818-supplement-Supp_pdf.pdf^{(6.8MB, pdf)}

Acknowledgments

This research was supported by the NIH NCI under the ICBP and PS-OC programs (5U54CA112970-08).

Biographies

graphic file with name nihms809818b1.gif

Young Hwan Chang received his Bachelor degree and Master degree in aerospace engineering from Korea Advanced Institute of Science and Technology (KAIST) in 2002, 2004 respectively and the Ph.D. degree in mechanical engineering from the University of California, Berkeley, in 2013. He is currently a postdoctoral researcher at the department of EECS, UC Berkeley and will join Oregon Health and Science University as an assistant professor.

graphic file with name nihms809818b2.gif

Roel Dobbe received his Bachelor degree cum laude in Mechanical Eng. at Delft University of Technology in 2007. In 2010 he finished his Master in Systems & Control cum laude at Delft Center for Systems & Control at Delft University of Technology. His Msc research focused on Hybrid Systems and Control, with application in Systems Biology. He is currently a second year graduate student in the Department of EECS at University of California, Berkeley.

graphic file with name nihms809818b3.gif

Palak Bhushan received his Bachelor's degree in Electrical Engineering from IIT Kanpur, India, in 2013. He is currently working towards his PhD degree in the EECS Dept., UC Berkeley. His research interests lie in the design, modeling and analysis of nonlinear biological systems, particularly gene network reconstruction and neural decoding.

graphic file with name nihms809818b4.gif

Dr. Joe W. Gray, a physicist and an engineer by training, is known for break-throughs that have changed clinical practices for patients. He has been employed as a staff scientist in the Biomedical Sciences Division of the Lawrence Livermore National Laboratory (1972-1991), professor of laboratory medicine at the University of California San Francisco (1991-2011), Associate Laboratory Director for Biosciences and Life Sciences Division Director at the Lawrence Berkeley National Laboratory (2003-2011). In 2011, he joined the Oregon Health and Science University, where he holds the Gordon Moore Endowed Chair, and serves as Chair, Department of Biomedical Engineering; Director, Center for Spatial Systems Biomedicine; and Associate Director for Translational Research, Knight Cancer Institute. He also holds positions as Emeritus Professor, University of California San Francisco; and as Senior Scientist, Lawrence Berkeley National Laboratory.

graphic file with name nihms809818b5.gif

Claire J. Tomlin received the B.A.Sc. degree in electrical engineering from the University of Waterloo, Canada, in 1992, the M.Sc. degree in electrical engineering from Imperial College, University of London, in 1993, and the Ph.D. degree in electrical engineering from the University of California, Berkeley, in 1998. She is a Professor of Electrical Engineering and Computer Sciences at Berkeley, where she holds the Charles A. Desoer in Engineering. She held the positions of Assistant, Associate, and Full Professor at Stanford from 1998–2007, and in 2005 joined Berkeley. She has been an Affiliate at LBL in the Life Sciences Division since January 2012. She works in hybrid systems and control, with applications to biology, robotics, and air traffic systems. Dr. Tomlin received the Erlander Professorship of the Swedish Research Council in 2010, a MacArthur Fellowship in 2006, and the Eckman Award of the American Automatic Control Council in 2003.

Footnotes

In [31], the authors pointed out that although data on time derivative can be difficult to obtain especially in the presence of noise, it is possible to estimate the gene expressions relatively accurately by repeating measurement with careful instrumentation and statistics [5] [32].

Φ(t_j) is known as the sensing matrix in compressive sensing [21, 22]. Thus, for given sensing matrix Φ(t_j) and measurement y(t_j), we reconstruct s with penalizing sparsity (‖s‖₁). In [21, 22], we assume that u(t_j) is known. Since we can simply subtract u(t_j) from y(t_j), we may reconstruct unbiased s. However, if u(t_j) is not assumed to be known, this causes bias or uncertainties in reconstructing s.

References

1.Janes KA, Yaffe MB. Data-driven modelling of signal-transduction networks. Nat Rev Mol Cell Biol. 2006;7(11):820–828. doi: 10.1038/nrm2041. [DOI] [PubMed] [Google Scholar]
2.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;498(25):14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ramsay J, Silverman B. Functional data analysis. Springer Series in Statistics. 2005 [Google Scholar]
4.Summer G, Perkins TJ. Functional data analysis for identifying nonlinear models of gene regulatory networks. BMC Genomics. 2010;11(4) doi: 10.1186/1471-2164-11-S4-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Dhaeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]
6.Amin DN, Sergina N, Ahuja D, McMahon M, Blair JA, Wang D, Hann B, Koch KM, Shokat KM, Moasser MM. Resiliency and vulnerability in the her2-her3 tumorigenic driver. Science Translational Medicine. 2010;2(16):16ra7. doi: 10.1126/scitranslmed.3000389. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, Ng S, Gibb WJ, Wang NJ, Ziyad S, Tong F, Bayani N, Hu Z, Billig JI, Dueregger A, Lewis S, Jakkula L, Korkola JE, Durinck S, Pepin F, Guan Y, Purdom E, Neuvial P, Bengtsson H, Wood KW, Smith PG, Vassilev LT, Hennessy BT, Greshock J, Bachman KE, Hardwicke MA, Park JW, Marton LJ, Wolf DM, Collisson EA, Neve RM, Mills GB, Speed TP, Feiler HS, Wooster RF, Haussler D, Stuart JM, Gray JW, Spellman PT. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(8):2724–2729. doi: 10.1073/pnas.1018854108. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Sontag ED. Network reconstruction based on steady-state data. Essays in Biochemistry. 2008;45:161–176. doi: 10.1042/BSE0450161. [DOI] [PubMed] [Google Scholar]
9.Zechnera C, Ruessa J, Krenn P, Pelet S, Peter M, Lygeros J, Koeppl H. Moment-based inference predicts bimodality in transient gene expression. Proceedings of the National Academy of Science of the United States of America. 2012;109(21):8340–8345. doi: 10.1073/pnas.1200161109. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Zavlanos MM, Julius AA, Boyd SP, Pappas GJ. Identification of stable genetic networks using convex programming; Proceedings of the American Control Conference (ACC); 2008. pp. 2755–2760. [Google Scholar]
11.Cooper NG, Belta CA, Julius AA. Genetic regulatory network identification using multi-variate monotone functions; Proceedings of the IEEE conference on Decision and Control and European Control Conference (CDC-ECC); 2011. pp. 2208–2213. [Google Scholar]
12.Porreca R, Drulhe S, de Jong H, Ferrari-Trecate G. Structural identification of piecewise-linear models of genetic regulatory networks. Journal of Computational Biology. 2008;15(10):1365–1380. doi: 10.1089/cmb.2008.0109. [DOI] [PubMed] [Google Scholar]
13.Bernardo DD, Gardner T, Collins J. Robust identification of large genetic networks. Pacific Symposium on Biocomputing. 2004;9:486–497. doi: 10.1142/9789812704856_0046. [DOI] [PubMed] [Google Scholar]
14.Richard G, Julius AA, Belta C. Optimizing regulation functions in gene network identification; IEEE Conference on Decision and Control (CDC); 2013. pp. 745–750. [Google Scholar]
15.Yeung MKS, Tegnr J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proceedings of the National Academy of Sciences. 2002;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Song L, Kolar M, Xing EP. Keller: estimating time-varying interactions between genes. Bioinformatics. 2009;25(12):i128–i136. doi: 10.1093/bioinformatics/btp192. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Barzel B, Barabasi AL. Network link prediction by global silencing of indirect correlations. Nature Biotech. 2013;31(8):720–725. doi: 10.1038/nbt.2601. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Feizi S, Marbach D, Medard M, Kellis M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nature Biotech. 2013;31(8):726–733. doi: 10.1038/nbt.2635. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, Santini S, di Bernardo M, di Bernardo D, Cosma MP. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009;137(1):172–181. doi: 10.1016/j.cell.2009.01.055. [DOI] [PubMed] [Google Scholar]
20.Kang T, White JT, Xie Z, Benenson Y, Sontag E, Bleris L. Reverse engineering validation using a benchmark synthetic gene circuit in human cells. ACS Synthetic Biology. 2013;2(5):255–262. doi: 10.1021/sb300093y. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Chang YH, Tomlin CJ. Reconstruction of gene regulatory networks with hidden node. European Control Conference (ECC) 2014;2014:1492–1497. [Google Scholar]
22.Chang YH, Gray JW, Tomlin CJ. Exact reconstruction of gene regulatory networks using compressive sensing. BMC Bioinformatics. 2014;15(400) doi: 10.1186/s12859-014-0400-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Liang X, Ren X, Zhang Z, Ma Y. in Computer Vision ECCV 2012, vol 7576 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2012. Repairing sparse low-rank texture; pp. 482–495. [Google Scholar]
24.Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of Computational Biology. 2009;16(2):229–239. doi: 10.1089/cmb.2008.09TT. [DOI] [PubMed] [Google Scholar]
25.Stolovitzky G, Monroe D, Califano A. Dialogue on reverse-engineering assessment and methods. Annals of the New York Academy of Sciences. 2007;1115(1):1–22. doi: 10.1196/annals.1407.021. [DOI] [PubMed] [Google Scholar]
26.Stolovitzky G, Prill RJ, Califano A. Lessons from the dream2 challenges. Annals of the New York Academy of Sciences. 2009;1158(1):159–195. doi: 10.1111/j.1749-6632.2009.04497.x. [DOI] [PubMed] [Google Scholar]
27.Shiraishi Y, Kimura S, Okada M. Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. BMC Bioinformatics. 2010;26(8):1073–1081. doi: 10.1093/bioinformatics/btq094. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003;34(06):166–176. doi: 10.1038/ng1165. [DOI] [PubMed] [Google Scholar]
29.Roy S, Lagree S, Hou Z, Thomson J, Stewart R, Gasch A. Integrated module and gene-specific regulatory inference implicates upstream signaling networks. PLoS Comput Biol. 2013;9(10):e1003252. doi: 10.1371/journal.pcbi.1003252. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Pan W, Yuan Y, Goncalves J, Stan G. Reconstruction of arbitrary biochemical reaction networks: A compressive sensing approach; Decision and Control (CDC), 2012 IEEE 51st Annual Conference on; Dec, 2012. pp. 2334–2339. [Google Scholar]
31.Yeung MKS, Tegnr J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proceedings of the National Academy of Sciences. 2002;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Ideker T, Thorsson V, Siegel AF, Hood LE. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. Journal of Computational Biology. 2000;7(6):805–817. doi: 10.1089/10665270050514945. [DOI] [PubMed] [Google Scholar]
33.Komodakis N. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE; 2006. Image completion using global optimization; pp. 442–452. [Google Scholar]
34.Sun J, Yuan L, Jia J, Shum HY. Image completion with structure propagation. ACM Transactions on Graphics (ToG) 2005;24(3):861–868. [Google Scholar]
35.Bertalmio M, Sapiro G, Caselles V, Ballester C. Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '00, (New York, NY, USA) ACM Press/Addison-Wesley Publishing Co; 2000. Image inpainting; pp. 417–424. [Google Scholar]
36.Oliveira MM, Bowen B, Mc Kenna R, sung Chang Y. Fast digital image inpainting; Appeared in the Proceedings of the International Conference on Visualization, Imaging and Image Processing (VIIP 2001), Marbella, Spain; 2001. pp. 106–107. [Google Scholar]
37.Bertalmio M, Bertozzi AL, Sapiro G. Navier-stokes, fluid dynamics, and image and video inpainting. Proc IEEE Computer Vision and Pattern Recognition (CVPR. 2001:355–362. [Google Scholar]
38.Mairal J, Elad M, Sapiro G. Sparse representation for color image restoration. Image Processing, IEEE Transactions on. 2008 Jan;17:53–69. doi: 10.1109/tip.2007.911828. [DOI] [PubMed] [Google Scholar]
39.Elad M, Starck JL, Querre P, Donoho D. Simultaneous cartoon and texture image inpainting using morphological component analysis (mca) Applied and Computational Harmonic Analysis. 2005;19(3):340–358. Computational Harmonic Analysis - Part 1. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp.pdf

NIHMS809818-supplement-Supp_pdf.pdf^{(6.8MB, pdf)}

[R1] 1.Janes KA, Yaffe MB. Data-driven modelling of signal-transduction networks. Nat Rev Mol Cell Biol. 2006;7(11):820–828. doi: 10.1038/nrm2041. [DOI] [PubMed] [Google Scholar]

[R2] 2.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;498(25):14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Ramsay J, Silverman B. Functional data analysis. Springer Series in Statistics. 2005 [Google Scholar]

[R4] 4.Summer G, Perkins TJ. Functional data analysis for identifying nonlinear models of gene regulatory networks. BMC Genomics. 2010;11(4) doi: 10.1186/1471-2164-11-S4-S18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Dhaeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]

[R6] 6.Amin DN, Sergina N, Ahuja D, McMahon M, Blair JA, Wang D, Hann B, Koch KM, Shokat KM, Moasser MM. Resiliency and vulnerability in the her2-her3 tumorigenic driver. Science Translational Medicine. 2010;2(16):16ra7. doi: 10.1126/scitranslmed.3000389. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, Ng S, Gibb WJ, Wang NJ, Ziyad S, Tong F, Bayani N, Hu Z, Billig JI, Dueregger A, Lewis S, Jakkula L, Korkola JE, Durinck S, Pepin F, Guan Y, Purdom E, Neuvial P, Bengtsson H, Wood KW, Smith PG, Vassilev LT, Hennessy BT, Greshock J, Bachman KE, Hardwicke MA, Park JW, Marton LJ, Wolf DM, Collisson EA, Neve RM, Mills GB, Speed TP, Feiler HS, Wooster RF, Haussler D, Stuart JM, Gray JW, Spellman PT. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proceedings of the National Academy of Sciences of the United States of America. 2012;109(8):2724–2729. doi: 10.1073/pnas.1018854108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Sontag ED. Network reconstruction based on steady-state data. Essays in Biochemistry. 2008;45:161–176. doi: 10.1042/BSE0450161. [DOI] [PubMed] [Google Scholar]

[R9] 9.Zechnera C, Ruessa J, Krenn P, Pelet S, Peter M, Lygeros J, Koeppl H. Moment-based inference predicts bimodality in transient gene expression. Proceedings of the National Academy of Science of the United States of America. 2012;109(21):8340–8345. doi: 10.1073/pnas.1200161109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Zavlanos MM, Julius AA, Boyd SP, Pappas GJ. Identification of stable genetic networks using convex programming; Proceedings of the American Control Conference (ACC); 2008. pp. 2755–2760. [Google Scholar]

[R11] 11.Cooper NG, Belta CA, Julius AA. Genetic regulatory network identification using multi-variate monotone functions; Proceedings of the IEEE conference on Decision and Control and European Control Conference (CDC-ECC); 2011. pp. 2208–2213. [Google Scholar]

[R12] 12.Porreca R, Drulhe S, de Jong H, Ferrari-Trecate G. Structural identification of piecewise-linear models of genetic regulatory networks. Journal of Computational Biology. 2008;15(10):1365–1380. doi: 10.1089/cmb.2008.0109. [DOI] [PubMed] [Google Scholar]

[R13] 13.Bernardo DD, Gardner T, Collins J. Robust identification of large genetic networks. Pacific Symposium on Biocomputing. 2004;9:486–497. doi: 10.1142/9789812704856_0046. [DOI] [PubMed] [Google Scholar]

[R14] 14.Richard G, Julius AA, Belta C. Optimizing regulation functions in gene network identification; IEEE Conference on Decision and Control (CDC); 2013. pp. 745–750. [Google Scholar]

[R15] 15.Yeung MKS, Tegnr J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proceedings of the National Academy of Sciences. 2002;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Song L, Kolar M, Xing EP. Keller: estimating time-varying interactions between genes. Bioinformatics. 2009;25(12):i128–i136. doi: 10.1093/bioinformatics/btp192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Barzel B, Barabasi AL. Network link prediction by global silencing of indirect correlations. Nature Biotech. 2013;31(8):720–725. doi: 10.1038/nbt.2601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Feizi S, Marbach D, Medard M, Kellis M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nature Biotech. 2013;31(8):726–733. doi: 10.1038/nbt.2635. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, Santini S, di Bernardo M, di Bernardo D, Cosma MP. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009;137(1):172–181. doi: 10.1016/j.cell.2009.01.055. [DOI] [PubMed] [Google Scholar]

[R20] 20.Kang T, White JT, Xie Z, Benenson Y, Sontag E, Bleris L. Reverse engineering validation using a benchmark synthetic gene circuit in human cells. ACS Synthetic Biology. 2013;2(5):255–262. doi: 10.1021/sb300093y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Chang YH, Tomlin CJ. Reconstruction of gene regulatory networks with hidden node. European Control Conference (ECC) 2014;2014:1492–1497. [Google Scholar]

[R22] 22.Chang YH, Gray JW, Tomlin CJ. Exact reconstruction of gene regulatory networks using compressive sensing. BMC Bioinformatics. 2014;15(400) doi: 10.1186/s12859-014-0400-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Liang X, Ren X, Zhang Z, Ma Y. in Computer Vision ECCV 2012, vol 7576 of Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2012. Repairing sparse low-rank texture; pp. 482–495. [Google Scholar]

[R24] 24.Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of Computational Biology. 2009;16(2):229–239. doi: 10.1089/cmb.2008.09TT. [DOI] [PubMed] [Google Scholar]

[R25] 25.Stolovitzky G, Monroe D, Califano A. Dialogue on reverse-engineering assessment and methods. Annals of the New York Academy of Sciences. 2007;1115(1):1–22. doi: 10.1196/annals.1407.021. [DOI] [PubMed] [Google Scholar]

[R26] 26.Stolovitzky G, Prill RJ, Califano A. Lessons from the dream2 challenges. Annals of the New York Academy of Sciences. 2009;1158(1):159–195. doi: 10.1111/j.1749-6632.2009.04497.x. [DOI] [PubMed] [Google Scholar]

[R27] 27.Shiraishi Y, Kimura S, Okada M. Inferring cluster-based networks from differently stimulated multiple time-course gene expression data. BMC Bioinformatics. 2010;26(8):1073–1081. doi: 10.1093/bioinformatics/btq094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003;34(06):166–176. doi: 10.1038/ng1165. [DOI] [PubMed] [Google Scholar]

[R29] 29.Roy S, Lagree S, Hou Z, Thomson J, Stewart R, Gasch A. Integrated module and gene-specific regulatory inference implicates upstream signaling networks. PLoS Comput Biol. 2013;9(10):e1003252. doi: 10.1371/journal.pcbi.1003252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Pan W, Yuan Y, Goncalves J, Stan G. Reconstruction of arbitrary biochemical reaction networks: A compressive sensing approach; Decision and Control (CDC), 2012 IEEE 51st Annual Conference on; Dec, 2012. pp. 2334–2339. [Google Scholar]

[R31] 31.Yeung MKS, Tegnr J, Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regression. Proceedings of the National Academy of Sciences. 2002;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Ideker T, Thorsson V, Siegel AF, Hood LE. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. Journal of Computational Biology. 2000;7(6):805–817. doi: 10.1089/10665270050514945. [DOI] [PubMed] [Google Scholar]

[R33] 33.Komodakis N. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE; 2006. Image completion using global optimization; pp. 442–452. [Google Scholar]

[R34] 34.Sun J, Yuan L, Jia J, Shum HY. Image completion with structure propagation. ACM Transactions on Graphics (ToG) 2005;24(3):861–868. [Google Scholar]

[R35] 35.Bertalmio M, Sapiro G, Caselles V, Ballester C. Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '00, (New York, NY, USA) ACM Press/Addison-Wesley Publishing Co; 2000. Image inpainting; pp. 417–424. [Google Scholar]

[R36] 36.Oliveira MM, Bowen B, Mc Kenna R, sung Chang Y. Fast digital image inpainting; Appeared in the Proceedings of the International Conference on Visualization, Imaging and Image Processing (VIIP 2001), Marbella, Spain; 2001. pp. 106–107. [Google Scholar]

[R37] 37.Bertalmio M, Bertozzi AL, Sapiro G. Navier-stokes, fluid dynamics, and image and video inpainting. Proc IEEE Computer Vision and Pattern Recognition (CVPR. 2001:355–362. [Google Scholar]

[R38] 38.Mairal J, Elad M, Sapiro G. Sparse representation for color image restoration. Image Processing, IEEE Transactions on. 2008 Jan;17:53–69. doi: 10.1109/tip.2007.911828. [DOI] [PubMed] [Google Scholar]

[R39] 39.Elad M, Starck JL, Querre P, Donoho D. Simultaneous cartoon and texture image inpainting using morphological component analysis (mca) Applied and Computational Harmonic Analysis. 2005;19(3):340–358. Computational Harmonic Analysis - Part 1. [Google Scholar]

PERMALINK

Reconstruction of Gene Regulatory Networks based on Repairing Sparse Low-rank Matrices

Young Hwan Chang

Roel Dobbe

Palak Bhushan

Joe W Gray

Claire J Tomlin

Abstract

1 Introduction

2 Motivation

Figure 1.

3 Problem Formulation

3.1 Formulating Gene Regulatory Networks as a Dynamical System

3.2 Organizing GRN Dynamic Equations into Sparse Low-rank Representation

4 Reconstruction of GRNs via Repairing 𝒬

4.1 Repairing 𝒬 by Refining Support Estimation

4.2 Handling a Large Number of Candidate Basis Functions

5 Results and Discussion

5.1 Application of DREAM 4 in silico Network Challenge dataset

Figure 2.

Table 1.

Figure 3.

5.2 Implications

5.2.1 Existence of (dominant) common dynamic responses

Figure 4.

Figure 5.

5.2.2 Avoiding overfitting

5.2.3 Ambiguity

5.3 Summary

6 Conclusion

Supplementary Material

Acknowledgments

Biographies

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases