Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2019 Jul 2;36(1):197–204. doi: 10.1093/bioinformatics/btz529

Inference of differential gene regulatory networks based on gene expression and genetic perturbation data

Xin Zhou 1, Xiaodong Cai 1,
Editor: Janet Kelso
PMCID: PMC6956787  PMID: 31263873

Abstract

Motivation

Gene regulatory networks (GRNs) of the same organism can be different under different conditions, although the overall network structure may be similar. Understanding the difference in GRNs under different conditions is important to understand condition-specific gene regulation. When gene expression and other relevant data under two different conditions are available, they can be used by an existing network inference algorithm to estimate two GRNs separately, and then to identify the difference between the two GRNs. However, such an approach does not exploit the similarity in two GRNs, and may sacrifice inference accuracy.

Results

In this paper, we model GRNs with the structural equation model (SEM) that can integrate gene expression and genetic perturbation data, and develop an algorithm named fused sparse SEM (FSSEM), to jointly infer GRNs under two conditions, and then to identify difference of the two GRNs. Computer simulations demonstrate that the FSSEM algorithm outperforms the approaches that estimate two GRNs separately. Analysis of a dataset of lung cancer and another dataset of gastric cancer with FSSEM inferred differential GRNs in cancer versus normal tissues, whose genes with largest network degrees have been reported to be implicated in tumorigenesis. The FSSEM algorithm provides a valuable tool for joint inference of two GRNs and identification of the differential GRN under two conditions.

Availability and implementation

The R package fssemR implementing the FSSEM algorithm is available at https://github.com/Ivis4ml/fssemR.git. It is also available on CRAN.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A gene regulatory network (GRN) consists of a set of genes that interact with each other to govern their expression and molecular functions. For example, transcription factors (TFs) can bind to promoter regions of their target genes and regulate the expression of target genes (Harbison et al., 2004). Gene-gene interactions can change under different environments, in different tissue types or disease states, and during development and speciation (Ideker and Krogan, 2012). Therefore, GRNs undergo substantial rewiring depending on specific molecular context in which they operate (Califano, 2011). Identification of condition-specific GRNs is critical to unravel the molecular mechanism of various tissue or disease-specific biological processes (Sonawane et al., 2017).

Although a number of computational methods have been developed to infer GRNs from gene expression and other relevant data, they are mainly concerned with the static structure of gene networks under one single condition. Several methods aim to infer GRNs using only gene expression data; they include the approaches constructing relevance networks based on a similarity measure, such as correlation or mutual information (Butte and Kohane, 1999; Faith et al., 2007; Margolin et al., 2006), Gaussian Graphical Model (GGM) (Friedman et al., 2008), Bayesian networks (Statnikov and Aliferis, 2010) and linear regression model (Haury et al., 2012). Several other methods infer GRNs by integrating genetic perturbations with gene expression data; these methods include approaches using Bayesian networks incorporating expression quantitative trait loci (eQTLs) (Zhu et al., 2007), likelihood-based causal models (Neto et al., 2008) and structural equation models (SEMs) (Cai et al., 2013; Liu et al., 2008; Logsdon and Mezey, 2010).

While it is possible to apply these methods to identify GRNs under different conditions separately, such an approach is apparently not optimal to identify the difference in GRNs, because it does not exploit the similarity in two GRNs. Several methods have been proposed to use the gene expression data of different conditions to jointly estimate GRNs under different conditions. Particularly, GRNs under multiple conditions are modeled with multiple GGMs, and these GGMs are inferred jointly from gene expression data (Danaher et al., 2014). When a gene is mutated, its regulatory effect on all its target genes may be changed. Taking into account such effects, a node-based approach to joint inference of multiple GGMs was developed in Mohan et al. (2014). GGMs exploit the sample covariance of the gene expression levels, but they cannot integrate genetic perturbations with gene expression data. Moreover, it has been demonstrated that genetic perturbation along with gene expression data can determine directed edges in GRNs (Logsdon and Mezey, 2010), but GGMs can only identify undirected edges.

In this paper, we employ SEMs to model GRNs as described in Cai et al. (2013), Logsdon and Mezey (2010) and Liu et al. (2008). This enables us to integrate genetic perturbation data with gene expression data. Taking into account the sparsity in GRNs, we have developed a sparse-aware maximum likelihood (SML) method (Cai et al., 2013) to infer a single GRN based on SEM. Here, taking into account not only the sparsity in GRNs but also the sparsity in the differences between GRNs under two different conditions, we develop an algorithm, named fused sparse SEM (FSSEM), to infer two GRNs from different conditions jointly, and then to identify difference in two GRNs. Computer simulations demonstrate the superior performance of our novel approach relative to the existing ones that infer GRNs under two conditions separately.

2 Materials and methods

2.1 GRN model

Supposed that expression levels of n genes under two different conditions are measured using, e.g. micro-array or RNA-Seq technique. Let yi(k)=[yi1(k),yi2(k),,yin(k)]T denote expression level of n genes in individual i under condition k, where k=1,2andi=1,2,,nk, with nk being the number of individuals where gene expression levels are measured under condition k. Supposed that a set of naturally occurring genetic perturbations have been observed for these genes. These perturbations can be due to, e.g. different genotypes of eQTLs or gene copy number variants (CNVs). In this paper, we will consider only eQTLs. As in Cai et al. (2013) and Logsdon and Mezey (2010), we assume that each gene in the GRN of interest has at least one cis-eQTL, so that the structure of the underlying GRN is uniquely identifiable. Let xi(k)=[xi1(k),xi2(k),,xiq(k)]T denote the observed genotypes of q cis-eQTLs in individual i under condition k, where k=1,2andi=1,2,,nk. Since the expression level of a particular gene may be regulated by other genes and is affected by its eQTLs, we employ the following SEM to model the expression of n genes:

yi(k)=B(k)yi(k)+F(k)xi(k)+μ(k)+ϵi(k), (1)

where i=1,,nk,k=1,2,n×nmatrixB(k) defines the unknown network structure under condition k,n×qmatrixF(k) captures the effect of cis-eQTLs on gene expression levels under condition k,n×1vectorμ(k) accounts for the model bias in SEM, and n×1vectorϵi(k) denotes the residual error, which is modeled as a Gaussian vector with zero mean and variance σ2 and is independent for different samples. It is assumed that no self-loops are presented per gene in GRNs, which implies that the diagonal entries of B(k) are zero. It is also assumed that q cis-eQTLs have been identified using an existing eQTL method, but sizes of their effects are unknown, thus, F(k)hasq nonzero entries with known locations. Our main goal is to estimate network matrices B(1)andB(2), but F(1)andF(2) will also be estimated jointly with B(1)andB(2).

2.2 Joint inference of two GRNs

Let Y(k)=[y1(k),,ynk(k)],X(k)=[x1(k),,xnk(k)]andE(k)=[ϵ1(k),,ϵnk(k)], where k =1, 2, and assume that n1+n2 observations are independent. Then, the negative log-likelihood function of the data can be written as

L(B,F,μ,σ2)=logk=12i=1nkP(yi(k)|xi(k),μi(k),B(k),F(k))=k=12nk2log|IB(k)|2+(n1+n2)n2log(2πσ2)+12σ2k=12||(IB(k))Y(k)F(k)X(k)μ(k)||F2, (2)

where B=[B(1),B(2)],F=[F(1),F(2)],μ=[μ(1),μ(2)], and .F stands for the Frobenius norm. It is not difficult to show that minimizing (2) with respect to μyieldsμ^(k)=(IB(k))Y˜(k)F(k)X˜(k), where Y˜(k)=Y(k)1/nki=1nkyi(k)1,X˜(k)=X(k)1/nki=1nkxi(k)1, and 1 is a vector with all its entries equal to 1.

Since a gene is regulated by a small number of other genes (Gardner et al., 2003; Tegner et al., 2003; Thieffry et al., 1998), GRNs are sparse, meaning that most entries of B(1)andB(2) are zeros. Moreover, it is reasonable to expect that changes in a GRN under two different conditions are relatively small. Therefore, most entries of B(2)B(1) are zeros. Let σ^2 be an estimate of σ2 that will be specified later, replacing μ(k)andσ2 in (2) with μ^(k)andσ^2, respectively, and taking into account the sparsity in B(1)andB(2), and the sparsity in B(2)B(1), we estimate BandF by minimizing the following penalized negative log-likelihood function:

J(B,F)=k=12nklog|IB(k)|+12σ^2k=12(IB(k))Y˜(k)F(k)X˜(k)F2+λk=12B(k)1,w(k)+ρB(2)B(1)1,r, (3)

where Bi,i(k)=0,i=1,,n,k=1,2,B(k)1,w(k)=ijwij(k)|Bij(k)| is the weighted 1-norm, B(2)B(1)1,r is also a weighted 1-norm with similar definition, λandρ are two nonnegative parameters. Note that the number of unknowns in B(k)andF(k),k=1,2 is p=2n(n1)+2q, which is typically much larger than the number of samples n1+n2. Due to the 1-norm terms in (3), our FSSEM algorithm that minimizes J(B,F) can handle the case where pn1+n2, as will be shown in computer simulations and real data analysis. Weights wij(k)andrij in the penalty terms are introduced to improve estimation accuracy and robustness in line with the adaptive lasso (Zou, 2006) and the adaptive generalized fused lasso (Viallon et al., 2016), and they are selected as 1/|B^ij(k)|and1/|B^ij(2)B^ij(1)|, respectively, where B^(1)andB^(2) are preliminary estimates of B(1)andB(2) obtained from the following ridge regression:

{B^,F^}=argmin{B,F}{k=1212||(I-B(k))Y˜(k)-F(k)X˜(k)||F2+τ||B(k)||F2}s.t.Bii(k)=0,i=1,,n,k=1,2, (4)

where B^=[B^(1),B^(2)],F^=[F^(1),F^(2)], and the estimate of σ2,σ^2 in (3) is given by

σ^2=k=12(IB^(k))Y˜(k)F^(k)X˜(k)F2(n1+n2)n. (5)

Of note, the likelihood function (2) is obtained under the assumption that ϵi(1)andϵi(2) are independent, or equivalently, that yi(1)andyi(2) are independent. In some applications, yi(1)andyi(2) may be correlated, when, for example, yi(1) is the gene expression of a tumor, and yi(2) is the gene expression level of the normal tissue of the same individual. It turns out that our FSSEM algorithm, that minimizes the penalized negative likelihood function J(B,F) in (3), is quite robust with respect to (w.r.t.) the correlation between ϵi(1)andϵi(2). Our computer simulations will show that performance of the FSSEM algorithm does not change significantly, when the correlation coefficient between ϵi(1)andϵi(2) varies from 0 to 1.

2.3 Ridge regression

In the first stage, we solve the ridge regression problem (4) to find initial values of B,F, weights w(k),k=1,2, and r for the FSSEM algorithm to minimize (3). Let Bi(k),Fi(k)andYi(k) be the i-th row of B(k),F(k)andY(k), respectively. Define Bi,i(k) as the 1×(n1) vector obtained by removing the i-th entry from Bi(k). Let Sq(i) be the set of indices of non-zero entries in the Fi(k),Fi,Sq(i)(k) be the vector that contains the nonzero entries of Fi(k),X˜Sq(i) be the matrix formed by taking rows of X˜ whose indices are in Sq(i), and Y˜i be the matrix formed by removing the ith row of Y˜.

Then, the ridge regression problem (4) can be decomposed into n separate problems

argminBi,-i,Fi,Sq(i){k=1212||Y˜i(k)-Bi,-i(k)Y˜-i(k)-Fi,Sq(i)(k)X˜Sq(i)(k)||F2+τ||Bi,i(k)||F2},i=1,,n. (6)

Minimizing the objective function in (6) w.r.t. Fi,Sq(i)(k) yields the following closed-form solution

F^i,Sq(i)(k)=(Y˜i(k)B^i,i(k)Y˜i(k))X˜Sq(i)(k)T(X˜Sq(i)(k)X˜Sq(i)(k)T)1. (7)

Substituting F^i,Sq(i)(k) into (6) and minimizing w.r.t. B^i,i(k)givesB^i,i(k)=Y˜i(k)Pi(k)Y˜i(k)T(Y˜i(k)Pi(k)Y˜i(k)T+τI)1, which in turn results in F^i,Sq(i)(k)=Y˜i(k)Γi(k)X˜Sq(i)(k)T(X˜Sq(i)(k)X˜Sq(i)(k)T)1, where Γi(k)=IPi(k)Y˜i(k)T(Y˜i(k)Pi(k)Y˜i(k)T+τI)1Y˜i(k)andPi(k)=IX˜Sq(i)(k)T(X˜Sq(i)(k)X˜Sq(i)(k)T)1X˜Sq(i)(k). After B^(k)andF^(k) are estimated, the estimate of σ^2 is given in (5). The hyper-parameter τ in ridge regression (4) or (6) is selected by 5-fold cross-validation.

2.4 FSSEM algorithm

In this section, we will develop the FSSEM algorithm to minimize the objective function J(B,F) in (3) with the initial values of B(k)andF(k) given in (7). The objective function is non-convex due to the log-determinant term, and non-smooth due to the 1 norm terms. Recently, the proximal alternating linearized minimization (PALM) method (Bolte et al., 2014) was developed to solve a broad classes of non-convex and non-smooth minimization problems. We next apply the PALM approach to develop the FSSEM algorithm.

Without loss of generality, we define the proximal operator associated with a proper and lower semi-continuous function h(x):Rd(,+]asproxαh(v)=argminuRd{α/2uv2+h(u)}, where α>0andvRd are given. We also define the fused lasso signal approximator (Friedman et al., 2007; Hoefling, 2010) on x=[x1T,x2T]T as the following proximal operator:

proxαf(x)(z)=argminxR2d{α2k=12xkzk2+f(x)}, (8)

where f(x)=λk=12xk1+ρx2x11,z1andz2 are two d×1 vectors and z=[z1T,z2T]T. Denote solution of (8) as (x1(λ),x2(λ)), and let x1j(λ)andx2j(λ) be the jth element of x1(λ)andx2(λ), respectively. Then we have

(x1j(0),x2j(0))={(z1jρ/α,z2j+ρ/α),z1jz2j>2ρ/α(z1j+ρ/α,z2jρ/α),z1jz2j<2ρ/α(z1j+z2j2,z1j+z2j2),|z1jz2j|2ρ/α, (9)

where z1jandz2j are the jth element of z1andz2, respectively.

Define soft-thresholding function S(β,λ) as

S(β,λ)={βλifβ>λβ+λifβ<λ0if|β|λ. (10)

If β=[β1,,βd] is a d×1vector,theS(β,λ) is a d×1 vector whose jth element is S(βj,λ). The solution of (8) at λ>0 is given in terms of the soft-thresholding operator as follows (Friedman et al., 2007):

proxαf(x)(z)=(S(x1(0),λ/α),S(x2(0),λ/α)). (11)

Minimizing (3) w.r.t. F(k)yieldsF^i,Sq(i)(k) in (7). Substituting F^i,Sq(i)(k) in (7) into (3) gives

J(B)=H(B)+i=1Ngfi(Bi,i), (12)

where

H(B)=k=12nk2log|IB(k)|2+12σ^2i=1Ngk=12Y˜i(k)Pi(k)Bi,i(k)Y˜i(k)Pi(k)22, (13)

and

fi(Bi,i)=λ(Bi,i(1)1,w(1)+Bi,i(2)1,w(2))+ρBi,i(1)Bi,i(2)1,r. (14)

Using the inertial version of the PALM approach (Pock and Sabach, 2016), the FSSEM algorithm efficiently minimizes the non-convex non-smooth function J(B) with the block coordinate descent (BCD) method in an iterative fashion. More specifically, in each cycle of the iteration, J(B) is minimized successively w.r.t. [Bi,i(1),Bi,i(2)], while [Bj,j(1),Bj,j(2)],j=1,,n,ji are fixed.

Let us consider updating the ith block of variables Bi,i=[Bi,i(1),Bi,i(2)] in the (t+1)th cycle. Let B[t]=[B(1)[t],B(2)[t]] be the estimate of B in the tth cycle. Define B˜j,j=Bj,j[t1]+αt(Bj,j[t1]Bj,j[t11]), where t1=t+1,j<i,t1=t,ji, and αt is a constant in the interval [0,1]. We obtain Bi,i from the FLSA proximal operator (11) as follows:

Bi,i=proxγifi(.)(B˜i,i1γiBi,iH(B˜)), (15)

where 1/γi is the step-size for the i-th block that will be given later, and Bi,iH(B˜) is the partial derivative of H(B) w.r.t. Bi,iatB˜.

Since Bi,i=[Bi,i(1),Bi,i(2)], we have Bi,iH(B)=[Bi,i(1)H(B),Bi,i(2)H(B)]. The determinant of IB(k) can be expressed as cii(k)Bi,i(k)ci(k), where cii(k) is the (i, i) co-factor of IB(k), and the jth entry of the (n1)×1 column vector ci(k) is the co-factor of IB(k) corresponding to the jth entry of Bi,i(k). Defining Bi(k)={Bj,j(k),j=1,,n,ji}, we can write Bi,i(k)H(B),k=1,2, with Bi(k) fixed, as follows:

Bi,i(k)H(B)=nkci(k)Tcii(k)Bi,i(k)ci(k)+1σ2(Bi,i(k)Y˜i(k)Pi(k)Y˜i(k)TY˜i(k)Pi(k)Y˜i(k)T), (16)

where ci(k) is calculated from Bi(k). In Supplementary Text S1, we prove that given Bi(k),Bi,i(k)H(B),k=1,2 are Lipschitz continuous. Specifically, we can write Bi,i(k)H(B)asBi,i(k)H(Bi,i(k),Bi(k)), which satisfies

Bi,i(k)H(x,Bi(k))Bi,i(k)H(y,Bi(k))Li(Bi(k))xy, (17)

where the Lipschitz constant Li(Bi(k)) is derived in the Supplementary Text S1, and is given by

Li(Bi(k))=nkci(k)22/minBi,i(k)(det(IB(k)))2+λmax(Y˜i(k)Pi(k)Y˜i(k)T)/σ2. (18)

Here λmax(Y˜i(k)Pi(k)Y˜i(k)T) is the maximum eigenvalue of Y˜i(k)Pi(k)Y˜i(k)T, and the value of minBi,i(k)det(IB(k))2 can be computed by solving the optimization problem as shown in (S12) in Supplementary Text S1. Let Li(Bi)=max{Li(Bi(k)),k=1,2}. Then, the step size in (15) is chosen to be 1/γi=1/Li(Bi).

Algorithm 1.

Fused Sparse SEM (FSSEM)

Select τ* in (4) via cross-validation

Solve (4) with τ* to obtain (B^,F^), and compute σ^2 from (5).

Set wij(k)=1/|B^ij(k)|,rij=1/|B^ij(2)B^ij(1)|.

Initialize B[0]=B^.

for tin1,2, do

  Select αt[0,1]

foriin1,,ndo

   Compute Li(Bi) from (18), set γi=Li(Bi)

   Update Bi,i(k),k=1,2, with (15)

   Set B˜i,i=Bi,i[t]+αt(Bi,i[t]Bi,i[t1])

end for

  Update Fi(k) with (7) and σ^2 with (5)

if convergence then

   Break

end if

end for

Return {B^(k),F^(k),k=1,2}

The FSSEM algorithm is summarized in Algorithm 1. The convergence criterion is defined as

{k=12||B(k)[t+1]B(k)[t]||F2/k=12||B(k)[t]||F2+k=12||F(k)[t+1]F(k)[t]||F2/k=12||F(k)[t]||F2}<ϵv|J(B[t+1])J(B[t])|/|J(B[t])|<ϵo, (19)

where ϵv>0andϵo>0 are pre-specified small constants. Since the objective function is not convex, it is not guaranteed that the FSSEM algorithm converges to the global minimization. However, we prove in Supplementary Text S1 that the FSSEM algorithm always converges to a stationary point of the objective function. Note that if we drop the fused lasso term ρB(1)B(2)1,r in (3), then minimizing J(B,F) is equivalent to estimating two network matrices B(1)andB(2) separately. The BCD approach used in FSSEM can also be employed to solve this problem, because the proximal operator in (15) can be easily solved in terms of the soft-thresholding function S(β,λ) defined in (10). This BCD approach is much more efficient than the SML algorithm in Cai et al. (2013), which employs the element-wise coordinate ascent approach. Parameters λandρ in (3) can be determined with cross-validation (CV) or Bayesian information criterion (BIC). In Supplementary Text S1, we derive the expression for the maximum values of λandρ and describe the CV process.

3 Results

3.1 Computer simulations

In this section, we conduct simulation studies to compare the performance of the FSSEM algorithm with that of the SML (Cai et al., 2013) and the QDG (Neto et al., 2008) algorithms. FSSEM estimates network matrices B(1)andB(2) jointly, while SML and QDG estimate B(1)andB(2) separately. Other method such as the AL-based (Logsdon and Mezey, 2010) algorithm is also available to estimate B(1)andB(2) separately. However, as shown in Cai et al. (2013), SML outperforms the AL-based algorithm. Therefore, we select only SML and QDG for performance comparison.

Following the setup of Cai et al. (2013), both directed acyclic networks (DAG) and directed cyclic networks (DCG) are simulated in our experiments. Specifically, the adjacency matrix A(1) of a DAG or DCG of 30or300 gene nodes with expected number of edges per gene d=1ord=0.1 is generated for the GRN under condition 1. Another adjacency matrix A(2) was generated by randomly changing 10% entries of A(1), and the probabilities of changes of entries from 0to1 and from 1to0 are equal. A network matrix B(1) was generated from A(1) as follows. For any entry Aij(1)=1,Bij(1) is generated from a random variable uniformly distributed over interval [0.5,1]or[1,0.5]; for all Aij(1)=0, we set Bij(1)=0. The second network matrix B(2) was generated from A(2)andB(1) as follows. For all Aij(2)=0, we set Bij(2)=0; for all Aij(2)=Aij(1), we set Bij(2)=Bij(1); and for all Aij(1)=0butAij(2)=1, we generate Bij(2) from a random variable uniformly distributed over interval [0.5,1]or[1,0.5]. The genotypes of eQTLs were simulated from an F2 cross. Values 1and3 were assigned to two homozygous genotypes, respectively, and value 2 to the heterozygous genotype. Then, X(1)andX(2) were generated from ternary random variables taking on values {1,2,3} with corresponding probabilities {0.25,0.5,0.25}. The number of eQTLs per gene ne was chosen to be 3, and effect sizes of all eQTLs were set to 1inF(1)andF(2). Error terms E(1)andE(2) were independently sampled from Gaussian random variables with zero mean and variance σ2; μ(1)andμ(2) were set to zero vectors; and the sample sizes ns=n1=n2 vary from 80to500. Finally, Y(k) was calculated as Y(k)=(IB(k))1(F(k)X(k)+E(k)), where k =1, 2.

For each configuration of the two GRNs, 30 replicates of the GRN were simulated. For each replicate, QDG, SML and FSSEM were run to infer network matrices B(1)andB(2). QDG was implemented with α=0.01, and the hyper parameters of SML and FSSEM algorithms were selected by 5-fold cross-validation and BIC, respectively. After B(k),k=1,2, are estimated, an edge from gene j to gene i under condition k is declared, if Bi,j(k)0. The power of detection (PD) and the false discovery rate (FDR) for detecting network edges were calculated from B(1)andB(2) estimated from the data of each of 30 network replicates. Detailed definitions of PD and FDR are given in Supplementary Text S1. The differential network was defined as ΔB=B(2)B(1), and PD and FDR for the differential network were calculated accordingly.

The results for DAGs with n=300,ne=3andσ2=0.25 are depicted in Figure 1, and results of DAGs under other settings are given in Supplementary Figures S1–S5. First, let us look at the PD and FDR of B(1)andB(2) in the left panel of Figure 1. FSSEM offers slightly better PD than SML, and much better PD than QDG, when the sample size is 200, and slightly better PD than QDG, when the sample size is > 200. It offers much lower FDR than QDG and slightly lower FDR than SML. Next, let us look at the PD and FDR of ΔB=B(2)B(1) in the right panel of Figure 1. FSSEM exhibits slightly worse PD than SML, and much better PD than QDG when the sample size is 200, and similar or slightly better PD than SML and QDG, when the sample size is >200. Moreover, FSSEM offers much smaller FDR than both SML and QDG across all sample sizes. The same trend was also observed in Supplementary Figures S4 and S5 for n =300 with different noise levels for the PD and the FDR of B(1),B(2)andΔB. Comparing the results in Figure 1 and Supplementary Figures S4 and S5 for DAGs with n =300 with those in Supplementary Figures S1–S3 for DAGs with n =30, we observed that both PD and FDR of FSSEM and SML, as well as the PD of QDG, are similar, and that the FDR of QDG improves when n =30, but it is still higher than the FDR of FSSEM. Overall, our FSSEM offered lower FDR particularly in identifying changed network edges than SML and QDG, while exhibited similar or higher PD. For the GRNs of n =300 genes, there are 2(n2n)=179400 unknown entries in B(1)andB(2) to be estimated, and 2nne=1800 unknown entries in F(1)andF(2). Therefore, the number of observations is much smaller than the size of features, when the sample size ns500. Interestingly, the performance of FSSEM and SML algorithms did not change much when the sample size ns varied from 80to500, but the performance of QDG algorithm improves when the sample size increases.

Fig. 1.

Fig. 1.

The PD and FDR of FSSEM, SML and QDG algorithms for the DAG with n =300 genes and ne = 3 eQTLs per gene. The number of samples ns=n1=n2 varies from 80to500 and noise variance σ2=0.25. PD and FDR were obtained from 30 network replicates

Simulation results for DCGs with n=300,ne=3andσ2=0.25 are depicted in Figure 2, and results of DCGs under other settings are shown in Supplementary Figures S6–S10. The performance of the three algorithms for DCGs is similar to that in the cases of DAGs. To evaluate the performance of these algorithms in inferring relatively large GRNs, we increased n to 600. For both DAGs and DCGs, the performance of FSSEM is similar to the cases when n =300, as shown in Supplementary Figure S11. However QDG and SML were too time consuming and failed to obtain any results for n =600.

Fig. 2.

Fig. 2.

The PD and FDR of FSSEM, SML and QDG algorithms for the DCG with n =300 genes and ne = 3 eQTLs per gene. The number of samples ns=n1=n2 varies from 80to500 and noise variance σ2=0.25. PD and FDR were obtained from 30 network replicates

For the convenience of comparison, the simulation results of DAG and DCG with n1=n2=500,ne=3andσ2=0.25 are summarized in Table 1, which clearly shows that FSSEM outperforms both SML and QDG. Particularly, FSSEM offers much lower FDR than SML and QDG in inferring the differential GRN. In all simulations shown in Figures 1 and 2 and Supplementary Figures S1–S11, and in Table 1, we assumed that data samples are independent. We also tested the performance of the FSSEM algorithm for the case of paired data samples, where the paired samples yi(1) and yi(2) are correlated. To this end, we jointly generated the (i, j)th elements of E(1)andE(2) from two Gaussian random variables with zero mean, variance σ2 and correlation coefficient ρ. The simulation results in Supplementary Figures S12 and S13 for DAG and DCG networks of 30 genes show that the performance of FSSEM for estimating the differential GRN (ΔB) remains almost the same, when ρ varies from 0 to 1. The FDR for estimating B degrades slightly, when ρ varies from 0 to 1. This shows that FSSEM is robust w.r.t the correlation between paired data samples.

Table 1.

The PD and FDR of FSSEM, SML and QDG algorithms

Network n FSSEM
SML
QDG
PDB FDRB PDΔB FDRΔB PDB FDRB PDΔB FDRΔB PDB FDRB PDΔB FDRΔB
DAG 30 1.000 0.012 1.000 0.029 0.948 0.077 1.000 0.751 0.974 0.037 0.969 0.245
300 1.000 0.008 1.000 0.040 0.940 0.159 1.000 0.751 0.996 0.885 0.996 0.980
DCG 30 0.962 0.021 0.981 0.040 0.904 0.077 0.977 0.735 0.911 0.052 0.936 0.301
300 0.997 0.010 0.991 0.031 0.937 0.170 0.991 0.762 0.993 0.882 0.986 0.980

Note: Expected number of eQTLs per gene is ne = 3, number of samples is n1=n2=500, noise variance σ2=0.25. PD and FDR were obtained from 30 network replicates.

3.2 Real data analysis

In Lu et al. (2011), gene expression levels in 42 tumors and their adjacent normal tissues of non-smoking female patients with lung adenocarcinoma were measured with 54 675 probe sets from Affymetrix Human Genome U133 Plus 2.0 arrays. The genotypes of single nucleotide polymorphisms (SNPs) in the same set of tissues were obtained using 906 551 SNP probes from Affymetrix Genome-Wide Human SNP 6.0 arrays. We applied FSSEM to this dataset to infer GRNs in lung cancer and normal tissues.

Both gene expression and SNP data in the gene expression omnibus (GEO) database (GSE33356) were downloaded. The R package affy (Gautier et al., 2004) was employed to transform raw micro-array data to normalized gene expression levels. Specifically, the raw gene expression data in the custom CDF format (Dai et al., 2005) were normalized using the robust multi-array average (RMA) method (Irizarry et al., 2003a). In total, gene expression levels of 18 807 genes with their Entrez IDs were obtained from 54 675 probe sets. The genotypes of the 906 551 SNP probes in the 84 tissue samples were transformed to values {0, 1, 2} using the following mapping: AA0,AB1andBB2. The missing genotypes of SNP probes were imputed by Beagle (Browning and Browning, 2007) and SNPs with a minor allele frequency (MAF) of 5% or less were removed (Altshuler et al., 2005). Finally, R package MatrixEQTL (Shabalin, 2012) was adopted to identify cis-eQTLs of genes. In total, 1260 genes were found to have at least one cis-eQTLs within 1 M base pairs (bps) from the open reading frame (ORF) of the gene at an FDR<0.05andMAF>5%.

We applied the FSSEM algorithm to the expression levels and genotypes of the eQTLs of these 1,260 genes to infer the GRNs in lung tumor and normal tissues. An edge from gene i to gene j was detected if Bji(k)0,k=1,2, where B(1)andB(2) specify the GRNs in normal and tumor tissues, respectively. Then, we identified the differential GRN based on ΔB=B(2)B(1). Since small changes of coefficients Bji may not have much biological effect, we regarded the regulatory effect from gene i to gene j to be different using the following two criteria rather than the simple criterion Bji(2)Bji(1). The first criterion is |Bji(2)Bji(1)|>min{|Bji(1)|,|Bji(2)|}, which ensures that there is at least one-folder change relative to min{|Bji(1)|,|Bji(2)|}. However, when one of Bij(k),k=1,2 is zero or near zero, this criterion still failed to filter out very small changes. To avoid this issue, we added another criterion. Specifically, we obtained all nonzero entries of B(k),k=1,2, and compute the 20 percentile value of all nonzero |Bji(k)|,k=1,2asη. Then, we defined the second criterion as max{|Bji(k)|,k=1,2}>η.

Finally, our network analysis with FSSEM identified 325 genes that are associated with at least one edge in the differential network. The differential network has 765 edges and is depicted in Figure 3. To assess whether the 325 genes are related to the cancer status, we performed gene set enrichment analysis (GSEA) with the 4762 C2 gene sets in the molecular signatures database (MSigDB) (Subramanian et al., 2005). Using Fisher’s exact test, we found that 25 C2 gene sets are enriched in the set of 325 genes at a p-value<0.05. These 25 gene sets are listed in Supplementary Table S1. After searching over the C2 gene sets with key words ‘lung cancer’, ‘lung tumor’, ‘lung carcinoma’ and ‘LUCA’ in their description text, we identified 140 gene sets that are related to lung cancer. Comparing the 140 lung cancer-related gene sets with the 25 enriched gene sets, we found that 5 of the 25 enriched gene sets are related to lung cancer, which is significant (Fisher’s exact test, p-value <7.2×104). Furthermore, we ranked the 325 genes according to the number of edges in the differential GRN that they are involved in. The top ten genes are C4BPA, ANPEP, LTF, SELE, HLA-DQA1, CLC, PNMAL1, TPSAB1, ERAP2 and PPP1R14C; all ten top genes have been reported in the literature to be implicated in lung cancer, as discussed in Supplementary Text S1.

Fig. 3.

Fig. 3.

The differential GRN of 325 genes with 765 edges in lung tumors versus normal tissues. The size of a node is proportional to its degree, the number of edges that the node connects to. The top ten genes with the highest degrees are labeled

In Holbrook et al. (2011), gene expression levels and SNPs of 49 gastric tumors and normal tissues were measured with Illumina mRNA expression arrays and Affymetrix SNP arrays, respectively. We downloaded the dataset that contains Illumina sequencing data and SNP data from the GEO database (GSE29999). A total of 558 genes were found to have at least one cis-eQTLs within 1 M base pairs from the ORF of the gene at an FDR<0.01andMAF>5%. Genotypes of the eQTLs and expression levels of these 558 genes were analyzed with FSSEM, which yielded a differential GRN of 88 genes in gastric cancer versus normal tissues. Two MSigDB C2 gene sets were found to be enriched in the set of 88 genes, as described in Supplementary Table S2. One of the enriched gene sets is related to gastric cancer. The 88 genes were ranked according to the number of edges that each genes associated with, and the top ten genes are CLC, NLRP2, SNCAIP, IL33, WARS, PGGHG, CXCL13, CIDEC, TMEM45B and RASEF. Six of the top ten genes are related to gastric cancer, and other 2 of top 10 genes are related to other cancers; see Supplementary Text S1 for more detailed description. Therefore, analysis of both lung and gastric datasets with FSSEM algorithm demonstrates that the genes in the differential GRN identified by FSSEM are relevant to tumorigenesis.

4 Discussion

In this paper, we developed a very efficient algorithm, named FSSEM, for joint inference of two similar GRNs by integrating genetic perturbations with gene expression data under two different conditions with the SEM. Computer simulations showed that our FSSEM offered much better accuracy in identifying changed gene-gene interactions than both SML and QDG algorithms, which infer two GRN separately. Particularly, the FDR of gene interactions in the differential GRN estimated by FSSEM was significantly lower than that resulted from SML and QDG. This result is expected because FSSEM exploits the similarity in the two GRNs and penalizes the changes of gene interactions in the inference process. We also analyzed real datasets of lung and gastric cancers. The FSSEM algorithm identified a set of genes involved in the differential GRNs in cancers, and these genes have been reported to be relevant in tumorigenesis.

The number of unknowns in network matrices B(k),k=1,2 is 2n(n1) and the number of unknowns in F(k),k=1,2 is 2q. When n is large, the number of unknown is huge, which incurs huge computation. In the analysis of the lung cancer data, we have n =1, 260, and we were able to complete the analysis in several days on a desktop with Intel i7-5820K CPU. If n is larger, analysis using FSSEM may be very time consuming. In the future work, we will consider parallelizing the FSSEM to increase its speed. In this paper, we consider joint inference of two similar GRNs. In some situations, we may need to infer more than two GRNs, e.g. in different developmental stages. Our future work also aims to develop a new algorithm for joint inference of more than two GRNs.

Funding

This work was supported by the National Science Foundation [Grant No. CCF-1319981], and the National Institute of General Medical Sciences [Grant No. 5R01GM104975].

Conflict of Interest: none declared.

Supplementary Material

btz529_Supplementary_Text

References

  1. Altshuler D. et al. (2005) A haplotype map of the human genome. Nature, 437, 1299.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bolte J. et al. (2014) Proximal alternating linearized minimization or nonconvex and nonsmooth problems. Math. Program., 146, 459–494. [Google Scholar]
  3. Browning S.R., Browning B.L. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Human Genet., 81, 1084–1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Butte A.J., Kohane I.S. (1999). Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements In: Biocomputing 2000, pp. 418–429. World Scientific. [DOI] [PubMed] [Google Scholar]
  5. Cai X. et al. (2013) Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Comput. Biol., 9, e1003068.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Califano A. (2011) Rewiring makes the difference. Mol. Syst. Biol., 7, 463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dai M. et al. (2005) Evolving gene/transcript definitions significantly alter the interpretation of genechip data. Nucleic Acids Res., 33, e175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Danaher P. et al. (2014) The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Series B Stat. Method., 76, 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Faith J.J. et al. (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5, e8.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Friedman J. et al. (2007) Pathwise coordinate optimization. Ann. Appl. Stat., 1, 302–332. [Google Scholar]
  11. Friedman J. et al. (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gardner T.S. et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105. [DOI] [PubMed] [Google Scholar]
  13. Gautier L. et al. (2004) affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307–315. [DOI] [PubMed] [Google Scholar]
  14. Harbison C.T. et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Haury A.-C. et al. (2012) Tigress: trustful inference of gene regulation using stability selection. BMC Syst. Biol., 6, 145.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hoefling H. (2010) A path algorithm for the fused lasso signal approximator. J. Comput. Graphical Stat., 19, 984–1006. [Google Scholar]
  17. Holbrook J.D. et al. (2011) Deep sequencing of gastric carcinoma reveals somatic mutations relevant to personalized medicine. J. Transl. Med., 9, 119.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Ideker T., Krogan N.J. (2012) Differential network biology. Mol. Syst. Biol., 8, 1–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Irizarry R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264. [DOI] [PubMed] [Google Scholar]
  20. Liu B. et al. (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics, 178, 1763–1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Logsdon B.A., Mezey J. (2010) Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Comput. Biol., 6, e1001014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lu T.-P. et al. (2011) Integrated analyses of copy number variations and gene expression in lung adenocarcinoma. PLoS One, 6, e24829.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Margolin A.A. et al. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Mohan K. et al. (2014) Node-based learning of multiple Gaussian graphical models. J. Mach. Learn. Res., 15, 445–488. [PMC free article] [PubMed] [Google Scholar]
  25. Neto E.C. et al. (2008) Inferring causal phenotype networks from segregating populations. Genetics, 179, 1089–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Pock T., Sabach S. (2016) Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM J. Imag. Sci., 9, 1756–1787. [Google Scholar]
  27. Shabalin A.A. (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics, 28, 1353–1358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sonawane A.R. et al. (2017) Understanding tissue-specific gene regulation. Cell Rep., 21, 1077–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Statnikov A., Aliferis C.F. (2010) Analysis and computational dissection of molecular signature multiplicity. PLoS Comput. Biol., 6, e1000790.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Subramanian A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Tegner J. et al. (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc. Natl. Acad. Sci. USA, 100, 5944–5949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Thieffry D. et al. (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli. Bioessays, 20, 433–440. [DOI] [PubMed] [Google Scholar]
  33. Viallon V. et al. (2016) On the robustness of the generalized fused lasso to prior specifications. Stat. Comput., 26, 285–301. [Google Scholar]
  34. Zhu J. et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput. Biol., 3, e69.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zou H. (2006) The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 101, 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz529_Supplementary_Text

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES