Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Nov 27;34(7):1148–1156. doi: 10.1093/bioinformatics/btx748

Semi-supervised network inference using simulated gene expression dynamics

Phan Nguyen 1, Rosemary Braun 1,2,
Editor: Inanc Birol
PMCID: PMC6455938  PMID: 29186340

Abstract

Motivation

Inferring the structure of gene regulatory networks from high-throughput datasets remains an important and unsolved problem. Current methods are hampered by problems such as noise, low sample size, and incomplete characterizations of regulatory dynamics, leading to networks with missing and anomalous links. Integration of prior network information (e.g. from pathway databases) has the potential to improve reconstructions.

Results

We developed a semi-supervised network reconstruction algorithm that enables the synthesis of information from partially known networks with time course gene expression data. We adapted partial least square-variable importance in projection (VIP) for time course data and used reference networks to simulate expression data from which null distributions of VIP scores are generated and used to estimate edge probabilities for input expression data. By using simulated dynamics to generate reference distributions, this approach incorporates previously known regulatory relationships and links the network to the dynamics to form a semi-supervised approach that discovers novel and anomalous connections. We applied this approach to data from a sleep deprivation study with KEGG pathways treated as prior networks, as well as to synthetic data from several DREAM challenges, and find that it is able to recover many of the true edges and identify errors in these networks, suggesting its ability to derive posterior networks that accurately reflect gene expression dynamics.

Availability and implementation

R code is available at https://github.com/pn51/postPLSR.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

For cells to function properly, thousands of genes must interact in a concerted effort to produce the appropriate amounts of RNA and protein required for a variety of biological processes (Iyer et al., 2017; Karlebach and Shamir, 2008; MacNeil and Walhout, 2011; Thompson et al., 2015). The collection of genes, their products, and interactions between them comprise gene regulatory networks (GRNs) which regulate their abundance and activity. Understanding the dynamics and structure of these networks can shed light on the regulatory cascades responsible for the emergence of different phenotypes, disease mechanisms, metabolic processes and other biological functions.

One way to elucidate the interactions between genes is the use of microarray and sequencing assays that can measure the activity levels of thousands of genes simultaneously. Advances in high-throughput technologies have enabled the generation and widespread availability of rich datasets in an affordable and efficient manner. One of the major goals in functional genomics and systems biology is the prediction of functional relationships between genes from these datasets via computational means (de Jong, 2002; Wang and Huang, 2014). This procedure, gene network reconstruction, can offer new experimental directions to verify novel interactions, identify deficiencies in currently known networks and models, and understand how these networks function and can be perturbed to affect disease pathways and other vital processes.

Although high-throughput sequencing techniques now generate large datasets efficiently and affordably, constructing accurate GRNs from these measurements remains a challenge. Typically, sample sizes are small compared with the number of genes measured. Consequently, network reconstruction is an underdetermined problem in which many models fit the data and an exponentially large space of networks needs to be considered. Furthermore, problems such as the stochastic nature of gene expression, experimental noise, missing data, difficulties in distinguishing between direct and indirect effects and incomplete characterizations of the gene regulatory dynamics hinder the efficacy of many network reconstruction approaches. To produce accurate GRNs, algorithms need to address these issues with plausible, accurate modeling assumptions and constraints.

Many methods have been proposed to address these challenges. Early methods for GRN reconstruction used coexpression between gene expression profiles to identify relationships between genes and treat quantities such as correlation and mutual information as measures of edge confidence (Butte and Kohane, 1999, 2000; Rice et al., 2005). Context Likelihood of Relatedness (CLR) (Faith et al., 2007) and Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) (Margolin et al., 2006) built on mutual information-based relevance networks by filtering out indirect interactions. Regression-based methods with stability selection to control for false discoveries have also been adapted to estimate regulatory strength between genes (Haury et al., 2012; van Someren et al., 2006). Other approaches to determine causality between genes have been based on neural networks (Weaver et al., 1999), probabilistic graphical models (Friedman, 2004), Boolean networks (Kauffman, 1969), random forests (Huynh-Thu et al., 2010) and partial least squares regression (PLSR) (Ciaccio et al., 2015; Guo et al., 2016; Pihur et al., 2008).

In general, these methods assume that data samples are independent in order to infer edges by using similarity- and causality-based edge confidence measures or by incorporating these samples into regression-type methods to estimate the influence among genes on their expression. Furthermore, early work in GRN reconstruction has focused on static data. However, since gene expression regulation is a dynamic process that can exhibit high autocorrelation, temporal data can be used to detect periodicity, identify cascades of differential expression, observe temporal responses to knockouts and other perturbations, and study expression evolution across different environmental, phenotypic, and other conditions, all of which may be used to infer causality between genes (Bar-Joseph, 2004). Many GRN reconstruction methods have been developed to handle some of the features and challenges that are unique to time course data, some of which are adaptations of static methods, such as TD-ARACNE (Zoppoli et al., 2010). Other methods include those based on Granger causality (Tam et al., 2013), dynamic time warping (Riccadonna et al., 2016), dynamic Bayesian networks (Perrin et al., 2003), and differential equations (Bonneau et al., 2006; Stokić et al., 2009).

Despite varying levels of success, many of these methods do not take advantage of known regulatory dependencies that are available in databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000; Kanehisa et al., 2014). Several methods have been developed to leverage prior information in order to reduce false positive rates and improve network reconstructions. CoRe (Petri et al., 2015) uses a known network and supervised learning to learn a classification model for each regulator of the network with the expression data of a potential target as input and edge confidence as output. The edge confidence scores are then recalibrated based on models learned after permuting the gene labels of the expression data and degree-preserving randomizations of the known network. iRafNet (Petralia et al., 2015) extends the random forest-based algorithm GENIE3 (Huynh-Thu et al., 2010) by using prior regulatory information to bias the selection of known regulators when learning a random forest model for each target gene in a network.

Several methods to detect missing and anomalous links based on the network’s topological features have also been proposed (Clauset et al., 2008; Guimerà and Sales-Pardo, 2009; Lü and Zhou, 2011). Typically, these methods compute edge likelihoods based on structural and generative assumptions or derive association quantities based on node degrees, common neighbors, path ensembles, and other network-based features (Lü and Zhou, 2011). However, they do not take into account any biological assumptions or experimental data. Given the availability of both pathway databases and expression datasets, it is therefore of interest to develop hybrid approaches that integrate transcriptomic data with partially known networks and accurate modeling constraints in order to refine these networks by detecting discrepancies and novel relationships.

In this paper, we describe a semi-supervised approach that integrates existing GRN model topologies with new time course gene expression data to infer novel interactions that are not captured in pathway databases. Our approach builds on PLS-variable importance in projection (VIP) (Chong and Jun, 2005), a variable selection method using PLSR. Here, we adapt PLS-VIP for time course data by assuming that the expression of a gene is a function of the expression of the genes at the previous time point and calculate the VIP scores for the lagged PLSR model. Furthermore, we use a reference network derived from a pathway database to simulate expression data from which additional VIP scores are computed to generate a pair of null distributions for each ordered gene pair. The data-derived VIP scores are then compared with these respective null distributions to estimate posterior edge probabilities. In Section 2, we describe this approach in more detail. In Section 3, we apply the method to time course datasets from past DREAM challenges, as well as to an experimental dataset from an insufficient sleep study using KEGG pathways as prior networks. The synthetic DREAM challenge data provides a highly controlled framework in which to evaluate the performance of our approach, whereas application to the experimental dataset and KEGG pathways enables us to compare our method against competing approaches in real data. While the connections specified in KEGG pathways have been expertly curated, we expect that there remain innaccuracies (either missing edges that have yet to be experimentally identified, or anomalous links due to prior false positive results) that our method is designed to detect. In both applications, we show that the method is able to recover significant portions of these networks while also being able to identify novel and anomalous connections. These results suggest the method’s ability to derive posterior networks that accurately reflect gene expression dynamics by incorporating previously known regulatory relationships and linking the network to the dynamics. In Section 4, we conclude and discuss possible extensions to our approach.

2 Materials and methods

2.1 Background: PLSR-based methods

In general, regression-based methods assume that the expression of a gene can be modeled as a function of all other genes in a dataset, and determine if an edge exists using the magnitude of the association a predictor and the target gene. Multiple linear regressions can be used to model the relationship between a set of predictors and a response when the number of predictors is low relative to the number of available samples and when the predictors are not highly collinear. However, gene expression and other biological datasets are subject to problems with low sample sizes, high dimensionality, and multicollinearity, making multiple linear regressions inappropriate for modeling relationships in those datasets. One approach that has been used for predictive modeling while also accounting for these problems is PLSR (Chong and Jun, 2005; Höskuldsson, 1988; Wold et al., 2001). PLSR seeks to simultaneously decompose predictor and response matrices with sets of latent variables for each matrix such that the covariance between the sets of latent variables is maximized. This procedure permits the high dimensionality of the gene expression space to be reduced to a smaller number of PLSR components, removing latent sources of variance that have little predictive value.

Several GRN reconstruction approaches have used PLSR as an underlying method. (Pihur et al., 2008) introduced an unsupervised method that assumes that the expression of each gene is a linear function of the expression of the remaining genes in a dataset. PLSR is then used to construct a model for each gene, and an interaction score between two genes is derived based on the contribution of the latent variables to the predictor gene and on the contribution of the predictor gene to the latent variables of the target gene’s model. These scores are then aggregated and thresholded to predict an undirected network. The method was shown to be able to identify potential interactions, but it exhibited high false positive rates, required large sample sizes for better quality results, and only predicted undirected edges that corresponded to association rather than causation.

The PLSR ‘VIP’ score provides an efficient means to select predictor genes that most strongly predict the target gene, between which edges may be inferred to exist. Briefly, the VIP score quantifies the contribution of each gene to the latent variables as the weighted sum of the squared correlations between the PLSR components and the original variable (Chong and Jun, 2005). A variable with a large VIP score (close to or greater than 1) can be considered important in given model, and thus PLS-VIP may be used to identify predictor genes that most strongly influence the target gene. A number of methods have used PLS-VIP to infer edges in GRNs. DIONESUS (Ciaccio et al., 2015) is an unsupervised GRN reconstruction algorithm that models the expression of each gene as a function of the expression of the remaining genes. It iterates through the genes of a dataset, treating each as a response and the remaining genes as potential predictors, and uses PLS-VIP to compute VIP scores for the predictors. These scores are treated as measures of edge confidence, with higher scores assumed to be more indicative of an edge, and are aggregated and thresholded to predict directed edges. DIONESUS was shown to be scalable, efficient, and accurate on in silico data and was applied to microwestern array and cell viability assay data to reconstruct a cell signaling network in a human carcinoma cell line driven by the overexpression of Epidermal Growth Factor Receptor, and was also successful in reconstructing networks in several DREAM challenges. PLSNET (Guo et al., 2016) is a variant of the PLS-VIP approach that incorporates sample-bagging and feature-bagging. At each iteration, the algorithm applies PLS-VIP to a bootstrapped expression dataset, and the VIP scores for each regulator-target pair are added across all iterations and rescaled to account for a regulator’s influence across different target genes.

Our method extends these ideas in two novel and important ways. First, by using time-course gene expression data, we are able to make causal inferences about gene regulation, permitting the reconstruction of directed networks that reflect regulatory dependencies. Second, by simulating gene expression dynamics from known networks, our method incorporates previously known regulatory relationships in a semi-supervised manner and links the network structure to the dynamical behavior of the system.

2.2 Extending PLSR to time-course data

Since gene expression regulation is a dynamic process, time course data can be measured and used to infer causality, identify activated genes and changes in differential expression, detect periodicity, determine coexpression, and provide insight into other temporal aspects and mechanisms that cannot be ascertained from static data. One basic approach that has been used to model and analyze time course expression data, such as in methods based on Granger causality and Markov processes, is to assume that the data can be modeled with vector autoregression (Dewey and Galas, 2001; Tam et al., 2013). More specifically, the one-way temporal ordering of the data can be exploited by assuming that the expression of a gene is linearly dependent on the expression of its regulators at a previous time point. A lag of one time interval between the predictor and response variables is typically used, so that the expression of a gene is described by

xi(t+Δt)=jiwjixj(t)+ϵ, (1)

where xi(t) is the expression of gene i at time t, and wji are weights indicating how much the expression level of gene j influences that of gene i over a time interval Δt.

Gene expression has more typically been modeled with differential equations (Chen et al., 2004; Elowitz and Leibler, 1999; Gardner et al., 2001; Marbach et al., 2010), particularly when the interest is in how gene expression changes during a biological process or as a result of perturbations and stimuli. In the linear case,

dxi(t)dt=jiajixj(t)+ϵ,

where aji is the rate of influence of the expression of gene j on that of gene i. Since microarray data can only collected at discrete time points, discretizing results in

Δxi(t+Δt)=jiwjixj(t)+ϵ, (2)

where wji=ajiΔt and Δxi(t+Δt)=xi(t+Δt)xi(t). PLSR may be applied to fit (1) or (2) to time-course gene expression data.

To assess the contribution of a predictor gene j on a target gene i, we compute the VIP score vji for j in the expression model for i. However, rather than setting a common threshold vthresh and assigning edges when vji>vthresh, we instead compare vji to the distributions of VIP scores that would be expected if the edge is/isn’t present using simulated network dynamics. This approach has the appealing feature of integrating known regulatory links into the analysis and provides a means to assign a posterior probability to a given edge, as detailed below.

2.3 Semi-supervised network reconstruction

Partial knowledge of GRNs is available in pathway databases such as KEGG. Given the availability of these knowledge-derived networks and transcriptomic datasets, it is useful to consider integrating this information in GRN reconstruction, using experimental data to identify novel or anomalous edges from existing networks. In contrast to many existing GRN reconstruction methods that construct de novo networks solely from expression data, semi-supervised methods can incorporate a partially known network with accurate modeling constraints and dynamics in order to explain the observed gene expression values and identify deficiencies that contribute to discrepancies between the observed expression values and the dynamics of the partially known network.

Here, we use prior networks from pathway databases to inform the inclusion of edges in the inferred networks. To motivate our approach, it is instructive to consider some of the properties of PLS-VIP and prior PLS-VIP-based approaches. In DIONESUS, a de novo network is constructed based on the intuition that because a VIP score is a measure of a predictor’s contribution to a PLSR model, a high VIP score for a predictor–response pair should be treated as evidence for the existence of a corresponding edge in the network. Using this assumption, the method aggregates the VIP scores across the PLSR models for all genes and imposes a cutoff to determine the edges of the network. However, the mathematical properties of the VIP scores—namely, that the mean of the squares of the VIP scores within each gene’s model is equal to 1—can contribute to a large number of errors when multiple models are aggregated and thresholded. If a gene is affected by most or all of the genes in a pathway, the VIP scores will summarize the relative importance of these regulators to each other, but when combining the scores and imposing a standard VIP cutoff of 1, many of those edges will not be identified. Similarly, when a gene has few regulators, the same procedure will identify many false positive edges. Therefore, while VIP scores from the same model may be compared with each other to determine the relative importance of potential predictors of the same target gene, they are not necessarily comparable across regression problems corresponding to different target genes.

Rather than comparing VIP scores across PLSR models for different genes using a common threshold, we compare the VIP score for a pair of genes to distributions of scores that would be expected of that pair given the structure of the rest of the network and a dynamic model for gene expression. To obtain these distributions, we use the known regulatory connections of a pathway along with the dynamic model to simulate time course data from which reference distributions of VIP scores are computed. By comparing the data-derived VIP scores to those derived from simulated data that is potentially observable based on the network and the assumed dynamics on that network, we can identify VIP scores that are atypical for a pair of genes and may correspond to novel or anomalous edges.

A summary of the overall approach is shown in Figure 1. First, we assume that the expression of a gene can be modeled using (1) or (2) and calculate the data-derived VIP scores for the lagged PLSR model and an input expression dataset. We then use the same model and the reference network to simulate expression data from which additional VIP scores are computed to generate a pair of reference distributions for each ordered gene pair: one assuming the reference network includes the edge of interest, and another with the same edge excluded from the network. For each of the two networks, 2000 gene expression trajectories were simulated using randomly assigned weights wji for every edge, and the VIP scores were computed from these simulated trajectories. Figure 2 shows examples of these simulated pairs of VIP distributions. The posterior edge probabilities are then estimated by comparing the data-derived VIP scores (shown as black vertical lines in Figure 2) to these respective distributions, and a network can be determined by thresholding the probabilities. We identify novel connections by prior non-edges with high posterior probabilities and erroneous edges by prior edges with low posterior probabilities. By using simulated dynamics to generate reference distributions, this approach incorporates previously known regulatory relationships and links the network to the dynamics to form a semi-supervised approach that uses the prior network and expression data to recover the true edges of the known network and discover novel and anomalous connections.

Fig. 1.

Fig. 1.

Workflow for the semi-supervised gene network reconstruction approach. The inputs are an observed time-course expression dataset and a network obtained from a pathway database. For the pair (ij) of interest, synthetic expression data are obtained from simulated dynamics on the network with and without edge eij. The VIP score for (ij) is computed from the input expression data and compared with the distributions of VIP scores that would be expected with and without eij from the simulations to obtain the posterior probability that eij is an edge (Color version of this figure is available at Bioinformatics online.)

Fig. 2.

Fig. 2.

Examples of pairs of distributions generated by the semi-supervised PLS-VIP-based approach. (a) Potential novel edge. The data-derived VIP score for the prior non-edge is high relative to the pairs of reference distributions of VIP scores, which contributes to a high posterior edge probability. The non-edge represents a potential novel edge in the network. (b) Potential anomalous edge. Since the data-derived VIP score for the prior edge is low relative to the pairs of reference distributions of VIP scores, the posterior edge probability will be low. This represents a potential anomalous edge. (c) Prior non-edge with a high VIP score but low posterior edge probability. Using a VIP threshold would identify a false positive edge with a high VIP score, whereas the semi-supervised approach assigns a low posterior edge probability because the VIP score is low relative to its reference distribution of edge VIP scores (Color version of this figure is available at Bioinformatics online.)

In our applications, we assume that gene expression is described by (1). Additional mathematical details, details about the posterior edge probability estimation, and analysis of the parameters are available in the Supplementary Material.

2.4 Datasets

2.4.1 DREAM

We applied our method to synthetic time course gene expression data from several DREAM challenges. In one of the DREAM2 challenges, 50-node networks were derived from Erdos-Renyi and scale-free topologies with Hill-type kinetics driving gene expression (Stolovitsky et al., 2009). The DREAM3 in silico network challenge contained 10-, 50- and 100-gene subnetworks extracted from Escherichia coli and Saacharomyces cerevisiae gene network with expression values simulated using GeneNetWeaver (Marbach et al., 2009, 2010; Prill et al., 2010). Finally, in the DREAM4 in silico network challenge, GeneNetWeaver was used to apply various perturbations to 10- and 100-gene networks and the network response was measured before and after the perturbations were removed.

2.4.2. Insufficient sleep

We also applied our method to a time course microarray dataset from a study of the mechanisms and effects of insufficient sleep and circadian rhythm disruption on gene expression, circadian regulation, and other related processes (Möller-Levet et al., 2013). Data were collected by subjecting 26 participants to restricted sleep and control conditions, each followed by an extended period of constant routine during which blood samples were periodically collected for RNA extraction. Since many time points and samples are available, the richness of the dataset makes it amenable to many types of analyses.

To apply our method to the insufficient sleep dataset, we treat KEGG pathways as reference networks. In particular, we only consider subgraphs consisting of genes for which expression data is available. Since there are edges that appear in one pathway and not another between the same pair of genes, we first merge all of the pathways together. With the resulting graph and for each pathway, we take the induced subgraph consisting of genes that are in both the pathway and expression dataset. Finally, since this procedure may leave many singleton nodes in the subgraph, we then take the largest component of the subgraph and use it as the input network.

3 Results

3.1 DREAM

3.1.1. Network/edge recovery with synthetic data

We first evaluate the method by applying it directly to the DREAM datasets without any modifications to the networks. An area under the curve (AUC) can be computed by sorting the posterior edge probabilities and calculating the true- and false-positive rates as a function of the posterior probability threshold for including an edge. When using the original network as the input and reference for the AUC computation, the AUC can be interpreted as a network recovery rate, which will be close to 1 if the method is accurate and there are very few novel or anomalous edges in the network. Since the DREAM datasets are synthetic and the networks are completely known, the method should ideally assign high and low edge probabilities to the edges and non-edges, respectively, in order to return AUC values that are close to 1. If the input network is known to be incomplete, high AUC values are still desirable as an indicator that the regulatory connections that we know about are indeed correct. Furthermore, when a majority of the edges (non-edges) are assigned high (low) posterior probabilities, the high (low) edge probabilities that are assigned to non-edges (edges) in the input network can be treated as putative evidence for a corresponding novel (anomalous) edge in the network, since the probabilities will rank similarly with those of the true edges (non-edges).

In Figure 3, the DREAM dataset AUCs are shown for prior probabilities (p,q){(0.5,0.5),(0.75,0.25),(0.75,0.5)}. Even when uninformative priors p=q=0.5 are used, except in a few cases, the method returns relatively high AUC values. When p increases or q decreases from those values, the AUCs also increase; as p increases, the posterior edge probabilities of the prior edges will increase, and when q decreases, the posterior edge probabilities of the prior non-edges will decrease, both of which contribute to a higher AUC. Although this suggests using large values of p and small values of q, some care must be taken in choosing these parameters, especially when one of the goals of reconstructing GRNs is discovering novel regulatory relationships between genes. For a potential novel edge to be proposed, a prior non-edge should have a posterior edge probability that is higher than those of the true non-edges and comparable to those of the edges in the graph. If q is set close to zero or extremely low relative to p, then the posterior edge probabilities of all prior non-edges will be small compared with the those of the prior edges. In this case, the AUC will be 1, but no potential novel edges will be proposed.

Fig. 3.

Fig. 3.

AUCs for the posterior PLS-VIP-based method when applied to the DREAM datasets. Pathways are arranged along the x-axis by their number of nodes; 10 networks contain 10 nodes, 7 networks contain 50 nodes and 10 networks contain 100 nodes

3.1.2. Detection of novel and anomalous edges

Although the previous analysis explains how posterior probabilities are generally assigned to edges and non-edges as a function of the prior parameters, how this affects the recovery of the input network, and why non-edges (edges) with high (low) posterior edge probabilities are possible novel (anomalous) connections, it says little about the method’s ability to detect novel edges in a network. To evaluate its link prediction ability, we can remove some of the edges from the DREAM networks, run the method with the modified networks, and consider the highest ranking non-edges. For a numerical measure of this ability, an AUC can be calculated by ranking the posterior edge probabilities of the non-edges and removed edges. In this case, the AUC can be interpreted as the probability that a randomly chosen missing edge has a higher posterior edge probability than that of randomly chosen a non-edge. The degree to which the AUC exceeds 0.5 is a measure of how much better the algorithm does than chance, so values closer to 1 are indicative of better novel edge detection performance (A graphical depiction of this proceedure is given in Section 2, Supplementary Fig. SA1.).

In Figure 4, we randomly removed 4, 8, 16 and 32 edges over 20 iterations for each of the 50-node DREAM networks. Except for the first two networks, the method generally does better than chance at identifying the missing edges of the networks. In addition, even as the number of removed edges increases, the mean of the AUCs appears to be fairly stable. These observations suggest the method’s potential to detect novel edges, and when many regulatory links are missing, the method is still able to use the expression data along with the dynamics on the rest of the partially known network to identify many of the missing edges that contribute to inconsistencies between the expression data and the prior network.

Fig. 4.

Fig. 4.

AUCs for the posterior PLS-VIP-based method when used for novel edge detection. 4, 8, 16 and 32 edges are randomly removed from the 50-node DREAM networks for 20 iterations, and AUCs are calculated using the non-edges of the modified networks (Color version of this figure is available at Bioinformatics online.)

We can perform a similar analysis to evaluate the method’s ability to detect anomalous edges. To do so, we can add edges to the networks, run the method with the modified networks, and look at the lowest ranking edges. We can calculate an AUC by ranking the posterior probabilities of the edges and added edges to summarize the anomaly detection performance. In this case, the AUC can be interpreted as the probability that a randomly chosen edge has a higher posterior edge probability than that of a randomly chosen false edge, and higher AUCs correspond to better performance. In Figure 5, we randomly added 4, 8, 16 and 32 edges over 20 iterations for each of the 50-node DREAM networks. As with the novel edge detection case, the method does not perform well on the first two networks, but the rest of the networks exhibit ranges of AUCs that are above 0.5. Based on this performance, the method can potentially be used for anomaly detection.

Fig. 5.

Fig. 5.

AUCS for the posterior PLS-VIP-based method when used for anomaly detection. 4, 8, 16 and 32 edges are randomly added to the 50-node DREAM networks for 20 iterations, and AUCs are calculated using the edges of the modified networks (Color version of this figure is available at Bioinformatics online.)

3.2 Insufficient sleep

3.2.1. Network/edge recovery

We now consider an application to the insufficient sleep dataset with KEGG pathways as input networks. Since these networks are known to be incomplete, the primary interest should be discovering novel edges, represented by prior non-edges with high posterior edge probabilities. In addition, while the interactions that comprise these pathways have been expertly curated, it is possible for some studies to contain false positive results, while other study results may be irreproducible. These correspond to anomalies, represented by prior edges with low posterior probabilities.

As with the DREAM networks, we compute the AUCs with the KEGG pathways as reference networks using the same prior probability parameters in Figure 6. Compared to the DREAM network AUCs at the same prior parameters, the AUCs in this case are lower, which may be a result of the networks being incomplete and the expression data being real. With p=q=0.5, the method only appears to perform slightly but significantly better than chance at recovering the pathway edges, so it is likely to suggest many potential novel edges, most of which may be false discoveries. To reduce the number of false discoveries, a combination of higher p and lower q should be used. The former results in higher posterior edge probabilities being assigned to many of the prior edges, which will reflect higher confidence in the studies used to form the pathways but yield little to no candidate edges as anomalies. The latter will assign lower posterior edge probabilities to many of the prior non-edges, leaving very few non-edges with high posterior edge probabilties that can be proposed as novel edges and therefore reducing the number of false discoveries.

Fig. 6.

Fig. 6.

AUCs for the posterior PLS-VIP-based method when applied to the insufficient sleep dataset with KEGG pathways as known prior graphs. Wilcoxon test P-values for H0:μAUC0.5 and H1:μAUC>0.5 are also shown for each set of prior parameters (Color version of this figure is available at Bioinformatics online.)

3.2.2. Posterior edge probability and VIP score comparison

Since our method transforms VIP scores into posterior edge probabilities, it is useful to see how the computed posterior probabilities compare with the data-derived VIP scores. In Figure 7, the posterior probabilities are plotted against the VIP scores for each ordered pair of genes in the circadian rhythm pathway using p=q=0.5, colored by prior edge existence. We note that high (low) VIP scores do not necessarily correspond to high (low) edge probabilities. More specifically, many of the prior edges (cyan) have moderate to high edge probabilities, many having VIP scores that are below 1. Similarly, there are many non-edges (red) with relatively low edge probabilities, some having VIP scores that are >1. Therefore, many of the edges and non-edges that would have been misclassified by directly comparing VIP scores are more likely to be classified correctly using the posterior probabilities. We also see that the distribution of VIP scores for the prior edges are slightly more skewed towards lower values than that of the prior non-edges. However, the corresponding posterior probabilities for the prior edges tend to cover moderate to high values, whereas those of the non-edges are bimodally concentrated around 0 and less so around 1. It can also be observed that the posterior edge probabilities for true edges are generally higher than those for non-edges at the same VIP scores, and as with the AUCs, further improvements can be made by adjusting the prior probabilities appropriately. (The multiple curves visible in Figure 7 can be attributed to automorphically equivalent nodes; a detailed explanation may be found in the Supplementary Material).

Fig. 7.

Fig. 7.

Posterior edge probability versus VIP score for each pair of genes in the circadian rhythm pathway with p = q = 0.5. Large VIP scores tend to have higher posterior probabilities, but a small VIP can still result in a high posterior probability and vice-versa (Color version of this figure is available at Bioinformatics online.)

3.2.3. Method comparisons

We finally compare our approach to iRafNet (Petralia et al., 2015) and PLSNET (Guo et al., 2016). In Figure 8, the AUCs for PLSNET are computed using the methodet default parameters and the AUCs for iRafNet and the posterior PLS-VIP-based approach are computed using uninformative priors. Even with uninformative prior parameters, our approach still produces AUCs that are slightly but significantly above 0.5, while the AUCs of the other methods tend to be smaller. By using an input prior network and simulations based on those networks, our approach is able to better recover the true interactions of the network.

Fig. 8.

Fig. 8.

AUCs obtained by the proposed method (post PLS-VIP), iRafNet and PLSNET for KEGG pathways using the sleep dataset. Methods were applied using default and/or uninformative prior parameters. Wilcoxon test P-values for H0:μAUC0.5 and H1:μAUC>0.5 are given for each method (Color version of this figure is available at Bioinformatics online.)

4 Discussion

We have presented a semi-supervised approach for GRN reconstruction that can be used to refine partially known GRNs based on time-course gene expression data. In particular, we applied the PLS-VIP method to time course data by assuming that the expression or change in expression of a gene at a time point is dependent on the expression of its regulators at the previous time point. To evaluate whether each VIP score is evidence of a network edge, we developed a simulation framework that incorporates previously known regulatory relationships to model the expected gene expression dynamics, thus establishing reference distributions to which the data-derived VIP scores can be compared. This approach directly relates the network structure to the gene expression dynamics, and the semi-supervised approach of using a prior network enables the method to recover known edges while also discovering novel edges and detecting anomalous connections. The posterior edge probabilities that are estimated for each pair of genes can be used to guide and prioritize further experiments to validate the suggested connections.

To be useful for further biological studies, GRN reconstruction methods must be able to accurately identify novel regulatory interactions that can be experimentally verified. We have shown that our semi-supervised approach is able to recover extensive portions of the regulatory dependencies of an input network, as evidenced by the high AUCs corresponding to edge recovery for certain ranges of prior probability values when applied to the DREAM and insufficient sleep datasets. By recovering known edges at a rate better than chance, we can identify novel relationships with higher confidence. More specifically, prior non-edges with high posterior edge probabilities and prior edges with low posterior edge probabilities can be treated as putative evidence for novel and anomalous edges in the network, respectively. By incorporating the putative structure of the rest of the network, the method takes into account the local regulatory relationships between genes as well as the global features of the network and regulatory dynamics when deriving these posterior probabilities. We also showed that our method was capable of novel edge detection by removing true edges from the DREAM networks and attempting to recover those edges. Similarly, by adding false edges and using the method to identify them, we showed that our method can potentially be used for anomaly detection. We also showed that our approach can outperform other related methods.

Additional extensions can be made to our approach to better model the underlying gene regulatory dynamics and potentially improve link prediction performance. For example, the modifications to PLS-VIP for time course data assumed that the expression or change in expression were linear functions of the expression at the previous time point. Although this was a straightforward extension that led to good performance, other temporal modifications can be incorporated to take in account more realistic regulatory dynamics. Since genes are known to regulate the expression of other genes by its products and the generation of those products and related processes can require different amounts of time, other time course-based methods have included the expression at multiple previous time points as predictors. Other methods have included specific time points by identifying an optimal delay in response for different pairs of genes. Also, when modeling with PLSR, we assumed that if an edge exists between a pair of genes, then the connection between them is always active, which may not be the case for true regulatory dynamics. In addition, when simulating expression data and sampling edge weights for these connections, we assumed that all of the weights were concentrated around one value instead of having separate parameters for different pairs of genes. Lastly, we assumed that the response of a gene to its predictors is locally linear, so that it can be modeled (to a first approximation) by (1) or (2). Other network reconstruction methods have used non-linear functions to model regulatory dynamics, and a similar modification here may make our approach more representative of the underlying biology.

Yet even in spite of the simplifying assumptions, we note that our method was able to recover many of the true edges of the input prior networks as well as identify novel and anomalous edges when they were introduced into the DREAM networks. These results suggest its ability to derive posterior networks that accurately reflect gene expression dynamics and can be used to guide and prioritize further experiments and analyses.

Funding

This work was supported by the James S. McDonnell Foundation [220020394] and the National Science Foundation [DMS-1547394].

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Bar-Joseph Z. (2004) Analyzing time series gene expression data. Bioinformatics, 20, 2493–2503. [DOI] [PubMed] [Google Scholar]
  2. Bonneau R. et al. (2006) The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol., 7, R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Butte A.J., Kohane I.S. (1999) Unsupervised knowledge discovery in medical databases using relevance networks. In: Proceedings of the AMIA Symposium, pp. 711–715. [PMC free article] [PubMed]
  4. Butte A.J., Kohane I.S. (2000) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput., 5, 415–426. [DOI] [PubMed] [Google Scholar]
  5. Chen K.C. et al. (2004) Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell, 15, 3841–3862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chong I.G., Jun C.H. (2005) Performance of some variable selection methods when multicollinearity is present. Chemo. Intel. Lab. Syst., 78, 103–112. [Google Scholar]
  7. Ciaccio M.F. et al. (2015) The DIONESUS algorithm provides scalable and accurate reconstruction of dynamic phosphoproteomic networks to reveal new drug targets. Integr. Biol., 7, 776–791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Clauset A. et al. (2008) Hierarchical structure and the prediction of missing links in networks. Nature, 453, 98–101. [DOI] [PubMed] [Google Scholar]
  9. de Jong H. (2002) Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol., 9, 67–103. [DOI] [PubMed] [Google Scholar]
  10. Dewey T.G., Galas D.J. (2001) Dynamic models of gene expression and classification. Funct. Integr. Genomics, 1, 269–278. [DOI] [PubMed] [Google Scholar]
  11. Elowitz M.B., Leibler S. (1999) A synthetic oscillatory network of transcriptional regulators. Nature, 403, 335–338. [DOI] [PubMed] [Google Scholar]
  12. Faith J.J. et al. (2007) Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol., 5, e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Friedman N. (2004) Inferring cellular networks using probabilistic graphical models. Science, 303, 799–805. [DOI] [PubMed] [Google Scholar]
  14. Gardner T.S. et al. (2001) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105. [DOI] [PubMed] [Google Scholar]
  15. Guimerà R., Sales-Pardo M. (2009) Missing and spurious interactions and the reconstruction of complex networks. Proc. Natl. Acad. Sci. USA, 106, 22073–22078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Guo S. et al. (2016) Gene regulatory network inference using pls-based methods. BMC Bioinformatics, 17, 545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Haury A.C. et al. (2012) TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Syst. Biol., 6, 145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Höskuldsson A. (1988) PLS regression methods. J. Chem., 2, 211–228. [Google Scholar]
  19. Huynh-Thu V.A. et al. (2010) Inferring regulatory networks from expression data using tree-based methods. PLoS One, 5, e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Iyer A.S. et al. (2017) Computational methods to dissect gene regulatory networks in cancer. Curr. Opin. Syst. Biol., 2, 115. [Google Scholar]
  21. Kanehisa M., Goto S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kanehisa M. et al. (2014) Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Res., 42, D199–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Karlebach G., Shamir R. (2008) Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol., 9, 770–780. [DOI] [PubMed] [Google Scholar]
  24. Kauffman S. (1969) Homeostasis and differentiation in random genetic control networks. Nature, 224, 177–178. [DOI] [PubMed] [Google Scholar]
  25. Lü L., Zhou T. (2011) Link prediction in complex networks: a survey. Phys. A Stat. Mech Appl., 390, 1150–1170. [Google Scholar]
  26. MacNeil L.T., Walhout A.J.M. (2011) Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression. Genome Res., 21, 645–657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Marbach D. et al. (2009) Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J. Comput. Biol., 16, 229–239. [DOI] [PubMed] [Google Scholar]
  28. Marbach D. et al. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl. Acad. Sci. USA, 107, 6286–6291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Margolin A.A. et al. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Möller-Levet C.S. et al. (2013) Effects of insufficient sleep on circadian rhythmicity and expression amplitude of the human blood transcriptome. Proc. Natl. Acad. Sci. USA, 110, E1132–E1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Perrin B. et al. (2003) Gene networks inference using dynamic Bayesian networks. Bioinformatics, 19, ii138–ii148. [DOI] [PubMed] [Google Scholar]
  32. Petralia F. et al. (2015) Integrative random forest for gene regulatory network inference. Bioinformatics, 31, i197–i205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Petri T. et al. (2015) Addressing false discoveries in network inference. Bioinformatics, 31, 2836–2843. [DOI] [PubMed] [Google Scholar]
  34. Pihur V. et al. (2008) Reconstruction of genetic association networks from microarray data: a partial least squares approach. Bioinformatics, 24, 561–568. [DOI] [PubMed] [Google Scholar]
  35. Prill R.J. et al. (2010) Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS One, 5, e9202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Riccadonna S. et al. (2016) DTW-MIC coexpression networks from time-course data. PLoS One, 11, e0152648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Rice J.J. et al. (2005) Reconstructing biological networks using conditional correlation analysis. Bioinformatics, 21, 765–773. [DOI] [PubMed] [Google Scholar]
  38. Stokić D. et al. (2009) A fast and efficient gene-network reconstruction method from multiple over-expression experiments. BMC Bioinformatics, 10, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Stolovitzky G. et al. (2009) Lessons from the DREAM2 challenges. Ann. N. Y. Acad. Sci., 1158, 159–195. [DOI] [PubMed] [Google Scholar]
  40. Tam G.H.F. et al. (2013) Gene regulatory network discovery using pairwise Granger causality. IET Syst. Biol., 7, 195–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Thompson D. et al. (2015) Comparative analysis of gene regulatory networks: From network reconstruction to evolution. Annu. Rev. Cell Dev. Biol., 31, 399–428. [DOI] [PubMed] [Google Scholar]
  42. van Someren E. et al. (2006). Regularization and noise injection for improving genetic network models In: Zhang W., Shmulevich I. (eds) Computational and Statistical Approaches to Genomics. Springer, USA, pp. 279–295. [Google Scholar]
  43. Wang Y.X.R., Huang H. (2014) Review on statistical methods for gene network reconstruction using expression data. J. Theor. Biol., 362, 53–61. [DOI] [PubMed] [Google Scholar]
  44. Weaver D.C. et al. (1999) Modeling regulatory networks with weight matrices. Pac. Symp. Biocomput., 4, 112–123. [DOI] [PubMed] [Google Scholar]
  45. Wold S. et al. (2001) PLS-regression: a basic tool of chemometrics. Chem. Intell. Lab. Syst., 58, 109–130. [Google Scholar]
  46. Zoppoli P. et al. (2010) TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics, 11, 154. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES