Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Jan 29;17(1):e1008223. doi: 10.1371/journal.pcbi.1008223

Causal network inference from gene transcriptional time-series response to glucocorticoids

Jonathan Lu 1,#, Bianca Dumitrascu 2,#, Ian C McDowell 3, Brian Jo 2, Alejandro Barrera 4,5, Linda K Hong 4, Sarah M Leichter 4, Timothy E Reddy 6,*, Barbara E Engelhardt 1,7,*
Editor: Christina S Leslie8
PMCID: PMC7875426  PMID: 33513136

Abstract

Gene regulatory network inference is essential to uncover complex relationships among gene pathways and inform downstream experiments, ultimately enabling regulatory network re-engineering. Network inference from transcriptional time-series data requires accurate, interpretable, and efficient determination of causal relationships among thousands of genes. Here, we develop Bootstrap Elastic net regression from Time Series (BETS), a statistical framework based on Granger causality for the recovery of a directed gene network from transcriptional time-series data. BETS uses elastic net regression and stability selection from bootstrapped samples to infer causal relationships among genes. BETS is highly parallelized, enabling efficient analysis of large transcriptional data sets. We show competitive accuracy on a community benchmark, the DREAM4 100-gene network inference challenge, where BETS is one of the fastest among methods of similar performance and additionally infers whether causal effects are activating or inhibitory. We apply BETS to transcriptional time-series data of differentially-expressed genes from A549 cells exposed to glucocorticoids over a period of 12 hours. We identify a network of 2768 genes and 31,945 directed edges (FDR ≤ 0.2). We validate inferred causal network edges using two external data sources: Overexpression experiments on the same glucocorticoid system, and genetic variants associated with inferred edges in primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project. BETS is available as an open source software package at https://github.com/lujonathanh/BETS.

Author summary

We can better understand human health and disease by studying the state of cells and how environmental dysregulation affects cell state. Cellular assays, when collected across time, can show us how genes in cells respond to stimuli. These time-series assays provide an opportunity to identify causal relationships among thousands of genes without performing hundreds of thousands of experiments. However, inferring causal relationships from these time-series data needs to be fast, robust, and accurate. We present a method, BETS, that infers causal gene networks from gene expression time series. BETS runs quickly because it is parallelized, allowing even data sets with thousands of genes to be analyzed. We demonstrate the performance of BETS compared to 22 other state-of-the-art inference methods on benchmark data. We then use BETS to build causal networks from gene expression responses to the widely-prescribed drug dexamethasone. We replicate the estimated causal relationships using gene expression data from the Genotype-Tissue Expression (GTEx) project and from additional experiments with dexamethasone. We release our software so that BETS can be used to accurately and effectively infer causal relationships from gene expression time-series assays.


This is a PLOS Computational Biology Methods paper.

1 Introduction

The recent availability of gene expression measurements over time has enabled the search for interpretable statistical models of gene regulatory dynamics [1]. These time-series data present a unique opportunity to use the coordinated transcriptional response to environmental exposure to infer causal relationships between genes. However, there are several challenges to overcome in the analysis of time-series transcriptomic data. These data are generally high-dimensional: the number of quantified gene transcripts—approximately 20,000 in human samples—often dramatically exceeds the number of available time points and samples. Many classical statistical assumptions fail to hold in this high-dimensional regime [2, 3]. Moreover, the large number of gene transcripts poses a computational burden, as the number of possible edges in a gene network grows quadratically. Also, a transcriptional time series often has a small number of time points, and those time points are often not uniformly spaced; furthermore, because transcriptional time-series data often quantify transcription post exposure, the time series is not stationary, and genes respond to the exposure and return to baseline at different rates [4, 5].

In this work, we develop an approach that uses gene transcription time series following glucocorticoid (GC) exposure to build a directed gene network [6]. GCs play an essential role in regulating stress response, and are widely used as anti-inflammatory and immunosuppresive medications [6, 7]. Dexamethasone and other GCs have recently been recommended by the U.K. Government, U.S. National Institutes of Health, World Health Organization, and Infectious Disease Society of America for treatment of hospitalized COVID-19 patients with severe disease who are on mechanical ventilation or extracorporeal membrane oxygenation [811]. Despite clinical benefits, prolonged exposure to GCs has been linked to increased risk of type 2 diabetes mellitus (T2DM) [12] and obesity [13]. Understanding the immune, metabolic, and transcriptional effects of GCs may enable the development of improved anti-inflammatory treatments without metabolic side effects. A recent study assayed A549 lung cells over 12 hours to characterize the effect of GCs on cell state [6]. Here, we develop a method to accurately, interpretably, and efficiently infer a directed gene network using the study’s transcription time-series data. We focus our network analysis on immune-related genes, metabolism-related genes, and transcription factors (TFs) to study the inferred coordinated response of these systems to GCs.

Our method, Bootstrap Elastic net inference from Time Series (BETS), uses vector autoregression with elastic net regularization to infer directed edges between genes. Stability selection, which assesses the robustness of an edge to perturbations in the data, leads to improvements over baseline vector autoregression methods in this high-dimensional context [3]. Furthermore, BETS is biologically interpretable because estimated coefficients provide the direction (sign) and effect size of the causal relationship between a pair of genes. Finally, BETS’s parallelization enables efficient inference of networks with millions of possible edges in a computationally tractable way.

We use the causal network inferred by BETS on the GC time-series data to study the relationships between TFs, immune genes, and metabolic genes. We validate our network using two approaches: Ten measurements of the same GC system with an overexpressed TF, and an expression quantitative trait loci (eQTL) study in human primary lung tissue [14]. Although our framework is motivated by transcriptional response to GC exposure, our approaches are general, and BETS is able to infer directed networks from arbitrary high-dimensional time-series data.

1.1 Related work

Several methods have been developed to estimate directed gene networks from transcription time-series data (S1 Fig) [1523]. These methods estimate directed networks in which the directed edges between nodes—representing genes—indicate a cause-effect relationship between genes. In other words, perturbing expression of the causal gene would lead to changes in expression of the effect gene [24]. We briefly overview these methods; for detailed discussion, see the Supplemental Information. Here, we take g′ to be the causal gene and g to be the effect gene, and we quantify support for a causal edge g′ → g in the time-series data.

Mutual information (MI) methods assess the MI between the expression of g′ at the previous time point and the expression of g at the current time point (S1(A) Fig) [2530]. A causal edge g′ → g is included in the network if the MI of the two genes across time exceeds a threshold.

Granger causality methods determine if including the expression of g′ at the previous time point improves our ability to predict the expression of g at the current time point beyond using the expression of g at the previous time point [31]. A common way to implement Granger causality is through a vector autoregression (VAR) model, which assumes a linear relationship between all genes’ expression at the previous time point and the expression of g at the current time point. A causal edge g′ → g is included in the network when g′ has a statistically significant coefficient in the VAR.

Ordinary differential equations (ODEs) fit the derivative of the expression of g as a function of all genes’ expression at a single time point (S1(C) Fig) [15, 32, 33]. ODE methods typically assume linearity, as small sample sizes make it challenging to infer the parameters of nonlinear functions. A causal edge g′ → g is included when g′ has a statistically significant coefficient in the ODE.

Decision trees (DTs) are a type of nonparametric function based on partitioning the data [34, 35]. DT methods fall either under VAR or ODE; either the DTs fit the expression of g at the current time as a function of all genes’ expression at the previous time point (VAR), or they fit the derivative of the expression of g as a function of all genes’ expression at a single time point (ODE) (S1(D) Fig) [36, 37]. A causal edge g′ → g is included in the network when an importance score for g′ exceeds some threshold, where importance scores are typically the reduction in variance of g when g′ is included as a predictor.

Dynamic Bayesian networks (DBNs) search the space of possible directed acyclic graphs between previous and current expression levels to identify the network structure with the highest posterior probability of each edge given the data (S1(E) Fig) [3842]. DBNs typically assume a linear relationship between previous and current expression. A causal edge g′ → g is included in the network when its marginal posterior probability of existence exceeds some threshold.

A Gaussian process (GP) is a distribution over continuous, nonlinear functions. GPs are often used in the context of nonlinear DBNs, where GP regression is used to model a nonlinear relationship between previous expression and current expression (S1(F) Fig) [43, 44]. A causal edge g′ → g is included in the network when its posterior probability of existence exceeds some threshold.

While these approaches produce directed networks that have the flavor of Bayesian networks, except for DBNs, none of them produce graphs that are constrained to be acyclic, so they do not have the same statistical semantics as Bayesian networks.

2 Results

First, we briefly describe the approach that BETS uses to infer a directed gene network. Next, we compare results from BETS to those from twenty two other methods on the 100-gene time-series data from the DREAM4 Network Inference Challenge [45]. Then, we describe the network estimated from the GC transcription time-series data. Finally, we validate the inferred network using two different frameworks: Overexpression experiments on the same system, and genetic variants associated with inferred edges in human primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project [14].

2.1 BETS: A vector autoregressive approach to causal inference of gene regulatory networks

Directed networks represent causal relationships among diverse interacting variables in complex systems. We developed a robust, scalable approach based on ideas from Granger causality to construct these directed networks from short, high-dimensional time-series observations of gene expression levels.

Let G be the set of all p = |G| genes in the data set and gG be a gene. Let ¬g be G with g removed. Let t be a single time point, ranging from {1, 2, …, T}. Let Xtg be the expression of gene g at time t. Let L be the time lag, or the number of previous time point observations; so L = 2 means that we use two previous time points, t − 1 and t − 2, to predict expression at time t. These types of autoregressive models work best with similarly-spaced time points, as the data sets in this paper approximate, and assume stationarity, or the same causal effects across each time gap.

Definition 2.1 (Granger causality). For lag L, a gene g′ is said to Granger-cause another gene g if using Xt-1g,,Xt-Lg, the expression values of g′ at times t − 1 to tL, improves prediction of Xtg, the expression value of g at time t, beyond predicting Xtg using Xt-1g,,Xt-Lg alone.

To test for Granger causality from g′ to g, we first preprocessed the gene expression time-series data (Methods). For every potential effect gene g, we fit all other genes g′ ∈ ¬g simultaneously (Eq 1), echoing ideas from the graphical lasso for undirected network inference [46]. Intuitively, this adapts the idea of Granger causality to conditional Granger causality, where we consider how gene g′ Granger-causes g conditioning on the effects of all other genes. This approach uses the regression:

Xtg==1LαgXt-g+g¬g=1Lβg,gXt-g+ϵt, (1)

where ϵtN(0,1). For BETS, we set L = 2. To test for an edge, if there is statistical support for βg,g0, then we say g′ conditionally Granger-causes g at lag L. We build the directed network by including a directed edge to g from every gene g′ that has been inferred to conditionally Granger-cause g.

Robustly building this network is difficult due to the high dimensionality of the problem: The number of genes that could Granger-cause a given g far exceeds the available time points and technical replicates. To address this challenge, BETS regularizes the VAR model parameters using an elastic net penalty (Methods, Fig 1A). Elastic net regression encourages sparsity and performs automatic variable selection on the genes being tested for causal influence [47]. The elastic net penalty, unlike the lasso penalty [48], is able to select groups of correlated variables and allows the number of selected variables to be greater than the number of samples. This is important for gene expression assays where gene expression levels are often well correlated, and there are far more genes than samples [2].

Fig 1. BETS Algorithm.

Fig 1

A) Model fit. The VAR model is fit on both the original and a permuted data set (blue arrows indicate shuffling each gene’s expression independently across time). Based on the null distribution of coefficients, a threshold is chosen to control the edge FDR at ≤ 0.05. B) Stability selection. From the original data, 1000 bootstrap samples are generated. For each sample, a network is inferred as in A. Each edge’s selection frequency across the bootstrapped networks is computed. C) Statistical significance. For both the original and permuted data, a selection frequency distribution is generated for stability selection as in B. Edges are thresholded to control the stability FDR at ≤ 0.2. See S1 Fig for an overview of network inference methods.

In BETS, we fit the same VAR model to a data set in which causal genes have their expression permuted over time to generate a null distribution of edge coefficients. The coefficients are thresholded to produce a causal network with each edge at edge false discovery rate (FDR) ≤ 0.05 (Fig 1A). We then applied this network inference procedure to multiple (here, 1000) bootstrapped samples of the original data set (Fig 1B). Each edge has a selection frequency, or the frequency that the edge appears in networks inferred from the bootstrapped samples. Inspired by stability selection, this approach assesses if network edges are robust to perturbations of the data [3]. Finally, we ran this overall procedure on a permuted version of the original data set to obtain a null distribution of selection frequencies (Fig 1C). The selection frequency threshold for including each edge is chosen to control the stability FDR ≤0.2. As a baseline, we compare BETS against Enet, which runs elastic net regression without stability selection to produce a causal network with each edge at edge FDR ≤ 0.05 (Fig 1A).

2.2 Leading performance on DREAM Network Inference Challenge

We evaluated BETS against other directed network inference methods. We used the DREAM4 Network Inference Challenge [45], a community benchmark for directed network inference using gene time-series data. This benchmark consisted of five data sets, each with ten time-series measurements for 100 genes across 21 time points [45]. Evaluation was previously done by looking at the average of the area under the precision recall curve (AUPR) or the area under the receiver operating characteristic (AUROC) over the five data sets [37, 45]. Any method that provides a ranking of possible network edges could be evaluated in this framework.

We tested BETS and Enet against 22 other methods on the DREAM challenge [36, 37, 40, 4951]. We ran SWING-RF, SWING-Lasso, CSId, Jump3, CLR, MRNET, and ARACNE in-house and found our results consistent with those reported in the literature. All 22 methods reported AUPR, but only 17 reported AUROC.

BETS ranked 7th out of 24 in AUPR with an average AUPR of 0.128 (Fig 2A and S1 Table) and 4th out of 19 in AUROC with an average AUROC of 0.688 (Fig 2B and S2 Table). BETS was the top performer of all VAR methods, and Enet was second best. All 24 methods outperformed random selection of edges, which achieved an average AUPR of 0.002 and average AUROC of 0.50 [49]. We also found that BETS and Enet had similar performance to the DBN methods in AUPR, and outperformed most of them in AUROC. Ranked by the top AUPR of each class of methods, the best performing class was DT, followed by GP, MI, VAR, DBN, and ODE [36, 40, 49]. The VAR method used in BETS produces edge signs (indicating excitatory or inhibitory causal effects) and effect sizes. While other methods based on GPs (e.g., CSId), MI (e.g., tl-CLR) or DTs (e.g., SWING-RF) had marginally better overall network inference, they do not provide insight into the causal relationships because they only output a positive measure of a causal interaction [32, 44, 51].

Fig 2. Algorithm performance on the DREAM community benchmark.

Fig 2

A) AUPR scores from 24 methods, averaged across the five DREAM networks. B) AUROC scores from 19 methods, averaged across the five DREAM networks. Arrows indicate our methods. Stars indicate methods that we ran in-house; results were consistent with reported results. The bars reach one standard deviation from the average as calculated across the five DREAM networks; no bar indicates the standard deviation was not reported. See also S1S5 Tables.

Next, we compared the speed of BETS and three other top-performing methods: SWING-RF, CSId and Jump3 (S3 Table). SWING-RF was the fastest at 0.11 hours, while BETS took 4.8 hours, CSId took 9.8 hours and Jump3 took 45 hours. Thus, while BETS had a lower AUPR compared to CSId and Jump3, it was substantially faster. BETS had both a lower AUPR and longer runtime than SWING-RF.

BETS improved upon Enet using stability selection. To quantify this improvement, we compared three other models: Elastic net with lag L = 1, ridge regression with lag L = 2, and lasso with lag L = 2 (S4 Table). In each case, the stability selection version outperformed the original version in average AUPR and AUROC. The improvement in average AUPR ranged between 0.016 and 0.03 (+20% to +31%), while the improvement in average AUROC ranged between 0.012 and 0.04 (+1.8% to +6.1%). Thus, our stability selection procedure leads to improved performance for multiple versions of VAR.

We also found that stability selection performance is robust to the number of bootstrap samples (S5 Table). Decreasing the number of bootstrap samples from 1000 to 100 led to minor decreases of −0.004 in AUPR and −0.008 in AUROC, within the standard deviation across the networks. It also resulted in a 10-fold decrease in memory usage and 3-fold decrease in run time, due to a constant-time hyperparameter search. If users face computational constraints, we recommend that they use 100 bootstrap samples for nearly equivalent performance.

Finally, we found that BETS’ performance on DREAM is robust to the choice of lag. We ran BETS with lag L = 1 instead of lag L = 2 (S4 Table). BETS with lag L = 1 achieves an average AUPR of 0.14 (an increase of 0.012 from lag L = 2) and an average AUROC of 0.686 (a decrease of 0.002 from lag L = 2, within the standard deviation across networks). BETS with lag L = 1 still ranks 7th out of 24 in AUPR and 4th out of 19 in AUROC (tied with GP4GRN). Thus, BETS achieves consistently good performance on DREAM for both lags L = 1 and L = 2.

2.3 Application to gene transcription response to glucocorticoids

To infer the causal relationships in the GC response network, we analyzed RNA-seq data collected from human adenocarcinoma and lung model cell line A549, which consists of two data sets. In an original exposure data set, cells were exposed to the synthetic GC dexamethasone (dex) for 0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, and 12 hours [6]. In an unperturbed data set, the cells were first exposed to dex for 12 hours, after which the media was replaced and dex removed, and then measurements were taken at the same intervals 0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, and 12 hours. BETS was fit jointly over the two data sets. In total there were 7 technical replicates (4 from original exposure and 3 from unperturbed). A single VAR was fit on 70 samples: Each of the 7 replicates had 10 samples, because using a lag L = 2 VAR model turns 12 time points into 10 samples.

We applied BETS to the GC-mediated expression responses to infer a causal network (Fig 3A). Edges with selection frequency (i.e., frequency of appearance among bootstrap networks) at least 0.097 were declared significant (FDR ≤ 0.2; Fig 3B). The network contained 2, 768 nodes representing distinct genes and 31, 945 directed edges (0.4% of possible edges). Of these, 466 genes were causes (i.e., had an outward directed edge) and all 2, 768 genes were effects (i.e., had an incoming directed edge). In Granger causality, and dynamical systems more generally, a causal gene g′ is allowed to have incoming directed edges because g′ may be affected by the past value of another gene g″, and g′ may have a causal effect on the later value of gene g. The out-degree distribution was heavy-tailed and skewed right (Fig 3C), while the in-degree distribution was lighter-tailed and more symmetric (Fig 3D). The network’s edge in-degree had a heavier left tail and lighter right tail than a normal distribution (Fig 3E). This suggests that causal genes are relatively rare (only 1/6th of network genes are causes) and a fifth of those only affect a single gene, whereas genes that are effects tend to have multiple causes. The network was inferred efficiently due to parallelization across genes, taking six days in real time and 292 days in CPU time to fit 5.5 million elastic net models.

Fig 3. Causal network inferred from glucocorticoid receptor data.

Fig 3

A) Causal network clustered by gene type. Edge color indicates the type of the causal gene: red edge indicates an immune causal edge, blue edge indicates a metabolic causal edge, purple edge (both) indicates an immune and metabolic causal edge, and tan edge indicates a neither immune nor metabolic other causal edge. B) Significance thresholding for edges, based on the null distribution of selection frequencies. C) Out-degree distribution of network. For clarity, several high out-degree values with low frequencies are not shown. D) In-degree distribution of network. E) Quantile-quantile (Q-Q) plot of in-degree distribution against normal quantiles. The in-degrees have a heavier left tail and lighter right tail than the normal distribution. F) Enrichment of gene classes among network causal genes, measured by odds ratio. G) Enrichment of edge classes among network edges, measured by odds ratio. See also S6 Table.

To study the network with respect to the glucocorticoid system, we annotated specific genes as transcription factors (TFs), immune-related, or metabolism-related [5255]. First, we inspected enrichment of each category among the causal genes (Fig 3F). At FDR ≤ 0.05, we found enrichment for TFs among causes; there were 226 causal TFs, representing 8.2% of the 2, 768 input genes. 62 of these TFs were causal, representing 13% of all causal genes (odds ratio (OR) = 2.0, Fisher’s exact test (FET) adjusted p ≤ 2.9 × 10−5). Similarly, we found an enrichment among immune-related genes as causes: of 109 immune genes, representing 3.9% of the input genes, 39 of these were causes, representing 8.4% of all causal genes (OR = 2.9, FET adjusted p ≤ 2.5 × 10−6). In contrast, there was no enrichment among metabolism-related genes: there were 120 metabolic genes, representing 4.3% of input genes; 19 of these metabolism genes were causes, representing 4.1% of all causes (OR = 0.93, FET adjusted p ≤ 0.66). This suggests that our network is enriched for causal TFs and immune genes.

To study the interactions among gene classes inferred by our network, we quantified enrichment for edges between each of the four gene classes—immune, metabolic, TF, and other gene types (Fig 3G; S6 Table). We found enrichment of 11 of the 16 possible edge types (FDR ≤ 0.05). The network was enriched for edges from i) causal TFs to immune genes and other genes; ii) causal immune genes to TFs, immune genes, metabolic genes, and other genes; and iii) causal metabolic genes to TFs, immune genes, metabolic genes, and other genes. This suggests that our network is enriched for a broad range of edge types.

The inferred relationships in BETS are conditionally Granger-causal, meaning that we find that gene g′ Granger-causes g conditioning on the effects of other genes. Thus, an edge g′ → g should be assessed based on g’s residuals after controlling for the effects of other genes and itself, instead of g’s raw values. Consider the edge KRT6ANKAIN4 (Fig 4): NKAIN4’s raw expression values suggest a negative relationship between KRT6A and NKAIN4 (Fig 4A and 4B). However, after controlling for the effects of all covariates besides KRT6A, a positive relationship between KRT6A and NKAIN4 appears (Fig 4C and 4D). Thus, conditionally Granger-causal relationships g′ → g should assess g only after controlling all other effects on g.

Fig 4. Conditional Granger causality reveals opposite sign of relationship KRT6ANKAIN4.

Fig 4

A) Time series and B) scatter plot of expression values from KRT6A and NKAIN4. C) Time series and D) scatter plot of expression values from KRT6A and residual expression values from NKAIN4 after controlling for the effects of other covariates in NKAIN4. Each y-axis tick in A and C indicates 0.1 unit-variance standardized ln(TPM), where TPM is Transcripts Per kilobase Million. The grey line marks zero-centered expression. B and D axes are in units of ln(TPM).

Our network identified known biological interactions between genes with immune, metabolic, and TF roles; we highlighted 16 gene pairs with experimentally validated interactions (Fig 5, S2 Fig, and S7 Table). Known interactions were found using the BIOGRID PPI database [56]. SOCS1 and SOCS3 bind IRS2 and promote its degradation, leading to reduced insulin signaling [57, 58]; furthermore, SOCS1 represses IL-4-induced IRS2 singling [59]. NR4A1 heterodimerizes with RXRA to activate it in order to promote gene expression under vitamin A signaling [60]; NR4A1 also inhibits p300-induced RXRA acetylation [61]. Eleven of the 16 edges had the correct interaction direction; the five that were reversed are TNFAIP3IRAK2, SOCS3HIVEP1, ATF3MDM2, E2F1CDH1, and FOSEGFR. These results suggest that BETS infers biologically meaningful relationships, but transcriptional data, absent other assays on protein abundance and cellular dynamics, are often underpowered to resolve the direction of the edge.

Fig 5. Time-series profiles of experimentally validated causal interactions across gene classes.

Fig 5

For each gene pair, their profiles were from either the original exposure data set or the unperturbed data set. The effects of all covariates beside the causal gene were controlled in the effect gene values to show the conditional Granger-causal relationship. Colors encode gene classes: pink shows immune genes, dark blue/gray shows metabolic genes, teal shows TFs, and brown/tan shows other genes. Darker colors show causal genes and lighter colors show effect genes. The grey line marks zero-centered expression. Each y-axis tick indicates 0.1 unit-variance standardized ln(TPM). See also S7 Table and S2 Fig.

2.4 Validation of inferred network on overexpression data

We asked whether our inferred network edges validated on overexpression versions of the same experimental system, in which each of ten TFs was separately overexpressed over the same 12 hours of observations. Specifically, we assessed the concordance between inferred network edges g′ → g and their coefficient in the overexpression data set under a VAR model (Methods).

We first evaluated how well network edges replicated on individual overexpression data sets. We performed linear regression of a one-hot encoding of the original network’s edge sign (i.e., positive versus no edge or negative; negative versus positive or no edge) as the predictor against the VAR model edge coefficients estimated from each of the overexpression time series as the response (Fig 6A and 6B, Methods). Of the ten data sets, 9 showed enriched positive effect sizes among positive edges at FDR ≤0.2 (CEBPB, CEBPD, FOSL2, FOXO1, FOXO3, KLF6, KLF9, KLF15, OCT4; Fig 6A). Three data sets showed enriched negative effect sizes among negative edges (OCT4, TFCP2L1, CEBPD) and four showed enriched positive effect sizes among negative edges (CEBPB, FOSL2, KLF9, KLF15; Fig 6B). Taken together, the positive edges inferred by BETS validate on the overexpression data, but the negative edges do not, indicating repressive effects may have inconsistent signs or feedback loops.

Fig 6. Validation of inferred network on overexpression data.

Fig 6

A-B) Regression of one-hot encoding of positive (negative for B) edges as the predictor against the VAR model edge coefficient from the overexpression data as the response. A 1 indicates that an edge had a positive (in A) or negative (in B) coefficient in the original inferred network (FDR ≤ 0.2). C) For the 123 causal edges from TFCP2L1, regression of edge sign as the predictor against the VAR model edge coefficient from TFCP2L1 overexpression data as the response.

Next, we checked whether the 123 inferred causal edges from the TF TFCP2L1 validated in the TFCP2L1 overexpression data set (there were only about 10 causal edges from each of the other 9 TFs). We regressed the original network’s edge sign (+ 1 for positive edges, 0 for no edge, and −1 for negative edge) as the predictor against the overexpression VAR model edge coefficients as the response (Fig 6C). We found a positive relationship between the edge sign and overexpression coefficient (slope 0.17, two-sided t-test p ≤ 5 × 10−5). This shows that causal edges from TFCP2L1 are enriched for matched effect directions in the TFCP2L1 overexpression data.

This validation may be limited by a misspecification of the linear regression model. As a supplementary analysis, we fit nonstationary GPs to gene trajectories using nsgp; genes that were differentially expressed in the TF overexpression data compared to the original exposure data were listed as potential targets of that TF (Supplemental Information). We found that BETS weakly predicts potential targets of 5 of the 10 over-expression TFs. This approach did not show substantial improvement above random prediction of potential targets on these data (FDR ≤ 0.1).

2.5 Validation of network edges through lung trans-eQTLs

We next validated our network edges on an expression quantitative trait-loci (eQTL) study. A single nucleotide polymorphism (SNP) S is an eQTL for a gene g′ if it is associated with g′’s expression level within a population. Given a true causal edge g′ → g, if a SNP S is a local (cis-) eQTL for g′, S might also be a distal (trans-) eQTL for g [62]. We used gene expression levels in primary lung tissue (n = 278) from the Genotype Tissue Expression (GTEx) project v6p [14]. We observed an enrichment of low trans-eQTL association p-values from the directed network compared to shuffling the variant labels (Fig 7A and 7B). This suggests our network captures more valid causal effects than expected by chance.

Fig 7. Network edge validation using known cis- elements from GTEx v6 lung cis-eQTLs.

Fig 7

A) Enrichment of trans associations in primary lung tissue among p-values from edges inferred by BETS compared to p-values from permutations. B) Quantile-quantile plot of validated edges shows signal enrichment in lung samples when compared to signals from four other tissues in the GTEx v6 study. C) SNPs associated with inferred gene pairs. Genotype-phenotype plots corresponding to the cis-effect (left column), correlation in the GTEx v6 data between cause (y-axis) and effect (x-axis) gene pairs (right column).

We next inspected specific associations and their corresponding edges. We found 340 trans-eQTL pairs in lung samples corresponding to 130 network edges (q-value FDR ≤ 0.2). There are more trans-eQTLs than edges because there are multiple cis-eQTLs for some causal genes g′. The 340 trans-eQTLs greatly improved upon the 2 identified in primary lung tissue in the GTEx v6 trans-eQTL study [63], demonstrating the utility of transcriptional time series for prioritizing promising associations. The top trans-associations were rs2302178-CLDN1 (q-value FDR ≤ 0.095; extended from the cis-association rs2302178-HS3ST6), rs590429-ADAMTS (q-value FDR ≤ 0.11; extended from the cis-association rs2302178-OLR1), and rs2072783-CLIP2 (q-value FDR ≤ 0.11; extended from the cis-association rs2302178-GMPR; Fig 7C).

We searched for validated associations between immune-related genes, metabolic-related genes, and TFs. One association was OLR1ITGAV, where the known association between SNP rs4329754 and OLR1 extends to an association between the same genetic variant and effect gene ITGAV (q-value FDR ≤0.13) [14]. OLR1 plays key roles in immunity and metabolism [64, 65]. It is associated with metabolic syndrome [66] and atherosclerosis [66], and modulates inflammatory and humoral immune responses [67, 68]. Meanwhile, ITGAV plays a key role in the motility of CD4+ T cells during inflammation [69].

Another association was between the TF SNAI2 and gene PTPN6, where we find that the known association between genetic variant rs56800165 and SNAI2 extends to an association between the same SNP rs56800165 and effect gene PTPN6 (q-value FDR ≤0.17) [14]. SNAI2 is a direct target of the glucocorticoid receptor GR that regulates cell migration in breast cancer [70], while PTPN6 is involved in glucose homeostasis via negative regulation of insulin signalling [71]. PTPN6 is also associated with inflammatory phenotypes in multiple diseases [72, 73]. Finally, both SNAI2 and PTPN6 are involved in the cell-cell adherens junctions pathway, as SNAI2 represses transcription of cadherin, while PTPN6 positively regulates the cadherin-catenin complex [74]. Thus, for several eQTL-validated edges for gene pairs, we find that the genes are involved in related biological processes, but further experimentation is required to confirm direct interactions.

As A549 cells are models for lung tissue [75], we quantified enrichment of validated edges in lung compared to enrichment in four non-lung tissues with similar sample sizes: subcutaneous adipose (n = 298), transformed fibroblasts (n = 272), tibial artery (n = 285), and thyroid (n = 278). We validated 341 unique network edges across the five tissues (FDR ≤ 0.2). 130 edges validated for lung, 4 for subcutaneous adipose, 125 for transformed fibroblasts, 3 for tibial artery, and 82 for thyroid tissues. More network edges validated in primary lung than in non-lung tissues, suggesting that A549 cells most closely match lung samples among GTEx tissues; this is consistent with their tissue of tumor origin.

3 Discussion and conclusion

We described an approach, BETS, to build directed networks using short time-series observations of high-dimensional transcription data. BETS combined ideas from elastic net regression, graphical lasso, stability selection, and VAR models to infer Granger-causal relationships in high-dimensional transcription time-series data. Our method achieved competitive performance on the DREAM4 100-Gene Network Inference Challenge, ranking 7th out of 24 methods in AUPR and 4th out of 19 methods in AUROC; it was also faster than several methods with similar or better performance and infers effect size and sign, unlike the other top performing methods. Stability selection resulted in consistent improvement to VAR models across different hyperparameter settings.

Next, we applied BETS to time-series RNA-seq data from human A549 cells exposed to glucocorticoids and identified a directed network of 31, 945 edges (FDR ≤0.2), capturing the causal relationships among genes after exposure to GCs. Despite the intervention of GCs in this cell line, we expect that many of the causal relationships estimated using these data will exist across cellular environments in lung cells. In our estimated causal network, we found enrichment of immune genes and TFs among causal genes. We also found enrichment of 11 specific types of causal edges from TFs, immune genes, and metabolic genes. We validated our network first in ten overexpression data sets. Edges that were positive in the original network had an enrichment of positive VAR effect sizes in the overexpression data. However, edges in the original network did not predict differential expression of genes in the overexpression data, as called by nsgp. Validating network edges by searching for trans-eQTLs in GTEx primary lung tissue samples, we found an enrichment of associations with genetic variants across network edges. Finally, we discovered 340 trans-eQTLs, a dramatic improvement from the GTEx v6 trans-eQTL study [63].

While BETS has demonstrated effective inference of causal relationships, there are interesting future directions to explore. All methods that infer networks from transcriptional time series face several difficulties: i) Transcript levels are sometimes an imperfect proxy for protein levels, especially when transcript dynamics are changing [15, 76]; ii) the scarcity of time point samples causes statistical challenges for inferring millions of possible causal interactions between genes, let alone non-additive interactions among causes [3, 77, 78]; iii) transcription data do not capture the complete regulatory context including chromatin structure and epigenetic regulations [15]; iv) transcription relationships are often nonstationary: the relationship may change over time due to responses from the environment [4, 5]; and v) inferred networks are often sensitive to the choice of preprocessing and parameter choices [79]. Single-cell data also implicitly include transcription time-series information when pseudotime is inferred, making ideas from Granger causality exceptionally relevant. However, recent preprints suggest some limits to pseudotime’s potential for network inference [80, 81]. Finally, experimental followup is key to establishing causality; BETS only generates promising, interpretable hypotheses. Indeed, by discovering hundreds more trans-eQTLs than the GTEx study (a 170-fold increase) [63], BETS demonstrates its potential to prioritize biologically meaningful associations.

4 Methods

4.1 Method details

Bootstrap Elastic net regression from Time Series (BETS)

Bootstrap Elastic net regression from Time Series (BETS) is a vector-autoregressive approach to causal inference from gene expression time-series data. It is based on the principle of Granger causality [31]: a gene g′ Granger-causes another gene g if previous information from gene g′ improves our current predictions of gene g, beyond using previous information from g and from other genes.

BETS first preprocesses the data. BETS fits an elastic net vector autoregression model to handle the high dimensionality of the time series, inferring a network (Fig 1A). It infers one network for each of 1000 bootstrapped samples of the original data set and computes each edge’s selection frequency, or its frequency of appearance among the bootstrapped networks (Fig 1B) [3]. Finally, BETS includes an edge in the network using the selection frequencies (Fig 1B). Our baseline comparison, Enet, only preprocesses the data and fits an elastic net vector autoregression model from the original data (Fig 1A; Section 2.2).

Preprocessing temporal time-series data

For a gene temporal profile (i.e., one gene’s expression values across time for a single replicate), we used zero-mean unstandardized normalization, which centers each gene temporal profile to have mean zero across time. Because gene temporal profile ranges from staying almost constant to having drastic fluctuations, BETS uses this approach because a unit-variance normalization would over-represent the weak causal effects of genes with lower variability.

Vector autoregression model

Let G be the set of all genes in the data, let p = |G| be the number of genes, and let g be a gene. Let ¬g be G with g removed. Let there be T time points total, and let t ∈ {1, 2, …, T} be a single time point. Let there be R replicates of the gene expression time series.

Let Xt,rg be the expression of gene g at time t for replicate r. Let Xtg=[Xt,1g,Xt,2g,,Xt,Rg]T be the R × 1 vector of gene expression levels of gene g across R replicates at time t. The rest of the paper does not mention replicates for simplicity, but here we discuss replicates for completeness.

Let g′ be the gene we are testing to be causal for gene g and let refer to the time lag of the causal edge g′ → g. Let L be the maximum lag. In BETS, L = 2 is the default.

We model each gene g as

Xtg==1LαgXt-g+=1Lg¬gβg,gXt-g+ϵt, (2)

where ϵtN(0,1). In other words, the expression of each gene g is modeled as a linear function of its and other genes’ L previous expression values, under independent Gaussian noise. αg represents the (scalar) effect size of gene g’s th previous value, Xt-g, on its current value, Xtg. βg,g represents the (scalar) effect size of the th previous value of gene g′ ≠ g, Xt-g, on gene g’s current value, Xtg. Eq 2 requires that t > for the th previous value, Xt-g, to exist.

To demonstrate how our model is fit in practice, we reformulate Eq 2 using matrix notation. Each row represents one time point for one replicate. There are TL time points with t > L and R replicates, so there are R(TL) samples, or rows, in total. Let N = R(TL).

Define Xtg, an N × 1 vector, as:

Xtg=[XL+1,1gXL+1,RgXL+2,1gXL+2,RgXT,Rg]. (3)

We can similarly write Xt-g, which is Xtg with each entry replaced by its th previous value. Define Xt-g, a N × L matrix consisting of the first L previous vectors Xt-g, i.e., for ranging in {1, …, L}.

Xt-g=[Xt-1gXt-Lg]. (4)

Let αg be a L × 1 vector of the L lagged coefficients.

αg=[α1gαLg]. (5)

Next, let us formulate Eq 2 involving the genes g′ in matrix notation. Let Xt-¬g be a N × L(|G| − 1) predictor matrix of the vectors Xt-g, for g′ ≠ g and ∈ {1, … L}. Note the number of columns is L(|G| − 1), because there are L previous time points ∈ {1, …, L}, and for each , there are |G| − 1 genes g′ ≠ g, giving |G| − 1 vectors: Xt-g1,,Xt-g|G|-1.

Xt-¬g=[Xt-1g1Xt-1g|G|-1Xt-2g1Xt-2g|G|-1Xt-Lg|G|-1]. (6)

Let β.,g be a L(|G| − 1) × 1 vector of the causal coefficients βg,g where g′ ≠ g.

β.,g=[β1g1,gβ1g|G|-1,gβ2g1,gβ2g|G|-1,gβLg|G|-1,g]. (7)

We then fit the model:

Xtg=Xt-gαg+Xt-¬gβ.,g+ϵt, (8)

where ϵt is a N × 1 vector with each element ϵt,nN(0,1). In its most compact form, we can write

Xt-G=[Xt-gXt-¬g],β¯g=[αgβ.,g]. (9)

Note that Xt-G is a N × L|G| matrix and β¯g is a L|G| × 1 vector. Thus, the matrix formulation of Eq 2 is:

Xtg=Xt-Gβ¯g+ϵt. (10)

Elastic net penalty

Because of the large number of predictors as compared to the small number of samples, we use the elastic net penalty, which is a generalization of both ridge and lasso penalties. The elastic net fits the following objective:

β^ELASTICNETg=argminβ¯gRL|G|Xtg-Xt-Gβ¯g22+λ(aβ¯g1+(1-a)β¯g22). (11)

Here ‖ ⋅ ‖1 represents the 1-norm and ‖ ⋅ ‖2 represents the 2-norm.

For the elastic net, we used the following ranges of hyperparameter values: λ ∈ {10−4, 10−3, …, 1}, a ∈ {0.1, 0.3, …, 0.9}. For lasso, we used λ ∈ {10−5, …, 1}. For ridge, when we used {10−5, …, 1}, we found that the optimal value selected in some cases was the maximum value of λ = 1. We thus expanded the range to {10−5, …, 106} to ensure that we were not missing better hyperparameters at larger values. At this point, the optimal λ was found to be 100.

Hyperparameter tuning

Hyperparameters were selected using leave-one-out cross-validation (LOOCV). The hyperparameter (or pair of hyperparameters, for elastic net) that minimizes the mean-squared error on the held-out observations is selected. More specifically, we first fix a hyperparameter (λ, a). Then, for a given gene g and row index i, extract the ith row of Xtg and Xt-G. We refer to this extracted validation set as (Xtg)i (target) and (Xt-g)i (predictors). The remaining data is the training set, (Xtg)-i (target) and (Xt-G)-i (predictors).

First, let β^(λ,a),ig be the β^ELASTICNETg that is fit from the training set.

β^(λ,a),ig=argminβ¯gRL|G|(Xtg)-i-(Xt-G)-iβ¯g22+λ(aβ¯g1+(1-a)β¯g22). (12)

We then compute prediction error on the validation set, (Xtg)i-(Xt-G)iβ^(λ,a),ig22). We repeat the fit β^(λ,a),ig and error for every row index i of Xtg and for every gene g. The mean held-out cross-validation error for (λ, a) is:

MSE(λ,a)=gGi=1N1N(Xtg)i-(Xt-G)iβ^(λ,a),ig22. (13)

The (λ, a) that minimizes the error in Eq 13 is selected.

Permuted coefficients

We evaluate the significance of any given edge g′ → g through permutation. In detail, we remove the time dependency between g′ and g via permutations of individual gene temporal profiles over time.

We first generate a single permuted data set X˜tg. For each gene, we independently shuffle the temporal profile of each gene g ∈ {1, …, |G|} across time (Fig 1A). This is done separately for distinct replicates.

We wish to model the hypothesis of no causal relations from any gene g′ ∈ ¬g, upon a given effect gene g. We use the unpermuted values of the effect gene Xtg and the permuted values of all other causal genes g′ ∈ ¬g, as X˜t¬g. The effect gene g remains unpermuted, as we do not consider self-regulatory loops.

Permutation-based causal coefficients β˜g,g are then fit as

Xtg==1LαgXt-g+=1Lg¬gβ˜g,gX˜t-g+ϵt. (14)

We use these coefficients to perform FDR calibration.

Edge FDR

The result of the elastic net VAR model is a complete network whose edges are weighted according to the estimated regression coefficients.

For each lag ∈ {1, …, L} and effect gene g, we control the edge FDR at ≤0.05 by finding the threshold Tg such that

g¬g1{|β˜g,g|>Tg}g¬g1{|β˜g,g|>Tg}+g¬g1{|βg,g|>Tg}0.05. (15)

For each gene pair (g′, g), g′ ∈ ¬g, a directed edge g′ → g exists if, for at least one of the lags ∈ {1, …, L}, |βg,g|>Tg.

Stability selection

Stability selection is used to ensure the robustness of BETS to small sample size. Stability selection is a method for high-dimensional graph estimation that uses bootstrap samples [82]. While the authors prove finite sample control for the family-wise error rate (FWER), we are interested in controlling the false discovery rate (FDR).

This procedure draws B = 1000 bootstrap samples, where each sample consists of N rows drawn with replacement from output Xtg and input Xt-G (Eq 10). Let the jth bootstrap sample from the original data be Xtg,j and Xt-G,j. A set of N row indices, Ij, are sampled with replacement from [1, 2, …, N]. Xtg,j and Xt-G,j are created by choosing the rows Ij of Xtg and Xt-G. Xtg,j is an N × 1 vector and Xt-G,j is an N × L|G| matrix.

Now consider the permuted output and input, X˜tg and X˜t-G, constructed from X˜tg (Eq 10). Let the jth bootstrap sample from the permuted data be X˜tg,j and X˜t-G,j. X˜tg,j and X˜t-G,j are created by choosing the rows Ij of X˜tg and X˜t-G.

Thus, the jth bootstrap sample for both the original and permuted data sets use the same row indices Ij.

For each of the 1000 bootstrap samples, we infer a network using the elastic net fit and edge FDR procedure described earlier. Each edge g′ → g’s selection frequency, πg′, g (the frequency of g′ → g among the 1000 bootstrap networks) is computed (Fig 1B).

Stability FDR

To determine the appropriate cutoff for the selection frequency of each edge (πg′, g), we generate a null distribution of selection frequencies using permutations. First, we generate a second permuted data set X^tg in which we again independently shuffle the temporal profile of each gene g ∈ {1, …, |G|} across time. This is done separately for distinct replicates.

We run the stability selection procedure on X^tg as if it were Xtg, using the same set of row indices Ij to generate the bootstrap samples, and using X˜tg to generate the permuted coefficients.

After running for all B = 1000 bootstrap samples, we obtain the null selection frequency of each edge, π^g,g.

We control the stability FDR at 0.2 by finding the threshold Tb such that

g¬g1{π^g,g>Tb}g¬g1{π^g,g>Tb}+g¬g1{πg,g>Tb}0.2. (16)

Because the maximum lag is 2, each edge g′ → g has two possible lags and thus two selection frequencies. The lag with larger absolute value of average coefficient across the 1, 000 networks is considered in both the permuted and the real empirical distributions. So, if |β1g,g| exceeds |β2g,g|, the lag is said to be 1 and the selection frequency πg,g1 is used.

Network inference performance metrics

Refer to every network edge inferred by a method as a positive and every missing edge as a negative. Let TP be True Positives, FP be False Positives, TN be True Negatives, and FN be False Negatives. Let TPR be True Positive Rate, (i.e., recall), and FPR be False Positive Rate. Then, we have

TPR=TPTP+FN
FPR=FPFP+TN
Precision=TPTP+FP

In the DREAM benchmark, each network inference method is evaluated by comparing the true network (i.e., the network used to generate the synthetic data) with the inferred network at different thresholds for edge inclusion. The two main evaluation metrics are Area Under the Receiver Operating Characteristic curve (AUROC) and Area Under the Precision-Recall curve (AUPR). AUROC plots TPR on the y axis and FPR on the x axis. AUPR plots precision on the y axis and recall on the x axis. When the number of negatives greatly exceeds the number of positives, as with gene networks, which are typically sparse, AUPR is a more relevant metric [83].

4.2 Software

BETS is available for download on Github at https://github.com/lujonathanh/BETS. The software is licensed under the terms of the Apache License, version 2.0. The analysis code is available at https://zenodo.org/record/4009546#.X5XEh0JKg1g.

4.3 Data sets and processing

DREAM Network Inference Challenge

There were five data sets in the DREAM4 Network Inference Challenge, each consisting of ten time series of 21 time points and 100 genes [45, 84]. For the first half of the time series, a “drug perturbation” was applied; this affected about 1/3 of genes. For the second half, the perturbation was removed and the system was allowed to relax back to the wild-type state.

Glucocorticoid gene expression data

We analyzed RNA-sequencing data from a set of experiments developed to study glucocorticoid receptors (GRs) in the human adenocarcinoma and lung model cell line, A549 [6]. There was an original exposure data set of 4 replicates in which cells were stimulated by the glucocorticoid dexamethasone (dex), and gene expression was profiled at {0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12} hours of dex stimulation. There was also an unperturbed data set of 3 replicates in which cells were exposed to dex for 12 hours, after which the conditioned media was replaced and dex removed. Gene expression was profiled at {0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12} hours after dex removal. We integrated the original exposure and unperturbed data into a joint data set with 7 replicates. The original exposure data set is available at the Gene Expression Omnibus (GEO), with reference numbers listed for the rows that list “RNASeq” as the Assay under the column “Experiment_GEO_Series” in Supplementary Table 3 of [6]. The GEO accession numbers for time points {0, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12} of the original exposure data set are GSE91305, GSE91198, GSE91311, GSE91358, GSE91303, GSE91243, GSE91281, GSE91229, GSE91255, GSE91284, GSE91222, and GSE91212, respectively. The unperturbed data set is available at GEO accession number GSE144662.

We selected 2768 genes for analysis, which had average expression > 2 Transcripts Per kilobase Million (TPM) and were differentially expressed in the original exposure data. A gene was called differentially expressed if its expression at any time point differed from its expression at time 0, ascertained by running edgeR (FDR ≤ 0.05) [6]. We added NR3C1, which encodes the glucocorticoid receptor (GR). NR3C1 was not found to be differentially expressed at FDR ≤ 0.05.

After genes were selected, gene expression TPM were log-normalized and corrected for surrogate variables using SVAseq [85]. Each gene’s temporal profile was centered to have mean zero across time. In the original exposure data, all replicates besides replicate 1 had a measurement for each time point. Replicate 1 was missing time points 5 and 6 hrs, so we imputed these values using a linear interpolation from time points 4 and 7 hrs in the log-transformed, surrogate-corrected space.

Overexpression transcriptional time-series data

There were ten overexpression data sets, in which each of the transcription factors CEBPB, CEBPD, FOSL2, FOXO1, FOXO3, KLF6, KLF9, KLF15, POU5F1, and TFCP2L1 was separately overexpressed across 12 hours of dex stimulation. Each overexpression data set had three replicates; gene expression was profiled after {0, 1, 4, 8, 12} hours of dex stimulation. The same 2768 genes were selected and the same normalization and SVAseq correction as earlier was performed. The overexpression data sets are available at the Gene Expression Omnibus at GEO accession number GSE144660.

4.4 Application of methods to the data

DREAM benchmarking

We ran the methods BETS, Enet, CSId [44], Jump3 [36], CLR [27], MRNET [28], ARACNE [29], SWING-RF [51], and SWING-Lasso [51] on the DREAM challenge. In BETS, inferred edges were ranked by their selection frequency for calculating AUPR and AUROC. In Enet, edges were ranked by the absolute value of their coefficient. The Python3 version of CSId was run after obtaining it from correspondence with Dr. Penfold.

Jump3 required setting the “systematic noise” and “observational noise” parameters. We used Dr. Huynh-Thu’s settings on the DREAM challenge, with systematic noise at 1e − 4 and observational noise at 0.01 times the value of the gene’s expression. ARACNE, MRNET, and CLR were run using the minet R library. BETS, Enet, CSId, and Jump3 were run on a single node without parallelization. The node had 28 cores, 128 GB of memory, and 2.4 GHz processor speed. ARACNE, MRNET, and CLR were run on a 4 GB RAM, Intel Core i5 1.3 GHz laptop.

Network analysis: Gene annotations

We considered genes with three possible labels: immune system, metabolism, or transcription factor. Immune genes were labeled as such using two sources. The first source is the Gene Ontology (GO) annotation “Immune” (GO:0002376) [52]. We applied this label when the evidence codes were one of EXP, IDA, IGI, IMP, IPI, IC, TAS. The second source is the Gene Ontology Consortium’s curated, ranked list of immune-related genes based on multiple databases and experimental evidence [54]. For the GO annotation, we selected all genes with score ≥ 7. This resulted in 616 immune genes overall, and 109 immune genes in our list of 2768 genes.

Metabolic genes were called using two sources. The first source is the GO annotation “carbohydrate metabolic process” GO:0005975 [52]. We applied this label when the evidence codes were EXP, IDA, IGI, IMP, IPI, IC, TAS. The second source is the Gene Set Enrichment Analysis (GSEA)-curated list of metabolic-related genes [53]. We searched only among those with experimental evidence: the Canonical, KEGG, BIOCARTA, and Reactome pathways. We used the following four search queries: “gluconeogenesis OR (glucose AND metabolism) OR glycolysis,” “lipid AND metabolism,” “Diabetes,” “Obesity.” This resulted in 544 metabolic genes overall, of which 120 were in our gene list. 65 genes were both immune and metabolic overall; 12 of these were in our gene list.

Transcription factors (TFs) were called using the Bioguo database of human TFs [55]. There were 1463 TFs overall, of which 226 were present in our gene list.

Experimental interactions

We created a list of experimentally validated interactions from the BIOGRID Homo sapiens Protein-Protein Interactions database [56]. Proteins were mapped to genes using BioMart from Ensembl 94 [86]. Among genes in our gene list, there were 17, 990 BIOGRID interactions.

Validation on overexpression data

The overexpression data had four time points with 1 to 4 hour time gaps, unlike the original 12 time points with 0.5 to 2 hour time gaps. On the overexpression data, we used a VAR model that regressed each effect gene’s expression level on its previous expression level and the causal gene’s previous expression level, assuming normal noise ϵtN(0,1):

Xtg=cgXt-1g+dg,gXt-1g+ϵt. (17)

No regularization was included, and ordinary least squares was used to fit the equation. The expression Xt-1g of a causal gene g′ is fit as a single predictor without the other expression. Lag 1, not 2, is used due to the larger time gaps.

Validation on lung trans-eQTLs in GTEx v6

Trans-eQTLs were discovered using the Genotype Tissue Expression (GTEx) v6 data [14, 63]. First, we mapped our genes from hg38 to hg19. For every edge g′ → g, we tested the set of genetic variants within 20 kilobases of g′ for trans-eQTL association with g [87]. Specifically, we computed the p-value for linear association of each variant with the corresponding effect gene g using MatrixEQTL [88]. A null distribution was generated by taking every edge g′ → g, permuting the effect gene g’s expression values, and repeating the linear association test. FDR over test statistics was calculated using q-value [89]. Because not every causal gene g′ had a cis-eQTL, only 26,839 edges (84% of the original 31,945 edges) were tested.

5 Supporting information

S1 Fig. Overview of gene regulatory network inference methods.

Panels show each inference method applied to a cause gene g’ (blue, solid) and an effect gene g (blue, dotted). A) Mutual information is computed between the cause and effect. B) The effect’s expression is fit as an autoregression from the cause’s past expression. C) The effect’s expression is fit as a differential equation from the cause’s current expression. D) The effect’s expression is fit as a decision tree function of the cause’s past expression. E) The space of dynamic causal networks is searched, with linear relationships between cause and effect. F) The space of dynamic causal networks is searched, with nonlinear relationships between cause and effect.

(PDF)

S2 Fig. Causal gene expression and effect gene residuals from experimentally validated interactions.

On the y-axes are the effect gene residual expression values after subtracting the effects of all other covariates. Axes are in units of ln(TPM). Related to Fig 5 and S7 Table.

(PDF)

S1 Table. DREAM4 100-gene network inference results, AUPR.

DBN is dynamic Bayesian network, DT is decision tree, GP is Gaussian process, MI is mutual information, ODE is ordinary differential equation, VAR is vector autoregression. The references that reported ebdbnet, ScanBMA, and LASSO did not provide AUPR values for individual networks. Algorithms that were run in-house were ARACNE, BETS, CLR, CSId, Enet, Jump3, MRNET, SWING-Lasso, and SWING-RF. Where reported literature values were available, they were consistent with these values. Values for CSIc, G1DBN, GCCA, GP4GRN, TSNI, VBSSMa and VBSSMb were taken from [49]. Values for ebdnet, LASSO and ScanBMA, were taken from [40]. Values for dynGENIE3, GENIE3, OKVAR-Boost and tl-CLR were taken from [37]. Values for Inferelator and Jump3 were taken from [36]. Related to Fig 2.

(DOCX)

S2 Table. DREAM4 100-gene network inference results, AUROC.

DBN is dynamic Bayesian network, DT is decision tree, GP is Gaussian process, MI is mutual information, ODE is ordinary differential equation, VAR is vector autoregression. The references that reported ebdbnet, ScanBMA, and LASSO did not provide AUROC values for individual networks. Algorithms that were run in-house were ARACNE, BETS, CLR, CSId, Enet, Jump3, MRNET, SWING-Lasso, and SWING-RF. Values for CSIc, G1DBN, GCCA, GP4GRN, TSNI, VBSSMa and VBSSMb were taken from [49]. Values for ebdnet, LASSO, and ScanBMA, were taken from [40]. Related to Fig 2.

(DOCX)

S3 Table. Results of in-house algorithms on DREAM4 100-gene network inference.

AUPR, AUROC, and Time indicate average AUPR, AUROC, and time over the five networks, respectively. BETS and Enet are in bold to indicate that they are our own developed methods, based on vector autoregression. SWING-RF [51] and Jump3 [36] are decision tree methods. CSId is a Gaussian process method [44]. CLR [27], MRNET [90], and ARACNE [29] are mutual information methods. SWING-Lasso is a vector autoregression method [51]. Related to Fig 2.

(DOCX)

S4 Table. Improvement on DREAM4 100-gene network inference from bootstrap.

For each AUROC or AUPR column, the average is the listed value and the standard deviation is listed in parentheses. “Coefficient” denotes the result when ranking edges by their fitted coefficient, as in the original method. “Bootstrap” denotes the results when ranking edges by the frequency by which they appear in the bootstrap networks.

(DOCX)

S5 Table. Dependency of BETS performance on bootstrap samples.

DREAM results reported for running BETS on both 100 and 1000 bootstrap samples. All values in the columns are averages and the parenthetical values as standard deviations across the 5 DREAM4 Networks. The 1000 samples row is bolded because 1000 samples are the default settings. These use zero-mean normalization, lag 2, and the elastic net penalty. Related to Fig 2.

(DOCX)

S6 Table. Enrichment of edges between specific gene classes in inferred causal network.

A Fisher’s Exact Test was performed, where the rows of the contingency table were whether or not an edge was of the edge type, and the columns were whether or not the edge was part of the inferred network. Related to Fig 3.

(DOCX)

S7 Table. Gene pair information from Fig 5.

Shown Data Set indicates whether the gene temporal profiles in Fig 5 are taken from the original exposure data or unperturbed data. The edge type indicates the gene class of the causal and effect gene; for example, I → M indicates an edge from an Immune causal gene to a Metabolic effect gene. I = Immune; M = Metabolic; T = Transcription Factor; A = Any gene. Related to Fig 5.

(DOCX)

S1 Text. Supplemental information.

Additional overview and analyses.

(DOCX)

Acknowledgments

The authors would like to thank Gregory Darnell, Derek Aguiar, Ariel Gewirtz, Allison Chaney, Isabella Grabski, Cristina Anastase, and Genna Gliner for helpful discussion, feedback, and generosity in running cluster jobs; and Jian Peng for productive discussion and helpful comments.

The authors gratefully acknowledge that this work was performed using the Princeton Research Computing resources sponsored by the Princeton Institute for Computational Science and Engineering (PICSciE) at Princeton University.

Data Availability

All files have been submitted to the Gene Expression Omnibus, accession number: GSE91208. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE91208 All other data are contained in the manuscript and the Supporting information.

Funding Statement

This work was funded by the following grants to BEE: NIH R01 HL133218 and NIH U01 HG007900 (National Human Genome Research Institute), and an NSF 711 CAREER 1750729 (Division of Information and Intelligent Systems). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data. Nature Reviews Genetics. 2012;13(8):552–564. 10.1038/nrg3244 [DOI] [PubMed] [Google Scholar]
  • 2. Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, et al. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics. 2003;7:733–742. [Google Scholar]
  • 3. Bühlmann P, Kalisch M, Meier L. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application. 2014;1:255–278. 10.1146/annurev-statistics-022513-115545 [DOI] [Google Scholar]
  • 4. Mas P. Circadian clock function in Arabidopsis thaliana: time beyond transcription. Trends in cell biology. 2008;18(6):273–281. 10.1016/j.tcb.2008.03.005 [DOI] [PubMed] [Google Scholar]
  • 5. Robinson JW, Hartemink AJ. Learning non-stationary dynamic Bayesian networks. Journal of Machine Learning Research. 2010;11(Dec):3647–3680. [Google Scholar]
  • 6. McDowell IC, Barrera A, D’Ippolito AM, Vockley CM, Hong LK, Leichter SM, et al. Glucocorticoid receptor recruits to enhancers and drives activation by motif-directed binding. Genome Research. 2018;28(9):1272–1284. 10.1101/gr.233346.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Cain DW, Cidlowski JA. Immune regulation by glucocorticoids. Nature Reviews Immunology. 2017;. 10.1038/nri.2017.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.UK Government. World first coronavirus treatment approved for NHS use by government; 2020. https://www.gov.uk/government/news/world-first-coronavirus-treatment-approved-for-nhs-use-by-government.
  • 9.National Institutes of Health. COVID-19 Treatment Guidelines: Corticosteroids; 2020. https://www.covid19treatmentguidelines.nih.gov/immune-based-therapy/immunomodulators/corticosteroids/. [PubMed]
  • 10.World Health Organization. Corticosteroids for COVID-19; 2020. https://www.who.int/publications/i/item/WHO-2019-nCoV-Corticosteroids-2020.1.
  • 11.Infectious Diseases Society of America. Infectious Diseases Society of America Guidelines on the Treatment and Management of Patients with COVID-19; 2020. https://www.idsociety.org/practice-guideline/covid-19-guideline-treatment-and-management. [DOI] [PMC free article] [PubMed]
  • 12. Geer EB, Islam J, Buettner C. Mechanisms of glucocorticoid-induced insulin resistance: focus on adipose tissue function and lipid metabolism. Endocrinology and Metabolism Clinics of North America. 2014;43(1):75–102. 10.1016/j.ecl.2013.10.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Spencer SJ, Tilbrook A. The glucocorticoid contribution to obesity. Stress. 2011;14(3):233–246. 10.3109/10253890.2010.534831 [DOI] [PubMed] [Google Scholar]
  • 14. GTEx Consortium, et al. Genetic effects on gene expression across human tissues. Nature. 2017;550(7675):204 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Liu ZP. Reverse Engineering of Genome-wide Gene Regulatory Networks from Gene Expression Data. Current Genomics. 2015;16(1):3–22. 10.2174/1389202915666141110210634 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Opgen-Rhein R, Strimmer K. Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics. 2007;8(2):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lozano AC, Abe N, Liu Y, Rosset S. Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics. 2009;25(12):i110–i118. 10.1093/bioinformatics/btp199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Cho H, Berger B, Peng J. Reconstructing Causal Biological Networks through Active Learning. PLOS One. 2016;11(3):e0150611 10.1371/journal.pone.0150611 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Maathuis MH, Kalisch M, Bühlmann P, et al. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics. 2009;37(6A):3133–3164. 10.1214/09-AOS685 [DOI] [Google Scholar]
  • 20. Murphy KP. Active Learning of Causal Bayes Net Structure. University of California, Berkeley; 2001. [Google Scholar]
  • 21. Rau A, Jaffrézic F, Nuel G. Joint estimation of causal effects from observational and intervention gene expression data. BMC Systems Biology. 2013;7(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hauser A, Bühlmann P. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning. 2014;55(4):926–939. 10.1016/j.ijar.2013.11.007 [DOI] [Google Scholar]
  • 23. He YB, Geng Z. Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research. 2008;9(Nov):2523–2547. [Google Scholar]
  • 24. Grzegorczyk M. An introduction to Gaussian Bayesian networks. Systems Biology in Drug Discovery and Development: Methods and Protocols. 2010; p. 121–147. [DOI] [PubMed] [Google Scholar]
  • 25. Madar A, Greenfield A, Vanden-Eijnden E, Bonneau R. DREAM3: network inference using dynamic context likelihood of relatedness and the inferelator. PLOS One. 2010;5(3):e9803 10.1371/journal.pone.0009803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lopes M, Bontempi G. Experimental assessment of static and dynamic algorithms for gene regulation inference from time series expression data. Frontiers in Genetics. 2013;4:303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology. 2007;5(1):e8 10.1371/journal.pbio.0050008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Meyer PE, Kontos K, Lafitte F, Bontempi G. Information-theoretic inference of large transcriptional regulatory networks. EURASIP Journal on Bioinformatics and Systems Biology. 2007;2007:8–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. In: BMC Bioinformatics. vol. 7 BioMed Central; 2006. p. S7 10.1186/1471-2105-7-S1-S7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics. 2010;11(1):154 10.1186/1471-2105-11-154 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Granger CWJ. Testing for causality. Journal of Economic Dynamics and Control. 1980;2:329–352. 10.1016/0165-1889(80)90069-X. [DOI] [Google Scholar]
  • 32. Bansal M, Gatta GD, Di Bernardo D. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006;22(7):815–822. 10.1093/bioinformatics/btl003 [DOI] [PubMed] [Google Scholar]
  • 33. Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, et al. The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology. 2006;7(5):R36 10.1186/gb-2006-7-5-r36 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Quinlan JR. Induction of decision trees. Machine Learning. 1986;1(1):81–106. 10.1007/BF00116251 [DOI] [Google Scholar]
  • 35. Breiman L. Classification and regression trees. Routledge; 2017. [Google Scholar]
  • 36. Huynh-Thu VA, Sanguinetti G. Combining tree-based and dynamical systems for the inference of gene regulatory networks. Bioinformatics. 2015;31(10):1614–1622. 10.1093/bioinformatics/btu863 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Geurts P, et al. dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data. Scientific Reports. 2018;8(1):3384 10.1038/s41598-018-21715-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Lèbre S. Inferring dynamic genetic networks with low order independencies. Statistical Applications in Genetics and Molecular Biology. 2009;8(1):1–38. [DOI] [PubMed] [Google Scholar]
  • 39. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA, et al. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. In: Pacific Symposium on Biocomputing. vol. 6; 2001. p. 266 [DOI] [PubMed] [Google Scholar]
  • 40. Young WC, Raftery AE, Yeung KY. Fast Bayesian inference for gene regulatory networks using ScanBMA. BMC Systems Biology. 2014;8(1):47 10.1186/1752-0509-8-47 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics. 2004;21(3):349–356. [DOI] [PubMed] [Google Scholar]
  • 42. Rau A, Jaffrézic F, Foulley JL, Doerge RW. An empirical Bayesian method for estimating biological networks from temporal microarray data. Statistical Applications in Genetics and Molecular Biology. 2010;9(1). 10.2202/1544-6115.1513 [DOI] [PubMed] [Google Scholar]
  • 43. Äijö T, Lähdesmäki H. Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics. Bioinformatics. 2009;25(22):2937–2944. 10.1093/bioinformatics/btp511 [DOI] [PubMed] [Google Scholar]
  • 44. Penfold CA, Shifaz A, Brown PE, Nicholson A, Wild DL. CSI: a nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data. Statistical Applications in Genetics and Molecular Biology. 2015;14(3):307–310. [DOI] [PubMed] [Google Scholar]
  • 45. Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of Computational Biology. 2009;16(2):229–239. 10.1089/cmb.2008.09TT [DOI] [PubMed] [Google Scholar]
  • 46. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
  • 48. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996; p. 267–288. [Google Scholar]
  • 49. Penfold CA, Wild DL. How to infer gene networks from expression profiles, revisited. Interface Focus. 2011;1(6):857–870. 10.1098/rsfs.2011.0053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Irrthum A, Wehenkel L, Geurts P, et al. Inferring regulatory networks from expression data using tree-based methods. PLOS ONE. 2010;5(9):e12776 10.1371/journal.pone.0012776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Finkle JD, Wu JJ, Bagheri N. Windowed Granger causal inference strategy improves discovery of gene regulatory networks. Proceedings of the National Academy of Sciences. 2018;115(9):2252–2257. 10.1073/pnas.1710936115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–15550. 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Consortium TGO. Gene Ontology Consoritum’s Curated List of Immune Genes; 2014. http://wiki.geneontology.org/index.php/Immunology. [Google Scholar]
  • 55. Zhang HM, Chen H, Liu W, Liu H, Gong J, Wang H, et al. AnimalTFDB: a comprehensive animal transcription factor database. Nucleic Acids Research. 2012;40(Database issue):D144–D149. 10.1093/nar/gkr965 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Chatr-Aryamontri A, Oughtred R, Boucher L, Rust J, Chang C, Kolas NK, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Research. 2017;45(D1):D369–D379. 10.1093/nar/gkw1102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Rui L, Yuan M, Frantz D, Shoelson S, White MF. SOCS-1 and SOCS-3 block insulin signaling by ubiquitin-mediated degradation of IRS1 and IRS2. Journal of Biological Chemistry. 2002;277(44):42394–42398. 10.1074/jbc.C200444200 [DOI] [PubMed] [Google Scholar]
  • 58. Calegari VC, Alves M, Picardi PK, Inoue RY, Franchini KG, Saad MJ, et al. Suppressor of cytokine signaling-3 provides a novel interface in the cross-talk between angiotensin II and insulin signaling systems. Endocrinology. 2005;146(2):579–588. 10.1210/en.2004-0466 [DOI] [PubMed] [Google Scholar]
  • 59. McCormick SM, Gowda N, Fang JX, Heller NM. Suppressor of cytokine signaling (SOCS) 1 regulates IL-4-activated insulin receptor substrate (IRS)-2 tyrosine phosphorylation in monocytes and macrophages via the proteasome. Journal of Biological Chemistry. 2016; p. jbc–M116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Perlmann T, Jansson L. A novel pathway for vitamin A signaling mediated by RXR heterodimerization with NGFI-B and NURR1. Genes & Development. 1995;9(7):769–782. 10.1101/gad.9.7.769 [DOI] [PubMed] [Google Scholar]
  • 61. Zhao Wx, Tian M, Zhao Bx, Li Gd, Liu B, Zhan Yy, et al. Orphan receptor TR3 attenuates the p300-induced acetylation of retinoid X receptor-α. Molecular Endocrinology. 2007;21(12):2877–2889. 10.1210/me.2007-0107 [DOI] [PubMed] [Google Scholar]
  • 62. Peters J. Causality: Lecture Notes. ETH Zurich: ETH Zurich; 2015. [Google Scholar]
  • 63. Jo B, He Y, Strober BJ, Parsana P, Aguet F, Brown AA, et al. Distant regulatory effects of genetic variation in multiple human tissues. bioRxiv. 2016;. [Google Scholar]
  • 64. Chui PC, Guan HP, Lehrke M, Lazar MA. PPARγ regulates adipocyte cholesterol metabolism via oxidized LDL receptor 1. The Journal of Clinical Investigation. 2005;115(8):2244–2256. 10.1172/JCI24130 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Arslan C, Bayoglu B, Tel C, Cengiz M, Dirican A, Besirli K. Upregulation of OLR1 and IL17A genes and their association with blood glucose and lipid levels in femoropopliteal artery disease. Experimental and Therapeutic Medicine. 2017;13(3):1160–1168. 10.3892/etm.2017.4081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Palmieri VO, Coppola B, Grattagliano I, Casieri V, Cardinale G, Portincasa P, et al. Oxidized LDL receptor 1 gene polymorphism in patients with metabolic syndrome. European Journal of Clinical Investigation. 2013;43(1):41–48. 10.1111/eci.12013 [DOI] [PubMed] [Google Scholar]
  • 67. Oh S, Joo H. LOX-1 boosts immunity. Oncotarget. 2015;6(26):21763 10.18632/oncotarget.4756 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Joo H, Li D, Dullaers M, Kim TW, Duluc D, Upchurch K, et al. C-type lectin-like receptor LOX-1 promotes dendritic cell-mediated class-switched B cell responses. Immunity. 2014;41(4):592–604. 10.1016/j.immuni.2014.09.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Overstreet MG, Gaylo A, Angermann B, Hughson A, Hyun Ym, Lambert K, et al. Inflammation-induced effector CD4+ T cell interstitial migration is alpha-v integrin dependent. Nature Immunology. 2013;14(9):949 10.1038/ni.2682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Ling J, Singhal A, Lopez-Dee ZP, Porreca B, Sprague T. Snai2 is a new target to mediate glucocorticoid signaling on breast cancer cell migration. In: Proceedings of the American Association of Cancer Research Annual Meeting, July 2018. vol. 78. AACR; 2018.
  • 71. Dubois MJ, Bergeron S, Kim HJ, Dombrowski L, Perreault M, Fournès B, et al. The SHP-1 protein tyrosine phosphatase negatively modulates glucose homeostasis. Nature Medicine. 2006;12(5):549 10.1038/nm1397 [DOI] [PubMed] [Google Scholar]
  • 72. Eriksen KW, Woetmann A, Skov L, Krejsgaard T, Bovin LF, Hansen ML, et al. Deficient SOCS3 and SHP-1 expression in psoriatic T cells. Journal of Investigative Dermatology. 2010;130(6):1590–1597. 10.1038/jid.2010.6 [DOI] [PubMed] [Google Scholar]
  • 73. Christophi GP, Panos M, Hudson CA, Christophi RL, Gruber RC, Mersich AT, et al. Macrophages of multiple sclerosis patients display deficient SHP-1 expression and enhanced inflammatory phenotype. Laboratory Investigation. 2009;89(7):742 10.1038/labinvest.2009.32 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45(D1):D353–D361. 10.1093/nar/gkw1092 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Lieber M, Todaro G, Smith B, Szakal A, Nelson-Rees W. A continuous tumor-cell line from a human lung carcinoma with properties of type II alveolar epithelial cells. International Journal of Cancer. 1976;17(1):62–70. 10.1002/ijc.2910170110 [DOI] [PubMed] [Google Scholar]
  • 76. Liu Y, Beyer A, Aebersold R. On the dependency of cellular protein levels on mRNA abundance. Cell. 2016;165(3):535–550. 10.1016/j.cell.2016.03.014 [DOI] [PubMed] [Google Scholar]
  • 77. De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nature Reviews Microbiology. 2010;8(10):717 10.1038/nrmicro2419 [DOI] [PubMed] [Google Scholar]
  • 78. Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences. 2010;107(14):6286–6291. 10.1073/pnas.0913357107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Uygun S, Peng C, Lehti-Shiu MD, Last RL, Shiu SH. Utility and limitations of using gene expression data to identify functional associations. PLOS Computational Biology. 2016;12(12):e1005244 10.1371/journal.pcbi.1005244 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Qiu X, Rahimzamani A, Wang L, Mao Q, Durham T, McFaline-Figueroa JL, et al. Towards inferring causal gene regulatory networks from single cell expression measurements. Cell Systems. 2020; 10(3):265–274.e11. 10.1016/j.cels.2020.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Deshpande A, Chu LF, Stewart R, Gitter A. Network Inference with Granger Causality Ensembles on Single-Cell Transcriptomic Data. BioRxiv. 2019; p. 534834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2010;72(4):417–473. 10.1111/j.1467-9868.2010.00740.x [DOI] [Google Scholar]
  • 83.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. ACM; 2006. p. 233–240.
  • 84. Marbach D, Schaffter T, Floreano D, Prill RJ, Stolovitsky G. The DREAM4 In-silico Network Challenge: Training data, gold standards, and supplementary information; 2009. http://gnw.sourceforge.net/resources/DREAM4%20in%20silico%20challenge.pdf. [Google Scholar]
  • 85. Leek JT, Storey JD. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLOS Genetics. 2007;3(9):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, et al. Ensembl 2018. Nucleic Acids Research. 2017;46(D1):D754–D761. 10.1093/nar/gkx1098 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Gao C, Zhao S, McDowell IC, Brown CD, Engelhardt BE. Context-specific and differential gene co-expression networks via Bayesian biclustering models. PLOS Computational Biology. 2016;12:e1004791 10.1371/journal.pcbi.1004791 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28(10):1353–1358. 10.1093/bioinformatics/bts163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003;100(16):9440–9445. 10.1073/pnas.1530509100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Meyer PE, Lafitte F, Bontempi G. minet: AR/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008;9(1):461 10.1186/1471-2105-9-461 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008223.r001

Decision Letter 0

Thomas Lengauer, Christina S Leslie

16 Sep 2019

Dear Dr Engelhardt,

Thank you very much for submitting your manuscript 'Causal network inference from gene transcription time series response to glucocorticoids' for review by PLOS Computational Biology. Your manuscript has been fully evaluated by the PLOS Computational Biology editorial team and in this case also by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial critiques of the manuscript as it currently stands. While your manuscript cannot be accepted in its present form, we are willing to consider a revised version in which the issues raised by the reviewers have been adequately addressed. We cannot, of course, promise publication at that time.

In your revision, note in particular the suggestions for method comparison made by Reviewer #1 to demonstrate the practical advantages of the novel features of BETS, as well as technical and model interpretation questions from both reviewers. Please also clarify if a new RNA-seq data set has been generated for this study and if it has been deposited in a public repository.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

Your revisions should address the specific points made by each reviewer. Please return the revised version within the next 60 days. If you anticipate any delay in its return, we ask that you let us know the expected resubmission date by email at ploscompbiol@plos.org. Revised manuscripts received beyond 60 days may require evaluation and peer review similar to that applied to newly submitted manuscripts.

In addition, when you are ready to resubmit, please be prepared to provide the following:

(1) A detailed list of your responses to the review comments and the changes you have made in the manuscript. We require a file of this nature before your manuscript is passed back to the editors.

(2) A copy of your manuscript with the changes highlighted (encouraged). We encourage authors, if possible to show clearly where changes have been made to their manuscript e.g. by highlighting text.

(3) A striking still image to accompany your article (optional). If the image is judged to be suitable by the editors, it may be featured on our website and might be chosen as the issue image for that month. These square, high-quality images should be accompanied by a short caption. Please note as well that there should be no copyright restrictions on the use of the image, so that it can be published under the Open-Access license and be subject only to appropriate attribution.

Before you resubmit your manuscript, please consult our Submission Checklist to ensure your manuscript is formatted correctly for PLOS Computational Biology: http://www.ploscompbiol.org/static/checklist.action. Some key points to remember are:

- Figures uploaded separately as TIFF or EPS files (if you wish, your figures may remain in your main manuscript file in addition).

- Supporting Information uploaded as separate files, titled Dataset, Figure, Table, Text, Protocol, Audio, or Video.

- Funding information in the 'Financial Disclosure' box in the online system.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see here

We are sorry that we cannot be more positive about your manuscript at this stage, but if you have any concerns or questions, please do not hesitate to contact us.

Sincerely,

Christina S. Leslie

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This manuscript presents the Bootstrap Elastic net regression from Time Series (BETS) algorithm, an network inference approach for time series gene expression data. The core question in time series-based network inference is how to use past information from regulators to detect their influence on the later expression levels of target genes. BETS adopts a Granger causality approach to address this question. Although Granger causality has been widely applied to time series gene expression data, there are multiple characteristics that differentiate BETS from prior work. The most notable is that BETS provides a false discovery rate (FDR) framework to rank and select network edges. Because it operates with temporal data, permuting the time points generates a convenient dataset in which the true regulator-target dependencies should be destroyed. Running BETS on the permuted data then creates a null distribution of regression coefficients or edge frequencies that can be used for FDR calculations. This FDR evaluation is combined with traditional bootstrapping and stability selection to make BETS robust to small sample sizes and false positive associations that are problematic in short time series. In addition, BETS incorporates an elastic net regression penalty and automated hyperparameter tuning, which are not implemented in many prior Granger causality methods.

In general, the presentation (figures, writing, review of related work, etc.), methodological choices, and analyses are superb. The synthesis of different types of time series network inference techniques in Figure S1 is an excellent high-level overview of the field. The evaluation on the simulated data DREAM4 challenge is valuable, demonstrating that BETS performs reasonably well on this popular benchmarking dataset and putting BETS in the context of many other types of network inference tools. There is a sizeable AUPR performance gap between BETS and the top method. However, this is not a concern because there are other advantages of BETS besides its AUPR on a simulated data set (e.g. parallelism, speed, and statistical rigor). Furthermore, other work in the gene network inference field has shown that AUPR on simulated data does not reflect the challenges of network inference in real mammalian systems, so maximizing AUPR on the DREAM4 data should not be the main objective of new methods.

Therefore, the glucocorticoid case study is more relevant and interesting. Because there are not complete gold standard networks available for evaluating condition-specific human regulatory networks, the BETS predictions are assessed with two independent datasets. Overexpression of 10 transcription factors demonstrates that there is general agreement between the gene expression changes induced by overexpression and the edge signs predicted by BETS. The BETS predictions are not perfect, as four transcription factors have positive effects enriched among negative predicted edges. In addition, trans-eQTL analysis of GTEx lung gene expression data reveals many new trans-eQTLs that are nominated by the BETS network edges. These demonstrate how BETS predictions can be used to gain biological insights.

One weakness with respect to the method's originality and relevance to the network inference field is that there is not a direct analysis demonstrating that the novel methodological aspects of BETS have a practical impact. The DREAM4 results are 10 years old and do not reflect the state of the art in Granger causality analysis. Only BETS was run on the glucocorticoid express data. Conceptually, the null distributions and FDR-based approach should improve network quality, but this claim is not directly assessed.

Major comments:

1) Related to the comment above, the manuscript would be improved by more specifically demonstrating to readers why they should use BETS over other modern Granger causality approaches. For instance, if the FDR framework is the main appeal, can it directly demonstrate the advantages of having a principled way to select the size of a network? If it is the scalability and parallelization, can the high-throughput software pipeline be made more robust and user friendly (see below)? There is a comparison with elastic net regression without stability selection, but ensembling or stability selection is now common place in network inference. The most closely related Granger causality work that shares some features with BETS is not included in the DREAM4 benchmarking. Examples of closely related methods include the last two methods in Section 2.1.6 of the supplement (the references are broken so specific manuscripts are unknown) or SWING (Finkle 2018 doi:10.1073/pnas.1710936115), which was not referenced.

2) The analyses all fix the lag at L=2. It is unknown whether that lag would still be appropriate for time series data with more time points. The lag can be set by the user, but the relationships between the lag hyperparameter, the length of the time series, and the effectiveness of the FDR procedures have not been assessed.

3) The permutation and bootstrapping concepts in Figure 1 are understandable at a high level. The methodological details are challenging to follow. It appears that a single temporal permutation is made in the outer loop and another is made in the inner loop. Is a single permutation sufficient to obtain a robust null distribution? In addition, if the bootstrapping is done on the matrix form in Equation 10, which has already unrolled or expanded the L previous time points, how is the inner loop permutation performed? The details of the permutation and FDR procedures are difficult to verify.

4) Using the BETS network's edge signs to predict the vector autoregressive model's edge coefficients is a creative way to use the overexpression data to validate the network that accounts for the edge signs. However, there are some unaddressed caveats with this approach. The overexpression data have fewer time points, so a simpler regression model is used (Equation 17). Any errors in that simple model's fits will confound the BETS network assessment. This approach may also ignore false negative edges in the BETS network. An alternative approach would be to focus on the edge directions in the BETS network by estimating the differentially expressed genes in each overexpression experiment using a temporally-aware statistical test. nsgp (Heinonen 2015 doi:10.1093/bioinformatics/btu699) or a similar method may be able to accommodate the different number of time points in the original and overexpression data. Then, these genes can be treated as the targets of that regulator in a pseudo-gold standard network. The predicted edges for that regular in the BETS network could be evaluated with a precision-recall curve, even if they have very few predicted target genes.

5) The author contributions imply that the RNA-seq data were generated as part of this study. If that is the case, the experimental protocols and methods are incomplete. In addition, the expression data should be made available.

Minor comments:

6) The supplement and GitHub readme note that the time points must be approximately equally spaced. This is an important assumption that limits the applicability to irregular time series. It should be stated more clearly in the main text.

7) The supplement describes the global and local null distributions and FDR approaches in excellent detail. It is difficult to link this discussion to the methods that were actually used in the main text. Stating where the global and local versions were used in the main text methods would help connect these discussions.

8) The main text does not explain the biological goals of the glucocorticoid study or why immune and metabolic genes are of interest. It would help guide readers if some of the well-written explanation from the supplement's Sections 1 and 2.3 was moved to the main text results.

9) The discussion notes that applying Granger causality to single-cell pseudotemporal data is a relevant related area. This is indeed an exciting future direction for BETS, but recent preprints have shown that pseudotimes may not have the same information for network inference as bona fide time series data (Qiu 2018 doi:10.1101/426981; Deshpande 2019 doi:10.1101/534834). That related work may guide readers who attempt to apply BETS to pseudotemporal data.

10) The methods describes BIOGRID PPI, which do not appear to have been used in the analyses described in the results.

11) The references in the supplement are broken

12) The supplement refers to a method named VAR-GEN instead of BETS. Are these the same?

13) Supplemental Figure 7 is difficult to understand. What are the indices and percentages?

14) It is commendable that the software pipelines are available on GitHub with an open source license. Some challenges in running the software are detailed below. In addition, the final version should be archived on Zenodo, Figshare, Software Heritage, or a comparable resource. The Google Drive materials could also be migrated to a permanent repository. The support group https://groups.google.com/forum/#!forum/bets-support displayed an error "You do not have permission to access this content. (#418)".

15) The supplement contains typos "we uses the" (page 9) and "multiple sclrosis patients" (page 29).

Software comments:

The software was tested on Windows 10 with Git for Windows (GNU bash, version 4.4.23(1)-release (x86_64-pc-msys)) in the following Python 2 conda environment:

$ conda create --name bets python=2.7 numpy=1.13 scipy=0.19 pandas=0.20 matplotlib scikit-learn

Python 2's end of life date is in 2020 and support will be dropped by several packages BETS requires (https://python3statement.org/). Porting the code to Python 3 is strongly recommended.

Overall, the software would be much easier to run if it adopted an establishing pipeline workflow. The five major steps require substantial user intervention even though everything could be automated after the options in package_params_cpipeline.sh are configured. This would also help with eventual cross-platform compatibility (currently macOS is supported) and formal testing to ensure the pipeline can run on a different system.

The BETS pipeline in BETS_tutorial.md did not work in the environment described above. After editing dozens of lines, it was possible to run the code through Step 3 (Fit the model on the original data), but then there were too many scripts to edit. The most common incompatibilities were:

- Assuming 'python2' instead of 'python' to run .py files in the shell scripts

- The 'rB' argument when loading pickled files generated "ValueError: Invalid mode ('rB')". Removing the 'rB' worked.

- The 'wB' argument when writing pickled files needed to be 'w' instead.

- The paths to scripts combined different path separators / and \\

- The 'module load' command is not needed to load Python in many systems (this does not terminate execution though)

- Exporting environment variables as a way to configure BETS can be unreliable.

Reviewer #2: In the manuscript, Engelhardt and colleagues describe a new method called Bootstrap Elastic net regression from Time Series (BETS) to infer causal gene networks from time-series gene expression data. BETS uses vector autoregression with elastic net regularization to infer causal relationships (directed edges) between genes. The authors benchmark the performance of BETS by comparing their results against those from 21 other methods on the time-series data from the DREAM4 Network Inference Challenge. Assessed using previously used metrics of performance (AUPR, area under the precision recal curve; AUROC, area under the ROC), BETS’ ranks 6th out of 22 in terms of the AUPR metric (0.13 vs the top-performing CSId @ ~0.2) and ranks 3rd out of 17 in AUROC (0.7 vs the top-performing CSId @0.72). Compared to other top-performing methods, BETS is the fastest (~2-10 times faster), making it an attractive alternative to the top-performing CSId and Jump3. The authors demonstrate the utility of BETS by applying it on a previously published time-series expression (RNA-Seq) data on A549 cells (human adenocarcinoma cell line) exposed to dexamethasone (synthetic glucocorticoid). BETS-inferred causal gene network is validated against orthogonal over-expression datasets.

The proposed method is technically sound, and manuscript is generally well written, concise and easy to read. As the authors correctly state, BETS’ advantage over other methods is its speed, with performance comparable to the best-performing methods. BETS would be a nice addition to the arsenal of methods used to infer causal gene networks from time-series expression data. And, the authors have made their method available on GitHub.

Major Points

1. With regards to the text related to the inferred Glucocorticoid response network (on page 8), where the authors describe the inferred network containing 2,768 nodes (genes): It is noted that all 2,768 genes are ‘effect genes’ (had an incoming directed edge), and 466/2,768 genes are ‘causes’ (causal?), defined as nodes with an outward directed edge. If all the genes have an out-degree (outgoing edge), I don’t understand how the authors can define 466 genes that has both incoming and outgoing edge as causal. I would think that only those nodes with one or more out-going edges and no incoming edge are ‘causal.’ The authors need to explicitly clarify what their definition of ‘causal gene’ is because this raises questions about their ‘causal gene network’ definition.

2. The fact that all 2,768 genes have an out-going edge means that the resulting network/graph is not a directed acyclic graph (DAG; network free of cycles) and that it contains strongly connected component (SCCs), defined as a sub‐networks where, for every pair of nodes u and v in the sub‐network, there exists a directed path from u to v, and from v to u. Given this, I wonder if a method like Vertex Sort (PMID: 19690563) would be more appropriate to infer the network hierarchy and thus ‘causal genes’.

Minor Points:

1. I recommend the authors consider using “time-series” instead of “time series.”

2. The authors have been objective overall in describing/interpreting their results/findings. Line 184 on page 6 states that “BETS had a slightly lower AUPR compared with CSId (~0.12 vs 0.20).” I take issue with ‘slightly’ since CSId’s is ~65% better than BETS w.r.t this metric. I would just remove ‘slightly.’

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: The manuscript states "All files have been or are being submitted to the Gene Expression Omnibus under the same study name, including accession number GSE91208." However, that accession number only lists DNase-seq data, not RNA-seq data. Reviewer access links to any private data used in this study should be made available.

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008223.r003

Decision Letter 1

Thomas Lengauer, Christina S Leslie

11 May 2020

Dear Dr. Engelhardt,

Thank you very much for submitting your manuscript "Causal network inference from gene transcription time-series response to glucocorticoids" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript and accompanying software and materials according to the review recommendations.

In particular, Reviewer #1 has concerns about the usability of the BETS software, as this reviewer tried and was not able to run the package.  Minor additional concerns include ensuring the availability of the overexpression data by depositing in GEO as well as the source code and supplementary materials by placing in a suitable repository.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Christina S. Leslie

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have made substantial revisions that greatly enhance the manuscript and address almost all of my comments with the initial submission. These include clarifications in the Methods text, demonstration of robustness to the lag parameter values, results from new network inference algorithms, and another approach for evaluating the glucocorticoid predictions. Even though not all of the quantitative results favor the BETS algorithm, the additional results strengthen the manuscript. The authors have conducted fair and objective evaluations instead of skewing the results and language to only highlight advantages of BETS. For instance, they are honest in reporting that SWING-RF is now the top performer in the DREAM dataset and the nsgp-based glucocorticoid evaluation shows the BETS predictions are not significantly better than random guessing. (Those glucocorticoid predictions are still supported by two other analyses.) All network inference algorithms have some limitations, so the objective assessment will raise readers' confidence in the reported results and improves the manuscript overall.

The main reason I am enthusiastic about this manuscript is that the false discovery rate framework is a major and novel contribution to the network inference field. Figure 2 shows that BETS substantially improves upon other vector autoregressive network inference algorithms (including the new SWING-Lasso) in the DREAM evaluation. BETS remains very good at identifying trans-eQTLs. In addition, the analyses and results are rigorous, and the manuscript is very polished overall.

The only remaining major concern is the usability of the BETS software, as detailed below.

Major comments:

1) I am still unable to run the BETS pipeline. This time I tried running BETS in Python 3 on a Linux server. I created a fresh conda environment with

$ conda create --name bets python=3 numpy=1.13 scipy=0.19 pandas=0.20 matplotlib scikit-learn

There were fatal errors within the first few minutes of running the pipeline described in BETS_tutorial.md

- The line 'export NULL=g' comes after $NULL is used in package_params_cpipeline.sh so the directory names are incorrect

- prep_jobs_bootstrap.sh still uses the 'python2' command to run the Python script so it does not work in a Python 3 environment. Because this command is run within a script, aliasing python2 to point to python3 does not work.

Because the software is an important part of the contribution, the authors should ensure the pipeline can run in a fresh Python environment. One way to guarantee this would be to run it in a continuous integration service like Travis CI or GitHub Actions.

Minor comments:

2) To keep the discussion balanced, some of the new negative results could be included alongside the existing positive results. SWING-RF is very fast and accurate (Table S3), but it is not included in the runtime discussion on line 191. BETS is substantially faster than CSId and Jump3 but not SWING-RF. The Discussion paragraph starting at line 379 only focuses on the positive validations for the glucocorticoid network but ignores the nsgp-based results

3) The editors should ensure the overexpression data is available on GEO before publication.

4) The source code and supplementary materials on Google Drive should be archived in a more permanent repository, even if they are > 100 GB. The NIH figshare instance allows 100 GB of storage, and researchers can request more (https://nih.figshare.com/f/faq). Zenodo has 50 GB by default, but more is available upon request (https://about.zenodo.org/policies/). PLOS Computational Biology partners with Dryad (https://datadryad.org/stash/publishing_charges) and offers a 300 GB limit (https://datadryad.org/stash/faq)

5) Some inconsistencies and typos have been introduced over the different versions of the manuscript

- References to 'STAR Methods' remain

- Some text describes 340 trans-eQTLs, other text states 341

- Line 442 states 'In BETS, L = 2.' but now the results include multiple values of L, so this could state L = 2 is the default

- Line 476 'that the the'

- Line 576 states 'reference numbers listed in Supplementary Table 3' but that table contains other data

- The 'DREAM benchmarking' section (line 601) omits the new SWING methods

- Line 638 still refers to STRING interactions

- Line 799 'gene g \\in \\not tf a random score' is missing a word

Reviewer #2: The authors have have satisfactorily addressed the reviewers' comments.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: Not yet. The accession number GSE91208 is listed but does not seem relevant. The GEO accession numbers for the 100 nM dexamethasone treatment time course have now been provided. The GEO uploads for the overexpression data are still in progress. The Google Drive data should still be archived as well.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008223.r005

Decision Letter 2

Thomas Lengauer, Christina S Leslie

7 Aug 2020

Dear Dr. Engelhardt,

We are pleased to inform you that your manuscript 'Causal network inference from gene transcription time-series response to glucocorticoids' has been provisionally accepted for publication in PLOS Computational Biology. Please note the reviewers additional comments, though.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Christina S. Leslie

Associate Editor

PLOS Computational Biology

Thomas Lengauer

Methods Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have addressed my previous comments about the manuscript, data, and software. I was able to access the GEO datasets and spot checked the expression data. The supplementary results and files are now archived on Zenodo. Finally, I confirmed that I could run BETS on a Linux server in the conda environment described in the previous review. I have no other concerns and remain enthusiastic about this research.

One last comment is that the Zenodo file port-from-della.zip contains many files that could be removed. For instance, I noticed

- drive-download-20200629T235020Z-001.zip

- ProbGenReceipt_2016.pdf

- Goldwater_ResearchEssay_1_25_17.docx

- BACKUP* files

- The code/ subdirectory

- Many other files in the presentations/ subdirectory that aren't referenced in 'Full Progeny.xlsx'

The Zenodo dataset can be updated at any time by uploading a new version of the zip file, so this suggestion does not impact my recommendation for the manuscript.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008223.r006

Acceptance letter

Thomas Lengauer, Christina S Leslie

22 Jan 2021

PCOMPBIOL-D-19-01120R2

Causal network inference from gene transcription time-series response to glucocorticoids

Dear Dr Engelhardt,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Jutka Oroszlan

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Overview of gene regulatory network inference methods.

    Panels show each inference method applied to a cause gene g’ (blue, solid) and an effect gene g (blue, dotted). A) Mutual information is computed between the cause and effect. B) The effect’s expression is fit as an autoregression from the cause’s past expression. C) The effect’s expression is fit as a differential equation from the cause’s current expression. D) The effect’s expression is fit as a decision tree function of the cause’s past expression. E) The space of dynamic causal networks is searched, with linear relationships between cause and effect. F) The space of dynamic causal networks is searched, with nonlinear relationships between cause and effect.

    (PDF)

    S2 Fig. Causal gene expression and effect gene residuals from experimentally validated interactions.

    On the y-axes are the effect gene residual expression values after subtracting the effects of all other covariates. Axes are in units of ln(TPM). Related to Fig 5 and S7 Table.

    (PDF)

    S1 Table. DREAM4 100-gene network inference results, AUPR.

    DBN is dynamic Bayesian network, DT is decision tree, GP is Gaussian process, MI is mutual information, ODE is ordinary differential equation, VAR is vector autoregression. The references that reported ebdbnet, ScanBMA, and LASSO did not provide AUPR values for individual networks. Algorithms that were run in-house were ARACNE, BETS, CLR, CSId, Enet, Jump3, MRNET, SWING-Lasso, and SWING-RF. Where reported literature values were available, they were consistent with these values. Values for CSIc, G1DBN, GCCA, GP4GRN, TSNI, VBSSMa and VBSSMb were taken from [49]. Values for ebdnet, LASSO and ScanBMA, were taken from [40]. Values for dynGENIE3, GENIE3, OKVAR-Boost and tl-CLR were taken from [37]. Values for Inferelator and Jump3 were taken from [36]. Related to Fig 2.

    (DOCX)

    S2 Table. DREAM4 100-gene network inference results, AUROC.

    DBN is dynamic Bayesian network, DT is decision tree, GP is Gaussian process, MI is mutual information, ODE is ordinary differential equation, VAR is vector autoregression. The references that reported ebdbnet, ScanBMA, and LASSO did not provide AUROC values for individual networks. Algorithms that were run in-house were ARACNE, BETS, CLR, CSId, Enet, Jump3, MRNET, SWING-Lasso, and SWING-RF. Values for CSIc, G1DBN, GCCA, GP4GRN, TSNI, VBSSMa and VBSSMb were taken from [49]. Values for ebdnet, LASSO, and ScanBMA, were taken from [40]. Related to Fig 2.

    (DOCX)

    S3 Table. Results of in-house algorithms on DREAM4 100-gene network inference.

    AUPR, AUROC, and Time indicate average AUPR, AUROC, and time over the five networks, respectively. BETS and Enet are in bold to indicate that they are our own developed methods, based on vector autoregression. SWING-RF [51] and Jump3 [36] are decision tree methods. CSId is a Gaussian process method [44]. CLR [27], MRNET [90], and ARACNE [29] are mutual information methods. SWING-Lasso is a vector autoregression method [51]. Related to Fig 2.

    (DOCX)

    S4 Table. Improvement on DREAM4 100-gene network inference from bootstrap.

    For each AUROC or AUPR column, the average is the listed value and the standard deviation is listed in parentheses. “Coefficient” denotes the result when ranking edges by their fitted coefficient, as in the original method. “Bootstrap” denotes the results when ranking edges by the frequency by which they appear in the bootstrap networks.

    (DOCX)

    S5 Table. Dependency of BETS performance on bootstrap samples.

    DREAM results reported for running BETS on both 100 and 1000 bootstrap samples. All values in the columns are averages and the parenthetical values as standard deviations across the 5 DREAM4 Networks. The 1000 samples row is bolded because 1000 samples are the default settings. These use zero-mean normalization, lag 2, and the elastic net penalty. Related to Fig 2.

    (DOCX)

    S6 Table. Enrichment of edges between specific gene classes in inferred causal network.

    A Fisher’s Exact Test was performed, where the rows of the contingency table were whether or not an edge was of the edge type, and the columns were whether or not the edge was part of the inferred network. Related to Fig 3.

    (DOCX)

    S7 Table. Gene pair information from Fig 5.

    Shown Data Set indicates whether the gene temporal profiles in Fig 5 are taken from the original exposure data or unperturbed data. The edge type indicates the gene class of the causal and effect gene; for example, I → M indicates an edge from an Immune causal gene to a Metabolic effect gene. I = Immune; M = Metabolic; T = Transcription Factor; A = Any gene. Related to Fig 5.

    (DOCX)

    S1 Text. Supplemental information.

    Additional overview and analyses.

    (DOCX)

    Attachment

    Submitted filename: BETS_r2r.pdf

    Attachment

    Submitted filename: bets_r2r_v2.pdf

    Data Availability Statement

    All files have been submitted to the Gene Expression Omnibus, accession number: GSE91208. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE91208 All other data are contained in the manuscript and the Supporting information.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES