Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 May 24.
Published in final edited form as: Cell Syst. 2017 May 24;4(5):543–558.e8. doi: 10.1016/j.cels.2017.04.010

Inference and evolutionary analysis of genome-scale regulatory networks in large phylogenies

Christopher Koch 1,*, Jay Konieczka 2,*, Toni Delorey 2, Ana Lyons 3, Amanda Socha 4, Kathleen Davis 5, Sara A Knaack 6, Dawn Thompson 2, Erin K O'Shea 7,8,9,10, Aviv Regev 2,11, Sushmita Roy 6,12
PMCID: PMC5515301  NIHMSID: NIHMS873650  PMID: 28544882

Abstract

Changes in transcriptional regulatory networks can significantly contribute to species evolution and adaptation. However, identification of genome-scale regulatory networks is an open challenge, especially in non-model organisms. Here, we introduce multi-species regulatory network learning (MRTLE), a computational approach that uses phylogenetic structure, sequence-specific motifs, and transcriptomic data to infer the regulatory networks in different species. Using simulated data from known networks and transcriptomic data from six divergent yeasts, we demonstrate that MRTLE predicts networks with greater accuracy than existing methods because it incorporates phylogenetic information. We used MRTLE to infer the structure of the transcriptional networks that control the osmotic stress responses of divergent, non-model yeast species, then validated our predictions experimentally. Interrogating these networks reveals that gene duplication promotes network divergence across evolution. Taken together, our approach facilitates study of regulatory network evolutionary dynamics across multiple poorly studied species.

Graphical abstract

A new computational method for genome-scale regulatory network inference that uses phylogenetic structure, sequence-specific motifs and transcriptomes from diverse species. Here, it is used to study the evolution of stress-specific regulatory networks in six ascomycete yeasts.

graphic file with name nihms873650u1.jpg

Introduction

Transcriptional regulatory networks are key components of cellular information processing and transmit upstream signals to affect downstream context-specific expression patterns. Such networks are defined by connections of regulators such as transcription factors and signaling proteins to target genes (Kim et al., 2009). Changes in transcriptional regulatory networks have been repeatedly shown to contribute to phenotypic diversity of organisms (King & Wilson, 1975; Romero et al., 2012; Carroll, 2000; Wittkopp, 2007). However, our understanding of how regulatory networks evolve and impact complex phenotypes has been limited to a handful of transcription factors in a few species (Borneman et al., 2007; Tuch et al., 2008; Schmidt et al., 2010; Odom et al., 2007). An improved understanding of regulatory network evolution requires a systematic framework for both mapping global regulatory networks in multiple species as well as comparing the networks across species.

While significant effort has been invested in identifying regulatory networks in individual model organisms such as S. cerevisiae (Hughes & de Boer, 2013; Harbison et al., 2004; Macisaac et al., 2006) and E. coli (Faith et al., 2007), an open challenge is to identify these networks in newly sequenced species and compare networks across species. Recently, several comparative functional genomic studies have measured genome-wide mRNA levels in multiple species (Brawand et al., 2011; Thompson et al., 2013; Brawand et al., 2014). These quantitative datasets serve as “readouts” of the network state and provide the opportunity to comprehensively study how regulatory networks convert environmental signals into species-specific phenotypes and change globally across species. However, there are two major challenges that need to be overcome. First, most successful network reconstructions have used hundreds of samples, whereas the available data for each species in a comparative study is restricted to a few dozen samples. Second, to understand the role of regulatory network evolution on species evolution, regulatory networks need to be inferred for a complex phylogeny consisting of a sufficiently large number of species. Incorporating the phylogenetic structure enables us to account for the inherrent relatedness of species based on their DNA sequence composition, to trace the evolution of individual regulatory connections (edges) at different points on the phylogeny, and to compare the relative contribution of sequence and network divergence to phenotypic divergence. A large phylogeny is important to be able to systematically observe patterns of conservation and divergence and to study different factors such as gene duplication that can contribute to regulatory network divergence. Existing approaches to infer regulatory networks for multiple species have either not attempted to explicitly model the phylogeny of the species involved (Penfold et al., 2015; Joshi et al., 2014), or their applications have been restricted to two or three species (Xie et al., 2011; Penfold et al., 2015). Extending such approaches to infer genome-scale networks for a large phylogeny with complex orthologies can be computationally expensive. While a number of studies have compared gene expression profiles across multiple species (Bergmann et al., 2003; Ihmels et al., 2005; Kristiansson et al., 2013; Roy et al., 2013b), these approaches typically identify gene modules that are conserved or diverged across species and do not provide fine-grained regulatory network connectivity information. Such information is critical to identify specific regulatory connections evolution must have made and broken as the species diverged.

In this paper, we develop a probabilistic graphical model-based method, Multi-species Regulatory neTwork LEarning (MRTLE) that uses a phylogenetic framework, to infer regulatory networks in multiple species simultaneously. In MRTLE, the regulatory network of each species is modeled as a probabilistic graphical model (Friedman, 2004) and the phylogenetic information is incorporated by specifying a prior probability distribution over edge gain and loss from the ancestral to extant species. We use the ascomycete yeasts as a model system to study the evolution of regulatory networks, and validate MRTLE using simulations, available reconstructions of the yeast S. cerevisiae network, and available ChIP-chip based TF datasets in other species. MRTLE reconstructs networks better than approaches that do not incorporate phylogenetic information, while also inferring networks that diverge in a manner consistent with the phylogeny of the species involved. We use our inferred networks to identify regulators with evolutionarily conserved roles in stress-related repression and induction across ascomycete yeasts. In total, our computational framework of simultaneously inferring regulatory networks for multiple species and assessing regulatory network divergence enables a systematic study of evolution of gene regulatory networks in a complex phylogeny.

Results

Inference and analysis of regulatory networks in multiple species using MRTLE

We developed a multi-species network inference algorithm called MRTLE that imposes a phylogenetically-motivated prior distribution on a set of graphs, each graph describing the regulatory network of a species (Figure 1A, Methods). The prior distribution encodes the belief that regulatory networks diverge according to the phylogeny, that is, the regulatory networks of species that are phylogenetically closer are likely to be more similar. This probability is in turn described over individual edge states ( Rijz in Figure 1A), for a given species Z as a function of its state in A, the immediate ancestor of Z, P(Rijz|RijA). This is modeled as a continuous time Markov process parameterized by the rate matrix Q, which specifies the rates at which we expect regulators to gain or lose targets per unit time, and the branch length tz, which specifies the divergence time between species Z and its immediate ancestor A,. This parameterization allowes for the probability of edge gain and loss to be dependent branch specific (Hobolth & Jensen, 2005; Garber et al., 2009; Habib et al., 2012). Each regulatory network is modeled by a dependency network, a special type of a Probabilistic Graphical Model (PGM) (Friedman, 2004). A PGM has two components: the graph structure (Figure 1A, GX, GY, GZ ) and parametric functions (Figure 1A, ψX, ψY, ψZ). The nodes in the graph correspond to random variables, and encode the expression levels of a gene. The graph structure specifies the regulators of each gene, while the parameters of the graph specify how the regulator levels determine the output expression level.

Figure 1. Overview of the MRTLE learning algorithm and results on simulated data.

Figure 1

A. The MRTLE algorithm takes as input a phylogenetic tree relating multiple extant species, expression data for each extant species, and optionally sequence-specific transcription factor binding motifs for each species. MRTLE uses the phylogenetic tree and motif instances as prior knowledge, and outputs multiple regulatory networks, one for each species. Each regulatory network specifies the directed connections among regulatory proteins such as transcription factors (blue filled circles) to target genes (red filled squares). To capture the evolutionary dynamics of regulatory edge gain and loss, MRTLE uses a phylogenetic prior that is parameterized by a continuous time Markov chain. Each branch on the tree can have different gain and loss rates depending upon the branch length (e.g. tz for species Z) and an overall gain and loss rate of regulatory connections specified in the rate matrix Q. RijZ denotes the state of the edge between regulator i and target gene j in species Z. B. Pairwise similarities measured by F-score for the simulated ground truth (True) set of seven networks, Net1-Net7, and the inferred sets of networks using two baseline methods that do not incorporate any phylogenetic information (INDEP, GENIE3), and MRTLE that uses the phylogenetic tree of the considered species during network inference. C. Area on the Precision Recall Curve (AUPR) values comparing networks inferred by MRTLE, INDEP, and GENIE3 to the seven simulated ground truth networks. The greater the area under a curve (AUPR) the better the method. D. Comparison of AUPR between (i) MRTLE and INDEP, (ii) MRTLE and GENIE3, (iii) INDEP and GENIE3, when considering only true and predicted conserved edges between species pairs.

MRTLE takes as input expression data from k different species, a phylogenetic tree with branch lengths, gene orthology relationships including those arising from gene duplications, and rate parameters for regulatory edge loss and gain (Methods, Figure 1A). The output of MRTLE is k networks, one for each species. The prior is flexible and can integrate species-specific regulatory information such as sequence-specific motifs. The prior probability of a regulatory interaction between a target gene and a regulator depends upon both per-species prior regulatory information (e.g., presence of sequence-specific motifs if available), and the phylogenetic prior (Methods).

Since the majority of real regulatory network connections remain undiscovered, especially in non-model organisms, we first used simulations to assess our approach. The goal of the simulation is to ask, if the observed expression data from multiple species are generated from phylogenetically divergent networks, does a method such as MRTLE perform better than other methods? In our simulation, regulatory networks for seven extant species were evolved from an ancestral network using a phylogenetic tree (Figure 1B), followed by generation of simulated expression data at the extant species (Methods). We compared MRTLE to two baseline approaches, INDEP and GENIE3 (Huynh-Thu et al., 2010), that performed network inference in each species independently (Methods). INDEP is similar to MRTLE except it did not use a phylogenetic prior. GENIE3 was shown to have state of the art performance in network inference problems (Huynh-Thu et al., 2010; Marbach et al., 2012). Three criteria were used for evaluation: (a) do the inferred regulatory networks exhibit phylogenetic patterns of conservation that are similar to the true regulatory networks, (b) how well do the methods recover edges from the ground truth network, and (c) how well do the methods recover those edges that are conserved.

For (a) we computed the F-score-based similarity (Methods) for each pair of species' true networks, and compared this to the F-score for all pairs of inferred networks. Inclusion of the phylogenetic prior greatly aids in recovering a pattern of network similarity which agrees with the true pattern of conservation and divergence (Figure 1B). For example, when using MRTLE, inferred networks Net1 and Net2 are more similar to each other (F-score of 0.44), than Net1 is to Net7 (F-score 0.36). Similarly, Net6 and Net7 are more similar to each other (F-score of 0.40) than they are to any of the other species. This is in agreement with the observed trend in the ground truth networks. In contrast, both INDEP and GENIE3 substantially underestimated the similarity between all pairs of networks, and their inferred networks did not exhibit a strong phylogenetic pattern of conservation, but rather appeared uniformly similar to each other. For (b) we used edge precision and recall curves and the area under the precision-recall curve (AUPR). Overall, MRTLE outperforms both INDEP and GENIE3 (Figure 1C; Figure S1), achieving a higher AUPR than GENIE3 in six of the seven networks and a higher AUPR than INDEP on all seven networks. Although the differences in AUPR are small, they are significant when comparing MRTLE against the other two methods (T-test P-value < 0.05). GENIE3 and INDEP are comparable in performance with no significant difference in performance, with INDEP tending to have higher AUPRs than GENIE3. For (c) we considered only true conserved edges between pairs of species and again assessed the methods' accuracies in terms of AUPR. We found that MRTLE is generally better at recovering edges that are evolutionarily conserved compared to INDEP (Figure 1D (i)) and GENIE3 (Figure 1D (ii)). Furthermore, INDEP was better than GENIE3 at recovering true conserved edges (Figure 1D (iii)).

Our simulation results show that if the observed data are generated from networks that share an evolutionary history, an approach such as MRTLE that uses phylogenetic information can more effectively learn regulatory networks across multiple species. Having established the utility of MRTLE on simulated datasets, we next compared MRTLE, GENIE3 and INDEP for inferring regulatory networks from real expression data from six yeast species (Methods): S. cerevisiae, C. glabrata, S. castellii, C. albicans, K. lactis, and S. pombe. These datasets measure genome-wide transcriptome states in different stress conditions: glucose depletion, heat shock, oxidative stress, and osmotic stress. Glucose depletion, heat shock, and oxidative stress datasets were previously published (Thompson et al., 2013; Roy et al., 2013b; Wapinski et al., 2007), while osmotic stress was generated as part of this study. As potential regulators we included ∼500 genes that have known DNA binding roles in S. cerevisiae, as well as genes whose protein products are known to bind RNA (Table S1). We used the species tree branch lengths and gain and loss rate parameters inferred by Habib et al. (Habib et al., 2012) to specify the probabilities of edge loss and gain in MRTLE. We first assessed all three network inference methods without making use of sequence-specific motif priors. This enabled us to compare against GENIE3, which does not incorporate priors, and also to assess the broader, future applicability of MRTLE to species phylogenies for which such information may not be available. To evaluate the inferred networks, we used criteria similar to the simulation setting. To compute precision-recall curves, we used a ChIP-chip based regulatory network in S. cerevisiae, which has been a gold-standard in the field (MacIsaac et al., 2006). MRTLE outperforms INDEP and GENIE3 achieving higher precision at the same recall (Figure 2A). When comparing the phylogenetic pattern of conservation, we observe that MRTLE inferred networks diverge in a pattern consistent with the phylogeny (Figure 2B). In contrast, the networks inferred by INDEP and GENIE3 display extreme divergence. Furthermore, the extent of conservation in MRTLE networks is more consistent with observed conservation of ChIP-chip based binding profiles (Tuch et al., 2008), than either INDEP or GENIE3 (Figure S2). Overall, these results suggest that using phylogenetic information as prior can enable a more accurate reconstruction of a regulatory network, and, the absence of a phylogenetic prior leads to an overestimation of the divergence in the species' networks. Since GENIE3 did not have a significantly different performance than INDEP and does not incorporate sequence-specific motifs our subsequent results include only INDEP as the baseline.

Figure 2. Assessing inferred networks on the ascomycete yeast phylogeny.

Figure 2

A Precision-recall curves for MRTLE, INDEP and GENIE3 without motifs assessing the agreement of inferred networks to an S. cerevisiae gold standard network derived from ChIP-chip experiments. B. Pairwise similarities measured by F-score for the networks inferred by GENIE3, INDEP, and MRTLE when motifs were withheld, for six yeast species. C. Precision-recall curves for MRTLE and INDEP when motifs were included, assessing the agreement of the inferred networks using an S. cerevisiae transcription factor (TF) knockout network from Hu et al. as the gold standard. D. Pairwise similarities measured by F-score for the networks inferred by INDEP and MRTLE and the prior motif network for six yeast species. E. AUPR values assessing MRTLE and INDEP at recovering ChIP-chip targets of the TF, MCM1, in three species, C. albicans, K. lactis, and S. cerevisiae. F. Fold enrichment of MCM1 ChIP-chip targets in MCM1's inferred target set by MRTLE or INDEP in the 30,000 most confident edges from each method. G. AUPR values assessing MRTLE and INDEP at recovering edges of regulatory networks consisting of ChIP-chip targets of six different TFs in C. albicans and S. cerevisiae. H. AUPR values for each TF assessing MRTLE and INDEP at recovering ChIP-chip targets of each of the six different TFs in C. albicans and S. cerevisiae. The ground truth in H is the same as in G, but presented at the per-TF level.

Having established that MRTLE is able to outperform methods that do not use phylogenetic priors (e.g., INDEP and GENIE3) when neither method has access to sequence-specific motifs, we next evaluated MRTLE when given valuable sequence-based regulatory information. We used species-specific motifs from Habib et al. (Habib et al., 2012) as additional priors on the graphs. We could not use the MacIsaac et al. gold-standard network (MacIsaac et al., 2006), because it used evolutionarily conservation as an additional filter to define TF-target edges, and our motif priors were also defined using an evolutionary signature (Habib et al., 2012). As an alternative gold standard, we used an S. cerevisiae regulatory network from Hu et al. (Hu et al., 2007), obtained by systematically deleting regulators and analyzing the downstream effects on expression (Hu et al., 2007) (Methods). Using this gold standard we found MRTLE to outperform INDEP in edge recovery (Figure 2C). Notably, MRTLE outperforms INDEP at low recall (high precision) thresholds, suggesting that those regulatory edges supported by expression, evolutionary conservation, and a motif instance, are more likely to be functional than those supported only by expression and a motif instance.

Next, we examined the networks inferred by MRTLE and INDEP to assess whether they diverge in a manner consistent with the phylogeny. Since the degree and pattern of network similarity is dependent upon the similarity in the motif networks used as priors in addition to the expression data, we also estimated the similarity of all pairs of motif networks (Figure 2D, Motif Prior). The motif prior networks exhibited stronger evolutionary conservation compared to networks learned from INDEP (Figure 2D). As was the case for simulated data (Figure 1B) and for real expression data alone (Figure 2B), the networks learned by MRTLE exhibit stronger evolutionary conservation than those learned by INDEP, and diverge in a pattern consistent with the phylogeny (Figure 2D). As in the no motif case, the observed conservation levels for MRTLE with motifs agrees more with previous studies (Figure S2, (Tuch et al., 2008)). The similarity scores for INDEP networks increased relative to the scores when not using motifs, consistent with the hypothesis that the motif prior constrains the inferred networks to be more conserved than expression alone (Figure 2B, Figure 2D). The similarity scores for the MRTLE networks were comparable with and without motifs, suggesting that MRTLE is robust to the prior inputs.

Although large-scale knockout and ChIP-chip networks are not available in non-model organisms, a handful of transcription factors have been studied across multiple species (Tuch et al., 2008; Lavoie et al., 2010) using ChIP-chip experiments. In particular, Tuch et al. measured binding gene targets of the TF, MCM1, in S. cerevisiae, K. lactis, and C. albicans (Tuch et al., 2008). Lavoie et al. measured targets of CBF1, HMO1, FHL1, IFH1, and RAP1, in S. cerevisiae and C. albicans (Lavoie et al., 2010). We used these two ChIP-chip datasets to test the ability of MRTLE and INDEP with motifs to recover these targets. On the MCM1 datasets MRTLE outperforms INDEP in K. lactis and S. cerevisiae, and performs comparably in C. albicans (Figure 2E). As an additional evaluation measure, we calculated the fold enrichment of the ChIP-chip MCM1 targets in the predicted MCM1 targets among the top ∼30,000 edges (Figure 2F, Methods). Although the predicted targets from both methods were enriched for ChIP-chip MCM1 targets, MRTLE achieved a higher fold enrichment compared to INDEP in all three species. We combined the predicted targets of all TFs studied by Lavoie et al. into a single network and found MRTLE to significantly outperform INDEP for C. albicans (Figure 2G). However, MRTLE was outperformed on S. cerevisiae (Figure 2G). To gain insight into the lower performance of MRTLE on this S. cerevisiae network, we analyzed our predictions per TF (Figure 2H; Figure S3). In S. cerevisiae, MRTLE outperformed INDEP on RAP1 and TBF1, and it was outperformed for CBF1 (Figure 2H; Figure S3). Both methods had low AUPRs on HMO1, IFH1, and FHL1, likely due to the small number of targets. It is likely that CBF1's targets diverge substantially across species giving no additional advantage with MRTLE, or, it is possible that the current CBF1 target set is incomplete. Future experiments combing ChIP-chip experiments with TF knockout are needed to examine this property. Taken together, MRTLE was more effective than INDEP at recovering ChIP-based regulatory edges in non-model organisms, demonstrating that a phylogenetic prior-based framework is beneficial for non-model organisms as well.

The genome-wide regulatory networks for these six species enable us to more systematically study factors driving regulatory network evolution. For example, estimated rates of gain and loss of edges can provide insights into the relative importance of these two types of network changes in regulatory network divergence. Previously, Habib et al. assessed gain and loss rates of computationally inferred binding sites of individual TFs (Habib et al., 2012). Using a similar framework to Habib et al., we computed gain and loss rates of targets for each regulator (TFs and signaling proteins, Methods, Table S2). We find loss rates to be higher (1.84±0.67) than gain rates (0.48±0.20). A similar trend was observed with the rates from Habib et al. (loss rate of 4.91±2.35 and gain rate of 0.17±0.17), as well as in our recalculations of the rates using motif instances only (loss rate 3.92±1.31, gain rate 0.94±0.31). Our results show that regulatory networks evolve by losing edges more rapidly than by gaining edges and this property is true for both purely-sequence based networks and MRTLE-inferred networks. Although the same trends are observed in all three sources of rates, rates inferred using MRTLE networks were significantly different from the rates inferred from Habib et al. (Habib et al., 2012) or the rates obtained in the prior networks. In particular, regulators in the MRTLE network have a relatively lower loss rate (mean 1.84), compared to the loss rate (mean 4.91) estimated by Habib et al. MRTLE gain (Figure 3A) and loss rates (Figure 3B) are also lower than those estimated directly on the motifs used as priors. The significant differences in the rates from Habib et al. prior networks and the MRTLE-inferred networks, suggest that the MRTLE-inferred networks represent the output of integrating expression and sequence-specific motifs.

Figure 3. Assessing rates of target gain and loss for regulators in MRTLE-inferred networks and motif networks.

Figure 3

A-B Box plots of gain (A) and loss (B) rates calculated for the MRTLE-inferred networks, the rates calculated for the motifs used as prior knowledge in the MRTLE framework, and the rates calculated for the motifs used by Habib et al. For the MRTLE networks, rates were calculated using the top approximately 50,000 edges in each species' network. C-D. CDF plots of gain (C) and loss (D) rates calculated using the MRTLE-inferred networks at confidence thresholds amounting to approximately 50,000 edges for regulators with duplication (blue) and without duplications (red). E-F. Box plots of gain (E) and loss (F) rates for regulators in the MRTLE networks when considering all regulators and all targets (left; “All”), all regulators and targets without duplications (middle; Uniform Targets”), and duplicated regulators collapsed into a single average regulator with all targets (right; “Collapsed Regulators”). Rates are computed using the top 50,000 edge set. P-values from KS tests are given in parentheses, testing the hypothesis that regulators with duplications have higher gain (E) or loss (F) rates than regulators without duplications. G. Each point represents a regulator, with the x-coordinate specifying the regulator's loss rate and the y-coordinate specifying its gain rate. Outlier regulators with high gain rates (2 STD above the mean) are noted. H. Comparison of MRTLE and motif prior rates of target gain (i, iii) and loss (ii, iv) for each regulator and its targets including only those regulators from orthogroups with at least one duplication (i, ii), and from orthogroups without duplications (iii, iv). Each point represents a specific regulator, with the x-coordinate specifying the gain/loss rate of the regulator's motif-based targets, and the y-coordinate specifying the gain/loss rate of the regulator's MRTLE-based targets from the top 50k edge set.

Duplication of transcription factors can significantly contribute to regulatory network divergence (Pougach et al., 2014; Voordeckers et al., 2015). We next asked if regulators with duplications differ in their rates of gain and loss compared to regulators without duplications. We find that regulators with duplications have significantly higher edge gain rates (KS test P-value <1E-6, Figure 3C) compared to regulators without duplications. Such regulators also tend to lose edges more than those without duplications, but the trend is less pronounced (KS test P-value <0.04, Figure 3D). We repeated the rate calculations using targets with uniform orthology, and collapsing duplicated regulators into a single orthogroup by taking the average rate, and found similar results (Figure 3E, F, Methods). Additionally, we calculated the rates at various confidence thresholds, and found the results to be robust to the threshold used (Figure S4).

We identified 19 regulators that had a significantly higher rate of edge gain (>2 STD from mean, Table S3). These regulators were associated with diverse processes including stress response (SKN7, CRZ1, CAD1, RLM1), response to nutrients (MIG1, GZF3, CBF1, HAP4), cell cycle (FKH1, FKH2, ACE2), RNA binding (SUI3, JSN1, NOT5), and chromatin organization (CBF1, FKH1, FKH2, RPH1, TBF1). Regulators with high gain rates tend to also have high loss rates (Pearson's Correlation of 0.66), but this pattern was defied by KRE33, which had one of the slowest loss rates (1.68 STD below mean) despite having the highest gain rate (4.77 STD above mean, Figure 3G). KRE33 is involved in ribosomal biogenesis, a process that has been shown to be inherently tied to species lifestyle in the ascomycete lineage (Thompson et al., 2013) and KRE33 might be an important factor in regulatory divergence in this phylogeny. Although the majority of these regulators were from orthogroups that had a duplication, four of the regulators (CBF1, KRE33, HAP4, SUI3) were from orthogroups that did not have duplications. Such regulators tend to be associated with response to stress and chemical stimuli suggesting that such processes may be subject to multiple forces of evolutionary turnover, including gene duplication.

Recently, Pougach et al. showed that sequence affinity of paralogous TFs diverges after duplication, which can influence regulatory network rewiring (Pougach et al., 2014). To investigate the role of sequence affinity divergence on the overall edge gain rate, we correlated the MRTLE gain and loss rates to the motif gain and loss rates. We found a strong correlation between rates calculated using MRTLE networks or the motif networks (Figure 3H (i,ii)) for TFs from duplicated families. This correlation was negative or weak for TFs from families with no duplications (Figure 3H (iii,iv)), although we had much fewer TFs that had motifs and came from non-duplicating families. This suggests that sequence divergence can contribute to network divergence of TFs from duplicated gene families. For two of the TF families we had sequence motifs for both paralogs: SKN7, HSF1 and YAP1, CAD1. The difference in MRTLE gain rates was much greater for the SKN7, HSF1 pair compared to the YAP1, CAD1 pair (Figure 3H (i)). Interestingly, SKN7 and HSF1 had very different sequence affinities (Figure S5) compared to YAP1 and CAD1. These results are consistent with published studies of regulatory divergence of individual TFs (Pougach et al., 2014) and offer preliminary evidence that sequence divergence could explain, in part, the greater tendency to gain targets. Taken together, our inferred networks enabled us to quantitatively assess regulatory network evolution and predict regulators that contribute to regulatory network divergence more than others. Such regulators tend to come from regulator families with duplications or are implicated in stress response.

Evolution of the Osmotic Stress Response (OSR) regulatory network

To gain insight into how changes in regulatory networks can affect complex phenotypes, we used MRTLE-inferred regulatory networks to study response to osmotic stress across six Ascomycota species. Response to environmental stress is a major driving force in the evolution of new phenotypic traits (Hiyama et al., 2012; Hoffmann & Willi, 2008), especially in unicellular organisms (Gasch, 2007). Our current understanding of the regulatory network in response to stress is strongly biased to S. cerevisiae and we understand little about its structure and function in other species. To address this gap, we first measured using microarrays, genome-wide gene expression profiles under osmotic stress in six species. We then identified stress-specific transcriptional modules using a multi-species module inference algorithm, Arboretum (Roy et al., 2013b). Application of Arboretum to our OSR-specific expression data identified five modules ranging from the most repressed genes (module 1) to the most induced genes (module 5, Figure 4A). We then inferred OSR-specific networks by filtering the original MRTLE inferred networks to keep only those edges that connected targets and regulators within the same OSR module (Methods). We refer to this approach of inferring context-specific expression networks as MRTLE+Arboretum. To assess the accuracy of our inferred context-specific regulatory network edges, we performed miSeq expression profiling in knockout strains of two regulators, MSN2/4 and SKO1, under osmotic stress (Figure 4B-D, Methods). MSN2/4 is a general stress response regulator (Gasch et al., 2000), and SKO1 is an OSR-specific regulator. Both these regulators coordinate with the protein kinase, HOG1, to control OSR in S. cerevisiae (Capaldi et al., 2008). We compared our predicted targets against the miSeq data in two ways. First we asked whether the expression of MRTLE and MRTLE+Arboretum inferred targets of these two TFs was significantly different based on a KS-test, from non-targets under osmotic stress (Figure 4B, C). Second, we used LIMMA to define targets of these mutants in each species (Figure 4D, Methods) (Smyth et al., 2005). Based on the KS-test, both MRTLE and MRTLE+Arboretum targets are significantly repressed in the MSN2/4 knockout in S. cerevisiae compared to wildtype, which suggests that our predicted regulatory connections are valid. We did not find significant differences for the knockout of the ortholog of MSN2/4 in the two other species, C. albicans and S. pombe. The lack of significant differences in these species is consistent with previous observations where MSN2/4 does not play a significant role in general stress response (Nicholls et al., 2004; Chen et al., 2008; Sanso et al., 2008). In particular, the C. albicans MSN2/4 homologs, MNL1 and MSN4, do not play a role in general stress response (Nicholls et al., 2004). Only MNL1 is required for adaptation to weak acid stress (Ramsdale et al., 2008). For SKOI, we found a significant down-regulation of targets in C. albicans and a significant, albeit reduced, effect in S. cerevisiae.

Figure 4. Osmotic Stress Response (OSR) module assessment.

Figure 4

A Expression heat maps for each of five inferred OSR-specific expression modules, ranging from most repressed (left) to most induced (right). Height of each heat bar is proportional to the genes in each module. B. Box plots comparing differential expression under osmotic stress response (OSR) for predicted targets and non-targets of MRTLE and an approach that combines MRTLE with modular filtering (MRTLE + Arboretum). Targets inferred with the MRTLE + Arboretum approach are those targets inferred by MRTLE with the additional constraint that a target must be present in the same OSR-specific module as its regulator. Each plot shows the log2 ratio of expression in knockout over wild-type of the specified regulator's targets. P-values from KS tests are given for each pair of comparisons, testing the hypothesis that the predicted targets have decreased expression after knockout relative to the non-targets, implying that the knocked out TF has an activating role under salt stress. C. Fold-enrichment of LIMMA-based targets of MSN2/4 in S. cerevisiae and SKO1 in C. albicans. Targets were called with LIMMA and fold-enrichment.

The LIMMA-based analysis confirmed our observations. At a P-value < 0.05, we found 117 MSN2/4 targets in S. cerevisiae and 159 SKOI targets in C. albicans. LIMMA identified relatively fewer targets (14) for S. cerevisiae SKOI and therefore we excluded it from this analysis. After removing genes from these sets that were not in the data set used by MRTLE, we were left with 114 targets of MSN2/4 in S. cerevisiae and 149 targets of SKOI in C. albicans. Our MRTLE+Arboretum approach yielded 311 predicted targets of MSN2/4, 31 of which were among the 114 LIMMA targets, representing a 4.2-fold enrichment (hypergeometric test P-value < 1.2e-12, Figure 4D). In contrast, the original MRTLE S. cerevisiae network predicted 891 MSN2/4 targets, 50 of which overlapped with the LIMMA results, representing a 2.4-fold enrichment (p-value <1e-10). Similarly for C. albicans SKOI, MRTLE alone predicted 334 targets, 21 of which overlapped with LIMMA targets (2.3-fold enrichment, P-value < 1.7e-4). In contrast, 6 of MRTLE+Arboretum's 40 SKOI targets overlapped with LIMMA resulting in a higher fold enrichment (5.6-fold enrichment, P-value< 5.8e-4). These analyses suggest that the MRTLE+Arboretum approach can greatly improve the accuracy of stress-specific regulatory network learning.

To assess the overall extent of conservation in our complete OSR-specific networks, we calculated the F-score similarity between networks of each species pair (Figure 5A). We found significant phylogenetic pattern, although the extent of conservation was lower than what we observed before (Figure 2D). We then examined the portions of our OSR-specific networks spanning the most repressed and most induced modules, and identified conserved regulators acting as hubs in each case (Figure 5B). In the repressed module, KRE33 remained a conserved hub across all species. BAS1 acted as a repressor in the three most recently diverged species, S. castellii, C. glabrata, and S. cerevisiae, while TOD6 acted as a repressor in all species except S. pombe, for which no ortholog exists. In the induced modules, we found MSN2/4 as a hub in the most recently diverged species (Table S4). Intriguingly, in C. albicans, we found COM2 (MNL1 in C. albicans), which belongs to the MSN2/4 family as a hub in C. albicans. In the other species we found the YAP family of TFs and cell-cycle regulators (SWI5, SWI4, MBP1) to act as hubs. In S. pombe glucose regulators were predicted as the strongest hub followed by the cell cycle related regulators. These regulatory networks thus predict several regulators that have not been associated with stress response in these species that can be followed up with future validation studies.

Figure 5. MRTLE+Arboretum inferred OSR networks in six ascomycete yeast species.

Figure 5

A Conservation of the inferred OSR networks for each species measured by F-score. B. Networks spanning the most repressed and most induced OSR modules. Node size is proportional to node degree. Networks were constructed at the gene level rather than the orthogroup level, but nodes are labeled with S. cerevisiae orthology names for species other than S. cerevisiae. Nodes with many S. cerevisiae orthologs were truncated due to space considerations.

While the structure of the network specifies which regulators regulate which genes, the function of a network specifies how the regulator drives the expression of its targets. A regulator can regulate expression by acting as an activator or repressor of expression. Do regulator roles of activation and repression change across species and to what extent do such changes depend upon the stress? To address these questions, we examined the regulator-module relationships in the OSR and Heat Shock Response (HSR) data (Table S5, (Roy et al., 2013b)).

We used two measures to assess a regulator's activating or repressive role. The first measure used the significance of enrichment of a regulator's targets in the activating versus repressive module (Methods). Our second measure compared the expression of the targets for each time point in the repressed or induced module. Our enrichment-based analysis identified several notable regulators with a conserved association with repression in response to osmotic stress, such as KRE33, NSR1, SFP1, LOC1, REH1/REI1, and CHA4/TEA1 (Figure 6A, Table S5). Interestingly, the majority of the conserved, repressed regulators are associated with ribosomal biogenesis, which is repressed in species under stress. Regulators with conserved activating roles across all six species included the MSN2/4 family, the SKN7/HSF1 family, and AFT1/AFT2. Most of these regulators have general or specific stress related functions. Our second analysis focused on regulators with targets in both activating and repressive modules. This was a complementary measure, which recapitulated regulators from our enrichment-based measure and also identified several additional candidates of regulator divergence (typically in one or two species, Figure 6B). This included cell cycle regulators such as FKH1/2 and MBP1/SWI4, stress regulators (CRZ1), chromatin remodelers (GIS1;RPH1) and HAP4. Notably, several of these regulators were also associated with higher gain rates suggesting that regulator expression divergence might be associated with the tendency of the regulators to gain or lose edges. However, additional datasets would be needed to more fully understand this phenomenon. Overall, regulators tended to not change signs between species from activating to repressive or vice versa.

Figure 6. Comparative analysis of regulator association to Osmotic Stress Response (OSR) expression levels.

Figure 6

A Shown are regulator-module association scores computed using the most repressed and most induced OSR modules. Each association score represents the difference of the negative log P-value from two hypergeometric tests, one for the most induced and one for the most repressed module. Positive scores (red) represent a stronger association with the most induced module compared to the repressed module, while negative scores (blue) represent a stronger association with the repressed module compared to the induced module. Blank scores represent a species for which a regulator was not present due to a gene loss event, or for which no targets were predicted in the top 30,000 edges in MRTLE. Regulators for which all species had low scores for both stresses (absolute value < 2) are excluded from the figure (See Supplementary Table S5 for all regulators). B. Shown is the negative log P-value from a t-test comparing the expression levels of targets of a regulator in the induced vs repressed module for each experimental condition (time point and stress signal) for OSR. The intensity of red or blue in each entry is proportional to -log(p-value) (see colorscale at the bottom). Regulators with more targets in the induced module than the repressed module are considered as “activators” and use the red color map. Regulators with more targets in the repressed module than the induced module are considered as “repressors” and use the blue color map.

To examine the generality of this observation, we compared the OSR regulator signs to those in HSR (Figure 7). The majority of the regulators had similar associations in these stresses, with stress-related regulators such as MSN2/4 exhibiting a conserved activation and ribosomal biogenesis regulators exhibiting a conserved repression across species. However, some notable differences were uncovered, including a pronounced inductive role under heat stress in all species for HSP60, which is known to have a regulatory role post heat stress. Consistent with its role in the S. cerevisiae OSR, SKO1 also exhibited a conserved role of up-regulation in all species except S. pombe, and showed no significant association in HSR. Examples of regulators that changed their association with expression modules between stresses primarily were observed in a species-specific manner. In particular, PHO4 and TYE7 were associated with repression in heat shock in C. albicans (Figure 7), but did not have a significant association in C. albicans in osmotic stress (Figure 6). In summary, regulator associations with module expression are generally conserved across species for particular stresses. Regulator-module associations change their sign between stresses, but these changes are rare and happen in a species and clade-specific manner.

Figure 7. Comparative analysis of regulator association to Heat Stress Response (HSR) expression levels.

Figure 7

A Shown are regulator-module association scores computed using the most repressed and most induced HSR modules. Each association score represents the difference of the negative log P-value from two hypergeometric tests, one for the most induced and one for the most repressed module. Positive scores (red) represent a stronger association with the most induced module compared to the repressed module, while negative scores (blue) represent a stronger association with the repressed module compared to the induced module. Blank scores represent a species for which a regulator was not present due to a gene loss event, or for which no targets were predicted in the top 30,000 edges in MRTLE. Regulators for which all species had low scores for both stresses (absolute value < 2) are excluded from the figure (See Supplementary Table S5 for all regulators). B. Shown is the negative log P-value from a t-test comparing the expression levels of targets of a regulator in the induced vs repressed module for each experimental condition (time point and stress signal) for HSR. The intensity of red or blue in each entry is proportional to -log(p-value) (see heatmaps at bottom). Regulators with more targets in the induced module than the repressed module are considered as “activators” and use the red color map. Regulators with more targets in the repressed module than the induced module are considered as “repressors” and use the blue colormap.

Discussion

A comparative framework for regulatory networks can provide insights into principles of gene regulation (Garfield & Wray, 2010; Li & Johnson, 2010; Wohlbach et al., 2009), as well as inform better learning of network structure (Penfold et al., 2015; Thompson et al., 2015). Here, we have presented our algorithm MRTLE, for inferring regulatory networks for multiple species related by a known phylogeny. MRTLE makes use of a known phylogenetic tree to explicitly model evolutionary rates of regulatory edge gain and loss and can additionaly incorporate sequence specific motifs to identify regulatory networks in a complex phylogeny. Furthermore, MRTLE is able to incorporate complex many-to-many orthology relationships arising from gene duplications, which are known to play a crucial role in regulatory network evolution (Voordeckers et al., 2015; Teichmann & Babu, 2004; Pérez et al., 2014).

By leveraging data from related species within a phylogenetic framework, MRTLE is able to outperform methods that do not make use of evolutionary information (INDEP, GENIE3), in both simulated and real data settings. By favoring networks that are more phylogenetically coherent, MRTLE is able to recover the conserved parts of regulatory networks more accurately than methods that do not incorporate the phylogeny. MRTLE can accurately learn regulatory networks even when the sample size of expression data is small as evidenced by our cross-species ChIP-chip comparisons. These results suggest that MRTLE can be an effective tool for inferring regulatory networks in non-model organisms, for which data are just becoming available and little is known about their regulatory networks. Computationally-inferred high confidence regulatory interactions could be critical for prioritizing ChIP-seq and regulator perturbation experiments needed to understand the regulatory networks in these poorly characterized species.

Inferring genome-wide regulatory networks in a large set of species enabled us to perform several systematic analyses to study regulatory network evolution. One of the properties that we discovered was the relatively higher rates of target gain and loss in regulators with duplications versus regulators without duplications. Notable exceptions were a few stress-related regulators that exhibited high rates of turnover but did not have duplications. Consistent with previous work (Pougach et al., 2014), we find that the MRTLE rates of TFs in duplicated families are more correlated to the sequence-derived rates suggesting that sequence affinity divergence can facilitate TFs with duplications to diverge. However, additional experimental data measuring sequence affinity of individual members for a larger number of families are needed to more robustly examine this property. The MRTLE framework also enabled us to compare, for the first time to our knowledge, global transcriptional networks for a specific stress. We found that patterns of functional divergence of a regulator-module relationship typically were gradual and included a change from down- or up-regulation to no significant association with a module. While some regulators changed their association across different stresses, most of the divergence in association is likely to occur gradually through fine-tuning of expression.

MRTLE can be extended in several directions. A particular challenge to employing MRTLE is setting the prior probability of an edge gain or loss for each branch. In this paper, we used previously established motif gain and loss rates as a proxy for regulatory edge gain and loss rates. While this yielded good performance in our setting, a different approach may be necessary in phylogenies where motif turnover rates are not available. Additionally, one could incorporate variable gain and loss rates for each putative regulator, making use of prior information about each regulator's gain and loss rates. MRTLE's reliance on the high predictive power of a target's. mRNA level based on the TF's expression makes it difficult to discover potential regulatory roles of genes, such as HOG1, which is known to be important in osmotic stress response in S. cerevisiae. Integrating regulator activity levels that is less dependent on gene expression levels is another future extension to MRTLE. Another direction of future work is to extend our simulation to model the evolution of sequence-specific motifs together with the network evolution model, to enable a more controlled study of the role of sequence and expression evolution in regulatory network evolution. In summary, MRTLE represents a powerful framework to infer and compare regulatory networks on a genome-wide scale in a complex phylogeny, and should enable furthering our understanding of regulatory network evolution and its impact on how species interact and adapt to environmental changes.

STAR methods

Contact for Reagent and Resource Sharing

Information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact Sushmita Roy, sroy@biostat.wisc.edu.

Experimental Model and Subject Details

Osmotic stress response gene expression profiling

Strains and growth conditions

The following wild-type strains were used for each species in the study: S. cerevisiae W303 (Capaldi et al., 2008), C. glabrata CBS 138, S. castellii CLIB 592, K. lactis CLIB 209, C. albicans SC5314, C. albicans BWP17, S. pombe SPY73h+. Deletion mutant strain S. pombe hsr1 was obtained from Bioneer. S. cerevisiae deletion mutation strains of msn2/4 and sko1 were created on the W303 wild type strain previously described in Capaldi et al. (Capaldi et al., 2008). Deletion mutation strain for C. albicans sko1 and mnl1 were previously described in Homann et al. (Homann et al., 2009). All species were grown in the following rich medium chosen to minimize cross-species variation in growth (termed BMW): yeast extract (1.5%), peptone (1%), dextrose (2%), SC amino acid mix (Sunrise Science) 2 g/L, adenine 100 mg/L, tryptophan 100 mg/L, uracil 100 mg/L (Thompson et al., 2013). For each strain, cells were plated onto BMW plates from frozen glycerol stocks. After 2 days, cells were taken from plates and re-suspended into liquid BMW and grown overnight. Approximately 100-1500ul quantities (depending on the species growth rate and timing constraints for the days experiments) of the overnight cultures were used to inoculate pre-warmed, 350 ml BMW cultures in 2L Erlenmeyer flasks in New Brunswick Scientific water bath model C76 shakers. All strains were grown at 180 rpm at 30 C except for S. castellii, which was grown at 25 C.

Osmotic Stress Response profile experiments

The OD600 was measured throughout the day to ensure culture growth was tracking as expected (Thompson et al., 2013). When samples reached a species-specific OD600, corresponding to slightly late mid-log, we transferred 150ml of the culture to each of 100ml BMW (CTRL) and 100ml BMW+KCl (EXP). Both CTRL and EXP media were pre-warmed for 40-60 minutes in the shaker prior to the experiment. EXP media was either BMW + 0.5M KCl, 1M KCl, or 2M KCl, yielding a final concentration of 0.2M KCl, 0.4M KCl, and 0.8M KCl, respectively, upon addition of the culture. In each case, CTRL media was added first, followed by the EXP media, whereupon the shaker was immediately activated to 180 rpm and the timer started simultaneously. Samples (20ml) were collected from the CTRL media + culture immediately upon activation of the shaker (T=0), then at T=10, 20, 40, and 80 minutes from the EXP media + culture. Samples were collected in 50 mL conicals filled with 30ml of 100% methanol to yield a 60/40 methanol/sample mixture. The methanol-filled tubes were stored at -80 C until ready for use. During sample collection tubes were placed in a rack in a dry-ice ethanol bath kept at approximately -40 C. Once the sample was added to the methanol, the methanol and media were separated from the cells by centrifugation and poured off. The conicals containing a cell pellet were flash frozen in liquid nitrogen and then stored at -80 C until processed for permanent storage or RNA isolation. To process, the cell pellets were washed in 5 ml of nuclease-free water and spun for 5 min at 3700 rpm at 4 C. The supernatant was discarded and the pellet re-suspended in 2 mL of RNAlater (Ambion) and transferred to 2 ml Sarstadt tubes for storage.

RNA preparation and labeling

Total RNA was isolated using the RNeasy Mini Kits (Qiagen) according to the provided instructions for mechanical lysis. Samples were quality controlled with the RNA 6000 Pico kit of the Bioanalyzer 2100 (Agilent). Total RNA samples were labeled with either Cy3 or Cy5 using a modification of the protocol developed by Joe DeRisi (University of California at San Francisco) and Rosetta Inpharmatics as described previously (Wapinski et al., 2010). In the case of the OSR profile experiments, the control was a pooled sample, consisting of equal quantities of 160ng RNA from each of the T= 0, 10, 20, 40, and 80 minute samples. The pool was constructed prior to the SS-III reverse transcription step, where the Agilent spike-in A (or spike-in B in the case of labeling with Cy5), could be incorporated into the reaction.

Microarray hybridization

We used two-color Agilent 55- or 60-mer oligo-arrays in the 4 × 44 K format (four to five probes per target gene) and 8 × 15 K format (two probes per target gene). After hybridization and washing per the manufacturer's instructions, arrays were scanned using an Agilent scanner and analyzed with Agilent's Feature Extraction software (release 10.5.1.)

cDNA synthesis for Mi-seq RNA-sequencing gene expression studies

1 ug of total RNA in a volume of 11 uL was used as input. Heat fragmentation was completed by adding 3 uL The RNA Storage Solution-Ambion (AM7000) to each sample in an Eppendorf 96 well plate (951020401, Fisher Scientific) and heating at 98°C for 30 minut es. First strand cDNA was created by adding 1uL of OligoDT to samples and heating at 70C 10min. Samples were then put immediately on ice. A mastermix of 2 uL of 10× Affinity script buffer, 0.8 uL of 25mM dNTPs, 2 uL of DTT and 1 uL of the AffinityScript RT Enzyme (AffinityScript Multiple Temperature Reverse Transcriptase, 600109) was created. 5.8 uL of the mastermix was added to each sample well and mixed. Samples were incubated at room temperature of 10 minutes in a thermocycler, followed by 1 hour at 50°C, 15 minutes at 70°C and a 4 °C hold.

Second strand cDNA synthesis was completed with mRNA Second Strand Synthesis Module, E6111L. cDNA synthesis reaction was cleaned up using Agencourt AMPure XP beads (A63881, Beckman Coulter). Sample and beads were used as input to library construction; beads remain in the plate well with sample until the adapter ligation cleanup.

Library Construction for sequencing

Libraries were created using KAPA Biosystems Library Preparation Kit (KK2505 and KK8202) in an Eppendorf 96 well plate. Enzymatic reactions were cleaned up by adding AMPure XP to the sample after end repair, and leaving the beads in the sample throughout adapter ligation. 20% PEG, NaCl 2.5M was added to samples and beads for A-base and Adapter ligation cleanup, as previously described (Fisher et al., 2011). Prior to library enrichment, samples were eluted from AMPure XP beads. For library enrichment and amplification, a mastermix containing 12 uL of 5X Kapa HiFi Fidelity Buffer 2mM Mg, 1 uL of 25 mM dNTPs, 4 uL of primer mix, 1 uL of Kapa HiFi HotStart Enzyme and 2 uL of water per sample was created. 20 uL of the mastermix was added to the sample and the following PCR program was run: 98C for 45 seconds, 12 cycles of 98C for 15 seconds, 60C for 30 seconds and 72C for 30 seconds, a final extension at 72C for 1 min and a 4C hold. The library enrichment reaction was cleaned by adding 60 uL of AMPure XP beads and samples were eluted off of the AMPure beads in 15 ul of Trish-HCL (pH 8). The samples were then transferred to new plate and library quality was assessed.

Library quality control

Libraries were checked for quality control using Agilent High Sensitivity D5000 Screentape assay; size range for each sample was between 200-500 base pairs.

Library Sequencing

Each library was diluted to 2nM and pooled prior to sequencing. Sequencing was completed on the MiSeq platform and a 25 × 25, paired end sequencing run was completed.

Method Details

Probabilistic framework for phylogeny-aware regulatory network learning for multiple species: MRTLE

Our multi-species network inference approach is based on a probabilistic graphical model representation of a regulatory network (Friedman, 2004; Segal et al., 2003; Friedman et al., 2000; Markowetz & Spang, 2007; Pe'er et al., 2006). Bayesian networks (Friedman et al., 2000) and dependency networks (Heckerman et al., 2001) are examples of probabilistic graphical models that have been used to represent regulatory networks. Here, we use a dependency network representation because they can be relatively easily learned from observed expression data and can capture cyclic dependencies (Huynh-Thu et al., 2010; Heckerman et al., 2001). Below, we first give a description of a probabilistic model representation of a regulatory network for a single species, followed by a description of the probabilistic priors we have employed to capture phylogenetic relationships, and then a sketch of the MRTLE algorithm.

Modeling a regulatory network in one species

A probabilistic graphical model (PGM) of a regulatory network has two components: the structure, which specifies the regulators of a target gene, and the parameterized functions, which describe the sign and magnitude of the interactions of individual and combinations of regulators specifying the expression of a target gene. In PGMs, the expression level of a gene i is captured by a random variable, Xi and a conditional probability distribution relates the expression levels of regulators to the expression level of a target gene, by specifying the probability of a target gene taking a specific expression value given the expression values of its regulators. In MRTLE, Xi and its parents are assumed to be jointly Gaussian and the conditional distribution for each Xi given its parent is a conditional Gaussian.

Extending to multiple regulatory networks

Let N denote the number of species, and let Gs denote the graph associated with the sth species. Let Ds denote the expression datasets associated with the sth species that represent measured expression levels of both targets and regulators under multiple conditions. Given datasets, D1,…, DN and a phylogenetic tree over N species, our goal is to simultaneously infer the unknown regulatory networks G1, …, GN for all species. We use a Bayesian framework to tackle this problem and optimize the posterior probability of the graphs given the data, P(G1,…,GN|D1, …, DN). Using Bayes rule this is proportional to P(D1,…, DN|G1,…,GN)P(G1, …, GN), where P(D1,…, DN|G1,…,GN) is the data likelihood that is computed easily for each species independently, s=1NP(Ds|Gs). P(G1, …, GN) is the prior over the N graphs. To incorporate the phylogenetic similarity between species, we use a specific formulation of the multi-graph prior, which we describe below.

Phylogenetic and species-specific graph priors

P(G1,…,GN) is defined as a product of Γ1(G1, …, GN), which captures the multi-species phylogenetic prior, and Γ2(G1 , GN) that captures any species-specific regulatory information such as binding sites. Γ1 and Γ2 each define a distribution over a set of graphs, where each graph is represented by a set of edges from regulators to target genes. These are not bipartite graphs because there exist genes that act as both regulators and targets. To describe Γ1 in more detail we make use of the concept of an orthogroup (Wapinski et al., 2007), which is defined as a set of orthologous genes. Each orthogroup contains 0 or more gene members from each species. We assume that Γ1 decomposes as a product over sets of edges between regulator orthogroups and target orthogroups. For simplicity, we first assume that each species has one gene in each regulator orthogroup and one gene in each target orthogroup. Later, we describe how to relax this assumption. Let Ijk={Ijk1,,IjkN} be a binary vector for each regulator j and target gene k pair. Ijki is a binary variable capturing the state of the edge from regulator j to target k for the ith species, taking a value of 0 if the edge is absent and 1 if the edge is present. We express this prior as Γ1(G1, ⋯,GN)=∏jk(Ijk), which assumes that the prior decomposes as a product over the edges. P(Ijk) can be efficiently computed using Felsentein's algorithm for computing the probability of discrete observations at the leaf nodes of a phylogenetic tree (Felsenstein, 1981). First, we expand Ijk to include the ancestral species at the N - 1 intermediate points in the tree using indices N + 1 to 2N − 1 to represent these internal points. P(Ijk1,,IjkN) requires us to integrate away the state of the edges at the internal nodes as IjkN+1,,Ijk2N1P(Ijk1,,IjkN,IjkN+1,,Ijk2n1). Using the tree structure to make independence assumptions, we can write this as IjkN+1,,Ijk2N1P(Ijk2N1)lP(Ijkl|Ijkpa(l)), where pa(l) denotes the immediate ancestor species of I. Hence the probability, P(Ijk), can be computed efficiently using the probability of an edge state in I, given the state of the edge in the ancestor of I, P(Ijkl|Ijkpa(l)).

Two parameters, pg and pm, each taking values from zero to one, are used to determine this probability. The first, denoted pg, represents the probability of gaining a regulatory edge given that the edge does not exist in the ancestral species. The second, denoted pm, represents the probability of maintaining a regulatory edge, given its presence in the ancestral species. Setting these parameters to appropriate values is a difficult task. In our experiments on real data with six yeast species, we estimated a rate matrix using the average rate of motif binding site gain and loss from (Habib et al., 2012). We then set pg and pm for each branch in the phylogenetic tree based on this rate matrix and the branch length. In this regard, we used binding site gain and loss rates as a proxy for regulatory edge gain and loss rates. Thus our prior Γt is parameterized by branch lengths and the two rate parameters that are multiplied to obtain the probabilities pg and pm. Because branch lengths vary, the probabilities pm and pg are modeled separately for each branch. The second part of the prior, Γ2(G1,…,GN), acts in a per-species manner, and can be further decomposed as a product over species-specific graphs. Each P(Gi) further decomposes as a product of edges, P(I(XjXk)), where / is an indicator function for an edge existing between regulator j and target k. Similar to (Roy et al., 2013a), we parameterize the prior probability as a logistic function: P(I(XjXk) = 1) as 11+exp(β0+β1mjk). Here, mjk specifies whether gene k has a motif in its promoter region that can be bound by regulator j. In our current implementation of the algorithm, each mjk takes on a real value, proportional to the significance of an instance of j's motif found in k's promoter. These weights could be estimated using a standard motif scanning tool, for example FIMO (Grant et al., 2011). β0 is a sparsity prior that can be used to control the extent to which the algorithm penalizes the addition of a new edge. β1 controls the strength of the motif prior. Both β0 and β1 are user-tunable parameters. The addition of the motif prior enables us to select interactions that are weakly predicted by expression data, but are supported by the motif presence. Note that this framework is flexible and can easily be modified to fit a scenario where we do not have species-specific motif information (β1 = 0), or in settings where additional types of prior information for an edge are present.

Score-based learning of regulatory networks

To infer graphs for all species we use a score-based approach that searches over the space of possible graphs. Because the space of possible graphs is super-exponential in the number of variables, it is not possible to find a global optima. Instead, it is typical to use heuristic search algorithms over the graph space, score each candidate graph, and select the one that corresponds to a local optima. In the multi-species setting, we need to simultaneously search over the N graphs. Specifically, the score of a current graph configuration is composed of the data likelihood, P(D1, …, DN|G1, …, GN), as well as the graph prior, P(G1,…,GN). As described above, P(D1,…, DN|G1,…,GN) is written as a product over the N species, ∏s P(Ds|Gs). In a dependency network we cannot easily compute the likelihood P(Ds|Gs), but instead we compute a pseudo likelihood, which is given by the product of conditional distributions, P(Xis|RXis), where RXis. denotes the regulator set for Xis in species s. We assume that each variable Xis and its regulators RXis. are distributed according to a multi-variate Gaussian. The conditional P(Xis|RXis) is a conditional Gaussian distribution with mean μXis|RXis and variance σXis|RXis, estimated from the joint using Lauritzen et al. (Lauritzen, 1996). Using the conditional means, μXis|RXis, and variances, σXis|RXis, of a variable given its regulator set, we compute the conditional data likelihood for each variable Xis using data from species s. To compute the portion of the score representing the graph prior, we need to compute Γ1 and Γ2. As described above, Γ1 decomposes as a product over each possible regulator-target orthogroup, and can be computed using Felsentein's algorithm (Felsenstein, 1981), while Γ1 is computed in a species-specific manner.

Handling non-uniform orthogroups

In the description so far, we have assumed that each species has exactly one gene in the regulator orthogroup, and exactly one gene in the target orthogroup. However, for most evolutionary studies, the ability to handle many-to-many mappings between species is essential. In our problem setting, when there are duplications, we need to specially handle the Ijk variable that specifies the state of edges between the regulatory orthogroup j and the target orthogroup k. If a species has more than one gene in the regulator orthogroup or target orthogroup, we consider all possible edges between the genes in the regulator orthogroup to the genes in the target orthogroup, and select the edge that has the highest improvement in score. That is, if a species I has p regulators and q targets in the jth and kth orthogroups, respectively, we will consider all p × q edges for that species. We set Ijkl=1 if any member of the jth regulator orthogroup has an edge to any member of the kth target orthogroup.

Computational complexity of the MRTLE algorithm

MRTLE uses a greedy network learning algorithm which operates on one orthogroup at a time, which can be parallelized because the priors decompose at the orthogroup level. The search decomposes into per-orthogroup regulator set estimation problems, where the orthogroup corresponds to the target gene. In each iteration, for a target orthogroup, MRTLE would search among all regulator orthogroups to find the best regulator orthogroup that would result in an overall score improvement. For a target orthogroup j and regulator orthogroup i, this score improvement is calculated based on: (a) A species-specific contribution that examines all regulators genes in the orthogroup and all target genes in the target orthogroup to find the regulator gene pair with the highest score improvement, (b) The computation of the phylogenetic prior. Operation (a) requires nis×mjs operations in species s with nis regulators in the ith orthogroup and mjs genes in the jth orthogroup. For N species, the overall complexity for this calculation is 0(Nnimj), where nt and mj are the maximum number of regulator and target genes in the ith and jth orthogroups respectively. The second operation of computing the phylogenetic prior uses the Felsenstein algorithm that is linear in the number of species, 0(N). Taken together, scoring a given target and regulator orthogroup pair is therefore 0(Nntmj), which we write simply as 0(Nnm), with n denoting the maximum number of genes in a species in a regulator orthogroup and m denoting the maximum number of genes in the target orthogroup in a species. This search procedure is executed for all regulator orthogroups to find the best move. If R is the total number of regulator orthogroups, finding the best move takes 0(RNnm). Finally, the iteration of finding the next best regulator is executed at most the maximum number of pre-specified regulators a gene can have. Let this be k. Hence the overall complexity of the MRTLE algorithm is O(kRNnm).

We note that the complexity of the algorithm without the phylogenetic prior, would require O(kRNnm) operations as well. However, this can be parallelized across species and therefore would be faster.

Details of the baseline algorithms compared

We compared MRTLE to two baseline algorithms, GENIE3 and INDEP, both of which aimed to learn a regulatory network for each species independently.

GENIE3

GENIE3 is a dependency network learning algorithm that infers the structure of the regulatory network by solving a set of individual regression problems, one per gene. Each regression problem is solved by learning tree-based ensembles (either Random Forests or Extra Trees) that represent the regulatory program of a gene. GENIE3 takes as input an expression data matrix and a set of candidate regulators and outputs a ranking of potential regulatory edges. GENIE3 was one of the best performers in the DREAM network inference challenge (Huynh-Thu et al., 2010; Marbach et al., 2012).

INDEP

The INDEP algorithm is also a dependency network learning algorithm, that infers the structure of the regulatory network by solving a set of individual linear regression problems. The INDEP algorithm uses a per-gene greedy algorithm that aims to infer the regulators of each gene one at a time and is described in more detail in Siahpirani & Roy (Siahpirani & Roy, 2016) as the Per-Gene Greedy (PGG) algorithm. Briefly if Ds represents the dataset for the sth species, INDEP aims to learn the graph structure Gs by optimizing P(Gs|Ds) which is proportional to P(Ds|Gs)P(Gs). P(Gs) is defined in the same manner as the species-specific prior of MRTLE. INDEP makes the same assumptions as MRTLE about the Gaussian distribution of the gene expression data (See Sections Modeling a regulatory network in one species and Phylogenetic and species-specific graph priors).

Datasets

We evaluated our learning algorithm on simulated data with known ground truth, as well as with real yeast expression data.

Simulated datasets

Our simulation framework made use of a simple probabilistic process of network structure evolution, which was parameterized with the probability, pg, of gaining an edge that does not exist in the ancestral species, and the probability, pm, of maintaining an edge that exists in the ancestral species. The simulation started from an ancestral network of 300 genes and 33 regulators and a species tree shown in Figure 1B, and evolved each possible edge down the branch of a tree until the leaves in the species tree were reached. We set pg = 0.2, and pm = 0.8 for this process. Once we had the network structures for each species, we used GeneNetWeaver to generate data from each species (Schaffter et al., 2011). GeneNetWeaver uses stochastic differential equations to generate expression data. Specifically, each sample in each dataset represents a steady state measurement after perturbing a node and running the system to steady state. Each dataset consisted of 300 samples.

Yeast expression datasets

We applied our algorithm to real expression data from six ascomycete yeast species (Thompson et al., 2013; Roy et al., 2013b; Wapinski et al., 2010), and a new osmotic stress response dataset collected in this work (GSE94628). These data measure gene expression for six species, namely S. cerevisiae, C. glabrata, S. castellii, K. lactis, C. albicans and S. pombe in four stresses: glucose depletion, heat shock, osmotic stress, and oxidative stress (oxidative stress data was not available for S. pombe). A total of 35 measurements were used for C. albicans and C. glabrata, 30 measurements were used for S. cerevisiae, S. castellii, and K. lactis, and 21 measurements were used for S. pombe for which oxidative stress data was not available. In addition to the phylogenetic priors, our study in yeast included species-specific sequence motifs identified using the Cladeoscope algorithm, developed by Habib et al. (Habib et al., 2012). We learned regulatory networks using a gene set drawn from 6,547 orthogroups, which included genes with complex orthology relationships and many duplication levels. 459 of these orthogroups contained at least one potential regulator in at least one species.

Evaluation of learned networks

We assessed the effectiveness of network reconstruction using the MRTLE approach by comparing against two baseline approaches described above, GENIE3 and INDEP, on both simulated and real expression data.

Experiments on simulated data

GENIE3

We downloaded GENIE3 from http://homepages.inf.ed.ac.uk/vhuynht/software.html. We ran GENIE3 on the entire dataset of 300 samples. GENIE3's internal ensemble framework automatically generates confidence estimates on individual regulatory edges. GENIE3 has two main parameters: the number of trees, nb, and the number of features to be used at each split, K. We tested multiple configurations for each parameter: nb ∈ {100,500,1000,1500}, and K ∈ {sqrt, all}, where sqrt uses the square root of the number of regulators, while all will select all the regulators. For each configuration of these parameters, GENIE3 will output a confidence value of the presence of a regulatory edge for all potential edges. To select a particular configuation we used AUPR (described below in Evaluation metrics). We found the configuration of nb = 1500 and K = all to give the best AUPR and used the network inferred from this setting for our downstream evaluation. However, the overall performance of GENIE3 was stable across different parameter configurations (Figure S6A).

INDEP

INDEP was run within a stability selection framework, where a network was learned on one of 50 random subsamples of data containing 150 samples each. This allowed us to compute a confidence for each regulatory edge defined by the fraction of data subsets for which the edge was selected. The INDEP algorithm has two parameters that control the influence of the prior distribution: β0 for controlling the sparsity of the inferred network and β1 to control the influence of the sequence-specific motifs. In the simulation case β1 was set to 0. We tried different parameter configurations of β0 ∈ {−0.6,−0.8,−1.0,−1.2,−1.4,−1.6,−1.8,−2.0,−3.0,−4.0,−5.0}. As in GENIE3, we used AUPR to select the best setting. We found β0 = -3.0 to give the best performance, however the overall performance of INDEP was stable across different parameter configurations (Figure S6B).

MRTLE

Similarly to INDEP, MRTLE was also run within a stability selection framework, with 50random subsamples of the data each comprising 150 samples. MRTLE has multiple parameterconfigurations: pg for controlling the probability of an edge to be gained in the child species, pm tocontrol the probability of maintaining an existing edge, β0 for controlling sparsity, and β1 forcontrolling the influence of the motif prior. As in the INDEP case, we set β1 = 0. We tested differentconfigurations for MRTLE pg = {0.1,0.2,0.3,0.4}, pm = {0.7,0.8,0.9} and β0 = {−0.6,−0.8,−1.0,−1.2,−1.4,−1.6,−1.8,−2.0}. We found β0 = -2.0, pg = 0.2, pm = 0.9, to give the best results, however, the performance of MRTLE is stable across different configuration settings (Figure S6C).

Evaluation in real expression setting

The evaluation proceeded in the same way as in the simulation case where we tried different parameter configurations and selected the one with the highest AUPR.

GENIE3

We ran GENIE3 with different values of the features used per split, K ∈ {all.sqrt}, and number of trees, nb ∈ {100,500,1000,1500,2000}. We selected the best configuration based on the AUPR performance on the Maclsaac gold standard available for S. cerevisiae (Maclsaac et al., 2006). This configuration was K = sqrt, and nb = 500, but GENIE3 was quite stable to different parameter configurations (Figure S7A).

INDEP

To infer the networks, we used a stability selection framework, where we divided the expression datasets into 25 equal partitions each consisting of 20 measurements of the available stress response measurements. In S. pombe, for which oxidative stress data was not available, we partitioned the data into subsamples consisting of 14 measurements. We then inferred networks using each of the 25 data partitions, and calculated a confidence for each regulator-target interaction for each species, by calculating the percentage of the 25 networks that each edge was present in.

We used the “with motif” case to determine the optimal parameter configurations. Specifically, we set the sparsity parameter β0 ∈ {−1,−2,−3,−4,−5} and the motif parameter β1 ∈ {1,2,3,4,5} (Figure S7B). We used AUPR computed on the Hu et al. dataset from S. cerevisiae (Davis & Goadrich, 2006), to determine the best setting (similar strategy as in GENIE3), and found β0 = −5.0 and β1 = 5.0 to give the best results.

In the case where motifs were not used, we need only to specify the sparsity parameter, β0. We selected β0 = −5.0, because this was the configuration that was ideal for the motif case. We checked the sensitivity of INDEP to multiple settings of β0: β0 ∈ {−1.0,−2.0,−3.0,−4.0,−5.0}. INDEP results were very stable across different β0 values (Figure S7C).

MRTLE

Similarly to INDEP, we used a stability selection framework to learn regulatory networks with MRTLE. We used settings of pg and pm that were derived from the species tree branch lengths, and used β0 ∈ {−0.8,−0.9} and β1 ∈ {3,4,5}. We ran MRTLE with these configurations using only those targets orthogroups without duplications, and computed AUPR on the Hu et al. dataset. We found the AUPRs to be very stable (Figure S7D), however, β0 = −0.9 and β1 = 4 gave the best AUPR, and we used this configuration in all further analyses.

Evaluation metrics

We used different evaluation metrics to assess the quality of the inferred networks.

Area under the precision recall curve

On both simulated data (all species) and real expression data (S. cerevisiae), we compared the inferred networks with the true networks based on Area under the precision recall curve (AUPR) computed using the aupr tool from Davis and Goadrich. (Davis & Goadrich, 2006). Precision is defined as the ratio of true positives to the total number of predicted edges. Recall is defined as the ratio of the number of true positive edges to the number of true edges. To compute the precision-recall curve, we need to estimate precision and recall at different confidence thresholds for edges. For MRTLE and INDEP, we obtained these confidences using stability selection. That is, we generated random subsamples of the data, learned a network from each subsample, and computed a confidence for each edge representing the fraction of inferred networks in which the edge was present. GENIE3 has its own bootstrap procedure during the Random Forests learning procedure and directly outputs a confidence for each edge. The area under the precision-recall curve gives an overall assessment of the quality of the inferred networks.

Pattern of phylogenetic conservation

We assessed the quality of the inferred regulatory networks using the extent of inferred conservation and the ability to capture phylogenetically coherent patterns of conservation between species. A pattern of conservation is said to be phylogenetic if it obeys the phylogenetic structure, that is, networks for species that are close on the phylogeny should exhibit greater similarity than networks of species that are further apart. We used an F-score measure to assess the similarity between pairs of networks, where F-score is defined as the harmonic mean of precision, P, and recall, R, F-score=2PRP+R. This required us to specify a network at a specific confidence threshold for each species. For the simulated data we picked these thresholds to obtain ≈3,000 edges. For the real data, we picked thresholds to obtain ≈30,000 edges.

In the simulation setting, since we had access to the true networks, we could additionally directly assess the extent of conservation and divergence present, and compare this to the conservation and divergence present in the inferred networks. This comparison was done by defining the predicted common edges between two inferred networks and comparing to true common edges using AUPR.

Evaluating regulator-target edge predictions in S. cerevisiae

To evaluate our networks inferred for S. cerevisiae when motifs were withheld, we used a ChlP-chip derived TF-target gene network from Maclsaac et al. (Maclsaac et al., 2006), which has previously been used as a gold standard in the field (Marbach et al., 2012). When evaluating the full power of MRTLE with motifs included into its prior formulation, we used a dataset from Hu et al., which was constructed by systematically examining the genome-wide expression profile in 268 individual deletion strains, each strain representing a transcription factor (TF) (Hu et al., 2007). The regulatory network was defined using a two step approach. First, an initial network was defined as the total set of significantly differentially expressed genes in each deletion strain. Second, this network was refined using a regulatory epistasis approach in an effort to remove indirect interactions. See Hu et al. (Hu etal., 2007) for details.

Evaluation of regulator-target edge predictions in non- S. cerevisiae species

The evaluation of edges in the non-S. cerevisiae species was done using available ChlP-chip datasets for a handful of transcription factors, namely MCM1 from (Tuch et al., 2008) and CBF1, HMOI, FHL1, IFH1, and RAP1 from (Lavoie et al., 2010). We evaluated the quality of the inferred interactions for these TFs based on AUPR and fold enrichment of ChlP-based targets in the MRTLE inferred networks. Fold enrichment is defined as the ratio of the observed over expected proportion of true edges as follows:

(#of true positive targets)/(#of predicted targets)(#of actual targets)/(#of genes in dataset).

Inference of stress-specific regulatory networks for multiple species

To define the regulatory network for each stress, e.g., Osmotic stress response, we used the Arboretum algorithm to first define five transcriptional modules as described in Roy et al. (Roy et al., 2013b). We next filtered the MRTLE regulatory network inferred in each species using the module assignments such that an edge was removed from the network if either of the end points of the edge were in different modules, resulting in a single stress-specific network for each species. We refer to this combined approach as Arboretum+MRTLE.

Assessing a regulator's role as repressive or activating

We used two measures to assess whether a regulator acted in a repressive or activating manner. In the first, we used a Hypergeometric distribution to calculate the significance of overlap between the MRTLE inferred targets of a regulator and a transcriptional module. A regulator, r's association

with a repressed or induced module was quantified based on the difference, -log(p - valueACT) -(−log(p - valueREP)), where p−valueACT and p-valueREP are the Hypergeometric test p-values obtained when testing for enrichment of r's targets in the most induced or most repressed module, respectively. Regulators with negative values for this measurement were considered to be repressive regulators, while positive values indicated an activator. In the second analysis, we directly compared the expression of the targets of a regulator in the induced and repressed modules based on one-sided a T-test for each time point. We required a regulator to have at least 5 targets in one module (e.g. most induced) and at least 2 targets in the other (e.g. most repressed) module and tested whether the targets in the induced module were significantly higher than the repressed module. Next, to assign a sign to a regulator, we used the difference in the number of targets in each module; if a regulator had more targets in the repressive module, it was considered as a repressor, whereas, if it had more targets in the induced module, it was called an activator. This too gave a single statistic that could be used to assess if a regulator was mostly repressive or activating.

Defining targets of selected regulators based on LIMMA

To validate predicted targets of key transcription factors we measured mRNA levels using miseq,and utilized the LIMMA software (Smyth et al., 2005), applying it to salt stress data sets in wild typeand mutant strains of yeast. Here a wild type (Scer.WT) and an msn2/msn4 knockout mutant(Scer.msn2.4) S. cerevisiae strain were used to define the targets of msn2/4, and a wild type andskol mutant strain of C. albicans were used to define the targets of skol in C. albicans. For S. cerevisiae two replicate RNA-seq experiments were performed for each of 4 conditions for bothwild type and mutant: (1) T=0 minutes under no salt stress (BMW.TO), (2) T=20 minutes under nosalt stress (BMW.T20), (3) T=0 minutes under a KCL salt stress treatment (KCL.TO), (4) T=20minutes under a KCL salt stress treatment (KCL.T20). Using LIMMA, differentially expressed geneswere called for msn2/msn4 in S. cerevisiae with the following contrast functions:(Scer.WT.KCI.T20-Scer.WT.BMW.T0)-(Scer.msn2.4.KCI.T20-Scer.msn2.4.BMW.T0). The rationale for this contrast function is that the genes that are under MSN2/4 control under osmotic stress are those whose expression changes from TO to T20 in the wild type, but not in the msn2/4 strain when subjected to the same stress. Similarly for C. albicans, the contrast function used was (Calb.WT.KCI.T20-Calb.WT.BMW.T0)-(Calb.sko1.KCI.T20-Calb.sko1.BMW.T0). A similar contrast function was applied for the other strains as well. While targets were called for other species and regulators, in only the above two cases did MRTLE and the limma analysis both find a sufficient number of targets to allow for enrichment analyses. The LIMMA algorithm results for these two contrasts then provided us with a log-fold change and an adjusted p-value (q value) measure for the significance of differential expression of each gene. A q-value < 0.05 was chosen to select targets of msn2/msn4 in S. cerevisiae and skol in C. albicans. These target lists were then utilized in the downstream analyses. When comparing our MRTLE results to the msn2/4 double knockout, any gene predicted to be regulated by either msn2 or msn4 by MRTLE was considered.

Estimation of gain and loss rates in MRTLE inferred network

Rates of target gain and loss were calculated for each regulator orthogroup by modeling gain and loss of targets with a continuous-time Markov process, and using an expectation-maximization (EM) based approach to estimate the rates, as in Hobolth et al. and Garber et al. (Hobolth & Jensen, 2005; Garber et al., 2009). When assessing the rates in MRTLE, three separate sets of rates were calculated (Fig 3E, F). The first allowed for many-to-many orthology relationships within a regulator group, constructing rate matrices for each possible mapping when regulator duplications were present (Fig 3E, F; All). In the second set, we examined the effect of genes lost from the genome from the effect of regulators' targets lost. For this, we calculated rates for each regulator after removing any targets from consideration that did not have uniform, one-to-one orthology (Fig 3E, F; Uniform Targets). In the third set, we tested whether double-counting of certain regulators did not bias the results. To address this, we calculated the rates by taking the average rate of each of the possible orthology mappings for a regulator (Fig 3E, F; Collapsed Regulators).

Data and Software Availability

The MRTLE software and inferred networks are available at https://bitbucket.org/roygroup/mrtle. The expression datasets generated as part of this study are available in GEO (GSE94628).

Supplementary Material

1

Figure S1. Evaluation of MRTLE, INDEP, and GENIE3 on simulated data, Related to Figure 1.

A. Precision-recall curves comparing networks inferred by MRTLE (red lines), INDEP (blue lines), and GENIE3 (green lines) to the seven simulated ground truth networks. The greater the area under a curve (AUPR) the better the method. The table shows the AUPR of all three methods on the seven networks.

Figure S2. Comparison of the extent of conservation of MCM1 targets inferred by MRTLE, GENIE3 and INDEP Related to Figure 2.

Each bar indicates the percentage of genes predicted to be regulated by MCM1 in Species A that can be mapped via orthology from Species A to Species B and are predicted to be regulated by MCM1 in Species B. For example, in the first set of comparisons, S. cerevisiae is Species A, and K. lactis is species B.

Figure S3. Assessment of INDEP and MRTLE inferred targets for each TF from the Lavoie et al dataset Related to Figure 2.

A-B. Precision and recall curves per TF evaluating the target predictions made by MRTLE and INDEP for six TFs for S. cerevisae (A) and C. albicans (B) using ChIP-chip datasets from Lavoie et al as a gold standard.

Figure S4. Assessing rates of target gain and loss for regulators in MRTLE-inferred networks at varying confidence thresholds, Related to Figure 3.

A-B. CDFs showing the gain (A) and loss (B) rates for regulators with duplications (blue curve) and without duplications (red curve), calculated for the top ∼30,000 edges. C-D. CDFs showing the gain (A) and loss (B) rates for regulators with duplications (blue curve) and without duplications (red curve), calculated for the top ∼40,000 edges.

Figure S5. Analysis of sequence specificity of YAP1/CAD1 and HSF1/SKN7 families, Related to Figure 3.

Shown are the sequence logos of motifs from S. cerevisiae motifs for the SKN7 and CAD1 regulators from the Cladeo-Scope analysis, and their paralogs. One possible origin of target set divergence for regulators is changes in the preferred binding sites of regulators. A way to examine this phenomenon is to look at the motifs of a TF and its paralogs. SKN7 and CAD1 are examples of regulators with paralogs and with motif information in our dataset. SKN7 and its paralog (HSF1) have identifiably different motifs, while CAD1 and its paralog (YAP1) have motifs that are more similar. The binding sites SKN7 and HSF1 differ, and hence the associated target gene sets are likely more divergent than those of CAD1 and YAP1 for this reason.

Figure S6. Sensitivity analysis of GENIE3, INDEP, and MRTLE parameters on simulated data, Related to Figure 1.

A-C. AUPRs obtained on simulated data by running GENIE3 (A), INDEP (B), and MRTLE (C) with various parameter settings for each algorithm.

Figure S7. Sensitivity analysis of GENIE3, INDEP, and MRTLE parameters on real data, Related to Figure 2 A.

GENIE3 evaluated with varying parameter settings on the gold standard from MacIsaac et al. GENIE3 does not incorporate prior information such as motif instances into its learning framework. B. INDEP evaluated with varying parameter settings on the gold standard from Hu et al, with motifs included as prior knowledge. C. INDEP evaluated with varying parameter settings on the gold standard from MacIsaac et al, with motifs not included as prior knowledge. D. MRTLE evaluated with varying parameter settings on the gold standard from Hu et al, with motifs included as prior knowledge. Only orthogroups without duplications were included in this analysis to reduce the computational burden of these experiments.

Table S1. Yeast dataset statistics, Related to Figure 2.

Shown are the number of regulators and targets as well as the number of orthogroups that these regulators and targets are contained in, and the number of expression samples used for each yeast species

Table S2. Target gain and loss rates, Related to Figure 3.

Shown are the target gain and loss rates for each regulator, as well as the number of target orthogroups (Target Ogs) regulated by at least one regulator in a species in this orthogroup.

Table S3. Gene Ontology (GO) process annotations of regulators with high gain rates, Related to Figure 3.

Shown are all regulators with high gain rates (2 STD above the mean) and their GO process annotations.

Table S4. OSR network hubs, Related to Figure 5.

Top ten hubs defined based on the out degree in the repressed and induced OSR networks inferred by MRTLE.

Table S5. Regulator activator/repressor associations, Related to Figure 6 and Figure 7.

Shown are the HSR and OSR association scores calculated for each regulator. Negative numbers indicate an association with repression; positive numbers indicate an association with induction. Regulators with no predicted targets are not shown. A subset of these regulators are shown in Figures 6A and 7A.

2
3
4

Highlights.

  • Integrating phylogeny, motifs and expression improves regulatory network inference

  • A phylogenetic framework allows genome-scale network inference in non-model species

  • Comparative analyses of predicted networks identifies properties of network evolution

  • Stress response networks of non-model organisms are inferred and validated

Acknowledgments

This research was possible through generous support from NSF (NSF DBI: 1350677), the Sloan Foundation, the McDonnell Foundation (S.R), from HHMI (E.K.O), from NIH (R01CA119176-01, Pioneer award), Broad Institute, HHMI, Burrough Wellcome Fund Career Award at the Scientific Interface and the Sloan Foundation (A.R) and by an NIH Ruth L. Kirschstein National Research Service Award (J.K).

Footnotes

Author contributions: Conceived and designed study: S.R, A.R, E.K.O, J.K, D.A.T. Algorithm development and implementation: S.R, C.K. Computational experiments: S.R, C.K, S.K. Expression time-series and validation experiments: J.K, A.S, A.L, T.D, K.D, D.A.T. Wrote the paper: S.R, C.K, J.K, A.L. All authors provided feedback on the manuscript.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Bergmann S, Ihmels J, Barkai N. Similarities and Differences in Genome-Wide Expression Data of Six Organisms. PLoS biology. 2003;2(1):e9–. doi: 10.1371/journal.pbio.0020009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M. Divergence of Transcription Factor Binding Sites Across Related Yeast Species. Science (New York, NY) 2007;317(5839):815–819. doi: 10.1126/science.1140748. [DOI] [PubMed] [Google Scholar]
  3. Brawand D, Soumillon M, Necsulea A, Julien P, Csárdi GAB, Harrigan P, Weier M, Liechti AEL, Aximu-Petri A, Kircher M, Albert FW, Zeller U, Khaitovich P, Grützner F, Bergmann S, Nielsen R, Pääbo S, Kaessmann H. The evolution of gene expression levels in mammalian organs. Nature. 2011;478(7369):343–348. doi: 10.1038/nature10532. [DOI] [PubMed] [Google Scholar]
  4. Brawand D, Wagner CE, Li YI, Malinsky M, Keller I, Fan S, Simakov O, Ng AY, Lim ZWW, Bezault E, Turner-Maier J, Johnson J, Alcazar R, Noh HJJ, Russell P, Aken B, Alföldi J, Amemiya C, Azzouzi N, Baroiller JFCOF, Barloy-Hubler F, Berlin A, Bloomquist R, Carleton KL, Conte MA, D'Cotta H, Eshel O, Gaffney L, Galibert F, Gante HF, Gnerre S, Greuter L, Guyon R, Haddad NS, Haerty W, Harris RM, Hofmann HA, Hourlier T, Hulata G, Jaffe DB, Lara M, Lee AP, MacCallum I, Mwaiko S, Nikaido M, Nishihara H, Ozouf-Costaz C, Penman DJ, Przybylski D, Rakotomanga M, Renn SC, Ribeiro FJ, Ron M, Salzburger W, Sanchez-Pulido L, Santos ME, Searle S, Sharpe T, Swofford R, Tan FJ, Williams L, Young S, Yin S, Okada N, Kocher TD, Miska EA, Lander ES, Venkatesh B, Fernald RD, Meyer A, Ponting CP, Streelman JT, Lindblad-Toh K, Seehausen O, Di Palma F. The genomic substrate for adaptive radiation in African cichlid fish. Nature. 2014 doi: 10.1038/nature13726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Capaldi AP, Kaplan T, Liu Y, Habib N, Regev A, Friedman N, O'Shea EK. Structure and function of a transcriptional network activated by the MAPK Hog1. Nature genetics. 2008;40(11):1300–1306. doi: 10.1038/ng.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carroll SB. Endless Forms: The Evolution of Gene Regulation and Morphological Diversity. Molecular Cell. 2000;101(6):577–580. doi: 10.1016/s0092-8674(00)80868-5. [DOI] [PubMed] [Google Scholar]
  7. Chen D, Wilkinson CR, Watt S, Penkett CJ, Toone WM, Jones N, Bühler JUR. Multiple pathways differentially regulate global oxidative stress responses in fission yeast. Molecular Biology of the Cell. 2008;19(1):308–317. doi: 10.1091/mbc.E07-08-0735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. New York, New York, USA: ACM Press; 2006. pp. 233–240. 2006. [Google Scholar]
  9. Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS. Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biol. 2007;5(1):e8–. doi: 10.1371/journal.pbio.0050008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution. 1981;17(6):368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
  11. Fisher S, Barry A, Abreu J, Minie B, Nolan J, Delorey TM, Young G, Fennell TJ, Allen A, Ambrogio L, Berlin AM, Blumenstiel B, Cibulskis K, Friedrich D, Johnson R, Juhn F, Reilly B, Shammas R, Stalker J, Sykes SM, Thompson J, Walsh J, Zimmer A, Zwirko Z, Gabriel S, Nicol R, Nusbaum C. A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries. Genome biology. 2011;12(1):R1. doi: 10.1186/gb-2011-12-1-r1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Friedman N. Inferring Cellular Networks using Probabilistic Graphical Models. Science (New York, NY) 2004;303:799–805. doi: 10.1126/science.1094068. [DOI] [PubMed] [Google Scholar]
  13. Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian Networks to Analyze Expression Data. Journal of Comp Biol. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
  14. Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25(12):i54–i62. doi: 10.1093/bioinformatics/btp190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Garfield DA, Wray GA. The Evolution of Gene Regulatory Interactions. BioScience. 2010;60(1):15–23. [Google Scholar]
  16. Gasch AP. Comparative genomics of the environmental stress response in ascomycete fungi. Yeast (Chichester, England) 2007;24(11):961–976. doi: 10.1002/yea.1512. [DOI] [PubMed] [Google Scholar]
  17. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO. Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell. 2000;11(12):4241–4257. doi: 10.1091/mbc.11.12.4241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grant CE, Bailey TL, Noble WS. Bioinformatics. 7. Vol. 27. Oxford, England: 2011. FIMO: scanning for occurrences of a given motif; pp. 1017–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Habib N, Wapinski I, Margalit H, Regev A, Friedman N. A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Molecular Systems Biology. 2012;8(1):–. doi: 10.1038/msb.2012.50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. doi: 10.1038/nature02800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Heckerman D, Chickering DM, Meek C, Rounthwaite R, Kadie C. Dependency Networks for Inference, Collaborative Filtering, and Data Visualization 2001 [Google Scholar]
  22. Hiyama A, Taira W, Otaki JM. Color-Pattern Evolution in Response to Environmental Stress in Butterflies. Frontiers in Genetics. 2012;3:15. doi: 10.3389/fgene.2012.00015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hobolth A, Jensen JL. Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm. Statistical applications in genetics and molecular biology. 2005;4(1) doi: 10.2202/1544-6115.1127. [DOI] [PubMed] [Google Scholar]
  24. Hoffmann AA, Willi Y. Detecting genetic responses to environmental change. Nat Rev Genet. 2008;9(6):421–432. doi: 10.1038/nrg2339. [DOI] [PubMed] [Google Scholar]
  25. Homann OR, Dea J, Noble SM, Johnson AD. A Phenotypic Profile of the Candida albicans Regulatory Network. PLoS genetics. 2009;5(12):1–12. doi: 10.1371/journal.pgen.1000783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hu Z, Killion PJ, Iyer VR. Genetic reconstruction of a functional transcriptional regulatory network. Nature Genetics. 2007;39(5):683–687. doi: 10.1038/ng2012. [DOI] [PubMed] [Google Scholar]
  27. Hughes TR, de Boer CG. Mapping Yeast Transcriptional Networks. Genetics. 2013;195(1):9–36. doi: 10.1534/genetics.113.153262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PloS one. 2010;5(9):e12776. doi: 10.1371/journal.pone.0012776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ihmels J, Bergmann S, Berman J, Barkai N. Comparative gene expression analysis by differential clustering approach: application to the Candida albicans transcription program. PLoS Genet. 2005;1(3) doi: 10.1371/journal.pgen.0010039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Joshi A, Beck Y, Michoel T. Multi-species network inference improves gene regulatory network reconstruction for early embryonic development in Drosophila. 2014 doi: 10.1089/cmb.2014.0290. [DOI] [PubMed] [Google Scholar]
  31. Kim HD, Shay T, O'Shea EK, Regev A. Transcriptional Regulatory Circuits: Predicting Numbers from Alphabets. Science (New York, NY) 2009;325(5939):429–432. doi: 10.1126/science.1171347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. King MC, Wilson AC. Evolution at two levels in humans and chimpanzees. Science (New York, NY) 1975;188(4184):107–116. doi: 10.1126/science.1090005. [DOI] [PubMed] [Google Scholar]
  33. Kristiansson E, Österlund T, Gunnarsson L, Arne G, Larsson JG, Nerman O. A novel method for cross-species gene expression analysis. BMC Bioinformatics. 2013;14(1):70–70. doi: 10.1186/1471-2105-14-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lauritzen SL. Graphical Models. New York, USA: Oxford University Press; 1996. [Google Scholar]
  35. Lavoie H, Hogues H, Mallick J, Sellam A, Nantel A, Whiteway M. Evolutionary Tinkering with Conserved Components of a Transcriptional Regulatory Network. PLoS Biol. 2010;8(3):e1000329. doi: 10.1371/journal.pbio.1000329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Li H, Johnson AD. Evolution of Transcription Networks _ Lessons from Yeasts. Curr Biol. 2010;20(17):R746–R753. doi: 10.1016/j.cub.2010.06.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Macisaac K, Wang T, Gordon DB, Gifford D, Stormo G, Fraenkel E. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics. 2006;7(1):113–113. doi: 10.1186/1471-2105-7-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, DREAM5 Consortium. Kellis M, Collins JJ, Stolovitzky G. Wisdom of crowds for robust gene network inference. Nature Methods. 2012;9(8):796–804. doi: 10.1038/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Markowetz F, Spang R. Inferring cellular networks - a review. BMC bioinformatics. 2007;8(Suppl 6):S5–. doi: 10.1186/1471-2105-8-S6-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Nicholls S, Straffon M, Enjalbert B, Nantel AE, Macaskill S, Whiteway M, Brown AJP. Msn2- and Msn4-Like Transcription Factors Play No Obvious Roles in the Stress Responses of the Fungal Pathogen Candida albicans. Eukaryotic Cell. 2004;3(5):1111–1123. doi: 10.1128/EC.3.5.1111-1123.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, MacIsaac KD, Rolfe PA, Conboy CM, Gifford DK, Fraenkel E. Tissue-specific transcriptional regulation has diverged significantly between human and mouse. Nature Genetics. 2007;39(6):730–732. doi: 10.1038/ng2047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Pérez JC, Fordyce PM, Lohse MB, Hanson-Smith V, DeRisi JL, Johnson AD. How duplicated transcription regulators can diversify to govern the expression of nonoverlapping sets of genes. Genes & Development. 2014;28(12):1272–1277. doi: 10.1101/gad.242271.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Pe'er D, Tanay A, Regev A. MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals. J Mach Learn Res. 2006;7:167–189. [Google Scholar]
  44. Penfold CA, Millar JBA, Wild DL. Inferring orthologous gene regulatory networks using interspecies data fusion. Bioinformatics. 2015;31(12):i97–i105. doi: 10.1093/bioinformatics/btv267. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Pougach K, Voet A, Kondrashov FA, Voordeckers K, Christiaens JF, Baying B, Benes V, Sakai R, Aerts J, Zhu B, Van Dijck P, Verstrepen KJ. Duplication of a promiscuous transcription factor drives the emergence of a new regulatory network. Nature Communications. 2014;5:4868–4868. doi: 10.1038/ncomms5868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Ramsdale M, Selway L, Stead D, Walker J, Yin Z, Nicholls SM, Crowe J, Sheils EM, Brown AJP. MNL1 Regulates Weak Acid–induced Stress Responses of the Fungal Pathogen Candida albicans. Molecular biology of the cell. 2008;19(10):4393–4403. doi: 10.1091/mbc.E07-09-0946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Romero IG, Ruvinsky I, Gilad Y. Comparative studies of gene expression and the evolution of gene regulation. Nature Reviews Genetics. 2012;13(7):505–516. doi: 10.1038/nrg3229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Roy S, Lagree S, Hou Z, Thomson JA, Stewart R, Gasch AP. Integrated Module and Gene-Specific Regulatory Inference Implicates Upstream Signaling Networks. PLoS computational biology. 2013a;9(10):e1003252–. doi: 10.1371/journal.pcbi.1003252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Roy S, Wapinski I, Pfiffner J, French C, Socha A, Konieczka J, Habib N, Kellis M, Thompson D, Regev A. Arboretum: Reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome research. 2013b doi: 10.1101/gr.146233.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Sanso M, Gogol M, Ayte J, Seidel C, Hidalgo E. Transcription Factors Pcr1 and Atf1 Have Distinct Roles in Stress- and Sty1-Dependent Gene Regulation. Eukaryotic Cell. 2008;7(5):826–835. doi: 10.1128/EC.00465-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Schaffter T, Marbach D, Floreano D. GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011 doi: 10.1093/bioinformatics/btr373. [DOI] [PubMed] [Google Scholar]
  52. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT. Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding. Science (New York, NY) 2010;328(5981):1036–1040. doi: 10.1126/science.1186176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature genetics. 2003;34(2):166–176. doi: 10.1038/ng1165. [DOI] [PubMed] [Google Scholar]
  54. Siahpirani AF, Roy S. A prior-based integrative framework for functional transcriptional regulatory network inference. Nucleic acids research. 2016:gkw963–. doi: 10.1093/nar/gkw963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Smyth GKK, Michaud JEL, Scott HSS. Use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005 doi: 10.1093/bioinformatics/bti270. [DOI] [PubMed] [Google Scholar]
  56. Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nature Genetics. 2004;36(5):492–496. doi: 10.1038/ng1340. [DOI] [PubMed] [Google Scholar]
  57. Thompson D, Regev A, Roy S. Comparative Analysis of Gene Regulatory Networks: From Network Reconstruction to Evolution. Annual Review of Cell and Developmental Biology. 2015;31(1):null–. doi: 10.1146/annurev-cellbio-100913-012908. [DOI] [PubMed] [Google Scholar]
  58. Thompson DA, Roy S, Chan M, Styczynsky MP, Pfiffner J, French C, Socha A, Thielke A, Napolitano S, Muller P, Kellis M, Konieczka JH, Wapinski I, Regev A, Tautz D. Evolutionary principles of modular gene regulation in yeasts. eLife. 2013;2 doi: 10.7554/eLife.00603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Tuch BB, Galgoczy DJ, Hernday AD, Li H, Johnson AD. The Evolution of Combinatorial Gene Regulation in Fungi. PLoS Biol. 2008;6(2):e38–. doi: 10.1371/journal.pbio.0060038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Voordeckers K, Pougach K, Verstrepen KJ. How do regulatory networks evolve and expand throughout evolution? Current Opinion in Biotechnology. 2015;34:180–188. doi: 10.1016/j.copbio.2015.02.001. [DOI] [PubMed] [Google Scholar]
  61. Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007;449(7158):54–61. doi: 10.1038/nature06107. [DOI] [PubMed] [Google Scholar]
  62. Wapinski I, Pfiffner J, French C, Socha A, Thompson DA, Regev A. Gene duplication and the evolution of ribosomal protein gene regulation in yeast. Proceedings of the National Academy of Sciences. 2010;107(12):5505–5510. doi: 10.1073/pnas.0911905107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Wittkopp PJ. Variable gene expression in eukaryotes: a network perspective. Journal of Experimental Biology. 2007;210(9):1567–1575. doi: 10.1242/jeb.002592. [DOI] [PubMed] [Google Scholar]
  64. Wohlbach DJ, Thompson DAA, Gasch AP, Regev A. From elements to modules: regulatory evolution in Ascomycota fungi. Current opinion in genetics & development. 2009;19(6):571–578. doi: 10.1016/j.gde.2009.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Xie D, Chen CC, He X, Cao X, Zhong S. Towards an Evolutionary Model of Transcription Networks. PLoS computational biology. 2011;7(6):e1002064–. doi: 10.1371/journal.pcbi.1002064. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Figure S1. Evaluation of MRTLE, INDEP, and GENIE3 on simulated data, Related to Figure 1.

A. Precision-recall curves comparing networks inferred by MRTLE (red lines), INDEP (blue lines), and GENIE3 (green lines) to the seven simulated ground truth networks. The greater the area under a curve (AUPR) the better the method. The table shows the AUPR of all three methods on the seven networks.

Figure S2. Comparison of the extent of conservation of MCM1 targets inferred by MRTLE, GENIE3 and INDEP Related to Figure 2.

Each bar indicates the percentage of genes predicted to be regulated by MCM1 in Species A that can be mapped via orthology from Species A to Species B and are predicted to be regulated by MCM1 in Species B. For example, in the first set of comparisons, S. cerevisiae is Species A, and K. lactis is species B.

Figure S3. Assessment of INDEP and MRTLE inferred targets for each TF from the Lavoie et al dataset Related to Figure 2.

A-B. Precision and recall curves per TF evaluating the target predictions made by MRTLE and INDEP for six TFs for S. cerevisae (A) and C. albicans (B) using ChIP-chip datasets from Lavoie et al as a gold standard.

Figure S4. Assessing rates of target gain and loss for regulators in MRTLE-inferred networks at varying confidence thresholds, Related to Figure 3.

A-B. CDFs showing the gain (A) and loss (B) rates for regulators with duplications (blue curve) and without duplications (red curve), calculated for the top ∼30,000 edges. C-D. CDFs showing the gain (A) and loss (B) rates for regulators with duplications (blue curve) and without duplications (red curve), calculated for the top ∼40,000 edges.

Figure S5. Analysis of sequence specificity of YAP1/CAD1 and HSF1/SKN7 families, Related to Figure 3.

Shown are the sequence logos of motifs from S. cerevisiae motifs for the SKN7 and CAD1 regulators from the Cladeo-Scope analysis, and their paralogs. One possible origin of target set divergence for regulators is changes in the preferred binding sites of regulators. A way to examine this phenomenon is to look at the motifs of a TF and its paralogs. SKN7 and CAD1 are examples of regulators with paralogs and with motif information in our dataset. SKN7 and its paralog (HSF1) have identifiably different motifs, while CAD1 and its paralog (YAP1) have motifs that are more similar. The binding sites SKN7 and HSF1 differ, and hence the associated target gene sets are likely more divergent than those of CAD1 and YAP1 for this reason.

Figure S6. Sensitivity analysis of GENIE3, INDEP, and MRTLE parameters on simulated data, Related to Figure 1.

A-C. AUPRs obtained on simulated data by running GENIE3 (A), INDEP (B), and MRTLE (C) with various parameter settings for each algorithm.

Figure S7. Sensitivity analysis of GENIE3, INDEP, and MRTLE parameters on real data, Related to Figure 2 A.

GENIE3 evaluated with varying parameter settings on the gold standard from MacIsaac et al. GENIE3 does not incorporate prior information such as motif instances into its learning framework. B. INDEP evaluated with varying parameter settings on the gold standard from Hu et al, with motifs included as prior knowledge. C. INDEP evaluated with varying parameter settings on the gold standard from MacIsaac et al, with motifs not included as prior knowledge. D. MRTLE evaluated with varying parameter settings on the gold standard from Hu et al, with motifs included as prior knowledge. Only orthogroups without duplications were included in this analysis to reduce the computational burden of these experiments.

Table S1. Yeast dataset statistics, Related to Figure 2.

Shown are the number of regulators and targets as well as the number of orthogroups that these regulators and targets are contained in, and the number of expression samples used for each yeast species

Table S2. Target gain and loss rates, Related to Figure 3.

Shown are the target gain and loss rates for each regulator, as well as the number of target orthogroups (Target Ogs) regulated by at least one regulator in a species in this orthogroup.

Table S3. Gene Ontology (GO) process annotations of regulators with high gain rates, Related to Figure 3.

Shown are all regulators with high gain rates (2 STD above the mean) and their GO process annotations.

Table S4. OSR network hubs, Related to Figure 5.

Top ten hubs defined based on the out degree in the repressed and induced OSR networks inferred by MRTLE.

Table S5. Regulator activator/repressor associations, Related to Figure 6 and Figure 7.

Shown are the HSR and OSR association scores calculated for each regulator. Negative numbers indicate an association with repression; positive numbers indicate an association with induction. Regulators with no predicted targets are not shown. A subset of these regulators are shown in Figures 6A and 7A.

2
3
4

RESOURCES