Summary
Biological systems are driven by intricate interactions among molecules. Many methods have been developed that draw on large numbers of expression samples to infer connections between genes (or their products). The result is an aggregate network representing a single estimate for the likelihood of each interaction, or “edge,” in the network. Although informative, aggregate models fail to capture population heterogeneity. Here we propose a method to reverse engineer sample-specific networks from aggregate networks. We demonstrate our approach in several contexts, including simulated, yeast microarray, and human lymphoblastoid cell line RNA sequencing data. We use these sample-specific networks to study changes in network topology across time and to characterize shifts in gene regulation that were not apparent in the expression data. We believe that generating sample-specific networks will greatly facilitate the application of network methods to large, complex, and heterogeneous multi-omic datasets, supporting the emerging field of precision network medicine.
Subject Areas: Biological Sciences, Bioinformatics, Complex Systems
Graphical Abstract

Highlights
-
•
We developed LIONESS to extract single-sample networks from aggregate models
-
•
We tested LIONESS using in silico, yeast, and human expression data
-
•
LIONESS-estimated networks are reproducible, accurate, and biologically meaningful
-
•
Single-sample network analysis highlights important biological processes
Biological Sciences; Bioinformatics; Complex Systems
Introduction
In many instances, especially when analyzing complex traits and diseases, a single gene or pathway cannot fully explain a particular phenotype. In these cases, biological processes are often characterized as complex networks whose structures are altered as the phenotype changes. Studying the pattern of connections between biological components, and how these structures change between cell states, can yield new insights into the mechanisms driving disease. However, accurately reconstructing these networks in a way that captures both the properties and complexities of each phenotype remains a significant challenge.
Biological and phenotypic variability is a prominent feature in many complex traits and diseases. The generation of large multi-omic resources, including The Cancer Genome Atlas, the Encyclopedia of DNA Elements (ENCODE Project Consortium, 2012), and the Genotype-Tissue Expression (GTEx Consortium, 2015, GTEx Consortium et al., 2017) project, as well as the recent rise of single-cell genomic technologies and the cataloging of individual cell types in the Human Cell Atlas (Rozenblatt-Rosen et al., 2017), have brought this issue to the forefront. We now recognize that diversity in the regulatory processes active in different cells, across multiple tissues, between various phenotypes, and even in response to environmental exposures, all contribute to the complexity of observed disease manifestations. It is also increasingly clear that the cumulative effect of multiple individual-specific variations, each with a relatively small effect size, likely play an important role in the manifestation of many different diseases, including rare disease subtypes (McClellan and King, 2010). These observations speak to a multifactorial process. In other words, rather than individual molecules, it is alterations in biological processes, characterized as complex networks, that play a critical role in mediating the observed diversity (Loscalzo et al., 2007). Effectively capturing this network-level heterogeneity is critical as we seek to understand how gene expression and regulatory processes manifest at an increasingly individualized level.
Existing methods for estimating biological networks often rely upon combining information from large quantities of data (most commonly gene expression data). This means that even when the data represent a spectrum of phenotypes, these approaches, by default, estimate only a single “aggregate” network (De Smet and Marchal, 2010, Marbach et al., 2012). Although these types of aggregate networks have allowed us to gain important insights across a wide range of biological systems and diseases, they only capture the regulatory processes shared across a population of samples. More recently, several approaches have been suggested for exploring sample-level network information (Alvarez et al., 2016, Liu et al., 2015, Liu et al., 2016). However, these methods are severely limited. In particular, current single-sample methods rely upon differential analysis of the underlying expression data, thereby masking any information shared across the population (see “Transparent Methods” and Table S1). Regulatory processes act on a network that contains both common and context-specific interactions (Sonawane et al., 2017). However, there are currently no existing approaches designed to reconstruct the complete network for each sample in a population.
To fill this gap and effectively model the regulatory processes active in each sample in a population, we have developed a method to reverse engineer sample-specific networks. We call this approach LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples). LIONESS estimates individual sample networks by applying linear interpolation to the predictions made by existing aggregate network inference approaches. In this article, we demonstrate the accuracy, robustness, and applicability of LIONESS in the context of multiple aggregate network reconstruction approaches and in several datasets, including simulated data, microarray expression data from synchronized yeast cells, and RNA sequencing (RNA-seq) data collected from human lymphoblastoid cell lines (Figure 1A; Table S2). We also show how the predictions from LIONESS can be used to model regulatory network changes over time and to characterize the regulatory processes active in individual samples. Ultimately, we find that analyzing single-sample regulatory networks provides a view of biological systems that is distinct from, but complementary to, other sources of multi-omic data.
Figure 1.
Overview of LIONESS Approach and Evaluation
(A) Flow diagram summarizing the analyses performed in this article to evaluate the LIONESS approach. LIONESS was applied to multiple aggregate network reconstruction approaches including Pearson correlation coefficient, PANDA (Passing Attributes between Networks for Data Assimilation), MI (mutual information), and CLR (Context Likelihood of Relatedness).
(B) Visual illustration of how LIONESS estimates the network for a single sample based on two aggregate network models, one reconstructed using all biological samples in a given dataset and the other using all except the sample of interest (q).
Results
Complex Relationships in Biological Networks
Many widely used network inference methods start by calculating a score or statistic for each gene pair based on shared information across a set of input gene expression samples (De Smet and Marchal, 2010, Marbach et al., 2012). These scores are sometimes augmented to better account for regulatory complexity (Faith et al., 2007, Margolin et al., 2006, Langfelder and Horvath, 2008) but are ultimately used to infer the presence or absence of “interactions” between genes. This collection of genes and their corresponding complex set of inferred interactions are conceptualized as a network in which “nodes” represent genes and “edges” represent the interactions between those genes. In this context, heterogeneity in the underlying input samples is often essential for correctly estimating a network model, as variance in the data can amplify gene co-variation patterns, leading to more robust network predictions. However, at the same time, building this type of consensus, or “aggregate,” network model largely ignores the fact that there may be multiple different underlying regulatory networks represented across the individual input samples.
Consider the collection of cells within a tissue. We now recognize that within this system, each cell may have its own unique gene expression profile and corresponding unique active gene regulatory network. In the same way, each individual person in a group manifests a phenotype in a slightly different fashion, meaning that his or her gene expression profile and the gene regulatory network driving it should be subtly different. We have started to embrace this complexity in analyzing gene expression, whereas it has been largely ignored in the analysis of gene regulatory networks.
To better model network-level diversity across a population, we sought to develop a method that could model sample-specific networks. In developing our approach, we recognized that there are two types of relationships that needed to be considered: (1) intra-network relationships, or the connections among the nodes (genes) within a biological network, and (2) inter-network relationships, or the relationships between multiple different biological networks. The first of these (intra-network relationships) is an area that has been highly studied. It is now widely recognized that relationships among nodes within a biological network are very complex and that these networks are often characterized by nonlinear regulatory dynamics and synergistic effects. Fortunately, there are many approaches that have already been developed to model these complex interactions (Wang and Huang, 2014, Marbach et al., 2012), as outlined above. In contrast, the comparative study of networks (inter-network relationships) is still a relatively young field. However, a number of recent studies have used linear approaches to analyze and cluster sets of networks (Marbach et al., 2012, Schlauch et al., 2017, Mucha et al., 2010, Onnela et al., 2012).
LIONESS: Linear Interpolation to Obtain Network Estimates for Single Samples
With the above in mind, we developed our approach by using a linear framework to relate a set of networks, each representing a different biological sample. In other words, we suggest that an “aggregate” network predicted from a set of N samples can be thought of as the average of individual component networks reflecting the contributions from each member in the input sample set. Mathematically, this means that the weight of an edge, between two nodes (i and j) in an aggregate network derived using all samples (α) can be modeled as the linear combination of the weight of that edge across a set of networks:
| (Equation 1) |
where .
In this equation, each network () in the set directly corresponds to one of the samples (s) used to reconstruct the aggregate network (), and represents the relative contribution of that sample to the aggregate model; we note that the complex relationships between the nodes in the aggregate network () can be calculated using any aggregate network reconstruction approach. This allows us to ensure that higher-order, nonlinear relationships, such as those commonly found in complex biological networks, can be included in the network models.
Next, we also suggest that, as in Equation 1, the weight of an edge in a network reconstructed using all but one of the samples (α−q) can be written as:
| (Equation 2) |
where .
Comparing Equations 1 and 2, we find that so long as the relative contribution of each of the samples (s) has the same proportional contribution to the network reconstructed using all samples () when compared with the network reconstructed using all samples except one () (implying that is a constant; for more information see Equation E8 in Supplemental Information). This comparison also allows us to solve exactly for the network for an individual sample q. In particular, by subtracting the above equations we find:
| (Equation 3) |
The network specific to sample q in terms of the aggregate networks is then:
| (Equation 4) |
In summary, the edge scores for a given individual network are equal to the difference in edge scores for an aggregate network constructed using all the samples and an aggregate network reconstructed using all but the sample of interest, multiplied by a scaling factor, and added to the edge scores of the network reconstructed using all but the sample of interest (Figure 1B). What this means is that we can use pairs of aggregate network models to “extract” networks for each of the individual input samples. In the following analysis we give samples equal weight (), although one could, in principle, weight samples differently based on the quality of the data for individual samples or some other measure. A more detailed version of the LIONESS derivation is provided in the Supplemental Information.
We note that the mathematical framework presented in Equation 4 is independent of the inference method used to estimate the aggregate network edge weights. In other words, LIONESS can be thought of as a mathematical “wrapper” that can be applied to estimate networks based on any aggregation model. With this in mind, we have performed a detailed exploration of the behavior of Equation 4 when the aggregate network model is calculated using Pearson correlation or mutual information (MI), two measures commonly applied to quantify the level of a linear or nonlinear association between variables, respectively. For both measures, we are able to show that the inter-network linearity assumption of LIONESS (Equation 1) holds in the context of large sample size (see Supplemental Information). Simulation analysis also illustrates how LIONESS consistently assigns similar edge weights to the samples that most contribute to an expected relationship and correctly identifies and re-weights edges for the samples that are most inconsistent with an expected relationship (Figure S1).
LIONESS Accurately and Robustly Predicts Networks Using In Silico Data
To systematically evaluate LIONESS, we created a series of datasets where the underlying networks corresponding to each input expression sample are known. We used these data to (1) evaluate whether LIONESS accurately predicts individual sample networks, (2) to explore how sensitive these predictions are to the properties of the underlying data, and (3) to assess whether LIONESS is able to recover sample-specific network relationships (i.e., edges specific to a given sample's network).
Briefly, to create a benchmark in silico dataset, we started with a baseline network containing M nodes and random edges. We then permuted the edges within this baseline network, creating a single-sample network with the same degree distribution (Figure 2A). We repeated this N times, creating N “gold-standard” single-sample networks. To derive corresponding expression profiles for each of these networks, we generated 1,000 random initial expression states (0 or 1 corresponding to whether the gene is “on” or “off,” respectively) and applied a Boolean model (see “Transparent Methods”) to determine the corresponding network attractors (Wuensche, 1998). We averaged over all states defined within these 1,000 attractors to generate “expression” values for the M nodes (which represent genes) in each network. This gave us an M-by-N matrix of expression values, one for each of the nodes (genes) in each network. An overview of our approach is shown in Figure S2. The generated in silico datasets are included in Data S1.
Figure 2.
Evaluation of LIONESS′ Ability to Recover Known Single-Sample Networks in In Silico Data
(A) Toy example of how we create a single-sample network from an underlying baseline network.
(B) Illustration of the gene expression samples used to build a single-sample network. We evaluated the accuracy of both the aggregate network derived using all samples (red) and the LIONESS-estimated single-sample network (black) by benchmarking against the corresponding “gold-standard” single-sample network.
(C) The mean and standard deviation of the AUC values of the aggregate (red) and LIONESS-predicted single-sample networks (black) estimated from in silico datasets representing varying levels of heterogeneity.
(D) The mean and standard deviation of the AUC values of the aggregate (red) and LIONESS-predicted single-sample networks (black) estimated using increasing numbers of input expression samples. For each sample size, 10,000 random subsets of samples were used.
(E) Violin plots showing the distribution of AUC values for aggregate and LIONESS-predicted single-sample networks estimated using four different aggregate network reconstruction approaches. For (C–E) AUCs were calculated using all possible edges, and for edges that differ from the baseline model (permuted edges), see (A).
See also Figures S2–S4.
We first evaluated LIONESS′ predictions in the context of varying heterogeneity. To do this, we generated six different in silico datasets using the same baseline network but varying the amount of permutation used to obtain the single-sample network models. For this analysis we chose a network size of M=100 nodes and N=100 samples and used Pearson correlation to calculate an aggregate network before applying Equation 4 to reconstruct each of the individual sample networks. We evaluated the accuracy of the Pearson correlation aggregate network and each of the LIONESS-estimated single-sample networks (Figure 2B) by comparing with the original “gold-standard” networks and calculating the area under the receiver operator characteristic curve (AUCROC, or more simply AUC).
We observe that in the context of greater heterogeneity among the single-sample networks (increased permutation) the LIONESS-predicted networks are much more accurate than the aggregate network (Figure 2C). On the other hand, in the context of low heterogeneity, the accuracy of the LIONESS-predicted networks is similar to that of the aggregate network; this is to be expected because the aggregate network should not be significantly different from the single-sample networks in this context. Most interesting, however, is the fact that the accuracy of the permuted edges (those that appear in the single-sample network but not the baseline network, see Figure 2A) is independent of sample heterogeneity. These edges are not accurately captured in the aggregate network model, especially in the case of low heterogeneity.
We have repeated this analysis on in silico data for networks (1) of various sizes (contain more nodes) and (2) with varying levels of noise added to their associated expression data. We find that LIONESS′ performance is independent of the size of the network models (Figures S3A and S3B) and retains its ability to predict networks even in the presence of expression data noise (Figure S3C).
Next, we evaluated LIONESS′ predictions in the context of varying sample size. To do this, we generated an additional in silico dataset based on the same 100-node baseline network as the previous analysis. We used a moderate level of permutation (ρ=1) to generate a dataset with 10,000 paired network and expression samples. We selected subsets of this dataset containing N+1 samples, where N varied from 20 to 5,000; applied LIONESS to estimate the (N+1)th sample's network; and evaluated the accuracy of that network as well as the corresponding aggregate network from which it was derived (Figure 2D). We observe that as we increase the number of samples (N), the accuracy of LIONESS single-sample networks remains constant, both overall and for the sample-specific permuted edges. However, although including more samples improves the accuracy of the aggregate network model, the sample-specific permuted edges within the aggregate model are very poorly estimated with increasing sample size. This behavior is expected; including more samples provides increasing information that can help accurately estimate edges that are in the baseline network (those that are most likely to be common across all the single-sample networks). These edges are, by definition, the opposite of the sample-specific permuted edges.
We next assessed how sensitive LIONESS networks are to the chosen set of “background” samples. Using the same in silico dataset described above, we evaluated the similarity between pairs of single-sample networks that represent the same expression sample (q), but which were constructed using independent sets of background samples. We found high reproducibility, in particular as we increased the number of background samples (Figure S4A). We also tested how robust LIONESS predictions are when there are distinct subtypes represented in the background samples. To do this, we generated a separate in silico dataset that contained seven subtypes of different sizes (for more information see “Transparent Methods”). We found that LIONESS′ performance was similar when using a background consisting of all samples, or a background consisting of only subtype-specific samples (Figure S4B, p value = 0.639 for the overall analysis). Simulation analysis also illustrates how, in the case of multiple subtypes in the underlying expression data, using all samples allows for a robust estimation of single-sample edge weights (Figure S4C).
Finally, we tested the generalizability of LIONESS by estimating single-sample networks from aggregate models derived using several common network reconstruction approaches, including Pearson correlation, Passing Attributes between Networks for Data Assimilation (PANDA) (Glass et al., 2013), MI, and Context Likelihood of Relatedness (CLR) (Faith et al., 2007). These methods represent several commonly used reconstruction approaches, including both linear (Pearson) and nonlinear (MI) models, that infer either directed (PANDA) or undirected (Pearson, MI, CLR) networks (for more information, see “Transparent Methods”). Figure 2E shows the distribution in AUC values for the aggregate and LIONESS single-sample network predictions for each of these approaches. We find that LIONESS consistently and accurately predicts single-sample networks for all four network inference methods. Interestingly, although the difference in AUC between the overall aggregate and single-sample models is fairly similar for all four approaches, the AUC values are lowest for networks estimated using MI, a nonlinear approach for assessing correlation. This may reflect that our in silico data do not fully represent the complexity found in biological systems or that MI is not the optimal measure to use when estimating a regulatory network from expression data.
Estimating Single-Sample Networks Using Experimental Data from Yeast
We next tested LIONESS using experimental data from cell-cycle-synchronized yeast cells. We downloaded gene expression data (Gene Expression Omnibus: GSE4987) (Pramila et al., 2006) consisting of dye-swap technical replicates measured every 5 min for 120 min. We ma-normalized (Yang et al., 2009) these data, removed probe sets with missing information, batch-corrected using ComBat (Johnson et al., 2007), averaged probe sets mapping to the same open reading frame annotation, and quantile-normalized the resulting gene-by-sample matrix of expression values. We note that the 105-min time point was excluded in both replicates due to poor hybridization performance (Pramila et al., 2006).
We used four different network inference methods (Pearson Correlation, PANDA (Glass et al., 2013), MI, and CLR (Faith et al., 2007)) to reconstruct aggregate networks for this dataset and applied LIONESS to estimate the networks for each of the individual samples. The correlation between edge weights in each pair of the estimated sample-specific networks is shown in the first column of Figure 3 (R1&R2-from-R1&R2). We see that network estimates for the same technical replicate are highly similar, as evidenced by the strong diagonal in the upper-right and lower-left squares of each comparison; additional structure is also evident in off-diagonal similarities that reflect the fact that the time course data include more than one cell cycle.
Figure 3.
Analysis of LIONESS Networks Predicted for 48 Expression Samples Collected across a Yeast Cell Cycle Time Course Experiment
LIONESS was used to predict networks for each sample in the expression data by applying four different aggregate network reconstruction approaches. For each approach we built the aggregate models either using all samples (R1&R2 from R1&R2), or only the samples from the same technical replicate (R1-from-R1 & R2-from-R2). The Spearman correlation was used to evaluate how similar these networks are to each other. See also Figure S5.
To test if strong reproducibility was because of inclusion of replicates in the expression data, we also ran LIONESS separately on each individual replicate. This analysis produced 24 single-sample networks estimated using only the data in replicate one and 24 single-sample networks estimated using only the data in replicate two (R1-from-R1 & R2-from-R2). The correlation between these networks is shown in the second column of Figure 3. As before, we observe strong reproducibility in estimated edge weights between technical replicates. However, it is worth noting that even though we have corrected for batch effects in the expression data, several of the methods, especially CLR, appear to be sensitive to the “background” data used.
We note that this level of reproducibility is similar to that observed in the underlying expression data, demonstrating that we did not lose replicate information by applying LIONESS separately to the two sets of expression samples (Figure S5A). Interestingly, replicate PANDA networks had higher levels of similarity when compared with the other three reconstruction approaches. Based on these results, in the following analysis we focus on the single-sample networks derived using PANDA as the aggregate network inference method. Results for the other reconstruction approaches are presented in Figure S5B.
Single-Sample Networks Show Periodic Structure across the Cell Cycle
We next tested whether these single-sample networks could provide insight into gene regulation and dynamic cellular network processes. We averaged the aggregate networks and single-sample networks representing the same time point in each of the two replicates, identified the 1,000 edges with the highest variability across the individual networks, and visualized those edges as a heatmap in Figure 4A. The highly variable edges have a range of predicted weights in the aggregate network (left panel); however, we observe strong oscillatory patterns in edge weights (right panel), apparently reflecting changes in gene regulation across the cell cycle. Further investigation indicates that all these highly variable edges originate from one of four transcription factors (MBP1, SWI4, SWI6, and STB1), each of which is known to play a key role in regulating the yeast cell cycle (Ho et al., 1999).
Figure 4.
Characterizing Networks across the Yeast Cell Cycle
(A) A heatmap of the edge weights for the 1,000 most variable edges across the sample-specific network models. The left panel shows the weights of these edges in the aggregate network, and the right panel shows the edge weights across the single-sample networks. For the right panel rows are Z score normalized for visualization purposes.
(B) The average expression of genes targeted by the four transcription factors that were identified as regulatory nodes of the 1,000 topmost variable edges as well as the average weight of high-confidence edges that extend between those transcription factors and their target genes. The average weight of these edges in the aggregate network is shown as a dashed line.
See also Figure S5.
We examined the genes for which there is strong evidence of targeting by these transcription factors (average edge weight across all LIONESS networks greater than zero). In Figure 4B we plot the average weight of these high-evidence interactions for each regulating transcription factor and the average expression of their target genes. It is immediately apparent that oscillation in edge weights occurs at exactly twice the frequency of the oscillation in gene expression, and that the gene expression oscillates with a period approximately equal to that of the yeast cell cycle.
To understand this result we have to recognize that PANDA interprets correlation in target gene expression as an indication of co-regulation by an upstream transcription factor. Consequently, PANDA assigns greater edge weights when a transcription factor's targets are all coordinately increasing (activated) or decreasing (de-activated or repressed) their expression levels. High edge weights should be interpreted as evidence for information flow from a transcription factor (TF) to its targets, which could be due to a physically present TF actively regulating its downstream targets, and could also reflect a strong lack of regulation by that TF. In this light, the “turn on/turn off” behavior is exactly what one would predict given how PANDA estimates network relationships and is further evidence that LIONESS is extracting meaningful single-sample networks.
Reconstructing Single-Sample Networks for Human Lymphoblastoid Cell Lines
Lastly, we applied LIONESS to infer individual-specific human gene regulatory networks. We used a set of 155 RNA-seq samples from immortalized lymphoblastoid cell lines representing 65 different individuals (Pickrell et al., 2010). We downloaded raw fastq files from the Pritchard lab website (http://eqtl.uchicago.edu/) and aligned samples to hg19 using Bowtie (Langmead et al., 2009); subsequent quality control analysis using RNA-SeQC (DeLuca et al., 2012) excluded two samples due to low expression profile efficiency scores. This left us with a final set of 153 RNA-seq experiments that includes replicates and represents 65 distinct individuals. We normalized this dataset using DEseq2 (Love et al., 2014). For additional data processing and normalization information, see Transparent Methods.
Based on our results when applying LIONESS to network models in the simulated and yeast cell cycle data, we chose PANDA as our aggregate network reconstruction method for the human data. We used PANDA to estimate aggregate gene regulatory network models for the collection of 153 RNA-seq samples. We then applied LIONESS to these aggregate models, resulting in 153 single-sample networks, one for each of the RNA-seq expression samples. A hierarchical clustering (complete linkage, Spearman correlation) of the network edge weights demonstrates that networks for the same individual nearly always cluster more strongly with each other than with networks representing different individuals (Figure S6). This analysis demonstrates that even when constructing networks using biological data from higher-order organisms such as human, the sample-specific networks predicted by LIONESS are reproducible.
Complex Relationships between Network Targeting and Gene Expression
As with yeast, we investigated the relationship between gene targeting and expression in human networks. First, we averaged single-sample networks that represent the same individual, resulting in 65 “person-specific” regulatory networks. We then selected high-evidence regulatory interactions for each transcription factor (average edge weight across all single-sample networks greater than zero), and directly compared the mean edge weight for these interactions in each of the single-sample networks to the average expression of the targeted genes in the original expression samples.
We found nonlinear relationships between targeting and expression, with the highest average edge weights occurring when target genes have either high or low expression levels (Figure 5A); this is consistent with what we observed in our yeast analysis (Figure 4B). Coloring by the transcription factor expression level in each sample reveals additional patterns with some transcription factors primarily acting as activators (increased target gene expression upon increased TF expression and targeting) and others generally acting as repressors (decreased target gene expression upon increased TF expression and targeting). However, the relationship between a transcription factor and its target genes is not always simple, indicating that other regulatory mechanisms, such as co-activators, post-translational modifiers, or epigenetic mechanisms, are likely playing an important role in mediating these regulatory events.
Figure 5.
Comparison of Gene Regulation, Gene Expression, and DNase Hypersensitivity Data
(A) For six representative transcription factors, the mean expression of target genes and the mean weight of the edges targeting those genes across the 65 samples are plotted. For each sample, the expression of the TF is shown as a color, scaled to the normal distribution for visualization purposes.
(B) A cartoon illustrating how high edge weights and thus regulatory activity is not necessarily equivalent to the presence of a physical interaction.
(C) The distribution of the Spearman correlation values when comparing gene targeting (calculated by combining LIONESS predictions with TF expression; k(L+e)) and the significance level of DNase hypersensitivity in a gene's promoter across all the samples. We also show the percentage of genes whose targeting positively correlates with DNase hypersensitivity when targeting is calculated using only the LIONESS-predicted edge weight (k(L): no expression considered) or a combination of expression and motif information (k(m+e)). We performed these analyses either using all 12,424 genes included in our network model or for the set of 3,488 genes with a DNase peak called in all 65 samples (open genes).
See also Figure S6.
Increased Network Targeting Corresponds to Open Chromatin
DNase hypersensitivity profiling data are also available for these 65 lymphoblastoid cell lines (Degner et al., 2012), and we used it to investigate how network structures reflect epigenetic state. We downloaded the data from the Pritchard lab website (http://eqtl.uchicago.edu/) and called DNase “peaks” for each sample using Model-based Analysis for ChIP-seq (MACS) (Zhang et al., 2008). When a peak fell within the promoter region of a gene, we assigned that gene a sample-specific score reflecting the significance level of the associated peak call. We found 12,424 genes with a DNase promoter peak in at least one sample and 3,488 genes with a promoter peak in all samples. For details on the DNase data processing, see “Transparent Methods”.
A DNase hypersensitivity peak represents a region of open chromatin that is often presumed to be occupied by one or more regulatory proteins, including transcription factors. We wanted to determine if differences in chromatin state between the 65 individuals is reflected in alterations in transcription factor targeting within our single-sample networks. We assigned each edge in each sample a score by combining (1) the weight of that interaction in our single-sample network models (because this value indicates whether information is flowing between that transcription factor and target gene in the PANDA model) and (2) the expression level of the transcription factor itself (because this value indicates whether the TF is physically present in the cell (Figure 5B)). This resulted in a set of expression-modified edge weights for each sample. For more information on how we calculated these edge weights, see “Transparent Methods”.
We next used the sum of the edge weights associated with each gene to estimate the number of transcription factors regulating that gene in each of the 65 person-specific networks (k(L+e)). For comparison, we calculated gene targeting in two other ways: (1) using LIONESS edge weight estimates in the absence of gene expression information (k(L)) and (2) using gene expression information in the absence of LIONESS-predicted edge weights (k(m+e)); for the second measure we combined transcription factor expression in each sample with the motif information used for PANDA's prior (see “Transparent Methods”). We note that this last approach is conceptually similar to current methods for approximating sample-specific network information (see Introduction).
To evaluate the association of network targeting with chromatin state, for each gene we calculated the Spearman correlation between gene targeting across the networks and the significance scores of that gene's promoter DNase across the corresponding cell lines. We find that gene targeting in the expression-modified LIONESS model (k(L+e)) is very strongly correlated with promoter-DNase events, especially when only considering genes with measured chromatin information across all the cell lines (Figure 5C). This association is greater than when using only expression and motif information (k(m+e)), demonstrating that the LIONESS approach provides additional information on chromatin state not apparent in the data used to seed the algorithm.
Differential Targeting of Genes Highlights Important Biological Processes
Finally, we wanted to determine if there are common structures across these single-sample regulatory networks that might be reflective of important biological processes. We performed a hierarchical clustering (complete linkage, Spearman correlation) on the edge weights in the 65 single-sample lymphoblastoid networks and identified two distinct groups of samples defined by sample-specific edge weights (Figure 6A). In parallel, we performed a hierarchical clustering using gene expression values (Figure 6B) and found two groups of samples that are distinct from the groups defined by the edge weight clustering.
Figure 6.
Analysis of Human Lymphoblastoid Cell Line Networks
(A) A hierarchical clustering on the edge weights for 65 regulatory networks, one for each distinct subject-derived lymphoblastoid cell line included in the RNA-seq dataset.
(B) An equivalent hierarchical clustering on gene expression values for these 65 individuals. This clustering is distinct from the one based on the network edge weights. Subject labels are colored based on the network clustering.
(C) Reactome pathways enriched based on GSEA using gene targeting instead of gene expression and comparing samples from the right and left groups of networks presented in (A). No Reactome pathways were identified when comparing the expression values of genes in the different groups defined by the hierarchical clustering presented in (B).
See also Figure S6.
We then used Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005) to compare groups of samples defined in the expression-based and network-based clustering. First, we compared the expression levels of genes between groups of individuals defined in the expression-based clustering. Although there were many differentially expressed genes between the expression-based groups of samples (2,620 with false discovery rate [FDR] < 0.01), GSEA found no enrichment for known biological functions (Figure 6C). Next, we defined the targeting level of a gene in a sample as the sum of all edges pointing to that gene in a single-sample network. We then used GSEA to compare the targeting levels of genes between the two groups of individuals defined by the network-based clustering (Glass et al., 2014). In contrast to the expression-based analysis, in the differential-targeting analysis GSEA found enrichment for many cellular processes related to cell proliferation (in the smaller “orange” cluster; n = 18) and immune function (in the larger “blue” cluster; n = 45; Figure 6C).
Unfortunately, there is little phenotypic information for the 65 individuals in this study, and those available (Choy et al., 2008) are not significantly associated with the groups defined by clustering on either the expression or the network information. However, given our functional enrichment results, we believe that the regulatory differences we observe between the network groups is likely related to differences in cellular growth rate induced by variable Epstein-Barr virus (EBV) levels in the cell lines. EBV is used to transform human B cells into immortalized lymphoblastoids and is known to activate nuclear factor (NF)-κB transcriptional response (Cahir McFarland et al., 1999). Consistent with this hypothesis, we find the signature “Activation of NF-κB in B-cells” highly targeted in the small, “cell proliferation” cluster (ES = 0.5, FDR = 2 × 10−3).
We also performed GSEA on gene targeting in the aggregate network modeled using all 65 samples. This identified one significant pathway “Activation of IRF3/IRF7 mediated by TBK1/IKK epsilon.” This pathway is activated upon cell stimulation with viruses. In fact, Interferon Regulatory Factor 7 (IRF7) was originally identified in the context of EBV infection (Zhang and Pagano, 1997). Although this indicates that this pathway is likely to be highly regulated in all 65 lymphoblastoid samples, it also shows that many of the subtle differences between the individual cell lines cannot be detected with an aggregate network approach. Overall, these results indicate that evaluating single-sample networks can lend insight into the biological processes active in different individuals even when a similar analysis of the gene expression data or the aggregate network model does not.
Discussion
In this article, we present LIONESS as a method for estimating sample-specific regulatory networks. The core principle behind LIONESS is that the addition or removal of even a single sample will slightly perturb an aggregate network model. This perturbation can be used to estimate the contribution of a sample to the aggregate network, and therefore the network of that sample. Importantly, by relying on independent and existing aggregate models to capture the network of the interactions between genes, and a linear interpolation to estimate individual-level differences in the associated edge weights, LIONESS is able to reconstruct network estimates for each sample while preserving the biological complexity of the gene-gene interactions.
There are many network reconstruction methods, but there is no consensus as to the “correct” or “best” one to use—if in fact there is a single method that works best for all data types (Marbach et al., 2012). In the analysis presented here, we used four representative gene expression network reconstruction approaches: Pearson correlation, MI, CLR, and PANDA. These were chosen because they illustrate network reconstruction methods that use either a linear (Pearson) or nonlinear (MI) correlation measure, and the extensions of those measures to better capture true regulatory interactions instead of simple correlative effects. Within this representative collection of methods, our analysis suggests that applying LIONESS to aggregate networks reconstructed using PANDA has the greatest potential for reconstructing accurate network models that can be used to interpret phenotypic differences.
We also note that although we tested our approach in the context of using gene expression to reverse engineer regulatory networks, the linear algebraic framework at the heart of LIONESS is generalizable and can be applied in other settings where aggregate relationships are inferred from a collection of samples. In principle, this includes the application of LIONESS not only to other network inference methods but also in contexts where network relationships are inferred from other multi-sample omics data, such as metabolomics data, genetic/variant data, or epigenetic regulatory information such as CpG methylation. Incorporating this information into single-sample network models will be an important step in understanding the complexity of metazoan gene regulation.
Looking forward, LIONESS provides a way to unite the extensive literature and methodologies for modeling complex network relationships, with statistical analysis techniques that use sample-level information to model heterogeneity. Great progress has been made in assigning patients to disease subgroups based on gene expression profiles, or in using mutational profiles to match individual patients to specific therapies. LIONESS provides a framework in which one could imagine using a similar approach to analyze networks for precision medicine applications. For example, the network-interactions and properties predicted using LIONESS could be directly associated with patient phenotype, genotype, progression, survival, drug response, etc. Therefore LIONESS not only addresses the problem of estimating multiple networks for populations with significant phenotypic of biological heterogeneity but also provides a means of estimating and analyzing networks when samples of a particular phenotype or disease subtype are rare. Ultimately, one could imagine using the LIONESS approach to identify and target the regulatory pathways active in an individual patient (rather than using mutations or gene expression as surrogates for those pathways).
In summary, our approach to modeling single-sample networks is the first single-sample approach that estimates each sample's complete network rather than simply re-purposing differential expression information for network-based analysis. More importantly, LIONESS fills a critical gap, enabling the predictions made by existing network reconstruction methodologies to be evaluated using the same statistical techniques widely applied in other areas of genomic data analysis. The mathematical framework of LIONESS is highly generalizable and has the potential to be used to study many different and important questions in the fields of precision medicine, health, and biomedical research.
Limitations of Study
One central feature of LIONESS is that it is not a stand-alone algorithm. Rather it is designed to be applied to the output of other aggregate network approaches, and thus is subject to the limitations inherent in those approaches. In this work we focused on testing and evaluating LIONESS in the context of gene regulatory network reconstruction approaches, which primarily leverage expression data to estimate edges and edge weights. It should be mathematically straightforward to apply LIONESS in other contexts, such as to aggregate approaches that consider other network features (e.g., node weights), or to methods that reconstruct different types of networks and/or leverage other types of omics data. Research in these areas is undergoing rapid development. Therefore it will be critical to systematically evaluate LIONESS in these contexts to fully interpret the method's predictions. Finally, it is important to recognize that the accuracy of LIONESS′ single-sample networks depends on the accuracy of the method used for inferring the underlying aggregate network. As genomic data continue to grow and methods for network inference improve, this relationship implies that the accuracy of the single-sample networks inferred by LIONESS will only continue to improve.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Acknowledgments
This work was supported by grants from the US National Heart, Lung, and Blood Institute of the National Institutes of Health (R01HL111759, P01HL105339, K25HL133599, 1R35CA220523), from the Charles A. King Trust Postdoctoral Research Fellowship Program, Sara Elizabeth O'Brien Trust, Bank of America, N.A., Co-Trustees, and from the Norwegian Research Council, Helse Sør-Øst, and University of Oslo through the Centre for Molecular Medicine Norway (NCMM). We would also like to thank Farrah Roy, Abhijeet Sonawane, and John Platig for useful insights and suggestions in drafting this manuscript.
Author Contributions
Conceptualization, M.G.T. and K.G.; Software, Validation, and Formal Analysis, M.L.K. and K.G.; Methodology and Investigation, M.L.K., M.G.T., and K.G.; Resources, G.C.Y., J.Q., and K.G.; Data Curation and Writing – Original Draft, M.L.K., M.G.T., and K.G.; Writing – Review & Editing, all authors; Visualization, M.L.K. and K.G.; Supervision, J.Q., K.G., and G.C.Y.; Funding Acquisition, M.L.K., G.C.Y., J.Q., and K.G.
Declaration of Interests
The authors declare no competing interests.
Published: April 26, 2019
Footnotes
Supplemental Information can be found online at https://doi.org/10.1016/j.isci.2019.03.021.
Data and Software Availability
The data sets generated during and/or analyzed during the current study are available in Data S1 (in silico data), and from the Gene Expression Omnibus under accession number GSE4987 (yeast data), GSE19480 (human RNA-seq data), and GSE31388 (human DNase data). The human data is also available online at http://eqtl.uchicago.edu/. Fully processed and normalized versions of the yeast and human data used in this study are also available from the authors upon request.
Supplemental Information
The text files within this zipped directory contain the edges and the expression data for each generated in silico dataset used to benchmark LIONESS.
References
- Alvarez M.J., Shen Y., Giorgi F.M., Lachmann A., Ding B.B., Ye B.H., Califano A. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat. Genet. 2016;48:838–847. doi: 10.1038/ng.3593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cahir McFarland E.D., Izumi K.M., Mosialos G. Epstein-barr virus transformation: involvement of latent membrane protein 1-mediated activation of NF-kappaB. Oncogene. 1999;18:6959–6964. doi: 10.1038/sj.onc.1203217. [DOI] [PubMed] [Google Scholar]
- Choy E., Yelensky R., Bonakdar S., Plenge R.M., Saxena R., de Jager P.L., Shaw S.Y., Wolfish C.S., Slavik J.M. Genetic analysis of human traits in vitro: drug response and gene expression in lymphoblastoid cell lines. PLoS Genet. 2008;4:e1000287. doi: 10.1371/journal.pgen.1000287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Degner J.F., Pai A.A., Pique-Regi R., Veyrieras J.B., Gaffney D.J., Pickrell J.K., de Leon S., Michelini K., Lewellen N., Crawford G.E. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature. 2012;482:390–394. doi: 10.1038/nature10808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DeLuca D.S., Levin J.Z., Sivachenko A., Fennell T., Nazaire M.D., Williams C., Reich M., Winckler W., Getz G. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics. 2012;28:1530–1532. doi: 10.1093/bioinformatics/bts196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Faith J.J., Hayete B., Thaden J.T., Mogno I., Wierzbowski J., Cottarel G., Kasif S., Collins J.J., Gardner T.S. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5:e8. doi: 10.1371/journal.pbio.0050008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glass K., Huttenhower C., Quackenbush J., Yuan G.C. Passing messages between biological networks to refine predicted interactions. PLoS One. 2013;8:e64832. doi: 10.1371/journal.pone.0064832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glass K., Quackenbush J., Silverman E.K., Celli B., Rennard S.I., Yuan G.C., Demeo D.L. Sexually-dimorphic targeting of functionally-related genes in COPD. BMC Syst. Biol. 2014;8:118. doi: 10.1186/s12918-014-0118-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium, Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group, Statistical Methods groups—Analysis Working Group; Enhancing GTEx (eGTEx) groups, NIH Common Fund, NIH/NCI, NIH/NHGRI, NIH/NIMH, NIH/NIDA, Biospecimen Collection Source Site—NDRI, Biospecimen Collection Source Site—RPCI, Biospecimen Core Resource—VARI, Brain Bank Repository—University of Miami Brain Endowment Bank, Leidos Biomedical—Project Management, ELSI Study, Genome Browser Data Integration &Visualization—EBI; Genome Browser Data Integration &Visualization—UCSC Genomics Institute, University of California Santa Cruz;Lead analysts, Laboratory, Data Analysis &Coordinating Center (LDACC), NIH program management Genetic effects on gene expression across human tissues. Nature. 2017;550:204–213. [Google Scholar]
- Ho Y., Costanzo M., Moore L., Kobayashi R., Andrews B.J. Regulation of transcription at the Saccharomyces cerevisiae start transition by Stb1, a Swi6-binding protein. Mol. Cell. Biol. 1999;19:5267–5278. doi: 10.1128/mcb.19.8.5267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson W.E., Li C., Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
- Langfelder P., Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B., Trapnell C., Pop M., Salzberg S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C., Louhimo R., Laakso M., Lehtonen R., Hautaniemi S. Identification of sample-specific regulations using integrative network level analysis. BMC Cancer. 2015;15:319. doi: 10.1186/s12885-015-1265-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu X., Wang Y., Ji H., Aihara K., Chen L. Personalized characterization of diseases using sample-specific networks. Nucleic Acids Res. 2016;44:e164. doi: 10.1093/nar/gkw772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loscalzo J., Kohane I., Barabasi A.L. Human disease classification in the postgenomic era: a complex systems approach to human pathobiology. Mol. Syst. Biol. 2007;3:124. doi: 10.1038/msb4100163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marbach D., Costello J.C., Kuffner R., Vega N.M., Prill R.J., Camacho D.M., Allison K.R., Kellis M., Collins J.J., Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat. Methods. 2012;9:796–804. doi: 10.1038/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Margolin A.A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Dalla Favera R., Califano A. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7(Suppl 1):S7. doi: 10.1186/1471-2105-7-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]
- Mucha P.J., Richardson T., Macon K., Porter M.A., Onnela J.P. Community structure in time-dependent, multiscale, and multiplex networks. Science. 2010;328:876–878. doi: 10.1126/science.1184819. [DOI] [PubMed] [Google Scholar]
- Onnela J.P., Fenn D.J., Reid S., Porter M.A., Mucha P.J., Fricker M.D., Jones N.S. Taxonomies of networks from community structure. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2012;86 doi: 10.1103/PhysRevE.86.036104. 036104-36104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pickrell J.K., Marioni J.C., Pai A.A., Degner J.F., Engelhardt B.E., Nkadori E., Veyrieras J.B., Stephens M., Gilad Y., Pritchard J.K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pramila T., Wu W., Miles S., Noble W.S., Breeden L.L. The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the S-phase gap in the transcriptional circuitry of the cell cycle. Genes Dev. 2006;20:2266–2278. doi: 10.1101/gad.1450606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rozenblatt-Rosen O., Stubbington M.J.T., Regev A., Teichmann S.A. The human cell Atlas: from vision to reality. Nature. 2017;550:451–453. doi: 10.1038/550451a. [DOI] [PubMed] [Google Scholar]
- Schlauch D., Glass K., Hersh C.P., Silverman E.K., Quackenbush J. Estimating drivers of cell state transitions using gene regulatory network models. BMC Syst. Biol. 2017;11:139. doi: 10.1186/s12918-017-0517-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Smet R., Marchal K. Advantages and limitations of current network inference methods. Nat. Rev. Microbiol. 2010;8:717–729. doi: 10.1038/nrmicro2419. [DOI] [PubMed] [Google Scholar]
- Sonawane A.R., Platig J., Fagny M., Chen C.Y., Paulson J.N., Lopes-Ramos C.M., Demeo D.L., Quackenbush J., Glass K., Kuijjer M.L. Understanding tissue-specific gene regulation. Cell Rep. 2017;21:1077–1088. doi: 10.1016/j.celrep.2017.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S., Mesirov J.P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y.X., Huang H. Review on statistical methods for gene network reconstruction using expression data. J. Theor. Biol. 2014;362:53–61. doi: 10.1016/j.jtbi.2014.03.040. [DOI] [PubMed] [Google Scholar]
- Wuensche A. Discrete dynamical networks and their attractor basins. Complex. Int. 1998;6:3–21. [Google Scholar]
- Yang Y.H., Paquet A., Dudoit S. 2009. marray: Exploratory Analysis for Two-Color Spotted Microarray Data. https://www.bioconductor.org/packages/release/bioc/html/marray.html. [Google Scholar]
- Zhang L., Pagano J.S. IRF-7, a new interferon regulatory factor associated with Epstein-Barr virus latency. Mol. Cell. Biol. 1997;17:5748–5757. doi: 10.1128/mcb.17.10.5748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., Liu X.S. Model-based analysis of ChIP-seq (MACS) Genome Biol. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The text files within this zipped directory contain the edges and the expression data for each generated in silico dataset used to benchmark LIONESS.






