Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2023 May 5:2023.05.03.539308. [Version 1] doi: 10.1101/2023.05.03.539308

A Bayesian Noisy Logic Model for Inference of Transcription Factor Activity from Single Cell and Bulk Transcriptomic Data

Argenis Arriojas 1,2,3, Susan Patalano 3, Jill Macoska 3,*, Kourosh Zarringhalam 1,3,*
PMCID: PMC10187261  PMID: 37205561

Abstract

The advent of high-throughput sequencing has made it possible to measure the expression of genes at relatively low cost. However, direct measurement of regulatory mechanisms, such as Transcription Factor (TF) activity is still not readily feasible in a high-throughput manner. Consequently, there is a need for computational approaches that can reliably estimate regulator activity from observable gene expression data. In this work, we present a noisy Boolean logic Bayesian model for TF activity inference from differential gene expression data and causal graphs. Our approach provides a flexible framework to incorporate biologically motivated TF-gene regulation logic models. Using simulations and controlled over-expression experiments in cell cultures, we demonstrate that our method can accurately identify TF activity. Moreover, we apply our method to bulk and single cell transcriptomics measurements to investigate transcriptional regulation of fibroblast phenotypic plasticity. Finally, to facilitate usage, we provide user-friendly software packages and a web-interface to query TF activity from user input differential gene expression data: https://umbibio.math.umb.edu/nlbayes/.

Author Summary

NextGen RNA sequencing (RNA-Seq) has enabled simultaneous measurement of the expression level of all genes. Measurements can be done at the population level or single-cell resolution. However, direct measurement of regulatory mechanisms, such as Transcription Factor (TF) activity, is still not possible in a high-throughput manner. As such, there is a need for computational models to infer regulator activity from gene expression data. In this work, we introduce a Bayesian methodology that utilizes prior biological knowledge on bio-molecular interactions in conjunction with readily available gene expression measurements to estimate TF activity. The Bayesian model naturally incorporates biologically motivated combinatorial TF-gene interaction logic models and accounts for noise in gene expression data as well as prior knowledge. The method is accompanied by efficiently implemented R and Python software packages as well as a user-friendly web-based interface that allows users to upload their gene expression data and run queries on a TF-gene interaction network to identify and rank putative transcriptional regulators. This tool can be used for a wide range of applications, such as identification of TFs downstream of signaling events and environmental or molecular perturbations, the aberration in TF activity in diseases, and other studies with ‘case-control’ gene expression data.

Introduction

Gene regulation plays an essential role in many cellular processes, including metabolism, signal transduction, development, and cell fate (Lelli, Slattery, & Mann, 2012; de Jong, 2002). At the transcriptional level, regulation of genes is orchestrated by the concerted action between Transcription Factors (TFs), histone modifiers, and distal cis-regulatory elements to finely tune and modulate expression of genes (Wilkinson, Nakauchi, & Gottgens, 2017). Sequence specific TFs, which have affinity for specific DNA sequences, may bind to cis-regulatory elements at the enhancer or promoter region of genes to either activate (upregulate) or repress (downregulate) the expression of genes. Aberration in TF activity and the dysregulation of target genes have been implicated in many pathological states and human disease (Lee & Young, 2013). Activity of TFs can be triggered downstream of signaling events, which in turn may be activated in response to environmental and molecular perturbations (Barolo & Posakony, 2002). Perturbations in TF activity often result in modulation of gene expression. The technological advancements in high-throughput sequencing have made it possible to measure expression of genes at relatively low cost. However, direct measurement of regulatory mechanisms, such as TF protein expression and functional activity in a high-throughput manner is still not readily available. Consequently, there is a need for computational approaches that can identify active regulatory mechanisms from observable gene expression data.

The scientific community has proposed several computational algorithms and biophysical models to study the impact of TF activity on gene expression. Some of these algorithms use statistical and probabilistic approaches to infer TF activity and dynamics directly from gene expression data (Bonneau, et al., 2006; Asif & Sanguinetti, 2011; Ocone & Sanguinetti, 2011; Bulashevska & Eils, 2005; Veber, Guziolowski, Le Borgne, Radulescu, & Siegel, 2008; Segal, et al., 2003), and more recently (Schacht, Oswald, Eils, Eichmüller, & König, 2014; Jiang, Freedman, Liu, & Liu, 2015; Fröhlich, 2015; Alvarez, et al., 2016; Yu, et al., 2017; Keenan, et al., 2019), while others rely on biophysical approaches to model expression of genes based on known TF-gene interactions (Honkela, et al., 2010; Djordjevic, Sengupta, & Shraiman, 2003). Boolean networks and probabilistic extensions have also been used to model gene regulation (Hashimoto, et al., 2004; Friedman, 2003; Bar-Joseph, et al., 2003; Segal, Taskar, Gasch, Friedman, & Koller, 2001; Segal, et al., 2003; Bulashevska & Eils, 2005). In logic models, genes are assumed to be either ON or OFF and Boolean logic (AND, OR, NOR, etc.) is utilized to model combinatorial regulation. For example, (Bulashevska & Eils, 2005) introduced a Bayesian approach to generalize the Boolean logic to incorporate noise and utilized their approach to reconstruct gene regulatory networks in yeast.

Another class of algorithms use prior biological knowledge on biomolecular interactions to link a differential gene expression (DGE) profile to upstream regulators (e.g., TFs) (Zarringhalam, Enayetallah, Gutteridge, Sidders, & Ziemek, 2013; Fakhry, et al., 2016; Subramanian, et al., 2005; Chindelevitch, et al., 2012; Chindelevitch, Loh, Enayetallah, Berger, & Ziemek, 2012; Kramer, Green, Pollard, & Tugendreich, 2014; Farahmand, O’Connor, Macoska, & Zarringhalam, 2019). The essential ingredients of these type of algorithms are (1) an input DGE profile, (ii) a network of biomolecular interactions or pre-defined gene sets, and (iii) an inference algorithm to query the network. The output is a set of candidate regulators, pathways, or biological processes with associated probabilities or significance p-values. The DGE profile as obtained from RNA-Seq or microarrays studies is the observable input that quantifies the difference in transcript abundance between two conditions (e.g., healthy vs. disease, stimulated vs. not stimulated, etc.). The network of biomolecular interactions encapsulates the prior biological knowledge. The inference algorithms typically map the DGE profile to the network to identify drivers (nodes, terms, or paths in the graph) of the observed transcriptional changes.

Despite the popularity and success of these methods, several challenges remain to be sufficiently addressed. For example, biophysical models are computationally expensive and are suitable for small scale applications or simulation studies. Boolean logic models, although simple to implement, cannot directly account for noise in gene expression data. On the other hand, probabilistic models for inference of gene regulatory networks typically overlook the context of experiment. Regulatory networks may be noisy or contain interactions that are applicable in a specific context only. Properly modeling any dependencies on the biological context in active regulator inference and enrichment analysis algorithms can lead to more accurate inference of the regulatory mechanisms specific to that context. Utilizing causality and information on mode of regulation (activation vs. repression) can also significantly reduce false positive predictions, resulting in more interpretable models. Moreover, biologically motivated TF-gene interaction logic models (e.g., combinatorial effects of activators and repressors on gene expression) must be taken into consideration when inferring transcriptional regulatory programs. To address these challenges, we have developed a Noisy-Logic Bayesian (NLBayes) TF activity inference model that accounts for these factors in a unified manner. Given an input DGE profile, our model incorporates the prior information on causal regulatory interactions and makes posterior adjustments to further account for noise and determine the context-specific posterior network structure and active regulators through a Gibbs sampling procedure.

We evaluate the performance of our model using simulation studies as well as over-expression datasets and demonstrate that our method can accurately identify active transcriptional regulators from gene expression data and causal graphs. We benchmark our algorithm against VIPER, a closely related method that is widely used for identification of regulon activity (Alvarez, et al., 2016). Both methods are able to identify relevant TFs in the corresponding experiments, with several TFs identified by both methods at the intersection, indicating that the algorithms complement each other. Our method can be used for novel biological discoveries. To illustrate this, we apply our method to investigate transcriptional regulation of fibroblast phenotypic plasticity in response to signaling molecules TGFβ and CXCL12. This study utilizes differential expression profiles from bulk RNA-Seq experiment using prostate cell lines (Gharaee-Kermani, et al., 2012), stimulated with TGFβ and CXCL12. Our analysis recovers several TFs, including YAP1 as the top prediction, which has been identified as a driver of myofibroblast differentiation in multiple tissue phenotypes (Piersma, et al., 2015; Pang, et al., 2023; Li, et al., 2022; Li, et al., 2022; Xu, et al., 2021; Lee, et al., 2022; Salloum, et al., 2021; Wang, et al., 2020; Li, et al., 2021; Allison, 2021). Additionally, we present new single cell gene expression data from the same prostate cell lines as well as three additional human prostate fibroblast cell lines. We characterize the cell lines at the transcriptional level and apply our algorithm to identify major transcriptional regulators in each cell line and study the impact of immortalization on transcriptional regulation. Our algorithm provides a general framework and a widely applicable tool to study transcriptional regulators of differential gene expression. To facilitate wider use, we provide R and python packages, and a web-interface for running inference experiments. Fig. 1 summarizes the overall approach.

Fig. 1.

Fig. 1.

Schematic diagram of the NLBayes inference pipeline. The starting point is gene expression data from two conditions, from which a differential gene expression profile is calculated by discretizing gene values to −1 (down regulated), 0 (not regulated) or 1 (up regulated) using cutoff thresholds on p-values and/or foldchange. A TF-gene interaction network is used to build the graphical model. Values for gene nodes are populated from the differential expression profile. NLBayes runs a probabilistic query on the causal network and outputs the posterior distribution of TFs, from which the activation state of the TFs is determined.

Results

To test our approach, we performed a series of experiments including simulations studies, benchmarks against an alternative approach, and inference of TF activity on novel datasets.

The core of the algorithm uses an OR-NOR transcriptional regulation logic to predict TF activity. In this logic two conditions must be satisfied for gene activation: 1) at least one of activator is targeting the gene and 2) no inhibitor is targeting the gene. On the other hand, for down regulation of genes we use a simple OR transcriptional regulation logic model: at least one inhibitor must target the genes.

The algorithm outputs posterior probabilities for each TF activation state. The prior probability for TF activation is set to a small value p0=0.01, and as such we consider a TF with posterior probability p0.2 as potentially relevant. We then define three thresholds to classify inferred active TFs: High-confidence (p0.8) Mid-confidence (p0.5), and Low-confidence (p0.2).

Simulation studies

We performed several simulation studies to assess the ability of our algorithm in recovering active transcriptional regulators from gene expression data and to test the robustness of the inference process to noise in gene expression data and the causal graph of TF-gene interactions. For this analysis we generated a random interaction network consisting of 250 TFs and 5000 downstream target genes. The downstream targets of each TF were picked at random from a binomial distribution, resulting in ~30000 interactions (edges) in the network. The edges of the network were randomly set as activation (65%) and inhibition (35%).

We randomly selected 10 TFs and assigned them as active (the ground truth) and simulated downstream differential gene expression data by assigning +1 or −1 to 10% of the target genes according to the causal graph. For genes targeted by multiple active TFs, we calculated the algebraic sum of all incoming interactions and took the net sign. This produced an average of 120 differentially expressed genes. Each experiment was repeated 20 times.

Impact of data randomization:

For this analysis, we randomized a fraction of the input data (0, 0.25, 0.5, 0.75, 1.00) by randomly toggling the values. At 0% the data is not randomized, while at 100% the input data is completely random. The inference procedure was run on each input data set. Figs. 2A and 2B show the ROC and precision-recall curves for each randomization experiment respectively. Fig. 2C illustrates the sampled posterior distributions for active TFs at different randomization levels. As expected, at 100% the model fails to recover active TFs. However, up to 75% randomization the algorithm is still able to recover active TFs, with significant increase in accuracy at lower levels of noise.

Fig. 2.

Fig. 2.

Performance evaluation of the model before and after randomization of data. ROC curves and Precision vs Recall curves are shown for randomization simulation in input gene expression (A, B) and TF-gene interaction network (E, F). p indicates the corresponding fraction of randomization used. AUC scores are displayed in the legend. (C) The posterior distributions for TF activity is shown. The colors indicate the ground truth, i.e. whether the TF was set to active in the simulation. (F) The impact of the number of target genes on the inference results for the case of 25% randomization of data.

To assess the impact of the number of target genes regulated by a TF on the posterior probability of the TF’s activity, we ran multiple simulations and plotted the posterior probability vs the number of genes regulated by TFs, color coded by TF activity (Fig. 2D). For these simulations, we used noisy data at 25% randomization. We observe that for active TFs with less than 30 target genes, the posterior probability is low. This is expected as in these simulations, only 10% of target genes are set as differentially expressed, yielding an average of only 3 target genes that are modulated. This information is too small to shift the posterior probability. However, as the total number of target genes increase, we observe a threshold effect, where posterior probabilities stabilize, and the total number of targets does not have an impact on the inference. These experiments were repeated for several other randomly generated networks with consistent results (SI File Table S4).

Impact of network randomization:

For this analysis, we select a fraction (0, 0.25, 0.5, 0.75, 1.00) of the edges in the network and randomly reassign them to different target genes. The inference was run using unperturbed input gene expression data and randomized networks. Figs. 2E and 2F show the ROC and precision-recall curves for each randomization experiment. A similar picture as in data randomization emerges.

These results demonstrate the robustness of our algorithm at 25% randomization level in both the gene expression data as well as the input network, while still retaining some prediction power at the 50% and 75% randomization levels.

Over-expression datasets

To test the ability of our algorithm in recovering known active regulators, we used three publicly available over-expression datasets (GSE3151), performed on human primary mammary epithelial cell cultures, each generated by over-expression of an oncogene: E2F3, c-Myc, and H-Ras (Bild, et al., 2006). For these inference experiments, we utilized a TF-gene interaction network generated by Farahmand et. al. (Farahmand, O’Connor, Macoska, & Zarringhalam, 2019). The network was generated using several high-throughput datasets, and a Gaussian Graphical Model. We chose to use this network as it showed consistently high predictive power across several datasets (Farahmand, O’Connor, Macoska, & Zarringhalam, 2019). In each experiment, differentially expressed genes were determined compared to the control sample and inputted into the TF activity inference algorithm. Table 1 summarizes the inference results. Inferred active regulators are split into three categories based on posterior probability p of inferred activity: High-confidence (p0.8) Mid-confidence (p0.5), and Low-confidence (p0.2). Percentage of differentially expressed genes targeted (explained) by at least one inferred active regulator is also presented in the table.

Table 1.

Predicted active TFs from 3 over-expression experiments: E2F, MYC, and RAS. Top 3 rows summarize the total number of Differentially Expressed Genes (DEGs) and the number of genes with increased and decreased RNA expression. Rows 4 and 5 show the proportion of explained DEGs by at least 1 inferred regulator at the indicated confidence level. Bottom colored panel lists the inferred regulators, split into three categories, high-confidence (green), mid-confidence (blue) and low-confidence (yellow).

E2F3 MYC RAS

max p-value = 0.01 max p-value = 0.01 max p-value = 0.01
min abs(log2 fold-change) = 1.00 min abs(log2 fold-change) = 1.00 min abs(log2 fold-change) = 2.00

DEG Increased Decreased DEG Increased Decreased DEG Increased Decreased
418 321 97 127 75 52 409 163 246

Confidence level DEG explained Confidence level DEG explained Confidence level DEG explained

0.80 53% 0.80 32% 0.80 58%
0.50 61% 0.50 49% 0.50 64%
0.20 65% 0.20 70% 0.20 71%

Name Enrichment Inference Name Enrichment Inference Name Enrichment Inference

E2F1 3.7E-06 1.00 TCF3 4.9E-01 1.00 E2F3 6.1E-01 1.00
FLI1 2.7E-02 1.00 HDAC2 1.0E+00 0.90 PPARG 4.0E-01 1.00
FOXA2 7.0E-01 1.00 TP53 1.2E-02 0.85 TEAD4 4.9E-01 1.00
HOXA4 7.0E-01 1.00 - - - KLF5 2.2E-02 1.00
MXI1 2.6E-01 1.00 - - - FOSL1 2.4E-06 1.00
BCOR 2.2E-02 1.00 - - - RELA 1.4E-01 1.00
AR 1.4E-02 1.00 - - - ASCL1 1.0E+00 0.99
FOXF1 7.9E-01 1.00 - - - STAT3 6.1E-01 0.99
TCF3 8.6E-01 1.00 - - - PRKDC 1.0E+00 0.97
MBD3 1.0E+00 0.99 - - - JUND 1.4E-01 0.97
SOX2 7.3E-01 0.97 - - - NFKB2 1.2E-02 0.96
- - - - - - SOX2 5.3E-01 0.96
- - - - - - CTCFL 1.0E+00 0.94
- - - - - - GATA3 9.0E-01 0.92
- - - - - - KDM6B 3.2E-03 0.86
HNF1B 1.0E+00 0.77 SIX2 1.0E+00 0.77 MYC 2.1E-01 0.75
FOXM1 1.7E-02 0.76 ETV1 1.0E+00 0.73 NME2 1.0E+00 0.66
NR5A2 9.2E-01 0.72 KLF11 1.0E+00 0.54 ESRRA 1.0E+00 0.65
HNF4A 1.0E+00 0.71 ILF3 2.9E-01 0.51 SREBF1 1.8E-01 0.64
POU5F1 7.0E-01 0.69 - - - MBD2 9.7E-01 0.64
GATA3 1.5E-01 0.68 - - - - - -
TCF21 1.0E+00 0.50 - - - - - -
PGR 8.6E-01 0.39 PAX5 1.0E+00 0.49 PBX4 1.0E+00 0.41
TEAD1 1.0E+00 0.39 TFAP4 2.0E-01 0.49 HOXB13 1.0E+00 0.37
GATA6 1.0E+00 0.39 ZC3H8 5.0E-01 0.45 FOXF1 3.6E-03 0.36
VEZF1 1.0E+00 0.30 PPARG 7.6E-01 0.44 KLF6 1.8E-02 0.34
EGR1 7.0E-01 0.29 PDX1 6.9E-01 0.42 BCOR 9.7E-01 0.33
NKX2-1 1.0E+00 0.29 TEAD4 2.9E-01 0.37 HNF1B 1.0E+00 0.32
HOXA9 4.3E-01 0.26 BATF 7.3E-01 0.37 SRF 9.0E-01 0.31
HOXC5 1.0E+00 0.26 THAP11 1.0E+00 0.36 GATA4 1.0E+00 0.27
- - - SMARCC2 1.0E+00 0.33 CEBPB 4.1E-02 0.24
- - - NCOA1 1.0E+00 0.33 MAFF 3.6E-03 0.22
- - - TRIM28 2.9E-01 0.24 - - -
- - - MYB 1.0E+00 0.21 - - -

For the E2F3 expression data, the E2F1 is returned as the top regulator. E2F1 and E2F3 have a similar function in control of the cell cycle and are similarly implicated in cancer (Chen, Tsai, & Leone, 2009). E2F3 also regulates expression of FLI1, an ETS domain transcription factor and proto-oncogene (Li, Luo, Liu, Zacksenhaus, & Ben-David, 2015), as well as the FOXA2 transcription factor that promotes aggressive prostate cancer (Qi, Pellecchia, & Ronai, 2010) and HOXA4, a transcription factor important for embryonic development but often over-expressed in human colorectal (Bhatlekar, et al., 2014) and ovarian (Yamashita, et al., 2006) cancers.

The MYC and RAS oncogenic proteins transcriptionally activate multiple genes associated with tumor progression and prognosis. MYC transcriptionally activates TCF3, HDAC2 and TP53. The TCF3 transcription factor was recently found to promote gastric (Xie, et al., 2021) and endometrial (Gui, et al., 2021) cancers, among others. HDAC2 has been reported to promote metastasis in pancreatic (Krauß, et al., 2021) and breast (Huang, et al., 2021) cancers. MYC also regulates transcription of the well-known TP53 tumor suppressor gene. RAS activates transcription of E2F3 itself, as well as PPARG and TEAD4. Modulation of PPARG activity has been intensively examined as an anti-cancer therapeutic target (Chi, et al., 2021), and TEAD4 which is known to modulate different cellular processes in cancer via its transcriptional output (Chen, et al., 2020).

Taken together, these results demonstrate that the algorithm can accurately detect modulated transcriptional signals from DNA binding proteins.

Benchmarks

We compared the performance of our algorithm against VIPER (Alvarez, et al., 2016), a widely used method for inference of regulon activity input gene expression data and causal graphs. For this benchmark, we used the Human Breast Carcinoma context specific network from ARACNe interactome (Lachmann, Giorgi, Lopez, & Califano, 2016). This network is appropriate for overexpression datasets as they used human breast cell cultures and allows a fair comparison. Fig. 3 summarizes the overlap between the two algorithms. Overall, there is good agreement between the methods as well as regulators recovered only by one algorithm, demonstrating the viability of both methods in recovering modulated TFs. Each algorithm predicts TF activity that is not shared by the other algorithm. In the E2F3 overexpression experiment, our algorithm infers SOX17 as an active TF. SOX genes have been shown to interact with E2F3, and to be regulated by common micro-RNAs. For example, MicroRNA-141 regulates both SOX17 and E2F3 (Hamidi, Taghehchian, Basirat, Zangouei, & Moghbeli, 2022; Zhou, et al., 2015; Jia, et al., 2012). On the other hand, VIPER detects MEIS2 as a relevant TF in the overexpression of E2F3. siRNA-mediated silencing of RB1 and MEIS2 increases expression of E2F3 (Alam, et al., 2019). This shows that both approaches can complement each other, providing a wider picture of TF activity. While individually predicted TFs by each method may be important and can provide useful clues to the underlying biology, there is higher confidence in predictions at the intersection of both methods.

Fig. 3.

Fig. 3.

Comparison of active TF predictions by our method (y axis) and VIPER (x axis), in three separate overexpression experiments. Input network used is the BRCA derived regulon from (Alvarez, et al., 2016). Jointly predicted regulators are colored in red. Top predictions specific to one algorithm are labeled in black. Gray area shows low confidence predictions by both algorithms.

Fibroblast phenotypic plasticity.

To demonstrate the utility of our methodology in discovery of novel biology, we applied our algorithm to study fibroblast-to-myofibroblast phenotypic conversion in response to pro-fibrotic signaling molecules TGFβ and CXCL12 (Patalano, et al., 2018; Gharaee-Kermani, et al., 2012; Rodríguez-Nieves, Patalano, Almanza, Gharaee-Kermani, & Macoska, 2016). In this experiment, patient derived, immortalized prostate N1 cells were treated with the pro-fibrotic proteins TGFβ and CXCL12, both of which are known to promote collagen expression (Patalano, et al., 2018). TGFβ acts upon TGF/TGFR signaling axis and activates multiple Smad proteins, while CXCL12 acts upon CXCL12/CXCR4-axis, which transactivates EGFR and downstream signaling through MEK/ERK and PI3K/Akt pathways. Both signaling axes converge in the nucleus and promote the expression of multiple collagen genes (Gharaee-Kermani, et al., 2012; Rodríguez-Nieves, Patalano, Almanza, Gharaee-Kermani, & Macoska, 2016). RNA-Seq data was processed and compared to the background model to generate differential gene expression profiles as previously described (Patalano, et al., 2018). Differential gene expression data from TGFβ and CXCL12 treated cells were identified at fold change ≥ 2 and four cutoff thresholds for p-value. The different cutoffs for p-value were applied to examine the impact of stringency in significance on the inference results, and were chosen such that only 200, 400, 600 or 800 top differentially expressed genes are considered. To achieve this, we sorted the table of differentially expressed genes by p-values in ascending order and took the top rows for the analysis. Both datasets were used as input to the TF activity inference algorithm. For these experiments (and the remaining experiments), we used the 3-tissue TF-gene interaction network generated by Farahmand et. al. This network contains interaction edges that are common in at least three of the tissues used in that work and showed consistent performance across multiple datasets (Farahmand, O’Connor, Macoska, & Zarringhalam, 2019). Table 2 summarizes the results. Inferred active regulators upon CXCL12 induction are largely similar to that of TGFβ (Table 2). This is expected as transcriptional profiles induced by TGFβ and CXCL12 are 75% similar (Patalano, et al., 2018). The top predicted regulators for TGFβ and CXCL12 for the top 200 DEGs are YAP1, RBPJ, KMT2C, ELF1, STAT1, and BPTF. YAP1 is known to play a role in the development and progression of multiple cancers as a transcriptional regulator of this signaling pathway and may function as a potential target for cancer treatment (Astudillo, 2022). Moreover, YAP1 has been identified as a driver of myofibroblast differentiation in several tissue phenotypes, like skin, heart, lung, pharynx, liver and kidney (Piersma, et al., 2015; Pang, et al., 2023; Li, et al., 2022; Li, et al., 2022; Xu, et al., 2021; Lee, et al., 2022; Salloum, et al., 2021; Wang, et al., 2020; Li, et al., 2021; Allison, 2021). STAT1 is a member of the STAT protein family. In response to cytokines and growth factors, STAT family members are phosphorylated by the receptor associated kinases, and then form homo- or heterodimers that translocate to the cell nucleus where they act as transcription activators. The protein encoded by this gene can be activated by various ligands including interferon-alpha, interferon-gamma, EGF, PDGF and IL6. This protein mediates the expression of a variety of genes, which is thought to be important for cell viability in response to different cell stimuli (Harrison & Moseley, 2020). RB1 is a negative regulator of the cell cycle and was the first tumor suppressor gene found. The encoded protein also stabilizes constitutive heterochromatin to maintain the overall chromatin structure. The active, hypophosphorylated form of the protein binds transcription factor E2F1. The deletion or mutation of the RB1 gene in many human cancers defines RB1 as a tumor suppressor gene (Knudsen, Pruitt, Hershberger, Witkiewicz, & Goodrich, 2019). LIN9 is a tumor suppressor protein that inhibits DNA synthesis and oncogenic transformation through association with the retinoblastoma 1 protein. The encoded protein also interacts with a complex of other cell cycle regulators to repress cell cycle-dependent gene expression in non-dividing cells (Walston, Iness, & Litovchick, 2021).

Table 2.

Predicted active TFs upon TGFβ (left) and CXCL12 induction (right). The top row shows total number of DEGs using four different p-value cutoff thresholds (Max p-value). Bottom colored panel lists the inferred regulators, split into three categories, high-confidence (green), mid-confidence (blue) and low-confidence (yellow).

TGFβ CXCL12


Top DE genes Top DE genes
200 400 600 800 200 400 600 800


Max p-value 4.7E-133 2.6E-95 1.6E-71 8.0E-53 Max p-value 6.0E-132 4.6E-94 1.2E-72 4.1E-58


YAP1 1.00 1.00 1.00 1.00 YAP1 1.00 1.00 1.00 1.00
RBPJ 1.00 1.00 0.53 1.00 BCLAF1 1.00 1.00 1.00 1.00
KMT2C 1.00 1.00 0.96 0.61 BPTF 1.00 1.00 0.82 0.04
ELF1 1.00 1.00 0.07 0.75 RBPJ 0.99 1.00 1.00 1.00
STAT1 0.96 0.96 1.00 0.80 KMT2C 0.98 0.91 0.07 0.62
BPTF 0.92 0.02 1.00 0.95 GABPA 0.98 0.28 1.00 1.00
HIF1A 0.90 1.00 0.05 0.12 STAT1 0.96 0.99 1.00 0.82
BRCA1 0.63 0.00 0.00 0.00 RB1 0.91 0.95 0.99 1.00
YY2 0.50 0.00 0.00 0.00 LIN9 0.89 0.57 0.83 0.99
JMJD1C 0.08 0.90 0.00 0.00 PRDM1 0.33 0.10 0.02 0.00
VEZF1 0.06 0.89 0.09 0.22 CBFB 0.00 1.00 0.78 0.17
ERG 0.00 0.84 0.00 0.00 AHR 0.01 0.68 0.00 0.00
TCF12 0.00 0.82 0.86 0.91 SP4 0.00 0.45 0.00 0.00
ZNF12 0.00 0.57 0.90 0.25 ATF2 0.00 0.37 0.02 1.00
LIN9 0.15 0.33 0.97 0.90 MYBL2 0.02 0.30 0.01 0.00
BACH1 0.00 0.00 0.98 0.85 BACH1 0.00 0.27 0.24 0.45
RB1 0.02 0.00 0.97 1.00 ELF1 0.05 0.13 0.94 0.38
GABPA 0.00 0.09 0.96 1.00 CHD1 0.00 0.00 0.40 0.10
BCLAF1 0.00 0.15 0.96 0.97 ELK4 0.00 0.01 0.10 0.74
TCF4 0.00 0.00 0.93 0.00 TCF4 0.00 0.00 0.00 0.63
MEF2A 0.01 0.01 0.48 0.38 PIAS1 0.00 0.00 0.04 0.59
SETX 0.00 0.02 0.03 0.94 SMAD4 0.00 0.01 0.17 0.47
ATF2 0.00 0.00 0.16 0.34 SETX 0.00 0.00 0.02 0.29
PIAS1 0.00 0.00 0.05 0.25 - - - - -
IRF1 0.04 0.04 0.00 0.20 - - - - -


We note that all these predicted active TFs appear as top regulators for the top 400, 600 and 800 DEGs, indicating the robustness of the algorithm to cutoff stringency criteria in input DGE profiles.

Fibroblast heterogeneity: Single cell experiments.

We performed several scRNA-seq experiments to further investigate the phenotypic plasticity of human prostate fibroblasts and characterize heterogeneity in cell populations. For this study, we utilized 4 human prostate cell lines: N1, SFT1, pHPF, and iHPF (For more information, see SI File Table S1). N1 cells are HPV E6/E7-immortalized prostate stromal fibroblasts originally explanted and grown from a stromal Nodule of benign prostatic hyperplasia (Begley, Kasina, MacDonald, & Macoska, 2008). They exhibit a fibroblastic morphology, and express fibroblastic markers vimentin and calponin. These cells demonstrate secretion and proliferation profiles consistent with aging primary prostate fibroblasts. SFT1 cells are spontaneously immortalized prostate fibroblasts grown from a prostate of a patient with a Solitary Fibrous Tumor of the prostate (Gharaee-Kermani, Mehra, Robinson, Wei, & Macoska, 2014). These cells carry an uncommon NAB2/STAT6 fusion gene that is associated with solitary fibrous tumors and likely accounted for cellular immortalization. pHPF cells are primary Human Prostate Fibroblasts, purchased at passage 3 from Lifeline Cell Technology, harvested from young adult male. Finally, iHPF cells are created through transduction from pHPF cells with an EF1α-driven hTERT Lentivirus construct and have grown continuously in culture >30 passages.

We applied scRNA-seq to all 4 cell lines. Fig. 4A shows the UMAP projection of the cell lines. The N1 and SFT1 form distinct clusters in close proximity. Most of the cells are in G1 phase. pHPF cells also form a single cluster, mostly consisting of cells in G1 phase. Interestingly the iHPF cells cluster in two groups (A and B) that surround the primary pHPF cells. The majority of cells in iHPF_A cluster are in G1 phase, while iHPF_B consists of a mix of cells in G1, G2, S, and M phases. To further investigate the identity of these cells, we merged the data with FACS sorted single cell expression data derived from prostate tissue generated by Henry et. al. (Henry, et al., 2018). Fig. 4B shows that the RNA expression profiles of the five human prostate fibroblast cell lines N1, SFT1, iHPF_A and iHPF_B, and primary pHPF cells, cluster as expected with that of tissue-derived human prostate fibroblasts. As seen in Fig. 4C, the five human prostate fibroblast cell lines share a large signature of highly and commonly expressed genes, likely reflecting their common fibroblastic cell type. In particular, all five cell lines express COL1A1 (collagen 1) and VIM (vimentin). Examination of the top 10 differentially expressed genes in the four immortalized cell lines compared to primary human fibroblasts shows that N1 and SFT1 demonstrate a high degree of overlap and commonly express several inflammation-associated genes (CXCL1, ZNFAS1, CHI3L1). The iHPF_A and iHPF_B share a common gene signature as well that includes gene encoding signaling proteins (BEX1, WNT5A), growth factors and pathways (EREG, IGFBP5), and a gene over-expressed in the autoimmune disease, rheumatoid arthritis (TGM2). However, iHPF_B cells also highly express genes that are not expressed by iHPF_A, including several associated with vasculogenesis or angiogenesis (ANGPT1, F3, ADAMT1) or connective tissue and bone growth (TNFRSF11B). This suggests that iHPF_B cells may phenotypically resemble endothelial cells, which can differentiate from fibroblasts (Junker, et al., 2013). Fig. 4E quantifies the average log FC of top expressed markers (Fig. 4C) compared to the background (pHPF). Taken together, these data suggest that a seemingly homogenous culture of primary stromal prostate fibroblasts may comprise several subpopulations as have recently been shown for dermal fibroblasts (Hu, Moore, & Longaker, 2018; Philippeos, et al., 2018).

Fig. 4.

Fig. 4.

Single cell gene expression data from 4 prostate cell lines. A) UMAP projection of cell lines. iHPF shows two separate phenotypes here termed iHPF_A and iHPF_B. Cell cycle phase G1 is prominent in N1 and SFT1, but pHPF cells show significant number of cells in S and G2/M phases. B) Integration single prostate tissue data (Henry, et al., 2018), showing clusters for Fibroblasts (Fib), Smooth Muscle (SM), Basal Epithelia (BE), Luminal Epithelia (LE) and Other Epithelia (OE1, OE2). All 4 cell lines, N1, SFT1, pHPF and iHPF, appear as interconnected clusters, lying between epithelial and fibroblast cells. C) Top expressed genes. Overall, all 5 cell classes share the same highly expressed genes including COL1A1. D) Top 10 differentially expressed genes with respect to pHPF as a reference. Smallest and largest p-values are 1E-128 and 1E-26 respectively. See SI File Tables S2 and S3 for a list of top genes for each cell line, as shown in (C) and (D); Full list of differentially expressed genes is available in SI Table 1. E) Differential expression with respect to pHPF, for the highly expressed genes shown in C. FC of genes in panel C compared to pHPF. Notably, even though collagen expression high in across all groups, it is downregulated in N1 when compared to pHPF.

Next, we sought to quantitatively characterize similarities between cell lines at the transcriptional level. We first performed a differential gene expression analysis using the pHPF cell line as the background. Fig. 5A shows a bar plot of total number of upregulated and downregulated genes in each cell line compared to pHPF cells. We performed a gene set enrichment analysis on up & down regulated genes (Fig. 5B). As expected, the N1 and SFT1 cells demonstrate a high level of similarity, as we have previously shown that they respond similarly to stimulation with pro-fibrotics (Rodríguez-Nieves, Patalano, Almanza, Gharaee-Kermani, & Macoska, 2016). Conversely, although immortalized from the pHPF cells, iHPF_A and iHPF_B demonstrate a higher than expected dissimilarity, potentially reflecting fibroblast heterogeneity in the primary cell culture from which they were derived.

Fig. 5.

Fig. 5.

A) Total number of DEGs compared to the background model (pHPF). B) GO term Enrichment analysis of up regulated genes in each cell line (columns). See SI File Fig S1 for the GO term Enrichment analysis on down regulated genes.

Next, we applied our algorithm to the DEG profiles from each cell line to quantify similarity in transcriptional gene regulation. Fig. 6 shows top inferred regulators in each cell line (left panel bar plots), along with their RNA expression level (middle panel bar plots), and the corresponding enrichment of the differentially expressed target genes (right panel bar plots). The enrichment analysis was performed by quantifying the overlap between the targets of the TF and DEGs using Fisher’s exact test. This analysis was performed for comparison of enrichment-based methods with our approach. Enrichment-based approaches do not consider the global topology of the TF-gene interaction network into consideration and yield results that are purely based on the local overlap of TF targets and the set of DEGs. In Fig 6 we observe that many TFs inferred by our method, have low enrichment scores.

Fig 6.

Fig 6.

Active TF inference for each cluster of immortalized cells. Gene markers were identified by comparison with pHPF and fed into the inference algorithm. In each data set, the left panel bar plots show the inferred probability of regulator activity by the algorithm. Vertical lines mark 0.2, 0.5, and 0.8 probability thresholds. The middle panel bar plots show the Enrichment Analysis of DE targets (Fisher’s Exact test) for comparison. Significant p-values (<0.05) are highlighted in pink. The right panel bar plots show mean RNA expression across single cells. TFs with expression above 25-percentile are highlighted in red. TFs with inference posterior probability p>0.2,0.5,0.8 that also show significant enrichment and RNA expression are highlighted in yellow, blue, and green respectively. TFs with posterior probability p>0.2, but the expression level below the 25 percentile are highlighted in gray. TFs with p<0.2 are shown with a dotted pattern. The bottom panel shows active TFs inferred in each cell line (rows) by the algorithm. N1 and SFT show a similar pattern of TF activity.

Among the TF regulators identified, HAND2 was shared across 2 cell lines. The protein encoded by this gene belongs to the basic helix-loop-helix family of transcription factors and, among many other development-related functions, is required for vascular development and regulation of angiogenesis, possibly through a VEGF signaling pathway (Yamagishi, Olson, & Srivastava, 2000).

The N1 and SFT1 cell lines shared expression of CEBPD, GTF2F1, and MXI1. CEBPD is an intron-less gene that encodes a bZIP transcription factor which can bind as a homodimer to certain DNA regulatory regions. It can also form heterodimers with the related protein CEBP-alpha. The encoded protein is important in the regulation of genes involved in immune and inflammatory responses and may be involved in the regulation of genes associated with activation and/or differentiation of macrophages. It may also be involved in the early stages of adipogenesis (Balamurugan & Sterneck, 2013; Hishida, Nishizuka, Osada, & Imagawa, 2009).

GTF2F1 encodes TFIIF, a general transcription initiation factor that binds to RNA polymerase II and helps to recruit it to the initiation complex in collaboration with TFIIB. It is also a JNK1/3-binding partner and may modulate c-JUN-mediated MAPK signaling in cell proliferation, differentiation, migration, senescence, and apoptosis (Sun, et al., 2015). MXI1 encodes a basic helix-loop-helix protein that inhibits the transcriptional activity of MYC by sequestering MAX, thus preventing the formation of MYC-MAX heterodimers, and by competing with MYC-MAX heterodimers for binding to target sites (Zervos, Gyuris, & Brent, 1993).

The iHPF_A and iHPF_B cell lines shared expression of the TF regulators PPARG and TCF21. PPARG encodes a member of the peroxisome proliferator-activated receptor (PPAR) subfamily of nuclear receptors. PPARs form heterodimers with retinoid X receptors (RXRs) and these heterodimers regulate transcription of various genes that regulate adipocyte differentiation and, pathologically, the development or progression of obesity, diabetes, atherosclerosis and cancer (Rosen, 2005; Cataldi, Costa, Ciccodicola, & Aprile, 2021). TCF21 encodes a transcription factor of the basic helix-loop-helix family. The TCF21 product is mesoderm specific, and expressed in embryonic epicardium, mesenchyme-derived tissues of lung, gut, gonad, and both mesenchymal and glomerular epithelial cells in the kidney. It is involved in the differentiation of mesenchymal cells to fibroblasts (Lighthouse & Small, 2016).

Of note, many of these transcriptional regulators are basic helix-loop-helix TFs, and three TF regulators in particular – CEBPD, TCF21, and HAND2 have been identified as promotors of mesenchymal cell differentiation towards the fibroblast lineage (as opposed to the smooth muscle cell lineage). This suggests that the immortalized fibroblast cell lines express TF regulators that function to maintain the fibroblast phenotype as well as those that may extend this phenotype towards that of immune/inflammatory cells (CEBPD), adipocytes (CEBPD, PPARG) or vascular cells (HAND2). This suggests that fibroblast phenotypic plasticity is perhaps a common rather than exceptional cellular state that may be identified by the expression of particular TF regulators.

Discussion

In this work, we presented an algorithm for inference of TF activity from differential gene expression profiles and causal graphs. The algorithm incorporates transcriptional logic in the context of Bayesian networks, allowing for probabilistic deviation from deterministic logic rules. The probabilistic framework provides the flexibility for ‘plug-and-play’ integration of various logic models. In this study, we focused on one such model (OR-NOR logic). As a future direction we plan to extend the packages so users can choose the logic prior to running the inference.

The queries are run on causal graphs of TF-gene interactions. We provide several options for such graphs assembled from small-scale curated databases (Kolchanov, et al., 2002; Han, et al., 2015), large-scale public databases (Kanehisa & Goto, 2000; Jensen, et al., 2009; Cerami, et al., 2011), as well as de novo reconstructed graphs from high-throughput experiments (Farahmand, O’Connor, Macoska, & Zarringhalam, 2019). We note that the quality and the coverage of the causal graph has a major impact on the ability of regulator activity inference models. Most curated publicly available network of transcriptional regulation with annotation on mode of regulation are small and very limited in their coverage while other higher coverage networks may consist of noisy inferred interactions. Unlike standard enrichment analysis methods, our framework has been designed to account for noise (applicability of interactions and noise in direction of regulation).

Bayesian Networks are Directed Acyclic Graphs (DAGs) and as such, feedback loops cannot be directly modeled in this context, which is a limitation of this approach. Another limitation of our approach is that it is designed to detect TF activation, but not TF deactivation. Moreover, since we only consider the OR-NOR transcriptional regulatory logic, results produced by this approach may miss TFs with alternative regulatory relationships. Since the approach is Bayesian and takes the entire topology of the network into account, by design it outputs a minimal number of TFs whose activation can explain the gene expression data. This is an advantage of our algorithm over enrichment analysis methods that typically contain a large proportion of false positive. The disadvantage may be that sometimes not all true positives get high posteriors probabilities, especially if there are many active regulators present.

For the inference process, we utilized Gibbs Sampling, an MCMC algorithm that is widely used in Bayesian networks. A drawback from MCMC models in Bayesian networks is the convergence time. We implemented the core of the inference in C++ to reduce the wait time. Other strategies can be taken to speed up processing time. For instance, an enrichment-based test can be run a priori to exclude TFs with insufficient differentially expressed targets. This will result in a significant speed up in convergence time, albeit some border line cases may be lost.

Our tool is an exploratory discovery tool that provides a narrow list of potentially relevant TFs, summarizing the observed differential gene expression data. This is similar to standard GO term and pathway enrichment analysis that are also typically applied to summarize differential gene expression data. The focus of our tool is transcriptional regulation and our algorithm can be used as a complementary tool in conjunction with enrichment analysis methods.

To increase the utility of our algorithm, we provided a user-friendly R and python packages as well as a web-based platform with integrated interactive visualization. The pre-processing steps for speeding up the algorithm are implemented as default in the webserver. As databases of causal transcriptional regulatory interactions become more available, we will integrate them in the web-platform and accordingly optimize the inference algorithm for each network.

Materials and Methods

Noisy Logic-based gene regulation graphical model

As a starting point, we construct a causal graph from the TF-gene interaction network as follows. The causal graph is a triplet (G,E,S), where G represent the nodes, E represent the edges (pairs of nodes) and S represent signs associated with each edge (+,). Figure 7 shows a graphical representation of the proposed model. The nodes in the graph consist of the following layers:

Fig 7.

Fig 7.

The proposed graphical model for each interaction ij.

  • Transcript nodes N=Y1,,Ym : These are the terminal nodes in the network and represent the transcripts. The domain of these nodes is 𝒟(Y)={(),(0),(+)}, representing downregulated, not regulated, and upregulated respectively. The value of these nodes will be populated from the input gene expression data.

  • True states nodes H=H1,,Hm : These nodes represent the true unobserved state of the transcript nodes. This is done to account for noise in input data. These nodes have domain 𝒟(H)=𝒟(Y) and are central in the implementation of noisy logic gates.

  • Regulator state nodes X=X1,,Xn : These nodes represent the activation state of TFs in the network. Here we use 𝒟(X)={(0),(+)}, for no activation or activation respectively.

  • TF activity noise nodes θ=θ1,,θn. For each Xi in the network, we assign a node θi, representing a continuous random variable with domain 𝒟θi=[0,1]. These nodes represent the probability of activation for the corresponding node Xi and are modeled by a beta distribution.

  • Mode of regulation S=S11,,S1m,,Sij,,Snm. These nodes represent the mode of action (activation vs. repression) between parent TF node Xi and target transcript node Hj. These nodes have a domain 𝒟(S)={(I),(NA),(A)}, representing inhibition (I), non-applicable (NA) and activation (A) respectively. We use one-hot encoding for this variable, i.e., Sij is a vector of size 3 with components Sij(I),Sij(NA),Sij(A){0,1}.

Transcriptional Logic

In our model, we incorporate logic gates as in (Bulashevska & Eils, 2005) to explicitly account for combinatorial effects using Boolean logic while accounting for uncertainty. In this work, we consider a combination of noisy OR and NOR gates. We consider two models as follows.

OR model

This model is used to describe the likelihood of downregulation of a gene by a set of TFs. In this model, presence of one active inhibitor is sufficient to downregulate the gene. The probability mass function is modeled as a Bernoulli trial with probability of success

PHj=()Θ=1i=1n1θixisij(I),

where θ={X,θ,S} all model parameters involving H nodes. Although this model seems like a sensible choice, it assumes that all the genes targeted by a TF should strictly follow its influence. However, target regulation depends on many other factors, and we should expect only a fraction of targets to be effectively regulated by a given TF. To make our model more flexible, we incorporate a hyper-parameter ξ, that allows the likelihood model to tolerate more zero-genes in the evidence data. The OR model above is now

pHj=()Θ=1i=1n(1ξ)1θixisij(I)1ξn+ξnq,

where q represents a prior probability of finding a downregulated gene in the evidence data. For convenience, let us define ζ=(1ξ), and let ζdeg and ζnondeg denote two different values for the ζ parameter that are set depending on whether the current gene has been observed as a modulated or not, respectively. The value of the parameter ζ is set close but greater than zero, for non-differentially expressed genes, e.g. 0<ζnondeg0.1, while for differentially expressed genes it may be set close or equal to one. Additionally, an extra (1ξ) term is now multiplying 1θi, effectively increasing the sensitivity of these TF-gene interactions. This has proven beneficial to improve specificity of the inference results. We set ζdeg=0.99 and ζnondeg is set to be proportional to Nedgesdeg, i.e., the number of edges in the network that point to genes that are observed as differentially expressed. More specifically, we have used the following relation:

ζnondeg=110NedgesdegNedges,

where Nedges is the total number of edges in the network.

OR-NOR model for gene activation

This model offers a relatively simple way for describing combinatorial effects of both up-and down-regulation within the same interaction network. The rationale is that for a target gene to be activated, at least one of its upstream activators must be activated (OR gate), while at the same time none of its inhibitors is active (NOR gate). In this case the target gene up-regulation event is modeled as a Bernoulli trial with probability of success

PHj=(+)Θ=1i=1n(1ξ)1θiXisij(A)i=1n(1ξ)1θixisij(I)1ξn+ξnq+,

where q+ is the prior probability of finding an upregulated gene in the evidence data. Target gene activation state is regarded as a multinomial trial with three possible outcomes: upregulation, downregulation, or not changed. This is congruent with discretized differential expression data and allows building a complete model likelihood. Correspondingly, the complementary outcome likelihood is then represented by a NOR-NOR model.

PHj=(0)Θ=i=1n(1ξ)1θiXisij(A)i=1n(1ξ)1θixisij(I)1ξn+ξnq0,

Here q0 is the prior probability of finding a non-differentially expressed gene in the observed data.

The model likelihood

The posterior probability of model parameters given the observed data is given by:

PΘYj=yj=PYj=yjΘP(Θ)PYj=yj=hP(Yj=yjHj=h)P(Hj=hX,θ,Sj)P(X)P(θ)P(Sj)hPYj=yjHj=hPHj=h,

where P(YH) is the conditional probability of the observed expression value Y given the true state. This conditional probability models the true positive and false positive rate in input gene expression data. Table 3 shows the values used in our implementation. These values are estimations based on typical experimental errors. This component of the model may encompass several sources of uncertainty, such as dropped reads during RNA sequencing or type I errors in the statistical analysis made for computation of differential gene expression. This may be treated as a prior probability, representing our belief that we would observe these rates.

Table 3.

Conditional probability of False positive and False negative in the true state given the observed state.

P(Y=·H=·) H=() H=(0) H=(+)

Y=() 0.945 0.050 0.005
Y=(0) 0.050 0.900 0.050
Y=(+) 0.005 0.050 0.945

Fitting the model

The next task is to find the set of model parameters that maximize the model likelihood. Given the large scale of the parameter space, this problem is intractable analytically. A widely used approach for inferring posterior probability in large scale Bayesian networks is Markov Chain Monte Carlo (MCMC) sampling. In particular, Gibbs sampling is a suitable MCMC method to approximate the posterior distribution of the model parameters given the observed data. In Gibbs sampling, we sequentially sample from each random variable, conditioned on the current state of its Markov blanket. To assess the convergence to the posterior distribution of the model parameters, we run at least three independent sampling chains and periodically compute the Gelman-Rubin statistic for each random variable (Gelman & Rubin, 1992). We stop the process after this diagnostic statistic is less than 1.1 for every random variable in the model. The core of the sampling algorithm has been implemented using C++, and user-friendly R and python packages developed.

Algorithm

In this section we present Algorithm 1, which was used throughout this work, along with a detailed description of each of the steps.

Algorithm 1.

Sample model’s posterior distribution

1. Preprocess input data.
2. Populate the causal graph with observed evidence.
3. Generate N replicas of the graph and initialize random variables by sampling from their prior distributions.
4. Run the Gibbs sampling procedure:
for each graph replica do
  for k iterations do
   for each random variable in the model do
    4.1. Sample a new value given the current state of its Markov blanket.
    4.2. Add the new value to this random variable’s Markov chain.
   end for.
  end for.
end for.
5. Compute Gelman-Rubin statistics for the Markov chains and get the maximum value among all random variables.
6. If the maximum Gelman-Rubin statistic is greater than 1.1, go back to step 4
7. Return posterior mean values for every random variable.

Step 1: Data preprocessing. The input differential gene expression data should contain the computed p-values and fold-change scores for all genes. In this step, we select thresholds for p-values and fold change. The aim is to limit the input data to include only the most significant DEGs, totaling to <= 800 as a rule of thumb.

Step 2: Populate the causal graph. Here we assign the observed values (evidence) for the differential expression of each gene to the corresponding Y nodes in the graph. These values will remain fixed during the sampling process.

Step 3: We create N independent copies of the graph, allowing us to store separate states of N Markov chains to sample. The use of multiple Markov chains provides a way to compare independent results and enables early stopping. Ideally, all N chains should converge to the same posterior distribution for all random variables in the model.

Step 4: Gibbs sampling. For each random variable we retrieve the current state of its Markov blanket, which is given by the current values of the children nodes, the parent nodes, and the children of the parent nodes. Then, given this Markov blanket, we compute the conditioned probability of the random variable, from which a new sample is drawn. The sampled value is stored and we move the next random variable and repeat the process. After completing the sampling process for all variables, we start over from the first random variable. This process is repeated for each of the N graph replicas created in Step 3. To assess convergence of the posterior distribution, we pause the sampling process after completing k rounds to compute the Gelman-Rubin statistics described in the next Step of the algorithm. The choice of k, the period in which convergence is checked, is arbitrary and has been set to k=20 in our implementation.

Step 5: Maximum Gelman-Rubin statistic computation. During the Markov chains sampling process, we need to determine whether the sampled distributions for the random variables in the model have converged to their respective posterior distributions. For this purpose, we compare the N Markov chain distributions for every single variable in the model by means of the Gelman-Rubin statistic R, which is a dimensionless score that combines the between chain and within chain variances of the random variables. The statistic approaches R=1 as the sampled distributions converge to a steady state. This score is computed for all random variables and we keep the maximum value obtained.

Step 6: Convergence assessment. Convergence of the sampling process is called when max(R)<1.1 for every random variable. If this condition has not been satisfied, we continue the sampling process in Step 4.

Step 7: At this point the sampling process is complete, and the mean value for each random variable is returned.

Simulations

To assess the impact of noise on model performance, we simulated interaction networks and input differential gene expression data. First, a random interaction network was generated by selecting a total number of NX TFs and a total number of NY genes in the network. For each TF, its number of target genes was sampled according to a negative binomial distribution. Target genes were assigned randomly using a uniform distribution among all genes in the network. Finally for each interaction edge a sign of regulation was assigned with 0.65 probability of being activation (+) and 0.35 probability for repression (−). Subsequently differential gene expression data was simulated as follows. First, we select a random set of TFs that are assigned as active. This set of TFs constitute the ground truth. For each active TF, we select 10% of its target genes and assign differential expression according to the sign of regulation as predicted by the graph: +1 if target is upregulated by TF, −1 if downregulated. For genes targeted by multiple active TFs, we perform the algebraic sum of all incoming interactions and take the net sign.

TF-gene interaction network

The TF-gene interaction network was obtained from (Farahmand, O’Connor, Macoska, & Zarringhalam, 2019) in which interaction network were constructed from direct experimental evidence, integrating data from ChIP-Atlas (Wang, et al., 2018) and The Genotype-Tissue Expression (GTEx) databases (Oki, et al., 2018; Consortium, et al., 2017). Integration was achieved through a regularized Gaussian Graphical model that softly integrated TF-gene interactions derived from ChIP-Seq data into gene expression derived from tissues, resulting in 15 tissue specific TF-gene interaction networks as well as a “merged network” obtained by overlapping tissue-specific networks. In the present work we use the merged network containing interactions that are consistent in at least three tissue types, resulting in 338680 TF-gene interactions from 750 TF molecules.

Differential gene expression data

DNA microarray-based gene expression profiles for over-expression studies in human primary mammary epithelial cell cultures (Bild, et al., 2006), were obtained from the GEO repository (GSE3151) by using the GEO2R tool for sample selection (Barrett, et al., 2012). Differential expression was computed using the limma R package (Smyth, 2005). We limited the number of differentially expressed genes by applying cutoff thresholds for the adjusted p-values (p0.01) and log2-foldchange (fc1) for E2F3 and MYC datasets. For the RAS dataset these cutoff values produced 2226 differentially expressed genes. To further limit the number of DEGs in this experiment, we increased the log2-foldchange cutoff threshold to 2.

Additionally, we used data from a study on fibroblast-to-myofibroblast phenotypic conversion in response to pro-fibrotic signaling molecules TGFβ and CXCL12 (Patalano, et al., 2018; Gharaee-Kermani, et al., 2012; Rodríguez-Nieves, Patalano, Almanza, Gharaee-Kermani, & Macoska, 2016). Differential expression for TGFβ and CXCL12 treated fibroblasts were generated using the R package edgeR (Robinson, McCarthy, & Smyth, 2009).

Single cell RNA sequencing

Following Trypsin digestion, cells were collected in a 50mL conical tube and washed and resuspended in PBS to a final concentration of ~700 cells/uL. After gentle resuspension, 2.3ul (~1,610 cells) per sample was combined with nuclease free water and master mix per 10X recommendations for a targeted recovery of ~1000 cells per sample.

The samples were loaded onto a 10X Chromium Chip A (PN230027, deprecated) (N1 and SFT1) or Chip G (PN 2000177) (pHPF and iHPF) and run through the 10X Chromium controller instrument and manufacturer protocols for RNA recovery and library preparation. Briefly, cells were partitioned into individual lipid droplets, lysed to release RNA and tagged with UMIs (unique molecular identifiers) for cell of origin. mRNA was isolated with dT oligo beads then reverse transcribed to DNA, ligated with Illumina-compatible sequencing adapters with multiplex capable barcodes (PN-120262 for N1/SFT1 (deprecated), PN-1000213 for pHPF/iHPF), for sample of origin, and PCR amplified for 13 cycles. N1 and SFT1 samples were prepared with Chromium v2 Library and gel bead kit (PN-120267, deprecated) while pHPF and iHPF were prepared with v3 (PN 1000128).

Following library preparation, samples were assessed by Agilent 2100 Bioanalyzer using High Sensitivity chips and reagents (PN 5067–4626) to confirm a normal size distribution to minimize bias, quantified by qPCR with Illumina adapter compatible primers and Sybr Green (Kapa ROX Low Universal Library Quant kit PN KK4873) and molarity calculated by size-correcting to the bioanalyzer average size. N1 and SFT1 samples were pooled together while pHPF and iHPF were pooled together separately.

The samples were sequenced on a Hiseq 2500 in Rapid Run mode in the CPCT Genomics Core, using paired-end on board clustering (PN PE-402–4002) and sequencing by synthesis (SBS) reagents (PN FC-402–4021). Twenty-eight bases were sequenced in read 1 to capture the UMIs, 8 bases for the single indexes, and 91 bases in Read 2 to capture transcripts, yielding ~100 million total sequencing reads per sample.

RNA-Seq alignment

Reads were aligned to human genome version GRCh38 using the 10X cellranger v4.0.0 pipeline (cellranger mkref, and cellranger count) using default parameters. Over 93% of reads were mapped to the genome for all 4 datasets. Mapping rates were 93.6%, 93.3%, 96.3% and 96.3% for N1, SFT1, iHPF and pHPF, respectively. To improve the quality of the data, the resulting count matrices were reanalyzed to force the number of cells accepted, to those with highest UMI counts.

Single Cell Data Analysis

Count data from single cell alignment were processed using the Seurat R package (Stuart, et al., 2019; Hafemeister & Satija, 2019). Seurat objects were created from the filtered matrices resulting from the alignment step. Low quality cells were filtered out, by removing those with large mitochondria contamination and cells with either too few or too many unique genes or total RNA count. Cell cycle scores were assigned to each cell with the method CellCycleScoring using default parameters and cell cycle genes provided by Seurat package. Datasets N1, SFT1, pHPF and iHPF were combined by using Seurat’s merge function and normalized with SCTransform using 3000 variable features and no centering. Batch effect was removed by considering the number of genes, RNA counts and mitochondrial RNA contamination as unwanted sources of variation. Dimensionality reduction was performed through principal component analysis (PCA) and Uniform Manifold Approximation & Projection (UMAP) as implemented in Seurat package, with functions RunPCA (30 PCs) and RunUMAP respectively. Cells were then filtered to work with G1 cells only. Differential expression was computed with respect to pHPF, by using the FindMarkers method. All Seurat methods were used with default parameters, unless otherwise stated.

Prostate tissue single cell data was retrieved from (Henry, et al., 2018). This corresponds to FACS sorted cells, containing fibroblasts, smooth muscle, endothelial and epithelial cells. Here we ignore FACS sorting labels as we run our own classification process. This dataset was preprocessed using Seurat and its SCTransform pipeline, same as with cell lines data. Cells were clustered by using methods FindNeighbors and FindClusters. Each cluster was classified by looking into cell type markers taken from (Henry, et al., 2018), to assign labels: fibroblast, smooth muscle, endothelia, basal epithelia, luminal epithelia, and other epithelia 1 and 2. Cells in G1 cell cycle phase were retained for downstream analysis. Table 4 shows the markers used for the tissue cells classification.

Table 4.

Gene markers used for classification of single cells in the prostate tissue single cell dataset by (Henry, et al., 2018). These are the same markers used in that study.

Tissue type Gene markers
Basal Epithelia KRT14, DST, KRT15, KRT5, RGCC
Luminal Epithelia MSMB, KLK3, ACPP, PLA2G2A, KLK2
Other Epithelia 1 SCGB3A1, LCN2, PIGR, WFDC2, FCGBP
Other Epithelia 2 KRT13, APOBEC3A, CSTB, LYPD3, SERPINB1
Fibroblast APOD, FBLN1, PTGDS, CFD, DCN
Smooth muscle TPM2, ACTA2, RGS5, MT1A, MYH11

Cell lines data and tissue data were combined by using Seurat’s integration pipeline with SCTransform, using 3000 features, k.anchor = 6, and reduction = ‘rpca’.

Supplementary Material

Supplement 1
media-1.pdf (1.1MB, pdf)
Supplement 2
media-2.xlsx (841KB, xlsx)

Acknowledgements

This work was supported by grants AI150090 (KZ), R01AI167570 (KZ), DK104310 (JAM), and CA156734 (JAM, KZ) from the National Institutes of Health. AA was supported in part by College of Science and Mathematics Dean’s Doctoral Research Fellowship through fellowship support from Oracle, project ID R20000000025727. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Footnotes

Competing interests

The authors declare no competing interests.

Data and software availability

“scRNAseq of Primary and Immortalized Human Prostate Fibroblast Cell Lines”

BioProject Accession number: PRJNA881605

Study Accession number: SRP397809

SRA Accession numbers: SRX17617080, SRX17617081, SRX17617082, SRX17617083 Use NCBI’s SRA toolkit to download the 4 datasets above. For further instructions, see: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload

We make our inference algorithm available to use through the following web application: https://umbibio.math.umb.edu/nlbayes

Open-source R and Python packages are available at Github: https://github.com/umbibio/nlbayes-r (doi:10.5281/zenodo.7105306) https://github.com/umbibio/nlbayes-python (doi:10.5281/zenodo.7105233)

We have used R version 4.1.3 and Python 3.10 to develop and test the corresponding packages. Detailed instructions and examples are available in each corresponding repository.

Bibliography

  1. Akutsu T., Miyano S., & Kuhara S. (2000, 08). Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 16(8), pp. 727–734. [DOI] [PubMed] [Google Scholar]
  2. Alam P., Haile B., Arif M., Pandey R., Rokvic M., Nieman M., . . . Kanisicak O. (2019). Inhibition of Senescence‐Associated Genes Rb1 and Meis2 in Adult Cardiomyocytes Results in Cell Cycle Reentry and Cardiac Repair Post–Myocardial Infarction. Journal of the American Heart Association, 8(15), p. e012089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Allison S. (2021, may). A SOX-9-NAV3-YAP1 axis in kidney fibrosis. Nat. Rev. Nephrol., 17(5), p. 297. [DOI] [PubMed] [Google Scholar]
  4. Alvarez M. J., Shen Y., Giorgi F. M., Lachmann A., Ding B. B., Ye B. H., & Califano A. (2016, Jun 20). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature Genetics, 48, pp. 838–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Asif H. M., & Sanguinetti G. (2011, May). Large-scale learning of combinatorial transcriptional dynamics from gene expression. Bioinformatics, 27(9), pp. 1277–1283. [DOI] [PubMed] [Google Scholar]
  6. Astudillo P. (2022). An emergent Wnt5a/YAP/TAZ regulatory circuit and its possible role in cancer. Seminars in Cell & Developmental Biology, 125, pp. 45–54. [DOI] [PubMed] [Google Scholar]
  7. Balamurugan K., & Sterneck E. (2013). The Many Faces of C/EBPδ and their Relevance for Inflammation and Cancer. Int J Biol Sci, 9, pp. 917–933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bar-Joseph Z., Gerber G. K., Lee T. I., Rinaldi N. J., Yoo J. Y., Robert F., . . . Gifford D. K. (2003, Nov). Computational discovery of gene modules and regulatory networks. Nat. Biotechnol., 21(11), pp. 1337–1342. [DOI] [PubMed] [Google Scholar]
  9. Barolo S., & Posakony J. W. (2002, May). Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. Genes Dev., 16(10), pp. 1167–1181. [DOI] [PubMed] [Google Scholar]
  10. Barrett T., Wilhite S. E., Ledoux P., Evangelista C., Kim I. F., Tomashevsky M., . . . Soboleva A. (2012, 11). NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research, 41(D1), pp. D991–D995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Begley L. A., Kasina S., MacDonald J., & Macoska J. A. (2008). The inflammatory microenvironment of the aging prostate facilitates cellular proliferation and hypertrophy. Cytokine, 43(2), pp. 194–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bhatlekar S., Addya S., Salunek M., Orr C. R., Surrey S., McKenzie S., . . . Boman B. M. (2014, aug). Identification of a Developmental Gene Expression Signature, Including HOX Genes, for the Normal Human Colonic Crypt Stem Cell Niche: Overexpression of the Signature Parallels Stem Cell Overpopulation During Colon Tumorigenesis. Stem Cells and Development, 23(2), pp. 167–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bild A. H., Yao G., Chang J. T., Wang Q., Potti A., Chasse D., . . . Nevins J. R. (2006, Jan). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439(7074), pp. 353–357. [DOI] [PubMed] [Google Scholar]
  14. Bonneau R., Reiss D. J., Shannon P., Facciotti M., Hood L., Baliga N. S., & Thorsson V. (2006, May). The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biol., 7(5), pp. 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Bulashevska S., & Eils R. (2005, 03). Inferring genetic regulatory logic from expression data. Bioinformatics, 21(11), pp. 2706–2713. [DOI] [PubMed] [Google Scholar]
  16. Casella G., & George E. I. (1992). Explaining the Gibbs Sampler. The American Statistician, 46(3), pp. 167–174. [Google Scholar]
  17. Cataldi S., Costa V., Ciccodicola A., & Aprile M. (2021, apr). PPARγ and Diabetes: Beyond the Genome and Towards Personalized Medicine. Curr. Diab. Rep., 21(6), pp. 18–15. [DOI] [PubMed] [Google Scholar]
  18. Cerami E. G., Gross B. E., Demir E., Rodchenkov I., Babur Ö., Anwar N., . . . Sander C. (2011, Jan). Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res., 39(suppl_1), pp. D685–D690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Chen H.-Z., Tsai S.-Y., & Leone G. (2009, nov). Emerging roles of E2Fs in cancer: an exit from cell cycle control. Nat. Rev. Cancer, 9(11), pp. 785–797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Chen M., Huang B., Zhu L., Chen K., Liu M., & Zhong C. (2020, oct). Structural and Functional Overview of TEAD4 in Cancer Biology. OTT, 13, pp. 9865–9874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Chi T., Wang M., Wang X., Yang K., Xie F., Liao Z., & Wei P. (2021, sep). PPAR-γ Modulators as Current and Potential Cancer Treatments. Frontiers in Oncology, 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Chindelevitch L., Loh P.-R., Enayetallah A., Berger B., & Ziemek D. (2012, Dec). Assessing statistical significance in causal graphs. BMC Bioinf., 13(1), pp. 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Chindelevitch L., Ziemek D., Enayetallah A., Randhawa R., Sidders B., Brockel C., & Huang E. S. (2012, Apr). Causal reasoning on biological networks: interpreting transcriptional changes. Bioinformatics, 28(8), pp. 1114–1121. [DOI] [PubMed] [Google Scholar]
  24. Consortium G., Aguet F., Brown A. A., Castel S. E., Davis J. R., He Y., . . . Zhu J. (2017, Oct 11). Genetic effects on gene expression across human tissues. Nature, 550, pp. 204–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Davis S., & Meltzer P. S. (2007, 05). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23(14), pp. 1846–1847. [DOI] [PubMed] [Google Scholar]
  26. de Jong H. (2002). Modeling and Simulation of Genetic Regulatory Systems: A Literature Review. Journal of Computational Biology, 9(1), pp. 67–103. [DOI] [PubMed] [Google Scholar]
  27. Djordjevic M., Sengupta A. M., & Shraiman B. I. (2003, Nov). A Biophysical Approach to Transcription Factor Binding Site Discovery. Genome Res., 13(11), pp. 2381–2390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Edgar R., Domrachev M., & Lash A. E. (2002, 01). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30(1), pp. 207–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Fakhry C. T., Choudhary P., Gutteridge A., Sidders B., Chen P., Ziemek D., & Zarringhalam K. (2016, Dec). Interpreting transcriptional changes using causal graphs: new methods and their practical utility on public networks. BMC Bioinf., 17(1), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Fang F., Ooka K., Bhattachyya S., Wei J., Wu M., Du P., . . . Varga J. (2011, May). The Early Growth Response Gene Egr2 (Alias Krox20) Is a Novel Transcriptional Target of Transforming Growth Factor-β that Is Up-Regulated in Systemic Sclerosis and Mediates Profibrotic Responses. Am. J. Pathol., 178(5), pp. 2077–2090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Farahmand S., O’Connor C., Macoska J. A., & Zarringhalam K. (2019, Dec). Causal Inference Engine: a platform for directional gene set enrichment analysis and inference of active transcriptional regulators. Nucleic Acids Res., 47(22), pp. 11563–11573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Friedman N. (2003, Sep). Probabilistic models for identifying regulation networks. Bioinformatics, 19(suppl_2), p. ii57. [Google Scholar]
  33. Fröhlich H. (2015, 06). biRte: Bayesian inference of context-specific regulator activities and transcriptional networks. Bioinformatics, 31(20), pp. 3290–3298. [DOI] [PubMed] [Google Scholar]
  34. Gelman A., & Rubin D. B. (1992, 11). Inference from Iterative Simulation Using Multiple Sequences. Statist. Sci., 7(4), pp. 457–472. [Google Scholar]
  35. Gharaee-Kermani M., Kasina S., Moore B. B., Thomas D., Mehra R., & Macoska J. A. (2012, Nov). CXC-Type Chemokines Promote Myofibroblast Phenoconversion and Prostatic Fibrosis. PLoS One, 7(11), p. e49278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Gharaee-Kermani M., Mehra R., Robinson D. R., Wei J. T., & Macoska J. A. (2014, mar). Complex Cellular Composition of Solitary Fibrous Tumor of the Prostate. The American Journal of Pathology, 184(3), pp. 732–739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Giaimo B. D., Gagliani E. K., Kovall R. A., & Borggrefe T. (2021). Transcription Factor RBPJ as a Molecular Switch in Regulating the Notch Response. In Reichrath J., & Reichrath S. (Eds.). Cham: Springer International Publishing. [DOI] [PubMed] [Google Scholar]
  38. Gui T., Liu M., Yao B., Jiang H., Yang D., Li Q., . . . Zhao Q. (2021, dec). TCF3 is epigenetically silenced by EZH2 and DNMT3B and functions as a tumor suppressor in endometrial cancer. Cell Death Differ., 28(12), pp. 3316–3328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Guo Y.-J., Pan W.-W., Liu S.-B., Shen Z.-F., Xu Y., & Hu L.-L. (2020, Mar 01). ERK/MAPK signalling pathway and tumorigenesis (Review). Experimental and Therapeutic Medicine, 19(3), pp. 1997–2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Hafemeister C., & Satija R. (2019, Dec 23). Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biology, 20(1), p. 296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Hamidi A. A., Taghehchian N., Basirat Z., Zangouei A. S., & Moghbeli M. (2022, Jun 04). MicroRNAs as the critical regulators of cell migration and invasion in thyroid cancer. Biomarker Research, 10(1), p. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Han H., Shim H., Shin D., Shim J. E., Ko Y., Shin J., . . . Lee I. (2015, Jun 12). TRRUST: a reference database of human transcriptional regulatory interactions. Scientific Reports, 5, p. 11432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Hao Y., Hao S., Andersen-Nissen E., Mauck W. M., Zheng S., Butler A., . . . Satija R. (2021). Integrated analysis of multimodal single-cell data. Cell. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Harrison A. R., & Moseley G. W. (2020). The Dynamic Interface of Viruses with STATs. Journal of Virology, 94(22), pp. e00856–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Hashimoto R. F., Kim S., Shmulevich I., Zhang W., Bittner M. L., & Dougherty E. R. (2004, May). Growing genetic regulatory networks from seed genes. Bioinformatics, 20(8), pp. 1241–1247. [DOI] [PubMed] [Google Scholar]
  46. Henry G. H., Malewska A., Joseph D. B., Malladi V. S., Lee J., Torrealba J., . . . Strand D. W. (2018). A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra. Cell Reports, 25(12), pp. 3530–3542.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Hishida T., Nishizuka M., Osada S., & Imagawa M. (2009). The role of C/EBPδ in the early stages of adipogenesis. Biochimie, 91(5), pp. 654–657. [DOI] [PubMed] [Google Scholar]
  48. Hjortoe G. M., Weilguny D., & Willumsen B. M. (2005, Jan). Elk3 from Hamster—A Ternary Complex Factor with Strong Transcriptional Repressor Activity. DNA Cell Biol., 24(1), pp. 35–42. [DOI] [PubMed] [Google Scholar]
  49. Honkela A., Girardot C., Gustafson E. H., Liu Y.-H., Furlong E. E., Lawrence N. D., & Rattray M. (2010). Model-based method for transcription factor target identification with limited data. Proceedings of the National Academy of Sciences, 107(17), pp. 7793–7798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Hu M. S., Moore A. L., & Longaker M. T. (2018, apr). A Fibroblast Is Not a Fibroblast Is Not a Fibroblast. J. Invest. Dermatol., 138(4), pp. 729–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Huang W., Chen J., Liu X., Liu X., Duan S., Chen L., . . . Zhou J. (2021, sep). MIER3 induces epithelial-mesenchymal transition and promotes breast cancer cell aggressiveness via forming a co-repressor complex with HDAC1/HDAC2/Snail. Experimental Cell Research, 406(1), p. 112722. [DOI] [PubMed] [Google Scholar]
  52. Jensen L. J., Kuhn M., Stark M., Chaffron S., Creevey C., Muller J., . . . von Mering C. (2009, Jan). STRING 8—a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res., 37(suppl_1), pp. D412–D416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Jia Y., Yang Y., Zhan Q., Brock M. V., Zheng X., Yu Y., . . . Guo M. (2012, Nov). Inhibition of SOX17 by MicroRNA 141 and Methylation Activates the WNT Signaling Pathway in Esophageal Cancer. Journal of Molecular Diagnostics, 14(6), pp. 577–585. [DOI] [PubMed] [Google Scholar]
  54. Jiang C., Xuan Z., Zhao F., & Zhang M. Q. (2007, 01). TRED: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Research, 35(suppl_1), pp. D137–D140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Jiang P., Freedman M. L., Liu J. S., & Liu X. S. (2015). Inference of transcriptional regulation in cancers. Proceedings of the National Academy of Sciences, 112(25), pp. 7731–7736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Jung F., Johnson A. D., Kumar M. S., Wei B., Hautmann M., Owens G. K., & McNamara C. (1999, Nov). Characterization of an E-box–Dependent cis Element in the Smooth Muscle α-Actin Promoter. Arterioscler., Thromb., Vasc. Biol. [DOI] [PubMed] [Google Scholar]
  57. Junker J. P., Lönnqvist S., Rakar J., Karlsson L. K., Grenegård M., & Kratz G. (2013). Differentiation of human dermal fibroblasts towards endothelial cells. Differentiation, 85(3), pp. 67–77. [DOI] [PubMed] [Google Scholar]
  58. Kanehisa M., & Goto S. (2000, Jan). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28(1), pp. 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Keenan A. B., Torre D., Lachmann A., Leong A. K., Wojciechowicz M. L., Utti V., . . . Ma’ayan A. (2019, 05). ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Research, 47(W1), pp. W212–W224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Knudsen E. S., Pruitt S. C., Hershberger P. A., Witkiewicz A. K., & Goodrich D. W. (2019, may). Cell Cycle and Beyond: Exploiting New RB1 Controlled Mechanisms for Cancer Therapy. Trends in Cancer, 5(5), pp. 308–324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Kolchanov N. A., Ignatieva E. V., Ananko E. A., Podkolodnaya O. A., Stepanenko I. L., Merkulova T. I., . . . Romashchenko A. G. (2002, Jan). Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res., 30(1), pp. 312–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Kramer A., Green J., Pollard J. J., & Tugendreich S. (2014, Feb). Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics, 30(4), pp. 523–530. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Krauß L., Urban B. C., Hastreiter S., Schneider C., Wenzel P., Hassan Z., . . . Schneider G. (2021, 12). HDAC2 Facilitates Pancreatic Cancer Metastasis. Cancer Research, 82(4), pp. 695–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Lachmann A., Giorgi F. M., Lopez G., & Califano A. (2016, 04). ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information. Bioinformatics, 32(14), pp. 2233–2235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Lambert S. A., Jolma A., Campitelli L. F., Das P. K., Yin Y., Albu M., . . . Weirauch M. T. (2018). The Human Transcription Factors. Cell, 172(4), pp. 650–665. [DOI] [PubMed] [Google Scholar]
  66. Lee P.-J., Sui Y.-H., Liu T.-T., Tsang N.-M., Huang C.-H., Lin T.-Y., . . . Liu S.-C. (2022, aug). Epstein-Barr viral product-containing exosomes promote fibrosis and nasopharyngeal carcinoma progression through activation of YAP1/FAPα signaling in fibroblasts. J. Exp. Clin. Cancer Res., 41(1), p. 254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Lee T. I., & Young R. A. (2013, Mar). Transcriptional Regulation and Its Misregulation in Disease. Cell, 152(6), pp. 1237–1251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Lelli K. M., Slattery M., & Mann R. S. (2012). Disentangling the Many Layers of Eukaryotic Transcriptional Regulation. Annual Review of Genetics, 46(1), pp. 43–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Li J., Zeng Z., Chen J., Liu X., Jiang X., Sun W., . . . Xie C. (2020). Pathologic evolution-related Gene Analysis based on both single-cell and bulk transcriptomics in Colorectal Cancer. J Cancer, 11, pp. 6861–6873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Li S., Zhou X., Zeng R., Lin L., Zou X., Yan Y., . . . Zhao Y. (2022, nov). YAP1 silencing attenuated lung injury/fibrosis but worsened diaphragmatic function by regulating oxidative stress and inflammation response in mice. Free Radic. Biol. Med., 193(Pt 2), pp. 485–498. [DOI] [PubMed] [Google Scholar]
  71. Li T.-Y., Su W., Li L.-L., Zhao X.-G., Yang N., Gai J.-X., . . . Liang H.-H. (2022, nov). Critical role of PAFR/YAP1 positive feedback loop in cardiac fibrosis. Acta Pharmacol. Sin., 43(11), pp. 2862–2872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Li X., Zhang F., Qu L., Xie Y., Ruan Y., Guo Z., . . . Guo B. (2021, nov). Identification of YAP1 as a novel downstream effector of the FGF2/STAT3 pathway in the pathogenesis of renal tubulointerstitial fibrosis. J. Cell. Physiol., 236(11), pp. 7655–7671. [DOI] [PubMed] [Google Scholar]
  73. Li Y., Luo H., Liu T., Zacksenhaus E., & Ben-David Y. (2015, apr). The ets transcription factor Fli-1 in development, cancer and disease. Oncogene, 34(16), pp. 2022–2031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Lighthouse J. K., & Small E. M. (2016, feb). Transcriptional control of cardiac fibroblast plasticity. J. Mol. Cell. Cardiol., 91, pp. 52–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Liu D., Skomorovska Y., Song J., Bowler E., Harris R., Ravasz M., . . . Ewing R. M. (2019, Jan). ELF3 is an antagonist of oncogenic-signalling-induced expression of EMT-TF ZEB1. Cancer Biol. Ther., 20(1), pp. 90–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Macoska J. A., Paris P., Collins C., Andaya A., Beheshti B., Chaib H., . . . Squire J. A. (2004). Evolution of 8p loss in transformed human prostate epithelial cells. Cancer Genetics and Cytogenetics, 154(1), pp. 36–43. [DOI] [PubMed] [Google Scholar]
  77. Mondal A., Das S., Samanta J., Chakraborty S., & Sengupta A. (2022, jun). YAP1 induces hyperglycemic stress-mediated cardiac hypertrophy and fibrosis in an AKT-FOXM1 dependent signaling pathway. Arch. Biochem. Biophys., 722(109198), p. 109198. [DOI] [PubMed] [Google Scholar]
  78. Munkácsy G., Herman P., & Gyorffy B. (2019). Independent validation of induced overexpression efficiency across 242 experiments shows a success rate of 39%. Scientific Reports, 9(1), p. 343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Ocone A., & Sanguinetti G. (2011, Oct). Reconstructing transcription factor activities in hierarchical transcription network motifs. Bioinformatics, 27(20), pp. 2873–2879. [DOI] [PubMed] [Google Scholar]
  80. Oki S., Ohta T., Shioi G., Hatanaka H., Ogasawara O., Okuda Y., . . . Meno C. (2018). ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data. EMBO reports, 19(12), p. e46255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Pang P., Si W., Wu H., Wang C., Liu K., Jia Y., . . . Yang B. (2023, feb). The circular RNA circHelz enhances cardiac fibrosis by facilitating the nuclear translocation of YAP1. Transl. Res. [DOI] [PubMed] [Google Scholar]
  82. Patalano S., Rodríguez-Nieves J., Colaneri C., Cotellessa J., Almanza D., Zhilin-Roth A., . . . Macoska J. (2018, Feb). CXCL12/CXCR4-Mediated Procollagen Secretion Is Coupled To Cullin-RING Ubiquitin Ligase Activation. Sci. Rep., 8(3499), pp. 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Pavlides S. C., Lecanda J., Daubriac J., Pandya U. M., Gama P., Blank S., . . . Gold L. I. (2016, Apr). TGF-β activates APC through Cdh1 binding for Cks1 and Skp2 proteasomal destruction stabilizing p27kip1 for normal endometrial growth. Cell Cycle, 15(7), pp. 931–947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Philippeos C., Telerman S. B., Oul\ifmmode\gravee\elseè\fis B., Pisco A. O., Shaw T. J., Elgueta R., . . . Watt F. M. (2018, apr). Spatial and Single-Cell Transcriptional Profiling Identifies Functionally Distinct Human Dermal Fibroblast Subpopulations. J. Invest. Dermatol., 138(4), pp. 811–825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Piersma B., de Rond S., Werker P. M., Boo S., Hinz B., van Beuge M. M., & Bank R. A. (2015). YAP1 Is a Driver of Myofibroblast Differentiation in Normal and Diseased Fibroblasts. The American Journal of Pathology, 185(12), pp. 3326–3337. [DOI] [PubMed] [Google Scholar]
  86. Qi J., Pellecchia M., & Ronai Z. A. (2010, sep). The Siah2-HIF-FoxA2 axis in prostate cancer–new markers and therapeutic opportunities. Oncotarget, 1(5), pp. 379–385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Robinson M. D., McCarthy D. J., & Smyth G. K. (2009, 11). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp. 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Rodríguez-Nieves J. A., Patalano S. C., Almanza D., Gharaee-Kermani M., & Macoska J. A. (2016, Jul). CXCL12/CXCR4 Axis Activation Mediates Prostate Myofibroblast Phenoconversion through Non-Canonical EGFR/MEK/ERK Signaling. PLoS One, 11(7), p. e0159490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Rosen E. D. (2005, jul). The transcriptional basis of adipocyte development. Prostaglandins, Leukotrienes Essent. Fatty Acids, 73(1), pp. 31–34. [DOI] [PubMed] [Google Scholar]
  90. Salloum S., Jeyarajan A. J., Kruger A. J., Holmes J. A., Shao T., Sojoodi M., . . . Chung R. T. (2021, jun). Fatty acids activate the transcriptional coactivator YAP1 to promote liver fibrosis via p38 mitogen-activated protein kinase. Cell. Mol. Gastroenterol. Hepatol., 12(4), pp. 1297–1310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Satija R., Farrell J. A., Gennert D., Schier A. F., & Regev A. (2015). Spatial reconstruction of single-cell gene expression data. Nature Biotechnology, 33, pp. 495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Sato M., Matsumoto M., Saiki Y., Alam M., Nishizawa H., Rokugo M., . . . Igarashi K. (2020, Mar). BACH1 Promotes Pancreatic Cancer Metastasis by Repressing Epithelial Genes and Enhancing Epithelial–Mesenchymal TransitionAddiction of Pancreatic Cancer to BACH1 in EMT. Cancer Res., 80(6), pp. 1279–1292. [DOI] [PubMed] [Google Scholar]
  93. Schacht T., Oswald M., Eils R., Eichmüller S. B., & König R. (2014, 08). Estimating the activity of transcription factors by the effect on their target genes. Bioinformatics, 30(17), pp. i401–i407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Segal E., Shapira M., Regev A., Pe’er D., Botstein D., Koller D., & Friedman N. (2003, Jun). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat. Genet., 34(2), pp. 166–176. [DOI] [PubMed] [Google Scholar]
  95. Segal E., Taskar B., Gasch A., Friedman N., & Koller D. (2001, Jun). Rich probabilistic models for gene expression. Bioinformatics, 17(suppl_1), pp. S243–S252. [DOI] [PubMed] [Google Scholar]
  96. Smyth G. K. (2005). limma: Linear Models for Microarray Data. New York, NY: Springer New York. [Google Scholar]
  97. Somogyi R., & Sniegoski C. A. (1996). Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation. Complexity, 1(6), pp. 45–63. [Google Scholar]
  98. Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W. M., . . . Satija R. (2019). Comprehensive Integration of Single-Cell Data. Cell, 177(7), pp. 1888–1902.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Subramanian A., Tamayo P., Mootha V. K., Mukherjee S., Ebert B. L., Gillette M. A., . . . Mesirov J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102(43), pp. 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Sun Y., Liu W.-Z., Liu T., Feng X., Yang N., & Zhou H.-F. (2015, jun). Signaling pathway of MAPK/ERK in cell proliferation, differentiation, migration, senescence and apoptosis. Journal of Receptors and Signal Transduction, 35(6), pp. 600–604. [DOI] [PubMed] [Google Scholar]
  101. Szklarczyk D., Franceschini A., Wyder S., Forslund K., Heller D., Huerta-Cepas J., . . . von Mering C. (2014, 10). STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Research, 43(D1), pp. D447–D452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  102. Theiler J. (90s). Combining Statistical Tests By Multiplying p-values.
  103. Trescher S., Münchmeyer J., & Leser U. (2017, Mar 27). Estimating genome-wide regulatory activity from multi-omics data sets using mathematical optimization. BMC Systems Biology, 11(1), p. 41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Veber P., Guziolowski C., Le Borgne M., Radulescu O., & Siegel A. (2008, Dec). Inferring the role of transcription factors in regulatory networks. BMC Bioinf., 9(1), pp. 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  105. Walston H., Iness A. N., & Litovchick L. (2021). DREAM On: Cell Cycle Control in Development and Disease. Annual Review of Genetics, 55(1), pp. 309–329. [DOI] [PubMed] [Google Scholar]
  106. Wang Q., Armenia J., Zhang C., Penson A. V., Reznik E., Zhang L., . . . Schultz N. (2018, Apr 17). Unifying cancer and normal RNA sequencing data from different sources. Scientific Data, 5, p. 180061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  107. Wang X., Wang G., Qu J., Yuan Z., Pan R., & Li K. (2020, mar). Calcipotriol inhibits NLRP3 signal through YAP1 activation to alleviate cholestatic liver injury and fibrosis. Front. Pharmacol., 11, p. 200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  108. Wilkinson A. C., Nakauchi H., & Gottgens B. (2017, Oct). Mammalian Transcription Factor Networks: Recent Advances in Interrogating Biological Complexity. cels, 5(4), pp. 319–331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Xie G., Dong P., Chen H., Xu L., Liu Y., Ma Y., . . . Shen L. (2021, nov). Decreased expression of ATF3, orchestrated by β-catenin/TCF3, miR-17–5p and HOXA11-AS, promoted gastric cancer progression via increased β-catenin and CEMIP. Exp. Mol. Med., 53(11), pp. 1706–1722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  110. Xu W., Song W., Wang Y. U., Zan Y., Zhang M., Li M., . . . Wu J. (2021, may). Efficacy of YAP1-gene knockdown to inhibit alveolar-epithelial-cell senescence and alleviate idiopathic pulmonary fibrosis (IPF). Cancer Genomics Proteomics, 18(3 Suppl), pp. 451–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Yamagishi H., Olson E. N., & Srivastava D. (2000, 2). The basic helix-loop-helix transcription factor, dHAND, is required for vascular development. The Journal of Clinical Investigation, 105(3), pp. 261–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  112. Yamashita T., Tazawa S., Yawei Z., Katayama H., Kato Y., Nishiwaki K., . . . Ishikawa M. (2006, apr). Suppression of invasive characteristics by antisense introduction of overexpressed HOX genes in ovarian cancer cells. Int. J. Oncol., 28(4), pp. 931–938. [PubMed] [Google Scholar]
  113. Yu B., Zhang K., Milner J. J., Toma C., Chen R., Scott-Browne J. P., . . . Goldrath A. W. (2017, May 01). Epigenetic landscapes reveal transcription factors that regulate CD8+ T cell differentiation. Nature Immunology, 18(5), pp. 573–582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Zarringhalam K., Enayetallah A., Gutteridge A., Sidders B., & Ziemek D. (2013, 09). Molecular causes of transcriptional response: a Bayesian prior knowledge approach. Bioinformatics, 29(24), pp. 3167–3173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  115. Zervos A. S., Gyuris J., & Brent R. (1993, jan). Mxi1, a protein that specifically interacts with Max to bind Myc-Max recognition sites. Cell, 72(2), pp. 223–232. [DOI] [PubMed] [Google Scholar]
  116. Zhou X., Ji G., Ke X., Gu H., Jin W., & Zhang G. (2015, Nov 01). MiR-141 Inhibits Gastric Cancer Proliferation by Interacting with Long Noncoding RNA MEG3 and Down-Regulating E2F3 Expression. Digestive Diseases and Sciences, 60(11), pp. 3271–3282. [DOI] [PubMed] [Google Scholar]
  117. Zhou Z., Patel M., Ng N., Hsieh M. H., Orth A. P., Walker J. R., . . . Liu J. (2014, Dec). Identification of synthetic lethality of PRKDC in MYC-dependent human cancers by pooled shRNA screening. BMC Cancer, 14(1), pp. 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (1.1MB, pdf)
Supplement 2
media-2.xlsx (841KB, xlsx)

Data Availability Statement

“scRNAseq of Primary and Immortalized Human Prostate Fibroblast Cell Lines”

BioProject Accession number: PRJNA881605

Study Accession number: SRP397809

SRA Accession numbers: SRX17617080, SRX17617081, SRX17617082, SRX17617083 Use NCBI’s SRA toolkit to download the 4 datasets above. For further instructions, see: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload

We make our inference algorithm available to use through the following web application: https://umbibio.math.umb.edu/nlbayes

Open-source R and Python packages are available at Github: https://github.com/umbibio/nlbayes-r (doi:10.5281/zenodo.7105306) https://github.com/umbibio/nlbayes-python (doi:10.5281/zenodo.7105233)

We have used R version 4.1.3 and Python 3.10 to develop and test the corresponding packages. Detailed instructions and examples are available in each corresponding repository.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES