Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2023 Feb 10;51(7):e38. doi: 10.1093/nar/gkad053

Using single cell atlas data to reconstruct regulatory networks

Qi Song 1, Matthew Ruffalo 2, Ziv Bar-Joseph 3,4,
PMCID: PMC10123116  PMID: 36762475

Abstract

Inference of global gene regulatory networks from omics data is a long-term goal of systems biology. Most methods developed for inferring transcription factor (TF)–gene interactions either relied on a small dataset or used snapshot data which is not suitable for inferring a process that is inherently temporal. Here, we developed a new computational method that combines neural networks and multi-task learning to predict RNA velocity rather than gene expression values. This allows our method to overcome many of the problems faced by prior methods leading to more accurate and more comprehensive set of identified regulatory interactions. Application of our method to atlas scale single cell data from 6 HuBMAP tissues led to several validated and novel predictions and greatly improved on prior methods proposed for this task.

INTRODUCTION

The reconstruction of regulatory networks from functional genomics data has been a major research focus in computational biology (1–4). Several methods for inferring transcription factor (TF)–gene interactions focused on the use of the expression of TFs to predict gene expression (5–9). While computational methodology and data differed, many methods shared a common, underlying, assumption: If the expression of a TF, or a combination of TFs, is a good proxy for the expression of a specific gene, then these TFs are likely the regulators of that gene. Starting from the early days of microarrays, through the use of next generation sequencing and more recently with single cell technologies, several computational methods have been developed and tested using these ideas (1–11). Some predictions of these methods, either general or context specific, have also been successfully experimentally validated (4,7,10,11).

While the assumption about the relationship between TF and gene expression levels proved successful, it did not always reflect biological realities. Most studies relied on snapshot or static data. In such studies, the expression levels of genes may not correlate with that of their regulating TFs due to the time delay between TF binding and the accumulation of expressions (12,13). In other cases, gene expression levels may be elevated prior to the expression of the TF and remain high after the activation with no causal relationship between the two (14). These issues can lead to both, false positive and false negative predictions.

Another issue arises from the data itself. Obviously, single cell data is much more powerful than bulk since it learns the interactions based on expression values from the same cells (15). However, expression alone may not be enough to predict such interactions. Several studies have shown that many TFs are post-translationally activated (16–18). For these TFs, high expression levels may not be directly correlated to activity, leading to false positive predictions (19).

Several methods have been used to infer such interactions. Early efforts mapped expression data to regulatory interactions by evaluating correlations between genes, using mutual information (20) and co-expression (21). Other approaches utilized regression to identify the best TFs for predicting target gene expressions. These methods include linear model based on LARS (22) and LASSO (23–25), and autoregressive models (26,27) which attempt to capture temporal expression patterns. Models that can learn more complex non-linear relationships have also been developed, including random forest (6,7), gradient-boosting trees (5), neural network (28), and Bayesian network (29–31). A few methods have also incorporated additional types of data beyond expression including ChIP-seq, and TF motif information (2,32,33), ATAC-seq and other epigenetic data (9) and protein protein interactions (32). However, these methods have also attempted to use the interaction information to predict gene expression which, as discussed above, may not reflect the current activation level of the target gene.

To overcome the above problems and improve the prediction of a global, cross tissue, gene regulatory network we developed methods that combined three novel strategies for regulatory networks inference. First, we predicted RNA velocity (34) instead of gene expression. Unlike gene expression, which may not be able to represent dynamic activity of a gene, RNA velocity measures the real time activity of genes and thus can serve as a much better proxy for the level of gene regulation. A second aspect of our model is the use of scATAC-seq in addition to expression for inferring TF activation. Unlike expression, scATAC-seq is impacted by post-translational modifications and so can be used to infer the actual activity of TFs. Finally, our method integrates much larger single cell data from recent atlas studies (HuBMAP (35)), allowing it to utilize big data for the regulatory network inference task.

To combine different data types across tissues we developed a multi-task based deep learning framework to predict RNA velocity values for target genes. We constructed the final tissue-specific regulatory networks by ranking TFs using deep SHAP algorithm (36) on the trained models.

We tested our method and compared it to several previous methods. As we show, by using RNA velocity we can obtain more accurate results compared to previous methods. We discuss both global tissue-specific networks and specific organ function related subnetworks identified by our method and provide a comprehensive list of tissue specific predictions of TF-gene interactions for use in further downstream analysis.

MATERIALS AND METHODS

Data preprocessing

Gene quantification and calculation of RNA velocities

Gene counts for scRNA-seq data were generated using the in-house HuBMAP transcriptomic pipelines which use Salmon (37) to map reads to NCBI GRCh38 reference genome (38) and quantify gene counts. Next, scVelo package (34) was used to estimate RNA velocity of each gene. Only genes having sufficient read counts in both spliced and unspliced regions and cells having enough genes with spliced and unspliced counts were kept.

Peak calling for scATAC-seq and SNARE-seq data

Reads mapping was performed using BWA short read aligner (39). SnapATAC package (40) was used to generate cell-by-bin matrix and MACS2 package was used to call peaks (saved as BED format) over all cells. For SNARE-seq data, reads mapping, peak calling was performed using the same pipeline.

TF activity score

Activity score of each TF is computed based on binding sites information from ChIP-seq data and chromatin accessibility information from scATAC-seq data. We downloaded all ChIP-seq data from Cistrome database (41), which includes binding site locations for 1359 human TFs in its recent release. We kept only TFs that satisfied the following criteria: (i) sample median sequence quality score ≥25 (scores calculated from FastQC software); (ii) uniquely mapped ratio ≥60%; (iii) PBC score ≥80%; (iv) FRiP score ≥1%; (v) number of peaks with fold change >10 (PeaksFoldChangeAbove10 score ≥ 500). See instruction page of Cistrome for more details: https://cistrome.org/chilin/_downloads/instructions.pdf. After the filtering steps, 623 TFs were selected for calculating TF activity score matrix Inline graphic and Inline graphic.

graphic file with name M0002.gif
graphic file with name M0003.gif

where Inline graphic calculates the summed TF activity of the ith TF with respect to the promoter region of jth gene. Inline graphic represents the kth binding site of the ith TF in the promoter region of jth gene. Inline graphic denotes the set of scATAC-seq peaks overlapping with Inline graphic. Note that Inline graphic is specific to each binding site of the TF-gene pair (See Figure 1B). In this study, we used ChIPseeker (42) to map the ChIP-seq binding sites to their corresponding nearest genes. We defined the promoter region of a gene using the default parameter in ChIPseeker package (42) (3000 bps upstream ∼3000 bp downstream region relative to the transcription start site). The function Inline graphic performs the following operations: (i) computes the overlapped length between the kth peak of TF i relative to gene j and any scATAC-seq peak regions and (ii) divide the length from (i) by the length of kth peak of TF i relative to gene j. Inline graphic then summed over m such overlapped regions for TF i relative to gene j. Similarly, Inline graphic averaged over m such overlapped regions for TF i relative to gene j. For scATAC-seq data, we used peak regions from bed files, which represents aggregated open chromatin regions from all single cells. For SNARE-seq data, we used peak regions from bed files but mapped each region back to each cell if any reads that produced the peak was from that cell. Therefore, for scATAC-seq data, there is a single Inline graphic and Inline graphic for each tissue and for SNARE-seq data, there is a single Inline graphic and Inline graphic for each single cell.

Figure 1.

Figure 1.

The flowchart of the MTLRank framework. (A) Hubmap tissue-specific scRNA-seq and scATAC data sets were used. TF RPKM matrix was generated from tissue-specific scRNA-seq data and TF activity matrix was generated from the integration of Cistrome DB ChIP-seq TF prior information and scATAC-seq data. (B) Computation of TF activity scores from ChIP-seq data and scATAC-seq data. Blue blocks represent TF binding sites from ChIP-seq data and green block represents open chromatin region from scATAC-seq data. TF binding sites were weighted by the scATAC-seq open chromatin regions (see Materials and Methods for more details). (C) Multi-task based model training and deep SHAP based TF ranking. The final tissue-specific regulatory networks were constructed from the TF ranking results.

Filtering, standardization and scaling

First round of filtering was performed to remove target genes without enough cells producing RNA velocity values. For liver, we removed target genes having <1000 cells, and for other tissues, we removed target genes having <5000 cells. This filtering process was performed on the velocity matrix Y (see Table 2 for number of target genes, and cells after filtering in each tissue). Then we used the remaining cells in the velocity matrix to filter RPKM matrix X. To train MTLRank models, we only used the TF columns in RPKM matrix X and removed other columns (the list of current known human TFs was downloaded from (43)). RPKM values of each TF were then transformed by log10 and standardized to zero-mean and unit-variance. As a result, each tissue is assigned with a column-wise standardized RPKM matrix Inline graphic, where m is the number of cells and n is the number of TFs (see Table 2 for number of TFs in each tissue). Similarly, velocity matrix Y was also column-wise standardized. TF activity scores were scaled by Inline graphic and Inline graphic. In the next step, RPKM matrix Inline graphic, velocity matrix Y, TF activity score matrix Inline graphicand Inline graphic were used as inputs to the MTLRank pipeline.

Table 2.

Number of TFs, target genes, cells and peaks for SNARE-seq data

Type Metric Left kidney Right Kidney Right lung
scRNA-seq TFs 2641 2533 2577
Genes 750 1066 1803
Cells 44 367 62 297 119 206
scATAC-seq Peaks 117 584 113 638 101 587
Cells 23 941 20 930 52 230

MTLRank framework

Overview of MTLRank framework

MTLRank prediction mainly consists of two steps. The first step involves training multi-task models with RPKM expressions, and activity scores of TFs as inputs to predict velocities of genes. Each gene has its own model, but model parameters are shared through a multi-task learning framework. This reduces overfitting and can lead to better performance (44). The second step is to rank the TFs for each of the models trained in the first step. In MTLRank's framework, we adopted deep SHAP algorithm to compute SHAP value for each TF and ranked TFs by sum of absolute SHAP values. A regulatory network was then constructed based the ranking results. Details of each step are described in the following subsections.

Model architecture and loss function

We trained a multi-layer neural network to identify TFs regulating genes. Supplementary Figure S1 illustrates the model architecture. The first layer is the input layer that takes in TF activity score matrices Inline graphic and Inline graphic, and RPKM expression matrix Inline graphic for the same TFs. The second layer is a TF aggregation layer in which each neuron represents a TF that takes the activity scores and RPKM expressions for that TF. For example, in a model that predicts velocity values for the ith gene, the kth neuron in the TF aggregation layer performs the operation Inline graphic for a single input, where Inline graphic, Inline graphic and Inline graphic are learned during training and are specific to kth TF, and Inline graphic represents standardized RPKM expression for the kth TF in cell c. In such case, each neuron in the TF aggregation layer transforms the weighted means of TF activity scores and TF RPKM expressions. TF aggregation layer is then followed by three fully connected layers with 64, 32 and 16 neurons. We named these layers as FC1, FC2 and FC3. FC3 is connected to the final output layer that predicts RNA velocity of the target gene i. We note that we tested additional architectures with more or less layers but did not observe large changes in results. We applied trace norm (44) to perform multi-task learning across different models. Trace norm regularization adopts a ‘soft sharing’ strategy, meaning that all models share their parameters in an indirect manner. This is achieved by first concatenating parameters from different models into a single matrix, then minimizing the sum of singular values of the concatenated matrix to encourage it to become a low-rank matrix. To reduce the computational cost, instead of sharing parameters among all models, our framework only shared parameters within each gene cluster. For each tissue, we assigned genes into nearly equal-sized clusters based on their velocity values. Gene clustering was performed using constrained k-means clustering algorithm (minimum cluster size = 24, maximum cluster size = 25) (45). Within each cluster, the loss function of models is defined as:

graphic file with name M00030.gif

where T is the number of tasks within each cluster, which is equal to the number of genes in each cluster. Inline graphic denotes the number of cells (training examples) available for each task. Inline graphic is the true RNA velocity for gene i in cell c, and Inline graphic computes the predicted RNA velocity through forward propagation using a single input and current model parameters Inline graphic. Inline graphic denotes activity scores of all TFs with respect to gene i and expressions of all TFs in cell c. Inline graphic is the mean squared loss function. Inline graphic is the model weight matrix for kth layer of the ith task. The term Inline graphic represents L1 regularization that penalizes complexity of all models and all layers. We used the default value Inline graphic value from Keras (version 2.8.0). Inline graphic represents a 2D matrix of model parameters concatenated from the kth layer of each task. Parameter tensor from each task is flattened into a 1D vector before concatenation so that each column of Inline graphic represents parameters from a single task (Supplementary Figure S1 and Supplementary file 1). Inline graphic is the regularization term that sums trace norm over m shared layers and Inline graphic controls the strength of trace norm regularization. Given that lower-level features are usually more similar than higher-level features across different tasks (44), we only shared layer FC1 and FC2 (Supplementary Figure S1 and Supplementary file 1). Accordingly, layer FC3 and the output layer were trained as task-specific layers. Trace norm of Inline graphic is computed as (44):

graphic file with name M00045.gif

where Inline graphic is the sum of singular values for Inline graphic. To train the models, the gradient for Inline graphic should be defined during back-propagation. A numerical stable sub-gradient for Inline graphiccan be computed as (44):

graphic file with name M00050.gif

where U and V were obtained from singular decomposition of the Inline graphic matrix.

graphic file with name M00052.gif

The loss function was optimized using Adam optimizer (46) provided by TensorFlow version 2.7.0 (47).

Training and testing

Since not all genes had velocity values for all cells, training and testing sets were constructed on a per-gene basis. For each tissue, we split RPKM matrix and TF activity score matrices by cells. The ratio between training cells and testing cells is 9:1. Then for each gene, we removed the cells that did not have velocity value available for that gene. This produced inputs with different training examples among the shared models. During training, an ensemble training batch was generated by sampling an equally sized batch for each task. The number of ensemble training batches for one epoch is equal to Inline graphic, where Inline graphic is the number of cells for the gene with most training examples among all tasks and b is the pre-specified batch size. Inline graphic rounds the value to the least integer greater than that value. Bootstrapping was used for genes with fewer available training samples. The above sampling process was repeated until all training examples have been used at least once and this is a complete epoch for training. After the specified number of epochs were finished, testing was performed on each gene separately using the test set. Standardization and scaling of the RPKM matrix and TF activity score matrices were performed separately for training and testing set as described previously. For each tissue, the above training and testing procedures were repeated three times. R2 score was computed for each run and the final score was the average R2 from the three runs. All hyperparameters used during training are shown in Supplementary Table S2.

Ranking TFs using deep SHAP

TFs were ranked by deep SHAP (Shapley Additive exPlanation) algorithm (36). Given a single input (a cell), deep SHAP assigns a SHAP value to each input feature. The SHAP value represents the impact of a specific input on the output compared to using a set of reference examples as inputs. In this study, we defined a reference example as an input vector with mean expressions of all cells and zero-value TF activity scores. The final importance score for each TF is then calculated by the following steps: (i) for each tissue, we computed a single reference example; (ii) we ran deep SHAP and compared all training examples to the single reference example; (iii) we summed absolute SHAP value for each input over all training examples and further aggregated for each TF the SHAP values from expressions and the TF activity scores. This resulted in a single non-negative importance score for each TF. The larger the importance score is, the more it contributes to the predicted outcome.

Construction of regulatory networks

We performed the following steps to construct a single regulatory network for each gene: (i) for each gene, we ran model training and deep SHAP TF ranking for five rounds; (ii) we extracted the top-ranked 50 TFs in each run and obtained the common top TFs across all five runs; (iii) we connected the TFs obtained in step (ii) to the corresponding target genes; (iv) repeat the above steps for all genes. Once a regulatory network is constructed, TFs could be further ranked by their degrees.

Benchmarking different methods and validating TFs

Benchmarking different methods

We compared MTLRank to other regulatory network inference methods. Linear models have been widely adopted for inferring regulatory networks using only gene expression data (9,23,24,25). We thus included a baseline LASSO regression method which predicts gene expression rather than RNA velocity. Furthermore, we also included a baseline NN model that uses the same architecture with MTLRank but without parameter sharing, two widely used network inference methods that only take expressions as inputs (GENIE3 (6) and GRNboost2 (5)), and a network inference approach that takes expressions and ATAC-seq data (CellOracle (48)). For each tissue, we randomly sampled 500 genes and ran all prediction methods on the 500 genes. Specifically, for MTLRank, we first clustered the 500 genes into 20 equally sized clusters and ran MTLRank training and testing framework as described previously. Unlike neural network, LASSO, GENIE3, GRNBoost2 cannot include different types of prior information as inputs. Hence, for these methods we only used RPKM matrix as input to predict expressions of target genes. For CellOracle, in addition to expression data, we also used our TF activity matrices computed from scATAC-seq and ChIP-seq data as inputs. We set any entries not equal to zero as 1. We used the ridge regression model adopted by CellOracle (48) to train tissue-level models and predicted velocities. Training and testing were performed as described previously. Hyperparameters used for these methods are summarized in Supplementary Table S3.

R2 score

We used standard R2 score to evaluate how well the model predict RPKM expressions or RNA velocities of the target genes. It is defined as the equation below:

graphic file with name M00056.gif

where Inline graphic denotes the true RNA velocity/RPKM expression for the ith target gene in the testing set and Inline graphic denotes the mean of these true values.

Validating TFs

While the gold-standard for tissue-specific regulatory network is lacking, the roles of the predicted TFs could be validated using various types of experimental evidence. To validate TFs in the predicted tissue-specific regulatory networks, we downloaded curated TFs deposited at TF-Marker database (49) (https://bio.liclab.net/TF-Marker/). TF-Marker has collected various types of tissue-specific TFs that are tissue markers, or are regulating the tissue marker genes, or are regulated by tissue marker genes. We used the overlap between tissue-specific marker TFs and the input TFs in RPKM matrix as the ‘ground truth TF’ for the predicted tissue regulatory networks. Supplementary Table S4 shows the number of ‘ground truth TF’ in each tissue. The recall rate was used to evaluate the recovery of tissue-specific TFs:

graphic file with name M00059.gif

where TP stands for true positive, the number of successfully recovered tissue-specific TF and FN stands for false negative, the number of true tissue-specific TF not identified by the model. The alternative metric could be precision score, which is defined as:

graphic file with name M00060.gif

where FP stands for false positive, the number of predicted positive TF that were not collected in the TF-Marker database. The precision score, however, could be biased because we might not have enough ‘true’ TFs. The nominator could be underestimated when the set of TP is incomplete. We have collected from the database a number of tissue-specific TFs (TP) but we are likely missing several more. Those TFs selected by the model but not collected by the database are not necessarily false. Due to incomplete ground truth, we did not include precision score in our evaluation.

RESULTS

MTLRank, a multi-task learning based framework for predicting regulatory associations

To characterize the dynamics of gene regulations, we developed a multi-task learning (MTL) based framework, MTLRank, that uses tissue specific single cell and general information to determine TF–gene interactions. Our method uses single cell RNA-Seq and ATAC-Seq, and ChIP-Seq from bulk studies. For this, we collected and processed scRNA-seq and scATAC-seq data sets from 6 human tissues and 29 individual donors (Figure 1A, Supplementary Table S1). All data sets were from the HuBMAP consortium (35). To model RNA velocity using activities from TFs, we used TF expressions from the processed scRNA-seq data. scATAC-seq data was combined with ChIP-seq binding site data to generate TF–gene activity scores (Figure 1C, Table 1). Using these data sets as inputs, we built gene specific models in each tissue to predict the velocity of genes from TF expression and activity. Model parameters were shared for models within the same tissue using a ‘soft sharing’ MTL method (Figure 1B, Materials and Methods). We further used the learned models to rank TFs and construct tissue-specific regulatory networks.

Table 1.

Number of TFs, target genes, cells for scRNA-seq data and number of cells, peaks for scATAC-seq data

Type Metric Liver Heart Left kidney Large intestine Spleen Right lung
scRNA-seq TFs 864 2147 2276 1736 1350 2399
Genes 1308 554 2019 515 1827 2380
Cells 4656 25 280 93 640 11 960 67 226 54 127
scATAC-seq Peaks 113 726 270 033 8887 98 077 27 645 27 371
Cells 50 748 310 661 48 264 8368 11 857 17 482

Comprehensive evaluation of MTLRank models

Evaluation based on prediction of velocities

Following the preprocessing steps (Methods), we obtained 4656–93 640 scRNA cells and 8368–310 661 scATAC cells for each tissue (Table 1). To evaluate MTLRank we compared velocity and gene expression predictions based on cross validation R2 scores between MTLRank and several prior methods for learning TF-gene interactions. These methods included LASSO regression (50), a baseline neural network model (NN), GENIE3 (6) and GRNboost (5) (see Materials and Methods). We tested these methods either as expression-based methods or velocity-based methods. For expression-based methods, we used the TF RPKM as input and the gene expression RPKM as outputs, similar to prior strategies (5,6,51). For velocity-based methods, we incorporated two additional sources of information: TF activity scores for inputs and the gene velocity values as outputs.

Results, presented in Figure 2 show that MTLRank using RPKM of TFs + TF activity scores & velocities as outputs outperforms other methods and other input-output combinations (Figure 2A). Additionally, baseline NN model that uses the same architecture as MTLRank yielded better predictions for velocity values than RPKM expressions even when using the same inputs (Figure 2A). We observed only a marginal improvement when adding TF activity scores for liver, large intestine, and left kidney in baseline NN models (Figure 2A). Unlike TF expressions, which are derived from the same cells from which we obtained the velocity for the gene, scATAC-seq is from a different cell (or, in our case, a set of cells) for the above tissues (Materials and Methods). Our current framework addresses this by averaging the scATAC-seq signals across different cells and used these averaged signals as inputs for model training. This may have impacted model performance. To validate this, we further collected SNARE-seq (single-nucleus chromatin accessibility and mRNA expression sequencing (52)) data sets for lung and kidney tissues (35). Unlike standard scATAC-Seq, SNARE-seq data profiles chromatin accessibility and RNA expressions in the same single cell (Table 2). Training and testing were performed using the same strategy as previously described (Materials and Methods). Our results demonstrate that the single-cell-specific TF activity scores indeed improved the model performance when such information was available (Figure 2B). This means averaging the signals from scATAC-seq data can indeed lower the model performance. Note that the data used for Figure 2A and B is from different experiments (and tissues) and therefore the results are not directly comparable.

Figure 2.

Figure 2.

Comprehensive evaluation of MTLRank models. (A) R2 scores for the 500 randomly sampled genes in each tissue. TFA: TF activity score matrix, velo: RNA velocities; NN: baseline neural network model. (B) R2 scores for the 500 randomly sampled genes in each tissue from SNARE-seq data. Left part represents R2 for all 500 genes and right part represents R2 for genes that have >200 available TF activity scores from the SNARE-seq data. The bars marked with ‘single cell’ stand for the scATAC-seq signals that were paired with the scRNA-seq data at single cell level and the bars marked with ‘average’ stand for the scATAC-seq signals that were averaged out across the cells. (C) Recall percentage of the TF-Marker database marker genes. The recall percentage was computed for the TFs in each tissue-specific network. The two numbers on top of each bar indicates the total number of recallable TFs versus recalled TFs by the method. P-values from hypergeometric test were also marked on top the bars. P-values <0.05 were marked by red color. Total number of recallable TFs are the intersection between the tissue markers from TF-Marker database and all available TFs in RPKM matrix.

scRNA-seq is known to suffer from drop-out event characterized by high proportion of zero expressions (53). We performed additional experiment to test the robustness of our method to drop-out event. Results, presented in Supplementary Figure S3 and Supplementary file 1, indicate that our framework can still perform well even when the drop-out rate reaches 60%, although there is a slight loss of performance at this percentage.

Evaluation based on recall of marker TFs

The R2 score-based evaluation can quantify the model performance in general. However, it does not validate whether the models correctly recover specific TFs. Therefore, we sought to focus on validating whether the TFs predicted in the tissue-specific networks are indeed key TFs for that tissue (Materials and Methods). Due to the lack of ‘gold-standard’ data sets, systematic evaluation of TFs is a challenging task. We thus first looked at whether the recovered TFs are known to be tissue-specific markers using TF-Marker database (49). We compared recall percentage of TF markers to a random selection method, which selected the same number of TFs as present in each tissue-specific network. The results, presented in Figure 2C, indicate that MTLRank models outperformed the random selection method for TF marker recall (Table 3, Supplementary file 4). MTLRank achieved the best performance in liver, which is also the best performing tissue from the R2 score evaluation. We computed p-values using hypergeometric test (using the number of input TFs in each tissue as background) to evaluate whether the identified TFs were enriched for known tissue specific TF markers. We observed significant results for liver (corrected P-value = 1.59 × 10−8), kidney (corrected P-value = 1.70 × 10−4), and lung (corrected P-value = 2.01 × 10−2). In contrast, TFs selected by random method did not significantly overlap with TF markers in any tissue (Figure 2C). Further, the recall values for MTLRank are significantly higher than the recall values from the random method (Wilcoxon sum rank test, P-value = 9 × 10−3). Next, we ranked identified TFs by degree in each tissue and found that MTLRank successfully identified some well-known tissue TF markers. For example, JUN, FOS and MAF were TF markers that appeared among the top 10 TFs in liver (Table 3). JUN and FOS are involved in liver development and regeneration (54), and MAF plays important role for erythropoiesis in fetal liver (55). In the top 10 TFs of kidney, we found two TF markers (Table 3) including ID1, a transcriptional inhibitor that has been reported to drive dedifferentiation of kidney epithelial cells (56), and EGR1, an early growth response protein associated with diabetic kidney disease (57,58). ID1 and FOS were also found in the top 10 TFs of right lung (Table 3). ID1 has been shown to promote migration of lung cancer cells (59), and FOS was reported to regulate inflammatory response during acute lung injury (60).

Table 3.

Validated TF markers among the top 20 predicted TFs (sorted by degree) in each tissue-specific network

Ensembl ID Gene name Comment PMID Tissue
ENSG00000177606 JUN AP-1 Transcription Factor Subunit 22105228; 31612883; 31781649; 30901906; 21997551 Liver, heart, large intestine,
ENSG00000170345 FOS AP-1 Transcription Factor Subunit 31781649; 31536749 Liver, right lung
ENSG00000178573 MAF V-Maf Avian Musculoaponeurotic Fibrosarcoma Oncogene 31612883;32116021 Liver
ENSG00000125968 ID1 Inhibitor Of DNA Binding 1 21921784;16473539 Left kidney
ENSG00000120738 EGR1 Early Growth Response 1 24819335 Left kidney
ENSG00000107485 GATA3 GATA-Binding Factor 3 30696889 Left kidney
ENSG00000141905 NFIC Nuclear Factor I C 32195335 Right lung

Evaluation based on TF perturbation data

Currently, the gold-standard GRNs (gene regulatory networks) are lacking for evaluating GRN inference approaches. We therefore resort to using a high-throughput TF perturbation data set as an additional validation data set, in which the ground truth is anticipated to be similar to genes related to the perturbed TF. We applied our framework to a single cell dataset with perturbed c-Jun TF in human CAR-T cell (61). We obtained target genes for c-Jun from JASPAR motif analysis (downloaded from Harmonizome website (62)). We aimed to evaluate whether our framework can successfully recover the TFs as target genes for c-Jun (hereafter referred as c-Jun related TFs). We next used scRNA-seq data and ATAC-seq data from CAR-T cells, and ChIP-seq data from CistromDB [7] as inputs and examined whether our method can recover c-Jun related TFs. We ranked the top TFs following the same procedure in our previous analysis. The result suggests that the top ranked TFs from our model are significantly enriched for c-Jun related TFs (Supplementary Figure S2 and Supplementary file 1), and that MTLRank successfully captured the regulatory response to the perturbation on c-Jun related TFs.

Tissue-specific networks reveals key pathways and essential organ functions

We next constructed tissue-specific networks by performing model-based ranking for TFs. Briefly, we trained models using all available target genes in each tissue. To construct tissue-specific networks, the top 50 commonly predicted TFs among different runs were selected as the regulators for the corresponding target gene. We next used the networks to compute the importance of each TF for each gene using deep SHAP (Methods). See Supplementary File 2 for the predicted tissue-specific networks. We ranked TFs in each network by their degrees and found that there are four TFs (FOS, JUN, RPS27A and ELF3) commonly present among the top 10 TFs in at least three tissues. FOS was found in liver, spleen, left kidney, and right lung; JUN was found in liver spleen, and right lung; RPS27A was found in spleen, left kidney, and right lung; ELF3 was found in left kidney, large intestine, and right lung. Among the four TFs, FOS, JUN and RPS27A were labeled as ‘low tissue specificity’ in The Human Protein Atlas portal (https://www.proteinatlas.org/) (63), supporting their universal roles in gene regulation. Among the four TFs, FOS and JUN are known to interact by forming protein heterodimer AP-1 (64), which is involved in various cellular key pathways including cell proliferation (64), apoptosis (65), differentiation (66) and activation of T cells (67). In the tissue-specific networks, FOS and JUN were jointly selected in liver, spleen, and lung. Particularly, for spleen there are 266 predicted targets for FOS and 184 predicted targets for JUN, with 65 common targets (hypegeometric P-value = 5.14 × 10−14, using genes listed in Table 1 as background). We further investigated the union of FOS-JUN targets by hypergeometric test-based GO overrepresentation analysis. The results show that GO terms related to T cell activation/differentiation and interleukin production are significantly overrepresented in these genes (Supplementary Table S5 and Supplementary file 1). This observation is consistent with previous findings that FOS and JUN can co-regulate interleukin 2, which further modulate cell-mediated immunity (68). RPS27A is a ribosomal protein involved in cell proliferation, regulation of cell cycle, and apoptosis (69). ELF3 is one of the ETS (Erythroblast Transformation Specific) family transcription factors that are involved in regulation of inflammatory response (70). The predicted active roles of ELF3 in kidney and lung were also previously reported (71).

We also performed gene set enrichment analysis (GSEA) for the TFs and target genes in each network, using their respective degree ranking as inputs to GSEA. Results (supplementary file 3), show that TFs/target genes in the liver, spleen and kidney networks were enriched for distinctive annotations related to essential organ functions. For example, the KEGG (72) ‘Drug metabolism’ pathway is enriched among liver target genes (corrected P-value = 0.040); KEGG ‘Th1 (T helper) and Th2 cell differentiation’ pathway is enriched among spleen target genes (corrected P-value = 0.002), and KEGG ‘MAPK signaling pathway’ is enriched among spleen TFs (corrected P-value = 0.010). Based on GSEA results, we performed leading edge analysis (73) to extract genes that contribute most to the identified enriched terms. We then used these genes and their regulators to construct three tissue subnetworks with distinctive organ functions, as we discuss below.

Liver drug metabolism network

Drug metabolism is one of the key liver functions (74). Leading edge analysis has identified 12 genes from the KEGG pathway ‘Drug metabolism’ (Supplementary file 3), among which CYP3A4is one of the top-10 target genes in terms of degree in the liver network (Figure 3A). CYP3A4 is an important enzyme from the cytochrome P450 family that catalyzes many drug metabolism related reactions (75). As predicted, CYP3A4 is primarily found in liver (76). The other four target genes in the liver drug metabolism network (CYP2B6, CYP2E1, CYP3A5, CYP2C19) are also from the cytochrome P450 family (Figure 3A). Their active roles during drug metabolism processes were previously reported (77–80).

Figure 3.

Figure 3.

Tissue-specific subnetwork with distinctive organ functions. (A) Predicted drug metabolism network in liver. The top 10 regulators and top 10 target genes from the original tissue-specific network were marked by red circles. (B) Predicted MAPK-Th1/Th2 network in spleen. The top 10 regulators and top 10 target genes from the original tissue-specific network were marked by red circles. (C) Predicted calcium signaling network in kidney. The top 10 regulators and top 10 target genes from the original tissue-specific network were marked by red circles.

Spleen MAPK-th1/th2 network

Activation of immune response is the essential function for spleen (81). Target genes identified for spleen are enriched for various immune related KEGG pathway terms, including ‘C-type lectin receptor signaling pathway’ (corrected P-value = 0), ‘Th1 (T helper 1) and Th2 cell differentiation’ (corrected P-value = 0.002) and ‘T cell receptor signaling pathway’ (corrected P-value = 0.023). (Supplementary file 2). MAPKs (mitogen-activated protein kinases) proteins are known to promote Th1 immune response through regulating production of cytokines (82). To further investigate how MAPKs can modulate Th1/Th2 immune response, we extracted leading edge genes annotated as ‘MAPK signaling pathway’ for TFs and those annotated as ‘Th1 and Th2 cell differentiation’ for target genes. This yielded a small network with 5 TFs and 13 target genes. This subnetwork contained two HSPA (heat shock protein A) family TFs targeting IL2RB (interleukin 2 receptor subunit beta) (Figure 3B), a cytokine receptor important for Th1 cell differentiation (83). Previous studies have also observed a strong correlation between MAPK genes and HSPA (heat shock protein A) family genes (84,85). Though direct interactions between HSPA genes and IL2RB have not been reported, it is likely that HSPA genes activate the MAPKs upon sensing of extracellular/intracellular stress, which further regulate immune response by regulating cytokines and their receptors. Taken together, the spleen MAPK-Th1/Th2 network may represent MAPKs mediated immune response in spleen tissue.

Kidney calcium signaling network

One of the important functions for kidney is maintaining the balance of calcium (86,87). Target genes for kidney are enriched for KEGG pathway ‘Calcium signaling pathway’ (corrected P-value = 0). Leading edge analysis identified 49 target genes associated with calcium signaling. We further extracted these 49 target genes and their regulators from the kidney network (Figure 3C). Among these genes we found two CAMK (Ca2+/calmodulin-dependent protein kinase) family genes, (CAMK4, CAMK2D), and one CALM (calmodulin) family gene (CALM1). Genes from both families are important for calcium signaling (88,89). Additionally, we found multiple genes from the FGF (fibroblast growth factor) family and its receptor family FGFR in calcium signaling network (FGF10, FGF7, FGFR3, FGF9, Figure 3C). FGFs can regulate calcium metabolism by interacting with klotho proteins (90), suggesting the possible roles of the identified FGFs and FGFR genes.

We also conducted experiments to test the ability of MTLRank to infer cell type specific regulatory networks. See Supplementary Figure S4 and Supplementary file 1 for more information.

Web portal for the query of tissue-specific networks

To facilitate the query of the predicted tissue-specific interactions, we built a web portal, HuBNet (https://hubnet-qs.herokuapp.com/). Interactions in each tissue are ranked by their importance scores (Materials and Methods). Users may query the predicted tissue-specific interactions by providing a list of TFs and target genes of interest and select a tissue. The web portal will return the query results in real time, which can be viewed in a table format or as network graphs (Supplementary Figure S5 and Supplementary file 1).

DISCUSSION

Recent advances of single cell technologies have led to a surge of studies that generate tissue specific genomic profiles at atlas scale. The HuBMAP consortium is one such effort generating several tissue-specific single cell genomic data sets (91–93).

Here, we developed MTLRank, a deep NN method to incorporate chromatin accessibility, TF binding site information, and RNA velocity values to predict TF-gene interactions across multiple HuBMAP tissues. We showed that MTLRank can accurately identify TF–gene interactions and that it provides known and new interactions that can shed new light on the activity and pathways in several different tissues.

While RNA velocity analysis improves the performance, it also raises new challenges. Due to the nature of RNA velocity computation, many genes do not have available velocity values in all cells. To address this challenge, MTLRank relies on multi-task learning to share parameters across individual gene models. Similar parameter sharing strategy has been shown in the past to reduce model overfitting and improve model generalization (44).

The predicted regulatory associations between TFs and target genes are quantified in a model-specific manner, rather than by using a model-agnostic ranking method. This improves interpretability of the trained models and may help identify higher order TF–TF correlations that contribute to the ranking of input features. As co-regulation among TFs is also a topic of interest in regulatory genomics, the trained models may be further explored to reveal such interactions.

The lack of gold standard validation data set poses a great challenge for methods that predicts TF–gene associations. While synthetic data set can construct ground truth to be recovered (94), we believe such strategy cannot evaluate approaches using multiple types of prior information and does not characterize the importance of tissue-specific TFs. To address this issue, we performed a R2 score-based evaluation and a TF tissue specific marker-based evaluation in this study. We also examined whether these sources of validation overlapped with the ChIP-seq TF data we used for learning the models. We found that in the TF-Marker database, only five TFs for heart were assigned using ChIP-seq experiment (ELK1, ESR1, HAND1, NFIL3, POU3F2), and we thus removed these from the analysis. All other validated TFs are based on independent data and so can be used for validation.

While the tissue-specific networks have successfully reconstructed the organ functional pathways, our framework has a few limitations. First, the size of training data set is limited by the number of cells and genes with available RNA velocity values. This could possibly be improved in the future with improvements for methods to compute RNA velocity values. Second, it is difficult to fine tune the hyperparameters for thousands of models. For tissue with >2000 target genes and 2000 TFs, training models and computing SHAP values could be challenging (Supplementary Figure S6). Third, due to the small number of TF known markers, we were not able to test the recall of TF marker for spleen, where strong immune response pathways were found.

Finally, there are several potential directions to further expand the work. When time series data is available in addition to snapshot data such data can be integrated to further improve predictions. In addition, extracting TF-TF correlations from the trained model may identify co-factors important for the given functional pathways.

AVAILABILITY

The script for preprocessing, modeling and analysis of predicted interactions are available at GitHub repository: https://github.com/alexQiSong/MTLRank.

DATA AVAILABILITY

The data underlying this article were provided by the HuBMAP consortium and Cistrome database. Data will be shared on request to the HuBMAP consortium and Cistrome database maintenance team.

ACCESSION NUMBERS

scRNA-seq, scATAC-seq and SNARE-seq datasets used in this study can be accessed at the HubMap data portal: https://portal.hubmapconsortium.org/. All accession numbers used in this study are listed in Supplementary Table S1.

Supplementary Material

gkad053_Supplemental_Files

ACKNOWLEDGEMENTS

The results here are based upon data generated by the NIH Human BioMolecular Atlas Program (HuBMAP).

Author contributions: Q.S., and Z.B.J. conceived the original idea of the project. Q.S. developed the MTLRank machine learning framework and performed analysis on the predicted interactions. M.R. processed raw single cell data from HuBMAP database. Q.S., M.R. and Z.B.J. wrote the manuscript.

Contributor Information

Qi Song, Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

Matthew Ruffalo, Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

Ziv Bar-Joseph, Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

NIH [OT2OD026682, 1U54AG075931, 1U24CA268108 to Z.B.J., in part]. Funding for open access charge: NIH.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Sima C., Hua J., Jung S.. Inference of gene regulatory networks using time-series data: a survey. Curr. Genomics. 2009; 10:416–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Qin J., Hu Y., Xu F., Yalamanchili H.K., Wang J.. Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods. 2014; 67:294–303. [DOI] [PubMed] [Google Scholar]
  • 3. Gu F., Hsu H.K., Hsu P.Y., Wu J., Ma Y., Parvin J., Huang T.H.M., Jin V.X.. Inference of hierarchical regulatory network of estrogen-dependent breast cancer through ChIP-based data. BMC Syst. Biol. 2010; 17:170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Chen X., Gu J., Wang X., Jung J.G., Wang T.L., Hilakivi-Clarke L., Clarke R., Xuan J.. CRNET: an efficient sampling approach to infer functional regulatory networks by integrating large-scale ChIP-seq and time-course RNA-seq data. Bioinformatics. 2018; 34:1733–1740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Moerman T., Aibar Santos S., Bravo González-Blas C., Simm J., Moreau Y., Aerts J., Aerts S. GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019; 35:2159–2161. [DOI] [PubMed] [Google Scholar]
  • 6. Huynh-Thu V.A., Irrthum A., Wehenkel L., Geurts P.. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010; 5:e12776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Aibar S., González-Blas C.B., Moerman T., Huynh-Thu V.A., Imrichova H., Hulselmans G., Rambow F., Marine J.C., Geurts P., Aerts J.et al.. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods. 2017; 14:1083–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Matsumoto H., Kiryu H., Furusawa C., Ko M.S.H., Ko S.B.H., Gouda N., Hayashi T., Nikaido I.. SCODE: an efficient regulatory network inference algorithm from single-cell RNA-Seq during differentiation. Bioinformatics. 2017; 33:2314–2321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Song Q., Lee J., Akter S., Rogers M., Grene R., Li S.. Prediction of condition-specific regulatory genes using machine learning. Nucleic Acids Res. 2021; 48:e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hamey F.K., Nestorowa S., Kinston S.J., Kent D.G., Wilson N.K., Gottgens B.. Reconstructing blood stem cell regulatory network models from single-cell molecular profiles. Proc. Natl. Acad. Sci. U.S.A. 2017; 114:5822–5829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Moignard V., Woodhouse S., Haghverdi L., Lilly A.J., Tanaka Y., Wilkinson A.C., Buettner F., MacAulay I.C., Jawaid W., Diamanti E.et al.. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nat. Biotechnol. 2015; 33:269–276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Zhu Z., Pilpel Y., Church G.M.. Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. J. Mol. Biol. 2002; 318:71–81. [DOI] [PubMed] [Google Scholar]
  • 13. Popp A.P., Hettich J., Gebhardt J.C.M.. Altering transcription factor binding reveals comprehensive transcriptional kinetics of a basic gene. Nucleic Acids Res. 2021; 49:6249–6266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Larsen S.J., Röttger R., Schmidt H.H.H.W., Baumbach J.. 2019. E. coli gene regulatory networks are inconsistent with gene expression data. Nucleic Acids Res. 47:85–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Chen G., Ning B., Shi T.. Single-cell RNA-seq technologies and related computational data analysis. Front. Genet. 2019; 10:317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Zhang Z., Liao B., Xu M., Jin Y.. Post-translational modification of POU domain transcription factor Oct-4 by SUMO-1. FASEB J. 2007; 21:3042–3051. [DOI] [PubMed] [Google Scholar]
  • 17. Morse A.M., Whetten R.W., Dubos C., Campbell M.M.. Post-translational modification of an R2R3-MYB transcription factor by a MAP Kinase during xylem development. New Phytol. 2009; 183:1001–1013. [DOI] [PubMed] [Google Scholar]
  • 18. Orosa-Puente B., Leftley N., von Wangenheim D., Banda J., Srivastava A.K., Hill K., Truskina J., Bhosale R., Morris E., Srivastava M.et al.. Root branching toward water involves posttranslational modification of transcription factor ARF7. Science. 2018; 362:1407–1410. [DOI] [PubMed] [Google Scholar]
  • 19. de la Fuente A. From ‘differential expression’ to ‘differential networking’ - identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010; 26:326–333. [DOI] [PubMed] [Google Scholar]
  • 20. Butte A.J., Kohane I.S.. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 2000; 2000:418–429. [DOI] [PubMed] [Google Scholar]
  • 21. Margolin A.A., Nemenman I., Basso K., Wiggins C., Stolovitzky G., Favera R.D., Califano A.. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006; 7(Suppl. 1):S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Haury A.C., Mordelet F., Vera-Licona P., Vert J.P.. TIGRESS: trustful Inference of Gene REgulation using Stability Selection. BMC Syst. Biol. 2012; 6:145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Liu L.Z., Wu F.X., Zhang W.J.. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets. BMC Syst. Biol. 2014; 8(Suppl. 3):S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Omranian N., Eloundou-Mbebi J.M.O., Mueller-Roeber B., Nikoloski Z.. Gene regulatory network inference using fused LASSO on multiple data sets. Sci. Rep. 2016; 6:20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Nguyen P., Braun R.. Time-lagged Ordered Lasso for network inference. BMC Bioinformatics. 2018; 19:545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Fujita A., Sato J.R., Garay-Malpartida H.M., Morettin P.A., Sogayar M.C., Ferreira C.E.. Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics. 2007; 23:1623–1630. [DOI] [PubMed] [Google Scholar]
  • 27. Fujita A., Sato J.R., Garay-Malpartida H.M., Yamaguchi R., Miyano S., Sogayar M.C., Ferreira C.E.. Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC Syst. Biol. 2007; 1:39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Yuan Y., Bar-Joseph Z.. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl. Acad. Sci. U.S.A. 2019; 116:27151–27158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Yu J., Smith A., Wang P.P., Hartemink E.J., Jarvis E.D., Smith V.A., Hartemink A.J. Using Bayesian network inference algorithms to recover molecular genetic regulatory networks. 3rd International Conference on Systems Biology. 2002. [Google Scholar]
  • 30. Li Z., Li P., Krishnan A., Liu J.. Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis. Bioinformatics. 2011; 27:2686–2691. [DOI] [PubMed] [Google Scholar]
  • 31. Perrin B.E., Ralaivola L., Mazurie A., Bottani S., Mallet J., D’Alché-Buc F. Gene networks inference using dynamic Bayesian networks. Bioinformatics. 2003; 19:ii138–ii148. [DOI] [PubMed] [Google Scholar]
  • 32. Glass K., Huttenhower C., Quackenbush J., Yuan G.C.. Passing messages between biological networks to refine predicted interactions. PLoS One. 2013; 8:e64832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Banf M., Rhee S.Y.. Enhancing gene regulatory network inference through data integration with markov random fields. Sci. Rep. 2017; 7:41174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Bergen V., Lange M., Peidli S., Wolf F.A., Theis F.J.. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat. Biotechnol. 2020; 38:1408–1414. [DOI] [PubMed] [Google Scholar]
  • 35. HuBMAP Consortium The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature. 2019; 574:187–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Lundberg S.M., Lee S.I.. A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems. 2017; 30:4768–4777. [Google Scholar]
  • 37. Patro R., Duggal G., Love M.I., Irizarry R.A., Kingsford C.. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017; 14:417–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Schneider V.A., Graves-Lindsay T., Howe K., Bouk N., Chen H.-C., Kitts P.A., Murphy T.D., Pruitt K.D., Thibaud-Nissen F., Albracht D.et al.. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017; 27:849–864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Li H., Durbin R.. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Fang R., Preissl S., Li Y., Hou X., Lucero J., Wang X., Motamedi A., Shiau A.K., Zhou X., Xie F.et al.. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 2021; 12:1337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Zheng R., Wan C., Mei S., Qin Q., Wu Q., Sun H., Chen C.H., Brown M., Zhang X., Meyer C.A.et al.. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 2019; 47:D729–D735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Yu G., Wang L.G., He Q.Y.. ChIP seeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015; 31:2382–2383. [DOI] [PubMed] [Google Scholar]
  • 43. Lambert S.A., Jolma A., Campitelli L.F., Das P.K., Yin Y., Albu M., Chen X., Taipale J., Hughes T.R., Weirauch M.T.. The Human Transcription Factors. Cell. 2018; 175:598–599. [DOI] [PubMed] [Google Scholar]
  • 44. Yang Y., Hospedales T.M.. Trace norm regularised deep multi-task learning. 5th International Conference on Learning Representations, ICLR 2017 - Workshop Track Proceedings. 2019. [Google Scholar]
  • 45. Bradley P., Bennett K., Demiriz A.. Constrained k-means clustering. 2000.
  • 46. Kingma D.P., Ba J.L.. Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. 2015. [Google Scholar]
  • 47. Abadi M., Agarwal A., Barham P., Brevdo E., Chen Z., Citro C., Corrado G.S., Davis A., Dean J., Devin M.et al.. Tensorflow: large-scale machine learning on heterogeneous distributed systems. 2016; arXiv doi:14 March 2016, preprint: not peer reviewedhttps://arxiv.org/abs/1603.04467.
  • 48. Kamimoto K., Hoffmann C.M., Morris S.A.. CellOracle: dissecting cell identity via network inference and in silico gene perturbation. 2020; bioRxiv doi:21 April 2020, preprint: not peer reviewed 10.1101/2020.02.17.947416. [DOI] [PMC free article] [PubMed]
  • 49. Xu M., Bai X., Ai B., Zhang G., Song C., Zhao J., Wang Y., Wei L., Qian F., Li Y.et al.. TF-Marker: a comprehensive manually curated database for transcription factors and related markers in specific cell and tissue types in human. Nucleic Acids Res. 2022; 50:D402–D412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Santosa F., Symes W.W.. Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 1986; 7:1307–1330. [Google Scholar]
  • 51. Jensen P.A., Lutz K.A., Papin J.A.. TIGER: toolbox for integrating genome-scale metabolic models, expression data, and transcriptional regulatory networks. BMC Syst. Biol. 2011; 23:147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Chen S., Lake B.B., Zhang K.. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 2019; 37:1452–1457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 2020; 11:1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Stepniak E., Ricci R., Eferl R., Sumara G., Sumara I., Rath M., Hui L., Wagner E.F.. c-Jun/AP-1 controls liver regeneration by repressing p53/p21 and p38 MAPK activity. Genes Dev. 2006; 20:2306–2314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Kusakabe M., Hasegawa K., Hamada M., Nakamura M., Ohsumi T., Suzuki H., Tran M.T.N., Kudo T., Uchida K., Ninomiya H.et al.. c-Maf plays a crucial role for the definitive erythropoiesis that accompanies erythroblastic island formation in the fetal liver. Blood. 2011; 118:1374–185. [DOI] [PubMed] [Google Scholar]
  • 56. Li Y., Yang J., Luo J.H., Dedhar S., Liu Y.. Tubular epithelial cell dedifferentiation is driven by the helix-loop-helix transcriptional inhibitor Id1. J. Am. Soc. Nephrol. 2007; 18:449–460. [DOI] [PubMed] [Google Scholar]
  • 57. Yang Y.L., Hu F., Xue M., Jia Y.J., Zheng Z.J., Li Y., Xue Y.M.. Early growth response protein-1 upregulates long noncoding RNA arid2-IR to promote extracellular matrix production in diabetic kidney disease. Am. J. Physiol. - Cell Physiol. 2019; 316:C340–C352. [DOI] [PubMed] [Google Scholar]
  • 58. XUE Y.-M. Early growth response 1 (Egr1) is a transcriptional activator of RAAS in diabetic kidney disease. Diabetes. 2018; 67:507–P. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Li J., Li Y., Wang B., Ma Y., Chen P.. Id-1 promotes migration and invasion of non-small cell lung cancer cells through activating NF-κB signaling pathway. J. Biomed. Sci. 2017; 24:95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Zhou L., Xue C., Chen Z., Jiang W., He S., Zhang X.. c-Fos is a mechanosensor that regulates inflammatory responses and lung barrier dysfunction during ventilator-induced acute lung injury. BMC Pulm. Med. 2022; 22:9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Lynn R.C., Weber E.W., Sotillo E., Gennert D., Xu P., Good Z., Anbunathan H., Lattin J., Jones R., Tieu V.et al.. c-Jun overexpression in CAR T cells induces exhaustion resistance. Nature. 2019; 576:293–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Rouillard A.D., Gundersen G.W., Fernandez N.F., Wang Z., Monteiro C.D., McDermott M.G., Ma’ayan A.. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database (Oxford). 2016; 2016:baw100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Uhlén M., Fagerberg L., Hallström B.M., Lindskog C., Oksvold P., Mardinoglu A., Sivertsson Å., Kampf C., Sjöstedt E., Asplund A.et al.. Tissue-based map of the human proteome. Science. 2015; 347:1260419. [DOI] [PubMed] [Google Scholar]
  • 64. Shaulian E., Karin M.. AP-1 in cell proliferation and survival. Oncogene. 2001; 20:2390–400. [DOI] [PubMed] [Google Scholar]
  • 65. Ameyar M., Wisniewska M., Weitzman J.B.. A role for AP-1 in apoptosis: the case for and against. Biochimie. 2003; 85:747–752. [DOI] [PubMed] [Google Scholar]
  • 66. Eckert R.L., Adhikary G., Young C.A., Jans R., Crish J.F., Xu W., Rorke E.A.. AP1 transcription factors in epidermal differentiation and skin cancer. J. Skin Cancer. 2013; 2013:537028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Atsaves V., Leventaki V., Rassidakis G.Z., Claret F.X.. AP-1 transcription factors as regulators of immune responses in cancer. Cancers (Basel). 2019; 11:1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Schwarz E.M., Salgame P., Bloom B.R.. Molecular regulation of human interleukin 2 and T-cell function by interleukin 4. Proc. Natl. Acad. Sci. U.S.A. 1993; 90:7734–7738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Wang H., Yu J., Zhang L., Xiong Y., Chen S., Xing H., Tian Z., Tang K., Wei H., Rao Q.et al.. RPS27a promotes proliferation, regulates cell cycle progression and inhibits apoptosis of leukemia cells. Biochem. Biophys. Res. Commun. 2014; 446:1204–1210. [DOI] [PubMed] [Google Scholar]
  • 70. Conde J., Otero M., Scotece M., Abella V., Gómez R., López V., Pino J., Mera A., Goldring M.B., Gualillo O.. E74-like factor (ELF3) and leptin, a novel loop between obesity and inflammation perpetuating a pro-catabolic state in cartilage. Cell. Physiol. Biochem. 2018; 45:2401–2410. [DOI] [PubMed] [Google Scholar]
  • 71. Kushwah R., Oliver J.R., Wu J., Chang Z., Hu J.. Elf3 regulates allergic airway inflammation by controlling dendritic cell-driven T cell differentiation. J. Immunol. 2011; 187:4639–4653. [DOI] [PubMed] [Google Scholar]
  • 72. Kanehisa M., Goto S.. KEGG: kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000; 27:29–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Tan Y., Godec J., Wu F., Tamayo P., Mesirov J.P., Haining W.N.. A method for downstream analysis of gene set enrichment results facilitates the biological interpretation of vaccine efficacy studies. 2016; bioRxiv doi:11 April 2016, preprint: not peer reviewed 10.1101/043158. [DOI]
  • 74. Almazroo O.A., Miah M.K., Venkataramanan R.. Drug metabolism in the liver. Clin. Liver Dis. 2017; 21:1–20. [DOI] [PubMed] [Google Scholar]
  • 75. Michaels S., Wang M.Z.. The revised human liver cytochrome P450 ‘pie’: absolute protein quantification of CYP4F and CYP3A enzymes using targeted quantitative proteomics. Drug Metab. Dispos. 2014; 42:1241–1251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Lynch T., Price A.. The effect of cytochrome P450 metabolism on drug response, interactions, and adverse effects. Am. Fam. Physician. 2007; 76:391–396. [PubMed] [Google Scholar]
  • 77. Zanger U.M., Klein K.. Pharmacogenetics of cytochrome P450 2B6 (CYP2B6): advances on polymorphisms, mechanisms, and clinical relevance. Front. Genet. 2013; 4:24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Garciá-Suástegui W.A., Ramos-Chávez L.A., Rubio-Osornio M., Calvillo-Velasco M., Atzin-Méndez J.A., Guevara J., Silva-Adaya D. The role of CYP2E1 in the drug metabolism or bioactivation in the brain. Oxid. Med. Cell. Longev. 2017; 2017:4680732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Liu J., Feng D., Kan X., Zheng M., Zhang X., Wang Z., Sun L., Chen H., Gao X., Lu T.et al.. Polymorphisms in the CYP3A5 gene significantly affect the pharmacokinetics of sirolimus after kidney transplantation. Pharmacogenomics. 2021; 22:903–912. [DOI] [PubMed] [Google Scholar]
  • 80. El Rouby N., Lima J.J., Johnson J.A.. Proton pump inhibitors: from CYP2C19 pharmacogenetics to precision medicine. Expert Opin. Drug Metab. Toxicol. 2018; 14:447–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Cesta M.F. Normal structure, function, and histology of the spleen. Toxicol. Pathol. 2006; 34:455–465. [DOI] [PubMed] [Google Scholar]
  • 82. Cargnello M., Roux P.P.. Activation and function of the MAPKs and their substrates, the MAPK-activated protein kinases. Microbiol. Mol. Biol. Rev. 2011; 75:50–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Liao W., Lin J.X., Leonard W.J.. Interleukin-2 at the crossroads of effector responses, tolerance, and immunotherapy. Immunity. 2013; 38:13–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Yu J., Jiang Z., Ning L., Zhao Z., Yang N., Chen L., Ma H., Li L., Fu Y., Zhu H.et al.. Protective HSP70 induction by Z-ligustilide against oxygen-glucose deprivation injury via activation of the MAPK pathway but not of HSF1. Biol. Pharm. Bull. 2015; 38:1564–1572. [DOI] [PubMed] [Google Scholar]
  • 85. Qi Z., Qi S., Gui L., Shen L., Feng Z.. Daphnetin protects oxidative stress-induced neuronal apoptosis via regulation of MAPK signaling and HSP70 expression. Oncol. Lett. 2016; 12:1959–1964. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Gallant Hill, Spiegel D.M. Calcium balance in chronic kidney disease. Curr. Osteoporos. Rep. 2017; 15:214–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Goodman W.G. Calcium and phosphorus metabolism in patients who have chronic kidney disease. Med. Clin. North Am. 2005; 89:631–647. [DOI] [PubMed] [Google Scholar]
  • 88. Swulius M.T., Waxham M.N.. Ca2+/calmodulin-dependent Protein Kinases. Cell. Mol. Life Sci. 2008; 65:2637–2657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Kobayashi H., Saragai S., Naito A., Ichio K., Kawauchi D., Murakami F.. Calm1 signaling pathway is essential for the migration of mouse precerebellar neurons. Dev. 2015; 142:375–384. [DOI] [PubMed] [Google Scholar]
  • 90. Beenken A., Mohammadi M.. The FGF family: biology, pathophysiology and therapy. Nat. Rev. Drug Discov. 2009; 8:235–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. Stachtea X., Loughrey M.B., Salvucci M., Lindner A.U., Cho S., McDonough E., Sood A., Graf J., Santamaria-Pang A., Corwin A.et al.. Stratification of chemotherapy-treated stage III colorectal cancer patients using multiplexed imaging and single-cell analysis of T-cell populations. Mod. Pathol. 2022; 35:564–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Melani R.D., Gerbasi V.R., Anderson L.C., Sikora J.W., Toby T.K., Hutton J.E., Butcher D.S., Negrão F., Seckler H.S., Srzentic K.et al.. The Blood Proteoform Atlas: a reference map of proteoforms in human hematopoietic cells. Science. 2022; 375:411–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Hickey J.W., Neumann E.K., Radtke A.J., Camarillo J.M., Beuschel R.T., Albanese A., McDonough E., Hatler J., Wiblin A.E., Fisher J.et al.. Spatial mapping of protein composition and tissue organization: a primer for multiplexed antibody-based imaging. Nat. Methods. 2022; 19:284–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Schaffter T., Marbach D., Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011; 27:2263–2270. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad053_Supplemental_Files

Data Availability Statement

The data underlying this article were provided by the HuBMAP consortium and Cistrome database. Data will be shared on request to the HuBMAP consortium and Cistrome database maintenance team.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES