Significance
Accurate inference of gene interactions and causality is required for pathway reconstruction, which remains a major goal for many studies. Here, we take advantage of 2 recent technological developments, single-cell RNA sequencing and deep learning to propose an encoding scheme for gene expression data. We use this encoding in a supervised framework to perform several different types of analysis using minimal assumptions. Our method, convolutional neural network for coexpression (CNNC), first transforms expression data lacking locality to an image-like object on which convolutional neural networks (CNNs) work very well. We then utilize CNNs for learning relationships between genes, causality inferences, functional assignments, and disease gene predictions. For all of these tasks, CNNC significantly outperforms all prior task-specific methods.
Keywords: gene interactions, deep learning, causality inference
Abstract
Several methods were developed to mine gene–gene relationships from expression data. Examples include correlation and mutual information methods for coexpression analysis, clustering and undirected graphical models for functional assignments, and directed graphical models for pathway reconstruction. Using an encoding for gene expression data, followed by deep neural networks analysis, we present a framework that can successfully address all of these diverse tasks. We show that our method, convolutional neural network for coexpression (CNNC), improves upon prior methods in tasks ranging from predicting transcription factor targets to identifying disease-related genes to causality inference. CNNC’s encoding provides insights about some of the decisions it makes and their biological basis. CNNC is flexible and can easily be extended to integrate additional types of genomics data, leading to further improvements in its performance.
Several computational methods have been developed to infer relationships between genes based on gene expression data. These range from methods for inferring coexpression relationships between pairs of genes (1) to methods for inferring a biological or disease process for a gene based on other genes [either using clustering or guilt by association (2)] to causality inferences (3, 4) and pathway reconstruction methods (5). To date, each of these tasks was handled by a different computational framework. For example, gene coexpression analysis is usually performed using Pearson correlation (PC) or mutual information (MI) (6). Functional assignment of genes is often performed using clustering (7) or undirected graphical models including Markov random fields (8), while pathway reconstruction is often based on directed probabilistic graphical models (4). These methods also serve as an initial step in some of the most widely used tools for the analysis of genomics data including network inference and reconstruction approaches (3, 9, 10), methods for classification based on genes expression (11) and many more.
While successful and widely used, these methods also suffer from serious drawbacks. First, most of these methods are unsupervised. Given the large number of genes that are profiled, and the often relatively small (at least in comparison) number of samples, several genes that are determined to be coexpressed or cofunctional may only reflect chance or noise in the data (12). In addition, most of the widely used methods are symmetric, which means that each pair has only one relationship value. While this is advantageous for some applications (e.g., clustering), it may be problematic for methods that aim at inferring causality (e.g., network reconstruction tasks).
To address these issues, we developed a method, convolutional neural network for coexpression (CNNC), which provides a supervised way (that can be tailored to the condition/question of interest) to perform gene relationship inference. CNNC utilizes a representation of the input data specifically suitable for deep learning. It represents each pair of genes as an image (histogram) and uses convolutional neural networks (CNNs) to infer relationships between different expression levels encoded in the image. The network is trained with positive and negative examples for the specific domain of interest (e.g., known targets of a transcription factor [TF], known pathways for a specific biological process, known disease genes, etc.), and the output can be either binary or multinomial.
We applied CNNC using a large cohort of single-cell (SC) expression data and tested it on several inference tasks. We show that CNNC outperforms prior methods for inferring interactions (including TF–gene and protein–protein interactions), causality inference, and functional assignments (including biological processes and diseases).
Results
We developed CNNC, a general computational framework for supervised gene relationship inference (Fig. 1). CNNC is based on a CNN, which is used to analyze summarized co-occurrence histograms from pairs of genes in single-cell RNA-sequencing (scRNA-seq) data. Given a relatively small labeled set of positive pairs (with either negative or random pairs serving as negative), CNNC learns to discriminate between interacting, causal pairs, negative pairs, or any other gene relationship types that can be defined.
Fig. 1.
CNNC input, output, and architecture. CNNC aims to infer gene–gene relationships using single-cell expression data. For each gene pair, scRNA-seq expression levels are transformed into 32 × 32 normalized empirical probability function (NEPDF) matrices. The NEPDF serves as an input to a convolutional neural network (CNN). The intermediate layer of the CNN can be further concatenated with input vectors representing DNase-seq and PWM data. The output layer can either have a single, 3, or more values, depending on the application. For example, for causality inference the output layer contains 3 probability nodes where p0 represents the probability that genes a and b are not interacting, p1 encodes the case that gene a regulates gene b, and p2 is the probability that gene b regulates gene a.
Learning a CNNC Model.
CNNC can be trained with any expression dataset, although as with other neural network applications, the more data, the better its performance. Given expression data, we first generate a normalized empirical probability distribution function (NEPDF) for each gene pair (genes a and b) (Fig. 1). For this, we calculate a normalized 2-dimensional (2D) histogram of fixed size (32 × 32). The specific dimension for the input is a hyperparameter that can be learned for each data (SI Appendix, Fig. S3A and Supplementary Notes). In the histogram, columns represent gene a expression levels and rows represent gene b such that entries in the matrix represent the (normalized) co-occurrences of these values. If different data types are combined (e.g., bulk and SC), they can be either used separately or concatenated to form a combined NEPDF with dimension of 32 × 64 (SI Appendix, Fig. S3B). Next, the distribution matrix is used as input to a CNN, which is trained using a N-dimension (ND) output label vector, where N depends on specific task. For example, for coexpression or interaction prediction N is set to 1 (interacting or not), while for causality inference it is set to 3 where label 0 indicates that genes a and b are not interacting and label 1 (2) indicates that gene a (b) regulates gene b (a). In general, our CNN model consists of one 32 × 32 input layer, 10 intermediate layers including 6 convolutional layers, 3 maxpooling layers, 1 flatten layer, and a final ND “softmax” layer or 1 scalar “sigmoid” layer (Methods and SI Appendix, Fig. S1).
For the analysis presented in this paper, we used processed scRNA-seq and bulk RNA-seq from different studies (13). All raw data were uniformly processed and assigned to a predetermined set of more than 20,000 mouse genes for each task (Methods).
In addition to gene expression data, CNNC can integrate other data types including DNase-seq (14), position weight matrix (PWM) (15), etc. For this, we concatenated the additional information as a vector to the intermediate output of the gene expression data and continued with the standard CNN architecture. See Methods and SI Appendix, Fig. S1 for different architectures and details of CNNC and SI Appendix, Table S1, for information on training and run time.
Using CNNC to Predict TF–Gene Interactions.
We first tested the CNNC framework on the task of predicting pairwise interactions from gene expression data (16). Chromatin immunoprecipitation (ChIP)-seq has been widely used as a gold standard for studying cell-type–specific protein–DNA interactions (17). We thus evaluated CNNC’s performance using cell-type-specific scRNA-seq datasets (for mouse embryonic stem cells [mESCs], bone marrow-derived macrophages, and dendritic cells; Methods) and ChIP-seq data from Gene Transcription Regulation Database (GTRD) (18).
We extracted data from GTRD for 38 TFs for which ChIP-seq experiments were performed in mESCs, 13 TFs studied in macrophages, and 16 TFs for dendritic cells. To determine targets for each TF using the ChIP-seq data, we followed prior work (19, 20) and defined a promotor region as 10 kb upstream to 1 kb downstream from the transcription start site (TSS) for each gene. If a TF a has at least one detected peak signal in or overlapping the promotor region of gene b, we say that TF a regulates gene b. For this prediction task, we compared CNNC with several popular methods for gene–gene coexpression analysis: PC and MI, which are the 2 most popular coexpression analysis methods; Genie3 (9), which was the best performer in the dialogue for reverse engineering assessments and methods (DREAM4) challenge (21), in silico networks construction challenge; count statistics (CS) (22), which relies on local information based on gene expression ranks in large heterogeneous samples; conditional-density resampled estimate of mutual information (DREMI) (23); and a fully connected deep neural network (DNN), which also uses our NEPDF as input. Since most of the prior methods used for comparison are symmetric, we focused here on the 2 labels setting (interacting or not). We applied 3-fold cross-validation to all datasets. Each fold contains several TFs, although the test is performed separately for each TF (Methods).
Fig. 2 presents the results of these comparisons. As can be seen, CNNC and DNN outperform all prior methods for all cell types. We observe significant improvement over all prior methods (Fig. 2 A–G). To evaluate the performance of CNNC, we chose both area under the receiver operating characteristic curve and the area under the precision recall curve (AUROC/AUPRC) as our evaluation scores. The AUPRC achieved by CNNC is around 20% higher than PC and MI on some datasets (see SI Appendix, Fig. S2, for details). Importantly, as can be seen in Fig. 2 A–D, the difference is even more pronounced for the top ranked predictions. For CNNC, we see almost no false negatives (less than 20%) for the top 5% ranked pairs. Such top predictions are often the most important since the ability to validate predicted interaction is usually limited to the top few predictions.
Fig. 2.
GTRD TF-target prediction. (A–D) Precision recall curve (PRC) of CNNC, deep neural network (DNN), count statistics (CS), and mutual information (MI), which are the top 4 performing methods for TF-target prediction. Training and testing for these plots were done using bone marrow-derived macrophage scRNA-seq and ChIP-seq data. Median (mean) AUPRC was shown on the top of each panel. Here, each gray line represents one TF, the red line represents the median curve, and the light green part represents the region between the 25th and 75th quantiles. (E and F) AUPRC and AUROC for all 7 methods we compared. We tested all methods on the macrophage and mESC tissue-specific datasets. For mESC, the comparison P values with the best 2 other methods in term of AUPRC (AUROC) were based on Wilcoxon signed-rank test. For macrophage, the comparison P values are based on AUPRC and AUROC because the number of cells is not enough for Wilcoxon test: CS, 2.85 × 10−2; DNN, 3.55 × 10−2. See SI Appendix, Fig. S2, for similar analysis using 3 additional datasets. (G) Overall AUPRC and AUROC of the methods in all 5 train and test experiments (3 tissue-specific datasets and 2 with larger 1.3M mouse brain scRNA-seq from 10×) with comparison P values using Wilcoxon signed-rank test between CNNC and the best 2 other methods in terms of AUPRC (AUROC): CS, 4.90 × 10−8 (2.55 × 10−12); DNN, 9.42 × 10−3 (3.15 × 10−3). (H) Comparison of TF-target predictions with additional data using mESC expression and TFs. Columns 1 to 4 show the AUPRCs of PC, MI, CS, and CNNC using scRNA-seq data, respectively. The fifth and sixth columns show performance when only using PWM or DNase. The last 3 columns show performance of the integration of expression, sequence (PWM), and DNase data. The comparison P value between CS and CNNC integration methods is (AUPRC): 8.37 × 10−4.
Data Integration Further Improves TF Target Gene Prediction.
The above analysis was only based on using expression values. However, as noted above, gene relationship inference is often used as a component in more extensive procedures that often integrate different types of genomics data. To test how the use of the NN-based method can aid such procedures, we extended CNNC so that it can utilize sequence and DNase hypersensitivity information. For sequence, we used PWMs from Jaspar (24). DNase-seq data for mESCs were obtained from the mouse ENCODE project (25). We used a simple strategy for processing the PWM and DNase data, which resulted in an additional 2D vector as input for each pair, which we embedded to create a 128D vector (Methods). We next concatenated this vector with the NEPDF’s 128D vector in the flatten layer to form a 256D vector as shown in Fig. 1 and SI Appendix, Fig. S1A.
Results, presented in Fig. 2H, show that these additional data sources indeed improve the ability to predict TF–gene interactions. As before, a combined framework utilizing CNNC outperforms methods that used CS.
CNNC Can Predict Pathway Regulator–Target Gene Pairs.
While TFs usually directly impact the expression of their targets, several methods have also utilized RNA-seq data to infer pathways that combine protein–protein and protein–DNA interactions (26). To test whether CNNC can serve as a component in pathway inference methods, we selected 2 representative pathway databases, Kyoto Encyclopedia of Genes and Genomes (KEGG) (27) and Reactome (28) as gold standard and used these, together with a large scRNA-seq dataset of 43,261 cells that was collected from over 500 different studies representing a wide range of cell types, conditions, etc. (13), and a bulk RNA-seq data from ENCODE (25) to train and test our framework. Since we are interested in causal relationships, we only used directed edges (from regulator to target gene) with activation or inhibition edge types and filtered out cyclic gene pairs where genes regulate each other mutually (to allow for a unique label for each pair). As for the negative data, here we limited the negative set to a random set of pairs where both genes appear in pathways in the database but do not interact. Given the large number of genes, we performed a 3-fold cross-validation, keeping the set of regulator genes in the training and test set separated (Methods and SI Appendix, Supplementary Notes).
Results are presented in Fig. 3. As can be seen, CNNC performs very well on the KEGG pathways reaching an median (mean) AUROC of 0.9949 (0.8822) compared to less than 0.9309 (0.7324) for the methods we compared, which here also included Bayesian directed networks (BDNs) (4) that learn a global directed interaction graph (Fig. 3F). CNNC also performs well on Reactome pathways (SI Appendix, Fig. S4). We also used the KEGG data to test the specific architecture CNNC utilizes and observed that the architecture used improves upon 2 alternative deep NN architectures, deep fully connected NN (DNN), and CNN without pooling layers (Fig. 3H and SI Appendix, Fig. S5).
Fig. 3.
Predicting undirected pathway edges. (A) Overall ROCs for CNNC performance on KEGG pathway gene interaction prediction using a large compendium of scRNA-seq data and bulk data. Here, each gray line represents one regulator with outgoing edges. Median (mean) AUROC is shown on the top of each panel. (B–G) Overall ROCs for Pearson correlation (PC), mutual information (MI), count statistics (CS), Bayesian directed network (BDN), DREMI, and Genie3 when tested on the KEGG pathway gene interaction prediction task. (H) Comparison of the 7 methods on the gene interaction prediction task. The comparison P values are (AUPRC [AUROC]): DREMI, 0.0 (0.0); PC, 5.71 × 10−184 (3.48 × 10−214); MI, 1.79 × 10−191 (2.25 × 10−193); BDN, 8.31 × 10−231 (2.84 × 10−226); CS, 7.91 × 10−160 (1.45 × 10−189); GENIE3, 8.04 × 10−66 (7.31 × 10−110); DNN, 7.61 × 10−7 (3.90 × 10−7). Boxplot shown with median, first quartile, third quartile, maximum, and minimum.
Using CNNC for Causality Prediction.
So far, we focused on general interaction predictions. However, as discussed above, CNNC can also be used to infer directionality by changing the output of the NN. We next used CNNC to infer causal edges for KEGG and Reactome datasets. Specifically, we used CNNC to predict whether for 2 genes, a and b, the interaction is from a to b or vice versa. For the pathway databases, we only analyzed directed edges and so had the ground truth for that data as well. As can be seen in Fig. 4, for the KEGG dataset, CNNC is very successful achieving a median AUROC of 0.9949 (Fig. 4 A and B). For Reactome (SI Appendix, Fig. S4), we see that the most confident predictions are correct, but beyond the top prediction performance levels off. We compared the performance of CNNC to another method developed for learning causal relationships from gene expression data, BDN (4), which learns a global directed interaction graph. Results presented in SI Appendix, Fig. S6, show that CNNC greatly outperforms BDN on this causality prediction task. We have also tested several other applications of CNNC that included its use for determining the impact of the interaction (activation or repression) and for determining its ability to identify directed vs. undirected (complex-based) interactions. In both cases, CNNC performs well as we show in SI Appendix, Fig. S7.
Fig. 4.
Directed (causal) edge prediction. (A) Overall ROCs for performance of CNNC on KEGG pathway directed edge prediction using a large compendium of scRNA-seq and bulk data. Median (mean) AUROC is shown on the Top. (B) The AUROC histogram for A. (C–H) A typical NEPDF sample from a KEGG interaction that is correctly predicted as label 0, 1, and 2, in the form of 2D and 3D plot. (I–L) Variance (var), mean, and coefficient of variance (CV) of gene 1 as the expression of gene 2 increases for top correctly predicted pairs with label 1. (M–P) Same for top predictions for label 2. (Q–T) Average and variance of CV for the top prediction groups correctly predicted as label 1 (Q and S) and label 2 (R and T).
To try to understand the basis for the decisions reached by CNNC, we plotted the 2D and 3D figures of 3 NEPDF inputs (Fig. 4 C–H), which were correctly predicted as different labels (0 in Fig. 4 C and D, 1 in Fig. 4 E and F, and 2 in Fig. 4 G and H). As can be seen, random pairs look more uninformative and symmetric, while in both label 1 and 2 figures, the 2 genes display partial correlations and there are places where both are up or down concurrently, and the main difference between the histograms in Fig. 4 E–H are cases where one gene is up and the other is not. In Fig. 4 E and F, gene 2 is up while gene 1 is not, indicating that the causal relationship is likely 1→2. The opposite holds for Fig. 4 G and H and so the method infers that 2→1 for that input. While relationships between expression values, including the ones mentioned above, can be manually prescribed for an algorithm, we also noted that the encoding used for CNNC allows it to look at more complicated relationships between genes. In Fig. 4 I–P, we plot the mean, variance, and coefficient of variance (CV) for gene 2 as a function of the expression of gene 1 for both prediction directions (1→2, top and 2→1, bottom). As can be seen, the variance and CV trends are consistent within category and diverging between categories, indicating that CNNC can make use of second-order or even higher-order distribution properties. Similar phenomena have been anecdotally observed in specific cases, for example for micro-RNA regulation (29), but the ability of CNNC to learn such relationships on its own strongly suggests that it can generalize much better than prior methods for inferring such causal interactions.
We also performed a number of experiments to test the robustness of CNNC to dropouts, size of the input data, the impact of unbalanced labels, and different cross-validation strategies. Results, presented in SI Appendix, Fig. S8, indicate that CNNC is robust and can be successfully applied even to unbalanced datasets, which are common in biology.
Using CNNC for Functional Assignments.
We next explored the use of CNNC for assigning function or disease relevance for genes. For this, we applied CNNC to predict genes associated with 3 essential cell functions: cell cycle, circadian rhythm, and immune system. For each of these functions, we obtained known genes from gene set enrichment analysis (GSEA) (30) and trained CNNC using all expression data on 2/3 of these genes holding the other 1/3 as a test set. In this setting, the network is trained to predict 1 for a pair of genes that are both cell cycle genes and 0 for all other pairs (Methods and SI Appendix, Supplementary Notes). When testing on the held-out set, CNNC achieved an AUROC of 0.79 (Fig. 5A), outperforming both guilt by association (GBA) and DNN significantly. Importantly, the top 10% predicted genes were all true positives (SI Appendix, Fig. S9). CNNC also performs best for circadian rhythm, and immune system functions (Fig. 5 B and C).
Fig. 5.
Functional assignment and pathway reconstruction using CNNC. CNNC can be used as a component in downstream analysis algorithms including functional assignments and disease gene prediction. (A–C) Performance of CNNC on the cell cycle, immune system, and circadian rhythm gene prediction task. (D–F) Performance of CNNC on the asthma, COPD, and HNC disease gene prediction task. (G) Predicted expression pattern of a cell cycle∼cell cycle gene pair. (H) Predicted expression pattern of a cell cycle∼non-cell cycle gene pair. (I and J) The most significant GO terms of top 5% predicted asthma (D) and COPD (E) disease genes by CNNC and GBA, respectively.
Given its success on a well-studied functional set, we next asked whether CNNC can be used to predict novel disease genes. We focused on 2 lung diseases, asthma and chronic obstructive pulmonary disease (COPD) and on head and neck cancer (HNC). We obtained 147, 44, and 72 genes for asthma, COPD, and HNC, respectively, from “Malacards” (31). We next trained CNNC with all known genes for each of the 2 diseases and used it to predict additional genes for each disease. We evaluated the predicted set both manually and by statistical analysis using gene ontology (GO) and compared these to prior methods for GBA (32) analysis, as can be seen in Fig. 5 D–F. For all of the 3 diseases, CNNC outperforms GBA, and it obtained much more significant GO terms when compared to GBA (Fig. 5 I and J). Manual inspection of the top 10 genes for asthma indicated that 7 of them are supported based on recent studies (Dataset S1), including “Lck,” which was recently determined to be a potential drug target for asthma therapy (33).
Discussion and Conclusion
Several methods for inferring gene–gene relationships from expression data have been developed over the last 2 decades. While these methods perform well in some cases, they suffer from a number of drawbacks that often led to false positives or missing key relationships (false negatives). The former can be attributed to the unsupervised nature of most methods (including methods for coexpression and clustering) making it hard to “train” them on a labeled dataset. The latter often resulted from the assumptions used by specific methods (e.g., distribution assumptions for DBNs) that do not always hold.
To address these issues, we presented CNNC, a general framework for gene relationship inference, which is based on CNNs. The key idea here is to convert the input data into a co-occurrence histogram. Such representation enables us to fully utilize both the information contained in SC data and the ability of CNNs to exploit spatial information. On the one hand, SC data provide information about the actual, cell-based, relationships, while relationships in bulk studies only provide information on averages and so do not accurately reflect real interactions and causality. Furthermore, the large number of cells in recent SC datasets enables us to accurately estimate joint distribution for gene pairs. Here, we used tens of thousands of expression profiles from relatively small number of experiments (a few hundreds), whereas bulk datasets contained much fewer profiles (the bulk data we use, which is from one of the largest experiments, has only ∼300 profiles). In addition, unlike most prior methods, CNNC is supervised, which allows the CNN to zoom in on subtle differences between positive and negative pairs. Supervision also helps fine-tune the scoring function based on the different application. For example, different features may be important for analyzing TF–gene interactions when compared to inferring proteins in the same pathway. Finally, the fact that the network can utilize the large volumes of scRNA-seq data without requiring explicit assumptions about the distribution of the input allows it to better overcome noise and other errors, reducing false negative.
Analysis of several different interaction prediction and functional assignment tasks indicates that CNNC can improve upon prior, unsupervised methods. It can also be naturally extended to integrate complementary data including epigenetic and sequence information. Comparisons to more advanced methods for biological network reconstruction further highlight the advantages of CNNC. In addition, CNNC can be used as a preprocessing step, or as a component in more advanced network reconstruction methods. Finally, CNNC is easy to use either with general data or with condition-specific data. For the former, users can download the data and implementation from the supporting website, provide a list of labels (positive and negative pairs for their system of interest), and retrieve the scores for all possible gene pairs. These in turn can be used for any downstream application including network analysis, functional gene assignment, etc.
While a number of prior NN methods were developed, by us and others, to analyze single-cell expression vectors (11, 34–38), these methods are very different from CNNC. First, their goal is usually to compare data across cells rather than to analyze gene relationships within cells as CNNC does. Second, unlike CNNC, these prior methods rely on a vector (or matrix for multiple cells) representation of expression data, which does not utilize the spatial analysis advantages of deep NN. CNNC uses such an idea by converting coexpression relationships to image histograms prior to their analysis. While this was applied here to gene expression data, such an approach may also be appropriate for other types of data, for example, financial data.
Since CNNC is supervised, it would indeed not generalize to cases where no labels are available, unlike some of the methods we compare to. On the other hand, when labels are available, which is common to several cases with genomics data (including all of the tasks we presented), CNNC is a much better choice than unsupervised methods.
CNNC is implemented in Python, and both data and an open-source version of the software are available from the supporting website (https://github.com/xiaoyeye/CNNC).
Methods
Dataset Sources and Preprocess Pipelines.
We used genomics data of different types from several studies. The mouse scRNA-seq dataset collected by Alavi et al. (13) consists of uniformly processed 43,261 expression profiles from over 500 different scRNA-seq studies. For each profile, expression values are available for the same set of 20,463 genes. Among the cells, 4,126 are dendritic cells, and 6,283 are bone marrow-derived macrophage cells. Additionally, mESC data, which contain 2,717 cells, were downloaded from Gene Expression Omnibus with accession number GSE65525 (39). The 1.3 million mouse SC data were downloaded from ref. 40. Mouse bulk RNA-seq dataset was downloaded from Mouse Encode project (25). That data included 249 samples, and we only utilized genes that are present in the scRNA-seq dataset leading to the same number of genes for both datasets. mESC DNase data were also downloaded from Mouse Encode project (25) (ENCFF096WRW.bed). Mouse TF motif information is from TRANSFAC database (41). PWM values were calculated by Python package “Biopython” (42).
For the DNase and PWM analysis, we followed prior papers and defined the TSS region as 10 kb upstream to 1 kb downstream from the TSS for each gene (19, 20). For each TF and gene pair, using Biopython package, we calculated the score between the TF motif sequence and both the “+/−” sequences at all possible positions along the TSS region of the gene, and then selected the maximum one as the final PWM score. The maximum DNase peak signal in the TSS region was calculated as the scalar DNase value for each gene.
Labeled Data.
mESC ChIP-seq peak region data were downloaded from GTRD database, and we used peaks with threshold P value < 10−400 for mESC cells and 10−200 for macrophage cells and dendritic cells. If one TF a has at least one ChIP-seq peak signal in or partially in the TSS region of gene b, as defined above, we say that a regulates b. See SI Appendix, Table S3, for details. KEGG and Reactome pathway data were downloaded by the R package “graphite” (43). KEGG contains 290 pathways, and Reactome contains 1,581 pathways. For both, we only select directed edges with either activation or inhibition edge types and filter out cyclic gene pairs where genes regulate each other mutually (to allow for a unique label for each pair). In total, we have 3,057 proteins with outgoing directed edges in KEGG, and the total number of directed edges is 33,127. For Reactome, the corresponding numbers are 2,519 and 33,641.
Constructing the Input Histogram.
Image size is very important, since small image size will lead to data loss (very few expression levels), whereas large sizes can miss important relationships due to noise. Thus, the best image size depends on the particular task and the amount of data available. The best way to determine the optimal size is to treat it as another network hyperparameter and perform cross-validation with different sizes to select the optimal one. As we show in SI Appendix, Fig. S3, applying such method to the KEGG prediction tasks identifies 32 × 32 as the optimal input size for the scRNA-seq data we used, and so this is the dimension used throughout this paper. For any gene pair a and b, we first log transformed their expression, and then uniformly divided the expression range of each gene to 32 bins. Next, we created the 32 × 32 histogram by assigning each sample to an entry in the matrix and counting the number of samples for each entry. Due to the very low expression levels and even more so to dropouts in scRNA data, the value in zero–zero position is always very large and often dominates the entire matrix. To overcome this, we added pseudocounts to all entries and did another log transformation for each entry to get the final matrix. We combined bulk and scRNA-seq NEPDFs by concatenating them as a 32 × 64 matrix to achieve better performance. See SI Appendix, Fig. S3, for additional ways to integrate the different data types.
CNN for RPKM Data.
We followed VGGnet (44) to build our CNN model (SI Appendix, Fig. S1). The CNN consists of stacked layers of 3 × 3 convolutional filters (Eq. 1) ( is a power of 2, ranging from 32 to 64 to 128) and interleaved layers of 2 × 2 maxpooling (Eq. 2). We used the constructed input data as input to CNN. Each convolution layer computes the following function:
[1] |
where X is the input from the previous layer, (i,j) is output position, k is convolutional filter index, and W is the filter matrix of size 3 × 3. In other words, each convolutional layer computes a weighted average of the prior layer values where the weights are determined based on training. The maxpooling layer computes the following function:
[2] |
where X is input, (i,j) is output position, and k is the convolutional filter index. In other words, the layer selects one of the values of the previous layer to move forward.
Overall Structure of CNNC.
The overall structure of the CNN is presented in SI Appendix, Fig. S1. The input layer of the CNN is either 32 × 32 (scRNA-seq) or 32 × 64 (scRNA-seq and bulk RNA-seq) as discussed above. In addition, the CNN contains 10 intermediate layers and a single 1- or 3-dimensional output layer. The 10 layers include both convolutional and maxpooling layers, and the exact dimensions of each layer are shown in SI Appendix, Fig. S1. Following ref. 45, we used rectified linear activation function (ReLU) as the activation function (Eq. 3) across the whole network, except the final classification layers where “sigmoid” function (Eq. 4) was used for binary classification and “softmax” function (Eq. 5) for multiple categories classification. These functions are defined below:
[3] |
[4] |
[5] |
Training and Testing Strategy.
We evaluated the CNN using 3-fold cross-validation across all tasks. In these, training and test datasets are strictly separated to avoid information leakage. See SI Appendix, Supplementary Notes, for details. For TF-binding prediction, we used a binary classification, where a NEPDF matrix with label 1 was generated for each target b of TF a, and a matrix with label 0 was generated for a randomly selected nontarget gene r to balance the negative positive dataset. Similar to prior work, targets are defined based on the presence of a ChIP-seq peak in their promoter (46).
For KEGG and Reactome pathway prediction tasks, we used 3 labels (to enable causality analysis): for each gene pair (a, b), we generated (a, b)’s and (b, a)’s NEPDF matrices with label of 1 and 2, respectively. NEPDF matrices with a label of 0 were generated from random (r, s) gene pairs among KEGG or Reactome gene sets. For each candidate gene pair, CNNC computes a probability vector [p0, p1, p2], where p0 represents the probability that genes a and b are not interacting, p1 encodes the case that gene a regulates gene b, and p2 is the probability that gene b regulates gene a. After training, we used p1(a, b) + p2(a, b) as the probability that a interacts with b and p2(a, b) – p2(b, a) as the pseudo-probability that b regulates a. For interaction prediction, only the prob vectors of (a, b) and (r, s) were used in the evaluations, while for causality prediction, only the prob vectors of (a, b) and (b, a) were used. To avoid overfitting, early stopping strategy by monitoring validation loss function is used. To evaluate the detailed performance for every TF and regulator in the KEGG and Reactome tasks, we calculated the AUROC and AUPRC for each of them and combined all values as the final result (SI Appendix, Supplementary Notes). We also performed 4-categorical tasks using this data separation and cross-validation strategy (see SI Appendix, Fig. S7, for details).
Integrating Expression, Sequence, and DNase Data.
To integrate DNase and PWM data with the processed RNA-seq data, we first computed the max value for a PWM scan and DNase accessibility for each promotor region. We next generated a 2-value vector from this data for each pair and embedded it to a 128D vector using one fully connected layer containing 128 nodes. Next these are concatenated with the expression processed data to form a 256D vector, which serves as input to a fully connected 128-node layer neural network with binary classifier. See SI Appendix, Fig. S1, for details.
Functional Gene Assignment.
To assign genes to a function (biological process or disease), we train a CNNC model for each function with known genes g. Similar to all other tasks, input to each model is a pair of genes where the first is g and the second is either a positive (known) or negative (randomly selected from the whole unknown gene set) gene.
Known Genes for Functional Assignment Testing.
We downloaded 855 (332, 103, 182, 59, 128) human cell cycle (immune system, rhythm, asthma, COPD, HNC) genes from GSEA [“Malacards” (31) (a human disease website, https://www.malacards.org/)]. We obtained mouse ontologies for all genes resulting in 682, 278, 98, 147 and 47, 71 genes for them, respectively. For training, we used all genes for the diseases and a randomly selected set of unknown genes. See SI Appendix, Supplementary Notes, for details.
Data Availability.
All data, scripts, and instruction required to run CNNC in Python can be found in our support website, https://github.com/xiaoyeye/CNNC. All other public data can be found following the pipelines in Dataset Sources and Preprocess Pipelines and Labeled Data in Methods.
Supplementary Material
Acknowledgments
This work was partially supported by NIH Grants 1R01GM122096 and OT2OD026682 (to Z.B.-J.), and a James S. McDonnell Foundation Scholars Award in Studying Complex Systems (to Z.B.-J.).
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission. N.R.Z. is a guest editor invited by the Editorial Board.
Data deposition: The software in this paper has been deposited in GitHub, https://github.com/xiaoyeye/CNNC.
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1911536116/-/DCSupplemental.
References
- 1.Kuzmin E., et al. , Systematic analysis of complex genetic interactions. Science 360, eaao1729 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Itzel T., et al. , Translating bioinformatics in oncology: Guilt-by-profiling analysis and identification of KIF18B and CDCA3 as novel driver genes in carcinogenesis. Bioinformatics 31, 216–224 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hill S. M., et al. , Inferring causal molecular networks: Empirical assessment through a community-based effort. Nat. Methods 13, 310–318 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Maathuis M. H., Colombo D., Kalisch M., Buhlmann P., Predicting causal effects in large-scale systems from observational data. Nat. Methods 7, 247–248 (2010). [DOI] [PubMed] [Google Scholar]
- 5.Marbach D., et al. , Wisdom of crowds for robust gene network inference. Nat. Methods 9, 796–804 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Song L., Langfelder P., Horvath S., Comparison of co-expression measures: Mutual information, correlation, and model based indices. BMC Bioinf. 13, 328 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Langfelder P., Horvath S., WGCNA: An R package for weighted correlation network analysis. BMC Bioinf. 9, 559 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wei Z., Li H., A Markov random field model for network-based analysis of genomic data. Bioinformatics 23, 1537–1544 (2007). [DOI] [PubMed] [Google Scholar]
- 9.Huynh-Thu V. A., Irrthum A., Wehenkel L., Geurts P., Inferring regulatory networks from expression data using tree-based methods. PLoS One 5, e12776 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chan T. E., Stumpf M. P. H., Babtie A. C., Gene regulatory network inference from single-cell data using multivariate information measures. Cell Syst. 5, 251–267.e3 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lin C., Jain S., Kim H., Bar-Joseph Z., Using neural networks for reducing the dimensions of single-cell RNA-seq data. Nucleic Acids Res. 45, e156 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Freytag S., Gagnon-Bartsch J., Speed T. P., Bahlo M., Systematic noise degrades gene co-expression signals but can be corrected. BMC Bioinf. 16, 309 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Alavi A., Ruffalo M., Parvangada A., Huang Z., Bar-Joseph Z., A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Song L., Crawford G. E., DNase-seq: A high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb. Protoc. 2010, pdb.prot5384 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Sinha S., On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22, e454–e463 (2006). [DOI] [PubMed] [Google Scholar]
- 16.Crow M., Gillis J., Co-expression in single-cell analysis: Saving grace or original sin? Trends Genet. 34, 823–831 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Johnson D. S., Mortazavi A., Myers R. M., Wold B., Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007). [DOI] [PubMed] [Google Scholar]
- 18.Yevshin I., Sharipov R., Valeev T., Kel A., Kolpakov F., GTRD: A database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Res. 45, D61–D67 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Schulz M. H., et al. , Reconstructing dynamic microRNA-regulated interaction networks. Proc. Natl. Acad. Sci. U.S.A. 110, 15686–15691 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Schulz M. H., et al. , DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data. BMC Syst. Biol. 6, 104 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Greenfield A., Madar A., Ostrer H., Bonneau R., DREAM4: Combining genetic and dynamic information to identify biological networks and dynamical models. PLoS One 5, e13397 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wang Y. X., Waterman M. S., Huang H., Gene coexpression measures in large heterogeneous samples using count statistics. Proc. Natl. Acad. Sci. U.S.A. 111, 16371–16376 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Krishnaswamy S., et al. , Systems biology. Conditional density-based analysis of T cell signaling in single-cell data. Science 346, 1250689 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Khan A., et al. , JASPAR 2018: Update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yue F., et al. , A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Gitter A., Carmi M., Barkai N., Bar-Joseph Z., Linking the signaling cascades and dynamic regulatory networks controlling stress responses. Genome Res. 23, 365–376 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K., KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fabregat A., et al. , The reactome pathway knowledgebase. Nucleic Acids Res. 46, D649–D655 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Schmiedel J. M., et al. , Gene expression. MicroRNA control of protein expression noise. Science 348, 128–132 (2015). [DOI] [PubMed] [Google Scholar]
- 30.Subramanian A., et al. , Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Rappaport N., et al. , MalaCards: An amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 45, D877–D887 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Oliver S., Guilt-by-association goes global. Nature 403, 601–603 (2000). [DOI] [PubMed] [Google Scholar]
- 33.Zhang S., Yang R., Zheng Y., The effect of siRNA-mediated lymphocyte-specific protein tyrosine kinase (Lck) inhibition on pulmonary inflammation in a mouse model of asthma. Int. J. Clin. Exp. Med. 8, 15146–15154 (2015). [PMC free article] [PubMed] [Google Scholar]
- 34.Amodio M., et al. , Exploring single-cell data with deep multitasking neural networks. Nat. Methods, 10.1038/s41592-019-0576-7 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ding J., Condon A., Shah S. P., Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat. Commun. 9, 2002 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lopez R., Regier J., Cole M. B., Jordan M. I., Yosef N., Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Eraslan G., Simon L. M., Mircea M., Mueller N. S., Theis F. J., Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Arvaniti E., Claassen M., Sensitive detection of rare disease-associated cell subsets via representation learning. Nat. Commun. 8, 14825 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Klein A. M., et al. , Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.10x Genomics (2018) 1.3 Million Brain Cells from E18 Mice. https://support.10xgenomics.com/single-cell-gene-expression/datasets. Accessed 8 May 2019.
- 41.Wingender E., Dietze P., Karas H., Knuppel R., TRANSFAC: A database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24, 238–241 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cock P. J., et al. , Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sales G., Calura E., Cavalieri D., Romualdi C., Graphite—a bioconductor package to convert pathway topology to gene network. BMC Bioinf. 13, 20 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Karen Simonyan A. Z., Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (10 April 2015).
- 45.Xavier Glorot A. B., Yoshua B., “Deep sparse rectifier neural networks” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, Gordon G., Dunson D., Dudík M., Eds. (2011), vol. 15, pp. 315–323. [Google Scholar]
- 46.Ernst J., Plasterer H. L., Simon I., Bar-Joseph Z., Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res. 20, 526–536 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data, scripts, and instruction required to run CNNC in Python can be found in our support website, https://github.com/xiaoyeye/CNNC. All other public data can be found following the pipelines in Dataset Sources and Preprocess Pipelines and Labeled Data in Methods.