Skip to main content
Genomics, Proteomics & Bioinformatics logoLink to Genomics, Proteomics & Bioinformatics
. 2022 Mar 12;20(3):496–507. doi: 10.1016/j.gpb.2021.08.015

DeepCAGE: Incorporating Transcription Factors in Genome-wide Prediction of Chromatin Accessibility

Qiao Liu 1,2, Kui Hua 1, Xuegong Zhang 1, Wing Hung Wong 2,, Rui Jiang 1,
PMCID: PMC9801045  PMID: 35293310

Abstract

Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors (TFs) and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding statuses of TFs, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core TFs to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of TF activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.

Keywords: Chromatin accessibility, Deep learning, Transcription factor, Gene expression

Introduction

One of the fundamental questions in functional genomics is how activities of genes are spatially and temporally controlled through interactive effects of transcription factors (TFs) and regulatory elements such as promoters, enhancers, and silencers. These regulatory elements, as short regions of non-coding DNA sequence, are known to typically reside in chromatin accessible regions and be bound by a set of TFs to carry out regulatory functions in a manner specific to cellular contexts [1]. Therefore, the exploration of a landscape of chromatin accessible regions across major cell types will greatly facilitate the deciphering of gene regulatory mechanisms and further provide insights into cell differentiation, tissue homeostasis, and disease development [2].

Recent advances in deep sequencing techniques have enabled genome-wide assays of chromatin accessibility. For example, DNase-seq utilizes the DNase I enzyme to digest DNA sequences and identify DNase I-hypersensitive regions that are largely chromatin accessible [3]. ATAC-seq uses the Tn5 transposase to integrate primer DNA sequences into cleaved fragments that mainly come from chromatin accessible regions [4]. With the accomplishment of the ENCODE [5] and Roadmap [6] projects, these techniques have been successfully applied to the establishment of the chromatin accessibility landscape for dozens of cell lines across several species. The accumulation of these data provides an unprecedented opportunity for deepening our understanding of both gene regulation and occurrence of diseases [7], [8], [9].

However, due to limitations such as experimental cost, it is still impractical to further extend the landscape to cover all possible cell types, with the consideration of the huge variability in cellular biological contexts such as cell differentiation, environmental stimuli, and other factors. Toward this concern, computational approaches have been proposed to predict chromatin states by using such information as DNA sequence, gene expression, and other types of data [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. For example, Kelley et al. proposed a deep convolutional neural network model called Basset to predict chromatin accessible regions purely relying on one-hot encoded DNA sequences [12]. Liu et al. developed a hybrid deep learning model for integrating multiple forms of sequence representations to achieve high prediction performance [14]. Quang et al. used a hybrid convolutional and recurrent neural network for predicting chromatin signals [18]. However, a model purely relying on sequence data can hardly be generalized to make predictions across different cell types as the sequence itself is not cell type-specific. To overcome this limitation, Zhou et al. proposed a regression model called BIRD that utilized only gene expression data to predict chromatin accessible regions [13]. Nevertheless, with the complete removal of sequence data, the scope of application of this method is limited because the availability of gene expression is not as wide as sequence data. With the aforementioned understanding, Nair et al. proposed a deep residual neural network [20] model called ChromDragoNN to combine both sequence and expression data toward the prediction of chromatin accessibility [21]. However, sequence signatures and expression features are combined by simple concatenation in this method. This formulation, though simple in computation, lacks enough interpretability and is not consistent with existing biological knowledge.

With the aforementioned understanding, we propose a method called DeepCAGE, that is, a Deep densely connected convolutional network for predicting Chromatin Accessibility by incorporating Gene Expression and binding statuses of TFs. Unlike BIRD and ChromDragoNN that take full expression data as predictors, our method carefully considers the binding statuses of chromatin-binding factors (e.g., TFs), based on the biological understanding that chromatin accessibility is largely determined by chromatin-binding factors that have access to DNA [2]. In a series of systematic evaluations, DeepCAGE achieves state-of-the-art performance in not only the classification of chromatin accessible statuses but also the regression of DNase-seq signals. To make DeepCAGE more understandable, we propose a strategy for visualizing the weights in the first convolutional layer. Interestingly, many known motifs were successfully recovered by DeepCAGE. In the downstream application to whole-genome sequencing (WGS) data analysis, DeepCAGE effectively prioritizes deleterious variants for the prediction and interpretation of complex phenotypes.

Method

Overview of DeepCAGE

DeepCAGE was designed based on the premise that binding statuses and gene expression of TFs could complement sequence data toward the precise prediction of chromatin accessibility. With this understanding, we designed DeepCAGE as a hybrid neural network that consisted of a convolutional module for sequence data and a feedforward module for chromatin accessibility prediction (Figure 1). Briefly, we applied the one-hot encoding to the input sequence data, fed the encoded data to a densely connected convolutional neural network (DenseNet), and took the output as the sequence feature. For binding statuses, we scanned the input sequence for potential binding sites for a set of 402 human TFs by using non-redundant motifs in the HOCOMOCO database [22] with the tool Homer [23]. We then selected the maximum score of reported binding sites for each TF to obtain a vector of 402 dimensions as the motif feature. For gene expression, we focused on log-transformed transcripts per million (TPM) values of the 402 TFs and obtained a vector of 402 dimensions after quantile normalization as the expression feature. With these data, we combined the two vectors of the motif and expression features by taking the element-wise product, and we concatenated the result to the sequence feature to obtain the hybrid feature, which went through a feedforward neural network with a fully connected hidden layer and an output layer for either classification or regression. We presented detailed hyperparameters of the hybrid network in Table S1.

Figure 1.

Figure 1

Overview of the DeepCAGE model

The sequence of the input DNA region is converted to a one-hot matrix and goes through a DenseNet to extract sequence features. Normalized expression levels of the 402 human TFs and the corresponding motif binding scores are combined by using an element-wise product and then concatenated with sequence features. The combined features are finally fed to a feedforward neural network for chromatin accessibility prediction. DenseNet, densely connected convolutional neural network; TPM, transcripts per million; Conv, convolution; GIS, gradient importance score; TF, transcription factor.

DeepCAGE extracts sequence features by using an architecture called the DenseNet, which has the advantage of alleviating the vanishing-gradient problem and strengthening the feature propagation [24]. As shown in Figure 1, there are three dense blocks in our model. Each block includes five convolutional layers, and each layer connects to every other layer in a feedforward fashion. A convolutional layer consists of two consecutive small kernels of size 1×1 and 3×1, where the former aims at reducing the concatenated channels to a fixed number, and the latter acts as the traditional convolution. A transition module is presented before a dense block for feature extracting and dimensionality reduction. An input sequence is first extended to a fixed length of 1000 bp centered at the midpoint of the sequence and then converted to a 1000×4 binary matrix by using the one-hot encoding. The matrix is then fed to the first transition module that contains a convolutional layer and a max-pooling layer. The convolutional layer has 160 kernels of size 4×15 for extracting low-level features and detecting DNA binding motifs, while the max-pooling layer is present for finding the most significant activation signal in a given sliding window of each kernel. Similar settings are used for the other two transition modules for extracting high-level features and dimensionality reduction. Rectified linear units (ReLU) are used after each convolution operation for keeping positive activations and setting negative activation values to zeros. Batch normalization [25] and dropout [26] strategies are used after each ReLU function for reducing internal covariate shift and avoiding overfitting, respectively. For the DeepCAGE regression model, there are two major differences from the classification model. First, the output layer directly uses a linear transformation instead of a sigmoid function. Second, the mean square error (MSE) instead of the cross-entropy is used as the loss function.

Data processing

DNase-seq bam files and narrow peaks across 55 human cell types were downloaded from the ENCODE project [5] (Tables S2 and S3). The human hg19 reference genome was divided into non-overlapping regions (loci) of 200 bp. Considering that a cell type may have multiple DNase-seq replicates, a locus is regarded as chromatin accessible if it overlaps with narrow peak regions of at least half of the replicates and inaccessible otherwise (Figure S1). For the classification design, a binary label ylk is assigned to locus l, representing whether it is accessible in cell type k. For the regression design, bam files of multiple replicates for a cell type are pooled, and the raw read counts, nlk, is obtained for locus l in cell type k. To eliminate the effect of sequencing depths, the normalized read count, nlk=Nnlk/Nk, is calculated, where Nk denotes the total number of pooled reads for cell type k, and N=min{Nk} is the minimal number of pooled reads across all cell types. The normalized read counts are further log-transformed after adding a pseudocount of one. The transformed data represent the level of chromatin accessibility and are then used as the response variable in the regression model.

RNA-seq data across the same 55 human cell types were also downloaded from the ENCODE project (Table S4). TPM of the 402 core human TFs were extracted from the gene expression data. After further log transformation and quantile normalization based on TPM values, the normalized expression within each cell type was averaged across multiple replicates, and the mean expression profile of each cell type was finally used.

WGS data and RNA-seq profiles of Genotype-Tissue Expression (GTEx) muscle tissues were downloaded from the Database of Genotypes and Phenotypes (dbGaP: phs000424.v7.p2). Matching these two types of data, a total of 491 donors were selected for downstream analysis (Table S5). For each of these donors, RNA-seq data were processed in the same way as ENCODE data, and WGS data were filtered by excluding all insertions/deletions (indels) and rare variants whose minor allele frequencies were less than or equal to 5 across all donors.

Model evaluation

Cell type-level five-fold cross-validation experiments are designed for evaluating our method. In each fold, the 55 cell types are partitioned into a training set with 44 cell types and a testing set with the remaining 11 cell types (Tables S6 and S7). Putative known accessible loci are identified as genomic regions (loci) that are chromatin accessible in at least two cell types in the training set. Putative novel accessible loci are identified as genomic regions that are accessible in at least two testing cell types and are not present in the training data.

Cell type-wise and locus-wise metrics are defined to evaluate our method from different perspectives (Figure S2). Cell type-wise metrics are calculated within a testing cell type across genomic regions to provide high-level assessment of a method. Locus-wise metrics are calculated based on a genomic region across cell types to give a detailed analysis of the performance. These metrics provide a comprehensive and systematic evaluation of our method in both the classification and the regression designs.

Let YL×K and Y^L×K be the true label matrix and predicted matrix, where L denotes the number of putative loci and K denotes the number of cell types. In the classification design, ylk and y^lk denote the true binary label and predicted probability of chromatin accessible status for locus l in cell type k, respectively. In this situation, the cell type-wise area under the precision-recall curve (auPR) for cell type k is calculated based on yk=(y1k,y2k,,yLk) and y^k=(y^1k,y^2k,,y^Lk) as follows. Given a threshold t for a cell type k, the precision is defined as the number of correct predictions (lylkI(y^lk>t)) over the number of all predictions (lI(y^lk>t)), and the recall is defined as the number of correct predictions over the number of truly accessible loci (lylk), where I(x) is an indicator function that is equal to 1 if x is true and 0 otherwise. Varying the threshold from 0 to 1 and calculating the precision and recall at each threshold value, the precision-recall curve can be drawn, and the area under this curve can be obtained. The locus-wise auPR for locus l is calculated based on yl=(yl1,yl2,,ylK) and y^l=(y^l1,y^l2,,y^lK) in a similar way.

In the regression design, ylk and y^lk denote the true and predicted DNase-seq signals for locus l in cell type k, respectively. In this situation, the cell type-wise Pearson correlation coefficient (PCC) for cell type k is calculated as the PCC of yk and y^k, and the locus-wise PCC is calculated based on yl and y^l in a similar way. The prediction squared error (PSE), which considers both cell type-wise prediction and locus-wise prediction, is calculated as PSE=klylk-y^lk2/klylk-y¯k2, where y¯k=lylk/L is the mean of yk.

Two statistics, cell range and cell variability, are introduced to describe the activity of a locus based on the true DNase-seq signals across testing cell types. The cell range of locus l is calculated by maxyl-min(yl), and the cell variability of locus l is defined by the standard deviation of yl.

Baseline methods

Basset [12], DeepSEA [10], and DanQ [18] are three representative neural network models that take only DNA sequences as input . BIRD [13] is a regression model that takes only gene expression data as input. ChromDragoNN [21] is a neural network-based model that takes both DNA sequences and gene expression data as input. Our method and ChromDragoNN have the following major differences. First, the design principles of these two methods are notably different. ChromDragoNN predicts chromatin accessibility through directly concatenating DNA sequences and expression data of all genes. DeepCAGE explains chromatin accessibility with DNA sequences and binding statuses of TFs. Therefore, DeepCAGE tries to interpret chromatin accessibility in a more natural way since chromatin accessibility is believed to be largely determined by the occupancy and topological organization of nucleosomes as well as chromatin-binding factors [2]. Second, the network architectures of these two methods are different. ChromDragoNN uses a ResNet to extract sequence features, while DeepCAGE uses a DenseNet that is a relatively new architecture and has also been experimentally validated to outperform ResNet in many tasks [20]. Third, inputs of these two methods are also different. ChromDragoNN requires DNA sequences and expression data of all genes, while DeepCAGE takes DNA sequences and expression data of 402 human core TFs as input. Motif binding profiles of these TFs can be annotated with the existing motif database, which can be precomputed without additional experimental cost.

Gradient importance score

DeepCAGE takes advantage of the gradient importance score (GIS) to prioritize TFs given a pair of cell types and a genomic locus. Briefly, a locus is extended to a 200 kb genomic region centered at the midpoint of the locus. Then, the average absolute gradient of predicted accessibility within the extended region with respect to the expression of a TF is calculated as:

GISki=1LlLy^lkgki

where y^lk denotes the predicted accessibility of locus l in cell type k, gki denotes the expression of TF i in cell type k, and L denotes the set of putative regulatory elements that contains all accessible loci within the extended region. The GIS gives an intuition of which TFs play an important role in a specific cell type.

Motif analysis

The weights of the kernels from the first convolutional layer are converted into position weight matrices (PWMs) by counting subsequence occurrences in a set of input sequences that activate a kernel at a threshold value. All subsequences with activation values that greater than the threshold of a kernel are pooled together and aligned. The PWMs are then composed of the frequencies of the four nucleotides (A, C, G, and T) at each position. A subsequence at position i is regarded as activated if

m=0M-1n=0N-1wm,nkxi+m,nj>αMAVk

where M×N denotes the size of the kernels (4×15 in the first convolutional layer), and α is the control coefficient with the default value of 0.7 in all experiments. MAVk denotes the maximal activation value of kernel k and is represented as:

MAVk=maxi,jm=0M-1n=0N-1wm,nkxi+m,nj

Motifs are identified using the tool TomTom (v4.12.0) [27] with the E-value threshold of 0.05 and are compared to known motifs in the JASPAR database (v2018) [28]. Besides, the information content of recovered motifs is calculated based on the information entropy, as IC=i,j(pijlog2pij-bilog2bi), where pij is the element in PWM, i and j are the nucleotide type and position, respectively, and bi (default value: 0.25) is the background frequency of nucleotide i.

Phenotype prediction

A linear regression model with l1 penalty is adopted to predict the heights of GTEx donors using the deleterious scores of variants, as:

h=α0+k=1KαkΔOk

where h is the height of a GTEx donor, and ΔOk denotes the deleterious score of variant k calculated using DeepCAGE. The coefficient of the l1 penalty is set to 0.5. A ten-fold cross-validation experiment is used in validation, and the average coefficient of determinant (R2) is used for evaluating how much variance in the phenotype can be explained.

Results

DeepCAGE accurately predicts binary chromatin accessibility statuses

We first evaluated the performance of DeepCAGE in predicting whether an input DNA sequence is chromatin accessible or not. To achieve this objective, we downloaded paired DNase-seq and RNA-seq data across 55 cell types from the ENCODE project [5] and conducted a five-fold cross-validation experiment at the cell type level. In each fold of the validation, we partitioned the data into a training set of 44 cell types and a testing set of the remaining 11 cell types. We then defined putative known accessible loci as genomic regions that are chromatin accessible in at least two cell types in the training data. For each cell type, we further identified a positive set of putative loci that are accessible in the cell type and a negative set of putative loci that are inaccessible. After that, we trained our model on the training data and classified positive loci against negative ones for each testing cell type. Finally, we calculated a criterion called the cell type-wise auPR (see Method) to evaluate the performance of a classification method.

We compared the performance of DeepCAGE with four existing methods, including Basset [12], DeepSEA [10], DanQ [18], and ChromDragoNN [21] in the aforementioned cross-validation experiment. Results (Figure 2A) show that DeepCAGE achieves the highest performance with the mean cell type-wise auPR of 0.418 for known accessible loci, compared to 0.166 of Basset, 0.195 of DeepSEA, 0.188 of DanQ, and 0.319 of ChromDragoNN. Particularly, DeepCAGE outperforms sequence-based methods by a large margin, suggesting that these methods may fail in capturing cell type-specific information. Further analysis shows that the proportion of positive loci is in general small in a cell type and exhibits large variation (ranging from 2.6% to 29%), suggesting the ability of our method in dealing with unbalanced data.

Figure 2.

Figure 2

Performance of the DeepCAGE classification model

A. DeepCAGE achieves the highest cell type-wise auPR for both known accessible loci and novel accessible loci compared to baseline methods (Basset, DeepSEA, DanQ, and ChromDragoNN). B. The performance of DeepCAGE for loci with different activities across testing cell types. auPR, area under the precision-recall curve.

We then took one step further to assess the ability of our method in predicting novel chromatin accessible loci. In each fold of the validation experiment, we identified putative novel accessible loci as genomic regions that are accessible in at least two testing cell types and are not present in the training data, and we applied the trained model to predict whether these loci are accessible or not in a testing cell type. Results, as shown in Figure 2A, also suggest the superiority of DeepCAGE with a mean cell type-wise auPR of 0.181, compared to 0.107 of Basset, 0.104 of DeepSEA, 0.110 of DanQ, and 0.151 of ChromDragoNN.

We finally analyzed how the cell type specificity of accessible regions affects the prediction performance of our method. To achieve this objective, we divided the putative known accessible loci into three groups based on the proportion of cell types in which a locus is accessible. We then evaluated the cross-validation results using a criterion called the locus-wise auPR that evaluated the prediction performance of a method on an accessible locus across cell types (see Method). Results show that for a locus accessible in less than 10% cell types, DeepCAGE achieves a mean locus-wise auPR of 0.578, and this criterion increases when a locus is accessible in more cell types (Figure 2B). These results suggest that the cell type specificity is likely a factor that affects the prediction performance of a method.

DeepCAGE recovers a continuous degree of chromatin accessibility

In the aforementioned classification experiments, we only considered the binary accessible status of a genomic region in a specific cell type. In the real situation, however, the accessibility of a genomic region given by a DNase-seq experiment is in a continuous form. Considering this situation, we further proposed a DeepCAGE regression model to predict the degree of chromatin accessibility for a DNA region , which is defined as the normalized average count of raw reads that fall into the corresponding region.

With the same cross-validation settings as in the aforementioned section, we compared the performance of DeepCAGE to two baseline methods, BIRD [13] and ChromDragoNN [21], and we assessed regression results in terms of two criteria, the cell type-wise PCC and PSE (Figure 3A–C; see Method). Results show that DeepCAGE achieves a mean cell type-wise PCC of 0.785, compared to 0.637 for BIRD and 0.735 for ChromDragoNN (Figure 3B). Further analysis shows that in 18.2% of the testing cell types, DeepCAGE achieves a cell type-wise PCC of 0.85 or higher. In two cell types, DeepCAGE even achieves a cell type-wise PCC of 0.9 or higher (see examples in Figure 3A). DeepCAGE also achieves the minimal PSE (0.42), outperforming the two baseline methods (0.77 for BIRD and 0.57 for ChromDragoNN) by a quite large margin (Figure 3C).

Figure 3.

Figure 3

Performance of the DeepCAGE regression model

A. DeepCAGE predicts DNase-seq signals in five testing cell types. B. Cell type-wise PCC for three different methods across all testing cell types. *, two-sided paired-sample Wilcoxon signed-rank test P value = 3.37×10-5. C. PSE for three different methods across all testing cell types. D. Locus-wise PCC achieved by DeepCAGE with respect to two statistics with both known accessible loci and novel accessible loci. E. Locus-wise PCC achieved by DeepCAGE considering the number of accessible cell types under known accessible loci. F. An example of true (green) and predicted (yellow) DNase-seq signals of three testing cell types under the same genomic region (Chr1:42.83–42.93 Mb). Mean signal (red) denotes the average DNase-seq signal across all training cell types. PCC, Pearson correlation coefficient; PSE, prediction squared error.

We then explored the performance of DeepCAGE for putative accessible loci with different cell type specificity by introducing two statistics, cell range and cell variability, to describe the activity dynamics of a genomic region based on the true DNase-seq signals cross cell types (see Method). We divided known and novel accessible loci into three groups (low, medium, and high) according to the 1/3 and 2/3 quantiles of these statistics. Results show that DeepCAGE has high performance for accessible loci with medium cell range and cell variability (Figure 3D), consistent with the results in BIRD [13]. Briefly, DeepCAGE achieves a median locus-wise PCC (see Method) of 0.512 for known accessible loci with medium cell range, compared to 0.435 and 0.399 for loci with low and high cell ranges, respectively. When using the statistic of cell variability, DeepCAGE achieves median locus-wise PCCs of 0.384, 0.514, and 0.448 for known accessible loci with low, medium, and high cell variabilities, respectively. The results are similar for novel accessible loci, except that the values of the criteria are slightly low. We further divided known accessible loci into five groups based on the number of cell types in which a locus is accessible. Results (Figure 3E) show that the performance of DeepCAGE varies a lot for loci accessible in different numbers of cell types. Briefly, the performance is high for loci accessible in the medium proportion of cell types and low for those accessible in only a small proportion of cell types.

Finally, we visualized both the true (green) and predicted (yellow) DNase-seq signals of a sample genomic region across three testing cell types (GM12878, HepG2, and H1-hESC) in the UCSC genome browser [29]. In addition, we also provided the mean signal (red; calculated by taking the average DNase-seq signals across all training cell types) as a reference. As shown in Figure 3F, obviously, DeepCAGE well distinguishes the difference of DNase-seq signals among the three testing cell types while the mean signal fails.

Model ablation analysis of DeepCAGE

We studied the contributions of gene expression and binding scores of TFs to the performance of our method. Taking the DeepCAGE regression model as an example, by discarding gene expression data, the median cell type-wise PCC decreased by 13.1% (Figure S3; P = 6.53 × 1011, one-sided paired-sample Wilcoxon signed-rank test). By removing binding scores, the median cell type-wise PCC decreased by 3.6% (Figure S3; P = 3.78 × 104, one-sided paired-sample Wilcoxon signed-rank test). These results suggest that gene expression data could significantly help improve the performance of DeepCAGE in cross-cell type prediction, while binding scores slightly increase the performance. One potential reason behind this observation is that a large proportion of DNA sequence motifs have already been learned in the convolution layers of the neural network, and thus the binding scores only provide complementary information regarding DNA sequence features.

Besides, to demonstrate the superiority of the network architecture used by DeepCAGE, we additionally conducted the following two experiments. First, we replaced the DenseNet with a ResNet which had the same number of layers as the number of dense blocks and the same hidden nodes in the convolutional layers. Results show that DenseNet leads to 6.4% increment in performance over ResNet in terms of the median cell type-wise PCC (Figure S4; P = 3.15 × 106, one-sided paired-sample Wilcoxon signed-rank test). Second, we explored the influence of two key hyperparameters (the number of residual blocks and the convolutional layers within a residual block) on the performance of ChromDragoNN. It is noted that a deeper model architecture does not help improve the performance significantly (Figure S5).

GIS helps prioritize cell type-related TFs

We proposed a strategy for prioritizing cell type-related TFs according to the absolute gradient of the predicted accessibility with respect to the expression of a TF. Taking the K562 cell line as an example, we calculated the average GISs of all TFs from all putative loci within up-streaming 100 kb to down-streaming 100 kb of a tumor suppressor gene TP53, which has been shown to have a key role in myeloid blast transformation [30]. The average GISs of all TFs across cell types with respect to the transcription start site (TSS) of this gene are shown in Figure 4A. The 402 human core TFs were then prioritized by their average GISs in K562 cell line (Figure 4B). Interestingly, many top-ranked TFs were related to functions in leukemia cells validated by literature. For example, EGR1 (rank1st) was involved in regulating PMA-induced megakaryocytic differentiation of K562 cell line [31]; the inhibition of E2F7 (rank3rd) might lead to a reduction of miRNAs involved in leukemic cell lines [32]; the expression of JunB (rank5th) was inactivated by methylation in chronic myeloid leukemia [33]. The Gene Ontology (GO) terms enriched by the top 5% prioritized TF coding genes also included biological processes of leukocyte differentiation and hematopoietic development (Figure 4C). To sum up, the GIS gives us an intuitive interpretation of which TF may play an important role in predicting chromatin accessibility given a specific cell type and a genomic region.

Figure 4.

Figure 4

GIS helps identify important TFs

A. GIS heatmap of the 402 human core TFs across 55 cell types. B. Bar chart showing the GISs of the 20 top-ranked TFs in the K562 cell line. C. Enriched GO terms by top-ranked TFs in the K562 cell line. GO, Gene Ontology.

DeepCAGE automatically learns binding motifs of TFs

In order to make DeepCAGE more understandable, we explored the features that were automatically learned by DeepCAGE by investigating the weights of the 160 kernels in the first convolutional layer. Briefly, we converted the weights into PWMs (see Method) and then compared them with known motifs in the JASPAR database [28]. We found that 48 (30%) of the kernels could match known motifs at the E-value threshold of 0.05. Among the matched kernels, 25 (52%) had at least one matched core human TF used in DeepCAGE model. We then calculated the information content (see Method), set the weights of each kernel to zeros, and denoted the decrease in the cell type-wise PCC as the influence score for each kernel. We showed several learned unmatched motifs that have a high influence score (Figure 5A) and illustrated a few examples of learned motifs that could match known motifs in JASPAR database (Figure 5B). These results demonstrate that DeepCAGE can not only help us find potential binding motifs but also has the potential to guide the finding of novel motifs which are not discovered by experiments yet.

Figure 5.

Figure 5

DeepCAGE recovers both known and novel motifs

A. DeepCAGE identifies both known and novel motifs in the learning process. Green dots and yellow dots represent known and novel motifs recovered by DeepCAGE, respectively. B. Matched motifs with an E-value threshold of 0.05 in the format of sequence logos (above: known motif from the JASPAR database; below: motif learned by DeepCAGE).

DeepCAGE prioritizes putative deleterious variants in personal genomes

We applied DeepCAGE to WGS data analysis and demonstrated how our method could benefit the detection of individual-specific deleterious variants in regulatory elements that potentially influence phenotype. The principle was to quantify the degree that a genetic variant affects the chromatin accessibility of a nearby genomic region and then prioritize variants accordingly. As shown in Figure 6A, for an individual, we fed the individual genome and the reference genome separately to the trained DeepCAGE regression model and calculated prediction scores for each of them. We then took the absolute log2 fold change of these two scores as a measure of the change in chromatin accessibility. For a variant, we defined its individual-level deleterious score by the change of chromatin accessibility of a 200 bp genomic region around. Finally, we obtained the cohort-level deleterious score for a variant by applying the aforementioned procedure to all individuals in a cohort who contain the variant and then averaging the individual-level deleterious scores for the variant. Note that we also took as input the expression profile of TFs in the muscle tissue and only considered WGS variants with the minor allele frequency larger than 5.

Figure 6.

Figure 6

DeepCAGE helps prioritize and interpret WGS variants

A. The deleterious score is calculated by the absolute value of log2 fold change of predicted chromatin accessibility of the REF genome and the personal genome from WGS data. B. WGS variants within a risk region were ranked by averaging deleterious scores across donors containing the variant. C. The absolute log2 fold change of average height with respect to top-K and bottom-K ranked variants (K = 20, 40, and 80) around a height-associated gene. *, P < 0.05. D. Predicting phenotype height with deleterious scores with all variants, top-ranked variants, and bottom-ranked variants. REF, reference; WGS, whole-genome sequencing.

We downloaded WGS data of 491 donors with the height phenotype from the dbGap of the GTEx project (Table S5). We collected 3290 risk single nucleotide polymorphisms (SNPs) that were associated with height by a large-scale genome-wide association study [34]. For each risk SNP, we defined a risk region as a 200 kb genomic region centered at the SNP. We then ranked SNPs within a risk region according to their cohort-level deleterious scores obtained from donors (Figure 6B). As an illustration, we examined the risk region around a risk SNP (rs5742714) in the promoter region of IGF1, a gene encoding a well-known growth factor [35]. The top-ranked variants within this risk region showed an obviously greater absolute log2 fold change of average height than the bottom-ranked variants (Figure 6C).

We then quantitatively explored how much variance of the height phenotype can be explained by the deleterious scores of risk variants. To achieve this objective, we proposed a linear regression model with l1 penalty, which took deleterious scores of a set of variants as predictors and the height phenotype as the response variable (see Method). Results show that the 1,103,572 WGS variants within the 3290 risk regions together interpreted 2.49% of the height variance. Furthermore, the variants ranked among the top 10% according to their deleterious scores in each risk region together can interpret 2.11% of the height variance. These results suggest that the small portion of variants prioritized by our method already contained most information that is helpful in the explanation of the phenotype. We also noticed that the bottom-ranked 10% variants, on the contrary, failed to interpret the height phenotype (Figure 6D). To conclude, DeepCAGE is capable of giving a fine mapping of putative risk genetic variants and prioritizing WGS variants that might be associated with a specific phenotype.

Discussion

In this study, we introduce a deep learning framework called DeepCAGE toward genome-wide prediction of chromatin accessibility. A hallmark of our method is the incorporation of the sequence data and the binding statuses of TFs into a unified deep neural network. With these two types of information complementing each other, our method overcomes the limitations of existing approaches and demonstrates state-of-the-art performance in not only classification but also regression of chromatin accessibility signals. Our method provides insights into functional genomics in two aspects. First, the GIS can give us an intuitional measurement of the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. Second, the visualization of convolutional kernels demonstrates that features automatically extracted by our method are not only consistent with existing knowledge but also contain potentially novel binding motifs of TFs. Such interpretability of our model will benefit the dissection of the regulatory landscape under a variety of cell conditions. Our method also provides the possibility of interpreting and prioritizing putative deleterious variants in genetic studies. Such ability in explaining complex traits can further be explored to promote the understanding of disease-associated genetic variants.

Certainly, our model can be further improved from the following aspects. First, currently, we ignore the expression of genes that direct the synthesis of proteins other than TFs. However, it has been shown that proteins such as chromatin regulators, a class of enzymes with specialized function domains, can shape and maintain the epigenetic state in a cell context-dependent fashion [36], and thus can also provide information for inferring chromatin accessible state [37], [38]. How to incorporate information on these chromatin regulators into our model is one of the directions in our future work. Second, predicting chromatin accessibility has been explored in a single-cell level [39], [40], [41], [42], it is possible to extend the predictive power of DeepCAGE to a single-cell level by incorporating the single-cell gene expression data. Third, our model currently identifies chromatin accessible regions in a cell type-specific manner but cannot further distinguish the specific type of potential regulatory elements in these regions. With the accumulation of annotations regarding cis-regulatory elements such as enhancers and silencers [43], [44], [45], [46], as well as computational methods for predicting interactions between these elements [47], [48], [49], [50], [51], it is expected that our framework can further be extended to uncover the comprehensive relationship between different types of genomic regulatory elements and the genome-wide transcriptomic profile.

Code availability

DeepCAGE is freely available at https://github.com/kimmo1019/DeepCAGE with step-by-step instructions. DeepCAGE is also available at NGDC BioCode with accession https://ngdc.cncb.ac.cn/biocode/tools/BT007170.

CRediT author statement

Qiao Liu: Conceptualization, Software, Formal analysis, Writing - original draft, Writing - review & editing, Visualization. Kui Hua: Writing - review & editing. Xuegong Zhang: Supervision. Wing Hung Wong: Conceptualization, Investigation, Supervision, Writing - review & editing, Funding acquisition. Rui Jiang: Conceptualization, Investigation, Supervision, Writing - review & editing, Funding acquisition. All authors have read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

Acknowledgments

This work has been partly supported by the National Natural Science Foundation of China (Grant Nos. 61721003, 61873141, and 61573207), the National Key R&D Program of China (Grant No. 2018YFC0910404), and the Tsinghua-Fuzhou Institute for Data Technology. This work was also supported by the National Institutes of Health grants (Grant Nos. P50HG007735 and R01HG010359). The Genotype-Tissue Expression project was supported by the Common Fund of the Office of the Director of the National Institutes of Health. We thank Mengmeng Wu, Zhana Duren, and Fengling Chen for their helpful discussions.

Handled by Zhihua Zhang

Footnotes

Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences / China National Center for Bioinformation and Genetics Society of China.

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2021.08.015.

Contributor Information

Wing Hung Wong, Email: whwong@stanford.edu.

Rui Jiang, Email: ruijiang@tsinghua.edu.cn.

Supplementary material

The following are the Supplementary data to this article:

Supplementary Figure S1

Identification of putative accessible loci Putative accessible loci were determined by DNase-seq peaks according to the following three steps. Step 1: The human reference genome was divided into non-overlapping regions of 200 bps. Step 2: Regions is kept if it is contained in half of replicated of a cell type. Step 3: Putative accessible loci were determined by collecting regions that appear in at least two cell types.

mmc1.pdf (12.1KB, pdf)
Supplementary Figure S2

Definition of cell type-wise and locus-wise criteria In the cell-type-wise evaluation, the auPR and Pearson’s correlation were calculated based on rows of both label matrix and predicted matrix. In the locus-wise evaluation, the Pearson’s correlation coefficient was calculated based on columns of both label matrix and predicted matrix. auPR, area under the precision-recall curve.

mmc2.pdf (27.4KB, pdf)
Supplementary Figure S3

Model ablation analysis of DeepCAGE A. Four models were designed by considering different inputs. B. By removing both expression and motif scores of transcription factors (model 1), the median cell type-wise PCC decreases from 0.795 to 0.660. If only gene expression (model 2) or motif scores (model 3) are discarded, the median cell type-wise decreases to 0.664 and 0.759, respectively. PCC, Pearson correlation coefficient.

mmc3.pdf (1.6MB, pdf)
Supplementary Figure S4

Ablation study for model architecture of DeepCAGE A. We implemented DeepCAGE model with two different architectures (DenseNet and ResNet). Note that we used a ResNet with three layers (equal to the number of dense blocks) and the number of hidden nodes and the convolutional kernel size is the same as convolutional layers in DenseNet. B. DeepCAGE with DenseNet architecture achieves a median cell-type-wise PCC of 0.795, while DeepCAGE with ResNet architecture achieves an average cell-type-wise PCC of 0.731.

mmc4.pdf (210.8KB, pdf)
Supplementary Figure S5

The cross-cell-type prediction performance of ChromDragoNN with different hyperparameter settings A. The cell-type-wise PCC of ChromDragoNN with the different number of convolutional layers in each residual block. B. The cell-type-wise PCC of ChromDragoNN with the different number of residual blocks.

mmc5.pdf (323.4KB, pdf)
Supplementary Table S1

Hyperparameters of the DeepCAGE model Note: The hyperparameters were determined by mainly focusing on choosing the number of dense blocks {1,3,5}, learning rate {0.01,0.001,0.0001}, number of hidden nodes in the feed-forward network {128,256,512} with the help of the Hyperopy library (http://hyperopt.github.io/hyperopt/)

mmc6.docx (18.8KB, docx)
Supplementary Table S2

The information of Dnase-seq peaks across 55 cell types from the ENCODE project

mmc7.xlsx (20.2KB, xlsx)
Supplementary Table S3

The information of Dnase-seq bam file across 55 cell types from the ENCODE project

mmc8.xlsx (19KB, xlsx)
Supplementary Table S4

The information of RNA-seq data across 55 cell types from the ENCODE project

mmc9.xlsx (30.5KB, xlsx)
Supplementary Table S5

The information of GTEx data collected from 491 donors

mmc10.xlsx (28.5KB, xlsx)
Supplementary Table S6

Information of the 55 cell types used in this study

mmc11.docx (17.8KB, docx)
Supplementary Table S7

Data partition in the five-fold cross-validation experiment

mmc12.docx (15.4KB, docx)

References

  • 1.Kellis M., Wold B., Snyder M.P., Bernstein B.E., Kundaje A., Marinov G.K., et al. Defining functional DNA elements in the human genome. Proc Natl Acad Sci U S A. 2014;111:6131–6138. doi: 10.1073/pnas.1318948111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Klemm S.L., Shipony Z., Greenleaf W.J. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet. 2019;20:207–220. doi: 10.1038/s41576-018-0089-8. [DOI] [PubMed] [Google Scholar]
  • 3.Crawford G.E., Holt I.E., Whittle J., Webb B.D., Tai D., Davis S., et al. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS) Genome Res. 2006;16:123–131. doi: 10.1101/gr.4074106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Buenrostro J.D., Giresi P.G., Zaba L.C., Chang H.Y., Greenleaf W.J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013;10:1213. doi: 10.1038/nmeth.2688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Roadmap Epigenomics Consortium, Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Corces M.R., Granja J.M., Shams S., Louie B.H., Seoane J.A., Zhou W., et al. The chromatin accessibility landscape of primary human cancers. Science. 2018;362:6413. doi: 10.1126/science.aav1898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Trevino A.E., Sinnott-Armstrong N., Andersen J., Yoon S.J., Huber N., Pritchard J.K., et al. Chromatin accessibility dynamics in a model of human forebrain development. Science. 2020;367:6476. doi: 10.1126/science.aay1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Song S., Cui H., Chen S., Liu Q., Jiang R. EpiFIT: functional interpretation of transcription factors based on combination of sequence and epigenetic information. Quant Biol. 2019;7:233–243. [Google Scholar]
  • 10.Zhou J., Troyanskaya O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liu Q., Gan M., Jiang R. A sequence-based method to predict the impact of regulatory variants using random forest. BMC Syst Biol. 2017;11:7. doi: 10.1186/s12918-017-0389-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kelley D.R., Snoek J., Rinn J.L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26:990–999. doi: 10.1101/gr.200535.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhou W., Sherwood B., Ji Z., Xue Y., Du F., Bai J., et al. Genome-wide prediction of DNase I hypersensitivity using gene expression. Nat Commun. 2017;8:1038. doi: 10.1038/s41467-017-01188-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Liu Q., Xia F., Yin Q., Jiang R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics. 2018;34:732–738. doi: 10.1093/bioinformatics/btx679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Min X., Zeng W., Chen N., Chen T., Jiang R. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics. 2017;33:i92–101. doi: 10.1093/bioinformatics/btx234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Xu C., Liu Q., Zhou J., Xie M., Feng J., Jiang T. Quantifying functional impact of non-coding variants with multi-task Bayesian neural network. Bioinformatics. 2020;36:1397–1404. doi: 10.1093/bioinformatics/btz767. [DOI] [PubMed] [Google Scholar]
  • 17.Yin Q., Wu M., Liu Q., Lv H., Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics. 2019;20:193. doi: 10.1186/s12864-019-5489-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Quang D., Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44:e107. doi: 10.1093/nar/gkw226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ding K., Liu Q., Lee E., Zhou M., Lu A., Zhang S. Feature-enhanced graph networks for genetic mutational prediction using histopathological images in colon cancer. Proc Int Conf Med Image Comput Comput Assist Interv. 2020:294–304. [Google Scholar]
  • 20.He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proc IEEE Conf Comput Vision Pattern Recognit. 2016:770–778. [Google Scholar]
  • 21.Nair S., Kim D.S., Perricone J., Kundaje A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics. 2019;35:i108–i116. doi: 10.1093/bioinformatics/btz352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kulakovskiy I.V., Vorontsov I.E., Yevshin I.S., Soboleva A.V., Kasianov A.S., Ashoor H., et al. HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2015;44:D116–D125. doi: 10.1093/nar/gkv1249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38:576–589. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Huang G., Liu Z., Weinberger K.Q., van der Maaten L. Densely connected convolutional networks. Proc IEEE Conf Comput Vision Pattern Recognit. 2017;1:3. [Google Scholar]
  • 25.Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc 32ed Inter Conf Mach Learn. 2015:448–456. [Google Scholar]
  • 26.Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–1958. [Google Scholar]
  • 27.Gupta S., Stamatoyannopoulos J.A., Bailey T.L., Noble W.S. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. doi: 10.1186/gb-2007-8-2-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Khan A., Fornes O., Stigliani A., Gheorghe M., Castro-Mondragon J.A., van der Lee R., et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2017;46:D260–D266. doi: 10.1093/nar/gkx1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kent W.J., Sugnet C.W., Furey T.S., Roskin K.M., Pringle T.H., Zahler A.M., et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Law J.C., Ritke M.K., Yalowich J.C., Leder G.H., Ferrell R.E. Mutational inactivation of the p53 gene in the human erythroid leukemic K562 cell line. Leuk Res. 1993;17:1045–1050. doi: 10.1016/0145-2126(93)90161-d. [DOI] [PubMed] [Google Scholar]
  • 31.Cheng T., Wang Y., Dai W. Transcription factor egr-1 is involved in phorbol 12-myristate 13-acetate-induced megakaryocytic differentiation of K562 cells. J Biol Chem. 1994;269:30848–30853. [PubMed] [Google Scholar]
  • 32.Gabra M.M., Salmena L. MicroRNAs and acute myeloid leukemia chemoresistance: a mechanistic overview. Front Oncol. 2017;7:255. doi: 10.3389/fonc.2017.00255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yang M.Y., Liu T.C., Chang J.G., Lin P.M., Lin S.F. JunB gene expression is inactivated by methylation in chronic myeloid leukemia. Blood. 2003;101:3205–3211. doi: 10.1182/blood-2002-05-1598. [DOI] [PubMed] [Google Scholar]
  • 34.Yengo L., Sidorenko J., Kemper K.E., Zheng Z., Wood A.R., Weedon M.N., et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼ 700000 individuals of European ancestry. Hum Mol Genet. 2018;27:3641–3649. doi: 10.1093/hmg/ddy271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Becker N.S., Verdu P., Georges M., Duquesnoy P., Froment A., Amselem S., et al. The role of GHR and IGF1 genes in the genetic determination of African pygmies’ short stature. Eur J Hum Genet. 2013;21:653–658. doi: 10.1038/ejhg.2012.223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Chen T., Dent S.Y. Chromatin modifiers and remodellers: regulators of cellular differentiation. Nat Rev Genet. 2014;15:93. doi: 10.1038/nrg3607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Duren Z., Chen X., Jiang R., et al. Modeling gene regulation from paired expression and chromatin accessibility data. Proc Natl Acad Sci U S A. 2017;114:E4914–E4923. doi: 10.1073/pnas.1704553114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wang Y., Jiang R., Wong W.H. Modeling the causal regulatory network by integrating chromatin accessibility and transcriptome data. Natl Sci Rev. 2016;3:240–251. doi: 10.1093/nsr/nww025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Chen S., Yan G., Zhang W., et al. RA3 is a reference-guided approach for epigenetic characterization of single cells. Nat Commun. 2021;12:1–13. doi: 10.1038/s41467-021-22495-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Liu Q., Xu J., Jiang R., et al. Density estimation using deep generative neural networks. Proc Natl Acad Sci U S A. 2021;118 doi: 10.1073/pnas.2101344118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Liu Q., Chen S., Jiang R., et al. Simultaneous deep generative modelling and clustering of single-cell genomic data. Nat Mach Intell. 2021;3:536–544. doi: 10.1038/s42256-021-00333-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Chen X., Chen S., Song S., et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat Mach Intell. 2022;4:116–126. [Google Scholar]
  • 43.Khan A., Zhang X. dbSUPER: a database of super-enhancers in mouse and human genome. Nucleic Acids Res. 2016;44:D164–D171. doi: 10.1093/nar/gkv1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zeng W., Min X., Jiang R. EnDisease: a manually curated database for enhancer-disease associations. Database (Oxford) 2019;2019:baz020. doi: 10.1093/database/baz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Chen S., Liu Q., Cui X., Feng Z., Li C., Wang X., et al. OpenAnnotate: a web server to annotate the chromatin accessibility of genomic regions. Nucleic Acids Res. 2021;49:W483–W490. doi: 10.1093/nar/gkab337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zeng W., Chen S., Cui X., Chen X., Gao Z., Jiang R. SilencerDB: a comprehensive database of silencers. Nucleic Acids Res. 2021;49:D221–D228. doi: 10.1093/nar/gkaa839. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Li W., Wong W.H., Jiang R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 2019;47:e60. doi: 10.1093/nar/gkz167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Liu Q., Lv H., Jiang R. hicGAN infers super resolution Hi-C data with generative adversarial networks. Bioinformatics. 2019;35:i99–107. doi: 10.1093/bioinformatics/btz317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Zeng W., Xin J., Jiang R., Wang Y. Reusability report: compressing regulatory networks to vectors for interpreting gene expression and genetic variants. Nat Mach Intell. 2021;3:576–580. [Google Scholar]
  • 50.Liu Q., Hu Z., Jiang R., Zhou M. DeepCDR: a hybrid graph convolutional network for predicting cancer drug response. Bioinformatics. 2020;36:i911–i918. doi: 10.1093/bioinformatics/btaa822. [DOI] [PubMed] [Google Scholar]
  • 51.Singh S., Yang Y., Poczos B., Ma J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quant Biol. 2019;7:122–137. doi: 10.1007/s40484-019-0154-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figure S1

Identification of putative accessible loci Putative accessible loci were determined by DNase-seq peaks according to the following three steps. Step 1: The human reference genome was divided into non-overlapping regions of 200 bps. Step 2: Regions is kept if it is contained in half of replicated of a cell type. Step 3: Putative accessible loci were determined by collecting regions that appear in at least two cell types.

mmc1.pdf (12.1KB, pdf)
Supplementary Figure S2

Definition of cell type-wise and locus-wise criteria In the cell-type-wise evaluation, the auPR and Pearson’s correlation were calculated based on rows of both label matrix and predicted matrix. In the locus-wise evaluation, the Pearson’s correlation coefficient was calculated based on columns of both label matrix and predicted matrix. auPR, area under the precision-recall curve.

mmc2.pdf (27.4KB, pdf)
Supplementary Figure S3

Model ablation analysis of DeepCAGE A. Four models were designed by considering different inputs. B. By removing both expression and motif scores of transcription factors (model 1), the median cell type-wise PCC decreases from 0.795 to 0.660. If only gene expression (model 2) or motif scores (model 3) are discarded, the median cell type-wise decreases to 0.664 and 0.759, respectively. PCC, Pearson correlation coefficient.

mmc3.pdf (1.6MB, pdf)
Supplementary Figure S4

Ablation study for model architecture of DeepCAGE A. We implemented DeepCAGE model with two different architectures (DenseNet and ResNet). Note that we used a ResNet with three layers (equal to the number of dense blocks) and the number of hidden nodes and the convolutional kernel size is the same as convolutional layers in DenseNet. B. DeepCAGE with DenseNet architecture achieves a median cell-type-wise PCC of 0.795, while DeepCAGE with ResNet architecture achieves an average cell-type-wise PCC of 0.731.

mmc4.pdf (210.8KB, pdf)
Supplementary Figure S5

The cross-cell-type prediction performance of ChromDragoNN with different hyperparameter settings A. The cell-type-wise PCC of ChromDragoNN with the different number of convolutional layers in each residual block. B. The cell-type-wise PCC of ChromDragoNN with the different number of residual blocks.

mmc5.pdf (323.4KB, pdf)
Supplementary Table S1

Hyperparameters of the DeepCAGE model Note: The hyperparameters were determined by mainly focusing on choosing the number of dense blocks {1,3,5}, learning rate {0.01,0.001,0.0001}, number of hidden nodes in the feed-forward network {128,256,512} with the help of the Hyperopy library (http://hyperopt.github.io/hyperopt/)

mmc6.docx (18.8KB, docx)
Supplementary Table S2

The information of Dnase-seq peaks across 55 cell types from the ENCODE project

mmc7.xlsx (20.2KB, xlsx)
Supplementary Table S3

The information of Dnase-seq bam file across 55 cell types from the ENCODE project

mmc8.xlsx (19KB, xlsx)
Supplementary Table S4

The information of RNA-seq data across 55 cell types from the ENCODE project

mmc9.xlsx (30.5KB, xlsx)
Supplementary Table S5

The information of GTEx data collected from 491 donors

mmc10.xlsx (28.5KB, xlsx)
Supplementary Table S6

Information of the 55 cell types used in this study

mmc11.docx (17.8KB, docx)
Supplementary Table S7

Data partition in the five-fold cross-validation experiment

mmc12.docx (15.4KB, docx)

Articles from Genomics, Proteomics & Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES