Abstract
The development of chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing (ChIP-seq) technologies has promoted generation of large-scale epigenomics data, providing us unprecedented opportunities to explore the landscape of epigenomic profiles at scales across both histone marks and tissue types. In addition to many tools directly for data analysis, advanced computational approaches, such as deep learning, have recently become promising to deeply mine the data structures and identify important regulators from complex functional genomics data. We implemented a neural network framework, a Variational Auto-Encoder (VAE) model, to explore the epigenomic data from the Roadmap Epigenomics Project and the Encyclopedia of DNA Elements (ENCODE) project. Our model is applied to 935 reference samples, covering 28 tissues and 12 histone marks. We used the enhancer and promoter regions as the annotation features and ChIP-seq signal values in these regions as the feature values. Through a parameter sweep process, we identified the suitable hyperparameter values and built a VAE model to represent the epigenomics data and to further explore the biological regulation. The resultant Roadmap-ENCODE VAE (RE-VAE) model contained data compression and feature representation. Using the compressed data in the latent space, we found that the majority of histone marks were well clustered but not for tissues or cell types. Tissue or cell specificity was observed only in some histone marks (e.g., H3K4me3 and H3K27ac) and could be characterized when the number of tissue samples is large (e.g., blood and brain). In blood, the contributive regions and genes identified by RE-VAE model were confirmed by tissue-specificity enrichment analysis with an independent tissue expression panel. Finally, we demonstrated that RE-VAE model could detect cancer cell lines with similar epigenomics profiles. In conclusion, we introduced and implemented a VAE model to represent large-scale epigenomics data. The model could be used to explore classifications of histone modifications and tissue/cell specificity and to classify new data with unknown sources.
Keywords: Epigenomic, Histone mark, Roadmap Epigenomics, Tissue specificity, Variational Auto-Encoder
1. Introduction
Histones are the central components of the nucleosomal subunits in eukaryotic cells. There are four core histone proteins: H3, H4, H2A, and H2B. These histone proteins wrap around segments of DNA and form nucleosomes. Each of these core histones has a long side chain or N-terminal tail with rich lysine and arginine residues [1]. The N-terminal tail of histone proteins are subjected to posttranslational modifications (PTMs), such as methylation, phosphorylation, acetylation, ubiquitylation, and sumoylation [2]. Through different types of modifications or different combination of modifications, PTMs exert their roles in regulating transcriptional processes [3], which has been extensively linked to many diseases, such as immune diseases [4], cancer [2, 5, 6], and others. Technologies have been developed for detecting histone modifications, such as chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing (ChIP-seq).
Histone modifications do not alter DNA sequence, but they may control the accessibility of specific DNA regions, especially regulatory elements in the genome. While DNA sequence of an individual remains nearly identical in all cells, the landscape of the histone modifications in different cell types vary greatly, leading to different epigenetic and gene expression programs as well as distinct biological processes [7]. Each type of histone modification has a corresponding coded mark named histone mark. To uncover the epigenomic information of histone modifications in different human cells and tissue types and to understand how the epigenomic landscape contributes to cellular circuitry, lineage specification, and the onset and progression of human disease, the NIH Roadmap Epigenomics Project was launched with its goal of producing a public resource of human epigenomic data [8]. The epigenomic landscape data of representative primary human tissues and cells is one of the main outcomes of this program. The data has been deposited into publicly accessible web resource, including 111 reference epigenomes as well as 16 additional epigenomes previously reported by the Encyclopedia of DNA Elements (ENCODE) project [9]. The data has been extensively used in other studies, but many new regulatory features remain uncovered [10–14]. In this study, we aimed to implement an unsupervised deep learning approach to decode regulatory structures and features from these epigenomics profiles.
Generation of numerous large and highly complex biomedical data has brought us into the era of big data. There is strong demand on mining such data for better feature discovery, providing important insights into the biological complexity [15]. Recently, artificial neural networks have achieved great success especially in the areas of the image and natural language processing, and interest in applying them to emerging data has been increasing [16]. In this study, we implemented a Variational Auto-Encoder (VAE) neural network model to investigate the underlying patterns of the human histone modification patterns by using the 127 Roadmap-ENCODE (RE) reference epigenomics data. The VAE model is a type of artificial neural network and can be used to learn efficient data coding in an unsupervised manner. It has been used for dimensionality reduction and feature representation [17, 18]. We utilized VAE models to investigate the hidden patterns of human histone modifications by using the ChIP-seq signals in enhancer and promoter regions of 12 histone marks. We further applied our trained model to new datasets to demonstrate its potential utilities on distinguishing cell types and identification of the regulatory regions (elements) and genes that are highly associated with the histone marks in specific tissues and cell types.
2. Materials and methods
2.1. Datasets
ChIP-seq narrowPeak data.
The narrow contiguous regions of enrichment (also called “narrow peaks”) for histone ChIP-seq and DNase-seq epigenomic data were downloaded from the website[19]. The originally downloaded data included measurements of 32 types of histone marks from 1,032 samples covering 28 anatomical locations and 19 lineage groups. To control data heterogeneity, we retained only histone marks with samples size ≥ 10. In total, there were 12 histone marks included in our following analysis, covering 935 samples from 28 anatomical locations. Among the 935 samples, 527 were from primary cell lines, 331 from primary tissue, and 77 considered as cell line derived. For each anatomy group, there were an average of 33 cell lines (range: 6 ~ 189), with spleen and esophagus having the smallest samples (n = 6) and blood having the largest number of samples (n = 189). For the 12 histone marks, H3K27me3, H3K36me3, H3K4me1, H3K4me3, and H3K9me3 had the most comprehensive data (n = 127 for each mark). There were both normal cell lines (e.g., foreskin fibroblast primary cells) and disease cell lines (e.g., HeLa-S3 cervical carcinoma cell line). A detailed description of these data is presented in Supplementary Table S1.
2.1.1. Enhancer and promoter region data.
We downloaded enhancer and promoter annotation data from the GeneHancer (version 4.7) database [20]. GeneHancer is a resource for genome-wide enhancer-to-gene and promoter-to-gene associations. It integrated genomic regulatory elements from four different genome-wide databases: ENCODE, the Ensembl regulatory build [21], the Functional ANnoTation Of the Mammalian genome (FANTOM) project [22], and the VISTA Enhancer Browser [23]. In total, GeneHancer integrated more than 250,000 non-redundant enhancer and promoter regions and linked these regions to genes using tissue co-expression, Hi-C, and expression quantitative trait loci, among others [20].
2.1.2. Validation data.
We downloaded two additional datasets from Gene Expression Omnibus (GEO) for validation, which were completely independent from the RE samples. For each dataset, the ChIP-seq data in the narrowPeak format were downloaded for different histone marks. The first dataset (GEO accession ID: GSE104481) included ChIP-seq data for the NCI-H23 non-small lung cancer cell (epithelial cell) [24] covering H3K27ac and H3K4me3, respectively. The second dataset (GEO ID: GSE106563) was conducted using the esophageal squamous carcinoma cells [25] for H3K4me3 only.
2.2. Data preprocess
All downloaded regions were mapped to hg19 wherever necessary using the liftOver tool[26]. The sequence length of enhancers from GeneHancer ranged between 1 and 183,369 base pairs (bps), with a mean of 1,514 bps. The sequence length of promoter ranged between 59 and 183,369 bps, with a mean of 304 bps. In our analysis, we chose those with 20 bps or more for initial sample annotation. After preprocessing (see Feature annotation section), the length of enhancers and promoters selected as feature regions ranged from 837 to 86,252 bps. The length distributions of peak regions from ChIP-seq varied in a broad range and depended on the types of histone marks. Some marks tended to have sharp peak regions, such as H3K9me3 (mean: 309, standard deviation: 368) while others tended to have wide peak regions, such as H3K4me3 (mean: 730, standard deviation: 868) in our dataset. We intersected the GeneHancer regulatory regions with the RE ChIP-seq data for each sample using the bedtools [27]. We required the minimum fraction of the overlapping regions between a GeneHancer regulatory region and a peak region to be more than 50% of either querying regions. Specifically, the following command was used for the analysis: bedtools intersect -a GeneHancer -b narrowPeak -e -f 0.50 F 0.50 -wo > output.txt.
2.3. Feature annotation
The GeneHancer regulatory regions were used as the features. For each feature, its overlapping peak regions in each RE sample were selected and the signal values of the peak regions were used as the feature value in the corresponding RE sample. If a feature region had multiple overlapping peak regions, the peak region with the maximum overlapping sequences was selected and its signal value was used as the feature value. If a feature region had no overlapping peak region that met our aforementioned criteria, it was assigned 0 in the corresponding sample. Following this procedure, we built a sample × feature matrix to represent the RE samples. Features that were prevalently inactive were excluded, defined as those with a signal value of 0 in more than 50% of samples. Consequently, there were 14,735 feature regions remained in 935 epigenome samples. We further calculated the variance of each feature and selected the top 10,000 feature regions with the highest variance. Finally, we built a feature matrix with 10,000 features by 935 epigenomic samples for the 12 histone marks.
2.4. Construction of the VAE model
Auto-Encoder (AE) is a type of artificial neural networks. An AE model typically consists of multiple layers (e.g., the input layer, the latent layer, and the output layer). The part from the input layer to the latent layer is called the encoder and the part from the latent layer to the output layer is called the decoder. An AE model is always made up by the encoding and decoding processes. AE aims to learn a representation of the input matrix by minimizing the reconstruction errors, referred to the loss between the outputs and inputs. Variational Auto-Encoder models inherit the AE architecture and consist of encoders and decoders similarly. In contrast to AE models which are deterministic and are constructed to minimize reconstruction error, VAEs are stochastic and learn two distinct latent representations: a mean and a standard deviation for each vector encoding (Fig. 1A). Thus, VAEs make strong assumptions concerning the distribution of latent variables. One of the advantages of VAE models is that VAE can discover nonlinear explanatory features through nonlinear activation functions by implementing the variational approach for latent representation learning. Thus, the resultant latent variables from VAE could achieve better representations of features and independence. In addition, VAE integrates the Kullback-Leibler (KL) divergence term to the reconstruction loss that regularizes weights through constraining the latent vectors to match a Gaussian distribution [28]. Because AE and VAE models learn to compress data through their encoders and uncompress data through their decoders, they are widely used for dimensionality reduction and feature representation [17, 29].
Figure 1.
Construction and performance of the Variational Auto-Encoder (VAE) model. (A) The architecture of a typical VAE framework, The VAE model encods a vector of ChIP-seq signal values (n = 10,000) into mean (μ) and standard deviation (σ) vectors (m = 100). A reparameterization trick enables learning z, which is then converted back to the input (X); (B) Distribution of validation loss values on different batch size, epochs and learning rate with latent vector dimension = 100. (C) Training and validation loss of our VAE model across training epochs. (D). Distributions of the difference between the original input values and the reconstructed values. Each dot represents one feature (region). The mean values of the differences are around 0, and the sums of the absolute values between input and output are very close to 0, indicating that our VAE model essentially restores the values.
In this work, we aimed to build a VAE model that can compress epigenomics data and reveal a biologically relevant latent space. The input feature by sample matrix was firstly log2-transformed using log2(value+1), where value was the peak signals, followed by z-score normalization. We used the minimum-maximum normalization method to re-scale the values into the [0, 1] interval. Samples were divided into two subgroups, one for training (80% of all samples) and the other for validation (the remaining 20% samples). We defined three layers in our VAE model: an input layer, a latent layer, and an output layer. We used two dense layers to encode the mean vector and standard deviation (SD) vector separately for each input vector. The mean and SD vectors were transformed into a nonlinear space using rectified linear units (ReLU) transformation. The two representations were learned concurrently through the use of a reparameterization trick that permits a back-propagated gradient [30, 31]. Next, we built a customized Lambda layer to integrate the two nonlinear-transformed vectors into a latent vector z by sampling from a normal distribution with mean = 0 and SD = 1. The latent vector represented the activities of each hidden node and the compressed representations of input features. Next, we computed the reconstructed input by multiplying z with a weight vector and adding a visible bias vector b, followed by applying the sigmoid activation function. The loss values contained two parts: Mean squared error (MSE) and the KL loss. MSE was used to measure the difference between the original input vector (x) and the reconstructed output (x’). KL loss was used to measure the difference between encoded distribution and a Gaussian distribution to ensure that the encoded variables are an efficient descriptor of the input (Fig. 1A). To accelerate the training process, we trained the VAE model in batches of samples and the number of samples in each batch was termed the batch size. The reconstruction error was optimized through stochastic gradient descent with the weight matrix W and bias vectors b being updated in each batch. The magnitudes of weight and bias changes were controlled by a specified learning rate. Training proceeded through epochs and in each epoch, training used sufficient batches to include all training samples. Training stopped once the specified number of epochs (termed the epoch size) was reached. Adam optimizer [32] was used in training the VAE model, as well as a batch normalization strategy in the encoding stage. We used Keras[33] (version 2.1.6) with a TensorFlow backend (version 1.0.1) to implement VAE.
2.5. Model hyperparameter sweep
To achieve the appropriate performance of the VAE model for our epigenomic datasets, we performed a parameter sweep over the latent vector dimensions (50, 100, 200, 300), batch size (10, 30, 50, 100), epochs (30, 50, 100), and learning rates (0.0005, 0.005, 0.01, 0.05). After the VAE was fully trained, we obtained two matrices: the weight matrix of the decoder part and the learned representation values for each original input vector of the sample. To allow the manual interpretation of nodes, we named each latent vector in the hidden layer as “Latent i” based on the order in which they appeared.
2.6. Application of RE-VAE model to identify tissue-of-origin of cancer samples
The trained VAE model (hereafter called the RE-VAE model) had two components: one latent matrix to represent the input RE samples (935 × 100) and another matrix with weights that were used to obtain the latent matrix (10,000 × 100). We used the weight matrix to evaluate unseen samples and infer their potential tissue-of-origin. Specifically, we focused on two histone marks: H3K4me3 and H3K27ac, because they had relatively better performance in distinguishing samples from different tissues (see the Results section). We preprocessed the two validation datasets in the same way as we processed the training data, including mapping the peak regions to the 10,000 features and calculating feature values. For each query sample from the two validation datasets, we used the weight matrix from our RE-VAE model and calculated their latent representation. We calculated Pearson’s Correlation Coefficient (PCC) using the latent representation between the query samples and the RE samples and obtained the most related RE tissue or cell line types for the query samples.
2.7. Contributive features in the RE-VAE model implied tissue specificity
To evaluate that the samples of a given tissue type are most associated with some dimensions in a latent matrix, we firstly applied t-test for each vector on 28 tissue types and obtained the p-values for each tissue type with each latent vector. For most learned latent vectors, the distribution of corresponding weights was similar in the shape of the normal distribution. In order to characterize patterns explained by selected encoded latent vectors of interest, we performed tissue-specific enrichment analysis separately for the genes extracted from the annotation information of the top 2.5% highly weighted enhancer and promoter regions using our inhouse developed tool deTS[34]. deTS provided a reference panel containing tissue-specific genes for 47 human tissues defined using the GTEx transcriptomic data and compared lists of querying genes with the reference panel to identify the most enriched tissues for the query genes.
3. Results
In this study, we used the Roadmap Epigenomics Project and ENCODE data to train a VAE model. The VAE model compressed the original features into a low dimension space and learnt the latent representation of the input samples. The latent representation was then used to decode histone mark classification and tissue or cell type classification. We also identified the latent vectors that were most significantly associated with each histone mark or each tissue or cell group. We further investigated the tissue specificity of the contributive regions and genes using our inhouse tissue-specificity enrichment tool deTS. Finally, we demonstrated the utility of our trained RE-VAE model in new ChIP-seq datasets from independent studies to distinguish cell lines among different tissue types.
3.1. Performance of the VAE model
Fig. 1 illustrated the framework and performance of our VAE model. There are three layers in our model: an input layer, a latent layer, and an output layer. Notably, in the encoding step, we added the batch normalization process because it can produce faster training with heterogeneous feature activation. In machine learning, the batch normalization process scales the activation to zero mean and unit variance by adding additional feature regularization processes. Thus, we included the batch normalization layer to speed up training and reduce batch to batch variability. We used the reconstruction loss, i.e., the KL loss function, to evaluate the model fitting.
There are several parameters that could impact the performance of the model, such as the batch size, the number of epochs, and the learning rates. We conducted a parameter sweep to investigate how the loss changes with different settings of batch size, epochs, and learning rate. The training and validation set loss was evaluated at each run. As shown in Fig. 1B, in general, training was relatively stable for many parameter combinations. The reconstruction loss tended to be small when the batch size was large (e.g., 50 or 100). Base on the results from parameter sweep process, we selected the parameters as below: latent dimension = 100, batch size = 50, training epochs = 50, and learning rate = 0.005 (Fig. 1B). With these settings, we trained our three-layer VAE model. As shown in Fig. 1C, in the first few steps, there is a sharp reduction of both training loss and the validation loss and then the curve went flat. At epoch = 50, both the training loss and validation loss turned to be stable at approximately 4500.
We further compared the original values and the reconstructed values for each feature (i.e., an enhancer or promoter region) across all samples. We calculated the mean and the sum of absolute values of the difference between the original value and the reconstructed value in each sample. As shown in Fig. 1D, the mean of difference was centered on 0 (mean = −0.01, SD = 0.07) and the sum of the absolute difference was very close to 0 (mean = 0.07, SD = 0.02). This indicated that our VAE model restored the original data matrix very well.
3.2. Visualization of the features based on the compressed representation
We used t-distributed stochastic neighbor embedding (tSNE) to visualize the resultant latent representation of the RE data. As shown in Fig. 2A, the main groups of samples were related to several histone marks, including H3K27me3, DNase, H3K9me3, H3K4me1, H3K36me3, and H4K20me1. However, clusters of other marks were not segregated clearly, such as H3K27ac, H3K4me2, H3K4me3, and H2A.Z. There were no obvious tissue patterns or cell type patterns among all these epigenomics data (Fig. 2B), indicating that histone mark types differ much more than histone tissue specificity. We then examined each single histone mark for potential tissue patterns, with its rationale that some marks were previously indicated to have relatively strong tissue specificity while others not [7, 35, 36]. As shown in Fig. 2C, 2D and Supplementary Fig. S1, we indeed observed tissue specific clusters of samples in some histone marks, such as H3K27ac, where a group of blood samples and a group of brain-related samples formed notable clusters (Fig. 2C). Similarly, we observed tissue clusters using the histone mark H3K4me3 (Fig. 2D).
Figure 2.
The tSNE plot of the learned latent features. Each dot denotes an epigenome sample. (A) Same histone marks have the trend to be clustered together. (B) There are no obvious tissue patterns across all the samples. (C) The tissue patterns in H3K27ac samples. The blood cluster is strong. (D) The tissue patterns in H3K4me3 samples. The blood cluster is strong. Dot colors in B, C, and D are shown in the bottom and five GI tissues are labeled in the same color.
3.3. Biological interpretation of latent vectors and contributive regulatory regions
We investigated whether the latent vectors could be used to infer classification of the samples. As shown in Fig. 2, several histone marks formed clusters in the tSNE plot. Thus, we used t-test to identify latent vectors that were significantly associated with each of the 12 histone marks. The best distinguished histone mark was H3K27me3 (latent 79, p = 7.88×10−207), followed by H3K9me3 (latent 78, p = 1.19×10−206) and H3K36me3 (latent 19, p = 3.22×10−140) (Fig. 3). Several other marks could also be distinguished well, such as H3K4me3, DNase, and H3K4me1. However, we failed to find any latent vectors that could distinguish H2A.Z, H3K4me2, and H3K79me2 with both high specificity and high sensitivity. This is consistent with the tSNE plot in Fig. 2, where H2A.Z, H3K4me2 and H3K79me2 were randomly distributed in the figure. Fig. 3 shows the value distributions of the 12 histone marks and their most related latent vector with p-values.
Figure 3.
The latent value distribution for the 12 histone marks. For each histone mark, the most related latent vector and its p-value is shown in this figure.
We next explored whether there were latent vectors associated with tissues. For the 28 tissues, we only found a latent vector that was significantly associated with blood (p = 6.84×1022) and a latent vector that was associated with ESC (p = 5.8×10−3). Other tissues only had latent vectors with marginal significance (e.g., cervix, p = 0.03; muscle, p = 0.03; skin, p = 0.03; iPSC, p = 0.04; and thymus, p = 0.04) (Fig. 4A-E). For each of these tissues, we further collected the top 250 most contributive regions according to their weights in the corresponding vector. For these matched genes to these regions, we performed tissue-specificity enrichment analysis using the computational tool deTS that we developed. We obtained about 200 genes (range: 194 – 201) for each tissue (Supplementary Table S2). After the genes were input into the deTS tool, interestingly, those genes obtained from the most related latent vector for blood could be validated, as they were enriched in the “whole blood” tissue among the 47 tissues types in the GTEx panel. However, this strong pattern was not observed for the other 5 tissues.
Figure 4.
Seven tissues having significantly associated latent vectors (total of five latents). (A) Latent 24 is significantly associated with blood (p-value = 6.84E-22). (B) Latent 44 is significantly associated with ESC (p-value = 0.0058) and with iPSC (p-value = 0.04). (C) Latent 33 is significantly associated with cervix (p-value = 0.03). (D) Latent 46 is significantly associated with skin (p-value = 0.03) and with muscle (p-value = 0.03). (E) Latent 24 is significantly associated with thymus (p-value = 0.04). (F) After removing the blood samples and re-running VAE model, latent 62 is significantly associated with brain (p-value = 0.04).
Brain is the tissue with the second largest sample size in our dataset (n = 87). Thus, we removed the blood samples and trained the model again. As shown in Fig. 4F and Fig. 5, we found latent vectors that were associated with brain with nominal significance (p = 0.04). Importantly, the contributive regions and their target genes were significantly enriched in brain-related tissues using deTS tool and the GTEx tissue expression panel (brain - spinal cord (cervical c-1), p = 1.25.8×10−3; Brain - Caudate (basal ganglia), p = 0.01; Brain – Hippocampus, p = 0.03) (Fig. 5). Collectively, these results indicated that the RE-VAE model could be used to identify histone mark and tissue associated latent representation vectors when the sample size was large.
Figure 5.
Tissue-specific enrichment analysis of the genes from the high weighted enhancer and promoter regions. Those genes obtained from the latent vector that is most significantly associated with blood were found to be enriched in whole blood of the GTEx panel. This self-support feature did not present in other 6 tissues. After removing the blood samples, those genes from the latent vector most associated with brain tissue were found to be enriched in brain tissues of GTEx panel.
3.4. Distinguish tissue types using trained RE-VAE model
From the trained RE-VAE model, we saved the weight matrix such that for any unseen sample, we could regenerate their latent representation following our RE-VAE model. In this way, we were able to predict the status (e.g., the cell-of-origin, histone mark, and disease status) of any new samples by comparing them with the 935 reference RE samples. We focused on the two histone marks (H3K27ac and H3K4me3) that mainly tag enhancer and promoter regions, as the features we used were from GeneHancer regulatory elements. In addition, as shown in the previous results, these two marks had a relatively stronger performance in distinguishing tissues groups.
3.4.1. H3K27ac.
We downloaded two ChIP-seq datasets for the histone mark H3K27ac in the format of narrow peak files: GSE104481 and GSE106563. As described in the Methods section, these datasets all contained the samples of carcinoma cell lines. For each dataset, we used the saved RE-VAE model to calculate the latent representation of the new data following the pre-trained weight matrix. As a result, each sample was recoded and represented using a 1×100 vector, where each element in the vector was its value corresponding to a latent representation. Each sample was then compared with the RE samples, represented by a 935×100 matrix in the latent space. The PCC was calculated for each sample with the RE samples (see Methods), resulting in a ranked list of RE samples. For the first dataset (GEO ID: GSE104481), we extracted two replicates of the NCI-H23 non-small lung cancer cell line, which is a type of epithelial cell. Initially, we expected that the new samples should be related to its similar tissues, such as IMR90 fetal lung fibroblasts cell line (Roadmap Epigenomics Project ID: E017), fetal lung (E088), lung (E096), and NHLF lung fibroblast primary cells (E128). However, the results showed that new samples were more related to cancer cell lines rather than their similar tissues (Fig. 6A). Specifically, the query NCI-H23 non-small lung cancer cell line was most similar to A549 EtOH 0.02pct lung carcinoma cell line (PCC = 0.8323), followed by HeLa-S3 cervical carcinoma cell line (PCC = 0.8301), and K562 leukemia cell line (PCC = 0.7237). The second dataset (GEO ID: GSE106563) had four samples from esophageal squamous carcinoma cell lines. As shown in Fig. 7, the new samples were more related with the cancer cell line rather than with its matched tissue of origin. For example, query GSM2842762_KYSE70 was most similar to A549 EtOH 0.02pct lung carcinoma cell line (PCC = 0.7134), followed by HeLa-S3 cervical carcinoma cell line (PCC = 0.5832), and K562 leukemia cell line (PCC = 0.5249). These results collectively showed that for carcinoma samples, their profiles of the histone mark H3K27ac were more similar to carcinoma cell lines rather than their matched tissues-of-origin, implying that there are likely some shared and dominant processes that drove the overall epigenomic profiles of cancer samples (Table 1).
Figure 6.
The dendrogram cluster of histone mark H3K27ac and H3K4me3 from dataset GSE104481. (A) The new samples of H3K27ac from dataset GSE104481 are clustered with cancer cell lines, e.g. E114, E123, E118. (B) The new samples of H3K4me3 from dataset GSE104481 are clustered with epithelial cell lines, e.g. E053, E054, E057.
Figure 7.

The heatmap of histone mark H3K27ac from dataset GSE106563. The four new samples of H3K27ac from dataset GSE106563 formed a cluster with four cancer cell lines E117, E114, E118, and E123.
Table 1.
| Dataset | Histone mark | Query | Most related tissue/cell line | Lineage group | PCC* |
|---|---|---|---|---|---|
| GSE104481 | H3K27ac | GSE104481_R1$ | GSE104481_R2 | Epithelial | 0.9591 |
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.8323 | |||
| HeLa-S3 cervical carcinoma cell line | ENCODE | 0.8301 | |||
| K562 leukemia cell line | ENCODE | 0.7237 | |||
| GSE104481_R2 | GSE104481_R1 | Epithelial | 0.9591 | ||
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.8969 | |||
| HeLa-S3 cervical carcinoma cell line | ENCODE | 0.8300 | |||
| K562 leukemia cell line | ENCODE | 0.7052 | |||
| H3K4me3 | GSE104481_R1 | GSE104481_R2 | Epithelial | 0.9982 | |
| Foreskin melanocyte primary cells skin03 | Epithelial | 0.8662 | |||
| Foreskin melanocyte primary cells skin01 | Epithelial | 0.8618 | |||
| Foreskin keratinocyte primary cells skin02 | Epithelial | 0.8556 | |||
| GSE104481_R2 | GSE104481_R1 | Epithelial | 0.9982 | ||
| Foreskin melanocyte primary cells skin01 | Epithelial | 0.8616 | |||
| Foreskin melanocyte primary cells skin03 | Epithelial | 0.8608 | |||
| Foreskin keratinocyte primary cells skin02 | Epithelial | 0.8505 | |||
| GSE106563 | H3K27ac | GSM2842762_KYSE70 | GSM2842764_TE5 | NA | 0.9702 |
| GSM2842768_TT | NA | 0.8580 | |||
| GSM2842760_KYSE140 | NA | 0.7828 | |||
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.7134 | |||
| HeLa-S3 cervical carcinoma cell line | ENCODE | 0.5832 | |||
| K562 leukemia cell line | ENCODE | 0.5249 | |||
| GSM2842768_TT | GSM2842764_TE5_H3K27Ac | NA | 0.8865 | ||
| GSM2842760_KYSE140_H3K27Ac | NA | 0.8686 | |||
| GSM2842762_KYSE70_H3K27Ac | NA | 0.8580 | |||
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.7366 | |||
| Foreskin Keratinocyte Primary Cells skin03 | Epithelial | 0.6605 | |||
| K562 leukemia cell line | ENCODE | 0.6063 | |||
| GSM2842764_TE5 | GSM2842762_KYSE70_H3K27Ac | NA | 0.9702 | ||
| GSM2842768_TT_H3K27Ac | NA | 0.8865 | |||
| GSM2842760_KYSE140_H3K27Ac | NA | 0.8231 | |||
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.7038 | |||
| HeLa-S3 cervical carcinoma cell line | ENCODE | 0.5889 | |||
| K562 leukemia cell line | ENCODE | 0.5644 | |||
| GSM2842760_KYSE140 | GSM2842768_TT_H3K27Ac | NA | 0.8686 | ||
| GSM2842764_TE5_H3K27Ac | NA | 0.8231 | |||
| GSM2842762_KYSE70_H3K27Ac | NA | 0.7828 | |||
| A549 EtOH 0.02pct lung carcinoma cell line | ENCODE | 0.6727 | |||
| Foreskin Keratinocyte Primary Cells skin03 | Epithelial | 0.6425 | |||
| K562 leukemia cell line | ENCODE | 0.5288 | |||
PCC: Pearson’s Correlation Coefficient.
R: Replicate.
3.4.2. H3K4me3.
The ChIP-Seq narrowPeak files for the histone mark H3K4me3 were obtained from GEO datasets: GSE104481. As described before, this dataset is also the data from carcinoma cell lines. The data process was the same as what for H3K27ac. Similarly, the PCC was calculated for each sample with the RE samples, resulting in a ranked list of RE samples. As we initially expected, the results showed that the new samples from the dataset GSE104481 were related to the tissue skin (Fig. 6B). Unlike the profiles of the histone mark H3K27ac that were more similar to carcinoma cell lines, the profiles of histone H3k4me3 were more similar to its matched tissues. This observation indicated that the profiles of histone H3K4me3 might keep more information about the tissues-of-origin than that of H3K27ac. If this is true, H3K4me3 might be used for distinguishing the sample’s tissue types and verifying the tissue property of new samples (Table 1).
4. Discussion
In this paper, we introduced a deep learning approach that is applied to the data feature representation. The VAE model is promising for feature extraction from functional genomic data, but it still requires careful validation and a more thorough evaluation. The VAE model can learn features that were generally non-redundant and could disentangle large sources of variation in the data. Currently, the most widely applied method for dimension reduction is the classic linear dimension reduction approach, such as principal component analysis. There are two main reasons that we used the auto-encoder model for dimension reduction. First, auto-encoder makes use of all the input information to capture the most salient features of the training data, compared to other approaches that achieve dimension reduction by dropping some of the features. This is critical especially when the dataset is very large, where auto-encoder can keep more information and be more efficient. Second, it is a non-linear combination of all the input features, which may help us to find more complex relations. These are not available for linear dimension reduction approach. For the overfitting issue, we have conducted a parameter sweeping and selected a set of appropriate hyperparameters in our final model. We split the whole dataset into traing and validation datasets. The loss function for model evaluation was conducted on both the training and validating. Thus, we consider the overfitting issue is not a concern.
In our results, we observed that the encoded features recapitulated histone mark patterns, while they did not show obvious tissue patterns in the samples by each histone mark. From the tSNE plots of each histone mark, we could still observe some potential tissue patterns, such as the blood samples that could cluster in most of histone mark plots and brain samples that could cluster too. In our evaluation by using an independent dataset, we found histone marks such as H3K4me3 could distinguish tissue-specific patterns. Our evaluation also revealed that histone mark H3K27ac could distinguish cancer cell lines from other samples. In future application, such analysis may help detect sample contamination, identify sample origin, and distinguish different types (e.g., cell lines) of samples.
The learned feature vectors could capture the tissue pattern for blood, while not enough to detect the patterns for other tissues. This result is likely due to the small sample size in the whole dataset. In addition, there were missing samples of some tissues, which could be another limitation factor. Another possibility is that the genes we used for tissue specificity enrichment analysis came from the annotations of GeneHancer, which did not provide tissue information. While genes are expressed in a tissue specific way (e.g. tissue specific or housekeeping genes), the enhancers and promoters may not be annotated as a highly tissue-specific manner. Blood is a systemic tissue spread all over the body, so it is less biased toward such feature analyses. If the dataset is ideal (i.e., for each histone mark, there were 127 samples; or for each tissue type, there was large number of samples), we would have uniform data distributions for all tissue types. Accordingly, we would likely have good power to detect all the features leading to histone marks being associated with the related tissue types. Interpretation of the decoder weights would help identify the contribution of different enhancers or promoters, as well as the genes, to unique tissue patterns. However, at the current stage of the data, gene-based analysis should be performed with caution because gene level information was derived from computing of the complex data, which likely leads to a higher false positive rate.
5. Conclusion
We presented an integrative, deep learning framework, namely Roadmap-ENCODE VAE (RE-VAE) model, to systematically investigate the potential features of the histone modification signals in enhancer and promoter regions. We demonstrated that the RE-VAE model may be useful to decode the complex epigenomics data, leading to regulation patterns and tissue specific features. The results will help further interpret and understand the gene regulation mechanisms in specific cell types or tissues.
Supplementary Material
FIGURE S1. tSNE plots of tissue patterns of the 12 histone marks. The x-axis is tSNE1 and yaxis is tSNE2. The dot colors are described as in the legends of Figure 2B–2D.
TABLE S1. Summary of sample tissues and cell lines.
TABLE S2. Gene lists from highly weighted regions for 7 tissues.
Highlights.
We implemented a neural network framework-Variational Auto Encoder (VAE) model for dimension reduction and feature representation in the epigenomic data from the Roadmap Epigenomics Project and the Encyclopedia of DNA Elements (ENCODE) project.
We found that histone marks are segregated clearly using represented features, while only blood and brain samples show tissue-specific clusters for given histone marks
Two cancer ChIP-seq datasets of histone mark H3K27ac and H3K4me3 are tested on the trained VAE model. By comparing with the 935 reference samples, we found that two cancer samples are more related to cancer cell lines rather than their similar tissues.
Active histone marks showed strong tissue-specificity than repressed histone markers
Acknowledgments
This work was partially supported by National Institutes of Health grants (LM012806, DE027393, DE028103 and DE027711) and the Cancer Prevention and Research Institute of Texas grant (CPRIT RP180734). We would like to thank the members of Bioinformatics and Systems Medicine Laboratory (BSML) for valuable discussion.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Erler J, Zhang R, Petridis L, Cheng X, Smith JC, Langowski J, The role of histone tails in the nucleosome: a computational study, Biophys J 107 (12) (2014) 2911–2922. https://10.1016/j.bpj.2014.10.065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Bannister AJ, Kouzarides T, Regulation of chromatin by histone modifications, Cell Res 21 (3) (2011) 381–95. https://10.1038/cr.2011.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Yun M, Wu J, Workman JL, Li B, Readers of histone modifications, Cell Res 21 (4) (2011) 564–78. https://10.1038/cr.2011.42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Farh KK, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, Shoresh N, Whitton H, Ryan RJ, Shishkin AA, Hatan M, Carrasco-Alfonso MJ, Mayer D, Luckey CJ, Patsopoulos NA, De Jager PL, Kuchroo VK, Epstein CB, Daly MJ, Hafler DA, Bernstein BE, Genetic and epigenetic fine mapping of causal autoimmune disease variants, Nature 518 (7539) (2015) 337–43. https://10.1038/nature13835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Cohen I, Poreba E, Kamieniarz K, Schneider R, Histone modifiers in cancer: friends or foes?, Genes Cancer 2 (6) (2011) 631–47. https://10.1177/1947601911417176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Chervona Y, Costa M, Histone modifications and cancer: biomarkers of prognosis?, Am J Cancer Res 2 (5) (2012) 589–97. [PMC free article] [PubMed] [Google Scholar]
- [7].Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu YC, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh KH, Feizi S, Karlic R, Kim AR, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, De Jager PL, Farnham PJ, Fisher SJ, Haussler D, Jones SJ, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai LH, Wang W, Waterl RA, and Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M, Integrative analysis of 111 reference human epigenomes, Nature 518 (7539) (2015) 317–30. https://10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA, The NIH Roadmap Epigenomics Mapping Consortium, Nat Biotechnol 28 (10) (2010) 1045–8. https://10.1038/nbt1010-1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].ENCODE Project Consortium, An integrated encyclopedia of DNA elements in the human genome, Nature 489 (7414) (2012) 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Elliott G, Hong C, Xing X, Zhou X, Li D, Coarfa C, Bell RJ, Maire CL, Ligon KL, Sigaroudinia M, Gascard P, Tlsty TD, Harris RA, Schalkwyk LC, Bilenky M, Mill J, Farnham PJ, Kellis M, Marra MA, Milosavljevic A, Hirst M, Stormo GD, Wang T, Costello JF, Intermediate DNA methylation is a conserved signature of genome regulation, Nat Commun 6 (2015) 6363. https://10.1038/ncomms7363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Gjoneska E, Pfenning AR, Mathys H, Quon G, Kundaje A, Tsai LH, Kellis M, Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease, Nature 518 (7539) (2015) 365–9. https://10.1038/nature14252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Leung D, Jung I, Rajagopal N, Schmitt A, Selvaraj S, Lee AY, Yen CA, Lin S, Lin Y, Qiu Y, Xie W, Yue F, Hariharan M, Ray P, Kuan S, Edsall L, Yang H, Chi NC, Zhang MQ, Ecker JR, Ren B, Integrative analysis of haplotype-resolved epigenomes across human tissues, Nature 518 (7539) (2015) 350–354. https://10.1038/nature14217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Perez-Rico YA, Boeva V, Mallory AC, Bitetti A, Majello S, Barillot E, Shkumatava A, Comparative analyses of super-enhancers reveal conserved elements in vertebrate genomes, Genome Res 27 (2) (2017) 259–268. https://10.1101/gr.203679.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Amin V, Harris RA, Onuchic V, Jackson AR, Charnecki T, Paithankar S, Lakshmi Subramanian S, Riehle K, Coarfa C, Milosavljevic A, Epigenomic footprints across 111 reference epigenomes reveal tissue-specific epigenetic regulation of lincRNAs, Nat Commun 6 (2015) 6370. https://10.1038/ncomms7370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T, The rise of deep learning in drug discovery, Drug Discov Today 23 (6) (2018) 1241–1250. https://10.1016/j.drudis.2018.01.039. [DOI] [PubMed] [Google Scholar]
- [16].Rifaioglu AS, Atas H, Martin MJ, Cetin-Atalay R, Atalay V, Dogan T, Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases, Brief Bioinform (2018). https://10.1093/bib/bby061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Vokhmintcev A, Melnikov A, Timchenko M, Kozko A, Makovetskii A, Kober A, Development of methods for selecting features using deep learning techniques based on autoencoders, Applications of Digital Image Processing XLI, International Society for Optics and Photonics, 2018, p. 1075227. [Google Scholar]
- [18].Vincent P, Larochelle H, Bengio Y, Manzagol P-A, Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 1096–1103. [Google Scholar]
- [19].NIH Roadmap Epigenomics. https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/. (Accessed August 20 2018).
- [20].Fishilevich S, Nudel R, Rappaport N, Hadar R, Plaschkes I, Iny Stein T, Rosen N, Kohn A, Twik M, Safran M, Lancet D, Cohen D, GeneHancer: genome-wide integration of enhancers and target genes in GeneCards, Database (Oxford) 2017 (2017). https://10.1093/database/bax028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR, The ensembl regulatory build, Genome biology 16 (1) (2015) 56. 10.1186/s13059-015-0621-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, An atlas of active enhancers across human cell types and tissues, Nature 507 (7493) (2014) 455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Visel A, Minovitsky S, Dubchak I, Pennacchio LA, VISTA Enhancer Browser—a database of tissue-specific human enhancers, Nucleic acids research 35 (suppl_1) (2006) D88–D92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Handoko L, Kaczkowski B, Hon CC, Lizio M, Wakamori M, Matsuda T, Ito T, Jeyamohan P, Sato Y, Sakamoto K, Yokoyama S, Kimura H, Minoda A, Umehara T, JQ1 affects BRD2-dependent and independent transcription regulation without disrupting H4-hyperacetylated chromatin states, Epigenetics 13 (4) (2018) 410–431. https://10.1080/15592294.2018.1469891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Jiang Y, Jiang YY, Xie JJ, Mayakonda A, Hazawa M, Chen L, Xiao JF, Li CQ, Huang ML, Ding LW, Sun QY, Xu L, Kanojia D, Jeitany M, Deng JW, Liao LD, Soukiasian HJ, Berman BP, Hao JJ, Xu LY, Li EM, Wang MR, Bi XG, Lin DC, Koeffler HP, Co-activation of super-enhancer-driven CCAT1 by TP63 and SOX2 promotes squamous cancer progression, Nat Commun 9 (1) (2018) 3619. https://10.1038/s41467-018-06081-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].liftOver. https://genome.ucsc.edu/cgi-bin/hgLiftOver (Accessed 10 Sep 2018).
- [27].Quinlan AR, Hall IM, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics 26 (6) (2010) 841–2. https://10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Hershey JR, Olsen PA, Approximating the Kullback Leibler divergence between Gaussian mixture models, Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, IEEE, 2007, pp. IV-317–IV-320. [Google Scholar]
- [29].Wu G, Kim M, Wang Q, Munsell BC, Shen D, Scalable high-performance image registration framework by unsupervised deep feature representations learning, IEEE Transactions on Biomedical Engineering 63 (7) (2016) 1505–1516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Kingma DP, Welling M, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [Google Scholar]
- [31].Rezende DJ, Mohamed S, Wierstra D, Stochastic backpropagation and approximate inference in deep generative models, arXiv preprint arXiv:1401.4082 (2014). [Google Scholar]
- [32].Kingma DP, Ba J, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]
- [33].Chollet F, Keras, 2015. https://keras.io/. (Accessed Sep 20 2018).
- [34].Pei G, Dai Y, Zhao Z, Jia P, deTS: tissue-specific enrichment analysis to decode tissue specificity, Bioinformatics (2019). https://10.1093/bioinformatics/btz138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Cotney J, Leng J, Oh S, DeMare LE, Reilly SK, Gerstein MB, Noonan JP, Chromatin state signatures associated with tissue-specific gene expression and enhancer activity in the embryonic limb, Genome Res 22 (6) (2012) 1069–80. https://10.1101/gr.129817.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Bonn S, Zinzen RP, Girardot C, Gustafson EH, Perez-Gonzalez A, Delhomme N, GhaviHelm Y, Wilczynski B, Riddell A, Furlong EE, Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development, Nat Genet 44 (2) (2012) 148–56. https://10.1038/ng.1064. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
FIGURE S1. tSNE plots of tissue patterns of the 12 histone marks. The x-axis is tSNE1 and yaxis is tSNE2. The dot colors are described as in the legends of Figure 2B–2D.
TABLE S1. Summary of sample tissues and cell lines.
TABLE S2. Gene lists from highly weighted regions for 7 tissues.






