Skip to main content
Bioinformatics Advances logoLink to Bioinformatics Advances
. 2025 Jul 23;5(1):vbaf175. doi: 10.1093/bioadv/vbaf175

Statistical relationships across epigenomes using large-scale hierarchical clustering

Anastasiia Kim 1,, Nicholas Lubbers 2, Christina R Steadman 3, Karissa Y Sanbonmatsu 4
Editor: Anna-Sophie Fiston-Lavier
PMCID: PMC12373635  PMID: 40861392

Abstract

Motivation

Recent advances in genomics and sequencing platforms have revolutionized our ability to create immense data sets, particularly for studying epigenetic regulation of gene expression. However, the avalanche of epigenomic data is difficult to parse for biological interpretation given nonlinear complex patterns and relationships. This attractive challenge in epigenomic data lends itself to machine learning for discerning infectivity and susceptibility. In this study, we explore over 3000 epigenomes of uninfected individuals and provide a framework to characterize the relationships among epigenetic modifiers, their modifiers, genetic loci, and specific immune cell types across all chromosomes using hierarchical clustering.

Results

Hierarchical clustering of epigenomic data revealed consistent epigenetic patterns across chromosomes, demonstrating that variation due to epigenetic modifiers is greater than variation between cell types. Gene Ontology and KEGG pathway analyses indicated significant enrichment of genes involved in chromatin remodeling, mRNA splicing, immune responses, and the regulation of microRNAs and snoRNAs. Epigenetic modifiers frequently formed biologically relevant clusters, including the cohesin complex, RNA Polymerase II transcription factors, and PRC2 complex members. These clustering behaviors remained consistent across all chromosomes, supported by entropy analysis and high Adjusted Rand Index scores, indicating robust cross-chromosomal similarity. Co-occurrence analysis further revealed specific sets of modifiers that consistently appeared together within clusters, reflecting shared biological functions and interactions. Validation using another dataset confirmed the reproducibility of these clustering patterns and modifier co-occurrence relationships, underscoring the reliability and generalizability of the methodology.

Availability and implementation

The analysis pipeline for this study is freely available online at the GitHub repository: https://github.com/lanl/epigen.

1 Introduction

Epigenetic modification of the genome contributes to the function and physiology of all organisms. These covalent additions of functional groups can occur directly on DNA and histone proteins via chromatin modifying enzymes and are termed DNA methylation and histone (posttranslational) modifications, respectively. Epigenetic modifications influence various cellular and organismal processes ranging from basic biochemistry to immune function and even human behavior (Keverne et al. 2015, Obata et al. 2015, Tiffon 2018). The booming growth of the epigenomics field—i.e. the study of epigenetic modifications across the entire genome—can be attributed not only to growing interest of the scientific community but also the recent onslaught of advances in sequencing technology (Callinan and Feinberg 2006, Clark et al. 2016). As sequencing platforms continue to evolve, the number of epigenomic datasets continues to grow, and such data is housed within the ENCODE, SRA, and GEO databases (Barrett et al. 2005, Leinonen et al. 2011, The Encode Project Consortium 2012). These datasets provide a wealth of information, particularly for understanding the contribution of aberrant epigenetic modifications in disease states, such as cancer, and the impact of environmental perturbations on function and behavior (Wolffe 2001, Jirtle and Skinner 2007, Hou et al. 2012, Ilango et al. 2020, Perera et al. 2020). However, the complexity and sheer enormity of these datasets precludes definitive biological interpretation; as such, this is a unique opportunity to utilize machine learning approaches to identify patterns and signatures that may reveal mechanisms of normative and/or aberrant physiological processes. In particular, the immune system utilizes epigenetic modifications and chromatin remodeling processes to respond to threats: the orchestrated epigenomic regulation of immunity, both innate and adaptive function, presents a large complex story to unravel. The details and precise mechanisms of this regulation would provide useful information for prediction, early detection, and development of effective treatments (Fernández-Morera et al. 2010, Janson and Winqvist 2011).

The machine learning and data science revolution of the past decade has now been directed at a great many scientific endeavors. Unsupervised learning methods have been used extensively in genomics and epigenomics studies, with particular focus on DNA methylation. For example, studies showed that variational autoencoders can learn latent DNA methylome (methylation across the genome) representation that can be used for lower dimensional epigenetic analyses (Titus et al. 2018). Zamanighomi et al. (2018) identified clusters of informative peaks for single-cell methylation data using unsupervised clustering. Hierarchical clustering was employed in studies to analyze DNA methylation patterns: Virmani et al. (2002) showed that DNA methylation patterns differ between lung cancer cell lines, while Lin et al. revealed tumor-specific hypermethylated clusters and expressed breast cancer genes (Lin et al. 2015). Hierarchical clustering of diffuse large B-cell lymphoma based on the extent of DNA methylation variability identified novel epigenetic clusters (Chambwe et al. 2014). These important studies demonstrate the utility of hierarchical clustering and unsupervised learning on identifying patterns based on DNA methylation. Yet, few studies have addressed histone modifications and associated chromatin regulatory proteins. Compared with the simplicity of DNA methylation datasets, the sheer number of histone modifications, variations, and interactions creates a complex pattern recognition challenge (Seligson et al. 2005, Xi et al. 2018). Thus, in this work, we turn to the tools of data science to meet this challenge, focusing on using hierarchical clustering of large-scale epigenomic data, to provide a birds-eye-view of the relationships between histone modifications and associated chromatin regulatory proteins across a variety of immune cell types. We collated a dataset comprised of 3111 whole-genome histone and chromatin modifiers samples totaling approximately 1.5 TB of data from The Encode Project Consortium (2012). We note that this article does not delve into a detailed analysis for each chromosome; rather, we present a methodological framework which serves as a foundation for future, more detailed investigations of individual chromosome analysis.

2 Data description

To understand how epigenetic modifications of chromosomes are impacted by pathogen exposures resulting in activated immune function, it is necessary to first analyze data that come from uninfected individuals to understand the relationships between epigenetic modifications and associated chromatin regulatory proteins. We acquired 3111 ChIP-seq samples on various histone modifications and chromatin binding proteins from cells derived from healthy, normal subjects (uninfected) from The Encode Project Consortium (2012). Note that we could not confirm whether one healthy donor contributed multiple samples across different cell types and modifiers, nor can we estimate the total number of unique individuals in the dataset as this information was not provided in the ENCODE metadata. These samples were based on either the hg19 (1479 samples) or the hg38 (1632 samples) human genome reference assembly, with each sample characterized by a specific epigenetic modifier in a specific immune cell type. In addition to being classified as either a histone or not, each epigenetic modifier was further characterized by the factor it is associated with and its activity. Factor represents different roles and mechanisms in the epigenetic regulation process and the activity of each modifier indicates its role in regulating gene expression. In total, there were 28 unique cell types (from 16 to 300 samples per cell type) and 28 unique epigenetic modifiers (from 22 to 83 samples per modifier) in the h19 (GRCh37) assembly samples and 14 unique cell types (from 68 to 340 samples per cell type) and 68 unique epigenetic modifiers (24 samples per modifier) in the h38 (GRCh38) assembly samples. Each sample was stored in a BigWig file, which contained the signal P value, which is −log10 (P value). The signal P value tests the null hypothesis that the signal at a particular genomic loci is present in the control. BigWig files are a format used to store large sets of genomic data that would otherwise take up too much space and be cumbersome to process. They allow for efficient retrieval of subsets of data, making it easier to access and visualize specific regions of a genome without the need to process the entire file. Note that we do not aggregate data across epigenetic modifiers or cell types, but rather build the dendrogram on individual samples, and analyze the structure of dendrogram with respect to epigenetic modifier and cell types; the structure of the dendrogram is entirely determined by the 200 bp binned features.

3 Methods

Our analysis focused primarily on samples aligned to the hg38 (GRCh38) genome assembly reference. To evaluate the generalizability of our methodology, we also applied our analysis to samples aligned to the hg19 (GRCh37) genome assembly. This step was taken to confirm that the patterns we observed were not unique to a specific dataset. For each chromosome, we divided the genome into 200 bp regions and extracted the median signal P value for each region. These median values were then transformed into regular P values. Using these 200 bp regions as features, we conducted hierarchical clustering on the samples. As part of this clustering and subsequent analyses, it’s important to note that each sample can be characterized by multiple types of “labels”—including modifiers, cell type, factor, or activity. In the context of co-occurrence analyses, however, we focus specifically on modifiers to examine how these epigenetic modifiers cluster together.

3.1 Hierarchical clustering

Hierarchical clustering has gained popularity due to its straightforward implementation and visualization capabilities (Murtagh and Contreras 2012). The results of agglomerative clustering might differ based on the choice of distance metrics and linkage techniques, which dictate cluster merging decisions. We conducted hierarchical clustering using correlation distance (1r, where r is the Pearson correlation coefficient) and complete linkage clustering. While correlation distance is not strictly a distance in mathematical terms, it allows to derive clusters that focus on relationships between features (regions of the genome in our case) rather than absolute magnitudes. In our study, this distance was particularly useful for capturing trends where there were peaks in regions of the genome across multiple samples. We used complete linkage, where the distance between clusters is the largest distance between any two members of the cluster, as it is robust to outliers and tends to produce tighter clusters, aligning with our objective of clustering samples based on similar feature (regions of the genome) behaviors.

As hierarchical clustering is an unsupervised learning technique, it naturally leads one to the question: how can we assess the quality of the clusters especially when too many clusters were formed? The optimal clustering scenario occurs when the within-cluster distance is minimized while the between-cluster distance is maximized. Typically, cluster validation employs internal or external metrics (Murtagh and Contreras 2017). Given that we used correlation rather than, or example, Euclidean distance, many traditional internal validation metrics become unsuitable, as they often rely on minimizing the sum of squared errors. The challenge with correlation is that one cannot merely average correlations in the same way distances are averaged. Furthermore, internal metrics can exhibit biases and might not always offer the best validation. On the other hand, external validation methods such as the ARI, Jaccard coefficient, entropy, purity, and Fowlkes-Mallows index require a known ground truth for comparison, which might not always be available, as in our case. However, a practical approach in our study involved leveraging the comparative analysis of different chromosomes and genome assemblies using entropy and ARI. For example, by treating one chromosome (or genome reference assembly) as a “ground truth,” we were able to compare the clusterings performed on two different chromosomes (or assemblies) with each other to determine their similarity.

3.2 Entropy-based analysis across chromosomes

We used the entropy to compare the branching patterns of the dendrograms between chromosomes. The entropy for each cluster was defined as the weighted_entropy=cluster_idE(cluster_id)×size(cluster_id)total_size where E(cluster_id)=ipilog2(pi)log2(n)  pi is the proportion of the ith label (i.e. modifier, cell type, factor, or activity) in the cluster and n is the number of unique labels. E(cluster_id) is the entropy of a particular cluster, size(cluster_id) is the number of elements in that cluster, and total_size is the total number of elements across all clusters. In such way, we normalized the entropy to be between 0 and 1 which allowed for fair comparison across clusters with different numbers of labels, focusing on how evenly distributed the labels are, rather than just how many there are in a specific cluster. The weighted entropy over the clusters is defined so that each cluster is weighted by its size to ensure that the contribution of each cluster to the overall entropy is proportional to its representation in the data.

To validate our findings, we calculated the entropy for datasets with randomly shuffled labels, repeating the process 100× to obtain an averaged result. This procedure was implemented to determine the presence of any underlying structure in the data to ensure that clustering did not happen by chance. In this study, entropy served as a quantitative measure of the level of variability within each cluster in the dendrogram, with lower entropy values indicating more homogeneous clusters. For each chromosome, we performed a detailed analysis by calculating entropy values at various cut points within the dendrogram. This was done separately for each case, depending on how the samples were labeled: as epigenetic modifiers, cell types, factors, or by their activity. By assessing entropy across these different labeling scenarios, we gained insights into how the clustering patterns varied depending on the specific characteristics assigned to the samples. To assess the robustness of the observed cluster structure, we performed a permutation test. We calculated the frequency with which entropy values of reshuffled labels were smaller than entropy of the data labels to create an empirical P value for the clustering of that label.

To ensure that the clustering structure is driven by meaningful epigenetic relationships rather than random noise, technical errors, or dataset imbalances, we assessed the internal consistency of the dataset. We split the hg38 dataset into two subsets (A and B), each containing 476 samples. Both subsets included 68 unique modifiers, with exactly 7 samples per modifier, and the cell type distributions were also matched in quantity pairs.

3.3 ARI analysis across chromosomes

In our study, alongside the entropy-based analysis where each sample was assigned a specific label type (modifier, cell type, factor, or activity), we also calculated the ARI to compare the dendrograms across all chromosomes. The advantage of using ARI lies in its independence from specific labels, allowing for a broader comparison of clustering patterns across different chromosomes. The ARI is a corrected version of rand index, adjusted for chance, which accounts for the likelihood that two samples might end up in the same cluster purely by chance. This makes the ARI a more reliable measure in situations in which random agreement is more likely: cases where the dataset is large or the number of clusters is high. The ARI has a value close to 0 for random clustering and 1 for perfect clustering agreement.

3.4 Investigating clusters: dendrogram and UMAP visualization

To understand the patterns within each chromosome of our dataset independently, we analyzed clusters in dendrograms and UMAP plots. Additionally, we examined the branching fraction in the dendrograms for each chromosome, where we analyzed the relationship between the number of clusters and the distance at which the dendrogram is cut. This assessment helped in understanding the clustering dynamics at different levels.

In addition to analyzing clustering patterns, we also sought to identify the most important (i.e. common) regions in those clusters. We defined region of the genome as important in a certain cluster if its P value was less than 0.05 in all samples within that cluster. For every 200 bp region marked as important, we expanded the region by adding a 500 bp buffer to each region. These expanded regions of genome were then matched with known gene positions from the Genome Browser, allowing us to associate certain regions with specific genes. Although many of the important regions were not linked to any known genes, we identified plenty known genes for each chromosome and their position in genome using the Genome Browser’s web-API (Kent et al. 2002). Because the formation of clusters is influenced by important regions of genome, understanding their distribution and relationships can reveal significant biological patterns (e.g. functional similarities) among the samples in the same clusters.

As the hierarchical clustering dendrogram becomes visually intractable for large datasets, we utilized the ETE toolkit to construct a circular dendrogram, which allows for better visual exploration of patterns. In this dendrogram, we highlighted branches in different colors—red, green, blue, and black—depending on the number of important known genes shared by the cluster members. Additionally, we labeled each cluster with pie charts to indicate the number of distinct epigenetic modifiers present in the clusters where more than 30 important genes were found.

We used UMAP to quickly examine how our data clustered, observing whether samples grouped by activity, factor, modifier, or cell type. UMAP’s capability to handle nonmetric distances, such as correlation, proved particularly useful in our analysis especially when t-SNE and other nonlinear dimensionality reduction techniques usually require metric distances as input.

We set a moderate value of the min_dist parameter at 0.6—this parameter controls how tightly UMAP is allowed to pack points together—and a high number of neighbors, 100, to focus on the global structures within the dataset. UMAP’s ability for capturing global structures of dataset effectively allows it to reveal larger-scale patterns, that might not be as clearly visible in methods emphasizing local structures or individual linkages, such as hierarchical clustering.

3.5 Co-occurrence analysis of epigenetic modifiers

To analyze data more efficiently, we cut the dendrogram at a specific height so that the correlation distance remains below the threshold of 0.3, ensuring that clusters formed at or below this level had a correlation coefficient of at least 0.7 among their members. We then examined the co-occurrences of epigenetic modifiers in the clusters across all chromosomes. We set a minimum cluster size criterion, considering only those clusters with four or more members. Within these clusters, we determined the frequency at which pairs of unique epigenetic modifiers appeared together. This was accomplished by summing the minimum count of the two modifiers each time they were found within the same cluster. We then plotted these sums for the 50 top co-occurring pairs to visualize the most frequent interactions among the epigenetic modifiers. We normalized co-occurrences cij by the total number of modifiers in the dataset and the number of chromosomes using cij/Ni*Nj/Nchr).

To count co-occurrences, we determine a quantity for each cluster, and sum this quantity over all clusters. The quantity for co-occurence will be called the label. We have considered several labels in this work: modifier, cell type, factor, and activity. On each cluster, the quantity is actually the minimum number of samples between label A and label B in the cluster. In this way, the un-normalized co-occurence reflects the magnitude of samples that co-occur in any cluster. Let us assume that B is the less frequent label. Co-occurence will be maximized if for each sample of type B in each cluster, there is at least one sample of type A in the same cluster which can be paired to that sample. For example, co-occurence can be maximized if all A and B fall in exactly one cluster.

We also produced self-organizing maps (Kohonen 1990) to aid visualization of the patterns of epigenetic modifier co-occurrences within the clusters in chromosome 6 derived from distinct datasets of the human genome assemblies hg19 and hg38 datasets.

Finally, we conducted a jacknife z-test to determine which labels in the co-occurrence patterns differed significantly between the hg38 and hg19 datasets. We generated 600 jackknife replicates for each of the hg19 and hg38 datasets. For every pair of epigenetic modifiers, we then computed its mean co-occurrence across the 600 hg19 replicates and likewise across the 600 hg38 replicates. To assess significance, we formed a z-score by dividing the difference in those two means by the square root of the sum of their jackknife variances. P values were obtained from the normal approximation and all tests were corrected for multiple comparisons using the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995) to control the false discovery rate. Any pair with an adjusted P value below 0.05 was concluded to co-occur at significantly different rates between these two datasets. Future work will include formal equivalence testing to statistically support similarities, which requires careful selection of an acceptable difference threshold. Choosing threshold is not straightforward in this context, as it must reflect biologically meaningful bounds on co-occurrence variability—a decision that depends on domain-specific knowledge and empirical variability across modifier pairs.

3.6 Linking with gene ontology for functional insights

Building on our analysis, we integrated our findings with external databases to deepen our understanding of the important genes identified across all chromosomes. We used Gene Ontology (GO) and the KEGG pathway database for this purpose. KEGG offers a vast collection of databases that shed light on biological pathways and diseases, linking genomic information with higher-order functional insights. In contrast, GO provides a well-structured framework for biological activity, defining key concepts used to describe gene function and the relationships among these concepts. It categorizes gene functions into molecular functions, cellular components, and biological processes. The integration with GO and KEGG helped us find out more about what roles these genes might play and how they’re involved in different biological processes and pathways. We used gProfiler tool (Reimand et al. 2007) and Cytoscape (Shannon et al. 2003) to visualize GO and KEGG pathway results, respectively.

4 Results

In this section, we present the results of our analysis, which primarily focused on the hg38 (GRCh38) genome assembly reference dataset and chromosome 6, known for its abundance of immune-related genes (Shiina et al. 2009). Additionally, we have validated our findings on the different dataset consisted of samples aligned to the hg19 (GRCh37) genome assembly to ensure the robustness of our methodology (Fig. 1).

Figure 1.

Figure 1.

A flowchart of our approach. Note that datasets from the two genome reference assemblies are different samples. This image was produced by authors using BioRender.

4.1 Numerous known genes, identified among important features, drove the clustering

We observed the abundance of important known genes in most of the clusters, indicated by the color-coded branches in the hierarchical clustering dendrogram of chromosome 6. The pie charts in the dendrogram, each varying in the number of colors, show the count of distinct epigenetic modifiers within each cluster (Fig. 2). Several pie charts consisted of only one or two colors, signifying that clusters containing at least 30 important known genes, were characterized by only one or two specific epigenetic modifiers, respectively. This pattern suggested that a considerable number of epigenetic modifiers tended to group together, reflecting a clustering of samples influenced predominantly by the specific traits of these modifiers.

Figure 2.

Figure 2.

Dendrogram of representative chromosome (chromosome 6). Here, red branches denote clusters where the number of important (common) known genes in all samples are at least 30. The labels of each of such clusters are also represented by the pie charts, where the number of colors indicate how many distinct epigenetic modifiers are in the cluster. Note if there are more than 4 distinct modifiers, the least frequent ones are combined in one pie chart slice.

4.2 Samples showed a tendency to cluster together according to the epigenetic modifiers first and then by the cell types

While the dendrogram hinted that similar epigenetic modifiers tend to cluster together, uniform manifold approximation and projection (UMAP) plots not only confirmed this tendency but also revealed that samples clustered first by the modifiers and then by the cell types. However, some samples characterized by the certain cell types like peripheral blood mononuclear cells, common myeloid progenitors, CD34-positive cells, neutrophils, and CD14-positive monocytes showed a tendency to cluster primarily by these cell types (Fig. 3). In contrast, for other cell types, the clustering was predominantly driven by modifiers (Fig. 4). The clustering patterns observed in UMAP plots suggested samples tended to group together based on the most distinctive features in the dataset. In this context, if a specific cell type has more unique and defining characteristics than an epigenetic modifier, we expect the samples will cluster according to cell type and vice versa. These several cell types (peripheral blood mononuclear cells, common myeloid progenitors, CD34-positive cells, neutrophils, and CD14-positive monocytes) that tended to form clusters were the most prevalent within the dataset which might had an effect on the UMAP plots. In the hg38 genome dataset, the epigenetic modifiers were evenly distributed, with each of the 68 types represented by 24 samples. However, the representation of cell types in the dataset showed a different pattern. Two out of the 14 cell types—common myeloid progenitor and CD34-positive peripheral blood mononuclear cells—accounted for about a third of the dataset, indicating a skewed distribution in the representation of cell types.

Figure 3.

Figure 3.

UMAP projections from different angles of chromosome 6. Each sample is color-coded to indicate its associated cell type. Each color represents one of the 14 cell types. The most visually distinct samples are for clarity. The axes represent the components or dimensions that UMAP has reduced the data to. The left and right plots show different projects of the data to clarify the 3D UMAP embedding.

Figure 4.

Figure 4.

UMAP projections from different angles of chromosome 6. Each sample is color-coded to indicate its associated epigenetic modifier. Given that there are 68 unique modifiers in the dataset, some colors may not be distinctly differentiable to the eye. The most visually distinct samples are labeled for clarity. The axes represent the components or dimensions that UMAP has reduced the data to. The left and right plots show different projects of the data to clarify the 3D UMAP embedding.

The main distinction in clustering was notably between histone and nonhistone modifiers (Fig. 5). Modifiers typically clustered based on their activity (such as permissive or repressive) and factor (transcription factor, chromatin modifier, histone modification, etc.). However, a more pronounced distinction was observed between histone permissive and histone repressive modifiers compared to that between non-histone permissive and non-histone repressive modifiers (Fig. 5).

Figure 5.

Figure 5.

UMAP projections from different angles of chromosome 6. Each sample is color-coded to indicate its associated factor and activity. The shapes of the markers indicate whether each epigenetic modifier is a histone or not. The axes represent the components or dimensions that UMAP has reduced the data to. The left and right plots show different projects of the data to clarify the 3D UMAP embedding.

4.3 Clustering behavior was consistent across all chromosomes

We were interested in identifying which features drove clustering and whether they were linked to specific genes. To achieve this, we needed to examine each cluster that can be obtained by cutting the dendrogram at choosen height. Our observation of a roughly constant branching fraction across the dendrograms suggested a similar structure across all chromosomes (Fig. 6).

Figure 6.

Figure 6.

The number of clusters (log10) found using the hierarchical clustering on each chromosome, as a function of the distance (1 − Pearson correlation) used to cut the hierarchical clustering. All chromosomes have a similar structure with a roughly exponential decay in clusters. This corresponds to a roughly constant branching fraction for the dendrogram.

Additionally, plotting entropy against distance offered further insights, especially when comparing different chromosomes (Fig. 7). We observed how entropy varied as a function of the cut point; where entropy decreased rapidly, the dendrogram cuts were effectively dividing clusters into more similar labels. The entropy behavior was analyzed with respect to the epigenetic modifiers, activity and factor associated with each epigenetic modifier, and cell types (Fig. 7). On the right-hand limit, each dendrogram corresponded to one unique cluster which contained all of the samples, and thus all chromosomes had exactly the same entropy value, which was the entropy of the dataset. On the left-hand limit, each cluster had exactly one sample, and so the entropy was precisely zero for each cluster, and thus zero for each dendrogram. In between, the variation in entropy as a function of the cut point was observed. Cross-referencing with the number of clusters plot (Fig. 6), we observed that the entropy started to diminish at an early stage, marked by a higher distance values, when there were a few clusters. It made a more dramatic jump downwards around 10 clusters (where the distance was around 1) for the activity, factor, and epigenetic modifier plots. For the cell type, the entropy curves were qualitatively different from the entropy curves associated with the epigenetic modifiers. These curves were approximately flat from the cut point 1 upwards, indicating that the highest levels of the dendrogram did not make any noticeable effect on the cell line distribution. By comparing the behavior of black curve corresponding to the randomly shuffled labels to the original entropy curves, we observed that there was less evident structure in the cell type behavior meaning that clustering by the cell types was less meaningful than the clustering by other identifiers. Still, by conducting a permutation test to assess the robustness of the observed cluster structure, specifically for each entropy plot (Fig. 7), we calculated the frequency with which entropy values of reshuffled labels were smaller than those of the normal labels. A P value of less than 0.05 was observed across clustering distances (i.e. x-axis in Fig. 7) smaller than 1.4 for epigenetic modifier, activity, and factor labels. For cell labels, the distance threshold was slightly higher, at approximately 0.9. At a distance of 0.3, corresponding to a correlation of 0.7, the P value was less than 0.001 for each of the labels (cell type, modifier, activity, and factor), suggesting that the observed cluster structure is statistically significant and not due to random chance. This distance of 0.3 is the threshold used for subsequent cluster analysis.

Figure 7.

Figure 7.

Weighted entropy versus distance for true and random labels. The entropy values were calculated with respect to activity (permissive/repressive/both), type of factor (transcription/histone modifier/chromatin modifier/others), modifier (68 types), and cell type (14 types) for each chromosome, as a function of the distance (1 − Pearson correlation) used to cut the hierarchical clustering dendrogram. Black lines on the plot correspond to the entropy values when all labels in the dendrogram were randomly shuffled. Calculations were performed 100 times. Note that the y-axis is scaled differently in each plot.

We also used the adjusted rand index (ARI) to compare dendrograms across all chromosomes. The ARI, notable for its independence from specific labels, unlike entropy, facilitates a more comprehensive comparison of clustering patterns. Our analysis revealed that the pairwise ARI values in chromosome comparisons ranged from approximately 0.5 to 0.9. This range suggested that clustering patterns were more similar than dissimilar between all chromosomes.

To check the consistency of the dataset itself, we computed entropy for the hg38 subsets, ensuring that samples with the same (mark, cell) labels were matched. The entropy plots for subsets A and B exhibit similar behavior, indicating that the dataset is internally consistent (Fig. 8). This validation supports the robustness of our clustering approach and confirms that observed patterns are not artifacts of dataset composition. Furthermore, the similarity in entropy trends suggests that the underlying biological organization of labels is preserved across subsets, while any observed differences may reflect meaningful biological variability rather than technical noise.

Figure 8.

Figure 8.

Weighted entropy versus distance. The calculated entropy values across all chromosomes in the hg38 (GRCh38) dataset, shown separately for subsets A and B. Both subsets have the same size and matching (modifier, cell type) pair distribution. Entropy is computed with respect to activity type, factor, modifier, and cell type, as a function of the distance threshold used to cut the hierarchical clustering dendrogram. The similar behavior between subsets A and B supports the internal consistency of the dataset.

4.4 Several modifiers co-occured together in the same clusters across all chromosomes

The co-occurrence matrix heatmap revealed how epigenetic modifiers were related, specifically whether they tended to be present together in the same clusters across all chromosomes (Fig. 9). The co-occurrence patterns of modifiers showed that Histone 2 and 3 acetylation modifiers commonly co-occurred. The trio of RAD21, SMC3, and CTCF was also frequently found to co-occur, consistent with their participation with the cohesin complex. POLR2A and its phosphorylated form, POLR2AphosphoS5, often clustered together with the pair of TAF1 and TBP. Also, TAF1 and TBP were frequently found in the same cluster with GTF2F1 consistent with their involvement in RNA polymerase II (Pol II) transcription. EZH2, alongside its phosphorylated variant, EZH2phosphoT487, co-occurred with SUZ12 and H3K27me3 consistent with the role of polycomb repressive complex 2 (PRC2) complex depositing the H3K27me3 modification. Another set of modifiers that tended to appear together were BHLHE40, MAX, CHD2, and MAZ; BHLHE40 also co-occured with the pair of USF1 and USF2. Another notable co-occurence included TCF12, RCORI, ZFP36, RXRA, and HDAC2. Also, EP300 and FOXA1 showed a tendency to cluster within the same cluster. Interestingly, these clustering trends were not confined to chromosome 6 alone; many of the modifiers that co-occurred on chromosome 6 were also found clustering together on other chromosomes. This pattern was evident in both UMAP plots and co-occurrence heatmaps, indicating a broader chromosomal consistency in the clustering behavior of these epigenetic modifiers (Figs 4 and 9).

Figure 9.

Figure 9.

Co-occurrences of epigenetic modifiers within the clusters across all chromosomes. The intensity of red coloration corresponds to higher frequencies (sums of counts) of co-occurrences within the same clusters, following a dendrogram cut-off at a correlation distance below 0.3. Only clusters with a minimum size of 4 samples were chosen. The heatmap highlights the top 50 co-occurring modifiers. The result is normalized by the total number of modifiers in the dataset and the number of chromosomes.

4.5 Many of the known genes were associated with microRNAs and snoRNAs

Finally, we connected important genes found across all chromosomes to the Gene Ontology and the Kyoto encyclopedia of genes and genomes (KEGG) pathway databases. We found several significant GO terms that were enriched in our important genes (Fig. 10), including those involved with chromatin remodeling and immune responses. For example, the GO term “Innate immune response to mucosa” was aligned to the h2BC10, H2BC11, H2BC6, H2BC7, H2BC8, and RNASE3 genes, which are directly related to the immune system processes. Meanwhile, other terms were related to fundamental cellular processes and structures involved in various aspects of immune cell function. KEGG pathway analysis revealed important genes that were connected to microRNAs, which were best annotated and studied in the context of cancer (Virmani et al. 2002), and therefore identified as the most significant in our analysis. We also identified other microRNAs related to roles of the spliceosome, immune function (i.e. neutrophil extracellular trap formation and systemic lupus erythematosus), and alcoholism (Fig. 11).

Figure 10.

Figure 10.

Gene ontology results for the important genes found across all chromosomes (using gProfiler tool). Several molecular functions, biological processes, and cellular components are identified. Innate immune response in mucosa, the body’s early defense mechanism in mucosal tissues, is associated with genes such as H2BC10, H2BC11, H2BC6, H2BC7, H2BC8, and RNASE3.

Figure 11.

Figure 11.

KEGG pathway network of important genes (P<0.05 in all samples in cluster) gathered across all chromosomes. The functionality grouped network is visualized using Cytoscape based on the connectivity between pathways and genes. Many of the 51 microRNAs genes are annotated as involved in cancer pathways, as those are the most well studied. Some small nuclear RNAs are connected to the spliceosome. Many genes in the histone H2B family are found to be dysregulated in viral-induced cancers. These H2B genes and several others from the H2A histone family are linked to neutrophil extracellular trap formation, as histones play a major role in the NETs (Neutrophil Extracellular Traps) framework to halt invaders. Dysregulation of such processes occurs in autoimmune diseases, such as systemic lupus erythematosus, and in cancer, diabetes, and alcoholism (not shown in the plot).

4.6 Validation on another dataset

We evaluated our approach on another dataset consisting of 1479 samples aligned to the hg19 genome assembly. To make comparison fair, we randomly selected 600 samples from each hg19-aligned and hg-38-aligned dataset, resulting in an equal number of 14 cell types and 28 modifiers. We observed qualitatively similar branching factors on the entropy plot for all type of labels (Fig. 12). However, all entropies were lower for the hg19-aligned dataset than for the hg38-aligned dataset, making it hard to tell whether the observed differences were due to the datasets themselves or if the difference between hg19 and hg38 genome assemblies played a significant role. The co-occurrence matrix heatmaps revealed that the same epigenetic modifiers tended to cluster together regardless of the dataset, with the exception of H4K8ac, which co-occurred with a bunch of different modifiers in the hg19-aligned dataset but not in the hg38-aligned one (Fig. 13 and Fig. 14). We also produced self-organizing maps to aid visualization (Fig. 15). CTCF, RAD21, and SMC3 modifiers appeared together in both plots. Also, the trio of H3K4me1, H3K4me2, and H3K4me3 appeared near each other in both plots. We obtained P values by applying a z-test to the jackknife replicate differences (using the normal approximation) and then corrected all tests for multiple comparisons via the Benjamini–Hochberg procedure to determine which label pairs’ co-occurrence patterns differed significantly between the hg38 and hg19 datasets. We visualized P values in heatmap (Fig. 14). Cells colored red correspond to modifier pairs with adjusted P values of 1 under the jackknife z-test. These P values indicate that the result is maximally consistent with the null hypothesis between hg19 and hg38 for these pairs (i.e. we did not detect a significant difference in co-occurrence patterns of a modifier pair between datasets).

Figure 12.

Figure 12.

Weighted entropy versus distance. The calculated entropy values across all chromosomes for hg19 (GRCh37) and hg38 (GRCh38) datasets, with respect to activity type, factor, modifier, and cell type, as a function of the distance used to cut the hierarchical clustering dendrogram. To ensure a fair comparison, we randomly selected 600 samples from each dataset, resulting in an equal number of 14 cell types and 28 modifiers.

Figure 13.

Figure 13.

Heatmaps that illustrate the patterns of epigenetic modifier co-occurrences within the clusters in chromosome 6 derived from distinct datasets of the human genome assemblies hg19 (left) and hg38 (right) datasets. The intensity of red coloration corresponds to higher frequencies (sums of counts) of co-occurrences within the same clusters, following a dendrogram cut-off at a correlation distance below 0.3. Only clusters with a minimum size of 4 samples were chosen. The heatmaps exhibit a similar pattern of co-occurrences, indicating that certain epigenetic modifiers tend to co-occur together regardless of the dataset. To ensure a fair comparison, we randomly selected 600 samples from each dataset, resulting in 28 unique modifiers. Note that some diagonal elements are not equal to 1. This is because there are samples from the same modifiers that co-occur less frequently than others. The result is normalized by the total number of modifiers in the dataset.

Figure 14.

Figure 14.

Heatmap of FDR‐adjusted P values from the jackknife z-test comparing hg19 versus hg38 co-occurrence. Each cell shows the Benjamini–Hochberg–corrected P value for a pair of epigenetic modifiers, computed by taking the difference in their mean co-occurrence over 600 jackknife replicates and dividing by the combined jackknife standard error. P values <0.05 indicate pairs whose co-occurrence rates differ significantly between the two datasets. Cells colored light gray correspond to modifier pairs with zero co-occurrence in both datasets, which are masked rather than assigned P value of 1. The red cells have P values of exactly 1. The overall pattern parallels the co-occurrence patterns shown in Fig. 13.

Figure 15.

Figure 15.

Self-organizing maps that illustrate the patterns of epigenetic modifier co-occurrences within the clusters in chromosome 6 derived from distinct datasets of the human genome assemblies hg19 (left) and hg38 (right) datasets. The same data were used as in Fig. 13. The darker hexagons represent areas with higher distances to neighboring neurons, indicating potential boundaries between clusters. In contrast, lighter hexagons represent more homogeneous regions where features are similar to their neighboring neurons. CTCF, RAD21, and SMC3 appear together in both plots. The grouping of H3K4me1, H3K4me2, and H3K4me3 appear near each other in both plots.

5 Discussion

Hierarchical clustering helped to reveal trends in the epigenome. On a large scale, clusterings from different chromosomes exhibited similar branching statistics. Also, regardless of chromosome, clustering first separated samples into groups based on their epigenetic modifier, and then, at smaller scales, into groups of similar cell types; however, the variation between epigenetic modifiers was larger than the variation between cell types. We also analyzed the ARI between clusterings on different chromosomes to investigate whether these clusters were similar. The results showed that indeed, there is broad similarity between the epigenetic information on different chromosomes.

We found important regions of the genome in most of the clusters. Through the use of GO information, many of the known genes were associated with microRNAs and snoRNAs, which are key to regulating gene expression and thought to be dysregulated during aberrant immune function and in various cancers. GO revealed several common terms for the important genes found across each chromosome (Figs 10 and 11). These genes are related to chromatin remodeling and nucleosome occupancy, mRNA splicing and regulation, and immune responses. This finding suggests that in addition to the traditional perspective that epigenetic modifiers help regulate the expression of genes involved with cellular processes, they also function in a feedforward manner to regulate their own expression. That is, epigenetic modifications, regardless of a permissive or repressive function, regulate chromatin states for all genes including epigenetic genes. Further, our analysis discovered that processes impacting mRNA regulation, including splicing and microRNAs, are also regulated systematically by epigenetic modifiers. Overall, this suggests that the genes consistently regulated by epigenetic modifiers, regardless of the type of modifier and the extent of permissiveness, are those involved in the formation of chromatin, gene expression processes, and small noncoding RNAs, including microRNAs and snoRNAs. As such, it is possible that identifying epigenetic regulation of noncoding RNAs may be key to understanding disease states, which requires support via empirical data.

The co-occurrence patterns of modifiers showed that Histone 2 and 3 acetylation modifiers commonly co-occurred. The trio of RAD21, SMC3, and CTCF was also frequently found to co-occur. As components of the cohesin complex, RAD21 and SMC3, in conjunction with CTCF, are instrumental in promoting chromatin looping. Several studies have highlighted a correlation between CTCF and cohesin with both the frequency of interaction and gene expression during differentiation, suggesting their significant influence in mediating how chromatin structure affects gene regulation (Phillips-Cremins et al. 2013, Zuin et al. 2014). The co-occurrence of GTF2F1, TAF1, TBP, POLR2A, and POLR2AphosphoS5 in clusters can be attributed to their involvement in RNA Pol II transcription (Wang et al. 2012, Grau et al. 2015). EZH2, as well as its phosphorylated form EZH2phosphoT487, were clustered together with SUZ12 and H3K27me3. This aligns with the function of the PRC2 complex in laying down the H3K27me3 mark (Hansen et al. 2008).

Overall, the biologically plausible reasons for the observed clustering of epigenetic modifiers include their involvement in common pathways or biological processes, functional interactions where certain modifiers work together, and their co-binding to similar epigenomic regions. Additionally, posttranslational modifications like phosphorylation can link modifiers closely in function and regulation, contributing to their clustering.

In summary, a number of trends were identified by examining whole-genome epigenetic sequencing data for immune cells from the Encode consortium. Our validation, applying the method to two genome assembly datasets, demonstrated that some trends are persistent across the datasets, while others are not. An important task for future work would be to conduct further research to determine whether the observed variation stems from biological differences or from the workflows used to collect the samples. Nonetheless, because the underlying data for these two assemblies come from fundamentally different samples, these findings reflect not only differences in processing but also inherent differences in the data itself. It is beyond the scope of this work to determine whether the observed variation stems from biological differences or from the workflows used to collect the samples. Ideally, for a fair comparison, the datasets should have similar composition in terms of sample size and label distribution to minimize potential biases. However, it is possible to analyze datasets of different sizes and label distributions by applying appropriate normalization techniques, such as subsampling, weighting, or statistical corrections to account for disparities. When such adjustments are made, the robustness of observed patterns across datasets can still be assessed. Nevertheless, any remaining differences should be interpreted cautiously, considering whether they arise from biological variation or dataset composition effects.

We elaborate on the rationale behind analyzing healthy data as a preparatory step for future infectious disease studies. Understanding the usual clustering patterns of epigenomic features in a non-diseased state provides a reference for identifying deviations associated with infection in future work. This study highlights that certain epigenetic modifiers tend to form clusters in healthy individuals due to their roles in normal physiological processes. These patterns, however, may be disrupted in the context of an infectious disease, leading to altered clustering behaviors. Such changes are indicative not only of disease-specific epigenomic alterations but also of broader biological responses to infection. For instance, the modulation of genes tied to the immune response during an infection could shift the clustering of related epigenetic modifiers away from what is observed in the healthy state. Similarly, epigenetic modifiers that are typically active in maintaining health may exhibit altered behavior under disease conditions. The entropy plots between the healthy and infectious datasets can be examined for variations in entropy levels, along with observed shifts in the overall variability of the dataset. Furthermore, combining exposure datasets with the healthy and analyzing the resulting dendrogram could help identify what are the important genes in the epigenetic response to the exposure. For example, a new infected sample of data can be incorporated into existing dendrogram by computing the distances between the new data and the original data, and building a new dendrogram from the full collection. This would tell where the new data sits in the spectrum of all of the data. In the same way that our study can show associations between labels like cell type or epigenetic mark, as well as identifying genes involved in clusters, the new dendrogram could be used to find how the exposure label (i.e. modifier or cell type) relates to the healthy dataset, and which genes can be used to summarize this relationship. As data collection techniques improve, other possibilities for future work include using a similar data analysis pipeline to study other modalities of epigenetic information such as single-cell data.

6 Conclusion

Our study presents a robust methodological framework that employs hierarchical clustering to analyze large-scale epigenomic datasets, revealing critical relationships among epigenetic modifiers. The consistent clustering patterns observed across chromosomes highlight the essential role of these modifiers in influencing epigenetic regulation throughout the genome. Furthermore, our analysis underscores the feedforward nature of epigenetic mechanisms, emphasizing the involvement of noncoding RNAs in maintaining normal physiological states.

The reproducibility and consistency of clustering patterns and modifier co-occurrences identified here provide a foundational reference to recognize deviations associated with disease states, particularly infections. Future research can leverage this framework to identify disease-specific epigenetic alterations and their biological consequences, potentially facilitating early detection and informing targeted therapeutic strategies.

Acknowledgements

This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001.

Contributor Information

Anastasiia Kim, Computing and AI division at Los Alamos National Laboratory, Los Alamos, NM 87544, United States.

Nicholas Lubbers, Computing and AI division at Los Alamos National Laboratory, Los Alamos, NM 87544, United States.

Christina R Steadman, Bioscience division at Los Alamos National Laboratory, Los Alamos, NM 87544, United States.

Karissa Y Sanbonmatsu, Theoretical division at Los Alamos National Laboratory, Los Alamos, NM 87544, United States.

Author contributions

Anastasiia Kim (Formal analysis, data curation, code development, methodology, visualization, writing—original draft), Nicholas Lubbers (Conceptualization, formal analysis, methodology, validation, writing—original draft), Christina R. Steadman (Conceptualization, data curation, validation, writing—original draft), and Karissa Y. Sanbonmatsu (Conceptualization, funding acquisition, supervision, writing—review and editing)

Conflict of interest

None declared.

Funding

The study was supported by the Defense Threat Reduction Agency, grant DTRA1308139949.

Data availability

The code supporting the results of this article is available in the https://github.com/lanl/epigen repository.

References

  1. Barrett T, Suzek TO, Troup DB, et al.  Ncbi geo: mining millions of expression profiles—database and tools. Nucleic Acids Res  2005;33:D562–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Benjamini Y, Hochberg Y.  Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol  1995;57:289–300. [Google Scholar]
  3. Callinan PA, Feinberg AP.  The emerging science of epigenomics. Hum Mol Genet  2006;15:R95–101. [DOI] [PubMed] [Google Scholar]
  4. Chambwe N, Kormaksson M, Geng H, et al.  Variability in DNA methylation defines novel epigenetic subgroups of dlbcl associated with different clinical outcomes. Blood  2014;123:1699–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Clark SJ, Lee HJ, Smallwood SA, et al.  Single-cell epigenomics: powerful new methods for understanding gene regulation and cell identity. Genome Biol  2016;17:72–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Fernández-Morera JL, Calvanese V, Rodríguez-Rodero S, et al.  Epigenetic regulation of the immune system in health and disease. Tissue Antigens  2010;76:431–9. [DOI] [PubMed] [Google Scholar]
  7. Grau J, Grosse I, Posch S, et al. Motif clustering with implications for transcription factor interactions. Technical report. PeerJ PrePrints  2015.
  8. Hansen KH, Bracken AP, Pasini D, et al.  A model for transmission of the h3k27me3 epigenetic mark. Nat Cell Biol  2008;10:1291–300. [DOI] [PubMed] [Google Scholar]
  9. Hou L, Zhang X, Wang D  et al.  Environmental chemical exposures and human epigenetics. Int J Epidemiol  2012;41:79–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ilango S, Paital B, Jayachandran P  et al.  Epigenetic alterations in cancer. Front Biosci  2020;25:1058–109. [DOI] [PubMed] [Google Scholar]
  11. Janson PC, Winqvist O.  Epigenetics–the key to understand immune responses in health and disease. Am J Reprod Immunol  2011;66:72–4. [DOI] [PubMed] [Google Scholar]
  12. Jirtle RL, Skinner MK.  Environmental epigenomics and disease susceptibility. Nat Rev Genet  2007;8:253–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kent WJ, Sugnet CW, Furey TS  et al.  The human genome browser at UCSC. Genome Res  2002;12:996–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Keverne EB, Pfaff DW, Tabansky I.  Epigenetic changes in the developing brain: effects on behavior. Proc Natl Acad Sci USA  2015;112:6789–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kohonen T.  The self-organizing map. Proc IEEE  1990;78:1464–80. [Google Scholar]
  16. Leinonen R, Sugawara H, Shumway M; International Nucleotide Sequence Database Collaboration  The sequence read archive. Nucleic Acids Res  2011;39:D19–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lin I-H, Chen D-T, Chang Y-F, et al.  Hierarchical clustering of breast cancer methylomes revealed differentially methylated and expressed breast cancer genes. PLoS One  2015;10:e0118453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Murtagh F, Contreras P.  Algorithms for hierarchical clustering: an overview. WIREs Data Mining Knowl Discov  2012;2:86–97. [Google Scholar]
  19. Murtagh F, Contreras P.  Algorithms for hierarchical clustering: an overview, II. Wiley Interdiscip Rev Data Min Knowl Discov  2017;7:e1219. [Google Scholar]
  20. Obata Y, Furusawa Y, Hase K.  Epigenetic modifications of the immune system in health and disease. Immunol Cell Biol  2015;93:226–32. [DOI] [PubMed] [Google Scholar]
  21. Perera BPU, Faulk C, Svoboda LK, et al.  The role of environmental exposures and the epigenome in health and disease. Environ Mol Mutagen  2020;61:176–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Phillips-Cremins JE, Sauria MEG, Sanyal A, et al.  Architectural protein subclasses shape 3d organization of genomes during lineage commitment. Cell  2013;153:1281–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Reimand J, Kull M, Peterson H, et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res  2007;35:W193–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Seligson DB, Horvath S, Shi T, et al.  Global histone modification patterns predict risk of prostate cancer recurrence. Nature  2005;435:1262–6. [DOI] [PubMed] [Google Scholar]
  25. Shannon P, Markiel A, Ozier O, et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res  2003;13:2498–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Shiina T, Hosomichi K, Inoko H, et al.  The hla genomic loci map: expression, interaction, diversity and disease. J Hum Genet  2009;54:15–39. [DOI] [PubMed] [Google Scholar]
  27. The Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature  2012;489:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Tiffon C.  The impact of nutrition and environmental epigenetics on human health and disease. Int J Mol Sci  2018;19:3425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Titus AJ, Bobak CA, Christensen BC. A new dimension of breast cancer epigenetics. In: Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018, Funchal, Madeira, Portugal) - Volume 3: BIOINFORMATICS, pp.140–145. Setúbal, Portugal: SCITEPRESS, 2018.
  30. Virmani AK, Tsou JA, Siegmund KD, et al.  Hierarchical clustering of lung cancer cell lines using dna methylation markers. Cancer Epidemiol Biomarkers Prevent  2002;11:291–7. [PubMed] [Google Scholar]
  31. Wang J, Zhuang J, Iyer S, et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res  2012;22:1798–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wolffe AP.  Chromatin remodeling: why it is important in cancer. Oncogene  2001;20:2988–90. [DOI] [PubMed] [Google Scholar]
  33. Xi Y, Shi J, Li W, et al.  Histone modification profiling in breast cancer cell lines highlights commonalities and differences among subtypes. BMC Genomics  2018;19:150–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zamanighomi M, Lin Z, Daley T, et al.  Unsupervised clustering and epigenetic classification of single cells. Nat Commun  2018;9:2410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zuin J, Dixon JR, van der Reijden MIJA, et al.  Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells. Proc Natl Acad Sci USA  2014;111:996–1001. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code supporting the results of this article is available in the https://github.com/lanl/epigen repository.


Articles from Bioinformatics Advances are provided here courtesy of Oxford University Press

RESOURCES