Abstract
Cancer cells show remarkable plasticity and can switch lineages in response to the tumor microenvironment. Cellular plasticity drives invasiveness and metastasis and helps cancer cells to evade therapy by developing resistance to radiation and cytotoxic chemotherapy. Increased understanding of cell fate determination through epigenetic reprogramming is critical to discover how cancer cells achieve transcriptomic and phenotypic plasticity.
Glioblastoma is a perfect example of cancer evolution where cells retain an inherent level of plasticity through activation or maintenance of progenitor developmental programs. However, the principles governing epigenetic drivers of cellular plasticity in glioblastoma remain poorly understood. Here, using machine learning (ML) we employ cross-patient prediction of transcript expression using a combination of epigenetic features (ATAC-seq, CTCF ChIP-seq, RNAPII ChIP-seq, H3K27Ac ChIP-seq, and RNA-seq) of glioblastoma stem cells (GSCs). We investigate different ML and deep learning (DL) models for this task and build our final pipeline using XGBoost. The model trained on one patient generalizes to another one suggesting that the epigenetic signals governing gene transcription are consistent across patients even if GSCs can be very different. We demonstrate that H3K27Ac is the epigenetic feature providing the most significant contribution to cross-patient prediction of gene expression. In addition, using H3K27Ac signals from patients-derived GSCs, we can predict gene expression of human neural crest stem cells suggesting a shared developmental epigenetic trajectory between subpopulations of these malignant and benign stem cells.
Our cross-patient ML/DL models determine weighted patterns of influence of epigenetic marks on gene expression across patients with glioblastoma and between GSCs and neural crest stem cells. We propose that broader application of this analysis could reshape our view of glioblastoma tumor evolution and inform the design of new epigenetic targeting therapies.
Keywords: glioblastoma, epigenetics, transcriptomics, cross-patient prediction, machine learning, deep learning
1. Author summary
This study aimed to develop a machine learning (ML) pipeline that can be used to investigate the role of epigenetic regulation on gene transcription in glioblastoma stem cells (GSCs). We developed a cross-patient prediction pipeline with multi-epigenomic data of patient-derived GSCs to predict gene expression. Our pipeline includes in-silico perturbation analysis, which examines the impact of different epigenetic regulators including chromatin accessibility (ATAC-seq), distal chromatin looping (CTCF ChIP-seq), histone modifications (H3K27Ac ChIP-seq), and active transcription (RNAPII ChIP-seq), on gene transcription across patients. Our in-silico perturbation analysis inferred that the various epigenetic modulators are essential for regulating gene expression, with a higher weight on H3K27Ac, across patients. Collectively, we developed a cross-patient prediction pipeline that can be used to unravel the multi-epigenetic-driven mechanism of gene expression and propose potential drivers of cellular plasticity in GSCs.
2. Introduction
Glioblastoma stem cells (GSCs) are characterized by tumor-initiating and self-renewal properties and are known to drive chemoresistance and heterogeneity in glioblastoma (1–5). Until now, there is limited understanding regarding the epigenetic factors that define GSCs’ failure to attenuate their stemness potential in the face of differentiation cues. In addition, the influence of epigenetic mechanisms on GSC phenotypic plasticity is not fully understood. To address these questions, previous studies have examined the contribution of histone modifications, chromatin accessibility, and distal chromatin looping on gene transcription (6–9). Conventional studies employ correlation analysis to look at the linear relationship between gene expression and one epigenetic modulator. For example, it was shown that histone modifications can be predictive for gene expression by looking at the linear relationship between expression and histone modifications (10). In addition, CTCF was shown to play a role in gene regulation by participating in establishment and maintenance of chromatin loops (11). Finally, gene expression of oncogenes is amplified due to increased chromatin accessibility and enhancer activation (12). Although these studies provide mechanistic insights on epigenetics-based modulation, they include poor correlation coefficients between each epigenetic modulator and gene expression suggesting that correlation analysis is incapable of comprehensively interrogating the enormous amount of epigenomic and transcriptomic data.
To address this issue, machine learning (ML) techniques have been applied to epigenomics data such as ATAC-seq, ChIP-seq of histone modification marks, or transcription factors. The majority of these studies have been applied to predict gene expression based on one type of epigenetic modulator, such as histone modifications, as seen in AttentiveChrome and GraphReg (13–15). In the last few years, few studies have also applied ML algorithms to build models that take multiple epigenetic modulators to predict gene expression (16–19). However, the application of machine learning to study the combinatorial effect of multiple epigenetic modulators across patients with cancer has not been performed.
Here, we develop an ML-based prediction model that predicts gene expression levels in patient-derived GSCs using multiple epigenetic regulators (ATAC-seq, RNA polymerase II (RNAPII) ChIP-seq, CTCF ChIP-seq, and H3K27Ac ChIP-seq) at high performance. To examine the contribution of each epigenetic regulator on gene expression, we perform in-silico perturbation analysis and show that all epigenetic regulators contribute to gene expression prediction, with a higher contribution of H3K27Ac ChIP-seq signals, followed by RNAPII ChIP-seq, ATAC-seq, and CTCF ChIP-seq across patients.
Glioblastoma is a perfect example of cancer evolution where cells retain an inherent level of plasticity through activation or maintenance of progenitor developmental programs. To determine the contribution of H3K27Ac and CTCF binding on predicting gene expression between GSCs and neural stem cells we used publicly available ChIP-seq data of H3K27Ac and CTCF of neural crest (NCCs) and neural progenitor cells (NPCs) as test data for our GSC data-trained model and compared the Pearson Correlation Coefficients (PCCs). Our analysis shows that the GSCs and NCCs share patterns of H3K27Ac enhancer landscape influence on their gene expression.
Overall, our ML approach determines weighted patterns of influence of epigenetic marks on gene expression across patients with glioblastoma. Moreover, our approach reveals pattern similarity of H3K27Ac marks across GSCs and NCCs, which provides insights into epigenetic-driven cellular plasticity. Broader application of this analysis could reshape our view of glioblastoma tumor evolution and inform the design of new epigenetic targeting therapies.
3. Materials and methods
3.1. Datasets and pre-processing
We model and investigate the relationship between epigenetic modulators and gene transcription of patients derived GSCs using machine learning. To achieve this, we used the following four markers to compose our GSC patient datasets: ChIP-sequencing with H3K27Ac (enhancer marker), RNAPII (active transcription marker), and CTCF (distal chromatin looping marker), and ATAC-sequencing (chromatin accessibility) and RNA-sequencing data (Fig 1). This dataset was created for two patients (GSC1 and GSC2). For investigative downstream experiments, we included datasets composed of markers from non-GSC crest and progenitor neural cell data sourced from ENCODE (accession codes: ENCFF056WDN, ENCFF521XJN, ENCFF400MZX, ENCFF123YLB, ENCFF503KKJ, ENCFF655GGB, and ENCFF583OOM)(18,20–26). Both neural cell samples were from the human embryonic stem cells (H9) cell line. The progenitor cell dataset included H3K27Ac ChIP-seq, CTCF ChIP-seq, and DNase-seq (as an analog for ATAC-sequencing). The crest cell dataset included H3K27Ac ChIP-seq and CTCF ChIP-seq. For detailed information on the preparation of these datasets see supplementary section S1.
Fig 1.
Schematic overview of epigenetics-driven gene transcription and epigenomics sequencing data processing: Gene transcription heavily relies on epigenetics. These epigenetic mechanisms can be categorized into four categories: A. Chromatin accessibility, B. Active Transcription, C. Chromatin looping, D. Histone modifications. Counts of these sequencing +/− 2.5 kilo base-pairs (kbp) flanking the TSS region of each gene were measured and divided into 50 bins, with each bin representing 100 base pairs to create a heatmap for the input of the model.
We focused on the +/− 2.5 kilo base-pairs (kbp) flanking region of the transcription start site (TSS) for each gene and divided it into 50 bins, with each bin representing 100 base pairs. Further, we created a 50 × 4 matrix with rows representing bins and columns for epigenetic features for each of the 20,015 genes, and each bin contains summarized counts. As a result, we passed 20,015 × 50 × 4 to our model as an input. To prepare the gene expression labels, we summarized counts of the +/− 2.5kb flanking the TSS per gene and normalized it using transcripts per million (Fig 2).
Fig 2. Patient datasets representation after preprocessing.
The figure depicts the standardized epigenetic marker values per gene for a patient. It highlights the 3-dimensional arrangement of the datasets prior to model input. Here the “x” axis corresponds to the 50 bins of 100bp counts for each feature. The ”y” axis represents each gene’s 4 epigenetic features. The figure’s “z” axis is representative of the gene arrangement in the dataset.
The second preprocessing step included separate standardization and log(2) transformation of the gene expression measurements (27). To account for the variation in count values across different types of sequencing and experimental conditions, we standardized the counts of each sequencing data by using the mean value of the corresponding sequence data. This standardization occurred after the train, validation, and test data-splitting was performed. Here, each of the four epigenetic features were standardized separately within the train, validation, and test sets. We chose to standardize the data at this point, as opposed to before the data splitting, to avoid potential data leakage (28). Each gene had its corresponding target label log(2) transformed with a pseudo count of 1, prior to the data split process. Additional information regarding the model input process is contained in the supplementary section S2 (Fig S1 & S2).
3.2. Machine learning modeling
We tested several regression-based cross-patient predictive models to examine the novel application of our specific combination of epigenetic markers to predict gene expression using machine learning. Given the two patients (GSC1 and GSC2), we sought to use machine learning to extract a common pattern seen across patients, as GSCs are highly heterogeneous from one patient to the next. Therefore, we trained each model using a subset of genes from patient 1 followed by testing the same model with patient 2’s dataset (represented as GSC1→GSC2). Our experimental setup also included the inverse operation where the models were trained on patient 2’s dataset and tested using patient 1’s data (represented as GSC2 →GSC1).
To our knowledge this was the first time this specific combination of markers was explored in a machine learning study. In every case, the prediction task was a regression where the models predicted the RNA-seq gene expression value per gene. Cross-patient prediction experiments required two datasets (one for training and validation, and the other for testing), each composed of ChIP-seq, ATAC-seq, and RNA-seq.
To select the best machine learning model, we tested deep learning architectures like a Multi-layered Perceptron (MLP), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a Branched Multi-layer Perceptron (Branched MLP). With the branched MLP we combined the genomic sequences from the region as inputs.
Traditional machine learning algorithms included Gradient Boosting Regression (GBR), Support Vector Regression (SVR), and Multiple Linear Regression (MLR) architectures. Detailed information regarding these models is outlined in supplementary section S3.
Evaluation metrics.
Pearson Correlation Coefficient (PCC) was used as the primary metric for hyperparameter tuning, performance evaluation, and feature perturbation. Spearman Correlation Coefficient (SCC) was calculated concurrently with PCC.
3.2.1. Criteria for model selection.
Out of all the models tested, we selected our final machine learning model based on the analysis of each model’s overall Pearson Correlation Coefficient (PCC) metric results.
3.2.2. XGBoost model details
The XGBoost based model exhibited the best average performance of all the cross-patient experiments in this study. The downstream analysis and discussions we share in later sections are drawn from this architecture’s predictions. Since XGBoost takes 2-dimensional input, the dataset features were flattened to 20,0015 × 200 while the target variable (RNA-seq) became a 20,0015 × 1 array. The relative positioning of the bins of each of the four gene features was kept contiguous (Fig 3).
Fig 3. Cross-patient prediction methodology using the model XGBoost architecture.
The model input for training and validation was derived from a different patient than the testing dataset. As shown, the matrices were flattened before going into the model where the RNA-seq value was predicted.
Loss function.
We used mean squared error (MSE) for loss function calculations as follows:
(1) |
In the calculation represents the actual RNA-seq measurement while represents the predicted value per gene for the number of genes in the dataset.
3.3. Model training, validation and testing
We split the data into training, validation, and test sets to perform hyperparameter tuning. We trained two models (GSC1→GSC2 and GSC2→GSC1). The hyperparameters for the first model were tuned by dividing the GSC1 datasets into training (70%) and validation (30%) sets. All the genes in the GSC2 dataset constituted the test set. Similar setup was used for the GSC2→GSC1 model. A base set of hyperparameters for experimental testing, optimized for higher PCC performance, were chosen for each model. This was done using a grid search (combinatorial products of hyperparameter values into unique sets) optimizing for model performance. Supplementary section S3 (Table S1, S2, S3, S4, S5, S6, S7, and S8) includes additional information on hyperparameter tuning.
We ran each model 10 times (with different random seeds) in the two cross-patient arrangements (GSC1→GSC2 and GSC2→GSC1) for experimental testing. The mean and standard deviation of our metrics are reported for our study’s experimental results (Fig 4A and 4B).
Fig 4A & 4B. PCC cross-patient regression model results.
Our experimental results were compiled as the mean PCC scores over 10 runs of each model. In both graphs, the error bars shown indicate standard deviation of the model results. A) Our cross-patient XGBoost-based regression model performed higher than all other architectures when training with GSC1 and testing with GSC2 (GSC1→GSC2). B) Our XGBoost-based algorithm also out-performed the others when training with GSC2 and testing with GSC1 (GSC2→GSC1).
3.4. Perturbation experimental setup and analysis
To study the individual effect of each epigenetic signal on prediction, we perturbed one of each of the datasets four epigenetic features at a time and ran the perturbed input through the trained model. For this set of experiments, each signal perturbation was done for 10 runs while each run was given a different random seed. The perturbation experiments maintained our cross-patient prediction arrangement and were run in both prediction directions (GSC1→GSC2 and GSC2→GSC1).
Perturbation of a particular epigenetic signal was achieved by replacing all of that feature’s values with 0.0 (the mean of the standardization) for all gene’s across the test dataset. Our analysis compared the calculated mean and standard deviation results, for each set of perturbations, to the original model performance and each other. The per-gene prediction results were also recorded for later analysis.
4. Results
4.1. Machine and deep learning models perform similarly in different cross-patient prediction scenarios
The study found that, when the models were trained with GSC1’s (patient 1) and tested with GSC2’s (patient 2) data, which we referred to as GSC1→GSC2, the highest PCC value was acquired with the XGBoost Regression (XGBR) model at 0.826199 ± 0.000888. Multiple Linear Regression (MLR) held the lowest performance of 0.676872 ± 0.0; a 0.149327 difference in PCC. The sub-par performance of MLR indicated that the relationships among the epigenetic markers and their relationship with gene expression was non-linear. This characteristic of the data reinforced the need to experiment with many different algorithms of higher complexity to optimize our study’s application of machine learning. Interestingly, given the non-linearity of the data, five of our models successfully found patterns leading to RNA-seq values using completely different architectures. In fact, this sub-group (XGBR, GBR, MLP, CNN, and RNN) performed within 0.014497 of each other. This indicated that the results were to some degree agnostic to the model used and more so dependent on the epigenetic features of the dataset.
For the following set of experiments, we trained all of our models with GSC2’s (patient 2) dataset and tested with GSC1’s (patient 1). We referred to this as the GSC2→GSC1 prediction direction. The highest level of performance we achieved with this set of experiments was a PCC score of 0.809537 ± 0.000377, which as it turns out was again from XGBR. Comparing this prediction direction to the previous one we found that the general trends in metrics were similar. The separation between XGBR and MLR was 0.12818 for this prediction direction. Meanwhile, the difference among the first and sixth models was 0.016044. These two findings indicate the small differences in PCC among our higher performing models for this prediction direction as well.
Our data suggest that across both prediction directions, XGBR emerged as the model with the highest test set PCC measurements. Additionally, the strong PCC values for most of our models including XGBR contrasted by the consistently lower than average results from our MLR, underscored the non-linearity of our GSC datasets. Therefore, we used the XGBR model for our following perturbation experiments and downstream analysis. Finally, the strong PCC values highlighted the success of our cross-patient prediction approach to generalize the epigenetic input patterns connected to the gene expression from one patient to another. Our methodology of experimenting with both the GSC1→GSC2 and GSC2→GSC1 prediction directions successfully found high performance and consistency across our metrics.
Our study’s corresponding SCC metric results are detailed in the supplementary section S5 (Fig S4A & S4B). Script time considerations are detailed in the supplementary section S6 (Table S9).
4.2. Perturbation Results: All the combined epigenetic markers (epigenetic modification, chromatin accessibility, and histone modifications) contribute to gene expression with a higher weight on H3K27Ac signals
To evaluate the contribution of each epigenetic marker to predicting gene expression, we conducted perturbation analyses on each marker and observed the resulting performance metrics (PCC). Specifically, when the model was trained with GSC1 and GSC2 patient data was the evaluation dataset (GSC1→GSC2), the most striking change occurred when we perturbed the H3K27Ac signals. Here we saw a decline in performance of 0.583851 (70.667%) in PCC. We noted declines of 0.036298 (4.393%) for RNAPII, 0.004939 (0.597%) for ATAC-seq, and 0.00419 (0.507%) for CTCF perturbations (Fig 5A). This suggests that the predictive dependence of gene transcription was on all epigenetic markers (histone modifications, RNAPII binding, broad chromatin accessibility, and chromatin looping), with a greater weight on H3K27Ac signals followed by RNAPII, ATAC-seq, and then CTCF.
Fig 5A & 5B. Epigenetic signal perturbation model performance comparison using XGBoost in both prediction directions.
Each epigenetic signal was perturbed over 10 separate experiments in the two prediction directions. Shown are the mean PCC model results for 10 runs for each epigenetic marker (different random seeds) and error bars indicating the standard deviation. The hyperparameters used were identical to our other testing (see Table 1). The figures illustrate the order of model effect signal perturbation had from most to least: H3K27Ac, RNAPII, ATAC-seq, and CTCF. The effect was consistent in both prediction directions. A) GSC1→GSC2 prediction direction. B) GSC2→GSC1 prediction direction.
To determine if this finding is consistent across patients, we applied our perturbation methodology to the opposite prediction direction (GSC2→GSC1). The mean of all the experiments produced a decline in PCC, although to a lesser degree. The perturbation of the H3K27Ac signal of the GSC1 patient data while it was the evaluation dataset yielded a metric decrease of 0.14429 (17.824%). This was lower than the percentage decrease of the same feature as seen above, but it was a higher difference when compared to the other three features. The remaining results for this prediction direction showed declines of 0.008748 (1.08%) for RNAPII perturbation, 0.005978 (0.738%) for ATAC-seq, and finally 0.0021 (0.2594%) for CTCF perturbation (Fig 5B). This other direction of cross-patient in-silico perturbation analysis shows the same order of the predictive epigenetic modulator’s dependence on gene transcription is observed in the other patient. This suggests that, according to the data, gene transcription is highly connected with the H3K27Ac signals, followed by RNAPII, ATAC-seq, and CTCF. And this mechanism is observed across both patients. Interestingly, perturbing H3K27Ac led to different degrees of PCC reduction in GSC1→GSC2 and GSC2→GSC1 (0.456 and 0.250, respectively). This indicates that the degree of gene transcription dependence on H3K27Ac signals varies among GSCs. This could be potentially due to their heterogeneity, as H3K27Ac signals can vary among patients (29).
Comparing these PCC values to the correlation values between RNA-seq and each epigenetic marker, we can say that our cross-patient prediction analysis captured a trend that is not captured by simple correlation analysis (S4, Fig S3A & S3B). Additionally, this result on the importance of H3K27ac signals was furthered supported by the SCC results (S7, Fig S5A & S5B) and the primary model’s feature importance output (S8, Fig S6A & S6B). To address the generalibiilty of this model to unseen data, we altered the hyperparamters for the “reverse” direction, and observed a measurable increase in PCC/SCC, suggesting the generalibility of this model (S9).
4.3. Cross-patient analysis with neural crest and progenitor cell epigenetic data
H3K27Ac is known as an enhancer marker, and CTCF is also known as a mediator between enhancer and promoter. The enhancers are crucial gene expression modulators among GSCs as well as different cell types such as neural crest and progenitor cells. Thus, we hypothesized that H3K27Ac and CTCF would be critical markers to predict gene expression for GSCs as well as NCCs and neural progenitor cells. Furthermore, we hypothesized that if we test our GSC-trained model with neural crest or progenitor cells, we would observe similar PCC values. To test this hypothesis, we used publicly available data of H3K27Ac and CTCF-ChIP seq of neural crest and progenitor cell as test data and compared the PCCs. When we test the patient dataset on the trained model, we curated each dataset to include their H3K27Ac and CTCF values while ensuring that the ATAC and RNAPII values were 0 (the mean of the standardizations) across all genes of GSCs and then compared the PCCs. Our cross-cells analysis (Fig 6A and 6B) showed that testing the model with GSC showed the highest PCC (0.780–0.794) followed by one with NCC (0.649–0.672), one with neural progenitor cells (0.549–0.567). This indicates that computationally H3K27Ac/CTCF-related epigenetic landscape is most similar across GSCs, then followed by GSCs and NCCs, and GSCs and neural progenitor cells.
Fig 6A & 6B.
Our cross-patient analysis was extended to include neural crest and neural progenitor cell data compiled from features available on ENCODE. In each experiment, the test set included only the H3K27Ac and CTCF markers. A) The model was trained with GSC1. From the prediction performance it was inferred that the GSCs were most similar to each other. Interestingly, the plot indicates that the NCC epigenetic data is more similar to GSC1 than the neural progenitor cell data. B) When the model was trained with GSC2, we saw the same trend where the GSCs were most similar followed by the NCC data. Both plots’ values are the mean of 10 experiments and the standard deviation indicated by the error bars.
To further investigate this similarity in epigenetic landscape, we particularly looked into the genes with higher accuracy as these genes significantly contribute to training the model. Our XGBoost-trained model allowed us to rank genes based on prediction accuracy and to identify a group of genes that contributes to training the model. Thus, we ranked the genes based on performance, which is defined by means of squared error between observed and predicted values from 10 runs. Particularly, we focused on only expressed genes, because H3K27Ac and CTCF are known to positively correlate with gene expression, thus low-expressed genes would not be influenced by their H3K27Ac and CTCF signals around TSS (11,29). Looking at the rank of genes for all the GSC, neural crest and progenitor cells, we set 4500 as a threshold to separate between “accurately predicted” genes and the rest. Our PCC result indicates that GSCs are more similar to NCCs than to neural stem cells in terms of H3K27Ac- and CTCF-based epigenetic landscape. Given this data, we decided to focus on GSCs and NCCs.
To assess the epigenetic landscape of GSC1, GSC2, and NCCs, we looked into the intersection of “accurately predicted” genes between GSC1, GSC2, and NCCs and identified 544 genes in the intersection. Furthermore, we looked into the aggregated signals of these 544 genes for H3K27Ac and CTCF, respectively, and compared it to the aggregated signals of randomly selected genes (Fig 7A and 7B). For H3K27Ac, we observed a shift in the peak distribution for the “accurately predicted” genes of NCCs compared to the randomly selected genes, suggesting that H3K27Ac signals contribute to accurately predicting gene expression of NCCs. (Fig 7A), Given this result and H3K27Ac’s contribution on predicting GSC’s gene expression, we hypothesized that the intersection peaks for NCCs would show good concordance with the GSC peaks, and in fact it did. Meanwhile, considering the epigenetic disparities between GSCs and NCCs, as expected, we observed the small difference in standardized counts right before TSS (e.g. around bin 22–23) between GSCs and NCCs (Fig 7B). Regarding CTCF, we observed a comparable peak distribution between “accurately predicted” genes and randomly selected genes. This indicates that CTCF of “accurately predicted” genes don’t significantly contribute to predicting gene expression. Moreover, the contribution of each epigenetic marker, H3K27Ac and CTCF, of “accurately predicted” genes align with the observed difference in their predictive capabilities for gene expressions. Overall, this result suggests that H3K27Ac contributes to accurately predicting gene expression of NCCs as it does with GSCs. However, CTCF seems to be less impactful for prediction. Additionally, the resemblance of H3K27Ac peak distribution between GSC and NCC underscores epigenetic similarities at the enhancer landscape when it comes to predicting gene expression.
Fig 7A & 7B. Epigenetic signal analysis.
We performed an analysis of H3K27Ac signals (standardized counts per bin), to investigate similarity/dis-similarity between GSCs and NCCs. A) Each line represents the mean of the standardized count values for 544 common genes identified by the intersection of the model’s lowest error per gene, on the respective test set. The plot shows the bi-modal nature of the H3K27Ac signal with corresponding peaks at the bin 27 within the TSS region. B) The heatmap visualizes the Euclidean distance between pairs of dataset mean values in the highlighted region. We identified similarity when comparing GSC1 and GSC2 indicated by the relatively small distance between their mean values. The same observation of similarity was made at bin 27 where the Euclidean distance is also relatively small between the GSCs and the NCC datasets. The observation is contrasted by the relatively larger difference in the signals at bin 27 and 28 between the GSCs and a group of corresponding 544 genes within the NCCs that were randomly chosen from outside of the lowest error rate gene population.
5. Discussion
Origin and maintenance of GSC plasticity are regulated by endogenous cell processes affecting DNA, chromatin, and RNA, as well as by factors of the microenvironment that help propagate cancer stem cell phenotypes. Epigenetic pressure contributes to the ability of GSCs to remain plastic and allows GSCs and differentiated progenies to adopt a population equilibrium that facilitates tumor persistence (30–32).
To understand non-linearity of epigenetic mechanisms that drive gene expression, machine-learning has been recently employed as demonstrated in AttentiveChrome and GraphReg (16). Here, we apply machine learning to a combination of epigenetic bulk NGS data to discover epigenetic marks that can predict gene expression across patients with glioblastoma. To explore the developmental origin of cellular plasticity in GSCs, we successfully developed machine-learning models that utilize cross-patient prediction to investigate the epigenetic regulation of gene expression in heterogeneous GSCs, NCCs, and NPCs.
Our model comparison shows XGBR architecture to be the best performing model. XGBR is known as a strong architecture for processing tabular data and our datasets consisted of various types of epigenetic modulators arranged as a series of tables (33). Our cross-patient perturbation analysis using the XGBR model indicates that H3K27Ac signal contributes more to prediction of gene expression compared to RNAPII, broad chromatin accessibility, and chromatin looping across patients. Our perturbation analysis shows that the sum of the drop in PCC by each epigenetic modulator is less than PCC without perturbation. This suggests that other potential contributors to gene expression prediction may exist, such as other histone modification marks (e.g., H3K9me3, H3K27me3, etc.) and modulators of distal chromatin looping (e.g., Cohesin).
Recently, application of a neural network projection on the developmental trajectory of normal brain cells uncovered that glioblastoma cells share features of common lineage with perivascular neural crest and radial glial cells (34). Here we show that patients derived GSCs exhibit common patterns of H3K27Ac marks with NCCs. Remarkably, H3K27Ac marks of human NCCs can predict gene expression of GSCs from different patients with glioblastoma. Since these are bulk NGS data, they suggest a conserved enhancer landscape between certain subpopulations of GSCs and NCCs. In the future, it will be important to define the specific subpopulation of GSCs that shares gene regulatory networks with NCCs to determine conserved epigenetic traits of cellular plasticity between glioblastoma and the developing nervous system.
6. Conclusion
We built a cross-patient prediction analysis framework that can be used to provide insights on the contribution of multi-epigenetics markers on predicting gene expression of cancer stem cells and other cells with stem-cell phenotypic potential that involves cell plasticity. By applying this framework to GSCs, we identified that H3K27Ac signals contribute to predicting gene expressions most, followed by RNAPII, ATAC-seq, and CTCF, and this contribution is preserved across patients. Furthermore, we applied this to neural progenitor cells and crest cells and found that H3K27Ac/CTCF-related epigenetic landscape was similar across GSCs, neural progenitor cells, and neural crest cells. Overall, we presented a cross patient gene expression prediction framework that can be used to formulate deep insights into epigenetic-driven gene expression mechanisms and the epigenetic landscape of cellular plasticity across cancer stem cells and multiple cell types.
Supplementary Material
Acknowledgments
For the neural crest and progenitor cell data, we acknowledge the ENCODE Consortium and Bradley Berstein lab, which the data originally came from. This research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University. We are grateful to the members of COBRE-CBHD Computational Biology Core at Brown University and Eduardo Fajardo at Albert Einstein College of Medicine for the support. Y.S. was supported by Honjo International Foundation Fellowship. N.T. greatly acknowledge support from Warren Alpert Foundation. Effort for R.S and H.B was funded by the NIH award 1R35HG011939-01.
7. Abbreviations
- GSC1
Glioblastoma Stem Cells 1. Nomenclature to indicate patient 1
- GSC2
Glioblastoma Stem Cells 2. Nomenclature to indicate patient 2
- GSC1→GSC2
This denotes the experimental setup whereby the model is trained on data derived from patient GSC1 (patient 1) while the test set is composed of data from patient GSC2 (patient 2)
- GSC2→GSC1
Indicates that the training dataset is derived from patient GSC2 and the prediction direction is toward the test set of GSC1
- NCC
Neural Crest Cell
- NPC
Neural Progenitor Cell
- PCC
Pearson Correlation Coefficient metric
- SCC
Spearman Correlation Coefficient metric
- SD
Standard deviation
Footnotes
Code availability
The study’s code is located at https://github.com/rsinghlab/ML_epigenetic_features_glioblastoma.
References
- 1.Lathia JD, Mack SC, Mulkearns-Hubert EE, Valentim CL, Rich JN. Cancer stem cells in glioblastoma. Genes Dev. 2015. Jun 15;29(12):1203–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, et al. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 20140612th ed. 2014. Jun 20;344(6190):1396–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ignatova TN, Kukekov VG, Laywell ED, Suslov ON, Vrionis FD, Steindler DA. Human cortical glial tumors contain neural stem-like cells expressing astroglial and neuronal markers in vitro. Glia. 2002. Sep;39(3):193–206. [DOI] [PubMed] [Google Scholar]
- 4.Liu G, Yuan X, Zeng Z, Tunici P, Ng H, Abdulkadir IR, et al. Analysis of gene expression and chemoresistance of CD133+ cancer stem cells in glioblastoma. Mol Cancer. 20061202nd ed. 2006. Dec 2;5(1):67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Singh SK, Hawkins C, Clarke ID, Squire JA, Bayani J, Hide T, et al. Identification of human brain tumour initiating cells. Nature. 2004. Nov 18;432(7015):396–401. [DOI] [PubMed] [Google Scholar]
- 6.Wang S, Zang C, Xiao T, Fan J, Mei S, Qin Q, et al. Modeling cis-regulation with a compendium of genome-wide histone H3K27ac profiles. Genome Res. 2016. Oct;26(10):1417–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cheng C, Yan KK, Yip KY, Rozowsky J, Alexander R, Shou C, et al. A statistical framework for modeling gene expression using chromatin features and application to modENCODE datasets. Genome Biol. 2011;12(2):R15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Schmidt F, Kern F, Schulz MH. Integrative prediction of gene expression with chromatin accessibility and conformation data. Epigenetics Chromatin. 2020. Feb 6;13(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ouyang Z, Zhou Q, Wong WH. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci U S A. 2009. Dec 22;106(51):21521–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Karlić R, Chung HR, Lasserre J, Vlahovicek K, Vingron M. Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A. 2010. Feb 16;107(7):2926–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kubo N, Ishii H, Xiong X, Bianco S, Meitinger F, Hu R, et al. Promoter-proximal CTCF binding promotes distal enhancer-dependent gene activation. Nat Struct Mol Biol. 2021. Feb;28(2):152–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Liu Y, Wu Z, Zhou J, Ramadurai DKA, Mortenson KL, Aguilera-Jimenez E, et al. A predominant enhancer co-amplified with the SOX2 oncogene is necessary and sufficient for its expression in squamous cancer. Nat Commun. 2021. Dec 8;12(1):7139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Singh R, Lanchantin J, Sekhon A, Qi Y. Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin. Adv Neural Inf Process Syst. 2017. Dec;30:6785–95. [PMC free article] [PubMed] [Google Scholar]
- 14.Chen Y, Xie M, Wen J. Predicting gene expression from histone modifications with self-attention based neural networks and transfer learning. Front Genet. 2022;13:1081842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lee D, Yang J, Kim S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nat Commun. 2022. Nov 5;13(1):6678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Karbalayghareh A, Sahin M, Leslie CS. Chromatin interaction-aware gene regulatory modeling with graph attention networks. Genome Res. 2022. May;32(5):930–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bigness J, Loinaz X, Patel S, Larschan E, Singh R. Integrating Long-Range Regulatory Interactions to Predict Gene Expression Using Graph Convolutional Networks. J Comput Biol J Comput Mol Cell Biol. 2022. May;29(5):409–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Massa AT, Mousel MR, Herndon MK, Herndon DR, Murdoch BM, White SN. Genome-Wide Histone Modifications and CTCF Enrichment Predict Gene Expression in Sheep Macrophages. Front Genet. 2020;11:612031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Read DF, Cook K, Lu YY, Le Roch KG, Noble WS. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS Comput Biol. 2019. Sep;15(9):e1007329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012. Sep 6;489(7414):57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B, et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020. Jan 8;48(D1):D882–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hitz B, Kagda M, Lam B, Litton C, Small C, Sloan C, et al. Data navigation on the ENCODE Portal [Internet]. 2023. [cited 2024 Jun 23]. Available from: https://www.researchsquare.com/article/rs-3088639/v1
- 23.Hitz BC, Lee JW, Jolanki O, Kagda MS, Graham K, Sud P, et al. The ENCODE Uniform Analysis Pipelines. [Google Scholar]
- 24.Epigenome-based splicing prediction using a recurrent neural network - PMC [Internet]. [cited 2024 Jun 23]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7343189/ [DOI] [PMC free article] [PubMed]
- 25.Zhang J, Lee D, Dhiman V, Jiang P, Xu J, McGillivray P, et al. An integrative ENCODE resource for cancer genomics. Nat Commun. 2020. Jul 29;11(1):3696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Meuleman W, Muratov A, Rynes E, Halow J, Lee K, Bates D, et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature. 2020;584(7820):244–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zwiener I, Frisch B, Binder H. Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures. Emmert-Streib F, editor. PLoS ONE. 2014. Jan 8;9(1):e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kapoor S, Narayanan A. Leakage and the Reproducibility Crisis in ML-based Science [Internet]. arXiv; 2022. [cited 2023 Apr 9]. Available from: http://arxiv.org/abs/2207.07048 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Mack SC, Singh I, Wang X, Hirsch R, Wu Q, Villagomez R, et al. Chromatin landscapes reveal developmentally encoded transcriptional states that define human glioblastoma. J Exp Med. 20190404th ed. 2019. May 6;216(5):1071–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Carén H, Stricker SH, Bulstrode H, Gagrica S, Johnstone E, Bartlett TE, et al. Glioblastoma Stem Cells Respond to Differentiation Cues but Fail to Undergo Commitment and Terminal Cell-Cycle Arrest. Stem Cell Rep. 2015. Nov 10;5(5):829–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dirkse A, Golebiewska A, Buder T, Nazarov PV, Muller A, Poovathingal S, et al. Stem cell-associated heterogeneity in Glioblastoma results from intrinsic tumor plasticity shaped by the microenvironment. Nat Commun. 2019. Apr 16;10(1):1787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Auffinger B, Tobias AL, Han Y, Lee G, Guo D, Dey M, et al. Conversion of differentiated cancer cells into cancer stem-like cells in a glioblastoma model after primary chemotherapy. Cell Death Differ. 2014. Jul;21(7):1119–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on tabular data? [Internet]. arXiv; 2022. [cited 2023 Mar 29]. Available from: http://arxiv.org/abs/2207.08815 [Google Scholar]
- 34.Hu Y, Jiang Y, Behnan J, Ribeiro MM, Kalantzi C, Zhang MD, et al. Neural network learning defines glioblastoma features to be of neural crest perivascular or radial glia lineages. Sci Adv. 8(23):eabm6340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Genome-wide analysis of polymerase III–transcribed Alu elements suggests cell-type–specific enhancer function - PMC [Internet]. [cited 2024 Jun 23]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6724667/ [DOI] [PMC free article] [PubMed]
- 36.Random Forests(TM) in XGBoost — xgboost 2.0.3 documentation [Internet]. [cited 2024 Jun 10]. Available from: https://xgboost.readthedocs.io/en/stable/tutorials/rf.html
- 37.XGBoost Parameters — xgboost 2.0.3 documentation [Internet]. [cited 2024 Jun 10]. Available from: https://xgboost.readthedocs.io/en/stable/parameter.html
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.