Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Dec 1.
Published in final edited form as: Biochim Biophys Acta Rev Cancer. 2021 Jul 7;1876(2):188588. doi: 10.1016/j.bbcan.2021.188588

Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine

Emre Arslan 1,#, Jonathan Schulz 1,#, Kunal Rai 1,#
PMCID: PMC8595561  NIHMSID: NIHMS1725057  PMID: 34245839

Abstract

The recent deluge of genome-wide technologies for the mapping of the epigenome and resulting data in cancer samples has provided the opportunity for gaining insights into and understanding the roles of epigenetic processes in cancer. However, the complexity, high-dimensionality, sparsity, and noise associated with these data pose challenges for extensive integrative analyses. Machine Learning (ML) algorithms are particularly suited for epigenomic data analyses due to their flexibility and ability to learn underlying hidden structures. We will discuss four overlapping but distinct major categories under ML: dimensionality reduction, unsupervised methods, supervised methods, and deep learning (DL). We review the preferred use cases of these algorithms in analyses of cancer epigenomics data with the hope to provide an overview of how ML approaches can be used to explore fundamental questions on the roles of epigenome in cancer biology and medicine.

Background

The epigenome consists of a diverse repertoire of covalent histone modifications and nucleic acids that cooperatively regulate chromatin structure and gene expression. Epigenetic modifications are reversible and dynamically regulated, initially added and subsequently removed by specialized chromatin-modifying enzymes known as epigenetic ‘writers’ and ‘erasers’, respectively1,2. Epigenetic mechanisms, acting in conjunction with transcription factors, play a critical role in orchestrating the transcriptional changes associated with differentiation of a variety of cell types, during normal development38. Specifically, they allow signal transduction cascades acting through common transcription factors to drive cell type-specific transcriptional responses, and they provide a mechanism for the heritable maintenance of cell type-specific gene expression after inciting signals have dissipated. While contributions of specific epigenetic elements in isolation has been studied extensively, how multiple epigenetic elements together, and in conjunction with other features such as DNA sequence, precisely control gene expression patterns is not completely understood.

It is becoming increasingly evident that the epigenome plays a major role in the etiology of various diseases, including cancer, diabetes, autoimmune disorders. Epigenome aberrations occur extensively across multiple cancer types and are thought to play a major role in establishing the neoplastic cellular state in conjunction with genetic anomalies. While the precise roles of individual cancer-specific epigenetic elements are currently being investigated, an important and poorly understand element of this process is the interplay of various epigenomic elements with other modules such as metabolic, proteomic, and transcriptomic states.

The advancements in next generation sequencing, clubbed with classical biochemical approaches, have led to innovation in mapping techniques for many epigenetic elements. For example, genome-wide profiles of histone modifications and transcription factor binding can be mapped by ChIP-Seq911, or more recently CUT&Tag12 among other methods; chromatin accessibility can be determined using ATAC-Seq13 and DNaseI-Seq14. Higher-order chromatin structure is determined using methods such as Hi-C15. DNA methylation is widely determined using RRBS16, WGBS17,18, or array-based technologies. These technologies have provided massive amounts of new data in various cancer systems that can provide insights into how epigenome may regulate the evolution of cancer cells during progression and response to therapies. Furthermore, the epigenomic landscape of cancer tissues can be used to stratify patients into groups for precision medicine. To achieve these objectives, we need to implement automated decision systems that can help generate novel approaches for prevention, diagnosis, and treatment.

Overview

Machine Learning (ML) enables computers to learn from data without being explicitly programmed and make accurate predictions. ML models have created unprecedented momentum in different domains, including epigenomics studies. In this review, we categorize traditional ML frameworks based on principal focus of ML applications in epigenomics. We will review several conventional supervised and unsupervised learning methods that have been used for the examination of epigenomics data.

Machine learning applications have played crucial roles in addressing multiple questions related to the basic biology of epigenetic elements, their role in gene regulation, and the utility of the epigenome in cancer diagnostics and treatment. We present specific studies from the literature where conventional ML tools have been used to address these fundamentally important biological questions. Figure 1 presents the overall structure of this review. The paper is organized as follows: Dimensionality reduction section presents an overview of several feature extraction and feature selection algorithms used in epigenomics studies. We review major supervised learning and unsupervised learning algorithms and their application in the epigenomics field. Next, several deep learning (DL) methods in epigenomics analysis are presented in detail. Even all deep learning models follow a supervised or unsupervised learning strategy, we separated the deep learning section to discuss many methods in detail. Finally, we present a brief discussion on several challenges in epigenomics analysis and close with a focused discussion of future directions in the field.

Figure 1. Overview of the Machine Learning in Epigenomics.

Figure 1.

Machine Learning methods can be used for analyses of datasets defining maps of higher-order chromatin structure, chromatin accessibility, DNA Methylation, and histone modification in various cancer systems to address fundamentally important biological questions.

Dimensionality Reduction

The dimension of a dataset is the number of features (also called attributes or variables in the ML literature). If the dimension p is much larger than the number of samples N (p >> N), it is called high-dimensional data19. The sizeable gap between the number of features and the observation size in high-dimensional epigenomic datasets leads to the curse of dimensionality20, where several fundamental problems arise such as increased computational complexity, multicollinearity21, and decrease in expected predictive power (Hughes phenomenon)22. Applying a dimensionality reduction algorithm before training a classification or clustering method is crucial to identify relevant epigenomic signatures and avoid overfitting. Dimensionality reduction methods are also valuable to visually inspect the data and check the distribution of samples/cells based on their epigenomic background. There are two main approaches for dimensionality reduction: feature extraction and feature selection.

Feature Extraction (FE) methods transform the original high-dimensional data to a lower dimensionality using linear or non-linear operations. The primary goal of a FE model is to preserve as much information as possible with fewer variables. Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) are some of many FE methods in epigenomics analysis. PCA is a linear transformation based on an eigenvector search23,24. Principal components (PCs) are directions that capture the maximum variances from the original data and are orthogonal to each other. PCA is a classical technique for exploratory data analysis (EDA) as its probabilistic formulation is computationally efficient, especially for very high-dimensional datasets like epigenomic datasets25. While there are more than ten variants of the original PCA applied in different domains2636, only a couple are implemented in epigenomic analysis37,38. The major limitation of PCA is its inability to capture non-linear relationships in the original data. t-SNE is a powerful FE method to capture local and non-linear patterns from the original high-dimensional data. It extends the Stochastic Neighbor Embedding (SNE) method39 which maps objects that represented by high-dimensional vectors to a low-dimensional space while preserving neighbor identities, by introducing a symmetrized version of the cost function and Student-t distribution to measure similarities between observation pairs40. UMAP uncovers a low-dimensional graph representation of the original data41 and is widely used to visualize high-dimensional genomics and epigenomics datasets. UMAP’s solid theoretical framework built on manifold theory and topological data analysis serves as a fast FE method that capture nonlinear structure in the high-dimensional representation41. Several studies, focused on comparing tSNE and UMAP, have concluded UMAP is faster42 and provides better scaling than t-SNE41, yet a more recent study shows no difference of preserving the global structure of the high-dimensional data between UMAP and t-SNE and the implementation of FIt-SNE is as fast as UMAP. An issue for t-SNE is choosing the right perplexity4346 as different perplexities often produce reasonably different outcomes47.

Feature selection (FS) is a straightforward approach to reduce the number of features before any ML step. FS algorithms select a few features from the entire feature space and discard the rest. Choosing the right FS method can improve prediction performance, generalizability, and scalability. Many FS methods exist to efficiently reduce the number of input features, but we will focus on the most well-known and widely used ones. Correlation-based Feature Selection (CFS) ranks features according to their correlation with the class label and chooses the most correlated features by assuming that features with a low correlation with the label are irrelevant48. Information Gain (IG) evaluates the relevance of each feature to a corresponding class by mutual information. Higher mutual information indicates a higher relevance of variable to the label49. IG does not depend on multivariate analysis, making it possible to identify redundant features. Gain Ratio (GR) extends IG via normalizing with the entropy of the variable (intrinsic information) and reducing the symmetric measurement bias50. ReliefF51 is a descendant of the original Relief Algorithm52 that weights each feature according to their skill to agree on the same labeled closest instance and differentiate on closest different class observation, where the F represents the sixth variation (A-F) of the original algorithm. ReliefF chooses a random sample and finds the nearest same class k samples (Near Hits) and the nearest other class k samples (Near Misses) and then searches for attributes that are preserved within the same class and differentiates between distinct classes51. ReliefF also extends the original idea to handle missing data and multi-class labels53.

A few recent FE or FS methods have focused on addressing the high dimensionality of epigenomic datasets. For example, Alkuhlani et al.54 introduce a multistage FS strategy to identify the optimal CpG sites for breast, colon, and lung cancer datasets. Han et al.55 propose a dynamic Recursive Feature Elimination (dRFE) based FS method. In general, deep learning-based pipelines skip a FE or FS step in epigenomics but applying an FS method before any deep neural network model can significantly improve the performance56.

A dimensionality reduction approach is a must for all epigenomic data analysis pipelines. A FS method selects the most relevant epigenomic features, that are the most discriminative for a particular experimental setup. A FE method helps to project the epigenomic data from hundreds of thousands features to several hundreds while preserving most insight and visualize the high-dimensional epigenomic data in 2D or 3D. Detailed mathematical explanations of feature selection and feature extraction methods are beyond the scope of this paper, but comprehensive FE and FS reviews are available5759.

Supervised Learning

Supervised Learning is a widely used ML strategy to solve various real-world problems6062. The existence of labels associated with the data is the fundamental requirement for a supervised learning setup. Supervised Learning algorithms search for linear/non-linear patterns from data-label pairs and estimate the output for newly presented input data. They answer several fundamental biological questions such as diagnosing cancer63,64, predicting the stage of cancer65, understanding factors affecting the aging66, and reporting oncogenes and tumor suppressor genes67. There are two subcategories of this branch of ML: Classification and Regression. Classification tasks predict categorical labels (e.g., estimating cancer stage via ATAC-Seq), while numerical outputs are estimated by regression tasks (e.g., evaluating gene expression values using histone modification marks). Support Vector Machine (SVM), Decision Trees, Random Forest, Naïve Bayes (NB), and Logistic Regression are some of many off-the-shelf supervised learning algorithms. SVM relies on finding the hyperplane that separates two classes of input data (binary classification)68. The hyperplane serves as the decision boundary which needs to be equally distant from the closest points of the two categories. The ‘Support Vectors’ term refers to the closest samples to the hyperplane from each category/label69. Decision Trees (DTs) are flexible and easy-to-use classification methods. The algorithm builds a tree, splitting based on feature values in each layer, and repeats this partitioning until a leaf node has all the same class observations70. A large number of relatively uncorrelated DTs build the Random Forest (RF) classifier71 which uses the wisdom of the crowd. The ensemble structure of the RF reduces the contribution of individual DT errors. Naïve Bayes (NB) is another widely used supervised learning method that is fast and easy to interpret supervised learning method, which depends on conditional probabilities from Bayes Theorem72. NB’s principal assumption is the independence of all attributes from each other, which is unrealistic in many real-world applications. Logistic Regression (LR) models the relationship between input data and labels using a logistic function that has an S-like shape and projects any real-valued number between 0 and 1. LR predicts a dichotomous output by using a threshold on the projected range. Linear Regression takes features (explanatory variable) as input and estimates continuous output (dependent variable) by fitting a linear equation73. It is fast and easy to interpret as estimated weights for each variable reveals the importance of each feature to estimate the output. Despite its popularity in data analysis, Linear regression is prone to over-fitting, outliers, and multicollinearity73. Lasso Regression74 and Ridge Regression75 methods mitigate mentioned problems and reduce the complexity of the model. Lasso Regression performs L1 regularization that adds the absolute sum of the estimated coefficients as the penalty term to the sum of the squared error objective function whereas Ridge Regression uses L2 regularization where the penalty term is the squared absolute sum of coefficients.

Diagnoses of most cancer types are based on biopsies, computerized imaging scans, or even autopsies. While highly effective, these approaches sometimes lead to errors for an accurate diagnosis. For example, pathological examination of needle biopsies to diagnose prostate cancer has high false-negative rates63. Therefore, exploiting the molecular features of tumors could be useful in such cases. For this, cost-efficient and time-efficient diagnostic tools are crucial. Vast amounts of epigenomic data generated from several cancer cells are great candidates for a supervised learning framework as each observation has label corresponding to epigenomics features. We can potentially find epi-signatures for those cancer types by training classifiers. It has been shown that by using the genome-wide DNA methylation changes as input, an L1 regularized logistic regression model can differentiate tumors from benign samples with high sensitivity and specificity rates63. A striking example of the utility of cancer epigenomes to classify subtypes of tumors was shown in Capper et al.64, where DNA methylation was used as input to train an RF classifier that predicted 100 known tumor types across all entities and various age groups. Large cohorts in autopsy studies show that up to 26% of cancer patients develop Brain Metastases (BM)76 that emphasizes the importance of an accurate diagnosis before an optimal treatment plan. The BrainMETH77 algorithm was developed for the accurate diagnosis of BM in a three-step RF-based classifier that uses DNA methylation as the input. Class A -first step- classify primary and metastatic brain tumors, Class B -second step-determine the tissue of origin, and finally Class C -third step- focusing on characterizing of breast cancer BM (BCBM) subtypes. High prediction performance over a large cohort (n = 165) has made BrainMETH a reliable diagnostic tool. In another study, Kaur et al.65 used SVM, RF, NB, and DTs as classifiers and DNA methylation as input for the clinical staging prediction of Liver Hepatocellular Carcinoma (LIHC). Finally, Deng et al.78 use combined SVM-RFE and LASSO to predict a higher risk of progression for colorectal cancer (CRC) patients and listed 222 candidate epigenetic CpGs related to CRC progression.

It is well known that several potent tumor suppressors are silenced by DNA methylation1,2,7984, whereas enhancers activate oncogenes8589. Therefore, DNA methylation, histone modification marks, chromatin openness/closeness, and 3D chromatin interaction datasets can be highly useful in identifying oncogenes or tumor suppressor genes9092. The DORGE (Discovery of Oncogenes and tumor suppressoR genes using Genetic and Epigenetic features)67 ML model, which is an elastic net–based logistic regression classifier, was proposed with this purpose. Authors show that histone modifications effectively predict tumor suppressor gene expression while super-enhancers and methylation change effectively predict oncogene expression. Li et al. use five conventional supervised learning classifiers to find differentially expressed genes (DEGs) via CpG methylation, histone modification marks (H3K4me3, H3K27me3, and H3K36me3), and genomic nucleotide sequence datasets for lung cancer samples93. The study shows the predictive power of histone modifications and CpG methylation for DEG estimation. Dixon et al.94 use chromatin state features and the RF classifier to predict changes in interaction frequency. Based on their feature ranking, H3K4me1 has the most predictive power on long-range chromatin interactions estimation. Their integrative analysis that combines Hi-C, histone modifications, and CTCF binding data reveals dynamic chromatin structure reorganization during embryonic stem cell differentiation.

Enhancers are shown to be activated in multiple cancers. However, identification of enhancers is not a straightforward task as enhancers can be several hundred kilo- or mega-bases away from the gene body, or even be in a different chromosome95. Sethi and Gu96 developed a framework that uses a linear SVM as the classifier. The model gets self-transcribing active regulatory region sequencing (STARR-Seq), DNase-I hypersensitive sites (DHS), and H3K27ac in the first step (matched-filter scores step) and integrates matched-filter scores with H3K27ac, H3K4me1, H3K4me2, H3K4me3, and H3K9ac histone modification marks in Drosophila. Their findings on the test set confirmed the high enhancer prediction performance in this model. Authors also show their framework can predict enhancers in mammals with no re-parametrization. TargetFinder97, which ensembles several classifiers, integrates DNA methylation, ChIP-Seq, DNase-Seq, and Cap Analysis of Gene Expression (CAGE) to estimate promoter-enhancer contacts.

Supervised Learning ML methods can help to predict missing epigenomic signals via other epigenomic datasets. Fu et al.98 use logistic regression to understand the relationship between DNA methylation and histone modifications. They show the importance of five core histone marks (H3K4me1, H3K4me3, H3K27me3, H3K36me3, and H3K9me3) to explain the variance that was observed across methylomes. 3DEpiLoop99 takes H3K4me1, H3K27me3, H3K36me3, CTCF, and RAD21 as input and predicts chromatin looping interactions within topologically associating domains (TADs). The authors applied several classifiers such as AdaBoost classification trees, Neural Networks, SVM, and Stochastic Gradient Boosting and chose an RF classifier for the model. ChromImpute100 uses an ensemble of regression trees to impute a genome-wide epigenomic signal at 25-bp resolution. Authors show that motif analysis of these divergent locations agrees with cell type–specific regulatory mechanisms. PREDICTD (PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition) is a tensor factorization-based model to impute epigenomic data101. It is a computationally efficient method and is one of few methods102,103 that uses tensor decomposition in epigenomics analysis.

Unsupervised Learning

Unsupervised ML methods assign samples/observations to several clusters (groups) based on their feature set similarity104. The power of unsupervised learning is that it can discover underlying biology unbiasedly as it does not need labeled data. Unsupervised ML algorithms reveal novel biological findings such as discovering novel molecular subgroups of patients for a specific histologically defined cancer type.

We will discuss traditional unsupervised learning methods and novel clustering algorithms proposed for epigenomics datasets. Hierarchical Clustering (HC) groups samples through the similarity matrix and recursively clusters a pair of observations at a time. Most of the HC methods used in epigenomics are agglomerative HC models. In the beginning, each observation starts in its own cluster followed by merging of closer units until one robust cluster is identified. Wiwie et al.105 show hierarchical clustering is one of the best-unsupervised ML methods among 13 well-known clustering methods that analyze 24 biomedical datasets. A significant drawback of the HC is that the algorithm needs a cut-off value to define the number of final clusters106. Lin and Chen et al.107 applied HC on DNA methylation and find tumor-specific hypermethylated clusters for breast cancer cell lines. Virmani et al.108 analyze the methylation status of 89 cell lines. The dendrogram of the HC has two major groups that are consistent with small cell lung cancer (SCLC) and non-small cell lung cancer cell lines (NSCLC) cell lines. Lin and Wang109 integrated gene promoter methylation and gene expression profiles to find candidate genes in NSCLC. Hierarchical clustering of 578 candidate genes agreed with the Epithelial-to-Mesenchymal Transition (EMT) phenotype segregation of cell lines.

k-means is a simple yet one of the widely-used clustering methods. This unsupervised learning method partitions all samples into a fixed number ‘k’ of clusters. The algorithm starts with a random initialization (assigning some individuals as ‘means’), associates every observation to the closest ‘mean’, then assigns centroids of each cluster as new ‘means’, and returns to the second (associating all samples to the new ‘means’) step110. This recursive setup is repeated until getting a final stable clustering. k-means is sensitive to its random initialization. Different initializations lead to having diverse final clusters111. K-means and HC methods have been extensively used by the TCGA and other genomic studies to identify patient subgroups based on their DNA methylation profiles. Examples include classical hypermethylated CIMP phenotype in colorectal cancer112 and G-CIMP in glioblastoma113. Substantial biological insights have been since gained on these subgroups of patients and specific therapeutic strategies are being designed114,115. Zhang et al.116 use an extension of k-means algorithm117 to cluster TCGA ovarian cancer samples by DNA methylation, microRNA expression, and copy number alteration datasets. Authors uncover seven subtypes of ovarian cancer, which have significantly different survival rates. They also list ‘driver’ genes for each subtype that could be targeted for a treatment plan.

Non-negative Matrix Factorization (NMF) is a popular clustering and dimensionality reduction method in epigenomics. It was proposed to get a low-rank representation of a non-negative matrix118. The first clustering attempt via NMF in document clustering showed NMF can be used as a powerful unsupervised learning method119. After a year of the successful document clustering attempt, the NMF was applied to biological data120 and has since become very popular in genomics and epigenomics analyses. Mishra and Guda121 use NMF clustering on differentially methylated sites between healthy and Pancreatic Cancer samples. They identify three molecular subtypes in Pancreatic Cancer that have different enrichment of neoplasm histological grades and pathologic T-stages. Through the NMF on H3K27ac data, we identified four clusters (EpiCs) of colorectal cancer samples based on their underlying enhancer patterns122. Importantly, these EpiC subtypes allowed stratification of these patients into groups that were vulnerable to specific combinations of targeted therapeutics where one agent was enhancer-blocking bromodomain inhibitors.

Ernst and Kellis proposed ChromHMM123 that is based on a multivariate hidden Markov model as an unsupervised ML method to analyze epigenomic marks. ChromHMM assigns chromatin states to the corresponding genomic location. ChromHMM method permeates the epigenomic unsupervised analysis literature124128. One may think ChromHMM takes binarized epigenomic signals and predicts chromatin states as in a supervised learning scheme, yet the predictions are not ground-truth labels (cancer type/stage, age, expression of a gene, etc.) associated with the given data, but candidate-state annotations. Other similar methods are Segway129 and EpiCSeg130. Segway is a dynamic Bayesian network (DBN) that models the transformed data (inverse hyperbolic sin) with the Gaussian distribution and trains the model with the expectation-maximization (EM) algorithm. EpiCSeg130 passes read counts from several histone marks to a Hidden Markov Model (HMM) like ChromHMM.

Deep Learning

Deep Learning is a subfield of Machine Learning that uses artificial neural networks (ANN or NN) for analysis123. As a general rule, NNs are more flexible than other algorithms, can handle highly complex non-linear data, and have shown highly accurate empirical performances. However, they also require more data for training and can be more difficult to train and interpret than traditional machine learning algorithms. NNs are composed of multiple sequential layers that each consist of two components: a set of artificial neurons and a non-linear activation function. This structure allows each layer to flexibly model relatively simple non-linear data with practically no assumptions. Stacking these layers sequentially nests these simple but flexible non-linear functions to flexibly model complex non-linear data. All models make some assumptions about input data. Stricter assumptions decrease training data requirements, but possibly limit model accuracy if the data does not match those assumptions. Because of their architecture, NNs can analyze highly complex data with minimal assumptions. This makes NNs well suited to analyzing high-dimensional and highly complex epigenomic data to arbitrarily high theoretical accuracy when sufficient training data is available. Because NNs use a wide variety of techniques that are specific to the subfield of deep learning and every NN uses a slightly different combination of those techniques, in depth descriptions will focus on a small number of papers per data type, each of which introduces one or more techniques or approaches. We will then introduce a larger variety of papers in decreased depth to provide a general awareness of current research.

Neural Networks in Epigenomic Data Analysis

Prediction of gene expression from epigenomic data:

A number of epigenetic elements function together to exert precise control of the expression of target genes. The complex interaction of these elements makes gene expression difficult to predict, and the regulatory contribution of each epigenetic element even more difficult to interpret. Due to their ability to handle highly complex non-linear data, NNs are being increasingly used to predict gene expression from epigenetic data, with tools being developed to inspect NNs internal functioning and interpret the contribution of individual epigenetic elements. An outline of how an NN could analyze epigenetic data is shown in Figure 2.

Figure 2. Neural networks function through pattern recognition.

Figure 2.

A hypothetical small NN is shown which has learned to use the local ATAC-seq pileup pattern to predict expression and is predicting expression of the highlighted gene. Each circle represents a neuron, and each row of neurons represents a layer. Arrows represent connections between neurons in adjacent layers. Each neuron learns from training data to identify a specific combination of its inputs that represents a relevant biologically pattern. The images on the neurons represent possible learned patterns, and the numbers next to each neuron represent how well the input data matches the neuron’s learned pattern on a scale from 0 (no match) to 1 (perfect match). Because neurons in later layers take patterns from previous layers as input, learned patterns become more complex in later layers, allowing a “deep” NN with many layers to learn highly complex patterns. Real NNs have hundreds of neurons per layer and many tens of layers and are expected to learn much more complex patterns than those shown here.

Methylation and DNA Sequence Data:

MRCNN used convolutional NNs (CNN) to predict genome-wide methylation levels based on the nearby DNA sequences124. CNNs prioritize local connections by using sliding windows that look for patterns in small, limited regions. Where each layer in a typical NN has artificial neurons and a non-linear activation function, CNNs add a compression step, called “pooling”, that decreases the size of the data after each layer. By compressing the data after each layer and using a window that is the same size, the sliding window at each layer effectively covers larger and larger portions of the input data, but in lower resolution, learning more complex patterns that span larger portions of the genome. The final layer of a CNN connects all of the patterns found across the region of interest in previous layers to detect highly complex patterns that would require many more connections to identify in a classic NN. This means CNN125 could be useful in predicting gene expression patterns from DNA methylation and histone modification patterns as well as the underlying DNA sequence.

MRCNN uses a 400bp window around the methylation site in the first layer, using the raw DNA sequence to learn 400bp motifs associated with hyper-, or hypo-, methylation in their samples. This method predicted methylated vs unmethylated regions with an accuracy of 93.2%, outperforming previous methods such as DeepCpG (approximately 87%)126.

Methylation and Variation Autoencoders:

Autoencoders (AE) are NN-based dimensionality reduction models which consist of two NNs: an encoder, which creates the compressed representation called an encoding, and the decoder, which recreates the original data from the encoding. Forcing the decoding to match the original data incentivizes AEs to produce informative compressed encodings. Compressed representation are expected to resemble principal component analysis (PCA)128, wherein regions with high variability between samples take priority in the encoding over those that are nearly identical between all samples and multiple highly correlated regions are compressed into a single value. The primary advantage of AEs over PCA is the ability to learn non-linear correlations, allowing AEs to learn biologically relevant patterns that PCAs may miss, such as buffer effects, synergistic effects, and antagonistic effects. Variational AEs (VAE129) extend the AE model by putting additional constraints on the distribution of the lower-dimensional encoding. In the VAE, some noise is added to the encoding and it is required to match some pre-defined distribution, typically a multidimensional normal/gaussian distribution. These constraints make the encoding process probabilistic, which improves the correlation between sample similarity and encoding similarity. These probabilistic NNs are hence called “generative models”, because the enforced structure of their encodings gives them the ability to regenerate the original dataset. This allows them to be used more widely for varied forms of statistical modeling.

Wang and Wang, 2019, used a variational autoencoder (VAE) alongside t-SNE to compress 450K methylation data for logistic regression classification127. Wang and Wang used a VAE to reduce the dimensionality of their data from 300,000 probes to 100 compressed dimensions. For visualization, t-SNE was used to further compress the data to two dimensions. The data consisted of two lung cancer subtypes (LUAD-01 and LUSC-01) and two matching adjacent normal groups (LUAD-11 AND LUSC-11). The two-dimensional non-linearly compressed representation split the four groups well, implying that the features learned by the VAE and t-SNE mapping represent biologically relevant features to separate the four sample groups. Additionally, expected biology features were recapitulated in the compressed representation: non-cancer samples produced tighter groupings than cancer samples and the greatest overlap between groups occurred between cancer samples and their matched non-cancer samples. A logistic regression model used to separate the 4 groups obtained classification precisions of 0.92 (LUAD-01), 0.99 (LUSC-01), 0.75 (LUAD-11), and 1.00 (LUSC-11) with many errors mixing a minority of cancer samples with their matched noncancer samples and a few LUSC-01 samples showing LUAD-01 profiles. This paper shows the value of VAE encodings for interpreting complex, high dimensional non-linear data.

Histone modification data (ChIP-seq):

We will discuss three models that have been developed to predict transcription from raw ChIP-seq pipeline values, with progressively more features improving performance. The first, DeepChrome, used a CNN to predict transcription from local ChIP-seq pileup values130. DeepChrome out-performed baseline models in predicting transcription (Linear Regression, Support Vector Machines, Random Forest, a custom Rule-Based algorithm), implying that it learned biologically relevant regulatory patterns. However, the model did not include a procedure to view and assess these patterns, limiting biological interpretability.

The second method, AttentiveChrome, used a recurrent NN (RNN)131. Unlike typical NNs, RNNs effectively analyze sequential data because they preserve information from previous inputs. This allows them to analyze an input from a sequence in the context of previously provided inputs. RNNs are described in greater detailed in the referenced review, Lipton et al 2015132,133. AttentiveChrome used a specific form of an RNN, a Long Short-Term Memory (LSTM) model which uses multiple NNs to intelligently update only small portions of the saved state, the preserved information, each time an input is provided. This limits the loss of older data and improves its ability to learn complex long-range interactions. AttentiveChrome also included an “attention module” which explicitly models the importance of specific inputs thus increasing the “attention” paid to highly important inputs on a sample-by-sample basis. This decreases the noise influencing predictions, thereby increasing model accuracy on future samples by decreasing the contribution of inputs with likely spurious correlations. Secondly, the outputs of the attention module can be extracted and viewed, elucidating the regions of the ChIP-seq pileup that most correlate with expression on a sample-by-sample basis. This method identified regions that may play mechanistic roles in regulating expression, providing targets for further mechanistic studies. The attention module, and its contribution to biological interpretability, was the primary contribution from the AttentiveChrome module.

The third method, DeepDiff, is an LSTM NN that instead predicts differential genes between two samples based on their local ChIP-seq profiles134. This model further improves prediction accuracy through two additional techniques: Multitask learning135 and Siamese contrastive loss136138. Multitask learning improves prediction accuracy by forcing the network to make multiple predictions per sample on input data. DeepDiff learned to predict differential expression as the main task and learned two auxiliary tasks. First, it learned to classify the sample’s cell type, next it learned to minimize the “siamese contrastive loss” on its sample encoding. The siamese contrastive loss forces a single NN to learn to identify similar encodings in similarly expressed genes and different encodings in differentially expressed genes. It does this by passing two ChIP-seq profiles for the same gene in different samples through the same network to generate an. The model then measures the difference in the encoding and compares it to the difference in expression, punishing the NN if a differentially expressed gene has two similar encodings or vice versa. DeepDiff performed slightly better than AttentiveChrome in predicting gene expression, however, DeepDiff’s primary contribution was the combination of three tasks that, taken together with its attention module, improve biological interpretation of the model. Attention mechanisms showed multiple expected correlations: the enhancer mark H3K4me1 and the activating mark H3K4me3 were associated with increased expression, while the heterochromatin mark H4K9me3 and inhibitory mark H3K27me3 were associated with decreased expression. However, attention also unexpectedly showed the gene body-associated mark H3K36me3 associated with decreased expression. This could result from a failure of the model or could point to some yet unknown regulatory relationship, highlighting the potential to uncover previously unknown biological relationships using highly flexible models such as an NNs.

Higher order chromatin structure (Hi-C):

DeepExpression predicts gene expression from promoter DNA sequence and Hi-C promoter-enhancer interactions139. DeepExpression is composed of three NNs, a “densely connected CNN”140 that analyzes promoter DNA sequence, a fully connected NN that analyzes enhancer-promoter (EP) interactions, and a fully connected NN that integrates output from the previous two NN modules. DeepExpression outperformed random forest, linear regression, and lasso-regularized linear regression in predicting transcription. To interpret model results, researchers implemented “model ablation analysis”, in which the model is run multiple times with different portions of the input hidden from the model, to assess the contribution of different portions of the input to prediction accuracy. They showed that the majority of the effect of the promoter sequence could be predicted using just the sequence within 1kb of the promoter whereas the contribution of enhancer-promoter contacts (predicted by Hi-C data) was less than that of the DNA sequence. Lastly, researchers extracted learned sequences from the promoter and EP interaction modules and used these sequences for motif analysis, identifying transcription factors associated with promoter and enhancer regions. DeepExpressions shows the power of NNs to learn complex patterns in input data and extract actionable information to direct downstream experiments.

Neural networks for patient stratification based on epigenomic data

ATAC-seq Sample Clustering: Wasserstein Autoencoder, Generative Adversarial Network

Using an Encoder-GAN based on the “Wassertein autoencoder” (WAE)141, a recent study developed a model, ClusterATAC, to accurately cluster 401 TCGA tumor samples based on the chromatin accessibility profiles mapped by ATAC-seq142. Wasserstein AEs, similarly to VAEs, extend the AE model by putting additional constraints on the distribution of the lower-dimensional encoding. The difference between Encoder-GAN and VAEs is complex and beyond the scope of this review, but both are probabilistic generative models. In ClusterATAC, the NN was used to compress the data down to an encoding of 200 values between approximately −10 and 10. Sample encodings were then clustered using a gaussian mixture model, producing 22 cancer subgroups from 401 TCGA samples. These clusters had high correlations with Kaplan Meier survival curves and cancer type, implying that learned sample clusters effectively represented real underlying biological differences between TCGA samples. This appears to be an effective technique for biological interpretation of sparse, noisy, and high-dimensional epigenomic data and will be seen again as a popular strategy for analyzing multiOmics data.

MultiOmics analyses using NNs

Integrative analyses of mRNA, miRNA, and Methylation using AutoEncoders

AEs are highly useful in analyses of multiOmics data. Whereas linear techniques would suffice for analysis if combined regulatory effects were purely additive, the non-linear nature of AEs effectively models expected synergistic and antagonistic effects when combining multiple mechanisms of expression regulation. As an example, Chaudhary et al. 2018 used an AE as a pre-processing step for k-means clustering to predict survival in liver cancer143. In this study, the partially pre-processed matrices from the three data types (mRNA, miRNA, methylation) for each sample were stacked and input into the network. The multidimensional compressed representations of each sample were then passed through a univariate Cox proportional hazards (Cox-PH) model to select compressed features for which a significant Cox-PH model was obtained. These compressed features function similarly to components in PCA, however, they represent the presence or absence of some non-linear pattern. These significantly survival-associated, compressed features were then used to cluster samples using the K-means clustering algorithm. Lastly, to classify samples to each cluster for downstream analysis, mRNA, miRNA, and CpG sites most associated with each cluster were selected by ANOVA to train classifiers for each cluster. These survival-associated profiles were also shown to correlated with risk factors known to be associated with HCC survival, which implies an association with meaningful underlying biological processes.

Integrative analyses of miRNA, Methylation, and CNV using AutoEncoders, SHAP Values

PathME, 2020, used multi-modal sparse denoising AEs in concert with NCI Pathway Ineraction Database pathways to encode a pathway score for each pathway for each sample144. The matrix of scores for every pathway-sample combination was then clustered using NMF clustering to generate sample clusters and pathway clusters simultaneously. In addition, researchers used Shapley Additive exPlanations (SHAP) values to interpret the contribution of specific epigenomic elements to pathway scores, providing insight into the regulatory processes underlying known biological processes. As previously described, AEs are non-linear dimensionality reduction NN models that use an encoder to compress data and a decoder to reconstruct the initial input. PathME uses a two-step encoder to produce pathway-specific encodings. The first step uses datatype-specific encoders to encode epigenetic features for all genes in a specific pathway. The second step concatenates these encodings and further compresses to a single pathway score for the sample.

Researchers tested the model on four multiOmic TCGA datasets: Colorectal cancer (CRC), lung squamous cell carcinoma (LSCC), glioblastoma multiforme (GBM), and breast cancer (BRCA). PathME appears to cluster data well based on multiple metrics that can be viewed in the original paper. Interesting, clustering of GBM samples showed three clusters with similar survival durations and a single cluster with significantly longer survival duration, implying that these clusters represent biologically meaningful biology. Lastly, researchers used SHAP values to interpret predictions, obtaining greater insight into model predictions than other papers represented here. SHAP values are a metric based on game theory that attempt to quantify the importance of specific inputs to the final prediction. Researchers used SHAP value absolute values, meaning that larger SHAP values correspond with a larger effect on the prediction. The calculation of SHAP values is highly complex and described in Lundberg et al, 2017145. SHAP values were used to identify specific epigenetic features in specific genes that majorly contributed to scores for biologically relevant pathways. In CRC, the fibroblast growth factor signaling pathway played a major role, which was expected biologically. SHAP values showed that known prognostic markers of FGF19, FGFR2 and miR-31–3p/5p expression majorly contributed to this pathway’s score.

PathME incorporated biological knowledge through pathway-specific encodings and improved model interpretability by using SHAP values to interrogate the contribution of individual epigenetic features to pathway scores. These methods circumvent some of the major limitations of NN models and show new ways that NNs can be incorporated into biological analysis pipelines.

Integrative analyses of multiOmic single cell datasets:

scMVAE146 incorporated the expected structure of single cell RNA-seq and ATAC-seq data into its model using generative technique inspired by scVI147, SCALE148, and MVAE149. The contributions of these models will be discussed when the techniques they inspired are introduced. There are three versions of the model, each of which uses a different encoder model. Inspired by SCALE, every version of the model encodes to a gaussian mixture rather than a single Gaussian. The Gaussian mixture encoding distribution has the advantage of allowing multiple modes of cell state in the data. In complex samples, these modes tend to represent different clusters of cells, such as different cell types, or cancer vs noncancer cells. This tends to improve model accuracy when multiple clusters of cells are expected because the model is not required to squish the data into the single-mode assumed by classical generative NNs such as VAEs and WAEs. Based on the scVI model, every version of the model decodes the encoding to predict a distribution of mRNA expression levels and ATAC-seq pileup levels rather than a specific value. The distribution used, the zero-inflated negative binomial distribution (ZINB), builds assumptions about the distribution of scRNA-seq and scATAC-seq into the model. Allowing the model to explicitly predict a ZINB distribution when decoding rather than a specific value is expected to improve model robustness and generalizability by allowing the model to seamlessly incorporate noisy values without being thrown off by extreme values. For reasons beyond the scope of this article, typical NN predictions using least squares error are vulnerable to extreme values.

The three encoding models used in scMVAE will be discussed in order of increasing complexity, with the simplest encoding making minimal assumptions about the data and the most complex model making the greatest assumptions. As discussed in the introduction, models that make stringent assumptions about input data often require less training data to make accurate predictions, but may be inaccurate in spite of infinite training data if model assumptions are incorrect. This means that the most complex version of the scMVAE model may be expected to make the most accurate predictions in the presence of limited data, but only if model assumptions are empirically shown to fit the data. The first version of the encoder stacks all input data formats to put directly into the encoder. Given the limited amount of available data, this version, called scMVAE-direct, had the worst results among the three versions. The second version, called scMVAE-NN, used a single encoder for each data type to encode data type-specific patterns, followed by an encoder that combined the two single data type encodings. By allowing the final encoder to combine already encoded representations of each data type, this version likely improved the final encoder’s ability to find complex multiOmic patterns on a limited dataset. The last encoder, called scMVAE-POE, is based on the MVAE model. This model encodes each data type independently to a single multivariate Gaussian distribution, similar to the VAE. To combine the encodings from each data type, all data type-specific encodings are input into a Product of Experts (PoE) model, a special kind of inference network that limits the amount of data required for efficient training150. scMVAE-POE is more complex and makes greater assumptions than previous NN models. However, using generative assumptions inspired by bayesian generative modeling in tandem with the flexibility of NNs, this model minimizes the data required for accurate modeling.

The models were tested on simulated data generated by SPLATTER151 and on cancer samples. The models were compared against clustering using the non-linear singleOmics methods scVI and Seurat (on scRNA-seq data and scATAC-seq data separately), and the linear multiOmic techniques intNMF and MOFA. Most models performed well clustering simple simulated data, however scMVAE-NN and scMVAE-POE performed better on complex simulated data. On real data most models performed well, however as dropout increased, the performance of other models degraded more rapidly than the performance of scMVAE. These results should be confirmed on additional datasets, however they are expected due to the structure of scMVAE-NN and scMVAE-POE, which have the flexibility of NN models while building in structure to reduce the required training data. Lastly, researchers showed that scMVAE also recovered correlations in expression between transcription factors (TFs) and their target genes (TGs). This implies that scMVAE may be useful for imputing expression values for biological interpretation of single cell multiOmics experiments.

A word on Performance Evaluation

Evaluation metrics are crucial to assess the performance of ML models. Accuracy is the ratio of the number of correct predictions to the total number of samples for classification tasks. Sensitivity is the proportion of the number of indices labeled as positive by the model to all positive cases. Specificity measures the ability to classify negative results. The Receiver Operating Characteristic (ROC) curve plots the performance of a classification model at different thresholds using False Positive Rate (1-specificity) and TPR (sensitivity). The area under the ROC curve (AUC) is widely used as a metric to measure performance. Maximizing generalizability of the ML model is the primary goal in an ML model development process. To test accuracy, one ideally needs enough observations to split the data into training and testing sets. Most of the time, there are not enough samples in epigenomics analysis and therefore k-fold cross validation152,153 is used. Selecting the ‘k’ of k-fold cross-validation associates a well-known ML phenomenon which is the bias-variance trade-off. Although there is no formal rule to choose the ‘k’ and the number of samples affects the choice, choosing 5-fold and 10-fold cross-validation is common practice to balance the bias and the variance152,153. Unfortunately, there is no consensus in selecting the ‘k’ of k-fold cv in epigenomics analysis. There are papers that use 2-fold66,154, 3-fold62,155, 4-fold154,156,157, 5-fold91,154, 8-fold154,158, and 10-fold159161 cross-validation to investigate the model’s predictive validity.

Challenges and Future Directions

Despite numerous opportunities with ML frameworks, epigenomics data analysis and machine learning applications in epigenomics face several significant challenges. We will first discuss the most critical challenges and highlight some future research directions. The unfavorable ratio of epigenomic features to sample size makes extracting reliable knowledge challenging. This statistical barrier occurs due to the high-dimensional nature of epigenomic datasets and it relates to the curse of dimensionality. Typically, there are hundreds of thousands of features compared with 10–100 samples in most epigenomics experimental designs. Feature extraction and feature selection methods, discussed under the dimensionality reduction section, help to remove irrelevant and redundant features’ effects before downstream analysis.

Imbalanced data is a common problem, wherein the number of target labels is not the same or even close in many epigenomics studies189. While the number of methods for epigenomic profiling has exploded in the past few years, they are limited by the availability of the patient material for profiling. This imbalanced class problem further worsens for rare diseases or those where biopsies are difficult to perform (such as GBM, prostate cancers). Another contribution to imbalance of the data is the number of genomic regions with specific epigenetic elements (e.g. enhancers, polycomb elements or TF binding regions) that associate with clinical or molecular phenotype of interest190,191. An ML model that was trained on an imbalanced dataset may have a bias towards the majority class labels, whereas the actual target is the rare disease or cancer type or condition. Minority over-sampling192 and boosting193 are major solutions to overcome this problem.

Bayesian analysis provides appropriate mathematical modeling to overcome the sample size problem which is common for many epigenomics datasets. Bayesian approaches combine the prior probability (expert’s knowledge on the domain) and the likelihood (observed data) to get the posterior distribution and interpret the results194. Several genomics analyses showed the Bayesian setting’s superiority over frequentist methods195199. Bayesian frameworks are getting more popular in epigenomics analysis200,201 but there are still opportunities for further enhancements. In addition, generative NN models inspired by Bayesian approaches may provide an opportunity for a synthesis of these techniques.

Maximizing generalizability of the ML model is the primary goal in an ML model development process. To test accuracy, one ideally needs enough observations to split the data into training and testing sets. Most of the time, there are not enough samples in epigenomics analysis and therefore k-fold cross validation202,203 is used. Selecting the ‘k’ of k-fold cross-validation associates a well-known ML phenomenon which is the bias-variance trade-off. Although there is no formal rule to choose the ‘k’ and the number of samples affects the choice, choosing 5-fold and 10-fold cross-validation is common practice to balance the bias and the variance202,203. Unfortunately, there is no consensus in selecting the ‘k’ of k-fold cv in epigenomics analysis. Various studies have used 2-fold78,204, 3-fold64,205, 4-fold66,204,206, 5-fold 99,204, 8-fold204,207, and 10-fold208210 cross-validation to investigate the model’s predictive validity. We also want to point out nested cross-validation211, which is a recent modification of the cv, to machine learning community in epigenomics.

ML methods in epigenomics should be designed to help explain the mechanisms and design new clinical protocols; however, the prevalent practice is to propose accuracy-oriented or efficiency-oriented ML methods. The interpretability and explainability of ML models suffer from the black-box nature of the models212. In addition, nonlinear classification boundaries in many complex supervised learning methods hinder the explanation of how epigenomic markers affect diseases, limiting planning for further clinical purposes. Clear reasoning must be established for any ML model before it can be trusted in high stakes clinical settings.

Ensuring reproducibility of any computational analyses is absolutely essential, specially in the genomics and epigenomics field213. The limited ability to replicate the results due to some version changes of the utilized packages may cause consistency and reliability issues for biological findings. Many supervised and unsupervised algorithms depend on several packages, libraries, and modules in epigenomics. However, sometimes reproducibility is not achieved even by using the same data and the same pipeline214. Future research must address this critical need with novel ML/DL algorithms. In addition, many genomics and epigenomics data analyses suffer from the exclusion of sample/feature outliers. Outliers decrease the ability to extract real biological signals and mislead many downstream analyses. Mallik and Zhao204 use Density-based Clustering of applications with reducing Noise (DBSCAN) as the outlier detector followed by hierarchical clustering.

Many research studies use Random Forest among many off-the-shelf classifiers for their prediction analysis. There are many ML methods alternative to RF that we may want to use for further epigenomics analysis. For instance, the gradient boosting machine (GBM), a part of the Netflix Prize winning solution215, XGBoost216 has the highest prediction performance for many Kaggle competitions, and SPORF (Sparse Projection Oblique Randomer Forests) addresses many issues related to the RF217.

To gain a comprehensive understanding of the biological roles of epigenetic processes, integration of epigenomic data with genomics, proteomics, metabolomics, pharmacogenomics, and clinical responses is warranted218. It is a non-trivial task to harmonize even the same data type generated by different labs or various scientists since the distribution of epigenomic signals varies across labs and batches. The more difficult problem will be integrating several heterogeneous data types219. There are some current multi-modal integration approaches with a range of success rates however, they suffer various limitations220. Another hurdle is data privacy. Ching et al.221 discussed the shortcomings of electronic health records (EHRs) and data privacy issues thoroughly. Patient privacy restrictions may critically affect the power of research studies by limiting the sample sizes222.

Transfer learning (TL) is an ML technique where a model trained on one task is re-purposed on a second related task. TL is a popular approach in limited data resource computer vision223225 and natural language processing tasks226228 where the pre-trained model is used as the starting point to improve analysis performance. The TL models have become more popular in genomics229,230 and epigenomics231,232 data analysis recently. This area of ML enables researchers to ask new biological questions linking several phenotypes, diseases, or cancer types. TL will need more attention with the integration of several data types over the coming years in epigenomic analysis.

NNs are highly flexible, non-linear “function approximation” algorithms, meaning that they find patterns in high dimensional, highly complex data that allow them to model said data more accurately than the vast majority of competing algorithms. There are NN models for dimensionality reduction, clustering data, learning regulatory correlations between data types, and many other tasks. Due to the flexibility and empirically shown accuracy of NNs, these models have gained widespread use across these tasks. This is especially true in multiOmics analysis, where complex, non-linear models may be necessary to uncover the complex relationships between multiple epigenomic modalities. However, NNs are not without limitations. Compared to simpler models such as PCA, SVM, or UMAP, they can be slower, require more expertise, and be difficult to interpret when digging into both expected and unexpected results. Over time, we expect usage of NNs to increase as improved tools make development and model interpretation easier. This will grant greater access to NN’s results and insights, allowing them to provide biological insight that is not yet accessible to specialized and generalized labs alike.

Overall, machine learning is being increasingly utilized in our understanding of cancer epigenomic datasets to gain novel biological insights. We expect cutting-edge deep learning tools will find various uses in the future as number of epigenomic datasets in cancer systems become available to the community which will enhance our understanding of epigenome contribution to cancer progression.

Table 1.

Overview of the Literature on Machine Learning in Epigenomics Analysis

Cancer Type Machine Learning Method ML Task Epigenomics Data Type Ref
Liver Hepatocellular Carcinoma, Lung Support Vector Machine Supervised Learning 450K Methylation array, ChIP-Seq from ENCODE, Illumina Human Methylation27 65,93,131
Pan-Cancer, Liver Hepatocellular Carcinoma, Lung, Colon, Breast, Prostate, Brain Metastasis Random Forest Supervised Learning ChIP-Seq, 450K Methylation array, Illumina Human Methylation27 64,65,67,77,93,131135
Pan-Cancer Logistic regression Supervised Learning 450K Methylation array, ChIP-Seq from ENCODE, ATAC-Seq 67,136,137
Prostate Cancer Linear Regression Supervised Learning 450K Methylation array 63
Pan-Cancer Decision Tree Supervised Learning ATAC-Seq 136
Breast, Lung, Colorectal, Glioblastoma Hierarchical Clustering Unsupervised Learning 450K Methylation array, ChIP-Seq, Illumina GoldenGate 107109,112,113
Glioblastoma, Pan-Cancer, Ovarian k-means Clustering Unsupervised Learning 450K Methylation array, Illumina Human Methylation27, Illumina GoldenGate 113,116,138
Pancreatic, Colorectal NMF Unsupervised Learning 450K Methylation array, ChIP-Seq 121,122
Breast, Prostate ChromHMM Unsupervised Learning ChIP-Seq, Illumina HiSeq 2500 139141
Glioma, Pan-Cancer, HeLa, Breast, GM12878, Gastric Fully Connected Deep Networks Supervised Learning 450K Methylation Array, ChIP-Seq, DNA sequence, Affymetrix SNP 6.0, CNV, WGBS 142149
Hepatocellular carcinoma, Breast, K562, Ovarian, Lung, Colon, CRC, GBM AE Unsupervised Learning 450K Methylation Array, Affymetrix SNP 6.0 CNV, SNARE-seq, scCAT-seq 150159
Lung, Esophageal VAE Unsupervised Learning 450K Methylation array, ChIP-Seq 160,161
Acute myeloid leukemia, K562, HCT116 CNN Supervised Learning WGBS, ChIP-Seq, H3K27ac HiChIP, Dnase 162166
Breast LSTM Supervised Learning 450K Methylation array, ChIP-Seq 167169
Hepatoblastoma GRU Unsupervised Learning Single cell WGBS, Single cell RRBS 170
Pan-Cancer GAN Unsupervised Learning ATAC-Seq 171

Abbreviations: NMF = Non-negative matrix factorization, NN = Neural Network, DNN = Deep Neural Network, here referring to classic NNs, CNN = Convolutional Neural Network, DBN = Deep Belief Network, GAN = Generative Adversarial Network, AE = Autoencoder, VAE = Variation Autoencoder, LSTM = Long Short Term Memory, GRU = Gate Recurrent Unit, WGBS = Whole Genome Bisulfite Sequencing, RRBS = Reduced Representation Bisulfite Sequencing, GBM = Glioblastoma Multiforme, 450K Methylation array = Illumina Infinium Human Methylation 450k

ACKNOWLEDGEMENTS

Authors are supported from grants from National Institutes of Health (NIH R21CA231654; R01CA222214; R01DE028061; R01CA226269; R01CA245395), American Cancer Society (ACS 133407-RSG-19-187-01-DMC), Department of Defense (DoD W81XWH1710269, W81XWH2010098 and W81XWH2010646), Cancer Prevention and Research Institute of Texas (CPRIT RP200390 and RP170407) and Melanoma Research Alliance (MRA 508397).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declarations of interest: none

REFERENCES

  • 1.Dawson MA The cancer epigenome: Concepts, challenges, and therapeutic opportunities. Science 355, 1147–1152, doi: 10.1126/science.aam7304 (2017). [DOI] [PubMed] [Google Scholar]
  • 2.Dawson MA & Kouzarides T Cancer epigenetics: from mechanism to therapy. Cell 150, 12–27, doi: 10.1016/j.cell.2012.06.013 (2012). [DOI] [PubMed] [Google Scholar]
  • 3.Henning AN, Roychoudhuri R & Restifo NP Epigenetic control of CD8(+) T cell differentiation. Nat Rev Immunol 18, 340–356, doi: 10.1038/nri.2017.146 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tough DF, Rioja I, Modis LK & Prinjha RK Epigenetic Regulation of T Cell Memory: Recalling Therapeutic Implications. Trends Immunol 41, 29–45, doi: 10.1016/j.it.2019.11.008 (2020). [DOI] [PubMed] [Google Scholar]
  • 5.Kouzarides T Chromatin modifications and their function. Cell 128, 693–705, doi: 10.1016/j.cell.2007.02.005 (2007). [DOI] [PubMed] [Google Scholar]
  • 6.Badeaux AI & Shi Y Emerging roles for chromatin as a signal integration and storage platform. Nature reviews. Molecular cell biology 14, 211–224, doi: 10.1038/nrm3545 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Maunakea AK, Chepelev I & Zhao K Epigenome mapping in normal and disease States. Circ Res 107, 327–339, doi: 10.1161/CIRCRESAHA.110.222463 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Margueron R & Reinberg D The Polycomb complex PRC2 and its mark in life. Nature 469, 343–349, doi: 10.1038/nature09784 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Landt SG et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome research 22, 1813–1831, doi: 10.1101/gr.136184.111 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Terranova C et al. An Integrated Platform for Genome-wide Mapping of Chromatin States Using High-throughput ChIP-sequencing in Tumor Tissues. Journal of visualized experiments : JoVE, doi: 10.3791/56972 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rotem A et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nature biotechnology 33, 1165–1172, doi: 10.1038/nbt.3383 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kaya-Okur HS et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nature communications 10, 1930, doi: 10.1038/s41467-019-09982-5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Buenrostro JD, Wu B, Chang HY & Greenleaf WJ ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current protocols in molecular biology / edited by Ausubel Frederick M. … [et al. ] 109, 21 29 21–29, doi: 10.1002/0471142727.mb2129s109 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.John S et al. Genome-scale mapping of DNase I hypersensitivity. Current protocols in molecular biology / edited by Ausubel Frederick M. … [et al. ] Chapter 27, Unit 21 27, doi: 10.1002/0471142727.mb2127s103 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.van Berkum NL et al. Hi-C: a method to study the three-dimensional architecture of genomes. Journal of visualized experiments : JoVE, doi: 10.3791/1869 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Meissner A et al. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic acids research 33, 5868–5877, doi: 10.1093/nar/gki901 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Cokus SJ et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452, 215–219, doi: 10.1038/nature06745 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lister R et al. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133, 523–536, doi: 10.1016/j.cell.2008.03.029 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Friedman J, Hastie T & Tibshirani R The elements of statistical learning. Springer series in statistics New York; (Vol. 1) (2001). [Google Scholar]
  • 20.Bellman R Dynamic programming. Princeton univ. Press Princeton; (1957). [Google Scholar]
  • 21.Altman N & Krzywinski M The curse (s) of dimensionality. Nat Methods 15, 399–400 (2018). [DOI] [PubMed] [Google Scholar]
  • 22.Hughes G On the mean accuracy of statistical pattern recognizers. IEEE transactions on information theory 14, 55–63 (1968). [Google Scholar]
  • 23.Hotelling H Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24, 417 (1933). [Google Scholar]
  • 24.Pearson K LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572 (1901). [Google Scholar]
  • 25.Van Der Maaten L, Postma E & Van den Herik J Dimensionality reduction: a comparative. J Mach Learn Res 10, 13 (2009). [Google Scholar]
  • 26.Kambhatla N & Leen TK Dimension reduction by local principal component analysis. Neural computation 9, 1493–1516 (1997). [Google Scholar]
  • 27.Locantore N et al. Robust principal component analysis for functional data. Test 8, 1–73 (1999). [Google Scholar]
  • 28.Hubert M, Rousseeuw PJ & Vanden Branden K ROBPCA: a new approach to robust principal component analysis. Technometrics 47, 64–79 (2005). [Google Scholar]
  • 29.Serneels S & Verdonck T Principal component analysis for data containing outliers and missing elements. Computational Statistics & Data Analysis 52, 1712–1727 (2008). [Google Scholar]
  • 30.Vidal R, Ma Y & Sastry S Generalized principal component analysis (GPCA). IEEE transactions on pattern analysis and machine intelligence 27, 1945–1959 (2005). [DOI] [PubMed] [Google Scholar]
  • 31.Wang T, Gu IY & Shi P Object tracking using incremental 2D-PCA learning and ML estimation. IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. I-933-I-936 (2007). [Google Scholar]
  • 32.Su Y, Huang Y & Kuo C-CJ Efficient text classification using tree-structured multi-linear principal component analysis. 24th International Conference on Pattern Recognition (ICPR) (2018). [Google Scholar]
  • 33.Zou H, Hastie T & Tibshirani R Sparse principal component analysis. Journal of computational and graphical statistics 15, 265–286 (2006). [Google Scholar]
  • 34.Journée M, Nesterov Y, Richtárik P & Sepulchre R Generalized power method for sparse principal component analysis. Journal of Machine Learning Research 11 (2010). [Google Scholar]
  • 35.Yi S, Lai Z, He Z, Cheung Y. m. & Liu Y Joint sparse principal component analysis. Pattern Recognition 61, 524–536 (2017). [Google Scholar]
  • 36.Schölkopf B, Smola A & Müller K-R Lecture Notes in Computer Science. International conference on artificial neural networks. 583–588 (1997). [Google Scholar]
  • 37.Rahmani E et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature methods 13, 443 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhang Q et al. A comparative study of five association tests based on CpG set for epigenome-wide association studies. PloS one 11, e0156895 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hinton G & Roweis ST Stochastic neighbor embedding. NIPS. Vol 15 (2002). [Google Scholar]
  • 40.Van der Maaten L & Hinton G Visualizing data using t-SNE. Journal of machine learning research 9 (2008). [Google Scholar]
  • 41.McInnes L, Healy J & Melville J Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018). [Google Scholar]
  • 42.Becht E et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology 37, 38–44 (2019). [DOI] [PubMed] [Google Scholar]
  • 43.Pedregosa F et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12, 2825–2830 (2011). [Google Scholar]
  • 44.Krijthe J, van der Maaten L & Krijthe MJ Package ‘Rtsne’. GitHub (2018). [Google Scholar]
  • 45.Donaldson J & Donaldson MJ Package ‘tsne’. CRAN Repository (2010). [Google Scholar]
  • 46.Wolf FA, Angerer P & Theis FJ SCANPY: large-scale single-cell gene expression data analysis. Genome biology 19, 1–5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kobak D & Berens P The art of using t-SNE for single-cell transcriptomics. Nature communications 10, 1–14 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Hall MA Correlation-based feature selection for machine learning. (1999).
  • 49.Hall MA & Smith LA Practical feature subset selection for machine learning. (1998).
  • 50.Han J, Kamber M & Pei J Data mining concepts and techniques third edition. The Morgan Kaufmann Series in Data Management Systems 5, 83–124 (2011). [Google Scholar]
  • 51.Kononenko I et al. I. Overcoming the myopia of inductive learning algorithms with RELIEFF. European conference on machine learning. 171–182 (1997). [Google Scholar]
  • 52.Kira K & Rendell LA A practical approach to feature selection. Machine learning proceedings (1992). [Google Scholar]
  • 53.Urbanowicz RJ, Meeker M, La Cava W, Olson RS & Moore JH Relief-based feature selection: Introduction and review. Journal of biomedical informatics 85, 189–203 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Alkuhlani A, Nassef M & Farag I Multistage feature selection approach for high-dimensional cancer data. Soft Computing 21, 6895–6906 (2017). [Google Scholar]
  • 55.Han Y, Huang L & Zhou F A dynamic recursive feature elimination framework (dRFE) to further refine a set of OMIC biomarkers. Bioinformatics (2021). [DOI] [PubMed] [Google Scholar]
  • 56.Chen Z et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 1542–1552 (2020). [DOI] [PubMed] [Google Scholar]
  • 57.Tang J, Alelyani S & Liu H Feature selection for classification: A review. Data classification: Algorithms and applications, 37 (2014). [Google Scholar]
  • 58.Xu X, Liang T, Zhu J, Zheng D & Sun T Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing 328, 5–15 (2019). [Google Scholar]
  • 59.Bolón-Canedo V, Sánchez-Maroño N & Alonso-Betanzos A A review of feature selection methods on synthetic data. Knowledge and information systems 34, 483–519 (2013). [Google Scholar]
  • 60.Levatid J, Ceci M, Kocev D & Džeroski S Semi-supervised learning for multi-target regression. International workshop on new frontiers in mining complex patterns (2014). [Google Scholar]
  • 61.Chappell D Introducing azure machine learning. A guide for technical professionals, sponsored by microsoft corporation (2015). [Google Scholar]
  • 62.LeCun Y, Bengio Y & Hinton G Deep learning. nature 521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
  • 63.Aref-Eshghi E et al. Genomic DNA methylation-derived algorithm enables accurate detection of malignant prostate tissues. Frontiers in oncology 8, 100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Capper D et al. DNA methylation-based classification of central nervous system tumours. Nature 555, 469–474 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kaur H, Bhalla S & Raghava GP Classification of early and late stage liver hepatocellular carcinoma patients from their genomics and epigenomics profiles. PloS one 14, e0221476 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Wang T et al. Epigenetic aging signatures in mice livers are slowed by dwarfism, calorie restriction and rapamycin treatment. Genome biology 18, 1–11 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Lyu J et al. DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features. Science advances 6, eaba6784 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Steinwart I & Christmann A Support vector machines. Springer Science & Business Media (2008). [Google Scholar]
  • 69.Wang L Support vector machines: theory and applications. Springer Science & Business Media (Vol. 177) (2005). [Google Scholar]
  • 70.Rokach L & Maimon O Data mining and knowledge discovery handbook. Springer Science+ Business Media (2005). [Google Scholar]
  • 71.Qi Y Random forest for bioinformatics. Ensemble machine learning Springer (2012). [Google Scholar]
  • 72.Murphy KP Naive bayes classifiers. University of British Columbia 18 (2006). [Google Scholar]
  • 73.Montgomery DC, Peck EA & Vining GG Introduction to linear regression analysis. John Wiley & Sons; (2021). [Google Scholar]
  • 74.Tibshirani R Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288 (1996). [Google Scholar]
  • 75.Hoerl AE & Kennard RW Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970). [Google Scholar]
  • 76.Gavrilovic IT & Posner JB Brain metastases: epidemiology and pathophysiology. Journal of neuro-oncology 75, 5–14 (2005). [DOI] [PubMed] [Google Scholar]
  • 77.Orozco JI et al. Epigenetic profiling for the molecular classification of metastatic brain tumors. Nature communications 9, 1–14 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Deng Y et al. CpG-methylation-based risk score predicts progression in colorectal cancer. Epigenomics 12, 605–615 (2020). [DOI] [PubMed] [Google Scholar]
  • 79.Micevic G, Theodosakis N & Bosenberg M Aberrant DNA methylation in melanoma: biomarker and therapeutic opportunities. Clin Epigenetics 9, 34, doi: 10.1186/s13148-017-0332-8 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Vogelstein B et al. Cancer genome landscapes. Science 339, 1546–1558, doi: 10.1126/science.1235122 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Wouters J et al. Comprehensive DNA methylation study identifies novel progression-related and prognostic markers for cutaneous melanoma. BMC Med 15, 101, doi: 10.1186/s12916-017-0851-3 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Weber M et al. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nature genetics 39, 457–466, doi: 10.1038/ng1990 (2007). [DOI] [PubMed] [Google Scholar]
  • 83.Jin SG, Xiong W, Wu X, Yang L & Pfeifer GP The DNA methylation landscape of human melanoma. Genomics 106, 322–330, doi: 10.1016/j.ygeno.2015.09.004 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Chiappinelli KB et al. Inhibiting DNA Methylation Causes an Interferon Response in Cancer via dsRNA Including Endogenous Retroviruses. Cell 162, 974–986, doi: 10.1016/j.cell.2015.07.011 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Herz HM, Hu D & Shilatifard A Enhancer malfunction in cancer. Molecular cell 53, 859–866, doi: 10.1016/j.molcel.2014.02.033 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Calo E & Wysocka J Modification of enhancer chromatin: what, how, and why? Molecular cell 49, 825–837, doi: 10.1016/j.molcel.2013.01.038 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Sur I & Taipale J The role of enhancers in cancer. Nature reviews. Cancer 16, 483–493, doi: 10.1038/nrc.2016.62 (2016). [DOI] [PubMed] [Google Scholar]
  • 88.Hnisz D et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947, doi: 10.1016/j.cell.2013.09.053 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Bradner JE, Hnisz D & Young RA Transcriptional Addiction in Cancer. Cell 168, 629–643, doi: 10.1016/j.cell.2016.12.013 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Hnisz D et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Schmitt AD et al. A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell reports 17, 2042–2059 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Akdemir KC et al. Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nature Genetics 52, 1178–1188 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Li J, Ching T, Huang S & Garmire LX in BMC bioinformatics. 1–12 (BioMed Central). [Google Scholar]
  • 94.Dixon JR et al. Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331–336 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Pennacchio LA, Bickmore W, Dean A, Nobrega MA & Bejerano G Enhancers: five essential questions. Nature Reviews Genetics 14, 288–295 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Sethi A et al. Supervised enhancer prediction with epigenetic pattern recognition and targeted validation. Nature methods 17, 807–814 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Whalen S, Truty RM & Pollard KS Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature genetics 48, 488–496 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Fu K, Bonora G & Pellegrini M Interactions between core histone marks and DNA methyltransferases predict DNA methylation patterns observed in human cells and tissues. Epigenetics 15, 272–282 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Al Bkhetan Z & Plewczynski D Three-dimensional epigenome statistical model: genome-wide chromatin looping prediction. Scientific reports 8, 1–11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Ernst J & Kellis M Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nature biotechnology 33, 364–376 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Durham TJ, Libbrecht MW, Howbert JJ, Bilmes J & Noble WS PREDICTD parallel epigenomics data imputation with cloud-based tensor decomposition. Nature communications 9, 1–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Zhu Y et al. Constructing 3D interaction maps from 1D epigenomes. Nature communications 7, 1–11 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Hore V et al. Tensor decomposition for multiple-tissue gene expression experiments. Nature genetics 48, 1094–1100 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Ghahramani Z Unsupervised learning. Summer School on Machine Learning. 72–112 Springer; (2003). [Google Scholar]
  • 105.Wiwie C, Baumbach J & Röttger R Comparing the performance of biomedical clustering methods. Nature methods 12, 1033 (2015). [DOI] [PubMed] [Google Scholar]
  • 106.Murtagh F & Contreras P Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 86–97 (2012). [Google Scholar]
  • 107.Lin I-H et al. Hierarchical clustering of breast cancer methylomes revealed differentially methylated and expressed breast cancer genes. PloS one 10, e0118453 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Virmani AK et al. Hierarchical clustering of lung cancer cell lines using DNA methylation markers. Cancer Epidemiology and Prevention Biomarkers 11, 291–297 (2002). [PubMed] [Google Scholar]
  • 109.Lin SH et al. Genes suppressed by DNA methylation in non-small cell lung cancer reveal the epigenetics of epithelial–mesenchymal transition. BMC genomics 15, 1–15 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Likas A, Vlassis N & Verbeek JJ The global k-means clustering algorithm. Pattern recognition 36, 451–461 (2003). [Google Scholar]
  • 111.Celebi ME, Kingravi HA & Vela PA A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert systems with applications 40, 200–210 (2013). [Google Scholar]
  • 112.Hinoue T et al. Genome-scale analysis of aberrant DNA methylation in colorectal cancer. Genome research 22, 271–282, doi: 10.1101/gr.117523.110 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Noushmehr H et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer cell 17, 510–522, doi: 10.1016/j.ccr.2010.03.017 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Malta TM et al. Glioma CpG island methylator phenotype (G-CIMP): biological and clinical implications. Neuro Oncol 20, 608–620, doi: 10.1093/neuonc/nox183 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Weisenberger DJ, Liang G & Lenz HJ DNA methylation aberrancies delineate clinically distinct subsets of colorectal cancer and provide novel targets for epigenetic therapies. Oncogene 37, 566–577, doi: 10.1038/onc.2017.374 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Zhang W et al. Integrating genomic, epigenomic, and transcriptomic features reveals modular signatures underlying poor prognosis in ovarian cancer. Cell reports 4, 542–553 (2013). [DOI] [PubMed] [Google Scholar]
  • 117.Liu Y et al. A novel Bayesian network inference algorithm for integrative analysis of heterogeneous deep sequencing data. Cell research 23, 440–443 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Lee DD & Seung HS Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999). [DOI] [PubMed] [Google Scholar]
  • 119.Xu W, Liu X & Gong Y Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (2003). [Google Scholar]
  • 120.Brunet J-P, Tamayo P, Golub TR & Mesirov JP Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences 101, 4164–4169 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Mishra NK & Guda C Genome-wide DNA methylation analysis reveals molecular subtypes of pancreatic cancer. Oncotarget 8, 28990 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Orouji E et al. Chromatin State Dynamics Confers Specific Therapeutic Strategies in Enhancer Subtypes of Colorectal Cancer. bioRxiv (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Ernst J & Kellis M ChromHMM: automating chromatin-state discovery and characterization. Nature methods 9, 215–216 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Kundaje A et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Trapnell C et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature biotechnology 32, 381 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Savage JE et al. Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence. Nature genetics 50, 912–919 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Gil N & Ulitsky I Regulation of gene expression by cis-acting long non-coding RNAs. Nature Reviews Genetics 21, 102–117 (2020). [DOI] [PubMed] [Google Scholar]
  • 128.Jansen PR et al. Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nature genetics 51, 394–403 (2019). [DOI] [PubMed] [Google Scholar]
  • 129.Hoffman MM et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods 9, 473 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Mammana A & Chung H-R Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome biology 16, 1–12 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Cai Z et al. Classification of lung cancer using ensemble-based feature selection and machine learning methods. Molecular BioSystems 11, 791–800 (2015). [DOI] [PubMed] [Google Scholar]
  • 132.Uzunangelov V, Wong CK & Stuart JM Accurate cancer phenotype prediction with AKLIMATE, a stacked kernel learner integrating multimodal genomic data and pathway knowledge. PLoS Computational Biology 17, e1008878 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Jin W, Li Q-Z, Liu Y & Zuo Y-C Effect of the key histone modifications on the expression of genes related to breast cancer. Genomics 112, 853–858 (2020). [DOI] [PubMed] [Google Scholar]
  • 134.Toth R et al. Random forest-based modelling to detect biomarkers for prostate cancer progression. Clinical epigenetics 11, 1–15 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.List M et al. Classification of breast cancer subtypes by combining gene expression and DNA methylation data. Journal of integrative bioinformatics 11, 1–14 (2014). [DOI] [PubMed] [Google Scholar]
  • 136.Mäenpää T GENE EXPRESSION PREDICTION WITH MACHINE LEARNING. Information technology (2020). [Google Scholar]
  • 137.Malta TM et al. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell 173, 338–354. e315 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Sánchez-Vega F, Gotea V, Margolin G & Elnitski L Pan-cancer stratification of solid human epithelial tumors and cancer cell lines reveals commonalities and tissue-specific features of the CpG island methylator phenotype. Epigenetics & chromatin 8, 1–24 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Xi Y et al. Histone modification profiling in breast cancer cell lines highlights commonalities and differences among subtypes. BMC genomics 19, 1–11 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Taberlay PC, Statham AL, Kelly TK, Clark SJ & Jones PA Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome research 24, 1421–1432 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Taberlay PC et al. Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations. Genome research 26, 719–731 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Polano M et al. A New Epigenetic Model to Stratify Glioma Patients According to Their Immunosuppressive State. Cells 10, doi: 10.3390/cells10030576 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143.Liu B et al. DNA Methylation Markers for Pan-Cancer Prediction by Deep Learning. Genes (Basel) 10, doi: 10.3390/genes10100778 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 144.Pan X et al. D-GPM: A Deep Learning Method for Gene Promoter Methylation Inference. Genes (Basel) 10, doi: 10.3390/genes10100807 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145.Kim SG et al. Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions. BMC Syst Biol 10 Suppl 2, 54, doi: 10.1186/s12918-016-0302-3 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146.Wang Y, Yang Y, Chen S & Wang J DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration. Briefings in Bioinformatics, doi: 10.1093/bib/bbab048 (2021). [DOI] [PubMed] [Google Scholar]
  • 147.Lin Y, Zhang W, Cao H, Li G & Du W Classifying Breast Cancer Subtypes Using Deep Neural Networks Based on Multi-Omics Data. Genes (Basel) 11, doi: 10.3390/genes11080888 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Ashoor H et al. Graph embedding and unsupervised learning predict genomic sub-compartments from HiC chromatin interaction data. Nature Communications 11, 1173, doi: 10.1038/s41467-020-14974-x (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149.Zhang G, Xue Z, Yan C, Wang J & Luo H A Novel Biomarker Identification Approach for Gastric Cancer Using Gene Expression and DNA Methylation Dataset. Front. Genet 12, 644378, doi: 10.3389/fgene.2021.644378 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Chaudhary K, Poirion OB, Lu L & Garmire LX Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer. Clin Cancer Res 24, 1248–1259, doi: 10.1158/1078-0432.CCR-17-0853 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Tong L, Mitchel J, Chatlin K & Wang MD Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis. BMC Med Inform Decis Mak 20, 225, doi: 10.1186/s12911-020-01225-8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152.Zuo C & Chen L Deep-joint-learning analysis model of single cell transcriptome and open chromatin accessibility data. Briefings in Bioinformatics, bbaa287, doi: 10.1093/bib/bbaa287 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153.Tong L, Wu H & Wang MD Integrating multi-omics data by learning modality invariant representations for improved prediction of overall survival of cancer. Methods 189, 74–85, doi: 10.1016/j.ymeth.2020.07.008 (2021). [DOI] [PubMed] [Google Scholar]
  • 154.Lee T-Y, Huang K-Y, Chuang C-H, Lee C-Y & Chang T-H Incorporating deep learning and multi-omics autoencoding for analysis of lung adenocarcinoma prognostication. Comput Biol Chem 87, 107277, doi: 10.1016/j.compbiolchem.2020.107277 (2020). [DOI] [PubMed] [Google Scholar]
  • 155.Lv J, Wang J, Shang X, Liu F & Guo S Survival prediction in patients with colon adenocarcinoma via multi-omics data integration using a deep learning algorithm. Biosci Rep, doi: 10.1042/BSR20201482 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 156.Lemsara A, Ouadfel S & Fröhlich H PathME: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data. BMC Bioinformatics 21, 146, doi: 10.1186/s12859-020-3465-2 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Seal DB, Das V, Goswami S & De RK Estimating gene expression from DNA methylation and copy number variation: A deep learning regression model for multi-omics integration. Genomics 112, 2833–2841, doi: 10.1016/j.ygeno.2020.03.021 (2020). [DOI] [PubMed] [Google Scholar]
  • 158.Xu J et al. A hierarchical integration deep flexible neural forest framework for cancer subtype classification by integrating multi-omics data. BMC Bioinformatics 20, 527, doi: 10.1186/s12859-019-3116-7 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Poirion OB, Chaudhary K & Garmire LX Deep Learning data integration for better risk stratification models of bladder cancer. AMIA Jt Summits Transl Sci Proc 2017, 197–206 (2018). [PMC free article] [PubMed] [Google Scholar]
  • 160.Wang Z & Wang Y Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders. BMC Bioinformatics 20, 568, doi: 10.1186/s12859-019-3130-9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Hu R, Pei G, Jia P & Zhao Z Decoding regulatory structures and features from epigenomics profiles: A Roadmap-ENCODE Variational Auto-Encoder (RE-VAE) model. Methods 189, 44–53, doi: 10.1016/j.ymeth.2019.10.012 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 162.Tian Q et al. MRCNN: a deep learning model for regression of genome-wide DNA methylation. BMC Genomics 20, 192, doi: 10.1186/s12864-019-5488-5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163.Williams J et al. MethylationToActivity: a deep-learning framework that reveals promoter activity landscapes from DNA methylomes in individual tumors. Genome Biology 22, 24, doi: 10.1186/s13059-020-02220-y (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164.Singh R, Lanchantin J, Robins G & Qi Y DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32, i639–i648, doi: 10.1093/bioinformatics/btw427 (2016). [DOI] [PubMed] [Google Scholar]
  • 165.Zeng W, Wang Y & Jiang R Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network. Bioinformatics, btz562, doi: 10.1093/bioinformatics/btz562 (2019). [DOI] [PubMed] [Google Scholar]
  • 166.Jaroszewicz A & Ernst J An integrative approach for fine-mapping chromatin interactions. Bioinformatics (Oxford, England) 36, 1704–1711, doi: 10.1093/bioinformatics/btz843 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 167.Bichindaritz I, Liu G & Bartlett C Integrative Survival Analysis of Breast Cancer with Gene Expression and DNA Methylation Data. Bioinformatics (Oxford, England), doi: 10.1093/bioinformatics/btab140 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 168.Singh R, Lanchantin J, Sekhon A & Qi Y Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin. Adv Neural Inf Process Syst 30, 6785–6795 (2017). [PMC free article] [PubMed] [Google Scholar]
  • 169.Sekhon A, Singh R & Qi Y DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications. Bioinformatics 34, i891–i900, doi: 10.1093/bioinformatics/bty612 (2018). [DOI] [PubMed] [Google Scholar]
  • 170.Angermueller C, Lee HJ, Reik W & Stegle O DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology 18, 67, doi: 10.1186/s13059-017-1189-z (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 171.Yang H, Wei Q, Li D & Wang Z Cancer classification based on chromatin accessibility profiles with deep adversarial learning model. PLoS Comput Biol 16, e1008405, doi: 10.1371/journal.pcbi.1008405 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 172.Zhang Y et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biology 9, R137, doi: 10.1186/gb-2008-9-9-r137 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 173.Gu J et al. Recent advances in convolutional neural networks. Pattern Recognition 77, 354–377, doi: 10.1016/j.patcog.2017.10.013 (2018). [DOI] [Google Scholar]
  • 174.Agarwal V & Shendure J Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Reports 31, 107663, doi: 10.1016/j.celrep.2020.107663 (2020). [DOI] [PubMed] [Google Scholar]
  • 175.Rumelhart DE & McClelland JL Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations. A Bradford Book; (1987) [Google Scholar]
  • 176.Lipton ZC, Berkowitz J & Elkan C A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv:1506.00019 [cs] (2015). [Google Scholar]
  • 177.Ruder S An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098 [cs, stat] (2017). [Google Scholar]
  • 178.Chopra S, Hadsell R & LeCun Y Learning a similarity metric discriminatively, with application to face verification. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) (2005). [Google Scholar]
  • 179.Bromley J, Guyon I, LeCun Y, Säckinger E & Shah R Signature Verification using a “Siamese” Time Delay Neural Network. Advances in neural information processing systems (1993) [Google Scholar]
  • 180.Chicco D Siamese neural networks: An overview. Artificial Neural Networks Vol. 2190 (2021). [DOI] [PubMed] [Google Scholar]
  • 181.Huang G, Liu Z, van der Maaten L & Weinberger KQ Densely Connected Convolutional Networks. arXiv:1608.06993 [cs] (2018). [Google Scholar]
  • 182.Shlens J A Tutorial on Principal Component Analysis. arXiv preprint arXiv:1404.1100 (2014). [Google Scholar]
  • 183.Tolstikhin I, Bousquet O, Gelly S & Schoelkopf B Wasserstein Auto-Encoders. arXiv preprint arXiv:1711.01558 (2017). [Google Scholar]
  • 184.Kingma DP & Welling M Auto-Encoding Variational Bayes. (2013).
  • 185.Lopez R, Regier J, Cole MB, Jordan MI & Yosef N Deep generative modeling for single-cell transcriptomics. Nature Methods 15, 1053–1058, doi: 10.1038/s41592-018-0229-2 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 186.Xiong L et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nature Communications 10, 4576, doi: 10.1038/s41467-019-12630-7 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 187.Wu M & Goodman N Multimodal Generative Models for Scalable Weakly-Supervised Learning. arXiv:1802.05335 [cs, stat] (2018). [Google Scholar]
  • 188.Zappia L, Phipson B & Oshlack A Splatter: simulation of single-cell RNA sequencing data. Genome Biology 18, 174, doi: 10.1186/s13059-017-1305-0 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 189.Holder LB, Haque MM & Skinner MK Machine learning for epigenetics and future medical applications. Epigenetics 12, 505–514 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 190.Singh AP, Mishra S & Jabin S Sequence based prediction of enhancer regions from DNA random walk. Scientific reports 8, 1–12 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 191.Deng L et al. PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine. BMC bioinformatics 19, 135–145 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 192.Lin W & Xu D Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types. Bioinformatics 32, 3745–3752 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 193.Kelchtermans P et al. Machine learning applications in proteomics research: how the past can boost the future. Proteomics 14, 353–366 (2014). [DOI] [PubMed] [Google Scholar]
  • 194.Sorensen D & Gianola D Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer Science & Business Media (2007). [Google Scholar]
  • 195.Arslan E & Braga-Neto UM A bayesian approach to top-scoring pairs classification. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 871–875 (IEEE) (2017). [Google Scholar]
  • 196.Arslan E A Novel Bayesian Rank-Based Framework for the Classification of High-Dimensional Biological Data, (2018).
  • 197.Knight JM, Ivanov I & Dougherty ER MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification. BMC bioinformatics 15, 1–13 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 198.Osabe T, Shimizu K & Kadota K Accurate Classification of differential expression patterns in a bayesian framework with robust normalization for multi-group RNA-Seq count data. Bioinformatics and biology insights 13, 1177932219860817 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 199.Sun Z et al. A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies. Nature communications 10, 1–10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 200.Klein H-U, Schäfer M, Bennett DA, Schwender H & De Jager PL Bayesian integrative analysis of epigenomic and transcriptomic data identifies Alzheimer’s disease candidate genes and networks. PLoS computational biology 16, e1007771 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 201.Banos DT et al. Bayesian reassessment of the epigenetic architecture of complex traits. Nature communications 11, 1–14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 202.Kuhn M & Johnson K Applied predictive modeling. Vol. 26 Springer; (2013). [Google Scholar]
  • 203.James G, Witten D, Hastie T & Tibshirani R An introduction to statistical learning. Vol. 112 Springer; (2013). [Google Scholar]
  • 204.Mallik S & Zhao Z Detecting methylation signatures in neurodegenerative disease by density-based clustering of applications with reducing noise. Scientific reports 10, 1–14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 205.Ma X, Liu Z, Zhang Z, Huang X & Tang W Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data. BMC bioinformatics 18, 1–13 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 206.Nishino K et al. Identification of an epigenetic signature in human induced pluripotent stem cells using a linear machine learning model. Human cell 34, 99–110 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 207.Alag A Machine learning approach yields epigenetic biomarkers of food allergy: A novel 13-gene signature to diagnose clinical reactivity. PloS one 14, e0218253 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 208.Dogan MV, Grumbach IM, Michaelson JJ & Philibert RA Integrated genetic and epigenetic prediction of coronary heart disease in the Framingham Heart Study. PloS one 13, e0190549 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 209.Zhang R, Wang Y, Yang Y, Zhang Y & Ma J Predicting CTCF-mediated chromatin loops using CTCF-MP. Bioinformatics 34, i133–i141 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 210.Su W-X et al. Gene expression classification using epigenetic features and DNA sequence composition in the human embryonic stem cell line H1. Gene 592, 227–234 (2016). [DOI] [PubMed] [Google Scholar]
  • 211.Bates S, Hastie T & Tibshirani R Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 212.Chakraborty Supriyo, et al. Interpretability of deep learning models: a survey of results. IEEE smartworld, ubiquitous intelligence & computing, advanced & trusted computed, scalable computing & communications, cloud & big data computing, Internet of people and smart city innovation (2017) [Google Scholar]
  • 213.Baker M 1,500 scientists lift the lid on reproducibility. Nature News 533, 452 (2016). [DOI] [PubMed] [Google Scholar]
  • 214.Kulkarni N et al. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines. BMC bioinformatics 19, 5–13 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 215.Bennett J & Lanning S The netflix prize. Proceedings of KDD cup and workshop. 35 (2007). [Google Scholar]
  • 216.Chen T & Guestrin C Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794 (2016). [Google Scholar]
  • 217.Tomita TM et al. Sparse Projection Oblique Randomer Forests. arXiv preprint arXiv:1506.03410 (2015). [Google Scholar]
  • 218.Cazaly E et al. Making sense of the epigenome using data integration approaches. Frontiers in pharmacology 10, 126 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 219.Libbrecht MW & Noble WS Machine learning applications in genetics and genomics. Nature Reviews Genetics 16, 321–332 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 220.Chen T & Tyagi S Integrative computational epigenomics to build data-driven gene regulation hypotheses. GigaScience 9, giaa064 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 221.Ching T et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface 15, 20170387 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 222.Rahu M & McKee M Epidemiological research labelled as a violation of privacy: the case of Estonia. International journal of epidemiology 37, 678–682 (2008). [DOI] [PubMed] [Google Scholar]
  • 223.Gopalakrishnan K, Khaitan SK, Choudhary A & Agrawal A Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Construction and Building Materials 157, 322–330 (2017). [Google Scholar]
  • 224.Cao X, Wipf D, Wen F, Duan G & Sun J A practical transfer learning algorithm for face verification. Proceedings of the IEEE international conference on computer vision. 3208–3215. (2013) [Google Scholar]
  • 225.Dawei W et al. Recognition pest by image-based transfer learning. Journal of the Science of Food and Agriculture 99, 4524–4531 (2019). [DOI] [PubMed] [Google Scholar]
  • 226.Howard J & Ruder S Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018). [Google Scholar]
  • 227.Radford A, Narasimhan K, Salimans T & Sutskever I Improving language understanding by generative pre-training. arXiv preprint arXiv:1902.00993 (2018). [Google Scholar]
  • 228.Devlin J, Chang M-W, Lee K & Toutanova K Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [Google Scholar]
  • 229.López-García G, Jerez JM, Franco L & Veredas FJ Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data. PloS one 15, e0230536 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 230.Wang T et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome biology 20, 1–15 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 231.Schwessinger R et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nature Methods 17, 1118–1124 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 232.Arslan E & Rai K Transfer learning for gene expression prediction with deep neural networks. (AACR, 2020). [Google Scholar]

RESOURCES