Abstract
The large-scale application of the mammalian methylation array has substantially expanded the availability of DNA methylation data in mammalian species. However, this data captures only a small portion of species-tissue combinations. To address this, we develop CMImpute (Cross-species Methylation Imputation), a method based on a conditional variational autoencoder, to impute DNA methylation representing species-tissue combinations. We demonstrate that CMImpute achieves strong sample-wise correlation between imputed and observed values. Using CMImpute and data from 348 species and 59 tissue types, we impute methylation data for 19,786 new species-tissue combinations. We expect CMImpute will be a useful resource for DNA methylation analyses.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03561-2.
Keywords: DNA methylation, Machine learning, Epigenetics, Imputation
Background
DNA methylation is an epigenetic mark in which a methyl group is added to a cytosine. It is associated with gene regulation and disease [1–3] and is a biomarker for individual characteristics such as age [4, 5]. There is thus extensive interest in profiling DNA methylation in humans [6, 7] as well as other species [5, 8–11]. In addition to studying DNA methylation profiles in individual species, insights have been gained from comparative epigenomic analyses across species as epigenetic information from one species will likely be informative to another species [12–18]. Along with varying at the species level, DNA methylation levels typically vary significantly across different tissue types and thus associate with cell and tissue identity.
Various methods exist for profiling DNA methylation data in biological samples, including microarrays [19–21] and sequencing-based assays such as whole genome bisulfite sequencing (WGBS) [22] and reduced representation bisulfite sequencing (RRBS) [23]. Microarrays typically profile fewer cytosines than sequencing-based assays but allow for easier and more robust data collection and thus remain a popular approach to profile DNA methylation [24]. However, historically profiling DNA methylation using microarrays for species other than human or until recently mouse was limited due to the lack of applicable microarrays [19, 25]. This recently changed with the development of the mammalian methylation array, which has array probes that allow the measurement of DNA methylation across mammalian species at a set of 36 k CpGs that are well conserved across mammals [14]. This array has been used by the Mammalian Methylation Consortium to profile DNA methylation samples in at least one tissue type for over 300 mammalian species, collectively covering over 50 different tissue types [12, 13]. However, the biological samples were gathered opportunistically and thus the collected data has an incomplete and imbalanced tissue type representation across species. For certain species, like horses and human, data from many tissue types were collected. However, for many other species, data from only one or two tissue types were collected. This results in experimental data being available for only a small percentage of the potential species-tissue combinations. The incomplete and imbalanced coverage of the experimental data thus motivates the need for computational approaches to accurately impute a DNA methylation sample representing a species and tissue type combination for which there is no experimental data available.
Current methods have shown that large-scale imputation of epigenetic datasets including DNA methylation [26–34] can be effective in certain contexts. For instance, some methods can impute missing or low-coverage CpG sites within existing samples but are unable to impute a whole missing methylation sample in an unprofiled species and tissue type [26–30]. Other methods can impute whole datasets for an assay when there is a different epigenetic assay conducted in the same sample, primarily in the context of a single species [31, 33, 34] or in one case in human and mouse [35]. However, data from multiple epigenetic assays is not available for the vast majority of samples and species profiled by the mammalian methylation array. Furthermore, existing methods do not leverage large compendia of cross-species DNA methylation data that have emerged [14, 16], and in particular coverage of a common set of conserved CpGs profiled by the mammalian methylation array [14]. Overall, in cases in which there is no data available in a given tissue type for a target species, such as in a less common species or tissue types that are difficult to access, methods that only consider data from a single species and thus do not harness cross-species compendia would not be able to make predictions for that tissue type.
To harness compendia of newly available cross-species methylation data to impute methylation values of shared CpGs across species for missing species and tissue combinations, we developed CMImpute (Cross-species Methylation Imputation). CMImpute specifically imputes samples representing a species’ mean methylation within a specific tissue type, henceforth referred to as a species-tissue combination mean sample or for short combination mean sample. Given the association of DNA methylation with species-level characteristics and cell and tissue identity, these types of combination mean samples have proven useful in cross-species epigenetic analyses [12, 13, 18]. CMImpute takes as input exclusively methylation data with corresponding species and tissue labels to output combination mean samples. To perform species-tissue combination mean imputation, CMImpute uses a neural network architecture called a conditional variational autoencoder (CVAE), an extension of the variational autoencoder (VAE). VAEs and CVAEs have been used in various bioinformatics applications [36, 37] including in the context of DNA methylation [30, 32]. Previous applications in the context of DNA methylation include imputing missing CpG values within an existing sample [30] and generating additional human cancer DNA methylation samples solely for data augmentation purposes when data for that cancer type is already experimentally available for some individuals [32]. However, none of these existing VAE and CVAE-based approaches have been designed for or applied in the context of cross-species DNA methylation imputation and thus do not impute methylation in an unprofiled species and tissue type.
We demonstrate that CMImpute is able to accurately impute combination mean samples of missing species-tissue combinations through a cross-validation analysis of mammalian methylation array data. We show that imputed samples strongly correlate with observed species-tissue combination mean samples for held out combinations, when considering both samples across all probes and probes across all samples. In addition, we train CMImpute using all available observed samples from 746 species-tissue combinations to impute 19,786 mean samples representing the remaining 96.4% of combinations of the 348 species and 59 tissue types that had not been previously experimentally profiled. We demonstrate that the imputed samples, both from the cross-validation analysis and the full imputation, maintain inter-combination mean sample correlation patterns related to species and tissue types that are present in observed combination mean samples. Furthermore, we show how the imputed combination mean samples can be used to study the relationship between DNA methylation and maximum lifespan. These combination mean samples imputed by CMImpute vastly expand the coverage of species-tissue combination mean samples providing a resource for cross-species epigenetic studies or studies within a species lacking coverage of tissue types of interest.
Results
Overview of CMImpute
CMImpute takes as input individual methylation samples, spanning a common set of CpGs, and the corresponding species and tissue label for each sample. We note there can be multiple training samples representing the same species and tissue combination since samples from more than one individual are collected for most species-tissue combinations that have observed data available. CMImpute outputs imputed species-tissue combination mean samples for combinations with no observed samples available but where other tissues were profiled in the target-species and other species were profiled in the target-tissue (Fig. 1a,b, Additional file 1). To capture inter- and intra-species tissue signals for imputation, CMImpute trains a neural network using the input methylation samples and species and tissue labels (Fig. 1c). Using the trained neural network, CMImpute then imputes the methylation level for each CpG in missing species-tissue combinations.
Fig. 1.
Data and method overview. a Grid of all species-tissue combinations colored by what type of data is now available for each combination (observed, imputed, or both). Observed combinations were observed in at least one individual and have no imputed data available. Combinations with both observed and imputed data available (Observed + Imputed) were both observed in at least one individual and had predictions generated for it in cross-validation. Imputed combinations represent all combinations without observed data and were included in the final imputed data compendium only. Species are sorted top-to-bottom by the number of available tissues. Tissues are sorted left-to-right by the number of available species. Species and tissues available in the same number of tissues or species, respectively, listed in alphabetical order. Number of individual samples in each species and tissue type (listed in same order as figure) available in Additional file 1. Subset of species and tissues outlined in the dotted red line highlighted for use in b. b Example displaying the different categories of data used during training by CMImpute. In this example, there are no observed samples for certain horse tissues (target combinations). Samples from three categories of training data are used as input for CMImpute: target species data from non-target tissues (same species-different tissue), data from other species in the target tissues (same tissue-different species), and data from overlapping tissues between the target species and other species if available (overlapping species). For this example, the target species is horse and the target tissues are brain, ear, tail, fetus, and lymph node. c Method overview illustrating the neural network architecture used for Training and Imputation. CMImpute’s CVAE framework takes as input a matrix of individual observed samples with corresponding species and tissue labels. During Training, the CVAE learns methylation patterns from the three categories of training data. Once trained, Imputation can occur. The CVAE uses the learned parameters to impute species-tissue combination mean samples of the missing target species-tissue combinations. In the example illustrated, CMImpute imputes the missing horse tissues. For visualization purposes, X, y, X' , and Ximpute are shown transposed (Methods)
The specific neural network CMImpute uses to perform the imputation is a CVAE, an extension of the VAE. A VAE is a self-supervised neural network architecture trained to reconstruct the original input and regularized to maintain a probabilistic latent space [38]. This regularization enables VAEs to both encode an input sample into and to generate a new sample from its latent space. However, VAEs do not have control over the types of data generated. CVAEs extend the VAE framework by adding labels corresponding to each input sample [39]. These labels provide additional information about each sample during training and allow for control over the generated samples. We specifically condition the CVAE on the species and tissue labels to generate methylation samples representing previously unseen species-tissue combinations.
CMImpute predictions qualitatively agree with observed data
To assess CMImpute’s imputation performance, we first applied it to a subset of the species and tissues for which there was mammalian methylation array data available. Specifically, we applied CMImpute in fivefold cross-validation to impute data for 465 combination mean samples for which we also have observed data available. These 465 combination mean samples correspond to 134 species with data from more than one tissue type available and 23 tissues with data from more than one species available. We compared CMImpute’s performance to the performance of four baseline methods (Methods). One baseline was logistic regression, where for each probe we applied logistic regression with the species and tissue labels as the features. Another baseline was a global baseline which was the mean of all training samples. The other two baselines were species and tissue baselines, which were based on the mean of training samples within the same species or the same tissue, respectively.
We first qualitatively evaluated CMImpute’s predictions by generating heatmaps that show the methylation values for each combination mean sample and probe after applying hierarchical clustering with optimal leaf ordering [40]. We did this both for all probes (Fig. 2a–c, Additional file 2: Fig. S1a) and a subset of 11,749 probes that are mappable to a unique genomic location in most mammalian species, referred to as highest-coverage probes (Methods) (Additional file 2: Fig. S1b). The samples mainly clustered by phylogenetic order with tissue clustering primarily occurring within the orders. We compared these heatmaps to corresponding heatmaps based on observed data. The CMImpute and species baseline-imputed heatmaps appeared similar to the observed methylation patterns at the inter-species level. However, when the species contribution was removed from the observed and imputed datasets by subtracting the average of all same-species training samples for visualization purposes (Fig. 2d–f, Additional file 2: Fig. S1c), we observed differentially methylated regions in the observed and CMImpute combination mean samples but not the species baseline-imputed samples. This lack of tissue signal in the species baseline was expected as it was defined as the average of all available same-species samples. The logistic regression and tissue baseline, while appearing to capture tissue-specific methylation patterns, did not appear to effectively capture the observed species-specific methylation patterns (Additional file 2: Fig. S1a,c).
Fig. 2.
Visualization of imputed species-tissue combination mean samples relative to held-out observed values. a–c Heatmaps of methylation probe values for the a observed data held-out during cross-validation and b CMImpute’s and c the species baseline’s predictions of the held-out data (additional baselines are shown in Additional file 2: Fig. S1a). Each row is a species-tissue combination mean sample and each column is a methylation probe. Samples and probes were ordered based on hierarchical clustering followed by optimal leaf ordering. Color bars on the left indicate the phylogenetic order (inner) and tissue (outer) corresponding to the samples. Legends corresponding to the color bars are above the heatmaps. Color scale representing methylation values from 0 to 1 on the right. d–f Heatmaps of the d observed, e CMImpute, and f species baseline datasets with the species signal removed to highlight the differentially methylated tissue regions (observed tissue AUC score of 0.850, CMImpute tissue AUC score of 0.874). Species signal was removed by subtracting the average methylation values of same-species training samples from the full methylation values displayed in a. Color scale representing methylation delta values from -1 to 1 on the right
Analysis of combination mean sample-wise imputation performance
We next quantitatively evaluated the imputation performance of CMImpute predictions generated in fivefold cross-validation relative to the baselines. For this, we evaluated the agreement of CMImpute and baseline-imputed species-tissue combination mean samples with the corresponding held-out combination mean samples using the average Pearson correlation and mean squared error (MSE). On average when considering all probes, CMImpute combination mean samples had a 0.920 correlation, compared to 0.906 for the species baseline, 0.886 for logistic regression, 0.778 for the tissue baseline, and 0.803 for the global baseline (Fig. 3a). To further put the agreement of CMImpute’s predictions with held-out data in context, we also computed average pairwise correlations between samples of the same species and tissue combination, which was 0.981 (Fig. 3a). This suggests that there still exists some reproducible biological signal not captured by CMImpute’s predictions. When considering the subset of highest-coverage probes, CMImpute’s performance increased to 0.932 and continued to be greater than the species baseline’s performance of 0.897, logistic regression’s performance of 0.923, tissue baseline’s performance of 0.880, and global baseline’s performance of 0.877 (Additional file 2: Fig. S2a). Similar performance trends were also seen using MSE as the evaluation metric (Additional file 2: Fig. S2b-c).
Fig. 3.

Sample-wise performance of imputed species-tissue combination mean samples. a Sample-wise Pearson correlation of imputed species-tissue combination mean samples with held-out observed values when considering all methylation probes. Individual to individual variability represents the average pairwise correlation of observed data between individuals of the same species and tissue type for each combination. Baselines and individual to individual variability labeled by Wilcoxon signed-rank test p-value comparing CMImpute’s sample-wise Pearson correlation to the individual to individual variability and each baseline’s sample-wise Pearson correlation for each imputed species-tissue combination ([CMImpute, Individual to Individual Variability], [CMImpute, Species Baseline], [CMImpute, Logistic Regression], [CMImpute, Tissue Baseline], [CMImpute, Global Baseline]). b Comparison of CMImpute and baseline imputation performance measured via sample-wise Pearson correlation with held-out observed data across all probes. The y-axis is CMImpute’s performance on each imputed combination. The x-axis is the species baseline’s (top left), logistic regression’s (top right), global baseline (bottom left), or tissue baseline’s (bottom right) performance on each imputed combination. Each dot is a single imputed species-tissue combination mean sample. The black diagonal line represents equal performance between CMImpute and the baseline. If a point is above the diagonal, CMImpute outperforms the baseline on the corresponding imputed combination mean sample and vice versa. Values in the upper left corners are the fractions of samples where CMImpute outperforms the corresponding baseline
We additionally investigated CMImpute’s sample-wise performance across different phylogenetic orders considering all probes (Additional file 2: Fig. S2d) and the subset of highest-coverage probes (Additional file 2: Fig. S2e). Overall, CMImpute yielded high sample-wise Pearson correlations across all but one phylogenetic order, with the mean correlation remaining between 0.877 and 0.935. The only outlier was the order Monotremata, which had a mean correlation of 0.806 but contained only 13 samples and encompasses two egg-laying mammals. CMImpute outperformed the species baseline across orders that collectively represent the majority of species considered and outperformed all other baselines across all phylogenetic orders (Additional file 2: Fig. S2d-e, Additional file 1). These results demonstrate that CMImpute is able to accurately impute combination mean samples across a wide range of mammalian species.
In addition to having higher mean correlation with the observed data than baselines, CMImpute also had higher correlation for a large majority of individual species-tissue combination mean samples (Fig. 3b, Additional files 3,4). Specifically, CMImpute outperformed the species baseline in 68% of species-tissue combination mean samples, in 78% compared to logistic regression, in 98% compared to tissue baseline, and in 97% compared to the global baseline based on sample-wise Pearson correlation across all probes. Combination mean samples where the species baseline had higher correlation than CMImpute were for combinations for which there was overall a relatively low number of individual samples from the target species or target tissue type represented in the training data (Additional file 2: Fig. S3a-b). Combination mean samples where logistic regression had higher correlation than CMImpute also included combinations for which there was a relatively lower number of individual samples from the same tissue (Additional file 2: Fig S3a-b). CMImpute additionally outperformed the baselines for the large majority of samples when restricting to the subset of highest-coverage probes and when considering MSE instead of correlation (Additional file 5: Table S1). These results demonstrate that CMImpute is able to impute species-tissue combination mean samples for held-out combinations with greater accuracy than the baselines for a large majority of species-tissue combinations.
Analysis of probe-wise imputation performance
We also quantified probe-wise performance, which was based on the agreement of observed and imputed probe values across samples. This contrasts with the sample-wise performance, which was based on the agreement of observed and imputed probe values within the same combination mean sample. For this, we again conducted evaluations in fivefold cross-validation using both the Pearson correlation coefficient and MSE. Our primary evaluation was for the subset of highest-coverage probes since for these probes, methylation values across samples would less likely be driven by differences in mappability across species. However, we additionally report evaluation results when considering all probes.
For the subset of highest-coverage probes, CMImpute had a mean probe-wise correlation of 0.623 significantly outperforming the species, logistic regression, tissue, and global baselines of 0.518, 0.494, 0.217, and 0.002, respectively (Fig. 4a). When considering all probes, CMImpute’s mean probe-wise correlation of 0.688 was also higher compared to the species, logistic regression, tissue, and global baselines of 0.650, 0.545, 0.142, and 0.004 respectively (Fig. 4b). We note that when considering the median as opposed to the mean correlation and all probes, the species baseline did have a higher median correlation of 0.716 compared to CMImpute’s 0.703. However, this was not the case for other baselines or for the highest-coverage probes where the median correlations for the species baseline and CMImpute were 0.535 and 0.626, respectively (Fig. 4a,b).
Fig. 4.
Probe-wise performance of imputed species-tissue combination mean samples. a Distributions of probe-wise Pearson correlations with held-out observed values when considering highest-coverage probes. The top boxplots show the distribution of probe-wise correlations with held-out observed values. The bottom histograms show the number of imputed combination mean samples across 50 Pearson correlation bins. As the global baseline predictions do not vary within a fold, the probe-wise performance is not meaningful and this is not included in the histograms. Legend for both boxplots and histograms shown in histogram plot. Baselines labeled by Wilcoxon signed-rank test p-value comparing CMImpute’s probe-wise Pearson correlation and each baseline’s probe-wise Pearson correlation for each imputed probe ([CMImpute, Species Baseline], [CMImpute, Logistic Regression], [CMImpute, Tissue Baseline], [CMImpute, Global Baseline]). Corresponding plots for subsets of higher variance probes can be found in Additional file 2: Fig. S8a-c. b Same as a except for all probes. Corresponding plots for subsets of higher variance probes can be found in Additional file 2: Fig. S8d-f. c 2-d histogram showing CMImpute (top row) and species baseline (bottom row) probe-wise correlation as a function of mean inter-tissue probe variance when considering the subset of highest-coverage probes. Each row contains four heatmaps corresponding to variance quartiles. Within each variance quartile, the heatmap shows the number of probes with a certain probe variance (x-axis) and certain probe-wise correlation (y-axis) split into 50 bins along each axis. Each quartile is labeled with its own color bar. Color bar scales for each quartile are consistent across methods. Remaining baselines can be found in Additional file 2: Fig. S11. Corresponding plot when considering all probes can be found in Additional file 2: Fig. S12a. d Probe-wise MSE (y-axis) relationship to mean inter-tissue variance (x-axis) when considering the subset of highest-coverage probes shown in separate plots for CMImpute, species baseline, logistic regression, tissue baseline, and global baseline (left to right). Each dot corresponds to a single probe. Corresponding plot for all probes can be found in Additional file 2: Fig. S13b. e Boxplot of the probe-wise Pearson correlation with held-out observed values for the subset of highest-coverage probes in each mean inter-tissue variance quartile. Each variance quartile represented in the boxplots correspond to the variance quartile in the 2-d histograms from c. Corresponding plot when considering all probes can be found in Additional file 2: Fig. S12b. f Boxplot of the probe-wise MSE for each mean inter-tissue variance quartile. Same format as e. Corresponding plot when considering all probes can be found in Additional file 2: Fig. S14b
While CMImpute significantly outperformed the baselines (Fig. 4a,b), as we would expect we observed lower absolute correlations of 0.623 for the subset of highest-coverage probes and 0.688 for all probes than the sample-wise correlation of 0.932 for the subset of highest-coverage probes and 0.920 for all probes (Fig. 3a, Additional file 2: Fig. S2a). We note that for probes that have almost no variance across combination mean samples, we would expect the probe-wise Pearson correlation to be less informative and have low correlation values. We additionally evaluated probe-wise performance using MSE and found that CMImpute significantly outperformed the baselines for both all probes and the subset of highest-coverage probes with mean MSEs of 0.0171 and 0.0148, respectively (Additional file 2: Fig. S2b-c,4). Unlike for correlation, mean probe-wise MSE values are equal to the mean sample-wise MSE values.
We further analyzed probe-wise performance as a function of probe variance, allowing us to compare imputation performance across differentially and non-differentially methylated regions. We used three types of variances: inter-combination variance, mean inter-tissue variance, and mean inter-species variance (“Methods”). Inter-combination variance represents the mean probe variation between different species-tissue combinations. Mean inter-tissue variance represents the average probe variation between tissues within a species. Mean inter-species variance represents the average probe variation between species within a tissue. To analyze probe-wise performance as a function of probe variance, we used variance quartiles for each type of variance where each quartile contains one fourth of the probes being considered (Additional file 2: Fig. S5). We analyzed the performance within each variance quartile using both probe-wise Pearson correlation and MSE.
We first analyzed the probe-wise performance as a function of inter-combination and mean inter-species variance considering the subset of highest-coverage probes. This revealed that for both these variance types, CMImpute consistently had a higher mean correlation than the baselines across all variance quartiles with a minimum mean correlation of 0.571 in the lowest variance quartile and maximum mean correlation of 0.667 in the highest variance quartile across both variance measures compared to the species baseline mean correlations of 0.558 and 0.550, respectively (Additional file 2: Fig. S6a-b,7a-b,8a-b). All other baselines yielded lower performance than the species baseline. These results were also consistent with those from the probe-wise MSE metric, where CMImpute had a lower mean MSE for all inter-combination and mean inter-species variance quartiles (Additional file 2: Fig. S9,10a-b).
For mean inter-tissue variance considering the subset of highest-coverage probes, CMImpute consistently had a higher mean correlation and lower mean MSE than the species baseline in all but the lowest variance quartile (Fig. 4c,e, Additional file 2: Fig. S5a). CMImpute also had a higher median correlation in the top two variance quartiles. While the species baseline relative to CMImpute more accurately imputed probes with low mean inter-tissue variance, it less accurately imputed probes of higher mean inter-tissue variance, which contain greater tissue-specific activity signal. We also note that while CMImpute outperformed the species baseline on these higher varying probes, CMImpute’s performance did decrease as the mean inter-tissue probe variance increases, with the mean correlation going from 0.692 to 0.589 and the mean MSE going from 0.005 to 0.032 in the lowest and highest variance quartiles, respectively. Compared to logistic regression, CMImpute achieved higher mean probe-wise correlation across all mean inter-tissue variance quartiles and lower mean MSE across the first three quartiles (Fig. 4d–f, Additional file 2: Fig. S11). CMImpute outperformed the remaining baselines in all mean inter-tissue variance quartiles considering Pearson correlation (Fig. 4c,e, Additional file 2: Fig. 8c,11) and MSE (Fig. 4d,f, Additional file 2: Fig. S10c).
When considering all probes for each variance metric, similar results to the subset of highest-variance probes were seen with CMImpute outperforming the species baseline for most variance quartiles and outperforming the other baselines across all variance quartiles for all three variance metrics (Additional file 2: Fig. S6c-d,7c-d,8d-f,10 d-f,12,13,14). We also investigated specific probes where CMImpute outperformed the species baseline and vice versa, and similarly for logistic regression, across all probes and the subset of highest-coverage probes. Probes where CMImpute outperformed the species baseline had higher mean inter-tissue variance (Additional file 2: Fig. S15a,c) and probes where CMImpute outperformed logistic regression had lower mean inter-tissue variance (Additional file 2: Fig. S15b,d). These results demonstrate CMImpute generally outperformed baselines and that overall probe-wise performance is associated with probe variance between different tissue types.
Impact of amount of available data on imputation accuracy
We next sought to understand how the amount of available training data impacted imputation performance. We first investigated the sample-wise performance as a function of the number of tissue types within a target species. We note this evaluation does not consider the amount of data available in the tissue types for non-target species. As the number of tissue types in the target species increased, CMImpute’s mean Pearson correlation across all probes with held-out data showed positive correlation (r = 0.181), with the performance increasing from 0.915 mean correlation for one tissue type in the target species to 0.951 for five tissue types. We did not observe a corresponding increase for the mean correlation between individual observed samples within the same tissue and species combination (r = -0.048), which had values of 0.982 and 0.978 for one and five tissue types in the target species, respectively (Fig. 5a). This trend, which was also seen when considering the sample-wise correlation across only higher variance probes (Additional file 2: Fig. S16a-c), was consistent with CMImpute’s improved performance with additional tissue types being driven by the additional available training data and not differences in the observed variability across individuals within the species and tissue combinations. Consistent with results based on all imputed combination mean samples (Fig. 3a), CMImpute also generally outperformed the baselines for subsets of imputed samples as a function of the amount of available training data across all probes (Additional file 2: Fig. S17a). CMImpute also outperformed baselines when restricted to higher variance probes, particularly on combination mean samples with a low number of same-species tissues, which make up the majority of samples in the cross-validation analysis (Additional file 2: Fig. S16a-c).
Fig. 5.
Impact of the amount of available training data on CMImpute performance. a Sample-wise Pearson correlation distributions as a function of the number of tissue types available in the target species during training. The box plot shows the distribution of Pearson correlation for each number of tissue types. Line connects the median correlations for an imputation method across all tissue type counts. Individual to individual variability represents the average pairwise correlation of observed data between individuals of the same species and tissue type for each combination. Baseline performance can be found in Additional file 2: Fig. S17a. b Similar to a, but sample-wise Pearson correlation as a function of the number of species available during training in the target tissue. Baseline performance can be found in Additional file 2: Fig. S17b
We also evaluated the sample-wise performance as a function of the number of different species within the target tissue (Fig. 5b). We note this evaluation does not consider the number of available tissue types in the target species. In this evaluation, CMImpute’s performance showed positive correlation as the number of same-tissue species increased (r = 0.166), with the mean performance increasing from 0.893 to 0.932 when going from one to two same-tissue species and achieved the maximum performance of 0.938 when considering the maximum number of same-tissue species. This trend was also seen when considering the sample-wise correlation across only higher variance probes (Additional file 2: Fig. S16d-f). Consistent with the results from the evaluation as a function of the number of tissue types in the target species, CMImpute generally outperformed the baselines in sample-wise imputation performance evaluations as a function of the number of species for the target tissue type across all probes (Additional file 2: Fig. S17b). This was not only due to CMImpute’s performance on lower variance probes as it also outperformed the baselines when restricting to higher variance probes (Additional file 2: Fig. S16d-f). Overall, CMImpute yielded high performance with limited amounts of same-tissue and same-species training data, but performance still increased with additional training data.
Imputation of non-observed species and tissue combination mean samples
Using all the data collected using the mammalian methylation array that we are considering here (Additional file 1), we applied CMImpute to impute all combinations not present in this input compendium (Methods). This resulted in imputed data for 19,786 species-tissue combination mean samples without observed data available spanning all 348 species and 59 tissue types (Fig. 1a imputed).
We first clustered and visualized heatmaps of the methylation values for all probes in the CMImpute species-tissue combination mean samples (Fig. 1a imputed and observed + imputed, Fig. 6a). As these heatmaps are based on the full imputed set of combination mean samples, there was no observed data for most samples to directly compare to. However, similar to what we previously observed when clustering based on the observed data (Fig. 2a), these heatmaps also showed sample clusters that corresponded to phylogenetic order. Also consistent with these phylogenetic order associated clusters, heatmaps of pairwise correlations between samples showed a correlated block structure between phylogenetic orders in both observed (Fig. 1a observed, Fig. 6b) and CMImpute-imputed (Fig. 1a imputed and observed + imputed, Fig. 6c) combination mean samples. We confirmed that these patterns could not be explained based on mappability differences between species as we saw similar patterns when we clustered and visualized the data restricted to the highest-coverage probes (Additional file 2: Fig. S18). For comparison, we also conducted a similar set of clustering and visualizations for the data imputed from the baseline methods (Additional file 2: Fig. S19-21). This showed that the logistic regression and tissue baselines did not show clear clustering of samples corresponding to species (Additional file 2: Fig. S19b,c-21b,c), while as expected the species baseline did (Additional file 2: Fig. S19a-21a).
Fig. 6.
Visualization of CMImpute-imputed samples of non-observed combinations. a Heatmap of the imputed dataset’s methylation probe values. Samples and probes were ordered based on hierarchical clustering followed by optimal leaf ordering. Color bars on the left indicate the phylogenetic order (inner) and tissue (outer) corresponding to the samples. Legends corresponding to the color bars can be found above the heatmaps. Color scale representing methylation values from 0 to 1 on the right. CMImpute-imputed combination mean samples of missing species-tissue combinations mainly cluster by phylogenetic order. b-c Heatmaps of pairwise correlations between species-tissue combination mean samples for b all 746 observed species-tissue combinations and c 20,251 CMImpute-imputed samples from both the cross-validation analysis and full imputed compendium. Samples are ordered based on hierarchical clustering followed by optimal leaf ordering of the methylation samples (same order as a). Color bars on the left indicate the phylogenetic order (inner) and tissue (outer) corresponding to the samples. Despite the observed heatmap considering a small subset of the imputed datasets, both the observed and imputed heatmaps demonstrate highly correlated block structures between and within phylogenetic orders. d Heatmaps of imputed dataset considered in a) with the species signal removed to highlight the differentially methylated tissue regions. Samples are ordered based on hierarchical clustering followed by optimal leaf ordering of the regressed combination mean samples, which shows greater clustering by tissue and less clustering by order compared to a
To highlight tissue-specific signal captured in observed and CMImpute-imputed values, we also clustered and visualized the methylation values and pairwise correlations based on all probes after regressing out the species contribution (Fig. 1a imputed and observed + imputed, Fig. 2d, Fig. 6d). This revealed clusters of samples corresponding to the same or similar tissue types and a correlated block structure corresponding to tissue types in both the observed (Additional file 2: Fig. S22a) and CMImpute-imputed (Additional file 2: Fig. S22b) combination mean samples, despite covering different sets of species-tissue combinations. For comparison, we also conducted a similar analysis for the baseline methods (Additional file 2: Fig. S19a-d,22c-f). Unlike CMImpute, the species baseline did not capture tissue-specific methylation patterns (Additional file 2: Fig. S22c,23a). The tissue and logistic regression baselines, which previously did not show species-specific signal, did show tissue-specific methylation patterns (Additional file 2: Fig. S22d-e,23b-c).
Quantifying species and tissue signals in combination mean samples
In addition to identifying species and tissue signals through clustering and visualization, we also directly quantified species and tissue signals in combination mean samples. We did this for species signal by evaluating the ability of pairwise correlations between species-tissue combination mean samples to predict whether a combination mean sample pair is of the same species quantified using an Area Under receiver operating characteristic Curve (AUC), and similarly for tissue signal based on whether the pair is of the same tissue, first using all probes (Fig. 7). We performed these evaluations on observed combination mean samples as well as imputed combination mean samples from CMImpute and the baseline methods. To directly compare the AUC values based on observed and imputed combination mean samples, we restricted this analysis to the species and tissue combinations included in the cross-validation analysis (Fig. 1a observed + imputed). The observed and CMImpute combination mean samples had similar tissue signals with AUC values of 0.656 and 0.667, respectively, and similar species signals with AUC values of 0.992 and 0.979, respectively. The tissue and species AUC values for combination mean samples based on logistic regression (0.750, 0.786) were higher and lower, respectively, than observed and CMImpute AUC values. As expected, the species baseline had a high species AUC value (0.993) and a low tissue AUC value (0.503), while the tissue baseline had a high tissue AUC value (0.857) and low species AUC value (0.471). To confirm that the species and tissue signals were not simply reflecting mappability differences between species, we additionally restricted this analysis to the subset of highest-coverage probes and saw similar trends (Additional file 2: Fig. S24a). In addition, we confirmed that when using all imputed combination mean samples (20,251 combinations considered in Fig. 6c), including those for which we did not have observed data, we saw similar tissue and species AUC values for CMImpute and the baselines (Additional file 2: Fig. S24b,c).
Fig. 7.
Species and tissue signal in observed and imputed samples. Area Under ROC values for predicting whether samples within the cross-validation dataset are from the same species or tissue based on their pairwise correlations for all probes. Corresponding plot for the subset of highest-coverage probes can be found in Additional file 2: Fig. S24a
Imputed species-tissue combination mean samples are predictive of a species’ maximum lifespan
We next applied the imputed species-tissue combination mean samples to analyze the relationship between species-level methylation values and species’ maximum lifespan [12, 18]. For this we followed a similar approach to Li et al. [18] and performed linear regression to predict the logarithm of a species’ maximum lifespan based on methylation data. Specifically, we first performed a linear regression analysis in a tissue-agnostic setting based on the average of combination mean samples within a species. We did this to see if similar predictive performance could be achieved with CMImpute’s imputed data for tissue types without observed data as could be with observed data for tissue types in which observed data was available (Methods).
We evaluated the predictive performance using Pearson correlation with log maximum lifespan using a leave-one-species-out (LOSO) analysis and saw similar correlations of 0.813 and 0.829 for the observed and imputed data, respectively (Fig. 8a,b). The MSE distributions were also similar with low median MSEs for both observed and imputed data of 0.064 and 0.047, respectively (Additional file 2: Fig. S25a). In addition to the imputed and observed data leading to similar predictive performance, the actual predicted values of the logarithm of maximum lifespan were also highly correlated with each other (0.973, Fig. 8c), demonstrating that CMImpute samples capture similar signals related to species maximum lifespan as the observed data. In a tissue-specific setting when considering individual tissue types (Additional file 2: Fig. S25b), similar predictive performances were also seen between observed and imputed data (average Pearson correlation of 0.772 and 0.762 and median MSE of 0.072 and 0.064, for observed and imputed data respectively, restricted to tissue types with observed data in at least three species) (Methods). As the maximum lifespan is a species-level characteristic, we additionally performed these evaluations using the species baseline’s imputed data. The species baseline achieved similar performance to the observed and CMImpute-imputed data in the tissue-agnostic (correlation 0.831, median MSE 0.064) and tissue-specific (correlation 0.817, median MSE 0.063) settings (Additional file 2: Fig. S25). However, the species baseline predicts the same maximum lifespan for each tissue type within a species removing the possibility of performing a tissue-level analysis regarding species’ maximum lifespan or any other downstream analysis regarding tissue type. In general, this lack of tissue specificity may restrict the utility of the species baseline’s predictions.
Fig. 8.
Prediction of species’ maximum lifespan using combination mean samples. a,b Leave-one-species-out (LOSO) linear regression analysis using species-average samples to predict a species’ maximum lifespan. For each plot, each dot corresponds to a species. Dashed red line is the regression line between predicted and reported log-maximum lifespan. Solid red line denotes y = x. Regression coefficients, Pearson correlation, p-value, and standard error are shown above each plot. Average methylation calculated over a exclusively observed methylation samples or b CMImpute-imputed species-tissue combination mean samples. Predicted log-maximum lifespan (x-axis) plotted against the reported log-maximum lifespan (y-axis). c Comparison of maximum lifespan predictions based on average species methylation samples between using observed and imputed data
Discussion
Following the development and large-scale application of the mammalian methylation array [12–14], there has been a large increase in available methylation data from a wide range of mammalian species. However, while high, though still incomplete, tissue coverage is present in certain species such as horse, mouse, and human, most species have limited profiled tissue types. Handling this incomplete and imbalanced tissue sample coverage across the 348 species in the Mammalian Methylation Consortium compendium [12–14] presents a significant bioinformatics challenge. To tackle this, we introduced CMImpute, designed to estimate mean methylation values for various species-tissue combinations. We trained CMImpute on the data from the Mammalian Methylation Consortium [12–14]. CMImpute was specifically designed for imputing these species and tissue combinations that have not been previously experimentally profiled but where other tissues have been profiled in the target species and other species have been profiled in the target tissue type (Fig. 1b). To do this, CMImpute uses a CVAE which has the advantages of being able to share information across probes and learn non-linear relationships across various species and tissues simultaneously using a single model.
Through a cross-validation analysis, we demonstrated that CMImpute accurately imputed combination mean samples of missing species and tissue combinations, outperforming multiple baselines both in terms of agreement with held-out observed data for sample and probe-wise performance. CMImpute was able to accurately impute samples across a wide range of phylogenetic orders. CMImpute yielded reasonable sample-wise performance that was better than the baselines when limited same-species and same-tissue information for the target species-tissue combination was available, and as expected performance increased with the availability of additional same-species or same-tissue information. CMImpute’s probe-wise performance was robust to lower and higher variance probes compared to the baselines. However, despite the relative robustness we note that as tissue to tissue variability increased, CMImpute’s probe-wise performance did decrease.
Finally, we trained CMImpute on all data from the mammalian methylation array that we were considering and used the subsequent model to impute 19,786 new species-tissue combination mean samples representing 348 species and 59 tissue types. We showed based on these predictions and the cross-validation predictions that CMImpute’s imputed samples contained species and tissue signals that were consistent with observed patterns. We also demonstrated that using the new imputed combination samples we could predict the maximum lifespan of a species with similar accuracy as when using observed samples.
While CMImpute already showed effective performance, there are possible extensions that could be investigated in future work. Currently, CMImpute does not account for other sample attributes besides species and tissue; however, a potential avenue for future work is to investigate if extending CMImpute with additional labels corresponding to additional attributes such as age, sex, or individual donor performs effectively. CMImpute also does not currently explicitly account for phylogenetic information, which could potentially be used to improve predictive performance. CMImpute makes its predictions based on exclusively methylation data as opposed to incorporating sequence or other biochemical data so as not to confound downstream analyses with other layers of information. Furthermore, in the context of highly conserved sites across species, limited species-specific variation could be predicted from sequence, and other sources of informative biochemical data in the target species are often not available. However, it might be possible to improve predictive performance by taking into account data from other sources, particularly for species with extensive additional epigenetic data available such as human and mouse. Future work could investigate approaches for also incorporating other biochemical data, when available, or sequence information into predictions. Additionally, CMImpute currently does not utilize the CVAE’s probabilistic latent space for tasks such as modeling uncertainty. CMImpute could be adjusted to model uncertainty in the model’s predictions due to inherent noise in the data itself, referred to as aleatoric uncertainty [41, 42], presenting an interesting future direction for investigating uncertainty-aware training [43, 44].
Finally, this work was limited to applying CMImpute to data from the mammalian methylation array, while there is also now a large-scale cross-species methylation dataset based on RRBS [16]. However we note such data, unlike data from the mammalian methylation array, does not specifically target highly conserved regions across mammals. Future work could also investigate applying and possibly extending CMImpute to RRBS or other methylation assays.
Conclusions
Here we introduced CMImpute, a generative neural network-based method that we used to impute the mean methylation values for various species-tissue combinations. We have shown that CMImpute achieves strong correlation across probes within a sample and low MSE with observed values. While CMImpute’s correlation across samples within a probe was not as strong, they were still high though did decrease in performance for probes that were highly variable across tissues. We have also shown CMImpute’s predictions surpass several baseline methods for all measures considered. We additionally used CMImpute to impute samples representing previously unprofiled species-tissue combinations. We demonstrated that the imputed samples maintain similar species and tissue relationships as with the observed data. We expect that these imputed samples will be useful for downstream analyses, though we caveat the samples are computational predictions and we note the most variable probes are the hardest to accurately impute. These imputed samples are publicly available [45, 46] and provide computational predictions that vastly expand the current compendium of methylation information. We expect CMImpute and its imputed datasets will be a resource for comparative epigenetic studies analyzing species and tissue-level methylation patterns across mammalian species.
Methods
Mammalian methylation array data
We used a dataset of 13,245 individual DNA methylation samples across 348 mammalian species and 59 tissues spanning 746 unique tissue-species combinations [12, 13] (Additional file 1) (Gene Expression Omnibus accession number GSE223748). All of this data was generated by the Mammalian Methylation Consortium on the mammalian methylation array [14] and corresponded to the subset of the consortium data available at the time of our analyses. The array provides coverage of 37,492 methylation probes with the large majority selected from conserved genomic loci across mammalian species and the remaining approximately two thousand selected based on known human biomarkers. Each probe contains a 50-bp sequence on one side of a CpG site [14]. The methylation value of each probe is the beta value derived using SeSaMe normalization [47] and represents the percent methylation.
Definition of probe subset used for highest-coverage probe analyses
While most probes on the mammalian methylation array were selected from overall highly conserved genomic regions, for any given probe there could be non-human mammals for which it is not expected to work because of sequence divergence in that mammal or the sequence being non-unique in the mammal. If a probe does not work in a species, it is also not expected to reflect within species methylation signals, such as differential tissue signals. To determine if a probe is expected to work in a mammalian species that has a genome available, we obtained previously computed mappability information [14], which is whether the probe maps to a unique genomic location in that particular species. We obtained this information from annotation files at [48]. In total, mappability information was available for 57 of the species included in the cross-validation analysis (Additional file 6). We defined a subset of 11,749 probes as “highest-coverage” probes based on being mappable in at least 90% of these 57 species.
CMImpute inputs and outputs for training and imputation
For training, CMImpute takes as input two matrices and outputs a single matrix. One of these input matrices is a N TRAINxM matrix of methylation values where N TRAIN is the number of individual samples in the training dataset, M is the number of mammalian methylation array probes, the rows correspond to individuals, and the columns correspond to methylation probes. The other of these input matrices is a N TRAINxL labels matrix that contains one-hot encoded species and tissue labels, where L is the sum of the total number of available species and tissues and the rows correspond to an individual while the columns correspond to either a species or a tissue. This N TRAINxL matrix is a column-wise concatenation of an N TRAINxS array of one-hot encoded species labels and an N TRAINxT array of one-hot encoded tissue labels, where S is the number of species with observed data and T is the number of tissues with observed data. During training, CMImpute transforms the input into a N TRAINxZ latent space representation where Z is the latent space dimension. The output is a N TRAINxM matrix of reconstructed methylation samples, which contain CMImpute’s predicted methylation values of the original samples after reducing the original samples to a latent space representation. As with the input matrix of methylation values, the rows correspond to individuals and the columns correspond to methylation probes. For imputation, CMImpute then takes as input just a 1xL labels matrix and outputs a 1xM matrix containing species-tissue combination mean sample values, where each value represents the average methylation value of a probe across individuals for a given species-tissue combination. The row corresponds to an imputed species-tissue combination and columns correspond to methylation probe values.
CMImpute model
CMImpute uses a CVAE which is formally defined as a conditional directed graphical generative model and in practice is implemented as a neural network architecture that consists of an encoder, latent space, and decoder. The CVAE provides a flexible framework that is able to generate new samples with previously unseen characteristics based on an arbitrary number of conditions. The parameters of this architecture are trained to maximize the conditional log-likelihood [39]. The values the model components take on are dependent on the input observation matrix X and the corresponding label matrix y. The encoder representation is a recognition network which is used to approximate the true conditional prior , while the decoder representation is a generation network where z denotes the encoded latent space vector.
In order to maximize the conditional log-likelihood , the theoretical variational lower bound is used as the objective function. The variational lower bound represents the lower bound for the probability of observations given the CVAE’s learned parameters and is given as [39] where KL refers to the Kullback–Leibler divergence. The model assumes that both the encoder and true conditional prior are multivariate Gaussians, where the learned latent space z is a vector sampled from where is assumed to be diagonal and [38, 39, 49] where I is the identity matrix. Thus, the expectation term of the variational lower bound represents the expected output of the generation network and the KL term regularizes the latent space to be as close to as possible.
CMImpute represents the encoder and decoder as fully connected neural networks. The encoder takes as input training methylation samples X concatenated with their corresponding one-hot encoded species and tissue labels y. Each training sample in X is a vector of numbers between 0 and 1. With these inputs, the encoder predicts the latent space representation z based on the Gaussian parameters.
According to the variational lower bound and the definition of the encoder as the recognition network Q, z should be directly sampled from . However sampling z from Q is a non-continuous operation, thus a gradient cannot be calculated and backpropagation cannot be used to learn the parameters if this sampling operation is performed within the network [49]. Instead, the sampling operation is performed outside the network by using the “reparametrization trick” which reparametrizes the latent space, z, with a deterministic function where is an auxiliary variable. This allows for the expectation term of the variational lower bound, , to be estimated via Monte Carlo (MC) approximation [38]. By substituting this term with the approximation consisting of replacing z with the deterministic function g, the variational lower bound is replaced with a differentiable estimator in the form of the empirical lower bound: where , is the lth sample from the standard normal distribution, and L is the number of MC samples [38, 39]. Following previous work, we set L to 1, which has been shown to yield an effective approximation in most settings while minimizing computational cost [38]. As typical, μ and Σ were each represented as network layers directly before z in the encoder (Fig. 1c). μ was a vector representing the multivariate Gaussian distribution mean and was a vector representing the logarithms of each term of the Σ diagonal []. Representing as a logarithm of the diagonal allows for an exponentiation operation during reparameterization, which makes computing the loss function (defined below) more numerically stable. Based on the “reparametrization trick” described above and the definitions of μ and , the latent space is defined as follows to approximate the true distribution while remaining differentiable: where [49] and where exp denotes an operation that takes the exponentiation of each term of a vector.
The decoder takes as input z and y and attempts to reconstruct the training samples; the reconstructed samples are represented as X’. The training loss function (LTRAIN) represents the empirical lower bound via the sum of the reconstruction loss (binary cross entropy, LRECON) and KL regularization (LREG) [49]:
where
and X i is the ith training sample, N is the number of samples, and k is the latent space dimension.
CMImpute training and hyperparameter selection
CMImpute was implemented in python 3.9.13 with Keras 2.10.0 (built on top of TensorFlow 2.10.0). CMImpute uses the Adam optimizer [50] to learn the CVAE parameters. CMImpute selects the hyperparameters of the number of hidden layers in the encoder and decoder, hidden layer dimensions, activation function, latent space dimension, learning rate, and epsilon value via grid search (Additional file 5: Table S2, Additional file 2: Fig. S26). For each hyperparameter combination evaluated, a model was trained on a training dataset and used to impute combination mean samples for species and tissue combinations in a corresponding validation dataset. Validation datasets consisted of species and tissue combinations not present in the training dataset, but where at least one same-species and same-tissue sample was present in the training dataset (see below). CMImpute then selected the hyperparameter combination with the highest average sample-wise Pearson correlation between the imputed and held-out species-tissue combinations in the validation dataset.
Species-tissue combination mean imputation
Species-tissue combination mean imputation refers to using a trained model to predict a sample that represents a species’ average methylation values in a specific tissue. CMImpute uses its trained decoders to generate species-tissue combination mean samples for every desired combination of species and tissue via the following steps (Fig. 1c):
CMImpute draws a random sample from a standard normal distribution of shape 1xZ where Z is the latent space dimension. This sample is used as the latent space representation. CMImpute performed this random normal sampling with Numpy version 1.23.4.
CMImpute inputs the random normal latent space representation from step 1 and a 1x(S + T) one-hot encoded label of the target species and tissue type into the decoder, where S and T are the number of profiled species and tissue types, respectively. The resulting output of the decoder is an imputed species-tissue combination mean sample for the target combination.
In some previous CVAE applications, the generative process is composed of two parts: (i) obtaining a latent representation by inputting a chosen input into the encoder and (ii) generating a new sample by inputting the resulting latent representation along with a conditional label into the decoder [32, 36, 39, 51, 52]. Using an encoded input as the latent representation makes the resulting generated output dependent on the chosen input sample and thus specific to the individual from which the sample was obtained. However, for the problem of imputing species-tissue combination mean samples that we consider here, an individual-agnostic generative process based on the species and tissue label is required. Since the provided conditional label is sufficient to drive sample imputation of a specific species and tissue combination [36, 39] and the latent space is a multivariate Gaussian regularized to be close to a standard normal distribution, CMImpute samples from a standard normal distribution to obtain a latent representation. CMImpute’s sampling scheme approximates a true sampling from the latent space, as the latent space is regularized to be as close to the standard normal distribution as possible, without encoding information from existing methylation samples. We investigated the impact of the specific random samples from the normal distribution on the final imputed result using a random fold from the cross-validation analysis. For each held out species-tissue combination in a fold, we imputed 20 combination mean samples using a different sample from a normal distribution each time. We then took the variance across combination mean samples for each probe for samples of the same combination but different latent samples and samples representing different combinations. We found that samples representing the same species-tissue combination but imputed using different random latent space representations had low variation (median probe variation 2x10-5) across samples, while samples representing different combinations had a wider range of variation (median probe variation 0.024) accounting for the difference in species and tissue types, verifying that the specific sampled latent space values have minimal impact on the final predicted sample (Additional file 2: Fig. S27). Thus, as expected the imputed samples are mainly based on the overall species and tissue label inputted into the trained decoder and the final imputed combination mean samples are agnostic to individual variation.
Logistic regression baseline
We compared CMImpute to a logistic regression baseline with L2 regularization. For this baseline, we trained one model per methylation probe with separate species and tissue features.
For a particular probe and species and tissue combination, the trained model was then used to predict the methylation value. Specifically, the predicted value for a probe was: , where X was the one-hot-encoded representation of the species and tissue labels, W the learned feature weights, and c the learned intercept. The feature weights were learned using the loss function where y is the real methylation probe values and was the regularization coefficient. We trained each logistic regression model in python 3.9.13 using scikit-learn version 0.24.2. Using this package, we included each one-hot-encoded species and tissue label from the training dataset twice in the training input, once with a corresponding y-label of 1 and sample weight of the probe’s methylation value and once with a corresponding y-label of 0 and sample weight of one minus the probe’s methylation value. In this setup, the methylation value prediction corresponds to the probability of a positive classification. Once trained, we concatenated the predictions of each probe-specific model together to form a full imputed methylation sample for a particular species and tissue combination.
For the cross-validation analysis, we tuned the regularization coefficient across values of 1, 2, 4, 8, and 16. We selected that yielded combination mean samples with the highest Pearson correlation with held-out samples. A value of 2 yielded this highest testing performance (0.886 Pearson correlation, Additional file 2: Fig. S28).
Mean imputation baselines
We compared CMImpute to three mean imputation baselines referred to as the species baseline, tissue baseline, and global baseline. Let N be the total number of experimentally profiled samples, X i an individual methylation sample (1xM where M is the total number of probes), and the imputed combination mean sample representing the species S and tissue T.
(1) The species baseline imputed a combination mean sample by taking the average of all training samples of the target species. is an indicator variable indicating whether a sample i is from a particular species S, and N S is the number of experimentally profiled samples within a species S.
(2) The tissue baseline imputed a combination mean sample by taking the average of all training samples of a target tissue. is an indicator variable indicating whether a sample i is from a particular tissue T, and N T is the number of experimentally profiled samples within a tissue T.
(3) The global baseline imputed a combination mean sample by taking the average of all training samples.
Cross-validation datasets to compare imputed species-tissue combination mean samples with held-out observed data
To compare CMImpute and baseline predictions to held-out observed species-tissue combination mean samples, we created multiple training and testing datasets. We considered the 520 observed species-tissue combinations where the target species is available in more than one tissue type and the target tissue is available in more than one species. We randomly divided these combinations into five folds, resulting in 465 imputed species-tissue combinations for evaluations. These 465 combinations correspond to 134 species with data from more than one tissue type available and 23 tissues with data from more than one species available. This final amount is less than the 520 combinations initially considered because we only considered the imputation performance of a species-tissue combination if there was both same-species different-tissue and same-tissue different-species data available in the corresponding training dataset.
In cross-validation, we considered each of the five folds a testing dataset. When each fold was considered as a testing dataset, the remaining data outside the fold was included in either the training or validation dataset. To determine which combinations were included in the training or validation dataset, we first randomly divided the combinations into candidate training and validation datasets. To perform this division, we randomly selected 20% of the species-tissue combinations to form the candidate validation dataset, while the remaining 80% of combinations formed the candidate training dataset. For each combination in the candidate validation dataset, if the combination did not have at least one combination of the same species and at least one combination of the same tissue present in the candidate training dataset, then the combination was moved from the candidate validation dataset to the candidate training dataset. If at the end of this procedure the candidate validation dataset consisted of less than 10% of the remaining combinations outside the testing fold, we made a new split of the training and validation data and repeated the process. Otherwise, the candidate training and validation datasets were used for the final training and validation datasets. This process resulted in the validation dataset consisting of at least 10% of the combinations remaining outside the testing fold while still having same-species and same-tissue information available in the training dataset.
The hyperparameters (activation function, latent space dimension, learning rate, epsilon value, number of hidden layers, and hidden layer dimensions) were selected via grid search based on the validation performance. Using the models selected based on hyperparameter tuning we imputed species-tissue combination mean samples representing combinations held-out from the corresponding training and validation sets.
For performance evaluations, we concatenated all imputed samples into one grid of species-tissue combination mean samples. For methylation value and pairwise correlation visualizations, we ordered both the samples and probes based on hierarchical clustering followed by optimal leaf ordering [40].
Prediction of non-observed species-tissue combinations
To impute non-observed species and tissue combinations, we first selected the hyperparameters for a model. We did this by creating four random 80–20% training–testing splits on observed combinations that involve a species with more than one tissue type and a tissue type with more than one species available. This criteria ensures that when a combination is held-out during hyperparameter tuning, same-species different-tissue and same-tissue different-species training information will still be available during training. Of the 746 experimentally profiled species-tissue combinations, 520 combinations satisfied this criteria (Additional file 1). Once created, we performed a hyperparameter grid search on each of the four splits and determined the highest performing hyperparameter combination for each split based on sample-wise Pearson correlation with held-out samples. For each of these four best hyperparameter combinations, we averaged the performance for those hyperparameters across all four random splits and selected the hyperparameters that resulted in the highest average. We saw the highest sample-wise performance on average across all tuning datasets (0.933) for the following hyperparameter combination: two hidden layers of dimensions 1024 and 512, TanH activation function, latent space dimension of 8, learning rate of 0.001, and epsilon value of 0.0001. We then trained a single model based on these hyperparameter values using all available methylation samples. Finally, we used the trained model to generate samples of species-tissue combinations not experimentally profiled.
Probe variance calculations
We measured three types of variances across the experimentally profiled data to determine how a probe’s variance among different species and tissues impacts imputation performance. These three types of probe-wise variances are as follows: (1) inter-combination variance which measures the variance between species and tissue combinations, (2) mean inter-tissue variance which measures the average variance between tissues within a species, and (3) mean inter-species variance which measures the average variance between species within a tissue type.
Let M be the number of probes in the mammalian methylation array. We used the following process for calculating the probe-wise inter-combination variance ( (1xM vector) of the experimentally profiled data:
(1) Calculate the mean methylation of each observed species and tissue combination (e.g., human heart, horse liver). This step prevents the number of individual samples in a particular combination from skewing the variance calculation so the variance is measured between combination mean samples. In the equations below, let N be the number of observed samples, be the number of observed samples within a species and tissue combination, be an individual methylation sample (1xM), be an indicator variable indicating whether a sample i is from a particular species and tissue combination, be the resulting mean methylation of a particular species and tissue combination (1xM), C be the set of observed species and tissue combinations, and be the resulting |C|xM array containing the mean methylation value of each probe for each unique species-tissue combination. Formally and are defined as:
and
(2) Calculate the variance of each probe across each unique species-tissue combination.
.
Below is the process for calculating the probe-wise mean inter-tissue variance () (1xM vector) of the observed data.
(1) Let S represent the set of species with more than one tissue type available within the species. Let X represent the individual methylation samples from all species in S.
(2) For each species in S, calculate the mean methylation of each tissue. Let be the set of observed tissues within a species. is the resulting | xM array containing the average methylation value for each tissue observed in the target species. Formally is defined as
(3) For each species in S, calculate the variance of each probe across each tissue available in the species. is the resulting variance of each probe across each tissue observed in the target species, that is
.
(4) Calculate the average variance for each probe across all species in S.
Below is the process for calculating the probe-wise mean inter-species variance () (1xM vector) of the observed data.
(1) Let T represent the set of tissues profiled in more than one species. Let X represent the individual methylation samples from all tissues in T.
(2) For each tissue in T, calculate the mean methylation of each species. is the set of experimentally profiled species within a tissue. is the resulting ||xM array containing the average methylation value for each tissue observed in the target species. Formally is defined as
(3) For each tissue in T, calculate the variance of each probe across each species available in the tissue. is the resulting variance of each probe across each species observed in the target tissue, that is
.
(4) Calculate the average variance for each probe across all tissues in T.
Linear regression analysis of species-tissue combination mean samples relative to species maximum lifespan
We evaluated how predictive combination mean methylation samples were of the log-maximum lifespan of a species in a linear regression model through a leave-one-species-out (LOSO) analysis. For this we used maximum lifespan values for 114 species obtained from the anAge database [53]. Our methodology follows a similar structure of Li et al. [18]. We implemented and trained the linear regression models in python 3.9.13 using scikit-learn version 0.24.2. The LOSO analysis was performed in the following four settings:
Tissue-agnostic observed samples: Observed species-tissue combination mean samples for each observed tissue and species combination were averaged across all tissue types within a species to form a single average observation per species. The number of tissues being averaged in each species sample is dependent on the number of observed tissues available.
Tissue-agnostic imputed samples: Imputed species-tissue combination mean samples for each non-observed species and tissue combination were averaged across all tissue types within a species to form a single average sample per species. The number of tissues being averaged in each species sample is dependent on the number of imputed tissues available.
Tissue-specific observed samples: Instead of averaging across tissue types, each observed species-tissue combination mean sample remains intact and was used as a training sample. Each tissue within a species shares the same log-maximum lifespan. The number of samples held out in the LOSO analysis corresponds to the number of tissue types observed in the species.
Tissue-specific imputed samples: Imputed species-tissue combination mean samples were used as training samples. Each tissue within a species shares the same log-maximum lifespan. The number of samples held out in the LOSO analysis corresponds to the number of tissue types that are not observed in the species. The imputed combination mean samples span the same 114 species as the observed setting, but the tissues represented in a species do not overlap with the observed samples.
We evaluated the predictive performance of the tissue-agnostic species’ averages for both the observed and imputed data by computing the Pearson correlation and MSE between the predicted and reported log-maximum lifespans. We also calculated the Pearson correlation between the imputed and observed predicted log-maximum lifespan values across all species. We similarly evaluated the predictive performance of the tissue-specific species-tissue combination mean samples for both observed and imputed data using Pearson correlation and MSE. For each tissue, we computed the Pearson correlation and MSE between the predicted log-maximum lifespan for each species in which the tissue was observed and the reported log-maximum lifespan and averaged the Pearson correlations and MSEs across the tissues. We restricted this analysis to tissue types observed in three or more species as the Pearson correlation is either not defined or not informative when there are fewer.
Supplementary Information
Additional file 1. Number of experimentally profiled samples from each species-tissue combination used in the study. This data was made available by the Mammalian Methylation Consortium. Each row corresponds to a tissue and each column corresponds to a species. Species-tissue combinations with no samples are colored grey and combinations with at least one sample are colored green. Row and column totals shown in the second to last row and column, respectively. Number of unique tissues available in a species or unique species with a tissue type available shown in the last row and column, respectively. Species are sorted by number of unique tissues available in the species and tissues sorted by number of unique species with the tissue type available
Additional file 2. Supplementary figures S1-S28
Additional file 3. Cross-validation sample-wise Pearson correlation with held-out observed combination mean samples, when considering all probes, for each species-tissue combination. Results shown for CMImpute and all four baselines (species baseline, logistic regression, tissue baseline, global baseline). Performance values colored from dark to light grey for lower and higher performance, respectively
Additional file 4. Cross-validation sample-wise Pearson correlation with held-out observed combination mean samples, when considering the subset of highest-coverage probes, for each species-tissue combination. Results shown for CMImpute and all four baselines (species baseline, logistic regression, tissue baseline, global baseline). Performance values colored from dark to light grey for lower and higher performance, respectively
Additional file 5. Supplementary tables S1-S2
Additional file 6. List of probe annotation files used to select the subset of highest-coverage probes. All annotation files are available at https://github.com/shorvath/MammalianMethylationConsortium
Additional file 7. Review history. The review history is available as Additional file 7
Acknowledgements
We thank the Mammalian Methylation Consortium for generating the mammalian methylation array data. We thank Caesar Li for assistance with using the data. We thank members of Ernst Lab for their feedback.
Peer review information
Wenjing She was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Authors' contributions
E.M. and J.E. developed the CMImpute method. E.M. implemented the method, applied it to the mammalian methylation array data, and performed analyses. J.E. conceived the study and supervised the project. S.H. provided data and advice for the project. E.M. and J.E. wrote the main text. All authors participated in editing the text. All authors read and approved the final manuscript.
Funding
This work was supported by the National Institutes of Health (NIH) (DP1DA044371, U01MH130995) (J.E.), UCLA Jonsson Comprehensive Cancer Center, Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research Ablon Scholars Program (J.E.), and the NIH Training Grant in Genomic Analysis and Interpretation T32HG002536 (E.M.).
Data availability
CMImpute code and the full grid of imputed species-tissue combination mean samples can be found at [45, 46, 54]. The code is provided under the open source MIT license at [45]. All data used was previously published by the Mammalian Methylation Consortium [12] and available from the Gene Expression Omnibus GSE223748. Probe annotations can be found at [48].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The Regents of the University of California filed a patent (publication number WO2020150705) for the mammalian methylation array for which J.E. and S.H. are named inventors. S.H. is a founder of the non-profit Epigenetic Clock Development Foundation, which has licensed several patents from UC Regents, and distributes the mammalian methylation array. S.H. is also an employee of Altos Labs. The remaining author declares no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Greenberg MVC, Bourc’his D. The diverse roles of DNA methylation in mammalian development and disease. Nat Rev Mol Cell Biol. 2019;20(10):590–607. [DOI] [PubMed] [Google Scholar]
- 2.Moore LD, Le T, Fan G. DNA methylation and its basic function. Neuropsychopharmacology. 2013;38(1):23–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Berman BP, Weisenberger DJ, Aman JF, Hinoue T, Ramjan Z, Liu Y, et al. Regions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with nuclear lamina–associated domains. Nat Genet. 2012;44(1):40–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Horvath S, Raj K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat Rev Genet. 2018;19(6):371–84. [DOI] [PubMed] [Google Scholar]
- 5.Wilkinson GS, Adams DM, Haghani A, Lu AT, Zoller J, Breeze CE, et al. DNA methylation predicts age and provides insight into exceptional longevity of bats. Nat Commun. 2021;12(1):1615. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.del C Gomez-Alonso M, Kretschmer A, Wilson R, Pfeiffer L, Karhunen V, Seppälä I, et al. DNA methylation and lipid metabolism: an EWAS of 226 metabolic measures. Clin Epigenet. 2021;13(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.You C, Wu S, Zheng SC, Zhu T, Jing H, Flagg K, et al. A cell-type deconvolution meta-analysis of whole blood EWAS reveals lineage-specific smoking-associated DNA methylation changes. Nat Commun. 2020;11(1):4779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Larison B, Pinho GM, Haghani A, Zoller JA, Li CZ, Finno CJ, et al. Epigenetic models developed for plains zebras predict age in domestic horses and endangered equids. Commun Biol. 2021;4(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Horvath S, Haghani A, Macoretta N, Ablaeva J, Zoller JA, Li CZ, et al. DNA methylation clocks tick in naked mole rats but queens age more slowly than nonbreeders. Nat Aging. 2022;2(1):46–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Horvath S, Haghani A, Peng S, Hales EN, Zoller JA, Raj K, et al. DNA methylation aging and transcriptomic studies in horses. Nat Commun. 2022;13(1):40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lowdon RF, Jang HS, Wang T. Evolution of epigenetic regulation in vertebrate genomes. Trends Genet. 2016;32(5):269–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Haghani A, Li CZ, Robeck TR, Zhang J, Lu AT, Ablaeva J, et al. DNA methylation networks underlying mammalian traits. Science. 2023;381(6658):eabq5693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lu AT, Fei Z, Haghani A, Robeck TR, Zoller JA, Li CZ, et al. Universal DNA methylation age across mammalian tissues. Nat Aging. 2023;3(9):1144–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Arneson A, Haghani A, Thompson MJ, Pellegrini M, Kwon SB, Vu H, et al. A mammalian methylation array for profiling methylation levels at conserved sequences. Nat Commun. 2022;13(1):783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhou J, Sears RL, Xing X, Zhang B, Li D, Rockweiler NB, et al. Tissue-specific DNA methylation is conserved across human, mouse, and rat, and driven by primary sequence conservation. BMC Genomics. 2017;18(1):724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Klughammer J, Romanovskaia D, Nemc A, Posautz A, Seid CA, Schuster LC, et al. Comparative analysis of genome-scale, base-resolution DNA methylation profiles across 580 animal species. Nat Commun. 2023;14(1):232. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ding W, Kaur D, Horvath S, Zhou W. Comparative epigenome analysis using Infinium DNA methylation BeadChips. Brief Bioinform. 2023;24(1):bbac617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Li CZ, Haghani A, Yan Q, Lu AT, Zhang J, Fei Z, et al. Epigenetic predictors of species maximum life span and other life-history traits in mammals. Sci Adv. 2024;10(23):eadm7273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011;6(6):692–702. [DOI] [PubMed] [Google Scholar]
- 20.Lippman Z, Gendrel AV, Colot V, Martienssen R. Profiling DNA methylation patterns using genomic tiling microarrays. Nat Methods. 2005;2(3):219–24. [DOI] [PubMed] [Google Scholar]
- 21.Moran S, Arribas C, Esteller M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics. 2016;8(3):389–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hansen KD, Langmead B, Irizarry RA. BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol. 2012;13(10):R83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005;33(18):5868–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kurdyukov S, Bullock M. DNA methylation analysis: choosing the right method. Biology. 2016;5(1):3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Garcia-Prieto CA, Álvarez-Errico D, Musulen E, Bueno-Costa A, N. Vazquez B, Vaquero A, et al. Validation of a DNA methylation microarray for 285,000 CpG sites in the mouse genome. Epigenetics. 2022;17(12):1677–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yu F, Xu C, Deng HW, Shen H. A novel computational strategy for DNA methylation imputation using mixture regression model (MRM). BMC Bioinformatics. 2020;21(1):552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zou LS, Erdos MR, Taylor DL, Chines PS, Varshney A, Parker SCJ, et al. BoostMe accurately predicts DNA methylation values in whole-genome bisulfite sequencing of multiple human tissues. BMC Genomics. 2018;19(1):390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Tian Q, Zou J, Tang J, Fang Y, Yu Z, Fan S. MRCNN: a deep learning model for regression of genome-wide DNA methylation. BMC Genomics. 2019;20(2):192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Qiu YL, Zheng H, Gevaert O. Genomic data imputation with variational auto-encoders. GigaScience. 2020;9(8):giaa082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ernst J, Kellis M. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015;33(4):364–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Choi J, Chae H. methCancer-gen: a DNA methylome dataset generator for user-specified cancer type based on conditional variational autoencoder. BMC Bioinform. 2020;21(1):181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Durham TJ, Libbrecht MW, Howbert JJ, Bilmes J, Noble WS. PREDICTD PaRallel epigenomics data imputation with cloud-based tensor decomposition. Nat Commun. 2018;9(1):1402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Schreiber J, Durham T, Bilmes J, Noble WS. Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol. 2020;21(1):81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Schreiber J, Hegde D, Noble W. Zero-shot imputations across species are enabled through joint modeling of human and mouse epigenomics. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '20). Association for Computing Machinery, New York, NY, USA, Article 39, 1–9. 2020. 10.1145/3388440.3412412.
- 36.Lim J, Ryu S, Kim JW, Kim WY. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J Cheminformatics. 2018;10(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wang Z, Wang Y. Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders. BMC Bioinformatics. 2019;20(18):568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kingma DP. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. 2013.
- 39.Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. Adv Neural Inf Process Syst. 2015;2:3483–91.
- 40.Bar-Joseph Z, Gifford DK, Jaakkola TS. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics. 2001;17(suppl_1):S22-9. [DOI] [PubMed] [Google Scholar]
- 41. Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision?. Adv Neural Inf Process Syst. 2017;30:5574–84.
- 42.Zhang J, Dai Y, Xiang M, Fan DP, Moghadam P, He M, Barnes N. Dense uncertainty estimation. arXiv preprint arXiv:2110.06427. 2021.
- 43.Ran X, Xu M, Mei L, Xu Q, Liu Q. Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation. Neural Netw. 2022;1(145):199–208. [DOI] [PubMed] [Google Scholar]
- 44. Böhm V, Lanusse F, Seljak U. Uncertainty quantification with generative models. arXiv preprint arXiv:1910.10046. 2019.
- 45.Maciejewski E, Horvath S, Ernst J. CMImpute. Github. 2023. https://github.com/ernstlab/CMImpute.
- 46.Maciejewski E, Ernst J, Horvath S. Cross-species and tissue imputation of species-level DNA methylation samples. Zenodo. 2024. https://zenodo.org/records/13376705.
- 47.Zhou W, Triche TJ, Laird PW, Shen H. SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 2018;46(20):e123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Haghani A, Horvath S. Mammalian Methylation Consortium. GitHub. 2023.https://github.com/shorvath/MammalianMethylationConsortium.
- 49.Doersch C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. 2016.
- 50.Kingma DP. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.
- 51.Zhao T, Zhao R, Eskenazi M. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. 2017.
- 52.Pagnoni A, Liu K, Li S. Conditional variational autoencoder for neural machine translation. arXiv preprint arXiv:1812.04405. 2018.
- 53.de Magalhães JP, Curado J, Church GM. Meta-analysis of age-related gene expression profiles identifies common signatures of aging. Bioinformatics. 2009;25(7):875–81. [DOI] [PMC free article] [PubMed]
- 54.Maciejewski E, Ernst J, Horvath S. CMImpute: cross-species and tissue imputation of species-level DNA methylation samples across mammalian species. Zenodo. 2025. 10.5281/zenodo.14675967.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1. Number of experimentally profiled samples from each species-tissue combination used in the study. This data was made available by the Mammalian Methylation Consortium. Each row corresponds to a tissue and each column corresponds to a species. Species-tissue combinations with no samples are colored grey and combinations with at least one sample are colored green. Row and column totals shown in the second to last row and column, respectively. Number of unique tissues available in a species or unique species with a tissue type available shown in the last row and column, respectively. Species are sorted by number of unique tissues available in the species and tissues sorted by number of unique species with the tissue type available
Additional file 2. Supplementary figures S1-S28
Additional file 3. Cross-validation sample-wise Pearson correlation with held-out observed combination mean samples, when considering all probes, for each species-tissue combination. Results shown for CMImpute and all four baselines (species baseline, logistic regression, tissue baseline, global baseline). Performance values colored from dark to light grey for lower and higher performance, respectively
Additional file 4. Cross-validation sample-wise Pearson correlation with held-out observed combination mean samples, when considering the subset of highest-coverage probes, for each species-tissue combination. Results shown for CMImpute and all four baselines (species baseline, logistic regression, tissue baseline, global baseline). Performance values colored from dark to light grey for lower and higher performance, respectively
Additional file 5. Supplementary tables S1-S2
Additional file 6. List of probe annotation files used to select the subset of highest-coverage probes. All annotation files are available at https://github.com/shorvath/MammalianMethylationConsortium
Additional file 7. Review history. The review history is available as Additional file 7
Data Availability Statement
CMImpute code and the full grid of imputed species-tissue combination mean samples can be found at [45, 46, 54]. The code is provided under the open source MIT license at [45]. All data used was previously published by the Mammalian Methylation Consortium [12] and available from the Gene Expression Omnibus GSE223748. Probe annotations can be found at [48].







