Abstract
The cumulative number of stem cell divisions in a tissue, known as mitotic age, is thought to be a major determinant of cancer-risk. Somatic mutational and DNA methylation (DNAm) clocks are promising tools to molecularly track mitotic age, yet their relationship is underexplored and their potential for cancer risk prediction in normal tissues remains to be demonstrated. Here we build and validate an improved pan-tissue DNAm counter of total mitotic age called stemTOC. We demonstrate that stemTOC’s mitotic age proxy increases with the tumor cell-of-origin fraction in each of 15 cancer-types, in precancerous lesions, and in normal tissues exposed to major cancer risk factors. Extensive benchmarking against 6 other mitotic counters shows that stemTOC compares favorably, specially in the preinvasive and normal-tissue contexts. By cross-correlating stemTOC to two clock-like somatic mutational signatures, we confirm the mitotic-like nature of only one of these. Our data points towards DNAm as a promising molecular substrate for detecting mitotic-age increases in normal tissues and precancerous lesions, and hence for developing cancer-risk prediction strategies.
Subject terms: Computational biology and bioinformatics, Cancer genomics, Cancer stem cells, Epigenomics, Ageing
DNA methylation (DNAm) clocks can track mitotic age, but their potential use for cancer risk prediction remains less explored. Here, the authors develop a DNAm counter of total mitotic age (stemTOC) that shows an increase of mitotic age in normal tissues and precancerous lesions.
Introduction
A key priority area of precision and preventive oncology is to develop novel strategies for early detection and risk prediction1,2. Given the growing evidence that the risk of neoplastic transformation of any given tissue in any given individual is strongly influenced by the cumulative number of stem-cell divisions (aka mitotic age) that the underlying cell-of-origin has undergone2–5, the ability to measure this mitotic age in human tissues could help address this clinical need. Underpinning the association of mitotic age with cancer risk is the gradual accumulation of molecular alterations following stem-cell division, eventually predisposing specific subclones to neoplastic transformation6–12. Conversely, these molecular alterations, if measured, could be used to estimate mitotic age11,13–22.
Among the potential molecular alterations that could be used to measure mitotic age, somatic mutations and DNA methylation (DNAm) have emerged as the most promising ones11,14,15,23. DNAm-based mitotic-like counters hypothesized to track the cumulative number of DNA methylation errors arising during cell division in both stem-cell and expanding progenitor cell populations have been proposed16–18,24–29, yet confounders such as cell-type heterogeneity (CTH)30 and chronological age24 have cast doubt on the biological and clinical significance of these counters and their total mitotic age estimates (defined here as the cumulative number of divisions in both stem-cells and amplifying progenitor populations). Moreover, DNAm changes in normal and preneoplastic lesions often display a stochastic pattern19,31,32, which can pose specific statistical challenges to estimating mitotic age. Finally, although the relation between clock-like DNAm and somatic mutational signatures has already undergone preliminary investigations18,33, these studies have only explored this in the context of B cell malignancies33 or did not directly link total mitotic age estimates to the specific somatic mutational signatures from Alexandrov et al.18,34. Exploring this relationship across multiple cancer types is vital to better understand which molecular substrate may be more suitable for tracking mitotic age on the time scales in which precancerous lesions progress to invasive cancer35,36.
Here we build and validate a pan-tissue epigenetic counter of total mitotic age called stemTOC (Stochastic Epigenetic Mitotic Timer of Cancer) that addresses the above challenges, and subsequently apply it in conjunction with a DNAm atlas and advanced cell-type deconvolution methods37,38, to demonstrate that a sample’s total mitotic age correlates with the fraction of its putative tumor cell-of-origin, thus establishing a direct link between mitotic age and tumor progression. We show that stemTOC can track the subtle increases in mitotic age of normal tissues exposed to cancer-risk factors, or at risk of cancer development. By cross-correlating stemTOC to clock-like mutational signatures, we find that the number of C > T mutations representing deamination of 5-methylcytosine is a marker of mitotic age. Collectively, our data point towards a potential advantage of DNAm changes over clock-like mutational signatures11,34 as a means of tracking total mitotic age in normal tissues at cancer risk, whilst also adding substantial evidence to the hypothesis put forward by Tomasetti and Vogelstein3, that the mitotic age of a tissue is a major cancer-risk factor.
Results
Construction of stemTOC
We aimed to build a mitotic counter that is, to the largest extent possible, unaffected by confounders such as CTH and chronological age, and which can also account for the potential stochasticity of DNAm changes. To reduce confounding by CTH it is critical to build a mitotic counter with CpGs that are not cell-type specific in an appropriate ground state39. To this end, we collected Illumina 450k/EPIC DNAm profiles from 86 fetal samples encompassing 13 fetal/neonatal tissue-types (“Methods” section), to select 30,257 promoter-associated CpGs that are constitutively unmethylated across all fetal/neonatal tissue-types (Fig. 1a). To avoid confounding by chronological age, we used the cell-line data from Endicott et al.24 to require that CpGs, which undergo significant DNA hypermethylation (i.e. displaying gains of DNAm) with increased population doublings (PDs) in-vitro across a range of different normal cell lines, that they simultaneously do not undergo hypermethylation in these same cell lines when treated with a cell-cycle inhibitor (e.g mitomycin) or under reduced growth-promoting conditions (i.e. serum deprivation) (“Methods” section). A total of 6 cell lines were considered representing a variety of cell types including fibroblasts, smooth muscle, and endothelial cells24. Of the 30,257 CpGs, only 629 (denoted “vitro-mitCpGs”, Fig. 1a) satisfied these requirements. In order to avoid confounding by cell-culture effects, we next required these CpGs to also undergo significant DNA hypermethylation with chronological age in three separate large in-vivo blood DNAm datasets (Fig. 1b). Since the cell-line data removes many CpGs that accumulate DNA hypermethylation purely because of “passage of time” (and hence chronological age), this additional requirement is designed to ascertain that these vitro-mitCpGs do display age-associated DNA hypermethylation in-vivo, in line with the expectation that in a tissue with a relatively high stem-cell division rate (i.e. blood), mitotic age and chronological age should be strongly correlated. Blood-tissue was chosen for another 2 reasons. First, the availability of many large whole-blood DNAm datasets ensures adequate power to detect DNAm changes associated with chronological and mitotic age. Second, for blood-tissue, we can adjust for CTH at high resolution (12 immune cell subtypes)40,41, which further ensures that the observed DNAm changes are not due to shifts in underlying cell-type proportions31. At the end of this step, we thus obtained a reduced subset of 371 mitotic CpG candidates, which we call “vivo-mitCpGs” defining our stemTOC CpGs (Fig. 1b, Supplementary Data 1). Next, we subjected these 371 stemTOC CpGs to further tests to verify their validity. First, using independent DNAm data from neonatal buccal swabs and cord blood we verified that our 371 stemTOC CpGs retain ultra-low DNAm levels, as required (“Methods” section, Supplementary Fig. 1). Second, we used 3 separate age-matched DNAm datasets of sorted cells42–44 (including adult neutrophils, monocytes, naïve CD4+ T-cells, B cells, neurons, adipocytes, endothelial cells, lung- epithelial, colon-epithelial, hepatocytes, exocrine and endocrine pancreas), as well as the age-matched multi-tissue eGTEX dataset45, to confirm that stemTOC CpGs are not cell-type specific markers of adult cell types (“Methods” section, Supplementary Figs. 2 and 3). Indeed, any cell-type specific DNAm differences in these aged cell and tissue-types were exclusively restricted to comparisons involving colon-epithelial cells, the tissue with the highest turnover rate, thus clearly indicating that these DNAm differences are likely due to differences in cell-type specific mitotic rates (Supplementary Figs. 2 and 3). Applying eFORGE246,47 to the 371 stemTOC CpGs, we observed strong and exclusive enrichment for DNase Hypersensitive Sites (DHSs) as defined in hESCs (Supplementary Fig. 4a). Among chromatin states, we observed strong enrichment for bivalent transcription start sites (TSS), which was strongest for those also defined in hESCs (Supplementary Fig. 4b). Correspondingly, we observed strong enrichment for both H3K27me3 and H3K4me3 marks, although notably stronger for the repressive H3K27me3 mark (Supplementary Fig. 4a), consistent with previous observations that H3K4me3 is moderately protective of age-associated DNAm accrual39,48.
Based on the 371 vivo-mitCpGs, we finally derive a relative estimate of total mitotic age that explicitly accounts for the underlying stochasticity of mitotic DNAm changes in normal tissues19,31,32 (“Methods” section, Fig. 1c). This stochasticity implies that the 371 vivo-mitCpGs could display marked variations in DNAm within a tissue and subject, owing to the presence of multiple subclones of varying carcinogenic potential. We reasoned that a specific CpG subset displaying the highest DNAm levels could mark the clone at the highest risk and that this subset may vary randomly between subjects19,32,49. We built a simple yet realistic simulation model (“Methods” section) to demonstrate how taking a specific upper quantile of the DNAm distribution over mitotic CpGs should yield an improved estimator of total mitotic age compared to taking an average DNAm over these same CpGs (Fig. 1c)19,32. In effect, taking an upper quantile can better capture the mitotic age of any dominant subclone within the complex subclonal mosaic characteristic of any aging tissue19,32,50–56. To identify a sensible upper-quantile threshold, we computed upper quantiles for the 371 vivo-mitCpGs in a real DNAm dataset encompassing 42 normal-breast samples adjacent to breast cancer (“normal-tissue at risk”) and 50 age-matched normal-breast samples from healthy women (“normal healthy”)32 (“Methods” section, Fig. 1d). This revealed substantially larger effect sizes for higher upper-quantile values consistent with DNAm changes in the normal tissue at risk being highly stochastic. We decided on a 95% upper quantile threshold to maximize effect size without compromising sampling variability (Fig. 1d).
Validation of stemTOC in normal tissues and sorted cell populations
One way to assess if stemTOC is measuring mitotic age in in-vivo human tissue samples is by cross-comparing correlation strengths of mitotic age estimates with chronological age across tissues characterized by very different stem-cell division rates. We reasoned that for tissues with a high stem-cell division rate, the total mitotic age will be strongly determined by the subject’s chronological age, whilst for tissues with a low stem-cell division rate, or for tissues where the mitotic age is more strongly influenced by temporary turnover or other factors operating on shorter time scales (e.g. hormonal factors), the correlation between mitotic age and chronological age should be much weaker or non-evident. Applying stemTOC to the normal-adjacent tissue data from the TCGA confirmed this: tissues with a high stem-cell division rate like the colon and rectum displayed strong correlations between mitotic and chronological age, whilst slower-dividing and hormone-sensitive tissues like breast and endometrium did not, despite these being appropriately powered (Fig. 2a, b, Supplementary Fig. 5, Supplementary Data 2). Similar results were obtained when estimating mitotic age in the enhanced GTEX (eGTEX) DNAm dataset, encompassing 987 samples from 9 normal tissue-types (Supplementary Fig. 6, Supplementary Data 3). For instance, the colon displayed a clear correlation, whilst the ovary and breast did not (Supplementary Fig. 6). Applying stemTOC to large purified immune cell datasets from healthy individuals (“Methods” section) also revealed relatively strong correlations with chronological age (Fig. 2c, d, Supplementary Data 4), in line with the hematopoietic system’s relatively high stem-cell division rate3,57. Of note, the correlation of stemTOC with chronological age was also observed in a fairly large dataset of 139 naïve CD4+ T-cell samples (Fig. 2c, d), supporting the view that the observed correlation in sorted CD4+ T-cell datasets is not the result of age-associated shifts in the naïve to mature CD4+ T-cell fractions41. Finally, in sorted neurons, a predominantly post-mitotic population, the correlation was generally speaking not significant (Fig. 2e).
To further demonstrate that the associations of stemTOC with chronological age are not confounded by cell-type heterogeneity, we assembled a large collection of 18 whole-blood cohorts encompassing 14,515 samples (“Methods” section), computing associations of stemTOC with chronological age in each one of them, before and after adjustment for immune-cell fractions using our recently validated DNAm reference matrix for 12 immune-cell types41. Associations of mitotic age with chronological age remained significant and even displayed a marginal increase after adjustment for CTH (Supplementary Fig. 7). Similar findings were obtained in the eGTEX normal tissue datasets for which we could infer cell-type fractions using our EpiSCORE DNA-atlas38 (“Methods” section, Supplementary Fig. 8a, b). Thus, the correlation of stemTOC with chronological age is clearly independent of any putative age-associated variations in cell-type fractions as well as of any differences in the mitotic ages of underlying cell types.
To further validate stemTOC, we compared the estimated mitotic ages of tissues stratified by high, medium, and low stem-cell division rates3,16. Using the normal-adjacent samples from the TCGA encompassing 15 tissue-types, we observed a strong correlation (PCC = 0.94) between stemTOC’s median mitotic age of the tissue and the tissue-specific annual intrinsic rate of stem-cell division, as estimated by Tomasetti & Vogelstein3,8 (Fig. 2f, “Methods” section). This association was confirmed by plotting stemTOC against the estimated total number of stem-cell divisions for each normal-adjacent sample on a log-scale, revealing clear differences in mitotic age between low, medium, and high stem-cell division rate tissues (Fig. 2g).
Benchmarking of stemTOC in normal tissues reveals improvements
We next benchmarked stemTOC against 6 previously proposed DNAm-based mitotic counters, five of which are built from, or restrict to, 450k array probes (epiTOC230, epiTOC116, HypoClock18,30 and EpiCMIT-hyper/hypo33), with the remaining one (RepliTali24) being built entirely from EPIC beadarrays. CpGs making up each clock displayed relatively little overlap except for the epiTOC1 and epiTOC2 clocks (Supplementary Fig. 9). We observed that counters based on hypermethylated CpGs (i.e. stemTOC, epiTOC1/2 and epiCMIT-hyper) displayed stronger correlations with age than counters based on hypomethylated ones (i.e. displaying loss of DNAm) (HypoClock, RepliTali, epiCMIT-hypo) (Fig. 2b). Similar patterns were observed in the normal-tissue EPIC DNAm data from eGTEX (Supplementary Fig. 6) as well as in the sorted immune-cell subsets (Fig. 2d). StemTOC and the hypermethylation-based counters also displayed much stronger correlations with the Vogelstein–Tomasetti stem-cell division rates, as assessed over 15 tissue-types (Fig. 2h), suggesting that these counters yield more accurate mitotic age estimates during normal/healthy aging compared to those based on hypomethylation.
To explore the effect of CTH and the additional caveat that RepliTali is EPIC-based and thus less likely to perform well on 450k datasets, we compared regression statistics with chronological age before and after adjustment for CTH, and, in the case of whole blood, restricting also only to EPIC datasets. We observed that hypermethylated counters displayed less sensitivity to the underlying CTH of the tissue, as compared to the hypomethylated ones (Supplementary Figs. 7, 8, 10). It is also noteworthy that stemTOC’s mitotic age displayed stronger correlations with chronological age than those of all hypomethylated clocks, including RepliTali (Supplementary Figs. 8b, S10b). For instance, in a high-turnover-rate tissue (colon eGTEX EPIC dataset), RepliTali’s association with chronological age was only significant upon adjustment for CTH, whilst stemTOC’s association was stronger and independent of CTH adjustment (Supplementary Fig. 8b). Moreover, the effect of beadarray technology seemed to be relatively minor as RepliTali’s mitotic age estimates on EPIC data were very robust upon restricting to 450k probes (Supplementary Fig. 8c). Thus, overall, the results in normal tissue suggest an improvement of stemTOC and the other hypermethylated clocks over the hypomethylated based ones.
stemTOC’s mitotic age correlates with tumor cell-of-origin fraction
We reasoned that tumor samples of higher tumor purity would display a higher mitotic age if mitotic age is a key marker of cancer progression. One way to estimate tumor purity is by estimating the tumor cell-of-origin fraction in cancer samples, which could be accomplished with our EpiSCORE algorithm and DNAm atlas encompassing tissue-specific DNAm reference matrices for 13 tissue and 40 cell types (“Methods” section)38. Before estimating cell-type fractions in the TCGA samples however, we sought additional validation of our DNAm atlas using an independent whole-genome bisulfite-sequencing (WGBS) DNAm atlas of 102 sorted samples from Loyfer et al.58 encompassing 15 tissue-subtypes and 17 cell types (“Methods” section). Applying EpiSCORE we were able to correctly predict the cell type in Loyfer’s WGBS DNAm atlas with an overall 85% accuracy (Fig. 3a). Given this good validation, we next applied stemTOC and EpiSCORE to the TCGA samples (Supplementary Data 5–21), which revealed strong correlations between mitotic age and tumor cell-of-origin fraction, especially for those tumor-types where the cell-of-origin is reasonably well established (Fig. 3b). For instance, colon and rectal adenocarcinoma (COAD & READ) displayed very strong correlations between stemTOC’s mitotic age and the epithelial cell fraction (Fig. 3b). Mitotic age in luminal and basal breast cancer displayed strongest correlations with the luminal and basal fractions, respectively. In the case of pancreatic adenocarcinoma (PAAD), mitotic age was most strongly correlated with the ductal fraction. Liver hepatocellular- (LIHC) and cholangio- (CHOL) carcinoma displayed the strongest correlations with hepatocyte and cholangiocyte fractions, as required. Mitotic age in skin cutaneous melanoma (SKCM) displayed the strongest correlations with the melanocyte fraction. Lung squamous cell (LUSC) and lung-adeno (LUAD) carcinoma displayed the strongest correlations with basal and alveolar epithelial fractions, respectively. It is worth stressing that for each tumor-type, the strongest correlation with mitotic age was attained by the presumed tumor cell-of-origin (Supplementary Fig. 11a, b). stemTOC’s mitotic age also displayed strong correlations with other tumor purity indices as estimated by Aran et al.59, although the correlations with EpiSCORE’s tumor cell fraction were generally higher than for gene-expression or IHC-based tumor purity estimation methods (Supplementary Fig. 11c). These data clearly indicate that despite the CTH of TCGA samples, the mitotic age estimates from stemTOC are tracking DNAm changes in the tumor cell-of-origin of the respective cancer type. Mitotic age estimates can also inform on the potential cell-of-origin in tumor types where this is controversial or less well established. For instance, in low-grade glioma (LGG) and glioblastoma multiforme (GBM), mitotic age correlated most strongly with astrocyte fraction (Fig. 3c, Supplementary Fig. 11b), favoring an astrocyte-like progenitor as the putative cell of origin60. Prostate adenocarcinoma (PRAD) has long been linked to a luminal phenotype61,62, yet studies have also indicated a potential role for basal cells in the initiation of this cancer type63. Our analysis is consistent with the known luminal-cell expansion associated with PRAD (Fig. 3c, Supplementary Fig. 11b), and the prevailing view that the cell-of-origin of human PRAD is a luminal progenitor cell64–66.
Of note, the correlation strengths with tumor cell-of-origin fraction obtained with stemTOC were significantly stronger than those obtained with the other 6 counters (Fig. 3d, e). Moreover, none of the other 6 mitotic counters achieved 100% accuracy in predicting the putative cell-of-origin (Supplementary Fig. 11d). We also combined epiCMIT-hyper and epiCMIT-hypo into epiCMIT following Duran-Ferrer et al.33, but in line with epiCMIT-hypo’s worse performance, stemTOC and epiCMIT-hyper also outperformed epiCMIT (Supplementary Fig. 12). However, all mitotic counters were universally accelerated in TCGA cancer types, as compared to their respective age-matched normal-adjacent tissue (Supplementary Fig. 13). To further explore the relationship between mitotic counters, we computed correlations between each pair of counters across all sorted immune cell subsets, the normal tissues from GTEX and the normal-adjacent and tumor samples from TCGA, demonstrating that the hypermethylated counters are well-correlated with each other and to a lesser extent also with RepliTali (Supplementary Fig. 14−16), which is noteworthy given the aforementioned relatively little CpG overlap between them (Supplementary Fig. 9). In summary, the correlation of stemTOC’s mitotic age with the tumor cell-of-origin fraction underscores its potential to quantify cancer risk.
stemTOC predicts increased mitotic age in precancerous lesions
We next assembled 9 DNAm datasets representing different normal healthy tissues and corresponding age-matched precancerous conditions (“Methods” section). Mitotic age, as estimated with stemTOC, was significantly higher in the precancerous tissue in each of these datasets (Fig. 4a). For instance, stemTOC’s mitotic age could discriminate neoplasia from benign lesions in the prostate, intestinal metaplasia from normal gastric mucosa, or Barrett’s esophagus from normal esophagus. Some of the other counters did not provide the expected discrimination, and a formal comparison of all counters across the 9 datasets revealed an overall improved performance of stemTOC (Supplementary Fig. 17). Using EpiSCORE we next estimated cell-type fractions in these datasets, and broadly speaking, for most tissue-types we observed correlations of the mitotic age with tumor cell-of-origin fraction in the precancerous lesions and in cancer itself, but not in the histologically normal tissue, although in many cases the number of samples with normal histology was much lower, which limits power to detect associations in this subgroup (Supplementary Fig. 18). For instance, in colon adenoma there was a clear correlation (R = 0.75, P = 3e-8, n = 39) but not so in normal colon (R = 0.44, P = 0.27, n = 8), likely due to the much lower number of normal samples (Supplementary Fig. 18). In premalignant cholangiocarcinoma (CCA) lesions, there was a strong correlation with cholangiocyte fraction (R = 0.45, P < 0.001, n = 60), which was not observed in normal liver samples (R = 0.16, P = 0.25, n = 50), probably because the cholangiocyte fraction is much lower in these normal samples (Supplementary Fig. S18). Correlations between stemTOC’s mitotic age with tumor cell-of-origin fraction in normal-adjacent tissue from the TCGA was broadly speaking also observed (e.g. COAD, BRCA, PRAD), but there were also exceptions (e.g. KIRP, LUAD) (Supplementary Fig. 19). Of note, using an upper quantile over the 371 stemTOC CpGs to define the mitotic age generally resulted in larger effect sizes as well as stronger correlations with cancer-status and tumor cell-of-origin fraction compared to taking an average, highlighting the importance of taking the underlying stochasticity of DNAm patterns into account (Supplementary Fig. 20). Overall, these data underscore the potential of stemTOC’s mitotic age to indicate cancer risk in precancerous lesions.
Correlation of mitotic age with smoking and obesity-associated inflammation
Smoking and obesity are two main cancer-risk factors that promote inflammation and which can increase the tissue’s intrinsic rate of stem-cell division67,68. Hence, one would expect the mitotic age of a tissue to correlate with the level of exposure to such cancer-risk factors2,8,67,69–75. To assess this, we analyzed an Illumina 450k DNAm dataset of 790 buccal swabs from healthy women all aged 53 at sample draw (“Methods” section)76. Buccal swabs contain approximately 50% squamous epithelial and 50% immune cells77 and we reasoned that the mitotic age in these buccal swabs should correlate with both the squamous epithelial fraction as well as an individual’s lifelong smoking habit. Estimating epithelial and immune-cell fractions following our validated HEpiDISH procedure77, we observed a strong correlation of stemTOC’s mitotic age with the squamous epithelial fraction, but importantly also a significant positive correlation with smoking status, which was independent of epithelial fraction (Multivariate linear regression, P = 5e-5, Fig. 4b).
Epithelial cells in lung tissue are the cell-of-origin of lung cancer, and so we next estimated stemTOC’s mitotic age in over 200 normal lung-tissue samples from eGTEX45, also encompassing smokers, ex-smokers, and never-smokers. Fractions for 7 cell types including epithelial cells were estimated using EpiSCORE (“Methods” section, Fig. 4c)38. In this cohort, donors were of different ages, and correspondingly, we observed a strong correlation between stemTOC’s mitotic age with chronological age (Fig. 4c). Importantly, multivariate regression analysis including age, epithelial fraction, and smoking status, revealed a significant positive correlation of mitotic age with smoking-exposure, with the mitotic age of older individuals displaying bigger differences between smokers and non-smokers (Fig. 4c).
As another example, we focused on an EPIC DNAm dataset of liver-tissue samples from 325 obese individuals diagnosed with non-alcoholic fatty liver disease (NAFLD)78, of which 210 displayed no fibrosis (grade-0), with the rest displaying severe fibrosis (grade 3–4) (“Methods” section). Fractions for 5 cell types including hepatocytes were estimated using EpiSCORE (“Methods” section)38, which confirmed the known reduction of hepatocyte fraction in NAFLD78 (Fig. 4d). We observed a strong correlation of stemTOC’s mitotic age with chronological age, as well as with disease stage, both of which were significant in a multivariate regression analysis that also included hepatocyte fraction (Fig. 4d). Thus, the increased mitotic age with NAFLD stage is observed despite the reduction in hepatocyte fraction, attesting to stemTOC’s sensitivity. Overall, these data support the view that stemTOC can track increased mitotic age in disease-relevant tissues exposed to major cancer-risk factors.
stemTOC correlates with clock-like somatic mutational signature−1
Next, we explored the relationship of stemTOC with somatic mutational “clock-like” signatures11,34,79. Among the somatic mutational signatures derived from the TCGA cancer samples, two single-base substitution signatures SBS1 and SBS5 (here termed MS1 and MS5) have been shown to be associated with chronological age, with MS1 in particular being linked to deamination of methylated CpG dinucleotides resulting in C > T mutations11. Given that somatic mutations and DNAm data in the same normal tissue specimens have not yet been extensively characterized, we asked if the mutational loads in TCGA cancer samples correlate with the estimated total (lifetime) number of stem-cell divisions (TNSC), as derived using the Vogelstein–Tomasetti estimates and chronological age at diagnosis (“Methods” section). As remarked previously79, although cancers display a clear acceleration of mitotic age, a broad correlation between a tumor’s mutational signature load and the baseline (normal) number of stem-cell divisions is expected if the signature is of a mitotic nature. Vindicating Alexandrov et al.11, MS1 displayed a correlation with the total number of stem-cell divisions (TNSC) across cancer types, whilst MS5 did not (Fig. 5a, b). To check that this correlation is independent of chronological age, we computed for each cancer type the median mutational load per Mbp and calendar year over all samples of a given cancer type, and compared it to the intrinsic rate of stem-cell division of the corresponding normal tissue-type, revealing a positive correlation for MS1 but not for MS5 (Fig. 5c, d). Like MS1, stemTOC displayed a clear positive correlation with TNSC across tumors (Fig. 5e), which was also independent of age (Fig. 5e, f). However, unlike MS1, stemTOC displayed a saturation effect, with highly proliferative cancer types displaying high stemTOC-values despite the relatively low turnover rate of the underlying normal tissue (e.g. lung cancers). We verified that this saturation effect is also observed if we define stemTOC in terms of an average DNAm over the 371 CpGs (as opposed to taking the upper quantile), indicating that this is not a technical artifact of our stemTOC definition (Supplementary Fig. 21). Given that in normal-adjacent tissues, stemTOC displays correlations with TNSC without evidence of a saturation effect (Fig. 2f, g), this suggests that stemTOC is a sensitive marker of mitotic age.
Next, we asked if stemTOC’s mitotic age correlates with the mutational loads of MS1 and MS5. Correlating the median stemTOC’s mitotic age of each cancer type to the corresponding median MS1 and MS5 loads, revealed a significant association with MS1, but not for MS5, further attesting to the more mitotic-like nature of MS1 (Fig. 5g, h). Of note, stemTOC correlated with MS1 within most TCGA cancer types, even after adjusting for chronological age (Supplementary Fig. 22). However, in some cancer types (e.g. lung squamous cell carcinoma) no correlation was evident (Supplementary Fig. 22). This indicates that although stemTOC and MS1 both approximate mitotic age, that they are also distinct. This is consistent with reports that MS1 may also reflect somatic mutations arising from other mutational processes2. Of note, a strong correlation with tumor cell-of-origin fraction is not seen if one were to use the MS1-load as a proxy for mitotic age (Supplementary Fig. 23), and correlations of MS1-load with CPE-based tumor purity estimates59 were weaker than with EpiSCORE-derived tumor cell-of-origin fractions (Supplementary Fig. 24). Overall, these data indicate broad agreement between stemTOC and MS1.
Discussion
We have here derived a pan-tissue epigenetic mitotic counter (stemTOC) that avoids as much as possible, the confounding effects of cell-type heterogeneity and chronological age, and have used it in conjunction with a state-of-the-art DNAm atlas encompassing 13 tissue and 40 cell types, to establish a concrete direct link between the mitotic age of a sample and its tumor cell-of-origin fraction in each of 15 cancer types. We note that this result is consistent with mitotic age correlating with tumor purity under the reasonable assumption that tumor cells have a higher mitotic age than the surrounding stroma, likely owing to the tumor cell’s higher proliferative potential. Most importantly though, the potential of stemTOC to detect mitotic age increases was also demonstrated in precancerous lesions from 8 normal tissue-types, as well as in normal oral/lung tissues exposed to smoking and normal liver tissue from obese individuals, which collectively represent scenarios where “tumor” purity indices have not been defined or validated. As such, our findings are of profound biological and clinical significance.
First, they add substantial weight to the hypothesis put forward by Tomasetti and Vogelstein3, that the mitotic age of the cell-of-origin is a major determinant of cancer risk, and that it increases with exposure to exogenous cancer-risk factors. Second, the ability to accurately measure mitotic age and tumor cell-of-origin fraction in preneoplastic lesions opens up new avenues for risk prediction as well as preventive and precision oncology. As a concrete example, the increased mitotic age in the buccal swab squamous epithelium of healthy smokers vs non-smokers could potentially be used as a non-invasive tool to monitor cancer risk in relation to oral, lung, and esophageal squamous cell carcinomas. Third, the 371 CpG loci making up stemTOC could form the basis for developing non-invasive targeted bisulfite-sequencing assays on cell-free DNA in serum20. Fourth, by comparing stemTOC to somatic mutational signatures MS1 and MS5 in the TCGA samples, we have confirmed that MS1 is clearly mitotic-like, whereas MS5 is not. This is consistent with MS5 (and not MS1) being the dominant somatic mutational signature in post-mitotic neurons80, and contrasts with a previous study focusing on B cell malignancies which found that the combined epiCMIT-hyper/hypo mitotic age estimate correlated with both MS1 and MS533.
It is illuminating to discuss the comparison of stemTOC to MS1 in more detail. Although recent studies have demonstrated a correlation of MS1 with chronological age in normal healthy tissue-types52,80,81, the analogous correlation of DNAm-based mitotic age with chronological age has been more widely demonstrated across many more normal tissue-types, including precancerous lesions. When correlating a molecular clock’s mitotic age to chronological age in cancer samples, it is worth pointing out that for highly proliferative tumor types arising from tissues with a relatively low intrinsic rate of stem-cell division (e.g. liver, pancreas, lung), the estimated mitotic age would not be expected to correlate well with age at diagnosis since for these cancer samples the majority of cell divisions would have occurred after tumor onset. This is exactly the pattern seen for stemTOC, which displayed a more non-linear correlation and saturation effect with age across tumor samples, even if defined by an average DNAm over the stemTOC CpGs. In contrast, in the same tumors, MS1, which is effectively also an average molecular load estimate, displayed a more linear pattern. In contrast to tumors, in normal tissue stemTOC displays strong linear correlations with age without evidence of any saturation effect. Although normal samples with matched DNAm and somatic mutational profiles representing different tissue-types are still lacking to allow for an objective comparison, the lack of extensive somatic mutational data in normal tissues underscores the greater intrinsic difficulty of detecting somatic mutations in such tissues80. Indeed, the advantage of DNAm over somatic mutations as a technically more feasible substrate to track mitotic age in normal and preneoplastic tissues, specially over the potentially shorter time scales between preneoplastic and cancer stages36, should not be surprising given that the rate at which DNAm changes are acquired in normal cells is about 10−100 times higher than somatic mutations82. Thus, DNAm could be more useful than somatic mutations to track the evolution of precancerous states on shorter time scales36.
The benchmarking analysis of stemTOC against previous DNAm-based clocks also revealed an important biological insight: in general, clocks anchored on sites gaining methylation, specially stemTOC and epiCMIT-hyper, appear to provide better proxies of mitotic age in normal adult tissues compared to HypoClock and epiCMIT-hypo, which are based on hypomethylation. However, a caveat as far as HypoClock is concerned, is that this clock’s assessment on Illumina beadarray data only considers a small fraction of the millions of solo-CpGs that were originally proposed to lose DNAm with cell division18. RepliTali, a clock trained and optimized on a subset of solo-CpGs with representation on EPIC beadarrays performed much better than HypoClock. Correspondingly, the comparison of stemTOC (restricted to common 450k+EPIC probes) to RepliTali (trained on EPIC probes) only revealed a relatively minor improvement: whilst on most 450k datasets (a comparison that would favor stemTOC), we observed a consistent improvement of stemTOC over RepliTali, on EPIC datasets the improvements were more marginal. For instance, on the whole blood and eGTEX EPIC datasets, stemTOC’s mitotic age was more strongly correlated with chronological age, and in a few datasets including high-turnover tissues like colon and blood, RepliTali’s mitotic age did not pass the statistical significance threshold. Upon adjustment for CTH, RepliTali’s correlations became significant, an indication that a small fraction of RepliTali’s CpGs may be confounded by CTH. Overall, our data reinforces the view that DNAm gains at specific genes that are initially unmethylated in fetal tissue is a reliable way to track mitotic age, especially in the context of normal adult tissue turnover30. And whilst the results comparing the specific clocks favor the ones based on hypermethylation, this does not mean that the optimal subset of CpGs for measuring mitotic age are necessarily those gaining DNAm with cell division. Indeed, it is entirely plausible that a subset of solo-CpGs that are currently underrepresented on Illumina beadarrays, could lead to further improvements in mitotic-age prediction.
Any potential residual confounding by CTH and chronological is also worth discussing further. First, it is important to stress that CTH can confound mitotic age estimates in three distinct ways. One way is if the CpGs making up the counter are cell-type specific DNAm markers, i.e. if these CpGs display big differences in DNAm (typically >50% DNAm change) between cell types. In such a scenario, variations in cell-type composition between bulk-samples could affect relative mitotic age estimates. In relation to this, it is worth pointing out that by construction, stemTOC’s CpGs (as well as those defining epiTOC1/216,30) are not cell-type specific markers as defined in a suitably defined ground state (fetal-stage). Furthermore, we have shown that stemTOC’s CpGs do not display big DNAm differences between age-matched sorted cell and tissue-types, a clear indication that these CpGs are not cell-type specific markers of adult cell or tissue-types. A second way in which CTH could bias mitotic age estimates is through the selection of CpGs that only change with cell division in a specific cell type, so that they don’t generalize to other cell types and tissues. We also addressed this type of confounding, by ensuring that stemTOC CpGs correlate with cell division in cell lines representing different cell types (fibroblasts, endothelial cells and smooth muscle). In addition, because these were selected by comparing DNAm changes in-vitro with and without treatment by cell-cycle inhibitors, these stemTOC CpGs are not confounded by chronological age. Hence, by further requiring that these same CpGs also change with chronological age in-vivo whilst adjusting for 12 immune-cell fractions in blood, this likely selects for a subset of CpGs that change with cell division in-vivo. Finally, the third way in which CTH can influence mitotic age estimates is when applying the mitotic counter to a bulk-tissue sample which is made up of different cell types. Even if the CpGs making up the counter are not cell-type specific, it is clear that different cell types in a bulk-sample may have different mitotic ages, hence the mitotic age estimate reflects a weighted average over the mitotic ages of each cell type with the weights reflecting their cell-type proportions. Whilst we acknowledge that stemTOC is unable to address this type of confounding, the observed strong correlation between stemTOC’s mitotic age and tumor cell-of-origin fraction in each of 15 TCGA cancer types, demonstrates that this is not a major source of confounding, probably because the mitotic age of the highly proliferative cancer or precancer cells dominates the average estimate. A related challenge and limitation is that stemTOC’s mitotic age estimate not only reflects an average over cell types, but also an average over the cells that make up the differentiation hierarchy within a lineage. As such, the mitotic-age estimate not only reflects the number of cell divisions of the underlying long-lived stem-cell pool, but more broadly also the cell divisions of short-term stem and progenitor cell expansions and their population sizes83. In this regard though, it is worth pointing out that it is mainly and only the DNAm changes that accumulate in the long-lived stem-cell pool, and which are inherited by progenitors and differentiated cells, that are able to accumulate in the cell population at large.
Finally, this study has also highlighted the importance of the underlying stochastic DNAm changes in aging and precancerous lesions19,32,84. Indeed, by approximating mitotic age with an upper quantile statistic over the pool of mitotic CpGs (as opposed to taking an ordinary average), we can better account for the inherent inter-CpG and inter-subject stochasticity, enabling the identification of DNAm outliers that very likely reflect epigenetic mosaicism and subclonal expansions31,32. In line with this, these DNAm outliers led to mitotic age estimates displaying stronger correlations with cancer and tumor cell-of-origin fraction. On the other hand, these DNAm outliers may also not necessarily mark the preneoplastic clones of higher mitotic age but those characterized by subtle quenching or even silencing of tissue-specific developmental genes, as such silencing may confer a selective advantage49,85,86.
In summary, this work demonstrates how a DNAm-based mitotic age counter can detect subtle increases of mitotic age in the tumor cell-of-origin of normal and precancerous tissues, opening up new biotechnological opportunities for developing early detection and cancer-risk prediction strategies.
Methods
All data analyzed in this manuscript has already been published in the respective publications, as described below and in the “Data availability” section. As such, this research complies with all ethical regulations.
Illumina DNAm datasets used in the construction of stemTOC
Fetal tissue DNAm sets
We obtained and normalized Illumina 450k data from the Stem-Cell Matrix Compendium-2 (SCM2)87, as described by us previously16. There were a total of 37 fetal tissue samples encompassing 10 tissue-types (stomach, heart, tongue, kidney, liver, brain, thymus, spleen, lung, adrenal gland). We also obtained Illumina 450k data of 15 cord-blood samples88 and 34 fetal-tissue samples from Slieker et al.89 encompassing amnion, muscle, adrenal, and pancreas. Both sets were normalized like the SCM2 data, i.e. by processing idat files with minfi90 followed by type-2 probe bias adjustment with BMIQ91.
Cell-line data from Endicott et al.24
The Illumina EPICv1 cell-line data was downloaded from GEO under accession number GSE197512. Raw idat files were processed with minfi and BMIQ normalized, resulting in a normalized DNAm data matrix defined over 843,298 probes and 182 “baseline-profiling” samples. A total of 6 human cell lines were used, including AG06561 (skin fibroblast), AG11182 (veil endothelial), AG11546 (iliac vein smooth muscle), AG16146 (skin fibroblast), AG21839 (neonatal foreskin fibroblast), and AG21859 (foreskin fibroblast). One cell line (AG21837, skin keratinocyte), which displayed global non-monotonic DNAm patterns with population doublings (PDs) was removed.
Whole-blood DNAm datasets
We used 3 Illumina whole-blood datasets. One EPIC dataset encompassing 710 samples from Han Chinese was processed and normalized as described by us previously92. Briefly, idat files were processed with minfi, followed by BMIQ type-2 probe adjustment, and due to the presence of beadchip effects, data was further normalized with ComBat93. Another dataset is an Illumina 450k set of 656 samples from Hannum et al.94. The normalization of this dataset is as described by us previously92. Finally, we also analyzed a 450k dataset from Johansson et al.95. This dataset was obtained from NCBI GEO website under accession number GSE87571. The file “GSE87571_RAW.tar” containing the IDAT files was downloaded and processed with minfi R-package. Probes with P-values <0.05 across all samples were kept. Filtered data was subsequently normalized with BMIQ, resulting in a normalized data matrix for 475,069 probes across 732 samples.
Independent buccal swab and cord blood Illumina DNAm datasets
We analyzed two independent EPIC DNAm datasets to assess the robustness and stability of the 371 stemTOC CpGs. One dataset consists of 44 buccal swabs from infants96 and the other of 128 cord blood samples from neonates97. Briefly, idat files were downloaded from GEO under accession numbers GSE229463 and GSE195595, respectively, and subsequently processed with minfi90 and BMIQ91 as described for the other datasets.
Illumina DNAm cancer datasets
The SeSAMe98 processed Illumina 450k beta value matrices were downloaded from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/) with TCGAbiolinks99. We analyzed 32 cancer types (ACC, BLCA, BRCA, CESC, CHOL, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, MESO, PAAD, PCGP, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS, UVM). We did further processing of the downloaded matrices as follows: For each cancer type, probes with missing values in more than 30% samples were removed. The missing values were then imputed with impute.knn (k = 5)100. Then, the type-2 probe bias was adjusted with BMIQ91. Technical replicates were removed by retaining the sample with the highest CpG coverage. Clinical information on TCGA samples was downloaded from Liu et al.101. In addition, we analyzed an Illumina 450k DNAm dataset from Pipinikas et al.102, consisting of 45 primary pancreatic neuroendocrine tumors (PNETs). Processing of the idat files and QC was performed with minfi90, impute, and BMIQ as described for the TCGA datasets.
Illumina DNAm dataset from eGTEX
The ChAMP103 processed Illumina EPIC beta-valued data matrix was downloaded from GEO (GSE213478)45. The dataset includes 754,119 probes and 987 samples from 9 normal tissue-types: breast mammary tissue (n = 52), colon transverse (n = 224), kidney cortex (n = 50), lung (n = 223), skeletal muscle (n = 47), ovary (n = 164), prostate (n = 123), testis (n = 50), and whole blood (n = 54) derived from 424 GTEX subjects.
Illumina DNAm datasets of sorted immune-cell types
We analyzed Illumina 450k DNAm data from BLUEPRINT42, encompassing 139 monocyte, 139 CD4+ T-cell, and 139 neutrophil samples from 139 subjects. This dataset was processed as described by us previously104. In addition, we analyzed Illumina 450k DNAm data from GEO (GSE56046) encompassing 1202 monocyte and 214 naïve CD4+ T-cell samples from Reynolds et al.105, and GEO (GSE56581) encompassing 98 CD8+ T-cell samples from Tserel et al.106. Data was normalized as described by us previously107. We also analyzed Illumina 450k DNAm data from EGA (EGA: EGAS00001001598) encompassing 100 B cell, 98 CD4+ T cell, and 104 monocyte samples from Paul et al.43. Data was normalized as described by us previously43.
Other Illumina whole-blood DNAm datasets
We analyzed a large collection of 18 whole-blood cohorts encompassing a total of 14,515 samples. This is a subset of the over 20,000 samples we previously analyzed41, consisting mostly of healthy samples. Illumina DNAm data was normalized as described by us previously41. Briefly, the 18 whole-blood cohorts chosen here were: LiuMS(n = 279): The 450k dataset from Kular et al.108 was obtained from the NCBI GEO website under the accession number GSE106648.
Song (n = 2052): The EPIC dataset from Song et al.109 was obtained from the NCBI GEO website under the accession number GSE169156.
HPT-EPIC (n = 1394) & HPT-450k (n = 418): These datasets110 were obtained from the NCBI GEO websites under the accession numbers GSE210255 and GSE210254.
Barturen (n = 5740): The EPIC dataset from Barturen et al.111 was obtained from GEO under accession number GSE179325.
Airwave (n = 1032): The EPIC dataset from the Airwave study112 was obtained from GEO under accession number GSE147740.
VACS (n = 529): The 450k dataset from Zhang et al.113 was obtained from GEO under accession number GSE117860.
Ventham (n = 380): The 450k dataset from Ventham et al.114 was obtained from NCBI GEO website under accession number GSE87648.
Hannon−1 and 2 (n = 636 and 665): The 450k datasets from Hannon et al.115,116 were obtained from NCBI GEO websites under accession numbers GSE80417 and GSE84727.
Zannas (n = 422): This 450k dataset117 was obtained from GEO under accession number GSE72680
Flanagan/FBS (n = 184): The 450k dataset Flanagan et al.118 was obtained from NCBI GEO under the accession number GSE61151.
Johansson (n = 729): The 450k dataset from Johansson et al.95 was obtained from GEO under accession number GSE87571.
Lehne (n = 2707): This 450k DNAm dataset consists of peripheral blood samples119, and we used the already QC-processed and normalized version previously described by Voisin et al.120.
TZH(n = 705), Hannum (n = 656), LiuRA (n = 689), Tsaprouni (n = 464): The TZH (EPIC)92, Hannum (450k)94, LiuRA (450k)121, Tsaprouni (450k)122 were downloaded and normalized as described by us previously92.
Illumina DNAm atlas from Moss et al.
We downloaded Illumina 450k & EPIC DNAm data (idat files)44 from GEO under accession number GSE122126. We processed the 450k and EPIC data separately with minfi, removing low-quality controls (detection P-value <0.05) and further normalized the data with BMIQ, resulting in a merged dataset defined over 449,156 probes and 28 samples. These 28 samples included sorted pancreatic beta cells (n = 4), pancreatic ductal (n = 3), pancreatic acinar (n = 3), adipocytes (n = 3), hepatocytes (n = 3), cortical neurons (n = 3), leukocyte (n = 1), lung epithelial (n = 3), colon-epithelial (n = 3), and vascular endothelial (n = 2).
Illumina DNAm datasets of sorted neurons
We analyzed a total of 4 Illumina DNAm datasets of sorted neurons (neuronal nuclei NeuN+), in all cases only using normal control samples. In all cases, raw Illumina EPIC/450k DNAm files were downloaded from GEO under accession numbers GSE112179 (n = 28, post-mortem frontal cortex)123, GSE41826 (n = 29, post-mortem frontal cortex)124, GSE66351 (n = 16, post-mortem human brains)125 and GSE98203 (n = 29, post-mortem human brains)126. In all cases, probes not detected above the background were assigned NA. CpGs with coverage 0.99 were kept (in the 2nd and 4th sets, this threshold was relaxed slightly to 0.98). The remaining NAs were imputed with impute.knn (k = 5). Finally, the beta-valued data matrix was normalized with BMIQ.
Illumina DNAm datasets from normal healthy and normal “at cancer-risk” tissue
Lung preinvasive dataset : This is an Illumina 450k DNAm dataset of lung-tissue samples that we have previously published76. We used the normalized dataset from Teschendorff et al.76 encompassing 21 normal lungs and 35 age-matched lung-carcinoma in situ (LCIS) samples, and 462,912 probes after QC. Of these 35 LCIS samples, 22 progressed to an invasive lung cancer (ILC).
Breast preinvasive dataset : This is an Illumina 450k dataset of breast tissue samples from Johnson et al.127. Raw idat files were downloaded from GEO under accession number GSE66313, and processed with minfi. Probes with sample coverage <0.95 (defined as a fraction of samples) with detected (P < 0.05) P-values were discarded. The rest of the unreliable values were assigned NA and imputed with knn (k = 5)100. After BMIQ normalization, we were left with 448,296 probes and 55 samples, encompassing 15 normal-adjacent breast tissue and 40 age-matched ductal carcinoma in situ (DCIS) samples, of which 13 were from women who later developed an invasive breast cancer (BC).
Gastric metaplasia dataset : Raw idat files were downloaded from GEO (GSE103186)128 and processed with minfi. Probes with over 99% coverage were kept and missing values imputed using impute R-package using impute.knn (k = 5). Subsequently, data was intra-array normalized with BMIQ, resulting in a final normalized data matrix over 482,975 CpGs and 191 samples, encompassing 61 normal gastric mucosas, 22 mild intestinal metaplasias, and 108 metaplasias. Although age information was not provided, we used Horvath’s clock129 to confirm that normal and mild intestinal metaplasias were age-matched. This is justified because Horvath’s clock is not a mitotic clock30 and displays a median absolute error of ±3 years129.
Barrett’s Esophagus and adenocarcinoma dataset : This Illumina 450k dataset49 is freely available from GEO under accession number GSE104707. Data was normalized as described by us previously130. The BMIQ-normalized dataset is defined over 384,999 probes and 157 samples, encompassing 52 normal squamous epithelial samples from the esophagus, 81 age-matched Barrett’s Esophagus specimens, and 24 esophageal adenocarcinomas.
Colon adenoma dataset131: Illumina 450k raw idat files were downloaded from ArrayExpress E-MTAB-6450 and processed with minfi. Only probes with 100% coverage were kept. Subsequent data was intra-array normalized with BMIQ, resulting in a normalized data matrix over 483,422 CpGs and 47 samples, encompassing 8 normal colon specimens and 39 age-matched colon adenomas. Although age information was not made publicly available, we imputed them using Horvath’s clock, confirming that normals and adenomas are age-matched.
Cholangiocarcinoma (CCA) dataset132: Raw idat files were downloaded from GEO (GSE156299). This is an EPIC dataset and was processed with minfi. Probes with >99% coverage (fraction of samples with P < 0.05) were kept. NAs were imputed with impute.knn (k = 5). Subsequent data was intra-array normalized with BMIQ, resulting in a normalized data matrix over 854,026 probes and 137 samples, encompassing 50 normal bile duct specimens, 60 premalignant, and 27 cholangiocarcinomas. Normal bile duct, premalignant, and CCA specimens were derived from the same patient. Thus, normal and premalignant samples are mostly age-matched.
Prostate cancer progression dataset : Illumina 450k raw idat files were downloaded from GEO (GSE116338)133 and processed with minfi. CpGs with coverage >0.95 were kept. NAs were imputed with impute.knn (k = 5). Subsequent data matrix was intra-array normalized with BMIQ. Samples included benign lesions (n = 10), neoplasia (n = 6), primary tumors (n = 14) and metastatic prostate cancer (n = 6). In this dataset, benign lesions were significantly older compared to neoplasia and primary tumors, thus this dataset provides a particularly good test that mitotic clocks are measuring mitotic age as opposed to chronological age.
Oral squamous cell carcinoma (OSCC) progression dataset : The processed beta-valued Illumina 450k DNAm data matrix was downloaded from GEO (GSE123781)134. Probes not detected above background had been assigned NA. CpGs with coverage >0.95 were kept. The remaining NAs were imputed with impute.knn (k = 5). Then the beta matrix was BMIQ normalized. Samples included 18 healthy controls, 8 lichen planus (putative premalignant condition), and 15 OSCCs. This is the only dataset where premalignant samples older than the healthy controls.
Colon adenoma dataset 2. Processed beta-valued Illumina 450k DNAm data matrix was downloaded from GEO (GSE48684)135, encompassing 17 normal-healthy samples, 42 age-matched colon adenoma samples, and 64 colon cancer samples. Probes not detected above background were assigned NA. CpGs with coverage >0.95 were kept. The remaining NAs were imputed with impute.knn (k = 5). Then the beta matrix was BMIQ normalized. Although age information was not made publicly available, ages were imputed with Horvath’s clock confirming that normals and adenomas are age-matched.
Normal breast Erlangen dataset : This Illumina 450k dataset is freely available from GEO under accession number GSE69914. Data was normalized as described by us previously32. The BMIQ-normalized dataset is defined over 485,512 probes and 397 samples, encompassing 50 normal-breast samples from healthy women, 42 age-matched normal-adjacent samples, and 305 invasive breast cancers. We note that this dataset was excluded from the formal comparison of stemTOC to all other clocks, because this dataset was used to select the upper-quantile threshold in the definition of stemTOC (see later).
Illumina DNAm datasets of normal tissues exposed to cancer-risk factors
Buccal swabs+smoking (n = 790): This Illumina 450k DNAm dataset was generated and analyzed previously by us76. We used the same normalized DNAm dataset as in this previous publication. Briefly, the samples derive from women all aged 53 years at sample draw and belong to the MRC1946 birth cohort. This cohort has well-annotated epidemiological information, including smoking-status information. Among the 790 women, 258 were never-smokers, 365 ex-smokers, and 167 current smokers at sample draw.
Lung-tissue+smoking (n = 204): This Illumina EPIC DNAm dataset of normal lung-tissue samples derives from eGTEX45 and was processed as the other tissue datasets from eGTEX. Age information was only provided in age-groups. Smoking status distribution was 89 current smokers, 54 ex-smokers, and 61 never-smokers.
Liver-tissue+obesity (n = 325): This Illumina EPIC DNAm dataset of liver tissue is derived from GEO: GSE180474, and encompasses liver tissue samples from obese individuals (minimum BMI = 32.6), all diagnosed with non-alcoholic fatty liver disease (NAFLD)78. Of the 325 samples, 210 had no evidence of fibrosis (grade-0), 55 had grade-3 fibrosis, 36 intermediate grade 3-4 fibrosis, and 24 grade-4 fibrosis. We downloaded the processed beta and P-values and only kept probes with 100% coverage across all samples. Data was further adjusted for type-2 probe bias using BMIQ.
Construction of stemTOC: identification of mitotic CpGs
The construction of stemTOC initially involves a careful selection of CpGs that track mitotic age. Initially, we follow the procedure as described for the epiTOC+epiTOC2 mitotic clocks, identifying CpGs mapping to within 200 bp of transcription start sites and that are constitutively unmethylated across 86 fetal tissue samples encompassing 15 tissue-types. Here, by constitutive unmethylation, we mean a DNAm beta value <0.2 in each of 86 fetal tissue samples encompassing 13 tissue-types. Of note, this means that all these CpGs occupy the same methylation state in the fetal ground state, i.e. these CpGs are not cell-type or tissue-specific in this ground state. Next, we use the cell-line data from Endicott et al.24 to further identify the subset of CpGs that display significant hypermethylation with population doublings, but which do not display such hypermethylation when the cell lines are treated with a cell-cycle inhibitor (mitomycin-C) or when the cell-culture is deprived of growth-promoting serum. In detail, for each of 6 cell lines (AG06561, AG11182, AG11546, AG16146, AG21837, AD21859) representing fibroblasts (3), smooth muscle (1), keratinocyte (1) and endothelial cell types, we ran linear regressions of DNAm against population doublings (PD) (a total of 182 samples), identifying for each cell-line CpGs where DNAm increases with PDs (q-value (FDR) < 0.05). Out of a total of 843,98 CpGs, 14,255 CpGs displayed significant hypermethylation with PDs in each of the 6 cell lines. Of these 14,225 CpGs, we next removed those still displaying hypermethylation (unadjusted P < 0.05) with days in culture in cell lines treated with mitomycin-C or in cell lines deprived of serum. This resulted in a much-reduced set of 629 “vitro-mitotic” CpG candidates. In the next step, we asked how many of these “vitro-mitotic” CpGs display age-associated hypermethylation in-vivo, using 3 separate large whole-blood cohorts. The rationale for using whole blood is that this is a high-turnover-rate tissue, and so chronological age should be correlated with mitotic age. Moreover, large whole-blood cohorts guarantee adequate power to detect age-associated DNAm changes. And thirdly, by intersecting CpGs undergoing hypermethylation with PDs in-vitro with those undergoing hypermethylation with age in-vivo, we are more likely to be identifying CpGs that track mitotic age in-vivo. We used the 3 large whole-blood datasets as described in the previous section, each containing approximately 700 samples, whilst adjusting for all potential confounders including cell fractions for 12 immune-cell subtypes40. Specifically, we used a DNAm reference matrix defined over 12 immune cell types, which we have recently validated across over 23,000 whole-blood samples from 22 cohorts41, to estimate cell-type proportions using our EpiDISH procedure136. These proportions were then included as covariates when identifying age-DMCs in each cohort separately. To arrive at a final set of age-DMCs, we used the directional Stouffer method over the 3 large cohorts to compute an overall Stouffer z-statistic and P-value, selecting CpGs with z > 0 and P < 0.05. Of the 629 “vitro-mitotic” CpGs, 371 were significant in the Stouffer meta-analysis of whole-blood cohorts. This set of 371 CpGs defined our “vivo-mitotic” CpGs making up stemTOC.
Estimating relative mitotic age with stemTOC
Given an arbitrary sample with a DNAm profile defined over these 371 CpGs, we next define the mitotic age of the sample as the 95% upper quantile of DNAm values over the 371 vivo-mitotic CpGs. The justification for taking an upper quantile value, as opposed to taking an average is as follows: extensive DNAm data from previous studies indicate that DNAm changes that mark cancer risk (and hence also mitotic age) are characterized by an underlying inter-CpG and inter-subject stochasticity19,31,32,84. Specifically, relevant DNAm changes in normal tissues at cancer risk constitute mild outliers in the DNAm distribution representing subclonal expansions, that occur only infrequently across independent subjects, with the specific CpG outliers displaying little overlap between subjects. Thus, for a given pool of mitotic CpG candidates (i.e. the pool of 371 vivo-mitotic CpGs derived earlier), only a small subset of these may be tracking mitotic age in any given subject at-risk of tumor development. Thus, taking an average DNAm over the 371 CpGs is not optimized to capturing the effect of subclonal expansions that track the mitotic age. To understand this, we ran a simple simulation model, with parameter choices inspired by real DNAm data19,31,32,84, for a reduced set of 20 CpGs and 40 samples representing 4 disease stages (normal, normal at-risk, preneoplastic, cancer) with 10 samples in each stage. In the initial stage, all CpGs are unmethylated with DNAm beta-values drawn from a Beta distribution Beta(a = 10,b = 90), where Beta(a,b) is defined by the probability distribution
1 |
which has a mean and variance . For each “normal at-risk” sample, we randomly selected 1–3 CpGs (precise number was drawn from a uniform distribution), simulating DNAm gains for these CpGs by drawing DNAm values from Beta(a = 3,b = 7). For each preneoplastic sample, we randomly selected 5–10 CpGs with DNAm values drawn from Beta(a = 5,b = 5). Finally, for invasive cancer, we randomly selected 11–17 CpGs with DNAm gains drawn from Beta(a = 8,b = 2). Under this model, discrimination of normal-at-risk samples from normal healthy is difficult if one were to average DNAm over 20 CpGs. However, taking a 95% upper quantile (UQ) of the DNAm distribution over the 20 CpGs, one can discriminate the normal at-risk samples. In this instance, a 95% UQ over 20 CpGs corresponds to taking the maximum value over these 20 CpGs. Another way to understand this is by first identifying CpGs that display DNAm outliers in the normal at-risk group compared to normal. This can be done using differential variance statistics19,137 to identify hypervariable CpGs, or alternatively, by finding CpGs for which a suitable upper quantile over the normal at-risk samples is much greater than the corresponding UQ value over the normal samples. To determine what UQ-threshold may be suitable, we analyzed our Illumina 450k DNAm dataset encompassing 50 normal-breast tissues from healthy women and 42 age-matched normal-adjacent samples (“at-risk” samples) from women with breast cancer32. For each of the 371 CpGs, we computed the mean and UQ over the 50 normal-breast samples, and separately over the 42 normal-adjacent “at-risk” samples. We considered a range of UQ thresholds from 0.75 to 0.99. We then compared the difference (i.e. effect size) in the obtained values between the normal-adjacent and normal-healthy, averaging over the 371 CpGs. The effect size increases with higher UQ values. We selected a 95% UQ because at this threshold we maximized effect size without compromising variability (choosing higher UQs leads to increased random variation). Although in this analysis, the UQ is taking across samples, it is reasonable, given the underlying stochasticity of the patterns, to apply the same UQ-threshold across CpGs. Thus, for any independent sample, we define the relative mitotic age of stemTOC as the 95% upper quantile of the DNAm distribution defined over the 371 CpGs. We note however that results in this manuscript are not strongly dependent on the particular UQ value, i.e. results are generally very robust to UQ values in the range 75–95%.
Estimating cell-type fractions in solid tissues with EpiSCORE and HEpiDISH
In this work we estimate the proportions of all main cell types within tissues from the TCGA using our validated EpiSCORE algorithm37 and its associated DNAm atlas of tissue-specific DNAm reference matrices38. This atlas comprises DNAm reference matrices for lung (7–9 cell types: alveolar epithelial, basal, other epithelial, endothelial, granulocyte, lymphocyte, macrophage, monocyte and stromal), pancreas (6 cell types: acinar, ductal, endocrine, endothelial, immune, stellate), kidney (4 cell types: epithelial, endothelial, fibroblasts, immune), prostate (6 cell types: basal, luminal, endothelial, fibroblast, leukocytes, smooth muscle), breast (7 cell types: luminal, basal, fat, endothelial, fibroblast, macrophage, lymphocyte), olfactory epithelium (9 cell types: mature neurons, immature neurons, basal, fibroblast, gland, macrophages, pericytes, plasma, T-cells), liver (5 cell types: hepatocytes, cholangiocytes, endothelial, Kupffer, lymphocytes), skin (7 cell types: differentiated and undifferentiated keratinocytes, melanocytes, endothelial, fibroblast, macrophages, T-cells), brain (6 cell types: astrocytes, neurons, microglia, oligos, OPCs and endothelial), bladder (4 cell types: endothelial, epithelial, fibroblast, immune), colon (5 cell types: endothelial, epithelial, lymphocytes, myeloid and stromal) and esophagus (8 cell types: endothelial, basal, stratified, suprabasal and upper epithelium, fibroblasts, glandular, immune). Hence, when estimating cell-type fractions in the TCGA tissue samples, we were restricted to those tissues with an available DNAm reference matrix. EpiSCORE was run on the BMIQ-normalized DNAm data from the TCGA with default parameters and 500 iterations. EpiSCORE was also used to estimate cell-type fractions in solid tissues from non-TCGA datasets. For instance, we used EpiSCORE to estimate 7 lung cell-type fractions (endothelial, epithelial, neutrophil, lymphocyte, macrophage, monocyte, and stromal) in normal lung-tissue samples from eGTEX45, and 5 liver cell-type fractions (cholangiocyte, hepatocyte, Kupffer, endothelial and lymphocyte) in liver tissue78. For the buccal swab dataset, because buccal swabs only contain squamous epithelial and immune-cells, we used the validated HEpiDISH algorithm and associated DNAm reference matrix to estimate these fractions in this tissue77.
Validation of EpiSCORE fractions using a WGBS DNAm atlas
Whilst EpiSCORE has already undergone substantial validation38, here we decided to further validate it against the recent human WGBS DNAm atlas from Loyfer et al.58, which comprises 207 sorted samples from various tissue-types. Briefly, we downloaded the beta-valued data and bigwig files (hg38) from the publication website. We required CpGs to be covered by at least 10 reads. We then averaged the DNAm values of CpGs mapping to within 200 bp of the transcription start sites of coding genes, resulting in a DNAm data matrix defined over 25,206 gene promoters and 207 sorted samples. The 207 sorted samples represent 37 cell types from 42 distinct tissues (anatomical regions). We validate EpiSCORE in this WGBS-atlas by asking if it could predict the corresponding cell type. This was only done for tissues for which there is an EpiSCORE DNAm reference matrix. For instance, for breast tissue, the WGBS DNAm atlas profiled 3 luminal and 4 basal epithelial samples. Hence, for breast tissue, we applied EpiSCORE with our breast DNAm reference matrix, derived via imputation from a breast-specific scRNA-Seq atlas, to these 7 sorted samples to assess if it can predict the basal/luminal subtypes using a maximum “probability” criterion (using the estimated fraction as a probability). This analysis was done for WGBS-sorted cells from 15 anatomical sites (brain, skin, colon, breast, bladder, liver, esophagus, pancreas, prostate, kidney tubular, kidney glomerular, lung alveolar, lung pleural, lung bronchus, and lung interstitial) using our EpiSCORE DNAm reference matrices for brain, skin, colon, breast, bladder, liver, esophagus, pancreas, prostate, kidney and lung. An overall accuracy was estimated as the number of sorted WGBS samples where the corresponding cell type was correctly predicted.
Benchmarking against other DNAm-based mitotic clocks
We benchmarked stemTOC against 6 other epigenetic mitotic clocks: epiTOC16, epiTOC230, HypoClock18, RepliTali24, epiCMIT-hyper and epiCMIT-hypo33. Briefly, epiTOC gives a mitotic score called pcgtAge, which is an average DNAm over 385 epiTOC-CpGs. In the case of epiTOC2, we used the estimated total cumulative number of stem-cell divisions (tnsc). In the case of HypoClock, the score is given by the average DNAm over 678 solo-WCGWs with representation on Illumina 450k arrays. Of note, for Hypoclock, smaller values mean a larger deviation from the methylated ground state. In the case of RepliTali, which was trained on EPIC data, the score is calculated with 87 RepliTali CpGs and their linear regression coefficients as provided by Endicott et al.24. Of the 87 RepliTali CpGs, only 30 are present on the Illumina 450k array. As far as epiCMIT33 is concerned, this clock is based on two separate lists of 184 hypermethylated and 1164 hypomethylated CpGs. Because the biological mechanism by which CpGs gain or lose DNAm during cell division is distinct, the strategy recommended by Ferrer-Durante to compute an average DNAm over the two lists to then select the one displaying the biggest deviation from the ground state as a measure of mitotic age, is in our opinion not justified. Their strategy could conceivably result in the hypermethylated CpGs being used for one sample, and the hypomethylated CpGs being used for another. Instead, in this work we separately report the average DNAm of the hypermethylated and hypomethylated components. Thus, for the hypermethylated CpGs, the average DNAm over these sites defines the “epiCMIT-hyper” clock’s mitotic age, whereas for the hypomethylated CpGs we take 1-average(hypomethylated CpGs) as a measure of mitotic age, thus defining the “epiCMIT-hypo” clock.
We compare all mitotic clocks using the following evaluation strategies. In the analysis correlating mitotic ages to the fractions of the tumor cell-of-origin in the various TCGA cancer types, we compare the inferred Pearson Correlation Coefficients (PCCs) between each pair of clocks across the TCGA cancer types using a one-tailed Wilcoxon rank sum test. Likewise, in the analysis where we correlate mitotic age to chronological age in the normal-adjacent samples from the TCGA, in the normal samples from eGTEX, and the sorted immune-cell subsets, we compare the inferred PCCs between each pair of clocks across the independent datasets using a one-tailed Wilcoxon rank sum test. Finally, when assessing the clocks for predicting cancer risk, for each of the 9 datasets with normal-healthy and age-matched “normal at-risk” samples and for each clock, we first computed an AUC-metric, quantifying the clock’s discriminatory accuracy. For each pair of mitotic clocks, we then compare their AUC values across the 9 independent datasets using a one-tailed Wilcoxon rank sum test.
Comparison to stem-cell division rates and somatic mutational signatures
We obtained estimated tissue-specific intrinsic stem-cell division rates of 13 tissue-types (bladder, breast, colon, esophagus, oral, kidney, liver, lung, pancreas, prostate, rectum, thyroid, and stomach) from Vogelstein & Tomasetti3,8, supplemented with estimates for skin and blood30, corresponding to 17 TCGA cancer types (BLCA, BRCA, COAD, ESCA, HNSC, KIRP & KIRC, LIHC, LSCC & LUAD, PAAD, PRAD, READ, THCA, LAML, STAD, and SKCM). Thus, for each normal-adjacent sample of the TCGA, we can estimate the total number of stem-cell divisions (TNSC) by multiplying the intrinsic rate (IR) of the tissue with the chronological age of the sample. Somatic mutational clock signature−1 (MS1) and signature-5 (MS5) were derived from Alexandrov et al.11. These loads represent the number of mutations per Mbp. Of note, because these mutational loads and stemTOC’s mitotic age both correlate with chronological age, in specific analyses we divide these values by the chronological age of the sample to arrive at age-adjusted values.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Source data
Acknowledgements
This work was supported by NSFC (National Science Foundation of China) grants, grant numbers 32170652 and 32370699. We would also like to thank the TCGA and everyone who supports open-access data.
Author contributions
A.E.T. conceived and designed the study. A.E.T. wrote the manuscript. T.Z., H.T., and A.E.T. performed the statistical and bioinformatic analyses. H.T. and Z.D. helped with bioinformatic analyses. S.B. provided useful feedback.
Peer review
Peer review information
Nature Communications thanks Georg Luebeck and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The main Illumina DNA methylation 450k/EPIC datasets used here are freely available from the following public repositories: Endicott (182 cell-line samples, GSE197512)24. Hannum (656 whole blood, GSE40279, [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40279])94; MESA (214 purified CD4+ T-cells and 1202 Monocyte samples, GSE56046GSE56581)105; Tserel (98 CD8+ T-cells, GSE59065)106; BLUEPRINT (139 matched CD4+ T-cells, Monocytes and Neutrophils, EGAS00001001456)42; Paul (100 B cells, 98 T cells, and 104 monocytes, EGA: EGAS00001001598)43; Liu (335 whole blood, GSE42861)121; Pai (n = 28 sorted neurons, GSE112179)123, Guintivano (n = 29 sorted neurons, GSE41826)124, Gasparoni (n = 16 sorted neurons, GSE66351)125 and Kozlenkov (n = 29 sorted neurons, GSE98203)126; Gastric tissue (191 normal and metaplasia, GSE103186)128; Colon tissue (47 normal and adenoma, E-MTAB-6450)131; Colon tissue2 (123 normal, adenoma, and cancer samples, GSE48684)135; Breast Erlangen (397 normal, GSE69914)32; Breast2 (121 normal, GSE101961)138; Breast Johnson (55 normal and ductal carcinoma in situ samples, GSE66313)127; Liver tissue (137 normal, premalignant and cholangiocarcinoma samples, GSE156299)132; Prostate tissue (36 normal, neoplasia, primary and metastatic samples, GSE116338)133; Oral tissue (41 normal, lichen planus and oral squamous cell carcinoma samples, GSE123781)134; Esophagus (50 normal, 81 Barrett’s Esophagus and 24 adenocarcinomas, GSE104707)49; Liver-NAFLD (n = 325, GSE180474)78; SCM2 (37 fetal tissue samples, GSE31848)87; Cord Blood (15 samples, GSE72867)88; Slieker (34 fetal tissue samples, GSE56515)89; eGTEX (987 samples from 9 normal tissue-types, GSE213478)45; Buccal Swabs from Infants (44 samples, GSE229463)96; Cord Blood from Neonates (128 samples, GSE195595)97; TCGA data was downloaded from https://gdc.cancer.gov; The DNAm dataset in buccal cells from the NSHD MRC194676 is available by submitting data requests to mrclha.swiftinfo@ucl.ac.uk; see full policy at http://www.nshd.mrc.ac.uk/data.aspx. Managed access is in place for this 69-year-old study to ensure that the use of the data is within the bounds of consent given previously by participants and to safeguard any potential threat to anonymity since the participants are all born in the same week. The DNAm atlas encompassing DNAm reference matrices for 13 tissue-types encompassing over 40 cell types is freely available from the EpiSCORE R-package https://github.com/aet21/EpiSCORE. Source data are provided with this paper. The remaining data are available within the Article, Supplementary Information, or Source Data file. Source data are provided with this paper.
Code availability
An R-package “EpiMitClocks” with functions to estimate the mitotic age for each of the clocks in this work is freely available from https://github.com/aet21/EpiMitClocks The package comes with a tutorial vignette. The R-package EpiSCORE for estimating cell-type fractions in solid tissue-types is available from https://github.com/aet21/EpiSCORE The package comes with a tutorial vignette. We have also made an OceanCode Capsule available from https://codeocean.com/capsule/3811164/tree.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Tianyu Zhu, Huige Tong.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-024-48649-8.
References
- 1.Spira A, et al. Precancer Atlas to drive precision prevention trials. Cancer Res. 2017;77:1510–1541. doi: 10.1158/0008-5472.CAN-16-2346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jassim A, Rahrmann EP, Simons BD, Gilbertson RJ. Cancers make their own luck: theories of cancer origins. Nat. Rev. Cancer. 2023;23:710–724. doi: 10.1038/s41568-023-00602-5. [DOI] [PubMed] [Google Scholar]
- 3.Tomasetti C, Vogelstein B. Cancer etiology. Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science. 2015;347:78–81. doi: 10.1126/science.1260825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhu L, et al. Multi-organ mapping of cancer risk. Cell. 2016;166:1132–1146.e7. doi: 10.1016/j.cell.2016.07.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Klutstein M, Moss J, Kaplan T, Cedar H. Contribution of epigenetic mechanisms to variation in cancer risk among tissues. Proc. Natl Acad. Sci. USA. 2017;114:2230–2234. doi: 10.1073/pnas.1616556114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Baylin SB, Ohm JE. Epigenetic gene silencing in cancer - a mechanism for early oncogenic pathway addiction? Nat. Rev. Cancer. 2006;6:107–116. doi: 10.1038/nrc1799. [DOI] [PubMed] [Google Scholar]
- 7.Feinberg AP, Ohlsson R, Henikoff S. The epigenetic progenitor origin of human cancer. Nat. Rev. Genet. 2006;7:21–33. doi: 10.1038/nrg1748. [DOI] [PubMed] [Google Scholar]
- 8.Tomasetti C, Li L, Vogelstein B. Stem cell divisions, somatic mutations, cancer etiology, and cancer prevention. Science. 2017;355:1330–1334. doi: 10.1126/science.aaf9011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tomasetti C, Vogelstein B, Parmigiani G. Half or more of the somatic mutations in cancers of self-renewing tissues originate prior to tumor initiation. Proc. Natl Acad. Sci. USA. 2013;110:1999–2004. doi: 10.1073/pnas.1221068110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Alexandrov LB, et al. Mutational signatures associated with tobacco smoking in human cancer. Science. 2016;354:618–622. doi: 10.1126/science.aag0299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Alexandrov LB, et al. Clock-like mutational processes in human somatic cells. Nat. Genet. 2015;47:1402–1407. doi: 10.1038/ng.3441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schumacher B, Pothof J, Vijg J, Hoeijmakers JHJ. The central role of DNA damage in the ageing process. Nature. 2021;592:695–703. doi: 10.1038/s41586-021-03307-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Blackburn EH, Greider CW, Szostak JW. Telomeres and telomerase: the path from maize, Tetrahymena and yeast to human cancer and aging. Nat. Med. 2006;12:1133–1138. doi: 10.1038/nm1006-1133. [DOI] [PubMed] [Google Scholar]
- 14.Siegmund KD, Marjoram P, Woo YJ, Tavare S, Shibata D. Inferring clonal expansion and cancer stem cell dynamics from DNA methylation patterns in colorectal cancers. Proc. Natl Acad. Sci. USA. 2009;106:4828–4833. doi: 10.1073/pnas.0810276106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kim JY, Tavare S, Shibata D. Counting human somatic cell replications: methylation mirrors endometrial stem cell divisions. Proc. Natl Acad. Sci. USA. 2005;102:17739–17744. doi: 10.1073/pnas.0503976102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Yang Z, et al. Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol. 2016;17:205. doi: 10.1186/s13059-016-1064-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Youn A, Wang S. The MiAge Calculator: a DNA methylation-based mitotic age calculator of human tissue types. Epigenetics. 2018;13:192–206. doi: 10.1080/15592294.2017.1389361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhou W, et al. DNA methylation loss in late-replicating domains is linked to mitotic cell division. Nat. Genet. 2018;50:591–602. doi: 10.1038/s41588-018-0073-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Teschendorff AE, et al. Epigenetic variability in cells of normal cytology is associated with the risk of future morphological transformation. Genome Med. 2012;4:24. doi: 10.1186/gm323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shen SY, et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature. 2018;563:579–583. doi: 10.1038/s41586-018-0703-0. [DOI] [PubMed] [Google Scholar]
- 21.Wang T, et al. Epigenetic aging signatures in mice livers are slowed by dwarfism, calorie restriction and rapamycin treatment. Genome Biol. 2017;18:57. doi: 10.1186/s13059-017-1186-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Field AE, et al. DNA methylation clocks in aging: categories, causes, and consequences. Mol. Cell. 2018;71:882–895. doi: 10.1016/j.molcel.2018.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Yatabe Y, Tavare S, Shibata D. Investigating stem cells in human colon by using methylation patterns. Proc. Natl Acad. Sci. USA. 2001;98:10839–10844. doi: 10.1073/pnas.191225998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Endicott JL, Nolte PA, Shen H, Laird PW. Cell division drives DNA methylation loss in late-replicating domains in primary human cells. Nat. Commun. 2022;13:6659. doi: 10.1038/s41467-022-34268-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Curtius K, et al. A molecular clock infers heterogeneous tissue age among patients with Barrett’s esophagus. PLoS Comput. Biol. 2016;12:e1004919. doi: 10.1371/journal.pcbi.1004919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Luebeck GE, et al. Implications of epigenetic drift in colorectal neoplasia. Cancer Res. 2019;79:495–504. doi: 10.1158/0008-5472.CAN-18-1682. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Genereux DP, Miner BE, Bergstrom CT, Laird CD. A population-epigenetic model to infer site-specific methylation rates from double-stranded DNA methylation patterns. Proc. Natl Acad. Sci. USA. 2005;102:5802–5807. doi: 10.1073/pnas.0502036102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Sontag LB, Lorincz MC, Georg Luebeck E. Dynamics, stability and inheritance of somatic DNA methylation imprints. J. Theor. Biol. 2006;242:890–899. doi: 10.1016/j.jtbi.2006.05.012. [DOI] [PubMed] [Google Scholar]
- 29.Minteer CJ, et al. More than bad luck: cancer and aging are linked to replication-driven changes to the epigenome. Sci. Adv. 2023;9:eadf4163. doi: 10.1126/sciadv.adf4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Teschendorff AE. A comparison of epigenetic mitotic-like clocks for cancer risk prediction. Genome Med. 2020;12:56. doi: 10.1186/s13073-020-00752-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Teschendorff AE, Relton CL. Statistical and integrative system-level analysis of DNA methylation data. Nat. Rev. Genet. 2018;19:129–147. doi: 10.1038/nrg.2017.86. [DOI] [PubMed] [Google Scholar]
- 32.Teschendorff AE, et al. DNA methylation outliers in normal breast tissue identify field defects that are enriched in cancer. Nat. Commun. 2016;7:10478. doi: 10.1038/ncomms10478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Duran-Ferrer M, et al. The proliferative history shapes the DNA methylome of B-cell tumors and predicts clinical outcome. Nat. Cancer. 2020;1:1066–1081. doi: 10.1038/s43018-020-00131-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Alexandrov LB, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578:94–101. doi: 10.1038/s41586-020-1943-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gabbutt C, Wright NA, Baker AM, Shibata D, Graham TA. Lineage tracing in human tissues. J. Pathol. 2022;257:501–512. doi: 10.1002/path.5911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gabbutt C, et al. Fluctuating methylation clocks for cell lineage tracing at high temporal resolution in human tissues. Nat. Biotechnol. 2022;40:720–730. doi: 10.1038/s41587-021-01109-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Teschendorff AE, Zhu T, Breeze CE, Beck S. EPISCORE: cell type deconvolution of bulk tissue DNA methylomes from single-cell RNA-Seq data. Genome Biol. 2020;21:221. doi: 10.1186/s13059-020-02126-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhu T, et al. A pan-tissue DNA methylation atlas enables in silico decomposition of human tissue methylomes at cell-type resolution. Nat. Methods. 2022;19:296–306. doi: 10.1038/s41592-022-01412-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Nejman D, et al. Molecular rules governing de novo methylation in cancer. Cancer Res. 2014;74:1475–1483. doi: 10.1158/0008-5472.CAN-13-3042. [DOI] [PubMed] [Google Scholar]
- 40.Salas LA, et al. Enhanced cell deconvolution of peripheral blood using DNA methylation for high-resolution immune profiling. Nat. Commun. 2022;13:761. doi: 10.1038/s41467-021-27864-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Luo Q, et al. A meta-analysis of immune-cell fractions at high resolution reveals novel associations with common phenotypes and health outcomes. Genome Med. 2023;15:59. doi: 10.1186/s13073-023-01211-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chen L, et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell. 2016;167:1398–1414 e24. doi: 10.1016/j.cell.2016.10.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Paul DS, et al. Increased DNA methylation variability in type 1 diabetes across three immune effector cell types. Nat. Commun. 2016;7:13555. doi: 10.1038/ncomms13555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Moss J, et al. Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 2018;9:5068. doi: 10.1038/s41467-018-07466-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Oliva M, et al. DNA methylation QTL mapping across diverse human tissues provides molecular links between genetic variation and complex traits. Nat. Genet. 2023;55:112–122. doi: 10.1038/s41588-022-01248-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Breeze CE, et al. eFORGE: a tool for identifying cell type-specific signal in epigenomic data. Cell Rep. 2016;17:2137–2150. doi: 10.1016/j.celrep.2016.10.059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Breeze CE, et al. eFORGE v2.0: updated analysis of cell type-specific signal in epigenomic data. Bioinformatics. 2019;35:4767–4769. doi: 10.1093/bioinformatics/btz456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Chen Y, Breeze CE, Zhen S, Beck S, Teschendorff AE. Tissue-independent and tissue-specific patterns of DNA methylation alteration in cancer. Epigenet. Chromatin. 2016;9:10. doi: 10.1186/s13072-016-0058-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Luebeck EG, et al. Identification of a key role of widespread epigenetic drift in Barrett’s esophagus and esophageal adenocarcinoma. Clin. Epigenet. 2017;9:113. doi: 10.1186/s13148-017-0409-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Brunner SF, et al. Somatic mutations and clonal dynamics in healthy and cirrhotic human liver. Nature. 2019;574:538–542. doi: 10.1038/s41586-019-1670-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Martincorena I, et al. Somatic mutant clones colonize the human esophagus with age. Science. 2018;362:911–917. doi: 10.1126/science.aau3879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Moore L, et al. The mutational landscape of normal human endometrial epithelium. Nature. 2020;580:640–646. doi: 10.1038/s41586-020-2214-z. [DOI] [PubMed] [Google Scholar]
- 53.Izzo F, et al. DNA methylation disruption reshapes the hematopoietic differentiation landscape. Nat. Genet. 2020;52:378–387. doi: 10.1038/s41588-020-0595-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Genovese G, et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N. Engl. J. Med. 2014;371:2477–2487. doi: 10.1056/NEJMoa1409405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Mitchell E, et al. Clonal dynamics of haematopoiesis across the human lifespan. Nature. 2022;606:343–350. doi: 10.1038/s41586-022-04786-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Jaiswal S, et al. Age-related clonal hematopoiesis associated with adverse outcomes. N. Engl. J. Med. 2014;371:2488–2498. doi: 10.1056/NEJMoa1408617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sender R, Milo R. The distribution of cellular turnover in the human body. Nat. Med. 2021;27:45–48. doi: 10.1038/s41591-020-01182-9. [DOI] [PubMed] [Google Scholar]
- 58.Loyfer N, et al. A DNA methylation atlas of normal human cell types. Nature. 2023;613:355–364. doi: 10.1038/s41586-022-05580-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 2015;6:8971. doi: 10.1038/ncomms9971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Alcantara Llaguno SR, Parada LF. Cell of origin of glioma: biological and clinical implications. Br. J. Cancer. 2016;115:1445–1450. doi: 10.1038/bjc.2016.354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Xin L. Cells of origin for cancer: an updated view from prostate cancer. Oncogene. 2013;32:3655–3663. doi: 10.1038/onc.2012.541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Wang X, et al. A luminal epithelial stem cell that is a cell of origin for prostate cancer. Nature. 2009;461:495–500. doi: 10.1038/nature08361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Goldstein AS, Huang J, Guo C, Garraway IP, Witte ON. Identification of a cell of origin for human prostate cancer. Science. 2010;329:568–571. doi: 10.1126/science.1189992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhang D, Zhao S, Li X, Kirk JS, Tang DG. Prostate luminal progenitor cells in development and cancer. Trends Cancer. 2018;4:769–783. doi: 10.1016/j.trecan.2018.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Wang ZA, Toivanen R, Bergren SK, Chambon P, Shen MM. Luminal cells are favored as the cell of origin for prostate cancer. Cell Rep. 2014;8:1339–1346. doi: 10.1016/j.celrep.2014.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Karthaus WR, et al. Regenerative potential of prostate luminal cells revealed by single-cell analysis. Science. 2020;368:497–505. doi: 10.1126/science.aay0267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Issa JP. Aging and epigenetic drift: a vicious cycle. J. Clin. Invest. 2014;124:24–29. doi: 10.1172/JCI69735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Wattacheril JJ, Raj S, Knowles DA, Greally JM. Using epigenomics to understand cellular responses to environmental influences in diseases. PLoS Genet. 2023;19:e1010567. doi: 10.1371/journal.pgen.1010567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Wu S, Powers S, Zhu W, Hannun YA. Substantial contribution of extrinsic risk factors to cancer development. Nature. 2016;529:43–47. doi: 10.1038/nature16166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Tomasetti C, Vogelstein B. Cancer risk: role of environment-response. Science. 2015;347:729–731. doi: 10.1126/science.aaa6592. [DOI] [PubMed] [Google Scholar]
- 71.Ushijima T, Hattori N. Molecular pathways: involvement of Helicobacter pylori-triggered inflammation in the formation of an epigenetic field defect, and its usefulness as cancer risk and exposure markers. Clin. Cancer Res. 2012;18:923–929. doi: 10.1158/1078-0432.CCR-11-2011. [DOI] [PubMed] [Google Scholar]
- 72.Ushijima T. Epigenetic field for cancerization. J. Biochem Mol. Biol. 2007;40:142–150. doi: 10.5483/bmbrep.2007.40.2.142. [DOI] [PubMed] [Google Scholar]
- 73.Issa JP. Epigenetic variation and cellular Darwinism. Nat. Genet. 2011;43:724–726. doi: 10.1038/ng.897. [DOI] [PubMed] [Google Scholar]
- 74.Herceg Z, et al. Roadmap for investigating epigenome deregulation and environmental origins of cancer. Int J. Cancer. 2018;142:874–882. doi: 10.1002/ijc.31014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Lappalainen T, Greally JM. Associating cellular epigenetic models with human phenotypes. Nat. Rev. Genet. 2017;18:441–451. doi: 10.1038/nrg.2017.32. [DOI] [PubMed] [Google Scholar]
- 76.Teschendorff AE, et al. Correlation of smoking-associated dna methylation changes in buccal cells with DNA methylation changes in epithelial cancer. JAMA Oncol. 2015;1:476–485. doi: 10.1001/jamaoncol.2015.1053. [DOI] [PubMed] [Google Scholar]
- 77.Zheng SC, et al. A novel cell-type deconvolution algorithm reveals substantial contamination by immune cells in saliva, buccal and cervix. Epigenomics. 2018;10:925–940. doi: 10.2217/epi-2018-0037. [DOI] [PubMed] [Google Scholar]
- 78.Johnson ND, et al. Differential DNA methylation and changing cell-type proportions as fibrotic stage progresses in NAFLD. Clin. Epigenet. 2021;13:152. doi: 10.1186/s13148-021-01129-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013;3:246–259. doi: 10.1016/j.celrep.2012.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Abascal F, et al. Somatic mutation landscapes at single-molecule resolution. Nature. 2021;593:405. doi: 10.1038/s41586-021-03477-4. [DOI] [PubMed] [Google Scholar]
- 81.Lee-Six H, et al. The landscape of somatic mutation in normal colorectal epithelial cells. Nature. 2019;574:532–537. doi: 10.1038/s41586-019-1672-7. [DOI] [PubMed] [Google Scholar]
- 82.Horsthemke B. Epimutations in human disease. Curr. Top. Microbiol. Immunol. 2006;310:45–59. doi: 10.1007/3-540-31181-5_4. [DOI] [PubMed] [Google Scholar]
- 83.Laurenti E, Gottgens B. From haematopoietic stem cells to complex differentiation landscapes. Nature. 2018;553:418–426. doi: 10.1038/nature25022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Teschendorff AE, Jones A, Widschwendter M. Stochastic epigenetic outliers can define field defects in cancer. BMC Bioinform. 2016;17:178. doi: 10.1186/s12859-016-1056-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Chen Y, Widschwendter M, Teschendorff AE. Systems-epigenomics inference of transcription factor activity implicates aryl-hydrocarbon-receptor inactivation as a key event in lung cancer development. Genome Biol. 2017;18:236. doi: 10.1186/s13059-017-1366-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Liu T, et al. Computational identification of preneoplastic cells displaying high stemness and risk of cancer progression. Cancer Res. 2022;82:2520–2537. doi: 10.1158/0008-5472.CAN-22-0668. [DOI] [PubMed] [Google Scholar]
- 87.Nazor KL, et al. Recurrent variations in DNA methylation in human pluripotent stem cells and their differentiated derivatives. Cell Stem Cell. 2012;10:620–634. doi: 10.1016/j.stem.2012.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Aranyi T, et al. Systemic epigenetic response to recombinant lentiviral vectors independent of proviral integration. Epigenet. Chromatin. 2016;9:29. doi: 10.1186/s13072-016-0077-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Slieker RC, et al. DNA methylation landscapes of human fetal development. PLoS Genet. 2015;11:e1005583. doi: 10.1371/journal.pgen.1005583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Aryee MJ, et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014;30:1363–1369. doi: 10.1093/bioinformatics/btu049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Teschendorff AE, et al. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics. 2013;29:189–196. doi: 10.1093/bioinformatics/bts680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.You C, et al. A cell-type deconvolution meta-analysis of whole blood EWAS reveals lineage-specific smoking-associated DNA methylation changes. Nat. Commun. 2020;11:4779. doi: 10.1038/s41467-020-18618-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
- 94.Hannum G, et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell. 2013;49:359–367. doi: 10.1016/j.molcel.2012.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Johansson A, Enroth S, Gyllensten U. Continuous aging of the human DNA methylome throughout the human lifespan. PLoS ONE. 2013;8:e67378. doi: 10.1371/journal.pone.0067378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Kocher K, et al. Genome-wide neonatal epigenetic changes associated with maternal exposure to the COVID-19 pandemic. BMC Med Genom. 2023;16:268. doi: 10.1186/s12920-023-01707-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Petroff RL, et al. Translational toxicoepigenetic meta-analyses identify homologous gene DNA methylation reprogramming following developmental phthalate and lead exposure in mouse and human offspring. Environ. Int. 2024;186:108575. doi: 10.1016/j.envint.2024.108575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Zhou W, Triche TJ, Jr., Laird PW, Shen H. SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 2018;46:e123. doi: 10.1093/nar/gky691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Colaprico A, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44:e71. doi: 10.1093/nar/gkv1507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–525. doi: 10.1093/bioinformatics/17.6.520. [DOI] [PubMed] [Google Scholar]
- 101.Liu J, et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;173:400–416 e11. doi: 10.1016/j.cell.2018.02.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Pipinikas CP, et al. Epigenetic dysregulation and poorer prognosis in DAXX-deficient pancreatic neuroendocrine tumours. Endocr. Relat. Cancer. 2015;22:L13–L18. doi: 10.1530/ERC-15-0108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Tian Y, et al. ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics. 2017;33:3982–3984. doi: 10.1093/bioinformatics/btx513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Teschendorff AE, Jing H, Paul DS, Virta J, Nordhausen K. Tensorial blind source separation for improved analysis of multi-omic data. Genome Biol. 2018;19:76. doi: 10.1186/s13059-018-1455-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Reynolds LM, et al. Age-related variations in the methylome associated with gene expression in human monocytes and T cells. Nat. Commun. 2014;5:5366. doi: 10.1038/ncomms6366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Tserel L, et al. Age-related profiling of DNA methylation in CD8+ T cells reveals changes in immune response and transcriptional regulator genes. Sci. Rep. 2015;5:13107. doi: 10.1038/srep13107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Zheng SC, et al. Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses. Nat. Methods. 2017;14:216–217. doi: 10.1038/nmeth.4187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Kular L, et al. DNA methylation as a mediator of HLA-DRB1*15:01 and a protective variant in multiple sclerosis. Nat. Commun. 2018;9:2397. doi: 10.1038/s41467-018-04732-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Song N, et al. Persistent variations of blood DNA methylation associated with treatment exposures and risk for cardiometabolic outcomes in long-term survivors of childhood cancer in the St. Jude Lifetime Cohort. Genome Med. 2021;13:53. doi: 10.1186/s13073-021-00875-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Shang L, et al. meQTL mapping in the GENOA study reveals genetic determinants of DNA methylation in African Americans. Nat. Commun. 2023;14:2711. doi: 10.1038/s41467-023-37961-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Barturen G, et al. Whole blood DNA methylation analysis reveals respiratory environmental traits involved in COVID-19 severity following SARS-CoV−2 infection. Nat. Commun. 2022;13:4597. doi: 10.1038/s41467-022-32357-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Robinson O, et al. Determinants of accelerated metabolomic and epigenetic aging in a UK cohort. Aging Cell. 2020;19:e13149. doi: 10.1111/acel.13149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Zhang X, et al. Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality. Clin. Epigenet. 2018;10:155. doi: 10.1186/s13148-018-0591-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Ventham NT, et al. Integrative epigenome-wide analysis demonstrates that DNA methylation may mediate genetic risk in inflammatory bowel disease. Nat. Commun. 2016;7:13507. doi: 10.1038/ncomms13507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Hannon E, et al. An integrated genetic-epigenetic analysis of schizophrenia: evidence for co-localization of genetic associations and differential DNA methylation. Genome Biol. 2016;17:176. doi: 10.1186/s13059-016-1041-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Hannon E, et al. DNA methylation meta-analysis reveals cellular alterations in psychosis and markers of treatment-resistant schizophrenia. eLife. 2021;10:e58430. doi: 10.7554/eLife.58430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Zannas AS, et al. Epigenetic upregulation of FKBP5 by aging and stress contributes to NF-kappaB-driven inflammation and cardiovascular risk. Proc. Natl Acad. Sci. USA. 2019;116:11370–11379. doi: 10.1073/pnas.1816847116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Flanagan JM, et al. Temporal stability and determinants of white blood cell DNA methylation in the breakthrough generations study. Cancer Epidemiol Biomarkers Prev. 2015;24:221–9. doi: 10.1158/1055-9965.EPI-14-0767. [DOI] [PubMed] [Google Scholar]
- 119.Lehne B, et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol. 2015;16:37. doi: 10.1186/s13059-015-0600-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Voisin S, et al. Meta-analysis of genome-wide DNA methylation and integrative omics of age in human skeletal muscle. J. Cachexia Sarcopenia Muscle. 2021;12:1064–1078. doi: 10.1002/jcsm.12741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Liu Y, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 2013;31:142–147. doi: 10.1038/nbt.2487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Tsaprouni LG, et al. Cigarette smoking reduces DNA methylation levels at multiple genomic loci but the effect is partially reversible upon cessation. Epigenetics. 2014;9:1382–1396. doi: 10.4161/15592294.2014.969637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Pai S, et al. Differential methylation of enhancer at IGF2 is associated with abnormal dopamine synthesis in major psychosis. Nat. Commun. 2019;10:2046. doi: 10.1038/s41467-019-09786-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Guintivano J, Aryee MJ, Kaminsky ZA. A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics. 2013;8:290–302. doi: 10.4161/epi.23924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Gasparoni G, et al. DNA methylation analysis on purified neurons and glia dissects age and Alzheimer’s disease-specific changes in the human cortex. Epigenet. Chromatin. 2018;11:41. doi: 10.1186/s13072-018-0211-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Kozlenkov A, et al. DNA methylation profiling of human prefrontal cortex neurons in heroin users shows significant difference between genomic contexts of hyper- and hypomethylation and a younger epigenetic age. Genes. 2017;8:152. doi: 10.3390/genes8060152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Johnson KC, et al. DNA methylation in ductal carcinoma in situ related with future development of invasive breast cancer. Clin. Epigenet. 2015;7:75. doi: 10.1186/s13148-015-0094-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Huang KK, et al. Genomic and epigenomic profiling of high-risk intestinal metaplasia reveals molecular determinants of progression to gastric cancer. Cancer Cell. 2018;33:137–150 e5. doi: 10.1016/j.ccell.2017.11.018. [DOI] [PubMed] [Google Scholar]
- 129.Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14:R115. doi: 10.1186/gb-2013-14-10-r115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Maity AK, et al. Novel epigenetic network biomarkers for early detection of esophageal cancer. Clin. Epigenet. 2022;14:23. doi: 10.1186/s13148-022-01243-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Bormann F, et al. Cell-of-origin DNA methylation signatures are maintained during colorectal carcinogenesis. Cell Rep. 2018;23:3407–3418. doi: 10.1016/j.celrep.2018.05.045. [DOI] [PubMed] [Google Scholar]
- 132.Goeppert B, et al. Integrative analysis reveals early and distinct genetic and epigenetic changes in intraductal papillary and tubulopapillary cholangiocarcinogenesis. Gut. 2022;71:391–401. doi: 10.1136/gutjnl-2020-322983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Silva R, et al. Longitudinal analysis of individual cfDNA methylome patterns in metastatic prostate cancer. Clin. Epigenet. 2021;13:168. doi: 10.1186/s13148-021-01155-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Nemeth CG, et al. Recurrent chromosomal and epigenetic alterations in oral squamous cell carcinoma and its putative premalignant condition oral lichen planus. PLoS ONE. 2019;14:e0215055. doi: 10.1371/journal.pone.0215055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Luo Y, et al. Differences in DNA methylation signatures reveal multiple pathways of progression from adenoma to colorectal cancer. Gastroenterology. 2014;147:418–29 e8. doi: 10.1053/j.gastro.2014.04.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Teschendorff AE, Breeze CE, Zheng SC, Beck S. A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinform. 2017;18:105. doi: 10.1186/s12859-017-1511-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Teschendorff AE, Widschwendter M. Differential variability improves the identification of cancer risk markers in DNA methylation studies profiling precursor cancer lesions. Bioinformatics. 2012;28:1487–1494. doi: 10.1093/bioinformatics/bts170. [DOI] [PubMed] [Google Scholar]
- 138.Song MA, et al. Landscape of genome-wide age-related DNA methylation in breast tissue. Oncotarget. 2017;8:114648–114662. doi: 10.18632/oncotarget.22754. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The main Illumina DNA methylation 450k/EPIC datasets used here are freely available from the following public repositories: Endicott (182 cell-line samples, GSE197512)24. Hannum (656 whole blood, GSE40279, [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40279])94; MESA (214 purified CD4+ T-cells and 1202 Monocyte samples, GSE56046GSE56581)105; Tserel (98 CD8+ T-cells, GSE59065)106; BLUEPRINT (139 matched CD4+ T-cells, Monocytes and Neutrophils, EGAS00001001456)42; Paul (100 B cells, 98 T cells, and 104 monocytes, EGA: EGAS00001001598)43; Liu (335 whole blood, GSE42861)121; Pai (n = 28 sorted neurons, GSE112179)123, Guintivano (n = 29 sorted neurons, GSE41826)124, Gasparoni (n = 16 sorted neurons, GSE66351)125 and Kozlenkov (n = 29 sorted neurons, GSE98203)126; Gastric tissue (191 normal and metaplasia, GSE103186)128; Colon tissue (47 normal and adenoma, E-MTAB-6450)131; Colon tissue2 (123 normal, adenoma, and cancer samples, GSE48684)135; Breast Erlangen (397 normal, GSE69914)32; Breast2 (121 normal, GSE101961)138; Breast Johnson (55 normal and ductal carcinoma in situ samples, GSE66313)127; Liver tissue (137 normal, premalignant and cholangiocarcinoma samples, GSE156299)132; Prostate tissue (36 normal, neoplasia, primary and metastatic samples, GSE116338)133; Oral tissue (41 normal, lichen planus and oral squamous cell carcinoma samples, GSE123781)134; Esophagus (50 normal, 81 Barrett’s Esophagus and 24 adenocarcinomas, GSE104707)49; Liver-NAFLD (n = 325, GSE180474)78; SCM2 (37 fetal tissue samples, GSE31848)87; Cord Blood (15 samples, GSE72867)88; Slieker (34 fetal tissue samples, GSE56515)89; eGTEX (987 samples from 9 normal tissue-types, GSE213478)45; Buccal Swabs from Infants (44 samples, GSE229463)96; Cord Blood from Neonates (128 samples, GSE195595)97; TCGA data was downloaded from https://gdc.cancer.gov; The DNAm dataset in buccal cells from the NSHD MRC194676 is available by submitting data requests to mrclha.swiftinfo@ucl.ac.uk; see full policy at http://www.nshd.mrc.ac.uk/data.aspx. Managed access is in place for this 69-year-old study to ensure that the use of the data is within the bounds of consent given previously by participants and to safeguard any potential threat to anonymity since the participants are all born in the same week. The DNAm atlas encompassing DNAm reference matrices for 13 tissue-types encompassing over 40 cell types is freely available from the EpiSCORE R-package https://github.com/aet21/EpiSCORE. Source data are provided with this paper. The remaining data are available within the Article, Supplementary Information, or Source Data file. Source data are provided with this paper.
An R-package “EpiMitClocks” with functions to estimate the mitotic age for each of the clocks in this work is freely available from https://github.com/aet21/EpiMitClocks The package comes with a tutorial vignette. The R-package EpiSCORE for estimating cell-type fractions in solid tissue-types is available from https://github.com/aet21/EpiSCORE The package comes with a tutorial vignette. We have also made an OceanCode Capsule available from https://codeocean.com/capsule/3811164/tree.