Abstract
DNA methylation is an important biological process which involves the reversible addition of chemical tags called methyl groups to DNA and affects whether genes are active or inactive. Individual methylation profiles are determined by both genetic and environmental influences. Inter-individual variation in DNA methylation profiles can be exploited to estimate or predict a wide variety of human characteristics and disease risk profiles. Indeed, a number of methylation-based predictors of human traits have been developed and linked to important health outcomes. However, there is an unmet need to communicate the applicability and limitations of state-of-the-art methylation-based predictors to the wider community. To address this, we created a secure, web-based interactive platform called ‘MethylDetectR’ which calculates estimated values or scores for a variety of human traits using blood methylation data. These traits include age, lifestyle traits, high-density lipoprotein cholesterol and the levels of 27 blood proteins related to inflammatory and neurological processes and disease. Methylation-based predictors often return scores on arbitrary scales. To provide meaning to these scores, users can interactively view how estimated trait scores for a given individual compare against other individuals in the sample. Users can optionally upload binary phenotypes and investigate how estimated traits vary according to case vs. control status for these phenotypes. Users can also view how different methylation-based predictors correlate with one another, and with phenotypic values for corresponding traits in a large reference sample (n = 4,450; Generation Scotland). The ‘MethylDetectR’ platform allows for the fast and secure calculation of DNA methylation-derived estimates for many human traits. This platform also helps to show the correlations between methylation-based scores and corresponding traits at the level of a sample, report estimated health profiles at an individual level, demonstrate how scores relate to important binary outcomes of interest and highlight the current limitations of molecular health predictors.
Keywords: epigenetics, DNA methylation, epidemiology, translation, Shiny, prediction
Introduction
DNA methylation (DNAm) is an epigenetic mechanism in which methyl groups are added to the genome sequence. Inter-individual variability in DNAm profiles results from differences in both underlying genetics and environmental influences 1. Factors such as diet, stress and smoking behaviours may influence the process of methylation. Typically, methyl groups are added to cytosine residues in the context of a cytosine-guanine dinucleotide (CpG site) 2. The addition of these chemical tags can alter whether, and to what extent, a gene is active. In contrast to genetics, these molecular modifications are dynamic, tissue-specific and reversible 3.
Biological data may be harnessed to estimate or predict a variety of human characteristics and disease risk profiles ( Figure 1). There is a growing body of evidence demonstrating the effective creation and application of DNAm-based predictors of human traits and health 4– 17. Additionally, methylation-based predictors of traits such as smoking status may provide more accurate measurements than self-reported information, thereby allowing for improved disease prediction and risk stratification 18. Blood DNAm data is often used as it is minimally-invasive to collect and it provides a good index of the overall health status of the body.
Increased training sample sizes and refinements in statistical and machine learning methodologies have improved the accuracy of DNAm-based predictors 19, 20. Furthermore, there has been a rapid increase in the commercialisation and scalability of DNAm assays for direct-to-consumer use or for use in clinical, research or industrial settings. A major goal of using these predictors is to aid in prediction strategies and provide better clinical outcomes for individuals. Therefore, translational platforms for methylation-based health profiling are warranted in order to communicate the applications and limitations of DNAm-based predictors of human traits.
To address this need, we have created a web-based platform called ‘MethylDetectR’ that allows for an interactive demonstration of state-of-the-art DNAm-based predictors. The DNAm-based predictors in this platform include a highly accurate predictor of chronological age trained in > 13,000 individuals across 14 cohorts with a root mean squared error of 2.04 years in the original publication 19. We also include six DNAm-based predictors of lifestyle and biochemical traits: alcohol consumption per week, body fat percentage, body mass index, high-density lipoprotein (HDL) cholesterol, smoking status and waist-to-hip ratio 21. These predictors were generated in 5,087 individuals who are members of the Generation Scotland: Scottish Family Health Study (GS) which represents one of the largest DNAm resources in the world. Furthermore, we have recently developed DNAm-based proxies for the levels of 160 blood proteins that are related to inflammatory or neurological processes and disease states (Olink® inflammation and neurology panels) in 875 and 706 members of the Lothian Birth Cohort 1936 (LBC1936) study, respectively (Gadd et al. 2020 (unpublished report)). Of these DNAm-based proxies, 27 predictors of protein levels correlated well (see Methods, P < 0.1) with measured protein levels in an LBC1936 holdout sample and external test sets. These 27 DNAm-based predictors of plasma protein levels are included in the ‘MethylDetectR’ platform ( Figure 2).
Briefly, the ‘MethylDetectR’ platform consists of two applications. The first application, named ‘MethylDetectR – Calculate Your Scores’, allows users to securely upload Illumina 450k or EPIC DNAm array data and obtain blood-based methylation predicted scores (or values) for the aforementioned traits. No data are stored by the application. Furthermore, predicted scores are often returned on arbitrary scales. In the main ‘MethylDetectR’ application, users can compare DNAm-derived scores for any individual in their input dataset against other individuals in the input dataset. Percentile ranks for individuals in the input dataset may be downloaded. Users can also upload an optional file containing binary phenotypes whereby individuals are coded as ‘0’ for control status and ‘1’ for case status. This information allows users to view how distributions of the DNAm-derived traits vary by cases and controls. Users can also view how the various DNAm-based predictors correlate with one another in the input dataset and in a separate reference sample. This reference sample comprises 4,450 individuals who are members of the GS study. These individuals are unrelated to each other and distinct from those in the original training samples in which the predictors for age, HDL cholesterol and lifestyle traits were developed. They are also unrelated to those included in the original training sample. Furthermore, information is provided on how well the predicted scores correlate with phenotypic values for corresponding traits that are available in GS. Lastly, the user can subset the reference samples by sex, age range or case vs. control status.
This platform communicates important information relating to the generation and applicability of DNAm-based predictors of human traits and health. The ‘MethylDetectR’ platform also represents a research tool for fast and automatic generation of DNAm-derived estimates for human traits. This platform can show that DNAm-based scores for traits may correlate well with measured values for a given trait at the level of the cohort. For example, a predictor for epigenetic age correlates strongly with true age with a root mean squared error of 2.04 years 19. However, this platform also helps to show that predictors may report inaccurate values at an individual level. For instance, although the age predictor correlates well with true age at the level of the cohort, an individual’s predicted age may differ from their true age by a number of years or decades. The optional incorporation of binary phenotype data allows users to view how well established or putative risk factors, as estimated by DNAm data, are stratified according to cases and controls for a given trait of interest. Together, the functionalities of ‘MethylDetectR’ begin to address the translational gap in the development and implementation of molecular-based health predictors by highlighting their performance and limitations in advance of their potential utility in diagnostic and stratification paradigms.
Methods
Implementation
The ‘MethylDetectR’ platform. The ‘MethylDetectR’ platform consists of two applications. The first application is called ‘MethylDetectR - Calculate Your Scores’. Users may upload DNAm data as an R object (.rds file) and obtain estimated values or scores for a variety of traits across individuals in their input dataset ( https://shiny.igmm.ed.ac.uk/Calculate_Your_Scores/). The upload limit is 3 gigabytes; however, files greater than 500 megabytes may take a considerable amount of time to upload. An optional ‘SexAgeInfo’ file may also be uploaded in order to include sex and age information in the output file. This should be a .csv file and have three columns: one column for the IDs of individuals in the methylation file (‘ID’ column), one column should list the sex of these individuals written as ‘Male’ or ‘Female’ or ‘NA’ (‘Sex’ column) and one column should report the actual or chronological age of individuals (‘Age’ column). This functionality is important given that users can subset the input dataset and GS cohort by sex in the main ‘MethylDetectR’ application. Furthermore, if true age is included, then the application will use this information to subset the sample according to the age slider function on the sidebar panel. If this information is not uploaded, then epigenetic or predicted age will be used to subset the data by age range. In the case where some individuals have true age available and others have missing data, true age will be used for those who have such data and epigenetic age will be used for those without age data in order to the subset the sample. It is strongly recommended that anonymised or pseudonymised IDs are used where possible. For the user’s own convenience in preparing the methylation object, it is recommended that individuals are included as columns and CpG sites as rows. However, this version or a transposed version are accepted and automatically processed by the software. The following features also aid with automation in generating DNAm-based scores for traits:
Beta values or M values are accepted with the latter converted to beta values by the software.
Missing methylation values are accepted and mean imputed across input individuals by the software.
CpG sites that are necessary for the estimation of a trait but are missing in the uploaded dataset are allowed. In this case, each individual in the input dataset receives the mean beta value for a given missing CpG site from either GS (n = 4,450; for age, lifestyle traits or HDL cholesterol) or LBC1936 (for protein proxies). In effect, this gives every individual in the uploaded dataset a constant that brings their score closer to that of the reference sample. In this way, all CpG sites are used for any sample uploaded.
Predicted values or scores for the aforementioned traits can be downloaded as a .csv file which may be uploaded to the main ‘MethylDetectR’ application. Alternatively, an R script is provided to generate these DNAm-based scores if the user does not wish to upload DNAm data or if the DNAm file is too large for upload ( https://doi.org/10.5281/zenodo.4291880 22).
The main ‘MethylDetectR’ application can be described in four modules or panels ( https://shiny.igmm.ed.ac.uk/MethylDetectR/). A demo version of the app which does not require the upload of data is available at: https://shiny.igmm.ed.ac.uk/MethylDetectR_Demo/. In the first panel, users can view DNAm-based scores corresponding to various traits for any individual in their input dataset. The user can interactively compare scores for any individual to the remainder of the input dataset. A .csv file may be downloaded which shows percentile ranks for each individual in the uploaded dataset. Percentile ranks are reported for each estimated trait with the exception of the age predictor which is reported in years. If the user uploads an optional binary phenotype file, they can also view the distributions of the DNAm-based scores according to case vs. control status for a given trait of interest. This file should contain IDs of individuals in the methylation file (‘ID’ column) as the first column and the remaining columns may contain any names or traits of interest with individuals coded as ‘0’ for controls and ‘1’ for cases. In the second panel, a plot shows the percentile ranks for a given individual for up to 10 traits of interest. Users can also choose to view the spread of percentile ranks for cases versus controls. Here, the median percentile for cases, along with the first and third quartiles (interquartile range), are plotted for each selected DNAm-based estimated trait. In the third panel, the user can view how different DNAm-based predictors correlate with one another in both the input and GS datasets. In the fourth panel, users can view how DNAm-based scores for age, lifestyle and biochemical traits correlate with phenotypic values for these traits in GS participants (n = 4,450). The DNAm-based proxies for protein levels are not shown as phenotypic values for protein levels are not available for all 27 proteins in GS. However, these 27 proxies were all correlated r > 0.1 with measured plasma protein levels in the Lothian Birth Cohort 1936 (Gadd et al. 2020 (unpublished report)). In each panel, users can subset the samples by age range and sex. In panels 1-3, options to subset the dataset by case vs. control status are present.
No data are stored in either application and are deleted upon closing the application. Applications are also timed out after three minutes of inactivity and are hosted on patched and secure servers within the Institute of Genetics and Molecular Medicine, University of Edinburgh. This research and translational tool complies with GDPR guidelines and has been designed to ensure the highest level of data security and privacy. The applications are also available at the following website: https://www.ed.ac.uk/centre-genomic-medicine/research-groups/marioni-group/methyldetectr. Information relating to participant consent is also available at this website. Given that no data are stored, this information pertains to general risk surrounding the upload of biological data to online software and the measures taken to mitigate the risk of motivated intruders gaining access to such data.
Development of a DNAm-based predictor. To develop ‘omics’-based predictors, such as DNAm-based predictors, statistical or machine learning methodologies are commonly applied. In the case of DNAm, the process begins with the quantification of DNAm across individuals using a tissue or cell-type of interest. Many studies focus on blood as it integrates information from various tissues around the body and represents an inexpensive and minimally-invasive approach to gather molecular data. Many cohort studies also have DNA from historic blood samples stored and available to analyse. The collection of saliva and buccal samples is becoming increasingly popular as a relatively low cost and non-invasive method for cohort studies interested in epigenetic epidemiology. The number of CpG sites which are measured depends on the array used but typically includes up to around 800,000 unique sites. Following quantification and quality control, a researcher may wish to study the association between DNAm and a trait of interest, such as smoking status. The average methylation level at a given CpG site across individuals will be correlated with the trait of interest. Methylation levels may be reported between 0-100% for convenience, and a level of 50% means that 50% of cells or DNA molecules within an individual’s sample show methylation at that CpG site. One approach is to correlate each CpG site, in turn, with the trait of interest thereby considering each CpG site in isolation. This is approach is referred to as an epigenome-wide or methylome-wide association study (EWAS or MWAS). Alternatively, methods such as penalised regression can be used to model all CpG sites simultaneously producing parsimonious models that account for correlated features/sites (e.g., least absolute shrinkage and selection operator or LASSO regression) or models that apply small weights to all features/sites (e.g., ridge regression). Elastic net regression is a commonly used intermediate of these two approaches. Correlations among CpGs may arise from sites which lie near each other in a genomic region or from a shared environmental influence; for instance, inhalation of cigarette smoke may affect many CpG sites across different chromosomes. A subset of CpG sites may show a strong relationship with the trait of interest and therefore be informative for predicting the trait in other individuals. The strength of the correlation, or association, is represented by an effect size and provides a weighting for that CpG site’s importance in predicting the trait. In a separate or test group of individuals, predicted values or scores for the trait can be obtained by multiplying DNAm levels at each informative CpG site by their weight derived from statistical analyses. The sum of these products provides a predicted or estimated value for the trait. A statistical transformation can be applied to return DNAm-based scores on the original scale for the trait, such as pack years for smoking. Alternatively, comparison to other individuals may provide meaning to the DNAm-based scores. In any case, predicted values or scores in the test sample may be correlated with true values for a given trait to provide an index of predictive power.
DNAm-based predictors in ‘MethylDetectR’. In ‘MethylDetectR’, we include DNAm-based predictors of chronological age, six lifestyle and biochemical traits and the levels of 27 blood proteins.
Age predictor
The age predictor was developed by Zhang et al. using elastic net regression and best linear unbiased prediction (BLUP) 19, 23, 24. The two sets of predictors were both built on the same set of 13,566 training samples spread across 14 cohorts; the elastic net method selected 514 CpG sites as informative for predicting chronological age whereas the BLUP predictor used all CpG sites (319,607 probes). In ‘MethylDetectR’, we apply the elastic net predictor owing to the faster computation and superior performance of this age predictor when compared to the BLUP predictor. The elastic net predictor correlates 0.98 with chronological age (root mean squared error = 2.6 years in GS (n = 4,450) and 2.04 in original publication 19). The age predictor is returned in values of years.
Lifestyle and biochemical traits
Previously, we generated ten predictors of lifestyle and biochemical traits in 5,087 individuals within the GS study using LASSO penalised regression 21, 25. Ten-fold cross-validation was applied and the mixing parameter (alpha) was set to 1. The lambda value corresponding to the minimum mean cross-validation error was selected and applied to generate the optimal models 25. In a test sample consisting of 875 individuals in the Lothian Birth Cohort 1936 study, DNAm-based predictors for four of the traits explained greater than 10% of phenotypic variance in their respective trait. These four traits were alcohol consumption, body mass index, high-density lipoprotein cholesterol and smoking behaviour. There were no phenotypic data available for body fat percentage and waist-to-hip ratio in the Lothian Birth Cohort 1936 study. However, these traits were highly correlated with body mass index in GS (correlation coefficients of 0.6 and 0.4, respectively). Therefore, body fat percentage and waist-to-hip ratio were brought forward to the ‘MethylDetectR’ platform in addition to the four DNAm-based predictors of lifestyle and biochemical traits which demonstrated test R 2 statistics of greater than 10% in the Lothian Birth Cohort 1936 sample. Four other traits demonstrated test R 2 statistics of less than 10%: educational attainment (2.5%), low-density lipoprotein and remnant cholesterol (0.6%), total cholesterol (2.7%) and total-to-high-density lipoprotein cholesterol ratio (4.5%). These four traits were not brought forward to the ‘MethylDetectR’ platform 21.
In the training sample, alcohol intake was assessed in units per week and was only considered in those who reported that their intake was representative of a normal week. A natural log(units + 1) transformation was applied to reduce skewness. For body mass index, extreme values defined as less than 17 kg/m 2 or greater than 50 kg/m 2 were removed and a natural log transformation was applied. Smoking behaviour was assessed using pack years which is calculated by multiplying the number of packs smoked per day by the number of years the participant has smoked. Current and never smokers were included; ex-smokers were removed owing to complications in adjusting for time since cessation when calculating pack years. To reduce skewness, a natural log(pack years + 1) transformation was applied. Further quality control details are available in the original publication 21. Each phenotype was then regressed on age, sex, and ten genetic principal components with the residuals being entered as the dependent variable in the LASSO models. DNAm values at CpG sites were the independent variables (n = 392,843 CpG sites). CpG sites were filtered to include loci present on both the Illumina EPIC and 450k arrays.
Blood protein levels
Recently, we developed DNAm-based predictors for the levels of 160 different proteins that are present on the Olink® inflammation and neurology panels. The panels include proteins that have established links to inflammatory and neurological processes or disease states, as well as exploratory proteins with roles in cellular regulation, cell-to-cell communication and development. Each Olink® panel consisted of 92 proteins. Two proteins on the neurology panel, MAPT and HAGH, were removed from analyses as more than 40% of samples had levels below the lowest limit of detection. In relation to the inflammation panel, one protein, BDNF, failed quality control and was removed from the study. For a further 21 proteins, more than 40% of samples also exhibited levels below the lowest limit of detection and were therefore removed from the study. This left a total of 70 proteins on the inflammation panel and 90 proteins on the neurology panel (n = 160 protein targets). All proteins were measured in plasma. The inflammatory proteins were measured in individuals at Wave 1 of the Lothian Birth Cohort 1936 study (mean age = 69.6 ± 0.8 years, n = 875). Neurology proteins were measured at a later time-point (Wave 2 – mean age = 72.5 ± 0.5 years, n = 706). Protein levels were rank-based inverse normalised and then regressed on age, sex and four genetic principal components with residuals from these models representing the dependent variables in machine learning models.
Predictors were developed using elastic net regression in members of the Lothian Birth Cohort 1936 study. The R package glmnet (version 4.0) was used to perform elastic net regression 25. The mixing parameter (alpha) was set to 0.5 and the lambda value corresponding to the minimum mean cross-validation error was applied to each elastic net model. Each fold included a single methylation processing batch (or combination of batches) with a range of 50–68 and 62–85 individuals per fold for the neurology and inflammation proteins, respectively. Two folds were set aside as the test data and ten-fold cross validation was performed using the remaining data (Neurology: n train = 576 and n test = 130, Inflammation: n train = 725 and n test = 150). Proteins were retained if the corresponding DNAm-based proxy showed an r coefficient > 0.1 and P < 0.1 with measured protein levels in the test set. In total, 18/92 Olink® neurology proteins (20.0%) and 20/70 Olink® inflammation proteins (28.6%) satisfied this criteria. For these 38 proteins, elastic net models were rerun using twelve-fold cross validation to incorporate all available data for the training stage. Each fold represented a single batch and ranged from 49–81 and 62–85 individuals per fold for the neurology and inflammation sets, respectively. For one of the 38 proteins (TRAIL), no features were selected by the elastic net model and thus this protein was excluded. The 37 DNAm protein proxies with more than one nonzero feature selected in the twelve-fold cross validation stage were then tested, where possible, by determining their correlation with measured protein levels in separate cohort studies: STRADL (subset of GS; n = 778, inflammatory and neurology panels, SOMAScan-based measurements) and Lothian Birth Cohort 1921 (LBC1921; n = 162 individuals, Olink® neurology panel). Out-of-sample testing was possible for 14/19 inflammatory and all 18 neurology proteins with overlapping proteins measured in STRADL and/or LBC1921. Proteins were retained if they showed r > 0 and P < 0.05 in at least one of the test sets. Twenty-three protein proxies satisfied this condition. Of the 5 inflammatory protein proxies with no overlapping protein data in STRADL or LBC1921, a further 3 proxies were included in the final proxy set as they showed a significant relationship with measured protein levels in the initial LBC1936 holdout set (P < 0.05). An additional inflammatory protein, IL6, was retained as it showed a significant correlation with IL6 levels measured through ELISA in LBC1936 members. Together, the validation stage resulted in the selection of 27 DNAm-based proxies and the corresponding features for these proteins were included in the ‘MethylDetectR’ platform (n = 13 inflammatory and n = 14 neurology protein proxies).
Operation
Software requirements. Both applications are hosted on a secure, patched server hosted at the University of Edinburgh. The applications are developed using Shiny (version 1.4) in R 26. The version of R used at the time of ‘MethylDetectR’ development was 3.5.0 27. For ‘MethylDetectR – Calculate Your Scores’, the following R packages were utilised: shinyWidgets (version 0.5.4) 28, shinythemes (version 1.1.2) 29, data.table (version 1.12) 30, shinyalert (version 1) 31. For the main ‘MethylDetectR’ application, shinyWidgets (version 0.5.4) 28, shinythemes (version 1.1.2) 29, data.table (version 1.12) 30, shinyalert (version 1) 31 were also used, in addition to ggplot2 (version 3.0) 32, dplyr (version 0.7) 33, forcats (version 0.4.0) 34, wesanderson (version 0.3.6) 35, shinycssloaders (version 0.2) 36, magick (version 2.5.0) 37, corrplot (version 0.84) 38, ggcorrplot (version 0.1) 39 and cowplot (version 0.9.4) 40. Scripts for implementing both applications are available at: https://doi.org/10.5281/zenodo.4291880 22. The script has been designed to ensure automatic installation of missing CRAN packages which are necessary for the operation of ‘MethylDetectR’.
Overview of workflow. The main components of the platform are outlined in the Implementation section and the associated workflow is graphically depicted in Figure 3.
Use cases
MethylDetectR – Calculate Your Scores
The user can upload a DNAm file as an R object (.rds file) to the ‘MethylDetectR – Calculate Your Scores’ application. Beta values or M values may be used. If M values are detected, these are converted to beta values. It is recommended that individuals are included as columns and CpG sites (derived from Illumina arrays) are included as rows and that file sizes of no greater than 500 Mb are uploaded. A ‘SexAgeInfo’ file may also be uploaded as a .csv file so that data pertaining to the ages and sex of the input individuals are included in the output file along with DNAm-based values for age (epigenetic age), lifestyle, biochemical and protein traits. This file should include a column corresponding to the IDs of individuals in the methylation file (‘ID’ column), a column that lists the sex of these individuals written as ‘Male’ or ‘Female’ or ‘NA’ (‘Sex’ column) and a column that reports the chronological or true ages of individuals (‘Age’ column). Alternatively, the user can merge this information into their DNAm score file after running the ‘MethylDetectR – Calculate Your Scores’ step. Examples for the DNAm and ‘SexAgeInfo’ files are available at: https://doi.org/10.5281/zenodo.4291880 22.
The software automatically generates DNAm-based scores for every individual in the dataset (.csv file) and a report for the user is printed on the application detailing quality control steps carried out during the calculation process. For example, the report informs the user whether or not the data had to be transposed, or if M values were converted to beta values ( Figure 4).
Alternatively, the user can download an R script to locally generate DNAm-based scores for the traits. The DNAm object is annotated as ‘data’ and the ‘SexAgeInfo’ input is annotated as ‘sexageinfo’ ( https://doi.org/10.5281/zenodo.4291880 22). In either case, an output .csv file is generated containing DNAm-based scores or values for each trait and for every individual in the input dataset. This output file should be uploaded to the main ‘MethylDetectR’ application. An example output .csv file showing the correct column names and file structure is available at: https://doi.org/10.5281/zenodo.4291880 22.
MethylDetectR
Panel 1. The output file from either the ‘MethylDetectR – Calculate Your Scores’ application or a publicly available script should be uploaded to the main ‘MethylDetectR’ application. Incorrectly assigned column names will be reported to the user, as will files with no individuals or files with non-numeric values. A timeout is triggered following three minutes of inactivity.
The first panel allows users to choose a predictor of interest and view how a selected individual in the input dataset ranks against the remainder of the input dataset (in pink) in the context of that predictor ( Figure 5A). Alternatively, if the user uploads an optional file with binary phenotype information, then users can also subset the data by case vs. control status. In Figure 5B, the user can view where a selected individual’s DNAm-based score for body mass index lies along the sample subset by controls (in pink) and cases (in blue) for diabetes. The user can subset to different age ranges and sex in order to see how the selected individual would compare to the truncated sample selection. Users can also download the percentile ranks for every individual in the input dataset when compared against all other individuals in the dataset. Percentile ranks are available for each trait with the exception of age which is reported in years.
Panel 2. In the second panel, users may select up to 10 traits in order to simultaneously view the percentile ranks for a selected individual in the input dataset when compared against other individuals in the sample ( Figure 6A). Furthermore, the user can view how percentile ranks for a given trait vary according to cases and controls for a selected binary phenotype. In Figure 6B, the median percentile for diabetes cases along with the interquartile range (first to third quartile) are plotted for physical traits, such as body mass index and body fat percentage, as well as three protein proxies which have been shown to predict incidence of diabetes (Gadd et al. 2020 (unpublished report)) Again, the user can use a sidebar functionality to subset by age range and sex.
Panel 3. In the third panel, users can select up to 10 DNAm-based predictors and view how they correlate with one another in order to visualise their interrelationships and the underlying data structure. This is represented for both the input and GS datasets ( Figure 7A). Furthermore, the correlations are updated according to the selected age range and sex. The user can also subset the input dataset to cases, controls or choose to visualise correlation data for cases and controls in the input dataset alongside each other ( Figure 7B).
Panel 4. In the fourth and final panel, users can view how well the DNAm-based predictors for age, lifestyle traits and HDL cholesterol correlate with actual values of their respective traits in GS ( Figure 8). This is demonstrated for age and the six lifestyle or biochemical traits as measured values were available for these traits in GS. Measured protein levels were not available for all 27 blood proteins studied. However, proxies for blood protein levels showed an r coefficient of greater than 0.1 with measured blood protein levels in the initial test set comprising members of the Lothian Birth Cohort 1936 cohort and, where testing was possible, correlated well with measured protein levels in at least one external test cohort ( Figure 1 in Gadd et al. 2020 (unpublished report)). In this final panel, users can also subset by age range and sex to view how the performance of the predictors varies according to the truncated reference dataset.
Discussion
We have created and implemented the first publicly available online translational platform for methylation-based health profiling. The platform includes a wide variety of traits which are estimated from large-scale DNAm data. These include chronological age, lifestyle traits, biochemical data and blood protein levels thereby providing an automatic and comprehensive estimate of individual health profiles from a single blood draw. Users can interactively view how well DNAm-based estimators for various traits perform at an individual level and how DNAm-based estimators stratify according to case and control status for binary phenotypes of interest. The ‘MethylDetectR’ platform communicates key messages surrounding the development and present limitations of DNAm-based health profiling to the wider research community and public. Furthermore, the platform is designed to ensure the highest level of data security and safety and is publicly available with open source code and example input and output files.
DNAm-based predictors can integrate biological and environmental information to provide important indices of an individual’s health status and well-being. These predictors must display high degrees of sensitivity and specificity in order to accurately distinguish individuals on trajectories toward disease and adverse clinical endpoints from those who will remain healthy in a given clinical context. Recently, a blood-based DNAm test was developed that could detect five separate types of cancer up to four years before conventional diagnosis. The assay measured circulating tumour DNA methylation and predicted disease in 88% of post-diagnosis patients, with a specificity of 96% 17. By focusing on blood DNAm data, these predictors may allow for minimally-invasive estimates relating to the risk of various disease states. The dynamic nature of DNAm profiles means that health and disease risk profiles may change in response to altered lifestyle behaviours and environmental inputs. DNAm-based predictors of human traits, such as chronological age and body mass index, may also aid in forensic contexts.
Distributions of DNAm values and subsequent DNAm-based scores may vary across different methylation datasets. This may result from biological differences, for example cases for a given disease may have altered DNAm values for a number of probes relative to controls, or result from technical or normalisation differences. As a result, DNAm-based scores may vary greatly across datasets and projecting an individual onto a reference sample to view where they would lie along a reference sample is therefore challenging. Future work will focus on developing methods which can appropriately account for variability across datasets and allow for a projection of individuals onto disparate DNAm samples or datasets.
Increased sample sizes through recruitment, consortia or meta-analyses may allow for more sensitive or specific DNAm-based predictors. Advancements in statistical and machine learning approaches used to generate such predictors will also allow for greater accuracy in predicting human traits and health 20. Furthermore, if the outcomes on which the predictors are trained are inaccurate or possess lots of noise, then the predictors themselves will perform poorly in identifying individuals at risk of disease. Therefore, advancements in understanding disease biology and ways to diagnose or stratify different diseases will help to create well-defined outcomes on which predictors can be trained. This is expected to improve their ability in predicting important health and clinical outcomes. However, stringent ethical frameworks are also necessitated prior to widespread application of molecular-based health profiling in health and forensic contexts 41.
DNAm-based predictors represent one avenue within molecular-based health profiling. Genetics-based predictors of human traits may correlate well with true values for traits, such as human height 42. However, genetic predictors of disease may often fail to accuracy classify individuals by disease status 43. Additionally, other ‘omics’ data have been explored in order to predict human traits or disease. For example, a proteomic signature of age correlates 0.94 with chronological age 44. Plasma protein-based predictors of disease states, including dementia and cancer, have been explored 45– 47. Lipid-based predictors of human traits have also been developed using plasma samples 48, 49. Complex and common disease states are multifactorial conditions. Therefore, it is likely that composite predictors using various lines of ‘omics’ data may allow for greater accuracy in predicting disease risk and outcomes when compared to using one line of evidence alone. Furthermore, the incorporation of ‘omics’ data with clinical or demographic data could provide even more refined predictors of human health and disease 43.
Conclusions
Our platform provides an important translational tool which communicates state-of-the-art developments in relation to DNAm-based predictors of human traits and health. The ‘MethylDetectR’ platform also represents a research tool for the convenient and secure generation of DNAm-estimated traits for use in clinical and population studies. Importantly, our platform highlights the applicability and limitations surrounding such predictors prior to their potential deployment in clinical assessment and management paradigms.
Reproducibility: All relevant code is available at: https://doi.org/10.5281/zenodo.4291880 22. The following Research Resource Identifiers (RRIDs) have been generated for ‘MethylDetectR – Calculate Your Scores’, RRID: SCR_018972, and ‘MethylDetectR’, RRID: SCR_018973. The limitations surrounding the resultant datasets are that the predictors may work well for risk stratification relative to others in the dataset, but may fail to accurately predict trait information at an individual level.
Data availability
Underlying data
The underlying methylation and phenotypic data used to generate the original predictors cannot be made available due to the data containing information that could compromise participant consent and confidentiality. According to the terms of consent for GS participants, access to data must be reviewed by the GS Access Committee. Applications should be made to access@generationscotland.org. Lothian Birth Cohort 1936 data can be requested from the Lothian Birth Cohort 1936 research team, following completion of a data request application. More information can be found online ( http://www.lothianbirthcohort.ed.ac.uk/content/collaboration).
Extended data
Zenodo: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling, https://doi.org/10.5281/zenodo.4291880 22
This project contains the following extended data:
DNAm_File_Example.rds. (Example DNAm input file for upload to ‘MethylDetectR – Calculate Your Scores’.)
SexAgeInfo_example.csv. (Example optional ‘SexAgeInfo’ input file for upload to ‘MethylDetectR – Calculate Your Scores’.)
MethylDetectR_Test_for_Upload.csv. (Example input file for upload to main ‘MethylDetectR’ application.)
MethylDetectR_Case_Control_Example.csv. (Example binary phenotype .csv file with controls coded as ‘0’ and cases coded as ‘1’.)
Script_For_User_To_Generate_Scores.R. (R script for user to locally generate DNAm-based estimates of human traits for upload to ‘MethylDetectR’ application.)
Predictors_Shiny_by_Groups.csv. (Associated file for use in ‘Script_For_User_To_Generate_Scores.R.)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Software availability
Zenodo: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling, https://doi.org/10.5281/zenodo.4291880 22.
This project contains the following scripts:
MethylDetectR – Calculate Your Scores.R (R script for running of ‘MethylDetectR – Calculate Your Scores’ application.)
MethylDetectR.R. (R script for running of ‘MethylDetectR’ application.)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Reporting guidelines
Open Science Framework: TRIPOD checklist for ‘MethylDetectR - a translational platform for methylation-based health profiling’, https://doi.org/10.17605/OSF.IO/3MGJT 50
The TRIPOD checklist is available at: https://www.tripod-statement.org/wp-content/uploads/2020/01/Tripod-Checlist-Prediction-Model-Development.pdf 51.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Acknowledgements
We thank the staff and participants involved in the Lothian Birth Cohort 1936 and Generation Scotland studies for their ongoing commitment and contribution to these studies. We thank Dr Qian Zhang and Prof Peter Visscher for their permission to use the age predictor. We thank Dr Daniel McCartney and Ms Danni Gadd for their work in developing the lifestyle, cholesterol and protein proxies. We thank Dr Adam Jackson at the Institute of Genetics and Molecular Medicine, University of Edinburgh for developing the ‘MethylDetectR’ website, and for his expertise and advice in preparing the website content. We thank Mr Stephen Cass and Mr Ewan McDowall at the Institute of Genetics and Molecular Medicine, University of Edinburgh for their advice on security and safety aspects surrounding the platform and for managing and hosting the secure server from which ‘MethylDetectR’ operates. We also thank Dr Rena Gertz, University of Edinburgh Data Protection Officer, and Victoria Rowntree, Assistant Data Protection Officer, for their review and approval of our software as GDPR compliant, and for their advice on Participant Information and Informed Consent documentation.
Funding Statement
This research was supported by funding from the Wellcome 4-year PhD in Translational Neuroscience–training the next generation of basic neuroscientists to embrace clinical research [grant: 108890/Z/15/Z awarded to R.F.H]. R.E.M. is supported by an Alzheimer’s Research UK major project grant [grant: ARUK-PG2017B−10]. LBC1936 methylation typing was supported by Centre for Cognitive Ageing and Cognitive Epidemiology (Pilot Fund award), Age UK, The Wellcome Trust Institutional Strategic Support Fund, The University of Edinburgh, and The University of Queensland. Lothian Birth Cohort 1921 and 1936 proteomic analyses were supported by a National Institutes of Health (NIH) research grant (R01AG054628). Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6) and the Scottish Funding Council (HR03006). DNA methylation profiling of the GS samples was carried out by the Genetics Core Laboratory at the Wellcome Trust Clinical Research Facility, Edinburgh, Scotland and was funded by the Wellcome Trust (Wellcome Trust Strategic Award “STratifying Resilience and Depression Longitudinally” ([STRADL; Reference 104036/Z/14/Z]). Proteomic analyses in STRADL were supported by Dementias Platform UK (DPUK). DPUK funded this work through core grant support from the Medical Research Council [MR/L023784/2].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; peer review: 2 approved with reservations]
References
- 1. Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003;33(Suppl):245–54. 10.1038/ng1089 [DOI] [PubMed] [Google Scholar]
- 2. Beck S, Rakyan VK: The methylome: approaches for global DNA methylation profiling. Trends Genet. 2008;24(5):231–7. 10.1016/j.tig.2008.01.006 [DOI] [PubMed] [Google Scholar]
- 3. Bestor TH, Edwards JR, Boulard M: Notes on the role of dynamic DNA methylation in mammalian development. Proc Natl Acad Sci U S A. 2015;112(22):6796–9. 10.1073/pnas.1415301111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Stevenson AJ, McCartney DL, Hillary RF, et al. : Characterisation of an inflammation-related epigenetic score and its association with cognitive ability. Clin Epigenetics. 2020;12(1):113. 10.1186/s13148-020-00903-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hillary RF, Stevenson AJ, McCartney DL, et al. : Epigenetic measures of ageing predict the prevalence and incidence of leading causes of death and disease burden. Clin Epigenetics. 2020;12(1):115. 10.1186/s13148-020-00905-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Langdon RJ, Beynon RA, Ingarfield K, et al. : Epigenetic prediction of complex traits and mortality in a cohort of individuals with oropharyngeal cancer. Clin Epigenetics. 2020;12(1):58. 10.1186/s13148-020-00850-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Lu AT, Seeboth A, Tsai PC, et al. : DNA methylation-based estimator of telomere length. Aging (Albany NY). 2019;11(16):5895–923. 10.18632/aging.102173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Lu AT, Quach A, Wilson JG, et al. : DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging (Albany NY). 2019;11(2):303–327. 10.18632/aging.101684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Hannum G, Guinney J, Zhao L, et al. : Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell. 2013;49(2):359–67. 10.1016/j.molcel.2012.10.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Levine ME, Lu AT, Quach A, et al. : An epigenetic biomarker of aging for lifespan and healthspan. Aging (Albany NY). 2018;10(4):573–91. 10.18632/aging.101414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Horvath S: DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):R115. 10.1186/gb-2013-14-10-r115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Rosen AD, Robertson KD, Hlady RA, et al. : DNA methylation age is accelerated in alcohol dependence. Transl Psychiatry. 2018;8(1):182. 10.1038/s41398-018-0233-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Horvath S, Garagnani P, Bacalini MG, et al. : Accelerated epigenetic aging in Down syndrome. Aging Cell. 2015;14(3):491–5. 10.1111/acel.12325 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. McCrory C, Fiorito G, Hernandez B, et al. : Association of 4 epigenetic clocks with measures of functional health, cognition, and all-cause mortality in The Irish Longitudinal Study on Ageing (TILDA). BioRxiv. 2020. 10.1101/2020.04.27.063164 [DOI] [Google Scholar]
- 15. Zhao W, Ammous F, Ratliff S, et al. : Education and Lifestyle Factors Are Associated with DNA Methylation Clocks in Older African Americans. Int J Environ Res Public Health. 2019;16(17):3141. 10.3390/ijerph16173141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Barbu MC, Shen X, Walker RM, et al. : Epigenetic prediction of major depressive disorder. Mol Psychiatry. 2020. 10.1038/s41380-020-0808-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Chen X, Gole J, Gore A, et al. : Non-invasive early detection of cancer four years before conventional diagnosis using a blood test. Nat Commun. 2020;11(1):3475. 10.1038/s41467-020-17316-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zhang Y, Elgizouli M, Schöttker B, et al. : Smoking-associated DNA methylation markers predict lung cancer incidence. Clin Epigenetics. 2016;8:127. 10.1186/s13148-016-0292-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Zhang Q, Vallerga CL, Walker RM, et al. : Improved precision of epigenetic clock estimates across tissues and its implication for biological ageing. Genome Med. 2019;11(1):54. 10.1186/s13073-019-0667-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Trejo Banos D, McCartney DL, Patxot M, et al. : Bayesian reassessment of the epigenetic architecture of complex traits. Nat Commun. 2020;11(1):2865. 10.1038/s41467-020-16520-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. McCartney DL, Hillary RF, Stevenson AJ, et al. : Epigenetic prediction of complex traits and death. Genome Biol. 2018;19(1):136. 10.1186/s13059-018-1514-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Hillary: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling (Version 5.0) [Data set].Wellcome Open. Zenodo.2020. 10.5281/zenodo.4291880 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67(2):301–20. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
- 24. Robinson GK: That BLUP is a good thing: the estimation of random effects. Statistical Science. 1991;6(1):15–32. Reference Source [Google Scholar]
- 25. Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
- 26. Chang W, Cheng J, Allaire J, et al. : Shiny: Web Application Framework for R; R package version 1.4. 0.2.2020. [Google Scholar]
- 27. Team RC: R version 3.5. 0. R: A language and environment for statistical computing R Foundation for Statistical Computing. Vienna, Austria.2018. Reference Source [Google Scholar]
- 28. Perrier V, Meyer F, Granjon D: shinyWidgets: Custom Inputs Widgets for Shiny. R package version.2019. [Google Scholar]
- 29. Chang W, Park T, Dziedzic L, et al. : shinythemes: Themes for Shiny. R package version. 2015;1(1):144. [Google Scholar]
- 30. Dowle M, Srinivasan A: data. table: Extension of ‘data. frame. R package version 1.12. 8.2019. [Google Scholar]
- 31. Attali D, Edwards T: shinyalert: Easily Create Pretty Popup Messages (Modals) in'Shiny'. R package version 10.2018. Reference Source [Google Scholar]
- 32. Wickham H: ggplot2: elegant graphics for data analysis. springer.2016. Reference Source [Google Scholar]
- 33. Hadley Wickham RF, Henry L, Müller K: dplyr: A Grammar of Data Manipulation. R package version 0.7. 4.2017. [Google Scholar]
- 34. Wickham H: Tools for working with categorical variables (factors)(R package Version 0.4. 0)[Computer software].2019. [Google Scholar]
- 35. Ram K, Wickham H: wesanderson: A Wes Anderson palette generator. R package version 0.3. 6.2018. Reference Source [Google Scholar]
- 36. Sali A: shinycssloaders: Add CSS Loading Animations to “shiny” Outputs. R Package Version 02 0.2017. [Google Scholar]
- 37. Ooms J: Magick: advanced graphics and image-processing in R. CRAN R package version.2018;1. Reference Source [Google Scholar]
- 38. Wei T, Simko V: R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84).2017. Reference Source [Google Scholar]
- 39. Kassambara A: ggcorrplot: Visualization of a Correlation Matrix using’ggplot2’. R package version 01.2016;1. Reference Source [Google Scholar]
- 40. Wilke CO: cowplot: Streamlined Plot Theme and Plot Annotations for “ggplot2. R package version 0.9. 4.2019. [Google Scholar]
- 41. Bell CG, Lowe R, Adams PD, et al. : DNA methylation aging clocks: challenges and recommendations. Genome Biol. 2019;20(1):249. 10.1186/s13059-019-1824-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Lello L, Avery SG, Tellier L, et al. : Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–97. 10.1534/genetics.118.301267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Lewis CM, Vassos E: Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12(1):44. 10.1186/s13073-020-00742-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Tanaka T, Biancotto A, Moaddel R, et al. : Plasma proteomic signature of age in healthy humans. Aging Cell. 2018;17(5):e12799. 10.1111/acel.12799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Tanaka T, Lavery R, Varma V, et al. : Plasma proteomic signatures predict dementia and cognitive impairment. Alzheimers Dement (N Y). 2020;6(1):e12018. 10.1002/trc2.12018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Nedjadi T, Benabdelkamal H, Albarakati N, et al. : Circulating proteomic signature for detection of biomarkers in bladder cancer patients. Sci Rep. 2020;10(1):10999. 10.1038/s41598-020-67929-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Cuvelliez M, Vandewalle V, Brunin M, et al. : Circulating proteomic signature of early death in heart failure patients with reduced ejection fraction. Sci Rep. 2019;9(1):19202. 10.1038/s41598-019-55727-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Mundra PA, Barlow CK, Nestel PJ, et al. : Large-scale plasma lipidomic profiling identifies lipids that predict cardiovascular events in secondary prevention. JCI Insight. 2018;3(17):e121326. 10.1172/jci.insight.121326 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Gerl MJ, Klose C, Surma MA, et al. : Machine learning of human plasma lipidomes for obesity estimation in a large population cohort. PLoS Biol. 2019;17(10):e3000443. 10.1371/journal.pbio.3000443 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Hillary RF: TRIPOD Checklist for MethylDetectR - A Translational Tool For Methylation-Based Health Profiling.2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Collins GS, Reitsma JB, Altman DG, et al. : Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) The TRIPOD Statement. Circulation. 2015;131(2):211–9. [DOI] [PMC free article] [PubMed] [Google Scholar]