Version Changes
Revised. Amendments from Version 1
First, we have updated the three applications associated with the 'MethylDetectR' platform to include 'info' buttons which, when pressed, show information on the presented data, limitations of methylation-based predictors available in 'MethylDetectR' and links to all other elements of the platform to allow for quick and convenient navigation across the platform. Second, we have removed methylation-based predictors of 27 blood protein levels which were included in the previous version of 'MethylDetectR'. The pipeline for generating these predictors has been refined. We will include the refined predictors for a larger set of proteins once they become available and are published. The predictors for chronological age and six lifestyle and biochemical traits are still available. Indeed, users can use our platform for the quick and convenient generation of methylation-based scores for these traits and interactively view how scores compare across individuals in their dataset. Third, the discussion on the limitations of methylation-based predictors available in 'MethylDetectR' has been refined and expanded. The aim of this amendment is to emphasise that methylation-based scores cannot make consistently accurate predictions at an individual level, instead, they work well at a population level. This limits their clinical utility, however, they will improve through the employment of larger-scale studies and more refined prediction methods. Fourth, we have included information on 'Version Control' detailing how and when we will update the 'MethylDetectR' platform. We will update the platform every three months in an effort to include new methylation-based predictors of human traits as they are generated by our group and others. Updates will be managed by the authors Robert F. Hillary and Riccardo E. Marioni. Researchers are invited to contact the authors to discuss the inclusion of new methylation-based predictors in the 'MethylDetectR' platform.
Abstract
DNA methylation is an important biological process that involves the reversible addition of chemical tags called methyl groups to DNA and affects whether genes are active or inactive. Individual methylation profiles are determined by both genetic and environmental influences. Inter-individual variation in DNA methylation profiles can be exploited to estimate or predict a wide variety of human characteristics and disease risk profiles. Indeed, a number of methylation-based predictors of human traits have been developed and linked to important health outcomes. However, there is an unmet need to communicate the applicability and limitations of state-of-the-art methylation-based predictors to the wider community. To address this need, we have created a secure, web-based interactive platform called ‘MethylDetectR’ which automates the calculation of estimated values or scores for a variety of human traits using blood methylation data. These traits include age, lifestyle traits and high-density lipoprotein cholesterol. Methylation-based predictors often return scores on arbitrary scales. To provide meaning to these scores, users can interactively view how estimated trait scores for a given individual compare against other individuals in the sample. Users can optionally upload binary phenotypes and investigate how estimated traits vary according to case vs. control status for these phenotypes. Users can also view how different methylation-based predictors correlate with one another, and with phenotypic values for corresponding traits in a large reference sample (n = 4,450; Generation Scotland). The ‘MethylDetectR’ platform allows for the fast and secure calculation of DNA methylation-derived estimates for several human traits. This platform also helps to show the correlations between methylation-based scores and corresponding traits at the level of a sample, report estimated health profiles at an individual level, demonstrate how scores relate to important binary outcomes of interest and highlight the current limitations of molecular health predictors.
Keywords: epigenetics, DNA methylation, epidemiology, translation, Shiny, prediction
Introduction
DNA methylation (DNAm) is an epigenetic mechanism in which methyl groups are added to the genome sequence. Inter-individual variability in DNAm profiles results from differences in both underlying genetics and environmental influences 1. Factors such as diet, stress and smoking behaviours may influence the process of methylation. Typically, methyl groups are added to cytosine residues in the context of a cytosine-guanine dinucleotide (CpG site) 2. The addition of these chemical tags can alter whether, and to what extent, a gene is active. In contrast to genetics, these molecular modifications are dynamic, tissue-specific and reversible 3. Further, methylation at many CpG sites is tissue-specific though some show strong concordance across multiple tissues 4. In addition, CpG modifications induced through environmental factors, such as smoking, may be reversible or show persistent alterations 5.
Biological data may be harnessed to estimate or predict a variety of human characteristics and disease risk profiles. There is a growing body of evidence demonstrating the effective creation and application of DNAm-based predictors of human traits and health 6– 19. Additionally, methylation-based predictors of traits such as smoking status may provide more accurate measurements than self-reported information, thereby allowing for improved disease prediction and risk stratification 20. Blood DNAm data is often used as it is minimally-invasive to collect and it provides a good index of the overall health status of the body 21.
Increased training sample sizes and refinements in statistical and machine learning methodologies have improved the accuracy of DNAm-based predictors 22, 23. Furthermore, there has been an increase in the commercialisation and scalability of DNAm assays for direct-to-consumer use or for use in clinical, research or industrial settings 24. A major goal of using these predictors is to aid in prediction strategies and provide better clinical outcomes for individuals. Therefore, translational platforms for methylation-based health profiling are warranted in order to communicate the applications and limitations of DNAm-based predictors of human traits.
To address this need, we have created a web-based platform called ‘MethylDetectR’ that allows for an interactive demonstration of state-of-the-art DNAm-based predictors. A demo version of the app which does not require the upload of data is available at: https://shiny.igmm.ed.ac.uk/MethylDetectR_Demo/. The DNAm-based predictors in this platform include a highly accurate predictor of chronological age trained in > 13,000 individuals across 14 cohorts with a root mean squared error of 2.04 years in the original publication 22. We also include six DNAm-based predictors of lifestyle and biochemical traits: alcohol consumption per week, body fat percentage, body mass index, high-density lipoprotein (HDL) cholesterol, smoking status and waist-to-hip ratio 25. These predictors were generated in 5,087 individuals who are members of the Generation Scotland: Scottish Family Health Study (GS) which represents one of the largest DNAm resources in the world.
Briefly, the ‘MethylDetectR’ platform consists of two applications. The first application named ‘MethylDetectR – Calculate Your Scores’ allows users to securely upload Illumina 450k or EPIC DNAm array data and obtain blood-based methylation predicted scores (or values) for the aforementioned traits. No data are stored by the application. Furthermore, predicted scores are often returned on arbitrary scales. The use of this application is optional as users may instead use R scripts which we have made publicly available if they so wish or if their DNAm files are too large for upload to the online application (>3 GB) ( https://doi.org/10.5281/zenodo.4646300). However, users can access a file called ‘Truncate_to_these_CpGs.csv’ to subset the list of CpG sites in their DNAm files to those required by the ‘MethylDetectR – Calculate Your Scores’ application. This should reduce file size and upload time. The second and main application named ‘MethylDetectR’ allows users to compare DNAm-derived scores for any individual in their input dataset against other individuals in the input dataset. Percentile ranks for individuals in the input dataset may be downloaded. Users can also upload an optional file containing binary phenotypes whereby individuals are coded as ‘0’ for control status and ‘1’ for case status. This information allows users to view how distributions of the DNAm-derived traits vary by cases and controls. Users can also view how the various DNAm-based predictors correlate with one another in the input dataset and in a separate reference sample. This reference sample comprises 4,450 individuals who are members of the GS study. These individuals are unrelated to each other and distinct from those in the original training samples in which the predictors for age, HDL cholesterol and lifestyle traits were developed. They are also unrelated to those included in the original training sample. Furthermore, information is provided on how well the predicted scores correlate with phenotypic values for corresponding traits that are available in GS. Lastly, the user can subset the input sample by sex, age range or case vs. control status determined by the case-control variables uploaded by the user. Further, the user can subset the GS reference sample by age and sex.
This platform communicates important information relating to the generation and applicability of DNAm-based predictors of human traits and health. The ‘MethylDetectR’ platform also represents a research tool for fast and automatic generation of DNAm-derived estimates for human traits. This platform can show that DNAm-based scores for traits may correlate well with measured values for a given trait at the level of the cohort. For example, a predictor for epigenetic age correlates strongly with true age with a root mean squared error of 2.04 years 22. However, this platform also helps to show that predictors may report inaccurate values at an individual level. For instance, although the age predictor correlates well with true age at the level of the cohort, an individual’s predicted age may differ from their true age by a number of years or decades. The optional incorporation of binary phenotype data allows users to view how well established or putative risk factors, as estimated by DNAm data, are stratified according to cases and controls for a given trait of interest. Together, the functionalities of ‘MethylDetectR’ begin to address the translational gap in the development and implementation of molecular-based health predictors by highlighting their performance and limitations in advance of their potential utility in diagnostic and stratification paradigms.
Methods
Implementation
Data protection and privacy. No data are stored in ‘MethylDetectR’ and are deleted upon closing the applications. Applications are also timed out after three minutes of inactivity and are hosted on patched and secure servers within the Institute of Genetics and Cancer, University of Edinburgh. This research and translational tool complies with GDPR guidelines and has been designed to ensure the highest level of data security and privacy. The ‘MethylDetectR’ applications and information on their usage are also available at the following website: https://www.ed.ac.uk/centre-genomic-medicine/research-groups/marioni-group/methyldetectr. Information relating to participant consent is also available at this website. Given that no data are stored, this information pertains to general risk surrounding the upload of biological data to online software and the measures taken to mitigate the risk of motivated intruders gaining access to such data.
The ‘MethylDetectR’ platform. The ‘MethylDetectR’ platform consists of two applications. The first application is called ‘MethylDetectR - Calculate Your Scores’. Users may upload DNAm data as an R object (.rds file) and obtain estimated values or scores for a variety of traits across individuals in their input dataset ( https://shiny.igmm.ed.ac.uk/Calculate_Your_Scores/). The upload limit is 3 gigabytes; however, files greater than 500 megabytes may take a considerable amount of time to upload. Users can make these upload files smaller by subsetting to CpG sites used in ‘MethylDetectR – Calculate Your Scores’. These CpG sites are available in the ‘Truncate_to_these_CpGs.csv’ file in Zenodo (Zenodo link). An optional ‘SexAgeInfo’ file may also be uploaded in order to include sex and age information in the output file. This should be a .csv file and have three columns: one column for the IDs of individuals in the methylation file (‘ID’ column), one column should list the sex of these individuals written as ‘Male’ or ‘Female’ or ‘NA’ (‘Sex’ column) and one column should report the actual or chronological age of individuals (‘Age’ column). This functionality is important given that users can subset the input dataset and GS cohort by sex in the main ‘MethylDetectR’ application. Furthermore, if true age is included, then the application will use this information to subset the sample according to the age slider function on the sidebar panel. If this information is not uploaded, then epigenetic or predicted age will be used to subset the data by age range. In the case where some individuals have true age available and others have missing data, true age will be used for those who have such data and epigenetic age will be used for those without age data in order to the subset the sample. It is strongly recommended that anonymised or pseudonymised IDs are used where possible. For the user’s own convenience in preparing the methylation object, it is recommended that individuals are included as columns and CpG sites as rows. However, this version or a transposed version are accepted and automatically processed by the software. The following features also aid with automation in generating DNAm-based scores for traits:
Beta values or M values are accepted with the latter converted to beta values by the software.
Missing methylation values are accepted and mean imputed across input individuals by the software.
CpG sites that are necessary for the estimation of a trait but are missing in the uploaded dataset are allowed. In this case, each individual in the input dataset receives the mean beta value for a given missing CpG site derived from GS DNAm data. In effect, this gives every individual in the uploaded dataset a constant that brings their score closer to that of the reference sample. In this way, all CpG sites are used for any sample uploaded.
Predicted values or scores for the aforementioned traits can be downloaded as a .csv file which may be uploaded to the main ‘MethylDetectR’ application. Alternatively, an R script is provided to generate these DNAm-based scores if the user does not wish to upload DNAm data or if the DNAm file is too large for upload ( https://doi.org/10.5281/zenodo.4646300 26).
The main ‘MethylDetectR’ application can be described in four modules or panels ( https://shiny.igmm.ed.ac.uk/MethylDetectR/). In the first panel, users can view DNAm-based scores corresponding to various traits for any individual in their input dataset. The user can interactively compare scores for any individual to the remainder of the input dataset. A .csv file may be downloaded which shows percentile ranks for each individual in the uploaded dataset. Percentile ranks are reported for each estimated trait with the exception of the age predictor which is reported in years. If the user uploads an optional binary phenotype file, they can also view the distributions of the DNAm-based scores according to case vs. control status for a given trait of interest. This file should contain IDs of individuals in the methylation file (‘ID’ column) as the first column and the remaining columns may contain any names or traits of interest with individuals coded as ‘0’ for controls and ‘1’ for cases. In the second panel, a plot shows the percentile ranks for a given individual for selected traits. Users can also choose to view the spread of percentile ranks for cases versus controls. Here, the median percentile for cases, along with the first and third quartiles (interquartile range), are plotted for each selected DNAm-based estimated trait. In the third panel, the user can view how different DNAm-based predictors correlate with one another in both the input and GS datasets. In the fourth panel, users can view how DNAm-based scores for age, lifestyle and biochemical traits correlate with phenotypic values for these traits in GS participants (n = 4,450). In each panel, users can subset the samples by age range and sex. In panels 1-3, options to subset the input dataset by case vs. control status are present.
Development of a DNAm-based predictor. Readers are referred to a review on the development of DNAm-based scores and the challenges surrounding their generation 27. To develop ‘omics’-based predictors, such as DNAm-based predictors, statistical or machine learning methodologies are commonly applied. In the case of DNAm, the process begins with the quantification of DNAm across individuals using a tissue or cell-type of interest. Many studies focus on blood as it integrates information from various tissues around the body and represents an inexpensive and minimally-invasive approach to gather molecular data. Many cohort studies also have DNA from historic blood samples stored and available to analyse. The collection of saliva and buccal samples is becoming increasingly popular as a relatively low cost and non-invasive method for cohort studies interested in epigenetic epidemiology. The number of CpG sites which are measured depends on the array used but typically includes up to around 800,000 unique sites. Following quantification and quality control, a researcher may wish to study the association between DNAm and a trait of interest, such as smoking status. The average methylation level at a given CpG site across individuals will be correlated with the trait of interest. Methylation levels may be reported between 0-100% for convenience, and a level of 50% means that 50% of cells or DNA molecules within an individual’s sample show methylation at that CpG site. One approach is to correlate each CpG site, in turn, with the trait of interest thereby considering each CpG site in isolation. This is approach is referred to as an epigenome-wide or methylome-wide association study (EWAS or MWAS). Alternatively, methods such as penalised regression can be used to model all CpG sites simultaneously producing parsimonious models that account for correlated features/sites (e.g., least absolute shrinkage and selection operator or LASSO regression) or models that apply small weights to all features/sites (e.g., ridge regression). Elastic net regression is a commonly used intermediate of these two approaches. Correlations among CpGs may arise from sites which lie near each other in a genomic region or from a shared environmental influence; for instance, inhalation of cigarette smoke may affect many CpG sites across different chromosomes. A subset of CpG sites may show a strong relationship with the trait of interest and therefore be informative for predicting the trait in other individuals. The strength of the correlation, or association, is represented by an effect size and provides a weighting for that CpG site’s importance in predicting the trait. In a separate or test group of individuals, predicted values or scores for the trait can be obtained by multiplying DNAm levels at each informative CpG site by their weight derived from statistical analyses. The sum of these products provides a predicted or estimated value for the trait. A statistical transformation can be applied to return DNAm-based scores on the original scale for the trait, such as pack years for smoking. Alternatively, comparison to other individuals may provide meaning to the DNAm-based scores. In any case, predicted values or scores in the test sample may be correlated with true values for a given trait to provide an index of predictive power.
DNAm-based predictors in ‘MethylDetectR’. In ‘MethylDetectR’, we include DNAm-based predictors of chronological age, and six lifestyle and biochemical traits.
Age predictor
The age predictor was developed by Zhang et al. using elastic net regression and best linear unbiased prediction (BLUP) 22, 28, 29. The two sets of predictors were both built on the same set of 13,566 training samples spread across 14 cohorts. The age predictor was generated using data from individuals with an age range of 2 to 104 years. The elastic net method selected 514 CpG sites as informative for predicting chronological age whereas the BLUP predictor used all CpG sites (319,607 probes). In ‘MethylDetectR’, we apply the elastic net predictor owing to the faster computation and superior performance of this age predictor when compared to the BLUP predictor. The elastic net predictor correlates 0.98 with chronological age (root mean squared error = 2.6 years in GS (n = 4,450) and 2.04 in original publication 22). The age predictor is returned in values of years.
Lifestyle and biochemical traits
Previously, we generated ten predictors of lifestyle and biochemical traits in 5,087 individuals within the GS study using LASSO penalised regression 25, 30. Ten-fold cross-validation was applied and the mixing parameter (alpha) was set to 1. The lambda value corresponding to the minimum mean cross-validation error was selected and applied to generate the optimal models 30. In a test sample consisting of 875 individuals in the Lothian Birth Cohort 1936 study, DNAm-based predictors for four of the traits explained greater than 10% of phenotypic variance in their respective trait. These four traits were alcohol consumption, body mass index, high-density lipoprotein cholesterol and smoking behaviour. There were no phenotypic data available for body fat percentage and waist-to-hip ratio in the Lothian Birth Cohort 1936 study. However, these traits were highly correlated with body mass index in GS (correlation coefficients of 0.6 and 0.4, respectively). Therefore, body fat percentage and waist-to-hip ratio were brought forward to the ‘MethylDetectR’ platform in addition to the four DNAm-based predictors of lifestyle and biochemical traits which demonstrated test R 2 statistics of greater than 10% in the Lothian Birth Cohort 1936 sample. Four other traits demonstrated test R 2 statistics of less than 10%: educational attainment (2.5%), low-density lipoprotein and remnant cholesterol (0.6%), total cholesterol (2.7%) and total-to-high-density lipoprotein cholesterol ratio (4.5%). These four traits were not brought forward to the ‘MethylDetectR’ platform 25.
In the training sample, alcohol intake was assessed in units per week and was only considered in those who reported that their intake was representative of a normal week. A natural log(units + 1) transformation was applied to reduce skewness. For body mass index, extreme values defined as less than 17 kg/m 2 or greater than 50 kg/m 2 were removed and a natural log transformation was applied. Smoking behaviour was assessed using pack years which is calculated by multiplying the number of packs smoked per day by the number of years the participant has smoked. Current and never smokers were included; ex-smokers were removed owing to complications in adjusting for time since cessation when calculating pack years. To reduce skewness, a natural log(pack years + 1) transformation was applied. In generating the predictors, phenotypes were pre-corrected to remove the influence of age, sex and ancestry using ten genetic principal components. Phenotypic data used to train the predictors were not corrected for cell-type heterogeneity. Further quality control details are available in the original publication 25. DNAm values at CpG sites were the independent variables (n = 392,843 CpG sites). CpG sites were filtered to include loci present on both the Illumina EPIC and 450k arrays.
Version Control. We will update ‘MethylDetectR’ every three months to include new DNAm-based predictors of human traits as they are generated by our own group and others. Updates will be managed by Robert F. Hillary or Riccardo E. Marioni. If researchers wish to have their predictors considered for inclusion in ‘MethylDetectR’, please use the corresponding author email address in this manuscript or the contact details available at the ‘MethylDetectR’ website ( https://www.ed.ac.uk/centre-genomic-medicine/research-groups/marioni-group/methyldetectr). The current and historical versions of ‘MethylDetectR’ are available in the Zenodo repository, updated versions will also be made available in this repository ( https://doi.org/10.5281/zenodo.4646300).
Operation
Software requirements. Both applications are hosted on a secure, patched server hosted at the University of Edinburgh. The applications are developed using Shiny (version 1.4) in R 31. The version of R used at the time of ‘MethylDetectR’ development was 3.5.0 32. For ‘MethylDetectR – Calculate Your Scores’, the following R packages were utilised: shinyWidgets (version 0.5.4) 33, shinythemes (version 1.1.2) 34, data.table (version 1.12) 35, shinyalert (version 1) 36. For the main ‘MethylDetectR’ application, shinyWidgets (version 0.5.4) 33, shinythemes (version 1.1.2) 34, data.table (version 1.12) 35, shinyalert (version 1) 36 were also used, in addition to ggplot2 (version 3.0) 37, dplyr (version 0.7) 38, forcats (version 0.4.0) 39, wesanderson (version 0.3.6) 40, shinycssloaders (version 0.2) 41, magick (version 2.5.0) 42, corrplot (version 0.84) 43, ggcorrplot (version 0.1) 44 and cowplot (version 0.9.4) 45. Scripts for implementing both applications are available at: https://doi.org/10.5281/zenodo.4646300 26. The script has been designed to ensure automatic installation of missing CRAN packages which are necessary for the operation of ‘MethylDetectR’.
Overview of workflow. The main components of the platform are outlined in the Implementation section and the associated workflow is graphically depicted in Figure 1.
Use cases
MethylDetectR – Calculate Your Scores
The user can upload a DNAm file as an R object (.rds file) to the ‘MethylDetectR – Calculate Your Scores’ application. Beta values or M values may be used. If M values are detected, these are converted to beta values. It is recommended that individuals are included as columns and CpG sites (derived from Illumina arrays) are included as rows and that file sizes of no greater than 500 MB are uploaded. The application details general information on the platform, as well as information on how to format files and links to all elements of the platform. To make DNAm files smaller prior to upload, users can access a file called ‘Truncate_to_these_CpG.csv’ which is available in the Zenodo repository. This file allows users to subset CpGs measured in their dataset to those used in ‘MethylDetectR – Calculate Your Scores’ making files considerably smaller. A ‘SexAgeInfo’ file may also be uploaded as a .csv file so that data pertaining to the ages and sex of the input individuals are included in the output file along with DNAm-based values for age (epigenetic age), lifestyle and biochemical traits. This file should include a column corresponding to the IDs of individuals in the methylation file (‘ID’ column), a column that lists the sex of these individuals written as ‘Male’ or ‘Female’ or ‘NA’ (‘Sex’ column) and a column that reports the chronological or true ages of individuals (‘Age’ column). Alternatively, the user can merge this information into their DNAm score file after running the ‘MethylDetectR – Calculate Your Scores’ step. Examples for the DNAm and ‘SexAgeInfo’ files are available at: https://doi.org/10.5281/zenodo.4646300 26.
The software automatically generates DNAm-based scores for every individual in the dataset (.csv file) and a report for the user is printed on the application detailing quality control steps carried out during the calculation process. For example, the report informs the user whether or not the data had to be transposed, or if M values were converted to beta values ( Figure 2).
Alternatively, the user can download an R script to locally generate DNAm-based scores for the traits. The DNAm object is annotated as ‘data’ and the ‘SexAgeInfo’ input is annotated as ‘sexageinfo’ ( https://doi.org/10.5281/zenodo.4646300 26). In either case, an output .csv file is generated containing DNAm-based scores or values for each trait and for every individual in the input dataset. This output file should be uploaded to the main ‘MethylDetectR’ application. An example output .csv file showing the correct column names and file structure is available at: https://doi.org/10.5281/zenodo.4646300 26.
MethylDetectR
Panel 1. The output file from either the ‘MethylDetectR – Calculate Your Scores’ application or a publicly available script should be uploaded to the main ‘MethylDetectR’ application. Incorrectly assigned column names will be reported to the user, as will files with no individuals or files with non-numeric values. A timeout is triggered following three minutes of inactivity. All panels contain information on the data shown in the panel, and Panel 1 details information on how to format files. Links to all other elements of the platform are shown in each panel via ‘info’ buttons located in the sidebar panels.
The first panel allows users to choose a predictor of interest and view how a selected individual in the input dataset ranks against the remainder of the input dataset (in pink) in the context of that predictor ( Figure 3A). Alternatively, if the user uploads an optional file with binary phenotype information, then users can also subset the data by case vs. control status. In Figure 3B, the user can view where a selected individual’s DNAm-based score for body mass index lies along the sample subset by controls (in pink) and cases (in blue) for diabetes. The user can subset to different age ranges and sex in order to see how the selected individual would compare to the truncated sample selection. Users can also download the percentile ranks for every individual in the input dataset when compared against all other individuals in the dataset. Percentile ranks are available for each trait with the exception of age which is reported in years.
Panel 2. In the second panel, users may select multiple traits in order to simultaneously view the percentile ranks for a selected individual in the input dataset when compared against other individuals in the sample ( Figure 4A). Furthermore, the user can view how percentile ranks for a given trait vary according to cases and controls for a selected binary phenotype. In Figure 4B, the median percentile for diabetes cases along with the interquartile range (first to third quartile) are plotted for multiple traits, such as body mass index and body fat percentage. Again, the user can use a sidebar functionality to subset by age range and sex.
Panel 3. In the third panel, users can select multiple DNAm-based predictors and view how they correlate with one another in order to visualise their interrelationships and the underlying data structure. This is represented for both the input and GS datasets ( Figure 5A). Furthermore, the correlations are updated according to the selected age range and sex. The user can also subset the input dataset to cases, controls or choose to visualise correlation data for cases and controls in the input dataset alongside each other ( Figure 5B).
Panel 4. In the fourth and final panel, users can view how well the DNAm-based predictors for age, lifestyle traits and HDL cholesterol correlate with actual values of their respective traits in GS ( Figure 6). In this final panel, users can also subset by age range and sex to view how the performance of the predictors varies according to the truncated reference dataset.
Discussion
We have created and implemented the first publicly available online translational platform for methylation-based health profiling. The platform includes a wide variety of traits which are estimated from large-scale DNAm data. These include chronological age, lifestyle traits and biochemical data thereby providing an automatic and comprehensive estimate of individual health profiles from a single blood draw. Users can interactively view how well DNAm-based estimators for various traits perform at an individual level and how DNAm-based estimators stratify according to case and control status for binary phenotypes of interest. The ‘MethylDetectR’ platform communicates key messages surrounding the development and present limitations of DNAm-based health profiling to the wider research community and public. This is achieved by including ‘info’ buttons in the sidebar panels of each application that lead to important information for interpreting the presented results, key limitations and general information on DNAm-based scores. Furthermore, the platform is designed to ensure the highest level of data security and safety and is publicly available with open source code and example input and output files. We will continue to update ‘MethylDetectR’ every three months with the aim of including new DNAm-based predictors of human traits when they come available.
DNAm-based predictors can integrate biological and environmental information to provide important indices of an individual’s health status and well-being. These predictors must display high degrees of sensitivity and specificity in order to accurately distinguish individuals on trajectories toward disease and adverse clinical endpoints from those who will remain healthy in a given clinical context. Currently, the DNAm-based predictors in ‘MethylDetectR’ cannot make consistently accurate predictions at an individual level and therefore cannot yet be reliably applied in a diagnostic or forensic context. Highly-accurate DNAm-based scores can aid in research environments as they may provide more accurate information than self-report data 20. For instance, the DNAm-based predictor of smoking can provide a more accurate profile of smoking history than responses from participants in questionnaires. Further, the use of DNAm-based scores for a variety of human traits can proxy many phenotypes such as biochemical and lifestyle traits using a single blood draw. Together, these data can help researchers determine relationships between putative risk factors and important health outcomes, and aid in patient stratification paradigms. Further, ‘MethylDetectR’ serves as an important translational tool showing an interactive, demo version of the platform and substantial information within each application regarding the interpretation and limitations of DNAm-based predictors. As these predictors become refined, they may be of clinical value. For instance, a blood-based DNAm test was recently developed that could detect five separate types of cancer up to four years before conventional diagnosis. The assay measured circulating tumour DNA methylation and predicted disease in 88% of post-diagnosis patients, with a specificity of 96% 19.
Distributions of DNAm values and subsequent DNAm-based scores may vary across different methylation datasets. In relation to the biochemical and lifestyle traits, the predictors were generated using an adult sample of individuals with European ancestry. Therefore, it is possible that the predictors may not be generalisable to datasets comprising different age ranges, such as cohorts of children, and individuals with different ancestries. Differences between datasets may also arise from biological differences, for example cases for a given disease may have altered DNAm values for a number of probes relative to controls, or result from technical or normalisation differences. As a result, DNAm-based scores may vary greatly across datasets and projecting an individual onto a reference sample to view where their DNAm-based score would lie along the reference sample is therefore challenging. Future work will focus on developing methods which can appropriately account for variability across datasets and allow for a projection of individuals onto disparate DNAm samples or datasets.
Increased sample sizes through recruitment, consortia or meta-analyses may allow for more sensitive or specific DNAm-based predictors. Advancements in statistical and machine learning approaches used to generate such predictors will also allow for greater accuracy in predicting human traits and health 23. Furthermore, if the outcomes on which the predictors are trained are inaccurate or possess lots of noise, then the predictors themselves will perform poorly in identifying individuals at risk of disease. Therefore, advancements in understanding disease biology and ways to diagnose or stratify different diseases will help to create well-defined outcomes on which predictors can be trained. This is expected to improve their ability in predicting important health and clinical outcomes. However, stringent ethical frameworks are also necessitated prior to widespread application of molecular-based health profiling in health and forensic contexts 46.
DNAm-based predictors represent one avenue within molecular-based health profiling. Genetics-based predictors of human traits may correlate well with true values for traits, such as human height 47. However, genetic predictors of disease may often fail to accuracy classify individuals by disease status 48. Additionally, other ‘omics’ data have been explored in order to predict human traits or disease. For example, a proteomic signature of age correlates 0.94 with chronological age 49. Plasma protein-based predictors of disease states, including dementia and cancer, have been explored 50– 52. Lipid-based predictors of human traits have also been developed using plasma samples 53, 54. Complex and common disease states are multifactorial conditions. Therefore, it is likely that composite predictors using various lines of ‘omics’ data may allow for greater accuracy in predicting disease risk and outcomes when compared to using one line of evidence alone. Furthermore, the incorporation of ‘omics’ data with clinical or demographic data could provide even more refined predictors of human health and disease 48.
Conclusions
Our platform provides an important translational tool which communicates state-of-the-art developments in relation to DNAm-based predictors of human traits and health. The ‘MethylDetectR’ platform also represents a research tool for the convenient and secure generation of DNAm-estimated traits for use in clinical and population studies. Importantly, our platform highlights the applicability and limitations surrounding such predictors prior to their potential deployment in clinical assessment and management paradigms.
Reproducibility: All relevant code is available at: https://doi.org/10.5281/zenodo.4646300 26. The following Research Resource Identifiers (RRIDs) have been generated for ‘MethylDetectR – Calculate Your Scores’, RRID: SCR_018972, and ‘MethylDetectR’, RRID: SCR_018973. The limitations surrounding the resultant datasets are that the predictors may work well for risk stratification relative to others in the dataset, but may fail to accurately predict trait information at an individual level.
Data availability
Underlying data
The underlying methylation and phenotypic data used to generate the original predictors cannot be made available due to the data containing information that could compromise participant consent and confidentiality. According to the terms of consent for GS participants, access to data must be reviewed by the GS Access Committee. Applications should be made to access@generationscotland.org. Lothian Birth Cohort 1936 data can be requested from the Lothian Birth Cohort 1936 research team, following completion of a data request application. More information can be found online ( http://www.lothianbirthcohort.ed.ac.uk/content/collaboration).
Extended data
Zenodo: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling, https://doi.org/10.5281/zenodo.4646300 26
This project contains the following extended data:
DNAm_File_Example.rds. (Example DNAm input file for upload to ‘MethylDetectR – Calculate Your Scores’.)
SexAgeInfo_example.csv. (Example optional ‘SexAgeInfo’ input file for upload to ‘MethylDetectR – Calculate Your Scores’.)
Truncate_to_these_CpGs.csv. (File for truncating DNAm input file to CpG sites used in ‘MethylDetectR – Calculate Your Scores’.)
MethylDetectR_Test_for_Upload.csv. (Example input file for upload to main ‘MethylDetectR’ application.)
MethylDetectR_Case_Control_Example.csv. (Example binary phenotype .csv file with controls coded as ‘0’ and cases coded as ‘1’.)
Script_For_User_To_Generate_Scores.R. (R script for user to locally generate DNAm-based estimates of human traits for upload to ‘MethylDetectR’ application.)
Predictors_Shiny_by_Groups.csv. (Associated file for use in ‘Script_For_User_To_Generate_Scores.R.)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Software availability
Zenodo: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling, https://doi.org/10.5281/zenodo.4646300 26.
This project contains the following scripts:
MethylDetectR – Calculate Your Scores.R (R script for running of ‘MethylDetectR – Calculate Your Scores’ application.)
MethylDetectR.R. (R script for running of ‘MethylDetectR’ application.)
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Reporting guidelines
Open Science Framework: TRIPOD checklist for ‘MethylDetectR - a translational platform for methylation-based health profiling’, https://doi.org/10.17605/OSF.IO/3MGJT 55
The TRIPOD checklist is available at: https://www.tripod-statement.org/wp-content/uploads/2020/01/Tripod-Checlist-Prediction-Model-Development.pdf 56.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Acknowledgements
We thank the staff and participants involved in the Lothian Birth Cohort 1936 and Generation Scotland studies for their ongoing commitment and contribution to these studies. We thank Dr Qian Zhang and Prof Peter Visscher for their permission to use the age predictor. We thank Dr Adam Jackson at the Institute of Genetics and Cancer, University of Edinburgh for developing the ‘MethylDetectR’ website, and for his expertise and advice in preparing the website content. We thank Mr Stephen Cass and Mr Ewan McDowall at the Institute of Genetics and Cancer, University of Edinburgh for their advice on security and safety aspects surrounding the platform and for managing and hosting the secure server from which ‘MethylDetectR’ operates. We also thank Dr Rena Gertz, University of Edinburgh Data Protection Officer, and Victoria Rowntree, Assistant Data Protection Officer, for their review and approval of our software as GDPR compliant, and for their advice on Participant Information and Informed Consent documentation.
Funding Statement
This research was supported by funding from the Wellcome 4-year PhD in Translational Neuroscience–training the next generation of basic neuroscientists to embrace clinical research [grant: 108890/Z/15/Z awarded to R.F.H]. R.E.M. is supported by an Alzheimer’s Research UK major project grant [grant: ARUK-PG2017B−10]. LBC1936 methylation typing was supported by Centre for Cognitive Ageing and Cognitive Epidemiology (Pilot Fund award), Age UK, The Wellcome Trust Institutional Strategic Support Fund, The University of Edinburgh, and The University of Queensland. Lothian Birth Cohort 1921 and 1936 proteomic analyses were supported by a National Institutes of Health (NIH) research grant (R01AG054628). Generation Scotland received core support from the Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6) and the Scottish Funding Council (HR03006). DNA methylation profiling of the GS samples was carried out by the Genetics Core Laboratory at the Wellcome Trust Clinical Research Facility, Edinburgh, Scotland and was funded by the Wellcome Trust (Wellcome Trust Strategic Award “STratifying Resilience and Depression Longitudinally” ([STRADL; Reference 104036/Z/14/Z]). Proteomic analyses in STRADL were supported by Dementias Platform UK (DPUK). DPUK funded this work through core grant support from the Medical Research Council [MR/L023784/2].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 2 approved]
References
- 1. Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003;33(Suppl):245–54. 10.1038/ng1089 [DOI] [PubMed] [Google Scholar]
- 2. Beck S, Rakyan VK: The methylome: approaches for global DNA methylation profiling. Trends Genet. 2008;24(5):231–7. 10.1016/j.tig.2008.01.006 [DOI] [PubMed] [Google Scholar]
- 3. Bestor TH, Edwards JR, Boulard M: Notes on the role of dynamic DNA methylation in mammalian development. Proc Natl Acad Sci U S A. 2015;112(22):6796–9. 10.1073/pnas.1415301111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hannon E, Lunnon K, Schalkwyk L, et al. : Interindividual methylomic variation across blood, cortex, and cerebellum: implications for epigenetic studies of neurological and neuropsychiatric phenotypes. Epigenetics. 2015;10(11):1024–32. 10.1080/15592294.2015.1100786 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Joehanes R, Just AC, Marioni RE, et al. : Epigenetic Signatures of Cigarette Smoking. Circ Cardiovasc Genet. 2016;9(5):436–447. 10.1161/CIRCGENETICS.116.001506 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Stevenson AJ, McCartney DL, Hillary RF, et al. : Characterisation of an inflammation-related epigenetic score and its association with cognitive ability. Clin Epigenetics. 2020;12(1):113. 10.1186/s13148-020-00903-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Hillary RF, Stevenson AJ, McCartney DL, et al. : Epigenetic measures of ageing predict the prevalence and incidence of leading causes of death and disease burden. Clin Epigenetics. 2020;12(1):115. 10.1186/s13148-020-00905-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Langdon RJ, Beynon RA, Ingarfield K, et al. : Epigenetic prediction of complex traits and mortality in a cohort of individuals with oropharyngeal cancer. Clin Epigenetics. 2020;12(1):58. 10.1186/s13148-020-00850-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Lu AT, Seeboth A, Tsai PC, et al. : DNA methylation-based estimator of telomere length. Aging (Albany NY). 2019;11(16):5895–923. 10.18632/aging.102173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Lu AT, Quach A, Wilson JG, et al. : DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging (Albany NY). 2019;11(2):303–327. 10.18632/aging.101684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Hannum G, Guinney J, Zhao L, et al. : Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol Cell. 2013;49(2):359–67. 10.1016/j.molcel.2012.10.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Levine ME, Lu AT, Quach A, et al. : An epigenetic biomarker of aging for lifespan and healthspan. Aging (Albany NY). 2018;10(4):573–91. 10.18632/aging.101414 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Horvath S: DNA methylation age of human tissues and cell types. Genome Biol. 2013;14(10):R115. 10.1186/gb-2013-14-10-r115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Rosen AD, Robertson KD, Hlady RA, et al. : DNA methylation age is accelerated in alcohol dependence. Transl Psychiatry. 2018;8(1):182. 10.1038/s41398-018-0233-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Horvath S, Garagnani P, Bacalini MG, et al. : Accelerated epigenetic aging in Down syndrome. Aging Cell. 2015;14(3):491–5. 10.1111/acel.12325 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. McCrory C, Fiorito G, Hernandez B, et al. : Association of 4 epigenetic clocks with measures of functional health, cognition, and all-cause mortality in The Irish Longitudinal Study on Ageing (TILDA). BioRxiv. 2020. 10.1101/2020.04.27.063164 [DOI] [Google Scholar]
- 17. Zhao W, Ammous F, Ratliff S, et al. : Education and Lifestyle Factors Are Associated with DNA Methylation Clocks in Older African Americans. Int J Environ Res Public Health. 2019;16(17):3141. 10.3390/ijerph16173141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Barbu MC, Shen X, Walker RM, et al. : Epigenetic prediction of major depressive disorder. Mol Psychiatry. 2020. 10.1038/s41380-020-0808-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Chen X, Gole J, Gore A, et al. : Non-invasive early detection of cancer four years before conventional diagnosis using a blood test. Nat Commun. 2020;11(1):3475. 10.1038/s41467-020-17316-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Zhang Y, Elgizouli M, Schöttker B, et al. : Smoking-associated DNA methylation markers predict lung cancer incidence. Clin Epigenetics. 2016;8:127. 10.1186/s13148-016-0292-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Salameh Y, Bejaoui Y, El Hajj N, et al. : DNA Methylation Biomarkers in Aging and Age-Related Diseases. Front Genet. 2020;11:171. 10.3389/fgene.2020.00171 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Zhang Q, Vallerga CL, Walker RM, et al. : Improved precision of epigenetic clock estimates across tissues and its implication for biological ageing. Genome Med. 2019;11(1):54. 10.1186/s13073-019-0667-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Trejo Banos D, McCartney DL, Patxot M, et al. : Bayesian reassessment of the epigenetic architecture of complex traits. Nat Commun. 2020;11(1):2865. 10.1038/s41467-020-16520-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Dupras C, Beauchamp E, Joly Y, et al. : Selling direct-to-consumer epigenetic tests: are we ready? Nat Rev Genet. 2020;21(6):335–336. 10.1038/s41576-020-0215-2 [DOI] [PubMed] [Google Scholar]
- 25. McCartney DL, Hillary RF, Stevenson AJ, et al. : Epigenetic prediction of complex traits and death. Genome Biol. 2018;19(1):136. 10.1186/s13059-018-1514-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hillary: MethylDetectR - A Translational Tool for Methylation-Based Health Profiling (Version 5.0) [Data set].Wellcome Open. Zenodo.2020. 10.5281/zenodo.4646300 [DOI] [PMC free article] [PubMed]
- 27. Hüls A, Czamara D: Methodological challenges in constructing DNA methylation risk scores. Epigenetics. 2020;15(1–2):1–11. 10.1080/15592294.2019.1644879 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc B. 2005;67(2):301–20. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
- 29. Robinson GK: That BLUP is a good thing: the estimation of random effects. Stat Sci. 1991;6(1):15–32. Reference Source [Google Scholar]
- 30. Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
- 31. Chang W, Cheng J, Allaire J, et al. : Shiny: Web Application Framework for R; R package version 1.4. 0.2.2020. [Google Scholar]
- 32. Team RC: R version 3.5. 0. R: A language and environment for statistical computing R Foundation for Statistical Computing. Vienna, Austria.2018. Reference Source [Google Scholar]
- 33. Perrier V, Meyer F, Granjon D: shinyWidgets: Custom Inputs Widgets for Shiny. R package version.2019. [Google Scholar]
- 34. Chang W, Park T, Dziedzic L, et al. : shinythemes: Themes for Shiny. R package version. 2015;1(1):144. [Google Scholar]
- 35. Dowle M, Srinivasan A: data. table: Extension of ‘data. frame. R package version 1.12. 8.2019. [Google Scholar]
- 36. Attali D, Edwards T: shinyalert: Easily Create Pretty Popup Messages (Modals) in'Shiny'. R package version 10.2018. Reference Source [Google Scholar]
- 37. Wickham H: ggplot2: elegant graphics for data analysis. springer.2016. Reference Source [Google Scholar]
- 38. Hadley Wickham RF, Henry L, Müller K: dplyr: A Grammar of Data Manipulation. R package version 0.7. 4.2017. [Google Scholar]
- 39. Wickham H: Tools for working with categorical variables (factors)(R package Version 0.4. 0)[Computer software].2019. [Google Scholar]
- 40. Ram K, Wickham H: wesanderson: A Wes Anderson palette generator. R package version 0.3. 6.2018. Reference Source [Google Scholar]
- 41. Sali A: shinycssloaders: Add CSS Loading Animations to “shiny” Outputs. R Package Version 02 0.2017. [Google Scholar]
- 42. Ooms J: Magick: advanced graphics and image-processing in R. CRAN R package version.2018;1. Reference Source [Google Scholar]
- 43. Wei T, Simko V: R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84).2017. Reference Source [Google Scholar]
- 44. Kassambara A: ggcorrplot: Visualization of a Correlation Matrix using’ggplot2’. R package version 01.2016;1. Reference Source [Google Scholar]
- 45. Wilke CO: cowplot: Streamlined Plot Theme and Plot Annotations for “ggplot2. R package version 0.9. 4.2019. [Google Scholar]
- 46. Bell CG, Lowe R, Adams PD, et al. : DNA methylation aging clocks: challenges and recommendations. Genome Biol. 2019;20(1):249. 10.1186/s13059-019-1824-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Lello L, Avery SG, Tellier L, et al. : Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–97. 10.1534/genetics.118.301267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Lewis CM, Vassos E: Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12(1):44. 10.1186/s13073-020-00742-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Tanaka T, Biancotto A, Moaddel R, et al. : Plasma proteomic signature of age in healthy humans. Aging Cell. 2018;17(5):e12799. 10.1111/acel.12799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Tanaka T, Lavery R, Varma V, et al. : Plasma proteomic signatures predict dementia and cognitive impairment. Alzheimers Dement (N Y). 2020;6(1):e12018. 10.1002/trc2.12018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Nedjadi T, Benabdelkamal H, Albarakati N, et al. : Circulating proteomic signature for detection of biomarkers in bladder cancer patients. Sci Rep. 2020;10(1):10999. 10.1038/s41598-020-67929-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Cuvelliez M, Vandewalle V, Brunin M, et al. : Circulating proteomic signature of early death in heart failure patients with reduced ejection fraction. Sci Rep. 2019;9(1):19202. 10.1038/s41598-019-55727-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Mundra PA, Barlow CK, Nestel PJ, et al. : Large-scale plasma lipidomic profiling identifies lipids that predict cardiovascular events in secondary prevention. JCI Insight. 2018;3(17):e121326. 10.1172/jci.insight.121326 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Gerl MJ, Klose C, Surma MA, et al. : Machine learning of human plasma lipidomes for obesity estimation in a large population cohort. PLoS Biol. 2019;17(10):e3000443. 10.1371/journal.pbio.3000443 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Hillary RF: TRIPOD Checklist for MethylDetectR - A Translational Tool For Methylation-Based Health Profiling.2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Collins GS, Reitsma JB, Altman DG, et al. : Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) The TRIPOD Statement. Circulation. 2015;131(2):211–9. [DOI] [PMC free article] [PubMed] [Google Scholar]