Skip to main content
PLOS Medicine logoLink to PLOS Medicine
. 2023 Jul 6;20(7):e1004247. doi: 10.1371/journal.pmed.1004247

Blood-based epigenome-wide analyses of 19 common disease states: A longitudinal, population-based linked cohort study of 18,413 Scottish individuals

Robert F Hillary 1, Daniel L McCartney 1, Hannah M Smith 1, Elena Bernabeu 1, Danni A Gadd 1, Aleksandra D Chybowska 1, Yipeng Cheng 1, Lee Murphy 2, Nicola Wrobel 2, Archie Campbell 1, Rosie M Walker 1,3, Caroline Hayward 1,4, Kathryn L Evans 1, Andrew M McIntosh 1,5, Riccardo E Marioni 1,*
PMCID: PMC10325072  PMID: 37410739

Abstract

Background

DNA methylation is a dynamic epigenetic mechanism that occurs at cytosine-phosphate-guanine dinucleotide (CpG) sites. Epigenome-wide association studies (EWAS) investigate the strength of association between methylation at individual CpG sites and health outcomes. Although blood methylation may act as a peripheral marker of common disease states, previous EWAS have typically focused only on individual conditions and have had limited power to discover disease-associated loci. This study examined the association of blood DNA methylation with the prevalence of 14 disease states and the incidence of 19 disease states in a single population of over 18,000 Scottish individuals.

Methods and findings

DNA methylation was assayed at 752,722 CpG sites in whole-blood samples from 18,413 volunteers in the family-structured, population-based cohort study Generation Scotland (age range 18 to 99 years). EWAS tested for cross-sectional associations between baseline CpG methylation and 14 prevalent disease states, and for longitudinal associations between baseline CpG methylation and 19 incident disease states. Prevalent cases were self-reported on health questionnaires at the baseline. Incident cases were identified using linkage to Scottish primary (Read 2) and secondary (ICD-10) care records, and the censoring date was set to October 2020. The mean time-to-diagnosis ranged from 5.0 years (for chronic pain) to 11.7 years (for Coronavirus Disease 2019 (COVID-19) hospitalisation). The 19 disease states considered in this study were selected if they were present on the World Health Organisation’s 10 leading causes of death and disease burden or included in baseline self-report questionnaires. EWAS models were adjusted for age at methylation typing, sex, estimated white blood cell composition, population structure, and 5 common lifestyle risk factors. A structured literature review was also conducted to identify existing EWAS for all 19 disease states tested. The MEDLINE, Embase, Web of Science, and preprint servers were searched to retrieve relevant articles indexed as of March 27, 2023. Fifty-four of approximately 2,000 indexed articles met our inclusion criteria: assayed blood-based DNA methylation, had >20 individuals in each comparison group, and examined one of the 19 conditions considered. First, we assessed whether the associations identified in our study were reported in previous studies. We identified 69 associations between CpGs and the prevalence of 4 conditions, of which 58 were newly described. The conditions were breast cancer, chronic kidney disease, ischemic heart disease, and type 2 diabetes mellitus. We also uncovered 64 CpGs that associated with the incidence of 2 disease states (COPD and type 2 diabetes), of which 56 were not reported in the surveyed literature. Second, we assessed replication across existing studies, which was defined as the reporting of at least 1 common site in >2 studies that examined the same condition. Only 6/19 disease states had evidence of such replication. The limitations of this study include the nonconsideration of medication data and a potential lack of generalizability to individuals that are not of Scottish and European ancestry.

Conclusions

We discovered over 100 associations between blood methylation sites and common disease states, independently of major confounding risk factors, and a need for greater standardisation among EWAS on human disease.


In an epigenome-wide association study using population based linked records of over 18,000 people in Scotland, Robert F. Hillary and colleagues explore how differential DNA methylation correlates with incident and prevalent disease states.

Author summary

Why was this study done?

  • Blood DNA methylation can inform us about the biological mechanisms that underlie common disease states. Epigenome-wide association studies (EWAS) investigate whether the proportion of methylation at loci termed CpG sites (cytosine-phosphate-guanine dinucleotides) associate with health outcomes of interest.

  • There is a need for large-scale EWAS that probe for epigenetic signals across a wide range of conditions as well as a structured literature review to inform the utility of this approach in identifying disease-relevant loci.

What did the researchers do and find?

  • DNA methylation was assayed at 752,722 CpG sites using whole-blood samples from 18,413 volunteers, which were collected at the study baseline of Generation Scotland (2006 to 2011).

  • EWAS tested for associations between differential methylation at CpG sites and the prevalence and incidence of 14 and 19 disease states, respectively. Prevalence and incidence data were derived from self-report questionnaires and electronic health record linkage, respectively.

  • We identified over 100 CpG associations with 4 prevalent conditions (breast cancer, chronic kidney disease, ischemic heart disease, and type 2 diabetes) and 2 incident conditions (chronic obstructive pulmonary disease and type 2 diabetes). We also found poor replicability among existing studies with lung cancer showing the highest degree of replication (17% of sites replicated in at least 2 studies).

What do these findings mean?

  • Blood DNA methylation could act as a peripheral marker of several common disease states including breast cancer, cardiopulmonary disease, and type 2 diabetes.

  • As population biobank resources expand, studies that examine the same condition should reach consensus on covariate strategies, phenotype definitions, and reporting guidelines.

1. Introduction

Epigenetic modifications to DNA represent an important mechanism by which the environment interacts with the genome [1]. DNA methylation (DNAm) is one of the best-studied epigenetic mechanisms and involves the addition of chemical tags termed methyl groups to DNA, typically in the context of cytosine-phosphate-guanine dinucleotides (CpG sites). Factors such as diet, stress, and smoking behaviours may influence the process of methylation. The addition of these chemical tags can alter whether, and to what extent, a gene is active. In contrast to genetic sequence variation, these modifications are reversible and can modulate gene expression in cell- and tissue-specific manners [2]. Genome-wide patterns of DNAm are most commonly assayed using microarray-based technologies such as the Illumina HumanMethylation 450K and HumanMethylationEPIC arrays. The arrays permit a cost-effective assessment of DNAm at a scale required for large-scale population health studies [3,4].

Epigenome-wide association studies (EWAS) examine associations between the proportion of methylation at CpG sites and health outcomes of interest, such as chronic disease states [5]. Primarily, EWAS have been conducted using whole-blood DNAm. Patterns of DNAm identified in blood do not necessarily mirror DNAm patterns in distal or disease-relevant tissues such as nervous tissue for Alzheimer’s disease [6,7]. However, blood sampling represents a minimally invasive route for scalable biomarker measurement. Blood-based EWAS have also implicated differential methylation at individual loci as candidate markers of disease risk. For example, TXNIP and ABCG1 are important regulators of glucose and cholesterol metabolism, respectively. Hypomethylation within TXNIP (cg19693031) and ABCG1 hypermethylation (cg06500161) have been associated with type 2 diabetes risk across individuals of multiple ancestries [811].

Existing EWAS on common diseases can be broadly categorised into prevalence analyses (i.e., cross-sectional) and incidence analyses (i.e., longitudinal assessment of incident cases in unaffected individuals). EWAS have often relied on modest sample sizes (<1,000 individuals), which has limited the discovery of loci that associate with disease states. Meta-analyses can increase power but may be vulnerable to between-study heterogeneities. There is a need for large-scale EWAS that examine the prevalence and incidence of multiple disease states in a single population. These analyses would help to establish the relevance of blood methylation as a peripheral marker of common disease states. Furthermore, there is a need for structured literature reviews to assess the level of agreement in locus discovery among existing EWAS that examine the same condition. A synthesis of the level of concordance between published association studies would aid in evaluating the utility of epigenome-wide analyses as an avenue for identifying risk mechanisms underlying common disease states.

Here, we utilise Generation Scotland: the Scottish Family Health Study (GS), a large cohort with DNAm data (n = 18,413). We hypothesise that differential methylation at CpG sites associates with the prevalence of 14 conditions and the incidence of 19 disease states. First, we integrate blood DNAm and self-reported disease data from questionnaires answered at the study baseline to perform EWAS on 14 prevalent disease states (cross-sectional analyses). Second, we conduct EWAS on 19 incident disease states ascertained through electronic health record linkage over up to 14 years of follow-up (longitudinal analyses). Third, we perform a structured literature review to identify blood-based EWAS findings on all 19 disease states considered in this study. We examine whether findings in this study replicate previous analyses and quantify the level of agreement within previously published studies. Fourth, we employ genetic colocalisation analyses to determine whether DNAm levels at the loci identified in our EWAS and disease risk mechanisms are likely influenced by shared or distinct genetic variants. These analyses would help to determine whether DNAm is an important molecular mechanism connecting genetic risk to disease endpoints. Fig 1 provides a visual summary of the study design.

Fig 1. Study design for epigenome-wide analyses on prevalent and incident disease states in Generation Scotland.

Fig 1

(A) Recruitment for Generation Scotland took place between 2006 and 2011. Prevalence analyses: participants self-reported disease status and donated blood samples at the study baseline. Incidence analyses: linked healthcare data were used to determine if participants who were free from a particular condition at baseline went on to develop the condition over up to 14 years of follow-up. Controls were free of the disease at the baseline and during follow-up. (B). Blood DNAm at baseline was available for 18,413 participants. The mean age was 47.5 years and the sample was 58.8% female. EWAS tested for associations between blood CpG methylation and the prevalence of 14 disease states at baseline or the incidence (time-to-onset) of 19 disease states during follow-up. The mean time-to-diagnosis ranged from 5.0 years (for chronic pain) to 11.7 years (for COVID-19 hospitalisation). Significant findings were tested for replication in existing studies via a structured literature review. Replication within existing studies was also investigated. Colocalisation analyses were employed to help dissect whether associations between DNAm and disease states reflected shared or distinct genetic architectures. (C). The first box lists the 14 self-reported disease states at the study baseline, which were included in this study. The second box lists the 19 incident disease states identified through electronic health record linkage. They include the same 14 conditions listed in the first box along with 5 additional disease states. Of note, prevalent AD reflected family history of the disease due to the young mean age of the sample at baseline, whereas incident AD reflected diagnosed disease. Image created using Biorender.com. AD, Alzheimer’s dementia; COVID-19, Coronavirus Disease 2019; CpG, cytosine-phosphate-guanine dinucleotide; DNAm, DNA methylation; EWAS, epigenome-wide association studies.

2. Methods

2.1. Ethics statement

All components of Generation Scotland received ethical approval from the NHS Tayside Committee on Medical Research Ethics (REC Reference Number: 05/S1401/89). Generation Scotland has also been granted Research Tissue Bank status by the East of Scotland Research Ethics Service (REC Reference Number: 20-ES-0021), providing generic ethical approval for a wide range of uses within medical research. Written informed consent was obtained from all participants. This study was performed in accordance with the Helsinki declaration.

2.2. Generation Scotland cohort

Generation Scotland, or GS, is a large family-structured cohort study that consists of 24,000 individuals from across Scotland. Participants were identified via Community Health Index numbers, with the support of Scottish Practices and Professionals Involved in Research. The initial phase of recruitment (2006 to 2010) focussed on the Glasgow and Tayside regions of Scotland and was later extended to Ayrshire, Arran, and the Northeast of Scotland. Individuals must have been aged between 35 and 65 years, had ≥1 first-degree relative and ≥1 full sibling. The age range was later broadened to 18 to 65 years. Family members of probands were also invited to partake in the study. In total, 23,960 individuals were recruited, which encompassed 6,665 probands, 16,007 family members, and 1,288 individuals who self-volunteered without invitation. There were 5,573 families with a mean size of 4 members and 1,400 participants without relatives. The median age at baseline was 47 years and the sample was 59% female [12,13]. Detailed health and lifestyle information were collected via questionnaires at the study baseline alongside venepuncture to obtain whole blood samples from which DNAm was assayed. This study is reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guideline (S1 STROBE Checklist).

The present study does not have a registered prospective protocol. An unpublished, informal analysis plan was made and discussed among study authors prior to the implementation of statistical analyses (August 2022). There were no significant changes to the analysis plan following informal review among the study authors, with the exception of pathway enrichment and outlier sensitivity analyses following peer review.

2.3. Preparation of DNA methylation data

Whole-blood DNAm was measured using the Illumina Infinium MethylationEPIC array. DNAm profiling of the GS samples was carried out by the Genetics Core Laboratory at the Edinburgh Clinical Research Facility, Edinburgh, Scotland. Methylation typing was performed in 3 distinct sets. Quality control steps are detailed in full in S1 Text. Following quality control, there were 5,087, 4,450, and 8,876 individuals within Sets 1, 2, and 3, respectively. Set 1 contained related individuals. Set 2 consisted of individuals who were unrelated to each other and those in Set 1. Set 3 consisted of related individuals, and individuals related to those in Sets 1 and 2. The sets were combined and dasen normalisation was performed across all individuals [14]. Linear regression models were used to adjust methylation M-values for chronological age, sex, and experimental batch (factor with 121 levels, i.e., individuals were assayed across 121 unique batches). Residualised M-values were taken forward for analyses. In total, 752,722 probes and 18,413 individuals passed quality control criteria and were considered as a single analytical sample in our analyses.

2.4. Preparation of disease phenotypes

Nineteen common disease states were considered across prevalence and incidence analyses: (i) Alzheimer’s dementia (AD); (ii) breast cancer; (iii) chronic kidney disease (CKD); (iv) chronic neck and/or back pain; (v) chronic obstructive pulmonary disease (COPD); (vi) colorectal cancer; (vii) Coronavirus Disease 2019 (COVID-19) severity (requiring hospitalisation); (viii) inflammatory bowel disease (IBD); (ix) ischemic heart disease; (x) liver cirrhosis; (xi) long COVID; (xii) lung cancer; (xiii) osteoarthritis; (xiv) ovarian cancer; (xv) Parkinson’s disease; (xvi) prostate cancer; (xvii) rheumatoid arthritis; (xviii) stroke; and (xix) type 2 diabetes. Outcomes were selected if they were present among the 10 leading causes of death in high-income countries, the 10 leading causes of disease burden (disease-adjusted life years (DALYs)) in high-income countries or self-reported conditions at the baseline [1517]. Depression was not considered as it is included in an ongoing meta-analysis EWAS. Although asthma can occur at any age, it has a higher prevalence among children aged 0 to 17 years than in adults. It was therefore excluded from the present analyses that used an adult sample with a broad age profile [18].

Self-report data were used for 12 disease states in cross-sectional analyses of disease prevalence. Self-reported parental history of AD was used a proxy variable for AD. Analyses on self-reported parental history of AD were restricted to participants who were >45 years at baseline. This ensured that only participants whose parents were likely old enough at baseline to be at risk of AD were considered (i.e., >65 years). The CKD Epidemiology Collaboration, or CKD-EPI, equation was implemented to estimate glomerular filtration rate (eGFR) at baseline. Individuals with an eGFR <60 ml/min/1.73 m2 were deemed to have CKD [19]. Therefore, 14 disease phenotypes were considered in prevalent analyses.

All 19 phenotypes were included in longitudinal analyses via linkage to electronic health records (with the exception of self-reported long COVID). The primary and secondary care codes used to define incident phenotypes are available in S1 Appendix. Prevalent cases from the study baseline were excluded for these analyses as were those where record linkage provided evidence of a diagnosis prior to baseline. Therefore, incident cases included those diagnosed after baseline who had died and those who received a diagnosis and remained alive. Controls were censored if they were free of a diagnosis at the time of death or at the end of the follow-up period. Further information on the preprocessing of incident phenotypes, including COVID phenotypes, is available in S2 Text.

2.5. Epigenome-wide association studies on prevalent disease

First, logistic regression models were used to adjust prevalent phenotypes for chronological age and sex, with the exception of breast cancer and prostate cancer, which were adjusted for age after restricting the cohort to females and males, respectively. Second, linear regression models were used for EWAS via the OSCA (OmicS-data-based Complex trait Analysis) software [20]. Residuals from logistic regression models were entered as the dependent variable and age-, sex-, and batch-adjusted CpG M-values represented the independent variable. This strategy was employed to reduce computational burden. A Bonferroni significance threshold was set at p < 2.6 × 10−9 (= 3.6 × 10−8/14 phenotypes) [21]. Two models with different covariate strategies were employed, as described below:

  • 1. Basic model: Phenotype and CpG M-values, processed as described above, and 5 Houseman-estimated white blood cell (WBC) proportions as fixed effect covariates [22]. Six cell types are estimated from the Houseman method. However, their proportions sum to 100%. Therefore, the percentage of granulocytes was not included in this analysis given that it is collinear with the other 5 cell types. The basic model was as follows:

Phenotype (residuals) ~ CpG M-values (residuals) + 5 methylation-predicted WBC proportions.

  • 2. Fully adjusted model: additional adjustments for 5 common lifestyle factors, which were alcohol consumption, body mass index, deprivation index (Scottish Index of Multiple Deprivation), methylation-based smoking score (EpiSmokEr) [23], and years of education. Body mass index was log transformed prior to analysis. Furthermore, multidimensional scaling was applied to GS genotype data to obtain an estimate of population structure. The first 20 genetic principal components were extracted and included in our analytical models. The fully adjusted model was as follows:

Phenotype (residuals) ~ CpG M-values (residuals) + 5 methylation-predicted WBC proportions + alcohol consumption (units/week) + log(body mass index (kg/m2)) + deprivation index (Scottish Index of Multiple Deprivation) + education (an 11-category ordinal variable) + methylation-based smoking score (EpiSmokEr) + 20 genetic PCs (population structure).

Results from basic and fully adjusted models are presented within the main text. Both models are included to assess the effects of lifestyle factors on associations between methylation sites and common disease states. Some covariates may be more appropriate for one disease state over another (e.g., body mass index for type 2 diabetes versus cigarette smoking for COPD). However, all 5 risk factors are included in an effort to capture the most common environmental and lifestyle risk factors across a broad range of disparate conditions. We do not further present unadjusted analyses (i.e., using DNAm data that are not adjusted for age, sex, and batch effects) given the strong, possible confounding effects of age and technical variation on associations between CpG methylation and age-related disease states. We also did not initially adjust for family structure in our models. However, we later ran a series of sensitivity analyses (outlined in Section 3.5), including adjustment for relatedness between participants.

2.6. Epigenome-wide association studies on incident disease

First, Cox proportional hazards models were used to adjust incident phenotypes for age at baseline and sex (17/19 phenotypes). Only age was included for breast, ovarian, and prostate cancer. Time-to-onset for the disease, or censoring, was the survival outcome in Cox proportional hazards models. Only individuals with an age at event or censoring ≥65 years were considered for AD. As outlined above, controls were censored at the time of death or at the end of the follow-up period. Logistic regression models were used to adjust 2 remaining COVID phenotypes prior to EWAS analyses. Cox models were not employed for COVID phenotypes owing to the limited differences in time-to-event data between individuals with positive COVID diagnoses. Whereas DNAm was corrected for age at baseline (as well as sex and batch), COVID phenotypes were adjusted for sex and age at COVID testing or diagnosis. Here, age at COVID testing or diagnosis was considered given the variation in time elapsed between baseline visits (between 2006 and 2011) and the onset of the COVID pandemic. Second, martingale residuals or logistic regression residuals were extracted and included as dependent variables in OSCA. A Bonferroni-corrected significance threshold was set at p < 1.9 × 10−9 (= 3.6 × 10−8/19 phenotypes). Basic and fully adjusted models were employed, as described in the previous section. Methods for sensitivity EWAS analyses are detailed under S3 Text.

2.7. Pathway enrichment analyses

Enrichment was assessed among Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms using the gometh() function in the R package missMethyl [24]. This function models the relationship between the number of probes per gene and the probability of being selected, accounting for the selection bias associated with probe-dense genes. The top 100 CpGs (i.e., smallest EWAS p-values) from each fully adjusted model were included as input features. There were 33 such models for consideration (14 prevalent and 19 incident models). Pathways with an FDR-adjusted p-value < 0.05 were deemed significant.

2.8. Structured literature review on blood-based EWAS of common disease

MEDLINE, Embase (Ovid interface, 1980 onwards), Web of Science (core collection, Thomson Reuters), and preprint servers were searched to identify relevant articles indexed as of March 27, 2023. The initial search dates were between August 1, 2022 and August 31, 2022, and later updated and performed again on March 27, 2023. We used the following search terms or their synonyms appropriate to each database: (“blood”.mp OR “whole blood”.mp OR “peripheral blood.mp”) AND (“EWAS” OR exp “epigenome-wide*” / OR exp “epigenome-wide association” /) AND (the disease of interest, e.g., “COPD” OR “chronic obstructive pulmonary disease”). The search strategy returned approximately unique 2,000 articles, of which 54 passed inclusion criteria. Inclusion criteria were as follows: (i) original research article; (ii) EWAS performed with blood DNAm; (iii) there were at least 20 individuals in each comparison group (i.e., cases and controls); and (iv) the study examined at least one of the 19 common disease states outlined in our study.

Here, we make an important distinction between systematic reviews and our structured literature review. The structured search of the literature was intended to identify appropriate studies for look-up analyses using a predefined and agreed list of search terms. This is similar to systematic reviews in that search terms are used to systematically screen literature databases. However, our approach differed from a systematic review in that no original or meta-analyses were performed using data from the literature beyond a look-up analysis of CpGs identified in these studies. Unlike a systematic review, the approach also does not provide an estimate for a clinical question and rather summarises the current EWAS literature.

First, we wished to examine whether the CpG associations identified in our study had been previously described. A CpG site was declared as novel in our study if it was not previously reported at experiment-wise significance thresholds deemed by each of the 54 studies. Of note, these studies used different significance thresholds. Several studies did not make their full summary statistics available, which prohibited the use of a common significance threshold for look-up analyses. However, the studies also differed from one another with respect to methylation arrays, phenotype definitions, and covariate strategies. We focussed on unique CpGs rather than unique genomic locations. Look-up analyses were performed separately for each condition following our structured literature review. Second, we aimed to determine the level of agreement among studies that examined the same condition with respect to locus discovery. Here, our study was ignored as we were only interested in the previous literature for this analysis. A CpG site or its gene (if available) was considered to be replicated if it was reported as significant (at thresholds set by each study) in at least 2 studies that examined the same condition. While focusing on genes alone may neglect intergenic CpGs, the CpG-level and gene-level look-up analyses are included together in an effort to capture as much information as possible from disparate studies in the literature.

2.9. Colocalisation analyses

Colocalisation analyses required GWAS summary statistics for CpG sites (i.e., methylation Quantitative Trait Loci–mQTLs, trait 1) and for respective disease states (trait 2; [2530]). The GoDMC mQTL resource represents the largest mQTL study to date in terms of sample size but only focused on 450k array sites [31]. Therefore, the GoDMC resource was utilised for sites that are common to the EPIC and 450k arrays. However, mQTL analyses were also conducted in GS due to the need to generate mQTL summary statistics for sites present on the EPIC array only (S4 Text). In instances where CpGs had associations in both GS and GoDMC, we used the following criteria to determine which dataset to retain: (i) the dataset must have >10 genetic variants available and (ii) if both datasets satisfy (i), then retain the dataset with the larger sample size. Of note, GS served as the replication cohort within the original GoDMC analyses. Effect sizes in GS and GoDMC showed correlation coefficients of 0.97 and 0.96 for cis and trans variants, respectively, in the original GoDMC publication [31]. We observed a similar coefficient of 0.97 between effect sizes for the subset of CpGs used in our colocalisation analyses. Therefore, there was likely little heterogeneity between the data sources used in our workflow.

The coloc.abf() function in the R package coloc was used to test for colocalisation and default parameters were applied (version 5.1.0) [32]. SNPs ±1 Mb surrounding each CpG site were extracted from mQTL datasets (i.e., GS or GoDMC, trait 1) and disease GWAS summary statistics (trait 2). The method tests for 5 mutually exclusive hypotheses, H0: there are no causal variants for either trait in the tested region; H1 and H2: causal variant for trait 1 and trait 2 only, respectively; H3: distinct causal variants for both traits; and H4: the traits share a causal variant. Posterior probabilities ≥95% for H4 provided strong evidence in favour of colocalisation.

3. Results

3.1. Demographics and disease counts in Generation Scotland

The mean age of the sample was 47.5 years (n = 18,413, standard deviation (SD) = 14.9) and the sample was 58.8% female. Summary data for demographic variables are presented in Table 1. Additional data on covariates and disease counts are displayed in S1S3 Tables. The number of self-reported cases for prevalent disease at baseline ranged from 34 participants with Parkinson’s disease to 5,296 with chronic neck and/or back pain, respectively (basic model). Further, the number of cases with incident disease since baseline (derived from health record linkage) ranged from 31 for severe COVID (hospitalisation from COVID-19 infection) to 1,886 for chronic neck and/or back pain. Associations between covariates and disease states are displayed in S4 and S5 Tables for prevalent and incident disease states, respectively (also available in S1 and S2 Figs).

Table 1. Summary of demographic variables in the Generation Scotland cohort.

Phenotype Units n Mean SD
Age years 18,413 47.5 14.9
Alcohol Consumption units/week 16,705 11.0 13.0
Body Mass Index kg/m2 18,299 27.0 5.2
DNAm smoking score (EpiSmokEr) - 18,413 1.4 4.3
n Median IQR
Education 11-category ordinal variable 17,389 4 3
Scottish Index of Multiple Deprivation rank 17,287 4,331 3,115
n n-female % female
Sex - 18,413 10,833 58.8

DNAm, DNA methylation; IQR, interquartile range; SD, standard deviation.

Education was measured as an ordinal variable: 0, 0 years; 1, 1–4 years; 2, 5–9 years; 3, 10–11 years; 4, 12–13 years; 5, 14–15 years; 6, 16–17 years; 7, 18–19 years; 8, 20–21 years; 9, 22–23 years; 10, ≥24 years.

3.2. Epigenome-wide analyses of prevalent disease

We first tested for cross-sectional associations between blood CpG methylation and 14 disease states at the study baseline. There were 1,340 significant associations across 10 diseases in a basic model that adjusted for age, sex, and estimated blood cell proportions (p < 2.6 × 10−9; Fig 2A, S6 Table). Over 90% of these associations (n = 1,246) were attributed to type 2 diabetes (n = 703 associations, 52.5%), COPD (n = 301, 22.5%), and chronic pain (n = 242, 18.1%). Genomic inflation factors ranged from 0.8 to 1.6 across all basic models (S7 Table). Look-up analyses in the EWAS Catalog showed that 617/1,340 associations involve CpGs that were previously associated with common disease risk factors including body mass index, smoking, and alcohol consumption [33]. For clarity, we do not present summary statistics (i.e., 95% CIs and p-values) for all individual CpG associations in the main text given the large number of associations present in basic and fully adjusted models. However, these are made available in S6 and S8 Tables, respectively.

Fig 2. Epigenome-wide association studies on 14 prevalent disease states in Generation Scotland.

Fig 2

(A) Diseases that had CpG associations in only the basic model or the fully adjusted model are shown in bold. Colorectal cancer was present in both the basic and fully adjusted model, but no CpGs were common to both models for this condition. (B). Ideogram showing 69 sites that were common to both the basic and fully adjusted models. These loci associated with 4 unique disease states. Full information is available in S8 Table. Image created using Biorender.com. CKD, chronic kidney disease; COPD, chronic obstructive pulmonary disease; CpG, cytosine-phosphate-guanine dinucleotide; WBC, white blood cells.

Next, we conducted a fully adjusted model that further accounted for 5 common lifestyle risk factors and population structure. The 5 risk factors were alcohol consumption, body mass index, deprivation (Scottish Index of Multiple Deprivation), a methylation-based proxy for tobacco smoking [23], and years of education. The fully adjusted model returned 78 associations across 8 disease states (p < 2.6 × 10−9; Fig 2B, S8 Table). Sixty-nine associations from the basic model were also present in the fully adjusted analysis. The 69 associations were spread across 4 disease states: CKD (n = 1); ischemic heart disease (n = 6); breast cancer (n = 10); and type 2 diabetes (n = 52). Genomic inflation factors ranged from 0.8 to 1.8 across all fully adjusted models and were 1.1, 1.8, 1.0, and 1.1 for CKD, ischemic heart disease, breast cancer, and type 2 diabetes, respectively (S7 Table). The significant findings included associations between self-reported history of breast cancer and hypomethylation within cg06072257 and cg06123699, which are located near UBIAD1 and TPRG1 on chromosomes 1 and 3, respectively (p = 6.5 × 10−103 and p = 2.4 × 10−101, respectively). The site cg17944885 located between ZNF788 and ZNF20 on chromosome 19 associated with prevalent CKD (p = 1.7 × 10−12). Furthermore, CpGs annotated to ABCG1, DHCR24, and MYLIP were common to ischemic heart disease and type 2 diabetes (Fig 2B). We also examined where the 69 associations of interest were located in relation to CpG islands. CpG islands are clusters of methylation sites that typically occur at or near transcription start sites. Only 1 CpG was annotated to a CpG island (cg00994936), 20 were located in shores (0 to 2 kb from islands), 11 were in shelves (2 to 4 kb from islands), and the remaining 37 were annotated to the “open sea” (isolated sites outside of islands) (S8 Table).

Genetic colocalisation analyses provided weak evidence for a shared causal variant underlying methylation at cg00857282 (MYLIP) and risk of ischemic heart disease (PP = 63%; S9 Table). There was also moderate evidence for distinct causal variants underlying 10 of the 69 prevalent associations (PP > 75%).

3.3. Epigenome-wide analyses on incident disease

Using health record linkage, we tested whether CpGs measured at baseline associated with the future onset of 19 disease states. We observed 14,237 associations between baseline CpG methylation and the incidence of 11 disease states in the basic model (p < 1.9 × 10−9; Fig 3A, S10 Table). Of these, 11,305 (79.4%) and 2,657 (18.7%) were attributed to COPD and type 2 diabetes, respectively. Well-established smoking-associated probes (e.g., cg14391737 within PRSS23 and cg05575921 within AHRR) associated with the incidence of COPD, lung cancer, ischemic heart disease, stroke, pain, and/or CKD. Genomic inflation factors ranged from 0.8 to 3.8 across all basic incidence models (S11 Table).

Fig 3. Epigenome-wide association studies on 19 incident disease states in Generation Scotland.

Fig 3

Diseases that were identified in only the basic model or only the fully adjusted model are shown in bold. COVID severity, liver cirrhosis, and ovarian cancer were present in both a basic and fully adjusted model, but there were no overlapping CpGs for these disease states in both models. (B). Ideogram showing 64 associations that were common to the basic and fully adjusted models. Full information is available in S12 Table. Image created using Biorender.com. COPD, chronic obstructive pulmonary disease; CpG, cytosine-phosphate-guanine dinucleotide; WBC, white blood cells.

There were 79 unique associations in the fully adjusted model, which were spread across 5 disease states (Fig 3B, S12 Table). However, only 64 associations for COPD (n = 6) and type 2 diabetes (n = 58) were present across both basic and fully adjusted models. One site was annotated to a CpG island (cg14334350), 10 were in shores, 12 were in shelves, and 41 were located in the “open sea.” Genomic inflation factors ranged from 0.8 to 1.8 across all fully adjusted incidence models and were 1.1 and 1.8 for COPD and type 2 diabetes, respectively (S11 Table). Genes annotated to CpGs that associated with COPD included ALPG, C11orf91, CPOX, GPR15, HLA-DRB5, and PRSS23. Genes annotated to CpGs that were associated with type 2 diabetes included ABCA1, ABCG1, CPT1A, SREBF1, SLC7A11, SLC7A5, and TXNIP among others (see S12 Table for full details). Only type 2 diabetes had CpGs common to cross-sectional and longitudinal analyses and reflected 17 CpGs annotated to 11 unique genes.

There was only moderate evidence for distinct causal variants underlying 11/64 incident associations (PP > 75%). No associations showed strong evidence of colocalisation (S13 Table).

As a further analysis, we examined the contribution of each of the 5 common lifestyle risk factors in attenuating the 1,340 prevalent associations and 14,237 incident associations that were brought forward to the fully adjusted stage. The findings are outlined in full in S5 Text and S14 Table. In brief, the mean attenuation in effect sizes by each of the covariates ranged from 5.5% (for body mass index) to 63.1% (for smoking). However, there was heterogeneity across disease states given their distinct risk profiles.

3.4. Pathway enrichment analysis for methylation sites associated with common disease states

The top 100 CpGs (i.e., with the smallest EWAS p-values) for each fully adjusted model were assessed for enrichment in KEGG pathways and GO terms (see Methods). Thirty-three models were considered and reflected 14 prevalent and 19 incident phenotypes (S15 Table). Significant pathways were returned only for prevalent type 2 diabetes and ischemic heart disease (FDR-corrected p-value <0.05). The overrepresented terms included cholesterol and metabolic processes as well as alcohol metabolic pathways, which may indicate residual confounding despite adjustment for self-reported alcohol consumption.

3.5. Associations between CpG methylation and disease states are robust in sensitivity analyses

Mixed-effects models that included a kinship matrix were used to account for relatedness as sensitivity analyses. Effect sizes correlated >0.99 with associations from the standard EWAS, which included related individuals (S16 and S17 Tables, S3 Fig). Further, Cox proportional hazard models are often used to conduct incidence analyses. This model relies on the proportional hazard assumption, which in effect states that the hazard ratio remains constant over time and implies that the effect of a risk variable is also constant over the length of follow-up. The assumption is supported by a nonsignificant relationship between Schoenfeld residuals and time and refuted by a significant association. Fourteen of the 64 incident associations violated the proportional hazard assumption (p < 0.05 between Schoenfeld residuals and time; S18 Table). However, we also restricted the analyses to each possible year of follow-up and found that there were minimal differences in hazard ratios between time-points that failed the assumption versus those that did not (S19 Table). This suggested the hazards were proportional over time and all associations were therefore retained. Furthermore, death was considered as a censoring event within our study rather than a competing risk. Effect sizes were correlated >0.99 when incidence models were repeated with death as a competing event, and when individuals who had died were excluded from analyses (S20 Table).

The large number of association models employed in EWAS renders it challenging to examine the potential influence of outlying values for each CpG site, particularly where multiple phenotypes are evaluated. In an effort to highlight possible influential outliers, we computed Cook’s distance measurements across all 69 prevalent associations (4 prevalent phenotypes) and 64 incident associations (2 incident phenotypes) that were present in basic and fully adjusted models. There were therefore 133 association models for which Cook’s distance was computed. Cook’s distance is a measure of the effect of deleting an observation on the estimated coefficients, and the associated plots for all 133 models are shown in S2 Appendix [34,35]. Two separate criteria were used to identify influential outliers: (i) individuals were deemed as outliers if their distance was greater than 3 times the mean distance across data points (standard rule of thumb) or (ii) a smaller subset of “extreme outliers” were identified based on visual inspection of the plots. There were between 174 to 565 outliers across models using the first criterion and 0 to 4 extreme outliers identified by the second criterion. Effect sizes were correlated 0.7 with those from the original EWAS when outliers from the first criterion were removed and 0.99 when those from the second criterion were omitted (S21 Table).

Fully adjusted models were repeated using logistic regression (prevalent disease) or Cox models (incident disease) with age and sex included as fixed-effect covariates. This differs from the main analytical strategy that used linear regression models with adjusted phenotype and methylation variables and allowed us to return effect sizes on an interpretable scale. Fig 4 shows odds ratios and hazard ratios associated with a per-1 SD increase in adjusted CpG methylation M-values for all 69 and 64 prevalent and incident disease associations (S22 Table). We also computed the Harrell’s C-statistic for each of the 64 incident associations, which is a measure of goodness of fit within survival analyses. Specifically, we calculated the difference between the C-statistic between a fully adjusted model with and without each CpG of interest. The model without the CpG included age, sex, estimated blood cell proportions, population structure, and 5 common lifestyle factors as outlined previously. The C-statistic from this model was 0.87 and 0.80 for COPD and type 2 diabetes, respectively. All CpGs increased the concordance index. The increment obtained from CpGs ranged from 0.1% to 1.2% (for cg00163198, type 2 diabetes) across all 64 loci (S22 Table).

Fig 4. Blood CpGs associated with prevalent or incident disease states showing effect sizes on interpretable scale.

Fig 4

Effect sizes were reestimated using logistic regression (prevalent disease, blue points) or Cox proportional hazards models (incident disease, violet points) to return more interpretable effect sizes. Effect sizes represent a per-1 SD increase in age-, sex-, and experimental batch-adjusted CpG methylation M-values (or age- and batch-adjusted for breast cancer). CpGs shown were significant in both basic and fully adjusted models. Odds ratios and hazard ratios are detailed in S22 Table. CI, confidence interval; CpG, cytosine-phosphate-guanine dinucleotide; SD, standard deviation.

3.6. Structured literature review on existing epigenome-wide analyses of common diseases

We performed a structured review of the literature to identify blood-based EWAS on the 19 disease states considered in our study (n = 54 studies; Fig 5). Characteristics for each of the 54 studies are outlined (S23 Table). The studies were deemed to be of high quality. However, there was a high risk of selection bias among epigenome-wide analyses as well as attrition bias (i.e., in the incidence analyses). Fourteen disease states had at least 1 EWAS reported in the literature. The number of studies ranged from 1 (for long COVID) to 7 (for type 2 diabetes and lung cancer) (S24 and S25 Tables). Four studies used the Illumina 27k array (7.4%), 36 used the 450k array (66.7%), 12 employed the EWAS array (22.2%), and 2 implemented alternative arrays (Infinium Multi-Ethnic Global-8 and PyroMark Q24, 3.7%). Sixteen studies examined incident disease, while the remaining 38 focused on prevalent disease.

Fig 5. Look-up and replication analyses within EWAS on common disease states.

Fig 5

A structured literature search was performed to identify existing EWAS on 19 common disease states (either prevalent or incident). (1) We first determined whether associations in our study replicated those of previous studies. We focussed only on associations that were common to basic and fully adjusted models. There were 69 prevalent associations across 4 conditions (breast cancer, CKD, ischemic heart disease, and type 2 diabetes), and 64 incident associations across 2 conditions (COPD and type 2 diabetes). We found that 11/69 prevalent associations and 8/64 incident associations were reported in the literature. (2) We then turned our attention to the existing studies and asked whether studies that examined the same trait (e.g., incident type 2 diabetes) reported the same loci in their studies. We omit our study here as we are only interested in the previous literature. We required that a CpG site was reported in at least 2 studies that examined the same trait. There was a limited amount of replication in the literature as indicated in the right-hand side of the figure. Image created using Biorender.com. CKD, chronic kidney disease; CpG, cytosine-phosphate-guanine dinucleotide; COPD, chronic obstructive pulmonary disease; IBD, inflammatory bowel disease.

First, we performed look-up analyses to determine whether CpGs identified in our study were previously reported at significance thresholds deemed by each individual study. Only 11/69 prevalent associations in this study (including 1 for CKD and 10 for type 2 diabetes) and 8/64 incident associations (for type 2 diabetes only) were reported in the literature (at p < 2 × 10−5, which represented the least conservative threshold across studies for these traits; Fig 5). The replicated associations for type 2 diabetes implicated genes including ABCG1, CPT1A, SREBF1, and TXNIP.

Second, we assessed how well previous studies that examined the same trait (e.g., the 7 studies on type 2 diabetes) agreed with one another in terms of locus discovery. The present study was not included in this analysis as here we were interested only in the previous literature. A CpG was considered to be replicated in the literature if 2 or more studies reported it as significant at the threshold defined in their study. As different arrays may not have the same CpG sites, we also considered whether a given gene was replicated in at least 2 studies examining the same condition. There were 10 disease states that were available for testing (i.e., had 2 or more studies with available summary statistic data). The number of unique CpGs that were reported as significant by the authors ranged from 7 (for COPD) to 2,746 (for ovarian cancer). Six of the 10 disease states had evidence of replication across existing studies with respect to the CpGs identified by EWAS. They were IBD (1.1% of CpGs replicated), stroke (1.8%), ovarian cancer (2.2%), CKD (5.2%), type 2 diabetes (6.5%), and lung cancer (16.8%) (Fig 5). Similar percentages were observed for genes, with the exception of CKD, which had no common genes across studies as all of the replicated CpGs were intergenic (S25 Table).

Discussion

Using one of the world’s largest methylation datasets, we perform a series of EWAS on the prevalence and incidence of a broad range of conditions. We undertook a large-scale, comprehensive review of the literature and highlight the poor agreement that exists across previous epigenome-wide analyses that examine the same condition. By comparing these data with our own findings, we uncover 58 novel associations with the prevalence of 3 self-reported disease states at the study baseline (breast cancer, ischemic heart disease, and type 2 diabetes). We also identify 56 novel associations between CpGs and the time-to-onset of 2 disease states (COPD and type 2 diabetes). These associations were independent of common lifestyle risk factors. However, we also observe a vast number of additional associations whereby CpGs index or track associations between lifestyle factors and common disease states, further highlighting the appropriateness of DNAm as a biomarker of lifestyle behaviours.

The novel associations observed in this study could strengthen evidence for candidate molecular pathways underlying peripheral disease states, e.g., self-reported history of breast cancer associated with differential methylation at cg06072257 (UBIAD1) and cg06123699 (TPRG1). UBIAD1 (UbiA Prenyltransferase Domain Containing 1) is a biosynthetic enzyme that converts vitamin K1 (phylloquinone) to menaquinone, which is the most abundant form of vitamin K2 in human tissue [36]. Low expression of UBIAD1 in human breast tumours correlates with reduced survival [37] and also associates with risk for bladder cancer [38]. TPRG1 encodes for Tumour protein P63 Regulated 1 and its expression is associated with estrogen receptor-positive and triple-negative breast cancers [39,40]. Furthermore, in relation to COPD, cg23353945 (C11orf91) correlated with incidence of the disease and has been associated in trans with CCL21 protein levels [41]. Serum CCL21 levels are elevated in COPD patients and may contribute to the development of lung cancer [42,43]. This may suggest that a C11orf91-CCL21 axis contributes to risk of pulmonary disease independently from lifestyle risk factors. However, these findings warrant further investigation in mechanistic in vitro and in vivo studies.

The most consistent associations across models and look-up analyses were for type 2 diabetes. This is likely attributed to the strong correlation between metabolic processes (e.g., glucose and lipid metabolism) and DNAm in blood [44]. The condition with the highest degree of replication within the existing literature alone was lung cancer. This may reflect the strong influence of smoking on DNAm. From these analyses, it is apparent that EWAS possess a general low level of replicability, in particular when compared to genome-wide association studies (or GWAS), which show replication rates of 50% to 90% [45,46]. However, unlike DNAm, genetic factors remain fixed across the life-course and large sample sizes in GWAS have ensured adequate power. Epigenetic analyses are also highly susceptible to adjustments for environmental exposures as indicated above. Caution should be paid to covariate strategies particularly where the primary objective is to identify causal molecular mechanisms that connect genetic risk to disease endpoints, which should mandate high replicability. Furthermore, in our study, EWAS were conducted using linear regression models, which examined each CpG site in isolation. The risk of overfitting was low due to the large number of observations compared to the number of model parameters. However, the vast number of associations observed in our analyses may be attributable to the large sample size and possibly to the correlation structure among CpG sites within the same genomic region or distal sites influenced by similar lifestyle factors. As sample sizes grow, it may be necessary to employ additional methods that permit the joint and conditional estimation of probe effects while accounting for correlation structure and unknown confounders [20,47].

The generally poor replication across existing EWAS reflects a number of possible factors. These include the use of (i) different statistical models and significance thresholds; (ii) arrays with different CpG content (e.g., 450k versus EPIC arrays); (iii) different study designs (e.g., community-based designs with no enrichment for a particular disease versus targeted case/control designs); (iv) heterogeneities in genetic backgrounds; (v) variation in phenotype definitions for health record linkage analyses; and (vi) the use of disparate covariate strategies. Some studies also did not make full summary statistics available. Nevertheless, our review is critical and timely given that the scale of EWAS continues to rise in tandem with enhancements in array technologies, population biobank sizes, and health record phenotyping algorithms.

We highlight a number of further considerations in addition to those arising from the structured literature review. First, there was limited overlap between methylation sites identified in the prevalence and incidence analyses. Prevalence analyses relied on self-report data, which may have been prone to recall bias, whereas incidence analyses considered diagnosed disease. A subset of controls within the prevalence analyses will also have been reassigned to cases in the incidence analyses, which could attenuate common signal between these analyses. Second, the majority of disease states showed weak associations with differential methylation at CpG sites despite the large sample size employed. This is further highlighted by the lack of consistency in coefficient estimates across models. It is important to note that while the overall sample size was large, the number of cases in many conditions was modest, which may have limited power. The analyses also emphasise that epigenome-wide analyses are highly sensitive to adjustments for environmental exposures. Third, colocalisation analyses did not provide evidence that altered methylation and disease risk mechanisms shared common genetic variants. The CpG associations may instead reflect distinct genetic aetiologies, unknown confounding factors, and some of the associations could capture subclinical disease in the participants. Fourth, we did not consider multimorbidity in this study. There are a number of possible trajectories that a particular participant may have shown, as well as a number of recorded events for a given condition (e.g., stroke). Indeed, we focused on time-to-first-event in this study alone. Future research will focus on applying sophisticated statistical methods to model all possible multimorbidity trajectories from linked healthcare data and disentangle their relationships with peripheral methylation.

Our study has a number of limitations. First, winsorization of methylation values was not applied in our study. Winsorizing limits extreme values in the data, e.g., in M-values for a given CpG site, and can reduce the effect of possibly spurious outliers [34]. However, sensitivity analyses using Cook’s distance metrics suggested that regression coefficients were largely stable when influential data points were removed, particularly where extreme outliers were excluded. Second, we did not adjust for medication data, which may confound associations between peripheral methylation and disease. Third, we did not consider disease subtypes as this may have reduced power to detect associations. Fourth, we utilised family history of Alzheimer’s disease as proxy for prevalent disease due to the young mean age of the sample at baseline. This complicates its generalisability with incident analyses on Alzheimer’s disease, which relied on diagnosed disease. Our phenotype definitions may also have neglected potential cases for other disorders such as CKD, including individuals with proteinuria and normal eGFR or with tubular disorders. Indeed, there is stark heterogeneity in clinical presentations among all conditions considered in our study given their multifactorial aetiologies. Future research may benefit from focussing on precise common endpoints in the disease process, such as fibrosis for CKD and liver cirrhosis. Fifth, our findings in blood might not reflect important changes in distal, disease-relevant tissues. Sixth, our analyses consisted of individuals with European ancestry and might not be generalisable to individuals of other ancestries. Seventh, the look-up analyses in our structured literature review relied on genome-wide significant p-value thresholds set by individual studies. This metric is not fully informative given that significant associations will be tightly coupled to characteristics such as the sample size of the study.

Moving forward, we recommend that studies examining the same condition could engage in consortium efforts, which may provide an opportunity to reach consensus on covariate strategies and normalisation methods. Furthermore, it is essential that all studies report clearly the output of nested models, such as models with and without adjustments for lifestyle risk factors, and provide full publicly available summary statistics where possible.

Our epigenome-wide analyses uncovered over 100 novel associations between blood CpGs and common disease states that act independently of major confounding risk factors. Our summary data and synthesis of the literature provide a timely foundation that will expedite discoveries into the role of blood DNAm in common disease states.

Supporting information

S1 STROBE Checklist. STROBE statement—Checklist of items that should be included in reports of observational studies.

(DOCX)

S1 Appendix. Disease code lists.

(XLSX)

S2 Appendix. Cook’s distance plots for 133 associations in basic and fully adjusted models.

Outliers are highlighted in green (COPD) and blue (type 2 diabetes).

(PDF)

S1 Text. Supplementary methods for methylation quality control.

(DOCX)

S2 Text. Supplementary methods for preparation of phenotypes.

(DOCX)

S3 Text. Supplementary methods for sensitivity EWAS.

(DOCX)

S4 Text. Supplementary methods for methylation QTL analyses.

(DOCX)

S5 Text. Supplementary note on covariate-specific attenuation of effect sizes in basic model.

(DOCX)

S1 Fig. Associations between covariates and prevalent disease states in univariable and multivariable logistic regression models.

(DOCX)

S2 Fig. Associations between covariates and incident disease states in univariable and multivariable Cox proportional hazards models.

(DOCX)

S3 Fig. Correlation between effect sizes from linear regression EWAS and sensitivity linear mixed effects analyses that further accounted for relatedness.

(DOCX)

S1 Table. Summary data for demographic variables and covariates.

(XLSX)

S2 Table. Counts for prevalent disease states.

(XLSX)

S3 Table. Counts for incident disease states.

(XLSX)

S4 Table. Associations between covariates and prevalent disease states.

(XLSX)

S5 Table. Associations between covariates and incident disease states.

(XLSX)

S6 Table. Significant associations from basic model—Epigenome-wide association studies on prevalent disease states.

(XLSX)

S7 Table. Genomic inflation factors for epigenome-wide association studies on prevalent disease states.

(XLSX)

S8 Table. Significant associations from fully adjusted model—Epigenome-wide association studies on prevalent disease states.

(XLSX)

S9 Table. Genetic colocalisation analyses for prevalent disease associations.

(XLSX)

S10 Table. Significant associations from basic model—Epigenome-wide association studies on incident disease states.

(XLSX)

S11 Table. Genomic inflation factors for epigenome-wide association studies on incident disease states.

(XLSX)

S12 Table. Significant associations from fully adjusted model—Epigenome-wide association studies on incident disease states.

(XLSX)

S13 Table. Genetic colocalisation analyses for incident disease associations.

(XLSX)

S14 Table. Sensitivity analysis to test for the effects of each of the 5 lifestyle risk factors included in this study on attenuating associations from the basic model.

(XLSX)

S15 Table. Pathway enrichment analyses.

(XLSX)

S16 Table. Sensitivity analysis to test for effect of relatedness on associations with prevalent disease states.

(XLSX)

S17 Table. Sensitivity analysis to test for effect of relatedness on associations with incident disease states.

(XLSX)

S18 Table. Sensitivity analysis to test for proportional hazard assumption.

(XLSX)

S19 Table. Sensitivity analysis to estimate hazard ratios during each year of follow-up for associations that violated proportional hazard assumption in S18 Table.

(XLSX)

S20 Table. Sensitivity analysis to assess the impact of all-cause mortality as a competing risk in incidence models.

(XLSX)

S21 Table. Sensitivity analysis to identify influential observations based on Cook’s distance statistics.

(XLSX)

S22 Table. Odds ratios and hazard ratios for prevalent and incident disease associations, including Harrell’s C-statistic for the latter.

(XLSX)

S23 Table. Characteristics of 54 studies identified in structured literature review.

(XLSX)

S24 Table. Look-up analyses to assess whether associations identified in the present study are newly described.

(XLSX)

S25 Table. Replication within existing epigenome-wide association studies that examined the same condition in the literature.

(XLSX)

Abbreviations

AD

Alzheimer’s dementia

CKD

chronic kidney disease

COPD

chronic obstructive pulmonary disease

COVID-19

Coronavirus Disease 2019

CpG

cytosine-phosphate-guanine dinucleotide

DALY

disability-adjusted life year

DNAm

DNA methylation

eGFR

estimated glomerular filtration rate

EWAS

epigenome-wide association studies

GO

Gene Ontology

GS

Generation Scotland

GWAS

genome-wide association studies

IBD

inflammatory bowel disease

KEGG

Kyoto Encyclopaedia of Genes and Genomes

OSCA

OmicS-data-based Complex trait Analysis

SD

standard deviation

TPRG1

Tumour protein P63 Regulated 1

UBIAD1

UbiA Prenyltransferase Domain Containing 1

WBC

white blood cell

Data Availability

EWAS summary statistics are available on the EWAS Catalog (http://ewascatalog.org/?query=10.1101/2023.01.10.23284387). According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to access@generationscotland.org.

Funding Statement

This research was funded in whole, or in part, by the Wellcome Trust (104036/Z/14/Z, 216767/Z/19/Z, 220857/Z/20/Z to AMM; 108890/Z/15/Z to DAG and 218493/Z/19/Z to HMS). This work was supported by the British Heart Foundation (Immediate Fellowship FS/IPBSRF/22/27042 to RFH), Alzheimer’s Society (AS-PG-19b-010 to REM and supports EB), and the Medical Research Council (U. MC_UU_00007/10 to CH). The Generation Scotland study was also awarded funding and supported by the Chief Scientist Office of the Scottish Government Health Directorates (CZD/16/6), the Scottish Funding Council (HR03006) and Wellcome (104036/Z/14/Z to AMM). AMM acknowledges further support by Wellcome (216767/Z/19/Z and 220857/Z/20/Z), United Kingdom Research and Innovation Medical Research Council (MC_PC_17209, MR/W014386/1 and MR/S035818/1) and the European Union H2020 (SEP-210574971). HMS and DAG are supported by Wellcome through the Translational Neuroscience PhD Programme (218493/Z/19/Z to HMS and 108890/Z/15/Z to DAG). YC is supported by the University of Edinburgh and University of Helsinki joint PhD program in Human Genomics. ADC is supported by a Medical Research Council PhD Studentship in Precision Medicine with funding by the Medical Research Council Doctoral Training Programme and the University of Edinburgh College of Medicine and Veterinary Medicine. RFH receives salary support from the British Heart Foundation (FS/IPBSRF/22/27042), EB receives salary support through Alzheimer’s Society (AS-PG-19b-010). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. o Wellcome: https://wellcome.org/ o British Heart Foundation: https://www.bhf.org.uk/ o Alzheimer's Society: https://www.alzheimers.org.uk/ o Medical Research Council: https://www.ukri.org/councils/mrc/ o Chief Scientist Office: https://www.cso.scot.nhs.uk/ o Scottish Funding Council: https://www.sfc.ac.uk/ o UK Research and Innovation: https://www.ukri.org/ o European Union H2020: https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en.

References

  • 1.Beck S, Rakyan VK. The methylome: approaches for global DNA methylation profiling. Trends Genet. 2008;24(5):231–237. Epub 2008/03/08. doi: 10.1016/j.tig.2008.01.006 . [DOI] [PubMed] [Google Scholar]
  • 2.Jaenisch R, Bird A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet. 2003;33(3):245–254. doi: 10.1038/ng1089 [DOI] [PubMed] [Google Scholar]
  • 3.Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98(4):288–295. doi: 10.1016/j.ygeno.2011.07.007 [DOI] [PubMed] [Google Scholar]
  • 4.Pidsley R, Zotenko E, Peters TJ, Lawrence MG, Risbridger GP, Molloy P, et al. Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome Biol. 2016;17(1):1–17. doi: 10.1186/s13059-016-1066-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Flanagan JM. Epigenome-wide association studies (EWAS): past, present, and future. Methods Mol Biol. 2015;1238:51–63. Epub 2014/11/26. doi: 10.1007/978-1-4939-1804-1_3 . [DOI] [PubMed] [Google Scholar]
  • 6.Hannon E, Lunnon K, Schalkwyk L, Mill J. Interindividual methylomic variation across blood, cortex, and cerebellum: implications for epigenetic studies of neurological and neuropsychiatric phenotypes. Epigenetics. 2015;10(11):1024–1032. Epub 2015/10/13. doi: 10.1080/15592294.2015.1100786 ; PubMed Central PMCID: PMC4844197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gadd DA, Stevenson AJ, Hillary RF, McCartney DL, Wrobel N, McCafferty S, et al. Epigenetic predictors of lifestyle traits applied to the blood and brain. Brain. IDAA Commun. 2021;3(2). doi: 10.1093/braincomms/fcab082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Meeks KAC, Henneman P, Venema A, Addo J, Bahendeka S, Burr T, et al. Epigenome-wide association study in whole blood on type 2 diabetes among sub-Saharan African individuals: findings from the RODAM study. Int J Epidemiol. 2019;48(1):58–70. Epub 2018/08/15. doi: 10.1093/ije/dyy171 ; PubMed Central PMCID: PMC6380309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Juvinao-Quintero DL, Marioni RE, Ochoa-Rosales C, Russ TC, Deary IJ, van Meurs JBJ, et al. DNA methylation of blood cells is associated with prevalent type 2 diabetes in a meta-analysis of four European cohorts. Clin Epigenetics. 2021;13(1):40. doi: 10.1186/s13148-021-01027-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chambers JC, Loh M, Lehne B, Drong A, Kriebel J, Motta V, et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case-control study. Lancet Diabetes Endocrinol. 2015;3(7):526–534. Epub 2015/06/23. doi: 10.1016/S2213-8587(15)00127-8 ; PubMed Central PMCID: PMC4724884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Fraszczyk E, Spijkerman AMW, Zhang Y, Brandmaier S, Day FR, Zhou L, et al. Epigenome-wide association study of incident type 2 diabetes: a meta-analysis of five prospective European cohorts. Diabetologia. 2022;65(5):763–776. Epub 2022/02/17. doi: 10.1007/s00125-022-05652-2 ; PubMed Central PMCID: PMC8960572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Smith BH, Campbell H, Blackwood D, Connell J, Connor M, Deary IJ, et al. Generation Scotland: the Scottish Family Health Study; a new resource for researching genes and heritability. BMC Med Genet. 2006;7:74. Epub 2006/10/04. doi: 10.1186/1471-2350-7-74 ; PubMed Central PMCID: PMC1592477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Smith BH, Campbell A, Linksted P, Fitzpatrick B, Jackson C, Kerr SM, et al. Cohort Profile: Generation Scotland: Scottish Family Health Study (GS:SFHS). The study, its participants and their potential for genetic research on health and illness. Int J Epidemiol. 2013;42(3):689–700. Epub 2012/07/13. doi: 10.1093/ije/dys084 . [DOI] [PubMed] [Google Scholar]
  • 14.Pidsley R, CC YW, Volta M, Lunnon K, Mill J, Schalkwyk LC. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics. 2013;14:293. Epub 2013/05/02. doi: 10.1186/1471-2164-14-293 ; PubMed Central PMCID: PMC3769145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–1222. Epub 2020/10/19. doi: 10.1016/S0140-6736(20)30925-9 ; PubMed Central PMCID: PMC7567026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.World Health Organization. Global Health Estimates 2020: deaths by cause, age, sex, by country and by region, 2000–2019. Geneva: World Health Organization; 2020. [Google Scholar]
  • 17.GBD 2019 Ageing Collaborators. Global, regional, and national burden of diseases and injuries for adults 70 years and older: systematic analysis for the Global Burden of Disease 2019 Study. BMJ. 2022;376:e068208. Epub 2022/03/12. doi: 10.1136/bmj-2021-068208 ; PubMed Central PMCID: PMC9316948 at www.icmje.org/disclosure-of-interest/ and declare support from CHTF, the Bill and Melinda Gates Foundation, IPEP, Instituto de Salud Carlos III—Spain, and FEDER for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Moorman J, Akinbami L, Bailey C, Zahran H, King M, Johnson C. Vital & health statistics. Series 3, Analytical and epidemiological studies. 35. US Dept. of Health and Human Services. Public Health Service, National Center for Health Statistics. 2012:2001–2010. [PubMed] [Google Scholar]
  • 19.Levey AS, Stevens LA, Schmid CH, Zhang YL, Castro AF 3rd, Feldman HI, et al. A new equation to estimate glomerular filtration rate. Ann Intern Med. 2009;150(9):604–612. Epub 2009/05/06. doi: 10.7326/0003-4819-150-9-200905050-00006 ; PubMed Central PMCID: PMC2763564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhang F, Chen W, Zhu Z, Zhang Q, Nabais MF, Qi T, et al. OSCA: a tool for omic-data-based complex trait analysis. Genome Biol. 2019;20(1):107. doi: 10.1186/s13059-019-1718-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Saffari A, Silver MJ, Zavattari P, Moi L, Columbano A, Meaburn EL, et al. Estimation of a significance threshold for epigenome-wide association studies. Genet Epidemiol. 2018;42(1):20–33. Epub 2017/10/17. doi: 10.1002/gepi.22086 ; PubMed Central PMCID: PMC5813244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):86. doi: 10.1186/1471-2105-13-86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bollepalli S, Korhonen T, Kaprio J, Anders S, Ollikainen M. EpiSmokEr: a robust classifier to determine smoking status from DNA methylation data. Epigenomics. 2019;11(13):1469–1486. Epub 2019/08/31. doi: 10.2217/epi-2019-0206 . [DOI] [PubMed] [Google Scholar]
  • 24.Phipson B, Maksimovic J, Oshlack A. missMethyl: an R package for analyzing data from Illumina’s HumanMethylation450 platform. Bioinformatics. 2015;32(2):286–288. doi: 10.1093/bioinformatics/btv560 [DOI] [PubMed] [Google Scholar]
  • 25.Zhang H, Ahearn TU, Lecarpentier J, Barnes D, Beesley J, Qi G, et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat Genet. 2020;52(6):572–581. Epub 2020/05/20. doi: 10.1038/s41588-020-0609-2 ; PubMed Central PMCID: PMC7808397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Tcheandjieu C, Zhu X, Hilliard AT, Clarke SL, Napolioni V, Ma S, et al. Large-scale genome-wide association study of coronary artery disease in genetically diverse populations. Nat Med. 2022;28(8):1679–1692. doi: 10.1038/s41591-022-01891-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sakornsakolpat P, Prokopenko D, Lamontagne M, Reeve NF, Guyatt AL, Jackson VE, et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat Genet. 2019;51(3):494–505. doi: 10.1038/s41588-018-0342-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wuttke M, Li Y, Li M, Sieber KB, Feitosa MF, Gorski M, et al. A catalog of genetic loci associated with kidney function from analyses of a million individuals. Nat Genet. 2019;51(6):957–972. doi: 10.1038/s41588-019-0407-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat Genet. 2021;53(10):1415–1424. Epub 2021/10/02. doi: 10.1038/s41588-021-00931-x . [DOI] [PubMed] [Google Scholar]
  • 30.Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat Genet. 2021;53(11):1616–1621. Epub 2021/11/06. doi: 10.1038/s41588-021-00954-4 . [DOI] [PubMed] [Google Scholar]
  • 31.Min JL, Hemani G, Hannon E, Dekkers KF, Castillo-Fernandez J, Luijk R, et al. Genomic and phenotypic insights from an atlas of genetic effects on DNA methylation. Nat Genet. 2021;53(9):1311–1321. Epub 2021/09/09. doi: 10.1038/s41588-021-00923-x ; PubMed Central PMCID: PMC7612069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS Genet. 2014;10(5):e1004383. Epub 2014/05/17. doi: 10.1371/journal.pgen.1004383 ; PubMed Central PMCID: PMC4022491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Battram T, Yousefi P, Crawford G, Prince C, Sheikhali Babaei M, Sharp G, et al. The EWAS Catalog: a database of epigenome-wide association studies [version 2; peer review: 2 approved]. Wellcome Open Res. 2022;7(41). doi: 10.12688/wellcomeopenres.17598.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tukey JW. Exploratory data analysis: Reading, MA; 1977.
  • 35.Cook RD. Detection of influential observation in linear regression. Dent Tech. 2000;42(1):65–68. [Google Scholar]
  • 36.Al Rajabi A, Booth SL, Peterson JW, Choi SW, Suttie JW, Shea MK, et al. Deuterium-labeled phylloquinone has tissue-specific conversion to menaquinone-4 among Fischer 344 male rats. J Nutr. 2012;142(5):841–845. Epub 2012/03/23. doi: 10.3945/jn.111.155804 ; PubMed Central PMCID: PMC3327742 Suttie, M. K. Shea, B. Miao, M. A. Grusak, and X. Fu, no conflicts of interest. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Welsh J, Bak MJ, Narvaez CJ. New insights into vitamin K biology with relevance to cancer. Trends Mol Med. 2022;28(10):864–881. Epub 2022/08/27. doi: 10.1016/j.molmed.2022.07.002 ; PubMed Central PMCID: PMC9509427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Yan L, Li Q, Sun K, Jiang F. MiR-4644 is upregulated in plasma exosomes of bladder cancer patients and promotes bladder cancer progression by targeting UBIAD1. Am J Transl Res. 2020;12(10):6277. [PMC free article] [PubMed] [Google Scholar]
  • 39.Terkelsen T, Russo F, Gromov P, Haakensen VD, Brunak S, Gromova I, et al. Secreted breast tumor interstitial fluid microRNAs and their target genes are associated with triple-negative breast cancer, tumor grade, and immune infiltration. Breast Cancer Res. 2020;22(1):73. doi: 10.1186/s13058-020-01295-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Akter S, Choi TG, Nguyen MN, Matondo A, Kim JH, Jo YH, et al. Prognostic value of a 92-probe signature in breast cancer. Oncotarget. 2015;6(17):15662–15680. Epub 2015/04/18. doi: 10.18632/oncotarget.3525 ; PubMed Central PMCID: PMC4558178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Gadd DA, Hillary RF, McCartney DL, Zaghlool SB, Stevenson AJ, Cheng Y, et al. Epigenetic scores for the circulating proteome as tools for disease prediction. Elife. 2022;11. Epub 2022/01/14. doi: 10.7554/eLife.71802 ; PubMed Central PMCID: PMC8880990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kuźnar-Kamińska B, Mikuła-Pietrasik J, Mały E, Makowska N, Malec M, Tykarski A, et al. Serum from patients with chronic obstructive pulmonary disease promotes proangiogenic behavior of the vascular endothelium. Eur Rev Med Pharmacol Sci. 2018;22(21):7470–7481. Epub 2018/11/24. doi: 10.26355/eurrev_201811_16288 . [DOI] [PubMed] [Google Scholar]
  • 43.Kuznar-Kaminska B, Mikula-Pietrasik J, Ksiazek K, Batura-Gabryel H. Chemokines CXCL12 and CCL21 may contribute to the development of lung cancer in COPD patients. Eur Respir J. 2013;42(Suppl 57):P553. [Google Scholar]
  • 44.Kim M. DNA methylation: a cause and consequence of type 2 diabetes. Genomics Inform. 2019;17(4):e38. Epub 2020/01/04. doi: 10.5808/GI.2019.17.4.e38 ; PubMed Central PMCID: PMC6944052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Marigorta UM, Rodríguez JA, Gibson G, Navarro A. Replicability and Prediction: Lessons and Challenges from GWAS. Trends Genet. 2018;34(7):504–517. Epub 2018/05/03. doi: 10.1016/j.tig.2018.03.005 ; PubMed Central PMCID: PMC6003860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zou J, Zhou J, Faller S, Brown RP, Sankararaman SS, Eskin E. Accurate modeling of replication rates in genome-wide association studies by accounting for Winner’s Curse and study-specific heterogeneity. G3 (Bethesda, Md). 2022;12(12). Epub 2022/10/18. doi: 10.1093/g3journal/jkac261 ; PubMed Central PMCID: PMC9713380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Trejo Banos D, McCartney DL, Patxot M, Anchieri L, Battram T, Christiansen C, et al. Bayesian reassessment of the epigenetic architecture of complex traits. Nat Commun. 2020;11(1):2865. Epub 2020/06/10. doi: 10.1038/s41467-020-16520-1 ; PubMed Central PMCID: PMC7280277. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Philippa Dodd

10 Jan 2023

Dear Dr Hillary,

Thank you for submitting your manuscript entitled "Blood-based epigenome-wide analyses on the prevalence and incidence of nineteen common disease states" for consideration by PLOS Medicine.

Your manuscript has now been evaluated by the PLOS Medicine editorial staff as well as by an academic editor with relevant expertise and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Jan 12 2023 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pmedicine

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Feel free to email us at plosmedicine@plos.org if you have any queries relating to your submission.

Kind regards,

Philippa Dodd, MBBS MRCP PhD

PLOS Medicine

Decision Letter 1

Philippa Dodd

4 Mar 2023

Dear Dr. Hillary,

Thank you very much for submitting your manuscript "Blood-based epigenome-wide analyses on the prevalence and incidence of nineteen common disease states" (PMEDICINE-D-22-04026R1) for consideration at PLOS Medicine.

Your paper was evaluated by a senior editor and discussed among all the editors here. It was also sent to independent reviewers, including a statistical reviewer. The reviews are appended at the bottom of this email and any accompanying reviewer attachments can be seen via the link below:

[LINK]

In light of these reviews, I am afraid that we will not be able to accept the manuscript for publication in the journal in its current form, but we would like to consider a revised version that addresses the reviewers' and editors' comments. Obviously we cannot make any decision about publication until we have seen the revised manuscript and your response, and we plan to seek re-review by one or more of the reviewers.

In revising the manuscript for further consideration, your revisions should address the specific points made by each reviewer and the editors. Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments, the changes you have made in the manuscript, and include either an excerpt of the revised text or the location (eg: page and line number) where each change can be found. Please submit a clean version of the paper as the main article file; a version with changes marked should be uploaded as a marked up manuscript.

In addition, we request that you upload any figures associated with your paper as individual TIF or EPS files with 300dpi resolution at resubmission; please read our figure guidelines for more information on our requirements: http://journals.plos.org/plosmedicine/s/figures. While revising your submission, please upload your figure files to the PACE digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at PLOSMedicine@plos.org.

We expect to receive your revised manuscript by Mar 27 2023 11:59PM. Please email us (plosmedicine@plos.org) if you have any questions or concerns.

***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.***

We ask every co-author listed on the manuscript to fill in a contributing author statement, making sure to declare all competing interests. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. If new competing interests are declared later in the revision process, this may also hold up the submission. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT. You can see our competing interests policy here: http://journals.plos.org/plosmedicine/s/competing-interests.

Please use the following link to submit the revised manuscript:

https://www.editorialmanager.com/pmedicine/

Your article can be found in the "Submissions Needing Revision" folder.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it.

We look forward to receiving your revised manuscript.

Sincerely,

Philippa Dodd, MBBS MRCP PhD

PLOS Medicine

plosmedicine.org

-----------------------------------------------------------

Requests from the editors:

GENERAL

Please respond to all editor and reviewer comments detailed below in full.

Please number the lines starting at 1 and in continuous sequence throughout, thereafter.

Your study combines a cohort study and a systematic review/meta-analysis (SRMA). Each of these study designs has its own reporting guidance. Please review both the STROBE guideline (cohort studies) and the PRISMA guideline (SRMAs) and report the relevant parts of your study according to the guidance.

The STROBE guideline can be found here: http://www.equator-network.org/reporting-guidelines/strobe/

Please include the completed STROBE checklist as Supporting Information. Please add the following statement, or similar, to the relevant section of the Methods: "This study is reported as per the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guideline (S1 Checklist)." Please use section and paragraph numbers, rather than page or line numbers as these often change at the time of publication.

The PRISMA guideline can be found here: https://www.equator-network.org/reporting-guidelines/prisma/

Please provide the completed PRISMA checklist as supporting information.

When completing the checklist, please use section and paragraph numbers, rather than page or line numbers as these often change at the time of publication. Please add the following statement, or similar, to the relevant section of the Methods: "This study is reported as per the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline (S1 Checklist)."

PLOS Medicine appeals to a wide general medical audience, to whom we think your study would appeal. To make your manuscript widely accessible we suggest ensuring clear use of language and particular attention to the explanation and definition of terms that are likely to be less familiar to the more general reader, as opposed to the epigeneticist. We also suggest caution with your use of causal language and claims of primacy.

TITLE

Please revise your title according to PLOS Medicine's style. Your title must be nondeclarative and not a question. It should begin with main concept if possible. "Effect of" should be used only if causality can be inferred, i.e., for an RCT. Please place the study design ("A randomized controlled trial," "A retrospective study," "A modelling study," etc.) in the subtitle (ie, after a colon).

ABSTRACT

Please structure your abstract using the PLOS Medicine headings (Background, Methods and Findings, Conclusions).

Please combine the Methods and Findings sections into one section, “Methods and findings”.

Abstract Background:

Please provide the context of why the study is important.

The final sentence should clearly state the study question.

“…epigenome-wide analyses…” suggest defining as “(EWAS)” here. It may also be helpful to further elaborate on the importance and potential uses of epigenetic data

Abstract Methods and Findings:

Please include further details of the study population and setting, number of participants, years during which the study took place, length of follow up, and main outcome measures.

What health record database did you link the epigenetics data to?

Please define “CpGs” at first use in the abstract.

“We identify 69 associations between CpGs and the prevalence of four disease states” Before you present the results some further brief details of what you did would be helpful - What diseases did you investigate? what associations did you investigate? How did you decide which diseases to investigate?

Please provide additional details of your literature review – dates of search, data sources, number of studies included, types of study designs included, eligibility criteria, and synthesis/appraisal methods

Please quantify the main results with 95% CIs and p values. When reporting p values please report as p<0.001 or where higher as p=0.002, for example

Please ensure that all numbers presented in the abstract are present and identical to numbers presented in the main manuscript text.

Please include any important dependent variables that are adjusted for in the analyses.

In the last sentence of the Abstract Methods and Findings section, please describe the main limitation(s) of the study's methodology.

Abstract Conclusions:

Please address the study implications without overreaching what can be concluded from the data; the phrase "In this study, we observed ..." may be useful.

Please interpret the study based on the results presented in the abstract, emphasizing what is new without overstating your conclusions.

Please avoid vague statements such as "these results have major implications for policy/clinical care". Mention only specific implications substantiated by the results.

Please avoid assertions of primacy ("We report for the first time....")

AUTHOR SUMMARY

At this stage, we ask that you include a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should be distinct from the scientific abstract. Please see our author guidelines for more information: https://journals.plos.org/plosmedicine/s/revising-your-manuscript#loc-author-summary

The summary should include 2-3 single sentence bullet points under each individual question. We encourage you to review some published articles on our website for examples here https://journals.plos.org/plosmedicine/

INTRODUCTION

Please further explain the need for and potential importance of your study. If there has been a systematic review (other than that you have conducted for this study) of the evidence related to your study, please refer to and reference that review and indicate whether it supports the need for your study.

“…To date, no study…” claims of supremacy can be risky suggest “to our knowledge” or similar

METHODS and RESULTS

Please ensure that your methods section includes details of the study cohort without the need to refer to another article (basic demographics, age, sex, enrollment criteria etc). Please include these details in section labelled 2.1.

Did your study have a prospective protocol or analysis plan? Please state this (either way) early in the Methods section.

a) If a prospective analysis plan (from your funding proposal, IRB or other ethics committee submission, study protocol, or other planning document written before analyzing the data) was used in designing the study, please include the relevant prospectively written document with your revised manuscript as a Supporting Information file to be published alongside your study, and cite it in the Methods section. A legend for this file should be included at the end of your manuscript.

b) If no such document exists, please make sure that the Methods section transparently describes when analyses were planned, and when/why any data-driven changes to analyses took place.

c) In either case, changes in the analysis-- including those made in response to peer review comments-- should be identified as such in the Methods section of the paper, with rationale.

For all observational studies, we request that in the manuscript text, authors please indicate the following:

(1) the specific hypotheses you intended to test,

(2) the analytical methods by which you planned to test them,

(3) the analyses you actually performed, and

(4) when reported analyses differ from those that were planned, transparent explanations for differences that affect the reliability of the study's results. If a reported analysis was performed based on an interesting but unanticipated pattern in the data, please be clear that the analysis was data-driven.

Please ensure that the main results are quantified with 95% CIs and p values. Please report p as p<0.001 and where higher as p=0.002, for example. For the purpose of transparent data reporting, if not please clearly state why not.

LITERATURE REVIEW

Please move the details of your literature review to the main manuscript. Such that the following is included dates of search, data sources, types of study designs included, eligibility criteria, and synthesis/appraisal methods.

We require that SRs are updated to within roughly 6 months of the expected publication date. Please update your search to the present time. We also ask for an evaluation of study quality and risk of bias and for an evaluation for evidence of publication bias. Please include.

TABLES

Please include a table containing the baseline characteristics of your study population

FIGURES

Please ensure that CpG is defined within the figure captions where relevant, figure 1, for example.

To help facilitate transparent data reporting, PLOS Medicine requests that where adjusted analyses are presented unadjusted analyses are presented for comparison. Please include unadjusted analyses. If not including unadjusted analyses, then please clearly state the reasons why not.

The + and and * as well as the text in part B of the figures are very difficult to read even with the figure enlarged, please revise to improve accessibility to the reader

Please consider avoiding the use of red and/or green to improve accessibility of your figures to those with color blindness

Please quantify the main results with 95% CIs and p values. Please report p values as p<0.001 or where higher as p=0.002, for example. For the purpose of transparent data reporting, if not please clearly state why not.

FIGURE 5: please revise the statement “our study is the first with”

DISCUSSION

Please present and organize the Discussion as follows: a short, clear summary of the article's findings; what the study adds to existing research and where and why the results may differ from previous research; strengths and limitations of the study; implications and next steps for research, clinical practice, and/or public policy; one-paragraph conclusion. Please ensure that the discussion reads as a single piece of continuous prose without any sub-headings.

Please remove the sub-heading “Conclusions”

Please move the ethics statement to the methods section of the main manuscript.

Please remove data availability, funding and competing interest statements from the end of the manuscript and include only in the manuscript submission form when you resubmit.

REFERENCES

Please ensure that in-text reference callouts are placed within square parentheses preceding punctuation, as follows, “For example [1,3,6].” Please note the presence of a space preceding the opening parenthesis and the absence of spaces between citations.

In your bibliography (including in the supporting files) please ensure that up to but no more than 6 author names are listed followed by et al., where more than 6 authors contribute to an individual study.

Journal name abbreviations should be those found in the National Center for Biotechnology Information (NCBI) databases.

Please see our website for further reference guidelines https://journals.plos.org/plosmedicine/s/submission-guidelines#loc-references

SUPPORTING FILES

Please include the PRISMA and STROBE checklists, as detailed above

Please apply the same suggestions above for tables and figures to those in the supporting files as relevant.

Comments from the reviewers:

Reviewer #1: Hilary et al report the results of a phenome - wide EWAS analysis of in Generation Scotland (N≤18,413). Using Illumina EPCI methylation arrays association of individual CpGs were assessed with prevalence of 14 disease states at base line (mean age 47.5) and incidence of 19 disease states over follow up (~9-14 years).

Association analysis was carried out using linear regression models were used for EWAS via the OSCA (OmicS-data-based Complex trait Analysis) software. Basic model (houseman cell type proportions) and adjusted (cell types. Genetic principal components and lifestyle factors) models were run. Cox proportional-hazards models were used for survival analysis for incidence phenotypes

Adjustment was for number of CpGs + number of phenotypes. (14 logistic regression, 19 for incidence analysis). M values were used.

The methods for DNA methylation data processing and adjustment for age / sex are appropriate. The level of significance adjusts for the number of phenotypes examined. Sensitivity analyses were carried out to assess the effect of relatedness between subjects in the cohort.

In total 69 associations between CpGs and the prevalence of four disease states at baseline ( 58 are novel). 64 CpGs were associated with the incidence of two disease states (COPD and type 2 diabetes (56 are novel).

Strengths of the study include the sample size adjustment fro lifestyle factors. Weaknesses include the lack of genomic inflation analysis to assess if there is residual confounding and the self report and EPR nature of phenotyping.

Main points

1. Increasingly winzorisation of data is being adopted in EWAS to reduce the influence of outlier probes. Was this considered?

2. What was the genomic inflation of the models?

3. When determining whether associations were novel was the EWAS atlas used (https://ngdc.cncb.ac.cn/ewas/atlas)? What was the criteria for novel? Unique CpG or unique CpG at unique genomic location?

4. Did the authors consider incorporating enrichment analyses to look at enriched & depleted genomic locations, enriched Go & KEGG terms, related traits in EWAS Atlas for say top 100 CpGs for each analysis? This could give insight into whether these CpGs have previously been corelated with specific environmental exposures for example.

5. It would have been nice to see what the clinical significance of significantly associated CpGs was - how predictive of new onset disease for example?

6. It could be acknowledged in the discussion that for the incidence disease that CpGs might not be on the causal pathway and could reflect sub clinical disease.

Minor points

1. Mean age and average length of follow up could be included in figure 1 legend for clarity to reader.

The lifestyle factors adjusted for should be explicitly stated in manuscript, not just providing a reference

Reviewer #2: The present study characterizes epigenome-wide associations between DNA methylation (DNAm) patterns derived from peripheral blood and both the prevalence and incidence of a range of different diseases, based on data from the Generation Scotland study. The authors provide results from both a core model and a fully-adjusted model controlling for several lifestyle factors, and perform a literature review of published EWASs to investigate the extent to which findings replicate. The authors also conduct co-localization analyses to test whether the top DNAm loci and associated traits are linked to common vs distinct genetic variants. The manuscript is very well-written, it addresses an important topic and has multiple strengths, including the investigation of both disease prevalence (cross-sectional analyses) and incidence (longitudinal analyses) across a broad range of diseases and the use of one of the largest epigenetic datasets in the world. Overall, I believe this is an excellent study and poised to make a significant contribution to the field. Specific comments that could be addressed to strengthen the manuscript are provided below.

* Abstract

o The authors could also mention in the background that previous EWAS studies typically focus on single diseases (as opposed to the wide range of outcomes examined here)

o Could you provide some key information, such as the age range of the sample, the fact that this is a family-structured study and the prevalence/incidence range across diseases examined.

o It would be worth stating a bit more explicitly that, based on this large sample, most diseases show rather weak associations (e.g., as indicated by the lack of [consistent] EWAS-significant associations across models)

* Introduction: in the last sentence, the authors state that colocalization analyses are performed to determine 'whether CpG methylation…causally associate with disease risk'. Can the authors clarify whether this is indeed what the colocalization analyses indicate? If both a CpG site and a trait are influenced by the same genetic variant, does it mean that the CpG site is causally involved in the disease, or could it be that the genetic variant has a pleiotropic effect on the CpG and disease, without the CpG being necessarily on the pathway?

* Figure 1 is very clear, but it would be helpful to list which diseases were included in the prevalence vs incidence analyses (or both).

* Methods

o The prevalence analyses are based on self-report disease status whereas the incidence analyses are based on linkage with health records. Can the authors comment on how comparable/concordant these measures are, and to what extent this may also contribute to differences in findings between incidence vs prevalence analyses (with only Type 2 diabetes showing some overlap between these EWASs)

o In the methods, the authors describe the family structure of the Generation Scotland cohort and how methylation data was available in three different sample sets with different familial/relatedness characteristics. It was unclear to me from the methods though how this complex kinship structure was taken into account in the analyses. Later in the results section the authors state that sensitivity analyses were performed using mixed-effects models to account for family relatedness but I would suggest making this clearer earlier in the manuscript. With regards to the three analytical sets, I was also unsure whether these were analysed separately and results pooled via meta-analysis (given differences between the sets in terms of selection criteria and also quality control procedures for example between set 1 vs set 2 and 3), or whether this was treated as a single analytical sample?

o Section 2.3. The format for the numbering (x) of the diseases here resembles that used for the references which is a bit confusing.

o Section 2.6. The number of eligible articles after inclusion criteria was 56, out of 2000 articles identified in the search. This is a very large reduction - can the authors mention some of the main reasons for articles dropping out? Perhaps I missed it in the supplementary but it would be helpful to add a table listing the articles included and key characteristics.

* Results

o Section 3.1. I appreciate that the authors provide information on the prevalence and incidence of the diseases in the supplementary materials due to space restrictions, but could the authors (i) indicate in the text the range of prevalence/incidence of the diseases in this sample, and (ii) add percentages in the supplementary table in addition to number of cases/controls?

o Section 3.5. I understand it is not straightforward to establish whether findings between EWAS studies replicate, particularly when full summary statistics may not be provided. My two main concerns with the strategy taken though are that (i) focusing on genes themselves excludes the possibility of testing whether e.g. intergenic CpG sites, which may still be functionally relevant, replicate, as they are not annotated to genes; and (ii) that using a genome-wide significant p-value threshold may be only partially informative, as it depends on sample characteristics such as power. I wonder whether the authors could also utilize other estimates to assess replication based on summary statistics, such concordance in the direction of associations and correlations between effect sizes.

* Discussion: The discussion is clear and concise, but could be expanded to cover a few more important themes emerging from the paper, including:

o The findings indicate that most diseases seem to show rather weak associations with DNAm, even when using such a large methylation dataset

o Type 2 diabetes emerges as one of the diseases with the strongest/most consistent associations (with some convergence across core/extended models and across prevalence/incidence analyses)

o Several sites are only significant in fully adjusted models - how do the authors interpret this?

o Commenting on the findings, meaning and implications of the colocalization analyses (i.e. these inform about shared/distinct genetic effects on DNAm and disease, but do they also inform about the [lack of ]causal effects of DNAm on the disease?).

Reviewer #3: This study seeks to identify epigenetic signals of disease prevalence or incidence in the Generation Scotland study. The paper is concisely written and reports some novel EWAS which is a significant contribution to the existing literature. A particularly interesting feature of the paper is the effort to examine concordance of previous EWAS results but this needs to be strengthened to help the reader gain an informed insight into what these results show.

The concordance between the EWAS in GS and previous literature is not currently very detailed. For example, it is not clear what the authors accept as replication between studies - for disease states where there is a number of EWAS (eg T2D) are they accepting CpG sites that are replicated across all studies or do they accept any CpG that is reported in 2 or more studies? Do the authors consider direction of effect or potential heterogeneity across genetic ancestries (although most studies are predominantly European ancestry)? Do they consider the statistical models used in different EWAS? What is the replication in studies using GS data eg Bermingham et al PMID 30935889?

The information being reported here is of consequence as it tells us if EWAS is an avenue of research worth following at all - if concordance between studies is very low and it is not explainable it suggests there is a problem in running these kinds of studies. It would also help if this was put into context - how much replication do we see between well powered GWAS studies? The other comment here is that the authors do not seem to present an appraisal of overlap amongst previous studies on the same disease phenotypes even though they say this will be presented in the introduction.

I have some minor areas needing clarification or improvement:

In 2.4 the adjustment for age, sex and batch could be more clearly described. How are age/sex/batch adjusted M values generated? What are the batches? The batch correction is likely to be incomplete if only one batch factor was included and this needs to be acknowledged. Other commonly used approaches such as SVA may not be feasible in this analysis model but the authors need to justify their approach as it may impact how they interpret replication with other studies.

In the fully adjusted model, can the authors justify their model choice? For example, is adjustment for BMI appropriate in T2D model?

For the mQTL analyses, did the authors check if the mQTL associations were similar between GS and GoDMC for the CpGs that were present on the 450k array? This would give an estimate of how well powered GS is to detect mQTLs within GS and whether there is substantial heterogeneity between GS and GoDMC mQTLs.

Figure 5 is very difficult to interpret - For example in group 4 (EPIC array, n=2) is this prevalent and/or incident disease? For replication with existing studies is "genes replicate" any CpGs at an annotated locus or something else? For "Replication of our study with existing studies" which of the numerator /denominator are from the current study? Why is only CKD and T2D in this box?

The study lists a number of limitations but these could be discussed in more detail if word count constraints allowed. For example, discuss case ascertainment issues of using parental history of AD as a proxy for variable for AD. The discussion of replication between studies could also be discussed in more detail if space allowed.

Reviewer #4: The manuscript entitled "Blood-based epigenome-wide analyses on the prevalence and incidence of nineteen common disease states" by Hillary et al conducts a large and well powered epigenome-wide association analysis of several disease traits using both a cross-sectional and lontiudinal approach. They identify 69 significant associations in four out of the nineteen disease states. They additionaly identify 64 CpGs that colocalise with two diseases, COPD and Type 2 diabetes of which 56 they consider novel and independent of the five lifestyle factors they include. They also not the poor replication in the majority of previous EWAS and can only replicate these finding in 4/19 diseases investigated. Overall, I enjoyed reviewing the article and found if of high interest, however, I have some constructive comments that I hope the authors will find improve the manuscript.

1) The first of these is no replication of their findings in an indpendent cohort. I suppose given that this is one of the many criticisims of EWAS, I am curious as to why these authors chose not find a suitable replication cohort. I understand that they tried to use the existing literature and EWAS catalog but not all of the nineteen disease states they investigated are well studied in the literature. So here are their findings being somehow biased by the amount of DNA methylation work that has been done in cancer versus other more hetergenous diseases with smaller numbers of research done.

2) In their intiitial regression they identify 1,340 associations versus 78 in the fully adjusted model. Did they look to see which of these addded covariates in their models may be accounting for this drop in signal? Does this represent some sort of environmental influence on the CpG associated in the basic model.

3) I was a little confused by colocalisation mQTL analyses. It was unclear if they were differentiating between these and did they look at GWAS based SNPs or only use GoDMC and their own GWAS data for these? This question comes from their results, which shows most of those CpGs that are colocalising with disease coming from Generation Scotland versus GoDMC. Is this because of the difference in array used.

4) Also, were the 20 PCs used in the model from GWAS data or from the EPIC data? I'm assuming GWAS but this wasn't clearly laid out in the manuscript.

5) It would be extremely useful to know where in relation to gene these CpGs are located. Are they in the TSS or are they in islands - this may give some insight into the biology.

6) I really liked the paragraph in the discussion in regards to standardizing practice for EWAS, especially in meta-analysis. I guess my suggestion here would be for some more specific recommendations, so normalisation methods, handling of batch effects, population strucuture, etc.

Minor comments

In the abstract the first paragraph of the results section says 14 common disease states but everywhere else it is 19. I figure this is a typo.

In the methods you mention five estiamted WBC cell types but in all the models this is six. I know there is some question around colinearity and these cell types but not sure if this was just an error.

In the abstract you mention CpGs - given the more general medical audience for this journal feel it would be better keep the language more general.

Will these summary data be placed into the EWAS catalogue?

Reviewer #5: In this work, the authors describe an epigenome-wide association study (EWAS) on a cohort of individuals from Generation Scotland. In consideration were 14 baseline self-reported common disease states and also the incidence of 19 disease states inferred by utilising health record data. The study design is split into "Prevalence analysis", a logistic regression approach for the 14 self-reported baseline disease states and "Incidence analysis", a censored Cox survival analysis for the 19 health record inferred disease states.

The study considered 18,413 individuals across a set of 752,722 filtered CpG sites on the Illumina MethylationEPIC array. In total, 69 associations were identified across 4 disease states, of which 58 are novel. Also, a total of 64 CpG sites associate with both COPD and diabetes. The authors also undertook a literature analysis and compared the results of this very large Generation Scotland study with the literature. They find poor replication.

Comments

=======

This very large study is a welcome addition to the EWAS literature. It examines not just self-reported disease state at baseline, but increases the value of the epigenetic data in the study by making use of health records to infer the diagnosis of disease states over time. The authors also take the time to compare the results in the context of the current literature and comment on the degree of replication. I agree with the authors that this review is a critical and timely analysis and commentary.

The inclusion of the basic and fully-adjusted model in the paper is useful, and the large reduction in the number of significant CpG sites after covariate adjustment illustrative of the importance of such adjustment. Figure S1 and S2 and also very illustrative for demonstrating the difference between correlation and causation.

To address:

* The fully-adjusted model needs more discussion around the covariates.

1) For body mass index, there is typically a long tail and these severely obese individuals can place substantial leverage on the regression. It is unclear from the manuscript or Supplementary methods whether BMI or BMI z-scores were used in the model.

2) For alcohol consumption or deprivation index, were these also regularised or normalised in some way?

* For the methylation-predicted WBC proportions in both the basic and fully-adjusted model, this is compositional data where the increase in one blood cell type reduces the others and all add to 1. How was this data specified to the model? Including all the cell types may introduce some multicollinearity. Did the authors use Pearson correlation or Variance inflation factor scores to determine the degree of multicollinearity? How was multicollinearity handled?

* For each of the significant CpG sites, was a diagnostic such as Cooks distance used to look for highly influential data points? Often this diagnostic is useful to find allele-specific methylation.

* In section 3.4, there is some assumed knowledge. Please explain the phrase "proportional hazard assumption"

Reviewer #6: This is a well-conducted study on blood-based epigenome-wide analyses on the prevalence and incidence of 19 common disease states. The study design, datasets, statistical methods and analyses, and presentation (tables and figures) and interpretation of the results are mostly adequate and of a good standard. However, there are still a few issues needing attention.

1) In section 2.5, it says "Cox proportional-hazards models were used to adjust incident phenotypes for age at baseline and sex (17/19 phenotypes)". However, as the outcome is incident disease rather than all-cause mortality, the death becomes a competing risk in the analysis. Have the authors considered a competing risk analysis instead?

2) The poor replication across existing studies is a concern but the authors have discussed this comprehensively in the discussion.

3) Are there any multi-morbidity issues in the study, e.g. an individul developed more than one disease in the follow-up? If so, what is the impact of this interaction between diseases on the findings, and how this has been addressed in the analyses?

Any attachments provided with reviews can be seen via the following link:

[LINK]

Attachment

Submitted filename: hillary Plos medicine 23.docx

Decision Letter 2

Philippa Dodd

12 May 2023

Dear Dr. Marioni,

Thank you very much for re-submitting your manuscript "Blood-based epigenome-wide analyses of nineteen common disease states: A longitudinal, population-based linked cohort study of 18,413 Scottish individuals" (PMEDICINE-D-22-04026R2) for review by PLOS Medicine.

I have discussed the paper with my colleagues and it was also seen again by 5 reviewers. I am pleased to say that provided the remaining editorial and production issues are dealt with we are planning to accept the paper for publication in the journal.

The remaining issues that need to be addressed are listed at the end of this email. Any accompanying reviewer attachments can be seen via the link below. Please take these into account before resubmitting your manuscript:

[LINK]

***Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.***

In revising the manuscript for further consideration here, please ensure you address the specific points made by each reviewer and the editors. In your rebuttal letter you should indicate your response to the reviewers' and editors' comments and the changes you have made in the manuscript. Please submit a clean version of the paper as the main article file. A version with changes marked must also be uploaded as a marked up manuscript file.

Please also check the guidelines for revised papers at http://journals.plos.org/plosmedicine/s/revising-your-manuscript for any that apply to your paper. If you haven't already, we ask that you provide a short, non-technical Author Summary of your research to make findings accessible to a wide audience that includes both scientists and non-scientists. The Author Summary should immediately follow the Abstract in your revised manuscript. This text is subject to editorial change and should be distinct from the scientific abstract.

We expect to receive your revised manuscript within 1 week. Please email us (plosmedicine@plos.org) if you have any questions or concerns.

We ask every co-author listed on the manuscript to fill in a contributing author statement. If any of the co-authors have not filled in the statement, we will remind them to do so when the paper is revised. If all statements are not completed in a timely fashion this could hold up the re-review process. Should there be a problem getting one of your co-authors to fill in a statement we will be in contact. YOU MUST NOT ADD OR REMOVE AUTHORS UNLESS YOU HAVE ALERTED THE EDITOR HANDLING THE MANUSCRIPT TO THE CHANGE AND THEY SPECIFICALLY HAVE AGREED TO IT.

Please ensure that the paper adheres to the PLOS Data Availability Policy (see http://journals.plos.org/plosmedicine/s/data-availability), which requires that all data underlying the study's findings be provided in a repository or as Supporting Information. For data residing with a third party, authors are required to provide instructions with contact information for obtaining the data. PLOS journals do not allow statements supported by "data not shown" or "unpublished results." For such statements, authors must provide supporting data or cite public sources that include it.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

Please note, when your manuscript is accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you've already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosmedicine@plos.org.

If you have any questions in the meantime, please contact me or the journal staff on plosmedicine@plos.org.  

We look forward to receiving the revised manuscript by May 19 2023 11:59PM.   

Sincerely,

Philippa Dodd, MBBS MRCP PhD

PLOS Medicine

plosmedicine.org

------------------------------------------------------------

Requests from Editors:

GENERAL

Thank you for your very detailed and considered responses to previous and editor and reviewer comments which the editorial team very much appreciate. Please see below for further comments that we require you address prior to publication.

*** From the Editor-in-chief – Please discuss methodological choices and approaches in respect of overfitting and the potential to find associations due to large sample size. ***

AUTHOR SUMMARY

Thank you for including an author summary which reads very nicely but is rather long. Some points currently detailed could be combined and made more concise to improve brevity while minimizing loss of information. Please revise in mind of the below guidance.

The authors summary should consist of 2-3 succinct bullet points under each of the following headings:

• Why Was This Study Done? Authors should reflect on what was known about the topic before the research was published and why the research was needed.

• What Did the Researchers Do and Find? Authors should briefly describe the study design that was used and the study’s major findings. Do include the headline numbers from the study, such as the sample size and key findings.

• What Do These Findings Mean? Authors should reflect on the new knowledge generated by the research and the implications for practice, research, policy, or public health. Authors should also consider how the interpretation of the study’s findings may be affected by the study limitations. In the final bullet point of ‘What Do These Findings Mean?’, please describe the main limitations of the study in non-technical language.

METHODS

Line 244-258 – during our technical checks this portion of text was highlighted as overlapping with a source identified as an author PhD thesis. We appreciate that there are only so many ways that methods can be detailed and so do not find this overly concerning but would appreciate it if the authors could consider re-wording this text.

RESULTS

Are there any additional data on the underlying cause of CKD in this population? If so, it might be interesting to see how the association may appear if stratified accordingly. We suspect those data may be unavailable but if they are it could be a worthwhile exploration.

DISCUSSION

Given the multifactorial causes of CKD might the reported associations warrant further discussion perhaps in relation to the common endpoint (fibrosis)? The same would apply to liver cirrhosis. It may also be worth noting that your definition of CKD may fail to capture some individuals who would otherwise fulfil the criteria (those with proteinuria and normal eGFR, tubular disorders etc)

SUPPLEMENTARY FIGURES

We usually advise against the use of asterisks to depict p values to improve clarity, but I think in this case the converse would apply.

STROBE Checklist – I could see a title referencing its presence within the manuscript but in my version, I couldn’t find the checklist attached, please include.

SOCIAL MEDIA

If not already done so, to help us extend the reach of your research, please detail any Twitter handles you wish to be included when we tweet this paper (including your own, your coauthors’, your institution, funder, or lab) in the manuscript submission form when you re-submit the manuscript.

Comments from Reviewers:

Reviewer #1: The authors have now significantly revised this manuscript in line with both my comments and those of the other reviewers. The revised manuscript is still an excellent study with interesting findings relevant both to the disease phenotypes studies and future application of EWAS in common disease. The discussion now clearly highlights the potential limitations of the study population and analytical approach.

I have no further substantive comments to make.

Reviewer #3: I am happy that the authors have addressed the comments raised in my earlier review. Furthermore, I believe they have addressed comments raised by other reviewers.

Reviewer #4: I thank the authors for their consideration of my comments.

Reviewer #5: The reviewers have sufficiently addressed my comments.

Reviewer #6: Many thanks authors for their great effort to improve the manuscript. All my comments/concerns were comprehensively addressed. I am satisfied with the response and revision. No further issues needing attention.

Any attachments provided with reviews can be seen via the following link:

[LINK]

Decision Letter 3

Philippa Dodd

25 May 2023

Dear Dr Marioni, 

On behalf of my colleagues and the Academic Editor, Professor John W. Holloway, I am pleased to inform you that we have agreed to publish your manuscript "Blood-based epigenome-wide analyses of nineteen common disease states: A longitudinal, population-based linked cohort study of 18,413 Scottish individuals" (PMEDICINE-D-22-04026R3) in PLOS Medicine.

Prior to publication please ensure that you update your data availability statement to indicate that your data are now available in the EWAS Catalogue (currently 'will be made available on publication).

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Once you have received these formatting requests, please note that your manuscript will not be scheduled for publication until you have made the required changes.

In the meantime, please log into Editorial Manager at http://www.editorialmanager.com/pmedicine/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process. 

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with medicinepress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for submitting to PLOS Medicine, it has been a pleasure handling your manuscript. We look forward to publishing your paper. 

Best wishes,

Pippa 

Philippa Dodd, MBBS MRCP PhD 

PLOS Medicine

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 STROBE Checklist. STROBE statement—Checklist of items that should be included in reports of observational studies.

    (DOCX)

    S1 Appendix. Disease code lists.

    (XLSX)

    S2 Appendix. Cook’s distance plots for 133 associations in basic and fully adjusted models.

    Outliers are highlighted in green (COPD) and blue (type 2 diabetes).

    (PDF)

    S1 Text. Supplementary methods for methylation quality control.

    (DOCX)

    S2 Text. Supplementary methods for preparation of phenotypes.

    (DOCX)

    S3 Text. Supplementary methods for sensitivity EWAS.

    (DOCX)

    S4 Text. Supplementary methods for methylation QTL analyses.

    (DOCX)

    S5 Text. Supplementary note on covariate-specific attenuation of effect sizes in basic model.

    (DOCX)

    S1 Fig. Associations between covariates and prevalent disease states in univariable and multivariable logistic regression models.

    (DOCX)

    S2 Fig. Associations between covariates and incident disease states in univariable and multivariable Cox proportional hazards models.

    (DOCX)

    S3 Fig. Correlation between effect sizes from linear regression EWAS and sensitivity linear mixed effects analyses that further accounted for relatedness.

    (DOCX)

    S1 Table. Summary data for demographic variables and covariates.

    (XLSX)

    S2 Table. Counts for prevalent disease states.

    (XLSX)

    S3 Table. Counts for incident disease states.

    (XLSX)

    S4 Table. Associations between covariates and prevalent disease states.

    (XLSX)

    S5 Table. Associations between covariates and incident disease states.

    (XLSX)

    S6 Table. Significant associations from basic model—Epigenome-wide association studies on prevalent disease states.

    (XLSX)

    S7 Table. Genomic inflation factors for epigenome-wide association studies on prevalent disease states.

    (XLSX)

    S8 Table. Significant associations from fully adjusted model—Epigenome-wide association studies on prevalent disease states.

    (XLSX)

    S9 Table. Genetic colocalisation analyses for prevalent disease associations.

    (XLSX)

    S10 Table. Significant associations from basic model—Epigenome-wide association studies on incident disease states.

    (XLSX)

    S11 Table. Genomic inflation factors for epigenome-wide association studies on incident disease states.

    (XLSX)

    S12 Table. Significant associations from fully adjusted model—Epigenome-wide association studies on incident disease states.

    (XLSX)

    S13 Table. Genetic colocalisation analyses for incident disease associations.

    (XLSX)

    S14 Table. Sensitivity analysis to test for the effects of each of the 5 lifestyle risk factors included in this study on attenuating associations from the basic model.

    (XLSX)

    S15 Table. Pathway enrichment analyses.

    (XLSX)

    S16 Table. Sensitivity analysis to test for effect of relatedness on associations with prevalent disease states.

    (XLSX)

    S17 Table. Sensitivity analysis to test for effect of relatedness on associations with incident disease states.

    (XLSX)

    S18 Table. Sensitivity analysis to test for proportional hazard assumption.

    (XLSX)

    S19 Table. Sensitivity analysis to estimate hazard ratios during each year of follow-up for associations that violated proportional hazard assumption in S18 Table.

    (XLSX)

    S20 Table. Sensitivity analysis to assess the impact of all-cause mortality as a competing risk in incidence models.

    (XLSX)

    S21 Table. Sensitivity analysis to identify influential observations based on Cook’s distance statistics.

    (XLSX)

    S22 Table. Odds ratios and hazard ratios for prevalent and incident disease associations, including Harrell’s C-statistic for the latter.

    (XLSX)

    S23 Table. Characteristics of 54 studies identified in structured literature review.

    (XLSX)

    S24 Table. Look-up analyses to assess whether associations identified in the present study are newly described.

    (XLSX)

    S25 Table. Replication within existing epigenome-wide association studies that examined the same condition in the literature.

    (XLSX)

    Attachment

    Submitted filename: hillary Plos medicine 23.docx

    Data Availability Statement

    EWAS summary statistics are available on the EWAS Catalog (http://ewascatalog.org/?query=10.1101/2023.01.10.23284387). According to the terms of consent for Generation Scotland participants, access to data must be reviewed by the Generation Scotland Access Committee. Applications should be made to access@generationscotland.org.


    Articles from PLOS Medicine are provided here courtesy of PLOS

    RESOURCES