Abstract
Methylome-wide association studies (MWASs) have identified many 5′-cytosine-phosphate-guanine-3′ (CpG) sites associated with complex traits. Several methods have been developed to predict CpG methylation levels from genotypes when the direct measurements of methylation are unavailable. To date, the published methods have mostly used datasets from populations of European ancestry to train prediction models for methylations, which limits the generalizability of methylome-wide association study to non-European populations. To address this gap, we proposed a new model by incorporating local ancestry (LA) information, called LA Methylation Predictor with Preselection (LAMPP), to improve the prediction accuracy of DNA methylation in admixed populations. We showed that LAMPP outperformed the conventional model and other LA models in prediction accuracy using an admixed African American population. We further applied our model to identify significant CpG sites for seven complex traits. Together, our LAMPP model is a valuable tool to reveal epigenetic underpinnings of complex traits in the admixed populations.
Keywords: methylome-wide association studies, prediction models of DNA methylation, local ancestry, admixed populations
Introduction
To better understand the biological mechanisms of identified genome-wide association study (GWAS) signals, linking gene expression, or DNA methylation data with GWAS findings is important to elucidate the function of the significant loci for complex traits. However, gene expression and DNA methylation data are not always available, and large-scale transcriptome and methylome profiling are costly. Computational methods can be used to impute gene expression or DNA methylation from individual genotypes. To date, many models have been developed to impute gene expression [1–6], enabling transcriptome-wide association studies (TWAS) to successfully identify genes associated with complex traits [7–10]. However, the framework to impute epigenome-wide DNA methylation has not been widely applied yet, which limits our understanding of epigenetic contributions to the mechanisms underlying GWAS loci.
DNA methylation is a key epigenetic modification linked to disease pathology [11–13]. Similar to TWAS, methylome-wide association studies (MWASs) have successfully identified CpG (5′-cytosine-phosphate-guanine-3′) sites associated with complex traits [5, 6]. In addition, genetic variants that influence DNA methylation levels, known as methylation quantitative trait loci (meQTLs), have been widely characterized across cell types and populations, providing insight into the genetic architecture of methylation [14–16]. It also indicates the potential of utilizing genetic variants to predict methylations and to further reveal the role of genetically predicted methylations in complex traits. Such prediction models of CpG methylation can provide extensive and valuable resources when direct measurement of methylation is not available. However, the existing prediction models have limited utility for non-European populations (e.g. Africans, Asians, Latinos) since these models are mostly built using data generated from European populations [5, 6]. To address this gap, multi- and cross-ancestry methods have been developed [17–21] to improve prediction accuracy in understudied populations by leveraging information from different ancestry groups with larger GWAS datasets. For example, Zhao et al. utilized transfer learning to leverage large-scale European GWASs to boost the prediction accuracy in populations of South Asian or African ancestry [19]. However, these methods may not perform well in admixed cohorts since they still rely on ancestry discretization. Admixture is the result of inheriting genomic segments from at least two parental populations [22], leading to within-population heterogeneity in terms of both global ancestry (GA) (i.e. average ancestry over the entire genome) and local ancestry (LA) (i.e. ancestry for a small segment of the genome) [23]. Thus, the genetic diversity inherent to admixed populations makes it inappropriate to treat them as one homogeneous group [17–20]. To fully account for ancestral heterogeneity, Sun et al. proposed Genetic Ancestry Utilization in polygenic risk scores for aDmixed Individuals (GAUDI) [24], which uses LA information to improve polygenic risk prediction in admixed populations, and demonstrated its advantages over other methods. Taken together, these studies suggest the importance and potential of incorporating LA when building a framework to predict DNA methylation levels for admixed populations.
Another challenge with existing prediction methods is that they typically use methylation data measured from array-based platforms, including the Illumina Human Methylation 450K (450K) and Infinium MethylationEPIC (EPIC) BeadChips, which cover only ~1.6%–3% of the DNA methylome. This leaves a large proportion of CpGs unmeasured and unimputed [25]. Next-generation sequencing approaches, such as Methylation Capture Sequencing (MC-seq) [26], can address this gap by profiling significantly more CpGs. Models developed from MC-seq data can provide higher coverage of the methylome, and more biological insights for the CpG sites that are not included in the current array-based methods.
In this manuscript, we propose a novel model that incorporates LA information to improve the prediction accuracy of DNA methylation levels for admixed populations, referred to as the “LA Methylation Predictor with Preselection” (LAMPP). We demonstrate that our method achieves higher prediction accuracy than the conventional model (without LA) and other LA models. Our model was built using MC-seq data and validated using an independent dataset. The application of our model to an admixed cohort (the Population Architecture using Genomics and Epidemiology (PAGE) study) identified genetically-regulated CpGs that are associated with several complex traits. Together, our results show that the LAMPP model improves the prediction accuracy for DNA methylation and is a useful tool for MWAS in admixed populations.
Results
Overview of the LAMPP model
To estimate genetically driven DNA methylation, we proposed a new model by extending the conventional and basic LA models (Fig. 1). Compared to the basic LA model which partitions all genotypes by African (AFR) and European (EUR) ancestry, our LAMPP model features an additional preselection step: we firstly determined if one single nucleotide polymorphism (SNP) had LA-specific effect (
), and for those SNPs without this effect, we kept the original genotype while for SNPs with LA-specific effect, we dissected the genotype and modeled them separately (Fig. 1a). This preselection step was conducted marginally for each SNP by comparing two nested models (null model and ancestry model) (Fig. 1b). For example, assuming there are 1000 SNPs in the conventional model and 100 of them carry LA-specific effects, the total number of features in the LAMPP model would be 1100 instead of doubling to 2000 in the basic LA model. In this way, our model could achieve a balance between model simplicity and prediction accuracy.
Figure 1.
Overview of the local ancestry methylation predictor with preselection (LAMPP) step model. (a) The workflow of the LAMPP model. Based on the conventional model, LA is firstly incorporated by dissecting the original genotype by two ancestries. Then based on the basic LA model, a preselection step is added to identify SNPs with LA-specific effects. The LAMPP model takes the methylation count data as input and uses regularized logistic regression to output prediction coefficients. (b) In the preselection step, the two nested models (null model, ancestry model) are compared to determine if a SNP has LA-specific effect on the DNA methylation.
Currently, modeling the methylation beta-value (percentage of methylation) as one continuous variable is the common practice [5, 6]. However, it cannot take the difference in sequence depths into consideration [27–30]. Therefore, with count data (methylated count and unmethylated count) measured in our MC-seq data, we directly modeled the count data as the response variable using regularized logistic regression (binomial link). After building the LAMPP model, we could obtain the prediction weights/effect sizes. In the application step, the effect sizes could be applied to the genotype data from a new dataset to get the predicted methylation level in percentage (equivalent to beta-value). To validate the performance of the model, the predicted methylation data could be contrasted with the measured methylation data for the new dataset, regardless of the profiling approach.
We note the following key features for the LAMPP model (see Methods for more details): (1) a preselection step is used to select SNPs with LA-specific effects among all cis SNPs, and the original genotype is dissected by two ancestries for these SNPs; and (2) count data are modeled directly to account for the difference in sequence depths.
LAMPP improves DNA methylation prediction accuracy
To investigate the effects of incorporating LA on DNA methylation prediction accuracy, we compared the conventional model without LA and three LA models: LAMPP, the basic LA model, and GAUDI [24]. Across all 1.8 million CpGs, the prediction R2 of each LA model was compared to the R2 of the conventional model using paired t-test [31]. Demographic characteristics for all datasets are summarized in Fig. 2 and Supplementary Table S1.
Figure 2.

Overview of the studied cohorts. (a) The sample size and available data type for the model building set (VACS-Methyl-seq), external test set (MWCCS), Application set 1 (PAGE), and Application set 2 (VACS-genotype). (b) The global ancestry estimates for the admixed individuals in each cohort. Yoruba in Ibadan, Nigeria (YRI) and Utah Residents (CEPH) with Northern and Western European Ancestry (CEU) samples from the 1000 genomes project are used as the African and European reference panels.
We first trained prediction models in 80% of the Veterans Aging Cohort Study (VACS) samples with MC-seq data and applied the models to the remaining 20% of the samples (internal test set). As heritability represents the theoretical upper limit on how accurately methylation can be predicted from genotypes [5], we divided all CpGs into different groups based on whether the estimated heritability was statistically significant or not. The 1 816 705 available CpGs were divided into CpGs with significant heritability (88 521, 5%) and those without significant heritability (1 728 184, 95%), and the prediction accuracy was compared separately. For the CpGs with significant heritability, we further divided them into CpGs with and without LA-specific effects, according to whether the preselection step identified SNPs with LA-specific effects (P < threshold) among all cis SNPs near a certain CpG (see Methods). To determine the optimal threshold, a five-fold cross-validation was conducted within the training sample. At the optimal threshold (0.005) (Supplementary Fig. S1), 77 763 (88%) of the CpGs with significant heritability had SNPs with LA-specific effects, while 10 758 (12%) did not have detectable LA-specific effects. The overall results of comparing all models are summarized in Fig. 3a. The LAMPP model outperformed the conventional model in all CpG groups (mean difference in R2 >0), while GAUDI and the basic LA model generally did not perform as well as the conventional model. Specifically, for CpGs with significant heritability, LAMPP increased the prediction accuracy R2 by 0.02 (Supplementary Table S2). Further looking into CpGs with and without LA-specific effects, R2 was increased by 0.021 and 0.014 (Supplementary Table S2), respectively. Of note, the increase of 0.014 was attributed to modeling of count data, while the increase of 0.021 in CpGs with LA-specific effects showed an additional improvement, indicating the importance of incorporating LA.
Figure 3.
Comparison of the prediction accuracy (R2) between LA models and the conventional model. Forrest plot for the mean difference in R2 between each LA model and the conventional model in the (a) internal test set (20% of the model building set (VACS-Methyl-seq)) and (b) external test set (MWCCS). All methylation sites were divided into CpGs with significant heritability and CpGs with non-significant heritability, while the CpGs with significant heritability were further divided into CpGs with LA-specific effects and CpGs without LA-specific effects. The comparison was conducted separately in these CpG groups. The mean differences, along with its 95% confidence interval were from the paired t-test.
To further compare the prediction accuracy of different models in an independent dataset, the previous 80% and 20% samples were combined into a single full model building set, and the final prediction models were then applied to the Multicenter AIDS Cohort Study (MACS)/Women's Interagency HIV Study (WIHS) Combined Cohort Study (MWCCS) samples as an external test set. Similar patterns were observed: only LAMPP outperformed the conventional model in all CpG groups (Fig. 3b) (Supplementary Table S3).
While Fig. 3 shows the absolute improvement (mean difference in R2) when comparing the LAMPP model to the conventional model, the relative improvement (percentage of increment in R2) is shown in Fig. 4. We note that the percentage of increment was particularly high for CpGs with non-significant heritability (on average, the LAMPP model improved R2 by 25.9% in the internal test set, and 14.3% in the external test set) (Supplementary Tables S4 and S5), even the absolute improvements were small in these CpGs. In addition, Fig. 4 displays specific R2 for each CpG group for the conventional model and the LAMPP model. We observed a similar pattern for the two models: they both had the highest R2 for CpGs with LA-specific effects (average R2 was 0.269, 0.15 in internal and external test sets for the conventional model, and 0.29, 0.16 for the LAMPP model) (Supplementary Table S4).
Figure 4.

Improvement in DNA methylation prediction accuracy. The average gain of R2 by the LAMPP model across CpGs, and the relative improvement compared to the conventional model (average percentage of increment in R2) in the (a) internal test set (20% of the model building set (VACS-Methyl-seq)), and (b) external test set (MWCCS). All methylation sites were divided into CpGs with significant heritability and CpGs with non-significant heritability, while the CpGs with significant heritability were further divided into CpGs with LA-specific effects and CpGs without LA-specific effects.
Enrichment analyses characterizing well-predicted CpGs
A subset of 435 854 CpGs were identified as well-predicted CpGs by the LAMPP model (prediction accuracy R2 > 0.01) [5]. An example is shown in Fig. 5a for cg23505766, which had a prediction accuracy of 0.54 in the internal test set and 0.52 in the external test set. These well-predicted CpGs are taken forward in the following analyses.
Figure 5.
Example and enrichment for the well-predicted CpGs. (a) An example of a well-predicted CpG (cg23505766). The observed DNA methylation levels were plotted against the imputed DNA methylation levels, separated by the internal test set (20% of the model building set (VACS-Methyl-seq)), and external test set (MWCCS). (b) Functional enrichment for well-predicted CpGs in CpG island (CGI) regions, gene body regions, and gene regulatory regions. The logarithm of odds ratio (OR) with 95% confidence interval is presented. UTR: Untranslated exon region.
Enrichment analysis was performed using genomic features to characterize the well-predicted CpGs. Compared to the background CpGs (R2 ≤ 0.01), the set of well-predicted CpGs was significantly depleted in CpG islands (odds ratio (OR) = 0.688, 95% confidence interval (CI) = [0.683, 0.693], P-value < 1E-300) and promoters (OR = 0.615, 95% CI = [0.609, 0.620], P-value <1E-300) (Fig. 5b and Supplementary Table S6). We also observed significant enrichment in strong enhancer (OR = 1.281, 95% CI = [1.264, 1.299], P-value = 2.93E-289), weak enhancer (OR = 1.198, 95% CI = [1.180, 1.215], P-value = 3.73E-131), and insulator (OR = 1.282, 95% CI = [1.254, 1.310], P-value = 6.69E-108) regions (Fig. 5b and Supplementary Table S6). These are consistent with Fryett et al. who reported enrichments in enhancers and depletions in CpG islands and promoters for their well-predicted CpGs [5]; and Huan et al. who suggested that CpGs with heritability >0.1 are depleted in promoters and enriched in enhancers [32].
Applications to the PAGE study to identify trait-associated CpGs
We first applied our prediction models to the Application set 1 (PAGE consortium). We imputed the 435 854 well-predicted CpG sites, and then tested associations between those CpGs and multiple traits. Significant trait-associated CpGs were declared using Bonferroni correction (P < 0.05/435 854 = 1.15E-07).
We identified 1173 CpGs associated with white blood cell (WBC) count, 4 with hemoglobin, 1 with platelet count, 1 with BMI, 2 with height, 30 with stroke, and 2 with heart attack (Supplementary Tables S7–S13). We note that WBC had the largest number of associations, and the associated CpGs were concentrated in one region on chromosome 1 (107.68–169.34 Mb) (Fig. 6a). The well-predicted methylation site shown in Fig. 5a, cg23505766 (chr1_153582541_153582541, mapped to S100A16), was among the WBC-associated CpGs (P = 1.15E-14) (Supplementary Table S7). Multiple studies on WBC [33–35] also reported that the GWAS signals in African Americans had a broad peak around a similar region (90.38–177.81 Mb) [33], and suggested that one gene in this region, ACKR1 (Duffy antigen receptor for chemokines DARC as the former name), could confer selective advantage against malaria [35–37] and thus made this region important in African Americans. Among our WBC-associated CpGs, seven of them mapped to the promoter region of ACKR1 (marked in Supplementary Table S7). The Application set 2 (VACS-genotype) (n = 1867) was used to replicate our identified CpGs for WBC, and we defined replicated signals as those with P < .05/1173 = 4.26E-05 and consistent direction of effect. We showed that among the 1173 WBC-associated CpGs, 62.6% were significant in the Application set 2, 71.6% had a consistent direction of effect and 59.0% were replicated (Supplementary Table S14). The correlation of the effect sizes among the replicated CpGs was 0.69 (Fig. 6b). Of note, 94.3% of the significant signals had a consistent direction of effect (59.0/62.6), indicating a high level of consistency between our results in the two application sets. The results for CpGs associated with other traits are summarized in Supplementary Figs S2–S8 [38] and Supplementary Tables S8–S13. Among them, several stroke-associated CpGs, chr1_39573220_39573220 (MACF1), chr12_111843383_111843383 (SH2B3), and chr20_33731429_33731429 (EDEM2) (Supplementary Table S12) had corresponding genes identified by other epigenome-wide association studies of stroke [39]. It is noteworthy that SH2B3 has been reported by multiple studies as a differentially expressed gene for stroke [40, 41]. For heart attack, the top CpG chr6_91297221_91297221 was mapped to MAP3K7 (Supplementary Table S13), which encodes for transforming growth factor β–activated kinase 1 and plays a key role in the maintenance of myocardial homeostasis [42, 43].
Figure 6.
Identification of trait-associated CpGs in methylome-wide association studies (MWASs). (a) Manhattan plot of MWAS for white blood cell (WBC) counts. The two lines mark the significance level after Bonferroni correction (P < .05/435 854 = 1.15E-07), and the suggestive significance level (P < 1E-05). A total of 1173 significant CpGs were identified using the Application set 1 (PAGE). (b) Scatter plot to compare the effect size between Application set 1 (PAGE) and Application set 2 (VACS-genotype) for the 1173 significant CpGs. The correlation for all signals and replicated signals (significant and same effect direction in the replication set) were shown. (c) The number and proportion of the trait-associated CpGs that have nearby (±500 kb) genome-wide association studies (GWAS) loci, or transcriptome-wide association studies (TWAS) loci. In-sample GWAS (derived from PAGE) and independent GWAS were used separately.
We further investigated the intersections between our identified CpGs and the corresponding GWAS and TWAS loci. Using the in-sample GWAS derived from the same PAGE cohort, we found that for the majority (90.8%) of identified CpGs, at least one SNP reaching genome-wide significance (GWS) (P < 5E-08) was located within ±500 kb of the CpG site (Fig. 6c and Supplementary Table S15). However, for some of the traits (i.e. BMI, height, stroke, heart attack), the limited sample size for the in-sample GWAS restricted the power to identify GWS SNPs. We then utilized independent GWAS data from the GWAS Catalog [44, 45] and identified more CpGs with nearby GWS signals: 1105 out of the 1213 trait-associated CpGs (91.1%) were located within ±500 kb of the corresponding GWS loci (Fig. 6c and Supplementary Table S15). In addition, using the TWAS hub database (http://twas-hub.org/), we found that many of the trait-associated CpGs (70.7%) also had nearby TWAS loci and they were for traits WBC, hemoglobin, platelet count, height, and stroke (Fig. 6c and Supplementary Table S16). Thus, leveraging the MWAS with GWAS and TWAS studies, we were able to gain more insight into the interplay between DNA methylation, gene expression, and complex traits.
Discussion
In this study, we proposed a novel LAMPP model to improve the prediction accuracy of DNA methylation levels in admixed populations. We demonstrated that it achieved better prediction accuracy than the conventional model and two other LA models (basic LA model and GAUDI). Employing MC-seq data to train our model, we were able to reliably impute a total of 435 854 CpGs. The prediction models for these well-predicted CpGs were then applied to PAGE study to impute genetically regulated DNA methylations for this admixed cohort and to further identify trait-associated CpGs. We identified significant CpGs for seven complex traits in this admixed population. Together, our results show that LAMPP is a robust and accurate prediction tool to impute DNA methylation levels and perform downstream analyses for the admixed populations.
LA information has been incorporated into prediction models to account for within-population heterogeneity in admixed population [24]. However, our results showed that not all LA models outperformed the conventional model without LA information. For example, the basic LA model did not perform as well as the conventional model in both the internal and external test sets. The potential reason is that the basic LA model dissects all genotypes by AFR and EUR ancestry, which increases the model complexity by doubling the number of features and can be redundant for SNPs that do not carry LA-specific effects. The GAUDI model [24] also did not perform better than the conventional model. For CpGs with non-significant heritability, the GAUDI model showed the largest improvement in the internal test set, but no substantial improvement in the external test set, potentially due to overfitting. GAUDI relied on linkage disequilibrium (LD) pruning to reduce the number of variables, which is efficient in constructing polygenic risk scores from all genome-wide SNPs [24], but can be problematic in the MWAS framework where only local SNPs (within ±500 kb of the CpG site) were included. To achieve a balance between incorporating useful features (LA) and reducing model complexity, the LAMPP model adds a preselection step to determine if one SNP has LA-specific effect on DNA methylation and only dissects the genotype by ancestry for those SNPs with LA-specific effect. We demonstrated that LAMPP outperformed other models in both the internal and external test sets. Furthermore, we found that a total of 435 854 well-predicted CpGs by LAMPP were depleted in the promoter regions but enriched in enhances and insulators, which are consistent with the previous findings of Fryett et al. [5] and Huan et al. [32]. Taken together, our results along with others indicated that the genetically regulated CpGs may play a crucial role in maintaining epigenome stability [23]. It also implied that genetic variants may exert their regulatory effects through modulating the epigenetic state of distal regulatory elements (i.e. enhances and insulators) [46].
Applying our models to the Application set 1 (PAGE), we identified 1213 CpG-trait associations. An important region for WBC in African Americans was also revealed, which included a high replication rate using the Application set 2. This finding was consistent with multiple studies [33–35] reporting that this region contained a broad peak of GWAS signals for WBC and could confer selective advantage against malaria. It is also noteworthy that the majority of trait-associated CpGs were located near (within ±500 kb) GWS SNPs for the corresponding trait. This suggests shared genetic contributions between variation in DNA methylation and the specific trait. Taken together, these results highlight the potential of our prediction model to aid in the interpretation of GWAS data: it helps to decipher, among the large number of GWAS signals, the ones that lie near the trait-associated CpGs and may impact the disease risk through regulation of DNA methylation [6].
We acknowledge several limitations of this study. Our model building set was based on moderately sized samples. Using this dataset, only 24% of the 1.8 million investigated CpGs were well-predicted. A larger admixed population will be important to validate our results and improve the number of well-predicted CpGs. Another limitation is that LAMPP was built on bulk methylation in blood tissue. This could partially account for the small number of associated CpGs for some traits (e.g. heart attack), as the genetically regulated DNA methylation in other causal tissues, rather than the blood tissue, might impact the disease risk. In future studies, training tissue-specific or cell type-specific prediction models of DNA methylation will shed more light on the role of DNA methylation in complex traits at tissue or cell type levels. Finally, due to the inclusion of ancestry-specific genetic effects, the applicability of our models is restricted to individual-level data for admixed populations. The framework in S-PrediXcan [2] or UTMOST [4] cannot be directly used to apply our prediction models to a diverse range of GWAS summary data.
Despite these limitations, we demonstrate the effectiveness and precision of LAMPP to improve the prediction accuracy in admixed populations and its pivotal role in advancing our understanding of DNA methylation on complex traits and enhancing the interpretability of genetic studies by integrating methylation data.
Methods
Study cohorts
Three independent cohorts were considered in our study, serving as model building, validation, and application sets. We included participants with admixed ancestry background from both AFR and EUR. Genotype data were available for all samples.
Veterans Aging Cohort Study Biomarker Cohort
Veterans Aging Cohort Study Biomarker Cohort (VACS) is a multi-center, prospective, observational cohort study from the Veteran Healthcare System in the United States [47, 48]. VACS-BC included a total of 2244 samples with genotyping data. Among them, a subset of samples (n = 377) had DNA methylation data profiled using MC-seq. This subset was used for model development (model building set). The prediction model built from the model building set was applied on the remaining samples without MC-seq data (n = 1867) to impute methylation levels and perform downstream analyses (one of the application sets).
MACS/WIHS combined cohort study
MWCCS was also a large longitudinal prospective cohort for persons living with HIV [47, 48]. DNA methylation from a subset of MWCCS (n = 213) was profiled with EPIC array. These samples were served as an external test set to independently validate the prediction models from the model building set.
PAGE consortium
The PAGE consortium was established to conduct genetic research in ancestrally diverse populations within the United States [49]. We selected African American individuals from the Women’s Health Initiative, Multiethnic Cohort, and the Icahn School of Medicine at Mount Sinai BioMe biobank in New York City (BioMe) in our analysis (n = 13 173). This study had genotype data available and was used as another application set.
Demographic characteristics for the model building set (VACS-Methyl-seq), external test set (MWCCS), Application set 1 (PAGE), and Application set 2 (VACS-genotype) are summarized in Fig. 2 and Supplementary Table S1.
Genotyping, imputation, and quality control
The VACS samples were genotyped using the Illumina HumanOmniExpress Beadchip and imputed with IMPUTE2 [50] using the 1000 Genomes Project Phase 3 as the reference panel [51]. The MWCCS samples were genotyped with the Infinium Omni2.5 Bead-Chip and imputed with Minimac4 [52] using the same reference panel [51]. The PAGE samples were genotyped using the Multi-Ethnic Genotyping Array [49] and imputed to the TopMed reference panel via the TopMed imputation server [52]. In the three cohorts, SNPs with minor allele frequency < 0.05, missing rate > 5%, imputation quality r2 < 0.8, or deviated significantly from Hardy–Weinberg equilibrium (P < 1E-6) were removed. Approximately 4.6 million SNPs passed QC and were used for following analyses.
DNA methylation
In the model building set (VACS-Methyl-seq), the Agilent SureSelectXT Methyl-seq was used for DNA methylation profiling (n = 377). Quality control (QC) on the MC-seq data was conducted following standard procedure [53, 54], and CpGs with missing rate > 5% were excluded. To ensure data quality, CpG sites with sequencing coverage >10× depth were kept. A total of 1.8 million CpGs passed QC steps and were used to train prediction models. In MWCCS, the Illumina Infinium MethylationEPIC BeadChip (EPIC) was used for DNA methylation profiling. We followed methods described in Lehne et al. [55] to perform methylation normalization and adjust for potential batch effects.
Ancestry estimation for all samples
A two-way admixture of AFR and EUR ancestry was used to model the ancestry composition for our admixed samples [23, 56–58]. From the 1000 Genomes Project, Utah residents with Northern and Western European ancestry (n = 98) and Yoruba from Ibadan, Nigeria (n = 97) served as reference groups for EUR and AFR descent [51, 59]. GA was estimated with ADMIXTURE 1.3.0 [60] with the number of ancestral groups set to 2. For LA estimation, SHAPEIT2 [61] was first used to phase genotype data for both admixed samples and reference, and RFMix 1.5.4 [62] was then used to infer LA at the haplotype level.
Estimation of DNA methylation heritability
We estimated SNP-based heritability for the 1.8 million CpG sites using the model building set. The DNA methylation heritability is defined as the proportion of the variation in methylation levels explained by genetic effects. Genome-wide Complex Trait Analysis (GCTA) 1.93.2 was used to estimate the heritability [63]. For each CpG, we included SNPs from 500 kb upstream to 500 kb downstream [23] and all SNPs located in this window were used for heritability estimation.
Statistical models
Conventional prediction models use an additive genetic model to characterize methylation levels [5]:
![]() |
(1) |
where
is the observed methylation,
is the effect size of the
th SNP,
is the genotype of the SNP (the number of alternative alleles) and we include local SNPs (within ±500 kb of the CpG), and
represents other factors influencing
, assumed to be independent of the genetic component. The effect sizes can be estimated using penalized approaches such as elastic-net [64].
When applied to an admixed population, model (1) does not consider that the genetic effect
may differ by ancestry for the
th SNP. To address this problem, we propose to incorporate LA and dissect the original genotype
by AFR and EUR ancestry:
![]() |
where
and
denote the genotypes of the two alleles at the
th SNP,
and
denote the LA of the two alleles, respectively. With these definitions, Model (1) can be extended to:
![]() |
(2) |
Compared to Model (1), which can be seen as a special case of model (2) with
for all SNPs, Model (2) takes the ancestral heterogeneity at the SNP level into consideration. However, dissecting all SNPs by AFR and EUR ancestry will double the number of features, and is unnecessary for those SNPs that do not carry ancestry-specific effects (
). Therefore, based on Model (2), we propose a LAMPP step model, in which we dissect the original genotype by AFR and EUR only for the SNPs that do carry LA-specific effect (
). In the preselection step, for each SNP, we infer whether it has LA-specific effect using two nested models [23]:
Null:
![]() |
Ancestral:
![]() |
For the
th SNP, we test the null hypothesis
to assess whether the ancestral model fits significantly better than the null model. If the
th SNP has LA-specific effects (P < threshold), we denote it belongs to the LA set. The optimal threshold is determined via five-fold cross-validation in the training set. Specifically, we calculate the cross-validation R2 for each tested threshold (0, 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2) across all CpGs, and choose the best one as the final threshold.
With the preselection results, we build the following model:
![]() |
(3) |
Methylation array data measure DNA methylation levels as a continuous variable from 0 to 1 (beta-value) [65], while MC-seq data measure methylation as count data and have additional sequence depth information (sum of methylated and unmethylated counts). To account for the difference in sequence depth, we fit the LAMPP Model (3) using regularized logistic regression (elastic-net). If users want to train their own LAMPP model from methylation array data, the regularized linear regression will be used instead. The overview of our prediction models is displayed in Fig. 1.
Training and testing prediction models for DNA methylation
To illustrate the improvement of incorporating LA information, we compared the conventional model (1) without LA information and three LA models: the basic LA model (2), LAMPP (3), and GAUDI [24].
To compare these models, the VACS-Methyl-seq data were split into 80% for training and 20% as an internal test set. For our LAMPP model (3), methylation levels were input as count data (methylated and unmethylated counts) and fitted with regularized logistic regression (binomial link). Models (1)–(3) were all trained using elastic net in the glmnet package (
set to 0.5) [66]. GAUDI model was trained using its GitHub package [24], which had built-in steps to perform LD pruning and P-value thresholding. When training prediction models for each CpG, samples with non-missing data for that specific CpG were used. Models were applied to the internal test set, and R2 (squared Pearson correlation between observed and predicted methylation levels) was calculated to assess model performance.
For external, independent validation, we combined the 80% (previous model training set) and the 20% (previous internal test set) into a single set to build the final model. The MWCCS samples (n = 213) served as the external test set to evaluate overall prediction accuracy [5].
Finally, we selected a subset of CpGs with prediction accuracy R2 > 0.01 (corresponds to correlation >0.1) as well-predicted CpGs [5]. This resulted in 435 854 CpGs for further analyses.
Enrichment analyses for well-predicted CpGs
For all CpGs in the MC-seq data, we used annotatr [67] and its built-in annotation database to make CpG annotations (CpG islands (CGI), CGI shelves, CGI shores, inter-CGI regions), gene body annotations, gene regulatory, and open chromatin annotations. To test whether the well-predicted CpGs (R2 > 0.01) were enriched in the functional region more often than by chance expected by the background (CpGs with R2 ≤ 0.01) [68, 69], we performed functional enrichment analysis using Fisher’s exact test [70, 71]. For each functional region, enrichment estimates were calculated as log odds ratios with 95% confidence intervals.
Applying LAMPP to identify trait-associated CpGs
We applied the LAMPP model to predict methylation at 435 854 well-predicted CpGs using genotype data from 13 173 PAGE samples (Application set 1). We then tested the association between genetically predicted CpGs and the following phenotypes in PAGE: WBC, hemoglobin, platelet count, BMI, height, stroke, and heart attack. Linear regression model was applied on the predicted methylation level against each phenotype, adjusting for the following covariates: gender, first five principal components of genotype data. Trait-associated CpG sites were selected at a Bonferroni-corrected P value. We also used participants in VACS with only genotype data (Application set 2) to replicate our identified CpGs for the overlapped traits. Among the identified trait-associated CpGs, we calculated the percentage of CpGs that were significant in the Application set 2, the percentage of CpGs with directional consistency in effect sizes, and the percentage of replicated CpGs (significant and same effect direction). Among the replicated CpGs, we also calculated the correlations of the effect sizes.
Intersections between trait-associated CpGs and corresponding GWAS/TWAS loci
To investigate the relationships between our identified CpGs and the corresponding GWAS loci, we calculated the proportion of CpGs that lied within ±500 kb of the GWS SNPs associated with the corresponding trait (P < 5E-8). The GWS SNPs were defined using in-sample GWAS and independent GWAS separately. The in-sample GWAS was derived from the same PAGE cohort using PLINK (v1.9) [72]. For the independent GWAS, we used the association data from the GWAS Catalog (https://www.ebi.ac.uk/gwas/), which is a collection of significant SNP–trait associations from 4963 studies [44, 45].
We also investigated the proportion of CpGs that lied within ±500 kb of the TWAS loci associated with the corresponding trait. TWAS hub, a database (http://twas-hub.org/) containing 17 022 associated loci for 342 traits, was used to identity TWAS loci for phenotypes of our interest. Stroke was not among the traits collected TWAS hub, so a separate study on stroke [41] was included.
Key Points
We propose a new model, LA Methylation Predictor with Preselection (LAMPP) to predict DNA methylation levels from SNP genotype data.
LAMPP incorporates local ancestry information and improves the prediction accuracy of DNA methylation in the admixed populations.
LAMPP is a useful tool to identify genetically regulated methylation regions that are associated with complex traits, which further enhances the interpretation and prioritization of genome-wide association study results.
Supplementary Material
Acknowledgments
The authors appreciate the support of the Veteran Aging Study Cohort Biomarker Core, the MWCCS sites, the PAGE consortium, and Yale Center of Genomic Analysis. The views and opinions expressed in this manuscript are those of the authors and do not necessarily represent those of the Department of Veterans Affairs or the US government. Part of data used by this work was provided by patients and collected by the VA as part of their care and support. The authors gratefully acknowledge the contributions of the study participants and dedication of the staff at those sites.
COMpAAAS/VACS, a CHAART Cooperative Agreement, is supported by the National Institutes of Health: National Institute on Alcohol Abuse and Alcoholism (U24-AA020794, U01-AA020790, U01-AA020795, U01-AA020799, U10-AA013566-completed) and in kind by the US Department of Veterans Affairs. Additional grant support from the National Institute on Drug Abuse R01-DA035616 is also acknowledged.
Data in the application part of this manuscript were collected by the MACS/WIHS Combined Cohort Study (MWCCS). The contents of this publication are solely the responsibility of the authors and do not represent the official views of the National Institutes of Health (NIH). MWCCS (Principal Investigators): Atlanta CRS (Ighovwerha Ofotokun, Anandi Sheth, and Gina Wingood), U01-HL146241; Baltimore CRS (Todd Brown and Joseph Margolick), U01-HL146201; Bronx CRS (Kathryn Anastos, David Hanna, and Anjali Sharma), U01-HL146204; Brooklyn CRS (Deborah Gustafson and Tracey Wilson), U01-HL146202; Data Analysis and Coordination Center (Gypsyamber D’Souza, Stephen Gange and Elizabeth Topper), U01-HL146193; Chicago-Cook County CRS (Mardge Cohen, Audrey French, and Ryan Ross), U01-HL146245; Chicago-Northwestern CRS (Steven Wolinsky, Frank Palella, and Valentina Stosor), U01-HL146240; Northern California CRS (Bradley Aouizerat, Jennifer Price, and Phyllis Tien), U01-HL146242; Los Angeles CRS (Roger Detels and Matthew Mimiaga), U01-HL146333; Metropolitan Washington CRS (Seble Kassaye and Daniel Merenstein), U01-HL146205; Miami CRS (Maria Alcaide, Margaret Fischl, and Deborah Jones), U01-HL146203; Pittsburgh CRS (Jeremy Martinson and Charles Rinaldo), U01-HL146208; UAB-MS CRS (Mirjam-Colette Kempf, James B. Brock, Emily Levitan, and Deborah Konkle-Parker), U01-HL146192; UNC CRS (M. Bradley Drummond and Michelle Floris-Moore), U01-HL146194. The MWCCS is funded primarily by the National Heart, Lung, and Blood Institute (NHLBI), with additional co-funding from the Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD), National Institute on Aging (NIA), National Institute of Dental & Craniofacial Research (NIDCR), National Institute of Allergy And Infectious Diseases (NIAID), National Institute of Neurological Disorders and Stroke (NINDS), National Institute of Mental Health (NIMH), National Institute on Drug Abuse (NIDA), National Institute of Nursing Research (NINR), National Cancer Institute (NCI), National Institute on Alcohol Abuse and Alcoholism (NIAAA), National Institute on Deafness and Other Communication Disorders (NIDCD), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute on Minority Health and Health Disparities (NIMHD), and in coordination and alignment with the research priorities of the National Institutes of Health, Office of AIDS Research (OAR). MWCCS data collection is also supported by UL1-TR000004 (UCSF CTSA), UL1-TR003098 (JHU ICTR), UL1-TR001881 (UCLA CTSI), P30-AI-050409 (Atlanta CFAR), P30-AI-073961 (Miami CFAR), P30-AI-050410 (UNC CFAR), P30-AI-027767 (UAB CFAR), P30-MH-116867 (Miami CHARM), UL1-TR001409 (DC CTSA), KL2-TR001432 (DC CTSA), and TL1-TR001431 (DC CTSA). The authors gratefully acknowledge the contributions of the study participants and dedication of the staff at the MWCCS sites.
Contributor Information
Youshu Cheng, Department of Biostatistics, Yale School of Public Health, 47 College St, New Haven, CT 06510, United States; VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT 06516, United States.
Geyu Zhou, Department of Biostatistics, Yale School of Public Health, 47 College St, New Haven, CT 06510, United States.
Hongyu Li, Department of Biostatistics, Yale School of Public Health, 47 College St, New Haven, CT 06510, United States.
Xinyu Zhang, VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT 06516, United States; Department of Psychiatry, Yale School of Medicine, 300 George St, New Haven, CT 06510, United States.
Amy Justice, VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT 06516, United States; Department of Internal Medicine, Yale School of Medicine, 333 Cedar St, New Haven, CT 06510, United States.
Claudia Martinez, Cardiovascular Division, Department of Medicine, University of Miami Miller School of Medicine, 1600 NW 10th Ave, Miami, FL 33136, United States.
Bradley E Aouizerat, Bluestone Center for Clinical Research, College of Dentistry, New York University, 421 1st Ave, New York, NY 10010, United States; Department of Oral and Maxillofacial Surgery, College of Dentistry, New York University, 421 1st Ave, New York, NY 10010, United States.
Ke Xu, VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT 06516, United States; Department of Psychiatry, Yale School of Medicine, 300 George St, New Haven, CT 06510, United States.
Hongyu Zhao, Department of Biostatistics, Yale School of Public Health, 47 College St, New Haven, CT 06510, United States; VA Connecticut Healthcare System, 950 Campbell Ave, West Haven, CT 06516, United States.
Conflict of interest: The authors declare that they have no competing interests.
Funding
The project was supported by the National Institute on Drug Abuse (R03DA039745, R01DA038632, R01DA047063, R01DA047820, R01DA061926, R01DA061995); National Institutes of Health grant (P01 AA029545, R01 GM134005, U24 HG012108, U01 HG013840); and National Science Foundation grant (DMS1902903).
Data availability
Demographic and clinical variables and DNA methylation data for the VACS samples were submitted to GEO dataset (GSE117861) and are publicly available. Access to individual-level data from the MACS/WIHS Combined Cohort Study Data (MWCCS) may be obtained upon review and approval of a MWCCS concept sheet. Links and instructions for online concept sheet submission are on the study website. The data for the PAGE consortium can be downloaded from dbGaP (phs000200, phs00925, and phs000227). The independent GWAS summary data can be downloaded from the GWAS Catalog (https://www.ebi.ac.uk/gwas/). The TWAS loci data can be downloaded from TWAS hub (http://twas-hub.org/). LAMPP is publicly available at https://github.com/YoushuCheng/LAMPP.
Ethics approval and consent to participate
IRB approval was received from the institutional review boards of the coordinating center at Yale University, New Haven, CT, the Veterans Affairs (VA) Connecticut Healthcare System, West Haven, CT, and from the participating clinical sites. Informed consent was provided by all WIHS participants via protocols approved by institutional review committees at each affiliated institution.
References
- 1. Gamazon ER, Wheeler HE, Shah KP. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 2015;47:1091–8. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Barbeira AN, Dickinson SP, Bonazzola R. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 2018;9:1825. 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Gusev A, Ko A, Shi H. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat Genet 2016;48:245–52. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Hu Y, Li M, Lu Q. et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat Genet 2019;51:568–76. 10.1038/s41588-019-0345-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Fryett JJ, Morris AP, Cordell HJ. Investigating the prediction of CpG methylation levels from SNP genotype data to help elucidate relationships between methylation, gene expression and complex traits. Genet Epidemiol 2022;46:629–43. 10.1002/gepi.22496 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Freytag V, Vukojevic V, Wagner-Thelen H. et al. Genetic estimators of DNA methylation provide insights into the molecular basis of polygenic traits. Transl Psychiatry 2018;8:31. 10.1038/s41398-017-0070-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Liang Y, Pividori M, Manichaikul A. et al. Polygenic transcriptome risk scores (PTRS) can improve portability of polygenic risk scores across ancestries. Genome Biol 2022;23:23. 10.1186/s13059-021-02591-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ioannidis NM, Wang W, Furlotte NA. et al. Gene expression imputation identifies candidate genes and susceptibility loci associated with cutaneous squamous cell carcinoma. Nat Commun 2018;9:4264. 10.1038/s41467-018-06149-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Mancuso N, Gayther S, Gusev A. et al. Large-scale transcriptome-wide association study identifies new prostate cancer risk regions. Nat Commun 2018;9:4079. 10.1038/s41467-018-06302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Khawaja AP, Cooke Bailey JN, Wareham NJ. et al. Genome-wide analyses identify 68 new loci associated with intraocular pressure and improve risk prediction for primary open-angle glaucoma. Nat Genet 2018;50:778–82. 10.1038/s41588-018-0126-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Hawe JS, Wilson R, Schmid KT. et al. Genetic variation influencing DNA methylation provides insights into molecular mechanisms regulating genomic function. Nat Genet 2022;54:18–29. 10.1038/s41588-021-00969-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Schübeler D. Function and information content of DNA methylation. Nature. 2015;517:321–6. 10.1038/nature14192 [DOI] [PubMed] [Google Scholar]
- 13. Luo C, Hajkova P, Ecker JR. Dynamic DNA methylation: in the right place at the right time. Science. 2018;361:1336–40. 10.1126/science.aat6806 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Oliva M, Demanelis K, Lu Y. et al. DNA methylation QTL mapping across diverse human tissues provides molecular links between genetic variation and complex traits. Nat Genet 2023;55:112–22. 10.1038/s41588-022-01248-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Liu H, Doke T, Guo D. et al. Epigenomic and transcriptomic analyses define core cell types, genes and targetable mechanisms for kidney disease. Nat Genet 2022;54:950–62. 10.1038/s41588-022-01097-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Cheng Y, Cai B, Li H. et al. HBI: a hierarchical Bayesian interaction model to estimate cell-type-specific methylation quantitative trait loci incorporating priors from cell-sorted bisulfite sequencing data. Genome Biol 2024;25:273. 10.1186/s13059-024-03411-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Ruan Y, Lin Y-F, Feng Y-CA. et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet 2022;54:573–80. 10.1038/s41588-022-01054-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Tian P, Chan TH, Wang YF. et al. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front Genet 2022;13:906965. 10.3389/fgene.2022.906965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Zhao Z, Fritsche LG, Smith JA. et al. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet 2022;109:1998–2008. 10.1016/j.ajhg.2022.09.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Chen F, Wang X, Jang S-K. et al. Multi-ancestry transcriptome-wide association analyses yield insights into tobacco use biology and drug repurposing. Nat Genet 2023;55:291–300. 10.1038/s41588-022-01282-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Zhou G, Chen T, Zhao H. SDPRX: a statistical method for cross-population prediction of complex traits. Am J Hum Genet 2023;110:13–22. 10.1016/j.ajhg.2022.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Tan T, Atkinson EG. Strategies for the genomic analysis of admixed populations. Annu Rev Biomed Data Sci 2023;6:105–27. 10.1146/annurev-biodatasci-020722-014310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Li B, Aouizerat BE, Cheng Y. et al. Incorporating local ancestry improves identification of ancestry-associated methylation signatures and meQTLs in African Americans. Commun Biol 2022;5:401. 10.1038/s42003-022-03353-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Sun Q, Rowland BT, Chen J. et al. Improving polygenic risk prediction in admixed populations by explicitly modeling ancestral-differential effects via GAUDI. Nat Commun 2024;15:1016. 10.1038/s41467-024-45135-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Shu C, Zhang X, Aouizerat BE. et al. Comparison of methylation capture sequencing and infinium methylationEPIC array in peripheral blood mononuclear cells. Epigenetics Chromatin 2020;13:51. 10.1186/s13072-020-00372-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Barros-Silva D, Marques CJ, Henrique R. et al. Profiling DNA methylation based on next-generation sequencing approaches: new insights and clinical applications. Genes (Basel) 2018;9. 10.3390/genes9090429 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wu H, Wang C, Wu Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013;14:232–43. 10.1093/biostatistics/kxs033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Feng H, Conneely KN, Wu H. A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. Nucleic Acids Res 2014;42:e69. 10.1093/nar/gku154 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wu H, Xu T, Feng H. et al. Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates. Nucleic Acids Res 2015;43:e141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Park Y, Wu H. Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics. 2016;32:1446–53. 10.1093/bioinformatics/btw026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Ross A, Willson VL, Ross A, Willson VL. Paired samples T-test. In: Ross A, Willson VL, editors. Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures. Rotterdam: SensePublishers; 2017. p. 17–9, 10.1007/978-94-6351-086-8_4. [DOI] [Google Scholar]
- 32. Huan T, Joehanes R, Song C. et al. Genome-wide identification of DNA methylation QTLs in whole blood highlights pathways for cardiovascular disease. Nat Commun 2019;10:4267. 10.1038/s41467-019-12228-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Reiner AP, Lettre G, Nalls MA. et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the continental origins and genetic epidemiology network (COGENT). PLoS Genet 2011;7:e1002108. 10.1371/journal.pgen.1002108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Keller MF, Reiner AP, Okada Y. et al. Trans-ethnic meta-analysis of white blood cell phenotypes. Hum Mol Genet 2014;23:6944–60. 10.1093/hmg/ddu401 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Reich D, Nalls MA, Kao WH. et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet 2009;5:e1000360. 10.1371/journal.pgen.1000360 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Nalls MA, Wilson JG, Patterson NJ. et al. Admixture mapping of white cell count: genetic locus responsible for lower white blood cell count in the health ABC and Jackson heart studies. Am J Hum Genet 2008;82:81–7. 10.1016/j.ajhg.2007.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lo KS, Wilson JG, Lange LA. et al. Genetic association analysis highlights new loci that modulate hematological trait variation in Caucasians and African Americans. Hum Genet 2011;129:307–17. 10.1007/s00439-010-0925-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and Manhattan plots. J Open Source Softw 2018;3:731. 10.21105/joss.00731 [DOI] [Google Scholar]
- 39. Soriano-Tárraga C, Lazcano U, Giralt-Steinhauer E. et al. Identification of 20 novel loci associated with ischaemic stroke. Epigenome-wide association study. Epigenetics 2020;15:988–97. 10.1080/15592294.2020.1746507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Islam T, Rahman MR, Khan A. et al. Integration of Mendelian randomisation and systems biology models to identify novel blood-based biomarkers for stroke. J Biomed Inform 2023;141:104345. 10.1016/j.jbi.2023.104345 [DOI] [PubMed] [Google Scholar]
- 41. Yang J, Yan B, Fan Y. et al. Integrative analysis of transcriptome-wide association study and gene expression profiling identifies candidate genes associated with stroke. PeerJ. 2019;7:e7435. 10.7717/peerj.7435 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Li L, Chen Y, Doan J. et al. Transforming growth factor β–activated kinase 1 signaling pathway critically regulates myocardial survival and remodeling. Circulation 2014;130:2162–72. 10.1161/CIRCULATIONAHA.114.011195 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. van Woerden GM, Senden R, de Konink C. et al. The MAP3K7 gene: further delineation of clinical characteristics and genotype/phenotype correlations. Hum Mutat 2022;43:1377–95. 10.1002/humu.24425 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Buniello A, MacArthur JAL, Cerezo M. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 2019;47:D1005–12. 10.1093/nar/gky1120 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Sollis E, Mosaku A, Abid A. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res 2023;51:D977–85. 10.1093/nar/gkac1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Bushey AM, Dorman ER, Corces VG. Chromatin insulators: regulatory mechanisms and epigenetic inheritance. Mol Cell 2008;32:1–9. 10.1016/j.molcel.2008.08.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Barkan SE, Melnick SL, Preston-Martin S. et al. The women's interagency HIV study. WIHS Collaborative Study Group. Epidemiology 1998;9:117–25. 10.1097/00001648-199803000-00004 [DOI] [PubMed] [Google Scholar]
- 48. Justice AC, Dombrowski E, Conigliaro J. et al. Veterans aging cohort study (VACS): overview and description. Med Care 2006;44:S13–24. 10.1097/01.mlr.0000223741.02074.66 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Wojcik GL, Graff M, Nishimura KK. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 2019;570:514–8. 10.1038/s41586-019-1310-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 2009;5:e1000529. 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Siva N. 1000 Genomes project. Nat Biotechnol 2008;26:256. 10.1038/nbt0308-256b [DOI] [PubMed] [Google Scholar]
- 52. Das S, Forer L, Schönherr S. et al. Next-generation genotype imputation service and methods. Nat Genet 2016;48:1284–7. 10.1038/ng.3656 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Wreczycka K, Gosdschan A, Yusuf D. et al. Strategies for analyzing bisulfite sequencing data. J Biotechnol 2017;261:105–15. 10.1016/j.jbiotec.2017.08.007 [DOI] [PubMed] [Google Scholar]
- 54. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–2. 10.1093/bioinformatics/btr167 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Lehne B, Drong AW, Loh M. et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol 2015;16:37. 10.1186/s13059-015-0600-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Atkinson EG, Maihofer AX, Kanai M. et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat Genet 2021;53:195–204. 10.1038/s41588-020-00766-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Chi C, Shao X, Rhead B. et al. Admixture mapping reveals evidence of differential multiple sclerosis risk by genetic ancestry. PLoS Genet 2019;15:e1007808. 10.1371/journal.pgen.1007808 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Seldin MF, Pasaniuc B, Price AL. New approaches to disease mapping in admixed populations. Nat Rev Genet 2011;12:523–8. 10.1038/nrg3002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Gazal S, Sahbatou M, Babron M-C. et al. High level of inbreeding in final phase of 1000 genomes project. Sci Rep 2015;5:17453. 10.1038/srep17453 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009;19:1655–64. 10.1101/gr.094052.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. O'Connell J, Gurdasani D, Delaneau O. et al. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet 2014;10:e1004234. 10.1371/journal.pgen.1004234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Maples BK, Gravel S, Kenny EE. et al. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet 2013;93:278–88. 10.1016/j.ajhg.2013.06.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Yang J, Lee SH, Goddard ME. et al. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 2011;88:76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodology 2005;67:301–20. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
- 65. Du P, Zhang X, Huang C-C. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinform 2010;11:587. 10.1186/1471-2105-11-587 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 2010;33:1–22. 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Cavalcante RG, Sartor MA. Annotatr: genomic regions in context. Bioinformatics 2017;33:2381–3. 10.1093/bioinformatics/btx183 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009;4:44–57. 10.1038/nprot.2008.211 [DOI] [PubMed] [Google Scholar]
- 69. Huang da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009;37:1–13. 10.1093/nar/gkn923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Fisher RA. On the interpretation of χ2 from contingency tables, and the calculation of P. J Roy Stat Soc 1922;85:87–94. 10.2307/2340521 [DOI] [Google Scholar]
- 71. Bedrick EJ, Hill JR. A survey of exact inference for contingency tables Comment. Stat Sci 1992;7:153–7. [Google Scholar]
- 72. Purcell S, Neale B, Todd-Brown K. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Demographic and clinical variables and DNA methylation data for the VACS samples were submitted to GEO dataset (GSE117861) and are publicly available. Access to individual-level data from the MACS/WIHS Combined Cohort Study Data (MWCCS) may be obtained upon review and approval of a MWCCS concept sheet. Links and instructions for online concept sheet submission are on the study website. The data for the PAGE consortium can be downloaded from dbGaP (phs000200, phs00925, and phs000227). The independent GWAS summary data can be downloaded from the GWAS Catalog (https://www.ebi.ac.uk/gwas/). The TWAS loci data can be downloaded from TWAS hub (http://twas-hub.org/). LAMPP is publicly available at https://github.com/YoushuCheng/LAMPP.










