Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison

Dylan Clark-Boucher; Xiang Zhou; Jiacong Du; Yongmei Liu; Belinda L Needham; Jennifer A Smith; Bhramar Mukherjee

doi:10.1101/2023.02.10.23285764

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Feb 14:2023.02.10.23285764. [Version 1] doi: 10.1101/2023.02.10.23285764

Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison

Dylan Clark-Boucher ¹, Xiang Zhou ², Jiacong Du ², Yongmei Liu ³, Belinda L Needham ⁴, Jennifer A Smith ^4,⁵, Bhramar Mukherjee ^2,⁴

PMCID: PMC9949196 PMID: 36824903

Abstract

Epigenetic researchers often evaluate DNA methylation as a mediator between social/environmental exposures and disease, but modern statistical methods for jointly evaluating many mediators have not been widely adopted. We compare seven methods for high-dimensional mediation analysis with continuous outcomes through both diverse simulations and analysis of DNAm data from a large national cohort in the United States, while providing an R package for their implementation. Among the considered choices, the best-performing methods for detecting active mediators in simulations are the Bayesian sparse linear mixed model by Song et al. (2020) and high-dimensional mediation analysis by Gao et al. (2019); while the superior methods for estimating the global mediation effect are high-dimensional linear mediation analysis by Zhou et al. (2021) and principal component mediation analysis by Huang and Pan (2016). We provide guidelines for epigenetic researchers on choosing the best method in practice and offer suggestions for future methodological development.

Introduction

In this study, we review and evaluate the available methods for performing mediation analysis when the mediators are high-dimensional DNA methylation (DNAm) measurements. DNAm is an epigenomic mechanism describing when a methyl group binds to the DNA, which occurs predominantly at cytosine-guanine dinucleotides, called “CpG sites.” DNAm has an important role in regulating gene expression across the entire genome, and is particularly impactful at CpG sites in the promoter regions of genes, where it can inhibit the binding of enzymes needed for transcription¹.

Recent advancements in technology have made it possible to collect DNAm data on a massive scale². Indeed, microarray technologies have enabled the measurement of over 850,000 CpG sites simultaneously², encouraging broad research on DNAm in the etiology of disease; and studies taking advantage of these tools have identified DNAm as a risk factor in obesity^3,4, type II diabetes⁵ and cardiovascular conditions^6,7. At the same time, however, DNAm has also been linked to exposures such as diet⁸, smoking⁹, alcohol¹⁰, air pollution¹¹, and socioeconomic status (SES)^12,13, which has prompted research on whether the effects of these exposures on health outcomes could be transferred by changes in DNAm. Effect transmission of this nature is called mediation, and it has become popular in epigenomic research to treat DNAm as a high-dimensional mediator between environmental exposures and human disease¹⁴.

As an example of such an analysis, our previous work^15,16 examined the association between low SES and glycated hemoglobin (HbA1c) in the Multi-Ethnic Study of Atherosclerosis (MESA), a United States population-based longitudinal study¹⁷. Indicators of SES, such as education level, are strong predictors of type II diabetes¹⁸, while HbA1c is an important risk factor of cardiovascular disease and a critical biomarker in type II diabetes diagnosis^19-21. Since education level is also associated with DNAm^12,13,22, and DNAm itself with HbA1c level²³, we hypothesized that if low education results in greater HbA1c, part of that effect could be mediated by DNAm (Fig. 1). In the current study, we revisit this hypothesis for the purpose of illustration. Our sample from MESA has 963 individuals and includes DNAm measurements at 402,339 CpG sites, none of which we know for certain are related to education or HbA1c in advance.

The standard statistical tool for addressing such a hypothesis is mediation analysis. Formally, mediation is when an exposure, say A, affects an outcome, Y, in part through its effect on a single mediating variable M. When M is a mediator of the A to Y association, the total effect of A on Y has two components: an indirect effect, from A affecting M and M affecting Y, and a direct effect, from A affecting Y independently of M. In the “traditional mediation analysis” approach proposed by Baron and Kenny (1986), the associations from this mechanism could be measured by fitting a few regression models: one for the effect of A on M (the mediator model), one for the effects of A and M on Y (the outcome model), and sometimes a third model for the total effect of A on Y, M ignored^24-26. The more recently developed “causal mediation analysis,” based on the counterfactual approach^27,28, has established conditions under which the parameters of these models can be interpreted as causal effects²⁹. The causal approach is more flexible when Y or M are binary and when there is A-M interaction in the outcome model³⁰.

While standard examples of mediation consider only one exposure, one mediator, and one outcome^31,32, there has been growing interest in methods for mediation that can handle many potential mediators at once. Epigenetic studies have felt this need especially, as DNAm is usually measured at several hundred thousand CpG sites with little prior knowledge of their importance. In settings such as this, a naïve strategy would be to evaluate the potential mediators one at a time, each with their own pair of models; but if the mediators are correlated this approach is inefficient, and the resulting estimates are potentially biased due to confounding from the excluded co-mediators³¹. Instead, so that we leverage these correlations rather than ignore them, the preferred approach is to assess the mediators jointly, in a single multivariable model. Although several methods for fitting such a model have been presented in the literature, none of them are widely used in analyzing DNAm data, a sign that epigenetic research is still catching up to recent developments in mediation analysis with high-dimensional mediators.

Our study aims to bridge this gap and guide researchers in epigenetics to use state of the art methods for mediation analysis with high-dimensional mediators. Despite the recent methodological developments, there are no clear-cut standards for which methods should be applied in which circumstances, making it difficult to select the best-suited method for an analysis in advance. While our prior research examined methods for large scale single-mediator hypotheses³¹, there is no such work for methods that can incorporate many potential mediators at once. Our study addresses this question first with an extensive simulation study, directly comparing the performance of sevesn different methods for mediation with high-dimensional mediators across a spectrum of settings. Along with metrics related to identification of key mediators and estimation of mediation effect, we include a computation time comparison to evaluate the scalability of the methods to large datasets. Next, to assess the utility of these methods on real data, we apply the same seven methods—plus two additional methods adapted from them—on the data from MESA to evaluate the mediating role of DNAm in the association between low education level and HbA1c. Our study is the first to address this critical gap in the epigenetic mediation literature, both by providing clarity on the methods available and by assessing their strengths and weaknesses under different settings. Moreover, although the analysis is centered around DNAm, the methods we deploy are not specific to epigenetics, and our results and guidelines should be similarly useful for researchers studying high-dimensional mediation problems in other fields. We include, supplementary to our study, an R package for implementing the methods, called “hdmed,” so that researchers have access to a centralized resource they can draw from in their own high-dimensional mediation analyses.

Notations and General Framework

Before proceeding, it will be useful to provide an overview of the relevant mediation model and to summarize the types of methods which have become available. To begin, suppose we have a dataset of n individuals: an exposure $A_{i}$ , a continuous outcome $Y_{i}$ , and continuous mediators $M_{i}$ measured for the i^th person, i varying from 1 to n. We write $M_{i}$ in bold to indicate its status as a vector—in this case, a set of p mediators $M_{i}^{(j)}$ , j varying from 1 to p. Let $C_{i}$ be a vector of q covariates. When p is greater than 1, we can use the regression models

E [Y_{i} ∣ A_{i}, M_{i}, C_{i}] = β_{a} A_{i} + β_{m}^{T} M_{i} + β_{c}^{T} C_{i}

(1)

and

E [M_{i} ∣ A_{i}, C_{i}] = α_{a} A_{i} + α_{c} C_{i}

(2)

to estimate the mediating role of $M_{i}$ in the causal pathway between the exposure and outcome³³. Model (1) is the outcome model and model (2) is the mediator model. In model (1), $β_{m}$ is a p-vector in which the j^th component, $(β_{m})_{j}$ , is the linear association of j^th mediator with $Y_{i}$ adjusting for the other variables; while $β_{a}$ is the association between $A_{i}$ and $Y_{i}$ adjusting for mediators and covariates. In model (2), $α_{a}$ is a p-vector of the associations between the exposure and each mediator, $(α_{a})_{j}$ ; and $α_{c}$ is a matrix with the mediator-covariate associations. Also note that in model (1), we have assumed there is no interaction between $A_{i}$ and $M_{i}$ , which is beyond the scope of our present study.

The parameters of these models underly the causal effects of interest. Under certain assumptions^27,33, the direct effect of $A_{i}$ on $Y_{i}$ is $β_{a}$ , the global indirect effect (or global mediation effect) of $A_{i}$ on $Y_{i}$ through $M_{i}$ is $α_{a}^{T} β_{m}$ , and the total effect of $A_{i}$ on $Y_{i}$ is $β_{a} + α_{a}^{T} β_{m}$ . Another quantity of interest is the proportion mediated, defined as the ratio of the global indirect effect to the total effect, which measures the degree to which the $A_{i}$ to $Y_{i}$ pathway is mediated by $M_{i}$ . We may also seek to measure the product terms $(α_{a})_{j} (β_{m})_{j}$ , which measure the contribution of the j^th mediator to the global indirect effect, since summing these for j from 1 to p yields $α_{a}^{T} β_{m}$ . However, we emphasize that $(α_{a})_{j} (β_{m})_{j}$ cannot be interpreted as a causal effect through the j^th mediator on its own, since we have made no assumptions about the causal ordering of the mediators and can only formally treat them as a joint system. Instead, we call $(α_{a})_{j} (β_{m})_{j}$ the mediation contribution, and describe the j^th mediator as active if its contribution is not zero.

If the potential mediators are uncorrelated, conditional on the exposure and covariates, or if p is reasonably small relative to n, then it is trivial to fit the above models using linear regression. However, if the mediators are correlated and p is large, the estimates from model (1) may have extremely high variance; and if p is so large as to exceed n, the linear regression model cannot even be fitted. These concerns are relevant to us because DNAm measurements tend to be correlated, while the number of sites that we have measurements on exceeds the number of samples. Addressing these issues has been a focus of the mediation literature, with authors using penalized regression^34-38, dimension reduction^39-41, Bayesian inference^15,42, and latent variables⁴³ to make the outcome model statistically tractable.

We provide a graphical depiction of eleven available methods in Fig. 2, dividing them into three different groups. Each method is described in greater detail in the Methods section and up to nine of them are included in the analysis. In the first group, we consider methods that fit the above pair of models explicitly, allowing one to estimate $α_{a}^{T} β_{m}$ , the global indirect effect, simply by summing the estimated mediation contributions. These include high-dimensional mediation analysis (HIMA) by Zhang et al. 2016³⁴, high-dimensional mediation analysis (HDMA) by Gao et al. 2019³⁵, mediation analysis via fixed effect model (MedFix) by Zhang 2019³⁶, pathway least absolute shrinkage operator (pathway LASSO) by Zhao and Luo 2022³⁷, the Bayesian sparse linear mixed model (BSLMM) by Song et al. 2020¹⁵, and the Gaussian mixture model (GMM) by Song et al. 2021⁴². In the second group, we consider methods that can estimate $α_{a}^{T} β_{m}$ “directly”; in other words, without needing to fit the original pair of models explicitly. These have the drawback of being unable to identify specific active mediators because they do not provide estimates of the mediation contributions. They include principal component mediation analysis (PCMA) by Huang and Pan 2016³⁹, sparse principal component mediation analysis (SPCMA) by Zhao et al. 2020⁴⁰, and high-dimensional linear mediation analysis (HILMA) by Zhou et al. 2021³⁸. Last, in the third group, we consider methods that make no attempt to estimate the mediation effects as originally proposed, but instead reconceptualize the mediation framework with newly-defined parameters based on latent variables. This group includes the methods high-dimensional multivariate mediation analysis (HDMM) by Chén et al. 2018⁴¹ and latent variable mediation analysis (LVMA) by Derkach et al. 2021⁴³. Within this comparative structure, we evaluate methods from all three groups, identifying their strengths and weaknesses across a wide range of simulation settings and analysis of DNAm data from MESA.

Results

Simulation Results

We begin by comparing the performance of the methods using simulations, where we know and can control the true values of the parameters. On simulated data with 2,000 (potential) mediators and either 1,000 or 2,500 observations, we consider (1) a baseline setting, where the mediators are moderately correlated and their signals are sparse; (2) a high-correlation setting, where the correlations between mediators are enhanced compared to (1); and (3) a non-sparse setting, where every mediator has at least some mediation signal but some of the signals are systematically larger. In Settings (1) and (2), 60 random mediators have $(α_{a})_{j}$ only sampled from a Normal(0,1), 60 have $(β_{m})_{j}$ only sampled from a Normal(0,1), and 20 have both, with the remaining entries of $α_{a}$ and $β_{m}$ fixed at zero. In Setting (3), we use a similar scheme, but sample the previously zero $(α_{a})_{j}$ and $(b_{m})_{j}$ from a Normal(0,0.2²). Our simulations also vary the strength of the signals within each of these settings by changing the proportion of variance that is explained by the associations. We do so by changing PVE_A, the proportion of variance in each mediator that can be explained by A, among those mediators that are affected by A; PVE_IE, the proportion of variance of Y that is explained by the total mediation effect; and PVE_DE, the proportion of variance of Y that is explained by the direct effect of A on Y. Results for varying PVE_IE are presented here while results for varying PVE_DE and PVE_A are included in the supplement (Supplementary Figs 1-4). In addition to the high-dimensional mediation methods, we include a one-at-a-time method⁴⁴ in which the mediators are assessed individually using linear regression. We evaluate the methods by their true positive rate (TPR) for detecting active mediators, their mean squared error (MSE) for estimating the contributions of active and inactive mediators, and their percent relative bias for estimating the global indirect effect. See Methods for more details.

True positive rate

Fig. 3 compares the TPR detecting active mediators of the Group 1 methods and the one-at-a-time method. The value shown is the mean TPR over 100 simulated datasets and a 95% empirical confidence interval (CI). On each dataset and for each method, thresholding was used to keep the false discovery rate (FDR) below 10%. For the non-sparse setting, we show the TPR for detecting mediators whose $(α_{a})_{j}$ and $(β_{m})_{j}$ were both sampled from Normal(0,1) rather than Normal(0,0.2²). We include the Group 1 methods HIMA, HDMA, MedFix, pathway LASSO, and BSLMM. We focus on TPR but not false positive rate (FPR) because the FDR correction was highly conservative, the mean FPR ranging from 0 to 5.1x10⁻⁴ across all settings and methods.

For a sample size of 2,500 and a PVE_IE of 0.10, the most powerful method in the baseline setting was BSLMM (mean TPR: 0.45; CI: 0.25 - 0.63), whose average TPR was 40% higher than that of the second-best method, HDMA. BLSMM also performed best when PVE_IE was 0.05 (mean TPR: 0.25; CI: 0.02 - 0.48), but to a lesser degree, outperforming HDMA by only 13%. BSLMM remained the best method, and HDMA the second best, no matter the signal strength or the degree of correlations, but performed poorly when the signals were non-sparse. In the setting with 1,000 observations, PVE_IE set to 0.05, and non-sparse signals, the best-performing method was HIMA (mean TPR: 0.09; CI: 0.05 - 0.10), its average TPR 3.3 times higher than that of BSLMM, which performed worst.

Estimation of contributions of active mediators

Next, we assess the MSE of the methods for estimating mediation contributions, relative to the one-at-a-time approach. In Fig. 4, we show the relative MSE (rMSE) for estimating mediation contributions among the mediators that were either active (in the baseline and high-correlation settings) or had $(α_{a})_{j}$ or $(β_{m})_{j}$ sampled from the larger-variance distribution (in the non-sparse setting). In the baseline setting with 2,500 observations, the best-performing method when the mediation signal was strong was BSLMM, whose mean rMSE of 0.59 (CI: 0.13 - 1.51) was 24% lower than that of HDMA, the second-best method. However, when the PVE_IE was reduced to 0.05 or the sample size reduced to 1,000, the best-performing method was either HDMA or MedFix, with MedFix (mean rMSE: 0.79; CI: 0.31 - 1.53) performing 61% better than BSLMM after reducing both. Similar trends were observed for the high-correlation and non-sparse settings.

Estimation of contributions of inactive mediators

Figure 5 shows the rMSE among the mediators that either were not active (in the baseline and high-correlation settings) or had $(α_{a})_{j}$ or $(β_{m})_{j}$ sampled from the smaller-variance distribution (in the non-sparse setting). We exclude pathway LASSO from Fig. 4 because for the baseline and high-correlation settings it had rMSEs of exactly zero. The reason for this is that pathway LASSO tended to be highly conservative and successfully assigned inactive mediators to have no effect. As for the other methods, in the baseline setting with 2,500 samples, MedFix performed the best when PVE_IE was 0.10, with a mean rMSE of 1.8x10⁻³ (CI: 1.9x10⁻⁴ - 6.4x10⁻³), which was 46% lower than the mean rMSE for the second-best method, HIMA. In contrast, HIMA was the best-performing method when signal was weakened to a PVE_IE of 0.05, attaining a mean rMSE of 2.8x10⁻⁴ (CI: 0.0 - 1.3x10⁻³), which was 94% lower than that of the second-best, MedFix. Results were similar when the correlations between mediators were heightened and when the sample size was reduced. In the settings where mediation signals were non-sparse, the best-performing method was always HIMA, which had a mean rMSE of 3.7x10⁻² (CI: 1.1x10⁻² - 6.5x10⁻²) when PVE_IE was 0.10 and there were 2,500 observations, 2% lower than that of MedFix.

Fig. 5. — Y-axis is on a log₁₀ scale. Value shown is the mean of the relative mean-squared error for estimating mediation contributions among inactive mediators (relative to the one-at-a-time approach) across 100 simulated data replicates, with intervals representing the inner 95% range. For baseline and high-correlation-between-mediators settings, inactive mediators are those which do not contribute to the global mediation effect, whereas in the non-sparse setting, inactive mediators are those whose contributions were sampled from a distribution with small variance instead of large.

Estimation of global indirect effect

Lastly in Fig. 6, we show the percent relative bias for estimating $α_{a}^{T} β_{m}$ , the global indirect effect. We use the same methods as in Figures 3 to 5 along with the Group 2 methods PCMA and HILMA, which obtain an estimate of the global indirect effect without needing to directly fit the original mediation model. (The Group 2 method SPCMA is excluded for computational reasons.) In the baseline setting with 2,500 samples, the best performer when PVE_IE was 0.10 was HILMA, whose mean relative bias of 9% (CI: 0.6% - 20.8%) was 40% lower than that of HDMA, the second-best. Next, when the PVE was reduced to 0.05, the best-performing method was MedFix (mean relative bias: 20.5%; CI: 1.0% - 43.8%), which outperformed HILMA by only 7%. We observed similar results for a sample size of 1,000 and high-correlations. In the non-sparse settings, where the biases tended to be much higher, the best performing methods were either PCMA or HDMA.

Scalability

We evaluated the scalability of the methods by running them 30 times on a common computing platform, and recording their run time (Table 1). This was done in both a small data setting (n = 100, p = 200) and a big data setting (n = 1,000, p = 1,000). On the larger dataset, the methods MedFix, HDMA, and PCMA posed insignificant computational burden; whereas BSLMM took an average of 40.1 minutes per run (assuming 30,000 posterior samples), HILMA an average of 40.9 minutes per run, pathway LASSO an average of 192.6 minutes per run, and SPCMA an average of 842.5 minutes per run (assuming 100 principal components). Run times were substantially lower in the smaller dataset, the slowest method, pathway LASSO, only taking an average of 18.71 minutes. The memory consumption of the methods is included in Supplementary Table 1.

Table 1.

Computation time comparison for high-dimensional mediation analysis methods

Method	n = 100, p = 200		n = 1,000, p = 2,000
Method	Mean	Interquartile Range	Mean	Interquartile Range
BSLMM	39.17s	(38.84s - 39.54s)	40.14m	(39.74m - 40.34m)
HDMA	1.40s	(1.37s - 1.40s)	29.76s	(29.55s - 29.92s)
HDMM	24.85s	(24.80s - 24.89s)	12.36m	(12.33m - 12.37m)
HILMA	24.42s	(24.13s - 24.63s)	40.85m	(38.22m - 40.65m)
HIMA	0.25s	(0.25s - 0.25s)	3.55s	(3.47s - 3.62s)
MEDFIX	0.61s	(0.60s - 0.61s)	7.33s	(7.22s - 7.42s)
PCMA	2.77s	(2.74s - 2.79s)	58.97s	(58.08s - 59.35s)
PLASSO	18.71m	(18.19m - 19.23m)	192.62m	(188.10m - 195.83m)
SPCMA	16.05m	(15.94m - 16.04m)	842.54m	(827.26m - 855.21m)

Open in a new tab

Methods were run 30 times each on a single core of an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz processor.

DNAm data analysis results from MESA

For our real data analysis, we applied the methods on a dataset with high-dimensional epigenetic mediators. Our exposure of interest was low SES—measured by educational attainment below a four-year degree—while our outcome variable was HbA1c level and our potential mediators were DNAm measurements at 402,339 CpG sites. Since the methods are incapable of handling so many CpG sites at once, we reduced our scope to only include the 2,000 sites with the strongest association with low SES. This was based on a linear mixed-model adjusting for age, sex, race, and the estimated proportions of residual non-monocytes as fixed effects and methylation chip and position as random effects. Our final dataset contained these 2,000 CpG sites and 963 samples. HbA1c, DNAm, and all other continuous variables were standardized prior to analysis.

Identification of noteworthy CpG sites

We identified CpG sites that potentially mediated the relationship between low SES and HbA1c using the Group 1 methods HIMA, HDMA, MedFix, pathway LASSO, and BSLMM. In HIMA, HDMA, MedFix, and pathway LASSO, which involve feature selection, we describe a CpG site to be “active” if its estimated mediation contribution is not zero; whereas in BSLMM, we do so if the estimated posterior inclusion probability is not zero (see Methods). We also included a one-at-a-time method in which the CpG sites were assessed individually with linear mixed models, identifying active mediators with the joint significance test⁴⁴. Out of 2,000 CpG sites, HIMA found 3 sites to be noteworthy, HDMA found 11, MedFix found 3, pathway LASSO found 141, and BSLMM found 3, amounting to 144 unique CpG sites in total. The one-at-a-time method identified zero CpG sites as noteworthy at an FDR threshold of 10%. Eleven CpG sites were identified as noteworthy by at least two of the methods (Table 2). Among these 11, the estimated mediation contributions were similar across methods in direction and size except for BSLMM, for which the estimates were an order of magnitude smaller than the others but in the same direction.

Table 2.

Estimated contributions of noteworthy CpG sites on the mediation pathway between low education and HbA1c

CpG Name	Chromosome	Nearby Gene(s)	USCS RefGene Group	Univariate (0 sites identified)	HIMA (3 sites identified)	HDMA (11 sites identified)	MedFix (3 sites identified)	Pathway LASSO (141 sites identified)	BSLMM (3 sites identified)
cg10508317	17	SOCS3	Body	3.48 x10⁻²	1.59 x10⁻²^*	3.56 x10⁻²^*	2.90 x10⁻²^*	2.35 x10⁻²^*	0.25 x10⁻²
cg01288337	14	RIN3	Body	3.35 x10⁻²	1.47 x10⁻²^*	2.82 x10⁻²^*	2.70 x10⁻²^*	4.43 x10⁻²^*	0.21 x10⁻²
cg10244976	16	LMF1	Body	3.00 x10⁻²	0	2.78 x10⁻²^*	0	2.23 x10⁻²^*	0.19 x10⁻²
cg07516252	14	REC8	TSS200	2.72 x10⁻²	0	2.24 x10⁻²^*	0	2.26 x10⁻²^*	0.26 x10⁻²
cg07571519	10	C10orf105; CDH23	3'UTR; Body	2.53 x10⁻²	0.33 x10⁻²^*	3.67 x10⁻²^*	1.47 x10⁻²^*	2.81 x10⁻²^*	0.21 x10⁻²
cg23079012	2	LINC00299	Body	2.27 x10⁻²	0	1.99 x10⁻²^*	0	1.98 x10⁻²^*	0.29 x10⁻²
cg01587454	8	DCAF4L2	1stExon	1.77 x10⁻²	0	2.10 x10⁻²^*	0	1.99 x10⁻²^*	0.38 x10⁻²
cg27527503	4	HADH	TSS1500	1.75 x10⁻²	0	1.86 x10⁻²^*	0	1.27 x10⁻²^*	0.23 x10⁻²
cg25891647	11	GRAMD1B	Body	−1.27 x10⁻²	0	−3.42 x10⁻²^*	0	−3.02 x10⁻²^*	−0.33 x10⁻²
cg08473752	17	NLK	Body	−0.70 x10⁻²	0	−2.34 x10⁻²^*	0	−2.32 x10⁻²^*	−0.22 x10⁻²
cg12644059	15	BLM	N/A¹	−0.03 x10⁻²	0	−2.31 x10⁻²^*	0	−1.84 x10⁻²^*	−0.22 x10⁻²

Open in a new tab

Selected as noteworthy by given method

CpG site cg12644059 is 3.240kb from the final base pair of the BLM gene

Table includes all CpG sites that were selected as having a noteworthy mediation contribution by at least two of the implemented methods out of 2,000 CpG sites in total. Criteria for CpG identification varied by method. All estimates are adjusted for age, sex, race, and the estimated proportions of residual non-monocytes as fixed effects, along with methylation chip and position as random effects to address potential batch effects. Note that for HIMA, HDMA, MedFix, and pathway LASSO, which fit high-dimensional regression models, we used additional pre-screening to reduce the number of mediators in advance to only n/log(n) ≈ 141 CpG sites, which is the approach recommended by the HIMA and HDMA authors and helps with statistical and computational efficiency (see Methods). Pathway LASSO selected all of these 141.

Some of these CpG sites were on or nearby genes that are potentially related HbA1c. Site cg10508317 is in the body of the SOCS3 gene, for which a rich body of literature has established links between overexpression and insulin resistance⁴⁵. The same site has also been identified in MESA as a mediator between adult SES and BMI⁴⁶ and adult SES and HbA1c³¹ based on previous one-at-a-time analyses. Site cg01288337, in the body of the RIN3 gene, has been identified in MESA as a potential mediator between adult SES and HbA1c based on one-at-a-time analysis as well³¹. The RIN3 gene itself is proximal to the SLC24A4 gene, both of which have been linked to brain glucose metabolism in human population studies⁴⁷. In addition, site cg27527503 is in the promoter region of the HADH gene, which is differentially expressed with respect to diabetes status⁴⁸ and is a primary driver of hyperinsulinism⁴⁹ and hyperinsulinaemic hypoglycemia (low blood sugar due to excess insulin)⁵⁰. A Venn diagram of genes identified by the methods is included in Supplementary Fig. 5, and results for every noteworthy CpG site are listed in Supplement File 1.

Global mediation through DNAm

Next, we estimated the direct effect of low education on HbA1c, the global indirect effect of low education on HbA1c through DNAm, and the total effect of low education on HbA1c using the Group 1 methods HIMA, HDMA, MedFix, pathway LASSO, and BSLMM, as well as the Group 2 methods PCMA, SPCMA, and HILMA (Table 3). Results across methods varied considerably, with the estimated global indirect effect ranging from 0.03 in HILMA to 0.17 in SPCMA. The estimated total effect ranged from 0.02 (HILMA) to 0.198 (HIMA, HDMA, and MedFix). While HILMA appeared to be an outlier, some of the other methods were consistent, with HDMA, BSLMM, P- LASSO, PCMA, and SPCMA all estimating the global indirect effect to be close to 0.15. The variability in the estimated indirect effect and estimated total effect led to variability in the proportion mediated as well, from 17.1% in HIMA to 100% in HILMA.

Table 3.

Estimated effects in the mediation mechanism from low education to DNAm to HbA1c

Method	Estimated Global indirect Effect	Estimated Direct Effect	Estimated Total Effect	Estimated Proportion Mediated
HIMA	0.03	0.16	0.20	0.17
HDMA	0.13	0.07	0.20	0.65
MedFix	0.07	0.13	0.20	0.36
BSLMM	0.14	0.05	0.18	1.00
Pathway LASSO	0.13	0.05	0.18	0.74
PCMA	0.15	0.02	0.17	0.91
SPCMA	0.17	0.00	0.17	1.00
HILMA	0.03	0.00	0.03	1.00

Open in a new tab

All estimates are adjusted for age, sex, race, and the estimated proportions of residual non-monocytes as fixed effects, along with methylation chip and position as random effects to address potential batch effects. We provide only point estimates, not interval estimates, because some of the methods are either not capable of producing interval estimates or do not provide the code for producing them in their software. For HIMA, HDMA, and MedFix, which as coded do not directly provide estimates of the direct effect, we first estimate the total effect by fitting the outcome model with the CpG sites omitted, then estimate the direct effect by subtracting the indirect effect from the total effect. Note also that, for HIMA, HDMA, MedFix, and pathway LASSO, we used additional screening to reduce the number of mediators in advance for the sake of statistical and computational efficiency, so only n/log(n) ≈ 141 CpG sites were seen by the multivariate model rather than 2,000 (this approach is recommended by the HIMA and HDMA authors).

Additional Findings

In addition to estimating the global indirect effect, method SPCMA is also able to identify potentially-mediating CpG sites in groups. It does so by linearly combining the mediators using sparse principal component-defined weights, then evaluating the resulting principal components as mediators themselves⁴⁰. However, out of 100 computed principal components, only three of them had significant mediation contributions after 10% FDR correction, the first representing a linear combination of 762 CpG sites, the second a combination of 782 sites, and the third a combination of 797 sites. Since the transformed mediators are functions of so many CpG sites at once, one cannot make claims about which particular CpG sites are active mediators, but the method still provides insight to whether there is statistical mediation at all.

We finish our analysis by deploying HDMM, a method from Group 3. Unlike the methods in Groups 1 and 2, HDMM cannot be used to estimate the global indirect effect from the proposed mediation structure, nor to estimate the mediation contributions of specific CpG sites. Rather, HDMM uses a likelihood-based approach to compute “directions of mediation”, which are weights that can be used to linearly combine the observed mediators into unobserved, latent mediators that replace the observed mediators in the mediation models (similar to PCMA). The estimated effect of the first latent mediator on average HbA1c was 0.13, the estimated total effect 0.71, and the proportion mediated 0.715. The three CpG sites with the largest directions of mediation were cg01288337 (0.36) on the RIN3 gene, cg16162970 (−0.22) near the PACS2 gene, and cg25891647 (−0.21) on the GRAMD1B gene; the first and last of which were among the 11 CpG sites identified by other methods in Table 2. Although the size and direction of these estimates are not interpretable, they offer evidence that these CpG sites are potentially involved in mediation.

Discussion

In this study, we reviewed and evaluated statistical methods for performing mediation analysis with high-dimensional DNAm data, so that researchers in epigenetics have the information they need to choose the most appropriate method for their data sample, subject matter, and research objectives. In extensive simulations, we found that the most powerful method for identifying active mediators was generally BSLMM, with HDMA close behind; though the former performed poorly in settings where the mediation signals were non-sparse. No method was uniformly better than the others at estimating the mediation contributions, though pathway LASSO was always the weakest. For estimating the global indirect effect, the best-performing method was HILMA in sparse mediation settings and PCMA or HDMA in non-sparse settings. Our scalability comparison revealed that HIMA, HDMA, MedFix, and PCMA were easily scalable to large datasets (e.g., n = 1,000 and p = 2,000), whereas SPCMA and pathway LASSO were extremely computationally costly.

On DNAm data from MESA, 11 CpG sites were selected by at least two of the methods as mediators between low SES and HbA1c level. Of the many genes related to these sites, SOCS3, RIN3, and HADH have the strongest potential biological connections to HbA1c^{45,47,48,50-52}, which contributes to the already rich literature on DNAm as a mediator between the exposome and health outcomes. Moreover, the methods generally produced similar estimates of the mediation contributions, with the exception of BSLMM. It is possible that since BSLMM is non-sparse, the estimated mediation contributions end up severely shrunken compared to the methods which directly select features.

Estimates of the global indirect effect were highly variable. Part of this can be explained by the fact that HDMA, MedFix, HIMA, and pathway LASSO are sparse models that can set mediation contributions to be exactly zero, resulting in a rigid and unstable estimation of the global indirect effect. The method HILMA, which is built specifically for estimating the global indirect effect and direct effect, produced estimates that were sharply different than the other methods, possibly because our simulations indicated that it struggled in non-sparse mediation settings.

In practice, the optimal method for mediation analysis with high-dimensional mediators will depend both on the data and the objective. If the goal is to identify specific CpG sites that are involved in mediation, one preferred method may be HDMA, which performed well at detecting active mediators in our simulations and was not overly conservative when applied to the observed data. If one’s focus is the global indirect effect, our simulations suggested that the optimal method is HILMA; but considering the variability we observed in our DNAm analysis, it may be worthwhile to apply BSLMM and HDMA as well to ensure the results are robust. If the results of multiple methods disagree substantially, it may be difficult to say with confidence which is closest to the truth, and the estimates should be interpreted with caution. Next, if there is interest in latent, unmeasured mediators, either HDMM or LVMA is worth attempting, though HDMM is computationally simpler. A detailed decision tree for selecting the optimal method is included in Fig. 7.

Some strengths of our study include its broad coverage of the available methods, the breadth of its simulation settings, and the comprehensive set of evaluation criteria. Our analysis of real DNAm data is especially essential because it elucidates the potential limitations of using these methods in practice, as it is impossible to incorporate the full complexity of real data sources into contrived simulation settings. However, our study also has weaknesses. First, since DNAm measurements and HbA1c data were collected concurrently, and represent only single time points, we cannot interpret the parameters we have estimated as causal effects. Nor can we interpret the mediation contributions estimated in Table (2) as causal, since DNAm was correlated across CpG sites and we have made no assumptions about their causal ordering. Moreover, although it would be optimal to address our research question longitudinally, with measurements at multiple time points, there is a dearth of mediation analysis methods which can handle that type of data, and longitudinal mediation analysis with high-dimensional mediators should be a focus of future methodological development. Second, we limited our analysis to the situation that Y and $M$ are continuous, that $M$ and A do not interact, and that only one A is of interest. However, we note that the methods HIMA and HDMA can also be applied to identify active mediators when Y is binary, while PCMA can be applied to infer the global indirect effect when there is A-M interaction in the outcome model. MedFix, along with the simultaneously-proposed MedMix (mediation analysis with mixed effect model by Zhang (2021)) can be applied when both the exposures and mediators are high-dimensional, while Huang and Vanderweele (2014) proposed a variance component test of the global indirect effect when only A is high-dimensional⁵³. As the landscape of methods for high-dimensional mediation analysis continues to expand, future review studies should consider exploring additional mediation settings (in presence of non-linearity, interaction) for which statistical methods are continuing to become available.

Methods

Mediation Model with Multiple Mediators

Let $M$ be a set of p variables, $M^{(1)}$ , $M^{(2)}$ , to $M^{(p)}$ , each a potential mediator in the causal pathway between A and Y. We assume that the ordering of the potential mediators is arbitrary and that Y is continuous. Given a dataset of n individuals, with $A_{i}$ , $Y_{i}$ , $M_{i}$ , and q covariates $C_{i}$ measured for each subject i, we can evaluate the mediating role of $M$ with the models

E [Y_{i} ∣ A_{i}, M_{i}, C_{i}] = β_{a} A_{i} + β_{m}^{T} M_{i} + β_{c}^{T} C_{i}

(1)

and

E [M_{i} ∣ A_{i}, C_{i}] = α_{a} A_{i} + α_{c} C_{i} .

(2)

We refer to these as the outcome and mediator models. Bolded terms distinguish vectors from scalars. Under certain assumptions, the parameters of this model can be used to derive causal effects of interest: Namely, in addition to the baseline assumption of temporality, we assume (1) that there is no unmeasured confounding in the exposure-outcome association after conditioning on $C$ , (2) that there is no unmeasured confounding in the mediator-outcome associations after adjusting for the exposure and $C$ , (3) that there is no unmeasured confounding of the exposure-mediator associations after conditioning on $C$ , and (4) that the measured confounders of the mediator-outcome associations are not caused by the exposure (which would make those confounders mediators themselves). In these circumstances only can $β_{a}$ be interpreted as the natural direct effect of A on Y, $α_{a}^{T} β_{m}$ the natural indirect effect of A on Y through $M$ , and $β_{a} + α_{a}^{T} β_{m}$ the total effect of A on Y³³. We say a mediator $M^{(j)}$ is active if $(α_{a})_{j} (β_{m})_{j}$ is not zero, since it contributes mathematically to the indirect effect, but this contribution itself cannot be formally interpreted causally unless the mediators are independent conditional on A and $C$ . Extensions of this framework cover cases when Y is binary, when $M$ is binary, or when the outcome model requires an interaction effect between $M$ and A³³.

A summary of the methods that can evaluate $M$ as a mediator is provided in Table 4, using the above pair of models as a frame of reference. We describe each of the methods in greater detail in the following three sections.

Table 4.

Methods Summary.

Name and Author	Estimation of global indirect effect	Estimation of mediation contributions	Mediator identification	Y Data Type	Summary
Group 1 Methods
HIMA; Zhang, 2016	Point estimation	Point estimation	Yes	Continuous or binary	Fits the outcome model with the minimax concave penalty. Requires subsequent fitting of ordinary least squares regression to test the statistical significance the mediation contributions.
HDMA; Gao, 2019	Point estimation	Point, interval estimation	Yes	Continuous or binary	Fits the outcome model with the de-sparsified LASSO penalty.
MedFix; Zhang, 2021	Point estimation	Point, interval estimation	Yes	Continuous	Fits the outcome model with the adaptive LASSO penalty. Can also be applied when the exposure is high-dimensional in addition to the mediators.
Pathway LASSO Zhao and Luo, 2022	Point estimation	Point estimation	Yes	Continuous	Fits the outcome model and mediator models with a jointly penalized likelihood, directly applying shrinkage to the mediation contributions $(α_{a})_{j} (β_{m})_{j}$ .
BSLMM; Song, 2020	Bayesian point, interval estimation	Bayesian interval estimation	Yes	Continuous	Bayesian mixed-model in which the mediator-outcome associations $(β_{m})_{j}$ and the exposure-mediator associations $(α_{a})_{j}$ are assumed to independently follow sparse normal distributions.
GMM; Song, 2021	Bayesian point, interval estimation	Bayesian interval estimation	Yes	Continuous	Bayesian mixed-model in which the mediator-outcome associations $(β_{m})_{j}$ and the exposure-mediator associations $(α_{a})_{j}$ are assumed to jointly follow a sparse multivariate normal distribution.
Group 3 Methods
PCMA; Huang and Pan, 2016	Point, interval estimation	No	No	Continuous or binary	Applies principal component analysis on the mediator model residuals, transforming the mediators so they are independent. Can be applied when there is A-M interaction in the outcome model.
SPCMA; Zhao, 2019	Point, interval estimation	No	Identifies whether subsets of the mediators are jointly active	Continuous	Similar to PCMA but applies sparse PCA, resulting in transformed mediators that are more interpretable.
HILMA; Zhou, 2020	Point, interval estimation	No	No	Continuous	Uses a debiased penalized regression approach to directly estimate the global indirect effect $α_{a}^{T} β_{m}$ . Can be applied for multiple exposures simultaneously.
Group 3 Methods
HDMM; Chen, 2018	No	No	Nonspecifically identifies groups of active mediators	Continuous	Estimates “directions of mediation” by which the observed mediators can be linearly combined to form latent mediators. The latent mediators replace the true mediators in the analysis.
LVMA; Derkach, 2019	No	No	Identifies inputted mediators associated with latent mediators	Continuous or binary	Reformulates the causal structure of the mediation problem. Assumes that $M$ itself is not responsible for mediation, but rather that the effect of A on Y is mediated by latent, unmeasured factors, $F$ , which also cause changes in $M$ .

Open in a new tab

Group 1 Methods

This group of methods can estimate both the global indirect effect $α_{a}^{T} β_{m}$ and the mediator-specific contributions $(α_{a})_{j} (β_{m})_{j}$ , j from 1 to p.

HIMA

High-dimensional mediation analysis (HIMA), proposed by Zhang et al. (2016), is a penalized regression approach with two main steps: First, the outcome model is fitted with a minimax concave penalty⁵⁴, performing feature selection on the mediators by setting some of them to have no effect on Y³⁴. Then, among the remaining mediators, they fit the mediator models individually using ordinary regression. The authors test the significance of $(α_{a})_{j} (β_{m})_{j}$ by applying Bonferroni correction to the maximum of the $(β_{m})_{j}$ and $(α_{a})_{j}$ p-values. To obtain p-values for the $(β_{m})_{j}$ estimates, the authors re-fit the reduced outcome model by ordinary least squares, which statistically may be overconfident. The authors also recommend an initial screening step to reduce the number of mediators at the start, as the outcome model will still be unstable if p is extremely large compared to n.

HDMA

High-dimensional mediation analysis (HDMA), proposed by Gao et al. (2019), is the same as HIMA except for its penalty function, replacing the minimax concave penalty with the recently-proposed de-sparsified LASSO^35,55. The advantage of this penalty is that the resulting estimates of $β_{m}$ are asymptotically normal, so one can test their statistical significance without needing to subsequently apply ordinary least squares. HDMA is also less biased than HIMA when the mediators are highly-correlated.

MedFix

Mediation analysis via fixed effect model (MedFix) is another extension of HIMA, proposed by Zhang (2021)³⁶. MedFix was originally proposed for a setting where there are not only multiple mediators, but also multiple exposures, which it handles by applying adaptive LASSO to both the outcome model and the mediator models. If there is only one exposure, feature selection in the mediator models is not necessary, and applying MedFix is analogous to applying HDMA except with adaptive LASSO instead of debiased LASSO.

Pathway LASSO

Pathway LASSO is another penalized regression approach, proposed by Zhao and Luo (2022)³⁷. Whereas HIMA, HDMA, and MedFix use a two-step design—the outcome model and mediator models fitted separately—this method fits the models all together, with a jointly penalized likelihood. The penalty not only applies shrinkage to the mediator-outcome associations, like the other methods, but also to the exposure-mediator associations and the mediation contributions.

BSLMM

The Bayesian sparse linear mixed model (BSLMM) is a Bayesian approach proposed by Song et al. (2020)¹⁵. The model assumes $α_{a}$ and $β_{m}$ are random vectors, both independently following mixtures of normal distributions. Most of the effects are presumed to be small, owing to a normal distribution with mean zero and small variance, while the others are allowed to be larger, resulting from a normal distribution with higher variance. We estimate the effects with their posterior mean, and we distinguish active mediators from inactive with their posterior inclusion probability of belonging to the distribution with higher variance.

GMM

The Gaussian mixed model (GMM), proposed by Song et al. (2021), is an extension of BSLMM in which the $(α_{a})_{j}$ , $(β_{m})_{j}$ pairs are treated as correlated, following a mixture of multivariate normal distributions instead of two independent normal distributions⁴². Thus, GMM may be more useful than BSLMM if the true size of each $(β_{m})_{j}$ is related to the size of the corresponding $(α_{a})_{j}$ , and vice-versa.

Group 2 Methods

This group of methods directly estimate the global indirect effect without producing estimates of its mediator-specific contributions.

PCMA

Principal component mediation analysis (PCMA), proposed by Huan and Pan (2016), was an early method for multiple-mediator mediation using principal component analysis (PCA)³⁹. The authors perform PCA on the residual matrix of the mediator models, then use the p by r loading matrix $Q$ to transform the matrix $M$ into a new set of mediators, $M^{*}$ , which are uncorrelated conditional on A and $C$ . The transformed mediators then replace the original mediators in the analysis, and because they are uncorrelated, the outcome and mediator models can be fit without issue. Although the mediators have been transformed, and the mediator-specific contributions $(α_{a})_{j} (β_{m})_{j}$ no longer correspond to the original j^th mediator, the global indirect effect $α_{a}^{T} β_{m}$ can still be estimated with its original interpretation. The authors set r to equal p, though this is only possible if p is less than n.

SPCMA

Zhao et al (2019) proposed sparse principal component analysis (SPCMA) to improve the interpretability of the results from PCMA⁴⁰. In PCMA, the transformed mediators are difficult to interpret because they are sums of all p original mediators; whereas in SPCMA, the loading matrix $Q$ is sparsified, meaning that each transformed mediator is only a sum of a few of the original mediators. The results are easier to interpret because, if a specific transformed mediator has a large effect, it can potentially be traced back to the original mediators which were used to construct it. SPCMA induces bias in its estimation compared to PCMA, but it can be helpful for identifying groups of mediators which may be active.

HILMA

High-dimensional linear mediation analysis (HILMA), proposed by Zhou (2020), estimates $α_{a}^{T} β_{m}$ with a complex, de-biased penalized regression approach³⁸. The mathematics of the procedure are beyond the scope of this text, but the proposed estimator has asymptotic properties for testing whether $α_{a}^{T} β_{m}$ is zero, and can also be applied when there are multiple (but not high-dimensional) exposures.

Group 3 Methods

The last group of methods is fundamentally distinct from the others: Instead of fitting the original mediation models (Group 1), or estimating the mediation effect without fitting the models (Group 2), they reconceptualize the causal structure of the problem to produce results with unique interpretations. Like any method, they should only be applied when their assumptions about the causal structure are reasonable.

HDMM

High-dimensional multivariate mediation (HDMM), proposed by Chén et al. (2018), is similar to PCMA in that it uses dimension reduction, but chooses the loading vectors with a likelihood-based approach instead of PCA⁴¹. The loading vectors are referred to as “directions of mediation,” each vector specifying a linear combination of mediators which contribute to the likelihood of the mediation models. Hence, HDMM implicitly assumes that there are latent, unmeasured mediating variables that can be represented as linear combinations of the observed mediators. The results of HDMM are difficult to interpret, but it can still be useful for identifying whether there is any mediation through $M$ at all, and for identifying large subsets of mediators that contribute to that mediation.

LVMA

Latent variable mediation analysis (LVMA), proposed by Derkach et al. (2019), assumes that $M$ itself is not involved in mediation, but rather, that there are a small number of unmeasured mediators, $F$ , which transmit the effect of A to Y and which also cause changes in $M$ ⁴³. In other words, LVMA assumes explicitly what HDMM assumes implicitly, and the results of the two methods have a similar structure. A key feature of LVMA is that the $F \to M$ associations are sparsified, meaning that the method can be used for detecting relevant mediators in $M$ . An observed mediator would be considered active if it is associated with a latent mediator that is itself associated with A and Y.

Simulation study

Simulation settings

We evaluate the above methods with a simulation study. To contrast them under diverse conditions, we consider three different settings of mediation: (1) a baseline setting in which the mediation signals are sparse and the (potential) mediators are moderately correlated, (2) a high-correlation setting with sparse signals, and (3) a moderate correlation setting in which the signals are non-sparse. Within each of these settings, we also vary the degree of mediation by modifying three parameters: the proportion of variance in $M$ that is explained by A among those associated with A (PVE_A), the proportion of the variance of Y that is explained by the direct effect (PVE_DE), and the proportion of the variance of Y that is explained by the global indirect effect (PVE_IE). For a baseline case, we let PVE_A equal 0.20 and PVE_DE and PVE_IE both equal 0.10; then, in three additional cases, we sequentially decrease one of these parameters by half, weakening the signal, and set the other two parameters to their values from the baseline. Between Settings (1) to (3), this amounted to 12 unique data-generating mechanisms in total. Each of these was evaluated with a sample size of 1,000 and 2,500, with the number of potential mediators fixed at 2,000. All combinations of settings are listed below in Table 5.

Table 5.

Complete list of settings in simulation study

Number of potential mediators (p)	Sample Size (n)	Sparsity of signals	Degree of correlation	PVE_A	PVE_IE	PVE_DE
2000	2500	Sparse	Baseline	0.20	0.10	0.10
2000	2500	Sparse	Baseline	0.20	0.05	0.10
2000	2500	Sparse	Baseline	0.10	0.10	0.10
2000	2500	Sparse	Baseline	0.20	0.10	0.05
2000	2500	Sparse	High	0.20	0.10	0.10
2000	2500	Sparse	High	0.20	0.05	0.10
2000	2500	Sparse	High	0.10	0.10	0.10
2000	2500	Sparse	High	0.20	0.10	0.05
2000	2500	Non-sparse	Baseline	0.20	0.10	0.10
2000	2500	Non-sparse	Baseline	0.20	0.05	0.10
2000	2500	Non-sparse	Baseline	0.10	0.10	0.10
2000	2500	Non-sparse	Baseline	0.20	0.10	0.05
2000	1000	Sparse	Baseline	0.20	0.10	0.10
2000	1000	Sparse	Baseline	0.20	0.05	0.10
2000	1000	Sparse	Baseline	0.10	0.10	0.10
2000	1000	Sparse	Baseline	0.20	0.10	0.05
2000	1000	Sparse	High	0.20	0.10	0.10
2000	1000	Sparse	High	0.20	0.05	0.10
2000	1000	Sparse	High	0.10	0.10	0.10
2000	1000	Sparse	High	0.20	0.10	0.05
2000	1000	Non-sparse	Baseline	0.20	0.10	0.10
2000	1000	Non-sparse	Baseline	0.20	0.05	0.10
2000	1000	Non-sparse	Baseline	0.10	0.10	0.10
2000	1000	Non-sparse	Baseline	0.20	0.10	0.05

Open in a new tab

Simulated dataset creation

First, to obtain sparse mediation effects for Settings (1) and (2), we assume that 1,920 of the 2,000 coefficients $(α_{a})_{j}$ and $(β_{m})_{j}$ are zero and the remaining 80 are standard normal. Twenty of the nonzero $(α_{a})_{j}$ and $(β_{m})_{j}$ are chosen to overlap and have $(α_{a})_{j} (β_{m})_{j}$ not equal to zero. To obtain non-sparse signals for Setting (3), we sample the previously zero coefficients from a normal distribution with mean zero and standard deviation 0.2. (These parameter vectors are sampled only once, at the start of the simulations, so that the global mediation effect is held constant, but we shuffle the mediators in each dataset so that different mediators are assigned the effects each time.) Once we have these, we obtain a single simulated dataset by sampling $A_{i}$ from a standard normal distribution, then produce $M_{i}$ from model (4) assuming there are no covariates. We add noise to $M_{i}$ by sampling residuals from a multivariate normal distribution with mean $0_{p}$ and variance $Σ$ , where $Σ$ is derived by shuffling and then tuning the variance-covariance of the observed methylation data (see supplementary section 1). In Settings (1) and (3), we tune $Σ$ so that the correlations between mediators range from −0.37 to 0.49, and in Setting (2), so that they range from −0.58 to 0.75. We fix PVE_A by scaling $Σ$ appropriately based on $α_{a}$ . Finally, we define $Y_{i}$ based on model (3) assuming the residuals are Normal(0, $σ^{2}$ ), choosing $β_{a}$ and $σ^{2}$ to yield the desired PVE_DE and PVE_IE.

Evaluation

We evaluate the methods by applying them to 100 replicates of each setting in Table 5. We omit SPCMA, GMM, and LVMA for computational reasons, as they are too computationally costly to deploy on so many replicates, and omit HDMM because it does not have an estimand that is comparable to the others. We include a one-at-a-time approach—in which the mediator are assessed individually using traditional mediation analysis and the joint significance test⁴⁴—as a baseline for comparison. When running HIMA, HDMA, MedFix, and pathway LASSO, we pre-screen the mediators to only include the $n ∕ \log (n)$ mediators with the strongest associations with Y adjusting for A, which is the approach recommended by the HIMA and HDMA authors^34,35 (see supplementary section 2 for more details). For comparison metrics, we use (1) the true positive rate for detecting active mediators, $TPR = \frac{number of true mediators identified}{number of true mediators}$ ; (2) the mean squared error in estimating the mediation contributions of inactive mediators, ${MSE}_{Inactive} = {mean}_{j : Inactive} {((\hat{α_{a}})_{j} (\hat{β_{m}})_{j} - (α_{a})_{j} (β_{m})_{j})}^{2}$ ; (3) the mean squared error in estimating the mediation contributions of active mediators, ${MSE}_{Active} = {mean}_{j : Active} {((\hat{α_{a}})_{j} (\hat{β_{m}})_{j} - (α_{a})_{j} (β_{m})_{j})}^{2}$ ; and (4) the percent relative bias in estimating the global indirect effect, $\frac{∣ \hat{α_{a}^{T} β_{m}} - α_{a}^{T} β_{m} ∣}{α_{a}^{T} β_{m}} \times 100$ . In the non-sparse setting, since all the mediators contribute to the indirect effect, we consider the “active” ones to be those whose mediator-outcome and exposure-mediator effects both come from the distribution with higher-variance, and the others inactive. Each metric is computed for each dataset to the applicable methods, and we report the average and a 95% empirical confidence interval over the 100 replicates.

Scalability comparison

We compare the scalability of the methods by assessing their processing time on simulated datasets of two sizes: one with 100 observations and 200 mediators and one with 1,000 observations and 2,000 mediators. For the larger dataset, we use one of the datasets created for the simulation study, and for the smaller dataset, we subset the rows and columns of $M$ and the entries in $A$ and $Y$ . Run times are assessed on a single core of an Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz processor. We attempt each method 30 times and report the mean and interquartile range of the computation times. Since SCPMA and BSLMM tend to be time-consuming, we approximate their run times by downscaling the appropriate parameters: In particular, since the desired number of principal components in SPCMA is 100, we use only 2 principal components and scale the computing time by 50; and since the desired number of posterior samples in BSLMM is 30,000, we draw only 750 samples and scale the result by 40. Ad hoc experimentation confirmed that the methods were approximately linear with respect to these inputs.

Data application with MESA

To demonstrate how these methods can be applied to observed DNAm data, we evaluate the association between SES and HbA1c and its potential mediation through DNAm. For the exposure, we use a binary variable that indicates low educational attainment (less than a 4-year college degree); for the outcome, we use HbA1c, a continuous variable that reflects average three-month blood glucose level. Our data for this portion come from the Multi-Ethnic Study of Atherosclerosis (MESA), a US population-based longitudinal study¹⁷. Out of 6,814 total participants, a random subsample of 1,264 had their DNAm measured at 484,882 CpG sites. We limit our analysis to the 963 participants who (1) had methylation data, (2) had no missing data for the required variables, (3) consented to genetic and phenotypic use through the database of Genotypes and Phenotypes (dbGaP) (phs000209.v13.p3), and (4) were not on diabetes medication, which can cause changes in HbA1c (Fig. 6). Standard quality control filters reduced the number of CpG sites to 402,339. Since it is not statistically or computationally feasible to include so many mediators at once, we used a screening procedure to reduce that number further, fitting model (6) below for each mediator separately to choose the 2,000 CpG sites at which DNAm was most strongly associated with education based on the $(α_{a})_{j}$ p-value. These 2,000 formed the baseline set of CpGs for our analysis. DNAm was measured using M-values, defined as the log-2 ratio of the methylated to unmethylated probe intensities, which has the advantage of occurring on a continuous and unbounded scale⁵⁶. For more details see supplementary section 3. A model for the proposed mechanism is given by

E [HbA 1 c_{i} ∣ {Education}_{i}, {DNAm}_{i}, {Covariates}_{i}] = β_{a} {Education}_{i} + β_{m}^{T} {DNAm}_{i} + β_{c}^{T} {Covariates}_{i}

(5)

and

E [{DNAm}_{i} ∣ {Education}_{i}, {Covariates}_{i}] = α_{a} {Education}_{i} + α_{c} {Covariates}_{i},

(6)

where the covariates include age, sex, race, and the estimated proportions of residual non-monocytes (i.e., neutrophils, B cells, T cells, and natural killer cells) as fixed effects and methylation chip and position as random effects.

We performed mediation analysis on the final dataset of 963 individuals and 2,000 CpG sites. All of the mediation methods described above were included except for GMM and LVMA, which again were too costly computationally. Although it is reasonable for some of the methods to include all 2,000 CpG sites directly in the multivariable model, HIMA and HDMA involve sure independence screening⁵⁷ to reduce the number of mediators in advance to n/log(n), where n is the sample size. For the sake of consistency across the penalized regression methods, we do so with not only HIMA and HDMA, but also MedFix and pathway LASSO, including only the 141 (963/log(963)) CpG sites most associated with low education (a direct extension of the initial screening). (Note that, for HIMA and HDMA, this screening is part of the proposed method, not separate from it, but for MedFix and pathway LASSO the additional screening is still beneficial for the sake of comparing methods and for statistical and computational efficiency). Additional pre-screening is not necessary for PCMA, SPCMA, BSLMM, and HILMA, and we include all 2,000 CpG sites directly; however, in HDMM, which cannot accommodate p > n simplistically, we again use only twice-screened subset of 141 sites. For the sake of comparison with multivariate methods, we also include a one-at-a-time mediation method based on linear regression and the joint significance test. For estimating the total effect, the methods PCMA, SPCMA, BSLMM, and Pathway LASSO all produce estimates of the direct effect, so we can estimate the total effect by summing the estimated direct and global indirect effects. Since the methods HIMA, HDMA, and MedFix do not produce estimates of the direct effect, we first estimate the total effect on its own by fitting model (5) with the mediators excluded, then subtract the estimated global indirect effect from this value to obtain an estimate of the direct effect. As none of the high-dimensional methods are built to directly handle random effects as covariates, we regress these out of the outcome variable and potential mediators in advance. For the fixed effect covariates, HIMA, HDMA, MedFix, and BSLMM allow one to include them directly; whereas in PCMA, SPCMA, HILMA, HDMM, and pathway LASSO, we regressed them out in advance from the outcome and mediators. Continuous variables (including HbA1c and the mediators) were standardized for all methods. All analysis was conducted using R version 4.2.1.

Supplementary Material

Supplement 1

media-1.pdf^{(1.6MB, pdf)}

Supplement 2

media-2.xlsx^{(43.1KB, xlsx)}

Acknowledgements

MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC 95163, 75N92020D00004, N01-HC-95164, 75N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR-001881, and DK063491. The MESA Epigenomics & Transcriptomics Studies were funded by NIH grants 1R01HL101250, 1RF1AG054474, R01HL126477, R01DK101921, and R01HL135009. Co-authors of this manuscripts were partially supported by NHLBI grant R01HL141292, NSF grant DMS1712933, and NIH grants R01HG008773 and 1UG3CA267907.

Data Availability

Data used for the simulation study are available from the authors upon request. Data used in the DNAm analysis can be obtained through the MESA Data Coordinating Center (https://www.mesanhlbi.org/).

Code Availability

R scripts for the analysis are available at https://github.com/dclarkboucher/mediation_DNAm. Our R package “hdmed” can be found at https://github.com/dclarkboucher/hdmed.

References

1.Moore L. D., Le T. & Fan G. DNA methylation and its basic function. Neuropsychopharmacol. Off. Publ. Am. Coll. Neuropsychopharmacol. 38, 23–38 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kurdyukov S. & Bullock M. DNA Methylation Analysis: Choosing the Right Method. Biology (Basel). 5, (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Dick K. J. et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet (London, England) 383, 1990–1998 (2014). [DOI] [PubMed] [Google Scholar]
4.Wahl S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Volkmar M. et al. DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. EMBO J. 31, 1405–1426 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chilunga F. P. et al. Genome-wide DNA methylation analysis on C-reactive protein among Ghanaians suggests molecular links to the emerging risk of cardiovascular diseases. NPJ genomic Med. 6, 46 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Nakatochi M. et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin. Epigenetics 9, 54 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fujii R. et al. Dietary fish and ω-3 polyunsaturated fatty acids are associated with leukocyte ABCA1 DNA methylation levels. Nutrition 81, 110951 (2021). [DOI] [PubMed] [Google Scholar]
9.Sun Y. V et al. Epigenomic association analysis identifies smoking-related DNA methylation sites in African Americans. Hum. Genet. 132, 1027–1037 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Philibert R. A., Plume J. M., Gibbons F. X., Brody G. H. & Beach S. R. H. The impact of recent alcohol use on genome wide DNA methylation signatures. Front. Genet. 3, 54 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Rider C. F. & Carlsten C. Air pollution and DNA methylation: effects of exposure in humans. Clin. Epigenetics 11, 131 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Lam L. L. et al. Factors underlying variable DNA methylation in a human community cohort. Proc. Natl. Acad. Sci. U. S. A. 109 Suppl, 17253–17260 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Needham B. L. et al. Life course socioeconomic status and DNA methylation in genes related to stress reactivity and inflammation: The multi-ethnic study of atherosclerosis. Epigenetics 10, 958–969 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Fujii R., Sato S., Tsuboi Y., Cardenas A. & Suzuki K. DNA methylation as a mediator of associations between the environment and chronic diseases: A scoping review on application of mediation analysis. Epigenetics 1–27 (2021) doi: 10.1080/15592294.2021.1959736. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Song Y. et al. Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics 76, 700–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Du J. et al. Methods for Large-scale Single Mediator Hypothesis Testing: Possible Choices and Comparisons. (2022) doi: 10.48550/arxiv.2203.13293. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bild D. E. et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am. J. Epidemiol. 156, 871–881 (2002). [DOI] [PubMed] [Google Scholar]
18.Whitaker S. M. et al. The Association Between Educational Attainment and Diabetes Among Men in the United States. American journal of men’s health vol. 8 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Sakurai M. et al. HbA1c and the risks for all-cause and cardiovascular mortality in the general Japanese population: NIPPON DATA90. Diabetes Care 36, 3759–3765 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Singer D. E., Nathan D. M., Anderson K. M., Wilson P. W. & Evans J. C. Association of HbA1c with prevalent cardiovascular disease in the original cohort of the Framingham Heart Study. Diabetes 41, 202–208 (1992). [DOI] [PubMed] [Google Scholar]
21.Yeung S. L. A., Luo S. & Schooling C. M. The Impact of Glycated Hemoglobin (HbA(1c)) on Cardiovascular Disease Risk: A Mendelian Randomization Study Using UK Biobank. Diabetes Care 41, 1991–1997 (2018). [DOI] [PubMed] [Google Scholar]
22.Borghol N. et al. Associations with early-life socio-economic position in adult DNA methylation. Int. J. Epidemiol. 41, 62–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chen Z. et al. DNA methylation mediates development of HbA1c-associated complications in type 1 diabetes. Nat. Metab. 2, 744–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Baron R. M. & Kenny D. A. The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations. Journal of personality and social psychology vol. 51. [DOI] [PubMed] [Google Scholar]
25.MacKinnon D. Introduction to statistical mediation analysis. (New York, NY u.a: Erlbaum; ). [Google Scholar]
26.VanderWeele T. J. Marginal Structural Models for the Estimation of Direct and Indirect Effects. Epidemiology 20, (2009). [DOI] [PubMed] [Google Scholar]
27.Pearl J. Direct and Indirect Effects. in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence 411–420 (Morgan Kaufmann Publishers Inc., 2001). [Google Scholar]
28.Robins J. M. & Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 3, 143–155 (1992). [DOI] [PubMed] [Google Scholar]
29.VanderWeele T. J. Mediation Analysis: A Practitioner’s Guide. Annu. Rev. Public Health 37, 17–32 (2016). [DOI] [PubMed] [Google Scholar]
30.Vander Weele author., T. Explanation in causal inference: methods for mediation and interaction. Explanation in causal inference: methods for mediation and interaction (Oxford University Press, 2015). [Google Scholar]
31.Du J. et al. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet. Epidemiol. n/a, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Aung M. T. et al. Application of an analytical framework for multivariate mediation analysis of environmental data. Nat. Commun. 11, 5624 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.VanderWeele T. J. & Vansteelandt S. Mediation Analysis with Multiple Mediators. Epidemiol. Method. 2, 95–115 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zhang H. et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics 32, 3150–3154 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gao Y. et al. Testing Mediation Effects in High-Dimensional Epigenetic Studies. Front. Genet. 10, 1195 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zhang Q. High-Dimensional Mediation Analysis with Applications to Causal Gene Identification. Statistics in biosciences (2021). [Google Scholar]
37.Zhao Y. & Luo X. Pathway LASSO: pathway estimation and selection with high-dimensional mediators. Stat. Interface 15, 39–50 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Zhou R. R., Wang L. & Zhao S. D. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 107, 573–589 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Huang Y.-T. & Pan W.-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics 72, 402–413 (2016). [DOI] [PubMed] [Google Scholar]
40.Zhao Y., Lindquist M. A. & Caffo B. S. Sparse principal component based high-dimensional mediation analysis. Comput. Stat. Data Anal. 142, 106835 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Chén O. Y. et al. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics (Oxford, England: ) vol. 19 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Song Y. et al. Bayesian sparse mediation analysis with targeted penalization of natural indirect effects. J. R. Stat. Soc. Ser. C (Applied Stat. 70, 1391–1412 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Derkach A., Pfeiffer R. M., Chen T.-H. & Sampson J. N. High dimensional mediation analysis with latent variables. Biometrics 75, 745–756 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.MacKinnon D. P., Lockwood C. M., Hoffman J. M., West S. G. & Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol. Methods 7, 83–104 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Pedroso J. A. B., Ramos-Lobo A. M. & Donato J. J. SOCS3 as a future target to treat metabolic disorders. Hormones (Athens). 18, 127–136 (2019). [DOI] [PubMed] [Google Scholar]
46.Wang Y. Z. et al. DNA Methylation Mediates the Association Between Individual and Neighborhood Social Disadvantage and Cardiovascular Risk Factors. Front. Cardiovasc. Med. 9, 848768 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Stage E. et al. The effect of the top 20 Alzheimer disease risk genes on gray-matter density and FDG PET brain metabolism. Alzheimer’s Dement. (Amsterdam, Netherlands: ) 5, 53–66 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Mei H. et al. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis. Genes (Basel). 8, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Rahman S. A., Nessa A. & Hussain K. Molecular mechanisms of congenital hyperinsulinism. J. Mol. Endocrinol. 54, R119–R129 (2015). [DOI] [PubMed] [Google Scholar]
50.Galcheva S., Al-Khawaga S. & Hussain K. Diagnosis and management of hyperinsulinaemic hypoglycaemia. Best Pract. Res. Clin. Endocrinol. Metab. 32, 551–573 (2018). [DOI] [PubMed] [Google Scholar]
51.Pedroso J. A. B. et al. Inactivation of SOCS3 in leptin receptor-expressing cells protects mice from diet-induced insulin resistance but does not prevent obesity. Mol. Metab. 3, 608–618 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Senniappan S., Shanti B., James C. & Hussain K. Hyperinsulinaemic hypoglycaemia: genetic mechanisms, diagnosis and management. J. Inherit. Metab. Dis. 35, 589–601 (2012). [DOI] [PubMed] [Google Scholar]
53.Huang Y.-T., Vanderweele T. J. & Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann. Appl. Stat. 8, 352–376 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Zhang C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). [Google Scholar]
55.Zhang S. S. & Zhang C.-H. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B, Statistical methodology vol. 76 (2014). [Google Scholar]
56.Du P. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Fan J. & Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. 70, 849–911 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.pdf^{(1.6MB, pdf)}

Supplement 2

media-2.xlsx^{(43.1KB, xlsx)}

Data Availability Statement

Data used for the simulation study are available from the authors upon request. Data used in the DNAm analysis can be obtained through the MESA Data Coordinating Center (https://www.mesanhlbi.org/).

R scripts for the analysis are available at https://github.com/dclarkboucher/mediation_DNAm. Our R package “hdmed” can be found at https://github.com/dclarkboucher/hdmed.

[R1] 1.Moore L. D., Le T. & Fan G. DNA methylation and its basic function. Neuropsychopharmacol. Off. Publ. Am. Coll. Neuropsychopharmacol. 38, 23–38 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Kurdyukov S. & Bullock M. DNA Methylation Analysis: Choosing the Right Method. Biology (Basel). 5, (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Dick K. J. et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet (London, England) 383, 1990–1998 (2014). [DOI] [PubMed] [Google Scholar]

[R4] 4.Wahl S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Volkmar M. et al. DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. EMBO J. 31, 1405–1426 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Chilunga F. P. et al. Genome-wide DNA methylation analysis on C-reactive protein among Ghanaians suggests molecular links to the emerging risk of cardiovascular diseases. NPJ genomic Med. 6, 46 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Nakatochi M. et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin. Epigenetics 9, 54 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fujii R. et al. Dietary fish and ω-3 polyunsaturated fatty acids are associated with leukocyte ABCA1 DNA methylation levels. Nutrition 81, 110951 (2021). [DOI] [PubMed] [Google Scholar]

[R9] 9.Sun Y. V et al. Epigenomic association analysis identifies smoking-related DNA methylation sites in African Americans. Hum. Genet. 132, 1027–1037 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Philibert R. A., Plume J. M., Gibbons F. X., Brody G. H. & Beach S. R. H. The impact of recent alcohol use on genome wide DNA methylation signatures. Front. Genet. 3, 54 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Rider C. F. & Carlsten C. Air pollution and DNA methylation: effects of exposure in humans. Clin. Epigenetics 11, 131 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Lam L. L. et al. Factors underlying variable DNA methylation in a human community cohort. Proc. Natl. Acad. Sci. U. S. A. 109 Suppl, 17253–17260 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Needham B. L. et al. Life course socioeconomic status and DNA methylation in genes related to stress reactivity and inflammation: The multi-ethnic study of atherosclerosis. Epigenetics 10, 958–969 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Fujii R., Sato S., Tsuboi Y., Cardenas A. & Suzuki K. DNA methylation as a mediator of associations between the environment and chronic diseases: A scoping review on application of mediation analysis. Epigenetics 1–27 (2021) doi: 10.1080/15592294.2021.1959736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Song Y. et al. Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics 76, 700–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Du J. et al. Methods for Large-scale Single Mediator Hypothesis Testing: Possible Choices and Comparisons. (2022) doi: 10.48550/arxiv.2203.13293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Bild D. E. et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am. J. Epidemiol. 156, 871–881 (2002). [DOI] [PubMed] [Google Scholar]

[R18] 18.Whitaker S. M. et al. The Association Between Educational Attainment and Diabetes Among Men in the United States. American journal of men’s health vol. 8 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Sakurai M. et al. HbA1c and the risks for all-cause and cardiovascular mortality in the general Japanese population: NIPPON DATA90. Diabetes Care 36, 3759–3765 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Singer D. E., Nathan D. M., Anderson K. M., Wilson P. W. & Evans J. C. Association of HbA1c with prevalent cardiovascular disease in the original cohort of the Framingham Heart Study. Diabetes 41, 202–208 (1992). [DOI] [PubMed] [Google Scholar]

[R21] 21.Yeung S. L. A., Luo S. & Schooling C. M. The Impact of Glycated Hemoglobin (HbA(1c)) on Cardiovascular Disease Risk: A Mendelian Randomization Study Using UK Biobank. Diabetes Care 41, 1991–1997 (2018). [DOI] [PubMed] [Google Scholar]

[R22] 22.Borghol N. et al. Associations with early-life socio-economic position in adult DNA methylation. Int. J. Epidemiol. 41, 62–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Chen Z. et al. DNA methylation mediates development of HbA1c-associated complications in type 1 diabetes. Nat. Metab. 2, 744–762 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Baron R. M. & Kenny D. A. The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations. Journal of personality and social psychology vol. 51. [DOI] [PubMed] [Google Scholar]

[R25] 25.MacKinnon D. Introduction to statistical mediation analysis. (New York, NY u.a: Erlbaum; ). [Google Scholar]

[R26] 26.VanderWeele T. J. Marginal Structural Models for the Estimation of Direct and Indirect Effects. Epidemiology 20, (2009). [DOI] [PubMed] [Google Scholar]

[R27] 27.Pearl J. Direct and Indirect Effects. in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence 411–420 (Morgan Kaufmann Publishers Inc., 2001). [Google Scholar]

[R28] 28.Robins J. M. & Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology 3, 143–155 (1992). [DOI] [PubMed] [Google Scholar]

[R29] 29.VanderWeele T. J. Mediation Analysis: A Practitioner’s Guide. Annu. Rev. Public Health 37, 17–32 (2016). [DOI] [PubMed] [Google Scholar]

[R30] 30.Vander Weele author., T. Explanation in causal inference: methods for mediation and interaction. Explanation in causal inference: methods for mediation and interaction (Oxford University Press, 2015). [Google Scholar]

[R31] 31.Du J. et al. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet. Epidemiol. n/a, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Aung M. T. et al. Application of an analytical framework for multivariate mediation analysis of environmental data. Nat. Commun. 11, 5624 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.VanderWeele T. J. & Vansteelandt S. Mediation Analysis with Multiple Mediators. Epidemiol. Method. 2, 95–115 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Zhang H. et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics 32, 3150–3154 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Gao Y. et al. Testing Mediation Effects in High-Dimensional Epigenetic Studies. Front. Genet. 10, 1195 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Zhang Q. High-Dimensional Mediation Analysis with Applications to Causal Gene Identification. Statistics in biosciences (2021). [Google Scholar]

[R37] 37.Zhao Y. & Luo X. Pathway LASSO: pathway estimation and selection with high-dimensional mediators. Stat. Interface 15, 39–50 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Zhou R. R., Wang L. & Zhao S. D. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 107, 573–589 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Huang Y.-T. & Pan W.-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics 72, 402–413 (2016). [DOI] [PubMed] [Google Scholar]

[R40] 40.Zhao Y., Lindquist M. A. & Caffo B. S. Sparse principal component based high-dimensional mediation analysis. Comput. Stat. Data Anal. 142, 106835 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Chén O. Y. et al. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics (Oxford, England: ) vol. 19 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Song Y. et al. Bayesian sparse mediation analysis with targeted penalization of natural indirect effects. J. R. Stat. Soc. Ser. C (Applied Stat. 70, 1391–1412 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Derkach A., Pfeiffer R. M., Chen T.-H. & Sampson J. N. High dimensional mediation analysis with latent variables. Biometrics 75, 745–756 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.MacKinnon D. P., Lockwood C. M., Hoffman J. M., West S. G. & Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol. Methods 7, 83–104 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Pedroso J. A. B., Ramos-Lobo A. M. & Donato J. J. SOCS3 as a future target to treat metabolic disorders. Hormones (Athens). 18, 127–136 (2019). [DOI] [PubMed] [Google Scholar]

[R46] 46.Wang Y. Z. et al. DNA Methylation Mediates the Association Between Individual and Neighborhood Social Disadvantage and Cardiovascular Risk Factors. Front. Cardiovasc. Med. 9, 848768 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Stage E. et al. The effect of the top 20 Alzheimer disease risk genes on gray-matter density and FDG PET brain metabolism. Alzheimer’s Dement. (Amsterdam, Netherlands: ) 5, 53–66 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Mei H. et al. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis. Genes (Basel). 8, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Rahman S. A., Nessa A. & Hussain K. Molecular mechanisms of congenital hyperinsulinism. J. Mol. Endocrinol. 54, R119–R129 (2015). [DOI] [PubMed] [Google Scholar]

[R50] 50.Galcheva S., Al-Khawaga S. & Hussain K. Diagnosis and management of hyperinsulinaemic hypoglycaemia. Best Pract. Res. Clin. Endocrinol. Metab. 32, 551–573 (2018). [DOI] [PubMed] [Google Scholar]

[R51] 51.Pedroso J. A. B. et al. Inactivation of SOCS3 in leptin receptor-expressing cells protects mice from diet-induced insulin resistance but does not prevent obesity. Mol. Metab. 3, 608–618 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Senniappan S., Shanti B., James C. & Hussain K. Hyperinsulinaemic hypoglycaemia: genetic mechanisms, diagnosis and management. J. Inherit. Metab. Dis. 35, 589–601 (2012). [DOI] [PubMed] [Google Scholar]

[R53] 53.Huang Y.-T., Vanderweele T. J. & Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann. Appl. Stat. 8, 352–376 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Zhang C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010). [Google Scholar]

[R55] 55.Zhang S. S. & Zhang C.-H. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society. Series B, Statistical methodology vol. 76 (2014). [Google Scholar]

[R56] 56.Du P. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Fan J. & Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. 70, 849–911 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison

Dylan Clark-Boucher

Xiang Zhou

Jiacong Du

Yongmei Liu

Belinda L Needham

Jennifer A Smith

Bhramar Mukherjee

Abstract

Introduction

Fig. 1. Proposed causal mechanism in which the effect of low education on HbA1c is mediated by DNAm.

Notations and General Framework

Fig. 2. Methods for mediation analysis with high-dimensional DNAm data.

Results

Simulation Results

True positive rate

Fig. 3. True positive rate for detecting mediation signals at a false discovery rate of 10%.

Estimation of contributions of active mediators

Fig. 4. MSE in estimating mediation contributions of active mediators, relative to one-at-a-time method.

Estimation of contributions of inactive mediators

Fig. 5. MSE in estimating mediation contributions of inactive mediators, relative to one-at-a-time method.

Estimation of global indirect effect

Fig. 6. Percent relative bias in estimated global indirect effect.

Scalability

Table 1.

DNAm data analysis results from MESA

Identification of noteworthy CpG sites

Table 2.

Global mediation through DNAm

Table 3.

Additional Findings

Discussion

Fig. 7. Decision tree for selecting a high-dimensional mediation analysis.

Methods

Mediation Model with Multiple Mediators

Table 4.

Group 1 Methods

HIMA

HDMA

MedFix

Pathway LASSO

BSLMM

GMM

Group 2 Methods

PCMA

SPCMA

HILMA

Group 3 Methods

HDMM

LVMA

Simulation study

Simulation settings

Table 5.

Simulated dataset creation

Evaluation

Scalability comparison

Data application with MESA

Fig. 6. Pre-processing of MESA methylation data.

Supplementary Material

Acknowledgements

Data Availability

Code Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases