Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Dylan Clark-Boucher; Xiang Zhou; Jiacong Du; Yongmei Liu; Belinda L Needham; Jennifer A Smith; Bhramar Mukherjee

doi:10.1371/journal.pgen.1011022

. 2023 Nov 7;19(11):e1011022. doi: 10.1371/journal.pgen.1011022

Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Dylan Clark-Boucher ¹, Xiang Zhou ², Jiacong Du ², Yongmei Liu ³, Belinda L Needham ⁴, Jennifer A Smith ^4,⁵, Bhramar Mukherjee ^2,^4,^*

Editor: Zoltán Kutalik⁶

PMCID: PMC10655967 PMID: 37934796

Abstract

Epigenetic researchers often evaluate DNA methylation as a potential mediator of the effect of social/environmental exposures on a health outcome. Modern statistical methods for jointly evaluating many mediators have not been widely adopted. We compare seven methods for high-dimensional mediation analysis with continuous outcomes through both diverse simulations and analysis of DNAm data from a large multi-ethnic cohort in the United States, while providing an R package for their seamless implementation and adoption. Among the considered choices, the best-performing methods for detecting active mediators in simulations are the Bayesian sparse linear mixed model (BSLMM) and high-dimensional mediation analysis (HDMA); while the preferred methods for estimating the global mediation effect are high-dimensional linear mediation analysis (HILMA) and principal component mediation analysis (PCMA). We provide guidelines for epigenetic researchers on choosing the best method in practice and offer suggestions for future methodological development.

Author summary

DNA methylation is an epigenetic mechanism that regulates the expression of genes, turning them “on” or “off” to meet the needs of the cell. Changes in methylation activity are associated with both health conditions and socioeconomic factors like education and access to healthcare. Recently, researchers have been interested in whether DNA methylation may act as a link between socioeconomic disadvantage and health. Standard methods to investigate whether DNA methylation is a link, or a mediator, between disadvantage and health do not work well when there are multiple mediators—in this case, DNA methylation sites—under consideration. Our study reviews 12 statistical methods for mediation analysis that can be used to analyze many methylation sites simultaneously. We compare the methods on simulated data and provide guidelines and software for their implementation. We then demonstrate how the methods can be applied to real methylation data by testing whether DNA methylation sites across the genome mediate the effect of lower educational attainment on HbA1c, an important marker of type II diabetes.

Introduction

In this study, we review and evaluate several available methods for performing mediation analysis when the mediators are high-dimensional DNA methylation (DNAm) measurements. DNAm is an epigenomic mechanism in which a methyl group binds to the DNA—a process that most often occurs at cytosine-guanine dinucleotides, called “CpG sites.” One of the primary functions of DNAm is to regulate gene expression. For example, when CpG sites in the promoter regions of genes become methylated, it can discourage gene expression by inhibiting the binding of enzymes needed for transcription [1].

Advancements in modern technology have made it possible to measure DNAm on a massive scale. Indeed, microarray techniques have been used to measure more than 850,000 CpG sites at once, producing rich, detailed data that has encouraged broad research on DNAm in the etiology of disease [2]. Owing greatly to this technology and others, DNAm has been established as a risk factor in obesity [3], type II diabetes [4], schizophrenia [5], preterm birth [6], breast cancer [7], cardiovascular disease [8], and countless other conditions spanning physical and mental health. A focus of research in genetic epidemiology has been to interrogate these relationships for their predictive utility [9], biological mechanisms [10], and causality in relation to medical phenotypes [11].

However, in addition to its well-established connections to a disease or health outcome, DNAm is also associated with environmental exposures which themselves are known to affect human health. Factors such as diet [12], smoking [13], alcohol [14], air pollution [15], and socioeconomic status (SES) [16] are only a handful of the many environmental exposures that have been shown to be associated with differences in DNAm. As each of these traits have their own health risks, there have been mounting hypotheses that DNAm serves as a conduit through which assaults from the exposome are able to affect health. Effect transmission of this nature is called mediation, and it has become popular in epigenetic research to treat DNAm as a mediator between environmental exposures and human disease [17].

As an example of such an analysis, our previous work [18] showed associations between low SES and glycated hemoglobin (HbA1c) in the Multi-Ethnic Study of Atherosclerosis (MESA), a United States population-based longitudinal study [19]. Indicators of SES, such as education level, are strong predictors of type II diabetes [20], while HbA1c is an important risk factor of cardiovascular disease and a critical biomarker in type II diabetes diagnosis [21]. Since education level is also associated with DNAm [16,22,23], and DNAm itself with HbA1c level [24], we hypothesized that if low education results in greater HbA1c, part of that effect could be mediated by DNAm (Fig 1). The present study revisits this hypothesis for the purpose of illustration. Our sample from MESA has 963 individuals and includes DNAm measurements at 402,339 CpG sites, none of which we know for certain are related to education or HbA1c in advance.

The standard statistical tool for addressing such a hypothesis is mediation analysis. Formally, mediation is when an exposure, say A, affects an outcome, Y, in part through its effect on a single mediating variable M. When M is a mediator of the A to Y association, the total effect of A on Y has two components: an indirect effect, from A affecting M and M affecting Y, and a direct effect, from A affecting Y independently of M. In the “traditional mediation analysis” approach proposed by Baron and Kenny (1986), the associations from this mechanism could be measured by fitting linear regression models: one for the effect of A on M (the mediator model), one for the effects of A and M on Y (the outcome model), and sometimes a third model for the total effect of A on Y, M ignored [25–27]. The more recently developed “causal mediation analysis,” based on the counterfactual approach [28,29], has established conditions under which the parameters of these models can be interpreted as causal effects [30]. The causal approach is more flexible when Y or M are binary and when there is A-M interaction in the outcome model [31].

While standard examples of mediation consider only one exposure, one mediator, and one outcome [18,32], there has been growing interest in methods for mediation that can handle many potential mediators at once. Epigenetic studies have felt this need especially, as DNAm is usually measured at several hundred thousand CpG sites with little prior knowledge of their importance. Moreover, although a naïve strategy in such settings would be to evaluate the potential mediators one-at-a-time, in separate models, this approach can be problematic when the mediators are correlated conditional on the exposure variable and covariates, since the resulting estimates may be biased due confounding from the co-mediators that were excluded [18]. There could also be a loss in efficiency due to lack of exploiting the joint multivariable structure. To reduce the risk of bias and to increase precision, it is better to evaluate the mediators jointly and fit a single, multivariable outcome model that adjusts the effect of each mediator for the others, rather than fitting multiple one-at-a-time models. Though several methods for fitting such a model have been presented in the literature, they have yet to be widely adopted by practitioners and researchers for investigating substantive questions on high-dimensional mediation analysis with DNAm.

Our study aims to bridge this gap and guide researchers in epigenetics to use state of the art methods for mediation analysis with high-dimensional mediators. Despite the recent methodological developments, there are no clear-cut standards for which methods should be applied in which circumstances, making it difficult to select the best-suited method for an analysis in advance. While our prior research examined methods for large scale single-mediator hypotheses [18], there is no such work for methods that can simultaneously incorporate many potential mediators at once. Our study first addresses this question with a simulation study, directly comparing the performance of seven different methods for mediation analysis with high-dimensional mediators across a spectrum of settings. Along with metrics related to identification of key mediators and estimation of mediation effect, we include a computation time comparison that tests the scalability of the methods to large datasets. In addition, to assess the utility of these methods on real, large-scale DNAm data, we apply the same seven methods from the simulation study, plus two additional methods, on the DNAm data provided by MESA, where we evaluate the mediating role of DNAm in the association between low education level and HbA1c. Our study is the first to address this critical gap in the applied epigenetic literature, both by providing clarity on the available methods and by assessing their strengths and weaknesses under real and simulated conditions. Although the focus of our study is applications involving DNAm, the methods explored in the text are not specific to epigenetics, and our results and guidelines should be similarly useful for researchers studying high-dimensional mediation problems in other fields.

Another key feature of our study is the presentation of a versatile, user-friendly, and well-documented R package for implementing the methods described in the text. Computer code for the methods has been made available previously, but is found in a varyingly functioning and de-centralized state across the many repositories, supplementary files, and R packages assembled by the methods’ authors. Our work centralizes these resources into a single, stand-alone R package hdmed (https://cran.r-project.org/package=hdmed), which has the flexibility to apply multiple methods for high-dimensional mediation analysis in one place. It is our hope that by synthesizing these methods into a confined, usable package, we will catalyze the translation of our study and findings into practical, insightful research pursuits in genetic epidemiology and other fields.

Notations and general framework

Before proceeding, it will be useful to provide an overview of the relevant mediation model and to summarize the types of methods which have become available. To begin, suppose we have a dataset of n individuals: an exposure A_i, a continuous outcome Y_i, and continuous mediators M_i measured for the i^th person, i varying from 1 to n. We write M_i in bold to indicate its status as a vector—in this case, a set of p mediators M_i^(j), j varying from 1 to p. Let C_i be a vector of q covariates. When p is greater than 1 (and possibly greater than n), we can use the regression models

E [Y_{i} | A_{i}, M_{i}, C_{i}] = β_{a} A_{i} + β_{m}^{T} M_{i} + β_{c}^{T} C_{i}

(1)

and

E [M_{i} | A_{i}, C_{i}] = α_{a} A_{i} + α_{c} C_{i}

(2)

to estimate the mediating role of M_i in the causal pathway between the exposure and outcome [33]. Model (1) is the outcome model and model (2) is the mediator model. In model (1), β_m is a p-vector in which the j^th component, (β_m)_j, is the linear association of j^th mediator with Y_i adjusting for the other variables; while β_a is the association between A_i and Y_i adjusting for mediators and covariates. In model (2), α_a is a p-vector of the associations between the exposure and each mediator, (α_a)_j; and α_c is a matrix of the mediator-covariate associations. Also note that in model (1), we have assumed there is no interaction between A_i and M_i, which is beyond the scope of our present study.

The parameters of these models underly the causal effects of interest. Under certain assumptions (Section 1 in S1 Text) [28,33], the direct effect of A_i on Y_i is β_a, the global indirect effect (or global mediation effect) of A_i on Y_i through M_i is α_a^Tβ_m, and the total effect of A_i on Y_i is β_a + α_a^Tβ_m. Another quantity of interest is the proportion mediated, defined as the ratio of the global indirect effect to the total effect, which measures the degree to which the A_i to Y_i pathway is mediated by M_i. Lastly, we may also seek to measure the product terms (α_a)_j(β_m)_j, which we will call the mediation contributions. The mediation contribution of the j^th mediator reflects the mathematical contribution of that mediator to the global mediation effect, since the sum of (α_a)_j(β_m)_j over all j equals α_a^Tβ_m. These parameters are intuitive to estimate, but difficult to interpret. Though it is tempting to refer to (α_a)_j(β_m)_j as a causal effect corresponding to the j^th mediator, we emphasize that this parameter cannot generally be interpreted as the natural indirect effect through that mediator specifically. Identifying the indirect effects of specific mediators, in settings with multiple mediators, requires strong assumptions about whether the group of mediators are sequentially ignorable—conditions that would be violated, for example, in situations where a subset of mediators have causal effects on some of the others. (The exact assumptions are not described here as they would require a discursion into counterfactual inference. See [34]). Despite the limited interpretability of the mediation contributions, we will refer to a mediator as inactive if its mediation contribution is zero, and active otherwise. This has the caveat that if a mediation contribution is zero, that mediator could still be involved in the causal path from A to Y, since complex causal relationships among the set of mediators might exist.

If the potential mediators are uncorrelated, conditional on the exposure and covariates, or if p is reasonably small relative to n, then it is trivial to fit the above models using linear regression. However, if the mediators are correlated and p is large, the estimates from model (1) may have extremely high variance; and if p is so large as to exceed n, the linear regression model cannot even be fitted. These concerns are relevant to us because DNAm measurements tend to be correlated, while the number of sites that we have measurements on exceeds the number of samples. Addressing these issues has been a recent focus of the mediation literature, with authors using penalized regression [35–40], dimension reduction [41–43], Bayesian inference [44,45], and latent variables [46] to make the outcome model statistically tractable.

We provide a graphical depiction of 12 available methods in Fig 2. Eight of them are assessed in the simulation study, ten of them are used in the DNAm analysis, and all of them are described in the Methods section. To help elucidate the differences between methods, we partition them into three distinct groups based on their approaches and objectives. In the first group, we consider methods that explicitly fit the outcome and mediator models as we have defined them so that one can estimate α_a^Tβ_m, the global indirect effect, simply by summing the estimates of the mediation contributions. The methods for doing so are high-dimensional mediation analysis (HIMA) by Zhang et al. 2016 [35], high-dimensional mediation analysis (HDMA) by Gao et al. 2019 [36], mediation analysis via fixed effect model (MedFix) by Zhang 2019 [37], pathway least absolute shrinkage operator (pathway LASSO) by Zhao and Luo 2022 [38], the Bayesian sparse linear mixed model (BSLMM) by Song et al. 2020 [44], and the Gaussian mixture model (GMM) by Song et al. 2021 [45]. In contrast, the second group of methods considers those that can estimate α_a^Tβ_m “directly”—without needing to fit the mediation models we began with in their original form. These methods have the drawback of being unable to estimate the mediation contributions of specific active mediators. They include principal component mediation analysis (PCMA) by Huang and Pan 2016 [41], sparse principal component mediation analysis (SPCMA) by Zhao et al. 2020 [42], high-dimensional linear mediation analysis (HILMA) by Zhou et al. 2021 [39], and a method we will call partial penalized high-dimensional mediation analysis (PMED), proposed by Guo et al. 2022 [40]. Lastly, the third group of methods are those that make no attempt to estimate the mediation effects as originally proposed, but that instead reconceptualize the mediation framework with newly-defined parameters based on latent variables. The methods include high-dimensional multivariate mediation analysis (HDMM) by Chén et al. 2018 [43] and latent variable mediation analysis (LVMA) by Derkach et al. 2021 [46]. Within this comparative structure, we evaluate methods from all three groups, identifying their strengths and weaknesses across a wide range of simulation settings and analysis of DNAm data from MESA.

Materials and methods

Overview of methods

Let A be an exposure, Y be a continuous outcome, and M be a set of p continuous variables that potentially mediate the causal path from A to Y. Then with A_i, Y_i, M_i, and covariates C_i measured for n subjects, i from 1 to n, we can evaluate the mediating role of M with models (1) and (2) as presented in the Introduction section. We provide an overview of 12 methods for mediation analysis that can accommodate this multivariate framework below. A tabular summary is given in the supplement (S1 Table).

Group 1 methods: Penalized regression to estimate mediator-specific contributions

HIMA

High-dimensional mediation analysis (HIMA) by Zhang et al. (2016) is a penalized regression approach in which the outcome model is fit with a minimax concave penalty [47], performing feature selection on the potential mediators [35]. The mediator models are then fit among the remaining mediators using ordinary linear regression. Finally, the “significance” of the mediation contributions is tested by taking the maximum of the (β_m)_j and (α_a)_j p-values, where the p-values for (β_m)_j are obtained by refitting the reduced outcome model with ordinary least squares (an approach which is likely to cause the p-values to be overconfident [48]). The authors also recommend an initial screening step to reduce the number of mediators at the start, as the outcome model will still be unstable if p is extremely large compared to n. A new version of HIMA, called HIMA2, was published recently [49]. HIMA2 is similar to HDMA, but suggests a p-value correction procedure that maintains the false discovery rate for detecting active mediators. HIMA2 is excluded from our comparison due to its similarity to HDMA.

HDMA

High-dimensional mediation analysis (HDMA) by Gao et al. (2019), is the same as HIMA except for its penalty function, replacing the minimax concave penalty with the recently-proposed de-sparsified LASSO [36,50]. The advantage of this penalty is that the resulting estimates of β_m are asymptotically normally distributed, so one can test their statistical significance without a subsequent application of ordinary least squares.

MedFix

Mediation analysis via fixed effect model (MedFix) is another extension of HIMA, proposed by Zhang (2021) [37]. MedFix was originally proposed for a setting where there are multiple exposures in addition to multiple mediators, which it handles by applying adaptive LASSO [51] to both the outcome model and the mediator models. If there is only one exposure, linear regression can replace adaptive LASSO in the mediator models, and applying MedFix is analogous to applying HDMA except with adaptive LASSO instead of debiased LASSO.

Pathway LASSO

Pathway LASSO is a penalized regression approach by Zhao and Luo (2022) [38]. Whereas HIMA, HDMA, and MedFix handle the outcome and mediator models separately, this method fits the models simultaneously with a jointly penalized likelihood that directly applies shrinkage to the mediator-outcome associations, exposure-mediator associations, and their products (the mediation contributions).

BSLMM

The Bayesian sparse linear mixed model (BSLMM) by Song et al. (2020) assumes α_a and β_m are random vectors that independently follow mixtures of normal distributions [44]. Most of the effects are assumed to be small, resulting from a normal distribution with low variance, while a minority are assumed to be larger and follow a normal distribution with higher variance. Active mediators are discriminated from inactive by their posterior inclusion probability of belonging to the higher-variance distribution.

GMM

The Gaussian mixed model (GMM) by Song et al. (2021) is an extension of BSLMM in which the (α_a)_j, (β_m)_j pairs are treated as correlated, following a mixture of multivariate normal distributions instead of two independent normal distributions [45]. Thus, GMM may be more useful than BSLMM if the true size of each (β_m)_j is related to the size of the corresponding (α_a)_j, and vice-versa.

Group 2 methods: Dimension reduction and direct estimation of global indirect effect

PCMA

Principal component mediation analysis (PCMA) by Huan and Pan (2016) is a mediation analysis method based on principal component analysis (PCA) [41]. The authors perform PCA on the residual matrix of the mediator models, then use the resulting loading matrix to transform M into a new set of mediators which are uncorrelated conditional on A and C. The transformed mediators then replace the original mediators in the analysis and are evaluated in a one-at-a-time fashion. In spite of the transformation, the global indirect effect α_a^Tβ_m can still be estimated with its original interpretation as the global mediation effect through M. The authors set the number of transformed mediators to be p, though this is only possible if p is less than n.

SPCMA

Zhao et al (2019) proposed sparse principal component analysis (SPCMA) to improve the interpretability of the results from PCMA [42]. In PCMA, the transformed mediators are difficult to interpret because they are sums of all p original mediators; whereas in SPCMA, the loading matrix is sparsified so that the transformed mediators are only sums of only a subset of the original mediators. Thus, if a specific transformed mediator has a large effect, it can potentially be traced back to the original mediators which were used to construct it. Though the added sparsity induces bias, it can be helpful for identifying groups of mediators which may be active.

HILMA

High-dimensional linear mediation analysis (HILMA) by Zhou (2020) estimates α_a^Tβ_m with a complex, de-biased penalized regression procedure that is beyond the scope of this article [39]. The proposed estimator has asymptotic properties for testing whether α_a^Tβ_m is zero and can also be applied when there are multiple (but not more than n) exposures.

PMED

Partial penalized high-dimensional mediation analysis (PMED) is a two-step estimation and inference procedure for the global mediation effect, proposed by Guo et al. (2022) [40]. In the first step, the outcome model is fitted with the mediators penalized by the smoothly-clipped absolute deviation (SCAD) penalty. In the second step, the estimated direct effect from the outcome model is subtracted from an estimated total effect, which is obtained by fitting an unpenalized outcome model with the mediators omitted. The method reports the global mediation effect and a set of potentially active mediators selected in step one, but does not provide estimates of the mediator-specific mediation contributions. PMED can also be applied when there are multiple, but fewer than n, exposure variables.

Group 3 methods: Latent variable representation to summarize mediators

HDMM

High-dimensional multivariate mediation (HDMM) by Chén et al. (2018) uses dimension reduction similar to PCMA, but chooses the loading vectors with a likelihood-based approach instead of PCA [43]. The loading vectors are referred to as “directions of mediation,” each vector specifying a linear combination of mediators which contribute to the likelihood of the mediation models. This implicitly assumes there are latent, unmeasured mediators that can be represented as linear combinations of the observed mediators. The results of HDMM are difficult to interpret, but it can still be useful for identifying whether there is any mediation through M at all and for identifying large subsets of mediators that contribute to that mediation. A limitation of HDMM is that it cannot directly be applied when p exceeds n.

LVMA

Latent variable mediation analysis (LVMA) by Derkach et al. (2019) assumes are a small number of latent, unmeasured mediators F which transmit the effect of A to Y and which also cause changes in M [46]. Thus, LVMA assumes explicitly what HDMM assumes implicitly, and the results of the methods have a similar interpretation. Another feature of LVMA is that the F → M associations are sparsified, which is useful for detecting relevant mediators in M. Indeed, an observed mediator would be considered active if it is associated with a latent mediator that itself is associated with A and Y.

Simulation study

Simulation settings

Primary simulation settings

We evaluated the above methods with a simulation study. To contrast them under diverse conditions, we considered three different settings of mediation: (1) a baseline setting in which the mediation signals are sparse and the error terms of the (potential) mediators are moderately correlated, (2) a high-correlation setting with sparse signals, and (3) a moderate correlation setting in which the signals are non-sparse. We also varied the signal strength of the mediation by modifying three parameters: the proportion of variance explained by A in mediators affected by A (PVE_A); the proportion of variance in Y explained by the direct effect (PVE_DE); and the proportion of variance in Y explained by the global indirect effect (PVE_IE). For a baseline case, we let PVE_A be 0.20, PVE_DE be 0.1, and PVE_IE be 0.10. Then, for three additional cases, we sequentially decreased one of these parameters by half, weakening the signal, and set the other two parameters to their values from the baseline. Each of the four signal strengths was evaluated in each of settings (1) to (3) with a sample size of 1,000 and 2,500, with the number of mediators fixed at 2,000. This amounted to 24 simulation settings in total. A complete list of the primary simulation settings is provided in the supplement (S2 Table), as are the numerical results underlying the figures along with code for generating the simulated data (S2 and S3 Files). None of the settings adjusted for confounding variables.

Additional simulation settings

To broaden the variety of simulation conditions, we consider two additional sets of simulations that involve specifc changes to the data-generating mechanism. In the first additional scenario, we consider cases in which the coefficients of the outcome and mediator models are not mixed in sign, but strictly non-negative (as explained below, the coefficients in the primary simulation settings had both positive and negative signs). The non-negative effect simulations are analogous to simulation Setting (1) above, but with the coefficents of the model converted to their absolute value. They include each of the four signal strength settings explored previously.

Finally, the second additional scenario considers data-generating mechanisms in which there is an unmeasured confounding variable, U, that directly influences the exposure, the outcome, and a subset of the mediators. For these simulations, we begin with Setting (1) (as described above) with the first set of signal strength parameters (PVE_A = 0.2, PVE_DE = 0.1, PVE_IE = 0.1), then perturb the data-generating mechanism by adding confounding effects of U to the generation of A, M, and Y. We explore different levels of confounding by setting the sensitivity analysis parameter, namely the variance of U, to be 1, 2, or 3, while holding the effects of U on the other variables constant. In both additional simulation scenarios we set n to be 2,500. A list of the additional scenarios is provided in the supplement (S3 Table). Results for both scenarios are reported in the supplement as well (S1 and S2 Figs).

Simulated dataset creation

Primary simulation settings

To obtain sparse mediation effects for Settings (1) and (2), we let 1,920 of the 2,000 coefficients (α_a)_j and (β_m)_j be zero and the remaining 80 be standard normal. Twenty of the nonzero (α_a)_j and (β_m)_j were chosen to overlap and have the product (α_a)_j(β_m)_j not equal zero. To obtain non-sparse signals for Setting (3), we sampled the previously zero coefficients from a normal distribution with mean zero and standard deviation 0.2. Once these parameters were fixed, we obtained a single simulated dataset by sampling A_i from a standard normal distribution, then produce M_i from model (2) assuming there are no covariates. We add noise to M_i by sampling residuals from a multivariate normal distribution with mean 0_p and variance S, where S is derived by shuffling, then tuning the variance-covariance of the observed methylation data (Section 2 in S1 Text). In Settings (1) and (3), we tune S so that the error correlations between mediators range from -0.37 to 0.49, and in Setting (2), so that they range from -0.68 to 0.89. We fix PVE_A by scaling S appropriately based on α_a. Finally, we define Y_i based on model (1) assuming the residuals are Normal(0, s²), choosing β_a and s² to yield the desired PVE_DE and PVE_IE.

Additional simulation settings

For the simulations with non-negative effects, the (a_a)_j and (β_m)_j coefficients from Setting (1) are converted to their absolute value. Since this also changes the global indirect effect, α_a^Tβ_m, we update the direct effect, β_a, to equal $α_{a}^{T} β_{m} \sqrt{P V E_{D E} / P V E_{I E}}$ , so that the ratio of the variance of Y explained by the direct effect to the variance of Y explained by the indirect effect is the same as previously. No other parameters used to generate the data are modified; for example, the residual variance of the outcome model (s²) is the same as before.

For the unmeasured confounding simulations, the modified data-generating mechanism is described in models (3), (4), and (5), which are shown below:

A_{i} = γ U_{i} + δ_{i}

(3)

M_{i} = α_{a} A_{i} + α_{u} U_{i} + ϵ_{i}

(4)

Y_{i} = β_{a} A_{i} + β_{u} U_{i} + β_{m}^{T} M_{i} + ζ_{i}

(5)

Here, δ_i and ζ_i are independent normal random variables with mean zero, and their variances are chosen to be equal to their values from the baseline setting (1 and s², respectively). In model (4), ε_i is a multivariate normal random vector, independent of δ_i and ζ_i, with variance-covariance matrix set to be S from the baseline setting. The confounder-exposure effect γ is set to be 1/3, and the confounder-outcome effect β_u is set to β_a/2. For the vector of confounder-mediator effects, α_u, we set the j^th entry to be (α_a)_j/2 if (α_a)_j is not zero, and set it to be 1/2 if (α_a)_j is zero but (β_m)_j is not zero. (That is, only the mediators that are affected by A, affect Y, or both, are directly affected by U.) The choice of these fractions (e.g., β_a/2) is somewhat arbitrary, but it ensures the confounding effects are on a similar scale to the coefficients of interest, only slightly weaker. The remaining parameters are held at their values from Setting (1). The confounding variable U is sampled from a normal distribution with mean zero and variance τ, the sensitivity analysis parameter, which is set to be 1, 2, or 3. This varies the intensity of the confounding.

Simulation analysis

We performed mediation analysis on 100 simulated datasets in each setting. We omitted the methods SPCMA, GMM, and LVMA because were too computationally costly, and omitted HDMM because its estimand is not comparable to the others. We also included a one-at-a-time approach in which the mediators are assessed one-at-a-time using traditional mediation analysis and the joint significance test [52]. When running HIMA, HDMA, MedFix, and pathway LASSO, we pre-screened the mediators to only include the n/log(n) mediators most-associated with Y adjusting for A, which is recommended by the HIMA and HDMA authors [35,36]. For comparison metrics, we used the true positive rate for detecting active mediators, $TPR= \frac{number of true mediators identified}{number of true mediators}$ ; the mean squared error in estimating the mediation contributions of inactive mediators, ${MSE}_{Inactive} = {mean}_{j :Inactive} {((\hat{α_{a}})_{j} {(\hat{β_{m}})}_{j} - {(α_{a})}_{j} {(β_{m})}_{j})}^{2}$ ; the mean squared error in estimating the mediation contributions of active mediators, ${MSE}_{Active} = {mean}_{j :Active} {((\hat{α_{a}})_{j} {(\hat{β_{m}})}_{j} - {(α_{a})}_{j} {(β_{m})}_{j})}^{2}$ ; and the percent relative bias in estimating the global indirect effect, $\frac{\hat{| α_{a}^{T} β_{m}} - α_{a}^{T} β_{m} |}{α_{a}^{T} β_{m}} \times 100$ . In the non-sparse setting, “active” mediators were considered those with both effects sampled from the high-variance distribution. We provide additional details on how the methods were applied in the supplement (Section 3 in S1 Text).

Data application with MESA

Study design

Data were provided by the Multi-Ethnic Study of Atherosclerosis (MESA), a United States population-based longitudinal study on the progression of subclinical cardiovascular disease [19]. Briefly, MESA recruitment ran from July 2000 to August 2002 and comprised 6,814 participants ages 45 to 84. From 2010 to 2012, a subsample of 1,264 random patients had their DNAm measured at 484,882 CpG sites. Standard quality control filters reduced the number of CpGs considered to 402,339 [53]. To demonstrate an application of high-dimensional mediation analysis methods, we evaluated whether DNAm mediates the association between SES and HbA1c in MESA. For the exposure, we used a binary variable that indicates low educational attainment (less than a 4-year college degree). For the outcome, we used HbA1c, a continuous variable that reflects average three-month blood glucose level. We limit our analysis to the 963 participants who (1) had methylation data, (2) had no missing data for the required variables, (3) consented to genetic and phenotypic use through the database of Genotypes and Phenotypes (dbGaP) (phs000209.v13.p3), and (4) were not on diabetes medication, which can cause changes in HbA1c. See supplement for more details (Section 4 in S1 Text). DNAm was measured using M-values, defined as the log-2 ratio of the methylated to unmethylated probe intensities [54].

Statistical analysis

We performed mediation analysis with the methods HIMA, HDMA, HILMA, MedFix, pathway LASSO, PMED, BSLMM, PCMA, SPCMA, and HDMM, based on the models

E [{HbA1c}_{i} {| Education}_{i}, {DNAm}_{i}, {Covariates}_{i}] = β_{a} {Education}_{i} + β_{m}^{T} {DNAm}_{i} + β_{c}^{T} {Covariates}_{i}

(6)

and

E [{DNAm}_{i} {| Education}_{i}, {Covariates}_{i}] = α_{a} {Education}_{i} + α_{c} {Covariates}_{i},

(7)

with the same parameters as models (1) and (2). The covariates included age, sex, race, methylation chip, methylation position, and the estimated proportions of residual non-monocytes (i.e., neutrophils, B cells, T cells, and natural killer cells). Since it is not statistically feasible to include 402,339 mediators at once, we used model (7) to select the 2,000 CpG sites most strongly associated with education based on the (α_a)_j p-value from a linear mixed-model in which methylation chip and position were treated as random effects. These 2,000 formed the baseline set of CpGs for our analysis. Although it is reasonable for some of the methods to include all 2,000 CpG sites directly in the multivariable model, HIMA and HDMA require sure independence screening [55] to reduce the number of mediators in advance to n/log(n), where n is the sample size. For the sake of consistency across the penalized regression methods, we also do this extra screening with MedFix and pathway LASSO, including only the 141 (963/log(963)) CpG sites most associated with low education (a direct extension of the initial screening). We also use this twice-screened subset for HDMM, which requires that p is less than n. For the sake of comparison with multivariate methods, we include a one-at-a-time mediation method based on linear mixed models and the joint significance test. For the methods PCMA, SPCMA, BSLMM, PMED, and Pathway LASSO, which produce estimates of the direct effect, the total effect is estimated by summing the direct effect and global indirect effect. For the methods HIMA, HDMA, and MedFix, which do not estimate the direct effect, we estimate the total effect by fitting model (5) with the mediators excluded, then subtract the estimated global indirect effect from this value to estimate the direct effect. Since none of the high-dimensional methods can handle random effects as covariates, we regress methylation chip and position out of Y and M in advance with a linear mixed model, while fixed-effect covariates are either regressed out as well (in PCMA, SPCMA, HILMA, HDMM, and pathway LASSO) or included directly in the method (in HIMA, HDMA, MedFix, PMED, and BSLMM). Continuous variables (including HbA1c and the mediators) were standardized for all methods. The methods LVMA and GMM were too computationally costly to implement. All analysis was conducted using R version 4.2.1.

Results

Simulation results

We begin by comparing the performance of the methods using simulations. On simulated data with 2,000 potential mediators, we consider (1) a baseline setting, where the error terms of the mediators are moderately correlated and the signals of the mediators are sparse; (2) a high-correlation setting, where the error correlations between mediators are enhanced compared to (1); and (3) a non-sparse setting, where every mediator has at least some mediation signal but some of the signals are systematically larger. In Settings (1) and (2), 60 random mediators have (α_a)_j only sampled from a Normal(0,1), 60 have (β_m)_j only sampled from a Normal(0,1), and 20 have both, with the remaining entries of α_a and β_m fixed at zero. In Setting (3), we use a similar scheme but sample the previously zero (α_a)_j and (β_m)_j from a Normal(0,0.2²). Our simulations also vary the strength of the signals within each of these settings by changing the proportion of variance that is explained by the associations. We do so by changing PVE_A, the proportion of variance in each mediator that can be explained by A, among those mediators that are affected by A; PVE_IE, the proportion of variance of Y that is explained by the total mediation effect; and PVE_DE, the proportion of variance of Y that is explained by the direct effect of A on Y. Results for varying PVE_IE are presented here while results for varying PVE_DE and PVE_A are included in the supplement (S3–S6 Figs). In addition to the high-dimensional mediation methods, we include a one-at-a-time method [52] in which the mediators are assessed individually using linear regression. We evaluate the methods by their true positive rate (TPR) for detecting active mediators, their mean squared error (MSE) for estimating the contributions of active mediators, and their percent relative bias for estimating the global indirect effect.

Detection of active mediators

Our first evaluation metric is TPR, which is the proportion of the true active mediators the method successfully detected on simulated data. In Fig 3, we show the mean TPR over 100 simulated datasets, with an empirical 95% confidence interval (CI), for both the Group 1 methods and the one-at-a-time approach. To choose signifiance cutoffs for discriminating active mediators from inactive, we used a thresholding procedure within each dataset and each method that fixed the false discovery rate (FDR) below 10% (see Methods). For the non-sparse setting, in which every mediator is active, we show the mean TPR for detecting mediators whose (α_a)_j and (β_m)_j were both sampled from Normal(0,1) rather than Normal(0,0.2²). We focus on TPR but not false positive rate (FPR) because the FDR correction was highly conservative, and the mean FPR ranged only from 0 to 5.0x10^-4 across all settings and methods.

Fig 3 — Value shown is the mean TPR across 100 simulated data replicates, with intervals representing the inner 95% range. False discovery proportion was capped below 10% by a proper choice of the p-value threshold (one-at-a-time, HIMA, HDMA, MedFix), posterior inclusion probability threshold (BSLMM), or tuning parameter (pathway LASSO).

For a sample size of 2,500 and a PVE_IE of 0.10, the most powerful method in the baseline setting was BSLMM (mean TPR: 0.45; CI: 0.25–0.63), whose average TPR was 40% higher than that of HDMA, the second-best method. BLSMM also performed best when PVE_IE was 0.05 (mean TPR: 0.25; CI: 0.02–0.48), but to a lesser degree, outperforming HDMA by only 13%. BSLMM remained the best method, and HDMA the second best, no matter the signal strength or the degree of correlations, but performed poorly when the signals were non-sparse. In the setting with 1,000 observations, PVE_IE set to 0.05, and non-sparse signals, the best-performing method was HIMA (mean TPR: 0.09; CI: 0.05–0.10), its average TPR 3.3 times higher than that of BSLMM, which performed worst.

Estimation of contributions of active mediators

We now assess the MSE of the methods for estimating mediation contributions of active mediators relative to the one-at-a-time approach. In Fig 4, we show the relative MSE (rMSE) for estimating mediation contributions among the mediators that were either active (in the baseline and high-correlation settings) or had (α_a)_j or (β_m)_j sampled from the larger-variance distribution (in the non-sparse setting). In the baseline setting with 2,500 observations, the best-performing method when the mediation signal was strong was BSLMM, whose mean rMSE of 0.59 (CI: 0.13–1.51) was 24% lower than that of HDMA, the second-best method. However, when the PVE_IE was reduced to 0.05 or the sample size reduced to 1,000, the best-performing method was either HDMA or MedFix, with MedFix (mean rMSE: 0.79; CI: 0.31–1.53) performing 61% better than BSLMM after reducing both. Similar trends were observed for the high-correlation and non-sparse settings. Relative MSE for inactive mediators is provided in the supplement (S5 Fig).

Estimation of global indirect effect

Lastly, Fig 5 shows the percent relative bias for estimating the global mediation effect, α_a^Tβ_m. We use the same methods as in Figs 3 and 4 along with the Group 2 methods PCMA and HILMA, which obtain an estimate of the global indirect effect without directly fitting the original mediation model. In the baseline setting with 2,500 samples, the best performer when PVE_IE was 0.10 was HILMA, whose mean relative bias of 9% (CI: 0.6% - 20.8%) was 40% lower than that of HDMA, the second-best. When the PVE was reduced to 0.05, the best-performing method was MedFix (mean relative bias: 20.5%; CI: 1.0% - 43.8%), which outperformed HILMA by only 7%. We observed similar results for a sample size of 1,000 and high-correlations. In the non-sparse settings, where the biases tended to be much higher, the best performing methods were either PCMA or HDMA.

DNAm data analysis results from MESA

On an epigenetic dataset with 402,339 CpG sites, we applied SPCMA, HDMM, and every method from our simulation study to infer whether the association between SES and HbA1c is mediated by changes DNAm. For SES, we used a binary variable representing low education level (i.e., education below a 4-year degree), and for DNAm we used M-values [54]. All variables (including methylation values and CpGs) were standardized before analysis. Since the methods are incapable of handling so many CpG sites at once, we reduced our scope to include only the 2,000 sites with the strongest association with low SES based on linear mixed model p-values (see Materials and Methods). Our final dataset contained these 2,000 CpG sites and 963 samples.

Identification of noteworthy CpG sites

We identified CpG sites that potentially mediated the relationship between low SES and HbA1c using methods from Group 1. In HIMA, HDMA, MedFix, and pathway LASSO, which involve feature selection, we describe a CpG site to be “active” if its estimated mediation contribution is not zero; whereas in BSLMM, we do so if the estimated posterior inclusion probability is not zero (see Materials and Methods). We also included a one-at-a-time method in which the CpG sites were assessed individually with linear mixed models, identifying active mediators with the joint significance test [52]. Out of 2,000 CpG sites, HIMA found 3 sites to be noteworthy, HDMA found 11, MedFix found 3, pathway LASSO found 141, and BSLMM found 3, amounting to 144 unique CpG sites in total. The one-at-a-time method identified zero CpG sites as noteworthy at an FDR threshold of 10%. Eleven CpG sites were identified as noteworthy by at least two of the methods (Table 1). Among these 11, the estimated mediation contributions were similar across methods in direction and size except for BSLMM, for which the estimates were an order of magnitude smaller than the others but in the same direction.

Table 1. Estimated contributions of noteworthy CpG sites on the mediation pathway between low education and HbA1c.

CpG Name	Chromosome	Nearby Gene(s)	USCS RefGene Group	Univariate (0 sites identified)	HIMA (3 sites identified)	HDMA (11 sites identified)	MedFix (3 sites identified)	Pathway LASSO (141 sites identified)	BSLMM (3 sites identified)
cg10508317	17	SOCS3	Body	3.48x10^-2	1.59 x10^-2^*	3.56x10^-2^*	2.90x10^-2^*	2.35x10^-2^*	0.25x10^-2
cg01288337	14	RIN3	Body	3.35x10^-2	1.47 x10^-2^*	2.82x10^-2^*	2.70x10^-2^*	4.43x10^-2^*	0.21x10^-2
cg10244976	16	LMF1	Body	3.00x10^-2	0	2.78x10^-2^*	0	2.23x10^-2^*	0.19x10^-2
cg07516252	14	REC8	TSS200	2.72x10^-2	0	2.24x10^-2^*	0	2.26x10^-2^*	0.26x10^-2
cg07571519	10	C10orf105; CDH23	3’UTR; Body	2.53x10^-2	0.33 x10^-2^*	3.67x10^-2^*	1.47x10^-2^*	2.81x10^-2^*	0.21x10^-2
cg23079012	2	LINC00299	Body	2.27x10^-2	0	1.99x10^-2^*	0	1.98x10^-2^*	0.29x10^-2
cg01587454	8	DCAF4L2	1stExon	1.77x10^-2	0	2.10x10^-2^*	0	1.99x10^-2^*	0.38x10^-2
cg27527503	4	HADH	TSS1500	1.75x10^-2	0	1.86x10^-2^*	0	1.27x10^-2^*	0.23x10^-2
cg25891647	11	GRAMD1B	Body	-1.27x10^-2	0	-3.42x10^-2^*	0	-3.02x10^-2^*	-0.33x10^-2
cg08473752	17	NLK	Body	-0.70x10^-2	0	-2.34x10^-2^*	0	-2.32x10^-2^*	-0.22x10^-2
cg12644059	15	BLM	N/A¹	-0.03x10^-2	0	-2.31x10^-2^*	0	-1.84x10^-2^*	-0.22x10^-2

Open in a new tab

*Selected as noteworthy by given method

¹CpG site cg12644059 is 3.240kb from the final base pair of the BLM gene

Includes all CpG sites that were selected as having a noteworthy mediation contribution by at least two of the implemented methods out of 2,000 CpG sites in total. All estimates are adjusted for age, sex, race, and the estimated proportions of residual non-monocytes as fixed effects, along with methylation chip and position as random effects to address potential batch effects. Note that for HIMA, HDMA, MedFix, and pathway LASSO, which fit high-dimensional regression models, we used additional pre-screening to reduce the number of mediators in advance to only n/log(n) ≈ 141 CpG sites, which is the approach recommended by the HIMA and HDMA authors and helps with statistical and computational efficiency (see Materials and Methods). Pathway LASSO selected all of these.

Some of these CpG sites are on or nearby genes that are potentially related HbA1c. Site cg10508317 is in the body of the SOCS3 gene, for which a rich body of literature has established links between overexpression and insulin resistance [56], and has previously been been identified in MESA as a mediator between adult SES and BMI [11] and adult SES and HbA1c [18] based on single-mediator analysis. Site cg01288337, which is in the body of the RIN3 gene, has been identified in MESA as a potential mediator between adult SES and HbA1c based on one-at-a-time analysis as well [18]. The RIN3 gene itself is proximal to the SLC24A4 gene, both of which have been linked to brain glucose metabolism in human population studies [57]. In addition, site cg27527503 is in the promoter region of the HADH gene, which is differentially expressed with respect to diabetes status [58] and is a primary driver of hyperinsulinism [59] and hyperinsulinaemic hypoglycemia [60]. A Venn diagram of genes identified by the methods is included in the supplement (S5 Fig), and results for every noteworthy CpG site are provided in the supplement (S1 File).

Global mediation through DNAm

Next, we estimated the direct effect of low education on HbA1c, the global indirect effect of low education on HbA1c through DNAm, and the total effect of low education on HbA1c using the Group 1 methods HIMA, HDMA, MedFix, pathway LASSO, and BSLMM, as well as the Group 2 methods PCMA, SPCMA, and HILMA (Table 2). Results across methods varied considerably, with the estimated global indirect effect ranging from 0 in PMED to 0.17 in SPCMA. The estimated total effect ranged from 0.03 (HILMA) to 0.198 (HIMA, HDMA, and MedFix). Despite the variability in the estimated global indirect effect, some of the other methods were consistent, with HDMA, BSLMM, pathway LASSO, PCMA, and SPCMA all estimating the global indirect effect to be close to 0.15. The variability in the estimated indirect effect and estimated total effect led to variability in the proportion mediated as well, from 17.1% in HIMA to 100% in HILMA.

Table 2. Estimated effects in the mediation mechanism from low education to DNAm to HbA1c.

Method	Estimated Global Indirect Effect	Estimated Direct Effect	Estimated Total Effect	Estimated Proportion Mediated
HIMA	0.03	0.16	0.20	0.17
HDMA	0.13	0.07	0.20	0.65
MedFix	0.07	0.13	0.20	0.36
BSLMM	0.14	0.05	0.18	1.00
Pathway LASSO	0.13	0.05	0.18	0.74
PCMA	0.15	0.02	0.17	0.91
SPCMA	0.17	0.00	0.17	1.00
HILMA	0.03	0.00	0.03	1.00
PMED	0.00	0.20	0.20	0.00

Open in a new tab

All estimates are adjusted for age, sex, race, and the estimated proportions of residual non-monocytes as fixed effects, along with methylation chip and position as random effects to address potential batch effects. We provide only point estimates, not interval estimates, because some of the methods are either not capable of producing interval estimates or do not provide the code for producing them in their software. Note also that for HIMA, HDMA, MedFix, and pathway LASSO, we used additional screening to reduce the number of mediators in advance for the sake of statistical and computational efficiency, so only n/log(n) ≈ 141 CpG sites were seen by the multivariable outcome model rather than 2,000.

Additional findings

In addition to estimating the global indirect effect, the method SPCMA is also able to identify potentially-mediating CpG sites in groups. It does so by linearly combining the mediators using sparse principal component-defined weights, then evaluating the resulting principal components as mediators themselves [42]. However, out of 100 computed principal components, only three of them had significant mediation contributions after 10% FDR correction, the first representing a linear combination of 762 CpG sites, the second a combination of 782 sites, and the third a combination of 797 sites. Since the transformed mediators are functions of so many CpG sites at once, one cannot make claims about which particular CpG sites are active mediators, but the method still provides insight to whether there is statistical mediation at all.

We conclude our analysis by applying HDMM, a method from Group 3. Unlike the methods in Groups 1 and 2, HDMM cannot be used to estimate the global indirect effect from the proposed mediation structure, nor to estimate the mediation contributions of specific CpG sites. Rather, HDMM uses a likelihood-based approach to compute “directions of mediation”, which are weights that can be used to linearly combine the observed mediators into unobserved, latent mediators that replace the observed mediators in the mediation models. The estimated effect of the first latent mediator on average HbA1c was 0.13, the estimated total effect 0.71, and the proportion mediated 0.715. The three CpG sites with the largest directions of mediation were cg01288337 (0.36) on the RIN3 gene, cg16162970 (-0.22) near the PACS2 gene, and cg25891647 (-0.21) on the GRAMD1B gene; the first and last of which were among the 11 CpG sites identified by other methods in Table 1. Although the size and direction of these estimates are not interpretable, they offer evidence that these CpG sites are potentially involved in mediation.

Discussion

In this study, we reviewed and evaluated eight statistical methods for performing mediation analysis with high-dimensional DNAm data, so that researchers in epigenetics have the information they need to choose the most appropriate method for their data sample, subject matter, and research objectives. In extensive simulations, we found that the most powerful method for identifying active mediators was generally BSLMM, with HDMA as a close comparator. However, BSLMM performed poorly in settings where the mediation signals were non-sparse. No method was uniformly better than the others at estimating the mediation contributions though Pathway LASSO appeared to be a sub-optimal choice. For estimating the global indirect effect, the best-performing method was HILMA in sparse mediation settings and PCMA or HDMA in non-sparse mediation settings. In simulation settings where the effects are strictly non-negative, BSLMM tended to perform best for detecting active mediators and estimating their mediation contributions, while HILMA was again the strongest method for estimating the global mediation effect (S1 Fig). In simulation scenarios with an unmeasured confounder, the performance of the multiple-mediator methods became worse as the severity of the confounding effects increased, in terms of estimating the global mediation effect or inferring the mediation contributions (S2 Fig). However, the relative performance of these methods compared to the one-at-a-time approach improved substantially with more confounding, which emphasizes the importance of evaluating the mediators simultaneously rather than one-by-one. Our comparison of the scalability of the methods revealed that HIMA, HDMA, MedFix, PMED, and PCMA were easily scalable to large datasets (e.g., n = 1,000 and p = 2,000), whereas SPCMA and pathway LASSO were computationally expensive (Section 5 in S1 Text).

On DNAm data from MESA, 11 CpG sites were selected by at least two of the methods as mediators between low SES and HbA1c level. Of the many genes related to these sites, SOCS3, RIN3, and HADH have the strongest potential biological connections to HbA1c [56–58,60–62], which contributes to the already rich literature on DNAm as a mediator between the exposome and health outcomes. Moreover, the methods generally produced similar estimates of the mediation contributions, with the exception of BSLMM. It is possible that since estimated from BSLMM is non-sparse, the estimated mediation contributions end up severely shrunken compared to the methods that directly select features.

Estimates of the global indirect effect were highly variable. Part of this can be explained by the fact that HDMA, MedFix, HIMA, and pathway LASSO are sparse models that can set mediation contributions to be exactly zero, resulting in a rigid and unstable estimation of the global indirect effect. The method HILMA, which is built specifically for estimating the global indirect effect and direct effect, produced estimates that were sharply different than the other methods, possibly because our simulations indicated that it struggled in non-sparse mediation settings.

In practice, the optimal method for mediation analysis with high-dimensional mediators will depend both on the data and the objective. If the goal is to identify specific CpG sites that are involved in mediation, one preferred method may be HDMA, which performed well at detecting active mediators in our simulations and was not overly conservative when applied to the observed DNAm data. If one’s focus is the global indirect effect, our simulations suggested that the optimal method is HILMA; but considering the variability we observed in our DNAm analysis, it may be worthwhile to apply BSLMM and HDMA as well to ensure the results are robust. If the results of multiple methods disagree substantially, it may be difficult to say with confidence which is closest to the truth, and the estimates should be interpreted with caution. Next, if there is interest in latent, unmeasured mediators, either HDMM or LVMA is worth attempting, although HDMM is simpler computationally. A detailed decision tree to aid the user for selecting the optimal method is included in Fig 6.

Some strengths of our study include its broad coverage of the available methods, the breadth of its simulation settings, and the comprehensive set of evaluation criteria. Our analysis of real DNAm data is especially essential because it elucidates the potential limitations of using these methods in practice, as it is impossible to incorporate the full complexity of real data sources into contrived simulation settings. Another strength of our study is the presentation of an R package, as the lack of readily available, centralized software for implementing methods for high-dimensional mediation analysis is a potential reason for their so far limited permeation into epigenetic research. We are hopeful that our package, hdmed, will facilitate and encourage the application and adoption of these methods to epigenetic datasets in future studies.

However, our study also has weaknesses. First, since DNAm measurements and HbA1c data were collected concurrently, and represent only single time points, we cannot interpret the parameters we have estimated as causal effects. Although it would be optimal to address our research question longitudinally—with measurements at multiple time points—there is a dearth of mediation analysis methods which can handle that type of data, and longitudinal mediation analysis with high-dimensional mediators should be a focus of future methodological development. Second, the validity of the mediation analysis depends on the strong assumption that the causal mechanism is correctly specified—that is, that the exposure affects the mediators, that the mediators affect the outcome, and that confounding of this relationship is accounted for by the model. If there is unmeasured confounding of the causal pathway, or if some of the variables treated as mediators are, in fact colliders, the parameters of the high-dimensional mediation model become difficult to interpret, and the estimate of the global indirect effect may be highly biased. Though recent work by [63] has directly considered the issue of unmeasured confounding in a high-dimensional mediation model, the issue of collider bias is an important area for future research.

Third, we limited our analysis to the situation that Y and M are continuous, that M and A do not interact, and that only one A is of interest. However, we note that the methods HIMA and HDMA can also be applied to identify active mediators when Y is binary, while PCMA can be applied to infer the global indirect effect when there is A-M interaction in the outcome model. MedFix, along with the simultaneously-proposed MedMix (mediation analysis with mixed effect model by Zhang (2021)) can be applied when both the exposures and mediators are high-dimensional, while Huang and Vanderweele (2014) proposed a variance component test of the global indirect effect when only A is high-dimensional [64]. If one has prior knowledge that the signs of the outcome model coefficients are in the same direction, a reasonable approach might be to use sign-constrained optmization rather than standard penalties such as the LASSO [65]. In terms of data type, methods that can accommodate non-continuous Y and M are in general scarce, and represent an important direction for future methodological development. As the landscape of methods for high-dimensional mediation analysis continues to expand, future review studies should consider exploring additional mediation settings (in presence of non-linearity, interaction) for which statistical methods are continuing to become available.

Supporting information

S1 Text. Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison.

(PDF)

Click here for additional data file.^{(305.9KB, pdf)}

S1 File. Estimated mediation contributions of CpG sites identified as mediators by any method.

(XLSX)

Click here for additional data file.^{(43.1KB, xlsx)}

S2 File. Numerical results underlying the figures from the simulation study.

(XLSX)

Click here for additional data file.^{(3.4MB, xlsx)}

S3 File. Zip file of computer code for performing the simulation study.

(ZIP)

Click here for additional data file.^{(29.5MB, zip)}

S1 Fig. Results for simulations with strictly non-negative effects.

(A) True positive rate for detecting active mediators. (B) Relative mean squared error for estimating the mediation contributions of active mediators, relative to the one-at-a-time method. (C) Relative mean squared error for estimating the mediation contributions of inactive mediators, relative to the one-at-a-time method. (D) Percent relative bias for inferring the global mediation effect. The simulation settings for were created by taking the absolute values of the exposure-mediator and mediator-outcome effects in the original baseline simulation settings, which had four different proportion-of-variance-explained (PVE) settings: (1) PVE_A = 0.2, PVE_DE = 0.1, PVE_IE = 0.1; (2) PVE_A = 0.1, PVE_DE = 0.1, PVE_IE = 0.1; (2) PVE_A = 0.2, PVE_DE = 0.05, PVE_IE = 0.1. (4) PVE_A = 0.2, PVE_DE = 0.1, PVE_IE = 0.05.

(PNG)

Click here for additional data file.^{(314.2KB, png)}

S2 Fig. Results for simulations with an unmeasured confounder U.

(A) True positive rate for detecting active mediators. (B) Mean squared error for inferring the mediation contributions of active mediators. (C) Relative mean squared error for inferring the mediation contributions of active mediators, relative to the one-at-a-time method. (D) Percent relative bias for inferring the global mediation effect.

(PNG)

Click here for additional data file.^{(297KB, png)}

S3 Fig. True positive rate for detecting mediation signals at a false discover rate of 10%.

Mean true positive (TPR) rate and 95% empirical confidence interval for detecting active mediators in 100 simulated datasets. In the baseline and high-correlation-among-mediators settings, TPR is for distinguishing mediators which contribute to the global mediation effect from those which do not, whereas in the non-sparse setting, where all mediators contribute, TPR is for distinguishing mediators whose contributions were sampled from a high-variance distribution from those whose contributions were sampled from a low-variance distribution. False discovery rate was capped below 10% by a proper choice of the p-value threshold (one-at-a-time, HIMA, HDMA, MedFix), posterior inclusion probability threshold (BSLMM), or method tuning parameter (P-LASSO). PVE(A): Percent of variance in Y explained by the exposure. PVE(IE): Percent of variance in Y explained by the indirect effect. PVE(DE): Percent of variance in Y explained by the direct effect.

(PNG)

Click here for additional data file.^{(268.3KB, png)}

S4 Fig. MSE in estimating mediation contributions of active mediators, relative to one-at-a-time method.

Mean relative mean squared error (rMSE) and 95% empirical confidence interval for estimating mediation contributions among active mediators in 100 simulated datasets, relative to the one mediator at a time method. Y-axis is on a log₁₀ scale. For the baseline and high-correlation-between-mediators settings, active mediators are those which contribute to the global mediation effect, whereas in the non-sparse setting, where all mediators have some contribution, active mediators are those whose contributions were sampled from a distribution with large variance instead of small. PVE(A): Percent of variance in Y explained by the exposure. PVE(IE): Percent of variance in Y explained by the indirect effect. PVE(DE): Percent of variance in Y explained by the direct effect.

(PNG)

Click here for additional data file.^{(274.2KB, png)}

S5 Fig. MSE in estimating mediation contributions of inactive mediators, relative to one-at-a-time method.

Mean relative mean squared error (rMSE) and 95% empirical confidence interval for estimating mediation contributions among inactive mediators in 100 simulated datasets, relative to the one mediator at a time method. Y-axis is on a log₁₀ scale. For the baseline and high-correlation-between-mediators settings, inactive mediators are those which have no mediation contribution, whereas in the non-sparse setting, where all mediators have some contribution, inactive mediators are those whose contributions were sampled from a distribution with small variance instead of large. The method pathway LASSO is excluded from this figure because for multiple settings it had rMSEs of exactly zero. This happened because pathway LASSO tended to be highly conservative and successfully assigned inactive mediators to have no effect. PVE(A): Percent of variance in Y explained by the exposure. PVE(IE): Percent of variance in Y explained by the indirect effect. PVE(DE): Percent of variance in Y explained by the direct effect.

(PNG)

Click here for additional data file.^{(608.1KB, png)}

S6 Fig. Percent relative bias in estimated global indirect effect.

Mean percentage relative bias in estimating the global mediation effect across 100 simulated data replicates, with intervals representing the inner 95% range. PVE(A): Percent of variance in Y explained by the exposure. PVE(IE): Percent of variance in Y explained by the indirect effect. PVE(DE): Percent of variance in Y explained by the direct effect.

(PNG)

Click here for additional data file.^{(326.8KB, png)}

S7 Fig. Genes containing or near CpG sites selected as active mediators between low education and HbA1c by methods for high-dimensional mediation analysis.

CpG sites were linked to genes using R Bioconductor package “IlluminaHumanMethylation450kanno.ilmn12.hg19”. Additional genes detected by Pathway LASSO listed in supplementary S1 File.

(JPG)

Click here for additional data file.^{(184.9KB, jpg)}

S1 Table. Summary of methods for high-dimensional mediation analysis.

(PDF)

Click here for additional data file.^{(180.5KB, pdf)}

S2 Table. Complete list of primary simulation settings.

(PDF)

Click here for additional data file.^{(133KB, pdf)}

S3 Table. Complete list of additional simulation settings.

(PDF)

Click here for additional data file.^{(127.2KB, pdf)}

Acknowledgments

The authors wish to thank the MESA staff and participants.

Data Availability

Instructions for generating our simulated data can be found on our GitHub site (https://github.com/dclarkboucher/mediation_DNAm), which includes R-scripts and a ReadMe file that explains how to implement our simulation study. The GitHub site also includes code for implementing our DNAm data analysis, only with pseudo-data instead of the MESA data. The exact data used in the DNAm analysis can be obtained through the MESA Data Coordinating Center (https://www.mesa-nhlbi.org/) (accession: phs000209.v13.p3). Access to MESA’s data requires a specific application process depending on the type of project; see https://www.mesa-nhlbi.org/ancillary.aspx and https://www.mesa-nhlbi.org/Publications.aspx for more details.

Funding Statement

MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts 75N92020D00001, HHSN268201500003I, N01-HC-95159, 75N92020D00005, N01-HC-95160, 75N92020D00002, N01-HC-95161, 75N92020D00003, N01-HC-95162, 75N92020D00006, N01-HC 95163, 75N92020D00004, N01-HC-95164, 5N92020D00007, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR-001881, and DK063491. The MESA Epigenomics & Transcriptomics Studies were funded by NIH grants 1R01HL101250, 1RF1AG054474, R01HL126477, R01DK101921, and R01HL135009. Co-authors of this manuscripts were partially supported by NHLBI grant R01HL141292, NSF grant DMS1712933, and NIH grants R01HG008773 and 1UG3CA267907. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Moore LD, Le T, Fan G. DNA Methylation and Its Basic Function. Neuropsychopharmacology. 2013. Jan;38(1):23–38. doi: 10.1038/npp.2012.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kurdyukov S, Bullock M. DNA Methylation Analysis: Choosing the Right Method. Biology (Basel). 2016. Jan;5(1):3. doi: 10.3390/biology5010003 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aïssi D, Wahl S, et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet. 2014. Jun;383(9933):1990–8. doi: 10.1016/S0140-6736(13)62674-4 [DOI] [PubMed] [Google Scholar]
4.Volkmar M, Dedeurwaerder S, Cunha DA, Ndlovu MN, Defrance M, Deplus R, et al. DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. EMBO J. 2012. Mar;31(6):1405–26. doi: 10.1038/emboj.2011.503 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Abdolmaleky HM, Cheng K-H, Faraone S V, Wilcox M, Glatt SJ, Gao F, et al. Hypomethylation of MB-COMT promoter is a major risk factor for schizophrenia and bipolar disorder. Hum Mol Genet. 2006. Nov;15(21):3132–45. doi: 10.1093/hmg/ddl253 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Menon R, Conneely KN, Smith AK. DNA methylation: an epigenetic risk factor in preterm birth. Reprod Sci. 2012. Jan;19(1):6–13. doi: 10.1177/1933719111424446 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Hopper JL, Nguyen TL, Li S. Blood DNA methylation score predicts breast cancer risk: applying OPERA in molecular, environmental, genetic and analytic epidemiology. Mol Oncol. 2022. Jan;16(1):8–10. doi: 10.1002/1878-0261.13117 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Nakatochi M, Ichihara S, Yamamoto K, Naruse K, Yokota S, Asano H, et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin Epigenetics. 2017;9:54. doi: 10.1186/s13148-017-0353-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Yousefi PD, Suderman M, Langdon R, Whitehurst O, Davey Smith G, Relton CL. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet. 2022. Jun;23(6):369–83. doi: 10.1038/s41576-022-00465-w [DOI] [PubMed] [Google Scholar]
10.Stoll S, Wang C, Qiu H. DNA Methylation and Histone Modification in Hypertension. Int J Mol Sci. 2018;19(4):1174. doi: 10.3390/ijms19041174 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wang YZ, Zhao W, Ammous F, Song Y, Du J, Shang L, et al. DNA Methylation Mediates the Association Between Individual and Neighborhood Social Disadvantage and Cardiovascular Risk Factors. Front Cardiovasc Med. 2022;9:848768. doi: 10.3389/fcvm.2022.848768 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fujii R, Yamada H, Munetsuna E, Yamazaki M, Mizuno G, Ando Y, et al. Dietary fish and ω-3 polyunsaturated fatty acids are associated with leukocyte ABCA1 DNA methylation levels. Nutrition. 2021. Jan;81:110951. [DOI] [PubMed] [Google Scholar]
13.Sun Y V, Smith AK, Conneely KN, Chang Q, Li W, Lazarus A, et al. Epigenomic association analysis identifies smoking-related DNA methylation sites in African Americans. Hum Genet. 2013. Sep;132(9):1027–37. doi: 10.1007/s00439-013-1311-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Philibert RA, Plume JM, Gibbons FX, Brody GH, Beach SRH. The impact of recent alcohol use on genome wide DNA methylation signatures. Front Genet. 2012;3:54. doi: 10.3389/fgene.2012.00054 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rider CF, Carlsten C. Air pollution and DNA methylation: effects of exposure in humans. Clin Epigenetics. 2019;11(1):131. doi: 10.1186/s13148-019-0713-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Needham BL, Smith JA, Zhao W, Wang X, Mukherjee B, Kardia SLR, et al. Life course socioeconomic status and DNA methylation in genes related to stress reactivity and inflammation: The multi-ethnic study of atherosclerosis. Epigenetics. 2015;10(10):958–69. doi: 10.1080/15592294.2015.1085139 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Fujii R, Sato S, Tsuboi Y, Cardenas A, Suzuki K. DNA methylation as a mediator of associations between the environment and chronic diseases: A scoping review on application of mediation analysis. Epigenetics. 2021. Aug 12;1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Du J, Zhou X, Clark-Boucher D, Hao W, Liu Y, Smith JA, et al. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet Epidemiol. 2022. Dec 5;47(2):167–84. doi: 10.1002/gepi.22510 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Bild DE, Bluemke DA, Burke GL, Detrano R, Diez-Roux A V, Folsom AR, et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol. 2002. Nov;156(9):871–81. doi: 10.1093/aje/kwf113 [DOI] [PubMed] [Google Scholar]
20.Whitaker SM, Bowie JV, McCleary R, Gaskin DJ, LaVeist TA, Thorpe RJ Jr. The Association Between Educational Attainment and Diabetes Among Men in the United States. Am J Mens Health. 2014;8(4):349–56. doi: 10.1177/1557988313520034 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Singer DE, Nathan DM, Anderson KM, Wilson PW, Evans JC. Association of HbA1c with prevalent cardiovascular disease in the original cohort of the Framingham Heart Study. Diabetes. 1992. Feb;41(2):202–8. doi: 10.2337/diab.41.2.202 [DOI] [PubMed] [Google Scholar]
22.Lam LL, Emberly E, Fraser HB, Neumann SM, Chen E, Miller GE, et al. Factors underlying variable DNA methylation in a human community cohort. Proc Natl Acad Sci U S A. 2012. Oct;109(Suppl 2):17253–60. doi: 10.1073/pnas.1121249109 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Borghol N, Suderman M, McArdle W, Racine A, Hallett M, Pembrey M, et al. Associations with early-life socio-economic position in adult DNA methylation. Int J Epidemiol. 2012. Feb;41(1):62–74. doi: 10.1093/ije/dyr147 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Dawes K, Philibert W, Darbro B, Simons RL, Philibert R. Additive and Interactive Genetically Contextual Effects of HbA1c on cg19693031 Methylation in Type 2 Diabetes. Genes (Basel). 2022;13(4):683. doi: 10.3390/genes13040683 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Baron RM, Kenny DA. The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations. J Pers Soc Psychol. 1986;51(6):1173–82. doi: 10.1037//0022-3514.51.6.1173 [DOI] [PubMed] [Google Scholar]
26.MacKinnon D. Introduction to statistical mediation analysis. New York, NY: u.a: Erlbaum; [Google Scholar]
27.VanderWeele TJ. Marginal Structural Models for the Estimation of Direct and Indirect Effects. Epidemiology. 2009;20(1):18–26. doi: 10.1097/EDE.0b013e31818f69ce [DOI] [PubMed] [Google Scholar]
28.Pearl J. Direct and Indirect Effects. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2001. p. 411–420. [Google Scholar]
29.Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992. Mar;3(2):143–55. doi: 10.1097/00001648-199203000-00013 [DOI] [PubMed] [Google Scholar]
30.VanderWeele TJ. Mediation Analysis: A Practitioner’s Guide. Annu Rev Public Health. 2016;37:17–32. doi: 10.1146/annurev-publhealth-032315-021402 [DOI] [PubMed] [Google Scholar]
31.VanderWeele T. Explanation in causal inference: methods for mediation and interaction. Explanation in causal inference: methods for mediation and interaction. New York: Oxford University Press; 2015. [Google Scholar]
32.Aung MT, Song Y, Ferguson KK, Cantonwine DE, Zeng L, McElrath TF, et al. Application of an analytical framework for multivariate mediation analysis of environmental data. Nat Commun. 2020;11(1):5624. doi: 10.1038/s41467-020-19335-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.VanderWeele TJ, Vansteelandt S. Mediation Analysis with Multiple Mediators. Epidemiol Method. 2014. Jan;2(1):95–115. doi: 10.1515/em-2012-0010 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Imai K, Yamamoto T. Identification and Sensitivity Analysis for Multiple Causal Mechanisms: Revisiting Evidence from Framing Experiments. Polit Anal. 2017/01/04. 2013;21(2):141–71. [Google Scholar]
35.Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016. Oct 15;32(20):3150–4. doi: 10.1093/bioinformatics/btw351 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing Mediation Effects in High-Dimensional Epigenetic Studies. Front Genet. 2019;10:1195. doi: 10.3389/fgene.2019.01195 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Zhang Q. High-Dimensional Mediation Analysis with Applications to Causal Gene Identification. Stat Biosci. 2021;14:432–51. [Google Scholar]
38.Zhao Y, Luo X. Pathway LASSO: pathway estimation and selection with high-dimensional mediators. Stat Interface. 2022;15(1):39–50. doi: 10.4310/21-sii673 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Zhou RR, Wang L, Zhao SD. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika. 2020. Sep 1;107(3):573–89. doi: 10.1093/biomet/asaa016 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Guo X, Li R, Liu J, Zeng M. High-Dimensional Mediation Analysis for Selecting DNA Methylation Loci Mediating Childhood Trauma and Cortisol Stress Reactivity. J Am Stat Assoc. 2022. Jul 3;117(539):1110–21. [Google Scholar]
41.Huang Y-T, Pan W-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016. Jun;72(2):402–13. doi: 10.1111/biom.12421 [DOI] [PubMed] [Google Scholar]
42.Zhao Y, Lindquist MA, Caffo BS. Sparse principal component based high-dimensional mediation analysis. Comput Stat Data Anal. 2020;142:106835. doi: 10.1016/j.csda.2019.106835 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Chén OY, Crainiceanu C, Ogburn EL, Caffo BS, Wager TD, Lindquist MA. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics. 2018;19(2):121–36. doi: 10.1093/biostatistics/kxx027 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Song Y, Zhou X, Zhang M, Zhao W, Liu Y, Kardia SLR, et al. Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics. 2020. Sep;76(3):700–10. doi: 10.1111/biom.13189 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Song Y, Zhou X, Kang J, Aung MT, Zhang M, Zhao W, et al. Bayesian sparse mediation analysis with targeted penalization of natural indirect effects. J R Stat Soc Ser C (Applied Stat. 2021. Nov 1;70(5):1391–412. doi: 10.1111/rssc.12518 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Derkach A, Pfeiffer RM, Chen T-H, Sampson JN. High dimensional mediation analysis with latent variables. Biometrics. 2019. Sep 1;75(3):745–56. doi: 10.1111/biom.13053 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010. Apr 1;38(2):894–942. [Google Scholar]
48.Lee JD, Sun DL, Sun Y, Taylor JE. Exact Post-Selection Inference, with application to the LASSO. Ann Stat. 2016;44(3):907–27. [Google Scholar]
49.Perera C, Zhang H, Zheng Y, Hou L, Qu A, Cheng Z, et al. HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data. BMC Bioinformatics. 2022;23:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Zhang SS, Zhang C-H. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Series B Stat Methodol. 2014;76(1):217–42. [Google Scholar]
51.Zou H. The Adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006. Dec 1;101(476):1418–29. [Google Scholar]
52.MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002. Mar;7(1):83–104. doi: 10.1037/1082-989x.7.1.83 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Liu Y, Ding J, Reynolds LM, Lohman K, Register TC, De La Fuente A, et al. Methylomics of gene expression in human monocytes. Hum Mol Genet. 2013. Dec;22(24):5065–74. doi: 10.1093/hmg/ddt356 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11(1):587. doi: 10.1186/1471-2105-11-587 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc. 2008. Nov 1;70(5):849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Pedroso JAB, Ramos-Lobo AM, Donato JJ. SOCS3 as a future target to treat metabolic disorders. Hormones (Athens). 2019. Jun;18(2):127–36. doi: 10.1007/s42000-018-0078-5 [DOI] [PubMed] [Google Scholar]
57.Stage E, Duran T, Risacher SL, Goukasian N, Do TM, West JD, et al. The effect of the top 20 Alzheimer disease risk genes on gray-matter density and FDG PET brain metabolism. Alzheimer’s Dement (Amsterdam, Netherlands). 2016;5:53–66. doi: 10.1016/j.dadm.2016.12.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Mei H, Li L, Liu S, Jiang F, Griswold M, Mosley T. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis. Genes (Basel). 2017. Jan;8(1):44. doi: 10.3390/genes8010044 [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Rahman SA, Nessa A, Hussain K. Molecular mechanisms of congenital hyperinsulinism. J Mol Endocrinol. 2015;54(2):R119–29. doi: 10.1530/JME-15-0016 [DOI] [PubMed] [Google Scholar]
60.Galcheva S, Al-Khawaga S, Hussain K. Diagnosis and management of hyperinsulinaemic hypoglycaemia. Best Pract Res Clin Endocrinol Metab. 2018;32(4):551–73. doi: 10.1016/j.beem.2018.05.014 [DOI] [PubMed] [Google Scholar]
61.Pedroso JAB, Buonfiglio DC, Cardinali LI, Furigo IC, Ramos-Lobo AM, Tirapegui J, et al. Inactivation of SOCS3 in leptin receptor-expressing cells protects mice from diet-induced insulin resistance but does not prevent obesity. Mol Metab. 2014. Sep;3(6):608–18. doi: 10.1016/j.molmet.2014.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Senniappan S, Shanti B, James C, Hussain K. Hyperinsulinaemic hypoglycaemia: genetic mechanisms, diagnosis and management. J Inherit Metab Dis. 2012. Jul 1;35(4):589–601. doi: 10.1007/s10545-011-9441-2 [DOI] [PubMed] [Google Scholar]
63.Wickramarachchi DS, Lim LHM, Sun B. Mediation analysis with multiple mediators under unmeasured mediator-outcome confounding. Stat Med. 2023. Feb 20;42(4):422–32. doi: 10.1002/sim.9624 [DOI] [PubMed] [Google Scholar]
64.Huang Y-T, Vanderweele TJ, Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann Appl Stat. 2014. Mar;8(1):352–76. doi: 10.1214/13-AOAS690 [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Meinshausen N. Sign-constrained least squares estimation for high-dimensional regression. Electron J Stat. 2013;7:1607–31. [Google Scholar]

PLoS Genet. doi: 10.1371/journal.pgen.1011022.r001

Decision Letter 0

John M Greally, Zoltán Kutalik

21 Jun 2023

Dear Dr Mukherjee,

Thank you very much for submitting your Research Article entitled 'Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Zoltán Kutalik, PhD

Academic Editor

PLOS Genetics

John Greally

Section Editor

PLOS Genetics

As the authors can see the reviewers raised some important points, but appreciate the usefulness of the findings and the addressed topic. Beyond their comments, I'd like to ask the authors to address the point of unmeasured confounders: in the simulations it is always assumed that there is no unmeasured confounder and the linear model fit reflects causation. Also, what happens if variables treated as mediators are actually colliders? Such simulation scenarios (including model misspecifications) and the presence of unmeasured confounding factors should be explored and commented on in the Discussion.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The manuscript is a timely evaluation and comparison of the methods for high dimensional mediation with a single exposure, continuous mediators and continuous outcome. The R package is also a valuable resource for the practitioners working with high dimensional mediation analysis. While the paper focuses on methylation data, it is expected to benefit the researchers in other areas as well. The following are my specific comments.

1.Correlation: the issue of “correlations” or correlated mediators is mentioned in many places in this manuscript. The mediators may be correlated for different reasons. They can be causally related, or their noise terms may be correlated, or they are conditionally independent given the exposure and the confounders, and merely marginally correlated due to the exposures and the confounders. The authors did not make any distinction between them, which may unintentionally mislead the readers. One such example (along many) is line 94-96 on page 5. It reads “Instead, so that we leverage these correlations rather than ignore them, the preferred approach is to evaluate the mediators jointly through a single, multivariable statistical model.” This is largely not true. Using the multivariate outcome model only addresses the marginal correlation due to their dependence on common exposure and confounders. Most of the surveyed methods assume that the mediators are not causally interdependent, and some of the inference procedures further assume independent noise for mediators. The authors need to clarify what types of correlations that these methods can “leverage”, and what their common limitations are.

2.Interpretations of the mediation contribution: This is partially related to the correlation issue. The authors have presented the causal assumptions for the total mediation effect but not the mediation effect of the individual mediators. Instead, they say that they cannot be interpreted as a causal effect through the jth mediator. This is true in general. But it will be more helpful if the authors can present the assumptions under which alpha_j*beta_j can be interpreted as a causal effect through the jth mediator. I believe that some of the surveyed papers have presented such assumptions. The authors also need to clarify further the meaning of “mediation contribution” and its limitations. For example, if a mediator’s mediation effect is completed mediated by other mediators, it may not be significant in the outcome model, but it is still “active” somewhere in the causal network among the mediators.

3.The simulation setting appears to resemble a setting with inconsistent mediation in which half of the mediators have positive mediation effects and the other half have negative mediation effects, and they partially cancel each other out. While it is common in multivariate mediation analysis, it will be better to include a case where the mediation effects of the individual mediators are more consistent. It means that the direct effect, total effect, and the mediation effects of most true mediators have the same sign.

4.It will benefit the readers the most, if these methods are evaluated using a real dataset that they have direct access to, instead of behind dbGaP wall.

5.What are the average or median magnitudes of the correlations in the moderate and high correlation simulation settings? They should have meaningful differences.

6.The following paper is a method similar to HILMA published in JASA in 2022. It is faster than HILMA, and the application involves methylation data. It also reports a set of “important mediators” based on the variable selection of the outcome model. But it is not exactly mediator selection and there is no inference for it. They provided code and script to reproduce their results. It is NOT my paper.

Guo, X., Li, R., Liu, J., & Zeng, M. (2022). High-dimensional mediation analysis for selecting DNA methylation Loci mediating childhood trauma and cortisol stress reactivity. Journal of the American Statistical Association, 117(539), 1110-1121.

7.The website for the real data https://www.mesanhlbi.org/ may contain typos. The data availability statement should mention that dbGaP application is needed.

8.Line 130 on page 6, can p,q be larger than n?

Reviewer #2: Referee report on “Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons”

General comment: In this paper, the authors reviewed and evaluated seven statistical methods of mediation analysis using simulations and high-dimensional DNA methylation (DNAm) data. The authors also centralize the computer codes of all the methods into a single, stand-alone R package. In addition, the authors provide some guidelines for the usage of these statistical methods. In particular, the authors created a decision tree for selecting a high-dimensional mediation analysis.

The following are some detailed comments.

Comments:

1. In introduction, the authors said that “Though several methods for fitting such a model have been presented in the literature, none of them are widely used in analyzing DNAm data.” However, there are some high-dimensional mediation methods that have been used in DNAm data, e.g., HIMA2 in [1].

2. Moreover, HIMA2 is an extension of the HIMA method. The authors have evaluated the HIMA methods. I was wondering whether HIMA2 will perform better if we add it into comparison.

3. In data application, the authors mentioned that they used model (6) to select 2000 CpG sites. But I did not find the model (6) in the paper. Can the authors write or locate the model (6) more clearly?

4. In simulation analysis on Page 16, the authors mentioned not only the MSE in mediation contributions of active mediators but also MSE in mediation contributions of mediators. However, in simulation results, only the MSE in contributions of active

mediators is presented. I was wondering why the MSE in mediation contributions of all mediators is ignored.

5. In the DNAm data analysis, the authors calculated the estimated mediation contributions of each method. What is a clear definition of the estimated mediation contributions?

6. In simulations, the authors generated the exposure, mediators, and outcome from continuous distribution. However, the decision tree for selecting a high-dimensional mediation analysis in Figure 6 says that we should choose HDMA and HIMA methods when the outcome is binary. Should we provide some simulations for the performance of the two methods when the outcome is binary?

References

[1] Perera, C., Zhang, H., Zheng, Y., Hou, L., Qu, A., Zheng, C., Xie, K., and Liu, L. (2022). Hima2: high-dimensional mediation analysis and its application in epigenome-wide dna methylation data. BMC bioinformatics, 23(1):1–14.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: No: Not sure whether it is necessary for this data analysis paper, but I did not see any "numerical data that underlies graphs" in the supplementary

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Genet. 2023 Nov 7;19(11):e1011022. doi: 10.1371/journal.pgen.1011022.r002

Author response to Decision Letter 0

30 Sep 2023

Attachment

Submitted filename: response_to_reviewers.pdf

Click here for additional data file.^{(1,004.2KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1011022.r003

Decision Letter 1

John M Greally, Zoltán Kutalik

18 Oct 2023

Dear Dr Mukherjee,

We are pleased to inform you that your manuscript entitled "Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Please note the comment from reviewer 2:

"In the responses, the authors said that they included a binary case in Figure 6. But I did not find that case in Figure 6. Is this table explained in the manuscript?"

Please address this in your editing process, in case there is something missing that would compromise your publication.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Zoltán Kutalik, PhD

Academic Editor

PLOS Genetics

John Greally

Section Editor

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The revised manuscript has addressed my original concerns.

Reviewer #2: The authors have appropriately addressed most of my comments. I just have the following minor comments.

In the responses, the authors said that they included a binary case in Figure 6. But I did not find that case in Figure 6. Is this table explained in the manuscript?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: None

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-23-00581R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

PLoS Genet. doi: 10.1371/journal.pgen.1011022.r004

Acceptance letter

John M Greally, Zoltán Kutalik

30 Oct 2023

PGENETICS-D-23-00581R1

Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Dear Dr Mukherjee,

We are pleased to inform you that your manuscript entitled "Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Judit Kozma

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison.

(PDF)

Click here for additional data file.^{(305.9KB, pdf)}

S1 File. Estimated mediation contributions of CpG sites identified as mediators by any method.

(XLSX)

Click here for additional data file.^{(43.1KB, xlsx)}

S2 File. Numerical results underlying the figures from the simulation study.

(XLSX)

Click here for additional data file.^{(3.4MB, xlsx)}

S3 File. Zip file of computer code for performing the simulation study.

(ZIP)

Click here for additional data file.^{(29.5MB, zip)}

S1 Fig. Results for simulations with strictly non-negative effects.

(PNG)

Click here for additional data file.^{(314.2KB, png)}

S2 Fig. Results for simulations with an unmeasured confounder U.

(PNG)

Click here for additional data file.^{(297KB, png)}

S3 Fig. True positive rate for detecting mediation signals at a false discover rate of 10%.

(PNG)

Click here for additional data file.^{(268.3KB, png)}

S4 Fig. MSE in estimating mediation contributions of active mediators, relative to one-at-a-time method.

(PNG)

Click here for additional data file.^{(274.2KB, png)}

S5 Fig. MSE in estimating mediation contributions of inactive mediators, relative to one-at-a-time method.

(PNG)

Click here for additional data file.^{(608.1KB, png)}

S6 Fig. Percent relative bias in estimated global indirect effect.

(PNG)

Click here for additional data file.^{(326.8KB, png)}

S7 Fig. Genes containing or near CpG sites selected as active mediators between low education and HbA1c by methods for high-dimensional mediation analysis.

CpG sites were linked to genes using R Bioconductor package “IlluminaHumanMethylation450kanno.ilmn12.hg19”. Additional genes detected by Pathway LASSO listed in supplementary S1 File.

(JPG)

Click here for additional data file.^{(184.9KB, jpg)}

S1 Table. Summary of methods for high-dimensional mediation analysis.

(PDF)

Click here for additional data file.^{(180.5KB, pdf)}

S2 Table. Complete list of primary simulation settings.

(PDF)

Click here for additional data file.^{(133KB, pdf)}

S3 Table. Complete list of additional simulation settings.

(PDF)

Click here for additional data file.^{(127.2KB, pdf)}

Attachment

Submitted filename: response_to_reviewers.pdf

Click here for additional data file.^{(1,004.2KB, pdf)}

Data Availability Statement

[pgen.1011022.ref001] 1.Moore LD, Le T, Fan G. DNA Methylation and Its Basic Function. Neuropsychopharmacology. 2013. Jan;38(1):23–38. doi: 10.1038/npp.2012.112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref002] 2.Kurdyukov S, Bullock M. DNA Methylation Analysis: Choosing the Right Method. Biology (Basel). 2016. Jan;5(1):3. doi: 10.3390/biology5010003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref003] 3.Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aïssi D, Wahl S, et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet. 2014. Jun;383(9933):1990–8. doi: 10.1016/S0140-6736(13)62674-4 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref004] 4.Volkmar M, Dedeurwaerder S, Cunha DA, Ndlovu MN, Defrance M, Deplus R, et al. DNA methylation profiling identifies epigenetic dysregulation in pancreatic islets from type 2 diabetic patients. EMBO J. 2012. Mar;31(6):1405–26. doi: 10.1038/emboj.2011.503 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref005] 5.Abdolmaleky HM, Cheng K-H, Faraone S V, Wilcox M, Glatt SJ, Gao F, et al. Hypomethylation of MB-COMT promoter is a major risk factor for schizophrenia and bipolar disorder. Hum Mol Genet. 2006. Nov;15(21):3132–45. doi: 10.1093/hmg/ddl253 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref006] 6.Menon R, Conneely KN, Smith AK. DNA methylation: an epigenetic risk factor in preterm birth. Reprod Sci. 2012. Jan;19(1):6–13. doi: 10.1177/1933719111424446 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref007] 7.Hopper JL, Nguyen TL, Li S. Blood DNA methylation score predicts breast cancer risk: applying OPERA in molecular, environmental, genetic and analytic epidemiology. Mol Oncol. 2022. Jan;16(1):8–10. doi: 10.1002/1878-0261.13117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref008] 8.Nakatochi M, Ichihara S, Yamamoto K, Naruse K, Yokota S, Asano H, et al. Epigenome-wide association of myocardial infarction with DNA methylation sites at loci related to cardiovascular disease. Clin Epigenetics. 2017;9:54. doi: 10.1186/s13148-017-0353-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref009] 9.Yousefi PD, Suderman M, Langdon R, Whitehurst O, Davey Smith G, Relton CL. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet. 2022. Jun;23(6):369–83. doi: 10.1038/s41576-022-00465-w [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref010] 10.Stoll S, Wang C, Qiu H. DNA Methylation and Histone Modification in Hypertension. Int J Mol Sci. 2018;19(4):1174. doi: 10.3390/ijms19041174 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref011] 11.Wang YZ, Zhao W, Ammous F, Song Y, Du J, Shang L, et al. DNA Methylation Mediates the Association Between Individual and Neighborhood Social Disadvantage and Cardiovascular Risk Factors. Front Cardiovasc Med. 2022;9:848768. doi: 10.3389/fcvm.2022.848768 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref012] 12.Fujii R, Yamada H, Munetsuna E, Yamazaki M, Mizuno G, Ando Y, et al. Dietary fish and ω-3 polyunsaturated fatty acids are associated with leukocyte ABCA1 DNA methylation levels. Nutrition. 2021. Jan;81:110951. [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref013] 13.Sun Y V, Smith AK, Conneely KN, Chang Q, Li W, Lazarus A, et al. Epigenomic association analysis identifies smoking-related DNA methylation sites in African Americans. Hum Genet. 2013. Sep;132(9):1027–37. doi: 10.1007/s00439-013-1311-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref014] 14.Philibert RA, Plume JM, Gibbons FX, Brody GH, Beach SRH. The impact of recent alcohol use on genome wide DNA methylation signatures. Front Genet. 2012;3:54. doi: 10.3389/fgene.2012.00054 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref015] 15.Rider CF, Carlsten C. Air pollution and DNA methylation: effects of exposure in humans. Clin Epigenetics. 2019;11(1):131. doi: 10.1186/s13148-019-0713-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref016] 16.Needham BL, Smith JA, Zhao W, Wang X, Mukherjee B, Kardia SLR, et al. Life course socioeconomic status and DNA methylation in genes related to stress reactivity and inflammation: The multi-ethnic study of atherosclerosis. Epigenetics. 2015;10(10):958–69. doi: 10.1080/15592294.2015.1085139 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref017] 17.Fujii R, Sato S, Tsuboi Y, Cardenas A, Suzuki K. DNA methylation as a mediator of associations between the environment and chronic diseases: A scoping review on application of mediation analysis. Epigenetics. 2021. Aug 12;1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref018] 18.Du J, Zhou X, Clark-Boucher D, Hao W, Liu Y, Smith JA, et al. Methods for large-scale single mediator hypothesis testing: Possible choices and comparisons. Genet Epidemiol. 2022. Dec 5;47(2):167–84. doi: 10.1002/gepi.22510 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref019] 19.Bild DE, Bluemke DA, Burke GL, Detrano R, Diez-Roux A V, Folsom AR, et al. Multi-Ethnic Study of Atherosclerosis: objectives and design. Am J Epidemiol. 2002. Nov;156(9):871–81. doi: 10.1093/aje/kwf113 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref020] 20.Whitaker SM, Bowie JV, McCleary R, Gaskin DJ, LaVeist TA, Thorpe RJ Jr. The Association Between Educational Attainment and Diabetes Among Men in the United States. Am J Mens Health. 2014;8(4):349–56. doi: 10.1177/1557988313520034 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref021] 21.Singer DE, Nathan DM, Anderson KM, Wilson PW, Evans JC. Association of HbA1c with prevalent cardiovascular disease in the original cohort of the Framingham Heart Study. Diabetes. 1992. Feb;41(2):202–8. doi: 10.2337/diab.41.2.202 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref022] 22.Lam LL, Emberly E, Fraser HB, Neumann SM, Chen E, Miller GE, et al. Factors underlying variable DNA methylation in a human community cohort. Proc Natl Acad Sci U S A. 2012. Oct;109(Suppl 2):17253–60. doi: 10.1073/pnas.1121249109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref023] 23.Borghol N, Suderman M, McArdle W, Racine A, Hallett M, Pembrey M, et al. Associations with early-life socio-economic position in adult DNA methylation. Int J Epidemiol. 2012. Feb;41(1):62–74. doi: 10.1093/ije/dyr147 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref024] 24.Dawes K, Philibert W, Darbro B, Simons RL, Philibert R. Additive and Interactive Genetically Contextual Effects of HbA1c on cg19693031 Methylation in Type 2 Diabetes. Genes (Basel). 2022;13(4):683. doi: 10.3390/genes13040683 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref025] 25.Baron RM, Kenny DA. The Moderator-Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations. J Pers Soc Psychol. 1986;51(6):1173–82. doi: 10.1037//0022-3514.51.6.1173 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref026] 26.MacKinnon D. Introduction to statistical mediation analysis. New York, NY: u.a: Erlbaum; [Google Scholar]

[pgen.1011022.ref027] 27.VanderWeele TJ. Marginal Structural Models for the Estimation of Direct and Indirect Effects. Epidemiology. 2009;20(1):18–26. doi: 10.1097/EDE.0b013e31818f69ce [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref028] 28.Pearl J. Direct and Indirect Effects. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 2001. p. 411–420. [Google Scholar]

[pgen.1011022.ref029] 29.Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992. Mar;3(2):143–55. doi: 10.1097/00001648-199203000-00013 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref030] 30.VanderWeele TJ. Mediation Analysis: A Practitioner’s Guide. Annu Rev Public Health. 2016;37:17–32. doi: 10.1146/annurev-publhealth-032315-021402 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref031] 31.VanderWeele T. Explanation in causal inference: methods for mediation and interaction. Explanation in causal inference: methods for mediation and interaction. New York: Oxford University Press; 2015. [Google Scholar]

[pgen.1011022.ref032] 32.Aung MT, Song Y, Ferguson KK, Cantonwine DE, Zeng L, McElrath TF, et al. Application of an analytical framework for multivariate mediation analysis of environmental data. Nat Commun. 2020;11(1):5624. doi: 10.1038/s41467-020-19335-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref033] 33.VanderWeele TJ, Vansteelandt S. Mediation Analysis with Multiple Mediators. Epidemiol Method. 2014. Jan;2(1):95–115. doi: 10.1515/em-2012-0010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref034] 34.Imai K, Yamamoto T. Identification and Sensitivity Analysis for Multiple Causal Mechanisms: Revisiting Evidence from Framing Experiments. Polit Anal. 2017/01/04. 2013;21(2):141–71. [Google Scholar]

[pgen.1011022.ref035] 35.Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics. 2016. Oct 15;32(20):3150–4. doi: 10.1093/bioinformatics/btw351 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref036] 36.Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing Mediation Effects in High-Dimensional Epigenetic Studies. Front Genet. 2019;10:1195. doi: 10.3389/fgene.2019.01195 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref037] 37.Zhang Q. High-Dimensional Mediation Analysis with Applications to Causal Gene Identification. Stat Biosci. 2021;14:432–51. [Google Scholar]

[pgen.1011022.ref038] 38.Zhao Y, Luo X. Pathway LASSO: pathway estimation and selection with high-dimensional mediators. Stat Interface. 2022;15(1):39–50. doi: 10.4310/21-sii673 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref039] 39.Zhou RR, Wang L, Zhao SD. Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika. 2020. Sep 1;107(3):573–89. doi: 10.1093/biomet/asaa016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref040] 40.Guo X, Li R, Liu J, Zeng M. High-Dimensional Mediation Analysis for Selecting DNA Methylation Loci Mediating Childhood Trauma and Cortisol Stress Reactivity. J Am Stat Assoc. 2022. Jul 3;117(539):1110–21. [Google Scholar]

[pgen.1011022.ref041] 41.Huang Y-T, Pan W-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016. Jun;72(2):402–13. doi: 10.1111/biom.12421 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref042] 42.Zhao Y, Lindquist MA, Caffo BS. Sparse principal component based high-dimensional mediation analysis. Comput Stat Data Anal. 2020;142:106835. doi: 10.1016/j.csda.2019.106835 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref043] 43.Chén OY, Crainiceanu C, Ogburn EL, Caffo BS, Wager TD, Lindquist MA. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics. 2018;19(2):121–36. doi: 10.1093/biostatistics/kxx027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref044] 44.Song Y, Zhou X, Zhang M, Zhao W, Liu Y, Kardia SLR, et al. Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics. 2020. Sep;76(3):700–10. doi: 10.1111/biom.13189 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref045] 45.Song Y, Zhou X, Kang J, Aung MT, Zhang M, Zhao W, et al. Bayesian sparse mediation analysis with targeted penalization of natural indirect effects. J R Stat Soc Ser C (Applied Stat. 2021. Nov 1;70(5):1391–412. doi: 10.1111/rssc.12518 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref046] 46.Derkach A, Pfeiffer RM, Chen T-H, Sampson JN. High dimensional mediation analysis with latent variables. Biometrics. 2019. Sep 1;75(3):745–56. doi: 10.1111/biom.13053 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref047] 47.Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010. Apr 1;38(2):894–942. [Google Scholar]

[pgen.1011022.ref048] 48.Lee JD, Sun DL, Sun Y, Taylor JE. Exact Post-Selection Inference, with application to the LASSO. Ann Stat. 2016;44(3):907–27. [Google Scholar]

[pgen.1011022.ref049] 49.Perera C, Zhang H, Zheng Y, Hou L, Qu A, Cheng Z, et al. HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data. BMC Bioinformatics. 2022;23:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref050] 50.Zhang SS, Zhang C-H. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Series B Stat Methodol. 2014;76(1):217–42. [Google Scholar]

[pgen.1011022.ref051] 51.Zou H. The Adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006. Dec 1;101(476):1418–29. [Google Scholar]

[pgen.1011022.ref052] 52.MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002. Mar;7(1):83–104. doi: 10.1037/1082-989x.7.1.83 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref053] 53.Liu Y, Ding J, Reynolds LM, Lohman K, Register TC, De La Fuente A, et al. Methylomics of gene expression in human monocytes. Hum Mol Genet. 2013. Dec;22(24):5065–74. doi: 10.1093/hmg/ddt356 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref054] 54.Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11(1):587. doi: 10.1186/1471-2105-11-587 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref055] 55.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc. 2008. Nov 1;70(5):849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref056] 56.Pedroso JAB, Ramos-Lobo AM, Donato JJ. SOCS3 as a future target to treat metabolic disorders. Hormones (Athens). 2019. Jun;18(2):127–36. doi: 10.1007/s42000-018-0078-5 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref057] 57.Stage E, Duran T, Risacher SL, Goukasian N, Do TM, West JD, et al. The effect of the top 20 Alzheimer disease risk genes on gray-matter density and FDG PET brain metabolism. Alzheimer’s Dement (Amsterdam, Netherlands). 2016;5:53–66. doi: 10.1016/j.dadm.2016.12.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref058] 58.Mei H, Li L, Liu S, Jiang F, Griswold M, Mosley T. Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis. Genes (Basel). 2017. Jan;8(1):44. doi: 10.3390/genes8010044 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref059] 59.Rahman SA, Nessa A, Hussain K. Molecular mechanisms of congenital hyperinsulinism. J Mol Endocrinol. 2015;54(2):R119–29. doi: 10.1530/JME-15-0016 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref060] 60.Galcheva S, Al-Khawaga S, Hussain K. Diagnosis and management of hyperinsulinaemic hypoglycaemia. Best Pract Res Clin Endocrinol Metab. 2018;32(4):551–73. doi: 10.1016/j.beem.2018.05.014 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref061] 61.Pedroso JAB, Buonfiglio DC, Cardinali LI, Furigo IC, Ramos-Lobo AM, Tirapegui J, et al. Inactivation of SOCS3 in leptin receptor-expressing cells protects mice from diet-induced insulin resistance but does not prevent obesity. Mol Metab. 2014. Sep;3(6):608–18. doi: 10.1016/j.molmet.2014.06.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref062] 62.Senniappan S, Shanti B, James C, Hussain K. Hyperinsulinaemic hypoglycaemia: genetic mechanisms, diagnosis and management. J Inherit Metab Dis. 2012. Jul 1;35(4):589–601. doi: 10.1007/s10545-011-9441-2 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref063] 63.Wickramarachchi DS, Lim LHM, Sun B. Mediation analysis with multiple mediators under unmeasured mediator-outcome confounding. Stat Med. 2023. Feb 20;42(4):422–32. doi: 10.1002/sim.9624 [DOI] [PubMed] [Google Scholar]

[pgen.1011022.ref064] 64.Huang Y-T, Vanderweele TJ, Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann Appl Stat. 2014. Mar;8(1):352–76. doi: 10.1214/13-AOAS690 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1011022.ref065] 65.Meinshausen N. Sign-constrained least squares estimation for high-dimensional regression. Electron J Stat. 2013;7:1607–31. [Google Scholar]

PERMALINK

Methods for mediation analysis with high-dimensional DNA methylation data: Possible choices and comparisons

Dylan Clark-Boucher

Xiang Zhou

Jiacong Du

Yongmei Liu

Belinda L Needham

Jennifer A Smith

Bhramar Mukherjee

Roles

Abstract

Author summary

Introduction

Fig 1. Proposed causal mechanism in which the effect of low education on HbA1c is mediated by DNAm.

Notations and general framework

Fig 2. Methods for mediation analysis with high-dimensional DNAm data.

Materials and methods

Overview of methods

Group 1 methods: Penalized regression to estimate mediator-specific contributions

HIMA

HDMA

MedFix

Pathway LASSO

BSLMM

GMM

Group 2 methods: Dimension reduction and direct estimation of global indirect effect

PCMA

SPCMA

HILMA

PMED

Group 3 methods: Latent variable representation to summarize mediators

HDMM

LVMA

Simulation study

Simulation settings

Primary simulation settings

Additional simulation settings

Simulated dataset creation

Primary simulation settings

Additional simulation settings

Simulation analysis

Data application with MESA

Study design

Statistical analysis

Results

Simulation results

Detection of active mediators

Fig 3. True positive rate for detecting mediation signals at a false discovery rate of 10%.

Estimation of contributions of active mediators

Fig 4. MSE in estimating mediation contributions of active mediators, relative to one-at-a-time method.

Estimation of global indirect effect

Fig 5. Percent relative bias in estimated global indirect effect.

DNAm data analysis results from MESA

Identification of noteworthy CpG sites

Table 1. Estimated contributions of noteworthy CpG sites on the mediation pathway between low education and HbA1c.

Global mediation through DNAm

Table 2. Estimated effects in the mediation mechanism from low education to DNAm to HbA1c.

Additional findings

Discussion

Fig 6. Decision tree for selecting a high-dimensional mediation analysis.

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

John M Greally

Zoltán Kutalik

Roles

Author response to Decision Letter 0

Decision Letter 1

John M Greally

Zoltán Kutalik

Roles

Acceptance letter

John M Greally

Zoltán Kutalik

Roles

Associated Data

Supplementary Materials