Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Mar 7.
Published in final edited form as: Genet Epidemiol. 2021 Oct 19;46(1):32–50. doi: 10.1002/gepi.22433

Penalized mediation models for multivariate data

Daniel J Schaid 1, Ozan Dikilitas 2, Jason P Sinnwell 1, Iftikhar J Kullo 2
PMCID: PMC8900147  NIHMSID: NIHMS1780191  PMID: 34664742

Abstract

Statistical methods to integrate multiple layers of data, from exposures to intermediate traits to outcome variables, are needed to guide interpretation of complex data sets for which variables are likely contributing in a causal pathway from exposure to outcome. Statistical mediation analysis based on structural equation models provide a general modeling framework, yet they can be difficult to apply to high-dimensional data and they are not automated to select the best fitting model. To overcome these limitations, we developed novel algorithms and software to simultaneously evaluate multiple exposure variables, multiple intermediate traits, and multiple outcome variables. Our penalized mediation models are computationally efficient and simulations demonstrate that they produce reliable results for large data sets. Application of our methods to a study of vascular disease demonstrates their utility to identify novel direct effects of single-nucleotide polymorphisms (SNPs) on coronary heart disease and peripheral artery disease, while disentangling the effects of SNPs on the intermediate risk factors including lipids, cigarette smoking, systolic blood pressure, and type 2 diabetes.

Keywords: cardiovascular disease, data integration, L1 penalty, mediation analysis, structural equation model

1 |. INTRODUCTION

Mediation analysis attempts to determine the relationships among exposure variables, intermediate “mediator” variables, and outcome variables, with an aim to infer causal relationships (D. MacKinnon, 2008; T. VanderWeele, 2015). This is easily viewed as a causal diagram such as exposure →mediator→ outcome, where the arrows denote the direction of causation. For quantitative variables, linear regression models are often used to partition the total effect of an exposure variable on the outcome variable into a direct effect of exposure on the outcome, and an indirect effect that passes through the mediator (D. P. MacKinnon et al., 2002; T. J. VanderWeele, 2016). The power and popularity of mediation analysis has grown in different fields, such as social and political science research, epidemiology, and bioinformatics.

Recent efforts to model high-dimensional biomarkers (e.g., metabolite levels or gene expression) that could potentially act as mediators have focused on latent factors, whereby the low-dimensional latent factors capture the variation in the high-dimensional biomarkers and the unobserved latent factors are treated as mediators (Albert et al., 2016; Derkach et al., 2019; Huang & Pan, 2016). Although this approach provides a computational advantage, it does not disentangle the effects of the biomarkers that directly mediate the effect of exposure on the outcome. Alternatively, structural equation models (SEMs) are preferred for multiple mediators because they can be used to simultaneously estimate the parameters representing direct and indirect effects, they provide explicit ways to model the relationships among the variables, and they account for the correlations among the mediators (Hoyle, 2012; Preacher & Hayes, 2008; T. J. VanderWeele & Vansteelandt, 2014). This approach offers several advantages. First, the regression of the outcome on all mediators, and the regression of each mediator on exposure, can all be achieved with a single model, allowing one to assess the fit of a model, to contrast with competing models. Second, the correlations among the residuals after regressing each mediator on exposure can be incorporated into the model to improve statistical efficiency. When only measured variables are included in a model, SEMs reduce to the approach of path analysis (Li, 1975), originated by Sewall Wright (Wright, 1921, 1923).

We recently developed a penalized SEM model tailored for mediation analysis for a single exposure, multiple mediators, and a single outcome, focused on the pairs of parameters that link the exposure, a mediator, and the outcome variables (Schaid & Sinnwell, 2020). This type of penalized model provided computational advantages over more general penalized SEMs (Jacobucci et al., 2016; Serang et al., 2017), but the restriction to a single exposure and a single outcome can be limiting. The purpose of this current work was to expand our past efforts to penalized SEMs for multiple exposures, multiple mediators, and multiple outcomes while maintaining computational efficiency.

In this article, we describe the framework of SEMs and how parameters are penalized, the computational method to fit the penalized log-likelihood, and demonstrate the statistical properties of the penalized model by simulations. We then illustrate the utility of our approach by applying it to a study of vascular disease. In this study, the multiple exposure variables are genetic variants; the multiple mediators are risk factors for vascular disease—risk factors that are also associated with genetic variants; the multiple outcomes are coronary heart disease (CHD) and peripheral artery disease (PAD). The aim of this analysis was to disentangle the effects of various genetic and nongenetic factors on the risk of vascular disease—an attempt to determine the direct effects of genetic variants on vascular disease, versus their indirect effects through risk factors known to be associated with vascular disease.

2 |. METHODS

2.1 |. Statistical models

For the ith subject (i = 1, …, n), let yi=(yi1,,yip) denote a vector of p traits, mi=(mi1,,miq) a vector of q potential mediators, and xi=(xi1,,xir) a vector of r exposures. These vectors can be arranged in the respective matrices Yn×p, Mn×q, and Xn×r where the data for ith subject is in the ith row of each of these matrices. We wish to fit an SEM that captures mediation effects, as in xmy, and direct effects, as in xy. In addition, we need to account for residual variances and covariances among the variables in the assumed models. To illustrate the assumed SEM, we present in Figure 1 a model for three x exposure variables, three m mediators, and two y outcomes. Although the complete graph includes both directed and undirected edges, we split the graph into the directed edges and undirected edges to aid visualization. The directed edges capture the direct effects of x on y and the indirect effects of x through the mediators on y. In Figure 1, it can be seen that arrows point out of each x variable and into each of the m mediator variables and into each of the y outcome variables. Also, arrows point out of each of the m mediators and into each of the y outcome variables. The statistical goal is to determine which arrows to select for a model. The undirected edges in Figure 1 represent the residual variances and covariances, after accounting for the directed edges. The variances are represented by the undirected loops of a variable connected with itself, and the covariances by undirected edges between two different variables. The model in Figure 1 illustrates that conditional on the selected directed edges, the model includes the residual variances and covariances of the m mediator variables (a q × q covariance Vm), the residual variances and covariances of the y outcome variables (a p × p covariance matrix Vy), and the variances and covariances of the x exposure variables (an r × r covariance matrix Vx). This figure illustrates that there are no assumed residual covariances between the x, m, and y variables. Based on these assumptions, our SEM approach will account for the covariance structure of all variables.

FIGURE 1.

FIGURE 1

Directed acyclic graph on the left depicting all possible relationships among exposures in green (x1, x2, x3), mediators in blue (m1, m2, m3), and outcomes in red (y1, y2). The sub-graphs with undirected edges on the right represent the variances and covariances within each of the x, m, and y sets of variables

Let the random vector of all variables be arranged as zi=(xi,mi,yi). We assume that zi is centered about its mean and has a multivariate normal distribution with covariance matrix Σ. The structure of Σ is determined by the assumed SEM. With this setup, we can use the reticular action model (RAM) that provides a 1:1 mapping of the parameters in a graphical representation of an SEM to matrices that determine the implied covariance matrix Σ (Jacobucci et al., 2016; McArdle, 2005). Although the general RAM allows for latent variables, they do not exist for our proposed mediation model and so these terms will be ignored in our presentation. The RAM include two matrices, A and S. The asymmetric A matrix represents the parameters for the directional effects of one-headed arrows (conceptually viewed as column labels of matrix A pointing down to row labels of matrix A). The symmetric S matrix represents all undirected edges (i.e., all variances and covariances). Based on how variables are arranged in the vector z and the parameters in the SEM of Figure 1, the matrix A is represented below, with row and column labels for the variables and with 0′s denoted as empty cells:

2 |.

Because each of the elements of vector x point to each of the mediators in vector m, the matrix αq×r for effects of x on m occurs in the columns pertaining to x and the rows pertaining to m. Likewise, for each of the elements of vector x pointing to each of the elements in vector y, the matrix δp×r of direct effects occurs in the columns pertaining to x and the rows pertaining to y. The effects of each of the mediators in vector m on each of the traits in vector y are in the matrix βp×q. All other cells of matrix A contain 0. The indirect effect of xi on yk can be depicted by xi – [αji] → mj – [βkj] → yk, where the influence of xi on mj is represented by αji, and the influence of mj on yk is represented by βkj. The direct effect of xi on yk is measured by δki and can be depicted by xi – [δki] → yk.

The symmetric S matrix, for variances and covariances, is block diagonal with matrices Vx, Vm, and Vy along the diagonal:

2 |.

Based on the above matrices A and S, the implied covariance matrix can be expressed as Σ = (IA)−1S(IA)−1′. In Appendix A we illustrate how the special structure of these matrices allows analytic formulas to rapidly compute (IA)−1.

Based on mapping the SEM to an implied covariance matrix, we can now consider ways to measure the fit of an SEM to data. The minus twice log-likelihood of the SEM, divided by n, is

lnlike=logdet(Σ)+tr(Σ1D), (1)

where log det(Σ) is the log of the determinant of Σ, tr () is the trace of a matrix, and D is the sample covariance matrix. To fit the SEM, we propose the following penalized model based on L1 penalties for each of the αij, βij, and δij parameters:

P(α,β,δ;λ)=λ(w(q,r)i=1qj=1r|αij|+w(p,r)i=1pj=1r|δij|+w(p,q)i=1pj=1q|βij|). (2)

The penalty parameter λ governs the overall amount of shrinkage, and for the elements in each of the matrices α, β, and δ we use a lasso L1 penalty. We balance the penalty across matrices of different sizes by the function w(d1, d2), where d1. and d2 are the dimensions of a matrix. Our initial work did not include this weight function and simulations (not shown) found that an imbalance of the number of elements in matrices led to large matrices dominating the penalty such that parameters in other matrices shrunk to zero too often. By including this weight function and by additional simulations (not shown), we found that that w(d1, d2) = (d1d2).4 provided good performance and is used throughout this study. Adding expressions (1) and (2) results in the penalized ln like that we need to minimize,

pen-lnlike=logdet(Σ)+tr(Σ1D)+P(α,β,δ;λ). (3)

The penalty function in expression (2) differs from our prior penalty for when there is a single exposure and single outcome (Schaid & Sinnwell, 2020). When there is a single exposure and a single outcome, each mediator has only two parameters (αj and βj), and in this case we treated each of these as a pair in a group, for a sparse-group lasso penalty, with a grouping effect αj2+βj2 and a penalty to encourage sparseness (|αj|+|βj|). However, when there are multiple exposures and multiple outcomes, the sparse-group model becomes more complicated because of overlapping groupings. For this reason, we chose the penalty model of expression (2) that achieves a sparse model while balancing across the parameter matrices of different sizes.

2.2 |. Optimization methods

The parameters to update to minimize the penalized ln like are the penalized parameters αij, βij, and δij. In addition, because Yn×p can depend on Mn×q and Xn×r, and Mn×q can depend Xn×r, the un-penalized parameters Vm and Vy could potentially require updating. In contrast, Vx does not require updating because Xn×r does not depend on any other variables. We assume that the number of traits (p) is relatively small and update un-penalized items of Vy by a gradient descent algorithm. However, because we allow for a potentially large number of mediators (q), optimizing the penalized ln like for all the terms in the matrix Vm can lead to long computation times and numerical instability. For this reason, we first use concepts from seemingly unrelated regression models (Zellner, 1962) to provide a robust estimate of Vm and then perform a single regularization of this matrix to assure that it is of full rank. The Vm matrix is then fixed during our optimization algorithm.

To estimate Vm, we first perform ordinary least squares regression of each mj on all x′s to create a vector of residuals for each subject (vector of length q for the mediators). These residuals are used to estimate the sample variance matrix V^m, a method used by seemingly unrelated regression (Zellner, 1962) when computing generalized least squares. Because this matrix is not full rank when some mediators are highly correlated, or when q > n, we regularize this matrix by using the glasso R package (Friedman et al., 2008). This places an L1 penalty on terms in V^m, shrinking small values to zero, based on a penalty parameter. We use a small penalty (i.e., penalty of 0.02) to achieve a full rank matrix, denoted V˜m. Simulations with different penalties (not shown) suggest that the choice of 0.02 for the glasso penalty had little impact on results.

To estimate the SEM parameters, we use an algorithm that iteratively updates all terms of α, followed by β, δ, and Vy. Some key aspects of our gradient descent algorithm are: (1) the use of sub-gradient equations to account for penalty functions that are not differentiable for some values of the parameters, and (2) use of majorization-minimization, a way to expand the objective function such that the new function dominates the original objective function, yet minimizing the new function is easier and leads to the same solution (Lange, 2004).

The algorithm is a sequence of nested loops. The outer loop cycles over several inner loops, with an inner loop for each of α, β, δ, and Vy. When updating parameters within a loop, all other parameters remain fixed. Each of the inner loops follow a similar algorithm, with differences determined by how the derivatives of the ln like differ across the different parameters, and how the parameters are updated. Appendix A provides details on derivatives and parameter updates for all parameters, as well as the steps of the inner loop.

We created an R package (regmed) that implements the optimization algorithm with efficient C++ code, based on the linear algebra library in the package RcppArmadillo. To choose a value of penalty parameter λ, we search a grid of values from largest to smallest. The estimated parameters for a specified value of λ are used as initial values to provide a warm start for the next smaller grid value of λ. The lower limit of λ was determined by maintaining the number of estimated parameters less than the sample size. This is because the implied covariance matrix Σ depends on the estimated parameters α, β, δ, and Vy, and determining the inverse of matrix Σ can become numerically unstable when the number of estimated parameters exceeds the sample size. The grid λ value that results in the smallest Bayesian information criterion (BIC) is chosen for the best fitting model. Although our models are conceptually similar to regularization of general SEMs (Jacobucci et al., 2016), there are key differences that make our tailored approach more computationally efficient and stable: (1) our lasso penalty accounts for the differing number of parameters in the matrices α, β, δ such that matrices with a large number of parameters do not dominate the penalty; (2) we were able to capitalize on the structure of the assumed SEM to reduce computation time; (3) by experimentation, we found the majorization-minimization step to reduce the number of required iterations, in contrast to standard gradient descent with half-step back-tracking.

2.3 |. Simulation studies

To evaluate the statistical properties of the proposed penalized SEM, we simulated data under five scenarios. For each scenario, we simulated 80 quantitative exposure variables, 10 quantitative potential mediators, and 5 quantitative outcome variables. We assumed that the exposure variables had a compound symmetric covariance structure, with diagonal variances of 1 and off-diagonal covariances all equal to 0.5; the potential mediator variables had a compound symmetric covariance structure, with diagonal variances of 1 and off-diagonal covariances all equal to 0.2, and outcome variables had a compound symmetric covariance structure, with diagonal variances of 1 and off-diagonal covariances all equal to 0.1. Scenario 1 is a null model with all α = β = δ = 0, a model used to evaluate the rate of false-positive selection of any parameters. Scenarios 2–5 are illustrated in Figure 2. Scenarios 2–4 represent models where only one set of parameters are nonzero: scenario 2 for subset of α′s; scenario 3 for subset of β′s; scenario-4 for subset of δ′s. Scenario 5 is a true mediation model for a subset of the x′s, m′s and y′s. For all simulations, the assumed covariance structure for each of the x′s, m′s and y′s, and the model parameters (α, β, δ), determined an implied covariance matrix for all random variables. This implied covariance matrix was used to simulate data from a multivariate normal distribution, for sample sizes of either 100, 500, or 1000. Simulations were repeated 100 times for each simulation experiment, and the penalized SEM was fit to each data set. The results for each simulation experiment were summarized by the average number of selected parameters of each type (α β, δ) and the average values of the selected parameters.

FIGURE 2.

FIGURE 2

Scenarios for simulations. For each simulation scenario, the nonzero model parameters were fixed at common values (αij = α, βij = β, δij = δ). The parameters with nonzero values are illustrated by the directed arrows in the figure. Scenario 1 is not shown, because all αij, βij, and δij were set at 0. Scenario 2 is for a subset of x′s associated with a subset of m′s; scenario 3, a subset of m′s associated with a subset of m′s; scenario 4, a subset of x′s having direct effects on a subset of y′s. Scenario 5 is for x1 having both mediated and direct effects on y1 as well as mediated effects on y2, and y3 having mediated effects on y3

2.4 |. Mayo Vascular Disease Biorepository

We wished to perform a multivariate analysis to understand the joint effects of multiple single-nucleotide polymorphisms (SNPs) and multiple risk factors on both CHD and PAD. Because CHD and PAD share common risk factors, and they have a similar biological basis, we needed to allow for correlation of CHD and PAD. Furthermore, we wanted to evaluate whether SNPs directly influence either CHD or PAD, or whether SNPs have indirect effects by SNPs influencing risk factors (e.g., mediators), which in turn influence CHD or PAD. Hence, our analytic approach can be viewed as multivariate mediation analysis with multiple exposures (i.e., SNPs), multiple potential mediators (i.e., risk factors such as elevated lipids and hypertension), and multiple traits (CHD and PAD).

We first identified genetic variants that could potentially have indirect effects via mediators on CHD or PAD by selecting SNPs from publicly available large-scale genome-wide association studies (GWAS), based on their statistical significance for association with potential mediators, as well as their association with CHD or PAD. We then fit penalized multivariate mediation models to an independent data set, the Mayo Vascular Disease Biorepository (Ye et al., 2013) maintained at the Mayo Clinic. These steps are described next.

2.5 |. Selection of SNPs from public SNP summary statistics

Our goal was to evaluate whether SNPs that are associated with risk factors (e.g., mediators) have an indirect effect on traits CHD or PAD or have a direct effect on the traits. For this objective, we chose SNPs from large-scale GW AS that were associated with at least one mediator and at least one trait. The potential mediators were history of type 2 diabetes (T2D), low-density lipoprotein cholesterol (LDL), high-density lipoprotein cholesterol (HDL), triglycerides (TGL), systolic blood pressure (SBP), and smoking status (SMK). To select SNPs associated with traits or mediators, we chose SNPs variants that had a p value < 1e–9. The SNP summary statistics used to select SNPs were from large published GWAS, as described in Table 1. After selecting SNPs for each mediator and trait, we subset to SNPs that were associated with at least one trait (CHD or PAD) and at least one mediator, resulting in 238 SNPs. SNPs associated with a mediator and not a trait were not included in analyses because the aim was to determine the joint effects of SNPs and mediators on traits.

TABLE 1.

Source of summary statistics for SNP associations with vascular traits and mediators

Phenotype Study description Reference
CHD Interim release of UK Biobank data based on a CAD phenotype inclusive of angina (SOFT phenotype) with 10,801 cases and 137,914 controls Nelson et al. (2017)
PAD PheWeb includes genome-wide associations for EHR-derived ICD billing codes from the white British participants of the UK Biobank. Phenotypes were classified into 1403 broad PheWAS codes with counts ranging from 51 to 77,977 cases and 330,366 to 408,908 controls. Summary statistics for PAD (Phecode 443) were obtained from the PheWeb website UKBIO_PAD_phenocode-443 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/443 (Zhou et al., 2018)
T2D Meta-analysis of GWAS in 62,892 T2D cases and 596,424 controls of European ancestry, by combining three GWAS data sets: DIAGRAM. GERA, and the full cohort UK Biobank (UKB). Xue et al. (2018)
HDL, LDL, TGL 188,577 European-ancestry individuals, including 94,595 individuals from 23 studies genotyped with GWAS arrays and 93,982 individuals from 37 studies genotyped with the Metabochip array. Willer et al. (2013)
SBP 1,006,863 subjects with European ancestry from UK Biobank (UKB), International Consortium of Blood Pressure Genome Wide Association Studies (ICBP), US Million Veteran Program (MVP) and the Estonian Genome Centre, University of Tartu Biobank (EGCUT) Evangelou et al. (2018)
SMK 1,232,091 subjects for binary phenotype of smoking initiation (ever smoking in lifetime). Meta-analysis of all GSCAN (GWAS [Genome Wide Association Studies] & Sequencing Consortium of Alcohol and Nicotine use) cohorts summary statistics except 23andMe summary statistics. Liu et al. (2019)

Abbreviations: CHD, coronary heart disease; HDL, high-density lipoprotein cholesterol; LDL, low-density lipoprotein cholesterol; PAD, peripheral artery disease; SBP, systolic blood pressure; SMK, smoking status; SNP, single-nucleotide polymorphism; T2D, type 2 diabetes; TGL, triglycerides.

2.6 |. Mayo Vascular Disease Biorepository

The Mayo Vascular Disease Biorepository (VDB) data set includes diverse measures of risk factors, disease status, GWAS SNP array data, and SNPs imputed by the HRC Reference panel (McCarthy et al., 2016). The VDB enrolled patients referred for noninvasive vascular evaluation and exercise stress testing at the Mayo Clinic from January 14, 2006 to July 24, 2020. We restricted the study cohort to genetically unrelated adult participants (≥18 years of age) with European ancestry, given the low proportion of non-European ancestry individuals, and individuals that qualified either as a case for CHD or PAD or as a control without any history of vascular disease, resulting in 4895 subjects. Atherosclerotic cardiovascular disease (ASCVD) phenotypes and related comorbidities were ascertained using previously validated electronic phenotyping algorithms based on both structured data elements (International Classification of Diseases diagnosis codes, current procedural terminology codes, laboratory measurements) and natural language processing of unstructured data elements, such as vascular laboratory and imaging reports in the electronic health record.

2.7 |. Vascular traits

PAD includes various diseases that affect noncardiac, non-intracranial arteries, of which the most common is atherosclerosis. PAD was defined as either an ankle-brachial index ≤0.9 at rest or 1-min post exercise, presence of poorly compressible arteries, or history of lower extremity revascularization. Risk factors for PAD are similar to those for other atherosclerotic vascular diseases, with smoking and diabetes mellitus the strongest (Fowkes et al., 2013).

CHD was determined by presence of myocardial infarction, coronary atherosclerosis/chronic ischemic heart disease, or history of coronary revascularization. Controls were individuals without any vascular phenotypes at the time of study enrollment.

2.8 |. Mediators

The following intermediate risk factors were available in the VDB and used as potential mediators between SNPs and either CHD or PAD:

  • HDL (log HDL).

  • LDL (log LDL): Corrected highest LDL-cholesterol before enrollment; if a participant was on a statin medication, the level was adjusted by dividing the values by the coefficient 0.7.

  • TGL (log Triglycerides).

  • T2D (yes vs. no).

  • SBP (log systolic BP): if a participant was on an antihypertensive medication, the SBP value was corrected by adding 15 mmHg.

  • SMK (ever smoker, yes/no).

2.9 |. SNPs

Genotyping was performed using three different Illumina platforms (Illumina Human660W-Quad V1, Infinium HumanCoreExome Beadchip, and Human 610 Quad V1). Imputation of SNPs was performed by the University of Michigan Imputation Server (McCarthy et al., 2016). High quality imputed SNPs (INFO > 0.3) that matched the SNPs chosen based on prior large-scale GWAS were selected for analysis. A total of 237 out of 238 SNPs matched, and were used in our analyses.

3 |. RESULTS

3.1 |. Simulation results

Based on simulations with 80 quantitative exposure variables, 10 quantitative potential mediators, and 5 quantitative outcome variables, there were 800 possible α′s, 50 possible β′s, and 400 possible δ′s. Results from the simulations across all five scenarios are illustrated in Table 2. The results are summarized according to selection of true (true nonzero) parameters and false (true zero parameters) in terms of the average number of selected parameters, averaged over all simulations, and the average size of the parameters among those selected. The most striking limitation of our method is the large number of falsely selected α′s when the sample size was 100. For this small sample size and 800 possible α′s, there were a large number of falsely selected α′s (average 101–130). However, average sizes of the selected α′s were quite small, ranging −0.007 to 0.014, and often rounded to 0.000. For larger sample sizes of 500 or 1000, the average number of falsely selected α′s was often near 0. Some noted exceptions were when some α′s were falsely selected in situations when other true α′s had large effect sizes (e.g., scenarios 2 and 5), presumably due to the correlations among the x′s and among the m′s. Nonetheless, the size of the falsely selected α′s was small (e.g., 0.014). The average number of falsely selected α′s decreased from approximately 3 for a sample size of 500 to approximately 0.15 for a sample size of 1000. In contrast to false selection of α′s, our simulations show that true nonzero α′s were always selected, and that the average size of the selected values were shrunken toward zero compared with the true values, as expected for the penalized model.

TABLE 2.

Simulation results for different scenarios determined by true effects sizes ofα, β, and δ

True effect sizes False α True α False β True β False δ True δ







Scenario N α β δ Ave. no. selected Ave. size Ave. no. selected Ave. size Ave. no. selected Ave. size Ave. no. selected Ave. size Ave. no. selected Ave. size Ave. no. selected Ave. size
1 100 0 0 0 101.60 −0.002 0 2.60   0.007 0 0   – 0
500 0 0 0 0.00   – 0 0.01   0.000 0 0.00   – 0
1000 0 0 0 0.00   – 0 0.04   0.000 0 0.00   – 0

2 100 0.1 0 0 116.78   0.001 3 0.017 2.18 −0.002 0 0.33   0.003 0
500 0.1 0 0 0.01   0.001 3 0.000 0.04   0.000 0 0.01   0.036 0
1000 0.1 0 0 0.00   – 3 0.000 0.09 −0.009 0 0.00   – 0
100 0.2 0 0 128.56   0.003 3 0.053 3.12 −0.001 0 0.41 −0.006 0
500 0.2 0 0 0.42   0.009 3 0.017 1.12 −0.005 0 0.01   0.013 0
1000 0.2 0 0 0.03   0.006 3 0.019 0.78   0.000 0 0.00   – 0
100 0.5 0 0 111.57   0.004 3 0.283 1.84 −0.003 0 0.14   0.008 0
500 0.5 0 0 2.36   0.014 3 0.303 5.56   0.001 0 0.07   0.006 0
1000 0.5 0 0 0.11   0.006 3 0.294 1.02   0.002 0 0.00   – 0

3 100 0 0.1 0 128.29   0.000 0 2.87   0.000 3 0.007 0.4 −0.018 0
500 0 0.1 0 0.00   – 0 0.02   0.006 3 0.002 0.00   – 0
1000 0 0.1 0 0.00   – 0 0.27   0.003 3 0.015 0.00   – 0
100 0 0.2 0 125.85   0.000 0 2.64   0.006 3 0.030 0.45 −0.007 0
500 0 0.2 0 0.04 −0.007 0 1.31   0.003 3 0.090 0.03   0.018 0
1000 0 0.2 0 0.00   – 0 1.28   0.004 3 0.124 0.00   – 0
100 0 0.5 0 118.38   0.000 0 2.80   0.005 3 0.241 0.54 −0.001 0
500 0 0.5 0 0.01   0.009 0 0.88   0.005 3 0.352 0.02   0.009 0
1000 0 0.5 0 0.00   – 0 1.49   0.007 3 0.386 0.00   – 0

4 100 0 0 0.1 123.54   0.000 0 2.98 −0.006 0 0.25   0.015 3 0.000
500 0 0 0.1 0.00   – 0 0.06 −0.010 0 0.00   – 3 0.000
1000 0 0 0.1 0.00   – 0 0.11   0.006 0 0.00   – 3 0.000
100 0 0 0.2 126.25   0.000 0 3.21   0.001 0 0.76   0.024 3 0.002
500 0 0 0.2 0.00   – 0 0.41   0.001 0 0.15   0.020 3 0.008
1000 0 0 0.2 0.00   – 0 0.59   0.000 0 0.09   0.012 3 0.018
100 0 0 0.5 121.40   0.001 0 2.92   0.011 0 1.05   0.043 3 0.028
500 0 0 0.5 0.15   0.003 0 5.93   0.000 0 2.34   0.014 3 0.294
1000 0 0 0.5 0.00   – 0 1.44   0.003 0 0.27   0.008 3 0.302

5 100 0.1 0.1 0.1 130.70   0.002 3 0.020 3.47   0.004 3 0.011 0.54   0.011 1 0.001
500 0.1 0.1 0.1 0.00   – 3 0.000 0.04   0.012 3 0.004 0.00   – 1 0.000
1000 0.1 0.1 0.1 0.00   – 3 0.000 0.27   0.013 3 0.015 0.00   – 1 0.000
100 0.2 0.2 0.2 116.55   0.003 3 0.052 2.05   0.006 3 0.031 0.51   0.028 1 0.001
500 0.2 0.2 0.2 0.70   0.012 3 0.022 2.73   0.009 3 0.112 0.34   0.015 1 0.028
1000 0.2 0.2 0.2 0.04   0.008 3 0.017 1.31   0.007 3 0.135 0.06   0.016 1 0.038
100 0.5 0.5 0.5 101.80   0.004 3 0.253 2.51   0.016 3 0.291 0.35   0.029 1 0.025
500 0.5 0.5 0.5 2.85   0.014 3 0.299 7.19   0.004 3 0.422 1.81   0.012 1 0.263
1000 0.5 0.5 0.5 0.15   0.008 3 0.287 2.62   0.007 3 0.424 0.58   0.010 1 0.273

Note: When no parameters were selected in the models, (Ave. no. selected = 0), Ave. size is not relevant and indicated by “-”.

The patterns of false selection as a function of sample size that were observed for the α′s were similar for the β′s and δ′s, but to a much less extreme. The average number of falsely selected β′s was largest (range, 1.84–3.47) for the smaller sample size of 100 and decreased (range 0.04–2.62) for the larger sample size of 1000. Over all scenarios, the average size of the falsely selected β′s ranged −0.01 to 0.016, emphasizing shrinkage toward zero. The true nonzero β′s were always selected, and their average size tended to be larger than those of the falsely selected β′s and shrunken toward zero compared with the true values. In contrast to the selection of α′s and α′s, the average number of falsely selected δ′s was quite small (range, 0–2.34), and the true nonzero δ′s were always selected with estimated effect sizes shrunken toward zero compared with the true values.

3.2 |. Results from vascular disease study

A total of 4895 subjects from the VDB data were included in our study; 2790 had CHD and 2131 had PAD. The association between CHD and PAD was weak, with an odds ratio of 1.08. Characteristics of the subjects according to CHD and PAD status are provided in Table 3. Based on the imputed SNPs from the VDB data, there were 237 that matched the 238 SNPs selected from prior GWAS summary statistics. These 237 SNPs were from 11 different chromosomes, and the range of the minor allele frequency was 0.012–0.495. Among the 237 SNPs, all were associated with CHD; 3 associated with PAD; 45 with T2D; 21 with HDL; 76 with LDL; 7 with TGL; 115 with SBP. None were associated with SMK after we restricted to the set of SNP associated with either CHD or PAD.

TABLE 3.

Characteristics of patients in the Vascular Disease Biorepository

CHD
PAD
Controls (N = 2105) Cases (N = 2790) Controls (N = 2764) Cases (N = 2131)
Sex
 F 45% 27% 34% 35%
 M 55% 73% 67% 65%

T2D
 Present 18% 32% 20% 33%
 Smoking
 Ever 57% 68% 54% 77%

Age
 Median 65.00 71.00 66.00 71.00
 Range 24.00–92.00 32.00–96.00 24.00–92.00 25.00–96.00

BMI
 Median 28.10 29.05 28.60 28.70
 Range 15.80–58.10 16.00–55.60 15.80–55.00 16.00–58.10

Systolic BP
 Median 136.00 140.00 135.00 143.00
 Range 86.00–228.00 83.00–265.00 85.00–235.00 83.00–265.00

HDL
 Median 52.00 46.00 50.00 47.00
 Range 16.00–169.00 9.00–138.00 14.00–169.00 9.00–119.00

LDL
 Median 138.00 140.00 142.00 135.71
 Range 21.00–325.71 24.00–444.29 39.00–444.29 21.00–430.00

TGL
 Median 130.00 144.00 129.00 152.00
 Range 28.00–399.00 26.00–400.00 26.00–400.00 32.00–399.00

Abbreviations: BMI, body mass index; BP, blood pressure; CHD, coronary heart disease; HDL, high-density lipoprotein cholesterol; LDL, low-density lipoprotein cholesterol; PAD, peripheral artery disease; T2D, type 2 diabetes; TGL, triglycerides.

Some of the SNPs occurred in clusters with extreme linkage disequilibrium. Because our statistical methods consider the correlation among the exposure variables, that is, SNPs, extreme correlations near 1 would result in non-singular matrices making the computation of the penalized log-likelihood model unstable. For this reason, we used hierarchical clustering such that SNPs with absolute pair-wise correlation of at least 0.98 were grouped into clusters; the first principal component from each cluster was subsequently used in our penalized models. The number of SNPs and clusters per chromosome are summarized in Table 4. This process resulted in 82 “exposure” variables that were summaries of each of the 82 SNP clusters. The chromosome and position of each of the 237 SNPs, and the cluster groupings, are available in Table S1.

TABLE 4.

Number of SNPs and clusters of SNPs per chromosome that were used in the penalized models for VDB data

Chromosome No. SNPs No. clustersa
1 16   8
2   3   2
3   5   1
4   6   2
6 56 16
7 11   7
8   4   3
9 52 17
12 25   5
15 27   8
19 32 13

Abbreviation: SNP, single-nucleotide polymorphism.

a

Clusters of SNPs were created by hierarchical clustering such that SNPs with absolute pair-wise correlation ≥ 0.98 were in the same cluster.

The six potential mediators included HDL, LDL, TGL, T2D, SBP, and SMK. Each of these mediators was regressed on sex, age, and body mass index (BMI) to create residuals that were used in analyses as adjusted mediator variables. The binary mediators (T2D, SMK) were analyzed by logistic regression and the other quantitative traits by linear regression. The two outcome variables (CHD and PAD) were each regressed on sex, age, and BMI by use of logistic regression to create residuals that were used in analyses as adjusted outcome variables.

For each of the 4895 subjects, the 82 SNP measures, the six potential mediators and the two outcomes were concatenated into vectors and these vectors were centered about their means and scaled by their standard deviations. Hence, our analyses focused on the correlations among variables. The penalized SEM was fit to the data over a grid of values for the λ penalty parameter and the model with the smallest BIC was chosen as the best fit (see Figure 3 for values of BIC). The best fitting model is illustrated in Figure 4 with directed edges for parameters that were selected. The SNP variables are labeled according to their chromosome location and SNP cluster. For example, s19.10 is for SNPs on chromosome 19 in cluster 10 (see Table S1 for details). Because the penalized models shrink the parameter estimates, we refit the selected best model by an un-penalized SEM with the sem function in the lavaan R package. The resulting unpenalized parameter estimates are provided in Table 5.

FIGURE 3.

FIGURE 3

Bayes information criterion (BIC) for fit of penalized model to VDB data versus regularization penalty parameter λ

FIGURE 4.

FIGURE 4

Illustration of best fitting penalized SEM for the VDB data. Green circles represent SNP clusters, blue circles represent intermediate risk factors (LDL, HDL, TRI, SBP, SMK, and T2D), and red circles represent the traits CHD and PAD. CHD, coronary heart disease; HDL, high-density lipoprotein cholesterol; LDL, low-density lipoprotein cholesterol; PAD, peripheral artery disease; SBP, systolic blood pressure; SEM, structural equation model; SMK, smoking status; SNP, single-nucleotide polymorphism; T2D, type 2 diabetes

TABLE 5.

Un-penalized parameter estimates for the model selected by minimum BIC of penalized models for VDB data

Dependent Predictor Genes in region (see Table S1 for details of SNPs in the SNP clusters) Type of parameter Effect size SE
HDL S19.9 APOE α −0.052 0.013
LDL s1.5 CELSR2 α −0.127 0.079
LDL s1.7 CELSR2 α   0.035 0.079
LDL s19.5 LDLR α   0.068 0.014
LDL s19.10 APOE α −0.134 0.014
TGL s8.3 rs2954038 α   0.063 0.012
TGL s19.10 APOE α   0.037 0.028
TGL s19.11 APOE α   0.029 0.024
TGL s19.13 APOE α   0.031 0.020
CHD s1.3 CELSR2 δ −0.027 0.018
CHD s1.8 CELSR2 δ −0.032 0.018
CHD s6.5 PHACTR1 δ   0.042 0.014
CHD s6.11 SLC22A3, LPAL2 δ   0.047 0.014
CHD s6.13 SLC22A3, LPAL2 δ   0.058 0.014
CHD s9.12 CDKN2B-AS1 δ   0.032 0.020
CHD s9.13 CDKN2B-AS1 δ −0.028 0.020
CHD s12.5 SH2B3, ATXN2, BRAP, ACAD10, ALDH2, MAPKAPK5, ADAM1A, TMEM116, ERP29, NAA25 δ   0.054 0.014
CHD HDL β −0.105 0.014
CHD LDL β   0.026 0.014
CHD T2D β   0.078 0.014
CHD SBP β   0.025 0.014
CHD SMK β   0.057 0.014
PAD s1.4 CELSR2 δ   0.041 0.013
PAD s12.3 SH2B3, ATXN2, BRAP, ACAD10, ALDH2, MAPKAPK5, ADAM1A, TMEM116, ERP29, NAA25 δ −0.042 0.013
PAD HDL β −0.062 0.015
PAD LDL β −0.096 0.014
PAD TGL β   0.071 0.015
PAD T2D β   0.083 0.014
PAD SBP β   0.150 0.014
PAD SMK β   0.201 0.014

Note: The arrows in Figure 4 originate at the predictor variables and point to the dependent variables.

Abbreviations: BIC, Bayesian information criterion; CHD, coronary heart disease; HDL, high-density lipoprotein cholesterol; LDL, low-density lipoprotein cholesterol; PAD, peripheral artery disease; SBP, systolic blood pressure; SMK, smoking status; SNP, single-nucleotide polymorphism; T2D, type 2 diabetes.

Figure 4 and Table 5 illustrate that CHD and PAD share the common risk factors of SBP, T2D, and SMK, and that these risk factors do not appear to serve as mediators for the effects of SNPs. In contrast, LDL and HDL serve as mediators between the effects of SNPs and both CHD and PAD, while TGL serves as a mediator between SNPs and only PAD. Multiple SNPs on chromosome 19 are associated with LDL, TGL and HDL, while additional SNPs on chromosome 1 are associated with LDL, and additional SNPS on chromosome 8 are associated with TGL. SNPs on chromosomes 1 and 12 have direct effects on PAD, and SNPs on chromosomes 1, 6, 9, and 12 have direct effects on CHD. Note that the interpretation of direct effects of SNPs on CHD and PAD are within the context of the data available, which means that they do not appear to have effects on the available risk factors, but could nonetheless have effects on unmeasured intermediate traits that subsequently influence CHD or PAD. The scaled traits CHD and PAD each had variance of 1.0 before the fit of the model, and their estimated variances in the fitted model were 0.96 for CHD and 0.88 for PAD. This implies that the selected model explains 4% of the variability of CHD and 12% of the variability of PAD.

4 |. DISCUSSION

We developed an approach to fit penalized SEMs to multivariate data with the intention of discovering mediation effects of intermediate risk factors while allowing direct effects of multivariate exposures on multivariate outcomes, as well as effects of multivariate intermediate risk factors on multivariate outcomes. This type of integrative analysis provides a focused evaluation of potential causal pathways, from exposure variables to mediator variables to outcome variables, while accounting for the various types of correlations within and among the different types of variables. Our penalized models used L1 penalties on the three types of parameters, α and β for indirect effects and δ for direct effects, and balanced the penalty applied across these three types of parameters to account for the differing numbers of parameters. By searching across a grid of λ penalty values, we proposed choosing the model with the minimum BIC. Finally, to avoid over-shrinkage of parameter estimates, we proposed refitting the selected model without a penalty.

Our approach based on SEMs is beneficial when the measured variables can be partitioned into exposures, mediators, and outcome variables, and prior knowledge supports the assumption of the direction of the edges: exposure → mediator → outcome. For genetic data, it is reasonable to treat genetic data as root cause exposures, clinical phenotypes as outcomes, and intermediate variables as mediators (e.g., gene expression, methylation, etc.). In contrast, when this prior belief is not known, an alternative approach is to treat all variables equal and use Gaussian graphical models to “learn” the network connections. Recent developments of penalized directed acyclic Gaussian graphical models provide a general approach to select a broader class of models than SEMs (Li et al., 2020; Yuan et al., 2019). However, when the assumptions of a SEM are correct, our approach would be more powerful, more computationally efficient, and more likely to result in biologically correct models than a general graphical model. This is because the parameter space that is inconsistent with the SEM would not be evaluated, such as mediator → gene, reducing the model search space. To illustrate, when there are r exposures, q mediators, and p traits, the number of additional parameters that a Gaussian graphical model would need to evaluate, beyond those for an SEM, is q2 + r2 + pq + pr + qr, demonstrating a dramatic increase in the number of parameters to evaluate as the number of exposures or mediators increases.

Simulations were used to evaluate the properties of our methods. Simulation results showed that our method was able to accurately select true effects with few false-positive parameters when the sample size was large; at least 500 for our simulations. However, small sample sizes (e.g., 100) in the presence of a potentially large number of parameters (1250 in our simulations) can lead to a high rate of false-positive parameters, particularly the α′s that measure the association of a large number of exposure variables with potential mediators. The average sizes of the falsely selected parameters were close to zero, suggesting that use of a hard threshold to select parameters when sample size is small might be a useful approach to guard against false-positives, at the risk of removing true-positives with small effects. A potential limitation of using a multivariate normal distribution when some variables are binary or categorical is the assumption of constant residual variance. By limited simulations for a sample size of 500 with genetic markers as exposures, multivariate normal mediators, and binary outcome variables (results not shown), we found general trends to be consistent with the results shown in Table 2 for multivariate normal distributions. Future developments using a latent multivariate normal distribution to model categorical traits (e.g., probit models) may prove beneficial, as they have been used in some software for unpenalized SEMs (Hoyle, 2012).

Application of our methods to the VDB data revealed known relationships of SNPs with lipid measurements and of lipid measurements with CHD and PAD. SNPs in regions with genes that are known to be associated with HDL (APOE [Sorli et al., 2006]), LDL (CELSR2 [Rizk et al., 2015], LDLR [Aulchenko et al., 2009], APOE [Saito et al., 2004]), and TGL (APOE [Maxwell et al., 2013]) were in our selected model. However, our selected model included SNPs in the region of the gene CELSR2 that had direct effects on CHD, apparently contributions to the risk of CHD that were independent of the risk portrayed by LDL. This suggests that the multiple SNPs in the region of the CELSR2 gene on chromosome 1 are worthy of further study to determine the risk imposed by their indirect effects on CHD as mediated by LDL versus the direct effects of other SNPs in the same region. After accounting for the effects of LDL, HDL, TGL, and the risk factors SBP, SMK, T2D, on the risk of CHD, additional SNPs on chromosomes 6, 9, and 12 were identified to have direct effects on the risk of CHD, with details provided in Table 5. Although different SNP clusters were identified for the risk of CHD and PAD, the SNP clusters were in the regions of the same genes (CELSR2 on chromosome 1 and 10 genes on chromosome 12), suggesting that CHD and PAD share a common genetic basis. The gene CELSR2 is in the CELSR2-PSRC1-SORT1 gene cluster, and SNPs in this region, with high linkage disequilibrium with each other, have been associated with CHD, presumably due to SNPs that regulate plasma LDL cholesterol levels (Arvind et al., 2014). Yet, after adjusting for HDL, LDL, and TGL, our selected model suggested that there were additional direct effects on CHD due to SNPs in the region of the CELSR2 gene. Direct effects of SNPs on CHD were also found for those on chromosome 6 in the region of the gene PHACTR1, a gene associated with coronary artery calcification which is a manifestation of CHD (Pechlivanis et al., 2013). Two groups of SNPs on chromosome 6 were found to have direct effects on CHD: group 6.11 (containing rs117791490, rs117733303 and rs3798220) and group 6.13 (rs10455872). These SNPs are in the LPA gene (e.g., missense rs3798220; Ile4399Met) which encodes Lipoprotein (a), an established risk factor for CHD (Clarke et al., 2009). After we adjusted for the effects of LDL on CHD, we found residual direct effect of this group of SNPs on CHD, suggesting that additional fine-mapping and functional studies of this genomic region is warranted. Finally, even though we included a large number of SNPs that were previously reported to be associated with T2D and SBP, neither of these risk factors were identified as mediators between the SNPs and CHD or PAD.

A potential limitation of the VDB data is the relatively modest size of the study cohort which can limit the statistical power to detect weaker associations. Although we used previously validated electronic phenotyping algorithms to ascertain phenotypes and related risk factors, some degree of misclassification may persist. Since the risk factor measurements and phenotype ascertainment were obtained using data from the electronic health record, as opposed to being prospectively measured at enrollment, it is possible that time-related biases could exist as well as biases due to missing data and confounding. Despite the potential misclassification errors (which if random would reduce our power to detect associations), we identified a multivariate model with lipids having mediating effects as well as SNPs having direct effects in PAD and CHD, while accounting for direct effects of risk factors on both traits. Although our methods simultaneously accounted for correlations among the SNPs, and in this sense can be viewed as multivariate fine-mapping of the effects of SNPs on the traits and intermediate risk factors, it is important to emphasize that our goal was to disentangle the effects of SNPs that have been previously reported to be associated with at least one disease (PAD or CHD) and at least one risk factor. There are many more SNPs that have been reported to be associated with either PAD or CHD and not the risk factors, and vice versa, which were not included in our analyses. Overall, our joint analysis of multiple SNPs, multiple intermediate risk factors, and the bivariate outcomes of CHD and PAD provides refined targeted genetic regions worth pursuing to improve understanding of the genetic basis of CHD and PAD.

In conclusion, our new algorithm to compute penalized SEMs is computationally efficient and provides a new approach to integrate multivariate exposures, intermediate risk factors, and outcome variables. Our approach provides advantages in terms of computational feasibility and interpretations. On the other hand, our parametric approach to mediation analysis has limitations, such as not accounting for nonlinear effects or interactions. More general models and nonparametric approaches have been developed (Daniel et al., 2015; Imai, Keele, & Tingley, 2010; Imai, Keele, & Yamamoto, 2010), yet are limited to a small number of mediators. To account for large number of mediators, an approach that linearly combines potential mediators into a smaller number of orthogonal mediators has been proposed (Chen et al., 2018), or a few number of unobserved latent factors could be treated as a mediators, and the latent factors assumed to influence the high-dimensional biomarkers (Albert et al., 2016; Derkach et al., 2019; Huang & Pan, 2016). Although these strategies provide a computational approach by reducing the number of potential mediators, the resulting “mediators” are linear combinations of biomarkers making it challenging to sort out those that directly mediate the effect of exposure on the outcome. Alternatively, for very high-dimensional mediators a Bayesian approach has been proposed, although computation time can be long due to a large number of sampling iterations required for reasonable convergence (Song et al., 2020). An advantage of our penalized SEM is that it models the observed covariance matrix, so sample size is not a limitation. Rather, the size of the covariance matrix determines computation speed and computer memory requirements, and this size is determined by the number of exposures, mediators, and outcome variables. As the dimension of this matrix increases, computational challenges arise. When there are many mediators, prefiltering to reduce the number of potential mediators by using sure independence screening (Fan & Lv, 2008) can be applied. This approach is based on ranking marginal correlations and then selecting the highest ranked values such that the number of parameters is less than the sample size. Because mediation depends on the two correlations, cor (xi, mj) and cor (mj, yk), one can rank the absolute values of their products, |cor (xi, mj)cor(mj, yk)|, and choose the highest ranked values to determine which potential mediators to include in the penalized mediation models. An alternative approach would be to determine if the large covariance matrix is block-diagonal, in which case there could be opportunities in the future to capitalize on this special structure to further increase computational efficiency.

Supplementary Material

Supplemental Table S1

ACKNOWLEDGMENTS

This study was supported by the U.S. Public Health Service and National Institutes of Health (contract grant number GM065450 and GM140487).

Funding information

National Institute of General Medical Sciences, Grant/Award Number: GM065450 and GM140487

APPENDIX A

The algorithm for gradient descent depends on gradients of the ln like, and a step-size multiplier (t) that moderates the updates so that the optimization function decreases. The details for sub-gradients of the penalized ln like with respect to each of the parameters, as well as methods to update parameter estimates, are provided below. The ln like depends on the implied variance matrix Σ, which in turn depends on the parameters of interest. As shown elsewhere (von Oertzen & Brick, 2014), the derivative of ln like with respect to a parameter θ that resides in Σ is

lnlikeθ=tr(Σ1ΣθC),

where

Σθ=[BAθE]sym+BSθB, (A1)

C=(IΣ1D),B=(IA)1, E = BSB′, and Xsym = X + X′. Note that Aθ is a sparse matrix, with entries 0 except for a single value of 1 where a parameter resides (for parameters α, β, and δ). Because the matrix S does not depend on α, α, or δ,Sθ drops out of the derivatives in expression (A1) for these parameters. Likewise, Sθ is a sparse matrix, with entries 0 except for a single value of 1 where an item from Vy resides, and Aθ drops out of the derivatives in expression (A1) for Vy.

A.1 |. Updates for αij, βij, and δij

The methods to update αij, βij, and δij are similar, so we describe the method in terms of αij. Based on the current value for αij and expression (A1), the derivative gαij=lnlikeαij requires the matrix A=Aαij with a value of 1 in the position A*[ki, kj], and 0′s elsewhere, where ki and kj are the row and column indices of the position of αij in matrix Σ. To update αij we need to consider the direction guided by the derivative gαij, the step size t, the penalty λ, and the weight w. The update based on the soft thresholding required for the lasso L1 penalty is

αij,new=S(αijtgαij,tλw).

The soft-thresholding function is S(z, λ) = sign(z) · max(|z| − λ, 0), the same kind of update for the usual lasso estimation.

A.2 |. Updates for Vy,ij

The derivative gVy,ij=lnlikeVy,ij requires the square matrix S=SVy,ij with a value of 1 in the position S*[ki, kj], and 0′s elsewhere, where ki and kj are the row and column indices of the position of Vy,ij in matrix Σ. Because Vy,ij is not penalized, the update follows the usual gradient update: Vy,ij,new=Vy,ijtgVy,ij

A.2.1 |. Computational efficiency

Potential computational bottlenecks in the iterative algorithms are computations of the matrices B = (I − A)−1, Σ, Σ−1, and of log det(Σ). Naive inverse operations would be on the order of O((p + q + r)3). However, because of the structure of these matrices, computations can be rapidly completed by closed formulas.

A.3 |. Computation of B = (IA) −1

The matrix (IA) is a lower triangular matrix arranged as

(IA)=L=(Ir×rαq×rIq×qδp×rβp×qIp×p)

By noting that LL−1 = I and using forward substitution, the solution for B = L−1 can be shown to be

B=(IαIδ+βαβI) (A2)

A.4 |. Computation of Σ and Σ−1

By noting that Σ = BSB′, the matrix S is block diagonal with diagonal blocks Vx, Vm, and Vy, and by use of (A2), the following expression results

Σ=(VxVxαVxδαVxαVxα+VmαVxγ+VmβγVxγVxα+βVmγVxγ+βVmβ+Vy),

where γ = δ + βα.

Following the approach to compute Σ, we note that Σ1=(IA)S1(IA), and with S−1 having block-diagonal terms Vx1,Vm1, and Vy1, the expression for Σ−1 is

Σ1=(Vx1+αVm1α+δVy1δαVm1+δVy1βδVy1Vm1α+βVy1δVm1+βVy1ββVy1Vy1δVy1βVy1)

By reusing terms that occur multiple times and avoiding redundant computations because Σ and Σ−1 are symmetric, we are able to efficiently compute both matrices.

A.5 |. Computation of log det(Σ)

Naive calculation of log det(Σ) for the ln like would be on the order of O((p + q + r)3) operations. However, we show how it can be rapidly updated. Note that Σ = BSB′. By Cholesky decomposition, Σ = LL′, and log det(Σ)=2log(Lii), where Lii are the diagonal terms of the lower triangular matrix L. Now, if we take the Cholesky decomposition of S as S=LSLS, and because B is a lower triangular matrix, we can see that L = BLS. By noting that S is block diagonal, LS is also block diagonal with diagonal blocks Lx, Lm, and Ly, where these matrices comes from Cholesky decompositions: Vx=LxLx,V˜m=LmLm, and Vy=LyLy. Because of the structure of B (see A2), the diagonal terms of L = BLS are Lx,ii, Lm,ii, and Ly,ii. This means that

log det(Σ)=2j=1qlog(Lx,jj)+2j=1qlog(Lm,jj)+2j=1qlog(Ly,jj).

Since Vx and V˜m are constant over iterations, the first two summands need to be computed only once and reused over iterations, and so the updated computations can focus only on the Cholesky decompositions of the smaller matrix Vy.

A.5.1 |. Steps of inner loop of iterative algorithm to parameters

The steps of the inner loop to update αij are described below; similar steps are used for βij, δij.

Inner loop (begin with iter = 1, repeat until convergence):

  • Update gradient: g′ = ∂ ln like/∂αij, where αij is current value of parameter

  • Optimize step size so direction of descent with majorization scheme holds:

    • start with step t = 1.0, and iterate by shrinking step as t = 0.5 × t. The value of 0.5 is arbitrary and can be any value between 0 and 1.

    • denote updated parameter αij* (see above for how parameter is updated)

    • change in parameter: Δ=(αijαij)

    • update, lnlikenew(αij,Φ), where Φ is the vector of all parameters except αij

    • Majorize the ln like by use of quadratic approximation in place of using the Hessian matrix:
      lnlike-majorize=lnlikeold(αij,Φ)+gΔ+ΔΔ/(2t)

    stop if ln likenew ≤ lnlike-majorize; majorization is achieved

  • Check convergence, based on specified tolerance (tol)

    • penlnlikenew=lnlikenew(αij,Φ)+Pnew(αij,Φ;w,λ,f)

    • converge if |penlnlikenewpenlnlikeold|<tol(|penlnlikeold|+1.0)

  • If not converged

    • iter=iter+1

    • penlnlikeold=penlnlikenew

The inner loop for Vy,ij follows the steps above, except that these parameters are not penalized and the change in ln like without majorization is used to determine convergence.

Footnotes

SOFTWARE

Software implementing the proposed tests for mediation for quantitative traits is available as an R package called “regmed” in the Comprehensive R Archive Network (CRAN).

SUPPORTING INFORMATION

Additional supporting information may be found in the online version of the article at the publisher’s website.

DATA AVAILABILITY STATEMENT

Genome-wide summary statistics for the intermediate risk factors and the outcome variables (CHD and PAD) that were used to select SNPs for this study are publicly available (see references in Table 1). Due to institutional review board regulations, individual level data from the Mayo Vascular Disease biorepository are not available.

REFERENCES

  1. Albert JM, Geng C, & Nelson S (2016). Causal mediation analysis with a latent mediator. Biometrical Journal, 58(3), 535–548. 10.1002/bimj.201400124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Arvind P, Nair J, Jambunathan S, Kakkar VV, & Shanker J (2014). CELSR2-PSRC1-SORT1 gene expression and association with coronary artery disease and plasma lipid levels in an Asian Indian cohort. Journal of Cardiology, 64(5), 339–346. 10.1016/j.jjcc.2014.02.012 [DOI] [PubMed] [Google Scholar]
  3. Aulchenko YS, Ripatti S, Lindqvist I, Boomsma D, Heid IM, Pramstaller PP, Penninx BW, Janssens AC, Wilson JF, Spector T, Martin NG, Pedersen NL, Kyvik KO, Kaprio J, Hofman A, Freimer NB, Jarvelin MR, Gyllensten U, Campbell H, … ENGAGE C (2009). Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nature Genetics, 41(1), 47–55. 10.1038/ng.269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen OY, Crainiceanu C, Ogburn EL, Caffo BS, Wager TD, & Lindquist MA (2018). High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics, 19(2), 121–136. 10.1093/biostatistics/kxx027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Clarke R, Peden JF, Hopewell JC, Kyriakou T, Goel A, Heath SC, Parish S, Barlera S, Franzosi MG, Rust S, Bennett D, Silveira A, Malarstig A, Green FR, Lathrop M, Gigante B, Leander K, de Faire U, Seedorf U, … Procardis C (2009). Genetic variants associated with Lp(a) lipoprotein level and coronary disease. The New England Journal of Medicine, 361(26), 2518–2528. 10.1056/NEJMoa0902604 [DOI] [PubMed] [Google Scholar]
  6. Daniel RM, De Stavola BL, Cousens SN, & Vansteelandt S (2015). Causal mediation analysis with multiple mediators. Biometrics, 71(1), 1–14. 10.1111/biom.12248 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Derkach A, Pfeiffer RM, Chen TH, & Sampson JN (2019). High dimensional mediation analysis with latent variables. Biometrics, 75(3), 745–756. 10.1111/biom.13053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, Ntritsos G, Dimou N, Cabrera CP, Karaman I, Ng FL, Evangelou M, Witkowska K, Tzanis E, Hellwege JN, Giri A, Velez Edwards DR, Sun YV, Cho K, ..the Million Veteran Program. (2018). Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nature genetics, 50(10), 1412–1425. 10.1038/s41588-018-0205-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fan J, & Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fowkes FG, Rudan D, Rudan I, Aboyans V, Denenberg JO, McDermott MM, Norman PE, Sampson UK, Williams LJ, Mensah GA, & Criqui MH (2013). Comparison of global estimates of prevalence and risk factors for peripheral artery disease in 2000 and 2010: A systematic review and analysis. Lancet, 382(9901), 1329–1340. 10.1016/S0140-6736(13)61249-0 [DOI] [PubMed] [Google Scholar]
  11. Friedman J, Hastie T, & Tibshirani R (2008). Sparse inverse covariance estimation with the lasso. Biostatistics, 9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hoyle R, (Ed.). (2012). Handbook of structural equation modeling. The Guilford Press. [Google Scholar]
  13. Huang YT, & Pan WC (2016). Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediation. Biometrics, 72(2), 402–413. 10.1111/biom.12421 [DOI] [PubMed] [Google Scholar]
  14. Imai K, Keele L, & Tingley D (2010). A general approach to causal mediation analysis. Psychological Methods, 15(4), 309–334. 10.1037/a0020761 [DOI] [PubMed] [Google Scholar]
  15. Imai K, Keele L, & Yamamoto T (2010). Identification, inference and sensitivity analysis for causal mediation effects. Statistical Science, 25, 51–71. [Google Scholar]
  16. Jacobucci R, Grimm KJ, & McArdle JJ (2016). Regularized Structural Equation Modeling. Structural Equation Modeling, 23(4), 555–566. 10.1080/10705511.2016.1154793 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lange K (2004). Optimization. Springer. [Google Scholar]
  18. Li C (1975). Path analysis—A primer. Boxwood Press. [Google Scholar]
  19. Li C, Shen X, & Pan W (2020). Likelihood ratio tests for a large directed acyclic graph. Journal of the American Statistical Association, 115(531), 1304–1319. 10.1080/01621459.2019.1623042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Liu M, Jiang Y, Wedow R, Li Y, Brazel DM, Chen F, Datta G, Davila-Velderrain J, McGuire D, Tian C, Zhan X 23andMe Research Team, HUNT All-In Psychiatry, Choquet H, Docherty AR, Faul JD, Foerster JR, Fritsche LG, Gabrielsen ME, Gordon SD, .. Vrieze S (2019). Association studies of up to 1.2 million individuals yield new insights into the genetic etiology of tobacco and alcohol use. Nature genetics, 51(2), 237–244. 10.1038/s41588-018-0307-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. MacKinnon D (2008). Inroduction to statistical mediation analysis. Taylor and Francis Group. [Google Scholar]
  22. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, & Sheets V (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7(1), 83–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Maxwell TJ, Ballantyne CM, Cheverud JM, Guild CS, Ndumele CE, & Boerwinkle E (2013). APOE modulates the correlation between triglycerides, cholesterol, and CHD through pleiotropy, and gene-by-gene interactions. Genetics, 195(4), 1397–1405. 10.1534/genetics.113.157719 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. McArdle J (2005). The development of the ram rules for latent variable structural equation modeling. In Maydeu-Olivares A, & McArdle J (Eds.), Contemporary psychometrics: A festschrift for Roderick P. McDonald (pp. 225–273). Lawrence Erlbaum. [Google Scholar]
  25. McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y, Sidore C, Kwong A, Timpson N, Koskinen S, Vrieze S, Scott LJ, Zhang H, Mahajan A,, … Haplotype Reference, C. (2016). A reference panel of 64,976 haplotypes for genotype imputation. Nature Genetics, 48(10), 1279–1283. 10.1038/ng.3643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Nelson CP, Goel A, Butterworth AS, Kanoni S, Webb TR, Marouli E, Zeng L, Ntalla I, Lai FY, Hopewell JC, Giannakopoulou O, Jiang T, Hamby SE, Di Angelantonio E, Assimes TL, Bottinger EP, Chambers JC, Clarke R, Palmer CNA, .. Deloukas P (2017). Association analyses based on false discovery rate implicate new loci for coronary artery disease. Nature genetics, 49(9), 1385–1391. 10.1038/ng.3913 [DOI] [PubMed] [Google Scholar]
  27. Pechlivanis S, Mühleisen TW, Möhlenkamp S, Schadendorf D, Erbel R, Jöckel KH, Hoffmann P, Nöthen MM, Scherag A, Moebus S, & Heinz Nixdorf Recall Study Investigative, G. (2013). Risk loci for coronary artery calcification replicated at 9p21 and 6q24 in the Heinz Nixdorf Recall Study. BMC Medical Genetics, 14, 23. 10.1186/1471-2350-14-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Preacher KJ, & Hayes AF (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40(3), 879–891. 10.3758/brm.40.3.879 [DOI] [PubMed] [Google Scholar]
  29. Rizk NM, El-Menyar A, Egue H, Souleman Wais I, Mohamed Baluli H, Alali K, Farag F, Younes N, & Al Suwaidi J (2015). The association between serum LDL cholesterol and genetic variation in chromosomal locus 1p13.3 among coronary artery disease patients. BioMed Research International, 2015, 678924. 10.1155/2015/678924 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Saito M, Eto M, Nitta H, Kanda Y, Shigeto M, Nakayama K, Tawaramoto K, Kawasaki F, Kamei S, Kohara K, Matsuda M, Matsuki M, & Kaku K (2004). Effect of apolipoprotein E4 allele on plasma LDL cholesterol response to diet therapy in type 2 diabetic patients. Diabetes Care, 27(6), 1276–1280. 10.2337/diacare.27.6.1276 [DOI] [PubMed] [Google Scholar]
  31. Schaid DJ, & Sinnwell JP (2020). Penalized models for analysis of multiple mediators. Genetic Epidemiology, 44(5), 408–424. 10.1002/gepi.22296 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Serang S, Jacobucci R, Brimhall KC, & Grimm KJ (2017). Exploratory mediation analysis via regularization. Structural Equation Modeling: A Multidisciplinary Journal, 24(5), 733–744. 10.1080/10705511.2017.1311775 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Song Y, Zhou X, Zhang M, Zhao W, Liu Y, Kardia S, Roux A, Needham BL, Smith JA, & Mukherjee B (2020). Bayesian shrinkage estimation of high dimensional causal mediation effects in omics studies. Biometrics, 76(3), 700–710. 10.1111/biom.13189 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Sorli JV, Corella D, Frances F, Ramirez JB, Gonzalez JI, Guillen M, & Portoles O (2006). The effect of the APOE polymorphism on HDL-C concentrations depends on the cholesterol ester transfer protein gene variation in a Southern European population. Clinica Chimica Acta, 366(1–2), 196–203. 10.1016/j.cca.2005.10.001 [DOI] [PubMed] [Google Scholar]
  35. VanderWeele T (2015). Explanation in causal inference. Oxford University Press. [Google Scholar]
  36. VanderWeele TJ (2016). Mediation analysis: A practitioner′s guide. Annual Review of Public Health, 37, 17–32. 10.1146/annurev-publhealth-032315-021402 [DOI] [PubMed] [Google Scholar]
  37. VanderWeele TJ, & Vansteelandt S (2014). Mediation analysis with multiple mediators. Epidemiol Methods, 2(1), 95–115. 10.1515/em-2012-0010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. von Oertzen T, & Brick T (2014). Efficient Hessian computation using sparse matrix derivatives in RAM notation. Behavior Research Methods, 46(2), 385–395. [DOI] [PubMed] [Google Scholar]
  39. Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, Ganna A, Chen J, Buchkovich ML, Mora S, Beckmann JS, Bragg-Gresham JL, Chang H-Y, Demirkan A, Den Hertog HM, Do R, Donnelly LA, Ehret GB, Esko T, .. Abecasis GR (2013). Discovery and refinement of loci associated with lipid levels. Nature genetics, 45(11), 1274–1283. 10.1038/ng.2797 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wright S (1921). Correlation and causation. Journal of Agricultural Research, 20, 557–585. [Google Scholar]
  41. Wright S (1923). The theory of path coefficients: A reply to Niles′s criticism. Genetics, 8, 239–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Xue A, Wu Y, Zhu Z, Zhang F, Kemper KE, Zheng Z, Yengo L, Lloyd-Jones LR, Sidorenko J, Wu Y, eQTLGen Consortium, McRae AF, Visscher PM, Zeng J, & Yang J (2018). Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nature communications, 9(1), 2941. 10.1038/s41467-018-04951-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ye Z, Kalloo FS, Dalenberg AK, & Kullo IJ (2013). An electronic medical record-linked biorepository to identify novel biomarkers for atherosclerotic cardiovascular disease. Global Cardiology Science & Practice, 2013(1), 82–90. 10.5339/gcsp.2013.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Yuan Y, Shen X, Pan W, & Wang Z (2019). Constrained likelihood for reconstructing a directed acyclic Gaussian graph. Biometrika, 106(1), 109–125. 10.1093/biomet/asy057 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Zellner A (1962). An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 57(298), 348–368. [Google Scholar]
  46. Zhou W, Nielsen JB, Fritsche LG, Dey R, Gabrielsen ME, Wolford BN, LeFaive J, VandeHaar P, Gagliano SA, Gifford A, Bastarache LA, Wei W-Q, Denny JC, Lin M, Hveem K, Kang HM, Abecasis GR, Willer CJ, & Lee S (2018). Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50(9), 1335–1341. 10.1038/s41588-018-0184-y [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Table S1

Data Availability Statement

Genome-wide summary statistics for the intermediate risk factors and the outcome variables (CHD and PAD) that were used to select SNPs for this study are publicly available (see references in Table 1). Due to institutional review board regulations, individual level data from the Mayo Vascular Disease biorepository are not available.

RESOURCES