Abstract
Motivation
The biological pathways linking exposures and disease risk are often poorly understood. To gain insight into these pathways, studies may try to identify biomarkers that mediate the exposure/disease relationship. Such studies often simultaneously test hundreds or thousands of biomarkers.
Results
We consider a set of m biomarkers and a corresponding set of null hypotheses, where the jth null hypothesis states that biomarker j does not mediate the exposure/disease relationship. We propose a Multiple Comparison Procedure (MCP) that rejects a set of null hypotheses or, equivalently, identifies a set of mediators, while asymptotically controlling the Family-Wise Error Rate (FWER) or False Discovery Rate (FDR). We use simulations to show that, compared to currently available methods, our proposed method has higher statistical power to detect true mediators. We then apply our method to a breast cancer study and identify nine metabolites that may mediate the known relationship between an increased BMI and an increased risk of breast cancer.
Availability and implementation
R package MultiMed on https://github.com/SiminaB/MultiMed.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Mediation analysis can be used to study how an exposure, E, affects a disease, Y (Baron and Kenny, 1986; MacKinnon, 2008; Ten Have and Joffe, 2012). In the simplest scenario, there is only a single putative mediator, M. In this scenario, mediation analysis tests whether M is a true mediator and, if so, decomposes the total effect of E on Y into a direct and indirect (i.e. via M) effect (Pearl, 2012; Robins and Greenland, 1992; VanderWeele and Vansteelandt, 2014). In other scenarios, there may be more than one possible mediator (Daniel et al., 2015; Nguyen et al., 2015; Taguri et al., 2015). We consider the scenario where a large number of biomarkers may potentially mediate an exposure/disease association (Boca et al., 2014; Chen et al., 2017; Huang and Pan, 2016) and we introduce procedures for selecting a subset of those biomarkers to be designated as probable mediators. The key is that the proposed procedures, developed for replicability analyses (Bogomolov and Heller, 2018), can asymptotically control the Family-Wise Error Rate (FWER) or the False Discovery Rate (FDR). Our motivation is a 836-person case/control study of ER+ breast cancer where the goal is to identify the subset of the 478 measured metabolites that are likely to be mediators of the well-known association between higher BMI and an increased risk of breast cancer (van den Brandt et al., 2000).
Although mediation analysis initially focused on scenarios with a single mediator, mediation analysis has recently been extended for scenarios with a small number of mediators. In this setting, the possible causal paths (e.g. E → M1 → M2 → Y) can be fully enumerated, the various indirect effects can be well defined using the language of causal inference, and the assumptions needed to obtain unbiased estimates can be formulated (Daniel et al., 2015; Taguri et al., 2015; VanderWeele and Vansteelandt, 2014). The next step is to extend mediation analysis to scenarios where there is a large number of mediators, such as when the potential mediator is a high-dimensional vector of voxels in an fMRI image (Chen et al., 2017; Zhao and Luo, 2016), serum metabolite levels (Boca et al., 2014), gene expression levels (Huang and Pan, 2016) or methylation levels (Zhang et al., 2016). Towards this aim, methods have been designed to test whether a set of biomarkers, considered together, mediate an exposure/outcome association (Huang and Pan, 2016), to identify the Direction of Mediation or the linear combination of biomarkers that best captures the mediating effect (Chen et al., 2017), and to model the relationship between exposure, biomarkers and outcome (Zhang et al., 2016). Here, our objective is to add to this growing body of literature by introducing a new multiple testing procedure that identifies probable mediators, while controlling for false positive findings.
Our paper proceeds as follows. In Section 2, we start by describing our newly proposed procedures for identifying probable mediators and competing procedures. We continue by describing the simulations used to compare the procedures and then finish by describing the motivating breast case cancer study. In Section 3, we assess the performance of these procedures on the simulated datasets and identify possible mediators in our motivating study. In Section 4, we summarize our findings, describe the novelty of our method in the context of the current literature, and explain why our method can only identify ‘probable’ mediators.
2 Materials and methods
2.1 Definitions
Let us consider n individuals. For individual i, let Ei be the exposure, Yi be the outcome, and be a vector of m potential mediators. For this paper, our potential mediators will always be biomarkers and we will use the terms interchangeably. We will say that a biomarker j is a mediator if Ei is associated with Mij and, conditional on Ei, Yi is associated with Mij. To formalize this statement, we define two null hypotheses
(1) |
(2) |
We therefore say that biomarker j is a mediator if and only if the two null hypotheses, and , are false. We note that, in our primary discussion, we are not considering the stricter null hypothesis that outcome and biomarker j are independent conditional on the exposure and the set of all other biomarkers, as defined by
(3) |
where .
Let the combined data for individual i be denoted by , and let the complete dataset be denoted by the matrix . The arrows (e.g. ) indicate the corresponding variable is a vector. Furthermore, let and let Ω be the possible subsets of . Then we define a Multiple Comparison Procedure (MCP) to be a function, from to Ω, that inputs the data and outputs the set of biomarkers that are likely to be mediators.
Let ω1 be the set of m11 biomarkers that are mediators and ω0 be the set of biomarkers that are not mediators. For a set , we let be the number of elements in ω and we let be the number of elements in . We next define the Family-Wise Error Rate (FWER) of an MCP to be and the False-Discovery Rate (FDR) to be , where the expectation is over D and we have used the abbreviations and .
2.2 Models and assumptions
We will first assume that the biomarkers and outcome are continuous variables that can be expressed as
(4) |
(5) |
where ϵMij and are random error terms with j and . Equation 5 further implies
(6) |
Assuming Equations 4 and 6 are true, the two null hypotheses can be restated as
(7) |
(8) |
For each biomarker, we can test the two hypotheses by first fitting Equations 4 and 6 using linear regression to estimate and and their standard errors and . We can next calculate their Wald test statistics, and . We can then calculate the corresponding P-values assuming, if appropriate, that the test statistics follow a t-distribution with the appropriate degrees of freedom or, more generally, that the asymptotic normal approximation holds, and where is the cumulative normal distribution. We note that that the estimated parameters from linear regression are consistent estimates for βj and γj even if Mij and Yij are not normally distributed (Lumley et al., 2002). Furthermore, we note that stating the marginal relationship between a single biomarker and the outcome can be described by Equation 6 does not preclude a more complex relationship where multiple correlated biomarkers affect the outcome, as shown in our simulations.
We will also relax the assumptions and allow the outcome to be a binary variable. We will assume that a probit model holds, where and
(9) |
which implies
(10) |
We now let be the estimate from fitting model 10. In this scenario, the two null hypotheses can, again, be restated by Equations 7 and 8. In fact, the requirement that the probit model holds is unnecessary, and we only require that the for biomarkers satisfying assumption . Then, the only changes for a prospectively collected binary outcome is that we would obtain and by probit regression. For retrospective sampling (i.e. case/control studies), we must also perform weighted regressions to estimate βEj where a sample’s weight is inversely proportional to the probability of being sampled. In practice, epidemiologists will often choose to use logistic regression instead of probit regression for estimating the conditional association between biomarker and outcome. We have chosen to present our theoretical results using the probit link since the multi-variable model (i.e. Equation 9) and the marginal model (i.e. Equation 10) are consistent with each other in probit regression and implies the null hypothesis of Equation 2. However, in practice, and as further discussed in the Supplementary Material, we have found that MCP’s still perform well when using the logit link.
With the above assumptions, by a combination of theoretical and empirical results, we are able to show that FWER and FDR for our proposed procedures are maintained in practical settings.
2.3 Multiple comparison procedures
We first describe two existing MCPs that are designed to achieve a specified FWER: MCPB and MCPP, where the subscripts ‘B’ and ‘P’ abbreviate ‘Bonferroni’ and ‘Permutation’, respectively. We then introduce three new MCPs, MCPS, and that are designed to achieve, asymptotically, a specified FWER, where the subscript ‘S’ abbreviates ‘Subset’ and the superscripts ‘WY’ and ‘MV’ abbreviate ‘Westfall-Young’ and ‘Multivariate’, respectively. Finally, we introduce an MCP, MCPD, that is designed to achieve, asympotically, a specified FDR in realistic scenarios, where the ‘D’ abbreviates ‘false Discovery rate’.
2.3.1 MCP—Bonferroni
We claim biomarker j to be a mediator if and , where α is the targeted FWER. . We further define a Bonferroni-adjusted P-value and restate the definition as
2.3.2 MCP—permutation
We claim biomarker j to be a mediator if , where α is a constant and PPj is the P-value calculated by our prior permutation approach (Boca et al., 2014). . Briefly, in this approach, we focus on the product , where is the Pearson correlation between E and Mj, and is the Pearson correlation between Mj and Y given E. We then use permutations to estimate the distribution of the under the hypothesis that there is no mediator and define PPj to be the probability of observing a value larger than under this distribution.
2.3.3 MCP—subset
For the Bonferroni procedure, MCPB, each P-value must meet the strict threshold of . Here, we suggest a different method, described in Figure 1, and based on work by Bogomolov and Heller (2018) restricts the testing of each hypothesis to a subset of biomarkers and therefore requires dividing α by a number smaller than m. We let t1 be a threshold (i.e. scalar value) for a significant exposure/mediator relationship, and define and where is the cardinality of a set. Similarly, let t2 be a threshold for a significant mediator/outcome association, and define and . We then claim biomarker j to be a mediator if and . . We further define a subset-adjusted P-value if and , 1 otherwise. Note that In practice, we set because in order to be discovered by this procedure, . We stress that our use of a Bonferroni-type threshold on a set of prescreened metabolites is only valid because the P-values used for screening (e.g. P-values assessing exposure/metabolite association) are effectively independent of the P-values used in the second step (e.g. P-values assessing the metabolite/outcome association). Furthermore, we gain statistical power, compared to traditional Bonferroni approaches by eliminating tests that are irrelevant (e.g. testing for exposure/biomarker associations for biomarkers not associated with the outcome).
2.3.4 MCP—subset, Westfall-Young
In the previously defined version of MCPS, the P-values must meet Bonferroni-type thresholds, and , in the second step. However, we note that when the S2 metabolites in are highly correlated, the threshold would be conservative. A similar statement applies to and . To construct an alternative threshold, we borrow ideas from Westfall and Young (1993). We estimate the null distribution of by a permutation procedure, and then replace S2 by the approximate number of independent biomarkers where is the quantile of this distribution. We similarly define a and denote the resulting procedure by .
2.3.5 MCP—subset, multivariate
For MCPS and , we aim to detect mediators as defined by and . However, we might also consider replacing by . Unfortunately, we have no procedure that offers theoretical guarantees under the null and . Instead, we offer an ad-hoc procedure that performs well in practice. Instead of modeling each biomarker/outcome association marginally, we use stepwise regression. Specifically, we define as the P-value for biomarker j when it is added to the multivariable biomarkers/outcome model (i.e. if biomarker j is the third biomarker added, then is the P-value for biomarker j when there are three biomarkers in the model). We then define and .
2.3.6 MCP—subset, other modifications
Here, it is worth commenting on two other possible modifications to MCPS, although neither will be further discussed in this paper. First, we could claim biomarker j to be a mediator if and , where c is any value in (0, 1). For example, letting c < 0.5 could be advantageous if exposure/mediator associations were far stronger and would be easily detectable at more stringent thresholds. Second, instead of prespecifying t1 and t2, we could choose thresholds so that the selected sets coincide with the rejected hypotheses in the second step.
2.3.7 MCP—FDR (MCPD)
This procedure, based on work by Bogomolov and Heller (2018), builds on the adjusted P-values of MCPS. For a given dataset, we calculate our subset-adjusted P-values as in Section 2.3.3. We then claim biomarker to be a mediator if
(11) |
where PDj is the FDR-adjusted P-value and is the rank of PSj for biomarker . Note that .
2.4 Theoretical properties
We show that the asymptotic FWER for MCPS is less than or equal to α (see Appendix for details):
Theorem 1: For , if A1 holds and follow Equations 4 and either 6 or 10, then . where Assumption A1 is defined by
Assumption A1: If then .
We note that A1 is satisfied by many parametric models. Moreover, in the Appendix, we show that the asymptotic FDR of MCPD is less than or equal to α:
Theorem 2: For , if holds, .
We note, without proof, that a similar claim will also hold if blocks of putative mediators are conditionally independent and we let the number of blocks go to infinity.
In Section 3.1 of the Results and in the Supplementary Material, we show in simulations that the FWER and FDR are controlled at the nominal level for finite n and dependent mediators.
2.5 Simulations
We compare the performance of the MCPs in variations of the following simulated study. We consider a study with 500 individuals and m biomarkers, where . In the first set of simulations, we let , Mij follow Equation 4 with , and
(12) |
with where and were chosen so . We chose to fix the marginal variance at 1 because we believe that it reflects real datasets, where biomarkers and outcome are normalized. We let m00 be the number of biomarkers with , m10 be the number with , m01 be the number with and m11 be the number with (i.e. m11 is the number of true mediators). We vary the chosen values , allowing m10 to be large to reflect our motivating example where many biomarkers were likely associated with the exposure. For the primary analysis, all biomarkers are independent conditional on E. For the secondary analyses, with the exception of true mediators, blocks of biomarkers were correlated. We let be block diagonal, with blocks of size 5 (m = 110) or 20 (m = 1010), and let the off-diagonal elements be either 0, 0.5 or 0.9. If the correlation would result in var(Y) exceeding 1, we reduced . Biomarkers associated with E or Y were maximally spread across blocks with the rule that no block contained biomarkers associated with both E and Y. This restriction prevents the creation of ‘potential mediators’, where hypotheses 1 and 2 are both false, that would not qualify as ‘true mediators’ (see Section 4 for details). In a second set of simulations, we consider a binary outcome, , with if , 0 otherwise. For each scenario, defined by outcome type, m and , we run 1000 simulations. In null simulations, we let and calculate the FWER as the proportion of simulations where our MCP selects at least one biomarker at . In non-null simulations, we calculate power as the average proportion of the 10 true mediators that are selected by our MCP set to and we calculate the observed FDR when the FDR threshold is set to 0.2. Note, with 1000 simulations, the standard error of our estimated FWER should be no larger than .
2.6 Breast cancer study
This study, nested inside the Prostate, Lung, Colorectal and Ovarian Cancer Screening Study (PLCO), includes 418 estrogen-receptor positive (ER+) breast cancer cases and 418 controls matched on study entry age (±2 years), date of blood collection (±3 months) and hormone therapy use at baseline (Moore et al., 2017). Non-fasting serum samples were collected at the first follow-up visit, one-year post-baseline. Serum metabolites (<1 Kilodalton molecular weight) were measured by Metabolon Inc. using liquid chromatography-tandem mass-spectrometry. Of the 1057 serum metabolites measured, 478 were identified and present in at least 90% of the population. Metabolite peaks were normalized by dividing by batch median and then log transformed. All models were adjusted for age at serum collection, race, hormone use, age of menarche, parity, age of menopause, smoking and diabetes status. For purposes of sample weighting, the prevalence of ER+ breast cancer was 0.016.
3 Results
3.1 Simulations
The simulations demonstrate that the newly proposed MCPs have good operating characteristics. First, under the null scenarios with , most MCPs achieved their targeted FWER of 0.05. These results are summarized in Table 1 for conditionally independent biomarkers and in Supplementary Tables for blocks of dependent biomarkers. The exception is that the FWER for the permutation approach could be as high as 0.07, an undesirable consequence of not having theoretical guarantees on the error rate. In general, the FWER was smallest for MCPB, which relied on Bonferroni correction for determining significance. Second, under all scenarios, MCPD achieved its targeted FDR (Supplementary Tables).
Table 1.
Continuous outcome |
Binary outcome |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
m00 | m10 | m01 | m11 | MCPB | MCPP | MCPS | MCPB | MCPP | MCPS | ||
110 | 0 | 0 | 0 | 0.00 | 0.03 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 |
95 | 15 | 0 | 0 | 0.00 | 0.04 | 0.01 | 0.01 | 0.00 | 0.03 | 0.01 | 0.01 |
70 | 40 | 0 | 0 | 0.01 | 0.07 | 0.03 | 0.03 | 0.02 | 0.06 | 0.02 | 0.02 |
95 | 0 | 15 | 0 | 0.00 | 0.04 | 0.02 | 0.03 | 0.00 | 0.07 | 0.01 | 0.02 |
80 | 15 | 15 | 0 | 0.01 | 0.05 | 0.04 | 0.04 | 0.01 | 0.08 | 0.05 | 0.08 |
55 | 40 | 15 | 0 | 0.02 | 0.04 | 0.05 | 0.04 | 0.01 | 0.02 | 0.03 | 0.02 |
1010 | 0 | 0 | 0 | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 |
995 | 15 | 0 | 0 | 0.00 | 0.04 | 0.00 | 0.01 | 0.00 | 0.05 | 0.01 | 0.01 |
700 | 310 | 0 | 0 | 0.00 | 0.06 | 0.01 | 0.01 | 0.00 | 0.04 | 0.00 | 0.00 |
995 | 0 | 15 | 0 | 0.00 | 0.04 | 0.00 | 0.01 | 0.00 | 0.05 | 0.00 | 0.00 |
980 | 15 | 15 | 0 | 0.00 | 0.03 | 0.01 | 0.01 | 0.00 | 0.03 | 0.01 | 0.03 |
685 | 310 | 15 | 0 | 0.02 | 0.05 | 0.04 | 0.05 | 0.00 | 0.07 | 0.00 | 0.06 |
Note: The first four columns show the number (m00) of biomarkers associated with neither exposure nor outcome, the number (m10) associated with only the exposure, the number (m01) associated with only the outcome and the number (m11) associated with both exposure and outcome. The remaining columns show the FWER, defined to be the mean proportion of simulations with at least one biomarker identified as a mediator, when . Details of the simulation can be found in Section 2.
The newly proposed MCPs tended to have higher power for detecting true mediators. These results are summarized in Table 2 for conditionally independent biomarkers and in Supplementary Tables for blocks of dependent biomarkers. In the majority of scenarios, among all univariate approaches, MCPS and slightly outperformed MCPP despite having lower FWER in the null simulations, but their relative performance depended on the exact scenario. The MCPP test can select biomarkers with a single strong association (i.e. the m10 and m01 biomarkers with one true association) and biomarkers with two non-significant, but still modest, associations (i.e. the m00 biomarkers with ). Note, we omit from Tables 1 and 2, where biomarkers are independent, because it yields identical results to MCPS. Furthermore, we found that the multivariate approach resulted in higher power, as compared to the univariate approaches, with having the highest power in all scenarios. We emphasize that there is no equivalent to the multivariate approach, and no means for obtaining the corresponding increase in power, using a permutation based method. As expected, only increased power, compared to MCPS, when there was significant correlation (e.g. 0.9) among biomarkers and the total number of biomarkers was large (e.g. 1010) (Supplementary Material). Given its limited benefit and its failure to strictly control FWER in two simulations () with higher correlation (Supplementary Tables S3 and S7), we do not recommend using in practice. In general, results were similar for the binary and continuous outcome.
Table 2.
Continuous outcome |
Binary outcome |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
m00 | m10 | m01 | m11 | MCPB | MCPP | MCPS | MCPB | MCPP | MCPS | ||
100 | 0 | 0 | 10 | 0.54 | 0.68 | 0.72 | 0.81 | 0.30 | 0.49 | 0.49 | 0.58 |
85 | 15 | 0 | 10 | 0.54 | 0.61 | 0.68 | 0.80 | 0.28 | 0.37 | 0.40 | 0.49 |
60 | 40 | 0 | 10 | 0.55 | 0.54 | 0.64 | 0.79 | 0.28 | 0.30 | 0.34 | 0.43 |
85 | 0 | 15 | 10 | 0.54 | 0.58 | 0.68 | 0.77 | 0.28 | 0.42 | 0.44 | 0.65 |
70 | 15 | 15 | 10 | 0.54 | 0.54 | 0.64 | 0.77 | 0.25 | 0.33 | 0.36 | 0.66 |
45 | 40 | 15 | 10 | 0.55 | 0.50 | 0.60 | 0.76 | 0.29 | 0.30 | 0.33 | 0.66 |
1000 | 0 | 0 | 10 | 0.28 | 0.69 | 0.61 | 0.77 | 0.11 | 0.47 | 0.34 | 0.44 |
985 | 15 | 0 | 10 | 0.28 | 0.60 | 0.58 | 0.74 | 0.10 | 0.37 | 0.31 | 0.41 |
690 | 310 | 0 | 10 | 0.27 | 0.33 | 0.45 | 0.66 | 0.10 | 0.15 | 0.18 | 0.23 |
985 | 0 | 15 | 10 | 0.27 | 0.57 | 0.57 | 0.77 | 0.11 | 0.43 | 0.35 | 0.65 |
970 | 15 | 15 | 10 | 0.26 | 0.53 | 0.54 | 0.75 | 0.11 | 0.36 | 0.32 | 0.63 |
675 | 310 | 15 | 10 | 0.27 | 0.33 | 0.44 | 0.76 | 0.08 | 0.12 | 0.16 | 0.42 |
Note: The first four columns show the number (m00) of biomarkers associated with neither exposure nor outcome, the number (m10) associated with only the exposure, the number (m01) associated with only the outcome and the number (m11) associated with both exposure and outcome. The remaining columns show the power, defined to be the mean proportion of true mediators identified, when . Details of the simulation can be found in Section 2.
In an attempt to mimic our Breast Cancer study, we simulated data with a large number of exposure/metabolite associations. When m = 1010 and , we found that the benefit of our newly proposed MCPS approach, as compared to the Bonferroni approach, was less pronounced. Intuitively, this decline occurs because when S1 is large, will have to achieve near Bonferroni-level significance for the biomarker to qualify as a mediator.
3.2 Breast cancer study
The 478 metabolites were strongly associated with both BMI and breast cancer status, with 218 of the BMI/metabolite associations having a P-value below 0.05 and 103 of the breast cancer/metabolite associations having a P-value below 0.05. We found 24 metabolites, listed in Table 3, which were potential mediators connecting BMI and breast cancer risk (FDR < 0.2). Of those 24, only 2, 16-α-hydroxy-DHEA-3-sulfate and 3-methyl-glutaryl carnitine 1, were significant at FWER = 0.05. We note that the P-values from our new methods were lower than the P-values produced by alternative methods. However, as seen in the simulations with a large number number of exposure/biomarker associations, the P-values from these methods were not dramatically smaller.
Table 3.
Name | pB | pp | pS | pD | |
---|---|---|---|---|---|
16a-Hydroxy DHEA 3-sulfate | 0.021 | 0.055 | 0.014 | 0.008 | 0.0075 |
3-Methylglutarylcarnitine | 0.046 | 0.29 | 0.015 | 0.018 | 0.0075 |
4A3B17B disulfate | 0.31 | 0.67 | 0.06 | 0.094 | 0.02 |
Allo-isoleucine | 0.12 | 0.2 | 0.083 | 0.056 | 0.021 |
4A3B17B monosulfate | 0.59 | 0.61 | 0.11 | 0.14 | 0.023 |
Urate | 0.24 | 0.084 | 0.16 | 0.086 | 0.027 |
3-Methyl-2-oxobutyrate | 0.6 | 0.22 | 0.41 | 0.23 | 0.054 |
4A3B17B disulfate | 1 | 0.99 | 0.43 | 0.38 | 0.054 |
Gamma-glutamylvaline | 0.84 | 0.56 | 0.57 | 0.32 | 0.063 |
Alpha-hydroxyisovalerate | 1 | 1 | 0.73 | 0.6 | 0.073 |
2-Methylbutyrylcarnitine | 1 | 1 | 1 | 0.52 | 0.096 |
21-Hydroxypregnenolone disulfate | 1 | 1 | 1 | 0.92 | 0.1 |
7-Methylguanine | 1 | 0.98 | 1 | 0.66 | 0.1 |
Histidine | 1 | 1 | 1 | 0.7 | 0.1 |
N-acetylalanine | 1 | 0.86 | 1 | 0.68 | 0.1 |
Lactate | 1 | 1 | 1 | 1 | 0.12 |
Succinylcarnitine | 1 | 1 | 1 | 1 | 0.12 |
4A3B17B monosulfate | 1 | 1 | 1 | 1 | 0.13 |
Alpha-tocopherol | 1 | 0.79 | 1 | 1 | 0.15 |
Octanoylcarnitine | 1 | 1 | 1 | 1 | 0.16 |
Dihomo-linolenate | 1 | 1 | 1 | 1 | 0.17 |
Decanoylcarnitine | 1 | 1 | 1 | 1 | 0.18 |
Euricoyl sphingomyelin | 1 | 1 | 1 | 1 | 0.18 |
N1-methylguanosine | 1 | 0.25 | 1 | 1 | 0.18 |
4 Discussion
We introduced a new method for testing multiple putative mediators. This computationally efficient method can maintain specified family-wise error rates (FWER) and false discovery rates (FDR), and should be very useful in modern studies evaluating high dimensional biomarkers. We then applied this new method to a study evaluating the mechanistic relationship between increased BMI and an increased risk of breast cancer.
We note that MCPS and test each biomarker individually. Therefore, we can only use these methods to claim that marginally, when considered in isolation, each selected biomarker has the defining characteristics of a mediator. We neither claim that exposure nor the outcome is correlated with the selected biomarker, conditional on all other biomarkers. We aim only to reject and . Hence, the markers selected by these procedures may not all be true biological mediators. Consider the following example. Let and . Our MCP is designed to select M2, but M2 is not a true biological mediator. For this reason, we opted to call our selected biomarkers as ‘probable mediators’ and not ‘true mediators’. Given this limitation, when using either MCPS or , we suggest a second step, following variable selection, that builds a graphical model containing the exposure, outcome and selected variables. A second option is to use , which identifies biomarkers marginally associated with the exposure and, to some extent, conditionally associated with the outcome. The caveat is that association is only conditional on those biomarkers that were included in the stepwise regression and this method does not carry theoretical guarantees.
The newly proposed MCP is an important contribution to the current literature on multivariate mediation analysis. First, the new methods improve upon our previous permutation approach in four ways. The new MCP is more powerful, provides theoretical guarantees on FWER, requires less computational time and can easily be extended to a multivariable analysis. Moreover, this new MCP provides a means for controlling FDR, in addition to FWER. Second, this MCP compliments those procedures that fit mediation models where the majority of putative mediators are presumed to be true mediators. Our MCP can be considered a preprocessing step to model fitting. Third, this paper brings the mathematical theory developed for the field of replicability to mediation analysis. The proofs guaranteeing asymptotic FWER and FDR control extend the theory to mediation analysis where the P-values, and , are calculated from a common dataset.
The theory developed here builds upon the theory developed by Bogomolov and Heller (Bogomolov and Heller, 2018) for demonstrating replicability. In their work, the pair of P-values summarize the association between biomarker and outcome (e.g. SNP and disease) in two distinct study populations. Then, showing that their MCP selects biomarker j would be equivalent to stating that the biomarker/outcome association is replicable (i.e. the association is significant in both datasets). The common feature in their application and ours is that the two sets of P-values can be considered independent. This requirement limits further extensions, preventing, for example, its use in cases where the P-values are for two correlated traits in a common population.
In our breast cancer study, we identified 16a-hydroxy DHEA 3-sulfate and 3-methylglutarylcarnitine-1 as potential mediators of the BMI and ER+ breast cancer association. 16a-hydroxy DHEA 3-sulfate is the 16a-hydroxylated metabolite of DHEA and has been found in laboratory studies to be estrogenic and to be capable of binding and activating the ER estrogen receptor. However, it has not been previously linked with breast cancer risk. 3-Methylglutarylcarnitine-1 is a marker indicative of incomplete degradation of leucine. Specifically, when the 3-hydroxy-3-methylglutaryl-coenzyme A lyase enzyme, which catalyzes the final step in leucine catabolism, is insufficiently active, 3-methylglutarylcarnitine-1 accumulates in the blood. For this reason, 3-methylglutarylcarnitine-1 is sometimes used in clinical settings to diagnose errors in leucine metabolism. No prior studies have examined this metabolite in relation to breast cancer risk. These findings point toward potentially new metabolic pathways that may link a high BMI with breast cancer risk.
Funding
Ruth Heller acknowledges support by Israel Science Foundation (grant no. 0603616831).
Conflict of Interest: none declared.
Supplementary Material
References
- Baron R.M., Kenny D.A. (1986) The moderator–mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of Person. Soc. Psychol., 51, 1173–1182. [DOI] [PubMed] [Google Scholar]
- Boca S.M. et al. (2014) Testing multiple biological mediators simultaneously. Bioinformatics, 30, 214–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bogomolov M., Heller R. (2018) Assessing replicability of findings across two studies of multiple features. Biometrika. [Google Scholar]
- Chen O.Y. et al. (2017) High-dimensional multivariate mediation: with application to neuroimaging data. Biostatistics, doi:10.1093/biostatistics/kxx027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daniel R.M. et al. (2015) Causal mediation analysis with multiple mediators. Biom, 71, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Y.-T., Pan W.-C. (2016) Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biom, 72, 402–413. [DOI] [PubMed] [Google Scholar]
- Lumley T. et al. (2002) The importance of the normality assumption in large public health data sets. Annu. Rev. Public Health, 23, 151–169. PMID: 11910059. [DOI] [PubMed] [Google Scholar]
- MacKinnon D.P. (2008) Introduction to Statistical Mediation Analysis. Erlbaum Psych Press. [Google Scholar]
- Moore S. et al. (2017) A metabolomics analysis of body mass index and postmenopausal breast cancer risk. JNCI., 110, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen Q.C. et al. (2015) Practical guidance for conducting mediation analysis with multiple mediators using inverse odds ratio weighting. Am. J. Epidemiol., 181, 349–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearl J. (2012) The causal mediation formula: a guide to the assessment of pathways and mechanisms. Prevent. Sci., 13, 426–436. [DOI] [PubMed] [Google Scholar]
- Robins J.M., Greenland S. (1992) Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3, 143–155. [DOI] [PubMed] [Google Scholar]
- Taguri M. et al. (2015) Causal mediation analysis with multiple causally non-ordered mediators. Stat. Methods Med. Res., 27, 3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ten Have T.R., Joffe M.M. (2012) A review of causal estimation of effects in mediation analyses. Stat. Methods Med. Res., 21, 77–107. [DOI] [PubMed] [Google Scholar]
- van den Brandt P.A. et al. (2000) Pooled analysis of prospective cohort studies on height, weight, and breast cancer risk. Am. J. Epidemiol., 152, 514.. [DOI] [PubMed] [Google Scholar]
- VanderWeele T., Vansteelandt S. (2014) Mediation analysis with multiple mediators. Epidemiol. Methods, 2, 95–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Westfall P.H., Young S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment, Vol. 279 Wiley-Interscience. [Google Scholar]
- Zhang H. et al. (2016) Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics, 32, 3150–3154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y., Luo X. (2016) Pathway lasso: estimate and select sparse mediation pathways with high dimensional mediators. https://arxiv.org/abs/1603.07749.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.