Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2018 Feb 6;34(14):2418–2424. doi: 10.1093/bioinformatics/bty064

FWER and FDR control when testing multiple mediators

Joshua N Sampson 1,, Simina M Boca 2, Steven C Moore 1, Ruth Heller 3
Editor: Jonathan Wren
PMCID: PMC6355102  PMID: 29420693

Abstract

Motivation

The biological pathways linking exposures and disease risk are often poorly understood. To gain insight into these pathways, studies may try to identify biomarkers that mediate the exposure/disease relationship. Such studies often simultaneously test hundreds or thousands of biomarkers.

Results

We consider a set of m biomarkers and a corresponding set of null hypotheses, where the jth null hypothesis states that biomarker j does not mediate the exposure/disease relationship. We propose a Multiple Comparison Procedure (MCP) that rejects a set of null hypotheses or, equivalently, identifies a set of mediators, while asymptotically controlling the Family-Wise Error Rate (FWER) or False Discovery Rate (FDR). We use simulations to show that, compared to currently available methods, our proposed method has higher statistical power to detect true mediators. We then apply our method to a breast cancer study and identify nine metabolites that may mediate the known relationship between an increased BMI and an increased risk of breast cancer.

Availability and implementation

R package MultiMed on https://github.com/SiminaB/MultiMed.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Mediation analysis can be used to study how an exposure, E, affects a disease, Y (Baron and Kenny, 1986; MacKinnon, 2008; Ten Have and Joffe, 2012). In the simplest scenario, there is only a single putative mediator, M. In this scenario, mediation analysis tests whether M is a true mediator and, if so, decomposes the total effect of E on Y into a direct and indirect (i.e. via M) effect (Pearl, 2012; Robins and Greenland, 1992; VanderWeele and Vansteelandt, 2014). In other scenarios, there may be more than one possible mediator (Daniel et al., 2015; Nguyen et al., 2015; Taguri et al., 2015). We consider the scenario where a large number of biomarkers may potentially mediate an exposure/disease association (Boca et al., 2014; Chen et al., 2017; Huang and Pan, 2016) and we introduce procedures for selecting a subset of those biomarkers to be designated as probable mediators. The key is that the proposed procedures, developed for replicability analyses (Bogomolov and Heller, 2018), can asymptotically control the Family-Wise Error Rate (FWER) or the False Discovery Rate (FDR). Our motivation is a 836-person case/control study of ER+ breast cancer where the goal is to identify the subset of the 478 measured metabolites that are likely to be mediators of the well-known association between higher BMI and an increased risk of breast cancer (van den Brandt et al., 2000).

Although mediation analysis initially focused on scenarios with a single mediator, mediation analysis has recently been extended for scenarios with a small number of mediators. In this setting, the possible causal paths (e.g. EM1 → M2 → Y) can be fully enumerated, the various indirect effects can be well defined using the language of causal inference, and the assumptions needed to obtain unbiased estimates can be formulated (Daniel et al., 2015; Taguri et al., 2015; VanderWeele and Vansteelandt, 2014). The next step is to extend mediation analysis to scenarios where there is a large number of mediators, such as when the potential mediator is a high-dimensional vector of voxels in an fMRI image (Chen et al., 2017; Zhao and Luo, 2016), serum metabolite levels (Boca et al., 2014), gene expression levels (Huang and Pan, 2016) or methylation levels (Zhang et al., 2016). Towards this aim, methods have been designed to test whether a set of biomarkers, considered together, mediate an exposure/outcome association (Huang and Pan, 2016), to identify the Direction of Mediation or the linear combination of biomarkers that best captures the mediating effect (Chen et al., 2017), and to model the relationship between exposure, biomarkers and outcome (Zhang et al., 2016). Here, our objective is to add to this growing body of literature by introducing a new multiple testing procedure that identifies probable mediators, while controlling for false positive findings.

Our paper proceeds as follows. In Section 2, we start by describing our newly proposed procedures for identifying probable mediators and competing procedures. We continue by describing the simulations used to compare the procedures and then finish by describing the motivating breast case cancer study. In Section 3, we assess the performance of these procedures on the simulated datasets and identify possible mediators in our motivating study. In Section 4, we summarize our findings, describe the novelty of our method in the context of the current literature, and explain why our method can only identify ‘probable’ mediators.

2 Materials and methods

2.1 Definitions

Let us consider n individuals. For individual i, let Ei be the exposure, Yi be the outcome, and Mi·={Mi1,,Mim}T be a vector of m potential mediators. For this paper, our potential mediators will always be biomarkers and we will use the terms interchangeably. We will say that a biomarker j is a mediator if Ei is associated with Mij and, conditional on Ei, Yi is associated with Mij. To formalize this statement, we define two null hypotheses

(A)H01j:EiMij (1)
(B)H02j:YiMij|Ei (2)

We therefore say that biomarker j is a mediator if and only if the two null hypotheses, H01j and H02j, are false. We note that, in our primary discussion, we are not considering the stricter null hypothesis that outcome and biomarker j are independent conditional on the exposure and the set of all other biomarkers, as defined by

(B*)H02j*:YiMij|Ei,Mi(j) (3)

where Mi(j)={Mi1,,Mi(j1),Mi(j+1),Mim}.

Let the combined data for individual i be denoted by Di=[Ei,Yi,Mi1,,Mim]T, and let the complete dataset be denoted by the n×(m+2) matrix D=[D1,,Dn]T. The arrows (e.g. D) indicate the corresponding variable is a vector. Furthermore, let ω*={1,,m} and let Ω be the 2m possible subsets of ω*. Then we define a Multiple Comparison Procedure (MCP) to be a function, from n(m+2) to Ω, that inputs the data and outputs the set of biomarkers that are likely to be mediators.

Let ω1 be the set of m11 biomarkers that are mediators and ω0 be the set of m0=mm11 biomarkers that are not mediators. For a set ωΩ, we let C(ω) be the number of elements in ω and we let V(ω) be the number of elements in ωω0. We next define the Family-Wise Error Rate (FWER) of an MCP to be E[1(V>0)] and the False-Discovery Rate (FDR) to be E[V/max(C,1)], where the expectation is over D and we have used the abbreviations C=C(MCP(D)) and V=V(MCP(D)).

2.2 Models and assumptions

We will first assume that the biomarkers and outcome are continuous variables that can be expressed as

Mij=β0j+βjEi+ϵMij (4)
Yi=γ0*+γE*Ei+jγj*Mij+ϵYi* (5)

where ϵMij and ϵYi* are random error terms with ϵMijEij and ϵYi*{Mi·,Ei}. Equation 5 further implies

Yi=γ0+γEEi+γjMij+ϵYij (6)

Assuming Equations 4 and 6 are true, the two null hypotheses can be restated as

(A)H01j:βj=0 (7)
(B)H02j:γj=0 (8)

For each biomarker, we can test the two hypotheses by first fitting Equations 4 and 6 using linear regression to estimate β^j and γ^j and their standard errors σ^βj and σ^γj. We can next calculate their Wald test statistics, Z1j=nβ^j/σ^βj and Z2j=nγ^j/σ^γj. We can then calculate the corresponding P-values assuming, if appropriate, that the test statistics follow a t-distribution with the appropriate degrees of freedom or, more generally, that the asymptotic normal approximation holds, P1j=2Φ(|Z1j|) and P2j=2Φ(|Z2j|) where Φ(·) is the cumulative normal distribution. We note that that the estimated parameters from linear regression are consistent estimates for βj and γj even if Mij and Yij are not normally distributed (Lumley et al., 2002). Furthermore, we note that stating the marginal relationship between a single biomarker and the outcome can be described by Equation 6 does not preclude a more complex relationship where multiple correlated biomarkers affect the outcome, as shown in our simulations.

We will also relax the assumptions and allow the outcome to be a binary variable. We will assume that a probit model holds, where Yi=1(Yi>0) and

Yi=γ0*+γE*Ei+jγj*Mij+ϵYij* (9)

which implies

Yi=γ0+γEEi+γjMij+ϵYij (10)

We now let γ^j be the estimate from fitting model 10. In this scenario, the two null hypotheses can, again, be restated by Equations 7 and 8. In fact, the requirement that the probit model holds is unnecessary, and we only require that the E[γ^j]=0 for biomarkers satisfying assumption H02j. Then, the only changes for a prospectively collected binary outcome is that we would obtain γ^j and P2j by probit regression. For retrospective sampling (i.e. case/control studies), we must also perform weighted regressions to estimate βEj where a sample’s weight is inversely proportional to the probability of being sampled. In practice, epidemiologists will often choose to use logistic regression instead of probit regression for estimating the conditional association between biomarker and outcome. We have chosen to present our theoretical results using the probit link since the multi-variable model (i.e. Equation 9) and the marginal model (i.e. Equation 10) are consistent with each other in probit regression and γj=0 implies the null hypothesis of Equation 2. However, in practice, and as further discussed in the Supplementary Material, we have found that MCP’s still perform well when using the logit link.

With the above assumptions, by a combination of theoretical and empirical results, we are able to show that FWER and FDR for our proposed procedures are maintained in practical settings.

2.3 Multiple comparison procedures

We first describe two existing MCPs that are designed to achieve a specified FWER: MCPB and MCPP, where the subscripts ‘B’ and ‘P’ abbreviate ‘Bonferroni’ and ‘Permutation’, respectively. We then introduce three new MCPs, MCPS, MCPSWY and MCPSMV that are designed to achieve, asymptotically, a specified FWER, where the subscript ‘S’ abbreviates ‘Subset’ and the superscripts ‘WY’ and ‘MV’ abbreviate ‘Westfall-Young’ and ‘Multivariate’, respectively. Finally, we introduce an MCP, MCPD, that is designed to achieve, asympotically, a specified FDR in realistic scenarios, where the ‘D’ abbreviates ‘false Discovery rate’.

2.3.1 MCP—Bonferroni (MCPB)

We claim biomarker j to be a mediator if P1jα/m and P2jα/m, where α is the targeted FWER. MCPB(D|α)={j:P1jα/m,P2jα/m}. We further define a Bonferroni-adjusted P-value PBj=m×max(P1j,P2j) and restate the definition as MCPB(D|α)={j:PBjα}.

2.3.2 MCP—permutation (MCPP)

We claim biomarker j to be a mediator if PPjα, where α is a constant and PPj is the P-value calculated by our prior permutation approach (Boca et al., 2014). MCPP(D|α)={j:PPjα}. Briefly, in this approach, we focus on the product |ρ^(E,Mj)ρ^(Mj,Y|E)|, where ρ^(E,Mj) is the Pearson correlation between E and Mj, and ρ^(Mj,Y|E) is the Pearson correlation between Mj and Y given E. We then use permutations to estimate the distribution of the maxj(|ρ^(E,Mj)ρ^(Mj,Y|E)|) under the hypothesis that there is no mediator and define PPj to be the probability of observing a value larger than |ρ^(E,Mj)ρ^(Mj,Y|E)| under this distribution.

2.3.3 MCP—subset (MCPS)

For the Bonferroni procedure, MCPB, each P-value must meet the strict threshold of α/m. Here, we suggest a different method, described in Figure 1, and based on work by Bogomolov and Heller (2018) restricts the testing of each hypothesis to a subset of biomarkers and therefore requires dividing α by a number smaller than m. We let t1 be a threshold (i.e. scalar value) for a significant exposure/mediator relationship, and define ωS1={j:P1jt1} and S1=C(ωS1) where C(·) is the cardinality of a set. Similarly, let t2 be a threshold for a significant mediator/outcome association, and define ωS2={j:P2jt2} and S2=C(ωS2). We then claim biomarker j to be a mediator if P1j0.5α/S2,P2j0.5α/S1 and jωS1ωS2. MCPS(D|t1,t2,α)={j:P1jmin(t1,0.5α/S2),P2jmin(t2,0.5α/S1)}. We further define a subset-adjusted P-value PSj=2max(S2P1j,S1P2j) if P1jt1 and P2jt2, 1 otherwise. Note that MCPS(D|t1,t2,α)={j:PSjα}. In practice, we set t1=t2=α/2 because in order to be discovered by this procedure, (P1j,P2j)(α/2,α/2). We stress that our use of a Bonferroni-type threshold on a set of prescreened metabolites is only valid because the P-values used for screening (e.g. P-values assessing exposure/metabolite association) are effectively independent of the P-values used in the second step (e.g. P-values assessing the metabolite/outcome association). Furthermore, we gain statistical power, compared to traditional Bonferroni approaches by eliminating tests that are irrelevant (e.g. testing for exposure/biomarker associations for biomarkers not associated with the outcome).

Fig. 1.

Fig. 1.

Diagram of the MCPS approach showing that the procedure first selects two sets of biomarkers and then selects the shared subset that meet additional criteria

2.3.4 MCP—subset, Westfall-Young (MCPSWY)

In the previously defined version of MCPS, the P-values must meet Bonferroni-type thresholds, 0.5α/S2 and 0.5α/S1, in the second step. However, we note that when the S2 metabolites in ωS2 are highly correlated, the threshold 0.5α/S2 would be conservative. A similar statement applies to ωS1 and 0.5α/S1. To construct an alternative threshold, we borrow ideas from Westfall and Young (1993). We estimate the null distribution of minjS2(P1j) by a permutation procedure, and then replace S2 by the approximate number of independent biomarkers S2WY=0.5α/q2,0.5α where q2,0.5α is the (0.5×α)th quantile of this distribution. We similarly define a S1WY and denote the resulting procedure by MCPSWY(D|t1,t2,α)={j:P1jmin(t1,0.5α/S2WY),P2jmin(t2,0.5α/S1WY)}.

2.3.5 MCP—subset, multivariate (MCPSMV)

For MCPS and MCPSWY, we aim to detect mediators as defined by H01j and H02j. However, we might also consider replacing H02j by H02j*. Unfortunately, we have no procedure that offers theoretical guarantees under the null H01j and H02j*. Instead, we offer an ad-hoc procedure that performs well in practice. Instead of modeling each biomarker/outcome association marginally, we use stepwise regression. Specifically, we define P2j* as the P-value for biomarker j when it is added to the multivariable biomarkers/outcome model (i.e. if biomarker j is the third biomarker added, then P2j* is the P-value for biomarker j when there are three biomarkers in the model). We then define ωS2MV={j:P2j*t2},S2MV=C(ωS2MV) and MCPSMV(D|t1,t2,α)={j:P1jmin(t1,0.5α/S2MV),P2j*min(t2,0.5α/S1)}.

2.3.6 MCP—subset, other modifications

Here, it is worth commenting on two other possible modifications to MCPS, although neither will be further discussed in this paper. First, we could claim biomarker j to be a mediator if P1jcα/S2,P2j(1c)α/S1 and jωS1ωS2, where c is any value in (0, 1). For example, letting c < 0.5 could be advantageous if exposure/mediator associations were far stronger and would be easily detectable at more stringent thresholds. Second, instead of prespecifying t1 and t2, we could choose thresholds so that the selected sets coincide with the rejected hypotheses in the second step.

2.3.7 MCP—FDR (MCPD)

This procedure, based on work by Bogomolov and Heller (2018), builds on the adjusted P-values of MCPS. For a given dataset, we calculate our subset-adjusted P-values {PS1,,PSm} as in Section 2.3.3. We then claim biomarker jωS1ωS2 to be a mediator if

PDj=minj:PSjPSjPSjrank(PSj)α, (11)

where PDj is the FDR-adjusted P-value and rank(PSj) is the rank of PSj for biomarker jwS1wS2. Note that MCPD(D|t1,t2,α)={j:pDjα}.

2.4 Theoretical properties

We show that the asymptotic FWER for MCPS is less than or equal to α (see Appendix for details):

Theorem 1: For MCPS(·|t1,t2,α), if A1 holds and {Mi1,,Mim,Yi} follow Equations 4 and either 6 or 10, then limnFWERα. where Assumption A1 is defined by

Assumption A1: If YiMij|Ei then YiMij|{Ei,Mij}{j,j:γj=0,βj=0}.

We note that A1 is satisfied by many parametric models. Moreover, in the Appendix, we show that the asymptotic FDR of MCPD is less than or equal to α:

Theorem 2: For MCPD(·|t1,t2,α), if MijMij|Ej,j{1,,m} holds, limnFDRα.

We note, without proof, that a similar claim will also hold if blocks of putative mediators are conditionally independent and we let the number of blocks go to infinity.

In Section 3.1 of the Results and in the Supplementary Material, we show in simulations that the FWER and FDR are controlled at the nominal level for finite n and dependent mediators.

2.5 Simulations

We compare the performance of the MCPs in variations of the following simulated study. We consider a study with 500 individuals and m biomarkers, where m{110,1010}. In the first set of simulations, we let EiN(0,1), Mij follow Equation 4 with ϵMijN(0,σMj2), and

Yi=γ0*+γE*Ei+jγj*Mij+ϵYi* (12)

with ϵYijN(0,σY2) where σMj2 and σY2 were chosen so var(Mij)=var(Yi)=1. We chose to fix the marginal variance at 1 because we believe that it reflects real datasets, where biomarkers and outcome are normalized. We let m00 be the number of biomarkers with βj=γj*=0, m10 be the number with βj=0.18,γj*=0, m01 be the number with βj=0,γj*=0.18 and m11 be the number with βj=γj*=0.18 (i.e. m11 is the number of true mediators). We vary the chosen values {m00,m10,m01,m11}, allowing m10 to be large to reflect our motivating example where many biomarkers were likely associated with the exposure. For the primary analysis, all biomarkers are independent conditional on E. For the secondary analyses, with the exception of true mediators, blocks of biomarkers were correlated. We let Σ0 be block diagonal, with blocks of size 5 (m =110) or 20 (m =1010), and let the off-diagonal elements be either 0, 0.5 or 0.9. If the correlation would result in var(Y) exceeding 1, we reduced γj*. Biomarkers associated with E or Y were maximally spread across blocks with the rule that no block contained biomarkers associated with both E and Y. This restriction prevents the creation of ‘potential mediators’, where hypotheses 1 and 2 are both false, that would not qualify as ‘true mediators’ (see Section 4 for details). In a second set of simulations, we consider a binary outcome, Y*, with Yi*=1 if Yi>0, 0 otherwise. For each scenario, defined by outcome type, m and {m00,m10,m01,m11}, we run 1000 simulations. In null simulations, we let m11=0 and calculate the FWER as the proportion of simulations where our MCP selects at least one biomarker at α=0.05. In non-null simulations, we calculate power as the average proportion of the 10 true mediators that are selected by our MCP set to α=0.05 and we calculate the observed FDR when the FDR threshold is set to 0.2. Note, with 1000 simulations, the standard error of our estimated FWER should be no larger than 0.007=0.05×(10.05)/1000.

2.6 Breast cancer study

This study, nested inside the Prostate, Lung, Colorectal and Ovarian Cancer Screening Study (PLCO), includes 418 estrogen-receptor positive (ER+) breast cancer cases and 418 controls matched on study entry age (±2 years), date of blood collection (±3 months) and hormone therapy use at baseline (Moore et al., 2017). Non-fasting serum samples were collected at the first follow-up visit, one-year post-baseline. Serum metabolites (<1 Kilodalton molecular weight) were measured by Metabolon Inc. using liquid chromatography-tandem mass-spectrometry. Of the 1057 serum metabolites measured, 478 were identified and present in at least 90% of the population. Metabolite peaks were normalized by dividing by batch median and then log transformed. All models were adjusted for age at serum collection, race, hormone use, age of menarche, parity, age of menopause, smoking and diabetes status. For purposes of sample weighting, the prevalence of ER+ breast cancer was 0.016.

3 Results

3.1 Simulations

The simulations demonstrate that the newly proposed MCPs have good operating characteristics. First, under the null scenarios with m11=0, most MCPs achieved their targeted FWER of 0.05. These results are summarized in Table 1 for conditionally independent biomarkers and in Supplementary Tables for blocks of dependent biomarkers. The exception is that the FWER for the permutation approach could be as high as 0.07, an undesirable consequence of not having theoretical guarantees on the error rate. In general, the FWER was smallest for MCPB, which relied on Bonferroni correction for determining significance. Second, under all scenarios, MCPD achieved its targeted FDR (Supplementary Tables).

Table 1.

FWER from four multiple comparison procedures MCPB, MCPP, MCPS and MCPSMV

Continuous outcome
Binary outcome
m00 m10 m01 m11 MCPB MCPP MCPS MCPSMV MCPB MCPP MCPS MCPSMV
110 0 0 0 0.00 0.03 0.01 0.01 0.00 0.01 0.00 0.00
95 15 0 0 0.00 0.04 0.01 0.01 0.00 0.03 0.01 0.01
70 40 0 0 0.01 0.07 0.03 0.03 0.02 0.06 0.02 0.02
95 0 15 0 0.00 0.04 0.02 0.03 0.00 0.07 0.01 0.02
80 15 15 0 0.01 0.05 0.04 0.04 0.01 0.08 0.05 0.08
55 40 15 0 0.02 0.04 0.05 0.04 0.01 0.02 0.03 0.02
1010 0 0 0 0.00 0.02 0.00 0.00 0.00 0.02 0.00 0.00
995 15 0 0 0.00 0.04 0.00 0.01 0.00 0.05 0.01 0.01
700 310 0 0 0.00 0.06 0.01 0.01 0.00 0.04 0.00 0.00
995 0 15 0 0.00 0.04 0.00 0.01 0.00 0.05 0.00 0.00
980 15 15 0 0.00 0.03 0.01 0.01 0.00 0.03 0.01 0.03
685 310 15 0 0.02 0.05 0.04 0.05 0.00 0.07 0.00 0.06

Note: The first four columns show the number (m00) of biomarkers associated with neither exposure nor outcome, the number (m10) associated with only the exposure, the number (m01) associated with only the outcome and the number (m11) associated with both exposure and outcome. The remaining columns show the FWER, defined to be the mean proportion of simulations with at least one biomarker identified as a mediator, when α=0.05. Details of the simulation can be found in Section 2.

The newly proposed MCPs tended to have higher power for detecting true mediators. These results are summarized in Table 2 for conditionally independent biomarkers and in Supplementary Tables for blocks of dependent biomarkers. In the majority of scenarios, among all univariate approaches, MCPS and MCPSWY slightly outperformed MCPP despite having lower FWER in the null simulations, but their relative performance depended on the exact scenario. The MCPP test can select biomarkers with a single strong association (i.e. the m10 and m01 biomarkers with one true association) and biomarkers with two non-significant, but still modest, associations (i.e. the m00 biomarkers with P1j,P2j0.1). Note, we omit MCPSWY from Tables 1 and 2, where biomarkers are independent, because it yields identical results to MCPS. Furthermore, we found that the multivariate approach resulted in higher power, as compared to the univariate approaches, with MCPSMV having the highest power in all scenarios. We emphasize that there is no equivalent to the multivariate approach, and no means for obtaining the corresponding increase in power, using a permutation based method. As expected, MCPSWY only increased power, compared to MCPS, when there was significant correlation (e.g. 0.9) among biomarkers and the total number of biomarkers was large (e.g. 1010) (Supplementary Material). Given its limited benefit and its failure to strictly control FWER in two simulations (m=m00=110) with higher correlation (Supplementary Tables S3 and S7), we do not recommend using MCPSWY in practice. In general, results were similar for the binary and continuous outcome.

Table 2.

Power from four multiple comparison procedures MCPB, MCPP, MCPS and MCPSMV

Continuous outcome
Binary outcome
m00 m10 m01 m11 MCPB MCPP MCPS MCPSMV MCPB MCPP MCPS MCPSMV
100 0 0 10 0.54 0.68 0.72 0.81 0.30 0.49 0.49 0.58
85 15 0 10 0.54 0.61 0.68 0.80 0.28 0.37 0.40 0.49
60 40 0 10 0.55 0.54 0.64 0.79 0.28 0.30 0.34 0.43
85 0 15 10 0.54 0.58 0.68 0.77 0.28 0.42 0.44 0.65
70 15 15 10 0.54 0.54 0.64 0.77 0.25 0.33 0.36 0.66
45 40 15 10 0.55 0.50 0.60 0.76 0.29 0.30 0.33 0.66
1000 0 0 10 0.28 0.69 0.61 0.77 0.11 0.47 0.34 0.44
985 15 0 10 0.28 0.60 0.58 0.74 0.10 0.37 0.31 0.41
690 310 0 10 0.27 0.33 0.45 0.66 0.10 0.15 0.18 0.23
985 0 15 10 0.27 0.57 0.57 0.77 0.11 0.43 0.35 0.65
970 15 15 10 0.26 0.53 0.54 0.75 0.11 0.36 0.32 0.63
675 310 15 10 0.27 0.33 0.44 0.76 0.08 0.12 0.16 0.42

Note: The first four columns show the number (m00) of biomarkers associated with neither exposure nor outcome, the number (m10) associated with only the exposure, the number (m01) associated with only the outcome and the number (m11) associated with both exposure and outcome. The remaining columns show the power, defined to be the mean proportion of true mediators identified, when α=0.05. Details of the simulation can be found in Section 2.

In an attempt to mimic our Breast Cancer study, we simulated data with a large number of exposure/metabolite associations. When m = 1010 and m10=310, we found that the benefit of our newly proposed MCPS approach, as compared to the Bonferroni approach, was less pronounced. Intuitively, this decline occurs because when S1 is large, P2j will have to achieve near Bonferroni-level significance for the biomarker to qualify as a mediator.

3.2 Breast cancer study

The 478 metabolites were strongly associated with both BMI and breast cancer status, with 218 of the BMI/metabolite associations having a P-value below 0.05 and 103 of the breast cancer/metabolite associations having a P-value below 0.05. We found 24 metabolites, listed in Table 3, which were potential mediators connecting BMI and breast cancer risk (FDR < 0.2). Of those 24, only 2, 16-α-hydroxy-DHEA-3-sulfate and 3-methyl-glutaryl carnitine 1, were significant at FWER = 0.05. We note that the P-values from our new methods were lower than the P-values produced by alternative methods. However, as seen in the simulations with a large number number of exposure/biomarker associations, the P-values from these methods were not dramatically smaller.

Table 3.

We list metabolites with an FDR-adjusted P-value < 0.2 using MCPD (pD), along with their adjusted P-values based on MCPB (pB), MCPP (pp), MCPS (pS) and MCPSWY (pSWY); 4A3B17B = 4-androsten-3beta, 17beta-diol

Name pB pp pS pSWY pD
16a-Hydroxy DHEA 3-sulfate 0.021 0.055 0.014 0.008 0.0075
3-Methylglutarylcarnitine 0.046 0.29 0.015 0.018 0.0075
4A3B17B disulfate 0.31 0.67 0.06 0.094 0.02
Allo-isoleucine 0.12 0.2 0.083 0.056 0.021
4A3B17B monosulfate 0.59 0.61 0.11 0.14 0.023
Urate 0.24 0.084 0.16 0.086 0.027
3-Methyl-2-oxobutyrate 0.6 0.22 0.41 0.23 0.054
4A3B17B disulfate 1 0.99 0.43 0.38 0.054
Gamma-glutamylvaline 0.84 0.56 0.57 0.32 0.063
Alpha-hydroxyisovalerate 1 1 0.73 0.6 0.073
2-Methylbutyrylcarnitine 1 1 1 0.52 0.096
21-Hydroxypregnenolone disulfate 1 1 1 0.92 0.1
7-Methylguanine 1 0.98 1 0.66 0.1
Histidine 1 1 1 0.7 0.1
N-acetylalanine 1 0.86 1 0.68 0.1
Lactate 1 1 1 1 0.12
Succinylcarnitine 1 1 1 1 0.12
4A3B17B monosulfate 1 1 1 1 0.13
Alpha-tocopherol 1 0.79 1 1 0.15
Octanoylcarnitine 1 1 1 1 0.16
Dihomo-linolenate 1 1 1 1 0.17
Decanoylcarnitine 1 1 1 1 0.18
Euricoyl sphingomyelin 1 1 1 1 0.18
N1-methylguanosine 1 0.25 1 1 0.18

4 Discussion

We introduced a new method for testing multiple putative mediators. This computationally efficient method can maintain specified family-wise error rates (FWER) and false discovery rates (FDR), and should be very useful in modern studies evaluating high dimensional biomarkers. We then applied this new method to a study evaluating the mechanistic relationship between increased BMI and an increased risk of breast cancer.

We note that MCPS and MCPSWY test each biomarker individually. Therefore, we can only use these methods to claim that marginally, when considered in isolation, each selected biomarker has the defining characteristics of a mediator. We neither claim that exposure nor the outcome is correlated with the selected biomarker, conditional on all other biomarkers. We aim only to reject H01j and H02j. Hence, the markers selected by these procedures may not all be true biological mediators. Consider the following example. Let EM1Y and M1M2. Our MCP is designed to select M2, but M2 is not a true biological mediator. For this reason, we opted to call our selected biomarkers as ‘probable mediators’ and not ‘true mediators’. Given this limitation, when using either MCPS or MCPSWY, we suggest a second step, following variable selection, that builds a graphical model containing the exposure, outcome and selected variables. A second option is to use MCPSMV, which identifies biomarkers marginally associated with the exposure and, to some extent, conditionally associated with the outcome. The caveat is that association is only conditional on those biomarkers that were included in the stepwise regression and this method does not carry theoretical guarantees.

The newly proposed MCP is an important contribution to the current literature on multivariate mediation analysis. First, the new methods improve upon our previous permutation approach in four ways. The new MCP is more powerful, provides theoretical guarantees on FWER, requires less computational time and can easily be extended to a multivariable analysis. Moreover, this new MCP provides a means for controlling FDR, in addition to FWER. Second, this MCP compliments those procedures that fit mediation models where the majority of putative mediators are presumed to be true mediators. Our MCP can be considered a preprocessing step to model fitting. Third, this paper brings the mathematical theory developed for the field of replicability to mediation analysis. The proofs guaranteeing asymptotic FWER and FDR control extend the theory to mediation analysis where the P-values, {P11,,P1m} and {P21,,P2m}, are calculated from a common dataset.

The theory developed here builds upon the theory developed by Bogomolov and Heller (Bogomolov and Heller, 2018) for demonstrating replicability. In their work, the pair of P-values (p1j,p2j) summarize the association between biomarker and outcome (e.g. SNP and disease) in two distinct study populations. Then, showing that their MCP selects biomarker j would be equivalent to stating that the biomarker/outcome association is replicable (i.e. the association is significant in both datasets). The common feature in their application and ours is that the two sets of P-values can be considered independent. This requirement limits further extensions, preventing, for example, its use in cases where the P-values are for two correlated traits in a common population.

In our breast cancer study, we identified 16a-hydroxy DHEA 3-sulfate and 3-methylglutarylcarnitine-1 as potential mediators of the BMI and ER+ breast cancer association. 16a-hydroxy DHEA 3-sulfate is the 16a-hydroxylated metabolite of DHEA and has been found in laboratory studies to be estrogenic and to be capable of binding and activating the ERβ estrogen receptor. However, it has not been previously linked with breast cancer risk. 3-Methylglutarylcarnitine-1 is a marker indicative of incomplete degradation of leucine. Specifically, when the 3-hydroxy-3-methylglutaryl-coenzyme A lyase enzyme, which catalyzes the final step in leucine catabolism, is insufficiently active, 3-methylglutarylcarnitine-1 accumulates in the blood. For this reason, 3-methylglutarylcarnitine-1 is sometimes used in clinical settings to diagnose errors in leucine metabolism. No prior studies have examined this metabolite in relation to breast cancer risk. These findings point toward potentially new metabolic pathways that may link a high BMI with breast cancer risk.

Funding

Ruth Heller acknowledges support by Israel Science Foundation (grant no. 0603616831).

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Baron R.M., Kenny D.A. (1986) The moderator–mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. Journal of Person. Soc. Psychol., 51, 1173–1182. [DOI] [PubMed] [Google Scholar]
  2. Boca S.M. et al. (2014) Testing multiple biological mediators simultaneously. Bioinformatics, 30, 214–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bogomolov M., Heller R. (2018) Assessing replicability of findings across two studies of multiple features. Biometrika. [Google Scholar]
  4. Chen O.Y. et al. (2017) High-dimensional multivariate mediation: with application to neuroimaging data. Biostatistics, doi:10.1093/biostatistics/kxx027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Daniel R.M. et al. (2015) Causal mediation analysis with multiple mediators. Biom, 71, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Huang Y.-T., Pan W.-C. (2016) Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biom, 72, 402–413. [DOI] [PubMed] [Google Scholar]
  7. Lumley T. et al. (2002) The importance of the normality assumption in large public health data sets. Annu. Rev. Public Health, 23, 151–169. PMID: 11910059. [DOI] [PubMed] [Google Scholar]
  8. MacKinnon D.P. (2008) Introduction to Statistical Mediation Analysis. Erlbaum Psych Press. [Google Scholar]
  9. Moore S. et al. (2017) A metabolomics analysis of body mass index and postmenopausal breast cancer risk. JNCI., 110, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Nguyen Q.C. et al. (2015) Practical guidance for conducting mediation analysis with multiple mediators using inverse odds ratio weighting. Am. J. Epidemiol., 181, 349–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Pearl J. (2012) The causal mediation formula: a guide to the assessment of pathways and mechanisms. Prevent. Sci., 13, 426–436. [DOI] [PubMed] [Google Scholar]
  12. Robins J.M., Greenland S. (1992) Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3, 143–155. [DOI] [PubMed] [Google Scholar]
  13. Taguri M. et al. (2015) Causal mediation analysis with multiple causally non-ordered mediators. Stat. Methods Med. Res., 27, 3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ten Have T.R., Joffe M.M. (2012) A review of causal estimation of effects in mediation analyses. Stat. Methods Med. Res., 21, 77–107. [DOI] [PubMed] [Google Scholar]
  15. van den Brandt P.A. et al. (2000) Pooled analysis of prospective cohort studies on height, weight, and breast cancer risk. Am. J. Epidemiol., 152, 514.. [DOI] [PubMed] [Google Scholar]
  16. VanderWeele T., Vansteelandt S. (2014) Mediation analysis with multiple mediators. Epidemiol. Methods, 2, 95–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Westfall P.H., Young S.S. (1993) Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment, Vol. 279 Wiley-Interscience. [Google Scholar]
  18. Zhang H. et al. (2016) Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinformatics, 32, 3150–3154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Zhao Y., Luo X. (2016) Pathway lasso: estimate and select sparse mediation pathways with high dimensional mediators. https://arxiv.org/abs/1603.07749.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES