Abstract
We consider the scenario where there is an exposure, multiple biologically-defined sets of biomarkers, and an outcome. We propose a new two-step procedure that tests if any of the sets of biomarkers mediate the exposure/outcome relationship, while maintaining a prespecified Family-Wise Error Rate (FWER). The first step of the proposed procedure is a screening step that removes all groups that are unlikely to be strongly associated with both the exposure and the outcome. The second step adapts recent advances in post-selection inference to test if there are true mediators in each of the remaining, candidate sets. We use simulation to show that this simple two-step procedure has higher statistical power to detect true mediating sets when compared with existing procedures. We then use our two-step procedure to identify a set of Lysine-related metabolites that potentially mediate the known relationship between increased BMI and the increased risk of ER+ breast cancer in post-menopausal women.
Keywords: group testing, high dimensional mediation, pathway analysis
1. Introduction
Mediation analysis explores how an exposure (E) is associated with an outcome (Y) 1,2. Traditionally, mediation analysis assumes the exposure influences a single mediating variable (M) which, in turn influences the outcome (Figure 1A), and then aims to decompose the total effect of the exposure into a direct and indirect (i.e. via M) effect 2–4. Initial methods explored this decomposition using parametric models 5,6, while more modern methods use a counterfactual framework 2,7. Recently, epidemiological studies have considered high-dimensional biomarkers as potential mediators linking an exposure and disease 8–10. Here, we focus on the scenario when the biomarkers can be split into disjoint, biologically defined sets (i.e. groups) and the specific goal is to test whether any of these predefined sets potentially mediate the exposure/outcome relationship (Figure 1B). By pooling signals from multiple biomarkers within a set, these group level tests may increase the power to detect true mediators. Group level tests have already been demonstrated to increase the power18–20 for testing associations when looking at groups of rare-variants in genes12,13, groups of genetic variants in pathways14,15, and groups of biomarkers with shared function or structure16,17. We note, however, that we can only increase power when the mediating biomarkers belong to the same group, a condition that will strongly depend on the type and quality of set definitions.
There is a growing body of literature exploring how high-dimensional biomarkers can be used in mediation analysis, specifically discussing how to identify sets of biomarkers that collectively mediate the exposure/outcome association9,21,22, test if individual biomarkers mediate the association 8,21,23,24, and test if predefined groups of biomarkers mediate the association 10,25. Here, we add to the literature by proposing a new procedure for testing groups of biomarkers. Our motivating study 26 is a 843-individual case-control study of ER+ positive breast cancer where the goal is to identify if one of the 38 biologically defined sets of metabolites mediates the relationship between higher body mass index (BMI) and the increased risk of breast cancer.
The first step of our two-step (TS) procedure is a screening step that removes all sets unlikely to be strongly associated with both the exposure and the outcome. The second step of the procedure adapts a method for post-selection inference 27–29 to attach a corrected or conditional p-value to each biomarker in the remaining sets. We then claim a set to be a mediating set if one of the included biomarkers is a statistically significant mediator based on this conditional p-value. Specifically, we define our p-value for a set of biomarkers to be the minimum of these conditional p-values. This approach builds upon two recent advances in the genetics literature, improved group tests of rare variants30,31 and post-selection inference to identify specific associated variants within the group27–29. This approach, screening by group and testing individual biomarkers, has higher power to detect mediating sets than the standard methods used to test groups. Moreover, our approach has the additional benefit of identifying the specific biomarkers in a set that are the actual mediators. We note that we defined our set-level p-value to be the minimum p-value, as opposed to using Fisher’s method, because the statistical distribution for the sum of the logged-conditional p-values could not be easily described.
The remainder of the paper is organized as follows. In the Materials and Methods, we describe the proposed TS procedure, the simulations used to evaluate the procedure, and the motivating study of breast cancer. In the Results, we compare the performance of the TS procedure with comparators in the simulations and report our findings from the breast cancer study. Finally, in the Discussion, we offer insights about the differences between the TS and other procedures.
2. Materials and methods
2.1. Notation
Let us consider individuals. For individual , let be the exposure, be the outcome, and be a vector of biomarkers or potential mediators. For a given biomarker j, we denote the m-1 set of biomarkers without j by . For this paper, our potential mediators will always be biomarkers and we will use the terms interchangeably. We classify all biomarkers into predefined disjoint sets, where the biomarkers of set } are indexed by , and we then define and . Finally, we let be the set containing biomarker be the indices for all biomarkers, other than , in the set and .
2.2. Causal Inference
We introduce counterfactual notation. We define to be the value of the biomarkers in subject if is set to and we define to be the value of the outcome if is set to and is set to . Given the number of biomarkers and their unknown and potentially bidirectional relationships, we cannot allow the biomarkers to be functions of each other (i.e. is only a function of e) when using counterfactual notation.
We can then define the total effect from changing to to be , the Natural Indirect Effect (NIE(s)) through a given set to be , and the Natural Indirect Effect (NIE(j)) through a given biomarker to be .
We would like to claim to be a mediating set if NIE(j) ≠ 0 for at least one . We emphasize that, as shown below, this statement would differ from claiming that is a mediating set if NIE(s) ≠ 0. Our formal definition of mediating set is offered in Section 2.3.
2.3. Continuous Outcome
We will first assume that the biomarkers and outcome are continuous random variables defined by
(3) |
(4) |
where , and . Equation (4) further implies that for any set
(5) |
We can define the causal effects from Section 2.2 in terms of the parameters from equations 3–5: , , and . When equations 3–5 hold, we also note that the following assumptions, provided by Imai et al 32, will also hold and allow all causal effects to be estimable.
Assumption 1 (Sequential ignorability)
where denotes the vector of observed pre-treatment covariates.
However, when m > n, we may not be able to estimate the parameters in equation (4). Therefore, we may not be able to estimate and test for its presence. Instead, as a pragmatic compromise, we will use the parametric models from equations (3) and (5), and formally define s to be a mediating-set if for at least one , or equivalently, if . We offer three comments regarding this definition. First, although not easily stated using the language of causal inference, our definition for a mediating-set is still well defined. Second, when the biomarkers from different sets are independent given and our definition has the desired meaning. Third, we note that there are other powerful methods for detecting . One approach10 would be transforming the biomarkers using spectral decomposition and then individually testing each of the resulting linear combinations
We will fit models (3) and (5) using linear regressions to obtain the Maximum Likelihood Estimates (MLE) for set . We denote the MLE by and ; we denote the combined vector by . Furthermore, we denote the estimates of the covariances for , and by , and , the standard errors of the jth element of by and the standard error of the jth element of by . Note, the , for all The latter is true because the likelihood, , can be factored into two components, , each containing only one set of parameters, as described in previous literature 8,10,24.
We can define biomarker-level p-values to test the null hypotheses and by and , where , and is the Cumulative Distribution Function (CDF) for the standard normal distribution. We choose the normal distribution as opposed to the t-distribution because relevant studies of high dimensional biomarkers typically have significantly more subjects than markers in per set, n >> .
We can further define a weighted set-level p-value to test the set’s association with the exposure and outcome using one of two variance component tests 30,31. For the first method, we test for an association between the group of biomarkers and E or Y using the following pooled test statistic
(10) |
where the complementary effect estimates are used as weights. Thus, when testing for the association between the set of biomarkers and the exposure, we treat as fixed weights and, similarly, when testing for the association with the outcome we treat as fixed weights. Both test statistics explicitly upweights biomarkers that have large effects with exposure or outcome. The corresponding p-values for the set s, and , are calculated from two functions, and , that are the CDFs for a linear combinations of distributions with weights determined by and . For the second method, we test for an association between the group of biomarkers and the exposure or outcome without any weighting, using the statistics
(11) |
(12) |
The corresponding p-values for the set s, and , are now calculated from two functions, and , that are the CDFs for a linear combination of distributions with weights set to 1. Note that sets of biomarkers only associated with the outcome or exposure will have relatively small values of , compared to , making the former potentially more powerful at detecting true signals. However, we incorporate the unweighted tests in parts of proposed methodology because of their statistical independence with each other.
2.4. Binary Outcome
Although we have thus-far considered continuous outcomes, we will also consider the scenario where the outcome is a binary random variable. We define the binary outcome, , by the probit model and following equations (4) and (5). Here, our definitions of the null hypotheses (i.e. equations 7–9), test statistics, and p-values remain essentially the same. The exception is that we estimate by probit regression and for retrospective sampling (i.e. case/control models), we estimate by weighted linear regression, where the weights are proportional to the probability of being sampled. Note, the choice of the probit model ensures both that equations (4) and (5) are consistent and that the hypotheses stated in equations (2) and (7) are identical, although the only true requirement is for for biomarkers satisfying . We note that, in practice, the procedure performs equally satisfactorily when the coefficients and p-values for the outcome associations are calculated using logistic regression, a model more familiar in epidemiology 24.
2.5. Testing procedures for groups of biomarkers
We first describe existing procedures and then introduce our new, more powerful, two-step procedures for testing sets of biomarkers.
2.5.1. Minimum-P Procedure (MIN)
This procedure is a direct modification of the approach introduced by Sampson and others 24. We start by calculating the FWER-corrected p-value for each biomarker using the MCPS approach. Briefly, define and . Let and be the cardinality of each set (i.e. the number of elements in that set). We define a FWER-corrected p-value for each biomarker by if and 1 otherwise. The MIN procedure then claims a set to be a mediating set if . In other words, the MIN procedure claims a set to be a mediating set if one of the biomarkers included in that set qualifies as a mediator after adjusting for multiple testing.
2.5.2. Linear Procedures (LIN)
These procedures were introduced by Huang 25.They suggest two test statistics, each with a normal distribution under the null hypotheses and some additional assumptions. For set , the two statistics are and , where , and the max is over all binary 2ms-length vectors . The LIN-1 and LIN-2 procedures will, respectively, claim a set to be a mediating set if and . For the comparison below, we consider only the more powerful, LIN-2, which we abbreviate by LIN. We note that the LIN procedure is designed to answer a slightly different problem and test the null hypothesis that .
2.5.3. Quadratic Procedure (QUAD)
This procedure was introduced by Huang and Pan 10. To account for the possibility of effects in different directions, they suggest the statistic and a novel parametric bootstrap to obtain the corresponding p-value. Specifically, they randomly generate B bootstrap replicates of from a normal distribution with mean and variance . They then calculate for each boostrapped set and define the p-value to be . The QUAD procedure will claim a set to be a mediating set if . We note that the original QUAD procedure offered modified methods that could handle large sets, with , a scenario not considered here.
2.5.4. Marginal Procedure (MARG)
This procedure, the first to be introduced here, is a set-level modification of the MCPs statistic 24. Importantly, for reasons discussed at the end of this section, we do not suggest that this overly-simplified approach controls the FWER and include it because it provides a reference for the maximum possible power that can be reasonably expected from a test. MARG is based on p-values and calculated from the unweighted pooled association test statistics . Define and so that they are the sets potentially associated with the exposure and outcome. Let and be the cardinality of each set (i.e. the number of elements in that set) and the marginal p-value be if and 1 otherwise. The MARG procedure will claim a set to be a mediating set if . However, the problem is that this procedure only marginally tests if the set is associated with both the exposure and the outcome; the procedure does not ensure that there is a common set of mediating biomarkers associated with both the exposure and the outcome (i.e. that is a true mediating set). We do note that MARG uses p-values from group tests without weights (i.e. and ) because the p-values and are not independent under the null hypothesis (see Proposition 1 in Supplemental Material) and therefore can occasionally have lower power than the TS method proposed below. We note that Huang proposed another marginal procedure, JTV-comp33, which again tests for sets associated with both the exposure and the outcome without ensuring that there are true mediators in that set. We offer comparisons with JTV-comp in Supplementary Material Section 4.6 and note that the method did tend to have higher statistical power, albeit with a slightly inflated type-I error rate.
2.5.5. Two-Step Procedure (TS)
This novel procedure is described in Figure 2. In parallel analyses, we identify biomarkers associated with the exposure and we identify biomarkers conditionally associated with the outcome. Note, in each of these analyses, we perform two steps: (i) a screening step to remove sets of biomarkers that are unlikely to be strongly associated with both the exposure and outcome (ii) a testing step that assigns individual p-values to each biomarker in the remaining sets. After these parallel analyses, we define the mediating sets to be those sets with biomarkers associated with both the exposure and outcome. Below we describe the links in Figure 2 for identifying the biomarkers associated with the exposure, while omitting the near identical descriptions for identifying biomarkers associated with the outcome.
Step 1: Define based on p-values from weighted and unweighted group test statistics. Here, we screen out those sets that are unlikely to be strongly associated with both the exposure and outcome. Note, this set differs from . Further note that choices of 0.025 and 0.1 can be modified, but we have found these thresholds to work well in practice for the FWER-corrected p-value of 0.05.
Step 2a: For each biomarker in one of the remaining sets, , we calculate a conditional p-value for association (i.e. conditioned on its set passing the screening step). Here, we define for where is the truncated normal distribution described in the Supplementary Material Section 1.3, and, for completeness, define for .
Step 2b: We then divide into two complementary sets of biomarkers, , where is the set of candidate biomarkers. Let be the number of biomarkers in .
Step 2c: For each biomarker in , we now calculate an adjusted conditional p-value, where the adjustment is needed to account for multiple testing. In its simplest form, the adjusted p-value would be However, we find it beneficial to decrease the multiple-testing penalty for those biomarkers strongly associated with the outcome (i.e. potentially true mediating biomarkers). Therefore, we define the adjusted p-value to be , , if and, for completeness, define for
Step 2d: After completing steps 2a-2c for both the exposure and the outcome, we can now define an adjusted p-value for mediation by for markers and, for completeness, define =1 for . We then say a set, , is a mediating set if . Let us formally define the FWER for the TS procedure by FWERTS . Then, in Supplementary Material Section 1.5, we prove the following theorem.
Theorem 1.
If , then
We note that the assumption, , is unlikely to hold in practice and violations of this assumption can lead to the TS procedure having an inflated type I error. Consider the example where E affects ) and a second biomarker ) also affects . Furthermore, of the two biomarkers, only affects the outcome. Then, may be mistakenly classified as a mediating set. In Supplementary Material Section 3.1, we offer simulations to show that the effect of on must be large for there to be an inflated type I error rate. We note that when , we can modify our approach so that its performance does not require this assumption (Supplementary Material Section 2).
We offer a couple of remarks. First, the initial screening uses a quadratic (i.e. ), as opposed to a linear (i.e. ), test statistic. This choice offers increased power when the proportion of associated biomarkers is low or biomarkers within a set have opposing effects. Second, the initial screening steps select sets using one weighted and one unweighted test statistic. Ideally, we would have used two weighed statistics but that would greatly complicate post selection inference because of the dependence between (see Supplemental Material Section 1.2). Post-selection inference in the second step of TS allows to quantify mediating effects (i.e. ) for detected potential mediators (see Supplemental Material Section 1.4).
2.6. Simulations
We compared the performance of the five previously defined procedures (MIN, LIN, QUAD, MARG, TS) for testing sets of biomarkers. Specifically, we used simulations to estimate the power to detect a mediating set of biomarkers in various scenarios defined by equations (3–5) and the parameters in Table 1.
Table 1:
Parameter | Interpretation | Possible Values |
---|---|---|
Number of sets of biomarkers | 15, 20, 50 | |
Number of one- or two-dimensional sets | 0, 4, 6 | |
Number of biomarkers per set | 15, 20, 50 | |
Number of mediating biomarkers in the mediating set | 1, 3, 5 | |
Number of noise biomarkers in the mediating set associated only with exposure | 0, 2, 4 | |
Number of noise biomarkers in the mediating set associated only with outcome | 0, 2, 4 | |
Number of noise biomarkers in the mediating set with half associated only with the exposure and half associated only with the outcome | 0, 2, 4 | |
Number of associated biomarkers in one- or two-dimensional sets | 6 | |
Correlation between biomarkers within a set | 0, 0.25, 0.5 | |
Non-null effect of exposure on metabolite | 0.065A, 0.045B | |
Non-null effect of metabolite on outcome | 0.065 A, 0.085B |
Models with a continuous outcome
Models with a binary outcome
Bold values indicate default settings.
We assumed there were a total of sets of biomarkers, each containing biomarkers.
We assumed that there was mediating set, one-dimensional sets, and two-dimensional sets, where we say a set is one-dimensional if it contains biomarkers associated with only the exposure or only the outcome and two-dimensional if it contains biomarkers associated with only the exposure and biomarkers associated with only the outcome. For the one mediating set, we assumed there were true mediators, “noise” biomarkers associated with only the exposure, “noise” biomarkers associated with only the outcome, and . In Supplemental Figure S1, we provide additional details on the simulation model.
Unless otherwise designated, we used the default parameters highlighted in black for the simulations.
In all scenarios, the exposure followed a normal distribution defined by and the biomarkers followed a multivariate normal distribution defined by equation 3. The correlation (i.e. off-diagonal element in ) for metabolites within a set was chosen to have AR(1) structure (i.e. ) with . For continuous outcomes, the sample contained a total of individuals and the outcome followed the normal distribution defined by equation 4. For non-null associations, we let the magnitudes of all effects be the same with . For a binary outcome, the sample contained cases and controls retrospectively sampled from a large cohort with outcomes generated from a logistic model and the incidence defined by , the non-null exposure effects defined by and the non-null outcome effects defined by . The choice of effect sizes ensured similar power when testing continuous and binary outcomes. We generated 1000 simulations per scenario to estimate power of five methods at a
We also conducted additional sensitivity analyses. We explore the effect of confounding in Supplementary Material Section 3.1. Specifically, we consider the scenario where there are no mediators, but there are confounders that link the exposure, biomarkers, and outcome. We also explore the methods under a wider set of parameter values in Supplementary Material Section 3.2. Specifically, we further explore the power when varying the proportion of biomarkers in a set that are mediators and the strength of the biomarkers’ combined association with both the exposure and outcome.
2.7. Metabolomic Study of Breast Cancer
Our motivating study aims to identify metabolites that mediate the known relationship between high BMI and the increased risk of estrogen-receptor positive (ER+) breast cancer. This study nested inside the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Study (PLCO), includes 410 (ER+) breast cancers and 433 controls matched on study age (+/− 2 years), date of blood collection (+/− 3 months), and hormone therapy use at baseline. The study collected serum samples at the first follow up visit, one-year post-baseline, and using these specimens, measured the serum metabolites (< 1 Kilodalton molecular weight) with liquid chromatography-tandem mass-spectrometry. Metabolite peaks were normalized by dividing by batch median and then log transformed. Of the 1057 measured serum metabolites, we consider the 481 metabolites that have known identities and are present in at least 90% of the case-control population. These 481 metabolites can be divided into 38 disjoint sets defined by their biologic properties 34. Details on the study have been previously published 26.
3. Results
3.1. Simulations
In general, TS successfully controlled the FWER at the targeted level (0.05). The MIN, LIN, and QUAD procedures similarly controlled the FWER. However, we note that all four procedures tend to have conservative FWERs when most biomarkers are associated with neither the exposure nor the outcome. The conservative nature of the tests observed here is consistent with previous observations when testing individual biomarkers21. As expected, the MARG procedure had inflated FWER when the number of two-dimensional sets was greater than 0 (Supplemental Figures S3F and S8F). Again, this inflated type I error results from falsely declaring a set as a mediating set if it is marginally associated with both the exposure and outcome, regardless of whether true mediating biomarkers are present. Additionally, TS, MIN, LIN and QUAD were relatively robust to the presence of unmeasured confounders affecting the biomarkers, exposure and outcome (see Supplemental Figures S3-S12).
The simulations demonstrate that the newly proposed TS procedure has comparable or better performance characteristics then the competing methods in all tested scenarios. Note, we show the results for only a selection of illustrating scenarios in the main text and show additional results in the supplementary material. We focus on a baseline scenario with disjoint sets of size mediating set, one-dimensional sets, two-dimensional sets, mediating biomarkers, and noise-biomarkers, and a correlation of . We then vary individual parameters to assess their specific impact on the relative performance of the five procedures (Figures 3, 4 and Supplemental Figures S13-S22).
The MARG procedure unexpectedly did not have the highest power under all settings. The main disadvantage of MARG is that it uses suboptimal statistics and to detect associations between the group of biomarkers and either the exposure or outcome. Nevertheless, MARG did have the highest power in most settings, and performed significantly better when the number of associated markers, , in the mediating set was large (Figure 3D and 3G-1I and Figure 4D and 4G-2I). JTV-comp33, a similar procedure to MARG, also had high, if not the highest power, when included in simulations (see Supplemental Figure S21). However, this study not only had an inflated type I error when the number of two-dimensional sets was greater than 0 but also in some scenarios when the number of one-dimensional sets was greater than 0 (see Supplemental Figure S22).
QUAD generally had the lowest power to detect the mediating set in all our simulations. The power tends to be low because the parametric bootstrap procedure 10 is known to be overly-conservative. LIN generally has power only slightly lower than TS. However, LIN has higher power when either the number of mediators in a set is large (see Figure 3D and 4D for setting with and Supplemental Figures S15 and S16) or the number of test sets is small (see Figure 3A and 4A for setting with ). The power for LIN was significantly affected by the number of noise biomarkers in a mediating set (Figures 3G-1I and 4G-2I) because large values of and increase the variance in the denominator of the statistic. We note that we did not evaluate LIN in examples where the biomarkers in the mediating set have opposing effects (i.e. as clearly LIN would have little to no power in this potentially unrealistic scenario.
The MIN procedure had reasonable power, achieving more than 80% of the power of the TS procedure for the baseline scenario. Moreover, the MIN procedure had the highest power when there was only mediating biomarker and 20 disjoint sets (Figures 3D and 4D). However, even when there was only a single mediating biomarker, increasing the number of sets and, therefore, biomarkers, decreased the difference in power achieved by the MIN and TS procedures (see Supplemental Figures S13 and S14).
The final observation is that the TS procedure had consistently higher power than the other methods for the realistic scenarios evaluated here. We note that changing the number of one- and two- dimensional sets did not have a meaningful impact on the overall or relative performance of the TS procedure (Figures 3E-3F, 4E-4F). We note that increasing the number of noise biomarkers resulted in a slight loss of power for the TS procedure but had no effect on the MIN procedure (Figures 3D-3I, 4D-4I). In contrast to other tests, increasing the total number of sets had minimal effect on the TS procedure, but greatly reduced the power of the MIN, LIN and QUAD procedures. In the supplementary material, we compared the four procedures in a wider set of scenarios that have been more rarely observed in practice. We note that when the proportion of mediating biomarkers () was large or close to 1, LIN and QUAD did have higher power (see Supplemental Figures S17 and S18). However, we also note that when the proportion was low or the number of sets was large, the TS procedure had notably higher power. Specifically, when 100 disjoint sets and , the TS procedure had power above 80% while the other procedures had power below 40% (see Supplemental Figures S13 and S14). As expected, the power for all procedures, including TS, is significantly improved by increasing the number of mediating biomarkers in the set and/or reducing the correlation between biomarkers (Figures 3C, 4C). We also saw similar results when the overall effect of mediators increased (i.e. 0.1% to 2% of phenotypic variation) while the number of mediators remained constant (see Supplemental Figures S19 and S20). We note that for our TS procedure, which is based on the minimum conditional p-value, this increase in power can be mainly attributed to the increased probability that the set is selected in the first step.
Increased correlation reduces the power for all tests because increases significantly when the regression for the outcome includes highly correlated biomarkers. Moreover, for the TS approach, the p-values from post-selection inference tend be larger because, in theory, an association with an outcome may be caused by the association to another mediating biomarker. Finally, compared to MIN, the TS identified a larger number of true mediators (see Supplemental Figures S23 and S24), consistent with the overall improvement in power (see Figure 3 and 4).
3.2. Breast Cancer Study
We tested the 38 sets of metabolites to determine if any mediated the relationship between BMI and risk of breast cancer. We observed that TS selected three pathways associated with the exposure and seven pathways associated with the outcome at step one, but only one pathway, Lysine metabolism, was identified as a mediating set with adjusted p-value . TS identified that the specific mediating biomarker, 3-Methylglutarylcarnitine-1, drove the association with LIN and MIN also detected the same pathway with adjusted p-value of 0.0001 and 0.041. Associations detected by LIN, TS and MIN were large driven by the single biomarker 3-Methylglutarylcarnitine-1 (see Table 2). On the other hand, MARG discovered a different Sterol/steroid pathway () that contains several sex hormones, suggesting that non-overlapping sets of hormones may be associated with the exposure and outcome. Lastly, QUAD did not identify any pathways that are potential mediators. However, the lowest adjusted p-value of 0.18 was for the Lysine metabolism group. In Table 2, we present the metabolites in the pathway discovered by LIN, TS and MIN and we present similar results for the Sterol/steroid pathway in the Supplementary material (Table S2).
Table 2:
Metabolite | P-value for association with risk of breast cancer | P-value for association with BMI |
---|---|---|
2-Aminoadipate | 0.43 | 5.3∙10−4 |
3-Methylglutarylcarnitine-1* | 9.5∙10−5 | 2.0∙10−8 |
3-Methylglutarylcarnitine-2 | 0.02 | 0.31 |
Glutarate pentanedioate | 0.82 | 0.96 |
Lysine | 0.35 | 0.25 |
N6-Trimethyl-L-lysine | 0.13 | 0.02 |
N2-Acetyl-L-lysine | 0.28 | 0.09 |
N6-Acetyl-L-lysine | 0.17 | 0.55 |
Pipecolate | 0.64 | 0.26 |
3-Methylglutaconate | 0.62 | 0.82 |
Metabolite discovered by TS and MIN to mediate effect of BMI
4. Discussion
We introduced a new procedure, TS, to test if sets of biomarkers mediate the relationship between an exposure and an outcome. The TS procedure is computationally efficient, controls the family-wise error rate (FWER), and has high statistical power for detecting potentially mediating sets of biomarkers. Additionally, the TS also identifies individual biomarkers that are potential mediators. The strength of the method comes from the first, screening, step that removes all sets that are unlikely to be strongly associated with both the exposure and the outcome. As compared with standard association tests this screening step removes a significantly larger number of sets (e.g. 100 x (1–0.025 × 0.1) = 99.75% of sets removed compared to 100 x (1–0.025)=97.5% of sets removed). The statistical complication, which was solved and discussed in the supplementary material, is to calculate the adjusted, conditional p-values based on this screening approach. In addition to higher power, the added benefit of our approach is that it identifies the individual mediating biomarkers and allows practitioners to make more precise claims about the underlying biology by measuring effects mediated through biomarkers. In the remainder of this section, we focus on providing insight into the trends observed in our simulations, potential extensions to the TS procedure, and a discussion on the benefits of group testing.
One observation is that the QUAD and LIN procedures had lower power in many tested scenarios. Here, we offer some comments about those tests. First, importantly, we note that these procedures were not designed for scenarios where only a small subset of the biomarkers in the mediating set qualify as actual mediators. Second, despite having lower power, these procedures still have the advantage of offering a set-level p-value. In contrast, the TS procedure offers biomarker-level p-values for elements of the set, and then uses the minimum biomarker-level p-value as the set-level p-value. Third, the test statistic used in the QUAD procedure is similar to those statistics use in the first step of the TS procedure, and therefore the similarity in procedures’ statistics needs to be reconciled with the dissimilarity in their performance. For the null distribution, the QUAD procedure assumes that both of the estimated vectors, and , follow multivariate normal distributions with non-zero means, while the TS procedure assumes one vector is a group of fixed weights and the other vector follows a multivariate normal variable with zero means. As further illustrated by an example in Supplementary material of the original paper 10, the variance of under its associated null is an order of magnitude larger than the variance of or under their associated nulls. Fourth, when testing a set of biomarkers, the quadratic version of the test statistics (see corresponding references 30,31) generally have higher statistical power than their linear counterparts when only a subset of biomarkers are true positives regardless of the sign of effect. A second observation, which applies to all procedures, is that testing groups will only have higher power, as compared to testing individual biomarkers, when the groupings combine mediators together. Albeit not explicitly stated, our examples showed the cost if the groups were formed randomly. In this extreme scenario, a set would likely have at most one mediating biomarker (i.e. ). As Figures 3D, 4D, S17, and S18 illustrate, the power of all group tests are noticeably lower than the power of MIN, where we recall that MIN is equivalent to testing each biomarker individually. A third observation is that TS and MIN had low power when the majority of biomarkers in a set each had a small mediating effect. The last point, which we did not show by simulation, is that in the presence of alternating effects, where can be both positive and negative, LIN will clearly have lower power to detect mediating sets.
There are potential modifications to the TS procedure. First, as remarked previously, in the initial step of TS method, we used a combination of weighted statistics, and unweighted statistics, and , to test the marginal null. The more powerful approach would be to select candidate sets by weighted tests only (i.e. ). However, in the second step of our new procedure, the adjustment of p-values would need to take into consideration selection by both of these test statistics, a complication that requires further thought. The second modification of TS would be to calculate the adjusted p-values by computing them under the global null 28. Such a test can provide better power when the proportion of mediators in a set is very low. A third modification is to allow one biomarker to appear in multiple sets. However, such a modification is not straight-forward, as the assumptions in Theorem 1, would clearly fail to hold. Another potential modification is to allow there to be exposure/biomarker interactions in the outcome model. As a final note, we refer to each selected set as a potentially mediating set, as opposed to simply referring to it as a mediating set. We note that the test for an association between a set of biomarkers and the outcome ignores biomarkers from all other sets. Therefore, for these association tests, the other biomarkers are “unmeasured” confounders that could potentially bias our findings. Note, for large m, such bias is difficult to avoid because models cannot easily include all biomarkers.
We expect there to be growing interest in testing whether groups of biomarkers are mediators. In some scenarios, we might expect the exposure to affect an underlying, latent process that affects both the individual biomarkers and the outcome22. In other scenarios, we might expect the exposure to directly affect the biomarkers and the biomarkers to directly influence the outcome8,10,22,33. Recently, for example, Chen and colleagues9 investigated whether a thermal stimula excited a region of the brain (e.g. sets of fMRI voxels in common areas), which in turn affected the reaction. Huang33 recently investigated whether smoking affected methylation levels in a gene (e.g. sets of probes linked to a common gene), which in turn affected cancer risk. As another example, a study8 investigated whether high intake of fish, as measured by a questionnaire, influenced serum levels of sets of metabolites (e.g. sets that were associated with consumption of specific fish), which in turn were associated with a reduced risk of colorectal cancer.
Supplementary Material
Footnotes
Data availability statement
The data and code and data have been submitted the journal to be openly available.
5 References
- 1.Steen J, Loeys T, Moerkerke B, Vansteelandt S. Flexible Mediation Analysis With Multiple Mediators. Am J Epidemiol. 2017;186(2):184–193. [DOI] [PubMed] [Google Scholar]
- 2.VanderWeele TJ, Vansteelandt S. Mediation Analysis with Multiple Mediators. Epidemiol Methods. 2014;2(1):95–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Interpretation Pearl J. and identification of causal mediation. J Psychological methods. 2014;19(4):459. [DOI] [PubMed] [Google Scholar]
- 4.Robins JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–155. [DOI] [PubMed] [Google Scholar]
- 5.Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173–1182. [DOI] [PubMed] [Google Scholar]
- 6.MacKinnon DP. Introduction to statistical mediation analysis. New York: Lawrence Erlbaum Associates; 2008. [Google Scholar]
- 7.Imai K, Keele L, Tingley D. A general approach to causal mediation analysis. Psychol Methods. 2010;15(4):309–334. [DOI] [PubMed] [Google Scholar]
- 8.Boca SM, Sinha R, Cross AJ, Moore SC, Sampson JN. Testing multiple biological mediators simultaneously. Bioinformatics. 2014;30(2):214–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen OY, Crainiceanu C, Ogburn EL, Caffo BS, Wager TD, Lindquist MA. High-dimensional multivariate mediation with application to neuroimaging data. Biostatistics. 2018;19(2):121–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huang YT, Pan WC. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016;72(2):402–413. [DOI] [PubMed] [Google Scholar]
- 11.Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Flannick J, Mercader JM, Fuchsberger C, et al. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls. Nature. 2019;570(7759):71–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Vidmar L, Maver A, Drulovic J, et al. Multiple Sclerosis patients carry an increased burden of exceedingly rare genetic variants in the inflammasome regulatory genes. Sci Rep. 2019;9(1):9171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Saunders EJ, Dadaev T, Leongamornlert DA, et al. Gene and pathway level analyses of germline DNA-repair gene variants and prostate cancer susceptibility using the iCOGS-genotyping array. Br J Cancer. 2018;118(6):e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fransen E, Bonneux S, Corneveaux JJ, et al. Genome-wide association analysis demonstrates the highly polygenic character of age-related hearing impairment. Eur J Hum Genet. 2015;23(1):110–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wu C, Delano DL, Mitro N, et al. Gene set enrichment in eQTL data identifies novel annotations and pathway regulators. PLoS Genet. 2008;4(5):e1000070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Huang J, Weinstein SJ, Moore SC, et al. Pre-diagnostic Serum Metabolomic Profiling of Prostate Cancer Survival. J Gerontol A Biol Sci Med Sci. 2019;74(6):853–859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Moutsianas L, Agarwala V, Fuchsberger C, et al. The Power of Gene-Based Rare Variant Methods to Detect Disease-Associated Variation and Test Hypotheses About Complex Disease. PLOS Genetics. 2015;11(4):e1005165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Derkach A, Lawless JF, Merico D, Paterson AD, Sun L. Evaluation of gene-based association tests for analyzing rare variants using Genetic Analysis Workshop 18 data. BMC Proc. 2014;8(Suppl 1 Genetic Analysis Workshop 18Vanessa Olmo):S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Derkach A, Zhang H, Chatterjee N. Power Analysis for Genetic Association Test (PAGEANT) provides insights to challenges for rare variant association studies. Bioinformatics. 2018;34(9):1506–1513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Barfield R, Shen J, Just AC, et al. Testing for the indirect effect under the null for genome-wide mediation analyses. Genet Epidemiol. 2017;41(8):824–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Derkach A, Pfeiffer RM, Chen TH, Sampson JN. High dimensional mediation analysis with latent variables. Biometrics. 2019;75(3):745–756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Chakrabortty A, Nandy P, Li H. Inference for Individual Mediation Effects and Interventional Effects in Sparse High-Dimensional Causal Graphical Models. arXiv preprint arXiv:10652. 2018. [Google Scholar]
- 24.Sampson JN, Boca SM, Moore SC, Heller R. FWER and FDR control when testing multiple mediators. Bioinformatics. 2018;34(14):2418–2424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Huang Y-T. Joint significance tests for mediation effects of socioeconomic adversity on adiposity via epigenetics. Ann Appl Stat. 2018;12(3):1535–1557. [Google Scholar]
- 26.Moore SC, Playdon MC, Sampson JN, et al. A Metabolomics Analysis of Body Mass Index and Postmenopausal Breast Cancer Risk. J Natl Cancer Inst. 2018;110(6):588–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Heller R, Chatterjee N, Krieger A, Shi J. Post-Selection Inference Following Aggregate Level Hypothesis Testing in Large-Scale Genomic Data. Journal of the American Statistical Association. 2018:1–14.30034060 [Google Scholar]
- 28.Heller R, Meir A, Chatterjee N. Post-selection estimation and testing following aggregated association tests. arXiv preprint arXiv:00497. 2017. [Google Scholar]
- 29.Lee JD, Sun DL, Sun Y, Taylor JEJTAoS. Exact post-selection inference, with application to the lasso. 2016;44(3):907–927. [Google Scholar]
- 30.Derkach A, Lawless JF, Sun L. Pooled Association Tests for Rare Genetic Variants: A Review and Some New Results. Statist Sci. 2014;29(2):302–321. [Google Scholar]
- 31.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Imai K, Keele L, Yamamoto T. Identification, Inference and Sensitivity Analysis for Causal Mediation Effects. Statist Sci. 2010;25(1):51–71. [Google Scholar]
- 33.Huang YT. Variance component tests of multivariate mediation effects under composite null hypotheses. Biometrics. 2019. [DOI] [PubMed] [Google Scholar]
- 34.Derkach A, Sampson J, Joseph J, Playdon MC, Stolzenberg-Solomon RZ. Effects of dietary sodium on metabolites: the Dietary Approaches to Stop Hypertension (DASH)-Sodium Feeding Study. Am J Clin Nutr. 2017;106(4):1131–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.