Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2024 Apr 22;20(4):e1011246. doi: 10.1371/journal.pgen.1011246

Collider bias correction for multiple covariates in GWAS using robust multivariable Mendelian randomization

Peiyao Wang 1, Zhaotong Lin 1,2, Haoran Xue 1,3, Wei Pan 1,*
Editor: Xiang Zhou4
PMCID: PMC11065275  PMID: 38648211

Abstract

Genome-wide association studies (GWAS) have identified many genetic loci associated with complex traits and diseases in the past 20 years. Multiple heritable covariates may be added into GWAS regression models to estimate direct effects of genetic variants on a focal trait, or to improve the power by accounting for environmental effects and other sources of trait variations. When one or more covariates are causally affected by both genetic variants and hidden confounders, adjusting for them in GWAS will produce biased estimation of SNP effects, known as collider bias. Several approaches have been developed to correct collider bias through estimating the bias by Mendelian randomization (MR). However, these methods work for only one covariate, some of which utilize MR methods with relatively strong assumptions, both of which may not hold in practice. In this paper, we extend the bias-correction approaches in two aspects: first we derive an analytical expression for the collider bias in the presence of multiple covariates, then we propose estimating the bias using a robust multivariable MR (MVMR) method based on constrained maximum likelihood (called MVMR-cML), allowing the presence of invalid instrumental variables (IVs) and correlated pleiotropy. We also established the estimation consistency and asymptotic normality of the new bias-corrected estimator. We conducted simulations to show that all methods mitigated collider bias under various scenarios. In real data analyses, we applied the methods to two GWAS examples, the first a GWAS of waist-hip ratio with adjustment for only one covariate, body-mass index (BMI), and the second a GWAS of BMI adjusting metabolomic principle components as multiple covariates, illustrating the effectiveness of bias correction.

Author summary

Genome-wide association studies (GWAS) are powerful in identifying genetic variants influencing complex traits and diseases. However, adjusting for heritable covariates in GWAS may introduce collider bias when both genetic variants and confounders may causally influence these covariates. In this study, for the first time we derived the analytical form of the bias term in GWAS with multiple covariates, enabling bias estimation and correction using any MVMR method. On the other hand, many existing MVMR methods may not be robust to invalid IVs and are designed for independent samples. Since GWAS data of multiple traits are needed, overlapping samples become inevitable. Hence, while investigating the performance of many MVMR methods, we mainly adopt MVMR-cML, a novel MVMR approach robust to invalid IVs and sample overlap. Our simulations underscore that most MVMR methods effectively reduce collider bias across various scenarios. Furthermore, by accounting for correlations among GWAS statistics, as well as the linkage disequilibrium (LD) between the target SNP and IVs, we establish the consistency and asymptotic normality of the bias-corrected estimator based on MVMR-cML. The application of our bias-correction approach to two published GWAS data examples illustrates its utility and efficacy.

Introduction

Genome-wide association studies (GWAS) have revolutionized genetics of complex diseases and traits by identifying many novel associations [1]. This breakthrough has not only paved the way for the development of new therapeutics but has also enabled early prevention strategies [2]. A notable example is a mega-GWAS conducted in 2014, which revealed 108 loci linked to schizophrenia, providing valuable insights for the development of novel drugs [3]. Moreover, the publication of GWAS results has facilitated secondary genetic epidemiological analyses, such as Mendelian randomization (MR) [4]. Despite these advancements, some limitations of GWAS remain.

Trait-associated single nucleotide polymorphisms (SNPs) detected through GWAS can only account for a small to modest portion of trait variability [5]. To address this, researchers have incorporated additional covariates in GWAS regression models to reduce residual variations to increase statistical power [5]. In addition, to distinguish total and direct effects of SNPs, conditional GWAS analysis has been proposed to adjust for some covariates [6]. In either case, a conditional SNP-trait association estimate is produced, which, however, may be biased [6, 7]: it is possible that both the SNP and unknown confounders may casually affect one of the covariates, making it a so-called collider; including a collider in analysis will induce conditional associations between SNPs and confounders that are otherwise truly independent. Furthermore, when the confounder is causal to both the trait and the covariate, this becomes problematic because it opens an indirect association pathway SNP → confounder → trait in addition to the direct association SNP → trait, biasing the conditional effect from its true value, inducing so-called collider bias. The magnitude of the bias depends on the association between the trait and covariates, as well as the SNP effect on the covariates [6].

Collider bias may result in spurious associations and false positives [8]. It will also have negative impact on downstream applications of GWAS results. For example, in Mendelian randomization (MR) analysis, biased SNP effect estimates can produce misleading causal estimates of the exposure on the outcomes [9].

Collider bias can also manifest in other scenarios, though not the focus of this paper. For instance, when analyzing disease progression using a case-only sample and conditioning on disease incidence, the presence of a shared confounder between disease incidence and progression can lead to spurious associations between SNPs associated with disease onset and that confounder [7, 10]. Consequently, the estimation of the direct SNP-on-progression effect will be biased by the indirect association path of SNP → confounder → disease prognosis [7].

Several approaches can be employed to correct collider bias [7, 1113]. Under a simple model, a least squares approach reveals that the biased SNP-to-trait effect is equal to the true direct effect plus a bias term that is the product of the SNP-to-covariate effect and a slope [7]. Thus, estimating the bias term primarily involves approximating the slope. This estimation process resembles estimating the causal parameter of an exposure on an outcome using genetic variants as instrumental variables (IVs), for which various MR methods can be applied under certain IV assumptions [11]. Once a slope estimate is obtained, subtracting the bias term from the conditional effect estimate yields an (almost) unbiased estimate of the true direct effect. Previous studies have demonstrated that different MR approaches, such as inverse-variance weighted (IVW) regression and MR-Egger regression, can reduce type-I errors under certain conditions [11]. However, these approaches have limitations.

Firstly and most importantly, previous methods can only accommodate a single covariate and cannot effectively address collider bias induced by multiple covariates. In this paper, our primary objective is to mitigate bias resulting from the inclusion of multiple covariates in GWAS. Accordingly we propose application of multivariable Mendelian randomization (MVMR) methods to estimate the slope vector for multiple covariates being adjusted. Secondly, all MR methods impose more or less strong assumptions, especially on their requirements of valid IVs, to obtain a valid estimate of the slope vector [14]. Accordingly, we propose the use of multivariable MR constrained maximum likelihood (MVMR-cML), which is robust in the presence of invalid IVs as demonstrated before [14]. Consequently, we propose a bias correction approach utilizing MVMR-cML. In particular, here we prove the consistency and asymptotic normality of the collider bias-corrected estimator when MVMR-cML is applied, facilitating its valid use even in the presence of overlapping GWAS samples and when the SNP being tested and the SNPs being used as IVs in MVMR are in linkage disequilibrium (LD). Our proposed method with MVMR-cML can be applied to 1-sample, 2-sample and overlapping-sample settings with GWAS summary data. Nevertheless, for comparison, we also studied several other state-of-the-art MVMR methods.

The rest of the paper is structured as follows. Section 2 provides a comprehensive introduction to our bias-correction approach, outlining its key principles and methodology. In section 3, we proceed to assess the performance of the proposed methods with various MVMR across various scenarios through extensive simulations. The simulation results demonstrate that when a small number of covariates are adjusted, our approach effectively reduces collider bias while satisfactorily controlling type-I errors. However, as the number of covariates increases, the correction approach becomes less effective due to increased errors and uncertainties in parameter estimation. In section 4, we apply the methods to two UK Biobank GWAS datasets [5]. The first one considered a well-known example of a GWAS of waist-hip ratio (WHR) with adjustment for BMI as a single covariate, where our analysis indicated that collider bias had a minor impact on SNP-WHR association estimates. The second study utilized UK Biobank metabolomic data as multiple covariates to enhance statistical power of GWAS of BMI. Upon employing bias-correction, we observed that many previously significant SNPs were no longer significant with reduced effect size estimates, suggesting the likely presence of collider bias. More discussions are given in the final section.

Description of the method

Collider bias with multiple covariates

We consider a general problem of conditional association between a SNP (or any variable of interest) G and an outcome Y, conditional on a vector of (quantitative) covariates X. This type of conditional analysis is often performed when investigating the direct effect of G on Y through pathways not mediated via X [11], or when X can explain a large proportion of the total variation of Y to boost statistical power [5]. For the purpose of presentation, we first consider causal models, then at the end generalize the results to association analysis. We adopt the same terminology as used for collider bias in previous research [7, 11].

We first consider causal relationships among the variables in a simpler scenario with two covariates X1 and X2 as illustrated in Fig 1. Although we do not specify any causal relationship between covariates X1 and X2, they are allowed to be correlated. The total effect of G on Y, denoted by βGYTotal, consists of the direct effect βGY and the indirect effect mediated through X, represented by βGXTβXY=βGX1βX1Y+βGX2βX2Y. To estimate the direct effect between G and Y, one would condition on X. When there is no confounder present (βUY = 0 and/or βUX=(βUX1,βUX2)T=0), or when X and G are uncorrelated, conditioning on X provides unbiased estimation of βGY. However, when X lies on the causal pathway from both G and U, it becomes a collider. Consequently, conditioning on X leads to an association between G and U, resulting in an indirect association between G and Y through the path GUY. This indirect association biases the estimation of βGY [7].

Fig 1. Directed acyclic graph for outcome Y, SNP G, covariates X1 and X2, and confounder U.

Fig 1

Conditioning on colliders X1 and X2 induces a conditional association between G and U, represented by the dashed line. This creates an indirect association between G and Y through the path GUY, which will bias the estimation of the direct effect βGY.

Throughout the paper, we assume that SNP G and confounder U are independent in the population [7], and the confounders of G and U, such as population structure, are not present (or can be suitably adjusted). With GWAS summary statistics for βGYTotal and βGX, the direct effect βGY can be estimated using multitrait conditional/joint analysis (mtCOJO) [15], a previously proposed approach to estimate the SNP effects on Y conditioning on multiple covariates X1, ⋯, Xp. Specifically, the direct effect can be estimated as β^GY=β^GYTotal-β^GXTβXY, where βXY represents the effects of the covariates on the outcome when all covariates are jointly fitted, and can be transformed to βXY=D12R-1D12dXY where D is a p × p diagonal matrix containing the SNP-based heritability of covariates, and R is a p × p matrix of genetic correlations of X1, ⋯, Xp. Using GWAS summary data, D and R are estimated by linkage disequilibrium score (LDSC) regression [16, 17]. dXY are the marginal effects of the covariates on Y and can be estimated using MR. For more details about the mtCOJO approach, see Zhu et al. (2018) [15]. However, if only collider-biased conditional effect estimates are provided, mtCOJO cannot be employed to correct these effect estimates towards unbiasedness [11]. Consequently, other bias-correction approaches are necessary.

Previous studies have primarily focused on scenarios where only one covariate is included in the analysis, leading to the development of bias-correction approaches utilizing univariable Mendelian randomization (UVMR) [7, 11, 12]. In this paper, we consider more general situations where multiple covariates are incorporated into a GWAS regression, some or all of which may introduce collider bias. Our proposed bias-correction approach is applicable to GWAS summary data (for both the outcome and covariates). The problem is formulated as follows.

Suppose we have a SNP denoted as Gi, a vector of covariates represented as X, an unmeasured confounder U, and a outcome/trait Y. The vector X consists of two components: J=(J1,,Jp1)T and H=(H1,,Hp2)T. The variables in J are not associated with Gi or U, but the elements in H may be influenced by both the SNP and the confounder, therefore inducing collider bias if included [6]. Fig 2 provides an illustration of the collider-bias problem in a multivariable scenario. When conditioning on the colliders H1 and H2, they induce an indirect association path GiUY, which can bias the estimation of the direct effect βGiY. However, the covariates J1 and J2 are not associated with Gi and U, and therefore do not introduce collider bias.

Fig 2. Directed acyclic graph for outcome Y, SNP Gi, covariates H and J, and confounder U.

Fig 2

Conditioning on the colliders H1 and H2 induces a conditional association between Gi and U, represented by the dashed line. This creates an indirect association between Gi and Y through the path GiUY, which will bias the estimation of the direct effect βGiY. J1 and J2 are not associated with Gi and U, and therefore do not introduce bias.

We assume the following true causal model for SNP Gi, which extends previous studies to a more general multivariable scenario [7]:

H=βGiHGi+BVHV+βUHU+EH, (1)
Y=βGiYGi+βXYTX+βUYU+EY (2)
=βGiYGi+βJYTJ+βHYTH+βUYU+EY. (3)

In practice, one may include different covariates in the GWAS regression of H, we denote these covariates as V = (v1, ⋯, vq)T. To establish our analytical expression of the bias term, we need to assume V is uncorrelated with U and the SNP, and hence the effect estimates β^GiH are not biased. Matrix BVH consists of columns βvsH representing the effects of vs on H. Additionally, EH and EY represent the noises for H and Y respectively. For simplicity, we assume that J, Gi, U, EH, and EY are pairwisely uncorrelated.

For each SNP Gi, the parameter of interest is its direct effect βGiY. However, since we do not observe the confounder U in practice, we can only regress Y on Gi and X, and U is absorbed into the error term ϵ, which is thus correlated with H:

Y=βGiYCGi+βJYTJ+βHYTH+ϵ. (4)

Consequently, the conditional effect βGiYC is biased for the true direct effect βGiY. In Section A in S1 Text, we derive that the bias term is the inner product of a vector b and βGiH:

βGiYC=βGiY+bTβGiH, (5)
b=-βUYVar(U)M22βUH. (6)
M22={βUHβUHTVar(U)+Cov(EH)+BVHCov(V)BVHT-BVHCov(J,V)TCov(J)-1Cov(J,V)BVHT}-1. (7)

This new result extends a previous one [7] for only univariate (i.e. one-dimensional) H to that for multivariate (i.e. multi-dimensional) H. As previously [7], it holds under the assumption that the effects of the confounders βUH and βUY remain constant across different SNPs, which is reasonable. In order to eliminate the bias term from βGiYC, we need to estimate the slope vector b, which can be done via MR.

In practice, when conducting GWAS marginally as usual for each SNP Gi, Gi may not be causal to the outcome Y (and H); instead, it may be simply in linkage disequilibrium with one or more causal SNPs. In this context, as in our real data examples, our above proposed analysis is to estimate the SNP’s marginal and conditional associations, rather than its causal total and direct effects. But for simplicity we will still loosely use direct and total effects to refer to marginal and conditional associations in the sequel.

Collider bias and MR

Eq (5) elucidates a relationship between collider bias correction and MVMR: both aim to estimate a linear relationship between instrumental effects on the outcome (conditioned on covariates) and instrumental effects on multiple covariates. The vector b in this context can be seen as the causal parameters in MVMR, while βGiY represents the pleiotropic effect of a SNP. Any MVMR approach can be employed to estimate b, given that independent and valid IVs are available, although sample overlap should be carefully considered [18]. In MVMR, a valid IV must satisfy three assumptions:

  • A1: the IV is associated with at least one exposure conditional on the other exposures included in the model;

  • A2: the IV is independent of any confounder of each exposure-outcome pair;

  • A3: the IV is independent of the outcome conditional on all exposures included in the model and the confounders.

Under the model assumption in Eqs (1) and (2), as SNPs are assumed to be independent of the confounder U, valid instruments Gi’s satisfy the conditions βGiH0 (A1) and βGiY=0 (A3). In practice, GWAS summary statistics can be utilized to select instruments that affect the covariates. However, among the selected instruments, Condition A3 is likely to be violated since some SNPs are expected to have direct effects on Y. Hence, it is necessary to account for pleiotropy when estimating b in MVMR. In other words, although the direct effect βGiY is our target through out the paper, in the usual scenario of MVMR, it is pleiotropy effect that should be better avoided in analysis. This is a difference between collider bias correction and MVMR estimation [11].

Previous studies mainly focused on the univariable case, where the GWAS of Y includes only one covariate and b reduces to a scalar b. Weighted regression, incorporating an intercept to accommodate for pleiotropy, was employed to obtain b^ [7, 11]. This approach is analogous to univariable MR-Egger (UV-Egger) regression and can be extended to the multivariable case. However, this approach assumes independence between instrumental effects on covariates and the direct effects βGiY (referred to as the InSIDE assumption) [19]. When there are correlated pleiotropic effects, this assumption is violated, leading to biased estimation of b [19]. In practice, the InSIDE assumption may be violated when the outcome and covariates share common biological mechanisms [11]. Other robust MVMR methods, such as MVMR-Lasso and MVMR-median, can be used to estimate b [20]. A previous numerical study shows that, compared to multvariable MR-Egger (MVMR-Egger) regression, these methods produce a more precise estimate under correlated pleiotropy [14, 20]. However, they may still be biased when too many invalid IVs are used for analysis [14]. Another bias correction method, known as Slope-Hunter has been developed to account for correlated pleiotropy [12]. Numerical examples demonstrate its efficacy in reducing collider bias unless there is a strong negative correlation between pleiotropic effects [12]. However, it (along with aforementioned and many other robust MR methods) is limited to the univariable case and has not yet been generalized to situations with multiple covariates (or exposures).

It is important to note other drawbacks of MR-Egger regression. In the univariate case, the performance of Egger regression depends on the orientation of SNPs [21]. To ensure the independence of analysis from the reported reference alleles, SNPs were flipped to have positive effects on the risk factor [19], known as default coding. However, violations of the InSIDE assumption can occur when some, but not all, SNPs are re-oriented to achieve positive associations with the exposure [21]. And this problem may also arise in a multivariate scenario. The previous literature also indicated that under unbalanced pleiotropy, default coding may severely bias the causal estimate of Egger regression. [21]. In our simulation, where only balanced pleiotropy was present, Egger regression was nearly unbiased in the univariate case, but the variance of causal estimate was large. This is consistent with a previous study [21]. Furthermore, the implementation of UV-Egger regression requires no measurement error in the SNP-exposure association (known as the NOME assumption). Hence, the SNP effects on exposures are assumed to be known without accounting for their uncertainty. When the InSIDE assumption holds, the presence of uncertainty in the SNP effects on exposures can bias the estimate towards 0 [22]. In the context of collider bias, where the SNP effects on covariates are obtained from GWAS summary data, this can lead to underestimation of the magnitude of bias and subsequently result in under-adjustment of the conditional estimates β^GiYC [11]. This issue has not been extensively studied for MVMR-Egger regression [19]. In Table O in S1 Text, some simulation results indicate that, when the InSIDE assumption was violated, MVMR-Egger yielded biased estimates b^.

The previous discussion highlights the limitations of existing approaches. Therefore, it is crucial to employ a robust MVMR method that can accurately estimate the slope vector b and thus correct for collider bias. MVMR-cML emerges as a strong competitor as it remains robust even in the presence of invalid IVs that violate all three IV assumptions [14]. Previous simulations have demonstrated the competitive performance of MVMR-cML among robust MVMR approaches in the presence of correlated and uncorrelated pleiotropy [14]. Hence, in our proposal, we estimate b using MVMR-cML, and obtain the estimate of βGiY by subtracting the bias from the conditional estimate β^GiYC. We refer this method as MVMR-cML-bias-correction. More details are given below.

Bias correction via MVMR-cML

Given the GWAS summary datasets {β^GiH1,,β^GiHp2,β^GiYC,σ^GiH1,,σ^GiHp2,σ^GiYC}i=1m of covariates H1,,Hp2 and the outcome Y, computed by regressions in Eqs (1) and (4). We first select (approximately) independent IVs Zj through a pruning process. Subsequently, we identify the corresponding instrumental effect estimates {β^ZjH1,,β^ZjHp2,β^ZjYC,σ^ZjH1,,σ^ZjHp2,σ^ZjYC}j=1l based on the significant marginal associations between each Zj and at least one of the covariates in H. Let V*={Z1*,,Zl0*} represent the true (unknown) set of valid IVs, where |V*|=l0. Through out this paper we use a superscript 0 to denote true values of parameters.

As previously [14], given the usual large sample size of GWAS, it is reasonable to assume a multivariate normal model for SNP effect estimates:

(β^ZjYCβ^ZjH)N{((b0)TβZjH0+βZjY0βZjH0),Σj},j=1,,l. (8)

The non-diagonal elements of covariance matrix Σj capture the correlations among the summary statistics for the outcome and covariates due to overlapping samples.

The correlation parameters in Σj can be estimated using either null z-scores in GWAS summary data [23] or LDSC regression. [24] Throughout this paper, we approximate Σj using null z-scores. In the context of MVMR-cML, we assume that Σj is either known or well-estimated using GWAS summary data [14].

The log-likelihood function of b, βZjH and βZjY is:

L(b,βZjH,βZjY|β^j,Σj)=-12j=1l(β^j-βj)TΣj-1(β^j-βj), (9)

where βj=(bTβZjH+βZjY,βZjH1,,βZjHp2)T and β^j=(β^ZjYC,β^ZjH1,,β^ZjHp2)T. Under the constraint that the number of invalid IVs is K, we estimate the unknown parameters b, βZjH and βZjY by solving the following constrained maximum likelihood problem:

{b^,βZjH,βZjY}=argmax{b,βZjH,βZjY}L(b,βZjH,βZjY|β^j,Σj) (10)
subjecttoj=1lI(βZjY0)=K. (11)

For a given number of invalid IVs, K, a coordinate descent-like algorithm is implemented to obtain the estimates b^(K) and {β˜ZjH(K),β˜ZjY(K)}j=1l. The selection of K is performed using the Bayesian information criterion (BIC) from a candidate set K={0,1,,l-p2-1}:

BIC(K)=-2L(b^(K),β˜ZjH(K),β˜ZjY(K)|β^j,Σj)+Klog(N).

where N is the minimum sample size of all GWAS datasets used for analysis. The value of K ranges from 0 to lp2 − 1, taking into account the multivariable plurality condition [14]. When K = 0, it indicates that all IVs are valid. By minimizing BIC, we determine the estimate K^, and final b^=b^(K^) and V^*={Zj|β^ZjY(K^)=0}. A consistently estimated covariance matrix of b^, denoted as Σ^b^, can be obtained by the observed Fisher information matrix from the likelihood using all selected valid IVs in V^*. This approach is referred to as MVMR-cML-BIC, following the previous terminology [14]. Under mild conditions, MVMR-cML consistently select the set of valid IVs. Specifically, as N → + ∞, we have P(V^*=V*)1. Additionally, the distribution of the standardized difference Σb^-12(b^-b0) approaches a multivariable standard normal distribution [14]. Σb^ is substituted by Σ^b^ in practice. For better finite-sample performance, a data perturbation approach was proposed to account for statistical uncertainty in model selection [14]. However, in the current paper, this approach is not utilized due to its time-consuming nature.

After obtaining the estimated slope vector b^, we can calculate the bias-corrected estimate as

β^GiY=β^GiYC-b^Tβ^GiH. (12)

We have the following result for the desired statistical property of β^GiY.

Theorem 1 Under some mild conditions, as the sample size N → + ∞, the bias-corrected estimator β^GiY=β^GiYC-b^Tβ^GiH is consistent and has an asymptotic normal distribution:

(β^GiY-βGiY0)/σGiYDN(0,1), (13)

where βGiY0 is the true direct effect of Gi on Y, and σGiY2 is the variance of β^GiY.

A consistent estimator σ^GiY2 of σGiY2 is shown in Section B.2 in S1 Text. For MVMR-cML, the variance estimator σ^GiY2 properly accounts for correlations among the GWAS summary data, as well as LD between the target SNP Gi and IVs. More details, including the proof and conditions of Theorem 1 and the analytical expression of σ^GiY2, are given in the Section B in S1 Text.

Denote the elements of b^ as b^k, and the diagonal elements of Σ^b^ as σ^b^k2. Assuming that the estimated slope vector b^, β^GiYC, and the elements in β^GiH are mutually independent, we have the simplified expression of σ^GiY2 as follows [7, 12]:

σ^GiY2=β^GiHTΣ^b^β^GiH+k=1p2b^k2σ^GiHk2+k=1p2σ^b^k2σ^GiHk2+σ^GiY2C. (14)

Although the independence assumption does not hold in practice, in our simulation and real data applications, Eq (14) often yielded similar variance estimates. In general, the variance increases as more covariates are adjusted. Hence, as expected, adjusting for too many heritable covariates leads to a sacrifice of power.

Note that our method is applicable to GWAS summary data for both the heritable covariates and the trait of interest.

Other methods

Note that our proposed bias-correction method can be applied with various MVMR methods other than MVMRcML, such as MVMR-Egger, MVMR-IVW, MVMR-Lasso, and MVMR-median [20]; we compared the performance of these methods in simulations and real data analyses. For other MVMR methods to be applicable here as well as possible, we assumed the asymptotic normality of their estimates and applied (14) to estimate the variance of their SNP effect estimate after bias correction, where the covariance matrix of b was estimated by data perturbation [14]. Note that it is unknown how to estimate the correlations between b^ and β^GiH or β^GiYC for other MVMR methods, while they can be estimated for MVMR-cML. The corresponding bias-correction procedure was named after the MVMR method, for example, as MVMR Egger bias correction. In MVMR-Egger regression, we reoriented the SNPs such that they all had positive effects on the first covariate H1 [19].

In addition, in cases with only one heritable covariate (i.e. p2 = 1), we also applied two bias-correction approaches of Dudbridge et al. (2019) [7] and Mahmoud et al (2022) [12]; following the previous literature [12], we denote these two methods respectively as DHO and Slope-Hunter (SH).

Simulation setups

We conducted a simulation study to assess the performance of different MVMR methods in the one-sample setting (i.e. where all GWAS data were based on the same sample of individuals) under two scenarios: one involving independent SNPs with no pleiotropy, and the other using (weakly) correlated SNPs with pleiotropy. In the former, independent SNPs were randomly generated; in the latter, we first pruned the UK Biobank (UKB) genotype data with correlation coefficients 0.1, then we randomly sampled 1000 SNPs from chromosome 1. Among the 1000 SNPs, 30 independent SNPs were randomly selected as IVs; 50 SNPs affected H only, 50 affected Y only, 50 affected both H and Y; the remaining ones were null SNPs having no effect on either the covariates or the outcome. Before drawing the SNPs, we partitioned chromosome 1 into 133 independent blocks [25]. The 150 non-null SNPs were drawn from the first 10 blocks, and the 850 null SNPs were drawn from the last 10 blocks. Hence, the non-null SNPs were independent of the null SNPs. We also drew 3000 independent null SNPs from other chromosomes, whose z-scores were used to estimate the correlations of summary statistics [24]. These 3000 null SNPs were not presented in simulation results. When estimating the correlation of summary statistics, we used a p-value 0.1 to select null z-cores for the covariates or the outcome, and approximately 2000 null z-scores were selected for each pair of summary data. We utilized correlated SNPs in order to compare the variance estimator in Eq (14) with our newly proposed variance estimator of MVMR-cML, the later of which can account for LD among SNPs. However, in our simulation, the correlations among SNPs had minor influence on variance estimation. All genotypes Gi were centered to have a sample mean of 0. In order to simulate correlated pleiotropy, for the 50 SNPs affecting both H=(H1,,Hp2)T and Y, their effects on Y (βGiY) and the first covariate H1 (βGiH1) were generated from a bivariate normal distribution with mean 0, variance 1, and a constant correlation ρ being 0, 0.5 or −0.5. All other SNP effects, as well as the causal effects from the covariates to the outcome (βHY) were independently drawn from a standard normal distribution. All SNP effects were predetermined before initiating the simulation. The confounder U and error terms EH, EY were drawn from normal distributions with a mean of 0. U accounted for 40% of the unknown variance in both Y and the covariates within H. Each error term contributed 10% of the total variance in the covariates or outcome. The values of H and Y were subsequently determined based on the equations below:

H=i=14000βGiHGi+βUHU+EH, (15)
Y=i=14000βGiYGi+βUYU+βHYTH+EY. (16)

We conducted the simulation under two scenarios with 30% and 50% invalid IVs respectively. The invalid IVs were taken from those having pleiotropic effects, while the valid IVs were taken from those affecting the covariates only. The GWAS of Y was performed using H as covariates, while the GWAS of each element in H only included the SNPs. The simulation setups closely followed those in a previous study [7]. We mainly presented the results for the SNPs suffering from collider bias, i.e., those SNPs affecting the covariates.

Verification and comparison

In each simulation scenario, we varied the dimension of H=(H1,,Hp2), denoted as p2, considering dimensions of 1, 2 and 4. Here we only present the main simulation with correlated SNPs and uncorrelated pleiotropy (30% invalid IVs) while additional details regarding other scenarios can be found in the Section F in S1 Text.

We applied our bias-correction method to the simulated GWAS summary data, employing a significance level of 0.05 to calculate type-I error rates and power. To gauge the impact of collider bias and assess the effectiveness of the bias-correction methods, we computed the probability of type-I error. Specifically, the probability of rejecting the null hypothesis H0 : βGiY=0 for each SNP that exerted no direct effect on Y, respectively before and after bias correction. Recognizing that different SNPs might exhibit distinct type-I error rates, we calculated the average across all these SNPs, providing an overall measure. In evaluating the power of detecting the effect βGiY, we computed the empirical probability of rejecting the null hypothesis. Power calculations were performed individually for each SNP affecting Y, and the results were averaged over different SNPs to derive an overall measure. Given that our bias-correction approach may impact power, we specifically identified SNPs exhibiting the largest increase or decrease in power after applying MVMR-cML-bias-correction. This allowed us to scrutinize the bias-correction methods’ impact on power in extreme cases.

To highlight the bias-correction results for a single SNP, we randomly selected two SNPs with collider bias (i.e., SNPs affecting covariates) and provided their corresponding effect estimates, type-I error rates and power in the Tables P-R and Tables W-Y in S1 Text.

Dramatically inflated Type-I errors were better controlled after bias correction

In Table 1, we provide a comprehensive overview of empirical type-I error rates and power across different SNPs, along with the sample standard deviations (SD) of the point estimates. Overall, when p2 = 1, all MVMR methods demonstrated effectiveness in mitigating inflated type-I errors resulting from collider bias. Notably, for SNPs affecting only the covariates, type-I error rates were close to the nominal level after bias correction of each MVMR method. For example, when p2 = 1, the empirical type-I error rate of SNPs affecting H but not Y decreased from 0.79 to 0.05 after applying MVMR-cML for bias correction. As pointed out by one reviewer, the method of Dudbridge et al (2019) [7] is also robust to overlapping samples as confirmed here by the good performance of DHO.

Table 1. Empirical type-I error rate (for SNPs underlined) and power with and without bias correction in the presence of 30% invalid IVs.

Sample standard deviations (SD) are given in parenthesis.

Dimension of H 1 2 4
Bias correction No cML Egger IVW Lasso Median DHO SH No cML Egger IVW Lasso Median No cML Egger IVW Lasso Median
Null SNPs 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.06 0.06 0.06 0.06 0.06 0.05 0.06 0.10 0.10 0.10 0.10
(SD) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.01)
All SNPs affecting H but not Y 0.73 0.05 0.05 0.05 0.05 0.05 0.07 0.05 0.74 0.06 0.07 0.09 0.09 0.06 0.71 0.08 0.17 0.12 0.12 0.09
(SD) (0.04) (0.03) (0.03) (0.03) (0.03) (0.03) (0.05) (0.03) (0.03) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.04) (0.07) (0.04) (0.04) (0.04)
All SNPs affecting Y only 0.34 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.65 0.54 0.56 0.55 0.55 0.56 0.77 0.70 0.72 0.73 0.73 0.73
(SD) (0.05) (0.05) (0.05) (0.05) (0.05) (0.05) (0.05) (0.05) (0.04) (0.05) (0.05) (0.05) (0.05) (0.05) (0.03) (0.04) (0.04) (0.04) (0.04) (0.04)
All SNPs affecting both H and Y 0.76 0.26 0.24 0.26 0.26 0.25 0.26 0.25 0.69 0.52 0.54 0.53 0.53 0.52 0.82 0.66 0.63 0.66 0.66 0.62
(SD) (0.04) (0.05) (0.04) (0.05) (0.05) (0.05) (0.05) (0.05) (0.04) (0.05) (0.05) (0.05) (0.05) (0.05) (0.03) (0.05) (0.04) (0.04) (0.04) (0.05)
SNP with greatest increase in power 0.23 0.87 0.87 0.87 0.87 0.87 0.87 0.87 0.05 0.98 0.99 0.98 0.98 0.98 0.28 0.95 0.99 0.98 0.98 0.97
(SD) (0.42) (0.34) (0.34) (0.34) (0.34) (0.34) (0.33) (0.34) (0.22) (0.13) (0.09) (0.13) (0.13) (0.14) (0.45) (0.21) (0.11) (0.13) (0.13) (0.16)
SNP with greatest decrease in power 1.00 0.05 0.01 0.03 0.03 0.02 0.03 0.05 1.00 0.05 0.04 0.06 0.06 0.04 1.00 0.06 0.07 0.09 0.09 0.03
(SD) (0.00) (0.22) (0.10) (0.18) (0.18) (0.13) (0.18) (0.21) (0.00) (0.23) (0.19) (0.24) (0.24) (0.20) (0.00) (0.24) (0.26) (0.28) (0.28) (0.18)

When p2 ≥ 2, all MVMR methods were less effective: all the type-I error rates remained slightly above 0.05 after correction, but MVMR-cML and MVMR-median gave smaller type-I error rates than other methods. Two reasons contributed to the minor inflation of type-I errors. Firstly, for SNPs solely influencing the covariates, their summary statistics β^GiH, exhibited slight biases owing to inter-SNP correlations. When p2 = 1, this bias was minimal and thus had negligible impact on bias correction. However, with the inclusion of more covariates in the analysis, the collective estimation errors of the entire vector β^GiH increased. This problem also exist in the estimation of b. As shown in Table 2, b^ were slightly biased from the true values due to invalid IVs. Even the deviation is small for each element, the aggregated estimation error of b^ increased and became influential as more covariates joined the analysis. According to Eq (12), this would reduce the accuracy of our bias-correction method. As for null SNPs, although their summary statistics remained unbiased, integrating additional covariates in GWAS augmented the uncertainty in estimating βGiH, making it more challenging for bias-correction to approximate the true direct effect βGiY. Consequently, the effect estimates of a few null SNPs might deviate from 0 after bias correction, thereby slightly inflating type-I errors.

Table 2. True values and mean estimates of b with ρ = 0 and 30% invalid IVs.

Sample standard deviations (SD), mean standard errors (Mean SE) are given in parenthesis.

Dimension of b 1 2 4
b b 1 b 1 b 2 b 1 b 2 b 3 b 4
True value −3.26 −1.19 −1.22 −0.48 −0.64 −0.46 −0.39
MVMR-cML −3.36 −1.17 −1.26 −0.51 −0.83 −0.41 −0.37
(SD) (0.10) (0.07) (0.07) (0.06) (0.06) (0.04) (0.07)
(Mean SE) (0.09) (0.05) (0.06) (0.04) (0.05) (0.04) (0.03)
MVMR-Egger −3.41 −1.05 −1.25 −0.69 −0.74 −0.36 −0.42
(SD) (0.16) (0.10) (0.05) (0.07) (0.05) (0.04) (0.03)
(Mean SE) (0.21) (0.11) (0.06) (0.08) (0.05) (0.05) (0.03)
MVMR-IVW −3.32 −1.26 −1.24 −0.56 −0.72 −0.41 −0.42
(SD) (0.09) (0.05) (0.05) (0.04) (0.05) (0.03) (0.03)
(Mean SE) (0.09) (0.05) (0.06) (0.03) (0.05) (0.04) (0.03)
MVMR-Lasso −3.32 −1.26 −1.24 −0.56 −0.72 −0.41 −0.42
(SD) (0.09) (0.05) (0.05) (0.04) (0.05) (0.03) (0.03)
(Mean SE) (0.09) (0.05) (0.06) (0.03) (0.05) (0.04) (0.03)
MVMR-Median −3.36 −1.21 −1.22 −0.56 −0.73 −0.42 −0.43
(SD) (0.10) (0.06) (0.06) (0.04) (0.05) (0.04) (0.04)
(Mean SE) (0.13) (0.08) (0.09) (0.06) (0.08) (0.05) (0.06)
DHO −3.54 NA NA NA NA NA NA
(SD) (0.23)
(Mean SE) (0.17)
SH −3.33 NA NA NA NA NA NA
(SD) (0.11)
(Mean SE) (0.12)

Collider bias was mitigated after bias correction

Fig 3 illustrates the mean estimates of βGiY (averaged over 1000 simulations) both before and after bias correction against the true (direct or conditional) effects. Here we exclusively present figures for MVMR-cML, representing our proposed method. The complete set of results is available in the Section F in S1 Text. Across various scenarios, all MVMR methods produced similar figures. Before correction, numerous points deviated from the identity line, indicating the presence of collider bias in the conditional effects βGiYC for these SNPs. Fig 3 demonstrates the effectiveness of MVMR-cML in eliminating or reducing bias, particularly when p2 = 1. This is evident as most points aligned closely around the identity line after bias correction. However, when p2 > 1, our proposed method might not entirely eliminate bias, and many points slightly deviated from the identity line. This deviation was attributed to the introduction of more covariates in H. As mentioned before, this increased the collective estimation error in the vector β^GiH and reduced the accuracy of β^GiY. Additionally, invalid IVs might introduce slight bias into the estimate of b as shown in Table 2. Even the bias for each element of b^ was small, as more covariates were included in analysis, the aggregated estimation error of the entire vector b^ got large. Consequently, achieving precise estimation became challenging, resulting in a slight bias even after correction. Furthermore, our bias correction approach, as described in Eq (14), increased the variance of the corrected effect estimate. This was evident in the figures, where the vertical bars representing mean standard errors (SEs) became longer after correction. Hence there is a trade-off between type I error control and statistical power, as larger SEs reduce the power to detect true associations between SNPs and the outcome Y. Therefore, while our correction approach effectively addressed collider bias, it came at the expense of sacrificing some power, especially in scenarios with more covariates in H.

Fig 3. Mean estimates of the effects of SNPs having collider bias with ρ = 0 and 30% invalid IVs.

Fig 3

Horizontal coordinates are for the true effects, vertical coordinates are for the estimated effects. Vertical bars are the means of standard errors averaged over 1000 repetitions. In the legends, “H only” means the SNPs affecting only the covariates; “H and Y” means the SNPs affecting both the covariates and outcome. (a)-(b): p2 = 1; (c)-(d): p2 = 2; (e)-(f): p2 = 4.

In S1 Text, Tables P-R present a summary of mean effect estimates, empirical type-I error rates and power for randomly selected SNPs exhibiting collider bias. Sample standard deviation (SD) and mean standard error (Mean SE) are provided in parentheses. Without bias correction, the empirical type-I error rates were dramatically inflated due to collider bias. All MVMR methods effectively addressed the bias, thus reducing type-I errors. For any single SNP, its estimation accuracy was improved, and the effect estimates were closer to the true value after bias correction. For example, in Table Q in S1 Text, when ρ = 0, the effect estimate of the first SNP, whose true effect was 0, changed from −3.54 to 0.09 after MVMR-cML-bias-correction, but with the standard error increased from 0.22 to 0.34. Correspondingly, the type-I error rate decreased from 1 to 0.02, other MVMR methods produced similar results. For the SNPs affecting both the covariates and outcome, most MVMR methods provided a more accurate effect estimate after bias correction. For example, in Table R in S1 Text, when ρ = 0.5, the effect estimate of the third SNP (whose true effect was −0.78) was −0.3 before correction; after MVMR-Lasso-bias-correction, it was −0.55.

Estimation challenges with more covariates

Table 2 provides the mean estimates of b for each MVMR method. As mentioned earlier, all MVMR methods yielded relatively accurate estimates of b. However, as mentioned before, when p2 = 4, even the the estimation bias for each element is small, the collective estimation error of the entire vector b^, as well as β^GiH might be large. Consequently, the effect estimates of some SNPs was inaccurate after bias correction. For example, in Table R in S1 Text, when ρ = 0, the mean effect estimate of the forth SNP (whose true value was 0.65) was 0.59 before correction, and became 1.09 after MVMR-Egger-bias-correction. When p2 = 1, Egger regression had a larger standard error of b^ than other methods. This is consistent with the previous literature: Egger regression with the default coding yielded a larger variance for the causal estimate, compared to other methods, such as IVW regression [21].

Other simulation results

In Section F in S1 Text, additional simulation results are presented. When the analysis involved a higher proportion of invalid IVs (50%), or when the InSIDE assumption was violated, certain MVMR methods, such as MVMR-Egger and MVMR-IVW, generated a slightly biased estimate of b and were unable to completely eliminate collider bias for a few SNPs, especially when p2 = 4. Note that, if the variance estimator (14), instead of the correct and default one, was used for MVMR-cML, the results were similar (shown in Table L and Table M in S1 Text). When 50% invalid IVs were used, some methods, such as MVMR-cML, tended to underestimate the standard errors of b^ [14], and hence the standard errors of the bias-corrected estimator β^GiY, as shown in Table Y in S1 Text. In the scenario without pleiotropy (Section C in S1 Text), a small p-value 5e − 8 was utilized for IV selection, thus only valid IVs were employed for estimation. Consequently, all MVMR methods performed equally well and successfully mitigated collider bias. Following the reviewers’ suggestions, we have included two additional simulations in Section D and Section E in S1 Text. The first simulation illustrates a “null” scenario without collider bias, wherein the mean effect estimates remained unchanged after bias correction. The second simulation demonstrates the inadequacy of UMVR in mitigating the collider bias induced by two covariates.

Applications

We applied our bias-correction approach to two previous GWAS applications. In the first, we examined a GWAS of waist-to-hip ratio (WHR) adjusted for body mass index (BMI) [11]. In the second, we considered a GWAS of BMI, utilizing principal components (PCs) of metabolomic variables as heritable covariates [5]. The GWAS results were derived from individual-level UK Biobank (UKB) data using the same subsample. By applying our bias-correction method to these GWAS applications, we aimed to assess the impact of collider bias and thus enhance the reliability of estimated SNP effects.

We followed the same data cleaning procedure described previously [5]. The analysis was performed on a dataset comprising self-reported, unrelated White individuals. For the genotype data, we first removed individuals who were identified as outliers for heterozygosity or had a high rate missing data. Additionally, individuals with abnormal numbers of sex chromosomes and those with inconsistent self-reported sex and genetic sex were also excluded. Next, genetic variants with a minor allele frequency less than 0.01, a missing genotype rate exceeding 0.05, or failing the Hardy-Weinberg equilibrium test at a p-value threshold of 1e − 6 were removed. After the data cleaning process, missing values were imputed using its mean value. Following the data cleaning steps, the principal components (PCs) for the 249 metabolomic biomarkers were calculated for the GWAS of BMI. The resulting dataset comprised approximately 500,000 SNPs and 100,000 individuals. All features were standardized prior to the GWAS. Moreover, the two phenotypes, WHR and BMI, underwent inverse-rank normalization to facilitate analysis [5]. For BMI, following previous study [5], significant SNPs were identified using the conventional p-value threshold of 5e − 8 and mapped to 1,703 independent linkage disequilibrium (LD) blocks [25], with each treated as an independent genomic locus.

To select (approximately) independent IVs, we applied a pruning procedure using a window size of 50 SNPs and a linkage disequilibrium (LD) threshold of r2 = 0.001. From the remaining SNPs, we selected relevant IVs for covariates based on a significance threshold of 5e − 10. We then obtained b^ using different MVMR methods and performed the bias correction for each SNP. Initially different methods gave different results. We performed a sensitivity analysis by leaving one IV out each time, some influential SNPs are identified, most of which were identified as invalid IVs by MVMR-cML. After removing these influential IVs, most or all MVMR methods gave consistent results.

GWAS of WHR with adjustment for BMI

In this section, we applied the bias-correction methods to a GWAS of WHR, adjusting for BMI as a single covariate [26]. The previous analysis of WHR utilized BMI as a covariate with the aim of uncovering SNPs influencing WHR through pathways distinct from those mediated by BMI [26]. However, it is important to note that this approach might introduce collider bias in the estimation of the direct effect of a SNP on WHR, primarily due to the heritability of BMI [6].

To obtain the GWAS summary data of WHR, we conducted a regression analysis using the equation in (4). In our analysis, the covariate vector J=(J1,,Jp1)T consisted of sex, age, and the top 10 genetic PCs provided by UKB [5]. The vector H became a scalar of BMI. The error term was denoted as ϵ. It was important to note that the conditional effects βJY and βHY could also be biased compared to their true values βJY and βHY, respectively, due to the unmeasured confounder U. As our bias-correction approach requires GWAS summary data of H, in this part, we also conducted GWAS of BMI (with adjustment for J).

Table 3 provides the estimate b^ (a scalar), its standard error (SE), and the numbers of significant SNPs and loci produced by each method. For DHO, the “Hedges-Olkin” method was employed to reduce regression dilution [7], and the SE was estimated through data perturbation. With the exception of Egger regression, all methods yielded consistent results, where b^ was negative. We suspect that the positive value reported by Egger was incorrect for two reasons. First, Egger regression is not robust to correlated and directional pleiotropy [20], which might persist even after the removal of possibly invalid IVs. Second, the SE provided by Egger regression was extremely large compared to those of other methods, suggesting a substantial degree of uncertainty in the corresponding estimate. Therefore, we accepted the values given by other methods. The estimates of b aligned with those in the literature, where most MVMR approaches yielded a negative value close to 0 [11].

Table 3. Point estimates of b and number of significant loci obtained by applying different MVMR methods.

Standard errors (SE) are given in parenthesis.

No correction MVMR-cML MVMR-Egger MVMR-IVW MVMR-Median MVMR-Lasso SH DHO
b^ NA −0.079 0.047 −0.058 −0.063 −0.063 −0.11 −0.068
(SE) NA (0.001) (0.322) (0.003) (0.004) (0.003) (0.002) (0.013)
# of significant SNPs 315 269 11 260 260 260 209 229
# of significant loci 47 42 2 42 42 42 37 39

In Table 3, fewer loci were identified after correcting collider bias. For the 315 significant SNPs before adjustment, Fig 4 shows that their effect estimates did not change significantly after bias correction, suggesting that adjusting for BMI as a covariate did not introduce severe bias. The decrease in the number of significant SNPs was mainly due to the slightly inflated variances of the effect estimates. This conclusion is consistent with a previous study showing that adjusting for BMI in a GWAS of WHR introduced only minor collider bias [11]. For Egger regression, due to the large SE of b^, the variance of β^GiY was severely inflated after bias correction, leading to a dramatic decrease in power. Only 2 significant loci were identified.

Fig 4. Effect estimates of WHR before and after bias correction.

Fig 4

Horizontal and vertical bars represent 1 SE of an estimate before and after correction respectively. SEs are given in the right column. In the legends, “before” refers to the SNPs that were significant only before bias correction, “after” refers to the SNPs significant only after bias correction, “both” refers to the SNPs significant both before and after bias correction.

Fig 4 compares the SNP effect estimates before and after bias correction, along with their corresponding standard errors (SEs) as vertical or horizontal bars. Here we only include figures for MVMR-cML since all MVMR methods yielded similar figures; the complete set of results can be found in the Section G.1 in S1 Text.

The Manhattan plots in Fig 5 depict the significant loci before and after MVMR-cML bias correction. The upper/lower panel represents the results after/before bias correction. The overall profiles of significant SNPs/loci before and after correction are similar, but fewer SNPs remain significant after correction.

Fig 5. Manhattan plot of WHR before (upper panel) and after (lower panel) applying MVMR-cML-bias-correction.

Fig 5

GWAS of BMI with adjustment for metabolomic PCs

We obtained the GWAS result of BMI using Eq (4) [5]. The vector H contained the principal components (PCs) derived from the 249 metabolomic variables. We employed PCs for dimension reduction since many of the metabolites exhibit high correlations [5]. J and ϵ were the same as in the previous example. This regression model for BMI was referred to as model 1 (M1) previously [5] and subsequently. To correct potential collider bias, the GWAS of metabolomic PCs was conducted using J as covariates.

Instead of seeking loci that act on BMI not through metabolic pathways, the previous study aimed to enhance the power of discovering marginal SNP effects by using omic data to account for environmental/residual variation. However, it is crucial to note that including metabolomic PCs can potentially introduce collider bias for marginal effects, given both genetic and environmental influences on metabolomic measurements. Table 4 presents the number of associated SNPs for the top 10 PCs at the genome-wide significance level of 5e − 8. To mitigate collider bias, we applied our bias-correction approach and compared the results with previous findings. Specifically, we explored scenarios where the dimension of the vector H, denoted as p2, varied from 1 to 5. In other words, we first fitted M1 with the top p2 metabolomic PCs and then adjusted possible collider bias induced by these p2 covariates using different MVMR methods. This allowed us to examine the results of including up to 5 metabolomic PCs in the analysis. Additionally, as suggested by previous research [5], we also provided results using the top 20 metabolomic PCs in Section G.2 in S1 Text.

Table 4. The proportion of the total variance explained by and the number of SNPs significantly associated with each metabolomic PC.

metabolomic PC PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
proportion of variance 12% 10% 7% 5% 3% 3% 3% 3% 2% 2%
# of associated SNPs 680 595 668 497 316 200 304 297 338 182

To validate the identified loci for BMI, we followed a previously established validation approach [5] and utilized two GWAS summary datasets as validation data based on their larger sample sizes. Specifically, we used the UK Biobank (UKB) GWAS round 2 results (sample size 361,194) for partial validation, which were published by the Neale lab in 2018 (http://www.nealelab.is/uk-biobank). We also incorporated the summary results from a meta-analysis of GIANT and UKB data, which included around 700,000 samples [27].

Table 5 provides the estimates of b and their standard errors obtained by applying various methods on M1. Overall, the values were consistent. Therefore, different methods yielded similar estimates β^GiY after bias correction. We performed the conditional F-test for weak IVs when p2 > 1 [28], and all F-statistics were above the threshold value of 10, indicating no severe weak instrument bias. The set of complete results is available in Section G.2.2 in S1 Text.

Table 5. Point estimates of b when applying different MVMR methods on M1.

Standard errors (SE) are given in parenthesis.

Dimension of b 1 2 3 4 5
b^ b^1 b^1 b^2 b^1 b^2 b^3 b^1 b^2 b^3 b^4 b^1 b^2 b^3 b^4 b^5
MVMR-cML 0.23 0.24 −0.17 0.24 −0.25 0.16 0.28 −0.35 0.16 0.15 0.25 −0.29 0.14 0.22 0.16
(SE) (0.03) (0.03) (0.05) (0.03) (0.05) (0.03) (0.03) (0.07) (0.03) (0.04) (0.04) (0.06) (0.03) (0.04) (0.04)
MVMR-Egger 0.16 0.22 −0.14 0.28 −0.20 0.16 0.24 −0.27 0.14 0.16 0.23 −0.23 0.13 0.22 0.16
(SE) (0.08) (0.06) (0.05) (0.05) (0.05) (0.03) (0.06) (0.06) (0.03) (0.04) (0.05) (0.06) (0.03) (0.04) (0.04)
MVMR-IVW 0.23 0.23 −0.14 0.23 −0.21 0.16 0.26 −0.27 0.14 0.17 0.23 −0.23 0.13 0.22 0.16
(SE) (0.03) (0.03) (0.05) 0.03 0.05 0.03 (0.04) (0.06) (0.03) (0.04) (0.03) (0.06) (0.02) (0.04) (0.04)
MVMR-Lasso 0.26 0.23 −0.14 0.23 −0.21 0.16 0.26 −0.27 0.14 0.17 0.24 −0.24 0.13 0.23 0.16
(SE) (0.04) (0.03) (0.05) (0.03) (0.05) (0.03) (0.04) (0.06) (0.03) (0.04) (0.03) (0.06) (0.02) (0.04) (0.04)
MVMR-Median 0.23 0.23 −0.12 0.23 −0.18 0.15 0.26 −0.25 0.16 0.15 0.23 −0.20 0.14 0.19 0.15
(SE) (0.03) (0.04) (0.08) (0.04) (0.08) (0.05) (0.05) (0.10) (0.05) (0.06) (0.05) (0.09) (0.04) (0.06) (0.05)
DHO 0.16 NA NA NA NA
(SE) (0.01)
SH 0.30
(SE) (0.04)

Table 6 provides the numbers of significant SNPs and loci identified with and without bias correction. In general, we observed that the number of loci decreased after the correction. This reduction could be attributed to two main reasons. Firstly, some SNPs were identified as significant in M1 due to collider bias, as certain metabolomic PCs were associated with these SNPs. For instance, when only the first metabolomic PC was involved, out of the 306 significant SNPs identified as significant in M1, 99 of them became non-significant after MVMR-cML-bias-correction. Among these 99 SNPs, 58 were associated with the first metabolomic PC. Similarly, when two metabolomic PCs were included, M1 identified 483 significant SNPs, but after the correction of MVMR-cML, 316 of them were no longer significant. Among these 316 SNPs, 170 were associated with at least one of the two metabolomic PCs. As p2 increased, the number of non-significant SNPs after correction that were associated with metabolomic PCs also increased. For example, when p2 = 5, out of 558 SNPs identified by M1, 417 were no longer significant after the correction of MVMR-cML, and among these 417 SNPs, 267 were associated with metabolomic PCs. Secondly, any bias-correction method yielded a larger variance of β^GiY, leading to a reduction in power. On the other hand, our bias-correction method also uncovered some significant SNPs that were missed without bias correction. For instance, when 5 metabolomic PCs were included and MVMR-cML was used to obtain b^, 44 SNPs were identified as significant only after the bias correction. Overall, for each number of covariates, all MVMR methods yielded similar number of significant loci. In Table AC in S1 Text, we show the results when 6 to 10 metabolomic PCs were adjusted: few or no loci were identified after bias correction, possibly due to variance inflation.

Table 6. Number of significant loci after applying different bias-correction methods on M1.

# of metabolomic PCs in analysis No correction MVMR-cML MVMR-Egger MVMR-Lasso MVMR-Median MVMR-IVW DHO SH
1 # of significant SNPs 306 221 220 223 219 223 219 214
# of significant loci 49 33 34 35 33 35 34 30
UKB validation 45 33 34 35 33 35 34 30
other validation 42 30 31 32 30 32 31 29
2 # of significant SNPs 483 192 193 194 179 192 NA NA
# of significant loci 60 32 31 32 28 32
UKB validation 45 32 31 32 28 32
other validation 46 29 28 29 26 29
3 # of significant SNPs 468 195 187 196 180 196 NA NA
# of significant loci 51 30 28 30 27 30
UKB validation 42 30 28 30 27 30
other validation 43 28 26 27 25 27
4 # of significant SNPs 553 195 191 190 174 190 NA NA
# of significant loci 62 29 28 27 26 27
UKB validation 44 29 28 27 26 27
other validation 46 27 25 25 24 25
5 # of significant SNPs 558 185 183 183 169 183 NA NA
# of significant loci 63 24 23 23 23 23
UKB validation 45 14 24 22 23 22
other validation 47 23 22 22 23 22

Fig 6 compares the SNP effect estimates before and after bias correction, along with their corresponding standard errors as vertical or horizontal bars. We only include MVMR-cML since all MVMR methods produced similar figures. And we only present the results for the cases with p2 = 1, 2, or 5 as representatives. The complete results can be found in Section G.2 in S1 Text.

Fig 6. Effect estimates (in M1) of BMI before and after bias correction.

Fig 6

Horizontal and vertical bars represent 1 SE of an estimate before and after correction respectively. SEs are given in the right column. (a)-(b):1 metabolomic PC is used. (c)-(d): 2 metabolomic PCs are used. (e)-(f): 5 metabolomic PCs are used. In the legends, “before” refers to the SNPs that are significant only before bias correction, “after” refers to the SNPs that are significant only after bias correction, “both” refers to the SNPs that are significant both before and after bias correction.

Fig 6(a) shows the results when only the first metabolomic PC was included. For the SNPs that were significant only before the correction, our bias-correction approach pulled their effect estimates closer to 0, indicating that collider bias may have led to false positive results for these SNPs. Fig 6(c) and 6(e) demonstrate that as more covariates were included, a greater number of SNPs became non-significant after the correction. In each figure, many points had their effect estimates closer to 0 after the correction. For the SNPs that were significant both with and without the bias correction, their effect estimates were hardly affected by collider bias and remained the same after the correction. When only one covariate was adjusted, the SEs of β^GiY increased only slightly after bias correction: as shown in Fig 6(b), all points are close to the identity line. However, as more covariates/PCs were included, bias-correction led to much larger SEs.

The Manhattan plots in Fig 7 illustrate the significant loci before and after bias correction by MVMR-cML. An upper/lower panel represents the results after/before bias correction. After correction, less significant p-values were observed. As more metabolomic PCs were included, fewer significant SNPs and loci remained after bias correction.

Fig 7. Manhattan plot of BMI before (upper panel) and after (lower panel) applying MVMR-cML bias correction (in M1).

Fig 7

(a): one metabolomic PC is adjusted, (b): two metabolomic PCs are adjusted, (c): five metabolomic PCs are adjusted.

Results after removing genetic components of covariates

To mitigate collider bias, a strategy is to remove the genetic components from the covariates before their inclusion in GWAS analysis [5]. We removed the genetic components of the metabolomic PCs and used the corresponding residuals as the covariates in the GWAS of BMI. Then we applied the bias-correction methods, where each MVMR method yielded different estimates of b due to the presence of weak instrument bias. This bias arose because, after removing genetic components, few or no SNPs were strongly associated with the covariates. The conditional F (shown in Section G.2.2 in S1 Text) test also indicated the presence of weak instruments. However, it is noteworthy that different MVMR methods consistently produced similar results after bias correction. As shown in Section G.2.1 in S1 text, it was confirmed that the effect estimates in such a model were closer to each other before and after bias correction; in other words, there was no or little collider bias. On the other hand, as before, bias-correction for more covariates led to more inflated variances and possible loss of power.

Discussion

In this study, we have addressed a general issue of collider bias in GWAS conditional analysis when multiple heritable covariates are included or adjusted. We have proposed an extension to an existing work [11], the latter of which, along with other existing ones, can handle only one covariate. Our derivation demonstrates that the estimation of the potential collider bias corresponds to a multivariable instrumental effect regression problem, allowing for the application of any valid MVMR methods. We have specifically employed MVMR-cML due to its robustness against the presence of invalid IVs violating any or all of the three valid IV assumptions [14]. More importantly here, due to its framework of constrained maximum likelihood, we can derive its various distributional characteristics, including correlations among various statistics, in the presence of overlapping samples across the GWAS of the outcome and heritable covariates. In contrast, it is unknown how to do so with other MVMR methods. Thus, our method with MVMR-cML is applicable to GWAS summary data for both the covariates and the trait of interest in a 1-sample, 2-sample or overlapping-sample setting. Through extensive simulations, we compared our method to other state-of-the-art MVMR methods, allowing for the inclusion of invalid IVs violating one of the three valid IV assumptions (i.e. Assumption A3, but not A2). The results indicated that most MVMR methods can reduce collider bias under different scenarios. These findings align with previous conclusions in the literature [11, 14], highlighting the effectiveness of our approach in eliminating or reducing collider bias. By incorporating our bias-correction approach into GWAS regression analyses, researchers can mitigate the potential bias introduced by including multiple covariates, obtaining less biased estimates of SNP-trait associations. However, as more covariates are included in the analysis, estimating and thus correcting the collider bias becomes more challenging due to increasing estimation errors and uncertainties.

In this paper, our main focus has been on correcting collider bias in conditional analyses of GWAS when adjusting for one or more heritable covariates. There are also wide-ranging applications in mediation analysis with one or more heritable covariates as potential mediators, where hidden confounding and thus collider bias are often ignored [2938]. As mentioned by one reviewer, one limitation of our study is the restriction to linear models with quantitative traits and covariates. For binary traits, logistic regression is typically employed. While a logistic regression model can be well approximated by a linear regression model in marginal GWAS analysis of SNPs due to their small effect sizes, however, the approximation may be inadequate in conditional GWAS analysis with covariates of potentially large effects. Consequently, our bias-correction method cannot be directly applied in the latter case. It remain open how to extend our proposed bias-correction approach to accommodate binary traits. Furthermore, it is important to note that collider bias can arise in other scenarios. For instance, when the trait of interest is a precursor of, not subsequent to a heritable disease, conditioning on the disease incidence may bias the estimation of the direct SNP effects, since there are shared confounders between the trait and disease [7]. One specific example is that knowing BMI as a cause of type-2 diabetes, a SNP causing type-2 diabetes may have a biased association with BMI when studied within a case-only sample of type-2 diabetes [7]. In addition, participation in a GWAS of disease prognosis is often conditional on survival until time of recruitment, and possibly other health conditions. But there may be unknown common causes of survival and prognosis that create further biases [7]. It would be useful to extend our method to investigate these problems.

Supporting information

S1 Text. Supplementary file with theory and derivations/proofs, more simulation results and additional real data analysis results.

(PDF)

pgen.1011246.s001.pdf (17.4MB, pdf)

Acknowledgments

This study was supported by the Minnesota Supercomputing Institute at the University of Minnesota.

Data Availability

One needs to apply to UK Biobank (https://www.ukbiobank.ac.uk/) for approval to access the individual-level data used here. All public datasets used or obtained in our simulations and real data analyses are available from the Zendo repository (https://doi.org/10.5281/zenodo.10947055). The code to reproduce the results of this study is available at https://github.com/peiyao2017/MV-cML-bias-Adjustment.

Funding Statement

This research was supported by NIH grants R01 AG065636 (to PW, ZL, HX, WP), R01 AG069895 (WP), RF1 AG067924 (WP), U01 AG073079 (PW, WP), R01 HL116720 (WP), and R01 GM126002 (WP). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: Biology, function, and translation. Am J Hum Genet. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Wang Z, Emmerich A, Pillon NJ, Moore T, Hemerich D, Cornelis MC, et al. Genome-wide association analyses of physical activity and sedentary behavior provide insights into underlying mechanisms and roles in disease prevention. Nat Genet. 2022;54(9):1332–1344. doi: 10.1038/s41588-022-01165-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Giacomini KM, Yee SW, Mushiroda T, Weinshilboum RM, Ratain MJ, Kubo M. Genome-wide association studies of drug response and toxicity: an opportunity for genome medicine. Nat Rev Drug Discov. 2017;16(1):1. doi: 10.1038/nrd.2016.234 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Boehm FJ, Zhou X. Statistical methods for Mendelian randomization in genome-wide association studies: A review. Comput Struct Biotechnol J. 2022;20:2338–2351. doi: 10.1016/j.csbj.2022.05.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Lin Z, Knutson KA, Pan W. Leveraging omics data to boost the power of genome-wide association studies. HGG Adv. 2022;3(4):100144. doi: 10.1016/j.xhgg.2022.100144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Aschard H, Vilhjálmsson BJ, Joshi AD, Price AL, Kraft P. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies. Am J Hum Genet. 2015;96(2):329–339. doi: 10.1016/j.ajhg.2014.12.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Dudbridge F, Allen RJ, Sheehan NA, Schmidt AF, Lee JC, Jenkins RG, et al. Adjustment for index event bias in genome-wide association studies of subsequent events. Nat Commun. 2019;10(1):1561. doi: 10.1038/s41467-019-09381-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Munafò MR, Tilling K, Taylor AE, Evans DM, Davey Smith G. Collider scope: when selection bias can substantially influence observed associations. Int J Epidemiol. 2018;47(1):226–235. doi: 10.1093/ije/dyx206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Mitchell RE, Hartley AE, Walker VM, Gkatzionis A, Yarmolinsky J, Bell JA, et al. Strategies to investigate and mitigate collider bias in genetic and Mendelian randomisation studies of disease progression. PLoS Genet. 2023;19(2):e1010596. doi: 10.1371/journal.pgen.1010596 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gkatzionis A, Burgess S. Contextualizing selection bias in Mendelian randomization: how bad is it likely to be? Int J Epidemiol. 2019;48(3):691–701. doi: 10.1093/ije/dyy202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Cai S, Hartley A, Mahmoud O, Tilling K, Dudbridge F. Adjusting for collider bias in genetic association studies using instrumental variable methods. Genet Epidemiol. 2022;46(5):1–14. doi: 10.1002/gepi.22455 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mahmoud O, Dudbridge F, Davey Smith G, Munafo M, Tilling K. A robust method for collider bias correction in conditional genome-wide association studies. Nat Commun. 2022;13(1):619. doi: 10.1038/s41467-022-28119-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gilbody J, Borges MC, Smith GD, Sanderson E. Multivariable MR can mitigate bias in two-sample MR using covariable-adjusted summary associations. medRxiv. 2022. [Google Scholar]
  • 14. Lin Z, Xue H, Pan W. Robust multivariable Mendelian randomization based on constrained maximum likelihood. Am J Hum Genet. 2023;110(4):592–605. doi: 10.1016/j.ajhg.2023.02.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, et al. Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications. 2018;9(1):224. doi: 10.1038/s41467-017-02317-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh PR, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47(11):1236–1241. doi: 10.1038/ng.3406 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Schizophrenia Working Group of the Psychiatric Genomics Consortium, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–295. doi: 10.1038/ng.3211 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Burgess S, Davies NM, Thompson SG. Bias due to participant overlap in two-sample Mendelian randomization. Genet Epidemiol. 2016;40(7):597–608. doi: 10.1002/gepi.21998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Rees JMB, Wood AM, Burgess S. Extending the MR-Egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy. Stat Med. 2017;36(29):4705–4718. doi: 10.1002/sim.7492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Grant AJ, Burgess S. Pleiotropy robust methods for multivariable Mendelian randomization. Stat Med. 2021;40(26):5813–5830. doi: 10.1002/sim.9156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Lin Z, Pan I, Pan W. A practical problem with Egger regression in Mendelian randomization. PLoS Genet. 2022;18(5):e1010166. doi: 10.1371/journal.pgen.1010166 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Bowden J, Del Greco M F, Minelli C, Davey Smith G, Sheehan NA, Thompson JR. Assessing the suitability of summary data for two-sample Mendelian randomization analyses using MR-Egger regression: the role of the I2 statistic. Int J Epidemiol. 2016;45(6):1961–1974. doi: 10.1093/ije/dyw220 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Li T, Ning Z, Shen X. Improved estimation of phenotypic correlations using summary association statistics. Front Genet. 2021;12:665252. doi: 10.3389/fgene.2021.665252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Lin Z, X Haoran, P Wei. Combining Mendelian randomization and network deconvolution for inference of causal networks with GWAS summary data. PloS Genet. 2023;19(5):e1010762. doi: 10.1371/journal.pgen.1010762 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32(2):283–285. doi: 10.1093/bioinformatics/btv546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Pulit SL, Stoneman C, Morris AP, Wood AR, Glastonbury CA, Tyrrell J, et al. Meta-analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum Mol Genet. 2019;28(1):166–174. doi: 10.1093/hmg/ddy327 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry. Hum Mol Genet. 2018;27(20):3641–3649. doi: 10.1093/hmg/ddy271 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Sanderson E, Spiller W, Bowden J. Testing and correcting for weak and pleiotropic instruments in two-sample multivariable Mendelian randomization. Stat Med. 2021;40(25):5434–5452. doi: 10.1002/sim.9133 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Schaid DJ, Sinnwell JP. Penalized models for analysis of multiple mediators. Genet Epidemiol. 2020;44(5):408–424. doi: 10.1002/gepi.22296 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Sohn MB, Li H. Compositional mediation analysis for microbiome studies. Ann Appl Stat. 2019;13(1):661–681. doi: 10.1214/18-AOAS1210 [DOI] [Google Scholar]
  • 31. Zhang H, Chen J, Li Z, Liu L. Testing for Mediation Effect with Application to Human Microbiome Data. Statistics in Biosciences. 2021;13(2):313–328. doi: 10.1007/s12561-019-09253-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Yang T, Niu J, Chen H, Wei P. Estimation of total mediation effect for high-dimensional omics mediators. BMC Bioinformatics. 2021;22(1):414. doi: 10.1186/s12859-021-04322-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Zhang J, Wei Z, Chen J. A distance-based approach for testing the mediation effect of the human microbiome. Bioinformatics. 2018;34(11):1875–1883. doi: 10.1093/bioinformatics/bty014 [DOI] [PubMed] [Google Scholar]
  • 34. Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing mediation effects in high-dimensional epigenetic studies. Front Genet. 2019;10:1195. doi: 10.3389/fgene.2019.01195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Huang YT, Pan WC. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016;72(2):402–413. doi: 10.1111/biom.12421 [DOI] [PubMed] [Google Scholar]
  • 36. Liu Z, Shen J, Barfield R, Schwartz J, Baccarelli AA, Lin X. Large-scale hypothesis testing for causal mediation effects with applications in genome-wide epigenetic studies. J Am Stat Assoc. 2022;117(537):67–81. doi: 10.1080/01621459.2021.1914634 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213. doi: 10.1080/01621459.2020.1765785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Hou L, Yu Y, Sun X, Liu X, Yu Y, Li H, et al. Causal mediation analysis with multiple causally non-ordered and ordered mediators based on summarized genetic data. Stat Methods Med Res. 2022;31(7):1263–1279. doi: 10.1177/09622802221084599 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Michael P Epstein, Xiang Zhou

25 Jan 2024

Dear Dr Pan,

Thank you very much for submitting your Research Article entitled 'Collider Bias Correction for Multiple Covariates in GWAS Using Robust Multivariable Mendelian Randomization' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool.  PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Xiang Zhou, Ph.D.

Academic Editor

PLOS Genetics

Michael Epstein

Section Editor

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The manuscript outlines a novel approach for estimating collider bias in GWAS, utilizing a robust multivariable Mendelian Randomization (MVMR) technique that's based on constrained maximum likelihood, referred to as MVMR-cML. The objective of MVMR-cML is to infer the causal effects of multiple exposures on the outcome in scenarios where the instrumental variables (IVs) are invalid and correlated pleiotropy exists. I have some concerns that are listed below to further enhance the manuscript.

1. In Figure 2, H1 is both the mediator in the pathway G→H1→Y and a collider in the pathway G→H1←U→Y. I wonder how this can be determined in practice, that is, how can we test whether H1 plays both roles as those depicted in Figure 2?

2. I suggested the authors conducted additional simulations to compare the performance of the traditional GWAS and the proposed bias correction method when H is not a collider?

3. If H1 is a mediator, it should not be accounted for in GWAS analysis. In addition, if we remove the edge H→Y and render H1 a complete collider, it is generally not recommended adjust for it in GWAS. Thus, Figure 2 is confusing.

4. In Figure 4, the effect estimates of WHR before and after bias correction appear to be quite similar. I suggested the authors to try different data set to highlight the advantage of the proposed method.

5. The manuscript attempts to utilize MVMR to correct collider bias in GWAS, but the depiction of direct effect or indirect effect in Figure 2 is confusing.

6. The symbols in the manuscript are highly confusing and most of them are not explained, which seriously impacts the readability of the article. For example, in formula (3), K represents variables, but in formula (9), K denotes the number of invalid IVs. Furthermore, in section 2.5, rho represents correlation, whereas in section 3.1, rho denotes the proportion of invalid instrumental variables. What does the sudden appearance of V in formula (1) represent? What are its practical implications? In the simulation settings, V disappears again. Is V a vector of covariates or SNPs?

7. If Y is a binary variable, such as diseases, does this method remain effective?

8. MVMR assumes the IV is associated with at least one of H, and the SNP directly associated with Y is invalid SNP. GWAS aims to find the SNPs directly associated with Y, but MVMR-cML assume the plurality condition, and in formula (9) the number of invalid IV is constrained to K. How to interpret this?

9. How to use this method in practice? Did this method be applied to thousands of SNPs one by one? What is the computational time?

10. I strongly suggested the authors to re-organize the introduction of the manuscript, to further highlight the significance and motivation of the proposed method.

Reviewer #2: This paper develops instrument variables correction for collider bias when there are multiple colliders, such as GWAS when conditioning on multiple heritable covariates. This extends earlier work that corrected for a single collider such as disease incidence. The authors show that it works well in conjunction with their own MVMR-cML approach for multivariate Mendelian randomisation. This is a good contribution to the literature. I have the following comments.

1. In several places the authors state that their method can account for overlapping samples, and even say (section 2.4) that they have simulations confirming that previous methods have inflated type-1 errors in this case. However I could find no such simulations in the paper. Moreover, Dudbridge et al (2019) argued that their approach was robust to overlapping samples, and confirmed it in their simulations. This was further established by Barry et al. (PLoS Genet 2021). Please explain how your situation differs from theirs, and provide simulations to back up any new claims.

2. The authors claim to have derived the variance of their estimator, and that this goes beyond previous work. However, across all the results the new method’s SE is consistently overestimated compared to the empirical SD, and the type-1 errors are conservative, whereas previous methods appear to be well calibrated. So something isn’t right here.

3. In some places the superscript T means “transpose”, and in others “total” – this is confusing. See eg equations 2-3 and the text above.

4. P9 “The correlation parameters in \\Sigma_j…” repeats earlier text.

5. The simulation used 1000 SNPs from chromosome 6. Why was this chromosome used, and was the HLA region avoided?

6. Section 3.1.1, I couldn’t understand why the type-1 errors were not correctly controlled, not just “better controlled”, especially at p2=4. I see that the simulation had invalid IVs, but the pleiotropy appears to be balanced with the InSIDE assumption valid, so the methods should all be OK. What is happening here? It would help to give a simulation under the assumptions of the method, showing that it does work in that case.

7. I was surprised that the authors have considered Egger regression, since the same authors have published a convincing refutation of that method.

8. P15 “This deviation was attributed to … more covariates”. This seems to attribute bias to an increase in sampling variance, which isn’t right.

9. Supp P6. Second line of proof, “negative likelihood” should be log likelihood.

Reviewer #3: The authors proposed using an MVMR approach (MVMR-cML) that they recently developed to perform collider bias correction in obtaining GWAS summary statistics. I think the idea is interesting and can potentially be more reliable than a univariate MR approach. However, I also have some concerns related to the setup of the framework, method details and data evaluation. Here are my major comments:

1. Foundation of the causal framework setup. In Figure 2, the authors draw the causal DAG and claims that the true parameter to estimate should be the direct effect of SNP j \\beta_{G_iY}. I think this is very confusing and misleading as \\beta_{G_iY} will never be the direct effect of SNP j even when there is no collider bias as other SNPs are not jointly considered. Even without collider bias, the marginal associations between SNP j and Y without adjusting for other SNPs are never claimed to be any causal effect of SNP j. Figure 2 is also not qualified to be a DAG as only a single SNP is considered here. Given that what GWAS summary statistics provide are never any causal effect of SNP j, I think the target of the problem in the paper to adjust for collider bias is very vague. I do agree that collider bias exists but I'm only convinced that it should be adjusted when we want to identify causal SNPs using methods that jointly consider multiple SNP such as SUSIE. I'm not convinced for GWAS summary statistics. Thus, I think the authors should clarify the setup and maybe give a few more concrete examples to address the necessity.

2. Another problem for the DAG in Figure 2 is that covariates such as population ancestry (maybe denoted as V in the paper?) that are confounders of U and G are not included. In the structural equations (1) the authors simplify that V is not correlated with U nor G. If V are population ancestries, this assumption will not be true and I'm not sure whether it will affect the subsequent calculations or not.

3. For the MVMR-cML method the authors claim that in equation (9), the number of invalid IVs K can be as large as l-p_2 -1. I'm not sure if this is possible. I think the slope b in (5) would not be identifiable if beta_{G_iY} is not sparse enough. I hope the authors can clarify more.

4. Since the authors need to work with individual GWAS data anyway, what is the benefit of using summary GWAS MR?

5. Related to the previous question, if the trait is binary, than a linear model might not be appropriate, is the method proposed in the paper still applicable?

6. To make the paper more convincing, I think the authors need to have some real data example to illustrate that univariate MR is not enough in practice

A minor comment:

Section 2.1 the authors briefly mentioned mtCOJO. I think that part is very hard to follow without reading the mtCOJO paper and the authors should make the section more self-contained.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Frank Dudbridge

Reviewer #3: No

Decision Letter 1

Michael P Epstein, Xiang Zhou

17 Mar 2024

Dear Dr Pan,

Thank you very much for submitting your Research Article entitled 'Collider Bias Correction for Multiple Covariates in GWAS Using Robust Multivariable Mendelian Randomization' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript.

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Xiang Zhou, Ph.D.

Academic Editor

PLOS Genetics

Michael Epstein

Section Editor

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors hava addressed all my comments.

Reviewer #2: The authors responded to my comments, and I now understand the paper better. Thank you.

There are a number of typos in the new text, and some remaining in the old. The most serious of which is "casually" instead of "causally".

Reviewer #3: The revised paper has resolved most of my previous concerns. I only have two remaining comments.

1. Foundation of the casual framework setup. I appreciate authors' effort in providing better description of the setup, but frankly speaking, it is still very confusing. The description now is a mixture of causal inference language and associations where the definitions are not coherent. If the direct effects that the authors care about are just conditional associations of a single SNP G on Y conditional on X, which has a clear definition by its own, seems that the casual concepts like mediation and collider bias, and the DAG in Fig1 are not relevant. The authors have also created new terms like direct association, indirect associations and spurious associations, which I don't think are clearly defined concepts and can be confusing. While the authors change from causal effects to different types of associations in the text, they still call models (1)-(3) on a single SNP as the "true causal model for each SNP G_i". These descriptions seem to be conflicting with each other.

I feel that it is possible that the authors describe the problem clearly by starting with a causal model considering all SNPs, discuss the collider bias the mediation in that model and then discuss how it affects the single-SNP associations. This might be beyond the scope of the paper. The authors may add the limitation of their framework in the discussion section.

2. For my previous point 3, I did not understand the authors responses. Does the authors mean that b is still identifiable when "invalid IVs K can be as large as l − p2 − 1"? How is this relavant to the "multivariable plurality condition"?

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: None

Reviewer #2: None

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Frank Dudbridge

Reviewer #3: No

Decision Letter 2

Michael P Epstein, Xiang Zhou

2 Apr 2024

Dear Dr Pan,

We are pleased to inform you that your manuscript entitled "Collider Bias Correction for Multiple Covariates in GWAS Using Robust Multivariable Mendelian Randomization" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Xiang Zhou, Ph.D.

Academic Editor

PLOS Genetics

Michael Epstein

Section Editor

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-23-01347R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Michael P Epstein, Xiang Zhou

16 Apr 2024

PGENETICS-D-23-01347R2

Collider Bias Correction for Multiple Covariates in GWAS Using Robust Multivariable Mendelian Randomization

Dear Dr Pan,

We are pleased to inform you that your manuscript entitled "Collider Bias Correction for Multiple Covariates in GWAS Using Robust Multivariable Mendelian Randomization" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary file with theory and derivations/proofs, more simulation results and additional real data analysis results.

    (PDF)

    pgen.1011246.s001.pdf (17.4MB, pdf)
    Attachment

    Submitted filename: author_one_to_one_response.pdf

    pgen.1011246.s002.pdf (194.9KB, pdf)
    Attachment

    Submitted filename: author_one_to_one_response.pdf

    pgen.1011246.s003.pdf (142.4KB, pdf)

    Data Availability Statement

    One needs to apply to UK Biobank (https://www.ukbiobank.ac.uk/) for approval to access the individual-level data used here. All public datasets used or obtained in our simulations and real data analyses are available from the Zendo repository (https://doi.org/10.5281/zenodo.10947055). The code to reproduce the results of this study is available at https://github.com/peiyao2017/MV-cML-bias-Adjustment.


    Articles from PLOS Genetics are provided here courtesy of PLOS

    RESOURCES