Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2026 Jan 7;27(1):bbaf649. doi: 10.1093/bib/bbaf649

A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk

Yu Zhang 1, Ming Li 2, David M Haas 3, C Noel Bairey Merz 4, Tsegaselassie Workalemahu 5, Kelli Ryckman 6, Janet M Catov 7,8, Lisa D Levine 9, Alexa Freedman 10, George R Saade 11, Jiaqi Hu 12, Hongyu Zhao 13, Xihao Li 14,15,, Nianjun Liu 16,, Qi Yan 17,
PMCID: PMC12777983  PMID: 41499219

Abstract

Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (–0.020, SE = 0.0102, Inline graphic =.0470) than IVW using only common variants (–0.023, SE = 0.0123, Inline graphic =.0659).

Keywords: Mendelian randomization, rare variants, common variants, STAARpipeline, lipids, preeclampsia

Introduction

Mendelian randomization (MR) has emerged as a powerful tool for inferring causal relationships between risk factors and health outcomes, leveraging genetic variants as instrumental variables to circumvent the inherent issues of confounding and reverse causation in observational studies [1, 2]. Traditionally, MR analyses have primarily utilized common genetic variants as instruments, which account for a limited spectrum of the genetic architecture underlying complex diseases and traits. Rare variants, despite their relatively low minor allele frequencies (MAFs) (<1% or 5%), may have larger effect sizes and provide insights into novel biological mechanisms and pathways [3, 4]. But they have been largely overlooked in MR.

Integrating rare variants into two-sample MR analysis has been limited by several key challenges. First, individual rare variants contribute minimal variation in allele frequency across the population, leading to reduced statistical power in genome-wide association studies (GWAS) to detect associations with exposures [5]. Second, even when rare variants exhibit moderate associations with exposures, they often fail to pass stringent genome-wide significance thresholds, leading to their exclusion during instrument selection in conventional MR pipelines. Third, rare variants typically require aggregation strategies—such as burden tests or kernel methods—to capture their collective effects [6–8], which are not directly compatible with standard MR techniques that rely on independent, variant-level effect estimates. As a result, rare variants are rarely selected as instruments in two-sample MR studies [9–12], limiting the statistical power of MR analysis by overlooking an important aspect of the genetic architecture.

The advent of high-throughput whole-genome sequencing (WGS) technologies [13] and large-scale genomic consortia [14] has enabled the generation of extensive genetic data, including rare variants. These developments have driven the need for scalable statistical tools capable of leveraging rare variant information from WGS data. For instance, STAARpipeline [15] performs genome-wide rare variant association tests using gene-based burden [6] or kernel methods [7, 8] while incorporating functional annotations and controlling for population structure. As a result, functionally informed burden-level summary statistics for rare variants are becoming increasingly available. However, despite these advances, current MR methods are not designed to integrate burden-level rare variant statistics, as they typically rely on SNP-level GWAS summary data.

In this study, we introduce a novel method called, two-sample MR framework that integrates both Common and Annotation-informed Rare Variants, referred to as MR-CARV. By using both common and rare variants as instruments, MR-CARV addresses limitations of existing MR methods, and maximizes the utility of genetic data. Our simulation studies demonstrate that this new framework maintains the correct type I error rates and achieves higher statistical power than existing methods only based on common variants. We applied the MR-CARV framework to explore the causal relationship between lipid levels and preeclampsia, given lipid metabolism’s critical role in pregnancy and its links to complications like preeclampsia [16–18]. Although lipid levels is known to be associated with pregnancy complications [19], it has not been well estimated whether these associations are causal. Our analysis underscores the value of integrating rare variants in MR methods and offers new insights into the biological connection between lipids and preeclampsia.

Material and methods

Suppose we have Inline graphic common variants as valid instruments for evaluating the causal effect (Inline graphic) of an exposure variable, Inline graphic, on an outcome variable, Inline graphic. For the Inline graphic common variant, we obtain its estimated effect on the exposure, denoted as Inline graphic, and its standard error (SE), Inline graphic from one study. Similarly, we obtain its estimated effect on the outcome, denoted as Inline graphic, and its SE, Inline graphic from another study, i.e. independent from the previous one. There are various MR techniques designed to estimate Inline graphic. In this study, we selected some commonly used methods and extended them using the MR-CARV framework.

Inverse variance weighted method

For each common variant, Inline graphic is estimated as Inline graphic. The inverse variance weighted (IVW) [20], one of the most widely used MR methods, combines these estimates weighted by the inverse of their variances. Its estimator [20] is given by:

graphic file with name DmEquation1.gif (1)

with variance being

graphic file with name DmEquation2.gif (2)

This approach assumes that the genetic variants are valid instruments, namely, they are associated with the exposure, influence the outcome only through the exposure, and are not associated with any confounders of the exposure–outcome relationship.

In the following, we describe our proposed strategy to further integrate rare variants into MR-CARV framework. Suppose there are Inline graphic rare variant sets and the Inline graphic rare variant set has Inline graphic rare variants. We combine rare variants within each set into a Burden variable using a weighted method as employed in STAAR [21]. We start by identifying Inline graphic functional annotations for rare variants, defined as variants with MAF <1% or 5%. These annotations provide insights into the biological relevance and potential impact of each variant, guiding the weighting process. The Burden variable, Inline graphic, for the Inline graphic rare variant set with the Inline graphic annotation is constructed by weighting the genotype of each rare variant within the set, Inline graphic, with the product of the functional annotation, Inline graphic, and a weighting based on the allele frequency, Inline graphic. As such, Inline graphic. Following STAAR, we set Inline graphic to Inline graphic to upweight rarer variants, and to Inline graphic for equal weighting of all rare variants. We also set Inline graphic for Inline graphic to only include the Inline graphic as the weight. With the Burden variable representing the aggregated effect of rare variants in each set, we conduct rare variant association study (RVAS) to estimate the genetic effect of each rare variant set on the exposure (Inline graphic) and outcome (Inline graphic), respectively. Note that the Inline graphic Burden variables should satisfy the assumption of IVW to be valid instrument, the same as for the common variants. The RVAS results for Inline graphic rare variant sets with the Inline graphic annotation along with common variants are then incorporated into equation (1) to get the IVW estimator of the proposed framework (Inline graphic) that considers both common and rare variants, which is

graphic file with name DmEquation3.gif (3)

with variance being

graphic file with name DmEquation4.gif (4)

Similar to Inline graphic and Inline graphic, Inline graphic, and Inline graphic. The P-value of Inline graphic is denoted by Inline graphic. We then combine the P-values (Inline graphic) under Inline graphic annotations and different settings of Inline graphic into one P-value using the Cauchy combination test [22]. The combined statistics is

graphic file with name DmEquation5.gif (5)

where the Inline graphic is the set of specified values of Inline graphic, and Inline graphic is the size of set Inline graphic. We set Inline graphic in practice as in STAAR [21]. The P-value of MR-CARV(IVW) can be approximated by

graphic file with name DmEquation6.gif (6)

Debiased inverse variance weighted method

However, the IVW method is subject to several sources of bias, which can compromise the validity of causal inference. One significant source of bias in IVW estimation arises from weak instruments—genetic variants that have weak associations with the exposure. In such cases, the estimated SNP-exposure effects (Inline graphic) are small and imprecise, leading to unstable ratio estimates Inline graphic and inflated variance. The debiased inverse variance weighted (dIVW) [5] addresses this issue with a bias correction factor. The dIVW estimator [5] for common variants is given by

graphic file with name DmEquation7.gif (7)

Its variance [5] is

graphic file with name DmEquation8.gif (8)

Similar to MR-CARV(IVW), we construct MR-CARV(dIVW) by incorporating the summary statistics of rare variant sets with Inline graphic annotation and the parameter combination Inline graphic. The formula would be

graphic file with name DmEquation9.gif (9)

with variance being

graphic file with name DmEquation10.gif (10)

We then combine the P-values of MR-CARV(dIVW) under all the annotations and settings of Inline graphic using the Cauchy combination test as in MR-CARV(IVW).

Robust adjusted profile score method

Another important source of bias in IVW estimation is horizontal pleiotropy, which occurs when genetic instruments influence the outcome through pathways other than the exposure. This violates the exclusion restriction assumption in MR and can lead to biased causal effect estimates, particularly when pleiotropic effects are heterogeneous. The robust adjusted profile score (RAPS) method [23] addresses pleiotropy by modeling it explicitly through a random effects framework. It assumes that the estimated effects of instruments on the outcome (Inline graphic) follow a normal distribution with mean Inline graphic and variance Inline graphic, where Inline graphic captures the overdispersion due to pleiotropy. Meanwhile, the estimated effects on the exposure are modeled as Inline graphic. The causal effect is estimated by maximizing the adjusted profile likelihood, and robustness is enhanced by replacing the standard Inline graphic-loss with alternative loss functions (Huber or Tukey’s biweight loss) to mitigate the influence of outlier SNPs with large pleiotropic effects—referred to as idiosyncratic pleiotropy. To extend RAPS for rare variant analysis in our framework Inline graphic, we include the summary statistics of rare variant sets under Inline graphic annotation and the parameter combination Inline graphic into the random effects models. Specifically, for common variants, we have Inline graphic and Inline graphic, and for rare variant sets, we have Inline graphic and Inline graphic for the Burden variables. After obtaining the P-values under all the annotation and settings of Inline graphic, we still use the Cauchy combination test to calculate the P-value of Inline graphic. In addition to IVW, dIVW, RAPS, this framework is flexible and can be extended to other MR methods in a similar way as described above. In order to make causal inference, both common variants and Burden variables constructed from rare variants should satisfy the corresponding model assumptions.

Find the uncorrelated instruments

For the IVW, dIVW, and RAPS methods in MR-CARV framework, independent instruments are essential [24]. To achieve this, independent instruments can be selected through established methods, such as choosing a single variant per locus [25] or linkage disequilibrium (LD) pruning [26]. For LD pruning, burden variables can be created for rare variant sets and pruning can be applied to both common variants and burden variables using the STAARpipeline [15]. Alternatively, an automated approach can use individual-level data to calculate correlations between common variants and Burden variables, followed by a graph-based method [27] to select uncorrelated instruments. Detailed steps with an example are provided in the Supplementary Section S1. This process ensures uncorrelated instruments, meeting the requirements of the IVW, dIVW, and RAPS methods. Individual-level data for this can come from study samples or a representative large reference panel with whole sequencing data, such as the UK Biobank [28] provided they align with the study population.

Simulation setting

To evaluate the performance of the methods that consider either common variants alone or both common and rare variants, we conducted a simulation study using genetic data from the 1000 Genomes Project [29]. To avoid issues caused by LD, we selected one independent instrumental variable per chromosome from chromosomes 15 through 22, designating chromosomes 15–18 for common variants and 19–22 for rare variant sets to assess the contribution of each. Utilizing the R package sim1000G [30], we randomly selected 50 SNPs per chromosome from the 10 kb region between 30000000 and 30010000 base pairs (bp), with MAF spanning from 0.001 to 0.5. For common variants, we randomly selected one SNP per chromosome from chromosomes 15–18 with Inline graphic. For rare variant sets on chromosomes 19–22, we included all variants with MAFs between 0.001 and 0.05. This approach resulted in four common variants and four rare variant sets.

Using these variants, we generated a dataset comprising Inline graphic individuals with exposure, and Inline graphic individuals with outcome. The exposure Inline graphic of Inline graphic individuals was modeled as a linear function of genotypic data for continuous exposure:

graphic file with name DmEquation11.gif (11)

and as an expit function for binary exposure:

graphic file with name DmEquation12.gif (12)

where Inline graphic. Here, Inline graphic, for Inline graphic, was the genotype of the Inline graphic common variant, and Inline graphic, for Inline graphic and Inline graphic, denoted the genotype of the Inline graphic rare variant in the Inline graphic rare variant set. The binary variable Inline graphic indicated whether a rare variant was causal, determined using a Bernoulli distribution with the probability Inline graphic, where Inline graphic represented five randomly selected functional annotations from a set of ten. Each of these 10 annotations was drawn from an independent standard normal distribution (Inline graphic). Here, Inline graphic for each selected annotation, and Inline graphic was set to Inline graphic or Inline graphic to achieve Inline graphic15%, and 35% proportions of causal rare variants (causal variant ratio), respectively [21]. This method ensured that the causality of a rare variant was influenced by a specific set of functional annotations. The effect sizes of these variants on the exposure were represented by Inline graphic for common variants and Inline graphic for rare variants. In our simulations, Inline graphic was fixed at 0.5 for each common variant. For rare variants, the effect size was determined by Inline graphic, where Inline graphic was set to 0.5. This produced a maximum effect size of Inline graphic for variants with Inline graphic, and minimum effect size of Inline graphic for variants with Inline graphic, capturing the influence of MAF on the exposure. Additionally, we assessed the impact of effect direction consistency of rare variants by varying the proportion of rare variants with positive effects in a set on the exposure (i.e. positive effect ratio) to 100%, 80%, and 50%. The error term, Inline graphic, encapsulating all random variations, is modeled to follow a standard normal distribution, N(0,1).

To generate Inline graphic individuals with outcome, we first simulated the exposure variable Inline graphic as described previously. For a continuous outcome, Inline graphic is modeled as a linear function of Inline graphic:

graphic file with name DmEquation13.gif (13)

whereas for a binary outcome, the probability of Inline graphic is given by

graphic file with name DmEquation14.gif (14)

Here, Inline graphic represents the true causal effect of Inline graphic on Inline graphic, and Inline graphic is a normally distributed error term following Inline graphic.

We conducted simulation studies using large-scale genetic datasets (Inline graphic) to evaluate our framework across both continuous and binary exposure–outcome combinations. Type I error was assessed under the null hypothesis (Inline graphic) across 10 000 iterations, and power was evaluated over 1000 simulations using a range of positive and negative Inline graphic values (Supplementary Table S1). We generated summary statistics using the STAARpipeline on independent samples and compared existing MR methods (IVW, dIVW, and RAPS) with our proposed MR-CARV methods [MR-CARV(IVW), MR-CARV(dIVW), and MR-CARV(RAPS)]. The MR-CARV methods incorporated MAF weighting (BetaInline graphic and BetaInline graphic) and 10 functional annotations. To assess robustness of type I error and power under varying sample sizes, we additionally simulated smaller datasets with combinations of Inline graphic and Inline graphic set to 1000 and 2000 for both continuous and binary traits. Further evaluations under LD structure and horizontal pleiotropy were conducted using continuous traits, as representative scenarios to illustrate the core method performance.

Results

Simulation results

The MR-CARV methods [MR-CARV(IVW), MR-CARV(dIVW), and MR-CARV(RAPS)] effectively controlled the type I error rate, comparable to existing methods (IVW, dIVW, and RAPS) under large sample size (Fig. 1). The type I error rates of IVW, dIVW, MR-CARV(IVW) and MR-CARV(dIVW) were consistently below 0.05 across all settings. In contrast, RAPS and MR-CARV(RAPS) maintained type I error rate near 0.05.

Figure 1.

Bar plots showing type I error rates for different MR methods across scenarios with various combinations of exposure/outcome types, causal variant proportions, and positive effect ratios. Each plot compares MR methods, demonstrating error control consistency when β = 0 under simulation conditions.

Type I error rates for different methods across four exposure–outcome scenarios (continuous–continuous, continuous–binary, binary–continuous, binary–binary), evaluated using 10 000 simulations with sample size 10 000 per trait, shown for causal-variant proportions of 15% and 35% and for settings where 100%, 80%, or 50% of causal variants have positive effects on the exposure.

For both the exposure and outcome were continuous and the true effect size Inline graphic, the MR-CARV methods consistently demonstrated higher statistical power compared with the existing methods (first row in Fig. 2). This superior performance can be attributed to the inclusion of both common and rare variants in the MR-CARV framework, which captures a broader spectrum of genetic variation. By leveraging the additional information provided by rare variants, the MR-CARV methods enhanced the detection of true causal relationships, leading to increased statistical power. This difference in statistical power between MR-CARV methods and existing methods increased with the increase of causal variant ratio and positive effect ratio. Notably, MR-CARV(IVW) achieved the highest power of 95.8% when the causal variant ratio was 35% and the positive effect ratio was 100%, compared to 57.6% with IVW using only common variants—a 66.3% relative increase in power. This is likely because these conditions strengthen the contribution of rare variants, thereby providing even more significant gains in detection capability for the MR-CARV methods. Additionally, the statistical power of MR-CARV methods increased with the increase of the causal variant ratio and the positive effect ratio. In contrast, the statistical power of the existing methods remains relatively stable across different causal ratios and positive ratios because these methods primarily rely on common variants, which are less sensitive to changes in the proportion of causal or positive effect variants. In the scenario of continuous exposure and binary outcome (Inline graphic), the performance is similar to that observed in the continuous exposure and continuous outcome setting (the second row in Fig. 2).

Figure 2.

Bar plots displaying statistical power across MR methods under various simulation conditions with nonzero causal effect. MR-CARV shows consistently higher power.

Power for different methods across four exposure–outcome scenarios (continuous–continuous, continuous–binary, binary–continuous, binary–binary), evaluated using 1 000 simulations with sample size 10 000 per trait, shown for causal-variant proportions of 15% and 35% and for settings where 100%, 80%, or 50% of causal variants have positive effects on the exposure.

For binary exposure and continuous outcome (Inline graphic), the MR-CARV methods [MR-CARV(IVW), MR-CARV(dIVW), and MR-CARV(RAPS)] consistently exhibited higher statistical power compared with the existing methods (IVW, dIVW, and RAPS) (the third row in Fig. 2). And the difference in statistical power between MR-CARV methods and existing methods still increased as causal variant ratio and positive effect ratio increase. However, unlike the previous settings, existing methods demonstrated higher statistical power with smaller causal and positive ratios of rare variants. For MR-CARV methods, the statistical power slightly increased with higher causal ratios, except when the positive ratio is 80%. This is likely due to the increased variability and noise introduced by higher causal and positive ratios of rare variants when simulating Inline graphic and Inline graphic, although existing methods only consider common variants in the analysis. The binary nature of the exposure adds additional complexity because of the logit transformation, which may lead to issues such as noncollapsibility. The existing methods, which are not optimized to handle this increased noise, may struggle to detect true causal effects, resulting in lower statistical power. While MR-CARV methods utilize rare variant information, the noncollapsibility issues still presents challenges that can affect the estimates. However, including rare variants improves performance compared with existing methods. In the case of binary exposure and binary outcome (Inline graphic), it demonstrates similar performance as in the binary exposure and continuous outcome setting (the fourth row in Fig. 2).

The power performance under negative true effect size (Supplementary Fig. S2) is similar to those in Fig. 2. Supplementary Figs S3–S14 demonstrate that MR-CARV methods still maintained correct type I error rate and had higher statistical power compared with existing methods even under small sample size. Although a higher sample size generally leads to higher statistical power, increasing the sample size of the outcome notably enhances statistical power, while increasing the sample size of the exposure has a less impact on statistical power. The details of MR-CARV evaluation under LD structure are provided in Supplementary Section S5. MR-CARV consistently maintained well-controlled type I error and demonstrated improved power compared with existing MR methods that use only common variants. Similarly, the evaluation of MR-CARV under horizontal pleiotropy is presented in Supplementary Section S6 and Supplementary Table S3. MR-CARV maintained appropriate type I error under InSIDE-valid and balanced pleiotropy scenarios, and achieved higher statistical power than existing methods, consistent with the respective assumptions of each method.

Application to real data

To demonstrate the practical utility of our proposed MR-CARV framework, we applied the existing methods (IVW, dIVW, and RAPS), and MR-CARV methods [MR-CARV(IVW), MR-CARV(dIVW), and MR-CARV(RAPS)], to explore the causal effects of lipids on preeclampsia. We utilized summary statistics for lipid traits, including HDL-C, low-density lipoprotein cholesterol (LDL-C), triglycerides (TG), and total cholesterol (TC), from Selvaraj et al. [31]. The STAARpipeline was applied to perform both common and rare variant analyses, investigating their effects on blood lipid levels in a cohort of over 66 000 individuals in their paper. For rare variants, we included gene-coding, gene-noncoding, and nongene regions. Here, we only set Inline graphic and do not include functional annotations in the weighting scheme, because Selvaraj et al. [31] presented only the summary effect size of rare variants with Burden(1,1). The preeclampsia data used in our analysis were obtained from the Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-Be Heart Health Study (nuMoM2b-HHS) [32]. The nuMoM2b-HHS dataset provided individual-level WGS and phenotype data. We used the STAARpipeline to assess the association between preeclampsia and both common and rare variants (i.e. Inline graphic), on 486 preeclampsia cases and 2821 controls. Covariates included maternal age, age squared, the first 10 principal components calculated by genotypic data, race, and study sites [33]. A genetic relationship matrix was included as a random effect. We only selected uncorrelated common variants and Burden variables from rare variant sets as instruments.

Our analysis revealed a consistent negative association between HDL-C levels and preeclampsia, suggesting that higher HDL-C may confer a protective effect against the development of this condition (Table 1). When incorporating rare variants using the MR-CARV framework, the effect estimate remained similar but had a smaller SE and narrower 95% CI, reflecting increased precision. This improvement can be attributed to the additional genetic information contributed by rare variants, which enhances statistical power. Notably, the Inline graphic-value of IVW decreased from.0659 (common variants only) to.0470 (common + rare variants), further supporting the benefit of including rare variants in MR analysis.

Table 1.

Estimates, SE, 95% confidence interval (CI), and Inline graphic-values for the causal effect of HDL-C on preeclampsia from different methods

Method Estimate SE 95% CI Inline graphic -value
MR-CARV(IVW) −0.020 0.0102 [−0.0401, −0.0003] .0470
IVW −0.023 0.0123 [−0.0469, 0.0015] .0659
MR-CARV(dIVW) −0.021 0.0104 [−0.0408, −0.0003] .0472
dIVW −0.023 0.0125 [−0.0476, 0.0015] .0660
MR-CARV(RAPS) −0.020 0.0103 [−0.0406, −0.0002] .0472
RAPS −0.023 0.0125 [−0.0475, 0.0015] .0660

Figure 3 displays the estimates for IVW and MR-CARV(IVW), which appear similar. However, the higher precision of MR-CARV(IVW) makes the estimate more statistically significant. With the graph-based method to select the uncorrelated instruments, we identified 9 instruments in gene-coding regions, 1 instruments in gene-noncoding regions, 2 instruments in non-gene regions, and 63 instruments from common variants. Including these 12 instruments from rare variants already increased the statistical power of our analysis, and we could expect that incorporating more rare variant instruments will further enhance the statistical power. Notably, the summary statistics for rare variants are more dispersed compared with those for common variants, particularly for HDL-C. This greater spread is likely due to the smaller number of rare variants or the larger variance associated with the Burden variable in the set. The effect estimates of LDL-C, TG, and TC on the preeclampsia are shown in SupplementaryTable S2. All the Inline graphic values are >.05 either using the MR-CARV methods or the existing methods.

Figure 3.

Scatter plot of per allele HDL-C effects versus log-odds for preeclampsia. Includes CIs and fitted lines from IVW and MR-CARV(IVW), comparing causal effect estimates using common versus rare variants.

MR scatter plot showing the per-allele HDL-C effects versus ln(OR) for preeclampsia, with common variants in black, rare variant sets in gray, and regression lines indicating the MR-CARV(IVW) (solid) and IVW (dashed) estimates.

Discussion

In this study, we developed a novel two-sample MR framework, MR-CARV, which incorporates annotation-weighted rare variants in addition to the common variants, enhancing the performance of existing two-sample MR methods. Rare variants were aggregated into Burden variables using functional annotations and MAF-based weighting, similar to the STAAR approach [21], and the Cauchy combination test was applied to integrate results across multiple annotation-informed models. Extensive simulations demonstrated that MR-CARV framework effectively controls type I error and achieves higher statistical power than existing methods, particularly in settings with binary exposures and limited instrument strength from common variants. We also found that increasing the outcome sample size yields greater power gains than increasing the exposure sample size.

Applying the MR-CARV framework, we found that HDL-C shows a statistically significant protective effect against preeclampsia only when both common and rare variants are included. Incorporating rare variants led to more precise estimates, as indicated by smaller SEs, even though fewer rare variant sets were used compared with common variants. This suggests potential for further gains with more rare variant data. Statistically, the estimated effect was marginally significant but consistent with prior studies [19, 34, 35]. HDL-C’s known anti-inflammatory and antioxidant properties [36, 37] may reduce oxidative stress and improve endothelial function, which are critical in preventing preeclampsia [38].

The growing availability of WGS data from large-scale initiatives such as the UK Biobank [28], All of Us [39], TOPMed [40], and the Million Veteran Program [41] has enabled broader analysis of rare variants. This expansion is supported by advanced tools like STAARpipeline [15], SAIGE [42, 43], and REGENIE [44]. MR-CARV currently leverages STAARpipeline to construct annotation-informed burden scores for rare variant sets. As WGS-based summary statistics become increasingly available, MR-CARV is well positioned to capitalize on these data. Moreover, our framework is flexible and can incorporate outputs from other rare variant association tools such as REGENIE.

Similar to existing MR studies [45, 46], the MR-CARV framework relies on the same key assumptions for valid causal inference. For example, to ensure the independence of instruments, we use an automated graph-based method to calculate correlations and only select uncorrelated instruments. However, more advanced techniques, such as Bayesian approaches, may offer better solutions in complex scenarios [47]. To address horizontal pleiotropy, the MR-CARV framework can be extended to incorporate rare variants into existing methods designed to account for pleiotropy, such as MR-Egger regression [48] and multivariable MR [49]. Overall, MR-CARV is a highly flexible tool that integrates both common and rare variants, offering a unified framework for causal inference in observational studies. Its adaptability allows the incorporation of advanced methodologies to address complex challenges, making it a valuable resource for exploring causal relationships between risk factors and complex diseases.

Key Points

  • The Mendelian randomization with common and annotation-informed rare variants (MR-CARV) framework addresses key limitations in MR by integrating both common and rare variants into established methods like inverse variance weighted, debiased inverse variance weighted, and robust adjusted profile score, enhancing their statistical power.

  • Rare variants are incorporated into the MR-CARV framework by weighting them with functional annotations, improving the biological relevance and precision of the analysis.

  • The MR-CARV framework is highly flexible and can be adapted to extend other two-sample MR methods, providing broad applicability across traits and diseases.

  • We offer a publicly available R package, mr.carv, at https://github.com/yu-zhang-oYo/mr.carv which enables researchers to integrate rare and common variants seamlessly into MR analyses.

  • Applying MR-CARV to data on blood lipid levels and preeclampsia reveals a significant protective effect of high-density lipoprotein cholesterol on preeclampsia risk, demonstrating its practical utility in uncovering meaningful causal relationships.

Supplementary Material

supplements_bbaf649
supplementary_Table_S3_bbaf649

Acknowledgements

The authors would like to thank Stephen Burgess, PhD and Xiaofeng Zhu, PhD for their insightful discussions. This research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute.

Contributor Information

Yu Zhang, Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, 1025 E. 7th Street, Bloomington, IN 47405, United States.

Ming Li, Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, 1025 E. 7th Street, Bloomington, IN 47405, United States.

David M Haas, Department of Obstetrics and Gynecology, Indiana University School of Medicine, 1130 W. Michigan St., Indianapolis, IN 46202, United States.

C Noel Bairey Merz, Barbra Streisand Women’s Heart Center, Smidt Heart Institute, Cedars-Sinai Medical Center, 8700 Beverly Blvd., Los Angeles, CA 90048, United States.

Tsegaselassie Workalemahu, Department of Obstetrics and Gynecology, University of Utah Health, 30 N. Mario Capecchi Dr. Level 5, South (HELIX), Salt Lake City, UT 84112, United States.

Kelli Ryckman, Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, 1025 E. 7th Street, Bloomington, IN 47405, United States.

Janet M Catov, Department of Obstetrics, Gynecology and Reproductive Sciences, University of Pittsburgh, 300 Halket Street, Pittsburgh, PA 15213, United States; Department of Epidemiology, University of Pittsburgh, 130 De Soto Street, Pittsburgh, PA 15213, United States.

Lisa D Levine, Department of Obstetrics and Gynecology, The University of Pennsylvania Perelman School of Medicine, 3400 Spruce Street, Philadelphia, PA 19104, United States.

Alexa Freedman, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, 303 E. Superior Street, Chicago, IL 60611, United States.

George R Saade, Department of Obstetrics and Gynecology, Eastern Virginia Medical School, 621 W. 21st Street, Norfolk, VA 23507, United States.

Jiaqi Hu, Department of Chronic Disease Epidemiology, Yale School of Public Health, 60 College Street, New Haven, CT 06510, United States.

Hongyu Zhao, Department of Biostatistics,School of Public Health, Yale University, 60 College Street, New Haven, CT 06520, United States.

Xihao Li, Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, NC 27599, United States; Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, Chapel Hill, NC 27599, United States.

Nianjun Liu, Department of Epidemiology and Biostatistics, Indiana University School of Public Health-Bloomington, 1025 E. 7th Street, Bloomington, IN 47405, United States.

Qi Yan, Department of Obstetrics and Gynecology, Columbia University, 622 West 168th Street, New York, NY 10032, United States.

Author contributions

Yu Zhang (Conceptualization, Methodology, Investigation, Software, Writing—original draft, Writing—review & editing), Ming Li (Writing—review & editing), David M. Haas (Writing—review & editing), C. Noel Bairey Merz (Writing—review & editing), Tsegaselassie Workalemahu (Writing—review & editing), Kelli Ryckman (Writing—review & editing), Janet M. Catov (Writing—review & editing), Lisa D. Levine (Writing—review & editing), Alexa Freedman (Writing—review & editing), George R. Saade (Writing—review & editing), Xihao Li (Investigation, Writing—review & editing), Nianjun Liu (Supervision, Writing—review & editing), and Qi Yan (Conceptualization, Methodology, Investigation, Supervision, Writing—review & editing)

Conflict of interest: C.N.B.M. serves as a board director and receives stock from iRhythm Technologies. During the preparation of this work the authors used ChatGPT (OpenAI, San Francisco, CA) in order to refine language and improve the clarity of the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Funding

This work was partially supported by National Heart, Lung, and Blood Institute grant (R56HL164477). Support for collection of nuMoM2b-HHS data was provided by cooperative agreement funding from the National Heart, Lung, and Blood Institute and the Eunice Kennedy Shriver National Institute of Child Health and Human Development grants (U10-HL119991 to RTI International, U10-HL119989 to Case Western Reserve University, U10-HL120034 to Columbia University, U10-HL119990 to Indiana University, U10-HL120006 to the University of Pittsburgh, U10-HL119992 to Northwestern University, U10-HL120019 to the University of California, Irvine, U10-HL119993 to University of Pennsylvania, U10-HL120018 to the University of Utah); and National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health to Clinical and Translational Science Institutes at Indiana University (UL1TR001108) and University of California, Irvine (UL1TR000153).

Data availability

The summary statistics for lipid traits are available at the published paper Selvaraj et al. [31]. Access to the nuMoM2b-HHS individual-level WGS and phenotype data is available through dbGaP at dbGaP study ID phs002808.v1.p1. MR-CARV is implemented as an open source R package available at https://github.com/yu-zhang-oYo/mr.carv.

References

  • 1. Xu  J, Li  M, Gao  Y. et al.  Using Mendelian randomization as the cornerstone for causal inference in epidemiology. Environ Sci Pollut Res  2022; 29:5827–39. 10.1007/s11356-021-15939-3 [DOI] [PubMed] [Google Scholar]
  • 2. Sanderson  E, Glymour  MM, Holmes  MV. et al.  Mendelian randomization. Nat Rev Methods Primers  2022; 2:6. 10.1038/s43586-021-00092-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Lee  S, Abecasis  GR, Boehnke  M. et al.  Rare-variant association analysis: study designs and statistical tests. Amer J Hum Genet  2014; 95:5–23. 10.1016/j.ajhg.2014.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Wang  Q, Dhindsa  RS, Carss  K. et al.  Rare variant contribution to human disease in 281,104 UK biobank exomes. Nature  2021; 597:527–32. 10.1038/s41586-021-03855-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ye  T, Shao  J, Kang  H. Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization. Ann Stat  2021; 49:2079–100. 10.1214/20-AOS2027 [DOI] [PubMed] [Google Scholar]
  • 6. Madsen  BE, Browning  SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet  2009; 5:e1000384. 10.1371/journal.pgen.1000384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wu  MC, Lee  S, Cai  T. et al.  Rare-variant association testing for sequencing data with the sequence kernel association test. Amer J Hum Genet  2011; 89:82–93. 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Liu  Y, Chen  S, Li  Z. et al.  ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. Amer J Hum Genet  2019; 104:410–21. 10.1016/j.ajhg.2019.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tada  H, Kawashiri  M-A, Yamagishi  M. Comprehensive genotyping in dyslipidemia: Mendelian dyslipidemias caused by rare variants and mendelian randomization studies using common variants. J Hum Genet  2017; 62:453–8. 10.1038/jhg.2016.159 [DOI] [PubMed] [Google Scholar]
  • 10. Rao  AR, Nelson  SF. Calculating the statistical significance of rare variants causal for Mendelian and complex disorders. BMC Med Genomics  2018; 11:53–17. 10.1186/s12920-018-0371-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Hsu  L-A, Teng  M-S, Wu  S. et al.  Common and rare PCSK9 variants associated with low-density lipoprotein cholesterol levels and the risk of diabetes mellitus: a Mendelian randomization study. Int J Mol Sci  2022; 23:10418. 10.3390/ijms231810418 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Triozzi  JL, Hsi  RS, Wang  G. et al.  Mendelian randomization analysis of genetic proxies of thiazide diuretics and the reduction of kidney stone risk. JAMA Netw Open  2023; 6:e2343290–0. 10.1001/jamanetworkopen.2023.43290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Abdi G, Tarighat MA, Jain M. et al. Revolutionizing genomics: exploring the potential of next-generation sequencing. In: Singh V, Kumar A, (eds). Advances in Bioinformatics. Singapore: Springer; 2024. p. 1–33. 10.1007/978-981-99-8401-5_1 [DOI] [Google Scholar]
  • 14. Austin  MA, Hair  MS, Fullerton  SM. Research guidelines in the era of large-scale collaborations: an analysis of genome-wide association study consortia. Am J Epidemiol  2012; 175:962–9. 10.1093/aje/kwr441 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Li  Z, Li  X, Zhou  H. et al.  A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies. Nat Methods  2022; 19:1599–611. 10.1038/s41592-022-01640-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Yang  Y, Wang  Y, Lv  Y. et al.  Dissecting the roles of lipids in preeclampsia. Metabolites  2022; 12:590. 10.3390/metabo12070590 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zhou  X, Han  T-L, Chen  H. et al.  Impaired mitochondrial fusion, autophagy, biogenesis and dysregulated lipid metabolism is associated with preeclampsia. Exp Cell Res  2017; 359:195–204. 10.1016/j.yexcr.2017.07.029 [DOI] [PubMed] [Google Scholar]
  • 18. Spracklen  CN, Smith  CJ, Saftlas  AF. et al.  Maternal hyperlipidemia and the risk of preeclampsia: a meta-analysis. Am J Epidemiol  2014; 180:346–58. 10.1093/aje/kwu145 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Spracklen  CN, Saftlas  AF, Triche  EW. et al.  Genetic predisposition to dyslipidemia and risk of preeclampsia. Am J Hypertens  2015; 28:915–23. 10.1093/ajh/hpu242 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Burgess  S, Butterworth  A, Thompson  SG. Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol  2013; 37:658–65. 10.1002/gepi.21758 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li  X, Li  Z, Zhou  H. et al.  Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat Genet  2020; 52:969–83. 10.1038/s41588-020-0676-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Liu  Y, Xie  J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc  2020; 115:393–402. 10.1080/01621459.2018.1554485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Zhao  Q, Wang  J, Hemani  G. et al.  Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. Ann Stat  2020; 48:1742–69. 10.1214/19-AOS1866 [DOI] [Google Scholar]
  • 24.Burgess S, Thompson SG. Mendelian Randomization: Methods for Causal Inference Using Genetic Variants. 2nd ed. New York: Chapman & Hall/CRC; 2021. 10.1201/9780429324352 [DOI] [Google Scholar]
  • 25. Swerdlow  DI, Preiss  D, Kuchenbaecker  KB. et al.  HMG-coenzyme a reductase inhibition, type 2 diabetes, and bodyweight: evidence from genetic analysis and randomised trials. The Lancet  2015; 385:351–61. 10.1016/S0140-6736(14)61183-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Schmidt  AF, Finan  C, Gordillo-Marañón  M. et al.  Genetic drug target validation using Mendelian randomisation. Nat Commun  2020; 11:3255. 10.1038/s41467-020-16969-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Tsukiyama  S, Ide  M, Ariyoshi  H. et al.  A new algorithm for generating all the maximal independent sets. SIAM J Comput  1977; 6:505–17. 10.1137/0206036 [DOI] [Google Scholar]
  • 28. Hofmeister  RJ, Ribeiro  DM, Rubinacci  S. et al.  Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK biobank. Nat Genet  2023; 55:1243–9. 10.1038/s41588-023-01415-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Clarke  L, Zheng-Bradley  X, Smith  R. et al.  The 1000 genomes project: data management and community access. Nat Methods  2012; 9:459–62. 10.1038/nmeth.1974 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Dimitromanolakis  A, Xu  J, Krol  A. et al.  sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinform  2019; 20:1–9. 10.1186/s12859-019-2611-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Selvaraj  MS, Li  X, Li  Z. et al.  Whole genome sequence analysis of blood lipid levels in > 66,000 individuals. Nat Commun  2022; 13:5995. 10.1038/s41467-022-33510-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Haas  DM, Ehrenthal  DB, Koch  MA. et al.  Pregnancy as a window to future cardiovascular health: design and implementation of the nuMoM2b heart health study. Am J Epidemiol  2016; 183:519–30. 10.1093/aje/kwv309 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Yan  Q, Blue  NR, Truong  B. et al.  Genetic associations with placental proteins in maternal serum identify biomarkers for hypertension in pregnancy. medRxiv  2023:2023–05. 10.1101/2023.05.25.23290460 [DOI] [PubMed] [Google Scholar]
  • 34. Ding  Y, Yao  M, Liu  J. et al.  Association between human blood metabolome and the risk of pre-eclampsia. Hypertens Res  2024; 47:1063–72. 10.1038/s41440-024-01586-x [DOI] [PubMed] [Google Scholar]
  • 35. Hosier  H, Lipkind  HS, Rasheed  H. et al.  Dyslipidemia and risk of preeclampsia: a multiancestry Mendelian randomization study. Hypertension  2023; 80:1067–76. 10.1161/HYPERTENSIONAHA.122.20426 [DOI] [PubMed] [Google Scholar]
  • 36. Barter  PJ, Nicholls  S, Rye  K-A. et al.  Antiinflammatory properties of HDL. Circ Res  2004; 95:764–72. 10.1161/01.RES.0000146094.59640.13 [DOI] [PubMed] [Google Scholar]
  • 37. Tran-Dinh  A, Diallo  D, Delbosc  S. et al.  HDL and endothelial protection. Br J Pharmacol  2013; 169:493–511. 10.1111/bph.12174 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Sánchez-Aranguren  LC, Prada  CE, Riaño-Medina  CE. et al.  Endothelial dysfunction and preeclampsia: role of oxidative stress. Front Physiol  2014; 5:372. 10.3389/fphys.2014.00372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. The All of Us Research Program Investigators . The “all of us” research program. New Engl J Med  2019; 381:668–76. 10.1056/NEJMsr1809937 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Taliun  D, Harris  DN, Kessler  MD. et al.  Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature  2021; 590:290–9. 10.1038/s41586-021-03205-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Gaziano  JM, Concato  J, Brophy  M. et al.  Million veteran program: a mega-biobank to study genetic influences on health and disease. J Clin Epidemiol  2016; 70:214–23. 10.1016/j.jclinepi.2015.09.016 [DOI] [PubMed] [Google Scholar]
  • 42. Zhou  W, Nielsen  JB, Fritsche  LG. et al.  Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet  2018; 50:1335–41. 10.1038/s41588-018-0184-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zhou  W, Zhao  Z, Nielsen  JB. et al.  Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat Genet  2020; 52:634–9. 10.1038/s41588-020-0621-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Mbatchou  J, Barnard  L, Backman  J. et al.  Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet  2021; 53:1097–103. 10.1038/s41588-021-00870-7 [DOI] [PubMed] [Google Scholar]
  • 45. VanderWeele  TJ, Tchetgen  EJT, Cornelis  M. et al.  Methodological challenges in Mendelian randomization. Epidemiology  2014; 25:427–35. 10.1097/EDE.0000000000000081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Leeuw  Cd, Savage  J, Bucur  IG. et al.  Understanding the assumptions underlying Mendelian randomization. Eur J Hum Genet  2022; 30:653–60. 10.1038/s41431-022-01038-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Herce  Á, Salvador  M. Instrument selection in panel data models with endogeneity: a Bayesian approach. Econometrics  2024; 12:36. 10.3390/econometrics12040036 [DOI] [Google Scholar]
  • 48. Burgess  S, Thompson  SG. Interpreting findings from Mendelian randomization using the MR-egger method. Eur J Epidemiol  2017; 32:377–89. 10.1007/s10654-017-0255-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Rees  JM, Wood  AM, Burgess  S. Extending the MR-egger method for multivariable Mendelian randomization to correct for both measured and unmeasured pleiotropy. Stat Med  2017; 36:4705–18. 10.1002/sim.7492 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplements_bbaf649
supplementary_Table_S3_bbaf649

Data Availability Statement

The summary statistics for lipid traits are available at the published paper Selvaraj et al. [31]. Access to the nuMoM2b-HHS individual-level WGS and phenotype data is available through dbGaP at dbGaP study ID phs002808.v1.p1. MR-CARV is implemented as an open source R package available at https://github.com/yu-zhang-oYo/mr.carv.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES