Abstract
Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Despite the widespread availability of genome-wide data, existing methods to analyze genetic data still primarily focus on marginal association models, which fall short of fully capturing the polygenic nature of complex traits and elucidating biological causal mechanisms. Here we present a computationally efficient causal inference framework for genome-wide detection of putative causal variants underlying genetic associations. Our approach utilizes summary statistics from potentially overlapping studies as input, constructs in silico knockoff copies of summary statistics as negative controls to attenuate confounding effects induced by linkage disequilibrium, and employs efficient ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome. Our method is computationally efficient, requiring less than 15 minutes on a single CPU to analyze genome-wide summary statistics. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer’s disease (AD) we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline via marginal association testing. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of large-scale genome-wide association studies (GWAS) summary statistics from 2013 to 2022. Results reveal the method’s capacity to robustly discover additional loci for polygenic traits beyond conventional GWAS and pinpoint potential causal variants underpinning each locus (on average, 22.7% more loci and 78.7% fewer proxy variants), contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses. We are making the discoveries and software freely available to the community and anticipate that routine end-to-end in silico identification of putative causal genetic variants will become an important tool that will facilitate downstream functional experiments and future research into disease etiology, as well as the exploration of novel therapeutic avenues.
Introduction
Uncovering the precise causal genetic determinants of complex traits is integral to advancing the understanding of disease etiologies and the development of targeted therapies. An increasing body of evidence underscores the efficacy of therapeutic interventions grounded in genetic insights.1,2 Notably, approximately two-thirds of the FDA-approved new drugs in 2021 are based on genetic loci linked to their respective therapeutic indications or related phenotypes.3 Genome-wide association studies (GWAS) serve as one of the most popular methodologies for identifying genetic variants correlated with disease phenotypes. Nonetheless, the genetic variants identified through GWAS generally account for a limited proportion of heritability and are mostly proxy for causal variants, which may impede their applicability in functional genomics and the informed selection of drug targets and indications.4,5
Until recently, most GWAS have predominantly focused on marginal association models, correlating a phenotype of interest with the genotype of a single genetic variant, while also accounting for non-genetic risk factors and covariates (referred to as conventional GWAS pipeline in this paper). The linear model can be conveniently adapted to encompass generalized linear models for non-Gaussian phenotypes, such as dichotomized disease status, or extended to a model that includes a collection of genetic variants within a specific gene or region. Although this approach has yielded considerable success, the community is transitioning into the post-GWAS era, characterized by two fundamental paradigm shifts. First, prompted by the growing evidence supporting the high polygenicity of numerous complex traits, contemporary GWAS increasingly aim to elucidate genetic architecture encompassing a multitude of loci, each with small effects.6,7 However, the statistical power of a marginal association model may prove suboptimal to detect weaker associations, whose signal needs to be compared with a “large” error term, which encapsulates the combined effect of all other variants and other environmental factors. Second, the transition from genetic discoveries to elucidating biological mechanisms presents a challenge, partly made even more formidable by the fact that marginal tests can identify any proxy feature correlated with the true causal variants. In response to this challenge, researchers have developed second-stage fine-mapping approaches to focus the researcher’s attention on smaller sets of variants which have some guarantee of harboring the causal variant with high probability.8,9 However, these methods are inherently constrained by their focus on strong associations detected with marginal models, and cannot lead to discoveries of causal variants at other loci. Additionally, when multiple credible sets are identified at the same locus, these do not unequivocally represent independent causal effects. This ambiguity poses substantial challenges towards identifying causal variants, thereby hindering the potential for a mechanistic understanding. This aspect is particularly relevant in light of recent findings suggesting that genetic association signals at a locus can be the result of multiple causal variants.10
In contrast to marginal models, recent work has shown that assessing conditional independent effects with False Discovery Rate (FDR) control in a high-dimensional regression model (e.g., genome-wide regression model) can enhance the detection of variants with weaker effect sizes and improve the chances of identifying putatively causal variants.11 A conditional independence hypothesis evaluates the effect of each genetic variant on a phenotype, conditioning on all other genetic variants throughout the genome. The knockoffs methodology is a recently proposed statistical framework for testing this conditional independence hypothesis in high-dimensional settings.12 This approach involves generating synthetic, noisy replicas (knockoffs) of the original genetic variants, which function as negative controls for the conditional tests. The knockoffs aid in the selection of significant genetic variants and help mitigate the confounding effect of linkage disequilibrium (LD).13–15 Several knockoffs-based methodologies have been proposed for genetic research, including those by Candès et al. (2018), Sesia et al. (2019), Sesia et al. (2020), He et al. (2021), and Sesia et al. (2021).11,12,14,16,17 Motivated by the frequent unavailability of individual-level data in large meta-analyses of GWAS, He et al. (2022) introduced GhostKnockoffs, which can take as input summary statistics readily available from conventional GWAS, enabling a knockoff approach without access to individual genotypes.18 Chen et al. (2024) paired GhostKnockoffs with modern regression statistics (such as the Lasso or other types of penalized regression methods) to improve statistical power.19 The connections between conditional testing and causal inference are further discussed by Bates et al. (2020) for genetic trio studies and Li et al. (2022) in search of consistent conditional associations across environments.20,21 Despite these theoretical advantages, there are still considerable challenges in performing computationally efficient conditional independence tests in the context of large-scale GWAS.
In this study, we introduce an analytical pipeline for genome-wide detection of putative causal variants, which integrates several technical advancements from a series of companion papers we released at the same time. Our method defines and quantifies the causal effect of a genetic variant based on causal inference principles, which is conceptually equivalent to quantifying the change in phenotypic value observed by introducing a sequence change in a functional experiment, such as massively parallel reporter assay (MPRA) or CRISPR-Cas9. To achieve this, the analytical pipeline we propose integrates several technical advancements from a series of companion papers we released at the same time, which allow: (1) the ability to use summary statistics (e.g., p-values and direction of effects) from potentially overlapping studies, thereby facilitating the integration of multiple studies to the maximum extent feasible; 18 (2) the optimized construction of group knockoffs to enhance the power to identify tightly linked causal variants;19 and (3) the ability to use efficient ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome based on summary statistics.18,22 The pipeline necessitates less than 15 minutes on a single central processing unit (CPU) to analyze genome-wide summary statistics. It can be flexibly applied on top of the standard GWAS pipeline without changing any of the processing steps to enhance the discovery of additional loci and to localize conditionally independent causal effects.
We illustrate performance via simulations and by applying our new analysis pipeline to a meta-analysis of ten large-scale genetic studies of Alzheimer’s disease (AD) from 2017 to 2022. This analysis identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline via marginal association tests. We used existing data on MPRA + CRISPR-Cas9 experiments to validate these identified variants and show that the identified putative causal variants achieve good agreement with the experimental results. Furthermore, using functional genomics data from single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) in excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, astrocytes and OPCs, we show functional enrichment of these variants in microglia. Finally, we further applied the method to summary statistics from large-scale GWAS from 2013 to 2022 for a variety of phenotypes. On average, the method identifies 22.7% more loci and 78.7% fewer proxy variants per locus compared to conventional GWAS pipeline via marginal association tests. The results highlight the appealing attributes of the proposed method for robustly uncovering additional loci and pinpointing putative causal variants that underlie each locus. We are making the discoveries and software freely available to the community and anticipate that routine end-to-end in silico identification of putative causal genetic variants will become an important tool that will facilitate downstream functional experiments and future research into disease mechanisms and potential therapies.
Results
Overview of the method
We assume a study population of independent individuals including genetic variants. Let be a vector of genotypes, and be the phenotype. To evaluate the causal effect of the -th genetic variant on , we are interested in testing the following nonparametric conditional independence (CIT) hypothesis to evaluate the causal effect of the -th genetic variant on :
where are the genotypes for all variants across the genome except the -th. Essentially, we test if each variant is independent of conditional on all the other variants . The CIT hypothesis has a causal inference interpretation. Under standard identifiability conditions in causal inference, we show in the Methods section that this hypothesis is equivalent to testing whether a nonparametric conditional causal effect (CCE) in causal inference is zero,
where the causal effect of on is defined on the basis of the counterfactual (or potential) outcomes and , which are mutually unobservable quantities.23,24 Intuitively, the CCE evaluates the effect of on by considering changing the value of from to , without altering any other variants. If the distribution of changes for any stratum , the -th variant may be considered to have a causal effect on . We adopt this measure of causal effect because it is conceptually similar to a functional experiment (e.g. MPRA or CRISPR) that edits a particular sequence and then looks for any change in the phenotype (e.g. MPRA or CRISPR). In the Methods section, A more detailed discussion about the connections with other measures of causal effect is included in the Methods section.
The knockoff methodology was designed to test precisely the conditional independence hypothesis .11–15,18 Here, we leverage several recent advances in the model-X knockoffs framework. Our method, CIT-Lasso, contains six main steps: (1) Collect summary statistics, namely, marginal association Z-scores from a target study and LD structure of genetic variants from a reference panel; (2) Perform average-linkage hierarchical clustering to define mutually exclusive LD groups of variants; (3) Construct group knockoffs of the Z-scores as described in He et al. (2022) and Chu et al. (2023);18,22 (4) With the original Z-scores, knockoff Z-scores and LD structure as input, fit an ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome;19,25 (5) Calculate test statistic by contrasting the feature importance scores for the original and knockoff variants; (6) Implement a knockoff filter to select statistically significant, putatively causal variants with empirical FDR control. We refer to the selected variants in each LD group as a “catching set”, analogous to the “credible set” in Bayesian fine-mapping methods. Our catching sets are guaranteed to be mutually exclusive, and they represent conditionally independent effects on the outcome of interest. We have developed a computational pipeline to enable scalable implementation of the proposed method, including: 1. An efficient sampling procedure to generate multiple knockoffs; 2. A fast and scalable batch screening iterative lasso (BASIL) algorithm to fit the ultrahigh-dimensional sparse regression with summary statistics.22,25 Empirically, it only requires 860.5 seconds to meta-analyze ten AD genetic studies with a single Intel Xeon E5-2640 CPU (2.4GHz) with 24GB of requested RAM. We present the details in the Methods section.
End-to-end discovery and prioritization of causal variants
Unlike the usual two-stage marginal association test + fine-mapping procedure, the conditional independence testing in a high-dimensional regression model simultaneously performs discovery and prioritization of causal variants with improved power and reduced LD confounding as demonstrated in Sesia et al. (2020).11 In this paper, we propose a method that utilizes summary statistics as input (e.g., p-values and direction of effects) to achieve these appealing properties. The ability to use summary statistics facilitates the integration of multiple studies to the maximum extent feasible, and it allows the method to be flexibly applied on top of the standard GWAS pipeline without changing any of the processing steps. We conducted extensive simulation studies to compare the proposed method with the conventional two-stage marginal association test + fine-mapping procedure (e.g, SuSiE).8 The marginal association test was implemented by a score test in linear regression. The simulation study is based on 500 replicates. Each replicate includes 500 approximately independent 200kb regions across the genome, where 50 of them contain one causal variant. We restricted the simulations to directly genotyped variants with MAF ≥0.01; details about the simulation setting can be found in the Methods section. To evaluate performance, we focus on four metrics of interest:
Precision (true positive rate): overall proportion of variants that are causal among all variants in the catching/credible sets.
Recall (statistical power): proportion of causal variants that are covered by any catching/credible sets.
Size: number of variants in the catching/credible sets; we report the maximum and average size of all identified catching/credible sets for each replicate.
Purity: the smallest squared correlation among all pairs of variants within a catching/credible set; we report the minimum purity and average purity of all identified catching/credible sets for each replicate
While the average size and purity reflect the average performance, the maximum size and minimum purity reflect the performance for more challenging tasks. We are looking for a method with higher precision and recall, with smaller catching/credible sets and with higher purity.
Figure 1 shows that the proposed method (CIT-Lasso) exhibits substantially higher precision and recall in discovering causal genetic variants compared to the marginal association test (MAT). The higher precision than MAT is expected because the proposed method tests for conditional independence while marginal association test is subject to LD confounding. The higher recall is because CIT-Lasso controls the empirical FDR, an error criterion that is more liberal than the family-wise error rate (FWER) commonly used in conventional GWAS pipeline. In addition, the joint modeling of the effects of all variants avoid comparing the signal of each individual variant with a “large” error term, which encapsulates the combined effect of all other variants. Consequently, CIT-Lasso also has higher power than a two-stage marginal association test + fine-mapping procedure using SuSiE (MAT + SuSiE), because fine-mapping is not designed for new causal discoveries and is only applied to regions already discovered by MAT. More importantly, the improvement in precision remains when it is compared to MAT + SuSiE. In addition, CIT-Lasso exhibits catching sets with substantially smaller size and higher purity than MAT + SuSiE.
Figure 1: Genome-wide simulation study.
The simulation study is based on 500 replicates. Each replicate includes 500 approximately independent 200kb regions across the genome, among which 50 contain one causal variant. CIT-Lasso: the proposed conditional independence test paired with Lasso type model. MAT: marginal association test. MAT + SuSiE: marginal association test followed by SuSiE fine-mapping. Similar to credible sets reported by SuSiE, CIT-Lasso identifies mutually exclusive catching sets of genetic variants that have independent conditional causal effects on the outcome of interest. We compare the catching/credible sets based on precision, recall, size and purity. Precision: the proportion of variants that are causal among all variants in the catching/credible sets. Recall: proportion of causal variants that are covered by any catching/credible sets (statistical power). Size: size of the catching/credible sets; we report the maximum size and average size of all identified catching/credible sets for each replicate. Purity: square of the smallest correlation among all pairs of variants within a credible size; we report the minimum purity and average purity of all identified catching/credible sets for each replicate. While the average size and purity reflect the average performance, the maximum size and minimum purity reflect the performance for more challenging tasks.
In Supplemental Figure 1, we present additional comparisons with alternative implementations of SuSiE. We found that the improved precision remains when SuSiE is applied to the same regions discovered by CIT-Lasso. Therefore, the improvement compared to MAT + SuSiE is not just due to the less than ideal performance of the first stage MAT. When an additional 50% posterior inclusion probability (PIP) threshold is applied to SuSiE credible sets to further filter out variants, the precision becomes higher than CIT-Lasso while sacrificing the power. Unfortunately, the appropriate PIP threshold can vary from region to region, and it depends on LD. Current fine-mapping methods can only provide a relative importance of the variants (e.g. PIP) without a clear criterion for determining an appropriate threshold for the PIP.
The results also demonstrate an intrinsic difference in the way catching/credible sets are constructed. SuSiE’s credible sets aim to construct the smallest set that ensures a high probability (e.g. 95%) to cover the causal variants, and therefore each set may include variants with low probability to be causal. A catching set for CIT-Lasso aims to exclude variants with low chance to be causal. Empirically, CIT-Lasso seems to be selecting the top PIP variants from a SuSiE credible set, while identifying an adaptive threshold to control the empirical FDR. In Supplemental Figure 1, we further demonstrated that the proposed framework can also flexibly leverage the SuSiE model and utilize its predictions as a test statistic in lieu of the Lasso. With the same definition of catching sets that empirically control for variant-level FDR, CIT-Lasso and CIT-SuSiE exhibit almost the same precision and recall. In summary, the proposed method exhibits higher statistical power, with the ability to prioritize causal variants while empirically controlling FDR in contrast to the two-stage marginal association test + fine-mapping procedure.
Application to Alzheimer’s disease genetic studies
Alzheimer’s disease (AD) is the most common cause of dementia among people over the age of 65, affecting an estimated 5.5 million Americans. To study genetic risk and to identify molecular mechanisms of AD, we applied the proposed method to summary statistics from ten overlapping large-scale array-based genome-wide association studies, and whole-exome/-genome sequencing studies from 2017-2022. The details are described in the Methods section. We used LD matrices estimated using the Pan-UK Biobank data.26 We restrict the analyses to directly genotyped common and low-frequency variants with minor allele frequency >1%.
We propose a meta-analysis strategy that adaptively estimates weights to combine the ten studies allowing for sample overlap, as described in the Methods section. The meta-analysis Z-scores serve as the input of the proposed method. We present the estimated study correlations and the estimated optimal weights in Figure 2. The correlation results are consistent with our knowledge of overlap and other factors, such as differences in phenotype definition, analysis strategies, and quality control. Similarly, the weighting scheme up-weighted studies that are large in size and carry independent information, and down-weighted studies that largely overlap with others. Notably, the three major AD genetic studies, Jansen et al. 2019, Schwartzentruber et al. 2021 and Bellenguez et al. 2022, are estimated to have larger weights compared to the other studies which were integrated into the three meta-analyses.27–29 In addition to the meta-analysis, we applied the method separately to summary statistics from the three major AD genetic studies for comparison. We present the main meta-analysis results as a Manhattan plot in Figure 3. The Manhattan plot for conventional GWAS meta-analysis is in Supplementary Figure 2. We define two loci as different if they are at least 1Mb away from each other, where each locus may contain one or multiple sets of putative causal variants with conditionally independent effects. We adopt the most proximal gene’s name as the locus name, recognizing that it is not necessarily the causal gene.
Figure 2: Study design.
Our analysis of Alzheimer’s disease genetics is an aggregation of ten possibly overlapping studies from 2013 to 2022. Left panel: estimated study correlations; For each study, we present sequencing technology and sample size. Right panel: estimated adaptive combination of studies; studies with larger sample size and less correlation with other studies have larger weights. Each bar presents the weight per study in percentage, i.e. weight per study divided by the summation of all weights.
Figure 3: Meta-analysis of Alzheimer’s disease genetic studies.
Top panel: we present the Manhattan plot from the proposed conditional independence tests (CIT-Lasso). The dotted lines present the FDR threshold of 0.10. We define two loci as different if they are at least 1Mb away from each other. For each locus, we annotated the variant with the largest W statistic respectively and adopt the most proximal gene’s name as the locus name. Loci that are less than 1Mb away from conventional GWAS loci (p ≤ 5 × 10−8) are highlighted in red (45 loci). Additional loci identified by CIT-Lasso are highlighted in blue (37 loci). Variant density is shown at the bottom of plot (number of variants per 1Mb). Mid panels: we present fine-mapping examples for rs6701713 (CR1; MPRA p = 8 × 10−33; CRISPR-Microglia p = 2.9 × 10−3), rs13025717 (BIN1; MPRA p = 2.4 × 10−41; CRISPR-Microglia p = 1.3 × 10−5) and rs6064392 (CASS4; MPRA p = 1.04 × 10−22; CRISPR-Microglia p = 4.2 × 10−2), three causal variants validated by Cooper et al. (2022) using massively parallel reporter assays (MPRA) coupled with CRISPR in neurons and microglia. We compare marginal association test (MAT), MAT followed by SuSiE fine-mapping, and CIT-lasso. Different colors represent different catching sets (for CIT-Lasso, they represent independent conditional causal effects). The legend presents the genes that each catching set potentially regulates, mapped by the cS2G method. The red dotted lines represent the p-value threshold of 5 × 10−8 and FDR threshold of 0.10 for MAT and CIT-lasso respectively. The blue dotted lines represent the location of the validated causal variants.
We observed that CIT-Lasso (target FDR=0.1) consistently identified more loci compared to marginal association tests as performed in conventional GWAS (35 loci vs. 23 loci for Jansen et al. 2019; 31 loci vs. 28 loci for Schwartzentruber et al. 2021; 68 loci vs. 42 loci for Bellenguez et al. 2022; 82 loci vs. 51 loci for our meta-analysis; results are summarized in Supplementary Figure 3). Meanwhile, the conditional independence test results in a substantially smaller number of proxy variants per locus (on average, 3.3 vs. 19.3 for Jansen et al. 2019; 1.6 vs. 18.8 for Schwartzentruber et al. 2021; 2.3 vs. 15.8 loci for Bellenguez et al. 2022; 2.6 vs. 18.8 for our meta-analysis). Overall, the results demonstrate that the proposed method is able to robustly discover additional loci that are missed by marginal association test, and at the same time pinpoint potential causal variants underpinning each locus as a fine-mapping method would.
Concordance with MPRA+CRISPR experiments
To additionally evaluate the validity of the identified putative causal variants, we leveraged data and results from Cooper et al. (2022), where the authors used MPRA to screen noncoding variants reported in previous AD GWAS, followed by CRISPR functional validation in neurons and microglia.30 In this experimental approach, the functional effect of a variant is measured by evaluating the effect on gene expression of the corresponding sequence alteration. This is considered to be the gold-standard approach to validate functional consequences of a variant (referred to as causal variant in this section). We evaluated how the variants identified by CIT-Lasso overlap with the nine MPRA+CRISPR validated variants (eight variants reported by Cooper et al. 2022 and one additional variant with p-value≤ 0.05 in both MPRA and CRISPR experiments), and additionally compared results with SuSiE.
Among the nine causal variants, three are directly genotyped and they are all correctly identified by CIT-Lasso. We present them in Figure 3, and they correspond to gene CR1, BIN1 and CASS4. For the CR1 locus, rs6701713 is a causal variant validated by MPRA (p = 8 × 10−33) and CRISPR (microglia; p = 2.9 × 10−3). SuSiE is able to include the causal variant as part of its credible set, but the PIP is close to 0 and one cannot determine whether it should be selected. CIT-Lasso is able to provide a selection that empirically controls the FDR. This functional variant is identified by CIT-Lasso as a putative causal variant. For the BIN1 locus, rs13025717 is a causal variant validated by MPRA (p = 2.4 × 10−41) and CRISPR (microglia; p = 1.3 × 10−5). This variant is not covered by any credible sets identified by SuSiE, while CIT-Lasso correctly identified it as a putative causal variant. For the CASS4 locus, rs6064392 is a causal variant validated by MPRA (p = 1.04 × 10−22) and CRISPR (microglia; p = 4.2 × 10−2). Both SuSiE and CIT-Lasso are able to pinpoint the causal variant.
The other six causal variants are not directly genotyped and therefore they are not present in our primary analysis. We present them in Supplementary Figure 4. We found that SuSiE and CIT-Lasso are able to identify tightly linked neighboring variants as the causal variants in some cases. For instance, rs1532277 is a causal variant validated by MPRA (p = 1.5 × 10−6) and CRISPR (Neuron; p = 6.4 × 10−5) at the CLU locus. While rs1532277 is not directly genotyped in the dataset we considered, both SuSiE and CIT-Lasso identify its tightly linked neighbor rs1532278 (; 134 base pairs away). The credible set defined by SuSiE contains two other variants, while CIT-Lasso pinpoints the exact variant alone. This is consistent with our simulation study showing that CIT-Lasso exhibits smaller sets and higher purity.
Validation of new loci solely identified by CIT-Lasso
Since the proposed meta-analysis aims to incorporate all major studies to date and they are possibly overlapping with each other, we do not have a hold-out independent dataset for replication. To validate the proposed method, we adopted three alternative strategies: 1. We investigated whether the variants identified in an earlier study (e.g. Jansen et al. 2019) were replicated with smaller p-values in a future study with increased sample size (e.g. the meta-analysis); 2. We evaluated whether the identified variants were more likely to be functional compared to genome background variants; 3. We confirmed that variants identified by the meta-analysis exhibited consistent associations across the ten studies. We are particularly interested in performing these validation strategies on the additional loci solely identified by the proposed method but missed by marginal association tests.
We first evaluate whether the variants identified by our proposed method can be replicated in a future larger study with smaller marginal p-values. We compared the distribution of p-values in the meta-analysis with those of Jansen et al. 2019 and Schwartzentruber et al. 2021 respectively. We present the results in Figure 4. We observe that the p-values of the variants identified in previous studies become generally smaller in the meta-analysis with an increased sample size (Figure 4, left panel). For variants from the additional loci solely identified by our proposed method but missed by conventional GWAS (p ≥ 5 × 10−8), a large proportion of them are identified as statistically significant in the meta-analysis (Figure 4, right panel; 39% for Jansen et al. 2019; 60% for Schwartzentruber et al. 2023). This demonstrates the power of the proposed method which allows identification of the weaker associations in earlier studies with smaller sample sizes.
Figure 4. Variants identified by CIT-Lasso are replicated in a larger study with smaller marginal p-values.
For variants identified by CIT-Lasso, we compare the marginal p-values in the original study and the marginal p-values in the meta-analysis. Results show that a large proportion of variants identified by CIT-Lasso but missed by conventional marginal association test (loci that are more than 1Mb away from conventional GWAS loci) reach genome-wide significance level in a future study with larger sample size (38.9% for Jansen et al. 2019; 60.0% for Schwartzentruber et al. 2021). We excluded APOE locus and truncated p-values at 10−50 for better visualization.
Second, we investigate whether the identified variants are functionally enriched. We leveraged results from single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) in excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, astrocytes and OPCs.31 We evaluated how our identified variants overlap with the scATAC-seq peaks identified by Corces et al. (2020). In Figure 5, we present the enrichment relative to the background genome, defined as
Figure 5. Variants identified by CIT-Lasso are functionally enriched. The additional variants identified by CIT-Lasso have similar level of enrichment.
Top-left panel: variants identified by CIT-Lasso are more likely to overlap with scATACseq peaks in microglia. The enrichment is calculated as proportion of identified variants that overlap with scATACseq peaks divided by that of the background genome. Top-right panel: variants identified by CIT-Lasso have higher cS2G score to be functionally mapped to a gene that they potentially regulate. The figure presents the distribution of cS2G scores of the identified variants relative to that of all background genome variants. Bottom panel: variants identified by CIT-Lasso are more likely to be in a cS2G functional category.
We observed that the identified variants exhibit strong enrichment (>10x) in microglia relative to the background genome. Interestingly, the additional variants identified by the proposed method - but missed by conventional GWAS (≥ 5 × 10−8) - have similar levels of enrichment in microglia compared to those exhibiting stronger associations.
We also leveraged a recent combined SNP-to-gene linking strategy (cS2G) proposed by Gazal et al. (2022) to annotate the identified variants and map them to the genes they potentially regulate.32 The cS2G method combined seven S2G functional categories, including exon, promoter, GTEx fine-mapped cis-eQTL, eQTLGen fine-mapped blood cis-eQTL, EpiMap enhancer-gene linking, Activity-By-Contact (ABC), scATAC-seq Cicero blood/basal. A cS2G linking score is then computed to summarize the functional consequences and the confidence of the gene mapping. In Figure 5, we present the distribution of the cS2G linking score and the proportion of variants that fall in each of the functional categories. We observed that variants identified by the proposed method have substantially higher cS2G linking scores relative to genome background variants. They also have substantially higher percentage of belonging to one of the functional categories. Similar enrichment is found for the additional variants solely identified by the proposed method. The results show that the proposed method can identify putatively functional variants with weaker effects that are missed by conventional association tests.
Finally, we checked whether the variants solely identified by the proposed method but missed by conventional marginal association tests exhibit concordant Z-scores and direction of effects across the ten studies. We calculated the Spearman correlation of Z-scores between each pair of studies involved in the meta-analysis. For each identified variant, we also computed the proportion of studies where the direction of effect is concordant with that of the meta-analysis. We present the results in Supplementary Figure 6. We observed that Z-scores of the identified variants are positively correlated across the ten studies (median correlation = 0.7). In addition, almost all identified variants have concordant directions of effects when compared to the meta-analysis (for 94% variants, the direction of effect in the meta-analysis is concordant with >80% individual studies). This demonstrates that the proposed method, paired with the meta-analysis strategy, can robustly identify genetic variants that share consistent association across the ten AD genetic studies from 2017 to 2022.
Retrospective analysis of large-scale GWAS summary statistics from 2013 to 2022
The proposed method simultaneously produces discovery and prioritization of causal variants in few hours with only a single CPU. This opens up the exciting possibility of identifying putative causal variants at the phenome-scale and beyond. In this section, we demonstrate the application of the proposed method to summary statistics from large-scale GWAS from 2013 to 2022. Specifically, we curated GWAS summary statistics from 400+ publications. We restricted our analysis to studies with sample size >100,000, with individuals of European ancestry and at least ten loci passing the genome-wide significance level (marginal p-value ≤ 5 × 10−8). This results in 67 studies. We applied CIT-Lasso to the corresponding summary statistics and compared the results with the original GWAS via MAT; we compared the number of discovered loci and the average number of proxy variants per locus. We additionally evaluated the proportion of studies where the proposed method identifies more loci than marginal association tests as a function of phenotype polygenicity (quantified by the number of loci discovered by the original GWAS). We present the results at FDR=0.1 and 0.2 in Figure 6.
Figure 6. Retrospective analysis of large-scale GWAS summary statistics from 2013 to 2022.
Left panel: number of loci identified by CIT-Lasso vs. marginal association test (MAT); Mid panel: average number of proxy variants per loci by CIT-Lasso vs. MAT; Right panel: proportion of studies that CIT-lasso identifies more loci than MAT among all studies with a polygenicity of the trait greater than a particular level. The polygenicity is quantified by the number of loci identified by MAT as in the original GWAS.
We observe that CIT-Lasso generally identifies a substantially smaller number of proxy variants per locus compared to MAT (top mid panel). On average (over the 67 studies), CIT-Lasso identified 1.83 proxy variants whereas MAT identifies 8.61 proxy variants (down 78.7%). We also observed that CIT-Lasso identifies more loci in most studies (top left panel). On average (of the 67 studies), the proposed method identifies 143.6 loci whereas MAT identify 117.1 loci (up 22.7%). This trend is more pronounced at FDR=0.2 and for phenotypes with higher polygenicity, as expected (top right panel). For example, the proposed method identifies more loci in 67.2% (89.6% at FDR=0.2) studies that have more than ten GWAS loci. The proportion increases to 81.0% (100% at FDR=0.2) for studies with more than 100 GWAS loci, where the average number of loci identified by CIT-Lasso is 349.3 vs. that by MAT 279.6 (up 24.9%). We present specific results for several common polygenic traits in Figure 6 (bottom three panels) including BMI (Yengo et al. 2018; 610 vs. 376 loci), height (Yengo et al. 2018; 714 vs. 639 loci), total cholesterol (Willer et al. 2013; 127 vs. 76 loci), bone mineral density (Kemp et al. 2017; 262 vs. 203 loci), hypertension (Zhu et al. 2019; 260 vs. 185 loci), type 2 diabetes (Xue et al. 2018; 172 vs. 94 loci).33–37 The results demonstrate the generalizability of the proposed method to the phenome, and that it is particularly useful for complex traits with high polygenicity.
Discussion
The described method CIT-Lasso for identifying putative causal effects combines ideas from causal inference, model-X knockoffs and ultrahigh-dimensional sparse regression. By applying principles of causal inference, we designed a high-dimensional hypothesis testing problem that mimics experimentally editing a particular genomic sequence. We subsequently transformed the initial causal inference hypothesis into a conditional independence hypothesis, which can be rigorously tested by the modern model-X knockoffs framework. We facilitated this test by fitting an innovative ultrahigh-dimensional sparse regression, which enables the simultaneous modeling of all genomic variants present in the human genome. This results in a data driven method that learns far more efficiently and accurately from genetic data compared to the conventional marginal association models that analyze one genetic variant at a time, and then proceed toward a second stage fine-mapping procedure.
The proposed method identifies more loci compared to conventional marginal association tests. The superior power can be attributed to several factors including: 1. the use of FDR control as opposed to the stringent genome-wide significance level for marginal association tests (5 × 10−8); and 2. The use of an ultrahigh-dimensional sparse regression that jointly models all genetic variants, thereby making inference more efficient. Both are especially helpful for complex traits with high polygenicity. In particular, the ultrahigh-dimensional sparse regression has been employed recently in genetic research to establish polygenic risk scores that predict risk of complex diseases.25 It is well known that inclusion of genetic variants with relatively weaker associations that do not meet the stringent genome-wide significance level into such a comprehensive model can significantly improve the prediction accuracy. This aligns with our observations that the inference derived from the proposed ultrahigh-dimensional sparse regression exhibits improved power relative to marginal association tests. However, it is highly nontrivial to conduct hypothesis testing in an ultrahigh-dimensional sparse regression.38 The integration with model-X knockoffs makes rigorous inference possible. This is a salient advantage of our proposed methodology compared to the original GhostKnockoffs method, which conducts similar conditional independence tests with marginal association test statistics.18
Existing fine-mapping methods calculate relative likelihoods of causality using the PIP.8 Interpretations of PIP calculated through different methods can be inconsistent, lacking a systematic approach to defining a cutoff for determining causal variants.8,9,39 By contrast, the proposed method is an end-to-end method that simultaneously performs discovery and prioritization of causal variants. While it makes more discoveries, it is also superior to state-of-the-art second-stage fine-mapping methods in pinpointing the causal variants, and it automatically determines a cutoff that controls the FDR. Moreover, the current implementation of the ultrahigh-dimensional sparse regression can be replaced by any existing and future fine-mapping models, as illustrated in the SuSiE example, if they are demonstrated (or presumed) to offer enhanced performance in localizing causal variants.
Another appealing feature of the proposed method is that the identified variants per locus are grouped into different sets that presumably exhibit conditional independent effects on the phenotype of interest. This allows the identification of independent causal effects underlying association signals with FDR control. As a proof of concept, in Supplementary Figure 7, we present the number of identified variants, number of identified conditional independent effects (defined as groups of identified variants) and number of mapped genes via cS2G for each locus. We observed that as the number of conditional independent effects increases, the number of mapped genes correspondingly increases (correlation coefficient = 0.81). This shows that the different conditional independent effects tend to be mapped to different genes that they potentially regulate. In Supplementary Figure 8, we present some concrete examples for BIN1, PTK2B (CLU) and HLA complex. For BIN1, the proposed method identified three conditional independent effects, being mapped to PROC (2 variants), BIN1 (1 variant) and ERCC3/BIN1 (3 variants) respectively; for PTK2B, the proposed method identified two conditional independent effects, being mapped to PTK2B (2 variants) and CLU (1 variant); for HLA complex, the proposed method identified five conditional independent effects that can be mapped via cS2G to five different genes HLA-DRB5 (1 variant), HLA-DQA2 (1 variant), SLC44A4 (1 variant), AIF1 (1 variant), POU5F1 (1 variant). The findings are consistent with previous studies showing that genetic association signals can manifest through multiple causal variants with conditional independent effects on the phenotype of interest via regulating different genes.10
A limitation of our primary analysis is that we only considered directly genotyped variants and hence the causal variants can be missing from the analysis. While imputation has been shown to improve the statistical power of marginal association models, the situation changes when our goal is to pinpoint the causal variants. As discussed in Sesia et al. (2020), the imputed genotypes are independent of the phenotype after conditioning on the directly genotyped variants, and all the conditional independence hypotheses corresponding to the imputed variants are null.11 From a causal inference perspective, formal causal inference is solely viable for variants that are directly sequenced. This is due to the violation of the positivity condition (see Methods) by imputed variants, since they always operate as a function of other variants and remain independent of the phenotype of interest when conditioned on the variants used for imputation. Nevertheless, with appropriate qualifications, it is possibly to introduce imputed variants in the study and we additionally performed a comparison between the analyses with directly genotyped variants (our primary analysis) and the analyses with all imputed variants. Although it is commonly assumed that imputation can facilitate fine-mapping the causal variant, we found that imputation does not improve fine-mapping in the nine examples considered here for SuSiE and CIT-Lasso. For the three examples where the causal variants are directly genotyped, SuSiE and CIT-Lasso can no longer identify the causal variants (Supplementary Figure 5). For the six examples where the causal variants are not directly genotyped, SuSiE and CIT-Lasso also fail to pinpoint the causal variants (causal variants are not included in any credible/catching sets) except for one example rs636317 (Supplementary Figure 4). Our results emphasize the significance of whole-genome sequencing data in pinpointing causal variants rather than simply imputing. As whole genome sequencing data becomes increasingly available, we hope that the proposed method and computational approaches that apply its techniques will become essential tools for identifying causal biological pathways. This, in turn, should expedite the creation of new therapeutic interventions.
Methods
Causal inference of genetic variants to attenuate LD confounding
Let be a vector of genotypes, and the phenotype. In conventional genome-wide association studies, we test the null hypothesis for each genetic variant , referred to as marginal association testing. It is essential to note that, without randomization-based assignment of to level , the usual marginal association between and commonly assessed in GWAS cannot uncover causation. One major confounding effect in genetic studies is through the linkage equilibrium (LD), where a non-causal variant can be identified by a marginal association test if it is correlated with a causal variant. In causal inference, the causal effect of on is defined on the basis of the counterfactual (or potential) outcomes and , which are mutually unobservable quantities.23,24. is the outcome variable that would have been observed under the genotype value is the analogous quantity when . To evaluate the causal effect of the -th genetic variant on , we are interested in testing the nonparametric conditional causal effect (CCE) for all joint strata :
The conditional causal effect ensures that the confounding effect through other genetic variants has been adjusted for.24 Intuitively, the CCE evaluates the effect of on by changing the value of from to , without altering any other variants. If the distribution of changes for any strata , the -th variant may be said to have a causal effect on . It is conceptually similar to a functional experiment that edits a particular sequence and then looks for any change in the phenotype (e.g. MPRA or CRISPR).
While it is natural to consider this causal inference hypothesis , it is unclear how one should go about testing it. To address this, we propose to consider an alternative conditional independence hypothesis:
This hypothesis is equivalent to testing the conditional causal effect for all strata with , i.e., , under the following conditions:
Unconfoundedness (conditional exchangeability): for .
Positivity: for all with .
Consistency: for every individual with , where is the observed outcome.
The three conditions are the same as the usual identification conditions in causal inference (Rubin, 1974), which characterize the extent to which a statistical claim (based on observable quantities) can be interpreted as a causal effect. Condition 1 assumes no unmeasured confounders beyond . This means that the proposed test will attenuate confounding effects induced by linkage disequilibrium (the major confounding effect in genetic studies), but will not account for other unmeasured confounders such as confounding effects from genetic variants that are not genotyped or some environmental factors; Condition 2 assumes that and are not deterministically linked (i.e. there must be enough variation in for strata defined by ); Condition 3 is a rule in the logic of counterfactuals as discussed by Pearl (2010) that the potential outcome under the assignment is the outcome that will actually be observed in the event .40,41 Under the unconfoundedness assumption and the consistency condition,
Therefore, the null hypothesis for the conditional causal effect for all joint strata
can be written in terms of observable quantities as
which is equivalent to
Conditional independence test via GhostKnockoffs
To test the conditional independence hypothesis, He et al. (2022) proposed GhostKnockoffs and showed that for a particular feature importance defined by score test statistics in marginal association models, one can directly generate the knockoff feature importance score per variant without the need to generate individual-level knockoffs for hundreds of thousands of samples.18 In this paper, we propose the extension of GhostKnockoffs to feature importance from a joint model for improved power and prioritization of causal variants. Our proposed method is built on the method of Chen et al. (2024), which contains four main steps: (1) generate multiple (M) knockoff Z-scores per variant, . The knockoff Z-scores can serve as negative controls in any model to perform variable selection with FDR control; (2) calculate the feature importance score via a genome-wide sparse regression for both original and knockoff variants; (3) calculate the test statistic by contrasting the feature importance scores for the original and knockoff variants; (4) implement the knockoff filter procedure to select significant variants with FDR control.19 Here, steps (1) - (3) are applied to each stochastically independent LD block separately (independence is here approximate). The LD blocks are defined by Berisa and Pickrell (2016).40
Step 1: Generate knockoff Z-scores
Assuming we generate knockoff copies, He et al. (2022) showed that one can generate knockoff Z-scores by
where is a -dimensional vector; is a identity matrix; is the correlation matrix of that characterizes the linkage disequilibrium; is a diagonal matrix obtained by solving a convex optimization problem.12,18.
Although the knockoff method helps to prioritize causal variants over associations due to LD, it is difficult or impossible to distinguish causal genetic variants from highly correlated variants. The presence of tightly linked variants can diminish the power to identify the causal ones. To address this issue, we applied the group knockoff construction recently proposed by Chu et al. (2023).22 Conceptually, the group knockoff construction allows testing whether variants in a group have an effect on the response, conditional on all variants outside the group. The object of inference is shifted from single-variable to sets of highly correlated variables, and the statistical power to detect causal variants improves. In practice, groups are defined by applying average linkage hierarchical clustering with correlation coefficient cutoff 0.5. We refer to the selected variants in each LD group as the “catching set”.
Assuming we generate knockoff copies, the group knockoff construction ensures that the covariance matrix of takes the form
Unlike the single-variant knockoff construction where is a diagonal matrix, here is a block diagonal matrix where each block corresponds to a predefined group of tightly linked variants. Group FDR can be controlled for any choice of as long as is a valid covariance matrix (symmetric and positive definite). Intuitively, we aim to find “large” such that the knockoffs are sufficiently different from the original data to achieve high power, under the constraint that they are exchangeable to the original data at group-level to achieve FDR control. Among various possibilities considered in the literature, maximizing the entropy of the joint distribution of exhibits promising empirical power. It translates to solving the following convex optimization problem:
where the objective function is the entropy of the joint distribution and the constraint ensures that is a valid covariance matrix. To solve this problem, we apply the coordinate decent algorithm for maximum entropy group knockoffs and accelerate its convergence by exploiting the conditional independence among groups, following Chu et al. (2023).22
To generate , one classically uses a Cholesky decomposition of , which can be computationally intensive since is an matrix. Here, we utilize the structure of and propose an efficient sampling method as follows.
Compute the Cholesky factorization .
Generate ; let .
Generate i.i.d from ; Compute ; let .
Calculate .
As a result, the computational complexity is reduced from to .
Step 2: Calculate feature importance scores
We consider an ultrahigh-dimensional sparse regression as the “working” model. It is worth noting that knockoff inference holds without assuming that this working model is correctly specified. Without loss of generality, we assume that both phenotype and genotype variables are centered and standardized to have mean 0 and variance 1. If there are additional covariates, can be centered at the conditional mean given the covariates. Specifically, we consider a joint model of the original variants and knockoffs:
The corresponding penalized regression to estimate minimizes:
In this paper, we consider the setting where individual-level data and are not available but summary statistics are available, including: 1. Sample size of the target study; 2. Marginal Z-scores from a target study; 3. Correlation matrix from a reference panel. With this, Chen et al. (2024) proposed an alternative objective function that minimizes:
where is the expectation of , with .19 Under the condition that follows a multivariate Gaussian distribution, Chen et al. (2024) show that substituting by and solving this alternative optimization problem guarantees FDR control, although it is no longer the exact solution of the original penalized regression using individual level data.19 Also, it is substantially more powerful than the original version of GhostKnockoffs developed by He et al. (2022).18 In application to genetic variants where is not Gaussian, our empirical results show that the FDR control remains valid which is consistent with the results on robustness of knockoffs inference as discussed in Barber et al. (2020).41 To address the issue that can be nearly singular and the solution can be numerically unstable, in practice we work with where is an identity matrix. The shrinkage version is numerically equivalent to an elastic-net problem:
Once is estimated, we can then calculate the feature importance score as the magnitude of effect size: .
Solving the optimization problem defined by the aforementioned objective function is a non-trivial task when the number of genetic variants is very large. We propose a fast and scalable batch screening iterative lasso (BASIL) algorithm to solve , where . Compared to the previous BASIL algorithm based on individual level data, our BASIL algorithm requires only summary statistics and leverages the structure of the matrix for substantially improved computational efficiency.25 The algorithm is described as follows.
Initialization:
Let .
Compute and .
Create a sequence of equally spaced in log-scale. Let be the first values.
Find variants with the largest . Denote the index set as .
Iteration:
for the -th iteration, ,
Solve the optimization problem for to get .
Compute , for each .
Find smallest , denoted as , such that for all .
Let denote the index at which appears in the original sequence . Create a new sequence .
Find the variables in with largest . Add the variables to . Denote the updated index set as for the next iteration.
Repeat the iterations until reaches the smallest value .
To select the tuning parameter in absence of individual level data, we adopted the method proposed by Chen et al. (2024):
where .19 For data from genetic association studies, we observed that is often estimated to be very close to 1; the extreme value can be well approximated by due to the approximately block-diagonal LD structure, where is an identity matrix; when is sufficiently large. Therefore, we used the approximation below:
We calculated by Monte Carlo simulation with 10 replicates. Notably, this approximation can be calculated prior to data analysis, enabling efficient and robust parallel analysis of approximately independent LD blocks with the same tuning parameter.
Step 3: Calculate test statistics
After the feature importance scores are calculated, we compute the statistics
where indicates the -th knockoff copy and are the order statistics of ’s where as described in Step 2. For the -th variant, denote the index of the original (denoted as 0) or the knockoff feature that has the largest importance score; denotes the difference between the largest importance score and the median of the remaining importance scores. Here, knockoff scores and obey a property similar to the “flip-sign” property in the case of a single knockoff copy.14,42 In the multiple knockoff scenario, plays the role of the sign, and quantifies the magnitude of the contrast. Subsequently, we define a test statistic to quantify the magnitude of effect on the outcome as
Variants with are selected, where is the threshold calculated by the knockoff filter at target .
Step 4: Apply modified variant-level group knockoffs filter
To provide variant level interpretation given group knockoffs, we proposed a modified knockoff filter to test for
where represents the group where the -th variant resides; are the genotypes for all variants across the genome except variants in . Intuitively, it performs conditional independence tests between groups and marginal association tests within each group (so the variants within a group are not conditioned on each other). The modified knockoff filter at target is defined as
In addition, we define the -value for a variant with statistics and as
where is an estimate of the proportion of false discoveries if we are to select all variants with feature statistics . For variants with , we define and they will never be selected. Selecting variants with where is calculated at target , is equivalent to selecting variants with . The details are provided in Supplementary Materials. Compared to the group knockoff filter that selects the entire group that contains the causal variants as the catching set, this variant level filter highlights variants within each group that exhibit larger feature importance. In practice, we found that this significantly reduces the size and improves the purity of the catching set when it is paired with our proposed ultrahigh-dimensional sparse regression which additionally introduces sparsity to the feature importance scores within each group.
Meta-analysis of potentially overlapping studies
Our proposed method takes marginal Z-scores from a target study as input. Here we describe a simple and effective method to calculate meta-analysis Z-scores by aggregating Z-scores from possibly overlapping studies. The calculated meta-analysis Z-score directly serves as the input of the proposed knockoff inference, which allows the integration of multiple studies to the maximum extent feasible. We note that this significantly simplifies the meta-analysis pipeline proposed by He et al. (2022) because one no longer needs to perform knockoff inference for each study separately.18
Let be the Z-scores from the -th study and be the sample size of the -th study. We define meta-analysis Z-score as:
where the optimal weights are given by solving
is a diagonal matrix where if is observed and otherwise; where . We calculate the study correlation matrix
In practice, we use variants with |Z-score|≤1.96 to calculate to remove the correlation due to polygenic effects. Similar methods have been proposed by Lin and Sullivan (2009) and implemented in METAL (https://genome.sph.umich.edu/wiki/METAL_Documentation). 43,44
Connections with other causal inference methods
Our task differs from conventional applications of causal inference on treatment effects. Although many standardization and inverse probability weighting methods have been developed to adjust for potential bias in observational studies, most existing approaches focus on inference on the effects of binary treatments in a one-by-one (i.e., marginal) manner.24,45–53 Such approaches are not applicable to large-scale genetic studies where the goal is to simultaneously infer the causal effect of a large number of genetic variants on the outcome of interest – indeed, extensions of causal inference methodology to high-dimensional biology applications is a limited but growing area.54–57 This motivates our proposed method based on CCE and knockoffs to mimic a functional experiment that edits a particular sequence and looks for any change in the phenotype. In this section, we clarify the connections between our method and other existing causal inference methods.
Unconditional causal effect (UCE).
One popular measure of causal effect in epidemiologic studies is the UCE.24 The corresponding hypothesis is defined as
If a variant obeys the null hypothesis , it also obeys the null because
The implication is naturally one-sided: the null hypothesis does not imply the null . That is, the proposed method may successfully identify a genetic variant exhibiting a causal effect in one stratum, but the said variant may not have an unconditional causal effect (i.e., when marginalizing across strata).
Average causal effect (ACE).24
An analogous derivation can be applied to the average causal effect , defined as
If the underlying model is linear, , testing conditional independence is equivalent to testing the unconditional causal effect and testing the average causal effect under the same three identification conditions.
Double robustness.
When seeking to draw causal inferences, statistical bias can occur, even under the unconfoundedness assumption, if the working model is mis-specified. One appealing feature of applying model-X knockoffs for testing for the conditional causal effect is that the inference via model-X knockoffs is robust to the working model assumption. That is, although the test statistics employs lasso-type regression, the FDR remains under control even if the true disease model is nonlinear. This requires that the joint distribution of is estimated accurately in the construction of valid knockoffs. This is similar to doubly robust estimation in causal inference, which requires that be correctly specified if is mis-specified, although the guarantee in this case is a consequence of the knockoffs framework rather than of semi-parametric statistical theory based on efficient influence functions.49,58
Mendelian randomization.
Our task of finding causal variants substantially differs from that of Mendelian randomization, another popular causal inference method for genetic data.59–61 Unlike MR which uses genetic variation as an instrumental variable to investigate the causal effect of a modifiable exposure (e.g., demographic features, gene expressions, proteins) on the outcome of interest subject to unmeasured confounders, our target directly lies on the causal pathway between genetic variation and the outcome of interest under the unconfoundedness assumption.
Simulation studies
We performed simulations to empirically evaluate the performance of the proposed conditional independence tests in terms of precision, recall, size and purity of the catching/credible sets as defined earlier. We present the results in Figure 1. The primary comparison methods include: 1) the proposed conditional independence test paired with a Lasso type model (CIT-Lasso); 2) Marginal association test (MAT); 3) Marginal association test followed by SuSiE fine-mapping (MAT + SuSiE) based on reported credible sets. The SuSiE credible sets are based on the default setting in the susieR package, with maximum number of non-zero effects in the SuSiE regression model equal to 10, coverage equal to 95% and minimum absolute correlation allowed in a credible set equal to 0.5. We additionally compared CIT-Lasso with alternative definitions of SuSiE credible sets and present results in Supplementary Figure 1, including: 1) the proposed conditional independence test paired with a SuSiE model (CIT-Susie); 2) application of SuSiE fine-mapping to the same regions discovered by CIT-Lasso (Matched loci + SuSiE); 3) application of SuSiE fine-mapping to the same regions discovered by CIT-Lasso, with an additional 50% PIP threshold to filter out variants in the reported credible sets (Matched loci + SuSiE + 50%PIP).
We simulated genetic data directly using unimputed genotyped data from the UK biobank. The simulation study contains 500 replicates. For each replicate, we sampled 15,000 individuals with real genotype data on variants with minor allele frequency (MAF) ≥0.01. We sample 200kb regions from 500 nearly independent LD blocks (one region per block). We restricted the simulations to variants with MAF ≥0.01 to ensure stable calculation of summary statistics (e.g. p-values). We used 5,000 of the 15,000 individuals as the reference panel to estimate the LD structure and considered the remaining 10,000 individuals as part of the target study to compute the Z-scores. We considered a relatively small sample size for the reference panel to demonstrate the potential application of the proposed method to under representated populations, where the current sample sizes are substantially smaller (e.g. the Pan-UKBB panel has 420,531 samples for European ancestry but only 6,636 samples for African ancestry).
To simulate the trait, we randomly set 50 regions to be causal, each with one directly genotyped causal variant. The quantitative trait is then simulated as follows:
where and they are independent; is the observed covariate that is adjusted in the analysis; reflect variation due to unobserved covariates; are selected risk variants. We set the effect , where is the MAF for the -th variant. We define such that the variance due to the risk variants, , is 1.2. We performed marginal score test for each single variant to compute Z-scores, adjusting for the observed covariate .
Summary statistics from Alzheimer’s disease genetic studies
We consider summary statistics from the ten overlapping studies including: 1. The genome-wide survival association study by Huang et al. 2017 (14,406 cases, 25,849 controls); 2. The genome-wide meta-analysis by Jansen et al. 2019 (71,880 clinically diagnosed/proxy AD cases, 383,378 controls); 3. The genome-wide meta-analysis by Kunkle et al. 2019 (21,982 cases, 41,944 controls); 4. The genome-wide meta-analysis by Schwartzentruber et al. 2021, aggregating Kunkle et al. 2019 and UK Biobank based on a proxy AD phenotype; 5. In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels; 6-7. Two whole-exome sequencing analyses of data from The Alzheimer’s Disease Sequencing Project (ADSP) by Bis et al. 2019 (5740 cases, 5096 controls), and Le Guen et al. 2021 (6008 cases, 5119 controls); 8. In-house whole-exome sequencing analysis of ADSP (6155 cases, 5418 controls); 9. In-house whole-genome sequencing analysis of the 2021 ADSP release (3584 cases, 2949 controls); 10. The genome-wide meta-analysis by Bellenguez et al. 2022 (The genome-wide meta-analysis by Bellenguez et al. 2022 (discovery phase; 85,934 clinically diagnosed/proxy AD cases and 401,577 controls).27–29,62–67 All studies focused on individuals with European ancestry.
Supplementary Material
Acknowledgements
This research was additionally supported by NIH/NIA award AG066206 (ZH), AG066515 (ZH), AG075238 (MEB), EB001988-21 (TH, JY), and by the Simons Foundation under award 814641 (ZC). We gratefully acknowledge the studies which provided summary statistics.
Footnotes
Code Availability
Our overall method is implemented in the software GhostKnockoffGWAS, which is available at https://github.com/biona001/GhostKnockoffGWAS. The high-dimensional Lasso regression for summary statistics data is implemented in an independent software ghostbasil, which is itself a standalone R package, available at https://github.com/JamesYang007/ghostbasil. Pre-computed knockoff statistics from the European panel of Pan-UKB is freely available for download. Finally, scripts to reproduce the results in this paper can be accessed at https://github.com/biona001/ghostknockoff-gwas-reproducibility.
Competing interests
The Authors declare no competing interests.
Data Availability
Summary statistics for Alzheimer’s disease: 1. The genome-wide survival association study performed by Huang et al. 2017 (https://www.niagads.org/datasets; NIAGADS ID: NG00058); 2. The genome-wide meta-analysis by Jansen et al. 201944 (https://ctg.cncr.nl/software/summary_statistics); 3. The genome-wide meta-analysis by Kunkle et al. 2019 (NIAGADS ID: NG00075); 4. The genome-wide meta-analysis by Schwartzentruber et al. 2021 (https://www.ebi.ac.uk/gwas/; GWAS catalog ID: GCST90012877); 5. In-house genome-wide associations study imputed using the TOPMed reference panels (see Supplementary Table 3); 6-7. Two whole-exome sequencing analyses of data from ADSP by Bis et al. 2020 (NIAGADS ID: NG00065), and Le Guen et al. 2021 (NIAGADS ID: NG000112); 8. In-house whole-exome sequencing analysis of ADSP (NIAGADS ID: NG00067.v5); 9. In-house whole-genome sequencing analysis of ADSP (NIAGADS ID: NG00067.v5); 10. The genome-wide meta-analysis by Bellenguez et al. 2022 (GWAS Catalog ID: GCST90027158).
Summary statistics of the 400 GWAS: https://github.com/mikegloudemans/gwas-download.
Pan-UKBB reference panel: https://pan.ukbb.broadinstitute.org/.
The results of our analysis can be downloaded at https://github.com/biona001/ghostknockoff-gwas-reproducibility.
References
- 1.Nelson M. R. et al. The support of human genetic evidence for approved drug indications. Nat Genet 47, 856–860 (2015). [DOI] [PubMed] [Google Scholar]
- 2.King E. A., Davis J. W. & Degner J. F. Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS Genet 15, e1008489 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ochoa D. et al. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nature Reviews Drug Discovery vol. 21 Preprint at 10.1038/d41573-022-00120-3 (2022). [DOI] [PubMed] [Google Scholar]
- 4.Schaid D. J., Chen W. & Larson N. B. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat Rev Genet 19, 491–504 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Manolio T. A. et al. Finding the missing heritability of complex diseases. Nature vol. 461 Preprint at 10.1038/nature08494 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Boyle E. A., Li Y. I. & Pritchard J. K. An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169, 1177–1186 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42, 565–569 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang G., Sarkar A., Carbonetto P. & Stephens M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J R Stat Soc Series B Stat Methodol 82, 1273–1300 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Benner C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Abell N. S. et al. Multiple causal variants underlie genetic associations in humans. Science (1979) 375, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sesia M., Katsevich E., Bates S., Candès E. & Sabatti C. Multi-resolution localization of causal variants across the genome. Nat Commun 11, 1093 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Candès E., Fan Y., Janson L. & Lv J. Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J R Stat Soc Series B Stat Methodol 80, 551–577 (2018). [Google Scholar]
- 13.Katsevich E. & Sabatti C. Multilayer knockoff filter: Controlled variable selection at multiple resolutions. Ann Appl Stat 13, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.He Z. et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat Commun 12, 3152 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.He Z. et al. Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics. The American Journal of Human Genetics 108, 2336–2353 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sesia M., Sabatti C. & Candès E. J. Gene hunting with hidden Markov model knockoffs. Biometrika 106, 1–18 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Sesia M., Bates S., Candès E., Marchini J. & Sabatti C. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences 118, e2105841118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.He Z. et al. GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies. Nat Commun 13, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chen Z. et al. Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression. (2024). [Google Scholar]
- 20.Bates S., Sesia M., Sabatti C. & Candès E. Causal inference in genetic trio studies. Proc Natl Acad Sci U S A 117, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Li S., Sesia M., Romano Y., Candès E. & Sabatti C. Searching for consistent associations with a multi-environment knockoff filter. Biometrika (2021) doi: 10.1093/biomet/asab055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chu B. et al. Second-order group knockoffs with applications to GWAS. arXiv:2310.15069 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Rubin D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66, (1974). [Google Scholar]
- 24.James Robins M., H. M. A., Hernán M. A., Robins J. M. & Robins M. James, M. A. H. Causal Inference: What If. Foundations of Agnostic Statistics; (2020). [Google Scholar]
- 25.Qian J. et al. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK biobank. PLoS Genet 16, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pan-UKB team. https://pan.ukbb.broadinstitute.org. 2020.
- 27.Jansen I. E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat Genet 51, 404–413 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schwartzentruber J. et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nat Genet 53, 392–402 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bellenguez C. et al. New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nat Genet 54, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cooper Y. A. et al. Functional regulatory variants implicate distinct transcriptional networks in dementia. Science (1979) 377, (2022). [DOI] [PubMed] [Google Scholar]
- 31.Corces M. R. et al. Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases. Nat Genet 52, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Gazal S. et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet 54, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yengo L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700 000 individuals of European ancestry. Hum Mol Genet 27, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Xue A. et al. Genome-wide association analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes. Nat Commun 9, (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Willer C. J. et al. Discovery and refinement of loci associated with lipid levels. Nat Genet 45, (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kemp J. P. et al. Identification of 153 new loci associated with heel bone mineral density and functional involvement of GPC6 in osteoporosis. Nat Genet 49, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zhu Z. et al. Genetic overlap of chronic obstructive pulmonary disease and cardiovascular disease-related traits: A large-scale genome-wide cross-trait analysis. Respir Res 20, (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Lee J. D., Sun D. L., Sun Y. & Taylor J. E. Exact post-selection inference, with application to the lasso. Ann Stat 44, (2016). [Google Scholar]
- 39.Yang Z. et al. CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses. Nat Genet 55, (2023). [DOI] [PubMed] [Google Scholar]
- 40.Berisa T. & Pickrell J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Barber R. F., Candès E. J. & Samworth R. J. Robust inference with knockoffs. Ann Stat 48, (2020). [Google Scholar]
- 42.Gimenez J. R. & Zou J. Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization. (2018). [Google Scholar]
- 43.Lin D.-Y. & Sullivan P. F. Meta-Analysis of Genome-wide Association Studies with Overlapping Subjects. The American Journal of Human Genetics 85, 862–872 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Willer C. J., Li Y. & Abecasis G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Rubin D. B. & Thomas N. Combining propensity score matching with additional adjustments for prognostic covariates. J Am Stat Assoc 95, (2000). [Google Scholar]
- 46.Rubin D. B. Matching to Remove Bias in Observational Studies. Biometrics 29, (1973). [Google Scholar]
- 47.Ben-Michael E., Feller A. & Rothstein J. The Augmented Synthetic Control Method. J Am Stat Assoc 116, (2021). [Google Scholar]
- 48.Abadie A. & L’Hour J. A Penalized Synthetic Control Estimator for Disaggregated Data. J Am Stat Assoc 116, (2021). [Google Scholar]
- 49.Bang H. & Robins J. M. Doubly robust estimation in missing data and causal inference models. Biometrics 61, (2005). [DOI] [PubMed] [Google Scholar]
- 50.Robins J. M., Rotnitzky A. & Zhao L. P. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89, (1994). [Google Scholar]
- 51.Rosenbaum P. R. & Rubin D. B. The central role of the propensity score in observational studies for causal effects. Biometrika 70, (1983). [Google Scholar]
- 52.van der Laan M. J. & Rubin D. Targeted maximum likelihood learning. International Journal of Biostatistics 2, (2006). [Google Scholar]
- 53.Rose S. & van der Laan M. J. Targeted Learning: Causal Inference for Observational and Experimental Data. Targeted Learning: Causal Inference for Observational and Experimental Data (2011). [Google Scholar]
- 54.Cefalu M., Dominici F., Arvold N. & Parmigiani G. Model averaged double robust estimation. Biometrics 73, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Reifeis S. A., Hudgens M. G., Civelek M., Mohlke K. L. & Love M. I. Assessing exposure effects on gene expression. Genet Epidemiol 44, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Hejazi N. S., Boileau P., van der Laan M. J. & Hubbard A. E. A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology. Stat Methods Med Res 32, (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Boileau P., Qi N. T., Van Der Laan M. J., Dudoit S. & Leng N. A flexible approach for predictive biomarker discovery. Biostatistics 24, (2023). [DOI] [PubMed] [Google Scholar]
- 58.Funk M. J. et al. Doubly robust estimation of causal effects. Am J Epidemiol 173, (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Zhu H. & Zhou X. Transcriptome-wide association studies: a view from Mendelian randomization. Quantitative Biology 2020 1–15 (2020) doi: 10.1007/S40484-020-0207-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Porcu E. et al. Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nature Communications 2019 10:1 10, 1–12 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Yuan Z. et al. Testing and controlling for horizontal pleiotropy with probabilistic Mendelian randomization in transcriptome-wide association studies. Nature Communications 2020 11:1 11, 1–14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Huang K. et al. A common haplotype lowers PU. 1 expression in myeloid cells and delays onset of Alzheimer’s disease. Nat Neurosci 20, 1052–1061 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Kunkle B. W. et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat Genet 51, 414–430 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Bis J. C. et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Mol Psychiatry 25, 1859–1875 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Belloy M. E. et al. Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping. Alzheimers Res Ther 14, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Le Guen Y. et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimers Res Ther 13, 72 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Belloy M. E. et al. A fast and robust strategy to remove variant-level artifacts in Alzheimer disease sequencing project data. Neurol Genet 8, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Summary statistics for Alzheimer’s disease: 1. The genome-wide survival association study performed by Huang et al. 2017 (https://www.niagads.org/datasets; NIAGADS ID: NG00058); 2. The genome-wide meta-analysis by Jansen et al. 201944 (https://ctg.cncr.nl/software/summary_statistics); 3. The genome-wide meta-analysis by Kunkle et al. 2019 (NIAGADS ID: NG00075); 4. The genome-wide meta-analysis by Schwartzentruber et al. 2021 (https://www.ebi.ac.uk/gwas/; GWAS catalog ID: GCST90012877); 5. In-house genome-wide associations study imputed using the TOPMed reference panels (see Supplementary Table 3); 6-7. Two whole-exome sequencing analyses of data from ADSP by Bis et al. 2020 (NIAGADS ID: NG00065), and Le Guen et al. 2021 (NIAGADS ID: NG000112); 8. In-house whole-exome sequencing analysis of ADSP (NIAGADS ID: NG00067.v5); 9. In-house whole-genome sequencing analysis of ADSP (NIAGADS ID: NG00067.v5); 10. The genome-wide meta-analysis by Bellenguez et al. 2022 (GWAS Catalog ID: GCST90027158).
Summary statistics of the 400 GWAS: https://github.com/mikegloudemans/gwas-download.
Pan-UKBB reference panel: https://pan.ukbb.broadinstitute.org/.
The results of our analysis can be downloaded at https://github.com/biona001/ghostknockoff-gwas-reproducibility.






