Abstract
Multi-ancestry statistical fine-mapping of cis-molecular quantitative trait loci (cis-molQTL) aims to improve the precision of distinguishing causal cis-molQTLs from tagging variants. Here, we present the Sum of Shared Single Effects (SuShiE) model, which leverages linkage disequilibrium heterogeneity to improve fine-mapping precision, infer cross-ancestry effect size correlations, and estimate ancestry-specific expression prediction weights. Through extensive simulations, we find SuShiE consistently outperforms existing methods. We apply SuShiE to 36,907 molecular phenotypes including mRNA expression and protein levels from individuals of diverse ancestries in the TOPMed MESA and GENOA studies. SuShiE fine-maps cis-molQTLs for 18.2% more genes compared with existing methods while prioritizing fewer variants and exhibiting greater functional enrichment. While SuShiE infers highly consistent cis-molQTL architectures across ancestries, it finds evidence of heterogeneity at genes with predicted loss-of-function intolerance. Lastly, using SuShiE-derived cis-molQTL effect-sizes, we perform TWAS and PWAS on six white blood cell-related traits in the AOU Biobank, and identify 25.4% more genes compared with existing methods. Overall, SuShiE provides new insights into the cis-genetic architecture of molecular traits.
Introduction
Characterizing the functional consequences of genetic variation remains a crucial task in deciphering the mechanisms underlying complex disease risk1. To this end, cis-molecular quantitative trait loci (cis-molQTL) mapping seeks to identify genetic variants associated with genomically proximal molecular features measured across diverse cellular, tissue, and environmental contexts2. However, due to linkage disequilibrium (LD), it is challenging to distinguish causal cis-molQTLs from tagging variants within a genomic region. Statistical fine-mapping aims to resolve precisely this issue3, yet pervasive LD signals limit the resolution of these approaches. Previous efforts have demonstrated that leveraging the heterogeneity of LD and minor allele frequency (MAF) across diverse ancestries improves the precision of statistical fine-mapping and therefore enhances our biological understanding of molecular traits4.
While existing multi-ancestry fine-mapping frameworks have been proposed for the analysis of complex traits5–9, they have several limitations in the context of large-scale cis-molQTL data. First, many approaches do not model the correlation of causal variant effect sizes across ancestries or assume that they are a-priori independent across ancestries, which fails to reflect shared or similar genetic architectures5–7. Second, some approaches scale poorly, which precludes their application to thousands of molecular traits commonly measured in cis-molQTL studies5,6,9. Third, several approaches lack ancestry-specific effect size estimates5–7, which neglects their potential use in post-Genome-wide Association Studies (GWASs) frameworks (e.g., Transcriptome- and Proteome-wide Association Studies; TWASs and PWASs)10,11. Lastly, while recent approaches7–9 address some of these limitations, there is a need for a fine-mapping method that comprehensively addresses each of these limitations.
Here, we describe the Sum Of Shared Single Effects (SuShiE) approach to fine-map genetic variants shared across diverse ancestries for thousands of molecular traits. SuShiE integrates genotypic and molecular data or summary statistics from multiple ancestries to identify cis-molQTLs while modeling and learning the covariance structures of shared/non-shared signals. SuShiE leverages four key insights. First, SuShiE improves fine-mapping precision of the shared cis-molQTLs by leveraging LD across different ancestries. Second, it estimates ancestry-specific effect sizes at shared cis-molQTLs. Third, it infers the prior effect size correlation across ancestries to shed light on genetic similarities and differences. Lastly, SuShiE can analyze individual-level or summary statistics directly and is implemented using a scalable variational inference algorithm that runs seamlessly on CPUs, GPUs, or tensor-processing units (TPUs).
Through extensive simulations, we show that SuShiE outputs higher posterior inclusion probabilities (PIPs) at causal cis-molQTLs, outputs smaller credible set sizes, and exhibits better calibration compared with current existing methods7–9. We apply SuShiE to 36,907 molecular phenotypes including mRNA expression and protein levels measured in diverse ancestries from TOPMed-MESA12,13 (nmRNA=956 and nprotein=854) and GENOA14 (nmRNA=814) studies. SuShiE fine-maps cis-molQTLs for 3,068 (18.2%) more genes with smaller credible sets and greater enrichment in relevant functional annotations compared with existing methods. In addition, SuShiE infers shared genetic architecture of cis-molQTL in significantly heritable genes and shows the heterogeneity across ancestries of signals associated with multiple measures of loss-of-function (LOF) intolerance. Last, we integrate ancestry-specific cis-molQTL effects inferred by SuShiE with six white blood cell-related traits to perform individual-level TWAS and PWAS in the All of Us Biobank (average n=86,345)15 and observe that SuShiE-based prediction models identified 39.5 (25.4%) on average additional associated genes compared with the existing approaches. Overall, our approach sheds light on understanding the genetic cis-architecture of molecular data across multiple ancestries.
Results
SuShiE overview
Here, we briefly introduce the SuShiE model (for a detailed description, see Methods and Supplementary Note 1). SuShiE assumes cis-molQTLs are present in all ancestries (defined as shared cis-molQTLs) while allowing for effect sizes at causal cis-molQTLs to covary across ancestries a-priori, in contrast to previous multi-ancestry approaches5–7. These assumptions provide enough flexibility to model a variety of cis-genetic architectures across ancestries, including cases when effects are present only in a subset of ancestries. For instance, when effects are observed only in a subset of ancestries, prior variances can be shrunk towards zero to effectively allow for ancestry-specific causal cis-molQTLs.
Focusing on the out of ancestries, SuShiE models the normalized levels of a molecular trait measured in individuals as a linear combination of genotyped variants as
where is the number of shared effects, is a binary vector selecting the causal cis-molQTL shared across ancestries, is the effect size in the ancestry, and environmental noise distributed as (Figure 1a). Following previous work3,16, we place a prior over where is a vector representing prior probability for each SNP to be shared cis-molQTLs; however, unlike existing approaches5–7, we organize ancestry-specific effect sizes under a multivariate normal prior where is the prior effect size covariance matrix. To perform scalable inference, we use a variational Bayesian approach and compute, for each of the shared effects, the posterior probability of a shared causal cis-molQTL (), the ancestry-specific posterior effect sizes, and covariances, in addition to prior effect-size correlations (Figure 1b) inferred through a procedure analogous to Empirical Bayes. Through learning prior effect-size correlations, SuShiE can quantify genes’ heterogeneity in cis-molQTL effects across ancestries.
Figure 1: SuShiE infers PIPs, credible sets, and ancestry-specific effect sizes by leveraging shared genetic architectures and LD heterogeneity.

a) SuShiE takes individual-level phenotypic and genotypic data as input and assumes the shared cis-molQTL effects as a linear combination of single effects.
b) For each single shared effect, SuShiE models the cis-molQTL effect size follows a multivariate normal prior distribution with a covariance matrix, and the probability for each SNP to be moQTL follows a uniform prior distribution; through the inference, SuShiE outputs a credible set that includes putative causal cis-molQTLs, learns the effect-size covariance prior, and estimates the ancestry-specific effect sizes.
SuShiE constructs a 95% credible set for each of the effects along with a posterior inclusion probability (PIP) for each SNP to be putative causal cis-molQTL (Methods). SuShiE is implemented in an open-source command-line Python software with JAX using Just-In-Time compilation to achieve inference on CPUs, GPUs, or TPUs at https://github.com/mancusolab/sushie.
In addition, we note that several multi-ancestry methods leveraging the sum of single effect framework3,16 have been proposed, including SuSiEx7, MESuSiE8, and XMAP9. We list the similarities and differences in terms of study design, model assumptions, and software functionality in Supplementary Table 1, and showcase performance comparisons using both simulation and real data in later sections.
SuShiE outperforms other methods in simulations
First, to recapitulate the benefits of multi-ancestry study design, we performed simulations varying the number of contributing ancestries under a fixed total sample size (Methods). As the number of ancestries increased, SuShiE produced higher PIPs at causal cis-molQTLs, smaller credible set sizes, and better calibration (P < 1.94e-11 for all tests; Supplementary Figure 1), reaffirming that increasing genetic diversity refines fine-mapping results compared with expanding the sample size of a single ancestry. Next, we evaluated the performance of SuShiE in simulations by varying different parameters and compared against seven methods: SuShiE-Indep (i.e., SuShiE assuming no a-priori correlation of effect sizes across ancestries), meta-SuSiE (i.e., a meta-analysis on single-ancestry SuSiE), SuSiE (i.e., SuSiE performed over data aggregated across ancestries), SuSiEx7, MESuSiE8, XMAP9, and XMAP-IND (individual-level data as input; Methods). First, we observed that inferences for SuShiE, SuShiE-Indep, meta-SuSiE, SuSiE, SuSiEx, and MESuSiE successfully converged and output fine-mapping results for nearly all simulations whereas XMAP-methods completed in only 25.8% of simulations likely due to strong model misspecification in the cis-molQTL regime. Indeed, XMAP-methods are designed for GWAS fine-mapping and require pre-computed genome-wide polygenic effect covariance structure and stratification bias estimates (Supplementary Table 2).
For all simulations, SuShiE output higher PIPs at causal cis-molQTLs (0.10 higher on average; P<1.00e-300; Figure 2a, Supplementary Figure 2), smaller credible set sizes (0.14 SNPs smaller on average; P=2.96e-22; Figure 2b, Supplementary Figure 3), and better calibration (8.15% higher on average; P<1.00e-300; Figure 2c, Supplementary Figure 4) compared to all other methods. We noted that SuShiE produced larger credible set size on average compared to XMAP and XMAP-IND (P=2.19e-12; Figure 2b, Supplementary Figure 3); however, we found similar sizes when restricting to the ~18.5% of cases where both SuShiE and XMAP-methods produced credible sets (P=0.60). In addition, SuShiE similarly outperformed all other methods under simulations with differential sample sizes (Supplementary Figure 5) and cis-SNP heritabilities across ancestries (Supplementary Figure 6). Next, we evaluated SuShiE’s performance on genes with different degrees of LD diversity across ancestries, where LD diversity was defined by comparing variance and mean of LD scores across ancestries, respectively (Methods). We found that as the LD diversity increased, SuShiE output smaller credible set size (P=0.02 and 0.09 for Levene’s test and ANOVA, respectively), similar PIPs of causal cis-molQTLs (P=0.79 and 0.35), and similar calibration (P=0.37 and 0.12), suggesting better precision on fine-mapping high LD diversity genes. Next, we evaluated the ability of SuShiE to infer prior effect size correlations from data (Methods). On average, SuShiE accurately estimated primary effect size correlations (Figure 2d) with non-primary () effects having diminishing accuracies. This result was likely due to decreasing statistical power, as evidenced by simulations under increased sample sizes (Supplementary Figure 7). Furthermore, among methods capable of inferring the cis-molQTL effect size correlation prior, XMAP frequently produced biased estimates in the cis-molQTL regime, and MESuSiE performed comparably to SuShiE (Figure 2d), highlighting SuShiE’s ability to investigate molecular heterogeneity across ancestries. We also conducted additional comparisons using alternative metrics (Supplementary Note 2; Supplementary Figures 8–16) and found that SuShiE either outperformed other methods or performed comparably, consistent with our primary metrics. In addition, we assessed SuShiE’s performance under various model misspecifications and found it to be robust and consistently outperform the existing approaches (Supplementary Note 2; Supplementary Figures 17–19).
Figure 2: SuShiE outperforms other methods, estimates accurate effect-size correlation, and boosts higher power of TWAS in realistic simulations.

a-c) SuShiE outputs higher posterior inclusion probabilities (PIPs; a; 0.09 on average; P=3.65e-233), smaller credible set sizes (b; 0.15 on average; P=2.02e-12), and higher frequency of cis-molQTLs in the credible sets (calibration; c; 7.6% on average; P=4.59e-126) compared to other methods.
d) SuShiE accurately estimates the true effect-size correlation across ancestries while XMAP and XMAP-IND frequently produce biased estimates.
e-f) SuShiE outputs higher ancestry-specific prediction accuracy (P=1.39e-144) and induces higher TWAS power (P=1.98e-285) compared against other methods with the fixed sample size. The plots are aggregation across two ancestries.
By default for (a)-(f), the simulation assumes that there are 2 causal cis-molQTLs, the per-ancestry training sample size is 400, and the testing sample size is 200, cis-SNP heritability is 0.05, the effect size correlation is 0.8 across ancestries, and the proportion of cis-SNP heritability of complex trait explained by gene expression is 1.5e-4. P value is two-sided, not adjusted for multiple testing, and calculated using meta-analysis across all comparisons (Methods). The points are the mean across simulations, and the error bars are their corresponding 95% confidence intervals.
Next, we evaluated the use of SuShiE-derived ancestry-specific effect sizes in cis-molQTL data as a means to predict the genetic component of gene expression for downstream TWAS10,11. Briefly, we performed simulations under a model in which gene expression mediates disease risk and compared SuShiE predictions with commonly used approaches for prediction-based TWAS (e.g., LASSO17, Elastic Net18, and gBLUP19) as well as other available multi-ancestry fine-mapping methods (e.g., MESuSiE, XMAP, and XMAP-IND) to identify susceptibility genes (Methods). Overall, SuShiE-derived prediction models more accurately recapitulated gene expression levels compared with existing approaches and exhibited higher statistical power for TWAS with various study sample sizes and proportion of trait heritability mediated by gene expression (Figures 2e–f, Supplementary Figure 20).
Lastly, we extended SuShiE to analyze cis-molQTL summary statistics (e.g., marginal z-scores) directly (Supplementary Note 1). We compared the results of SuShiE between individual-level and summary-level versions and found highly consistent outcomes (i.e., PIPs of causal cis-molQTLs; ), thus validating our algorithm and implementation.
Overall, SuShiE outperforms existing approaches in realistic cis-molQTL settings, remains robust under model misspecifications, and improves statistical power in post-GWAS integrative analyses.
SuShiE identifies functionally relevant cis-molQTL signals
Having demonstrated that SuShiE outperforms other methods under realistic simulations, we next sought to perform fine-mapping on 36,907 molecular phenotypes from diverse ancestries. Specifically, from the Trans-Omics for Precision Medicine program Multi-Ethnic Study of Atherosclerosis12,13 (TOPMed-MESA), we analyzed mRNA expression data of 21,743 genes measured in peripheral blood mononuclear cells (PBMCs; visit-1; n=956) and protein abundance data of 1,274 genes measured in plasma (visit-1; n=854) for American European, African, and Hispanic ancestries (EUR, AFR, and HIS), together with mRNA expression data of 13,890 genes measured in lymphoblastoid cell lines (LCLs; n=814) for EUR and AFR from the Genetic Epidemiology Network of Arteriopathy study14 (GENOA; Methods; Supplementary Table 3). We also applied SuShiE-Indep, Meta-SuSiE, SuSiE, SuSiEx, and MESuSiE to these datasets for comparison with SuShiE. We excluded XMAP and XMAP-IND from comparisons, due to previously noted limitations for cis-molQTL data (e.g., XMAP completing only ~4% of genes, and XMAP-IND requiring < 3 ancestries).
Focusing on 1Mb windows for each gene (i.e., cis-region), SuShiE fine-mapped cis-molQTLs for 19,881 phenotypes (e/pGenes), representing an average increase of 3,068 (18.2%) compared with existing methods (P<2.33e-67 for all tests; Methods). For example, SuShiE fine-mapped 26.5% more e/pGenes compared to single-ancestry SuSiE followed by meta-analysis (i.e., Meta-SuSiE; P=1.66e-208), again highlighting the benefit of multi-ancestry study design. SuShiE-based credible sets maintained higher average PIPs (~0.04 higher on average) and higher frequency of cis-molQTLs with PIPs > 0.95 (~1.9% higher on average), as well as smaller credible sets in most cases (~5.84 SNPs smaller on average; Supplementary Table 4). We found the performance advantage slightly diminished in TOPMed-MESA protein and GENOA mRNA datasets, likely due to lower statistical power. Using the number of credible sets identified after purity pruning (Methods), SuShiE estimated most (90.9%) molecular phenotypes to exhibit 1–3 cis-molQTL signals (Figure 3a) with PIPs localizing near the transcription start site (TSS; Figure 3b), consistent with previous studies20–22.
Figure 3: SuShiE reveals cis-regulatory mechanisms for mRNA and protein expression.

a) SuShiE identifies cis-molQTLs for 13,818, 515, and 5,548 genes whose 89.1%, 86.8%, and 96.0% contain 1–3 cis-molQTLs for the TOPMed-MESA mRNA, TOPMed-MESA protein, and GENOA mRNA dataset, respectively.
b) Posterior inclusion probabilities (PIPs) of cis-molQTLs inferred by SuShiE are mainly enriched around the TSS region of genes. We group SNPs into 500-bp-long bins and compute their PIP average. There are 2,000 bins to cover a one-million-bp-long genomic window around the genes’ TSS.
c) Across all three studies, cis-molQTLs identified by SuShiE are enriched in four out of five candidate cis-regulatory elements (cCREs) from ENCODE23, with the promoter (PLS) as the most enriched category. Specifically, the mRNA expression from TOPMed-MESA and GENOA shows enrichment in the promoter, proximal enhancer (pELS), CTCF, and distal enhancer (dELS) but depletion in DNase-H3K4me3. Protein abundance from TOPMed-MESA shows enrichment in PLS and pELS but non-significant enrichment in CTCF and dELS because of the low number of genes identified with pQTLs (n=515). The points are meta-analyzed log-enrichment across genes and the error bars are their corresponding 95% confidence intervals.
To characterize the regulatory function of identified cis-molQTL signals, we performed enrichment analysis using PIPs with 89 genomic functional annotations (Methods). We observed that PIPs inferred by SuShiE were enriched in 83/89 annotations across all three datasets, with the highest enrichment occurring in promoter regions (Supplementary Table 5). For example, PIPs were enriched in 4/5 candidate cis-regulatory elements (cCREs) from ENCODE Registry v323 (Figure 3c) and in all 10 cell-type/tissue-specific cCREs using single-nucleus(sn) or single-cell(sc) ATAC-Seq24,25 (Supplementary Figure 21). Importantly, PIPs inferred by SuShiE were more enriched across functional annotations compared with those computed from existing fine-mapping methods (all P<8.13e-3; Supplementary Table 6), highlighting SuShiE’s ability to better prioritize functionally relevant cis-molQTLs. In addition, we investigated how potential regulatory function may differ among cis-molQTLs contributing to the same gene using per-effect posterior probabilities (), rather than PIPs, and found consistent results (Supplementary Note 2; Supplementary Figures 22–24; Supplementary Table 7). Last, we validated our fine-mapping results by applying SuShiE on molecular phenotypes from three independent datasets from TOPMed-MESA, GEUVADIS26, and INTERVAL27 studies and found SuShiE exhibited better validation performance compared to the existing methods (Supplementary Note 2; Supplementary Tables 8–9).
Overall, by jointly modeling multi-ancestry data, SuShiE identifies additional cis-regulatory mechanisms for molecular traits.
SuShiE identifies a putative cis-eQTL for URGCP
Here, we showcase a putative cis-eQTL for URGCP, a gene on chromosome 7 that has been implicated in tumor growth and progression28. SuShiE fine-mapped a single SNP in TOPMed-MESA mRNA (rs2528382; GRCh38: 7:43926148; PIP=0.96; Figure 4a; Supplementary Table 10), while alternative multi-ancestry fine-mapping methods (i.e., SuSiEx, MESuSiE, and XMAP) failed to identify any cis-eQTLs for this gene. Importantly, SuShiE validated rs2528382 in TOPMed-MESA visit-5 mRNA data (PIP=0.96). We found rs2528382 was reported as significant in whole blood eQTL data from the eQTLGen Consortium21 (P=2.46e-70; EUR-based), the Study of African Americans, Asthma, Genes, and Environments29 (SAGE; P=3.50e-6; AFR-based), and the Genes-Environments and Admixture in Latino Asthmatics (GALA II; P=1.42E-09; HIS-based) study29, further supporting its role in regulating URGCP expression levels. Investigating the functional consequences using genomic annotations, we found rs2528382 represents a non-coding exon variant within the 5’ UTR and localizes within a proximal enhancer (pELS) region, as evidenced by strong signals of H3K27ac in PBMCs23 within 2kb of the TSS (262bp; Figure 4b). Lastly, through snATAC-seq24 and scATAC-seq25, we found rs2528382 localizes within an open chromatin accessibility region measured in different cell types, such as PBMCs, naïve T cells, naïve B cells, cytotoxic NK cells, and monocytes. Altogether, these results suggest that rs2528382 regulates URGCP expression levels in PBMCs through disruption of regulatory activity. We present a second example where SuShiE identified a putative cis-eQTL (rs1059307) for SNHG5 gene (Supplementary Figure 25; Supplementary Table 11) in the Supplementary Note 2, whereas other multi-ancestry methods similarly failed to identify any cis-eQTLs.
Figure 4: SuShiE identifies eQTL rs2528382 for URGCP with functional support.

a) LD patterns of the region across EUR, AFR, and HIS. The blue color indicates LD scales () the red color indicates prioritized SNP location, and the green color indicates SNPs’ LD with the prioritized SNP. Manhattan plot of cis-eQTL scans of URGCP (denoted in orange) for each ancestry with SuShiE fine-mapping results. SuShiE is the only method to output credible sets for URGCP and prioritize a single SNP (rs2528382; denoted in red).
b) Functional annotations at URGCP locus show colocalization of active enhancer activity and chromatin accessibility with rs2528382. H3K27ac CHIP-seq peaks are measured in PBMCs (intensity denoted in blue), 0/1 accessibility annotations determined from scATAC-seq are measured in PBMCs, and 0/1 accessibility annotations determined from snATAC-seq are measured in naive T cells, naive B cells, cytotoxic NK (cNK) cells, and monocytes. Blue rectangles denote putative cCREs called from sc/snATAC-seq data that colocalize with rs2528382 (gray no colocalization).
SuShiE reveals heterogeneity at LOF intolerant genes
After validating cis-molQTLs identified by SuShiE, we next sought to characterize genetic architectures of molecular traits across ancestries. First, we computed cis-SNP heritability for all e/pGenes of each ancestry and observed 89.4% significant heritable genes (in at least one ancestry) across studies (Supplementary Figure 26), which resulted in highly correlated estimates across ancestries (Supplementary Figure 27). Next, using SuShiE-derived estimates of cis-molQTL correlation across ancestries (Methods), we found highly consistent effect-size correlations on average (0.83, 0.88, and 0.90 for EUR-AFR, EUR-HIS, and AFR-HIS, respectively), which further increased when focusing on genes whose heritabilities are significant in all ancestries (0.94, 0.98 and 0.99, respectively; 9,822 genes; 49.4%; Supplementary Figures 28–29). Altogether, these results further affirm previous results30–34 demonstrating primarily shared genetic architectures for molecular traits across ancestries.
Despite this evidence, we observed a long tail of heterogeneous effect sizes (i.e., SuShiE-estimated effect size correlation <1), suggesting the presence of ancestry-specific cis-molQTL effects (Supplementary Figure 30), which is consistent with previous multi-ancestry cis-molQTL studies29,35. To characterize this apparent heterogeneity across ancestries, we correlated the estimated correlation signals with multiple measures of constraint (pLI36, LOEUF37, EDS38, RVIS39, and shet40) and found highly significant associations (Table 1; Methods). Overall, genes with lower effect-size correlations across ancestries exhibited higher intolerance to loss-of-function mutations on average. For example, using TOPMed-MESA mRNA dataset, we observed an average cis-molQTL effect size correlation of 0.89 (when ; SE=0.02) between EUR and AFR individuals at genes that exhibited pLI <0.1, which decreased to 0.84 (when ; SE=0.01) when focusing on genes with pLI >0.9. Genes with high constraint exhibited lower estimates of cis-SNP heritability on average (Supplementary Table 12), which may result in apparent heterogeneity arising from low statistical power. Given this, we re-analyzed putative relationships using estimated covariances, only primary signals (), and bootstrapped standard errors and found broadly consistent results (Table 1). In addition, we observed our results were robust to adjusting for Wright’s fixation index (Fst; Table 1; Methods), suggesting heterogeneity/constraint associations are not driven solely by allele frequency differences across ancestries.
Table 1: Across-ancestry cis-molQTL effect size correlations are negatively associated with gene constraint scores.
Estimates and corresponding P values from linear regression framework testing associations between inferred effect size correlations across ancestries and constraint scores (Methods). “Bootstrap SE” refers to re-estimating standard error using bootstrapping. “Primary Effect” considers only estimates from L=1. “Effect Covariance” replaces the estimated correlation with the estimated effect size covariance across ancestries. “Adjusted Fst” refers to additional adjustment for Fst in the base model. Higher values of pLI, shet, and EDS indicates stronger constraint, while lower values of LOEUF and RVIS suggests greater constraint. The reported P values are not adjusted for multiple testing and one-sided.
| pLI | LOEUF | Shet | RVIS | EDS | |
|---|---|---|---|---|---|
| Base Model | −0.018 (4.49e-19) | 0.018 (4.02e–12) | −0.004 (1.96e–22) | 0.046 (1.86e–13) | −0.002 (2.75e–02) |
| Bootstrap SE | −0.018 (2.79e–23) | 0.018 (4.01e–12) | −0.004 (2.12e–21) | 0.046 (2.49e–16) | −0.002 (1.23e–02) |
| Primary Effect | −0.032 (2.25e–18) | 0.027 (9.62e–10) | −0.008 (6.63e–22) | 0.044 (1.04e–05) | −0.003 (3.93e–02) |
| Effect Covariance | −0.320 (2.82e–158) | 0.322 (2.73e–101) | −0.064 (2.94e–125) | 0.516 (3.62e–44) | −0.049 (4.98e–22) |
| Adjusted F st | −0.018 (1.90e–18) | 0.017 (7.17e–12) | −0.004 (1.01e–21) | 0.045 (5.87e–13) | −0.002 (2.33e–02) |
To investigate the relationship between cis-molQTLs identified by SuShiE and gene constraint, we first observed inverse associations between the number of fine-mapped cis-molQTLs per gene and constraint (Supplementary Figure 31), consistent with several previous studies showing the depletion of cis-molQTLs for high constraint genes22,34,38. We also observed positive associations between expected cis-molQTLs’ distance to TSS and constraint, affirming previous results that high constraint genes tended to have more complex regulatory regions22,38 (Supplementary Figure 32; Methods). In addition, we correlated gene enrichment scores from ENCODE23 cCREs with constraint scores. We found that putative causal cis-molQTLs for high constraint genes tended to be enriched for distal enhancers (dELS) and depleted for promoter (PLS) and proximal enhancers (pELS) compared with weakly constrained genes, consistent with several previous studies22,38 (Supplementary Figure 33). We found these associations remained significant after accounting for Fst, suggesting average allele frequency differences across ancestries cannot solely explain the observed heterogeneity (Supplementary Figures 31–33).
Overall, SuShiE recapitulates the findings of primarily shared genetic architectures of molecular traits and show that effect size heterogeneity is consistent with gene LOF intolerance.
SuShiE improves T/PWAS power in white blood cell traits
Lastly, to showcase the downstream benefits of SuShiE, we performed TWAS and PWAS10,11 on six white blood-cell-related traits in All Of Us biobank15 (AOU; average n=86,345; Supplementary Table 13; Methods). First, we assessed the predictive performance of SuShiE-based weights compared to alternative expression-prediction methods and found that SuShiE obtained higher prediction accuracy compared to other methods (Supplementary Note 2; Supplementary Figures 34–35). Given this, we predicted the expression levels of 19,366 genes (mRNA) and 515 proteins using ancestry-matched SuShiE cis-molQTL prediction weights from the above analyses and AOU genotypes. Overall, we identified 195 T/PWAS significant associations in white blood count (WBC), eosinophil count (EOS), and monocyte count (MON; Supplementary Table 14; Supplementary Figure 36). Of these associations, ~92% were identified in WBC due to substantially increased statistical power (i.e., 21,476 more participants on average). We found no significant associations in lymphocyte count (LYM), neutrophil count (NEU), and basophil count (BAS), likely due to lower statistical power, similar to previous studies41,42 that identified fewer associations compared to models based on WBC.
Consistent with our simulation results (Figure 2f), SuShiE demonstrated higher T/PWAS chi-square statistics (bootstrapped P=1.43e-39 and P=3.21e-2) and identified 27 and 52 (16% and 36%) more T/PWAS associations compared to results driven by SuSiE- and MESuSiE-based prediction weights (Figure 5a). To demonstrate the increase of T/PWAS power is not driven by false discoveries, we applied the same T/PWAS procedure on two traits in AOU Biobank, a-priori unlikely to be driven by white blood cells: biological sex (n=105,521) and smoking (n=50,144; defined as having smoked at least 100 cigarettes in lifetime; Supplementary Table 13). We observed no significant T/PWAS genes (with Bonferroni correction) and uniformly distributed unadjusted T/PWAS P values for each method (Supplementary Figure 37). In addition, we observed that the SuShiE T/PWAS signals associated with multiple measures of LOF intolerance (Supplementary Table 15), partially mitigating a limitation in previous works demonstrating that high LOF intolerance genes are typically depleted in TWAS models due to weak eQTL signals22,38 (Figure 5b; Methods). We found less support for a relationship between SuSiE-/MESuSiE-based TWAS signals and LOF intolerance (one-sided P=3.18e-4 and 3.42e-5; Supplementary Table 15), further demonstrating SuShiE’s advantage. To validate our results, we compared our TWAS statistics with multiple independent white blood cell-related TWASs29,41–44 and found that SuShiE-based TWAS replicated at rates similar to SuSiE and MESuSiE (two-sample proportion test: P=0.74 and 0.59), suggesting that its improved power is unlikely due to false positives and further highlighting its benefit in identifying disease-related genes. Lastly, using independent white blood cell-related GWASs45, we found that SuShiE-derived cis-molQTLs are enriched for heritability (P=1.70e-11; Supplementary Note 2; Supplementary Figure 38).
Figure 5: SuShiE identifies more T/PWAS genes and higher chi-square statistics compared with SuSiE and MESuSiE.

a) Scatter plot of T/PWAS t-statistics comparing SuShiE (y-axis) with SuSiE and MESuSiE across all phenotypes and contributing cis-molQTL studies. SuShiE identifies 27 and 52 more T/PWAS-significant genes than SuSiE and MESuSiE, respectively. Overall, SuShiE displays higher T/PWAS chi-square statistics compared to SuSiE by 0.08 and MESuSiE by 0.01, with bootstrapped p-values of 1.43e-39 and 3.21e-2, respectively (one-sided and not adjusted for multiple testing; Methods). The black dashed line represents the identity line (y = x). Genes identified as significant by both Methods are shown in purple (Both), while those not identified by either method are shown in grey (Neither).
b) Average T/PWAS chi-square statistics across all phenotypes and contributing cis-molQTL studies within low, middle, and high constraint scores for SuShiE, SuSiE, and MESuSiE (Methods). Error bars represent 95% confidence intervals.
Overall, our work has shown that by jointly modeling the molecular data across different ancestries while allowing effect sizes to covary, SuShiE outputs more accurate cis-molQTL prediction weights, thus boosting downstream statistical power for integrative analyses with GWASs.
Discussion
In this paper, we present the Sum of Shared Single Effect approach (SuShiE), a novel approach for multi-ancestry SNP fine-mapping of molecular traits using a scalable variational approach with the functionality of inferring cross-ancestry effect sizes and estimating effect prediction weights. Through extensive simulations, we demonstrate that SuShiE outperforms existing approaches. We apply SuShiE to 36,907 molecular phenotypes of diverse ancestries from TOPMed-MESA and GENOA studies. SuShiE fine-maps 18.2% more genes on average compared to the existing methods, exhibiting smaller credible set sizes and higher enrichment in relevant functional annotations. SuShiE infers highly correlated cis-molQTL effect sizes across ancestries on average in significantly heritable genes, reflecting primarily shared cis-molQTL architectures. In addition, we observe cis-molQTL effect size heterogeneity across ancestries associated with multiple constraint measurements. Last, we perform TWAS and PWAS on six white blood cell-related traits from AOU biobank using SuShiE-derived ancestry-specific cis-molQTL prediction weights and identify 25.4% more significant genes compared to the existing method.
We describe several caveats in our real data analysis. First, SuShiE approximates ancestry as a discrete category, allowing us to model cis-molQTL effect sizes using a multivariate normal distribution (Methods). While this simplifies modeling and inference tasks, we emphasize that this is a heuristic approach that neglects the complex and shared demographic histories underlying all humans. Indeed, recent work has demonstrated the importance of viewing genetic ancestries as a continuous spectrum rather than discrete categories46.
Second, we note that our data consist of African- (AFR) and Hispanic-American (HIS) individuals, which reflect recent admixture events. To account for complex diversity within ancestries, we include genotyping PCs as a covariate in our models. Several works have suggested that admixture can be sufficiently corrected for using global ancestry information (i.e., genotyping PCs) in association testing, especially when causal effect sizes are largely consistent across ancestries47 (Supplementary Figures 27–29). On the other hand, accounting for local ancestry may increase the associating testing power when causal effects are highly different across ancestries47 or aid fine-mapping in post-GWAS analysis48, which can be one of the future directions for SuShiE.
Third, we observe significant associations between gene LOF intolerance and several SuShiE-estimated metrics, including effect size heterogeneity across ancestries, the number of cis-molQTLs, cis-molQTL distance to TSS, and functional enrichments. The relationship remains significant after adjusting for Fst, suggesting allele frequency differences across ancestries are not sufficient to fully explain estimated heterogeneity. As a result, we hypothesize that cis-molQTL effect size heterogeneity can be in part due to gene-by-environment (GxE) interactions31,38. Highly constrained genes exhibit more complex regulatory landscapes with fewer cis-molQTLs (or apparent cis-molQTLs due to smaller effect sizes)22,38. As a result, these genes may have higher variability in their expression levels across individuals and be less resistant to environmental perturbations38, which may induce effect-size heterogeneity across different ancestries. On the other hand, it is possible that our Fst estimates are underpowered to detect subtle allele frequency differences across ancestries. Therefore, these associations may provide indirect evidence for natural selection partially driving cis-molQTL effect size heterogeneity across ancestries. To explicitly investigate the role of selection in molecular differences across ancestries, we likely require a more principled modeling procedure based in population genetics together with higher-resolution molecular data measured in diverse ancestries22. We note that the genetic architectures of cis-molQTLs and complex traits are different in many ways22, and our constraint analysis should be interpreted within the context of cis-molQTL settings. As a result, investigating the relationship between inferred posterior covariance structures and constraint in the context of complex traits and diseases can be valuable and interesting.
Fourth, in our T/PWAS analysis, we select six white blood-cell related traits to best match PBMC and LCL contexts. However, alternative cell-types not included in our analyses may better capture relevant contexts. For example, PBMCs and LCLs do not contain neutrophils, basophils, and eosinophils, and LCLs additionally do not include monocytes, which may result in a loss in statistical power. As single-cell RNA-seq datasets become more available, one possible direction can be to perform TWAS in fine-grained cellular contexts and backgrounds49. In addition, after predicting expression levels using ancestry-matched weights for each individual, we perform individual-level T/PWAS by concatenating the predicted expression levels across ancestries rather than perform ancestry-specific TWAS followed by meta-analysis50. The premise of the meta-analysis approach is that researchers obtain ancestry-specific GWAS and then integrate with corresponding eQTL weights. Because the causal genes for complex traits are likely shared across ancestries30–35,41, a regression framework with individual-level data concatenated across ancestries (the largest sample size) can maximize power. Lastly, caution is needed when interpreting TWAS findings. Several factors, such as correlations in predicted expression due to cis-molQTL LD and effect sizes, tissue-specific biases, and SNP pleiotropy, may lead to false positive hits51. As a result, downstream statistical fine-mapping52 or experimental validation methods are essential for further prioritizing candidate causal genes of complex traits. We discuss several additional caveats and potential directions for future work in Supplementary Note 2.
Overall, SuShiE, together with its application on large-scale molecular data of diverse ancestries, identifies more cis-regulatory mechanisms and reveals its genetic architecture. We anticipate considerable demand for our approach in the genetics field characterized by forthcoming multi-ancestry and multi-omics research.
Methods
Ethics statement
This study utilizes individual-level data from TOPMed-MESA, GENOA, GEUVADIS, INTERVAL, and the All of Us (AOU) Biobank. We confirm that our research complies with all relevant ethical regulations and protocols associated with each respective study.
Sum of Shared Single Effects Model
Here, we describe the statistical model underlying SuShiE (see Supplementary Note 1 for a detailed description). SuShiE assumes cis-molQTLs are present in all ancestries, defined as shared cis-molQTLs while allowing for effect sizes at causal cis-molQTLs to covary across ancestries a-priori. For the of total ancestries, SuShiE models the centered and standardized levels of a molecular trait measured in individuals as a linear combination of genotyped variants as
where is a vector of ancestry-specific cis-molQTL effects, and is environmental noise. In addition, we model as the sum of effects where is a binary vector indicating which variant is the shared cis-molQTL for the effect while allowing ancestry-specific effect sizes . Furthermore, we model where is a vector representing prior probability for each SNP to be a cis-molQTL, and model where
is the prior effect size covariance matrix with as variance, and as correlation. We provide a detailed description for our software implementation of SuShiE in Supplementary Note 2.
Variational inference of model parameters
To infer the cis-molQTL effects, we seek to estimate the posterior distribution of where . We regard as latent variables, , and , as observed data, and , and are the hyperparameters. However, inferring the exact distributions of latent variables is computationally intractable due to non-conjugacy with the prior distribution. Therefore, we seek a surrogate distribution , which minimizes the Kullback–Leibler (KL) divergence with . Through the principles of coordinate-ascent variational inference (CAVI)53, we can identify each surrogate as,
where and are the corresponding posterior mean and covariance, and is each SNP’s posterior probability to explain the effect. We provide the complete mathematical derivations, inference algorithms, and detailed definitions in the Supplementary Note 1.
Computing PIPs and credible sets
We define the posterior inclusion probability (PIP) for SNP with as . To compute an -credible set for each , where represents the desired probability that the set contains cis-molQTLs, we decreasingly sort and take a greedy approach to include SNPs until their cumulative sum exceeds . To refine the final inference results, we exclude the credible sets with a “purity” below 0.5, defined as the lowest absolute pairwise correlation among SNPs in the credible set for a single ancestry3. In a multi-ancestry setting, let be the purity of ancestry out of ancestries with sample size . We compute the purity as , where .
Inferring hyperparameters
We use an Empirical Bayes-like procedure to infer our hyperparameters: environmental/residual variance and cis-molQTL effect size covariance matrices across ancestries by directly maximize the ELBO54. As reflected in algorithms 1 and 2 in Supplementary Note 1, inference alternates between inferring variational parameters while keeping hyperparameters fixed (e.g., and ), and inferring hyperparameters while keeping variational parameters fixed (e.g., , and ), which is akin to a coordinate ascent procedure that includes both variational- and hyper-parameters54. For details, see Supplementary Note 1.
Default parameters and performance metrics
We provide detailed procedures for simulating genotype data, quantitative molecular traits, GWAS, and TWAS in Supplementary Note 2. We performed SNP fine-mapping using SuShiE on simulated genotypes and molecular data across EUR and AFR individuals. In terms of variational inference parameters, we specified to match the actual number of simulated effects and initialized cis-molQTL effects as , their covariance matrix as , the prior estimates of environmental noises as 0.001, the prior probability for SNPs to be cis-molQTLs as where is the number of common SNPs.
To evaluate the gain in parametrizing the effect size correlation across ancestries, we compared our method SuShiE to “SuShiE-Indep” which assumes the cis-molQTL effect sizes are independent across ancestries; that is, SuShiE-Indep does not model cis-molQTL effect size correlations across ancestries (i.e., is fixed at 0 and did not learn it through the Empirical-Bayes-like procedure) but only assumes causal variants are shared across ancestries. To demonstrate that SuShiE’s improvement does not result from the accumulation of samples across ancestries, we compared SuShiE’s performance to two methods: first, we performed single-ancestry SuSiE and then meta-analyzed the resulting PIPs by ; we refer to this method as “meta-SuSiE”. In this case, Meta-SuSiE neither models effect size correlations across ancestries nor assumes shared causal variants. Instead, it aggregates signals across ancestries, identifying variants that are causal in at least one ancestry. Second, we row-stacked the genotype matrices and molecular trait vectors across ancestries and then performed single-ancestry SuSiE as “SuSiE.”
To demonstrate SuShiE’s advantages in simulations over other recent methods that leverage a similar sum of single effect assumption3,16, we compared our method to SuSiEx7, MESuSiE8, XMAP9, and XMAP-IND (individual-level data as input). All four methods focus on complex traits and provide software that runs with summary data and LD reference matrix while XMAP (i.e., XMAP-IND) has capability to accept individual-level data. We set the initial function parameter values to be the same across all methods including cis-molQTL effect size variance and correlation priors, maximum iteration, minimum tolerance, the number of pre-specified causal effects (), the credible set confidence level, and the purity threshold. We provide detailed descriptions of the function parameter settings for each method in Supplementary Note 2. After simulating individual-level genotype and phenotype data of each ancestry, we computed the summary z scores using simple linear regression. We also used the LD matrix estimated from 1000G project as in-sample reference.
Overall, we ran eight methods (i.e., SuShiE, SuShiE-Indep, meta-SuSiE, SuSiE, SuSiEx, XMAP, XMAP-IND, and MESuSiE) on 500 genes’ simulated genotypes and molecular traits to output corresponding PIPs, credible sets, and ancestry-specific effect size estimates. We varied four parameters: per-ancestry cis-molQTL study sample size (), the number of cis-molQTLs (), the cis-SNP heritability of molecular traits () for each ancestry, and the effect size correlation (). To reflect a practical study design, the default parameters were fixed at , and unless stated otherwise. Furthermore, we evaluated the fine-mapping performance with three metrics across 500 simulated genes: PIPs at causal cis-molQTLs, credible set size, and frequency that causal cis-molQTLs are contained in 95% credible sets (calibration). We computed the metrics of meta-SuSiE based on the union of the credible sets across two single-ancestry SuSiE. We computed the credible set size metric only using the credible set that passes the purity pruning. To quantify the performance of SuShiE compared with other methods under a specified metric, we conducted linear regression analyses for each metric and each method. Specifically, we computed a linear regression model , where is the computed metric for each method, is an intercept, is the method label (effect ) and is the simulation parameter matrix reflecting various scenarios (i.e., per-sample sample size, cis-SNP heritability, the number of cis-molQTLs, and effect size correlation across ancestries), and is their effects. We limited this regression to include SuShiE and a single other method (thus setting SuShiE to be the reference method), and then meta analyzed the effects across methods to produce an aggregate P value.
To quantify the LD diversity across ancestries for each gene, we first calculated the LD scores of each SNP within the cis-region of genes for each ancestry based on computed from 1000G. Then, we performed ANOVA and Levene’s test to assess the equality of mean and variance in LD scores, respectively, between the two ancestries. The resulting statistics were used as a metric for LD diversity, where a higher value indicates greater LD diversity across ancestries.
Real-data analyses overview
We applied SuShiE and other methods (i.e., SuShiE-Indep, Meta-SuSiE, SuSiE, SuSiEx, MESuSiE, and XMAP) to three datasets: mRNA expression (visit-1) measured in peripheral blood mononuclear cells (PBMCs) and protein abundance measured in plasma of three EUR, AFR, and HIS ancestries from Trans-Omics for Precision Medicine program Multi-Ethnic Study of Atherosclerosis (TOPMed MESA)12,13 and mRNA expression measured in lymphoblastoid cell lines (LCLs) of EUR and AFR ancestries from the Genetic Epidemiology Network of Arteriopathy (GENOA) study14. We excluded the mRNA expression levels data measured in T cells and monocytes from TOPMed MESA study due to relatively smaller sample sizes. We provide the detailed quality control (QC) information for the genotypes and phenotypes of our three main datasets and three validation datasets, along with a description of the validation metrics, in Supplementary Note 2. We conducted pairwise comparisons of methods on four basic summary statistics, focusing on the genes for which both methods output credible sets; the summary statistics included the number of genes identified with cis-molQTLs (e/pGenes), the average PIPs of the SNPs in the credible sets, the average single-effect-specific credible set sizes, and the frequency of having genes whose credible sets contained SNPs with PIPs greater than 0.95. We defined the number of cis-molQTLs as the number of credible sets output after pruning for purity.
For all the fine-mapping analysis, we used the SNPs that are shared across ancestries on the genomic window of each gene that is 500kb upstream and downstream of each gene’s TSS and TES (one million bp in total), respectively, based on the GENCODE v3455,56. In addition, we only included genes that are located on the autosomes, do not overlap with the major histocompatibility complex (MHC) region, have more than 100 SNPs on the genomic window present in all ancestries, and whose ENSEMBL gene IDs match the records in GENCODE v3455,56. We removed all the ambiguous SNPs (i.e., A/T, T/A, C/G, and G/C). We adjusted for covariates by regressing them from both mRNA/protein levels and each SNP. In addition, we computed the cis-SNP heritability using the limix python package (Code Availability) for each analyzed molecule within each ancestry. We used PLINK2.0, vcftools, and bcftools for genotype manipulation57–60. For SuSiEx, XMAP, and MESuSiE, we computed the summary z scores using PLINK2.057,58 and LD computed from individual-level genotypes as in-sample reference panels, while adjusting for their corresponding covariates. For MESuSiE, we used its default settings for “ancestry_weight” in its R function. For molecular trait prediction models used in TWAS and PWAS, we obtained the eQTL prediction weights of EUR, AFR, and HIS in the TOPMed MESA mRNA dataset, the pQTL prediction weights of EUR, AFR, and HIS in the TOPMed MESA protein dataset, and the eQTL prediction weights of EUR and AFR in the GENOA mRNA dataset. Similarly, we provide a detailed description of how we evaluated expression prediction accuracy across methods in Supplementary Note 2.
Functional enrichment analyses and case study
We ran functional enrichment analysis only on the genes identified with cis-molQTLs (i.e., SuShiE outputs credible sets; e/pGenes). To visualize the relationship between the PIPs inferred by SuShiE and their distance to the TSS, we grouped fine-mapped SNPs into 2,000 bins that are 500 bp long to cover the one-million-bp window around the TSS for each gene and computed the average PIPs within each bin. To visualize the relationship between single effects’ posterior probabilities and their distance to the TSS, we performed the same procedure focusing on the shared effects that had credible set output (i.e., passed the purity threshold).
We performed enrichment analysis using 89 functional annotations. First, we downloaded 5 candidate cis-regulatory elements (cCREs) from ENCODE Registry v323. Then, we obtained 9 cell-type specific cCREs measured in PBMC using snATAC-Seq24 and one cCRE measured in frozen PBMC using scATAC-seq25. Last, we obtained the 74 categorical functional annotations from LDSC baseline annotations v2.261,62, and remapped to GRCh38 using LiftOver (Code Availability). To compute the functional enrichment scores, we employed an approach that is similar to TORUS63. Briefly, for each functional annotation and each gene, we performed the logistic regression where is the logit link function, is the vector for the PIPs of all the SNPs, is the binary vector indicating whether the SNPs fall into the annotation, and is the desired log-enrichment scores. After removing the genes on which logistic regression does not converge, we meta-analyzed the log-enrichment scores across genes by and where is the inverse of the squared standard error for gene . When comparing enrichment results across methods, we focused on e/pGenes fine-mapped by both methods. We computed the comparison z score as for method and . For the enrichment analyses focusing on individual shared effect using , rather than PIPs, we limited analyses to those single effects that had corresponding credible sets (i.e., were not pruned).
To perform a case study, we selected URGCP and SNHG5, which were fine-mapped by SuShiE, but missed by other multi-ancestry methods. We used annoQ64 to annotate the fine-mapped SNPs (Code Availability). To annotate the genomic region around URGCP, we downloaded the ChiP-Seq H3K27ac data of ENCODE23 from WashU Epigenome Browser65 (Code Availability) and proximal enhancer (pELS) cCREs from ENCODE Registry v3, PBMC annotation using scATAC-seq in Satpathy et al.25, naive T cells, naive B cells, cytotoxic natural killer (cNK) cells, and monocytes annotations using snATAC-seq in Chiou et al.24 We used plotgardener66 for visualization (Code Availability).
Prior cis-molQTL correlation analyses
To shed light on the relationship between heterogeneity of effect-sizes across ancestries and genes’ constraint, using all the credible sets output by SuShiE, we tested for association between SuShiE-inferred effect size correlations across ancestries () and five measures of constraint () using all the fine-mapped e/pGenes: probability of being Loss-of-Function Intolerant (pLI)36, loss-of-function observed/expected upper bound fraction (LOEUF)37, enhancer-domain score (EDS)38, the Residual Variation Intolerance Score (RVIS)39, and shet40. We downloaded pLI and LOEUF from gnomAD browser v4.0 (Code Availability), we downloaded EDS, RVIS, and shet from their original papers. Our base model is according to:
where is the intercept term, is the ordered and categorical single effect index representing the order of variance explained, is the corresponding ancestry pair indicator (e.g., the correlation of EUR-AFR, EUR-HIS, or HIS-AFR), is the study indicator (e.g., TOPMed MESA mRNA, TOPMed MESA proteins, or GENOA mRNA), s are the corresponding coefficients. We test the significance of in a linear regression framework. A negative value of for pLI, EDS, and shet is taken to indicate stronger associations between cis-molQTL effect size heterogeneity across ancestries and gene constraint, while a positive value of for LOEUF and RVIS is suggestive of stronger associations.
In addition, to show robustness, we re-tested these associations using estimated covariance by replacing by . We also focused on correlations estimated from the primary effect (i.e., ); in this case, we removed from the base model. We also re-computed the standard error using bootstrap. Specifically, for each study, each ancestry pair, and each , we sampled the genes with replacement and computed the . We repeated 100 times to construct the null distributions for and used its standard deviation as a new standard error. In addition, to adjust for allele frequency differences across ancestries, we added Wright’s fixation index (Fst) as an additional term. To compute Fs,, we only used the fine-mapped SNPs to compute the Fst using PLINK257,58 with the “Hudson” method67,68 for each gene. To investigate the relationship between expected cis-molQTLs’s distance to TSS and genes’ constraint, we computed the expected distance to TSS for each gene according to where is the distance (absolute value) to the TSS for SNP .
TWAS and PWAS analyses in All of Us biobank
We performed individual-level Transcriptome- and Proteome-wide Association Studies (TWASs and PWASs)10,11 on 6 white blood cell-related traits: basophil count (BAS), eosinophil count (EOS), lymphocyte count (LYM), monocyte count (MON), neutrophil count (NEU), and white blood cell count (WBC; Supplementary Table 13) measured in All Of Us biobank (AOU) Controlled Tier Dataset v715. For all the traits, we excluded individuals with certain pre-existing conditions (Supplementary Note 2) and excluded measurements that were 3 standard deviations away from the mean, resulting in a total of 86,345 individuals on average. We identified individual ancestry information based on AOU precomputed information (i.e., “eur”, “afr”, and “amr” labels in “ancestry_pred” column), resulting in 53,268 EUR, 16,748 AFR, and 16,329 HIS individuals on average.
To perform T/PWAS, we first predicted expression levels (either mRNA or proteins) for EUR, AFR, and HIS individuals in AOU using each ancestry-matched e/pQTL prediction weights with the score function in PLINK257,58. Then, we standardized the expression vector (centered by mean and scaled by standard deviation) within each ancestry and then concatenated them into a single vector across ancestries. Then, we regressed out biological sex, age, squared age, and ten genotype PCs from the trait measurements45. Last, we regressed the inverse-rank normalized residuals on the predicted expression levels to compute the TWAS or PWAS statistics. We re-performed the procedure using SuSiE- and MESuSiE-derived e/pQTL prediction weights as comparisons. We applied the Bonferroni correction to adjust the reported P values with n=23,000. We conducted an ANCOVA to compare T/PWAS chi-square statistics between SuShiE and SuSiE, as well as between SuShiE and MESuSiE, adjusting for white blood traits and study effects. Recognizing that nearby genes may have correlated T/PWAS statistics due to LD, potentially violating ANCOVA assumptions, we addressed this issue by computing the standard error using a bootstrap approach. Specifically, we performed ANCOVA on random samples of the same number of genes, selected with replacement, repeating this process 100 times to build a null distribution of the comparison estimates. We then used the standard deviation of this null distribution as the corrected standard error.
To validate our TWAS results, we compared them to five independent TWAS studies: Lu and Gopalan et al.41, Kachuri et al.29, Tapia et al.42, Wen et al.43, and Rowland et al.44 We released our cis-molQTL prediction weights to the public (Data availability). To test the association between T/PWAS chi-square statistics and genes’ constraint scores: pLI36, LOEUF37, EDS38, RVIS39, and shet40, we used linear regression adjusted for phenotype and study and reported P values. To compare significance of these associations between SuShiE, SuSiE, and MESuSiE, we computed the z score as where is the method (i.e., SuSiE and MESuSiE) and is the chi-square statistics for constraint score (i.e., the square of coefficients in the linear regression). For visualization purposes, we classified genes into three groups: Low, Middle, and High based on different scores, respectively. For pLI, we labeled genes with pLI >0.9 as High, <0.1 as Low, and otherwise middle. For other scores, we labeled genes whose value is greater than 90% quantile as High, smaller than 10% quantile as Low, and otherwise middle. Lastly, we provide detailed descriptions of how covariates were defined using AOU datasets in Supplementary Note 2.
Supplementary Material
Acknowledgements
The authors would like to thank members of the Mancuso and Gazal labs for fruitful discussions regarding this manuscript. The authors would also like to specially thank Dr. Michael D. Edge for his thoughtful comments and suggestions. This work was funded in part by National Institutes of Health (NIH) under awards R01HG012133 (N.M.), R01CA258808 (N.M.), R01GM140287 (P.M.), R35GM142783 (N.M.), R01GM140287 (P.M.), U54HG013243 (L.W.), R35GM147789 (S.G.), K08HL159346 (J.P.), R00CA246076 (L.K.) and R01MH125252 (A.G.).
MESA phenotypes (dbGaP: phs000209.v13.p3): MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts HHSN268201500003I, N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-001079, UL1-TR000040, UL1-TR-001420, UL1-TR-001881, and DK063491. Funding for SHARe genotyping was provided by NHLBI Contract N02-HL-64278. TOPMed MESA WGS genotype, mRNA, and protein expression data (dbGaP: phs001416.v3.p1): Molecular data for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). WGS genotype data for NHLBI TOPMed: MESA (phs001416.v3.p1) was performed at Broad Genomics (HHSN268201600034I). mRNA expression data for NHLBI TOPMed: MESA (phs001416.v3.p1) was performed at NWGC (HHSN268201600032I). SOMAscan proteomics for NHLBI TOPMed: Multi-Ethnic Study of Atherosclerosis (MESA) (phs001416.v1.p1) was performed at the Broad Institute and Beth Israel Proteomics Platform (HHSN268201600034I). Core support including centralized genomic read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Core support including phenotype harmonization, data management, sample-identity QC, and general program coordination were provided by the TOPMed Data Coordinating Center (R01HL-120393; U01HL-120393; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed.
GENOA genotype (dbGaP: phs001238.v2.p1) and gene expression (GEO: GSE138914) data were supported by grants from NIH NHLBI (HL054457, HL054464, HL054481, HL119443, and HL087660). The authors would like to acknowledge Drs. Sharon Kardia and Jennifer Smith in preparing GENOA eQTL data.
The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the All of Us Research Program would not be possible without the partnership of its participants.
Competing interests
L.W. provided consulting service to Pupil Bio Inc. and reviewed manuscripts for Gastroenterology Report, not related to this study, and received honorarium. S.G. received consulting fees from Eleven Therapeutics unrelated to this work. No potential conflicts of interest were disclosed by the other authors.
Data availability
SuShiE-derived prediction models (in both tsv format and FUSION format) for TWAS, PWAS, fine-mapping, and other analyzed results across cis-molQTL datasets can be found at https://zenodo.org/records/10963034 version 7.
The TOPMed-MESA data can be found and requested at dbGaP: phs000209.v13.p3, phs001416.v3.p1, phs001416.v1.p1. The GENOA data can be found and requested at dbGaP: phs001238.v2.p1 and GEO: GSE138914. The GEUVADIS data can be found at https://www.internationalgenome.org/data-portal/data-collection/geuvadis. The INTERVAL data can be found and requested at https://ega-archive.org/datasets/EGAD00001004080. The summary statistics in Chen et al. can be found at https://doi.org/10.1016/j.cell.2020.06.045. The LDSC annotation files can be found at https://console.cloud.google.com/storage/browser/broad-alkesgroup-public-requester-pays/. The ENCODE cCRE v3 can be found at https://screen.encodeproject.org/index/cversions. The snATAC-seq cCRE can be found at 10.1038/s41586-021-03552-w. The scATAC-seq cCRE can be found at https://doi.org/10.1038/s41587-019-0206-z. The All of Us data can be requested through https://allofus.nih.gov. The 1000G project data can be found at https://www.internationalgenome.org. The gnomAD v4.0 dataset for pLI and LOEUF is at https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/. The RVIS dataset can be found at https://doi.org/10.1371/journal.pgen.1003709. The shet dataset can be found at 10.1038/s41588-024-01820-9. The EDS dataset can be found at 10.1016/j.ajhg.2020.01.012.
Code availability
SuShiE v0.16 software is available at https://github.com/mancusolab/sushie. The analysis codes for simulation and real-data analysis of this manuscript: https://github.com/mancusolab/sushie-project-codes and https://zenodo.org/records/10963034 version 7. The twas_sim software is available at https://github.com/mancusolab/twas_sim. TOPMed RNA-seq Harmonization pipeline instruction is available at https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md. The GTEx eQTL analysis pipeline is at https://www.gtexportal.org/home/methods. The PLINK2 software is at https://www.cog-genomics.org/plink/2.0. The BCFTOOLS v1.21 software is at https://samtools.github.io/bcftools/bcftools.html. The FUSION pipeline is at http://gusevlab.org/projects/fusion/. The LiftOver software is at https://genome.ucsc.edu/cgi-bin/hgLiftOver. The WashU Epigenome Browser is at https://epigenomegateway.wustl.edu/. The Plotgardener v1.8.3 software is at https://github.com/PhanstielLab/plotgardener/. The AnnoQ is at http://annoq.org. The SuSiEx v1.1.2 software is at https://github.com/getian107/SuSiEx. The MESuSiE software is at https://github.com/borangao/MESuSiE. The XMAP v1.0.1 software is at https://github.com/YangLabHKUST/XMAP.
Reference
- 1.Cheung VG et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Aguet F et al. Molecular quantitative trait loci. Nat. Rev. Methods Primers 3, 1–22 (2023). [Google Scholar]
- 3.Wang G, Sarkar A, Carbonetto P & Stephens M A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol 82, 1273–1300 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wen X, Luca F & Pique-Regi R Cross-population joint analysis of eQTLs: fine mapping and functional annotation. PLoS Genet. 11, e1005176 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kichaev G & Pasaniuc B Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies. Am. J. Hum. Genet 97, 260–271 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.LaPierre N et al. Identifying causal variants by fine mapping across multiple studies. PLoS Genet. 17, e1009733 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yuan K et al. Fine-mapping across diverse ancestries drives the discovery of putative causal variants underlying human complex traits and diseases. Nat. Genet (2024) doi: 10.1038/s41588-024-01870-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gao B & Zhou X MESuSiE enables scalable and powerful multi-ancestry fine-mapping of causal variants in genome-wide association studies. Nat. Genet (2024) doi: 10.1038/s41588-023-01604-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cai M et al. XMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias. Nat. Commun 14, 6870 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Gusev A et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet 48, 245–252 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gamazon ER et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet 47, 1091–1098 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bild DE et al. Ethnic differences in coronary calcification: the Multi-Ethnic Study of Atherosclerosis (MESA). Circulation 111, 1313–1320 (2005). [DOI] [PubMed] [Google Scholar]
- 13.Taliun D et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Shang L et al. Genetic Architecture of Gene Expression in European and African Americans: An eQTL Mapping Study in GENOA. Am. J. Hum. Genet 106, 496–512 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature (2024) doi: 10.1038/s41586-023-06957-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zou Y, Carbonetto P, Wang G & Stephens M Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genet. 18, e1010299 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tibshirani R Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Series B Stat. Methodol 58, 267–288 (1996). [Google Scholar]
- 18.Zou H & Hastie T Regularization and Variable Selection Via the Elastic Net. J. R. Stat. Soc. Series B Stat. Methodol 67, 301–320 (2005). [Google Scholar]
- 19.Clark SA & van der Werf J Genomic best linear unbiased prediction (gBLUP) for the estimation of genomic breeding values. Methods Mol. Biol 1019, 321–330 (2013). [DOI] [PubMed] [Google Scholar]
- 20.GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Võsa U et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet 53, 1300–1310 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mostafavi H, Spence JP, Naqvi S & Pritchard JK Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet 55, 1866–1875 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chiou J et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 594, 398–402 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Satpathy AT et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol 37, 925–936 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lappalainen T et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sun BB et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cai J et al. URGCP promotes non-small cell lung cancer invasiveness by activating the NF-κB-MMP-9 pathway. Oncotarget 6, 36489–36504 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kachuri L et al. Gene expression in African Americans, Puerto Ricans and Mexican Americans reveals ancestry-specific patterns of genetic architecture. Nat. Genet 55, 952–963 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Shi H et al. Localizing Components of Shared Transethnic Genetic Architecture of Complex Traits from GWAS Summary Data. Am. J. Hum. Genet 106, 805–817 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Shi H et al. Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat. Commun 12, 1098 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hou K et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet 55, 549–558 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Saito S et al. Gene-specific somatic epigenetic mosaicism of FDFT1 underlies a non-hereditary localized form of porokeratosis. Am. J. Hum. Genet 111, 896–912 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Taylor DJ et al. Sources of gene expression variation in a globally diverse human cohort. Nature 632, 122–130 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Brown BC, Asian Genetic Epidemiology Network Type 2 Diabetes Consortium, Ye CJ, Price AL & Zaitlen N Transethnic Genetic-Correlation Estimates from Summary Statistics. Am. J. Hum. Genet 99, 76–88 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Lek M et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Karczewski KJ et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang X & Goldstein DB Enhancer Domains Predict Gene Pathogenicity and Inform Gene Discovery in Complex Disease. Am. J. Hum. Genet 106, 215–233 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Petrovski S, Wang Q, Heinzen EL, Allen AS & Goldstein DB Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zeng T, Spence JP, Mostafavi H & Pritchard JK Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat. Genet 56, 1632–1643 (2024). [DOI] [PubMed] [Google Scholar]
- 41.Lu Z et al. Multi-ancestry fine-mapping improves precision to identify causal genes in transcriptome-wide association studies. Am. J. Hum. Genet 109, 1388–1404 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tapia AL et al. A large-scale transcriptome-wide association study (TWAS) of 10 blood cell phenotypes reveals complexities of TWAS fine-mapping. Genet. Epidemiol 46, 3–16 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wen J et al. Transcriptome-Wide Association Study of Blood Cell Traits in African Ancestry and Hispanic/Latino Populations. Genes 12, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Rowland B et al. Transcriptome-wide association study in UK Biobank Europeans identifies associations with blood cell traits. Hum. Mol. Genet 31, 2333–2347 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen M-H et al. Trans-ethnic and Ancestry-Specific Blood-Cell Genetics in 746,667 Individuals from 5 Global Populations. Cell 182, 1198–1213.e14 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ding Y et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature 618, 774–781 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Mester R et al. Impact of cross-ancestry genetic architecture on GWASs in admixed populations. Am. J. Hum. Genet 110, 927–939 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhang J & Stram DO The role of local ancestry adjustment in association studies using admixed populations. Genet. Epidemiol 38, 502–515 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wang L et al. Integrating single cell expression quantitative trait loci summary statistics to understand complex trait risk genes. Nat. Commun 15, 4260 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Bhattacharya A et al. Best practices for multi-ancestry, meta-analytic transcriptome-wide association studies: Lessons from the Global Biobank Meta-analysis Initiative. Cell Genom 2, (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wainberg M et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet 51, 592–599 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Mancuso N et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nat. Genet 51, 675–682 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Blei DM, Kucukelbir A & McAuliffe JD Variational inference: A review for statisticians. J. Am. Stat. Assoc 112, 859–877 (2017). [Google Scholar]
- 54.Blei DM, Ng AY & Jordan MI Latent Dirichlet Allocation. in Advances in Neural Information Processing Systems 14 601–608 (The MIT Press, 2002). [Google Scholar]
- 55.Frankish A et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Frankish A et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Purcell S et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics vol. 81 559–575 Preprint at 10.1086/519795 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chang CC et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience vol. 4 Preprint at 10.1186/s13742-015-0047-8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Li H et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics vol. 25 2078–2079 Preprint at 10.1093/bioinformatics/btp352 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Danecek P et al. The variant call format and VCFtools. Bioinformatics vol. 27 2156–2158 Preprint at 10.1093/bioinformatics/btr330 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Finucane HK et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics vol. 47 1228–1235 Preprint at 10.1038/ng.3404 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Hujoel MLA, Gazal S, Hormozdiari F, van de Geijn B & Price AL Disease Heritability Enrichment of Regulatory Elements Is Concentrated in Elements with Ancient Sequence Age and Conserved Function across Species. Am. J. Hum. Genet 104, 611–624 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wen X MOLECULAR QTL DISCOVERY INCORPORATING GENOMIC ANNOTATIONS USING BAYESIAN FALSE DISCOVERY RATE CONTROL. Ann. Appl. Stat 10, 1619–1638 (2016). [Google Scholar]
- 64.Liu Z et al. Annotation Query (AnnoQ): an integrated and interactive platform for large-scale genetic variant annotation. Nucleic Acids Res. 50, W57–W65 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Li D et al. WashU Epigenome Browser update 2022. Nucleic Acids Res. 50, W774–W781 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kramer NE et al. Plotgardener: cultivating precise multi-panel figures in R. Bioinformatics 38, 2042–2045 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Hudson RR, Slatkin M & Maddison WP Estimation of levels of gene flow from DNA sequence data. Genetics 132, 583–589 (1992). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Bhatia G, Patterson N, Sankararaman S & Price AL Estimating and interpreting FST: the impact of rare variants. Genome Res. 23, 1514–1521 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
SuShiE-derived prediction models (in both tsv format and FUSION format) for TWAS, PWAS, fine-mapping, and other analyzed results across cis-molQTL datasets can be found at https://zenodo.org/records/10963034 version 7.
The TOPMed-MESA data can be found and requested at dbGaP: phs000209.v13.p3, phs001416.v3.p1, phs001416.v1.p1. The GENOA data can be found and requested at dbGaP: phs001238.v2.p1 and GEO: GSE138914. The GEUVADIS data can be found at https://www.internationalgenome.org/data-portal/data-collection/geuvadis. The INTERVAL data can be found and requested at https://ega-archive.org/datasets/EGAD00001004080. The summary statistics in Chen et al. can be found at https://doi.org/10.1016/j.cell.2020.06.045. The LDSC annotation files can be found at https://console.cloud.google.com/storage/browser/broad-alkesgroup-public-requester-pays/. The ENCODE cCRE v3 can be found at https://screen.encodeproject.org/index/cversions. The snATAC-seq cCRE can be found at 10.1038/s41586-021-03552-w. The scATAC-seq cCRE can be found at https://doi.org/10.1038/s41587-019-0206-z. The All of Us data can be requested through https://allofus.nih.gov. The 1000G project data can be found at https://www.internationalgenome.org. The gnomAD v4.0 dataset for pLI and LOEUF is at https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/. The RVIS dataset can be found at https://doi.org/10.1371/journal.pgen.1003709. The shet dataset can be found at 10.1038/s41588-024-01820-9. The EDS dataset can be found at 10.1016/j.ajhg.2020.01.012.
SuShiE v0.16 software is available at https://github.com/mancusolab/sushie. The analysis codes for simulation and real-data analysis of this manuscript: https://github.com/mancusolab/sushie-project-codes and https://zenodo.org/records/10963034 version 7. The twas_sim software is available at https://github.com/mancusolab/twas_sim. TOPMed RNA-seq Harmonization pipeline instruction is available at https://github.com/broadinstitute/gtex-pipeline/blob/master/TOPMed_RNAseq_pipeline.md. The GTEx eQTL analysis pipeline is at https://www.gtexportal.org/home/methods. The PLINK2 software is at https://www.cog-genomics.org/plink/2.0. The BCFTOOLS v1.21 software is at https://samtools.github.io/bcftools/bcftools.html. The FUSION pipeline is at http://gusevlab.org/projects/fusion/. The LiftOver software is at https://genome.ucsc.edu/cgi-bin/hgLiftOver. The WashU Epigenome Browser is at https://epigenomegateway.wustl.edu/. The Plotgardener v1.8.3 software is at https://github.com/PhanstielLab/plotgardener/. The AnnoQ is at http://annoq.org. The SuSiEx v1.1.2 software is at https://github.com/getian107/SuSiEx. The MESuSiE software is at https://github.com/borangao/MESuSiE. The XMAP v1.0.1 software is at https://github.com/YangLabHKUST/XMAP.
