Abstract
Molecular quantitative trait locus (QTL) studies seek to identify the causal variants affecting molecular traits like DNA methylation and histone modifications. However, existing fine-mapping tools are not well suited to molecular traits, and so molecular QTL analyses typically proceed by considering each variant and each molecular measurement independently, ignoring the LD among variants and the spatial correlation in effects between nearby sites. This severely limits accuracy in identifying causal variants and quantifying their molecular trait effects. Here, we introduce fSuSiE (“functional Sum of Single Effects”), a fine-mapping method that addresses these challenges by explicitly modeling the spatial structure of genetic effects on molecular traits. fSuSiE integrates wavelet-based functional regression with the computationally efficient “Sum of Single Effects” framework to simultaneously finemap causal variants and identify the molecular traits they affect. In simulations, fSuSiE identified causal variants and affected CpGs more accurately than methods that ignore spatial structure. In applications to DNA methylation and histone acetylation (H3K9ac) data from the ROSMAP study of the dorsolateral prefrontal cortex, fSuSiE achieved dramatically higher resolution than existing methods (e.g., identifying 6,355 single-variant methylation credible sets compared to only 328 from an existing approach). Applied to Alzheimer's disease (AD) risk loci, fSuSiE identified potential causal variants colocalizing with AD GWAS signals for established genes, including CASS4 and CR1/CR2, suggesting specific potential regulatory mechanisms underlying these AD risk loci.
Introduction
Molecular quantitative trait locus (QTL) mapping [1] aims to identify genetic variants associated with a wide variety of molecular traits measured across the genome, including RNA expression [2, 3], protein expression [4, 5], DNA methylation [6–11], chromatin accessibility [12, 13], transcription-factor binding [14, 15] and histone modification [16, 17]. A critical next step, known as fine-mapping, is to pinpoint the individual causal variants, determine how many act independently, and estimate their effects on the (typically large number of) molecular traits [18–30]. While sophisticated fine-mapping methods exist to handle these challenges for single traits, they are ill-suited to molecular traits, which are inherently high-dimensional. Consequently, the most common approach in practice remains large-scale univariate testing, where each SNP is tested for association with a single molecular trait one at a time. This approach identifies many trait associations with non-causal SNPs due to linkage disequilibrium (LD) with causal SNPs. It also fails to leverage the fact that causal SNPs may often affect multiple nearby molecular traits in a coordinated way (e.g., methylation QTLs frequently affect multiple CpGs within regions spanning several kb [7, 10, 31]) and this limits power to detect such effects. These limitations of existing analysis tools greatly hinder progress in understanding the way that genetic variants impact molecular traits. Furthermore, since most genetic effects on complex disease risk are likely mediated through effects on molecular traits, these limitations of existing analysis tools also hinder progress in understanding the genetics of complex disease.
To address these limitations, we introduce functional SuSiE (fSuSiE), a new method for fine-mapping spatially structured molecular traits. fSuSiE models the effects of each putative causal SNP as a function that varies along the genome so as to capture the coordinated changes across multiple molecular traits, drawing on ideas from functional data analysis [32–46]. It overcomes the high computational burden of fine-mapping high-dimensional molecular traits by building on the “Sum of Single Effects” (SuSiE) model [23, 24], which decomposes the complex multi-SNP computations into a series of simpler single-SNP computations. The resulting method is computationally practical for fine-mapping molecular measurements made across moderately large genetic regions containing hundreds or thousands of trait measurements and thousands of genetic variants.
We validate fSuSiE's performance using simulated methylation QTL datasets, where it simultaneously identifies causal variants and affected CpGs more accurately than methods ignoring spatial structure. To demonstrate its potential for biological discovery, we applied fSuSiE genome-wide to DNA methylation (DNAm) and H3K9ac data [47] from the ROSMAP study [48], analyzing over 400,000 CpG sites and 90,000 histone peaks. Applied to Alzheimer's disease risk loci, fSuSiE pinpointed putative causal variants that colocalized with AD GWAS signals for established genes including CASS4 and CR1/CR2, suggesting biologically coherent regulatory mechanisms underlying genetic risk. While we showcase fSuSiE using DNAm and H3K9ac, it is widely applicable to any trait that varies along the genome, such as other histone marks, chromatin accessibility, expression level variation among exons and introns, and other traits measured by high-throughput sequencing technologies.
Results
Overview of fSuSiE
fSuSiE takes as input an matrix, , containing molecular trait data on samples at genomic locations, and an matrix, of genotypes for the same samples at genetic variants (typically SNPs). It then attempts to identify which SNPs affect which molecular traits by fitting a multivariate linear model:
| (1) |
where is a matrix of effects whose element represents the effect of SNP on molecular trait , and is a matrix of residuals.
Like most fine-mapping methods, fSuSiE assumes that only a few genetic variants affect the trait: that is, most rows of are zero. It does this by modelling using a “sum of single effects” (SuSiE) model [23, 24], , where each matrix is assumed to capture the effect of a single SNP, and so has exactly one non-zero row. (One or more may be entirely zero if the data support fewer than effects.) fSuSiE uses a wavelet-based model [36, 42, 49, 50] for the effects of each causal SNP, that is, for the non-zero row in each . Wavelets make the assumption that the effects on molecular traits are spatially correlated along the genome. In practice, this is achieved by applying a wavelet transform to (1), then applying sparse prior distributions to the wavelet-transformed effects to impose shrinkage. (The wavelet-transformed molecular trait measurements and effects are denoted, respectively, as and in Fig. 1.) This induces spatial correlation on the original effects, . We implemented two different priors: a simple “independent shrinkage” (IS) prior that applies the same prior to all wavelet-transformed effects; and a more flexible “shrinkage-per-scale” (SPS) prior that uses a different prior for each wavelet scale. The more flexible SPS prior is expected to perform better, but at an increased computational cost.
Fig. 1. Overview of fine-mapping of molecular traits using fSuSiE.
The inputs to fSuSiE are an matrix, , containing samples of the molecular trait measured at locations (a), and an genotype matrix, (b). First, the wavelet transform is applied to (c), resulting in an matrix of wavelet coefficients (WCs), (d). The fSuSiE model is fit to , (e); this model fitting includes estimating the effects of the SNPs on the WCs—a matrix, (f)—and fitting the adaptive shrinkage priors for the effects (g). Key fSuSiE outputs include: credible sets (h), each of which is intended to capture a distinct causal SNP; a posterior inclusion probability (PIP) for each SNP giving the probability that the SNP is causal for at least one trait (h); and posterior estimates of the SNP effects on , credible bands, and estimates of affected locations (i). These quantities are defined in the Online Methods.
The key analysis questions answered by fSuSiE are: (i) which SNPs are causally affecting the observed molecular phenotypes? (ii) which molecular phenotypes are affected by each causal SNP? In terms of the model above, question (i) asks which rows of (or rows of ) are non-zero, whereas question (ii) relates to the values of the effect estimates in these non-zero rows. To answer (i), fSuSiE outputs a posterior inclusion probability (PIP) for each SNP and a “credible set” (CS) [20, 23] for each non-zero . The PIP for SNP is the probability that the th row of is non-zero. The CS for each is a subset of (highly correlated) SNPs that has high probability (e.g., >0.95) of containing the non-zero row. Within each CS, we refer to the SNP with the highest PIP as the “sentinel SNP”. To answer (ii), fSuSiE outputs, for each sentinel SNP, an estimate of its effect on each molecular trait, along with an assessment of its uncertainty: a pointwise credible band that is designed to have high probability (e.g.,>0.95) of containing the effect at each location. If the credible band for a particular molecular phenotype excludes zero then we say that the CS containing that sentinel SNP is “significant” for that molecular phenotype.
Fig. 1 summarizes the overall fSuSiE pipeline and Fig. 2 illustrates fSuSiE in a simple simulated example, and contrasts it with a standard QTL mapping approach (univariate SNP-trait association testing). This simple simulated example illustrates how fSuSiE directly addresses the key questions of which SNPs are impacting the molecular traits (D), and which traits are impacted by each SNP (E), whereas standard QTL mapping does not (B and C). Further details of the fSuSiE implementation, and guidance on applying fSuSiE to genome-scale molecular trait data, are given in the Online Methods and in the Supplementary Note.
Fig. 2. Molecular QTL mapping vs. fine-mapping with fSuSiE: an illustration.
In this simulated example, methylation levels, , of samples are measured at 32 CpGs within a locus of interest (A). The methylation levels at 16 CpGs are affected by the genotypes at 3 of the 12 candidate SNPs. The effect of one of the causal SNPs is depicted in A. (The three curves show the mean methylation levels at the three SNP genotypes.) To map methylation QTLs, typically one performs an association test for each SNP-CpG pair (e.g., [10, 47, 51]). Here we have 32 × 12 = 384 association tests. (The -values are from two-sided -tests, and were not corrected for multiple testing.) Then the associations can be examined by SNP (B) or by CpG (C). The 3 causal SNPs and corresponding 8 affected CpGs are indicated by Δ or ▲. The association tests find signals corresponding to 2 of the 3 causal SNPs (and perhaps all three with a more careful multiple testing correction; here we just used the Bonferroni criterion). In C, the affected CpGs appear to cluster into two continuous regions, but it is less clear which CpGs in the clusters are affected. In an fSuSiE fine-mapping analysis of these data (D, E), fSuSiE identifies 3 95% credible sets (CSs)—one for each causal SNP—and expresses uncertainty in which SNP is causal by the posterior inclusion probabilities, or “PIPs” (D). (Note that SNPs 3 and 4 are strongly correlated, so fSuSiE was unable to determine which high certainty that SNP 3 was the causal SNP.) fSuSiE also tells us which CpGs are affected by which causal SNPs (E): the 95% credible bands (depicted by the error bars) quantify uncertainty in the effects of the “sentinel SNPs” (defined as the SNP with the highest PIP in each CS). Locations that do not include zero predict the affected CpGs (•) for each CS. Code implementing this example is available at https://stephenslab.github.io/fsusieR/articles/methyl_demo.html.
Performance assessment in simulated methylation data sets
To validate fSuSiE we used simulated molecular QTL data sets where we could assess accuracy of the estimates by comparing to the true values. We assessed fSuSiE's ability to perform the two key tasks highlighted above: (i) correctly identify the causal SNPs that affect the molecular trait at one or more locations; and (ii) correctly identify the molecular locations affected by the causal SNPs.
We simulated data sets containing both SNP genotypes and CpG methylation levels, in which one or more SNPs affected the methylation levels at one or more CpGs. We simulated genotypes with realistic patterns of linkage disequilibrium (LD) using sim1000G [52], with genotype data from the 1000 Genomes Phase 3 whole-genome sequencing [53] as the input source. The simulations varied in size of genomic region (median: ~1 Mb), number of SNPs (1,500–4,000 SNPs), and LD patterns. Consistent with real patterns of LD, simulated SNPs were often very strongly correlated with each other, and sometimes perfectly correlated, and therefore one should not expect 100% accuracy in identifying the causal SNPs, even in the simulation settings that were most favorable to fSuSiE.
We simulated realistic whole-genome bisulfite sequencing (WGBS) methylation levels using WGB-SSuite [54], with different types of spatial structure in the methylation changes that might be motivated by different epigenetic mechanisms: constant changes in methylation across contiguous windows, with abrupt changes between windows (“WGBS block”); and changes that varied smoothly within the locus (“WGBS decay”). To benchmark fSuSiE in the “ideal” setting in which the data conformed to fSuSiE's modeling assumptions, we performed a third set of simulations in which the methylation levels were simulated from the wavelet model assumed by fSuSiE (“wavelet simulations”). We varied the number of CpGs , the number of SNPs , the number of causal SNPs, and the total amount of variation explained by the SNPs. Detailed descriptions of the simulation procedures are given in the Online Methods.
As far as we are aware, fSuSiE is the first available tool to perform statistical fine mapping of functional traits. Among existing approaches, perhaps the most obvious fine-mapping competitors are approaches that first reduce the high-dimensional trait to a single dimension, e.g. using the top principal component (PC), and then apply a univariate fine-mapping method (e.g. [55, 56]). Here we implemented a version of this approach, “SuSiE-topPC”, by running SuSiE on the top PC computed from the column-centered . This type of approach has important limitations compared with fSuSiE. First, in real applications the top PC may primarily capture non-genetic effects such as experimental batch effects or population structure [57, 58]. Second, it only identifies causal SNPs that impact some aspect of the high-dimensional trait, without identifying which specific molecular locations (CpGs) are affected by these SNPs. Nonetheless, this existing approach may work well for detecting causal SNPs that have the biggest effect on a locus, and serves as a helpful benchmark against which to compare the ability of fSuSiE to identify causal SNPs.
Both fSuSiE and SuSiE-topPC produce posterior inclusion probabilities (PIPs) that assess the probability of each SNP being a causal SNP, and 95% Credible Sets (CSs) that are designed to contain (with probability > 0.95) at least one causal SNP. In fine-mapping applications, due to high LD, it may often be impossible to confidently identify the causal SNP, but it may nonetheless be possible to identify a small number of SNPs, in high LD with one another (which we call high “purity”), that likely contains a causal SNP. That is, while no single SNP may have a PIP near 1, there may be a small, high purity, 95% CS. For this reason CSs are perhaps the most relevant fine-mapping output; however comparing PIPs is also informative, and so we assessed both PIPs and CSs here.
Comparing PIPs (Fig. 3A), both variants of fSuSiE (IS prior and SPS prior) achieved higher power than SuSiE-topPC at the same false discovery rate. All methods produced well calibrated PIPs (Supplementary Fig. 2). Comparing CSs (Fig. 3A, Supplementary Fig. 1), fSuSiE consistently achieved better power, with smaller and higher purity CSs, while also maintaining coverage slightly closer to the target value (95%). As expected, the performance of fSuSiE also tended to improve as the density of the molecular trait measurements increased (i.e., more CpGs), and as the SNPs produced larger changes in methylation (Supplementary Figures 3–8). Interestingly, the performance gains over SuSiE-topPC were greatest in the WGBS simulations that did not conform to fSuSiE's modeling assumptions.
Fig. 3. Performance assessment in simulated data sets.
Performance was summarized in two tasks: identification of the causal SNPs (A, B) and identification of the affected molecular trait locations (C). In A–C, we considered three simulation scenarios. In each of the three simulation scenarios, methylation data sets were simulated with samples, locations (CpGs), SNPs, among which 1–20 were causal, and 10% total variance explained. (In C, there was always 1 causal SNP, so .) Panels A, C show power vs. FDR (i.e., precision-recall with a flipped x-axis) in identifying causal SNPs (A) and affected CpGs (C). In each simulation setting, FDR and power were calculated by varying a measure threshold from 0 to 1 ( simulations). In A, the measure was PIP for all methods; in C, it was for fSuSiE, and the -value for the SNP-CpG tests. (See Online Methods for a definition of .) Power and FDR were calculated as and , where FP, TP, FN, FN denote, respectively, the number of false positives, true positives, false negatives and true negatives. Note that the yellow and red curves (fSuSiE, IS and SPS priors) closely overlap in some plots and therefore are not always visible. Panel B evaluates the 95% CSs returned by fSuSiE and SuSiE-topPC: power, the proportion of the true causal SNPs included in at least one CS; coverage, the proportion of CSs containing a true causal SNP; and CS size, the number of SNPs in a CS. The dotted horizontal line shows the target coverage (95%). Error bars capture the range of values in 95% of the simulations. Panel D summarizes the running times in all simulated data sets containing exactly 1,000 SNPs . The box plot whiskers depict 1.5× the interquartile range, the box bounds represent the upper and lower quartiles (25th and 75th percentiles), the center line represents the median (50th percentile), and points represent outliers. See Supplementary Figures 1–9 for additional results.
One particularly striking result is the increase in the number of single-variant CSs returned by fSuSiE vs. SuSiE-topPC (histograms in Fig. 3B). In most cases (5,502 out of about 5,730), these single-variant CSs contained the causal SNP, and in many cases single-variant CSs occurred even when there were other SNPs in very high LD with the causal SNP (including in cases where the other SNP differed from the causal SNP at only a single individual). To confirm this striking result, we performed additional simulations to specifically confirm that, given sufficiently strong signals, fSuSiE can indeed accurately produce a single-variant CS even in the presence of very strong LD (see Online Methods, “Additional simulations to assess single-variant CSs” and Supplementary Fig. 15). These results highlight an important general benefit of joint analysis of multiple traits: when a SNP affects multiple traits, joint analysis not only increases power to detect effects, but also substantially improves the fine-mapping resolution by reducing the typical sizes of CSs (see also [59]).
As noted above, a limitation of SuSiE-topPC (and related methods) is that it does not identify which locations (CpGs) are affected by the causal SNPs identified. One way to address this would be to follow SuSiE-topPC with simple univariate SNP-CpG tests to determine which CpGs are significantly associated with the causal SNPs identified. However, a benefit of functional regression methods, like fSuSiE, is that they exploit spatial correlations in effects to increase power to detect such associations. To demonstrate this, we performed simulations with a single (causal) SNP affecting a subset of CpGs, and applied both fSuSiE and standard SNP-CpG tests to detect the associated CpGs. The results (Fig. 3C, Supplementary Fig. 9) showed a consistent gain in power for fSuSiE, with the gains being greatest in the more challenging WGBS decay simulations. (We did not use the wavelet simulations here because in these simulations the causal SNPs produced changes, mostly small, to all CpGs.) This illustrates the ability of fSuSiE to combine borderline association signals across spatially-proximal affected CpGs to more accurately identify these signals.
Overall, the two variants of fSuSiE—one with the simpler IS prior, the other with a more flexible SPS prior—performed similarly, with a slight power advantage to the SPS prior. The tradeoff is that the SPS prior also increases running time (Fig. 3D, Supplementary Fig. 10). Thus, the choice of prior depends on weighing the benefits of additional discoveries against the additional computation. While both variants of fSuSiE take much longer to run than SuSiE-topPC, they remain quite practical for data sets of the scale we expect to analyze in molecular QTL studies (thousands of SNPs and thousands of molecular trait measurements).
Genome-wide fine-mapping of DNA methylation and histone acetylation QTL
To demonstrate the potential of fSuSiE to generate insights into genetic regulation of molecular trait profiles, and to aid in interpretation of disease-associated genetic variants, we applied fSuSiE to DNA methylation (DNAm) and histone acetylation (H3K9ac) data from the Religious Orders Study/Memory and Aging Project (ROSMAP) with harmonized genotypes from the Alzheimer's Disease Sequencing Project [48, 60, 61]. Array-based DNA methylation profiles of donors and H3K9ac ChIP-seq profiles of donors were obtained from the dorsolateral prefrontal cortex (DLPFC), a brain tissue affected by AD. The molecular trait data for fine-mapping with fSuSiE were the methylation proportions (“ values”) at 413,433 CpGs and the read counts at 92,401 H3K9ac peaks. (H3K9ac peak calling and other data processing steps are detailed in the Online Methods.) We used fSuSiE to fine-map SNPs affecting methylation (mSNPs) and H3K9ac (haSNPs), as described below. We also performed SuSiE fine-mapping of gene expression and protein abundance (from the same DLPFC samples) to obtain a more complete picture of the genetic regulation in this brain region.
To apply fSuSiE genome-wide, we partitioned the genome into large genomic regions, and then (after filtering out regions containing too few CpGs or peaks) applied fSuSiE to each region separately, using the SNP genotype data and molecular trait data at all SNPs and locations within the region. To minimize regulatory interactions between regions, we defined the regions based on topologically associating domains (TAD) [62]. For brevity we refer to these regions as TADs, although they also contain boundaries between TADs; see Online Methods. In total, we decomposed the genome (autosomal chromosomes only) into 1,329 TADs ranging in size from 2 to 35 Mb (median: 4 Mb) each containing 18–10,035 CpGs (median: 450 CpGs) and 2–827 H3K9ac peaks (median: 116 peaks). This genome-wide approach to fine-mapping implicitly assumes that each region contains at least one molecular QTL, which we empirically found to be justified for the molecular traits analyzed (see results below).
To illustrate the inferences produced by fSuSiE, Fig. 4 presents a detailed example of H3K9ac fine-mapping in a 3.7-Mb TAD on chromosome 1. (See also Supplementary Fig. 11 for a detailed methylation example.) We used fSuSiE to analyze the genotypes at 16,642 SNPs and H3K9ac measurements at 269 peaks within this TAD, for individuals. fSuSiE produced 4 CSs (Fig. 4A) corresponding to 4 independent signals of genetic effects on histone acetylation levels at one or more peaks (Fig. 4B). One SNP (rs12757179 in CS 1) was identified as a haSNP with high confidence (PIP = 0.99). For the other haSNPs, fSuSiE was less certain about which SNP was causal; this uncertainty is represented by CSs containing several SNPs each. Some of the haSNPs produce coordinated changes in histone acetylation across several sites, where a single variant simultaneously modulates histone acetylation levels at multiple nearby peaks in a spatially-correlated pattern; for example, the CS1 haSNP (rs12757179) affects histone acetylation (in both directions) at 17 H3K9ac peaks (that is, their 95% credible bands did not include zero). Interestingly, all inferred changes occur at peaks near an haSNP. (The CS3 haSNP produces the longest-range changes, but they remain within 200 kb of the SNP.) We also observed this tendency genome-wide in both methylation and H3K9ac (Fig. 5B). Since the fSuSiE model does not take location of SNPs into account, the fact that it identified mostly short-range effects provides some independent support that these effects are likely real rather than false positives.
Fig. 4. Example fSuSiE fine-mapping of H3K9ac.
Panel A shows the posterior inclusion probabilities (PIPs) for all SNPs analyzed in the TAD (chromosome 1, 205.1–208.8 Mb, 16,642 SNPs, 269 H3K9ac peaks). The PIP is an estimate of the probability that the SNP affects H3K9ac levels at one or more peaks within the TAD. The colors indicate the different haSNPs, and the uncertainty in which SNP is the haSNP is represented by the 95% credible set (CS): each SNP in the CS is a candidate haSNP, with probability of being causal given by the PIP. CS size is the number of SNPs in the CS; sentinel SNP is the SNP with the highest PIP in the CS. Panel B shows the effect of each haSNP on H3K9ac levels at each of the peaks analyzed in the TAD. (When the CS contains more than one SNP, the effect of the sentinel SNP is shown.) Since the effects were estimated at different scales in this analysis, to show the effect estimates in a single plot, we divided the effects by the credible band sizes. Base-pair positions are based on Genome Reference Consortium human genome assembly 38 (hg38). The results within the blue-shaded region are examined in more detail in Fig. 6.
Fig. 5. Fine-mapping of mSNPs and haSNPs in DLPFC.
Panel A compares: the number of CSs returned by SuSiE-topPC, fSuSiE (for H3K9ac and DNA methylation) and SuSiE (for RNA and protein expression, also in DLPFC; and 416, respectively); and the fine-mapping resolution (“CS size” is the number of SNPs included in the 95% CS). In Panel B, functional enrichment of the mSNPs and haSNPs is assessed by distance between fine-mapped SNP and the nearest affected molecular location (for single-variant CSs only), and by distance to the nearest gene transcription start site (TSS). Panel C compares the number of affected molecular locations per SNP from fSuSiE and the association tests. Notes: in C, a small fraction of SNPs and CSs (<2%) with a larger number of affected locations were not included in the histograms; in the distance-to-TSS plots, the SuSiE-topPC and fSuSiE SNP counts were weighted by the PIPs; the TSSs are obtained from the annotated gene transcripts in the GENCODE combined Ensembl/HAVANA database [63]; all the CS statistics do not double-count CSs that overlap in different (overlapping) TADs. See Supplementary Fig. 12 for additional supporting results.
Fig. 5 summarizes the genome-wide results from applying fSuSiE to all the methylation and H3K9ac data. Consistent with the increases in power shown in the simulations, fSuSiE identified many more putative causal SNPs than SuSiE-topPC, and at a much higher resolution; for example, fSuSiE identified 1,372 single-variant CSs for H3K9ac and 6,355 single-variant CSs for DNA methylation, compared to just 56 and 328, respectively, from SuSiE-topPC (Fig. 5A; see also Supplementary Fig. 12A, B). Among the SNPs in single-variant CSs, 55 were identified as affecting both histone acetylation and methylation. Furthermore, 50 of the mSNPs and 27 of the haSNPs affect transcript levels of one or more genes. These observations suggest potential coordinated cis-regulatory effects (Supplementary Fig. 13). fSuSiE also produced a much higher proportion of single-variant CSs compared to SuSiE fine-mapping analysis of RNA and protein expression (Fig. 5A), again highlighting the benefits of joint analysis of multiple traits [59]. (Although the increase in single-variant CSs outputted by fSuSiE in the real data is consistent with results in the simulations, and consistent with the expectation that joint analysis will improve fine-mapping resolution, it is also prudent to point out a limitation of fSuSiE: fSuSiE assumes that, once genetic effects are accounted for, the molecular trait data at different locations are independent. Deviations from this assumption in real data could plausibly cause fSuSiE to be overconfident in its inferences, and potentially produce CSs that are too small.)
The many more additional haSNPs and mSNPs discovered by fSuSiE tend to be located close to the inferred affected peaks/CpGs, and remain concentrated near gene TSSs (Fig. 5B). This provides evidence that the additional effects detected are mostly real signals rather than false positives. More generally, enrichments of haSNPs and mSNPs in predefined functional annotations [64, 65] were largely consistent between fSuSiE and SuSiE-topPC, suggesting that the additional discoveries of fSuSiE maintain similar functional characteristics (Supplementary Fig. 14). One exception was the stronger enrichments of SuSiE-topPC mSNPs in regions involved more directly in gene transcription (TSS, 5' UTR, promoters and coding regions). This may be because the strongest methylation signals (the top methylation PC) tend to occur near regions of direct transcriptional regulation. In comparison, fSuSiE, by looking beyond the strongest methylation signals, provides a more balanced picture of mSNPs involved in both direct and indirect gene regulation.
As a fine-mapping method, fSuSiE provides a much more parsimonious explanation for observed associations than does conventional (univariate or “SNP-by-SNP”) QTL association testing (Fig. 5C). Conventional QTL testing results in large numbers of SNPs, many in high LD with one other, being associated with a (usually) small number of peaks/CpGs; in contrast, fSuSiE identifies a much smaller number of potentially-causal SNPs associated with (usually) a larger number of peaks/CpGs. In other words, fSuSiE finds many more large coordinated changes in methylation and histone acetylation; some mSNPs are estimated to affect methylation levels at more than 100 CpGs). The vast majority of mSNPs and haSNPs identified by fSuSiE are associated with methylation and histone acetylation at locations very close to the SNP; for 96% mSNPs and and 54% haSNPs, the nearest affected location is within 100 kb (Fig. 5B).
fSuSiE reveals epigenetic regulatory mechanisms colocalizing with Alzheimer's disease risk loci
Alzheimer's disease (AD) is a complex neurodegenerative disorder with substantial genetic heritability. Recent large-scale genome-wide association studies have identified hundreds of AD risk loci [66, 67], but the molecular mechanisms underlying most genetic associations remain unclear. Since GWAS variants typically affect gene regulation rather than protein coding sequences, integration of molecular QTL data with GWAS through colocalization analysis offers a powerful approach to identify regulatory mechanisms and target genes underlying disease associations. To demonstrate this, we performed colocalization analysis using COLOC (version 5) [68] between our fSuSiE fine-mapping results for methylation and histone acetylation and AD GWAS summary statistics from two large-scale meta-analysis involving 788,989 (111,326 cases and 677,663 controls) [66] and 455,258 (71,880 cases, 383,378 controls) [67].
This genome-wide colocalization analysis identified 14 methylation and 4 histone acetylation loci colocalizing with AD risk variants in 95% colocalization credible sets. These colocalized variants involved 231 genes (25 from haQTL and 226 from mQTL) located within 500 kb of the top colocalized variants, suggesting potential regulatory targets for AD-associated epigenetic changes. In an attempt to pinpoint the specific genes regulated by these epigenetic changes, we integrated our results with eQTL data from the same ROSMAP DLPFC samples. This multi-modal integration provides substantial advantages over eQTL analysis alone: while eQTL colocalization with GWAS can identify genes affected by disease variants, the association itself does not provide mechanistic insights into how the disease variants may operate through regulation. To illustrate how integrating epigenomic QTL data can help address this, we show results for two well-studied AD risk loci, near genes CASS4 and CR1/CR2), where the fSuSiE fine-mapping results provide mechanistic insights into disease-associated genetic variants.
CASS4 is a potential AD risk gene supported by association analyses [66, 69, 70] and colocalization of AD SNPs with CASS4 expression SNPs [70, 71]. Previous colocalization and fine-mapping analyses have suggested rs6014724 and rs17462136 in the 5' UTR of CASS4 as putative causal variants [71]. The fSuSiE methylation fine-mapping suggests another potential causal variant: rs1884913 (PIP > 0.99, MAF = 23.2%), an mSNP that lies within the CS for AD, and affects methylation levels at 6 CpGs near the CASS4 TSS (Fig. 6A, Supplementary Fig. 11). Notably, rs1884913 shows moderate correlation with the previously reported variant rs6014724 (, ) but high correlation with rs17462136 (, ). This finding, supported by convergent evidence from multiple molecular QTL analyses, suggests that rs1884913 as a potential causal variant may contribute to understanding regulatory mechanisms underlying disease-associated genetic variants, and illustrates how multivariate functional analysis of molecular phenotypes may achieve higher fine-mapping resolution than univariate analysis of complex disease phenotypes by distinguishing between functionally distinct variants (rs17462136 and rs1884913 are separated by 11.3 kb) that remain indistinguishable using GWAS data alone due to high linkage disequilibrium.
Fig. 6. Results of methylation fine-mapping at the CASS4 AD locus (A) and H3K9ac fine-mapping at the CR1/CR2 AD locus (B).
AD (top) plots: association -values (two-sided -test, ) from Alzheimer's disease (AD) GWAS [66, 67], and 95% CS from SuSiE fine-mapping using the AD GWAS summary statistics [23, 24]. eQTL plots: association -values (two-sided -test, ) from eQTL analysis and 95% CS from SuSiE fine-mapping. The top SNPs for AD and gene expression are labeled. PIP plots: PIPs obtained from fSuSiE fine-mapping of methylation or histone acetylation. Effect plots: SNP effects on methylation or histone acetylation estimated by fSuSiE. CpGs or peaks unaffected by the SNP are drawn on the line. Error bars depict 95% credible bands . For each CS, the sentinel SNP (the SNP with the highest PIP in the CS) is labeled. If the CS contains more than 1 SNP, the total number of SNPs in the 95% CS is given. For the gene trascript annotations, the arrow indicates the direction of transcription (source: GENCODE Ensembl/HAVANA [63]). Read count plot (B only): the “raw” ChIP-seq peaks (average read counts) for each genotype at SNP rs679515. Average read counts less than 2 are not shown.
The CR1/CR2 AD risk locus [70, 72, 73] has also been intensely investigated [74–77], including in a recent proteome-wide association study that suggested both CR1 and CR2 as causal proteins [78]. However, none of these studies have pinpointed the causal variant. Our SuSiE and fSuSiE analyses suggest that the top association for AD, rs679515 (, MAF = 20.0%) is a solid candidate causal variant: first, rs679515 is the only SNP that overlaps CSs for both CR1 expression and CR2 expression; second, this SNP is one of the top SNPs in a CS for H3K9ac, with PIP = 0.14 (Fig. 6B, Supplementary Fig. 11). Furthermore, fSuSiE results suggest rs679515 has moderately long-range effects on histone acetylation, including at H3K9ac peaks near the TSSs of both CR1 and CR2 (Fig. 6B). Therefore, the fSuSiE H3K9ac results not only nominates a potential causal variant (rs679515), but also implicate both CR1 and CR2 via histone modification associations. This finding is further supported by multi-trait colocalization analysis through other FunGen-xQTL efforts, where rs679515 showed colocalization with CR2 in DLPFC bulk eQTLs, YOD1 in oligodendrocytes, and CR1 across multiple cortical contexts, demonstrating the complex regulatory architecture at this locus [79].
Discussion
We have introduced fSuSiE, a method that extends SuSiE to high-dimensional, spatially structured traits by borrowing ideas from functional regression. In simulated and real data sets, we demonstrated the benefits of fSuSiE for fine-mapping molecular traits; most notably, fSuSiE is the only method that can simultaneously estimate the genetic variants affecting the molecular trait and the specific changes to the molecular trait that are produced by the genetic variants. Although fSuSiE is more computationally demanding than conventional QTL association analysis, or than fine-mapping of a univariate trait (e.g., SuSiE-topPC), as we have shown here it remains practical for data sets containing hundreds of samples, thousands of molecular locations, and thousands of SNPs.
The molecular data sets we considered here consist of measurements made at predefined genomic locations (CpGs or peaks called by MACS). However, fSuSiE does not require predefined molecular features; for sequencing data (e.g., RNA-seq, ChIP-seq, WGBS), fSuSiE could instead be applied directly to the base-pair-level sequencing data (or data combined into small bins), in which case the relevant molecular features would be inferred from the fSuSiE effect estimates and credible bands. Applying fSuSiE to high-resolution sequencing data has the potential to quantify gene regulatory effects in greater detail and discover additional gene regulatory effects in regions outside the called peaks, opening new avenues for analyzing and integrating multiome data. In fact, beyond the ROSMAP methylation and histone modification analyses presented here, we have deployed fSuSiE across diverse molecular contexts within the FunGen-xQTL consortium, including integration with methylation data from multiple cohorts and analysis of single-nucleus ATAC-seq data from six major brain cell types for chromatin accessibility using recently published data on ROSMAP individuals [80].
fSuSiE is also a very flexible model framework, and includes additional modeling features we did not highlight in this paper. For example, one could incorporate functional annotations to guide discovery of the causal variants and affected molecular features. Second, fSuSiE is not limited to using wavelet-based functional regression; our implementation also includes hidden Markov model (HMM) alternatives that, while computationally slower, can sometimes offer more refined results for certain data types, and other functional regression approaches might be better suited to certain types of molecular data. A key advantage of embedding our approach within the SuSiE framework is the seamless integration with numerous available downstream analytical methods based on SuSiE. This design choice makes fSuSiE readily available for data integration pipelines, as demonstrated through our colocalization analysis using SuSiE-based COLOC methods [68] in Alzheimer's disease GWAS applications. Additionally, fSuSiE-estimated effect sizes can be directly incorporated into transcriptome-wide association studies (TWAS) to combine multiple genetic regulatory effects across molecular profile regions with GWAS signals.
Our work has several limitations representing directions for future research. First, the current methods do not handle missing data: the molecular trait needs to be measured at every location in every sample. When some measurements are missing, they could potentially be imputed provided that the imputation is accurate and/or there are not too many missing measurements. Second, the current framework requires analyzing one molecular context at a time. Future work includes developing joint modeling approaches for multiple molecular contexts to fine-map shared effects without requiring separate colocalization analyses. This would be particularly valuable for multiome data where both single-nucleus RNA-seq and single-nucleus ATAC-seq data are available, enabling simultaneous analysis of gene expression and chromatin accessibility as spatially structured molecular profiles. Third, while fSuSiE can handle both normalized and count-based data, our applications focused on normalized molecular phenotypes (pre-processed methylation proportions and normalized histone modification peak intensities from the FunGen-xQTL project). Although our method and software implementation includes count-based models specifically designed for high-throughput sequencing data that could be beneficial for other applications, we did not apply these alternatives in the current work.
The fSuSiE methods are implemented in an R package, fsusieR, available on GitHub and distributed via the MIT Open Source license. The R package includes detailed documentation, examples, and guidance on applying the methods in practice.
Online Methods
fSuSiE
Wavelet regression.
Consider a data set consisting of samples of a molecular trait measured at locations. The measurements are stored in an matrix with matrix elements , in which indexes samples and indexes locations. Let be a covariate of interest (e.g., a genotype) measured in the samples.
We begin with a standard multivariate linear regression model of given ,
| (2) |
where is a column vector of regression coefficients (“effects”) to be estimated, and is an matrix of residuals (which we will assume are normally distributed). The “m” in the superscript here refers to the fact that the residuals are defined in “measurement space” (as opposed to “wavelet space”; see below).
Transforming to the wavelet space involves a simple linear transformation [49, 50], written as , where is the wavelet projection matrix. The resulting transformed data, , also an matrix, is called the matrix of “wavelet coefficients” (WCs). The wavelet projection matrix has the property that it is an orthogonal matrix, and therefore it is invertible. This orthogonality property is very useful; for example, it makes it easy to recover statistical quantities on the measurement scale after they have been estimated on the wavelet scale. See the Supplementary Note for additional background on wavelet regression.
Applying the identity to the multivariate linear regression (2) results in a wavelet regression model,
| (3) |
where is the vector of effects in the wavelet space, and is an matrix of residuals in the wavelet space. We make the standard assumption these residuals are independent and normally distributed with mean zero and variances . The independence of the residuals across locations (conditional on the variances ) can be motivated by the “whitening” property of the wavelet transform [49, 50]. By contrast, assuming independence of the residuals in the measurement space would be entirely inappropriate due to the spatial structure.
Next we extend these ideas to a multiple linear regression model,
| (4) |
where is an containing observations about covariates of interest (e.g., genotypes measured at SNPs), and is the matrix of effects to be estimated. The corresponding multiple wavelet regression model is
| (5) |
where is the matrix of effects in the wavelet space. fSusiE is developed using this model. (To simplify notation, in the equations below we drop the “w” superscript from when it is clear from the context that these are the residuals in the wavelet space.)
Multiple wavelet regression model with an intercept and additional covariates.
In practice, we augment the multiple wavelet regression model (5) to include an intercept, and we often include additional covariates (e.g., sex, age, batch effects, genotype PCs):
| (6) |
where is a vector of ones of length , is an matrix containing data about additional covariates, is the (unknown) intercept, and is the matrix of (unknown) coefficients corresponding to the additional covariates . To simplify presentation, we store the intercept in the first row , and set the first column of to be all ones, so the same model can be written more simply as
| (7) |
This model can be reduced to the previous model (5) if we remove the linear effects of on and by making the substitutions , , in which we define
| (8) |
| (9) |
This is an extension the reduction used for univariate linear regression [93]. Note that we have sometimes found it more convenient to remove the linear effects of from instead of from , then compute , but this is equivalent to (9) because is orthogonal.
Standardization.
In practice, we scale each column of to have unit variance before computing the wavelet transform (this is often called “standardization”). This simplifies modeling because it allows us to reasonably assume that the wavelet coefficients have homoskedastic residual variances; consider that . Therefore, in fSuSiE we make the additional modeling assumption that the residual variances do not depend on location; .
Also, although not necessary, in practice we also standardize before running fSuSiE. This simplifies many of the underlying statistical computations in fSuSiE. Standardizing is common practice in QTL mapping analyses, and can be justified in some settings [81] [94].
Single function regression model.
The “single function regression” (SFR) model is the functional counterpart of the “single effect regression” (SER) model [23]. The SFR model is a multiple wavelet regression model (5) with the constraint that exactly one of the covariates has a nonzero effect on the wavelet coefficients:
| (10) |
where , is a matrix with entries , denotes the multinomial distribution with sample size and multinomial probabilities , denotes the prior distribution for the effects at location (specified in greater detail below), and denotes the diagonal matrix in which the diagonal elements are given by the vector .
Defined in this way, the SFR model has the following properties: (i) is a binary vector in which exactly one of the elements is one and the rest are zeros; (ii) at most one of the rows of the contains nonzero values; and (iii) the ’s are a priori independent across locations (again motivated by the “whitening” property of the wavelet transform).
Elements of are the prior probabilities that each covariate has a nonzero effect on the wavelet coefficients . Unless otherwise stated, the prior probabilities are the same for all covariates (SNPs); , .
fSuSiE model.
The fSuSiE model extends the SFR model to allow at most nonzero effects. It is the functional counterpart to the Sum of Single Effects (SuSiE) model [23], so we call it the “functional Sum of Single Effects model” (fSuSiE) model. The fSuSiE model is the wavelet regression model (5) together with the following prior on :
| (11) |
Since each binary vector contains exactly one 1, by contruction at most rows of contain nonzeros. The fSuSiE model reduces to the SFR model when . Note that this model allows for a different prior on the effects for each ; these priors are automatically adapted to the data using the algorithms described below.
Scale-dependent mixture priors for fSuSiE.
Above, we did not specify the exact form of the priors and in the SFR and fSuSiE models. To define a prior, we expand on a detail that was previously hidden: the location represents a wavelet scale and location. Our convention is to use to index scales and to index locations.
Previous wavelet regression approaches have exploited the scale-dependent sparsity of the wavelet coefficients [49, 82] [95, 96]. One of these approaches used a spike-and-slab prior [95, 96]. Building on this spike-and-slab prior as well as our recent work on adaptive shrinkage methods [81] [97, 98], we extend this scale-dependent spike-and-slab prior to a scale-dependent mixture prior, which we refer to as the “shrinkage-per-scale” (SPS) prior:
| (12) |
Here, the denote the mixture weights (, ) for each single function , and the denote the variances of the mixture components. The number of mixture components, , and the component variances are fixed, user-specified quantities, whereas the mixture weights are estimated. Note that this prior depends on the scale but not on the location .
The component variances are assumed to be increasing (, ), and the first component is assumed to be a Dirac “delta” mass at zero, denoted here by a normal distribution with zero variance, . We use procedures similar to those implemented in the ashr R package [81] to automatically determine a suitable as well as the components variances .
We also consider a special case of the SPS prior in which the mixture proportions and variances do not depend on the scale, [97]; that is, . We call this the “independent shrinkage” (IS) prior. While clearly less flexible than the SPS prior, the IS prior reduces the number of parameters to estimate and reduces overall computational effort, and therefore is sometimes more convenient in practice. We empirically assess the benefits of the SPS and IS priors in simulations.
Posterior statistics.
Here we define the key posterior quantities used in an fSuSiE analysis (Fig. 1). We do not explain here how these quantities are computed; these details are given in the Supplementary Note.
As we briefly explained in the Methods overview, we have three main inference aims:
Variable selection: identify the causal SNPs.
Feature annotation: identify the molecular features and locations that are affected by one or more SNPs.
Feature selection: identify the molecular features and locations that are affected by a given causal SNP.
For variable selection, we compute credible sets (CSs): a CS is defined as a subset of containing an effect SNP with high probability [20, 23]. More precisely, a level- CS is defined as a set of SNPs that is as small as possible such that it has probability at least of containing an effect SNP (a row of containing at least one non-zero). The number of CSs should reflect the number of causal SNPs, and the size of a CS should reflect the number of plausible candidate effects SNPs. We calculate CSs as described in [23].
To determine which SNPs within a CS are the strongest candidates for being an effect SNP, we compute a posterior inclusion probability (PIP) for each SNP. The PIP for SNP is defined as the posterior probability that at least one of the entries in the th row of is nonzero:
| (13) |
in which denotes the posterior probability that the th single effect is nonzero for SNP ,
| (14) |
See 2D for an illustration of PIPs and CSs.
The estimates of the SNPs on the molecular features in wavelet space are given by the posterior mean of . The SNP effects in measurement space are given by the posterior mean of . By elementary properties of expectations, this is simply
| (15) |
where denotes the inverse wavelet transform.
To identify affected features and locations, we compute pointwise Bayesian credible intervals [83, 84] for elements and at selected SNPs . We define “affected” as those elements in which the interval does not include zero. We refer to these intervals as “credible bands.” See Fig. 2E giving an example of the posterior means and the credible bands of for all locations (CpGs) and for all sentinel SNPs . The affected locations are defined as the locations in which at least one of the credible bands for the sentinel SNPs at does not contain a zero.
Outline of an fSuSiE analysis.
Briefly, the minimal requirements for performing an fSuSiE analysis are:
An matrix, , containing molecular trait measurements at locations after the linear effects of selected covariates are removed. The specific steps that were taken to prepare for the molecular data sets analyzed in this paper are detailed below.
An matrix, , containing the genotype information for the SNPs to be fine-mapped after removing the linear effects of the selected covariates.
, an upper limit on , the number of single functions in the fSuSiE model. Unless specifically mentioned, we set .
-
The prior for the wavelet effects. In this paper, we considered two priors: the independent shrinkage (IS) prior and the shrinkage-per-scale (SPS) prior (see “Scale-dependent mixture priors for fSuSiE” for explanations). We used the IS prior unless otherwise mentioned.
An additional input is optional but recommended:
The positions (e.g., base-pair positions) corresponding to the locations . If not specified, the positions are assumed to be evenly spaced. Molecular traits are typically not measured at evenly spaced positions, so providing this information will produce more accurate results.
Beyond this, the fSuSiE software includes many other settings and tuning parameters which may be adjusted as needed.
The basic steps of an fSuSiE analysis are as follows:
Compute the wavelet coefficients. Compute the WCs, , from the molecular trait data, . For computational efficiency, is obtained using the standard discrete wavelet transform (DWT). (In later steps, a second set of WCs are computed using the translation-invariant wavelet transform to improve accuracy.) For all the results presented in this paper, we used the (undecimated) wavelet transform with Daubechies least-asymmetric orthonormal compactly supported wavelets and with 10 vanishing moments (see Chapter 2 of [99]). Note that the fSuSiE software currently supports any wavelet transform that is implemented in the wavethresh R package [99].
Search for a good upper limit on R (optional). determines the number of single functions and, correspondingly, the number of causal SNPs. If is too small, fSuSiE may miss some causal SNPs; on the other hand, if is too large, fSuSiE may take a long time to run. We have implemented an ad hoc procedure to find a reasonable initial estimate of . By default, this procedure starts by fitting an fSuSiE model with . If all single functions are kept, is increased by one. This procedure iterates until one or more single functions are pruned, or if the upper limit, , is reached.
Fit fSuSiE model. We use an iterative algorithm, described below, to fit the fSuSiE model to the data , . This includes estimating the priors and the residual variance . During estimation of the priors, some single functions may be pruned, and therefore is an upper bound on the final number of single functions (and CSs).
Compute SNP-level posterior quantities: CSs and PIPs. Compute CSs and PIPs from the fSuSiE model fitted in the previous step.
Filter credible sets (optional). One may filter out the CSs with low “purity” (purity is defined as the smallest absolute correlation among all pairs of SNPs in the CS). This often improves quality or interpretability of the fine-mapping results.
Compute posterior effect estimates and credible bands. The final step is to compute posterior mean estimates of , and corresponding pointwise credible bands, , for each location and sentinel SNP (the SNP with the largest PIP in each CS). To address possible inaccuracies with the DWT (see Chapter 9 of [50]), we compute a new WC matrix using the translation-invariant wavelet transform (TIWT), also known as the stationary wavelet transform. (In brief, the TIWT modifies the original DWT by applying it to shifted copies of the signal, then the WCs are averaged across the shifted copies.) Since the TIWT greatly increases computational effort, we use this TIWT only in this final step.
All these steps are implemented by the susiF function of the fsusieR R package.
Molecular QTL simulations
Simulation of genotype data.
The fine-mapping regions for the simulations were selected uniformly at random from 94 breast cancer loci on autosomal chromosomes reported in [100] (see Supplementary Table 1 of that paper). The median size of a fine-mapping reagion was about 1 Mb.
Similar to [101], we used sim1000G [52] to simulate genotypes of unrelated individuals based on the genotypes from the 1000 Genomes Phase 3 whole-genome sequencing [53]. First, we randomly selected a continent-of-origin label (EUR, AMR, AFR, EAS, SAS), then we simulated SNP genotypes using individuals chosen uniformly at random from the 1000 Genomes samples with the selected continent-of-origin label.
Within the fine-mapping region, we kept all biallelic SNPs with minor allele frequencies (MAFs) of 5% or greater; that is, SNPs in which the minor allele was observed at least 10 times out of the chromosomes).
For a single simulation, the genotype matrix, , was a matrix with rows (individuals) and columns (SNPs), in which ranged from approximation 1,500 to 4,000.
Simulation of molecular trait data—wavelet simulations.
In these simulations, the molecular trait data were simulated from an fSuSiE model (more precisely, a multiple wavelet regression model with an SPS prior for the causal SNP effects). This involved the following steps: the effects of all causal SNPs (on the wavelet scale) were simulated from an SPS prior, , in which the first mixture component was always a “spike” at zero (); the effects of all non-causal SNPs were set to zero, ; the effects in the measurement space were then obtained as ; and then the molecular trait data, , were simulated from the multiple regression model (4), with . The residual variance was chosen so that explained a specified total variance in .
Similar to [49] [95, 96], we simulated molecular trait data with different levels of smoothness: first, we drew a “smoothness parameter”, , uniformly at random between 0 and 1; then we set the (“spike”) mixture component at each scale to be . The intuition is that settings of closer to 1 produce “smoother” signals (larger effects at the largest scales), while settings of closer to 0 produce effects of similar size across all scales.
Simulation of molecular trait data—WGBS block and WGBS decay simulations.
In these simulations, we simulated methylation data from WGBS similar to [54]. The methylation profiles, , were simulated using the following logistic regression model:
| (16) |
where denotes a baseline methylation level (a “Beta value” [102]) simulated using OmicsSIMLA [103], denotes the sigmoid function, denotes the probit link function in which is the cumulative density function (CDF) of the standard normal distribution, denotes the multivariate normal distribution on with mean and covariance matrix , and the other notation was defined above. We transform the Beta values using the probit as this makes the data approximately homoskedastic with a Gaussian noise distribution [102]. The residual variance was chosen so that explained a specified total variance in the unobserved .
The effects of the causal SNPs were simulated from a multivariate normal distribution, , in which is a correlation matrix. (The effects of the non-causal SNPs were set to zero, .) The correlation matrix determines the spatial structure of the methylation changes. We considered two different designs of that resulted in two different sets of WGBS simulations, which we referred to as “WGBS block” and “WGBS decay” in the text.
WGBS block simulations.
In a “WGBS block” simulation, the fine-mapping region was divided into 5 blocks with the same number of CpGs in each block. The SNP effects were the same within a block: if CpGs , were in the same block; otherwise, .
WGBS decay simulations.
In a “WGBS decay” simulation, the fine-mapping region was divided into 5 blocks of the same size (same number of CpGs). The SNP effects were similar within a block, but differed more as the CpGs were more distant within the block: if CpGs , were in the same block, otherwise , with .
Methods compared in the simulations.
fSuSiE.
We ran two variants of fSuSiE: one using the IS prior, and another using the SPS prior. fSuSiE was applied as described above (“Outline of an fSuSiE analysis”) by calling the susiF() function from the fsusieR R package with the following settings: L = 20, which sets the upper limit on the number of CSs to X; prior = “mixture_normal” for the IS prior or prior = “mixture_normal_per_scale” for the SPS prior; and all other options were kept at their defaults.
For a given CpG, we defined as the largest confidence level for which the corresponding credible band at that CpG did not include zero. For a given threshold, a CpG was defined as “affected” if was less than this threshold.
SuSiE-topPC.
The top PC, denoted here by , was obtained from the methylation data, , after centering the columns of . (Note that the columns of were not scaled to unit variance.) Then we ran function susie() from the susieR R package [23] on the data , with the following settings: standardize = TRUE; L = X, which sets the upper limit on the number of CSs to X; and all other options were kept at their defaults. Note that SuSiE-topPC was not used to identify affected CpGs.
SNP-CpG association tests.
The SNP-CpG association tests were implemented using the lm() function in R. For a given -value threshold, a CpG was defined as “affected” if the (two-sided) -value from the association test for that CpG was less than the threshold.
Additional simulations to assess single-variant CSs.
In the mSNP fine-mapping and haSNP fine-mapping, fSuSiE reported a large number of CSs with 1 SNP, a surprisingly high number based on our previous experience with fine-mapping other quantitative traits such as RNA and protein expression. Furthermore, we found that some of these single-variant CSs identified a SNP that was in very high LD with another SNP that was not in the CS. We wondered whether fSuSiE might be incorrectly generating CSs that were too small and/or incorrectly identifying the causal SNP, and therefore we performed an additional experiment to check this. Specifically, we examined the extreme case of fine-mapping two SNPs, one of which is causal, and the genotypes of the two SNPs differ in only a single individual. This is the strongest LD one can observe between two SNPs without being in perfect LD (i.e., a correlation of 1).
In these simulations, we chose the causal SNP at random from European ancestry WGS samples in 1000 Genomes phase 3 . The second SNP was identical to the first except for a single individual chosen at random; for this individual, we set . SNP 1 affected molecular locations , and had no effect on all other locations (in total, there were 128 locations, ). The effects of SNP 1, , were the same for all affected locations, and were varied from simulation to simulation from 0.01 to 1. The molecular trait for individual was simulated as , with i.i.d. standard normal. We performed 100 simulations at each setting of the effect, 0.01, 0.02, …, 1, for a total of 10,000 simulations.
We compared the results from SuSiE and fSuSiE in these simualtions. For SuSiE, we used as a univariate trait the most associated location; that is, the location with the smallest association -value when testing for association between the location and SNP 1.
In each simulation, we recorded the CS configuration: SNP 1 only; SNP 2 only; or both SNPs 1 and 2. The results of these simulations are summarized in Supplementary Fig. 15. Note that both SuSiE and fSuSiE rarely returned a CS containing the wrong SNP only (SNP 2): out of the 10,000 simulations, SuSiE returned this result in just 8 simulations, and fSuSiE returned this result in just 36 simulations.
Fine-mapping of molecular traits and Alzheimer's disease
Data sources overview.
We analyzed molecular trait data—DNA methylation and histone acetylation (H3K9ac)—in DLPFC donors from the ROSMAP cohort [48]. We also analyzed DLPFC RNA and protein expression data generated by the ADSP Functional Genomics Consortium xQTL Project (FunGen-xQTL) [85, 86]. SNP genotypes were obtained in the DLPFC donors by WGS [86]. Fine-mapping of AD risk loci was performed using data from two recent AD GWASs [66, 67].
WGS genotype data.
Whole-genome sequencing data for ROSMAP subjects were obtained from the Alzheimer's Disease Sequencing Project (ADSP) release 4 (R4), comprising a subset of the NIAGADS NG00067 dataset [86]. All sequencing data in ADSP R4 were centrally processed by the Genome Center for Alzheimer's Disease (GCAD) using the variant calling pipeline and data management tool (VCPA), a standardized pipeline functionally equivalent to the CCDG/TOPMed workflow [87]. VCPA, implemented in Workflow Description Language (WDL) and optimized for the Amazon EC2 cloud environment, accepts sequencing data in FASTQ, BAM, or CRAM formats.
Sequence reads were aligned to GRC human genome assembly 38 (hg38) using BWA-MEM [104], followed by variant calling of single nucleotide variants (SNVs) and short insertion-deletions (indels) using GATK HaplotypeCaller [105], with joint calling workflows for improved alignment quality through local realignment of insertions/deletions and base quality score recalibration using GATK modules. Genotype-level quality control (QC) set each genotype to missing if the read depth (DP) was less than 10 or the genotype quality (GQ) score was less than 20. Variant-level QC flags were applied in the following order: variants in GATK low sequence quality tranches (lacking FILTER “PASS” value above 99.8% VQSR Tranche); monomorphic variants; variants with high missing rate; and variants with high depth. ABHet annotation estimated whether biallelic variants matched expected allelic ratios, with ideal heterozygous variants having values near 0.5. For downstream analysis, variants were retained if they were not assigned any of these QC flags: ABHet between 0.25 and 0.75; missing rate <5%; MAF ≥ 1%; and no significant deviation from Hardy-Weinberg equilibrium (). Additional sample-level quality control steps included: verifying genetic sex; flagging duplicated and related individuals; outliers identified by heterozygosity levels and genotype call rates. The final data set contained approximately 30 million biallelic SNVs and 3.5 million indels.
To control for population stratification, PCA was performed on LD-pruned common variants using PLINK (MAF > 5%, ). We retained the first 15 PCs for use in subsequent analyses.
Defining regions for cis-eQTL mapping and fine-mapping of molecular traits.
Since methylation and H3K9ac lack gene-centric reference points, we adopted a unified approach using topologically associating domains (TADs) and their boundaries to define regions for QTL mapping and fine-mapping for all molecular traits analyzed. The TADs and their boundaries were derived from combined brain and blood Hi-C data [62]. In total, we defined 1,449 “extended TADs” for our analyses ranging in size from 1 to 17 Mb. For fine-mapping of methylation and H3K9ac, we used these extended TADs as the fine-mapping regions. For QTL mapping and fine-mapping of RNA and protein expression, the analysis region was defined as the extended TAD containing the gene, plus any additional part of the chromosome needed to include ±1 Mb from the gene's transcription start and end sites.
RNA-seq data and analyses of RNA expression.
The DLPFC bulk RNA-seq came in three different batches (). Total RNA () from DLPFC was used for library preparation via poly(A) selection (strand-specific dUTP protocol, ) or rRNA depletion (KAPA Stranded RNA-Seq Kit with RiboErase, , RiboGold ). Libraries were sequenced at 30–50 million paired-end reads per donor with exact depth varying by batch as described in [47]. Following standard QC via FastQC adapter trimming using fastp [106], reads were aligned to the human reference genome (hg38) using STAR [107] with WASP correction to reduce reference bias [108], with further quality control via Picard [109]. Gene-level RNA expression was quantified with RNA-SeQC [110], removing genes if over 20% samples had TPM expression level of 10% or less. Sample-level RNA QC was performed following from methods outlined by GTEx V8 [111] using three metrics to remove outliers: relative log expression; Mahalanobis distance to hierarchical clustering of samples; and D-statistics quantifying average correlation between pairs of samples. After these QC steps, a total of samples were retained for subsequent analyses of RNA expression in DLPFC.
After these quality control steps, the transcript abundance (TPM) matrices were quantile-normalized. Technical factors (batch, RNA integrity number, post-mortem interval) and biological covariates (sex, age at death) were included as covariates in QTL analyses, along with the top 15 genotype PCs to account for population stratification. Additional hidden confounders included as covariates were estimated by PCA performed on the quantile-normalized RNA expression matrix, with the number of PCs (34) determined by the Marchenko-Pastur limit [112].
QTL mapping for each gene was performed using TensorQTL [88]. Fine-mapping for each gene was performed using SuSiE [23] after removing the linear effects of aforementioned covariates from both the phenotype and genotype matrices. We took an adaptive approach to determine , the number of CSs, for each gene: starting with , we iteratively increased by 2 if all the single effect regressions (SERs) in the fitted model had a prior variance greater than zero. We continued in this way until at least one SER had a prior variance of zero. This adaptive strategy increased detection of causal SNPs while avoiding an unnecessarily large setting of (which increases computation). Using this adaptive approach, the largest setting of for a gene was 12. We filtered the CSs returned by SuSiE in two ways: we removed a CS if the “purity” [23]—the smallest absolute correlation (Pearson's ) among all pairs of SNPs in the CS—was less than 0.8; and we removed a CS if the sentinel SNP had a MAF less than 5%. (See below for the rationale for filtering based on MAF.)
Proteomic data and analyses of protein abundance.
DLPFC protein abundance was quantified using selected reaction monitoring (SRM) proteomics [113–116]. Gray matter tissue from DLPFC was homogenized and proteins were extracted, followed by trypsin digestion. Targeted peptides were selected based on prior discovery proteomics experiments, and SRM assays were performed with manual inspection to ensure correct peak assignment and peak boundaries. After QC and filtering for proteins quantified in at least 50% of samples, subjects with matched genotype and protein abundance data of 7,710 proteins were retained for association analysis and fine-mapping. Peptide relative abundances were (base-2) log-transformed, and centered at the median, then imputation of missing protein levels was performed using grouped empirical Bayes matrix factorization (gEBMF) [117]. QTL mapping and fine-mapping was performed as described for RNA expression (see above).
H3K9ac ChIP-seq data, association analysis and fine-mapping.
H3K9ac ChIP-seq [89] was performed on DLPFC donors using a well validated H3K9ac antibody (Millipore #06–942). Approximately 50 mg of gray matter tissue was dissected, cross-linked with 1% formaldehyde, and sonicated prior to overnight immunoprecipitation. Purified DNA was used to construct libraries (including end repair, adapter ligation and size selection), and single-end 36-bp reads were generated. Reads were aligned to human genome assembly hg38, and broad peaks were called with MACS [118, 119] (, broad cutoff = 0.1). Stringent per-sample quality control criteria were also applied (≥15 × 106 unique reads, non-redundant fraction ≥ 0.3, fraction of reads in peaks ≥ 0.05). A union peak set of 92,401 peaks was established to leverage information across donors. Peak counts were computed for the final set of donors, then normalized using a limma-voom pipeline [120, 121]. Batch effects for 62 libraries were removed using ComBat [122, 123]. Other technical factors (RNA integrity number, post-mortem interval) and biological covariates (sex, age at death), along with the top 15 genotype PCs, were included as covariates the association and fine-mapping analyses. Additional hidden confounders included as covariates in these analyses were estimated from PCA applied to the normalized methylation matrix. The number of hidden confounders, 113, was determined by the Marchenko-Pastur limit based on the input methylation matrix.
SNP-peak association testing was performed using TensorQTL [88]. Significance of the SNP-peak associations was determined by computing -values [124] across all tested variants, separately for each H3K9ac peak.
For fine-mapping, SuSiE-topPC and fSuSiE with the IS prior were applied to the data from each fine-mapping region (the TADs) in the same way as in the simulations, except that we set L = 20. Since fSuSiE requires a minimum of 16 peaks, we did not finemap the TADs that had fewer than than 16 peaks. We also did not consider SNPs with minor allele count (MAC) ≤ 5 (which corresponds to ).
In examining the fSuSiE CSs, we found that many of the single-variant CSs (i.e., the predicted haSNPs) affected H3K9c peaks that were far away (>100 kb) from the haSNP. These long-range SNP-peak interactions were heavily concentrated on lower frequency SNPs; in particular, we observed a strong excess of SNPs with MAF < 5%. This strongly suggests that the fSuSiE results for low-frequency SNPs contain a high proportion of false positives. (These results align with our independent findings that SuSiE is poorly calibrated and fails to achieve good coverage at very small sample sizes or with low-frequency SNPs [90].) In light of these observations, we removed all single-variant SuSiE-topPC and fSuSiE CSs in which the SNP had a MAF < 5%. For CSs with more than one SNP, we removed the CS if the sentinel SNP had a MAF < 5%. For consistency, we also removed all SNP-peak associations in which the SNP had a MAF < 5%.
DNA methylation data, association analysis and fine-mapping.
DNA methylation was assayed using the Illumina 450K array [47, 91] and processed via the SeSAMe pipeline [125]. SeSAMe applies NOOB background correction, nonlinear dye-bias normalization, and pOOBAH-based probe filtering to generate high-quality methylation proportions ( values) [125]. The resulting values were then transformed using a logit transformation. Missing data in the transformed matrix were imputed using flashier, a factor analysis method that estimates latent factors from the observed methylation data and predicts missing entries. As with other molecular datasets, additional hidden confounders were estimated by applying PCA to the imputed methylation matrix, with the number of hidden confounders (38) determined by the Marchenko-Pastur limit. Technical covariates and biological covariates were included as covariates in the association and fine-mapping analyses. Association analysis (TensorQTL) and fine-mapping (SuSiE-topPC, fSuSiE) were performed as described for the H3K9ac ChIP-seq data.
An additional issue specific to the Illumina 450K array is that SNPs within CpG probe sequences can create false methylation signals [126, 127]. Such polymorphisms can interfere with probe hybridization or affect signal detection, particularly when located within the CpG dinucleotide or the adjacent single base extension site, producing false intermediate methylation values often misinterpreted as epigenetic differences. To address this issue, we removed all all SuSiE-topPC and fSuSiE CSs in which one or more of the SNPs overlapped with a CpG probe in the Illumina 450K array (source: IlluminaHumanMethylation450K.rda from https://github.com/Yang9704/MethylCallR/ [128]). Reassuringly, most of the CpG probes overlapping the methylation CSs were also included in a list of “suggested probes for removal” from [127] (specifically, these are the CpG probes flagged as “discard” in “Additional file 2”). Most of the mSNPs overlapping with CpG probes had low MAFs so were already removed from the previous MAF-filtering step. We also did the same for the SNP-CpG associations, although this removed only a very small number of SNP-CpG associations meeting the -value threshold.
AD GWAS fine-mapping and colocalization.
To finemap AD risk loci, we constructed an LD reference panel using genotype data from the Alzheimer's Disease Sequencing Project (ADSP), which consisted of approximately 17,000 WGS individuals of European ancestry. QC on two recent AD GWAS meta-analysis summary statistics [66, 67] was then performed against this reference panel, including allele harmonization and LD mismatch detection using SLALOM [129], which flagged suspicious variants and then removed them. We then ran SuSiE-RSS [23, 24] using the -scores from each of the AD GWAS and the LD from the LD reference panel as input data. To prioritize AD risk loci for further investigation, we ran COLOC (version 5) [68] to assess colocalization of the putative causal AD SNPs with RNA expression, protein expression, DNA methylation and H3K9ac SNPs.
Excess-of-overlap enrichment analysis.
We performed an excess-of-overlap (EOO) enrichment analysis of the haSNPs and mSNPs using a collection of predefined functional annotations from the Baseline-LD v2.2 model among common SNPs (MAF > 0.05) in the 1000 Genomes Project [64, 65]. For each molecular trait (H3K9ac, DNA methylation) and method (fSuSiE, SuSiE-topPC), we defined the “positive set” for enrichment analysis using variants within 95% credible sets (CSs) that had PIP > 0.5 and complete functional annotations available in Baseline-LD v2.2 common SNPs. This filtering process yielded 8,264 CS variants from a total of 74,065 variants across DNA methylation and H3K9ac credible sets for fSuSiE, and 749 out of 97,595 for SuSiE-topPC. Following [79], we tested whether selected SNPs showed greater overlap with functional annotations than expected by chance, with control set defined as follows: for each selected SNP, 5 SNPs outside the selected set were chosen to match based on LD score and minor allele frequency. For each annotation, enrichment was calculated as the ratio of the proportion of selected SNPs overlapping the annotation to the proportion of control SNPs overlapping the annotation. Statistical significance and confidence intervals were estimated using a jackknife resampling procedure, where the enrichment ratio was recalculated after iteratively removing each chromosome. Mean and standard error of enrichment estimates were computed from jackknife samples; annotations were considered significantly enriched if the 95% jackknife confidence interval for the enrichment ratio did not include 1.
Supplementary Material
Acknowledgments
We thank Kevin Luo, Xin He, and members of the Wang and Stephens labs for discussions and support, and Yuqi Miao for initial input on simulation study design. We thank the staff at the Research Computing Center at the University of Chicago for providing the high-performance computing resources used to implement the numerical experiments. We thank Angela Helfrich and Mark Bronnimann from Amazon Web Services for providing cloud computing support. We also thank the members of the Alzheimer's Disease Sequencing Project Functional Genomics Consortium (FunGen-AD) for providing the FunGen-xQTL resource. This work was supported in part by NIH grants R01HG002585 and R35GM153249 (to M.S.), NIH grants R01AG076901 and R01AG086467 (to G.W., H.S., A.L.), U01AF072572 (to P.L.D.) and a grant from the Urbut Family Foundation (to G.W.). This project is supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Sciences, LLC program. Additional support came from the University of Chicago Data Science Institute through the 2024 AI+Science Research Initiative. This research was conducted using data from the Religious Orders Study and the Rush Memory and Aging Project (ROSMAP). We thank the participants and investigators of these studies.
Footnotes
Competing interests
The authors declare no competing interests.
Data availability
The datasets analyzed, including the ADSP R4 WGS genotype data, are available for application and download from the NIAGADS Data Sharing Service, https://dss.niagads.org. The code implementing the data processing and analysis pipelines for is available at https://statfungen.github.io/xqtl-protocol. Data generated from numerical studies and analysis to prepare figures for the paper is available at https://github.com/stephenslab/fsusie-experiments/. A subset of fine-mapped QTL obtained from our analyses are available for peer review purposes, at https://github.com/stephenslab/fsusie-experiments/blob/main/data/README.md. The complete set of QTL data and QTL-GWAS integration models will be made publicly available at https://synapse.org prior to publication for registered Synapse users, as per the Data Management and Sharing policy of the FunGen-xQTL project. Other data sets used include: 1000 Genomes Phase 3 whole-genome sequencing data, https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/; GENCODE Ensembl/Havana database, https://useast.ensembl.org/info/genome/genebuild/annotation_merge.html; MethylCallR R package, https://github.com/Yang9704/MethylCallR.
Code availability
The fsusieR R package is available on GitHub at https://github.com/stephenslab/fsusieR (3-clause BSD license). The code used to perform the simulations and generate the manuscript figures is available at https://github.com/stephenslab/fsusie-experiments/. Other software and R packages used in this work include: VCPA 1.1 (http://www.niagads.org/VCPA/); BWA-MEM 0.7.15 (https://github.com/lh3/bwa/); GATK 4.1.1 (https://gatk.broadinstitute.org); FastQC 0.12.1 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/); fastp 0.23.4 (https://github.com/OpenGene/fastp/); STAR 2.7.11b (https://github.com/alexdobin/STAR/); WASP (https://github.com/bmvdgeijn/WASP/); Picard 3.1.0 (https://broadinstitute.github.io/picard/); RNA-SeQC 2.4.2 (https://github.com/getzlab/rnaseqc/); MACS2 (https://github.com/macs3-project/MACS/); ComBat (https://github.com/epigenelabs/pyComBat/); PLINK 1.9 (https://www.cog-genomics.org/plink/); TensorQTL 1.0.8 (https://github.com/broadinstitute/tensorqtl/); sim1000G 1.40 (https://github.com/adimitromanolakis/sim1000G); WGBSSuite 0.4 (https://github.com/SystemsGeneticsSG/WGBSSuite/); SeqSIMLA (https://seqsimla.sourceforge.net/); susieR 0.14.7 (https://github.com/stephenslab/susieR/); coloc 5.2.3 (https://github.com/chr1swallace/coloc/); ashr 2.2–63 (https://github.com/stephens999/ashr/); mixsqp 0.3.18 (https://github.com/stephenslab/mixsqp/); wavethresh 4.7.2 (https://cran.r-project.org/package=wavethresh); limma 3.56.2 (https://bioconductor.org/packages/limma/); qvalue 2.32.0 (https://github.com/StoreyLab/qvalue/); flashier 1.0.21 (https://github.com/willwerscheid/flashier/); R 4.3.3 (https://www.r-project.org).
References
- [1].Aguet F., Alasoo K., Li Y. I., Battle A., Im H. K., Montgomery S. B., and Lappalainen T. (2023). Molecular quantitative trait loci. Nature Reviews Methods Primers 3, 4. [Google Scholar]
- [2].Mortazavi A., Williams B. A., McCue K., Schaeffer L., and Wold B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 (7), 621–628. [DOI] [PubMed] [Google Scholar]
- [3].Pickrell J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. American Journal of Human Genetics 94(4), 559–573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Ferkingstad E., Sulem P., Atlason B. A., Sveinbjornsson G., Magnusson M. I., Styrmisdottir E. L., et al. (2021). Large-scale integration of the plasma proteome with genetics and disease. Nature Genetics 53 (12), 1712–1721. [DOI] [PubMed] [Google Scholar]
- [5].Sun B. B., Chiou J., Traylor M., Benner C., Hsu Y.-H., Richardson T. G., et al. (2023). Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622 (7982), 329–338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Lister R., Pelizzola M., Dowen R. H., Hawkins R. D., Hon G., Tonti-Filippini J., et al. (2009). Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462 (7271), 315–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Bock C. (2012). Analysing and interpreting DNA methylation data. Nature Reviews Genetics 13 (10), 705–719. [DOI] [PubMed] [Google Scholar]
- [8].Banovich N. E., Lan X., McVicker G., van de Geijn B., Degner J. F., Blischak J. D., Roux J., Pritchard J. K., and Gilad Y. (2014). Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. PLoS Genetics 10 (9), e1004663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Teschendorff A. E. and Relton C. L. (2018). Statistical and integrative system-level analysis of DNA methylation data. Nature Reviews Genetics 19 (3), 129–147. [DOI] [PubMed] [Google Scholar]
- [10].Oliva M., Demanelis K., Lu Y., Chernoff M., Jasmine F., Ahsan H., Kibriya M. G., Chen L. S., and Pierce B. L. (2023). DNA methylation QTL mapping across diverse human tissues provides molecular links between genetic variation and complex traits. Nature Genetics 55, 112–122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Perzel Mandell K. A., Eagles N. J., Wilton R., Price A. J., Semick S. A., Collado-Torres L., Ulrich W. S., Tao R., Han S., Szalay A. S., Hyde T. M., Kleinman J. E., Weinberger D. R., and Jaffe A. E. (2021). Genome-wide sequencing-based identification of methylation quantitative trait loci and their role in schizophrenia risk. Nature Communications 12, 5251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Buenrostro J. D., Wu B., Chang H. Y., and Greenleaf W. J. (2015). ATAC-seq: a method for assaying chromatin accessibility genome-wide. Current Protocols in Molecular Biology 109, 21.29.1–21.29.9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Degner J. F., Pai A. A., Pique-Regi R., Veyrieras J.-B., Gaffney D. J., Pickrell J. K., De Leon S., Michelini K., Lewellen N., Crawford G. E., Stephens M., Gilad Y., and Pritchard J. K. (2012). DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482 (7385), 390–394. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Spitz F. and Furlong E. E. M. (2012). Transcription factors: from enhancer binding to developmental control. Nature Reviews Genetics 13 (9), 613–626. [DOI] [PubMed] [Google Scholar]
- [15].Yan J., Qiu Y., Ribeiro dos Santos A. M., Yin Y., Li Y. E., Vinckier N., et al. (2021). Systematic analysis of binding of transcription factors to noncoding variants. Nature 591 (7848), 147–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Zhou V. W., Goren A., and Bernstein B. E. (2011). Charting histone modifications and the functional organization of mammalian genomes. Nature Reviews Genetics 12, 7–18. [DOI] [PubMed] [Google Scholar]
- [17].Morgan M. A. J. and Shilatifard A. (2020). Reevaluating the roles of histone-modifying enzymes and their associated chromatin modifications in transcriptional regulation. Nature Genetics 52 (12), 1271–1281. [DOI] [PubMed] [Google Scholar]
- [18].Hutchinson A., Asimit J., and Wallace C. (2020). Fine-mapping genetic associations. Human Molecular Genetics 29 (R1), R81–R88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Schaid D. J., Chen W., and Larson N. B. (2018). From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics 19 (8), 491–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Maller J. B., McVean G., Byrnes J., Vukcevic D., Palin K., Su Z., et al. (2012). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature Genetics 44 (12), 1294–1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Kote-Jarai Z., Saunders E. J., Leongamornlert D. A., Tymrakiewicz M., Dadaev T., Jugurnauth-Little S., et al. (2013). Fine-mapping identifies multiple prostate cancer risk loci at 5p15, one of which associates with TERT expression. Human Molecular Genetics 22 (12), 2520–2528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Spain S. L. and Barrett J. C. (2015). Strategies for fine-mapping complex traits. Human Molecular Genetics 24 (R1), R111–R119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Wang G., Sarkar A., Carbonetto P., and Stephens M. (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society, Series B 82 (5), 1273–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Zou Y., Carbonetto P., Wang G., and Stephens M. (2022). Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics 18 (7), e1010299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Servin B. and Stephens M. (2007). Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3 (7), e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Wallace C., Cutler A. J., Pontikos N., Pekalski M. L., Burren O. S., Cooper J. D., et al. (2015). Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping. PLoS Genetics 11 (6), e1005272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Benner C., Havulinna A. S., Järvelin M.-R., Salomaa V., Ripatti S., and Pirinen M. (2017). Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. American Journal of Human Genetics 101(4), 539–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Kichaev G., Roytman M., Johnson R., Eskin E., Lindström S., Kraft P., and Pasaniuc B. (2017). Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 33 (2), 248–255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Lewin A., Saadi H., Peters J. E., Moreno-Moral A., Lee J. C., Smith K. G. C., Petretto E., Bottolo L., and Richardson S. (2016). MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICS datasets, with application to eQTL mapping in multiple tissues. Bioinformatics 32 (4), 523–532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Arvanitis M., Tayeb K., Strober B. J., and Battle A. (2022). Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity. American Journal of Human Genetics 109 (2), 223–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Liu Y., Li X., Aryee M. J., Ekström T. J., Padyukov L., Klareskog L., Vandiver A., Moore A. Z., Tanaka T., Ferrucci L., Fallin M. D., and Feinberg A. P. (2014). GeMes, clusters of DNA methylation under genetic control, can inform genetic and epigenetic analysis of disease. American Journal of Human Genetics 94 (4), 485–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Jaffe A. E., Murakami P., Lee H., Leek J. T., Fallin M. D., Feinberg A. P., and Irizarry R. A. (2012). Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. International Journal of Epidemiology 41 (1), 200–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Pedersen B. S., Schwartz D. A., Yang I. V., and Kechris K. J. (2012). Comb-p: software for combining, analyzing, grouping and correcting spatially correlated P-values. Bioinformatics 28 (22), 2986–2988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Lent S., Xu H., Wang L., Wang Z., Sarnowski C., Hivert M.-F., and Dupuis J. (2018). Comparison of novel and existing methods for detecting differentially methylated regions. BMC Genetics 19 (Suppl 1), 84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Peters T. J., Buckley M. J., Statham A. L., Pidsley R., Samaras K., V Lord R., Clark S. J., and Molloy P. L. (2015). De novo identification of differentially methylated regions in the human genome. Epigenetics & Chromatin 8, 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Shim H. and Stephens M. (2015). Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. Annals of Applied Statistics 9 (2), 665–686. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Yu X. and Sun S. (2016). HMM-DM: identifying differentially methylated regions using a hidden Markov model. Statistical Applications in Genetics and Molecular Biology 15 (1), 69–81. [DOI] [PubMed] [Google Scholar]
- [38].Shen L., Zhu J., Robert Li S.-Y., and Fan X. (2017). Detect differentially methylated regions using non-homogeneous hidden Markov model for methylation array data. Bioinformatics 33 (23), 3701–3708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Shim H., Xing Z., Pantaleo E., Luca F., Pique-Regi R., and Stephens M. (2024). Multiscale Poisson process approaches for detecting and estimating differences from high-throughput sequencing assays. Annals of Applied Statistics 18 (3), 1773–1788. [Google Scholar]
- [40].Fernández L., Pérez M., Olanda R., Orduña J. M., and Marquez-Molins J. (2020). HPG-DHunter: an ultrafast, friendly tool for DMR detection and visualization. BMC Bioinformatics 21, 287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Denault W. R. P. and Jugessur A. (2021). Detecting differentially methylated regions using a fast wavelet-based approach to functional association analysis. BMC Bioinformatics 22, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Lee W. and Morris J. S. (2016). Identification of differentially methylated loci using wavelet-based functional mixed models. Bioinformatics 32 (5), 664–672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Schor I. E., Degner J. F., Harnett D., Cannavò E., Casale F. P., Shim H., Garfield D. A., Birney E., Stephens M., Stegle O., and Furlong E. E. M. (2017). Promoter shape varies across populations and affects promoter evolution and expression noise. Nature Genetics 49 (4), 550–558. [DOI] [PubMed] [Google Scholar]
- [44].Fusi N. and Listgarten J. (2017). Flexible modeling of genetic effects on function-valued traits. Journal of Computational Biology 24 (6), 524–535. [DOI] [PubMed] [Google Scholar]
- [45].Collado-Torres L., Nellore A., Frazee A. C., Wilks C., Love M. I., Langmead B., Irizarry R. A., Leek J. T., and Jaffe A. E. (2016). Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Research 45 (2), e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Collado-Torres L., Nellore A., Kammers K., Ellis S. E., Taub M. A., Hansen K. D., Jaffe A. E., Langmead B., and Leek J. T. (2017). Reproducible RNA-seq analysis using recount2. Nature Biotechnology 35 (4), 319–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Ng B., White C. C., Klein H.-U., Sieberts S. K., McCabe C., Patrick E., Xu J., Yu L., Gaiteri C., Bennett D. A., Mostafavi S., and De Jager P. L. (2017). An xQTL map integrates the genetic architecture of the human brain's transcriptome and epigenome. Nature Neuroscience 20 (10), 1418–1426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Bennett D. A., Buchman A. S., Boyle P. A., Barnes L. L., Wilson R. S., and Schneider J. A. (2018). Religious Orders Study and Rush Memory and Aging Project. Journal of Alzheimer's disease 64 (Suppl 1), S161–S189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Mallat S. (1989). A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11 (7), 674–693. [Google Scholar]
- [50].Mallat S. G. (2009). A wavelet tour of signal processing: the sparse way (3rd ed.). Boston, MA: Elsevier/Academic Press. [Google Scholar]
- [51].Bell J. T., Pai A. A., Pickrell J. K., Gaffney D. J., Pique-Regi R., Degner J. F., Gilad Y., and Pritchard J. K. (2011). DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology 12, R10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Dimitromanolakis A., Xu J., Krol A., and Briollais L. (2019). sim1000G: a user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinformatics 20, 26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].The 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526 (7571), 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Rackham O. J. L., Dellaportas P., Petretto E., and Bottolo L. (2015). WGBSSuite: simulating whole-genome bisulphite sequencing data and benchmarking differential DNA methylation analysis tools. Bioinformatics 31 (14), 2371–2373. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Eulalio T., Sun M. W., Gevaert O., Greicius M. D., Montine T. J., Nachun D., and Montgomery S. B. (2025). regionalpcs improve discovery of DNA methylation associations with complex traits. Nature Communications 16, 368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [56].Yuan D. and Mancuso N. (2023). SuSiE PCA: a scalable Bayesian variable selection technique for principal component analysis. iScience 26 (11), 108181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [57].Dahl A., Guillemot V., Mefford J., Aschard H., and Zaitlen N. (2019). Adjusting for principal components of molecular phenotypes induces replicating false positives. Genetics 211 (4), 1179–1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Zhou H. J., Li L., Li Y., Li W., and Li J. J. (2022). PCA outperforms popular hidden variable inference methods for molecular QTL mapping. Genome Biology 23 (1), 210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Zou Y., Carbonetto P., Xie D., Wang G., and Stephens M. (2023). Fast and flexible joint fine-mapping of multiple traits via the Sum of Single Effects model. bioRxiv, doi: 10.1101/2023.04.14.536893. [DOI] [Google Scholar]
- [60].De Jager P. L., Ma Y., McCabe C., Xu J., Vardarajan B. N., Felsky D., Klein H.-U., White C. C., Peters M. A., Lodgson B., Nejad P., Tang A., Mangravite L. M., Yu L., Gaiteri C., Mostafavi S., Schneider J. A., and Bennett D. A. (2018). A multi-omic atlas of the human frontal cortex for aging and Alzheimer's disease research. Scientific Data 5, 180142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Ng B., White C. C., Klein H. U., et al. (2017). An xqtl map integrates the genetic architecture of the human brain's transcriptome and epigenome. Nature Neuroscience 20 (10), 1418–1426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].McArthur E. and Capra J. A. (2021). Topologically associating domain boundaries that are stable across diverse cell types are evolutionarily constrained and enriched for heritability. American Journal of Human Genetics 108 (2), 269–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [63].Frankish A., Diekhans M., Ferreira A.-M., Johnson R., Jungreis I., Loveland J., et al. (2019). GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research 47 (D1), D766–D773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [64].Finucane H. K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.-R., et al. (2015). Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics 47 (11), 1228–1235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [65].Gazal S., Finucane H. K., Furlotte N. A., Loh P.-R., Palamara P. F., Liu X., Schoech A., Bulik-Sullivan B., Neale B. M., Gusev A., and Price A. L. (2017). Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics 49 (10), 1421–1427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Bellenguez C., Küçükali F., Jansen I. E., Kleineidam L., Moreno-Grau S., Amin N., et al. (2022). New insights into the genetic etiology of Alzheimer's disease and related dementias. Nature Genetics 54 (4), 412–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Jansen I. E., Savage J. E., Watanabe K., Bryois J., Williams D. M., Steinberg S., et al. (2019). Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nature Genetics 51 (3), 404–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Wallace C. (2021). A more accurate method for colocalisation analysis allowing for multiple causal variants. PLoS Genetics 17 (9), e1009440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [69].Lambert J.-C., Ibrahim-Verbaas C. A., Harold D., Naj A. C., Sims R., Bellenguez C., et al. (2013). Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for alzheimer's disease. Nature Genetics 45 (12), 1452–1458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Wightman D. P., Jansen I. E., Savage J. E., Shadrin A. A., Bahrami S., Holland D., et al. (2021). A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer's disease. Nature Genetics 53 (9), 1276–1282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Schwartzentruber J., Cooper S., Liu J. Z., Barrio-Hernandez I., Bello E., Kumasaka N., Young A. M. H., Franklin R. J. M., Johnson T., Estrada K., Gaffney D. J., Beltrao P., and Bassett A. (2021). Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer's disease risk genes. Nature Genetics 53 (3), 392–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Lambert J.-C., Heath S., Even G., Campion D., Sleegers K., Hiltunen M., et al. (2009). Genome-wide association study identifies variants at CLU and CR1 associated with Alzheimer's disease. Nature Genetics 41 (10), 1094–1099. [DOI] [PubMed] [Google Scholar]
- [73].Jun G., Naj A. C., Beecham G. W., Wang L.-S., Buros J., Gallins P. J., et al. (2010). Meta-analysis confirms CR1, CLU, and PICALM as Alzheimer Disease risk loci and reveals interactions with APOE genotypes. Archives of Neurology 67 (12), 1473–1484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [74].Belbasis L., Morris S., van Duijn C., Bennett D., and Walters R. (2025). Mendelian randomization identifies proteins involved in neurodegenerative diseases. Brain 148 (7), 2412–2428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [75].Fonseca M. I., Chu S., Pierce A. L., Brubaker W. D., Hauhart R. E., Mastroeni D., Clarke E. V., Rogers J., Atkinson J. P., and Tenner A. J. (2016). Analysis of the putative role of CR1 in Alzheimer's disease: genetic association, expression and function. PLoS ONE 11 (2), e0149792. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [76].Kucukkilic E., Brookes K., Barber I., Guetta-Baranes T., Morgan K., Hollox E. J., and ARUK Consortium (2018). Complement receptor 1 gene (CR1) intragenic duplication and risk of Alzheimer's disease. Human Genetics 137 (4), 305–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [77].Brouwers N., Van Cauwenberghe C., Engelborghs S., Lambert J.-C., Bettens K., Le Bastard N., Pasquier F., Montoya A. G., Peeters K., Mattheijssens M., Vandenberghe R., De Deyn P. P., Cruts M., Amouyel P., Sleegers K., and Van Broeckhoven C. (2012). Alzheimer risk associated with a copy number variation in the complement receptor 1 increasing C3b/C4b binding sites. Molecular Psychiatry 17 (2), 223–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [78].Western D., Timsina J., Wang L., Wang C., Yang C., Phillips B., et al. (2024). Proteogenomic analysis of human cerebrospinal fluid identifies neurologically relevant regulation and implicates causal proteins for Alzheimer's disease. Nature Genetics 56 (12), 2672–2684. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [79].Cao X., Sun H., Feng R., Mazumder R., Buen Abad Najar C. F., Li Y. I., de Jager P. L., Bennett D., Dey K. K., and Wang G. (2025). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. medRxiv, 10.1101/2025.04.17.25326042. [DOI] [Google Scholar]
- [80].Xiong X., James B. T., Boix C. A., Park Y. P., Galani K., Victor M. B., Sun N., Hou L., Ho L.-L., Mantero J., Scannail A. N., Dileep V., Dong W., Mathys H., Bennett D. A., Tsai L.-H., and Kellis M. (2023). Epigenomic dissection of Alzheimer's disease pinpoints causal variants and reveals epigenome erosion. Cell 186 (20), 4422–4437.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [81].Stephens M. (2017). False discovery rates: a new deal. Biostatistics 18 (2), 275–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [82].Donoho D. L. and Johnstone I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3), 425–455. [Google Scholar]
- [83].Sniekers S. and van der Vaart A. (2020). Adaptive Bayesian credible bands in regression with a Gaussian process prior. Sankhya A 82 (2), 386–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [84].Barber S., Nason G. P., and Silverman B. W. (2002). Posterior probability intervals for wavelet thresholding. Journal of the Royal Statistical Society, Series B 64 (2), 189–205. [Google Scholar]
- [85].Cruchaga C. (2022). ADSP functional genomics: from gene, to function to mechanisms and targets. Alzheimer's & Dementia 18 (S4), e066436. [Google Scholar]
- [86].Leung Y. Y., Lee W.-P., Kuzma A. B., Nicaretta H., Valladares O., Gangadharan P., et al. (2025). Alzheimer's Disease Sequencing Project release 4 whole genome sequencing dataset. Alzheimer's & Dementia 21 (5), e70237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [87].Leung Y. Y., Valladares O., Chou Y.-F., Lin H.-J., Kuzma A. B., Cantwell L., Qu L., Gangadharan P., Salerno W. J., Schellenberg G. D., and Wang L.-S. (2019). VCPA: genomic variant calling pipeline and data management tool for Alzheimer's Disease Sequencing Project. Bioinformatics 35 (10), 1768–1770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [88].Taylor-Weiner A., Aguet F., Haradhvala N. J., Gosai S., Anand S., Kim J., Ardlie K., Van Allen E. M., and Getz G. (2019). Scaling computational genomics to millions of individuals with GPUs. Genome Biology 20, 228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [89].Heintzman N. D., Stuart R. K., Hon G., Fu Y., Ching C. W., Hawkins R. D., Barrera L. O., Van Calcar S., Qu C., Ching K. A., Wang W., Weng Z., Green R. D., Crawford G. E., and Ren B. (2007). Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics 39 (3), 311–318. [DOI] [PubMed] [Google Scholar]
- [90].Denault W. R., Carbonetto P., Li R., The Alzheimer's Disease Functional Genomics Consortium, Wang G., and Stephens M. (2025). Accounting for uncertainty in residual variances improves calibration of the Sum of Single Effects model for small sample sizes. bioRxiv, doi: 10.1101/2025.05.16.654543. [DOI] [Google Scholar]
- [91].Bibikova M., Barnes B., Tsan C., Ho V., Klotzle B., Le J. M., Delano D., Zhang L., Schroth G. P., Gunderson K. L., Fan J.-B., and Shen R. (2011). High density DNA methylation array with single CpG site resolution. Genomics 98 (4), 288–295. [DOI] [PubMed] [Google Scholar]
- [92].Morris J. S. (2015). Functional regression. Annual Review of Statistics and its Application 2, 321–359. [Google Scholar]
- [93].George E. I. and McCulloch R. E. (1997). Approaches to Bayesian variable selection. Statistica Sinica 7, 339–373. [Google Scholar]
- [94].Stephens M. and Balding D. J. (2009). Bayesian statistical methods for genetic association studies. Nature Reviews Genetics 10, 681–690. [DOI] [PubMed] [Google Scholar]
- [95].Crouse M., Nowak R., and Baraniuk R. (1998). Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing 46 (4), 886–902. [Google Scholar]
- [96].Ma L. and Soriano J. (2018). Efficient functional ANOVA through wavelet-domain Markov Groves. Journal of the American Statistical Association 113 (522), 802–818. [Google Scholar]
- [97].Xing Z., Carbonetto P., and Stephens M. (2021). Flexible signal denoising via flexible empirical Bayes shrinkage. Journal of Machine Learning Research 22 (93), 1–28. [PMC free article] [PubMed] [Google Scholar]
- [98].Wang W. and Stephens M. (2021). Empirical Bayes matrix factorization. Journal of Machine Learning Research 22 (120), 1–40. [PMC free article] [PubMed] [Google Scholar]
- [99].Nason G. (2008). Wavelet Methods in Statistics with R. New York, NY: Springer. [Google Scholar]
- [100].Fachal L., Aschard H., Beesley J., Barnes D. R., Allen J., Kar S., et al. (2020). Fine-mapping of 150 breast cancer risk regions identifies 191 likely target genes. Nature Genetics 52, 56–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [101].Yang Z., Wang C., Liu L., Khan A., Lee A., Vardarajan B., Mayeux R., Kiryluk K., and Ionita-Laza I. (2023). CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses. Nature Genetics 55 (6), 1057–1065. [DOI] [PubMed] [Google Scholar]
- [102].Du P., Zhang X., Huang C.-C., Jafari N., Kibbe W. A., Hou L., and Lin S. M. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [103].Chung R.-H. and Kang C.-Y. (2019, April). A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification. GigaScience 8 (5). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [104].Li H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, doi: 10.48550/arXiv.1303.3997. [DOI] [Google Scholar]
- [105].DePristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., Hartl C., Philippakis A. A., del Angel G., Rivas M. A., Hanna M., McKenna A., Fennell T. J., Kernytsky A. M., Sivachenko A. Y., Cibulskis K., Gabriel S. B., Altshuler D., and Daly M. J. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43 (5), 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [106].Chen S., Zhou Y., Chen Y., and Gu J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34 (17), i884–i890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [107].Dobin A., Davis C. A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., and Gingeras T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 (1), 15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [108].Van de Geijn B., McVicker G., Gilad Y., and Pritchard J. K. (2015). WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nature Methods 12 (11), 1061–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [109].(2019). Picard toolkit. https://broadinstitute.github.io/picard/.
- [110].Graubert A., Aguet F., Ravi A., Ardlie K. G., and Getz G. (2021). RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts. Bioinformatics 37 (18), 3048–3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [111].The GTEx Consortium (2020). The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369 (6509), 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [112].Blighe K. and Lun A. (2024). Pcatools: Pcatools: Everything principal components analysis. R package version 2.18.0. [Google Scholar]
- [113].Johnson E. C. B., Dammer E. B., Duong D. M., Yin L., Thambisetty M., Troncoso J. C., Lah J. J., Levey A. I., and Seyfried N. T. (2018). Deep proteomic network analysis of Alzheimer's disease brain reveals alterations in RNA binding proteins and RNA splicing associated with disease. Molecular Neurodegeneration 13, 52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [114].Johnson E. C. B., Dammer E. B., Duong D. M., Ping L., Zhou M., Yin L., et al. (2020). Large-scale proteomic analysis of Alzheimer's disease brain and cerebrospinal fluid reveals early changes in energy metabolism associated with microglia and astrocyte activation. Nature Medicine 26, 769–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [115].Ping L., Duong D. M., Yin L., Gearing M., Lah J. J., Levey A. I., and Seyfried N. T. (2018). Global quantitative analysis of the human brain proteome in Alzheimer's and Parkinson's disease. Scientific Data 5, 180036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [116].Ping L., Kundinger S. R., Duong D. M., Yin L., Gearing M., Lah J. J., Levey A. I., and Seyfried N. T. (2020). Global quantitative analysis of the human brain proteome and phosphoproteome in Alzheimer's disease. Scientific Data 7, 315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [117].Qi Z., Pelletier A., Willwerscheid J., Cao X., Wen X., Cruchaga C., Jager P. D., Julia T. C. W., and Wang G. (2023). Novel missing data imputation approaches enhance quantitative trait loci discovery in multi-omics analysis.
- [118].Zhang Y., Liu T., Meyer C. A., Eeckhoute J., Johnson D. S., Bernstein B. E., Nusbaum C., Myers R. M., Brown M., Li W., and Liu X. S. (2008). Model-based Analysis of ChIP-Seq (MACS). Genome Biology 9, R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [119].Feng J., Liu T., Qin B., Zhang Y., and Liu X. S. (2012). Identifying ChIP-seq enrichment using MACS. Nature Protocols 7, 1728–1740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [120].Law C. W., Chen Y., Shi W., and Smyth G. K. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology 15, R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [121].Law C. W., Alhamdoosh M., Su S., Dong X., Tian L., Smyth G. K., and Ritchie M. E. (2018). Rna-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved]. F1000Research 5 (1408). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [122].Johnson W. E., Li C., and Rabinovic A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 (1), 118–127. [DOI] [PubMed] [Google Scholar]
- [123].Leek J. T., Johnson W. E., Parker H. S., Jaffe A. E., and Storey J. D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28 (6), 882–883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [124].Storey J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics 31 (6), 2013–2035. [Google Scholar]
- [125].Zhou W., Triche J., Timothy J, Laird P. W., and Shen H. (2018). SeSAMe: reducing artifactual detection of DNA methylation by infinium beadchips in genomic deletions. Nucleic Acids Research 46 (20), e123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [126].an Chen Y., Lemire M., Choufani S., Butcher D. T., Grafodatskaya D., Zanke B. W., Gallinger S., Hudson T. J., and and R. W. (2013). Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics 8 (2), 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [127].Naeem H., Wong N. C., Chatterton Z., Hong M. K. H., Pedersen J. S., Corcoran N. M., Hovens C. M., and Macintyre G. (2014). Reducing the risk of false discovery enabling identification of biologically significant genome-wide methylation status using the HumanMethylation450 array. BMC Genomics 15, 51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [128].Yang H.-H. and Han M.-R. (2024). MethylCallR: a comprehensive analysis framework for Illumina Methylation Beadchip. Scientific Reports 14, 27026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [129].Kanai M., Elzur R., and Zhou W. (2022). Meta-analysis fine-mapping is often miscalibrated at single-variant resolution. Cell Genomics 2 (12), 100210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [130].Nason G. P., Sachs R. V., and Kroisandt G. (2000). Wavelet processes and adaptive estimation of the evolutionary wavelet spectrum. Journal of the Royal Statistical Society, Series B 62 (2), 271–292. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets analyzed, including the ADSP R4 WGS genotype data, are available for application and download from the NIAGADS Data Sharing Service, https://dss.niagads.org. The code implementing the data processing and analysis pipelines for is available at https://statfungen.github.io/xqtl-protocol. Data generated from numerical studies and analysis to prepare figures for the paper is available at https://github.com/stephenslab/fsusie-experiments/. A subset of fine-mapped QTL obtained from our analyses are available for peer review purposes, at https://github.com/stephenslab/fsusie-experiments/blob/main/data/README.md. The complete set of QTL data and QTL-GWAS integration models will be made publicly available at https://synapse.org prior to publication for registered Synapse users, as per the Data Management and Sharing policy of the FunGen-xQTL project. Other data sets used include: 1000 Genomes Phase 3 whole-genome sequencing data, https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/; GENCODE Ensembl/Havana database, https://useast.ensembl.org/info/genome/genebuild/annotation_merge.html; MethylCallR R package, https://github.com/Yang9704/MethylCallR.
The fsusieR R package is available on GitHub at https://github.com/stephenslab/fsusieR (3-clause BSD license). The code used to perform the simulations and generate the manuscript figures is available at https://github.com/stephenslab/fsusie-experiments/. Other software and R packages used in this work include: VCPA 1.1 (http://www.niagads.org/VCPA/); BWA-MEM 0.7.15 (https://github.com/lh3/bwa/); GATK 4.1.1 (https://gatk.broadinstitute.org); FastQC 0.12.1 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/); fastp 0.23.4 (https://github.com/OpenGene/fastp/); STAR 2.7.11b (https://github.com/alexdobin/STAR/); WASP (https://github.com/bmvdgeijn/WASP/); Picard 3.1.0 (https://broadinstitute.github.io/picard/); RNA-SeQC 2.4.2 (https://github.com/getzlab/rnaseqc/); MACS2 (https://github.com/macs3-project/MACS/); ComBat (https://github.com/epigenelabs/pyComBat/); PLINK 1.9 (https://www.cog-genomics.org/plink/); TensorQTL 1.0.8 (https://github.com/broadinstitute/tensorqtl/); sim1000G 1.40 (https://github.com/adimitromanolakis/sim1000G); WGBSSuite 0.4 (https://github.com/SystemsGeneticsSG/WGBSSuite/); SeqSIMLA (https://seqsimla.sourceforge.net/); susieR 0.14.7 (https://github.com/stephenslab/susieR/); coloc 5.2.3 (https://github.com/chr1swallace/coloc/); ashr 2.2–63 (https://github.com/stephens999/ashr/); mixsqp 0.3.18 (https://github.com/stephenslab/mixsqp/); wavethresh 4.7.2 (https://cran.r-project.org/package=wavethresh); limma 3.56.2 (https://bioconductor.org/packages/limma/); qvalue 2.32.0 (https://github.com/StoreyLab/qvalue/); flashier 1.0.21 (https://github.com/willwerscheid/flashier/); R 4.3.3 (https://www.r-project.org).






